Wikimedia Technical Conference/2018/Session notes/Identifying and extracting data trapped in our content

Facilitator Instructions: /Session_Guide#Session_Guidance_for_facilitators

Phab: https://phabricator.wikimedia.org/T206073

= Questions to answer during this session =

= Attendees list =


 * Daren, Cindy, Gergo, Josh Minor, Kate, Aaron, Magnus, Subbu, Santosh, Michael Holloway, Ramsey, Jon Katz, Danny Horn, Lydia, Cheol, Marko

= Structured notes = There are five sections to the notes:


 * 1) Questions and answers: Answers the questions of the session
 * 2) Features and goals: What we should do based on the answers to the questions of this session
 * 3) Important decisions to make: Decisions which block progress in this area
 * 4) Action items: Next actions to take from this session
 * 5) New questions: New questions revealed during this session

= Questions and answers = Please write in your original questions. If you came up with additional important questions that you answered, please also write them in. (Do not include “new” questions that you did not answer, instead add them to the new questions section)

= Features and goals =

= Important decisions to make =

= Action items =

= New Questions =

= Detailed notes = Place detailed ongoing notes here. The secondary note-taker should focus on filling any [?] gaps the primary scribe misses, and writing the highlights into the structured sections above. This allows the topic-leader/facilitator to check on missing items/answers, and thus steer the discussion.


 * E.g. Wiktionary wants to have every word in every language.  Let’s say we want to do something with that on Mobile. That’s very hard because there’s not really a consistent structure within a language wiktionary -- let alone between languages.  E.g. MCS only works with English Wiktionary.  [Note: This is for illustrative purposes only, because Wikidata is fixing this particular problem.]
 * Brainstorming examples of data to extract (from whiteboard)
 * Infoboxes
 * Categories -- subcategories
 * See also
 * ToC
 * Quality assessment /warnings / stubs
 * Navboxes (article relations)
 * Template data
 * Display title
 * Template styles
 * Workflow state (afd/afc/draft)
 * Double underscore switches __NOINDEX__ __HIDDENCAT__
 * Coordinates
 * Tables/Lists
 * Semantic statements as described in the text “portland has a population of”
 * Spoken article
 * Article series
 * Proofread progress/index
 * All of commons data
 * Semantic mediawiki data
 * I’ll visit the board and note this down.
 * Magnus: Ideally, these things should exist on Wikidata.
 * If they are copied, they will get out of sync.  So we need to display wikidata data on various Wikipedias.  But on big wikis, there is social resistance. So we need to be careful where we store [this data].
 * Cindy: what is the domain, do we include external wikis in this?
 * Michael: no, not really while disclosing, but we should recognize that semantic mediawiki etc exists.  [“our content” === the Wikimedia projects]
 * Magnus: Depends on the wiki owner
 * Aaron: is behavioral data (e.g. editor reputation based on content persistence measures) out of scope (yes)
 * Magnus: Also lists.  Lots of manually curated information.  Lists should be re-creatable from Wikidata by a query.
 * Josh M: Disambiguation pages!
 * Magnus: Yes.  Could also go the opposite route.
 * Josh M: New portal has a bunch of semantics buried.  Timelines. Did you know facts.
 * Michael: Summarized as “main page content”
 * Gergo: Lists of frequently asked questions
 * Darren: Relationships between pages in the form of links.
 * Aaron: Cleanup templates!  (In scope!)
 * SSmith: We have been working on a taxonomy for that.
 * Darren: See relationships between content based on who contributes.
 * Magnus: Could cluster interests based on who works on what with a link graph
 * Josh M: ???
 * Gergo: User boxes and other content on user pages.

Cutting off brainstorming. Break into groups.


 * Should it be stored in Wikidata, the host wiki, or somewhere else.

DJ: Many of the problems can be solved by data, logic, presentation. Keep presentation local. Keep data and logic centralized.

Important points

 * Extract can mean many different things (extract at source or delivery)
 * Storing data in structured way vs provide a structured view of data
 * Do you need to record the revision of the data from wikidata when put in a page
 * What does “versioning” mean?
 * Differences in curation models have different social blockers
 * Many issue solvable by splitting data/logic/selection/presentation.

Group 1: Daren, Gergo, Lydia, Cindy, DJ

 * Q1:
 * Lydia: wiktionary. We started storing that in wikidata. So basically we already changed the location where we want to store it.
 * What is the problem: structural,
 * Some stuff doesn’t belong in wikidata right.
 * So decide on wghat need to be decentralized vs what needs to be extracted from wikitext. Two different things for different data.
 * Catergories: MCR slot, structured done.
 * Basically everything you put at the top or the bottom of an article is likely and MCR slot ?
 * Wikidata: infobox partially, navboxes can be generated from, coordinates, tables (some)
 * Centralized but not wikidata: user info
 * MCR slot: categories,   page issues, proof read
 * Article series
 * Strategy: Store as wikitext in a MCR slot and stop bothering with it. (either hide there is an MCR slot or not
 * Q2: What difficulties do you expect.
 * Can we have seperate storage strategies for different things ?
 * Give them a parserfunction for taking a piece of string and a keyname and extract that data towards an MCR slot.
 * PPl try to store data that isn’t actually structured or formatted… what to do with that.
 * What we need is,
 * So storage and curation of data in MCR and wikidata, but that doesn’t necesarrily mean that we don’t still have shadow indexed tables and apis to then reuse that information again.
 * We need to recognize that not all encoded data needs the same approach towards extracting this information from the wikitext.
 * Where does it belong (de)centralized (and does it need an override)
 * Where is the canocnial information stored
 * How is the information exposed.
 * Which of these types of data need version
 * Q3: Versioning:
 * Non versioning: userpage info (maybe? If it’s only private ?)
 * Do you locally snapshot revision information if it is coming from an external or centralized location.

Group: Josh, Greg, Marco, Danny, Cheon
Where do you store it?

Picking a specific -- Categories? But that already is a property of MediaWiki, compared to a template. The concept of category exists in the core data structures, but infobox doesn't. Categories are a semantic network. They're more like tags, not ontological.

See also is all wikitext. It's stored as raw strings. If you wanted a voice agent to read them to you one after another, that would be interesting.

Model can be a css model, series of overrides. As a local community, you can override, the way that infoboxes work for a lot of things. There's a general person infobox, then other people build on that. This is what happened when wikidata descriptions, English had a problem, we said you can override it, now it's kept locally as an unstructured magic word. I think that's a general principle that can solve this -- if there's a global default, you can do local overrides when you want it.

If you can override the semantics, it should go somewhere else, or to Wikidata. Which one is better? Don't think you can answer that in general, def not for the whole community. If we do come up with a nice way to do descriptions with Wikidata, and people say no, then we have to redo things. It's hard to say this is the place.

WPs have specific concerns about WD, and if WD wants to be the global database of all truth, then they need to address those concerns. One concern is citations, WD uses a different model. At some point, there's going to be some kind of compromise on that. If they want to store everything, they need to prioritize those issues.

How do you get stuff out of these places?

There's getting it out, and there's also getting it one time rather than getting it every time. You get the see alsos, but then that's going to change over time. Do you do it as a template?

Say you have the content translation tool -- is that a forking tool or a synching tool? Right now, we fell into a forking model, but that wasn't a deliberate decision.

It's much harder than translate, because you have to take into account the wiki, each has their own rules & way they do things. That usually means somebody started something, and everyone copied. 800 projects, you have to check at least 800 cases, assuming there's only one template per project.

The question should be how do we talk to all of these communities, to get a single way of doing things that are so common? There's a simple technical way, but community is hard. We struggle with -- we want these things for us for our own technical purposes, but we need to make it work for them. It needs to help them in some way. People are happy with the way they've been doing work for years.

Maybe get them to allow us to edit their edits? if you're happy with doing the template this way, then just allow me to reformat it on save, the way it's better. This is how we ended up with Michael's team, we do this structuring ourselves. We take the main pages, and his team extracts the data we need. We had to talk to individual template owners on some languages where things were especially weird. But we have to keep up with changes, it drifts over time.

Another thing that comes up, often the templates are copied. Basically, most templates are copied from english or russian, because they've gone around advocating for their templates and offering help. It's almost like a genetic process, there are two families -- the English-derived templates and Russian-derived templates.

For extraction, there's also the question of do you enter wikitext, stored in wikitext, or do you just have the semantic layer, where you can enter see alsos into a structured list? It's a big ask.

When you submit the edit -- it looks like you're trying to add structured data, can I help? It's like Clippy, but maybe it's Jimmy. :)

Funny thing is: if we just make the output look the same and we do the transition work, people would probably be fine. It's only when it visually changes that people care. :)

What needs versioning?

When things move, or there are errors. Everything we're talking about can have vandalism and problems.

For debugging purposes, it would be good to look at the parsed-out wikitext, as the template has been expanded and turned into normal storage form. But it's not deterministic. Lack of stored revisions in the parser. The main problems are the templates themselves, you can't do determ parses, bc the templates are not self-enclosed. the render is screwed up completely. so you can't have structured parsing. some exceptions where people refuse to correct or just don't know it's a problem. structured templates would be much easier. wikitext 2.0!

maybe that's something for this whole discussion, rather than think about structuring individual things, make the template system first, and all you're doing is editing wikitext for the template. Most of these items are in some kind of template anyway. except for see also. portals are all templates.

do we have project group for category structure on WP? not really. they're so complex and used in so many different ways, we haven't tried to get into it. no group on english, a category project? There are category minders, who argue abou the structure -- but mostly in specific topic areas. categories are also used for things that aren't content, marking things for articles for deletion. it's so general purpose that we don't want to change it much, anything we do would break another use case.

The cool group: Star, Subbu, Halfak, pheudx, Ramsey
'''# Where to store? Wikidata, Local Wiki, Somewhere else?'''

Categories

Magnus: Depends on the type of data extracted. E.g. categories should be stored on the local wiki. Maybe some common mapping.

Aaron: So the categories would be stored in a local wiki, but a mapping across wikis would be stored in Wikidata.

Sam: Would it make sense to have both local and global wikibase?

Magnus: I hope this happens for structured commons.

Info boxes

Magnus: Wikidata

Table of contents

Not on wikidata. Maybe stored in the local wiki for querying.

Subbu: Derived data?

Nav Bars:

Magnus: Maybe on the local wiki, but not the content. Could use wikidata.

Subbu: The data could be in wikidata and derived from the information there.

Magnus: I think navboxes are way more subjective.

Aaron: Maybe the local wiki would store the query and Wikidata would store the data.

Coordinates:

Magnus: Monuments in a city can be generated from a list. Wikidata. Except for when there are multiple things listed on the page unless the list is generated from Wikidata.

Subbu: Function in Wikitext to pull things from Wikidata.

'''# What strategies can we use to extract this data? What difficulties do you anticipate.'''

Magnus: Pitchforks. Technically this is easy

Subbu: Curation. Why are the pitchforks coming out?

Magnus: If we allow people to edit locally, there will be some other reason. It's about giving up control. "Your project is not German Wikipedia.  Data is here.  Images are here.  Text is here."

Sam: Think about how much more time you would have governing on-wiki content if you don't have to govern the external data.

Magnus: We spent too much time curating our lists. We don't want to move them to Wikidata. "So you want to throw good money after bad money?"

Subbu: Is there a single point of failure a concern.

Ramsey: Pitchform from GLAM communities. OK to point to a wikidata item. But should the image entities point to a Q-id? Should wikidata information be derived from Wikidata? Or should the data be duplicated?

Magnus: The commons data structure is about the file. The file depicts "Starry Night". Commons is already including information from Wikidata  Could show the information on the commons page and that would remove quite a lot of the concerns.

Aaron: I would rather link things to Wikidata rather than filling it in -- as a lazy person.

Magnus: Can create a form with mandatory vs. optional information. Just to streamline the process.

Ramsey: Going back to governance. Wikidata community doesn't want commons creating new Qids that might not meet the notability requirements on Wikidata.

Magnus: Wikidata needs to ??? a bit. If I create an article on a Wikipedia, then it is by default notable in Wikidata. It's just not a text. It's a picture. It's about //something// and that should meet notability by default.

Subbu: Biggest concern is primarily a governance issue. Is Wikidata open to all of the other wikis coming in.

Ramsey: Properties too. They have to go to Wikidata to create properties. Commons identifies the properties that they need and want and they have to go to Wikidata to do that.

# Which of these need versioning and which do not?

Subbu: maybe what we want is derived data.

(We ask Michael)

Magnus: Is it enough to be versioned in the text.

Sam: What are clients getting out of having this data versioned?

Categories

Currently versioned in Wikitext.

Should be versioned in structured form.

InfoBox data

Versioned.

Table of contents

Not versioned

Nav Bar