Wikimedia Technical Conference/2018/Session notes/Architecting Core: concepts

From mediawiki.org

Theme: Architecting our code for change and sustainability

Type: Technical Challenges

Leader(s): Tim

Facilitator: Greg

Scribe: Irene

Description: In this session we will discuss what concepts and operations should be modeled in MediaWiki - what are the nouns and verbs that MediaWiki understands? This will help us better define components within the code base, interfaces for extensions, as well as APIs for interacting with clients.

Some concepts are obvious like “pages” and “edits”, but some concepts which are currently emulated by extensions may be helpful to introduce into MediaWiki proper, such as “workflows”, “drafts”, or “assessments”. Also, refining our definition and understanding of established concepts could be helpful, for instance to establish whether any curatable object on the wiki should be considered a “page”. -- phab:T206084

Attendees: Tim + Greg, Michael Holloway, Halfak, Raz, Fisch, Corey, Subbu, Zackariah, Jon Katz, Ramsay, Dmitry, Antoine, Cheol, daren, adam baso

Questions to answer during this session[edit]

Question Significance:

Why is this question important? What is blocked by it remaining unanswered?

Which concepts are essential to MediaWiki functionality? Establish a shared understanding what the “core” of MediaWiki is about, what functionality the “platform” should provide.
Which additional concepts are widely used (via gadgets, extensions, bots, etc) , but not explicitly modeled by MediaWiki? Which of these should be modeled in MediaWiki? This allows us to develop a plan to remove technical debt, inconsistencies and overhead caused by the need to somehow glue concepts into MediaWiki which it doesn’t support. Paving the cow paths reduces friction, replacing hacks with well defined concepts makes the code more maintainable.
Conversely, what functionality currently in core should be factored out into an extension? What extension points would be necessary to do this? This ties this session into the discussions about modularization and extension interfaces.

Moving code for optional functionality into extensions makes core more easy to maintain.

What are the criteria for deciding whether a concept should be modeled in core, and if it is supported, to what extent it is fleshed out, or left as a mere extension point. Having a decision matrix template for this question will allow us to make decisions about what concepts should be in core more quickly, and with more confidence.

Detailed notes[edit]

See also comment-annotations in the original gdoc.

  • Pondering what a concept is - what we mean is than plain english meaning. A term that is broader than extension interface or class, exists at higher level of abstraction. Have a concept repped by not a specific point in the code, instead conveyed by multiparts of it
  • Wants to focus on missing or incomplete concepts about the core specifically
  • A “page” is a concept - what exactly is a page?
    • A foto is a page?
    • Pages have at least revisions
    • Currently have titles, but should they? titles are a bit awk
    • In wikidata titles are the item IDs
  • Q from corey - basic reason why a title exists is as an id? Yes. daniel proposed a page and a title should be distact
  • Actions are concepts - deleting and un-deleting. Deletion of a page and of a file are the same user[?]
    • At the moment deletion for us is an archive
  • “Permanently delete” extension which simply provides a different definition of delete
  • Lookin at extension, they use hooks or have proliferation of hooks that are implemented in awkward way, lack of [?] concepts
  • Flagged Revisions is hooking to introduce the idea of a public time-stamp
    • A whole lot of core patches that introduced more parameters more hooks all over the place, all to introduce the concept of a publication timestamp
  • Page is a set of Revisons in core
  • Central auth is another good example, hooks into the User object and overrides a bunch of different things
    • We have an auth plug-in but we are missing most of what central auth actually means.
    • Need a concept of a user store
    • It hooks into the user getpassword feature, so the password comes from central auth not local database, a missing concept there that we need
    • Has to hook some method in a 4,000 line class, missing some concept there
  • Don’t want to get too into how hooks will work bc that’s the next session
  • Exercise: write down concepts that are missing or incomplete in core on post-its, specifically considering whether or not they are existing/tech dept, or future/JIT
    • Preferably future as in planned projects, not hypothetical future
  • Clar from corey - there are concepts that exist outside of core, moving them in allows us to change the interface so it’s not so hard to manage, we can simplify,
    • a: flagrefs has tech debt because there is no concept in core of a public timestamp
  • Q from Raz - can we have an example of two/three concepts?
    • A: pages, editing, an action can be a concept, viewing, deleting, file repository as a concept that is also a class hierarchy. File rep concept that works well in core
    • Raz: Ab testing library? Is that a concept?
    • Tim: Testing library is a concept, once you add specs like php or..? Then it becomes too narrow
    • Corey: Pages are the main concept, so curation is hung off pages, but for wikidata if pages are built on top of extraction of data and a curatable thing isn’t the page but the page sits on top of the curation
    • Jon: Are [?token] a concept? JonKatz doesn’t know what’s important
    • Tim: Yes, tokens are in core, but not properly extracted. Words within a revision - a parser token? No, not in core
    • Currently no concept of a sentence or paragraph
  • Corey: Wikipedia doesn’t know it’s Wikipedia (eg: there are concepts that make up a “encyclopedia” that are not in Core)       
  • Lots of unique ideas, not a lot of duplicates
  • Core Concepts as identified in the session:
    • Existing/tech debt:
      • Conversational elements
        • Topic, thread, response
      • Revert
      • Subpage handle
      • View, model, controller
      • Parser?
      • Content model
      • Output model
      • Content handler
      • Versioned content (abstracted from storage)
        • Has version, changes show up in recent change can be reverted, appended, etc
      • Lists
        • Clarification: things like watchlists, this could be generalized, list of pages or lists of users
      • Reference
        • Eg: citations
      • Referring to files by name is problematic (eg: file renames on Commons)
        • Missing concept: a File ID
      • Category
      • First class objects
        • Generalized category
    • Future/JIT:
      • Sibling MediaWiki (a concept within a wiki farm)
      • Generics
        • Raz: Doesn’t exist yet in php (or is maybe coming in the next update?), want to operate on a method on a type, don’t care which but want to maintain the type safety, need the support core so we can use it everywhere
      • Curatable generic object
        • Entity - or “curatable entity?” this is causing contention
        • Any noun in the wiki. A page, user, log item, etc. This is useful for referring eg: user’s page or etc
        • What ties these together?
        • If we have a metadata store, an entity is what that refers to, shows up as log item or page, only nouns. Whereas we could have curatable content about a verb
      • Username vs real names
        • Usernames: dense and not easy to understand, make a preference for real name
      • Break pages into concepts, properties, attributes, observations
        • Each with revision history and API
        • Subpage elements?
        • Daren: can currently only pull the whole page, instead would want to parse it out
        • [i cant hear daren]
      • A/B testing lib
      • Concept (or wikidata entity)
      • Quorum backed relation
      • Customizable segment
        • Has to do with indicating that certain portions of content could be customizable based on user preferences or customization
        • Thumbnail size? Things like that? Perhaps
        • Fragments of page able to be customized
        • “I like two columns but you might like only one”
      • Content snip
        • Generalization of text extracts
        • start/end range of text
        • Text range/ content range
      • Authorship
        • “Article history is the authorship”, not usable by anyone, people usually say “Wikipedia wrote it”
      • Content Gap
        • Closer to semantic MW
        • The notion that we have to achieve these things to model a particular thing
        • Q from corey: page issue? Missing citation? A: yes could be a thing potentially, if something was actively repped by a topic, missing something that is essential
        • Aaron: like redlinks
        • Gender, diversity, etc. in wikidata can currently say, hey this is missing, not a thing in core
      • Materializable Block
        • Something lower level than a page could change somewhere else, as it relates to presentation
      • Fact (triplet or other)
        • Wikidata has a concept, core does not
      • Site map
        • Currently have extension for sites and the moment but we current just have a big unordered list of sites
        • Or, hierarchical list of links to navigate? Yes, If i want to do a documentation portal for example
        • Confluence has a page hierarchy as their default way of organizing the site
      • External wiki federation
      • Admin tools
        • The concept from the 3rd party session of eg an “Admin panel” for managing a wiki
      • Workflow
  • Next step: figuring out which is especially interesting and which should we narrow down to focus on
  • Jon: Discussion elements are really [important/missing?] at the fact that were dont have that
  • Corey: Does this prevent from building what we need or?
  • J: Yes, comes down to lack of communication that makes this harder
  • T: How do you see this happening in core? Should core be providing stub / extendable concept for flow to implement
  • J: I can’t speak to flow vs extension, it should just be a first class citizen and it’s absence is missed
  • A: part of the reason flow failed is that it’s so difficult to integrate, cant suppress the flow for the period its active, it would make future flow easier -> bit of an existential threat bc dbs are going to explode in the next few years, pretty urgent
  • T: a few have been challenging that ^. On one hand we want curatable content, we have a page which is CC, in flow they didn’t use that page concept, which then missed out on a bunch of features like suppression
  • T: Roan regretted that decision, bit mistake in flow, to not make topics be pages; during the jade discussion, let’s not make the same mistake again, let’s make jade curatable via pages
  • T Concern from dbas, what will happen if we have 10 mil of these, 100 mil, can we have 100 mil pages?
  • T Do you want to have a separate concept of a JADE, but it’s split out solely for DB arch, or do we want to scale DB arch to have conceptual unity
  • (halfak is nodding)
  • Tim: Concepts are exposenvie, two instead of one costs us in terms of structure of code, you can’t share code between properly, all other problems with flow, not charng content of curatable page Deal with database issues in a way that doesn't require us to split concepts, to scale up the concepts
  • A: I think I know… supposed that we want to apply JADE assessments at the level of checking the truthiness of a sentence: is that saying that the general concept of a page could be applied to a sentence, so maybe that is a revision and curated?
  • T: yeah, it comes down to the fact that page are two things right now: curatable and have titles, and are discoverable bc they have titles; that’s the problem with 100mil pages, you can’t have a special list of all these pages, might cause problems from this, currently pages are indexed in many different ways, if you have 100 mil that doesn’t make sense. Curatable part of the page is relatively scalable, few bites per row for a page
  • Aaron: as I understand it, revision is a problem from a DBAs perspective, but
  • T: we’ve already vertically split the rvisions table, so that it’s relatively efficient. We’ve talked this week at sharding/splitting horizontally. Used to be very bulky. Now just ids linking to other things instead of that, still a bit bulky, 14 bites when it could be five
  • C: are you saying we don’t want separate thing, the page and title are split
  • T: at some point it may be necessary to have curatable object that is not discoverable; is saying that current problem with many revisions is a necessary cost of versioning, not another way to do that, currently talking about splitting hortizontally soon
  • A: that’s the coolest thing I’ve heard in a long time
  • T: other concepts that people want to discuss/need clarification?
  • A: question applying to lists: are lists curatable? And should they be? Or do they belong to a user and thus aren’t curatable, just time stamped
  • Dmitry: yes, well in the most general case a list is a list can be public/private, watched/not watched, expiration
  • C: is disambig a version of a list
  • D; don’t know, there’s ambiguation; curatable, maybe?
  • T: watch lists are not versioned at the moment, if someone deletes it all there’s no way to get it back, if a watch list is a kind of list then lists are not versioned
  • C: need for both, noncuratable and curatable, and also the page lits and documentation
  • Adam: for work backlogs, the resurrection of Gather, and private reading and watch lists
  • Aaron: was gather a page or curatable thing? Casualty of curation? (answer: no)
  • Aaron: Maybe not necessary for watchlists?
  • ACTION: Tim was going to be responsible for having revisions scale
  • Concepts are really behind, so we need to update that, esp with lists [who owns that]
  • IMPORTANT: List as a concept
  • IMPORTANT: conversational elements/discussion
  • Proposed item: what thing should core have in order to support what people want to do with discussion? What is the core/not core split there?
  • [didn’t catch subbu’s]
  • Danny is planning to hold a consultation
  • Jk: the thing separate from curatable item is the kind of edit; this edit has an attribute, do we have a way to do this?
  • Aaron: relationships between curratable things, what I was getting at with entities, to have a relationship between two different things you need the concept of the entities
  • Corey: also mixing this with tokens like para or sentence, there is a reply or discussion attached JUST to a sentence, need to be able to link those things
  • T: existing change tags table, it’s polymorphic, rev id and log id, whatever thing you’re attaching a change tag to, like 4 different things and they’re nulled if they’re not relevant
  • The fact that its polymorphic suggests we don’t have thing change [missed that]
  • Subbu: question: to what degree can we model these concepts by a curratable ID? Potentially abstractable concept. Curratable thing that has a content model and content handle that has transformations… input/output model
  • Tim: yeah, a discussion could fit into that model, discussions can be a, have a content model. Heard from one person that a discussion shouldn’t be versioned or curatable; don’t buy that, not in step with current internet standards, made the right choice with that
  • Subbbu: Output model, treat this as content modellin, different rendering or different targets
  • Corey: Business logic and the view logic is split in how you think about it and how you extract it;
  • Corey: concept of getting….[missed, dangit]
  • Aaron: concept of relationship between things, right way fwd is to make a desciprtion of what you mean reference, editing, change, and send that to TechCom. Its not a specific proposal, its a definition work - Aaron owns that, writing an rfc of what is an entity and relationships
  • Tim: yes, reasonable
  • ACTION: aaron is volunteering to write an RFC to define what is an entity
  • Corey: lack of context, need a larger context for this
  • Aaron: maybe this needs a wikipage somewhere?
  • Jk: present this as a brainstorming session
  • T: don’t have time to discussion, glad we had time for a few
  • Did we have people assigned for working on the discussion concepts?
  • Danny’s the guy in product, but will need to pair with someone in Technology
  • Might be aron or Roan who has been weighing in on Flow
  • Adam is hiere and maybe Kaldari
  • ACTION: Adam to talk to Kaldari re discussion concepts
  • Subbu: potential missing concept for the future: addressable page fragment, it’s coming up in many different things, addressable page fragment
  • T: Related to content snip
  • When I was proposing inline discussion, could have a “source range” could map that to a range in the DOM. DOM ranges are a concept we could use...
  • We need something to fulfill this niche
  • ACTION: Subbu something RFC something re page fragment addressing
  • JK: wants to make sure when we’re designing this that we’re narrowing our focus and designing for just one or two instead of 20
  • ACTION: Corey: concepts set of concepts, and working through a process (RFCs) to bringing in new concepts
    • Follow up with Tim, Daniel, Corey, a place to discuss the concept of concepts