Flow/Architecture

From MediaWiki.org
Jump to: navigation, search

DISCLAIMER: All decisions, especially in regards to storage layout, are fluid. We fully expect to delete the data created by the prototype on extensiondb1 before deploying the final solution.

Big ideas[edit | edit source]

  • Flow is about workflow management. A "discussion" is a type of workflow – a very simple one.
  • Flow is cross-wiki. Eventually a discussion can take place across wikis and appear on pages on different wikis.
Templates are often used to implement ad-hoc workflows.[edit | edit source]

In many cases local wikis use templates to encourage workflow within them. The goal for Flow's workflow models is to be dynamic enough to be managed by local wiki administrators to cover use cases currently handled by workflow suggestions inside templates. In other words, Flow will implement a whole bunch of Lego pieces, and the individual communities will stick them together into the various workflows they need.

Cross-wiki database[edit | edit source]

Flow metadata will be vertically partitioned away from core MediaWiki into a single database shared by all wikis. In the future, if necessary, it may be sharded across multiple masters. While the Flow MVP ("Minimum viable project – our initial release) will store comments from all wikis in a single database, it will not implement display of a piece of data on wikis other than the wiki it was created on. There are checks in a variety of places to ensure cross-wiki data is not displayed until we have a chance to focus on the implications (user IDs, page IDs, configuration differences of the wiki and its extensions, and many things we don't even know yet).

Flow revisions are kept in ExternalStore, probably configured in additional memcached instances. See #Deployment section.

Data layer[edit | edit source]

also see Flow/Database

ID generation[edit | edit source]

Flow uses 88-bit timestamped identifiers that are unique across machines, in order to (eventually) permit a Flow board that is a "feed" of items from multiple wikis. It stores them in the database as binary(11) rows. Flow has a UUID model class to deal with IDs in either binary or alphadecimal(a-z0-9) as necessary; internally this generates IDs using the UIDGenerator class in MediaWiki core. Because these identifiers are timestamped sorting by a UUID gives a time based sort. Additionally flow does not need to directly store timestamps in most places, it just uses the timestamp within the UUID.

Workflow[edit | edit source]

Globally unique identifier of an individual instance of a workflow. A discussion topic is a Workflow instance (for example, /wiki/User_talk:Maryana?workflow=0506af29cfae6e5b09a3fa163e68c4ac). A request for deletion of page Abc is a Workflow instance. etc.

  • ID - A unique identifier of this workflow
  • wikiId - The wiki id of the owning Title
  • pageId - The article id of the owning Title
  • namespace - The numeric namespace of the owning Title
  • titleText - The db key of the owning Title
  • userId - The user id of the initial workflow creator
  • userText - A user name of the initial workflow creator
  • definitionId - Id of the flow Definition this workflow is a type of

To do (potentially)

  • contentLanguage
    • Workflows may occur in different languages, this can help act as a filter for workflows a user can understand.

TopicListEntry[edit | edit source]

The topic list is an N to M relation between workflows. Initial use case is a parent discussion workflow is related to many topic workflows within the discussion. These topics can then be included into other discussions. (Considering renaming and adding a type field, to allow generic N to M relations if use cases arise)

  • topicListId - UID of the parent workflow
  • topicId - UID of the child workflow

AbstractRevision[edit | edit source]

  • revId - UIDGenerator::newTimestampedUID128()
  • userId - Id of the user that created this (from the wiki id of the owning workflow, identified by the concrete revisions)
  • userText - The user name that created this
  • parentId - A revision id that this revision is based on
  • changeType - A string identifying the type of action that created this revision(not user generated)
  • type - A string identifying the concrete revision type
  • content - The content, or if ExternalStore is enabled a URL from ES.
  • flags - Array(comma separated in storage) of string flags that apply specifically to the content. Examples include utf-8, html, etc.
  • modState - String identifier of the revisions current moderation state
  • modUserId - Denormalized Id of the user that most recently moderated this revision (from the wiki id of the owning workflow)
  • modUserText - Denormalized user name of the moderating user
  • modTimestamp - Denormalized wfTimestampNow() created when moderation most recently occurred
  • lastEditId - Denormalized UID of the revision that is the last content edit
  • lastEditUserId - Denormalized id of the user (from the wiki id of the owning workflow) that performed the last content edit
  • lastEditUserText - Denormalized name of the user that performed the last content edit

Notes:

  • The canonical source of who moderated what when is to look at the changeType of all related revisions and who created the revisions with moderation related change types. The denormalized moderation information is related to the most recent moderation action performed against a series of revisions without changing the canonical information they store.
  • The canonical source of who edited the content when is to check which revisions content does not match their parent. This is denormalized so the data model can expose the most recent content editor to the UI without performing extra lookups.

HeaderRevision[edit | edit source]

A very simple piece of revisionable content displayed at the top of every discussion with no custom aspects beyond AbstractRevision

  • workflowId - The workflow that this header belongs to

PostRevision[edit | edit source]

A revisionable piece of content related to other posts in a 1 to N parent/child relationship.

While not explicitly defined in each post, the owning workflow has the same UID as the postId of only parent with a null replyToId in a tree of posts. This post represents the title of the topic. This is cached elsewhere.

  • replyToId - UID of the posts parent
  • postId - UID of this post
  • revId - UID of the related AbstractRevision
  • origCreateTime - Denormalized creation time (extracted from UID) of the first revision of this post
  • origUserId - Denormalized user id (from the wiki id of the owning workflow) that created the first revision of this post
  • origUserText - Denormalized user name that created the first revision of this post


Revisions[edit | edit source]

Posts, Headers, and possibly a wide variety of content within Flow needs to be revisioned much like normal wiki pages.

  • Flow revisions have a different use case than wiki Article revisions
    • Articles have hundreds to (tens of?) thousands of revisions
    • Flow currently stores a revision per post, most posts will only ever have a single revision. Posts that go through a few edits and moderation cycles will still only have 10's of revisions.
    • Other pieces of flow, such as discussion headers, will have perhaps dozens or hundreds (thousands wouldn't be unreasonable) of revisions, but still not nearly as many as an article.
  • Flow revisions need different metadata depending on where the piece is used (header, post, etc).
  • Content is stored in External Store (same as article revisions in core MW). You can configure the ExernalStore servers that Flow uses independent from core.
    • The actual content when rendering pages will typically come from memcache so not as worried about needing multiple queries or joins to get revision + content.
    • External Store has been extended to batch requests for multiple pieces of content such that only 1 query is issued per server.
Other thoughts[edit | edit source]
  • Content storage is a separate concern from the revision metadata, the revisions should continue to point somewhere else(text table/external store/revisions dont care) for the actual storage.
    • Currently we fetch the content and store it with the metadata in memcache. This is primarily fetched via multiGet, hopefully negating much of the performance penalty related to fetching the content separately.
    • We would like to transition to the content storage API that gwicke is working on for parsoid when it is ready.
    • See /Discussion_Storage#Content_Storage (outdated)

Abuse Filter, SpamBlacklist, SpamRegex, etc.[edit | edit source]

There are a variety of spam prevention methods within mediawiki, both automated and human powered. Flow currently only handles the automated spam prevention methods.

Within Flow the `Flow\SpamFilter\Controller` class is used to apply the automated spam prevention techniques. Before writing out any new revision the revision is passed into the Controller which responds with a `Status` object. Small wrappers for each of the core spam prevention implementations like AbuseFilter and SpamBlacklist are implemented to satisfy the `Flow\SpamFilter\SpamFilter` interface. These implementations are queried individually by the Controller, all SpamFilter implementations must agree the content is safe for it to pass the SpamFilter.

Front-end architecture[edit | edit source]

Flow PHP code sends complete static HTML for the initial set of 10 topics on a Flow board, then does JS progressive enhancement to turn many actions (reply, add new topic, paginate 10 more topics, etc.) into in-page API calls. The API calls return more HTML that is inserted in the page and enhanced. This gives us a common code base for no-JavaScript and JS; the tradeoff is the cost of sending verbose HTML containing links and buttons for no-JavaScript rather than minimal JSON data. ErikB has spent some time profiling Flow's PHP code that generates this HTML; obviously we're very interested in the templating RFC.

Remembering what a user has already seen (future)[edit | edit source]

In addition to individual objects being subscribable, to generate feeds matching the current UI prototype we also need to know what Workflowl's have been previously seen by the user

  • Remembering a boolean true seen status for every model instance multiplied by all editors is a ton of data and seems sub-optimal.
  • Could remember the last date a user viewed items from that Workflow within the subscription and only display items newer than that
    • Each topic within a discussion is its own Workflow, so the memory is more granular than just the main page, but is it enough?

Suppressed revisions[edit | edit source]

Search (future)[edit | edit source]

Handled user side in javascript, in the backend, or both (probably both)?

  • Elastic Search is on its way. We will be integrating with it as possible.
  • We think Elastic Search can likely handle our tagging searches as well.
  • What elements do the UI call for in search results?

Tags (future)[edit | edit source]

Individual discussion topics must be taggable. How should the tagging implementation work?

  • Tags can be public or private
  • Should public and private tags be stored in the same table?
    • Conceptually it might be simpler to store them separately.
    • Storing the separately also provides a much stronger guarantee of not accidentally displaying private tags
  • We have not delved into Elastic Search yet, but it seems like it would be a strong contender for simplifying tagging

Mapping URLs to Flow workflows[edit | edit source]

Performance considerations[edit | edit source]

How much data will flow need to store? For estimation purposes, Wikipedia Statistics shows there are approximately 26M articles across all wikis. Not all of these will have talk page, but many will. Assuming they all have talk pages ranging from just a post or two, to a couple thousand posts on the largest, we should expect a lower bound of perhaps 100M individual replies will need to be stored in perhaps 20M seperate discussion graphs (not today, or even within the first year, but an approximation of size within a few years). If each reply consumes 1kb of space that puts a lower bound of at least 100GB of post content before we even get into metadata, indexes, etc.

And that is just the discussions, Flow will eventually handle many more workflows than just discussions.

To help get an idea of the space required I applied the EchoDiscussionParser to one of the enwiki database dumps ( enwiki-latest-pages-meta-current10.xml-p000925001p001325000 ). Within this file it detected:

  • 43952 pages in either Talk: or *_Talk: namespaces
  • 211904 individual section headers
  • 514916 user signatures

That works out to an average of:

  • 5 sections per talk page
  • 2.5 signatures per section
  • 12 signatures per talk page

These pages were of course built up over time, but give us a general idea of the size of the problem we need to handle.

It may be worthwhile to re-run this code and split the stats between article talk pages and user talk pages. There will be several orders of magnitude more article talk pages than user talk pages, so if they have different characteristics that may be useful information.

Caching possibilities[edit | edit source]

See /Memcache

Interactions with other systems[edit | edit source]

When Flow is enabled on a page (usually a Talk page), the page becomes a Flow board. MediaWiki creates a new revision of the page with a different contentmodel property ('flow-board' instead of 'wikitext').

A Flow board is different from a wiki page, it stores its content, revisions, and metadata in an external cross-wiki Flow database. If you query the MediaWiki API for a Flow board's content, or use Special:Export of a Flow board, you will see only a pointer to a UUID in this external database.

  • On WMF wikis, Flow stores posts as html, using the output from Parsoid. When you edit a post, you see the original wikitext stored by Parsoid in HTML attributes.

Flow implements many expected interactions with other parts of MediaWiki.

  • Flow generates Echo notifications
  • Flow adds flow-{edit-post,hide,delete,suppress} rights to $wgAvailableRights so that they are available for global groups/staff rights
  • Rendering of content:
    • Fire wikipage.content hook when new posts are loaded.
  • Filtering of edits:
    • Flow content is run through AbuseFilter SpamBlacklist, SpamRegex
    • images in Flow content should be checked against wfIsBadImage() (bug 61772).
  • Logging of user actions:
    • Flow inserts rows in RecentChanges, and formats them for display
    • Flow edits don't yet show up in the "get edits" function of CheckUser (bug 60275).
  • Flow inserts rows in Special:Contributions, and formats them for display
  • Flow creates entries in deletion and suppression logs, and formats them for display
    • user actions in Flow appear in Special:Contributions
  • Flow creates entries in the IRC feed
  • On a Flow board, [View history] is replaced with its own implementation, as is viewing differences. Users can see history of an individual topic as well.

Link handling[edit | edit source]

Flow detects references in items, stores them in new flow_wiki_ref and flow_ext_ref tables, and appends to the standard MediaWiki link tables. So Special:WhatLinksHere works for pages, links between Flow boards, images, and templates; and the "File usage" section of a File: page shows the Flow boards using the image.

Gerrit change 115860 adds a WhatLinksHereProps hook in MW 1.24 so that Flow can add "from the _header_; from a _post_" links to the line in WhatLinksHere.

Statistics[edit | edit source]

Adding a new topic increments the page and edit counts in Special:Statistics, since it creates a new page in the Topic namespace. But posts replying to a topic and edits to Flow board header/topic titles/posts do not, since they are not revisions in the regular MediaWiki page tables.

URL actions[edit | edit source]

You can add ?action=something to Flow URLs, although many regular page actions don't apply (delete, edit, veaction=edit, etc.). $wgFlowCoreActionWhitelist lists the actions that Flow doesn't override, including 'protect', 'watch', etc.

Flow boards accept actions such as view (default), and new Flow actions board-history, edit-header, etc.

Many more actions apply to individual topics and posts, identified by workflow=UUID. If the browser has JavaScript enabled, many GET requests with action URLs are replaced by in-page /API calls.

Deployment[edit | edit source]

  • Our test server is ee-flow.wmflabs.org
  • Flow is deployed to the beta cluster and QA runs browser tests on it
  • In December 2012 we launched Flow on a handful of talk pages on mediawiki.org, bug 56506.
  • Flow/Rollout records the pages on which Flow was enabled.

In January we deploy on two WikiProject talk pages on enwiki (bug 60178).

Backing out/the kill switch[edit | edit source]

Also see Flow/Admin

Flow only has any effect if it's enabled on a page. That's controlled by $wgFlowOccupyPages, set per-wiki by wmgFlowOccupyPages in wmf-config/InitialiseSettings.php; comment this out and those page revert to regular wiki pages, and this should have the same effect as disabling the extension altogether.

If Flow does cause problems outside of the pages on which it's specifically enabled, then it is safe to stop loading the Flow extension in wmf-config/CommonSettings.php. This may leave some oddly-formatted lines in RecentChanges(only on Flow-specific lines), but is safe and non-destructive to data.

An emergency patch to make Flow read-only by disabling 'edit-title', 'reply', and 'edit-post' and failing validation of new topics is in git #a3bf4c40.

Flushing memcache[edit | edit source]

Flow heavily caches the elements of Flow boards. If content appears incorrectly in development,often flushing memcached will fix this.

  • On labs instance sudo /etc/init.d/memcached restart.
  • In production, you can bump $wgFlowCacheVersion in wmf-config/CommonSettings.php and sync-file that to invalidate the entire cache. Note: because the Flow database is cross-wiki, this applies to all wikis running all versions of the Flow code. As of May 2014 cross-wiki sharing of Flow topics is not really implemented, so it may be possible to have different releases (and thus different wikis) running different cache versions.

Reference Material[edit | edit source]