Wikimedia Developer Summit/2017/Mediawiki History

Welcome to WMF etherpad installation! A reminder: all current as well as past content in any pad is public forever. Removing content from a pad does not mean it is deleted.

SESSION OVERVIEW

=
====

Title: Unconference: MediaWiki history for analytics.

Day & Time: Day 2 11:00

Room: Prince

Phabricator Task Link:

Facilitator(s):

Note-Taker(s): Andrew Otto

Remote Moderator:

Advocate:

SESSION SUMMARY

=
====

Marcel:


 * Problem: MW data is scattered in ~900 databases, many tables, not suited for analytics purposes.
 * New editor? User is bot?  No historical context (was a user a bot in the past?).  Active editor is historical, but user is bot can change over time.
 * Want to be able to solve these problems with simple queries. Fewer joins are better.

Joseph:
 * We start with all the many databases and tables, and process them to build a single table with all (event based) data.

Q: This is a single table? With just revisions?


 * Every event record is either related to a revision, page or user entity. Different types of events for each entity.
 * Many things can change about a page or a user. For every event, it is interesting to know the state of a value at the time of the event, as well as the latest current value.  E.g. page_title at the time of an old revision, and page_title at current time.
 * Difficult to rebuild this. page_ids are not stored in logging table.  page_titles are, but they change over time.


 * First imported all databases into Hadoop using Sqoop.
 * Then we process these using Spark (Scala).
 * First rebuild history of pages and users. Reconstruct history of renames over time.
 * User and page history then joined with enhanced revisions info to create this big denormalized event history table.
 * Finally things can be joined by page_id or page title latest.
 * Data is stored in Parquet in Hadoop. Query using Hive or Spark.

Q: Have we considered amending the log table to include ids?
 * Not yet. We talked to people here this week, and folks are excited about that.
 * There is a logging table field called 'log_params' that only PHP is able to parse. Makes queries complex.
 * We wrote a scala parser for this.
 * We could put this data in log table, but we aren't sure if it would break things.

Pivot: not really possible.
 * We have this data in Hadoop. We also loaded it into Druid which allows us to use Pivot to play with the data visually.
 * Demo of Pivot with Pageview data.
 * Pivot has dimensions, fields with multiple values, and measures: what you are counting.
 * With MW history, event entity (page, revision, user) is a dimention. A measure is 'events' count.
 * How to know which revisions were reverted? Have to compare SHAs of revisions.  Doing this in MySQL is
 * This has been processed into the event history, so you can query this easily.
 * Filter on revision, split on revision is identity revision, count.
 * Now also have revision_is_productive_edit, meaning it was not reverted within 24 hours.
 * Can combine these to get revisions that are productive, but also reverted.


 * Expect a huge change in Analytics capacity.

Q: How is that Pivot demo so fast?
 * Pivot runs on Druid. Allows for fast querying of time oriented information.  You can only use time series data.  Columnar storage, precompution, sharded and distributed.  This data is about 1TB in druid, stored on 3 machines.


 * Data is built from prod dbs, so not sanitized, we can't make it public.
 * Jaime had a presentation yesterday about new LabsDB servers with pre-sanitized data. Hopefully we can built this data from the pre-sanitized stuff, and then release it publicly.


 * Expermenting with Clickhouse (instead of Druid) for fast SQL queries. Druid uses a custom JSON based query format.  Would be nice to query with SQL instead.  Want to have a fast SQL query engine.
 * In the meantime we have Hive and Spark. Now as fast as Druid, but it works.


 * We don't have content information, but we'd like to include it (page links, etc.)

Discussion:

Can we fix the logging table? Fix the current code, and also augment it with this new computed history?

Q: AndyRussG: work done on visualations tools, limn, pivot, dashiki. Is there any thought about convergence? Maybe using graph extension on wiki? Could have a queryable version of this that graph tool could access?

Dan: We'd like to put a query API on top of data for what wikistats needs. Then other tools could use this too. This is precanned data though. For slicing and dicing, use pivot. We'd like to make this queryable, perhaps via something like query or notebooks.

Q: DerkJan: How are you going to deal with changes in stats definitions?

Right now we just recompute everything all the time. Problems we fix will go into new numbers. If we siginfigantly change a definition, we will make a blog post. We do log changes

Post Session Docs:
 * pivot.wikimedia.org (only for people internal to the fondation, need LDAP to access)
 * https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake (and related pages) for documentation on the project
 * https://gerrit.wikimedia.org/r/#/c/325312/ The actual code to compute history reconstruction and denormalization