Wikimedia Developer Summit/2017/Labsdb

Session recording

Overview of LabsDBs[edit]

databases for Tool Labs shared accounts
production database replicas

Get access by having a Labs project or being a Tool Labs project member Use Quarry (<https://quarry.wmflabs.org/>)

labsDb (the old setup) had a number of problems:
- Old servers with limited storage
- Data drift compared to production databases
- Little documentation, hard to use
- Occasionally overloaded by some applications
- No failover, SPOF

Almost the same data in replica DBS as in production dabases, except
- private data like user IP addresses (filtered out for privacy reasons)
- Wikitext content and Flow discussions are stored on different set of servers
- Data not stored in MySQL (like restbase or jobs)

Typical uses cases where replica DBs are not suited best:
- Actions operating on every commit; better use IRC or rcstream
- Analysing data which doesn't require realtime; better use dumps
- Some actions might better be served by APIs

Things to avoid when using replica DBS:
- Persistent connections (due to resource consumption/contention with other users of the shared database service)
  - Idle connections will be automatically shut down
- Do implement reconnection logic in your application to deal with force kills or server failover
  - "We can guaratee the service, but not the server" Restarts needed for maintenace and other reasons
- Avoid poorly optimised queries
  - Make sure to use indexes and minimise queries
  - If necessary break down bigger queries into smaller ones
  - Do not use MySQL's "LIMIT" for offsets, use an indexed field and the where clause (e.g. WHERE id >= 1 and id <= 1000)
  - "EXPLAIN ...." queries do not work on the replicas. Use "SHOW PROCESS LIST" + "SHOW EXPLAIN FOR <connection_id>" instead (<http://s.petrunia.net/blog/?p=89>)
  - When the user aborts, kill your query/close connection
  - Use execution limit features of MariaDB like SELECT ... LIMIT ... ROWS EXAMINED rows_limit;

Questions[edit]

Q (?): When should you use the API vs the database? If I need 50,000 things which do I choose? A: "It depends" :) If you try to do very complex queries against very large tables (images, revisions, etc) there may not be the correct indices etc. In some cases the API has been optimized to deal with continuation logic to get large batches of data. In general if it is easy and fast with the API, then use it. LabsDBs have more flexibility but way less resources, because they are not production. Also, LabsDBs can be subject to drastic schema changes that are not backwards compatible, you should take this into account.

Q: Can you show an example of "SHOW EXPLAIN" A: Slides from "Query Optimization: From 0 to 10 (and up to 5.7)" (<http://www.slideshare.net/jynus/query-optimization-from-0-to-10-and-up-to-57>) "type: ALL" == full table scan "rows: N" == number of rows that the engine needed to look at Creating new indices is not out of the question for the replica dbs. Many/most things are already indexed somehow. The page table has already an index to optimize queries on page titles named page_title, which includes the namespace. "type: ref" == used an index (good)

Q: How can we tell which indices actually exist in the Labs replicas? A: You are actually using a view which may change things. Go to mediawiki code (mediawiki.sql file?) and you'll find production schema. (<https://phabricator.wikimedia.org/diffusion/MW/browse/master/maintenance/tables.sql>) Information_schema_p metadata table mediawiki.org db documentation All schemas are documented under https://www.mediawiki.org/wiki/Manual:<TABLE_NAME>_table

Q (Dan): Explains the collection of the data to feed the mediawiki reconstruction. Says it would be cool to use labsDBs to collect that data, because the labsDBs data are already sanitized. A: It makes sense, we're open to this kind of request.

Q (Timo): Switch to Innodb, any collateral effect that we have to expect? A: Innodb, is going to perform better and be more reliable. In some occasions, because it is so heavily compressed, single row selects may be slower? In latency, we have seen 5x increase in performance.

Q. (darkblue-b) Misc Labs comment: Jupyter Notebook is very well received in other groups, and Yuvi showed it at Berkeley to great response.. fills a wide range of needs, from very beginner with a pre-form Notebook, to advnaced users, and has graphing and interactive queries capabilities. Second comment: the presenters asks the audience, what about "user databases" ? I suggest a standard VM image for users to run on their own machines with their own storage, which has a FEW, and SIMPLE presets for connecting to run queries.. Third comment: as the presenter said, there is no "black and white" answer to what sort of queries should be allowed or how to form them.. for example, a real-world query doing some summary stats on about 40 variables with 200,000 or so rows, selected from 600,000 set.. the first try based on some difficult criteria, ran for a day and a half and did not finish; after several rewrites, the actual solution ran in about six minutes on the same hardware.. and I was practicing every day .. hth

Future of replica DBS, pot==Announcement==ential ideas:
- Pre-generated summary tables
- Different indexes/plans than production
- User databases
- API to setup resource limits

Problems: During months the DBAs tried to fix labsDBs that were having performance problems, also very old servers. It took 6 months to reload english wikipedia. No automatic failover. Solution:

3 new LabsDBs servers 12 TBs of SSD
2 filtering servers
MariaDB 10.1
Load balancing and failover with high availability proxy
row-based replication
New servers are using InnoDB instead of the unreliable TokuDB engine used by labsdb

=> S1 and S3 fully reloaded and sanitized for 800 wikis, available for querying! (some larger wikis missing, will be loaded soon)

labsdb-web.eqiad.wmnet - short, high thoughput requests labsdb-analytics.eqiad.wmnet - long running queries with limited throughput