Wikimedia Developer Summit/2017/Labsdbs

Overview of LabsDBs

 * databases for Tool Labs shared accounts
 * production database replicas

Get access by having a Labs project or being a Tool Labs project member Use Quarry ()

(slides will be posted)


 * labsDb (the old setup) had a number of problems:
 * Old servers with limited storage
 * Data drift compared to production databases
 * Little documentation, hard to use
 * Occasionally overloaded by some applications
 * No failover, SPOF


 * Almost the same data in replica DBS as in production dabases, except
 * private data like user IP addresses (filtered out for privacy reasons)
 * Wikitext content and Flow discussions are stored on different set of servers
 * Data not stored in MySQL (like restbase or jobs)


 * Typical uses cases where replica DBs are not suited best:
 * Actions operating on every commit; better use IRC or rcstream
 * Analysing data which doesn't require realtime; better use dumps
 * Some actions might better be served by APIs


 * Things to avoid when using replica DBS:
 * Persistent connections (due to resource consumption/contention with other users of the shared database service)
 * Idle connections will be automatically shut down
 * Do implement reconnection logic in your application to deal with force kills or server failover
 * "We can guaratee the service, but not the server" Restarts needed for maintenace and other reasons
 * Avoid poorly optimised queries
 * Make sure to use indexes and minimise queries
 * If necessary break down bigger queries into smaller ones
 * Do not use MySQL's "LIMIT" for offsets, use an indexed field and the where clause (e.g. WHERE id >= 1 and id <= 1000)
 * "EXPLAIN ...." queries do not work on the replicas. Use "SHOW PROCESS LIST" + "SHOW EXPLAIN FOR " instead ()
 * When the user aborts, kill your query/close connection
 * Use execution limit features of MariaDB like SELECT ... LIMIT ... ROWS EXAMINED rows_limit;

QUESTIONS
Q (?): When should you use the API vs the database? If I need 50,000 things which do I choose? A: "It depends" :) If you try to do very complex queries against very large tables (images, revisions, etc) there may not be the correct indices etc. In some cases the API has been optimized to deal with continuation logic to get large batches of data. In general if it is easy and fast with the API, then use it. LabsDBs have more flexiibility but way less resources, because they are not production. Also, LabsDBs can be subject to drastic schema changes that are not backwards compatible, you should take this into account.

Q: Can you show an example of "SHOW EXPLAIN" A: Slides from "Query Optimization: From 0 to 10 (and up to 5.7)" () "type: ALL" == full table scan "rows: N" == number of rows that the engine needed to look at Creating new indices is not out of the question for the replica dbs. Many/most things are already indexed somehow. The page table has already an index to optimize queries on page titles named page_title, which includes the namespace. "type: ref" == used an index (good)

Q: How can we tell which indices actually exist in the Labs replicas? A: You are actually using a view which may change things. Go to mediawiki code (mediawiki.sql file?) and you'll find production schema. () Information_schema_p metadata table mediawiki.org db documentation All schemas are documented under https://www.mediawiki.org/wiki/Manual:_table

Q (Dan): Explains the collection of the data to feed the mediawiki reconstruction. Says it would be cool to use labsDBs to collect that data, because the labsDBs data are already sanitized. A: It makes sense, we're open to this kind of request.

Q (Timo): Switch to Innodb, any collateral effect that we have to expect? A: Innodb, is going to perform better and be more reliable. In some occasions, because it is so heavily compressed, single row selects may be slower? In latency, we have seen 5x increase in performance.

ANNOUNCEMENT
Problems: During months the DBAs tried to fix labsDBs that were having performance problems, also very old servers. It took 6 months to reload english wikipedia. No automatic failover.

Solution:
 * 3 new LabsDBs servers 12 TBs of SSD
 * 2 filtering servers
 * MariaDB 10.1
 * Load balancing and failover with high availability proxy
 * row-based replication
 * New servers are using InnoDB instead of the unreliable TokuDB engine used by labsdb

=> S1 and S3 fully reloaded and sanitized for 800 wikis, available for querying! (some larger wikis missing, will be loaded soon)

labsdb-web.eqiad.wmnet - short, high thoughput requests labsdb-analytics.eqiad.wmnet - long running queries with limited throughput

Future of replica DBS, potential ideas

 * Pre-generated summary tables
 * Different indexes/plans than production
 * User databases
 * API to setup resource limits