User:ASarabadani (WMF)/Database for developers toolkit/How-to/Make queries against production

From mediawiki.org

How to query the database directly. For example, to confirm whether data in production is stored correctly, to investigate a bug, to perform some analysis for a research or product report, etc.

Safe[edit]

  • Prefer Quarry whenever possible, which queries the world-public Wiki Replicas in Wikimedia Cloud. This is is suitable for all queries that do not involve private information like e-mail addresses, checkuser data, etc. Quarry has a nice web interface and provides sharable links for the results. If you prefer a command-line REPL interface, you can also SSH to Toolforge and invoke sql there, e.g. sql enwiki.
  • Use the analytics replicas, which are full replicas that include unredacted private data, but are dedicated to non-production purposes and as such have no live web traffic pointed at them. These are suitable for slow queries that require a long time to run slow and/or queries that involve private data. (The replicas in Wikimedia Cloud permit queries that take upto 1 hour).

Risky[edit]

The MediaWiki maintenance server is part of the production environment and has cron jobs that perform live analysis and/or change data through background tasks (see platform diagram). It interacts with the same live production databases as the web servers.

Note: Avoid slow queries that may take more than a second to complete as this would risk causing replication lag and impact site reliability.

To access these, ssh to the active mwmaint server and use the sql or mwscript mysql.php commands accordingly as documented at https://wikitech.wikimedia.org/wiki/Debugging_in_production#Debugging_databases. For example, sql enwiki establishes a replica connection that lets you immediately perform unchecked SQL queries.

Available features include:

  • Connect to hosts in the x1, ExternalStore text, and ParserCache clusters.
  • Connect to a database in the secondary datacenter. Use database tree to find relevant hostnames.

Security Note: This is highly dangerous. If you need to permanently modify core database rows on the production master, this can be done via the write option.

  • Before doing so, write queries should be peer-reviewed by a SRE Data Persistence or experienced MediaWiki developer.
  • Always "!log" your write queries on IRC in #wikimedia-operations and reference a Phrabricator task (if the query mentions a user name or contains private data, describe the query instead). Any read subquery must be as fast as possible. If unsure, ask Data Persistence team first.
  • Example use case: An emergency where all of Wikipedia is down due to an undefined database table. Again, their use must be extremely rare.
  • The write option can be used like so: sql testwiki --write.