MySQL queries

MySQL queries allow you to pull data from the Toolserver's replica databases.

Database layout
The database layout is available at the MediaWiki wiki: mw:Manual:Database layout.

There are also two commands you can use to view the layout. SHOW TABLES will show the available tables in a database. DESCRIBE table_name will show the available columns in a specific table.

Clusters
The various databases are stored in three different clusters: S1, S2, and S3. A table of this information is available here.
 * S1 &mdash; enwiki_p
 * S2 &mdash; bgwiki_p, bgwiktionary_p, commonswiki_p, cswiki_p, dewiki_p, enwikiquote_p, enwiktionary_p, eowiki_p, fiwiki_p, idwiki_p, itwiki_p, nlwiki_p, nowiki_p, plwiki_p, ptwiki_p, svwiki_p, thwiki_p, trwiki_p, and zhwiki_p
 * S3 &mdash; all other wikis

Data storage
There are a few tricks to how data is stored in the various tables.
 * 1) Page titles use underscores and never include the namespace prefix.
 * 2) User names use spaces, not underscores.
 * 3) Namespaces are integers. A key to the integers is available here. The actual names for the namespaces can be obtained from the Toolserver database (explained below).

Views
The Toolserver has exact replicas of Wikimedia's databases, however, certain information is restricted to Toolserver users using MySQL views.

For example, the user table does not show things like user_password or user_email to Toolserver users.

Accessing the data
There are a variety of ways to access the databases. From the command line, a shell script exists that automatically selects the correct cluster (S1, S2, or S3) for you. $ sql enwiki_p This command will connect to the appropriate cluster (S1) and give you a MySQL prompt where you can run queries. $ sql enwiki_p < test-query.sql > test-query.txt This command takes a .sql file that contains your query, selects the appropriate cluster, runs the query, and outputs to a text file. The advantage here is that it doesn't add lines around the data (making a table), it instead outputs the data in a tab-delimited format. $ sql enwiki_p < test-query.sql > test-query.txt; gzip test-query.txt This command does the same thing as the command above, but after outputting to a text file, it gzips the data. This can be very helpful if you're dealing with large amounts of data.

The sql shell script is a wrapper around the mysql command. Either method works, though the sql shell script selects the appropriate cluster for you.

If you wish to use the mysql command directly, a sample query would look like this: $ mysql -h sql-s1 enwiki_p -e "DESCRIBE user;" The -h option tells MySQL which host to access (in this case sql-s1). The -e option tells MySQL to echo the results of the query to the terminal. You can also have MySQL output the results to a file. $ mysql -h sql-s1 enwiki_p < test-query.sql > test-query.txt

Writing queries
Because the Toolserver database is read-only, nearly all of the queries you will want to run will be SELECT queries.

This query selects all columns from the user table. More information about MySQL queries are available below (in the example queries) and in the MySQL manual.

Queries end in a semi-colon. If you want to cancel the query, end it in \c. If you want to output in a non-table format, use \G.

Toolserver table
FIXME

Atypical log entries
Explanation: This pulls log entries from the deletion log that aren't restore or delete actions.

Broken redirects
Explanation: This pulls all broken redirects.

Cross-namespace redirects
Explanation: This pulls redirects from (Main) to any other namespace.

Deleted red-linked categories
Explanation: This pulls non-existent used categories that have previously been deleted.

Empty categories
Explanation: This pulls empty categories that aren't in specific categories and don't transclude a specific template.

Fully-protected articles with excessively long expiries
Explanation: Articles that are fully-protected from editing for more than one year.

Excessively long IP blocks
Explanation: Blocks of anonymous users that are longer than two years (but not indefinite).

Indefinitely fully-protected articles
Explanation: Articles indefinitely fully-protected from editing.

Long pages
Explanation: Pages over 175,000 bytes in length; excludes titles with "/" in them (to avoid archives, etc.).

Redirects obscuring page content
Explanation: Redirects with large page lengths. This usually indicates there is text below the redirect that should not be there.

Mistagged non-free content
Explanation: This pulls files on a local wiki that are in non-free category, but also exist at Commons. This indicates that either the local image or the Commons image should be deleted. It excludes the SHA1 empty string due to bad database rows.

Pages with the most revisions
Explanation: This pulls the pages with the most revisions. On large wikis, it can take several hours (or days) to run.

Page counts by namespace
Explanation: This pulls the number of redirects and non-redirects in each namespace.

Orphaned talk pages
Explanation: This (very, very hackishly) pulls all pages in the talk namespaces that don't have a corresponding subject-space page. It JOINs against Commons to ensure that File_talk: pages are truly orphaned. It also has some en.wiki-specific template checks in it.

Ownerless pages in the user space
Explanation: This pulls all User: and User_talk: pages not belonging to a registered user. Pages belonging to an anonymous user are excluded.

Polluted categories
Explanation: This pulls categories that contain pages in the (Main) namespace and the User: namespace. Generally categories hold one or the other.

Categories categorized in red-linked categories
Explanation: This pulls categories categorized in red-linked categories.

Articles containing red-linked files
Explanation: This pulls articles that contain red-linked images. It checks both Commons and the local wiki.

Self-categorized categories
Explanation: This pulls self-categorized categories.

Uncategorized categories
Explanation: This pulls uncategorized categories.

Pages in a specific category using a specific template
Explanation: This pulls pages that are in a specific category and are using a specific template.