MySQL queries

MySQL queries allow you to pull data from the Toolserver's replicated databases.

Database layout
The database layout is available at the MediaWiki wiki: mw:Manual:Database layout.

There are also two commands you can use to view the layout. SHOW TABLES will show the available tables in a database. DESCRIBE table_name will show the available columns in a specific table.

Clusters
The various databases are stored in clusters. The clusters are named with a preceding S and a digit, e.g., S1, S2, etc. A table of this information is available here: Wiki server assignments.

Data storage
There are a few tricks to how data is stored in the various tables.


 * 1) Page titles use underscores and never include the namespace prefix.
 * 2) User names use spaces, not underscores.
 * 3) Namespaces are integers. A key to the integers is available here. The localized names for the namespaces can be obtained from the Toolserver database.

Views
The Toolserver has exact replicas of Wikimedia's databases, however, certain information is restricted to Toolserver users using MySQL views.

For example, the user table does not show things like user_password or user_email to Toolserver users.

Accessing the data
There are a variety of ways to access the databases. From the command line, a shell script exists that automatically selects the correct cluster for you. $ sql enwiki_p This command will connect to the appropriate cluster (in this case, S1) and give you a MySQL prompt where you can run queries. $ sql enwiki_p < test-query.sql > test-query.txt This command takes a .sql file that contains your query, selects the appropriate cluster, runs the query, and outputs to a text file. The advantage here is that it doesn't add lines around the data (making a table), it instead outputs the data in a tab-delimited format. $ sql enwiki_p < test-query.sql | gzip >test-query.txt.gz This command does the same thing as the command above, but after outputting to a text file, it gzips the data. This can be very helpful if you're dealing with large amounts of data.

The sql shell script is a wrapper around the mysql command. Either method works, though the sql shell script selects the appropriate cluster for you.

If you wish to use the mysql command directly, a sample query would look like this: $ mysql -h sql-s1 enwiki_p -e "DESCRIBE `user`;" The -h option tells MySQL which host to access (in this case sql-s1). The -e option tells MySQL to echo the results of the query to the terminal. You can also have MySQL output the results to a file. $ mysql -h sql-s1 enwiki_p < test-query.sql > test-query.txt

Writing queries
Because the Toolserver database is read-only, nearly all of the queries you will want to run will be SELECT queries.

This query selects all columns from the user table. More information about MySQL queries are available below (in the example queries) and in the MySQL manual.

Queries end in a semi-colon. If you want to cancel the query, end it in \c. If you want to output in a non-table format, use \G.

Toolserver database
The "toolserver" database contains metadata about the various wikis and databases. It consists of four tables: language, namespace, namespacename, and wiki.

Example queries

 * w:en:Wikipedia:Database reports (and its equivalents in other languages) contain many useful examples.

Atypical log entries
Explanation: This pulls log entries from the deletion log that aren't restore or delete actions.

Broken redirects

 * See also source code of Special:BrokenRedirects

Explanation: This pulls all broken redirects.

Cross-namespace redirects
Explanation: This pulls redirects from (Main) to any other namespace.

Deleted red-linked categories
Explanation: This pulls non-existent used categories that have previously been deleted.

Empty categories
Explanation: This pulls empty categories that aren't in specific categories and don't transclude a specific template.

Fully-protected articles with excessively long expiries
Explanation: Articles that are fully-protected from editing for more than one year.

Excessively long IP blocks
Explanation: Blocks of anonymous users that are longer than two years (but not indefinite).

Indefinitely fully-protected articles
Explanation: Articles indefinitely fully-protected from editing.

Long pages
Explanation: Pages over 175,000 bytes in length; excludes titles with "/" in them (to avoid archives, etc.).

Redirects obscuring page content
Explanation: Redirects with large page lengths. This usually indicates there is text below the redirect that should not be there.

Mistagged non-free content
Explanation: This pulls files on a local wiki that are in non-free category, but also exist at Commons. This indicates that either the local image or the Commons image should be deleted. It excludes the SHA1 empty string due to bad database rows.

Pages with the most revisions
Explanation: This pulls the pages with the most revisions. On large wikis, it can take several hours (or days) to run.

Page counts by namespace
Explanation: This pulls the number of redirects and non-redirects in each namespace.

Orphaned talk pages
Explanation: This (very, very hackishly) pulls all pages in the talk namespaces that don't have a corresponding subject-space page. It JOINs against Commons to ensure that File_talk: pages are truly orphaned. It also has some en.wiki-specific template checks in it.

Ownerless pages in the user space
Explanation: This pulls all User: and User_talk: pages not belonging to a registered user. Pages belonging to an anonymous user are excluded.

Polluted categories
Explanation: This pulls categories that contain pages in the (Main) namespace and the User: namespace. Generally categories hold one or the other.

Categories categorized in red-linked categories
Explanation: This pulls categories categorized in red-linked categories.

Articles containing red-linked files
Explanation: This pulls articles that contain red-linked images. It checks both Commons and the local wiki.

Self-categorized categories
Explanation: This pulls self-categorized categories.

Uncategorized categories
Explanation: This pulls uncategorized categories.

Pages in a specific category using a specific template
Explanation: This pulls pages that are in a specific category and are using a specific template.

Top edit timestamp for a category of users
Explanation: This will take the user pages that do not contain a forward slash ("/") in "Category_name_goes_here" and get the top edit timestamp for each user.

Top pages by in a specific namespace for a specific user
Explanation: This will pull the page titles of a the most-edited pages by a specific user in a specific namespace.

Number of deletions per day for a specific user
Explanation: This will group the number of deletions per day by a specific user.

Number of deletions per day
Explanation: This will pull the number of deletions per day on a given wiki.

Most common deletion summaries
Explanation: This will pull the most common deletion summaries on a given wiki.

Most common deletion summaries for a specific user
Explanation: This will pull the most commonly used deletion summaries for a specific user.

Most common edit summaries for a specific user
Explanation: This will pull the most commonly used edit summaries for a specific user.

Number of revisions per day
Explanation: This will pull the number of (non-deleted) revisions to a particular wiki and group the numbers by day.

Number of revision per day by a specific user
Explanation: This will pull the number of (non-deleted) revisions by a specific user per day.

File description pages without an associated file
Explanation: This will pull file description pages that do not have an associated file, either locally or in the Commons repo.

Files without an associated file description page
Explanation: This will pull files that have no associated file description page, either locally or in the Commons repo.

File description pages containing no templates
Explanation: This will pull file description pages containing no templates that do not have an associated file description page in the Commons repo.

File description pages containing no templates or categories
Explanation: This will pull file description pages containing no templates or categories that do not have an associated file description page in the Commons repo.

Links on a particular page
Explanation: This will pull the names of the links on a particular page. For example, the text of the "Don't poke the bear" page contains the link "[ [bear]]" so the output of this query will list "Bear" as a result.

Links to a particular page
Explanation: This will pull the names of the links to a particular page. This is the equivalent of Special:WhatLinksHere. For example, the page "What not to do" may contain the link "[ [Don't poke the bear]]"; this query would output "What not to do" as a result.

Pages containing 0 page links
Explanation: This will pull pages that contain 0 page links. This will not account for things like image links, template links, category links, or external links.

Pages with 0 links to them
Explanation: This will pull pages that have 0 links to them (Special:WhatLinksHere would be empty!).

Short pages with a single author (excluding user pages and redirects)
Explanation: This will list short pages with a single author (excluding user pages and redirects).

Unused templates
Explanation: This will pull unused templates.

Broken image links
Explanation: List references to images that are not present locally nor on Commons (nowiki).

Revision counting for specific user till specific time
Explanation: Return the number of revisions by a specific user up to a certain time.

Log count in recent time by users not in group
A query like this outputs the users with:
 * the most log actions of the kind specified (in the example, patrol/patrol minus autopatrolled edits),
 * in the interval specified (in the example, last 24 months),
 * excluding users in some user groups (in the example, sysop and rollbacker).

In wikis where "patrol" right is given to all users, the example query can be used to find active patrollers who would be likely candidates for additional rights (like autropatrolled, rollbacker, sysop etc. depending on the wiki's configuration).

Frequency of a log action by day
Replace "create" (unregistered user signup) with your log action; and "8) AS DAY" with "6) AS MONTH" and so on for lower resolution.

Redirects with more than one revision
Explanation: This query lists redirect pages with more than one revision.

Short non-disambiguation pages
Explanation: This query lists 1000 non-disambiguation short pages.

List of external links
Explanation: This query compiles a list of external links (from the main namespace) grouped by website.

List of external links from all wikis
List of external links to a given domain from all namespaces on all Wikimedia projects; requires iterating over wikis, with a bash script in the example, syntax like  (outputs DB name, page title, linked URL).

List of interwiki links from all wikis
As above; use lowercase interwikiki prefix, like.

User Activity
Activity of registered users, as edits per day. This query filters out flagged bots but might still miss unflagged bot accounts though.

Survived editors
Editor retention by month considering time between first and last edit, excluding bots, orderd by month of first edit.

Article Size
This query works on the WMF Slave db

Top 10 Articles in Size
This query works on the WMF Slave db

Top links, red and blue
This query works on the WMF Slave db

List of users with a certain log action (not) among global users
For instance, a query to have a list of autocreated accounts (which have an entry in autocreate log) which are not known by CentralAuth (61876):

Editors of a page by edit count
Something like this can be used for instance to quickly list users who (don't) meet requirements for voting in a poll, though usually requirements are about more than edit count (cf. accounteligibility tool for complex examples).

Top most active editors under some conditions
For instance the following gives the top 50 most active translators (editors of the "Translations" namespace, where available). A simple JOIN should give most of the filtering options you'll need.

Is an IP address
This query matches 999.999.999.999 as well as valid addresses like 129.47.61.5