MySQL queries

MySQL queries allow you to pull data from the Toolserver's replica databases.

Database layout
The database layout is available at the MediaWiki wiki: mw:Manual:Database layout.

There are also two commands you can use to view the layout. SHOW TABLES will show the available tables in a database. DESCRIBE table_name will show the available columns in a specific table.

Clusters
The various databases are stored in clusters. The clusters are named with a preceding S and a digit, e.g., S1, S2, etc. A table of this information is available here: Wiki server assignments.

Data storage
There are a few tricks to how data is stored in the various tables.
 * 1) Page titles use underscores and never include the namespace prefix.
 * 2) User names use spaces, not underscores.
 * 3) Namespaces are integers. A key to the integers is available here. The localized names for the namespaces can be obtained from the Toolserver database.

Views
The Toolserver has exact replicas of Wikimedia's databases, however, certain information is restricted to Toolserver users using MySQL views.

For example, the user table does not show things like user_password or user_email to Toolserver users.

Accessing the data
There are a variety of ways to access the databases. From the command line, a shell script exists that automatically selects the correct cluster for you. $ sql enwiki_p This command will connect to the appropriate cluster (in this case, S1) and give you a MySQL prompt where you can run queries. $ sql enwiki_p < test-query.sql > test-query.txt This command takes a .sql file that contains your query, selects the appropriate cluster, runs the query, and outputs to a text file. The advantage here is that it doesn't add lines around the data (making a table), it instead outputs the data in a tab-delimited format. $ sql enwiki_p < test-query.sql | gzip >test-query.txt.gz This command does the same thing as the command above, but after outputting to a text file, it gzips the data. This can be very helpful if you're dealing with large amounts of data.

The sql shell script is a wrapper around the mysql command. Either method works, though the sql shell script selects the appropriate cluster for you.

If you wish to use the mysql command directly, a sample query would look like this: $ mysql -h sql-s1 enwiki_p -e "DESCRIBE `user`;" The -h option tells MySQL which host to access (in this case sql-s1). The -e option tells MySQL to echo the results of the query to the terminal. You can also have MySQL output the results to a file. $ mysql -h sql-s1 enwiki_p < test-query.sql > test-query.txt

Writing queries
Because the Toolserver database is read-only, nearly all of the queries you will want to run will be SELECT queries.

This query selects all columns from the user table. More information about MySQL queries are available below (in the example queries) and in the MySQL manual.

Queries end in a semi-colon. If you want to cancel the query, end it in \c. If you want to output in a non-table format, use \G.

Toolserver database
The "toolserver" database contains metadata about the various wikis and databases. It consists of four tables: language, namespace, namespacename, and wiki.

Atypical log entries
Explanation: This pulls log entries from the deletion log that aren't restore or delete actions.

Broken redirects

 * See also source code of Special:BrokenRedirects

Explanation: This pulls all broken redirects.

Cross-namespace redirects
Explanation: This pulls redirects from (Main) to any other namespace.

Deleted red-linked categories
Explanation: This pulls non-existent used categories that have previously been deleted.

Empty categories
Explanation: This pulls empty categories that aren't in specific categories and don't transclude a specific template.

Fully-protected articles with excessively long expiries
Explanation: Articles that are fully-protected from editing for more than one year.

Excessively long IP blocks
Explanation: Blocks of anonymous users that are longer than two years (but not indefinite).

Indefinitely fully-protected articles
Explanation: Articles indefinitely fully-protected from editing.

Long pages
Explanation: Pages over 175,000 bytes in length; excludes titles with "/" in them (to avoid archives, etc.).

Redirects obscuring page content
Explanation: Redirects with large page lengths. This usually indicates there is text below the redirect that should not be there.

Mistagged non-free content
Explanation: This pulls files on a local wiki that are in non-free category, but also exist at Commons. This indicates that either the local image or the Commons image should be deleted. It excludes the SHA1 empty string due to bad database rows.

Pages with the most revisions
Explanation: This pulls the pages with the most revisions. On large wikis, it can take several hours (or days) to run.

Page counts by namespace
Explanation: This pulls the number of redirects and non-redirects in each namespace.

Orphaned talk pages
Explanation: This (very, very hackishly) pulls all pages in the talk namespaces that don't have a corresponding subject-space page. It JOINs against Commons to ensure that File_talk: pages are truly orphaned. It also has some en.wiki-specific template checks in it.

Ownerless pages in the user space
Explanation: This pulls all User: and User_talk: pages not belonging to a registered user. Pages belonging to an anonymous user are excluded.

Polluted categories
Explanation: This pulls categories that contain pages in the (Main) namespace and the User: namespace. Generally categories hold one or the other.

Categories categorized in red-linked categories
Explanation: This pulls categories categorized in red-linked categories.

Articles containing red-linked files
Explanation: This pulls articles that contain red-linked images. It checks both Commons and the local wiki.

Self-categorized categories
Explanation: This pulls self-categorized categories.

Uncategorized categories
Explanation: This pulls uncategorized categories.

Pages in a specific category using a specific template
Explanation: This pulls pages that are in a specific category and are using a specific template.

Top edit timestamp for a category of users
Explanation: This will take the user pages that do not contain a forward slash ("/") in "Category_name_goes_here" and get the top edit timestamp for each user.

Top pages by in a specific namespace for a specific user
Explanation: This will pull the page titles of a the most-edited pages by a specific user in a specific namespace.

Number of deletions per day for a specific user
Explanation: This will group the number of deletions per day by a specific user.

Number of deletions per day
Explanation: This will pull the number of deletions per day on a given wiki.

Most common deletion summaries
Explanation: This will pull the most common deletion summaries on a given wiki.

Most common deletion summaries for a specific user
Explanation: This will pull the most commonly used deletion summaries for a specific user.

Most common edit summaries for a specific user
Explanation: This will pull the most commonly used edit summaries for a specific user.

Number of revisions per day
Explanation: This will pull the number of (non-deleted) revisions to a particular wiki and group the numbers by day.

Number of revision per day by a specific user
Explanation: This will pull the number of (non-deleted) revisions by a specific user per day.

File description pages without an associated file
Explanation: This will pull file description pages that do not have an associated file, either locally or in the Commons repo.

Files without an associated file description page
Explanation: This will pull files that have no associated file description page, either locally or in the Commons repo.

File description pages containing no templates
Explanation: This will pull file description pages containing no templates that do not have an associated file description page in the Commons repo.

File description pages containing no templates or categories
Explanation: This will pull file description pages containing no templates or categories that do not have an associated file description page in the Commons repo.

Links on a particular page
Explanation: This will pull the names of the links on a particular page. For example, the text of the "Don't poke the bear" page contains the link "[ [bear]]" so the output of this query will list "Bear" as a result.

Links to a particular page
Explanation: This will pull the names of the links to a particular page. This is the equivalent of Special:WhatLinksHere. For example, the page "What not to do" may contain the link "[ [Don't poke the bear]]"; this query would output "What not to do" as a result.

Pages containing 0 page links
Explanation: This will pull pages that contain 0 page links. This will not account for things like image links, template links, category links, or external links.

Pages with 0 links to them
Explanation: This will pull pages that have 0 links to them (Special:WhatLinksHere would be empty!).

short pages with a single author (excluding user pages and redirects)
Explanation: This will List of short pages with a single author (excluding user pages and redirects)

Unused templates enwiki
Explanation: This will pull Templates that they don't use

Broken image links
Explanation: list references to images that are not present locally nor on Commons (nowiki)

revision counting for specific user till specific time
Explanation: return number of specific user's revisions till specific time

Redirects that their history has more than one revision
Explanation: this Query lists redirect pages that their history has more that one revisions and they have to be merged their history.

Short pages not disambiguation
Explanation: this Query lists 1000 short pages that they are not disambiguation.

List of external links
Explanation: This query compiles a list of external links (from the main namespace) grouped by website.