Toolserver:Database access

One of the key features of a Toolserver account is access to a replicated copy of the Wikimedia databases. This page explains how to access them.

Overview
There are seven clusters of databases, defined by the Wikimedia Foundation and numbered from s1 to s7. The Toolserver replicates the clusters as they are. At the Toolserver, the commons database is available at all clusters, though Commons is properly a part of s4.

There are two different types of server for each cluster:


 * The RR server : This is the server you would usually use. It will connect you to a randomly selected database server which contains the requested cluster.
 * The user server : This is the server you should connect to if you need user database access (explained below)

By default, your database credentials are stored in. The code examples below show you how to read that file in, rather than putting passwords in your script directly, which is a security risk.

Database names
Each Wikimedia project has a specific database name. The general pattern is langcodeproject name_p. The full list is available here (though without the _p suffix).

Replication lag
The servers, for the most part, are real-time copies of the live databases, but at times there may be replication lag (replag). Replication lag causes updates to the copy of the database on the Toolserver to lag behind the master server, so the database represents the state of the master at some point in the past.

Connecting to a database
There are two methods of connecting to a database: by cluster, and by database name. Many tools still use the cluster method, since originally this was the only available method. However, the database name method is generally easier.

By database name
To connect to a database by name, use the hostname XX.rrdb.toolserver.org (for the RR server) or XX.userdb.toolserver.org (for the user server), where XX is the database name. (Example: enwiki-p.rrdb.toolserver.org.)

Please notice that XX.userdb.toolserver.org is used only for user databases that are stored in servers that also contain a copy of the Wikimedia databases stored on XX.rrdb.toolserver.org. User databases that do not require that access are typically stored in sql.toolserver.org, which does not have an alias.

By cluster
To connect to a database by cluster, you first need to determine which cluster the database is located on. Toolserver database explains how to do that; in short, connect to the toolserver</tt> database on the sql-toolserver</tt> server, and use a query like:  This will return an integer indicating the cluster, e.g. "1" for enwiki_p</tt>.

Then, depending on whether you want the RR or user server, connect to the hostname sql-sX-rr.toolserver.org</tt> or sql-sX-user.toolserver.org</tt>, where X</tt> is the cluster number. (Example: sql-s1-rr.toolserver.org</tt>.)

Command-line access
To connect to a specific database (for example, enwiki_p) from the Unix command line, run:. This will automatically look up the correct server and connect to it.

Batch access
If you plan to run long queries on the database, e.g. for scheduled report jobs, we recommend using batch job scheduling to access the database. This will ensure that the database doesn't become overloaded.

Program access

 * See also: Iterating over wikis

In most programming languages, it will be sufficient to tell MySQL to use the database credentials found in $HOME/.my.cnf</tt>

Below are various examples in a few common programming languages.

Python

 * See also: Django

Using oursql (uses less memory): Extra flags are required for oursql to force binary mode since VARCHAR fields on sql-s2 are mislabeled with latin-1. Manual decoding is required even after upgrading since the fields will be VARBINARY instead.

Slow queries and the query killer
To prevent replication lag caused by long queries which put load on the database, the Toolserver uses an automated query killer to kill queries. Query-Killer only runs when replication lag (replag) is noticeable and kills queries based on how long they have been running.

There are two kinds of queries: Normal- and SLOW_OK-query. If you know that you query will run for some time, you can add  (see below for an example) to your query and Query-Killer will be more patient with your query; please add a LIMIT: (see next paragraph) too.

If and when Query-Killer kills a query depends on the actual database replication lag. See the following table.

SLOW_OK-example:

Non-SELECT-queries with SLOW_OK will not be killed until the replag reaches 1800s.

Sometimes, you might want to artificially restrict the runtime of a query. You can do this using the  directive in the query: This will kill the query automatically if it runs for longer than 60 seconds.

This is useful for web tools, where you want to restrict the runtime of a query and present an error to the user if it runs too long. (Rather than allowing the connection to time out, or the user to get bored and go away.) It can also be used to limit execution time of a query which is usually fast, but sometimes runs for longer than desired.

There is also the possibility to get no e-mails if a LIMIT: query is killed. The syntax would be.

''N.B. If connecting using the command line 'mysql' tool, comments are not passed to the server by default, and so will not be visible to the query killer. Use the '-c' flag when connecting to suppress this behaviour.''

phpMyAdmin
phpMyAdmin is a browser-based GUI interface for MySQL. To access phpMyAdmin on the Toolserver, visit https://phpmyadmin.toolserver.org and use your Toolserver username and your LDAP password for the first dialog. When you see the phpMyAdmin login page use your username and your MySQL password (which can be found in the .my.cnf file in your home directory).

Database schema
The current MediaWiki database schema is available in the MediaWiki svn tree at /trunk/phase3/maintenance/tables.sql. A (possibly out of date) description of the schema is available at mw:Manual:Database layout.

Page text access
FIXME

General caveats
FIXME

Toolserver database
The toolserver (meta-)database is a database called "toolserver", containing information about other databases on the toolserver. It can be read by anyone and is available on all database servers. The host name of the primary copy is sql-toolserver</tt>.

The toolserver database is updated automatically on a daily basis. It is maintained by User:Duesentrieb, but please report any problems to JIRA.

Tables in the toolserver database:

wiki
Information about the wikis for which the Toolserver has data.

So, for example, to get SQL server, database name and the API URLs of the 10 largest Wikipedias, you could do this:

namespace (deprecated)
This table contains the primary namespace names for the wikis in the wiki table. Use of this table is deprecated, use the namespacename table instead, which covers all valid names for each namespace in each wiki, not just the primary names.

namespacename
This table contains the all valid namespace names for the wikis in the wiki table. Use of this table is preferred over the namespace table, since it covers all valid names for each namespace in each wiki, not just the primary names.

This can be used to build wiki-syntax links from information from the database, e.g.:

language
This table contains language names, by code, in English and the respective language.

For example, to get the languages of the ten largest Wikipedias:

User databases
By default, once a user is created an empty  database is created as well. Users can create additional databases by following the  pattern. For example, if user dab wanted to create a database named, he would name it.

You can allow other users access to your databases by adding the  suffix to the end of the database names. For example, if you wanted to share the previously created image database, you could rename it:.

If you don't need a backup of one of your databases, please include the string  in the name (for example:  ).

User databases are usually stored on the sql</tt> server, and this is where you should connect to use them. However, sometimes it is helpful to create a user database on the same server as one of the replicated databases, e.g. to run joins. To do this, connect to the user server instead of the RR server (described above), which will allow you to create user databases.

To estimate the size of a user database, use the query:

Database status files
There are status-files about each cluster (and the user databases) at /var/www/status_sclusternumber or  http ://toolserver.org/status_sclusternumber</tt> (e.g., http://toolserver.org/status_s1). In both cases, clusternumber can be an integer from 1 to 6 inclusive.

These files are simple ASCII text files. Lines beginning with " " are comment lines and can safely be ignored. A non-comment-line " " indicates that everything is alright and no known maintenance or service interruption is scheduled.

LOAD DATA LOCAL
By default, MySQL does not allow the LOAD DATA LOCAL command to be used for security reasons. If you want to allow this, you can edit your $HOME/.my.cnf</tt> file to add the local-infile</tt> option:

[client] ... [mysql] local-infile

For more information, see the MySQL manual.

In oursql/Python it can be set for an individual connection:, see Python/MySQL bulk load.

External access
Creating a tunnel in Windows/PuTTY: In configuration, Connection > SSH > Tunnels. Source port: 3306, Destination: enwiki-p.userdb.toolserver.org:3306, check Local, click Add. Creating a tunnel in Unix/Linux: ssh -L 3307:enwiki-p.userdb.toolserver.org:3306 username@nightshade.toolserver.org Follow the steps above to for creating a tunnel, then follow Connect Excel to MySQL Database.
 * Excel

Managing
rmytop -h sql-s1-rr -h sql-s1-rr -h sql-s2-rr -h sql-s2-rr -h sql-s3-rr -h sql-s3-rr