Toolserver:Database access

One of the key features of a Toolserver account is access to a replicated copy of the Wikimedia databases. This page explains how to access them.

Overview
There are seven clusters of databases, defined by the Wikimedia Foundation and numbered from s1 to s7. The Toolserver replicates the clusters as they are. At the Toolserver, the commons database is available at all clusters, though Commons is properly a part of s4.

There are two different types of server for each cluster:


 * The RR server : This is the server you would usually use. It will connect you to a randomly selected database server which contains the requested cluster.
 * The user server : This is the server you should connect to if you need user database access (explained below)

By default, your database credentials are stored in. The code examples below show you how to read that file in, rather than putting passwords in your script directly, which is a security risk.

Database names
Each Wikimedia project has a specific database name. The general pattern is langcodeproject name_p. The full list is available here (though without the _p suffix).

Replication lag
The servers, for the most part, are real-time copies of the live databases, but at times there may be replication lag (replag). Replication lag causes updates to the copy of the database on the Toolserver to lag behind the master server, so the database represents the state of the master at some point in the past.

Connecting to a database
There are two methods of connecting to a database: by cluster, and by database name. Many tools still use the cluster method, since originally this was the only available method. However, the database name method is generally easier.

By database name
To connect to a database by name, use the hostname XX.rrdb.toolserver.org (for the RR server) or XX.userdb.toolserver.org (for the user server), where XX is the database name. (Example: enwiki-p.rrdb.toolserver.org.)

Please notice that XX.userdb.toolserver.org is used only for user databases that are stored in servers that also contain a copy of the Wikimedia databases stored on XX.rrdb.toolserver.org. User databases that do not require that access are typically stored in sql.toolserver.org, which does not have an alias.

By cluster
To connect to a database by cluster, you first need to determine which cluster the database is located on. Toolserver database explains how to do that; in short, connect to the toolserver</tt> database on the sql-toolserver</tt> server, and use a query like:  This will return an integer indicating the cluster, e.g. "1" for enwiki_p</tt>.

Then, depending on whether you want the RR or user server, connect to the hostname sql-sX-rr.toolserver.org</tt> or sql-sX-user.toolserver.org</tt>, where X</tt> is the cluster number. (Example: sql-s1-rr.toolserver.org</tt>.)

Command-line access
To connect to a specific database (for example, enwiki_p) from the Unix command line, run: This will automatically look up the correct server and connect to it.

Batch access
If you plan to run long queries on the database, e.g. for scheduled report jobs, we recommend using batch job scheduling to access the database. This will ensure that the database doesn't become overloaded.

Program access
In most programming languages, it will be sufficient to tell MySQL to use the database credentials found in $HOME/.my.cnf</tt>

Below are various examples in a few common programming languages.

Python

 * See also: Django

Using MySQLdb:

Using oursql (uses less memory): Extra flags are required for oursql to force binary mode since VARCHAR fields on sql-s2 are mislabeled with latin-1. Manual decoding is required even after upgrading since the fields will be VARBINARY instead.

Slow queries and the query killer
To prevent replication lag caused by long queries which put load on the database, the Toolserver uses an automated query killer to kill queries. The query killer only runs when replication lag is noticeable high (currently, over 10 minutes), and kills queries based on how long they have been running.

The query killer has two phases, the "low" phase and the "crit" phase. If replication lag is under 30 minutes, it runs in the low phase; otherwise, it runs in the crit phase.

You can prevent your query from being killed in the low phase by including the string SLOW_OK</tt> somewhere in your query, for example: There is no way to prevent queries being killed in the crit phase (except by not writing slow queries).

Note that, as long as there is no replication lag, there is also no limit on query execution time. The query killer only comes into effect when the database is unable to keep up with the load.

Query time limits
Sometimes, you might want to artificially restrict the runtime of a query. You can do this using the LIMIT:<n></tt> directive in the query: This will kill the query automatically if it runs for longer than 60 seconds.

This is useful for web tools, where you want to restrict the runtime of a query and present an error to the user if it runs too long. (Rather than allowing the connection to time out, or the user to get bored and go away.) It can also be used to limit execution time of a query which is usually fast, but sometimes runs for longer than desired.

phpMyAdmin
phpMyAdmin is a browser-based GUI interface for MySQL. To access phpMyAdmin on the Toolserver, visit https://phpmyadmin.toolserver.org and use your Toolserver username and your LDAP password for the first dialog. When you see the phpMyAdmin login page use your username and your MySQL password (which can be found in the .my.cnf file in your home directory).

Database schema
The current MediaWiki database schema is available in the MediaWiki svn tree at /trunk/phase3/maintenance/tables.sql. A (possibly out of date) description of the schema is available at mw:Manual:Database layout.

Page text access
FIXME

General caveats
FIXME

Toolserver database
The toolserver (meta-)database is a database called "toolserver", containing information about other databases on the toolserver. It can be read by anyone and is available on all database servers. The host name of the primary copy is sql-toolserver</tt>.

The toolserver database is updated automatically on a daily basis. It is maintained by User:Duesentrieb, but please report any problems to JIRA.

Tables in the toolserver database:

wiki
Information about the wikis for which the Toolserver has data.

So, for example, to get SQL server, database name and the API URLs of the 10 largest Wikipedias, you could do this:

namespace (deprecated)
This table contains the primary namespace names for the wikis in the wiki table. Use of this table is deprecated, use the namespacename table instead, which covers all valid names for each namespace in each wiki, not just the primary names.

namespacename
This table contains the all valid namespace names for the wikis in the wiki table. Use of this table is preferred over the namespace table, since it covers all valid names for each namespace in each wiki, not just the primary names.

This can be used to build wiki-syntax links from information from the database, e.g.:

language
This table contains language names, by code, in English and the respective language.

For example, to get the languages of the top 10 Wikipedias:

User databases
By default, once a user is created an empty  database is created as well. Users can create additional databases by following the  pattern. For example, if user dab wanted to create a database named, he would name it.

You can allow other users access to your databases by adding the  suffix to the end of the database names. For example, if you wanted to share the previously created image database, you could rename it:

User databases are usually stored on the sql</tt> server, and this is where you should connect to use them. However, sometimes it is helpful to create a user database on the same server as one of the replicated databases, e.g. to run joins. To do this, connect to the user server instead of the RR server (described above), which will allow you to create user databases.

Database status files
There are status-files about each cluster (and the user databases) at /var/www/status_sclusternumber or  http ://toolserver.org/status_sclusternumber</tt> (e.g., http://toolserver.org/status_s1). In both cases, clusternumber can be an integer from 1 to 6 inclusive.

These files are simple ASCII text files. Lines beginning with " " are comment lines and can safely be ignored. A non-comment-line " " indicates that everything is alright and no known maintenance or service interruption is scheduled.

LOAD DATA LOCAL
By default, MySQL does not allow the LOAD DATA LOCAL command to be used for security reasons. If you want to allow this, you can edit your $HOME/.my.cnf</tt> file to add the local-infile</tt> option:

[client] ... [mysql] local-infile

For more information, see the MySQL manual.

External access
Creating a tunnel in Windows/PuTTY: In configuration, Connection > SSH > Tunnels. Source port: 3306, Destination: enwiki-p.userdb.toolserver.org:3306, check Local, click Add. Creating a tunnel in Unix/Linux: ssh -L 3307:enwiki-p.userdb.toolserver.org:3306 username@nightshade.toolserver.org Follow the steps above to for creating a tunnel, then follow Connect Excel to MySQL Database.
 * Excel

Managing
rmytop -h sql-s1-user -h sql-s2-user -h sql-s3-user