Toolserver:Code snippets

From mediawiki.org

This page was moved from the Toolserver wiki.
Toolserver has been replaced by Toolforge. As such, the instructions here may no longer work, but may still be of historical interest.
Please help by updating examples, links, template links, etc. If a page is still relevant, move it to a normal title and leave a redirect.

MySQL[edit]

Calculate replag[edit]

SELECT UNIX_TIMESTAMP() - UNIX_TIMESTAMP(MAX(rc_timestamp)) FROM recentchanges;

NB: This actually shows how long ago the last edit to the wiki was. If the wiki hasn't been edited for a while, it might appear the replag is higher than it really is, because the last edit was some time ago.

Show your processes[edit]

show full processlist\G

Select a server[edit]

See Queries#Accessing the data for a description of the "sql" script. To switch databases such commonwiki_p to enwiki_p, USE enwiki_p

Copying user database[edit]

You can dump your user database from sql and import it to sql-s1, 2 and 3 using:

mysqldump -h sql u_username | mysql -h sql-s1 u_username

You can also produce a database dump during the process (can be useful to restore the database in case of problems):

mysqldump -h sql u_username | tee u_username.sql | mysql -h sql-s1 u_username

Copy a table within a database[edit]

You can copy the structure of table "foo" to a new table "bar" with the following:

CREATE TABLE bar LIKE foo;

You can then copy the data with the following:

INSERT INTO bar SELECT * FROM foo;

This produces a copy of the table foo.

Fix UTF-8 encoded as latin-1[edit]

Versions of MySQL before 5 did not have binary nor UTF-8 support (or at least nobody cared). A common workaround was to encode the text as a UTF-8 byte sequence and give that to MySQL which read the byte in as if they were latin-1 characters. This worked alright until MySQL and VARCHAR became more Unicode aware. Now depending on settings, it may convert the latin-1 bytes to Unicode code points then encode to your specified encoding. So 'è' (U+00E8, UTF-8: c3 a8) becomes 'è' (U+00C3 U+00A8, UTF-8: c3 83 c2 a8), the equivalent python transformation is u'è'.encode('utf-8').decode('latin').encode('utf-8'). This double encoding is the reason why JOINs will fail with the newer VARBINARY fields used everywhere by MediaWiki now. Now how to convert your pseudo UTF-8 fields into actual utf8:

-- Leave out the outer CONVERT if only binary data is required
CONVERT( CONVERT( CONVERT( your_column USING latin1) USING binary) USING utf8);
Notes
  • LIKE is case sensitive with VARBINARY fields which are standard now in MediaWiki
  • VARCHAR will treat input as Unicode code points, encoding is specified by the collation attribute. Multibyte encoding like utf8, VARCHAR(255) may be larger than 255 bytes.
  • Since one of the clusters is still running MySQL 4, it is recommended to avoid using VARCHAR in user databases
  • Information in part based on Fixing character sets in MySQL

shell[edit]

Kill processes[edit]

To get the process ID (pid), see viewing processes. Once you have the pid, you can kill it:

kill pid

If the process doesn't exit, kill it harder:

kill -9 pid

But be aware that kill -9 doesn't give the process any opportunity to clean up.

If the process is running in the foreground (i.e. from the shell), you can also kill it with CTRL-C. If that doesn't work, try CTRL-\ (which like kill -9, gives the process no opportunity to clean up).

To kill all your processes at once, including any and all ssh processes:

pkill -u username

Calculate size of home[edit]

du -hs ~

Find folders taking up space[edit]

du—max-depth=1 | sort -k 1 -nr | head -n 10

Sort log files[edit]

Say if you've accidently, or intentionally, concatenated a bunch of Apache access logs together so they're not sorted anymore, the following sort will generally get you sorted.

sort -t ' ' -k 4.9,4.12n -k 4.5,4.7M -k 4.2,4.3n -k 4.14,4.15n -k 4.17,4.18n -k 4.20,4.21n

Check uptime of the cluster[edit]

ruptime -a

Bash[edit]

These scripts and settings should be placed in your .bash_profile file. After updating the file, you need to source it in order for the changes to become active ($ source ~/.bash_profile).

The env command gives a list of variables with their results for your current environment.

Setting a default editor[edit]

The default editor on nightshade is currently nano. If you want to change your personal default editor (for example, to use joe), use:

export EDITOR=joe

in ~/.bash_profile

MySQL queries[edit]

If you primarily deal with running queries on one particular database, there are bash scripts that can make life easier. The "sql" script is great for selecting the appropriate server, however it requires a lot of typing for commonly needed functions. And when a query finishes, you don't know how many results were output.

#!/bin/bash
query () { sql enwiki_p < "$1.sql" > "$1.txt"; echo `expr $(cat "$1.txt" | wc -l) - 1` rows; }
gquery () { sql enwiki_p < "$1.sql" > "$1.txt"; echo `expr $(cat "$1.txt" | wc -l) - 1` rows; gzip "$1.txt"; }

The "query" script runs an .sql file on the particular database and then prints the number of lines returned from the query.

The "gquery" scripts run an .sql file on the particular database, prints the number of lines returned from the query, and then gzips the output.

Both scripts are invoked using the script name followed by the file name (without an extension), for example:

query articles-ns-0

Symlinks[edit]

If you are constantly needing to make files public from a particular directory, it's possible to create a "mkpub" shortcut to symlink the files to your public directory (public_html/).

#!/bin/bash
mkpub () { ln -s "$PWD/$*" $HOME/public_html ; }

Search array[edit]

If you need to know whether something is in an array of things:

#!/bin/bash
isin() { local needle=$1 element; shift; for element; do [[ $element = "$needle" ]] && return; done; return 1; }

then use it:

if isin "$dir" "${arrayofdirs[@]}"; then
    #...
fi

Output to pastebin[edit]

Add this alias to ~/.bash_aliases:

$ echo "alias pastebin=\"curl -F 'sprunge=<-' http://sprunge.us\"" >> ~/.bash_aliases

Then, to easily send data to a pastebin, pipe it into that alias:

$ ls | pastebin
 http://sprunge.us/ZJZZ

Untar a file[edit]

$ tar -zxvf yourfile.tar.gz

Regular expressions[edit]

IP address notations[edit]

Dotted decimal IPv4 address formats[edit]

Simple regexp for addresses in lenient formats[edit]

This detects all IPv4 addresses made with 1 to 3 digits (including extra leading zeroes) per component:

([01]?[0-9]{1,2}|2([0-4][0-9]|5[0-5]))<!--
 -->\.([01]?[0-9]{1,2}|2([0-4][0-9]|5[0-5]))<!--
 -->\.([01]?[0-9]{1,2}|2([0-4][0-9]|5[0-5]))<!--
 -->\.([01]?[0-9]{1,2}|2([0-4][0-9]|5[0-5]))

When there's a match, the four decimal components are returned in \1, \3, \5, \7 (only one substring for each). If the regexp engine supports counters and can return an array of matches, it can even be compacted into:

([01]?[0-9]{1,2}|2([0-4][0-9]|5[0-5]))<!--
-->(\.([01]?[0-9]{1,2}|2([0-4][0-9]|5[0-5]))){3}

where the four components are returned in \1 (one substring), \3 (three substrings).

Note that this will unexpectedly match "000.000.000.000" or "010.01.001.1" (which are NOT usable as valid hostnames in IPv4 address format ; this means that such names will have to be resolved into some unknown address via a DNS hostname resolution query). Note that such pseudo-IP address formats are currently being used as valid user account names in Wikimedia sites, but NOT used by anonymous users just identified by their IPv4 address.

Improved regexp for addresses in shortest canonical format[edit]

This is a more restrictive version which detects only IPv4 addresses with a valid decimal format, without any extra leading zeroes in components, where the matched decimal 8-bit components are returned in \1, \3, \5, \7 (one substring for each):

(0|1[0-9]?[0-9]?|2([0-4][0-9]?|5[0-5]?|[6-9])?|[3-9][0-9]?)<!--
-->\.(0|1[0-9]?[0-9]?|2([0-4][0-9]?|5[0-5]?|[6-9])?|[3-9][0-9]?)<!--
-->\.(0|1[0-9]?[0-9]?|2([0-4][0-9]?|5[0-5]?|[6-9])?|[3-9][0-9]?)<!--
-->\.(0|1[0-9]?[0-9]?|2([0-4][0-9]?|5[0-5]?|[6-9])?|[3-9][0-9]?)

If the regexp engine supports counters and can return arrays for matches, you may also use this equivalent compact regexp where \1 matches the first decimal 8-bit component, and \3 is matched 3 times (once for each of the last three components):

(0|1[0-9]{0,2}|2([0-4][0-9]?|5[0-5]?|[6-9])?|[3-9][0-9]?)<!--
-->(\.(0|1[0-9]{0,2}|2([0-4][0-9]?|5[0-5]?|[6-9])?|[3-9][0-9]?)){3}
Quick and easy PHP boolean check[edit]

This is a faster, more compact, and easier to remember than using a regex with preg_match in PHP. It converts the given string to a valid Internet standard format string, and attempts to convert it back. If the string can be converted both ways, then it is a valid IP. If not, it is invalid.

if( long2ip( ip2long( "127.0.0.1" ) ) == "127.0.0.1" ) {
    echo "This is a valid IP address.";
}
if( long2ip( ip2long( "127.0.0" ) ) == "127.0.0" ) {
    echo "This is not a valid IP address.";
}

Hexadecimal-only IPv6 address formats[edit]

The valid IPv6 address format is less restricted than the one for IPv4, because its use within URL's is possible ONLY within delimiting [square brackets], with which such host address will never be confused with a valid DNS hostname. For this reason, extra leading zeroes are accepted (provided that each component doesn't have more than 4 hex digits). Additionally, the letter case of hex digits is not significant.

Addresses in unabbreviated formats[edit]

Here is a simple deterministic regexp matching only unabbreviated IPv6 addresses, where the hex components are returned in \1 (one string) and \3 (seven strings):

([0-9A-Fa-f]{1,4})(:([0-9A-Fa-f]{1,4})){7}
Addresses in canonical shortest unabbreviated format[edit]

Such unabbreviated address may still be canonicalized into uppercase and in its shortest form (without unnecessary leading zeroes), in which case the following regexp will only match this shortest unabbreviated form (which should be the one used for anonymous IPv6 users on MediaWiki sites where the first character of user account names is forced to uppercase), where the hex components are also returned in \1 (one string) and \3 (seven strings):

(0|[1-9A-F][0-9A-F]{0,3})(:(0|[1-9A-F][0-9A-F]{0,3})){7}
Addresses in abbreviated formats[edit]

However, the complexity of the standard notation is that one or more successive 16-bit hex components within the eight (separated by colons) may be dropped in the abbreviated notation (provided these components are all zero), by accepting that ONLY ONE of the specified components being empty (with such syntax, the total number of colons may be between 2 and 8, instead of just 7, and there may be between 0 and 8 hexadecimal 16-bit numbers) ; this will result in one possible occurrence of double colons "::".

A deterministic regexp to match all abbreviated and unabbreviated IPv6 addresses follows (if you want to understand how it is structured, look at the wiki code source):

<!--

-->(::<!-- -->([0-9A-Fa-f]{1,4}(<!-- -->:[0-9A-Fa-f]{1,4}<!-- -->){0,6})?<!-- -->|[0-9A-Fa-f]{1,4}<!--

  -->:(:<!--   -->([0-9A-Fa-f]{1,4}(<!--   -->:[0-9A-Fa-f]{1,4}<!--   -->){0,5})?<!--
   -->|[0-9A-Fa-f]{1,4}<!--
      -->:(:<!--   -->([0-9A-Fa-f]{1,4}(<!--   -->:[0-9A-Fa-f]{1,4}<!--   -->){0,4})?<!--
       -->|[0-9A-Fa-f]{1,4}<!--
          -->:(:<!--   -->([0-9A-Fa-f]{1,4}(<!--   -->:[0-9A-Fa-f]{1,4}<!--   -->){0,3})?<!--
           -->|[0-9A-Fa-f]{1,4}<!--
              -->:(:<!--   -->([0-9A-Fa-f]{1,4}(<!--   -->:[0-9A-Fa-f]{1,4}<!--   -->){0,2})?<!--
               -->|[0-9A-Fa-f]{1,4}<!--
                  -->:(:<!--   -->([0-9A-Fa-f]{1,4}(<!--   -->:[0-9A-Fa-f]{1,4}<!--   -->)?)?<!--
                   -->|[0-9A-Fa-f]{1,4}<!--
                      -->:(:<!--   -->([0-9A-Fa-f]{1,4})?<!--
                       -->|[0-9A-Fa-f]{1,4}<!--
                          -->:(:<!--
                           -->|[0-9A-Fa-f]{1,4}<!--
                           -->)<!--
                       -->)<!--
                   -->)<!--
               -->)<!--
           -->)<!--
       -->)<!--
   -->)<!--
-->)

For example, this regexp will accept "::", or "::0", or "0::", or "0::0", or the shortest unabbreviated format "0:0:0:0:0:0:0:0", or the full format "0000:0000:0000:0000:0000:0000:0000:0000" as they are all equivalent IPv6 addresses with a valid syntax. The strings "0:0:0:0:0:0:0:0::", or "00000000", or "::0::", or "0::0::0" will not be accepted as they are not in a valid IPv6 address format (too many colons, or missing colons with too many digits in a component, or ambiguous notation in the last two cases).

It is still preferable to canonicalize all abbreviated IPv6 addresses into the shortest unabbreviated form and with a single lettercase as shown above, notably if the address is used as the default user name (or talk page name) for contributing users not connected with their own named account, and that will be recorded in edit histories. (More information would be welcome from MediaWiki developers about which canonical format they will use as the default user name, for IPv6 users editing pages without being logged on, or for the private webserver logs.)

Dotted decimal or hexadecimal IPv4-mapped IPv6 address abbreviated formats[edit]

IPv4 addresses can also be mapped to IPv6 addresses to which they are implicitly bound on all network interfaces supporting the two routing protocols; for convenience, these can noted with the abbreviated IPv6 notation, by prefixing the "::" shorthand to the IPv4 decimal dotted notation. Although the addresses seem similar, they must be considered distinct because they may be routed differently, when datagram are sent to different routers with one of them not supporting both network address protocols simultaneously with the same routes and similar QOS and security options.

Here also, such IPv6 address can only be used within [squared brackets] as a hostname in URLs, and due to this, the restriction on extra leading zeroes does not apply (but each decimal components must still be limited to at most 3 digits, and at least one); for example "::000.000.000.000" is a valid IPv6 address format, equivalent to the IPv6 address "::" (and it is mapped with the IPv4 "0.0.0.0" address):

:<!--
  -->:(<!--
       -->([01]?[0-9]{1,2}|2([0-4][0-9]|5[0-5]))<!--
    -->(\.([01]?[0-9]{1,2}|2([0-4][0-9]|5[0-5]))){3}<!--
   -->|([0-9A-Fa-f]{1,4}):([0-9A-Fa-f]{1,4})<!--
-->)

where the matches are returned either:

  • in \2 (one substring) and \4 (three substrings) for each decimal 8-bit component; or
  • in \5 (one substring) and \7 (seven substrings) for each hexadecimal 16-bit components.

Complete IPv6 address formats[edit]

The abbreviated notation for IPv4-mapped IPv6 addresses can also be used as a suffix that follows any 96-bit IPv6 prefix (other than just "::", i.e. the null prefix), and so should be combined within the IPv6 address syntax everywhere there can be at least two hexadecimal 16-bit segments (separated by a single colon) at end of the notation.

Perl[edit]

'Unit' testing[edit]

If you need to check various pieces of data against multiple tests and see what data passes each test a quick and dirty way to do that is like so

Dealing with UTF-8[edit]

One problem you might run into, especially when dealing with non-English wikis, is that page titles can contain utf8. Perl uses an internal flag to denote whether or not a string is utf8 or not. Concatenating or processing strings with mixed flags can often lead to strange side effects such as double encoding or getting mangled. Doing the following can help avoid most, if not all, problems:

#!/usr/bin/perl
use Encode;
binmode STDOUT, ":encoding(utf8)"; #Of course having a terminal that handles utf8 is required.
...
$string = decode('utf8',$string); #For example a page_title fetched from the DB
...

Category:Documentation