Manual talk:Categorylinks table

From MediaWiki.org
Jump to: navigation, search

Suggested Database Engines[edit]

What are the suggested mysql database engines for this table? Wissenslogistiker 15:55, 5 March 2006 (UTC)

InnoDB as for nearly all tables in a MediaWiki database. --88.130.126.202 02:01, 25 January 2012 (UTC)

cl_type[edit]

This field is not currently documented. Incnis Mrsi 13:37, 3 January 2012 (UTC)

Fixed
Added. --88.130.69.106 13:41, 30 January 2012 (UTC)

Sortkeys in database vs. sortkeys provided in category links[edit]

I think the Fields section here needs to clarify the connection between sortkeys used in the database (cl_sortkey and cl_sortkey_prefix) and sorkeys provided by wiki editors in category links of the form [[:Category:Title|sortkey]]. I've just edited the lead to (hopefully?) lay the groundwork for this, but I don't know if my changes to the Fields section would even be totally correct, since I only understand category links from a wiki editor's perspective). Can someone please clarify whether the parenthetical remark "aka the sortkey is unspecified" in the description of cl_sortkey_prefix is referring to user-supplied sortkeys (after the pipe in category links) and whether "Otherwise it is the human readable version of cl_sortkey" means that it contains the literal user-supplied sortkey text? - dcljr (talk) 07:39, 22 March 2012 (UTC)

Extracting category information by simple text searching of the sql dump file[edit]

I've been wanting to get a set of categories for each page of an entire wiki using the sql categorylinks database dump. That's helpful, e.g. when identifying the licensing for a given image file wikimedia commons. Instructions on the internet for using these SQL dumpfiles assume you have a running mySQL database to load the dumpfiles into. But this is slow and inconvenient if you just want to grab a basic list of categories.

To help extract data from these wikimedia sql dump files, I've cobbled together the following perl script. It extracts e.g. the current version of the wiki commons dump (http://dumps.wikimedia.org/commonswiki/20130216/commonswiki-20130216-categorylinks.sql.gz) and prints the categories out to a text file. This file has the category names listed on the line corresponding to the page_id, for easy access. For example, the 2 millionth line in the output, when run on this categorylinks file is:

2000000:'Narda','Our_Lady_of_Sorrows_churches_in_Hungary','PD-self','Roman_Catholic_Churches_in_Vas_County'

Showing the categories for page_id 2000000 (http://commons.wikimedia.org/wiki/File:Kisnarda01.jpg).

This script takes under an hour to process the 18GB commons file on my old computer. I suspect it can also be sped up by some clever perl programmer. Perhaps other people will find such a script useful? It is rather dependent on the exact layout of the sql dump file, though.

#!/usr/bin/perl
use warnings;
use strict;

# Use on wikimedia databases.
#
# Put the category entries (2nd field in the table) on the line
# corresponding to their page_id (first field in the table)
# Requires each entry to be ordered by page_id. 
#
# Call as e.g.
# > thisScript.pl commonswiki-20130216-categorylinks.sql > Outfile.unicodetext

my $started = 0;
my $apostrophes = 0;
my $fullline="";
my $prev_id=100000; # hacky. This is a number greater than the first page_id
$/ = '(';
while (my $line = <>) {
	unless ($started) {
		$started = 1 if ($line =~ /INSERT INTO .categorylinks. VALUES\s+\($/);
	} else {
		#count the number of unescaped apostrophes
		$apostrophes += () = ($line =~ m/(?<!\\)(?:\\\\)*'/g); #see http://stackoverflow.com/questions/56554/what-is-the-proper-regular-expression-for-an-unescaped-backslash-before-a-charac
		$fullline.=$line;
		unless ($apostrophes % 2) {
			$apostrophes=0;
			if ($fullline =~ /^(\d+),\s*('.+?(?<!\\)(?:\\\\)*')/) {
				my $diff = $1 - $prev_id;
				$2 =~ s/\n\r//g;
				if ($diff==0) {
					print ','.$2;
				} elsif ($diff>0) {
					print "\n" x $diff;
					print $1.':'.$2;
				} else { #either this is the first one, or we have problems
					die "Page_ids not in order for ($fullline)" if ($started==2);
					$started=2;
					print "\n" x ($1-1);
					print $1.':'.$2;
				}
				$fullline="";
				$prev_id=$1;			
			}
		}
	}
}

HYanWong (talk) 02:28, 1 March 2013 (UTC)

And some equivalent C code that does it in 10 mins
#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>

int main(int argc, char *argv[] ) {
    FILE *in;
    FILE *out;
    long long unsigned int i=0;
    long long int page_id;
    long long int prev_id=1;
    wint_t c;
    int quotes_in_line=0;
    int nesting=0;
    int escapes=0;
    int cutoff = 2;
    int n;
    wchar_t *buffer = NULL;
    const wchar_t *start_after=L"INSERT INTO `categorylinks` VALUES";
    buffer = malloc(wcslen(start_after+1) * sizeof(wchar_t));
 
    printf("Reading wiki SQL dump from %s\n", argv[1]);
    in=fopen(argv[1], "r");
    printf("Writing to %s\n", argv[2]);
    out=fopen(argv[2], "w");
 
    printf("Looking for a line starting: \"%ls\" ... ", start_after);
    do { /* discard the file up to a line beginning with the start_after string*/
    	fgetws(buffer, wcslen(start_after)+1,in);
    	if (buffer==NULL) return 2;
    } while (wcscmp(buffer, start_after));
	if (ferror(in) || feof(in)) {
	   printf(" Oops - error!\n");
      	return(3);
    };
    ungetwc(')',in);

    printf("Found.\nStarting to read data: outputting a | every million page_ids read\n");
    fputwc('1', out); /* hack - the first one dean't print a number */
    while(c=fgetwc(in)) { /*read character-by-character */
    	switch(c) {
    		case '\\': /* backslash used to escape, e.g. quote marks */
    			escapes++;
      			if (quotes_in_line < cutoff) fputwc(c, out);
      			break;
    		case '\'': /* only bother with quotes that aren't escaped */
      			if (quotes_in_line < cutoff) fputwc(c, out);
				if ((escapes % 2) == 0) {
					nesting = 1-nesting;
					quotes_in_line++;
				};
    			escapes=0;
	    		break;

    		case ')': /* this could be the end of a record, if ) is not in a quoted string*/
       			if (nesting) {
                          if (quotes_in_line < cutoff) fputwc(c, out);
                       } else {
      				quotes_in_line=0;
      				while (fgetwc(in)!='(') {/* zoom to the next record */
      					if (feof(in)) {
      						printf(" Done!\n");
      						return(0);
      					};
      					if (ferror(in)) {
							printf(" Oops- error!\n");
      						return(1);
      					};
      				}; 
      				if (fscanf(in, "%Ld", &page_id)) { /* only print out the record number on the first go */
      					if (page_id-prev_id != 0) {
      						if (page_id > prev_id) {
	      							for(n=page_id-prev_id;n;n--) {
				      					fputwc('\n', out);
		      						};
	      						fprintf(out, "%Ld", page_id);
      						} else {
								printf("Page_ids not in order for page_ids %Ld and %Ld\n", prev_id, page_id);
      						};
      						prev_id=page_id;
      					}
      				} else {
      					if (fscanf(in, ";")) { //end of this input statement
      					};
						printf("Something wrong near line %Ld\n", prev_id); 
						return 1;
      				};
      				if ((++i % 100000) == 0) {(i % 1000000) ? printf(".") : printf("|"); fflush(stdout);};
      			}
    			escapes=0;
	    		break;
    		case WEOF:
    			printf(" Oops. Come to the end of the file but haven't finished a record. Is the file truncated?\n");
	    		return(1);
    		default:
      			if (quotes_in_line < cutoff) fputwc(c, out);
	    		escapes=0;
    		}
      }
    fclose(in);
    fclose(out);
    return 0;
}
HYanWong (talk) 14:23, 1 March 2013 (UTC)

Primary key[edit]

With regard to https://bugzilla.wikimedia.org/show_bug.cgi?id=15441#c2 , does this page belong in Category:MediaWiki database tables that lack a primary key? cl_from and cl_to are PRI. Leucosticte (talk) 02:10, 24 September 2014 (UTC)