Wik2dict

I want to have some Wiki information on my harddrive - and use it with the DICT interface. So I wrote a little conversion program in Python. It is also capable of downloading Wikipedia, Wiktionary and Wikibooks SQL dumps for you.

Warning: Might work for you, or it might not. Works fine for me. :) Usage or bug reports are welcome.

Bugs

 * English Wikipedia is still to big for my computer's memory (640 MB). All the rest seems to work fine. Guaka 19:27, 28 Jul 2004 (UTC)

install-dicts-debian.sh
/etc/init.d/dictd stop cp $1*dict.dz $1*index /usr/share/dictd/ dictdconfig -w                        # DON'T FORGET TO USE dictdconfig -w /etc/init.d/dictd start
 * 1) !/bin/bash

dic2wikt.py

 * 1) ! /usr/bin/env python
 * 2) -*- coding: utf-8 -*-
 * 3)   This program is free software; you can redistribute it and/or modify  #
 * 4)   it under the terms of the GNU General Public License as published by  #
 * 5)   the Free Software Foundation; either version 2 of the License, or     #
 * 6)   (at your option) any later version.                                   #
 * 1)   it under the terms of the GNU General Public License as published by  #
 * 2)   the Free Software Foundation; either version 2 of the License, or     #
 * 3)   (at your option) any later version.                                   #

program = "wik2dict.py"
 * 1) (c) 2004 Guaka
 * 2) convert MediaWiki SQL dumps into dict format
 * 1) convert MediaWiki SQL dumps into dict format
 * 1) convert MediaWiki SQL dumps into dict format

version, date = "0.2.1", "July 28th, 2004" """ install-dicts-debian.sh /etc/init.d/dictd stop cp $1*dict.dz $1*index /usr/share/dictd/ dictdconfig -w                        # DON'T FORGET TO USE dictdconfig -w /etc/init.d/dictd start """
 * 1) - More readable name of downloaded SQL dump.
 * 2) - Now uses dump filename if Wikititlesuffix is not asciiletters or
 * 3)   if it is "Wikipedia", "Wikibooks" or "Wiktionary".
 * 4) 0.2, July 27th, 2004
 * 5) - improved layout
 * 6) - more functionality
 * 7)   - more files at once, keeping MySQL database
 * 8)   - downloading Wikimedia SQL dumps
 * 9) 0.1, July 25th, 2004
 * 10) - basic functionality
 * 11) The English Wikipedia (~300.000 articles) is too big for my computer's memory.
 * 12) The German Wikipedia (~100.000) took quite a while, but converted.
 * 13) TODO:
 * 14) - Speedups.
 * 15)   - More regular expressions.
 * 16)   - string.translate?  regex.compile?
 * 17)   - Preprocess messages.
 * 18)   - It seems that only cur_namespace in [0, 8, 10] is used, dropping the rest
 * 19)     might speed up stuff a bit.
 * 20)     "DELETE FROM cur WHERE NOT (cur_namespace = 0 OR cur_namespace = 8 OR cur_namespace = 10);"
 * 21) - Improve layout stuff.
 * 22)   - Do something useful with Wiki-tables.
 * 23)   - Bugsquishing
 * 24) - Errorhandling.
 * 25) - Unicode, current problems might be due to stuff still being non-Unicode.
 * 26) - Option for deleting processed SQL dumps files.
 * 27) - Get to know more about the dict format's possibilities:
 * 28)   - Links to other stuff than what's shown?
 * 29)     (So that other thing can link to something.)
 * 30)   - External links without "{bla bla (http://test.com)}" mark-up?
 * 31)   - Bold or italic text?
 * 32) TODO, low priority (for Guaka at least):
 * 33) - NUMBEROFARTICLES is probably higher than what's used online.
 * 34) - i18n of.
 * 35) - More sophisticated way of dumping SQL dump into MySQL.
 * 36) - dictzip optional.
 * 37) - Finetuning downloading recent Wikimedia database dumps.
 * 38) Has worked for:
 * 39) * Wikipedia: nl
 * 40) * Wiktionary: en, es, fr, nl
 * 41) * Wikiquote: en
 * 42) * Wikibooks: en
 * 43) Worked, with weird characters in filenames, not liked by dictd:
 * 44) * Wikipedia: es, fr, it
 * 45) Worked, but filename was same as English version:
 * 46) * Wikipedia: de
 * 47) * Wiktionary: de
 * 48) Doesn't work yet for:
 * 49) * Wikipedia: en
 * 50) * all the Wiktionaries that haven't set Wikititlesuffix
 * 51)   tend to overwrite eachother
 * 52) Requirements
 * 53)   Python >=2.3  (possibly >=2.3.4, definitely >=2.3.0)
 * 54)   running MySQL server  (possibly >=4.0.17)
 * 55)   python modules:
 * 56)     dictdlib  (possibly >=2.0.3)
 * 57)     MySQLdb   (possibly >=0.9.2)
 * 58)   dictzip
 * 1) Worked, with weird characters in filenames, not liked by dictd:
 * 2) * Wikipedia: es, fr, it
 * 3) Worked, but filename was same as English version:
 * 4) * Wikipedia: de
 * 5) * Wiktionary: de
 * 6) Doesn't work yet for:
 * 7) * Wikipedia: en
 * 8) * all the Wiktionaries that haven't set Wikititlesuffix
 * 9)   tend to overwrite eachother
 * 10) Requirements
 * 11)   Python >=2.3  (possibly >=2.3.4, definitely >=2.3.0)
 * 12)   running MySQL server  (possibly >=4.0.17)
 * 13)   python modules:
 * 14)     dictdlib  (possibly >=2.0.3)
 * 15)     MySQLdb   (possibly >=0.9.2)
 * 16)   dictzip
 * 1)   Python >=2.3  (possibly >=2.3.4, definitely >=2.3.0)
 * 2)   running MySQL server  (possibly >=4.0.17)
 * 3)   python modules:
 * 4)     dictdlib  (possibly >=2.0.3)
 * 5)     MySQLdb   (possibly >=0.9.2)
 * 6)   dictzip
 * 1) !/bin/bash

import MySQLdb import dictdlib
 * 1) non-standard Python modules

import os import sys import re import time import string import commands import textwrap import htmlentitydefs import urllib from sets import Set from optparse import OptionParser
 * 1) standard Python modules

lang_iso_codes = { "aa" : "Afar",			# Afar "ab" : "Abkhazian",	# Abkhazian - FIXME "af" : "Afrikaans",	# Afrikaans "ak" : "Akana",		# Akan "als" : "Els&auml;ssisch",	# Alsatian "am" : "&#4768;&#4635;&#4653;&#4763;",	# Amharic "ar" : "&#1575;&#1604;&#1593;&#1585;&#1576;&#1610;&#1577;",	# Arabic "arc" : "&#1813;&#1829;&#1810;&#1834;&#1848;&#1821;&#1819;",	# Aramaic "as" : "&#2437;&#2488;&#2478;&#2496;&#2527;&#2494;",	# Assamese "av" : "&#1040;&#1074;&#1072;&#1088;",	# Avar "ay" : "Aymar",		# Aymara "az" : "Az&#601;rbaycan",	# Azerbaijani "ba" : "&#1041;&#1072;&#1096;&#1185;&#1086;&#1088;&#1090;",	# Bashkir "be" : "&#1041;&#1077;&#1083;&#1072;&#1088;&#1091;&#1089;&#1082;&#1072;&#1103;",	# Belarusian or Byelarussian "bg" : "&#1041;&#1098;&#1083;&#1075;&#1072;&#1088;&#1089;&#1082;&#1080;",	# Bulgarian "bh"	: "Bihara", "bi" : "Bislama",		# Bislama "bn" : "&#2476;&#2494;&#2434;&#2482;&#2494; - (Bangla)",	# Bengali "bo" : "Bod skad",		# Tibetan "br" : "Brezhoneg",	# Breton "bs" : "Bosanski",		# Bosnian "ca" : "Catal&agrave;",	# Catalan "ce" : "&#1053;&#1086;&#1093;&#1095;&#1080;&#1081;&#1085;",	# Chechen "ch" : "Chamoru",		# Chamorro "chy" : "Tsets&ecirc;hest&acirc;hese",	# Cheyenne "co" : "Corsu",		# Corsican "cr" : "Nehiyaw",		# Cree "cs" : "&#268;esky",	# Czech "csb" : "Cassubian",	# Cassubian - FIXME "cv" : "&#1063;&#1233;&#1074;&#1072;&#1096; - (&#264;&#259;va&#349;)",	# Chuvash "cy" : "Cymraeg",		# Welsh "da" : "Dansk",		# Danish "de" : "Deutsch",		# German "dk" : "Dansk", # 'da' is correct for the language. "dv" : "Dhivehi",		# Dhivehi "dz" : "(Dzongkha)",	# Bhutani "ee" : "Eve",			# Eve "el" : "&#917;&#955;&#955;&#951;&#957;&#953;&#954;&#940;",	# Greek "en" : "English",		# English "eo" : "Esperanto",	# Esperanto "es" : "Espa&ntilde;ol",	# Spanish "et" : "Eesti",		# Estonian "eu" : "Euskara",		# Basque "fa" : "&#1601;&#1575;&#1585;&#1587;&#1740;",	# Persian "ff" : "Fulfulde",		# Fulfulde "fi" : "Suomi",		# Finnish "fj" : "Na Vosa Vakaviti",	# Fijian "fo" : "F&oslash;royskt",	# Faroese "fr" : "Fran&ccedil;ais",	# French "fy" : "Frysk",		# Frisian "ga" : "Gaeilge",		# Irish "gd" : "G&agrave;idhlig",	# Scots Gaelic "gl" : "Galego",		# Gallegan "gn" : "Ava&ntilde;e'&#7869;",	# Guarani "gu" : "&#2711;&#2753;&#2716;&#2736;&#2750;&#2724;&#2752;",	# Gujarati "gv" : "Gaelg",		# Manx "ha" : "&#1607;&#1614;&#1608;&#1615;&#1587;&#1614;",	# Hausa "haw" : "Hawai`i",		# Hawaiian "he" : "&#1506;&#1489;&#1512;&#1497;&#1514;",	# Hebrew "hi" : "&#2361;&#2367;&#2344;&#2381;&#2342;&#2368;",	# Hindi "hr" : "Hrvatski",		# Croatian "hu" : "Magyar",		# Hungarian "hy" : "&#1344;&#1377;&#1397;&#1381;&#1408;&#1381;&#1398;",	# Armenian "hz" : "Otsiherero",	# Herero "ia" : "Interlingua",	# Interlingua (IALA) "id" : "Bahasa Indonesia",	# Indonesian "ie" : "Interlingue",	# Interlingue (Occidental) "ig" : "Igbo",			# Igbo "ik" : "I&ntilde;upiak",	# Inupiak "io" : "Ido",			# Ido "is" : "&Iacute;slensk",	# Icelandic "it" : "Italiano",		# Italian "iu" : "&#5123;&#5316;&#5251;&#5198;&#5200;&#5222;",	# Inuktitut "ja" : "&#26085;&#26412;&#35486;",	# Japanese "jv" : "Bahasa Jawa",	# Javanese "ka" : "&#4325;&#4304;&#4320;&#4311;&#4323;&#4314;&#4312;",	# Georgian "kk" : "&#1179;&#1072;&#1079;&#1072;&#1179;&#1096;&#1072;",	# Kazakh "kl" : "Kalaallisut",	# Greenlandic "km" : "&#6039;&#6070;&#6047;&#6070;&#6017;&#6098;&#6040;&#6082;&#6042;",	# Cambodian "kn" : "&#3221;&#3240;&#3277;&#3240;&#3233;",	# Kannada "ko" : "&#54620;&#44397;&#50612;",	# Korean "ks" : "&#2325;&#2358;&#2381;&#2350;&#2368;&#2352;&#2368; - (&#1603;&#1588;&#1605;&#1610;&#1585;&#1610;)",	# Kashmiri "ku" : "Kurd&icirc;",	# Kurdish "kw" : "Kernewek",		# Cornish "ky" : "Kirghiz", "la" : "Latina",		# Latin "lb" : "L&euml;tzebuergesch",	# Luxemburguish "li" : "Limburgs",		# Limburgian "ln" : "Lingala",		# Lingala "lo" : "(Pha xa lao)",	# Laotian "lt" : "Lietuvi&#371;",	# Lithuanian "lv" : "Latvie&scaron;u",	# Latvian "mi" : "Malagasy",		# Malagasy - FIXME "mh" : "Ebon",			# Marshallese "mi" : "M&#257;ori",	# Maori "mk" : "&#1052;&#1072;&#1082;&#1077;&#1076;&#1086;&#1085;&#1089;&#1082;&#1080;",	# Macedonian "ml" : "&#3374;&#3378;&#3375;&#3390;&#3379;&#3330;",	# Malayalam "mn" : "&#1052;&#1086;&#1085;&#1075;&#1086;&#1083;",	# Mongoloian "mo" : "Moldoveana",	# Moldovan "mr" : "&#2350;&#2352;&#2366;&#2336;&#2368;",	# Marathi "ms" : "Bahasa Melayu",	# Malay "mt" : "bil-Malti",	# Maltese "my" : "(Myanmasa)",	# Burmese "na" : "Nauru",		# Nauruan "nah" : "Nahuatl", "nds" : "Platd&uuml;&uuml;tsch",	# Low German or Low Saxon "ne" : "&#2344;&#2375;&#2346;&#2366;&#2354;&#2368;",	# Nepali "nl" : "Nederlands",	# Dutch "nb" : "Norsk",		# Norwegian [currently using old no code] "ne" : "&#2344;&#2375;&#2346;&#2366;&#2354;&#2368;",	# Nepali "nn" : "Nynorsk"	,	# (Norwegian) Nynorsk "no" : "Norsk", "nv" : "Din&eacute; bizaad",	# Navajo "ny" : "Chi-Chewa",	# Chichewa "oc" : "Occitan",		# Occitan "om" : "Oromoo", 		# Oromo "or" : "Oriya",		# Oriya - FIXME "pa" : "&#2346;&#2306;&#2332;&#2366;&#2348;&#2368; / &#2602;&#2588;&#2622;&#2604;&#2624; / &#1662;&#1606;&#1580;&#1575;&#1576;&#1610;",	# Punjabi "pi" : "&#2346;&#2366;&#2367;&#2356;",	# Pali "pl" : "Polski",		# Polish "ps" : "&#1662;&#1690;&#1578;&#1608;",	# Pashto "pt" : "Portugu&ecirc;s",	# Portuguese "qu" : "Runa Simi",	# Quechua "rm" : "Rumantsch",	# Raeto-Romance "rn" : "Kirundi",		# Kirundi "ro" : "Rom&acirc;n&#259;",	# Romanian "ru" : "&#1056;&#1091;&#1089;&#1089;&#1082;&#1080;&#1081;",	# Russian "rw" : "Kinyarwanda", "sa" : "&#2360;&#2306;&#2360;&#2381;&#2325;&#2371;&#2340;",	# Sanskrit "sc" : "Sardu",	# Sardinian "sd" : "&#2360;&#2367;&#2344;&#2343;&#2367;",	# Sindhi "se" : "S&aacute;megiella",	# (Northern) Sami "sg"	: "Sangro", "si" : "(Simhala)",	# Sinhalese "simple" : "Simple English", "sk" : "Sloven&#269;ina",	# Slovak "sl" : "Sloven&scaron;&#269;ina",	# Slovenian "sm" : "Gagana Samoa",	# Samoan "sn" : "chiShona",		# Shona "so" : "Soomaaliga",	# Somali "sq" : "Shqip",		# Albanian "sr" : "&#1057;&#1088;&#1087;&#1089;&#1082;&#1080; / Srpski",	# Serbian "ss" : "SiSwati",		# Swati "st" : "seSotho",		# (Southern) Sotho "su" : "Bahasa Sunda",	# Sundanese "sv" : "Svenska",		# Swedish "sw" : "Kiswahili",	# Swahili "ta" : "&#2980;&#2990;&#3007;&#2996;&#3021;",	# Tamil "te" : "&#3108;&#3142;&#3122;&#3137;&#3095;&#3137;",	# Telugu "tg" : "&#1058;&#1086;&#1207;&#1080;&#1082;&#1251;",	# Tajik "th" : "&#3652;&#3607;&#3618;",	# Thai "ti" : "Tigrinya",		# Tigrinya - FIXME "tk" : "&#1578;&#1585;&#1603;&#1605;&#1606; / &#1058;&#1091;&#1088;&#1082;&#1084;&#1077;&#1085;",	# Turkmen "tl" : "Tagalog",		# Tagalog (Filipino) "tn" : "Setswana",		# Setswana "to" : "Tonga",		# Tonga - FIXME "tpi" : "Tok Pisin",	# Tok Pisin "tr" : "T&uuml;rk&ccedil;e",	# Turkish "ts" : "Xitsonga",		# Tsonga "tt" : "Tatar",		# Tatar "tw" : "Twi",			# Twi -- FIXME "ty" : "Reo M&#257;`ohi",	# Tahitian "ug" : "Oyghurque",	# Uyghur "uk" : "&#1059;&#1082;&#1088;&#1072;&#1111;&#1085;&#1089;&#1100;&#1082;&#1072;",	# Ukrainian "ur" : "&#1575;&#1585;&#1583;&#1608;",	# Urdu "uz" : "&#1038;&#1079;&#1073;&#1077;&#1082;",	# Uzbek "ve" : "Venda",		# Venda "vi" : "Ti&#7871;ng Vi&#7879;t",	# Vietnamese "vo" : "Volap&uuml;k",	# Volapuk "wa" : "Walon",		# Walloon "wo" : "Wollof",		# Wolof "xh" : "isiXhosa",		# Xhosan "yi" : "&#1497;&#1497;&#1460;&#1491;&#1497;&#1513;",	# Yiddish "yo" : "Yor&ugrave;b&aacute;",	# Yoruba "za" : "(Cuengh)",		# Zhuang "zh" : "&#20013;&#25991;",	# (Zh&#333;ng W...n) - Chinese "zh-cn" : "&#20013;&#25991;(&#31616;&#20307;)",	# Simplified "zh-tw" : "&#20013;&#25991;(&#32321;&#20307;)",	# Traditional "zu" : "isiZulu",		# Zulu }
 * 1) list from MediaWiki 1.2.6's languages/Language.php:

def create_mysql_table(options, name, wik_datafile): user = options.user con = MySQLdb.connect("localhost", user)

db_name = "wik2dict_" + name

# create table and dump SQL file into MySQL try: con.query("CREATE DATABASE " + db_name) database_existed = False except MySQLdb._mysql.ProgrammingError: database_existed = True

if not database_existed or not options.dont_dump: w = os.path.splitext(wik_datafile) if w[1] == ".bz2": if not os.path.exists(w[0]): print "* Decompressing " + wik_datafile commands.getstatusoutput("bzip2 -d " + wik_datafile) wik_datafile = w[0]

print "* Dumping " + wik_datafile + " into MySQL database " + db_name commands.getstatusoutput("mysql -u " + user + " " + db_name + " < " + wik_datafile) else: print "* Using existing database " + db_name

cursor = con.cursor cursor.execute("USE " + db_name) return (con, cursor, db_name)

def get_text(cursor, title): cursor.execute("SELECT cur_text FROM cur WHERE cur_title = '" + title + "'") return cursor.fetchone[0]

def get_one(cursor, sql): cursor.execute(sql) return str(cursor.fetchone[0])

def get_info(cursor, name): #try, catch dictname = get_text(cursor, "Wikititlesuffix").strip.replace(" ", "_") if (dictname in ["Wikipedia", "Wikibooks", "Wiktionary"]		or not reduce(lambda x,y: x and (y in string.ascii_letters+"-_"), dictname)): dictname = name

url = get_text(cursor, "Printsubtitle") if url.find("http") > 0: url = url[url.index("http"):url.index("org")+3] info = (dictname + """

Available under the GNU Free Documentation License.

Up-to-date version can be found online at """ + url + """.

The MediaWiki MySQL database was converted into the dict format on			""" + time.ctime + " by " + program + " " + version + """, which is available under the GNU General Public License at http://meta.wikimedia.org/wiki/User:Guaka/wik2dict.py. """)	dictfilename = dictname.replace(" ", "-")	return (dictname, info, url, dictfilename)

def get_redirects(cursor): print "* Getting redirects...", n = cursor.execute("SELECT cur_title, cur_text FROM cur WHERE SUBSTRING(cur_text, 1, 9) = '\#redirect'") redirs = {} for i in range(n): x = cursor.fetchone k = x[1][10:].replace("", "").replace("", "").strip redirs[k] = x[0] print "Number of redirects:", len(redirs) return redirs

def remove_if_exists(f): if os.path.exists(f): os.remove(f) print "* Removed", f

def create_dict(dictfilename): remove_if_exists(dictfilename + ".dict") remove_if_exists(dictfilename + ".dict.dz") remove_if_exists(dictfilename + ".index") return dictdlib.DictDB(dictfilename, mode='write', quiet=0)

class Article_Processor: def __init__(self, cursor, options): self.cursor = cursor self.only_first = options.only_first self.verbose = options.verbose self.width = options.width self.left_margin = 5 self.left_margin_c = self.left_margin * " "

self.Nstab_image = get_text(cursor, "Nstab-image").upper self.Nstab_category = get_text(cursor, "Nstab-category").upper self.Nstab_special = get_text(cursor, "Nstab-special").upper self.Nstab_template = get_text(cursor, "Nstab-template").upper

self.re_comment = re.compile("", re.DOTALL) self.re_html = re.compile("<.*?>", re.DOTALL) self.re_msg = re.compile("(?<=)") self.re_htmlentity = re.compile("(?<=&).*?(?=;)") self.re_bullets = re.compile("\**")

self.re_notoc_etc = re.compile("__(START|NOTOC|END|NOEDITSECTION)__")

self.re_extlinks_nodesc = re.compile("(\[)(\S*?)(\])") self.extlinks_nodesc_func = lambda x: x.group(2) self.re_extlinks_desc = re.compile("(\[)(.*?) (.*?)(\])") self.extlinks_desc_func = lambda x: "{" + x.group(3) + " (" + x.group(2) + ")}"

self.bullets = "*#*-" + "." * 60

self.lang_iso_codes = lang_iso_codes for l in self.lang_iso_codes: self.lang_iso_codes[l] = self.process_html(self.lang_iso_codes[l]) self.max_len_isocodes = max(len, self.lang_iso_codes.keys)

self.redirs = get_redirects(self.cursor) self.messages = self.get_messages

def get_messages(self): print "* Getting messages...", n = self.cursor.execute("SELECT cur_title, cur_text FROM cur WHERE cur_namespace = 10") messages = {} for i in range(n): x = self.cursor.fetchone k = x[0].upper messages[k] = self.process_html(x[1])

for k, v in messages.items: if k in self.redirs and v.upper.find("REDIRECT") > 0: nk = self.redirs[k] nks = nk.split(":") if nks[0].upper == self.Nstab_template: nk = ":".join(nks[1:]) if nk in messages: # without I had a KeyError: 'LGBTA' for en-wp messages[k] = messages[nk]

messages["CURRENTYEAR"] = get_one(self.cursor, "SELECT YEAR(MAX(cur_timestamp)) FROM cur;") messages["CURRENTMONTH"] = get_one(self.cursor, "SELECT MONTH(MAX(cur_timestamp)) FROM cur;") messages["CURRENTMONTHNAME"] = get_one(self.cursor, "SELECT MONTHNAME(MAX(cur_timestamp)) FROM cur;") messages["CURRENTDAY"] = get_one(self.cursor, "SELECT DAYOFMONTH(MAX(cur_timestamp)) FROM cur;") messages["CURRENTDAYNAME"] = get_one(self.cursor, "SELECT DAYNAME(MAX(cur_timestamp)) FROM cur;") messages["CURRENTTIME"] = get_one(self.cursor, "SELECT DATE_FORMAT(MAX(cur_timestamp), '%H:%i:%s') FROM cur;") messages["NUMBEROFARTICLES"] = get_one(self.cursor, "SELECT COUNT(cur_title) FROM cur WHERE cur_namespace = 0;")

#Quicker to process messages first. #		#self.messages = messages #messages = map(lambda x: (x[0], self.replace_messages(x[1])),		#			  messages.items) print "Number of messages:", len(messages) return messages

def internal_links(self, s): r = [] for t in s.split(""):			no_interwiki = False			if t.find(":") == 1:  				t = t[2:]				no_interwiki = True

t_splitdp = t.split(":") if (t_splitdp[0] in self.lang_iso_codes.keys): #interwikilinks if no_interwiki: t = "{" + "".join(t_splitdp[1:]).replace("]]", "}") else: t = self.lang_iso_codes[t_splitdp[0]] + ": {" + "".join(t_splitdp[1:]).replace("]]", "}")

elif t_splitdp[0].upper in [self.Nstab_image, self.Nstab_category]: t = t[t.find("]]")+2:] 	 #drop images and categories

else: vert_pos = t.find("|") if vert_pos > 0 and vert_pos < t.find("]]"): t = t[vert_pos+1:] t = "{" + t.replace("]]", "}")

r.append(t) r[0] = r[0][1:] return "".join(r)

def external_links(self, s): return s

"""		self.re_extlinks = re.compile("\[.*?\]")

if self.re_extlinks.search(s): l = s.split("[") r = [l[0]] for t in l[1:]: space_pos = t.find(" ") rbrack_pos = t.find("]") if space_pos > 0 and space_pos < rbrack_pos: t = "{" + t[space_pos:rbrack_pos] + " (" + t[:space_pos] + ")}" + t[rbrack_pos+1:] r.append(t) return "".join(r) else: return s		"""

#def replace_messages(self, s): def repl_message(self, match_obj): m = match_obj.group(0) k = m.upper if k in self.messages: return self.messages[k] elif k == "PAGENAME" and hasattr(self, "PAGENAME"): return self.PAGENAME else: #print "Message", m, "not found" return m		#return self.re_msg.sub(repl_func, s)

def repl_htmlentities(self, match_obj): k = match_obj.group(0) if k in htmlentitydefs.entitydefs: return htmlentitydefs.entitydefs[k] else: return "&" + k + ";"

def process_html(self, s): s = self.re_comment.sub("", s) #first remove HTML comments s = self.re_html.sub("", s) #then remove other HTML stuff

return self.re_htmlentity.sub(self.repl_htmlentities, s)

def process_line(self, line): if line[0] == "#": self.numbering += 1 line = str(self.numbering) + ") " + line[1:]		else:			self.numbering = 0

if line[0] == "=": line = "\n " + line.replace("=", "")

elif line[:4] == "": line = " " * 3 + (self.width - 3) * "_" + "\n"

#tables elif line[:2] == "{|" or line[:2] == "|-" or line[:2] == "|}": line = "" elif line[0] == "|" or line[0] == "!": line = line.replace("!!", "||") pos_vert = line[1:].find("|") pos_dvert = line[1:].find("||") if pos_vert != pos_dvert: #table layout stuff line = self.left_margin_c + line[pos_vert+1:] else: line = self.left_margin_c + line[1:]

cols = [] for col in line.split("||"): col = col.strip if col[0:0] == "|": col = col[1:].strip if col: col = self.process_line(col) #hmm... cols.append(col) line = "\n".join(cols) else: nr_bullets = self.re_bullets.search(line).end if nr_bullets: if len(line) <= nr_bullets: line = " " * nr_bullets + self.bullets[nr_bullets-1] elif line[nr_bullets] == " ": line = " " * nr_bullets + self.bullets[nr_bullets-1] + line[nr_bullets:] else: line = " " * nr_bullets + self.bullets[nr_bullets-1] + " " + line[nr_bullets:] else: bullet = False

if self.numbering: wsize = self.width - 5 else: wsize = self.width

w = textwrap.wrap(line, wsize) spaces = self.left_margin_c line = spaces if self.numbering or nr_bullets: spaces = spaces + self.left_margin_c line += ("\n" + spaces).join(w) return line

def process_entry(self, s): s = self.process_html(s) s = self.re_notoc_etc.sub("", s)

s = self.re_msg.sub(self.repl_message, s) s = self.re_msg.sub(self.repl_message, s) #hmmm, twice...

#s = self.replace_messages(s) #s = self.replace_messages(s) self.numbering = 0 r = [] for line in s.split("\n"): line = self.internal_links(line) line = self.re_extlinks_nodesc.sub(self.extlinks_nodesc_func, line) line = self.re_extlinks_desc.sub(self.extlinks_desc_func, line) if line: #if len(line) > 0: line = self.process_line(line) r.append(line) s = "\n".join(r) return s

def add_entries(self, dict_object): q = "SELECT cur_title, cur_text FROM cur WHERE cur_namespace = 0 " #ORDER BY cur_title"		if self.only_first:			q = q + " LIMIT " + str(self.only_first)		self.cursor.execute(q)

l = self.cursor.rowcount # Can this be more precise? (For the "records", not for the SELECT...) # For small stuff it is too high, and for big stuff it is too low. print "* Estimation of records to process:", l		for i in range(l): fetch = self.cursor.fetchone

index = fetch[0] if self.verbose: print index if not index in self.redirs.values: indices = Set([index]) if index in self.redirs: indices.add(self.redirs[index]) for e in list(indices): indices.add(e.replace("_", " "))

self.PAGENAME = index entry = self.process_entry(fetch[1]) index = index.replace("_", " ") dict_object.addentry(index + "\n" + entry, indices)

def convert_sqldump_to_dict(wik_datafile, options): file_tail = os.path.split(wik_datafile)[1] name = file_tail.split(".")[0] name = name.replace("-", "_").replace("_cur_table", "")

(con, cursor, db_name) = create_mysql_table(options, name, wik_datafile) (dictname, info, url, dictfilename) = get_info(cursor, name)

dict_object = create_dict(dictfilename)

artproc = Article_Processor(cursor, options) artproc.add_entries(dict_object)

dict_object.setlonginfo(info) dict_object.setshortname(dictname.replace("_", " ")) dict_object.seturl(url) dict_object.finish

if options.dont_dump: print "* Keeping MySQL database " + db_name else: print "* Dropping MySQL database " + db_name con.query("DROP DATABASE " + db_name)

print "* Applying dictzip" commands.getstatusoutput("dictzip " + dictfilename + ".dict") print "* Created", dictfilename + ".dict.dz and", dictfilename + ".index."

return dictfilename

def download(which): files = [] base_url = "http://download.wikimedia.org/archives/" file_ending = "cur_table.sql.bz2" for w in which.split(":"): if w == "Wikipedia": base_url = "http://download.wikimedia.org/archives/" base_file = "Wikipedia-" elif w == "Wiktionary": base_url = "http://download.wikimedia.org/archives_wiktionary/" base_file = "Wiktionary-" elif w == "Wikibooks": base_url = "http://download.wikimedia.org/archives_wikibooks/" base_file = "Wikibooks-" elif w == "Wikiquote": base_url = "http://download.wikimedia.org/archives_wikiquote/" base_file = "Wikiquote-" elif w == "Special": base_url = "http://download.wikimedia.org/archives_special/" base_file = "Wikimedia-" else: for isocode in w.split(","): date = "" # "20040714_" isocode = isocode.upper url = base_url + isocode + "/" + file_ending fn = base_file + isocode + "_" + file_ending print "* Fetching " + url (fn, httpmessage) = urllib.urlretrieve(url, fn) print "* Saved " + fn				files.append(fn) print "* Downloaded " + httpmessage.dict["content-length"] + " bytes" return files

def main(profiling = False): usage = """

wik2dict.py [OPTIONS] FILE(S) ...

Possible problems: * Check your free diskspace! * max_allowed_packets is too low. Increasing it from 1M to 4M might help."""

if profiling: sys.argv = ['./wik2dict.py', 'wp-nl-20040721_cur_table.sql'] parser = OptionParser(usage) parser.add_option("-o", "--only-first", type="int", dest="only_first",					 help="only process first .. articles. Very useful for debugging.") parser.add_option("-v", "--verbose",					 action="store_true", dest="verbose") parser.add_option("-d", "--dontdump",					 action="store_true", dest="dont_dump") parser.add_option("-w", "--width", type="int", dest="width",					 help="textwidth", default=80) parser.add_option("-u", "--user", dest="user",					 help="MySQL user", default="root") parser.add_option("-f", "--fetch", dest="fetch",					 help= "for example -fWikipedia:nl,de,fr,en:Wiktionary:nl,de,fr,en") parser.add_option("-D", "--delete", dest="delete",					 help="delete SQL dumps after conversion") parser.add_option("-n", "--fetch_and_no_conversion", dest="no_conversion",					 help="Convert the fetched (with the --fetch option) dumps into dict")

(options, args) = parser.parse_args

print """wik2dict.py	This is Free Software, available under the GNU General Public License	(c) Guaka 2004

Conversion of MediaWiki SQL dumps into dict files. Optionally download Wikimedia SQL dumps. """

if options.fetch: files = download(options.fetch) print "* Finished downloading" if options.no_conversion: exit else: print print "* Starting wik2dict conversion" print

else: if not args: parser.error("Add one or more MySQL dump files as arguments or use the --fetch option to download stuff.") else: files = args

c = [] for f in files: c.append(convert_sqldump_to_dict(f, options)) print nr = len(files) print "Converted", nr, "file" + (nr > 1 and "s:" or ":") for i in range(nr): print " - " + files[i] + " into " + c[i] + ".dict.dz and " + c[i] + ".index"

if options.delete: for f in files: remove_if_exists(f)

if __name__ == "__main__": main