Toolserver:Unified Wiktionary API

=Unified Wiktionary API=

The intention is to create an API to the contents of the various Wiktionary projects. Dictionary content needs a lot more structure than Encyclopedia content, so a structure was created. The problem is that almost every Wiktionary uses a different structure, which impedes interoperability enormously.

It would be nice if it were possible to overcome this, by having an API that can be queried in a standard way and that will pass back results in a standard format.

Later the goal would be to also be able to feed this standard format to the API and edit entries this way, merging the new information in, when feasible.

Language names and language templates
A first hurdle is that some Wiktionaries use the names of the languages fully spelled out, while others use iso2 or iso3 in templates.

So it needs to be possible to convert from one to another.

Headings
We will need to keep a table for each Wiktionary about the headings they use and whether templates are used or not.

Other conventions
Is each language separated from the next by a ?

Is there more than one language present on the same page? (la.wikt, has a page per lang/spelling combination)

Requests to the API
It should be possible to ask what languages a given spelling has entries for on this Wiktionary. It should be possible to ask the entire contents of a given spelling for a certain language. It should be possible to ask which sections exist for given spelling for a given language. A certain representation should be agreed upon for each possible section (etymology, pronunciation, definitions, translations, synonyms, etc)

So what could the interchange format look like?
This is only a proposal:

XML is a rather heavy format, json is a lot more friendly to the eyes of mere mortals. So I'm going to try representing an entry in it:

Hippietrail warned me not to overengineer it, but I would still like to try encoding the entry for Helium, to get a feel for what it could be(come). The first level would be an ordered list of what is present on a page [preamble, lang1, lang2, ..., categories, interwikies]

In the preamble level there is room for links to other spellings (capitalized, with diacritics, etc)

I've indented them, for better readability, but that's not a prerequisite.

There is a possibility to have more than one etymology. If there is only one, etym1 is used. The next uses etym2, etc.

Sometimes a given spelling can be pronounced in different ways. It's not clear what to do when a particular pronunciation has more than 1 etymology. Or when it's the other way around. I think it would be best to simply repeat the pronunciations if that occurs. So the structure is fixed.

[   [        "Helium", "hélium" ],   {        "eng": { "etym1": [ "Modern Latin, from  ‘sun’ (because its presence was first theorised in the sun's atmosphere).", {                   "pronunc": ",, ", "hyph": "he&middot;li&middot;um", "noun": [ "",                       {                            "def1": [ "A colorless and inert gas, and the second lightest chemical element (symbol He) with an atomic number of 2 and atomic weight of 4.002602." ,                               {                                    "deriv": [ "heliair", "heliox" ],                                   "rel": [ "helio-", "Helios" ],                                   "trad": { "gloss": [ [                                               "chemical element" ]                                        ],                                        "afr": [ [                                               "helium" ]                                        ],                                        "deu": [ [                                               "Helium", "g=n" ]                                        ],                                        "fra": [ [                                               "hélium", "g=m" ]                                        ],                                        "pol": [ [                                               "hel" ]                                        ],                                        "spa": [ [                                               "helio", "g=m" ]                                        ],                                        "tur": [ [                                               "helyum" ]                                        ],                                        "swe": [ [                                               "helium" ]                                        ],                                        "Mandarin": [ [                                               "氦", "tr=hài" ]                                        ],                                        "zh-yue": [ [                                               "氦" ]                                        ],                                        "zh-min-nan": [ [                                               "goân-sò" ]                                        ],                                        "zh-classical": [ [                                               "氦" ]                                        ],                                        "nld": [ [                                               "helium", "g=n" ]                                        ],                                        "epo": [ [                                               "heliumo" ],                                           [                                                "helio" ]                                        ],                                        "fur": [ [                                               "eli", "g=m" ]                                        ],                                        "fas": [ [                                               "هلیم", "sc=fa-Arab" ]                                        ],                                        "rus": [ [                                               "гелий", "g=m", "tr=gélij", "sc=Cyrl" ]                                        ],                                        "tam": [ [                                               "ஹீலியம்" ],                                           [                                                "பரிதியம்", "tr=paridhiyam" ]                                        ],                                        "yid": [ [                                               "העליום", "tr=helium", "g=n" ]                                        ]                                     },                                    "syn": [ [                                           "E939", "rem=when used as a packaging gas" ]                                    ],                                    "ext": [ "For etymology and more information refer to: http://elements.vanderkrogt.net/elem/he.html (A lot of the translations were taken from that site with permission from the author)" ],                                   "cat": [ "Category:Chemical elements" ]                                }                             ]                         }                     ]                 }             ]         }     },    {        "fin": { "etym1": [ {                   "noun": [ "",                       {                            "def1": [ " helium" ]                        }                     ]                 }             ]         }     },    [        "ast", "cs", "de", "el", "fa", "fr", "io", "kk", "la", "lt", "nl", "no", "pl", "pt", "ro", "ru", "simple", "fi", "sv", "ta", "vi", "tr", "zh" ] ] Things that I couldn't place:

[18:50]  and is it feasible to include that in the API of MediaWiki? [18:50] that would come later but yes i have lookedd into it and asked rowan [18:51] i would warn people not to overengineer the exchange format. focus on one small part first and just a couple of wiktionaries [18:51] and allow for missing information [18:51]  oh [18:52]  that's a bit counter intuitive, isn't it? [18:52] lingro has already done some of this stuff by the way [18:52]  I would think it should include as much as possible [18:52] perhaps but i speak from experience [18:52]  who is lingro? [18:52]  you mean because otherwise it never gets finished? [18:53] then you will spend as much time as possible designing it and fighting with people who think the design should differ in unimportant ways [18:53] lingro is a language learning site [18:53] uses wiktionary as one of its data sources [18:53] or even never gets started. "release early release often" [18:54]  aha, lingro.com? [18:54] look at omegawiktionary. overdesigned hard to use missing lots of info wiktionary has had for years [18:54] yep. i invited them and some other projects such as ninjawords to contribute to a wiktparser project [18:55]  sounds good [18:55]  is there discussion on line about this somewhere? [18:58] on the mailing list and on the beer parlour both maybe 6 months ago [18:59] well they do all the work themselves. if the community did the work a lot more would get done [19:04]  the community is already compiling a dictionary :-) [19:07] yeah! [19:11]  Do you have something like this in mind: [19:12]  that you can query the server for all the definitions for a given word from a given wiktionary in a given language? [19:12]  then query for all synonyms and translations? [19:13]  or would you have to start by querying for the etymologies? [19:13]  then dig deeper from there? [19:14]  or would it simply be: give me all you have about a certain spelling (and maybe its variations) in an XML format? [19:15] <P0lygl0t> and when the parser cannot convert towards XML flag the entry by putting it in an rfc cat? [19:15] you should be able to query any field [19:15] <P0lygl0t> yes, but you wanted to keep it simple :-) [19:16] <P0lygl0t> for getting started [19:16] as the api got more use we would review parts of the format yes [19:16] i focused on the translation section [19:16] <P0lygl0t> so you always get the result in xml [19:16] <P0lygl0t> and you can tell the server what field/section you are interested in [19:17] if you look at wiktparser on the toolserver you will find a tool that extracts spanish translations from english articles [19:17] i would prefer json but the api works with lots of formats [19:17] <P0lygl0t> what is json? [19:18] json is derived from javascript objects but is now a standard used by many languages due to being much more lightweight than xml. ajax uses it a lot [19:18] <P0lygl0t> do you also foresee a way to feed the API xml that gets included in the entry? [19:18] yes later [19:19] <P0lygl0t> aha, I should probably look into json [19:19] of course we could always make a proxy to the api on the toolserver or somewhere before actually adding to the api directly [19:19] <P0lygl0t> why don't we get started to implement something that can do the extraction for en.wikt? [19:20] i think its easier to just think in terms of data structure rather than how to encode that structure as text [19:20] sure. en wikt is easy because our heading levels already reveal a lot of structure and the heading levels are not hidden within templates [19:21] my first level parse just returned a structure reflecting the headings with the wikitext of each section within [19:22] <P0lygl0t> ic, but that's not quite good enough [19:23] one caveat wth our heading format is that it has a surface structure and a deep structure most obviously when comparing a sngle etym article to a multi etym article [19:23] thats the first level [19:24] in fact theres a level before that which is just to retrieve the wikitext of a particular language entry of a page [19:24] there is also metadata such as the stuff before the first language include the also template and the stuff after the last language such as some categories an the interwiki links [19:25] <P0lygl0t> indeed [19:26] the beauty of a multi level parse is you can parse completely the bits you understand completey which conform to a specifed format and other bits you can still get to the wikitext of [19:26] for instance the etymology and pronunciation sections are too messy for a machine to parse [19:27] also when you grab the wikitext you can do stuff like store it all then get some stats on what the most common ways are that people write them [19:28] i was working on some articles on the format or "grammar" of a wiktionary page [19:28] <P0lygl0t> so multilevel sounds like a good idea, but other language wiktionaries will have a harder time implementing it [19:29] <P0lygl0t> because they lack depth in their headers [19:29] such as page := prolog article+ epilog [19:29] or the headings are hidden in templates [19:30] of course a parser can also look into templates especially if its working from a dump file [19:31] yes thats why an interchange format is good. the en codec could convert to the interchange format using the multilevel stuff but other wikts might have to use more brute force methods [19:31] other wikts wont have the same prolog and epilog as us but they will have at least some of the same metadata [19:32] so the interchange format would include metadata but might not include prolog or epilog [19:33] a page on most wikts will consist of one or more language articles, but latin will only ever have one language or article per page [19:33] other wikts might not have a concept of translingual [19:52] anyway taking all that into consideration i would focus on the articleentry level first because all wikts will have it in common [19:53] you would want to be able to query "does the xx wiktionary have an article in the yy language?" [19:54] so the codec will have to be able to find a page and then the part of that page dealing with a specific language [19:55] and something will have to deal with language name impedence mismatches with 2 letter and 3 letter codes as well as the mismatch between "chinese" and "mandarn" etc [19:56] i would not leave that stuff to the codec if possible or the result will be lots of variety each user will have to handle and inevitibly wll do so differently [19:57] <P0lygl0t> so the API would know how to convert from iso2 to iso3 to actual language names in all languages? [19:59] to some degree. not all because each wikt will have the laguage names in its own language! [20:13] <P0lygl0t> hippietrail: if I want to look something up on en.wikt or de.wikt it would be nice if I could simply pass it the iso3 code [20:14] it would be but that is to assume no wikt has any ambiguities in language names or codes [20:14] and also that you know which iso3 code you want when you only know the language name "zapotec" [20:14] does wiktionary strictly follow ISO codes? [20:15] <P0lygl0t> if I only know the language name, I would want the API to resolve that to iso3 for nl.wikt [20:15] because f.e. we use "ksh" and that doesnt seem to be in the ISO file i imported [20:15] also "translingual" has no language code and i think we still have non language entries for letters and symbols [20:15] <P0lygl0t> and iso2 for a lot of other ones [20:15] hippietrail: translingual = mul [20:15] <P0lygl0t> what does ksh stand for? [20:15] Kölsch [20:15] Ripuarian languages [20:15] well thats one interpretation [20:15] mul could also mean give me every language [20:16] it means Translingual for {{infl [20:16] <P0lygl0t> I saw vls was used by the Wikipedia of West Flanders [20:16] <P0lygl0t> instead of all of Flanders... [20:16] --> GerardM- has joined this channel (n=chatzill@i181119.upc-i.chello.nl). [20:16] iso does what ethnologue does and ethnologue is a splitter [20:16] <LinkyAenwk> http://en.wiktionary.org/wiki/splitter [20:17] <P0lygl0t> I don't think mediawiki follows ISO strictly [20:17] nor would most of the smaller wikts [20:18] <P0lygl0t> we have stuff like zh-min-nan [20:18] talk ISO into doing what wiktionary does? :P [20:18] <P0lygl0t> so, I guess we need full lang name, ISO2, ISO3 and WM convention [20:18] <P0lygl0t> i mean MW [20:18] probably some system where you can query "chinese" and get all enties that are part of that "macrolanguage" as well as ones merely labelled "chinese" [20:18] is it MW or rather just en.wikt? [20:19] dont see language codes being defined in standard MW install [20:19] <P0lygl0t> not sure, but we are talking about an API to all Wiktionaries [20:19] i can guarantee you will not even find a consensus between the mw projects [20:19] <P0lygl0t> some do things one way, others another way [20:20] <P0lygl0t> great, is that something for the API to resolve then? [20:20] the other approach is to just state that anything falling outside the strict iso3 language codes will fail until the articles ar edited to conform [20:20] yes for the api. which is why one central open is best [20:21] <P0lygl0t> are you suggesting that en.wikt converts all language names to iso3? [20:21] devs wont have to tackle the problem oveer and over and come up with differing solutions [20:21] i make no such suggestion. only making the problems clearer [20:22] and with a central open api devs can add new ways to resolve the problems that they understand and all users will benefit [20:22] maybe the first feature of the API should be a function to request a list of language codes from each wiki, then [20:22] <P0lygl0t> I was trying to find out if the API would include converting from one to the other [20:23] so each wiktionary would have to make a standard special page, which outputs the local code->language table [20:24] <P0lygl0t> hopefully that language code would be iso3 then? [20:24] then the API could request that first and stick to it, before making other requests using the code [20:24] the api should probably know 2 two 3 letter conversions and synonyms and know which sets of codes map to each macrolanguage [20:24] mapping codes to language names might belong in each codec [20:24] <P0lygl0t> ok, I think that's the first thing to work out [20:25] <P0lygl0t> code wise [20:25] <P0lygl0t> I think it's almost a separate little project [20:26] know-it-all already has a mysql database with 2 and 3 letter codes and english names, btw [20:26] <P0lygl0t> good starting point, I think [20:26] <P0lygl0t> in what language is it coded? [20:26] you can also be pragmatic and keep this problem in mind when working on other stuff because its a hard problem that becomes clearer with experience and the experience is hard to get if this problem blocks all other work [20:26] PHP [20:27] iso publishes all this stuff in txt files designed to be slurped easily into databases [20:27] yes, i used one of those to impotz [20:27] <P0lygl0t> good, I think the resulting API should also be coded in PHP, isn't it hippietrail? [20:27] import [20:27] including synonyms macrolanguages and french names as well as english [20:27] php makes sense if in the long run you want to integrate it into the mw api [20:28] <P0lygl0t> but it needs to be appended with what MW started using [20:28] <P0lygl0t> I thought that was the intention, or not? [20:28] but to start i would make it as a proxy on a site such as toolserver [20:28] <P0lygl0t> what other language would make sense? [20:28] can i upload .zip or something to wiktionary, no..hm? [20:29] python perl and java will probably be mentioned by other people if you ask them [20:29] <P0lygl0t> but if you want to integrate later on, I don't think any other language would make sense [20:29] i agree [20:29] <P0lygl0t> and I think the whole goal would be to integrate it [20:29] if necessary it will make it possible to use bits o mw code verbatin too [20:29] <P0lygl0t> so it is as close as possible to the DB [20:30] whatever language it will surely be rewritten from scratch a couple of times anyway [20:30] <P0lygl0t> I would have been on who mentioned Python, of course, but it's far more important to be practical [20:30] the mw api code is notoriously difficult to grok. even brion has trouble with it [20:31] php is also multiplatform works on the web and the command line has lots of libraries and if it doesnt have good unicode support built in we can use the unicode stuff from mw [20:31] <P0lygl0t> I was hoping somebody else would take care of the integration [20:32] also learning it will help when you need to hack mw later [20:32] <P0lygl0t> but I would not make it even harder by coding in another language [20:32] done then [20:32] motion passed [20:32] <P0lygl0t> I have no plans in that direction [20:33] think of other people that might want to help on the api [20:34] it should be an open project with an svn or cvs repository on the toolserver or somewhere [20:34] P0lygl0t: http://en.wiktionary.org/wiki/User:Mutante/langcodes [20:34] http://s23.org/~mutante/codes.sql [20:35] but "latin1" should be changed to utf-8 [20:35] i would probably start with a way to retrieve the wikitext of a given spelling in a given language and make codecs for en, la, and one other wikt [20:36] in some of my projects that need to pass languges around i pass a structure with a name field and a code field [20:38] for the language impedence problem i would use chinese in test cases from the very beginning [20:39] also its probably a good idea to allow lists almost everywhere in the api [20:39] <P0lygl0t> nl or de, maybe [20:39] <P0lygl0t> nl uses iso3 [20:39] i would pick three very differently formatted ones [20:39] <P0lygl0t> de uses something totally different [20:39] <P0lygl0t> but I trust there to be quite a few people [20:40] preferably at least two of which you are already familiar with [20:40] <P0lygl0t> who would code to the API for de anyway [20:40] <P0lygl0t> I only have some familiarity with nl, but not as much as I'd like [20:40] if you can get a test page on toolserver as soon as possible even just a mockup it will get people interested [20:41] updated User:Mutante/langcodes with some PHP code and links to the import CSV/text files [20:41] <P0lygl0t> mutante: thanks [20:42] <P0lygl0t> so, I don't think we have to do de ourselves [20:42] have three fields: wiktionary, language, get the wikitext for the right part of the right page from the right wikt. call that wikts api to render it into html. post that html into the page using ajax [20:43] something like that and allow for errors like no such page in that wikt, that wikt has such a page but no entry in that language, page not parseable, etc [20:44] <P0lygl0t> sounds good [20:46] <P0lygl0t> the challenge with nl.wikt is that it has no depth in its headers [20:47] <P0lygl0t> and their use of iso3, or maybe that's 'a good thing' [20:49] <P0lygl0t> sounds good hippietrail [20:50] i would say a campaign to make all wiktionaries use iso3 like nl, because it is a good thing [20:50] <P0lygl0t> hippietrail: do you have an account on toolservr? [20:50] after all, thats what ISO is for [20:50] to solve those problems [20:50] <P0lygl0t> mutante: I don't see it happen on en.wikt already [20:51] <P0lygl0t> the other wiktionaries had a chance 3,5 years ago to do things right [20:51] <P0lygl0t> and some still chose not to [20:51] maybe if it hurts more because then they dont have an API :p [20:51] <P0lygl0t> in fact, I think only nl.wikt is using iso3 [20:51] heh [20:52] <P0lygl0t> they can still have an API, only the API becomes harder to implement [20:53] <P0lygl0t> I don't see en.wikt convert to using iso3 instead of lang names [20:53] <P0lygl0t> I estimate half of the Wiktionary projects use iso2 templates [20:53] <P0lygl0t> and the others use plain language names [20:54] <P0lygl0t> also consider what Wikipedia is using, stuff like zh-min-nan [20:54] <P0lygl0t> what kind of convention is that, where did it come from? [20:55] it may be disappointing to users that not even all wiktionaries follow the same standard, when they might expect it from all Wikimedia projects already [20:55] <P0lygl0t> is there no iso3 for them? [20:56] i dont know, does ISO ever update and add new languages by request? [20:57] <P0lygl0t> I'm sure they do, but I'm also convinced it takes them a while [20:57] <P0lygl0t> Mandarin and Cantonese should have been on their radar from the beginning though [21:00] <P0lygl0t> so when can I start making requests to the 'Unified Wiktionary API'? [21:00] <P0lygl0t> hippietrail: still there? [21:02] <P0lygl0t> I'm going to reread all that we wrote [21:05] step out for snacks and medicine and now a huge backlog! (-: [21:06] i do have an account on ts yes but havent accessed it in ages [21:07] mutante are you volunteering to get all wiktionaries to start using iso3 then? should we hold off on the api until theyve all converted all their articles? [21:08] is it easier to convert the world to one religion or to deal with the world weve got? [21:09] if you want one standard there is omegawiktionary if you want wiktionaries with lots of content there are a bunch with no standards [21:10] ok backlog read and responded to [21:11] <P0lygl0t> who is volunteering to jot down what we discussed? [21:11] not me! [21:11] <P0lygl0t> I already sensed that :-) [21:11] <P0lygl0t> I don't think it's realistic to have the wiktionaries change in revolutionary ways [21:12] i see the mapping problem as easier than the mass conversion of projects but the two can go on simultaneously each propelled by the people who favour them. so really there is nothing to argue over [21:12] <P0lygl0t> so a unified API it is [21:12] its like expecting one true image format or video format of wiki software [21:12] unify the language notation and everything else still differs anyway [21:12] <P0lygl0t> I think that at a certain point there will be evolution, just like there is with en.wikt [21:13] hippietrail: uhm, no, that would take too long [21:13] <P0lygl0t> a few years ago I planted a seed with those t templates and now it has been worked out and starts being used [21:14] you are right hippie [21:14] there is more reason to standardize when there are more useful things that depend on things being standard [21:14] <P0lygl0t> indeed, I believe that doing this API is going to make change happen [21:15] <P0lygl0t> as they then have something to work towards and maybe it will be perceived as better (hopefully) [21:16] now don't forget translingual. a user might look up something in english but its entry is actually under translingual. that should "just work". sometimes there will be both. both should be returned. thus always a list rather than a single item [21:16] <P0lygl0t> anyway, I guess I'm the Chinese volunteer, if I want to see this happen [21:16] preaching about some far off goal wont move many people [21:17] <P0lygl0t> indeed, that's why there needs to be something 'tangible', like this API [21:17] i have a javascript extension that goes through the translation tables and checks if each blue link has an entry for the right lanuage. it has to deal wth the chinese problem among others [21:18] i encourage you to do a quick and dirty python proof of concept to put live while you learn php