Toolserver:Unified Wiktionary API

=Unified Wiktionary API=

The intention is to create an API to the contents of the various Wiktionary projects. Dictionary content needs a lot more structure than Encyclopedia content, so a structure was created. The problem is that almost every Wiktionary uses a different structure, which impedes interoperability enormously.

It would be nice if it were possible to overcome this, by having an API that can be queried in a standard way and that will pass back results in a standard format.

Later the goal would be to also be able to feed this standard format to the API and edit entries this way, merging the new information in, when feasible.

Language names and language templates
A first hurdle is that some Wiktionaries use the names of the languages fully spelled out, while others use iso2 or iso3 in templates.

So it needs to be possible to convert from one to another.

Headings
We will need to keep a table for each Wiktionary about the headings they use and whether templates are used or not.

Other conventions
Is each language separated from the next by a ?

Is there more than one language present on the same page? (la.wikt, has a page per lang/spelling combination)

Requests to the API
It should be possible to ask what languages a given spelling has entries for on this Wiktionary. It should be possible to ask the entire contents of a given spelling for a certain language. It should be possible to ask which sections exist for given spelling for a given language. A certain representation should be agreed upon for each possible section (etymology, pronunciation, definitions, translations, synonyms, etc)

So what could the interchange format look like?
This is only a proposal:

XML is a rather heavy format, json is a lot more friendly to the eyes of mere mortals. So I'm going to try representing an entry in it:

Hippietrail warned me not to overengineer it, but I would still like to try encoding the entry for Helium, to get a feel for what it could be(come). The first level would be an ordered list of what is present on a page [preamble, lang1, lang2, ..., categories, interwikies]

In the preamble level there is room for links to other spellings (capitalized, with diacritics, etc)

I've indented them, for better readability, but that's not a prerequisite.

There is a possibility to have more than one etymology. If there is only one, etym1 is used. The next uses etym2, etc.

Sometimes a given spelling can be pronounced in different ways. It's not clear what to do when a particular pronunciation has more than 1 etymology. Or when it's the other way around. I think it would be best to simply repeat the pronunciations if that occurs. So the structure is fixed.

Trying the same for Helium on de.wiktionary

[18:50]  and is it feasible to include that in the API of MediaWiki? [18:50] that would come later but yes i have lookedd into it and asked rowan [18:51] i would warn people not to overengineer the exchange format. focus on one small part first and just a couple of wiktionaries [18:51] and allow for missing information [18:51]  oh [18:52]  that's a bit counter intuitive, isn't it? [18:52] lingro has already done some of this stuff by the way [18:52]  I would think it should include as much as possible [18:52] perhaps but i speak from experience [18:52]  who is lingro? [18:52]  you mean because otherwise it never gets finished? [18:53] then you will spend as much time as possible designing it and fighting with people who think the design should differ in unimportant ways [18:53] lingro is a language learning site [18:53] uses wiktionary as one of its data sources [18:53] or even never gets started. "release early release often" [18:54]  aha, lingro.com? [18:54] look at omegawiktionary. overdesigned hard to use missing lots of info wiktionary has had for years [18:54] yep. i invited them and some other projects such as ninjawords to contribute to a wiktparser project [18:55]  sounds good [18:55]  is there discussion on line about this somewhere? [18:58] on the mailing list and on the beer parlour both maybe 6 months ago [18:59] well they do all the work themselves. if the community did the work a lot more would get done [19:04]  the community is already compiling a dictionary :-) [19:07] yeah! [19:11]  Do you have something like this in mind: [19:12]  that you can query the server for all the definitions for a given word from a given wiktionary in a given language? [19:12]  then query for all synonyms and translations? [19:13]  or would you have to start by querying for the etymologies? [19:13]  then dig deeper from there? [19:14]  or would it simply be: give me all you have about a certain spelling (and maybe its variations) in an XML format? [19:15] <P0lygl0t> and when the parser cannot convert towards XML flag the entry by putting it in an rfc cat? [19:15] you should be able to query any field [19:15] <P0lygl0t> yes, but you wanted to keep it simple :-) [19:16] <P0lygl0t> for getting started [19:16] as the api got more use we would review parts of the format yes [19:16] i focused on the translation section [19:16] <P0lygl0t> so you always get the result in xml [19:16] <P0lygl0t> and you can tell the server what field/section you are interested in [19:17] if you look at wiktparser on the toolserver you will find a tool that extracts spanish translations from english articles [19:17] i would prefer json but the api works with lots of formats [19:17] <P0lygl0t> what is json? [19:18] json is derived from javascript objects but is now a standard used by many languages due to being much more lightweight than xml. ajax uses it a lot [19:18] <P0lygl0t> do you also foresee a way to feed the API xml that gets included in the entry? [19:18] yes later [19:19] <P0lygl0t> aha, I should probably look into json [19:19] of course we could always make a proxy to the api on the toolserver or somewhere before actually adding to the api directly [19:19] <P0lygl0t> why don't we get started to implement something that can do the extraction for en.wikt? [19:20] i think its easier to just think in terms of data structure rather than how to encode that structure as text [19:20] sure. en wikt is easy because our heading levels already reveal a lot of structure and the heading levels are not hidden within templates [19:21] my first level parse just returned a structure reflecting the headings with the wikitext of each section within [19:22] <P0lygl0t> ic, but that's not quite good enough [19:23] one caveat wth our heading format is that it has a surface structure and a deep structure most obviously when comparing a sngle etym article to a multi etym article [19:23] thats the first level [19:24] in fact theres a level before that which is just to retrieve the wikitext of a particular language entry of a page [19:24] there is also metadata such as the stuff before the first language include the also template and the stuff after the last language such as some categories an the interwiki links [19:25] <P0lygl0t> indeed [19:26] the beauty of a multi level parse is you can parse completely the bits you understand completey which conform to a specifed format and other bits you can still get to the wikitext of [19:26] for instance the etymology and pronunciation sections are too messy for a machine to parse [19:27] also when you grab the wikitext you can do stuff like store it all then get some stats on what the most common ways are that people write them [19:28] i was working on some articles on the format or "grammar" of a wiktionary page [19:28] <P0lygl0t> so multilevel sounds like a good idea, but other language wiktionaries will have a harder time implementing it [19:29] <P0lygl0t> because they lack depth in their headers [19:29] such as page := prolog article+ epilog [19:29] or the headings are hidden in templates [19:30] of course a parser can also look into templates especially if its working from a dump file [19:31] yes thats why an interchange format is good. the en codec could convert to the interchange format using the multilevel stuff but other wikts might have to use more brute force methods [19:31] other wikts wont have the same prolog and epilog as us but they will have at least some of the same metadata [19:32] so the interchange format would include metadata but might not include prolog or epilog [19:33] a page on most wikts will consist of one or more language articles, but latin will only ever have one language or article per page [19:33] other wikts might not have a concept of translingual [19:52] anyway taking all that into consideration i would focus on the articleentry level first because all wikts will have it in common [19:53] you would want to be able to query "does the xx wiktionary have an article in the yy language?" [19:54] so the codec will have to be able to find a page and then the part of that page dealing with a specific language [19:55] and something will have to deal with language name impedence mismatches with 2 letter and 3 letter codes as well as the mismatch between "chinese" and "mandarn" etc [19:56] i would not leave that stuff to the codec if possible or the result will be lots of variety each user will have to handle and inevitibly wll do so differently [19:57] <P0lygl0t> so the API would know how to convert from iso2 to iso3 to actual language names in all languages? [19:59] to some degree. not all because each wikt will have the laguage names in its own language! [20:13] <P0lygl0t> hippietrail: if I want to look something up on en.wikt or de.wikt it would be nice if I could simply pass it the iso3 code [20:14] it would be but that is to assume no wikt has any ambiguities in language names or codes [20:14] and also that you know which iso3 code you want when you only know the language name "zapotec" [20:14] does wiktionary strictly follow ISO codes? [20:15] <P0lygl0t> if I only know the language name, I would want the API to resolve that to iso3 for nl.wikt [20:15] because f.e. we use "ksh" and that doesnt seem to be in the ISO file i imported [20:15] also "translingual" has no language code and i think we still have non language entries for letters and symbols [20:15] <P0lygl0t> and iso2 for a lot of other ones [20:15] hippietrail: translingual = mul [20:15] <P0lygl0t> what does ksh stand for? [20:15] Kölsch [20:15] Ripuarian languages [20:15] well thats one interpretation [20:15] mul could also mean give me every language [20:16] it means Translingual for {{infl [20:16] <P0lygl0t> I saw vls was used by the Wikipedia of West Flanders [20:16] <P0lygl0t> instead of all of Flanders... [20:16] --> GerardM- has joined this channel (n=chatzill@i181119.upc-i.chello.nl). [20:16] iso does what ethnologue does and ethnologue is a splitter [20:16] <LinkyAenwk> http://en.wiktionary.org/wiki/splitter [20:17] <P0lygl0t> I don't think mediawiki follows ISO strictly [20:17] nor would most of the smaller wikts [20:18] <P0lygl0t> we have stuff like zh-min-nan [20:18] talk ISO into doing what wiktionary does? :P [20:18] <P0lygl0t> so, I guess we need full lang name, ISO2, ISO3 and WM convention [20:18] <P0lygl0t> i mean MW [20:18] probably some system where you can query "chinese" and get all enties that are part of that "macrolanguage" as well as ones merely labelled "chinese" [20:18] is it MW or rather just en.wikt? [20:19] dont see language codes being defined in standard MW install [20:19] <P0lygl0t> not sure, but we are talking about an API to all Wiktionaries [20:19] i can guarantee you will not even find a consensus between the mw projects [20:19] <P0lygl0t> some do things one way, others another way [20:20] <P0lygl0t> great, is that something for the API to resolve then? [20:20] the other approach is to just state that anything falling outside the strict iso3 language codes will fail until the articles ar edited to conform [20:20] yes for the api. which is why one central open is best [20:21] <P0lygl0t> are you suggesting that en.wikt converts all language names to iso3? [20:21] devs wont have to tackle the problem oveer and over and come up with differing solutions [20:21] i make no such suggestion. only making the problems clearer [20:22] and with a central open api devs can add new ways to resolve the problems that they understand and all users will benefit [20:22] maybe the first feature of the API should be a function to request a list of language codes from each wiki, then [20:22] <P0lygl0t> I was trying to find out if the API would include converting from one to the other [20:23] so each wiktionary would have to make a standard special page, which outputs the local code->language table [20:24] <P0lygl0t> hopefully that language code would be iso3 then? [20:24] then the API could request that first and stick to it, before making other requests using the code [20:24] the api should probably know 2 two 3 letter conversions and synonyms and know which sets of codes map to each macrolanguage [20:24] mapping codes to language names might belong in each codec [20:24] <P0lygl0t> ok, I think that's the first thing to work out [20:25] <P0lygl0t> code wise [20:25] <P0lygl0t> I think it's almost a separate little project [20:26] know-it-all already has a mysql database with 2 and 3 letter codes and english names, btw [20:26] <P0lygl0t> good starting point, I think [20:26] <P0lygl0t> in what language is it coded? [20:26] you can also be pragmatic and keep this problem in mind when working on other stuff because its a hard problem that becomes clearer with experience and the experience is hard to get if this problem blocks all other work [20:26] PHP [20:27] iso publishes all this stuff in txt files designed to be slurped easily into databases [20:27] yes, i used one of those to impotz [20:27] <P0lygl0t> good, I think the resulting API should also be coded in PHP, isn't it hippietrail? [20:27] import [20:27] including synonyms macrolanguages and french names as well as english [20:27] php makes sense if in the long run you want to integrate it into the mw api [20:28] <P0lygl0t> but it needs to be appended with what MW started using [20:28] <P0lygl0t> I thought that was the intention, or not? [20:28] but to start i would make it as a proxy on a site such as toolserver [20:28] <P0lygl0t> what other language would make sense? [20:28] can i upload .zip or something to wiktionary, no..hm? [20:29] python perl and java will probably be mentioned by other people if you ask them [20:29] <P0lygl0t> but if you want to integrate later on, I don't think any other language would make sense [20:29] i agree [20:29] <P0lygl0t> and I think the whole goal would be to integrate it [20:29] if necessary it will make it possible to use bits o mw code verbatin too [20:29] <P0lygl0t> so it is as close as possible to the DB [20:30] whatever language it will surely be rewritten from scratch a couple of times anyway [20:30] <P0lygl0t> I would have been on who mentioned Python, of course, but it's far more important to be practical [20:30] the mw api code is notoriously difficult to grok. even brion has trouble with it [20:31] php is also multiplatform works on the web and the command line has lots of libraries and if it doesnt have good unicode support built in we can use the unicode stuff from mw [20:31] <P0lygl0t> I was hoping somebody else would take care of the integration [20:32] also learning it will help when you need to hack mw later [20:32] <P0lygl0t> but I would not make it even harder by coding in another language [20:32] done then [20:32] motion passed [20:32] <P0lygl0t> I have no plans in that direction [20:33] think of other people that might want to help on the api [20:34] it should be an open project with an svn or cvs repository on the toolserver or somewhere [20:34] P0lygl0t: http://en.wiktionary.org/wiki/User:Mutante/langcodes [20:34] http://s23.org/~mutante/codes.sql [20:35] but "latin1" should be changed to utf-8 [20:35] i would probably start with a way to retrieve the wikitext of a given spelling in a given language and make codecs for en, la, and one other wikt [20:36] in some of my projects that need to pass languges around i pass a structure with a name field and a code field [20:38] for the language impedence problem i would use chinese in test cases from the very beginning [20:39] also its probably a good idea to allow lists almost everywhere in the api [20:39] <P0lygl0t> nl or de, maybe [20:39] <P0lygl0t> nl uses iso3 [20:39] i would pick three very differently formatted ones [20:39] <P0lygl0t> de uses something totally different [20:39] <P0lygl0t> but I trust there to be quite a few people [20:40] preferably at least two of which you are already familiar with [20:40] <P0lygl0t> who would code to the API for de anyway [20:40] <P0lygl0t> I only have some familiarity with nl, but not as much as I'd like [20:40] if you can get a test page on toolserver as soon as possible even just a mockup it will get people interested [20:41] updated User:Mutante/langcodes with some PHP code and links to the import CSV/text files [20:41] <P0lygl0t> mutante: thanks [20:42] <P0lygl0t> so, I don't think we have to do de ourselves [20:42] have three fields: wiktionary, language, get the wikitext for the right part of the right page from the right wikt. call that wikts api to render it into html. post that html into the page using ajax [20:43] something like that and allow for errors like no such page in that wikt, that wikt has such a page but no entry in that language, page not parseable, etc [20:44] <P0lygl0t> sounds good [20:46] <P0lygl0t> the challenge with nl.wikt is that it has no depth in its headers [20:47] <P0lygl0t> and their use of iso3, or maybe that's 'a good thing' [20:49] <P0lygl0t> sounds good hippietrail [20:50] i would say a campaign to make all wiktionaries use iso3 like nl, because it is a good thing [20:50] <P0lygl0t> hippietrail: do you have an account on toolservr? [20:50] after all, thats what ISO is for [20:50] to solve those problems [20:50] <P0lygl0t> mutante: I don't see it happen on en.wikt already [20:51] <P0lygl0t> the other wiktionaries had a chance 3,5 years ago to do things right [20:51] <P0lygl0t> and some still chose not to [20:51] maybe if it hurts more because then they dont have an API :p [20:51] <P0lygl0t> in fact, I think only nl.wikt is using iso3 [20:51] heh [20:52] <P0lygl0t> they can still have an API, only the API becomes harder to implement [20:53] <P0lygl0t> I don't see en.wikt convert to using iso3 instead of lang names [20:53] <P0lygl0t> I estimate half of the Wiktionary projects use iso2 templates [20:53] <P0lygl0t> and the others use plain language names [20:54] <P0lygl0t> also consider what Wikipedia is using, stuff like zh-min-nan [20:54] <P0lygl0t> what kind of convention is that, where did it come from? [20:55] it may be disappointing to users that not even all wiktionaries follow the same standard, when they might expect it from all Wikimedia projects already [20:55] <P0lygl0t> is there no iso3 for them? [20:56] i dont know, does ISO ever update and add new languages by request? [20:57] <P0lygl0t> I'm sure they do, but I'm also convinced it takes them a while [20:57] <P0lygl0t> Mandarin and Cantonese should have been on their radar from the beginning though [21:00] <P0lygl0t> so when can I start making requests to the 'Unified Wiktionary API'? [21:00] <P0lygl0t> hippietrail: still there? [21:02] <P0lygl0t> I'm going to reread all that we wrote [21:05] step out for snacks and medicine and now a huge backlog! (-: [21:06] i do have an account on ts yes but havent accessed it in ages [21:07] mutante are you volunteering to get all wiktionaries to start using iso3 then? should we hold off on the api until theyve all converted all their articles? [21:08] is it easier to convert the world to one religion or to deal with the world weve got? [21:09] if you want one standard there is omegawiktionary if you want wiktionaries with lots of content there are a bunch with no standards [21:10] ok backlog read and responded to [21:11] <P0lygl0t> who is volunteering to jot down what we discussed? [21:11] not me! [21:11] <P0lygl0t> I already sensed that :-) [21:11] <P0lygl0t> I don't think it's realistic to have the wiktionaries change in revolutionary ways [21:12] i see the mapping problem as easier than the mass conversion of projects but the two can go on simultaneously each propelled by the people who favour them. so really there is nothing to argue over [21:12] <P0lygl0t> so a unified API it is [21:12] its like expecting one true image format or video format of wiki software [21:12] unify the language notation and everything else still differs anyway [21:12] <P0lygl0t> I think that at a certain point there will be evolution, just like there is with en.wikt [21:13] hippietrail: uhm, no, that would take too long [21:13] <P0lygl0t> a few years ago I planted a seed with those t templates and now it has been worked out and starts being used [21:14] you are right hippie [21:14] there is more reason to standardize when there are more useful things that depend on things being standard [21:14] <P0lygl0t> indeed, I believe that doing this API is going to make change happen [21:15] <P0lygl0t> as they then have something to work towards and maybe it will be perceived as better (hopefully) [21:16] now don't forget translingual. a user might look up something in english but its entry is actually under translingual. that should "just work". sometimes there will be both. both should be returned. thus always a list rather than a single item [21:16] <P0lygl0t> anyway, I guess I'm the Chinese volunteer, if I want to see this happen [21:16] preaching about some far off goal wont move many people [21:17] <P0lygl0t> indeed, that's why there needs to be something 'tangible', like this API [21:17] i have a javascript extension that goes through the translation tables and checks if each blue link has an entry for the right lanuage. it has to deal wth the chinese problem among others [21:18] i encourage you to do a quick and dirty python proof of concept to put live while you learn php

[20:42] I think you will have a problem requiring everyone to convert to one true format [20:42] but having mappings in the back end should be the workaround [20:43] <P0lygl0t> I'm not trying to convert anyone [20:43] <P0lygl0t> it's meant as an interchange format [20:43] <P0lygl0t> not as heavy as xml [20:44] <P0lygl0t> It's also a very first draft [20:44] <P0lygl0t> just trying to brainstorm [20:45] <P0lygl0t> atglenn: did you have a look at the document I linked to? [20:47] <P0lygl0t> creating that mapping in the background is going to prove to be a challenge too, I'm afraid [20:47] yes it will [20:47] <Thogo> hi P0lygl0t [20:47] I did a test case on en wikt and el wikt for an extension that I [20:47] still have not committed but I will Real Soon Now [20:47] <P0lygl0t> but it's the first step in achieving what I want to create [20:48] I think we might be working on versions of the same thing [20:48] which is cool... [20:48] <P0lygl0t> indeed, especially if we can find a way to merge efforts [20:49] <P0lygl0t> in order to come to a universal solution to the bigger issue [20:50] yup [20:50] <P0lygl0t> you made me curious: what are you working on? [20:52] have an extension for showing subsets of words based on context tag, lang, translations .... optionally with a snippet from the def and optionally with translations [20:52] dynamic glossary (one use) [20:52] something I feel we've been missing on el [20:53] but I've been wanting .. not unified format across all wikts [20:53] but conertable format.. so everythig tagged on each project [20:53] however it gets tagged locally. so that we can retrieve it [20:53] and so that later a mapping can be done [20:53] <P0lygl0t> I'm not trying to go for the impossible: unified format across all wiktionaries [20:53] <P0lygl0t> rather I'd like a way to translate from and to such unified format [20:54] uh u=huh [20:54] so I start with the simple stuff: being able to locate defs, knowning the language of the entry, finding translations [20:54] do that first well... then we move outward [20:54] <P0lygl0t> in which programming language? [20:54] extension so it's mediawiki (php) [20:55] <P0lygl0t> that's what has my biggest interest as well [20:55] using auxiliary tables. [20:55] I;ve had it going a couple of months but no time to get back ti it [20:55] <P0lygl0t> tables as in DB tables, or as in data structures? [20:55] * atglenn bumps it up a bit on the todo queue [20:55] db tables [20:55] <P0lygl0t> is it somewhere public? [20:57] it will be as soon as I shove it into svn, my bad I haven't done it [20:57] next couple of days. [20:57] <P0lygl0t> great [20:58] <P0lygl0t> I don't have a lot of experience with PHP, but I feel this sort of thing needs to be written in PHP [20:58] <P0lygl0t> so it can become part of the MW API [20:59] uh huh [20:59] well my bit was specifically an extension for the dynamic glossary stuff which could be expanded over time... [20:59] however the api piece of that could be abstracted at a future date if this got adopted [21:00] the interface is rough, it's based on someone clueful on the project setting up regexps to do the work [21:00] <P0lygl0t> I'm not entirely sure what it entails to create an API [21:00] <P0lygl0t> so I'm concentrating on the interchange format, for the moment [21:07] <P0lygl0t> I'm sure somebody brighter than myself will come and to fill in the blanks [21:08] <P0lygl0t> or at least hopeful [21:08] <P0lygl0t> atglenn: right now, I'm not sure what to do about the templates [21:09] <P0lygl0t> the user querying the DB is not interested in the name of the template [21:09] <P0lygl0t> but rather in what it expands into [21:09] right [21:10] the user is going to be allowed to give the name they see [21:10] <P0lygl0t> and even then, it should maybe spelled out what is a plural, what a superlative etc [21:10] this name must be mapped to what string lives in the wikitext [21:10] <P0lygl0t> indeed, but then how does it work the other way around? [21:11] <P0lygl0t> when somebody tries to feed a word and its plural [21:11] i.e. a table that contains both, indexed on both [21:11] a word and its plural? [21:11] ah, you don't mean grammatical terms... or do you? [21:11] <P0lygl0t> say that an inflection template expands to a word and its corresponding (possibly irregular) plural [21:11] <P0lygl0t> it's easy to expand that and pass it to the user [21:12] oh, I see [21:12] <P0lygl0t> but when the information comes back, how to turn it back into the template that was used in the first place? [21:12] mm hmm [21:12] <P0lygl0t> I feel the API should be able to function in two directions [21:13] <P0lygl0t> maybe both should be exported... but that still doesn't solve the whole problem [21:13] <P0lygl0t> both the template and what it expands into [21:13] <P0lygl0t> I mean [21:14] why do you want to convert back to the template in this case? [21:14] --> know-it-all has joined this channel (i=edgar@cl-345.dus-01.de.sixxs.net). [21:15] <P0lygl0t> I would like to be able to create/append to an article throught this API [21:15] <P0lygl0t> say, I find a translation in a new language on another wiktionary [21:16] <P0lygl0t> it would be nice if I could feed that back to the other wiktionaries [21:16] <P0lygl0t> through this API [21:18] <P0lygl0t> but then the codex needs to know about all the templates and when to use them, based on the forms it receives [21:18] yes. I think that is out of the question [21:19] <P0lygl0t> the problem then becomes that when I ask for the contents of an entry [21:19] <P0lygl0t> then modify/correct it and ask the API to put it back [21:19] the api could allow the user to specify a section that the user wants to tweak (possibly) [21:19] <P0lygl0t> the inflection templates would be replaced by plain text once again [21:20] but storing knowledge of template expansion is a losing battle. [21:20] <P0lygl0t> maybe comparing what the user supplied and what the template expands into is a possibility? [21:21] <P0lygl0t> but then it would still be hard to choose the template that would have been appropriate [21:21] <P0lygl0t> maybe the only option in that case is to tag the entry with an RFC [21:22] <P0lygl0t> so a human can have a look at it [21:22] <P0lygl0t> I don't think it's actually going to be common with the inflection templates [21:23] <P0lygl0t> but it's certainly possible that a word got a certain gender on en.wikt [21:23] <P0lygl0t> and that we find out on other Wiktionaries that this is wrong [21:24] <P0lygl0t> my intention in the end is to find this sort of inconsistencies [21:24] <P0lygl0t> and flag them [21:25] hmm [21:25] I would say [21:25] wait, it's a huge problem, take small solvable things first and build around them. [21:25] <P0lygl0t> you are right [21:26] <P0lygl0t> hippietrail was also saying that I shouldn't be overengineering [21:26] <P0lygl0t> anyway, this means we go for export in the first place [21:26] <P0lygl0t> and templates get expanded [21:28] <P0lygl0t> now I'm going to try and integrate what de.wikt has for Helium into that json [23:25] hey P0lygl0t [23:25] sorry, been with a friend [23:25] (if that's allowed?) [23:27] so this API is going to mirror en.wikt format? [23:35] <P0lygl0t> of course [23:35] <P0lygl0t> cirwin: every once in a while [23:35] hmm [23:35] ok [23:35] huh? [23:35] <P0lygl0t> :-) [23:35] I'll bear it in mind :p [23:35] hihi atglenn [23:36] <P0lygl0t> did you have a look at what I scribbled/jotted down on toolserver? [23:36] * cirwin strongly thinks api format should not nest definitions within etymology sections [23:36] it's going to mirror the format? i thought it was going to map to a universal format.... [23:36] yup [23:36] maybe I missed some bits somewhere [23:36] atglenn: he was answering the quesiton in brackets [23:36] nope [23:36] even [23:36] shouldn't... [23:36] * cirwin stops [23:36] P0lygl0t: explain :p [23:36] <P0lygl0t> uh oh [23:37] <P0lygl0t> maybe I should put my answer to a question in brackets, in brackets as well [23:37] I do that [23:37] (it's fun) [23:37] <P0lygl0t> the intention is to reach a universal format [23:37] ((and leads to nested opportunities)) [23:37] <P0lygl0t> cirwin: lol [23:37] I think the sensible way to do that is to have a list of ety and a list of pron [23:37] <P0lygl0t> or should that have been (((lol)))? [23:37] and refer to them by number [23:37] (maybe :p) [23:38] <P0lygl0t> really cirwin? you don't like that they are nested the way I did them? [23:38] no [23:38] <P0lygl0t> ok [23:38] <P0lygl0t> so you assign them numbers and then refer to them later on? [23:38] yes [23:39] <P0lygl0t> ok, no duplication of data anymore [23:39] that way the problem of sharing pron sections doesn't matter, and if there are no etys given you don't need an ety1 section [23:39] <P0lygl0t> and the nesting becomes less deep, which also has to be worth something [23:39] indeed [23:40] I think you cannot expect or enforce any nesting except: [23:40] defn will be in a pos someplace [23:40] <P0lygl0t> other remarks? did you like the choice of json? [23:40] also, Template:wikipedia should probably be put into a "links": [23:40] and translations will go with a def (somehow). [23:40] yes [23:40] (yes to json) [23:40] any more than that, at this stage, is asking for trouble. [23:40] for now, copying en.wikt gloss is enough [23:40] get buy-in to simple stuff first that's attainable. [23:40] <P0lygl0t> translations, synonyms are all nested under a def [23:41] great [23:41] ok [23:41] could be tricky [23:41] uh huh [23:41] <P0lygl0t> I was trying to do something comparable for de.wikt [23:41] <P0lygl0t> that proved hard [23:41] that's the one main thing I'd like to change about en.wikt format [23:41] imagine trying to get everyone everywhere to change.. and agree on the change. :-P [23:42] <P0lygl0t> what do you mean by gloss? that thing that's supposed to refer back to the def? [23:42] hehe [23:42] you won't win that battle now... [23:42] P0lygl0t: yah [23:42] atglenn: I'm aware of this [23:42] I will build a better wiktionary [23:42] ok [23:42] and subvert all the contriburos [23:42] see ya in 50 yrs :-P [23:42] 50 hrs, max :p [23:42] <P0lygl0t> atglenn: I realize getting anyone at all to change, will prove impossible [23:43] <P0lygl0t> that's why I was trying to come up with this, as a band aid [23:43] you can rope em in slowly [23:43] but get em hooked on why it's useful first, with stuff they don't have to change [23:43] yeah [23:43] like no creating form-of entries [23:43] <P0lygl0t> cirwin: what do you mean by gloss? that thing that's supposed to refer back to the def? [23:43] automagic links to correct def [23:43] P0lygl0t: yes [23:43] you can use it to guess with which def trans go [23:43] about 98% of the tine [23:44] <P0lygl0t> I'd prefer to have them nested, that way there is no confusion at all [23:44] yes [23:44] I agree, that is better [23:44] <P0lygl0t> I would simply take the gloss along, in order to be able to do the reverse operation [23:44] ok [23:44] <P0lygl0t> send something in through the API and have the entry updated [23:45] from a human point of view [23:45] it'd be nicer not to abbreviate in the json format [23:45] <P0lygl0t> even though that's not the initial intention to deploy [23:45] since when is trad short for translation :p [23:45] <P0lygl0t> hehe, traduki, traduction, etc [23:46] <P0lygl0t> anyway, the reason why I was abbreviating, was to get it unstuck/away from English a bit [23:46] put template:Wikipedia into ext if you find it [23:46] ok [23:46] maybe a good thing [23:46] <P0lygl0t> I didn't want to go Esperanto all the way (this time) [23:46] good [23:46] ;) [23:47] <P0lygl0t> I was a bit disappointed I couldn't use subst for substantive (noun) though [23:47] could get confusing though, if section names are the same as ISO-639 codes [23:47] <P0lygl0t> would be too confusing because of substitution [23:47] subst? [23:47] yes... [23:47] <P0lygl0t> nomen substantivum is a noun in Latin [23:47] and beacuse it's a noun [23:47] also, you have def1 [23:48] shouldn't you just have defs : [ {}, {} } ? [23:48] <P0lygl0t> that's why I would have liked to use subst/adj/adv etc, but that will be too confusing [23:48] [ {}, {} ] [23:48] noun adjective adverb sounds great to me :) [23:48] <P0lygl0t> let me think about that, I did it because I also did it for etymologies [23:49] probably rename "cat" to "topics" [23:49] so it is more general than just wiktionary [23:49] <P0lygl0t> of course, they do, but they might not for people coming from other languages [23:49] (you don't need to include the linguistic ones as they are derived from the entry) [23:49] <P0lygl0t> I don't want to make it too English centric [23:49] it's probably better to stick to one language than to mix many [23:49] whatever you chose to do it will alienate some [23:50] <P0lygl0t> the one line of German inbetween is because de.wikt had a picture [23:50] german? [23:50] <P0lygl0t> I started with a new example when I was encoding de.wikt though [23:50] <P0lygl0t> but that's not there yet [23:51] <P0lygl0t> were you talking about the Finnish entry? [23:51] not really [23:51] <P0lygl0t> I was simply trying to encode everything en.wikt happens to have on one page [23:51] oh, ok [23:51] it's pretty good, tbh [23:51] I'm just picky :p [23:51] <P0lygl0t> good, spent all my afternoon on it :-) [23:52] <P0lygl0t> picky is good [23:52] <P0lygl0t> as long as it's constructive [23:52] ok, the only major change is to unnest etymology [23:52] <P0lygl0t> yep, I agree with that and I'm glad you come up with a good way to accomplish it [23:53] <P0lygl0t> come up=suggest [23:53] <P0lygl0t> the tricky part now is to verify whether it's possible for the other wiktionaries to encode their entries this way [23:54] <P0lygl0t> they have both more and less information [23:54] indeed [23:54] nesting stuff under the correct def is the hardest [23:54] <P0lygl0t> de.wikt has Oberbegriffe, words that are 'above' the term [23:55] <P0lygl0t> and under it [23:55] hypernyms? [23:55] <P0lygl0t> for dog, they would list animal for instance [23:55] yeah [23:55] <P0lygl0t> or mammal [23:55] define hypernym [23:55] <know-it-all> 'hypernym' is English: (semantics) A superordinate grouping word or phrase which includes subordinate terms. "Musical instrument" is a hypernym of "guitar" because musical instruments include guitars.. [23:55] <P0lygl0t> yep [23:55] we have some entries that have them too [23:55] <P0lygl0t> ok, they are rather fanatic about that [23:55] coroutine springs to mind, but there are others [23:55] <LinkyAenwk> http://en.wiktionary.org/wiki/coroutine [23:55] heh [23:55] good thing [23:56] <P0lygl0t> almost all their entries seem to have them [23:56] no reason why not [23:56] and they're easy to add if you think about them [23:56] <P0lygl0t> and the opposite, but that's more comparable to our categories [23:56] <P0lygl0t> but you are right, connecting trans to defs is going to be the trickiest part [23:57] yah [23:57] <P0lygl0t> but then, this will make them think about that and evolve in the right direction [23:57] without a gloss, you just have to assume ordering [23:57] which will be wrong [23:57] <P0lygl0t> nl.wikt doesn't use #, but actual numbers to be able to refer back [23:57] <P0lygl0t> that also works [23:58] oh, cool [23:58] <P0lygl0t> we could have done that too, but for some reason the automagical numbering with # seemed so much better back in the beginning... [23:59] humans are bad at numbering stuff [00:00] <P0lygl0t> anyway, I'm glad you looked at it [00:00] <P0lygl0t> I think it's a necessary component [00:00] yes [00:00] <P0lygl0t> to have before being able to do the rest we were discussing [00:00] if we can get it implemented on a few wikts [00:00] then we are going to be sorted [00:00] :) [00:00] yeah [00:01] <P0lygl0t> and it will be useful for a lot of other purposes as well [00:01] indeed [00:01] inculding the relational database of Leftmost [00:02] <P0lygl0t> the only problem I have is how to implement it [00:02] <P0lygl0t> it would be best to do it in PHP [00:02] why? [00:02] <P0lygl0t> but hippietrail suggested I make a prototype in Python first anyway [00:02] <P0lygl0t> so it can run on the MediaWiki servers locally [00:02] it can do that anyway [00:02] <P0lygl0t> and become part of the MW API [00:02] though it'd have to make a seperate api request [00:02] hmm [00:03] I think that would be a bad idea [00:03] should be a seperate API [00:03] <P0lygl0t> really? [00:03] though it could still be a MW extension [00:03] yes [00:03] <P0lygl0t> then it doesn't matter in what language it's written [00:03] the API has a large feature set [00:03] nope [00:03] it's a matter of politics, probably [00:03] <P0lygl0t> has to be PHP? [00:03] i dunno [00:03] would wikimedia let anything else run on their servers [00:04] probably depends on a lot of things [00:04] but if it's not written as an extension [00:04] you can run it on a totally different server [00:04] <P0lygl0t> If I were MW, I would want everybody to stick to one thing [00:04] and it will still work - even if wikimedia refuse to install it [00:04] yeah [00:04] there's that [00:05] <P0lygl0t> but you are saying it doesn't have to become part of the API [00:05] no [00:05] it shouldn't [00:05] the API is for MW [00:05] this is not MW [00:05] <P0lygl0t> wouldn't it be slower if it has to run from another server? [00:05] a little [00:05] but you only have to request a page [00:05] so not much more [00:05] <P0lygl0t> shouldn't be an issue [00:05] <P0lygl0t> after all [00:06] <P0lygl0t> so code it in Python after all? [00:06] I'd say better well written python than hacked together php [00:06] * P0lygl0t hates building a prototype and then have to reimplement the whole thing [00:06] heh [00:07] * P0lygl0t is not sure whether he can hack well written Python :-) [00:07] <P0lygl0t> but the PHP will certainly be in a worse state [00:07] by the end of this you'll be able to :p [00:08] <P0lygl0t> oh and there is always the possibility to improve on it [00:08] * P0lygl0t is glad jsonlint.com exists [00:09] neat [00:09] <P0lygl0t> would have gone mad without it (although trying to make what I came up with the first time comply almost drove me nuts as well) [00:11] <P0lygl0t> an anon turned our irc discussion into a poem [00:12] kind of him [00:12] <P0lygl0t> by tagging it with tags [00:12] yeah [00:12] I notived it wasn't a mess anymore :p [00:12] <P0lygl0t> would have taken him quite some time as well otherwise [00:13] * P0lygl0t should probably put tags around the json [00:13] [00:14] <P0lygl0t> oh, ok [00:14] <P0lygl0t> do we get code highlighting then? [00:14] yip [00:14] <P0lygl0t> pity there is not a single js command then [00:15] ? [00:15] <P0lygl0t> well it's javascript's data format, but there is no js code there [00:16] <P0lygl0t> so there won't be a lot of code to highlight [00:16] it will make the numbers and strings coloured [00:16] which is fun, if pointless [00:17] <P0lygl0t> good [00:17] <P0lygl0t> I like colors [00:17] makes them all pale blue [00:17] maybe not so nice [00:17] but your choice [00:17] black is ugly too [00:18] <P0lygl0t> I'll see and if I don't like I'll turn it into source lang=pgplsql [00:18] <P0lygl0t> or was that plpgsql [00:18] * P0lygl0t is getting confused [00:19] see, that's the problem with abbreviations ;) [00:19] <P0lygl0t> yep, true enough, but my abbreviations were meant to make it less English centric [00:20] <P0lygl0t> it's bad enough that what it is now is a simplified representation of how en.wikt formats its entries... [00:20] simplified is good? [00:21] <P0lygl0t> I hope I find the time to have a look at de.wikt, fr.wikt and es.wikt to see how they do things [00:21] <P0lygl0t> and I hope Hippietrail can do the same for la.wikt [00:21] la.wikt is fun [00:21] <P0lygl0t> and the Chinese Wiktionaries [00:21] they have seperate pages for word and word_(en) [00:21] which is a good idea [00:22] <P0lygl0t> yes, they do it completely differently [00:22] but makes them even worse for standardising [00:22] <P0lygl0t> which is why we have to see whether the format accomodates them [00:22] this does accomodate them [00:22] it just requires more pageloads [00:22] <P0lygl0t> it should be possible to query the API for one spelling in one language [00:22] yes [00:23] which will be easier for them [00:23] <P0lygl0t> or for one spelling in all languages that are known to that wiktionary [00:23] <P0lygl0t> which will cause them to load more pages [00:23] not a big problem [00:23] <-- Tosca has left this server (Read error: 113 (No route to host)). [00:23] <P0lygl0t> it may even take them to loop all combinations, but I suppose they have a list on top of the page like we do as well? [00:24] yes [00:24] if not /prefixIndex will show them [00:24] <P0lygl0t> np at all then [00:25] <P0lygl0t> oh and I would almost forget, but I'll go and have a look at nl.wikt as well [00:25] <P0lygl0t> they use iso3 [00:25] ok [00:25] <P0lygl0t> I think it was Hippietrail who said that we should maybe do that as well [00:25] are you still planning to use iso639.py? [00:25] yes [00:25] much better than mixing standards [00:25] <P0lygl0t> of course [00:26] <P0lygl0t> the thing is, it should be possible to convert in all four directions: [00:26] ok - once I've debugged the feature additions to creation.js I'll try to allow explicit version of standard [00:26] <P0lygl0t> full lang name, iso2, iso3 and what MW does zh-min-nan [00:26] gah, them too [00:26] <P0lygl0t> that's what the wikipedias use as a prefix [00:27] <P0lygl0t> and the wiktionaries too [00:27] yup [00:27] <P0lygl0t> there's a funny one; simple [00:27] uhoh... [00:27] <P0lygl0t> no idea how to convert that to iso2 [00:27] <P0lygl0t> maybe just en ;-) [00:27] x-simple [00:28] <P0lygl0t> or 2-simple [00:28] depends if it matters or not [00:28] <P0lygl0t> not to me, I don't expect them to have more info than en.wikt does [00:29] true [00:29] <P0lygl0t> I usually skip them when harvesting wikipedias [00:29] <P0lygl0t> anyway, you got me thinking in smaller blocks [00:29] <-- Ahonc has left this server (Connection timed out). [00:30] <P0lygl0t> and even though this one isn't exactly small [00:30] <P0lygl0t> I think it's good to separate this functionality into a separate project [00:30] yeah [00:31] <P0lygl0t> would it be possible to develop it on toolserver and have it run there if we ask nicely? [00:31] yup [00:31] I have an account there [00:31] * P0lygl0t recalls you said they have a shadow copy of the data? [00:31] yup [00:31] but i think using the api is cleaner [00:31] <P0lygl0t> me too now, but only for the wiki at the moment [00:32] ok [00:32] <P0lygl0t> that's true [00:32] <P0lygl0t> but if we use the API we're accessing the server in the US then? [00:32] <P0lygl0t> or is the API smart enough to talk to a local mirror? [00:33] it'll use the local [00:33] I think... [00:33] <P0lygl0t> and toolserver is not picky regarding the programming language we want to use? [00:33] no [00:33] not at all [00:34] <P0lygl0t> good, not that it matters all that much whether it uses local or not [00:34] <P0lygl0t> but it would make me feel better :-) [00:34] <P0lygl0t> ok, great, so Python it is then [00:34] well [00:34] don't worry about where the data cmoes from [00:34] just have a module that wraps it [00:34] we can then move it easily [00:35] <P0lygl0t> ok, so using the API with mwclient is clear to me now [00:35] <P0lygl0t> what we are going to use as a data format is on its way to cristallize [00:35] <P0lygl0t> but how does this API present itself to the outside world? [00:36] <P0lygl0t> is it a socket? [00:36] <P0lygl0t> do we have to program the client ourselves as well, or simply extend mwclient? [00:37] <P0lygl0t> or is it like a web server? [00:37] <P0lygl0t> accepting http requests? [00:37] web server [00:38] <P0lygl0t> accepting get, post and put? [00:38] <P0lygl0t> what python module accomplishes that? [00:38] *** cirwin is now known as cirwin|biab. [00:40] *** cirwin|biab is now known as cirwin. [00:40] sorry, being moaned at [00:41] WSGI [00:41] but the web interface is not part of the main program [00:41] which is a parser [00:41] the webserver calls functions on the main program [00:47] <P0lygl0t> ok [00:48] <P0lygl0t> Hippietrail was talking about a codec [00:48] codec? [00:48] oh [00:48] that would create json from your objects [00:48] <P0lygl0t> how do you see that? Just a big structure of what translates into what? [00:48] <P0lygl0t> yes, something like that [00:48] so webinterface says "parse me X" and the app returns a python object [00:49] the webinterface then chucks it into json (or whatever) and outputs it [00:50] <P0lygl0t> but for each and every wiktionary it needs a different set of data to parse the page into the objects [00:50] yes [00:50] <P0lygl0t> is that hard coded in the Python [00:50] you'd have a set of parser objects [00:50] that create python objects [00:50] (probably a parser for each wikt) [00:50] yes [00:50] <P0lygl0t> so if, for a Wiktionary, that we can't do ourselves [00:51] <P0lygl0t> somebody else has to do it, they'll have to learn some Python [00:51] we give them most of the solution, and ask for ten minutes of time to fix [00:51] hmm [00:51] the alternative is to invent a new wiki-syntax-parser language [00:51] which no-one would know [00:51] <P0lygl0t> so Python is probably easier both for us and them? [00:51] <-- Nadando has left this server (Read error: 60 (Operation timed out)). [00:52] yes [00:52] and once we've written one or two parsers [00:52] <P0lygl0t> ok [00:52] we can probably mungs the bits that are similar [00:52] *munge [00:52] <P0lygl0t> ok, we try to reuse as much code as possible [00:52] yup [00:53] probably class Parser [00:53] <P0lygl0t> so the only data structures we'll need is the names of the headers they are using? [00:53] and then class EnParser(Parser) [00:53] probably [00:53] though I wouldn't be too sure yet :) [00:54] <P0lygl0t> and which style ===Translations===, { {=trans=}} or === { {trans}} === [00:54] not all use headers for everything [00:54] <P0lygl0t> oh yes, a subclass for each Wiktionary