Parsoid/Language conversion

This page contains some notes about LanguageConverter.

For improvements (syntactic regularization) to the PHP implementation of LanguageConverter, see T54661. These improvements have resulted in some articles that need to be fixed up: see Parsoid/Language_conversion/Preprocessor_fixups. For the Parsoid implementation of LanguageConverter, see T43716.

Documentation

 * Syntax docs: Writing systems/Syntax; Chinese help, MS translation, google translation
 * Wikis using Language Converter: m:Wikipedias in multiple writing systems
 * Language converter
 * meta:Automatic conversion between simplified and traditional Chinese
 * bug 41716
 * Category:放置于模板的noteTA

Chat with Liangent
;code2: .. }-? [09:17] I know, but it's not difficult to describe in DOM [09:17] yes it can [09:17] ouch [09:17] I feared that this would be the case [09:18] and of course the contents don't need to be balanced [09:18] as it is all string-based [09:18] [Notify] brion went offline (irc.freenode.net). [09:18] in practice it's rarely unbalanced [09:18] but technically it's completely possible [09:18] gwicke, we can treat those as independent parsing contexts [09:18] similar to how extensions/include tag content is handled in tokenizer [09:18] yes, we should really [09:19] and possibly represent the variants as sub-doms marked with the language [09:19] we've figured out hierarchical parsing scopes in the tokenizer and we just apply it wherever possible. [09:20] or as html in attributes (not pretty, but might avoid rendering issues) [09:21] we don't want all variants to show up / be indexed at once [09:21] gwicke: subbu: do you want to see this? https://zh.wikipedia.org/w/index.php?title=User:Liangent/test&action=edit [09:21] zh-cn users see a blue box, while zh-tw users see a red one... [09:22] i still think we start drawing up wikitext constructs that we consider deprecated uses -- and a good subproject for someone would be use info in dom based on auto-correct and other flags to signal "deprecated/erroneous wikitext" to editors .. kind of a linting tool. [09:22] but not sure we have the bandwidth to do it :) [09:22] probably not before the July release [09:22] seems like a far out project. [09:22] definitely not! [09:22] GSOC next year ;) [09:23] --> spectie has joined this channel (~fran@***). [09:23] <-- spectie has left this server (Changing host). [09:23] --> spectie has joined this channel (~fran@unaffiliated/spectie). [09:23] so I found some docs on the flags: [09:23] 'A' => 'A', // add rule for convert code (all text convert) [09:23] 'T' => 'T', // title convert [09:23] 'R' => 'R', // raw content [09:23] 'D' => 'D', // convert description (subclass implement) [09:23] '-' => '-', // remove convert (not implement) [09:23] 'H' => 'H', // add rule for convert code [09:23] // (but no display in placed code) [09:23] 'N' => 'N' // current variant name [09:24] yep [09:25] some of these sound like they would affect global converter state? [09:25] or do these only apply to one block? [09:25] liangent, so, those divs are closed inside each then [09:25] // 'S' show converted text [09:25] // '+' add rules for alltext [09:25] // 'E' the gave flags is error [09:26] we'll need to understand these flags to be able to represent them sanely I think [09:27] gwicke: it's currently done in this way, but ideally it could be fixed to only modify the state for the current conversion run [09:28] *it's currently done to modify global state [09:28] subbu: I don't really understand your question [09:29] liangent: global state creates all sorts of problems for us [09:29] gwicke: I haven't seen S/+/E in real world... maybe I have to have a look first [09:29] we plan to re-render parts of a page, which we can only do if this can be done independent of some magical global state [09:29] gwicke: if you can avoid that ... don't do that in its current php way [09:30] I doubt that we'll complete this before July then [09:31] gwicke: well you didn't paste the next line "// these flags above are reserved for program" [09:31] we'd have to convert a delta encoding to a list of options per block [09:32] ok, so 'A' etc are not used in wikitext [09:33] after I read the code ... [09:33] which flags are used in wikitext if S/+/E doesn't show up either? [09:33] would these be the only candidates? [09:33] 'A' is converted to 'S' and '+' internally [09:34] and 'S' is just as if there's no flag [09:34] ah, that comment refers to the ones before it [09:34] the comment style in the converter is slightly non-conventional [09:34] it seems like having liangent do a quick translation (and store that on the wiki somewhere) would be useful for future documentation (to en-speaking hackers) [09:35] yes, that would be great [09:35] gwicke: it was written years ago and almost no one modernize it [09:36] even a link to the Chinese documentation might be useful [09:36] could feed that through a translate tool [09:36] I can't find any place really using 'E' except for that comment [09:36] gwicke: you want it? https://zh.wikipedia.org/wiki/Help:%E5%AD%97%E8%AF%8D%E8%BD%AC%E6%8D%A2%E8%AF%AD%E6%B3%95 [09:37] --> lwelling has joined this channel (~owner@wikimedia/lwelling). [09:37] hmm- didn't google offer website translations at some point? [09:38] gwicke: maybe only if you're using chrome now? [09:38] looks like they made that a commercial service [09:38] google translate was one of their deprecated labs apis, but then they undeprecated it [09:38] but presumably with some tos changes [09:39] * gwicke tries bing [09:39] http://www.microsofttranslator.com/bv.aspx?from=&to=en&a=http%3A%2F%2Fzh.wikipedia.org%2Fwiki%2FHelp%3A%E5%AD%97%E8%AF%8D%E8%BD%AC%E6%8D%A2%E8%AF%AD%E6%B3%95 [09:41] http://translate.google.com/translate?sl=auto&tl=en&js=n&prev=_t&hl=en&ie=UTF-8&eotf=1&u=http%3A%2F%2Fzh.wikipedia.org%2Fwiki%2FHelp%3A%25E5%25AD%2597%25E8%25AF%258D%25E8%25BD%25AC%25E6%258D%25A2%25E8%25AF%25AD%25E6%25B3%2595&act=url [09:41] --> Reedy_ has joined this channel (~quassel@wikimedia/pdpc.active.reedy). [09:41] pasting a url into translate.google.com produces a clickable url as the translation [09:41] should we add those links (and liangent's translation) to https://meta.wikimedia.org/wiki/Automatic_conversion_between_simplified_and_traditional_Chinese ? [09:42] cscott: no [09:42] should be a page on MW.org [09:42] yeah, software docs should be on mw.org [09:43] we initially used meta for software too, before mw.org existed [09:43] <-- Reedy has left this server (Ping timeout: 258 seconds). [09:44] liangent's translation would be appreciated; the auto-translated version is not very intelligible [09:46] although sometimes amusingly misleading "-{foo}-" means emits "foo" literally --> becomes "a total ban on the word conversion, in some cases (such as just where prohibited word conversion)" [09:46] cscott: is the "syntax" section enough? [09:46] it gives some usage examples [09:46] when i first read that, i thought it was referring to censoring out expletives ("prohibited words") [09:46] liangent: i think so. it's hard to know what's important from the other parts w/o understanding what they are trying to say ;) [09:47] i'll trust your judgement [09:47] <-- HaeB has left this server (Ping timeout: 264 seconds). [09:49] like, the first two paragraphs *seem* to be boilerplate about web standards and markup. if there's some actual interesting point in there, a translation would be nice.  (what does "wikitext also no clear definition of normalized work is still in progress, which mark is the label unclear" mean?) [09:49] * cscott -> lunch [09:50] * gwicke -> office ;) [09:50] if brion is easily-pokable, i hope the latest core patch will finally put the manual thumbs to rest [09:50] I'm out of poking distance right now, but that will change in about 15 minutes [09:50] thx [09:51] i'm tired of that patchset [09:51] ;) [10:11] [Notice] -NickServ- This nickname is registered. Please choose a different nickname, or identify via /msg NickServ identify . [10:11] [Notice] -NickServ- You are now identified for gwicke. [10:11] [Notice] -MemoServ- You have 1 new memo. [10:11] [Notice] -MemoServ- To read them, type /msg MemoServ READ NEW [10:11] [Error] irc.freenode.net: No such nick/channel. [10:11] --> You have joined the channel #mediawiki-parsoid (~gabriel@wikimedia/gwicke). [10:11] *** The channel topic is "Round-tripping parser and runtime for MediaWiki | gwicke, subbu, marktraceur, and awight are the team | https://www.mediawiki.org/wiki/Parsoid | always fresh round-trip testing results at http://parsoid.wmflabs.org:8001/". [10:11] *** The topic was set by Brooke!sex@wikipedia/MZMcBride on 27.11.12 14:39. [10:12] *** Channel modes: no colors allowed, no messages from outside [10:12] *** This channel was created on 19.09.12 16:31. [10:14] --> HaeB has joined this channel (~quassel@wikipedia/HochaufeinemBaum). [10:17] *** rmoen|afk is now known as rmoen. [10:21]  New patchset: Subramanya Sastry; "WTS Code cleanup: Simplify calls to serializeChildren" [mediawiki/extensions/Parsoid] (master) - https://gerrit.wikimedia.org/r/56611 [10:25] *** Trevor|afk is now known as TrevorParscal. [10:25] [Notify] TrevorParscal is online (irc.freenode.net). [10:25] <-- gerrit-wm has left this server (Remote host closed the connection). [10:25] --> gerrit-wm has joined this channel (~gerrit-wm@manganese.wikimedia.org). [10:32] <-- HaeB has left this server (Read error: Connection reset by peer). [10:34] cscott: We haven't figured it out yet (I dunno if anyone answered) [10:34] <-- hashar has left this server (Quit: I am a manual virus, please copy me to your quit message.). [10:36] [Notify] brion is online (irc.freenode.net). [10:37] --> diiq has joined this channel (c7bcc109@gateway/web/freenode/ip.***). [10:46] marktraceur: ok, i'm going to document how image tags currently work; hopefully that will get pushed quickly so you can patch the tests as well as the source when you make your changes [10:46] cscott: https://www.mediawiki.org/wiki/Writing_systems/Syntax [10:48] you might be interested in https://bugzilla.wikimedia.org/show_bug.cgi?id=43547 - will implementing this be useful for parsoid team> [10:48] could we do bidirectional links with the chinese source for the page? [10:48] might make it slightly more likely they will get updated when changes are made [10:49] *** zz_YuviPanda is now known as YuviPanda. [10:49] the variant bug does look interesting; it would be a decent way to get more i18n tests into parsertests [10:50] currently parsertests relies on the kaa wiki [10:50] * cscott loves non-roman script languages [10:50] cscott: and mw core parser tests use language=sr and language=zh for the converter [10:52] ah, i see that now [10:52] crosswiki link done [10:52] i didn't read that section as closely; i was working on linktrail/linkprefix bugs at the time [10:53] i didn't parse the variant= option to parserTests when I implemented language= support for parsoid, maybe i should do that [10:53] of course, that requires grokking variants fully first, which it looks like gwicke is much closer to having done [10:54] if you read that parserTest file ... you need to make sure that you can recognize chinese characters [10:55] though some of tests are in sr -- which requires cyrl [10:57] cscott: TBH conversion between roman scripts are more difficult, because you must be careful not to touch html tags etc. [10:57] which is never a problem when you convert Chinese [10:58] my evil long-term plan is to have parsoid pass all the parser tests [10:58] so i should take a closer look and see which variant tests are passing/failing [10:58] it doesn't seem like character set should really be an issue. UTF8 FTW. [10:58] I believe all variant tests are failing [10:59] you must be echoing -{}- currently [10:59] i should have said, "after we parse the variant= option" -- they are certainly failing now because we're just ignoring the language variant [10:59] ah [11:00] can you point to the specific test you're thinking of when you say, "TBH conversion between roman scripts are more difficult, because you must be careful not to touch html tags etc." [11:00] i think I also need the TBH TLA expanded [11:01] to be honest [11:03] oh, right, i see now [11:03] i was trying to make this more complicated than it is [11:04] yeah, parsoid doesn't really have much problem with separating out the markup, that's all general and done before we'd look at the characters [11:05] we have more of a problem making sure we're using the right localized configuration, especially with all the crazy fallback and inheritance stuff that goes on in core [11:07] sadly zh has so much such stuff, even only in the converter part [11:07] see that table I translated just now [11:07] for example, zh-cn reuses rules from zh-hans but doesn't reuse everything... [11:08] *** edsanders is now known as edsanders|away. [11:12] wow, there are a lot of zh-* codes which I wouldn't think of as "chinese" as all. [11:13] zh-min, for example -- shouldn't that be a separate language? [11:15] cscott: that's a legacy one. the standard code for it is nan [11:15] so http://www.iana.org/assignments/language-tags/language-tags.xml makes it seem like the zh- languages are converging on a zh-{language name}-{country} format [11:15] * gwicke returns from another interview [11:15] see https://bugzilla.wikimedia.org/show_bug.cgi?id=8217 [11:16] cscott: you mean zh-Hans-CN ? [11:17] zh-hans-cn, zh-gan, zh-min, zh-hakka [11:17] but these might just be legacy names, that bugzilla seems like they've renamed some [11:17] 'nan' makes more sense than zh-min-nan, where 'zh' was really just standing for the character set, not tht language [11:18] zh-hans-cn looks verbose [11:18] http://www.w3.org/International/articles/language-tags/Overview.en.php [11:19] zh-gan is gan now [11:19] the second should be zh-min-nan not zh-min [11:20] currently it still resides at zh-min-nan.wikipedia.org, but its html says  [11:20] not sure what zh-hakka is. maybe just hak? [11:22] do we need to actually implement conversion for successful editing? [11:22] to me it seems that we mainly need to properly handle conversion tags/blocks [11:22] without necessarily doing any actualy conversion at first [11:23] the biggest problem I see is the global nature of conversion rules [11:24] liangent: will -{zh-hans:computer; zh-hant:ELECTRONICBRAIN;}- cause all instances of ELECTRONICBRAIN to be replaced with computer when converting to hans? [11:25] or only from that point of the document on, or only inside the block? [11:27] gwicke: yes [11:28] oh no [11:28] no further conversion, unless it has an H or A flag [11:29] do we emit something like computer ELECTRONICBRAIN , or something like  [11:29] ah, k- and those apply from that point in the document on I guess [11:31] it seems the answer is "the former, unless H is used, in which case it's the latter; unless A is used, in which case it's both" [11:31] i wonder if we could get away with changing the scope of the H and A rules without breaking too many pages [11:32] hoisting them to the nearest enclosing block-level element, or even hoisting them to be per-document, would be much cleaner long-term [11:32] ie, it would be better to have a short table of "additional rules for this page" included in the page's meta-data, like NOTOC and the rest. [11:32] cscott: neither I would say [11:33] <-- andre__ has left this server (Quit: andre__). [11:33] choose one between computer and ELECTRONICBRAIN based on user's pref [11:34] or if you really want to parse it you can expand the spec to multiple attribs [11:35] i suspect that all user-pref things will get the most general form put into parsoid's output, and then a postprocessor will munge the output for the user's display preferences [11:35] redlinks and stubs would be handled the same way [11:35] date preferences probably too [11:36] the postprocessor might even be client-side javascript or CSS in some cases [11:36] eh ... then just use the current converter as the postprocessor is enough ... [11:36] yeah, but we should ideally move conversion *rules* to the head [11:36] it's designed to work on HTML currently [11:37] ie, in the computer ELECTRONICBRAIN  example i was assuming the user's stylesheet would display:none one or the other [11:37] so that they apply to the entire page [11:37] why to the head? [11:37] with some backwards-compatibility selser stuff saying where in the original document the rule was found [11:37] inline conversion pairs that don't add general rules can be reprocessed independently [11:38] liangent: so that we can re-process a part of a page correctly [11:38] i think it's also a cleaner representation in the DOM, since HTML doesn't have any "from this point in the document on" state [11:38] not really -- A or H tags don't (technically) affect the whole page [11:38] all DOM attributes are tree-scoped [11:39] it only affects the part after the markup... [11:39] liangent: is that actually needed? [11:39] liangent: right, that's the part we think we might change (short or long term) [11:39] deprecate [11:39] that's a better word [11:39] hmm there're some real world usage on zhwiki [11:39] primarily used in templates [11:40] the hypothetical GSoC bot could roam through existing wikis and warn of usage that will break [11:40] to avoid pollution of pages using the template [11:40] ah, we could scope rules to a template I guess [11:40] liangent: the template would add the rule and then remove it? [11:40] yeah, block scoped is fine [11:40] we parse template independently already [11:40] cscott: yeah [11:40] but it's not always template level [11:41] for some common dictionaries that are not in mediawiki core, we put lots of -{H| }- in some templates [11:41] liangent: I guess -{ }- can always be used to suppress general rules? [11:41] i guess concretely what i'm saying is that we'd like to deprecate usage which can't be scoped [11:41] then transclude those templates in articles which need that [11:41] gwicke: oh, right -- we can be compatible by adding synthetic blockers to uses of the rule before its "definition" [11:42] gwicke: yes [11:42] but i still feel like the view that VE exposes should be scoped or global; that would be the long-term preferred usage. [11:42] liangent: if there was an easy way to add those entries globally, would that help? [11:43] or an easy way to add those entries to a given page [11:43] with globally, I mean for the entire wiki [11:44] gwicke: no. those rules are transcluded when really needed, and usually group by article scope, eg IT, Movie, Sport etc [11:44] in a way that can also make it easy to add those entries to the built-in dictionaries [11:44] i don't think the "transcluding a page with defs" is actually necessarily bad, so long as those defs would get scoped to the entire page [11:44] otherwise it creates unneeded conversion [11:44] liangent: is unneeded conversion purely a performance problem? [11:45] I should say excessive conversion [11:45] or are there words with multiple uses/meanings where the language converter would be overaggressive if a rule was in global scope [11:45] the latter [11:46] representing the sum of all the world's knowledge, in all the world's languages, is not an easy task. [11:46] some way to specify a subject-specific dictionary could be added to the head too [11:46] gwicke: there was an attempt in core, with a parser function [11:46] but it hasn't been done [11:46] templates are the wrong tool for that [11:46] and removed finally [11:47] for simplicity, i really like "all defs at head". that reasonably generalizes to including subject-specific dictionaries. but it doesn't handle the "defs scoped to a particular template" case. [11:47] at least from our point of view, which is very much geared to independent reprocessing and template expansion [11:48] defs scoped to a template could be fine [11:48] allowing defs to be block scoped handles the template case, but transcluded pages might still prove problematic (unless the transcluded page didn't create its own scope) [11:48] but defs leaking out of a template is a no-go for our processing model [11:49] and what's the difference between templates and transcluded pages? [11:49] same thing [11:49] liangent: defs leaking out [11:49] cscott: there is no difference between pages and templates [11:49] different namespaces, that's all [11:50] i'm referring to the use case, not the implementation [11:50] gwicke: no we store dictionaries in Template namespace too... [11:50] use case of transcluded pages is explicitly to allow the defs to leak out [11:50] use case of templates usually (we hope) doesn't let the defs leak [11:50] cscott: templates are parsed in their own global context [11:51] <-- jeff_evans has left this server (Remote host closed the connection). [11:51] ok, lunch time [11:51] so global defs in there are fine (but would need work in the PHP preprocessor), and would not leak out of the template expansion [11:51] *** subbu is now known as subbu|lunch. [11:51] gwicke: right. but we'd break the pages transcluded just for their defs. [11:51] transcluded pages are not supported [11:52] also you may want to think of the performance [11:52] we have some plan on how to still support unbalanced templates combining to a single dom structure [11:52] there was really some tradeoff in the converter design [11:53] by wrapping them in an extension-like tag [11:53] for example, the choice of using a global dictionary [11:53] my suggestion would be to provide another means to "include an extra dict", make the defs page-scoped, and explicitly add {literal} escapes for compatibility where needed. [11:53] ZhConversion.php has 19k lines [11:53] cscott: yes, that is my preference too [11:54] defs in templates would need to combine somehow for a given top-level template expansion [11:55] VE would then not directly expose the A/H/- flags, it would just add conversion dictionary support to that lovely page-level options dialog we saw in the wednesday meeting. [11:55] yes, plus add the option to disable conversion or add manual conversion pairs locally without introducing rules [11:55] with a way to both add individual unidirectional/bidirectional mappings, and also an equivalent to "transclude this page of defs" [11:56] right, the non-flagged versions of the syntax are fine [11:56] liangent: are templates depending on inheriting the page translation context too? [11:57] you mean, do changes in the page mappings affect included templates? [11:58] gwicke: it inherits currently [11:58] is that feature widely used? [11:58] gwicke: it shouldn't really be much of a problem; changing the page-level meta-options will likely force a re-render of the entire page regardless [11:59] oh, but then we'd have to track more dependencies between the template and the page context [11:59] depends on the type of the template -- for example, we may want it for infoboxes, or some inline-meta-templates, like, but not navboxes (it's even sometimes harmful to inherit in this case) [12:00] a simple thing would be to have a single "uses conversion" flag that we'd set if rendering the template depended on the page context in any way during a full render [12:00] if "uses conversion" is false, then we can substitute updated template content into the page easily [12:01] if "uses conversion" is true, then we need to do a full rerender of the page when the template changes [12:01] one hopes that "uses conversion" is usually false. if it's not, then we can implement more fine-grained dependency tracking. [12:01] cscott: actually we have some complex transclusion chain when using those dictionary-templates [12:02] --> andre__ has joined this channel (~andre@***). [12:02] <-- andre__ has left this server (Changing host). [12:02] --> andre__ has joined this channel (~andre@wikimedia/aklapper). [12:02] if I have to add it to every level (like ), that's somehow problematic [12:02] liangent: no, this is something parsoid's storage backend would internally track [12:03] part of knowing when to invalidate a stored article representation / reparse its templates. [12:03] then how can I specify that for specific template calls? [12:03] could the translation context be category-based? [12:04] (problem then would be that categories don't necessarily transfer across wikis) [12:04] ah, right, because the current inheritance behavior is sometimes-wanted and sometimes-a-problem. [12:04] gwicke: more magic words (sigh) [12:04] no, I mean standard page categories [12:05] [12:05] and i was saying you could add magic words to indicate which categories should get which behavior [12:05] customized per-wiki with the existing magic words mechanism [12:06] but maybe there are too many categories to consider doing that [12:06] well, a translation namespace with pages corresponding to category names maybe? [12:06] or define translations directly in the category meta info [12:06] categories already have inheritance [12:07] gwicke: not necessary all articles transcluding the sports dictionary are directly in the Sports category [12:07] what if it has ? [12:08] i would think you'd want something more like Tennis [12:08] normally that category is directly or indirectly in sports [12:08] rather than overloading the category [12:08] cscott: that's just our previous GROUPCONVERT idea [12:08] and it keeps popping up from time to time [12:09] <-- andre__ has left this server (Quit: andre__). [12:09] well, except using something other than the template mechanism as an implementation [12:09] but i basically like the GROUPCONVERT idea [12:09] also with it we can provide a better dictionary editing interface [12:09] with ContentHandler [12:09] liangent: is the problem with directly using categories multiple inheritance? [12:10] if you want to have a look of what it looks like currently - https://zh.wikipedia.org/wiki/Template:CGroup/IT?action=edit&uselang=en [12:10] ok, i'm going to step away from the keyboard for a bit to walk the dog [12:12] liangent: that example you posted -- am i seeing nested dictionaries there? ie, the IT dictionary includes the Games, Windows, Electronics, and Communications dictionaries? [12:12] gwicke: maybe - but will this allow people to attack the server [12:13] liangent: depends on how the translation info is protected I guess [12:13] could be restricted [12:13] gwicke: I mean, what if someone changes *this setting* on a very-top-level category? [12:14] cscott: no they're plain links. but there're some nested dictionaries. let me find one [12:14] about the same thing as changing a dictionary template that is transcluded everywhere [12:15] nested dictionaries are a bit like inheritance in categories [12:15] complete with multiple inheritance [12:15] would need a good rule on how to handle duplicate / conflicting rules [12:16] [Notify] brion went offline (irc.freenode.net). [12:16] anyway, I think this is post-July stuff [12:17] cscott: I can't find it now but I've seen one [12:17] liangent: Alolita is also very interested in this- maybe she could even organize some funding for work on this [12:17] it's just a Template:CGroup/* with some s in it