Northern Sami Wikipedia notes 2011

Some notes about small projects covers identified troubles and possibilities at Northern Sami Wikipedia. This project is very small and has identified a number of problems with the current software. In the following the reasoning is to make a few editors more productive by removing time-consuming unnecessary steps and cannibalizing material from larger projects. Hopefully this can be a sufficient aid for a small group to produce a basic lexicon in a small language from larger lexicons in other languages.

Small projects needs as much help as possible to create initial content and back bone structure. Initial content are such things as data for infoboxes, content reflecting infoboxes and content used inside infoboxes. Structures are for example categories and templates. Construction of such content and structures are somewhat difficult for beginners and it should be possible to reuse work across projects.

Update link trails
Often years are written at Northern Sami language in the form 1997:as, 1997:is, etc. This form fails to make proper links and must be written as 1997:as and 1997:is  even if it should be possible to write them as 1997:as  and 1997:is. If this is to be solved some of the regex-patterns must be updated on a per-language basis. For examples see the article Ole Henrik Magga at Northern Sami Wikipedia. This could be implemented as a simple list of accepted strings added pre- or postfix to a link, possibly with optional checks wetter the link itself is on a specific form.

Closely related to this is the problem with letters outside the defined set of legal letters. In the previous example with Ole Henrik Magga there is the string 1993rájes (actually an error) where the letter á makes the link fail to incorporate the whole postfix string. To make this work as expected it should be possible to define the string of legal letters, at least in the settings somehow. For the moment there is only the link trail variable and this is somewhat limited.

For Northern Sami language the link trail variable should reflect the characters given at the page Sámi locales, characters given in the subpart Sorting.

Even if there exist an old system message Linktrail this does not work. The regex must be changed by adding a bug in Bugzilla like Bug 31194 - Add $linkTrail for Northern Sami Wikipedia.

Rewrite rules for link targets
Words are often changed after more or less well-defined prefix, infix and suffix rules. A link could be checked by a successive "back of" -model whereby more and more aggressive rules are applied. If a run of a particular regex pattern or combination of such patterns produces a match on a particular page then it is used. All such rules adds up and if the total seems to be to radical the rewrite is dropped altogether. Such rules will fail, but they will save a lot of typing. For example, in Norwegian we can write something like  and in English we can write , while in Northern Sami this would be something like. Note that this form can't be rewritten to take advantage of the link trail pattern. Similar problems arise in several other languages like Uralic languages (for example Finnish) and to a lesser degree in North Germanic languages (for example Norwegian). It clearly would be nice if we could avoid all the extra typing.

This could be reformulated as a stemming problem or as a lemmatisation problem, and as such there are several approaches. In this context we need something simple enough that works in identifying a likely target page, even if the proposed link and the actual page uses words with different inflections. If a sequence of rewrite attempts still does not end up with a target page the process is terminated and the initial red pagelink is used as is. If the rewrite hits several pages and it is inconclusive about which is more likely to be correct, it will link to a special page that construct a disambiguation page on the fly.

Often both the page title and the link text has to be stemmed or lemmatized, that is reduced to one or several base inflected forms, and it is this common base form that is used to connect the entries. This creates a problem as there must be a store for the base form of the page title. This could be a redirect created by the system when the page is first created or when its moved to a new name. A problem with this approach is that the string of base inflected forms of the page title may not be a legal phrase in the content language.

Snowball can be used as a solution for written languages which rely heavily on suffix rules. This includes English and Norwegian, but not Northern Sami languages. Even if Snowball is mostly known as an effective solution for stemming, it is really a programming language for affix rules in general. We could do more arbitrary rewrites of the string, and perhaps even try several strategies and terminate the attempts if one of them succeed.

A possible and rather efficient method that succeed more often than not is to use a set of patterns to rewrite the link entry into a base inflected form. Each of the patterns are given a weight and they add up when chained together. A sequence of rewrites are deemed better if the sum of weights are lower than some other sequence of rewrites that also succeeds in reaching a possible target page. Typically the weights are adjusted edit distances, where the weights follow legal transforms in the given language.

Hunspell is a kind of a morphological rewrite engine that basically runs a series of rewrite rules according to a lexicon, generating the full dictionary on the fly. The dictionary can then be used for generating the string of page title base forms, which will then be used to construct a stored common inflected form. When the same thing is done to the link text it is possible to identify common strings of base forms.

For the Hunspell solution it is worth noting that all processes can use a common parser daemon, and that all redlinks can be resolved in one additional database query. It is also possible to make the additional base forms before querying the database at all, reducing the number of database queries to one, but at the cost of always running a Hunspell query. The Hunspell query can be shunted by storing the last spell check results in memcached.

For other languages it can be more easy to use some kind of local sensitive hash function or some kind of stochastic algorithm.

If a page use a name with parenthesis its not obvious how the name should be rewritten, but probably the string within the parenthesis should be kept both in the link and the target page. Both should therefore have the same parenthesized string in the name.

Bumping category members
If it were possible to declare common categories and rules for initiating them it would greatly help such small projects. Perhaps this could be a page where category names can be declared and connected to other more general names in larger projects, with fallbacks if the given granularity isn't useful for the specific project. If there are to few articles about some geographical feature at a municipality level, then bump them up to the county. If the county is still to granular, then bump it up to a country level.

For example could Máze be categorized at Northern Sami Wikipedia as a small place in Guovdageaidnu, but any entry to a category at the municipality level Guovdageainnu báikkit would be bumped to the category Finnmárkku báikkit at county level as long as the municipality category contains to few category members.

If it were possible to bump placement in categories like this it will also be possible to reuse categorization from another project. If Ole Henrik Magga at Norwegian (bokmål) Wikipedia is categorized at a to fine granularity, then the article is bumped upwards at Northern Sami Wikipedia until the categories works out. If subcategories are defined, then articles moves down to lower ones as appropriate.

To make this work some other site must identify some central properties of the categories, most notably wetter it is a single inheritance with increasing granularity and which one of the parent categories to bump to. Each site could have a global map of categories, and other sites would import parts of them as appropriate, or there could be a fallback sequence of sites to check out.

This relates very closely to ideas about a a common global repository of iw-links.

Generalized reusable templates
Templates are very difficult to make for new beginners. It would be very helpfull if a set of carefully crafted and localizable templates are available from the very beginning. This includes navigational aids, infoboxes and maintenance templates. Some of these are pretty easy to make reusable while others are very difficult. If possible such templates could be stored at Commons, Meta or Mediawiki. For an example of reusable maintenance templates see $maintenance at Norwegian (bokmål) Wikipedia. This template is built for both localization and reusabillity.

Most of whats necessary is already partly implemented as part of image transclusion. Whats necessary in addition is to be able to distribute system messages in an efficient manner. If a template shall be really useful it must be possible to construct the call locally and use locally defined data, not only to construct the call on an external site.

Note that this kind of feature generated a lot of discussions on Commons. Se for example Commons also as a repository for templates and pages and the bug Bug 4547 - Support crosswiki template inclusion (transclusion => interwiki templates, etc.) Probably this is better done at Meta and only enabled on small projects that truly needs the feature.

Import infobox entries
Construction of articles from infoboxes on other projects could be very interesting, but right now it is very difficult to even reuse information across projects through cut and paste. The whole discussion about a data commons is a little to involved for this, whats necessary to get this to work is a simple syntax that makes importing data a straight forward task. There should be something like Or if its extended to another page on the same wiki Or extended outside the wiki through the iw scheme It is although more easily understandable to simply use it as a normal but expensive parser function.

An alternate way to do this is to use a xpath parser function and due to security reasons limit it to registered iw schemes The xpath part starts after the last (?) colon.

The most interesting thing with this is that it will make it possible to create special templates that constructs articles from infoboxes defined at other projects. An infobox like the one at Sør-Aurdal in Norwegian (bokmål) Wikipedia can be reused to create an article Mátta-Aurdála gielda at Northern Sami Wikipedia. This won't solve the maintainability problem, but it will make it more easy to create new articles from a baseline.

List formatting
Because infoboxes may include lists of entries there must be some methods to transform such lists into readable sentences. Especially a list of items should be transformed into something that has a head, a middle part and an end. Before, after, and in between all of those there shoulde be joiners. In Northern Sami the joiner between the middle and end part is ja, in English it is and and in Norwegian it is og. All other joiners are set to a default according to the given language, for Norwegian that would be comma. Other languages may have other joiners. On a per instance basis it should be possible to override this. Assume we have a list A, B and C in Northern Sami language, contained in a parameter test, then it should render like &rarr; A, B ja C In English we chose to override the final joiner to "or", counting back &rarr; A, B or C or using identifiers in Norwegian &rarr; A, B men ikke C

The format for the parser function is something like In addition the strings can be set with numeric indices; one (1) being the forst joiner and minus one (-1) being the last one.

It is also interesting to pick up values from a template on the page for reuse somewhere else. This must also handle list content. It could also be more xpath-like identification of the entries Or perhaps the parser function could switch between xpath for parsed content and an alternative for direct traversal of the wiki markup.

Note that it is not possible to easily use lists inside templates as the parsing of the template eats the initial spaces. This should be fixed somehow. Probably the template parameters should be checked for first character and if any list identifier is found then a list should be generated.

It should be possible to use a list as first argument to plural, thereby switching between alternate strings. There could also be a parser function that count the number of list entries and return that number. This is rather simple for ordered and unordered lists, but not so simple for data lists. It is not obvious wetter we should count the dt or dd elements, or perhaps both of them.

Constraint translations
It is possible to find a legal string of transformations from one known form in some source language into another similar form in another destination language. This is not a free form translation from the source language and it is not necessarily even a correct translation in the destination language. It is a guided (constrained) generation of a limited string in the destination language.

Typically there will be a variant word or phrase from one language that will be processed in fixed way in a template, possibly the template can also be substituted to get a first version of an "translated" article, but often parts of the article will continue to be adapted on the fly.

The parameter to a infobox will have a specific form but this form does not fit in a new role in running text. In Northern Sami this will for example be the situation where a place name is imported and used in an aggregate to name a church. In Norwegian we write Aurdal kirke, while in English this becomes Aurdal church. The name of the place does not change. In Northern Sami this becomes Aurdalas girku. This kind of transformation is defined as part of tools created by Sámi giellatekno, a language project at University of Tromsø. The syntax they use is pretty simple and straight forward, and we could generate it in a parser function and call out to a daemon to do the actual transformation.

In this specific example we have Aurdal+N+Prop+Plc+Sg+Loc &rarr; Aurdalas assuming the source word Aurdal is in Norwegian and the destination word is in Northern Sami. We can make a default assumption about the destination language, it should be the content language, but the source language isn't obvious. Unless set it would also be the content language, and we can then transform from one inflection to another one in the same content language. We can although do as in lang-links and prepend it with a language code before a colon no:Aurdal+N+Prop+Plc+Sg+Loc &rarr; Aurdalas This form is pretty easy to to parse out and to interprete in transducers like the ones built by Sámi Giellatekno, but it could be somewhat difficult to differentiate between the inline form and the parser form that is described later.

The pattern for extracting the strings will be something like the following, but note that it would be somewhat more complex to handle additional operators [lang:]word+token1[+token2][…][+tokenN]

This form would then be rewritten as something like one of these two forms

Our previous example with Aurdal will then be turned into No:aurdal+n+prop+plc+sg+loc &rarr; Aurdalas

For short we can munge together the tokens and their operators and simply call them parameters.

The second form can be used with several words at once, and identify relations between words and params by position We can also add a pattern match, which would be sufficient to identify which word to attach which param to The patterns would be a set of tokens to search for, possibly by either exact or approximate match. By using such patterns it would be possible to switch word positions in the string.

This can be generalized to translate by example, and it should also be possible to change parameter sets generated from such examples. If we write a call like we simply says "translate the Norwegian Aurdal like the Northern Sami form Alvdalas". The word Alvdalas will then be analyzed and will produce a more complete form which is used as parameters for generating Aurdalas from Aurdal.

Sometimes the results from the analyzis will be insufficient and we will have to refine it by adding or removing switches. This can be done like this &rarr; whereby Aurdal become Aurdala and not Aurdalas. In addition there are times when we don't know if we have a complete match. In those circumstances we want to get as close as possible to a given form, which could be given by an example word. We can write this as &rarr;

It should be possible to combine the function with other parser functions, especially functions for system messages like plural.

Note that analysis and synthesis can be asymmetric between language pairs, and it can also be necessary with several cleanup phases. Especially note that this is the case for the Northern Sami to Norwegian (bokmål) pair as it is done for the Apertium engine.

Final notes
It is important to note that this kind of article production is not about translating existing articles, its about creating articles from well-defined infoboxes in other languages. It seems like statistical translations will not work very well but rule based translations like the ones produced by Apertium can work.