Content translation/Language anomalies


 * Note that this is only a proposal, not something that is sceduled for implementation.

Language anomalies can be detected by using a recurrent neural net with word embeddings. This is a pretty straight forward task, and can be used to detect strange language constructs. If the checked text comes from a machine translation engine or Content translation, then the generated text can contain language anomalies, and thus can be detected. Examples of language anomalies might be lack of agreement or wrong gender.

Assume an ordinary setup of a recurrent neural network, possibly also a bidirectional one, and also most likely as gated recurrent units (GRU), where an unknown word is guessed. This is the grammar model, a language model without explicit words, and all words are assumed to be word embeddings. The sequence comes from the known wikitext to be checked. Even if the "unknown word" is really known the network, it still tries to guess its value given the grammar model. The guessed value is given as an embedding in the word space.

Now take the "unknown word" (but really known) and find its representation as a word embedding inside the word space. This word space is the dictionary model. If the embedding for the unknown word in the dictionary is close to the previous estimated word embedding from the grammar model, then the word is most likely correct. If the word is pretty far from the estimated word embedding, then the word can be faulty but the grammar model can also be wrong. The distance of the known word from the estimated one is an anomaly estimate toward the specific known word given the context.

It is possible to learn a better classifier by using contrastive learning, that is giving known erroneous sequences, and thus learn a grammar space with multimodal distribution grammar space. The example sequence in the figure use the word "bird", but a pretty superficial check makes it clear that this specific embedding is just one mode of several. Replacing the rather simple distance measure with a better model, ie. another neural network, can lower the false positive rate considerably.

It should be possible to implement such an estimator within the current ORES framework, even if the output from the estimator identifies locations within a text. It would simply be a JSON formatted report that can be used for decorating text within VisualEditor, such that words that might be anomalies gets a colored curly underline. It would then be the editors choice whether the text should be changed. In ContentTranslates editor this would then colorize weird constructs heavily, thus making the editor aware that the text most likely needs further editing.

Note that output from an estimator for language anomalies can't be used as a quality measure for some kind of "correct translation", it is a detector for weird or unlikely constructs. It might be correct to write that a horse could perform a senator's duties, even if the estimator would balk on the reference.