Content translation/Language anomalies

Language anomalies can be detected by using a recurrent neural net with word embeddings. This is a pretty straight forward task, and can be used to detect strange language constructs. If the checked text comes from a machine translation engine or Content translation, then the generated text can contain language anomalies, and thus can be detected. Examples of language anomalies might be lack of agreement or wrong gender.

Assume an ordinary setup of a recurrent neural network, possibly also a bidirectional one, and also most likely as gated recurrent units, where an unknown word is guessed. This is the grammar modell, and all words are assumed to be word embeddings. The sequence comes from the known wikitext, and even if the unknown word is really known the network tries to guess its value with the grammar modell. This value is given as an embedding in the word space.

Now take the unknown (but really known) word and find its representation as a word embedding inside the word space. This is the dictionary model. If the word embedding in the dictionary is close to the previous estimated word embedding from the grammar model, then the word is most likely correct. If the word is pretty far from the estimated word embedding, then the word can be faulty but the grammar model can also be wrong. The distance of the known word from the estimated one describes the degree of anomaly with the specific known word given the context.

It is possible to learn a better classifier by using contrastive learning, that is giving known erroneous sequences, and thus learn a more irregular grammar space. This can lower the false positive rate.