Content translation/Caching

In content translation we use Redis for caching.

Why are we caching?[edit]

We allow editing the article in translation multiple times. User might stop even though the translation is incomplete and come back later. The translation data can also be saved in case of browser crashes, closing the tab etc.

Preparing translation support data is expensive. Some of the data might come from a paid service.

What are we caching?[edit]

Segmented article content
Link translation per language pair – WikiData results
Dictionary information per language pairs
Machine translation for the requested page, per language pair so that we can give faster user experience when an article loaded second time
Any other translation tools data to serve them faster when accessed second time.

Cache expiry[edit]

To be decided after doing some tests with real data. Perhaps an LRU algorithm with fixed memory size.

Cache invalidation[edit]

The cache hit happens when the revision id, article and language pairs match. If any of the above variable change, we need a selective cache refresh.

The following can be the cache invalidation strategy:

If the revision changes - ie - If the article was edited and new version is available, rerun the whole translation support data calculation. BUT do that per segments, calculate the hash of the segment and check if the cache has a matching hash we can reuse it.

Note: Perhaps surprisingly, MT engines almost always consider each segment in isolation and do not take larger context into account to improve translation quality.

This means, in Redis, the keys for the segments and other cached items should be SHA hash, so that checking if a content changed or not is easy.

Things we can reuse on changes:

title change -> nothing.
source language change -> nothing.
target language change -> segmented content remain the same, link sources remain the same.
revision change -> some segments are the same, determined by comparing the hash.

Redis data structure[edit]

The data model for the cache need to be in a hierarchical key value pairs. The following approach is suggested:

Use source language as the top key.
Under the source language we have the article titles
Under the articles we have target languages
Source language will have segments, segmentedContent, links, mt etc.
mt, links segments will be a hashsets (Redis terminology). segmentedContent is just a plain string.