Topic on Talk:RESTBase

Data store retention policy

3
Ciencia Al Poder (talkcontribs)

RESTBase acts as a data store, and storage is either sqlite or cassandra.

What I can't find anywhere in the documentation is the retention policy of the stored data.

After some months or years of usage, the sqlite file is in the order of Gigabytes. Quite unmanageable. I simply stopped the service, deleted the sqlite file and started it again, and everything seems to work. However, is there any purge or maintenance process of the data that I should run periodically to clean up unneeded data? I mean, stopping the service, manually removing the file and starting again is not particularly clean (although doable from a shell script run from cron). However, is it safe? As I said, apparently everything works, but I'd like to be sure it's safe to do, or if there are more clean ways to remove old data.

I want to migrate the store backend to cassandra, and I fear I'll have the same problems, data storage getting bigger without limit, and the only way to clean it would be to delete the storage entirely and let it recreate itself.

WMF should have bigger storage needs for all its wikis. How is WMF doing this? Do they really have terabytes of data in their cassandra cluster? Is all the data really needed?

Ciencia Al Poder (talkcontribs)
Ciencia Al Poder (talkcontribs)

After revisiting this, and finding the sqlite db file being 7.5GB of size, I decided to rename the db file to _old and restart the service.

I tried to reload a page that contained math formulas and surprisingly the formulas stoped working. Opening the image URL in the browser resulted in a json error telling it wasn't able to find a file. Quite shocking. The error disappeared once I edited the page.

Apparently, math images aren't regenerated on the fly, which means the database should never be purged, otherwise all math formulas will start failing until the page is edited+previewed, or force-purged.

I didn't expect Restbase database being used to store PNG images. That looks pretty inefficient.

Reply to "Data store retention policy"