Topic on Talk:Wikimedia Services/Revision storage for HTML and structured data: Use cases

NRuiz (WMF) (talkcontribs)
Erik Zachte (talkcontribs)
@NRuiz (WMF): Hadoop is about data which WMF extracts from the databases. But dumps are also used a lot outside WMF for all kind of purposes. And shrinking those dumps might improve download and processing times. Compression can only do so much. A random example I encountered today: https://ti.wikipedia.org/w/index.php?title=%E1%89%B2%E1%89%AA&action=history 25 / 29 revisions were about interwiki links. For popular topics there might be up to 200 or so language links, and as many revisions.
Those interwiki links have been migrated to wikidata, but the edit history is still there. My suggestion was to migrate those edits and replace them with dummy edits (only timestamp and user, and no raw text). I know this sounds radical, and not exactly trivial to implement, but shouldn't we deal with our history bloat someday?
Reply to "Dumps"