Jump to content

User:Flow cleanup bot

From mediawiki.org

This bot converts Flow boards to an XML export with history. Neat. Code at https://gitlab.wikimedia.org/pppery/flow-export-with-history

The main bot run is finished, and it is now being run supervised to clean up pages that the main run punted for later manual review.

Known issues and oddities:

  • Old posts with <pre> tags get garbled. This is actually due to an issue in Parsoid that can't be fixed cleanly. phab:T383645 (now fixed in most cases)
  • Multiline templates in replies aren't supported.
  • Flow apparently allowed commenting on a hidden topic. This will appear as an edit with "(no difference)", as the content isn't shown until the topic was unhidden.
  • Occasionally posts will be overindented for no good reason, due to vandalism being hidden or the flow data model not quite matching wikitext conventions.
    Contrarily, sometimes the bot will decide a post's wikitext is too complicated to indent and awkwardly leave it unindented. It spews a warning for me to look at when it does this, but sometimes I agree with it and leave a post unindented even if indenting would be more technically correct. I also didn't look at these warnings for the support desk archives, since those pages are so huge/broken/janky that dealing with just the other things drained all of my time and energy.
  • Flow seems to have allowed edits that made no change to the post, which are faithfully reproduced as normally impossible wikitext null revisions.
  • Flow posts that were "hidden" by a non-admin are exported in the history, but Flow posts that were "deleted" by an admin are not.
  • Vandalism and spam is included too. It's not the bot's fault ...
  • Multi-line wikitext like tables is poorly supported.
  • For certain very old Flow pages (which were Flow pre-2014) it was possible to have a flow board on top of a wikitext page. The edit summaries and usernames of that pre-Flow wikitext page are preserved (where applicable) and imported, but the content is lost without separately importing it from an old database dump (or other database skullduggery only the sysadmins can do). Related to T337807
    On this wiki I manually imported history from dumps for Talk:Design and Talk:MediaWiki UI, Talk:Structured Discussions/2013b, and Talk:Flow QA.
    User talk:Jorm (WMF), Talk:Beta Features/Nearby Pages/Archive 2, Talk:VisualEditor/Beta Features/Language, Talk:Search/Old, Talk:Winter/Archive 2/Flow export, Talk:Page Previews/2014/03 and Talk:Content translation/2014 were also affected, but the history was trivial enough (just a page move) to not be worth importing.
  • A very long time ago Flow allowed posting topics with no body. The current Flow API crashes when dealing with such topics, so I'm excluding them from the export entirely. There weren't any except on old sandbox-type pages.
  • Sometimes (but not always) an image gets silently deleted. Another Parsoid issue. See T388687
  • Very rarely a Flow post doesn't appear in the output of the Flow history API for no apparent reason. If the post is replied to then the bot spews a warning and I dig up the old post and add it in a post-export edit. If it is the last post in a topic then this cannot be detected.
  • Flow parsed headers using a subset of wikitext. Flow cleanup bot doesn't attempt to deal with this, so templates (or other odd wikitext) in headers can cause problms.
  • ...
  • All of the above issues are much more likely to occur for old posts than for new ones. It seems like some change to Parsoid HTML in 2017 significantly reduces the rate at which stuff gets garbled.

Please feel free to post any other issues you notice at Talk:Structured Discussions/Deprecation or User talk:Pppery