This is my brainstorming and notes page.
Parser Migration Tool
Project might have three components:
- Backend Tool -Based on rt server, grunge through pages looking for “something”, dumping results into a db.
- Present a nice UI allowing you to create new “somethings”, showing the results of the search & proposed edits, allow you to manually or automatically approve edits or coordinate manual edits
- Might be a separate piece to actually apply approved edits, if that's not trivial to integrate into #2.
The broad architecture looks like -
[Pages] | V Backend Tool | V [db with list of pages to fix + what to fix] | + < ----- Bots Fixing Issues | V Webapp
The architecture is going to based on the rt server. It will grunge through pages "looking for something", and will dump result into a database.
We still need to look what all "looking for something" and "create a new something" will looks like. It can be a regexp, a piece of js code, one can be created directly from the UI, or through gerrit & redeploy, etc. But, For the start we will just generate corrections based on what we know in parsoid. we will start by defining a few "interesting rules" manually. Eg - templates creating misnested HTML.
We might integrate the tool with the regular parsing pipeline and might use some of the rt server code.
What kind of info we need to extract?
While parsing a page, we know a bunch of things:
(a) unclosed tags
(b) fosterable content
And possibly others which we can make a list of perhaps. To actually fix that up, we need additional information like what kind of problem it is? possibly substring, possibly source offsets, possibly how it needs to be fixed.
This design is based on a MW dump rather than rt server. Its a raw design and It don't consider bronken text from templates/extensions. We will have to collect stats on how templates are used (balanced/unbalanced primarily) first and then we will add it to this design.
This is a raw architecture for the backend tool -
- For each article in the MW dump, fetch the wikitext source for that revision of the article if needed (ideally it would already included in the dump).
- Parse wikitext to DOM.
- For each job, run the <job source code> and fetch results.
- If result is non-empty, create row in 'job_result' table