This is my brainstorming and notes page.

Parser Migration Tool[edit]

Project might have three components:

  1. Backend Tool -Based on rt server, grunge through pages looking for “something”, dumping results into a db.
  2. Present a nice UI allowing you to create new “somethings”, showing the results of the search & proposed edits, allow you to manually or automatically approve edits or coordinate manual edits
  3. Might be a separate piece to actually apply approved edits, if that's not trivial to integrate into #2.


The broad architecture looks like -

            Logger Emits issues
[db with list of pages to fix + what to fix] 
                    + < ----- Bots Fixing Issues 

Backend Tool[edit]

The architecture is going to based on the rt server. It will grunge through pages "looking for something", and will dump result into a database.

We still need to look what all "looking for something" and "create a new something" will looks like. It can be a regexp, a piece of js code, one can be created directly from the UI, or through gerrit & redeploy, etc. But, For the start we will just generate corrections based on what we know in parsoid. we will start by defining a few "interesting rules" manually. Eg - templates creating misnested HTML.

We might integrate the tool with the regular parsing pipeline and might use some of the rt server code.

What kind of info we need to extract?[edit]

While parsing a page, we know a bunch of things:

  1. Unclosed tags;
  2. Fosterable content;
  3. Mis-nesting.

And possibly others which we can make a list of perhaps. To actually fix that up, we will need additional information like what kind of problem it is? possibly substring, possibly source offsets, possibly how it needs to be fixed.

Raw Design[edit]

This design is based on a MW dump rather than rt server. Its a raw design and It don't consider broken text from templates/extensions. We will have to collect stats on how templates are used (balanced/unbalanced primarily) first and then we will add it to this design.

This is a raw architecture for the backend tool -

  1. For each article in the MW dump, fetch the wikitext source  for that revision of the article if needed (ideally it would already included in the dump);
  2. Parse wikitext to DOM;
  3. For each job,  run the <job source code> and fetch results;
  4. If result is non-empty, create row in 'job_result' table;
  5. Fill the issue information in <job_issue> table.

Data Model[edit]

File:Data Model.svg
Data Model

Notes -

  • Job table stores the each job performed along with it submitter and its source code etc.
  • Job_Result table stores the results of each job performed, So it will emit multiple issues and each issue will have a unique Id (<issue_id>) and a type (<issue_type>).
  • <result> field in Job_Result is a JSON field having keys url, location, similar etc. It can have advice field but we can look up for default advices using job_type field.

Eg-{ issues: { unbalancedTags: [{start: 1022, end: 1040, replacement: '....', type: 'missing start', ...},...]}}

  • Finally, we will have a Job_Issues table that will store information about different issues emitted.