Extension:RPED/Development notes

The RemotePageExistenceDetection (RPED) extension will, whenever a page is displayed that has wikilinks:
 * Refer to a local list of names of all pages that exist on a specific remote wiki (i.e. Wikipedia) and:
 * Turn links red if the page exists on neither the local nor the remote wiki;
 * Turn links blue if the page exists on the local wiki but not the remote wiki; and
 * Turn links green if the page exists on the remote  wiki.

If you click on a red link, it prompts you to create a new page on the local wiki. If you click on a blue link, it takes you to the page on the local wiki. If you click on a green link, it takes you to the page on the remote wiki. The syntax will need to allow for an override of the normal behavior that gives the remote wiki article precedence if a page exists on both the local wiki and the remote wiki. I was thinking that a "local:" prefix or a tag or something might suffice.

For accessibility reasons, the extension will also allow different fonts, rather than different colors, to be displayed, depending on how the extension is configured and, perhaps, user preferences.

Rationale
The reason for this change is to help sites like Libertarian Wiki, which have a lot of articles (e.g. anarcho-capitalism) that really don't need to exist as (usually outdated) content forks on the local wiki. It would be better if they just linked to the pertinent article on Wikipedia. But, the problem with interwiki links as we know them today is that they have no existence detection. This proposal seeks to solve that.

Parsing
The parsing details are still being been sorted out. One option might be to parse the page for wikilinks (e.g., which normally returns libertarianism), test each wikilink for page existence on the remote wiki, and if the test returns  , convert the wikilink to an external link (e.g.  , which will return libertarianism). If the test returns, then the wikilink will remain untouched (and thus, after going through the rest of the parser, will turn blue if the page exists on the local wiki, and red if it does not). Once the extension's parser is complete, the revised content will be passed to the rest of Mediawiki's parser. It will be necessary to identify the proper hook for doing this.

An API query to Wikipedia every pageload
If there will be only a small number of pageloads on the local wiki, this method would result in less server load to Wikipedia, since it obviates the need to constantly be querying Wikipedia for information with which to update a list of pages that exist on Wikipedia (see next section). Determinations of which pages exist on Wikipedia could be made through remote API queries such as this:

http://en.wikipedia.org/w/api.php?action=query&titles=1911%20Encyclopaedia%20Britannica|1911%20Encyclopaedia%20Britannica|Adam%20Smith|Advocates%20for%20Self%20Government|Age%20of%20Enlightenment|Aggression|Agorism|Agorist|Albert%20Jay%20Nock|Alfred%20Jules%20%C9mile%20Fouill%E9e|Alliance%20of%20the%20Libertarian%20Left|Anarchism|Anarchist%20communism|Anarchist%20communist|Anarchist%20economics|Anarchist%20law|Anarchist%20school%20of%20thought|Anarcho-capitalism|Anarcho-capitalism%20and%20minarchism|Anarcho-capitalist|Anarcho-capitalists|Anarcho-communism|Anarcho-syndicalism|Anarchy,%20State,%20and%20Utopia|Anthony%20de%20Jasay|Anti-communism|Anti-globalisation|Anti-state|Anti-statism|Anti-war|Anticapitalist|Arizona|Atlas%20Shrugged|Auguste%20Comte|Augustine%20of%20Hippo|Australian%20Capital%20Territory%20general%20election,%202001|Australian%20Capital%20Territory%20general%20election,%202004|Australian%20Electoral%20Commission|Australian%20federal%20election,%202007|Austria|Austrian%20School|Austrian%20School%20of%20economics|Autarchism|Authoritarianism|Autonomist|Ayn%20Rand|Baird%20Callicott|Barry%20Goldwater|Bellum%20omnium%20contra%20omnes|Bernard%20Bosanquet%20%28philosopher%29

Some pageloads might require several such queries, since the API imposes a limit of 50 results.

Local list
If a high volume of pageloads to the local wiki are anticipated, then it would be more efficient to create and maintain an SQL table containing the names of all the pages that currently exist on Wikipedia. This results in a fixed number of queries to Wikipedia, rather than a number that varies depending on how many pageloads occur on the local wiki. This is more in accordance with the spirit of Wikimedia's live mirrors policy.

Centralized list
If there will be many high-volume wikis using this extension, a centralized list (stored as a mySQL database) should be generated and kept current on one website (with mirrors in case it fails). All of the other wikis should obtain their initial page list and differential updates from that website, rather than from Wikipedia, so that Wikipedia doesn't have to be the object of duplicative querying. The centralized list of pages that exist on Wikipedia will be generated by a script collecting all the page names from AllPages, 500 pages at a time (or 5,000 at a time, if bot access is obtained), using the API; or, possibly, AutoWikiBrowser could generate a list of all pages with the database scanner. Another possibility is using all-titles-in-ns0.gz to get an (outdated) list of all pages; using logs of all page creations and deletions in the past couple weeks, this could be turned into an up-to-date list.

The list will be created and updated once every 60 seconds, or more frequently, using NewPages and the deletion log.