User:Peter17/Reasonably efficient interwiki transclusion

This is a draft for my GSoC-2010 project, reasonably efficient interwiki transclusion, written after discussing with my mentor User:Catrope and updated several times during the project.

Initial state
Currently, some functions ( and   in  ) allow interwiki transclusion of distant templates.

The interwiki table contains the known interwiki prefixes. For each of them, the value  can be set to 1 to allow (or 0 to disallow) transclusion from that wiki.

If  is set to true, when a transclusion call refers to an article of another wiki and if transclusion from that wiki is allowed by , then:
 * checks whether the template has been cached less that 1h (by default) ago
 * if yes, then, the cached template is used
 * if not, then, a GET request is made to retrieve the content from the distant wiki

There are two different possible formats to retrieve (and cache) the content:  and.

The default with this system is that the data is cached for an arbitrary time, which means:
 * When a template is almost never modified, the cache is still updated whereas it is useless, so, we lose some performance.
 * When a template is actually modified, in the worst case, the cache will have to wait 1h before being updated and the users of the local wiki will not see the changes made to the template during that time.

So, the cache should be updated if and only if necessary.

I made some tests on May 10th.

Good points

 * It is working. I mean I can transclude my userpage from this wiki (mediawiki.org) to a wiki hosted on my computer using the syntax !
 * The links become full links: /Reasonably efficient interwiki transclusion becomes http://www.mediawiki.org/wiki/User:Peter17/Reasonably_efficient_interwiki_transclusion
 * If the wanted page calls some templates or parser functions, then, they are used to render the content, which is good!

Issues

 * When transcluding, what is transcluded is just the content of the page templateName of the distant wiki, which means:
 * The parameters are totally ignored.
 * The instructions, behave as if it was not a transclusion which is the opposite of the expected behavior...
 * The transcluded content is actually cached for 1h, which means even purging the cache will not update it...
 * The fact that links point to the distant pages might not the expected behavior...
 * All the parsing is done by the distant wiki, which is expensive for it. If foreign wikis want our templates, they could at least parse them by themselves...

Proposed approach
After a discussion on wikitech-l, some people, notably Chad and Aryeh Gregor, have suggested to use a similar approach as FileRepo does (see Manual:$wgForeignFileRepos). FileRepo is a class meant to allow the inclusion of distant files. It uses different backends in different cases, described below (see "Done work").

Questions and remarks

 * Is it possible to rely on  to obtain the list of all templates called by a page? If A calls B and B calls C, then: if B is modified and calls D instead of C, will this be taken immediately into account in the list of A template links?
 * It will be taken into account, yes, although possibly not immediately (deferred through job queue). --Catrope
 * The requested wikitext might itself call distant templates which might themselves call other distant templates. Some infinite loops might appear. This could be resolved by checking whether we already have retrieved a template before requesting it.
 * Seems not to be a problem with the current approach:
 * API-retrieved templates are cached if not already done
 * DB-retrieved templates work the same as local templates
 * It should be possible to transclude only a sections of an article, as Extension:Labeled Section Transclusion does. When using the API, there is a way to do this, using API:Parsing wikitext and defining the  argument.
 * When a template is used by several distant wikis, it would be great to display an alert that would incite the administrators to protect this template, so that it is not modified too often
 * The distant messages are not retrieved
 * The retrieved templates look ugly when they use a specific style from Common.css

Create a globaltemplatelinks table
Inside a wiki farm, it's quite easy to automagically purge the pages that use a distant template when this template is edited. We want to:
 * track the use of each template (know which distant pages are using it and update when pages are edited, deleted or moved)
 * when a template (or a subtemplate of this template!) is edited or deleted, invalidate the cache of those pages by updating page_touched in the  table

The approach proposed below needs to be discussed.

Inside a wiki farm, the transclusion links between the local pages and the distant pages would be stored in a shared DB, so that the distant wiki always knows which other wikis of the farm are using its templates and each page of each wiki knows which distant templates it uses.

The "calling" wikis could write in a "globaltemplatelinks table" on the shared database to store their usage of the templates. When a distant template is edited, the distant wiki will look in the  table to see who is transcluding the template. Then, it will access the DB of the calling wiki and invalidate the cache of the concerned pages.

When a calling page is edited/moved/deleted, it will update the globaltemplatelinks table to reflect the change.

The disadvantage of this approach is that each wiki must know:
 * it's own  (used by the distant wikis to access it's DB)
 * the  of all the wikis allowed to transclude its templates (in order to access their DBs for the cache invalidation job)

In the case of WMF's wikis, this could be solved by using a nice interwiki prefixes system, with a unique prefix for each wiki (enwikisource instead of en:s or s:en, frwikipedia instead of fr:w or w:fr...), at least for interwiki transclusion.


 * Proposed structure for the globaltemplatelinks table

As in the  table, from designates the page that calls the link and to the page pointed by the link.

We also need to store the wiki ID of the calling page, plus its page ID for cache invalidation and its full title (namespace text and title) for display and the wiki ID of the pointed page.

-- Table tracking interwiki transclusions in the spirit of templatelinks. -- This table tracks transclusions of this wiki's templates on another wiki -- The gtl_from_* fields describe the (remote) page the template is transcluded from -- The gtl_to_* fields describe the (local) template being transcluded CREATE TABLE /*_*/globaltemplatelinks ( -- The wiki ID of the remote wiki  gtl_from_wiki varchar(64) NOT NULL,
 * Proposed schema

-- The page ID of the calling page on the remote wiki gtl_from_page int unsigned NOT NULL,

-- The namespace ID of the calling page on the remote wiki gtl_from_namespace_id int NOT NULL,

-- The namespace name of the calling page on the remote wiki -- Needed for display purposes, since the foreign namespace ID doesn't necessarily match a local one gtl_from_namespace varchar(255) NOT NULL,

-- The title of the calling page on the remote wiki -- Needed for display purposes gtl_from_title varchar(255) binary NOT NULL,

-- The wiki ID of the wiki that hosts the transcluded page gtl_to_wiki varchar(64) NOT NULL,

-- The namespace of the transcluded page on that wiki gtl_to_namespace int NOT NULL,

-- The title of the transcluded page on that wiki gtl_to_title varchar(255) binary NOT NULL ) /*$wgDBTableOptions*/; CREATE UNIQUE INDEX /*i*/gtl_to_from ON /*_*/globaltemplatelinks (gtl_to_wiki, gtl_to_namespace, gtl_to_title, gtl_from_wiki, gtl_from_page); CREATE UNIQUE INDEX /*i*/gtl_from_to ON /*_*/globaltemplatelinks (gtl_from_wiki, gtl_from_page, gtl_to_wiki, gtl_to_namespace, gtl_to_title);

Add fields to the interwiki table
The former structure of the interwiki table was this one: +---++--+-+-+---+ +---++--+-+-+---+ +---++--+-+-+---+
 * Field    | Type       | Null | Key | Default | Extra |
 * iw_prefix | char(32)  | NO   | PRI |         |       |
 * iw_url   | blob       | NO   |     |         |       |
 * iw_local | bool       | NO   |     |         |       |
 * iw_trans | tinyint(1) | NO   |     | 0       |       |

Here is the new structure I proposed for the interwiki table: +---++--+-+-+---+ +---++--+-+-+---+ +---++--+-+-+---+
 * Field    | Type       | Null | Key | Default | Extra |
 * iw_prefix | char(32)  | NO   | PRI |         |       |
 * iw_url   | blob       | NO   |     |         |       |
 * iw_api   | blob       | NO   |     |         |       |
 * iw_wikiid | char(64)  | NO   |     |         |       |
 * iw_local | bool       | NO   |     |         |       |
 * iw_trans | tinyint(1) | NO   |     | 0       |       |

So, my changes consisted in adding two optional fields:
 * the URL of  of that wiki
 * the ID of that wiki (used in wfGetDb(DB_SLAVE, array, $wikiID);)

Currently,   allows the administrator to decide whether the templates from a particular wiki can be transcluded in the current wiki.
 * Explanations:
 * 0 will forbid this
 * 1 will allow this

With this structure, the software can allow transclusions in two different ways (using the API or using a direct DB access). When   is set to 1, the presence of   will indicate whether to use the DB access ( set) or the API (  not set).

Retrieve and cache the distant templates through the API
As explained before, the address of  of a foreign wiki can be stored in the   table.

For a given interwiki prefix, if no  is given if an API address is defined, a transclusion call will retrieve the wanted wikitext through an API call, cache it in   and return it to the parser. The key is

Moreover, if the wanted page is returned, the software will retrieve the list of all templates called by this wikitext (subtemplates). He will then determine which of them are not in the cache and cache them in, making API requests by groups of 50 (at most) templates.

This way, on the next loop of the parser, the subtemplates will be found in the cache and will not be requested.

Retrieve and cache the distant templates through DB access
This is the case of the wiki farms. The  table contains the   of the foreign wiki.

In this case, the most efficient solution is accessing directly the wanted wikitext by reading in the database of the foreign wiki. The software can access the DB inside the wiki farm with: $dbr = wfGetDb( DB_SLAVE, array, $wikiID );

When a distant template is called, the software retrieves the corresponding wikitext.

Accessing the distant DB is just as expensive as accessing the local one, so, no caching is needed, except a  table, to invalidate the cache of the pages that call a template, when this template is edited.

Final behavior
No parsing is done by the foreign wiki, all by the local wiki.

This way, the behavior of the interwiki transclusion is quite simple:
 * everything is parsed locally, exactly like the local templates
 * except that the templates and their subtemplates are the distant ones
 * the parameters of those templates are interpreted as on the local wiki, which means that the local templates will be used if they are present in the arguments of the distant templates
 * the links are interpreted as local links, pointing to local pages

On the local wiki, Template:Foo contains "Hello world!".
 * Example

On the foreign wiki, Template:Bar contains " Hi!"

Then, an article of the local wiki calling will produce "Hello world! Hi!"