User:Peter17/Reasonably efficient interwiki transclusion

This is a draft for my GSoC-2010 project, reasonably efficient interwiki transclusion, written after discussing with my mentor User:Catrope.

Conventionally, we will use the expressions:
 * home wiki for the wiki which hosts a template
 * distant wiki for another wiki that wants to use that template
 * wanted template for the wiki page hosted on the home wiki and called by a page of the distant wiki. It can be any public page of the distant wiki.

Of course, there might be a lot of pages and a lot of distant wikis that request the wanted template.

Current state
Currently, some functions ( and   in  ) allow interwiki transclusion from a home wiki to a distant wiki.

The interwiki table contains the known interwiki prefixes. For each of them, the value  can be set to 1 to allow (or 0 to disallow) transclusion from that wiki.

If  is set to true, when a transclusion call refers to an article of another wiki and if transclusion from that wiki is allowed by , then:
 * checks whether the template has been cached less that 1h (by default) ago
 * if yes, then, the cached template is used
 * if not, then, a GET request is made to retrieve the content from the home wiki

There are two different possible formats to retrieve (and cache) the content:  and.

The default with this system is that the data is cached for an arbitrary time, which means:
 * When a template is almost never modified, the cache is still updated whereas it is useless, so, we lose some performance.
 * When a template is actually modified, in the worst case, the cache will have to wait 1h before being updated and the users of the distant wiki will not see the changes made to the template during that time.

So, the cache should be updated if and only if necessary.

I made some tests on May 10th.

Good points

 * It is working. I mean I can transclude my userpage from this wiki (mediawiki.org) to a distant wiki hosted on my computer using the syntax !
 * The links become full links: /Reasonably efficient interwiki transclusion becomes http://www.mediawiki.org/wiki/User:Peter17/Reasonably_efficient_interwiki_transclusion
 * If the wanted page calls some templates or parser functions, then, they are used to render the content, which is good!

Current issues

 * When transcluding, what is transcluded is just the content of the page templateName of the home wiki, which means:
 * The parameters are totally ignored.
 * The instructions, behave as if it was not a transclusion which is the opposite of the expected behavior...
 * The transcluded content is actually cached for 1h, which means even purging the cache will not update it...

First version
Here, we assume that each distant wiki caches the templates it transcludes from the home wiki.

On the distant wiki, when a page that calls a template is rendered:
 * First, the distant wiki makes a request to the API of the home wiki to get the last modification timestamp of the wanted template.
 * Then it looks at the local cache to compare that timestamp with the timestamp of the cached template.
 * Then, the cached template is updated only if necessary and used to render the page.

The infrastructure for this request already exists in the API.

Second version
Since the template might be requested by a lot of distant wikis, we think that it would be more adapted to cache the data on the home wiki instead of doing that for each distant wiki.

Actually, if, like in the previous version, the cache is updated only when the page is rendered, it seems useless to cache the template on the distant wiki(s), because the page(s) that calls the template are themselves cached.

Consequence
In both first and second versions above, the page and template are updated only when the page is rendered. As a consequence, if the template has been updated in the meantime, purging the cache is necessary to properly see the transcluded template.

Third version
Instead of checking whether a template has changed or not when parsing the page, we could use a more efficient solution getting inspiration from Extension:GlobalUsage. In this solution, we would use a shared database, with a table. Every home and distant wikis could read and write in this table:
 * When a distant wiki parses a page, all its remote template calls will be listed in this table, with:
 * The address of the page that calls the template (maybe interwikiprefix + page name?)
 * The address of the called template (maybe interwikiprefix + template name?)
 * When a remote template is modified (or deleted), it will purge the cache of all the pages that link to it (listed in the table) through API:Purge

This way, distant pages are always up-to-date. However, a question remains about wikis that don't have access to the share database.

Issues with this solution
Let's divide the problem in two cases:

Making API queries for transclusions inside a farm seems too expensive because it requires HTTP GET and if the WMF projects use a lot of interwiki transclusions, this will induce an extra-load on the servers, which should be avoided.
 * Transclusions into a wiki farm, like WMF's wikis

API queries seem to be a good solution for this, but only if they don't ask very complex tasks that would require a lot of parsing. The proposed solution is to preprocess the wikitext, not to parse it completely. That's less expensive. The final parsing should be done on the distant wiki, so that it will not consume too much WMF's CPU time.
 * Transclusions called by an external wiki

Fourth version (currently preferred)
After a discussion on wikitech-l, some people, notably Chad and Aryeh Gregor, have suggested to use a similar approach as FileRepo does (see Manual:$wgForeignFileRepos). FileRepo is a class meant to allow the inclusion of distant files. It uses different backends in different cases.

First, repositories should be defined. For now, we imagine two kinds of repositories: one to transclude content inside a wiki farm, directly accessing the DB of the home wiki and one to allow transclusion calls from external websites (Wikia and any wiki), through the API. The behavior of the system would be slightly different according to the case.

Transclusions inside a wiki farm, like WMF's wikis
In this case, the most efficient solution would be to access directly the wanted wikitext by reading in the home wiki's database. This is how ForeignDBRepo works. The needed parameter is just the DB name (then, we can access the DB inside the wiki farm with: $dbw = wfGetLB( 'dbname' )->getConnection( DB_MASTER, array, 'dbname' ); When transcluding from several home wikis, those parameters should be given for each of them. Thus, it might be a good idea to store all those informations in the  table.
 * Retrieving and parsing the content

When a distant template is called, the system should retrieve the corresponding wikitext plus all the templates that are needed to render it.

All the retrieved content (wikitext and other needed templates) would be passed to the parser of the distant wiki. No parsing would be done by the home wiki.

In this case, there would be a table,, to store the interwiki links, in the home wiki's database. This would be used to invalidate the cache of the pages that include a template when this template is modified. API calls seem too expensive, so, the home wiki should have the right to access the distant wiki's DBs... It seems useless to cache the interwiki text of the transcluded templates in the local DB of the distant wikis, since it is not less expensive than accessing the home wiki's DB.
 * Caching

Transclusions called by an external wiki
To allow external wikis to transclude content from WMF wikis, we could create a second backend that would use the API to provide them the requested content. As a distant wiki might want to transclude content from a lot of different home wikis, it might be a good idea to store all the  addresses in the   table.
 * Retrieving the content

A transclusion call would retrieve the wanted wikitext plus all the needed content to render that wikitext (all templates called by this wikitext). The API might have to be modified to allow such a request (get the content plus all the templates it uses). Currently, the API can provide the list of the templates (with action=parse&prop=templates) used in a page.

All the retrieved content (wikitext and other needed templates) would be passed to the parser of the distant wiki. No parsing would be done by the home wiki.

In this case, the external wiki could cache the content for an arbitrary time, as it is currently done for the images by ForeignAPIRepo. The external wiki could store in its local DB: When retrieving a page and its used templates, all the templates should be stored, and the correct interwiki prefix added before their name
 * Caching
 * the complete name of the template (interwiki prefix + template name)
 * the timestamp it was retrieved
 * the wikitext of the template

Questions and remarks

 * Is it possible to rely on  to obtain the list of all templates called by a page? If A calls B and B calls C, then: if B is modified and calls D instead of C, will this be taken immediately into account in the list of A template links?
 * It will be taken into account, yes, although possibly not immediately (deferred through job queue). --Catrope
 * The requested wikitext might itself call distant templates which might themselves call other distant templates. Some infinite loops might appear. This could be resolved by checking whether we already have retrieved a template before requesting it.

Adding fields to the interwiki table
The current structure of the interwiki table is this one: +---++--+-+-+---+ +---++--+-+-+---+ +---++--+-+-+---+
 * Field    | Type       | Null | Key | Default | Extra |
 * iw_prefix | char(32)  | NO   | PRI |         |       |
 * iw_url   | blob       | NO   |     |         |       |
 * iw_local | bool       | NO   |     |         |       |
 * iw_trans | tinyint(1) | NO   |     | 0       |       |

Here is the new structure I propose for the interwiki table: +---++--+-+-+---+ +---++--+-+-+---+ +---++--+-+-+---+
 * Field    | Type       | Null | Key | Default | Extra |
 * iw_prefix | char(32)  | NO   | PRI |         |       |
 * iw_url   | blob       | NO   |     |         |       |
 * iw_api   | blob       | NO   |     |         |       |
 * iw_dbname | char(32)  | NO   |     |         |       |
 * iw_local | bool       | NO   |     |         |       |
 * iw_trans | tinyint(1) | NO   |     | 0       |       |

So, my changes consist in adding 2 optional fields:
 * the URL of  of that wiki
 * the DB name of that wiki

Currently,   allows the administrator to decide whether the templates from a particular wiki can be transcluded in the current wiki.
 * Explanations:
 * 0 will forbid this
 * 1 will allow this

As explained above, my code would allow transclusions in two different ways (using the API or using a direct DB access).

What I propose is to allow three values for  :
 * 0 will forbid any transclusion from the corresponding wiki
 * 1 will allow transclusion from that wiki, using the API. In this case, iw_api must contain the API URL.
 * 2 will allow transclusion from that wiki, using the direct DB access. In this case, iw_dbname must contain the DB name.

For now,   will remain a boolean. When it is set to 1, the presence of iw_dbname will indicate whether to transclude using the API or using the DB access.

Remarks

 * The template parameters might sometimes lead to large requests, so, a POST request to the API would be more adapted that the current GET.
 * It would be convenient to have the addresses of each  in the   table, in order to use them to perform the requests (retrieving the templates, purging the caches...).
 * We are not forced to use one shared database for all interwiki transclusions: we could also add database informations in the  to know which shared table to use when transcluding templates from each particular home wiki.
 * It should be possible to transclude only a sections of an article, as Extension:Labeled Section Transclusion does. When using the API, there is a way to do this, using API:Parsing wikitext and defining the  argument.

Special cases
It seems quite easy to understand how this approach would work for simple cases, such as transcluding a template which has no parameter and simply returns some HTML content.

However, some more complex cases may occur.

Complex templates
What should happen if the wanted template requires the transclusion of other templates or the execution of parser functions?

Simply getting its wikitext and parse it on the distant wiki might lead to some problems:
 * the needed templates might not exist on the distant wiki and, so, should be transcluded from the home wiki (and so on if they include other templates)
 * if the templates exist on the distant wiki, then, they could be different on two distinct distant wikis which would lead to different results when transcluding templates from different distant wikis

So, it seems to be a better idea to preprocess the required template on the home wiki and then send the result to the distant wiki. This way, template calls and parser function calls are done on the home wiki and the result is the same for any distant wiki.

If some simple (simple text) parameters are given, then, they should be substituted during this preprocessing.

It seems that the API is already capable of doing such a thing (see API:Parsing wikitext): http://fr.wikipedia.org/w/api.php?action=expandtemplates&text= returns:

 &lt;span style=&quot;white-space:nowrap&quot;&gt;1&amp;nbsp;km&lt;/span&gt;

So, the parser functions of the template are executed and the parameters are substituted.

Templates with complex parameters
Now, let's assume that the value of a parameter is the result of a parser function or template.

If we pass those parameters to the home wiki when requiring the template, then, that wiki might not know what to do with them, because it might not know the corresponding templates.

So, we think that those complex parameters should be parsed on the distant wiki, so they keep their usual behavior on this wiki, and, then, the corresponding results should be passed as arguments for the wanted template.

Caching
The need for caching every template request via the API or just some of them (requests for the templates which have no parameter) still needs to be discussed.

On one hand, we should cache as much as possible to avoid parsing again and again the same templates. On the other hand, some templates will be called with a lot of different arguments and maybe it's not necessary to cache all of them...

Some tests should be made to decide which solution is better.