Requests for comment/New sites system

From mediawiki.org
Request for comment (RFC)
New sites system
Component General
Creation date
Author(s) Daniel Friesen
Document status declined

This proposal is currently abandoned due to lack of time/interest; please feel free to take over and reactivate this RfC. Sharihareswara (WMF) (talk) 16:12, 17 June 2014 (UTC)[reply]

Our interwiki system is dated. There are a number of issues with it that need to be fixed by replacing it with a new system.

Issues that need to be solved in a new system:

  • (1) Global IDs: Right now the ID for an interwiki link is local to the wiki. Even when wikis share data there is no global id for interwiki links. The new system should have a global id field which will be unique and the same across all wikis sharing the same data. (Wikidata needs this)
  • (2) Multiple IDs: For any individual wiki one interwiki/site may be referred to by multiple interwiki links. In the current interwiki system to do this you have to create duplicate rows. In the new interwiki system duplication should be avoided by using a list (likely a separate table) of multiple local prefixes for one site row.
  • (3) Types and Typed data: Our interwiki system currently refers to more than just MediaWiki sites but treats them with MediaWiki url handling. This leads to some links like google: being broken. While also making other things underpowered. The new system should understand different types of sites so we can give different types of sites different link handling. It should also have type specific data where we can relocate the api location and put some other information inside.
  • (4) Languages: Our current interwiki table only understands languages for interwiki links that are interlanguage links using a language code as a prefix. The new sites system should annotate all websites with information about what language the site is in.
  • (5) Arbitrary interlanguage links: Our current interlanguage system only supports interlanguage links that are the same as a language code. The new sites system should instead make this an explicit flag so that some prefixes that don't match known language codes can be used as interlanguage links.
    • [Note] Where should we put this data? The obvious idea would be to put it on the sites table itself. But thinking about it Wikipedia: would be a normal interwiki while en: would be an interlanguage link. Perhaps we should put that boolean on the table that lists prefixes and tie it to a prefix. Special page UIs can just have two separate inputs for interwiki prefixes and interlanguage prefixes to make a sane input method.
    • [Note] Wikidata's proposed sites table called this site_link_navigation. And a boolean true would make a link go into the navigation / langlinks instead of act as an interwiki. However perhaps this isn't something we only want for interlanguage links. Maybe we want sister sites too? We may want to make this varbinary instead of a boolean and support the possibility of another type of navigation link.
    • [Note] We may also want to think about how this fits into the context of the idea of putting all interlanguage links into a central database.
  • (6) Groups: Wikidata's sites table suggested that we should have a column to group sites together. eg: All Wikipedias are in one group.
  • (7) Custom URLs: The API of the sites system should take into account the possibility of types that introduce very custom url patterns.
  • (8) Unprefixed sites: The new sites system should permit the creation of sites that don't have any local interwiki or interlanguage prefix.
    • [Use case] This could help synchronization by letting every wiki just copy everything instead of only what it has prefixes for.
    • [Use case] Wikidata will need this in some form for external language links that don't have a local prefix.
    • [Use case] Besides wikidata this could also be useful if we try replacing interlanguage links with something that doesn't use prefixes. Like in bug 167.
    • [Use case] This could also be useful if someone ever comes up with a UI that needs to let you pick sites from a list to make a link for some purpose.
    • [Note] Wikidata's proposed sites table tries to deal with languagelinks by always having a site_local_id even if it is not an interwiki. To do this right we should probably do something else such as replacing ll_lang with a column pointing to a site_id. This needs some more thought and discussion.
  • (9) Synchronization: large projects like Wikimedia have many wikis and the idea of re-doing the interwiki table for every single wiki is ridiculous. Wikis in a large project will need some way to share or synchronize their list of sites with each other.
    • [Note] Sites have data that is global and also data specific to the individual wiki. Do we want to split the data into two different tables?
  • (10) [???] Do we want to include a site title into this information? There is a possibility some of our new use cases may have a use for such a thing. We also never discussed whether interwiki links should actually use titles like "Foo - Wikipedia" instead of "Wikipedia:Foo".
  • (11) [Existing] Our MediaWiki type needs to know how to access the API of another wiki. Our current interwiki table uses a iw_api column for this. The new one will probably use special site data (like a scriptpath instead of an api url) to get the same information. Though it may use type specific data for that.
    • Are there reasons for "needing" this beyond mythical interwiki transclusions (I'm just curious, we'll probably still want it for eventual iw transclusions even if there isn't plans)? Bawolff (talk) 12:26, 14 August 2012 (UTC)[reply]
      • I think there are a few other mythical cases that could use it. Like doing page existence checks on 3rd party wikis so that interwikis are redlinks (see bug 11). Also my bug 39199 would definitely need the api. Right now it appears the Interlanguage extension is currently using it to do purges. ie: To update the interlanguage links on other wikis that are pulling from the central wiki. Daniel Friesen (Dantman) (talk) 13:19, 14 August 2012 (UTC)[reply]
  • (12) [Existing] Our MediaWiki type needs to know about wikiids that can be used to access the database of another wiki. ie: With wfGetLB( 'wikiid' ). The current interwiki table has a iw_wikiid column for this. The new system will probably use type specific data for this.
  • (13) [Existing] We need an equivalent to iw_local which indicates that when the title is an interwiki link it should support redirection. This should probably apply to all types and be a general part of the site row.
  • (14) [Existing] We need an equivalent to iw_trans that indicates scary transclusion should work. This will probably be type specific data. And in a data key that we don't even bother setting if it's not enabled for that wiki.
  • (15) [Bonus, UI] We have no standard UI to make this editable. That means for most people this is a black box they can never touch. We need a new special page for editing sites after we write the new system.
    • [Idea] We nave no versioning of this data so we're entirely dependent on half-baked logs for changes. We may want to consider treating sites as an index so that a UI can edit versioned information instead of the sites table and then build the index from that data.
    • [Idea] Detection: We expose the location of the API with an EditURI pointing to a RSD file in supported versions of MediaWiki, and in siteinfo we expose all the information we need to know to link to that wiki. The UI should probably take advantage of that and let people register MediaWiki urls simply by sticking a single link into them that the special page uses detection on to extract all the info it needs.

The old system[edit]

Because it's important to know what currently exists before recreating it, I (bawolff) have added some notes on the current system for managing sites in MediaWiki:

  • The interwiki table: See Manual:Interwiki_table. The only part that is really highly used is iw_prefix, iw_url and iw_local. iw_wikiid could potentially be used to match up interwikis with other knowledge about known sites.
  • The interwiki cache (not to be confused with the in process cache of interwikis, or using memcached for caching interwikis, which we also do). Basically its a cdb file that contains a bunch of keys of the form 'wiki-id:iw_prefix' => '<iw_local status> <iw_url>' It also has keys for global prefixes that act on all wikis, keys for mapping wiki-ids to site names (enwiki => wikipedia, frwikinews => wikinews, etc presumably), as well as keys for interwiki links that only apply to certain "sites". This is used by Wikimedia, and one file serves all wikis.
  • The SiteConfiguration class (aka $wgConf) This contains all sorts of information about other wikis on the wikifarm. It is somewhat Wikimedia specific. See specifically the WikiMap and WikiReference classes which can be used to make links and stuff to other websites known by the wiki. For example WikiMap::makeForeignLink( 'frwiki', 'some page', 'text of link' ) Several "global" wiki extensions (like GlobalUsage for cross wiki file usage) use this facility.

Database schema[edit]

-- Holds all the sites known to the wiki.
-- This includes their associated data and handling configuration.
-- In case a synchronization tool is used (eg: Wikibase), the table
-- can be obtained from an external source, in which case
-- global columns should not be modified locally.
table site(site_) {
	-- [Meta] The auto-incrementing site id
	id rowid,

	-- [Global] Global identifier for the site, eg: 'enwiktionary'
	global_key string(32) unique,

	-- [Global] Type of the site, eg: 'mediawiki'
	type string(32),

	-- [Global] Group of the site, eg: 'wikipedia'
	group string(32) default(''),

	-- [Meta] Source of the site data, eg: 'local', 'wikidata', 'my-magical-repo'
	source string(32) default('local'),

	-- [Global] Domain of the site in reverse order with a trailing dot, eg: 'org.mediawiki.www.'
	-- This field is an index for lookups and is build from type specific data in site_data.
	domain string(255),

	-- [Global] Protocol of the site, eg: 'http://', 'irc://', '//'
	-- This field is an index for lookups and is build from type specific data in site_data.
	protocol string(255),

	-- [Global] Language code of the site's primary language.
	-- We do not have real multilingual handling here by design,
	-- as implementing it would require expensive changes in core
	-- and would overcomplicate things. If you have a multilingual
	-- site, for instance imdb, you can just create multiple rows
	-- for it, eg: imdben and imdbbe.
	language string(32),

	-- [Global] Type dependent site data.
	data string,

	-- [Local] If site.tld/path/key:pageTitle should forward users to  the page on
	-- the actual site, where "key" is the local identifier.
	forward bool default(false),

	-- [Local] Type dependent site config.
	-- For instance if template transclusion should be allowed if it's a MediaWiki.
	config string,
}

-- Holds all the local prefixes and prefix types for the sites in site
table site_prefix(sp_) {
	-- local key value, eg: 'en' or 'wiktionary'
	key string(32) primary,

	-- local key type, eg: 'interwiki' or 'langlink'
	type string(32),

	-- Key to site.site_id
	site reference(site.site_id),
}

External API[edit]

[...To Be Discussed...]