Topic on Talk:Requests for comment/New sites system

Database schema proposal

20
Dantman (talkcontribs)

Denny Vrandečić proposed this database schema on the page.

-- Holds all the sites known to the wiki.
-- This includes their associated data and handling configuration.
-- In case a synchronization tool is used (ie Wikibase), the table
-- can be obtained from an external source, in which case
-- they should not be modified locally.
CREATE TABLE /*_*/site (
  -- Numeric id of the site
  site_id                    int unsigned        NOT NULL PRIMARY KEY AUTO_INCREMENT,

  -- Global identifier for the site, ie enwiktionary
  site_global_key            varchar(25)         NOT NULL,

  -- Type of the site, ie SITE_TYPE_MW
  site_type                  int unsigned        NOT NULL,

  -- Group of the site, ie SITE_GROUP_WIKIPEDIA
  site_group                 int unsigned        NOT NULL,

  -- Base URL of the site, ie http://en.wikipedia.org
  site_url                   varchar(255)        NOT NULL,

  -- Path of pages relative to the base url, ie /wiki/$1
  site_page_path             varchar(255)        NOT NULL,

  -- Path of files relative to the base url, ie /w/
  site_file_path             varchar(255)        NOT NULL,

  -- Language code of the sites primary language.
  -- We do not have real multilingual handling here by design,
  -- as implementing it would require expensive changes in core
  -- and would overcomplicate things. If you have a multilingual
   -- site, for instance imdb, you can just create multiple rows
   -- for it, ie imdben and imdbbe.
  site_language              varchar(10)         NOT NULL, 

  -- Type dependent site data.
  site_data                  blob                NOT NULL
) /*$wgDBTableOptions*/;

-- Holds all the local site keys and data for the sites in site
CREATE TABLE /*_*/sitelocal (
  -- local key
  sitelocal_key              VARCHAR(25)         NOT NULL,
  --   Key to site.site_id
  site_id                    int unsigned        NOT NULL,
  -- If the site should be linkable inline as an "interwiki link" using
  -- [[site_local_key:pageTitle]].
  sitelocal_link_inline      bool                NOT NULL,

  -- If equivalent pages of this site should be listed.
  -- For example in the "language links" section.
  sitelocal_link_navigation  bool                NOT NULL,

  -- If site.tld/path/key:pageTitle should forward users to  the page on
  -- the actual site, where "key" is the local identifier.
  sitelocal_forward          bool                NOT NULL,

  -- Type dependent site config.
  -- For instance if template transclusion should be allowed if it's a MediaWiki.
  sitelocal_config           blob                NOT NULL
) /*$wgDBTableOptions*/;

Mapping of use cases to schema

  • (1) GlobalIDs: site.site_global_key
  • (2) Multiple IDs: there can be several sitelocal_key for a single site_id
  • (3) Types and Typed data: site_type and site_data, for local differences also sitelocal_config
  • (4) Languages: site_language
  • (5) Arbitrary language links: sitelocal_link_inline, sitelocal_link_navigation
  • (6) Groups: site_group
  • (7) Custom URLs: site_url, site_page_path, site_file_path, as well as further data in site_data and sitelocal_config
  • (8) Unprefixed sites: no entry in sitelocal for a site_id means an unprefixed site, basically
  • (9) Synchronization: the split between site and sitelocal splits the global and local data
  • (10) Site title: not present now. Is there consensus for this?
  • (11) iw_api: covered by site_url and site_file_path (or site_data)
  • (12) iw_wikiid: covered either by site_global_key or site_data
  • (13) iw_local: sitelocal_forward
  • (14) iw_trans: sitelocal_config or site_data
  • (15) UI: not covered by schema
Denny Vrandečić (WMDE) (talkcontribs)
Dantman (talkcontribs)

It's missing the type and group being varbinary instead of ints as was said to be fixed there. Also the dropping of site_file_path. However I still don't like the site_url and site_page_path separation. IMHO the actual url still looks like type specific data.

Btw when I said "Do we want to split the data into two different tables?" I was mostly serious with that as a question. I haven't quite figured out if that's a good idea or not. The one technical reason to do that I can think of would be to share the database. But when I think about it again trying to manage it in that situation where a tiny change on one wiki suddenly affects every wiki makes me think things could quickly go wrong. So I don't know which is the option to use yet. We also need to have a separate discussion on whether the site table is going to be a first-class table of data or an index built up of configured sources. That decision will probably also affect what we do in this part of the schema.

Notes on sitelocal:

  • Presumably sitelocal is a multi-row 0+ table. Do we need link_inline anymore? Is there actually a situation where we can have an interwiki prefix and have it be a language link but not be an interwiki link?
  • Also sitelocal_config. Do we really want separate site config for each prefix? ie: If Asdf: and en: point to the same site, is there any reason only one of them should be usable in interwiki transclusion?
    • If we can't come up with a use for it I'd like to avoid having the table storing our prefixes have extra data besides the flag that says whether if the prefix is an interlanguage. The UI for separating interlanguage and interwiki links is one thing. For that we would just let the user input a comma-separated list of interwiki prefixes and a separate box would have a comma-separated list of interlanguage prefixes. But if we add any more data than that to the prefix the UI suddenly explodes from just editing a simple form of site data to something with subforms containing configuration for every single prefix.
Jeroen De Dauw (talkcontribs)
We also need to have a separate discussion on whether the site table is going to be a first-class table of data or an index built up of configured sources.

The thing obviously needs to work on single wiki installs. So if we make this inherently be an index, we need to introduce another thing storing the site data, which would presumably be similar to what we're proposing now, and be completely useless for very nearly everyone. I don't quite understand how seeing the table as primary data by default will cause problems when some code decides to use it as index - obviously other code interacting with the table should be aware of this, or use some suitable interface to the table that is, but what would be different in the table??

Jeroen De Dauw (talkcontribs)

This is more like what I had in mind:

-- Holds all the sites known to the wiki.
-- This includes their associated data and handling configuration.
-- In case a synchronization tool is used (ie Wikibase), the table
-- can be obtained from an external source, in which case
-- they should not be modified locally.
CREATE TABLE /*_*/site (
  -- Numeric id of the site
  site_id                    int unsigned        NOT NULL PRIMARY KEY AUTO_INCREMENT,

  -- Global identifier for the site, ie 'enwiktionary'
  site_global_key            varbinary(32)       NOT NULL,

  -- Type of the site, ie 'mediawiki'
  site_type                  varbinary(32)        NOT NULL,

  -- Group of the site, ie 'wikipedia'
  site_group                 varbinary(32)        NOT NULL,

  -- Source of the site data, ie 'local', 'wikidata', 'my-magical-repo'
  site_source                varbinary(32)        NOT NULL,

  -- Domain of the site in reverse order, ie 'org.mediawiki.www'
  -- This field is an index for lookups and is build from type specific data in site_data.
  site_domain               varchar(255)        NOT NULL,

  -- Protocol of the site, ie 'http://', 'irc://', '//'
  -- This field is an index for lookups and is build from type specific data in site_data.
  site_protocol             varchar(255)        NOT NULL,

  -- Language code of the sites primary language.
  -- We do not have real multilingual handling here by design,
  -- as implementing it would require expensive changes in core
  -- and would overcomplicate things. If you have a multilingual
   -- site, for instance imdb, you can just create multiple rows
   -- for it, ie imdben and imdbbe.
  site_language              varbinary(32)       NOT NULL, 

  -- Type dependent site data.
  site_data                  blob                NOT NULL,

  -- If site.tld/path/key:pageTitle should forward users to  the page on
  -- the actual site, where "key" is the local identifier.
  site_forward              bool                NOT NULL,

  -- Type dependent site config.
  -- For instance if template transclusion should be allowed if it's a MediaWiki.
  site_config               blob                NOT NULL
) /*$wgDBTableOptions*/;

-- Holds all the local site keys and data for the sites in site
CREATE TABLE /*_*/site_identifiers (
  --   Key to site.site_id
  si_site                    int unsigned        NOT NULL,

  -- local key type, ie 'interwiki' or 'langlink'
  si_type                   varbinary(32)       NOT NULL,

  -- local key value, ie 'en' or 'wiktionary'
  si_key                    varbinary(32)       NOT NULL

) /*$wgDBTableOptions*/;

-- unique key on ( si_key_type, si_key )
  • Can now have an arbitrary amount of langlink or interwiki identifiers per site, eliminating the case where we where previously forced to duplicate stuff.
  • Killed site_path field as it's type specific (site_page_path is something we need for every site and something I want to keep separate from site_url so we can easily change the later or select on it).
  • Modified field types of identifiers, types and lang to varbinary(32) to be consistent w/ core.
  • Removed split of config from the main table - I see more hassle arise with having it split then when not split.
  • Removed site_link_inline and site_link_navigation as they are obsolete due to the key link table
Dantman (talkcontribs)

Yeah this looks better. Course I'm still not keen on the site_url/site_page_path separation an think it should be type data.

I think we can drop site_link_inline. It was needed before because local_key could be a copy of the global key and not be a prefix. But now that a site with no interwiki links simply is one with no site_identifiers (site_prefix?) rows I can't think of a purpose for site_link_inline.

Also if you don't mind the bikeshedding:

  • site_global_key can probably just be site_global, or maybe just site_key since we're already talking about the unique data
  • si_key_type is basically the replacement for site_link_navigation, so we don't need site_link_navigation.
  • For si_site_id using si_site would match the other tables, see rev_page.
  • si_key_type can probably be just si_type. (Btw, I like 'type', I couldn't figure out what to call it before)
Jeroen De Dauw (talkcontribs)
Course I'm still not keen on the site_url/site_page_path separation an think it should be type data.

In order to be able to make a link to a site, we need to know where to put the page name. In the current interwiki table this is done with a single field holding for instance https://encrypted.google.com/?q=$1. I've now split this up into the base url and the part being appended. I don't see how this is type specific. It's true that the part being appended has a typical format per type, and often even tends to have the same value per type of site (ie /wiki/ suggests MediaWiki). But those are the values of the field being type specific, not the field itself.

I'd be great to find an approach with which we're both happy ofc - I'd love to see a suggestion coming from you. Just moving the field into ALL of the type classes would be rather stupid obviously (and makes it apparent it's not type specific), so that's not an approach I could settle for.

I think we can drop site_link_inline.

Definitely, gone now :)

site_global_key can probably just be site_global, or maybe just site_key since we're already talking about the unique data

Disagree - the current name is clearer, and the 4 extra bytes are not going to kill anyone. site_global could be confused with a setting indicating if it's a global site or not (ie you just wasted a minute of every new dev looking at the code).

For si_site_id using si_site would match the other tables, see rev_page.

You're right, did not know of this "convention". Updated now.

si_key_type can probably be just si_type.

Yeah.

Dantman (talkcontribs)

I don't really see moving the url stuff into the type data as strange, though I also don't see types using the same format for this.

Here, I'll give some examples of the possible situation I've been thinking of the whole time. Where site_url and site_*_path are gone and we just use site_data. (Using JSON so you can read it)

A GenericSite type site_data (just a url with a $1 replacement)

{ "url": "https://encrypted.google.com/?q=$1" }

A MediaWikiSite type site_data (data in the same format we always work with):

{ "server": "//mediawiki.org", "script_path": "/w", "article_path": "/wiki/$1" }

A GerritSite type site_data (A base url, if we used something like https://git.wikimedia.org/gerrit/r/4016 instead of https://gerrit.wikimedia.org/r/4016 the base_url would be https://git.wikimedia.org/gerrit so it's not the same as server in MWSite) that knows the differences between change numbers, change ids, and commit hashes and knows what url to build:

{ "base_url": "https://gerrit.wikimedia.org" }

A very custom TwitterSite type site_data which doesn't need any url and does special things like making [[twitter:@nadir_seen_fire]] point a profile while [[twitter:MediaWiki]] links to a search of tweets (yes this one is a little ridiculous but I have a feeling we'll end up with some people wanting some types so custom that the type itself doesn't want any instruction what the url is):

{}
Jeroen De Dauw (talkcontribs)

You can do that yes, but then you cannot:

  • Link to a sites domain (w/o doing evil regex stuff)
  • Lists sites by domain (w/o doing evil regex stuff)
  • Select sites based on domain
  • Display all sites on a specific domain
  • Update the domain of a site (w/o doing evil regex stuff)

Really, what's the harm done in having a page_path field for all sites? It will make the above things a lot easier/nicer and for those weird edge case site types you can always override behavior in their associate site class.

Dantman (talkcontribs)

Hmmmm... ok I do see a use for those. (Although I don't know if a column is necessary for 1 or 5, and 2-4 are almost the same thing)

How would you handle protocol then?

site_url is basically a user-inputted string not only can can it be http:// https:// it can also be protocol relative // and technically there is nothing stopping someone from adding (freenode, irc://irc.freenode.net/$1) so they can make irc links like freenode:mediawiki ( ;) in fact that's already being done!!!).

So using site_url to do things by-domain might not work so well.

How about this instead.

We store the url data inside of site_data. A type class has a method that returns the domain of a site (we can probably make a trick default implementation that calls $this->getUrl( '' ); and then uses wfParseUrl to get the domain). Using that method we store a new site_domain column we index and use in queries.

And if you really do want to make it so we can use this in the ui instead of storing the raw domain name we'll store a reverse-dot-postfixed domain like "org.mediawiki.www." so that we can optimize queries like site_domain LIKE 'org.mediawiki.%'. The reversal keeps the heaviest information at the start properly inside the index data and the trailing dot lets us make a mediawiki.org query match www.mediawiki.org without matching mediawiki-sucks.org in the same query or requiring a separate test for complete equality.

Jeroen De Dauw (talkcontribs)

Ok, modified schema accordingly. And since we're adding such an index anyway, also include a field for the protocol.

Dantman (talkcontribs)

Don't need site_page_path anymore, right?

Also site_domain needs that trailing . for LIKE queries to be effective.

Jeroen De Dauw (talkcontribs)

Oops - maintaining the stuff in LQT sort of is a pain :)

Don't see why we need the dot - can you explain? Either way, that does not affect the schema :)

Dantman (talkcontribs)

I mentioned before but I'll try to clarify.

The dot acts as a guarantee that every chunk of the domain will terminate with a dot even when it is the last chunk. It allows us to do partial matches on a domain using a single LIKE instead of both a LIKE and an or a LIKE that will make bad matches.

For example if we want to do a match on anything ending in mediawiki.org (which is almost always what you really want) we would do a query like this site_domain LIKE 'org.mediawiki.%'. Because mediawiki.org is 'org.mediawiki.' this LIKE query will match both mediawiki.org and www.mediawiki.org.

However if we use just 'org.mediawiki' instead we can't do this with one simple test. If we do site_domain LIKE 'org.mediawiki.%' then we won't match mediawiki.org without the www. If we do site_domain LIKE 'org.mediawiki' we will match mediawiki-something.org in addition to mediawiki.org, which is a different domain name. The only option is a long query site_domain = 'org.mediawiki' OR site_domain LIKE 'org.mediawiki.%'.

So reversing and postfixing together keep the important parts in the index and make it possible to match the base domain name in a single query (the query we will likely be using the most).

Additionally if databases are using sorting similar to `sort` then it is also important to keep domain names together.

org.mediawiki
org.mediawiki-something.www
org.mediawiki.www
org.mediawikisomething.www

vs.

org.mediawiki-something.www.
org.mediawiki.
org.mediawiki.www.
org.mediawikisomething.www.

Basically without the trailing . an ORDER BY would sort mediawiki.org separately from www.mediawiki.org if a mediawiki-something.org or www.mediawiki-something.org was also in the database.

Dantman (talkcontribs)

Oh right. We don't allow an interlanguage link, etc... to have the same prefix as an interwiki link, vice versa, and whatnot.

So rather than a UNIQUE on (si_key, si_type) we probably want si_key to be our PRIMARY index.

Jeroen De Dauw (talkcontribs)

Really - we don't allow for that? You sure?

I thought we did allow for this. If we don't, this seems like either policy or unrelated technical restriction, neither of which should affect design of this. This table should not be specific to the interwiki and "interlanguage" stuff at all - if you need another type of identifier that can be the same as one of another type, that'd just work with the current schema proposal, while I'd require changes if we put that primary index on si_key.

Dantman (talkcontribs)

Yes we only allow one single unique prefix. Whether it's an interwiki or a language link is just a simple bit of meta information on that.

We created this table practically just for the interwiki prefix stuff.

How are our interwiki links supposed to work when [[foo:Bar]] can simultaneously point to an interwiki, an intersite, and a sister site, and possibly three different sites at the same time?

Where do you see local site idenitfiers being used besides our interwiki/interlanguage system?

Dantman (talkcontribs)

Lets make it site_prefix(sp_). The prefix si_ is already used by searchindex while sp_ is unused.

Jeroen De Dauw (talkcontribs)

Where do you see having these similar key prefixes cause problems? We never need to join this table against searchindex I would think...

If it could cause problems, we ought to change the prefix, but not at the cost of the table name being good. site_prefix implies it contains prefixes, while it does not (the identifiers/keys might be used as one obviously, but this does not make them prefixes).

Reply to "Database schema proposal"