Topic on Talk:Requests for comment/New sites system

First-class data or index

11
Dantman (talkcontribs)

We need to figure out what the site table is going to be. Our first class data source or a table like pagelinks indexed from other sources.

First-class

If it's a first-class data source like interwiki was we'll be doing all of our editing on the site table. Anything done through a web interface will rely on our limited log system. We'll need to come up with some way to do synchronization without making the UI and sync fight each other.

Advantages:

  • We don't have to write code for rebuilding the table.

Index

If the site table is an index we'll have a setting for configuring the source that site data comes from and the rows in the sites table will be rebuilt from that data instead of edited.

Advantages:

  • Synchronization can be done using that source configuration. We'll just have a source type that uses the site data from another wiki. Probably two, one that looks at the site table in another wiki's database and another that uses the API.
  • We will not be restricted to editing the site table. We can take our time implementing the UI for the sites system if we implement a source that reads sites from a text file first and use that. Or read it from a wiki page. Additionally when we do implement the web UI we can implement it with a proper system that tracks the history of modifications to each site instead of just using a log table.
Jeroen De Dauw (talkcontribs)

I'm a bit confused by apparently having to decide between either having it as first class data or as index. Seems like we can easily make it work as either depending on the use case.

This is what I propose:

  • The site table is first class by default but can act like an index
  • All interaction with the table is done through an interface that knows if it's an index or not using some new wiki configuration

That would be all in the initial patchset. We would follow up with:

  • Wikidata makes use of this interface and modifies the wiki config to indicate the site config is behaving as an index

Other people could write editing UIs or whatever on top of it without having to modify the site interface and without even having to care the info is coming from the table or some random other source.

Dantman (talkcontribs)

Mmmmm... ok, yeah I was starting to lean a little towards somewhere in the middle a little earlier.

How about some notes/changes to that:

  • Note that if we get the web UI out and make it the recommended method the site table will quickly turn into something that is always treated as an index.
  • This web UI is actually probably going to end up used for editing the local stuff even when you're using something like Wikidata to fill it.
  • How about instead of making it a firstclass/index boolean we call it a 'source'. If the source is the string that the web UI uses it knows that it can edit the global data since it inserted it. If the source is something like 'wikidata' then it knows that it can't edit it and must only let the user modify the local data that it manages.
  • If you don't mind, while we're not aiming for multiple sources here we could actually put source as a column in the database. It could be useful to deal with some situations like a wiki transitioning from local to global data. So that we know what sites from a wiki being turned into one that just reads the global data have not been put into the central database yet. It would also let us safely purge data from an old source without damaging new data added to the table.
Jeroen De Dauw (talkcontribs)
Note that if we get the web UI out and make it the recommended method the site table will quickly turn into something that is always treated as an index.

I don't understand this. Are you talking about what will happen for Wikipedia, or for MediaWiki installs in general? For the later I really don't see how having an edit UI really affects the table being an index or not.

This web UI is actually probably going to end up used for editing the local stuff even when you're using something like Wikidata to fill it.

I don't understand this either - do you mean it could behave incorrectly here if the below is not implemented at all?

How about instead of making it a firstclass/index boolean we call it a 'source'. If the source is the string that the web UI uses it knows that it can edit the global data since it inserted it. If the source is something like 'wikidata' then it knows that it can't edit it and must only let the user modify the local data that it manages.

I'd prefer having settings such as:

  • site data = one of ( primary, editable index, non-editable index )
  • site config = one of ( primary, editable index, non-editable index )

This way the UI is not forced to care about which source it's actually coming from, and the logic to determine if it should allow editing remains in the site interface, which is IMO where it should be (since you can obviously have many editing UIs, APIs, ect).

put source as a column in the database

Sure. This will work if only the site data can come from an external source. If we also want to allow this for the config, we need a second field. So what would you suggest doing? Add site_source, add site_data_source and site_config_source, or something else?

Dantman (talkcontribs)
Note that if we get the web UI out and make it the recommended method the site table will quickly turn into something that is always treated as an index.

I don't understand this. Are you talking about what will happen for Wikipedia, or for MediaWiki installs in general? For the later I really don't see how having an edit UI really affects the table being an index or not.

I just mean that if I go with making the web UI use a revision system like we do for edits instead of modifying the site table directly and everyone starts using that for actual editing of the sites list then we'll basically just have a web ui using sites as an index and a sync system using sites as an index and pretty quickly almost no-one will be using site as a first-class table.

This web UI is actually probably going to end up used for editing the local stuff even when you're using something like Wikidata to fill it.

I don't understand this either - do you mean it could behave incorrectly here if the below is not implemented at all?

This was just a note to provide the background to the rationale why we might want source instead of a boolean. It's combined with the above.

How about instead of making it a firstclass/index boolean we call it a 'source'. If the source is the string that the web UI uses it knows that it can edit the global data since it inserted it. If the source is something like 'wikidata' then it knows that it can't edit it and must only let the user modify the local data that it manages.

I'd prefer having settings such as:

  • site data = one of ( primary, editable index, non-editable index )
  • site config = one of ( primary, editable index, non-editable index )
This way the UI is not forced to care about which source it's actually coming from, and the logic to determine if it should allow editing remains in the site interface, which is IMO where it should be (since you can obviously have many editing UIs, APIs, ect).

The line of thought behind a source string instead of primary/non-editable (what is an editable index?) is this:

  • The web UI manages it's data with a revision system and it indexes the site table off that so it says the site table is an index
  • Wikidata/MediaWiki sync site data from a foreign source so it says the site table is an index
  • In both situations the site table is set as an index. How does the web UI tell the difference and know if it is allowed to update the site table with it's locally edited revision data?
put source as a column in the database

Sure. This will work if only the site data can come from an external source. If we also want to allow this for the config, we need a second field. So what would you suggest doing? Add site_source, add site_data_source and site_config_source, or something else?

Hmmm... I didn't think about that before.

I know talked about things like having a text and page based interwiki source. But those are actually only things that come to mind when I think about what we'd do to make this easy for users too use soon. Honestly what I really want is one really good web UI. Once that is done and we're using this global id based system with local prefixes I actually don't really see any use for any other user interface anymore.

I do believe that if I put global data about sites inside of revisions in a web UI I'm probably going to do the same thing with local data. So whether we want site_source or two columns for this will depend on if you believe we're going to have multiple editing interfaces on one single wiki disagreeing on where the local data comes from. The idea of site_source came up because global data can come from anywhere (direct local db edit, wiki UI edit, synced from wikidata, synced from another database, synced from a wiki's API) which at the start I didn't think applied to local data that always came from somewhere on the same wiki.

Jeroen De Dauw (talkcontribs)
I just mean that if I go with making the web UI use a revision system like we do for edits instead of modifying the site table directly and everyone starts using that for actual editing of the sites list then we'll basically just have a web ui using sites as an index and a sync system using sites as an index and pretty quickly almost no-one will be using site as a first-class table.

Oh, sure, if you add revisioning, then yes. That's not trivial to do nicely though, unless you wait till the contentHandler stuff is fully merged into core and use that. Either way, we cannot block on creation of such a system so need this to be able to work as primary data as well.

Jeroen De Dauw (talkcontribs)
The line of thought behind a source string instead of primary/non-editable (what is an editable index?) is this:
  • The web UI manages it's data with a revision system and it indexes the site table off that so it says the site table is an index
  • Wikidata/MediaWiki sync site data from a foreign source so it says the site table is an index
  • In both situations the site table is set as an index. How does the web UI tell the difference and know if it is allowed to update the site table with it's locally edited revision data?

That is a reason to have a source field in the table, not to expose this field to the UI. The site interface would figure out if it's editable or not and provide this info the the UI, for instance in the form I suggested. So I suspect we're actually agreeing here? :)

Dantman (talkcontribs)

Sure, long as the site system actually understands the difference. Because 'editable' does not necessarily mean the table contents can be edited but could actually mean something more like "Can I update this with my data?" and the answer to that might be no to one interface if another added it.

Jeroen De Dauw (talkcontribs)

To address

Add site_source, add site_data_source and site_config_source, or something else?

I guess the question comes down to if we want to assume configuration will either be local or coming from the same source as the actual site data. I suspect this assumption is going to hold for a while, and will keep holding forever for nearly all usecases. So what about just having the single site_source field for now? If at a later point more info is needed, we can add a new field (or even do something else and drop this one). This ought to be easy as the table is small, and the only thing being aware of the fields and how they should be handled should be the sites interface.

Dantman (talkcontribs)

Yup, a single site_source defining the source of the global data for now pending future issues should work.

Reply to "First-class data or index"