Extension:Wikibase Quality Extensions

The Wikibase Quality Extensions consist of three extensions that shall ensure data quality in Wikibase.

Quality is the base extension that handles potentially incorrect data found by the other two extensions (view on Gerrit).

Constraints checks constraints and will work on statements on properties (not on templates anymore) (view on Gerrit) (Special Page Constraint Report).

External Validation compares and validates data with external databases (view on Gerrit).

Install instructions can be found on the Github mirrors (Quality, Constraints, External Validation).

The extensions are developed in a students' bachelor project called Wikidata Quality at Hasso-Plattner-Institute in Potsdam, Germany, in collaboration with the team of Wikidata.

If you have any questions or something else you want to let us know, please write to [mailto:tamara.slosarek@student.hpi.de tamara.slosarek@student.hpi.de].

What is currently possible
We developed the Special Page Constraint Report which enables the user to perform live constraint checks on any given entity (item or property) by simply typing the ID into the text box. Note that you are not able to enter the entities label.

The check result is presented in a table showing all constraints on the several claims and whether they are violated or not. For each Violation you are able to hover over a "[?]" to get a tooltip (on touch devices you get the tooltip by performing a click-gesture). The tooltip will provide you with the constraint explanation. Furthermore it is possible to view the specific constraint parameters (like the conflicting items in a 'Conflicts with'-Constraint or the min- and max-values for a 'Range'-Constraint) by clicking on "[...]" right next to the constraint name.

Currently we do this based on the constraints that were defined on the talk pages. To be able to do this on a special page in reasonable time and particularly on live data, we parsed every talk page and build a database table with every constraint with their corresponding parameters.

There are several currently 20 constraint types defined via Constraint-Template from which we are currently checking 16.

What currently doesn't work
The other four constraints usually get displayed with the status "todo" indicating that we have to do some work on it.

The Unique value constraint is currently not checked because it is not possible to check the uniqueness of a value among 14 million items when there is no index defined on the tables.

Unfortunatly, the check for the Format constraint is also not possible at this time due to security reasons.

Since everyone can create a new constraint template, it is normal that their might exist some constraints there that have not been implemented yet. This is the case for the Value only constraint (meaning, the property must not be used as a qualifier or reference) and the Source constraint (the property must only be used as a reference). We are working on an implementation for these constraints which should become available in the next version.

Currently, the checks are based on the constraints we parse periodacally from the property talk pages. This is obviously not the best solution. We are currently working on a PyWikiBot to migrate the constraints to statements on properties and on hooks to keep this table up-to-date.

Special Page Cross-Check with external databases
The Special Page that performs cross-checks is currently stuck in the review process and will be available in few weeks.

What will be possible in few weeks
We developed another Special Page for cross-checks of data in Wikidata with data from external databases like the Integrated Authority File. To compare an Item with the external data enter its ID into the text box (not the label, the ID) and click 'Check'. Then you should get a table with the following columns: Which external sources are used is listed on the Special Page Lists of external databases with the following information:

What currently doesn't work
We compare some coordinates but please treat those results with caution; we already lower the precision but it is still definetly possible that two coordinates describe the same thing but are too far apart and thus a mismatch - please ignore those for the moment.

Additionally to Match and Mismatch there will be another status Exception that indicates that the external value is wrong; since we do not store any of the mismatches we found yet, this will come in a later version (hopefully the next).

Vision - Show problems directly on Item Pages
Our vision of this project is, that every user who visits an item page gets a small indicator, when there is a constraint violation. Clicking on it, he should get a small text, explaining which constraint has been violated, and giving him the opportunity to fix it or to add it as an exception, when he is really sure that this is not a violation.

We also want to give the user assistance correcting the violation. When for example the symmetric constraint is violated, he should get a prompt asking him to add the missing statement to the other item. Of course we have to pay attention that this doesn't cause errors to spread. Therefore, we think about the possibility to only fill in the missing value automatically, when there is a reference proving the correctness. This would have the nice side-effect that the number of references in the system eventually grows.

In the end, the result of this constraint check should be displayed right beside the statements when you visit a particular page, but this will take a while. Until then, we want to migrate the constraints to the statements on properties, so that our special page can work without the usage of the database table we generated from the talk pages.

When everything we mentioned above works, we want to go one step further. In some cases (like when there is a Unique Value Constraint), it is hard to check whether there is a violation in real-time, but with most constraints, this should be possible. We are planning to check nearly everything a user enters as a statement against all constraints that were defined on this property - right when he wants to save the statement. This prevents false data to get into the system or helps to keep the data consistent.

We want to give the user a prompt with a list of constraints he would violate with this entry and give him the opportunity to change it or add it as an exception for this constraint (maybe this should only be possible after he adds a reference...). When he violates a constraint that connects two items (Symmetric/Inverse/Target required claim/...) he could also get a prompt suggesting him to (semi-)automatically add a statement to the other item.

As you see, there are many questions on how to integrate violations in the UI and in the workflow, and we are really looking forward to getting in contact with you. There are many shades of automation and we want to find a way that works well both for the system and for everyone working with it.