Extension:Wikibase Quality Extensions

From mediawiki.org
Logo of WikibaseQuality

The Wikibase Quality Extensions consist of three extensions that shall ensure data quality in Wikibase.

Quality is the base extension that handles potentially incorrect data found by the other two extensions (view on Gerrit).

Constraints checks constraints and will work on statements on properties (not on templates anymore) (view on Gerrit) (Special Page Constraint Report).

External Validation compares and validates data with external databases (view on Gerrit).

Install instructions can be found on the GitHub mirrors (Quality, Constraints, External Validation).

The extensions are developed in a students' bachelor project called Wikidata Quality at Hasso-Plattner-Institute in Potsdam, Germany, in collaboration with the team of Wikidata.

If you have any questions or something else you want to let us know, please write to tamara.slosarek@student.hpi.de .

Special Page Constraint Report

Result on Special Page Constraint Report

What is currently possible

Tooltip with constraint information
Constraint parameters

We developed the Special Page Constraint Report which enables users to perform live constraint checks on any given entity (item or property) by simply typing the ID into the text box. Note that you are not able to enter the entitie's label.

The check result is presented in a table showing all constraints on the claims and whether they are violated or not. For each Violation you are able to hover over a "[?]" to get a tooltip (on touch devices you get the tooltip by performing a click-gesture). The tooltip will provide you with the constraint explanation. Furthermore, it is possible to view the specific constraint parameters (like the conflicting items in a 'Conflicts with'-Constraint or the min- and max-values for a 'Range'-Constraint) by clicking on "[...]" right next to the constraint name.

Currently we do this based on the constraints that were defined on the talk pages. To be able to do this on a special page in reasonable time and particularly on live data, we parsed every talk page and build a database table with every constraint with their corresponding parameters.

There are currently 20 constraint types defined via Constraint-Template from which we are currently checking 16.

What currently doesn't work

The other four constraints usually get displayed with the status "todo" indicating that we have to do some work on it.

The Unique value constraint is currently not checked because it is not possible to check the uniqueness of a value among 14 million items when there is no index defined on the tables.

Unfortunately, the check for the Format constraint is also not possible at this time due to security reasons.

Since everyone can create a new constraint template, it is to be expected that some constraints which have not been implemented yet still exist. This is the case for the Value only constraint (meaning, the property must not be used as a qualifier or reference) and the Source constraint (the property must only be used as a reference). We are working on an implementation for these constraints which should become available in the next version.

Currently, the checks are based on the constraints we parse periodically from the property talk pages. This is obviously not the best solution. We are currently working on a PyWikiBot to migrate the constraints to statements on properties and on hooks to keep this table up-to-date.

Special Page Cross-Check with external databases

Result on Special Page Cross-check

The Special Page that performs cross-checks is currently stuck in the review process and will be available in a few weeks.

Special Page List of external databases

What will be possible in a few weeks

We developed another Special Page for cross-checks of data in Wikidata with data from external databases like the Integrated Authority File. To compare an Item with the external data, enter its ID into the text box (not the label, the ID) and click 'Check'.

Item Page with icons for constraint violations
Item Page with extended information

Then you should get a table with the following columns:

column description
status Shows if there is a Match or a Mismatch
property The property that was checked
Wikidata value The value stated in Wikidata
External values The value (or values) stated in the external source
References Shows if references are stated or missing for the concerning statement
External source Links to the Item of the external source in Wikidata

Which external sources are used is listed on the Special Page Lists of external databases with the following information:

column description
Name Link to the Item of the external source in Wikidata
Dump ID A string we gave the dumps to keep them apart
Import date The date the external data was imported into the database in Wikidata
Data language The language of the data in the external source
Source URL The URL where we got the dumps from
Size Size of the dump
License License for the external data

What currently doesn't work

We compare some coordinates but please treat those results with caution; we already lower the precision but it is still definetly possible that two coordinates describe the same thing but are too far apart and thus a mismatch - please ignore those for the moment.

Additionally to Match and Mismatch there will be another status Exception that indicates that the external value is wrong; since we do not store any of the mismatches we found yet, this will come in a later version (hopefully the next).

Vision - Show problems directly on Item Pages

Our vision of this project is, that every user who visits an item page gets a small indicator, when there is a constraint violation. Clicking on it, they should get a small text, explaining which constraint has been violated, and giving them the opportunity to fix it or to add it as an exception, when they are really sure that this is not a violation.

We also want to give the user assistance in correcting the violation. When for example the symmetric constraint is violated, they should get a prompt asking them to add the missing statement to the other item. Of course we have to pay attention that this doesn't cause errors to spread. Therefore, we think another possibility is to only fill in the missing value automatically, when there is a reference proving the correctness. This would have the nice side-effect that the number of references in the system eventually grows.

In the end, the result of this constraint check should be displayed right beside the statements when you visit a particular page, but this will take a while. Until then, we want to migrate the constraints to the statements on properties, so that our special page can work without the usage of the database table we generated from the talk pages.

When everything we mentioned above works, we want to go one step further. In some cases (like when there is a Unique Value Constraint), it is hard to check whether there is a violation in real-time, but with most constraints, this should be possible. We are planning to check nearly everything a user enters as a statement against all constraints that were defined on this property - right when they want to save the statement. This prevents false data from getting into the system or helps to keep the data consistent.

We want to give the user a prompt with a list of constraints they would violate with this entry and give them the opportunity to change it or add it as an exception for this constraint (maybe this should only be possible after they add a reference...). When they violate a constraint that connects two items (Symmetric/Inverse/Target required claim/...) they could also get a prompt suggesting them to (semi-)automatically add a statement to the other item.

As you see, there are many questions on how to integrate violations in the UI and in the workflow, and we are really looking forward to getting in contact with you. There are many shades of automation and we want to find a way that works well both for the system and for everyone working with it.