Extension:Wikibase Quality Extensions

Welcome to Wikidata Quality!

We are a team of students from Hasso-Plattner-Institute in Potsdam, Germany. For our bachelor project we're working together with the team of Wikidata to ensure their data quality.

In consultation with the Wikidata community, two projects have emerged. One part of our team wants to improve the usage and visualization of constraints (as suggested by the community here), whereas the second part is currently working on a tool that validates Wikidata by comparing it with external databases.

Improving Constraint Reports
When we started working on this projects, the only way to define constraints was on the talk page of a property, and it could only be done via editing templates. This is neither user-friendly nor easy to maintain. On the contrary: During our studies we found, that there are over 4000 hand-written constraints, but some of them don't match exactly the definition of the templates, e.g. Single_value instead of Single value. It is very difficult for a bot to check the data against their corresponding constraints, when some of them are written wrong.

So this is the status quo: There are constraints on the talk page of properties and there is a bot checking the data he finds in dumps of Wikidata against those constraints and generates these constraint reports. While this definitely generates additionally value, it isn't nice to read, the underlying constraints are a pain to maintain and checking against a dump is of course not as accurate as checking against live data.

Luckily, it is now possible to create statements on properties. Based on that feature, we are planning to migrate the constraints from the talk pages, enabling us to generate meaningful constraint reports right where they are needed. Our vision of this project is, that every user who visits an item page gets a small indicator, when there is a constraint violation. Clicking on it, he should get a small text that explains, which constraint has been violated, and giving him the opportunity to fix it or to add it as an exception, when he is really sure, that this is not a violation.

We also want to give the user assistance correcting the violation. When for example the symmetric constraint is violated, he should get a prompt asking him to add the missing statement to the other item. Of course we have to pay attention that this doesn't cause errors to spread. Therefore, we think about the possibility to only fill in the missing value automatically, when there is a reference proving the correctness. This would have the nice site effect that the number of references in the system eventually grows.

Currently, we are building a special page where you give an Item ID and we generate a table with the constraint report. Right now, we do this based on the constraints that were defined on the talk pages. To be able to do this on a special page in reasonable time and particularly on live data, we parsed every talk page and build a table with every constraint with their corresponding parameters.

In the end, the result of this check should be displayed right beside the statements when you visit an item page, but this will take a while. Until then, we want to migrate the constraints to the statements on properties, so that our special page can work without the usage of the table we generated from the talk pages.

And here, we need your help
For representing constraints with the Statements-on-properties-model, there are several possible approaches, but after we discussed them with several members of the Wikidata community, we agree with the proposal Ivan A. Krestinin made on the discussion page for property proposal. For this approach, many new properties have to be created. For every property, one needs the approval of some community members. It would be really great, if you could read this suggestion, maybe discuss it, and ideally like this approach, give your approval for the suggested properties, and maybe create them. We don't want to take the important decision how to handle constraints in the future of Wikidata, but we really need them represented as statement on properties to continue our work and generate meaningful constraint reports that in the end hopefully improve the quality of the data in Wikidata at all.

More features to come
When everything we mentioned above works, we want to go one step further. In some cases (like when there is a Unique Value Constraint), it is hard to check whether there is a violation in real-time, but with most constraints, tihs should be possible. We are planning to check nearly everything a user enters as a statement against all constraints that were defined on this property - right when he wants to save the statement. This prevents false data to get into the systems or helps to maintain the data consistent.

We want to give the user a prompt with a list of constraints he would violate with this entry and give him the opportunity to change it or add it as an exception for this constraint (maybe this should only be possible after he adds a reference...). When he violates a constraint that connects two items (Symmetric/Inverse/Target required claim/...) he could also get a prompt suggesting him to automatically add a statement to the other item.

You see, there are many questions how to handle integrate violations in the UI and in the workflow, and we are really looking forward to getting in contact with you. There are many shades of automation and we want to find a way that works well for the system and for everyone working with it.

External Validation
In our second project, we want to validate Wikidata by comparing it with external databases.

Currently, we are building a special page that takes an Item ID and performs a cross-check with data from the GND (we will be working on other external databases, soon).

For the comparative data, we analyze the GND data dump and write it into a table, that holds all our external data.

Therefor we need a mapping, that tells us, which Properties can be validated with which data from the dump. Because the GND dump is in XML, the mapping is in XPath; other formats will require other query languages.

For the actual cross-check, we basically look for a GND identifier in the Item (and prospectively other identifiers) and with our mapping, we can find the corresponding entries in our table and compare them to the Item's statements. Eventually, it is a bit more complicated because of different formats (for example German time formats or the name given in the form "Surname, Forename") and other exceptions and edge cases that need to be considered.In the end, the cross-check could be implemented as a live tool that can be executed from the Item's page itself (see mock ups).

On the other hand, it could be executed as a cronjob monthly (or something in this range).

In either way, found discrepancies could be treated as a "Cross-check mismatch" constraint violation (see first project).

For further information about us or our projects please visit our Github Wiki.