Extension:Wikibase Quality Extensions

Welcome to Wikidata Quality!

We are a team of students from Hasso-Plattner-Institute in Potsdam, Germany. For our bachelor project we're working together with the team of Wikidata to ensure their data quality.

In consultation with the Wikidata community, two projects have emerged. One part of our team wants to improve the usage and visualization of constraints (as suggested by the community here), whereas the second part is currently working on a tool that validates Wikidata by comparing it with external databases.

Improving Constraint Reports
When we started working on this projects, the only way to define constraints was on the talk page of a property, and it could only be done via editing templates. This is neither user-friendly nor easy to maintain. On the contrary: During our studies, we found that there are more than 4000 hand-written constraints, some of which don't match exactly the definition of the templates, e.g. Single_value instead of Single value. It is very difficult for a bot to check the data against their corresponding constraints, when some of them are written wrong.

So this is the status quo: There are constraints on the talk pages of properties and there is a bot checking the data it finds in dumps of Wikidata against those constraints to generate these constraint reports. While this definitely generates additional value, it isn't nice to read, the underlying constraints are a pain to maintain and checking against a dump is of course not as accurate as checking against live data.

Luckily, it is now possible to create statements on properties. Based on that feature, we are planning to migrate the constraints from the talk pages, enabling us to generate meaningful constraint reports right where they are needed. Our vision of this project is, that every user who visits an item page gets a small indicator, when there is a constraint violation. Clicking on it, he should get a small text, explaining which constraint has been violated, and giving him the opportunity to fix it or to add it as an exception, when he is really sure that this is not a violation.

We also want to give the user assistance correcting the violation. When for example the symmetric constraint is violated, he should get a prompt asking him to add the missing statement to the other item. Of course we have to pay attention that this doesn't cause errors to spread. Therefore, we think about the possibility to only fill in the missing value automatically, when there is a reference proving the correctness. This would have the nice side-effect that the number of references in the system eventually grows.

Currently, we are building a special page where you give an Entity ID (both items and properties are possible) and we generate a table with the constraint report. Right now, we do this based on the constraints that were defined on the talk pages. To be able to do this on a special page in reasonable time and particularly on live data, we parsed every talk page and build a database table with every constraint with their corresponding parameters.

In the end, the result of this check should be displayed right beside the statements when you visit a particular page, but this will take a while. Until then, we want to migrate the constraints to the statements on properties, so that our special page can work without the usage of the table we generated from the talk pages.

And here, we need your help
For representing constraints with the Statements-on-properties-model, there are several possible approaches, but after we discussed them with several members of the Wikidata community, we agree with the proposal Ivan A. Krestinin made on the discussion page for property proposal. For this approach, many new properties and a handful of new items have to be created. For every property, one needs the approval of some community members. It would be really great, if you could read this suggestion, maybe discuss it, and ideally like this approach, give your approval for the suggested properties, and maybe create them. We don't want to take the important decision how to handle constraints in the future of Wikidata, but we really need them represented as statement on properties to continue our work and generate meaningful constraint reports that in the end hopefully improve the quality of the data in Wikidata.

More features to come
When everything we mentioned above works, we want to go one step further. In some cases (like when there is a Unique Value Constraint), it is hard to check whether there is a violation in real-time, but with most constraints, this should be possible. We are planning to check nearly everything a user enters as a statement against all constraints that were defined on this property - right when he wants to save the statement. This prevents false data to get into the system or helps to keep the data consistent.

We want to give the user a prompt with a list of constraints he would violate with this entry and give him the opportunity to change it or add it as an exception for this constraint (maybe this should only be possible after he adds a reference...). When he violates a constraint that connects two items (Symmetric/Inverse/Target required claim/...) he could also get a prompt suggesting him to (semi-)automatically add a statement to the other item.

As you see, there are many questions on how to integrate violations in the UI and in the workflow, and we are really looking forward to getting in contact with you. There are many shades of automation and we want to find a way that works well both for the system and for everyone working with it.

External Validation
Our second project is about comparing items in Wikidata to their respective records in external databases to prove their substance.

The vision is, that with enough external data, we can compare and validate the bigger part of all statements to ensure and improve the data quality.

Another part of that project might be the enrichment of statements with references consisting of the external database, but we have not sufficiently elaborated this approach, yet.

For a start we have implemented a cross-check with the GND, which - among other things - holds information about persons like the date and place of birth or family connections.

(That might not seem much, but little information from many databases will result in much information ... that's the plan :).)

The data for the comparison comes from data dumps that are kindly provided by the database's operators and come in various forms and formats.The GND dump, for example, is in XML-format.

In preparation for the cross-check we analyse each dump to fill a table in our database that holds all our external data.

Each row holds one data-object with one record from the dump relative to one item.

For the GND dump example, we write a new row for each person's record including a XML-blob and its GND identifier.

The cross-check now gets an item ID Qxy and first looks for external identifiers we have already implemented.

For each identifier, we have to evaluate and write down a mapping that associates validatable properties with the path to the attendant entry in the data-object. The mapping is represented in a suitable query language (for XML we use XPath, ...).

So for each validatable statement (consisting of a validatable property and a value) we pick out the correspondent entry.

Then we can compare the value of the statement to the value from the data-object, collect all results and produce an output.

For example, for Angela Merkel with the statements "date of birth: 17 July 1954" and "spouse: Joachim Sauer" we select the corresponding values "Exakte Lebensdaten (exact dates of birth and death): 17.07.1954-" and "Ehemann (husband): Sauer, Joachim" in the data-object, compare the values and write an output that says that all statements could be validated successfully.

We have not yet decided, which final form this output will have; to test our tool, we have written a special page that takes an item ID and produces a tabular output, but we actually either want to implement a live tool that is integrated in an item's page (see mock ups) or a cronjob, that may be executed monthly.

In either way, found mismatches between the values could be treated and displayed as "cross-check mismatch" constraint violations (see other project).

For further information about us or our projects please visit our Github Wiki. We appreciate your input! Please write to [mailto:tamara.slosarek@student.hpi.de tamara.slosarek@student.hpi.de].