Extension:Wikibase Quality Extensions/old

From mediawiki.org

The Wikibase Quality Extensions consist of three extensions that shall ensure data quality in Wikibase.

Quality is the base extension that handles potentially incorrect data found by the other two extensions (view on Gerrit).

Constraints checks constraints and will work on statements on properties (not on templates anymore) (view on Gerrit).

External Validation compares and validates data with external databases (view on Gerrit).

Instructions can be found on the Github mirrors (Quality, Constraints, External Validation).

The extensions are developed in a students' bachelor project called Wikidata Quality at Hasso-Plattner-Institute in Potsdam, Germany, in collaboration with the team of Wikidata.

If you have any questions or something else you want to let us know, please write to tamara.slosarek@student.hpi.de .

Improving Constraint Reports[edit]

Current constraint report

Relating discussion of the community can be found here.

When we started working on this projects, the only way to define constraints was on the talk page of a property, and it could only be done via editing templates. This is neither user-friendly nor easy to maintain. On the contrary: During our studies, we found that there are more than 4000 hand-written constraints, some of which don't match exactly the definition of the templates, e.g. Single_value instead of Single value. It is very difficult for a bot to check the data against their corresponding constraints, when some of them are written wrong.

So this is the status quo: There are constraints on the talk pages of properties and there is a bot checking the data it finds in dumps of Wikidata against those constraints to generate these constraint reports. While this definitely generates additional value, it isn't nice to read, the underlying constraints are a pain to maintain and checking against a dump is of course not as accurate as checking against live data.

Luckily, it is now possible to create statements on properties. Based on that feature, we are planning to migrate the constraints from the talk pages, enabling us to generate meaningful constraint reports right where they are needed.

Vision of how a violation could be represented

Our vision of this project is, that every user who visits an item page gets a small indicator, when there is a constraint violation. Clicking on it, he should get a small text, explaining which constraint has been violated, and giving him the opportunity to fix it or to add it as an exception, when he is really sure that this is not a violation.

We also want to give the user assistance correcting the violation. When for example the symmetric constraint is violated, he should get a prompt asking him to add the missing statement to the other item. Of course we have to pay attention that this doesn't cause errors to spread. Therefore, we think about the possibility to only fill in the missing value automatically, when there is a reference proving the correctness. This would have the nice side-effect that the number of references in the system eventually grows.

Currently, we are building a special page where you give an Entity ID (both items and properties are possible) and we generate a table with the constraint report. Right now, we do this based on the constraints that were defined on the talk pages. To be able to do this on a special page in reasonable time and particularly on live data, we parsed every talk page and build a database table with every constraint with their corresponding parameters.

In the end, the result of this check should be displayed right beside the statements when you visit a particular page, but this will take a while. Until then, we want to migrate the constraints to the statements on properties, so that our special page can work without the usage of the table we generated from the talk pages.

And here, we need your help[edit]

For representing constraints with the Statements-on-properties-model, there are several possible approaches, but after we discussed them with several members of the Wikidata community, we agree with the proposal Ivan A. Krestinin made on the discussion page for property proposal. For this approach, many new properties and a handful of new items have to be created. For every property, one needs the approval of some community members. It would be really great, if you could read this suggestion, maybe discuss it, and ideally like this approach, give your approval for the suggested properties, and maybe create them. For an extended summary, how the new constraints would look like, which entites have to be created and especially, how they should/could be used, you can go along and read this page. If you have an questions or suggestions, feel free to edit this page.

Constraint check while editing

We don't want to take the important decision how to handle constraints in the future of Wikidata, but we really need them represented as statement on properties to continue our work and generate meaningful constraint reports that in the end hopefully improve the quality of the data in Wikidata.

Guide to fight inconsistency

More features to come[edit]

When everything we mentioned above works, we want to go one step further. In some cases (like when there is a Unique Value Constraint), it is hard to check whether there is a violation in real-time, but with most constraints, this should be possible. We are planning to check nearly everything a user enters as a statement against all constraints that were defined on this property - right when he wants to save the statement. This prevents false data to get into the system or helps to keep the data consistent.

We want to give the user a prompt with a list of constraints he would violate with this entry and give him the opportunity to change it or add it as an exception for this constraint (maybe this should only be possible after he adds a reference...). When he violates a constraint that connects two items (Symmetric/Inverse/Target required claim/...) he could also get a prompt suggesting him to (semi-)automatically add a statement to the other item.

As you see, there are many questions on how to integrate violations in the UI and in the workflow, and we are really looking forward to getting in contact with you. There are many shades of automation and we want to find a way that works well both for the system and for everyone working with it.

External Validation[edit]

Mock up Live Tool: Validation successful

Our second project is about comparing items in Wikidata to their respective records in external databases to prove their substance.

The vision is, that with enough external data, we can compare and validate the bigger part of all statements to ensure and improve the data quality.

Another part of that project might be the enrichment of statements with references consisting of the external database, but we have not sufficiently elaborated this approach, yet.

For a start we have implemented a cross-check with the GND, which - among other things - holds information about persons like the date and place of birth or family connections.

(That might not seem much, but little information from many databases will result in much information ... that's the plan :).)

The data for the comparison comes from data dumps that are kindly provided by the database's operators and come in various forms and formats.

Mock up Live Tool: Mismatch found

The GND dump, for example, is in XML-format.

In preparation for the cross-check we analyse each dump to fill a table in our database that holds all our external data.

Each row holds one data-object with one record from the dump relative to one item.

For the GND dump example, we write a new row for each person's record including a XML-blob and its GND identifier.

The cross-check now gets an item ID Qxy and first looks for external identifiers we have already implemented.

For each identifier, we have to evaluate and write down a mapping that associates validatable properties with the path to the attendant entry in the data-object. The mapping is represented in a suitable query language (for XML we use XPath, ...).

So for each validatable statement (consisting of a validatable property and a value) we pick out the correspondent entry.

Then we can compare the value of the statement to the value from the data-object, collect all results and produce an output.

For example, for Angela Merkel with the statements "date of birth: 17 July 1954" and "spouse: Joachim Sauer" we select the corresponding values "Exakte Lebensdaten (exact dates of birth and death): 17.07.1954-" and "Ehemann (husband): Sauer, Joachim" in the data-object, compare the values and write an output that says that all statements could be validated successfully.


We have not yet decided, which final form this output will have; to test our tool, we have written a special page that takes an item ID and produces a tabular output, but we actually either want to implement a live tool that is integrated in an item's page (see mock ups) or a cronjob, that may be executed monthly.

In either way, found mismatches between the values could be treated and displayed as "cross-check mismatch" constraint violations (see other project).