Wikidata annotation tool
This page is currently a draft.
Annotation Tool that extracts information from the books and feed them on Wikidata
Announcement of Proposal:
Project Completion Report
Name and Contact Information
Web Page / Blog / Microblog:
- Roorkee, Uttarakhand, India
Typical Working Hours:
- 10:00- 13:00, 15:30-19:00, 22:00-03:00 ( IST ) 4:30- 7:30, 9:30-1:30, 16:30- 21:30 ( UTC )
Project is strongly based on belief to improve user interactivity with Wikidata and create a whole new world of data sharing and saving by creating a tool that on highlighting a statement would provide a GUI to fix its structure and then feed it to Wikidata. Wikidata is a free information base that is same for humans and machines. It centralizes access to and structurally manage data so that every piece of data is easily available and accessible. By the means of the plugin people can save their important notes and quotes directly on Wikidata hence making them more accessible to the mass.
Statements or annotations link the web data together and bind them as one entity. Items, properties and values which are worthless without their interconnections are brought to life through these statements. The tools aims at helping people create annotations as a result gluing the dataweb together, and as a result, enriching it with tremendous amount of knowledge. So the need of the project is justified in a way, that there is need to continuously link together the things and thus make this network of data more and more valuable with more and more people annotating.
- You are at home, reading a book on Wikisource. Suppose you want to take notes of important things, you can annotate and directly feed and share important quotes and data automatically with their source to the knowledge base of Wikidata. Furthermore, the viewers of the book after you will be able to see your notes and thus saving the time. This can be done just by activating the plugin.
- Imagine a work office scenario. You are attending a presentation or seminar. An important fact or data point is shared during the presentation, e.g. your national statistical institute has just released the latest population data on their website. You can annotate it, click and it is on Wikidata.
- You are reading the news on your tablet using your browser, a new prime minister is being nominated. You can select the relevant text and insert this information in Wikidata.
- Given a statement from Wikidata (or another source), we can use this tool to mark up a reference and import that reference to Wikidata. This could help with providing references for the millions of statements (claims) that currently don't have one. So more people annotating through this tool will add more and more references to the Wikidata. So this way many claims can be converted to proper statements.
Information about project
Glossary of Wikidata terms used:
- It is a page in Wikidata main namespace representing a real-life topic, concept, or subject. Items are identified by a prefixed id, or by a sitelink to an external page, or by a unique combination of multilingual label and description.
- It is a descriptor of a value for a particular item. In other words, it is an attribute for an item.
- is a piece of data about an item, recorded on the item's page. A statement consists of a claim (a property-value pair such as "Season: Winter" about an item, together with optional qualifiers), supported by optional references (giving the source for the claim).
- It is simply a statement without references.
- Simply an information about item that explains something about one of its property.
- It is a part of the claim that says something about the specific claim, often in a descriptive way.
The side picture explains the above glossary terms, by using an item named London.
Now lets come to Pundit, the way Pundit creates annotations is person selects a sentence and this opens up a triple composer which has the sentence as the subject of the annotations, predicate can be selected from a preloaded list, while the objects are fetch from various servers like DBpedia etc. This procedure creates a triple, further triple can be composed to add more info about the statement, like one for references. Finally, all these triple are saved as an annotation based on rdf model to Pundit annotations server.
How it will work?
Following schema shows how the extension will work in details:
- Firstly, we are going to track the user using API to check if he/she is login and if not redirect to login page. User can still anonymously annotate text as usual like an anonymous user edits pages on Mediawiki.
- I will package Pundit integrated with Wikidata vocabulary (that will be fetched from Wikidata accordingly) and selectors, and a whole new GUI (different than already available Pundit GUI) as a browser plugin and as a bookmarklet.
- I will provide a GUI to the user so that he/she can annotate text. Note: Pundit already provides a GUI, we will alter according to our needs as most suitable.
- Next, the interface should propose to:
- choose a subject (i.e. an item), by default it will be the sentence user highlighted.
- choose a predicate (i.e.a property)
- choose an object (i.e. data value, or statement)
- The proposed predicate should already exist on Wikidata, if not we will present user with an interface with title:
- 'Can't find what you are looking for? Propose a property', and then we will redirect the user to property proposal page (A page where you can propose new properties for Wikidata). After this step, till now the annotation has become a claim.
- In the next step we will gather sources of the annotation such as gathering website url, book's name (Wikisource) and many more. If we can't find sources we will provide an interface to user to input them himself, so as to convert the claim to statement through references.
- Pundit will analyze the annotation as subject, object and predicate, pack it as statement and then save it at Pundit server.
- The flow will be unidirectional, that the user create annotations, save it on Pundit server, it is also synchronized with Wikidata item's page.
- Further extensions to this project can be bidirectionality.
Tools to be used
- Wikibase API: I am going to use API for Wikidata for the interaction related to the latter, currently it is in stable state and is regularly maintained. I will interact with Wikidata item pages through this API. Second job this API will do is to retrieve items, values and properties from Wikidata as to present to user so he/she can create their own statements. Also the login status of the user will be checked through this API.
- QUnit: I will be using QUnit Test framework provided by the jQuery foundation to test my code against many testing scenarios. This is openly available software lincesed under MIT license.
- In some cases the code may be extended from the existing external tools present at Wikimedia labs.
- Create a plugin that would can provide function of annotating text and then feed that annotated text to Wikidata.
- Plugin must use existing properties available from Wikidata, if not available ask for creating a new one.
- Plugin must allow users to create items on the fly.
- Provide references by taking source URL and quotations in considerations.
- Show the user with the plugin activated the annotations made by previous users of an annotation source.
- Create a Mediawiki extension for the same and thus increasing the reach.
|1||April 22 - May 3||Familiarising myself with the mentors, codebase and my project. As I am already in contact with my mentors for a long time, I have also gone through the codebase, so it won't take much long to accomplish this task.|
|2||May 4 - May 16||Create a prototype plugin for the Annotator and write its initial code. Also write the login functionality.|
|3||May 17 - May 23||Write unit tests and test the initial code.|
|4||May 23 - June 5||Use Pundit's entity extractor to extract entities from sentence and start altering the GUI provided by the existing base of Pundit, Consult Wikimedia Design team for the design improvements.|
|5||June 5 - June 15||Write more unit tests and vigorously test the GUI and the extension by creating various scenarios. Fix up any bugs found during testing.|
|6||June 15- June 25||Use Wikidata API to retrieve existing properties on selected text and show suggestions.|
|June 23||Midterm Evaluation|
|7||June 25 - July 10||Parse the statement (subject-predicate-object) to the Wikidata and save them on items pages using Wikidata API. Also saving on the Pundit's server to show further users what was annotated before them.|
|8||July 10 -July 19||Create a bot that fetches annotations from Pundit server and feeds them to wikidata regularly.|
|9||July 19 - July 22||Brush up documentation, add comments and package the plugin.|
|10||July 22 - August 1||Write unit tests and test repeatedly. Fix any bug found during tests. Packup plugin for public use.|
|11||August 2 - August 15||Find and fix all the bugs found and clean up the extension|
|August 18||Firm pencil down date.|
|11||August 21||Submit the extension to Google and launch the plugin on initial scale with help of Mediawiki and my mentors.|
Details on Timeline
- Task 1:
- Task 2:
- Since I am creating a plugin for a which can also be easily saved in bookmark, it won't take much time to implement since pundit already the functionality of packaging it in the bookmarklet, the major time will be consumed in setting up the plugin. So initial code will be based on setting up the pundit to be in synchronization with Mediawiki. In this phase the login functionality through Mediawiki will also be implemented. Login functionality will be based on the data provided by API as explained here.
- Task 3:
- Writing unit tests and then testing the code is essential and integral part of this project, so this will done on many stages of the project. I will be using jQuery QUnit tests to test my code, and thus this code can be regularly extended to cover unit tests
- Task 4:
- Task 5:
- Again unit tests written in QUnitTest module will test the whole code to find if anything is broken, hence improving the overall stability of the code. I any errors are found they all have to be fixed regularly during this period.
- Task 6:
- Task 7:
- Task 8:
- During this time I will create a bot based on the PyWikipediaBot that will regularly fetch annotations from the Pundit server and feed them to the Wikidata in a proper manner based on RDF model. Here subject in the annotation on Pundit server will become item, predicates as a property while the object as the value and these will be feed to Item's page on Wikidata by the use Wikibase API.
- Task 9:
- Task 10:
- Again write QUnit Tests avaliable by jQuery foundation to test the whole code and in turn fix up any bugs found that can break down the plugin.
- Task 11:
- Launch the plugin for the public and fix any bugs found by them by actively working on it.
- Task 122:
- Submitting the source code to google for final evaluation and launching the plugin officially for the use by the public, and hence finishing up the project.
- Optional Task:
- If the time permits I would create an extension for the plugin on Mediawiki and host up a testing scenario at Wikimedia labs, thus increasing the reach of the project.
For me, It was and will be always - Sharing is Caring. This has always helped me to get well with the Wikimedia community. I will publish all my completion reports on my blog weekly. All source code I write will be published to my Github repository and will be pushed to branch in Pundit repository also to make sure of collaboration. I always try to stay live in IRC, and am regular in replying to emails, so it helps me to blend in the community. Testing and documentation will be added to the Wikitech Mail page. I am mostly available on #mediawiki, #wikimedia-dev, #wikidata during my working hours. I usually hangout with my mentors to discuss the ideas, I always post our discussions and question on our google group which is free to everyone to join.
I am having summer vacations from April end to July mid, after that I will have to spend 28 hours per weeks to studies, which doesn't affect my work timings, so I think I would be able to complete my project in time and will continue working on developing it further in time once GSoC is completed. Coming from remote village in valleys of Himachal Pradesh, I love the idea of open source and think that 'Sharing is Caring' and hope that this idea will spread more through the communities like Mediawiki and projects like GSoC.
I am highly passionate about open-source software and security. You can see my other open-source projects on my GitHub profile mentioned below. I assure dedication of at least 40 hours per week to the work and that I do not have any other obligations during the period of the program, with the obvious exception of regular academics. A major part of this duration is summer holidays where I’ll be working from my home. Also, if any part of the proposal is not clear, I'll be very happy to clarify.
Past Projects & Contributions
- Github Profile.
- Build web app for a local startup at IIT Roorkee, Roorkee Delivers.
- Created a code sharing website OpenCode
- A web app that makes matches on the basis of common interest between two people.
- jQuery plugin for shopping cart ( jCart ) and cookies ( jCookie ).
- Created an application for the alumnies to share their experiences at IIT Roorkee.
- Contribution to Mediawiki (Gerrit Repo).
- I have mostly worked on improving the extension Multimedia Viewer.
- I have also contributed to open source project Moodle.
- Worked on our own lab music player based on play by github in Node.js.
Any Other Info
- A simple UI model under construction can be seen at this link.
This involves two simple tasks that are:
- Showcase a simple Pundit setup webpage.
- Make a simple webapp that uses wikidata api
Source code can be found at this link.