User:Tuxilina/OPWproposal

Wikipedia article translation metrics

 * Public URL: Wikipedia article translation metrics URL
 * Announcement: Announcement on Wikitech-l mailing lists

Name and contact information

 * Name: Roxana Necula
 * Email: necula.roxana91@undefinedgmail.com
 * Blog: http://www.tuxilina.com
 * IRC or IM networks/handle(s): tuxilina
 * Resume (LinkedIn): https://www.linkedin.com/in/roxananecula
 * Location: Bucharest, Romania
 * Typical working hours: 10:00 AM - 18:00 PM EET
 * Github account: https://github.com/nroxana

Synopsis
This project consists in finding different editting patterns of users, analysing the number of speakers of a language, the penetration of broadband internet connection in the area where the language is spoken etc. All this can be done through logging the information about the users such as: languages he/she speaks (if he is a multilingual user or not), the location from where the user is making the change, the level of bilingualism in that region, the percentage of the translated articles in different languages by the same user, etc.

The foundation behind Wikipedia has characterized the encyclopedia as trying to provide access to "the sum of all human knowledge." This can be achieved through reaching a uniform percentage of translated articles in all languages (all the articles in English should be translated in French, Italian etc and vice-versa, all articles in French, Italian etc should exist, all translated in English).

These findings will contribute to better understanding of content development in Wikipedias in different languages and to the development of the ContentTranslation project.


 * Mentors: Amir E. Aharoni and Joel Sahleen

Deliverables
Please describe the details and the timeline of the work you plan to accomplish on the project you are most interested in (discuss these first with the mentor of the project).

The main stages of this project are:
 * 1) gathering required data through the IRC logger
 * 2) based on the data acquired, gather additional data from the database about the user that has translated the articles, like region, level of bilingualism, whether it is a multilingual user or not, number of translated articles etc
 * 3) generate XML / JSON local files based on the data acquired before
 * 4) send XML / JSON files to Carrot clustering engine to build detailed search results
 * 5) search results are then modeled using graphical techniques, with the help of R language
 * 6) create documentation based on the final data sets generated.

Participation
The progress will be publicly available at OPW Internship Report and it will contain a weekly Plan & Progress section.

The plan will be a brief paragraph on what my weekly goals are (written at the starting of the week), and the Progress section (written at the end of the week) will consist in a detailed report on the tasks & microtasks done that week.

I also plan to lurk on IRC, as I have done so far, ask for help there and/or communicate via wikitech-l mailing lists and also communicate via Skype/mail with the mentor.

The code will be available at my github account (in the spirit of open source).

About you
I have completed bachelor's degree studies at the University Politehnica of Bucharest, Romania but without submitting the final thesis. My plan is to make this project my final thesis. I am majoring in Computer Science, at the Faculty of Automatic Control and Computer Science. Related to the programming experience I have, I was a Software Engineer at Electronic Arts Romania between June 2013 and April 2014. Now I currently work as a Junior Web Developer at a startup company, since April 2014.
 * Education completed or in progress:

I heard about GSOC / OPW programs from other friends/collegues in University. I think that especially the OPW program is an efficient way to create diversity among programmers and contributors. What I like about this organization is that I am able to contribute (with or without being accepted to OPW) on one of the most popular websites with the largest and most popular general reference work. Every bit of code matters and I believe that this project will help me to get the insights on such a large and complex project.
 * How did you hear about this program?

If I will be accepted I will suspend my current job for 3 months, until the end of the program. And I have also planned a 2 week Christmas vacation. :D So basically I will be free until the beginning of March, except the Christmas vacation.
 * Will you have any other time commitments, such as school work, another job, planned vacation, etc., during the duration of the program?

Past experience
So far I submitted a few patches to fix a bug related to improving the user feedback in html detected upload error. This helped me to get used to the MediaWiki core and code review process.
 * Please describe your experience with any other FOSS projects as a user and as a contributor:

The microtask suggested on the project page was to find different ways in which users mark articles as translated and analyze them.
 * Please describe any relevant projects that you have worked on previously and what knowledge you gained from working on them:

What I did first was to read the 'Multilinguals and Wikipedia editing' article and another paper on measuring self-focus bias to familiarize myself with the main concept.

Then I ran a small Java code that logs the Wikipedia IRC (raw feed) to see when an article is translated (or created). On the top 20 Wikipedias I chose to use wikistats and grabbed the first 20 Wikipedias. This helped me to grasp the main idea and build the tasks to complete this project.

Microtask
Besides the previous work done, which was one that was based on searching and finding the basics of Wikipedia metrics used through data dumps in IRC, another thing that I have done was finding a few interesting points in the metrics used by the Content Translation extension and elaborate the main idea of implementation.

I chose two main types of core metrics: Quantity of content and Evolution in time.

Number of articles created. Articles created per week, per user, per language.
This can be logged using an EventLogging schema. When the article is successfully published, a hook can be used to trigger an event that logs this information.

It is important to specify the ways this data is logged, so the data stored will contain the articles created per week, per user, per language (it will log the number of articles written in a certain language in a whole week by one user).

Possible schema:

Description: "EventLogging schema used for the number of articles created (per week, per language, per user)"

Properties:

Length of articles
This can also use an EventLogging schema. Because we only care for the successfully published article size, we can also use a hook triggered on the publishing article event. Then we can log the size for later analysis.

Possible schema:

Description: "EventLogging schema used for measuring created articles size"

Properties: For this type of metrics it can be used the  magic word in the logging function.

Time spent in creating a translation
Possible schema:

Description: "EventLogging schema used for measuring time spent in creating a translation"

Properties:

Short summary
For the 'Number of articles created' and 'Length of articles' measurements PHP can be used for defining logging functions, implemented by developers, and then the results would be used by analysts. Unlike these two, for the 'Time spent in creating a translation' type of metric, JavaScript is required for defining the duration.

Deletion rate
The Schema:PageDeletion can be used for this type of metric.

By analyzing Wikipedia dumps, we can monitor the number of the deleted articles in certain languages.

Any other info
The frameworks/tools/programming languages that I will use are:
 * Wikipedia IRC Logger
 * SQL for accessing the database to gather additional data
 * Carrot framework to build search results based on the data acquired
 * R language to generate graphical interpretation on the data