User:Tuxilina/OPWproposal

Wikipedia article translation metrics

 * Public URL: Wikipedia article translation metrics URL
 * Announcement: Announcement on Wikitech-l mailing lists

Name and contact information

 * Name: Roxana Necula
 * Email: necula.roxana91@gmail.com
 * IRC or IM networks/handle(s): tuxilina
 * Resume (LinkedIn): https://www.linkedin.com/profile/view?id=240054360
 * Location: Bucharest, Romania
 * Typical working hours: 10:00 AM - 18:00 PM EET
 * Github account: https://github.com/nroxana

Synopsis
This project consists in finding different editting patterns of users, analysing the number of speakers of a language, the penetration of broadband internet connection in the area where the language is spoken etc. All this can be done through logging the information about the users such as: languages he/she speaks (if he is a multilingual user or not), the location from where the user is making the change, the level of bilingualism in that region, the percentage of the translated articles in different languages by the same user, etc.

The foundation behind Wikipedia has characterized the encyclopedia as trying to provide access to "the sum of all human knowledge." This can be achieved through reaching a uniform percentage of translated articles in all languages (all the articles in English should be translated in French, Italian etc and vice-versa, all articles in French, Italian etc should exist, all translated in English).

These findings will contribute to better understanding of content development in Wikipedias in different languages and to the development of the ContentTranslation project.


 * Possible mentors: Amir E. Aharoni

Deliverables
Please describe the details and the timeline of the work you plan to accomplish on the project you are most interested in (discuss these first with the mentor of the project).

The main stages of this project are:
 * 1) gathering required data through the IRC logger
 * 2) based on the data aquired, gather additional data from the database about the user that has translated the articles, like region, level of bilingualism, whether it is a multilingual user or not, number of translated articles etc
 * 3) generate XML / JSON local files based on the data aquired before
 * 4) send XML / JSON files to Carrot clustering engine to build detailed search results
 * 5) search results are then modeled using graphical techniques, with the help of R language
 * 6) create documentation based on the final data sets generated.

Participation
The progress will be publicly available at OPW Internship Report and it will contain a weekly Plan & Progress section.

The plan will be a brief paragraph on what my weekly goals are (written at the starting of the week), and the Progress section (written at the end of the week) will consist in a detailed report on the tasks & microtasks done that week.

I also plan to lurk on IRC, as I have done so far, ask for help there and/or communicate via wikitech-l mailing lists and also communicate via Skype/mail with the mentor.

The code will be available at my github account (in the spirit of open source).

About you
I have completed bachelor's degree studies at the University Politehnica of Bucharest, Romania but without submitting the final thesis. My plan is to make this project my final thesis. I am majoring in Computer Science, at the Faculty of Automatic Control and Computer Science. Related to the programming experience I have, I was a Software Engineer at Electronic Arts Romania between June 2013 and April 2014. Now I currently work as a Junior Web Developer at a startup company, since April 2014.
 * Education completed or in progress:

I heard about GSOC / OPW programs from other friends/collegues in University. I think that especially the OPW program is an efficient way to create diversity among programmers and contributors. What I like about this organization is that I am able to contribute (with or without being accepted to OPW) on one of the most popular websites with the largest and most popular general reference work. Every bit of code matters and I believe that this project will help me to get the insights on such a large and complex project.
 * How did you hear about this program?

If I will be accepted I will suspend my current job for 3 months, until the end of the program. And I have also planned a 2 week Christmas vacation. :D So basically I will be free until the beginning of March, except the Christmas vacation.
 * Will you have any other time commitments, such as school work, another job, planned vacation, etc., during the duration of the program?

Past experience
So far I submitted a few patches to fix a bug related to improving the user feedback in html detected upload error. This helped me to get used to the MediaWiki core and code review process.
 * Please describe your experience with any other FOSS projects as a user and as a contributor:

The microtask suggested on the project page was to find different ways in which users mark articles as translated and analyze them.
 * Please describe any relevant projects that you have worked on previously and what knowledge you gained from working on them:

What I did first was to read the 'Multilinguals and Wikipedia editing' article and another paper on measuring self-focus bias to familiarize myself with the main concept.

Then I ran a small Java code that logs the Wikipedia IRC (raw feed) to see when an article is translated (or created). On the top 20 Wikipedias I chose to use wikistats and grabbed the first 20 Wikipedias. This helped me to grasp the main idea and build the tasks to complete this project.

Any other info
The frameworks/tools/programming languages that I will use are:
 * Wikipedia IRC Logger
 * SQL for accessing the database to gather additional data
 * Carrot framework to build search results based on the data aquired
 * R language to generate graphical interpretation on the data