User:Tuxilina/OPWproposal

From MediaWiki.org
Jump to navigation Jump to search

Wikipedia article translation metrics[edit]

Public URL

Description: Wikipedia article translation metrics URL
Project page: Wikipedia article translation metrics

Announcement
Announcement on Wikitech-l mailing lists
Progress report
User:Tuxilina/OPW_Report

Name and contact information[edit]

Name
Roxana Necula
Email
necula.roxana91@gmail.com
Blog
http://www.tuxilina.com
IRC or IM networks/handle(s)
tuxilina
Resume (LinkedIn)
https://www.linkedin.com/in/roxananecula
Location
Bucharest, Romania
Typical working hours
10:00 AM - 18:00 PM EET
Github account
https://github.com/nroxana

Synopsis[edit]

This project consists in finding different editting patterns of users, analysing the number of speakers of a language, the penetration of broadband internet connection in the area where the language is spoken etc. All this can be done through logging the information about the users such as: languages he/she speaks (if he is a multilingual user or not), the location from where the user is making the change, the level of bilingualism in that region, the percentage of the translated articles in different languages by the same user, etc.

The foundation behind Wikipedia has characterized the encyclopedia as trying to provide access to "the sum of all human knowledge." [1] This can be achieved through reaching a uniform percentage of translated articles in all languages (all the articles in English should be translated in French, Italian etc and vice-versa, all articles in French, Italian etc should exist, all translated in English).
These findings will contribute to better understanding of content development in Wikipedias in different languages and to the development of the ContentTranslation project.

Mentors
Amir E. Aharoni and Joel Sahleen

Deliverables[edit]

Please describe the details and the timeline of the work you plan to accomplish on the project you are most interested in (discuss these first with the mentor of the project).

The main stages of this project are:

  1. gathering required data through the IRC logger
  2. based on the data acquired, gather additional data from the database about the user that has translated the articles, like region, level of bilingualism, whether it is a multilingual user or not, number of translated articles etc
  3. generate XML / JSON local files based on the data acquired before
  4. send XML / JSON files to Carrot clustering engine[2] to build detailed search results
  5. search results are then modeled using graphical techniques, with the help of R language[3]
  6. create documentation based on the final data sets generated.
Milestone No: Timeline (Calendar) Task
1 Nov. 13 - Nov. 30 Community bonding, talking with the mentor and as well as other community members about planning in detail the main stages of the project
2 Dec. 1 - Dec. 11 Generate and gather data from the IRC Logger about the translated articles
3 Dec. 12 - Dec. 21 Generate additional data related to the users that have been translating
4 Dec. 24 - 26 + Dec. 31 - Jan. 1 Christmas vacation and New Year's Eve celebration :D
5 Jan. 5 - Jan. 12 Set up and investigate Carrot framework
6 Jan. 13 - Jan. 31 Build detailed search results on the acquired data using Carrot
7 Feb. 1 - Feb. 5 Set up and investigate R language
8 Feb. 6 - Feb. 22 Build graphic interpretation on the data
9 Feb. 23 - Mar. 9 Improving and optimizing code, documenting the final project and final tweaks.

Participation[edit]

The progress will be publicly available at OPW Internship Report and it will contain a weekly Plan & Progress section.
The plan will be a brief paragraph on what my weekly goals are (written at the starting of the week), and the Progress section (written at the end of the week) will consist in a detailed report on the tasks & microtasks done that week.

I also plan to lurk on IRC, as I have done so far, ask for help there and/or communicate via wikitech-l mailing lists and also communicate via Skype/mail with the mentor.

The code will be available at my github account [4] (in the spirit of open source).

About you[edit]

Education completed or in progress

I have completed bachelor's degree studies at the University Politehnica of Bucharest, Romania but without submitting the final thesis. My plan is to make this project my final thesis. I am majoring in Computer Science, at the Faculty of Automatic Control and Computer Science. Related to the programming experience I have, I was a Software Engineer at Electronic Arts Romania between June 2013 and April 2014. Now I currently work as a Junior Web Developer at a startup company, since April 2014.

How did you hear about this program?

I heard about GSOC / OPW programs from other friends/collegues in University. I think that especially the OPW program is an efficient way to create diversity among programmers and contributors. What I like about this organization is that I am able to contribute (with or without being accepted to OPW) on one of the most popular websites with the largest and most popular general reference work. Every bit of code matters and I believe that this project will help me to get the insights on such a large and complex project.

Will you have any other time commitments, such as school work, another job, planned vacation, etc., during the duration of the program?

If I will be accepted I will suspend my current job for 3 months, until the end of the program. And I have also planned a 2 week Christmas vacation. :D So basically I will be free until the beginning of March, except the Christmas vacation.


Past experience[edit]

Please describe your experience with any other FOSS projects as a user and as a contributor

So far I submitted a few patches to fix a bug related to improving the user feedback in html detected upload error. This helped me to get used to the MediaWiki core and code review process.

Please describe any relevant projects that you have worked on previously and what knowledge you gained from working on them

The microtask suggested on the project page[5] was to find different ways in which users mark articles as translated and analyze them.
What I did first was to read the 'Multilinguals and Wikipedia editing' article and another paper on measuring self-focus bias to familiarize myself with the main concept.
Then I ran a small Java code[6] that logs the Wikipedia IRC (raw feed) to see when an article is translated (or created). On the top 20 Wikipedias I chose to use wikistats and grabbed the first 20 Wikipedias. [7] This helped me to grasp the main idea and build the tasks to complete this project.

Microtask[edit]

Besides the previous work done, which was one that was based on searching and finding the basics of Wikipedia metrics used through data dumps in IRC, another thing that I have done was finding a few interesting points in the metrics used by the Content Translation extension and elaborate the main idea of implementation.

I chose two main types of core metrics: Quantity of content and Evolution in time.

Quantity of content[edit]

Number of articles created. Articles created per week, per user, per language.[edit]

This can be logged using an EventLogging schema. When the article is successfully published, a hook can be used to trigger an event that logs this information.
It is important to specify the ways this data is logged, so the data stored will contain the articles created per week, per user, per language (it will log the number of articles written in a certain language in a whole week by one user).
Possible schema:
Description: "EventLogging schema used for the number of articles created (per week, per language, per user)"
Properties:

type required description
pageID integer true integer of the page ID
userID integer true integer of the user ID
timestamp string true a timestamp of the article creation date
language string true language code for the article language

Length of articles[edit]

This can also use an EventLogging schema. Because we only care for the successfully published article size, we can also use a hook triggered on the publishing article event. Then we can log the size for later analysis.
Possible schema:
Description: "EventLogging schema used for measuring created articles size"
Properties:

type required description
pageID integer true integer of the page ID
pageSize string true the size of the page

For this type of metrics it can be used the {{PAGESIZE}} magic word in the logging function.

Time spent in creating a translation[edit]

Possible schema:
Description: "EventLogging schema used for measuring time spent in creating a translation"
Properties:

type required description
pageID integer true integer of the page ID
duration integer true duration between beginningTime and endTime of a created article, in miliseconds
language string true language code for the article language

Short summary[edit]

For the 'Number of articles created' and 'Length of articles' measurements PHP can be used for defining logging functions, implemented by developers, and then the results would be used by analysts. Unlike these two, for the 'Time spent in creating a translation' type of metric, JavaScript is required for defining the duration.

Evolution in time[edit]

Deletion rate[edit]

The Schema:PageDeletion can be used for this type of metric.
By analyzing Wikipedia dumps, we can monitor the number of the deleted articles in certain languages.

Any other info[edit]

The frameworks/tools/programming languages that I will use are:

  • Wikipedia IRC Logger [8]
  • SQL for accessing the database to gather additional data
  • Carrot framework to build search results based on the data acquired
  • R language to generate graphical interpretation on the data

See also[edit]

Outreachy: Round 5Round 6Round 7Round 8Round 9Round 10Round 11Round 12Round 13Round 14Past projects