You can find more about it at my OPW related blog post (note that I have a huge backlog).
Community Bonding Period
- Project plan agreed with mentors. My old proposal would be integrated into the plan as we go.
- Project on Phabricator.
Landing and first meetings with mentors
As I'm new to the FOSS community, and to coding on the web, most of the things I encounter are new and overwhelming. Taking this into account, I think that working with the mentors (especially Amir) couldn't have been more helpful. We met twice (living in the same city) and he gave me tips on how to best work and be in touch with the community, and most importantly, we had a really productive brainstorming to understand where I should start my research. I think getting into the world of MediaWiki is not easy (as there are lots of technical issues along with understanding the norms) and having someone to guide you through it is essential.
I also had a really nice skype meeting with Roxana, the second intern on the project to get to know each other.
Lessons learned since 22 November
- Got familiar with the Content Translation product and parts of its code.
- Learned to use IRC and the mailing lists as a mean of communication (along with phabricator).
- Got familiar with wikidata and some tools there that are related to our project.
- Did some research on the MediaWiki tables and statistics tools available.
- signed up to the wikidata and wiki research mailing lists.
- Communicate with the community and relevant personal over IRC.
- Chats and emails for short question or long updates.
- Complete weekly progress reports and blog post about my progress.
- Update my tasks on phabricator so everyone will know what the rest are doing.
- Weekly meetings with the team.
Deliverables (Before mid-term evaluation)
- Having all the data we need to start checking our model on two languages (and hopefully having a proper model at hand).
- write a description on the way people translate pages.
- have test cases for translated pages.
- have test cases for pages that weren't translated.
- knowing which metrics are important and how to check them (date of posting, changes to other pages etc.).
Week 1: December 9 - December 15
- Asked editors on the Hebrew village pump about their editing methods. See conversation (in Hebrew) here.
- Had a group meeting via Google hangout with Amir, Joel and Roxana.
- Started working on my first task - Research how articles are translated manually.
- Opened some smaller tasks in Phabricator for me to figure out what I need to do.
- Continued to learn how Wikidata works.
Week 2: December 16 - December 22
In this week I focused on task T78818 - finding an easy way to find all the articles that exist in two languages.
- Downloaded the relevant Hebrew Dumps.
- Learned about the Wikipedia API sandbox. Decided that this is not the way.
- Tried to understand if it is possible to work with Wikidata - wikibase tables (took a lot of time to understand what I am looking for).
- Decided for now that this is not the way as there is no easy access to it and the dumps are not "easy to use".
- Finished the task for the dumps direction: Wrote an SQL query that returns all the articles in Hebrew that has a counterpart in English, and ids.
Week 3: December 23 - December 29
- Met Amir - Reviewed what I did until now and what I am planning to do.
- Started Learning PhP.
- Asked (and granted) access to Wikipedia Database (we understood that I will work with too much data to download the dumps and work the database locally).
- Read about other projects related to language that people are doing with Wikipedia and Wikidata.
- Random walked around the different Wikipedias in search of properties that define translated pages. Learned a lot about how things work but still have a lot to do.
- Celebrated Hanukkah!
Week 4: December 30 - January 5
- Finally understood how Autolist works! (relating to two weeks ago's subject). Tried to understand its code, moderate success.
- Gave up on working with Wikidata databases and Toolkit (because from my understanding, it is not yet implemented to be easily accessed. Update - the data is available easily with toolkit - but it means I will need to learn Java so it will wait).
- Continued to research properties that define translated pages. Made a big progress. The findings are summaries in here.
Week 5: January 6 - January 13
- Finishing task T85410. Sent an email to analytics about it.
- Started writing a grant proposal to Google Translate so we will have a tool to translate the headlines.
Week 6: January 14 - January 19
- Wrapping up the research phase and all the related tasks - findings can be found here under the project page.
- Starting to researching how and if we can find and check each parameter.
Week 7: January 20 - January 26
- Continuing with researching how and if we can find and check each parameter.
- Sent a proposal to Google Translate.
Week 8: January 27 - February 2
- Built an nice SQL query that returns Hebrew articles that are candidates to be translated (and their first revision). The query also returns the en revisions counterparts.
- Talked to Marc Miquel about my project - we now have an open communication channel
- Switched to macbook pro!
- Turned out the text table is not available via ssh. I have to download a dump or use the API. Right now, both options are terrible.
- Downloaded small version of the revision table through the incremental (daily) XML dumps. Used Aaron Halfaker's library for python to open them.
- Started collecting a sample of the Hebrew Wiki (using random article), their English counterparts, and categorizing them as translated or not - on going.
- Corresponded with Tighe Flanagan and Asaf Bartov about getting examples for translated pages in Arabic.
Week 9: February 3 - February 9
- Started learning Unix (for better using ssh).
- Checked whether to use the API for the data of the revisions text. Answer (courtesy of YuviPanda on IRC): Maybe for small sets but I will use the dumps for the data itself.
- Downloaded and learned a bit about pywikibot.
- Wrote a python query that returns the revision text from the API using pywikibot (it's not the optimal solution, talked with Amir Ladsgroup about it).
- Designed an SQL query that chooses the right hebrew revision.
- Met Amir.
Week 10: February 10 - February 16
- Mostly studied for a test.
Week 11: February 17 - February 23
- Slow week - it snowed! for real this time :)
- Checked for a way to process the compressed XML dumps with Hadoop (e.g., using github.com/whym/wikihadoop) - Still don't have a decision.
- Read two relevant articles:
- "The Tower of Babel Meets Web 2.0: User-Generated Content and Its Applications in a Multilingual Context " by Brent Hecht and Darren Gergle.
- "Measuring Self-Focus Bias in Community-Maintained Knowledge Repositories" by Brent Hecht and Darren Gergle.
- Finished learning the basic of Unix for better using the terminal.
Week 12: February 24 - March 2
- Had three days off due to a conference from the university.
- Read another relevant article: "Towards Building a Multilingual Semantic Network:Identifying Interlingual Links in Wikipedia" by Bharath Dandala, Rada Mihalcea and Razvan Bunescu.
- Learned to use basic Regular expressions.
- Started programing in python the first function that looks in the talk text whether there is an indication that the page was translated.
- Stated a new semester.
Week 13: March 3 - March 9
- Celebrated Purim.
- Learned Git and finally opened a GitHub account (https://github.com/Livnetata).
- Published new blog posts.
- Received from Marc Mikel a script that should help extract data from the XML dumps. Still not implemented it but it should make my work easier.
- Met Amir for nicely finishing the internship and also to update him about the status of the project.
- Learned to use Regular expressions in Hebrew!
- Finished programing the function transltatedFromDisscusion that checks for indication of translation in the talk page.
- Programed a function that checks in the edit summaries whether there is an indication that the page is translated.