Parsoid/Linting/GSoC 2014 Application

Parsoid-based online-detection of broken wikitext
Public URL :https://www.mediawiki.org/wiki/User:Hardik95/GSoC_2014_Application Bugzilla report :Bug Report Announcement : http://lists.wikimedia.org/pipermail/wikitech-l/2014-March/075217.html

Name and contact information
Name: Hardik Juneja Email: hardikjuneja.hj@gmail.com IRC or IM networks/handle(s): hardikj Location: India Typical working hours: Online from 12pm to 3am until August, 6pm to 2am after August.

Project Summary
This GSOC project aims to detect broken and deprecated wikitext found on wiki pages and in some cases, possible fixups, using Parsoid. During parsing, Parsoid has access to this information that can help wiki editors know where broken wikitext is and how they can fix it. This tool might be quite useful for the community by communicating this information to wiki editors. Since we don’t necessarily want to reinvent the wheel, we will use existing UI and fixup workflows by feeding fixup information generated using Parsoid to the existing CheckWiki WikiProject. This tool will also help Parsoid developers to collect statistics about use of templates in balanced / unbalanced contexts.

The project aims at implementing a generator which would have following features:


 * 1) Finding issues like broken and deprecated wikitext and reporting them to checkwiki.
 * 2) Generating fixup information for each issue using Parsoid.
 * 3) Feeding this information to CheckWiki or provide a web service for CheckWiki to pull data.

Project Scope

 * 1) Finding issues using Parsoid based linter
 * 2) * Using some infrastructure of logging setup that is used to log production errors and also for tracing and debugging during development to create a Parsoid based linter.
 * 3) * Creating events when particular issue is found.
 * 4) Generating fixup information -
 * 5) * Planning the database structure and Create a database.
 * 6) * Creating an interface that listen to the events generated by linter and save it into a database.
 * 7) Feeding this information / provide a web service -
 * 8) * Creating web API’s for check wiki so they can pull data from our database
 * 9) * Creating a database sync service that will keep both database in sync
 * 10) Filtering and optimization -
 * 11) * Filter and optimize the process of Generation of issues
 * 12) * Generate fixup information for hard problem like balanced/unbalanced templates using Parsoid. This will be used for collecting statistics about use of templates in balanced / unbalanced context. Such information is useful in order to categorize templates into those that are basically always producing balanced output and those that often produce unbalanced output.

Deliverables

 * 1) Parsoid based linter using logger to generate events for each issues
 * 2) A robust Backend to store issues
 * 3) API’s for Bots and Checkwiki to pull information from backend
 * 4) A testing framework

Estimated project timeline
Throughout the GSoC period,  I'll be working on a repo and will commit as I go. I'll also be testing as I go, maybe by using a parser testing framework where we can feed broken wikitext and verify that the fixup output.
 * Community Bonding Period (2-3 weeks)
 * Study logger code and familiarize myself with its structure.
 * Discuss the project design with the community.
 * Fix some bugs along the way and get my hands dirty.


 * Logger Integration (2 weeks)
 * Plan on what event are required to be generated by the logger
 * Get the logger up and running


 * Data Model and event listeners (2 weeks)
 * Building Data Models
 * Building event listeners for each event emitted by logger


 * Community feedback period (1-2 weeks)
 * I'd like to share my work with the community and subject it to feedback.
 * This gives me time to interact with the community, explain the progress of my project and incorporate popular suggestions.

Milestone: Prototype of Fixup information Generator.

Milestone: Working project prototype. ready for integrated testing
 * API's (2 weeks)
 * Build API for bots and checkawiki


 * 2 weeks: Filtering and Optimization
 * 1 week: Proper testing using some demo pages on a sandbox
 * 2 weeks: Testing and documentation.

About you
I am Hardik Juneja, a B.Tech, Computer Science Engineering student, in JIIT, Noida, India. I love building things that are useful to people and are fun to build with huge learning curve. Languages I mostly code in are JavaScript and Python. I got interested in this project after seeing it in the list of ideas on the GSoC ideas page. It excites me to work on such a functionality which will make life easier for many wiki editors and also for Parsoid developers.

Participation
I stay online on the IRC during my work hours and can be found on #mediawiki, #mediawiki-parsoid. For Community feedback and discussion, I use the mailing lists (Wikitech-l and Wikidata-l). I will try to maintain a copy of my work on my Github. For development, I will use local environment of Parsoid and Mediawiki. I'll try to commit early and often to my branch. I think documentation is a important part of a project, So I will try to document my work when possible and also test it regularly.

Past open source experience
I am an active member of Open Source Developers Club in my university. Ever since I am introduced to open source, everything I develop and use is open source. As a contributor, I've also attended a few open source meetups including PyCon India 2013, Jsconf delhi 2013 and few local meetups of linux user group, Firefox, etc.
 * I have contributed few patches to Mozilla AMO project.
 * I have also contributed a patch to Parsoid project.
 * I have also created a plugin for Sublime editor.
 * I am also a contributor to Eden project of Sahana Software Foundation here some of my patches.
 * I was also a Google Code In mentor for Sahana Software Foundation this year.
 * All my other projects can be found on my Github profile.

Any other info

 * Micro task assigned to me - patch
 * Notes related to the project - Notes
 * Check Wiki - Project_Check_Wiki