User:Damonwang

Identity
Name: Damon Wang

Email: damonwang uchicago edu

Project title: Towards a Python port of Texvc

Contact/working info
Timezone: Chicago or New York (UTC-5 or -4)

Typical working hours: noon to four, eight to midnight (UTC 17:00-21:00, 1:00-5:00)

IRC or IM networks/handle(s): damonwang on freenode

Project summary
Mediawiki pretty-prints math formulae by taking a subset of AMS-TeX and displaying either a PNG rendered via dvi2png or, if the formula is simple enough, an HTML approximation. However, AMS-TeX has numerous features allowing arbitrary code execution, unbounded render time, and similarly undesirable behavior. The current solution is texvc, an OCaml parser which accepts only a safe subset of AMS-TeX, and produces the appropriate HTML or PNG output. Unfortunately, the obscurity of the source language seems to have discouraged maintenance: texvc has some fifty open bugs going back six years and including a request to make it easier to install ; of the texvc bugs that have been resolved, very few involved edits to the OCaml code. It was therefore suggested that texvc be ported to a more popular language.

A PHP port would have been the ideal solution. Mediawiki already has a large pool of PHP developers, calling the external binary is currently the source of a major outstanding bug, and a PHP dependency for a project written in PHP is as good as eliminating a dependency entirely. In fact, a PHP port has been attempted, but omitted all validation of the input, probably because PHP has no existing parsing packages. Judging from the LaTeX2MathML code, which does implement a LaTeX parser in PHP, backward to write the parser manually. Much of the clarity, concision, and robustness of the OCaml texvc comes from the fact we need only maintain a BNF-like input to a parser-generator, rather than the parser itself. As it would be well beyond the scope of GSoC to write a general PHP parsing package, I propose instead to port texvc into Python, which offers the following benefits:


 * popularity: easier to find maintainers, easier for sysadmins to install, easier for new developers to "read themselves into" the codebase for quick patches
 * ubiquity: although it does need an interpreter at run-time, this is not a difficult dependency to satisfy; probably even easier than an OCaml compiler at install-time or a binary distributions
 * several mature parsing packages: potentially no regression from the OCaml version in terms of concision, clarity, and "elegance"

About you
I am a third-year undergraduate who was first started programming in a high school data structures course taught with Scheme and C. Perhaps because I started out with homework exercises, my interest was in small toy problems such as those offered by Project Euler, the USA Computing Olympiad, the Python Challenge , and the system scripts on my laptop. As a relative newcomer to open-source software, especially on the scale of Mediawiki, it's a relief to find a rather isolated corner of the project like texvc where I can make a concrete contribution without needing to grok the entire codebase.

Although Scheme was my first language in high school and Haskell my first language at university, my work as a sysadmin on campus has taught me that big libraries and side effects are often the easiest way to get a job done. For practical programming, Python is my first choice, with shell a close second and Perl tying with Common Lisp as very distant thirds.

We don't just care about your project -- you are a person, and that matters to us! What drives you? What makes you want to make this the most awesomest wiki enhancement ever?

You don't need to write out your life story (we can read your blog if we want that), but we want to know a little about what makes you tick. Are you a Wikipedia addict wanting to make your own experience better? Did a wiki with usability problems run over your dog, and you're seeking revenge? What does making this project happen mean to you?

Deliverables
Since this is a port, and I'll essentially be regression-testing against the OCaml texvc, "correct" means "like the OCaml texvc" unless otherwise noted. Whether I reproduce erroneous behavior will be decided on a case-by-case basis: as a rule of thumb, I won't give special effort to either recreating bugs or fixing them.

It should be possible to break down your project into some bullet points describing particular features or milestones which can be reached individually. Consider that we may wish to roll out the system for testing when at an intermediate stage of completion, and that time estimates might vary, leaving you with more time than you expected or (more likely) a lot less -- some features can be pushed back if you end up short.

texvctest.sh
A testing script which determines if two version of texvc produce identical output on stdout (accept/reject, HTML or MathML as necessary). This script will not attempt to check the PNGs, since we trust TeX to render the correct output once texvc proper decides that the input is safe enough and complicated enough to pass on. In order to account for intended deviations, it will either check that two texvc binaries produce the same output when run, or that a single texvc binary produces output identical to what is cached in a file.

test cases
Many TeX files mimicking valid and invalid Math input, grouped into a few sets approximately in order of descending importance:


 * 1) manually constructed invalid examples involving various security holes (calling other executables, catcodes, etc.)
 * 2) manually constructed valid examples involving each of the supported commands, both in isolation and in simple combinations
 * 3) manually constructed invalid examples involving common mistakes like unsupported commands, mismatched delimiters (braces or \begin...\end blocks), etc.
 * 4) a dump of all math on enwiki and dewiki, to guarantee no regression against anything currently accepted

correct accept/reject decision
OCaml and Python versions should agree on what is valid and invalid input. In particular, Python texvc should accept all examples from enwiki and dewiki, since Bugzilla has no bug reports for the OCaml texvc in this area. It should also reject all of the contrived examples involving various security holes (this takes precedence over all other aspects of the Python texvc's behavior)

rendering statistics
Write a script which, given a dump of the &lt;math&gt; elements from a Mediawiki instance, counts the possible rendering modes. This will primarily be used to prioritize different areas of the texvc port: if, for example, we find that half of enwiki's math can be rendered as conservative HTML and almost all of the rest needs PNGs, then moderate and liberal HTML support would probably become optional deliverables.

correct HTML decision and output
Python and OCaml versions should agree what can be rendered as HTML and how that should be done. I will definitely support the conservative HTML output, and leave the less stringent options as optional deliverables depending on how often they are actually used according to the stats script.

documentation
In three formats:


 * 1) Python docstrings in compliance with Google's Python style guide as might be expected by future Python programmers
 * 2) inline documentation in compliance with the Mediawiki project's Doxygen style as might be expected by future maintainers with other Mediawiki development experience
 * 3) Manually written full-paragraphs-of-English documentation explaining
 * 4) *the overall architecture, including not only the code but also the test suite and my development workflow
 * 5) * reasoning behind more nontrivial design decisions
 * 6) * pointers for common expected modifications (e.g., "to add support for a new command, use these modules and edit these files and run these tests...")

As a concession to logistics, I'll probably only maintain the docstrings during development, and port it over to Doxygen at various milestones. Since the two systems contain the same information, I don't expect this to generate any oversights.

If time permits
Some of these ideas are pretty pie-in-the-sky. I think my required deliverables are sufficiently ambitious that few of the optional ones will get done, but I'd be happy to be mistaken.

higher-resolution statistics
Extend the stats script to count occurences of each supported math command. At the most basic, this would give coverage information for the wikidump tests to show how well we were really testing.

These data would not be used to justify dropping any currently supported commands, since that would be a very user-visible regression.

If I have time, there could be some interesting work to be done with the data, which essentially thousands of points in a space of as many dimensions as supported math commands. We could, for example, try some principal component analysis or clustering. I've been itching for a chance to really learn these techniques hands-on, so it'd be an exciting epilogue after I finished all my required deliverables.

texvc no-PNG flag
If automated testing (which might require running texvc on dozens of examples during development and thousands or more during overnight tests) turns out to be really slow, add a flag to prevent texvc from actually calling tex or dvipng in the case that it decides to produce PNG output. Since the image is not examined by texvctest.sh, this is unnecessary work and just slows down automated testing.

For every test, check texvc output with and without no-PNG before accepting the no-PNG output for testing purposes. While we're testing the no-PNG flag, log how great an improvement it provides. This isn't worth merging back into the trunk unless a significant increase is observed.

texvc-fuzz
As the very last finishing touch, write a generator which can produce valid and almost-valid TeX randomly for "fuzz-testing".

Project schedule
This schedule is highly provisional. As mentioned in the participation section, I hope to discuss it weekly with my mentor in order to adjust it for the last week's progress.

The official start of coding come two weeks before end of spring term at my university, but, conveniently, official end of coding comes five weeks before start of autumn term. Would it be all right if we were to shift Google Summer of Code down two weeks? It makes evaluations a little odd, since I'd have only three weeks before midterms, but I think we could make it work.

Weeks are numbered with zero at my finals week, so that I would start coding in week 1 assuming my schedule change is accepted. Ranges are inclusive.


 * week -7
 * 21 April accepted students announced


 * weeks -6 through -2
 * find out if it would be possible to log invalid input on a production Mediawiki server
 * try to find someone who would be willing to answer specific OCaml questions. Supposedly my department has two world-class OCaml researchers as faculty, and I know at least one grad student whose research involves ML.
 * ask on wikitech-l whether people want Math split off as an extension, and how best to handle the upgrade path
 * "read myself into" the project:
 * the Mediawiki style guide, supplemented by the Google style guide for the Python-specific bits
 * development policy
 * quick-start guide, although I shouldn't need touch the database or the rest of the codebase except for testing installation
 * outstanding texvc bugs
 * set up a development environment: an OCaml compiler, a local subversion repository, MantisBT, backups onto another machine, etc.
 * try to fix bug #13518 which breaks Math on Windows. Even if I can't actually solve it, I'll get some experience relevant parts of the Mediawiki codebase and contribution guidelines


 * weeks -1 through 0
 * study for finals. Take finals.


 * week 1

Try to break your deliverables into "milestone" points which can be reached in sequence. Block out your estimated schedule of when you'll reach each functional milestone. Don't forgot that real time may change -- leave enough wiggle room for your required features to be completed!

Participation
From a previous project at work, I've discovered a tendency to get lost in the nitpicky details unless I broke the project into smaller tasks with their own deadlines, and reminded myself frequently of the big picture. A few things that helped:


 * a bug tracker which I used as a glorified to-do list and scratch pad
 * writing myself an email each night summarizing my progress for the day and laying out my goals for the next
 * explicitly scheduling time for things like eating, sleeping, and miscellaneous not-programming fun, in order to avoid burn-out
 * taking copious notes as I went, like documentation for myself but less formally structured, usually entered into the bug tracker or a permanent scratch file
 * putting my code in source control and actually exercising some SCM discipline, like writing useful commit logs and committing frequently
 * an automated way to run some fairly comprehensive test suites with a few commands

I'd like to propose three different kinds of communication:
 * specific questions about the code, like stubborn bugs, done very quickly and informally over email or IRC
 * a weekly email to my mentor specifically to report my progress and plan the next week, and a meeting on IRC to discuss the schedule and vaguer, higher-level questions that need more back-and-forth than would be convenient over email
 * code review after each milestone

Generally, my programming could be called some kind of infantile Agile or test-driven methodology:
 * 1) set up a development environment
 * 2) write a testing framework
 * 3) write a functional specification more detailed than "like the OCaml texvc does it"
 * 4) repeat until everything has been implemented:
 * 5) write some tests that the OCaml texvc never fails, which define some minimally functional slice of texvc
 * 6) code until those tests pass
 * 7) write more stringent tests, fixing the code as necessary so they too pass
 * 8) document this
 * 9) Refactor (using the test suites to avoid regression) until I run out of time

We don't just want to know what you plan to accomplish; we want to know how. Briefly describe your work style: how you plan to communicate progress, where you plan to publish your source code while you're working, how and where you plan to ask for help. (We will tend to favor applicants that demonstrate a clear vision for what it means to be an active participant in our development community)

Past open source experience
Do you have any past experience working in open source projects (MediaWiki or otherwise). If so, tell us about it!

Any other info
If there's other relevant information -- UI mockups, references to related projects, a link to your proof of concept code, whatever. There are no specific requirements, but we love to see people who love what they're doing. Show us you're excited about this project and have an interest in the background and are considering how best to make your idea work.