User:Damonwang

This is my proposal for Summer of Code 2010. I would welcome any comments on the associated discussion page, and especially answers to the questions in my notes. Thank you for taking the time to read this!

Identity
Name: Damon Wang

Email: damonwang uchicago edu

Project title: Towards a Python port of Texvc

Contact/working info
Timezone: Chicago or New York (UTC-5 or -4)

Typical working hours: noon to four, eight to midnight (UTC 17:00-21:00, 1:00-5:00)

IRC or IM networks/handle(s): damonwang on freenode

Project summary
Mediawiki pretty-prints math formulae by taking a subset of AMS-TeX and displaying either a PNG rendered via dvi2png or, if the formula is simple enough, an HTML approximation. However, AMS-TeX has numerous features allowing arbitrary code execution, unbounded render time, and similarly undesirable behavior. The current solution is texvc, an OCaml parser which accepts only a safe subset of AMS-TeX, and produces the appropriate HTML or PNG output. Unfortunately, the obscurity of the source language seems to have discouraged maintenance: texvc has some fifty open bugs going back six years and including a request to make it easier to install ; of the texvc bugs that have been resolved, very few involved edits to the OCaml code. It was therefore suggested that texvc be ported to a more popular language.

A PHP port would have been the ideal solution. Mediawiki already has a large pool of PHP developers, calling the external binary is currently the source of a major outstanding bug, and a PHP dependency for a project written in PHP is as good as eliminating a dependency entirely. In fact, a PHP port has been attempted, but omitted all validation of the input, probably because PHP has no existing parsing packages. Judging from the LaTeX2MathML code, which does implement a LaTeX parser in PHP, backward to write the parser manually. Much of the clarity, concision, and robustness of the OCaml texvc comes from the fact we need only maintain a BNF-like input to a parser-generator, rather than the parser itself. As it would be well beyond the scope of GSoC to write a general PHP parsing package, I propose instead to port texvc into Python, which offers the following benefits:


 * popularity: easier to find maintainers, easier for sysadmins to install, easier for new developers to "read themselves into" the codebase for quick patches
 * ubiquity: although it does need an interpreter at run-time, this is not a difficult dependency to satisfy; probably even easier than an OCaml compiler at install-time or a binary distributions
 * several mature parsing packages: potentially no regression from the OCaml version in terms of concision, clarity, and "elegance"

About you
''We don't just care about your project -- you are a person, and that matters to us! What drives you? What makes you want to make this the most awesomest wiki enhancement ever?''

''You don't need to write out your life story (we can read your blog if we want that), but we want to know a little about what makes you tick. Are you a Wikipedia addict wanting to make your own experience better? Did a wiki with usability problems run over your dog, and you're seeking revenge? What does making this project happen mean to you?''

I am a third-year undergraduate who was first started programming in a high school data structures course taught with Scheme and C. Perhaps because I started out with homework exercises, my interest was in small toy problems such as those offered by Project Euler, the USA Computing Olympiad , the Python Challenge , and the system scripts on my laptop. As a relative newcomer to open-source software, especially on the scale of Mediawiki, it's a relief to find a rather isolated corner of the project where I can make a concrete contribution without needing to grok the entire codebase.

Although Scheme was my first language in high school and Haskell my first language at university, my work as a sysadmin at the Mac Lab on campus these last two years has taught me that big libraries and side effects are often the easiest way to get a job done. For practical programming, Python is my first choice, with shell a close second for scripting and Perl tying with Common Lisp as very distant thirds. The texvc port offers almost the best of both worlds: exposure to a new functional language while improving my Python by coding to the standards of a larger, mature project.

Working at the MacLab has also generated a healthy respect for written records. Since our policy of hiring exclusively students enforces high turnover, inadequate documentation would quickly turn sysadmins into detectives or archaeologists. At one point, I wasted two hours configuring the wrong web server before someone pointed out that this machine had a second Apache installation in a nonstandard location. I've since set up a lab bug tracker to help people work together without overlaps or gaps, and a searchable mail archive to record discussions on our mailing list. Although I don't like writing documentation any more than the next guy, I do love using it, and so one of the core deliverables for my Python texvc will be a complete set of documentation.

Deliverables
''It should be possible to break down your project into some bullet points describing particular features or milestones which can be reached individually. Consider that we may wish to roll out the system for testing when at an intermediate stage of completion, and that time estimates might vary, leaving you with more time than you expected or (more likely) a lot less -- some features can be pushed back if you end up short.''

Since this is a port, and I'll essentially be regression-testing against the OCaml texvc, "correct" means "like the OCaml texvc". Whether I reproduce erroneous behavior will be decided on a case-by-case basis: as a rule of thumb, I won't give special effort to either recreating bugs or fixing them, unless specifically noted.

texvctest.sh
A testing script which determines if two version of texvc produce identical output on stdout (accept/reject, HTML or MathML as necessary). This script will not attempt to check the PNGs, since we trust TeX to render the correct output once texvc proper decides that the input is safe enough and complicated enough to pass on. In order to account for intended deviations, it will either check that two texvc binaries produce the same output when run, or that a single texvc binary produces output identical to what is cached in a file.

texvcstats
Write a script which, given a dump of the &lt;math&gt; elements from a Mediawiki instance, counts the possible rendering modes. This will primarily be used to decide how much HTML support to incorporate into the Python texvc as a required deliverable. If, for example, we find that half of enwiki's math can be rendered as conservative HTML and almost all of the rest needs PNGs, then moderate and liberal HTML support would probably become optional deliverables.

test cases
Many TeX files mimicking valid and invalid Math input, grouped into a few sets approximately in order of descending importance:


 * 1) manually constructed invalid examples involving various security holes (calling other executables, catcodes, etc.)
 * 2) manually constructed valid examples involving each of the supported commands, both in isolation and in simple combinations
 * 3) manually constructed invalid examples involving common mistakes like unsupported commands, mismatched delimiters (braces or   blocks), etc.
 * 4) a dump of all math on enwiki and dewiki, to guarantee no regression against anything currently accepted

correct accept/reject decision
OCaml and Python versions should agree on what is valid and invalid input. In particular, Python texvc should accept all examples from enwiki and dewiki, since Bugzilla has no bug reports for the OCaml texvc in this area. It should also reject all of the contrived examples involving various security holes (this takes precedence over all other aspects of the Python texvc's behavior)

correct HTML decision and output
Python and OCaml versions should agree what can be rendered as HTML and how that should be done. I will definitely support the conservative HTML output, and leave the less stringent options as optional deliverables depending on how often they are actually used according to the stats script.

documentation
In three formats:


 * 1) Python docstrings in compliance with Google's Python style guide as might be expected by future Python programmers
 * 2) inline documentation in compliance with the Mediawiki project's Doxygen style as might be expected by future maintainers with other Mediawiki development experience
 * 3) Manually written full-paragraphs-of-English documentation explaining
 * 4) *the overall architecture, including not only the code but also the test suite and my development workflow
 * 5) * reasoning behind more nontrivial design decisions
 * 6) * pointers for common expected modifications (e.g., "to add support for a new command, use these modules and edit these files and run these tests...")

As a concession to logistics, I'll probably only maintain the docstrings during development, and port it over to Doxygen at various milestones. Since the two systems contain the same information, I don't expect this to generate any oversights.

higher-resolution statistics
Extend the stats script to count occurrences of each supported math command. At the most basic, this would give coverage information for the wikidump tests to show how well we were really testing.

These data would not be used to justify dropping any currently supported commands, since that would be a very user-visible regression.

If I have time, there could be some interesting work to be done with the data, which essentially thousands of points in a space of as many dimensions as supported math commands. We could, for example, try some principal component analysis or clustering. I've been itching for a chance to really learn these techniques hands-on, so it'd be an exciting epilogue after I finished all my required deliverables.

texvc no-PNG flag
If automated testing (which might require running texvc on dozens of examples during development and thousands or more during overnight tests) turns out to be really slow, add a flag to prevent texvc from actually calling tex or dvipng in the case that it decides to produce PNG output. Since the image is not examined by texvctest.sh, this is unnecessary work and just slows down automated testing.

For every test, check texvc output with and without no-PNG before accepting the no-PNG output for testing purposes. While we're testing the no-PNG flag, log how great an improvement it provides. This isn't worth merging back into the trunk unless a significant increase is observed.

Windows support
Fix bug 13518 preventing Math support on Windows servers. This depends on getting a Windows development environment, so I've scheduled it quite late.

texvc-fuzz
As the very last finishing touch, write a generator which can produce valid and almost-valid TeX randomly for "fuzz-testing". This is so pie-in-the-sky I won't even leave room for it in the schedule.

MathML support
Supposedly, texvc also supports MathML to some extent. Although dropping it in the Python port would technically be a regression against the OCaml texvc, it doesn't seem like a very important feature. IE requires a (probably never installed) plugin for MathML support and Opera has rendering problems, so correct MathML export would be of limited use to most users anyway. I'd prefer to focus my time on more commonly used features like PNG and HTML rendering.

Project schedule
''Try to break your deliverables into "milestone" points which can be reached in sequence. Block out your estimated schedule of when you'll reach each functional milestone. Don't forgot that real time may change -- leave enough wiggle room for your required features to be completed!''

This schedule is highly provisional in two senses. As mentioned in the participation section, I hope to discuss it weekly with my mentor in order to adjust it for the last week's progress. Also, if I start falling behind, I will cut features before blowing a deadline, with the intention of having source that will compile and run at some level of functionality. This is especially true in weeks 10 and 11, where the number of existing bugtracker tickets I close will be dictated purely by how much time I have left.

The official start of coding come two weeks before end of spring term at my university, but, conveniently, official end of coding comes five weeks before start of autumn term. Would it be all right if we were to shift Google Summer of Code down two weeks? It makes evaluations a little odd, since I'd have only three weeks before midterms, but I think we could make it work.

Weeks are numbered with zero at my finals week, so that I would start coding in week 1 assuming my schedule change is accepted. Ranges are inclusive.

weeks -7 through -2

 * 21 April (week -7)
 * accepted students announced


 * 26 May (week -2)
 * official start of coding


 * find out if it would be possible to log invalid input on a production Mediawiki server
 * try to find someone who would be willing to answer specific OCaml questions. Supposedly my department has two world-class OCaml researchers as faculty, and I know at least one grad student whose research involves ML.
 * ask on wikitech-l whether people want Math split off as an extension, and how best to handle the upgrade path
 * research security holes offered by TeX. A few ways I might start:
 * look through the TeXBook, TeX: The Program and TeX by Topic
 * ask on wikitech-l (particularly Conrad Irwin), texhax@tug.org, and comp.text.tex
 * "read myself into" the project:
 * the Mediawiki style guide, supplemented by the Google style guide for the Python-specific bits
 * development policy
 * quick-start guide, although I shouldn't need touch the database or the rest of the codebase except for testing installation
 * outstanding texvc bugs
 * set up a development environment: an OCaml compiler, a local subversion repository, MantisBT, backups onto another machine, etc.
 * update
 * try to fix bug #13518 which breaks Math on Windows, just to get some experience with relevant parts of the Mediawiki codebase and contribution guidelines. I'd have to test on a desktop Windows environment, so the patch would have to be reviewed by someone with a server.

weeks -1 through 0

 * 3-4 June
 * university reading period


 * 7-11 June
 * university finals

week 1: set-up

 * 13 June
 * actual start of coding


 * write and document texvctest.sh
 * write and document the statistics script
 * produce some visualisations showing how many of the math blocks on enwiki require each of the supported commands, so that through the rest of the summer we know how bad immediate deployment would be.
 * decide at what level to support HTML output

At the end, I should know what counts as a feature-complete port.

week 2: test suites

 * write test cases covering all the dangerous bits of TeX discovered during weeks -6 through -2 (very simple usage, not any kind of elaborate exploit)
 * write test cases covering all the supported commands (in very simple usage, just to make sure the parser recognizes them as valid tokens)
 * sort the enwiki dump according to the commands in each snippet, to assemble test suites for each of the downstream milestones
 * check that OCaml texvc passes all the valid tests and fails all the invalid tests
 * depending on how long it took to run the full suit the first time, optionally implement a no-PNG flag to complete in reasonable time and make sure its output matches OCaml texvc without no-PNG so future testing goes faster
 * decide whether dewiki significantly adds to the diversity of the test suites, and, if so, sort its dump the same way
 * discuss with mentor the possibility of reordering the milestones so that the parser supports as much of enwiki's math as early as possible
 * document the layout of the test cases

At the end, I should have test suites for all the supported commands and all of the forbidden ones.

weeks 3 through 6: parsing

 * 7 July (week 4)
 * start midterms


 * 14 July (week 5)
 * end midterms


 * 1) Translate all the BNF-like parsing rules over from OCaml, with stubs for all the parser actions a single hardcoded example for each token
 * 2) Fill in the tokenizer patterns and the parser actions

The order in which I implement these may change if it turns out some commands are notably more popular than others, but at the moment, based on the Math help page, I'd divide the parser into four categories:


 * 1) grouping with braces
 * 2) commands that just substitute a symbol, like the Greek alphabet, operators, etc., in the order listed
 * 3) commands that modify display, like typefaces and color
 * 4) commands which modify layout, like in "Larger Expressions"

At the end, Python texvc should call TeX on all valid cases and reject all invalid cases. I will also provide documentation for that parser, and a document in the style of a quick-start guide which explains how to add support for a new command.

weeks 7 through 8: HTML
At the end, Python texvc should match OCaml texvc for the supported HTML output modes.


 * 1) extend texvctest.sh to allow each of the three HTML rendering policies: always always PNG, HTML if simple, HTML if possible
 * 2) implement conservative HTML output
 * 3) possibly also implement moderate and liberal HTML output, depending on statistics and scheduling

Depending on how far we've fallen behind at this point, whether due to schedule slip or feature creep, I'd discuss with my mentor at this point how important moderate and liberal HTML support is.

week 9: Windows compatibility or refactoring as extension

 * 11 August (week 9)
 * "pencils down"

I've left this until last because it's not certain whether I'll be able to get a Windows testing environment. I'll have been working throughout the previous weeks to try to find one, or, failing that, find someone with one who could work with me. However, if at all possible I would like to test the Python texvc on Windows.

If I haven't solved bug 13518 yet, this would be the time to pick it up again.

At the end, the Python texvc should work when called from Cygwin, the native prompt ("PowerShell"?), and the web server.

If I haven't been able to get a Windows testing environment, then I'll refactor the Math code as an extension without pulling the original PHP out of core. Apparently there's some disagreement over whether Math should now be moved to an extension, and that hasn't been settled yet. In particular, upgrading older installation past the core-to-extension transition might be confusing.

weeks 10 through 11: clean-up

 * 18 August (week 10)
 * official end of coding, start of finals


 * check that Python texvc doesn't need write access to its own executable like OCaml texvc
 * go through my bugtracker and close as many tickets as possible
 * check the Python texvc for each bug reported against the OCaml texvc in the official Bugzilla; solve them, if possible
 * submit remaining tickets from my bugtracker to the official Bugzilla

week 12: documentation

 * 3 September (week 12)
 * start code submissions, actual end of coding


 * revise and copy-edit inline documentation
 * port the docstrings over to Doxygen-style inline documentation as required by Mediawiki style guide
 * Now that I have the entire summer's experience behind me, look over my notes and write up a document in the style of How to become a MediaWiki hacker, with solutions to any unobvious problems a new contributor might run into setting up to write a quick patch for Python texvc
 * Add a Math blurb to How to become a MediaWiki hacker
 * copy any remaining open tickets from my bugtracker to the official Bugzilla
 * summarize my notes into an English document in the style of Mediawiki's Manual:Code, explaining at a high level the architecture of Python texvc

Participation
We don't just want to know what you plan to accomplish; we want to know how''. Briefly describe your work style: how you plan to communicate progress, where you plan to publish your source code while you're working, how and where you plan to ask for help. (We will tend to favor applicants that demonstrate a clear vision for what it means to be an active participant in our development community)''

From a previous project at work, I've discovered a tendency to get lost in the nitpicky details unless I broke the project into smaller tasks with their own deadlines, and reminded myself frequently of the big picture. A few things that helped:


 * a bug tracker which I used as a glorified to-do list and scratch pad
 * writing myself an email each night summarizing my progress for the day and laying out my goals for the next
 * explicitly scheduling time for things like eating, sleeping, and miscellaneous not-programming fun, in order to avoid burn-out
 * taking copious notes as I went, like documentation for myself but less formally structured, usually entered into the bug tracker or a permanent scratch file
 * putting my code in source control and actually exercising some SCM discipline, like writing useful commit logs and committing frequently
 * an automated way to run some fairly comprehensive test suites with a few commands

I'd like to propose three different kinds of communication:
 * specific questions about the code, like stubborn bugs, done very quickly and informally over email or IRC
 * a weekly email to my mentor specifically to report my progress and plan the next week, and a meeting on IRC to discuss the schedule and vaguer, higher-level questions that need more back-and-forth than would be convenient over email
 * code review after each milestone

Generally, my programming could be called some kind of infantile Agile or test-driven methodology:
 * 1) set up a development environment
 * 2) write a testing framework
 * 3) write a functional specification more detailed than "like the OCaml texvc does it"
 * 4) repeat until everything has been implemented:
 * 5) write some tests that the OCaml texvc never fails, which define some minimally functional slice of texvc
 * 6) code until those tests pass
 * 7) write more stringent tests, fixing the code as necessary so they too pass
 * 8) document this
 * 9) Refactor (using the test suites to avoid regression) until I run out of time

Past open source experience
''Do you have any past experience working in open source projects (MediaWiki or otherwise). If so, tell us about it!''

I haven't got any projects whose code would be available for people to read, but my work as a sysadmin has occasionally required minor bits of software development. For example, I have maintained an in-house print system for two years, which involved Python and MySQL in the server process and and PHP web interface. Since a core feature was the ability to filter out print jobs which violated lab policy, my modifications included improved copy-counting for MSOffice documents that rendered to raw PostScript, a temporary ban feature so users could not flood the system with decoy jobs, a new keep-alive mechanism using Mac OS X's LaunchDaemon system, and various other modifications as users developed new strategies for obfuscating their documents or deceiving the desk staff.

Although this project is not open-source, it was similar to distributed open-source development in one critical way: I picked up a mature project which was actively developed by multiple people at the same time, brought myself up to speed by reviewing the code and what documentation there was, and communicated with my fellow contributors and the original developers primarily by email. I hope the same skills will serve me well as I port texvc.

Any other info
This project has been discussed on wikitech-l in a very long thread titled, "GSoC project advice: port texvc to Python?" .

Second mentor
Prof. John Reppy has agreed to answer any OCaml-specific questions that I couldn't find in the manual or online. I hope that, with his expertise available, the issue of not actually knowing any OCaml will be a serious problem in this project.

PHP or Python?
One of the most difficult decisions in drawing up this proposal was choosing between Python and PHP. Both are similarly popular, well-supported, and mature. Neither would constitute a significant burden upon sysadmins to set up or maintain, and either would enormously expand texvc's pool of potential contributors and maintainers.

Since PHP is the language of choice for the rest of Mediawiki, though, a PHP port would have a few advantages which Python could not match, namely,


 * a PHP dependency is essentially not a dependency at all, if the user wants to install the rest of Mediawiki
 * while it wouldn't be difficult to find Python experience, Mediawiki already has plenty of experienced PHP developers
 * a native PHP port would immediately eliminate the quoting problems that currently break TeX support on Windows

The reason I chose Python anyway was the availability of mature parser-generator packages, which PHP cannot match. To summarize my argument on the wikitech-l mailing list, if the port is motivated by maintainability concerns due to the obscurity of the OCaml language, then it would be a step backwards to port to PHP because at this point, manually written LALR parsers would be even more obscure than OCaml.

Recently, James Salsman turned my argument on its head when he suggested LIME, which purports to be a

"Complete LALR(1) parser generator and engine (like BISON or YACC) but it's all done in PHP, and the input grammar is easier and more maintainable."

If this is actually true, I'd be all in favor of a PHP port. However, I'm concerned about LIME's stability. It hasn't been updated in almost two years, and I can't find anybody talking about it or filing bugs against it, which suggests that it has lots of hidden flaws that haven't been discovered yet because it's not ready for general use and it's not being actively tested or developed any more. Aside from a recommendation on stackoverflow, I can't find anyone who will recommend it. I'd love to be proven wrong, but I don't think LIME is stable enough for production use.