User:GorillaWarfare/GSoC

Hi! I'm Molly White, a sophomore computer engineering student at Northeastern University. I'm hoping to participate in Summer of Code 2013 and/or Outreach Program for Women.

I love the Wikimedia projects in general, but have a few specific interests that could be particularly fun to pursue:
 * Exporting pages to various formats (particularly LaTeX)
 * Wikisource and proofreading improvements

Exporting/parsing to various document formats
The various projects use all sorts of tools to make pages accessible in different formats. These include:
 * File->Print
 * Uses CSS from MediaWiki:Print.css to remove metadata, navboxes, disambiguation links, etc, as well as perform some link reformatting, uncollapse tables, etc.
 * Book extension
 * Exports to PDF, ODF, DocBook XML, ZIM, or printed book from PediaPress
 * Uses Extension:PDF Writer, Extension:OpenDocument Export, Extension:XML Bridge, and Extension:Collection/openZIM.
 * Special:Export
 * Exports text and history to XML so that a page can be imported to another wiki. Not intended to create a document.
 * Wikisource:WSexport (more documentation at fr:s:Wikisource:Wsexport)
 * Exports to EPUB2. EPUB3, XHTML, and ODT are in development.
 * Wikisource-specific. Not in use on other wikis.
 * Relatively inflexible.
 * Other output extensions (Category:Output extensions)
 * Extension:Book (experimental, only tested on WikiMedia 1.12)
 * Extension:EPubExport (beta, somewhat limited language support)
 * Extension:Pdf Export (beta, meant for single article)
 * Extension:PdfBook (stable, meant for compiling several articles)
 * Extension:Wiki2LaTeX (stable, but has a security vulnerability. Creates LaTeX files and PDFs from these. No wikimarkup error correction.)

There is not currently a working extension to use LaTeX to create these PDFs (see Wiki2LaTeX above). It would also be interesting to explore how these extensions or a new extension could use the HTML created by Parsoid to generate other formats. I've explored the idea of trying to integrate pandoc somehow to perform the conversion, but the Mediawiki conversion can only handle very simple wikimarkup, and the HTML conversion does not seem to appreciate Parsoid's somewhat unique HTML. See User:GorillaWarfare/pandoc and my issue report on the matter.

There are quite a few document export-related bugs that could be incorporated into a project:
 * : PDF export: Use LaTeX formulas instead of inline images
 * : Support exporting content in DocBook format
 * : PDF export extension fails to render Arabic characters in monospace text (, preformatted)
 * : PDF export extension problem with wiki table
 * : PDF export extension problem with HTML tags in RTL wikis
 * : Create an export for FreeDict entry formats
 * : Provide EPUB sanitizer
 * : "Download as PDF" doesn't show references inside template

VisualEditor plugins
Particularly interesting plugins would be those that could improve Wikisource, and plugins for sheet music and source code (with syntax highlighting!).

Bugs:
 * : Refactor the edition of Page: pages JavaScript module

Wikisource Proofread Page extension
This could be a fun one—Proofread Page could really use some improvements.
 * Refactor code/write unit tests
 * Allow compatibility with the VisualEditor
 * Wikisource-specific VisualEditor modules
 * Improve editing toolbar?

Bugs:
 * : Refactor the edition of Page: pages JavaScript module
 * : Add support of Page: pages of Wikisource
 * : Add option to change the position of scanned page
 * : ProofreadPage does not use image's full resolution when zooming in

Wikisource OCR
Wikisource's OCR lackluster. It uses Tesseract, I believe, though the documentation is next to none. It would be interesting if the tool could be trained per book.

Extension_talk:Proofread_Page

Other interesting bugs

 * : Audio pronunciation: Automatic text-to-speech to convert IPA to sound
 * : Review and Deploy Wikicaptcha