User:GorillaWarfare/GSoC

Hi! I'm Molly White, a sophomore computer engineering student at Northeastern University. I'm hoping to participate in Summer of Code 2013 and/or Outreach Program for Women.

I love the Wikimedia projects in general, but have a few specific interests that could be particularly fun to pursue:
 * Exporting pages to various formats (particularly LaTeX)
 * Wikisource and proofreading improvements

Improvements to Extension:Collection
The Collection extension is a valuable tool that allows users to save articles to PDF, ODF, XML, or ZIM, or order a printed book from PediaPress. It can be used for a single article, or as the name suggests, compile a collection of articles. There are quite a few features that could be added and improvements that could be made to this tool; too many to take on in a single GSoC project. These improvements include:

Ease of use
Many Wikipedia readers do not realize that Collection exists. A survey performed two years ago indicated that Wikipedia readers are very interested in having the ability to use Wikipedia offline, save articles for later viewing, and download/print articles. Ideas have already been proposed for a Collection Extension 2, but the project is stalled. It suggests improvements to the placement, layout, and wording: change the confusing book metaphor to a reading list metaphor, move the print/collection functionality to the main article area, and create an improved "manage collections" page.

More output formats
Collection can currently create PDF, ODF, XML, and ZIM files. There are a good number of other formats that could be added, many of which have already been requested:
 * Plain text
 * HTML
 * LaTeX, DVI
 * EPUB, also see Wikisource:WSexport
 * FictionBook

Add formatting options
Allow the user to optionally change some of the output options, such as:
 * Grayscale/color
 * Include/exclude TOC with page numbers, anchors
 * Set ase font size
 * Option to exclude links
 * Exclude sections that aren't as useful in printed versions (e.g., external links, see also)
 * Exclude images (particularly useful for wikis like Wikinews that include large numbers of images, often resulting in several pages of nothing else)
 * Begin new page for every article

Better support for wikis with their own methods of collecting articles
The Collection extension is designed to export groups of articles, but does not translate well to wikis that have their own ways of collecting articles (Wikisource, Wikibooks, Wikiversity. The "Create a book/collection" and "Download as PDF" links in the sidebar simply create a PDF version of that particular page, usually just a sort of index/landing page. The expected result would be to create a PDF version of the full book.

Miscellaneous improvements

 * Currently no support for Extension:Quiz, which is an issue for Wikiversity
 * Format math equations in TeX (currently rendered as bitmap images)
 * Test and improve support for frequently-used templates
 * Infoboxes
 * Quote boxes

Exporting/parsing to various document formats
The various projects use all sorts of tools to make pages accessible in different formats. These include:
 * File->Print
 * Uses CSS from MediaWiki:Print.css to remove metadata, navboxes, disambiguation links, etc, as well as perform some link reformatting, uncollapse tables, etc.
 * Book extension
 * Exports to PDF, ODF, DocBook XML, ZIM, or printed book from PediaPress
 * Uses Extension:PDF Writer, Extension:OpenDocument Export, Extension:XML Bridge, and Extension:Collection/openZIM.
 * Special:Export
 * Exports text and history to XML so that a page can be imported to another wiki. Not intended to create a document.
 * Wikisource:WSexport (more documentation at fr:s:Wikisource:Wsexport)
 * Exports to EPUB2. EPUB3, XHTML, and ODT are in development.
 * Wikisource-specific. Not in use on other wikis.
 * Relatively inflexible.
 * Other output extensions (Category:Output extensions)
 * Extension:Book (experimental, only tested on WikiMedia 1.12)
 * Extension:EPubExport (beta, somewhat limited language support)
 * Extension:Pdf Export (beta, meant for single article)
 * Extension:PdfBook (stable, meant for compiling several articles)
 * Extension:Wiki2LaTeX (stable, but has a security vulnerability. Creates LaTeX files and PDFs from these. No wikimarkup error correction.)

There is not currently a working extension to use LaTeX to create these PDFs (see Wiki2LaTeX above). It would also be interesting to explore how these extensions or a new extension could use the HTML created by Parsoid to generate other formats. I've explored the idea of trying to integrate pandoc somehow to perform the conversion, but the Mediawiki conversion can only handle very simple wikimarkup, and the HTML conversion does not seem to appreciate Parsoid's somewhat unique HTML. See User:GorillaWarfare/pandoc and my issue report on the matter.

There are quite a few document export-related bugs that could be incorporated into a project:
 * : PDF export: Use LaTeX formulas instead of inline images
 * : Support exporting content in DocBook format
 * : PDF export extension fails to render Arabic characters in monospace text (, preformatted)
 * : PDF export extension problem with wiki table
 * : PDF export extension problem with HTML tags in RTL wikis
 * : Create an export for FreeDict entry formats
 * : Provide EPUB sanitizer
 * : "Download as PDF" doesn't show references inside template

VisualEditor plugins
Particularly interesting plugins would be those that could improve Wikisource, and plugins for sheet music and source code (with syntax highlighting!).

Bugs:
 * : Refactor the edition of Page: pages JavaScript module

Wikisource Proofread Page extension
This could be a fun one—Proofread Page could really use some improvements.
 * Refactor code/write unit tests
 * Allow compatibility with the VisualEditor
 * Wikisource-specific VisualEditor modules
 * Improve editing toolbar?

Bugs:
 * : Refactor the edition of Page: pages JavaScript module
 * : Add support of Page: pages of Wikisource
 * : Add option to change the position of scanned page
 * : ProofreadPage does not use image's full resolution when zooming in

Wikisource OCR
Wikisource's OCR lackluster. It uses Tesseract, I believe, though the documentation is next to none. It would be interesting if the tool could be trained per book.

Extension_talk:Proofread_Page

Other Wikisource project ideas
Some of these are too small for a full project, but could be grouped into a Wikisource project umbrella or included as an "if I have extra time" thing.
 * One-click import from Internet Archive
 * Could also be expanded to other resources: Project Gutenberg or similar
 * Create functionality to group multiple pages into one book (also potentially useful for WikiBooks)
 * There are some good ideas from a GSoC proposal from last year that was not accepted: User:Aashish.mittal/GSoC Application, bug report comment, Books on meta
 * Line numbers on Wikisource
 * User automation tools

Other interesting bugs

 * : Audio pronunciation: Automatic text-to-speech to convert IPA to sound
 * : Review and Deploy Wikicaptcha