Help:Extension:ProofreadPage/2013 draft

= Proofread = Proofreading produces the works on Wikisource from page scans. Page scans are normally in DjVu or PDF format which are uploaded to Wikimedia Commons. Proofreading takes place in the Index and Page namespaces before being transcluded into the main namespace. The proofreading process is split into different phases which are indicated by each page's page status. Wikisource has a style guide and certain formatting conventions that should be used during proofreading to make sure that our texts look correct and function properly. This proofreading function is provided by the ProofreadPage extension.

New users new to proofreading can experiment with the concept, and test their abilities with these simple introductory tests on the Distributed Proofreading's website.

The Proofread of the Month (PotM) is a good place to start for people who want to learn how proofreading works on Wikisource. This project runs a new work each month and invites all user to take part.

Help

 * Page scans


 * Index pages


 * Page numbers


 * Page status


 * Formatting conventions


 * Transclusion

= Proofread Page extension = The Proofread Page extension can render a book either as a column of OCR text beside a column of scanned images, or broken into its logical organization (such as chapters or poems) using transclusion.

The extension is intended to allow easy comparison of text to the original and allow rendering of a text in several ways without duplicating data. Since the pages are not in the main namespace, they are not included in the statistical count of text units.

The extension is installed on all Wikisource wikis. However, for this to work the editor's browser (and extensions such as NoScript) must allow script processing. Your Special:Preferences page (section "Gadgets") allows you to control certain features, such as whether the OCR button is enabled and whether the text by default appears side by side or one above another.

Anybody is able to proofread and correct most pages at Wikisource. However, editors must log into an account in order to change the proofread status. IP addresses cannot change this status. When corrections and formatting are complete, the page is marked as proofread and is ready for the main namespace, leave the page as 'not proofread' until it is done. Mark as problematic if appropriate.

Extension

 * 1) Install LabeledSectionTransclusion (not required but strongly recommended)
 * 2) Download the files from Git or download a snapshot (select your version of MediaWiki) and place the files under $IP/extensions/ProofreadPage. Warning: Current master branch of the git repository is only compatible with with MediaWiki 1.21 and above. In order to use ProofreadPage with MediaWiki 1.19 or 1.20 use the REL1_19 branch.
 * 3) Add to the end of LocalSettings.php:
 * 4) Add the required tables to the database; on the command line, enter:  (Note: Your designated database user needs to have CREATE rights on your MediaWiki database.)
 * 5) Installation can now be verified through Special:Version on your wiki

Thumbnailing
The extension links directly to image thumbnails which often don't exist. You must catch 404 errors and generate the missing thumbnails. You can do this with any one of these solutions:  Set an Apache RewriteRule in .htaccess to thumb.php for missing thumbnails:  or set the Apache 404 handler to Wikimedia's thumb-handler. This is a general-purpose 404 handler with Wikimedia-specific code, not simply a thumbnail generator.   For MediaWiki >= 1.20, you can simply redirect to thumb_handler.php:   Or in apache2.conf :  

WARNING: There is an in the images directory that may interfere with any .htaccess rules you install.

Namespaces
ProofreadPage create by default two custom namespaces named "Page" and "Index" in English with respectively ids 250 and 252.

Their names are translated if your wiki use an other language. .

You can customize their name or their id: Create namespaces by hand and set their ids in LocalSettings.php using $wgProofreadPageNamespaceIds global. You will do something like:

Configuration

 * In order to use the page quality system, it is necessary to create four categories. The names of these categories must be defined in s:Mediawiki:Proofreadpage_quality0_category to s:Mediawiki:Proofreadpage_quality4_category.
 * Ensure that you have installed Extension:ParserFunctions

Configuration of index namespace
The configuration is a JSON array of properties. Here is the structure of a property in the array, all the parameters are optional, the default value are set: The data parameter can have for value: "type", "language", "title", "author", "translator", "illustrator", "editor", "school", "year", "publisher", "place" or "progress".
 * You need to create MediaWiki:Proofreadpage_index_template in order to display index pages. This page is a template that receive as parameter entries of the edition form.
 * You need to create MediaWiki:Proofreadpage_index_data_config that contain the configuration of the index form.

Creating your first page

 * Before following these steps ensure you have followed the instructions in Using DjVu with MediaWiki.
 * Create a page in the "Page" namespace (or the internationalized name if you use an not-English wiki). For example if your namespace is 'Page' create 'Page:Alice in Wonderland.djvu'
 * Create the corresponding file for this page File:Alice in Wonderland.djvu
 * Create the index page 'Index:Alice in Wonderland.djvu'
 * To edit page 5 of the book navigate to 'Page:Alice_in_Wonderland.djvu/5' and click edit

OAI-PMH
Since 28904, the extension has an OAI-PMH API for index pages. This API is implemented in a new special page Special:ProofreadIndexOai using a basic OAI-PMH protocol with Simple Dublin Core (oai_dc) and Qualified Dublin Core (prp_qdc). This repository provides the data stored in index pages. [//wikisource.org/wiki/Special:ProofreadIndexOai?verb=ListRecords&metadataPrefix=prp_qdc Example in oldwikisource].

Sets based on MediaWiki categories can be configured in Mediawiki:Proofreadpage_index_oai_sets that contain a JSON array like:

Todolist
// wrap the output in a div, to prevent the parser from inserting paragraphs $out = " \n$out\n "; $out = $parser->recursiveTagParse( $out ); return $out; two ways to fix it, remove the outer  at return point or understand if there is not a better way to avoid this   insertion.
 * should not contain the header & footer fields, but only the transcluded part of pages. This is needed for two-column books. This requires an update of all pages using the database : for the moment, javascript workaround
 * should be followed by \n
 * float images in paragraph : do not use \n to glue pages. use ' ' in some cases ?
 * This break template:nop on en which protect the last linefeed on a page to be removed, replacing '\n' by ' ' generate the following sequence in this case "\n first line on the next page" : which mediawiki handle as a  . See  where &amp;#32; is proposed to workaround this problem.
 * deprecate pagenum template
 * layout integration: remove containers
 * two columns : use magic templates instead of javascript classes
 * generated code by transclusion through <pages is wrapped by a  , this is buggy when we need to use multiple tag pages on a page without enclosing each part in an html block tag. The code show adding a div is intentional
 * Empty pages are not displayed in the ribbon displaying the proofread status of transcluded pages. This misleads the reader into thinking that the status of the current text is good even if it's incomplete. In toc mode, empty pages are correctly displayed.
 * Possible solution: fix the way the total page count is computed in non-toc mode


 * The LEFT JOIN part filters out non-existing pages. The query w/o the left join has been tried on the toolserver and it returned the right number of pages.


 * ( low priority and probably not necessary ) add a parameter for the MediaWiki:Proofreadpage_header_template template containing the name of the index page. This could be useful to insert links to the index page in the template, but maybe redundant with the source tab. Zaran 13:00, 31 August 2011 (UTC)
 * page numbering: in some case the page number is missing for the last page, an example on la - the problem appears to be caused when pages are skipped
 * This appears to be caused by the last page number (which is processed first) having a y-position difference between it and the initial y-position being computed to be negative. You can solve this by adding a dummy page number at the end of the page, processing the page numbers and removing it again. Example replacement init_page_numbers can be found here. If the initial y-position could be computed correctly in the first place, that would be better.Inductiveload 05:04, 9 September 2011 (UTC)
 * I see, you add a dummy #pagenum span acting as a guard when calculating offset of page_numbers. I see the initial call to refresh_pagenumbers is protected but refresh_pagenumbers is also called when layout change, isn't it bugged in this case too or the initial setup is sufficient ? — Phe 08:57, 9 September 2011 (UTC)
 * It was bugged, when changing the layout the last page number was lost again, I applied what you did but directly in refresh_pagenumbers not in init_page_numbers. Thanks to have figured out a fix for this longstanding problem. — Phe 17:22, 9 September 2011 (UTC)
 * but layout 3 is still broken on la: — Phe 18:32, 9 September 2011 (UTC)
 * Fixed now — Phe 19:22, 9 September 2011 (UTC)


 * Yet another trouble with pagenumber page 63 to 66, page number overlap with chrome Mac and linux (14.0.835.186 64 bits for linux), not yet checked if the last change described above is related. — Phe 13:00, 1 October 2011 (UTC)

Text layer extraction from djvu file
DjVu files may contain a text layer, typically for the OCR text. This text is extracted when a page is edited for the first time, and added to the edit window.


 * Examples:
 * en:Page:Light waves and their uses.djvu/104 (the page was deleted for the purpose of the demonstration).
 * fr:Livre:Hugo - La Légende des siècles, 1e série, édition Hetzel, 1859, tome 2.djvu (click on pages)

The file description page might need to be purged if the djvu file was uploaded before the feature was added.
 * Configuration:

Configurable Headers and Footers
The default content of page headers and footers can be configured in Mediawiki:Proofreadpage_default_header and Mediawiki:Proofreadpage_default_footer.

In addition, this default value can be adapted to each book. For this, admins need to add 'header' and 'footer' fields to the index pages.

Proofreading path
ProofreadPage has five quality levels :

The command
Used on index pages, to display links to pages. The name of the index page must match the name of the djvu file.

 where X, Y, Z, A, B are page numbers
 * Syntax:

The "from...to" parameters define an interval of pages. Example :

The "AtoB" parameter applies a style to an interval of pages. Style parameters may also be applied to a single page. Available styles are : "roman", "highroman", "empty". Other strings are passed to the link. Example :  In this example, '1to10' is an interval, and 11 is a single page.

It is possible to define overlapping intervals, or to modify a single page within an interval. Example : 

Counters : if a numeric parameter is applied to a page number, it resets the page counter. Example :


 * Examples :
 * see here for an example
 * [//es.wikisource.org/w/index.php?title=%C3%8Dndice:Dar%C3%ADo_-_Eleven_Poems.djvu&diff=next&oldid=554168 see here for another example with Roman numbers first]

The command
This command transcludes a series of pages from an index. It also inserts links between pages, with the page numbers taken from the index page.

With djvu indexes, parameters should be integers : . With other indexes, parameters should be page names: <pages index=foo from=foo_page1.jpg to=foo_page15.jpg />. Section transclusion is possible for the first and last page: <pages index="foo.djvu" from=100 to=200 fromsection=section2 tosection=section1 />. Section transclusion can be applied to all pages too (cannot be used with fromsection and tosection): <pages index="foo.djvu" from=100 to=200 onlysection=english />.
 * Syntax:


 * Options in order to improve transclusion system of multi-pages books (with .djvu or .pdf file):


 * step
 * Transclude only one page on n. By example : <pages from=1 to=10 step=2 /> show the 1st, 3rd, 5th,7r and 9th pages.


 * exclude
 * Don't include following pages. By example : <pages from=1 to=10 exclude="2-5,9" /> show the 1st, 6th, 7th, 8th and 10th pages.


 * include
 * Include following pages. By example : <pages include="2-5,9" /> show the 2th, 3th, 4th, 5th and 9th pages.

We can, of course, use all the attributes on the same tag. By example <pages from=1 to=10 include="31" exclude="2-4" step="2" /> will show 1st, 5th, 7th, 9th and 31th pages.

Note: Filename components need to be wrapped in "quotation marks" if they contain spaces, or else the spaces in the filenames need to be replaced with underscores (_). Quotation marks also must be used if the filename contains a non-ASCII character.

The template Mediawiki:Proofreadpage_pagenum_template is inserted before each transclude page. It is used to display page numbers, in the text or in the margin. It accepts two parameters : 'page' for the page, 'num' for the page number. example
 * Configuration

Note: This transclusion method inserts a space between all pages. Thus, it is not possible to divide a word across two pages and have it displayed correctly. The recommendation is not to divide words.

User options
The following options can be made available in the user's preferences, as gadgets: The following options is available in the user's preferences:
 * Default layout of the edit window can be horizontal instead of vertical en:MediaWiki:Gadget-pr layout.js
 * Show the headers/footers in the edit window (in Preferences/Editing). Name in software : proofreadpage-showheaders

Configuring index pages
Index pages can be configured by modifying two templates : In addition, some fields of the index page can be passed to the headers/footers. They must be indicated in
 * MediaWiki:Proofreadpage index template: this template defines how the index page is rendered.
 * MediaWiki:Proofreadpage index attributes : this template defines the list of fields in the edit form.
 * MediaWiki:Proofreadpage js attributes

For language interwiki
 * MediaWiki:Proofreadpage specialpage text

About journal issues and partial publication
It is not a good idea to create an index page for a few pages of a book, or for a few pages of a journal issue. Another person might create another index with other pages from the same journal issue, and might not know that another index already exists for the same book.

If you want to publish pages from a journal issue, please name the index after the journal, not after the author of the article you are publishing.

If you create a djvu file, try to create a djvu of the whole book/issue, even if you are planning to publish only a few pages from that issue. You should not worry that the index pages will look unfinished. Centralizing all the pages of a given book/journal issue will help users who publish excerpts from the same book/issue.

Headers and Navigation
The 'pages' command can generate headers automatically. For this the command must include a "header" parameter.

fr:La Petite Dorrit/Tome 2/Chapitre 5
 * Example :

The header is defined in MediaWiki:Proofreadpage_header_template. It is a template that reads parameters extracted from the index page. In addition, it can provide, navigation links, with the following parameters: , ,

In order to find the previous and next chapters, the index page is used as a Table of Contents.All links from the index page to the to the main namespace are interpreted as 'chapters', except the first one, which is expected to belong to the "title" field. (note: if your wiki does not have an author namespace, this will not work, because the links to author/translator pages will this wrongly interpreted as chapters.)

All parameters defined in MediaWiki:Proofreadpage js attributes are passed to the header template, additionally you can pass any named parameters to this template with a <pages index="..." my_parameter=value />, obviously such parameters needs to be handled by the template. The same mechanism can be used to overload parameter value, e.g. <pages index="..." Author=value /> will avoid to use the default value the extension get from the Index page.

Page numbers are also available: ,

Finally, the value assigned to the "header" parameter is available as : This can be combined with parser functions, in order to define several styles of headers.

A special case is made by the extension for call to the without from and to parameter, in this case is assigned to toc and the TOC is transcluded from the Index: page.

Proofreading status indicator
A coloured proofreading status indicator is displayed in the main namespace, under the title of pages that use transclusion. It shows the proofreading status of transcluded pages from the "Page" namespace. Here is how it looks like :

<div title="This line indicates the proofreading status of the text. Green = Validated; Yellow = Proofread one; Red = Not proofread.">

In this example the text is 40% validated, 30% proofread, 25% raw, and 5% of the transcluded pages are problematic. It also contains a (hidden) backlink to the index page, that can be captured by local javascript.

This indicator is defined by a system message, and it can be configured by admins. Mediawiki:Proofreadpage_quality_template

In the Swedish Wikisource, similar bar graphs are also generated by the template Statusstapel, for example on sv:Wikisource:Statistik.

Special:IndexPages
This special page lists index pages and their proofreading status. Index pages that were created before the introduction of this feature need to be purged in order to be displayed in the list.

Pages are ordered using the following criterion : 2*(#validated) + (#proofread). This is intended to reflect the number of proofreading actions. In the future more options will be available.