Help:Extension:Wikisource/Wikimedia OCR

The Wikimedia OCR feature of the Wikisource extension adds a toolbar interface to the main editing toolbar when editing in the Page namespace, to quickly extract text from the page image and add it to the page body text-box. OCR stands for Optical Character Recognition, and is the process by which text in a photographic image can be turned into editable text, and so added to a wiki.

To use this feature, click the 'Extract text' button at the right side of the main editing toolbar. This will run the OCR process, and add the resulting text to the page body field in the editing form (replacing any text that is already there). An 'undo' button is shown at the top of the body field, allowing you to return to the previous state of the field if desired.

In its basic form, that's all the functionality of Wikimedia OCR, but there are a few advanced features that can be useful in some circumstances, available via the dropdown menu to the right of the main 'Extract text' button. These advanced features allow you to choose a different OCR engine; set a list of languages to help the software detect words; or select a smaller area of the page from which to extract text. These are all explained below, and note that other than engine choice, all are available via the 'Advanced options' menu item, which opens a new tab.



Engines
There are currently three OCR engines available: Tesseract, Google and Transkribus. Tesseract is an open-source tool that runs in-house and supports a wide range of languages and other options. Google OCR is a proprietary service, also supporting lots of languages, but with fewer options. Transkribus is supported by an EU cooperative READ-COOP and has partnered with the Wikimedia Foundation to provide a limited number of free credits to support Wikisource Loves Manuscripts project.

The choice of which to use can vary depending on the nature of the image to be processed.

To switch engines, select the relevant radio button in the dropdown menu. Your choice will be remembered for your current device, and can be changed at any time.

Languages
Clicking the 'Advanced options' menu item opens a new tab with a transcription form containing a field for selecting the language or languages that are used in the page of text being extracted. This is useful because the OCR engines can be much more accurate when they know what languages to expect.

Note that not all languages are supported by all engines, and if you change the engine then the list of available languages will change too.

If your language is not in the list, you can leave the Languages field empty and the OCR engine will attempt to extract what text it can. This can have varying results, and is worth trying.

Crop area
To extract text from only a part of an image (for example, a single column of a page from a newspaper), it is possible to select a crop area. Do this by first clicking the crop button (, see screenshot at right), and then clicking and dragging over the page image to draw a rectangle. The image can be zoomed and panned, and the crop rectangle moved and resized as required. There are buttons above the image with which to switch between moving and cropping. Once you've selected the desired area, click 'Extract area' and the text for only that area will be shown in the right-side text box.

Returning from Advanced options
After using the advanced options form to extract text, it's necessary to copy and paste the resulting text back into the body field of the page editing form. To make this a bit quicker, a 'Copy to clipboard' button is provided.

First-time use
The first time you open a page for editing, a pulsating blue dot is shown on the 'Extract text' button. Clicking this dot or either of the buttons will open a popup explaining what this feature is. After this popup is dismissed, it will not be shown again (on the same device).

Issues
If you encounter any issues with using Wikimedia OCR, please report them on Phabricator, under the Wikisource OCR tag.