Extension talk:Proofread Page

please discuss bugs and feature requests with the wikisource community, at oldwikisource:Wikisource:ProofreadPage

manque des explications
Il manque probablement une partie des explications, car, après avoir suivi toutes les instructions, on n'obtient rien d'autres que des images noires ou des erreurs d'affichage quand on clique sur l'onglet image. J'ai essayé une installation sur internet et sur mon ordinateur, et c'est tout ce que j'obtiens à chaque fois. Mode41 16:23, 6 May 2010 (UTC)

How to perform steps 3 and 4?
Forgive the beginner, but how does one execute the SQL file? I tried executing ProofreadPage.sql in the command prompt and got nothing. And what about #4: am I correct in assuming that that is telling me to edit ProofreadPage.sql to use the correct prefix?

I'm assuming that my failure to complete these two steps is what caused me to get the error:
 * Database returned error

Can anyone help? --Spangineer 04:26, 7 February 2011 (UTC)
 * Two ways :
 * run update.php, or
 * copy/paste the sql file content in phpmyadmin.

Proofread error: "no such file"
Hi, technical question.

I've tried to install Proofread on a fresh MW site, and it gave the error above every time I tried to make and index page (source came from archive.org and uploaded fine, even though no thumbnail on the File description). I've done this several times in id.wikisource too, so I'm stuck with this error for more than a year now. Anyone can help me?

Specs: MediaWiki 	1.16.0 PHP 	5.2.10 (apache2handler) MySQL 	5.0.77

CheckUser (Versi 2.3) Collection (Versi 1.4) Cite ParserFunctions Poem PDF Handler ConfirmEdit ProofreadPage (Versi 2009-04-20) SpamBlacklist

confirmEditSetup, pr_main, wfRssExtension dan wfSetupParserFunctions ,, dan expr, if, ifeq, ifexist, ifexpr, rel2abs, switch, time, timel dan titleparts

I've added new namespaces (100-103) and updated the database, but I've got the following error when I saved the index page:

Ada kesalahan sintaks pada permintaan basis data. Kesalahan ini mungkin menandakan adanya sebuah bug dalam perangkat lunak. Permintaan basis data yang terakhir adalah:

(Permintaan SQL disembunyikan)

dari dalam fungsi "". Basis data menghasilkan kesalahan "1064: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '0,0,0,0,0)' at line 1 (localhost)".

Even though it generates error, the index page was saved sucessfully. But it gave the "Error: no such file" message like the one in Index:Federal_Cases,_Volume_19.djvu. While the File:Federal_Cases,_Volume_19.djvu was corrupt in this case, mine was okay. So again, I don't know what's wrong with my installation.

Many thanks before. Bennylin (talk) 13:15, 2 January 2012 (UTC) (crossposted from English Wikisource's Scriptorium)

No Image
I installed the extension on my wiki but when I create a Page:xyz.jpg there is no image at the right side. Via Scan I can see the image. What's wrong? --87.146.17.135 13:10, 4 November 2012 (UTC)
 * You may have not well configure support for pdf, tiff or djvu files. Tpt (talk) 16:43, 4 November 2012 (UTC)
 * these extensions working fine.. --87.146.7.92 18:30, 4 November 2012 (UTC)
 * I'm having some trouble with the "Image" tab in the "Page:" namespace. It returns a 404 error. I've added to the Apache rewrite rule to the .htaccess in my MediaWiki folder, but this continues. --Inops (talk) 23:59, 4 November 2012 (UTC)
 * Are you using InstantCommons ? If you give me error messages you see and format of files you use, it would be better. Tpt (talk) 08:25, 5 November 2012 (UTC)
 * Am I not using InstantCommons (I was thinking for using it though, does it have an adverse affect on the extension?). I am attempting to use the extension with a PDF file, the extension functions perfectly well with PDF otherwise (I don't require the use of .djvu), but this error is a minor ningle of mine. The "Image" tab returns a generic 404 error (the displayed messaged dependant on the browser, obviously).
 * The URL of the tab e.g. "/mediawiki/images/thumb/1/13/Example.pdf/page10-1275px-Example.pdf.jpg", this returns a 404 error. However, "/mediawiki/thumb.php?f=Example.pdf&width=1275&page=10" will return the JPG rendering of the specific page of the file, and thereafter "/mediawiki/images/thumb/1/13/Example.pdf/page10-1275px-Example.pdf.jpg" will correctly return that image.
 * It seems that "mediawiki/images/thumb/" often produces the error, but "mediawiki/thumb.php?" does not, and does rather fix the former's error in a particular example. I assume this is to do with the creation of the thumbnail in the installation; thumb.php? creating the thumbnail and /thumb/ recounting the thumbnail. If the tab hyperlinked to the thumb.php address, this wouldn't be an issue. Is there a fix to this? Thanks, Jordan. --Inops (talk) 12:21, 5 November 2012 (UTC)
 * It's maybe because you haven't well configure thumb system. If you add the URL rewriting configuration describe there I hope that will solve your problem. Tpt (talk) 09:05, 6 November 2012 (UTC)
 * I've tried the code you suggested, but this didn't seem to make a difference. I've decided to disable the broken (for me) function by disabling the code in ProofreadPage/ProofreadPage.body.php. Thanks for the help. :) --Inops (talk) 09:54, 6 November 2012 (UTC)

5050 error on save
I'm unable to save any page in the Page namespace. It returns a 5050 error. So, the extension is rendered useless. My error log gives:
 * [07-Nov-2012 18:17:02] PHP Fatal error: Undefined class constant 'READ_LATEST' in D:\www\mediawiki\extensions\ProofreadPage\ProofreadPage.body.php on line 1238

Though, I am not sure if that's anything to do with it. Any help with this would be great. 90.220.162.151 18:22, 7 November 2012 (UTC)
 * It's a constant that is introduce in MediaWiki 1.20, version that have been release today. So upgrade your MediaWiki installation to MediaWiki 1.20 will fix the problem. If you don't want to upgrade to 1.20 you can remove ", Revision::READ_LATEST" from the line 1238 of the file D:\www\mediawiki\extensions\ProofreadPage\ProofreadPage.body.php and it will work (but can introduce a bug if two people save the page as the same time).
 * Thanks! Works perfectly. I had to install a newer version of Extension:Vector though. 90.220.162.151 00:25, 8 November 2012 (UTC)

Localization
How to add new translations of the namespace-names? I don't see it on translatewiki. --Bjarki S (talk) 01:20, 19 January 2013 (UTC)
 * I've updated translatewiki:Translating:MediaWiki. In short, you can now file a request on bugzilla, MediaWiki extensions>ProofreadPage. Previously all wikis had to configure it locally. --Nemo 06:43, 19 January 2013 (UTC)

Alright. Thanks! --Bjarki S (talk) 20:58, 19 January 2013 (UTC)

no such file at wikilivres
The Wikilivres.ca website has been getting the "no such file" errors reported here at Proofread error: "no such file" for maybe six months. Could anyone here suggest a solution? Please see wikilivres:wikilivres:Community_Portal/en for examples of files with the problem, together with links to a database error occurring when the pagelist tag is used. Thank you. -84user (talk) 23:04, 19 January 2013 (UTC)
 * Are you sure that DjVu support is well configure on your server ? I think that this is the cause of the issue. See DjVu for more information. Tpt (talk) 20:29, 20 January 2013 (UTC)

OCR software
I can't seem to find any information on the OCR software that this extension uses. We have just finished setting this up on the Icelandic Wikisource but there is a marked difference in the quality of the OCR retrieved from an English test document and an Icelandic one with the same font and font size. I am wondering if the support for the Icelandic language is not built into the OCR software or if there are any ways to improve it. Does the software learn by itself to recognize strange new characters like þ and ð if given enough practice or would that be a waste of time and effort? --Bjarki S (talk) 04:59, 20 January 2013 (UTC)
 * The extension doesn't include any OCR software, it only extracts the text embedded in the PDF/DjVU files you're using: check what generated them. s:en:Help:DjVu files has a lot of advice on how to get DjVu with decent OCR; for instance, if you use archive.org don't forget to specify the language of the document in your metadata. --Nemo 09:44, 20 January 2013 (UTC)
 * Hmm. So what does the "OCR" button do? Seems to work fine with PDFs without embedded text layer. --Bjarki S (talk) 17:15, 20 January 2013 (UTC)
 * My information may be outdated, but are you sure it's a PDF without text or is just your PDF reader unable to read/select/copy it? pdfinfo or pdftotext commands would tell you. --Nemo 18:00, 20 January 2013 (UTC)
 * OCR button is managed by a script stored on oldwikisource that call a script on toolserver that use the free but not very good Tesseract OCR software configured for the English language. So, if you use the script on an English and on an Icelandic texts, it's normal that the OCR is better for the English one. Tpt (talk) 20:34, 20 January 2013 (UTC)
 * Alright, thanks! Seems like there is no free option available for Icelandic. --Bjarki S (talk) 04:18, 29 January 2013 (UTC)
 * Eh, there are options available for Icelandic. There are three online services that have icelandic OCR support: Archive.org, ocr-extract.com and newocr.com. Plus, there is an Icelandic language file for Tesseract on Google code, and that is enough to make Tesseract compatible with Icelandic.--Snaevar (talk) 15:50, 29 January 2013 (UTC)
 * I installed support for icelandic on the toolserver, as for all language with diacritics accuracy of results depends a lot on the quality of scan. Beside that I also upgraded tesseract, result should be a bit better for all supported language. Phe (talk) 22:29, 30 January 2013 (UTC)
 * Hi, I'm a newbie on this, so forgive me if my Q's are too simple. I've uploaded this file:http://commons.wikimedia.org/wiki/File:ChFSA_FD1197205170%281%29.djvu, which is in spanish. The Proof Read Page doesn't seem to catch the OCR. I converted the .djvu, using Any2DjVu (Medium (300 dpi); Lossless; OCR (only works reliably for english text, locate columns automatically.)), at this point I really don't care about the quality of the OCR ('cause in Spanish), but that the proof reader page program actually performs the transclution to the Wikisource page, I'm doing sometihg wrong here? Thanks!--3BRBS (talk) 04:20, 2 February 2013 (UTC)
 * Hi! If I've well understood what your are saying, the issue is that you doesn't manage to get the text layer of the djvu (that contains a text layer) when your tired to get create pages of the Page: namespace? If yes, It's very strange because this extraction of the text layer of this file works fine for me on my test wiki. Tpt (talk) 15:57, 2 February 2013 (UTC)
 * Yes, you got it right, but after I wrote the message (above), I uploaded a new version of the file, which I make sure had a text layer (I runned the google script pdf2djvu, and quit using the website online converter). I had to install a view program to check that out, but in the end I gave up, because I couldn't check if the program (proof reader) could actually extract the text. If you say it works... that's great... but how can I check it on my own? Could you run the program to check on the three different version of the file I uploaded so I figure out what the problem was? (I'm thinking that is a failure of the website, because it said it runned the OCR, but I couldn't extract anything from it). Thanks!! :D--3BRBS (talk) 13:45, 5 February 2013 (UTC)
 * The first two versions of your file has no text layer according to evince and djView4. So, Proofread Page doesn't extract a text layer because there is no text layer in the file. So, there is no bug in Proofread Page. Tpt (talk) 17:06, 5 February 2013 (UTC)
 * Thanks for taking the time for checking, I believe there is a bug with the french website then, since it "OCR"ed the file, but added no text layer then. The third time, I used the google script! Best.--3BRBS (talk) 21:59, 9 February 2013 (UTC)

Error
I am using Mediawiki in "Malayalam" language. I installed this extension, but still getting my Index pages similar to this. Please help.--Balasankarc (talk) 19:54, 16 April 2013 (UTC)
 * You have to edit Mediawiki:Proofreadpage index template (that is the template outputted in index pages) and use here parameters setup in Mediawiki:proofreadpage index attributes. Tpt (talk) 07:01, 25 April 2013 (UTC)

Refactoring of Code
I came across the project idea Proofread Page extension needs to be refactored. I would like to know what are main problems in the code which we would like to overcome when we refactor. Are there a set of features according to which the code has to be refactored to support them in future releases? --Aarti Dwivedi
 * The main problems are related to the Page: pages edition system that is currently an horrible hack and this part of code as become too complicated to be modified easily without breaking everything. The goal is to implement it cleanly and make it compatible with the Visual Editor (see 46616). Tpt (talk) 06:57, 25 April 2013 (UTC)

Edit the format of text area
Hi,

I like to use a Semantic Form instead of the default text area on the left side. I intend to do this for digitizing a dictionary. So I need the image on the right and a semantic form on the left. Is there anyway I can do this?--Balasankarc (talk) 16:26, 16 May 2013 (UTC)

OCR for Bengali wikisource
Hi, I am from Bengali wikisource.(bn.wikisource.org). There are one OCR (open source) available at https://code.google.com/p/banglaocr/. Could you you please add this in for Bengali Wikisource? Jayantanth (talk) 08:48, 24 October 2013 (UTC)
 * Hi, do you know if tesseract-ocr-3.02.ben.tar.gz Bengali language data for Tesseract 3.02 at is the same thing and can be installed instead ? Phe (talk) 16:29, 25 October 2013 (UTC)
 * I installed the file I mentioned above and tried it on, I'm unsure how bad are the result but the ocr quality seems very poor :/ 18:45, 25 October 2013 (UTC)
 * Thank you for installing. I know that OCR till in needs to be some development. I have chacked severela times in of line in desktop, its working fine with good 300dpi images. But here its not responding. I am not sure about Tesseract 3.02. I am trying to contact main developer Md. Abul Hasnat & Murtoza Habib.Jayantanth (talk) 13:28, 26 October 2013 (UTC)

Hi Phe, I had contacted to main developer Md. Abul Hasnat. He had replied to me with the following answer below.

"Do you know if tesseract-ocr-3.02.ben.tar.gz Bengali language data for Tesseract 3.02 at [1] is the same thing and can be installed instead ?"
 * My comment: In general if you replace the tesseract training file on the tessdata sub-directory inside the BanglaOCR software it should provide results accordingly. The reason is that, BanglaOCR uses tesseract as an external OCR engine. However, several source confirmed me that tesseract Tesseract 3.02 still works better with the old training data rather than tesseract-ocr-3.02.ben.tar.gz. However, I did not have a chance to validate this by myself.

Many people may complain that BanglaOCR with tesseract provides extremely poor results.
 * My comment: BanglaOCR is a first complete OCR framework which was released as open source with the aim of continuous development. Unfortunately, after version 0.7 no one worked on that and hence in the past five years there is not enough progress. It was giving reasonable results for certain types of documents. However, it was not extended to handle any type of documents.

What will you suggest for tesseract-ocr-3 version?

Jayantanth (talk) 16:50, 13 April 2014 (UTC)

"Proofreadpage index data config" list of types
Shouldn't the list of types contain "langcode" as it used by the "Language" field?

Currently, it says:

"Possibles values: string, number, page"

I don't have enough knowledge in this area to feel comfortable editing it myself. Are there other types that aren't listed?

OCR Languages
Is there a list somewhere of the languages that are supported by the toolserver script? It seems that it does not support Hebrew. Who should I ask to add support for Hebrew (it seems that Tesseract-OCR has a data file for Hebrew). Inkbug (talk) 06:47, 29 January 2014 (UTC)
 * I installed it. Phe (talk) 13:04, 31 January 2014 (UTC)
 * Thank you! Inkbug (talk) 16:14, 1 February 2014 (UTC)

API Documentation & Improvement
Proofread Page Extension adds two API hooks to Query module. One meta hook for information ("proofreadinfo") and another property hook for quality status ("proofread") of the Proofread pages.

meta = proofreadinfo
Meta information about the configuration of Proofread extension.


 * Parameters
 * : Which proofread properties to get. Values (separate with '|'): namespaces, qualitylevels
 * : Information about Page & Index namespaces (Default)
 * : List of proofread quality levels (Default)


 * Example

prop = proofread
Proofread status of the given Index/Page. Index means the entire book. Note:  expects a namespace parameter   residing in previous API call.


 * GeneratorParameters
 * : Name of the Index/Page
 * : Page/Index namespace returned from the
 * : Appended for pagination.
 * : Limit the number of results per API call


 * Example

Inputs Notes/Sources
In same context any thoughts or points to compile API related notes.

Points could include,


 * 1) Use-case of API
 * 2) Existing components/projects/bots already using proofread API features
 * 3) Anything else.

w.r.t IRC chat
TPT suggests we need to specify serialise formatting for specific format of Page: and Index: pages: For Page: pages format see:

https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FProofreadPage.git/1c5685425ba4bc41c174552d5e61b1d4de343043/includes%2Fpage%2FPageContentHandler.php#L35

Some samples of the Page: pages serialization:

https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FProofreadPage.git/1c5685425ba4bc41c174552d5e61b1d4de343043/tests%2Fincludes%2Fpage%2FPageContentHandlerTest.php#L26

Need expansion here

TPT && PHE conversation:

All books has an index page. That contains all the meta information about the book.

For. E.g.:

https://en.wikisource.org/wiki/Index:Love_among_the_chickens_(1909).djvu

There after to get the Proofread Quality of each page inside book. We can use API in following way. There are 2 important things needed for API hook to work.


 * 1) GAPNAMESPACE
 * 2) GAPPREFIX

Note: All the parameters are pending to be documented

GAPPREFIX contains the index page of the book. And GAPNAMESPACE can be derived by querying prop=proofreadinfo.

http://en.wikisource.org/w/api.php?action=query&meta=proofreadinfo&piprop=namespaces|qualitylevels

Note namespace is different for every domain. Thus an example API call can be :-

http://en.wikisource.org/w/api.php?action=query&generator=allpages&gapnamespace=104&gapprefix=Love_among_the_chickens_(1909).djvu&prop=proofread

Index page for same can be found here

https://en.wikisource.org/wiki/Index:Love_among_the_chickens_(1909).djvu

Note the GAPCONTINUE parameter in the result. This is used to interate over further pages of the book.

Also GAPLIMIT can be used to limit the number of pages in 1 API call.

w.r.t to Maillist replies
Gaurav Vaidya has created a perl module to download an entire book from wikisource.

hypothetical “Index:Entire book.pdf” by:-


 * Using prop=imageinfo to get the number of pages for “File:Entire book.djvu".
 * Using prop=revisions to download the Wikitext for each individual page from “Page:Entire book.djvu/1” to “Page:Entire book.djvu/9999” (if the image had 9,999 pages).

This will work for Wikisources that redirect “File:”, “Index:” and “Page:” into their local namespaces.

He suggests it might helpful to have an API query that could return the proofread status for every page in an Index page.

TPT suggests to describe both the hooks properly which is our end goal anyways. Im writing a rought draft for same and will post in this section. For reference i'll go with Thomas suggestion https://www.mediawiki.org/wiki/API:Properties#imageinfo_.2F_ii.

--Kishanio (talk) 12:35, 17 May 2014 (UTC)