Extension talk:PdfHandler

Jump to navigation Jump to search

About this board

PdfHandler Talk Archive


Tommyheyser (talkcontribs)

I'm sure this topic has come up before many times and from what I've found through searching were usually along the line of "just use PdfHandler" and not much details. I've gotten PdfHandler to work and it's showing the thumbnails on the File pages as well as creating text files of the pdf in the images folder. How does the MW built-in search engine, or other search engine (I got CirrusSearch/Elastica/ElasticSearch running) make use of the text files.

Is there a configuration setting I need to turn on for MW to recognise the generated text files when indexing contents?

I'm asking because I still don't see the content of the PDF in the search results, either using MW built-in search engine or CirrusSearch.

I hope it's alright that I'm posting this here. I've posted a similar question to this one in the Extension talk:CirrusSearch page as well.

Tommyheyser (talkcontribs)

Okay, not sure what happened, but since I'm running MW on Windows Server (sorry, forgot to mention this before), the standard PdfHandler extension with my "workaround" wasn't working 100%. Thumbnail creation was okay and I thought the pdftotext was working fine, but apparently not.

I tried using SeongMoon version of PdfHandler, ran maintenance/update.php, refreshImageMetadata.php, rebuildImages.php as well as extensions/CirrusSearch/maintenance/forceSearchIndex.php as per the https://phabricator.wikimedia.org/source/extension-cirrussearch/browse/master/README file and now it seemed to work and PDF contents are showing up in search results.

Reply to "Searching content of PDF files"

No PDF/thumbnail, issue executing pdfinfo/pdftotext, Windows Server 2012 R2, IIS 8.5, MW 1.31

4
Tommyheyser (talkcontribs)

MW 1.31.1 running on Windows Server 2012 R2 IIS 8.5

I'm getting the following error (from $wgDebugLogFile output log file) for all execution of pdfinfo and pdftotext.

[exec] Error running "pdfinfo" "-enc" "UTF-8" "-meta" "C:/inetpub/wwwroot/w/images/f/f4/Phone_List.pdf": 'pdfinfo" "-enc" "UTF-8" "-meta" "C:' is not recognized as an internal or external command, operable program or batch file.

I'm not sure if this is the result of the new Shell framework introduced in 1.30, Manual:Shell framework, which replaces wfShellExec(). The debug log line before the error is:

[exec] MediaWiki\Shell\Command::execute: "pdfinfo" "-enc" "UTF-8" "-meta" "C:/inetpub/wwwroot/w/images/f/f4/Phone_List.pdf"

Tommyheyser (talkcontribs)
Tommyheyser (talkcontribs)

In case someone else is having this issue of not seeing PDF and is running MW 1.31 on Windows Server 2012 R2.

  1. I added the path to pdfinfo.exe and pdftotext.exe to System variables path (mine was C:\Program Files\xpdf-tools-win-4.00\bin64).
  2. Then, I edit {mediawiki install path}/extensions/PdfHandler/includes/PdfImage.php function retrieveMetaData.

a. Replacing:

$cmdMeta = [
$wgPdfInfo,
'-enc', 'UTF-8', # Report metadata as UTF-8 text...
'-meta',         # Report XMP metadata
$this->mFilename,
];

with

$cmdMeta = "pdfinfo.exe -enc UTF-8 -meta " . $this->mFilename;

b. Replacing

$cmdPages = [
$wgPdfInfo,
'-enc', 'UTF-8', # Report metadata as UTF-8 text...
'-l', '9999999', # Report page sizes for all pages
$this->mFilename,
];

with

$cmdPages = "pdfinfo.exe -enc UTF-8 -l 9999999 " . $this->mFilename;

c. Replacing

$cmd = [ $wgPdftoText,  $this->mFilename, '-' ];

with

$cmd = "pdftotext.exe " . $this->mFilename;


It's a bit of a hack, but it works. This should last until the issue is properly fixed.

173.77.3.157 (talkcontribs)
Reply to "No PDF/thumbnail, issue executing pdfinfo/pdftotext, Windows Server 2012 R2, IIS 8.5, MW 1.31"

Direct linking to PDF page, When clicking to direct media

2
Gmillerd (talkcontribs)

Does anyone have a modification of the extension to make click of the PDF when a page is specified to go to that page?

/mediawiki/index.php?title=File:Filename.pdf&page=25

to the following, to make the browser skip to the specified page?

/images/0/0b/Filename.pdf#page=25

I am able to do it in javascript, but the PHP evades me.

$("#file.fullImageLink").find("a:first").each(function() {
    $(this).attr("href", $(this).attr("href") + "#page=" + getUrlParameter("page"));
});
212.59.13.226 (talkcontribs)

Use # instead of ...&page=25

Reply to "Direct linking to PDF page, When clicking to direct media"
151.61.39.181 (talkcontribs)

Using mediawiki 1.30 and the extension for this version (PdfHandler-REL1_30-53d9884.tar.gz) I could not get thumbnail generation on the image page, where I got, instead of images, a text error like:

Error creating thumbnail: convert: no decode delegate for this image format `' @ error/constitute.c/ReadImage/504. convert: no images defined `/var/www/fountainpen.it/mediawiki/images/tmp/transform_7d1af7cbffc4.jpg' @ error/convert.c/ConvertImageCommand/32

looking at the debug I got the command used to create the thumbnail, they are called around line 188 of PdfHandler_body.php. a pipe between gs and convert. The problem is reported by convert, but it's caused by ghostscript, that, for the PDF files I was using, added to standard output (altough -q option is present) some line like:

  **** Warning: considering '0000000000 XXXXX n' as a free entry.
  **** Warning: considering '0000000000 XXXXX n' as a free entry.
  **** Warning: considering '0000000000 XXXXX n' as a free entry.
  **** Warning: considering '0000000000 XXXXX n' as a free entry.

those lines went on top of the jpeg image created over the pipe passed to convert, who failed conversion. Saving the image and processing it manually gave no error. I could solve the issue adding a line:

"-sstdout=/dev/null",

to the parameters passed to ghostscript inside PdfHandler_body.php, with a patch like this:

--- a/PdfHandler/PdfHandler_body.php    2018-04-30 23:14:14.000000000 +0200
+++ b/PdfHandler/PdfHandler_body.php    2018-11-01 13:02:12.744146598 +0100
@@ -195,6 +195,7 @@
            "-r{$wgPdfHandlerDpi}",
            "-dBATCH",
            "-dNOPAUSE",
+            "-sstdout=/dev/null",
            "-q",
            $srcPath
        );
Reply to "Error creating thumbnail images"

Thumbnail creation exits with code '134'

2
Octfx (talkcontribs)

Trying to create thumbnails results in error code 134. Output from the Debug-Log:

PdfHandler::doTransform: called wfMkdirParents(/tmp)

MediaWikiShellCommand::execute: /bin/bash /var/www/<path>/includes/shell/limit.sh (/usr/bin/gs -sDEVICE=jpeg -sOutputFile=- -dFirstPage=1 -dLastPage=1 -dSAFER -r150 -dBATCH -dNOPAUSE -q <pathToPDF> | /usr/bin/convert -depth 8 -quality 95 -resize 120 - /tmp/transform_9f856aed71d9.jpg) MW_INCLUDE_STDERR=1;MW_CPU_LIMIT=0; MW_CGROUP=; MW_MEM_LIMIT=307200; MW_FILE_SIZE_LIMIT=102400; MW_WALL_CLOCK_LIMIT=180; MW_USE_LOG_PIPE=yes

[exec] Probably exited with signal 6: /bin/bash /var/www/<path>/includes/shell/limit.sh (/usr/bin/gs -sDEVICE=jpeg -sOutputFile=- -dFirstPage=1 -dLastPage=1 -dSAFER -r150 -dBATCH -dNOPAUSE -q <pathToPDF> | /usr/bin/convert -depth 8 -quality 95 -resize 120 - /tmp/transform_9f856aed71d9.jpg) MW_INCLUDE_STDERR=1;MW_CPU_LIMIT=0; MW_CGROUP=; MW_MEM_LIMIT=307200; MW_FILE_SIZE_LIMIT=102400; MW_WALL_CLOCK_LIMIT=180; MW_USE_LOG_PIPE=yes

RETURN CODE: 134

ERROR: /bin/bash: line 1: 27183 Done                    /usr/bin/gs -sDEVICE=jpeg -sOutputFile=- -dFirstPage=1 -dLastPage=1 -dSAFER -r150 -dBATCH -dNOPAUSE -q <pathToPDF>
27184 Aborted                 | /usr/bin/convert -depth 8 -quality 95 -resize 120 - /tmp/transform_9f856aed71d9.jpg

[thumbnail] Removing bad 0-byte thumbnail "/tmp/transform_9f856aed71d9.jpg". unlink() succeeded

Extension was setup following Extension:PdfHandler#Debian.

MW Version: 1.30

PHP: 7.1.2

Ghostscript / Poppler-Utils / Imagick are installed and functioning

151.61.39.181 (talkcontribs)

I got a similar error, not having thumbnail creation, I got a different error (I'm reporting it separately) but I solved adding "-sstdout=/dev/null", at parameter used for the ghostscript command invocatio.

Reply to "Thumbnail creation exits with code '134'"
Brunodapei (talkcontribs)
Reply to "Wrong font"

When updating PDF, thumbnails not being regenerated

1
213.61.254.67 (talkcontribs)

When I upload a new PDF-Version, the generated thumbnails from "Version 1" always shown. The Version 2 thumbnails are not being created.

Reply to "When updating PDF, thumbnails not being regenerated"

PDFHandler doesn't show images

1
213.211.236.242 (talkcontribs)

I'm running MWK 1.30.0 (PHP 5.6.35). And installed PHPHandler in the version of 26.Apr-2018.

I embedded an PDF in one of my wiki pages with: File:test.pdf|page=1|thumb|My PDF

But only the link to the file is shown an no image of the first page.

In the files overview of the wiki, only the default PDF icon for the document is shown.

I'm using the wiki on Windows 10 and these are the lines in my LocalSettings.php:

wfLoadExtension( 'PdfHandler' );

$wgGenerateThumbnailOnParse = true;


$wgUseImageMagick = true;

$wgImageMagickConvertCommand = 'C:\wamp64\ImageMagick-7.0.7-Q16\convert.exe';

$wgPdfProcessor = 'C:\wamp64\gs\gs9.23\bin\gswin64.exe';

$wgPdfPostProcessor = $wgImageMagickConvertCommand;

$wgPdfInfo = 'C:\wamp64\xpdf-tools-win-4.00\bin64\pdfinfo.exe';

$wgPdftoText = 'C:\wamp64\xpdf-tools-win-4.00\bin64\pdftotext.exe';

$wgPdfCreateThumbnailsInJobQueue = "false";

There are no error-logs generated and running these maintenance scripts also doesn't help:

php C:\wamp64\www\mediawiki\maintenance\refreshImageMetadata.php 
php C:\wamp64\www\mediawiki\maintenance\rebuildImages.php 
php C:\wamp64\www\mediawiki\maintenance\runjobs.php

Any idea what I can try or how I can test or debug the PDFHandler?

Reply to "PDFHandler doesn't show images"
Brunodapei (talkcontribs)

Does PdfHandler works with MobileFrontend extension?

thank you

Reply to "MobileFrontend"
TieMichael (talkcontribs)

With PdfHandler enabled, I get a MW-Exception when uploading pdf-files to my wiki (using MW1.28). If PdfHandler is disables in LocalSettings.php everything is fine, also Thumbnails for existing Pdfs seem created normally.

Here is a log of the Exception:

[Mime] MimeAnalyzer::doGuessMimeType: analyzing head and tail of /tmp/phpddXZ86 for magic numbers.
[Mime] MimeAnalyzer::doGuessMimeType: magic header in /tmp/phpddXZ86 recognized as application/pdf
[Mime] MimeAnalyzer::guessMimeType: guessed mime type of /tmp/phpddXZ86: application/pdf
[Mime] MimeAnalyzer::improveTypeFromExtension: improved mime type for .pdf: application/pdf
wfShellExec: /bin/bash '/var/www/html/mediawiki/includes/limit.sh' ''\''pdfinfo'\'' -enc UTF-8  -l 9999999  -meta '\''/tmp/phpddXZ86'\''' 'MW_INCLUDE_STDERR=;MW_CPU_LIMIT=180; MW_CGROUP='\'''\''; MW_MEM_LIMIT=307200; MW_FILE_SIZE_LIMIT=102400; MW_WALL_CLOCK_LIMIT=180; MW_USE_LOG_PIPE=yes'
[XMP] XMPReader::startElementModeInitial Ignoring unrecognized element <http://ns.adobe.com/pdf/1.3/:Producer>.
[XMP] XMPReader::startElementModeInitial Ignoring unrecognized element <http://purl.org/dc/elements/1.1/:format>.
[XMP] XMPReader::startElementModeInitial Ignoring unrecognized element <http://ns.adobe.com/xap/1.0/mm/:DocumentID>.
[XMP] XMPReader::startElementModeInitial Ignoring unrecognized element <http://ns.adobe.com/xap/1.0/mm/:InstanceID>.
PdfImage::retrieveMetaData: 'pdftotext' '/tmp/phpddXZ86' -
wfShellExec: /bin/bash '/var/www/html/mediawiki/includes/limit.sh' ''\''pdftotext'\'' '\''/tmp/phpddXZ86'\'' - ' 'MW_INCLUDE_STDERR=;MW_CPU_LIMIT=180; MW_CGROUP='\'''\''; MW_MEM_LIMIT=307200; MW_FILE_SIZE_LIMIT=102400; MW_WALL_CLOCK_LIMIT=180; MW_USE_LOG_PIPE=yes'
wfShellExec: /bin/bash '/var/www/html/mediawiki/includes/limit.sh' ''\''pdfinfo'\'' -enc UTF-8  -l 9999999  -meta '\''/tmp/phpddXZ86'\''' 'MW_INCLUDE_STDERR=;MW_CPU_LIMIT=180; MW_CGROUP='\'''\''; MW_MEM_LIMIT=307200; MW_FILE_SIZE_LIMIT=102400; MW_WALL_CLOCK_LIMIT=180; MW_USE_LOG_PIPE=yes'
[XMP] XMPReader::startElementModeInitial Ignoring unrecognized element <http://ns.adobe.com/pdf/1.3/:Producer>.
[XMP] XMPReader::startElementModeInitial Ignoring unrecognized element <http://purl.org/dc/elements/1.1/:format>.
[XMP] XMPReader::startElementModeInitial Ignoring unrecognized element <http://ns.adobe.com/xap/1.0/mm/:DocumentID>.
[XMP] XMPReader::startElementModeInitial Ignoring unrecognized element <http://ns.adobe.com/xap/1.0/mm/:InstanceID>.
PdfImage::retrieveMetaData: 'pdftotext' '/tmp/phpddXZ86' -
wfShellExec: /bin/bash '/var/www/html/mediawiki/includes/limit.sh' ''\''pdftotext'\'' '\''/tmp/phpddXZ86'\'' - ' 'MW_INCLUDE_STDERR=;MW_CPU_LIMIT=180; MW_CGROUP='\'''\''; MW_MEM_LIMIT=307200; MW_FILE_SIZE_LIMIT=102400; MW_WALL_CLOCK_LIMIT=180; MW_USE_LOG_PIPE=yes'
UploadBase::verifyExtension: mime type application/pdf matches extension pdf, passing file
[exception] [cfc57b2ef5a59ebc2d422eea] /api.php   MWException from line 425 of /var/www/html/mediawiki/includes/filerepo/file/LocalFile.php: Could not find data for image 'SKMBT_C20317011809250_MT.pdf'.
#0 /var/www/html/mediawiki/includes/filerepo/file/LocalFile.php(553): LocalFile->loadExtraFromDB()
#1 /var/www/html/mediawiki/includes/filerepo/file/LocalFile.php(795): LocalFile->load(integer)
#2 /var/www/html/mediawiki/extensions/PdfHandler/CreatePdfThumbnailsJob.class.php(110): LocalFile->getMetadata()
#3 /var/www/html/mediawiki/includes/Hooks.php(195): CreatePdfThumbnailsJob::insertJobs(UploadFromFile, string, boolean)
#4 /var/www/html/mediawiki/includes/upload/UploadBase.php(481): Hooks::run(string, array)
#5 /var/www/html/mediawiki/includes/upload/UploadBase.php(347): UploadBase->verifyFile()
#6 /var/www/html/mediawiki/includes/upload/UploadFromFile.php(95): UploadBase->verifyUpload()
#7 /var/www/html/mediawiki/includes/api/ApiUpload.php(569): UploadFromFile->verifyUpload()
#8 /var/www/html/mediawiki/includes/api/ApiUpload.php(96): ApiUpload->verifyUpload()
#9 /var/www/html/mediawiki/includes/api/ApiMain.php(1434): ApiUpload->execute()
#10 /var/www/html/mediawiki/includes/api/ApiMain.php(509): ApiMain->executeAction()
#11 /var/www/html/mediawiki/includes/api/ApiMain.php(480): ApiMain->executeActionWithErrorHandling()
#12 /var/www/html/mediawiki/api.php(83): ApiMain->execute()
#13 {main}

Here config for PfHandler in LocalSettings.php

wfLoadExtension( 'PdfHandler' );
$wgUseImageMagick = true;
$wgPdfProcessor = 'gs';
$wgPdfInfo = 'pdfinfo';
$wgPdftoText = 'pdftotext';
$wgPdfPostProcessor = 'convert';
$wgPdfOutputExtension = "jpg";
$wgPdfHandlerDpi = "400";
$wgPdfHandlerJpegQuality = "95" ;
$wgPdfCreateThumbnailsInJobQueue = "false";

Any idea, what I am doing wrong? Thanks!

93.57.2.190 (talkcontribs)

I've the same issue...

أحمد (talkcontribs)

I got the same and found phabricator:T50700

As a workaround, to avoid the exception for now set:

$wgPdfCreateThumbnailsInJobQueue = false;

Reply to "MW Exception when uploading PDF"