Extension:PandocUltimateConverter
Release status: beta |
|
|---|---|
| Implementation | Special page, Data extraction, API, Artificial intelligence |
| Description | Import documents/webpages into wiki pages and export wiki pages to external formats — powered by Pandoc. Supports DOCX, ODT, PDF, DOC, and more, with automatic image handling and optional AI cleanup. |
| Author(s) | Urfiner (Nikolai Kochkin) |
| Latest version | 0.7.3 |
| Compatibility policy | Main branch maintains backward compatibility. |
| MediaWiki | 1.42+ |
| PHP | 7.4+ |
| Database changes | No |
|
|
| Licence | MIT License |
| Download | |
| Example | Demos on GitHub |
PandocUltimateConverter is a MediaWiki extension for importing documents and webpages into wiki pages and exporting wiki pages to external document formats — powered by Pandoc.
Inspired by Microsoft's PandocUpload extension, but rewritten from scratch with image import, export, batch conversion, OCR support, AI cleanup, and modern Codex UI.
MediaWiki 1.39–1.41 are partially supported in branch REL1_39.
You can see demos here.
Features
[edit]- Supports everything Pandoc supports. Tested: DOCX, ODT, PDF, DOC.
- Import: convert DOCX, ODT, PDF, DOC, or a webpage URL into a wiki page (with images)
- LLM-powered cleanup: optionally polish converted wikitext using an LLM (OpenAI or Claude) to remove formatting artifacts and improve readability
- Migration from Confluence: migrate a Confluence space to MediaWiki
- Export: download wiki pages as DOCX, ODT, EPUB, PDF, HTML, RTF, or TXT
- PDF export uses a configurable engine (default: LibreOffice pipeline, no LaTeX required; or any Pandoc-supported
--pdf-engine)
- PDF export uses a configurable engine (default: LibreOffice pipeline, no LaTeX required; or any Pandoc-supported
| Format | Pipeline | Extra dependency |
|---|---|---|
| DOCX, ODT | Pandoc → wikitext | — |
| DOC | LibreOffice → DOCX → Pandoc | LibreOffice or any PDF engine supported by Pandoc |
| PDF (text) | pdftohtml → HTML → Pandoc | Poppler utils |
| PDF (scanned) | pdftoppm → Tesseract OCR → wikitext | Poppler utils + Tesseract |
Limitations
[edit]- Scanned PDFs (image-only) require Tesseract OCR — without it, only text-based PDFs are supported.
- Colors are not preserved by default. Set
$wgPandocUltimateConverter_UseColorProcessors = true;to enable experimental color support. - URL import may include extra formatting — this is a known Pandoc behavior. Enable AI cleanup to mitigate this.
- AI cleanup sends page content to an external API — consider privacy implications for sensitive content.
Installation
[edit]- Install Pandoc
- Enable uploads and allow the file extensions you need:
$wgEnableUploads = true; $wgFileExtensions[] = 'docx'; $wgFileExtensions[] = 'odt'; $wgFileExtensions[] = 'pdf'; $wgFileExtensions[] = 'doc';
- Download and place the file(s) in a directory called
PandocUltimateConverterin yourextensions/folder. - Add the following code at the bottom of your LocalSettings.php file:
wfLoadExtension( 'PandocUltimateConverter' );
- (Optional) On Windows, set path to Pandoc if it is not in PATH:
$wgPandocUltimateConverter_PandocExecutablePath = 'C:\Program Files\Pandoc\pandoc.exe';
- (Optional) Set a custom temp folder:
$wgPandocUltimateConverter_TempFolderPath = 'D:\_TMP';
- (Optional) Install Poppler to enable PDF import:
- Linux
sudo apt install poppler-utils # Debian/Ubuntu sudo dnf install poppler-utils # RHEL/Fedora
- Windows
choco install poppler
- Or download from poppler-windows releases, extract, and either add
bin/to PATH or set: $wgPandocUltimateConverter_PdfToHtmlExecutablePath = 'C:\poppler\Library\bin\pdftohtml.exe'; $wgPandocUltimateConverter_PdfToPpmExecutablePath = 'C:\poppler\Library\bin\pdftoppm.exe'; $wgPandocUltimateConverter_PdfToTextExecutablePath = 'C:\poppler\Library\bin\pdftotext.exe';
- (Optional) Install Tesseract for scanned PDF / OCR (also requires Poppler):
- Linux
sudo apt install tesseract-ocr # Debian/Ubuntu sudo apt install tesseract-ocr-deu # additional languages sudo dnf install tesseract # RHEL/Fedora
- Windows
choco install tesseract
- Or download from UB-Mannheim/tesseract and add to PATH, or set:
$wgPandocUltimateConverter_TesseractExecutablePath = 'C:\Program Files\Tesseract-OCR\tesseract.exe';
- (Optional) Install LibreOffice for DOC import and PDF export (default engine):
- Linux
sudo apt install libreoffice # Debian/Ubuntu sudo dnf install libreoffice # RHEL/Fedora
- Windows
- Download from libreoffice.org and add the
program/folder to PATH, or set: $wgPandocUltimateConverter_LibreOfficeExecutablePath = 'C:\Program Files\LibreOffice\program\soffice.exe';
- To use a different PDF export engine instead of LibreOffice:
$wgPandocUltimateConverter_PdfExportEngine = '/path/to/xelatex'; // or 'pdflatex', 'lualatex', 'wkhtmltopdf', 'weasyprint', etc.
- (Optional) Configure an LLM provider to enable the AI cleanup feature by adding the following to LocalSettings.php:
$wgPandocUltimateConverter_LlmProvider = 'openai'; // or 'claude' $wgPandocUltimateConverter_LlmApiKey = 'sk-...'; // your API key
Done – Navigate to Special:Version on your wiki to verify that the extension is successfully installed.
Usage
[edit]Import (Special:PandocUltimateConverter)
[edit]Go to Special:PandocUltimateConverter to convert files or URLs into wiki pages.
- Choose source: file upload or URL (you can add multiple items for batch conversion)
- Enter the target page name for each item
- (Optional) Enable "AI cleanup" checkbox to automatically polish converted wikitext after conversion
- Click convert — you will be redirected to the new page when done
What happens during conversion:
- Images are extracted and uploaded to the wiki automatically (duplicates are skipped)
- The uploaded source file is removed after conversion
- Temporary files are cleaned up
- If AI cleanup is enabled, the converted wikitext is sent to the configured LLM for post-processing
The status column shows the current state of each item:
- Conversion done — Pandoc conversion completed successfully
- AI cleanup done — LLM polish completed successfully after conversion
- You can also trigger AI cleanup manually on any already-converted item using the ✨ button
Export (Special:PandocExport)
[edit]Export one or more wiki pages to an external document format.
Go to Special:PandocExport or use the Export action in the page tools menu (the same menu where "Delete" and "Move" appear).
Supported export formats: DOCX, ODT, EPUB, PDF, HTML, RTF, TXT.
Features:
- Export a single page or multiple pages into one document
- Export entire categories (subcategories are resolved recursively)
- "Separate files" option bundles each page as an individual file in a ZIP archive
- Images referenced in wikitext are embedded into the output document
- PDF export uses a configurable engine (default: LibreOffice pipeline via Pandoc → DOCX → LibreOffice, no LaTeX required). Any other value (e.g.
xelatex,pdflatex,lualatex,wkhtmltopdf,weasyprint) is passed directly to Pandoc's--pdf-engineoption.
Confluence Migration (Special:ConfluenceMigration)
[edit]Mass-migrate an entire Confluence space (Cloud or Server) to this wiki in one operation.
What gets migrated:
- All pages in the specified space are fetched via the Confluence REST API v1.
- Page content (Confluence "storage format" HTML) is converted to MediaWiki wikitext using Pandoc.
- Common Confluence macros (code blocks, info/note/warning/tip panels) are converted to their MediaWiki equivalents.
- File attachments are downloaded from Confluence and uploaded to the MediaWiki file repository.
- Pages are created with the edit summary "Imported from Confluence".
- When auto-categorize is enabled, pages with sub-pages get a matching category; nested sub-pages produce nested categories.
The migration is processed as a background job via the MediaWiki job queue. You do not have to keep your browser open. When the migration finishes you receive an Echo notification (requires the Extension:Echo).
To disable this feature, set $wgPandocUltimateConverter_EnableConfluenceMigration = false; in LocalSettings.php.
AI cleanup (LLM polish)
[edit]After conversion, the resulting wikitext may contain formatting artifacts such as leftover HTML tags, extra whitespace, empty spans, or website navigation elements (when converting from URLs). The optional AI cleanup feature sends the converted wikitext to an LLM to clean up these artifacts while preserving all content, wiki links, templates, and MediaWiki markup.
Supported providers:
- OpenAI — default model:
gpt-5.4-nano - Claude (Anthropic) — default model:
claude-3-5-haiku-20241022
There are two ways to use AI cleanup:
- Batch mode — check the "Polish with AI" checkbox before clicking Convert all. Each item is converted first, then automatically queued for AI cleanup. The conversion queue and the AI cleanup queue run in parallel.
- Per-item — click the ✨ button on any already-converted item to run AI cleanup on demand.
If AI cleanup fails, a per-item error is shown with a Retry button.
Configuration
[edit]All parameters are set in LocalSettings.php. The "$wgPandocUltimateConverter_" prefix has been omitted in each parameter for brevity.
| Category | Parameter | Description | Default |
|---|---|---|---|
| General | PandocExecutablePath
|
Path to Pandoc binary. Not needed if in PATH. | null
|
TempFolderPath
|
Temp folder for conversion files. Uses system default if not set. | null
| |
PandocCustomUserRight
|
Restrict access to a specific user right. | ""
| |
MediaFileExtensionsToSkip
|
File extensions to skip during image upload (e.g. ["emf"]).
|
[]
| |
FiltersToUse
|
Custom Pandoc Lua filters to apply. Must be in the filters/ folder.
|
[]
| |
UseColorProcessors
|
Preserve text/background colors from DOCX/ODT (experimental). | false
| |
ShowExportInPageTools
|
Show "Export" in the page Actions menu. | true
| |
EnableConfluenceMigration
|
Set to false to disable Special:ConfluenceMigration.
|
true
| |
| Optional dependencies | PdfToHtmlExecutablePath
|
Path to Poppler's pdftohtml. Not needed if in PATH.
|
null
|
PdfToPpmExecutablePath
|
Path to Poppler's pdftoppm. Not needed if in PATH.
|
null
| |
PdfToTextExecutablePath
|
Path to Poppler's pdftotext. Not needed if in PATH.
|
null
| |
LibreOfficeExecutablePath
|
Path to soffice/libreoffice. Not needed if in PATH.
|
null
| |
TesseractExecutablePath
|
Path to Tesseract OCR binary. Not needed if in PATH. | null
| |
OcrLanguage
|
Tesseract language code(s). Use + for multiple, e.g. "eng+deu".
|
"eng"
| |
PdfExportEngine
|
Engine used for PDF export. "libreoffice" uses a two-step pipeline (Pandoc → DOCX → PDF via LibreOffice, no LaTeX needed). Any other value (e.g. "xelatex", "pdflatex", "lualatex", "wkhtmltopdf", "weasyprint") is passed directly to Pandoc's --pdf-engine option. Preferably specify full path to the engine binary.
|
"libreoffice"
| |
| AI cleanup (LLM) | LlmProvider
|
LLM provider: "openai" or "claude". Leave null to disable.
|
null
|
LlmApiKey
|
API key for the configured provider. | null
| |
LlmModel
|
Model name. Defaults to gpt-5.4-nano (OpenAI) or claude-3-5-haiku-20241022 (Claude).
|
null
| |
LlmPrompt
|
Custom cleanup prompt. Uses a sensible built-in default if not set. | null
|
Custom user rights example
[edit]$wgPandocUltimateConverter_PandocCustomUserRight = 'pandoc';
$wgGroupPermissions['your_group_here']['pandoc'] = true;
Pandoc Lua filters
[edit]Add filters via:
$wgPandocUltimateConverter_FiltersToUse[] = 'increase_heading_level.lua';
Built-in filters (in the filters/ subfolder):
increase_heading_level.lua— increase heading levels by 1 (useful when documents start at H1)colorize_mark_class.lua— highlight "mark" classes with yellow background (see Issue #14)
API
[edit]The extension exposes three action API modules. Write operations (pandocconvert, pandocllmpolish) require a CSRF token and POST.
action=pandocconvert
[edit]Convert a file or URL into a wiki page.
| Parameter | Required | Description |
|---|---|---|
pagename |
yes | Target wiki page title. |
filename |
one of | Uploaded file name (mutually exclusive with url).
|
url |
one of | http/https URL to fetch (mutually exclusive with filename).
|
forceoverwrite |
no | 1 to overwrite existing page (default: 0).
|
token |
yes | CSRF token. |
action=pandocllmpolish
[edit]Run AI cleanup on an existing wiki page's wikitext.
| Parameter | Required | Description |
|---|---|---|
pagename |
yes | Target wiki page title (must exist and contain wikitext). |
token |
yes | CSRF token. |
action=pandocurltitle
[edit]Fetch remote URLs and extract their HTML <title> tags. Used internally by the Codex UI to suggest page names for URL imports. GET request, no token required.
Accepts multiple URLs (pipe-separated). Only http/https URLs are accepted.
| Parameter | Required | Description |
|---|---|---|
urls |
yes | One or more URLs (pipe-separated) to fetch titles from. |
Debugging
[edit]Add to LocalSettings.php:
$wgShowExceptionDetails = true;
$wgDebugLogGroups['PandocUltimateConverter'] = '/var/log/mediawiki/pandoc.log';
See also
[edit]- Extension:FlexForm — convert to wikitext using Pandoc
- Extension:ConvertPDF2Wiki — convert PDF files to wikitext
- Extension:SimplePasteImage — paste text with images in Visual Editor
