Jump to content

Extension:PandocUltimateConverter

From mediawiki.org
MediaWiki extensions manual
PandocUltimateConverter
Release status: beta
Implementation Special page , Data extraction, API , Artificial intelligence
Description Import documents/webpages into wiki pages and export wiki pages to external formats — powered by Pandoc. Supports DOCX, ODT, PDF, DOC, and more, with automatic image handling and optional AI cleanup.
Author(s) Urfiner (Nikolai Kochkin)
Latest version 0.7.3
Compatibility policy Main branch maintains backward compatibility.
MediaWiki 1.42+
PHP 7.4+
Database changes No
  • $wgPandocUltimateConverter_TesseractExecutablePath
  • $wgPandocUltimateConverter_MediaFileExtensionsToSkip
  • $wgPandocUltimateConverter_PandocCustomUserRight
  • $wgPandocUltimateConverter_ShowExportInPageTools
  • $wgPandocUltimateConverter_LlmProvider
  • $wgPandocUltimateConverter_PdfToPpmExecutablePath
  • $wgPandocUltimateConverter_LibreOfficeExecutablePath
  • $wgPandocUltimateConverter_UseColorProcessors
  • $wgPandocUltimateConverter_PdfToHtmlExecutablePath
  • $wgPandocUltimateConverter_TempFolderPath
  • $wgPandocUltimateConverter_PdfExportEngine
  • $wgPandocUltimateConverter_OcrLanguage
  • $wgPandocUltimateConverter_LlmApiKey
  • $wgPandocUltimateConverter_EnableConfluenceMigration
  • $wgPandocUltimateConverter_LlmBaseUrl
  • $wgPandocUltimateConverter_FiltersToUse
  • $wgPandocUltimateConverter_LlmPrompt
  • $wgPandocUltimateConverter_LlmModel
  • $wgPandocUltimateConverter_PdfToTextExecutablePath
  • $wgPandocUltimateConverter_PandocExecutablePath
Licence MIT License
Download
Example Demos on GitHub

PandocUltimateConverter is a MediaWiki extension for importing documents and webpages into wiki pages and exporting wiki pages to external document formats — powered by Pandoc.

Inspired by Microsoft's PandocUpload extension, but rewritten from scratch with image import, export, batch conversion, OCR support, AI cleanup, and modern Codex UI.

MediaWiki 1.39–1.41 are partially supported in branch REL1_39.

You can see demos here.

Please use GitHub Issues for feedback, bug reports, and feature requests.

Features

[edit]
  • Supports everything Pandoc supports. Tested: DOCX, ODT, PDF, DOC.
  • Import: convert DOCX, ODT, PDF, DOC, or a webpage URL into a wiki page (with images)
    • LLM-powered cleanup: optionally polish converted wikitext using an LLM (OpenAI or Claude) to remove formatting artifacts and improve readability
    • Migration from Confluence: migrate a Confluence space to MediaWiki
  • Export: download wiki pages as DOCX, ODT, EPUB, PDF, HTML, RTF, or TXT
    • PDF export uses a configurable engine (default: LibreOffice pipeline, no LaTeX required; or any Pandoc-supported --pdf-engine)
Format Pipeline Extra dependency
DOCX, ODT Pandoc → wikitext
DOC LibreOffice → DOCX → Pandoc LibreOffice or any PDF engine supported by Pandoc
PDF (text) pdftohtml → HTML → Pandoc Poppler utils
PDF (scanned) pdftoppm → Tesseract OCR → wikitext Poppler utils + Tesseract

Limitations

[edit]
  • Scanned PDFs (image-only) require Tesseract OCR — without it, only text-based PDFs are supported.
  • Colors are not preserved by default. Set $wgPandocUltimateConverter_UseColorProcessors = true; to enable experimental color support.
  • URL import may include extra formatting — this is a known Pandoc behavior. Enable AI cleanup to mitigate this.
  • AI cleanup sends page content to an external API — consider privacy implications for sensitive content.

Installation

[edit]
  • Install Pandoc
  • Enable uploads and allow the file extensions you need:
    $wgEnableUploads = true;
    $wgFileExtensions[] = 'docx';
    $wgFileExtensions[] = 'odt';
    $wgFileExtensions[] = 'pdf';
    $wgFileExtensions[] = 'doc';
    
  • Download and place the file(s) in a directory called PandocUltimateConverter in your extensions/ folder.
  • Add the following code at the bottom of your LocalSettings.php file:
    wfLoadExtension( 'PandocUltimateConverter' );
    
  • (Optional) On Windows, set path to Pandoc if it is not in PATH:
$wgPandocUltimateConverter_PandocExecutablePath = 'C:\Program Files\Pandoc\pandoc.exe';
  • (Optional) Set a custom temp folder:
$wgPandocUltimateConverter_TempFolderPath = 'D:\_TMP';
  • (Optional) Install Poppler to enable PDF import:
Linux
sudo apt install poppler-utils          # Debian/Ubuntu
sudo dnf install poppler-utils          # RHEL/Fedora
Windows
choco install poppler
Or download from poppler-windows releases, extract, and either add bin/ to PATH or set:
$wgPandocUltimateConverter_PdfToHtmlExecutablePath = 'C:\poppler\Library\bin\pdftohtml.exe';
$wgPandocUltimateConverter_PdfToPpmExecutablePath  = 'C:\poppler\Library\bin\pdftoppm.exe';
$wgPandocUltimateConverter_PdfToTextExecutablePath = 'C:\poppler\Library\bin\pdftotext.exe';
  • (Optional) Install Tesseract for scanned PDF / OCR (also requires Poppler):
Linux
sudo apt install tesseract-ocr          # Debian/Ubuntu
sudo apt install tesseract-ocr-deu      # additional languages
sudo dnf install tesseract              # RHEL/Fedora
Windows
choco install tesseract
Or download from UB-Mannheim/tesseract and add to PATH, or set:
$wgPandocUltimateConverter_TesseractExecutablePath = 'C:\Program Files\Tesseract-OCR\tesseract.exe';
  • (Optional) Install LibreOffice for DOC import and PDF export (default engine):
Linux
sudo apt install libreoffice            # Debian/Ubuntu
sudo dnf install libreoffice            # RHEL/Fedora
Windows
Download from libreoffice.org and add the program/ folder to PATH, or set:
$wgPandocUltimateConverter_LibreOfficeExecutablePath = 'C:\Program Files\LibreOffice\program\soffice.exe';
To use a different PDF export engine instead of LibreOffice:
$wgPandocUltimateConverter_PdfExportEngine = '/path/to/xelatex';   // or 'pdflatex', 'lualatex', 'wkhtmltopdf', 'weasyprint', etc.
  • (Optional) Configure an LLM provider to enable the AI cleanup feature by adding the following to LocalSettings.php :
$wgPandocUltimateConverter_LlmProvider = 'openai';   // or 'claude'
$wgPandocUltimateConverter_LlmApiKey   = 'sk-...';   // your API key
  • Yes Done – Navigate to Special:Version on your wiki to verify that the extension is successfully installed.

Usage

[edit]

Import (Special:PandocUltimateConverter)

[edit]
A legacy (non-Codex) form is available at "Special:PandocUltimateConverter?codex=0".
Target page and all the images will be overwritten if they already exist.

Go to Special:PandocUltimateConverter to convert files or URLs into wiki pages.

  1. Choose source: file upload or URL (you can add multiple items for batch conversion)
  2. Enter the target page name for each item
  3. (Optional) Enable "AI cleanup" checkbox to automatically polish converted wikitext after conversion
  4. Click convert — you will be redirected to the new page when done

What happens during conversion:

  • Images are extracted and uploaded to the wiki automatically (duplicates are skipped)
  • The uploaded source file is removed after conversion
  • Temporary files are cleaned up
  • If AI cleanup is enabled, the converted wikitext is sent to the configured LLM for post-processing

The status column shows the current state of each item:

  • Conversion done — Pandoc conversion completed successfully
  • AI cleanup done — LLM polish completed successfully after conversion
  • You can also trigger AI cleanup manually on any already-converted item using the ✨ button

Export (Special:PandocExport)

[edit]

Export one or more wiki pages to an external document format.

Go to Special:PandocExport or use the Export action in the page tools menu (the same menu where "Delete" and "Move" appear).

Supported export formats: DOCX, ODT, EPUB, PDF, HTML, RTF, TXT.

Features:

  • Export a single page or multiple pages into one document
  • Export entire categories (subcategories are resolved recursively)
  • "Separate files" option bundles each page as an individual file in a ZIP archive
  • Images referenced in wikitext are embedded into the output document
  • PDF export uses a configurable engine (default: LibreOffice pipeline via Pandoc → DOCX → LibreOffice, no LaTeX required). Any other value (e.g. xelatex, pdflatex, lualatex, wkhtmltopdf, weasyprint) is passed directly to Pandoc's --pdf-engine option.

Confluence Migration (Special:ConfluenceMigration)

[edit]

Mass-migrate an entire Confluence space (Cloud or Server) to this wiki in one operation.

What gets migrated:

  • All pages in the specified space are fetched via the Confluence REST API v1.
  • Page content (Confluence "storage format" HTML) is converted to MediaWiki wikitext using Pandoc.
  • Common Confluence macros (code blocks, info/note/warning/tip panels) are converted to their MediaWiki equivalents.
  • File attachments are downloaded from Confluence and uploaded to the MediaWiki file repository.
  • Pages are created with the edit summary "Imported from Confluence".
  • When auto-categorize is enabled, pages with sub-pages get a matching category; nested sub-pages produce nested categories.

The migration is processed as a background job via the MediaWiki job queue. You do not have to keep your browser open. When the migration finishes you receive an Echo notification (requires the Extension:Echo ).

To disable this feature, set $wgPandocUltimateConverter_EnableConfluenceMigration = false; in LocalSettings.php .

AI cleanup (LLM polish)

[edit]
AI cleanup sends your page content to an external API (OpenAI or Anthropic). Make sure this is acceptable for your wiki's content and privacy requirements.

After conversion, the resulting wikitext may contain formatting artifacts such as leftover HTML tags, extra whitespace, empty spans, or website navigation elements (when converting from URLs). The optional AI cleanup feature sends the converted wikitext to an LLM to clean up these artifacts while preserving all content, wiki links, templates, and MediaWiki markup.

Supported providers:

  • OpenAI — default model: gpt-5.4-nano
  • Claude (Anthropic) — default model: claude-3-5-haiku-20241022

There are two ways to use AI cleanup:

  1. Batch mode — check the "Polish with AI" checkbox before clicking Convert all. Each item is converted first, then automatically queued for AI cleanup. The conversion queue and the AI cleanup queue run in parallel.
  2. Per-item — click the ✨ button on any already-converted item to run AI cleanup on demand.

If AI cleanup fails, a per-item error is shown with a Retry button.

Configuration

[edit]

All parameters are set in LocalSettings.php . The "$wgPandocUltimateConverter_" prefix has been omitted in each parameter for brevity.

Category Parameter Description Default
General PandocExecutablePath Path to Pandoc binary. Not needed if in PATH. null
TempFolderPath Temp folder for conversion files. Uses system default if not set. null
PandocCustomUserRight Restrict access to a specific user right. ""
MediaFileExtensionsToSkip File extensions to skip during image upload (e.g. ["emf"]). []
FiltersToUse Custom Pandoc Lua filters to apply. Must be in the filters/ folder. []
UseColorProcessors Preserve text/background colors from DOCX/ODT (experimental). false
ShowExportInPageTools Show "Export" in the page Actions menu. true
EnableConfluenceMigration Set to false to disable Special:ConfluenceMigration. true
Optional dependencies PdfToHtmlExecutablePath Path to Poppler's pdftohtml. Not needed if in PATH. null
PdfToPpmExecutablePath Path to Poppler's pdftoppm. Not needed if in PATH. null
PdfToTextExecutablePath Path to Poppler's pdftotext. Not needed if in PATH. null
LibreOfficeExecutablePath Path to soffice/libreoffice. Not needed if in PATH. null
TesseractExecutablePath Path to Tesseract OCR binary. Not needed if in PATH. null
OcrLanguage Tesseract language code(s). Use + for multiple, e.g. "eng+deu". "eng"
PdfExportEngine Engine used for PDF export. "libreoffice" uses a two-step pipeline (Pandoc → DOCX → PDF via LibreOffice, no LaTeX needed). Any other value (e.g. "xelatex", "pdflatex", "lualatex", "wkhtmltopdf", "weasyprint") is passed directly to Pandoc's --pdf-engine option. Preferably specify full path to the engine binary. "libreoffice"
AI cleanup (LLM) LlmProvider LLM provider: "openai" or "claude". Leave null to disable. null
LlmApiKey API key for the configured provider. null
LlmModel Model name. Defaults to gpt-5.4-nano (OpenAI) or claude-3-5-haiku-20241022 (Claude). null
LlmPrompt Custom cleanup prompt. Uses a sensible built-in default if not set. null

Custom user rights example

[edit]
$wgPandocUltimateConverter_PandocCustomUserRight = 'pandoc';
$wgGroupPermissions['your_group_here']['pandoc'] = true;

Pandoc Lua filters

[edit]

Add filters via:

$wgPandocUltimateConverter_FiltersToUse[] = 'increase_heading_level.lua';

Built-in filters (in the filters/ subfolder):

  • increase_heading_level.lua — increase heading levels by 1 (useful when documents start at H1)
  • colorize_mark_class.lua — highlight "mark" classes with yellow background (see Issue #14)

API

[edit]

The extension exposes three action API modules. Write operations (pandocconvert, pandocllmpolish) require a CSRF token and POST.

action=pandocconvert

[edit]

Convert a file or URL into a wiki page.

Parameter Required Description
pagename yes Target wiki page title.
filename one of Uploaded file name (mutually exclusive with url).
url one of http/https URL to fetch (mutually exclusive with filename).
forceoverwrite no 1 to overwrite existing page (default: 0).
token yes CSRF token.

action=pandocllmpolish

[edit]

Run AI cleanup on an existing wiki page's wikitext.

Parameter Required Description
pagename yes Target wiki page title (must exist and contain wikitext).
token yes CSRF token.

action=pandocurltitle

[edit]

Fetch remote URLs and extract their HTML ‎<title> tags. Used internally by the Codex UI to suggest page names for URL imports. GET request, no token required.

Accepts multiple URLs (pipe-separated). Only http/https URLs are accepted.

Parameter Required Description
urls yes One or more URLs (pipe-separated) to fetch titles from.

Debugging

[edit]

Add to LocalSettings.php:

$wgShowExceptionDetails = true;
$wgDebugLogGroups['PandocUltimateConverter'] = '/var/log/mediawiki/pandoc.log';

See also

[edit]