Extension:TikaAllTheFiles

From mediawiki.org
MediaWiki extensions manual
TikaAllTheFiles
Release status: beta
Implementation Media, Search
Description Using Apache Tika, provides text and metadata extraction for thousands of file types, enabling full-text search of almost any uploaded file
Author(s) Matt Marjanovic (CtapMaddogtalk)
Maintainer(s) Center for Transparent Analysis and Policy
Latest version 2.0.0 (2024-04-20)
Compatibility policy Master maintains backward compatibility.
MediaWiki 1.37+
PHP 8.1+
Database changes No
Composer centertap/tika-all-the-files
License GNU General Public License 3.0 or later
Download
README.md
RELEASE-NOTES.md
Translate the TikaAllTheFiles extension if it is available at translatewiki.net

The TikaAllTheFiles (TATF) extension facilitates full-text search over uploaded files, by using the Apache Tika content analysis toolkit, which "detects and extracts metadata and text from over a thousand different file types".

In practical terms: if you already have Extension:CirrusSearch set up and working on your wiki, TATF will allow you to perform full-text searches over the contents of almost any uploaded file --- not just the PDFs.

TATF's features and capabilities:

  • extract embedded digital text from any type of uploaded file so that it can be indexed for full-text search;
  • extract and index printed text from bitmap image files and from images embedded in document files, e.g., image-only PDF's (requires Tesseract OCR;
  • extract metadata from any type of uploaded file for display on File: pages;
  • index metadata properties along with text, to enable simple searching for properties within full-text search.

Installation[edit]

This extension can be installed using composer.

The complete installation and configuration instructions can be found in README.md.

Configuration parameters[edit]

The complete description of configuration parameters can be found in README.md.