Release status: beta
|Description||Using Apache Tika, provides text and metadata extraction for thousands of file types, enabling full-text search of almost any uploaded file|
|Author(s)||Matt Marjanovic (CtapMaddogtalk)|
|Maintainer(s)||Center for Transparent Analysis and Policy|
|Latest version||1.0.0 (2021-12-13)|
|Compatibility policy||Master maintains backward compatibility.|
|License||GNU General Public License 3.0 or later|
|Translate the TikaAllTheFiles extension if it is available at translatewiki.net|
The TikaAllTheFiles (TATF) extension facilitates full-text search over uploaded files, by using the Apache Tika content analysis toolkit, which "detects and extracts metadata and text from over a thousand different file types".
In practical terms: if you already have Extension:CirrusSearch set up and working on your wiki, TATF will allow you to perform full-text searches over the contents of almost any uploaded file --- not just the PDFs.
TATF's features and capabilities:
- extract embedded digital text from any type of uploaded file so that it can be indexed for full-text search;
- extract and index printed text from bitmap image files and from images embedded in document files, e.g., image-only PDF's (requires Tesseract OCR;
- extract metadata from any type of uploaded file for display on
- index metadata properties along with text, to enable simple searching for properties within full-text search.
This extension can be installed using
The complete installation and configuration instructions can be found in README.md.
The complete description of configuration parameters can be found in README.md.