Datashare is a self-hosted search engine for documents, using Apache Tika and Apache Tesseract to read hundreds of file formats. Datashare is developed by the International Consortium of Investigative Journalists (ICIJ), famously known for its groundbreaking investigations into the offshore world (Pandora Papers, Panama Papers, etc).
Datashare is based on Apache Tika and supports thousands of files format.
It also provides:
- Many search filters (file types, creation date, languages, tags, etc)
- Search in batch (with a CSV)
- Search results download
- Tagging and recommendation
- Named Entities recognition with CoreNLP
- Optical characters recognition with Apache Tesseract