Open-Source document text extractor and transformer
Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a computer file as plain text, whatever its native format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before importing/using it in your own service or application.
A typical but not limited usage is to “import” crawled content for use by a search engine. We invite you to consider one of Norconex Collectors for this purpose (which rely on Norconex Importer).
Have a look at the supported file formats.