This feature release of Norconex Importer brings bug fixes, enhancements, and great new features, such as OCR and translation support. Keep reading for all the details on some of this release’s most interesting changes. While Java can be used to configure and use the Importer, XML configuration is used here for demonstration purposes. You can find all Importer configuration options here.
About Norconex Importer
Norconex Importer is an open-source product for extracting and manipulating text and metadata from files of various formats. It works for stand-alone use or as a Java library. It’s an essential component of Norconex Collectors for processing crawled documents. You can make Norconex Importer an essential piece of your ETL pipeline.
Norconex Importer now leverages Apache Tika 1.7’s newly introduced ORC capability. To convert popular image formats (PNG, TIFF, JPEG, etc.) to text, download a copy of Tesseract OCR for your operating system, and reference its install location in your Importer configuration. When enabled, OCR will process embedded images too (e.g., PDF with image for text). The class configure to enable OCR support is GenericDocumentParserFactory.
<documentParserFactory class="com.norconex.importer.parser.GenericDocumentParserFactory" > <ocr path="(path to Tesseract OCR software install)"> <languages>eng,fra</languages> </ocr> </documentParserFactory>
With the new TranslatorSplitter class, it’s now possible to hook Norconex Importer with a translation API. The Apache Tika API has been extended to provide the ability to translate a mix of document content or specific document fields. The translation APIs supported out-of-the-box are Microsoft, Google, Lingo24, and Moses.
<postParseHandlers> <spitter class="com.norconex.importer.handler.splitter.impl.TranslatorSplitter" api="microsoft"> <clientId>YOUR_CLIENT_ID</clientId> <secretId>YOUR_SECRET_ID</secretId> </spitter> </postParseHandlers>
Dynamic title creation
Too many documents do not have a valid title, when they have a title at all. What if you need a title to represent each document? What do you do in such cases? Do you take the file name as the title? Not so nice. Do you take the document property called “title”? Not reliable. You now have a new option with the TitleGeneratorTagger. It will try to detect a decent title out of your document. In cases where it can’t, it offers a few alternate options. You always get something back.
<postParseHandlers> <tagger class="com.norconex.importer.handler.tagger.impl.TitleGeneratorTagger" toField="generated_title" fallbackMaxLength="250" detectHeading="true" detectHeadingMinLength="10" detectHeadingMaxLength="500" /> </postParseHandlers>
Saving of parsing errors
A new top-level configuration option was introduced so that every file generating parsing errors gets saved in a location of your choice. These files will be saved along with the metadata obtained so far (if any), along with the Java exception that was thrown. This is a great addition to help troubleshoot parsing failures.
<importer> <parseErrorsSaveDir>/path/to/store/bad/files</parseErrorsSaveDir> </importer>
Document parsing improvements
The content type detection accuracy and performance were improved with this release. In addition, document parsing features the following additions and improvements:
- Better PDF support with addition of PDF XFA (dynamic forms) text extraction, as well as improved space detection (eliminating many space-stripping issues). Also, PDFs with JBIG2 and jpeg2000 image formats are now parsed properly.
- New XFDL parser (PureEdge Extensible Forms Description Language). Supports both Gzipped/Base64 encoded and plain text versions.
- New, much improved WordPerfect parser now parsing WordPerfect documents according to WordPerfect file specifications.
- New Quattro Pro parser for parsing Quattro Pro documents according to Quattro Pro file specifications.
- JBIG2 and jpeg2000 image formats are now recognized.
You want more?
The list of changes and improvements doesn’t stop here. Read the product release notes for a complete list of changes.
Unfamiliar with this product? No sweat — read this “Getting Started” page.
If not already out when you read this, the next feature release of Norconex HTTP Collector and Norconex Filesystem Collector will both ship with this version of the Importer. Can’t wait for the release? Manually upgrade the Norconex Importer library to take advantage of these new features in your favorite crawler.