Norconex Importer 2.1.0 released

This feature release of Norconex Importer brings bug fixes, enhancements, and great new features, such as OCR and translation support. Keep reading for all the details on some of this release’s most interesting changes. While Java can be used to configure and use the Importer, XML configuration is used here for demonstration purposes. You can find all Importer configuration options here.

About Norconex Importer

Norconex Importer is an open-source product for extracting and manipulating text and metadata from files of various formats. It works for stand-alone use or as a Java library. It’s an essential component of Norconex Collectors for processing crawled documents. You can make Norconex Importer an essential piece of your ETL pipeline.

OCR support

[ezcol_1half]

Norconex Importer now leverages Apache Tika 1.7’s newly introduced ORC capability. To convert popular image formats (PNG, TIFF, JPEG, etc.) to text, download a copy of Tesseract OCR for your operating system, and reference its install location in your Importer configuration. When enabled, OCR will process embedded images too (e.g., PDF with image for text). The class configure to enable OCR support is GenericDocumentParserFactory.

[/ezcol_1half]

[ezcol_1half_end]

<documentParserFactory 
    class="com.norconex.importer.parser.GenericDocumentParserFactory" >
  <ocr path="(path to Tesseract OCR software install)">
    <languages>eng,fra</languages>
  </ocr>
</documentParserFactory>

[/ezcol_1half_end]

Translation support

[ezcol_1half]

With the new TranslatorSplitter class, it’s now possible to hook Norconex Importer with a translation API. The Apache Tika API has been extended to provide the ability to translate a mix of document content or specific document fields. The translation APIs supported out-of-the-box are Microsoft, Google, Lingo24, and Moses.

[/ezcol_1half]

[ezcol_1half_end]

<postParseHandlers>
  <spitter
      class="com.norconex.importer.handler.splitter.impl.TranslatorSplitter"
      api="microsoft">
    <clientId>YOUR_CLIENT_ID</clientId>
    <secretId>YOUR_SECRET_ID</secretId>
  </spitter>
</postParseHandlers>

[/ezcol_1half_end]

Dynamic title creation

[ezcol_1half]

Too many documents do not have a valid title, when they have a title at all. What if you need a title to represent each document? What do you do in such cases? Do you take the file name as the title? Not so nice. Do you take the document property called “title”? Not reliable. You now have a new option with the TitleGeneratorTagger. It will try to detect a decent title out of your document. In cases where it can’t, it offers a few alternate options. You always get something back.

[/ezcol_1half]

[ezcol_1half_end]

<postParseHandlers>
  <tagger class="com.norconex.importer.handler.tagger.impl.TitleGeneratorTagger"
          toField="generated_title"
          fallbackMaxLength="250"
          detectHeading="true"
          detectHeadingMinLength="10"
          detectHeadingMaxLength="500" />
</postParseHandlers>

[/ezcol_1half_end]

Saving of parsing errors

[ezcol_1half]

A new top-level configuration option was introduced so that every file generating parsing errors gets saved in a location of your choice. These files will be saved along with the metadata obtained so far (if any), along with the Java exception that was thrown. This is a great addition to help troubleshoot parsing failures.

[/ezcol_1half]

[ezcol_1half_end]

<importer>
  <parseErrorsSaveDir>/path/to/store/bad/files</parseErrorsSaveDir>
</importer>

[/ezcol_1half_end]

Document parsing improvements

The content type detection accuracy and performance were improved with this release. In addition, document parsing features the following additions and improvements:

Better PDF support with addition of PDF XFA (dynamic forms) text extraction, as well as improved space detection (eliminating many space-stripping issues). Also, PDFs with JBIG2 and jpeg2000 image formats are now parsed properly.
New XFDL parser (PureEdge Extensible Forms Description Language). Supports both Gzipped/Base64 encoded and plain text versions.
New, much improved WordPerfect parser now parsing WordPerfect documents according to WordPerfect file specifications.
New Quattro Pro parser for parsing Quattro Pro documents according to Quattro Pro file specifications.
JBIG2 and jpeg2000 image formats are now recognized.

You want more?

The list of changes and improvements doesn’t stop here. Read the product release notes for a complete list of changes.

Unfamiliar with this product? No sweat — read this “Getting Started” page.

If not already out when you read this, the next feature release of Norconex HTTP Collector and Norconex Filesystem Collector will both ship with this version of the Importer. Can’t wait for the release? Manually upgrade the Norconex Importer library to take advantage of these new features in your favorite crawler.

Download Norconex Importer 2.1.0.

Pascal Essiembre

Pascal Essiembre has been a successful Enterprise Application Developer for several years before founding Norconex in 2007 and remaining its president to this day. Pascal has been responsible for several successful Norconex enterprise search projects across North America. Pascal is also heading the Product Division of Norconex and leading Norconex Open-Source initiatives.