Norconex Importer

Release History

Version Date Description
2.10.0 2019-12-22 Feature release
2.9.0 2018-06-17 Feature release
2.8.0 2017-11-26 Feature release
2.7.2 2017-05-26 Bugfix release
2.7.1 2017-05-25 Maintenance release
2.7.0 2017-04-26 Feature release
2.6.1 2016-12-14 Minor release
2.6.0 2016-08-25 Feature release
2.5.2 2016-05-31 Maintenance release
2.5.1 2016-03-22 Bug fix release
2.5.0 2016-02-28 Feature release
2.4.0 2015-11-02 Feature release
2.3.1 2015-08-07 Maintenance release
2.3.0 2015-07-21 Feature release
2.2.0 2015-06-15 Feature release
2.1.1 2015-04-08 Maintenance release
2.1.0 2015-03-31 Feature release
2.0.0 2014-11-25 Major release
1.3.0 2014-08-18 Feature release
1.2.0 2014-03-09 Feature release
1.1.0 2013-08-20 Minor release
1.0.1 2013-08-02 Maintenance release
1.0.0 2013-06-04 Open Source release

2.10.0 Feature release Download 2019-12-22

New FieldReportTagger for discovering fields being crawled to file (with sample values).
HierarchyTagger now has a boolean "regex" attribute to specify whether the separator should match a regular expression.
RenameTagger now as a boolean "regex" attribute to specify whether the fromField and toField are regular expression pattern and replacement.
Maven dependency updates: Apache Tika 1.18, Norconex Commons Lang 1.15.1.
HierarchyTagger no longer keep empty segments by default. A new "keepEmptySegments" attribute has been added for this.
OCR configuration now expects full path of Tesseract executable (as opposed to installation folder).
Fixed HierarchyTagger not constructing paths properly. #91
Fixed ClassCastException when a IDocumentFilter does not implement IOnMatchFilter.
Fixed LanguageTagger choosing main language as the one with lowest probability. #82
Upgraded pdfbox to 2.0.11 due to potential security issue.

2.9.0 Feature release Download 2018-06-17

New PDFPageSplitter to split PDF pages, treating them as individual documents.
ImporterResponse and ImporterStatus now display nicely in the logs (toString implemented).
Maven dependency updates: Norconex Commons Lang 1.15.0.
Fixed TitleGeneratorTagger throwing NullPointerException when "fromField" is specified but does not exists (is null). #74
Fixed "buffer underrun" exception sometimes appearing when parsing some .msg files with embedded files. #72

2.8.0 Feature release Download 2017-11-26

New TruncateTagger class.
New ExternalTagger class. #64
ExternalTransformer and ExternalParser can now supply/retrieve metadata as files to external applications and can also pass the document reference as argument. New command line tokens: ${INPUT_META} ${OUTPUT_META} ${REFERENCE}. #63
New configuration option for DOMTagger, DOMSplitter and DOMContentFilter for specifying which parser to use ("html" or "xml").
TextPatternTagger can now extract field names in addition to field values. #52
New RegexUtil and RegexFieldExtractor classes.
TextPatternTagger case sensitivity is now applied to individual patterns.
ReplaceTagger and ReplaceTransformer now support empty/null replacement values, resulting in replacing matches with nothing.
ExternalTransformer and ExternalParser can now specify regex match groups for field names and field values.
Now uses WordPerfect and Quattro Pro parsers contributed to Apache Tika.
Maven dependency updates: Apache Tika 1.16, Norconex Commons Lang 1.14.0.
Fixed ExternalTransformer and ExternalParser having issues with arguments with spaces in them. #64
Removed copies of Apache Tika classes that are now fixed in Apache Tika: ListTables, ImageParser, ListManager, PDF2XHTML, CharsetDetector.

2.7.2 Bugfix release Download 2017-05-26

Fixed "caseSensitive" flag sometimes having no effect in DOMContentFilter, RegexContentFilter, RegexMetadataFilter, and RegexReferenceFilter.

2.7.1 Maintenance release Download 2017-05-25

ImporterConfig#saveToXML(...) now written with xml:space="preserve".
Maven dependency updates: Norconex Commons Lang 1.13.1.

2.7.0 Feature release Download 2017-04-26

Added Lua scripting support to ScriptFilter, ScriptTagger, and ScriptTransformer.
New ExternalTransformer for transforming documents and extracting metadata using an external application.
Added schema-based XML configuration validation which can be trigged on command prompt with this new flag: -k or --checkcfg
New RegexReferenceFilter for filtering documents based on matching references (e.g. URL).
New MergeTagger for combining multiple fields into one.
New SubstringTransformer for reducing content (e.g., truncate) to a substring matching a begin and end indexes.
New UUIDTagger for adding random Universally unique identifier (UUID) to documents.
CharacterCaseTagger now supports "swap" and "string" to swap character case and capitalize beginning of a string, respectively.
New ConstantTagger#setOnConflict(...) method to specify if the constant should be added to existing values, replace them, or do nothing.
Now distributed with utility scripts.
Dependency updates: Apache Tika 1.14, Norconex Commons Lang 1.13.0, JSoup 1.10.2, OOXML-Schemas 1.3 (fixes some bad Visio parsing), Apache Commons Collections 3.2.2.
ExternalParser was rewritten. Now offers more metadata extraction options and environment variable support.
Modified Javadoc to include an XML usage example for all XML-configurable classes.
Dependent libraries for JPEG200 and JBIG2 image formats are no longer distributed with this product for licensing incompatibilities. To enable them, you will need JAR files found at these locations:
Fixed NoClassDefFoundError on some MS Visio files: com/microsoft/schemas/office/visio/x2012/main/ConnectsType
Fixed NullPointerException from parsing some Word documents. #41
Removed FixedHtmlEncodingDetector class in favor of the fixed version of HtmlEncodingDetector.
Removed deprecated Importer HTMLParser and PDFParser classes.
Removed deprecated IDocumentSplittableEmbeddedParser interface.
Removed Importer EnhancedPDFParser and EnhancedPDF2XHTML in favor of upgraded TIKA PDFParser and PDF2XHTML versions.

2.6.1 Minor release Download 2016-12-14

DOMTagger now supports a new flags called "matchBlanks" to extract elements that contain empty values or values made of white spaces only. #39
ReplaceTagger now supports new flags: "wholeMatch" and "replaceAll".
The default value in DOMTagger can how be an empty string or a string made of white spaces. #39
Dependency updates: Norconex Commons Lang 1.12.3, Joda Time 2.9.4, Apache HTTP Client 4.5.2, Apache HTTP Core 4.4.5, JJ2000 5.3, JAI ImageIO jpeg2000 1.3.1
Fixed ReplaceTagger not adding replaced value to "toField" when it is the same as original value. #29
Fixed NoSuchMethodError when performing OCR on some PDFs with JPEG 2000 images in them.
Fixed "No ImageWriter found for 'jpx' format" when performing OCR on some PDFs with JPX images in them.

2.6.0 Feature release Download 2016-08-25

New CountMatchesTagger that will count occurrences of matching substring or regular expression in a field value or document content and store the count in a target field.
DateFormatTagger now accepts multiple source formats when attempting to convert dates, trying them in order provided.
DOMTagger can now apply DOM selection on an optional "fromField" and can also use a "defaultValue" when there is no match. #28
New DOM selector possibility for DOMContentFilter and DOMTagger: ownText, data, id, tagName, val, className, cssSelector, and attr(attributeKey).
TranslatorSplitter now supports Yandex translation service.
GenericDocumentParserFactory/AbstractTikaParser now allows you to control which embedded documents you do not want extracted from their containers.
GenericDocumentParserFactory/AbstractTikaParser now allows you to control which documents containers you do not want to extract their embedded documents.
GenericDocumentParserFactory/AbstractTikaParser now allows you to specify which content types to "split" their embedded documents via regular expression.
GenericDocumentParserFactory now allows you to define and configure parsers via XML.
New IHintsAwareParser interface for parsers that can benefit from global configuration settings.
New ParseHints class holding generic configuration settings to be set on parsers implementing the new IHintsAwareParser.
New EmbeddedConfig class holding configuration settings related to embedded documents. Used by ParseHints on GenericDocumentParserFactory.
Can now pass optional -e or --contentEncoding to command line to explicitly set the character encoding (charset).
LanguageTagger now uses Tika language detection (supports at least 70 languages).
GenericDocumentParserFactory has been modified to introduce the concept of ParseHints which holds configuration settings every parsers have the option to support or not. Generic embedded and OCR configuration settings have been moved to the new ParseHints class.
The following GenericDocumentParserFactory method are now deprecated: setSplitEmbedded(boolean), isSplitEmbedded(), setOCRConfig(OCRConfig), and getOCRConfig().
It is now possible to configure ExternalParser via XML.
Now validates configuration and variable file paths when launched on the command line (throws errors on invalid paths).
Dependency updates: Tika 1.13 (which now uses PDFBox 2.x), Norconex Commons Lang 1.9.1, JSoup 1.9.2.
OCRConfig#setContentTypes(String) and equivalent configuration option in GenericDocumentParserFactory now expects a regular expression as opposed to a coma-separated list of content types.
DebugTagger now assumes UTF-8 instead of OS default charset when printing content.
Subclasses of AbstractStringTagger will now see tagTextDocument(...) method invoked at least once even if there is no content supplied.
Fixed DOMTagger ignoring subsequent selectors when one selector has no match. #21
Fixed ContentTypeDetector not closing TikaInputStream properly resulting in temporary "apache-tika-XXX.tmp" files not being deleted properly.
Fixed infinite loop with DOMSplitter when some selectors are too generic.
AbstractCharStreamTagger now tolerates null content stream.

2.5.2 Maintenance release Download 2016-05-31

It is now possible to specify a locale when parsing/formatting dates with CurrentDateTagger and DateFormatTagger.
Dependency updates: PDFBox 2.0.0 (final release).

2.5.1 Bug fix release Download 2016-03-22

Text-based transformers extending AbstractCharStreamTransformer now logs a warning when character encoding could not be detected, suggesting to make sure the content being transformed is text.
StripBetweenTransformer now accepts multiple strip endpoints with the same "start" regex.

2.5.0 Feature release Download 2016-02-28

DOMTagger and DOMFilter can now be told how to return matching elements values (i.e., text, html, or outerHtml).
New CharsetTagger to convert the character encoding of specified document metadata field into the desired target character encoding.
New CharsetTransformer to convert the character encoding of a document content into the desired target character encoding.
New CharsetUtil class offering simplified charset detection and conversion methods.
The "" file has been moved from classes to the installation root directory.
DOMTagger now returns matching element text as opposed to HTML (can be configured back to HTML).
When used as pre-parse handlers, most handlers dealing with text now accepts a charset to use for parsing content, or will detect encoding when no charset is specified. This eliminates many bad character issues.
Metadata document.contentEncoding is now always set when passed to importDocument method.
Dependency updates: Apache Tika 1.12, Norconex Commons Lang 1.9.0. Jempbox 1.8.11 (still required by Tika JPegParser), PDFBox 2.0.0-RC3, Apache Commons CLI 1.3.1.
Importer-specific version of Tika PDFParser was updated to work around PDFBox 2.0 no longer depending on Jempbox.
Importer now issues a WARN instead of DEBUG sometimes thrown when importing fails.
Fixed invalid zip bomb detection on PDF with elements nested more than 100 level deep.
Fixed charset in HTML comments being wrongfully considered when charset is being detected.
Fixed NullPointerException being thrown with some PDFs when extracting multilingual items.

2.4.0 Feature release Download 2015-11-02

The following new handlers enable using scripting languages to define processing logic: ScriptFilter, ScriptTagger, and ScriptTransformer.
New DOMContentFilter to filter out XML/HTML documents containing identified element or element value using a friendly syntax to navigate a DOM-tree structure. #48
New DOMSplitter handler to split XML/HTML documents into multiple documents based on a specified element.
New DOMTagger handler to extract text elements from XML/HTML documents using a friendly syntax to navigate a DOM-tree structure.
CharacterCaseTagger can now be applied to field names (in addition to, or instead of, values).
New CommonRestrictions class to obtain restrictions commonly associated with certain documents.
New methods on AbstractImporterHandler to deal with restrictions: #addRestriction(PropertyMatcher...), #addRestrictions(List) #removeRestriction(String), #getRestrictions() #removeRestriction(PropertyMatcher), #clearRestrictions()
New file formats supported (brought by Tika update): GCMD DIF, Geographic ISO 19139 files, CBOR.
Dependency updates: Apache Tika 1.10, JSoup 1.8.3, Norconex Commons Lang 1.8.0.
Importer ExternalParser now uses corrected ExternalParser from Tika.
AbstractStringTransformer#transformStringContent(...) now throws an ImporterHandlerException.
Saved and loaded configuration-related classes are now equal. Methods equals/hashCode/toString for those classes are now implemented uniformly and where added where missing.
Fixed some configuration classes not always being saved to XML properly or giving errors.

2.3.1 Maintenance release Download 2015-08-07

Dependency updates: Norconex Commons Lang 1.7.0.

2.3.0 Feature release Download 2015-07-21

New TextPatternTagger for extracting text matching regular expressions out of a document content and storing matches into a field. New unit tests created for it.
Jar manifest now includes implementation entries and specifications entries (matching Maven pom.xml).
Javadoc fixes and updates.
Library updates: Norconex Commons Lang 1.6.2.
Fixed NullPointerException in DebugTagger when a field contains a null value.

2.2.0 Feature release Download 2015-06-15

New DocumentLengthTagger for adding the document byte length as a field to imported documents.
New CurrentDateTagger for adding the current date as a field to imported documents.
New NumericMetadataFilter for filtering documents based on whether a numeric field value matches a given numeric range.
New DateMetadataFilter for filtering documents based on whether a date field value matches a given date range.
New ExternalParser class which is used to run an external process for parsing files (e.g. pdftotext) of the associated content type.
By default PDF parsing is now done with this flag set to true: "suppressDuplicateOverlappingText". This should eliminate the extraction of duplicate text in PDF where bolding is done by having multiple instance of the same string on top of each other.
Complete rewrite of AbstractStringFilter, AbstractStringTagger, and AbstractStringTransformer to limit the memory taken for loading the content. Now the memory is specified in absolute terms instead of dynamically allocating it based on free memory (an approach that could cause OutOfMemory errors). All subclasses now accept a "maxReadSize" configuration option to set the maximum number of characters to process at once. #9
The abstract methods accepting a "partial" boolean argument on AbstractStringFilter, AbstractStringTagger, and AbstractStringTransformer have been changed to now accept a "sectionIndex" integer, representing the document content section being processed. Only larger documents will be processed one section of text at a time (to preserve memory).
AbstractCharStreamTransformer#transformTextDocument(...) now throws an ImporterHandlerException instead of IOException to be consistent with other handlers.
TitleGeneratorTagger was re-written no longer uses Carrot, to reduce library dependencies.
Removed custom Tika mappings for Microsoft Visio now that they have been added to default Tika mappings in Tika 1.8. Reference:
ReplaceTagger: now case insensitive by default. Added a new flag to turn case-sensitivity on/off. #addReplacement(...) methods have been deprecated in favor of addReplacement(Replacement).
Regular expressions in RegexContentFilter, RegexMetadataFilter, ReplaceTagger, TextBetweenTagger, ReplaceTransformer, StripAfterTransformer, StripBeforeTransformer, and StripBetweenTransformer now always have the Pattern.DOTALL flags enabled and when case sensitivity is enabled for regex, Pattern.UNICODE_CASE is now always used.
Library updates: Apache Tika 1.8, Norconex Commons Lang 1.6.1, Apache Commons CLI 1.3, Apache Jempbox 1.8.9, Jempbox 2.0.0. Removed these library "direct" dependencies: Carrot2 (3.9.4), Lucene Analyzers (5.0.0), and Stax2 API (3.1.4).
Javadoc fixes and updates.
New unit tests to cover all filter onMatch use cases.
Fixed filters not working properly when using onMatch="include". Affects all subclasses of AbstractDocumentFilter, which now details the include/exclude logic in its Javadoc (github collector-http#108).
Fixed "Too many open files" exception.
Fixed the "restrictTo" feature not always working for AbstractImporterHandler subclasses. #7

2.1.1 Maintenance release Download 2015-04-08

PDFBox now uses latest snapshot (as opposed to a frozen one).
Javadoc fixes.
Library updates: SLF4J 1.7.12.

2.1.0 Feature release Download 2015-03-31

Added OCR support using Tesseract open-source product. Configured by setting an OCRConfig to GenericDocumentParserFactory.
Added document translation support with the new TranslatorSplitter. Support these translation APIs: Microsoft, Google, Lingo24, and Moses. Both the document content and/or chosen fields can be translated.
New TitleGeneratorTagger to dynamically generate titles out of documents, using Carrot2 to extract the best terms.
New EnhancedPDFParser and EnhancedPDF2XHTML classes modifying original Tika PDFParser to add support for PDF XFA (dynamic forms) text extraction as well as adding support for PDFBox 2.0.0 (which fixes the striping of space characters between words in many PDFs).
New XFDLParser for parsing PureEdge Extensible Forms Description Language files (XFDL). Supports both Gzipped+Base64 and plain text versions.
New WordPerfectParser class for parsing WordPerfect documents according to WordPerfect file specifications.
New QuattroProParser class for parsing QuattroPro documents according to QuattroPro file specifications.
New configuration "parseErrorsSaveDir" on importer configuration for saving files that caused parsing errors along with their exception and metadata if any.
KeepOnlyTagger and DeleteTagger now supports regular expression for identifying fields to keep/delete. The field="" attribute has been replaced by a element.
Added support for JBIG2 and jpeg2000 image formats.
Improved content detection of MS Office and Corel Office documents when importing an input stream with no specified extension.
Improved overall content detection accuracy and performance.
Default allocated memory for caching of document content was increased by a factor of 10 (10MB max per document, 100MB max total).
AbstractTikaParser can now be extended to modify Tika ParseContext.
importer.bat and will now load the from the ./classes folder.
Now always flush output stream from parsers so implementors do not have to be concerned with this.
Easier to extend GenericDocumentParserFactory to provide custom parsers. Dropped "registerNamedParser", "registerFallbackParser", and "getFallbackParser" methods in favor of new "createFallbackParser" and "createNamedParsers" methods.
HTMLParser and PDFParser are now deprecated. HTML and PDF are now handled by the fall-back parser (auto-detected).
IDocumentSplittableEmbeddedParser is now deprecated and has no effect. Will be deleted in a future release.
Minor javadoc improvements and fixes.
No longer adds null handlers (possible when configuration loading failed for an handler).
Improved exception handling for configuration loading.
Library updates: Tika 1.7, Norconex Commons Lang 1.6.0, JUnit 4.12, PDFBox 2.0.0 (SNAPSHOT-2015-03-28), Apache Commons Codec 1.10, Lucene Analyzer Common 5.0.0.
Updated several maven plugins and added SonarQube maven plugin.
Added Sonatype repository to pom.xml for snapshot releases.
Added more unit tests for various content type parsing.
Fixed embedded objects not always having the right content-type.
Fixed invalid mapping between "application/wordperfect" content type and WordPerfectParser.
Fixed AbstractCharStreamTagger subclasses badly detecting character encoding and failing documents as a consequence.

2.0.0 Major release Download 2014-11-25

Importing now returns an ImporterResponse, which may hold the imported document, along with nested documents, and and ImporterStatus.
New IDocumentSplitter handler and related classes, allowing implementations to split documents into more documents.
DefaultDocumentParserFactory can now be configured to treat embedded documents as distinct documents (committed separately). Parsers can now implement IDocumentSplittableEmbeddedParser to indicate they are supporting document splitting.
DefaultDocumentParserFactory can now ignore parsing specified content-types.
New IImporterResponseProcessor to process the import response.
Document encoding can now be explicitly specified when importing and the value get stored as a metadata field.
New ContentTypeDetector for detecting the content-type from documents.
New ImporterDocument, holding all objects related to a document being imported.
New ImporterMetadata, extending Properties to provide additional import-related convenience methods and constants.
New CsvSplitter class for splitting coma-separated value files into multiple records/documents to be indexed.
New RegexContentFilter for accepting/rejecting documents based on a successful regular expression match on their content.
New CharacterCaseTagger for modifying the character case of a metadata field value.
New DateFormatTagger for parsing/formatting date from specified metadata fields.
New DebugTagger for logging document content and/or metadata to help with implementation and troubleshooting.
New LanguageTagger which analyzes a document content to automatically detect and store as metadata the document language.
New TextStatisticsTagger that stores as metadata statistical information about a document content (word count, average words per sentences, etc.).
New AbstractDocument* class for each types of handlers, facilitating handler implementation.
Directory where temporary files are created is now configurable.
Added support for parsing .iso files.
Now licensed under The Apache License, Version 2.0.
Document content reads and writes are now performed in memory up to a configurable maximum size, after which the filesystem gets used. This reduces I/O and improves performance.
Now every handlers except filters can be restricted to matching metadata values (configurable).
*.tagger, *.filter, and *.transformer handlers were move to *.handler.tagger, *.handler.filter, and *.handler.transformer.
com.norconex.importer.ContentType has been replaced with com.norconex.commons.lang.file.ContentType.
For consistency, several references to metadata field names were renamed to use the term "field" (instead of property or else).
DefaultDocumentParserFactory was renamed to GenericDocumentParserFactory.
Handler "contentTypeRegex" tag was removed from handlers that supported it in favor of the more flexible "restrictTo" tag(s).

1.3.0 Feature release Download 2014-08-18

Now stores the content "family" for each documents as "importer.contentFamily". This is a higher level representation of a file content types.
New SplitTagger: Split values into multiple-values using a separator of choice.
New CopyTagger: copies document metadata fields to other fields.
New HierarchyTagger: splits a field string into multiple segments representing each node of a hierarchical branch.
Improved detection of certain mime types, such as those previously appearing as application/x-tika-*.
ReplaceTagger now supports regular expressions (via a new "regex" flag).
Can now detect these MS Viso mime-types properly: vsdx, vstc, vssx, vsdm, vstm, vssm.
AbstractCharStreamTransformer now enforces streaming as UTF8.
Now requires Java 7 or higher.
RelpaceTagger regular matching now only replaces matching "fromValue".

1.2.0 Feature release Download 2014-03-09

Now extracts text from WordPerfect documents (new WordPerfectParser class).
New transformer "ReduceConsecutivesTransformer" to reduce consecutive instances of the same string to only one instance.
New transformer "ReplaceTransformer" to perform search and replace on document content using regular expression.
New filter "EmptyMetadataFilter" to exclude/include documents with no data for one or more specified metadata properties.
Library updates: Tika 1.5, Norconex Commons Lang 1.3.0.
Now attempts to detect the character encoding from a character stream by looking at a Content-Type metadata. If none is present, defaults to UTF-8.
Fixed NPE in AbstractTextRestrictiveHandler when no content-type is found when used before parsing.

1.1.0 Minor release Download 2013-08-20

New tagger "TextBetweenTagger" to extract strings from a document and store them into document meta data fields.
New AbstractRestrictiveHandler and AbstractTextRestrictiveHandler abstract classes to facilitate re-use of common capabilities in handlers.
New BufferUtil and Memory Util classes.
AbstractRestrictiveTransformer now deprecated.
Upgraded norconex-commons-lang to 1.1.0.

1.0.1 Maintenance release Download 2013-08-02

Upgraded Apache Tika from 1.3 to 1.4.
Removed dependency on aspectjrt due to GPL licencing incompatibility. If you need .iso parsing, you can manually download and add to the classpath.

1.0.0 Open Source release Download 2013-06-04

Starting with this release, Norconex Importer is open-source under GPL.

Copyright © 2013-2020 Norconex Inc.