Norconex Importer

3.x Release Notes

Release History

Version	Date	Description
3.0.1	2023-07-09
3.0.0	2022-01-02	Major release. NOT a drop-in replacement for 2.x.

3.0.1 Release date 2023-07-09 Download

New	New DOMPreserveTransformer.	#76
Updated	Maven dependency updates: norconex-commons-maven-parent 1.0.2-SNAPSHOT.
Fixed	Fix RegexTagger not picking up XML-configured "fieldMatcher".

3.0.0 Major release. NOT a drop-in replacement for 2.x. Release date 2022-01-02 Download

Updated	Updated transitive dependencies with known vulnerabilities.
Updated	Updated dependencies to avoid logging library detection conflict.
Updated	Maven dependency updates: Apache Tika 1.27 (and its many transitive dependencies), UCAR jj2000 5.4, Opencsv 5.5.2, JAI Image-IO jpeg2000 1.4.0, JBIG2 ImageIO 2.0.
Fixed	Fixed invalid configuration in POM "maven-dependency-plugin".
New	Handlers now support XML "flow", which adds supports for if/ifNot/condition/then/else tags in XML configuration.
New	New "condition" classes for XML "flow" configuration: BlankCondition, DateCondition, DOMCondition, NumericCondition, ReferenceCondition, ScriptCondition, and TextCondition.
New	New RejectFilter.
New	New CharsetUtil#firstNonBlankOrUTF8(...) methods.
New	When not already set, an attempt to detect document character encoding is now always made before invoking handlers.
New	New CommonMatchers class.
New	New ImageTransformer class.
New	New NoContentTransformer class.
New	New -f or "outputMetaFormat" command-line argument for saving exported metadata fields in alternate formats.
New	New TextFilter class.
New	New ReferenceFilter class.
New	New ExternalHandler class.
New	New DOMFilter class.
New	New EmptyFilter class.
New	New RegexTagger class.
New	New URLExtractorTagger class.
New	New DOMDeleteTransformer class.
New	New XMLStreamSplitter class.
New	New HandlerDoc to ease handler implementations.
New	Importer now uses an EventManager and triggers several events: IMPORTER_HANDLER_BEGIN, IMPORTER_HANDLER_END, IMPORTER_HANDLER_ERROR, IMPORTER_PARSER_BEGIN, IMPORTER_PARSER_END, IMPORTER_PARSER_ERROR
New	New ImporterDocument#getStreamFactory() method.
New	ReplaceTagger now has the option to discard values that are unchanged after replacement.
New	New options on CharacterCaseTagger: "wordsFully", "stringFully", "sentences", and "sentencesFully".
New	Most configurable classes adding/setting metadata values now have an extra "onSet" option for dictating how values are set: append, prepend, replace, optional.
New	New DocInfo class.
New	New ImporterRequest class.
New	New option in DOMTagger to delete elements matched by a selector.
New	Added time zone support to DateMetadataFilter.
New	Added support for Webp image format.
Updated	Now requires Java 8 or higher.
Updated	Importer#importDocument(...) now expects an ImporterRequest or a Doc.
Updated	Default allocated memory for caching of document content was increased by a factor of 10 (100MB max per document, 1GB max total).
Updated	XML configuration of handlers had their XML tag names changed from "filter", "tagger", "transformer, "splitter" to simply "handler".
Updated	JBIG2 image support now included under apache license.
Updated	Logging now using SLF4J.
Updated	Maven dependency updates: Norconex Commons Lang 2.0.0, Apache Tika 1.22, Apache Commons CLI 1.4, Junit 5.
Updated	RegexFieldExtractor and RegexUtil have been deprecated in favor of Norconex Commons Lang FieldValueExtractor and Regex.
Updated	RegexContentFilter and RegexMetadataFilter have been deprecated in favor of TextFilter.
Updated	RegexReferenceFilter has been deprecated in favor of ReferenceFilter.
Updated	DOMContentFilter has been deprecated in favor of DOMFilter.
Updated	EmptyMetadataFilter has been deprecated in favor of EmptyFilter.
Updated	TextPatternTagger has been deprecated in favor of RegexTagger.
Updated	TextBetweenTagger now has "inclusive" and "caseSensitive" options configurable for each "between" details.
Updated	Now using Path instead of File in many cases.
Updated	Parsing no longer attempted on zero-length content.
Updated	List of PropertyMatcher replaced with PropertyMatchers.
Updated	ContentTypeDetector methods are now static.
Updated	Eliminated Apache Tika log warnings on startup when missing specific optional libraries not package due to licensing (e.g. JPEG 2000, jbig2).
Updated	Occurrences of accessors for overwrite="[false\|true]" and onConflict="..." have been deprecated in favor of new onSet="...".
Updated	Most places where regular expressions could be used now also support "basic" matching and "wildcard" as well as being able to ignore diacritical marks (e.g., accents).
Updated	Most occurrences of "caseSensitive" or "caseInsensitive" configuration options are now replaced with "ignoreCase".
Updated	Filters implementing AbstractStringFilter will now have their isStringContentMatching(...) method invoked at least once, even if there is no document content.
Updated	"parsed" boolean arguments were replaced by ParseState.PRE and ParseState.POST.
Updated	Many methods with a combinations of reference, input stream, and metadata were updated to now accept a Doc instance instead.
Removed	Removed some of the methods deprecated in previous releases.
Removed	Removed SplittableDocument.