Norconex Importer

Configuration

While Norconex Importer works out-of-the-box with its default settings, you will only unlock its full potential if you take time to configure it properly using Java or XML.

Refer to the following for an XML based configuration. Entries with a "class" attribute are expecting an implementation of your choice. The importer API offers several concrete implementations already. Developers can also create their own by implementing the proper Java interfaces. Refer to the Importer JavaDoc and/or see further down what interfaces you can implement to provide custom functionality. Got to the Extend the Importer section for more details on adding your own implementations.

<importer>
 
    <tempDir></tempDir>
    <maxFileCacheSize></maxFileCacheSize>
    <maxFilePoolCacheSize></maxFilePoolCacheSize>
    <parseErrorsSaveDir></parseErrorsSaveDir>
 
    <preParseHandlers>
        <!-- These tags can be mixed, in the desired order of execution. -->
        <tagger class="..." />
        <transformer class="..." />
        <filter class="..." />
        <splitter class="..." />        
    </preParseHandlers>
 
    <documentParserFactory class="..." />
 
    <postParseHandlers>
        <!-- These tags can be mixed, in the desired order of execution. -->
        <tagger class="..." />
        <transformer class="..." />
        <filter class="..." />
        <splitter class="..." />        
    </postParseHandlers>
 
    <responseProcessors>
        <responseProcessor class="..." />
    </responseProcessors>        
 
</importer>

The table below lists interface names that you can easily extend, and also lists available out-of-the-box implementations.

In the configuration file, you have to use the fully qualified name, as defined in the Javadoc (you can use variables to shorten package names). Click on a class or interface name to go directly to its full documentation, with extra configuration options.

When a default implementation exists for a configuration option taking a class attribute, it is highlighted.

Tag Description Classes Interface
tempDir Path to temporary directory. Defaults to system temp directory. N/A N/A
maxFileCacheSize Maximum size (bytes) a file content can take in memory before being written to disk instead. Default 1MB. N/A N/A
maxFilePoolCacheSize Maximum total size the combined files content can take in memory before being written to disk instead. Default 10MB. N/A N/A
parseErrorsSaveDir Optional directory where to save files causing a parsing error. Files will be saved along with metadata extracted so far and the error details (Java stacktrace). N/A N/A
documentParserFactory Factory dictating which document parser to use content types encountered. GenericDocumentParserFactory IDocumentParserFactory
tagger Taggers allow to add to or modify existing document metadata. CharacterCaseTagger
CharsetTagger
ConstantTagger
CopyTagger
CountMatchesTagger
CurrentDateTagger
DateFormatTagger
DebugTagger
DeleteTagger
DocumentLengthTagger
DOMTagger
ExternalTagger
FieldReportTagger
ForceSingleValueTagger
HierarchyTagger
KeepOnlyTagger
LanguageTagger
MergeTagger
RenameTagger
ReplaceTagger
ScriptTagger
SplitTagger
TextBetweenTagger
TextPatternTagger
TextStatisticsTagger
TitleGeneratorTagger
TruncateTagger
UUIDTagger
IDocumentTagger
transformer Transformers allow to manipulate and convert extracted text and save the modified text back. CharsetTransformer
ExternalTransformer
ReduceConsecutivesTransformer
ReplaceTransformer
ScriptTransformer
StripAfterTransformer
StripBeforeTransformer
StripBetweenTransformer
SubstringTransformer
IDocumentTransformer
filter Allows to filter out certain incoming documents. DateMetadataFilter
DOMContentFilter
EmptyMetadataFilter
NumericMetadataFilter
RegexContentFilter
RegexMetadataFilter
RegexReferenceFilter
ScriptFilter
IDocumentFilter
splitter Splits a document into multiple ones. CsvSplitter
DOMSplitter
PDFPageSplitter
TranslatorSplitter
IDocumentSplitter
responseProcessor Allows for custom processing of the importer response before it is returned. None IImporterResponseProcessor

Example

Pretend you are building a service that offers content extracted from documents of various nature. You have a special batch that you want your system to treat as "News" documents. You want to add a metadata value to each of these documents to mark them as such. You also noticed that some of these documents are HTML files with two "title" meta tags, and you want to keep only the first one encountered to avoid possible issues. The following will accomplish this for you:

<importer>
 
    <postParseHandlers>
        <tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger">
            <constant name="doctype">News</constant>
        </tagger>
        <tagger class="com.norconex.importer.handler.tagger.impl.SingleValueTagger">
            <singleValue field="title" action="keepFirst"/>
        </tagger>
    </postParseHandlers>
 
</importer>
 

More Options

There is a lot more you can do to structure your configuration files the way you like. Refer to this additional documentation for more configuration options such as creating reusable configuration fragments and using variables to make your file easier to maintain and more portable across different environments.