Norconex is proud to release version 2.4.0 of its Norconex Importer open-source product. In addition to the usual bug fixes and stability enhancements, this release provides more possibilities for parsing and enriching your documents. Most significantly, Importer 2.4.0 allows for scripting and DOM navigation. Keep reading for more details and usage samples.
Scripting
[ezcol_1half]
Whereas it has always been possible to extend the importer to implement your own document processing logic, now you can inject the importer via configuration using a scripting language. The following new handlers enable the use of scripting languages to manipulate documents: ScriptFilter, ScriptTagger, and ScriptTransformer.
The “JavaScript” script engine, which is already present as part of your Java installation, is the script engine used by these classes. The JavaScript engine used by the Oracle implementation of Java is based on Mozilla Rhino. You can find extensive JavaScript documentation on the Mozilla Rhino site.
Java developers can extend the importer to increase support for additional scripting languages. These new classes rely on JSR 223 API, which allows you to “plug” into any script engines to support your favorite scripting language.
[/ezcol_1half]
[ezcol_1half_end]
<!-- Reject documents that are not about "apple". --> <filter class="com.norconex.importer.handler.filter.impl.ScriptFilter"> <script><![CDATA[ isAppleDoc = metadata.getString('fruit') == 'apple' || content.indexOf('Apple') > -1; /*return*/ isAppleDoc; ]]></script> </filter> <!-- Add a "fruit" metadata field with the value "apple". --> <tagger class="com.norconex.importer.handler.tagger.impl.ScriptTagger"> <script><![CDATA[ metadata.addString('fruit', 'apple'); ]]></script> </tagger> <!-- Modify all occurences of "Alice" with "Roger". --> <transformer class="com.norconex.importer.handler.transformer.impl.ScriptTransformer"> <script><![CDATA[ modifiedContent = content.replace(/Alice/g, 'Roger'); /*return*/ modifiedContent; ]]></script> </transformer>
[/ezcol_1half_end]
DOM navigation
[ezcol_1half]
It is now possible to reference elements of an HTML or XML document using friendly CSS or JQuery-like syntax to navigate its domain object model (DOM). The jsoup parser is used to load document content into a DOM tree.
The new DOMContentFilter can be used to reject documents containing a specific HTML/XML path or element. The DOMSplitter can be used to break HTML/XML with “list” elements into different documents. Finally, the DOMTagger allows you to extract specific HTML/XML tag values or attributes and store them in your own fields (e.g., extract <h1> tags into a “title” field).
[/ezcol_1half]
[ezcol_1half_end]
<!-- Exclude documents containing GIF images. --> <filter class="com.norconex.importer.handler.filter.impl.DOMContentFilter" selector="img[src$=.gif]" onMatch="exclude" /> <!-- Store H1 tags in a title field. --> <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger"> <dom selector="h1" toField="title" overwrite="false" /> </tagger> <!-- Create a new contact document for each occurence of the "contact" tag. --> <splitter class="com.norconex.importer.handler.splitter.impl.DOMSplitter" selector="contact" />
[/ezcol_1half_end]
Other features
[ezcol_1half]
This release features several other helpful and interesting changes and additions. For instance, CharacterCaseTagger can now be used to adjust the character case of field names (in addition to values). A few additional file formats are also supported. For a complete list of changes, see the release notes.
[/ezcol_1half]
[ezcol_1half_end]
<!-- Make every instance of "title" field name lowercase. --> <tagger class="com.norconex.importer.handler.tagger.impl.CharacterCaseTagger"> <characterCase fieldName="title" type="lower" applyTo="field" /> </tagger>
[/ezcol_1half_end]
Useful links
- Download Norconex Importer 2.4.0.
- Find out how to get started.
- Report your issues and questions on Github.
- Use the Importer as part of one of Norconex Collectors (open-source crawlers).