Norconex HTTP and Filesystem Collector 2.7.0 released

Posted on April 26, 2017 by in Latest Releases

Norconex released version 2.7.0 of both its HTTP Collector and Filesystem Collector.  This update, along with related component updates, introduces several interesting features.

HTTP Collector changes

The following items are specific to the HTTP Collector.  For changes applying to both the HTTP Collector and the Filesystem Collector, you can proceed to the “Generic changes” section.

Crawling of JavaScript-driven pages

The alternative document fetcher PhantomJSDocumentFetcher now makes it possible to crawl web pages with JavaScript-generated content. This much awaited feature is now available thanks to integration with the open-source PhantomJS headless browser.   As a bonus, you can also take screenshots of web pages you crawl.

More ways to extract links

This release introduces two new link extractors.  You can now use the XMLFeedLinkExtractor to extract links from RSS or Atom feeds. For maximum flexibility, the RegexLinkExtractor can be used to extract links using regular expressions.

Generic changes

The following changes apply to both Filesystem and HTTP Collectors. Most of these changes come from an update to the Norconex Importer module (now also at version 2.7.0).

Much improved XML configuration validation

You no longer have to hunt for a misconfiguration.  Schema-based XML configuration validation was added and you will now get errors if you have a bad XML syntax for any configuration options.   This validation can be trigged on command prompt with this new flag: -k or --checkcfg.

Enter durations in human-readable format

Having to convert a duration in milliseconds is not the most friendly. Anywhere in your XML configuration where a duration is expected, you can now use a human-readable representation (English only) as an alternative.

Lua scripting language

Support for Lua scripting has been added to ScriptFilter, ScriptTagger, and ScriptTransformer.  This gives you one more scripting option available out-of-the-box besides JavaScript/ECMAScript.

Modify documents using an external application

With the new ExternalTransformer, you can now use an external application to perform document transformation.  This is an alternative to the existing ExternalParser, which was enhanced to provide the same environment variables and metadata extraction support as the ExternalTransformer.

Combine document fields

The new MergeTagger can be used for combining multiple fields into one. The target field can be either multi-value or single-value separated with the character of your choice.

New Committers

Whether you do not have a target repository (Solr, Elasticsearch, etc) ready at the time of crawling, or whether you are not using a repository at all, Norconex Collectors now ships with two file-based Committers for easy consumption by your own process: XMLFileCommitter and JSONFileCommitter. All available committers can be found here.


Several additional features or changes can be found in the latest Collector releases.  Among them:

  • New Importer RegexReferenceFilter for filtering documents based on matching references (e.g. URL).
  • New SubstringTransformer for truncating content.
  • New UUIDTagger for giving a unique id to each documents.
  • CharacterCaseTagger now supports “swap” and “string” to swap character case and capitalize beginning of a string, respectively.
  • ConstantTagger offers options when dealing with existing values: add to existing values, replace them, or do nothing.
  • Components such as Importer, Committers, etc., are all easier to install thanks to new utility scripts.
  • Document Access-Control-List (ACL) information is now extracted from SMB/CIFS file systems (Filesytem Collector).
  • New ICollectorLifeCycleListener interface that can be added on the collector configuration to be notified and take action when the collector starts and stops.
  • Added “removeTrailingHash” as a new GenericURLNormalizer option (HTTP Collector).
  • New “detectContentType” and “detectCharset” options on GenericDocumentFetcher for ignoring the content type and character encoding obtained from the HTTP response headers and detect them instead (Filesytem Collector).
  • Start URLs and start paths can now be dynamically created thanks to IStartURLsProvider and IStartPathsProvider (HTTP Collector and Filesystem Collector).

To get the complete list of changes, refer to the HTTP Collector release notes, Filesystem Collector release notes, or the release notes of dependent Norconex libraries such as: Importer release notes and Collector Core release notes.


Pascal Essiembre has been a successful Enterprise Application Developer for several years before founding Norconex in 2007 and remaining its president to this day. Pascal has been responsible for several successful Norconex enterprise search projects across North America. Pascal is also heading the Product Division of Norconex and leading Norconex Open-Source initiatives.