tag page

Norconex just released major upgrades to all its Norconex Collectors and related projects.  That is, Norconex HTTP Collector and Norconex Filesystem Collector, along with the Norconex Importer module and all available committers (Solr, Elasticsearch, HP IDOL, etc), were all upgraded to version 2.0.0.

With these major product upgrades come a new website that makes it easier to get all the software you need in one location: the Norconex Collectors website.  At a quick glance you can find all Norconex Collectors and Committers available for download.

Among the new features added to your crawling arsenal you will find:

  • Can now split a document into multiple documents.
  • Can now treat embedded documents as individual documents (like documents found in zip files or in other documents such as Word files).
  • Language detection (50+ languages).
  • Parsing and formatting of dates from/to any format.
  • Character case modifiers.
  • Can now index basic content statistics with each documents (word count, average word length, average words per sentences, etc).
  • Can now supply a “seed file” for listing start URLs or start paths to your crawler.
  • Document content reads and writes are now performed in memory up to a configurable maximum size, after which the filesystem gets used.  This reduces I/O and improves performance.
  • New event model where listeners can listen for any type of crawler events.
  • Can now  ignore parsing of specific content types.
  • Can filter documents based on arbitrary regular expressions performed on the document content.
  • Enhanced debugging options, where you can print out specific field content as they are being processed.
  • HTTP Collector: Can add link names to the document the links are pointing to (e.g. to create cleaner titles).
  • More…

Another significant change is all Norconex open-source projects are now licensed under The Apache License 2.0.   We hope this will facilitate adoption with third party commercial offerings.

It is important to note version 2.0.0 are not compatible with their previous 1.x version.  The configuration options changed in many areas so do not expect to run your existing configuration under 2.0.0.   Please refer to the latest documentation for new and modified configuration options.

Visit to the new Norconex Collectors website now.

HP Autonomy users, take control over your web crawling. Norconex recently released an HP Autonomy IDOL Committer module for its open-source web crawler, Norconex HTTP Collector. You can now enjoy the features of Norconex crawler and experience the freedom of open-source when crawling your sites for indexing into IDOL. (more…)