Norconex just released version 1.2 of Norconex HTTP Collector, its open-source web crawler. Along with it comes a complete product web site redesign and a new logo: a lovely web crawling spider wearing a Norconex hat.
Some changes in this feature release:
- New optional Mongo URL Database implementation.
- New optional TikaURLExtractor class providing an alternate URL extraction mechanism based on Apache Tika HTMLParser.
- New SegmentCountURLFilter class for filtering URLs having a specified number of segments (can check duplicate segments too).
- Configuration samples now point to Norconex test pages to ensure their stability.
To view a complete list of changes, read the Release Notes.
This release also takes advantage of the new 1.1.0 release of Norconex Committer, which simplifies making your own committer implementations.
As always, we welcome your feedback.
Download Norconex HTTP Collector 1.2 now!