Norconex is proud to announce the next major release of its popular open-source web crawler (also referred to as “Norconex HTTP Collector”). After a couple of years of development, you will find this new version was well worth the wait.
Not only does it introduce many new features, but it is also more flexible with even more documentation. Many of these improvements come from community feedback so long-term users deserve a pat on the back. This release is also yours.
If you are too eager to get started, you can download it now and follow its website documentation. Otherwise, keep reading for a glance at the new features.
What’s New?
Introduced features are too many to list here, but we’ll highlight some of the most significant.
Crawling of JavaScript-Driven Websites
Thanks to browser automation provided by Selenium WebDrivers, you can now use your favorite browser to crawl web pages relying on JavaScript to fully render. Generally speaking, if your browser can render content, the crawler can fetch it. It provides you with the ability to take screenshots of pages you crawl as well.
Multiple Committers
Committers are used to store crawled information into a target location, or repository of your choice. This version allows you to specify any number of committers to have your data sent to multiple targets at once (database, search engine, filesystem, etc.). It is also possible to perform simple routing as well.
Easier to deploy
Variables in configuration files can now be resolved against system properties and environment variables. Logging has been abstracted using SLF4J and now prints to STDOUT by default. These changes facilitate deployment in containerized environments (e.g., Docker).
Lots of Events
The event management has been redesigned and simplified. There are now more than 60 different event types being triggered for programmers to listen to and act upon. Ranging from new Committer and Importer events, as well as expected Web Crawler events.
XML Configuration improvements
Similar XML configuration options are now specified in a consistent way. In addition, it is now possible to provide partial class names (e.g., class=“ExtensionReferenceFilter“ instead of class=“com.norconex.collector.core.filter.impl.ExtensionReferenceFilter“). The Importer module also allows you to use XML “flow” to facilitate configuration logic. That is, you can now make use of special XML tags: <if>, <ifNot>, <condition>, <conditions>, <else>, and <then>.
Richer documentation
Documentation has been improved as well:
- A new Online Manual is now available, giving great insight into installation and XML configuration.
- Dynamic XML documentation combining options from all modules making up the web crawler into a single location.
The JavaDoc now has formatted XML documentation and XML usage, which is easy to copy and paste into your own configuration.
Config Starter
A very simple yet useful configuration generator is now available online. It will help you create your first configuration file. You provide your “start” URL, answer a few questions and your configuration file will be generated for you.
More?
Some additional features:
- Can send deletion requests to Committers upon encountering specific events.
- Can prevent duplicate documents to be sent to Committers during the same crawling sessions.
- Now supports these HTTP standards:
- Can now extra links after document importing/parsing as well as from metadata.
- The Crawler can be configured to stop itself after encountering specific events.
- New command-line options for cleaning previous crawls (starting fresh) and to export/import the crawler internal data store.
- Can now transform crawled images.
- Additional content and metadata manipulation options.
- Committers can now retry failing batches, reducing the batch size between each attempt.
- New out-of-the-box CSV Committer.
We recommend you have a look at the release notes for more.
What next?
If you are coming from Norconex HTTP Collector version 2, we recommend you have a look at the version 3 migration notes.
As always, community support is still available on GitHub. While on GitHub, take a moment to “Star” the project.
Come back once in a while as we’ll publish more in-depth articles on specific features or use cases you did not even think was possible to address with our web crawler.
Finally, we always love to know who is using the Norconex Web Crawler. Let us know and you may get listed on our wall of fame.
Enjoy!