Norconex is proud to announce the next major release of its popular open-source web crawler (also referred to as “Norconex HTTP Collector”).  After a couple of years of development, you will find this new version was well worth the wait.

Not only does it introduce many new features, but it is also more flexible with even more documentation.  Many of these improvements come from community feedback so long-term users deserve a pat on the back. This release is also yours.

If you are too eager to get started, you can download it now and follow its website documentation. Otherwise, keep reading for a glance at the new features.

What’s New?

Introduced features are too many to list here, but we’ll highlight some of the most significant.

Crawling of JavaScript-Driven Websites

Thanks to browser automation provided by Selenium WebDrivers, you can now use your favorite browser to crawl web pages relying on JavaScript to fully render.  Generally speaking, if your browser can render content, the crawler can fetch it.  It provides you with the ability to take screenshots of pages you crawl as well.

Multiple Committers

Committers are used to store crawled information into a target location, or repository of your choice.  This version allows you to specify any number of committers to have your data sent to multiple targets at once (database, search engine, filesystem, etc.).  It is also possible to perform simple routing as well.

Easier to deploy

Variables in configuration files can now be resolved against system properties and environment variables. Logging has been abstracted using SLF4J and now prints to STDOUT by default. These changes facilitate deployment in containerized environments (e.g., Docker).

Lots of Events

The event management has been redesigned and simplified. There are now more than 60 different event types being triggered for programmers to listen to and act upon.  Ranging from new Committer and Importer events, as well as expected Web Crawler events.

XML Configuration improvements

Similar XML configuration options are now specified in a consistent way. In addition, it is now possible to provide partial class names  (e.g., class=“ExtensionReferenceFilter“ instead of class=“com.norconex.collector.core.filter.impl.ExtensionReferenceFilter“). The Importer module also allows you to use XML “flow” to facilitate configuration logic. That is, you can now make use of special XML tags: <if>, <ifNot>, <condition>, <conditions>, <else>, and <then>.

Richer documentation

Documentation has been improved as well:

  • A new Online Manual is now available, giving great insight into installation and XML configuration.
  • Dynamic XML documentation combining options from all modules making up the web crawler into a single location.

The JavaDoc now has formatted XML documentation and XML usage, which is easy to copy and paste into your own configuration.

Config Starter

A very simple yet useful configuration generator is now available online. It will help you create your first configuration file. You provide your “start” URL, answer a few questions and your configuration file will be generated for you.

More?

Some additional features:

  • Can send deletion requests to Committers upon encountering specific events.
  • Can prevent duplicate documents to be sent to Committers during the same crawling sessions.
  • Now supports these HTTP standards:
  • Can now extra links after document importing/parsing as well as from metadata.
  • The Crawler can be configured to stop itself after encountering specific events.
  • New command-line options for cleaning previous crawls (starting fresh) and to export/import the crawler internal data store.
  • Can now transform crawled images.
  • Additional content and metadata manipulation options.
  • Committers can now retry failing batches, reducing the batch size between each attempt.
  • New out-of-the-box CSV Committer.

We recommend you have a look at the release notes for more. 

What next?

If you are coming from Norconex HTTP Collector version 2, we recommend you have a look at the version 3 migration notes.

As always, community support is still available on GitHub. While on GitHub, take a moment to “Star” the project.

Come back once in a while as we’ll publish more in-depth articles on specific features or use cases you did not even think was possible to address with our web crawler.

Finally, we always love to know who is using the Norconex Web Crawler.  Let us know and you may get listed on our wall of fame.

Enjoy!

Norconex is proud to announce the 2.9.0 release of its HTTP and Filesystem crawlers. Keep reading for a few release highlights.

CMIS support

Norconex Filesystem Collector now supports Content Management Interoperability Services (CMIS). CMIS is an open standard for accessing content management systems (CMS) content. Extra information can be extracted, such as document ACL (Access Control List) for document-level security. It is now easier than ever to crawl your favorite CMS. CMIS is supported by Alfresco, Interwoven, Magnolia, SharePoint server, OpenCMS, OpenText Documentum, and more.

Additional ACL support

ACL from your CMS is not the only new type of ACL you can extract.  This new Norconex Filesystem Collector release introduces support for obtaining local filesystem ACL.  These new ACL types are in addition to the already existing support for CIFS/SMB ACL extraction (since 2.7.0).

Field discovery

You can’t always tell upfront what metadata your crawler will find.  One way to discover your fields is to send them all to your Committer.  This approach is not always possible nor desirable.  You can now store to a local file all fields found by the crawler. Each field will be saved once, with sample values to give you a better idea of their nature.

New URL normalization rules

The HTTP Collector adds a few new rules GenericURLNormalizer. Those are:

  • removeQueryString
  • lowerCase
  • lowerCasePath
  • lowerCaseQuery
  • lowerCaseQueryParameterNames
  • lowerCaseQueryParameterValues

Subdomains being part of a domain

When you configure your HTTP crawler to stay on the current site (stayOnDomain="true"), you can now tell it to consider sub-domains as being the same site (includeSubdomains="true").

Other changes

For a complete list of all additions and changes, refer to the following release notes:

Download

 

Kafka users rejoice! You can now use Norconex open-source crawlers with Apache Kafka, thanks to the Norconex Apache Kafka Committer.

We owe this contribution to Joseph Paulo Mantuano (Senior Developer at The Red Flag Group) and Dan Davis.

The Norconex Collectors community keeps growing. We are thrilled to see the number of integrations grow with it as well.  If you know of any Norconex Committer implementation out there, let us know and we’ll add them to the list!

Not yet familiar with Norconex crawlers?  Head over to Norconex HTTP Collector or Norconex Filesystem Collector websites to learn more.

Great news! There is now a Google Cloud Search Committer for Norconex Crawlers!

This addition to Norconex Collector family should delight Google Cloud Search fans.  They too can now enjoy the full-featured crawling capabilities offered by Norconex Open-Source crawlers.

Since this Committer is developed and maintained by Google, you will find installation and configuration documentation on the Google Developers website.

New to Norconex crawlers? Head over to the Norconex Collectors website to start crawling.

Happy crawling!

Norconex crawlers and Neo4j graph database are now a love match! Neo4j is arguably the most popular graph database out there. Use Norconex crawlers to harvest relationships from websites and filesystems and feed them to your favorite graph engine.

This was made possible thanks to no other than France contributor Sylvain Roussy, a Neo4j reference, and author of 2 Neo4j books. Norconex is proud to have been able to partner with Sylvain to develop a Neo4j Committer for use with its Norconex HTTP and Filesystem Collectors.

To our French-speaking European friends, Sylvain will host a series of Neo4j Meetups at different locations. He will explain how Norconex crawlers can be used to gather graph data from the web to use in Neo4j. The first of the series is taking place on January 24th, in Genève:

Useful Links:

 

Norconex is proud to announce the release of Norconex HTTP Collector version 2.8.0.  This release is accompanied by new releases of many related Norconex open-source products (Filesystem Collector, Importer, Committers, etc.), and together they bring dozens of new features and enhancements highlighted below.

 

Extract a “Featured Image” from web pages

[ezcol_1half]

In addition to taking screenshots of webpages, you can now extract the main image of a web page thanks to the new FeaturedImageProcessor. You can specify conditions to identify the image (first one encountered matching a minimum site or a given pattern). You also have the option to store the image on file or as a BASE64 string with the crawled document (after scaling it to your preferred dimensions) or simply store a reference to it.

[/ezcol_1half]

[ezcol_1half_end]

[/ezcol_1half_end]

Limit link extraction to specific page portions

[ezcol_1half]

The GenericLinkExtractor now makes it possible to only extract links to be followed found within one or more specific sections of a web page. For instance, you may want to only extract links found in navigation menus and not those found in content areas in case the links usually point to other sites you do not want to crawl.

[/ezcol_1half]

[ezcol_1half_end]

[/ezcol_1half_end]

Truncate long field values

[ezcol_1half]

The new TruncateTagger offers the ability to truncate long values and the option to replace the truncated portion with a hash to help preserve uniqueness when required. This is especially useful in preventing errors with search engines (or other repositories) and field length limitations.

[/ezcol_1half]

[ezcol_1half_end]

[/ezcol_1half_end]

Add metadata to a document using an external application

[ezcol_1half]

The new ExternalTagger allows you to point to an external (i.e., command-line) application to “decorate” a document with extra metadata information. Both the existing document content and metadata can be supplied to the external application. The application output can be in a specific format (json, xml, properties) or free-form combined with metadata extraction patterns you can configure. Either standard streams or files can be supplied as arguments to the external application. To transform the content using an external application instead, have a look at the ExternalTranformer, which has also been updated to support metadata.

[/ezcol_1half]

[ezcol_1half_end]

[/ezcol_1half_end]

Other improvements

This release includes many more new features and enhancements:

  • To create a document checksum, you can now combine metadata with content.
  • The TextPatternTagger can now extract field names dynamically in addition to values.
  • The ReplaceTagger and ReplaceTransformer now support empty/null replacement values.
  • There are new configuration options on the GenericHttpClientFactory:
    • “authFormParams” to add arbitrary parameters to authentication forms.
    • “authPreemptive” to use preemptive authentication with BASIC authentication.
  • The Amazon CloudSearch and Elasticsearch Committers both have a new “fixBadIds” flag to safely handle URLs that do not meet product limitations.

For the complete list of changes, refer to these product release notes:

Useful links

10 Year Anniversary

 

Letter from the President,

 

Do you know that Norconex turned 10 this year? That’s right, Norconex was founded in 2007 and I could not be prouder to be president of Norconex as we cross this important milestone.

Our company’s numerous achievements would not have been possible without our amazing employees. They are smart, committed, loyal, and all have client satisfaction at heart. Having such a great team is precious beyond words.

I am also taking this occasion to thank every one of you, customers and partners, for having played a vital role in Norconex success. We can’t thank you enough for choosing our services and products, making us the success that we are.

We plan to keep growing our relationship in the years to come and continue to offer you the best.

 

We are looking forward to the next 10 years!

 

 

Sincerely,

Pascal Essiembre

President


WHAT THE FIRST 10 YEARs at NORCONEX LOOKED LIKE.

In this new business age that we all currently operate in the overall landscape sees shorter company lifecycles and much more exits, frequently and rapidly. Turning 10 is an enormous accomplishment for any company. Successful organizations know that many factors play a role; hard work, team dynamics, dedication and perseverance.

In fact, some of the key principles to longevity have helped Norconex navigate throughout the years.

Getting our start as a small professional services company 10 years ago the company has since set its footing as the specialist in enterprise search products and services.  We’ve also developed into providing professional support to customers for enterprise search and crawling solutions. As the cloud has become more secure and gained in popularity, Norconex began offering SaaS (Search as a Solution) and implemented our first fully hosted application.

Norconex also launched two search/discovery analytics products:

With thousands of users, Norconex made its mark in the open-source space by launching universal filesystem and web crawlers integrating with any search engine or repositories (such as Solr, Elasticssearch, HP IDOL, Azure Search, AWS Cloudsearch, etc.)

Allowing us to integrate seamlessly are two elite products from our line known as:

As industries changed and evolved over time, we eventually saw an important shift to open source search solutions. With that change Norconex has helped organizations convert from commercial architecture to open-source. Even as Google announced the discontinuation of their popular “Google Search Appliance” service our company has been consulting with GSA customers to help migrate their search needs to other platforms.

With the overall successful operation of our company for the past 10 years and with the implementation of key products and services, our organization has taken the steps necessary to give back to the community in several different forms. Since 2015 we’ve been supporting the movement in women’s soccer in Canada and became a proud sponsor of several young girl soccer teams near our headquarters.

The journey has been a fun ride with many learnings, successes and challenges along the way but we wouldn’t be able to be here without our amazing staff and clients. Thank you, and here’s to the next 10 years!!

Norconex released an SQL Committer for its open-source crawlers (Norconex Collectors).  This enables you to store your crawled information into an SQL database of your choice.

To define an SQL database as your crawler’s target repository, follow these steps:

  1. Download the SQL Search Committer.
  2. Follow the install instructions.
  3. Add this minimalist configuration snippet to your Collector configuration file. It is using H2 database as an example only. Replace with your own settings:
  4. Get familiar with additional Committer configuration options.  For instance, while the above example will create a table and fields for you, you can also use an existing table, or provide the CREATE statement used to create a table.

For further information:

Norconex just released a Microsoft Azure Search Committer for its open-source crawlers (Norconex Collectors).  This empowers Azure Search users with full-featured file system and web crawlers.

If you have not yet discovered Norconex Collectors, head over to the Norconex Collectors website to see what you’ve been missing.

To enable Azure Search as your crawler’s target search engine, follow these steps:

  1. Download the Azure Search Committer.
  2. Follow the install instructions.
  3. Add this minimum required configuration snippet to your Collector configuration file:
  4. You need to configure your index schema, the endpoint and index name from your Azure Search dashboard.  You will also obtain the admin API key from Azure Search Service dashboard.

The complete list of Committer configuration options is available here.  You will need to make sure the fields crawled match those you defined in your Azure Search index (can be achieved from your Collector configuration).

For further information:

Norconex just made it easier to understand the inner-workings of its crawlers by creating clickable flow diagrams. Those diagrams are now available as part of both the Norconex HTTP Collector and Norconex Filesystem Collector websites.

Clicking on a shape will bring up relevant information and offer links to the corresponding documentation in the Collector configuration page.

While not all features are represented in those diagrams, there should be enough to improve your overall understanding and help you better configure your crawling solution.

Have a look now: