Latest Releases – Norconex Inc

This year marks the 15th anniversary of Norconex. It is fair to say it has had a rather significant impact on my life so far. Norconex has brought all kinds of life experiences to me, including pride, a sense of accomplishment, and yes, occasional stressful moments. During my time with the business, I also got to witness significant changes in the enterprise search industry. While I reminisce, I thought I’d share some of my recollections with you.

Yours truly founded Norconex in 2007 and I remain president to this day. Norconex positioned itself early on as an independent enterprise search company. We started with three people, offering professional services and support, mainly on Verity, Autonomy, and other commercial search products.

As the enterprise search market was booming, large companies wanted their piece of the pie. What is the easiest option to get in the ring when you are a multi-billion dollar company? Acquisitions, of course. Consequently, we saw several vendor acquisitions during that time, allowing bigger companies to integrate their newly acquired search software into their more specialized product suite. Examples include Microsoft acquiring FAST to the benefit of SharePoint, Oracle getting Endeca, and HP infamously overpaying for Autonomy.Standard Approaches

Standard Approaches

While there are still no widely accepted “standard approaches” to interaction with the various enterprise search solutions, the passing of time brought us a certain commoditization of core search features. Full-text search, federated search, faceting, stemming, lemmatization, relevancy tuning, thesaurus management, geo-location search, document-level security, and horizontal scalability are just a few examples of the features expected of any respectable search engine these days. Does this mean enterprise search has stopped evolving? Not at all! For instance, advancements in artificial intelligence and machine learning can play a big role in enterprise search solutions; while many have yet to see those computational domains as more than buzzwords that only big players can afford to put into action, that’s changing and the future looks promising.

Open-source Software Recognition

We have also seen the long-overdue increase in open-source software recognition and adoption by organizations across the globe. It became increasingly more difficult for product owners to justify the high cost of commercial enterprise search software when you have Apache Lucene-based open-source products like Solr or Elasticsearch now checking all the core feature boxes, products that are often better supported by their respective communities than their more expensive alternatives. Add to this the advent of the cloud and the ability to get search-as-a-service and you get a massive transition toward open-source search solutions.

This scenery change was reflected in our client base as well. We successfully migrated several of our customers from a commercial on-premise platform to cloud and open-source ones, greatly benefiting their budgets.

Looking Back

Norconex has seen a few changes itself over the years, as well. We have grown to a steady (but still small) group of employees. We are now working on an expanded range of projects for all kinds of industries. Furthermore, in addition to professional services, support, and platform migration for our customers, we now develop products, both commercial and open-source. Without a doubt, our open-source web crawler is our most popular product and, I must say, I feel particularly proud of its worldwide adoption. While it brought Norconex new customers from different corners of the world, open-source has also brought me new connections with a wide array of people, relationships that I cherish.

The People

About people… when I look back, I recall lots of memories and a range of emotions, but what stands out at the forefront are people. I am still as passionate about what I do, but passion alone does not explain Norconex’s longevity and success. I believe a passion can’t take root and flourish without people who share it. For me, it includes family, colleagues, customers, the wonderful open-source community, the many friends I have made along the way, and you, reading these words. To all of you, I say: thank you for the last 15 years and thank you for helping the Norconex team to forge ahead on its journey. We have more crazy projects coming up, so buckle up! Somehow, it feels like we’re just getting started.

Norconex is proud to announce the next major release of its popular open-source web crawler (also referred to as “Norconex HTTP Collector”). After a couple of years of development, you will find this new version was well worth the wait.

Not only does it introduce many new features, but it is also more flexible with even more documentation. Many of these improvements come from community feedback so long-term users deserve a pat on the back. This release is also yours.

If you are too eager to get started, you can download it now and follow its website documentation. Otherwise, keep reading for a glance at the new features.

What’s New?

Introduced features are too many to list here, but we’ll highlight some of the most significant.

Crawling of JavaScript-Driven Websites

Thanks to browser automation provided by Selenium WebDrivers, you can now use your favorite browser to crawl web pages relying on JavaScript to fully render. Generally speaking, if your browser can render content, the crawler can fetch it. It provides you with the ability to take screenshots of pages you crawl as well.

Multiple Committers

Committers are used to store crawled information into a target location, or repository of your choice. This version allows you to specify any number of committers to have your data sent to multiple targets at once (database, search engine, filesystem, etc.). It is also possible to perform simple routing as well.

Easier to deploy

Variables in configuration files can now be resolved against system properties and environment variables. Logging has been abstracted using SLF4J and now prints to STDOUT by default. These changes facilitate deployment in containerized environments (e.g., Docker).

Lots of Events

The event management has been redesigned and simplified. There are now more than 60 different event types being triggered for programmers to listen to and act upon. Ranging from new Committer and Importer events, as well as expected Web Crawler events.

XML Configuration improvements

Similar XML configuration options are now specified in a consistent way. In addition, it is now possible to provide partial class names (e.g., class=“ExtensionReferenceFilter“ instead of class=“com.norconex.collector.core.filter.impl.ExtensionReferenceFilter“). The Importer module also allows you to use XML “flow” to facilitate configuration logic. That is, you can now make use of special XML tags: <if>, <ifNot>, <condition>, <conditions>, <else>, and <then>.

Richer documentation

Documentation has been improved as well:

A new Online Manual is now available, giving great insight into installation and XML configuration.
Dynamic XML documentation combining options from all modules making up the web crawler into a single location.

The JavaDoc now has formatted XML documentation and XML usage, which is easy to copy and paste into your own configuration.

Config Starter

A very simple yet useful configuration generator is now available online. It will help you create your first configuration file. You provide your “start” URL, answer a few questions and your configuration file will be generated for you.

More?

Some additional features:

Can send deletion requests to Committers upon encountering specific events.
Can prevent duplicate documents to be sent to Committers during the same crawling sessions.
Now supports these HTTP standards:
- ETag/If-None-Match
- HTTP Strict Transport Security (HSTS)
- If-Modified-Since
Can now extra links after document importing/parsing as well as from metadata.
The Crawler can be configured to stop itself after encountering specific events.
New command-line options for cleaning previous crawls (starting fresh) and to export/import the crawler internal data store.
Can now transform crawled images.
Additional content and metadata manipulation options.
Committers can now retry failing batches, reducing the batch size between each attempt.
New out-of-the-box CSV Committer.

We recommend you have a look at the release notes for more.

What next?

If you are coming from Norconex HTTP Collector version 2, we recommend you have a look at the version 3 migration notes.

As always, community support is still available on GitHub. While on GitHub, take a moment to “Star” the project.

Come back once in a while as we’ll publish more in-depth articles on specific features or use cases you did not even think was possible to address with our web crawler.

Finally, we always love to know who is using the Norconex Web Crawler. Let us know and you may get listed on our wall of fame.

Enjoy!

Norconex is proud to announce the 2.9.0 release of its HTTP and Filesystem crawlers. Keep reading for a few release highlights.

CMIS support

Norconex Filesystem Collector now supports Content Management Interoperability Services (CMIS). CMIS is an open standard for accessing content management systems (CMS) content. Extra information can be extracted, such as document ACL (Access Control List) for document-level security. It is now easier than ever to crawl your favorite CMS. CMIS is supported by Alfresco, Interwoven, Magnolia, SharePoint server, OpenCMS, OpenText Documentum, and more.

<startPaths>
    <path>cmis-atom:https://norconex.com/mycms/cmisatom!/my/starting/path</path>
</startPaths>

Additional ACL support

ACL from your CMS is not the only new type of ACL you can extract. This new Norconex Filesystem Collector release introduces support for obtaining local filesystem ACL. These new ACL types are in addition to the already existing support for CIFS/SMB ACL extraction (since 2.7.0).

Field discovery

You can’t always tell upfront what metadata your crawler will find. One way to discover your fields is to send them all to your Committer. This approach is not always possible nor desirable. You can now store to a local file all fields found by the crawler. Each field will be saved once, with sample values to give you a better idea of their nature.

<tagger class="com.norconex.importer.handler.tagger.impl.FieldReportTagger" 
    maxSamples="2" file="/path/to/report/myfields.csv" />

New URL normalization rules

The HTTP Collector adds a few new rules GenericURLNormalizer. Those are:

removeQueryString
lowerCase
lowerCasePath
lowerCaseQuery
lowerCaseQueryParameterNames
lowerCaseQueryParameterValues

Subdomains being part of a domain

When you configure your HTTP crawler to stay on the current site (stayOnDomain="true"), you can now tell it to consider sub-domains as being the same site (includeSubdomains="true").

Other changes

For a complete list of all additions and changes, refer to the following release notes:

Download

Kafka users rejoice! You can now use Norconex open-source crawlers with Apache Kafka, thanks to the Norconex Apache Kafka Committer.

We owe this contribution to Joseph Paulo Mantuano (Senior Developer at The Red Flag Group) and Dan Davis.

The Norconex Collectors community keeps growing. We are thrilled to see the number of integrations grow with it as well. If you know of any Norconex Committer implementation out there, let us know and we’ll add them to the list!

Not yet familiar with Norconex crawlers? Head over to Norconex HTTP Collector or Norconex Filesystem Collector websites to learn more.

Great news! There is now a Google Cloud Search Committer for Norconex Crawlers!

This addition to Norconex Collector family should delight Google Cloud Search fans. They too can now enjoy the full-featured crawling capabilities offered by Norconex Open-Source crawlers.

Since this Committer is developed and maintained by Google, you will find installation and configuration documentation on the Google Developers website.

New to Norconex crawlers? Head over to the Norconex Collectors website to start crawling.

Happy crawling!

Norconex crawlers and Neo4j graph database are now a love match! Neo4j is arguably the most popular graph database out there. Use Norconex crawlers to harvest relationships from websites and filesystems and feed them to your favorite graph engine.

This was made possible thanks to no other than France contributor Sylvain Roussy, a Neo4j reference, and author of 2 Neo4j books. Norconex is proud to have been able to partner with Sylvain to develop a Neo4j Committer for use with its Norconex HTTP and Filesystem Collectors.

To our French-speaking European friends, Sylvain will host a series of Neo4j Meetups at different locations. He will explain how Norconex crawlers can be used to gather graph data from the web to use in Neo4j. The first of the series is taking place on January 24th, in Genève:

Useful Links:

Norconex is proud to announce the release of Norconex HTTP Collector version 2.8.0. This release is accompanied by new releases of many related Norconex open-source products (Filesystem Collector, Importer, Committers, etc.), and together they bring dozens of new features and enhancements highlighted below.

Extract a “Featured Image” from web pages

[ezcol_1half]

In addition to taking screenshots of webpages, you can now extract the main image of a web page thanks to the new FeaturedImageProcessor. You can specify conditions to identify the image (first one encountered matching a minimum site or a given pattern). You also have the option to store the image on file or as a BASE64 string with the crawled document (after scaling it to your preferred dimensions) or simply store a reference to it.

[/ezcol_1half]

[ezcol_1half_end]

<preImportProcessors>
  <processor class="com.norconex.collector.http.processor.impl.FeaturedImageProcessor">
    <minDimensions>300x400</minDimensions>
    <scaleDimensions>50</scaleDimensions>
    <imageFormat>jpg</imageFormat>
    <scaleQuality>max</scaleQuality>  	
    <storage>inline</storage>
  </processor>
</preImportProcessors>

[/ezcol_1half_end]

Limit link extraction to specific page portions

[ezcol_1half]

The GenericLinkExtractor now makes it possible to only extract links to be followed found within one or more specific sections of a web page. For instance, you may want to only extract links found in navigation menus and not those found in content areas in case the links usually point to other sites you do not want to crawl.

[/ezcol_1half]

[ezcol_1half_end]

<extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor">
 
  <extractBetween>
    <start><![CDATA[<!-- BEGIN NAV LINKS -->]]></start>
    <end><![CDATA[<!-- END NAV LINKS -->]]></end>
  </extractBetween>
 
  <noExtractBetween>
    <start><![CDATA[<!-- BEGIN EXTERNAL SITES -->]]></start>
    <end><![CDATA[<!-- END EXTERNAL SITES -->]]></end>
  </noExtractBetween>
 
</extractor>

[/ezcol_1half_end]

Truncate long field values

[ezcol_1half]

The new TruncateTagger offers the ability to truncate long values and the option to replace the truncated portion with a hash to help preserve uniqueness when required. This is especially useful in preventing errors with search engines (or other repositories) and field length limitations.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.TruncateTagger"
    fromField="mySuperLongField"
    maxLength="500"
    toField="myTruncatedField"
    overwrite="true"
    appendHash="true"
    suffix="!" />

[/ezcol_1half_end]

Add metadata to a document using an external application

[ezcol_1half]

The new ExternalTagger allows you to point to an external (i.e., command-line) application to “decorate” a document with extra metadata information. Both the existing document content and metadata can be supplied to the external application. The application output can be in a specific format (json, xml, properties) or free-form combined with metadata extraction patterns you can configure. Either standard streams or files can be supplied as arguments to the external application. To transform the content using an external application instead, have a look at the ExternalTranformer, which has also been updated to support metadata.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.ExternalTagger">
  <command>
    /app/addressExtractor ${INPUT} ${INPUT_META} ${REFERENCE}
  </command>
  <metadata inputFormat="json">
    <pattern field="address" valueGroup="1">
      ^address=(.*)$
    </pattern>
  </metadata>
</tagger>

[/ezcol_1half_end]

Other improvements

This release includes many more new features and enhancements:

To create a document checksum, you can now combine metadata with content.
The TextPatternTagger can now extract field names dynamically in addition to values.
The ReplaceTagger and ReplaceTransformer now support empty/null replacement values.
There are new configuration options on the GenericHttpClientFactory:
- “authFormParams” to add arbitrary parameters to authentication forms.
- “authPreemptive” to use preemptive authentication with BASIC authentication.
The Amazon CloudSearch and Elasticsearch Committers both have a new “fixBadIds” flag to safely handle URLs that do not meet product limitations.

For the complete list of changes, refer to these product release notes:

Useful links

Download Norconex HTTP Collector
Get started with Norconex HTTP Collector
Report your issues and questions on Github
Contact Norconex

Norconex released an SQL Committer for its open-source crawlers (Norconex Collectors). This enables you to store your crawled information into an SQL database of your choice.

To define an SQL database as your crawler’s target repository, follow these steps:

Download the SQL Search Committer.
Follow the install instructions.

Add this minimalist configuration snippet to your Collector configuration file. It is using H2 database as an example only. Replace with your own settings:

<committer class="com.norconex.committer.sql.SQLCommitter">
  <driverPath>/path/to/driver/h2.jar</driverPath>
  <driverClass>org.h2.Driver</driverClass>
  <connectionUrl>jdbc:h2:file:///path/to/db/h2</connectionUrl>
  <tableName>test_table</tableName>
  <createMissing>true</createMissing>
</committer>

Get familiar with additional Committer configuration options. For instance, while the above example will create a table and fields for you, you can also use an existing table, or provide the CREATE statement used to create a table.

For further information:

Norconex just released a Microsoft Azure Search Committer for its open-source crawlers (Norconex Collectors). This empowers Azure Search users with full-featured file system and web crawlers.

If you have not yet discovered Norconex Collectors, head over to the Norconex Collectors website to see what you’ve been missing.

To enable Azure Search as your crawler’s target search engine, follow these steps:

Download the Azure Search Committer.
Follow the install instructions.

Add this minimum required configuration snippet to your Collector configuration file:

<committer class="com.norconex.committer.azuresearch.AzureSearchCommitter">
  <endpoint>https://example.search.windows.net</endpoint>
  <apiKey>1234567890ABCDEF1234567890ABCDEF</apiKey>
  <indexName>sample-index</indexName>
</committer>

You need to configure your index schema, the endpoint and index name from your Azure Search dashboard. You will also obtain the admin API key from Azure Search Service dashboard.

The complete list of Committer configuration options is available here. You will need to make sure the fields crawled match those you defined in your Azure Search index (can be achieved from your Collector configuration).

For further information:

Norconex released version 2.7.0 of both its HTTP Collector and Filesystem Collector. This update, along with related component updates, introduces several interesting features.

HTTP Collector changes

The following items are specific to the HTTP Collector. For changes applying to both the HTTP Collector and the Filesystem Collector, you can proceed to the “Generic changes” section.

Crawling of JavaScript-driven pages

[ezcol_1half]

The alternative document fetcher PhantomJSDocumentFetcher now makes it possible to crawl web pages with JavaScript-generated content. This much awaited feature is now available thanks to integration with the open-source PhantomJS headless browser. As a bonus, you can also take screenshots of web pages you crawl.

[/ezcol_1half]

[ezcol_1half_end]

<documentFetcher 
    class="com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher">
  <exePath>/path/to/phantomjs.exe</exePath>
  <renderWaitTime>5000</renderWaitTime>
  <referencePattern>^.*\.html$</referencePattern> 
</documentFetcher>

[/ezcol_1half_end]

Generic changes

The following changes apply to both Filesystem and HTTP Collectors. Most of these changes come from an update to the Norconex Importer module (now also at version 2.7.0).

Much improved XML configuration validation

[ezcol_1half]

You no longer have to hunt for a misconfiguration. Schema-based XML configuration validation was added and you will now get errors if you have a bad XML syntax for any configuration options. This validation can be trigged on command prompt with this new flag: -k or --checkcfg.

[/ezcol_1half]

[ezcol_1half_end]

# -k can be used on its own, but when combined with -a (like below),
# it will prevent the collector from executing if there are any errors.

collector-http.sh -a start -c examples/minimum/minimum-config.xml -k

# Error sample:
ERROR (XML) ReplaceTagger: cvc-attribute.3: The value 'asdf' of attribute 'regex' on element 'replace' is not valid with respect to its type, 'boolean'.

[/ezcol_1half_end]

Enter durations in human-readable format

[ezcol_1half]

Having to convert a duration in milliseconds is not the most friendly. Anywhere in your XML configuration where a duration is expected, you can now use a human-readable representation (English only) as an alternative.

[/ezcol_1half]

[ezcol_1half_end]

<!-- Example using "5 seconds" and "1 second" as opposed to milliseconds -->
<delay class="com.norconex.collector.http.delay.impl.GenericDelayResolver"
    default="5 seconds" ignoreRobotsCrawlDelay="true" scope="site" >
  <schedule dayOfWeek="from Saturday to Sunday">1 second</schedule>
</delay>

[/ezcol_1half_end]

Lua scripting language

[ezcol_1half]

Support for Lua scripting has been added to ScriptFilter, ScriptTagger, and ScriptTransformer. This gives you one more scripting option available out-of-the-box besides JavaScript/ECMAScript.

[/ezcol_1half]

[ezcol_1half_end]

<!-- Add "apple" to a "fruit" metadata field: -->
<tagger class="com.norconex.importer.handler.tagger.impl.ScriptTagger"
    engineName="lua">
  <script><![CDATA[
    metadata:addString('fruit', {'apple'});
  ]]></script>
</tagger>

[/ezcol_1half_end]

Modify documents using an external application

[ezcol_1half]

With the new ExternalTransformer, you can now use an external application to perform document transformation. This is an alternative to the existing ExternalParser, which was enhanced to provide the same environment variables and metadata extraction support as the ExternalTransformer.

[/ezcol_1half]

[ezcol_1half_end]

<transformer class="com.norconex.importer.handler.transformer.impl.ExternalTransformer">
  <command>/path/transform/app ${INPUT} ${OUTPUT}</command>
  <metadata>
    <match field="docnumber">DocNo:(\d+)</match>
  </metadata>
</transformer>

[/ezcol_1half_end]

Combine document fields

[ezcol_1half]

The new MergeTagger can be used for combining multiple fields into one. The target field can be either multi-value or single-value separated with the character of your choice.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.MergeTagger">
  <merge toField="title" deleteFromFields="true" 
      singleValue="true" singleValueSeparator=",">
    <fromFields>title,dc.title,dc:title,doctitle</fromFields>
  </merge>
</tagger>

[/ezcol_1half_end]

New Committers

[ezcol_1half]

Whether you do not have a target repository (Solr, Elasticsearch, etc) ready at the time of crawling, or whether you are not using a repository at all, Norconex Collectors now ships with two file-based Committers for easy consumption by your own process: XMLFileCommitter and JSONFileCommitter. All available committers can be found here.

[/ezcol_1half]

[ezcol_1half_end]

<committer class="com.norconex.committer.core.impl.XMLFileCommitter">
 <directory>/path/my-xmls/</directory>
 <pretty>true</pretty>
 <docsPerFile>100</docsPerFile>
 <compress>false</compress>
 <splitAddDelete>false</splitAddDelete>
</committer>

[/ezcol_1half_end]

Several additional features or changes can be found in the latest Collector releases. Among them:

New Importer RegexReferenceFilter for filtering documents based on matching references (e.g. URL).
New SubstringTransformer for truncating content.
New UUIDTagger for giving a unique id to each documents.
CharacterCaseTagger now supports “swap” and “string” to swap character case and capitalize beginning of a string, respectively.
ConstantTagger offers options when dealing with existing values: add to existing values, replace them, or do nothing.
Components such as Importer, Committers, etc., are all easier to install thanks to new utility scripts.
Document Access-Control-List (ACL) information is now extracted from SMB/CIFS file systems (Filesytem Collector).
New ICollectorLifeCycleListener interface that can be added on the collector configuration to be notified and take action when the collector starts and stops.
Added “removeTrailingHash” as a new GenericURLNormalizer option (HTTP Collector).
New “detectContentType” and “detectCharset” options on GenericDocumentFetcher for ignoring the content type and character encoding obtained from the HTTP response headers and detect them instead (Filesytem Collector).
Start URLs and start paths can now be dynamically created thanks to IStartURLsProvider and IStartPathsProvider (HTTP Collector and Filesystem Collector).

To get the complete list of changes, refer to the HTTP Collector release notes, Filesystem Collector release notes, or the release notes of dependent Norconex libraries such as: Importer release notes and Collector Core release notes.

What’s New?

Crawling of JavaScript-Driven Websites

Multiple Committers

Easier to deploy

Lots of Events

XML Configuration improvements

Richer documentation

Config Starter

More?

What next?

CMIS support

Additional ACL support

Field discovery

New URL normalization rules

Subdomains being part of a domain

Other changes

Download

Extract a “Featured Image” from web pages

Limit link extraction to specific page portions

Truncate long field values

Add metadata to a document using an external application

Other improvements

Useful links

HTTP Collector changes

Crawling of JavaScript-driven pages

More ways to extract links

Generic changes

Much improved XML configuration validation

Enter durations in human-readable format

Lua scripting language

Modify documents using an external application

Combine document fields

New Committers

More

Download