Pascal Essiembre – Norconex Inc

This year marks the 15th anniversary of Norconex. It is fair to say it has had a rather significant impact on my life so far. Norconex has brought all kinds of life experiences to me, including pride, a sense of accomplishment, and yes, occasional stressful moments. During my time with the business, I also got to witness significant changes in the enterprise search industry. While I reminisce, I thought I’d share some of my recollections with you.

Yours truly founded Norconex in 2007 and I remain president to this day. Norconex positioned itself early on as an independent enterprise search company. We started with three people, offering professional services and support, mainly on Verity, Autonomy, and other commercial search products.

As the enterprise search market was booming, large companies wanted their piece of the pie. What is the easiest option to get in the ring when you are a multi-billion dollar company? Acquisitions, of course. Consequently, we saw several vendor acquisitions during that time, allowing bigger companies to integrate their newly acquired search software into their more specialized product suite. Examples include Microsoft acquiring FAST to the benefit of SharePoint, Oracle getting Endeca, and HP infamously overpaying for Autonomy.Standard Approaches

Standard Approaches

While there are still no widely accepted “standard approaches” to interaction with the various enterprise search solutions, the passing of time brought us a certain commoditization of core search features. Full-text search, federated search, faceting, stemming, lemmatization, relevancy tuning, thesaurus management, geo-location search, document-level security, and horizontal scalability are just a few examples of the features expected of any respectable search engine these days. Does this mean enterprise search has stopped evolving? Not at all! For instance, advancements in artificial intelligence and machine learning can play a big role in enterprise search solutions; while many have yet to see those computational domains as more than buzzwords that only big players can afford to put into action, that’s changing and the future looks promising.

Open-source Software Recognition

We have also seen the long-overdue increase in open-source software recognition and adoption by organizations across the globe. It became increasingly more difficult for product owners to justify the high cost of commercial enterprise search software when you have Apache Lucene-based open-source products like Solr or Elasticsearch now checking all the core feature boxes, products that are often better supported by their respective communities than their more expensive alternatives. Add to this the advent of the cloud and the ability to get search-as-a-service and you get a massive transition toward open-source search solutions.

This scenery change was reflected in our client base as well. We successfully migrated several of our customers from a commercial on-premise platform to cloud and open-source ones, greatly benefiting their budgets.

Looking Back

Norconex has seen a few changes itself over the years, as well. We have grown to a steady (but still small) group of employees. We are now working on an expanded range of projects for all kinds of industries. Furthermore, in addition to professional services, support, and platform migration for our customers, we now develop products, both commercial and open-source. Without a doubt, our open-source web crawler is our most popular product and, I must say, I feel particularly proud of its worldwide adoption. While it brought Norconex new customers from different corners of the world, open-source has also brought me new connections with a wide array of people, relationships that I cherish.

The People

About people… when I look back, I recall lots of memories and a range of emotions, but what stands out at the forefront are people. I am still as passionate about what I do, but passion alone does not explain Norconex’s longevity and success. I believe a passion can’t take root and flourish without people who share it. For me, it includes family, colleagues, customers, the wonderful open-source community, the many friends I have made along the way, and you, reading these words. To all of you, I say: thank you for the last 15 years and thank you for helping the Norconex team to forge ahead on its journey. We have more crazy projects coming up, so buckle up! Somehow, it feels like we’re just getting started.

Norconex is proud to announce the next major release of its popular open-source web crawler (also referred to as “Norconex HTTP Collector”). After a couple of years of development, you will find this new version was well worth the wait.

Not only does it introduce many new features, but it is also more flexible with even more documentation. Many of these improvements come from community feedback so long-term users deserve a pat on the back. This release is also yours.

If you are too eager to get started, you can download it now and follow its website documentation. Otherwise, keep reading for a glance at the new features.

What’s New?

Introduced features are too many to list here, but we’ll highlight some of the most significant.

Crawling of JavaScript-Driven Websites

Thanks to browser automation provided by Selenium WebDrivers, you can now use your favorite browser to crawl web pages relying on JavaScript to fully render. Generally speaking, if your browser can render content, the crawler can fetch it. It provides you with the ability to take screenshots of pages you crawl as well.

Multiple Committers

Committers are used to store crawled information into a target location, or repository of your choice. This version allows you to specify any number of committers to have your data sent to multiple targets at once (database, search engine, filesystem, etc.). It is also possible to perform simple routing as well.

Easier to deploy

Variables in configuration files can now be resolved against system properties and environment variables. Logging has been abstracted using SLF4J and now prints to STDOUT by default. These changes facilitate deployment in containerized environments (e.g., Docker).

Lots of Events

The event management has been redesigned and simplified. There are now more than 60 different event types being triggered for programmers to listen to and act upon. Ranging from new Committer and Importer events, as well as expected Web Crawler events.

XML Configuration improvements

Similar XML configuration options are now specified in a consistent way. In addition, it is now possible to provide partial class names (e.g., class=“ExtensionReferenceFilter“ instead of class=“com.norconex.collector.core.filter.impl.ExtensionReferenceFilter“). The Importer module also allows you to use XML “flow” to facilitate configuration logic. That is, you can now make use of special XML tags: <if>, <ifNot>, <condition>, <conditions>, <else>, and <then>.

Richer documentation

Documentation has been improved as well:

A new Online Manual is now available, giving great insight into installation and XML configuration.
Dynamic XML documentation combining options from all modules making up the web crawler into a single location.

The JavaDoc now has formatted XML documentation and XML usage, which is easy to copy and paste into your own configuration.

Config Starter

A very simple yet useful configuration generator is now available online. It will help you create your first configuration file. You provide your “start” URL, answer a few questions and your configuration file will be generated for you.

More?

Some additional features:

Can send deletion requests to Committers upon encountering specific events.
Can prevent duplicate documents to be sent to Committers during the same crawling sessions.
Now supports these HTTP standards:
- ETag/If-None-Match
- HTTP Strict Transport Security (HSTS)
- If-Modified-Since
Can now extra links after document importing/parsing as well as from metadata.
The Crawler can be configured to stop itself after encountering specific events.
New command-line options for cleaning previous crawls (starting fresh) and to export/import the crawler internal data store.
Can now transform crawled images.
Additional content and metadata manipulation options.
Committers can now retry failing batches, reducing the batch size between each attempt.
New out-of-the-box CSV Committer.

We recommend you have a look at the release notes for more.

What next?

If you are coming from Norconex HTTP Collector version 2, we recommend you have a look at the version 3 migration notes.

As always, community support is still available on GitHub. While on GitHub, take a moment to “Star” the project.

Come back once in a while as we’ll publish more in-depth articles on specific features or use cases you did not even think was possible to address with our web crawler.

Finally, we always love to know who is using the Norconex Web Crawler. Let us know and you may get listed on our wall of fame.

Enjoy!

Norconex is proud to announce the 2.9.0 release of its HTTP and Filesystem crawlers. Keep reading for a few release highlights.

CMIS support

Norconex Filesystem Collector now supports Content Management Interoperability Services (CMIS). CMIS is an open standard for accessing content management systems (CMS) content. Extra information can be extracted, such as document ACL (Access Control List) for document-level security. It is now easier than ever to crawl your favorite CMS. CMIS is supported by Alfresco, Interwoven, Magnolia, SharePoint server, OpenCMS, OpenText Documentum, and more.

<startPaths>
    <path>cmis-atom:https://norconex.com/mycms/cmisatom!/my/starting/path</path>
</startPaths>

Additional ACL support

ACL from your CMS is not the only new type of ACL you can extract. This new Norconex Filesystem Collector release introduces support for obtaining local filesystem ACL. These new ACL types are in addition to the already existing support for CIFS/SMB ACL extraction (since 2.7.0).

Field discovery

You can’t always tell upfront what metadata your crawler will find. One way to discover your fields is to send them all to your Committer. This approach is not always possible nor desirable. You can now store to a local file all fields found by the crawler. Each field will be saved once, with sample values to give you a better idea of their nature.

<tagger class="com.norconex.importer.handler.tagger.impl.FieldReportTagger" 
    maxSamples="2" file="/path/to/report/myfields.csv" />

New URL normalization rules

The HTTP Collector adds a few new rules GenericURLNormalizer. Those are:

removeQueryString
lowerCase
lowerCasePath
lowerCaseQuery
lowerCaseQueryParameterNames
lowerCaseQueryParameterValues

Subdomains being part of a domain

When you configure your HTTP crawler to stay on the current site (stayOnDomain="true"), you can now tell it to consider sub-domains as being the same site (includeSubdomains="true").

Other changes

For a complete list of all additions and changes, refer to the following release notes:

Download

Kafka users rejoice! You can now use Norconex open-source crawlers with Apache Kafka, thanks to the Norconex Apache Kafka Committer.

We owe this contribution to Joseph Paulo Mantuano (Senior Developer at The Red Flag Group) and Dan Davis.

The Norconex Collectors community keeps growing. We are thrilled to see the number of integrations grow with it as well. If you know of any Norconex Committer implementation out there, let us know and we’ll add them to the list!

Not yet familiar with Norconex crawlers? Head over to Norconex HTTP Collector or Norconex Filesystem Collector websites to learn more.

Great news! There is now a Google Cloud Search Committer for Norconex Crawlers!

This addition to Norconex Collector family should delight Google Cloud Search fans. They too can now enjoy the full-featured crawling capabilities offered by Norconex Open-Source crawlers.

Since this Committer is developed and maintained by Google, you will find installation and configuration documentation on the Google Developers website.

New to Norconex crawlers? Head over to the Norconex Collectors website to start crawling.

Happy crawling!

Norconex crawlers and Neo4j graph database are now a love match! Neo4j is arguably the most popular graph database out there. Use Norconex crawlers to harvest relationships from websites and filesystems and feed them to your favorite graph engine.

This was made possible thanks to no other than France contributor Sylvain Roussy, a Neo4j reference, and author of 2 Neo4j books. Norconex is proud to have been able to partner with Sylvain to develop a Neo4j Committer for use with its Norconex HTTP and Filesystem Collectors.

To our French-speaking European friends, Sylvain will host a series of Neo4j Meetups at different locations. He will explain how Norconex crawlers can be used to gather graph data from the web to use in Neo4j. The first of the series is taking place on January 24th, in Genève:

Useful Links:

Norconex is proud to announce the release of Norconex HTTP Collector version 2.8.0. This release is accompanied by new releases of many related Norconex open-source products (Filesystem Collector, Importer, Committers, etc.), and together they bring dozens of new features and enhancements highlighted below.

Extract a “Featured Image” from web pages

[ezcol_1half]

In addition to taking screenshots of webpages, you can now extract the main image of a web page thanks to the new FeaturedImageProcessor. You can specify conditions to identify the image (first one encountered matching a minimum site or a given pattern). You also have the option to store the image on file or as a BASE64 string with the crawled document (after scaling it to your preferred dimensions) or simply store a reference to it.

[/ezcol_1half]

[ezcol_1half_end]

<preImportProcessors>
  <processor class="com.norconex.collector.http.processor.impl.FeaturedImageProcessor">
    <minDimensions>300x400</minDimensions>
    <scaleDimensions>50</scaleDimensions>
    <imageFormat>jpg</imageFormat>
    <scaleQuality>max</scaleQuality>  	
    <storage>inline</storage>
  </processor>
</preImportProcessors>

[/ezcol_1half_end]

Limit link extraction to specific page portions

[ezcol_1half]

The GenericLinkExtractor now makes it possible to only extract links to be followed found within one or more specific sections of a web page. For instance, you may want to only extract links found in navigation menus and not those found in content areas in case the links usually point to other sites you do not want to crawl.

[/ezcol_1half]

[ezcol_1half_end]

<extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor">
 
  <extractBetween>
    <start><![CDATA[<!-- BEGIN NAV LINKS -->]]></start>
    <end><![CDATA[<!-- END NAV LINKS -->]]></end>
  </extractBetween>
 
  <noExtractBetween>
    <start><![CDATA[<!-- BEGIN EXTERNAL SITES -->]]></start>
    <end><![CDATA[<!-- END EXTERNAL SITES -->]]></end>
  </noExtractBetween>
 
</extractor>

[/ezcol_1half_end]

Truncate long field values

[ezcol_1half]

The new TruncateTagger offers the ability to truncate long values and the option to replace the truncated portion with a hash to help preserve uniqueness when required. This is especially useful in preventing errors with search engines (or other repositories) and field length limitations.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.TruncateTagger"
    fromField="mySuperLongField"
    maxLength="500"
    toField="myTruncatedField"
    overwrite="true"
    appendHash="true"
    suffix="!" />

[/ezcol_1half_end]

Add metadata to a document using an external application

[ezcol_1half]

The new ExternalTagger allows you to point to an external (i.e., command-line) application to “decorate” a document with extra metadata information. Both the existing document content and metadata can be supplied to the external application. The application output can be in a specific format (json, xml, properties) or free-form combined with metadata extraction patterns you can configure. Either standard streams or files can be supplied as arguments to the external application. To transform the content using an external application instead, have a look at the ExternalTranformer, which has also been updated to support metadata.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.ExternalTagger">
  <command>
    /app/addressExtractor ${INPUT} ${INPUT_META} ${REFERENCE}
  </command>
  <metadata inputFormat="json">
    <pattern field="address" valueGroup="1">
      ^address=(.*)$
    </pattern>
  </metadata>
</tagger>

[/ezcol_1half_end]

Other improvements

This release includes many more new features and enhancements:

To create a document checksum, you can now combine metadata with content.
The TextPatternTagger can now extract field names dynamically in addition to values.
The ReplaceTagger and ReplaceTransformer now support empty/null replacement values.
There are new configuration options on the GenericHttpClientFactory:
- “authFormParams” to add arbitrary parameters to authentication forms.
- “authPreemptive” to use preemptive authentication with BASIC authentication.
The Amazon CloudSearch and Elasticsearch Committers both have a new “fixBadIds” flag to safely handle URLs that do not meet product limitations.

For the complete list of changes, refer to these product release notes:

Useful links

Download Norconex HTTP Collector
Get started with Norconex HTTP Collector
Report your issues and questions on Github
Contact Norconex

Letter from the President,

Do you know that Norconex turned 10 this year? That’s right, Norconex was founded in 2007 and I could not be prouder to be president of Norconex as we cross this important milestone.

Our company’s numerous achievements would not have been possible without our amazing employees. They are smart, committed, loyal, and all have client satisfaction at heart. Having such a great team is precious beyond words.

I am also taking this occasion to thank every one of you, customers and partners, for having played a vital role in Norconex success. We can’t thank you enough for choosing our services and products, making us the success that we are.

We plan to keep growing our relationship in the years to come and continue to offer you the best.

We are looking forward to the next 10 years!

Sincerely,

Pascal Essiembre

President

WHAT THE FIRST 10 YEARs at NORCONEX LOOKED LIKE.

In this new business age that we all currently operate in the overall landscape sees shorter company lifecycles and much more exits, frequently and rapidly. Turning 10 is an enormous accomplishment for any company. Successful organizations know that many factors play a role; hard work, team dynamics, dedication and perseverance.

In fact, some of the key principles to longevity have helped Norconex navigate throughout the years.

Getting our start as a small professional services company 10 years ago the company has since set its footing as the specialist in enterprise search products and services. We’ve also developed into providing professional support to customers for enterprise search and crawling solutions. As the cloud has become more secure and gained in popularity, Norconex began offering SaaS (Search as a Solution) and implemented our first fully hosted application.

Norconex also launched two search/discovery analytics products:

With thousands of users, Norconex made its mark in the open-source space by launching universal filesystem and web crawlers integrating with any search engine or repositories (such as Solr, Elasticssearch, HP IDOL, Azure Search, AWS Cloudsearch, etc.)

Allowing us to integrate seamlessly are two elite products from our line known as:

As industries changed and evolved over time, we eventually saw an important shift to open source search solutions. With that change Norconex has helped organizations convert from commercial architecture to open-source. Even as Google announced the discontinuation of their popular “Google Search Appliance” service our company has been consulting with GSA customers to help migrate their search needs to other platforms.

With the overall successful operation of our company for the past 10 years and with the implementation of key products and services, our organization has taken the steps necessary to give back to the community in several different forms. Since 2015 we’ve been supporting the movement in women’s soccer in Canada and became a proud sponsor of several young girl soccer teams near our headquarters.

The journey has been a fun ride with many learnings, successes and challenges along the way but we wouldn’t be able to be here without our amazing staff and clients. Thank you, and here’s to the next 10 years!!

Norconex released an SQL Committer for its open-source crawlers (Norconex Collectors). This enables you to store your crawled information into an SQL database of your choice.

To define an SQL database as your crawler’s target repository, follow these steps:

Download the SQL Search Committer.
Follow the install instructions.

Add this minimalist configuration snippet to your Collector configuration file. It is using H2 database as an example only. Replace with your own settings:

<committer class="com.norconex.committer.sql.SQLCommitter">
  <driverPath>/path/to/driver/h2.jar</driverPath>
  <driverClass>org.h2.Driver</driverClass>
  <connectionUrl>jdbc:h2:file:///path/to/db/h2</connectionUrl>
  <tableName>test_table</tableName>
  <createMissing>true</createMissing>
</committer>

Get familiar with additional Committer configuration options. For instance, while the above example will create a table and fields for you, you can also use an existing table, or provide the CREATE statement used to create a table.

For further information:

Norconex just released a Microsoft Azure Search Committer for its open-source crawlers (Norconex Collectors). This empowers Azure Search users with full-featured file system and web crawlers.

If you have not yet discovered Norconex Collectors, head over to the Norconex Collectors website to see what you’ve been missing.

To enable Azure Search as your crawler’s target search engine, follow these steps:

Download the Azure Search Committer.
Follow the install instructions.

Add this minimum required configuration snippet to your Collector configuration file:

<committer class="com.norconex.committer.azuresearch.AzureSearchCommitter">
  <endpoint>https://example.search.windows.net</endpoint>
  <apiKey>1234567890ABCDEF1234567890ABCDEF</apiKey>
  <indexName>sample-index</indexName>
</committer>

You need to configure your index schema, the endpoint and index name from your Azure Search dashboard. You will also obtain the admin API key from Azure Search Service dashboard.

The complete list of Committer configuration options is available here. You will need to make sure the fields crawled match those you defined in your Azure Search index (can be achieved from your Collector configuration).

For further information: