Norconex is proud to announce the 2.9.0 release of its HTTP and Filesystem crawlers. Keep reading for a few release highlights.

CMIS support

Norconex Filesystem Collector now supports Content Management Interoperability Services (CMIS). CMIS is an open standard for accessing content management systems (CMS) content. Extra information can be extracted, such as document ACL (Access Control List) for document-level security. It is now easier than ever to crawl your favorite CMS. CMIS is supported by Alfresco, Interwoven, Magnolia, SharePoint server, OpenCMS, OpenText Documentum, and more.

Additional ACL support

ACL from your CMS is not the only new type of ACL you can extract.  This new Norconex Filesystem Collector release introduces support for obtaining local filesystem ACL.  These new ACL types are in addition to the already existing support for CIFS/SMB ACL extraction (since 2.7.0).

Field discovery

You can’t always tell upfront what metadata your crawler will find.  One way to discover your fields is to send them all to your Committer.  This approach is not always possible nor desirable.  You can now store to a local file all fields found by the crawler. Each field will be saved once, with sample values to give you a better idea of their nature.

New URL normalization rules

The HTTP Collector adds a few new rules GenericURLNormalizer. Those are:

  • removeQueryString
  • lowerCase
  • lowerCasePath
  • lowerCaseQuery
  • lowerCaseQueryParameterNames
  • lowerCaseQueryParameterValues

Subdomains being part of a domain

When you configure your HTTP crawler to stay on the current site (stayOnDomain="true"), you can now tell it to consider sub-domains as being the same site (includeSubdomains="true").

Other changes

For a complete list of all additions and changes, refer to the following release notes:

Download

 

Kafka users rejoice! You can now use Norconex open-source crawlers with Apache Kafka, thanks to the Norconex Apache Kafka Committer.

We owe this contribution to Joseph Paulo Mantuano (Senior Developer at The Red Flag Group) and Dan Davis.

The Norconex Collectors community keeps growing. We are thrilled to see the number of integrations grow with it as well.  If you know of any Norconex Committer implementation out there, let us know and we’ll add them to the list!

Not yet familiar with Norconex crawlers?  Head over to Norconex HTTP Collector or Norconex Filesystem Collector websites to learn more.

Great news! There is now a Google Cloud Search Committer for Norconex Crawlers!

This addition to Norconex Collector family should delight Google Cloud Search fans.  They too can now enjoy the full-featured crawling capabilities offered by Norconex Open-Source crawlers.

Since this Committer is developed and maintained by Google, you will find installation and configuration documentation on the Google Developers website.

New to Norconex crawlers? Head over to the Norconex Collectors website to start crawling.

Happy crawling!

Amazon Web Services (AWS) and the Canadian Public Sector organized another excellent Public Sector Summit on May 15, 2019. AWS hosted the first such summit in Ottawa last year, but this year’s event attracted a much larger crowd. Thousands of attendees filled Shaw Centre’s entire third floor.

In the keynote sessions, it was great to hear Alex Benay (deputy minister at the Treasury Board of Canada) talk about the government’s modern digital initiative. He discussed the approach, successes, and challenges of the government’s Cloud migration journey. Another excellent speaker was Mohamed Frendi (director of IT, innovation, science, and economic development for the government of Canada). He covered Canada’s API Store and how it uses the Cloud to make government data more accessible.

The afternoon session was led by Darin Briskman, an AWS developer evangelist. He talked about Amazon’s self-service analytics tool, called AWS Lake Formation, which combines data from multiple sources to resolve data-driven challenges in a timely manner. Machine learning and AI help in making informed decisions and solving problems. This service is a great fit for Norconex’s open-source crawler products HTTP Collector and Filesystem Collector, which fetch data from unstructured data sources to make it easy to consume. Collected content and metadata are natively stored in various existing repositories (or formats), including AWS-specific ones like Amazon Elasticsearch Service, Amazon Open Distro Elasticsearch, and Amazon CloudSearch, as well as many others, such as relational databases, Apache Solr, Google Cloud Search, Neo4J, Microsoft Azure Search, Lucidworks, IDOL, and more.

 

The diagrams below provide further explanation. The one showing the crawling spider is particularly exciting, because Norconex crawlers have much potential to help in this area.  See available Norconex Committers.

     

 

AWS Public Sector Summit Event Pass

Selfies with Darin Briskman, Developer Evangelist, AWS and Stevan Beara, Solutions Architect Manager, AWS.

   

 

Norconex crawlers and Neo4j graph database are now a love match! Neo4j is arguably the most popular graph database out there. Use Norconex crawlers to harvest relationships from websites and filesystems and feed them to your favorite graph engine.

This was made possible thanks to no other than France contributor Sylvain Roussy, a Neo4j reference, and author of 2 Neo4j books. Norconex is proud to have been able to partner with Sylvain to develop a Neo4j Committer for use with its Norconex HTTP and Filesystem Collectors.

To our French-speaking European friends, Sylvain will host a series of Neo4j Meetups at different locations. He will explain how Norconex crawlers can be used to gather graph data from the web to use in Neo4j. The first of the series is taking place on January 24th, in Genève:

Useful Links:

 

Norconex is proud to announce the release of Norconex HTTP Collector version 2.8.0.  This release is accompanied by new releases of many related Norconex open-source products (Filesystem Collector, Importer, Committers, etc.), and together they bring dozens of new features and enhancements highlighted below.

 

Extract a “Featured Image” from web pages

[ezcol_1half]

In addition to taking screenshots of webpages, you can now extract the main image of a web page thanks to the new FeaturedImageProcessor. You can specify conditions to identify the image (first one encountered matching a minimum site or a given pattern). You also have the option to store the image on file or as a BASE64 string with the crawled document (after scaling it to your preferred dimensions) or simply store a reference to it.

[/ezcol_1half]

[ezcol_1half_end]

[/ezcol_1half_end]

Limit link extraction to specific page portions

[ezcol_1half]

The GenericLinkExtractor now makes it possible to only extract links to be followed found within one or more specific sections of a web page. For instance, you may want to only extract links found in navigation menus and not those found in content areas in case the links usually point to other sites you do not want to crawl.

[/ezcol_1half]

[ezcol_1half_end]

[/ezcol_1half_end]

Truncate long field values

[ezcol_1half]

The new TruncateTagger offers the ability to truncate long values and the option to replace the truncated portion with a hash to help preserve uniqueness when required. This is especially useful in preventing errors with search engines (or other repositories) and field length limitations.

[/ezcol_1half]

[ezcol_1half_end]

[/ezcol_1half_end]

Add metadata to a document using an external application

[ezcol_1half]

The new ExternalTagger allows you to point to an external (i.e., command-line) application to “decorate” a document with extra metadata information. Both the existing document content and metadata can be supplied to the external application. The application output can be in a specific format (json, xml, properties) or free-form combined with metadata extraction patterns you can configure. Either standard streams or files can be supplied as arguments to the external application. To transform the content using an external application instead, have a look at the ExternalTranformer, which has also been updated to support metadata.

[/ezcol_1half]

[ezcol_1half_end]

[/ezcol_1half_end]

Other improvements

This release includes many more new features and enhancements:

  • To create a document checksum, you can now combine metadata with content.
  • The TextPatternTagger can now extract field names dynamically in addition to values.
  • The ReplaceTagger and ReplaceTransformer now support empty/null replacement values.
  • There are new configuration options on the GenericHttpClientFactory:
    • “authFormParams” to add arbitrary parameters to authentication forms.
    • “authPreemptive” to use preemptive authentication with BASIC authentication.
  • The Amazon CloudSearch and Elasticsearch Committers both have a new “fixBadIds” flag to safely handle URLs that do not meet product limitations.

For the complete list of changes, refer to these product release notes:

Useful links

Norconex released an SQL Committer for its open-source crawlers (Norconex Collectors).  This enables you to store your crawled information into an SQL database of your choice.

To define an SQL database as your crawler’s target repository, follow these steps:

  1. Download the SQL Search Committer.
  2. Follow the install instructions.
  3. Add this minimalist configuration snippet to your Collector configuration file. It is using H2 database as an example only. Replace with your own settings:
  4. Get familiar with additional Committer configuration options.  For instance, while the above example will create a table and fields for you, you can also use an existing table, or provide the CREATE statement used to create a table.

For further information:

Norconex just made it easier to understand the inner-workings of its crawlers by creating clickable flow diagrams. Those diagrams are now available as part of both the Norconex HTTP Collector and Norconex Filesystem Collector websites.

Clicking on a shape will bring up relevant information and offer links to the corresponding documentation in the Collector configuration page.

While not all features are represented in those diagrams, there should be enough to improve your overall understanding and help you better configure your crawling solution.

Have a look now:

Amazon Web Services (AWS) have been all the rage lately, used by many organizations, companies and even individuals. This rise in popularity can be attributed to the sheer number of services provided by AWS, such as Elastic Compute (EC2), Elastic Beanstalk, Amazon S3, DynamoDB and so on. One particular service that has been getting more exposure very recently is the Amazon CloudSearch service. It is a platform that is built on top of the Apache Solr search engine and enables the indexing and searching of documents with a multitude of features.
The main focus of this blog post is crawling and indexing sites. Before delving into that, however, I will briefly go over the steps to configure a simple AWS CloudSearch domain. If you’re already familiar with creating a domain, you may skip to the next section of the post.

 

Starting a Domain

A CloudSearch domain is the search instance where all your documents will be indexed and stored. The level of usage of these domains is what dictates the pricing. Visit this link for more details.
Luckily, the web interface is visually appealing, intuitive and user friendly. First of all, you need an AWS account. If you don’t have one already, you can create one now by visiting the Amazon website. Once you have an account, simply follow these steps:

1) Click the CloudSearch icon (under the Analytics section) in the AWS console.

2) Click the “Create new search domain” button. Give the domain a name that conforms to the rules given in the first line of the popup menu, and select the instance type and replication factor you want. I’ll go for the default options to keep it simple.

3) Choose how you want your index fields to be added. I recommend starting off with the manual configuration option because it gives you the choice of adding the index fields at any time. You can find the description of each index field type here:

4) Set the access policies of your domain. You can start with the first option because it is the most straightforward and sensible way to start.

5) Review your selected options and edit what needs to be edited. Once you’re satisfied with the configurations, click “Confirm” to finalize the process.

 

It’ll take a few minutes for the domain to be ready for use, as indicated by the yellow “LOADING” label that shows up next to the domain name. A green “ACTIVE” label shows up once the loading is done.

Now that the domain is fully loaded and ready to be used, you can choose to upload documents to it, add index fields, add suggesters, add analysis schemes and so on. Note, however, that the domain will need to be re-indexed for every change that you apply. This can be done by clicking the “Run indexing” button that pops up with every change. The time it takes for the re-indexing to finish depends on the number of documents contained in the domain.

As mentioned previously, the main focus of this post is crawling sites and indexing the data to a CloudSearch domain. At the time of this writing, there are very few crawlers that are able to commit to a CloudSearch domain, and the ones that do are unintuitive and needlessly complicated. The Norconex HTTP Collector is the only crawler that has CloudSearch support that is very intuitive and straightforward. The remainder of this blog post aims to guide you through the steps necessary to set up a crawler and index the data to a CloudSearch domain in as simple and informative steps as possible.

 

Setting up the Norconex HTTP Collector

The Norconex HTTP Collector will be installed and configured in a Linux environment using Unix syntax. You can still, however, install on Windows, and the instructions are just as simple.

Unzip the downloaded file and navigate to the extracted folder. If needed, make sure to set the directory as readable and writable using the chmod command. Once that’s done, follow these steps:

1) Create a directory and name it testCrawl. In the folder myCrawler, create a file config.xml and populate it with the minimal configuration file, which you can find in the examples/minimum directory.

2) Give the crawler a name in the <httpcollector id="..."> I’ll name my crawler TestCrawl.

3) Set progress and log directories in their respective tags:

 

4) Within <crawlerDefaults>, set the work directory where the files will be stored during the crawling process:

5) Type the site you want crawled in the [tag name] tag:

Another method is to create a file with a list of URLs you want crawled, and point to the file:

6) If needed, set a limit on how deep (from the start URL) the crawler can go and a limit on the number of documents to process:

7) If needed, you can set the crawler to ignore documents with specific file extensions. This is done by using the ExtensionReferenceFilter class as follows:

8) You will most likely want to use an importer to parse the crawled data before it’s sent to your CloudSearch domain. The Norconex importer is a very intuitive and easy-to-use tool with a plethora of different configuration options, offering a multitude of pre- and post-parse taggers, transforms, filters and splitters, all of which can be found here. As a starting point, you may want to use the KeepOnlyTagger as a post-parse handler, where you get to decide on what metadata fields to keep:

Be sure that your CloudSearch domain has been configured to support the metadata fields described above. Also, make sure to have a ‘content’ field in your CloudSearch domain as the committer assumes that there’s one.

The config.xml file should look something like this:

 

The Norconex CloudSearch Committer

The Norconex http collector is compatible with several committers such as Solr, Lucidworks, Elasticsearch, etc. Visit this website to find out what other committers are available. The latest addition to this set of committers is the AWS CloudSearch committer. This is an especially useful committer since the very few publicly available CloudSearch committers are needlessly complicated and unintuitive. Luckily for you, Norconex solves this issue by offering a very simple and straightforward CloudSearch committer. All you have to do is:

1) Download the JAR file from here, and move it to the lib folder of the http collector folder.

2) Add the following towards the end of the <craweler></crawler> block (right after the specifying the importer) in your config.xml file:

You can obtain the URL for your document endpoint from your CloudSearch domain’s main page. As for the AWS credentials, specifying them in the config file could result in an error due to a bug in the committer. Therefore, we strongly recommend that you DO NOT include the <accessKey> and <secretAccessKey> variables. Instead, we recommend that you set two environment variables, AWS_ACCESS_KEY and AWS_SECRET_ACCESS_KEY with their respective values. To obtain and use these values, refer to the AWS documentation.

 

Run the Crawler!

All that is left to do is to run the http collector using the Linux shell script (from the main collector directory):

Give the crawler some time to crawl the specified URLs, until it reaches the <maxDepth> or <maxDocuments> constraints, or if it finds no more URLs to crawl. Once the crawling is complete, the successfully processed documents will be committed to the domain specified in the <documentEndpoint> option.

To confirm that the documents have indeed been uploaded, you can go to the domain’s main page and see how many documents are stored and run a test search.

Norconex released version 2.7.0 of both its HTTP Collector and Filesystem Collector.  This update, along with related component updates, introduces several interesting features.

HTTP Collector changes

The following items are specific to the HTTP Collector.  For changes applying to both the HTTP Collector and the Filesystem Collector, you can proceed to the “Generic changes” section.

Crawling of JavaScript-driven pages

[ezcol_1half]

The alternative document fetcher PhantomJSDocumentFetcher now makes it possible to crawl web pages with JavaScript-generated content. This much awaited feature is now available thanks to integration with the open-source PhantomJS headless browser.   As a bonus, you can also take screenshots of web pages you crawl.

[/ezcol_1half]

[ezcol_1half_end]

[/ezcol_1half_end]

More ways to extract links

[ezcol_1half]

This release introduces two new link extractors.  You can now use the XMLFeedLinkExtractor to extract links from RSS or Atom feeds. For maximum flexibility, the RegexLinkExtractor can be used to extract links using regular expressions.

[/ezcol_1half]

[ezcol_1half_end]

[/ezcol_1half_end]

Generic changes

The following changes apply to both Filesystem and HTTP Collectors. Most of these changes come from an update to the Norconex Importer module (now also at version 2.7.0).

Much improved XML configuration validation

[ezcol_1half]

You no longer have to hunt for a misconfiguration.  Schema-based XML configuration validation was added and you will now get errors if you have a bad XML syntax for any configuration options.   This validation can be trigged on command prompt with this new flag: -k or --checkcfg.

[/ezcol_1half]

[ezcol_1half_end]

[/ezcol_1half_end]

Enter durations in human-readable format

[ezcol_1half]

Having to convert a duration in milliseconds is not the most friendly. Anywhere in your XML configuration where a duration is expected, you can now use a human-readable representation (English only) as an alternative.

[/ezcol_1half]

[ezcol_1half_end]

[/ezcol_1half_end]

Lua scripting language

[ezcol_1half]

Support for Lua scripting has been added to ScriptFilter, ScriptTagger, and ScriptTransformer.  This gives you one more scripting option available out-of-the-box besides JavaScript/ECMAScript.

[/ezcol_1half]

[ezcol_1half_end]

[/ezcol_1half_end]

Modify documents using an external application

[ezcol_1half]

With the new ExternalTransformer, you can now use an external application to perform document transformation.  This is an alternative to the existing ExternalParser, which was enhanced to provide the same environment variables and metadata extraction support as the ExternalTransformer.

[/ezcol_1half]

[ezcol_1half_end]

[/ezcol_1half_end]

Combine document fields

[ezcol_1half]

The new MergeTagger can be used for combining multiple fields into one. The target field can be either multi-value or single-value separated with the character of your choice.

[/ezcol_1half]

[ezcol_1half_end]

[/ezcol_1half_end]

New Committers

[ezcol_1half]

Whether you do not have a target repository (Solr, Elasticsearch, etc) ready at the time of crawling, or whether you are not using a repository at all, Norconex Collectors now ships with two file-based Committers for easy consumption by your own process: XMLFileCommitter and JSONFileCommitter. All available committers can be found here.

[/ezcol_1half]

[ezcol_1half_end]

[/ezcol_1half_end]

More

Several additional features or changes can be found in the latest Collector releases.  Among them:

  • New Importer RegexReferenceFilter for filtering documents based on matching references (e.g. URL).
  • New SubstringTransformer for truncating content.
  • New UUIDTagger for giving a unique id to each documents.
  • CharacterCaseTagger now supports “swap” and “string” to swap character case and capitalize beginning of a string, respectively.
  • ConstantTagger offers options when dealing with existing values: add to existing values, replace them, or do nothing.
  • Components such as Importer, Committers, etc., are all easier to install thanks to new utility scripts.
  • Document Access-Control-List (ACL) information is now extracted from SMB/CIFS file systems (Filesytem Collector).
  • New ICollectorLifeCycleListener interface that can be added on the collector configuration to be notified and take action when the collector starts and stops.
  • Added “removeTrailingHash” as a new GenericURLNormalizer option (HTTP Collector).
  • New “detectContentType” and “detectCharset” options on GenericDocumentFetcher for ignoring the content type and character encoding obtained from the HTTP response headers and detect them instead (Filesytem Collector).
  • Start URLs and start paths can now be dynamically created thanks to IStartURLsProvider and IStartPathsProvider (HTTP Collector and Filesystem Collector).

To get the complete list of changes, refer to the HTTP Collector release notes, Filesystem Collector release notes, or the release notes of dependent Norconex libraries such as: Importer release notes and Collector Core release notes.

Download