web crawler – Norconex Inc

This blog post will show you how to use Prometheus with your Norconex crawler. This process is possible thanks to Norconex crawlers offering useful metrics via JMX. Using this solution, you can conveniently track the advancement of a crawling task with a quick glance which is especially useful when you have several crawling jobs running simultaneously.

If you don’t already have Prometheus installed, we will also guide you through the installation process using Docker. Already have Prometheus installed? Go ahead and skip the first section.
The required setup consists of three main components: Prometheus, JMX agent, and Norconex web crawler.

StandUp a Prometheus Server

Create a “prometheus-test” folder to store config files.
Create a custom YAML file: premetheus_config.yaml and add the following:

global: 
  scrape_interval: 15s 
  evaluation_interval: 15s 
  scrape_timeout: 10s 

scrape_configs: 
  # job_name: the name you give, usually one for each collector 
  - job_name: 'collector-http' 
    static_configs: 
    - targets:   ['host.docker.internal:9123']

Create a Dockerfile in the same folder. In it, add the Prometheus image to be used, and then add the premetheus_config.yaml file created earlier.

FROM prom/prometheus 
ADD prometheus_config.yaml /etc/prometheus/

Now it is time to build and start up the Prometheus container by running:

docker build -t my-prometheus-image . 
docker run -dp 9090:9090 my-prometheus-image

Confirm the service is running:

docker ps

You should get something like this:

Open your browser, and access Prometheus: http://localhost:9090

JMX Exporter / Prometheus Java Agent

Once Prometheus is up and running, you need to download the Prometheus JMX Java agent plugin. This agent reads information exposed by the crawler registered JMX mBeans and is intended to be run as a Java Virtual Machine (JVM) agent.

The latest plugin version will be used (version 0.18 as of this writing). Download the jar file, and save it in the prometheus-test folder. This agent requires Java 18. If you don’t already have it installed, download it here.

Next, you will create a jmx_config.yaml file to define the settings used by the JMX agent. Add the following to the file:

--- startDelaySeconds: 0 

ssl: false 
lowercaseOutputName: false 
lowercaseOutputLabelNames: false

Norconex Web Crawler

Norconex has two types of crawlers: web and file-system. We will use the web version in our test, so go ahead and download the crawler if you haven’t already done so.

To start crawling, you need to define the start URL and other settings, which should be defined in the crawler-config.xml file. Let’s create one now.

In the “prometheus-test” folder, create an XML file called “crawler_config.xml”. Then add the following:

<?xml version="1.0" encoding="UTF-8"?>

<httpcollector id="prometheus-test-collector">

  <!-- Decide where to store generated files. -->
  <workDir>${workdir}</workDir>
  <deferredShutdownDuration>10 seconds</deferredShutdownDuration>
  
  <crawlers>
    <crawler id="prometheus-test-crawler">

      <!-- Requires at least one start URL (or urlsFile). 
           Optionally limit crawling to same protocol/domain/port as 
           start URLs. -->
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="false">
        <url>https://www.britannica.com</url>
      </startURLs>

      <!-- === Recommendations: ============================================ -->

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>1</maxDepth>
	  <numThreads>${numThreads|'3'}</numThreads>
	  <maxDocuments>${maxDocuments|'1000'}</maxDocuments>
	  <canonicalLinkDetector ignore="true" />
	  <robotsTxt ignore="true" />
	  <robotsMeta ignore="true" />
	  <orphansStrategy>IGNORE</orphansStrategy>

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <sitemapResolver ignore="true" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="2 seconds" />
      
      <referenceFilters>
        <filter class="ReferenceFilter" onMatch="exclude">
	  <valueMatcher method="regex">.*literature.*</valueMatcher>
        </filter>
      </referenceFilters>
      
      <!-- Document importing -->
      <importer>
        <postParseHandlers>
          <!-- If your target repository does not support arbitrary fields,
               make sure you only keep the fields you need. -->
          <handler class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fieldMatcher method="csv">title,document.reference</fieldMatcher>      
          </handler>
        </postParseHandlers>
      </importer> 
      
      <!-- Decide what to do with your files by specifying a Committer. -->
      <committers>
        <committer class="core3.fs.impl.XMLFileCommitter">
          <docsPerFile>250</docsPerFile>
	  <indent>4</indent>
	  <splitUpsertDelete>false</splitUpsertDelete>
        </committer>
      </committers>

    </crawler>
  </crawlers>

</httpcollector>

Start the Crawler

Initiating the crawling task and enabling Prometheus to fetch metrics from the crawler is a straightforward process. But to ensure reproducibility, create a batch file (make it an equivalent script file on Unix/Linux) that contains the necessary command. This way, you can effortlessly launch the crawler whenever required.

In the “prometheus-test” folder, create a run-job.bat file. Then add the following:

@echo off 

set CRAWLER_HOME=path\to\Norconex\web\crawler\folder\ 
set TEST_DIR=path\to\prometheus\test\folder 

java -javaagent:%TEST_DIR%\jmx_prometheus_javaagent-0.18.0.jar=9123:%TEST_DIR%\jmx_config.yaml ^ 
     -DenableJMX=true ^ 
     -Dlog4j2.configurationFile="%CRAWLER_HOME%\log4j2.xml" ^ 
     -Dfile.encoding=UTF8 ^ -Dworkdir="%TEST_DIR%\workdirs" ^ 
     -cp "%CRAWLER_HOME%\lib\*" ^ 
     com.norconex.collector.http.HttpCollector start -clean -config=%TEST_DIR%\crawler_config.xml

Notice that a port is specified in the command. The port corresponds to the one set in the prometheus_config.yaml: scrape_config section. You can define more than one job at a time using the same hostname and a different port number for each job.

Run the run-job.bat file to start the crawler.

After starting the crawler, you will see logs being written to the console. You can now switch over to your Prometheus Dashboard and try one of the following queries:

{job=”collector-http”}
{job=”collector-http”, key=~”DOCUMENT_QUEUED|DOCUMENT_COMMITTED.*”}
{job=”collector-http”, key=~”DOCUMENT_FETCHED|DOCUMENT_COMMITTED.*”}
{job=”collector-http”, key=~”DOCUMENT_QUEUED|DOCUMENT_FETCHED|DOCUMENT_COMMITTED.*”}
{job=”collector-http”, key=~”DOCUMENT_.*|.*REJECTED.*”}

These queries would return the number of documents queued, fetched, and committed. Plus, the query results will show you the number of rejected documents. The job name refers to the job defined in the prometheus_config.yaml file: scrape_config section. The key in each query corresponds to an event type gathered by the crawler, importer, committer, and collector core. By specifying different events in the key, you can view the information you’re interested in regarding a specific crawling job.

You have abundant options for what events you include in your search. Here are some common ones:

DOCUMENT_COMMITTED_DELETE
DOCUMENT_COMMITTED_UPSERT
DOCUMENT_FETCHED
DOCUMENT_QUEUED
DOCUMENT_PROCESSED
REJECTED_UNMODIFIED
REJECTED_DUPLICATE
REJECTED_BAD_STATUS

As you enter the query in the search box, the result will be displayed almost instantly. You can then view it in either Table or Graph format.

Table Format:

Graph Format:

The graph generated by Prometheus offers a visual depiction of the crawling job’s advancement. As shown in the above graph, the golden line represents the number of documents queued, while the purple line depicts the number of processed documents. Eventually, these two lines will intersect after all documents have been processed, as demonstrated below. This depiction allows you to quickly assess the progress of the crawling job, without having to access and inspect the logs.

Conclusion

With Prometheus, monitoring the progress of single or multiple crawling jobs is no longer a hassle. There’s no need to open multiple consoles for each crawler to check the progress—Prometheus can take care of it all to give you an instant, at-a-glance visual. Just select the events you’re interested in, and then display them visually to save time on your daily monitoring task.

While your interest in events may vary, setting up this configuration requires less than an hour. We strongly recommend giving it a shot using our web or file system crawler. So go ahead and experiment with different combinations of events that align with your monitoring requirements and preferences.

Feel free to leave us feedback on what you think of our crawlers or what type of crawler monitoring you find the most useful. We’d love to hear your thoughts!

This vulnerability impacts Log4J version 2.x, version 1.2 is not affected (source). Norconex HTTP Collector version 2.x use Log4J v1.2.17 and thus are not affected. Version 3 of the Collector uses Log4J v2.17.1, which Apache has patched.

Note: Unless you made it so on purpose, the HTTP Collector does not run as a service accessible from the internet.

Norconex is proud to announce the next major release of its popular open-source web crawler (also referred to as “Norconex HTTP Collector”). After a couple of years of development, you will find this new version was well worth the wait.

Not only does it introduce many new features, but it is also more flexible with even more documentation. Many of these improvements come from community feedback so long-term users deserve a pat on the back. This release is also yours.

If you are too eager to get started, you can download it now and follow its website documentation. Otherwise, keep reading for a glance at the new features.

What’s New?

Introduced features are too many to list here, but we’ll highlight some of the most significant.

Crawling of JavaScript-Driven Websites

Thanks to browser automation provided by Selenium WebDrivers, you can now use your favorite browser to crawl web pages relying on JavaScript to fully render. Generally speaking, if your browser can render content, the crawler can fetch it. It provides you with the ability to take screenshots of pages you crawl as well.

Multiple Committers

Committers are used to store crawled information into a target location, or repository of your choice. This version allows you to specify any number of committers to have your data sent to multiple targets at once (database, search engine, filesystem, etc.). It is also possible to perform simple routing as well.

Easier to deploy

Variables in configuration files can now be resolved against system properties and environment variables. Logging has been abstracted using SLF4J and now prints to STDOUT by default. These changes facilitate deployment in containerized environments (e.g., Docker).

Lots of Events

The event management has been redesigned and simplified. There are now more than 60 different event types being triggered for programmers to listen to and act upon. Ranging from new Committer and Importer events, as well as expected Web Crawler events.

XML Configuration improvements

Similar XML configuration options are now specified in a consistent way. In addition, it is now possible to provide partial class names (e.g., class=“ExtensionReferenceFilter“ instead of class=“com.norconex.collector.core.filter.impl.ExtensionReferenceFilter“). The Importer module also allows you to use XML “flow” to facilitate configuration logic. That is, you can now make use of special XML tags: <if>, <ifNot>, <condition>, <conditions>, <else>, and <then>.

Richer documentation

Documentation has been improved as well:

A new Online Manual is now available, giving great insight into installation and XML configuration.
Dynamic XML documentation combining options from all modules making up the web crawler into a single location.

The JavaDoc now has formatted XML documentation and XML usage, which is easy to copy and paste into your own configuration.

Config Starter

A very simple yet useful configuration generator is now available online. It will help you create your first configuration file. You provide your “start” URL, answer a few questions and your configuration file will be generated for you.

More?

Some additional features:

Can send deletion requests to Committers upon encountering specific events.
Can prevent duplicate documents to be sent to Committers during the same crawling sessions.
Now supports these HTTP standards:
- ETag/If-None-Match
- HTTP Strict Transport Security (HSTS)
- If-Modified-Since
Can now extra links after document importing/parsing as well as from metadata.
The Crawler can be configured to stop itself after encountering specific events.
New command-line options for cleaning previous crawls (starting fresh) and to export/import the crawler internal data store.
Can now transform crawled images.
Additional content and metadata manipulation options.
Committers can now retry failing batches, reducing the batch size between each attempt.
New out-of-the-box CSV Committer.

We recommend you have a look at the release notes for more.

What next?

If you are coming from Norconex HTTP Collector version 2, we recommend you have a look at the version 3 migration notes.

As always, community support is still available on GitHub. While on GitHub, take a moment to “Star” the project.

Come back once in a while as we’ll publish more in-depth articles on specific features or use cases you did not even think was possible to address with our web crawler.

Finally, we always love to know who is using the Norconex Web Crawler. Let us know and you may get listed on our wall of fame.

Enjoy!

In my previous article, I talked about the new Config Starter and its features. This article serves as a follow-up. Now that you know how to generate a crawler configuration file, I will highlight the steps you can undertake to get you started on your own website crawling activities.

We will be using the TOKYO 2020 Olympic Games’ website as the crawl site in this article. The steps are as follows:

First, you will need to generate a basic configuration file targeting the Olympic website, using the Config Starter. In this example, I am targeting English content only, so I am excluding all URLs corresponding to the other languages on the website.

*Note that it is not mandatory to use the Config Starter to generate your configuration file as it only makes a basic configuration file. If you are looking for a more complete solution, you can make your own configuration file with the documentation here.

With your configuration file generated, the next step is to download the Norconex HTTP Collector on your computer from the Norconex Open-Source website and unzip it. If you are using the Config Starter, you will need to download version 3.x.

Once you have the HTTP Collector downloaded on your computer, open your command-line terminal in the location of the folder you just created with your download. To do this, simply use the following command with your file directory: cd C:\file\directory\of\the\collector

With your command-line terminal open, you must now enter the following line with the path to your configuration file:

Windows: collector-http.bat start -config= -config=/path/to/config.xml

Linux: collector-http.sh start -config= -config=/path/to/config.xml

Congratulations! You are now running your crawler. If all went according to plan, you should see something similar to the next image and the data crawled should now be located in the created committer directory (if you are using the same committer as me, it should be in the “work” folder).

Now that you have crawled the Olympic site, go and collect your gold medal!

If you encounter any issues during the process, you can find resolutions on the HTTP Collector GitHub issues page.

[Try] out the new Norconex HTTP Collector Config Starter.

[Download] and [get started] with the Norconex HTTP Collector.

[Learn] more about the inner workings of the Norconex HTTP Collector.

When starting to play with the programming world, it doesn’t take long to notice the sheer amount of options. With so many options available, it’s easy to become overwhelmed by all the information. Sorting through it all is no easy task for the less tech-savvy among us. For that reason, Norconex has put a lot of effort into making its products more user-friendly. With the new Config Starter, everyone can now easily generate a basic configuration file for the Norconex HTTP Collector version 3.

The Norconex HTTP Collector Config Starter is a “wizard” that will generate the configuration file for you to run the Norconex HTTP Collector. Whether you don’t know anything about the world of programming or you just want to quickly set up your crawler, this Config Starter is made for you. In this article, we will go through each section of this wizard and show you how to start your crawler with it.

Collect section

First, we have the Collect section. This section is important because it will collect all the documents from the webpage of your website. Here, you will provide the URL where the crawler collects the data that you need. The collector will start by downloading that page and then following all the links from that webpage. In this section, you can also, if applicable, enter any sections of your site that you do not want to crawl.

Import section

Next, we have the import section. This is where you will choose what to keep from the information collected by the Collector. The first option here lets you remove the data collected from your website’s header, footer and nav sections. Those sections are generally used to navigate through a website and rarely contain any relevant information to crawl.

The second option lets you choose the fields that you want to extract from the data collected. This is useful if your goal is to extract specific data from your website, which will be sent to the committer. If you want to keep all the data, just leave this section empty.

Commit section

Last, we have the Commit section. This section tells your crawler where to save the data you have collected. At the time of writing, we have four available committer options: Local JSON File, Local XML File, Apache Solr or Elasticsearch. If you select one of the last two, you will need to provide the localization of the committer.

If you want to manually configure any section of your configuration file, or use another committer, you can leave any section empty or easily edit it afterward.

When you are done, you can proceed to generate your configuration file; you should get something similar to the configuration below. This configuration comes with a detailed description for all the fields if you want to further customize your crawler later.

<?xml version="1.0" encoding="UTF-8"?>
<!--
 _   _  ___  ____   ____ ___  _   _ _______  __
| \ | |/ _ \|  _ \ / ___/ _ \| \ | | ____\ \/ /
|  \| | | | | |_) | |  | | | |  \| |  _|  \  /
| |\  | |_| |  _ <| |__| |_| | |\  | |___ /  \
|_| \_|\___/|_| \_\\____\___/|_| \_|_____/_/\_\
===============================================

HTTP Collector Configuration File.
 
Generated by: https://opensource.norconex.com/collectors/http/v3/config/starter
Website:      https://opensource.norconex.com/collectors/http/
Manual:       https://opensource.norconex.com/docs/collectors/http/

-->
<httpcollector id="config-id">

  <!--
    Crawler "work" directory.  This is where files downloaded or created as
    part of crawling activities get stored.
    It should be unique to each crawlers.
    -->
  <workDir>./work</workDir>

  <crawlers>
    <crawler id="crawler-id">

      <!--
        Mandatory starting URL(s) where crawling begins.  If you put more
        than one URL, they will be processed together.  You can also
        point to one or more URLs files (i.e., seed lists), or
        point to a sitemap.xml.
        -->
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>https://mywebsite.com</url>
      </startURLs>

      <!-- Normalizes incoming URLs. -->
      <urlNormalizer class="GenericURLNormalizer">
        <normalizations>
            removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence,
            decodeUnreservedCharacters, removeDefaultPort,
            encodeNonURICharacters
        </normalizations>
      </urlNormalizer>

      <!--Handles interval between each page download-->
      <delay default="3000" />

      <!--
        How many threads you want a crawler to use.  Regardless of how many
        thread you have running, the frequency of each URL being invoked
        will remain dictated by the &lt;delay/&gt option above.  Using more
        than one thread is a good idea to ensure the delay is respected
        in case you run into single downloads taking more time than the
        delay specified. Default is 2 threads.
        -->
      <numThreads>2</numThreads>

      <!--
        How many level deep can the crawler go. I.e, within how many clicks
        away from the main page (start URL) each page can be to be considered.
        Beyond the depth specified, pages are rejected.
        The starting URLs all have a zero-depth.  Default is -1 (unlimited)
        -->
      <maxDepth>10</maxDepth>

      <!--
        Crawler "work" directory.  This is where files downloaded or created as
        part of crawling activities (besides logs and progress) get stored.
        It should be unique to each crawlers.
        -->
      <maxDocuments>-1</maxDocuments>

      <!--
        What to do with orphan documents.  Orphans are valid
        documents, which on subsequent crawls can no longer be reached when
        running the crawler (e.g. there are no links pointing to that page
        anymore).  Available options are:
        IGNORE, DELETE, and PROCESS (default).
        -->
      <orphansStrategy>PROCESS</orphansStrategy>

      <!-- Handle robots.txt files. -->
      <robotsTxt ignore="false" />

      <!-- Detects and processes sitemap files. -->
      <sitemapResolver ignore="false" />

      <!-- 
        Detects pages with a canonical link and rejects them in favor of 
        the canonical one.
        -->
      <canonicalLinkDetector ignore="false" />

      <!--
        Filter out matching URLs before they are downloaded. If you 
        want links extracted before a page gets rejected, it needs to be 
        rejected after it was downloaded. You can use <documentFilters>
        instead to achieve this. 
        -->
      <referenceFilters>
        <filter class="RegexReferenceFilter" onMatch="exclude">.*/login/.*</filter>
      </referenceFilters>


      <!--
        Import a document using Norconex Importer module. Here is your chance
        to manipulate a document content and its metadata fields using 
        import handlers before it is sent to your target repository.  
        -->
      <importer>

        <!--
          Pre-parse handlers take place BEFORE a document is converted to 
          plain-text. If you need to deal with the original document format
          (HTML, XML, binary, etc.), define your import handlers here.
          -->
        <preParseHandlers>
          <!-- Remove navigation elements from HTML pages. -->
          <handler class="DOMDeleteTransformer">
            <dom selector="header" />
            <dom selector="footer" />
            <dom selector="nav" />
            <dom selector="noindex" />
          </handler>
        </preParseHandlers>

        <!--
          Post-parse handlers take place AFTER a document is converted to 
          plain-text. At this stage, content should be stripped of formatting
          (e.g., HTML tags) and you should no longer encounter binary content.
          -->
        <postParseHandlers>

          <!-- Rename extracted fields to what you want. -->
          <handler class="RenameTagger">
            <rename toField="title" onSet="replace">
              <fieldMatcher method="csv">dc:title, og:title</fieldMatcher>
            </rename>
          </handler>

          <!-- Make sure we are sending only one value per field. -->
          <handler class="ForceSingleValueTagger" action="keepFirst">
            <fieldMatcher method="csv">title</fieldMatcher>
          </handler>

          <!-- Keep only those fields and discard the rest. -->
          <handler class="KeepOnlyTagger">
            <fieldMatcher method="csv">title</fieldMatcher>
          </handler>
        </postParseHandlers>
      </importer>

      <!--
        Commits a document to a data source of your choice.
        This step calls the Committer module.  The
        committer is a different module with its own set of XML configuration
        options.  Please refer to committer for complete documentation.
        Below is an example using the FileSystemCommitter.
        -->
      <committers>

        <!--
          JSON File Committer.
          
          Store crawled documents to the local file system, in JSON Format.
          
          Web site:
            https://opensource.norconex.com/committers/core/

          Configuration options and output format:  
            https://opensource.norconex.com/committers/core/v3/apidocs/com/norconex/committer/core3/fs/impl/JSONFileCommitter.html
          -->
        <committer class="JSONFileCommitter"/>

      </committers>

    </crawler>
  </crawlers>

</httpcollector>

To start running your crawler, just refer to the location of the configuration file in the start command in the command-line console.

[Try] out the new Norconex HTTP Collector Config Starter.

[Download] and [get started] with the Norconex HTTP Collector.

[Learn] more about the inner workings of the Norconex HTTP Collector.

Norconex is proud to announce the 2.9.0 release of its HTTP and Filesystem crawlers. Keep reading for a few release highlights.

CMIS support

Norconex Filesystem Collector now supports Content Management Interoperability Services (CMIS). CMIS is an open standard for accessing content management systems (CMS) content. Extra information can be extracted, such as document ACL (Access Control List) for document-level security. It is now easier than ever to crawl your favorite CMS. CMIS is supported by Alfresco, Interwoven, Magnolia, SharePoint server, OpenCMS, OpenText Documentum, and more.

<startPaths>
    <path>cmis-atom:https://norconex.com/mycms/cmisatom!/my/starting/path</path>
</startPaths>

Additional ACL support

ACL from your CMS is not the only new type of ACL you can extract. This new Norconex Filesystem Collector release introduces support for obtaining local filesystem ACL. These new ACL types are in addition to the already existing support for CIFS/SMB ACL extraction (since 2.7.0).

Field discovery

You can’t always tell upfront what metadata your crawler will find. One way to discover your fields is to send them all to your Committer. This approach is not always possible nor desirable. You can now store to a local file all fields found by the crawler. Each field will be saved once, with sample values to give you a better idea of their nature.

<tagger class="com.norconex.importer.handler.tagger.impl.FieldReportTagger" 
    maxSamples="2" file="/path/to/report/myfields.csv" />

New URL normalization rules

The HTTP Collector adds a few new rules GenericURLNormalizer. Those are:

removeQueryString
lowerCase
lowerCasePath
lowerCaseQuery
lowerCaseQueryParameterNames
lowerCaseQueryParameterValues

Subdomains being part of a domain

When you configure your HTTP crawler to stay on the current site (stayOnDomain="true"), you can now tell it to consider sub-domains as being the same site (includeSubdomains="true").

Other changes

For a complete list of all additions and changes, refer to the following release notes:

Download

Great news! There is now a Google Cloud Search Committer for Norconex Crawlers!

This addition to Norconex Collector family should delight Google Cloud Search fans. They too can now enjoy the full-featured crawling capabilities offered by Norconex Open-Source crawlers.

Since this Committer is developed and maintained by Google, you will find installation and configuration documentation on the Google Developers website.

New to Norconex crawlers? Head over to the Norconex Collectors website to start crawling.

Happy crawling!

Norconex is proud to announce the release of Norconex HTTP Collector version 2.8.0. This release is accompanied by new releases of many related Norconex open-source products (Filesystem Collector, Importer, Committers, etc.), and together they bring dozens of new features and enhancements highlighted below.

Extract a “Featured Image” from web pages

[ezcol_1half]

In addition to taking screenshots of webpages, you can now extract the main image of a web page thanks to the new FeaturedImageProcessor. You can specify conditions to identify the image (first one encountered matching a minimum site or a given pattern). You also have the option to store the image on file or as a BASE64 string with the crawled document (after scaling it to your preferred dimensions) or simply store a reference to it.

[/ezcol_1half]

[ezcol_1half_end]

<preImportProcessors>
  <processor class="com.norconex.collector.http.processor.impl.FeaturedImageProcessor">
    <minDimensions>300x400</minDimensions>
    <scaleDimensions>50</scaleDimensions>
    <imageFormat>jpg</imageFormat>
    <scaleQuality>max</scaleQuality>  	
    <storage>inline</storage>
  </processor>
</preImportProcessors>

[/ezcol_1half_end]

Limit link extraction to specific page portions

[ezcol_1half]

The GenericLinkExtractor now makes it possible to only extract links to be followed found within one or more specific sections of a web page. For instance, you may want to only extract links found in navigation menus and not those found in content areas in case the links usually point to other sites you do not want to crawl.

[/ezcol_1half]

[ezcol_1half_end]

<extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor">
 
  <extractBetween>
    <start><![CDATA[<!-- BEGIN NAV LINKS -->]]></start>
    <end><![CDATA[<!-- END NAV LINKS -->]]></end>
  </extractBetween>
 
  <noExtractBetween>
    <start><![CDATA[<!-- BEGIN EXTERNAL SITES -->]]></start>
    <end><![CDATA[<!-- END EXTERNAL SITES -->]]></end>
  </noExtractBetween>
 
</extractor>

[/ezcol_1half_end]

Truncate long field values

[ezcol_1half]

The new TruncateTagger offers the ability to truncate long values and the option to replace the truncated portion with a hash to help preserve uniqueness when required. This is especially useful in preventing errors with search engines (or other repositories) and field length limitations.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.TruncateTagger"
    fromField="mySuperLongField"
    maxLength="500"
    toField="myTruncatedField"
    overwrite="true"
    appendHash="true"
    suffix="!" />

[/ezcol_1half_end]

Add metadata to a document using an external application

[ezcol_1half]

The new ExternalTagger allows you to point to an external (i.e., command-line) application to “decorate” a document with extra metadata information. Both the existing document content and metadata can be supplied to the external application. The application output can be in a specific format (json, xml, properties) or free-form combined with metadata extraction patterns you can configure. Either standard streams or files can be supplied as arguments to the external application. To transform the content using an external application instead, have a look at the ExternalTranformer, which has also been updated to support metadata.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.ExternalTagger">
  <command>
    /app/addressExtractor ${INPUT} ${INPUT_META} ${REFERENCE}
  </command>
  <metadata inputFormat="json">
    <pattern field="address" valueGroup="1">
      ^address=(.*)$
    </pattern>
  </metadata>
</tagger>

[/ezcol_1half_end]

Other improvements

This release includes many more new features and enhancements:

To create a document checksum, you can now combine metadata with content.
The TextPatternTagger can now extract field names dynamically in addition to values.
The ReplaceTagger and ReplaceTransformer now support empty/null replacement values.
There are new configuration options on the GenericHttpClientFactory:
- “authFormParams” to add arbitrary parameters to authentication forms.
- “authPreemptive” to use preemptive authentication with BASIC authentication.
The Amazon CloudSearch and Elasticsearch Committers both have a new “fixBadIds” flag to safely handle URLs that do not meet product limitations.

For the complete list of changes, refer to these product release notes:

Useful links

Download Norconex HTTP Collector
Get started with Norconex HTTP Collector
Report your issues and questions on Github
Contact Norconex

Norconex just made it easier to understand the inner-workings of its crawlers by creating clickable flow diagrams. Those diagrams are now available as part of both the Norconex HTTP Collector and Norconex Filesystem Collector websites.

Clicking on a shape will bring up relevant information and offer links to the corresponding documentation in the Collector configuration page.

While not all features are represented in those diagrams, there should be enough to improve your overall understanding and help you better configure your crawling solution.

Have a look now:

Amazon Web Services (AWS) have been all the rage lately, used by many organizations, companies and even individuals. This rise in popularity can be attributed to the sheer number of services provided by AWS, such as Elastic Compute (EC2), Elastic Beanstalk, Amazon S3, DynamoDB and so on. One particular service that has been getting more exposure very recently is the Amazon CloudSearch service. It is a platform that is built on top of the Apache Solr search engine and enables the indexing and searching of documents with a multitude of features.
The main focus of this blog post is crawling and indexing sites. Before delving into that, however, I will briefly go over the steps to configure a simple AWS CloudSearch domain. If you’re already familiar with creating a domain, you may skip to the next section of the post.

Starting a Domain

A CloudSearch domain is the search instance where all your documents will be indexed and stored. The level of usage of these domains is what dictates the pricing. Visit this link for more details.
Luckily, the web interface is visually appealing, intuitive and user friendly. First of all, you need an AWS account. If you don’t have one already, you can create one now by visiting the Amazon website. Once you have an account, simply follow these steps:

1) Click the CloudSearch icon (under the Analytics section) in the AWS console.

2) Click the “Create new search domain” button. Give the domain a name that conforms to the rules given in the first line of the popup menu, and select the instance type and replication factor you want. I’ll go for the default options to keep it simple.

3) Choose how you want your index fields to be added. I recommend starting off with the manual configuration option because it gives you the choice of adding the index fields at any time. You can find the description of each index field type here:

4) Set the access policies of your domain. You can start with the first option because it is the most straightforward and sensible way to start.

5) Review your selected options and edit what needs to be edited. Once you’re satisfied with the configurations, click “Confirm” to finalize the process.

It’ll take a few minutes for the domain to be ready for use, as indicated by the yellow “LOADING” label that shows up next to the domain name. A green “ACTIVE” label shows up once the loading is done.

Now that the domain is fully loaded and ready to be used, you can choose to upload documents to it, add index fields, add suggesters, add analysis schemes and so on. Note, however, that the domain will need to be re-indexed for every change that you apply. This can be done by clicking the “Run indexing” button that pops up with every change. The time it takes for the re-indexing to finish depends on the number of documents contained in the domain.

As mentioned previously, the main focus of this post is crawling sites and indexing the data to a CloudSearch domain. At the time of this writing, there are very few crawlers that are able to commit to a CloudSearch domain, and the ones that do are unintuitive and needlessly complicated. The Norconex HTTP Collector is the only crawler that has CloudSearch support that is very intuitive and straightforward. The remainder of this blog post aims to guide you through the steps necessary to set up a crawler and index the data to a CloudSearch domain in as simple and informative steps as possible.

Setting up the Norconex HTTP Collector

The Norconex HTTP Collector will be installed and configured in a Linux environment using Unix syntax. You can still, however, install on Windows, and the instructions are just as simple.

Unzip the downloaded file and navigate to the extracted folder. If needed, make sure to set the directory as readable and writable using the chmod command. Once that’s done, follow these steps:

1) Create a directory and name it testCrawl. In the folder myCrawler, create a file config.xml and populate it with the minimal configuration file, which you can find in the examples/minimum directory.

2) Give the crawler a name in the <httpcollector id="..."> I’ll name my crawler TestCrawl.

3) Set progress and log directories in their respective tags:

<progressDir>./testCrawl/progressdir</progressDir>
<logsDir>./testCrawl/logsDir</logsDir>

4) Within <crawlerDefaults>, set the work directory where the files will be stored during the crawling process:

<workDir>./testCrawl/workDir</workDir>

5) Type the site you want crawled in the [tag name] tag:

<url>http://beta2.norconex.com/</url>

Another method is to create a file with a list of URLs you want crawled, and point to the file:

<urlsFile>./urls/urlFile</urlsFile>

6) If needed, set a limit on how deep (from the start URL) the crawler can go and a limit on the number of documents to process:

<maxDepth>2</maxDepth>
<maxDocuments>10</maxDocuments>

7) If needed, you can set the crawler to ignore documents with specific file extensions. This is done by using the ExtensionReferenceFilter class as follows:

<referenceFilters>
  	<filter
     	class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter"                                                                             
            onMatch="exclude" caseSensitive="false">
         	png,gif,jpg,jpeg,js,css
  	</filter>
</referenceFilters>

8) You will most likely want to use an importer to parse the crawled data before it’s sent to your CloudSearch domain. The Norconex importer is a very intuitive and easy-to-use tool with a plethora of different configuration options, offering a multitude of pre- and post-parse taggers, transforms, filters and splitters, all of which can be found here. As a starting point, you may want to use the KeepOnlyTagger as a post-parse handler, where you get to decide on what metadata fields to keep:

<importer>
      <postParseHandlers>
         <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>title,description</fields>
         </tagger>
       </postParseHandlers>
</importer>

Be sure that your CloudSearch domain has been configured to support the metadata fields described above. Also, make sure to have a ‘content’ field in your CloudSearch domain as the committer assumes that there’s one.

The config.xml file should look something like this:

<?xml version="1.0" encoding="UTF-8"?>
<!-- 
   Copyright 2010-2015 Norconex Inc.

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<!-- This configuration shows the minimum required and basic recommendations
     to run a crawler.  
     -->
<httpcollector id="TestCrawl">

  <!-- Decide where to store generated files. -->
  <progressDir>../myCrawler/testCrawl/progress</progressDir>
  <logsDir>../myCrawler/testCrawl/logs</logsDir>

  <crawlers>
    <crawler id="CloudSearch">

      <!-- Requires at least one start URL (or urlsFile). 
           Optionally limit crawling to same protocol/domain/port as 
           start URLs. -->
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>http://beta2.norconex.com/</url>
      </startURLs>

      <!-- === Recommendations: ============================================ -->

      <!-- Specify a crawler default directory where to generate files. -->
      <workDir>../myCrawler/testCrawl</workDir>

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>2</maxDepth>
      <maxDocuments>10</maxDocuments>

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <!-- Before 2.3.0: -->
      <sitemap ignore="true" />
      <!-- Since 2.3.0: -->
      <sitemapResolverFactory ignore="true" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="5000" />

	  
      <referenceFilters>
      	<filter class="$filterExtension" 
			onMatch="exclude"
			caseSensitive="false" >
			png,gif,jpg,jpeg,js,css
		</filter>
      </referenceFilters>

      
      <!-- Document importing -->
 
      <importer>
        <postParseHandlers>
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>title,description/fields>
          </tagger>
        </postParseHandlers>
      </importer> 

 	</crawler>
  </crawlers> 
</httpcollector>

The Norconex CloudSearch Committer

The Norconex http collector is compatible with several committers such as Solr, Lucidworks, Elasticsearch, etc. Visit this website to find out what other committers are available. The latest addition to this set of committers is the AWS CloudSearch committer. This is an especially useful committer since the very few publicly available CloudSearch committers are needlessly complicated and unintuitive. Luckily for you, Norconex solves this issue by offering a very simple and straightforward CloudSearch committer. All you have to do is:

1) Download the JAR file from here, and move it to the lib folder of the http collector folder.

2) Add the following towards the end of the <craweler></crawler> block (right after the specifying the importer) in your config.xml file:

<committer class="com.norconex.committer.cloudsearch.CloudSearchCommitter">
    <documentEndpoint></documentEndpoint>
    <accessKey></accessKey>
    <secretAccessKey></secretAccessKey>
</committer>

You can obtain the URL for your document endpoint from your CloudSearch domain’s main page. As for the AWS credentials, specifying them in the config file could result in an error due to a bug in the committer. Therefore, we strongly recommend that you DO NOT include the <accessKey> and <secretAccessKey> variables. Instead, we recommend that you set two environment variables, AWS_ACCESS_KEY and AWS_SECRET_ACCESS_KEY with their respective values. To obtain and use these values, refer to the AWS documentation.

Run the Crawler!

All that is left to do is to run the http collector using the Linux shell script (from the main collector directory):

./collector-http.sh -a start -c ./myCrawler/config.xml

Give the crawler some time to crawl the specified URLs, until it reaches the <maxDepth> or <maxDocuments> constraints, or if it finds no more URLs to crawl. Once the crawling is complete, the successfully processed documents will be committed to the domain specified in the <documentEndpoint> option.

To confirm that the documents have indeed been uploaded, you can go to the domain’s main page and see how many documents are stored and run a test search.