Latest Articles – Norconex Inc

Introduction

When building a robust web crawling operation, regional access restrictions can pose serious challenges, especially if your infrastructure is centralized in a location like the United States, but the data you need resides behind geo-fenced websites in countries like Japan.

The Problem

Traditionally, the go-to workaround would be to deploy crawling nodes in each target region, such as setting up infrastructure in Asia Pacific (Tokyo) to reach Japanese-only content. While technically effective, this approach quickly becomes costly, operationally complex, and hard to scale.

What Are IPRoyal Proxies?

IPRoyal offers a wide range of proxy services designed to help users browse or crawl the web as if they were located in another country. Their residential proxies, in particular, are backed by real IP addresses from real devices, making them highly effective for:

Bypassing geo-restrictions and IP-based content filters
Avoiding CAPTCHAs and IP bans during large-scale crawling
Accessing localized content without deploying regional infrastructure

Unlike datacenter proxies, residential proxies are more trusted by target websites, as they appear like legitimate traffic from real users. This makes IPRoyal an ideal choice for crawling region-sensitive websites, especially when paired with flexible tools like Norconex web-crawler.

A Smarter Alternative: Use Residential Proxies

Instead of managing infrastructure in multiple regions, you can route traffic through IPRoyal’s residential proxies to simulate access from anywhere in the world. This allows your existing US-based Norconex crawler to behave like it’s operating inside Japan; No physical relocation required.

Let’s walk through a real-world example.

Bypassing Region-Redirects

Many Japanese websites appear globally accessible at first glance. You might load jp.example.com just fine, but behind the scenes, the server may redirect you to a regional page like sample.co.jp, which then blocks all non-Japanese IPs.

So even though the crawler gets through the first request, it hits a geo-block on the redirect target and fails to capture the actual content, leaving gaps in your data collection.

With IPRoyal’s Japan proxy IPs, the crawler can seamlessly follow redirects and access the final, localized content as if it were based in Japan.

How to Configure Norconex to Use IPRoyal Proxies

Below is a simple example of how to configure Norconex’s HTTP collector to use IPRoyal proxies only when accessing Japanese content:

<httpFetcherFactory class="GenericHttpFetcherFactory">
  <proxySettings>
    <host>proxy.iproyal.com</host>
    <port>12321</port> <!-- Replace with your assigned port -->
    <scheme>http</scheme> <!-- Or "https" if using TLS-enabled proxies -->
    <credentials>
      <username>your-username</username>
      <password>your-password</password>
    </credentials>
    <urlFilter>.*sample\.co\.jp.*</urlFilter> <!-- Apply proxy only to target domains -->
  </proxySettings>
</httpFetcherFactory>

<httpFetcherFactory class="GenericHttpFetcherFactory">
  <proxySettings>
    <host>proxy.iproyal.com</host>
    <port>12321</port> <!-- Replace with your assigned port -->
    <scheme>http</scheme> <!-- Or "https" if using TLS-enabled proxies -->
    <credentials>
      <username>your-username</username>
      <password>your-password</password>
    </credentials>
    <urlFilter>.*sample\.co\.jp.*</urlFilter> <!-- Apply proxy only to target domains -->
  </proxySettings>
</httpFetcherFactory>

Tip: With “urlFilter”, you can target specific domains for proxy usage, avoiding unnecessary proxy calls and optimizing speed and cost.

Final Thoughts

Pairing Norconex’s flexible crawling architecture with IPRoyal’s location-targeted proxies gives you the power to:

Crawl localized content across the globe
Bypass IP-based redirects and geo-blocks
Avoid costly regional infrastructure

It’s a lightweight, scalable way to unlock data from restricted regions—without the overhead.

In an ideal world, you would leverage users to identify “good” and “bad” search results and then tweak your search engine. However, this strategy may not be possible in the world of enterprise search due to budget or resource constraints. This raises a question: how can relevancy changes be tracked automatically? This blog presents one of many possible ways to track relevancy changes in an enterprise search solution. After the initial setup, this solution can largely be automated.

Introduction

Managing vast amounts of data stored across various file systems can be a daunting task. But it doesn’t have to be! Norconex File System Crawler comes to the rescue, offering a robust solution for efficiently extracting, organizing, and indexing your files.

But did you know you can extend its capabilities without writing a single line of code? In this blog post, you’ll learn how to connect an external application to the Crawler and unleash its full potential.

The Use Case

Both Norconex File System Crawler and Norconex Web Crawler utilize Norconex Importer to extract data from documents. Right out of the box, the Importer supports various file formats, as documented here. But you may encounter a scenario where the Importer cannot parse a document.

One such example is a RAR5 document. At the time of this writing, the latest version of File System Crawler is 2.9.1. Extracting a RAR5 file with this version throws the following exception.

com.norconex.importer.parser.DocumentParserException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pkg.RarParser@35f95a13
...
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pkg.RarParser@35f95a13
...
Caused by: java.lang.NullPointerException: mainheader is null
...

As you can see, Apache Tika’s RarParser class cannot extract the document. You’ll see how to work around this issue below.

Note: This blog post will focus on a no-code solution. However, if you can code, writing your own custom parser is highly recommended. Look at the Extend the File System Crawler section of the documentation on accomplishing just that.

ExternalTransformer to the Resuce

Many applications support the extraction of RAR files. One such application is 7zip. If you need to, go ahead and install 7zip on your machine now. You’ll need the application moving forward.

Overview

You will run 2 crawlers separately. The first crawls everything normally while ignoring RAR files. It will use the ExternalTransformer to extract the RAR file contents to folder X and do no further processing of the file. The second will crawl the extracted files in folder X.

Configs

Config for the first crawler is as follows, with helpful comment explanations of various options.

<?xml version="1.0" encoding="UTF-8"?>
<fscollector id="fs-collector-main">

#set($workdir = .\workdir-main)
#set($startDir = .\input)
#set($extractedDir = .\extracted)
#set($tagger = "com.norconex.importer.handler.tagger.impl")
#set($filter = "com.norconex.importer.handler.filter.impl")
#set($transformer = "com.norconex.importer.handler.transformer.impl")

  <logsDir>${workdir}/logs</logsDir>
  <progressDir>${workdir}/progress</progressDir>

  <crawlers>
	<crawler id="fs-crawler-main">
  	<workDir>${workdir}</workDir>
  	<startPaths>
    	<path>${startDir}</path>
  	</startPaths>
 	 
  	<importer>
    	<!-- do the following before attempting to parse a file -->
    	<preParseHandlers>
      	<transformer class="${transformer}.ExternalTransformer">
        	<!-- apply this transfomer to .rar files only -->
        	<restrictTo field="document.reference">.*\.rar$</restrictTo>
        	<!--
          	calls on 7zip to uncompress the file and place the contents in `extracted` dir
        	-->
        	<command>'C:\Program Files\7-Zip\7z.exe' e ${INPUT} -o${extractedDir} -y</command>
        	<metadata>
          	<pattern toField="extracted_paths" valueGroup="1">
            	^Path = (.*)$
          	</pattern>
        	</metadata>
        	<tempDir>${workdir}/temp</tempDir>
      	</transformer>

      	<!-- stop further processing of .rar files -->
      	<filter class="${filter}.RegexReferenceFilter" onMatch="exclude">
        	<regex>.*\.rar$</regex>
      	</filter>
   	 
    	</preParseHandlers>
  	</importer>
 	 
  	<!--
    	commit extracted files to the local FileSystem
    	You can substitute this with any of the available committers
  	-->
  	<committer class="com.norconex.committer.core.impl.FileSystemCommitter">
    	<directory>${workdir}/crawledFiles</directory>
  	</committer>
	</crawler>
    
  </crawlers>

</fscollector>

This crawler will parse all files normally, except RAR files. When encountering a RAR file, the Crawler will call upon 7zip to extract RAR files and place the extracted files under an extracted folder. No further processing will be done on these RAR files.

The second crawler is configured to simply extract files within the extracted folder. Here is the configuration:

<?xml version="1.0" encoding="UTF-8"?>
<fscollector id="fs-71-collector-extracted">

#set($workdir = .\workdir-extracted)
#set($startDir = .\extracted)

  <logsDir>${workdir}/logs</logsDir>
  <progressDir>${workdir}/progress</progressDir>

  <crawlers>

	<crawler id="fs-crawler-extracted">
  	<startPaths>
    	<path>${startDir}</path>
  	</startPaths>

  	<!--
    	commit extracted files to the local FileSystem
    	You can substitute this with any of the available committers
  	-->
  	<committer class="com.norconex.committer.core.impl.FileSystemCommitter">
    	<directory>${workdir}/crawledFiles</directory>
  	</committer>
	</crawler>
    
  </crawlers>

</fscollector>

There you have it! You just extended the capabilities of the File System Crawler without writing a single line of code – a testament to the incredible flexibility offered by the Crawler.

Conclusion

Norconex File System Crawler is undeniably a remarkable tool for web crawling and data extraction. Even more impressive is the ease with which you can extend the Crawler’s capabilities, all without the need for coding expertise. Whether you’re a seasoned professional or just getting started, let the Norconex File System Crawler – free from the complexities of coding – become your trusted companion in unleashing the full potential of your data management endeavours. Happy indexing!

Introduction

Norconex Web Crawler is a full-featured, open-source web crawling solution meticulously crafted to parse, extract, and index web content. The Crawler is flexible, adaptable and user-friendly, making it a top-notch selection for extracting data from the web.

As the volume and complexity of web crawling tasks increase, organizations face challenges in efficiently scaling the Crawler to meet organizational needs. Scaling effectively involves addressing issues related to configuration management, resource allocation, and the handling of large data sets to enable seamless scalability while maintaining data quality and integrity.

In this blog post you will learn how to handle configuration management for medium to large Crawler installations.

The Problem

Norconex Web Crawler only needs to be installed once, no matter how many sites you’re crawling. If you need to crawl different websites requiring different configuration options, you will likely need multiple configuration files. And as Crawling needs further grow, yet more configuration files will be needed. Some parts of these configuration files will inevitably have common elements as well. How can you minimize the duplication between configs?

The Solution: Apache Velocity Templates

Norconex Web Crawler configuration is not a plain XML file, but rather, a Apache Velocity template. Broadly speaking, the configuration file is interpreted by the Velocity Engine before being applied to the Crawler.
You can leverage the Velocity Engine to dynamically provide the appropriate values. The following sections walk you through exactly how to do so.

Variables

To keep things simple, consider a crawling solution that contains just 2 configuration files; one for siteA and one for siteB.

Note: This scenario is for demonstration purposes only. If you only have 2 sites to crawl, the following approach is not recommended.

Default configurations

The configurations for the 2 sites may look as follows.

siteA configuration

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<httpcollector id="collector-siteA">
  <workDir>./workDir</workDir>
  <crawlers>
    <crawler id="crawler-siteA">
      <startURLs stayOnDomain="true">
   	  <url>www.siteA.com</url>
      </startURLs>
      <maxDepth>-1</maxDepth>
      <!-- redacted for brevity -->     
    </crawler>
  </crawlers>
</httpcollector>

siteB configuration

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<httpcollector id="collector-siteB">
  <workDir>./workDir</workDir>
  <crawlers>
    <crawler id="crawler-siteB">
      <startURLs stayOnDomain="true">
        <url>www.siteB.com</url>
      </startURLs>
      <maxDepth>0</maxDepth>
      <!-- redacted for brevity -->
    </crawler>
  </crawlers>
</httpcollector>

As you can probably see, just 4 differences exist between the two configurations:

httpcollector id
crawler id
StartURLs
maxDepth

The common elements in both configurations should be shared. Below, you’ll learn how to share them with Velocity variables.

Optimized configuration

The following steps will optimize the configuration by extracting dynamic data to dedicated files thereby removing duplication.

First, extract unique items into their respective properties file

siteA.properties

domain=www.siteA.com
maxDepth=-1

siteB.properties

domain=www.cmp-cpm.forces.gc.ca
maxDepth=0

Then, add variables to the Crawler configuration and save it as my-config.xml at the root of your Crawler installation. The syntax to add a variable is ${variableName}.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<httpcollector id="collector-${domain}"> <!-- variable added here -->
  <workDir>./workDir</workDir>
  <crawlers>
    <crawler id="crawler-${domain}"> <!-- variable added here -->
      <startURLs stayOnDomain="true">
        <url>${domain}</url> <!-- variable added here -->
      </startURLs>   		 
      <maxDepth>${maxDepth}</maxDepth> <!-- variable added here -->
      <!-- redacted for brevity -->
    </crawler>
  </crawlers>
</httpcollector>

With the variables in place in the Crawler config, the variables file simply needs to be specified to the Crawler start script. This is accomplished with the -variables flag, as follows.

siteA

>collector-http.bat start -clean -config=my-config.xml -variables=siteA.properties

siteB

>collector-http.bat start -clean -config=my-config.xml -variables=siteB.properties

The Crawler will replace the variables in the config XML with what it finds in the .properties file.

The example above is for a Windows machine. If you are on Linux, use the collector-http.sh script instead.

Tip: If you’re interested in seeing what the config will look like after variables are replaced, use the configrender option.

>collector-http.bat configrender -c=my-config.xml -variables=siteA.properties -o=full_config.xml

So far, we have only seen the basics of storing data in variables. But what if siteA and siteB needed to commit documents to separate repositories? Below you’ll see how to leverage the power of Apache Velocity Engine to accomplish just that.

Importing Files

Using variables goes a long way toward organizing multiple configuration files. You can also dynamically include chunks of configuration by utilizing Velocity’s #parse() script element.

To demonstrate, consider that siteA is required to commit documents to Azure Cognitive Search and siteB to Elasticsearch. The steps below will walk you through how to accomplish just that.

First, you need 2 committer XML files.

committer-azure.xml

<committer class="AzureSearchCommitter">
  <endpoint>https://....search.windows.net</endpoint>   			 
  <apiKey>...</apiKey>
  <indexName>my_index</indexName>
</committer>

committer-es.xml

<committer class="ElasticsearchCommitter">
  <nodes>https://localhost:9200</nodes>
  <indexName>my_index</indexName>
</committer>

Then, augment the Crawler config (my-config.xml), and add the <committers> section

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<httpcollector id="collector-${domain}">
  <workDir>./workDir</workDir>
    <crawlers>
      <crawler id="crawler-${domain}">
        <startURLs stayOnDomain="true">
          <url>${domain}</url>
        </startURLs>
  		 
  	<maxDepth>${maxDepth}</maxDepth>
  	
  	<!-- add this section -->
	<committers>
	  #parse("${committer}")
        </committers>
    </crawler>
  </crawlers>
</httpcollector>

Finally, the .properties files must be updated to specify the committer file we required for each.

siteA.properties

domain=www.siteA.com
maxDepth=-1
committer=committer-azure.xml

siteB.properties

domain=www.siteB.com
maxDepth=0
committer=committer-es.xml

Now you can use the configrender option to see the final configuration for each site.

siteA

>collector-http.bat configrender -c=my-config.xml -variables=siteA.properties -o=full_config.xml

Relevant snippet from full_config.xml.

<committers>
  <committer class="com.norconex.committer.azuresearch.AzureSearchCommitter">
    <endpoint>https://....search.windows.net</endpoint>
    <apiKey>...</apiKey>
    <indexName>my_index</indexName>
    <!-- redacted for brevity -->
  </committer>
</committers>

siteB

>collector-http.bat configrender -c=my-config.xml -variables=siteB.properties -o=full_config.xml

Relevant snippet from full_config.xml.

<committers>
  <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
    <nodes>https://localhost:9200</nodes>
    <indexName>my_index</indexName>
    <!-- redacted for brevity -->
  </committer>
</committers>

And there you have it! With those simple steps, you can add the correct <committer> to the final configuration for each site.

Conclusion

As the scale and complexity of your projects grow, so does the challenge of managing multiple configuration files. Herein lies the beauty of harnessing the Apache Velocity Template Engine. By leveraging its power, you can streamline and organize your configurations to minimize redundancy and maximize efficiency. Say goodbye to duplicated efforts, and welcome a more streamlined, manageable, and scalable approach to web crawling. Happy indexing!

Introduction

Amazon CloudSearch, a powerful and scalable search and analytics service, has revolutionized how businesses handle data search and analysis. This blog post will walk you through how to set up and leverage Norconex Web Crawler to seamlessly index data to your Amazon CloudSearch domain.

Understanding Norconex Web Crawler

Norconex Web Crawler is an open-source web crawler designed to extract, parse, and index content from the web. For extracting data from the web, Crawler’s flexibility and ease of use make it an excellent choice. Norconex offers a range of committers that index data to various repositories. See https://opensource.norconex.com/committers/ for a complete list of supported target repositories. If the provided committers do not meet your requirements, extend the Committer Core and then create a custom committer to fit your needs.

This blog post will focus on indexing data to Amazon CloudSearch.

Prerequisites

Amazon CloudSearch

Follow the steps below to create a new Amazon CloudSearch Domain.

Browse to https://console.aws.amazon.com/cloudsearch/home, and click the Create a new search domain link.

Enter a Search Domain Name. Next, select search.small and 1 for Desired Instance Type and Desired Replication Count, respectively.

Select Manual configuration from the list of options.

Add 3 fields – title, description, and content, of type text.

Authorize your IP address to send data to this CloudSearch instance. Click on Allow access to all services from specific IP(s). Then enter your public IP address.

That’s it! You have now created your own Amazon CloudSearch domain. AWS will take a few minutes to complete the setup procedure.

Important: You will need the accessKey and secretKey for your AWS account. Not sure where to get these values? Contact your AWS administrator.

After a few minutes, go to your CloudSearch Dashboard and make a note of the Document Endpoint.

Norconex Web Crawler

Download the latest version of Crawler from Norconex’s website. At the time of this writing, version 3.0.2 is the most recent.

Download the latest version of Amazon CloudSearch Committer. At the time of this writing, version 2.0.0 is the most recent.

Follow the Automated Install instructions to install Amazon CloudSearch Committer libraries in the Crawler.

Crawler Configuration

The following Crawler configuration will be used for this test. First, place the configuration in the root folder of your Crawler installation. Then, name it my-config.xml.

Ensure that you supply appropriate values for serviceEndpoint, accessKey, and secretKey. On your CloudSearch Dashboard, serviceEndpoint is the Document Endpoint.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<httpcollector id="Norconex HTTP Crawler">
  <!-- Decide where to store generated files. -->
  <workDir>./output</workDir>
  <crawlers>
    <crawler id="Norconex Amazon CloudSearch Committer Demo">

      <startURLs
   	 stayOnDomain="true"
   	 stayOnPort="true"
   	 stayOnProtocol="true">
   	 <url>https://github.com/</url>
      </startURLs>

      <!-- only crawl 1 page -->     
      <maxDepth>0</maxDepth>

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <sitemapResolver ignore="true" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="5 seconds" />

      <importer>
 	  <postParseHandlers>
   		 <!-- only keep `description` and `title` fields -->
  		 <handler class="KeepOnlyTagger">
  		   <fieldMatcher method="csv">
   			description,title
   		   </fieldMatcher>
  		</handler>
         </postParseHandlers>
  	 </importer>

      <committers>
	  <!-- send documents to Amazon CloudSearch -->
        <committer class="CloudSearchCommitter">   	
          <serviceEndpoint>...</serviceEndpoint>
          <accessKey>...</accessKey>
          <secretKey>...</secretKey>
  	  </committer>
      </committers>
	 
    </crawler>
  </crawlers>
</httpcollector>

Note that this configuration is the minimal required. To suit your needs, you can set many other parameters. Norconex’s documentation does an excellent job of detailing all the available parameters.

Important: For the purposes of this blog, AWS credentials are specified directly in the Crawler configuration as plain text. This practice is not recommended due to the obvious security issues doing so creates. Accordingly, please consult AWS documentation to learn about securely storing your AWS credentials.

Start the Crawler

Norconex Web Crawler comes packaged with shell scripts to start the application. To start the Crawler, run the following command in the console. The example below is for a Windows machine. If you are on Linux, use the collector-http.sh script instead.

C:\Norconex\norconex-collector-http-3.0.2>collector-http.bat start -clean -config=.\my-config.xml

Recall that you saved the configuration at the root of your Crawler installation.

The crawl job will take only a few seconds since only a single page is being indexed. Once the job completes, browse to your CloudSearch Dashboard. Then run a Test Search with the word github to see that the page was indeed indexed!

Conclusion

Indexing data to Amazon CloudSearch using Norconex Web Crawler opens a world of possibilities for data management and search functionality. Following the steps outlined in this guide, you can seamlessly integrate your data to Amazon CloudSearch, empowering your business with faster, more efficient search capabilities. Happy indexing!

Introduction

In the era of data-driven decision-making, efficient data indexing is pivotal in empowering businesses to extract valuable insights from vast amounts of information. Elasticsearch, a powerful and scalable search and analytics service, has become popular for organizations seeking to implement robust search functionality. Norconex Web Crawler offers a seamless and effective solution for indexing web data to Elasticsearch.

In this blog post, you will learn how to utilize Norconex Web Crawler to index data to Elasticsearch and enhance your organization’s search capabilities.

Understanding Norconex Web Crawler

Norconex Web Crawler is an open-source web crawler designed to extract, parse, and index content from the web. The crawler’s flexibility and ease of use make it an excellent choice for extracting data from the web. Plus, Norconex offers a range of committers that index data to various repositories. See https://opensource.norconex.com/committers/ for a complete list of supported target repositories. If the provided committers do not meet your organizational requirements, you can extend the Committer Core and create a custom committer.

This blog post will focus on indexing data to Elasticsearch.

Prerequisites

Elasticsearch

To keep things simple, we will rely on Docker to stand up an Elasticsearch container locally. If you don’t have Docker installed, follow the installation instructions on their website. Once Docker is installed, open a command prompt and run the following command.

docker run -d -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" -e "xpack.security.enabled=false" elasticsearch:7.17.10

This command does the following

requests version 7.17.10 of Elasticsearch
maps ports 9200 and 9600
sets the discovery type to “single-node”
disables the security plugin
Starts the Elasticsearch container

Once the container is up, browse to http://localhost:9200 in your favourite browser. You will get a response that looks like this:

{
  "name" : "c6ce36ceee17",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "gGbNNtDHTKCSJnYaycuWzQ",
  "version" : {
  "number" : "7.17.10",
  "build_flavor" : "default",
  "build_type" : "docker",
  "build_hash" : "fecd68e3150eda0c307ab9a9d7557f5d5fd71349",
  "build_date" : "2023-04-23T05:33:18.138275597Z",
  "build_snapshot" : false,
  "lucene_version" : "8.11.1",
  "minimum_wire_compatibility_version" : "6.8.0",
  "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

Elasticsearch container is now up and running!

Norconex Web Crawler

Download the latest version of the Web Crawler from Norconex’s website. At the time of this writing, version 3.0.2 is the most recent version.

Download the latest version of Elasticsearch Committer. At the time of this writing, version 5.0.0 is the most recent version.

Follow the automated installation instructions to install the Elasticsearch Committer libraries into the Crawler.

Crawler Configuration

We will use the following Crawler configuration for this test. Place this configuration in the root folder of your Crawler installation, with the filename my-config.xml.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<httpcollector id="Norconex HTTP Collector">
    <!-- Decide where to store generated files. -->
  <workDir>./output</workDir>
  <crawlers>
	<crawler id="Norconex Elasticsearch Committer Demo">
  	<startURLs 
		stayOnDomain="true" 
		stayOnPort="true" 
		stayOnProtocol="true">
		<url>https://github.com/</url>
  	</startURLs>
  	<!-- only crawl 1 page --> 	 
  	<maxDepth>0</maxDepth>
  	<!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
  	<sitemapResolver ignore="true" />
  	<!-- Be as nice as you can to sites you crawl. -->
  	<delay default="5 seconds" />
  	<importer>
  	  	<postParseHandlers>
  	  	  	<!-- only keep `description` and `title` fields -->
  	  	  	<handler class="KeepOnlyTagger">
  	  	  	  	<fieldMatcher method="csv">
  	  	  	  	  	description,title
  	  	  	  	</fieldMatcher>
  	  	  	</handler>
  	  	</postParseHandlers>
   	</importer>
  	<committers>
 		 <!-- send documents to Elasticsearch -->
   		<committer class="ElasticsearchCommitter">
			<nodes>http://localhost:9200</nodes>
			<indexName>my-index</indexName>
   		</committer>
  	</committers>
 	 
    </crawler>
  </crawlers>
</httpcollector>

Note that this is the minimal configuration required. There are many more parameters you can set to suit your needs. Norconex’s documentation does an excellent job of detailing all the parameters.

Start the Crawler

Norconex Web Crawler comes packaged with shell scripts to start the application. To start the crawler, run the following command in a shell terminal. The example below is for a Windows machine. If you are on Linux, use the collector-http.sh script instead.

C:\Norconex\norconex-collector-http-3.0.2>collector-http.bat start -clean -config=.\my-config.xml

Recall that you saved the Crawler configuration at the root of your Crawler installation.

Since only a single page is being indexed, the crawl job will take only a few seconds. Once the job completes, query the Elasticsearch container by browsing to http://localhost:9200/my-index/_search in your browser. You will see something like this:

{
  "took": 12,
  "timed_out": false,
  "_shards": {
	"total": 1,
	"successful": 1,
	"skipped": 0,
	"failed": 0
  },
  "hits": {
	"total": {
  	"value": 1,
  	"relation": "eq"
	},
	"max_score": 1,
	"hits": [
  	{
    	"_index": "my-index",
    	"_id": "https://github.com/",
    	"_score": 1,
    	"_source": {
      	"title": "GitHub: Let's build from here · GitHub",
      	"description": "GitHub is where over 100 million developers shape the future of software, together. Contribute to the open source community, manage your Git repositories, review code like a pro, track bugs and features, power your CI/CD and DevOps workflows, and secure code before you commit it.",
      	"content": "<redacted for brevity>"
    	}
  	}
	]
  }
}

You can see that the document was indeed indexed!

Conclusion

Norconex Web Crawler streamlines the process of indexing web data into Elasticsearch, making valuable information readily available for search and analytics.
This guide provides step-by-step instructions for integrating your data with Elasticsearch, unleashing potent search capabilities for your organization’s applications. Embrace the powerful synergy of Norconex Web Crawler and Elasticsearch to revolutionize your data indexing journey, empowering your business with real-time insights and effortless data discovery. Happy indexing!

Introduction

Azure Cognitive Search is a robust cloud-based service that enables organizations to build sophisticated search experiences. In this blog post, you will learn how to utilize Norconex Web Crawler to index data into Azure Cognitive Search and enhance your organization’s search capabilities.

Understanding Norconex Web Crawler

This blog post will focus on indexing data to Microsoft Azure Cognitive Search.

Prerequisites

Azure Cognitive Search

Before getting started, make sure you’ve already set up an Azure Cognitive Search service instance through your Azure portal. Consult the official Microsoft documentation for guidance on setting up this service.
After completing the setup, create an Index where you will index/commit your data. Then configure the index with the following fields:

Note: For this exercise, the English – Lucene analyzer will be used for the title, description, and content fields.

Note that the following 3 items are required to configure the Norconex Azure Cognitive Search Committer:

URL (listed on the Overview page of your Azure Cognitive Search portal)
Admin API key (listed under Settings -> Keys)
Index name

Norconex Web Crawler

Download the latest version of the Web Crawler from Norconex’s website. At the time of this writing, version 3.0.2 is the most recent version.

Download the latest version of Azure Search Committer. At the time of this writing, version 2.0.0 is the most recent version.

Follow the Automated Install instructions to install the Azure Search Committer libraries into the Crawler.

Crawler Configuration

We will use the following Crawler configuration for this test. Place this configuration in the root folder of your Crawler installation, with the filename my-config.xml.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<httpcollector id="Norconex HTTP Collector">
  
  <!-- Decide where to store generated files. -->
  <workDir>./output</workDir>
  <crawlers>
    <crawler id="Norconex Azure Committer Demo">
      <startURLs 
        stayOnDomain="true" 
	stayOnPort="true" 
	stayOnProtocol="true">
	<url>https://github.com/</url>
      </startURLs>
      <!-- only crawl 1 page --> 	 
      <maxDepth>0</maxDepth>
      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <sitemapResolver ignore="true" />
      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="5 seconds" />
      <importer>
        <postParseHandlers>
          <!-- only keep `description` and `title` fields -->
          <handler class="KeepOnlyTagger">
            <fieldMatcher method="csv">
              description,title
            </fieldMatcher>
          </handler>
        </postParseHandlers>
      </importer>
      <committers>
        <!-- send documents to Azure Cognitive Search -->
   	<committer class="AzureSearchCommitter">
          <endpoint>https://....search.windows.net</endpoint>			    
            <apiKey>...</apiKey>
            <indexName>...</indexName>
        </committer>
      </committers> 
    </crawler>
  </crawlers>
</httpcollector>

Be sure to appropriately set the endpoint, apiKey, and indexName under the section. Recall that you noted this information while satisfying the Azure Search Prerequisites.

Start the Crawler

Norconex Web Crawler comes packaged with shell scripts to start the application. To start the crawler, run the following command in a shell terminal. The example below is for a Windows machine. If you are using Linux, use the collector-http.sh script instead.

C:\Norconex\norconex-collector-http-3.0.2>collector-http.bat start -clean -config=.\my-config.xml

Recall that you saved the Crawler configuration at the root of your Crawler installation.

Since only a single page is being indexed, the crawl job will only take a few seconds. Once the job completes, you can query the Azure Cognitive Search portal and see the document was indexed!

Common pitfalls

Invalid API key

If the API key is invalid, the Crawler will throw a “Forbidden” error.

Invalid HTTP response: "Forbidden". Azure Response:

Ensure that you use the Admin API key

Invalid index name

If the indexName provided in the Crawler config does not match what is in your Azure Search, you will see this error.

CommitterException: Invalid HTTP response: "Not Found". Azure Response: {"error":{"code":"","message":"The index 'test2' for service 'norconexdemo' was not found."}}

Misconfigured fields in the Azure Search index

If you did not add title, description and content fields to your index, the Crawler will throw an exception referencing the missing field.

CommitterException: Invalid HTTP response: "Bad Request". Azure Response: {"error":{"code":"","message":"The request is invalid. Details: parameters : The property 'content' does not exist on type 'search.documentFields'. Make sure to only use property names that are defined by the type."}}

Conclusion

Azure Cognitive Search, combined with the powerful data ingestion capabilities of Norconex Web Crawler, offers a potent solution for indexing and searching data from various sources. Following the steps outlined in this blog post, you can seamlessly integrate and update your organization’s Azure search index with fresh, relevant data. Leveraging the flexibility and scalability of Azure Cognitive Search will allow you to deliver exceptional search experiences to your users and gain valuable insights from your data. Happy indexing!

This blog post will show you how to use Prometheus with your Norconex crawler. This process is possible thanks to Norconex crawlers offering useful metrics via JMX. Using this solution, you can conveniently track the advancement of a crawling task with a quick glance which is especially useful when you have several crawling jobs running simultaneously.

If you don’t already have Prometheus installed, we will also guide you through the installation process using Docker. Already have Prometheus installed? Go ahead and skip the first section.
The required setup consists of three main components: Prometheus, JMX agent, and Norconex web crawler.

StandUp a Prometheus Server

Create a “prometheus-test” folder to store config files.
Create a custom YAML file: premetheus_config.yaml and add the following:

global: 
  scrape_interval: 15s 
  evaluation_interval: 15s 
  scrape_timeout: 10s 

scrape_configs: 
  # job_name: the name you give, usually one for each collector 
  - job_name: 'collector-http' 
    static_configs: 
    - targets:   ['host.docker.internal:9123']

Create a Dockerfile in the same folder. In it, add the Prometheus image to be used, and then add the premetheus_config.yaml file created earlier.

FROM prom/prometheus 
ADD prometheus_config.yaml /etc/prometheus/

Now it is time to build and start up the Prometheus container by running:

docker build -t my-prometheus-image . 
docker run -dp 9090:9090 my-prometheus-image

Confirm the service is running:

docker ps

You should get something like this:

Open your browser, and access Prometheus: http://localhost:9090

JMX Exporter / Prometheus Java Agent

Once Prometheus is up and running, you need to download the Prometheus JMX Java agent plugin. This agent reads information exposed by the crawler registered JMX mBeans and is intended to be run as a Java Virtual Machine (JVM) agent.

The latest plugin version will be used (version 0.18 as of this writing). Download the jar file, and save it in the prometheus-test folder. This agent requires Java 18. If you don’t already have it installed, download it here.

Next, you will create a jmx_config.yaml file to define the settings used by the JMX agent. Add the following to the file:

--- startDelaySeconds: 0 

ssl: false 
lowercaseOutputName: false 
lowercaseOutputLabelNames: false

Norconex Web Crawler

Norconex has two types of crawlers: web and file-system. We will use the web version in our test, so go ahead and download the crawler if you haven’t already done so.

To start crawling, you need to define the start URL and other settings, which should be defined in the crawler-config.xml file. Let’s create one now.

In the “prometheus-test” folder, create an XML file called “crawler_config.xml”. Then add the following:

<?xml version="1.0" encoding="UTF-8"?>

<httpcollector id="prometheus-test-collector">

  <!-- Decide where to store generated files. -->
  <workDir>${workdir}</workDir>
  <deferredShutdownDuration>10 seconds</deferredShutdownDuration>
  
  <crawlers>
    <crawler id="prometheus-test-crawler">

      <!-- Requires at least one start URL (or urlsFile). 
           Optionally limit crawling to same protocol/domain/port as 
           start URLs. -->
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="false">
        <url>https://www.britannica.com</url>
      </startURLs>

      <!-- === Recommendations: ============================================ -->

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>1</maxDepth>
	  <numThreads>${numThreads|'3'}</numThreads>
	  <maxDocuments>${maxDocuments|'1000'}</maxDocuments>
	  <canonicalLinkDetector ignore="true" />
	  <robotsTxt ignore="true" />
	  <robotsMeta ignore="true" />
	  <orphansStrategy>IGNORE</orphansStrategy>

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <sitemapResolver ignore="true" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="2 seconds" />
      
      <referenceFilters>
        <filter class="ReferenceFilter" onMatch="exclude">
	  <valueMatcher method="regex">.*literature.*</valueMatcher>
        </filter>
      </referenceFilters>
      
      <!-- Document importing -->
      <importer>
        <postParseHandlers>
          <!-- If your target repository does not support arbitrary fields,
               make sure you only keep the fields you need. -->
          <handler class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fieldMatcher method="csv">title,document.reference</fieldMatcher>      
          </handler>
        </postParseHandlers>
      </importer> 
      
      <!-- Decide what to do with your files by specifying a Committer. -->
      <committers>
        <committer class="core3.fs.impl.XMLFileCommitter">
          <docsPerFile>250</docsPerFile>
	  <indent>4</indent>
	  <splitUpsertDelete>false</splitUpsertDelete>
        </committer>
      </committers>

    </crawler>
  </crawlers>

</httpcollector>

Start the Crawler

Initiating the crawling task and enabling Prometheus to fetch metrics from the crawler is a straightforward process. But to ensure reproducibility, create a batch file (make it an equivalent script file on Unix/Linux) that contains the necessary command. This way, you can effortlessly launch the crawler whenever required.

In the “prometheus-test” folder, create a run-job.bat file. Then add the following:

@echo off 

set CRAWLER_HOME=path\to\Norconex\web\crawler\folder\ 
set TEST_DIR=path\to\prometheus\test\folder 

java -javaagent:%TEST_DIR%\jmx_prometheus_javaagent-0.18.0.jar=9123:%TEST_DIR%\jmx_config.yaml ^ 
     -DenableJMX=true ^ 
     -Dlog4j2.configurationFile="%CRAWLER_HOME%\log4j2.xml" ^ 
     -Dfile.encoding=UTF8 ^ -Dworkdir="%TEST_DIR%\workdirs" ^ 
     -cp "%CRAWLER_HOME%\lib\*" ^ 
     com.norconex.collector.http.HttpCollector start -clean -config=%TEST_DIR%\crawler_config.xml

Notice that a port is specified in the command. The port corresponds to the one set in the prometheus_config.yaml: scrape_config section. You can define more than one job at a time using the same hostname and a different port number for each job.

Run the run-job.bat file to start the crawler.

After starting the crawler, you will see logs being written to the console. You can now switch over to your Prometheus Dashboard and try one of the following queries:

{job=”collector-http”}
{job=”collector-http”, key=~”DOCUMENT_QUEUED|DOCUMENT_COMMITTED.*”}
{job=”collector-http”, key=~”DOCUMENT_FETCHED|DOCUMENT_COMMITTED.*”}
{job=”collector-http”, key=~”DOCUMENT_QUEUED|DOCUMENT_FETCHED|DOCUMENT_COMMITTED.*”}
{job=”collector-http”, key=~”DOCUMENT_.*|.*REJECTED.*”}

These queries would return the number of documents queued, fetched, and committed. Plus, the query results will show you the number of rejected documents. The job name refers to the job defined in the prometheus_config.yaml file: scrape_config section. The key in each query corresponds to an event type gathered by the crawler, importer, committer, and collector core. By specifying different events in the key, you can view the information you’re interested in regarding a specific crawling job.

You have abundant options for what events you include in your search. Here are some common ones:

DOCUMENT_COMMITTED_DELETE
DOCUMENT_COMMITTED_UPSERT
DOCUMENT_FETCHED
DOCUMENT_QUEUED
DOCUMENT_PROCESSED
REJECTED_UNMODIFIED
REJECTED_DUPLICATE
REJECTED_BAD_STATUS

As you enter the query in the search box, the result will be displayed almost instantly. You can then view it in either Table or Graph format.

Table Format:

Graph Format:

The graph generated by Prometheus offers a visual depiction of the crawling job’s advancement. As shown in the above graph, the golden line represents the number of documents queued, while the purple line depicts the number of processed documents. Eventually, these two lines will intersect after all documents have been processed, as demonstrated below. This depiction allows you to quickly assess the progress of the crawling job, without having to access and inspect the logs.

Conclusion

With Prometheus, monitoring the progress of single or multiple crawling jobs is no longer a hassle. There’s no need to open multiple consoles for each crawler to check the progress—Prometheus can take care of it all to give you an instant, at-a-glance visual. Just select the events you’re interested in, and then display them visually to save time on your daily monitoring task.

While your interest in events may vary, setting up this configuration requires less than an hour. We strongly recommend giving it a shot using our web or file system crawler. So go ahead and experiment with different combinations of events that align with your monitoring requirements and preferences.

Feel free to leave us feedback on what you think of our crawlers or what type of crawler monitoring you find the most useful. We’d love to hear your thoughts!

This vulnerability impacts Log4J version 2.x, version 1.2 is not affected (source). Norconex HTTP Collector version 2.x use Log4J v1.2.17 and thus are not affected. Version 3 of the Collector uses Log4J v2.17.1, which Apache has patched.

Note: Unless you made it so on purpose, the HTTP Collector does not run as a service accessible from the internet.

In my previous article, I talked about the new Config Starter and its features. This article serves as a follow-up. Now that you know how to generate a crawler configuration file, I will highlight the steps you can undertake to get you started on your own website crawling activities.

We will be using the TOKYO 2020 Olympic Games’ website as the crawl site in this article. The steps are as follows:

First, you will need to generate a basic configuration file targeting the Olympic website, using the Config Starter. In this example, I am targeting English content only, so I am excluding all URLs corresponding to the other languages on the website.

*Note that it is not mandatory to use the Config Starter to generate your configuration file as it only makes a basic configuration file. If you are looking for a more complete solution, you can make your own configuration file with the documentation here.

With your configuration file generated, the next step is to download the Norconex HTTP Collector on your computer from the Norconex Open-Source website and unzip it. If you are using the Config Starter, you will need to download version 3.x.

Once you have the HTTP Collector downloaded on your computer, open your command-line terminal in the location of the folder you just created with your download. To do this, simply use the following command with your file directory: cd C:\file\directory\of\the\collector

With your command-line terminal open, you must now enter the following line with the path to your configuration file:

Windows: collector-http.bat start -config= -config=/path/to/config.xml

Linux: collector-http.sh start -config= -config=/path/to/config.xml

Congratulations! You are now running your crawler. If all went according to plan, you should see something similar to the next image and the data crawled should now be located in the created committer directory (if you are using the same committer as me, it should be in the “work” folder).

Now that you have crawled the Olympic site, go and collect your gold medal!

If you encounter any issues during the process, you can find resolutions on the HTTP Collector GitHub issues page.

[Try] out the new Norconex HTTP Collector Config Starter.

[Download] and [get started] with the Norconex HTTP Collector.

[Learn] more about the inner workings of the Norconex HTTP Collector.

When starting to play with the programming world, it doesn’t take long to notice the sheer amount of options. With so many options available, it’s easy to become overwhelmed by all the information. Sorting through it all is no easy task for the less tech-savvy among us. For that reason, Norconex has put a lot of effort into making its products more user-friendly. With the new Config Starter, everyone can now easily generate a basic configuration file for the Norconex HTTP Collector version 3.

The Norconex HTTP Collector Config Starter is a “wizard” that will generate the configuration file for you to run the Norconex HTTP Collector. Whether you don’t know anything about the world of programming or you just want to quickly set up your crawler, this Config Starter is made for you. In this article, we will go through each section of this wizard and show you how to start your crawler with it.

Collect section

First, we have the Collect section. This section is important because it will collect all the documents from the webpage of your website. Here, you will provide the URL where the crawler collects the data that you need. The collector will start by downloading that page and then following all the links from that webpage. In this section, you can also, if applicable, enter any sections of your site that you do not want to crawl.

Import section

Next, we have the import section. This is where you will choose what to keep from the information collected by the Collector. The first option here lets you remove the data collected from your website’s header, footer and nav sections. Those sections are generally used to navigate through a website and rarely contain any relevant information to crawl.

The second option lets you choose the fields that you want to extract from the data collected. This is useful if your goal is to extract specific data from your website, which will be sent to the committer. If you want to keep all the data, just leave this section empty.

Commit section

Last, we have the Commit section. This section tells your crawler where to save the data you have collected. At the time of writing, we have four available committer options: Local JSON File, Local XML File, Apache Solr or Elasticsearch. If you select one of the last two, you will need to provide the localization of the committer.

If you want to manually configure any section of your configuration file, or use another committer, you can leave any section empty or easily edit it afterward.

When you are done, you can proceed to generate your configuration file; you should get something similar to the configuration below. This configuration comes with a detailed description for all the fields if you want to further customize your crawler later.

<?xml version="1.0" encoding="UTF-8"?>
<!--
 _   _  ___  ____   ____ ___  _   _ _______  __
| \ | |/ _ \|  _ \ / ___/ _ \| \ | | ____\ \/ /
|  \| | | | | |_) | |  | | | |  \| |  _|  \  /
| |\  | |_| |  _ <| |__| |_| | |\  | |___ /  \
|_| \_|\___/|_| \_\\____\___/|_| \_|_____/_/\_\
===============================================

HTTP Collector Configuration File.
 
Generated by: https://opensource.norconex.com/collectors/http/v3/config/starter
Website:      https://opensource.norconex.com/collectors/http/
Manual:       https://opensource.norconex.com/docs/collectors/http/

-->
<httpcollector id="config-id">

  <!--
    Crawler "work" directory.  This is where files downloaded or created as
    part of crawling activities get stored.
    It should be unique to each crawlers.
    -->
  <workDir>./work</workDir>

  <crawlers>
    <crawler id="crawler-id">

      <!--
        Mandatory starting URL(s) where crawling begins.  If you put more
        than one URL, they will be processed together.  You can also
        point to one or more URLs files (i.e., seed lists), or
        point to a sitemap.xml.
        -->
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>https://mywebsite.com</url>
      </startURLs>

      <!-- Normalizes incoming URLs. -->
      <urlNormalizer class="GenericURLNormalizer">
        <normalizations>
            removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence,
            decodeUnreservedCharacters, removeDefaultPort,
            encodeNonURICharacters
        </normalizations>
      </urlNormalizer>

      <!--Handles interval between each page download-->
      <delay default="3000" />

      <!--
        How many threads you want a crawler to use.  Regardless of how many
        thread you have running, the frequency of each URL being invoked
        will remain dictated by the &lt;delay/&gt option above.  Using more
        than one thread is a good idea to ensure the delay is respected
        in case you run into single downloads taking more time than the
        delay specified. Default is 2 threads.
        -->
      <numThreads>2</numThreads>

      <!--
        How many level deep can the crawler go. I.e, within how many clicks
        away from the main page (start URL) each page can be to be considered.
        Beyond the depth specified, pages are rejected.
        The starting URLs all have a zero-depth.  Default is -1 (unlimited)
        -->
      <maxDepth>10</maxDepth>

      <!--
        Crawler "work" directory.  This is where files downloaded or created as
        part of crawling activities (besides logs and progress) get stored.
        It should be unique to each crawlers.
        -->
      <maxDocuments>-1</maxDocuments>

      <!--
        What to do with orphan documents.  Orphans are valid
        documents, which on subsequent crawls can no longer be reached when
        running the crawler (e.g. there are no links pointing to that page
        anymore).  Available options are:
        IGNORE, DELETE, and PROCESS (default).
        -->
      <orphansStrategy>PROCESS</orphansStrategy>

      <!-- Handle robots.txt files. -->
      <robotsTxt ignore="false" />

      <!-- Detects and processes sitemap files. -->
      <sitemapResolver ignore="false" />

      <!-- 
        Detects pages with a canonical link and rejects them in favor of 
        the canonical one.
        -->
      <canonicalLinkDetector ignore="false" />

      <!--
        Filter out matching URLs before they are downloaded. If you 
        want links extracted before a page gets rejected, it needs to be 
        rejected after it was downloaded. You can use <documentFilters>
        instead to achieve this. 
        -->
      <referenceFilters>
        <filter class="RegexReferenceFilter" onMatch="exclude">.*/login/.*</filter>
      </referenceFilters>


      <!--
        Import a document using Norconex Importer module. Here is your chance
        to manipulate a document content and its metadata fields using 
        import handlers before it is sent to your target repository.  
        -->
      <importer>

        <!--
          Pre-parse handlers take place BEFORE a document is converted to 
          plain-text. If you need to deal with the original document format
          (HTML, XML, binary, etc.), define your import handlers here.
          -->
        <preParseHandlers>
          <!-- Remove navigation elements from HTML pages. -->
          <handler class="DOMDeleteTransformer">
            <dom selector="header" />
            <dom selector="footer" />
            <dom selector="nav" />
            <dom selector="noindex" />
          </handler>
        </preParseHandlers>

        <!--
          Post-parse handlers take place AFTER a document is converted to 
          plain-text. At this stage, content should be stripped of formatting
          (e.g., HTML tags) and you should no longer encounter binary content.
          -->
        <postParseHandlers>

          <!-- Rename extracted fields to what you want. -->
          <handler class="RenameTagger">
            <rename toField="title" onSet="replace">
              <fieldMatcher method="csv">dc:title, og:title</fieldMatcher>
            </rename>
          </handler>

          <!-- Make sure we are sending only one value per field. -->
          <handler class="ForceSingleValueTagger" action="keepFirst">
            <fieldMatcher method="csv">title</fieldMatcher>
          </handler>

          <!-- Keep only those fields and discard the rest. -->
          <handler class="KeepOnlyTagger">
            <fieldMatcher method="csv">title</fieldMatcher>
          </handler>
        </postParseHandlers>
      </importer>

      <!--
        Commits a document to a data source of your choice.
        This step calls the Committer module.  The
        committer is a different module with its own set of XML configuration
        options.  Please refer to committer for complete documentation.
        Below is an example using the FileSystemCommitter.
        -->
      <committers>

        <!--
          JSON File Committer.
          
          Store crawled documents to the local file system, in JSON Format.
          
          Web site:
            https://opensource.norconex.com/committers/core/

          Configuration options and output format:  
            https://opensource.norconex.com/committers/core/v3/apidocs/com/norconex/committer/core3/fs/impl/JSONFileCommitter.html
          -->
        <committer class="JSONFileCommitter"/>

      </committers>

    </crawler>
  </crawlers>

</httpcollector>

To start running your crawler, just refer to the location of the configuration file in the start command in the command-line console.

[Try] out the new Norconex HTTP Collector Config Starter.

[Download] and [get started] with the Norconex HTTP Collector.

[Learn] more about the inner workings of the Norconex HTTP Collector.