open source crawler – Norconex Inc

Introduction

Managing vast amounts of data stored across various file systems can be a daunting task. But it doesn’t have to be! Norconex File System Crawler comes to the rescue, offering a robust solution for efficiently extracting, organizing, and indexing your files.

But did you know you can extend its capabilities without writing a single line of code? In this blog post, you’ll learn how to connect an external application to the Crawler and unleash its full potential.

The Use Case

Both Norconex File System Crawler and Norconex Web Crawler utilize Norconex Importer to extract data from documents. Right out of the box, the Importer supports various file formats, as documented here. But you may encounter a scenario where the Importer cannot parse a document.

One such example is a RAR5 document. At the time of this writing, the latest version of File System Crawler is 2.9.1. Extracting a RAR5 file with this version throws the following exception.

com.norconex.importer.parser.DocumentParserException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pkg.RarParser@35f95a13
...
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pkg.RarParser@35f95a13
...
Caused by: java.lang.NullPointerException: mainheader is null
...

As you can see, Apache Tika’s RarParser class cannot extract the document. You’ll see how to work around this issue below.

Note: This blog post will focus on a no-code solution. However, if you can code, writing your own custom parser is highly recommended. Look at the Extend the File System Crawler section of the documentation on accomplishing just that.

ExternalTransformer to the Resuce

Many applications support the extraction of RAR files. One such application is 7zip. If you need to, go ahead and install 7zip on your machine now. You’ll need the application moving forward.

Overview

You will run 2 crawlers separately. The first crawls everything normally while ignoring RAR files. It will use the ExternalTransformer to extract the RAR file contents to folder X and do no further processing of the file. The second will crawl the extracted files in folder X.

Configs

Config for the first crawler is as follows, with helpful comment explanations of various options.

<?xml version="1.0" encoding="UTF-8"?>
<fscollector id="fs-collector-main">

#set($workdir = .\workdir-main)
#set($startDir = .\input)
#set($extractedDir = .\extracted)
#set($tagger = "com.norconex.importer.handler.tagger.impl")
#set($filter = "com.norconex.importer.handler.filter.impl")
#set($transformer = "com.norconex.importer.handler.transformer.impl")

  <logsDir>${workdir}/logs</logsDir>
  <progressDir>${workdir}/progress</progressDir>

  <crawlers>
	<crawler id="fs-crawler-main">
  	<workDir>${workdir}</workDir>
  	<startPaths>
    	<path>${startDir}</path>
  	</startPaths>
 	 
  	<importer>
    	<!-- do the following before attempting to parse a file -->
    	<preParseHandlers>
      	<transformer class="${transformer}.ExternalTransformer">
        	<!-- apply this transfomer to .rar files only -->
        	<restrictTo field="document.reference">.*\.rar$</restrictTo>
        	<!--
          	calls on 7zip to uncompress the file and place the contents in `extracted` dir
        	-->
        	<command>'C:\Program Files\7-Zip\7z.exe' e ${INPUT} -o${extractedDir} -y</command>
        	<metadata>
          	<pattern toField="extracted_paths" valueGroup="1">
            	^Path = (.*)$
          	</pattern>
        	</metadata>
        	<tempDir>${workdir}/temp</tempDir>
      	</transformer>

      	<!-- stop further processing of .rar files -->
      	<filter class="${filter}.RegexReferenceFilter" onMatch="exclude">
        	<regex>.*\.rar$</regex>
      	</filter>
   	 
    	</preParseHandlers>
  	</importer>
 	 
  	<!--
    	commit extracted files to the local FileSystem
    	You can substitute this with any of the available committers
  	-->
  	<committer class="com.norconex.committer.core.impl.FileSystemCommitter">
    	<directory>${workdir}/crawledFiles</directory>
  	</committer>
	</crawler>
    
  </crawlers>

</fscollector>

This crawler will parse all files normally, except RAR files. When encountering a RAR file, the Crawler will call upon 7zip to extract RAR files and place the extracted files under an extracted folder. No further processing will be done on these RAR files.

The second crawler is configured to simply extract files within the extracted folder. Here is the configuration:

<?xml version="1.0" encoding="UTF-8"?>
<fscollector id="fs-71-collector-extracted">

#set($workdir = .\workdir-extracted)
#set($startDir = .\extracted)

  <logsDir>${workdir}/logs</logsDir>
  <progressDir>${workdir}/progress</progressDir>

  <crawlers>

	<crawler id="fs-crawler-extracted">
  	<startPaths>
    	<path>${startDir}</path>
  	</startPaths>

  	<!--
    	commit extracted files to the local FileSystem
    	You can substitute this with any of the available committers
  	-->
  	<committer class="com.norconex.committer.core.impl.FileSystemCommitter">
    	<directory>${workdir}/crawledFiles</directory>
  	</committer>
	</crawler>
    
  </crawlers>

</fscollector>

There you have it! You just extended the capabilities of the File System Crawler without writing a single line of code – a testament to the incredible flexibility offered by the Crawler.

Conclusion

Norconex File System Crawler is undeniably a remarkable tool for web crawling and data extraction. Even more impressive is the ease with which you can extend the Crawler’s capabilities, all without the need for coding expertise. Whether you’re a seasoned professional or just getting started, let the Norconex File System Crawler – free from the complexities of coding – become your trusted companion in unleashing the full potential of your data management endeavours. Happy indexing!

Introduction

Norconex Web Crawler is a full-featured, open-source web crawling solution meticulously crafted to parse, extract, and index web content. The Crawler is flexible, adaptable and user-friendly, making it a top-notch selection for extracting data from the web.

As the volume and complexity of web crawling tasks increase, organizations face challenges in efficiently scaling the Crawler to meet organizational needs. Scaling effectively involves addressing issues related to configuration management, resource allocation, and the handling of large data sets to enable seamless scalability while maintaining data quality and integrity.

In this blog post you will learn how to handle configuration management for medium to large Crawler installations.

The Problem

Norconex Web Crawler only needs to be installed once, no matter how many sites you’re crawling. If you need to crawl different websites requiring different configuration options, you will likely need multiple configuration files. And as Crawling needs further grow, yet more configuration files will be needed. Some parts of these configuration files will inevitably have common elements as well. How can you minimize the duplication between configs?

The Solution: Apache Velocity Templates

Norconex Web Crawler configuration is not a plain XML file, but rather, a Apache Velocity template. Broadly speaking, the configuration file is interpreted by the Velocity Engine before being applied to the Crawler.
You can leverage the Velocity Engine to dynamically provide the appropriate values. The following sections walk you through exactly how to do so.

Variables

To keep things simple, consider a crawling solution that contains just 2 configuration files; one for siteA and one for siteB.

Note: This scenario is for demonstration purposes only. If you only have 2 sites to crawl, the following approach is not recommended.

Default configurations

The configurations for the 2 sites may look as follows.

siteA configuration

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<httpcollector id="collector-siteA">
  <workDir>./workDir</workDir>
  <crawlers>
    <crawler id="crawler-siteA">
      <startURLs stayOnDomain="true">
   	  <url>www.siteA.com</url>
      </startURLs>
      <maxDepth>-1</maxDepth>
      <!-- redacted for brevity -->     
    </crawler>
  </crawlers>
</httpcollector>

siteB configuration

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<httpcollector id="collector-siteB">
  <workDir>./workDir</workDir>
  <crawlers>
    <crawler id="crawler-siteB">
      <startURLs stayOnDomain="true">
        <url>www.siteB.com</url>
      </startURLs>
      <maxDepth>0</maxDepth>
      <!-- redacted for brevity -->
    </crawler>
  </crawlers>
</httpcollector>

As you can probably see, just 4 differences exist between the two configurations:

httpcollector id
crawler id
StartURLs
maxDepth

The common elements in both configurations should be shared. Below, you’ll learn how to share them with Velocity variables.

Optimized configuration

The following steps will optimize the configuration by extracting dynamic data to dedicated files thereby removing duplication.

First, extract unique items into their respective properties file

siteA.properties

domain=www.siteA.com
maxDepth=-1

siteB.properties

domain=www.cmp-cpm.forces.gc.ca
maxDepth=0

Then, add variables to the Crawler configuration and save it as my-config.xml at the root of your Crawler installation. The syntax to add a variable is ${variableName}.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<httpcollector id="collector-${domain}"> <!-- variable added here -->
  <workDir>./workDir</workDir>
  <crawlers>
    <crawler id="crawler-${domain}"> <!-- variable added here -->
      <startURLs stayOnDomain="true">
        <url>${domain}</url> <!-- variable added here -->
      </startURLs>   		 
      <maxDepth>${maxDepth}</maxDepth> <!-- variable added here -->
      <!-- redacted for brevity -->
    </crawler>
  </crawlers>
</httpcollector>

With the variables in place in the Crawler config, the variables file simply needs to be specified to the Crawler start script. This is accomplished with the -variables flag, as follows.

siteA

>collector-http.bat start -clean -config=my-config.xml -variables=siteA.properties

siteB

>collector-http.bat start -clean -config=my-config.xml -variables=siteB.properties

The Crawler will replace the variables in the config XML with what it finds in the .properties file.

The example above is for a Windows machine. If you are on Linux, use the collector-http.sh script instead.

Tip: If you’re interested in seeing what the config will look like after variables are replaced, use the configrender option.

>collector-http.bat configrender -c=my-config.xml -variables=siteA.properties -o=full_config.xml

So far, we have only seen the basics of storing data in variables. But what if siteA and siteB needed to commit documents to separate repositories? Below you’ll see how to leverage the power of Apache Velocity Engine to accomplish just that.

Importing Files

Using variables goes a long way toward organizing multiple configuration files. You can also dynamically include chunks of configuration by utilizing Velocity’s #parse() script element.

To demonstrate, consider that siteA is required to commit documents to Azure Cognitive Search and siteB to Elasticsearch. The steps below will walk you through how to accomplish just that.

First, you need 2 committer XML files.

committer-azure.xml

<committer class="AzureSearchCommitter">
  <endpoint>https://....search.windows.net</endpoint>   			 
  <apiKey>...</apiKey>
  <indexName>my_index</indexName>
</committer>

committer-es.xml

<committer class="ElasticsearchCommitter">
  <nodes>https://localhost:9200</nodes>
  <indexName>my_index</indexName>
</committer>

Then, augment the Crawler config (my-config.xml), and add the <committers> section

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<httpcollector id="collector-${domain}">
  <workDir>./workDir</workDir>
    <crawlers>
      <crawler id="crawler-${domain}">
        <startURLs stayOnDomain="true">
          <url>${domain}</url>
        </startURLs>
  		 
  	<maxDepth>${maxDepth}</maxDepth>
  	
  	<!-- add this section -->
	<committers>
	  #parse("${committer}")
        </committers>
    </crawler>
  </crawlers>
</httpcollector>

Finally, the .properties files must be updated to specify the committer file we required for each.

siteA.properties

domain=www.siteA.com
maxDepth=-1
committer=committer-azure.xml

siteB.properties

domain=www.siteB.com
maxDepth=0
committer=committer-es.xml

Now you can use the configrender option to see the final configuration for each site.

siteA

>collector-http.bat configrender -c=my-config.xml -variables=siteA.properties -o=full_config.xml

Relevant snippet from full_config.xml.

<committers>
  <committer class="com.norconex.committer.azuresearch.AzureSearchCommitter">
    <endpoint>https://....search.windows.net</endpoint>
    <apiKey>...</apiKey>
    <indexName>my_index</indexName>
    <!-- redacted for brevity -->
  </committer>
</committers>

siteB

>collector-http.bat configrender -c=my-config.xml -variables=siteB.properties -o=full_config.xml

Relevant snippet from full_config.xml.

<committers>
  <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
    <nodes>https://localhost:9200</nodes>
    <indexName>my_index</indexName>
    <!-- redacted for brevity -->
  </committer>
</committers>

And there you have it! With those simple steps, you can add the correct <committer> to the final configuration for each site.

Conclusion

As the scale and complexity of your projects grow, so does the challenge of managing multiple configuration files. Herein lies the beauty of harnessing the Apache Velocity Template Engine. By leveraging its power, you can streamline and organize your configurations to minimize redundancy and maximize efficiency. Say goodbye to duplicated efforts, and welcome a more streamlined, manageable, and scalable approach to web crawling. Happy indexing!

Introduction

Amazon CloudSearch, a powerful and scalable search and analytics service, has revolutionized how businesses handle data search and analysis. This blog post will walk you through how to set up and leverage Norconex Web Crawler to seamlessly index data to your Amazon CloudSearch domain.

Understanding Norconex Web Crawler

Norconex Web Crawler is an open-source web crawler designed to extract, parse, and index content from the web. For extracting data from the web, Crawler’s flexibility and ease of use make it an excellent choice. Norconex offers a range of committers that index data to various repositories. See https://opensource.norconex.com/committers/ for a complete list of supported target repositories. If the provided committers do not meet your requirements, extend the Committer Core and then create a custom committer to fit your needs.

This blog post will focus on indexing data to Amazon CloudSearch.

Prerequisites

Amazon CloudSearch

Follow the steps below to create a new Amazon CloudSearch Domain.

Browse to https://console.aws.amazon.com/cloudsearch/home, and click the Create a new search domain link.

Enter a Search Domain Name. Next, select search.small and 1 for Desired Instance Type and Desired Replication Count, respectively.

Select Manual configuration from the list of options.

Add 3 fields – title, description, and content, of type text.

Authorize your IP address to send data to this CloudSearch instance. Click on Allow access to all services from specific IP(s). Then enter your public IP address.

That’s it! You have now created your own Amazon CloudSearch domain. AWS will take a few minutes to complete the setup procedure.

Important: You will need the accessKey and secretKey for your AWS account. Not sure where to get these values? Contact your AWS administrator.

After a few minutes, go to your CloudSearch Dashboard and make a note of the Document Endpoint.

Norconex Web Crawler

Download the latest version of Crawler from Norconex’s website. At the time of this writing, version 3.0.2 is the most recent.

Download the latest version of Amazon CloudSearch Committer. At the time of this writing, version 2.0.0 is the most recent.

Follow the Automated Install instructions to install Amazon CloudSearch Committer libraries in the Crawler.

Crawler Configuration

The following Crawler configuration will be used for this test. First, place the configuration in the root folder of your Crawler installation. Then, name it my-config.xml.

Ensure that you supply appropriate values for serviceEndpoint, accessKey, and secretKey. On your CloudSearch Dashboard, serviceEndpoint is the Document Endpoint.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<httpcollector id="Norconex HTTP Crawler">
  <!-- Decide where to store generated files. -->
  <workDir>./output</workDir>
  <crawlers>
    <crawler id="Norconex Amazon CloudSearch Committer Demo">

      <startURLs
   	 stayOnDomain="true"
   	 stayOnPort="true"
   	 stayOnProtocol="true">
   	 <url>https://github.com/</url>
      </startURLs>

      <!-- only crawl 1 page -->     
      <maxDepth>0</maxDepth>

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <sitemapResolver ignore="true" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="5 seconds" />

      <importer>
 	  <postParseHandlers>
   		 <!-- only keep `description` and `title` fields -->
  		 <handler class="KeepOnlyTagger">
  		   <fieldMatcher method="csv">
   			description,title
   		   </fieldMatcher>
  		</handler>
         </postParseHandlers>
  	 </importer>

      <committers>
	  <!-- send documents to Amazon CloudSearch -->
        <committer class="CloudSearchCommitter">   	
          <serviceEndpoint>...</serviceEndpoint>
          <accessKey>...</accessKey>
          <secretKey>...</secretKey>
  	  </committer>
      </committers>
	 
    </crawler>
  </crawlers>
</httpcollector>

Note that this configuration is the minimal required. To suit your needs, you can set many other parameters. Norconex’s documentation does an excellent job of detailing all the available parameters.

Important: For the purposes of this blog, AWS credentials are specified directly in the Crawler configuration as plain text. This practice is not recommended due to the obvious security issues doing so creates. Accordingly, please consult AWS documentation to learn about securely storing your AWS credentials.

Start the Crawler

Norconex Web Crawler comes packaged with shell scripts to start the application. To start the Crawler, run the following command in the console. The example below is for a Windows machine. If you are on Linux, use the collector-http.sh script instead.

C:\Norconex\norconex-collector-http-3.0.2>collector-http.bat start -clean -config=.\my-config.xml

Recall that you saved the configuration at the root of your Crawler installation.

The crawl job will take only a few seconds since only a single page is being indexed. Once the job completes, browse to your CloudSearch Dashboard. Then run a Test Search with the word github to see that the page was indeed indexed!

Conclusion

Indexing data to Amazon CloudSearch using Norconex Web Crawler opens a world of possibilities for data management and search functionality. Following the steps outlined in this guide, you can seamlessly integrate your data to Amazon CloudSearch, empowering your business with faster, more efficient search capabilities. Happy indexing!