Crawler – Norconex Inc

This blog post will show you how to use Prometheus with your Norconex crawler. This process is possible thanks to Norconex crawlers offering useful metrics via JMX. Using this solution, you can conveniently track the advancement of a crawling task with a quick glance which is especially useful when you have several crawling jobs running simultaneously.

If you don’t already have Prometheus installed, we will also guide you through the installation process using Docker. Already have Prometheus installed? Go ahead and skip the first section.
The required setup consists of three main components: Prometheus, JMX agent, and Norconex web crawler.

StandUp a Prometheus Server

Create a “prometheus-test” folder to store config files.
Create a custom YAML file: premetheus_config.yaml and add the following:

global: 
  scrape_interval: 15s 
  evaluation_interval: 15s 
  scrape_timeout: 10s 

scrape_configs: 
  # job_name: the name you give, usually one for each collector 
  - job_name: 'collector-http' 
    static_configs: 
    - targets:   ['host.docker.internal:9123']

Create a Dockerfile in the same folder. In it, add the Prometheus image to be used, and then add the premetheus_config.yaml file created earlier.

FROM prom/prometheus 
ADD prometheus_config.yaml /etc/prometheus/

Now it is time to build and start up the Prometheus container by running:

docker build -t my-prometheus-image . 
docker run -dp 9090:9090 my-prometheus-image

Confirm the service is running:

docker ps

You should get something like this:

Open your browser, and access Prometheus: http://localhost:9090

JMX Exporter / Prometheus Java Agent

Once Prometheus is up and running, you need to download the Prometheus JMX Java agent plugin. This agent reads information exposed by the crawler registered JMX mBeans and is intended to be run as a Java Virtual Machine (JVM) agent.

The latest plugin version will be used (version 0.18 as of this writing). Download the jar file, and save it in the prometheus-test folder. This agent requires Java 18. If you don’t already have it installed, download it here.

Next, you will create a jmx_config.yaml file to define the settings used by the JMX agent. Add the following to the file:

--- startDelaySeconds: 0 

ssl: false 
lowercaseOutputName: false 
lowercaseOutputLabelNames: false

Norconex Web Crawler

Norconex has two types of crawlers: web and file-system. We will use the web version in our test, so go ahead and download the crawler if you haven’t already done so.

To start crawling, you need to define the start URL and other settings, which should be defined in the crawler-config.xml file. Let’s create one now.

In the “prometheus-test” folder, create an XML file called “crawler_config.xml”. Then add the following:

<?xml version="1.0" encoding="UTF-8"?>

<httpcollector id="prometheus-test-collector">

  <!-- Decide where to store generated files. -->
  <workDir>${workdir}</workDir>
  <deferredShutdownDuration>10 seconds</deferredShutdownDuration>
  
  <crawlers>
    <crawler id="prometheus-test-crawler">

      <!-- Requires at least one start URL (or urlsFile). 
           Optionally limit crawling to same protocol/domain/port as 
           start URLs. -->
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="false">
        <url>https://www.britannica.com</url>
      </startURLs>

      <!-- === Recommendations: ============================================ -->

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>1</maxDepth>
	  <numThreads>${numThreads|'3'}</numThreads>
	  <maxDocuments>${maxDocuments|'1000'}</maxDocuments>
	  <canonicalLinkDetector ignore="true" />
	  <robotsTxt ignore="true" />
	  <robotsMeta ignore="true" />
	  <orphansStrategy>IGNORE</orphansStrategy>

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <sitemapResolver ignore="true" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="2 seconds" />
      
      <referenceFilters>
        <filter class="ReferenceFilter" onMatch="exclude">
	  <valueMatcher method="regex">.*literature.*</valueMatcher>
        </filter>
      </referenceFilters>
      
      <!-- Document importing -->
      <importer>
        <postParseHandlers>
          <!-- If your target repository does not support arbitrary fields,
               make sure you only keep the fields you need. -->
          <handler class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fieldMatcher method="csv">title,document.reference</fieldMatcher>      
          </handler>
        </postParseHandlers>
      </importer> 
      
      <!-- Decide what to do with your files by specifying a Committer. -->
      <committers>
        <committer class="core3.fs.impl.XMLFileCommitter">
          <docsPerFile>250</docsPerFile>
	  <indent>4</indent>
	  <splitUpsertDelete>false</splitUpsertDelete>
        </committer>
      </committers>

    </crawler>
  </crawlers>

</httpcollector>

Start the Crawler

Initiating the crawling task and enabling Prometheus to fetch metrics from the crawler is a straightforward process. But to ensure reproducibility, create a batch file (make it an equivalent script file on Unix/Linux) that contains the necessary command. This way, you can effortlessly launch the crawler whenever required.

In the “prometheus-test” folder, create a run-job.bat file. Then add the following:

@echo off 

set CRAWLER_HOME=path\to\Norconex\web\crawler\folder\ 
set TEST_DIR=path\to\prometheus\test\folder 

java -javaagent:%TEST_DIR%\jmx_prometheus_javaagent-0.18.0.jar=9123:%TEST_DIR%\jmx_config.yaml ^ 
     -DenableJMX=true ^ 
     -Dlog4j2.configurationFile="%CRAWLER_HOME%\log4j2.xml" ^ 
     -Dfile.encoding=UTF8 ^ -Dworkdir="%TEST_DIR%\workdirs" ^ 
     -cp "%CRAWLER_HOME%\lib\*" ^ 
     com.norconex.collector.http.HttpCollector start -clean -config=%TEST_DIR%\crawler_config.xml

Notice that a port is specified in the command. The port corresponds to the one set in the prometheus_config.yaml: scrape_config section. You can define more than one job at a time using the same hostname and a different port number for each job.

Run the run-job.bat file to start the crawler.

After starting the crawler, you will see logs being written to the console. You can now switch over to your Prometheus Dashboard and try one of the following queries:

{job=”collector-http”}
{job=”collector-http”, key=~”DOCUMENT_QUEUED|DOCUMENT_COMMITTED.*”}
{job=”collector-http”, key=~”DOCUMENT_FETCHED|DOCUMENT_COMMITTED.*”}
{job=”collector-http”, key=~”DOCUMENT_QUEUED|DOCUMENT_FETCHED|DOCUMENT_COMMITTED.*”}
{job=”collector-http”, key=~”DOCUMENT_.*|.*REJECTED.*”}

These queries would return the number of documents queued, fetched, and committed. Plus, the query results will show you the number of rejected documents. The job name refers to the job defined in the prometheus_config.yaml file: scrape_config section. The key in each query corresponds to an event type gathered by the crawler, importer, committer, and collector core. By specifying different events in the key, you can view the information you’re interested in regarding a specific crawling job.

You have abundant options for what events you include in your search. Here are some common ones:

DOCUMENT_COMMITTED_DELETE
DOCUMENT_COMMITTED_UPSERT
DOCUMENT_FETCHED
DOCUMENT_QUEUED
DOCUMENT_PROCESSED
REJECTED_UNMODIFIED
REJECTED_DUPLICATE
REJECTED_BAD_STATUS

As you enter the query in the search box, the result will be displayed almost instantly. You can then view it in either Table or Graph format.

Table Format:

Graph Format:

The graph generated by Prometheus offers a visual depiction of the crawling job’s advancement. As shown in the above graph, the golden line represents the number of documents queued, while the purple line depicts the number of processed documents. Eventually, these two lines will intersect after all documents have been processed, as demonstrated below. This depiction allows you to quickly assess the progress of the crawling job, without having to access and inspect the logs.

Conclusion

With Prometheus, monitoring the progress of single or multiple crawling jobs is no longer a hassle. There’s no need to open multiple consoles for each crawler to check the progress—Prometheus can take care of it all to give you an instant, at-a-glance visual. Just select the events you’re interested in, and then display them visually to save time on your daily monitoring task.

While your interest in events may vary, setting up this configuration requires less than an hour. We strongly recommend giving it a shot using our web or file system crawler. So go ahead and experiment with different combinations of events that align with your monitoring requirements and preferences.

Feel free to leave us feedback on what you think of our crawlers or what type of crawler monitoring you find the most useful. We’d love to hear your thoughts!

This vulnerability impacts Log4J version 2.x, version 1.2 is not affected (source). Norconex HTTP Collector version 2.x use Log4J v1.2.17 and thus are not affected. Version 3 of the Collector uses Log4J v2.17.1, which Apache has patched.

Note: Unless you made it so on purpose, the HTTP Collector does not run as a service accessible from the internet.

Norconex is proud to announce the next major release of its popular open-source web crawler (also referred to as “Norconex HTTP Collector”). After a couple of years of development, you will find this new version was well worth the wait.

Not only does it introduce many new features, but it is also more flexible with even more documentation. Many of these improvements come from community feedback so long-term users deserve a pat on the back. This release is also yours.

If you are too eager to get started, you can download it now and follow its website documentation. Otherwise, keep reading for a glance at the new features.

What’s New?

Introduced features are too many to list here, but we’ll highlight some of the most significant.

Crawling of JavaScript-Driven Websites

Thanks to browser automation provided by Selenium WebDrivers, you can now use your favorite browser to crawl web pages relying on JavaScript to fully render. Generally speaking, if your browser can render content, the crawler can fetch it. It provides you with the ability to take screenshots of pages you crawl as well.

Multiple Committers

Committers are used to store crawled information into a target location, or repository of your choice. This version allows you to specify any number of committers to have your data sent to multiple targets at once (database, search engine, filesystem, etc.). It is also possible to perform simple routing as well.

Easier to deploy

Variables in configuration files can now be resolved against system properties and environment variables. Logging has been abstracted using SLF4J and now prints to STDOUT by default. These changes facilitate deployment in containerized environments (e.g., Docker).

Lots of Events

The event management has been redesigned and simplified. There are now more than 60 different event types being triggered for programmers to listen to and act upon. Ranging from new Committer and Importer events, as well as expected Web Crawler events.

XML Configuration improvements

Similar XML configuration options are now specified in a consistent way. In addition, it is now possible to provide partial class names (e.g., class=“ExtensionReferenceFilter“ instead of class=“com.norconex.collector.core.filter.impl.ExtensionReferenceFilter“). The Importer module also allows you to use XML “flow” to facilitate configuration logic. That is, you can now make use of special XML tags: <if>, <ifNot>, <condition>, <conditions>, <else>, and <then>.

Richer documentation

Documentation has been improved as well:

A new Online Manual is now available, giving great insight into installation and XML configuration.
Dynamic XML documentation combining options from all modules making up the web crawler into a single location.

The JavaDoc now has formatted XML documentation and XML usage, which is easy to copy and paste into your own configuration.

Config Starter

A very simple yet useful configuration generator is now available online. It will help you create your first configuration file. You provide your “start” URL, answer a few questions and your configuration file will be generated for you.

More?

Some additional features:

Can send deletion requests to Committers upon encountering specific events.
Can prevent duplicate documents to be sent to Committers during the same crawling sessions.
Now supports these HTTP standards:
- ETag/If-None-Match
- HTTP Strict Transport Security (HSTS)
- If-Modified-Since
Can now extra links after document importing/parsing as well as from metadata.
The Crawler can be configured to stop itself after encountering specific events.
New command-line options for cleaning previous crawls (starting fresh) and to export/import the crawler internal data store.
Can now transform crawled images.
Additional content and metadata manipulation options.
Committers can now retry failing batches, reducing the batch size between each attempt.
New out-of-the-box CSV Committer.

We recommend you have a look at the release notes for more.

What next?

If you are coming from Norconex HTTP Collector version 2, we recommend you have a look at the version 3 migration notes.

As always, community support is still available on GitHub. While on GitHub, take a moment to “Star” the project.

Come back once in a while as we’ll publish more in-depth articles on specific features or use cases you did not even think was possible to address with our web crawler.

Finally, we always love to know who is using the Norconex Web Crawler. Let us know and you may get listed on our wall of fame.

Enjoy!

In my previous article, I talked about the new Config Starter and its features. This article serves as a follow-up. Now that you know how to generate a crawler configuration file, I will highlight the steps you can undertake to get you started on your own website crawling activities.

We will be using the TOKYO 2020 Olympic Games’ website as the crawl site in this article. The steps are as follows:

First, you will need to generate a basic configuration file targeting the Olympic website, using the Config Starter. In this example, I am targeting English content only, so I am excluding all URLs corresponding to the other languages on the website.

*Note that it is not mandatory to use the Config Starter to generate your configuration file as it only makes a basic configuration file. If you are looking for a more complete solution, you can make your own configuration file with the documentation here.

With your configuration file generated, the next step is to download the Norconex HTTP Collector on your computer from the Norconex Open-Source website and unzip it. If you are using the Config Starter, you will need to download version 3.x.

Once you have the HTTP Collector downloaded on your computer, open your command-line terminal in the location of the folder you just created with your download. To do this, simply use the following command with your file directory: cd C:\file\directory\of\the\collector

With your command-line terminal open, you must now enter the following line with the path to your configuration file:

Windows: collector-http.bat start -config= -config=/path/to/config.xml

Linux: collector-http.sh start -config= -config=/path/to/config.xml

Congratulations! You are now running your crawler. If all went according to plan, you should see something similar to the next image and the data crawled should now be located in the created committer directory (if you are using the same committer as me, it should be in the “work” folder).

Now that you have crawled the Olympic site, go and collect your gold medal!

If you encounter any issues during the process, you can find resolutions on the HTTP Collector GitHub issues page.

[Try] out the new Norconex HTTP Collector Config Starter.

[Download] and [get started] with the Norconex HTTP Collector.

[Learn] more about the inner workings of the Norconex HTTP Collector.

This year I was given the privilege to attend my first KubeCon + CloudNativeCon North America 2020 virtually. This event spans four days consisting of virtual activities such as visiting vendor booths, learning about Cloud Native projects, and exploring the advancement of cloud native computing.

The keynote started off by paying respects to the passing of the legendary Dan Kohn. Kohn’s influence has changed how we do online shopping to research on the internet and made ways for the new evolutions of The Linux Foundation and Cloud Native Computing Foundation for an exciting future for many generations to come, while supporting the creation of sustainable open source ecosystems.

There were glitches while live streaming from the virtual conference platform, which was to be expected due to the real-time heavy load test that is not desirable in any production environments. Fortunately, on-demand recordings of the presentations are now available.

Slack channels can be joined from cloud-native.slack.com to communicate with others from channels like #kubecon-mixandmingle and other KubeCon-related topics. This feature provides a great way to connect with the KubeCon audience virtually even after the event is over.

KubeCon provides many 101 learning and tutorial events about the service CNCF projects offer and how it can help with the 3 main pillars that I am involved with daily: automation, dev-ops, and observability. Each of the pillar implementations are usually done in parallel, for instance, continuous development and continuous deployment required automation in building the pipeline, involves creating codes and having knowledge in operations architecture planning. Once deployed, the observability of the services running would be required to monitor for smooth services deliver to users. Many of the projects from the CNCF provide the services front to help create the development flow from committing code that gets deployed into the cloud services and providing monitoring capabilities to the secured mesh services.

At Norconex, our upcoming Norconex Collector version 3.0.0 could be used with the combination of Containerd, Helm, and Kubernetes and with automating the build and deployment via Jenkins. One way to get started is to figure out how to package the Norconex Collector and Norconex Committer into a container-runnable image with container tools such as Docker to run builds for development and testing. After discerning how to build the container image, I have to decide where to deploy and store the container image registry so that the Kubernetes cluster can pull the image from this registry and run the container image with Kubernetes Cronjob based on a schedule when the job should run. The Kubernetes Job would create Pod to run crawl using the Norconex Collector and commit indexed data. Finally, I would choose Jenkins as the build tool for this experiment to help to automate updates and deployments.

Below are steps that provide an overview for my quick demo experiment setup:

Demo use of the default Norconex Collector:
- Download | Norconex HTTP Collector with Filesystem Committer. The other choices of Committers can be found at Norconex Committers
- Build container image using Dockerfile
- Setup a Git repository file structure for container image build
- Guide to build and test-run using the created Dockerfile
  - Demo set up locally using Docker Desktop to run Kubernetes
    - Tutorials for setting up local Kubernetes
Determine where to push the container image; can be public or private image registry such as Docker Hub
- Demo will use Docker Hub public registry
  - https://hub.docker.com/repository/docker/somphouang/norconex-devops-demo
Create a Helm Chart template using the Helm Chart v3
- Demo will start with default template creation of Helm Chart
  - Get the Helm tool here: Helm | Installing Helm
- Demo to use the Kubernetes Node filesystem for persistent storage
  - Other storage options can be used, for instance, in AWS use EBS volume or EFS
- Helm template and yaml configuration
  - cronjob.yaml to deploy Kubernetes Cronjob that would create new Kubernetes job to run on schedule
  - pvc.yaml to create Kubernetes persistent volume and persistent volume claim that the Norconex Collector crawl job will use on the next recrawl job run
Simple build using Jenkins
- Overview of Jenkins build job pipeline script

I hope you enjoyed this recap of Kubecon!

More details of the codes and tutorials can be found here:

https://github.com/somphouang/norconex-devops-demo

Norconex is proud to announce the 2.9.0 release of its HTTP and Filesystem crawlers. Keep reading for a few release highlights.

CMIS support

Norconex Filesystem Collector now supports Content Management Interoperability Services (CMIS). CMIS is an open standard for accessing content management systems (CMS) content. Extra information can be extracted, such as document ACL (Access Control List) for document-level security. It is now easier than ever to crawl your favorite CMS. CMIS is supported by Alfresco, Interwoven, Magnolia, SharePoint server, OpenCMS, OpenText Documentum, and more.

<startPaths>
    <path>cmis-atom:https://norconex.com/mycms/cmisatom!/my/starting/path</path>
</startPaths>

Additional ACL support

ACL from your CMS is not the only new type of ACL you can extract. This new Norconex Filesystem Collector release introduces support for obtaining local filesystem ACL. These new ACL types are in addition to the already existing support for CIFS/SMB ACL extraction (since 2.7.0).

Field discovery

You can’t always tell upfront what metadata your crawler will find. One way to discover your fields is to send them all to your Committer. This approach is not always possible nor desirable. You can now store to a local file all fields found by the crawler. Each field will be saved once, with sample values to give you a better idea of their nature.

<tagger class="com.norconex.importer.handler.tagger.impl.FieldReportTagger" 
    maxSamples="2" file="/path/to/report/myfields.csv" />

New URL normalization rules

The HTTP Collector adds a few new rules GenericURLNormalizer. Those are:

removeQueryString
lowerCase
lowerCasePath
lowerCaseQuery
lowerCaseQueryParameterNames
lowerCaseQueryParameterValues

Subdomains being part of a domain

When you configure your HTTP crawler to stay on the current site (stayOnDomain="true"), you can now tell it to consider sub-domains as being the same site (includeSubdomains="true").

Other changes

For a complete list of all additions and changes, refer to the following release notes:

Download

Kafka users rejoice! You can now use Norconex open-source crawlers with Apache Kafka, thanks to the Norconex Apache Kafka Committer.

We owe this contribution to Joseph Paulo Mantuano (Senior Developer at The Red Flag Group) and Dan Davis.

The Norconex Collectors community keeps growing. We are thrilled to see the number of integrations grow with it as well. If you know of any Norconex Committer implementation out there, let us know and we’ll add them to the list!

Not yet familiar with Norconex crawlers? Head over to Norconex HTTP Collector or Norconex Filesystem Collector websites to learn more.

Great news! There is now a Google Cloud Search Committer for Norconex Crawlers!

This addition to Norconex Collector family should delight Google Cloud Search fans. They too can now enjoy the full-featured crawling capabilities offered by Norconex Open-Source crawlers.

Since this Committer is developed and maintained by Google, you will find installation and configuration documentation on the Google Developers website.

New to Norconex crawlers? Head over to the Norconex Collectors website to start crawling.

Happy crawling!

Amazon Web Services (AWS) and the Canadian Public Sector organized another excellent Public Sector Summit on May 15, 2019. AWS hosted the first such summit in Ottawa last year, but this year’s event attracted a much larger crowd. Thousands of attendees filled Shaw Centre’s entire third floor.

In the keynote sessions, it was great to hear Alex Benay (deputy minister at the Treasury Board of Canada) talk about the government’s modern digital initiative. He discussed the approach, successes, and challenges of the government’s Cloud migration journey. Another excellent speaker was Mohamed Frendi (director of IT, innovation, science, and economic development for the government of Canada). He covered Canada’s API Store and how it uses the Cloud to make government data more accessible.

The afternoon session was led by Darin Briskman, an AWS developer evangelist. He talked about Amazon’s self-service analytics tool, called AWS Lake Formation, which combines data from multiple sources to resolve data-driven challenges in a timely manner. Machine learning and AI help in making informed decisions and solving problems. This service is a great fit for Norconex’s open-source crawler products HTTP Collector and Filesystem Collector, which fetch data from unstructured data sources to make it easy to consume. Collected content and metadata are natively stored in various existing repositories (or formats), including AWS-specific ones like Amazon Elasticsearch Service, Amazon Open Distro Elasticsearch, and Amazon CloudSearch, as well as many others, such as relational databases, Apache Solr, Google Cloud Search, Neo4J, Microsoft Azure Search, Lucidworks, IDOL, and more.

The diagrams below provide further explanation. The one showing the crawling spider is particularly exciting, because Norconex crawlers have much potential to help in this area. See available Norconex Committers.

AWS Public Sector Summit Event Pass

Selfies with Darin Briskman, Developer Evangelist, AWS and Stevan Beara, Solutions Architect Manager, AWS.

Norconex crawlers and Neo4j graph database are now a love match! Neo4j is arguably the most popular graph database out there. Use Norconex crawlers to harvest relationships from websites and filesystems and feed them to your favorite graph engine.

This was made possible thanks to no other than France contributor Sylvain Roussy, a Neo4j reference, and author of 2 Neo4j books. Norconex is proud to have been able to partner with Sylvain to develop a Neo4j Committer for use with its Norconex HTTP and Filesystem Collectors.

To our French-speaking European friends, Sylvain will host a series of Neo4j Meetups at different locations. He will explain how Norconex crawlers can be used to gather graph data from the web to use in Neo4j. The first of the series is taking place on January 24th, in Genève:

Useful Links: