How to run Norconex Collector in Docker

Introduction

Docker is popular because it makes it easy to package and deliver programs. This article will show you how to run the Java-based, open-source crawler, Norconex HTTP Collector and Elasticsearch Committer in Docker to crawl a website and index its content into Elasticsearch. At the end of this article, you can find links to download the complete, fully functional files.

Overview

Here is the whole structure, which contains a “Dockerfile” to make a Docker image, “entrypoint.sh” and “start.sh” in “bin/” directory to configure and execute the Docker container, and “es-config.xml” in “examples/elasticsearch” as Norconex-Collector’s configuration file to crawl a website and index contents into Elasticsearch.

Installation

We are using Docker Community Edition in this tutorial. See Install Docker for more information.

Download the latest Norconex Collector and extract the downloaded .zip file. See Getting Started for more details.

Download the latest Norconex Elasticsearch Committer and install it. See Installation for more details.

Collector Configuration

Create “es-config.xml” in the “examples/elasticsearch” directory. In this tutorial, we will crawl /product/collector-http-test/complex1.php and /product/collector-http-test/complex2.php  and index them to Elasticsearch, which is running on 127.0.0.1:9200, with an index named “norconex.” See Norconex Collector Configuration as a reference.

Entrypoint and Start Scripts

Create a directory, “docker”, to store the configuration and execute scripts.

Entrypoint.sh:

start.sh:

Dockerfile

A Dockerfile is a simple text -file that contains a list of commands that the Docker client calls on while creating an image. Create a new file, “Dockerfile”, in the “norconex-collector-http-2.8.0” directory.
Let’s start with the base image “java:8-jdk” using FROM keyword.

Set environment variables and create a user and group in the image. We’ll set DOCKER_HOME and CRAWLER_HOME environment variables and create the user and group, “crawler”.

The following commands will create DOCKER_HOME and CRAWLER_HOME directories in the container and copy the content from the “norconex-collector-http-2.8.0” directory into CRAWLER_HOME.

The following commands change ownership and permissions for DOCKER_HOME, set entrypoint, and execute the crawler.

Almost There

Build a Docker image of Norconex Collector with the following command:

You will see this success message:

Start Elasticsearch for development with the following command (see Install Elasticsearch with Docker for more details):

Start Norconex Collector.

Let’s verify the crawling result. Visit http://127.0.0.1:9200/norconex/_search?pretty=true&q=* and you will see two indexed documents.

Conclusion

This tutorial is for development or testing use. If you would like to use it in a production environment, then we recommend that you consider the data persistence of Elasticsearch Docker container, security, and so forth, based on your particular case.

Useful Links

Download Norconex Collector
Download Norconex Elasticsearch Committer

Meng Zhai is an experienced full-stack software designer and developer with a Master’s Degree from the University of Ottawa. Within a short time of joining the Norconex team he became development lead for one of Norconex’s clients. He continues to work on multiple client projects and also collaborates on the Norconex suite of products.