Introduction
Docker is popular because it makes it easy to package and deliver programs. This article will show you how to run the Java-based, open-source crawler, Norconex HTTP Collector and Elasticsearch Committer in Docker to crawl a website and index its content into Elasticsearch. At the end of this article, you can find links to download the complete, fully functional files.
Overview
Here is the whole structure, which contains a “Dockerfile” to make a Docker image, “entrypoint.sh” and “start.sh” in “bin/” directory to configure and execute the Docker container, and “es-config.xml” in “examples/elasticsearch” as Norconex-Collector’s configuration file to crawl a website and index contents into Elasticsearch.
Installation
We are using Docker Community Edition in this tutorial. See Install Docker for more information.
Download the latest Norconex Collector and extract the downloaded .zip file. See Getting Started for more details.
Download the latest Norconex Elasticsearch Committer and install it. See Installation for more details.
Collector Configuration
Create “es-config.xml” in the “examples/elasticsearch” directory. In this tutorial, we will crawl /product/collector-http-test/complex1.php and /product/collector-http-test/complex2.php and index them to Elasticsearch, which is running on 127.0.0.1:9200, with an index named “norconex.” See Norconex Collector Configuration as a reference.
<?xml version="1.0" encoding="UTF-8"?> <!-- Copyright 2010-2017 Norconex Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> <httpcollector id="Norconex Complex Collector"> #set($http = "com.norconex.collector.http") #set($core = "com.norconex.collector.core") #set($urlNormalizer = "${http}.url.impl.GenericURLNormalizer") #set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter") #set($filterRegexRef = "${core}.filter.impl.RegexReferenceFilter") #set($committerClass = "com.norconex.committer.elasticsearch.ElasticsearchCommitter") #set($searchUrl = "http://127.0.0.1:9200") <progressDir>../crawlers/norconex/progress</progressDir> <logsDir>../crawlers/norconex/logs</logsDir> <crawlerDefaults> <referenceFilters> <filter class="$filterExtension" onMatch="exclude">jpg,gif,png,ico,css,js</filter> </referenceFilters> <urlNormalizer class="$urlNormalizer"> <normalizations> removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence, decodeUnreservedCharacters, removeDefaultPort, encodeNonURICharacters, removeDotSegments </normalizations> </urlNormalizer> <maxDepth>0</maxDepth> <workDir>../crawlers/norconex/workDir</workDir> <!-- We know we don't want to crawl the entire site, so ignore sitemap. --> <sitemapResolverFactory ignore="true" /> </crawlerDefaults> <crawlers> <crawler id="Norconex Complex Test Page 1"> <startURLs> <url>/product/collector-http-test/complex1.php</url> </startURLs> <committer class="$committerClass"> <nodes>$searchUrl</nodes> <indexName>norconex</indexName> <typeName>web</typeName> <queueDir>../crawlers/norconex/committer-queue</queueDir> <targetContentField>body</targetContentField> <queueSize>100</queueSize> <commitBatchSize>500</commitBatchSize> </committer> </crawler> <crawler id="Norconex Complex Test Page 2"> <startURLs> <url>/product/collector-http-test/complex2.php</url> </startURLs> <committer class="$committerClass"> <nodes>$searchUrl</nodes> <indexName>norconex</indexName> <typeName>web</typeName> <queueDir>../crawlers/norconex/committer-queue</queueDir> <targetContentField>body</targetContentField> <queueSize>100</queueSize> <commitBatchSize>500</commitBatchSize> </committer> </crawler> </crawlers> </httpcollector>
Entrypoint and Start Scripts
Create a directory, “docker”, to store the configuration and execute scripts.
Entrypoint.sh:
#!/bin/sh set -x set -e set -- /docker/crawler/docker/start.sh "$@" exec "$@"
start.sh:
#!/bin/sh set -x set -e ${CRAWLER_HOME}/collector-http.sh -a start -c examples/elasticsearch/es-config.xml
Dockerfile
A Dockerfile is a simple text -file that contains a list of commands that the Docker client calls on while creating an image. Create a new file, “Dockerfile”, in the “norconex-collector-http-2.8.0” directory.
Let’s start with the base image “java:8-jdk” using FROM
keyword.
FROM java:8-jdk
Set environment variables and create a user and group in the image. We’ll set DOCKER_HOME
and CRAWLER_HOME
environment variables and create the user and group, “crawler”.
ENV DOCKER_HOME /docker ENV CRAWLER_HOME /docker/crawler RUN groupadd crawler && useradd -g crawler crawler
The following commands will create DOCKER_HOME
and CRAWLER_HOME
directories in the container and copy the content from the “norconex-collector-http-2.8.0” directory into CRAWLER_HOME
.
RUN mkdir -p ${DOCKER_HOME} RUN mkdir -p ${CRAWLER_HOME} COPY ./ ${CRAWLER_HOME}
The following commands change ownership and permissions for DOCKER_HOME
, set entrypoint, and execute the crawler.
RUN chown -R crawler:crawler ${DOCKER_HOME} && chmod -R 755 ${DOCKER_HOME} ENTRYPOINT [ "/docker/crawler/docker/entrypoint.sh" ] CMD [ "/docker/crawler/docker/start.sh" ]
Almost There
Build a Docker image of Norconex Collector with the following command:
$ docker build -t norconex-collector:2.8.0 .
You will see this success message:
Successfully built 43298c7de13f Successfully tagged norconex-collector:2.8.0
Start Elasticsearch for development with the following command (see Install Elasticsearch with Docker for more details):
$ docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:6.1.2
Start Norconex Collector.
$ docker run --net=host norconex-collector:2.8.0
Let’s verify the crawling result. Visit http://127.0.0.1:9200/norconex/_search?pretty=true&q=* and you will see two indexed documents.
Conclusion
This tutorial is for development or testing use. If you would like to use it in a production environment, then we recommend that you consider the data persistence of Elasticsearch Docker container, security, and so forth, based on your particular case.
Useful Links
Download Norconex Collector
Download Norconex Elasticsearch Committer