elasticsearch

Introduction

In the era of data-driven decision-making, efficient data indexing is pivotal in empowering businesses to extract valuable insights from vast amounts of information. Elasticsearch, a powerful and scalable search and analytics service, has become popular for organizations seeking to implement robust search functionality. Norconex Web Crawler offers a seamless and effective solution for indexing web data to Elasticsearch.

In this blog post, you will learn how to utilize Norconex Web Crawler to index data to Elasticsearch and enhance your organization’s search capabilities.

Understanding Norconex Web Crawler

Norconex Web Crawler is an open-source web crawler designed to extract, parse, and index content from the web. The crawler’s flexibility and ease of use make it an excellent choice for extracting data from the web. Plus, Norconex offers a range of committers that index data to various repositories. See https://opensource.norconex.com/committers/ for a complete list of supported target repositories. If the provided committers do not meet your organizational requirements, you can extend the Committer Core and create a custom committer.

This blog post will focus on indexing data to Elasticsearch.

Prerequisites

To keep things simple, we will rely on Docker to stand up an Elasticsearch container locally. If you don’t have Docker installed, follow the installation instructions on their website. Once Docker is installed, open a command prompt and run the following command.

docker run -d -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" -e "xpack.security.enabled=false" elasticsearch:7.17.10

This command does the following

requests version 7.17.10 of Elasticsearch
maps ports 9200 and 9600
sets the discovery type to “single-node”
disables the security plugin
Starts the Elasticsearch container

Once the container is up, browse to http://localhost:9200 in your favourite browser. You will get a response that looks like this:

{
  "name" : "c6ce36ceee17",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "gGbNNtDHTKCSJnYaycuWzQ",
  "version" : {
  "number" : "7.17.10",
  "build_flavor" : "default",
  "build_type" : "docker",
  "build_hash" : "fecd68e3150eda0c307ab9a9d7557f5d5fd71349",
  "build_date" : "2023-04-23T05:33:18.138275597Z",
  "build_snapshot" : false,
  "lucene_version" : "8.11.1",
  "minimum_wire_compatibility_version" : "6.8.0",
  "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

Elasticsearch container is now up and running!

Norconex Web Crawler

Download the latest version of the Web Crawler from Norconex’s website. At the time of this writing, version 3.0.2 is the most recent version.

Download the latest version of Elasticsearch Committer. At the time of this writing, version 5.0.0 is the most recent version.

Follow the automated installation instructions to install the Elasticsearch Committer libraries into the Crawler.

Crawler Configuration

We will use the following Crawler configuration for this test. Place this configuration in the root folder of your Crawler installation, with the filename my-config.xml.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<httpcollector id="Norconex HTTP Collector">
    <!-- Decide where to store generated files. -->
  <workDir>./output</workDir>
  <crawlers>
	<crawler id="Norconex Elasticsearch Committer Demo">
  	<startURLs 
		stayOnDomain="true" 
		stayOnPort="true" 
		stayOnProtocol="true">
		<url>https://github.com/</url>
  	</startURLs>
  	<!-- only crawl 1 page --> 	 
  	<maxDepth>0</maxDepth>
  	<!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
  	<sitemapResolver ignore="true" />
  	<!-- Be as nice as you can to sites you crawl. -->
  	<delay default="5 seconds" />
  	<importer>
  	  	<postParseHandlers>
  	  	  	<!-- only keep `description` and `title` fields -->
  	  	  	<handler class="KeepOnlyTagger">
  	  	  	  	<fieldMatcher method="csv">
  	  	  	  	  	description,title
  	  	  	  	</fieldMatcher>
  	  	  	</handler>
  	  	</postParseHandlers>
   	</importer>
  	<committers>
 		 <!-- send documents to Elasticsearch -->
   		<committer class="ElasticsearchCommitter">
			<nodes>http://localhost:9200</nodes>
			<indexName>my-index</indexName>
   		</committer>
  	</committers>
 	 
    </crawler>
  </crawlers>
</httpcollector>

Note that this is the minimal configuration required. There are many more parameters you can set to suit your needs. Norconex’s documentation does an excellent job of detailing all the parameters.

Start the Crawler

Norconex Web Crawler comes packaged with shell scripts to start the application. To start the crawler, run the following command in a shell terminal. The example below is for a Windows machine. If you are on Linux, use the collector-http.sh script instead.

C:\Norconex\norconex-collector-http-3.0.2>collector-http.bat start -clean -config=.\my-config.xml

Recall that you saved the Crawler configuration at the root of your Crawler installation.

Since only a single page is being indexed, the crawl job will take only a few seconds. Once the job completes, query the Elasticsearch container by browsing to http://localhost:9200/my-index/_search in your browser. You will see something like this:

{
  "took": 12,
  "timed_out": false,
  "_shards": {
	"total": 1,
	"successful": 1,
	"skipped": 0,
	"failed": 0
  },
  "hits": {
	"total": {
  	"value": 1,
  	"relation": "eq"
	},
	"max_score": 1,
	"hits": [
  	{
    	"_index": "my-index",
    	"_id": "https://github.com/",
    	"_score": 1,
    	"_source": {
      	"title": "GitHub: Let's build from here · GitHub",
      	"description": "GitHub is where over 100 million developers shape the future of software, together. Contribute to the open source community, manage your Git repositories, review code like a pro, track bugs and features, power your CI/CD and DevOps workflows, and secure code before you commit it.",
      	"content": "<redacted for brevity>"
    	}
  	}
	]
  }
}

You can see that the document was indeed indexed!

Conclusion

Norconex Web Crawler streamlines the process of indexing web data into Elasticsearch, making valuable information readily available for search and analytics.
This guide provides step-by-step instructions for integrating your data with Elasticsearch, unleashing potent search capabilities for your organization’s applications. Embrace the powerful synergy of Norconex Web Crawler and Elasticsearch to revolutionize your data indexing journey, empowering your business with real-time insights and effortless data discovery. Happy indexing!

This was my first year joining the open-road Elastic{ON} Tour 2019 event in Toronto on September 18, 2019. My experience at this event was fully charged with excitement from meeting with Elastic architects, operations folks, security pros, and developers alike.

The event was hosted at The Carlu in downtown Toronto. In the morning, the opening keynote was presented by Nick Drost, Senior Director of Elastic, on search solutions such as app search, site search, and enterprise search, security using SIEM, and more. One of the most exciting keynote updates was about using Elastic Cloud on Kubernetes to help simplify processes of deployment, security, scaling, upgrades, snapshots, and high availability.

The next presenter, Michael Basnight, Software Engineer at Elastic, provided an Elastic Stack roadmap with demos of the latest and upcoming features. Kibana has added new capabilities to become much more than just the main user interface of Elastic Stack, with infrastructure and logs user interface. He introduced Fleet, which provides centralized config deployment, Beats monitoring, and upgrade management. Frozen indices allows for more index storage by having indices available and not taking up HEAP memory space until the indices are requested. Also, he provided highlights on Advanced Machine Learning analytics for outlier detection, supervised model training for regression and classification, and ingest prediction processor. Elasticsearch performance has increased by employing Weak AND (also called “WAND”), providing improvements as high as 3,700% to term search and improving other query types between 28% and 292%.

Another added feature to Elasticsearch stack is advanced scoring to help boost document query, using rank_features and distance_features. The new Geo UI uses map layers.

One of the most interesting new Beats to watch for is Functionbeat, which is a serverless data shipper that can subscribe to AWS SQS event topics and CloudWatch Logs, provisions the AWS Lambda function to ship data to Elasticsearch or Elastic Cloud Enterprise.

Elastic lightweight data shippers, Beats such as Filebeat for log files, Metricbeat for metrics, Packetbeat for network data, Winlogbeat for Windows event logs, Auditbeat for audit data, Heartbeat for uptime monitoring, and the latest Functionbeat for serverless shipper can be complemented with Norconex open-source products such as Norconex HTTP Collector or Norconex Filesystem Collector to crawl meta-data from the web or filesystem, then used with the open-source Norconex Elasticsearch Committer to push data to the Elasticsearch index, directly to Elastic Cloud Enterprise or the on-prem Elasticsearch Stack. Norconex can help with collecting meta-data from enterprise web architecture or enterprise filesystems for quick searching and to get relevant results.

Packed into the morning session, Jason Rhodes, Senior Software Engineer at Elastic, presented on unified observability, combining logs, metrics, and traces.

The afternoon session, Search for All with Elastic Enterprise Search and a Site Search demo and feature walkthrough, was presented by Diane Tetrault, Director of Product Marketing at Elastic. The latest UI gives the user the ability to configure content sources they search for and connect to their own data sources. Elastic Common Schema, introduced as an open-source specification, defines a common set of document fields for data ingested into Elasticsearch (https://www.elastic.co/blog/introducing-the-elastic-common-schema).

The Security with Elastic Stack session was presented by Neil Desai, Security Specialist at Elastic. He discussed the latest security capabilities to enable analysis automation to defend from cyber threats.

The Kibana and geo update features in Canvas and Elastic Maps were presented by Raya Fratkina, Kibana Team Lead at Elastic. Learning about ways to use these functionalities makes data more actionable.

I also learned tips at Elastic Architecture at Scale, a presentation by Artem Pogossian, Solutions Architect at Elastic. He discussed scaling from local laptops to multi-clusters and cross-clusters using case deployments.

A useful new feature in machine learning and analytics was introduced by Rich Collier, Solutions Architect and ML Specialist at Elastic. He demonstrated a use case using data frames, also called transforms, a feature that allows transformation of an existing index to a secondary, summarized index. Rich showed in a demo a possible use case from a digital retailer, using time series modeling to look for anomalies and forecasting in the shopper’s purchases, integrating Canvas UI designed in Kibana to build real-time data models. It was amazing to see the ability in demo to detect possible fraudulent purchases without having to be a data science expert.

Finally, after all these informational sessions, thanks to the Elastic event organizers for adding a closing happy hour, where I grabbed a drink with fellow attendees and Elastic folks. This was a great way to close a very extensive learning session. I look forward to being at the next year’s Elastic{ON} tour.

Event pass — Elastic{ON} Tour 2019 in Toronto event pass.

Elastic Team — On the right, Osman Ishaq at Elastic at the Ask Me Anything Booth

Raya Fratikina, Team Lead, Kibana at Elastic

Happy hour closing — Closing happy hour, drink with Elastic folks and other attendees.

Introduction

Docker is popular because it makes it easy to package and deliver programs. This article will show you how to run the Java-based, open-source crawler, Norconex HTTP Collector and Elasticsearch Committer in Docker to crawl a website and index its content into Elasticsearch. At the end of this article, you can find links to download the complete, fully functional files.

Overview

Here is the whole structure, which contains a “Dockerfile” to make a Docker image, “entrypoint.sh” and “start.sh” in “bin/” directory to configure and execute the Docker container, and “es-config.xml” in “examples/elasticsearch” as Norconex-Collector’s configuration file to crawl a website and index contents into Elasticsearch.

Installation

We are using Docker Community Edition in this tutorial. See Install Docker for more information.

Download the latest Norconex Collector and extract the downloaded .zip file. See Getting Started for more details.

Download the latest Norconex Elasticsearch Committer and install it. See Installation for more details.

Collector Configuration

Create “es-config.xml” in the “examples/elasticsearch” directory. In this tutorial, we will crawl /product/collector-http-test/complex1.php and /product/collector-http-test/complex2.php and index them to Elasticsearch, which is running on 127.0.0.1:9200, with an index named “norconex.” See Norconex Collector Configuration as a reference.

<?xml version="1.0" encoding="UTF-8"?>
<!--
   Copyright 2010-2017 Norconex Inc.

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<httpcollector id="Norconex Complex Collector">

  #set($http = "com.norconex.collector.http")
  #set($core = "com.norconex.collector.core")
  #set($urlNormalizer   = "${http}.url.impl.GenericURLNormalizer")
  #set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter")
  #set($filterRegexRef  = "${core}.filter.impl.RegexReferenceFilter")
  #set($committerClass = "com.norconex.committer.elasticsearch.ElasticsearchCommitter")
  #set($searchUrl = "http://127.0.0.1:9200")

  <progressDir>../crawlers/norconex/progress</progressDir>
  <logsDir>../crawlers/norconex/logs</logsDir>

  <crawlerDefaults>
    <referenceFilters>
      <filter class="$filterExtension" onMatch="exclude">jpg,gif,png,ico,css,js</filter>
    </referenceFilters>
    <urlNormalizer class="$urlNormalizer">
      <normalizations>
        removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence,
        decodeUnreservedCharacters, removeDefaultPort, encodeNonURICharacters,
        removeDotSegments
      </normalizations>
    </urlNormalizer>
    <maxDepth>0</maxDepth>
    <workDir>../crawlers/norconex/workDir</workDir>
    <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
    <sitemapResolverFactory ignore="true" />
  </crawlerDefaults>
  <crawlers>
    <crawler id="Norconex Complex Test Page 1">
      <startURLs>
        <url>/product/collector-http-test/complex1.php</url>
      </startURLs>
      <committer class="$committerClass">
    	  		<nodes>$searchUrl</nodes>
    	  		<indexName>norconex</indexName>
            <typeName>web</typeName>
    	  		<queueDir>../crawlers/norconex/committer-queue</queueDir>
    	  		<targetContentField>body</targetContentField>
    	  		<queueSize>100</queueSize>
  			<commitBatchSize>500</commitBatchSize>
   		</committer>
    </crawler>
    <crawler id="Norconex Complex Test Page 2">
      <startURLs>
        <url>/product/collector-http-test/complex2.php</url>
      </startURLs>
      <committer class="$committerClass">
    	  		<nodes>$searchUrl</nodes>
    	  		<indexName>norconex</indexName>
            <typeName>web</typeName>
    	  		<queueDir>../crawlers/norconex/committer-queue</queueDir>
    	  		<targetContentField>body</targetContentField>
    	  		<queueSize>100</queueSize>
  			<commitBatchSize>500</commitBatchSize>
   		</committer>
    </crawler>
  </crawlers>
</httpcollector>

Entrypoint and Start Scripts

Create a directory, “docker”, to store the configuration and execute scripts.

Entrypoint.sh:

#!/bin/sh
set -x
set -e

set -- /docker/crawler/docker/start.sh "$@"

exec "$@"

start.sh:

#!/bin/sh
set -x
set -e
${CRAWLER_HOME}/collector-http.sh -a start -c examples/elasticsearch/es-config.xml

Dockerfile

A Dockerfile is a simple text -file that contains a list of commands that the Docker client calls on while creating an image. Create a new file, “Dockerfile”, in the “norconex-collector-http-2.8.0” directory.
Let’s start with the base image “java:8-jdk” using FROM keyword.

FROM java:8-jdk

Set environment variables and create a user and group in the image. We’ll set DOCKER_HOME and CRAWLER_HOME environment variables and create the user and group, “crawler”.

ENV DOCKER_HOME /docker
ENV CRAWLER_HOME /docker/crawler
RUN groupadd crawler && useradd -g crawler crawler

The following commands will create DOCKER_HOME and CRAWLER_HOME directories in the container and copy the content from the “norconex-collector-http-2.8.0” directory into CRAWLER_HOME.

RUN mkdir -p ${DOCKER_HOME}
RUN mkdir -p ${CRAWLER_HOME}
COPY ./ ${CRAWLER_HOME}

The following commands change ownership and permissions for DOCKER_HOME, set entrypoint, and execute the crawler.

RUN chown -R crawler:crawler ${DOCKER_HOME} && chmod -R 755 ${DOCKER_HOME}
ENTRYPOINT [ "/docker/crawler/docker/entrypoint.sh" ]
CMD [ "/docker/crawler/docker/start.sh" ]

Almost There

Build a Docker image of Norconex Collector with the following command:

$ docker build -t norconex-collector:2.8.0 .

You will see this success message:

Successfully built 43298c7de13f
Successfully tagged norconex-collector:2.8.0

Start Elasticsearch for development with the following command (see Install Elasticsearch with Docker for more details):

$ docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:6.1.2

Start Norconex Collector.

$ docker run --net=host norconex-collector:2.8.0

Let’s verify the crawling result. Visit http://127.0.0.1:9200/norconex/_search?pretty=true&q=* and you will see two indexed documents.

Conclusion

This tutorial is for development or testing use. If you would like to use it in a production environment, then we recommend that you consider the data persistence of Elasticsearch Docker container, security, and so forth, based on your particular case.

Useful Links

Download Norconex Collector
Download Norconex Elasticsearch Committer

Norconex is proud to announce the release of Norconex HTTP Collector version 2.8.0. This release is accompanied by new releases of many related Norconex open-source products (Filesystem Collector, Importer, Committers, etc.), and together they bring dozens of new features and enhancements highlighted below.

Extract a “Featured Image” from web pages

[ezcol_1half]

In addition to taking screenshots of webpages, you can now extract the main image of a web page thanks to the new FeaturedImageProcessor. You can specify conditions to identify the image (first one encountered matching a minimum site or a given pattern). You also have the option to store the image on file or as a BASE64 string with the crawled document (after scaling it to your preferred dimensions) or simply store a reference to it.

[/ezcol_1half]

[ezcol_1half_end]

<preImportProcessors>
  <processor class="com.norconex.collector.http.processor.impl.FeaturedImageProcessor">
    <minDimensions>300x400</minDimensions>
    <scaleDimensions>50</scaleDimensions>
    <imageFormat>jpg</imageFormat>
    <scaleQuality>max</scaleQuality>  	
    <storage>inline</storage>
  </processor>
</preImportProcessors>

[/ezcol_1half_end]

Limit link extraction to specific page portions

[ezcol_1half]

The GenericLinkExtractor now makes it possible to only extract links to be followed found within one or more specific sections of a web page. For instance, you may want to only extract links found in navigation menus and not those found in content areas in case the links usually point to other sites you do not want to crawl.

[/ezcol_1half]

[ezcol_1half_end]

<extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor">
 
  <extractBetween>
    <start><![CDATA[<!-- BEGIN NAV LINKS -->]]></start>
    <end><![CDATA[<!-- END NAV LINKS -->]]></end>
  </extractBetween>
 
  <noExtractBetween>
    <start><![CDATA[<!-- BEGIN EXTERNAL SITES -->]]></start>
    <end><![CDATA[<!-- END EXTERNAL SITES -->]]></end>
  </noExtractBetween>
 
</extractor>

[/ezcol_1half_end]

Truncate long field values

[ezcol_1half]

The new TruncateTagger offers the ability to truncate long values and the option to replace the truncated portion with a hash to help preserve uniqueness when required. This is especially useful in preventing errors with search engines (or other repositories) and field length limitations.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.TruncateTagger"
    fromField="mySuperLongField"
    maxLength="500"
    toField="myTruncatedField"
    overwrite="true"
    appendHash="true"
    suffix="!" />

[/ezcol_1half_end]

Add metadata to a document using an external application

[ezcol_1half]

The new ExternalTagger allows you to point to an external (i.e., command-line) application to “decorate” a document with extra metadata information. Both the existing document content and metadata can be supplied to the external application. The application output can be in a specific format (json, xml, properties) or free-form combined with metadata extraction patterns you can configure. Either standard streams or files can be supplied as arguments to the external application. To transform the content using an external application instead, have a look at the ExternalTranformer, which has also been updated to support metadata.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.ExternalTagger">
  <command>
    /app/addressExtractor ${INPUT} ${INPUT_META} ${REFERENCE}
  </command>
  <metadata inputFormat="json">
    <pattern field="address" valueGroup="1">
      ^address=(.*)$
    </pattern>
  </metadata>
</tagger>

[/ezcol_1half_end]

Other improvements

This release includes many more new features and enhancements:

To create a document checksum, you can now combine metadata with content.
The TextPatternTagger can now extract field names dynamically in addition to values.
The ReplaceTagger and ReplaceTransformer now support empty/null replacement values.
There are new configuration options on the GenericHttpClientFactory:
- “authFormParams” to add arbitrary parameters to authentication forms.
- “authPreemptive” to use preemptive authentication with BASIC authentication.
The Amazon CloudSearch and Elasticsearch Committers both have a new “fixBadIds” flag to safely handle URLs that do not meet product limitations.

For the complete list of changes, refer to these product release notes:

Useful links

Download Norconex HTTP Collector
Get started with Norconex HTTP Collector
Report your issues and questions on Github
Contact Norconex

Norconex released version 2.7.0 of both its HTTP Collector and Filesystem Collector. This update, along with related component updates, introduces several interesting features.

HTTP Collector changes

The following items are specific to the HTTP Collector. For changes applying to both the HTTP Collector and the Filesystem Collector, you can proceed to the “Generic changes” section.

Crawling of JavaScript-driven pages

[ezcol_1half]

The alternative document fetcher PhantomJSDocumentFetcher now makes it possible to crawl web pages with JavaScript-generated content. This much awaited feature is now available thanks to integration with the open-source PhantomJS headless browser. As a bonus, you can also take screenshots of web pages you crawl.

[/ezcol_1half]

[ezcol_1half_end]

<documentFetcher 
    class="com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher">
  <exePath>/path/to/phantomjs.exe</exePath>
  <renderWaitTime>5000</renderWaitTime>
  <referencePattern>^.*\.html$</referencePattern> 
</documentFetcher>

[/ezcol_1half_end]

Generic changes

The following changes apply to both Filesystem and HTTP Collectors. Most of these changes come from an update to the Norconex Importer module (now also at version 2.7.0).

Much improved XML configuration validation

[ezcol_1half]

You no longer have to hunt for a misconfiguration. Schema-based XML configuration validation was added and you will now get errors if you have a bad XML syntax for any configuration options. This validation can be trigged on command prompt with this new flag: -k or --checkcfg.

[/ezcol_1half]

[ezcol_1half_end]

# -k can be used on its own, but when combined with -a (like below),
# it will prevent the collector from executing if there are any errors.

collector-http.sh -a start -c examples/minimum/minimum-config.xml -k

# Error sample:
ERROR (XML) ReplaceTagger: cvc-attribute.3: The value 'asdf' of attribute 'regex' on element 'replace' is not valid with respect to its type, 'boolean'.

[/ezcol_1half_end]

Enter durations in human-readable format

[ezcol_1half]

Having to convert a duration in milliseconds is not the most friendly. Anywhere in your XML configuration where a duration is expected, you can now use a human-readable representation (English only) as an alternative.

[/ezcol_1half]

[ezcol_1half_end]

<!-- Example using "5 seconds" and "1 second" as opposed to milliseconds -->
<delay class="com.norconex.collector.http.delay.impl.GenericDelayResolver"
    default="5 seconds" ignoreRobotsCrawlDelay="true" scope="site" >
  <schedule dayOfWeek="from Saturday to Sunday">1 second</schedule>
</delay>

[/ezcol_1half_end]

Lua scripting language

[ezcol_1half]

Support for Lua scripting has been added to ScriptFilter, ScriptTagger, and ScriptTransformer. This gives you one more scripting option available out-of-the-box besides JavaScript/ECMAScript.

[/ezcol_1half]

[ezcol_1half_end]

<!-- Add "apple" to a "fruit" metadata field: -->
<tagger class="com.norconex.importer.handler.tagger.impl.ScriptTagger"
    engineName="lua">
  <script><![CDATA[
    metadata:addString('fruit', {'apple'});
  ]]></script>
</tagger>

[/ezcol_1half_end]

Modify documents using an external application

[ezcol_1half]

With the new ExternalTransformer, you can now use an external application to perform document transformation. This is an alternative to the existing ExternalParser, which was enhanced to provide the same environment variables and metadata extraction support as the ExternalTransformer.

[/ezcol_1half]

[ezcol_1half_end]

<transformer class="com.norconex.importer.handler.transformer.impl.ExternalTransformer">
  <command>/path/transform/app ${INPUT} ${OUTPUT}</command>
  <metadata>
    <match field="docnumber">DocNo:(\d+)</match>
  </metadata>
</transformer>

[/ezcol_1half_end]

Combine document fields

[ezcol_1half]

The new MergeTagger can be used for combining multiple fields into one. The target field can be either multi-value or single-value separated with the character of your choice.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.MergeTagger">
  <merge toField="title" deleteFromFields="true" 
      singleValue="true" singleValueSeparator=",">
    <fromFields>title,dc.title,dc:title,doctitle</fromFields>
  </merge>
</tagger>

[/ezcol_1half_end]

New Committers

[ezcol_1half]

Whether you do not have a target repository (Solr, Elasticsearch, etc) ready at the time of crawling, or whether you are not using a repository at all, Norconex Collectors now ships with two file-based Committers for easy consumption by your own process: XMLFileCommitter and JSONFileCommitter. All available committers can be found here.

[/ezcol_1half]

[ezcol_1half_end]

<committer class="com.norconex.committer.core.impl.XMLFileCommitter">
 <directory>/path/my-xmls/</directory>
 <pretty>true</pretty>
 <docsPerFile>100</docsPerFile>
 <compress>false</compress>
 <splitAddDelete>false</splitAddDelete>
</committer>

[/ezcol_1half_end]

Several additional features or changes can be found in the latest Collector releases. Among them:

New Importer RegexReferenceFilter for filtering documents based on matching references (e.g. URL).
New SubstringTransformer for truncating content.
New UUIDTagger for giving a unique id to each documents.
CharacterCaseTagger now supports “swap” and “string” to swap character case and capitalize beginning of a string, respectively.
ConstantTagger offers options when dealing with existing values: add to existing values, replace them, or do nothing.
Components such as Importer, Committers, etc., are all easier to install thanks to new utility scripts.
Document Access-Control-List (ACL) information is now extracted from SMB/CIFS file systems (Filesytem Collector).
New ICollectorLifeCycleListener interface that can be added on the collector configuration to be notified and take action when the collector starts and stops.
Added “removeTrailingHash” as a new GenericURLNormalizer option (HTTP Collector).
New “detectContentType” and “detectCharset” options on GenericDocumentFetcher for ignoring the content type and character encoding obtained from the HTTP response headers and detect them instead (Filesytem Collector).
Start URLs and start paths can now be dynamically created thanks to IStartURLsProvider and IStartPathsProvider (HTTP Collector and Filesystem Collector).

To get the complete list of changes, refer to the HTTP Collector release notes, Filesystem Collector release notes, or the release notes of dependent Norconex libraries such as: Importer release notes and Collector Core release notes.

Download

Norconex just released major upgrades to all its Norconex Collectors and related projects. That is, Norconex HTTP Collector and Norconex Filesystem Collector, along with the Norconex Importer module and all available committers (Solr, Elasticsearch, HP IDOL, etc), were all upgraded to version 2.0.0.

With these major product upgrades come a new website that makes it easier to get all the software you need in one location: the Norconex Collectors website. At a quick glance you can find all Norconex Collectors and Committers available for download.

Among the new features added to your crawling arsenal you will find:

Can now split a document into multiple documents.

Can now treat embedded documents as individual documents (like documents found in zip files or in other documents such as Word files).

Language detection (50+ languages).

Parsing and formatting of dates from/to any format.

Character case modifiers.

Can now index basic content statistics with each documents (word count, average word length, average words per sentences, etc).

Can now supply a “seed file” for listing start URLs or start paths to your crawler.

Document content reads and writes are now performed in memory up to a configurable maximum size, after which the filesystem gets used. This reduces I/O and improves performance.

New event model where listeners can listen for any type of crawler events.

Can now ignore parsing of specific content types.

Can filter documents based on arbitrary regular expressions performed on the document content.

Enhanced debugging options, where you can print out specific field content as they are being processed.

HTTP Collector: Can add link names to the document the links are pointing to (e.g. to create cleaner titles).

More…

Another significant change is all Norconex open-source projects are now licensed under The Apache License 2.0. We hope this will facilitate adoption with third party commercial offerings.

It is important to note version 2.0.0 are not compatible with their previous 1.x version. The configuration options changed in many areas so do not expect to run your existing configuration under 2.0.0. Please refer to the latest documentation for new and modified configuration options.

Visit to the new Norconex Collectors website now.

Upgrade Norconex Committer and all is current concrete implementations (Solr, Elasticsearch, IDOL) have been upgraded and have seen a redesign of their web sites. Committers are libraries responsible for posting data to various repositories (typically search engines). They are in other products or projects, such as Norconex HTTP Collector. (more…)

Introduction

Understanding Norconex Web Crawler

Prerequisites

Elasticsearch

Norconex Web Crawler

Crawler Configuration

Start the Crawler

Conclusion

Introduction

Overview

Installation

Collector Configuration

Entrypoint and Start Scripts

Dockerfile

Almost There

Conclusion

Useful Links

Extract a “Featured Image” from web pages

Limit link extraction to specific page portions

Truncate long field values

Add metadata to a document using an external application

Other improvements

Useful links

HTTP Collector changes

Crawling of JavaScript-driven pages

More ways to extract links

Generic changes

Much improved XML configuration validation

Enter durations in human-readable format

Lua scripting language

Modify documents using an external application

Combine document fields

New Committers

More

Download