tag page

Amazon Web Services (AWS) have been all the rage lately, used by many organizations, companies and even individuals. This rise in popularity can be attributed to the sheer number of services provided by AWS, such as Elastic Compute (EC2), Elastic Beanstalk, Amazon S3, DynamoDB and so on. One particular service that has been getting more exposure very recently is the Amazon CloudSearch service. It is a platform that is built on top of the Apache Solr search engine and enables the indexing and searching of documents with a multitude of features.
The main focus of this blog post is crawling and indexing sites. Before delving into that, however, I will briefly go over the steps to configure a simple AWS CloudSearch domain. If you’re already familiar with creating a domain, you may skip to the next section of the post.

 

Starting a Domain

A CloudSearch domain is the search instance where all your documents will be indexed and stored. The level of usage of these domains is what dictates the pricing. Visit this link for more details.
Luckily, the web interface is visually appealing, intuitive and user friendly. First of all, you need an AWS account. If you don’t have one already, you can create one now by visiting the Amazon website. Once you have an account, simply follow these steps:

1) Click the CloudSearch icon (under the Analytics section) in the AWS console.

2) Click the “Create new search domain” button. Give the domain a name that conforms to the rules given in the first line of the popup menu, and select the instance type and replication factor you want. I’ll go for the default options to keep it simple.

3) Choose how you want your index fields to be added. I recommend starting off with the manual configuration option because it gives you the choice of adding the index fields at any time. You can find the description of each index field type here:

4) Set the access policies of your domain. You can start with the first option because it is the most straightforward and sensible way to start.

5) Review your selected options and edit what needs to be edited. Once you’re satisfied with the configurations, click “Confirm” to finalize the process.

 

It’ll take a few minutes for the domain to be ready for use, as indicated by the yellow “LOADING” label that shows up next to the domain name. A green “ACTIVE” label shows up once the loading is done.

Now that the domain is fully loaded and ready to be used, you can choose to upload documents to it, add index fields, add suggesters, add analysis schemes and so on. Note, however, that the domain will need to be re-indexed for every change that you apply. This can be done by clicking the “Run indexing” button that pops up with every change. The time it takes for the re-indexing to finish depends on the number of documents contained in the domain.

As mentioned previously, the main focus of this post is crawling sites and indexing the data to a CloudSearch domain. At the time of this writing, there are very few crawlers that are able to commit to a CloudSearch domain, and the ones that do are unintuitive and needlessly complicated. The Norconex HTTP Collector is the only crawler that has CloudSearch support that is very intuitive and straightforward. The remainder of this blog post aims to guide you through the steps necessary to set up a crawler and index the data to a CloudSearch domain in as simple and informative steps as possible.

 

Setting up the Norconex HTTP Collector

The Norconex HTTP Collector will be installed and configured in a Linux environment using Unix syntax. You can still, however, install on Windows, and the instructions are just as simple.

Unzip the downloaded file and navigate to the extracted folder. If needed, make sure to set the directory as readable and writable using the chmod command. Once that’s done, follow these steps:

1) Create a directory and name it testCrawl. In the folder myCrawler, create a file config.xml and populate it with the minimal configuration file, which you can find in the examples/minimum directory.

2) Give the crawler a name in the <httpcollector id="..."> I’ll name my crawler TestCrawl.

3) Set progress and log directories in their respective tags:

<progressDir>./testCrawl/progressdir</progressDir>
<logsDir>./testCrawl/logsDir</logsDir>

 

4) Within <crawlerDefaults>, set the work directory where the files will be stored during the crawling process:

<workDir>./testCrawl/workDir</workDir>

5) Type the site you want crawled in the [tag name] tag:

<url>http://beta2.norconex.com/</url>

Another method is to create a file with a list of URLs you want crawled, and point to the file:

<urlsFile>./urls/urlFile</urlsFile>

6) If needed, set a limit on how deep (from the start URL) the crawler can go and a limit on the number of documents to process:

<maxDepth>2</maxDepth>
<maxDocuments>10</maxDocuments>

7) If needed, you can set the crawler to ignore documents with specific file extensions. This is done by using the ExtensionReferenceFilter class as follows:

<referenceFilters>
  	<filter
     	class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter"                                                                             
            onMatch="exclude" caseSensitive="false">
         	png,gif,jpg,jpeg,js,css
  	</filter>
</referenceFilters>

8) You will most likely want to use an importer to parse the crawled data before it’s sent to your CloudSearch domain. The Norconex importer is a very intuitive and easy-to-use tool with a plethora of different configuration options, offering a multitude of pre- and post-parse taggers, transforms, filters and splitters, all of which can be found here. As a starting point, you may want to use the KeepOnlyTagger as a post-parse handler, where you get to decide on what metadata fields to keep:

<importer>
      <postParseHandlers>
         <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>title,description</fields>
         </tagger>
       </postParseHandlers>
</importer>

Be sure that your CloudSearch domain has been configured to support the metadata fields described above. Also, make sure to have a ‘content’ field in your CloudSearch domain as the committer assumes that there’s one.

The config.xml file should look something like this:

<?xml version="1.0" encoding="UTF-8"?>
<!-- 
   Copyright 2010-2015 Norconex Inc.

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<!-- This configuration shows the minimum required and basic recommendations
     to run a crawler.  
     -->
<httpcollector id="TestCrawl">

  <!-- Decide where to store generated files. -->
  <progressDir>../myCrawler/testCrawl/progress</progressDir>
  <logsDir>../myCrawler/testCrawl/logs</logsDir>

  <crawlers>
    <crawler id="CloudSearch">

      <!-- Requires at least one start URL (or urlsFile). 
           Optionally limit crawling to same protocol/domain/port as 
           start URLs. -->
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>http://beta2.norconex.com/</url>
      </startURLs>

      <!-- === Recommendations: ============================================ -->

      <!-- Specify a crawler default directory where to generate files. -->
      <workDir>../myCrawler/testCrawl</workDir>

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>2</maxDepth>
      <maxDocuments>10</maxDocuments>

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <!-- Before 2.3.0: -->
      <sitemap ignore="true" />
      <!-- Since 2.3.0: -->
      <sitemapResolverFactory ignore="true" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="5000" />

	  
      <referenceFilters>
      	<filter class="$filterExtension" 
			onMatch="exclude"
			caseSensitive="false" >
			png,gif,jpg,jpeg,js,css
		</filter>
      </referenceFilters>

      
      <!-- Document importing -->
 
      <importer>
        <postParseHandlers>
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>title,description/fields>
          </tagger>
        </postParseHandlers>
      </importer> 

 	</crawler>
  </crawlers> 
</httpcollector>

 

The Norconex CloudSearch Committer

The Norconex http collector is compatible with several committers such as Solr, Lucidworks, Elasticsearch, etc. Visit this website to find out what other committers are available. The latest addition to this set of committers is the AWS CloudSearch committer. This is an especially useful committer since the very few publicly available CloudSearch committers are needlessly complicated and unintuitive. Luckily for you, Norconex solves this issue by offering a very simple and straightforward CloudSearch committer. All you have to do is:

1) Download the JAR file from here, and move it to the lib folder of the http collector folder.

2) Add the following towards the end of the <craweler></crawler> block (right after the specifying the importer) in your config.xml file:

<committer class="com.norconex.committer.cloudsearch.CloudSearchCommitter">
    <documentEndpoint></documentEndpoint>
    <accessKey></accessKey>
    <secretAccessKey></secretAccessKey>
</committer>

You can obtain the URL for your document endpoint from your CloudSearch domain’s main page. As for the AWS credentials, specifying them in the config file could result in an error due to a bug in the committer. Therefore, we strongly recommend that you DO NOT include the <accessKey> and <secretAccessKey> variables. Instead, we recommend that you set two environment variables, AWS_ACCESS_KEY and AWS_SECRET_ACCESS_KEY with their respective values. To obtain and use these values, refer to the AWS documentation.

 

Run the Crawler!

All that is left to do is to run the http collector using the Linux shell script (from the main collector directory):

./collector-http.sh -a start -c ./myCrawler/config.xml

Give the crawler some time to crawl the specified URLs, until it reaches the <maxDepth> or <maxDocuments> constraints, or if it finds no more URLs to crawl. Once the crawling is complete, the successfully processed documents will be committed to the domain specified in the <documentEndpoint> option.

To confirm that the documents have indeed been uploaded, you can go to the domain’s main page and see how many documents are stored and run a test search.

Norconex released version 2.7.0 of both its HTTP Collector and Filesystem Collector.  This update, along with related component updates, introduces several interesting features.

HTTP Collector changes

The following items are specific to the HTTP Collector.  For changes applying to both the HTTP Collector and the Filesystem Collector, you can proceed to the “Generic changes” section.

Crawling of JavaScript-driven pages

[ezcol_1half]

The alternative document fetcher PhantomJSDocumentFetcher now makes it possible to crawl web pages with JavaScript-generated content. This much awaited feature is now available thanks to integration with the open-source PhantomJS headless browser.   As a bonus, you can also take screenshots of web pages you crawl.

[/ezcol_1half]

[ezcol_1half_end]

<documentFetcher 
    class="com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher">
  <exePath>/path/to/phantomjs.exe</exePath>
  <renderWaitTime>5000</renderWaitTime>
  <referencePattern>^.*\.html$</referencePattern> 
</documentFetcher>

[/ezcol_1half_end]

More ways to extract links

[ezcol_1half]

This release introduces two new link extractors.  You can now use the XMLFeedLinkExtractor to extract links from RSS or Atom feeds. For maximum flexibility, the RegexLinkExtractor can be used to extract links using regular expressions.

[/ezcol_1half]

[ezcol_1half_end]

<extractor class="com.norconex.collector.http.url.impl.RegexLinkExtractor">
  <linkExtractionPatterns>
    <pattern group="1">\[(http.*?)\]</pattern>
  </linkExtractionPatterns>
</extractor>
<extractor class="com.norconex.collector.http.url.impl.XMLFeedLinkExtractor">
  <applyToReferencePattern>.*rss$</applyToReferencePattern>
</extractor>

[/ezcol_1half_end]

Generic changes

The following changes apply to both Filesystem and HTTP Collectors. Most of these changes come from an update to the Norconex Importer module (now also at version 2.7.0).

Much improved XML configuration validation

[ezcol_1half]

You no longer have to hunt for a misconfiguration.  Schema-based XML configuration validation was added and you will now get errors if you have a bad XML syntax for any configuration options.   This validation can be trigged on command prompt with this new flag: -k or --checkcfg.

[/ezcol_1half]

[ezcol_1half_end]

# -k can be used on its own, but when combined with -a (like below),
# it will prevent the collector from executing if there are any errors.

collector-http.sh -a start -c examples/minimum/minimum-config.xml -k

# Error sample:
ERROR (XML) ReplaceTagger: cvc-attribute.3: The value 'asdf' of attribute 'regex' on element 'replace' is not valid with respect to its type, 'boolean'.

[/ezcol_1half_end]

Enter durations in human-readable format

[ezcol_1half]

Having to convert a duration in milliseconds is not the most friendly. Anywhere in your XML configuration where a duration is expected, you can now use a human-readable representation (English only) as an alternative.

[/ezcol_1half]

[ezcol_1half_end]

<!-- Example using "5 seconds" and "1 second" as opposed to milliseconds -->
<delay class="com.norconex.collector.http.delay.impl.GenericDelayResolver"
    default="5 seconds" ignoreRobotsCrawlDelay="true" scope="site" >
  <schedule dayOfWeek="from Saturday to Sunday">1 second</schedule>
</delay>

[/ezcol_1half_end]

Lua scripting language

[ezcol_1half]

Support for Lua scripting has been added to ScriptFilter, ScriptTagger, and ScriptTransformer.  This gives you one more scripting option available out-of-the-box besides JavaScript/ECMAScript.

[/ezcol_1half]

[ezcol_1half_end]

<!-- Add "apple" to a "fruit" metadata field: -->
<tagger class="com.norconex.importer.handler.tagger.impl.ScriptTagger"
    engineName="lua">
  <script><![CDATA[
    metadata:addString('fruit', {'apple'});
  ]]></script>
</tagger>

[/ezcol_1half_end]

Modify documents using an external application

[ezcol_1half]

With the new ExternalTransformer, you can now use an external application to perform document transformation.  This is an alternative to the existing ExternalParser, which was enhanced to provide the same environment variables and metadata extraction support as the ExternalTransformer.

[/ezcol_1half]

[ezcol_1half_end]

<transformer class="com.norconex.importer.handler.transformer.impl.ExternalTransformer">
  <command>/path/transform/app ${INPUT} ${OUTPUT}</command>
  <metadata>
    <match field="docnumber">DocNo:(\d+)</match>
  </metadata>
</transformer>

[/ezcol_1half_end]

Combine document fields

[ezcol_1half]

The new MergeTagger can be used for combining multiple fields into one. The target field can be either multi-value or single-value separated with the character of your choice.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.MergeTagger">
  <merge toField="title" deleteFromFields="true" 
      singleValue="true" singleValueSeparator=",">
    <fromFields>title,dc.title,dc:title,doctitle</fromFields>
  </merge>
</tagger>

[/ezcol_1half_end]

New Committers

[ezcol_1half]

Whether you do not have a target repository (Solr, Elasticsearch, etc) ready at the time of crawling, or whether you are not using a repository at all, Norconex Collectors now ships with two file-based Committers for easy consumption by your own process: XMLFileCommitter and JSONFileCommitter. All available committers can be found here.

[/ezcol_1half]

[ezcol_1half_end]

<committer class="com.norconex.committer.core.impl.XMLFileCommitter">
 <directory>/path/my-xmls/</directory>
 <pretty>true</pretty>
 <docsPerFile>100</docsPerFile>
 <compress>false</compress>
 <splitAddDelete>false</splitAddDelete>
</committer>

[/ezcol_1half_end]

More

Several additional features or changes can be found in the latest Collector releases.  Among them:

  • New Importer RegexReferenceFilter for filtering documents based on matching references (e.g. URL).
  • New SubstringTransformer for truncating content.
  • New UUIDTagger for giving a unique id to each documents.
  • CharacterCaseTagger now supports “swap” and “string” to swap character case and capitalize beginning of a string, respectively.
  • ConstantTagger offers options when dealing with existing values: add to existing values, replace them, or do nothing.
  • Components such as Importer, Committers, etc., are all easier to install thanks to new utility scripts.
  • Document Access-Control-List (ACL) information is now extracted from SMB/CIFS file systems (Filesytem Collector).
  • New ICollectorLifeCycleListener interface that can be added on the collector configuration to be notified and take action when the collector starts and stops.
  • Added “removeTrailingHash” as a new GenericURLNormalizer option (HTTP Collector).
  • New “detectContentType” and “detectCharset” options on GenericDocumentFetcher for ignoring the content type and character encoding obtained from the HTTP response headers and detect them instead (Filesytem Collector).
  • Start URLs and start paths can now be dynamically created thanks to IStartURLsProvider and IStartPathsProvider (HTTP Collector and Filesystem Collector).

To get the complete list of changes, refer to the HTTP Collector release notes, Filesystem Collector release notes, or the release notes of dependent Norconex libraries such as: Importer release notes and Collector Core release notes.

Download

HTTP Collector 2.6

Norconex has released version 2.6.0 of its HTTP Collector web crawler! Among new features, an upgrade of its Importer module brings new document parsing and manipulating capabilities. Some of the changes highlighted here also benefit the Norconex Filesystem Collector.

New URL normalization to remove trailing slashes

[ezcol_1half]

The GenericURLNormalizer has a new pre-defined normalization rule: “removeTrailingSlash”. When used, it makes sure to remove forward slash (/) found at the end of URLs so such URLs are treated the same as those not ending with such character. As an example:

  • https://norconex.com/ will become https://norconex.com
  • https://norconex.com/blah/ will become https://norconex.com/blah

It can be used with the 20 other normalization rules offered, and you can still provide your own.

[/ezcol_1half]

[ezcol_1half_end]

<urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
  <normalizations>
    removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence,
    decodeUnreservedCharacters, removeDefaultPort,
    encodeNonURICharacters, removeTrailingSlash
  </normalizations>
</urlNormalizer>

[/ezcol_1half_end]

Prevent sitemap detection attempts

[ezcol_1half]

By default StandardSitemapResolverFactory is enabled and tries to detect whether a sitemap file exists at the “/sitemap.xml” or “/sitemap_index.xml” URL path. For websites without sitemaps files at these location, this creates unnecessary HTTP request failures. It is now possible to specify an empty “path” so that such discovery does not take place. In such case, it will rely on sitemap URLs explicitly provided as “start URLs” or sitemaps defined in “robots.txt” files.

[/ezcol_1half]

[ezcol_1half_end]

<sitemapResolverFactory>
  <path/>
</sitemapResolverFactory>

[/ezcol_1half_end]

Count occurrences of matching text

[ezcol_1half]

Thanks to the new CountMatchesTagger, it is now possible to count the number of times any piece of text or regular expression occurs in a document content or one of its fields. A sample use case may be to use the obtained count as a relevancy factor in search engines. For instance, one may use this new feature to find out how many segments are found in a document URL, giving less importance to documents with many segments.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.CountMatchesTagger"> 
  <countMatches 
      fromField="document.reference"
      toField="urlSegmentCount" 
      regex="true">
    /[^/]+
  </countMatches>
</tagger>

[/ezcol_1half_end]

Multiple date formats

[ezcol_1half]

DateFormatTagger now accepts multiple source formats when attempting to convert dates from one format to another. This is particularly useful when the date formats found in documents or web pages are not consistent. Some products, such as Apache Solr, usually expect dates to be of a specific format only.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.DateFormatTagger"
    fromField="Last-Modified"
    toField="solr_date"
    toFormat="yyyy-MM-dd'T'HH:mm:ss.SSS'Z'">
  <fromFormat>EEE, dd MMM yyyy HH:mm:ss zzz</fromFormat>
  <fromFormat>EPOCH</fromFormat>
</tagger>

[/ezcol_1half_end]

DOM enhancements

[ezcol_1half]

DOM-related features just got better. First, the DOMTagger, which allows one to extract values from an XML/HTML document using a DOM-like structurenow supports an optional “fromField” to read the markup content from a field instead of the document content. It also supports a new “defaultValue” attribute to store a value of your choice when there are no matches with your DOM selector. In addition, now both DOMContentFilter and DOMTagger supports many more selector extraction options: ownText, data, id, tagName, val, className, cssSelector, and attr(attributeKey).

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
  <dom selector="div.contact" toField="htmlContacts" extract="html" />
</tagger>
<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger"
    fromField="htmlContacts">
  <dom selector="div.firstName" toField="firstNames" 
       extract="ownText" defaultValue="NO_FIRST_NAME" />
  <dom selector="div.lastName"  toField="lastNames" 
       extract="ownText" defaultValue="NO_LAST_NAME" />
</tagger>

[/ezcol_1half_end]

More control of embedded documents parsing

[ezcol_1half]

GenericDocumentParserFactory now allows you to control which embedded documents you do not want extracted from their containing document (e.g., do not extract embedded images). In addition, it also allows you to control which containing document you do not want to extract their embedded documents (e.g., do not extract documents embedded in MS Office documents). Finally, it also allows you now to specify which content types to “split” their embedded documents into separate files (as if they were standalone documents), via regular expression (e.g. documents contained in a zip file).

[/ezcol_1half]

[ezcol_1half_end]

<documentParserFactory class="com.norconex.importer.parser.GenericDocumentParserFactory">
  <embedded>
    <splitContentTypes>application/zip</splitContentTypes>
    <noExtractEmbeddedContentTypes>image/.*</noExtractEmbeddedContentTypes>
    <noExtractContainerContentTypes>
      application/(msword|vnd\.ms-.*|vnd\.openxmlformats-officedocument\..*)
    </noExtractContainerContentTypes>
  </embedded>
</documentParserFactory>

[/ezcol_1half_end]

Document parsers now XML configurable

[ezcol_1half]

GenericDocumentParserFactory now makes it possible to overwrite one or more parsers the Importer module uses by default via regular XML configuration. For any content type, you can specify your custom parser, including an external parser.

[/ezcol_1half]

[ezcol_1half_end]

<documentParserFactory class="com.norconex.importer.parser.GenericDocumentParserFactory">
  <parsers>
    <parser contentType="text/html" 
        class="com.example.MyCustomHTMLParser" />
    <parser contentType="application/pdf" 
        class="com.norconex.importer.parser.impl.ExternalParser">
      <command>java -jar c:\Apps\pdfbox-app-2.0.2.jar ExtractText ${INPUT} ${OUTPUT}</command>
    </parser>
  </parsers>
</documentParserFactory>

[/ezcol_1half_end]

More languages detected

[ezcol_1half]

LanguageTagger now uses Tika language detection, which supports at least 70 languages.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.LanguageTagger">
  <languages>en, fr</languages>
</tagger>

[/ezcol_1half_end]

What else?

Other changes and stability improvements were made to this release. A few examples:

  • New “checkcfg” launch action that helps detect configuration issues before an actual launch.
  • Can now specify “notFoundStatusCodes” on GenericMetadataFetcher.
  • GenericLinkExtractor no longer extracts URL from HTML/XML comments by default.
  • URL referrer data is now always preserved by default.

To get the complete list of changes, refer to the HTTP Collector release notes, or the release notes of dependent Norconex libraries such as: Importer release notes and Collector Core release notes.

Useful links

HTTP Collector 2.5

Norconex has released Norconex HTTP Collector version 2.5.0! This new version of our open source web crawler was released to help minimize your re-crawling frequencies and download delays, and it allows you to specify a locale for date parsing/formatting. The following highlights these key changes and additions:

Minimum re-crawl frequency

[ezcol_1half]

Not all web pages and documents are updated as regularly. In addition, updates are not as important to capture right away for all types of content. Re-crawling every page every time to find out if they changed or not can be time consuming (and sometimes taxing) on larger sites. For instance, you may want to re-crawl news pages more regularly than other types of pages on a given site. Luckily, some websites provide sitemaps which give crawlers pointers to its document update frequencies.

This release introduces “recrawlable resolvers” to help control the frequency of document re-crawls. You can now specify a minimum re-crawl delay, based on a document matching content type or reference pattern. The default implementation is GenericRecrawlableResolver, which supports sitemap “lastmod” and “changefreq” in addition to custom re-crawl frequencies.

[/ezcol_1half]

[ezcol_1half_end]

<recrawlableResolver
    class="com.norconex.collector.http.recrawl.impl.GenericRecrawlableResolver"
    sitemapSupport="last" >
  <minFrequency applyTo="contentType" value="monthly">application/pdf</minFrequency>
  <minFrequency applyTo="reference" value="1800000">.*latest-news.*\.html</minFrequency>
</recrawlableResolver>

[/ezcol_1half_end]

Download delays based on document URL

[ezcol_1half]

ReferenceDelayResolver is a new “delay resolver” that controls delays between each document download. It allows you to define different delays for different URL patterns. This can be useful for more fragile websites negatively impacted by the fast download of several big documents (e.g., PDFs). In such cases, introducing a delay between certain types of download can help keep the crawled website performance intact.

[/ezcol_1half]

[ezcol_1half_end]

<delay class="com.norconex.collector.http.delay.impl.ReferenceDelayResolver"
    default="2000"
    ignoreRobotsCrawlDelay="true"
    scope="crawler" >
  <pattern delay="10000">.*\.pdf$</pattern>
</delay>

[/ezcol_1half_end]

Specify a locale in date parsing/formatting

[ezcol_1half]

Thanks to the Norconex Importer 2.5.2 dependency update, it is now possible to specify a locale when parsing/formatting dates with CurrentDateTagger and DateFormatTagger.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.DateFormatTagger"
    fromField="date"
    fromFormat="EEE, dd MMM yyyy HH:mm:ss 'GMT'"
    fromLocale="fr"
    toFormat="yyyy-MM-dd'T'HH:mm:ss'Z'"
    keepBadDates="false"
    overwrite="true" />

[/ezcol_1half_end]

 

Useful links

  • Download Norconex HTTP Collector
  • Get started with Norconex HTTP Collector
  • Report your issues and questions on Github
  • Norconex HTTP Collector Release Notes

 

Norconex just released an Amazon CloudSearch Committer module for its open-source crawlers (Norconex “Collectors”). This is an especially useful contribution to CloudSearch users given that CloudSearch does not have its own crawlers.

If you’re not yet familiar with Norconex Collectors, head over to the Norconex Collectors website to see what you’ve been missing.
Assuming you’re already familiar with Norconex Collectors, you can enable CloudSearch as your crawler’s target search engine by following these steps:

  1. Download the CloudSearch Committer.
  2. Extract the zip, and copy the content of the “lib” folder to the “lib” folder of your existing Collector installation.
  3. Add this minimum required configuration snippet to your Collector configuration file:
    <committer class="com.norconex.committer.cloudsearch.CloudSearchCommitter">
      <serviceEndpoint>(CloudSearch service endpoint)</serviceEndpoint>
      <accessKey>
         (Optional CloudSearch access key. Will be taken from environment when blank.)
      </accessKey>
      <secretKey>
         (Optional CloudSearch secret key. Will be taken from environment when blank.)
      </secretKey>
    </committer>
  4. The document endpoint represents the CloudSearch domain you’ll want to use to store your crawled documents. It can be obtained from your CloudSearch domain’s main page.

CloudSearch main page

As for the AWS access and secret keys, they can also be stored outside the configuration file using one of the methods described here.
The complete list of configuration options is available here.

For further information:

Google Search Appliance is Being Phased Out… Now What?Google Search Appliance (GSA) was introduced in 2002, and since then, thousands of organizations have acquired Google “search in a box” to meet their search needs. Earlier this year, Google announced they are discontinuing sales of this appliance past 2016 and will not provide support beyond 2018. If you are currently using GSA for your search needs, what does this mean for your organization?

Google suggests migration from GSA to their Google Cloud Platform. Specifically, their BigQuery service offers a fully-scalable, fully-managed data warehouse with search capabilities and analytics to provide meaningful insights. This may be a great option, but what if your organization or government agency needs to have significant portions of your infrastructure in-house, behind firewalls? This new Google offering may be ill-suited as a possible replacement for GSA.

There are some other important elements you will want to consider before making your decision such as protecting sensitive data, investment stability, customizability, feature set, ongoing costs, and more.

Let’s look at some of the options together.

1. COMMERCIAL APPLIANCES

Examples: SearchBlox, Thunderstone, Mindbreeze

Pros

Commercial appliances can be fast to deploy if you have little requirement for customization. As such, they may need little or no professional services involvement.

To Watch

Because appliance products aim to be stand-alone, black box solutions, they may be less customizable to meet specific needs, and may not be able to easily integrate with many other technologies. Because the hardware is set for you, if your requirements change over time, you may end up with a product that no longer meets your needs. You may also be tied to the vendor for ongoing support, and as with GSA, there is no guarantee the vendor won’t discontinue the product and have you starting over again to find your next solution.

2. CLOUD-BASED SOLUTIONS

Examples: Google Cloud (BigQuery), Amazon CloudSearch, etc.

Pros

A cloud-based solution can be both cost-effective and fast to deploy, and will require little to no internal IT support depending on your needs. Because the solution is based in the cloud, most of the infrastructure and associated costs will be covered by the provider as part of the solution pricing.

To Watch

Cloud solutions may not work for organizations with sensitive data. While cloud-based solutions try to provide easy-to-use and flexible APIs, there might be customizations that can’t be performed or that must be done by the provider. Your organization may not own any ongoing development. Also, if you ever wish to leave, it may be difficult or costly to leave a cloud provider if you heavily rely on them for warehousing large portions of your data.

3. COMMERCIAL SOFTWARE SOLUTIONS

Examples: Coveo, OpenText Search, HP IDOL, Lexmark Perceptive Platform, IBM Watson Explorer, Senequa ES, Attivio

Pros

Commercial solutions work great behind firewalls. You can maintain control of your data within your own environment. Several commercial products often make several configuration assumptions that can potentially save time to deploy when minimal customization is required. Commercial vendors try to differentiate themselves by offering “specializations”, along with rich feature sets and administrative tools out of the box. If most of your requirements fit within their main offerings, you may have fewer needs for customization, potentially leading to professional services savings.

To Watch

Because there are so many commercial products out there, your organization may need to complete lengthy studies, potentially with the assistance of a consultant, to compare product offerings to see which will work with your platform(s) and compare all feature sets to find the best fit. Customization may be difficult or costly, and some products may not scale equally well to match your organization’s changing and growing needs. Finally, there is always risk that commercial products get discontinued, purchased, or otherwise vanish from the market, forcing you to migrate your environment to another solution once more. We have seen this with Verity K2, Fast, Fulcrum search, and several others.

4. CUSTOM OPEN SOURCE SOLUTIONS

Examples: Apache Solr, Elasticsearch

Pros

Going open source is often the most flexible solution you can implement. Having full access to a product source code makes customization potential almost unlimited. There are no acquisition or ongoing licensing costs, so the overall cost to deploy can be much less than for commercial products, and you can focus your spending towards creating a tailored solution rather than a pre-built commercial product. You will have the flexibility to change and add on to your search solution as your needs change. It is also good to point out that the risk of the product being discontinued is almost zero due to the advanced adoption of open source for Search. Being open source, add-on component options are plentiful and these options grow every day thanks to an advanced online community – and many of these options are also free!

To Watch

Depending on the number and complexity of your search requirements, the expertise required may be greater and an open source solution may take longer to deploy. You often need good developers to implement an open source solution; you will need key in-house resources, or be prepared to hire external experts to assist with implementation. If using an expert shop, you will want to pre-define your requirements to ensure the project stays within budget. It is good to note that unlike some of the commercial products, open source products usually keep a stronger focus on the search engine itself. This means they often lack many accompanying components and features, often shipping with commercial products (like crawlers for many data sources, built-in analytics reporting, industry-specific ontologies, etc). Luckily, open source solutions often integrate easily with several commercial or open source components that can be used to fill these gaps.

I hope this brief overview helps you begin your assessment on how to replace your Google Search Appliance, or implement other Search solutions.

 

Norconex HTTP Collector 2.3.0

Norconex is proud to release version 2.3.0 of its Norconex HTTP Collector open-source web crawler.  Thanks to incredible community feedback and efforts, we have implemented several feature requests, and your favorite crawler is now more stable than ever. The following describes only a handful of these new features with a focus on XML configuration. Refer to the product release notes for a complete list of changes.

Restrict crawling to a specific site

[ezcol_1half]

Up until now, you could restrict crawling to a specific domain, protocol, and port using one or more reference filters (e.g., RegexReferenceFilter). Norconex HTTP Collector 2.3.0 features new configuration options to “stay on a site”, called stayOnProtocol, stayOnDomain, and stayOnPort.  These new settings can be applied to the <startURLs> tag of your XML configuration.  They are particularly useful when you have many “start URLs” defined and you do not want to create many reference filters to stay on those sites.

[/ezcol_1half]

[ezcol_1half_end]

<startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
  <url>http://mysite.com</url>
</startURLs>

[/ezcol_1half_end]

 

Add HTTP request headers

[ezcol_1half]

GenericHttpClientFactory now allows you to set HTTP request headers on every HTTP calls that a crawler will make. This new feature can save the day for sites expecting certain header values to be present to render properly. For instance, some sites may rely on the “Accept-Language” request header to decide which language to pick to render a page.

[/ezcol_1half]

[ezcol_1half_end]

<httpClientFactory>
  <headers>
    <header name="Accept-Language">fr</header>
    <header name="From">john@smith.com</header>
  </headers>
</httpClientFactory>

[/ezcol_1half_end]

Specify a sitemap as a start URL

[ezcol_1half]

It is now possible to specify one or more sitemap URLs as “start URLs.”  This is in addition to the crawler attempting to detect sitemaps at standard locations. To only use the sitemap URL provided as a start URL, you can disable the sitemap discovery process by adding ignore="true" to <sitemapResolverFactory> as shown in the code sample.  To only crawl pages listed in sitemap files and not further follow links found in those pages, remember to set the <maxDepth> to zero.

[/ezcol_1half]

[ezcol_1half_end]

<startURLs>
  <sitemap>http://mysite.com/sitemap.xml</sitemap>
</startURLs>
<sitemapResolverFactory ignore="true" />

[/ezcol_1half_end]

Basic URL normalization always performed

[ezcol_1half]

URL normalization is now in effect by default using GenericURLNormalizer. The following are the default normalization rules applied:

  • Removing the URL fragment (the “#” character and everything after)
  • Converting the scheme and host to lower case
  • Capitalizing letters in escape sequences
  • Decoding percent-encoded unreserved characters
  • Removing the default port
  • Encoding non-URI characters

You can always overwrite the default normalization settings or turn off normalization altogether by adding the disabled="true" attribute to the <urlNormalizer> tag.

[/ezcol_1half]

[ezcol_1half_end]

<urlNormalizer>
  <normalizations>
    lowerCaseSchemeHost, upperCaseEscapeSequence, removeDefaultPort, 
    removeDotSegments, removeDirectoryIndex, removeFragment, addWWW 
  </normalizations>
  <replacements>
    <replace><match>&amp;view=print</match></replace>
    <replace>
       <match>(&amp;type=)(summary)</match>
       <replacement>$1full</replacement>
    </replace>
  </replacements>
</urlNormalizer>

[/ezcol_1half_end]

Scripting Language and DOM navigation

We introduced additional features when we upgraded the Norconex Importer dependency to its latest version (2.4.0). You can now use scripting languages to insert your own document processing logic or reference DOM elements of a XML or HTML file using a friendly syntax.  Refer to the Importer 2.4.0 release announcement for more details.

Useful links

There is so much more offered by this release. Use the following links to find out more about Norconex HTTP Collector.

Norconex is proud to release version 2.4.0 of its Norconex Importer open-source product.  In addition to the usual bug fixes and stability enhancements, this release provides more possibilities for parsing and enriching your documents.  Most significantly, Importer 2.4.0 allows for scripting and DOM navigation.  Keep reading for more details and usage samples.

Scripting

[ezcol_1half]

Whereas it has always been possible to extend the importer to implement your own document processing logic, now you can inject the importer via configuration using a scripting language. The following new handlers enable the use of scripting languages to manipulate documents: ScriptFilter, ScriptTagger, and ScriptTransformer.

The “JavaScript” script engine, which is already present as part of your Java installation, is the script engine used by these classes.  The JavaScript engine used by the Oracle implementation of Java is based on Mozilla Rhino. You can find extensive JavaScript documentation on the Mozilla Rhino site.

Java developers can extend the importer to increase support for additional scripting languages. These new classes rely on JSR 223 API, which allows you to “plug” into any script engines to support your favorite scripting language.

[/ezcol_1half]

[ezcol_1half_end]

<!-- Reject documents that are not about "apple". -->
<filter class="com.norconex.importer.handler.filter.impl.ScriptFilter">
  <script><![CDATA[
      isAppleDoc = metadata.getString('fruit') == 'apple'
              || content.indexOf('Apple') > -1;
      /*return*/ isAppleDoc;
  ]]></script>
</filter>

<!-- Add a "fruit" metadata field with the value "apple". --> 
<tagger class="com.norconex.importer.handler.tagger.impl.ScriptTagger">
  <script><![CDATA[
      metadata.addString('fruit', 'apple');
  ]]></script>
</tagger>

<!-- Modify all occurences of "Alice" with "Roger". -->
<transformer 
    class="com.norconex.importer.handler.transformer.impl.ScriptTransformer">
  <script><![CDATA[
      modifiedContent = content.replace(/Alice/g, 'Roger');
      /*return*/ modifiedContent;
  ]]></script>
</transformer>

 [/ezcol_1half_end]

DOM navigation

[ezcol_1half]

It is now possible to reference elements of an HTML or XML document using friendly CSS or JQuery-like syntax to navigate its domain object model (DOM). The jsoup parser is used to load document content into a DOM tree.

The new DOMContentFilter can be used to reject documents containing a specific HTML/XML path or element. The DOMSplitter can be used to break HTML/XML with “list” elements into different documents. Finally, the DOMTagger allows you to extract specific HTML/XML tag values or attributes and store them in your own fields (e.g., extract <h1> tags into a “title” field).

[/ezcol_1half]

[ezcol_1half_end]

<!-- Exclude documents containing GIF images. -->
<filter class="com.norconex.importer.handler.filter.impl.DOMContentFilter"
      selector="img[src$=.gif]" onMatch="exclude" />

<!-- Store H1 tags in a title field. -->
<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
  <dom selector="h1" toField="title" overwrite="false" />
</tagger>

<!-- Create a new contact document for each occurence of the "contact" tag. -->
<splitter class="com.norconex.importer.handler.splitter.impl.DOMSplitter"
    selector="contact" />

 [/ezcol_1half_end]

Other features

[ezcol_1half]

This release features several other helpful and interesting changes and additions.  For instance, CharacterCaseTagger can now be used to adjust the character case of field names (in addition to values). A few additional file formats are also supported.  For a complete list of changes, see the release notes.

[/ezcol_1half]

[ezcol_1half_end]

<!-- Make every instance of "title" field name lowercase. -->
<tagger class="com.norconex.importer.handler.tagger.impl.CharacterCaseTagger">
  <characterCase fieldName="title" type="lower" applyTo="field" />
</tagger>

 [/ezcol_1half_end]

Useful links

The latest release of Norconex HTTP Collector provides more content transformation capabilities, canonical URL support, increased stability, and more additional features.  

Norconex HTTP Collector 2.2 now availableAs the Internet grows, so does the demand for better ways to extract and process web data. Several commercial and open-source/free web crawling solutions have been available for years now. Unfortunately, most are limited by one or more of the following:

  • Feature set is too limited
  • Unfriendly and complex to setup
  • Poorly documented
  • Require strong programming skills
  • No longer supported or active
  • Integrates with a single search engine or repository
  • Geared solely on big data solutions (like the popular Apache Nutch has become)
  • Difficult to extend with your own features
  • High cost of ownership

Norconex is changing this with its full-featured, enterprise-class, open-source web crawler solution. Norconex HTTP Collector is entirely configurable using simple XML, yet offers many extension points for adventurous Java programmers. It integrates with virtually any repository or search engine (Solr, Elasticsearch, IDOL, GSA, etc.). You will find it is thoroughly documented in a single location, with sample configurations files working out of the box on any operating system.

The latest release builds upon the great community requests and feedback to provide the following highlights:

Canonical Links Detector

[ezcol_1half]

Canonical links are a way for the webmaster to help crawlers avoid duplicates by indicating the preferred URL for accessing a web page. The HTTP Collector now detects canonical links found in both HTML and HTTP headers.

The GenericCanonicalLinkDetector looks within the HTML <head> tags for a <link> tag following this pattern:

<link rel="canonical" href="https://norconex.com/sample" />

It also looks for an HTTP response header field named “Link” with a value following this pattern:

<https://norconex.com/sample.pdf> rel="canonical"

The advantage for webmasters in defining canonical URLs in the HTTP response header over an HTML page is twofold. First, it allows web crawlers to reject non-canonical pages before they are downloaded (saving bandwidth). Second, they can apply to any content types, not just HTML pages.

[/ezcol_1half]

[ezcol_1half_end]

<canonicalLinkDetector
    class="com.norconex.collector.http.url.impl.GenericCanonicalLinkDetector"
    ignore="false">
</canonicalLinkDetector>

[/ezcol_1half_end]

URL Reports Creation

[ezcol_1half]

URLStatusCrawlerEventListener is a new crawler event listener that can produce spreadsheet-friendly reports on fetched URLs and their statuses. Among other things, it can be useful for finding broken links on a site being crawled.

[/ezcol_1half]

[ezcol_1half_end]

<listener
    class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener">
  <statusCodes>404</statusCodes>
  <outputDir>/a/path/broken-links.tsv</outputDir>
</listener>

[/ezcol_1half_end]

Spoiled State Resolver

[ezcol_1half]

A new class called GenericSpoiledReferenceStrategizer allows you to specify how to handle URLs that were once valid, but turned “bad” on a subsequent crawl. You can chose to delete them from your repository, give them a single chance to recover on the next crawl, or simply ignore them.

[/ezcol_1half]

[ezcol_1half_end]

<spoiledReferenceStrategizer 
    class="com.norconex.collector.core.spoil.impl.GenericSpoiledReferenceStrategizer"
    fallbackStrategy="IGNORE">
  <mapping state="NOT_FOUND" strategy="DELETE" />
  <mapping state="BAD_STATUS" strategy="GRACE_ONCE" />
  <mapping state="ERROR" strategy="IGNORE" />
</spoiledReferenceStrategizer>

[/ezcol_1half_end]

Extra Filtering and Data Manipulation Options

Norconex HTTP Collector internally relies on the Norconex Importer library for parsing documents and manipulating text and metadata. The latest release of the Importer brings you several new options, such as:

  • CurrentDateTagger: Add the current date to a document.
  • DateMetadataFilter: Accepts or rejects a document based on the date value of a metadata field.
  • NumericMetadataFilter: Accepts or rejects a document based on the numeric value of a metadata field.
  • TextPatternTagger: Extracts and adds all text values matching the regular expression provided to a metadata field.

Want to crawl a filesystem instead?

Whether you are interested in crawling a local drive, a network drive, a FTP site, webav, or any other types of filesystems, Norconex Filesystem Collector is for you; it was recently upgraded to version 2.2.0 as well. Check its release notes for details.

Useful Links

This release of Norconex Importer brings many fixes, increased stability, and nice new features. The following highlights some of the additions with XML configuration or Java code samples.

Retrieve a document Length

[ezcol_1half]

Thanks to the new DocumentLengthTagger, you can now store a document byte length in a metadata field of your choice. The length can be obtained at any document processing stage.  For instance, it can be obtained before any transformation took place, or after it was parsed.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.DocumentLengthTagger"
  field="doc-length" overwrite="true" >
</tagger>

 [/ezcol_1half_end]

Add the current date to a document

[ezcol_1half]

The new CurrentDateTagger allows to add the current date to a metadata field and date format of your choice. This can be useful to indicate when a document was actually processed by the Importer.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.CurrentDateTagger"
  field="date-imported" format="yyyy-MM-dd" />

 [/ezcol_1half_end]

Filter documents on numeric or date range

[ezcol_1half]

NumericMetadataFilter and DateMetadataFilter now allow you to filter documents based on metadata field numeric or date values, respectively. You can define both closed ranges and open-ended ranges.

[/ezcol_1half]

[ezcol_1half_end]

<!-- Numeric range filter -->
<filter class="com.norconex.importer.handler.filter.impl.NumericMetadataFilter"
      onMatch="include" field="age" >
  <condition operator="ge" number="20" />
  <condition operator="lt" number="30" />
</filter>

<!-- Date range filter -->
<filter class="com.norconex.importer.handler.filter.impl.DateMetadataFilter"
      onMatch="include" field="publish_date" >
  <condition operator="ge" date="TODAY-7" />
  <condition operator="lt" date="TODAY" />
</filter>

 [/ezcol_1half_end]

Use external parsers

[ezcol_1half]

Wrapping a Tika class of the same name, the new ExternalParser allows Java programmers to point to external command-line applications to parse documents. One example can be for using “pdftotext” to parse PDFs instead of the default PDF parser based on PDFBox, which is much slower (but does a better job overall).

[/ezcol_1half]

[ezcol_1half_end]

import java.util.Map;

import com.norconex.commons.lang.file.ContentType;
import com.norconex.importer.parser.GenericDocumentParserFactory;
import com.norconex.importer.parser.IDocumentParser;
import com.norconex.importer.parser.impl.ExternalParser;

public class CustomDocumentParserFactory extends GenericDocumentParserFactory {

    @Override
    protected Map<ContentType, IDocumentParser> createNamedParsers() {
        Map<ContentType, IDocumentParser> parsers = super.createNamedParsers();

        ExternalParser pdfParser = new ExternalParser();
        pdfParser.setCommand(
                // Replace this with your own executable path
                "C:\\Apps\\pdftotext.exe", 
                "-enc", "UTF-8", "-raw", "-q", "-eol", "unix",                 
                ExternalParser.INPUT_FILE_TOKEN, 
                ExternalParser.OUTPUT_FILE_TOKEN);
        parsers.put(ContentType.PDF, pdfParser);
        return parsers;
    }
}

  [/ezcol_1half_end]

Other improvements

There are more changes under the hood, like upgrading to Apache Tika 1.8, as well as the fixing of OutOfMemory errors and document parsing sometimes never returning. You can find the complete list of changes in the release notes.

Several of these improvements were made possible thanks to the great feedback of the open-source community. Keep doing so: you make a difference.

Useful links