An Open-Source Crawler for Autonomy IDOL

HP Autonomy users, take control over your web crawling. Norconex recently released an HP Autonomy IDOL Committer module for its open-source web crawler, Norconex HTTP Collector. You can now enjoy the features of Norconex crawler and experience the freedom of open-source when crawling your sites for indexing into IDOL.

We have published the 100% Java source code for this piece of software on Github, and it is freely available to anyone who wants to download it. We encourage people to use it, report issues, and contribute to the project by helping with the documentation and providing patches.

Most key features of HP Autonomy HTTP Connector are available in Norconex HTTP Collector, including document changes detection on incremental crawls and purging documents from IDOL for deleted web pages. New ones are introduced, such as having different hit interval at different time of the day and the ability to overwrite pretty much every part of the web crawling flow with your own implementation logic. The IDOL Committer has been tested on diverse public and internal web sites with great performance.

If you have not used the Norconex HTTP Collector before, get familiar with it first. The following section focuses on the configuration related to the Norconex IDOL Committer module (version 1.0 as of this writing).

IDOL Committer Configuration Explained

The IDOL Committer will automatically populate key fields, such as DREREFERENCE and DRECONTENT, as well as any other extracted fields you wish to keep.  While you can configure the Committer programmatically, most users will prefer to rely on well-documented XML configuration.

Below is the XML template for the IDOL Committer, that is to be inserted in the Norconex HTTP Collector configuration.   This snippet can also be found in the IDOL Committer Javadoc.

<committer class="com.norconex.committer.idol.IdolCommitter">
    <host>(Host to IDOL.)</host>
    <aciPort>(Port to IDOL ACI)</aciPort>
    <indexPort>(Port to IDOL Index.)</indexPort>
    <databaseName>(IDOL Databse Name where to store documents.)</databaseName>
    <dreAddDataParams>
        <param name="(parameter name)">(parameter value)</param>
    </dreAddDataParams>
    <dreDeleteRefParams>
        <param name="(parameter name)">(parameter value)</param>
    </dreDeleteRefParams>
    <idSourceField keep="[false|true]">
        (Name of source field that will be mapped to the IDOL "DREREFERENCE"
        field or whatever "idTargetField" specified.
        Default is the document reference metadata field:
        "document.reference".  Once re-mapped, the metadata source field is
        deleted, unless "keep" is set to true.)
    </idSourceField>
    <idTargetField>
        (Name of IDOL target field where to store a document unique
        identifier (idSourceField).  If not specified, default
        is "DREREFERENCE".)
    </idTargetField>
    <contentSourceField keep="[false|true]>";
        (If you wish to use a metadata field to act as the document
        "content", you can specify that field here.  Default
        does not take a metadata field but rather the document content.
        Once re-mapped, the metadata source field is deleted,
        unless "keep" is set to true.)
    </contentSourceField>
    <contentTargetField>
        (IDOL target field name for a document content/body.
        Default is: DRECONTENT)
    </contentTargetField>
    <queueDir>(optional path where to queue files)</queueDir>
    <queueSize>(max queue size before committing)</queueSize>
    <commitBatchSize>
        (max number of docs to send IDOL at once)
    </commitBatchSize>
</committer>

 

The <host>, <aciPort>, <indexPort>, and <databaseName> tags should be familiar to IDOL users. The <dreAddDataParams> and <dreDeleteRefParams> tags can be used to modify the URL command sent to IDOL for additions or deletions. For instance, to set the priority for each batch of documents sent to IDOL, you can use the following:

<dreAddDataParams>
    <param name="Priority">100</param>
</dreAddDataParams>

 

By default, the library uses the metadata field “document.reference”, but it can be remapped using <idSourceField keep="[false|true]">.

If for some reason you do not want to have the full content of a webpage into the content field, you can select which metadata field you would like to use as content using <contentSourceField keep="[false|true]>.

If you would like the content of a web page to be copied to another field than DRECONTENT, you can accomplish this by using <contentTargetField>.

The following three configuration options relate to how documents are queued and sent to IDOL:

  • <queueDir> is an optional path where files are queued before being sent to IDOL. Default is ./committer-queue.
  • <queueSize> is the maximum queue size before sending all queued documents to IDOL.
  • <commitBatchSize> is the maximum number of documents to send to IDOL at once.

By default, the Norconex HTTP Collector extracts all fields it can find. It can sometimes be too much for what you want to index. You can use the KeepOnlyTagger from the Importer module to restrict the fields to index only those you care about:

<tagger fields="title,keywords,description, document.reference"/>

In the above example, the document.reference is used by default to populate the DREREFERENCE field.

Further Information

If you are unhappy with your current web crawling solution for your IDOL installation or are curious about trying an alternate solution, we invite you to give both the Norconex HTTP Collector and the IDOL Committer a try.

Further  information: