Norconex Collector Core
Releases

Release History

Version Date Description
1.10.0 2019-12-22 Feature release
1.9.1 2018-07-29 Maintenance release
1.9.0 2017-11-26 Feature release
1.8.2 2017-05-26 Bugfix release
1.8.1 2017-05-25 Maintenance release
1.8.0 2017-04-26 Feature release
1.7.0 2016-12-14 Feature release
1.6.0 2016-08-25 Feature release
1.5.0 2016-06-03 Feature release
1.4.0 2016-02-28 Maintenance release
1.3.0 2015-11-06 Feature release
1.2.1 2015-08-07 Maintenance release
1.2.0 2015-07-22 Feature release
1.1.0 2015-04-08 Feature release
1.0.2 2015-02-04 Bug fix release
1.0.1 2014-12-03 Bug fix release
1.0.0 2014-11-26 Initial release

1.10.0 Feature release Download 2019-12-22

Added "unmanaged" attribute to "logsDir" configuration option to prevent the collector from managing its own file-based logging.
New "maxParallelCrawlers" collector configuration option. Allows to run only a maximum number of crawlers at any given time, queuing the others. #25
Added SSL support to MongoDB crawl data store.
New AbstractCollector#getState() method.
Added advanced configuration parameters to MVStoreCrawlDataStoreFactory.
Maven dependency updates: Norconex Commons Lang 1.15.1, Norconex Importer 2.10.0, Norconex Committer Core 2.1.3, Norconex JEF 4.1.2, H2 1.4.199.
SpoiledReferenceStrategy of GRACE_ONCE now properly delete a document on subsequent fail.
Add a retry to MongoDB upserts to fix getting a constraint violation on concurrent upserts. #24

1.9.1 Maintenance release Download 2018-07-29

Significant performance improvement on MongoCrawlDataStore#isQueueEmpty().
Dependency updates: Norconex Importer 2.9.0, Norconex Commons Lang 1.15.0.
AbstractCrawler now logs documents it could not process as INFO.
MongoCrawlDataStore #buildMongoClient abd #buildMongoCredentials methods were moved to MongoConnectionDetails. #15
Fixed embedded document checksums creation pulling the wrong cached checksum causing them to always appear new when metadataChecksummer is disabled..
Fixed showing wrong path in error message when command-line variable file is invalid. #16
Fixed NullPointerException under some conditions for AbstractCrawlerConfig#saveToXML(...).

1.9.0 Feature release Download 2017-11-26

New "sourceFieldsRegex" option on GenericMetadataChecksummer and MD5DocumentChecksummer allowing the use of regular expressions to match the fields to use for building the checksum.
New "combineFieldsAndContent" option on MD5DocumentChecksummer to use both fields and content for building the checksum.
Can now specify custom collection names when using MongoCrawlDataStore and AbstractMongoCrawlDataStoreFactory implementations.
New "stopOnExceptions" added to crawler configuration to force crawler to stop upon encountering a specified exceptions.
The MongoCrawlDataStore now accepts references longer than 1024 characters.
AbstractCrawler no longer create work directory on object construction, but rather does it when the crawler starts.
Dependency updates: Norconex Importer 2.8.0, Norconex Commons Lang 1.14.0, Norconex Committer Core 2.1.2, Apache Commons DbUtils 1.7, MongoDB Java Driver 3.5.0, H2 Database 1.4.196.
When orphan strategy is "PROCESS", the crawler now always attempts to process a document, regardless of sitemap delays or recrawlable delays, since the reason for it to become orphan may be deletion, and we do not want to wait a future crawl cycle to find out.

1.8.2 Bugfix release Download 2017-05-26

Dependency updates: Norconex Importer 2.7.2.
Fixed "caseSensitive" flag sometimes having no effect in RegexMetadataFilter and RegexReferenceFilter.

1.8.1 Maintenance release Download 2017-05-25

MongoCrawlDataStore now support specifying the MongoDB authentication mechanism to use (MONGODB-CR or SCRAM-SHA-1).
Classes related to MongoDB crawl store implementation were updated to use MongoDB 3.x API.
Dependency updates: Norconex Importer 2.7.1, Norconex Committer Core 2.1.1, Mongodb Driver 3.4.2, Fongo 2.0.13 (for tests).
AbstractCollector#saveToXML(...) now written with xml:space="preserve".
Fixed "importer" config section not being inherited from "crawlerDefaults" when a specific crawler configuration does not declare one.

1.8.0 Feature release Download 2017-04-26

Added schema-based XML configuration validation which can be trigged on command prompt with this new flag: -k or --checkcfg
New ICollectorLifeCycleListener interface that can be added on the collector configuration to be notified and take action when the collector starts and stops.
Two new crawler events where added for crawler event listeners: CRAWLER_STOPPING and CRAWLER_STOPPED.
AbstractMongoCrawlDataStoreFactory now accepts encrypted passwords.
Now distributed with utility scripts.
Crawler events REJECTED_FILTER, REJECTED_BAD_STATUS, REJECTED_IMPORT, and REJECTED_ERROR are now DEBUG in log4j.properties.
When their log level is DEBUG, the word "Subject:" has been removed form crawler event messages and "No additional information available." is shown when there is no extra info to show.
Dependency updates: Norconex Commons Lang 1.13.0, Norconex Importer 2.7.0, Norconex JEF API 4.1.0, Norconex Committer Core 2.1.0, JSoup 1.10.2.
Modified Javadoc to include an XML usage example for all XML-configurable classes.
ICrawlerConfig no longer implements Cloneable.
Document, metadata, and reference filters now logs appropriate message when there is no "include" match, when log level is DEBUG.
Fixed crawler defaults not always being applied as it should.
Fixed minor errors in writing IXMLConfigurable classes to XML.
Throwable exceptions no longer makes a crawler hang under certain conditions when importing/parsing a file.
Removed code deprecated in version 1.2 or older.
Removed MapDB and Apache Derby crawlstore dependencies/implementations which were deprecated in version 1.6.

1.7.0 Feature release Download 2016-12-14

It is now possible to add JEF-related listeners on the collector configuration.
JXM support is not disabled by default to improve performance. It can be enabled by adding the JVM argument : -DenableJMX=true
Dependency updates: Norconex Commons Lang 1.12.3, Norconex Importer 2.6.1, Norconex JEF API 4.0.8, Joda Time 2.9.4, JJ2000 5.3, Apache HTTP Client 4.5.2, Apache HTTP Core 4.4.5, Apache Commons Logging 1.2
Fixed NullPointerException when stopping a crawler that did not previously run.

1.6.0 Feature release Download 2016-08-25

New "checkcfg" launch action that will load a configuration without doing anything with it (to help resolve config issues).
New CrawlState#isSkipped() method to indicate if a document was unmodified or premature.
New AbstractCrawler#beforeFinalizeDocumentProcessing() method to let crawler implementations act on a document before it is being finalized.
MVStoreCrawlDataStoreFactory is now the default crawl store factory (replacing now deprecated MapDB implementation).
Dependency updates: Norconex Importer 2.6.0, Norconex Committer Core 2.0.5, JSoup 1.9.2, Apache Commons DBCP 2.1.1, H2 Database 1.4.192.
API break: method signature changed for AbstractCrawler from applyCrawlData(ICrawlData crawlData, ImporterDocument document) to initCrawlData(ICrawlData crawlData, ICrawlData cachedCrawlData, ImporterDocument document).

1.5.0 Feature release Download 2016-06-03

New BasicJDBCCrawlDataStoreFactory implementation for collector implementations with basic crawl storage needs.
New document crawl state: PREMATURE.
New crawler event: REJECTED_PREMATURE.
Default database implementation for AbstractJDBCDataStoreFactory when invoked with an empty constructor is now H2.
When provided by collectors, document "crawl date" and content type can be added to the crawl data and will be stored in the crawl data store (affects all ICrawlDataStoreFactory implementations).
Dependency updates: Norconex Importer 2.5.2, MapDB 1.0.9, H2 1.4.191, Fongo 1.6.2.
Event string value for DOCUMENT_COMMITTED_REMOVE changed from DOCUMENT_COMMITTED_REMOV to DOCUMENT_COMMITTED_REMOVE.

1.4.0 Maintenance release Download 2016-02-28

Dependency updates: Norconex Importer 2.5.0.
ExtensionReferenceFilter is now smarter at detecting extension. #2
ExtensionReferenceFilter now allows white spaces around extensions in XML config.

1.3.0 Feature release Download 2015-11-06

Specifying an invalid path on the command-line for the config file or variable file now returns a meaningful message.
Maven direct dependency updates: Norconex Importer 2.4.0, Norconex JEF 4.0.7, Mongo Java Driver 2.13.3, Apache Derby 10.12.1.1.
Now logs (leve INFO) a less alarming message when a module version cannot be found.
Now logs module version information in file.
A new metadata boolean field called "collector.is-crawl-new" is now added before document importing. It indicates whether the document is already known from the crawler, from a previous run.
Cached instance of a reference data is now passed around as opposed to being obtained form the reference cache each time it is needed.
Saved and loaded configuration-related classes are now equal. Methods equals/hashCode/toString for those classes are now implemented uniformly and where added where missing.
Fixed some configuration classes not always being saved to XML properly or giving errors.
Fixed IOException when "keepDownloads" is true. This was occurring for URLs with no path (just the host name). Now prefixes created domain directory domain file with "d." and "f." respectively.

1.2.1 Maintenance release Download 2015-08-07

AbstractCrawler is no longer deleting remaining orphans after they have been processed (when orphan strategy is PROCESS).
Verbose logging in AbstractCrawler#processNextReference(...) has been changed from loglevel DEBUG to TRACE.
Dependency updates: Norconex Importer 2.3.1 and Norconex Committer Core 2.0.2.

1.2.0 Feature release Download 2015-07-22

New configurable option: ISpoiledStateStategyResolver. It allows one to customize what strategy to adopt when a reference is in a bad crawl state (ignore, delete, or grace once). A default implementation is provided: GenericSpoiledStateStrategyResolver.
New GenericMetadataChecksummer for choosing one or many metadata fields and their values to create a checksum.
Now printing release versions of Norconex libraries used when a collector is launched.
New NOT_FOUND state constant added to CrawlState (migrated from the HTTP Collector).
AbstractCrawler is now firing REJECTED_ERROR events when an exception prevented proper processing of a reference.
Documents with a bad crawl state other than "NOT_FOUND" are now given once chance to recover before a deletion request gets sent. This can be overwritten.
The OrphansStrategy default in crawler config is now PROCESS to get around cases where temporary conditions prevent accessing some documents that normally should (and should not avoid re-processing on incremental crawls).
MD5DocumentChecksummer#setField(String) has been deprecated in favor of MD5DocumentChecksummer#setFields(String...).
CrawlState#isCommittable() has been deprecated in favor of CrawlState#isNewOrModified().
Setter methods signatures accepting an array in AbstractCrawlerConfig were updated to accept "varargs" instead (variable arguments).
Uses default port when no Mongo port is specified when using Mongo data store.
When the saving of documents is enabled, each saved documents is no longer printed to STDOUT but logged as a Log4j debug statement instead.
Regular expressions in RegexMetadataFilter and RegexReferenceFilter now always have the Pattern.DOTALL flag enabled and when case sensitivity is enabled for regex, Pattern.UNICODE_CASE is now always used.
Library updates: Norconex JEF 4.0.6, Norconex Importer 2.3.0, Norconex Commons Lang 1.6.2, Mongo Java Driver 2.13.2, H2 database 1.4.187. New dependency: JUnit 4.12 (test scope).
Jar manifest now includes implementation entries and specifications entries (matching Maven pom.xml).
Javadoc fixes and updates.
Updated Mongo indexes to use stage instead of state. (Github collector-http#97).
Stopping a job that has been resumed now works as expected.
Stopping a job that has been resumed now works as expected.
ICrawlDataStore#isVanished(ICrawlData) has been deprecated.

1.1.0 Feature release Download 2015-04-08

New methods and configuration attribute to disable checksum creation in MD5DocumentChecksummer.
Library updates: Norconex Committer Core 2.0.1, Norconex Importer 2.1.1, Norconex JEF 4.0.4, MapDB 1.0.7, Apache Commons BeanUtils 1.9.2, Apache Commons DBCP2 2.1, Mongo Java Driver 2.13.0, H2 1.4.186.
Added Sonatype repository to pom.xml for snapshot releases.
Updated several maven plugins and added SonarQube maven plugin.
Removed pom.xml dependency on Norconex Commons Lang, which is already provided by other dependencies.
Subject in event logging is now only shown on DEBUG log level.
The database XML configuration in AbstractJDBCDataStoreFactory is now case-insensitive.
H2 database now has a write delay of zero to ensure durability on JVM crash.
MapDB and MVStore implementation of ICrawlDataStore now forces a commit on every addition a the expense of performance to ensure durability on JVM/OS/System crash.
BaseCrawlData#setDocumentChecksum(String) is now deprecated in favor of BaseCrawlData#setContentChecksum(String) to fix content checksum not being saved in crawl data store properly.
Fixed NullPointerException when running an incremental crawl over one that previously failed due to invalid configuration.
Fixed incremental run not always handling non-modified documents properly (sometimes deleting, sometimes re-adding).
Fixed NPE in AbstractJDBCDataStoreFactory#createCrawlDataStore(...) when database is null.

1.0.2 Bug fix release Download 2015-02-04

When splitting documents, crawlers will now trigger individual processing/deletion of children/embedded documents that no longer exists on incremental runs (based on your "orphansStrategy" configuration). When deleting orphans, deletion of a parent document will also trigger deletion requests to its children/embedded documents.
Fixed an infinite loop that sometime occurred when dealing with multiple threads and the configured maxDocument is reached (and greater than zero). This could prevent a collector from ever stopping.
Fixed invalid detection of crawler execution state, affecting ability to stop a collector.
Crawl data is no longer added to document metadata after the import phase (which could conflict with some handlers, like KeepOnlyTagger).
Default logging of Crawler events is now better aligned.
Updated JEF API to version 4.0.2.
Javadoc corrections.

1.0.1 Bug fix release Download 2014-12-03

When keepDownloads is true, saved files and directories are now prefixed with "f." and "d." respectively to avoid collisions.
Crawler id is now set on JEF JobSuite when a new thread starts to improve logging.
Upgraded norconex-jef to 4.0.1.

1.0.0 Initial release Download 2014-11-26

Initial release.

Copyright © 2013-2020 Norconex Inc.