Norconex Filesystem Collector

Release History

Version Date Description
2.9.0 2019-12-22 Feature release
2.8.0 2017-11-26 Feature release
2.7.1 2017-05-26 Maintenance release
2.7.0 2017-04-26 Feature release
2.6.1 2016-12-14 Maintenance release
2.6.0 2016-08-25 Feature release
2.5.0 2016-06-03 Minor release
2.4.0 2016-02-28 Minor release
2.3.0 2015-11-06 Feature release
2.2.0 2015-07-22 Feature release
2.1.0 2015-04-08 Feature release
2.0.2 2015-02-04 Bug fix release
2.0.1 2014-12-03 Bug fix release
2.0.0 2014-11-26 Major release.
1.0.0 2014-08-25 Initial release

2.9.0 Feature release Download 2019-12-22

Now extracts ACL from local files.
From Collector Core update, added "unmanaged" attribute to "logsDir" configuration option to prevent the collector from managing its own file-based logging.
Now supports CMIS (Atom), the open standard for content management systems. E.g., Alfresco, Interwoven, Magnolia, SharePoint server, OpenCMS, OpenText Documentum, etc.
Dependency updates: Norconex Collector Core 1.10.0, Norconex Commons Lang 1.15.1.
Fixed files with pound sign being ignored and/or having the pound sign URL-encoded. #47
Fixed NullPointerException under some conditions for FilesystemCrawlerConfig#saveToXML(...). #29

2.8.0 Feature release Download 2017-11-26

Several new features (new TruncateTagger, ExternalTagger, etc.) are included with this release, mainly through Norconex Collector Core and Norconex Importer dependency updates. Refer to related release notes for more details.
Dependency updates: Norconex Collector Core 1.9.0, Norconex Commons Lang 1.14.0.

2.7.1 Maintenance release Download 2017-05-26

Dependency updates: Norconex Collector Core 1.9.0.

2.7.0 Feature release Download 2017-04-26

Added schema-based XML configuration validation which can be trigged on command prompt with this new flag: -k or --checkcfg
New configurable GenericFilesystemOptionsProvider which allows to configure how different file systems are accessed (authentication, FTP(s), HTTP, Webdav, etc). Custom implementation can be provided with IFilesystemOptionsProvider.
ACL is now extracted from SMB/CIFS file systems.
Custom metadata extraction is now possible via IFileMetadataFetcher. Default implementation is GenericFileMetadataFetcher.
Custom document extraction is now possible via IFileDocumentFetcher. Default implementation is GenericFileDocumentFetcher.
Can now provide start paths dynamically with new IStartPathsProvider.
New features from dependency updates. Collector Core: ICollectorLifeCycleListener. Importer: MergeTagger, ExternalTransformer.
MongoCrawlDataStoreFactory now accepts encrypted passwords.
Now distributed with utility scripts.
XML configuration entries expecting millisecond durations can now be provided in human-readable format (e.g., "5 minutes and 30 seconds" or "5m30s").
Dependency updates: Norconex Collection Core 1.8.0, Norconex Commons Lang 1.13.0, JCIFS 1.3.17, Apache Commons VFS Sandbox 2.1.
FilesystemCollectorException now deprecated in favor of CollectorException.
Modified Javadoc to include an XML usage example for all XML-configurable classes.
Fixed minor errors in writing IXMLConfigurable classes to XML.
Removed JDBCCrawlDataStoreFactory deprecated since 1.5 (replaced since by BasicJDBCCrawlDataStoreFactory).

2.6.1 Maintenance release Download 2016-12-14

Dependency updates: Norconex Commons Lang 1.12.3, JJ2000 5.3, Norconex Collection Core 1.7.0, Apache HTTP Client 4.5.2, Apache HTTP Core 4.4.5, Apache Commons Codec 1.10, Apache Commons Net 3.5, Apache HttpClient 3.1.
Fixed FTP file system. Added thrid-party dependencies and FTP configuration required for FTP file system to work. #11

2.6.0 Feature release Download 2016-08-25

Dependency updates: Norconex Collector Core 1.6.0, Apache Commons VFS 2.1, Joda Time 2.9.4, JSoup 1.8.3, and Norconex Importer 2.6.0, which introduces new document parsing/manipulation features.

2.5.0 Minor release Download 2016-06-03

MVStore is now the default URL crawl store.
Dependency updates: Norconex Collector Core 1.5.0.
JDBCCrawlDataStoreFactory now deprecated in favor of BasicJDBCCrawlDataStoreFactory from Collector Core.

2.4.0 Minor release Download 2016-02-28

Now supports specifying relative paths in startPaths (for local file systems only).
The "" file has been moved from classes to the installation root directory.
Dependency updates: Norconex Collector Core 1.4.0, Joda Time 2.9.2.

2.3.0 Feature release Download 2015-11-06

Dependency updates: Norconex Collector Core 1.3.0 and Norconex Importer 2.4.0, which introduces many new features.

2.2.0 Feature release Download 2015-07-22

New CurrentDateTagger, DateMetadataFilter, NumericMetadataFilter, TextPatternTagger, GenericSpoiledReferenceStrategizer and more new features introduced by dependency upgrades.
New FileMetadataChecksummer#setDisabled(boolean) method to disable this default metadata checksummer.
Jar manifest now includes implementation entries and specifications entries (matching Maven pom.xml).
Dependency updates: Norconex Collector Core 1.2.0.
Improved/fixed javadoc.

2.1.0 Feature release Download 2015-04-08

Several new features, updates and fixes were added by upgrading Norconex Collector Core ( and Norconex Importer ( dependencies. Those include support for ORC, translation, a title generator, new content type parsing, and more. Refer to dependency release notes for more details.
Library updates: Norconex Collector Core 1.1.0, Junit 4.12, Joda-Time 2.7.
Added Sonatype repository to pom.xml for snapshot releases.
Updated several maven plugins and added SonarQube maven plugin.
Fixed log4j log levels incorrectly ending with a semi-colon.

2.0.2 Bug fix release Download 2015-02-04

Fixed the collector "stop" action having no effect.
Fixed crawl data wrongfully applied as metadata after the import phase.
Fixed incorrect deletion behavior for embedded orphan documents.
Improved logging options for crawler events.
Upgraded Norconex Collector Core dependency to 1.0.2.

2.0.1 Bug fix release Download 2014-12-03

From collector-core-1.0.1: When keepDownloads is true, saved files and directories are now prefixed with "f." and "d." respectively to avoid collisions. #44

2.0.0 Major release. Download 2014-11-26

Upgraded Norconex Importer to version 2.0.0, which brings to Norconex Filesystem Collector a lot of new features, such as: Document content splitting, splitting of embedded documents into individual documents, new taggers for language detection, changing character case, parsing and formatting dates, providing content statistics, and more. Please read the Norconex Importer release notes for a complete list of changes at:
Can now supplied a "pathsFile" as part of the startPaths, acting as a seed list.
New H2 database implementation for the reference database (crawl data store).
Now keeps track of parent references (for embedded/split documents).
New replaceable FileMetadataChecksummer which takes the document modified date and size to create a unique representation of a file.
New IFileDocumentProcessor to manipulated crawled document prior and after the import module is invoked.
New support for files filtering based on their Metadata.
New support for document filtering.
New ability to keep files fetch from a filesystem to a local location.
New JMX/MBean support added on crawlers.
Now licensed under The Apache License, Version 2.0.
Replaced the configuration option "deleteOrphans(true|false)" with "orphansStrategy(DELETE|PROCESS|IGNORE)".
The collector now references document content as reusable InputStream with memory caching instead of relying only on files. This saves a great deal of disk I/O and improves performance in most cases.
Refactored to use the new Norconex Collector Core library. A significant portion of the Norconex Filesystem Collector code has been moved to that core library.
New and more scalable crawler event model along with new listeners.
Refactored to use JEF 4.0.0 which makes the HTTP Collector easier to monitor.
Other libray upgrades: Norconex Committer to 2.0.0 and Norconex Commons Lang to 1.5.0.

1.0.0 Initial release Download 2014-08-25

Initial release.

Copyright © 2013-2020 Norconex Inc.