Norconex is proud to announce the 2.9.0 release of its HTTP and Filesystem crawlers. Keep reading for a few release highlights.
CMIS support
Norconex Filesystem Collector now supports Content Management Interoperability Services (CMIS). CMIS is an open standard for accessing content management systems (CMS) content. Extra information can be extracted, such as document ACL (Access Control List) for document-level security. It is now easier than ever to crawl your favorite CMS. CMIS is supported by Alfresco, Interwoven, Magnolia, SharePoint server, OpenCMS, OpenText Documentum, and more.
<startPaths> <path>cmis-atom:https://norconex.com/mycms/cmisatom!/my/starting/path</path> </startPaths>
Additional ACL support
ACL from your CMS is not the only new type of ACL you can extract. This new Norconex Filesystem Collector release introduces support for obtaining local filesystem ACL. These new ACL types are in addition to the already existing support for CIFS/SMB ACL extraction (since 2.7.0).
Field discovery
You can’t always tell upfront what metadata your crawler will find. One way to discover your fields is to send them all to your Committer. This approach is not always possible nor desirable. You can now store to a local file all fields found by the crawler. Each field will be saved once, with sample values to give you a better idea of their nature.
<tagger class="com.norconex.importer.handler.tagger.impl.FieldReportTagger" maxSamples="2" file="/path/to/report/myfields.csv" />
New URL normalization rules
The HTTP Collector adds a few new rules GenericURLNormalizer. Those are:
- removeQueryString
- lowerCase
- lowerCasePath
- lowerCaseQuery
- lowerCaseQueryParameterNames
- lowerCaseQueryParameterValues
Subdomains being part of a domain
When you configure your HTTP crawler to stay on the current site (stayOnDomain="true"
), you can now tell it to consider sub-domains as being the same site (includeSubdomains="true"
).
Other changes
For a complete list of all additions and changes, refer to the following release notes:
- HTTP Collector 2.9.0 release notes
- Filesystem Collector 2.9.0 release notes
- Collector Core 1.10 release notes
- Importer 2.10 release notes
Download