Posted on April 26, 2017 by Pascal Essiembre in Latest Releases
Norconex released version 2.7.0 of both its HTTP Collector and Filesystem Collector. This update, along with related component updates, introduces several interesting features.
The following items are specific to the HTTP Collector. For changes applying to both the HTTP Collector and the Filesystem Collector, you can proceed to the “Generic changes” section.
The alternative document fetcher PhantomJSDocumentFetcher now makes it possible to crawl web pages with JavaScript-generated content. This much awaited feature is now available thanks to integration with the open-source PhantomJS headless browser. As a bonus, you can also take screenshots of web pages you crawl.
1 2 3 4 5 6 |
<documentFetcher class="com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher"> <exePath>/path/to/phantomjs.exe</exePath> <renderWaitTime>5000</renderWaitTime> <referencePattern>^.*\.html$</referencePattern> </documentFetcher> |
This release introduces two new link extractors. You can now use the XMLFeedLinkExtractor to extract links from RSS or Atom feeds. For maximum flexibility, the RegexLinkExtractor can be used to extract links using regular expressions.
1 2 3 4 5 6 7 8 |
<extractor class="com.norconex.collector.http.url.impl.RegexLinkExtractor"> <linkExtractionPatterns> <pattern group="1">\[(http.*?)\]</pattern> </linkExtractionPatterns> </extractor> <extractor class="com.norconex.collector.http.url.impl.XMLFeedLinkExtractor"> <applyToReferencePattern>.*rss$</applyToReferencePattern> </extractor> |
The following changes apply to both Filesystem and HTTP Collectors. Most of these changes come from an update to the Norconex Importer module (now also at version 2.7.0).
You no longer have to hunt for a misconfiguration. Schema-based XML configuration validation was added and you will now get errors if you have a bad XML syntax for any configuration options. This validation can be trigged on command prompt with this new flag:
-k
or --checkcfg
.
1 2 3 4 5 6 7 |
# -k can be used on its own, but when combined with -a (like below), # it will prevent the collector from executing if there are any errors. collector-http.sh -a start -c examples/minimum/minimum-config.xml -k # Error sample: ERROR (XML) ReplaceTagger: cvc-attribute.3: The value 'asdf' of attribute 'regex' on element 'replace' is not valid with respect to its type, 'boolean'. |
Having to convert a duration in milliseconds is not the most friendly. Anywhere in your XML configuration where a duration is expected, you can now use a human-readable representation (English only) as an alternative.
1 2 3 4 5 |
<!-- Example using "5 seconds" and "1 second" as opposed to milliseconds --> <delay class="com.norconex.collector.http.delay.impl.GenericDelayResolver" default="5 seconds" ignoreRobotsCrawlDelay="true" scope="site" > <schedule dayOfWeek="from Saturday to Sunday">1 second</schedule> </delay> |
Support for Lua scripting has been added to ScriptFilter, ScriptTagger, and ScriptTransformer. This gives you one more scripting option available out-of-the-box besides JavaScript/ECMAScript.
1 2 3 4 5 6 7 |
<!-- Add "apple" to a "fruit" metadata field: --> <tagger class="com.norconex.importer.handler.tagger.impl.ScriptTagger" engineName="lua"> <script><![CDATA[ metadata:addString('fruit', {'apple'}); ]]></script> </tagger> |
With the new ExternalTransformer, you can now use an external application to perform document transformation. This is an alternative to the existing ExternalParser, which was enhanced to provide the same environment variables and metadata extraction support as the ExternalTransformer.
1 2 3 4 5 6 |
<transformer class="com.norconex.importer.handler.transformer.impl.ExternalTransformer"> <command>/path/transform/app ${INPUT} ${OUTPUT}</command> <metadata> <match field="docnumber">DocNo:(\d+)</match> </metadata> </transformer> |
The new MergeTagger can be used for combining multiple fields into one. The target field can be either multi-value or single-value separated with the character of your choice.
1 2 3 4 5 6 |
<tagger class="com.norconex.importer.handler.tagger.impl.MergeTagger"> <merge toField="title" deleteFromFields="true" singleValue="true" singleValueSeparator=","> <fromFields>title,dc.title,dc:title,doctitle</fromFields> </merge> </tagger> |
Whether you do not have a target repository (Solr, Elasticsearch, etc) ready at the time of crawling, or whether you are not using a repository at all, Norconex Collectors now ships with two file-based Committers for easy consumption by your own process: XMLFileCommitter and JSONFileCommitter. All available committers can be found here.
1 2 3 4 5 6 7 |
<committer class="com.norconex.committer.core.impl.XMLFileCommitter"> <directory>/path/my-xmls/</directory> <pretty>true</pretty> <docsPerFile>100</docsPerFile> <compress>false</compress> <splitAddDelete>false</splitAddDelete> </committer> |
Several additional features or changes can be found in the latest Collector releases. Among them:
To get the complete list of changes, refer to the HTTP Collector release notes, Filesystem Collector release notes, or the release notes of dependent Norconex libraries such as: Importer release notes and Collector Core release notes.
Comments