Norconex released version 2.7.0 of both its HTTP Collector and Filesystem Collector. This update, along with related component updates, introduces several interesting features.
HTTP Collector changes
The following items are specific to the HTTP Collector. For changes applying to both the HTTP Collector and the Filesystem Collector, you can proceed to the “Generic changes” section.
This release introduces two new link extractors. You can now use the XMLFeedLinkExtractor to extract links from RSS or Atom feeds. For maximum flexibility, the RegexLinkExtractor can be used to extract links using regular expressions.
The following changes apply to both Filesystem and HTTP Collectors. Most of these changes come from an update to the Norconex Importer module (now also at version 2.7.0).
Much improved XML configuration validation
You no longer have to hunt for a misconfiguration. Schema-based XML configuration validation was added and you will now get errors if you have a bad XML syntax for any configuration options. This validation can be trigged on command prompt with this new flag: -k or --checkcfg.
# -k can be used on its own, but when combined with -a (like below),
# it will prevent the collector from executing if there are any errors.
collector-http.sh -a start -c examples/minimum/minimum-config.xml -k
# Error sample:
ERROR (XML) ReplaceTagger: cvc-attribute.3: The value 'asdf' of attribute 'regex' on element 'replace' is not valid with respect to its type, 'boolean'.
Enter durations in human-readable format
Having to convert a duration in milliseconds is not the most friendly. Anywhere in your XML configuration where a duration is expected, you can now use a human-readable representation (English only) as an alternative.
<!-- Example using "5 seconds" and "1 second" as opposed to milliseconds -->
With the new ExternalTransformer, you can now use an external application to perform document transformation. This is an alternative to the existing ExternalParser, which was enhanced to provide the same environment variables and metadata extraction support as the ExternalTransformer.
Whether you do not have a target repository (Solr, Elasticsearch, etc) ready at the time of crawling, or whether you are not using a repository at all, Norconex Collectors now ships with two file-based Committers for easy consumption by your own process: XMLFileCommitter and JSONFileCommitter. All available committers can be found here.
New “detectContentType” and “detectCharset” options on GenericDocumentFetcher for ignoring the content type and character encoding obtained from the HTTP response headers and detect them instead (Filesytem Collector).
Norconex has released version 2.6.0 of its HTTP Collector web crawler! Among new features, an upgrade of its Importer module brings new document parsing and manipulating capabilities. Some of the changes highlighted here also benefit the Norconex Filesystem Collector.
New URL normalization to remove trailing slashes
The GenericURLNormalizer has a new pre-defined normalization rule: “removeTrailingSlash”. When used, it makes sure to remove forward slash (/) found at the end of URLs so such URLs are treated the same as those not ending with such character. As an example:
https://norconex.com/ will become https://norconex.com
https://norconex.com/blah/ will become https://norconex.com/blah
It can be used with the 20 other normalization rules offered, and you can still provide your own.
By default StandardSitemapResolverFactory is enabled and tries to detect whether a sitemap file exists at the “/sitemap.xml” or “/sitemap_index.xml” URL path. For websites without sitemaps files at these location, this creates unnecessary HTTP request failures. It is now possible to specify an empty “path” so that such discovery does not take place. In such case, it will rely on sitemap URLs explicitly provided as “start URLs” or sitemaps defined in “robots.txt” files.
Count occurrences of matching text
Thanks to the new CountMatchesTagger, it is now possible to count the number of times any piece of text or regular expression occurs in a document content or one of its fields. A sample use case may be to use the obtained count as a relevancy factor in search engines. For instance, one may use this new feature to find out how many segments are found in a document URL, giving less importance to documents with many segments.
DateFormatTagger now accepts multiple source formats when attempting to convert dates from one format to another. This is particularly useful when the date formats found in documents or web pages are not consistent. Some products, such as Apache Solr, usually expect dates to be of a specific format only.
DOM-related features just got better. First, the DOMTagger, which allows one to extract values from an XML/HTML document using a DOM-like structurenow supports an optional “fromField” to read the markup content from a field instead of the document content. It also supports a new “defaultValue” attribute to store a value of your choice when there are no matches with your DOM selector. In addition, now both DOMContentFilter and DOMTagger supports many more selector extraction options: ownText, data, id, tagName, val, className, cssSelector, and attr(attributeKey).
GenericDocumentParserFactory now allows you to control which embedded documents you do not want extracted from their containing document (e.g., do not extract embedded images). In addition, it also allows you to control which containing document you do not want to extract their embedded documents (e.g., do not extract documents embedded in MS Office documents). Finally, it also allows you now to specify which content types to “split” their embedded documents into separate files (as if they were standalone documents), via regular expression (e.g. documents contained in a zip file).
GenericDocumentParserFactory now makes it possible to overwrite one or more parsers the Importer module uses by default via regular XML configuration. For any content type, you can specify your custom parser, including an external parser.
Since the first FIFA Women’s World Cup in 1991, interest in playing and watching women’s soccer has only increased. Around the world, more girls than ever before are playing the beautiful game that not only provides obvious health benefits but also helps boost girls’ confidence and self-esteem at the time in their lives when they need it most.
Norconex is proud to renew its sponsorship of women’s soccer teams in the Association de Soccer de Hull (Gatineau, Quebec, Canada) for the 2016 season. In addition to renewing its support for five local teams with players between 10 and 16 years of age, Norconex now sponsors two competitive women’s teams (U12 and U15).
At the upcoming women’s soccer tournament in this year’s Summer Olympics, girls will be able to cheer for their soccer idols once again, and Norconex will be cheering along with them.
Norconex has released Norconex HTTP Collector version 2.5.0! This new version of our open source web crawler was released to help minimize your re-crawling frequencies and download delays, and it allows you to specify a locale for date parsing/formatting. The following highlights these key changes and additions:
Minimum re-crawl frequency
Not all web pages and documents are updated as regularly. In addition, updates are not as important to capture right away for all types of content. Re-crawling every page every time to find out if they changed or not can be time consuming (and sometimes taxing) on larger sites. For instance, you may want to re-crawl news pages more regularly than other types of pages on a given site. Luckily, some websites provide sitemaps which give crawlers pointers to its document update frequencies.
This release introduces “recrawlable resolvers” to help control the frequency of document re-crawls. You can now specify a minimum re-crawl delay, based on a document matching content type or reference pattern. The default implementation is GenericRecrawlableResolver, which supports sitemap “lastmod” and “changefreq” in addition to custom re-crawl frequencies.
ReferenceDelayResolver is a new “delay resolver” that controls delays between each document download. It allows you to define different delays for different URL patterns. This can be useful for more fragile websites negatively impacted by the fast download of several big documents (e.g., PDFs). In such cases, introducing a delay between certain types of download can help keep the crawled website performance intact.
Norconex just released an Amazon CloudSearch Committer module for its open-source crawlers (Norconex “Collectors”). This is an especially useful contribution to CloudSearch users given that CloudSearch does not have its own crawlers.
If you’re not yet familiar with Norconex Collectors, head over to the Norconex Collectors website to see what you’ve been missing.
Assuming you’re already familiar with Norconex Collectors, you can enable CloudSearch as your crawler’s target search engine by following these steps:
Norconex is proud to release version 2.3.0 of its Norconex HTTP Collector open-source web crawler. Thanks to incredible community feedback and efforts, we have implemented several feature requests, and your favorite crawler is now more stable than ever. The following describes only a handful of these new features with a focus on XML configuration. Refer to the product release notes for a complete list of changes.
Restrict crawling to a specific site
Up until now, you could restrict crawling to a specific domain, protocol, and port using one or more reference filters (e.g., RegexReferenceFilter). Norconex HTTP Collector 2.3.0 features new configuration options to “stay on a site”, called stayOnProtocol, stayOnDomain, and stayOnPort. These new settings can be applied to the <startURLs> tag of your XML configuration. They are particularly useful when you have many “start URLs” defined and you do not want to create many reference filters to stay on those sites.
GenericHttpClientFactory now allows you to set HTTP request headers on every HTTP calls that a crawler will make. This new feature can save the day for sites expecting certain header values to be present to render properly. For instance, some sites may rely on the “Accept-Language” request header to decide which language to pick to render a page.
Norconex HTTP Collector configuration example
Specify a sitemap as a start URL
It is now possible to specify one or more sitemap URLs as “start URLs.” This is in addition to the crawler attempting to detect sitemaps at standard locations. To only use the sitemap URL provided as a start URL, you can disable the sitemap discovery process by adding ignore="true" to <sitemapResolverFactory> as shown in the code sample. To only crawl pages listed in sitemap files and not further follow links found in those pages, remember to set the <maxDepth> to zero.
Norconex HTTP Collector configuration example
<sitemapResolverFactory ignore="true" />
Basic URL normalization always performed
URL normalization is now in effect by default using GenericURLNormalizer. The following are the default normalization rules applied:
Removing the URL fragment (the “#” character and everything after)
Converting the scheme and host to lower case
Capitalizing letters in escape sequences
Decoding percent-encoded unreserved characters
Removing the default port
Encoding non-URI characters
You can always overwrite the default normalization settings or turn off normalization altogether by adding the disabled="true" attribute to the <urlNormalizer> tag.
We introduced additional features when we upgraded the Norconex Importer dependency to its latest version (2.4.0). You can now use scripting languages to insert your own document processing logic or reference DOM elements of a XML or HTML file using a friendly syntax. Refer to the Importer 2.4.0 release announcement for more details.
There is so much more offered by this release. Use the following links to find out more about Norconex HTTP Collector.
Norconex is proud to release version 2.4.0 of its Norconex Importer open-source product. In addition to the usual bug fixes and stability enhancements, this release provides more possibilities for parsing and enriching your documents. Most significantly, Importer 2.4.0 allows for scripting and DOM navigation. Keep reading for more details and usage samples.
Whereas it has always been possible to extend the importer to implement your own document processing logic, now you can inject the importer via configuration using a scripting language. The following new handlers enable the use of scripting languages to manipulate documents: ScriptFilter, ScriptTagger, and ScriptTransformer.
Java developers can extend the importer to increase support for additional scripting languages. These new classes rely on JSR 223 API, which allows you to “plug” into any script engines to support your favorite scripting language.
Importer Configuration Samples
<!-- Reject documents that are not about "apple". -->
It is now possible to reference elements of an HTML or XML document using friendly CSS or JQuery-like syntax to navigate its domain object model (DOM). The jsoup parser is used to load document content into a DOM tree.
The new DOMContentFilter can be used to reject documents containing a specific HTML/XML path or element. The DOMSplitter can be used to break HTML/XML with “list” elements into different documents. Finally, the DOMTagger allows you to extract specific HTML/XML tag values or attributes and store them in your own fields (e.g., extract <h1> tags into a “title” field).
This release features several other helpful and interesting changes and additions. For instance, CharacterCaseTagger can now be used to adjust the character case of field names (in addition to values). A few additional file formats are also supported. For a complete list of changes, see the release notes.
Importer Configuration Sample
<!-- Make every instance of "title" field name lowercase. -->
This year’s conference was held in Austin, Texas on October 15-16, 2015. It gathered around 600 Lucene and Solr enthusiasts from 26 countries, including many of the Solr committers. Pascal Dimassimo and Pascal Essiembre attended the event on behalf of Norconex. While the talks were varied, there were a few recurrent themes such as search relevance, analytics, and infrastructure scaling. The following relates the experiences of the attendees with the content of conference sessions they attended. These talks should become available for viewing on YouTube shortly.
There were at least 10 talks related to the topic of relevancy alone. They offered ideas on how to improve relevancy, including intent detection, using machine learning principles, fuzzy matching, and more.
Of those standing out, Trey Grainger (co-author of Solr in Action) showed us how he created a knowledge graph built on top of Solr to improve CareerBuilder.com results.
Another noteworthy presentation came from Michael Nilsson and Diego Ceccarelli of Bloomberg, who broke their documents into features and use a matrix to decide the ranking of each feature. They reminded us there is nothing wrong with doing multiple passes to Solr to better serve up search requests.
Yonik Seeley, co-creator of Solr and now Solr Dude at Cloudera, presented us the new Solr JSON Facet API. This new API (which is actually available in Solr 5.3) has been completely re-written for Solr 5 and allows for first-class analytics support. You can now easily have nested facets, metrics and statistics. This is similar to Aggregations in Elasticsearch. According to the numbers presented, this new facet module performs much better than the original Solr facet module.
Erick Erickson presented the new Solr Streaming Aggregation API (also available in Solr 5.3). Solr has never been very good at accessing lots of search results because of deep paging issues and memory requirements. However, this new API builds on the existing exporting capabilities to allow us to stream concentrated data out of SolrCloud with new possibilities, like memory-efficient set operations (union, intersection, complement, join and unique). It also introduces new worker collections on the SolrCloud cluster to handle this processing. The goal is to build a general purpose, distributed computation framework right on top of Solr. This is still a work in progress, and the next speaker, Joel Bernstein, showed us what we can expect next. Leveraging the Streaming Aggregation API and JSON Facet API, Solr 6 should offer us a very powerful feature: SQL queries over Solr!
For those using Spark, LucidWork’s Timothy Potter introduced us to the tool they’ve built to use Solr as a Spark SQL DataSource. This allows Solr to be used with an existing Spark analysis pipeline. This tool also permits the writing of data into Solr from Spark.
Shenghua Wan and Rahul Gupta from WalmartLabs described their experiences using different technologies to perform distributed indexing. They experimented with MapReduce, Hadoop and others to distribute and enhance their XML data across several Solr shards, merging those shards in the end.
Riak’s developer Fred Dushin showed us Yokozuna, their new implementation of Riak Search. Riak is a distributed key/value store and with Yokozuna, Solr brings search to Riak. But Yokozuna also brings something to Solr. Because of its distributed nature, it makes it possible to use Riak to distribute Solr instead of using SolrCloud.
Mark Miller, Software Engineer at Cloudera, told us that open-source technologies have taken over the search ecosystem, especially Solr and Lucene. In the future, those search engines will get integrated with multiple systems. Cloudera wants to integrate Solr with Hadoop. Miller claims that at the moment, Solr search at scale is still flaky, even with SolrCloud, thought he admitted that it is good enough for general usage. According to Miller, Hadoop can help, so his firm created Cloudera Search, which uses Solr and Hadoop together.
The aforementioned topics were not the only ones covered at the conference. There were others of varying technicality. Toke Eskildsen, representing the State and University Library in Denmark, gave a low-level and very interesting talk about facet optimization. He demonstrated the code improvements he made to improve Solr facet performance and achieve impressive benchmark results.
David Smiley, who has long been involved in all things related to Solr geospatial research, showed us the latest work on spatial 2-D faceting, also known as heat maps. He also took the time to retrace the history of various geospatial functionalities in Solr and Lucene.
We’ve only scraped the surface of the conference proceedings at the Lucene/Solr Revolution 2015. We also thoroughly enjoyed the hospitality of the city of Austin, a community which offered a warm welcome and many wonderful sights. We hope our experiences stimulate further interest among others in attending future conferences, and we welcome further inquiries regarding our experiences in Austin.
The latest release of Norconex HTTP Collector provides more content transformation capabilities, canonical URL support, increased stability, and more additional features.
As the Internet grows, so does the demand for better ways to extract and process web data. Several commercial and open-source/free web crawling solutions have been available for years now. Unfortunately, most are limited by one or more of the following:
Feature set is too limited
Unfriendly and complex to setup
Require strong programming skills
No longer supported or active
Integrates with a single search engine or repository
Geared solely on big data solutions (like the popular Apache Nutch has become)
Difficult to extend with your own features
High cost of ownership
Norconex is changing this with its full-featured, enterprise-class, open-source web crawler solution. Norconex HTTP Collector is entirely configurable using simple XML, yet offers many extension points for adventurous Java programmers. It integrates with virtually any repository or search engine (Solr, Elasticsearch, IDOL, GSA, etc.). You will find it is thoroughly documented in a single location, with sample configurations files working out of the box on any operating system.
The latest release builds upon the great community requests and feedback to provide the following highlights:
Canonical Links Detector
Canonical links are a way for the webmaster to help crawlers avoid duplicates by indicating the preferred URL for accessing a web page. The HTTP Collector now detects canonical links found in both HTML and HTTP headers.
It also looks for an HTTP response header field named “Link” with a value following this pattern:
The advantage for webmasters in defining canonical URLs in the HTTP response header over an HTML page is twofold. First, it allows web crawlers to reject non-canonical pages before they are downloaded (saving bandwidth). Second, they can apply to any content types, not just HTML pages.
URLStatusCrawlerEventListener is a new crawler event listener that can produce spreadsheet-friendly reports on fetched URLs and their statuses. Among other things, it can be useful for finding broken links on a site being crawled.
A new class called GenericSpoiledReferenceStrategizer allows you to specify how to handle URLs that were once valid, but turned “bad” on a subsequent crawl. You can chose to delete them from your repository, give them a single chance to recover on the next crawl, or simply ignore them.
Norconex HTTP Collector internally relies on the Norconex Importer library for parsing documents and manipulating text and metadata. The latest release of the Importer brings you several new options, such as:
TextPatternTagger: Extracts and adds all text values matching the regular expression provided to a metadata field.
Want to crawl a filesystem instead?
Whether you are interested in crawling a local drive, a network drive, a FTP site, webav, or any other types of filesystems, Norconex Filesystem Collector is for you; it was recently upgraded to version 2.2.0 as well. Check its release notes for details.
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.