Norconex HTTP Collector 2.3.0

Norconex is proud to release version 2.3.0 of its Norconex HTTP Collector open-source web crawler.  Thanks to incredible community feedback and efforts, we have implemented several feature requests, and your favorite crawler is now more stable than ever. The following describes only a handful of these new features with a focus on XML configuration. Refer to the product release notes for a complete list of changes.

Restrict crawling to a specific site

[ezcol_1half]

Up until now, you could restrict crawling to a specific domain, protocol, and port using one or more reference filters (e.g., RegexReferenceFilter). Norconex HTTP Collector 2.3.0 features new configuration options to “stay on a site”, called stayOnProtocol, stayOnDomain, and stayOnPort.  These new settings can be applied to the <startURLs> tag of your XML configuration.  They are particularly useful when you have many “start URLs” defined and you do not want to create many reference filters to stay on those sites.

[/ezcol_1half]

[ezcol_1half_end]

<startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
  <url>http://mysite.com</url>
</startURLs>

[/ezcol_1half_end]

 

Add HTTP request headers

[ezcol_1half]

GenericHttpClientFactory now allows you to set HTTP request headers on every HTTP calls that a crawler will make. This new feature can save the day for sites expecting certain header values to be present to render properly. For instance, some sites may rely on the “Accept-Language” request header to decide which language to pick to render a page.

[/ezcol_1half]

[ezcol_1half_end]

<httpClientFactory>
  <headers>
    <header name="Accept-Language">fr</header>
    <header name="From">john@smith.com</header>
  </headers>
</httpClientFactory>

[/ezcol_1half_end]

Specify a sitemap as a start URL

[ezcol_1half]

It is now possible to specify one or more sitemap URLs as “start URLs.”  This is in addition to the crawler attempting to detect sitemaps at standard locations. To only use the sitemap URL provided as a start URL, you can disable the sitemap discovery process by adding ignore="true" to <sitemapResolverFactory> as shown in the code sample.  To only crawl pages listed in sitemap files and not further follow links found in those pages, remember to set the <maxDepth> to zero.

[/ezcol_1half]

[ezcol_1half_end]

<startURLs>
  <sitemap>http://mysite.com/sitemap.xml</sitemap>
</startURLs>
<sitemapResolverFactory ignore="true" />

[/ezcol_1half_end]

Basic URL normalization always performed

[ezcol_1half]

URL normalization is now in effect by default using GenericURLNormalizer. The following are the default normalization rules applied:

  • Removing the URL fragment (the “#” character and everything after)
  • Converting the scheme and host to lower case
  • Capitalizing letters in escape sequences
  • Decoding percent-encoded unreserved characters
  • Removing the default port
  • Encoding non-URI characters

You can always overwrite the default normalization settings or turn off normalization altogether by adding the disabled="true" attribute to the <urlNormalizer> tag.

[/ezcol_1half]

[ezcol_1half_end]

<urlNormalizer>
  <normalizations>
    lowerCaseSchemeHost, upperCaseEscapeSequence, removeDefaultPort, 
    removeDotSegments, removeDirectoryIndex, removeFragment, addWWW 
  </normalizations>
  <replacements>
    <replace><match>&amp;view=print</match></replace>
    <replace>
       <match>(&amp;type=)(summary)</match>
       <replacement>$1full</replacement>
    </replace>
  </replacements>
</urlNormalizer>

[/ezcol_1half_end]

Scripting Language and DOM navigation

We introduced additional features when we upgraded the Norconex Importer dependency to its latest version (2.4.0). You can now use scripting languages to insert your own document processing logic or reference DOM elements of a XML or HTML file using a friendly syntax.  Refer to the Importer 2.4.0 release announcement for more details.

Useful links

There is so much more offered by this release. Use the following links to find out more about Norconex HTTP Collector.

Norconex is proud to release version 2.4.0 of its Norconex Importer open-source product.  In addition to the usual bug fixes and stability enhancements, this release provides more possibilities for parsing and enriching your documents.  Most significantly, Importer 2.4.0 allows for scripting and DOM navigation.  Keep reading for more details and usage samples.

Scripting

[ezcol_1half]

Whereas it has always been possible to extend the importer to implement your own document processing logic, now you can inject the importer via configuration using a scripting language. The following new handlers enable the use of scripting languages to manipulate documents: ScriptFilter, ScriptTagger, and ScriptTransformer.

The “JavaScript” script engine, which is already present as part of your Java installation, is the script engine used by these classes.  The JavaScript engine used by the Oracle implementation of Java is based on Mozilla Rhino. You can find extensive JavaScript documentation on the Mozilla Rhino site.

Java developers can extend the importer to increase support for additional scripting languages. These new classes rely on JSR 223 API, which allows you to “plug” into any script engines to support your favorite scripting language.

[/ezcol_1half]

[ezcol_1half_end]

<!-- Reject documents that are not about "apple". -->
<filter class="com.norconex.importer.handler.filter.impl.ScriptFilter">
  <script><![CDATA[
      isAppleDoc = metadata.getString('fruit') == 'apple'
              || content.indexOf('Apple') > -1;
      /*return*/ isAppleDoc;
  ]]></script>
</filter>

<!-- Add a "fruit" metadata field with the value "apple". --> 
<tagger class="com.norconex.importer.handler.tagger.impl.ScriptTagger">
  <script><![CDATA[
      metadata.addString('fruit', 'apple');
  ]]></script>
</tagger>

<!-- Modify all occurences of "Alice" with "Roger". -->
<transformer 
    class="com.norconex.importer.handler.transformer.impl.ScriptTransformer">
  <script><![CDATA[
      modifiedContent = content.replace(/Alice/g, 'Roger');
      /*return*/ modifiedContent;
  ]]></script>
</transformer>

 [/ezcol_1half_end]

DOM navigation

[ezcol_1half]

It is now possible to reference elements of an HTML or XML document using friendly CSS or JQuery-like syntax to navigate its domain object model (DOM). The jsoup parser is used to load document content into a DOM tree.

The new DOMContentFilter can be used to reject documents containing a specific HTML/XML path or element. The DOMSplitter can be used to break HTML/XML with “list” elements into different documents. Finally, the DOMTagger allows you to extract specific HTML/XML tag values or attributes and store them in your own fields (e.g., extract <h1> tags into a “title” field).

[/ezcol_1half]

[ezcol_1half_end]

<!-- Exclude documents containing GIF images. -->
<filter class="com.norconex.importer.handler.filter.impl.DOMContentFilter"
      selector="img[src$=.gif]" onMatch="exclude" />

<!-- Store H1 tags in a title field. -->
<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
  <dom selector="h1" toField="title" overwrite="false" />
</tagger>

<!-- Create a new contact document for each occurence of the "contact" tag. -->
<splitter class="com.norconex.importer.handler.splitter.impl.DOMSplitter"
    selector="contact" />

 [/ezcol_1half_end]

Other features

[ezcol_1half]

This release features several other helpful and interesting changes and additions.  For instance, CharacterCaseTagger can now be used to adjust the character case of field names (in addition to values). A few additional file formats are also supported.  For a complete list of changes, see the release notes.

[/ezcol_1half]

[ezcol_1half_end]

<!-- Make every instance of "title" field name lowercase. -->
<tagger class="com.norconex.importer.handler.tagger.impl.CharacterCaseTagger">
  <characterCase fieldName="title" type="lower" applyTo="field" />
</tagger>

 [/ezcol_1half_end]

Useful links

In collaboration with .

Solr committers present at the event

This year’s conference was held in Austin, Texas on October 15-16, 2015. It gathered around 600 Lucene and Solr enthusiasts from 26 countries, including many of the Solr committers. Pascal Dimassimo and Pascal Essiembre attended the event on behalf of Norconex. While the talks were varied, there were a few recurrent themes such as search relevance, analytics, and infrastructure scaling. The following relates the experiences of the attendees with the content of conference sessions they attended. These talks should become available for viewing on YouTube shortly.

Relevancy

There were at least 10 talks related to the topic of relevancy alone. They offered ideas on how to improve relevancy, including intent detection, using machine learning principles, fuzzy matching, and more.

Of those standing out, Trey Grainger (co-author of Solr in Action) showed us how he created a knowledge graph built on top of Solr to improve CareerBuilder.com results.

Another noteworthy presentation came from Michael Nilsson and Diego Ceccarelli of Bloomberg, who broke their documents into features and use a matrix to decide the ranking of each feature. They reminded us there is nothing wrong with doing multiple passes to Solr to better serve up search requests.

Analytics

Whether it was analyzing search logs or user search behaviors, developers are working hard to build powerful analytics capabilities within Solr. Kiran Chitturi of Lucidworks suggested an easy way to capture user events using Snowplow JavaScript event tracker. He also highlighted the potential benefits you can get when sending those events to their new LucidWorks Fusion product.

Kiran Chitturi discussing events processing in LucidWorks Fusion
Kiran Chitturi discussing events processing in LucidWorks Fusion

Yonik Seeley, co-creator of Solr and now Solr Dude at Cloudera, presented us the new Solr JSON Facet API. This new API (which is actually available in Solr 5.3) has been completely re-written for Solr 5 and allows for first-class analytics support. You can now easily have nested facets, metrics and statistics. This is similar to Aggregations in Elasticsearch. According to the numbers presented, this new facet module performs much better than the original Solr facet module.

Erick Erickson presented the new Solr Streaming Aggregation API (also available in Solr 5.3). Solr has never been very good at accessing lots of search results because of deep paging issues and memory requirements. However, this new API builds on the existing exporting capabilities to allow us to stream concentrated data out of SolrCloud with new possibilities, like memory-efficient set operations (union, intersection, complement, join and unique). It also introduces new worker collections on the SolrCloud cluster to handle this processing. The goal is to build a general purpose, distributed computation framework right on top of Solr. This is still a work in progress, and the next speaker, Joel Bernstein, showed us what we can expect next. Leveraging the Streaming Aggregation API and JSON Facet API, Solr 6 should offer us a very powerful feature: SQL queries over Solr!

For those using Spark, LucidWork’s Timothy Potter introduced us to the tool they’ve built to use Solr as a Spark SQL DataSource. This allows Solr to be used with an existing Spark analysis pipeline. This tool also permits the writing of data into Solr from Spark.

Infrastructure Scaling

Shenghua Wan and Rahul Gupta sharing their experiments
Shenghua Wan and Rahul Gupta sharing their experiments

Shenghua Wan and Rahul Gupta from WalmartLabs described their experiences using different technologies to perform distributed indexing.  They experimented with MapReduce, Hadoop and others to distribute and enhance their XML data across several Solr shards, merging those shards in the end.

Riak’s developer Fred Dushin showed us Yokozuna, their new implementation of Riak Search. Riak is a distributed key/value store and with Yokozuna, Solr brings search to Riak. But Yokozuna also brings something to Solr. Because of its distributed nature, it makes it possible to use Riak to distribute Solr instead of using SolrCloud.

Mark Miller, Software Engineer at Cloudera, told us that open-source technologies have taken over the search ecosystem, especially Solr and Lucene. In the future, those search engines will get integrated with multiple systems. Cloudera wants to integrate Solr with Hadoop. Miller claims that at the moment, Solr search at scale is still flaky, even with SolrCloud, thought he admitted that it is good enough for general usage. According to Miller, Hadoop can help, so his firm created Cloudera Search, which uses Solr and Hadoop together.

Other Topics

The aforementioned topics were not the only ones covered at the conference. There were others of varying technicality. Toke Eskildsen, representing the State and University Library in Denmark, gave a low-level and very interesting talk about facet optimization. He demonstrated the code improvements he made to improve Solr facet performance and achieve impressive benchmark results.

Pascal Essiembre enjoying Autsin
Pascal Essiembre enjoying Autsin

David Smiley, who has long been involved in all things related to Solr geospatial research, showed us the latest work on spatial 2-D faceting, also known as heat maps. He also took the time to retrace the history of various geospatial functionalities in Solr and Lucene.

We’ve only scraped the surface of the conference proceedings at the Lucene/Solr Revolution 2015. We also thoroughly enjoyed the hospitality of the city of Austin, a community which offered a warm welcome and many wonderful sights. We hope our experiences stimulate further interest among others in attending future conferences, and we welcome further inquiries regarding our experiences in Austin.

The latest release of Norconex HTTP Collector provides more content transformation capabilities, canonical URL support, increased stability, and more additional features.  

Norconex HTTP Collector 2.2 now availableAs the Internet grows, so does the demand for better ways to extract and process web data. Several commercial and open-source/free web crawling solutions have been available for years now. Unfortunately, most are limited by one or more of the following:

  • Feature set is too limited
  • Unfriendly and complex to setup
  • Poorly documented
  • Require strong programming skills
  • No longer supported or active
  • Integrates with a single search engine or repository
  • Geared solely on big data solutions (like the popular Apache Nutch has become)
  • Difficult to extend with your own features
  • High cost of ownership

Norconex is changing this with its full-featured, enterprise-class, open-source web crawler solution. Norconex HTTP Collector is entirely configurable using simple XML, yet offers many extension points for adventurous Java programmers. It integrates with virtually any repository or search engine (Solr, Elasticsearch, IDOL, GSA, etc.). You will find it is thoroughly documented in a single location, with sample configurations files working out of the box on any operating system.

The latest release builds upon the great community requests and feedback to provide the following highlights:

Canonical Links Detector

[ezcol_1half]

Canonical links are a way for the webmaster to help crawlers avoid duplicates by indicating the preferred URL for accessing a web page. The HTTP Collector now detects canonical links found in both HTML and HTTP headers.

The GenericCanonicalLinkDetector looks within the HTML <head> tags for a <link> tag following this pattern:

<link rel="canonical" href="https://norconex.com/sample" />

It also looks for an HTTP response header field named “Link” with a value following this pattern:

<https://norconex.com/sample.pdf> rel="canonical"

The advantage for webmasters in defining canonical URLs in the HTTP response header over an HTML page is twofold. First, it allows web crawlers to reject non-canonical pages before they are downloaded (saving bandwidth). Second, they can apply to any content types, not just HTML pages.

[/ezcol_1half]

[ezcol_1half_end]

<canonicalLinkDetector
    class="com.norconex.collector.http.url.impl.GenericCanonicalLinkDetector"
    ignore="false">
</canonicalLinkDetector>

[/ezcol_1half_end]

URL Reports Creation

[ezcol_1half]

URLStatusCrawlerEventListener is a new crawler event listener that can produce spreadsheet-friendly reports on fetched URLs and their statuses. Among other things, it can be useful for finding broken links on a site being crawled.

[/ezcol_1half]

[ezcol_1half_end]

<listener
    class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener">
  <statusCodes>404</statusCodes>
  <outputDir>/a/path/broken-links.tsv</outputDir>
</listener>

[/ezcol_1half_end]

Spoiled State Resolver

[ezcol_1half]

A new class called GenericSpoiledReferenceStrategizer allows you to specify how to handle URLs that were once valid, but turned “bad” on a subsequent crawl. You can chose to delete them from your repository, give them a single chance to recover on the next crawl, or simply ignore them.

[/ezcol_1half]

[ezcol_1half_end]

<spoiledReferenceStrategizer 
    class="com.norconex.collector.core.spoil.impl.GenericSpoiledReferenceStrategizer"
    fallbackStrategy="IGNORE">
  <mapping state="NOT_FOUND" strategy="DELETE" />
  <mapping state="BAD_STATUS" strategy="GRACE_ONCE" />
  <mapping state="ERROR" strategy="IGNORE" />
</spoiledReferenceStrategizer>

[/ezcol_1half_end]

Extra Filtering and Data Manipulation Options

Norconex HTTP Collector internally relies on the Norconex Importer library for parsing documents and manipulating text and metadata. The latest release of the Importer brings you several new options, such as:

  • CurrentDateTagger: Add the current date to a document.
  • DateMetadataFilter: Accepts or rejects a document based on the date value of a metadata field.
  • NumericMetadataFilter: Accepts or rejects a document based on the numeric value of a metadata field.
  • TextPatternTagger: Extracts and adds all text values matching the regular expression provided to a metadata field.

Want to crawl a filesystem instead?

Whether you are interested in crawling a local drive, a network drive, a FTP site, webav, or any other types of filesystems, Norconex Filesystem Collector is for you; it was recently upgraded to version 2.2.0 as well. Check its release notes for details.

Useful Links

This release of Norconex Importer brings many fixes, increased stability, and nice new features. The following highlights some of the additions with XML configuration or Java code samples.

Retrieve a document Length

[ezcol_1half]

Thanks to the new DocumentLengthTagger, you can now store a document byte length in a metadata field of your choice. The length can be obtained at any document processing stage.  For instance, it can be obtained before any transformation took place, or after it was parsed.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.DocumentLengthTagger"
  field="doc-length" overwrite="true" >
</tagger>

 [/ezcol_1half_end]

Add the current date to a document

[ezcol_1half]

The new CurrentDateTagger allows to add the current date to a metadata field and date format of your choice. This can be useful to indicate when a document was actually processed by the Importer.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.CurrentDateTagger"
  field="date-imported" format="yyyy-MM-dd" />

 [/ezcol_1half_end]

Filter documents on numeric or date range

[ezcol_1half]

NumericMetadataFilter and DateMetadataFilter now allow you to filter documents based on metadata field numeric or date values, respectively. You can define both closed ranges and open-ended ranges.

[/ezcol_1half]

[ezcol_1half_end]

<!-- Numeric range filter -->
<filter class="com.norconex.importer.handler.filter.impl.NumericMetadataFilter"
      onMatch="include" field="age" >
  <condition operator="ge" number="20" />
  <condition operator="lt" number="30" />
</filter>

<!-- Date range filter -->
<filter class="com.norconex.importer.handler.filter.impl.DateMetadataFilter"
      onMatch="include" field="publish_date" >
  <condition operator="ge" date="TODAY-7" />
  <condition operator="lt" date="TODAY" />
</filter>

 [/ezcol_1half_end]

Use external parsers

[ezcol_1half]

Wrapping a Tika class of the same name, the new ExternalParser allows Java programmers to point to external command-line applications to parse documents. One example can be for using “pdftotext” to parse PDFs instead of the default PDF parser based on PDFBox, which is much slower (but does a better job overall).

[/ezcol_1half]

[ezcol_1half_end]

import java.util.Map;

import com.norconex.commons.lang.file.ContentType;
import com.norconex.importer.parser.GenericDocumentParserFactory;
import com.norconex.importer.parser.IDocumentParser;
import com.norconex.importer.parser.impl.ExternalParser;

public class CustomDocumentParserFactory extends GenericDocumentParserFactory {

    @Override
    protected Map<ContentType, IDocumentParser> createNamedParsers() {
        Map<ContentType, IDocumentParser> parsers = super.createNamedParsers();

        ExternalParser pdfParser = new ExternalParser();
        pdfParser.setCommand(
                // Replace this with your own executable path
                "C:\\Apps\\pdftotext.exe", 
                "-enc", "UTF-8", "-raw", "-q", "-eol", "unix",                 
                ExternalParser.INPUT_FILE_TOKEN, 
                ExternalParser.OUTPUT_FILE_TOKEN);
        parsers.put(ContentType.PDF, pdfParser);
        return parsers;
    }
}

  [/ezcol_1half_end]

Other improvements

There are more changes under the hood, like upgrading to Apache Tika 1.8, as well as the fixing of OutOfMemory errors and document parsing sometimes never returning. You can find the complete list of changes in the release notes.

Several of these improvements were made possible thanks to the great feedback of the open-source community. Keep doing so: you make a difference.

Useful links

 

[ezcol_1half]

In the wake of FIFA Women’s World Cup 2015 starting in Canada on June 6th, many young girls are anxiously waiting to see their favorite players compete for their country. Soccer (or “football” outside North America) has been the most practiced sport in Canada by kids from age 5 to 14 for over a decade now. Girls represent over 40% of these young players.

Norconex is proud to help attract and retain even more girls to play the world’s most beloved sport by sponsoring 5 girl-only teams from U10 to U14 part of the “Association de Soccer de Hull”.

We can’t predict who will be the next Christine Sinclair, but we hope an increasing number of girls will enjoy soccer and even dream to be on the national team one day.

If you’re in Canada this June (or early July), make sure you have your World Cup tickets. Matches will be played in 6 cities across the country.

Go girls!

[/ezcol_1half]

[ezcol_1half_end]
ASH U12 Girl Soccer Team

ASH U12 D1 Girl Soccer Team

[/ezcol_1half_end]

DockerDocker is all the rage at the moment! It was recently selected as Gartner Cool Vendor in DevOps. As you may already know, Docker is a platform to build and deploy applications as self-contained units. Those units, called containers, can be executed consistently on a developer laptop or production server. Since containers include all their dependencies, they are truly portable. And, compared to normal virtual machine images, Docker containers are much more lightweight because they don’t need as much infrastructure as a normal VM. Docker containers are built from an image, a simple text file describing the steps needed to assemble and execute the container. But the goal of this blog post is not to be a Docker tutorial. If you need it, there are plenty of good resources to get you started, like the Docker User Guide, this series of video tutorials recently published on their blog or the nice 10-minute tutorial where you can try Docker online. In this post, we will be using Docker 1.6.

We recently encountered a situation where we needed to use Solr 5 on a server already installed with Java 6. But Solr 5 requires at least Java 7. And, for different reasons, upgrading to Java 7 on this server was not an option. The solution? Run Solr 5 in a Docker container using the appropriate Java version! Containers are completely isolated, so this has no impact on the other applications running on the server.

It’s easy to build a Docker image for Solr 5. But, it’s even easier to use an already-existing image! Unfortunately, Docker does not offer an official Solr image (like it does for Elasticsearch). But the community has built multiple good-quality Solr images. We decided to use makuk66/docker-solr, which is actively maintained and has the options we needed. For example, this image has options to use SolrCloud. For this post, we will limit ourselves to using Solr cores.

First, you need to pull the image:

$ docker pull makuk66/docker-solr

Then, you can simply start a container with:

$ docker run -d -p 8983:8983 --name solr5 makuk66/docker-solr

You should be able to connect to Solr on port 8983.

But, as it is, you can’t add a core to this Solr installation. Solr requires the core files (solrconfig.xml, schema.xml, etc.) to be already on the server (in this case the container), which the makuk66/docker-solr does not provide. So we have to provide the core configuration files to the Docker container. The easy way to do so is to use Docker volumes, which link a directory on the host server to a directory of the Docker container. For example, let’s assume we create the necessary configuration files for the Solr core at ~/solr5/myindex on our server. This directory should contain a sub-directory conf with all the usual files, like solrconfig.xml and schema.xml. The myindex directory should also have a core.properties file with the content name=myindex.

$ cd ~/solr5/myindex
$ tree .
.
├── conf
│   ├── admin-extra.html
│   ├── admin-extra.menu-bottom.html
│   ├── admin-extra.menu-top.html
│   ├── _rest_managed.json
│   ├── schema.xml
│   └── solrconfig.xml
└── core.properties

1 directory, 7 files

Docker will need write access to the myindex directory (to create the data directory containing the Lucene index, for example). There are multiple ways to accomplish this, but here we simply change the group owner of the myindex directory to be docker and allow group members write access to the folder:

$ chgrp docker ~/solr5/myindex
$ chmod g+w ~/solr5/myindex

Now that the myindex directory is ready, we will need to link it so that it is available under the solr.home directory of the Docker container. What is the solr.home directory of the container? It’s easy to get this from Solr. When connecting to the Solr instance on port 8983, you should be redirected to the Solr dashboard. On this page, you should see the list of JVM parameters, and one of them is -Dsolr.solr.home

Docker Solr5 Dashboard

We can now remove the previous container:

$ docker rm -f solr5

and start a new one with a volume:

$ docker run -d -p 8983:8983 -v ~/solr5/myindex:/opt/solr/server/solr/myindex --name solr5 makuk66/docker-solr

Notice the -v parameter. It links the ~/solr5/myindex directory of the server to the /opt/solr/server/solr/myindex of the container. Every time Solr reads or writes data to the /opt/solr/server/solr/myindex directory, it will actually be accessing our ~/solr5/myindex directory. This is where Solr will create the data directory. Great, because Docker recommends that all files created by the container be held outside of the container. If you access the Solr instance on port 8983, you should now have the myindex core available.

The Docker container was started with basic JVM settings. What if we need to allocate more memory to Solr or other options? Docker allows us to override the default startup command defined in the image. For example, here is how we could start the container with more memory (don’t forget to remove the previous container):

$ docker run -d -p 8983:8983 -v ~/solr5/myindex:/opt/solr/server/solr/myindex --name solr5 makuk66/docker-solr "/bin/bash" "-c" "/opt/solr/bin/solr -m 1g -f"

To confirm that everything is fine with our Solr container, you can consult the logs generated by Solr with:

$ docker logs solr5

Conclusion

There is a lot more to be said about Docker and Solr 5, like how to use a specific Solr version or how to use SolrCloud. Hopefully this blog post was enough to get you started!

Introduction

You already know that Solr is a great search application, but did you know that Solr 5 could be used as a platform to slice and dice your data?  With Pivot Facet working hand in hand with Stats Module, you can now drill down into your dataset and get relevant aggregated statistics like average, min, max, and standard deviation for multi-level Facets.

In this tutorial, I will explain the main concepts behind this new Pivot Facet/Stats Module feature. I will walk you through each concept, such as Pivot Facet, Stats Module, and Local Parameter in query. Once you fully understand those concepts, you will be able to build queries that quickly slice and dice datasets and extract meaningful information.

Applications to Download

Facet

If you’re reading this blog post, you’re probably already familiar with the Facet concept in Solr. A facet is a way to count or aggregate how many elements are available for a given category. Facets also allow users to drill down and refine their searches. One common use of facets is for online stores.

Here’s a facet example for books with the word “Solr” in them, taken from Amazon.

2015-04-09_1428

To understand how Solr does it, go on the command line and fire up the techproduct example from Solr 5 by executing the following command:

pathToSolr/bin/solr -e techproducts

If you’re curious to know where the source data are located for the techproducts database, go to the folder pathToSolr/example/exampledocs/*.xml

Here’s an example of a document that’s added to the techproduct database.

Notice the cat and manu field names. We will be using them in the creation of facet.

<add><doc>
<field name="id">MA147LL/A</field>
 <field name="name">Apple 60 GB iPod with Video Playback Black</field>
 <field name="manu">Apple Computer Inc.</field>
 <!-- Join -->
 <field name="manu_id_s">apple</field>
 <field name="cat">electronics</field>
 <field name="cat">music</field>
 <field name="features">iTunes, Podcasts, Audiobooks</field>
 <field name="features">Stores up to 15,000 songs, 25,000 photos, or 150 hours of video</field>
 <field name="features">2.5-inch, 320x240 color TFT LCD display with LED backlight</field>
 <field name="features">Up to 20 hours of battery life</field>
 <field name="features">Plays AAC, MP3, WAV, AIFF, Audible, Apple Lossless, H.264 video</field>
 <field name="features">Notes, Calendar, Phone book, Hold button, Date display, Photo wallet, Built-in games, JPEG photo playback, Upgradeable firmware, USB 2.0 compatibility, Playback speed control, Rechargeable capability, Battery level indication</field>
 <field name="includes">earbud headphones, USB cable</field>
 <field name="weight">5.5</field>
 <field name="price">399.00</field>
 <field name="popularity">10</field>
 <field name="inStock">true</field>
 <!-- Dodge City store -->
 <field name="store">37.7752,-100.0232</field>
 <field name="manufacturedate_dt">2005-10-12T08:00:00Z</field>
</doc></add>

Open the following link in your favorite browser:

http://localhost:8983/solr/techproducts/select?q=*%3A*&rows=0&wt=json&indent=true&facet=true&facet.field=manu

Notice the 2 parameters:

  • facet=true
  • facet.field=manu

If everything worked as planned, you should get an answer that looks like the one below. You should see the results show how many elements are included for each manufacturer.

…
"response":{"numFound":32,"start":0,"docs":[]
 },
 "facet_counts":{
   "facet_queries":{},
   "facet_fields":{
     "manu":[
       "inc",8,
       "apache",2,
       "bank",2,
       "belkin",2,
…

Facet Pivot

Pivots are sometimes also called decision trees. Pivot allows you to quickly summarize and analyze large amounts of data in lists, independent of the original data layout stored in Solr.

One real-world example is the requirement of showing the university in the provinces and the number of classes offered in both provinces and university. Until facet pivot, it was not possible to accomplish this task without changing the structure of the Solr data.

With Solr, you drive the pivot by using the facet.pivot parameter with a comma separated field list.

The example below shows the count for each category (cat) under each manufacturer (manu).

http://localhost:8983/solr/techproducts/select?q=*%3A*&rows=0&wt=json&indent=true&facet=true&facet.pivot=manu,cat

Notice the fields:

  • facet=true
  • facet.pivot=manu,cat
"facet_pivot":{
     "manu,cat":[{
         "field":"manu",
         "value":"inc",
         "count":8,
         "pivot":[{
             "field":"cat",
             "value":"electronics",
             "count":7},
           {
             "field":"cat",
             "value":"memory",
             "count":3},
           {
             "field":"cat",
             "value":"camera",
             "count":1},
           {
             "field":"cat",
             "value":"copier",
             "count":1},
           {
             "field":"cat",
             "value":"electronics and computer1",
             "count":1},
           {
             "field":"cat",
             "value":"graphics card",
             "count":1},
           {
             "field":"cat",
             "value":"multifunction printer",
             "count":1},
           {
             "field":"cat",
             "value":"music",
             "count":1},
           {
             "field":"cat",
             "value":"printer",
             "count":1},
           {
             "field":"cat",
             "value":"scanner",
             "count":1}]},

Stats Component

The Stats Component has been around for some time (since Solr 1.4). It’s a great tool to return simple math functions, such as sum, average, standard deviation, and so on for an indexed numeric field.

Here is an example of how to use the Stats Component on the field price with the techproducts sample database. Notice the parameters:

http://localhost:8983/solr/techproducts/select?q=*:*&wt=json&stats=true&stats.field=price&rows=0&indent=true

  • stats=true
  • stats.field=price
...

"response":{"numFound":32,"start":0,"docs":[]
 },
 "stats":{
   "stats_fields":{
     "price":{
       "min":0.0,
       "max":2199.0,
       "count":16,
       "missing":16,
       "sum":5251.270030975342,
       "sumOfSquares":6038619.175900028,
       "mean":328.20437693595886,
       "stddev":536.3536996709846,
       "facets":{}}}}}

...

Mixing Stats Component and Facets

Now that you’re aware of what the stats module can do, wouldn’t it be nice if you could mix and match the Stats Component with Facets? To continue from our previous example, if you wanted to know the average price for an item sold by a given manufacturer, this is what the query would look like:

http://localhost:8983/solr/techproducts/select?q=*:*&wt=json&stats=true&stats.field=price&stats.facet=manu&rows=0&indent=true

Notice the parameters:

  • stats=true
  • stats.field=price
  • stats.facet=manu
…
"stats_fields":{
     "price":{
       "min":0.0,
       "max":2199.0,
       "count":16,
       "missing":16,
       "sum":5251.270030975342,
       "sumOfSquares":6038619.175900028,
       "mean":328.20437693595886,
       "stddev":536.3536996709846,
       "facets":{
         "manu":{
           "canon":{
             "min":179.99000549316406,
             "max":329.95001220703125,
             ...
             "stddev":106.03773765415568,
             "facets":{}},

"belkin":{
             "min":11.5,
             "max":19.950000762939453,
             ...
             "stddev":5.975052840505987,
             "facets":{}}

…

The problem with putting the facet inside the Stats Component is that the Stats Component will always return every term from the stats.facet field without being able to support simple functions, such as facet.limit and facet.sort. There’s also a lot of problems with multivalued facet fields or non-string facet fields.

Solr 5 Brings Stats to Facet

One of Solr 5’s new features is to bring the stats.fields under a Facet Pivot. This is a great thing because you can now leverage the power of the code already done for facets, such as ordering and filtering. Then you can just delegate the computing for the math function tasks, such as min, max, and standard deviation, to the Stats Component.

http://localhost:8983/solr/techproducts/select?q=*:*&wt=json&indent=true&rows=0&facet=true&stats=true&stats.field={!tag=t1}price&facet.pivot={!stats=t1}manu

Notice the parameters:

  • facet=true
  • stats=true
  • stats.field={!tag=t1}price
  • facet.pivot={!stats=t1}manu
...

"facet_counts":{
   "facet_queries":{},
   "facet_fields":{},
   "facet_dates":{},
   "facet_ranges":{},
   "facet_intervals":{},
   "facet_pivot":{
     "manu":[{
         "field":"manu",
         "value":"inc",
         "count":8,
         "stats":{
           "stats_fields":{
             "price":{
               "min":74.98999786376953,
               "max":2199.0,
...
               "sumOfSquares":5406265.926629987,
               "mean":549.697146824428,
               "stddev":740.6188014133371,
               "facets":{}}}}},
       {

...

The expression {!tag=t1} and the {!stats=t1} are named “Local Parameters in Queries”. To specify a local parameter, you need to follow these steps:

  1. Begin with {!
  2. Insert any number of key=value pairs separated by whitespace.
  3. End with } and immediately follow with the query argument.

In the example above, I refer to the stats field instance by referring to arbitrarily named tag that I created, i.e., t1.

You can also have multiple facet levels by using facet.pivot and passing comma separated fields, and the stats will be computed for the child Facet.

For example : facet.pivot={!stats=t1}manu,cat

http://localhost:8983/solr/techproducts/select?q=*:*&wt=json&indent=true&rows=0&facet=true&stats=true&stats.field={!tag=t1}price&facet.pivot={!stats=t1}manu,cat

...

"facet_pivot":{
     "manu,cat":[{
         "field":"manu",
         "value":"inc",
         "count":8,
         "pivot":[{
             "field":"cat",
             "value":"electronics",
             "count":7,
             "stats":{
               "stats_fields":{
                 "price":{
                   "min":74.98999786376953,
                   "max":479.95001220703125,
...
                   "stddev":153.31712383138424,
                   "facets":{}}}}},
           {

...

You can also mix and match overlapping sets, and you will get the computed facet.pivot hierarchies.

http://localhost:8983/solr/techproducts/select?q=*:*&wt=json&indent=true&rows=0&facet=true&stats=true&stats.field={!tag=t1,t2}price&facet.pivot={!stats=t1}cat,inStock&facet.pivot={!stats=t2}manu,inStock

Notice the parameters:

  • stats.field={!tag=t1,t2}price
  • facet.pivot={!stats=t1}cat,inStock
  • facet.pivot={!stats=t2}manu,inStock

This section represents a sample of the following sequence: facet.pivot={!stats=t1}cat,inStock

 "facet_pivot":{
     "cat,inStock":[{
         "field":"cat",
         "value":"electronics",
         "count":12,
         "pivot":[{
             "field":"inStock",
             "value":true,
             "count":8,
             "stats":{
               "stats_fields":{
                 "price":{
                   "min":74.98999786376953,
                   "max":399.0,
             ...
                   "facets":{}}}}},
           {
             "field":"inStock",
             "value":false,
             "count":4,
             "stats":{
               "stats_fields":{
                 "price":{
                   "min":11.5,
                   "max":649.989990234375,
...
                   "facets":{}}}}}],
         "stats":{
           "stats_fields":{
             "price":{
               "min":11.5,
               "max":649.989990234375,
...
               "facets":{}}}}},

This section represents a sample of the following sequence:

facet.pivot={!stats=t2}manu,inStock

It’s the sequence that was produced by the query shown in the URL above.

 "facet_pivot":{
     "cat,inStock":[{
         "field":"cat",
         "value":"electronics",
         "count":12,
         "pivot":[{
             "field":"inStock",
             "value":true,
             "count":8,
             "stats":{
               "stats_fields":{
                 "price":{
                   "min":74.98999786376953,
                   "max":399.0,
             ...
                   "facets":{}}}}},
           {
             "field":"inStock",
             "value":false,
             "count":4,
             "stats":{
               "stats_fields":{
                 "price":{
                   "min":11.5,
                   "max":649.989990234375,
...
                   "facets":{}}}}}],
         "stats":{
           "stats_fields":{
             "price":{
               "min":11.5,
               "max":649.989990234375,
...
               "facets":{}}}}},

How about Solr Cloud?

With Solr 5, it’s now possible to compute fields stats for each pivot facet constraint in a distributed environment, such as Solr Cloud. A lot of hard work went into solving this very complex problem. Getting the results from each shard and quickly and effectively merging them required a lot refactoring and optimization. Each level of facet pivots needs to be analyzed and will influence that level’s children facets. There is a refinement process that iteratively selects and rejects items at each facet level when results are coming in from all the different shards.

Does Pivot Faceting Scale Well?

Like I mentioned above, Pivot Faceting can be expensive in a distributed environment. I would be careful and properly set appropriate facet.list parameters at each facet pivot level. If you’re not careful, the number of dimensions requested can grow exponentially. Having too many dimensions can and will eat up all the system resources.  The online documentation is referring to multimillions of documents spread across multiple shards getting sub-millisecond response times for complex queries.

Conclusion

This tutorial should have given you a solid foundation to get you started on slicing and dicing your data. I have defined the concepts Pivot Facet, Stats Module, and Local Parameter. I also have shown you query examples using those concepts and their results. You should now be able to go out on your own and build your own solution. You can also give us a call if you need help. We provide training and consulting services that will get you up and running in no time.

Do you have any experience building analytical systems with Solr? Please share your experience below.

Optical character recognition (ORC), content translation, title generation, detection and text extraction from more file formats, are among the new features now part of your favorite crawlers: Norconex HTTP Collector 2.1.0 and Norconex Filesystem Collector 2.1.0. They are both available now and can be downloaded for free.  They both ship with and use the latest version of the Norconex Importer module, which is in big part responsible for many of these new features.

For more details and usage examples, check this article.

These two Collector releases also include bug fixes and stability improvements.  We recommend to existing users to upgrade.

Get your copy

Download Norconex HTTP Collector

Download Norconex Filesystem Collector

This feature release of Norconex Importer brings bug fixes, enhancements, and great new features, such as OCR and translation support.  Keep reading for all the details on some of this release’s most interesting changes. While Java can be used to configure and use the Importer, XML configuration is used here for demonstration purposes.  You can find all Importer configuration options here.

About Norconex Importer

Norconex Importer is an open-source product for extracting and manipulating text and metadata from files of various formats.  It works for stand-alone use or as a Java library.  It’s an essential component of Norconex Collectors for processing crawled documents.  You can make Norconex Importer an essential piece of your ETL pipeline.

OCR support

[ezcol_1half]

Norconex Importer now leverages Apache Tika 1.7’s newly introduced ORC capability. To convert popular image formats (PNG, TIFF, JPEG, etc.) to text, download a copy of Tesseract OCR for your operating system, and reference its install location in your Importer configuration.  When enabled, OCR will process embedded images too (e.g., PDF with image for text). The class configure to enable OCR support is GenericDocumentParserFactory.

[/ezcol_1half]

[ezcol_1half_end]

<documentParserFactory 
    class="com.norconex.importer.parser.GenericDocumentParserFactory" >
  <ocr path="(path to Tesseract OCR software install)">
    <languages>eng,fra</languages>
  </ocr>
</documentParserFactory>

 [/ezcol_1half_end]

Translation support

[ezcol_1half]

With the new TranslatorSplitter class, it’s now possible to hook Norconex Importer with a translation API.  The Apache Tika API has been extended to provide the ability to translate a mix of document content or specific document fields.  The translation APIs supported out-of-the-box are Microsoft, Google, Lingo24, and Moses.

[/ezcol_1half]

[ezcol_1half_end]

<postParseHandlers>
  <spitter
      class="com.norconex.importer.handler.splitter.impl.TranslatorSplitter"
      api="microsoft">
    <clientId>YOUR_CLIENT_ID</clientId>
    <secretId>YOUR_SECRET_ID</secretId>
  </spitter>
</postParseHandlers>

 [/ezcol_1half_end]

Dynamic title creation

[ezcol_1half]

Too many documents do not have a valid title, when they have a title at all.  What if you need a title to represent each document?  What do you do in such cases?   Do you take the file name as the title? Not so nice.  Do you take the document property called “title”?  Not reliable.  You now have a new option with the TitleGeneratorTagger.  It will try to detect a decent title out of your document.  In cases where it can’t, it offers a few alternate options. You always get something back.

[/ezcol_1half]

[ezcol_1half_end]

<postParseHandlers>
  <tagger class="com.norconex.importer.handler.tagger.impl.TitleGeneratorTagger"
          toField="generated_title"
          fallbackMaxLength="250"
          detectHeading="true"
          detectHeadingMinLength="10"
          detectHeadingMaxLength="500" />
</postParseHandlers>

 [/ezcol_1half_end]

Saving of parsing errors

[ezcol_1half]

A new top-level configuration option was introduced so that every file generating parsing errors gets saved in a location of your choice.  These files will be saved along with the metadata obtained so far (if any), along with the Java exception that was thrown. This is a great addition to help troubleshoot parsing failures.

[/ezcol_1half]

[ezcol_1half_end]

<importer>
  <parseErrorsSaveDir>/path/to/store/bad/files</parseErrorsSaveDir>
</importer>

 [/ezcol_1half_end]

Document parsing improvements

The content type detection accuracy and performance were improved with this release.  In addition, document parsing features the following additions and improvements:

  • Better PDF support with addition of PDF XFA (dynamic forms) text extraction, as well as improved space detection (eliminating many space-stripping issues).  Also, PDFs with JBIG2 and jpeg2000 image formats are now parsed properly.
  • New XFDL parser (PureEdge Extensible Forms Description Language).  Supports both Gzipped/Base64 encoded and plain text versions.
  • New, much improved WordPerfect parser now parsing WordPerfect documents according to WordPerfect file specifications.
  • New Quattro Pro parser for parsing Quattro Pro documents according to Quattro Pro file specifications.
  • JBIG2 and jpeg2000 image formats are now recognized.

You want more?

The list of changes and improvements doesn’t stop here.  Read the product release notes for a complete list of changes.

Unfamiliar with this product? No sweat — read this “Getting Started” page.

If not already out when you read this, the next feature release of Norconex HTTP Collector and Norconex Filesystem Collector will both ship with this version of the Importer.  Can’t wait for the release? Manually upgrade the Norconex Importer library to take advantage of these new features in your favorite crawler.

Download Norconex Importer 2.1.0.