tag page

Norconex released version 2.7.0 of both its HTTP Collector and Filesystem Collector.  This update, along with related component updates, introduces several interesting features.

HTTP Collector changes

The following items are specific to the HTTP Collector.  For changes applying to both the HTTP Collector and the Filesystem Collector, you can proceed to the “Generic changes” section.

Crawling of JavaScript-driven pages

[ezcol_1half]

The alternative document fetcher PhantomJSDocumentFetcher now makes it possible to crawl web pages with JavaScript-generated content. This much awaited feature is now available thanks to integration with the open-source PhantomJS headless browser.   As a bonus, you can also take screenshots of web pages you crawl.

[/ezcol_1half]

[ezcol_1half_end]

<documentFetcher 
    class="com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher">
  <exePath>/path/to/phantomjs.exe</exePath>
  <renderWaitTime>5000</renderWaitTime>
  <referencePattern>^.*\.html$</referencePattern> 
</documentFetcher>

[/ezcol_1half_end]

More ways to extract links

[ezcol_1half]

This release introduces two new link extractors.  You can now use the XMLFeedLinkExtractor to extract links from RSS or Atom feeds. For maximum flexibility, the RegexLinkExtractor can be used to extract links using regular expressions.

[/ezcol_1half]

[ezcol_1half_end]

<extractor class="com.norconex.collector.http.url.impl.RegexLinkExtractor">
  <linkExtractionPatterns>
    <pattern group="1">\[(http.*?)\]</pattern>
  </linkExtractionPatterns>
</extractor>
<extractor class="com.norconex.collector.http.url.impl.XMLFeedLinkExtractor">
  <applyToReferencePattern>.*rss$</applyToReferencePattern>
</extractor>

[/ezcol_1half_end]

Generic changes

The following changes apply to both Filesystem and HTTP Collectors. Most of these changes come from an update to the Norconex Importer module (now also at version 2.7.0).

Much improved XML configuration validation

[ezcol_1half]

You no longer have to hunt for a misconfiguration.  Schema-based XML configuration validation was added and you will now get errors if you have a bad XML syntax for any configuration options.   This validation can be trigged on command prompt with this new flag: -k or --checkcfg.

[/ezcol_1half]

[ezcol_1half_end]

# -k can be used on its own, but when combined with -a (like below),
# it will prevent the collector from executing if there are any errors.

collector-http.sh -a start -c examples/minimum/minimum-config.xml -k

# Error sample:
ERROR (XML) ReplaceTagger: cvc-attribute.3: The value 'asdf' of attribute 'regex' on element 'replace' is not valid with respect to its type, 'boolean'.

[/ezcol_1half_end]

Enter durations in human-readable format

[ezcol_1half]

Having to convert a duration in milliseconds is not the most friendly. Anywhere in your XML configuration where a duration is expected, you can now use a human-readable representation (English only) as an alternative.

[/ezcol_1half]

[ezcol_1half_end]

<!-- Example using "5 seconds" and "1 second" as opposed to milliseconds -->
<delay class="com.norconex.collector.http.delay.impl.GenericDelayResolver"
    default="5 seconds" ignoreRobotsCrawlDelay="true" scope="site" >
  <schedule dayOfWeek="from Saturday to Sunday">1 second</schedule>
</delay>

[/ezcol_1half_end]

Lua scripting language

[ezcol_1half]

Support for Lua scripting has been added to ScriptFilter, ScriptTagger, and ScriptTransformer.  This gives you one more scripting option available out-of-the-box besides JavaScript/ECMAScript.

[/ezcol_1half]

[ezcol_1half_end]

<!-- Add "apple" to a "fruit" metadata field: -->
<tagger class="com.norconex.importer.handler.tagger.impl.ScriptTagger"
    engineName="lua">
  <script><![CDATA[
    metadata:addString('fruit', {'apple'});
  ]]></script>
</tagger>

[/ezcol_1half_end]

Modify documents using an external application

[ezcol_1half]

With the new ExternalTransformer, you can now use an external application to perform document transformation.  This is an alternative to the existing ExternalParser, which was enhanced to provide the same environment variables and metadata extraction support as the ExternalTransformer.

[/ezcol_1half]

[ezcol_1half_end]

<transformer class="com.norconex.importer.handler.transformer.impl.ExternalTransformer">
  <command>/path/transform/app ${INPUT} ${OUTPUT}</command>
  <metadata>
    <match field="docnumber">DocNo:(\d+)</match>
  </metadata>
</transformer>

[/ezcol_1half_end]

Combine document fields

[ezcol_1half]

The new MergeTagger can be used for combining multiple fields into one. The target field can be either multi-value or single-value separated with the character of your choice.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.MergeTagger">
  <merge toField="title" deleteFromFields="true" 
      singleValue="true" singleValueSeparator=",">
    <fromFields>title,dc.title,dc:title,doctitle</fromFields>
  </merge>
</tagger>

[/ezcol_1half_end]

New Committers

[ezcol_1half]

Whether you do not have a target repository (Solr, Elasticsearch, etc) ready at the time of crawling, or whether you are not using a repository at all, Norconex Collectors now ships with two file-based Committers for easy consumption by your own process: XMLFileCommitter and JSONFileCommitter. All available committers can be found here.

[/ezcol_1half]

[ezcol_1half_end]

<committer class="com.norconex.committer.core.impl.XMLFileCommitter">
 <directory>/path/my-xmls/</directory>
 <pretty>true</pretty>
 <docsPerFile>100</docsPerFile>
 <compress>false</compress>
 <splitAddDelete>false</splitAddDelete>
</committer>

[/ezcol_1half_end]

More

Several additional features or changes can be found in the latest Collector releases.  Among them:

  • New Importer RegexReferenceFilter for filtering documents based on matching references (e.g. URL).
  • New SubstringTransformer for truncating content.
  • New UUIDTagger for giving a unique id to each documents.
  • CharacterCaseTagger now supports “swap” and “string” to swap character case and capitalize beginning of a string, respectively.
  • ConstantTagger offers options when dealing with existing values: add to existing values, replace them, or do nothing.
  • Components such as Importer, Committers, etc., are all easier to install thanks to new utility scripts.
  • Document Access-Control-List (ACL) information is now extracted from SMB/CIFS file systems (Filesytem Collector).
  • New ICollectorLifeCycleListener interface that can be added on the collector configuration to be notified and take action when the collector starts and stops.
  • Added “removeTrailingHash” as a new GenericURLNormalizer option (HTTP Collector).
  • New “detectContentType” and “detectCharset” options on GenericDocumentFetcher for ignoring the content type and character encoding obtained from the HTTP response headers and detect them instead (Filesytem Collector).
  • Start URLs and start paths can now be dynamically created thanks to IStartURLsProvider and IStartPathsProvider (HTTP Collector and Filesystem Collector).

To get the complete list of changes, refer to the HTTP Collector release notes, Filesystem Collector release notes, or the release notes of dependent Norconex libraries such as: Importer release notes and Collector Core release notes.

Download

In collaboration with .

Solr committers present at the event

This year’s conference was held in Austin, Texas on October 15-16, 2015. It gathered around 600 Lucene and Solr enthusiasts from 26 countries, including many of the Solr committers. Pascal Dimassimo and Pascal Essiembre attended the event on behalf of Norconex. While the talks were varied, there were a few recurrent themes such as search relevance, analytics, and infrastructure scaling. The following relates the experiences of the attendees with the content of conference sessions they attended. These talks should become available for viewing on YouTube shortly.

Relevancy

There were at least 10 talks related to the topic of relevancy alone. They offered ideas on how to improve relevancy, including intent detection, using machine learning principles, fuzzy matching, and more.

Of those standing out, Trey Grainger (co-author of Solr in Action) showed us how he created a knowledge graph built on top of Solr to improve CareerBuilder.com results.

Another noteworthy presentation came from Michael Nilsson and Diego Ceccarelli of Bloomberg, who broke their documents into features and use a matrix to decide the ranking of each feature. They reminded us there is nothing wrong with doing multiple passes to Solr to better serve up search requests.

Analytics

Whether it was analyzing search logs or user search behaviors, developers are working hard to build powerful analytics capabilities within Solr. Kiran Chitturi of Lucidworks suggested an easy way to capture user events using Snowplow JavaScript event tracker. He also highlighted the potential benefits you can get when sending those events to their new LucidWorks Fusion product.

Kiran Chitturi discussing events processing in LucidWorks Fusion
Kiran Chitturi discussing events processing in LucidWorks Fusion

Yonik Seeley, co-creator of Solr and now Solr Dude at Cloudera, presented us the new Solr JSON Facet API. This new API (which is actually available in Solr 5.3) has been completely re-written for Solr 5 and allows for first-class analytics support. You can now easily have nested facets, metrics and statistics. This is similar to Aggregations in Elasticsearch. According to the numbers presented, this new facet module performs much better than the original Solr facet module.

Erick Erickson presented the new Solr Streaming Aggregation API (also available in Solr 5.3). Solr has never been very good at accessing lots of search results because of deep paging issues and memory requirements. However, this new API builds on the existing exporting capabilities to allow us to stream concentrated data out of SolrCloud with new possibilities, like memory-efficient set operations (union, intersection, complement, join and unique). It also introduces new worker collections on the SolrCloud cluster to handle this processing. The goal is to build a general purpose, distributed computation framework right on top of Solr. This is still a work in progress, and the next speaker, Joel Bernstein, showed us what we can expect next. Leveraging the Streaming Aggregation API and JSON Facet API, Solr 6 should offer us a very powerful feature: SQL queries over Solr!

For those using Spark, LucidWork’s Timothy Potter introduced us to the tool they’ve built to use Solr as a Spark SQL DataSource. This allows Solr to be used with an existing Spark analysis pipeline. This tool also permits the writing of data into Solr from Spark.

Infrastructure Scaling

Shenghua Wan and Rahul Gupta sharing their experiments
Shenghua Wan and Rahul Gupta sharing their experiments

Shenghua Wan and Rahul Gupta from WalmartLabs described their experiences using different technologies to perform distributed indexing.  They experimented with MapReduce, Hadoop and others to distribute and enhance their XML data across several Solr shards, merging those shards in the end.

Riak’s developer Fred Dushin showed us Yokozuna, their new implementation of Riak Search. Riak is a distributed key/value store and with Yokozuna, Solr brings search to Riak. But Yokozuna also brings something to Solr. Because of its distributed nature, it makes it possible to use Riak to distribute Solr instead of using SolrCloud.

Mark Miller, Software Engineer at Cloudera, told us that open-source technologies have taken over the search ecosystem, especially Solr and Lucene. In the future, those search engines will get integrated with multiple systems. Cloudera wants to integrate Solr with Hadoop. Miller claims that at the moment, Solr search at scale is still flaky, even with SolrCloud, thought he admitted that it is good enough for general usage. According to Miller, Hadoop can help, so his firm created Cloudera Search, which uses Solr and Hadoop together.

Other Topics

The aforementioned topics were not the only ones covered at the conference. There were others of varying technicality. Toke Eskildsen, representing the State and University Library in Denmark, gave a low-level and very interesting talk about facet optimization. He demonstrated the code improvements he made to improve Solr facet performance and achieve impressive benchmark results.

Pascal Essiembre enjoying Autsin
Pascal Essiembre enjoying Autsin

David Smiley, who has long been involved in all things related to Solr geospatial research, showed us the latest work on spatial 2-D faceting, also known as heat maps. He also took the time to retrace the history of various geospatial functionalities in Solr and Lucene.

We’ve only scraped the surface of the conference proceedings at the Lucene/Solr Revolution 2015. We also thoroughly enjoyed the hospitality of the city of Austin, a community which offered a warm welcome and many wonderful sights. We hope our experiences stimulate further interest among others in attending future conferences, and we welcome further inquiries regarding our experiences in Austin.

DockerDocker is all the rage at the moment! It was recently selected as Gartner Cool Vendor in DevOps. As you may already know, Docker is a platform to build and deploy applications as self-contained units. Those units, called containers, can be executed consistently on a developer laptop or production server. Since containers include all their dependencies, they are truly portable. And, compared to normal virtual machine images, Docker containers are much more lightweight because they don’t need as much infrastructure as a normal VM. Docker containers are built from an image, a simple text file describing the steps needed to assemble and execute the container. But the goal of this blog post is not to be a Docker tutorial. If you need it, there are plenty of good resources to get you started, like the Docker User Guide, this series of video tutorials recently published on their blog or the nice 10-minute tutorial where you can try Docker online. In this post, we will be using Docker 1.6.

We recently encountered a situation where we needed to use Solr 5 on a server already installed with Java 6. But Solr 5 requires at least Java 7. And, for different reasons, upgrading to Java 7 on this server was not an option. The solution? Run Solr 5 in a Docker container using the appropriate Java version! Containers are completely isolated, so this has no impact on the other applications running on the server.

It’s easy to build a Docker image for Solr 5. But, it’s even easier to use an already-existing image! Unfortunately, Docker does not offer an official Solr image (like it does for Elasticsearch). But the community has built multiple good-quality Solr images. We decided to use makuk66/docker-solr, which is actively maintained and has the options we needed. For example, this image has options to use SolrCloud. For this post, we will limit ourselves to using Solr cores.

First, you need to pull the image:

$ docker pull makuk66/docker-solr

Then, you can simply start a container with:

$ docker run -d -p 8983:8983 --name solr5 makuk66/docker-solr

You should be able to connect to Solr on port 8983.

But, as it is, you can’t add a core to this Solr installation. Solr requires the core files (solrconfig.xml, schema.xml, etc.) to be already on the server (in this case the container), which the makuk66/docker-solr does not provide. So we have to provide the core configuration files to the Docker container. The easy way to do so is to use Docker volumes, which link a directory on the host server to a directory of the Docker container. For example, let’s assume we create the necessary configuration files for the Solr core at ~/solr5/myindex on our server. This directory should contain a sub-directory conf with all the usual files, like solrconfig.xml and schema.xml. The myindex directory should also have a core.properties file with the content name=myindex.

$ cd ~/solr5/myindex
$ tree .
.
├── conf
│   ├── admin-extra.html
│   ├── admin-extra.menu-bottom.html
│   ├── admin-extra.menu-top.html
│   ├── _rest_managed.json
│   ├── schema.xml
│   └── solrconfig.xml
└── core.properties

1 directory, 7 files

Docker will need write access to the myindex directory (to create the data directory containing the Lucene index, for example). There are multiple ways to accomplish this, but here we simply change the group owner of the myindex directory to be docker and allow group members write access to the folder:

$ chgrp docker ~/solr5/myindex
$ chmod g+w ~/solr5/myindex

Now that the myindex directory is ready, we will need to link it so that it is available under the solr.home directory of the Docker container. What is the solr.home directory of the container? It’s easy to get this from Solr. When connecting to the Solr instance on port 8983, you should be redirected to the Solr dashboard. On this page, you should see the list of JVM parameters, and one of them is -Dsolr.solr.home

Docker Solr5 Dashboard

We can now remove the previous container:

$ docker rm -f solr5

and start a new one with a volume:

$ docker run -d -p 8983:8983 -v ~/solr5/myindex:/opt/solr/server/solr/myindex --name solr5 makuk66/docker-solr

Notice the -v parameter. It links the ~/solr5/myindex directory of the server to the /opt/solr/server/solr/myindex of the container. Every time Solr reads or writes data to the /opt/solr/server/solr/myindex directory, it will actually be accessing our ~/solr5/myindex directory. This is where Solr will create the data directory. Great, because Docker recommends that all files created by the container be held outside of the container. If you access the Solr instance on port 8983, you should now have the myindex core available.

The Docker container was started with basic JVM settings. What if we need to allocate more memory to Solr or other options? Docker allows us to override the default startup command defined in the image. For example, here is how we could start the container with more memory (don’t forget to remove the previous container):

$ docker run -d -p 8983:8983 -v ~/solr5/myindex:/opt/solr/server/solr/myindex --name solr5 makuk66/docker-solr "/bin/bash" "-c" "/opt/solr/bin/solr -m 1g -f"

To confirm that everything is fine with our Solr container, you can consult the logs generated by Solr with:

$ docker logs solr5

Conclusion

There is a lot more to be said about Docker and Solr 5, like how to use a specific Solr version or how to use SolrCloud. Hopefully this blog post was enough to get you started!

Introduction

You already know that Solr is a great search application, but did you know that Solr 5 could be used as a platform to slice and dice your data?  With Pivot Facet working hand in hand with Stats Module, you can now drill down into your dataset and get relevant aggregated statistics like average, min, max, and standard deviation for multi-level Facets.

In this tutorial, I will explain the main concepts behind this new Pivot Facet/Stats Module feature. I will walk you through each concept, such as Pivot Facet, Stats Module, and Local Parameter in query. Once you fully understand those concepts, you will be able to build queries that quickly slice and dice datasets and extract meaningful information.

Applications to Download

Facet

If you’re reading this blog post, you’re probably already familiar with the Facet concept in Solr. A facet is a way to count or aggregate how many elements are available for a given category. Facets also allow users to drill down and refine their searches. One common use of facets is for online stores.

Here’s a facet example for books with the word “Solr” in them, taken from Amazon.

2015-04-09_1428

To understand how Solr does it, go on the command line and fire up the techproduct example from Solr 5 by executing the following command:

pathToSolr/bin/solr -e techproducts

If you’re curious to know where the source data are located for the techproducts database, go to the folder pathToSolr/example/exampledocs/*.xml

Here’s an example of a document that’s added to the techproduct database.

Notice the cat and manu field names. We will be using them in the creation of facet.

<add><doc>
<field name="id">MA147LL/A</field>
 <field name="name">Apple 60 GB iPod with Video Playback Black</field>
 <field name="manu">Apple Computer Inc.</field>
 <!-- Join -->
 <field name="manu_id_s">apple</field>
 <field name="cat">electronics</field>
 <field name="cat">music</field>
 <field name="features">iTunes, Podcasts, Audiobooks</field>
 <field name="features">Stores up to 15,000 songs, 25,000 photos, or 150 hours of video</field>
 <field name="features">2.5-inch, 320x240 color TFT LCD display with LED backlight</field>
 <field name="features">Up to 20 hours of battery life</field>
 <field name="features">Plays AAC, MP3, WAV, AIFF, Audible, Apple Lossless, H.264 video</field>
 <field name="features">Notes, Calendar, Phone book, Hold button, Date display, Photo wallet, Built-in games, JPEG photo playback, Upgradeable firmware, USB 2.0 compatibility, Playback speed control, Rechargeable capability, Battery level indication</field>
 <field name="includes">earbud headphones, USB cable</field>
 <field name="weight">5.5</field>
 <field name="price">399.00</field>
 <field name="popularity">10</field>
 <field name="inStock">true</field>
 <!-- Dodge City store -->
 <field name="store">37.7752,-100.0232</field>
 <field name="manufacturedate_dt">2005-10-12T08:00:00Z</field>
</doc></add>

Open the following link in your favorite browser:

http://localhost:8983/solr/techproducts/select?q=*%3A*&rows=0&wt=json&indent=true&facet=true&facet.field=manu

Notice the 2 parameters:

  • facet=true
  • facet.field=manu

If everything worked as planned, you should get an answer that looks like the one below. You should see the results show how many elements are included for each manufacturer.

…
"response":{"numFound":32,"start":0,"docs":[]
 },
 "facet_counts":{
   "facet_queries":{},
   "facet_fields":{
     "manu":[
       "inc",8,
       "apache",2,
       "bank",2,
       "belkin",2,
…

Facet Pivot

Pivots are sometimes also called decision trees. Pivot allows you to quickly summarize and analyze large amounts of data in lists, independent of the original data layout stored in Solr.

One real-world example is the requirement of showing the university in the provinces and the number of classes offered in both provinces and university. Until facet pivot, it was not possible to accomplish this task without changing the structure of the Solr data.

With Solr, you drive the pivot by using the facet.pivot parameter with a comma separated field list.

The example below shows the count for each category (cat) under each manufacturer (manu).

http://localhost:8983/solr/techproducts/select?q=*%3A*&rows=0&wt=json&indent=true&facet=true&facet.pivot=manu,cat

Notice the fields:

  • facet=true
  • facet.pivot=manu,cat
"facet_pivot":{
     "manu,cat":[{
         "field":"manu",
         "value":"inc",
         "count":8,
         "pivot":[{
             "field":"cat",
             "value":"electronics",
             "count":7},
           {
             "field":"cat",
             "value":"memory",
             "count":3},
           {
             "field":"cat",
             "value":"camera",
             "count":1},
           {
             "field":"cat",
             "value":"copier",
             "count":1},
           {
             "field":"cat",
             "value":"electronics and computer1",
             "count":1},
           {
             "field":"cat",
             "value":"graphics card",
             "count":1},
           {
             "field":"cat",
             "value":"multifunction printer",
             "count":1},
           {
             "field":"cat",
             "value":"music",
             "count":1},
           {
             "field":"cat",
             "value":"printer",
             "count":1},
           {
             "field":"cat",
             "value":"scanner",
             "count":1}]},

Stats Component

The Stats Component has been around for some time (since Solr 1.4). It’s a great tool to return simple math functions, such as sum, average, standard deviation, and so on for an indexed numeric field.

Here is an example of how to use the Stats Component on the field price with the techproducts sample database. Notice the parameters:

http://localhost:8983/solr/techproducts/select?q=*:*&wt=json&stats=true&stats.field=price&rows=0&indent=true

  • stats=true
  • stats.field=price
...

"response":{"numFound":32,"start":0,"docs":[]
 },
 "stats":{
   "stats_fields":{
     "price":{
       "min":0.0,
       "max":2199.0,
       "count":16,
       "missing":16,
       "sum":5251.270030975342,
       "sumOfSquares":6038619.175900028,
       "mean":328.20437693595886,
       "stddev":536.3536996709846,
       "facets":{}}}}}

...

Mixing Stats Component and Facets

Now that you’re aware of what the stats module can do, wouldn’t it be nice if you could mix and match the Stats Component with Facets? To continue from our previous example, if you wanted to know the average price for an item sold by a given manufacturer, this is what the query would look like:

http://localhost:8983/solr/techproducts/select?q=*:*&wt=json&stats=true&stats.field=price&stats.facet=manu&rows=0&indent=true

Notice the parameters:

  • stats=true
  • stats.field=price
  • stats.facet=manu
…
"stats_fields":{
     "price":{
       "min":0.0,
       "max":2199.0,
       "count":16,
       "missing":16,
       "sum":5251.270030975342,
       "sumOfSquares":6038619.175900028,
       "mean":328.20437693595886,
       "stddev":536.3536996709846,
       "facets":{
         "manu":{
           "canon":{
             "min":179.99000549316406,
             "max":329.95001220703125,
             ...
             "stddev":106.03773765415568,
             "facets":{}},

"belkin":{
             "min":11.5,
             "max":19.950000762939453,
             ...
             "stddev":5.975052840505987,
             "facets":{}}

…

The problem with putting the facet inside the Stats Component is that the Stats Component will always return every term from the stats.facet field without being able to support simple functions, such as facet.limit and facet.sort. There’s also a lot of problems with multivalued facet fields or non-string facet fields.

Solr 5 Brings Stats to Facet

One of Solr 5’s new features is to bring the stats.fields under a Facet Pivot. This is a great thing because you can now leverage the power of the code already done for facets, such as ordering and filtering. Then you can just delegate the computing for the math function tasks, such as min, max, and standard deviation, to the Stats Component.

http://localhost:8983/solr/techproducts/select?q=*:*&wt=json&indent=true&rows=0&facet=true&stats=true&stats.field={!tag=t1}price&facet.pivot={!stats=t1}manu

Notice the parameters:

  • facet=true
  • stats=true
  • stats.field={!tag=t1}price
  • facet.pivot={!stats=t1}manu
...

"facet_counts":{
   "facet_queries":{},
   "facet_fields":{},
   "facet_dates":{},
   "facet_ranges":{},
   "facet_intervals":{},
   "facet_pivot":{
     "manu":[{
         "field":"manu",
         "value":"inc",
         "count":8,
         "stats":{
           "stats_fields":{
             "price":{
               "min":74.98999786376953,
               "max":2199.0,
...
               "sumOfSquares":5406265.926629987,
               "mean":549.697146824428,
               "stddev":740.6188014133371,
               "facets":{}}}}},
       {

...

The expression {!tag=t1} and the {!stats=t1} are named “Local Parameters in Queries”. To specify a local parameter, you need to follow these steps:

  1. Begin with {!
  2. Insert any number of key=value pairs separated by whitespace.
  3. End with } and immediately follow with the query argument.

In the example above, I refer to the stats field instance by referring to arbitrarily named tag that I created, i.e., t1.

You can also have multiple facet levels by using facet.pivot and passing comma separated fields, and the stats will be computed for the child Facet.

For example : facet.pivot={!stats=t1}manu,cat

http://localhost:8983/solr/techproducts/select?q=*:*&wt=json&indent=true&rows=0&facet=true&stats=true&stats.field={!tag=t1}price&facet.pivot={!stats=t1}manu,cat

...

"facet_pivot":{
     "manu,cat":[{
         "field":"manu",
         "value":"inc",
         "count":8,
         "pivot":[{
             "field":"cat",
             "value":"electronics",
             "count":7,
             "stats":{
               "stats_fields":{
                 "price":{
                   "min":74.98999786376953,
                   "max":479.95001220703125,
...
                   "stddev":153.31712383138424,
                   "facets":{}}}}},
           {

...

You can also mix and match overlapping sets, and you will get the computed facet.pivot hierarchies.

http://localhost:8983/solr/techproducts/select?q=*:*&wt=json&indent=true&rows=0&facet=true&stats=true&stats.field={!tag=t1,t2}price&facet.pivot={!stats=t1}cat,inStock&facet.pivot={!stats=t2}manu,inStock

Notice the parameters:

  • stats.field={!tag=t1,t2}price
  • facet.pivot={!stats=t1}cat,inStock
  • facet.pivot={!stats=t2}manu,inStock

This section represents a sample of the following sequence: facet.pivot={!stats=t1}cat,inStock

 "facet_pivot":{
     "cat,inStock":[{
         "field":"cat",
         "value":"electronics",
         "count":12,
         "pivot":[{
             "field":"inStock",
             "value":true,
             "count":8,
             "stats":{
               "stats_fields":{
                 "price":{
                   "min":74.98999786376953,
                   "max":399.0,
             ...
                   "facets":{}}}}},
           {
             "field":"inStock",
             "value":false,
             "count":4,
             "stats":{
               "stats_fields":{
                 "price":{
                   "min":11.5,
                   "max":649.989990234375,
...
                   "facets":{}}}}}],
         "stats":{
           "stats_fields":{
             "price":{
               "min":11.5,
               "max":649.989990234375,
...
               "facets":{}}}}},

This section represents a sample of the following sequence:

facet.pivot={!stats=t2}manu,inStock

It’s the sequence that was produced by the query shown in the URL above.

 "facet_pivot":{
     "cat,inStock":[{
         "field":"cat",
         "value":"electronics",
         "count":12,
         "pivot":[{
             "field":"inStock",
             "value":true,
             "count":8,
             "stats":{
               "stats_fields":{
                 "price":{
                   "min":74.98999786376953,
                   "max":399.0,
             ...
                   "facets":{}}}}},
           {
             "field":"inStock",
             "value":false,
             "count":4,
             "stats":{
               "stats_fields":{
                 "price":{
                   "min":11.5,
                   "max":649.989990234375,
...
                   "facets":{}}}}}],
         "stats":{
           "stats_fields":{
             "price":{
               "min":11.5,
               "max":649.989990234375,
...
               "facets":{}}}}},

How about Solr Cloud?

With Solr 5, it’s now possible to compute fields stats for each pivot facet constraint in a distributed environment, such as Solr Cloud. A lot of hard work went into solving this very complex problem. Getting the results from each shard and quickly and effectively merging them required a lot refactoring and optimization. Each level of facet pivots needs to be analyzed and will influence that level’s children facets. There is a refinement process that iteratively selects and rejects items at each facet level when results are coming in from all the different shards.

Does Pivot Faceting Scale Well?

Like I mentioned above, Pivot Faceting can be expensive in a distributed environment. I would be careful and properly set appropriate facet.list parameters at each facet pivot level. If you’re not careful, the number of dimensions requested can grow exponentially. Having too many dimensions can and will eat up all the system resources.  The online documentation is referring to multimillions of documents spread across multiple shards getting sub-millisecond response times for complex queries.

Conclusion

This tutorial should have given you a solid foundation to get you started on slicing and dicing your data. I have defined the concepts Pivot Facet, Stats Module, and Local Parameter. I also have shown you query examples using those concepts and their results. You should now be able to go out on your own and build your own solution. You can also give us a call if you need help. We provide training and consulting services that will get you up and running in no time.

Do you have any experience building analytical systems with Solr? Please share your experience below.

In this tutorial, I will show you how to run Solr as a Microsoft Windows service. Up to version 5.0.0, it was possible to run Solr inside the Java web application container of your choice. However, since the release of version 5.0.0, the Solr team at Apache no longer releases the solr.war file. This file was necessary to run Solr from a different web application container such as Tomcat. Starting with version 5.0.0, Solr will be distributed only as a self-contained web application, using an embedded version of Jetty as a container.

Unfortunately, Jetty does not have a nice utility like Tomcat’s to register itself as a service on Microsoft Windows. I had to research and experiment to come up with a clean and easily-reproduced solution. I tried to follow the Jetty website instructions and adapt them to make Jetty work with Solr, but I was not able to stop the service cleanly. When I would request a “stop” from the Windows Service Manager, the service was flip-flopping between “starting” and “stopping” statuses. Then I discovered a simple tool, NSSM, that did exactly what I wanted. I will be using the NSSM tool in this tutorial.

Applications to Download

File System Setup

Taking Solr 5.0.0 as an example, first, extract Solr and NSSM to the following path on your file system (adapt paths as necessary).

C:\Program Files\solr-5.0.0
C:\Program Files\nssm

Setting up Solr as a service

On the command line, type the following:

"c:\Program Files\nssm\win64\nssm" install solr5

Fill out the path to the solr.cmd script, and the startup directory should be filled in automatically. Don’t forget to input the -f (foreground) parameter so that NSSM can kill it when it needs to be stopped or restarted.

Application tab on NSSM Service Editor screen capture to show path to Solr start script

The following step is optional, but I prefer having a clean and descriptive name in my Windows Service Manager. Under the details tab, fill out the Display name and Description.

Details tab for NSSM service installer for setting up Solr 5 as a service on Microsoft Windows

Click on Install service.

NSSM confirmation box saying "Solr5" installed successfully

Check that the service is running.

Microsoft Windows Component Services Running Solr 5

Go to your favorite web browser and make sure Solr is up and running.

Solr 5 running as a service on Microsoft Windows

Conclusion

I spent a few hours finding this simple solution, and I hope this tutorial will help you set up Solr as a Microsoft Windows service in no time. I invite you to view the solr.cmd file content to find the parameters that will help you customize your Solr setup. For instance, while looking inside this file, I realized there I needed to add the -f parameter to run Solr in the foreground. That was key to get it running the way I needed it.

If you successfully used a different approach to register Solr 5 as a service, please share it in the comments section below.

Solr_Logo_on_white_webI am very excited about the new Solr 5. I had the opportunity to download and install the latest release, and I have to say that I am impressed with the work that has been done to make Solr easy and fun to use right out of the box.

When I first looked at the bin folder, I noticed that the ./bin/solr script from Solr 4.10.x was still there, but when I checked the help for that command, I noticed that there are new parameters. In Solr 4.10, we only had the following parameters: start, stop, restart, and healthcheck. Now in Solr 5.0, we have additional options that make life a little easier: status, create, create_core, create_collection, and delete.

The create_core and the create_collection are self explanatory. What is interesting is that the create parameter is smart enough to detect the mode in which mode Solr is running; i.e., “Solr Cloud” or  “Solr Core” mode. It can then create the proper core or collection.

The status parameter returns a JSON formatted answer that looks like the following. It could be used by a tool like Nagios or JEF Monitor to do some remote monitoring.

Found 1 Solr nodes:
Solr process 6922 running on port 8983
{
"solr_home":"/Applications/solr-5.0.0/server/solr/",
"version":"5.0.0 1659987 - anshumgupta - 2015-02-15 12:26:10",
"startTime":"2015-02-27T17:19:22.455Z",
"uptime":"0 days, 0 hours, 2 minutes, 18 seconds",
"memory":"53.1 MB (%10.8) of 490.7 MB"}

 Solr Core demo

Since version 4.10, the /bin/solr start command has a parameter that lets you test Solr with few interesting examples: -e <example>.. To run Solr Core with sample data in 4.10, you would run the following command: ./bin/solr start -e default. That would give you an example of what could be done with a Solr search engine. In version 5.0, the default option has been replaced by the option ./bin/solr start -e techproducts. That new option illustrates many of the Solr Core capabilities.

Solr Cloud demo

Configuring a Solr Cloud used to be a very complicated process. Several moving pieces needed to be put together perfectly to configure a working Solr Cloud server. Solr 5.0 still has the ./bin/solr start -e cloud present in 4.10. This option lets you create a Solr Cloud instance by answering a few questions driven by a wizard. You can see an example of the type of questions asked below.

Welcome to the SolrCloud example!
This interactive session will help you launch a SolrCloud cluster on your local workstation.
To begin, how many Solr nodes would you like to run in your local cluster? (specify 1-4 nodes) [2]
Ok, let's start up 2 Solr nodes using for your example SolrCloud cluster.
...
Now let's create a new collection for indexing documents in your 2-node cluster.
Please provide a name for your new collection: [gettingstarted]
gettingstarted
How many shards would you like to split gettingstarted into? [2]
2
How many replicas per shard would you like to create? [2]
2
...

SolrCloud example running, please visit http://localhost:8983/solr

solr-cloud

Finally, a script to install Solr as service

Solr now has a script named install_solr_service.sh that installs Solr as a service on Linux and Unix machines. When I tested Solr 5, I ran the code from a Mac os box, so the script did not work for me. I received an error message telling me my Linux distribution was not supported and I needed to setup Solr as a service manually using the documentation provided in the Solr Reference Guide. Even if the install script did not work for me on a Mac, this tool is a great addition for system administrators who like to configure their machines using automated tools like Puppets.

We use Tomcat at work, so where did my WAR go?

As of Solr 5.0, the only supported container is the Jetty one that ships by default with the download file. It is possible to repackage the exploded files into a war, but you will end up with an unsupported installation of Solr. I cannot recommend that route.

Adding document has never been easier

In Solr 5.0, adding documents has never been easier. We now have access to a new tool named ./bin/post. This tool can take almost any input document imaginable and post it to Solr. It has support for JSON, XML, CSV, and rich text documents like Microsoft Office documents. The post tool can also act as a crawler to extract information out of a website. During my test, I was not able to get the content off of a web page. The information extracted was meta-data like the title, authors, and keywords. Maybe there is a way to obtain this content, but I was not able to find a parameter or a config file that would let me do so. I think that the post utility is a very good tool to get started, but for my day to day work, I will stick with our good old open source crawler and Solr Commiter that we use here at Norconex.

Here is a quick list of the parameters one can use from the post command:

* JSON file: ./post -c wizbang events.json
* XML files: ./post -c records article*.xml
* CSV file: ./post -c signals LATEST-signals.csv
* Directory of files: ./post -c myfiles ~/Documents
* Web crawl: ./post -c gettingstarted http://lucidworks.com -recursive 1 -delay 1
* Standard input (stdin): echo ‘{commit: {}}’ | ./post -c my_collection -type application/json -out yes -d
* Data as string: ./post -c signals -type text/csv -out yes -d $’id,value\n1,0.47′

Solr 5.0 supports even more document types thanks to Tika 1.7

Solr 5 now comes with Tika 1.7. This means that Solr now has support for OCR via the Terrasact application. You will need to install Terrasact separately. With Tika 1.7, Solr also has better support for PST and matlab files. The date and spatial unit handling also have been improved in this new release.

More Exciting new features

Solr 5.0 now lets you slice and dice your data the way you want it. What this means is stats and facets are now working together. For example, you can automatically get the min, max, and average price for a book. You can find more about this new feature here.

The folks at Apache also improved the schema API to let us add fields programmatically. A core reload will be done automatically if you use the API. Check out the details on how to use that feature.

We can also manage the request handler via the API.

What are the main “gotchas” to look for when upgrading to Solr 5.0?

Solr 5 does not support reading Solr/Lucene 3.x and earlier indexes. You have to make sure that you run the tool Lucene IndexUpdate that is included with the Solr 4.10 release. Another way to go about it would be to fully optimise your index with a Solr 4.10 installation.

Solr 5 does not support the pre Solr 4.3 solr.xml format and move entirely to core discovery. If you need some more information about moving to the latest and greatest solr.xml file format, I suggest this article:  moving to the new solr.xml.

Solr 5 only supports creating and removing SolrCloud collections through the Collection API. You might still be able to manage the collection the former way, but there is no guarantee that it will work in future releases, and the documentation strongly advises against it.

Conclusion

It looks like most of the work done in this release was geared toward ease of use. The inclusion of tools to easily add data to the index with a very versatile script was encouraging. I also liked the idea of moving to a Jetty-only model and approaching Solr as a self-contained piece of software. One significant advantage of going this route is that it will make providing support easier for the Solr team, who will also be able to optimise the code for a specific container.

Norconex just released major upgrades to all its Norconex Collectors and related projects.  That is, Norconex HTTP Collector and Norconex Filesystem Collector, along with the Norconex Importer module and all available committers (Solr, Elasticsearch, HP IDOL, etc), were all upgraded to version 2.0.0.

With these major product upgrades come a new website that makes it easier to get all the software you need in one location: the Norconex Collectors website.  At a quick glance you can find all Norconex Collectors and Committers available for download.

Among the new features added to your crawling arsenal you will find:

  • Can now split a document into multiple documents.
  • Can now treat embedded documents as individual documents (like documents found in zip files or in other documents such as Word files).
  • Language detection (50+ languages).
  • Parsing and formatting of dates from/to any format.
  • Character case modifiers.
  • Can now index basic content statistics with each documents (word count, average word length, average words per sentences, etc).
  • Can now supply a “seed file” for listing start URLs or start paths to your crawler.
  • Document content reads and writes are now performed in memory up to a configurable maximum size, after which the filesystem gets used.  This reduces I/O and improves performance.
  • New event model where listeners can listen for any type of crawler events.
  • Can now  ignore parsing of specific content types.
  • Can filter documents based on arbitrary regular expressions performed on the document content.
  • Enhanced debugging options, where you can print out specific field content as they are being processed.
  • HTTP Collector: Can add link names to the document the links are pointing to (e.g. to create cleaner titles).
  • More…

Another significant change is all Norconex open-source projects are now licensed under The Apache License 2.0.   We hope this will facilitate adoption with third party commercial offerings.

It is important to note version 2.0.0 are not compatible with their previous 1.x version.  The configuration options changed in many areas so do not expect to run your existing configuration under 2.0.0.   Please refer to the latest documentation for new and modified configuration options.

Visit to the new Norconex Collectors website now.

UpgradeNorconex Committer and all is current concrete implementations (Solr, Elasticsearch, IDOL) have been upgraded and have seen a redesign of their web sites.  Committers are libraries responsible for posting data to various repositories (typically search engines).  They are in other products or projects, such as Norconex HTTP Collector. (more…)

AutocompleteAutocomplete (also known as live suggestions or search suggestions) is very popular with Search applications. It is generally used to return either query suggestion (à la Google Autocomplete) or to propose existing search results (à la Facebook).

Open source search platforms like Solr and Elasticsearch support this feature. (more…)