analytics

This year’s conference was held in Austin, Texas on October 15-16, 2015. It gathered around 600 Lucene and Solr enthusiasts from 26 countries, including many of the Solr committers. Pascal Dimassimo and Pascal Essiembre attended the event on behalf of Norconex. While the talks were varied, there were a few recurrent themes such as search relevance, analytics, and infrastructure scaling. The following relates the experiences of the attendees with the content of conference sessions they attended. These talks should become available for viewing on YouTube shortly.

Relevancy

There were at least 10 talks related to the topic of relevancy alone. They offered ideas on how to improve relevancy, including intent detection, using machine learning principles, fuzzy matching, and more.

Of those standing out, Trey Grainger (co-author of Solr in Action) showed us how he created a knowledge graph built on top of Solr to improve CareerBuilder.com results.

Another noteworthy presentation came from Michael Nilsson and Diego Ceccarelli of Bloomberg, who broke their documents into features and use a matrix to decide the ranking of each feature. They reminded us there is nothing wrong with doing multiple passes to Solr to better serve up search requests.

Whether it was analyzing search logs or user search behaviors, developers are working hard to build powerful analytics capabilities within Solr. Kiran Chitturi of Lucidworks suggested an easy way to capture user events using Snowplow JavaScript event tracker. He also highlighted the potential benefits you can get when sending those events to their new LucidWorks Fusion product.

*Kiran Chitturi discussing events processing in LucidWorks Fusion*

Yonik Seeley, co-creator of Solr and now Solr Dude at Cloudera, presented us the new Solr JSON Facet API. This new API (which is actually available in Solr 5.3) has been completely re-written for Solr 5 and allows for first-class analytics support. You can now easily have nested facets, metrics and statistics. This is similar to Aggregations in Elasticsearch. According to the numbers presented, this new facet module performs much better than the original Solr facet module.

Erick Erickson presented the new Solr Streaming Aggregation API (also available in Solr 5.3). Solr has never been very good at accessing lots of search results because of deep paging issues and memory requirements. However, this new API builds on the existing exporting capabilities to allow us to stream concentrated data out of SolrCloud with new possibilities, like memory-efficient set operations (union, intersection, complement, join and unique). It also introduces new worker collections on the SolrCloud cluster to handle this processing. The goal is to build a general purpose, distributed computation framework right on top of Solr. This is still a work in progress, and the next speaker, Joel Bernstein, showed us what we can expect next. Leveraging the Streaming Aggregation API and JSON Facet API, Solr 6 should offer us a very powerful feature: SQL queries over Solr!

For those using Spark, LucidWork’s Timothy Potter introduced us to the tool they’ve built to use Solr as a Spark SQL DataSource. This allows Solr to be used with an existing Spark analysis pipeline. This tool also permits the writing of data into Solr from Spark.

Infrastructure Scaling

*Shenghua Wan and Rahul Gupta sharing their experiments*

Shenghua Wan and Rahul Gupta from WalmartLabs described their experiences using different technologies to perform distributed indexing. They experimented with MapReduce, Hadoop and others to distribute and enhance their XML data across several Solr shards, merging those shards in the end.

Riak’s developer Fred Dushin showed us Yokozuna, their new implementation of Riak Search. Riak is a distributed key/value store and with Yokozuna, Solr brings search to Riak. But Yokozuna also brings something to Solr. Because of its distributed nature, it makes it possible to use Riak to distribute Solr instead of using SolrCloud.

Mark Miller, Software Engineer at Cloudera, told us that open-source technologies have taken over the search ecosystem, especially Solr and Lucene. In the future, those search engines will get integrated with multiple systems. Cloudera wants to integrate Solr with Hadoop. Miller claims that at the moment, Solr search at scale is still flaky, even with SolrCloud, thought he admitted that it is good enough for general usage. According to Miller, Hadoop can help, so his firm created Cloudera Search, which uses Solr and Hadoop together.

Other Topics

The aforementioned topics were not the only ones covered at the conference. There were others of varying technicality. Toke Eskildsen, representing the State and University Library in Denmark, gave a low-level and very interesting talk about facet optimization. He demonstrated the code improvements he made to improve Solr facet performance and achieve impressive benchmark results.

David Smiley, who has long been involved in all things related to Solr geospatial research, showed us the latest work on spatial 2-D faceting, also known as heat maps. He also took the time to retrace the history of various geospatial functionalities in Solr and Lucene.

We’ve only scraped the surface of the conference proceedings at the Lucene/Solr Revolution 2015. We also thoroughly enjoyed the hospitality of the city of Austin, a community which offered a warm welcome and many wonderful sights. We hope our experiences stimulate further interest among others in attending future conferences, and we welcome further inquiries regarding our experiences in Austin.

GATINEAU, QC, CANADA — Thursday, September 22, 2014—Norconex is excited to announce the launch of Norconex Content Analytics, enabling organizations to get deep insights on their current information assets.

Norconex believes its Content Analytics product will provide customers with valuable statistical reports on documents from all kinds of enterprise repository sources, ranging from local file systems to remote secure servers, at a fraction of the cost of compiling reports manually or with competing products.

“I can already assess that this affordable enterprise solution will save some of our customers a fortune on their data migration projects,” said David Gaulin, Vice President of Professional Services at Norconex.

Norconex Content Analytics Availability

Norconex Content Analytics is a product driven by customer feedback and is part of Norconex’s commitment to delivering quality commercial products. Norconex Content Analytics is available immediately for purchase. Additional information can be found at /enterprise-search-software/content-analytics/.

About Norconex

Founded in 2007, Norconex is a leader in enterprise search and data discovery. The company offers a wide range of products and services designed to help with the processing and analysis of structured and unstructured data.

For more information on Norconex Content Analytics:

Website: /enterprise-search-software/content-analytics/

Email: info@norconex.com