Norconex is proud to announce the 2.9.0 release of its HTTP and Filesystem crawlers. Keep reading for a few release highlights.

CMIS support

Norconex Filesystem Collector now supports Content Management Interoperability Services (CMIS). CMIS is an open standard for accessing content management systems (CMS) content. Extra information can be extracted, such as document ACL (Access Control List) for document-level security. It is now easier than ever to crawl your favorite CMS. CMIS is supported by Alfresco, Interwoven, Magnolia, SharePoint server, OpenCMS, OpenText Documentum, and more.

<startPaths>
    <path>cmis-atom:https://norconex.com/mycms/cmisatom!/my/starting/path</path>
</startPaths>

Additional ACL support

ACL from your CMS is not the only new type of ACL you can extract.  This new Norconex Filesystem Collector release introduces support for obtaining local filesystem ACL.  These new ACL types are in addition to the already existing support for CIFS/SMB ACL extraction (since 2.7.0).

Field discovery

You can’t always tell upfront what metadata your crawler will find.  One way to discover your fields is to send them all to your Committer.  This approach is not always possible nor desirable.  You can now store to a local file all fields found by the crawler. Each field will be saved once, with sample values to give you a better idea of their nature.

<tagger class="com.norconex.importer.handler.tagger.impl.FieldReportTagger" 
    maxSamples="2" file="/path/to/report/myfields.csv" />

New URL normalization rules

The HTTP Collector adds a few new rules GenericURLNormalizer. Those are:

  • removeQueryString
  • lowerCase
  • lowerCasePath
  • lowerCaseQuery
  • lowerCaseQueryParameterNames
  • lowerCaseQueryParameterValues

Subdomains being part of a domain

When you configure your HTTP crawler to stay on the current site (stayOnDomain="true"), you can now tell it to consider sub-domains as being the same site (includeSubdomains="true").

Other changes

For a complete list of all additions and changes, refer to the following release notes:

Download

 

This was my first year joining the open-road Elastic{ON} Tour 2019 event in Toronto on September 18, 2019. My experience at this event was fully charged with excitement from meeting with Elastic architects, operations folks, security pros, and developers alike.

The event was hosted at The Carlu in downtown Toronto. In the morning, the opening keynote was presented by Nick Drost, Senior Director of Elastic, on search solutions such as app search, site search, and enterprise search, security using SIEM, and more. One of the most exciting keynote updates was about using Elastic Cloud on Kubernetes to help simplify processes of deployment, security, scaling, upgrades, snapshots, and high availability.

The next presenter, Michael Basnight, Software Engineer at Elastic, provided an Elastic Stack roadmap with demos of the latest and upcoming features. Kibana has added new capabilities to become much more than just the main user interface of Elastic Stack, with infrastructure and logs user interface. He introduced Fleet, which provides centralized config deployment, Beats monitoring, and upgrade management. Frozen indices allows for more index storage by having indices available and not taking up HEAP memory space until the indices are requested. Also, he provided highlights on Advanced Machine Learning analytics for outlier detection, supervised model training for regression and classification, and ingest prediction processor. Elasticsearch performance has increased by employing Weak AND (also called “WAND”), providing improvements as high as 3,700% to term search and improving other query types between 28% and 292%.

Another added feature to Elasticsearch stack is advanced scoring to help boost document query, using rank_features and distance_features. The new Geo UI uses map layers.

One of the most interesting new Beats to watch for is Functionbeat, which is a serverless data shipper that can subscribe to AWS SQS event topics and CloudWatch Logs, provisions the AWS Lambda function to ship data to Elasticsearch or Elastic Cloud Enterprise.

Elastic lightweight data shippers, Beats such as Filebeat for log files, Metricbeat for metrics, Packetbeat for network data, Winlogbeat for Windows event logs, Auditbeat for audit data, Heartbeat for uptime monitoring, and the latest Functionbeat for serverless shipper can be complemented with Norconex open-source products such as Norconex HTTP Collector or Norconex Filesystem Collector to crawl meta-data from the web or filesystem, then used with the open-source Norconex Elasticsearch Committer to push data to the Elasticsearch index, directly to Elastic Cloud Enterprise or the on-prem Elasticsearch Stack. Norconex can help with collecting meta-data from enterprise web architecture or enterprise filesystems for quick searching and to get relevant results.

Packed into the morning session, Jason Rhodes, Senior Software Engineer at Elastic, presented on unified observability, combining logs, metrics, and traces.

The afternoon session, Search for All with Elastic Enterprise Search and a Site Search demo and feature walkthrough, was presented by Diane Tetrault, Director of Product Marketing at Elastic. The latest UI gives the user the ability to configure content sources they search for and connect to their own data sources. Elastic Common Schema, introduced as an open-source specification, defines a common set of document fields for data ingested into Elasticsearch (https://www.elastic.co/blog/introducing-the-elastic-common-schema).

The Security with Elastic Stack session was presented by Neil Desai, Security Specialist at Elastic. He discussed the latest security capabilities to enable analysis automation to defend from cyber threats.

The Kibana and geo update features in Canvas and Elastic Maps were presented by Raya Fratkina, Kibana Team Lead at Elastic. Learning about ways to use these functionalities makes data more actionable.

I also learned tips at Elastic Architecture at Scale, a presentation by Artem Pogossian, Solutions Architect at Elastic. He discussed scaling from local laptops to multi-clusters and cross-clusters using case deployments.

A useful new feature in machine learning and analytics was introduced by Rich Collier, Solutions Architect and ML Specialist at Elastic. He demonstrated a use case using data frames, also called transforms, a feature that allows transformation of an existing index to a secondary, summarized index. Rich showed in a demo a possible use case from a digital retailer, using time series modeling to look for anomalies and forecasting in the shopper’s purchases, integrating Canvas UI designed in Kibana to build real-time data models. It was amazing to see the ability in demo to detect possible fraudulent purchases without having to be a data science expert.

Finally, after all these informational sessions, thanks to the Elastic event organizers for adding a closing happy hour, where I grabbed a drink with fellow attendees and Elastic folks. This was a great way to close a very extensive learning session. I look forward to being at the next year’s Elastic{ON} tour.

Event pass
Elastic{ON} Tour 2019 in Toronto event pass.
Elastic Team
On the right, Osman Ishaq at Elastic at the Ask Me Anything Booth
Raya Fratikina, Team Lead, Kibana at Elastic
Happy hour closing
Closing happy hour, drink with Elastic folks and other attendees.

Kafka users rejoice! You can now use Norconex open-source crawlers with Apache Kafka, thanks to the Norconex Apache Kafka Committer.

We owe this contribution to Joseph Paulo Mantuano (Senior Developer at The Red Flag Group) and Dan Davis.

The Norconex Collectors community keeps growing. We are thrilled to see the number of integrations grow with it as well.  If you know of any Norconex Committer implementation out there, let us know and we’ll add them to the list!

Not yet familiar with Norconex crawlers?  Head over to Norconex HTTP Collector or Norconex Filesystem Collector websites to learn more.

Great news! There is now a Google Cloud Search Committer for Norconex Crawlers!

This addition to Norconex Collector family should delight Google Cloud Search fans.  They too can now enjoy the full-featured crawling capabilities offered by Norconex Open-Source crawlers.

Since this Committer is developed and maintained by Google, you will find installation and configuration documentation on the Google Developers website.

New to Norconex crawlers? Head over to the Norconex Collectors website to start crawling.

Happy crawling!

Amazon Web Services (AWS) and the Canadian Public Sector organized another excellent Public Sector Summit on May 15, 2019. AWS hosted the first such summit in Ottawa last year, but this year’s event attracted a much larger crowd. Thousands of attendees filled Shaw Centre’s entire third floor.

In the keynote sessions, it was great to hear Alex Benay (deputy minister at the Treasury Board of Canada) talk about the government’s modern digital initiative. He discussed the approach, successes, and challenges of the government’s Cloud migration journey. Another excellent speaker was Mohamed Frendi (director of IT, innovation, science, and economic development for the government of Canada). He covered Canada’s API Store and how it uses the Cloud to make government data more accessible.

The afternoon session was led by Darin Briskman, an AWS developer evangelist. He talked about Amazon’s self-service analytics tool, called AWS Lake Formation, which combines data from multiple sources to resolve data-driven challenges in a timely manner. Machine learning and AI help in making informed decisions and solving problems. This service is a great fit for Norconex’s open-source crawler products HTTP Collector and Filesystem Collector, which fetch data from unstructured data sources to make it easy to consume. Collected content and metadata are natively stored in various existing repositories (or formats), including AWS-specific ones like Amazon Elasticsearch Service, Amazon Open Distro Elasticsearch, and Amazon CloudSearch, as well as many others, such as relational databases, Apache Solr, Google Cloud Search, Neo4J, Microsoft Azure Search, Lucidworks, IDOL, and more.

 

The diagrams below provide further explanation. The one showing the crawling spider is particularly exciting, because Norconex crawlers have much potential to help in this area.  See available Norconex Committers.

     

 

AWS Public Sector Summit Event Pass

Selfies with Darin Briskman, Developer Evangelist, AWS and Stevan Beara, Solutions Architect Manager, AWS.

   

 

Norconex crawlers and Neo4j graph database are now a love match! Neo4j is arguably the most popular graph database out there. Use Norconex crawlers to harvest relationships from websites and filesystems and feed them to your favorite graph engine.

This was made possible thanks to no other than France contributor Sylvain Roussy, a Neo4j reference, and author of 2 Neo4j books. Norconex is proud to have been able to partner with Sylvain to develop a Neo4j Committer for use with its Norconex HTTP and Filesystem Collectors.

To our French-speaking European friends, Sylvain will host a series of Neo4j Meetups at different locations. He will explain how Norconex crawlers can be used to gather graph data from the web to use in Neo4j. The first of the series is taking place on January 24th, in Genève:

Useful Links: