2016 – Norconex Inc

Somewhere between the White House and the Trump International Hotel, between the anti-Trump, and anti-pipeline protests, there was another peaceful gathering in Washington D.C. last week… KM World 2016!

This was the 20th anniversary of the event. Norconex attended the Enterprise Search & Discovery stream, and it was obvious that the event has matured from the 20 years of experience with quality information sessions and vendor participation.

In talking Search, it was mentioned in several sessions that Search users want their Search to “work like Google”. With Google employing tens of thousands of Search dedicated employees and the average company having less than one person dedicated to the same, it is no wonder that sometimes end users are left using a product that doesn’t fully meet their expectations.

The White House

Trump International Hotel

In many cases users are abandoning their Search application altogether to manually look for the content they need. This can cost a company in reduced productivity and in the case of online retailers, lost revenues. But there’s hope! With advancing technologies, dedicated vendors and Service Providers to work with, any company no matter the size can deploy a solution that works with their needs.

Some of the key areas of discussion I’d like to touch on in this article are Open Source, Machine Learning, the Cloud, User Interface, and Analytics.

OPEN SOURCE

Open Source continues to expand and is more and more widely accepted as a viable option for organizations of every size. This can be to save costs on licensing fees, but also to provide more flexibility in how your Search is developed. In some cases open source Search is being built alongside other products that include Search functionality (like Sharepoint) to enhance the Search experience beyond their standard offering.

MACHINE LEARNING

Machine Learning has also come a long way, and a few vendors were on-hand to show off their products. I was impressed with one product demonstration on how the Search results were displayed in an easily viewable chart format rather than a list. However, it was said at the event that statistics are showing only 60-70% accuracy for these tools. It was also said these products need very high query levels to reach the higher end of accuracy. This means only the Search applications with thousands and millions of hits are getting full advantage of Artificial Intelligence today. Assuming 60-70% relevancy is not enough, you will likely need some good old-fashioned human intervention to get the results to meet your expectations.

Also, if your organization is indexing all content, you may want to rethink this strategy and look at your content to determine what actually requires indexing. It was said that 60% of business data is not business data at all, but things like invitations to golf tournaments, pictures from the annual holiday party, duplicate documents or general user content such as personal emails that likely do not need to be included in your Search. A Content Analytics tool can help you narrow down what content needs to be indexed to help with the relevancy of Search returns.

THE CLOUD

Another hot topic was moving your data and Search application to the Cloud. The fear with moving to the Cloud had always been if your data will be secure. Much like open source, organizations of every size are now embracing a move to the Cloud. Many smaller companies who have limited IT resources are realizing that the big Cloud providers have security teams in place that can help their content actually be more secure than if they host on premise.

The newer challenge around the Cloud is for multi-national organizations who have data in countries where data privacy laws are in place such as Europe’s Safe Harbour and more recently Russia’s data protection laws. These legislations can regulate privacy, where their data can be stored, and also how and if that data can travel outside of the country. Multinationals need to find a strategy to work with these laws potentially piecing together various Cloud providers with data centres in the countries in question, or doing a hybrid of Cloud and on premise.

SEARCH ANALYTICS

Once you’ve built out your Search infrastructure, what your end users see is the User Interface and the results that are displayed for their queries. Rather than having a “Search Page” more and more companies are integrating the Search UI into their core user applications so the users don’t have to “search for the Search”.

If you are going to include a user feedback option, best participation was recorded when the feedback was put near the Search UI, but you will often get limited responses. This is where Search analytics comes into play… taking user feedback (if available) along with information from your Search users behaviours to keep a pulse on how Search is performing and if your users are finding the content they were looking for. A good Search Analytics product can help you to organize your Search data in a dashboard view, and provide an overall health-check to give you quick insights into where your Search is working, and where it needs some intervention to keep your Search running at an optimal level.

Regardless of whether you implement Search in-house or hire a team of experts, with all of the advancement in Search technology, you can put together all of the right pieces to provide a great Search tool for your employees and customers.

There are many business applications where web crawling can be of benefit. You or your team likely have ongoing research projects or smaller projects that come up from time to time. You may do a lot of manual web searching (think Google) looking for random information, but what if you need to do targeted reviews to pull specific data from numerous websites? A manual web search can be time consuming and prone to human error, and some important information could be overlooked. An application powered by a custom crawler can be an invaluable tool to save the manpower required to extract relevant content. This can allow you more time to actually review and analyze the data, putting it to work for your business.

A web crawler can be set up to locate and gather complete or partial content from public websites, and the information can be provided to you in an easily manageable format. The data can be stored in a search engine or database, integrated with an in-house system or tailored to any other target. There are multiple ways to access the data you gathered. It can be as simple as receiving a scheduled e-mail message with a .csv file or setting up search pages or a web app. You can also add functionality to sort the content, such as pulling data from a specific timeframe, by certain keywords or whatever you need.
If you have developers in house and want to build your own solution, you don’t even have to start from scratch. There are many tools available to get you started, such as our free crawler: Norconex HTTP Collector

If you hire a company to build your web crawler, you will want to use a reputable company that will respect all website terms of use. The solution can be set up and then “handed over” to your organization for you to run on an ongoing basis. For a hosted solution, the crawler and any associated applications will be set up and managed for you. This means any changes to your needs like adding/removing what sites to monitor or changing the parameters of what information you want to extract can be managed and supported as needed with minimal effort by your team.

Here are some examples of how businesses might use web crawling:

MONITORING THE NEWS AND SOCIAL MEDIA

What is being said about your organization in the media? Do you review industry forums? Are there comments posted on external sites by your customers that you might not even be aware of to which your team should be responding? A web crawler can monitor news sites, social media sites (Facebook, LinkedIn, Twitter, etc.), industry forums and others to get information on what is being said about you and your competitors. This kind of information could be invaluable to your marketing team to keep a pulse on your company image through sentiment analysis. This can help you know more about your customers’ perceptions and how you are comparing against your competition.

COMPETITIVE INFORMATION

Are people on your sales, marketing or product management teams tasked with going online to find out what new products or services are being provided by your competitors? Are you searching the competition to review pricing to make sure you are priced competitively in your space? What about comparing how your competitors are promoting their products to customers? A web crawler can be set up to grab that information, and then it can be provided to you so you can concentrate on analyzing that data rather than finding it. If you’re not currently monitoring your competition in this way, maybe you should be.

LEAD GENERATION

Does your business rely on information from other websites to help you generate a portion of your revenues? If you had better, faster access to that information, what additional revenues might that influence? An example is companies that specialize in staffing and job placement. When they know which companies are hiring, it provides them with an opportunity to reach out to those companies and help them fill those positions. They may wish to crawl the websites of key or target accounts, public job sites, job groups on LinkedIn and Facebook or forums on sites like Quora or Freelance to find all new job postings or details about companies looking for help with various business requirements. Capturing all those leads and returning them in a useable format can help generate more business.

TARGET LISTS

A crawler can be set up to do entity extraction from websites. Say, for example, an automobile association needs to reach out to all car dealerships and manufacturers to promote services or industry events. A crawler can be set up to crawl target websites that provide relevant company listings to pull things like addresses, contact names and phone numbers (if available), and that content can be provided in a single, usable repository.

POSTING ALERTS

Do you have partners whose websites you need to monitor for information in order to grow your business? Think of the real estate or rental agent who is constantly scouring the MLS (Multiple Listing Service) and other realtor listing sites to find that perfect home or commercial property for a client they are serving. A web crawler can be set up to extract and send all new listings matching their requirements from multiple sites directly to their inbox as soon as they are posted to give them a leg up on their competition.

SUPPLIER PRICING AND AVAILABILITY

If you are purchasing product from various suppliers, you are likely going back and forth between their sites to compare offerings, pricing and availability. Being able to compare this information without going from website to website could save your business a lot of time and ensure you don’t miss out on the best deals!

These are just some of the many examples of how web crawling can be of benefit. The number of business cases where web crawlers can be applied are endless. What are yours?

Useful links

Download Norconex HTTP Collector
Get started with Norconex HTTP Collector
Commercial Web Crawler Packages
Enterprise Search and Crawling Support Packages

Norconex has released version 2.6.0 of its HTTP Collector web crawler! Among new features, an upgrade of its Importer module brings new document parsing and manipulating capabilities. Some of the changes highlighted here also benefit the Norconex Filesystem Collector.

New URL normalization to remove trailing slashes

[ezcol_1half]

The GenericURLNormalizer has a new pre-defined normalization rule: “removeTrailingSlash”. When used, it makes sure to remove forward slash (/) found at the end of URLs so such URLs are treated the same as those not ending with such character. As an example:

https://norconex.com/ will become https://norconex.com
https://norconex.com/blah/ will become https://norconex.com/blah

It can be used with the 20 other normalization rules offered, and you can still provide your own.

[/ezcol_1half]

[ezcol_1half_end]

<urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
  <normalizations>
    removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence,
    decodeUnreservedCharacters, removeDefaultPort,
    encodeNonURICharacters, removeTrailingSlash
  </normalizations>
</urlNormalizer>

[/ezcol_1half_end]

Prevent sitemap detection attempts

[ezcol_1half]

By default StandardSitemapResolverFactory is enabled and tries to detect whether a sitemap file exists at the “/sitemap.xml” or “/sitemap_index.xml” URL path. For websites without sitemaps files at these location, this creates unnecessary HTTP request failures. It is now possible to specify an empty “path” so that such discovery does not take place. In such case, it will rely on sitemap URLs explicitly provided as “start URLs” or sitemaps defined in “robots.txt” files.

[/ezcol_1half]

[ezcol_1half_end]

<sitemapResolverFactory>
  <path/>
</sitemapResolverFactory>

[/ezcol_1half_end]

Count occurrences of matching text

[ezcol_1half]

Thanks to the new CountMatchesTagger, it is now possible to count the number of times any piece of text or regular expression occurs in a document content or one of its fields. A sample use case may be to use the obtained count as a relevancy factor in search engines. For instance, one may use this new feature to find out how many segments are found in a document URL, giving less importance to documents with many segments.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.CountMatchesTagger"> 
  <countMatches 
      fromField="document.reference"
      toField="urlSegmentCount" 
      regex="true">
    /[^/]+
  </countMatches>
</tagger>

[/ezcol_1half_end]

Multiple date formats

[ezcol_1half]

DateFormatTagger now accepts multiple source formats when attempting to convert dates from one format to another. This is particularly useful when the date formats found in documents or web pages are not consistent. Some products, such as Apache Solr, usually expect dates to be of a specific format only.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.DateFormatTagger"
    fromField="Last-Modified"
    toField="solr_date"
    toFormat="yyyy-MM-dd'T'HH:mm:ss.SSS'Z'">
  <fromFormat>EEE, dd MMM yyyy HH:mm:ss zzz</fromFormat>
  <fromFormat>EPOCH</fromFormat>
</tagger>

[/ezcol_1half_end]

DOM enhancements

[ezcol_1half]

DOM-related features just got better. First, the DOMTagger, which allows one to extract values from an XML/HTML document using a DOM-like structurenow supports an optional “fromField” to read the markup content from a field instead of the document content. It also supports a new “defaultValue” attribute to store a value of your choice when there are no matches with your DOM selector. In addition, now both DOMContentFilter and DOMTagger supports many more selector extraction options: ownText, data, id, tagName, val, className, cssSelector, and attr(attributeKey).

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
  <dom selector="div.contact" toField="htmlContacts" extract="html" />
</tagger>
<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger"
    fromField="htmlContacts">
  <dom selector="div.firstName" toField="firstNames" 
       extract="ownText" defaultValue="NO_FIRST_NAME" />
  <dom selector="div.lastName"  toField="lastNames" 
       extract="ownText" defaultValue="NO_LAST_NAME" />
</tagger>

[/ezcol_1half_end]

Document parsers now XML configurable

[ezcol_1half]

GenericDocumentParserFactory now makes it possible to overwrite one or more parsers the Importer module uses by default via regular XML configuration. For any content type, you can specify your custom parser, including an external parser.

[/ezcol_1half]

[ezcol_1half_end]

<documentParserFactory class="com.norconex.importer.parser.GenericDocumentParserFactory">
  <parsers>
    <parser contentType="text/html" 
        class="com.example.MyCustomHTMLParser" />
    <parser contentType="application/pdf" 
        class="com.norconex.importer.parser.impl.ExternalParser">
      <command>java -jar c:\Apps\pdfbox-app-2.0.2.jar ExtractText ${INPUT} ${OUTPUT}</command>
    </parser>
  </parsers>
</documentParserFactory>

[/ezcol_1half_end]

More languages detected

[ezcol_1half]

LanguageTagger now uses Tika language detection, which supports at least 70 languages.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.LanguageTagger">
  <languages>en, fr</languages>
</tagger>

[/ezcol_1half_end]

What else?

Other changes and stability improvements were made to this release. A few examples:

New “checkcfg” launch action that helps detect configuration issues before an actual launch.
Can now specify “notFoundStatusCodes” on GenericMetadataFetcher.
GenericLinkExtractor no longer extracts URL from HTML/XML comments by default.
URL referrer data is now always preserved by default.

To get the complete list of changes, refer to the HTTP Collector release notes, or the release notes of dependent Norconex libraries such as: Importer release notes and Collector Core release notes.

Useful links

Download Norconex HTTP Collector
Get started with Norconex HTTP Collector
Report your issues and questions on Github
Norconex HTTP Collector Release Notes

Since the first FIFA Women’s World Cup in 1991, interest in playing and watching women’s soccer has only increased. Around the world, more girls than ever before are playing the beautiful game that not only provides obvious health benefits but also helps boost girls’ confidence and self-esteem at the time in their lives when they need it most.

Norconex is proud to renew its sponsorship of women’s soccer teams in the Association de Soccer de Hull (Gatineau, Quebec, Canada) for the 2016 season. In addition to renewing its support for five local teams with players between 10 and 16 years of age, Norconex now sponsors two competitive women’s teams (U12 and U15).

At the upcoming women’s soccer tournament in this year’s Summer Olympics, girls will be able to cheer for their soccer idols once again, and Norconex will be cheering along with them.

Norconex has released Norconex HTTP Collector version 2.5.0! This new version of our open source web crawler was released to help minimize your re-crawling frequencies and download delays, and it allows you to specify a locale for date parsing/formatting. The following highlights these key changes and additions:

Minimum re-crawl frequency

[ezcol_1half]

Not all web pages and documents are updated as regularly. In addition, updates are not as important to capture right away for all types of content. Re-crawling every page every time to find out if they changed or not can be time consuming (and sometimes taxing) on larger sites. For instance, you may want to re-crawl news pages more regularly than other types of pages on a given site. Luckily, some websites provide sitemaps which give crawlers pointers to its document update frequencies.

This release introduces “recrawlable resolvers” to help control the frequency of document re-crawls. You can now specify a minimum re-crawl delay, based on a document matching content type or reference pattern. The default implementation is GenericRecrawlableResolver, which supports sitemap “lastmod” and “changefreq” in addition to custom re-crawl frequencies.

[/ezcol_1half]

[ezcol_1half_end]

<recrawlableResolver
    class="com.norconex.collector.http.recrawl.impl.GenericRecrawlableResolver"
    sitemapSupport="last" >
  <minFrequency applyTo="contentType" value="monthly">application/pdf</minFrequency>
  <minFrequency applyTo="reference" value="1800000">.*latest-news.*\.html</minFrequency>
</recrawlableResolver>

[/ezcol_1half_end]

Download delays based on document URL

[ezcol_1half]

ReferenceDelayResolver is a new “delay resolver” that controls delays between each document download. It allows you to define different delays for different URL patterns. This can be useful for more fragile websites negatively impacted by the fast download of several big documents (e.g., PDFs). In such cases, introducing a delay between certain types of download can help keep the crawled website performance intact.

[/ezcol_1half]

[ezcol_1half_end]

<delay class="com.norconex.collector.http.delay.impl.ReferenceDelayResolver"
    default="2000"
    ignoreRobotsCrawlDelay="true"
    scope="crawler" >
  <pattern delay="10000">.*\.pdf$</pattern>
</delay>

[/ezcol_1half_end]

Specify a locale in date parsing/formatting

[ezcol_1half]

Thanks to the Norconex Importer 2.5.2 dependency update, it is now possible to specify a locale when parsing/formatting dates with CurrentDateTagger and DateFormatTagger.

[/ezcol_1half]

[ezcol_1half_end]

<tagger class="com.norconex.importer.handler.tagger.impl.DateFormatTagger"
    fromField="date"
    fromFormat="EEE, dd MMM yyyy HH:mm:ss 'GMT'"
    fromLocale="fr"
    toFormat="yyyy-MM-dd'T'HH:mm:ss'Z'"
    keepBadDates="false"
    overwrite="true" />

[/ezcol_1half_end]

Useful links

Download Norconex HTTP Collector
Get started with Norconex HTTP Collector
Report your issues and questions on Github
Norconex HTTP Collector Release Notes

Norconex just released an Amazon CloudSearch Committer module for its open-source crawlers (Norconex “Collectors”). This is an especially useful contribution to CloudSearch users given that CloudSearch does not have its own crawlers.

If you’re not yet familiar with Norconex Collectors, head over to the Norconex Collectors website to see what you’ve been missing.
Assuming you’re already familiar with Norconex Collectors, you can enable CloudSearch as your crawler’s target search engine by following these steps:

Download the CloudSearch Committer.
Extract the zip, and copy the content of the “lib” folder to the “lib” folder of your existing Collector installation.

Add this minimum required configuration snippet to your Collector configuration file:

<committer class="com.norconex.committer.cloudsearch.CloudSearchCommitter">
  <serviceEndpoint>(CloudSearch service endpoint)</serviceEndpoint>
  <accessKey>
     (Optional CloudSearch access key. Will be taken from environment when blank.)
  </accessKey>
  <secretKey>
     (Optional CloudSearch secret key. Will be taken from environment when blank.)
  </secretKey>
</committer>

The document endpoint represents the CloudSearch domain you’ll want to use to store your crawled documents. It can be obtained from your CloudSearch domain’s main page.

As for the AWS access and secret keys, they can also be stored outside the configuration file using one of the methods described here.
The complete list of configuration options is available here.

For further information:

Google Search Appliance (GSA) was introduced in 2002, and since then, thousands of organizations have acquired Google “search in a box” to meet their search needs. Earlier this year, Google announced they are discontinuing sales of this appliance past 2016 and will not provide support beyond 2018. If you are currently using GSA for your search needs, what does this mean for your organization?

Google suggests migration from GSA to their Google Cloud Platform. Specifically, their BigQuery service offers a fully-scalable, fully-managed data warehouse with search capabilities and analytics to provide meaningful insights. This may be a great option, but what if your organization or government agency needs to have significant portions of your infrastructure in-house, behind firewalls? This new Google offering may be ill-suited as a possible replacement for GSA.

There are some other important elements you will want to consider before making your decision such as protecting sensitive data, investment stability, customizability, feature set, ongoing costs, and more.

Let’s look at some of the options together.

1. COMMERCIAL APPLIANCES

Examples: SearchBlox, Thunderstone, Mindbreeze

Pros

Commercial appliances can be fast to deploy if you have little requirement for customization. As such, they may need little or no professional services involvement.

To Watch

Because appliance products aim to be stand-alone, black box solutions, they may be less customizable to meet specific needs, and may not be able to easily integrate with many other technologies. Because the hardware is set for you, if your requirements change over time, you may end up with a product that no longer meets your needs. You may also be tied to the vendor for ongoing support, and as with GSA, there is no guarantee the vendor won’t discontinue the product and have you starting over again to find your next solution.

2. CLOUD-BASED SOLUTIONS

Examples: Google Cloud (BigQuery), Amazon CloudSearch, etc.

Pros

A cloud-based solution can be both cost-effective and fast to deploy, and will require little to no internal IT support depending on your needs. Because the solution is based in the cloud, most of the infrastructure and associated costs will be covered by the provider as part of the solution pricing.

To Watch

Cloud solutions may not work for organizations with sensitive data. While cloud-based solutions try to provide easy-to-use and flexible APIs, there might be customizations that can’t be performed or that must be done by the provider. Your organization may not own any ongoing development. Also, if you ever wish to leave, it may be difficult or costly to leave a cloud provider if you heavily rely on them for warehousing large portions of your data.

3. COMMERCIAL SOFTWARE SOLUTIONS

Examples: Coveo, OpenText Search, HP IDOL, Lexmark Perceptive Platform, IBM Watson Explorer, Senequa ES, Attivio

Pros

Commercial solutions work great behind firewalls. You can maintain control of your data within your own environment. Several commercial products often make several configuration assumptions that can potentially save time to deploy when minimal customization is required. Commercial vendors try to differentiate themselves by offering “specializations”, along with rich feature sets and administrative tools out of the box. If most of your requirements fit within their main offerings, you may have fewer needs for customization, potentially leading to professional services savings.

To Watch

Because there are so many commercial products out there, your organization may need to complete lengthy studies, potentially with the assistance of a consultant, to compare product offerings to see which will work with your platform(s) and compare all feature sets to find the best fit. Customization may be difficult or costly, and some products may not scale equally well to match your organization’s changing and growing needs. Finally, there is always risk that commercial products get discontinued, purchased, or otherwise vanish from the market, forcing you to migrate your environment to another solution once more. We have seen this with Verity K2, Fast, Fulcrum search, and several others.

4. CUSTOM OPEN SOURCE SOLUTIONS

Examples: Apache Solr, Elasticsearch

Pros

Going open source is often the most flexible solution you can implement. Having full access to a product source code makes customization potential almost unlimited. There are no acquisition or ongoing licensing costs, so the overall cost to deploy can be much less than for commercial products, and you can focus your spending towards creating a tailored solution rather than a pre-built commercial product. You will have the flexibility to change and add on to your search solution as your needs change. It is also good to point out that the risk of the product being discontinued is almost zero due to the advanced adoption of open source for Search. Being open source, add-on component options are plentiful and these options grow every day thanks to an advanced online community – and many of these options are also free!

To Watch

Depending on the number and complexity of your search requirements, the expertise required may be greater and an open source solution may take longer to deploy. You often need good developers to implement an open source solution; you will need key in-house resources, or be prepared to hire external experts to assist with implementation. If using an expert shop, you will want to pre-define your requirements to ensure the project stays within budget. It is good to note that unlike some of the commercial products, open source products usually keep a stronger focus on the search engine itself. This means they often lack many accompanying components and features, often shipping with commercial products (like crawlers for many data sources, built-in analytics reporting, industry-specific ontologies, etc). Luckily, open source solutions often integrate easily with several commercial or open source components that can be used to fill these gaps.

I hope this brief overview helps you begin your assessment on how to replace your Google Search Appliance, or implement other Search solutions.

MONITORING THE NEWS AND SOCIAL MEDIA

COMPETITIVE INFORMATION

LEAD GENERATION

TARGET LISTS

POSTING ALERTS

SUPPLIER PRICING AND AVAILABILITY

Useful links

New URL normalization to remove trailing slashes

Prevent sitemap detection attempts

Count occurrences of matching text

Multiple date formats

DOM enhancements

More control of embedded documents parsing

Document parsers now XML configurable

More languages detected

What else?

Useful links

Minimum re-crawl frequency

Download delays based on document URL

Specify a locale in date parsing/formatting

Useful links

1. COMMERCIAL APPLIANCES

Pros

To Watch

2. CLOUD-BASED SOLUTIONS

Pros

To Watch

3. COMMERCIAL SOFTWARE SOLUTIONS

Pros

To Watch

4. CUSTOM OPEN SOURCE SOLUTIONS

Pros

To Watch