Covid-19 has affected almost every country around the globe and has left everyone looking for the latest information. Below are just some of those who are searching for data:

  • Government agencies trying to manage information for the public
  • Healthcare organizations trying to keep abreast of the latest research
  • Businesses looking for the latest updates on government subsidies and how to properly plan and prepare to reopen
  • Parents following information on school closures and how to keep their families safe
  • Individuals staying home and trying to navigate through the constant updates and search for products that have become harder to source during the outbreak

For these scenarios and so many more, all of those searching need to be able to access the most current and relevant information.

Norconex has assisted with a couple of projects related to the coronavirus outbreak, so we wanted to share the details for one of those projects.

Covid-19 Content Monitor

Right before Covid-19 emerged, Norconex had built a search testbed for the Canadian federal government departments. The testbed application was used to demonstrate the many features of a modern search engine and how they can be applied to search initiatives across the Government of Canada. As part of this initiative, for Health Canada we had implemented the search using data related to health and safety recalls. 

When Covid-19 hit, it became more important than ever for Health Canada to ensure that the government disseminates accurate and up-to-date information to the Canadian population. Each department has the ongoing responsibility to properly inform its audience, efficiently share new directives and detail how the virus impacts department services. This raised some questions. How do you validate the quality of information shared with the public across various departments? How do you ensure a consistent message?

Norconex was happy to answer when asked for a quick solution to facilitate a remedy for these issues.

By building upon the pre-existing testbed, Norconex developed a search solution that crawls the relevant data from specific data sources. Health Canada employees can search through all data using various faceting options to help find the desired data.  The data is then provided back in a fast, simple-to-use interface. The solution monitors content in both of Canada’s official languages across all departments. Among its time-saving features, the search tool offers the following:

  • Automated classification of content
  • Continuous detection of new and updated content
  • Easy filtering of content
  • Detection of “alerts” found in pages so alerts can be verified more frequently to ensure continued relevance

This search and monitoring tool is currently being hosted for free on the Norconex cloud and being accessed by the team at Health Canada daily, saving precious time as they gather the information needed to help keep Canadians safe.


Somewhere between the White House and the Trump International Hotel, between the anti-Trump, and anti-pipeline protests, there was another peaceful gathering in Washington D.C. last week… KM World 2016!  

This was the 20th anniversary of the event.  Norconex attended the Enterprise Search & Discovery stream, and it was obvious that the event has matured from the 20 years of experience with quality information sessions and vendor participation.

In talking Search, it was mentioned in several sessions that Search users want their Search to “work like Google”.  With Google employing tens of thousands of Search dedicated employees and the average company having less than one person dedicated to the same, it is no wonder that sometimes end users are left using a product that doesn’t fully meet their expectations.


The White House


Trump International Hotel














In many cases users are abandoning their Search application altogether to manually look for the content they need. This can cost a company in reduced productivity and in the case of online retailers, lost revenues.  But there’s hope!  With advancing technologies, dedicated vendors and Service Providers to work with, any company no matter the size can deploy a solution that works with their needs.

Some of the key areas of discussion I’d like to touch on in this article are Open Source, Machine Learning, the Cloud, User Interface, and Analytics.


Open Source continues to expand and is more and more widely accepted as a viable option for organizations of every size.  This can be to save costs on licensing fees, but also to provide more flexibility in how your Search is developed.   In some cases open source Search is being built alongside other products that include Search functionality (like Sharepoint) to enhance the Search experience beyond their standard offering.


Machine Learning has also come a long way, and a few vendors were on-hand to show off their products.  I was impressed with one product demonstration on how the Search results were displayed in an easily viewable chart format rather than a list.  However, it was said at the event that statistics are showing only 60-70% accuracy for these tools.   It was also said these products need very high query levels to reach the higher end of accuracy.  This means only the Search applications with thousands and millions of hits are getting full advantage of Artificial Intelligence today.  Assuming 60-70% relevancy is not enough, you will likely need some good old-fashioned human intervention to get the results to meet your expectations.

Also, if your organization is indexing all content, you may want to rethink this strategy and look at your content to determine what actually requires indexing.  It was said  that 60% of business data is not business data at all, but things like invitations to golf tournaments, pictures from the annual holiday party, duplicate documents or general user content such as personal emails that likely do not need to be included in your Search.  A Content Analytics tool can help you narrow down what content needs to be indexed to help with the relevancy of Search returns.


Another hot topic was moving your data and Search application to the Cloud.  The fear with moving to the Cloud had always been if your data will be secure.  Much like open source, organizations of every size are now embracing a move to the Cloud.  Many smaller companies who have limited IT resources are realizing that the big Cloud providers have security teams in place that can help their content actually be more secure than if they host on premise.   

The newer challenge around the Cloud is for multi-national organizations who have data in countries where data privacy laws are in place such as Europe’s Safe Harbour and more recently Russia’s data protection laws.  These legislations can regulate privacy, where their data can be stored, and also how and if that data can travel outside of the country.  Multinationals need to find a strategy to work with these laws potentially piecing together various Cloud providers with data centres in the countries in question, or doing a hybrid of Cloud and on premise.  


Once you’ve built out your Search infrastructure, what your end users see is the User Interface and the results that are displayed for their queries.  Rather than having a “Search Page” more and more companies are integrating the Search UI into their core user applications so the users don’t have to “search for the Search”.  

If you are going to include a user feedback option, best participation was recorded when the feedback was put near the Search UI,  but you will often get limited responses.   This is where Search analytics comes into play… taking user feedback (if available) along with information from your Search users behaviours to keep a pulse on how Search is performing and if your users are finding the content they were looking for.  A good Search Analytics product can help you to organize your Search data in a dashboard view, and provide an overall health-check to give you quick insights into where your Search is working, and where it needs some intervention to keep your Search running at an optimal level.

Regardless of whether you implement Search in-house or hire a team of experts, with all of the advancement in Search technology, you can put together all of the right pieces to provide a great Search tool for your employees and customers.

Looking for InformationThere are many business applications where web crawling can be of benefit. You or your team likely have ongoing research projects or smaller projects that come up from time to time. You may do a lot of manual web searching (think Google) looking for random information, but what if you need to do targeted reviews to pull specific data from numerous websites? A manual web search can be time consuming and prone to human error, and some important information could be overlooked. An application powered by a custom crawler can be an invaluable tool to save the manpower required to extract relevant content. This can allow you more time to actually review and analyze the data, putting it to work for your business.

A web crawler can be set up to locate and gather complete or partial content from public websites, and the information can be provided to you in an easily manageable format. The data can be stored in a search engine or database, integrated with an in-house system or tailored to any other target. There are multiple ways to access the data you gathered. It can be as simple as receiving a scheduled e-mail message with a .csv file or setting up search pages or a web app. You can also add functionality to sort the content, such as pulling data from a specific timeframe, by certain keywords or whatever you need.
If you have developers in house and want to build your own solution, you don’t even have to start from scratch. There are many tools available to get you started, such as our free crawler:  Norconex HTTP Collector

If you hire a company to build your web crawler, you will want to use a reputable company that will respect all website terms of use. The solution can be set up and then “handed over” to your organization for you to run on an ongoing basis. For a hosted solution, the crawler and any associated applications will be set up and managed for you. This means any changes to your needs like adding/removing what sites to monitor or changing the parameters of what information you want to extract can be managed and supported as needed with minimal effort by your team.

Here are some examples of how businesses might use web crawling:


What is being said about your organization in the media? Do you review industry forums? Are there comments posted on external sites by your customers that you might not even be aware of to which your team should be responding? A web crawler can monitor news sites, social media sites (Facebook, LinkedIn, Twitter, etc.), industry forums and others to get information on what is being said about you and your competitors. This kind of information could be invaluable to your marketing team to keep a pulse on your company image through sentiment analysis. This can help you know more about your customers’ perceptions and how you are comparing against your competition.


Are people on your sales, marketing or product management teams tasked with going online to find out what new products or services are being provided by your competitors? Are you searching the competition to review pricing to make sure you are priced competitively in your space? What about comparing how your competitors are promoting their products to customers? A web crawler can be set up to grab that information, and then it can be provided to you so you can concentrate on analyzing that data rather than finding it. If you’re not currently monitoring your competition in this way, maybe you should be.


Does your business rely on information from other websites to help you generate a portion of your revenues? If you had better, faster access to that information, what additional revenues might that influence? An example is companies that specialize in staffing and job placement. When they know which companies are hiring, it provides them with an opportunity to reach out to those companies and help them fill those positions. They may wish to crawl the websites of key or target accounts, public job sites, job groups on LinkedIn and Facebook or forums on sites like Quora or Freelance to find all new job postings or details about companies looking for help with various business requirements. Capturing all those leads and returning them in a useable format can help generate more business.


A crawler can be set up to do entity extraction from websites. Say, for example, an automobile association needs to reach out to all car dealerships and manufacturers to promote services or industry events. A crawler can be set up to crawl target websites that provide relevant company listings to pull things like addresses, contact names and phone numbers (if available), and that content can be provided in a single, usable repository.


Do you have partners whose websites you need to monitor for information in order to grow your business? Think of the real estate or rental agent who is constantly scouring the MLS (Multiple Listing Service) and other realtor listing sites to find that perfect home or commercial property for a client they are serving. A web crawler can be set up to extract and send all new listings matching their requirements from multiple sites directly to their inbox as soon as they are posted to give them a leg up on their competition.


If you are purchasing product from various suppliers, you are likely going back and forth between their sites to compare offerings, pricing and availability. Being able to compare this information without going from website to website could save your business a lot of time and ensure you don’t miss out on the best deals!

These are just some of the many examples of how web crawling can be of benefit. The number of business cases where web crawlers can be applied are endless. What are yours?


Useful links


Google Search Appliance is Being Phased Out… Now What?Google Search Appliance (GSA) was introduced in 2002, and since then, thousands of organizations have acquired Google “search in a box” to meet their search needs. Earlier this year, Google announced they are discontinuing sales of this appliance past 2016 and will not provide support beyond 2018. If you are currently using GSA for your search needs, what does this mean for your organization?

Google suggests migration from GSA to their Google Cloud Platform. Specifically, their BigQuery service offers a fully-scalable, fully-managed data warehouse with search capabilities and analytics to provide meaningful insights. This may be a great option, but what if your organization or government agency needs to have significant portions of your infrastructure in-house, behind firewalls? This new Google offering may be ill-suited as a possible replacement for GSA.

There are some other important elements you will want to consider before making your decision such as protecting sensitive data, investment stability, customizability, feature set, ongoing costs, and more.

Let’s look at some of the options together.


Examples: SearchBlox, Thunderstone, Mindbreeze


Commercial appliances can be fast to deploy if you have little requirement for customization. As such, they may need little or no professional services involvement.

To Watch

Because appliance products aim to be stand-alone, black box solutions, they may be less customizable to meet specific needs, and may not be able to easily integrate with many other technologies. Because the hardware is set for you, if your requirements change over time, you may end up with a product that no longer meets your needs. You may also be tied to the vendor for ongoing support, and as with GSA, there is no guarantee the vendor won’t discontinue the product and have you starting over again to find your next solution.


Examples: Google Cloud (BigQuery), Amazon CloudSearch, etc.


A cloud-based solution can be both cost-effective and fast to deploy, and will require little to no internal IT support depending on your needs. Because the solution is based in the cloud, most of the infrastructure and associated costs will be covered by the provider as part of the solution pricing.

To Watch

Cloud solutions may not work for organizations with sensitive data. While cloud-based solutions try to provide easy-to-use and flexible APIs, there might be customizations that can’t be performed or that must be done by the provider. Your organization may not own any ongoing development. Also, if you ever wish to leave, it may be difficult or costly to leave a cloud provider if you heavily rely on them for warehousing large portions of your data.


Examples: Coveo, OpenText Search, HP IDOL, Lexmark Perceptive Platform, IBM Watson Explorer, Senequa ES, Attivio


Commercial solutions work great behind firewalls. You can maintain control of your data within your own environment. Several commercial products often make several configuration assumptions that can potentially save time to deploy when minimal customization is required. Commercial vendors try to differentiate themselves by offering “specializations”, along with rich feature sets and administrative tools out of the box. If most of your requirements fit within their main offerings, you may have fewer needs for customization, potentially leading to professional services savings.

To Watch

Because there are so many commercial products out there, your organization may need to complete lengthy studies, potentially with the assistance of a consultant, to compare product offerings to see which will work with your platform(s) and compare all feature sets to find the best fit. Customization may be difficult or costly, and some products may not scale equally well to match your organization’s changing and growing needs. Finally, there is always risk that commercial products get discontinued, purchased, or otherwise vanish from the market, forcing you to migrate your environment to another solution once more. We have seen this with Verity K2, Fast, Fulcrum search, and several others.


Examples: Apache Solr, Elasticsearch


Going open source is often the most flexible solution you can implement. Having full access to a product source code makes customization potential almost unlimited. There are no acquisition or ongoing licensing costs, so the overall cost to deploy can be much less than for commercial products, and you can focus your spending towards creating a tailored solution rather than a pre-built commercial product. You will have the flexibility to change and add on to your search solution as your needs change. It is also good to point out that the risk of the product being discontinued is almost zero due to the advanced adoption of open source for Search. Being open source, add-on component options are plentiful and these options grow every day thanks to an advanced online community – and many of these options are also free!

To Watch

Depending on the number and complexity of your search requirements, the expertise required may be greater and an open source solution may take longer to deploy. You often need good developers to implement an open source solution; you will need key in-house resources, or be prepared to hire external experts to assist with implementation. If using an expert shop, you will want to pre-define your requirements to ensure the project stays within budget. It is good to note that unlike some of the commercial products, open source products usually keep a stronger focus on the search engine itself. This means they often lack many accompanying components and features, often shipping with commercial products (like crawlers for many data sources, built-in analytics reporting, industry-specific ontologies, etc). Luckily, open source solutions often integrate easily with several commercial or open source components that can be used to fill these gaps.

I hope this brief overview helps you begin your assessment on how to replace your Google Search Appliance, or implement other Search solutions.