2013 – Norconex Inc

I am currently on a plane, coming back from three wonderful days of training on the Google Search Appliance 7.0 at the Google headquarters in Mountain View, California. I must admit that I was as impressed by the environment of the Googleplex as I was by the GSA itself.

On the complex, you will find numerous kitchens and restaurants, a full size T-Rex skeleton, ping pong tables, spas, pools, a beach volley ball set, a bowling alley, GBikes, and more. At first, it looks more like a family vacation destination than an office complex. But don’t be fooled by the distractions, real work is going on within those walls. (more…)

Norconex is glad to help Sophie Carrier-Laforte, an outstanding amateur athlete who is targeting the biggest honors again this year, going further. I have had the privilege to know Sophie for a long while now and I have seen her progression as an athlete. I must say that her hard work and her dedication to her sport is a true inspiration to me and the team here at Norconex. It took her many years to get where she is now, in the top Canadian Junior, but her journey is just beginning. (more…)

Norconex Commons Lang is a generic Java library providing useful utility classes that extend the base Java API. Its name is shamelessly borrowed from Apache Commons Lang, so people can quickly assume what it’s about just by its name. It is by no means an effort to replace Apache Commons Lang. Quite the opposite. We try to favor Apache Commons libraries whenever possible. Norconex uses this Commons Lang library as a catch-all, providing all kinds of generic utilities, some of which have extra dependencies over the base Java API. While this library is used by Norconex in its enterprise search projects, it is not tied to search and can be used in any context.

The following explores some of the key features it offers as of this writing. (more…)

Norconex at CIBC Run for the cure For a third consecutive year, Norconex will participate in CIBC Run for the Cure event to help out the Canadian Breast Cancer Foundation. The run will take place at Tunney’s Pasture in Ottawa, on October 6th 2013. Help us reach our objective by making a donation.

As always, you are also welcome to join our Team and run with us!

System integration concept During a recent client project, I was required to crawl several websites with specific requirements for each. For example, one of the websites required:

to have a meta tag content be used as a URL replacement for the actual URL,
the header, footer and any repetitive content be excluded from each page,
to be able to ignore robots.txt since it is meant for external crawlers only (Google, Bing, etc.), and
to index them in LucidWorks.

LucidWorks built-in web crawler is based on Aperture. It is great for basic web crawls, but I needed more advanced features that it could not provide. I had to configure LucidWorks with an external crawler that had more advanced built-in capabilities and the ability to create new functionality.

(more…)

Norconex just released version 1.1 of HTTP Collector, its free web crawler. This is an important upgrade from the Norconex Development Team, giving you the following great new features and enhancements:

Much faster and more constant crawling performance, especially with high volume (millions).
Support for sitemap.xml and sitemap index (plain or gzip).
Support for BASIC and DIGEST authentication.
Support for in-page robot instructions.
Support for ftp:// URLs.

To see a complete list of changes, see the Release Notes.

This release also takes advantage of the new 1.1 release of Norconex Importer, adding the ability to extract parts of documents using regular expression, and store those as document metadata for indexing (like the content of H1, H2 tags, or bold tags, for influencing the ranking).

We would love to hear your feedback on this release and the features you would like to see implemented next.

Download Norconex HTTP Collector 1.1 now!

2013/06/05

Say hello to Norconex HTTP Collector! At Norconex, we have always recognized the value open-source brings to software development, and to a greater extent, the world. It benefits us when building custom solutions for our customers and ourselves. As long-time consumers of open-source, it is time for us to give back.

As a result, Norconex is proud to announce open-sourcing of a handful of its libraries and products, so that the community can save time and money like it did for us. The Norconex HTTP Collector is an HTTP Crawler meant to give the greatest flexibility possible for developers and integrators. (more…)

Over the last few weeks, I have had the opportunity to work and play around with the Google Search Appliance Version 7.0, and I must say that it’s an interesting piece of technology. Varying opinions, such as “Yeah, GSA is cool, but it can’t do X, Y, or Z,” have prevented many people from adopting the Google offering as a real Enterprise Search solution. Thanks to the new version, most of those limitations are a thing of the past. The platform is much more mature than previous versions. You can sense that Google is really making the effort required to improve its search platform so that it properly addresses the needs of the Enterprise market. It’s not just about web content anymore. (more…)