2014 – Norconex Inc

GATINEAU, QC, CANADA – Monday, December 1, 2014 – Norconex announces the launch of its Google Search Appliance (GSA) Committer module for its Norconex Collectors Crawler Suite. Enterprise search developers and enthusiasts now have a flexible and extensible option for feeding documents to their GSA infrastructure. GSA is a target repository for crawled documents released by Norconex HTTP Collector, Norconex Filesystem Collector, and any future Collector released by Norconex . These Collectors can reside on any server (like remote filesystems) and send discovered documents across the network to a GSA installation. The GSA Committer is the latest addition to the growing list of Committers already available to Norconex Collector users: Apache Solr, Elasticsearch, HP IDOL, and Lucidworks.

“The increasing popularity of our universal crawlers motivates us to provide support for more search engines. Search engines come and go in an organization, but your investment in your crawling infrastructure can be protected by having re-usable crawler setups that can outlast any search engine installation,” said Norconex President Pascal Essiembre.

GSA Committer Availability

GSA Committer is part of Norconex’s commitment to delivering quality open-source products backed by community or commercial support. GSA Committer is available for immediate download at /collectors/committer-gsa.

Founded in 2007, Norconex is a leader in enterprise search and data discovery. The company offers a wide range of products and services designed to help process and analyze structured and unstructured data.

For more information on GSA Committer:

GSA Committer Website: /collectors/committer-gsa
Norconex Collectors: /collectors
Email: info@norconex.com

Norconex just released major upgrades to all its Norconex Collectors and related projects. That is, Norconex HTTP Collector and Norconex Filesystem Collector, along with the Norconex Importer module and all available committers (Solr, Elasticsearch, HP IDOL, etc), were all upgraded to version 2.0.0.

With these major product upgrades come a new website that makes it easier to get all the software you need in one location: the Norconex Collectors website. At a quick glance you can find all Norconex Collectors and Committers available for download.

Among the new features added to your crawling arsenal you will find:

Can now split a document into multiple documents.

Can now treat embedded documents as individual documents (like documents found in zip files or in other documents such as Word files).

Language detection (50+ languages).

Parsing and formatting of dates from/to any format.

Character case modifiers.

Can now index basic content statistics with each documents (word count, average word length, average words per sentences, etc).

Can now supply a “seed file” for listing start URLs or start paths to your crawler.

Document content reads and writes are now performed in memory up to a configurable maximum size, after which the filesystem gets used. This reduces I/O and improves performance.

New event model where listeners can listen for any type of crawler events.

Can now ignore parsing of specific content types.

Can filter documents based on arbitrary regular expressions performed on the document content.

Enhanced debugging options, where you can print out specific field content as they are being processed.

HTTP Collector: Can add link names to the document the links are pointing to (e.g. to create cleaner titles).

More…

Another significant change is all Norconex open-source projects are now licensed under The Apache License 2.0. We hope this will facilitate adoption with third party commercial offerings.

It is important to note version 2.0.0 are not compatible with their previous 1.x version. The configuration options changed in many areas so do not expect to run your existing configuration under 2.0.0. Please refer to the latest documentation for new and modified configuration options.

Visit to the new Norconex Collectors website now.

This feature release brings the following additions…

Simple Pipeline

Useful if you want to quickly assemble multiple tasks to be run into a single “pipeline” while keeping it ultra simple. The following example does it all in a single class only to keep it short.

public class MyPipeline extends Pipeline<String> {

    public MyPipeline() {
        addStage(new MyTask1());
        addStage(new MyTask2());
    }
    
    // Class: Task1
    private class MyTask1 implements IPipelineStage<String> {
        @Override
        public boolean execute(String context) {
            System.out.println("Task 1 executed: " + context);
            return true;
        }
    }  

    // Class: Task2
    private class MyTask2 implements IPipelineStage<String> {
        @Override
        public boolean execute(String context) {
            System.out.println("Task 2 executed: " + context);
            return true;
        }
    }  
    
    public static void main(String[] args) {
        new MyPipeline().execute("hello");
        
        // Will print out:
        //     Task 1 executed: hello
        //     Task 2 executed: hello
    }
}

Cacheable Streams

There are several excellent object caching mechanism available to Java already if you need something sophisticated. This release offers a very lightweight cache implementation that can make InputStream and OutputStream reusable. It stores the stream in memory until a configurable threshold is reached, after which it switches to fast file lookup. A CachedStreamFactory is used to obtain cached streams sharing the same pool of memory.

        int size10mb = 10 * 1024 * 1024;
        int size1mb  = 1024 * 1024;
        InputStream is = null; // <-- your original input stream
        OutputStream os = null; // <-- your original output stream
        
        CachedStreamFactory streamFactory = new CachedStreamFactory(size10mb, size1mb);
        
        //--- Reuse the input stream ---
        CachedInputStream cachedInput = streamFactory.newInputStream(is);
        
        // Read the input stream the first time
        System.out.println(IOUtils.toString(cachedInput));
        // Read the input stream a second time
        System.out.println(IOUtils.toString(cachedInput));
        
        // Released the cached data, preventing further re-use
        cachedInput.dispose();

        //--- Reuse the output stream ---
        CachedOutputStream cachedOutput = streamFactory.newOuputStream(os);
        
        IOUtils.write("lots of data", cachedOutput);
        
        // Obtain a new input stream from the output
        CachedInputStream newInputStream = cachedOutput.getInputStream();
        
        // Do what you want with this input stream

Enhanced XML Writing

The Java XMLStreamWriter is a useful class, but is a bit annoying to use when you are not always writing strings. The EnhancedXMLStreamWriter add convenience method for primary types and others.

        Writer out = null; // <-- your target writer
        
        EnhancedXMLStreamWriter xml = new EnhancedXMLStreamWriter(out);
        xml.writeStartDocument();
        xml.writeStartElement("item");
        
        xml.writeElementInteger("quantity", 23);
        
        xml.writeElementString("name", "something");
        
        xml.writeStartElement("size");
        xml.writeAttributeInteger("height", 24);
        xml.writeAttributeInteger("width", 26);
        xml.writeEndElement();

        xml.writeElementBoolean("sealwrapped", true);

        xml.writeEndElement();
        xml.writeEndDocument();
        
        /* Will write:
          
          <?xml version="1.0" encoding="UTF-8"?>
          <item>
              <quantity>23</quantity>
              <name>something</name>
              <size height="24" width="26" />
              <sealwrapped>true</sealwrapped>
          </item>
         */

More Equality checks

More methods were added to EqualUtils:

        EqualsUtil.equalsAnyIgnoreCase("toMatch", "candidate1", "candiate1");
        EqualsUtil.equalsAllIgnoreCase("toMatch", "candidate1", "candiate1");
        EqualsUtil.equalsNoneIgnoreCase("toMatch", "candidate1", "candiate1");

Discover More Features

A few more features and updates were made to the Norconex Commons Lang library. For more information, check out the full release notes.

Download your copy now.

The scene at GTEC brought another exciting year. GTEC is Canada’s Government Technology Event. As usual, there were many exciting keynotes, presentations, panel discussions and informative vendor exhibitors.

GTEC offers a great opportunity for buyers and clients to get out of the office and personally talk to vendors. It’s also an opportunity for vendors to talk to clients about their products and services and gather leads. GTEC is the one event that gets all the key parties under one roof.

Enterprise Search – Still very much a core issue within the GoC

At the Norconex booth, we had the opportunity to talk to many government employees. When we discussed what we do, the general response appeared to support the fact that search is still absolutely imperative for knowledge workers within the Government of Canada.

Unfortunately, though, reality in public servitude doesn’t offer that same support. Search seems to be low on the list of priorities and doesn’t appear to be getting the attention it deserves. The number of public servants who indicated that they were dissatisfied with the quality of their internal search was surprising.

They understood the importance of their document management systems and why it was necessary to keep stored information organized. However, they emphasized the need to find information without knowing exactly where it is located. They wanted more attention spent on how to get content “out” in order to leverage that information, enabling them to do their jobs more efficiently.

And the winner is…

Norconex is pleased to announce that the winner of the iPad mini is Douglas North from Shared Services Canada. We were lucky enough to have an employee from Canada Revenue Agency pull the winning ballot.

Norconex is currently showcasing its new Norconex Content Analytics product at the GTEC event in Ottawa. Mike Clark and Khalid Alhomoud are having a good time meeting new faces and existing customers. If you are nearby Ottawa, come for a visit at booth 908 in the Ottawa Convention Center (Shaw Center) for a free demo that could change the way you look at your data. The event ends tomorrow (Wednesday, October 29th).

GATINEAU, QC, CANADA — Thursday, September 22, 2014—Norconex is excited to announce the launch of Norconex Content Analytics, enabling organizations to get deep insights on their current information assets.

Norconex believes its Content Analytics product will provide customers with valuable statistical reports on documents from all kinds of enterprise repository sources, ranging from local file systems to remote secure servers, at a fraction of the cost of compiling reports manually or with competing products.

“I can already assess that this affordable enterprise solution will save some of our customers a fortune on their data migration projects,” said David Gaulin, Vice President of Professional Services at Norconex.

Norconex Content Analytics Availability

Norconex Content Analytics is a product driven by customer feedback and is part of Norconex’s commitment to delivering quality commercial products. Norconex Content Analytics is available immediately for purchase. Additional information can be found at /enterprise-search-software/content-analytics/.

About Norconex

Founded in 2007, Norconex is a leader in enterprise search and data discovery. The company offers a wide range of products and services designed to help with the processing and analysis of structured and unstructured data.

For more information on Norconex Content Analytics:

Website: /enterprise-search-software/content-analytics/

Email: info@norconex.com

GATINEAU, QC, CANADA – Thursday, August 25, 2014 – Norconex is announcing the launch of Norconex Filesystem Collector, providing organizations with a free “universal” filesystem crawler. The Norconex Filesystem Collector enables document indexing into target repositories of choice, such as enterprise search engines.

Following on the success of Norconex HTTP Collector web crawler, Norconex Filesystem Collector is the second open source crawler contribution to the Norconex “Collector” suite. Norconex believes this crawler allows customers to adopt a full-featured enterprise-class local or remote file system crawling solution that outlasts their enterprise search solution or other data repository.

“This not only facilitates any future migrations but also allows customer addition of their own ETL logic into a very flexible crawling architecture, whether using Autonomy, Solr/LucidWorks, ElasticSearch, or any others data repository,” said Norconex President Pascal Essiembre.

Norconex Filesystem Collector Availability

Norconex Filesystem Collector is part of Norconex’s commitment to deliver quality open-source products, backed by community or commercial support. Norconex Filesystem Collector is available for immediate download at /collectors/collector-filesystem/download.

For more information on Norconex Filesystem Collector:

Website: /collectors/collector-filesystem

Email: info@norconex.com

###

Release 1.3.0 of Norconex Importer is now available. Release overview:

Now stores the content “family” for each documents as “importer.contentFamily”.
New SplitTagger: Split values into multiple-values using a separator of choice.
New CopyTagger: copies document metadata fields to other fields.
New HierarchyTagger: splits a field string into multiple segments representing each node of a hierarchical branch.
ReplaceTagger now supports regular expressions.
Improved mime types detection.
More…

Download it now.

Web site: /collectors/importer/

During the development of our latest product, Norconex Content Analytics, we decided to add facets to the search interface. They allow for exploring the indexed content easily. Solr and Elasticsearch both have facet implementations that work on top of Lucene. But Lucene also offers simple facet implementations that can be picked out of the box. And because Norconex Content Analytics is based on Lucene, we decided to go with those implementations.

We’ll look at those facet implementations in this blog post, but before, let’s talk about a new feature of Lucene 4 that is used by all of them.

(more…)

Norconex Commons Lang 1.4.0 was just released.

New features:

New DataUnit classe to perform data unit (KB, MB, GB, etc) conversions much like Java TimeUnit class.
New DataUnitFormatter to format any data unit ot a human-readable format taking into account locale and decimals
New percentage formatter.
New ContentType class to represent a file media/MIME type and obtain its usual name, content family, and file extension(s).
New ContentFamily class to represent a group of files of similar content types. Useful for content categorization.
New ObservableMap class.
More…

Download it now.

Web site: /product/commons-lang/