This release of Norconex Importer brings many fixes, increased stability, and nice new features. The following highlights some of the additions with XML configuration or Java code samples.

Retrieve a document Length

[ezcol_1half]

Thanks to the new DocumentLengthTagger, you can now store a document byte length in a metadata field of your choice. The length can be obtained at any document processing stage.  For instance, it can be obtained before any transformation took place, or after it was parsed.

[/ezcol_1half]

[ezcol_1half_end]

 [/ezcol_1half_end]

Add the current date to a document

[ezcol_1half]

The new CurrentDateTagger allows to add the current date to a metadata field and date format of your choice. This can be useful to indicate when a document was actually processed by the Importer.

[/ezcol_1half]

[ezcol_1half_end]

 [/ezcol_1half_end]

Filter documents on numeric or date range

[ezcol_1half]

NumericMetadataFilter and DateMetadataFilter now allow you to filter documents based on metadata field numeric or date values, respectively. You can define both closed ranges and open-ended ranges.

[/ezcol_1half]

[ezcol_1half_end]

 [/ezcol_1half_end]

Use external parsers

[ezcol_1half]

Wrapping a Tika class of the same name, the new ExternalParser allows Java programmers to point to external command-line applications to parse documents. One example can be for using “pdftotext” to parse PDFs instead of the default PDF parser based on PDFBox, which is much slower (but does a better job overall).

[/ezcol_1half]

[ezcol_1half_end]

  [/ezcol_1half_end]

Other improvements

There are more changes under the hood, like upgrading to Apache Tika 1.8, as well as the fixing of OutOfMemory errors and document parsing sometimes never returning. You can find the complete list of changes in the release notes.

Several of these improvements were made possible thanks to the great feedback of the open-source community. Keep doing so: you make a difference.

Useful links

 

[ezcol_1half]

In the wake of FIFA Women’s World Cup 2015 starting in Canada on June 6th, many young girls are anxiously waiting to see their favorite players compete for their country. Soccer (or “football” outside North America) has been the most practiced sport in Canada by kids from age 5 to 14 for over a decade now. Girls represent over 40% of these young players.

Norconex is proud to help attract and retain even more girls to play the world’s most beloved sport by sponsoring 5 girl-only teams from U10 to U14 part of the “Association de Soccer de Hull”.

We can’t predict who will be the next Christine Sinclair, but we hope an increasing number of girls will enjoy soccer and even dream to be on the national team one day.

If you’re in Canada this June (or early July), make sure you have your World Cup tickets. Matches will be played in 6 cities across the country.

Go girls!

[/ezcol_1half]

[ezcol_1half_end]
ASH U12 Girl Soccer Team

ASH U12 D1 Girl Soccer Team

[/ezcol_1half_end]

Optical character recognition (ORC), content translation, title generation, detection and text extraction from more file formats, are among the new features now part of your favorite crawlers: Norconex HTTP Collector 2.1.0 and Norconex Filesystem Collector 2.1.0. They are both available now and can be downloaded for free.  They both ship with and use the latest version of the Norconex Importer module, which is in big part responsible for many of these new features.

For more details and usage examples, check this article.

These two Collector releases also include bug fixes and stability improvements.  We recommend to existing users to upgrade.

Get your copy

Download Norconex HTTP Collector

Download Norconex Filesystem Collector

This feature release of Norconex Importer brings bug fixes, enhancements, and great new features, such as OCR and translation support.  Keep reading for all the details on some of this release’s most interesting changes. While Java can be used to configure and use the Importer, XML configuration is used here for demonstration purposes.  You can find all Importer configuration options here.

About Norconex Importer

Norconex Importer is an open-source product for extracting and manipulating text and metadata from files of various formats.  It works for stand-alone use or as a Java library.  It’s an essential component of Norconex Collectors for processing crawled documents.  You can make Norconex Importer an essential piece of your ETL pipeline.

OCR support

[ezcol_1half]

Norconex Importer now leverages Apache Tika 1.7’s newly introduced ORC capability. To convert popular image formats (PNG, TIFF, JPEG, etc.) to text, download a copy of Tesseract OCR for your operating system, and reference its install location in your Importer configuration.  When enabled, OCR will process embedded images too (e.g., PDF with image for text). The class configure to enable OCR support is GenericDocumentParserFactory.

[/ezcol_1half]

[ezcol_1half_end]

 [/ezcol_1half_end]

Translation support

[ezcol_1half]

With the new TranslatorSplitter class, it’s now possible to hook Norconex Importer with a translation API.  The Apache Tika API has been extended to provide the ability to translate a mix of document content or specific document fields.  The translation APIs supported out-of-the-box are Microsoft, Google, Lingo24, and Moses.

[/ezcol_1half]

[ezcol_1half_end]

 [/ezcol_1half_end]

Dynamic title creation

[ezcol_1half]

Too many documents do not have a valid title, when they have a title at all.  What if you need a title to represent each document?  What do you do in such cases?   Do you take the file name as the title? Not so nice.  Do you take the document property called “title”?  Not reliable.  You now have a new option with the TitleGeneratorTagger.  It will try to detect a decent title out of your document.  In cases where it can’t, it offers a few alternate options. You always get something back.

[/ezcol_1half]

[ezcol_1half_end]

 [/ezcol_1half_end]

Saving of parsing errors

[ezcol_1half]

A new top-level configuration option was introduced so that every file generating parsing errors gets saved in a location of your choice.  These files will be saved along with the metadata obtained so far (if any), along with the Java exception that was thrown. This is a great addition to help troubleshoot parsing failures.

[/ezcol_1half]

[ezcol_1half_end]

 [/ezcol_1half_end]

Document parsing improvements

The content type detection accuracy and performance were improved with this release.  In addition, document parsing features the following additions and improvements:

  • Better PDF support with addition of PDF XFA (dynamic forms) text extraction, as well as improved space detection (eliminating many space-stripping issues).  Also, PDFs with JBIG2 and jpeg2000 image formats are now parsed properly.
  • New XFDL parser (PureEdge Extensible Forms Description Language).  Supports both Gzipped/Base64 encoded and plain text versions.
  • New, much improved WordPerfect parser now parsing WordPerfect documents according to WordPerfect file specifications.
  • New Quattro Pro parser for parsing Quattro Pro documents according to Quattro Pro file specifications.
  • JBIG2 and jpeg2000 image formats are now recognized.

You want more?

The list of changes and improvements doesn’t stop here.  Read the product release notes for a complete list of changes.

Unfamiliar with this product? No sweat — read this “Getting Started” page.

If not already out when you read this, the next feature release of Norconex HTTP Collector and Norconex Filesystem Collector will both ship with this version of the Importer.  Can’t wait for the release? Manually upgrade the Norconex Importer library to take advantage of these new features in your favorite crawler.

Download Norconex Importer 2.1.0.

Release 1.6.0 of Norconex Commons Lang provides new Java utility classes and enhancements to existing ones:

New Classes

TimeIdGenerator

[ezcol_1half]

Use TimeIdGenerator when you need to generate numeric IDs that are unique within a JVM. It generates Java long values that are guaranteed to be in order (but can have gaps).  Can generate up to 1 million unique IDs per milliseconds. Read Javadoc.

[/ezcol_1half]

[ezcol_1half_end]

[/ezcol_1half_end]

TextReader

[ezcol_1half]

A new class for reading large text, one chunk at a time, based on a specified maximum read size. When a text is too large, it tries to split it wisely at each paragraphs, sentences, or words (whichever one is possible). Read Javadoc.

[/ezcol_1half]

[ezcol_1half_end]

[/ezcol_1half_end]

ByteArrayOutputStream

[ezcol_1half]

An alternate version of Java and Apache Commons ByteArrayOutputStream. Like the Apache version, this version is faster than Java ByteArrayOutputStream. In addition, it provides additional methods for obtaining a subset of bytes ranging from zero to the total number of bytes written so far. Read Javadoc.

[/ezcol_1half]

[ezcol_1half_end]

[/ezcol_1half_end]

Enhancements

IOUtil enhancements

The following utility methods were added to the IOUtil class:

Other improvements

Get your copy

Download Norconex Commons Lang 1.6.0.

You can also view the release notes for a complete list of changes.

 

Broken linkThis tutorial will show you how to extend Norconex HTTP Collector using Java to create a link checker to ensure all URLs in your web pages are valid. The link checker will crawl your target site(s) and create a report file of bad URLs. It can be used with any existing HTTP Collector configuration (i.e., crawl a website to extract its content while simultaneously reporting on its broken links).  If you are not familiar with Norconex HTTP Collector already, you can refer to our Getting Started guide.

The link checker we will create will record:

  • URLs that were not found (404 HTTP status code)
  • URLs that generated other invalid HTTP status codes
  • URLs that generated an error from the HTTP Collector

The links will be stored in a tab-delimited-format, where the first row holds the column headers. The columns will be:

  • Referrer: the page containing the bad URL
  • Bad URL: the culprit
  • Cause: one of “Not Found,” “Bad Status,” or “Crawler Error”

One of the goals of this tutorial is to hopefully show you how easy it is to add your own code to the Norconex HTTP Collector. You can download the files used to create this tutorial at the bottom of this page. You can jump right there if you are already familiar with Norconex HTTP Collector. Otherwise, keep reading for more information.

Get your workspace setup

To perform this tutorial in your own environment, you have two main choices. If you are a seasoned Java developer and an Apache Maven enthusiast, you can create a new Maven project including Norconex HTTP Collector as a dependency. You can find the dependency information at the bottom of its download page.

If you want a simpler option, first download the latest version of Norconex HTTP Collector and unzip the file to a location of your choice. Then create a Java project in your favorite IDE.   At this point, you will need to add to your project classpath all Jar files found in the “lib” folder under your install location. To avoid copying compiled files manually every time you change them, you can change the compile output directory of your project to be the “classes” folder found under your install location. That way, the collector will automatically detect your compiled code when you start it.

You are now ready to code your link checker.

Listen to crawler events

There are several interfaces offered by the Norconex HTTP Collector that we could implement to achieve the functionality we seek. One of the easiest approaches in this case is probably to listen for crawler events. The collector provides an interface for this called ICrawlerEventListener. You can have any number of event listeners for your crawler, but we only need to create one. We can implement this interface with our link checking logic:

As you can see, the previous code focuses only on the crawler events we are interested in and stores URL information associated with these events. We do not have to worry about other aspects of web crawling in that implementation. The above code is all the Java we need to write for our link checker.

Configure your crawler

If you have not seen a Norconex HTTP Collector configuration file before, you can find sample ones for download, along with all options available, on the product configuration page.

This is how we reference the link checker we created:

By default, the Norconex HTTP Collector does not keep track of referring pages with every URL it extracts (to minimize information storage and increase performance). Because having a broken URL without knowing which page holds it is not very useful, we want to keep these referring pages. Luckily, this is just a flag to enable on an existing class:

In addition to these configuration settings, you will want to apply more options, such as restricting your link checker scope to only your site or a specific sub-section or your site. Use the configuration file sample at the bottom of this page as your starting point and modify it according to your needs.

You are ready

Once you have your configuration file ready and the compiled Link Checker listener in place, you can give it a try (replace .bat with .sh on *nix platforms):

The bad link report file will be written at the location you specified above.

Source files

Download Download the source files used to create this article

 

Despite all the “noise” on social media sites, we can’t deny how valuable information found on social media networks can be for some organizations. Somewhat less obvious is how to harvest that information for your own use. You can find many posts online asking about the best ways to crawl this or that social media service: Shall I write a custom web scraper? Should I purchase a product to do so?

This article will show you how to crawl Facebook posts (more…)

On large environments, it’s common to have many crawlers running at once, or at scheduled intervals, in order to keep your collected content up-to-date. For example, this is a typical requirement of search engines installations. They need their internal indices updated frequently in order to keep their search results relevant.

Keeping track of individual crawler execution can be challenging. How many are currently running? For how long? Any of them failed? Sure you can log in on the servers where these crawlers are running to get valuable insights. Your operating system can list running processes, and you can analyze each crawler logs. What if your supervisor or a non-technical person wants to know the current crawl status?   You can quickly become a bottleneck.

This approach is not ideal to say the least.

Luckily, Norconex Collectors were designed to take advantage of the Norconex JEF (Job Execution Framework) library.   As a result, all Norconex Collector crawlers you have defined are just waiting to be monitored by Norconex JEF Monitor, a web-based progress and status monitoring application. What’s best is you do not need to change anything in your crawler configurations to get this monitoring.

 JEF Monitor Main Screen

If you already have a JEF Monitor installation up and running, feel free to scroll down to skip the JEF Monitor installation.

Install JEF Monitor

Download the latest stable copy of JEF Monitor (4.0 as of this writing).   Decompress the obtained zip file in a directory of your choice, on the same server where one or more Norconex Collectors are installed.

This will create the following files and directory structure:

 To start JEF Monitor, execute jef-monitor.bat or jef-monitor.sh whether you are on a Windows or *nix environment. Open your favorite browser, and access JEF Monitor using this URL:

               http://localhost:8080/

Replace localhost with the proper server name if your browser was not started from the same server where you installed JEF Monitor.

With version 4.0, the default port is 8080. To change that port or to have JEF Monitor accessible via https only, modify the config/setup.properties file accordingly before starting JEF Monitor.

First-time configuration

The first time JEF Monitor is accessed, you have to go through a few initial configuration screens:

JEF Monitor Introduction Screen

Hit “Let’s Go!”

JEF Monitor Installation Name

JEF Monitor Installation Name

You can have several JEF Monitor installations. Any installation can report on other installations to give you a unified view of all your jobs (in this case, crawler jobs).   For this reason, you need to give a unique name to this installation.   It can be anything you like.

This tutorial will pretend we are only monitoring crawlers found on a dedicated server. We’ll call this installation “Crawler Server”.

Noroconex Collector Jobs to Monitor

This is where we tell JEF Monitor where our crawlers are running. For JEF, a Norconex Collector and its configured crawlers are treated as “jobs.” When running, each Norconex Collector configured creates an .index file in a subdirectory of the collector progress directory called “latest”.   A collector progress directory can be configured using the <progressDir> configuration option.

We need to tell JEF Monitor about your Collector jobs. Click on Add Files…

In this tutorial, we’re pretending we have an HTTP Collector set up to crawl Wikipedia. We called it “Wikipedia Crawl” with two crawlers: “Wikipedia English” and “Wikipedia French” (to be shown in JEF Monitor later).

The index file can be found in this location:

[…]/wikipedia/progress/latest/Wikipedia_32_Crawl.index

Select your own index file, and click the “Choose” button.

You should see your selection in the list of jobs to monitor. If you have more than one Norconex Collector installation you want to monitor, repeat the exercise. Alternatively, if you have multiple progress files in a directory, have sub-directories, or have not yet executed your Norconex Collector installation, you can add a directory to be monitored.   Index files found under the selected directories will show up when they get created.

When you are done, click “Continue”.

With each JEF “job” being monitored, you can optionally perform “actions.”   With the default installation of JEF Monitor, two actions for viewing the logs in your browser are available and already configured.   Leave those there, and click “Continue”.

Happy Monitoring!

Launch your Norconex Collector as you normally do, and you should eventually see its progress automatically updated.

To monitor additional Norconex Collector installations, click on “Monitored Jobs” under the “Settings” menu. You will then be presented with the now familiar “Jobs to monitor” screen (similar to the one higher up).

More options are available in JEF Monitor, such as tracking remote JEF Monitor installation from this one.

Experiment and have fun.

Norconex just released major upgrades to all its Norconex Collectors and related projects.  That is, Norconex HTTP Collector and Norconex Filesystem Collector, along with the Norconex Importer module and all available committers (Solr, Elasticsearch, HP IDOL, etc), were all upgraded to version 2.0.0.

With these major product upgrades come a new website that makes it easier to get all the software you need in one location: the Norconex Collectors website.  At a quick glance you can find all Norconex Collectors and Committers available for download.

Among the new features added to your crawling arsenal you will find:

  • Can now split a document into multiple documents.
  • Can now treat embedded documents as individual documents (like documents found in zip files or in other documents such as Word files).
  • Language detection (50+ languages).
  • Parsing and formatting of dates from/to any format.
  • Character case modifiers.
  • Can now index basic content statistics with each documents (word count, average word length, average words per sentences, etc).
  • Can now supply a “seed file” for listing start URLs or start paths to your crawler.
  • Document content reads and writes are now performed in memory up to a configurable maximum size, after which the filesystem gets used.  This reduces I/O and improves performance.
  • New event model where listeners can listen for any type of crawler events.
  • Can now  ignore parsing of specific content types.
  • Can filter documents based on arbitrary regular expressions performed on the document content.
  • Enhanced debugging options, where you can print out specific field content as they are being processed.
  • HTTP Collector: Can add link names to the document the links are pointing to (e.g. to create cleaner titles).
  • More…

Another significant change is all Norconex open-source projects are now licensed under The Apache License 2.0.   We hope this will facilitate adoption with third party commercial offerings.

It is important to note version 2.0.0 are not compatible with their previous 1.x version.  The configuration options changed in many areas so do not expect to run your existing configuration under 2.0.0.   Please refer to the latest documentation for new and modified configuration options.

Visit to the new Norconex Collectors website now.

This feature release brings the following additions…

Simple Pipeline

Useful if you want to quickly assemble multiple tasks to be run into a single “pipeline” while keeping it ultra simple.  The following example does it all in a single class only to keep it short.

 Cacheable Streams

There are several excellent object caching mechanism available to Java already if you need something sophisticated.   This release offers a very lightweight cache implementation that can make InputStream and OutputStream reusable.  It stores the stream in memory until a configurable threshold is reached, after which it switches to fast file lookup.   A CachedStreamFactory is used to obtain cached streams sharing the same pool of memory.

 Enhanced XML Writing

The Java XMLStreamWriter is a useful class, but is a bit annoying to use when you are not always writing strings.   The EnhancedXMLStreamWriter add convenience method for primary types and others.

More Equality checks

More methods were added to EqualUtils:

Discover More Features

A few more features and updates were made to the Norconex Commons Lang library.   For more information, check out the full release notes.

Download your copy now.