A No-Code Solution for Extending Norconex File System Crawler

Introduction

Managing vast amounts of data stored across various file systems can be a daunting task. But it doesn’t have to be! Norconex File System Crawler comes to the rescue, offering a robust solution for efficiently extracting, organizing, and indexing your files.

But did you know you can extend its capabilities without writing a single line of code? In this blog post, you’ll learn how to connect an external application to the Crawler and unleash its full potential.

The Use Case

Both Norconex File System Crawler and Norconex Web Crawler utilize Norconex Importer to extract data from documents. Right out of the box, the Importer supports various file formats, as documented here. But you may encounter a scenario where the Importer cannot parse a document. 

One such example is a RAR5 document. At the time of this writing, the latest version of File System Crawler is 2.9.1. Extracting a RAR5 file with this version throws the following exception.

com.norconex.importer.parser.DocumentParserException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pkg.RarParser@35f95a13
...
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pkg.RarParser@35f95a13
...
Caused by: java.lang.NullPointerException: mainheader is null
...

As you can see, Apache Tika’s RarParser class cannot extract the document. You’ll see how to work around this issue below.

Note: This blog post will focus on a no-code solution. However, if you can code, writing your own custom parser is highly recommended. Look at the Extend the File System Crawler section of the documentation on accomplishing just that.

ExternalTransformer to the Resuce

Many applications support the extraction of RAR files. One such application is 7zip. If you need to, go ahead and install 7zip on your machine now. You’ll need the application moving forward.

Overview

You will run 2 crawlers separately. The first crawls everything normally while ignoring RAR files. It will use the ExternalTransformer to extract the RAR file contents to folder X and do no further processing of the file. The second will crawl the extracted files in folder X.

Configs

Config for the first crawler is as follows, with helpful comment explanations of various options.

<?xml version="1.0" encoding="UTF-8"?>
<fscollector id="fs-collector-main">

#set($workdir = .\workdir-main)
#set($startDir = .\input)
#set($extractedDir = .\extracted)
#set($tagger = "com.norconex.importer.handler.tagger.impl")
#set($filter = "com.norconex.importer.handler.filter.impl")
#set($transformer = "com.norconex.importer.handler.transformer.impl")

  <logsDir>${workdir}/logs</logsDir>
  <progressDir>${workdir}/progress</progressDir>

  <crawlers>
	<crawler id="fs-crawler-main">
  	<workDir>${workdir}</workDir>
  	<startPaths>
    	<path>${startDir}</path>
  	</startPaths>
 	 
  	<importer>
    	<!-- do the following before attempting to parse a file -->
    	<preParseHandlers>
      	<transformer class="${transformer}.ExternalTransformer">
        	<!-- apply this transfomer to .rar files only -->
        	<restrictTo field="document.reference">.*\.rar$</restrictTo>
        	<!--
          	calls on 7zip to uncompress the file and place the contents in `extracted` dir
        	-->
        	<command>'C:\Program Files\7-Zip\7z.exe' e ${INPUT} -o${extractedDir} -y</command>
        	<metadata>
          	<pattern toField="extracted_paths" valueGroup="1">
            	^Path = (.*)$
          	</pattern>
        	</metadata>
        	<tempDir>${workdir}/temp</tempDir>
      	</transformer>

      	<!-- stop further processing of .rar files -->
      	<filter class="${filter}.RegexReferenceFilter" onMatch="exclude">
        	<regex>.*\.rar$</regex>
      	</filter>
   	 
    	</preParseHandlers>
  	</importer>
 	 
  	<!--
    	commit extracted files to the local FileSystem
    	You can substitute this with any of the available committers
  	-->
  	<committer class="com.norconex.committer.core.impl.FileSystemCommitter">
    	<directory>${workdir}/crawledFiles</directory>
  	</committer>
	</crawler>
    
  </crawlers>

</fscollector>

This crawler will parse all files normally, except RAR files. When encountering a RAR file, the Crawler will call upon 7zip to extract RAR files and place the extracted files under an extracted folder. No further processing will be done on these RAR files.

The second crawler is configured to simply extract files within the extracted folder. Here is the configuration:

<?xml version="1.0" encoding="UTF-8"?>
<fscollector id="fs-71-collector-extracted">

#set($workdir = .\workdir-extracted)
#set($startDir = .\extracted)

  <logsDir>${workdir}/logs</logsDir>
  <progressDir>${workdir}/progress</progressDir>

  <crawlers>

	<crawler id="fs-crawler-extracted">
  	<startPaths>
    	<path>${startDir}</path>
  	</startPaths>

  	<!--
    	commit extracted files to the local FileSystem
    	You can substitute this with any of the available committers
  	-->
  	<committer class="com.norconex.committer.core.impl.FileSystemCommitter">
    	<directory>${workdir}/crawledFiles</directory>
  	</committer>
	</crawler>
    
  </crawlers>

</fscollector>

There you have it! You just extended the capabilities of the File System Crawler without writing a single line of code – a testament to the incredible flexibility offered by the Crawler.

Conclusion

Norconex File System Crawler is undeniably a remarkable tool for web crawling and data extraction. Even more impressive is the ease with which you can extend the Crawler’s capabilities, all without the need for coding expertise. Whether you’re a seasoned professional or just getting started, let the Norconex File System Crawler – free from the complexities of coding – become your trusted companion in unleashing the full potential of your data management endeavours. Happy indexing!

Harinder has over 15 years of Software Development and Consulting experience with small and large organizations. His specialty is in the world of Enterprise Search.