During a recent client project, I was required to crawl several websites with specific requirements for each. For example, one of the websites required:
- to have a meta tag content be used as a URL replacement for the actual URL,
- the header, footer and any repetitive content be excluded from each page,
- to be able to ignore robots.txt since it is meant for external crawlers only (Google, Bing, etc.), and
- to index them in LucidWorks.
LucidWorks built-in web crawler is based on Aperture. It is great for basic web crawls, but I needed more advanced features that it could not provide. I had to configure LucidWorks with an external crawler that had more advanced built-in capabilities and the ability to create new functionality.