In my previous article, I talked about the new Config Starter and its features. This article serves as a follow-up. Now that you know how to generate a crawler configuration file, I will highlight the steps you can undertake to get you started on your own website crawling activities.
We will be using the TOKYO 2020 Olympic Games’ website as the crawl site in this article. The steps are as follows:
- First, you will need to generate a basic configuration file targeting the Olympic website, using the Config Starter. In this example, I am targeting English content only, so I am excluding all URLs corresponding to the other languages on the website.
*Note that it is not mandatory to use the Config Starter to generate your configuration file as it only makes a basic configuration file. If you are looking for a more complete solution, you can make your own configuration file with the documentation here.
- With your configuration file generated, the next step is to download the Norconex HTTP Collector on your computer from the Norconex Open-Source website and unzip it. If you are using the Config Starter, you will need to download version 3.x.
- Once you have the HTTP Collector downloaded on your computer, open your command-line terminal in the location of the folder you just created with your download. To do this, simply use the following command with your file directory: cd C:\file\directory\of\the\collector
- With your command-line terminal open, you must now enter the following line with the path to your configuration file:
Windows: collector-http.bat start -config= -config=/path/to/config.xml
Linux: collector-http.sh start -config= -config=/path/to/config.xml
Congratulations! You are now running your crawler. If all went according to plan, you should see something similar to the next image and the data crawled should now be located in the created committer directory (if you are using the same committer as me, it should be in the “work” folder).
Now that you have crawled the Olympic site, go and collect your gold medal!
If you encounter any issues during the process, you can find resolutions on the HTTP Collector GitHub issues page.
[Try] out the new Norconex HTTP Collector Config Starter.
[Learn] more about the inner workings of the Norconex HTTP Collector.