The Architecture of a Crawler

I’m going to explain how crawlers work. There are three main tasks that a crawler has to take care of.

  1. Find new hosts to crawl.
  2. Request data from a host that is being crawled.
  3. Display to the user the data gathered.

This design lends itself well to being distributed. Several host crawlers (those that perform task 2) can all be working in parallel and independently. All the host crawlers need is a coordinator (the one that performs task 1) to feed them lists of hosts that aren’t duplicated. The host crawlers then send their responses back to the coordinator which finds new hosts from the responses and then stores the responses. Lastly the aggregator or statistics generator (the one that performs task 3) periodically runs through all the data collected and creates useful ways for the user to view this information.

That’s how a crawler works in general terms. But the fun begins when a crawler has to actually be built. One of the most important decisions is how to store the data collected and how to store and distribute the list of hosts that need to be crawled next.

Relational Database Approach

This is the approach that g2paranha takes. It’s an easy and straightforward way to store data for anyone trained on these traditional databases. But the data that a crawler needs to store is for the most part not relational except for the links between hosts. Another problem with relational databases is that many of them lock all of the data whenever data is being written or read. This creates a huge bottleneck in a distributed environment where lots of both of these operations are being performed. So while it is easy to implement it may not be an optimal solution. On the positive side the extremely powerful SQL language is available for extracting statistics from the data.

Non-relational Database Approach

This is the direction I have been heading in for the crawler. It seems to be a good fit for the type of data that the crawler needs to store. But I have very little experience in this are. So far my only forays  into this field have been with CouchDB which is still in heavy development. CouchDB looks promising but I haven’t had much luck with getting it to work.

So if anyone has experience in non-relational databases or in creating distributed crawlers I’d like to hear from you.