Scalable Web Fetches using Serverless

Let's say you have a list of URLs you need to fetch for some reason - perhaps to check that they still exist, perhaps to parse their content for updates, whatever.

If the list is small - say up to 1000 urls - this is pretty easy to do using just curl(1) or wget(1) e.g.

wget --execute robots=off --adjust-extension --convert-links \
  --force-directories --no-check-certificate --no-verbose \
  --timeout=120 --tries=3 -P ./tmp --warc-file=${INPUT%.txt} \
  -i "$INPUT"

This iterates over all the urls in urls.txt and fetches them one by one, capturing them in WARC format. Easy.

But if your url list is long - thousands or millions of urls - this is going to be too slow to be practical. This is a classic Embarrassingly Parallel problem, so to make this scalable the obvious solution is to split your input file up and run multiple fetches in parallel, and then merge your output files (i.e. a kind of map-reduce job).

But then your problem becomes that you need to run this on multiple machines, and setting up and managing and tearing down those machines becomes the core of the problem. But really, you don't want to worry about machines, you just want an operating system instance available that you can make use of.

This is the promise of so-called serverless architectures such as AWS "Lambda" and Google Cloud's "Cloud Functions", which provide a container-like environment for computing, without actually having to worry about managing the containers. The serverless environment spins up instances on demand, and then tears them down after a fixed period of time or when your job completes.

So to try out this serverless paradigm on our web fetch problem, I've written cloudfunc-geturilist, a Google Cloud Platform "Cloud Function" written in go, that is triggered by input files being written into an input Google Cloud Storage bucket, and writes its output files to another GCS output bucket.

See the README instructions if you'd like to try out (which you can do using a GCP free tier account).

In terms of scalability, this seems to work pretty well. The biggest file I've run so far has been 100k URLs, split into 334 input files each containing 300 URLs. With MAX_INSTANCES=20, cloudfunc-geturilist processes these 100k URLs in about 18 minutes; with MAX_INSTANCES=100 that drops to 5 minutes. All at a cost of a few cents.

That's a fair bit quicker than having to run up 100 container instances myself, or than using wget!