Scalable Web Fetches using Serverless

Let's say you have a list of URLs you need to fetch for some reason - perhaps to check that they still exist, perhaps to parse their content for updates, whatever.

If the list is small - say up to 1000 urls - this is pretty easy to do using just curl(1) or wget(1) e.g.

wget --execute robots=off --adjust-extension --convert-links \
  --force-directories --no-check-certificate --no-verbose \
  --timeout=120 --tries=3 -P ./tmp --warc-file=${INPUT%.txt} \
  -i "$INPUT"

This iterates over all the urls in urls.txt and fetches them one by one, capturing them in WARC format. Easy.

But if your url list is long - thousands or millions of urls - this is going to be too slow to be practical. This is a classic Embarrassingly Parallel problem, so to make this scalable the obvious solution is to split your input file up and run multiple fetches in parallel, and then merge your output files (i.e. a kind of map-reduce job).

But then your problem becomes that you need to run this on multiple machines, and setting up and managing and tearing down those machines becomes the core of the problem. But really, you don't want to worry about machines, you just want an operating system instance available that you can make use of.

This is the promise of so-called serverless architectures such as AWS "Lambda" and Google Cloud's "Cloud Functions", which provide a container-like environment for computing, without actually having to worry about managing the containers. The serverless environment spins up instances on demand, and then tears them down after a fixed period of time or when your job completes.

So to try out this serverless paradigm on our web fetch problem, I've written cloudfunc-geturilist, a Google Cloud Platform "Cloud Function" written in go, that is triggered by input files being written into an input Google Cloud Storage bucket, and writes its output files to another GCS output bucket.

See the README instructions if you'd like to try out (which you can do using a GCP free tier account).

In terms of scalability, this seems to work pretty well. The biggest file I've run so far has been 100k URLs, split into 334 input files each containing 300 URLs. With MAX_INSTANCES=20, cloudfunc-geturilist processes these 100k URLs in about 18 minutes; with MAX_INSTANCES=100 that drops to 5 minutes. All at a cost of a few cents.

That's a fair bit quicker than having to run up 100 container instances myself, or than using wget!

My Personal URL Shortener

I wrote a really simple personal URL shortener a couple of years ago, and have been using it happily ever since. It's called shrtn ("shorten"), and is just a simple perl script that captures (or generates) a mapping between a URL and a code, records in a simple text db, and then generates a static html file that uses HTML meta-redirects to point your browser towards the URL.

It was originally based on posts from Dave Winer and Phil Windley, but was interesting enough that I felt the itch to implement my own.

I just run it on my laptop (shrtn <url> [<code>]), and it has settings to commit the mapping to git and push it out to a remote repo (for backup), and to push the generated html files up to a webserver somewhere (for serving the html).

Most people seem to like the analytics side of personal URL shorteners (seeing who clicks your links), but I don't really track that side of it at all (it would be easy enought to add Google Analytics to to your html files to do that, or just doing some analysis on the access logs). I mostly wanted it initially to post nice short links when microblogging, where post length is an issue.

Surprisingly though, the most interesting use case in practice is the ability to give custom mnemonic code codes to URLs I use reasonably often, or cite to other people a bit. If I find myself sharing a URL with more than a couple of people, it's easier just to create a shortened version and use that instead - it's simpler, easier to type, and easier to remember for next time.

So my shortener has sort of become a cross between a Level 1 URL cache and a poor man's bookmarking service. For instance:

If you don't have a personal url shortener you should give it a try - it's a surprisingly interesting addition to one's personal cloud. And all you need to try it out is a domain and some static webspace somewhere to host your html files.

Too easy.

[ Technical Note: html-based meta-redirects work just fine with browsers, including mobile and text-only ones. They don't work with most spiders and bots, however, which may a bug or a feature, depending on your usage. For a personal url shortener meta-redirects probably work just fine, and you gain all the performance and stability advantages of static html over dynamic content. For a corporate url shortener where you want bots to be able to follow your links, as well as people, you probably want to use http-level redirects instead. In which case you either go with a hosted option, or look at something like YOURLS for a slightly more heavyweight self-hosted option. ]

e-billing Still Sucks

You'd think that 20 years into the Web we'd have billing all sorted out. (I've got in view here primarily bill/invoice delivery, rather than payments, and consumer-focussed billing, rather than B2B invoicing).

We don't. Our bills are probably as likely to still come on paper as in digital versions, and the current "e-billing" options all come with significant limitations (at least here in Australia - I'd love to hear about awesome implementations elsewhere!)

Here, for example, are a representative set of my current vendors, and their billing delivery options (I'm not picking on anyone here, just grounding the discussion in some specific examples).

Vendor Billing Delivery Options
Citibank email, paper, web
iinet email, paper?, web
Kuringai Council BPayView, paper
Origin Energy email, paper
Sydney Water Australia Post Digital Mailbox, BPayView, paper

So that all looks pretty reasonable, you might say. All your vendors have some kind of e-billing option. What's the problem?

The current e-billing options

Here's how I'd rate the various options available:

  • email: email is IMO the best current option for bill delivery - it's decentralised, lightweight, push-rather-than-pull, and relatively easy to integrate/automate. Unfortunately, not everyone offers it, and sometimes (e.g. Citibank) they insist on putting passwords on the documents they send out via email on the grounds of 'security'. (On the other hand, emails are notoriously easy to fake, so faking a bill email is a straightforward attack vector if you can figure out customer-vendor relationships.)

    (Note too that most of the non-email e-billing options still use email for sending alerts about a new bill, they just don't also send the bill through as an attachment.)

  • web (i.e. a company portal of some kind which you log into and can then download your bill): this is efficient for the vendor, but pretty inefficient for the customer - it requires going to the particular website, logging in, and navigating to the correct location before you can view or download your bill. So it's an inefficient, pull-based solution, requiring yet another username/password, and with few integration/automation options (and security issues if you try).

  • BillPayView / Australia Post Digital Mailbox: for non-Australians, these are free (for consumers) solutions for storing and paying bills offered by a consortium of banks (BillPayView) and Australia Post (Digital Mailbox) respectively. These provide a pretty decent user experience in that your bills are centralised, and they can often parse the bill payment options and make the payment process easy and less error-prone. On the other hand, centralisation is a two-edged sword, as it makes it harder to change providers (can you get your data out of these providers?); it narrows your choices in terms of bill payment (or at least makes certain kinds of payment options easier than others); and it's basically still a web-based solution, requiring login and navigation, and very difficult to automate or integrate elsewhere. I'm also suspicious of 'free' services from corporates - clearly there is value in driving you through their preferred payment solutions and/or in the transaction data itself, or they wouldn't be offering it to you.

    Also, why are there limited providers at all? There should be a standard in place so that vendors don't have to integrate separately with each provider, and so that customers have maximum choice in whom they wish to deal with. Wins all-round.

And then there's the issue of formats. I'm not aware of any Australian vendors that bill customers in any format except PDF - are there any?

PDFs are reasonable for human consumption, but billing should really be done (instead of, or as well as) in a format meant for computer consumption, so they can be parsed and processed reliably. This presumably means billing in a standardised XML or JSON format of some kind (XBRL?).

How billing should work

Here's a strawman workflow for how I think billing should work:

  • the customer's profile with the vendor includes a billing delivery URL, which is a vendor-specific location supplied by the customer to which their bills are to be HTTP POST-ed. It should be an HTTPS URL to secure the content during transmission, and the URL should be treated by the vendor as sensitive, since its possession would allow someone to post fake invoices to the customer

  • if the vendor supports more than one bill/invoice format, the customer should be able to select the format they'd like

  • the vendor posts invoices to the customer's URL and gets back a URL referencing the customer's record of that invoice. (The vendor might, for instance, be able to query that record for status information, or they might supply a webhook of their own to have status updates on the invoice pushed back to them.)

  • the customer's billing system should check that the posted invoice has the correct customer details (at least, for instance, the vendor/customer account number), and ideally should also check the bill payment methods against an authoritative set maintained by the vendor (this provides protection against someone injecting a fake invoice into the system with bogus bill payment details)

  • the customer's billing system is then responsible for facilitating the bill payment manually or automatically at or before the due date, using the customer's preferred payment method. This might involve billing calendar feeds, global or per-vendor preferred payment methods, automatic checks on invoice size against vendor history, etc.

  • all billing data (ideally fully parsed, categorised, and tagged) is then available for further automation / integration e.g. personal financial analytics, custom graphing, etc.

This kind of solution would give the customer full control over their billing data, the ability to choose a billing provider that's separate from (and more agile than) their vendors and banks, as well as significant flexibility to integrate and automate further. It should also be pretty straightforward on the vendor side - it just requires a standard HTTP POST and provides immediate feedback to the vendor on success or failure.

Why doesn't this exist already - it doesn't seem hard?

Missing Delicious Feeds

I've been playing with using delicious as a lightweight URL database lately, mostly for use by greasemonkey scripts of various kinds (e.g. squatter_redirect).

For this kind of use I really just need a lightweight anonymous http interface to the bookmarks, and delicious provides a number of nice lightweight RSS and JSON feeds suitable for this purpose.

But it turns out the feed I really need isn't currently available. I mostly want to be able to ask, "Give me the set of bookmarks stored for URL X by user Y", or even better, "Give me the set of bookmarks stored for URL X by users Y, Z, and A".

Delicious have a feed for recent bookmarks by URL:{format}/url/{url md5}

and a feed for all a user's bookmarks:{format}/{username}

and feeds for a user's bookmarks limited by tag(s):{format}/{username}/{tag[+tag+...+tag]}

but not one for a user limited by URL, or for URL limited by user.

Neither alternative approach is both feasible and reliable: searching by url will only return the most recent set of N bookmarks; and searching by user and walking the entire (potentially large) set of their bookmarks is just too slow.

So for now I'm having to workaround the problem by adding a special hostname tag to my bookmarks (e.g., and then using the username+tag feed as a proxy for my username+domain search.

Any cluesticks out there? Any nice delicious folk want to whip up a shiny new feed for the adoring throngs? :-)

Testing Disqus

I'm trying out disqus, since I like the idea of being able to track/collate my comments across multiple endpoints, rather than have them locked in to various blogging systems. So this is a test post to try out commenting. Please feel free to comment ad nauseum below (and sign up for a disqus account, if you don't already have one).

Questions That Cannot Be Answered

Was thinking this morning about my interactions with the web over the last couple of weeks, and how I've been frustrated with not being able to (simply) get answers to relatively straightforward questions from the automated web. This is late 2008, and Google and Google Maps and Wikipedia and Freebase etc. etc. have clearly pushed back the knowledge boundaries here hugely, but at the same time lots of relatively simple questions are as yet largely unanswerable.

By way of qualification, I mean are not answerable in an automated fashion, not that they cannot be answered by asking the humans on the web (invoking the 'lazyweb'). I also don't mean that these questions are impossible to answer given the time and energy to collate the results available - I mean that they are not simply and reasonably trivially answerable, more or less without work on my part. (e.g. "How do I get to address X" was kind of answerable before Google Maps, but they were arguably the ones who made it more-or-less trivial, and thereby really solved the problem.)

So in the interests of helping delineate some remaining frontiers, and challenging ourselves, here's my catalogue of questions from the last couple of weeks:

  • what indoor climbing gyms are there in Sydney?

  • where are the indoor climbing gyms in Sydney (on a map)?

  • what are the closest gyms to my house?

  • how much are the casual rates for adults and children for the gyms near my house?

  • what are the opening hours for the gyms near my house?

  • what shops near my house sell the Nintendo Wii?

  • what shops near my house have the Wii in stock?

  • what shops near my house are selling Wii bundles?

  • what is the pricing for the Wii and Wii bundles from shops near my house?

  • of the shops near my house that sell the Wii, who's open late on Thursdays?

  • of the shops near my house that sell the Wii, what has been the best pricing on bundles over the last 6 months?

  • trading off distance to travel against price, where should I buy a Wii?

  • what are the "specials" at the supermarkets near my house this week?

  • given our grocery shopping habits and the current specials, which supermarket should I shop at this week?

  • I need cereal X - do any of the supermarkets have have it on special?

That's a useful starting set from the last two weeks. Anyone else? What are your recent questions-that-cannot-be-answered? (And if you blog, tag with #qtcba pretty please).

Banking for Geeks

Heard via @chieftech on twitter that the Banking Technology 2008 conference is on today. It's great to see the financial world engaging with developments online and thinking about new technologies and the Web 2.0 space, but the agenda strikes me as somewhat weird, perhaps driven mainly by the vendors they could get willing to spruik their wares?

How, for instance, can you have a "Banking Technology" conference and not have at least one session on 'online banking'? Isn't this the place where your technology interfaces with your customers? Weird.

My impression of the state of online banking in Australia is pretty underwhelming. As a geek who'd love to see some real technology innovation impact our online banking experiences, here are some wishlist items dedicated to the participants of Banking Technology 2008. I'd love to see the following:

  • Multiple logins to an account e.g. a readonly account for downloading things, a bill-paying account that can make payments to existing vendors, but not configure new ones, etc. This kind of differentiation would allow automation (scripts/services) using 'safe' accounts, without having to put your master online banking details at risk.

  • API access to certain functions e.g. balance checking, transaction downloads, bill payment to existing vendors, internal transfers, etc. Presumably dependent upon having multiple logins (previous), to help mitigate security issues.

  • Tagging functionality - the ability to interactively tag transactions (e.g. 'utilities', 'groceries', 'leisure', etc.), and to get those tags included in transaction reporting and/or downloading. Further, allow autotagging of transactions via descriptions/type/other party details etc.

  • Alert conditions - the ability to setup various kinds of alerts on various conditions, like low or negative balances, large withdrawals, payroll deposit, etc. I'm not so much thinking of plugging into particular alert channels here (email, SMS, IM, etc), just the ability to set 'flags' on conditions.

  • RSS support - the ability to configure various kinds of RSS feeds of 'interesting' data. Authenticated, of course. Examples: per-account transaction feeds, an alert condition feed (low balance, transaction bouncing/reversal, etc.), bill payment feed, etc. Supplying RSS feeds also means that such things can be plugged into other channels like email, IM, twitter, SMS, etc.

  • Web-friendly interfaces - as Eric Schmidt of Google says, "Don't fight the internet". In the online banking context, this means DON'T use technologies that work against the goodness of the web (e.g. frames, graphic-heavy design, Flash, RIA silos, etc.), and DO focus on simplicity, functionality, mobile clients, and web standards (HTML, CSS, REST, etc.).

  • Web 2.0 goodness - on the nice-to-have front (and with the proviso that it degrades nicely for non-javascript clients) it would be nice to see some ajax goodness allowing more friendly and usable interfaces and faster response times.

Other things I've missed? Are there banks out there already offering any of these?

Super Simple Public Location Broadcasting

I've been thinking about Yahoo's new fire eagle location-broking service over the last few days. I think it is a really exciting service - potentially a game changer - and has the potential to move publishing and using location data from a niche product to something really mainstream. Really good stuff.

But as I posted here, I also think fire eagle (at least as it's currently formulated) is probably only usable by a relatively small section of the web - roughly the relatively sophisticated "web 2.0" sites who are comfortable with web services and api keys and protocols like OAuth.

For the rest of the web - the long web 1.0 tail - the technical bar is simply too high for fire eagle as it stands to be useful and usable.

In addition, fire eagle as it currently stands is unicast, acting as a mediator between you some particular app acting as a producer or a consumer of your location data. But, at least on the consumer side, I want some kind of broadcast service, not just a per-app unicast one. I want to be able to say "here's my current location for consumption by anyone", and allow that to be effectively broadcast to anyone I'm interacting with.

Clearly my granularity/privacy settings might be different for my public location, and I might want to be able to blacklist certain sites or parties if they prove to be abusers of my data, but for lots of uses a broadcast public location is exactly what I want.

How might this work in the web context? Say I'm interacting with an e-commerce site, and if they some broad idea of my location (say, postcode, state, country) they could default shipping addresses for me, and show me shipping costs earlier in the transaction (subject to change, of course, if I want to ship somewhere else). How can I communicate my public location data to this site?

So here's a crazy super-simple proposal: use Microformat HTTP Request Headers.

HTTP Request Headers are the only way the browser can pass information to a website (unless you consider cookies a separate mechanism, and they aren't really useful here because they're domain specific). The HTTP spec even carries over the "From" header from email, to allow browsers to communicate who the user is to the website, so there's some kind of precedent for using HTTP headers for user info.

Microformats are useful here because they're really simple, and they provide useful standardised vocabularies around addresses (adr) and geocoding (geo).

So how about (for example) we define a couple of custom HTTP request headers for public location data, and use some kind of microformat-inspired serialisation (like e.g. key-value pairs) for the location data? For instance:

X-Adr-Current: locality=Sydney; region=NSW; postal-code=2000; country-name=Australia
X-Geo-Current: latitude=33.717718; longitude=151.117158

For websites, the usage is then about as trivial as possible: check for the existence of the HTTP header, do some very simple parsing, and use the data to personalise the user experience in whatever ways are appropriate for the site.

On the browser side we'd need some kind of simple fire eagle client that would pull location updates from fire eagle and then publish them via these HTTP headers. A firefox plugin would probably be a good proof of concept.

I think this is simple, interesting and useful, though it obviously requires websites to make use of it before it's of much value in the real world.

So is this crazy, or interesting?

The Long Tail of Location

Brady Forrest asked in a recent post what kinds of applications people would most like to see working with Yahoo's new location-broking service Fire Eagle (currently in private beta).

It's clear that most of the shiny new web 2.0 sites and apps might be able to benefit from such personal location info:

  • photo sites that can do automagic geotagging

  • calendar apps that adapt to our current timezone

  • search engines that can take proximity into account when weighting results

  • social networks that can show us people in town when we're somewhere new

  • maps and mashups that start where you are, rather than with some static default


And such sites and apps will no doubt be early adopters of fire eagle and whatever other location brokers might bubble up in the next little while.

Two things struck me with this list though. First, that's a lot of sites and apps right there, and unless the friction of authorising new apps to have access to my location data is very low, the pain of micromanaging access is going to get old fast. Is there some kind of 'public' client level access in fire eagle that isn't going to require individual app approval?

Second, I can't help thinking that this still leaves most of the web out in the cold. Think about all the non-ajax sites that you interact with doing relatively simple stuff that could still benefit from access to your public location data:

  • the shipping address forms you fill out at every e-commerce site you buy from

  • store locators and hours pages that ask for a postcode to help you (every time!)

  • timetables that could start with nearby stations or routes or lines if they knew where you were

  • intelligent defaults or entry points for sites doing everything from movie listings to real estate to classifieds

This is the long tail of location: the 80% of the web that won't be using ajax or comet or OAuth or web service APIs anytime soon. I'd really like my location data to be useful on this end of the web as well, and it's just not going to happen if it requires sites to register api keys and use OAuth and make web service api calls. The bar is just too high for lots of casual web developers, and an awful lot of the web is still custom php or asp scripts written by relative newbies (or maybe that's just here in Australia!). If it's not almost trivially easy, it won't be used.

So I'm interested in how we do location at this end of the web. What do we need on top of fire eagle or similar services to make our location data ubiquitous and immediately useful to relatively non-sophisticated websites? How do we deal with the long tail?

Notes on TheSchwartz

I've been playing around with SixApart's TheSchwartz for the last few days. TheSchwartz is a lightweight reliable job queue, typically used for handling relatively high latency jobs that you don't want to try and handle from a web process e.g. for sending out emails, placing orders into some external system, etc. Basically interacting with anything which might be down or slow or which you don't really need right away.

Actually, TheSchwartz is a job queue library rather than a job queue system, so some assembly is required. Like most Danga/SixApart software, it's lightweight, performant, and well-designed, but also pretty light on documentation. If you're not comfortable reading the (perl) source, it might be a challenging environment to setup.

Notes from the last few days:

  • Don't use the version on CPAN, get the latest code from subversion instead. At the moment the CPAN version is 1.04, but current svn is at 1.07, and has some significant additional functionality.

  • Conceptually TheSchwartz is very simple - jobs with opaque function names and arguments are inserted into a database for workers with a particular 'ability'; workers periodically check the database for jobs matching the abilities they have, and grab and execute them. Jobs that succeed are marked completed and removed from the queue; jobs that fail are logged and left on the queue to be retried after some time period up to a configurable number of retries.

  • TheSchwartz has two kinds of clients - those that submit jobs, and workers that perform jobs. Both are considered clients, which is confusing if you're thinking in terms of client-server interaction. TheSchwartz considers both sides to be clients.

  • There are three main classes to deal with: TheSchwartz, which is the main client functionality class; TheSchwartz::Job, which models the jobs that are submitted to the job queue; and TheSchwartz::Worker, which is a role-type class modelling a particular ability that a worker is able to perform.

  • New worker abilities are defined by subclassing TheSchwartz::Worker and defining your new functionality in a work() method. work() receives the job object from the queue as its only argument and does its stuff, marking the job as completed or failed after processing. A useful real example worker is TheSchwartz::Worker::SendEmail (also by Brad Fitzpatrick, and available on CPAN) for sending emails from TheSchwartz.

  • Depending on your application, it may make sense for workers to just have a single ability, or for them to have multiple abilities and service more than one type of job. In the latter case, TheSchwartz tries to use unused abilities whenever it can to avoid certain kinds of jobs getting starved.

  • You can also subclass TheSchwartz itself to modify the standard functionality, and I've found that useful where I've wanted more visibility of what workers are doing that you get out of the box. You don't appear at this point to be able to subclass TheSchwartz::Job however - TheSchwartz always uses this as the class when autovivifying jobs for workers.

  • There are a bunch of other features I haven't played with yet, including job priorities, the ability to coalesce jobs into groups to be processed together, and the ability to delay jobs until a certain time.

I've actually been using it to setup a job queue system for a cluster, which is a slightly different application that it was intended for, but so far it's been working really well.

I'm still feeling like I'm still getting to grips with the breadth of things it could be used for though - more experimentation required. I'd be interested in hearing of examples of what people are using it for as well.


Paying Bills

Was thinking in the weekend about places where I waste time, areas of inefficiency in my extremely well-ordered life (cough splutter).

One of the more obvious was bill handling. I receive paper bills during the month from the likes of Energy Australia, Sydney Water, David Jones, our local council for rates, etc. These all go into a pending file in the filing cabinet, in date order, and I then periodically check that file during the month and pay any bills that are coming due. If I get busy or forgetful I may miss a due date and pay a bill late. If a bill gets lost in the post I may not pay it at all. And the process is all dependent on me polling my billing file at some reasonable frequency.

There are variants to this process too. Some of my friends do all their bills once a month, and just queue the payments in their bank accounts for future payment on or near the due date. That's a lower workload system than mine, but for some (mostly illogical) reason I find myself not really trusting future-dated bill payments in the same way as immediate ones.

There's also a free (for users) service available in Australia called BPay View which allows you to receive your bills electronically directly into your internet banking account, and pay them from there. This is nice in that it removes the paper and data entry pieces of the problem, but it's still a pull model - I still have to remember to check the BPay View page periodically - and it's limited to vendors that have signed up for the program.

As I see it, there are two main areas of friction in this process:

  1. using a pull model i.e. the process all being dependent on me remembering to check my bill status periodically and pay those that are coming due. My mental world is quite cluttered enough without having to remember administrivia like bills.

  2. the automation friction around paper-based or PDF-based bills, and the consequent data entry requirements, the scope for user errors, etc.

BPay View mostly solves the second of these, but it's a solution that's closely coupled with your Internet Banking provider. This has security benefits, but it also limits you to your Internet Banking platform. For me, the first of these is a bigger issue, so I'd probably prefer a solution that was decoupled from my internet banking, and accept a few more issues with #2.

So here's what I want:

  • a billing service that receives bills from vendors on my behalf and enters them into its system. Ideally this is via email (or even a web service) and an XML bill attachment; in the real world it probably still involves paper bills and data entry for the short to medium term.

  • a flexible notification system that pushes alerts to me when bills are due based on per-vendor criteria I configure. This should include at least options like email, IM, SMS, twitter, etc. Notifications could be fire-once or fire-until-acknowledged, as the user chooses.

  • for bonus points, an easy method of transferring bills into my internet banking. The dumb solution is probably just a per-bill view from which I can cut and paste fields; smarter solutions would be great, but are probably dependent on the internet banking side. Or maybe we do some kind of per-vendor pay online magic, if it's possible to figure out the security side of not storing credit card info. Hmmm.

That sounds pretty tractable. Anyone know anything like this?

Transient RSS Feeds

As the use of RSS and Atom becomes increasingly widespread (we have people talking about Syndication-Oriented Architecture now), it seems to me that one of the use cases that isn't particularly well covered off is transient or short-term feeds.

In this category are things like short-term blogs (e.g. the feeds on the advent blogs I was reading this year: Catalyst 2007 and 24 Ways 2007), or comment feeds, for tracking the comments on a particular post.

Transient feeds require at least the ability to auto-expire a feed after some period of time (e.g. 30 days after the last entry) or after a certain date, and secondarily, the ability to add feeds almost trivially to your newsreader (I'm currently just using the thunderbird news reader, which is reasonable, but requires about 5 clicks to add a feed).

Anyone know of newsreaders that offer this functionality?

The Future of Advertising

Great quote from Dave Winer on Why Google launched OpenSocial:

Advertising is on its way to being obsolete. Facebook is just another step along the path. Advertising will get more and more targeted until it disappears, because perfectly targeted advertising is just information.

I don't see Facebook seriously threatening Google, as Dave does, but that quote is a classic, and long-term (surely!) spot on the money.

I'm much more in agreement with Tim O'Reilly's critique of OpenSocial. Somehow OpenSocial seems all backwards from the company whose maps openness help make mashups a whole new class of application.

It smells a lot like OpenSocial was hastily conceived just to get something out the door in advance of the Facebook announcements today, by Googlers who don't quite grok the power of the open juice.

Rant&#58; How To Not Sell Stuff

Today I've been reminded that while the web revolution continues apace - witness Web 2.0, ajax, mashups, RESTful web services, etc. - much of the web hasn't yet made it to Web 1.0, let alone Web 2.0.

Take ecommerce.

One of this afternoon's tasks was this: order some graphics cards for a batch of workstations. We had a pretty good idea of the kind of cards we wanted - PCIe Nvidia 8600GT-based cards. The unusual twist today was this: ideally we wanted ones that would only take up a single PCIe slot, so we could use them okay even if the neighbouring slot was filled i.e.

select * from graphics_cards
where chipset_vendor = 'nvidia'
and chipset = '8600GT'
order by width desc;

or something. Note that we don't even really care much about price. We just need some retailer to expose the data on their cards in a useful sortable fashion, and they would get our order.

In practice, this is Mission Impossible.

Mostly, merchants will just allow me to drill down to their graphics cards page and browse the gazillion cards they have available. If I'm lucky, I'll be able to get a view that only includes Nvidia PCIe cards. If I'm very lucky, I might even be able to drill down to only 8000-series cards, or even 8600GTs.

Some merchants also allow ordering on certain columns, which is actually pretty useful when you're buying on price. But none seem to expose RAM or clockspeeds in list view, let alone card dimensions.

And even when I manually drill down to the cards themselves, very few have much useful information there. I did find two sites that actually quoted the physical dimensions for some cards, but the in both cases the numbers they were quoting seemed bogus.

Okay, so how about we try and figure it out from the manufacturer's websites?

This turns out to be Mission Impossible II. The manufacturer's websites are all controlled by their marketing departments and largely consist of flash demos and brochureware. Even finding a particular card is an impressive feat, even if you have the merchant's approximation of its name. And when you do they often have less information than the retailers'. If there is any significant data available for a card, it's usually in a pdf datasheet or a manual, rather than available on a webpage.


So here are a few free suggestions for all and sundry, born out of today's frustration.

For manufacturers:

  • use part numbers - all products need a unique identifier, like books have an ISBN. That means I don't have to try and guess whether your 'SoFast HyperFlapdoodle 8600GT' is the same things as the random mislabel the merchant put on it.

  • provide a standard url for getting to a product page given your part number. I know, that's pretty revolutionary, but maybe take a few tips from google instead of just listening to your marketing department e.g.

  • keep old product pages around, since people don't just buy your latest and greatest, and products take a long time to clear in some parts of the world

  • include some data on your product pages, rather than just your brochureware. Put it way down the bottom of the page so your marketing people don't complain as much. For bonus points, mark it up with semantic microformat-type classes to make parsing easier.

  • alternatively, provide dedicated data product pages, perhaps in xml, optimised for machine use rather than marketing. They don't even have to be visible via browse paths, just available via search urls given product ids.

For merchants:

  • include manufacturer's part numbers, even if you want to use your own as the primary key. It's good to let your customers get additional information from the manufacturer, of course.

  • provide links at least to the manufacturer's home page, and ideally to individual product pages

  • invest in your web interface, particularly in terms of filtering results. If you have 5 items that are going to meet my requirements, I want to be able to filter down to exactly and only those five, instead of having to hunt for them among 50. Price is usually an important determiner of shopping decisions, of course, but if I have two merchants with similar pricing, one of whom let me find exactly the target set I was interested in, guess who I'm going to buy from?

  • do provide as much data as possible as conveniently as possible for shopping aggregators, particularly product information and stock levels. People will build useful interfaces on top of your data if you let them, and will send traffic your way for free. Pricing is important, but it's only one piece of the equation.

  • simple and useful beats pretty and painful - in particular, don't use frames, since they break lots of standard web magic like bookmarking and back buttons; don't do things like magic javascript links that don't work in standard browser fashion; and don't open content in new windows for me - I can do that myself

  • actively solicit feedback from your customers - very few people will give you feedback unless you make it very clear you welcome and appreciate it, and when you get it, take it seriously

End of rant.

So tell me, are there any clueful manufacturers and merchants out there? I don't like just hurling brickbats ...

Top Firefox Extensions

I've been meaning to document the set of firefox extensions I'm currently using, partly to share with others, partly so they're easy to find and install when I start using a new machine, and partly to track the way my usage changes over time. Here's the current list:

Obligatory Extensions

  • Greasemonkey - the fantastic firefox user script manager, allowing client-side javascript scripts to totally transform any web page before it gets to you. For me, this is firefox's "killer feature" (and see below for the user scripts I recommend).

  • Flash Block - disable flash and shockwave content from running automatically, adding placeholders to allow running manually if desired (plus per-site whitelists, etc.)

  • AdBlock Plus - block ad images via a right-click menu option

  • Chris Pederick's Web Developer Toolbar - a fantastic collection of tools for web developers

  • Joe Hewitt's Firebug - the premiere firefox web debugging tool - its html and css inspection features are especially cool

  • Daniel Lindkvist's Add Bookmark Here extension, adding a menu item to bookmark toolbar dropdowns to add the current page directly in the right location

Optional Extensions

  • Michael Kaply's Operator - a very nice microformats toolbar, for discovering the shiny new microformats embedded in web pages, and providing operations you can perform on them

  • Zotero - a very interesting extension to help capture and organise research information, including webpages, notes, citations, and bibliographic information

  • Colorful Tabs - tabs + eye candy - mmmmm!

  • Chris Pederick's User Agent Switcher - for braindead websites that only think they need IE

  • ForecastFox - nice weather forecast widgets in your firefox status bar (and not just US-centric)

Greasemonkey User Scripts

So what am I missing here?


Since this post, I've added the following to my must-have list:

  • Tony Murray's Print Hint - helps you find print stylesheets and/or printer-friendly versions of pages

  • the Style Sheet Chooser II extension, which extends firefox's standard alternate stylesheet selection functionality

  • Ron Beck's JSView extension, allowing you to view external javascript and css styles used by a page

  • The It's All Text extension, allowing textareas to be editing using the external editor of your choice.

  • The Live HTTP Headers plugin - invaluable for times when you need to see exactly what is going on between your browser and the server

  • Gareth Hunt's Modify Headers plugin, for setting arbitrary HTTP headers for web development

  • Sebastian Tschan's Autofill Forms extension - amazingly useful for autofilling forms quickly and efficiently

Data Blogging Scenarios 1 - Reviews

Following on from my earlier data blogging post, and along the lines of Jon Udell's lifebits scenarios, here's the first in a series of posts exploring some ideas about how data blogging might be interesting in today's Web 2.0 world.

Easy one first: Reviews.

When I write a review on my blog of a book I've read or a movie I've seen, it should be trivial to syndicate this as a review to multiple relevant websites. My book reviews might go to Amazon (who else does good user book review aggregation out there?), movies reviews to IMDB, Yahoo Movies, Netflix, etc.

I'm already writing prose, so I should just be able to mark it up as a microformats microformats:"hReview", add some tags to control syndication, and have that content available via one or more RSS or Atom feeds.

I should then just be able to go to my Amazon account, give it the url for the feed I want it to monitor for reviews, and - voila! - instant user-driven content syndication.

This is a win-win isn't it? Amazon gets to use my review on its website, but I get to retain a lot more control in the process:

  • I can author content using my choice of tools instead of filling out a textarea on the Amazon website

  • I can easily syndicate content to multiple sites, and/or syndicate content selectively as well

  • I can make updates and corrections according to my policies, rather than Amazon's (Amazon would of course still be able to decide what to do with such updates)

  • I should be able to revoke access to my content to specific websites if they do stupid stuff

  • I and my readers get the benefit of retaining and aggregating my content on my blog, and all your standard blogging magic (comments, trackbacks, tagclouds, etc.) still apply

It would probably also be nice if Amazon included a link back to the review on my blog which would drive additional traffic my way, and allow interested Amazon users to follow any further conversations (comments and trackbacks etc.) that have happened there.

So are there any sites out there already doing this?

Data Blogging for Fun and Profit

I've been spending some time thinking about a couple of intriguing posts by Jon Udell, in which he discusses a hypothetical "lifebits" service which would host his currently scattered "digital assets" and syndicate them out to various services.

Jon's partly interested in the storage and persistence guarantees such a service could offer, but I find myself most intrigued by the way in which he inverts the current web model, applying the publish-and-subscribe pull-model of the blogging world to traditional upload/push environments like Flickr or MySpace, email, and even health records.

The basic idea is that instead of creating your data in some online app, or uploading your data to some Web 2.0 service, you instead create it in your own space - blog it, if you like - and then syndicate it to the service you want to share it with. You retain control and authority over your content, you get to syndicate it to multiple services instead of having it tied to just one, and you still get the nice aggregation and wikipedia:"folksonomy" effects from the social networks you're part of.

I think it's a fascinating idea.

One way to think of this is as a kind of "data blogging", where we blog not ideas for consumption by human readers, but structured data of various kinds for consumption by upstream applications and services. Data blogs act as drivers of applications and transactions, rather than of conversations.

The syndication piece is presumably pretty well covered via RSS and Atom. We really just need to define some standard data formats between the producers - that's us, remember! - and the consumers - which are the applications and services - and we've got most of the necessary components ready to go.

Some of the specialised XML vocabularies out there are presumably useful on the data formats side. But perhaps the most interesting possibility is the new swag of microformats currently being put to use in adding structured data to web pages. If we can blog people and organisations, events, bookmarks, map points, tags, and social networks, we've got halfway decent coverage of a lot of the Web 2.0 landscape.

Anyone else interested in inverting the web?


I've been trying out a few of my blosxom wishlist ideas over the last few days, and have now got an experimental version of blosxom I'm calling blosphemy (Gr. to speak against, to speak evil of).

It supports the following features over current blosxom:

  • loads the main blosxom config from an external config file (e.g. blosxom.conf) rather than from inline in blosxom.cgi. This is similar to what is currently done in the debian blosxom package.

  • supports loading the list of plugins to use from an external config file (e.g. plugins.conf) rather than deriving it by walking the plugin directory (but falls back to current behaviour for backwards compatibility).

  • uses standard perl @INC to load blosxom plugins, instead of hardcoding the blosxom plugin directory. This allows blosxom to support CPAN blosxom plugins as well as stock $plugin_dir ones.

  • uses a multi-value $plugin_path instead of a single value $plugin_dir to search for plugins. The intention with this is to allow, for instance, standard plugins to reside in /var/www/blosxom/plugins, but to allow the user to add their own or modify existing ones by copying them to (say) $HOME/blosxom/plugins.

These changes isolate blosxom configuration from the cgi and plugin directories (configs can live in e.g. $HOME/blosxom/config for tarball/home directory installs, or /etc/blosxom for package installs), allowing nice clean upgrades. I've been upgrading using RPMs while developing, and the RPM upgrades are now working really smoothly.

If anyone would like to try it out, releases are at:

I've tried to keep the changes fairly minimalist and clean, so that some or all of them can be migrated upstream easily if desired. They should also be pretty much fully backward compatible with the current blosxom.

Comments and feedback welcome.