GNU sort traps with field separators

GNU sort is an excellent utility that is a mainstay of the linux command line. It has all kinds of tricks up its sleeves, including support for uniquifying records, stable sorts, files larger than memory, parallelisation, controlled memory usage, etc. Go read the man page for all the gory details.

It also supports sorting with field separators, but unfortunately this support has some nasty traps for the unwary. Hence this post.

First, GNU sort cannot do general sorts of CSV-style datasets, because it doesn't understand CSV-features like quoting rules, quote-escaping, separator-escaping, etc. If you have very simple CSV files that don't do any escaping and you can avoid quotes altogether (or always use them), you might be able to use GNU sort - but it can get difficult fast.

Here I'm only interested in very simple delimited files - no quotes or escaping at all, Even here, though, there are some nasty traps to watch out for.

Here's a super-simple example file with just two lines and three fields, called dsort.csv:

$ cat dsort.csv
a,b,c
a,b+c,c

If we do a vanilla sort on this file, we get the following (I'm also running it through md5sum to highlight when the output changes):

$ sort dsort.csv | tee /dev/stderr | md5sum
a,b+c,c
a,b,c
5efd74fa9bef453dd477ec9acb2cef5f  -

The longer line sorts before the shorter line because the '+' sign collates before the second comma in the short line - this is sorting on the whole line, not on the individual fields.

Okay, so if I want do an individual field sort, I can just use the -t option, right? You would think so, but unfortunately:

$ sort -t, dsort.csv | tee /dev/stderr | md5sum
a,b+c,c
a,b,c
5efd74fa9bef453dd477ec9acb2cef5f  -

Huh? Why doesn't that work the short line first, like we'd expect? Maybe it's not sorting on all the fields or something? Do I need to explicitly include all fields? Let's see:

$ sort -t, -k1,3 dsort.csv | tee /dev/stderr | md5sum
a,b+c,c
a,b,c
5efd74fa9bef453dd477ec9acb2cef5f  -

Huh? What the heck is going on here?

It turns out this unintuitive behaviour is because of the way sort interprets the the -k option - -kM,N (where M != N) doesn't mean 'sort by field M, then field M+1,... then by field N', it means instead 'join all fields from M to N (with the field separator?), and sort by that'. Ugh!

So I just need to specify the fields individually? Unfortunately, even that's not enough:

$ sort -t, -k1 -k2 -k3 dsort.csv | tee /dev/stderr | md5sum
a,b+c,c
a,b,c
5efd74fa9bef453dd477ec9acb2cef5f  -

This is because the first option here - -k1 is interpreted as -k1,3 (since the last field is '3'), because the default 'end-field' is the last. Double-ugh!

So the takeaway is: if you want an individual-field sort you have to specify every field individually, AND you have to use -kN,N syntax, like so:

$ sort -t, -k1,1 -k2,2 -k3,3 dsort.csv | tee /dev/stderr | md5sum
a,b,c
a,b+c,c
493ce7ca60040fa184f1bf7db7758516  -

Yay, finally what we're after!

Also, unfortunately, there doesn't seem to be a generic way of specifying 'all fields' or 'up to the last field' or 'M-N' fields - you have to specify them all individually. It's verbose and ugly, but it works.

And for some good news, you can use sort suffixes on those individual options (like n for numerics, r for reverse sorts, etc.) just fine.

Happy sorting!

ctap (or: reasons I like Go and modern tooling)

So this weekend I scratched a long-standing itch and whipped up a little utility for colourising output. It's called ctap and is now available on Github.

I have a bunch of processes that spit out TAP output at various points as tests are run on outputs, and having the output coloured for fast scanning is a surprisingly significant UX improvement.

I've been a happy user of the javascript tap-colorize utility for quite a while, and it does the job nicely. Unfortunately, it's not packaged (afaict) on either CentOS or Ubuntu, which means you typically have to install from source via npm, which is fine on a laptop, but a PITA on a server.

And though I'm pretty comfortable building I've always found node packages to be more bother than they're worth.

Which brings us back to ctap this weekend. Given how simple the TAP protocol is, I figured building an MVP colouriser couldn't more than an hour or two - and so it turned out.

But what's really nice is all the modern tooling that provides you with impressive boosts as you go these days. On this project that included:

These are fun times to be a software engineer!

Top 5 Bash Functions

Here are a few bash functions that I find myself using all the time.

Functions are great where you have something that's slightly more complex than an alias, or wants to parse out its arguments, but isn't big enough to turn into a proper script. Drop these into your ~/.bashrc file (and source ~/.bashrc) and you're good to go!

Hope one or two of these are helpful/interesting.

1. ssht

Only came across this one fairly recently, but it's nice - you can combine the joys of ssh and tmux to drop you automagically into a given named session - I use sshgc (with my initials), so as not to clobber anyone else's session. (Because ssh and then tmux attach is so much typing!)

ssht() {
  local SESSION_NAME=sshgc
  command ssh -t "$1" tmux new-session -A -s $SESSION_NAME
}

2. lead

lead is a combination of ls -l and head, showing you the most recent N files in the given (or current) directory:

lead() {
  if [[ "$2" =~ ^[0-9]+$ ]]; then
    command ls -lt "$1" | head -n $2
  else
    # This version works with multiple args or globs
    command ls -lt "$@" | head -n 30
  fi
}

Or if you're using exa instead of ls, you can use:

lead() {
  if [[ "$2" =~ ^[0-9]+$ ]]; then
    command exa -l -s newest -r --git --color always "$1" | head -n $2
  else
    command exa -l -s newest -r --git --color always "$@" | head -n 30
  fi
}

Usage:

# Show the 30 most recent items in the current directory
lead
# Show the 30 most recent items in the given directory
lead /etc
# Show the 50 most recent items in the current directory
lead . 50
# Show the most recent items beginning with `abc`
lead abc*

3. l1

This ("lowercase L - one", in case it's hard to read) is similar in spirit to lead, but it just returns the filename of the most recently modified item in the current directory.

l1() {
  command ls -t | head -n1
}

This can be used in places where you'd use bash's !$ e.g. to edit or view some file you just created:

solve_the_meaning_of_life >| meaning.txt
cat !$
42!
# OR: cat `l1`

But l1 can also be used in situations where the filename isn't present in the previous command. For instance, I have a script that produces a pdf invoice from a given text file, where the pdf name is auto-derived from the text file name. With l1, I can just do:

invoice ~/Invoices/cust/Hours.2107
evince `l1`

4. xtitle

This is a cute hack that lets you set the title of your terminal to the first argument that doesn't being with a '-':

function xtitle() {
  if [ -n "$DISPLAY" ]; then
    # Try and prune arguments that look like options
    while [ "${1:0:1}" == '-' ]; do
      shift
    done
    local TITLE=${1:-${HOSTNAME%%.*}}
    echo -ne "\033]0;"$TITLE"\007"
  fi
}

Usage:

# Set your terminal title to 'foo'
xtitle foo
# Set your terminal title to the first label of your hostname
xtitle

I find this nice to use with ssh (or incorporated into ssht above) e.g.

function sshx() {
  xtitle "$@"
  command ssh -t "$@"
  local RC=$?
  xtitle
  return $RC
}

This (hopefully) sets your terminal title to the hostname you're ssh-ing to, and then resets it when you exit.

5. line

This function lets you select a particular line or set of lines from a text file:

function line() {
  # Usage: line <line> [<window>] [<file>]
  local LINE=$1
  shift
  local WINDOW=1
  local LEN=$LINE
  if [[ "$1" =~ ^[0-9]+$ ]]; then
    WINDOW=$1
    LEN=$(( $LINE + $WINDOW/2 ))
    shift
  fi
  head -n "$LEN" "$@" | tail -n "$WINDOW"
}

Usage:

# Selecting from a file with numbered lines:
$ line 5 lines.txt
This is line 5
$ line 5 3 lines.txt
This is line 4
This is line 5
This is line 6
$ line 10 6 lines.txt
This is line 8
This is line 9
This is line 10
This is line 11
This is line 12
This is line 13

And a bonus alias:

alias bashrc="$EDITOR ~/.bashrc && source ~/.bashrc"

Scalable Web Fetches using Serverless

Let's say you have a list of URLs you need to fetch for some reason - perhaps to check that they still exist, perhaps to parse their content for updates, whatever.

If the list is small - say up to 1000 urls - this is pretty easy to do using just curl(1) or wget(1) e.g.

INPUT=urls.txt
wget --execute robots=off --adjust-extension --convert-links \
  --force-directories --no-check-certificate --no-verbose \
  --timeout=120 --tries=3 -P ./tmp --warc-file=${INPUT%.txt} \
  -i "$INPUT"

This iterates over all the urls in urls.txt and fetches them one by one, capturing them in WARC format. Easy.

But if your url list is long - thousands or millions of urls - this is going to be too slow to be practical. This is a classic Embarrassingly Parallel problem, so to make this scalable the obvious solution is to split your input file up and run multiple fetches in parallel, and then merge your output files (i.e. a kind of map-reduce job).

But then your problem becomes that you need to run this on multiple machines, and setting up and managing and tearing down those machines becomes the core of the problem. But really, you don't want to worry about machines, you just want an operating system instance available that you can make use of.

This is the promise of so-called serverless architectures such as AWS "Lambda" and Google Cloud's "Cloud Functions", which provide a container-like environment for computing, without actually having to worry about managing the containers. The serverless environment spins up instances on demand, and then tears them down after a fixed period of time or when your job completes.

So to try out this serverless paradigm on our web fetch problem, I've written cloudfunc-geturilist, a Google Cloud Platform "Cloud Function" written in go, that is triggered by input files being written into an input Google Cloud Storage bucket, and writes its output files to another GCS output bucket.

See the README instructions if you'd like to try out (which you can do using a GCP free tier account).

In terms of scalability, this seems to work pretty well. The biggest file I've run so far has been 100k URLs, split into 334 input files each containing 300 URLs. With MAX_INSTANCES=20, cloudfunc-geturilist processes these 100k URLs in about 18 minutes; with MAX_INSTANCES=100 that drops to 5 minutes. All at a cost of a few cents.

That's a fair bit quicker than having to run up 100 container instances myself, or than using wget!

Perl to Go

I started using perl back in 1996, around version 5.003, while working at UBC in Vancover. We were working on a student management system at the time, written in C, interfacing to an Oracle database. We started experimenting with this Common Gateway Interface thing (CGI) that had recently appeared, and let you write interactive applications on the web (!). perl was the tool of choice for CGI, and we were able to talk to Oracle using a perl module that spoke Oracle Call Interface (OCI).

That project turned out to be pretty successful, and I've been developing in perl ever since. Perl has a reputation for being obscure and abstruse, but I find it a lovely language - powerful and expressive. Yes it's probably too easy to write bad/unreadable perl, but it's also straightforward to write elegant, readable perl. I routinely pick up code I wrote 5 years ago and have no problems reading it again.

That said, perl is showing its age in places, and over the last few years I've also been using other languages where different project requirements made that make sense. I've written significant code in C, Java, python, ruby, and javascript/nodejs, but for me none of them were sufficiently attractive to threaten perl as my language of choice.

About a year ago I started playing with Go at $dayjob, partly interested in the performance gains of a compiled language, and partly to try out the concurrency features I'd heard about. Learning a new language is always challenging, but Go's small footprint, consistency, and well-written libraries really made picking it up pretty straightforward.

And for me the killer feature is this: Go is the only language besides perl in which I regularly find myself writing a good chunk of code, getting it syntactically correct, and then testing it and finding that it Just Works, first time. The friction between thinking and coding seems low enough (at least the way my brain works) that I can formulate what I'm thinking with a pretty high chance of getting it right. I still get surprised when it happens, but it's great when it does!

Other things help too - it's crazy fast, especially coming from mostly interpreted languages recently; the concurrency stuff really is nice, and let's you think about concurrent flows pretty intuitively; and lots of the the language decisions like formatting, tooling, and composition just seem to sit pretty well with me.

So while I'm still very happy writing perl, especially for less performance-intensive applications, I'm now a happy little Go developer as well, and enjoying exploring some more advanced corners of a new language home.

Yay Go!


If you're interested in learning Go, the online docs are pretty great: start with the Tour of Go and How to write Go code, and then read Effective Go.

If you're after a book-length treatment, the standard text is The Go Programming Language by Donovan and Kernighan. It's excellent but pretty dense, more textbook than tutorial.

I've also read Go in Practice, which is more accessible and cookbook-style. I thought it was okay, and learnt a few things, but I wouldn't go out of your way for it.

Simple dual upstream gateways in CentOS, revisited

Recently had to setup a few servers that needed dual upstream gateways, and used an ancient blog post I wrote 11 years ago (!) to get it all working. This time around I hit a gotcha that I hadn't noted in that post, and used a simpler method to define the rules, so this is an updated version of that post.

Situation: you have two upstream gateways (gw1 and gw2) on separate interfaces and subnets on your linux server. Your default route is via gw1 (so all outward traffic, and most incoming traffic goes via that), but you want to be able to use gw2 as an alternative ingress pathway, so that packets that have come in on gw2 go back out that interface.

(Everything below done as root, so sudo -i first if you need to.)

1) First, define a few variables to make things easier to modify/understand:

# The device/interface on the `gw2` subnet
GW2_DEV=eth1
# The ip address of our `gw2` router
GW2_ROUTER_ADDR=172.16.2.254
# Our local ip address on the `gw2` subnet i.e. $GW2_DEV's address
GW2_LOCAL_ADDR=172.16.2.10

2) The gotcha I hit was that 'strict reverse-path filtering' in the kernel will drop all asymmetrically routed entirely, which will kill our response traffic. So the first thing to do is make sure that is either turned off or set to 'loose' instead of 'strict':

# Check the rp_filter setting for $GW2_DEV
# A value of '0' means rp_filtering is off, '1' means 'strict', and '2' means 'loose'
$ cat /proc/sys/net/ipv4/conf/$GW2_DEV/rp_filter
1
# For our purposes values of either '0' or '2' will work. '2' is slightly
# more conservative, so we'll go with that.
echo 2 > /proc/sys/net/ipv4/conf/$GW2_DEV/rp_filter
$ cat /proc/sys/net/ipv4/conf/$GW2_DEV/rp_filter
2

3) Define an extra routing table called gw2 e.g.

$ cat /etc/iproute2/rt_tables
#
# reserved values
#
255     local
254     main
253     default
0       unspec
#
# local tables
#
102     gw2
#

4) Add a default route via gw2 (here 172.16.2.254) to the gw2 routing table:

$ echo "default table gw2 via $GW2_ROUTER_ADDR" > /etc/sysconfig/network-scripts/route-${GW2_DEV}
$ cat /etc/sysconfig/network-scripts/route-${GW2_DEV}
default table gw2 via 172.16.2.254

5) Add an iproute 'rule' saying that packets that come in on our $GW2_LOCAL_ADDR should use routing table gw2:

$ echo "from $GW2_LOCAL_ADDR table gw2" > /etc/sysconfig/network-scripts/rule-${GW2_DEV}
$ cat /etc/sysconfig/network-scripts/rule-${GW2_DEV}
from 172.16.2.10 table gw2

6) Take $GW2_DEV down and back up again, and test:

$ ifdown $GW2_DEV
$ ifup $GW2_DEV

# Test that incoming traffic works as expected e.g. on an external server
$ ssh -v server-via-gw2

For more, see:

Symlink redirects in nginx

Solved an interesting problem this week using nginx.

We have an internal nginx webserver for distributing datasets with dated filenames, like foobar-20190213.tar.gz. We also create a symlink called foobar-latest.tar.gz, that is updated to point to the latest dataset each time a new version is released. This allows users to just use a fixed url to grab the latest release, rather than having to scrape the page to figure out which version is the latest.

Which generally works well. However, one wrinkle is that when you download via the symlink you end up with a file named with the symlink filename (foobar-latest.tar.gz), rather than a dated one. For some use cases this is fine, but for others you actually want to know what version of the dataset you are using.

What would be ideal would be a way to tell nginx to handle symlinks differently from other files. Specifically, if the requested file is a symlink, look up the file the symlink points to and issue a redirect to request that file. So you'd request foobar-latest.tar.gz, but you'd then be redirected to foobar-20190213.tar.gz instead. This gets you the best of both worlds - a fixed url to request, but a dated filename delivered. (If you don't need dated filenames, of course, you just save to a fixed name of your choice.)

Nginx doesn't support this functionality directly, but it turns out it's pretty easy to configure - at least as long as your symlinks are strictly local (i.e. your target and your symlink both live in the same directory), and as long as you have the nginx embedded perl module included in your nginx install (the one from RHEL/CentOS EPEL does, for instance.)

Here's how:

1. Add a couple of helper directives in the http context (that's outside/as a sibling to your server section):

# Setup a variable with the dirname of the current uri
# cf. https://serverfault.com/questions/381545/how-to-extract-only-the-file-name-from-the-request-uri
map $uri $uri_dirname {
  ~^(?<capture>.*)/ $capture;
}

# Use the embedded perl module to return (relative) symlink target filenames
# cf. https://serverfault.com/questions/305228/nginx-serve-the-latest-download
perl_set $symlink_target_rel '
  sub {
    my $r = shift;
    my $filename = $r->filename;
    return "" if ! -l $filename;
    my $target = readlink($filename);
    $target =~ s!^.*/!!;    # strip path (if any)
    return $target;
  }
';

2. In a location section (or similar), just add a test on $symlink_target_rel and issue a redirect using the variables we defined previously:

location / {
  autoindex on;

  # Symlink redirects FTW!
  if ($symlink_target_rel != "") {
    # Note this assumes that your symlink and target are in the same directory
    return 301 https://www.example.com$uri_dirname/$symlink_target_rel;
  }
}

Now when you make a request to a symlinked resource you get redirected instead to the target, but everything else is handled using the standard nginx pathways.

$ curl -i -X HEAD https://www.example.com/foobar/foobar-latest.tar.gz
HTTP/1.1 301 Moved Permanently
Server: nginx/1.12.2
Date: Wed, 13 Feb 2019 05:23:11 GMT
Location: https://www.example.com/foobar/foobar-20190213.tar.gz

incron tips and traps

(Updated April 2020: added new no. 7 after being newly bitten...)

incron is a useful little cron-like utility that lets you run arbitrary jobs (like cron), but instead of being triggered at certain times, your jobs are triggered by changes to files or directories.

It uses the linux kernel inotify facility (hence the name), and so it isn't cross-platform, but on linux it can be really useful for monitoring file changes or uploads, reporting or forwarding based on status files, simple synchronisation schemes, etc.

Again like cron, incron supports the notion of job 'tables' where commands are configured, and users can have manage their own tables using an incrontab command, while root can manage multiple system tables.

So it's a really useful linux utility, but it's also fairly old (the last release, v0.5.10, is from 2012), doesn't appear to be under active development any more, and it has a few frustrating quirks that can make using it unnecessarily difficult.

So this post is intended to highlight a few of the 'gotchas' I've experienced using incron:

  1. You can't monitor recursively i.e. if you create a watch on a directory incron will only be triggered on events in that directory itself, not in any subdirectories below it. This isn't really an incron issue since it's a limitation of the underlying inotify mechanism, but it's definitely something you'll want to be aware of going in.

  2. The incron interface is enough like cron (incrontab -l, incrontab -e, man 5 incrontab, etc.) that you might think that all your nice crontab features are available. Unfortunately that's not the case - most significantly, you can't have comments in incron tables (incron will try and parse your comment lines and fail), and you can't set environment variables to be available for your commands. (This includes PATH, so you might need to explicitly set a PATH inside your incron scripts if you need non-standard locations. The default PATH is documented as /usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin.)

  3. That means that cron MAILTO support is not available, and in general there's no easy way of getting access to the stdout or stderr of your jobs. You can't even use shell redirects in your command to capture the output (e.g. echo $@/$# >> /tmp/incron.log doesn't work). If you're debugging, the best you can do is add a layer of indirection by using a wrapper script that does the redirection you need (e.g. echo $1 2&>1 >> /tmp/incron.log) and calling the wrapper script in your incrontab with the incron arguments (e.g. debug.sh $@/$#). This all makes debugging misbehaving commands pretty painful. The main place to check if your commands are running is the cron log (/var/log/cron) on RHEL/CentOS, and syslog (/var/log/syslog) on Ubuntu/Debian.

  4. incron is also very picky about whitespace in your incrontab. If you put more than one space (or a tab) between the inotify masks and your command, you'll get an error in your cron log saying cannot exec process: No such file or directory, because incron will have included everything after the first space as part of your command e.g. (gavin) CMD ( echo /home/gavin/tmp/foo) (note the evil space before the echo).

  5. It's often difficult (and non-intuitive) to figure out what inotify events you want to trigger on in your incrontab masks. For instance, does 'IN_CREATE' get fired when you replace an existing file with a new version? What events are fired when you do a mv or a cp? If you're wanting to trigger on an incoming remote file copy, should you use 'IN_CREATE' or 'IN_CLOSE_WRITE'? In general, you don't want to guess, you actually want to test and see what events actually get fired on the operations you're interested in. The easiest way to do this is use inotifywait from the inotify-tools package, and run it using inotifywait -m <dir>, which will report to you all the inotify events that get triggered on that directory (hit <Ctrl-C> to exit).

  6. The "If you're wanting to trigger on an incoming remote file copy, should you use 'IN_CREATE' or 'IN_CLOSE_WRITE'?" above was a trick question - it turns out it depends how you're doing the copy! If you're just going a simple copy in-place (e.g. with scp), then (assuming you want the completed file) you're going to want to trigger on 'IN_CLOSE_WRITE', since that's signalling all writing is complete and the full file will be available. If you're using a vanilla rsync, though, that's not going to work, as rsync does a clever write-to-a-hidden-file trick, and then moves the hidden file to the destination name atomically. So in that case you're going to want to trigger on 'IN_MOVED_TO', which will give you the destination filename once the rsync is completed. So again, make sure you test thoroughly before you deploy.

  7. Though cron works fine with symlinks to crontab files (in e.g. /etc/cron.d, incron doesn't support this in /etc/incron.d - symlinks just seem to be quietly ignored. (Maybe this is for security, but it's not documented, afaict.)

Have I missed any? Any other nasties bitten you using incron?

MongoDB Database Upgrades

I've been doing a few upgrades of old standalone (not replica set) mongodb databases lately, taking them from 2.6, 3.0, or 3.2 up to 3.6. Here's my upgrade process on RHEL/CentOS, which has been working pretty smoothly (cf. the mongodb notes here: https://docs.mongodb.com/manual/tutorial/change-standalone-wiredtiger/).

First, the WiredTiger storage engine (the default since mongodb 3.2) "strongly" recommends using the xfs filesystem on linux, rather than ext4 (see https://docs.mongodb.com/manual/administration/production-notes/#prod-notes-linux-file-system for details). So the first thing to do is reorganise your disk to make sure you have an xfs filesystem available to hold your upgraded database. If you have the disk space, this may be reasonably straightforward; if you don't, it's a serious PITA.

Once your filesystems are sorted, here's the upgrade procedure.

1. Take a full mongodump of all your databases

cd /data    # Any path with plenty of disk available
for $DB in db1 db2 db3; do
mongodump -d $DB -o mongodump-$DB-$(date +%Y%m%d)
done

2. Shut the current mongod down

systemctl stop mongod
# Save the current mongodb.conf for reference
mv /etc/mongodb.conf /etc/mongod.conf.rpmsave

3. Hide the current /var/lib/mongo directory to avoid getting confused later. Create your new mongo directory on the xfs filesystem you've prepared e.g. /opt.

cd /var/lib
mv mongo mongo-old
# Create new mongo directory on your (new?) xfs filesytem
mkdir /opt/mongo
chown mongod:mongod /opt/mongo

4. Upgrade to mongo v3.6

vi /etc/yum.repos.d/mongodb.repo

# Add the following section, and disable any other mongodb repos you might have
[mongodb-org-3.6]
name=MongoDB 3.6 Repository
baseurl=https://repo.mongodb.org/yum/redhat/$releasever/mongodb-org/3.6/x86_64/
gpgcheck=1
enabled=1
gpgkey=https://www.mongodb.org/static/pgp/server-3.6.asc

# Then do a yum update on the mongodb packages
yum update mongodb-org-{server,tools,shell}

5. Check/modify the new mongod.conf. See https://docs.mongodb.com/v3.6/reference/configuration-options/ for all the details on the 3.6 config file options. In particular, dbPath should point to the new xfs-based mongo directory you created in (3) above.

vi /etc/mongod.conf

# Your 'storage' settings should look something like this:
storage:
  dbPath: /opt/mongo
  journal:
    enabled: true
  engine: "wiredTiger"
  wiredTiger:
    collectionConfig:
      blockCompressor: snappy
    indexConfig:
      prefixCompression: true

6. Restart mongod

systemctl daemon-reload
systemctl enable mongod
systemctl start mongod
systemctl status mongod

7. If all looks good, reload the mongodump data from (1):

cd /data
for $DB in db1 db2 db3; do
mongorestore --drop mongodump-$DB-$(date +%Y%m%d)
done

All done!

These are the basics anyway. This doesn't cover configuring access control on your new database, or wrangling SELinux permissions on your database directory, but if you're doing those currently you should be able to figure those out.

Setting up a file transfer host

Had to setup a new file transfer host recently, with the following requirements:

  • individual login accounts required (for customers, no anonymous access)
  • support for (secure) downloads, ideally via a browser (no special software required)
  • support for (secure) uploads, ideally via sftp (most of our customers are familiar with ftp)

Our target was RHEL/CentOS 7, but this should transfer to other linuxes pretty readily.

Here's the schema we ended up settling on, which seems to give us a good mix of security and flexibility.

  • use apache with HTTPS and PAM with local accounts, one per customer, and nologin shell accounts
  • users have their own groups (group=$USER), and also belong to the sftp group
  • we use the users group for internal company accounts, but NOT for customers
  • customer data directories live in /data
  • we use a 3-layer hierarchy for security: /data/chroot_$USER/$USER are created with a nologin shell
  • the /data/chroot_$USER directory must be owned by root:$USER, with permissions 750, and is used for an sftp chroot directory (not writeable by the user)
  • the next-level /data/chroot_$USER/$USER directory should be owned by $USER:users, with permissions 2770 (where users is our internal company user group, so both the customer and our internal users can write here)
  • we also add an ACL to /data/chroot_$USER to allow the company-internal users group read/search access (but not write)

We just use openssh internal-sftp to provide sftp access, with the following config:

Subsystem sftp internal-sftp
Match Group sftp
  ChrootDirectory /data/chroot_%u
  X11Forwarding no
  AllowTcpForwarding no
  ForceCommand internal-sftp -d /%u

So we chroot sftp connections to /data/chroot_$USER and then (via the ForceCommand) chdir to /data/chroot_$USER/$USER, so they start off in the writeable part of their tree. (If they bother to pwd, they see that they're in /$USER, and they can chdir up a level, but there's nothing else there except their $USER directory, and they can't write to the chroot.)

Here's a slightly simplified version of the newuser script we use:

die() {
  echo $*
  exit 1
}

test -n "$1" || die "usage: $(basename $0) <username>"

USERNAME=$1

# Create the user and home directories
mkdir -p /data/chroot_$USERNAME/$USERNAME
useradd --user-group -G sftp -d /data/chroot_$USERNAME/$USERNAME -s /sbin/nologin $USERNAME

# Set home directory permissions
chown root:$USERNAME /data/chroot_$USERNAME
chmod 750 /data/chroot_$USERNAME
setfacl -m group:users:rx /data/chroot_$USERNAME
chown $USERNAME:users /data/chroot_$USERNAME/$USERNAME
chmod 2770 /data/chroot_$USERNAME/$USERNAME

# Set user password manually
passwd $USERNAME

And we add an apache config file like the following to /etc/httpd/user.d:

Alias /CUSTOMER /data/chroot_CUSTOMER/CUSTOMER
<Directory /data/chroot_CUSTOMER/CUSTOMER>
Options +Indexes
Include "conf/auth.conf"
Require user CUSTOMER
</Directory>

(with CUSTOMER changed to the local username), and where conf/auth.conf has the authentication configuration against our local PAM users and allows internal company users access.

So far so good, but how do we restrict customers to their own /CUSTOMER tree?

That's pretty easy too - we just disallow customers from accessing our apache document root, and redirect them to a magic '/user' endpoint using an ErrorDocument 403 directive:

<Directory /var/www/html>
Options +Indexes +FollowSymLinks
Include "conf/auth.conf"
# Any user not in auth.conf, redirect to /user
ErrorDocument 403 "/user"
</Directory>

with /user defined as follows:

# Magic /user endpoint, redirecting to /$USERNAME
<Location /user>
Include "conf/auth.conf"
Require valid-user
RewriteEngine On
RewriteCond %{LA-U:REMOTE_USER} ^[a-z].*
RewriteRule ^\/(.*)$ /%{LA-U:REMOTE_USER}/ [R]
</Location>

The combination of these two says that any valid user NOT in auth.conf should be redirected to their own /CUSTOMER endpoint, so each customer user lands there, and can't get anywhere else.

Works well, no additional software is required over vanilla apache and openssh, and it still feels relatively simple, while meeting our security requirements.

Linux Software RAID Mirror In-Place Upgrade

Ran out of space on an old CentOS 6.8 server in the weekend, and so had to upgrade the main data mirror from a pair of Hitachi 2TB HDDs to a pair of 4TB WD Reds I had lying around.

The volume was using mdadm, aka Linux Software RAID, and is a simple mirror (RAID1), with LVM volumes on top of the mirror. The safest upgrade path is to build a new mirror on the new disks and sync the data across, but there weren't any free SATA ports on the motherboard, so instead I opted to do an in-place upgrade. I haven't done this for a good few years, and hit a couple of wrinkles on the way, so here are the notes from the journey.

Below, the physical disk partitions are /dev/sdb1 and /dev/sdd1, the mirror is /dev/md1, and the LVM volume group is extra.

1. Backup your data (or check you have known good rock-solid backups in place), because this is a complex process with plenty that could go wrong. You want an exit strategy.

2. Break the mirror, failing and removing one of the old disks

mdadm --manage /dev/md1 --fail /dev/sdb1
mdadm --manage /dev/md1 --remove /dev/sdb1

3. Shutdown the server, remove the disk you've just failed, and insert your replacement. Boot up again.

4. Since these are brand new disks, we need to partition them. And since these are 4TB disks, we need to use parted rather than the older fdisk.

parted /dev/sdb
print
mklabel gl
# Create a partition, skipping the 1st MB at beginning and end
mkpart primary 1 -1
unit s
print
# Not sure if this flag is required, but whatever
set 1 raid on
quit

5. Then add the new partition back into the mirror. Although this is much bigger, it will just sync up at the old size, which is what we want for now.

mdadm --manage /dev/md1 --add /dev/sdb1
# This will take a few hours to resync, so let's keep an eye on progress
watch -n5 cat /proc/mdstat

6. Once all resynched, rinse and repeat with the other disk - fail and remove /dev/sdd1, shutdown and swap the new disk in, boot up again, partition the new disk, and add the new partition into the mirror.

7. Once all resynched again, you'll be back where you started - a nice stable mirror of your original size, but with shiny new hardware underneath. Now we can grow the mirror to take advantage of all this new space we've got.

mdadm --grow /dev/md1 --size=max
mdadm: component size of /dev/md0 has been set to 2147479552K

Ooops! That size doesn't look right, that's 2TB, but these are 4TB disks?! Turns out there's a 2TB limit on mdadm metadata version 0.90, which this mirror is using, as documented on https://raid.wiki.kernel.org/index.php/RAID_superblock_formats#The_version-0.90_Superblock_Format.

mdadm --detail /dev/md1
/dev/md1:
        Version : 0.90
  Creation Time : Thu Aug 26 21:03:47 2010
    Raid Level : raid1
    Array Size : 2147483648 (2048.00 GiB 2199.02 GB)
  Used Dev Size : -1
  Raid Devices : 2
  Total Devices : 2
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Mon Nov 27 11:49:44 2017
          State : clean 
Active Devices : 2
Working Devices : 2
Failed Devices : 0
  Spare Devices : 0

          UUID : f76c75fb:7506bc25:dab805d9:e8e5d879
        Events : 0.1438

    Number   Major   Minor   RaidDevice State
      0       8       17        0      active sync   /dev/sdb1
      1       8       49        1      active sync   /dev/sdd1

Unfortunately, mdadm doesn't support upgrading the metadata version. But there is a workaround documented on that wiki page, so let's try that:

mdadm --detail /dev/md1
# (as above)
# Stop/remove the mirror
mdadm --stop /dev/md1
mdadm: Cannot get exclusive access to /dev/md1:Perhaps a running process, mounted filesystem or active volume group?
# Okay, deactivate our volume group first
vgchange --activate n extra
# Retry stop
mdadm --stop /dev/md1
mdadm: stopped /dev/md1
# Recreate the mirror with 1.0 metadata (you can't go to 1.1 or 1.2, because they're located differently)
# Note that you should specify all your parameters in case the defaults have changed
mdadm --create /dev/md1 -l1 -n2 --metadata=1.0 --assume-clean --size=2147483648 /dev/sdb1 /dev/sdd1

That outputs:

mdadm: /dev/sdb1 appears to be part of a raid array:
      level=raid1 devices=2 ctime=Thu Aug 26 21:03:47 2010
mdadm: /dev/sdd1 appears to be part of a raid array:
      level=raid1 devices=2 ctime=Thu Aug 26 21:03:47 2010
mdadm: largest drive (/dev/sdb1) exceeds size (2147483648K) by more than 1%
Continue creating array? y
mdadm: array /dev/md1 started.

Success! Now let's reactivate that volume group again:

vgchange --activate y extra
3 logical volume(s) in volume group "extra" now active

Another wrinkle is that recreating the mirror will have changed the array UUID, so we need to update the old UUID in /etc/mdadm.conf:

# Double-check metadata version, and record volume UUID
mdadm --detail /dev/md1
# Update the /dev/md1 entry UUID in /etc/mdadm.conf
$EDITOR /etc/mdadm.conf

So now, let's try that mdadm --grow command again:

mdadm --grow /dev/md1 --size=max
mdadm: component size of /dev/md1 has been set to 3907016564K
# Much better! This will take a while to synch up again now:
watch -n5 cat /proc/mdstat

8. (You can wait for this to finish resynching first, but it's optional.) Now we need to let LVM know that the physical volume underneath it has changed size:

# Check our starting point
pvdisplay /dev/mda1
--- Physical volume ---
PV Name               /dev/md1
VG Name               extra
PV Size               1.82 TiB / not usable 14.50 MiB
Allocatable           yes 
PE Size               64.00 MiB
Total PE              29808
Free PE               1072
Allocated PE          28736
PV UUID               mzLeMW-USCr-WmkC-552k-FqNk-96N0-bPh8ip
# Resize the LVM physical volume
pvresize /dev/md1
Read-only locking type set. Write locks are prohibited.
Can't get lock for system
Cannot process volume group system
Read-only locking type set. Write locks are prohibited.
Can't get lock for extra
Cannot process volume group extra
Read-only locking type set. Write locks are prohibited.
Can't get lock for #orphans_lvm1
Cannot process standalone physical volumes
Read-only locking type set. Write locks are prohibited.
Can't get lock for #orphans_pool
Cannot process standalone physical volumes
Read-only locking type set. Write locks are prohibited.
Can't get lock for #orphans_lvm2
Cannot process standalone physical volumes
Read-only locking type set. Write locks are prohibited.
Can't get lock for system
Cannot process volume group system
Read-only locking type set. Write locks are prohibited.
Can't get lock for extra
Cannot process volume group extra
Read-only locking type set. Write locks are prohibited.
Can't get lock for #orphans_lvm1
Cannot process standalone physical volumes
Read-only locking type set. Write locks are prohibited.
Can't get lock for #orphans_pool
Cannot process standalone physical volumes
Read-only locking type set. Write locks are prohibited.
Can't get lock for #orphans_lvm2
Cannot process standalone physical volumes
Failed to find physical volume "/dev/md1".
0 physical volume(s) resized / 0 physical volume(s) not resized

Oops - that doesn't look good. But it turns out it's just a weird locking type default. If we tell pvresize it can use local filesystem write locks we should be good (cf. /etc/lvm/lvm.conf):

# Let's try that again...
pvresize --config 'global {locking_type=1}' /dev/md1
Physical volume "/dev/md1" changed
1 physical volume(s) resized / 0 physical volume(s) not resized
# Double-check the PV Size
pvdisplay /dev/mda1
--- Physical volume ---
PV Name               /dev/md1
VG Name               extra
PV Size               3.64 TiB / not usable 21.68 MiB
Allocatable           yes 
PE Size               64.00 MiB
Total PE              59616
Free PE               30880
Allocated PE          28736
PV UUID               mzLeMW-USCr-WmkC-552k-FqNk-96N0-bPh8ip

Success!

Finally, you can now resize your logical volumes using lvresize as you usually would.

The iconv slurp misfeature

Since I got bitten by this recently, let me blog a quick warning here: glibc iconv - a utility for character set conversions, like iso8859-1 or windows-1252 to utf-8 - has a nasty misfeature/bug: if you give it data on stdin it will slurp the entire file into memory before it does a single character conversion.

Which is fine if you're running small input files. If you're trying to convert a 10G file on a VPS with 2G of RAM, however ... not so good!

This looks to be a known issue, with patches submitted to fix it in August 2015, but I'm not sure if they've been merged, or into which version of glibc. Certainly RHEL/CentOS 7 (with glibc 2.17) and Ubuntu 14.04 (with glibc 2.19) are both affected.

Once you know about the issue, it's easy enough to workaround - there's an iconv-chunks wrapper on github that breaks the input into chunks before feeding it to iconv, or you can do much the same thing using the lovely GNU parallel e.g.

gunzip -c monster.csv.gz | parallel --pipe -k iconv -f windows-1252 -t utf8

Nasty OOM avoided!

ThinkPad X250 on CentOS 7

Wow, almost a year since the last post. Definitely time to reboot the blog.

Got to replace my aging ThinkPad X201 with a lovely shiny new ThinkPad X250 over the weekend. Specs are:

  • CPU: Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz
  • RAM: 16GB PC3-12800 DDR3L SDRAM 1600MHz SODIMM
  • Disk: 256GB SSD (swapped out for existing Samsung SSD)
  • Display: 12.5" 1920x1080 IPS display, 400nit, non-touch
  • Graphics: Intel Graphics 5500
  • Wireless: Intel 7265 AC/B/G/N Dual Band Wireless and Bluetooth 4.0
  • Batteries: 1 3-cell internal, 1 6-cell hot-swappable

A very nice piece of kit!

Just wanted to document what works and what doesn't (so far) on my standard OS, CentOS 7, with RH kernel 3.10.0-229.11.1. I had to install the following additional packages:

  • iwl7265-firmware (for wireless support)
  • acpid (for the media buttons)

Working so far:

  • media buttons (Fn + F1/mute, F2/softer, F3/louder - see below for configuration)
  • wifi button (Fn + F8 - worked out of the box)
  • keyboard backlight (Fn + space, out of the box)
  • sleep/resume (out of the box)
  • touchpad hard buttons (see below)
  • touchpad soft buttons (out of the box)

Not working / unconfigured so far:

  • brightness buttons (Fn + F5/F6)
  • fingerprint reader (supposedly works with fprintd)

Not working / no ACPI codes:

  • mute microphone button (Fn + F4)
  • application buttons (Fn + F9-F12)

Uncertain/not tested yet:

  • switch video mode (Fn + F7)

To get the touchpad working I needed to use the "evdev" driver rather than the "Synaptics" one - I added the following as /etc/X11/xorg.conf.d/90-evdev.conf:

Section "InputClass"
  Identifier "Touchpad/TrackPoint"
  MatchProduct "PS/2 Synaptics TouchPad"
  MatchDriver "evdev"
  Option "EmulateWheel" "1"
  Option "EmulateWheelButton" "2"
  Option "Emulate3Buttons" "0"
  Option "XAxisMapping" "6 7"
  Option "YAxisMapping" "4 5"
EndSection

This gives me 3 working hard buttons above the touchpad, including middle-mouse- button for paste.

To get fonts scaling properly I needed to add a monitor section as /etc/X11/xorg.conf.d/50-monitor.conf, specifically for the DisplaySize:

Section "Monitor"
  Identifier    "Monitor0"
  VendorName    "Lenovo ThinkPad"
  ModelName     "X250"
  DisplaySize   276 155
  Option        "DPMS"
EndSection

and also set the dpi properly in my ~/.Xdefaults:

*.dpi: 177

This fixes font size nicely in Firefox/Chrome and terminals for me.

I also found my mouse movement was too slow, which I fixed with:

xinput set-prop 11 "Device Accel Constant Deceleration" 0.7

(which I put in my old-school ~/.xsession file).

Finally, getting the media keys involved installing acpid and setting up the appropriate magic in 3 files in /etc/acpid/events:

# /etc/acpi/events/volumedown
event=button/volumedown
action=/etc/acpi/actions/volume.sh down

# /etc/acpi/events/volumeup
event=button/volumeup
action=/etc/acpi/actions/volume.sh up

# /etc/acpi/events/volumemute
event=button/mute
action=/etc/acpi/actions/volume.sh mute

Those files capture the ACPI events and handle them via a custom script in /etc/acpi/actions/volume.sh, which uses amixer from alsa-utils. Volume control worked just fine, but muting was a real pain to get working correctly due to what seems like a bug in amixer - amixer -c1 sset Master playback toggle doesn't toggle correctly - it mutes fine, but then doesn't unmute all the channels it mutes!

I worked around it by figuring out the specific channels that sset Master was muting, and then handling them individually, but it's definitely not as clean:

#!/bin/sh
#
# /etc/acpi/actions/volume.sh (must be executable)
#

PATH=/usr/bin

die() {
  echo $*
  exit 1
}
usage() {
  die "usage: $(basename $0) up|down|mute"
}

test -n "$1" || usage

ACTION=$1
shift

case $ACTION in
  up)
    amixer -q -c1 -M sset Master 5%+ unmute
    ;;
  down)
    amixer -q -c1 -M sset Master 5%- unmute
    ;;
  mute)
    # Ideally the next command should work, but it doesn't unmute correctly
#   amixer -q -c1 sset Master playback toggle

    # Manual version for ThinkPad X250 channels
    # If adapting for another machine, 'amixer -C$DEV contents' is your friend (NOT 'scontents'!)
    SOUND_IS_OFF=$(amixer -c1 cget iface=MIXER,name='Master Playback Switch' | grep 'values=off')
    if [ -n "$SOUND_IS_OFF" ]; then
      amixer -q -c1 cset iface=MIXER,name='Master Playback Switch'    on
      amixer -q -c1 cset iface=MIXER,name='Headphone Playback Switch' on
      amixer -q -c1 cset iface=MIXER,name='Speaker Playback Switch'   on
    else
      amixer -q -c1 cset iface=MIXER,name='Master Playback Switch'    off
      amixer -q -c1 cset iface=MIXER,name='Headphone Playback Switch' off
      amixer -q -c1 cset iface=MIXER,name='Speaker Playback Switch'   off
    fi
    ;;
  *)
    usage
    ;;
esac

So in short, really pleased with the X250 so far - the screen is lovely, battery life seems great, I'm enjoying the keyboard, and most things have Just Worked or have been pretty easily configurable with CentOS. Happy camper!

References:

My Personal URL Shortener

I wrote a really simple personal URL shortener a couple of years ago, and have been using it happily ever since. It's called shrtn ("shorten"), and is just a simple perl script that captures (or generates) a mapping between a URL and a code, records in a simple text db, and then generates a static html file that uses HTML meta-redirects to point your browser towards the URL.

It was originally based on posts from Dave Winer and Phil Windley, but was interesting enough that I felt the itch to implement my own.

I just run it on my laptop (shrtn <url> [<code>]), and it has settings to commit the mapping to git and push it out to a remote repo (for backup), and to push the generated html files up to a webserver somewhere (for serving the html).

Most people seem to like the analytics side of personal URL shorteners (seeing who clicks your links), but I don't really track that side of it at all (it would be easy enought to add Google Analytics to to your html files to do that, or just doing some analysis on the access logs). I mostly wanted it initially to post nice short links when microblogging, where post length is an issue.

Surprisingly though, the most interesting use case in practice is the ability to give custom mnemonic code codes to URLs I use reasonably often, or cite to other people a bit. If I find myself sharing a URL with more than a couple of people, it's easier just to create a shortened version and use that instead - it's simpler, easier to type, and easier to remember for next time.

So my shortener has sort of become a cross between a Level 1 URL cache and a poor man's bookmarking service. For instance:

If you don't have a personal url shortener you should give it a try - it's a surprisingly interesting addition to one's personal cloud. And all you need to try it out is a domain and some static webspace somewhere to host your html files.

Too easy.

[ Technical Note: html-based meta-redirects work just fine with browsers, including mobile and text-only ones. They don't work with most spiders and bots, however, which may a bug or a feature, depending on your usage. For a personal url shortener meta-redirects probably work just fine, and you gain all the performance and stability advantages of static html over dynamic content. For a corporate url shortener where you want bots to be able to follow your links, as well as people, you probably want to use http-level redirects instead. In which case you either go with a hosted option, or look at something like YOURLS for a slightly more heavyweight self-hosted option. ]

Fujitsu ScanSnap 1300i on RHEL/CentOS

Just picked up a shiny new Fujitsu ScanSnap 1300i ADF scanner to get more serious about less paper.

Fujitsu ScanSnap 1300i

I chose the 1300i on the basis of the nice small form factor, and that SANE reports it having 'good' support with current SANE backends. I'd also been able to find success stories of other linux users getting the similar S1300 working okay:

Here's my experience getting the S1300i up and running on CentOS 6.

I had the following packages already installed on my CentOS 6 workstation, so I didn't need to install any new software:

  • sane-backends
  • sane-backends-libs
  • sane-frontends
  • xsane
  • gscan2pdf (from rpmforge)
  • gocr (from rpmforge)
  • tesseract (from my repo)

I plugged the S1300i in (via the dual USB cables instead of the power supply - nice!), turned it on (by opening the top cover) and then ran sudo sane-find-scanner. All good:

found USB scanner (vendor=0x04c5 [FUJITSU], product=0x128d [ScanSnap S1300i]) at libusb:001:013
# Your USB scanner was (probably) detected. It may or may not be supported by
# SANE. Try scanimage -L and read the backend's manpage.

Ran sudo scanimage -L - no scanner found.

I downloaded the S1300 firmware Luuk had provided in his post and installed it into /usr/share/sane/epjitsu, and then updated /etc/sane.d/epjitsu.conf to reference it:

# Fujitsu S1300i
firmware /usr/share/sane/epjitsu/1300_0C26.nal
usb 0x04c5 0x128d

Ran sudo scanimage -L - still no scanner found. Hmmm.

Rebooted into windows, downloaded the Fujitsu ScanSnap Manager package and installed it. Grubbed around in C:/Windows and found the following 4 firmware packages:

Copied the firmware onto another box, and rebooted back into linux. Copied the 4 new firmware files into /usr/share/sane/epjitsu and updated /etc/sane.d/epjitsu.conf to try the 1300i firmware:

# Fujitsu S1300i
firmware /usr/share/sane/epjitsu/1300i_0D12.nal
usb 0x04c5 0x128d

Close and re-open the S1300i (i.e. restart, just to be sure), and retried sudo scanimage -L. And lo, this time the scanner whirrs briefly and ... victory!

$ sudo scanimage -L
device 'epjitsu:libusb:001:014' is a FUJITSU ScanSnap S1300i scanner

I start gscan2pdf to try some scanning goodness. Eeerk: "No devices found". Hmmm. How about sudo gscan2pdf? Ahah, success - "FUJITSU ScanSnap S1300i" shows up in the Device dropdown.

I exit, and google how to deal with the permissions problem. Looks like the usb device gets created by udev as root:root 0664, and I need 'rw' permissions for scanning:

$ ls -l /dev/bus/usb/001/014
crw-rw-r--. 1 root root 189, 13 Sep 20 20:50 /dev/bus/usb/001/014

The fix is to add a scanner group and local udev rule to use that group when creating the device path:

# Add a scanner group (analagous to the existing lp, cdrom, tape, dialout groups)
$ sudo groupadd -r scanner
# Add myself to the scanner group
$ sudo useradd -aG scanner gavin
# Add a udev local rule for the S1300i
$ sudo vim /etc/udev/rules.d/99-local.rules
# Added:
# Fujitsu ScanSnap S1300i
ATTRS{idVendor}=="04c5", ATTRS{idProduct}=="128d", MODE="0664", GROUP="scanner", ENV{libsane_matched}="yes"

Then logout and log back in to pickup the change in groups, and close and re-open the S1300i. If all is well, I'm now in the scanner group and can control the scanner sans sudo:

# Check I'm in the scanner group now
$ id
uid=900(gavin) gid=100(users) groups=100(users),10(wheel),487(scanner)
# Check I can scanimage without sudo
$ scanimage -L
device 'epjitsu:libusb:001:016' is a FUJITSU ScanSnap S1300i scanner
# Confirm the permissions on the udev path (adjusted to match the new libusb path)
$ ls -l /dev/bus/usb/001/016
crw-rw-r--. 1 root scanner 189, 15 Sep 20 21:30 /dev/bus/usb/001/016
# Success!

Try gscan2pdf again, and this time it works fine without sudo!

And so far gscan2pdf 1.2.5 seems to work pretty nicely. It handles both simplex and duplex scans, and both the cleanup phase (using unpaper) and the OCR phase (with either gocr or tesseract) work without problems. tesseract seems to perform markedly better than gocr so far, as seems pretty typical.

So thus far I'm a pretty happy purchaser. On to a paperless searchable future!

Parallelising With Perl

Did a talk at the Sydney Perl Mongers group on Tuesday night, called "Parallelising with Perl", covering AnyEvent, MCE, and GNU Parallel.

Slides

e-billing Still Sucks

You'd think that 20 years into the Web we'd have billing all sorted out. (I've got in view here primarily bill/invoice delivery, rather than payments, and consumer-focussed billing, rather than B2B invoicing).

We don't. Our bills are probably as likely to still come on paper as in digital versions, and the current "e-billing" options all come with significant limitations (at least here in Australia - I'd love to hear about awesome implementations elsewhere!)

Here, for example, are a representative set of my current vendors, and their billing delivery options (I'm not picking on anyone here, just grounding the discussion in some specific examples).

Vendor Billing Delivery Options
Citibank email, paper, web
iinet email, paper?, web
Kuringai Council BPayView, paper
Origin Energy email, paper
Sydney Water Australia Post Digital Mailbox, BPayView, paper

So that all looks pretty reasonable, you might say. All your vendors have some kind of e-billing option. What's the problem?

The current e-billing options

Here's how I'd rate the various options available:

  • email: email is IMO the best current option for bill delivery - it's decentralised, lightweight, push-rather-than-pull, and relatively easy to integrate/automate. Unfortunately, not everyone offers it, and sometimes (e.g. Citibank) they insist on putting passwords on the documents they send out via email on the grounds of 'security'. (On the other hand, emails are notoriously easy to fake, so faking a bill email is a straightforward attack vector if you can figure out customer-vendor relationships.)

    (Note too that most of the non-email e-billing options still use email for sending alerts about a new bill, they just don't also send the bill through as an attachment.)

  • web (i.e. a company portal of some kind which you log into and can then download your bill): this is efficient for the vendor, but pretty inefficient for the customer - it requires going to the particular website, logging in, and navigating to the correct location before you can view or download your bill. So it's an inefficient, pull-based solution, requiring yet another username/password, and with few integration/automation options (and security issues if you try).

  • BillPayView / Australia Post Digital Mailbox: for non-Australians, these are free (for consumers) solutions for storing and paying bills offered by a consortium of banks (BillPayView) and Australia Post (Digital Mailbox) respectively. These provide a pretty decent user experience in that your bills are centralised, and they can often parse the bill payment options and make the payment process easy and less error-prone. On the other hand, centralisation is a two-edged sword, as it makes it harder to change providers (can you get your data out of these providers?); it narrows your choices in terms of bill payment (or at least makes certain kinds of payment options easier than others); and it's basically still a web-based solution, requiring login and navigation, and very difficult to automate or integrate elsewhere. I'm also suspicious of 'free' services from corporates - clearly there is value in driving you through their preferred payment solutions and/or in the transaction data itself, or they wouldn't be offering it to you.

    Also, why are there limited providers at all? There should be a standard in place so that vendors don't have to integrate separately with each provider, and so that customers have maximum choice in whom they wish to deal with. Wins all-round.

And then there's the issue of formats. I'm not aware of any Australian vendors that bill customers in any format except PDF - are there any?

PDFs are reasonable for human consumption, but billing should really be done (instead of, or as well as) in a format meant for computer consumption, so they can be parsed and processed reliably. This presumably means billing in a standardised XML or JSON format of some kind (XBRL?).

How billing should work

Here's a strawman workflow for how I think billing should work:

  • the customer's profile with the vendor includes a billing delivery URL, which is a vendor-specific location supplied by the customer to which their bills are to be HTTP POST-ed. It should be an HTTPS URL to secure the content during transmission, and the URL should be treated by the vendor as sensitive, since its possession would allow someone to post fake invoices to the customer

  • if the vendor supports more than one bill/invoice format, the customer should be able to select the format they'd like

  • the vendor posts invoices to the customer's URL and gets back a URL referencing the customer's record of that invoice. (The vendor might, for instance, be able to query that record for status information, or they might supply a webhook of their own to have status updates on the invoice pushed back to them.)

  • the customer's billing system should check that the posted invoice has the correct customer details (at least, for instance, the vendor/customer account number), and ideally should also check the bill payment methods against an authoritative set maintained by the vendor (this provides protection against someone injecting a fake invoice into the system with bogus bill payment details)

  • the customer's billing system is then responsible for facilitating the bill payment manually or automatically at or before the due date, using the customer's preferred payment method. This might involve billing calendar feeds, global or per-vendor preferred payment methods, automatic checks on invoice size against vendor history, etc.

  • all billing data (ideally fully parsed, categorised, and tagged) is then available for further automation / integration e.g. personal financial analytics, custom graphing, etc.

This kind of solution would give the customer full control over their billing data, the ability to choose a billing provider that's separate from (and more agile than) their vendors and banks, as well as significant flexibility to integrate and automate further. It should also be pretty straightforward on the vendor side - it just requires a standard HTTP POST and provides immediate feedback to the vendor on success or failure.

Why doesn't this exist already - it doesn't seem hard?

CSV wrangling with App::CCSV

Well past time to get back on the blogging horse.

I'm now working on a big data web mining startup, and spending an inordinate amount of time buried in large data files, often some variant of CSV.

My favourite new tool over the last few months is is Karlheinz Zoechling's App::CCSV perl module, which lets you do some really powerful CSV processing using perl one-liners, instead of having to write a trivial/throwaway script.

If you're familiar with perl's standard autosplit functionality (perl -a) then App::CCSV will look pretty similar - it autosplits its input into an array on your CSV delimiters for further processing. It handles embedded delimiters and CSV quoting conventions correctly, though, which perl's standard autosplitting doesn't.

App::CCSV uses @f to hold the autosplit fields, and provides utility functions csay and cprint for doing say and print on the CSV-joins of your array. So for example:

# Print just the first 3 fields of your file
perl -MApp::CCSV -ne 'csay @f[0..2]' < file.csv

# Print only lines where the second field is 'Y' or 'T'
perl -MApp::CCSV -ne 'csay @f if $f[1] =~ /^[YT]$/' < file.csv

# Print the CSV header and all lines where field 3 is negative
perl -MApp::CCSV -ne 'csay @f if $. == 1 || ($f[2]||0) < 0' < file.csv

# Insert a new country code field after the first field
perl -MApp::CCSV -ne '$cc = get_country_code($f[0]); csay $f[0],$cc,@f[1..$#f]' < file.csv

App::CCSV can use a config file to handle different kinds of CSV input. Here's what I'm using, which lives in my home directory in ~/.CCSVConf:

<CCSV>
sep_char ,
quote_char """
<names>
  <comma>
    sep_char ","
    quote_char """
  </comma>
  <tabs>
    sep_char "  "
    quote_char """
  </tabs>
  <pipe>
    sep_char "|"
    quote_char """
  </pipe>
  <commanq>
    sep_char ","
    quote_char ""
  </comma>
  <tabsnq>
    sep_char "  "
    quote_char ""
  </tabs>
  <pipenq>
    sep_char "|"
    quote_char ""
  </pipe>
</names>
</CCSV>

That just defines two sets of names for different kinds of input: comma, tabs, and pipe for [,\t|] delimiters with standard CSV quote conventions; and three nq ("no-quote") variants - commanq, tabsnq, and pipenq - to handle inputs that aren't using standard CSV quoting. It also makes the comma behaviour the default.

You use one of the names by specifying it when loading the module, after an =:

perl -MApp::CCSV=comma ...
perl -MApp::CCSV=tabs ...
perl -MApp::CCSV=pipe ...

You can also convert between formats by specifying two names, in <input>,<output> format e.g.

perl -MApp::CCSV=comma,pipe ...
perl -MApp::CCSV=tabs,comma ...
perl -MApp::CCSV=pipe,tabs ...

And just to round things off, I have a few aliases defined in my bashrc file to make these even easier to use:

alias perlcsv='perl -CSAD -MApp::CCSV'
alias perlpsv='perl -CSAD -MApp::CCSV=pipe'
alias perltsv='perl -CSAD -MApp::CCSV=tabs'
alias perlcsvnq='perl -CSAD -MApp::CCSV=commanq'
alias perlpsvnq='perl -CSAD -MApp::CCSV=pipenq'
alias perltsvnq='perl -CSAD -MApp::CCSV=tabsnq'

That simplifies my standard invocation to something like:

perlcsv -ne 'csay @f[0..2]' < file.csv

Happy data wrangling!

Checking for a Dell DRAC card on linux

Note to self: this seems to be the most reliable way of checking whether
a Dell machine has a DRAC card installed:
sudo ipmitool sdr elist mcloc

If there is, you'll see some kind of DRAC card:

iDRAC6           | 00h | ok  |  7.1 | Dynamic MC @ 20h

If there isn't, you'll see only a base management controller:

BMC              | 00h | ok  |  7.1 | Dynamic MC @ 20h

You need ipmi setup for this (if you haven't already):

# on RHEL/CentOS etc.
yum install OpenIPMI
service ipmi start

Subtracting Text Files

This has bitten me a couple of times now, and each time I've had to re-google the utility and figure out the appropriate incantation. So note to self: to subtract text files use comm(1).

Input files have to be sorted, but comm accepts a - argument for stdin, so you can sort on the fly if you like.

I also find the -1 -2 -3 options pretty counter-intuitive, as they indicate what you want to suppress, when I seem to want to indicate what I want to select. But whatever.

Here's the cheatsheet:

FILE1=one.txt
FILE2=two.txt

# FILE1 - FILE2 (lines unique to FILE1)
comm -23 $FILE1 $FILE2

# FILE2 - FILE1 (lines unique to FILE2)
comm -13 $FILE1 $FILE2

# intersection (common lines)
comm -12 $FILE1 $FILE2

# xor (non-common lines, either FILE)
comm -3 $FILE1 $FILE2
# or without the column delimiters:
comm -3 --output-delimiter=' ' $FILE1 $FILE2 | sed 's/^ *//'

# union (all lines)
comm $FILE1 $FILE2
# or without the column delimiters:
comm --output-delimiter=' ' $FILE1 $FILE2 | sed 's/^ *//'