Fujitsu ScanSnap 1300i on RHEL/CentOS

Just picked up a shiny new Fujitsu ScanSnap 1300i ADF scanner to get more serious about less paper.

Fujitsu ScanSnap 1300i

I chose the 1300i on the basis of the nice small form factor, and that SANE reports it having 'good' support with current SANE backends. I'd also been able to find success stories of other linux users getting the similar S1300 working okay:

Here's my experience getting the S1300i up and running on CentOS 6.

I had the following packages already installed on my CentOS 6 workstation, so I didn't need to install any new software:

  • sane-backends
  • sane-backends-libs
  • sane-frontends
  • xsane
  • gscan2pdf (from rpmforge)
  • gocr (from rpmforge)
  • tesseract (from my repo)

I plugged the S1300i in (via the dual USB cables instead of the power supply - nice!), turned it on (by opening the top cover) and then ran sudo sane-find-scanner. All good:

found USB scanner (vendor=0x04c5 [FUJITSU], product=0x128d [ScanSnap S1300i]) at libusb:001:013
# Your USB scanner was (probably) detected. It may or may not be supported by
# SANE. Try scanimage -L and read the backend's manpage.

Ran sudo scanimage -L - no scanner found.

I downloaded the S1300 firmware Luuk had provided in his post and installed it into /usr/share/sane/epjitsu, and then updated /etc/sane.d/epjitsu.conf to reference it:

# Fujitsu S1300i
firmware /usr/share/sane/epjitsu/1300_0C26.nal
usb 0x04c5 0x128d

Ran sudo scanimage -L - still no scanner found. Hmmm.

Rebooted into windows, downloaded the Fujitsu ScanSnap Manager package and installed it. Grubbed around in C:/Windows and found the following 4 firmware packages:

Copied the firmware onto another box, and rebooted back into linux. Copied the 4 new firmware files into /usr/share/sane/epjitsu and updated /etc/sane.d/epjitsu.conf to try the 1300i firmware:

# Fujitsu S1300i
firmware /usr/share/sane/epjitsu/1300i_0D12.nal
usb 0x04c5 0x128d

Close and re-open the S1300i (i.e. restart, just to be sure), and retried sudo scanimage -L. And lo, this time the scanner whirrs briefly and ... victory!

$ sudo scanimage -L
device 'epjitsu:libusb:001:014' is a FUJITSU ScanSnap S1300i scanner

I start gscan2pdf to try some scanning goodness. Eeerk: "No devices found". Hmmm. How about sudo gscan2pdf? Ahah, success - "FUJITSU ScanSnap S1300i" shows up in the Device dropdown.

I exit, and google how to deal with the permissions problem. Looks like the usb device gets created by udev as root:root 0664, and I need 'rw' permissions for scanning:

$ ls -l /dev/bus/usb/001/014
crw-rw-r--. 1 root root 189, 13 Sep 20 20:50 /dev/bus/usb/001/014

The fix is to add a scanner group and local udev rule to use that group when creating the device path:

# Add a scanner group (analagous to the existing lp, cdrom, tape, dialout groups)
$ sudo groupadd -r scanner
# Add myself to the scanner group
$ sudo useradd -aG scanner gavin
# Add a udev local rule for the S1300i
$ sudo vim /etc/udev/rules.d/99-local.rules
# Added:
# Fujitsu ScanSnap S1300i
ATTRS{idVendor}=="04c5", ATTRS{idProduct}=="128d", MODE="0664", GROUP="scanner", ENV{libsane_matched}="yes"

Then logout and log back in to pickup the change in groups, and close and re-open the S1300i. If all is well, I'm now in the scanner group and can control the scanner sans sudo:

# Check I'm in the scanner group now
$ id
uid=900(gavin) gid=100(users) groups=100(users),10(wheel),487(scanner)
# Check I can scanimage without sudo
$ scanimage -L
device 'epjitsu:libusb:001:016' is a FUJITSU ScanSnap S1300i scanner
# Confirm the permissions on the udev path (adjusted to match the new libusb path)
$ ls -l /dev/bus/usb/001/016
crw-rw-r--. 1 root scanner 189, 15 Sep 20 21:30 /dev/bus/usb/001/016
# Success!

Try gscan2pdf again, and this time it works fine without sudo!

And so far gscan2pdf 1.2.5 seems to work pretty nicely. It handles both simplex and duplex scans, and both the cleanup phase (using unpaper) and the OCR phase (with either gocr or tesseract) work without problems. tesseract seems to perform markedly better than gocr so far, as seems pretty typical.

So thus far I'm a pretty happy purchaser. On to a paperless searchable future!

Parallelising With Perl

Did a talk at the Sydney Perl Mongers group on Tuesday night, called "Parallelising with Perl", covering AnyEvent, MCE, and GNU Parallel.

Slides

e-billing Still Sucks

You'd think that 20 years into the Web we'd have billing all sorted out. (I've got in view here primarily bill/invoice delivery, rather than payments, and consumer-focussed billing, rather than B2B invoicing).

We don't. Our bills are probably as likely to still come on paper as in digital versions, and the current "e-billing" options all come with significant limitations (at least here in Australia - I'd love to hear about awesome implementations elsewhere!)

Here, for example, are a representative set of my current vendors, and their billing delivery options (I'm not picking on anyone here, just grounding the discussion in some specific examples).

Vendor Billing Delivery Options
Citibank email, paper, web
iinet email, paper?, web
Kuringai Council BPayView, paper
Origin Energy email, paper
Sydney Water Australia Post Digital Mailbox, BPayView, paper

So that all looks pretty reasonable, you might say. All your vendors have some kind of e-billing option. What's the problem?

The current e-billing options

Here's how I'd rate the various options available:

  • email: email is IMO the best current option for bill delivery - it's decentralised, lightweight, push-rather-than-pull, and relatively easy to integrate/automate. Unfortunately, not everyone offers it, and sometimes (e.g. Citibank) they insist on putting passwords on the documents they send out via email on the grounds of 'security'. (On the other hand, emails are notoriously easy to fake, so faking a bill email is a straightforward attack vector if you can figure out customer-vendor relationships.)

    (Note too that most of the non-email e-billing options still use email for sending alerts about a new bill, they just don't also send the bill through as an attachment.)

  • web (i.e. a company portal of some kind which you log into and can then download your bill): this is efficient for the vendor, but pretty inefficient for the customer - it requires going to the particular website, logging in, and navigating to the correct location before you can view or download your bill. So it's an inefficient, pull-based solution, requiring yet another username/password, and with few integration/automation options (and security issues if you try).

  • BillPayView / Australia Post Digital Mailbox: for non-Australians, these are free (for consumers) solutions for storing and paying bills offered by a consortium of banks (BillPayView) and Australia Post (Digital Mailbox) respectively. These provide a pretty decent user experience in that your bills are centralised, and they can often parse the bill payment options and make the payment process easy and less error-prone. On the other hand, centralisation is a two-edged sword, as it makes it harder to change providers (can you get your data out of these providers?); it narrows your choices in terms of bill payment (or at least makes certain kinds of payment options easier than others); and it's basically still a web-based solution, requiring login and navigation, and very difficult to automate or integrate elsewhere. I'm also suspicious of 'free' services from corporates - clearly there is value in driving you through their preferred payment solutions and/or in the transaction data itself, or they wouldn't be offering it to you.

    Also, why are there limited providers at all? There should be a standard in place so that vendors don't have to integrate separately with each provider, and so that customers have maximum choice in whom they wish to deal with. Wins all-round.

And then there's the issue of formats. I'm not aware of any Australian vendors that bill customers in any format except PDF - are there any?

PDFs are reasonable for human consumption, but billing should really be done (instead of, or as well as) in a format meant for computer consumption, so they can be parsed and processed reliably. This presumably means billing in a standardised XML or JSON format of some kind (XBRL?).

How billing should work

Here's a strawman workflow for how I think billing should work:

  • the customer's profile with the vendor includes a billing delivery URL, which is a vendor-specific location supplied by the customer to which their bills are to be HTTP POST-ed. It should be an HTTPS URL to secure the content during transmission, and the URL should be treated by the vendor as sensitive, since its possession would allow someone to post fake invoices to the customer

  • if the vendor supports more than one bill/invoice format, the customer should be able to select the format they'd like

  • the vendor posts invoices to the customer's URL and gets back a URL referencing the customer's record of that invoice. (The vendor might, for instance, be able to query that record for status information, or they might supply a webhook of their own to have status updates on the invoice pushed back to them.)

  • the customer's billing system should check that the posted invoice has the correct customer details (at least, for instance, the vendor/customer account number), and ideally should also check the bill payment methods against an authoritative set maintained by the vendor (this provides protection against someone injecting a fake invoice into the system with bogus bill payment details)

  • the customer's billing system is then responsible for facilitating the bill payment manually or automatically at or before the due date, using the customer's preferred payment method. This might involve billing calendar feeds, global or per-vendor preferred payment methods, automatic checks on invoice size against vendor history, etc.

  • all billing data (ideally fully parsed, categorised, and tagged) is then available for further automation / integration e.g. personal financial analytics, custom graphing, etc.

This kind of solution would give the customer full control over their billing data, the ability to choose a billing provider that's separate from (and more agile than) their vendors and banks, as well as significant flexibility to integrate and automate further. It should also be pretty straightforward on the vendor side - it just requires a standard HTTP POST and provides immediate feedback to the vendor on success or failure.

Why doesn't this exist already - it doesn't seem hard?

CSV wrangling with App::CCSV

Well past time to get back on the blogging horse.

I'm now working on a big data web mining startup, and spending an inordinate amount of time buried in large data files, often some variant of CSV.

My favourite new tool over the last few months is is Karlheinz Zoechling's App::CCSV perl module, which lets you do some really powerful CSV processing using perl one-liners, instead of having to write a trivial/throwaway script.

If you're familiar with perl's standard autosplit functionality (perl -a) then App::CCSV will look pretty similar - it autosplits its input into an array on your CSV delimiters for further processing. It handles embedded delimiters and CSV quoting conventions correctly, though, which perl's standard autosplitting doesn't.

App::CCSV uses @f to hold the autosplit fields, and provides utility functions csay and cprint for doing say and print on the CSV-joins of your array. So for example:

# Print just the first 3 fields of your file
perl -MApp::CCSV -ne 'csay @f[0..2]' < file.csv

# Print only lines where the second field is 'Y' or 'T'
perl -MApp::CCSV -ne 'csay @f if $f[1] =~ /^[YT]$/' < file.csv

# Print the CSV header and all lines where field 3 is negative
perl -MApp::CCSV -ne 'csay @f if $. == 1 || ($f[2]||0) < 0' < file.csv

# Insert a new country code field after the first field
perl -MApp::CCSV -ne '$cc = get_country_code($f[0]); csay $f[0],$cc,@f[1..$#f]' < file.csv

App::CCSV can use a config file to handle different kinds of CSV input. Here's what I'm using, which lives in my home directory in ~/.CCSVConf:

<CCSV>
sep_char ,
quote_char """
<names>
  <comma>
    sep_char ","
    quote_char """
  </comma>
  <tabs>
    sep_char "  "
    quote_char """
  </tabs>
  <pipe>
    sep_char "|"
    quote_char """
  </pipe>
  <commanq>
    sep_char ","
    quote_char ""
  </comma>
  <tabsnq>
    sep_char "  "
    quote_char ""
  </tabs>
  <pipenq>
    sep_char "|"
    quote_char ""
  </pipe>
</names>
</CCSV>

That just defines two sets of names for different kinds of input: comma, tabs, and pipe for [,\t|] delimiters with standard CSV quote conventions; and three nq ("no-quote") variants - commanq, tabsnq, and pipenq - to handle inputs that aren't using standard CSV quoting. It also makes the comma behaviour the default.

You use one of the names by specifying it when loading the module, after an =:

perl -MApp::CCSV=comma ...
perl -MApp::CCSV=tabs ...
perl -MApp::CCSV=pipe ...

You can also convert between formats by specifying two names, in <input>,<output> format e.g.

perl -MApp::CCSV=comma,pipe ...
perl -MApp::CCSV=tabs,comma ...
perl -MApp::CCSV=pipe,tabs ...

And just to round things off, I have a few aliases defined in my bashrc file to make these even easier to use:

alias perlcsv='perl -CSAD -MApp::CCSV'
alias perlpsv='perl -CSAD -MApp::CCSV=pipe'
alias perltsv='perl -CSAD -MApp::CCSV=tabs'
alias perlcsvnq='perl -CSAD -MApp::CCSV=commanq'
alias perlpsvnq='perl -CSAD -MApp::CCSV=pipenq'
alias perltsvnq='perl -CSAD -MApp::CCSV=tabsnq'

That simplifies my standard invocation to something like:

perlcsv -ne 'csay @f[0..2]' < file.csv

Happy data wrangling!

Checking for a Dell DRAC card on linux

Note to self: this seems to be the most reliable way of checking whether
a Dell machine has a DRAC card installed:
sudo ipmitool sdr elist mcloc

If there is, you'll see some kind of DRAC card:

iDRAC6           | 00h | ok  |  7.1 | Dynamic MC @ 20h

If there isn't, you'll see only a base management controller:

BMC              | 00h | ok  |  7.1 | Dynamic MC @ 20h

You need ipmi setup for this (if you haven't already):

# on RHEL/CentOS etc.
yum install OpenIPMI
service ipmi start

Subtracting Text Files

This has bitten me a couple of times now, and each time I've had to re-google the utility and figure out the appropriate incantation. So note to self: to subtract text files use comm(1).

Input files have to be sorted, but comm accepts a - argument for stdin, so you can sort on the fly if you like.

I also find the -1 -2 -3 options pretty counter-intuitive, as they indicate what you want to suppress, when I seem to want to indicate what I want to select. But whatever.

Here's the cheatsheet:

FILE1=one.txt
FILE2=two.txt

# FILE1 - FILE2 (lines unique to FILE1)
comm -23 $FILE1 $FILE2

# FILE2 - FILE1 (lines unique to FILE2)
comm -13 $FILE1 $FILE2

# intersection (common lines)
comm -12 $FILE1 $FILE2

# xor (non-common lines, either FILE)
comm -3 $FILE1 $FILE2
# or without the column delimiters:
comm -3 --output-delimiter=' ' $FILE1 $FILE2 | sed 's/^ *//'

# union (all lines)
comm $FILE1 $FILE2
# or without the column delimiters:
comm --output-delimiter=' ' $FILE1 $FILE2 | sed 's/^ *//'

Extract - some devops yang

If you're a modern sysadmin you've probably been sipping at the devops koolaid and trying out one or more of the current system configuration management tools like puppet or chef.

These tools are awesome - particularly for homogenous large-scale deployments of identical nodes.

In practice in the enterprise, though, things get more messy. You can have legacy nodes that can't be puppetised due to their sensitivity and importance; or nodes that are sufficiently unusual that the payoff of putting them under configuration management doesn't justify the work; or just systems which you don't have full control over.

We've been using a simple tool called extract in these kinds of environments, which pulls a given set of files from remote hosts and stores them under version control in a set of local per-host trees.

You can think of it as the yang to puppet or chef's yin - instead of pushing configs onto remote nodes, it's about pulling configs off nodes, and storing them for tracking and change control.

We've been primarily using it in a RedHat/CentOS environment, so we use it in conjunction with rpm-find-changes, which identifies all the config files under /etc that have been changed from their deployment versions, or are custom files not belonging to a package.

Extract doesn't care where its list of files to extract comes from, so it should be easily customised for other environments.

It uses a simple extract.conf shell-variable-style config file, like this:

# Where extracted files are to be stored (in per-host trees)
EXTRACT_ROOT=/data/extract

# Hosts from which to extract (space separated)
EXTRACT_HOSTS=host1 host2 host3

# File containing list of files to extract (on the remote host, not locally)
EXTRACT_FILES_REMOTE=/var/cache/rpm-find-changes/etc.txt

Extract also allows arbitrary scripts to be called at the beginning (setup) and end (teardown) of a run, and before and/or after each host. Extract ships with some example shell scripts for loading ssh keys, and checking extracted changes into git or bzr. These hooks are also configured in the extract.conf config e.g.:

# Pre-process scripts
# PRE_EXTRACT_SETUP - run once only, before any extracts are done
PRE_EXTRACT_SETUP=pre_extract_load_ssh_keys
# PRE_EXTRACT_HOST - run before each host extraction
#PRE_EXTRACT_HOST=pre_extract_noop

# Post process scripts
# POST_EXTRACT_HOST - run after each host extraction
POST_EXTRACT_HOST=post_extract_git
# POST_EXTRACT_TEARDOWN - run once only, after all extracts are completed
#POST_EXTRACT_TEARDOWN=post_extract_touch

Extract is available on github, and packages for RHEL/CentOS 5 and 6 are available from my repository.

Feedback/pull requests always welcome.

Yum Error Performing Checksum

Ok, this has bitten me enough times now that I'm going to blog it so I don't forget it again.

Symptom: you're doing a yum update on a centos5 or rhel5 box, using rpms from a repository on a centos6 or rhel6 server (or anywhere else with a more modern createrepo available), and you get errors like this:

http://example.com/repodata/filelists.sqlite.bz2: [Errno -3] Error performing checksum
http://example.com/repodata/primary.sqlite.bz2: [Errno -3] Error performing checksum

What this really means that yum is too stupid to calculate the sha256 checksum correctly (and also too stupid to give you a sensible error message like "Sorry, primary.sqlite.bz2 is using a sha256 checksum, but I don't know how to calculate that").

The fix is simple:

yum install python-hashlib

from either rpmforge or epel, which makes the necessary libraries available for yum to calculate the new checksums correctly. Sorted.

A Low Tech Pub-Sub Pattern

I've been enjoying playing around with ZeroMQ lately, and exploring some of the ways it changes the way you approach system architecture.

One of the revelations for me has been how powerful the pub-sub (Publish- Subscribe) pattern is. An architecture that makes it straightforward for multiple consumers to process a given piece of data promotes lots of small simple consumers, each performing a single task, instead of a complex monolithic processor.

This is both simpler and more complex, since you end up with more pieces, but each piece is radically simpler. It's also more flexible and more scalable, since you can move components around individually, and it allows greater language and library flexibility, since you can write individual components in completely different languages.

What's also interesting is that the benefits of this pattern don't necessarily require an advanced toolkit like ZeroMQ, particularly for low-volume applications. Here's a sketch of a low-tech pub-sub pattern that uses files as the pub-sub inflection point, and incron, the 'inotify cron' daemon, as our dispatcher.

Recipe:

  1. Install incron, the inotify cron daemon, to monitor our data directory for changes. On RHEL/CentOS this is available from the rpmforge or EPEL repositories: yum install incron.

  2. Capture data to files in our data directory in some useful format e.g. json, yaml, text, whatever.

  3. Setup an incrontab entry for each consumer monitoring CREATE operations on our data directory e.g.

    /data/directory IN_CREATE /path/to/consumer1 $@/$#
    /data/directory IN_CREATE /path/to/consumer2 $@/$#
    /data/directory IN_CREATE /path/to/consumer3 $@/$#
    

    The $@/$# magic passes the full file path to your consumer - see man 5 incrontab for details and further options.

Done. Working pub-sub with minimal moving parts.

AoE on RHEL/CentOS

Updated 2012-07-24: packages updated to aoe-80 and aoetools-34, respectively.


I'm a big fan of Coraid and their relatively low-cost storage units. I've been using them for 5+ years now, and they've always been pretty well engineered, reliable, and performant.

They talk ATA-over-Ethernet (AoE), which is a very simple non-routable protocol for transmitting ATA commands directly via Ethernet frames, without the overhead of higher level layers like IP and TCP. So they're a lighter protocol than something like iSCSI, and so theoretically higher performance.

One issue with them on linux is that the in-kernel 'aoe' driver is typically pretty old. Coraid's latest aoe driver is version 78, for instance, while the RHEL6 kernel (2.6.32) comes with aoe v47, and the RHEL5 kernel (2.6.18) comes with aoe v22. So updating to the latest version is highly recommended, but also a bit of a pain, because if you do it manually it has to be recompiled for each new kernel update.

The modern way to handle this is to use a kernel-ABI tracking kmod, which gives you a driver that will work across multiple kernel updates for a given EL generation, without having to recompile each time.

So I've created a kmod-aoe package that seems to work nicely here. It's downloadable below, or you can install it from my yum repository. The kmod depends on the 'aoetools' package, which supplies the command line utilities for managing your AoE devices.

kmod-aoe (v80):

aoetools (v34):

There's an init script in the aoetools package that loads the kernel module, activates any configured LVM volume groups, and mounts any filesystems. All configuration is done via /etc/sysconfig/aoe.

OpenLDAP Tips and Tricks

Having spent too much of this week debugging problems around migrating ldap servers from RHEL5 to RHEL6, here are some miscellaneous notes to self:

  1. The service is named ldap on RHEL5, and slapd on RHEL6 e.g. you do service ldap start on RHEL5, but service slapd start on RHEL6

  2. On RHEL6, you want all of the following packages installed on your clients:

    yum install openldap-clients pam_ldap nss-pam-ldapd
    
  3. This seems to be the magic incantation that works for me (with real SSL certificates, though):

    authconfig --enableldap --enableldapauth \
      --ldapserver ldap.example.com \
      --ldapbasedn="dc=example,dc=com" \
      --update
    
  4. Be aware that there are multiple ldap configuration files involved now. All of the following end up with ldap config entries in them and need to be checked:

    • /etc/openldap/ldap.conf
    • /etc/pam_ldap.conf
    • /etc/nslcd.conf
    • /etc/sssd/sssd.conf

    Note too that /etc/openldap/ldap.conf uses uppercased directives (e.g. URI) that get lowercased in the other files (URI -> uri). Additionally, some directives are confusingly renamed as well - e.g. TLA_CACERT in /etc/openldap/ldap.conf becomes tla_cacertfile in most of the others. :-(

  5. If you want to do SSL or TLS, you should know that the default behaviour is for ldap clients to verify certificates, and give misleading bind errors if they can't validate them. This means:

    • if you're using self-signed certificates, add TLS_REQCERT allow to /etc/openldap/ldap.conf on your clients, which means allow certificates the clients can't validate

    • if you're using CA-signed certificates, and want to verify them, add your CA PEM certificate to a directory of your choice (e.g. /etc/openldap/certs, or /etc/pki/tls/certs, for instance), and point to it using TLA_CACERT in /etc/openldap/ldap.conf, and tla_cacertfile in /etc/ldap.conf.

  6. RHEL6 uses a new-fangled /etc/openldap/slapd.d directory for the old /etc/openldap/slapd.conf config data, and the RHEL6 Migration Guide tells you to how to convert from one to the other. But if you simply rename the default slapd.d directory, slapd will use the old-style slapd.conf file quite happily, which is much easier to read/modify/debug, at least while you're getting things working.

  7. If you run into problems on the server, there are lots of helpful utilities included with the openldap-servers package. Check out the manpages for slaptest(8), slapcat(8), slapacl(8), slapadd(8), etc.

Further reading:

rpm-find-changes

rpm-find-changes is a little script I wrote a while ago for rpm-based systems (RedHat, CentOS, Mandriva, etc.). It finds files in a filesystem tree that are not owned by any rpm package (orphans), or are modified from the version distributed with their rpm. In other words, any file that has been introduced or changed from it's distributed version.

It's intended to help identify candidates for backup, or just for tracking interesting changes. I run it nightly on /etc on most of my machines, producing a list of files that I copy off the machine (using another tool, which I'll blog about later) and store in a git repository.

I've also used it for tracking changes to critical configuration trees across multiple machines, to make sure everything is kept in sync, and to be able to track changes over time.

Available on github: https://github.com/gavincarr/rpm-find-changes

Google Hangout on CentOS 6

Kudos to Google for providing linux plugins for their Google Plus+ Hangouts (a multi-way video chat system), for both debian-based and rpm-based systems. The library requirements don't seem to be documented anywhere though, so here's the magic incantation required for installation on CentOS6 x86_64:

yum install libstdc++.i686 gtk2.i686 \
  libXrandr.i686 libXcomposite.i686 libXfixes.i686 \
  pulseaudio-libs.i686 alsa-lib.i686

Poor Man's NTP

One of today's annoyances was a third-party complaining about clock skew on a server at their site that they're testing against. No, they don't have a local ntp server, and no, they couldn't allow us to connect out to a designated ntp server externally. All we have is an ssh forward in.

They wanted me to manually set the clock on the server whenever they noticed it was out of synch! Real professionals.

I thought about tunneling an ntp stream out, but that requires udp-to-tcp fudging at each end, or using ssh's Tunnel facility, which requires root at both ends.

In the end, I settled for the low-tech approach - a once a day cron job that resets the time based on a local clock. Ugly, but good enough:

ssh root@server date --utc $(date --utc "+%m%d%H%M%Y.%S")

Hosttag - Tagging for Hosts

When you have more than a handful of hosts on your network, you need to start keeping track of what services are living where, what roles particular servers have, etc. This can be documentation-based (say on a wiki, or offline), or it can be implicit in a configuration management system. Old-school sysadmins often used dns TXT records for these kind of notes, on the basis that it was easy to look them up from the command line from anywhere.

I've been experimenting with the idea of using lightweight tags attached to hostnames for this kind of data, and it's been working really nicely. Hosttag is just a couple of ruby command line utilities, one (hosttag or ht) for doing tag or host lookups, and one (htset/htdel) for doing adds and deletes. Both are network based, so you can do lookups from wherever you are, rather than having to go to somewhere centralised.

Hosttag uses a redis server to store the hostname-tag and tag-hostname mappings as redis sets, which makes queries lightning fast, and setup straightforward.

So let's see it in action (rpms available in my yum repo):

# Installation - first install redis somewhere, and setup a 'hosttag'
# dns alias to the redis host (or use the `-s <server>` option in
# the examples that follow). e.g. on CentOS:
$ yum install redis rubygem-redis

# Install hosttag as an rpm package (from my yum repo).
# Also requires/installs the redis rubygem.
$ yum install hosttag
# gem version coming soon (gem install hosttag)

# Setup some test data (sudo is required for setting and deleting)
# Usage: htset --tag <host> <tag1> <tag2> <tag3> ...
$ sudo htset --tag server1 dns dell ldap server centos centos5 i386 syd
$ sudo htset --tag server2 dns dell ldap server debian debian6 x86_64 mel
$ sudo htset --tag server3 hp nfs server centos centos6 x86_64 syd
$ sudo htset --tag lappy laptop ubuntu maverick i386 syd

# Now run some queries
# Query by tag
$ ht dns
server1 server2
$ ht i386
lappy server1

# Query by host
$ ht server2
debian debian6 dell dns ldap mel server x86_64

# Multiple arguments
$ ht --or centos debian
server1 server2 server3
$ ht --and dns ldap
server1 server2

# All hosts
$ ht --all
lappy server1 server2 server3
# All tags
$ ht --all-tags
centos centos5 centos6 debian debian6 dell dns hp i386 laptop ldap \
maverick mel nfs server syd ubuntu x86_64

An obvious use case is to perform actions on multiple hosts using your ssh loop of choice e.g.

$ sshr $(ht centos) 'yum -y update'

Finally, a warning: hosttag doesn't have any security built in yet, so it should only be used on trusted networks.

Source code is on github - patches welcome :-).

Introducing PlanetAUX

I wrote a post a couple of years ago surveying the 20 biggest Australian listed companies in terms of how they syndicate the ASX announcements they make when they have news for the market.

All of the companies made their announcements available on their (or an associated) website, but that obviously requires that an interested investor or follower of the company regularly check the announcements page of each company they're interested in - not ideal, timely, or scalable.

"Push" or broadcast approaches are much more useful for this sort of thing, and back then 14 or the 20 did make their announcements available via email - which is useful enough - and 7 of the 20 also published RSS or Atom feeds, which are better still, because they're more lightweight, centralised, and standardised/re-mixable.

On the other hand, 7/20 is a pretty ordinary score, especially considering these are the biggest Australian listed companies. Smaller companies with fewer resources are presumably going to fare even worse.

Two years on the number of ASX20 companies with announcements feeds has now leapt to ... 9. :-/

And so towards the end of last year I decided to scratch this particular itch, and built an announcement aggregator site called PlanetAUX. It aggregates the announcements feeds of the ASX20 companies with them, and generates announcements feeds for those that don't have them.

Of the companies in the ASX20, there was only one whose HTML was so terrible that I couldn't manage to parse it. I figured 19/20 isn't bad for starters.

Plans are to add more companies as time and interest demands, presumably the remainder of the ASX50 initially, and then we'll see how things go.

Happy feeding.

RHEL6 GDM Sessions Workaround

Update: ilaiho has provided a better solution in the comments, which is to install the xorg-x11-xinit-session package, which adds a "User script" session option. This will invoke your (executable) ~/.xsession or ~/.Xclients configs, if selected, and works well, so I'd recommend you go that route instead of using this patch now.


The GDM Greeter in RHEL6 seems to have lost the ability to select 'session types' (or window managers), which apparently means you're stuck using Gnome, even if you have other better options installed. One workaround is to install KDM instead, and set DISPLAYMANAGER=KDE in your /etc/sysconfig/desktop config, as KDM does still support selectable session types.

Since I've become a big fan of tiling window managers in general, and ion in particular, this was pretty annoying, so I wasted a few hours today working through the /etc/X11 scripts and figuring out how they hung together on RHEL6.

So for any other gnome-haters out there who don't want to have to go to KDM, here's a patch to /etc/X11/xinit/Xsession that ignores the default 'gnome-session' set by GDM, which allows proper window manager selection either by user .xsession or .Xclients files, or by the /etc/sysconfig/desktop DISPLAY setting.

diff --git a/xinit/Xsession b/xinit/Xsession
index e12e0ee..ab94d28 100755
--- a/xinit/Xsession
+++ b/xinit/Xsession
@@ -30,6 +30,14 @@ SWITCHDESKPATH=/usr/share/switchdesk
 # Xsession and xinitrc scripts which has been factored out to avoid duplication
 . /etc/X11/xinit/xinitrc-common

+# RHEL6 GDM doesn't seem to support selectable sessions, and always requests a
+# gnome-session. So we unset this default here, to allow things like user
+# .xsession or .Xclients files to be checked, and /etc/sysconfig/desktop
+# settings (via /etc/X11/xinit/Xclients) honoured.
+if [ -n "$GDMSESSION" -a $# -eq 1 -a "$1" = gnome-session ]; then
+  shift
+fi
+
 # This Xsession.d implementation, is intended to obsolte and replace the
 # various mechanisms present in the 'case' statement which follows, and to
 # eventually be able to easily remove all hard coded window manager specific

Apply as root:

cd /etc/X11
patch -p1 < /tmp/xsession.patch

Rebuild Inventory

Here's what I use to take a quick inventory of a machine before a rebuild, both to act as a reference during the rebuild itself, and in case something goes pear-shaped. The whole chunk after script up to exit is cut-and-pastable.

# as root, where you want your inventory file
script $(hostname).inventory
export PS1='\h:\w\$ '               # reset prompt to avoid ctrl chars
fdisk -l /dev/sd?                   # list partition tables
cat /proc/mdstat                    # list raid devices
pvs                                 # list lvm stuff
vgs
lvs
df -h                               # list mounts
ip addr                             # list network interfaces
ip route                            # list network routes
cat /etc/resolv.conf                # show resolv.conf
exit

# Cleanup control characters in the inventory
perl -i -pe 's/\r//g; s/\033\]\d+;//g; s/\033\[\d+m//g; s/\007/\//g' \
  $(hostname).inventory

# And then copy it somewhere else in case of problems ;-)
scp $(hostname).inventory somewhere:

Anything else useful I've missed?

Cronologue

Came across cronologger (blog post) recently (via Dean Wilson), which is a simple wrapper script you use around your cron(8) jobs, which captures any stdout and stderr output and logs it to a couchdb database, instead of the traditional behaviour of sending it to you as email.

It's a nice idea, particularly for jobs with important output where it would be nice to able to look back in time more easily than by trawling through a noisy inbox, or for sites with lots of cron jobs where the sheer volume is difficult to handle usefully as email.

Cronologger comes with a simple web interface for displaying your cron jobs, but so far it's pretty rudimentary. I quickly realised that this was another place (cf. blosxom4nagios) where blosxom could be used to provide a pretty useful gui with very little work.

Thus: cronologue.

cronologue(1) is the wrapper, written in perl, which logs job records and and stdout/stderr output via standard HTTP PUTs back to a designated apache server, as flat text files. Parameters can be used to control whether job records are always created, or only when there is output produced. There's also a --passthru mode in which stdout and stderr streams are still output, allowing both email and cronologue output to be produced.

On the server side a custom blosxom install is used to display the job records, which can be filtered by hostname or by date. There's also an RSS feed available.

Obligatory screenshot:

Cronologue GUI

Update: I should add that RPMs for CentOS5 (but which will probably work on most RPM-based distros) are available from my yum repository.

Parallel Processing Perl Modules

Needed to parallelise some processing in perl the last few days, and did a quick survey of some of the parallel processing modules on CPAN, of which there is the normal bewildering diversity.

As usual, it depends exactly what you're trying to do. In my case I just needed to be able to fork a bunch of processes off, have them process some data, and hand the results back to the parent.

So here are my notes on a random selection of the available modules. The example each time is basically a parallel version of the following map:

my %out = map { $_ ** 2 } 1 .. 50;

Parallel::ForkManager

Object oriented wrapper around 'fork'. Supports parent callbacks. Passing data back to parent uses files, and feels a little bit clunky. Dependencies: none.

use Parallel::ForkManager 0.7.6;

my @num = 1 .. 50;

my $pm = Parallel::ForkManager->new(5);

my %out;
$pm->run_on_finish(sub {    # must be declared before first 'start'
    my ($pid, $exit_code, $ident, $exit_signal, $core_dump, $data) = @_;
    $out{ $data->[0] } = $data->[1];
});

for my $num (@num) {
    $pm->start and next;   # Parent nexts

    # Child
    my $sq = $num ** 2;

    $pm->finish(0, [ $num, $sq ]);   # Child exits
}
$pm->wait_all_children;

[Version 0.7.9]

Parallel::Iterator

Basically a parallel version of 'map'. Dependencies: none.

use Parallel::Iterator qw(iterate);

my @num = 1 .. 50;

my $it = iterate( sub {
    # sub is a closure, return outputs
    my ($id, $num) = @_;
    return $num ** 2;
}, \@num );

my %out = ();
while (my ($num, $square) = $it->()) {
  $out{$num} = $square;
}

[Version 1.00]

Parallel::Loops

Provides parallel versions of 'foreach' and 'while'. It uses 'tie' to allow shared data structures between the parent and children. Dependencies: Parallel::ForkManager.

use Parallel::Loops;

my @num = 1 .. 50;

my $pl = Parallel::Loops->new(5);

my %out;
$pl->share(\%out);

$pl->foreach( \@num, sub {
    my $num = $_;           # note this uses $_, not @_
    $out{$num} = $num ** 2;
});

You can also return values from the subroutine like Iterator, avoiding the explicit 'share':

my %out = $pl->foreach( \@num, sub {
    my $num = $_;           # note this uses $_, not @_
    return ( $num, $num ** 2 );
});

[Version 0.03]

Proc::Fork

Provides an interesting perlish forking interface using blocks. No built-in support for returning data from children, but provides examples using pipes. Dependencies: Exporter::Tidy.

use Proc::Fork;
use IO::Pipe;
use Storable qw(freeze thaw);

my @num = 1 .. 50;
my @children;

for my $num (@num) {
    my $pipe = IO::Pipe->new;

    run_fork{ child {
        # Child
        $pipe->writer;
        print $pipe freeze([ $num, $num ** 2 ]);
        exit;
    } };

    # Parent
    $pipe->reader;
    push @children, $pipe;
}

my %out;
for my $pipe (@children) {
    my $entry = thaw( <$pipe> );
    $out{ $entry->[0] } = $entry->[1];
}

[Version 0.71]

Parallel::Prefork

Like Parallel::ForkManager, but adds better signal handling. Doesn't seem to provide built-in support for returning data from children. Dependencies: Proc::Wait3.

[Version 0.08]

Parallel::Forker

More complex module, loosely based on ForkManager (?). Includes better signal handling, and supports scheduling and dependencies between different groups of subprocesses. Doesn't appear to provide built-in support for passing data back from children.

[Version 1.232]