Parallelising With Perl
Did a talk at the Sydney Perl Mongers group on Tuesday night, called "Parallelising with Perl", covering AnyEvent, MCE, and GNU Parallel.
World disintermediation, one hack at a time
Did a talk at the Sydney Perl Mongers group on Tuesday night, called "Parallelising with Perl", covering AnyEvent, MCE, and GNU Parallel.
Well past time to get back on the blogging horse.
I'm now working on a big data web mining startup, and spending an inordinate amount of time buried in large data files, often some variant of CSV.
My favourite new tool over the last few months is is Karlheinz Zoechling's App::CCSV perl module, which lets you do some really powerful CSV processing using perl one-liners, instead of having to write a trivial/throwaway script.
If you're familiar with perl's standard autosplit functionality (perl -a
)
then App::CCSV will look pretty similar - it autosplits its input into an
array on your CSV delimiters for further processing. It handles
embedded delimiters and CSV quoting conventions correctly, though, which
perl's standard autosplitting doesn't.
App::CCSV uses @f
to hold the autosplit fields, and provides utility
functions csay
and cprint
for doing say
and print
on the CSV-joins
of your array. So for example:
# Print just the first 3 fields of your file perl -MApp::CCSV -ne 'csay @f[0..2]' < file.csv # Print only lines where the second field is 'Y' or 'T' perl -MApp::CCSV -ne 'csay @f if $f[1] =~ /^[YT]$/' < file.csv # Print the CSV header and all lines where field 3 is negative perl -MApp::CCSV -ne 'csay @f if $. == 1 || ($f[2]||0) < 0' < file.csv # Insert a new country code field after the first field perl -MApp::CCSV -ne '$cc = get_country_code($f[0]); csay $f[0],$cc,@f[1..$#f]' < file.csv
App::CCSV can use a config file to handle different kinds of CSV input.
Here's what I'm using, which lives in my home directory in ~/.CCSVConf
:
<CCSV> sep_char , quote_char """ <names> <comma> sep_char "," quote_char """ </comma> <tabs> sep_char " " quote_char """ </tabs> <pipe> sep_char "|" quote_char """ </pipe> <commanq> sep_char "," quote_char "" </comma> <tabsnq> sep_char " " quote_char "" </tabs> <pipenq> sep_char "|" quote_char "" </pipe> </names> </CCSV>
That just defines two sets of names for different kinds of input: comma
,
tabs
, and pipe
for [,\t|]
delimiters with standard CSV quote conventions;
and three nq
("no-quote") variants - commanq
, tabsnq
, and pipenq
- to
handle inputs that aren't using standard CSV quoting. It also makes the comma
behaviour the default.
You use one of the names by specifying it when loading the module, after an =
:
perl -MApp::CCSV=comma ... perl -MApp::CCSV=tabs ... perl -MApp::CCSV=pipe ...
You can also convert between formats by specifying two names, in <input>,<output> format e.g.
perl -MApp::CCSV=comma,pipe ... perl -MApp::CCSV=tabs,comma ... perl -MApp::CCSV=pipe,tabs ...
And just to round things off, I have a few aliases defined in my bashrc
file
to make these even easier to use:
alias perlcsv='perl -CSAD -MApp::CCSV' alias perlpsv='perl -CSAD -MApp::CCSV=pipe' alias perltsv='perl -CSAD -MApp::CCSV=tabs' alias perlcsvnq='perl -CSAD -MApp::CCSV=commanq' alias perlpsvnq='perl -CSAD -MApp::CCSV=pipenq' alias perltsvnq='perl -CSAD -MApp::CCSV=tabsnq'
That simplifies my standard invocation to something like:
perlcsv -ne 'csay @f[0..2]' < file.csv
Happy data wrangling!
Needed to parallelise some processing in perl the last few days, and did a quick survey of some of the parallel processing modules on CPAN, of which there is the normal bewildering diversity.
As usual, it depends exactly what you're trying to do. In my case I just needed to be able to fork a bunch of processes off, have them process some data, and hand the results back to the parent.
So here are my notes on a random selection of the available modules. The example each time is basically a parallel version of the following map:
my %out = map { $_ ** 2 } 1 .. 50;
Object oriented wrapper around 'fork'. Supports parent callbacks. Passing data back to parent uses files, and feels a little bit clunky. Dependencies: none.
use Parallel::ForkManager 0.7.6; my @num = 1 .. 50; my $pm = Parallel::ForkManager->new(5); my %out; $pm->run_on_finish(sub { # must be declared before first 'start' my ($pid, $exit_code, $ident, $exit_signal, $core_dump, $data) = @_; $out{ $data->[0] } = $data->[1]; }); for my $num (@num) { $pm->start and next; # Parent nexts # Child my $sq = $num ** 2; $pm->finish(0, [ $num, $sq ]); # Child exits } $pm->wait_all_children;
[Version 0.7.9]
Basically a parallel version of 'map'. Dependencies: none.
use Parallel::Iterator qw(iterate); my @num = 1 .. 50; my $it = iterate( sub { # sub is a closure, return outputs my ($id, $num) = @_; return $num ** 2; }, \@num ); my %out = (); while (my ($num, $square) = $it->()) { $out{$num} = $square; }
[Version 1.00]
Provides parallel versions of 'foreach' and 'while'. It uses 'tie' to allow shared data structures between the parent and children. Dependencies: Parallel::ForkManager.
use Parallel::Loops; my @num = 1 .. 50; my $pl = Parallel::Loops->new(5); my %out; $pl->share(\%out); $pl->foreach( \@num, sub { my $num = $_; # note this uses $_, not @_ $out{$num} = $num ** 2; });
You can also return values from the subroutine like Iterator, avoiding the explicit 'share':
my %out = $pl->foreach( \@num, sub { my $num = $_; # note this uses $_, not @_ return ( $num, $num ** 2 ); });
[Version 0.03]
Provides an interesting perlish forking interface using blocks. No built-in support for returning data from children, but provides examples using pipes. Dependencies: Exporter::Tidy.
use Proc::Fork; use IO::Pipe; use Storable qw(freeze thaw); my @num = 1 .. 50; my @children; for my $num (@num) { my $pipe = IO::Pipe->new; run_fork{ child { # Child $pipe->writer; print $pipe freeze([ $num, $num ** 2 ]); exit; } }; # Parent $pipe->reader; push @children, $pipe; } my %out; for my $pipe (@children) { my $entry = thaw( <$pipe> ); $out{ $entry->[0] } = $entry->[1]; }
[Version 0.71]
Like Parallel::ForkManager, but adds better signal handling. Doesn't seem to provide built-in support for returning data from children. Dependencies: Proc::Wait3.
[Version 0.08]
More complex module, loosely based on ForkManager (?). Includes better signal handling, and supports scheduling and dependencies between different groups of subprocesses. Doesn't appear to provide built-in support for passing data back from children.
[Version 1.232]
Here's a quick hack demonstrating a nice juxtaposition between the
power of a CPAN module - in this case Christopher Laco's
Finance::Currency::Convert::WebserviceX
- and the elegance and utility of the little known osd_cat
, putting
together a desktop currency rates widget in a handful of lines:
#!/usr/bin/perl use strict; use IO::File; use Finance::Currency::Convert::WebserviceX; # Configuration my @currencies = map { uc } @ARGV || qw(USD GBP); my $base_currency = 'AUD'; my $refresh = 300; # seconds my $font = '9x15bold'; # X colours: http://sedition.com/perl/rgb.html my $colour = 'goldenrod3'; my $align = 'right'; my $pos = 'top'; my $offset = 25; my $lines = scalar @currencies; my $osd_refresh = $refresh + 1; my $osd = IO::File->new( "|osd_cat -l $lines -d $osd_refresh -c '$colour' -f $font -p $pos -A $align -o $offset" ) or die "can't open to osd_cat $!"; $osd->autoflush(1); local $SIG{PIPE} = sub { die "pipe failed: $!" }; my $cc = Finance::Currency::Convert::WebserviceX->new; while (1) { my $output = ''; $output .= "$_ " . $cc->convert(1, $base_currency, $_) . "\n" for @currencies; $osd->print($output); sleep $refresh; }
Most of this is just housekeeping around splitting out various osd_cat
options for tweaking, and allowing the set of currencies to display
to be passed in as arguments. I haven't bothered setting up any option
handling in order to keep the example short, but that would be
straightforward.
To use, you just run from the command line in the background:
./currency_osd &
and it shows up in the top right corner of your screen, like so:
Tweak to taste, of course.
I've been playing with the very nice YSlow firefox plugin recently, while doing some front-end optimisation on a Catalyst web project.
Most of YSlow's tuning tips were reasonably straightforward, but I wasn't sure how to approach the concatenation and minification of CSS and javascript files that they recommend.
Turns out - as is often the case - there's a very nice packaged solution on CPAN.
The File::Assets module provides concatentation and minification for CSS and Javascript 'assets' for a web page, using the CSS::Minifier (::XS) and JavaScript::Minifier (::XS) modules for minification. To use, you add a series of .css and .js files in building your page, and then 'export' them at the end, which generates a concatenated and minified version of each type in an export directory, and an appropriate link to the exported version. You can do separate exports for CSS and Javascript if you want to follow the Yahoo/YSlow recommendation of putting your stylesheets at the top and your scripts at the bottom.
There's also a Catalyst::Plugin::Assets module to facilitate using File::Assets from Catalyst.
I use Mason for my Catalyst views (I prefer using perl in my views rather than having another mini-language to learn) and so use this as follows.
First, you have to configure Catalyst::Plugin::Assets in your project config file (e.g. $PROJECT_HOME/project.yml):
Plugin::Assets:
path: /static
output_path: build/
minify: 1
Next, I set the per-page javascript and and css files I want to include as mason page attributes in my views (using an arrayref if there's more than one item of the given type) e.g.
%# in my person view <%attr> js => [ 'jquery.color.js', 'person.js' ] css => 'person.css' </%attr>
Then in my top-level autohandler, I include both global and per-page assets like this:
<%init> # Asset collation, javascript (globals, then per-page) $c->assets->include('js/jquery.min.js'); $c->assets->include('js/global.js'); if (my $js = $m->request_comp->attr_if_exists('js')) { if (ref $js && ref $js eq 'ARRAY') { $c->assets->include("js/$_") foreach @$js; } else { $c->assets->include("js/$js"); } } # The CSS version is left as an exercise for the reader ... # ... </%init>
Then, elsewhere in the autohandler, you add an exported link at the appropriate point in the page:
<% $c->assets->export('text/javascript') %>
This generates a link something like the following (wrapped here):
<script src="http://www.example.com/static/build/assets-ec556d1e.js" type="text/javascript"></script>
Beautiful, easy, maintainable.
I'm an old-school developer, doing all my hacking using terms, the command line, and vim, not a heavyweight IDE. Hacking perl Catalyst projects (and I imagine other MVC-type frameworks) can be slightly more challenging in this kind of environment because of the widely-branching directory structure. A single conceptual change can easily touch controller classes, model classes, view templates, and static javascript or css files, for instance.
I've found GNU screen to work really
well in this environment. I use per-project screen sessions set up
specifically for Catalyst - for my 'usercss' project, for instance, I have
a ~/.screenrc-usercss
config that looks like this:
source $HOME/.screenrc setenv PROJDIR ~/work/usercss setenv PROJ UserCSS screen -t home stuff "cd ~^Mclear^M" screen -t top stuff "cd $PROJDIR^Mclear^M" screen -t lib stuff "cd $PROJDIR/lib/$PROJ^Mclear^M" screen -t controller stuff "cd $PROJDIR/lib/Controller^Mclear^M" screen -t schema stuff "cd $PROJDIR/lib/$PROJ/Schema/Result^Mclear^M" screen -t htdocs stuff "cd $PROJDIR/root/htdocs^Mclear^M" screen -t static stuff "cd $PROJDIR/root/static^Mclear^M" screen -t sql stuff "cd $PROJDIR^Mclear^M" select 0
(the ^M
sequences there are actual Ctrl-M newline characters).
So a:
screen -c ~/.screenrc-usercss
will give me a set of eight labelled screen windows: home, top, lib, controller, schema, htdocs, static, and sql. I usually run a couple of these in separate terms, like this:
To make this completely brainless, I also have the following bash function
defined in my ~/.bashrc
file:
sc () { SC_SESSION=$(screen -ls | egrep -e "\.$1.*Detached" | \ awk '{ print $1 }' | head -1); if [ -n "$SC_SESSION" ]; then xtitle $1; screen -R $SC_SESSION; elif [ -f ~/.screenrc-$1 ]; then xtitle $1; screen -S $1 -c ~/.screenrc-$1 else echo "Unknown session type '$1'!" fi }
which lets me just do sc usercss
, which reattaches to the first detached
'usercss' screen session, if one is available, or starts up a new one.
Fast, flexible, lightweight. Choose any 3.
I wasted 15 minutes the other day trying to remember how to do this, so here it is for the future: to find out if and when a perl module got added to the core, you want Richard Clamp's excellent Module::CoreList.
Recent versions have a 'corelist' frontend command, so I typically use that e.g.
$ corelist File::Basename File::Basename was first released with perl 5 $ corelist warnings warnings was first released with perl 5.006 $ corelist /^File::Spec/ File::Spec was first released with perl 5.00405 File::Spec::Cygwin was first released with perl 5.006002 File::Spec::Epoc was first released with perl 5.006001 File::Spec::Functions was first released with perl 5.00504 File::Spec::Mac was first released with perl 5.00405 File::Spec::OS2 was first released with perl 5.00405 File::Spec::Unix was first released with perl 5.00405 File::Spec::VMS was first released with perl 5.00405 File::Spec::Win32 was first released with perl 5.00405 $ corelist URI::Escape URI::Escape was not in CORE (or so I think)
I've written a CPAN module to do stuff like this - File::DirCompare - so thought I'd check how straightforward this would be using File::DirCompare:
#!/usr/bin/perl use strict; use File::Basename; use File::DirCompare; use File::Compare qw(compare); use File::stat; die "Usage: " . basename($0) . " dir1 dir2\n" unless @ARGV == 2; my ($dir1, $dir2) = @ARGV; File::DirCompare->compare($dir1, $dir2, sub { my ($a, $b) = @_; if (! $b) { printf "Only in %s: %s\n", dirname($a), basename($a); } elsif (! $a) { printf "Only in %s: %s\n", dirname($b), basename($b); } else { my $stata = stat $a; my $statb = stat $b; # Return unless different return unless compare($a, $b) != 0 || $stata->mode != $statb->mode || $stata->uid != $statb->uid || $stata->gid != $statb->gid; # Report printf "%04o %s %s %s\t\t%04o %s %s %s\n", $stata->mode & 07777, basename($a), (getpwuid($stata->uid))[0], (getgrgid($stata->gid))[0], $statb->mode & 07777, basename($b), (getpwuid($statb->uid))[0], (getgrgid($statb->gid))[0]; } }, { ignore_cmp => 1 });
So this reports all entries that are different in content or permissions or ownership e.g. given a tree like this (slightly modified from Mike's example):
$ ls -lR scripts1 scripts2 scripts1: total 28 -rw-r--r-- 1 gavin gavin 0 Mar 17 16:41 script1 -rw-r--r-- 1 gavin gavin 0 Mar 17 16:41 script1.bak -rw-r--r-- 1 gavin gavin 0 Mar 17 16:41 script2 -rw-r--r-- 1 gavin gavin 0 Mar 17 16:41 script2.bak -rw-r--r-- 1 gavin gavin 0 Mar 17 16:41 script3 -rw-r--r-- 1 gavin gavin 0 Mar 17 16:41 script3.bak -rw-r--r-- 1 gavin gavin 0 Mar 17 16:49 script4 scripts2: total 28 -rw-r--r-- 1 gavin users 0 Mar 17 16:41 script1 -rw-r--r-- 1 gavin users 0 Mar 17 16:41 script1.bak -rw-r--r-- 1 gavin gavin 0 Mar 17 16:41 script2 -rw-r--r-- 1 gavin gavin 0 Mar 17 16:41 script2.bak -rwxr-xr-x 1 gavin gavin 0 Mar 17 16:41 script3* -rwxr-xr-x 1 gavin gavin 0 Mar 17 16:41 script3.bak* -rw-r--r-- 1 gavin gavin 0 Mar 17 16:49 script5
it will give output like the following:
$ ./pdiff2 scripts1 scripts2 0644 script1 gavin gavin 0644 script1 gavin users 0644 script1.bak gavin gavin 0644 script1.bak gavin users 0644 script3 gavin gavin 0755 script3 gavin gavin 0644 script3.bak gavin gavin 0755 script3.bak gavin gavin Only in scripts1: script4 Only in scripts2: script5
This obviously has dependencies that Mike's version doesn't have, but it
comes out much shorter and clearer, I think. It also doesn't fork and parse
an external ls
, so it should be more portable and less fragile. I should
probably be caching the getpwuid
lookups too, but that would have made it
5 lines longer. ;-)