Well past time to get back on the blogging horse.
I'm now working on a big data web mining startup, and spending an inordinate amount of time buried in large data files, often some variant of CSV.
My favourite new tool over the last few months is is Karlheinz Zoechling's App::CCSV perl module, which lets you do some really powerful CSV processing using perl one-liners, instead of having to write a trivial/throwaway script.
If you're familiar with perl's standard autosplit functionality (
then App::CCSV will look pretty similar - it autosplits its input into an
array on your CSV delimiters for further processing. It handles
embedded delimiters and CSV quoting conventions correctly, though, which
perl's standard autosplitting doesn't.
@f to hold the autosplit fields, and provides utility
cprint for doing
# Print just the first 3 fields of your file perl -MApp::CCSV -ne 'csay @f[0..2]' < file.csv # Print only lines where the second field is 'Y' or 'T' perl -MApp::CCSV -ne 'csay @f if $f =~ /^[YT]$/' < file.csv # Print the CSV header and all lines where field 3 is negative perl -MApp::CCSV -ne 'csay @f if $. == 1 || ($f||0) < 0' < file.csv # Insert a new country code field after the first field perl -MApp::CCSV -ne '$cc = get_country_code($f); csay $f,$cc,@f[1..$#f]' < file.csv
App::CCSV can use a config file to handle different kinds of CSV input.
Here's what I'm using, which lives in my home directory in
<CCSV> sep_char , quote_char """ <names> <comma> sep_char "," quote_char """ </comma> <tabs> sep_char " " quote_char """ </tabs> <pipe> sep_char "|" quote_char """ </pipe> <commanq> sep_char "," quote_char "" </comma> <tabsnq> sep_char " " quote_char "" </tabs> <pipenq> sep_char "|" quote_char "" </pipe> </names> </CCSV>
That just defines two sets of names for different kinds of input:
[,\t|] delimiters with standard CSV quote conventions;
nq ("no-quote") variants -
pipenq - to
handle inputs that aren't using standard CSV quoting. It also makes the
behaviour the default.
You use one of the names by specifying it when loading the module, after an
perl -MApp::CCSV=comma ... perl -MApp::CCSV=tabs ... perl -MApp::CCSV=pipe ...
You can also convert between formats by specifying two names, in <input>,<output> format e.g.
perl -MApp::CCSV=comma,pipe ... perl -MApp::CCSV=tabs,comma ... perl -MApp::CCSV=pipe,tabs ...
And just to round things off, I have a few aliases defined in my
to make these even easier to use:
alias perlcsv='perl -CSAD -MApp::CCSV' alias perlpsv='perl -CSAD -MApp::CCSV=pipe' alias perltsv='perl -CSAD -MApp::CCSV=tabs' alias perlcsvnq='perl -CSAD -MApp::CCSV=commanq' alias perlpsvnq='perl -CSAD -MApp::CCSV=pipenq' alias perltsvnq='perl -CSAD -MApp::CCSV=tabsnq'
That simplifies my standard invocation to something like:
perlcsv -ne 'csay @f[0..2]' < file.csv
Happy data wrangling!