Eleven Awesome Things You Can Do with csvkit

(Bobcatnorth via Flickr)

Tool csvkit Eleven Awesome Things You Can Do with csvkit

A Listicle Introduction to the Swiss Army Knife of CSVs

Back in 2009, Aaron Bycoffe inspired Joe Germuska to write a thirty-line Python script called csvcut for making it easy to select a subset of columns from a CSV file. For the next two years the two of them went back and forth forking and making small revisions to this tool. We got a lot of mileage out of it at the Chicago Tribune, where Joe and I were constantly wrestling with the annoyances of processing datasets efficiently in the command line. In 2011, in a fit of exuberance, I rewrote csvcut and began building other tools for making this sort of work easier. Thus csvkit was born and has been under continuous development ever since.

Over the last several months there have been two major releases of csvkit. These releases have brought long-awaited features such as Python 3 support, a csvformat utility and a new csvkit tutorial—not to mention a slew of bug fixes. To celebrate the latest release, here are eleven of my favorite awesome things you can do with csvkit. If you aren’t using it yet, hopefully this will convince you.

1. Ditch Excel (for real)

Who needs it?

in2csv file1.xls > file1.csv
in2csv file2.xlsx > file2.csv

Reference: in2csv

2. Conquer fixed-width formats

Fixed-width files are a particularly frustrating to parse. Save yourself some frustration by using a CSV-formatted schema to convert your fixed-width file into a CSV!

[schema.csv]

column,start,length
name,1,5 
age,6,2
cash,8,3
[data.fixed]

Chris44 72
Brian26110
Ryan 18145
Joe  34 83
in2csv -f fixed -s schema.csv data.fixed > data.csv
[data.csv]

name,age,cash
Chris,44,72
Brian,26,110
Ryan,18,145
Joe,34,83

Be sure to check out FFS for a database of fixed-width schemas for common files, such as those produced by the Bureau of Labor Statistics.

Reference: in2csv

3. Find cells matching a regular expression

[data.csv]

name,phone_number
Chris,555-999-1111
Brian,555-123-4567
Ryan,555-123-8901
Joe,555-777-1111

Find phone numbers following the pattern “ddd–123-dddd”:

csvgrep -c phone_number -r "\d{3}-123-\d{4}" data.csv > matching.csv
[matching.csv]

name,phone_number
Brian,555-123-4567
Ryan,555-123-8901

Reference: csvgrep

4. Turn your data into a JSON lookup table

Do you have data that with a unique id column? Would you like to be able to load that data into your browser keyed by its unique id so that you can use it as a lookup table? Well then.

[geo.csv]

slug,place,latitude,longitude
dcl,Downtown Coffee Lounge,32.35066,-95.30181
tyler-museum,Tyler Museum of Art,32.33396,-95.28174
genecov,Genecov Sculpture,32.299076986939205,-95.31571447849274
csvjson --key slug --indent 4 geo.csv
[keyed.json]

{
    "dcl": {
        "slug": "dcl",
        "place": "Downtown Coffee Lounge",
        "latitude": "32.35066",
        "longitude": "-95.30181"
    },
    "tyler-museum": {
        "slug": "tyler-museum",
        "place": "Tyler Museum of Art",
        "latitude": "32.33396",
        "longitude": "-95.28174"
    },
    "genecov": {
        "slug": "genecov",
        "place": "Genecov Sculpture",
        "latitude": "32.299076986939205",
        "longitude": "-95.31571447849274"
    }
}%

Reference: csvjson

5. Turn a CSV with latitude and longitude columns into GeoJSON

Is your data geographic, like in our previous example? Then don’t stop with JSON—go one step further and make it GeoJSON!

csvjson --lat latitude --lon longitude --key slug --crs EPSG:4269 --indent 4 geo.csv > geo.json
[geo.json]

{
    "type": "FeatureCollection",
    "bbox": [
        -95.31571447849274,
        32.299076986939205,
        -95.28174,
        32.35066
    ],
    "features": [
        {
            "type": "Feature",
            "id": "dcl",
            "geometry": {
                "type": "Point",
                "coordinates": [
                    -95.30181,
                    32.35066
                ]
            },
            "properties": {
                "place": "Downtown Coffee Lounge"
            }
        },
        {
            "type": "Feature",
            "id": "tyler-museum",
            "geometry": {
                "type": "Point",
                "coordinates": [
                    -95.28174,
...

Reference: csvjson

6. Generate summary statistics for any CSV file

[data.csv]

name,age,cash
Chris,44,72
Brian,26,110
Ryan,18,145
Joe,34,83
Ryan,27,300
Jeremy,98,1
$ csvstat data.csv

  1. name
        <type 'unicode'>
        Nulls: False
        Values: Chris, Brian, Jeremy, Joe, Ryan
  2. age
        <type 'int'>
        Nulls: False
        Min: 18
        Max: 98
        Sum: 247
        Mean: 41.1666666667
        Median: 30.5
        Standard Deviation: 26.6359197743
        Unique values: 6
  3. cash
        <type 'int'>
        Nulls: False
        Min: 1
        Max: 300
        Sum: 711
        Mean: 118.5
        Median: 96.5
        Standard Deviation: 92.1461701139
        Unique values: 6

Row count: 6

csvstat will calculate different statistics based on the type of each column.

Reference: csvstat

7. Execute a SQL query directly on a CSV file

[data.csv]

name,age,cash
Chris,44,72
Brian,26,110
Ryan,18,145
Joe,34,83
Ryan,27,300
Jeremy,98,1
csvsql --query "select name from data where age > 30" data.csv > old.csv
[old.csv]

name
Chris
Joe
Jeremy

Care to take it up a notch? Reference two CSV files and use SQL to JOIN them (this example uses data from the csvkit repository:

csvsql --query "select m.usda_id, avg(i.sepal_length) as mean_sepal_length from iris as i join irismeta as m on (i.species = m.species) group by m.species" examples/iris.csv examples/irismeta.csv
usda_id,mean_sepal_length
IRSE,5.006
IRVE2,5.936
IRVI,6.588

(Second example courtesy Jeroen Janssens.)

Reference: csvsql

8. Automatically create a SQL table and import a CSV into a database

[data.csv]

name,age,cash
Chris,44,72
Brian,26,110
Ryan,18,145
Joe,34,83
Ryan,27,300
Jeremy,98,1
createdb demographics
csvsql --db postgresql:///demographics --insert data.csv

Too good to be true? Check it out:

psql -q demographics -c "\d data"

            Table "public.data"
 Column |         Type         | Modifiers
--------+----------------------+-----------
 name   | character varying(6) | not null
 age    | integer              | not null
 cash   | integer              | not null
 
psql -q demographics -c "select * from data"

  name  | age | cash
--------+-----+------
 Chris  |  44 |   72
 Brian  |  26 |  110
 Ryan   |  18 |  145
 Joe    |  34 |   83
 Ryan   |  27 |  300
 Jeremy |  98 |    1
(6 rows)

Reference: csvsql

9. Extract a table from a SQL database into a CSV

sql2csv --db postgresql:///demographics --query "select * from data" > extract.csv
[extract.csv]

name,age,cash
Chris,44,72
Brian,26,110
Ryan,18,145
Joe,34,83
Ryan,27,300
Jeremy,98,1

Reference: sql2csv

10. Turn your Github issues into a CSV

curl https://api.github.com/repos/onyxfish/csvkit/issues?state=open | in2csv -f json > issues.csv
[issues.csv]

url,labels_url,comments_url,events_url,html_url,id,number,title,user,labels,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,pull_request,body
https://api.github.com/repos/onyxfish/csvkit/issues/329,https://api.github.com/repos/onyxfish/csvkit/issues/329/labels{/name},https://api.github.com/repos/onyxfish/csvkit/issues/329/comments,https://api.github.com/repos/onyxfish/csvkit/issues/329/events,https://github.com/onyxfish/csvkit/pull/329,42344699,329,Adding support for multiline json documents,,,open,False,,,0,2014-09-09T20:03:02Z,2014-09-09T20:03:02Z,,,Fixes #275
https://api.github.com/repos/onyxfish/csvkit/issues/328,https://api.github.com/repos/onyxfish/csvkit/issues/328/labels{/name},https://api.github.com/repos/onyxfish/csvkit/issues/328/comments,https://api.github.com/repos/onyxfish/csvkit/issues/328/events,https://github.com/onyxfish/csvkit/issues/328,42259481,328,Docs should lead-off with an example of why it's awesome,,,open,False,,,0,2014-09-09T00:36:58Z,2014-09-09T00:36:58Z,,,
...

Reference: in2csv

11. Slice a 9,655 column file using index ranges

Perhaps the most audacious use of csvkit I’ve seen:

(Credit to Dan Hill.)

Reference: csvcut

What awesome thing are you going to do with csvkit?

I like to think of csvkit as a “proper” open source project, which is to say, it is what its users have made of it. These examples didn’t come straight out of my brain—they are the product of use, suggestions and feedback from hundreds of users. So in closing, I’d like to say thanks to the 46 other people who have so far contributed code to csvkit. The next planned release of csvkit will be the fabled 1.0. I hope you will all join me by filing or fixing a bug. If you’ve never contributed to an open source project before, this is a great space to start. Join us in bringing some sanity to the world of command-line data processing.

About Christopher Groskopf

comments powered by Disqus