Data Packages for Fast, Reproducible Python Analysis

by Aneesh Karve4/25/2017

The tragedy of data science is that 79% of an analyst’s time goes to data preparation. Data preparation is not only tedious, it steals time from analysis.

A data package is an abstraction that encapsulates and automates data preparation. More specifically, a data package is a tree of serialized data wrapped in a Python module. Each data package has a unique handle, a revision history, and a web page. Packages are stored in a server-side registry that enforces access control.

Example: Bike for Your Rights
Suppose you wish to analyze bicycle traffic on Seattle’s Fremont Bridge. You could locate the source data, download it, parse it, index the date column, etc. — as Jake Vanderplas demonstrates — or you could install the data as a package in less than a minute:

$ pip install quilt # requires HDF5; details below
$ quilt install akarve/fremont_bike

Now we can load the data directly into Python:

from quilt.data.akarve import fremont_bike

In contrast to files, data packages require very little data preparation. Package users can jump straight to the analysis.

Less is More
The Jupyter notebooks shown in Fig. 1 perform the same analysis on the same data. The notebooks differ only in data injection. On the left we see a typical file-based workflow: download files, discover file formats, write scripts to parse, clean, and load the data, run the scripts, and finally begin analysis. On the right we see a package-based workflow: install the data, import the data, and begin the analysis. The key takeaway is that file-based workflows require substantial data preparation (red) prior to analysis (green).

before_after


Figure 1. File-based workflows (left) require significantly more prep than package-based workflows (right)

(Both notebooks are available on GitHub.)

Data Packages in Detail

Get the Package Manager
To run the code samples in this article you’ll need HDF5 1.8 1 (here’s how to install HDF5) and the Quilt package manger:

$ pip install quilt

Get a Data Package
Recall how we acquired the Fremont Bridge data:

$ quilt install akarve/fremont_bike

quilt install connects to a remote registry and materializes a package on the calling machine. quilt install is similar in spirit to git clone or npm install, but it scales to big data, keeps your source code history clean, and handles serialization.

Work with Package Data
To simplify dependency injection, Quilt rolls data packages into a Python module so that you can import data like you import code:

# python
from quilt.data.akarve import fremont_bike

Importing large data packages is fast since disk I/O is deferred until the data are referenced in code. At the moment of reference, binary data are copied from disk into main memory. Since there’s no parsing overhead, deserialization is five to twenty times faster than loading data from text files.

We can see that fremont_bike is a group containing two items:

# python
>>> fremont_bike
<GroupNode '/Users/akarve/quilt_packages/akarve/fremont_bike':''>
README
counts

A group contains other groups and, at its leaves, contains data:

# python
>>> fremont_bike.counts.data()
                      West Sidewalk East Sidewalk
Date
2012-10-03 00:00:00   4             9
2012-10-03 01:00:00   4             6
2012-10-03 02:00:00   1             1
...
[39384 rows x 2 columns]

Create a Package
Let’s start with some source data. How do we convert source files into a data package? We’ll need a configuration file, conventionally called build.yml. build.yml tells quilt how to structure a package. Fortunately, we don’t need to write build.yml by hand. quilt generate creates a build file that mirrors the contents of any directory:

$ quilt generate src

Let’s open the file that we just generated, src/build.yml:

contents:
  Fremont_Hourly_Bicycle_Counts_October_2012_to_present:
    file: Fremont_Hourly_Bicycle_Counts_October_2012_to_present.csv
  README:
    file: README.md

contents dictates the structure of a package.

Let’s edit build.yml to shorten the Python name for our data. Oh, and let’s index on the “Date” column:

contents:
  counts:
    file: Fremont_Hourly_Bicycle_Counts_October_2012_to_present.csv
    index_col: Date
    parse_dates: True
  README:
    file: README.md

counts — or any name that we write in its place — is the name that package users will type to access the data extracted from the CSV file. Behind the scenes, index_col and parse_dates are passed to pandas.read_csv as keyword arguments.

Now we can build our package:

$ quilt build YOUR_NAME/fremont_bike src/build.yml
...
src/Fremont_Hourly_Bicycle_Counts_October_2012_to_present.csv...
100%|███████████████████████████| 1.13M/1.13M [00:09<00:00, 125KB/s]
Saving as binary dataframe...
Built YOUR_NAME/fremont_bike successfully.

You’ll notice that quilt build takes a few seconds to construct the date index.

The build process has two key advantages: 1) parsing and serialization are automated; 2) packages are built once for the benefit of all users — there’s no repetitive data prep.

Push to the Registry
We’re ready to push our package to the registry, where it’s stored for anyone who needs it:

quilt login # accounts are free; only registered users can push
quilt push YOUR_NAME/fremont_bike

The package now resides in the registry and has a landing page populated by src/README.md. Landing pages look like this.

Packages are private by default, so you’ll see a 404 until and unless you log in to the registry. To publish a package, use access add:

quilt access add YOUR_NAME/fremont_bike public

To share a package with a specific user, replace public with their Quilt username.

Reproducibility

Package handles, such as akarve/fremont_bike, provide a common frame of reference that can be reproduced by any user on any machine. But what happens if the data changes? quilt log tracks changes over time:

# run in same directory as you ran quilt install akarve/fremont_bike
$ quilt log akarve/fremont_bike
Hash                         Pushed              Author
495992b6b9109a1f9d5e209d6... 2017-04-14 14:33:40 akarve
24bb9d6e9d80000d9bc5fdc1e... 2017-03-29 20:42:43 akarve
03d2450e755cf45fbbf9c3635... 2017-03-29 17:40:47 akarve

quilt install -x allows us to install historical snapshots:

quilt install akarve/fremont_bike -x 24bb9d6e9d80000d9bc5fdc1e89a0a77c40da33da5a054b05cdec29755ac408b

The upshot for reproducibility is that we no longer run models on “some data,” but on specific hash versions of specific packages.

Conclusion

Data packages make for fast, reproducible analysis by simplifying data prep, eliminating parsing, and versioning data. In round numbers, data packages speed both I/O and data preparation by a factor of 10.

In future articles we’ll virtualize data packages across Python, Spark, and R.

To learn more visit QuiltData.com.

Open Source

The Quilt client is open source. Visit our GitHub repository to contribute.

Appendix: Command summary

big-picture


Notes
1. We plan to transition to Apache Parquet in the near future.


Author

  • Aneesh Karve

    Aneesh is cofounder and CTO of Quilt Data (YC W16). He specializes in data and visualization. His research interests include machine learning, abstract algebra, and user interfaces.