by Aneesh Karve4/25/2017
The tragedy of data science is that 79% of an analyst’s time goes to data preparation. Data preparation is not only tedious, it steals time from analysis.
A data package is an abstraction that encapsulates and automates data preparation. More specifically, a data package is a tree of serialized data wrapped in a Python module. Each data package has a unique handle, a revision history, and a web page. Packages are stored in a server-side registry that enforces access control.
Example: Bike for Your Rights
Suppose you wish to analyze bicycle traffic on Seattle’s Fremont Bridge. You could locate the source data, download it, parse it, index the date column, etc. — as Jake Vanderplas demonstrates — or you could install the data as a package in less than a minute:
$ pip install quilt # requires HDF5; details below
$ quilt install akarve/fremont_bike
Now we can load the data directly into Python:
from quilt.data.akarve import fremont_bike
In contrast to files, data packages require very little data preparation. Package users can jump straight to the analysis.
Less is More
The Jupyter notebooks shown in Fig. 1 perform the same analysis on the same data. The notebooks differ only in data injection. On the left we see a typical file-based workflow: download files, discover file formats, write scripts to parse, clean, and load the data, run the scripts, and finally begin analysis. On the right we see a package-based workflow: install the data, import the data, and begin the analysis. The key takeaway is that file-based workflows require substantial data preparation (red) prior to analysis (green).
(Both notebooks are available on GitHub.)
Get the Package Manager
To run the code samples in this article you’ll need HDF5 1.8 1 (here’s how to install HDF5) and the Quilt package manger:
$ pip install quilt
Get a Data Package
Recall how we acquired the Fremont Bridge data:
$ quilt install akarve/fremont_bike
quilt install
connects to a remote registry and materializes a package on the calling machine. quilt install
is similar in spirit to git clone
or npm install
, but it scales to big data, keeps your source code history clean, and handles serialization.
Work with Package Data
To simplify dependency injection, Quilt rolls data packages into a Python module so that you can import data like you import code:
# python
from quilt.data.akarve import fremont_bike
Importing large data packages is fast since disk I/O is deferred until the data are referenced in code. At the moment of reference, binary data are copied from disk into main memory. Since there’s no parsing overhead, deserialization is five to twenty times faster than loading data from text files.
We can see that fremont_bike
is a group containing two items:
# python
>>> fremont_bike
<GroupNode '/Users/akarve/quilt_packages/akarve/fremont_bike':''>
README
counts
A group contains other groups and, at its leaves, contains data:
# python
>>> fremont_bike.counts.data()
West Sidewalk East Sidewalk
Date
2012-10-03 00:00:00 4 9
2012-10-03 01:00:00 4 6
2012-10-03 02:00:00 1 1
...
[39384 rows x 2 columns]
Create a Package
Let’s start with some source data. How do we convert source files into a data package? We’ll need a configuration file, conventionally called build.yml
. build.yml
tells quilt
how to structure a package. Fortunately, we don’t need to write build.yml
by hand. quilt generate
creates a build file that mirrors the contents of any directory:
$ quilt generate src
Let’s open the file that we just generated, src/build.yml
:
contents:
Fremont_Hourly_Bicycle_Counts_October_2012_to_present:
file: Fremont_Hourly_Bicycle_Counts_October_2012_to_present.csv
README:
file: README.md
contents
dictates the structure of a package.
Let’s edit build.yml
to shorten the Python name for our data. Oh, and let’s index on the “Date” column:
contents:
counts:
file: Fremont_Hourly_Bicycle_Counts_October_2012_to_present.csv
index_col: Date
parse_dates: True
README:
file: README.md
counts
— or any name that we write in its place — is the name that package users will type to access the data extracted from the CSV file. Behind the scenes, index_col
and parse_dates
are passed to pandas.read_csv
as keyword arguments.
Now we can build our package:
$ quilt build YOUR_NAME/fremont_bike src/build.yml
...
src/Fremont_Hourly_Bicycle_Counts_October_2012_to_present.csv...
100%|███████████████████████████| 1.13M/1.13M [00:09<00:00, 125KB/s]
Saving as binary dataframe...
Built YOUR_NAME/fremont_bike successfully.
You’ll notice that quilt build
takes a few seconds to construct the date index.
The build process has two key advantages: 1) parsing and serialization are automated; 2) packages are built once for the benefit of all users — there’s no repetitive data prep.
Push to the Registry
We’re ready to push our package to the registry, where it’s stored for anyone who needs it:
quilt login # accounts are free; only registered users can push
quilt push YOUR_NAME/fremont_bike
The package now resides in the registry and has a landing page populated by src/README.md
. Landing pages look like this.
Packages are private by default, so you’ll see a 404 until and unless you log in to the registry. To publish a package, use access add
:
quilt access add YOUR_NAME/fremont_bike public
To share a package with a specific user, replace public
with their Quilt username.
Package handles, such as akarve/fremont_bike
, provide a common frame of reference that can be reproduced by any user on any machine. But what happens if the data changes? quilt log
tracks changes over time:
# run in same directory as you ran quilt install akarve/fremont_bike
$ quilt log akarve/fremont_bike
Hash Pushed Author
495992b6b9109a1f9d5e209d6... 2017-04-14 14:33:40 akarve
24bb9d6e9d80000d9bc5fdc1e... 2017-03-29 20:42:43 akarve
03d2450e755cf45fbbf9c3635... 2017-03-29 17:40:47 akarve
quilt install -x
allows us to install historical snapshots:
quilt install akarve/fremont_bike -x 24bb9d6e9d80000d9bc5fdc1e89a0a77c40da33da5a054b05cdec29755ac408b
The upshot for reproducibility is that we no longer run models on “some data,” but on specific hash versions of specific packages.
Data packages make for fast, reproducible analysis by simplifying data prep, eliminating parsing, and versioning data. In round numbers, data packages speed both I/O and data preparation by a factor of 10.
In future articles we’ll virtualize data packages across Python, Spark, and R.
To learn more visit QuiltData.com.
The Quilt client is open source. Visit our GitHub repository to contribute.
Notes
1. We plan to transition to Apache Parquet in the near future.↩
Other Posts
Aneesh is cofounder and CTO of Quilt Data (YC W16). He specializes in data and visualization. His research interests include machine learning, abstract algebra, and user interfaces.