Skip to content

Load external data files with the DuckDB backend

Here we use the register method to load external data files and join them.

We're going to download one month of NYC Taxi data in parquet format and also download the "Taxi Zone Lookup Table" which is a csv

https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-01.parquet https://d37ci6vzurychx.cloudfront.net/misc/taxi+_zone_lookup.csv

Create an in-memory DuckDB connection via ibis

>>> import ibis
>>> con = ibis.duckdb.connect()  # in-memory database
>>> con.list_tables()
[]

Now we call register with the filepath (the table_name argument is optional, if it isn't specified, Ibis will use the filename minus the extension)

>>> con.register("taxi+_zone_lookup.csv", table_name="taxi_zone_lookup")
AlchemyTable: taxi+_zone_lookup
  LocationID   int32
  Borough      string
  Zone         string
  service_zone string

>>> con.register("green_tripdata_2022-01.parquet", table_name="tripdata")
AlchemyTable: green_tripdata_2022_01
  VendorID              int64
  lpep_pickup_datetime  timestamp
  lpep_dropoff_datetime timestamp
  store_and_fwd_flag    string
  RatecodeID            float64
  PULocationID          int64
  DOLocationID          int64
  passenger_count       float64
  trip_distance         float64
  fare_amount           float64
  extra                 float64
  mta_tax               float64
  tip_amount            float64
  tolls_amount          float64
  ehail_fee             int32
  improvement_surcharge float64
  total_amount          float64
  payment_type          float64
  trip_type             float64
  congestion_surcharge  float64
>>> con.list_tables()
['tripdata, 'taxi_zone_lookup']

We now have a schema parsed from the files and corresponding tables (they are actually views that are lazily-loaded) are available.

Now we can interact with these tables just like a table or view in any backend connection:

>>> lookup = con.table("taxi_zone_lookup")
>>> tripdata = con.table("tripdata")

>>> tripdata.columns
['VendorID', 'lpep_pickup_datetime', 'lpep_dropoff_datetime', 'store_and_fwd_flag', 'RatecodeID', 'PULocationID', 'DOLocationID', 'passenger_count', 'trip_distance', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'ehail_fee', 'improvement_surcharge', 'total_amount', 'payment_type', 'trip_type', 'congestion_surcharge']

>>> lookup.columns
['LocationID', 'Borough', 'Zone', 'service_zone']

We can grab a small subset of the tripdata columns and then join them to the lookup table to get human-readable values for the pickup locations:

>>> ibis.options.interactive = True

>>> tripdata = tripdata[["lpep_pickup_datetime", "PULocationID"]]

>>> tripdata.head()
  lpep_pickup_datetime  PULocationID
0  2022-01-01 00:14:21            42
1  2022-01-01 00:20:55           116
2  2022-01-01 00:57:02            41
3  2022-01-01 00:07:42           181
4  2022-01-01 00:07:50            33

>>> tripdata.join(lookup, tripdata.PULocationID == lookup.LocationID).head()
  lpep_pickup_datetime  PULocationID  LocationID    Borough                  Zone service_zone
0  2022-01-01 00:14:21            42          42  Manhattan  Central Harlem North    Boro Zone
1  2022-01-01 00:20:55           116         116  Manhattan      Hamilton Heights    Boro Zone
2  2022-01-01 00:57:02            41          41  Manhattan        Central Harlem    Boro Zone
3  2022-01-01 00:07:42           181         181   Brooklyn            Park Slope    Boro Zone
4  2022-01-01 00:07:50            33          33   Brooklyn      Brooklyn Heights    Boro Zone

That's it!

Ibis+duckdb currently supports registering parquet, csv, and csv.gz.

You can pass in the filename and the filetype will be inferred from the extension, or you can pass it explicitly using a file URI, e.g.

con.register("csv://some_csv_file_without_an_extension")
con.register("csv.gz://a_compressed_csv_file.csv")
con.register("parquet://a_parquet_file_with_truncated_extension.parq")

Last update: June 22, 2023