Preprocess your data with recipes

Prepare data for modeling with modular preprocessing steps.

Introduction

In this article, we’ll explore Recipes, which are designed to help you preprocess your data before training your model. Recipes are built as a series of preprocessing steps, such as:

  • converting qualitative predictors to indicator variables (also known as dummy variables),

  • transforming data to be on a different scale (e.g., taking the logarithm of a variable),

  • transforming whole groups of predictors together,

  • extracting key features from raw variables (e.g., getting the day of the week out of a date variable),

and so on. If you are familiar with scikit-learn’s dataset transformations, a lot of this might sound familiar and like what a transformer already does. Recipes can be used to do many of the same things, but they can scale your workloads on any Ibis-supported backend. This article shows how to use recipes for modeling.

To use code in this article, you will need to install the following packages: Ibis, IbisML, and scikit-learn.

pip install 'ibis-framework[duckdb,examples]' ibis-ml scikit-learn

The New York City flight data

Let’s use the nycflights13 data to predict whether a plane arrives more than 30 minutes late. This dataset contains information on 325,819 flights departing near New York City in 2013. Let’s start by loading the data and making a few changes to the variables:

import ibis

con = ibis.connect("duckdb://nycflights13.ddb")
con.create_table(
    "flights", ibis.examples.nycflights13_flights.fetch().to_pyarrow(), overwrite=True
)
con.create_table(
    "weather", ibis.examples.nycflights13_weather.fetch().to_pyarrow(), overwrite=True
)

You can now see the example dataset copied over to the database:

con = ibis.connect("duckdb://nycflights13.ddb")
con.list_tables()
['flights', 'weather']

We’ll turn on interactive mode, which partially executes queries to give users a preview of the results.

ibis.options.interactive = True
flights = con.table("flights")
flights = flights.mutate(
    dep_time=(
        flights.dep_time.lpad(4, "0").substr(0, 2)
        + ":"
        + flights.dep_time.substr(-2, 2)
        + ":00"
    ).try_cast("time"),
    arr_delay=flights.arr_delay.try_cast(int),
    air_time=flights.air_time.try_cast(int),
)
flights
┏━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ year   month  day    dep_time  sched_dep_time  dep_delay  arr_time  sched_arr_time  arr_delay  carrier  flight  tailnum  origin  dest    air_time  distance  hour   minute  time_hour           ┃
┡━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ int64int64int64timeint64stringstringint64int64stringint64stringstringstringint64int64int64int64timestamp(6)        │
├───────┼───────┼───────┼──────────┼────────────────┼───────────┼──────────┼────────────────┼───────────┼─────────┼────────┼─────────┼────────┼────────┼──────────┼──────────┼───────┼────────┼─────────────────────┤
│  20131105:17:005152        830     81911UA     1545N14228 EWR   IAH   22714005152013-01-01 10:00:00 │
│  20131105:33:005294        850     83020UA     1714N24211 LGA   IAH   22714165292013-01-01 10:00:00 │
│  20131105:42:005402        923     85033AA     1141N619AA JFK   MIA   16010895402013-01-01 10:00:00 │
│  20131105:44:00545-1       1004    1022-18B6     725N804JB JFK   BQN   18315765452013-01-01 10:00:00 │
│  20131105:54:00600-6       812     837-25DL     461N668DN LGA   ATL   116762602013-01-01 11:00:00 │
│  20131105:54:00558-4       740     72812UA     1696N39463 EWR   ORD   1507195582013-01-01 10:00:00 │
│  20131105:55:00600-5       913     85419B6     507N516JB EWR   FLL   1581065602013-01-01 11:00:00 │
│  20131105:57:00600-3       709     723-14EV     5708N829AS LGA   IAD   53229602013-01-01 11:00:00 │
│  20131105:57:00600-3       838     846-8B6     79N593JB JFK   MCO   140944602013-01-01 11:00:00 │
│  20131105:58:00600-2       753     7458AA     301N3ALAA LGA   ORD   138733602013-01-01 11:00:00 │
│                        │
└───────┴───────┴───────┴──────────┴────────────────┴───────────┴──────────┴────────────────┴───────────┴─────────┴────────┴─────────┴────────┴────────┴──────────┴──────────┴───────┴────────┴─────────────────────┘
weather = con.table("weather")
weather
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ origin  year   month  day    hour   temp    dewp    humid   wind_dir  wind_speed          wind_gust  precip   pressure  visib    time_hour           ┃
┡━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ stringint64int64int64int64stringstringstringstringstringstringfloat64stringfloat64timestamp(6)        │
├────────┼───────┼───────┼───────┼───────┼────────┼────────┼────────┼──────────┼────────────────────┼───────────┼─────────┼──────────┼─────────┼─────────────────────┤
│ EWR   201311139.02 26.06 59.37 270     10.357019999999999NA       0.01012    10.02013-01-01 06:00:00 │
│ EWR   201311239.02 26.96 61.63 250     8.05546           NA       0.01012.3  10.02013-01-01 07:00:00 │
│ EWR   201311339.02 28.04 64.43 240     11.5078           NA       0.01012.5  10.02013-01-01 08:00:00 │
│ EWR   201311439.92 28.04 62.21 250     12.658579999999999NA       0.01012.2  10.02013-01-01 09:00:00 │
│ EWR   201311539.02 28.04 64.43 260     12.658579999999999NA       0.01011.9  10.02013-01-01 10:00:00 │
│ EWR   201311637.94 28.04 67.21 240     11.5078           NA       0.01012.4  10.02013-01-01 11:00:00 │
│ EWR   201311739.02 28.04 64.43 240     14.960139999999999NA       0.01012.2  10.02013-01-01 12:00:00 │
│ EWR   201311839.92 28.04 62.21 250     10.357019999999999NA       0.01012.2  10.02013-01-01 13:00:00 │
│ EWR   201311939.92 28.04 62.21 260     14.960139999999999NA       0.01012.7  10.02013-01-01 14:00:00 │
│ EWR   2013111041    28.04 59.65 260     13.809359999999998NA       0.01012.4  10.02013-01-01 15:00:00 │
│                    │
└────────┴───────┴───────┴───────┴───────┴────────┴────────┴────────┴──────────┴────────────────────┴───────────┴─────────┴──────────┴─────────┴─────────────────────┘
flight_data = (
    flights.mutate(
        # Convert the arrival delay to a factor
        arr_delay=ibis.ifelse(flights.arr_delay >= 30, 1, 0),
        # We will use the date (not date-time) in the recipe below
        date=flights.time_hour.date(),
    )
    # Include the weather data
    .inner_join(weather, ["origin", "time_hour"])
    # Only retain the specific columns we will use
    .select(
        "dep_time",
        "flight",
        "origin",
        "dest",
        "air_time",
        "distance",
        "carrier",
        "date",
        "arr_delay",
        "time_hour",
    )
    # Exclude missing data
    .drop_null()
)
flight_data
┏━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ dep_time  flight  origin  dest    air_time  distance  carrier  date        arr_delay  time_hour           ┃
┡━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ timeint64stringstringint64int64stringdateint8timestamp(6)        │
├──────────┼────────┼────────┼────────┼──────────┼──────────┼─────────┼────────────┼───────────┼─────────────────────┤
│ 10:45:0067EWR   ORD   120719UA     2013-02-1402013-02-14 15:00:00 │
│ 10:48:00373LGA   FLL   1791076B6     2013-02-1402013-02-14 15:00:00 │
│ 10:48:00764EWR   IAH   2071400UA     2013-02-1402013-02-14 15:00:00 │
│ 10:51:002044LGA   MIA   1711096DL     2013-02-1402013-02-14 16:00:00 │
│ 10:51:002171LGA   DCA   40214US     2013-02-1402013-02-14 16:00:00 │
│ 10:57:001275JFK   SLC   2861990DL     2013-02-1402013-02-14 16:00:00 │
│ 10:57:00366LGA   STL   135888WN     2013-02-1402013-02-14 16:00:00 │
│ 10:57:001550EWR   SFO   3382565UA     2013-02-1402013-02-14 15:00:00 │
│ 10:58:004694EWR   MKE   113725EV     2013-02-1402013-02-14 15:00:00 │
│ 10:58:001647LGA   ATL   117762DL     2013-02-1402013-02-14 16:00:00 │
│                    │
└──────────┴────────┴────────┴────────┴──────────┴──────────┴─────────┴────────────┴───────────┴─────────────────────┘

We can see that about 16% of the flights in this dataset arrived more than 30 minutes late.

flight_data.arr_delay.value_counts().rename(n="arr_delay_count").mutate(
    prop=ibis._.n / ibis._.n.sum()
)
┏━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┓
┃ arr_delay  n       prop     ┃
┡━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━┩
│ int8int64float64  │
├───────────┼────────┼──────────┤
│         02732790.838745 │
│         1525400.161255 │
└───────────┴────────┴──────────┘

Data splitting

To get started, let’s split this single dataset into two: a training set and a testing set. We’ll keep most of the rows in the original dataset (subset chosen randomly) in the training set. The training data will be used to fit the model, and the testing set will be used to measure model performance.

Because the order of rows in an Ibis table is undefined, we need a unique key to split the data reproducibly. It is permissible for airlines to use the same flight number for different routes, as long as the flights do not operate on the same day. This means that the combination of the flight number and the date of travel is always unique.

import ibis_ml as ml

# Create data frames for the two sets:
train_data, test_data = ml.train_test_split(
    flight_data,
    unique_key=["carrier", "flight", "date"],
    # Put 3/4 of the data into the training set
    test_size=0.25,
    num_buckets=4,
    # Fix the random numbers by setting the seed
    # This enables the analysis to be reproducible when random numbers are used
    random_seed=222,
)

Create features

flights_rec = ml.Recipe(
    ml.ExpandDate("date", components=["dow", "month"]),
    ml.Drop("date"),
    ml.TargetEncode(ml.nominal()),
    ml.DropZeroVariance(ml.everything()),
    ml.MutateAt("dep_time", ibis._.hour() * 60 + ibis._.minute()),
    ml.MutateAt(ml.timestamp(), ibis._.epoch_seconds()),
)

Fit a model with a recipe

Let’s model the flight data. We can use any scikit-learn-compatible estimator.

We will want to use our recipe across several steps as we train and test our model. We will:

  1. Process the recipe using the training set: This involves any estimation or calculations based on the training set. For our recipe, the training set will be used to determine which predictors should be converted to dummy variables and which predictors will have zero-variance in the training set, and should be slated for removal.

  2. Apply the recipe to the training set: We create the final predictor set on the training set.

  3. Apply the recipe to the test set: We create the final predictor set on the test set. Nothing is recomputed and no information from the test set is used here; the dummy variable and zero-variance results from the training set are applied to the test set.

To simplify this process, we can use a scikit-learn Pipeline.

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

pipe = Pipeline([("flights_rec", flights_rec), ("lr_mod", LogisticRegression())])

Now, there is a single function that can be used to prepare the recipe and train the model from the resulting predictors:

X_train = train_data.drop("arr_delay")
y_train = train_data.arr_delay
pipe.fit(X_train, y_train)
Pipeline(steps=[('flights_rec',
                 Recipe(ExpandDate(cols(('date',)),
                                   components=['dow', 'month']),
                        Drop(cols(('date',))),
                        TargetEncode(nominal(), smooth=0.0),
                        DropZeroVariance(everything(), tolerance=0.0001),
                        MutateAt(cols(('dep_time',)),
                                 ((_.hour() * 60) + _.minute())),
                        MutateAt(timestamp(), _.epoch_seconds()))),
                ('lr_mod', LogisticRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Use a trained workflow to predict

X_test = test_data.drop("arr_delay")
y_test = test_data.arr_delay
pipe.score(X_test, y_test)
0.8390849833968762

Acknowledgments

This tutorial is derived from the tidymodels article of the same name. The transformation logic is very similar, and much of the text is copied verbatim.

Back to top