Skip to content

Join an in-memory DataFrame to a TableExpression

You might have an in-memory DataFrame that you want to join to a TableExpression. For example, you might have a file on your local machine that you don't want to upload to your backend, but you need to join it to a table in that backend.

You can perform joins on local data to TableExpressions from your backend easily with Ibis MemTables.

In this guide, you will learn how to join a pandas DataFrame to a TableExpression.

Data Setup:

In this example, we will create two DataFrames: one containing events and one containing event names. We will save the events to a parquet file and read that as a TableExpression in the DuckDB backend. We will then convert the event names DataFrame to a PandasInMemoryTable (MemTable), which is a pandas DataFrame as a TableExpression and join the two expressions together as we would two TableExpressions in a backend.

    In [1]: import ibis

    In [2]: import pandas as pd
       ...: from datetime import date

    In [3]: # create a pandas DataFrame that we will convert to a
       ...: # PandasInMemoryTable (Ibis MemTable)
       ...: events = pd.DataFrame(
       ...:     {
       ...:         'event_id': range(4),
       ...:         'event_name': [f'e{k}' for k in range(4)],
       ...:     }
       ...: )

    In [4]: # Create a parquet file that we will read in using the DuckDB backend
       ...: # as a TableExpression
       ...: measures = pd.DataFrame({
       ...:     "event_id": [0] * 2 + [1] * 3 + [2] * 5 + [3] * 2
       ...:     ,"measured_on": map(
       ...:         date
       ...:         ,[2021] * 12, [6] * 4 + [5] * 6 + [7] * 2
       ...:         ,range(1, 13)
       ...:     )
       ...:     ,"measurement": None
       ...: })

    In [5]: measures.at[1, "measurement"] = 5.
       ...: measures.at[4, "measurement"] = 42.
       ...: measures.at[5, "measurement"] = 42.
       ...: measures.at[7, "measurement"] = 11.

    In [6]: # Save measures to parquet:
       ...: measures.to_parquet('measures.parquet')

    In [7]: # connect to a DuckDB backend
       ...: conn = ibis.connect('duckdb://:memory:')
       ...: measures = conn.register('measures.parquet', 'measures')

    In [8]: # `measures` is a TableExpression in a DuckDB backend connection:
       ...: measures
    Out[8]:
    AlchemyTable: measures
      event_id    int64
      measured_on date
      measurement float64

Converting a pandas DataFrame to a MemTable is as simple as feeding it to ibis.memtable:

    In [9]: # To join, convert your DataFrame to a memtable
       ...: mem_events = ibis.memtable(events)

    In [10]: mem_events
    Out[10]:
    PandasInMemoryTable
      data:
        PandasDataFrameProxy:
             event_id event_name
          0         0         e0
          1         1         e1
          2         2         e2
          3         3         e3

and joining is the same as joining any two TableExpressions:

    In [11]: # Join as you would two table expressions
        ...: measures.join(
        ...:     mem_events,
        ...:     measures['event_id'] == mem_events['event_id']
        ...: ).to_pandas()
    Out[11]:
        event_id measured_on  measurement  event_name
    0          0  2021-06-01          NaN          e0
    1          0  2021-06-02          5.0          e0
    2          1  2021-06-03          NaN          e1
    3          1  2021-06-04          NaN          e1
    4          1  2021-05-05         42.0          e1
    5          2  2021-05-06         42.0          e2
    6          2  2021-05-07          NaN          e2
    7          2  2021-05-08         11.0          e2
    8          2  2021-05-09          NaN          e2
    9          2  2021-05-10          NaN          e2
    10         3  2021-07-11          NaN          e3
    11         3  2021-07-12          NaN          e3

Note that the return result of the join is a TableExpression and that to_pandas returns a pandas DataFrame.


Last update: June 22, 2023