Install¶
If you don't have ibis
installed, you can install it from:
PyPI¶
python -m pip install 'ibis-framework[duckdb]'
conda-forge¶
conda install -c conda-forge ibis-framework
You can check out the install page for more detailed instructions.
Ibis for pandas users¶
Much of the syntax and many of the operations in Ibis are inspired
by the pandas DataFrame
but the primary domain of Ibis is
SQL so there are some differences in how they operate.
For one thing, SQL (and therefore Ibis) makes no guarantees about row order,
which is a key assumption that numpy/pandas users are used to. So two columns
can't be deterministically "lined up" unless they are actually part of the same Table.
An outcome of this is that you can't index into Columns by position
(column.head(5)
or column[3:5]
). You can only index into Tables
(table.head(5)
or table[3:5]
). So if you want the first 5 rows of a column,
you have to do table.head(5).my_column
, table.my_column.head(5)
will not work.
Another difference between Ibis tables and pandas DataFrame
s are that many
of the pandas DataFrame
operations do in-place operations (they are "mutable"),
whereas Ibis table operations always return a new table expression ("immutable").
Finally, Ibis table expressions are lazy, meaning
that as you build up a table expression, no computation is actually performed
until you call an action method such as to_pandas
. Only then does Ibis
compile the table expression into SQL and send it to the backend.
(Note that we'll be using Ibis' interactive mode to automatically execute queries at
the end of each cell in this notebook. If you are using similar code in a program,
you will have to add .to_pandas()
to each operation that you want to evaluate.)
import ibis
import pandas as pd
ibis.options.interactive = True
We'll be using the DuckDB backend in Ibis in the examples below. First we'll create a simple DataFrame
.
df = pd.DataFrame(
[['a', 1, 2], ['b', 3, 4]],
columns=['one', 'two', 'three'],
index=[5, 6],
)
df
one | two | three | |
---|---|---|---|
5 | a | 1 | 2 |
6 | b | 3 | 4 |
Now we can create an Ibis table from the above DataFrame
.
Note that the index from the pandas DataFrame
is dropped.
Ibis has no notion of an index: If you want to use the index,
you will need to turn it into a column.
t = ibis.memtable(df, name="t")
t
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓ ┃ one ┃ two ┃ three ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩ │ string │ int64 │ int64 │ ├────────┼───────┼───────┤ │ a │ 1 │ 2 │ │ b │ 3 │ 4 │ └────────┴───────┴───────┘
Data types¶
The data types of columns in pandas are accessed using the dtypes
attribute. This returns
a Series
object.
df.dtypes
one object two int64 three int64 dtype: object
In Ibis, you use the schema
method which returns an ibis.Schema
object.
t.schema()
ibis.Schema { one string two int64 three int64 }
It is possible to convert the schema information to pandas data types using the to_pandas
method, if needed.
t.schema().to_pandas()
[('one', dtype('O')), ('two', dtype('int64')), ('three', dtype('int64'))]
Table layout¶
In pandas, the layout of the table is contained in the shape
attribute which contains the number
of rows and number of columns in a tuple. The number of columns in an Ibis table can be gotten
from the length of the schema.
len(t.schema())
3
To get the number of rows of a table, you use the count
method.
t.count()
2
To mimic pandas' behavior, you would use the following code. Note that you need to use the to_pandas
method
after count
to evaluate the expression returned by count
.
(t.count().to_pandas(), len(t.schema()))
(2, 3)
df.shape
(2, 3)
Subsetting columns¶
Selecting columns is very similar to in pandas. In fact, you can use the same syntax.
t[['one', 'two']]
┏━━━━━━━━┳━━━━━━━┓ ┃ one ┃ two ┃ ┡━━━━━━━━╇━━━━━━━┩ │ string │ int64 │ ├────────┼───────┤ │ a │ 1 │ │ b │ 3 │ └────────┴───────┘
However, since row-level indexing is not supported in Ibis, the inner list is not necessary.
t['one', 'two']
┏━━━━━━━━┳━━━━━━━┓ ┃ one ┃ two ┃ ┡━━━━━━━━╇━━━━━━━┩ │ string │ int64 │ ├────────┼───────┤ │ a │ 1 │ │ b │ 3 │ └────────┴───────┘
Selecting columns¶
Selecting columns is done using the same syntax as in pandas DataFrames
. You can use either
the indexing syntax or attribute syntax.
t['one']
┏━━━━━━━━┓ ┃ one ┃ ┡━━━━━━━━┩ │ string │ ├────────┤ │ a │ │ b │ └────────┘
or:
t.one
┏━━━━━━━━┓ ┃ one ┃ ┡━━━━━━━━┩ │ string │ ├────────┤ │ a │ │ b │ └────────┘
Adding, removing, and modifying columns¶
Modifying the columns of an Ibis table is a bit different than doing the same operations in
a pandas DataFrame
. This is primarily due to the fact that in-place operations are not
supported on Ibis tables. Each time you do a column modification to a table, a new table
expression is returned.
Adding columns¶
Adding columns is done through the mutate
method.
mutated = t.mutate(new_col=t.three * 2)
mutated
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━┓ ┃ one ┃ two ┃ three ┃ new_col ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━┩ │ string │ int64 │ int64 │ int64 │ ├────────┼───────┼───────┼─────────┤ │ a │ 1 │ 2 │ 4 │ │ b │ 3 │ 4 │ 8 │ └────────┴───────┴───────┴─────────┘
Notice that the original table object remains unchanged. Only the mutated
object that was returned
contains the new column.
t
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓ ┃ one ┃ two ┃ three ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩ │ string │ int64 │ int64 │ ├────────┼───────┼───────┤ │ a │ 1 │ 2 │ │ b │ 3 │ 4 │ └────────┴───────┴───────┘
It is also possible to create a column in isolation. This is similar to a Series
in pandas.
Note that the name of the column by default is a representation of the expression:
unnamed = t.three * 2
unnamed
┏━━━━━━━━━━━━━━━━━━━━┓ ┃ Multiply(three, 2) ┃ ┡━━━━━━━━━━━━━━━━━━━━┩ │ int64 │ ├────────────────────┤ │ 4 │ │ 8 │ └────────────────────┘
To get a version with a specific name, you can use the name
method:
new_col = unnamed.name("new_col")
new_col
┏━━━━━━━━━┓ ┃ new_col ┃ ┡━━━━━━━━━┩ │ int64 │ ├─────────┤ │ 4 │ │ 8 │ └─────────┘
You can then add this column to the table using a projection.
proj = t['one', 'two', new_col]
proj
┏━━━━━━━━┳━━━━━━━┳━━━━━━━━━┓ ┃ one ┃ two ┃ new_col ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━━━┩ │ string │ int64 │ int64 │ ├────────┼───────┼─────────┤ │ a │ 1 │ 4 │ │ b │ 3 │ 8 │ └────────┴───────┴─────────┘
Removing columns¶
Removing a column is done using the drop
method.
t.columns
['one', 'two', 'three']
subset = t.drop('one', 'two')
subset.columns
['three']
It is also possible to drop columns by selecting the columns you want to remain.
subset = t['two', 'three']
subset.columns
['two', 'three']
Modifying columns¶
Replacing existing columns is done using the mutate
method just like adding columns. You simply
add a column of the same name to replace it.
t
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓ ┃ one ┃ two ┃ three ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩ │ string │ int64 │ int64 │ ├────────┼───────┼───────┤ │ a │ 1 │ 2 │ │ b │ 3 │ 4 │ └────────┴───────┴───────┘
mutated = t.mutate(two=t.two * 2)
mutated
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓ ┃ one ┃ two ┃ three ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩ │ string │ int64 │ int64 │ ├────────┼───────┼───────┤ │ a │ 2 │ 2 │ │ b │ 6 │ 4 │ └────────┴───────┴───────┘
Renaming columns¶
In addition to replacing columns, you can simply rename them as well. This is done with the relabel
method
which takes a dictionary containing the name mappings.
relabeled = t.relabel(
dict(
one='a',
two='b',
)
)
relabeled
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓ ┃ a ┃ b ┃ three ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩ │ string │ int64 │ int64 │ ├────────┼───────┼───────┤ │ a │ 1 │ 2 │ │ b │ 3 │ 4 │ └────────┴───────┴───────┘
Selecting rows¶
There are several methods that can be used to select rows of data in various ways. These are described
in the sections below. We'll use the Palmer Penguins$^1$ dataset to investigate!
Ibis has several built-in example datasets that you can access using the ibis.examples
module.
$^1$: Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218.
penguins = ibis.examples.penguins.fetch()
penguins
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │ │ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │ │ Adelie │ Torgersen │ nan │ nan │ NULL │ NULL │ NULL │ 2007 │ │ Adelie │ Torgersen │ 36.7 │ 19.3 │ 193 │ 3450 │ female │ 2007 │ │ Adelie │ Torgersen │ 39.3 │ 20.6 │ 190 │ 3650 │ male │ 2007 │ │ Adelie │ Torgersen │ 38.9 │ 17.8 │ 181 │ 3625 │ female │ 2007 │ │ Adelie │ Torgersen │ 39.2 │ 19.6 │ 195 │ 4675 │ male │ 2007 │ │ Adelie │ Torgersen │ 34.1 │ 18.1 │ 193 │ 3475 │ NULL │ 2007 │ │ Adelie │ Torgersen │ 42.0 │ 20.2 │ 190 │ 4250 │ NULL │ 2007 │ │ … │ … │ … │ … │ … │ … │ … │ … │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
Head, tail and limit¶
The head
method works the same ways as in pandas. Note that some Ibis backends may not have an
inherent ordering of their rows and using head
may not return deterministic results. In those
cases, you can use sorting before calling head
to ensure a stable result.
penguins.head(5)
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │ │ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │ │ Adelie │ Torgersen │ nan │ nan │ NULL │ NULL │ NULL │ 2007 │ │ Adelie │ Torgersen │ 36.7 │ 19.3 │ 193 │ 3450 │ female │ 2007 │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
However, the tail method is not implemented since it is not supported in all databases.
It is possible to emulate the tail
method if you use sorting in your table to do a
reverse sort then use the head
method to retrieve the "top" rows.
Another way to limit the number of retrieved rows is using the limit
method. The following will return
the same result as head(5)
. This is often used in conjunction with other filtering techniques that we
will cover later.
penguins.limit(5)
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │ │ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │ │ Adelie │ Torgersen │ nan │ nan │ NULL │ NULL │ NULL │ 2007 │ │ Adelie │ Torgersen │ 36.7 │ 19.3 │ 193 │ 3450 │ female │ 2007 │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
Filtering rows¶
In addition to simply limiting the number of rows that are returned, it is possible to filter the rows using expressions. Expressions are constructed very similarly to the way they are in pandas. Ibis expressions are constructed from operations on columns in a table which return a boolean result. This result is then used to filter the table.
expr = penguins.bill_length_mm > 37.0
expr
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Greater(bill_length_mm, 37.0) ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ boolean │ ├───────────────────────────────┤ │ True │ │ True │ │ True │ │ NULL │ │ False │ │ True │ │ True │ │ True │ │ False │ │ True │ │ … │ └───────────────────────────────┘
We can evaluate the value counts to see how many rows we will expect to get back after filtering.
expr.value_counts()
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Greater(bill_length_mm, 37.0) ┃ Greater(bill_length_mm, 37.0)_count ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ boolean │ int64 │ ├───────────────────────────────┼─────────────────────────────────────┤ │ True │ 300 │ │ NULL │ 2 │ │ False │ 42 │ └───────────────────────────────┴─────────────────────────────────────┘
Now we apply the filter to the table. Since there are 6 True values in the expression, we should get 6 rows back.
filtered = penguins[expr]
filtered
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │ │ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │ │ Adelie │ Torgersen │ 39.3 │ 20.6 │ 190 │ 3650 │ male │ 2007 │ │ Adelie │ Torgersen │ 38.9 │ 17.8 │ 181 │ 3625 │ female │ 2007 │ │ Adelie │ Torgersen │ 39.2 │ 19.6 │ 195 │ 4675 │ male │ 2007 │ │ Adelie │ Torgersen │ 42.0 │ 20.2 │ 190 │ 4250 │ NULL │ 2007 │ │ Adelie │ Torgersen │ 37.8 │ 17.1 │ 186 │ 3300 │ NULL │ 2007 │ │ Adelie │ Torgersen │ 37.8 │ 17.3 │ 180 │ 3700 │ NULL │ 2007 │ │ Adelie │ Torgersen │ 41.1 │ 17.6 │ 182 │ 3200 │ female │ 2007 │ │ … │ … │ … │ … │ … │ … │ … │ … │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
Of course, the filtering expression can be applied inline as well.
filtered = penguins[penguins.bill_length_mm > 37.0]
filtered
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │ │ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │ │ Adelie │ Torgersen │ 39.3 │ 20.6 │ 190 │ 3650 │ male │ 2007 │ │ Adelie │ Torgersen │ 38.9 │ 17.8 │ 181 │ 3625 │ female │ 2007 │ │ Adelie │ Torgersen │ 39.2 │ 19.6 │ 195 │ 4675 │ male │ 2007 │ │ Adelie │ Torgersen │ 42.0 │ 20.2 │ 190 │ 4250 │ NULL │ 2007 │ │ Adelie │ Torgersen │ 37.8 │ 17.1 │ 186 │ 3300 │ NULL │ 2007 │ │ Adelie │ Torgersen │ 37.8 │ 17.3 │ 180 │ 3700 │ NULL │ 2007 │ │ Adelie │ Torgersen │ 41.1 │ 17.6 │ 182 │ 3200 │ female │ 2007 │ │ … │ … │ … │ … │ … │ … │ … │ … │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
Multiple filtering expressions can be combined into a single expression or chained onto existing table expressions.
filtered = penguins[(penguins.bill_length_mm > 37.0) & (penguins.bill_depth_mm > 18.0)]
filtered
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.3 │ 20.6 │ 190 │ 3650 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.2 │ 19.6 │ 195 │ 4675 │ male │ 2007 │ │ Adelie │ Torgersen │ 42.0 │ 20.2 │ 190 │ 4250 │ NULL │ 2007 │ │ Adelie │ Torgersen │ 38.6 │ 21.2 │ 191 │ 3800 │ male │ 2007 │ │ Adelie │ Torgersen │ 38.7 │ 19.0 │ 195 │ 3450 │ female │ 2007 │ │ Adelie │ Torgersen │ 42.5 │ 20.7 │ 197 │ 4500 │ male │ 2007 │ │ Adelie │ Torgersen │ 46.0 │ 21.5 │ 194 │ 4200 │ male │ 2007 │ │ Adelie │ Biscoe │ 37.8 │ 18.3 │ 174 │ 3400 │ female │ 2007 │ │ Adelie │ Biscoe │ 37.7 │ 18.7 │ 180 │ 3600 │ male │ 2007 │ │ … │ … │ … │ … │ … │ … │ … │ … │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
The code above will return the same rows as the code below.
filtered = penguins[penguins.bill_length_mm > 37.0][penguins.bill_depth_mm > 18.0]
filtered
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.3 │ 20.6 │ 190 │ 3650 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.2 │ 19.6 │ 195 │ 4675 │ male │ 2007 │ │ Adelie │ Torgersen │ 42.0 │ 20.2 │ 190 │ 4250 │ NULL │ 2007 │ │ Adelie │ Torgersen │ 38.6 │ 21.2 │ 191 │ 3800 │ male │ 2007 │ │ Adelie │ Torgersen │ 38.7 │ 19.0 │ 195 │ 3450 │ female │ 2007 │ │ Adelie │ Torgersen │ 42.5 │ 20.7 │ 197 │ 4500 │ male │ 2007 │ │ Adelie │ Torgersen │ 46.0 │ 21.5 │ 194 │ 4200 │ male │ 2007 │ │ Adelie │ Biscoe │ 37.8 │ 18.3 │ 174 │ 3400 │ female │ 2007 │ │ Adelie │ Biscoe │ 37.7 │ 18.7 │ 180 │ 3600 │ male │ 2007 │ │ … │ … │ … │ … │ … │ … │ … │ … │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
Aggregation has not been discussed yet, but aggregate values can be used in expressions to return things such as all of the rows in a data set where the value in a column is greater than the mean.
filtered = penguins[penguins.bill_length_mm > penguins.bill_length_mm.mean()]
filtered
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 46.0 │ 21.5 │ 194 │ 4200 │ male │ 2007 │ │ Adelie │ Dream │ 44.1 │ 19.7 │ 196 │ 4400 │ male │ 2007 │ │ Adelie │ Torgersen │ 45.8 │ 18.9 │ 197 │ 4150 │ male │ 2008 │ │ Adelie │ Biscoe │ 45.6 │ 20.3 │ 191 │ 4600 │ male │ 2009 │ │ Adelie │ Torgersen │ 44.1 │ 18.0 │ 210 │ 4000 │ male │ 2009 │ │ Gentoo │ Biscoe │ 46.1 │ 13.2 │ 211 │ 4500 │ female │ 2007 │ │ Gentoo │ Biscoe │ 50.0 │ 16.3 │ 230 │ 5700 │ male │ 2007 │ │ Gentoo │ Biscoe │ 48.7 │ 14.1 │ 210 │ 4450 │ female │ 2007 │ │ Gentoo │ Biscoe │ 50.0 │ 15.2 │ 218 │ 5700 │ male │ 2007 │ │ Gentoo │ Biscoe │ 47.6 │ 14.5 │ 215 │ 5400 │ male │ 2007 │ │ … │ … │ … │ … │ … │ … │ … │ … │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
Modifying rows¶
Sometimes you want to modify the values in a column based on some condition.
In pandas, you would do something like df.loc[condition] = new_value
.
In Ibis though, remember that all expressions are immutable, so you need to
create a new table expression with the modified values. You do this using the
ifelse
method on boolean columns:
long_billed_penguins = penguins.bill_length_mm > 37.0
species_modified = long_billed_penguins.ifelse('wide', penguins.species)
penguins.mutate(species_modified=species_modified)
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ species_modified ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ string │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┼──────────────────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ wide │ │ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │ wide │ │ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │ wide │ │ Adelie │ Torgersen │ nan │ nan │ NULL │ NULL │ NULL │ 2007 │ Adelie │ │ Adelie │ Torgersen │ 36.7 │ 19.3 │ 193 │ 3450 │ female │ 2007 │ Adelie │ │ Adelie │ Torgersen │ 39.3 │ 20.6 │ 190 │ 3650 │ male │ 2007 │ wide │ │ Adelie │ Torgersen │ 38.9 │ 17.8 │ 181 │ 3625 │ female │ 2007 │ wide │ │ Adelie │ Torgersen │ 39.2 │ 19.6 │ 195 │ 4675 │ male │ 2007 │ wide │ │ Adelie │ Torgersen │ 34.1 │ 18.1 │ 193 │ 3475 │ NULL │ 2007 │ Adelie │ │ Adelie │ Torgersen │ 42.0 │ 20.2 │ 190 │ 4250 │ NULL │ 2007 │ wide │ │ … │ … │ … │ … │ … │ … │ … │ … │ … │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┴──────────────────┘
Sorting rows¶
Sorting rows in Ibis uses a somewhat different API than in pandas. In pandas, you would use the
sort_values
method to order rows by values in specified columns. Ibis uses a method called
order_by
. To specify ascending or descending orders, pandas uses an ascending=
argument
to sort_values
that indicates the order for each sorting column. Ibis allows you to tag the
column name in the order_by
list as ascending or descending by wrapping it with ibis.asc
or
ibis.desc
.
First, let's ask Ibis for a pandas DataFrame
version of the penguin data:
df = penguins.to_pandas()
Here is an example of sorting a DataFrame
using two sort keys. One key is sorting in ascending
order and the other is in descending order.
df.sort_values(
['bill_length_mm', 'bill_depth_mm'], ascending=[True, False], na_position="first"
).head(5)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | None | 2007 |
271 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | None | 2009 |
142 | Adelie | Dream | 32.1 | 15.5 | 188.0 | 3050.0 | female | 2009 |
98 | Adelie | Dream | 33.1 | 16.1 | 178.0 | 2900.0 | female | 2008 |
70 | Adelie | Torgersen | 33.5 | 19.0 | 190.0 | 3600.0 | female | 2008 |
The same operation in Ibis would look like the following. Note that the index values of the
resulting DataFrame
start from zero and count up, whereas in the example above, they retain
their original index value. This is simply due to the fact that rows in tables don't necessarily
have a stable index in database backends, so the index is just generated on the result.
sorted = penguins.order_by(['bill_length_mm', ibis.desc('bill_depth_mm')]).head(5)
sorted
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Dream │ 32.1 │ 15.5 │ 188 │ 3050 │ female │ 2009 │ │ Adelie │ Dream │ 33.1 │ 16.1 │ 178 │ 2900 │ female │ 2008 │ │ Adelie │ Torgersen │ 33.5 │ 19.0 │ 190 │ 3600 │ female │ 2008 │ │ Adelie │ Dream │ 34.0 │ 17.1 │ 185 │ 3400 │ female │ 2008 │ │ Adelie │ Torgersen │ 34.1 │ 18.1 │ 193 │ 3475 │ NULL │ 2007 │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
Aggregation¶
Aggregation in pandas is typically done by computing columns based on an aggregate function.
stats = [df.bill_depth_mm.sum(), df.bill_length_mm.mean()]
pd.DataFrame([stats], columns=['total_bill_depth', 'avg.bill_length'])
total_bill_depth | avg.bill_length | |
---|---|---|
0 | 5865.7 | 43.92193 |
In Ibis, you construct aggregate expressions then apply them to the table using the aggregate
method.
stats = [
penguins.bill_depth_mm.sum().name('total_bill_width'),
penguins.bill_length_mm.mean().name('avg_bill_length'),
]
agged = penguins.aggregate(stats)
agged
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ total_bill_width ┃ avg_bill_length ┃ ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │ float64 │ float64 │ ├──────────────────┼─────────────────┤ │ 5865.7 │ 43.92193 │ └──────────────────┴─────────────────┘
You can also combine both operations into one and pass the aggregate expressions using keyword parameters.
agged = penguins.aggregate(
total_bill_depth=penguins.bill_depth_mm.sum(),
avg_bill_length=penguins.bill_length_mm.mean(),
)
agged
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ total_bill_depth ┃ avg_bill_length ┃ ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │ float64 │ float64 │ ├──────────────────┼─────────────────┤ │ 5865.7 │ 43.92193 │ └──────────────────┴─────────────────┘
Group by¶
Using a similar API as above, aggregations can also be
done across groupings using the by=
parameter.
agged = penguins.aggregate(
by='species',
total_bill_depth=penguins.bill_depth_mm.sum(),
avg_bill_length=penguins.bill_length_mm.mean(),
)
agged
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ species ┃ total_bill_depth ┃ avg_bill_length ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │ string │ float64 │ float64 │ ├───────────┼──────────────────┼─────────────────┤ │ Adelie │ 2770.3 │ 38.791391 │ │ Gentoo │ 1842.8 │ 47.504878 │ │ Chinstrap │ 1252.6 │ 48.833824 │ └───────────┴──────────────────┴─────────────────┘
Alternatively, by groups can be computed using a grouped table.
agged = penguins.group_by('species').aggregate(
total_bill_depth=penguins.bill_depth_mm.sum(),
avg_bill_length=penguins.bill_length_mm.mean(),
)
agged
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ species ┃ total_bill_depth ┃ avg_bill_length ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │ string │ float64 │ float64 │ ├───────────┼──────────────────┼─────────────────┤ │ Adelie │ 2770.3 │ 38.791391 │ │ Gentoo │ 1842.8 │ 47.504878 │ │ Chinstrap │ 1252.6 │ 48.833824 │ └───────────┴──────────────────┴─────────────────┘
You can group over multiple columns too, and rename them if you want.
If you only need to aggregate over a single column, then you don't
need to use the .aggregate()
method.
penguins.group_by(["species", "sex"], location="island").body_mass_g.approx_median()
┏━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ species ┃ sex ┃ location ┃ approx_median(body_mass_g) ┃ ┡━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ string │ string │ string │ int64 │ ├─────────┼────────┼───────────┼────────────────────────────┤ │ Adelie │ male │ Torgersen │ 4000 │ │ Adelie │ female │ Torgersen │ 3400 │ │ Adelie │ NULL │ Torgersen │ 3588 │ │ Adelie │ female │ Biscoe │ 3375 │ │ Adelie │ male │ Biscoe │ 4000 │ │ Adelie │ female │ Dream │ 3400 │ │ Adelie │ male │ Dream │ 3988 │ │ Adelie │ NULL │ Dream │ 2975 │ │ Gentoo │ female │ Biscoe │ 4700 │ │ Gentoo │ male │ Biscoe │ 5500 │ │ … │ … │ … │ … │ └─────────┴────────┴───────────┴────────────────────────────┘
Instead of aggregating after a group by, you can also transform
the table so that the output table has the same number of rows
as the input table. This is analogous to the groupby().transform()
pattern in pandas. You can pass complex expressions to compute per-group:
# Calculate how much the mass of each penguin deviates from the mean
penguins.group_by(["species", "sex"]).mutate(
# This column isn't needed, but it makes it easier to see what's going on
mass_mean=penguins.body_mass_g.mean(),
mass_deviation=penguins.body_mass_g - penguins.body_mass_g.mean(),
)
┏━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ mass_mean ┃ mass_deviation ┃ ┡━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ float64 │ float64 │ ├─────────┼────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┼─────────────┼────────────────┤ │ Gentoo │ Biscoe │ 50.0 │ 16.3 │ 230 │ 5700 │ male │ 2007 │ 5484.836066 │ 215.163934 │ │ Gentoo │ Biscoe │ 49.9 │ 16.1 │ 213 │ 5400 │ male │ 2009 │ 5484.836066 │ -84.836066 │ │ Gentoo │ Biscoe │ 47.6 │ 14.5 │ 215 │ 5400 │ male │ 2007 │ 5484.836066 │ -84.836066 │ │ Gentoo │ Biscoe │ 46.7 │ 15.3 │ 219 │ 5200 │ male │ 2007 │ 5484.836066 │ -284.836066 │ │ Gentoo │ Biscoe │ 46.8 │ 15.4 │ 215 │ 5150 │ male │ 2007 │ 5484.836066 │ -334.836066 │ │ Gentoo │ Biscoe │ 49.0 │ 16.1 │ 216 │ 5550 │ male │ 2007 │ 5484.836066 │ 65.163934 │ │ Gentoo │ Biscoe │ 48.4 │ 14.6 │ 213 │ 5850 │ male │ 2007 │ 5484.836066 │ 365.163934 │ │ Gentoo │ Biscoe │ 49.3 │ 15.7 │ 217 │ 5850 │ male │ 2007 │ 5484.836066 │ 365.163934 │ │ Gentoo │ Biscoe │ 49.2 │ 15.2 │ 221 │ 6300 │ male │ 2007 │ 5484.836066 │ 815.163934 │ │ Gentoo │ Biscoe │ 48.7 │ 15.1 │ 222 │ 5350 │ male │ 2007 │ 5484.836066 │ -134.836066 │ │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ └─────────┴────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┴─────────────┴────────────────┘
NULL values¶
Ibis has first-class support for NULL
values. In pandas and numpy, NULL
values are
represented by NaN
. This can be confusing when working with numeric data,
since NaN
is also a valid floating point value (along with +/-inf
).
In Ibis, we try to be more precise: All data types are nullable, and we use
ibis.NA
to represent NULL
values, and all datatypes have a .isnull()
method.
For floating point values, we use different values for NaN
and +/-inf
, and there
are the additional methods .isnan()
and .isinf()
.
Dropping rows with NULL
s¶
Both pandas and Ibis allow you to drop rows from a table based on whether a set of columns
contains a NULL
value. This method is called dropna
in both packages. The common set
of parameters in the two are subset=
and how=
. The subset=
parameter indicates which
columns to inspect for NULL
values. The how=
parameter specifies whether 'any' or 'all'
of the specified columns must be NULL
in order for the row to be dropped.
no_null_peng = penguins.dropna(['bill_depth_mm', 'bill_length_mm'], how='any')
Filling NULL
values¶
Both pandas and Ibis allow you to fill NULL
values in a table. In Ibis, the replacement value can only
be a scalar value of a dictionary of values. If it is a dictionary, the keys of the dictionary specify
the column name for the value to apply to.
no_null_peng = penguins.fillna(dict(bill_depth_mm=0, bill_length_mm=0))
Replacing NULL
s¶
Both pandas and Ibis have fillna
methods which allow you to specify a replacement value
for NULL
values.
bill_length_no_nulls = penguins.bill_length_mm.fillna(0)
Type casts¶
Type casting in pandas is done using the astype
method on columns.
df.bill_depth_mm.astype(str)
0 18.7 1 17.4 2 18.0 3 nan 4 19.3 ... 339 19.8 340 18.1 341 18.2 342 19.0 343 18.7 Name: bill_depth_mm, Length: 344, dtype: object
In Ibis, you cast the column type using the cast
method.
penguins.bill_depth_mm.cast('int')
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Cast(bill_depth_mm, int64) ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ int64 │ ├────────────────────────────┤ │ 19 │ │ 17 │ │ 18 │ │ NULL │ │ 19 │ │ 21 │ │ 18 │ │ 20 │ │ 18 │ │ 20 │ │ … │ └────────────────────────────┘
Casted columns can be assigned back to the table using the mutate
method described earlier.
casted = penguins.mutate(
bill_depth_mm=penguins.bill_depth_mm.cast('int'),
bill_length_mm=penguins.bill_length_mm.cast('int'),
)
casted.schema()
ibis.Schema { species string island string bill_length_mm int64 bill_depth_mm int64 flipper_length_mm int64 body_mass_g int64 sex string year int64 }
Set membership¶
pandas set membership uses the in
and not in
operators such as 'a' in df.species
. Ibis uses
isin
and notin
methods. In addition to testing membership in a set, these methods allow you to
specify an else case to assign a value when the value isn't in the set.
penguins.species.value_counts()
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ species ┃ species_count ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ string │ int64 │ ├───────────┼───────────────┤ │ Adelie │ 152 │ │ Gentoo │ 124 │ │ Chinstrap │ 68 │ └───────────┴───────────────┘
refined = penguins.species.isin(['Adelie', 'Chinstrap'])
refined.value_counts()
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Contains(species) ┃ Contains(species)_count ┃ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ boolean │ int64 │ ├───────────────────┼─────────────────────────┤ │ True │ 220 │ │ False │ 124 │ └───────────────────┴─────────────────────────┘
Merging tables¶
While pandas uses the merge
method to combine data from multiple DataFrames
, Ibis uses the
join
method. They both have similar capabilities.
The biggest difference between Ibis' join
method and pandas' merge
method is that pandas only
accepts column names or index levels to join on, whereas Ibis can merge on expressions.
Here are some examples of merging using pandas.
df_left = pd.DataFrame(
[
['a', 1, 2],
['b', 3, 4],
['c', 4, 6],
],
columns=['name', 'x', 'y'],
)
df_right = pd.DataFrame(
[
['a', 100, 200],
['m', 300, 400],
['n', 400, 600],
],
columns=['name', 'x_100', 'y_100'],
)
df_left.merge(df_right, on='name')
name | x | y | x_100 | y_100 | |
---|---|---|---|---|---|
0 | a | 1 | 2 | 100 | 200 |
df_left.merge(df_right, on='name', how='outer')
name | x | y | x_100 | y_100 | |
---|---|---|---|---|---|
0 | a | 1.0 | 2.0 | 100.0 | 200.0 |
1 | b | 3.0 | 4.0 | NaN | NaN |
2 | c | 4.0 | 6.0 | NaN | NaN |
3 | m | NaN | NaN | 300.0 | 400.0 |
4 | n | NaN | NaN | 400.0 | 600.0 |
We can now convert DataFrames
to Ibis tables to do join
s.
t_left = ibis.memtable(df_left, name="t_left")
t_right = ibis.memtable(df_right, name='t_right')
t_left.join(t_right, t_left.name == t_right.name)
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┓ ┃ name ┃ x ┃ y ┃ x_100 ┃ y_100 ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━┩ │ string │ int64 │ int64 │ int64 │ int64 │ ├────────┼───────┼───────┼───────┼───────┤ │ a │ 1 │ 2 │ 100 │ 200 │ └────────┴───────┴───────┴───────┴───────┘
Below is an outer join where missing values are filled with NaN
.
t_left.join(t_right, t_left.name == t_right.name, how='outer')
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┓ ┃ name ┃ x ┃ y ┃ name_right ┃ x_100 ┃ y_100 ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━┩ │ string │ int64 │ int64 │ string │ int64 │ int64 │ ├────────┼───────┼───────┼────────────┼───────┼───────┤ │ a │ 1 │ 2 │ a │ 100 │ 200 │ │ b │ 3 │ 4 │ NULL │ NULL │ NULL │ │ c │ 4 │ 6 │ NULL │ NULL │ NULL │ │ NULL │ NULL │ NULL │ m │ 300 │ 400 │ │ NULL │ NULL │ NULL │ n │ 400 │ 600 │ └────────┴───────┴───────┴────────────┴───────┴───────┘
Concatenating tables¶
Concatenating DataFrame
s in pandas is done with the concat
top-level function. It takes multiple DataFrames
and concatenates the rows of one DataFrame
to the next. If the columns are mis-matched, it extends the
list of columns to include the full set of columns and inserts NaN
s and None
s into the missing values.
Concatenating tables in Ibis can only be done on tables with matching schemas. The concatenation is done
using the top-level union
function or the union
method on a table.
We'll demonstrate a pandas concat
first.
df_1 = pd.DataFrame(
[
['a', 1, 2],
['b', 3, 4],
['c', 4, 6],
],
columns=['name', 'x', 'y'],
)
df_2 = pd.DataFrame(
[
['a', 100, 200],
['m', 300, 400],
['n', 400, 600],
],
columns=['name', 'x', 'y'],
)
pd.concat([df_1, df_2])
name | x | y | |
---|---|---|---|
0 | a | 1 | 2 |
1 | b | 3 | 4 |
2 | c | 4 | 6 |
0 | a | 100 | 200 |
1 | m | 300 | 400 |
2 | n | 400 | 600 |
Now we can convert the DataFrame
s to Ibis tables and combine the tables using a union.
t_1 = ibis.memtable(df_1, name='t_1')
t_2 = ibis.memtable(df_2, name='t_2')
unioned = ibis.union(t_1, t_2)
unioned
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓ ┃ name ┃ x ┃ y ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩ │ string │ int64 │ int64 │ ├────────┼───────┼───────┤ │ a │ 1 │ 2 │ │ b │ 3 │ 4 │ │ c │ 4 │ 6 │ │ a │ 100 │ 200 │ │ m │ 300 │ 400 │ │ n │ 400 │ 600 │ └────────┴───────┴───────┘