Ibis for pandas Users¶
Much of the syntax and many of the operations in Ibis are inspired by the pandas DataFrame, however, the primary domain of Ibis is SQL so there are some differences in how they operate.
One primary
difference between Ibis tables and pandas DataFrames are that many
of the pandas DataFrame operations do in-place operations, whereas
Ibis table operations always return a new table expression.
import ibis
import pandas as pd
Note that we'll be using Ibis' interactive mode to automatically execute queries at
the end of each cell in this notebook. If you are using similar code in a program,
you will have to add .execute() to each operation that you want to evaluate.
ibis.options.interactive = True
We'll be using the pandas backend in Ibis in the examples below. First we'll create a simple DataFrame.
df = pd.DataFrame([
['a', 1, 2],
['b', 3, 4]
], columns=['one', 'two', 'three'])
df
| one | two | three | |
|---|---|---|---|
| 0 | a | 1 | 2 |
| 1 | b | 3 | 4 |
Now we can create an Ibis table from the above DataFrame.
t = ibis.pandas.connect({'t': df}).table('t')
t
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓ ┃ one ┃ two ┃ three ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩ │ string │ int64 │ int64 │ ├────────┼───────┼───────┤ │ a │ 1 │ 2 │ │ b │ 3 │ 4 │ └────────┴───────┴───────┘
Data types¶
The data types of columns in pandas are accessed using the dtypes attribute. This returns
a Series object.
df.dtypes
one object two int64 three int64 dtype: object
In Ibis, you use the schema method which returns an ibis.Schema object.
t.schema()
ibis.Schema {
one string
two int64
three int64
}
It is possible to convert the schema information to pandas data types using the to_pandas method, if needed.
t.schema().to_pandas()
[('one', dtype('O')), ('two', dtype('int64')), ('three', dtype('int64'))]
Table layout¶
In pandas, the layout of the table is contained in the shape attribute which contains the number
of rows and number of columns in a tuple. The number of columns in an Ibis table can be gotten
from the length of the schema.
len(t.schema())
3
To get the number of rows of a table, you use the count method.
t.count()
2
To mimic pandas' behavior, you would use the following code. Note that you need to use the execute method
after count to evaluate the expression returned by count.
(t.count().execute(), len(t.schema()))
(2, 3)
df.shape
(2, 3)
Subsetting columns¶
Selecting columns is very similar to in pandas. In fact, you can use the same syntax.
t[['one', 'two']]
┏━━━━━━━━┳━━━━━━━┓ ┃ one ┃ two ┃ ┡━━━━━━━━╇━━━━━━━┩ │ string │ int64 │ ├────────┼───────┤ │ a │ 1 │ │ b │ 3 │ └────────┴───────┘
However, since row-level indexing is not supported in Ibis, the inner list is not necessary.
t['one', 'two']
┏━━━━━━━━┳━━━━━━━┓ ┃ one ┃ two ┃ ┡━━━━━━━━╇━━━━━━━┩ │ string │ int64 │ ├────────┼───────┤ │ a │ 1 │ │ b │ 3 │ └────────┴───────┘
Selecting columns¶
Selecting columns is done using the same syntax as in pandas DataFrames. You can use either
the indexing syntax or attribute syntax.
t['one']
┏━━━━━━━━┓ ┃ one ┃ ┡━━━━━━━━┩ │ string │ ├────────┤ │ a │ │ b │ └────────┘
or:
t.one
┏━━━━━━━━┓ ┃ one ┃ ┡━━━━━━━━┩ │ string │ ├────────┤ │ a │ │ b │ └────────┘
Adding, removing, and modifying columns¶
Modifying the columns of an Ibis table is a bit different than doing the same operations in
a pandas DataFrame. This is primarily due to the fact that in-place operations are not
supported on Ibis tables. Each time you do a column modification to a table, a new table
expression is returned.
Adding columns¶
Adding columns is done through the mutate method.
mutated = t.mutate(new_col=t.three * 2)
mutated
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━┓ ┃ one ┃ two ┃ three ┃ new_col ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━┩ │ string │ int64 │ int64 │ int64 │ ├────────┼───────┼───────┼─────────┤ │ a │ 1 │ 2 │ 4 │ │ b │ 3 │ 4 │ 8 │ └────────┴───────┴───────┴─────────┘
Notice that the original table object remains unchanged. Only the mutated object that was returned
contains the new column.
t
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓ ┃ one ┃ two ┃ three ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩ │ string │ int64 │ int64 │ ├────────┼───────┼───────┤ │ a │ 1 │ 2 │ │ b │ 3 │ 4 │ └────────┴───────┴───────┘
It is also possible to create a column in isolation. This is similar to a Series in pandas.
In this situation, the name of the column must be added using the name method.
new_col = (t.three * 2).name('new_col')
new_col
┏━━━━━━━━━┓ ┃ new_col ┃ ┡━━━━━━━━━┩ │ int64 │ ├─────────┤ │ 4 │ │ 8 │ └─────────┘
You can then add this column to the table using a projection.
proj = t['one', 'two', new_col]
proj
┏━━━━━━━━┳━━━━━━━┳━━━━━━━━━┓ ┃ one ┃ two ┃ new_col ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━━━┩ │ string │ int64 │ int64 │ ├────────┼───────┼─────────┤ │ a │ 1 │ 4 │ │ b │ 3 │ 8 │ └────────┴───────┴─────────┘
Removing columns¶
Removing a column is done using the drop method.
t.columns
['one', 'two', 'three']
subset = t.drop('two')
subset.columns
['one', 'three']
Multiple column names can also be given.
subset = t.drop(['one', 'two'])
subset.columns
/home/gil/github.com/ibis-project/ibis/ibis/expr/types/relations.py:761: FutureWarning: Passing a sequence of fields to `drop` is deprecated and will be removed in version 5.0, use `drop(*fields)` instead warnings.warn(
['three']
It is also possible to drop columns by selecting the columns you want to remain.
subset = t['two', 'three']
subset.columns
['two', 'three']
Modifying columns¶
Replacing existing columns is done using the mutate method just like adding columns. You simply
add a column of the same name to replace it.
t
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓ ┃ one ┃ two ┃ three ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩ │ string │ int64 │ int64 │ ├────────┼───────┼───────┤ │ a │ 1 │ 2 │ │ b │ 3 │ 4 │ └────────┴───────┴───────┘
mutated = t.mutate(two=t.two * 2)
mutated
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓ ┃ one ┃ two ┃ three ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩ │ string │ int64 │ int64 │ ├────────┼───────┼───────┤ │ a │ 2 │ 2 │ │ b │ 6 │ 4 │ └────────┴───────┴───────┘
Renaming columns¶
In addition to replacing columns, you can simply rename them as well. This is done with the relabel method
which takes a dictionary containing the name mappings.
relabeled = t.relabel(dict(
one='a',
two='b',
))
relabeled
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓ ┃ a ┃ b ┃ three ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩ │ string │ int64 │ int64 │ ├────────┼───────┼───────┤ │ a │ 1 │ 2 │ │ b │ 3 │ 4 │ └────────┴───────┴───────┘
Selecting rows¶
There are several methods that can be used to select rows of data in various ways. These are described in the sections below. We'll use the ubiquitous iris dataset for these examples.
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
df.head()
| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Create an Ibis table from the DataFrame above.
t = ibis.pandas.connect({'t': df}).table('t')
t
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┓ ┃ sepal_length ┃ sepal_width ┃ petal_length ┃ petal_width ┃ species ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━┩ │ float64 │ float64 │ float64 │ float64 │ string │ ├──────────────┼─────────────┼──────────────┼─────────────┼─────────┤ │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ setosa │ │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ setosa │ │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ setosa │ │ 4.6 │ 3.1 │ 1.5 │ 0.2 │ setosa │ │ 5.0 │ 3.6 │ 1.4 │ 0.2 │ setosa │ │ 5.4 │ 3.9 │ 1.7 │ 0.4 │ setosa │ │ 4.6 │ 3.4 │ 1.4 │ 0.3 │ setosa │ │ 5.0 │ 3.4 │ 1.5 │ 0.2 │ setosa │ │ 4.4 │ 2.9 │ 1.4 │ 0.2 │ setosa │ │ 4.9 │ 3.1 │ 1.5 │ 0.1 │ setosa │ │ … │ … │ … │ … │ … │ └──────────────┴─────────────┴──────────────┴─────────────┴─────────┘
Head, tail and limit¶
The head method works the same ways as in pandas. Note that some Ibis backends may not have an
inherent ordering of their rows and using head may not return deterministic results. In those
cases, you can use sorting before calling head to ensure a stable result.
t.head(5)
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┓ ┃ sepal_length ┃ sepal_width ┃ petal_length ┃ petal_width ┃ species ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━┩ │ float64 │ float64 │ float64 │ float64 │ string │ ├──────────────┼─────────────┼──────────────┼─────────────┼─────────┤ │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ setosa │ │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ setosa │ │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ setosa │ │ 4.6 │ 3.1 │ 1.5 │ 0.2 │ setosa │ │ 5.0 │ 3.6 │ 1.4 │ 0.2 │ setosa │ └──────────────┴─────────────┴──────────────┴─────────────┴─────────┘
However, the tail method is not implemented since it is not supported in all databases.
It is possible to emulate the tail method if you use sorting in your table to do a
reverse sort then use the head method to retrieve the "top" rows.
Another way to limit the number of retrieved rows is using the limit method. The following will return
the same result as head(5). This is often used in conjunction with other filtering techniques that we
will cover later.
t.limit(5)
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┓ ┃ sepal_length ┃ sepal_width ┃ petal_length ┃ petal_width ┃ species ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━┩ │ float64 │ float64 │ float64 │ float64 │ string │ ├──────────────┼─────────────┼──────────────┼─────────────┼─────────┤ │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ setosa │ │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ setosa │ │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ setosa │ │ 4.6 │ 3.1 │ 1.5 │ 0.2 │ setosa │ │ 5.0 │ 3.6 │ 1.4 │ 0.2 │ setosa │ └──────────────┴─────────────┴──────────────┴─────────────┴─────────┘
The starting position of the returned rows can be specified using the offset parameter.
t.limit(5, offset=4)
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┓ ┃ sepal_length ┃ sepal_width ┃ petal_length ┃ petal_width ┃ species ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━┩ │ float64 │ float64 │ float64 │ float64 │ string │ ├──────────────┼─────────────┼──────────────┼─────────────┼─────────┤ │ 5.0 │ 3.6 │ 1.4 │ 0.2 │ setosa │ │ 5.4 │ 3.9 │ 1.7 │ 0.4 │ setosa │ │ 4.6 │ 3.4 │ 1.4 │ 0.3 │ setosa │ │ 5.0 │ 3.4 │ 1.5 │ 0.2 │ setosa │ │ 4.4 │ 2.9 │ 1.4 │ 0.2 │ setosa │ └──────────────┴─────────────┴──────────────┴─────────────┴─────────┘
Filtering rows¶
In addition to simply limiting the number of rows that are returned, it is possible to filter the rows using expressions. Expressions are constructed very similarly to the way they are in pandas. Ibis expressions are constructed from operations on colunms in a table which return a boolean result. This result is then used to filter the table.
expr = t.sepal_width > 3.8
expr
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Greater(sepal_width, 3.8) ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ boolean │ ├───────────────────────────┤ │ False │ │ False │ │ False │ │ False │ │ False │ │ True │ │ False │ │ False │ │ False │ │ False │ │ … │ └───────────────────────────┘
We can evaluate the value counts to see how many rows we will expect to get back after filtering.
expr.value_counts()
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓ ┃ Greater(sepal_width, 3.8) ┃ count ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩ │ boolean │ int64 │ ├───────────────────────────┼───────┤ │ False │ 144 │ │ True │ 6 │ └───────────────────────────┴───────┘
Now we apply the filter to the table. Since there are 6 True values in the expression, we should get 6 rows back.
filtered = t[expr]
filtered
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┓ ┃ sepal_length ┃ sepal_width ┃ petal_length ┃ petal_width ┃ species ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━┩ │ float64 │ float64 │ float64 │ float64 │ string │ ├──────────────┼─────────────┼──────────────┼─────────────┼─────────┤ │ 5.4 │ 3.9 │ 1.7 │ 0.4 │ setosa │ │ 5.8 │ 4.0 │ 1.2 │ 0.2 │ setosa │ │ 5.7 │ 4.4 │ 1.5 │ 0.4 │ setosa │ │ 5.4 │ 3.9 │ 1.3 │ 0.4 │ setosa │ │ 5.2 │ 4.1 │ 1.5 │ 0.1 │ setosa │ │ 5.5 │ 4.2 │ 1.4 │ 0.2 │ setosa │ └──────────────┴─────────────┴──────────────┴─────────────┴─────────┘
Of course, the filtering expression can be applied inline as well.
filtered = t[t.sepal_width > 3.8]
filtered
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┓ ┃ sepal_length ┃ sepal_width ┃ petal_length ┃ petal_width ┃ species ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━┩ │ float64 │ float64 │ float64 │ float64 │ string │ ├──────────────┼─────────────┼──────────────┼─────────────┼─────────┤ │ 5.4 │ 3.9 │ 1.7 │ 0.4 │ setosa │ │ 5.8 │ 4.0 │ 1.2 │ 0.2 │ setosa │ │ 5.7 │ 4.4 │ 1.5 │ 0.4 │ setosa │ │ 5.4 │ 3.9 │ 1.3 │ 0.4 │ setosa │ │ 5.2 │ 4.1 │ 1.5 │ 0.1 │ setosa │ │ 5.5 │ 4.2 │ 1.4 │ 0.2 │ setosa │ └──────────────┴─────────────┴──────────────┴─────────────┴─────────┘
Multiple filtering expressions can be combined into a single expression or chained onto existing table expressions.
filtered = t[(t.sepal_width > 3.8) & (t.sepal_length > 5.5)]
filtered
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┓ ┃ sepal_length ┃ sepal_width ┃ petal_length ┃ petal_width ┃ species ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━┩ │ float64 │ float64 │ float64 │ float64 │ string │ ├──────────────┼─────────────┼──────────────┼─────────────┼─────────┤ │ 5.8 │ 4.0 │ 1.2 │ 0.2 │ setosa │ │ 5.7 │ 4.4 │ 1.5 │ 0.4 │ setosa │ └──────────────┴─────────────┴──────────────┴─────────────┴─────────┘
The code above will return the same rows as the code below.
filtered = t[t.sepal_width > 3.8][t.sepal_length > 5.5]
filtered
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┓ ┃ sepal_length ┃ sepal_width ┃ petal_length ┃ petal_width ┃ species ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━┩ │ float64 │ float64 │ float64 │ float64 │ string │ ├──────────────┼─────────────┼──────────────┼─────────────┼─────────┤ │ 5.8 │ 4.0 │ 1.2 │ 0.2 │ setosa │ │ 5.7 │ 4.4 │ 1.5 │ 0.4 │ setosa │ └──────────────┴─────────────┴──────────────┴─────────────┴─────────┘
Aggregation has not been discussed yet, but aggregate values can be used in expressions to return things such as all of the rows in a data set where the value in a column is greater than the mean.
filtered = t[t.sepal_width > t.sepal_width.mean()]
filtered
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┓ ┃ sepal_length ┃ sepal_width ┃ petal_length ┃ petal_width ┃ species ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━┩ │ float64 │ float64 │ float64 │ float64 │ string │ ├──────────────┼─────────────┼──────────────┼─────────────┼─────────┤ │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ setosa │ │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ setosa │ │ 4.6 │ 3.1 │ 1.5 │ 0.2 │ setosa │ │ 5.0 │ 3.6 │ 1.4 │ 0.2 │ setosa │ │ 5.4 │ 3.9 │ 1.7 │ 0.4 │ setosa │ │ 4.6 │ 3.4 │ 1.4 │ 0.3 │ setosa │ │ 5.0 │ 3.4 │ 1.5 │ 0.2 │ setosa │ │ 4.9 │ 3.1 │ 1.5 │ 0.1 │ setosa │ │ 5.4 │ 3.7 │ 1.5 │ 0.2 │ setosa │ │ 4.8 │ 3.4 │ 1.6 │ 0.2 │ setosa │ │ … │ … │ … │ … │ … │ └──────────────┴─────────────┴──────────────┴─────────────┴─────────┘
Sorting rows¶
Sorting rows in Ibis uses a somewhat different API than in pandas. In pandas, you would use the
sort_values method to order rows by values in specified columns. Ibis uses a method called
sort_by. To specify ascending or descending orders, pandas uses an ascending= argument
to sort_values that indicates the order for each sorting column. Ibis allows you to tag the
column name in the sort_by list as ascending or descending by wrapping it with ibis.asc or
ibis.desc.
Here is an example of sorting a DataFrame using two sort keys. One key is sorting in ascending
order and the other is in descending order.
df.sort_values(['sepal_length', 'sepal_width'], ascending=[True, False]).head(5)
| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 13 | 4.3 | 3.0 | 1.1 | 0.1 | setosa |
| 42 | 4.4 | 3.2 | 1.3 | 0.2 | setosa |
| 38 | 4.4 | 3.0 | 1.3 | 0.2 | setosa |
| 8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
| 41 | 4.5 | 2.3 | 1.3 | 0.3 | setosa |
The same operation in Ibis would look like the following. Note that the index values of the
resulting DataFrame start from zero and count up, whereas in the example above, they retain
their original index value. This is simply due to the fact that rows in tables don't necessarily
have a stable index in database backends, so the index is just generated on the result.
sorted = t.sort_by(['sepal_length', ibis.desc('sepal_width')]).head(5)
sorted
/home/gil/github.com/ibis-project/ibis/ibis/expr/types/relations.py:168: FutureWarning: `Table.sort_by` is deprecated and will be removed in 5.0, use `Table.order_by` instead warnings.warn(
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┓ ┃ sepal_length ┃ sepal_width ┃ petal_length ┃ petal_width ┃ species ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━┩ │ float64 │ float64 │ float64 │ float64 │ string │ ├──────────────┼─────────────┼──────────────┼─────────────┼─────────┤ │ 4.3 │ 3.0 │ 1.1 │ 0.1 │ setosa │ │ 4.4 │ 3.2 │ 1.3 │ 0.2 │ setosa │ │ 4.4 │ 3.0 │ 1.3 │ 0.2 │ setosa │ │ 4.4 │ 2.9 │ 1.4 │ 0.2 │ setosa │ │ 4.5 │ 2.3 │ 1.3 │ 0.3 │ setosa │ └──────────────┴─────────────┴──────────────┴─────────────┴─────────┘
Aggregation¶
Aggregation in pandas is typically done by computing columns based on an aggregate function.
stats = [df.sepal_width.sum(), df.sepal_length.mean()]
pd.DataFrame([stats], columns=['total_sepal_width', 'avg.sepal_length'])
| total_sepal_width | avg.sepal_length | |
|---|---|---|
| 0 | 458.6 | 5.843333 |
In Ibis, you construct aggregate expressions then apply them to the table using the aggregate method.
stats = [t.sepal_width.sum().name('total_sepal_width'), t.sepal_length.mean().name('avg_sepal_length')]
agged = t.aggregate(stats)
agged
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓ ┃ total_sepal_width ┃ avg_sepal_length ┃ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩ │ float64 │ float64 │ ├───────────────────┼──────────────────┤ │ 458.6 │ 5.843333 │ └───────────────────┴──────────────────┘
You can also combine both operations into one and pass the aggregate expressions using keyword parameters.
agged = t.aggregate(
total_sepal_width=t.sepal_width.sum(),
avg_sepal_length=t.sepal_length.mean(),
)
agged
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓ ┃ total_sepal_width ┃ avg_sepal_length ┃ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩ │ float64 │ float64 │ ├───────────────────┼──────────────────┤ │ 458.6 │ 5.843333 │ └───────────────────┴──────────────────┘
Group by¶
Aggregations can also be done across groupings using the by= parameter.
agged = t.aggregate(
by='species',
total_sepal_width=t.sepal_width.sum(),
avg_sepal_length=t.sepal_length.mean(),
)
agged
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓ ┃ species ┃ total_sepal_width ┃ avg_sepal_length ┃ ┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩ │ string │ float64 │ float64 │ ├────────────┼───────────────────┼──────────────────┤ │ setosa │ 171.4 │ 5.006 │ │ versicolor │ 138.5 │ 5.936 │ │ virginica │ 148.7 │ 6.588 │ └────────────┴───────────────────┴──────────────────┘
Alternatively, by groups can be computed using a grouped table.
agged = t.group_by('species').aggregate(
total_sepal_width=t.sepal_width.sum(),
avg_sepal_length=t.sepal_length.mean(),
)
agged
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓ ┃ species ┃ total_sepal_width ┃ avg_sepal_length ┃ ┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩ │ string │ float64 │ float64 │ ├────────────┼───────────────────┼──────────────────┤ │ setosa │ 171.4 │ 5.006 │ │ versicolor │ 138.5 │ 5.936 │ │ virginica │ 148.7 │ 6.588 │ └────────────┴───────────────────┴──────────────────┘
Dropping rows with NULLs¶
Both pandas and Ibis allow you to drop rows from a table based on whether a set of columns
contains a NULL value. This method is called dropna in both packages. The common set
of parameters in the two are subset= and how=. The subset= parameter indicates which
columns to inspect for NULL values. The how= parameter specifies whether 'any' or 'all'
of the specified columns must be NULL in order for the row to be dropped.
no_null_t = t.dropna(['sepal_width', 'sepal_length'], how='any')
Filling NULL values¶
Both pandas and Ibis allow you to fill NULL values in a table. In Ibis, the replacement value can only
be a scalar value of a dictionary of values. If it is a dictionary, the keys of the dictionary specify
the column name for the value to apply to.
no_null_t = t.fillna(dict(sepal_width=0, sepal_length=0))
Common column expressions¶
See the full API documentation for all of the available value methods and tools for creating value expressions. We mention a few common ones here as they relate to common SQL queries.
Type casts¶
Type casting in pandas is done using the astype method on columns.
df.sepal_width.astype(str)
0 3.5
1 3.0
2 3.2
3 3.1
4 3.6
...
145 3.0
146 2.5
147 3.0
148 3.4
149 3.0
Name: sepal_width, Length: 150, dtype: object
In Ibis, you cast the column type using the cast method.
t.sepal_width.cast('int')
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Cast(sepal_width, int64) ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ int64 │ ├──────────────────────────┤ │ 3 │ │ 3 │ │ 3 │ │ 3 │ │ 3 │ │ 3 │ │ 3 │ │ 3 │ │ 2 │ │ 3 │ │ … │ └──────────────────────────┘
Casted columns can be assigned back to the table using the mutate method described earlier.
casted = t.mutate(
sepal_width=t.sepal_width.cast('int'),
sepal_length=t.sepal_length.cast('int'),
)
casted.schema()
ibis.Schema {
sepal_length int64
sepal_width int64
petal_length float64
petal_width float64
species string
}
Replacing NULLs¶
Both pandas and Ibis have fillna methods which allow you to specify a replacement value
for NULL values.
sepal_length_no_nulls = t.sepal_length.fillna(0)
Set membership¶
pandas set membership uses the in and not in operators such as 'a' in df.species. Ibis uses
isin and notin methods. In addition to testing membership in a set, these methods allow you to
specify an else case to assign a value when the value isn't in the set.
t.species.value_counts()
┏━━━━━━━━━━━━┳━━━━━━━┓ ┃ species ┃ count ┃ ┡━━━━━━━━━━━━╇━━━━━━━┩ │ string │ int64 │ ├────────────┼───────┤ │ setosa │ 50 │ │ versicolor │ 50 │ │ virginica │ 50 │ └────────────┴───────┘
refined = t.species.isin(['versicolor', 'virginica'])
refined.value_counts()
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓ ┃ Contains(species) ┃ count ┃ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩ │ boolean │ int64 │ ├───────────────────┼───────┤ │ False │ 50 │ │ True │ 100 │ └───────────────────┴───────┘
Merging tables¶
While pandas uses the merge method to combine data from multiple DataFrames, Ibis uses the
join method. They both have similar capabilities. The signature for the join method in Ibis
is: join(right, predicates=(), how='inner', *, suffixes=('_x', '_y')). The merge method on
pandas' DataFrame allows many more parameters, but the signature with the corresponding
parameters would be: join(right, on=(), how='inner', *, suffixes=('_x', '_y')). The valid values
of the how= parameter will vary depending on the backend, but common values are 'inner', 'outer',
'left', and 'right'.
The biggest difference between Ibis' join method and pandas' merge method is that pandas only
accepts column names or index levels to join on, whereas Ibis can merge on expressions.
Here are some examples of merging using pandas.
df_left = pd.DataFrame([
['a', 1, 2],
['b', 3, 4],
['c', 4, 6],
], columns=['name', 'x', 'y'])
df_right = pd.DataFrame([
['a', 100, 200],
['m', 300, 400],
['n', 400, 600],
], columns=['name', 'x_100', 'y_100'])
df_left.merge(df_right, on='name')
| name | x | y | x_100 | y_100 | |
|---|---|---|---|---|---|
| 0 | a | 1 | 2 | 100 | 200 |
df_left.merge(df_right, on='name', how='outer')
| name | x | y | x_100 | y_100 | |
|---|---|---|---|---|---|
| 0 | a | 1.0 | 2.0 | 100.0 | 200.0 |
| 1 | b | 3.0 | 4.0 | NaN | NaN |
| 2 | c | 4.0 | 6.0 | NaN | NaN |
| 3 | m | NaN | NaN | 300.0 | 400.0 |
| 4 | n | NaN | NaN | 400.0 | 600.0 |
We can now convert DataFrames to Ibis tables to do joins.
pd_ibis = ibis.pandas.connect({'t_left': df_left, 't_right': df_right})
t_left = pd_ibis.table('t_left')
t_right = pd_ibis.table('t_right')
t_left.join(t_right, t_left.name == t_right.name)
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┓ ┃ name ┃ x ┃ y ┃ x_100 ┃ y_100 ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━┩ │ string │ int64 │ int64 │ int64 │ int64 │ ├────────┼───────┼───────┼───────┼───────┤ │ a │ 1 │ 2 │ 100 │ 200 │ └────────┴───────┴───────┴───────┴───────┘
You may notice that in Ibis joins, even if the predicate is an equality expression and both tables have the same column name, you will still get multiple output columns with suffixes added. This may change in a future version to match the pandas behavior.
Below is an outer join where missing values are filled with NaN.
t_left.join(t_right, t_left.name == t_right.name, how='outer')
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┓ ┃ name_x ┃ x ┃ y ┃ name_y ┃ x_100 ┃ y_100 ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━┩ │ string │ int64 │ int64 │ string │ int64 │ int64 │ ├────────┼───────┼───────┼────────┼───────┼───────┤ │ a │ 1 │ 2 │ a │ 100 │ 200 │ │ b │ 3 │ 4 │ b │ ∅ │ ∅ │ │ c │ 4 │ 6 │ c │ ∅ │ ∅ │ │ m │ ∅ │ ∅ │ m │ 300 │ 400 │ │ n │ ∅ │ ∅ │ n │ 400 │ 600 │ └────────┴───────┴───────┴────────┴───────┴───────┘
Concatenating tables¶
Concatenating DataFrames in pandas is done with the concat top-level function. It takes multiple DataFrames
and concatenates the rows of one DataFrame to the next. If the columns are mis-matched, it extends the
list of columns to include the full set of columns and inserts NaNs and Nones into the missing values.
Concatenating tables in Ibis can only be done on tables with matching schemas. The concatenation is done
using the top-level union function or the union method on a table.
We'll demonstrate a pandas concat first.
df_1 = pd.DataFrame([
['a', 1, 2],
['b', 3, 4],
['c', 4, 6],
], columns=['name', 'x', 'y'])
df_2 = pd.DataFrame([
['a', 100, 200],
['m', 300, 400],
['n', 400, 600],
], columns=['name', 'x', 'y'])
pd.concat([df_1, df_2])
| name | x | y | |
|---|---|---|---|
| 0 | a | 1 | 2 |
| 1 | b | 3 | 4 |
| 2 | c | 4 | 6 |
| 0 | a | 100 | 200 |
| 1 | m | 300 | 400 |
| 2 | n | 400 | 600 |
Now we can convert the DataFrames to Ibis tables and combine the tables using a union.
pd_ibis = ibis.pandas.connect({'t_1': df_1, 't_2': df_2})
t_1 = pd_ibis.table('t_1')
t_2 = pd_ibis.table('t_2')
unioned = ibis.union(t_1, t_2)
unioned
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓ ┃ name ┃ x ┃ y ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩ │ string │ int64 │ int64 │ ├────────┼───────┼───────┤ │ a │ 1 │ 2 │ │ b │ 3 │ 4 │ │ c │ 4 │ 6 │ │ a │ 100 │ 200 │ │ m │ 300 │ 400 │ │ n │ 400 │ 600 │ └────────┴───────┴───────┘