Install¶
If you don't have ibis
installed, you can install it from:
PyPI¶
python -m pip install 'ibis-framework[duckdb]'
conda-forge¶
conda install -c conda-forge ibis-framework
You can check out the install page for more detailed instructions.
Ibis for SQL users¶
Ibis provides a full-featured replacement for SQL
SELECT
queries, but expressed with Python code that is:
- Type-checked and validated as you go. No more debugging cryptic database errors; Ibis catches your mistakes right away.
- Easier to write. Pythonic function calls with tab completion in IPython.
- More composable. Break complex queries down into easier-to-digest pieces
- Easier to reuse. Mix and match Ibis snippets to create expressions tailored for your analysis.
We intend for all SELECT
queries to be fully portable to Ibis.
Coverage of other DDL statements (e.g. CREATE TABLE
or INSERT
) may
vary from engine to engine.
Note: If you find any SQL idioms or use cases in your work that are not represented here, please reach out so we can add more to this guide!
import ibis
ibis.options.sql.default_limit = None
Projections: select/add/remove columns¶
All tables in Ibis are immutable. To select a subset of a table's columns, or to add new columns, you must produce a new table by means of a projection.
t = ibis.table(
[('one', 'string'), ('two', 'float'), ('three', 'int32')], 'my_data'
)
t
UnboundTable: my_data one string two float64 three int32
In SQL, you might write something like:
SELECT two, one
FROM my_data
In Ibis, this is
proj = t['two', 'one']
or
proj = t.select(['two', 'one'])
This generates the expected SQL:
ibis.show_sql(proj)
SELECT t0.two, t0.one FROM my_data AS t0
What about adding new columns? To form a valid projection, all column expressions must be named. Let's look at the SQL:
SELECT two, one, three * 2 AS new_col
FROM my_data
The last expression is written:
new_col = (t.three * 2).name('new_col')
Now, we have:
proj = t['two', 'one', new_col]
ibis.show_sql(proj)
SELECT t0.two, t0.one, t0.three * CAST(2 AS TINYINT) AS new_col FROM my_data AS t0
mutate
: Add or modify columns easily¶
Since adding new columns or modifying existing columns is so common,
there is a convenience method mutate
:
mutated = t.mutate(new_col=t.three * 2)
Notice that using the name
was not necessary here because we're using
Python keywords to provide the name. Indeed:
ibis.show_sql(mutated)
SELECT t0.one, t0.two, t0.three, t0.three * CAST(2 AS TINYINT) AS new_col FROM my_data AS t0
If you modify an existing column with mutate
it will list out all the
other columns:
mutated = t.mutate(two=t.two * 2)
ibis.show_sql(mutated)
SELECT t0.one, t0.two * CAST(2 AS TINYINT) AS two, t0.three FROM my_data AS t0
SELECT *
equivalent¶
Especially in combination with relational joins, it's convenient to be
able to select all columns in a table using the SELECT *
construct. To
do this, use the table expression itself in a projection:
proj = t[t]
ibis.show_sql(proj)
SELECT t0.one, t0.two, t0.three FROM my_data AS t0
This is how mutate
is implemented. The example above
t.mutate(new_col=t.three * 2)
can be written as a normal projection:
proj = t[t, new_col]
ibis.show_sql(proj)
SELECT t0.one, t0.two, t0.three, t0.three * CAST(2 AS TINYINT) AS new_col FROM my_data AS t0
Let's consider a table we might wish to join with t
:
t2 = ibis.table([('key', 'string'), ('value', 'float')], 'dim_table')
Now let's take the SQL:
SELECT t0.*, t0.two - t1.value AS diff
FROM my_data t0
INNER JOIN dim_table t1
ON t0.one = t1.key
To write this with Ibis, it is:
diff = (t.two - t2.value).name('diff')
joined = t.join(t2, t.one == t2.key)[t, diff]
And verify the generated SQL:
ibis.show_sql(joined)
SELECT t0.one, t0.two, t0.three, t0.two - t1.value AS diff FROM my_data AS t0 JOIN dim_table AS t1 ON t0.one = t1.key
Using functions in projections¶
If you pass a function instead of a string or Ibis expression in any projection context, it will be invoked with the "parent" table as its argument. This can help significantly when [composing complex operations. Consider this SQL:
SELECT one, avg(abs(the_sum)) AS mad
FROM (
SELECT one, three, sum(two) AS the_sum
FROM my_data
GROUP BY 1, 2
) t0
GROUP BY 1
This can be written as one chained expression:
expr = (
t.group_by(['one', 'three'])
.aggregate(the_sum=t.two.sum())
.group_by('one')
.aggregate(mad=lambda x: x.the_sum.abs().mean())
)
Indeed:
ibis.show_sql(expr)
SELECT t0.one, AVG(ABS(t0.the_sum)) AS mad FROM ( SELECT t1.one AS one, t1.three AS three, SUM(t1.two) AS the_sum FROM my_data AS t1 GROUP BY 1, 2 ) AS t0 GROUP BY 1
Filtering / WHERE
¶
You can add filter clauses to a table expression either by indexing with
[]
(like pandas) or use the filter
method:
filtered = t[t.two > 0]
ibis.show_sql(filtered)
SELECT t0.one, t0.two, t0.three FROM my_data AS t0 WHERE t0.two > CAST(0 AS TINYINT)
filter
can take a list of expressions, which must all be satisfied for
a row to be included in the result:
filtered = t.filter([t.two > 0, t.one.isin(['A', 'B'])])
ibis.show_sql(filtered)
SELECT t0.one, t0.two, t0.three FROM my_data AS t0 WHERE t0.two > CAST(0 AS TINYINT) AND t0.one IN ('A', 'B')
To compose boolean expressions with AND
or OR
, use the respective
&
and |
operators:
cond = (t.two < 0) | ((t.two > 0) | t.one.isin(['A', 'B']))
filtered = t[cond]
ibis.show_sql(filtered)
SELECT t0.one, t0.two, t0.three FROM my_data AS t0 WHERE t0.two < CAST(0 AS TINYINT) OR t0.two > CAST(0 AS TINYINT) OR t0.one IN ('A', 'B')
Aggregation / GROUP BY
¶
To aggregate a table, you need:
- Zero or more grouping expressions (these can be column names)
- One or more aggregation expressions
Let's look at the aggregate
method on tables:
stats = [t.two.sum().name('total_two'), t.three.mean().name('avg_three')]
agged = t.aggregate(stats)
If you don't use any group expressions, the result will have a single row with your statistics of interest:
agged.schema()
ibis.Schema { total_two float64 avg_three float64 }
ibis.show_sql(agged)
SELECT SUM(t0.two) AS total_two, AVG(t0.three) AS avg_three FROM my_data AS t0
To add groupings, use either the by
argument of aggregate
or use the
group_by
construct:
agged2 = t.aggregate(stats, by='one')
agged3 = t.group_by('one').aggregate(stats)
ibis.show_sql(agged3)
SELECT t0.one, SUM(t0.two) AS total_two, AVG(t0.three) AS avg_three FROM my_data AS t0 GROUP BY 1
Non-trivial grouping keys¶
You can use any expression (or function, like in projections) deriving from the table you are aggregating. The only constraint is that the expressions must be named. Let's look at an example:
events = ibis.table(
[('ts', 'timestamp'), ('event_type', 'int32'), ('session_id', 'int64')],
name='web_events',
)
Suppose we wanted to total up event types by year and month:
keys = [events.ts.year().name('year'), events.ts.month().name('month')]
sessions = events.session_id.nunique()
stats = events.group_by(keys).aggregate(
total=events.count(), sessions=sessions
)
Now we have:
ibis.show_sql(stats)
SELECT CAST(EXTRACT(year FROM t0.ts) AS SMALLINT) AS year, CAST(EXTRACT(month FROM t0.ts) AS SMALLINT) AS month, COUNT(*) AS total, COUNT(DISTINCT t0.session_id) AS sessions FROM web_events AS t0 GROUP BY 1, 2
Aggregates considering table subsets¶
In analytics is it common to compare statistics from different subsets of a table. Let's consider a dataset containing people's name, age, gender, and nationality:
pop = ibis.table(
[
('name', 'string'),
('country', 'string'),
('gender', 'string'),
('age', 'int16'),
],
name='population',
)
Now, suppose you wanted to know for each country:
- Average overall age
- Average male age
- Average female age
- Total number of persons
In SQL, you may write:
SELECT country,
count(*) AS num_persons,
AVG(age) AS avg_age
AVG(CASE WHEN gender = 'M'
THEN age
ELSE NULL
END) AS avg_male,
AVG(CASE WHEN gender = 'F'
THEN age
ELSE NULL
END) AS avg_female,
FROM population
GROUP BY 1
Ibis makes this much simpler by giving you where
option in aggregation
functions:
expr = pop.group_by('country').aggregate(
num_persons=pop.count(),
avg_age=pop.age.mean(),
avg_male=pop.age.mean(where=pop.gender == 'M'),
avg_female=pop.age.mean(where=pop.gender == 'F'),
)
This indeed generates the correct SQL. Note that SQL engines handle
NULL
values differently in aggregation functions, but Ibis will write
the SQL expression that is correct for your query engine.
ibis.show_sql(expr)
SELECT t0.country, COUNT(*) AS num_persons, AVG(t0.age) AS avg_age, AVG(t0.age) FILTER(WHERE t0.gender = 'M') AS avg_male, AVG(t0.age) FILTER(WHERE t0.gender = 'F') AS avg_female FROM population AS t0 GROUP BY 1
count(*)
convenience: size()
¶
Computing group frequencies is so common that, like pandas, we have a
method size
that is a shortcut for the count(*)
idiom:
freqs = events.group_by(keys).size()
ibis.show_sql(freqs)
SELECT CAST(EXTRACT(year FROM t0.ts) AS SMALLINT) AS year, CAST(EXTRACT(month FROM t0.ts) AS SMALLINT) AS month, COUNT(*) AS "CountStar(web_events)" FROM web_events AS t0 GROUP BY 1, 2
Frequency table convenience: value_counts
¶
Consider the SQL idiom:
SELECT some_column_expression, count(*)
FROM table
GROUP BY 1
This is so common that, like pandas, there is a generic array method
value_counts
which does this for us:
expr = events.ts.year().value_counts()
ibis.show_sql(expr)
SELECT t0."ExtractYear(ts)", COUNT(*) AS "ExtractYear(ts)_count" FROM ( SELECT CAST(EXTRACT(year FROM t1.ts) AS SMALLINT) AS "ExtractYear(ts)" FROM web_events AS t1 ) AS t0 GROUP BY 1
HAVING
clause¶
The SQL HAVING
clause enables you to filter the results of an
aggregation based on some group-wise condition holding true. For
example, suppose we wanted to limit our analysis to groups containing at
least 1000 observations:
SELECT one, sum(two) AS total
FROM my_data
GROUP BY 1
HAVING count(*) >= 1000
With Ibis, you can do:
expr = (
t.group_by('one')
.having(t.count() >= 1000)
.aggregate(t.two.sum().name('total'))
)
ibis.show_sql(expr)
SELECT t0.one, SUM(t0.two) AS total FROM my_data AS t0 GROUP BY 1 HAVING COUNT(*) >= CAST(1000 AS SMALLINT)
Sorting / ORDER BY
¶
To sort a table, use the order_by
method along with either column names
or expressions that indicate the sorting keys:
sorted = events.order_by([events.ts.year(), events.ts.month()])
ibis.show_sql(sorted)
SELECT t0.ts, t0.event_type, t0.session_id FROM web_events AS t0 ORDER BY CAST(EXTRACT(year FROM t0.ts) AS SMALLINT), CAST(EXTRACT(month FROM t0.ts) AS SMALLINT)
The default for sorting is in ascending order. To reverse the sort
direction of any key, either wrap it in ibis.desc
or pass a tuple with
False
as the second value:
sorted = events.order_by(
[ibis.desc('event_type'), (events.ts.month(), False)]
).limit(100)
ibis.show_sql(sorted)
SELECT t0.ts, t0.event_type, t0.session_id FROM web_events AS t0 ORDER BY t0.event_type DESC, CAST(EXTRACT(month FROM t0.ts) AS SMALLINT) DESC LIMIT 100
/home/gil/github.com/ibis-project/ibis/ibis/expr/types/relations.py:1268: FutureWarning: `table.order_by((key, True)) and table.order_by((key, False)) syntax` is deprecated as of v6.0, removed in v7.0; Use ibis.desc(key) or ibis.asc(key) instead util.warn_deprecated(
LIMIT
and OFFSET
¶
This one is easy. The table limit
function truncates a table to the
indicates number of rows. So if you only want the first 1000 rows (which
may not be deterministic depending on the SQL engine), you can do:
limited = t.limit(1000)
ibis.show_sql(limited)
SELECT t0.one, t0.two, t0.three FROM my_data AS t0 LIMIT 1000
The offset
option in limit
skips rows. So if you wanted rows 11
through 20, you could do:
limited = t.limit(10, offset=10)
ibis.show_sql(limited)
SELECT t0.one, t0.two, t0.three FROM my_data AS t0 LIMIT 10 OFFSET 10
Common column expressions¶
See the full API documentation <api>
{.interpreted-text role="ref"} for
all of the available value methods and tools for creating value
expressions. We mention a few common ones here as they relate to common
SQL queries.
Type casts¶
Ibis's type system is independent of any SQL system. You cast Ibis expressions from one Ibis type to another. For example:
expr = t.mutate(date=t.one.cast('timestamp'), four=t.three.cast('float32'))
ibis.show_sql(expr)
SELECT t0.one, t0.two, t0.three, CAST(t0.one AS TIMESTAMP) AS date, CAST(t0.three AS REAL) AS four FROM my_data AS t0
CASE
statements¶
SQL dialects typically support one or more kind of CASE
statements.
The first is the simple case that compares against exact values of an
expression.
CASE expr
WHEN value_1 THEN result_1
WHEN value_2 THEN result_2
ELSE default
END
Value expressions in Ibis have a case
method that allows us to emulate
these semantics:
case = (
t.one.cast('timestamp')
.year()
.case()
.when(2015, 'This year')
.when(2014, 'Last year')
.else_('Earlier')
.end()
)
expr = t.mutate(year_group=case)
ibis.show_sql(expr)
SELECT t0.one, t0.two, t0.three, CASE CAST(EXTRACT(year FROM CAST(t0.one AS TIMESTAMP)) AS SMALLINT) WHEN CAST(2015 AS SMALLINT) THEN 'This year' WHEN CAST(2014 AS SMALLINT) THEN 'Last year' ELSE 'Earlier' END AS year_group FROM my_data AS t0
The more general case is that of an arbitrary list of boolean expressions and result values:
CASE
WHEN boolean_expr1 THEN result_1
WHEN boolean_expr2 THEN result_2
WHEN boolean_expr3 THEN result_3
ELSE default
END
To do this, use ibis.case
:
case = (
ibis.case()
.when(t.two < 0, t.three * 2)
.when(t.two > 1, t.three)
.else_(t.two)
.end()
)
expr = t.mutate(cond_value=case)
ibis.show_sql(expr)
SELECT t0.one, t0.two, t0.three, CASE WHEN ( t0.two < CAST(0 AS TINYINT) ) THEN t0.three * CAST(2 AS TINYINT) WHEN ( t0.two > CAST(1 AS TINYINT) ) THEN t0.three ELSE t0.two END AS cond_value FROM my_data AS t0
There are several places where Ibis builds cases for you in a simplified
way. One example is the ifelse
function:
switch = (t.two < 0).ifelse('Negative', 'Non-Negative')
expr = t.mutate(group=switch)
ibis.show_sql(expr)
SELECT t0.one, t0.two, t0.three, CASE WHEN ( t0.two < CAST(0 AS TINYINT) ) THEN 'Negative' ELSE 'Non-Negative' END AS "group" FROM my_data AS t0
Using NULL
in expressions¶
To use NULL
in an expression, either use the special ibis.NA
value
or ibis.null()
:
pos_two = (t.two > 0).ifelse(t.two, ibis.NA)
expr = t.mutate(two_positive=pos_two)
ibis.show_sql(expr)
SELECT t0.one, t0.two, t0.three, CASE WHEN ( t0.two > CAST(0 AS TINYINT) ) THEN t0.two ELSE NULL END AS two_positive FROM my_data AS t0
Set membership: IN
/ NOT IN
¶
Let's look again at the population dataset. Suppose you wanted to combine the United States and Canada data into a "North America" category. Here would be some SQL to do it:
CASE
WHEN upper(country) IN ('UNITED STATES', 'CANADA')
THEN 'North America'
ELSE country
END AS refined_group
The Ibis equivalent of IN
is the isin
method. So we can write:
refined = (
pop.country.upper()
.isin(['UNITED STATES', 'CANADA'])
.ifelse('North America', pop.country)
)
expr = pop.mutate(refined_group=refined)
ibis.show_sql(expr)
SELECT t0.name, t0.country, t0.gender, t0.age, CASE WHEN ( UPPER(t0.country) IN ('UNITED STATES', 'CANADA') ) THEN 'North America' ELSE t0.country END AS refined_group FROM population AS t0
The opposite of isin
is notin
.
Constant and literal expressions¶
Consider a SQL expression like:
'foo' IN (column1, column2)
which is equivalent to
column1 = 'foo' OR column2 = 'foo'
To build expressions off constant values, you must first convert the
value (whether a Python string or number) to an Ibis expression using
ibis.literal
:
t3 = ibis.table(
[('column1', 'string'), ('column2', 'string'), ('column3', 'float')],
'data',
)
value = ibis.literal('foo')
Once you've done this, you can use the literal expression like any other array or scalar expression:
has_foo = value.isin([t3.column1, t3.column2])
expr = t3.mutate(has_foo=has_foo)
ibis.show_sql(expr)
SELECT t0.column1, t0.column2, t0.column3, 'foo' IN (t0.column1, t0.column2) AS has_foo FROM data AS t0
In many other situations, you can use constants without having to use
ibis.literal
. For example, we could add a column containing nothing
but the number 5 like so:
expr = t3.mutate(number5=5)
ibis.show_sql(expr)
SELECT t0.column1, t0.column2, t0.column3, CAST(5 AS TINYINT) AS number5 FROM data AS t0
IS NULL
and IS NOT NULL
¶
These are simple: use the isnull
and notnull
functions respectively,
which yield boolean arrays:
indic = t.two.isnull().ifelse('valid', 'invalid')
expr = t.mutate(is_valid=indic)
ibis.show_sql(expr)
SELECT t0.one, t0.two, t0.three, CASE WHEN ( t0.two IS NULL ) THEN 'valid' ELSE 'invalid' END AS is_valid FROM my_data AS t0
agged = (
expr[expr.one.notnull()]
.group_by('is_valid')
.aggregate(three_count=lambda t: t.three.notnull().sum())
)
ibis.show_sql(agged)
SELECT t0.is_valid, SUM(CAST(NOT t0.three IS NULL AS INT)) AS three_count FROM ( SELECT t1.one AS one, t1.two AS two, t1.three AS three, CASE WHEN ( t1.two IS NULL ) THEN 'valid' ELSE 'invalid' END AS is_valid FROM my_data AS t1 WHERE NOT t1.one IS NULL ) AS t0 GROUP BY 1
BETWEEN
¶
The between
method on arrays and scalars compiles to the SQL BETWEEN
keyword. The result of between
is boolean and can be used with any
other boolean expression:
expr = t[t.two.between(10, 50) & t.one.notnull()]
ibis.show_sql(expr)
SELECT t0.one, t0.two, t0.three FROM my_data AS t0 WHERE t0.two BETWEEN CAST(10 AS TINYINT) AND CAST(50 AS TINYINT) AND NOT t0.one IS NULL
Joins¶
Ibis supports several kinds of joins between table expressions:
inner_join
: maps to SQLINNER JOIN
cross_join
: a cartesian product join with no keys. Equivalent toinner_join
with no join predicatesleft_join
: maps to SQLLEFT OUTER JOIN
outer_join
: maps to SQLFULL OUTER JOIN
semi_join
: maps to SQLLEFT SEMI JOIN
. May or may not be an explicit join type in your query engine.anti_join
: maps to SQLLEFT ANTI JOIN
. May or may not be an explicit join type in your query engine.
The join
table method is by default the same as inner_join
.
Let's look at a couple example tables to see how joins work in Ibis:
t1 = ibis.table(
[('value1', 'float'), ('key1', 'string'), ('key2', 'string')], 'table1'
)
t2 = ibis.table(
[('value2', 'float'), ('key3', 'string'), ('key4', 'string')], 'table2'
)
Let's join on one key:
joined = t1.left_join(t2, t1.key1 == t2.key3)
The immediate result of a join does not yet have a set schema. That is determined by the next action that you take. There's several ways forward from here that address the spectrum of SQL use cases.
Join + projection¶
Consider the SQL:
SELECT t0.*, t1.value2
FROM table1 t0
LEFT OUTER JOIN table2 t1
ON t0.key1 = t1.key3
After one or more joins, you can reference any of the joined tables in a projection immediately after:
expr = joined[t1, t2.value2]
ibis.show_sql(expr)
SELECT t0.value1, t0.key1, t0.key2, t1.value2 FROM table1 AS t0 LEFT OUTER JOIN table2 AS t1 ON t0.key1 = t1.key3
If you need to compute an expression that involves both tables, you can do that also:
expr = joined[t1.key1, (t1.value1 - t2.value2).name('diff')]
ibis.show_sql(expr)
SELECT t0.key1, t0.value1 - t1.value2 AS diff FROM table1 AS t0 LEFT OUTER JOIN table2 AS t1 ON t0.key1 = t1.key3
Join + aggregation¶
You can directly aggregate a join without need for projection, which also allows you to form statistics that reference any of the joined tables.
Consider this SQL:
SELECT t0.key1, avg(t0.value1 - t1.value2) AS avg_diff
FROM table1 t0
LEFT OUTER JOIN table2 t1
ON t0.key1 = t1.key3
GROUP BY 1
As you would hope, the code is as follows:
avg_diff = (t1.value1 - t2.value2).mean()
expr = (
t1.left_join(t2, t1.key1 == t2.key3)
.group_by(t1.key1)
.aggregate(avg_diff=avg_diff)
)
ibis.show_sql(expr)
SELECT t0.key1, AVG(t0.value1 - t1.value2) AS avg_diff FROM table1 AS t0 LEFT OUTER JOIN table2 AS t1 ON t0.key1 = t1.key3 GROUP BY 1
Join with SELECT *
¶
If you try to compile or execute a join that has not been projected or aggregated, it will be fully materialized:
joined = t1.left_join(t2, t1.key1 == t2.key3)
ibis.show_sql(joined)
SELECT t0.value1, t0.key1, t0.key2, t1.value2, t1.key3, t1.key4 FROM table1 AS t0 LEFT OUTER JOIN table2 AS t1 ON t0.key1 = t1.key3
Multiple joins¶
You can join multiple tables together in succession without needing to address any of the above concerns.
t3 = ibis.table([('value3', 'float'), ('key5', 'string')], 'table3')
total = (t1.value1 + t2.value2 + t3.value3).sum()
expr = (
t1.join(t2, [t1.key1 == t2.key3, t1.key2 == t2.key4])
.join(t3, t1.key1 == t3.key5)
.group_by([t2.key4, t3.key5])
.aggregate(total=total)
)
ibis.show_sql(expr)
SELECT t1.key4, t2.key5, SUM(t0.value1 + t1.value2 + t2.value3) AS total FROM table1 AS t0 JOIN table2 AS t1 ON t0.key1 = t1.key3 AND t0.key2 = t1.key4 JOIN table3 AS t2 ON t0.key1 = t2.key5 GROUP BY 1, 2
Self joins¶
What about when you need to join a table on itself? For example:
SELECT t0.one, avg(t0.two - t1.three) AS metric
FROM my_data t0
INNER JOIN my_data t1
ON t0.one = t1.one
GROUP BY 1
The table view
method enables you to form a self-reference that is
referentially distinct in expressions. Now you can proceed normally:
t_view = t.view()
stat = (t.two - t_view.three).mean()
expr = (
t.join(t_view, t.three.cast('string') == t_view.one)
.group_by(t.one)
.aggregate(metric=stat)
)
ibis.show_sql(expr)
SELECT t0.one, AVG(t1.two - t1.three) AS metric FROM ( SELECT t1.one AS one, t1.two AS two, t1.three AS three, t2.one AS one_right, t2.two AS two_right, t2.three AS three_right FROM my_data AS t1 JOIN my_data AS t2 ON CAST(t1.three AS TEXT) = t2.one ) AS t0, my_data AS t1, my_data AS t1 GROUP BY 1
Overlapping join keys¶
In many cases the columns being joined between two tables or table expressions have the same name. Consider this example:
t4 = ibis.table(
[
('key1', 'string'),
('key2', 'string'),
('key3', 'string'),
('value1', 'float'),
],
'table4',
)
t5 = ibis.table(
[
('key1', 'string'),
('key2', 'string'),
('key3', 'string'),
('value2', 'float'),
],
'table5',
)
In these case, we can specify a list of common join keys:
joined = t4.join(t5, ['key1', 'key2', 'key3'])
expr = joined[t4, t5.value2]
ibis.show_sql(expr)
SELECT t0.key1, t0.key2, t0.key3, t0.value1, t1.value2 FROM table4 AS t0 JOIN table5 AS t1 ON t0.key1 = t1.key1 AND t0.key2 = t1.key2 AND t0.key3 = t1.key3
You can mix the overlapping key names with other expressions:
joined = t4.join(t5, ['key1', 'key2', t4.key3.left(4) == t4.key3.left(4)])
expr = joined[t4, t5.value2]
ibis.show_sql(expr)
SELECT t0.key1, t0.key2, t0.key3, t0.value1, t1.value2 FROM table4 AS t0 JOIN table5 AS t1 ON t0.key1 = t1.key1 AND t0.key2 = t1.key2 AND CASE WHEN ( CAST(0 AS TINYINT) + 1 >= 1 ) THEN SUBSTR(t0.key3, CAST(0 AS TINYINT) + 1, CAST(4 AS TINYINT)) ELSE SUBSTR(t0.key3, CAST(0 AS TINYINT) + 1 + LENGTH(t0.key3), CAST(4 AS TINYINT)) END = CASE WHEN ( CAST(0 AS TINYINT) + 1 >= 1 ) THEN SUBSTR(t0.key3, CAST(0 AS TINYINT) + 1, CAST(4 AS TINYINT)) ELSE SUBSTR(t0.key3, CAST(0 AS TINYINT) + 1 + LENGTH(t0.key3), CAST(4 AS TINYINT)) END
Non-equality join predicates¶
You can join tables with boolean clauses that are not equality. Some
query engines support these efficiently, some inefficiently, or some not
at all. In the latter case, these conditions get moved by Ibis into the
WHERE
part of the SELECT
query.
expr = t1.join(t2, t1.value1 < t2.value2).group_by(t1.key1).size()
ibis.show_sql(expr)
SELECT t0.key1, COUNT(*) AS "CountStar()" FROM table1 AS t0 JOIN table2 AS t1 ON t0.value1 < t1.value2 GROUP BY 1
Other ways to specify join keys¶
You can also pass a list of column names instead of forming boolean expressions:
joined = t1.join(t2, [('key1', 'key3'), ('key2', 'key4')])
Subqueries¶
Ibis creates inline views and nested subqueries automatically. This section concerns more complex subqueries involving foreign references and other advanced relational algebra.
Correlated EXISTS
/ NOT EXISTS
filters¶
The SQL EXISTS
and NOT EXISTS
constructs are typically used for
efficient filtering in large many-to-many relationships.
Let's consider a web dataset involving website session / usage data and purchases:
events = ibis.table(
[
('session_id', 'int64'),
('user_id', 'int64'),
('event_type', 'int32'),
('ts', 'timestamp'),
],
'events',
)
purchases = ibis.table(
[
('item_id', 'int64'),
('user_id', 'int64'),
('price', 'float'),
('ts', 'timestamp'),
],
'purchases',
)
Now, the key user_id
appears with high frequency in both tables. But
let's say you want to limit your analysis of the events
table to only
sessions by users who have made a purchase.
In SQL, you can do this using the somewhat esoteric EXISTS
construct:
SELECT t0.*
FROM events t0
WHERE EXISTS (
SELECT 1
FROM purchases t1
WHERE t0.user_id = t1.user_id
)
To describe this operation in Ibis, you compare the user_id
columns
and use the any
reduction:
cond = (events.user_id == purchases.user_id).any()
This can now be used to filter events
:
expr = events[cond]
ibis.show_sql(expr)
SELECT t0.session_id, t0.user_id, t0.event_type, t0.ts FROM events AS t0 WHERE EXISTS( SELECT CAST(1 AS TINYINT) AS anon_1 FROM purchases AS t1 WHERE t0.user_id = t1.user_id )
If you negate the condition, it will instead give you only event data from user that have not made a purchase:
expr = events[-cond]
ibis.show_sql(expr)
SELECT t0.session_id, t0.user_id, t0.event_type, t0.ts FROM events AS t0 WHERE NOT ( EXISTS( SELECT CAST(1 AS TINYINT) AS anon_1 FROM purchases AS t1 WHERE t0.user_id = t1.user_id ) )
Subqueries with IN
/ NOT IN
¶
Subquery filters with IN
(and NOT IN
) are functionally similar to
EXISTS
subqueries. Let's look at some SQL:
SELECT *
FROM events
WHERE user_id IN (
SELECT user_id
FROM purchases
)
This is almost semantically the same as the EXISTS
example. Indeed,
you can write with Ibis:
cond = events.user_id.isin(purchases.user_id)
expr = events[cond]
ibis.show_sql(expr)
SELECT t0.session_id, t0.user_id, t0.event_type, t0.ts FROM events AS t0 WHERE t0.user_id IN ( SELECT t1.user_id FROM purchases AS t1 )
Depending on the query engine, the query planner/optimizer will often
rewrite IN
or EXISTS
subqueries into the same set of relational
algebra operations.
Comparison with scalar aggregates¶
Sometime you want to compare a value with an unconditional aggregate value from a different table. Take the SQL:
SELECT *
FROM table1
WHERE value1 > (
SELECT max(value2)
FROM table2
)
With Ibis, the code is simpler and more pandas-like:
expr = t1[t1.value1 > t2.value2.max()]
ibis.show_sql(expr)
SELECT t0.value1, t0.key1, t0.key2 FROM table1 AS t0 WHERE t0.value1 > ( SELECT MAX(t1.value2) AS "Max(value2)" FROM table2 AS t1 )
Conditional aggregates¶
Suppose you want to compare a value with the aggregate value for some common group values between two tables. Here's some SQL:
SELECT *
FROM table1 t0
WHERE value1 > (
SELECT avg(value2)
FROM table2 t1
WHERE t0.key1 = t1.key3
)
This query computes the average for each distinct value of key3
and
uses the corresponding average for the comparison, rather than the
whole-table average as above.
With Ibis, the code is similar, but you add the correlated filter to the average statistic:
stat = t2[t1.key1 == t2.key3].value2.mean()
expr = t1[t1.value1 > stat]
ibis.show_sql(expr)
SELECT t0.value1, t0.key1, t0.key2 FROM table1 AS t0 WHERE t0.value1 > ( SELECT AVG(t1.value2) AS "Mean(value2)" FROM table2 AS t1 WHERE t0.key1 = t1.key3 )
DISTINCT
expressions¶
In SQL, the DISTINCT
keyword is used in a couple of ways:
- Deduplicating identical rows in some
SELECT
statement - Aggregating on the distinct values of some column expression
Ibis supports both use cases. So let's have a look. The first case is
the simplest: call distinct
on a table expression. First, here's the
SQL:
SELECT DISTINCT *
FROM table1
And the Ibis Python code:
expr = t1.distinct()
ibis.show_sql(expr)
SELECT DISTINCT t0.value1, t0.key1, t0.key2 FROM table1 AS t0
For distinct aggregates, the most common case is COUNT(DISTINCT ...)
,
which computes the number of unique values in an expression. So if
we're looking at the events
table, let's compute the number of
distinct event_type
values for each user_id
. First, the SQL:
SELECT user_id, COUNT(DISTINCT event_type) AS unique_events
FROM events
GROUP BY 1
In Ibis this is:
metric = events.event_type.nunique()
expr = events.group_by('user_id').aggregate(unique_events=metric)
ibis.show_sql(expr)
SELECT t0.user_id, COUNT(DISTINCT t0.event_type) AS unique_events FROM events AS t0 GROUP BY 1
Window functions¶
Window functions in SQL allow you to write expressions that involve possibly-ordered groups of a table. Each window function involves one of the following:
- An analytic function. Most aggregate functions are valid analytic
functions, and there are additional ones such as
LEAD
,LAG
,NTILE
, and others. - A
PARTITION BY
clause. This may be omitted. - An
ORDER BY
clause. This may be omitted for many functions. - A window frame clause. The default is to use the entire partition.
So you may see SQL like:
AVG(value) OVER (PARTITION BY key1)
Or simply
AVG(value) OVER ()
Ibis will automatically write window clauses when you use aggregate functions in a non-aggregate context. Suppose you wanted to subtract the mean of a column from itself:
expr = t.mutate(two_demean=t.two - t.two.mean())
ibis.show_sql(expr)
SELECT t0.one, t0.two, t0.three, t0.two - AVG(t0.two) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS two_demean FROM my_data AS t0
If you use mutate
in conjunction with group_by
, it will add a
PARTITION BY
to the OVER
specification:
expr = t.group_by('one').mutate(two_demean=t.two - t.two.mean())
ibis.show_sql(expr)
SELECT t0.one, t0.two, t0.three, t0.two - AVG(t0.two) OVER (PARTITION BY t0.one ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS two_demean FROM my_data AS t0
For functions like LAG
that require an ordering, we can add an
order_by
call:
expr = (
t.group_by('one')
.order_by(t.two)
.mutate(two_first_diff=t.two - t.two.lag())
)
ibis.show_sql(expr)
SELECT t0.one, t0.two, t0.three, t0.two - LAG(t0.two, 1) OVER (PARTITION BY t0.one ORDER BY t0.two ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS two_first_diff FROM my_data AS t0
For more precision, you can create a Window
object that also includes
a window frame clause:
w = ibis.window(group_by='one', preceding=5, following=5)
expr = t.mutate(group_demeaned=t.two - t.two.mean().over(w))
ibis.show_sql(expr)
SELECT t0.one, t0.two, t0.three, t0.two - AVG(t0.two) OVER (PARTITION BY t0.one ROWS BETWEEN 5 PRECEDING AND 5 FOLLOWING) AS group_demeaned FROM my_data AS t0
Top-K operations¶
A common SQL idiom is the "top-K" or "top-N" operation: subsetting a dimension by aggregate statistics:
SELECT key1, count(*) AS `count`
FROM table1
GROUP BY 1
ORDER BY `count` DESC
LIMIT 10
Ibis has a special analytic expression topk
:
expr = t1.key1.topk(10)
This can be evaluated directly, yielding the above query:
ibis.show_sql(expr)
SELECT t0.key1, t0."Count(key1)" FROM ( SELECT t1.key1 AS key1, COUNT(t1.key1) AS "Count(key1)" FROM table1 AS t1 GROUP BY 1 ) AS t0 ORDER BY t0."Count(key1)" DESC LIMIT 10
Date / time data¶
See Timestamp methods <api.timestamp>
{.interpreted-text role="ref"}
for a table of available date/time methods.
For example, we can do:
expr = events.mutate(year=events.ts.year(), month=events.ts.month())
ibis.show_sql(expr)
SELECT t0.session_id, t0.user_id, t0.event_type, t0.ts, CAST(EXTRACT(year FROM t0.ts) AS SMALLINT) AS year, CAST(EXTRACT(month FROM t0.ts) AS SMALLINT) AS month FROM events AS t0
Casting to date / time types¶
In many cases, you can convert string values to datetime / timestamp
with strings.cast('timestamp')
, but you may have to do some more
reconnaissance into the data if this does not work.
Intervals¶
Ibis has a set of interval APIs that allow you to do date/time arithmetic. For example:
expr = events[events.ts > (ibis.now() - ibis.interval(years=1))]
ibis.show_sql(expr)
SELECT t0.session_id, t0.user_id, t0.event_type, t0.ts FROM events AS t0 WHERE t0.ts > CAST(NOW() AS TIMESTAMP) - INTERVAL '1' year
The implementation of each timedelta offset will depend on the query engine.
Buckets and histograms¶
To appear.
Unions¶
SQL dialects often support two kinds of UNION
operations:
UNION
: the combination of distinct rows from each table.UNION ALL
: the combination of all rows from each table, whether or not they are distinct.
The Ibis union
function by distinct is a UNION ALL
, and you can set
distinct=True
to get the normal UNION
behavior:
expr1 = t1.limit(10)
expr2 = t1.limit(10, offset=10)
expr = expr1.union(expr2)
ibis.show_sql(expr)
WITH anon_1 AS ( SELECT t1.value1 AS value1, t1.key1 AS key1, t1.key2 AS key2 FROM table1 AS t1 LIMIT 10 ), anon_2 AS ( SELECT t1.value1 AS value1, t1.key1 AS key1, t1.key2 AS key2 FROM table1 AS t1 LIMIT 10 OFFSET 10 ) SELECT t0.value1, t0.key1, t0.key2 FROM ( SELECT anon_1.value1 AS value1, anon_1.key1 AS key1, anon_1.key2 AS key2 FROM anon_1 UNION ALL SELECT anon_2.value1 AS value1, anon_2.key1 AS key1, anon_2.key2 AS key2 FROM anon_2 ) AS t0
Esoterica¶
This area will be the spillover for miscellaneous SQL concepts and how queries featuring them can be ported to Ibis.
Common table expressions (CTEs)¶
The simplest SQL CTE is a SQL statement that is used multiple times in a
SELECT
query, which can be "factored" out using the WITH
keyword:
WITH t0 AS (
SELECT region, kind, sum(amount) AS total
FROM purchases
GROUP BY 1, 2
)
SELECT t0.region, t0.total - t1.total
FROM t0
INNER JOIN t0 t1
ON t0.region = t1.region
WHERE t0.kind = 'foo' AND t1.kind = 'bar'
Explicit CTEs are not necessary with Ibis. Let's look at an example involving joining an aggregated table on itself after filtering:
purchases = ibis.table(
[
('region', 'string'),
('kind', 'string'),
('user', 'int64'),
('amount', 'float'),
],
'purchases',
)
metric = purchases.amount.sum().name('total')
agged = purchases.group_by(['region', 'kind']).aggregate(metric)
left = agged[agged.kind == 'foo']
right = agged[agged.kind == 'bar']
result = left.join(right, left.region == right.region)[
left.region, (left.total - right.total).name('diff')
]
Ibis automatically creates a CTE for agged
:
ibis.show_sql(result)
WITH t0 AS ( SELECT t3.region AS region, t3.kind AS kind, SUM(t3.amount) AS total FROM purchases AS t3 GROUP BY 1, 2 ), t2 AS ( SELECT t0.region AS region, t0.kind AS kind, t0.total AS total FROM t0 WHERE t0.kind = 'foo' ), t1 AS ( SELECT t0.region AS region, t0.kind AS kind, t0.total AS total FROM t0 WHERE t0.kind = 'bar' ) SELECT t2.region, t2.total - t1.total AS diff FROM t2 JOIN t1 ON t2.region = t1.region