Tutorial: Ibis for SQL users

Prerequisites

We recommend starting with the default (DuckDB) backend for a performant, fully-featured local experience. You can install Ibis with pip, conda, mamba, or pixi.

pip install 'ibis-framework[duckdb,examples]'
Warning

Note that the ibis-framework package is not the same as the ibis package in PyPI. These two libraries cannot coexist in the same Python environment, as they are both imported with the ibis module name.

conda install -c conda-forge ibis-duckdb
mamba install -c conda-forge ibis-duckdb
pixi add ibis-duckdb

Overview

Ibis provides a full-featured replacement for SQL SELECT queries, but expressed with Python code that is:

  • Type-checked and validated as you go. No more debugging cryptic database errors; Ibis catches your mistakes right away.
  • Easier to write. Pythonic function calls with tab completion in IPython.
  • More composable. Break complex queries down into easier-to-digest pieces
  • Easier to reuse. Mix and match Ibis snippets to create expressions tailored for your analysis.

We intend for all SELECT queries to be fully portable to Ibis. Coverage of other DDL statements (e.g. CREATE TABLE or INSERT) may vary from engine to engine.

Note

If you find any SQL idioms or use cases in your work that are not represented here, please reach out so we can add more to this guide!

Projections: select/add/remove columns

All tables in Ibis are immutable. To select a subset of a table’s columns, or to add new columns, you must produce a new table by means of a projection.

import ibis

t = ibis.table(dict(one="string", two="float", three="int32"), name="my_data")
t
UnboundTable: my_data
  one   string
  two   float64
  three int32

In SQL, you might write something like:

SELECT two, one
FROM my_data

In Ibis, this is

proj = t["two", "one"]

or

proj = t.select("two", "one")

This generates the expected SQL:

ibis.to_sql(proj)
SELECT
  t0.two,
  t0.one
FROM my_data AS t0

What about adding new columns?

SELECT two, one, three * 2 AS new_col
FROM my_data

The last expression is written:

new_col = (t.three * 2).name("new_col")

Now, we have:

proj = t["two", "one", new_col]
ibis.to_sql(proj)
SELECT
  t0.two,
  t0.one,
  t0.three * CAST(2 AS TINYINT) AS new_col
FROM my_data AS t0

mutate: Add or modify columns easily

Since adding new columns or modifying existing columns is so common, there is a convenience method mutate:

mutated = t.mutate(new_col=t.three * 2)

Notice that using the name was not necessary here because we’re using Python keyword arguments to provide the name:

ibis.to_sql(mutated)
SELECT
  t0.one,
  t0.two,
  t0.three,
  t0.three * CAST(2 AS TINYINT) AS new_col
FROM my_data AS t0

If you modify an existing column with mutate it will list out all the other columns:

mutated = t.mutate(two=t.two * 2)
ibis.to_sql(mutated)
SELECT
  t0.one,
  t0.two * CAST(2 AS TINYINT) AS two,
  t0.three
FROM my_data AS t0

SELECT * equivalent

Especially in combination with relational joins, it’s convenient to be able to select all columns in a table using the SELECT * construct. To do this, use the table expression itself in a projection:

proj = t[t]
ibis.to_sql(proj)
SELECT
  t0.one,
  t0.two,
  t0.three
FROM my_data AS t0

This is how mutate is implemented. The example above t.mutate(new_col=t.three * 2) can be written as a normal projection:

proj = t[t, new_col]
ibis.to_sql(proj)
SELECT
  t0.one,
  t0.two,
  t0.three,
  t0.three * CAST(2 AS TINYINT) AS new_col
FROM my_data AS t0

Let’s consider a table we might wish to join with t:

t2 = ibis.table(dict(key="string", value="float"), name="dim_table")

Now let’s take the SQL:

SELECT t0.*, t0.two - t1.value AS diff
FROM my_data t0
  INNER JOIN dim_table t1
    ON t0.one = t1.key

To write this with Ibis, it is:

diff = (t.two - t2.value).name("diff")
joined = t.join(t2, t.one == t2.key)[t, diff]

And verify the generated SQL:

ibis.to_sql(joined)
SELECT
  t0.one,
  t0.two,
  t0.three,
  t0.two - t1.value AS diff
FROM my_data AS t0
JOIN dim_table AS t1
  ON t0.one = t1.key

Using functions in projections

If you pass a function instead of a string or Ibis expression in any projection context, it will be invoked with the “parent” table as its argument. This can help significantly when [composing complex operations. Consider this SQL:

SELECT one, avg(abs(the_sum)) AS mad
FROM (
  SELECT one, three, sum(two) AS the_sum
  FROM my_data
  GROUP BY 1, 2
) t0
GROUP BY 1

This can be written as one chained expression:

expr = (
    t.group_by(["one", "three"])
    .aggregate(the_sum=t.two.sum())
    .group_by("one")
    .aggregate(mad=lambda x: x.the_sum.abs().mean())
)

Here’s the SQL:

ibis.to_sql(expr)
SELECT
  t0.one,
  AVG(ABS(t0.the_sum)) AS mad
FROM (
  SELECT
    t1.one AS one,
    t1.three AS three,
    SUM(t1.two) AS the_sum
  FROM my_data AS t1
  GROUP BY
    1,
    2
) AS t0
GROUP BY
  1

Filtering / WHERE

You can add filter clauses to a table expression either by indexing with [] (similar to pandas) or use the filter method:

filtered = t[t.two > 0]
ibis.to_sql(filtered)
SELECT
  t0.one,
  t0.two,
  t0.three
FROM my_data AS t0
WHERE
  t0.two > CAST(0 AS TINYINT)

filter can take a list of expressions, which must all be satisfied for a row to appear in the result:

filtered = t.filter([t.two > 0, t.one.isin(["A", "B"])])
ibis.to_sql(filtered)
SELECT
  t0.one,
  t0.two,
  t0.three
FROM my_data AS t0
WHERE
  t0.two > CAST(0 AS TINYINT) AND t0.one IN ('A', 'B')

To compose boolean expressions with AND or OR, use the respective & and | operators:

cond = (t.two < 0) | ((t.two > 0) | t.one.isin(["A", "B"]))
filtered = t[cond]
ibis.to_sql(filtered)
SELECT
  t0.one,
  t0.two,
  t0.three
FROM my_data AS t0
WHERE
  t0.two < CAST(0 AS TINYINT) OR t0.two > CAST(0 AS TINYINT) OR t0.one IN ('A', 'B')

Note the parentheses around the second expression. These are necessary due to operator precedence.

Aggregation / GROUP BY

To aggregate a table, you need:

  • Zero or more grouping expressions (these can be column names)
  • One or more aggregation expressions

Let’s look at the aggregate method on tables:

agged = t.aggregate(total_two=t.two.sum(), avg_three=t.three.mean())

If you don’t use any group expressions, the result will have a single row with your statistics of interest:

agged.schema()
ibis.Schema {
  total_two  float64
  avg_three  float64
}
ibis.to_sql(agged)
SELECT
  SUM(t0.two) AS total_two,
  AVG(t0.three) AS avg_three
FROM my_data AS t0

To add groupings, use either the by argument of aggregate or use the group_by construct:

agged2 = t.aggregate(total_two=t.two.sum(), avg_three=t.three.mean())
agged3 = t.group_by("one").aggregate(total_two=t.two.sum(), avg_three=t.three.mean())
ibis.to_sql(agged3)
SELECT
  t0.one,
  SUM(t0.two) AS total_two,
  AVG(t0.three) AS avg_three
FROM my_data AS t0
GROUP BY
  1

Non-trivial grouping keys

You can use any expression (or function, like in projections) deriving from the table you are aggregating. The only constraint is that the expressions must be named. Let’s look at an example:

events = ibis.table(
    dict(ts="timestamp", event_type="int32", session_id="int64"),
    name="web_events",
)

Suppose we wanted to total up event types by year and month:

stats = (
    events.group_by(year=events.ts.year(), month=events.ts.month())
    .aggregate(total=events.count(), sessions=events.session_id.nunique())
)

Now we have:

ibis.to_sql(stats)
SELECT
  CAST(EXTRACT(year FROM t0.ts) AS SMALLINT) AS year,
  CAST(EXTRACT(month FROM t0.ts) AS SMALLINT) AS month,
  COUNT(*) AS total,
  COUNT(DISTINCT t0.session_id) AS sessions
FROM web_events AS t0
GROUP BY
  1,
  2

Aggregates considering table subsets

In analytics is it common to compare statistics from different subsets of a table. Let’s consider a dataset containing people’s name, age, gender, and nationality:

pop = ibis.table(
    dict(name="string", country="string", gender="string", age="int16"),
    name="population",
)

Now, suppose you wanted to know for each country:

  • Average overall age
  • Average male age
  • Average female age
  • Total number of persons

In SQL, you may write:

SELECT
  country,
  count(*) AS num_persons,
  AVG(age) AS avg_age
  AVG(
    CASE WHEN gender = 'M' THEN age
    ELSE NULL
    END
  ) AS avg_male,
  AVG(
    CASE WHEN gender = 'F' THEN age
    ELSE NULL
    END
  ) AS avg_female,
FROM population
GROUP BY 1

Ibis makes this much simpler by giving you where option in aggregation functions:

expr = pop.group_by("country").aggregate(
    num_persons=pop.count(),
    avg_age=pop.age.mean(),
    avg_male=pop.age.mean(where=pop.gender == "M"),
    avg_female=pop.age.mean(where=pop.gender == "F"),
)

This indeed generates the correct SQL. Note that SQL engines handle NULL values differently in aggregation functions, but Ibis will write the SQL expression that is correct for your query engine.

ibis.to_sql(expr)
SELECT
  t0.country,
  COUNT(*) AS num_persons,
  AVG(t0.age) AS avg_age,
  AVG(t0.age) FILTER(WHERE
    t0.gender = 'M') AS avg_male,
  AVG(t0.age) FILTER(WHERE
    t0.gender = 'F') AS avg_female
FROM population AS t0
GROUP BY
  1

Counting rows with Table.count():

Computing group frequencies can be done by calling count() after calling group_by():

freqs = events.group_by(year=events.ts.year(), month=events.ts.month()).count()
ibis.to_sql(freqs)
SELECT
  CAST(EXTRACT(year FROM t0.ts) AS SMALLINT) AS year,
  CAST(EXTRACT(month FROM t0.ts) AS SMALLINT) AS month,
  COUNT(*) AS "CountStar(web_events)"
FROM web_events AS t0
GROUP BY
  1,
  2

Frequency table convenience: value_counts

Consider the SQL idiom:

SELECT some_column_expression, count(*)
FROM table
GROUP BY 1

This is so common that, like pandas, there is a generic column method value_counts which does this:

expr = events.ts.year().value_counts()
ibis.to_sql(expr)
SELECT
  t0."ExtractYear(ts)",
  COUNT(*) AS "ExtractYear(ts)_count"
FROM (
  SELECT
    CAST(EXTRACT(year FROM t1.ts) AS SMALLINT) AS "ExtractYear(ts)"
  FROM web_events AS t1
) AS t0
GROUP BY
  1

HAVING clause

The SQL HAVING clause enables you to filter the results of an aggregation based on some group-wise condition holding true. For example, suppose we wanted to limit our analysis to groups containing at least 1000 observations:

SELECT one, sum(two) AS total
FROM my_data
GROUP BY 1
HAVING count(*) >= 1000

With Ibis, you can do:

expr = (
    t.group_by("one")
    .having(t.count() >= 1000)
    .aggregate(total=t.two.sum())
)
ibis.to_sql(expr)
SELECT
  t0.one,
  SUM(t0.two) AS total
FROM my_data AS t0
GROUP BY
  1
HAVING
  COUNT(*) >= CAST(1000 AS SMALLINT)

Sorting / ORDER BY

To sort a table, use the order_by method along with either column names or expressions that indicate the sorting keys:

sorted = events.order_by([events.ts.year(), events.ts.month()])

ibis.to_sql(sorted)
SELECT
  t0.ts,
  t0.event_type,
  t0.session_id
FROM web_events AS t0
ORDER BY
  CAST(EXTRACT(year FROM t0.ts) AS SMALLINT) ASC,
  CAST(EXTRACT(month FROM t0.ts) AS SMALLINT) ASC

The default for sorting is in ascending order. To reverse the sort direction of any key, wrap it in ibis.desc:

sorted = (
    events.order_by([ibis.desc("event_type"), ibis.desc(events.ts.month())])
    .limit(100)
)

ibis.to_sql(sorted)
SELECT
  t0.ts,
  t0.event_type,
  t0.session_id
FROM web_events AS t0
ORDER BY
  t0.event_type DESC,
  CAST(EXTRACT(month FROM t0.ts) AS SMALLINT) DESC
LIMIT 100

LIMIT and OFFSET

The table limit function truncates a table to the indicates number of rows. So if you only want the first 1000 rows (which may not be deterministic depending on the SQL engine), you can do:

limited = t.limit(1000)
ibis.to_sql(limited)
SELECT
  t0.one,
  t0.two,
  t0.three
FROM my_data AS t0
LIMIT 1000

The offset option in limit skips rows. So if you wanted rows 11 through 20, you could do:

limited = t.limit(10, offset=10)
ibis.to_sql(limited)
SELECT
  t0.one,
  t0.two,
  t0.three
FROM my_data AS t0
LIMIT 10
OFFSET 10

Common column expressions

See the full API documentation for all of the available value methods and tools for creating value expressions.

We mention a few common ones here as they relate to common SQL queries.

Type casts

Ibis’s type system is independent of any SQL system. You cast Ibis expressions from one Ibis type to another. For example:

expr = t.mutate(
    date=t.one.cast("timestamp"),
    four=t.three.cast("float32"),
)

ibis.to_sql(expr)
SELECT
  t0.one,
  t0.two,
  t0.three,
  CAST(t0.one AS TIMESTAMP) AS date,
  CAST(t0.three AS REAL) AS four
FROM my_data AS t0

CASE statements

SQL dialects typically support one or more kind of CASE statements. The first is the simple case that compares against exact values of an expression.

CASE expr
  WHEN value_1 THEN result_1
  WHEN value_2 THEN result_2
  ELSE default
END

Value expressions in Ibis have a case method that allows us to emulate these semantics:

case = (
    t.one.cast("timestamp")
    .year()
    .case()
    .when(2015, "This year")
    .when(2014, "Last year")
    .else_("Earlier")
    .end()
)

expr = t.mutate(year_group=case)
ibis.to_sql(expr)
SELECT
  t0.one,
  t0.two,
  t0.three,
  CASE CAST(EXTRACT(year FROM CAST(t0.one AS TIMESTAMP)) AS SMALLINT)
    WHEN CAST(2015 AS SMALLINT)
    THEN 'This year'
    WHEN CAST(2014 AS SMALLINT)
    THEN 'Last year'
    ELSE 'Earlier'
  END AS year_group
FROM my_data AS t0

The more general case is that of an arbitrary list of boolean expressions and result values:

CASE
  WHEN boolean_expr1 THEN result_1
  WHEN boolean_expr2 THEN result_2
  WHEN boolean_expr3 THEN result_3
  ELSE default
END

To do this, use ibis.case:

case = (
    ibis.case()
    .when(t.two < 0, t.three * 2)
    .when(t.two > 1, t.three)
    .else_(t.two)
    .end()
)

expr = t.mutate(cond_value=case)
ibis.to_sql(expr)
SELECT
  t0.one,
  t0.two,
  t0.three,
  CASE
    WHEN (
      t0.two < CAST(0 AS TINYINT)
    )
    THEN t0.three * CAST(2 AS TINYINT)
    WHEN (
      t0.two > CAST(1 AS TINYINT)
    )
    THEN t0.three
    ELSE t0.two
  END AS cond_value
FROM my_data AS t0

There are several places where Ibis builds cases for you in a simplified way. One example is the ifelse function:

switch = (t.two < 0).ifelse("Negative", "Non-Negative")
expr = t.mutate(group=switch)
ibis.to_sql(expr)
SELECT
  t0.one,
  t0.two,
  t0.three,
  CASE WHEN (
    t0.two < CAST(0 AS TINYINT)
  ) THEN 'Negative' ELSE 'Non-Negative' END AS "group"
FROM my_data AS t0

Using NULL in expressions

To use NULL in an expression, either use the special ibis.NA value:

pos_two = (t.two > 0).ifelse(t.two, ibis.NA)
expr = t.mutate(two_positive=pos_two)
ibis.to_sql(expr)
SELECT
  t0.one,
  t0.two,
  t0.three,
  CASE WHEN (
    t0.two > CAST(0 AS TINYINT)
  ) THEN t0.two ELSE NULL END AS two_positive
FROM my_data AS t0

Set membership: IN / NOT IN

Let’s look again at the population dataset. Suppose you wanted to combine the United States and Canada data into a “North America” category. Here would be some SQL to do it:

CASE
  WHEN UPPER(country) IN ('UNITED STATES', 'CANADA') THEN 'North America'
  ELSE country
END AS refined_group

The Ibis equivalent of IN is the isin method. So we can write:

refined = (
    pop.country.upper()
    .isin(["UNITED STATES", "CANADA"])
    .ifelse("North America", pop.country)
)

expr = pop.mutate(refined_group=refined)
ibis.to_sql(expr)
SELECT
  t0.name,
  t0.country,
  t0.gender,
  t0.age,
  CASE
    WHEN (
      UPPER(t0.country) IN ('UNITED STATES', 'CANADA')
    )
    THEN 'North America'
    ELSE t0.country
  END AS refined_group
FROM population AS t0

The opposite of isin is notin.

Constant and literal expressions

Consider a SQL expression like:

'foo' IN (column1, column2)

which is equivalent to

column1 = 'foo' OR column2 = 'foo'

To build expressions off constant values, you must first convert the value (whether a Python string or number) to an Ibis expression using ibis.literal:

t3 = ibis.table(
    dict(column1="string", column2="string", column3="float"),
    name="data",
)

value = ibis.literal("foo")

Once you’ve done this, you can use the literal expression like any other array or scalar expression:

has_foo = value.isin([t3.column1, t3.column2])

expr = t3.mutate(has_foo=has_foo)
ibis.to_sql(expr)
SELECT
  t0.column1,
  t0.column2,
  t0.column3,
  'foo' IN (t0.column1, t0.column2) AS has_foo
FROM data AS t0

In many other situations, you can use constants without having to use ibis.literal. For example, we could add a column containing nothing but the number 5 like so:

expr = t3.mutate(number5=5)
ibis.to_sql(expr)
SELECT
  t0.column1,
  t0.column2,
  t0.column3,
  CAST(5 AS TINYINT) AS number5
FROM data AS t0

IS NULL and IS NOT NULL

These are simple: use the isnull and notnull functions respectively, which yield boolean arrays:

indic = t.two.isnull().ifelse("valid", "invalid")
expr = t.mutate(is_valid=indic)
ibis.to_sql(expr)
SELECT
  t0.one,
  t0.two,
  t0.three,
  CASE WHEN (
    t0.two IS NULL
  ) THEN 'valid' ELSE 'invalid' END AS is_valid
FROM my_data AS t0
agged = (
    expr[expr.one.notnull()]
    .group_by("is_valid")
    .aggregate(three_count=lambda t: t.three.notnull().sum())
)

ibis.to_sql(agged)
SELECT
  t0.is_valid,
  SUM(CAST(NOT t0.three IS NULL AS INT)) AS three_count
FROM (
  SELECT
    t1.one AS one,
    t1.two AS two,
    t1.three AS three,
    CASE WHEN (
      t1.two IS NULL
    ) THEN 'valid' ELSE 'invalid' END AS is_valid
  FROM my_data AS t1
  WHERE
    NOT t1.one IS NULL
) AS t0
GROUP BY
  1

BETWEEN

The between method on arrays and scalars compiles to the SQL BETWEEN keyword. The result of between is boolean and can be used with any other boolean expression:

expr = t[t.two.between(10, 50) & t.one.notnull()]
ibis.to_sql(expr)
SELECT
  t0.one,
  t0.two,
  t0.three
FROM my_data AS t0
WHERE
  t0.two BETWEEN CAST(10 AS TINYINT) AND CAST(50 AS TINYINT) AND NOT t0.one IS NULL

Joins

Ibis supports several kinds of joins between table expressions:

  • inner_join: maps to SQL INNER JOIN
  • cross_join: a cartesian product join with no keys. Equivalent to inner_join with no join predicates
  • left_join: maps to SQL LEFT OUTER JOIN
  • outer_join: maps to SQL FULL OUTER JOIN
  • semi_join: maps to SQL LEFT SEMI JOIN. May or may not be an explicit join type in your query engine.
  • anti_join: maps to SQL LEFT ANTI JOIN. May or may not be an explicit join type in your query engine.

The join table method is by default the same as inner_join.

Let’s look at a couple example tables to see how joins work in Ibis:

t1 = ibis.table(dict(value1="float", key1="string", key2="string"), name="table1")
t2 = ibis.table(dict(value2="float", key3="string", key4="string"), name="table2")

Let’s join on one key:

joined = t1.left_join(t2, t1.key1 == t2.key3)

The immediate result of a join does not yet have a set schema. That is determined by the next action that you take. There’s several ways forward from here that address a variety of SQL use cases.

Join + projection

Consider the SQL:

SELECT t0.*, t1.value2
FROM table1 t0
  LEFT OUTER JOIN table2 t1
    ON t0.key1 = t1.key3

After one or more joins, you can reference any of the joined tables in a projection immediately after:

expr = joined[t1, t2.value2]
ibis.to_sql(expr)
SELECT
  t0.value1,
  t0.key1,
  t0.key2,
  t1.value2
FROM table1 AS t0
LEFT OUTER JOIN table2 AS t1
  ON t0.key1 = t1.key3

If you need to compute an expression that involves both tables, you can do that also:

expr = joined[t1.key1, (t1.value1 - t2.value2).name("diff")]
ibis.to_sql(expr)
SELECT
  t0.key1,
  t0.value1 - t1.value2 AS diff
FROM table1 AS t0
LEFT OUTER JOIN table2 AS t1
  ON t0.key1 = t1.key3

Join + aggregation

You can directly aggregate a join without need for projection, which also allows you to form statistics that reference any of the joined tables.

Consider this SQL:

SELECT t0.key1, AVG(t0.value1 - t1.value2) AS avg_diff
FROM table1 t0
  LEFT OUTER JOIN table2 t1
    ON t0.key1 = t1.key3
GROUP BY 1

As you would hope, the code is as follows:

avg_diff = (t1.value1 - t2.value2).mean()
expr = (
    t1.left_join(t2, t1.key1 == t2.key3)
    .group_by(t1.key1)
    .aggregate(avg_diff=avg_diff)
)
ibis.to_sql(expr)
SELECT
  t0.key1,
  AVG(t0.value1 - t1.value2) AS avg_diff
FROM table1 AS t0
LEFT OUTER JOIN table2 AS t1
  ON t0.key1 = t1.key3
GROUP BY
  1

Join with SELECT *

If you try to compile or execute a join that has not been projected or aggregated, it will be fully materialized:

joined = t1.left_join(t2, t1.key1 == t2.key3)
ibis.to_sql(joined)
SELECT
  t0.value1,
  t0.key1,
  t0.key2,
  t1.value2,
  t1.key3,
  t1.key4
FROM table1 AS t0
LEFT OUTER JOIN table2 AS t1
  ON t0.key1 = t1.key3

Multiple joins

You can join multiple tables together in succession without needing to address any of the above concerns.

t3 = ibis.table(dict(value3="float", key5="string"), name="table3")

total = (t1.value1 + t2.value2 + t3.value3).sum()
expr = (
    t1.join(t2, [t1.key1 == t2.key3, t1.key2 == t2.key4])
    .join(t3, t1.key1 == t3.key5)
    .group_by([t2.key4, t3.key5])
    .aggregate(total=total)
)
ibis.to_sql(expr)
SELECT
  t1.key4,
  t2.key5,
  SUM(t0.value1 + t1.value2 + t2.value3) AS total
FROM table1 AS t0
JOIN table2 AS t1
  ON t0.key1 = t1.key3 AND t0.key2 = t1.key4
JOIN table3 AS t2
  ON t0.key1 = t2.key5
GROUP BY
  1,
  2

Self joins

What about when you need to join a table on itself? For example:

SELECT t0.one, avg(t0.two - t1.three) AS metric
FROM my_data t0
  INNER JOIN my_data t1
    ON t0.one = t1.one
GROUP BY 1

The table view method enables you to form a self-reference that is referentially distinct in expressions. Now you can proceed normally:

t_view = t.view()

stat = (t.two - t_view.three).mean()
expr = (
    t.join(t_view, t.three.cast("string") == t_view.one)
    .group_by(t.one)
    .aggregate(metric=stat)
)
ibis.to_sql(expr)
SELECT
  t0.one,
  AVG(t1.two - t1.three) AS metric
FROM (
  SELECT
    t1.one AS one,
    t1.two AS two,
    t1.three AS three,
    t2.one AS one_right,
    t2.two AS two_right,
    t2.three AS three_right
  FROM my_data AS t1
  JOIN my_data AS t2
    ON CAST(t1.three AS TEXT) = t2.one
) AS t0, my_data AS t1, my_data AS t1
GROUP BY
  1

Overlapping join keys

In many cases the columns being joined between two tables or table expressions have the same name. Consider this example:

t4 = ibis.table(
    dict(key1="string", key2="string", key3="string", value1="float"),
    name="table4",
)

t5 = ibis.table(
    dict(key1="string", key2="string", key3="string", value2="float"),
    name="table5",
)

In these case, we can specify a list of common join keys:

joined = t4.join(t5, ["key1", "key2", "key3"])
expr = joined[t4, t5.value2]
ibis.to_sql(expr)
SELECT
  t0.key1,
  t0.key2,
  t0.key3,
  t0.value1,
  t1.value2
FROM table4 AS t0
JOIN table5 AS t1
  ON t0.key1 = t1.key1 AND t0.key2 = t1.key2 AND t0.key3 = t1.key3

You can mix the overlapping key names with other expressions:

joined = t4.join(t5, ["key1", "key2", t4.key3.left(4) == t4.key3.left(4)])
expr = joined[t4, t5.value2]
ibis.to_sql(expr)
SELECT
  t0.key1,
  t0.key2,
  t0.key3,
  t0.value1,
  t1.value2
FROM table4 AS t0
JOIN table5 AS t1
  ON t0.key1 = t1.key1
  AND t0.key2 = t1.key2
  AND CASE
    WHEN (
      CAST(0 AS TINYINT) + 1 >= 1
    )
    THEN SUBSTR(t0.key3, CAST(0 AS TINYINT) + 1, CAST(4 AS TINYINT))
    ELSE SUBSTR(t0.key3, CAST(0 AS TINYINT) + 1 + LENGTH(t0.key3), CAST(4 AS TINYINT))
  END = CASE
    WHEN (
      CAST(0 AS TINYINT) + 1 >= 1
    )
    THEN SUBSTR(t0.key3, CAST(0 AS TINYINT) + 1, CAST(4 AS TINYINT))
    ELSE SUBSTR(t0.key3, CAST(0 AS TINYINT) + 1 + LENGTH(t0.key3), CAST(4 AS TINYINT))
  END

Non-equality join predicates

You can join tables with boolean clauses that are not equality. Some query engines support these efficiently, some inefficiently, or some not at all. In the latter case, these conditions get moved by Ibis into the WHERE part of the SELECT query.

expr = t1.join(t2, t1.value1 < t2.value2).group_by(t1.key1).size()
ibis.to_sql(expr)
SELECT
  t0.key1,
  COUNT(*) AS "CountStar()"
FROM table1 AS t0
JOIN table2 AS t1
  ON t0.value1 < t1.value2
GROUP BY
  1

Other ways to specify join keys

You can also pass a list of column names instead of forming boolean expressions:

joined = t1.join(t2, [("key1", "key3"), ("key2", "key4")])

Subqueries

Ibis creates inline views and nested subqueries automatically. This section concerns more complex subqueries involving foreign references and other advanced relational algebra.

Correlated EXISTS / NOT EXISTS filters

The SQL EXISTS and NOT EXISTS constructs are typically used for efficient filtering in large many-to-many relationships.

Let’s consider a web dataset involving website session / usage data and purchases:

events = ibis.table(
    dict(session_id="int64", user_id="int64", event_type="int32", ts="timestamp"),
    name="events",
)

purchases = ibis.table(
    dict(item_id="int64", user_id="int64", price="float", ts="timestamp"),
    name="purchases",
)

Now, the key user_id appears with high frequency in both tables. But let’s say you want to limit your analysis of the events table to only sessions by users who have made a purchase.

In SQL, you can do this using the somewhat esoteric EXISTS construct:

SELECT t0.*
FROM events t0
WHERE EXISTS (
  SELECT 1
  FROM purchases t1
  WHERE t0.user_id = t1.user_id
)

To describe this operation in Ibis, you compare the user_id columns and use the any reduction:

cond = (events.user_id == purchases.user_id).any()

This can now be used to filter events:

expr = events[cond]
ibis.to_sql(expr)
SELECT
  t0.session_id,
  t0.user_id,
  t0.event_type,
  t0.ts
FROM events AS t0
WHERE
  EXISTS(
    SELECT
      CAST(1 AS TINYINT) AS anon_1
    FROM purchases AS t1
    WHERE
      t0.user_id = t1.user_id
  )

If you negate the condition, it will instead give you only event data from user that have not made a purchase:

expr = events[-cond]
ibis.to_sql(expr)
SELECT
  t0.session_id,
  t0.user_id,
  t0.event_type,
  t0.ts
FROM events AS t0
WHERE
  NOT (
    EXISTS(
      SELECT
        CAST(1 AS TINYINT) AS anon_1
      FROM purchases AS t1
      WHERE
        t0.user_id = t1.user_id
    )
  )

Subqueries with IN / NOT IN

Subquery filters with IN (and NOT IN) are functionally similar to EXISTS subqueries. Let’s look at some SQL:

SELECT *
FROM events
WHERE user_id IN (
  SELECT user_id
  FROM purchases
)

This is almost semantically the same as the EXISTS example. Indeed, you can write with Ibis:

cond = events.user_id.isin(purchases.user_id)
expr = events[cond]
ibis.to_sql(expr)
SELECT
  t0.session_id,
  t0.user_id,
  t0.event_type,
  t0.ts
FROM events AS t0
WHERE
  t0.user_id IN (
    SELECT
      t1.user_id
    FROM purchases AS t1
  )

Depending on the query engine, the query planner/optimizer will often rewrite IN or EXISTS subqueries into the same set of relational algebra operations.

Comparison with scalar aggregates

Sometime you want to compare a value with an unconditional aggregate value from a different table. Take the SQL:

SELECT *
FROM table1
WHERE value1 > (
  SELECT max(value2)
  FROM table2
)

With Ibis, the code is simpler and more pandas-like:

expr = t1[t1.value1 > t2.value2.max()]
ibis.to_sql(expr)
SELECT
  t0.value1,
  t0.key1,
  t0.key2
FROM table1 AS t0
WHERE
  t0.value1 > (
    SELECT
      MAX(t1.value2) AS "Max(value2)"
    FROM table2 AS t1
  )

Conditional aggregates

Suppose you want to compare a value with the aggregate value for some common group values between two tables. Here’s some SQL:

SELECT *
FROM table1 t0
WHERE value1 > (
  SELECT avg(value2)
  FROM table2 t1
  WHERE t0.key1 = t1.key3
)

This query computes the average for each distinct value of key3 and uses the corresponding average for the comparison, rather than the whole-table average as above.

With Ibis, the code is similar, but you add the correlated filter to the average statistic:

stat = t2[t1.key1 == t2.key3].value2.mean()
expr = t1[t1.value1 > stat]
ibis.to_sql(expr)
SELECT
  t0.value1,
  t0.key1,
  t0.key2
FROM table1 AS t0
WHERE
  t0.value1 > (
    SELECT
      AVG(t1.value2) AS "Mean(value2)"
    FROM table2 AS t1
    WHERE
      t0.key1 = t1.key3
  )

DISTINCT expressions

In SQL, the DISTINCT keyword is used in a couple of ways:

  • Deduplicating identical rows in some SELECT statement
  • Aggregating on the distinct values of some column expression

Ibis supports both use cases. So let’s have a look. The first case is the simplest: call distinct on a table expression. First, here’s the SQL:

SELECT DISTINCT *
FROM table1

And the Ibis Python code:

expr = t1.distinct()
ibis.to_sql(expr)
SELECT DISTINCT
  t0.value1,
  t0.key1,
  t0.key2
FROM table1 AS t0

For distinct aggregates, the most common case is COUNT(DISTINCT ...), which computes the number of unique values in an expression. So if we’re looking at the events table, let’s compute the number of distinct event_type values for each user_id. First, the SQL:

SELECT user_id, COUNT(DISTINCT event_type) AS unique_events
FROM events
GROUP BY 1

In Ibis this is:

metric = events.event_type.nunique()
expr = events.group_by("user_id").aggregate(unique_events=metric)
ibis.to_sql(expr)
SELECT
  t0.user_id,
  COUNT(DISTINCT t0.event_type) AS unique_events
FROM events AS t0
GROUP BY
  1

Window functions

Window functions in SQL allow you to write expressions that involve possibly-ordered groups of a table. Each window function involves one of the following:

  • An analytic function. Most aggregate functions are valid analytic functions, and there are additional ones such as LEAD, LAG, NTILE, and others.
  • A PARTITION BY clause. This may be omitted.
  • An ORDER BY clause. This may be omitted for many functions.
  • A window frame clause. The default is to use the entire partition.

So you may see SQL like:

AVG(value) OVER (PARTITION BY key1)

Or simply

AVG(value) OVER ()

Ibis will automatically write window clauses when you use aggregate functions in a non-aggregate context. Suppose you wanted to subtract the mean of a column from itself:

expr = t.mutate(two_demean=t.two - t.two.mean())
ibis.to_sql(expr)
SELECT
  t0.one,
  t0.two,
  t0.three,
  t0.two - AVG(t0.two) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS two_demean
FROM my_data AS t0

If you use mutate in conjunction with group_by, it will add a PARTITION BY to the OVER specification:

expr = t.group_by("one").mutate(two_demean=t.two - t.two.mean())

ibis.to_sql(expr)
SELECT
  t0.one,
  t0.two,
  t0.three,
  t0.two - AVG(t0.two) OVER (PARTITION BY t0.one ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS two_demean
FROM my_data AS t0

For functions like LAG that require an ordering, we can add an order_by call:

expr = t.group_by("one").order_by(t.two).mutate(two_first_diff=t.two - t.two.lag())

ibis.to_sql(expr)
SELECT
  t0.one,
  t0.two,
  t0.three,
  t0.two - LAG(t0.two, 1) OVER (PARTITION BY t0.one ORDER BY t0.two ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS two_first_diff
FROM my_data AS t0

For more precision, you can create a Window object that also includes a window frame clause:

w = ibis.window(group_by="one", preceding=5, following=5)
expr = t.mutate(group_demeaned=t.two - t.two.mean().over(w))
ibis.to_sql(expr)
SELECT
  t0.one,
  t0.two,
  t0.three,
  t0.two - AVG(t0.two) OVER (PARTITION BY t0.one ROWS BETWEEN CAST(5 AS TINYINT) PRECEDING AND CAST(5 AS TINYINT) FOLLOWING) AS group_demeaned
FROM my_data AS t0

Top-K operations

A common SQL idiom is the “top-K” or “top-N” operation: subsetting a dimension by aggregate statistics:

SELECT key1, count(*) AS `count`
FROM table1
GROUP BY 1
ORDER BY `count` DESC
LIMIT 10

Ibis has a special analytic expression topk:

expr = t1.key1.topk(10)

This can be evaluated directly, yielding the above query:

ibis.to_sql(expr)
SELECT
  t0.key1,
  t0."Count(key1)"
FROM (
  SELECT
    t1.key1 AS key1,
    COUNT(t1.key1) AS "Count(key1)"
  FROM table1 AS t1
  GROUP BY
    1
) AS t0
ORDER BY
  t0."Count(key1)" DESC
LIMIT 10

Date / time data

See Timestamp methods <api.timestamp> for a table of available date/time methods.

For example, we can do:

expr = events.mutate(year=events.ts.year(), month=events.ts.month())

ibis.to_sql(expr)
SELECT
  t0.session_id,
  t0.user_id,
  t0.event_type,
  t0.ts,
  CAST(EXTRACT(year FROM t0.ts) AS SMALLINT) AS year,
  CAST(EXTRACT(month FROM t0.ts) AS SMALLINT) AS month
FROM events AS t0

Casting to date / time types

In many cases, you can convert string values to datetime / timestamp with strings.cast('timestamp'), but you may have to do some more reconnaissance into the data if this does not work.

Intervals

Ibis has a set of interval APIs that allow you to do date/time arithmetic. For example:

expr = events[events.ts > (ibis.now() - ibis.interval(years=1))]
ibis.to_sql(expr)
SELECT
  t0.session_id,
  t0.user_id,
  t0.event_type,
  t0.ts
FROM events AS t0
WHERE
  t0.ts > CAST(NOW() AS TIMESTAMP) - TO_YEARS(1)

The implementation of each timedelta offset will depend on the query engine.

Buckets and histograms

To appear.

Unions

SQL dialects often support two kinds of UNION operations:

  • UNION: the combination of distinct rows from each table.
  • UNION ALL: the combination of all rows from each table, whether or not they are distinct.

The Ibis union function by default is a UNION ALL, and you can set distinct=True to get the normal UNION behavior:

expr1 = t1.limit(10)
expr2 = t1.limit(10, offset=10)

expr = expr1.union(expr2)
ibis.to_sql(expr)
WITH anon_1 AS (
  SELECT
    t1.value1 AS value1,
    t1.key1 AS key1,
    t1.key2 AS key2
  FROM table1 AS t1
  LIMIT 10
), anon_2 AS (
  SELECT
    t1.value1 AS value1,
    t1.key1 AS key1,
    t1.key2 AS key2
  FROM table1 AS t1
  LIMIT 10
  OFFSET 10
)
SELECT
  t0.value1,
  t0.key1,
  t0.key2
FROM (
  SELECT
    anon_1.value1 AS value1,
    anon_1.key1 AS key1,
    anon_1.key2 AS key2
  FROM anon_1
  UNION ALL
  SELECT
    anon_2.value1 AS value1,
    anon_2.key1 AS key1,
    anon_2.key2 AS key2
  FROM anon_2
) AS t0

Esoterica

This area will be the spillover for miscellaneous SQL concepts and how queries featuring them can be ported to Ibis.

Common table expressions (CTEs)

The simplest SQL CTE is a SQL statement that is used multiple times in a SELECT query, which can be “factored” out using the WITH keyword:

WITH t0 AS (
   SELECT region, kind, sum(amount) AS total
   FROM purchases
   GROUP BY 1, 2
)
SELECT t0.region, t0.total - t1.total
FROM t0
  INNER JOIN t0 t1
    ON t0.region = t1.region
WHERE t0.kind = 'foo' AND t1.kind = 'bar'

Explicit CTEs are not necessary with Ibis. Let’s look at an example involving joining an aggregated table on itself after filtering:

purchases = ibis.table(
    dict(region="string", kind="string", user="int64", amount="float"),
    name="purchases",
)

metric = purchases.amount.sum().name("total")
agged = purchases.group_by(["region", "kind"]).aggregate(metric)

left = agged[agged.kind == "foo"]
right = agged[agged.kind == "bar"]

result = left.join(right, left.region == right.region)[
    left.region, (left.total - right.total).name("diff")
]

Ibis automatically creates a CTE for agged:

ibis.to_sql(result)
WITH t1 AS (
  SELECT
    t2.region AS region,
    t2.kind AS kind,
    SUM(t2.amount) AS total
  FROM purchases AS t2
  WHERE
    t2.kind = 'foo'
  GROUP BY
    1,
    2
), t0 AS (
  SELECT
    t2.region AS region,
    t2.kind AS kind,
    SUM(t2.amount) AS total
  FROM purchases AS t2
  WHERE
    t2.kind = 'bar'
  GROUP BY
    1,
    2
)
SELECT
  t1.region,
  t1.total - t0.total AS diff
FROM t1
JOIN t0
  ON t1.region = t0.region
Back to top