Ibis - Ibis versus X: Performance across the ecosystem part 2

TL; DR: Ibis supports both Polars and DataFusion. Both backends are have about the same runtime performance, and lag far behind DuckDB on this workload. There’s negligible performance difference between Ibis and the backend native APIs.

Motivation

This is part 2 of a series of posts showing performance across various backends that Ibis supports.

Check out part 1 if you haven’t already!

In this post, I’ll continue with the Polars and DataFusion backends.

I show each tool using both the Ibis API and the tool’s native API. We’ll see that the performance difference between these approaches is negligible.

Setup

I ran all of the code in this blog post on a machine with these specs.

Component	Specification
CPU	AMD EPYC 7B12 (64 threads)
RAM	94 GiB
Disk	1.5 TiB SSD
OS	NixOS (Linux 6.1.66)

Library versions

Here are the versions I used to run this experiment at the time of writing.

Dependency	Version
Python	3.10.13 (main, Aug 24 2023, 12:59:26) [GCC 12.3.0]
datafusion	33.0.0
ibis	`2b54b9800`
pandas	2.1.4
polars	0.19.19
pyarrow	14.0.1

Running the query across backends

Here are the different Ibis expressions for each backend as well as the same query with native APIs, along with timed executions of the query.

DuckDB

First, let’s run the Ibis + DuckDB version of the query from the original post:

from __future__ import annotations

import ibis
from ibis import _

expr = (
    ibis.read_parquet("/data/pypi-parquet/*.parquet")
    .filter(
        [
            _.path.re_search(
                r"\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$"
            ),
            ~_.path.re_search(r"(^|/)test(|s|ing)"),
            ~_.path.contains("/site-packages/"),
        ]
    )
    .group_by(
        month=_.uploaded_on.truncate("M"),
        ext=_.path.re_extract(r"\.([a-z0-9]+)$", 1)
        .re_replace(r"cxx|cpp|cc|c|hpp|h", "C/C++")
        .re_replace("^f.*$", "Fortran")
        .replace("rs", "Rust")
        .replace("go", "Go")
        .replace("asm", "Assembly")
        .nullif(""),
    )
    .aggregate(project_count=_.project_name.nunique())
    .dropna("ext")
    .order_by([_.month.desc(), _.project_count.desc()])
)
df = expr.to_pandas()

duckdb_ibis_results = %timeit -n1 -r1 -o %run duckdb_ibis.py
df.head()

33.9 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

	month	ext	project_count
0	2023-11-01	C/C++	836
1	2023-11-01	Rust	190
2	2023-11-01	Fortran	48
3	2023-11-01	Go	33
4	2023-11-01	Assembly	10

Ibis
DataFusion native

from __future__ import annotations

import ibis
from ibis import _

ibis.set_backend("datafusion")

expr = (
    ibis.read_parquet("/data/pypi-parquet/*.parquet")
    .filter(
        [
            _.path.re_search(
                r"\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$"
            ),
            ~_.path.re_search(r"(^|/)test(|s|ing)"),
            ~_.path.contains("/site-packages/"),
        ]
    )
    .group_by(
        month=_.uploaded_on.truncate("M"),
        ext=_.path.re_extract(r"\.([a-z0-9]+)$", 1)
        .re_replace(r"cxx|cpp|cc|c|hpp|h", "C/C++")
        .re_replace("^f.*$", "Fortran")
        .replace("rs", "Rust")
        .replace("go", "Go")
        .replace("asm", "Assembly")
        .nullif(""),
    )
    .aggregate(project_count=_.project_name.nunique())
    .dropna("ext")
    .order_by([_.month.desc(), _.project_count.desc()])
)
df = expr.to_pandas()

datafusion_ibis_results = %timeit -n1 -r1 -o %run datafusion_ibis.py
df.head()

8min 54s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

	month	ext	project_count
0	2023-11-01	C/C++	836
1	2023-11-01	Rust	190
2	2023-11-01	Fortran	48
3	2023-11-01	Go	33
4	2023-11-01	Assembly	10

DataFusion SQL

SELECT
  month,
  ext,
  COUNT(DISTINCT project_name) AS project_count
FROM (
  SELECT
    project_name,
    DATE_TRUNC('month', uploaded_on) AS month,
    NULLIF(
      REPLACE(
        REPLACE(
          REPLACE(
            REGEXP_REPLACE(
              REGEXP_REPLACE(
                REGEXP_MATCH(path, CONCAT('(', '\.([a-z0-9]+)$', ')'))[2],
                'cxx|cpp|cc|c|hpp|h',
                'C/C++',
                'g'
              ),
              '^f.*$',
              'Fortran',
              'g'
            ),
            'rs',
            'Rust'
          ),
          'go',
          'Go'
        ),
        'asm',
        'Assembly'
      ),
      ''
    ) AS ext
  FROM pypi
  WHERE COALESCE(
      ARRAY_LENGTH(
        REGEXP_MATCH(path, '\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$')
      ) > 0,
      FALSE
    )
    AND NOT COALESCE(ARRAY_LENGTH(REGEXP_MATCH(path, '(^|/)test(|s|ing)')) > 0, FALSE)
    AND NOT STRPOS(path, '/site-packages/') > 0
)
WHERE ext IS NOT NULL
GROUP BY month, ext
ORDER BY month DESC, project_count DESC

from __future__ import annotations

import datafusion

with open("./datafusion_native.sql") as f:
    query = f.read()

ctx = datafusion.SessionContext()
ctx.register_parquet(name="pypi", path="/data/pypi-parquet/*.parquet")
expr = ctx.sql(query)

df = expr.to_pandas()

datafusion_native_results = %timeit -n1 -r1 -o %run datafusion_native.py
df.head()

9min 4s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

	month	ext	project_count
0	2023-11-01	C/C++	836
1	2023-11-01	Rust	190
2	2023-11-01	Fortran	48
3	2023-11-01	Go	33
4	2023-11-01	Assembly	10

Ibis
Polars native

from __future__ import annotations

import ibis
from ibis import _

ibis.set_backend("polars")

expr = (
    ibis.read_parquet("/data/pypi-parquet/*.parquet")
    .filter(
        [
            _.path.re_search(
                r"\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$"
            ),
            ~_.path.re_search(r"(^|/)test(|s|ing)"),
            ~_.path.contains("/site-packages/"),
        ]
    )
    .group_by(
        month=_.uploaded_on.truncate("M"),
        ext=_.path.re_extract(r"\.([a-z0-9]+)$", 1)
        .re_replace(r"cxx|cpp|cc|c|hpp|h", "C/C++")
        .re_replace("^f.*$", "Fortran")
        .replace("rs", "Rust")
        .replace("go", "Go")
        .replace("asm", "Assembly")
        .nullif(""),
    )
    .aggregate(project_count=_.project_name.nunique())
    .dropna("ext")
    .order_by([_.month.desc(), _.project_count.desc()])
)
df = expr.to_pandas(streaming=True)

polars_ibis_results = %timeit -n1 -r1 -o %run polars_ibis.py
df.head()

6min 32s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

	month	ext	project_count
0	2023-11-01	C/C++	836
1	2023-11-01	Rust	190
2	2023-11-01	Fortran	48
3	2023-11-01	Go	33
4	2023-11-01	Assembly	10

from __future__ import annotations

import polars as pl

expr = (
    pl.scan_parquet("/data/pypi-parquet/*.parquet")
    .filter(
        [
            pl.col("path").str.contains(
                r"\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$"
            ),
            ~pl.col("path").str.contains(r"(^|/)test(|s|ing)"),
            ~pl.col("path").str.contains("/site-packages/", literal=True),
        ]
    )
    .with_columns(
        month=pl.col("uploaded_on").dt.truncate("1mo"),
        ext=pl.col("path")
        .str.extract(pattern=r"\.([a-z0-9]+)$", group_index=1)
        .str.replace_all(pattern=r"cxx|cpp|cc|c|hpp|h", value="C/C++")
        .str.replace_all(pattern="^f.*$", value="Fortran")
        .str.replace("rs", "Rust", literal=True)
        .str.replace("go", "Go", literal=True)
        .str.replace("asm", "Assembly", literal=True)
        .replace({"": None}),
    )
    .group_by(["month", "ext"])
    .agg(project_count=pl.col("project_name").n_unique())
    .drop_nulls(["ext"])
    .sort(["month", "project_count"], descending=True)
)

df = expr.collect(streaming=True).to_pandas()

polars_native_results = %timeit -n1 -r1 -o %run polars_native.py
df.head()

6min 54s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

	month	ext	project_count
0	2023-11-01	C/C++	836
1	2023-11-01	Rust	190
2	2023-11-01	Fortran	48
3	2023-11-01	Go	33
4	2023-11-01	Assembly	10

Takeaways

Ibis + DuckDB is the only system tested that handles this workload well out of the box

Both Polars and DataFusion are much slower than DuckDB and Dask on this workload.
Polars memory use fluctuates quite bit, while DataFusion’s memory profile is similar to DuckDB.

Let’s recap the results with some numbers:

Numbers

Toolset	Data size	Duration	Throughput
Ibis + DuckDB	25,825 MiB	34 seconds	763 MiB/s
Ibis + Polars	25,825 MiB	393 seconds	66 MiB/s
Polars native API	25,825 MiB	415 seconds	62 MiB/s
Ibis + DataFusion	25,825 MiB	535 seconds	48 MiB/s
DataFusion native API	25,825 MiB	545 seconds	47 MiB/s

The Polars run durations were highly variable

I couldn’t figure out how to get consistent run times.

Conclusion

If you’re considering Polars for new code, give Ibis a try with the DuckDB backend.

You’ll get better performance than Polars on some workloads, and with a broader cross-backend API that helps you scale from development to production.

If you find that Polars has better performance than DuckDB on a particular workload you can always switch to the Polars backend for that workload.

Everyone wins!

In the next post in this series we’ll cover the cloud backends: Snowflake, BigQuery, Trino and ClickHouse.