TL; DR : Ibis supports both Polars and DataFusion. Both backends are have about the same runtime performance, and lag far behind DuckDB on this workload. There’s negligible performance difference between Ibis and the backend native APIs.
Motivation 
This is part 2 of a series of posts showing performance across various backends that Ibis supports.
Check out part 1  if you haven’t already!
In this post, I’ll continue with the Polars  and DataFusion  backends.
I show each tool using both the Ibis API and the tool’s native API. We’ll see that the performance difference between these approaches is negligible.
 
Setup 
I ran all of the code in this blog post on a machine with these specs.
 
CPU 
AMD EPYC 7B12 (64 threads) 
 
RAM 
94 GiB 
 
Disk 
1.5 TiB SSD 
 
OS 
NixOS (Linux 6.1.66) 
 
 
Library versions 
Here are the versions I used to run this experiment at the time of writing.
 
Python 
3.10.13 (main, Aug 24 2023, 12:59:26) [GCC 12.3.0] 
 
datafusion 
33.0.0 
 
ibis 
2b54b9800 
 
pandas 
2.1.4 
 
polars 
0.19.19 
 
pyarrow 
14.0.1 
 
 
 
 
Running the query across backends 
Here are the different Ibis expressions for each backend as well as the same query with native APIs, along with timed executions of the query.
DuckDB 
First, let’s run the Ibis + DuckDB version of the query from the original post:
from  __future__ import  annotations 
 
import  ibis 
from  ibis import  _ 
 
 expr =  ( 
     ibis.read_parquet("/data/pypi-parquet/*.parquet" ) 
     .filter ( 
         [ 
             _.path.re_search( 
                 r"\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$"  
             ), 
             ~ _.path.re_search(r"(^|/)test(|s|ing)" ), 
             ~ _.path.contains("/site-packages/" ), 
         ] 
     ) 
     .group_by( 
         month= _.uploaded_on.truncate("M" ), 
         ext= _.path.re_extract(r"\.([a-z0-9]+)$" , 1 ) 
         .re_replace(r"cxx|cpp|cc|c|hpp|h" , "C/C++" ) 
         .re_replace("^f.*$" , "Fortran" ) 
         .replace("rs" , "Rust" ) 
         .replace("go" , "Go" ) 
         .replace("asm" , "Assembly" ) 
         .nullif("" ), 
     ) 
     .aggregate(project_count= _.project_name.nunique()) 
     .dropna("ext" ) 
     .order_by([_.month.desc(), _.project_count.desc()]) 
 ) 
 df =  expr.to_pandas() 
 duckdb_ibis_results =  % timeit - n1 - r1 - o % run duckdb_ibis.py 
 df.head() 
33.9 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) 
 
 
0 
2023-11-01 
C/C++ 
836 
 
1 
2023-11-01 
Rust 
190 
 
2 
2023-11-01 
Fortran 
48 
 
3 
2023-11-01 
Go 
33 
 
4 
2023-11-01 
Assembly 
10 
 
 
 
 
 
 
DataFusion and Polars 
from  __future__ import  annotations 
 
import  ibis 
from  ibis import  _ 
 
 ibis.set_backend("datafusion" ) 
 
 expr =  ( 
     ibis.read_parquet("/data/pypi-parquet/*.parquet" ) 
     .filter ( 
         [ 
             _.path.re_search( 
                 r"\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$"  
             ), 
             ~ _.path.re_search(r"(^|/)test(|s|ing)" ), 
             ~ _.path.contains("/site-packages/" ), 
         ] 
     ) 
     .group_by( 
         month= _.uploaded_on.truncate("M" ), 
         ext= _.path.re_extract(r"\.([a-z0-9]+)$" , 1 ) 
         .re_replace(r"cxx|cpp|cc|c|hpp|h" , "C/C++" ) 
         .re_replace("^f.*$" , "Fortran" ) 
         .replace("rs" , "Rust" ) 
         .replace("go" , "Go" ) 
         .replace("asm" , "Assembly" ) 
         .nullif("" ), 
     ) 
     .aggregate(project_count= _.project_name.nunique()) 
     .dropna("ext" ) 
     .order_by([_.month.desc(), _.project_count.desc()]) 
 ) 
 df =  expr.to_pandas() 
 datafusion_ibis_results =  % timeit - n1 - r1 - o % run datafusion_ibis.py 
 df.head() 
8min 54s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) 
 
 
0 
2023-11-01 
C/C++ 
836 
 
1 
2023-11-01 
Rust 
190 
 
2 
2023-11-01 
Fortran 
48 
 
3 
2023-11-01 
Go 
33 
 
4 
2023-11-01 
Assembly 
10 
 
 
 
 
 
 
DataFusion SQL
 
SELECT  
   month , 
   ext, 
   COUNT (DISTINCT  project_name) AS  project_count 
FROM  ( 
   SELECT  
     project_name, 
     DATE_TRUNC('month' , uploaded_on) AS  month , 
     NULLIF ( 
       REPLACE ( 
         REPLACE ( 
           REPLACE ( 
             REGEXP_REPLACE ( 
               REGEXP_REPLACE ( 
                 REGEXP_MATCH(path, CONCAT ('(' , '\.([a-z0-9]+)$' , ')' ))[2 ], 
                 'cxx|cpp|cc|c|hpp|h' , 
                 'C/C++' , 
                 'g'  
               ), 
               '^f.*$' , 
               'Fortran' , 
               'g'  
             ), 
             'rs' , 
             'Rust'  
           ), 
           'go' , 
           'Go'  
         ), 
         'asm' , 
         'Assembly'  
       ), 
       ''  
     ) AS  ext 
   FROM  pypi 
   WHERE  COALESCE ( 
       ARRAY_LENGTH( 
         REGEXP_MATCH(path, '\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$' ) 
       ) >  0 , 
       FALSE  
     ) 
     AND  NOT  COALESCE (ARRAY_LENGTH(REGEXP_MATCH(path, '(^|/)test(|s|ing)' )) >  0 , FALSE ) 
     AND  NOT  STRPOS(path, '/site-packages/' ) >  0  
 ) 
WHERE  ext IS  NOT  NULL  
GROUP  BY  month , ext 
ORDER  BY  month  DESC , project_count DESC  
 
from  __future__ import  annotations 
 
import  datafusion 
 
with  open ("./datafusion_native.sql" ) as  f: 
     query =  f.read() 
 
 ctx =  datafusion.SessionContext() 
 ctx.register_parquet(name= "pypi" , path= "/data/pypi-parquet/*.parquet" ) 
 expr =  ctx.sql(query) 
 
 df =  expr.to_pandas() 
 datafusion_native_results =  % timeit - n1 - r1 - o % run datafusion_native.py 
 df.head() 
9min 4s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) 
 
 
0 
2023-11-01 
C/C++ 
836 
 
1 
2023-11-01 
Rust 
190 
 
2 
2023-11-01 
Fortran 
48 
 
3 
2023-11-01 
Go 
33 
 
4 
2023-11-01 
Assembly 
10 
 
 
 
 
 
 
 
 
 
from  __future__ import  annotations 
 
import  ibis 
from  ibis import  _ 
 
 ibis.set_backend("polars" ) 
 
 expr =  ( 
     ibis.read_parquet("/data/pypi-parquet/*.parquet" ) 
     .filter ( 
         [ 
             _.path.re_search( 
                 r"\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$"  
             ), 
             ~ _.path.re_search(r"(^|/)test(|s|ing)" ), 
             ~ _.path.contains("/site-packages/" ), 
         ] 
     ) 
     .group_by( 
         month= _.uploaded_on.truncate("M" ), 
         ext= _.path.re_extract(r"\.([a-z0-9]+)$" , 1 ) 
         .re_replace(r"cxx|cpp|cc|c|hpp|h" , "C/C++" ) 
         .re_replace("^f.*$" , "Fortran" ) 
         .replace("rs" , "Rust" ) 
         .replace("go" , "Go" ) 
         .replace("asm" , "Assembly" ) 
         .nullif("" ), 
     ) 
     .aggregate(project_count= _.project_name.nunique()) 
     .dropna("ext" ) 
     .order_by([_.month.desc(), _.project_count.desc()]) 
 ) 
 df =  expr.to_pandas(streaming= True ) 
 polars_ibis_results =  % timeit - n1 - r1 - o % run polars_ibis.py 
 df.head() 
6min 32s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) 
 
 
0 
2023-11-01 
C/C++ 
836 
 
1 
2023-11-01 
Rust 
190 
 
2 
2023-11-01 
Fortran 
48 
 
3 
2023-11-01 
Go 
33 
 
4 
2023-11-01 
Assembly 
10 
 
 
 
 
 
 
from  __future__ import  annotations 
 
import  polars as  pl 
 
 expr =  ( 
     pl.scan_parquet("/data/pypi-parquet/*.parquet" ) 
     .filter ( 
         [ 
             pl.col("path" ).str .contains( 
                 r"\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$"  
             ), 
             ~ pl.col("path" ).str .contains(r"(^|/)test(|s|ing)" ), 
             ~ pl.col("path" ).str .contains("/site-packages/" , literal= True ), 
         ] 
     ) 
     .with_columns( 
         month= pl.col("uploaded_on" ).dt.truncate("1mo" ), 
         ext= pl.col("path" ) 
         .str .extract(pattern= r"\.([a-z0-9]+)$" , group_index= 1 ) 
         .str .replace_all(pattern= r"cxx|cpp|cc|c|hpp|h" , value= "C/C++" ) 
         .str .replace_all(pattern= "^f.*$" , value= "Fortran" ) 
         .str .replace("rs" , "Rust" , literal= True ) 
         .str .replace("go" , "Go" , literal= True ) 
         .str .replace("asm" , "Assembly" , literal= True ) 
         .replace({"" : None }), 
     ) 
     .group_by(["month" , "ext" ]) 
     .agg(project_count= pl.col("project_name" ).n_unique()) 
     .drop_nulls(["ext" ]) 
     .sort(["month" , "project_count" ], descending= True ) 
 ) 
 
 df =  expr.collect(streaming= True ).to_pandas() 
 polars_native_results =  % timeit - n1 - r1 - o % run polars_native.py 
 df.head() 
6min 54s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) 
 
 
0 
2023-11-01 
C/C++ 
836 
 
1 
2023-11-01 
Rust 
190 
 
2 
2023-11-01 
Fortran 
48 
 
3 
2023-11-01 
Go 
33 
 
4 
2023-11-01 
Assembly 
10 
 
 
 
 
 
 
 
 
 
 
 
 
 
Takeaways 
Ibis + DuckDB is the only system tested that handles this workload well out of the box 
Both Polars and DataFusion are much slower than DuckDB and Dask on this workload. 
Polars memory use fluctuates quite bit, while DataFusion’s memory profile is similar to DuckDB. 
 
Let’s recap the results with some numbers:
Numbers 
 
Ibis + DuckDB 
25,825 MiB 
34 seconds 
763 MiB/s 
 
Ibis + Polars 
25,825 MiB 
393 seconds 
66 MiB/s 
 
Polars native API 
25,825 MiB 
415 seconds 
62 MiB/s 
 
Ibis + DataFusion 
25,825 MiB 
535 seconds 
48 MiB/s 
 
DataFusion native API 
25,825 MiB 
545 seconds 
47 MiB/s 
 
 
I couldn’t figure out how to get consistent run times.
 
 
 
 
Conclusion 
If you’re considering Polars for new code, give Ibis a try with the DuckDB backend .
You’ll get better performance than Polars on some workloads, and with a broader cross-backend API that helps you scale from development to production.
If you find that Polars has better performance than DuckDB on a particular workload you can always switch to the Polars backend for that workload.
Everyone wins!
In the next post in this series we’ll cover the cloud backends: Snowflake, BigQuery, Trino and ClickHouse.
 
  Back to top