Skip to content

PySpark

filebadge

exportbadge

ibis.memtable Support memtable

The PySpark backend supports memtables by natively executing queries against the underlying storage (e.g., pyarrow Tables or pandas DataFrames).

Install

Install ibis and dependencies for the PySpark backend:

pip install 'ibis-framework[pyspark]'
conda install -c conda-forge ibis-pyspark
mamba install -c conda-forge ibis-pyspark

Connect

ibis.pyspark.connect

con = ibis.pyspark.connect(session=session)

ibis.pyspark.connect is a thin wrapper around ibis.backends.pyspark.Backend.do_connect.

The pyspark backend does not create SparkSession objects, you must create a SparkSession and pass that to ibis.pyspark.connect.

Connection Parameters

do_connect(session)

Create a PySpark Backend for use with Ibis.

Parameters:

Name Type Description Default
session SparkSession

A SparkSession instance

required

Examples:

>>> import ibis
>>> from pyspark.sql import SparkSession
>>> session = SparkSession.builder.getOrCreate()
>>> ibis.pyspark.connect(session)
<ibis.backends.pyspark.Backend at 0x...>

File Support

read_csv(source_list, table_name=None, **kwargs)

Register a CSV file as a table in the current database.

Parameters:

Name Type Description Default
source_list str | list[str] | tuple[str]

The data source(s). May be a path to a file or directory of CSV files, or an iterable of CSV files.

required
table_name str | None

An optional name to use for the created table. This defaults to a sequentially generated name.

None
kwargs Any {}

read_parquet(source, table_name=None, **kwargs)

Register a parquet file as a table in the current database.

Parameters:

Name Type Description Default
source str | Path

The data source. May be a path to a file or directory of parquet files.

required
table_name str | None

An optional name to use for the created table. This defaults to a sequentially generated name.

None
kwargs Any {}

Last update: August 1, 2023