PySpark¶

`ibis.memtable` Support ¶

The PySpark backend supports memtables by natively executing queries against the underlying storage (e.g., pyarrow Tables or pandas DataFrames).

Install¶

Install ibis and dependencies for the PySpark backend:

pipcondamamba

pip install 'ibis-framework[pyspark]'

conda install -c conda-forge ibis-pyspark

mamba install -c conda-forge ibis-pyspark

Connect¶

`ibis.pyspark.connect`¶

con = ibis.pyspark.connect(session=session)

ibis.pyspark.connect is a thin wrapper around ibis.backends.pyspark.Backend.do_connect.

The pyspark backend does not create SparkSession objects, you must create a SparkSession and pass that to ibis.pyspark.connect.

Connection Parameters¶

`do_connect(session)` ¶

Create a PySpark Backend for use with Ibis.

Parameters:

Name	Type	Description	Default
`session`	`SparkSession`	A SparkSession instance	required

Examples:

>>> import ibis
>>> from pyspark.sql import SparkSession
>>> session = SparkSession.builder.getOrCreate()
>>> ibis.pyspark.connect(session)
<ibis.backends.pyspark.Backend at 0x...>

File Support¶

`read_csv(source_list, table_name=None, **kwargs)` ¶

Register a CSV file as a table in the current database.

Parameters:

Name	Type	Description	Default
`source_list`	`str \| list[str] \| tuple[str]`	The data source(s). May be a path to a file or directory of CSV files, or an iterable of CSV files.	required
`table_name`	`str \| None`	An optional name to use for the created table. This defaults to a sequentially generated name.	`None`
`kwargs`	`Any`	Additional keyword arguments passed to PySpark loading function. https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.csv.html	`{}`

`read_parquet(source, table_name=None, **kwargs)` ¶

Register a parquet file as a table in the current database.

Parameters:

Name	Type	Description	Default
`source`	`str \| Path`	The data source. May be a path to a file or directory of parquet files.	required
`table_name`	`str \| None`	An optional name to use for the created table. This defaults to a sequentially generated name.	`None`
`kwargs`	`Any`	Additional keyword arguments passed to PySpark. https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.parquet.html	`{}`

Last update: August 1, 2023

PySpark¶

ibis.memtable Support ¶

Install¶

Connect¶

ibis.pyspark.connect¶

Connection Parameters¶

do_connect(session) ¶

File Support¶

read_csv(source_list, table_name=None, **kwargs) ¶

read_parquet(source, table_name=None, **kwargs) ¶

`ibis.memtable` Support ¶

`ibis.pyspark.connect`¶

`do_connect(session)` ¶

`read_csv(source_list, table_name=None, **kwargs)` ¶

`read_parquet(source, table_name=None, **kwargs)` ¶