PySpark¶
ibis.memtable
Support ¶
The PySpark backend supports memtable
s by natively executing queries against the underlying storage (e.g., pyarrow Tables or pandas DataFrames).
Install¶
Install ibis
and dependencies for the PySpark backend:
pip install 'ibis-framework[pyspark]'
conda install -c conda-forge ibis-pyspark
mamba install -c conda-forge ibis-pyspark
Connect¶
ibis.pyspark.connect
¶
con = ibis.pyspark.connect(session=session)
ibis.pyspark.connect
is a thin wrapper around ibis.backends.pyspark.Backend.do_connect
.
The pyspark
backend does not create SparkSession
objects, you must create a SparkSession
and pass that to ibis.pyspark.connect
.
Connection Parameters¶
do_connect(session)
¶
Create a PySpark Backend
for use with Ibis.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
session |
SparkSession
|
A SparkSession instance |
required |
Examples:
>>> import ibis
>>> from pyspark.sql import SparkSession
>>> session = SparkSession.builder.getOrCreate()
>>> ibis.pyspark.connect(session)
<ibis.backends.pyspark.Backend at 0x...>
File Support¶
read_csv(source_list, table_name=None, **kwargs)
¶
Register a CSV file as a table in the current database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source_list |
str | list[str] | tuple[str]
|
The data source(s). May be a path to a file or directory of CSV files, or an iterable of CSV files. |
required |
table_name |
str | None
|
An optional name to use for the created table. This defaults to a sequentially generated name. |
None
|
kwargs |
Any
|
Additional keyword arguments passed to PySpark loading function. https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.csv.html |
{}
|
read_parquet(source, table_name=None, **kwargs)
¶
Register a parquet file as a table in the current database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source |
str | Path
|
The data source. May be a path to a file or directory of parquet files. |
required |
table_name |
str | None
|
An optional name to use for the created table. This defaults to a sequentially generated name. |
None
|
kwargs |
Any
|
Additional keyword arguments passed to PySpark. https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.parquet.html |
{}
|