pandas¶
Ibis's pandas backend is available in core Ibis.
ibis.memtable
Support ¶
The pandas backend supports memtable
s by natively executing queries against the underlying storage (e.g., pyarrow Tables or pandas DataFrames).
Install¶
Install ibis
and dependencies for the pandas backend:
pip install 'ibis-framework'
conda install -c conda-forge ibis-framework
mamba install -c conda-forge ibis-framework
Connect¶
API¶
Create a client by passing in a dictionary of paths to ibis.pandas.connect
.
See ibis.backends.pandas.Backend.do_connect
for connection parameter information.
ibis.pandas.connect
is a thin wrapper around ibis.backends.pandas.Backend.do_connect
.
Connection Parameters¶
do_connect(dictionary=None)
¶
Construct a client from a dictionary of pandas DataFrames.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dictionary |
MutableMapping[str, pd.DataFrame] | None
|
An optional mapping of string table names to pandas DataFrames. |
None
|
Examples:
>>> import ibis
>>> ibis.pandas.connect({"t": pd.DataFrame({"a": [1, 2, 3]})})
<ibis.backends.pandas.Backend at 0x...>
Backend API¶
Backend
¶
Bases: BasePandasBackend
Attributes¶
db_identity: str
cached
property
¶
Return the identity of the database.
Multiple connections to the same
database will return the same value for db_identity
.
The default implementation assumes connection parameters uniquely specify the database.
Returns:
Type | Description |
---|---|
Hashable
|
Database identity |
tables
cached
property
¶
An accessor for tables in the database.
Tables may be accessed by name using either index or attribute access:
Examples:
>>> con = ibis.sqlite.connect("example.db")
>>> people = con.tables['people'] # access via index
>>> people = con.tables.people # access via attribute
Functions¶
add_operation(operation)
¶
Add a translation function to the backend for a specific operation.
Operations are defined in ibis.expr.operations
, and a translation
function receives the translator object and an expression as
parameters, and returns a value depending on the backend.
connect(*args, **kwargs)
¶
Connect to the database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*args |
Mandatory connection parameters, see the docstring of |
()
|
|
**kwargs |
Extra connection parameters, see the docstring of |
{}
|
Notes¶
This creates a new backend instance with saved args
and kwargs
,
then calls reconnect
and finally returns the newly created and
connected backend instance.
Returns:
Type | Description |
---|---|
BaseBackend
|
An instance of the backend |
create_table(name, obj=None, *, schema=None, database=None, temp=None, overwrite=False)
¶
Create a table.
database(name=None)
¶
Return a Database
object for the name
database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str | None
|
Name of the database to return the object for. |
None
|
Returns:
Type | Description |
---|---|
Database
|
A database object for the specified database. |
from_dataframe(df, name='df', client=None)
¶
Construct an ibis table from a pandas DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
pd.DataFrame
|
A pandas DataFrame |
required |
name |
str
|
The name of the pandas DataFrame |
'df'
|
client |
BasePandasBackend | None
|
Client dictionary will be mutated with the name of the DataFrame, if not provided a new client is created |
None
|
Returns:
Type | Description |
---|---|
Table
|
A table expression |
read_csv(path, table_name=None, **kwargs)
¶
Register a CSV file as a table in the current backend.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str | Path
|
The data source. A string or Path to the CSV file. |
required |
table_name |
str | None
|
An optional name to use for the created table. This defaults to a sequentially generated name. |
None
|
**kwargs |
Any
|
Additional keyword arguments passed to the backend loading function. |
{}
|
Returns:
Type | Description |
---|---|
ir.Table
|
The just-registered table |
read_parquet(path, table_name=None, **kwargs)
¶
Register a parquet file as a table in the current backend.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str | Path
|
The data source. |
required |
table_name |
str | None
|
An optional name to use for the created table. This defaults to a sequentially generated name. |
None
|
**kwargs |
Any
|
Additional keyword arguments passed to the backend loading function. |
{}
|
Returns:
Type | Description |
---|---|
ir.Table
|
The just-registered table |
register_options()
classmethod
¶
Register custom backend options.
to_csv(expr, path, *, params=None, **kwargs)
¶
Write the results of executing the given expression to a CSV file.
This method is eager and will execute the associated expression immediately.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
expr |
ir.Table
|
The ibis expression to execute and persist to CSV. |
required |
path |
str | Path
|
The data source. A string or Path to the CSV file. |
required |
params |
Mapping[ir.Scalar, Any] | None
|
Mapping of scalar parameter expressions to value. |
None
|
kwargs |
Any
|
Additional keyword arguments passed to pyarrow.csv.CSVWriter |
{}
|
to_delta(expr, path, *, params=None, **kwargs)
¶
Write the results of executing the given expression to a Delta Lake table.
This method is eager and will execute the associated expression immediately.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
expr |
ir.Table
|
The ibis expression to execute and persist to Delta Lake table. |
required |
path |
str | Path
|
The data source. A string or Path to the Delta Lake table. |
required |
params |
Mapping[ir.Scalar, Any] | None
|
Mapping of scalar parameter expressions to value. |
None
|
kwargs |
Any
|
Additional keyword arguments passed to deltalake.writer.write_deltalake method |
{}
|
to_parquet(expr, path, *, params=None, **kwargs)
¶
Write the results of executing the given expression to a parquet file.
This method is eager and will execute the associated expression immediately.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
expr |
ir.Table
|
The ibis expression to execute and persist to parquet. |
required |
path |
str | Path
|
The data source. A string or Path to the parquet file. |
required |
params |
Mapping[ir.Scalar, Any] | None
|
Mapping of scalar parameter expressions to value. |
None
|
**kwargs |
Any
|
Additional keyword arguments passed to pyarrow.parquet.ParquetWriter |
{}
|
to_torch(expr, *, params=None, limit=None, **kwargs)
¶
Execute an expression and return results as a dictionary of torch tensors.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
expr |
ir.Expr
|
Ibis expression to execute. |
required |
params |
Mapping[ir.Scalar, Any] | None
|
Parameters to substitute into the expression. |
None
|
limit |
int | str | None
|
An integer to effect a specific row limit. A value of |
None
|
kwargs |
Any
|
Keyword arguments passed into the backend's |
{}
|
Returns:
Type | Description |
---|---|
dict[str, torch.Tensor]
|
A dictionary of torch tensors, keyed by column name. |
User Defined functions (UDF)¶
Ibis supports defining three kinds of user-defined functions for operations on expressions targeting the pandas backend: element-wise, reduction, and analytic.
Elementwise Functions¶
An element-wise function is a function that takes N rows as input and
produces N rows of output. log
, exp
, and floor
are examples of
element-wise functions.
Here's how to define an element-wise function:
import ibis.expr.datatypes as dt
from ibis.backends.pandas.udf import udf
@udf.elementwise(input_type=[dt.int64], output_type=dt.double)
def add_one(x):
return x + 1.0
Reduction Functions¶
A reduction is a function that takes N rows as input and produces 1 row
as output. sum
, mean
and count
are examples of reductions. In
the context of a GROUP BY
, reductions produce 1 row of output per
group.
Here's how to define a reduction function:
import ibis.expr.datatypes as dt
from ibis.backends.pandas.udf import udf
@udf.reduction(input_type=[dt.double], output_type=dt.double)
def double_mean(series):
return 2 * series.mean()
Analytic Functions¶
An analytic function is like an element-wise function in that it takes N rows as input and produces N rows of output. The key difference is that analytic functions can be applied per group using window functions. Z-score is an example of an analytic function.
Here's how to define an analytic function:
import ibis.expr.datatypes as dt
from ibis.backends.pandas.udf import udf
@udf.analytic(input_type=[dt.double], output_type=dt.double)
def zscore(series):
return (series - series.mean()) / series.std()
Details of pandas UDFs¶
- Element-wise provide support for applying your UDF to any combination of scalar values and columns.
- Reductions provide support for whole column aggregations, grouped aggregations, and application of your function over a window.
- Analytic functions work in both grouped and non-grouped settings
- The objects you receive as input arguments are either
pandas.Series
or Python/NumPy scalars.
Keyword arguments must be given a default
Any keyword arguments must be given a default value or the function will not work.
A common Python convention is to set the default value to None
and
handle setting it to something not None
in the body of the function.
Using add_one
from above as an example, the following call will receive a
pandas.Series
for the x
argument:
import ibis
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3]})
con = ibis.pandas.connect({'df': df})
t = con.table('df')
expr = add_one(t.a)
expr
And this will receive the int
1:
expr = add_one(1)
expr
Since the pandas backend passes around **kwargs
you can accept **kwargs
in your function:
import ibis.expr.datatypes as dt
from ibis.backends.pandas.udf import udf
@udf.elementwise([dt.int64], dt.double)
def add_two(x, **kwargs): # do stuff with kwargs
return x + 2.0
Or you can leave them out as we did in the example above. You can also optionally accept specific keyword arguments.
For example:
import ibis.expr.datatypes as dt
from ibis.backends.pandas.udf import udf
@udf.elementwise([dt.int64], dt.double)
def add_two_with_none(x, y=None):
if y is None:
y = 2.0
return x + y