Data splitting

Segregating data into training, testing, and validation sets

train_test_split

train_test_split(table, unique_key, test_size=0.25, num_buckets=100, random_seed=None)

Randomly split Ibis table data into training and testing tables.

This function splits an Ibis table into training and testing tables based on a unique key or combination of keys. It uses a hashing function to convert the unique key into an integer, then applies a modulo operation to split the data into buckets. The training table consists of data points from a subset of these buckets, while the remaining data points form the test table.

Parameters

Name Type Description Default
table ir.Table The input Ibis table to be split. required
unique_key str | list[str] The column name(s) that uniquely identify each row in the table. This unique_key is used to create a deterministic split of the dataset through a hashing process. required
test_size float The ratio of the dataset to include in the test split, which should be between 0 and 1. This ratio is approximate because the hashing algorithm may not provide a uniform bucket distribution for small datasets. Larger datasets will result in more uniform bucket assignments, making the split ratio closer to the desired value. 0.25
num_buckets int The number of buckets into which the data is divided during the splitting process. It controls how finely the data is divided into buckets during the split process. Adjusting num_buckets can affect the granularity and efficiency of the splitting operation, balancing between accuracy and computational efficiency. 100
random_seed int | None Seed for the random number generator. If provided, ensures reproducibility of the split. None

Returns

Type Description
tuple[ir.Table, ir.Table] A tuple containing two Ibis tables: (train_table, test_table).

Raises

Type Description
ValueError If test_size is not a float between 0 and 1.

Examples

>>> import ibis_ml as ml

Split an Ibis table into training and testing tables.

>>> table = ibis.memtable({"key1": range(100)})
>>> train_table, test_table = ml.train_test_split(
...     table,
...     unique_key="key1",
...     test_size=0.2,
...     random_seed=0,
... )
Back to top