Data splitting

Segregating data into training, testing, and validation sets

train_test_split

train_test_split(
    table,
    unique_key,
    test_size=0.25,
    num_buckets=100,
    random_seed=None,
)

Randomly split Ibis table data into training and testing tables.

This function splits an Ibis table into training and testing tables based on a unique key or combination of keys. It uses a hashing function to convert the unique key into an integer, then applies a modulo operation to split the data into buckets. The training table consists of data points from a subset of these buckets, while the remaining data points form the test table.

Parameters

Name	Type	Description	Default
table	ir.Table	The input Ibis table to be split.	required
unique_key	str \| list[str]	The column name(s) that uniquely identify each row in the table. This unique_key is used to create a deterministic split of the dataset through a hashing process.	required
test_size	float	The ratio of the dataset to include in the test split, which should be between 0 and 1. This ratio is approximate because the hashing algorithm may not provide a uniform bucket distribution for small datasets. Larger datasets will result in more uniform bucket assignments, making the split ratio closer to the desired value.	`0.25`
num_buckets	int	The number of buckets into which the data is divided during the splitting process. It controls how finely the data is divided into buckets during the split process. Adjusting num_buckets can affect the granularity and efficiency of the splitting operation, balancing between accuracy and computational efficiency.	`100`
random_seed	int \| None	Seed for the random number generator. If provided, ensures reproducibility of the split.	`None`

Returns

Name	Type	Description
	tuple[ir.Table, ir.Table]	A tuple containing two Ibis tables: (train_table, test_table).

Raises

Name	Type	Description
	ValueError	If test_size is not a float between 0 and 1.

Examples

>>> import ibis_ml as ml

Split an Ibis table into training and testing tables.

>>> table = ibis.memtable({"key1": range(100)})
>>> train_table, test_table = ml.train_test_split(
...     table,
...     unique_key="key1",
...     test_size=0.2,
...     random_seed=0,
... )