Data splitting
Segregating data into training, testing, and validation sets
train_test_split
train_test_split(table, unique_key, test_size=0.25, num_buckets=100, random_seed=None)
Randomly split Ibis table data into training and testing tables.
This function splits an Ibis table into training and testing tables based on a unique key or combination of keys. It uses a hashing function to convert the unique key into an integer, then applies a modulo operation to split the data into buckets. The training table consists of data points from a subset of these buckets, while the remaining data points form the test table.
Parameters
Name | Type | Description | Default |
---|---|---|---|
table |
ir.Table | The input Ibis table to be split. | required |
unique_key |
str | list[str] | The column name(s) that uniquely identify each row in the table. This unique_key is used to create a deterministic split of the dataset through a hashing process. | required |
test_size |
float | The ratio of the dataset to include in the test split, which should be between 0 and 1. This ratio is approximate because the hashing algorithm may not provide a uniform bucket distribution for small datasets. Larger datasets will result in more uniform bucket assignments, making the split ratio closer to the desired value. | 0.25 |
num_buckets |
int | The number of buckets into which the data is divided during the splitting process. It controls how finely the data is divided into buckets during the split process. Adjusting num_buckets can affect the granularity and efficiency of the splitting operation, balancing between accuracy and computational efficiency. | 100 |
random_seed |
int | None | Seed for the random number generator. If provided, ensures reproducibility of the split. | None |
Returns
Type | Description |
---|---|
tuple[ir.Table, ir.Table] | A tuple containing two Ibis tables: (train_table, test_table). |
Raises
Type | Description |
---|---|
ValueError | If test_size is not a float between 0 and 1. |
Examples
>>> import ibis_ml as ml
Split an Ibis table into training and testing tables.
>>> table = ibis.memtable({"key1": range(100)})
>>> train_table, test_table = ml.train_test_split(
... table,="key1",
... unique_key=0.2,
... test_size=0,
... random_seed ... )