Data splitting
Segregating data into training, testing, and validation sets
train_test_split
train_test_split(
table,
unique_key,=0.25,
test_size=100,
num_buckets=None,
random_seed )
Randomly split Ibis table data into training and testing tables.
This function splits an Ibis table into training and testing tables based on a unique key or combination of keys. It uses a hashing function to convert the unique key into an integer, then applies a modulo operation to split the data into buckets. The training table consists of data points from a subset of these buckets, while the remaining data points form the test table.
Parameters
Name | Type | Description | Default |
---|---|---|---|
table | ir.Table | The input Ibis table to be split. | required |
unique_key | str | list[str] | The column name(s) that uniquely identify each row in the table. This unique_key is used to create a deterministic split of the dataset through a hashing process. | required |
test_size | float | The ratio of the dataset to include in the test split, which should be between 0 and 1. This ratio is approximate because the hashing algorithm may not provide a uniform bucket distribution for small datasets. Larger datasets will result in more uniform bucket assignments, making the split ratio closer to the desired value. | 0.25 |
num_buckets | int | The number of buckets into which the data is divided during the splitting process. It controls how finely the data is divided into buckets during the split process. Adjusting num_buckets can affect the granularity and efficiency of the splitting operation, balancing between accuracy and computational efficiency. | 100 |
random_seed | int | None | Seed for the random number generator. If provided, ensures reproducibility of the split. | None |
Returns
Name | Type | Description |
---|---|---|
tuple[ir.Table, ir.Table] | A tuple containing two Ibis tables: (train_table, test_table). |
Raises
Name | Type | Description |
---|---|---|
ValueError | If test_size is not a float between 0 and 1. |
Examples
>>> import ibis_ml as ml
Split an Ibis table into training and testing tables.
>>> table = ibis.memtable({"key1": range(100)})
>>> train_table, test_table = ml.train_test_split(
... table,="key1",
... unique_key=0.2,
... test_size=0,
... random_seed ... )