Data splitting
Segregating data into training, testing, and validation sets
train_test_split
train_test_split(
    table,
    unique_key,
    test_size=0.25,
    num_buckets=100,
    random_seed=None,
)Randomly split Ibis table data into training and testing tables.
This function splits an Ibis table into training and testing tables based on a unique key or combination of keys. It uses a hashing function to convert the unique key into an integer, then applies a modulo operation to split the data into buckets. The training table consists of data points from a subset of these buckets, while the remaining data points form the test table.
Parameters
| Name | Type | Description | Default | 
|---|---|---|---|
| table | ir.Table | The input Ibis table to be split. | required | 
| unique_key | str | list[str] | The column name(s) that uniquely identify each row in the table. This unique_key is used to create a deterministic split of the dataset through a hashing process. | required | 
| test_size | float | The ratio of the dataset to include in the test split, which should be between 0 and 1. This ratio is approximate because the hashing algorithm may not provide a uniform bucket distribution for small datasets. Larger datasets will result in more uniform bucket assignments, making the split ratio closer to the desired value. | 0.25 | 
| num_buckets | int | The number of buckets into which the data is divided during the splitting process. It controls how finely the data is divided into buckets during the split process. Adjusting num_buckets can affect the granularity and efficiency of the splitting operation, balancing between accuracy and computational efficiency. | 100 | 
| random_seed | int | None | Seed for the random number generator. If provided, ensures reproducibility of the split. | None | 
Returns
| Name | Type | Description | 
|---|---|---|
| tuple[ir.Table, ir.Table] | A tuple containing two Ibis tables: (train_table, test_table). | 
Raises
| Name | Type | Description | 
|---|---|---|
| ValueError | If test_size is not a float between 0 and 1. | 
Examples
>>> import ibis_ml as mlSplit an Ibis table into training and testing tables.
>>> table = ibis.memtable({"key1": range(100)})
>>> train_table, test_table = ml.train_test_split(
...     table,
...     unique_key="key1",
...     test_size=0.2,
...     random_seed=0,
... )