Encoding
Encoding of categorical and string columns
OneHotEncode
OneHotEncode(self, inputs, *, min_frequency=None, max_categories=None)A step for one-hot encoding select columns.
The original input column is dropped, and N-category new columns are created with names like {input_column}_{category}. Unknown categories will be ignored during transformation; the resulting one-hot encoded columns for this feature will be all zeros.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| inputs | SelectionType | A selection of columns to one-hot encode. | required |
| min_frequency | int | float | None | A minimum frequency of elements in the training set required to treat a column as a distinct category. May be either: - an integer, representing a minimum number of samples required. - a float in [0, 1], representing a minimum fraction of samples required. Defaults to None for no minimum frequency. |
None |
| max_categories | int | None | A maximum number of categories to include. If set, only the most frequent max_categories categories are kept. |
None |
Examples
>>> import ibis_ml as mlOne-hot encode all string columns.
>>> step = ml.OneHotEncode(ml.string())One-hot encode a specific column, only including categories with at least 20 samples.
>>> step = ml.OneHotEncode("x", min_frequency=20)One-hot encode a specific column, including at most 10 categories.
>>> step = ml.OneHotEncode("x", max_categories=10)OrdinalEncode
OrdinalEncode(self, inputs, *, min_frequency=None, max_categories=None)A step for encoding select columns as integer arrays.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| inputs | SelectionType | A selection of columns to ordinal encode. | required |
| min_frequency | int | float | None | A minimum frequency of elements in the training set required to treat a column as a distinct category. May be either: - an integer, representing a minimum number of samples required. - a float in [0, 1], representing a minimum fraction of samples required. Defaults to None for no minimum frequency. |
None |
| max_categories | int | None | A maximum number of categories to include. If set, only the most frequent max_categories categories are kept. |
None |
Examples
>>> import ibis_ml as mlOrdinal encode all string columns.
>>> step = ml.OrdinalEncode(ml.string())Ordinal encode a specific column, only including categories with at least 20 samples.
>>> step = ml.OrdinalEncode("x", min_frequency=20)Ordinal encode a specific column, including at most 10 categories.
>>> step = ml.OrdinalEncode("x", max_categories=10)CountEncode
CountEncode(self, inputs)A step for count encoding select columns.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| inputs | SelectionType | A selection of columns to count encode. | required |
Examples
>>> import ibis_ml as mlCount encode all string columns.
>>> step = ml.CountEncode(ml.string())TargetEncode
TargetEncode(self, inputs, smooth=0.0)A step for target encoding select columns.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| inputs | SelectionType | A selection of columns to target encode. | required |
| smooth | float | The amount of mixing of the target mean conditioned on the value of the category with the global target mean. A larger smooth value will put more weight on the global target mean. |
0.0 |
Examples
>>> import ibis_ml as mlTarget encode all string columns.
>>> step = ml.TargetEncode(ml.string())