Encoding
Encoding of categorical and string columns
OneHotEncode
OneHotEncode(self, inputs, *, min_frequency=None, max_categories=None)
A step for one-hot encoding select columns.
The original input column is dropped, and N-category new columns are created with names like {input_column}_{category}
. Unknown categories will be ignored during transformation; the resulting one-hot encoded columns for this feature will be all zeros.
Parameters
Name | Type | Description | Default |
---|---|---|---|
inputs |
SelectionType | A selection of columns to one-hot encode. | required |
min_frequency |
int | float | None | A minimum frequency of elements in the training set required to treat a column as a distinct category. May be either: - an integer, representing a minimum number of samples required. - a float in [0, 1] , representing a minimum fraction of samples required. Defaults to None for no minimum frequency. |
None |
max_categories |
int | None | A maximum number of categories to include. If set, only the most frequent max_categories categories are kept. |
None |
Examples
>>> import ibis_ml as ml
One-hot encode all string columns.
>>> step = ml.OneHotEncode(ml.string())
One-hot encode a specific column, only including categories with at least 20 samples.
>>> step = ml.OneHotEncode("x", min_frequency=20)
One-hot encode a specific column, including at most 10 categories.
>>> step = ml.OneHotEncode("x", max_categories=10)
OrdinalEncode
OrdinalEncode(self, inputs, *, min_frequency=None, max_categories=None)
A step for encoding select columns as integer arrays.
Parameters
Name | Type | Description | Default |
---|---|---|---|
inputs |
SelectionType | A selection of columns to ordinal encode. | required |
min_frequency |
int | float | None | A minimum frequency of elements in the training set required to treat a column as a distinct category. May be either: - an integer, representing a minimum number of samples required. - a float in [0, 1] , representing a minimum fraction of samples required. Defaults to None for no minimum frequency. |
None |
max_categories |
int | None | A maximum number of categories to include. If set, only the most frequent max_categories categories are kept. |
None |
Examples
>>> import ibis_ml as ml
Ordinal encode all string columns.
>>> step = ml.OrdinalEncode(ml.string())
Ordinal encode a specific column, only including categories with at least 20 samples.
>>> step = ml.OrdinalEncode("x", min_frequency=20)
Ordinal encode a specific column, including at most 10 categories.
>>> step = ml.OrdinalEncode("x", max_categories=10)
CountEncode
CountEncode(self, inputs)
A step for count encoding select columns.
Parameters
Name | Type | Description | Default |
---|---|---|---|
inputs |
SelectionType | A selection of columns to count encode. | required |
Examples
>>> import ibis_ml as ml
Count encode all string columns.
>>> step = ml.CountEncode(ml.string())
TargetEncode
TargetEncode(self, inputs, smooth=0.0)
A step for target encoding select columns.
Parameters
Name | Type | Description | Default |
---|---|---|---|
inputs |
SelectionType | A selection of columns to target encode. | required |
smooth |
float | The amount of mixing of the target mean conditioned on the value of the category with the global target mean. A larger smooth value will put more weight on the global target mean. |
0.0 |
Examples
>>> import ibis_ml as ml
Target encode all string columns.
>>> step = ml.TargetEncode(ml.string())