Encoding

Encoding of categorical and string columns

OneHotEncode

OneHotEncode(self, inputs, *, min_frequency=None, max_categories=None)

A step for one-hot encoding select columns.

The original input column is dropped, and N-category new columns are created with names like {input_column}_{category}. Unknown categories will be ignored during transformation; the resulting one-hot encoded columns for this feature will be all zeros.

Parameters

Name Type Description Default
inputs SelectionType A selection of columns to one-hot encode. required
min_frequency int | float | None A minimum frequency of elements in the training set required to treat a column as a distinct category. May be either: - an integer, representing a minimum number of samples required. - a float in [0, 1], representing a minimum fraction of samples required. Defaults to None for no minimum frequency. None
max_categories int | None A maximum number of categories to include. If set, only the most frequent max_categories categories are kept. None

Examples

>>> import ibis_ml as ml

One-hot encode all string columns.

>>> step = ml.OneHotEncode(ml.string())

One-hot encode a specific column, only including categories with at least 20 samples.

>>> step = ml.OneHotEncode("x", min_frequency=20)

One-hot encode a specific column, including at most 10 categories.

>>> step = ml.OneHotEncode("x", max_categories=10)

OrdinalEncode

OrdinalEncode(self, inputs, *, min_frequency=None, max_categories=None)

A step for encoding select columns as integer arrays.

Parameters

Name Type Description Default
inputs SelectionType A selection of columns to ordinal encode. required
min_frequency int | float | None A minimum frequency of elements in the training set required to treat a column as a distinct category. May be either: - an integer, representing a minimum number of samples required. - a float in [0, 1], representing a minimum fraction of samples required. Defaults to None for no minimum frequency. None
max_categories int | None A maximum number of categories to include. If set, only the most frequent max_categories categories are kept. None

Examples

>>> import ibis_ml as ml

Ordinal encode all string columns.

>>> step = ml.OrdinalEncode(ml.string())

Ordinal encode a specific column, only including categories with at least 20 samples.

>>> step = ml.OrdinalEncode("x", min_frequency=20)

Ordinal encode a specific column, including at most 10 categories.

>>> step = ml.OrdinalEncode("x", max_categories=10)

CountEncode

CountEncode(self, inputs)

A step for count encoding select columns.

Parameters

Name Type Description Default
inputs SelectionType A selection of columns to count encode. required

Examples

>>> import ibis_ml as ml

Count encode all string columns.

>>> step = ml.CountEncode(ml.string())

TargetEncode

TargetEncode(self, inputs, smooth=0.0)

A step for target encoding select columns.

Parameters

Name Type Description Default
inputs SelectionType A selection of columns to target encode. required
smooth float The amount of mixing of the target mean conditioned on the value of the category with the global target mean. A larger smooth value will put more weight on the global target mean. 0.0

Examples

>>> import ibis_ml as ml

Target encode all string columns.

>>> step = ml.TargetEncode(ml.string())
Back to top