Outlier handling

Outlier detection and handling

HandleUnivariateOutliers

HandleUnivariateOutliers(
    self,
    inputs,
    *,
    method='z-score',
    treatment='capping',
    deviation_factor=3,
)

A step for detecting and treating univariate outliers in numeric columns.

Name	Type	Description	Default
inputs	SelectionType	A selection of columns to analyze for outliers. All columns must be numeric.	required
method	str	The method to use for detecting outliers. “z-score” detects outliers based on the standard deviation from the mean for normally distributed data. “IQR” detects outliers using the interquartile range for skewed data.	`'z-score'`
treatment	str	The treatment to apply to the outliers. `capping` replaces outlier values with the upper or lower bound, while `trimming` removes outlier rows from the dataset.	`'capping'`
deviation_factor	int \| float	The magnitude of deviation from the center is used to calculate the upper and lower bound for outlier detection. For “z-score”, `Upper Bound = mean + deviation_factor * standard deviation`. `Lower Bound = mean - deviation_factor * standard deviation`. 68% of the data lies within 1 standard deviation. 95% of the data lies within 2 standard deviations. 99.7% of the data lies within 3 standard deviations. For “IQR”, `IQR = Q3 - Q1`. `Upper Bound = Q3 + deviation_factor * IQR`. `Lower Bound = Q1 - deviation_factor * IQR`.	`3`

>>> import ibis_ml as ml

Capping outliers in all numeric columns using z-score method.

>>> step = ml.HandleUnivariateOutliers(ml.numeric())

Trimming outliers in a specific set of columns using IQR method.

>>> step = ml.HandleUnivariateOutliers(
    ["x", "y"],
    method="IQR",
    deviation_factor=2.0,
    treatment="trimming",
   )