Preprocessors#

Impute#

tdprepview.Impute #

Impute(kind='mean', value=0)

Impute missing values in numerical columns using different strategies.

This class supports imputing missing numerical values using the mean, median, mode, min, max, or a custom value/percentile.

Parameters:

Name	Type	Description	Default
`kind`	`Literal['mean', 'median', 'mode', 'min', 'max', 'custom']`	The imputation strategy to use. Default is `"mean"`. Valid options: `"mean"`: Replace missing values using the mean of the column. `"median"`: Replace missing values using the median of the column. `"mode"`: Replace missing values using the mode of the column. `"min"`: Replace missing values using the minimum value of the column. `"max"`: Replace missing values using the maximum value of the column. `"custom"`: Replace missing values using the provided `value` (constant or percentile string).	`'mean'`
`value`	`Union[int, float, str]`	The custom value to use if `kind="custom"`. Can be a number (int/float) or a percentile string like `"P50"`. Ignored for other strategies. Default is 0.	`0`

Examples:

Using mean strategy:

import teradataml as tdml
import tdprepview
my_imputer = tdprepview.Impute(kind="mean")
my_pipeline = tdprepview.Pipeline(
    steps=[("my_column", my_imputer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
my_pipeline.fit(DF)

Using custom value:

import teradataml as tdml
import tdprepview
my_imputer = tdprepview.Impute(kind="custom", value=100)
my_pipeline = tdprepview.Pipeline(
    steps=[("my_column", my_imputer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
my_pipeline.fit(DF)

Using custom percentile:

import teradataml as tdml
import tdprepview
my_imputer = tdprepview.Impute(kind="custom", value="P90")
my_pipeline = tdprepview.Pipeline(
    steps=[("my_column", my_imputer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
my_pipeline.fit(DF)

Notes

During fitting, the function will query the database to compute the statistic corresponding to the chosen kind for each column using the TD_UnivariateStatistics in-db function.
During transform, it will use the COALESCE SQL function to replace NULLs.

tdprepview.SimpleImputer #

SimpleImputer(*, strategy='mean', fill_value=None)

Impute missing values using a specified strategy.

This class provides simple imputation methods for handling missing data, such as replacing with mean, median, most frequent value, or a constant. It is a wrapper for Impute to mimic scikit-learn's SimpleImputer API.

Parameters:

Name	Type	Description	Default
`strategy`	`Literal['mean', 'median', 'most_frequent', 'constant']`	The imputation strategy to use. Options are `"mean"`, `"median"`, `"most_frequent"`, and `"constant"`. Default is `"mean"`.	`'mean'`
`fill_value`	`Optional[Union[int, float]]`	The value to use for missing values when `strategy="constant"`. Ignored for other strategies. Default is None.	`None`

Examples:

Using mean strategy:

import tdprepview
my_imputer = tdprepview.SimpleImputer(strategy="mean")
my_pipeline = tdprepview.Pipeline(
    steps=[("my_column", my_imputer)]
)

Using custom constant:

import tdprepview
my_imputer = tdprepview.SimpleImputer(strategy="constant", fill_value=0)
my_pipeline = tdprepview.Pipeline(
    steps=[("my_column", my_imputer)]
)

Using most frequent value:

import tdprepview
my_imputer = tdprepview.SimpleImputer(strategy="most_frequent")
my_pipeline = tdprepview.Pipeline(
    steps=[("my_column", my_imputer)]
)

Notes

This class is only an alias for Impute, preserving sklearn compatibility.
During fitting, the underlying Impute object computes the statistic for each column using the TD_UnivariateStatistics in-db function..
During transform, it will use the COALESCE SQL function to replace NULLs.

tdprepview.IterativeImputer #

IterativeImputer()

Impute missing values using an iterative multivariate approach.

This preprocessor uses a multivariate method to impute missing values. During fitting, a sample of the data is pulled into Python to fit a scikit-learn IterativeImputer. During transforming, NULLs are replaced by a weighted combination of other existing columns based on the fitted imputer.

Examples:

Impute at least two columns:

import teradataml as tdml
import tdprepview
my_imputer = tdprepview.IterativeImputer()
pipeline = tdprepview.Pipeline(
    steps=[(["col1", "col2"], my_imputer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Notes

Fitting triggers pulling a sample into Python to fit scikit-learn's IterativeImputer.
Transforming replaces NULLs using the weighted combination of other columns as determined by the fitted iterative imputer.
At least two columns must be provided for meaningful imputation.
This class is based on scikit-learn's IterativeImputer from the impute module.

tdprepview.ImputeText #

ImputeText(kind='mode', value='')

Impute missing values in text columns using different strategies.

This class supports imputing missing text values using either the most frequent (mode) value or a custom specified value.

Parameters:

Name	Type	Description	Default
`kind`	`Literal['mode', 'custom']`	The type of imputation to perform. Default is `"mode"`. `"mode"` uses the most-frequent non-null value, `"custom"` uses the provided `value`.	`'mode'`
`value`	`str`	The custom value to use for imputation if `kind="custom"`. Default is an empty string.	`''`

Examples:

Basic usage with kind "mode":

import teradataml as tdml
import tdprepview
my_imputer = tdprepview.ImputeText(kind="mode")
my_pipeline = tdprepview.Pipeline(
    steps=[("my_column", my_imputer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
my_pipeline.fit(DF)

Basic usage with kind "custom":

import teradataml as tdml
import tdprepview
my_imputer = tdprepview.ImputeText(kind="custom", value = "missing")
my_pipeline = tdprepview.Pipeline(
    steps=[("my_column", my_imputer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
my_pipeline.fit(DF)

Notes

During fitting, the function will query the database for the most frequent non-null value if kind = "mode".
During transform, it will use the COALESCE SQL funtion to replace NULLs.

Transform#

tdprepview.Scale #

Scale(kind='minmax', numerator_subtr=0, denominator=1, zerofinull=True, feature_range=(0, 1), clip=False)

Scale numerical values using a chosen method and parameters.

This class supports multiple scaling methods including MinMax, Z-Score, Robust, MaxAbs, and custom scaling based on a numerator subtraction and a denominator.

Parameters:

Name	Type	Description	Default
`kind`	`Literal['minmax', 'zscore', 'robust', 'custom', 'maxabs']`	The scaling method to use. Supported values: "minmax": (X - MIN(X)) / (MAX(X) - MIN(X)) "zscore": (X - MEAN(X)) / STD(X) "robust": (X - MEDIAN(X)) / (P75(X) - P25(X)) "maxabs": X / MAX(ABS(X)) "custom": (X - numerator_subtr) / denominator	`'minmax'`
`numerator_subtr`	`Union[int, float, str]`	Value to subtract from each element before scaling. Can be int, float, or string ("mean", "std", "median", "mode", "max", "min", or percentile like "P33").	`0`
`denominator`	`Union[int, float, str]`	Value to divide each element after subtraction. Can be int, float, or string (formula composed of "mean", "std", "median", "mode", "max", "min", or percentile like "P90"). Must not be 0 if numeric.	`1`
`zerofinull`	`bool`	If True, output 0 when division would return null. Default is True.	`True`
`feature_range`	`Tuple[float, float]`	Tuple of (min, max) for minmax scaling. Default is (0, 1).	`(0, 1)`
`clip`	`bool`	If True, clip values to the feature_range for minmax scaling. Default is False.	`False`

Examples:

Using MinMax scaling:

import teradataml as tdml
import tdprepview
scaler = tdprepview.Scale(kind="minmax", feature_range=(0,1))
pipeline = tdprepview.Pipeline(
    steps=[("my_column", scaler)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Using custom scaling:

import teradataml as tdml
import tdprepview
scaler = tdprepview.Scale(kind="custom", numerator_subtr="mean", denominator="P75-P25")
pipeline = tdprepview.Pipeline(
    steps=[("my_column", scaler)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Notes

During fitting, necessary statistics are collected with the in-database function TD_UnivariateStatistics.
During transform, SQL literal primitives are used for the scaling formula.
Custom scaling allows formulas with percentile strings or standard statistics.

tdprepview.StandardScaler #

StandardScaler(*, with_mean=True, with_std=True)

Standardize features by removing the mean and scaling to unit variance.

This class is a convenience wrapper around Scale to mimic scikit-learn's StandardScaler API. It selects either "zscore" scaling or "custom" scaling based on the with_mean and with_std flags.

Parameters:

Name	Type	Description	Default
`with_mean`	`bool`	If True, center the data before scaling. Default is True.	`True`
`with_std`	`bool`	If True, scale the data to unit variance. Default is True.	`True`

Examples:

Standard scaling with mean and std:

import teradataml as tdml
import tdprepview
scaler = tdprepview.StandardScaler(with_mean=True, with_std=True)
pipeline = tdprepview.Pipeline(
    steps=[("my_column", scaler)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Scaling only with mean:

import teradataml as tdml
import tdprepview
scaler = tdprepview.StandardScaler(with_mean=True, with_std=False)
pipeline = tdprepview.Pipeline(
    steps=[("my_column", scaler)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Notes

During fitting, necessary statistics are collected with the in-database function TD_UnivariateStatistics.
During transform, SQL literal primitives are used for the scaling formula.

tdprepview.MaxAbsScaler #

MaxAbsScaler()

Scale each feature by its maximum absolute value.

This class is a convenience wrapper around Scale to mimic scikit-learn's MaxAbsScaler API. It is designed for data that is already centered at zero or sparse.

Examples:

Scaling with MaxAbsScaler:

import teradataml as tdml
import tdprepview
scaler = tdprepview.MaxAbsScaler()
pipeline = tdprepview.Pipeline(
    steps=[("my_column", scaler)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Notes

During fitting, necessary statistics are collected with the in-database function TD_UnivariateStatistics.
During transform, SQL literal primitives are used for scaling.

tdprepview.MinMaxScaler #

MinMaxScaler(feature_range=(0, 1), *, clip=False)

Transform features by scaling each feature to a given range.

This class is a convenience wrapper around Scale to mimic scikit-learn's MinMaxScaler API. It scales features linearly to the specified feature_range.

Parameters:

Name	Type	Description	Default
`feature_range`	`Tuple[float, float]`	Tuple (min, max) specifying the target range of transformed data. Default is (0, 1).	`(0, 1)`
`clip`	`bool`	If True, clip values outside the feature_range. Default is False.	`False`

Examples:

Basic usage:

import teradataml as tdml
import tdprepview
scaler = tdprepview.MinMaxScaler(feature_range=(0, 1), clip=True)
pipeline = tdprepview.Pipeline(
    steps=[("my_column", scaler)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Notes

During fitting, necessary statistics are collected with the in-database function TD_UnivariateStatistics.
During transform, SQL literal primitives are used for scaling.

tdprepview.RobustScaler #

RobustScaler(*, with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True, unit_variance=False)

Scale features using statistics that are robust to outliers.

This class is a convenience wrapper around Scale to mimic scikit-learn's RobustScaler API. It scales data based on the interquartile range (IQR) and optionally centers data by subtracting the median.

Parameters:

Name	Type	Description	Default
`with_centering`	`bool`	Whether to center the data before scaling. Default is True.	`True`
`with_scaling`	`bool`	Whether to scale the data to the quantile range. Default is True.	`True`
`quantile_range`	`Tuple[float, float]`	Tuple of floats (q_min, q_max) specifying the quantile range. Default is (25.0, 75.0).	`(25.0, 75.0)`
`copy`	`bool`	Ignored. Default is True.	`True`
`unit_variance`	`bool`	Ignored. Default is False.	`False`

Examples:

Robust scaling with default centering and scaling:

import teradataml as tdml
import tdprepview
scaler = tdprepview.RobustScaler()
pipeline = tdprepview.Pipeline(
    steps=[("my_column", scaler)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Robust scaling only by centering:

import teradataml as tdml
import tdprepview
scaler = tdprepview.RobustScaler(with_centering=True, with_scaling=False)
pipeline = tdprepview.Pipeline(
    steps=[("my_column", scaler)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Notes

During fitting, necessary statistics are collected with the in-database function TD_UnivariateStatistics.
During transform, SQL literal primitives are used for scaling.
Centering subtracts the median; scaling divides by the IQR or custom formula.

tdprepview.CutOff #

CutOff(cutoff_min=None, cutoff_max=None)

Clip numeric values that fall outside a specified range.

This preprocessor limits numeric values to a minimum and/or maximum threshold. Thresholds can be constants or derived from percentiles or summary statistics.

Parameters:

Name	Type	Description	Default
`cutoff_min`	`Optional[Union[int, float, str]]`	Minimum allowed value. Can be int, float, a percentile string like "P33", or a summary statistic ("mean", "mode", "median", "min"). If None, no lower bound is applied.	`None`
`cutoff_max`	`Optional[Union[int, float, str]]`	Maximum allowed value. Can be int, float, a percentile string like "P90", or a summary statistic ("mean", "mode", "median", "max"). If None, no upper bound is applied.	`None`

Examples:

Clip values with constant bounds:

import teradataml as tdml
import tdprepview
clipper = tdprepview.CutOff(cutoff_min=0, cutoff_max=100)
pipeline = tdprepview.Pipeline(
    steps=[("my_column", clipper)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Clip values using percentiles:

import teradataml as tdml
import tdprepview
clipper = tdprepview.CutOff(cutoff_min="P5", cutoff_max="P95")
pipeline = tdprepview.Pipeline(
    steps=[("my_column", clipper)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Clip values using summary statistics:

import teradataml as tdml
import tdprepview
clipper = tdprepview.CutOff(cutoff_min="median", cutoff_max="max")
pipeline = tdprepview.Pipeline(
    steps=[("my_column", clipper)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Notes

During fitting, necessary statistics are collected with the in-database function TD_UnivariateStatistics.
During transform, values outside the range are replaced by the closest value within the range using SQL primitives.

tdprepview.CustomTransformer #

CustomTransformer(custom_str, output_column_type='FLOAT()')

Apply a custom SQL expression to a column.

This transformer allows arbitrary SQL transformations on a column using a placeholder string %%COL%% which is replaced by the actual column name.

Parameters:

Name	Type	Description	Default
`custom_str`	`str`	A custom SQL expression that contains the string "%%COL%%" where the column name should be inserted. Example: `"2 * POWER(%%COL%%, 2) + 3 * %%COL%%"`.	required
`output_column_type`	`str`	Optional. SQL data type of the resulting column. Default is `"FLOAT()"`.	`'FLOAT()'`

Examples:

Apply a custom SQL expression:

import teradataml as tdml
import tdprepview
transformer = tdprepview.CustomTransformer(custom_str="2 * POWER(%%COL%%, 2) + 3 * %%COL%%")
pipeline = tdprepview.Pipeline(
    steps=[("my_column", transformer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Notes

This transformer is stateless; nothing happens during fitting.
Use with caution: the SQL expression is executed directly in the database.

tdprepview.Normalizer #

Normalizer(norm='l2')

Normalize input data row-wise.

This preprocessor scales rows individually to a specified norm, similar to scikit-learn's Normalizer. Each row is transformed so that its vector length matches the chosen norm.

Parameters:

Name	Type	Description	Default
`norm`	`Literal['max', 'l1', 'l2']`	The normalization method to use. Possible values: "max": Scale each row by its maximum absolute value. "l1": Scale each row so that the sum of absolute values is 1. "l2": Scale each row so that the Euclidean (L2) norm is 1. Default is "l2".	`'l2'`

Examples:

Normalize multiple columns:

import teradataml as tdml
import tdprepview
normalizer = tdprepview.Normalizer(norm="l2")
pipeline = tdprepview.Pipeline(
    steps=[(["col1", "col2", "col3"], normalizer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Using max normalization:

import teradataml as tdml
import tdprepview
normalizer = tdprepview.Normalizer(norm="max")
pipeline = tdprepview.Pipeline(
    steps=[(["colA", "colB"], normalizer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Notes

Row-wise normalization is applied during transform; no statistics are computed during fitting.
The transformation uses SQL expressions to compute the row-wise norm and divide each value accordingly.
At least two columns must be provided for meaningful normalization.

tdprepview.PowerTransformer #

PowerTransformer(method='yeo-johnson', standardize=False)

Apply a power transform feature-wise to make data more Gaussian-like.

Power transforms are parametric, monotonic transformations applied to stabilize variance and reduce skewness. This is useful for modeling issues related to heteroscedasticity or other situations where approximate normality is desired.

Currently supports the Box-Cox transform and the Yeo-Johnson transform

Box-Cox requires strictly positive data.
Yeo-Johnson supports both positive and negative values.

Parameters:

Name	Type	Description	Default
`method`	`Literal['yeo-johnson', 'box-cox']`	The power transform method to use. Options are: "yeo-johnson": Works with positive and negative values. Default. "box-cox": Only works with strictly positive values.	`'yeo-johnson'`
`standardize`	`bool`	Ignored. If standardization is desired, append a StandardScaler after this transformer.	`False`

Examples:

Power transform a single column:

import teradataml as tdml
import tdprepview
transformer = tdprepview.PowerTransformer(method="yeo-johnson")
pipeline = tdprepview.Pipeline(
    steps=[(["col1"], transformer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Power transform multiple columns with Box-Cox:

import teradataml as tdml
import tdprepview
transformer = tdprepview.PowerTransformer(method="box-cox")
pipeline = tdprepview.Pipeline(
    steps=[(["colA", "colB"], transformer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Notes

During fitting, a sample of the data is pulled into Python to fit scikit-learn's PowerTransformer.
During transform, SQL formulas are applied in Teradata to compute the power-transformed values.
Box-Cox requires strictly positive values; Yeo-Johnson can handle negative values.

tdprepview.TargetEncoder #

TargetEncoder(target_var, categories='auto', target_type='binary', smooth='auto', cv=5, shuffle=True, random_state=None)

Encode categorical variables by replacing categories with target-based statistics.

Each category value is encoded as a smoothed estimate of the target variable mean for that category. The encoding blends the global target mean with the category-specific target mean to reduce variance, especially for infrequent categories. This implementation follows the sklearn API.

Parameters:

Name	Type	Description	Default
`target_var`	`str`	str The name of the target variable used for encoding. Required.	required
`categories`	`Union[str, List[str]]`	str or list, default="auto" Categories per feature. If "auto", categories are inferred from the data. If a list, must contain at least two categories.	`'auto'`
`target_type`	`str`	{"auto", "continuous", "binary", "multiclass"}, default="binary" Type of the target variable. Determines how encoding is computed.	`'binary'`
`smooth`	`Union[str, float]`	{"auto"} or float, default="auto" Amount of smoothing between category means and the global mean. If float, higher values apply stronger smoothing. If "auto", a heuristic is applied.	`'auto'`
`cv`	`int`	int, default=5 Number of folds for cross-validation during fitting.	`5`
`shuffle`	`bool`	bool, default=True Whether to shuffle the data before splitting into folds.	`True`
`random_state`	`Optional[int]`	int or None, default=None Controls the randomness of shuffling when `shuffle=True`.	`None`

Examples:

Encode a binary target variable:

import teradataml as tdml
import tdprepview
from tdprepview import TargetEncoder

# Create encoder with binary target
te = TargetEncoder(target_var="churn", target_type="binary")

pipeline = tdprepview.Pipeline(
    steps=[(["region"], te)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","customers"))
pipeline.fit(DF)

Encode with smoothing and reproducible folds:

import teradataml as tdml
import tdprepview
from tdprepview import TargetEncoder

te = TargetEncoder(
    target_var="label",
    smooth=10.0,
    cv=3,
    shuffle=True,
    random_state=42
)

pipeline = tdprepview.Pipeline(
    steps=[(["product"], te)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","transactions"))
pipeline.fit(DF)

Notes

Output column type is FLOAT.
Requires the target variable to be present in the training DataFrame.

Discretize#

tdprepview.FixedWidthBinning #

FixedWidthBinning(n_bins=5, lower_bound=None, upper_bound=None)

Perform fixed-width binning on a numerical column.

This preprocessor divides numerical data into a fixed number of bins. Each bin has equal width, and values are assigned an integer representing the bin index.

Parameters:

Name	Type	Description	Default
`n_bins`	`int`	Number of bins to divide the data into. Must be greater than 1.	`5`
`lower_bound`	`Optional[Union[int, float]]`	Optional lower bound of the binning range. If None, the column minimum is used.	`None`
`upper_bound`	`Optional[Union[int, float]]`	Optional upper bound of the binning range. If None, the column maximum is used.	`None`

Examples:

Basic fixed-width binning:

import teradataml as tdml
import tdprepview
binning = tdprepview.FixedWidthBinning(n_bins=5)
pipeline = tdprepview.Pipeline(
    steps=[(["col1"], binning)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Fixed-width binning with custom bounds:

import teradataml as tdml
import tdprepview
binning = tdprepview.FixedWidthBinning(n_bins=4, lower_bound=0.0, upper_bound=100.0)
pipeline = tdprepview.Pipeline(
    steps=[(["colA"], binning)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Notes

The output column type is INTEGER.
During fitting, necessary statistics (min/max) are collected with TD_UnivariateStatistics if bounds are None.
Values are assigned to bins indexed from 0 to n_bins-1.

tdprepview.VariableWidthBinning #

VariableWidthBinning(kind='quantiles', no_quantiles=5, boundaries=None)

Bin numerical data into variable-width bins.

This preprocessor supports two binning strategies

"quantiles": Divide the data into bins based on percentiles.
"custom": Divide the data based on user-defined boundaries.

Parameters:

Name	Type	Description	Default
`kind`	`Literal['quantiles', 'custom']`	Method of binning. Options are "quantiles" or "custom". Default is "quantiles".	`'quantiles'`
`no_quantiles`	`int`	Number of bins to use when kind="quantiles". Must be between 2 and 100. Default is 5.	`5`
`boundaries`	`Optional[List[Union[int, float]]]`	List of numeric boundaries for custom binning. Must be sorted in ascending order. Required if kind="custom". Default is None.	`None`

Examples:

Quantile-based binning:

import teradataml as tdml
import tdprepview
binner = tdprepview.VariableWidthBinning(kind="quantiles", no_quantiles=4)
pipeline = tdprepview.Pipeline(
    steps=[(["col1", "col2"], binner)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Custom boundary binning:

import teradataml as tdml
import tdprepview
binner = tdprepview.VariableWidthBinning(kind="custom", boundaries=[0, 10, 20, 50])
pipeline = tdprepview.Pipeline(
    steps=[(["colA"], binner)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Notes

The output column type is INTEGER.
During fitting, necessary statistics (percentiles) are collected if kind="quantiles".
Custom boundaries are used directly without computation.
Bins are indexed starting from 0 up to the number of bins minus one.

tdprepview.QuantileTransformer #

QuantileTransformer(n_quantiles=10, output_distribution='uniform', ignore_implicit_zeros=False, subsample=None, random_state=None, copy=True)

Transform features using quantiles information.

This preprocessor transforms numerical columns by mapping values to their corresponding quantile bins. Output values are integers representing the quantile index.

Parameters:

Name	Type	Description	Default
`n_quantiles`	`int`	Number of quantiles to compute. Default is 10.	`10`
`output_distribution`	`str`	Only 'uniform' is currently supported. Default is 'uniform'.	`'uniform'`
`ignore_implicit_zeros`	`bool`	Whether to ignore implicit zero values when computing quantiles. Default is False.	`False`
`subsample`	`Optional[int]`	If not None, subsample the dataset to this size when computing quantiles. Ignored. Default is None.	`None`
`random_state`	`Optional[int]`	Random seed. Ignored. Default is None.	`None`
`copy`	`bool`	Whether to copy the input array before transforming. Ignored. Default is True.	`True`

Examples:

Quantile binning:

import teradataml as tdml
import tdprepview
qt = tdprepview.QuantileTransformer(n_quantiles=5)
pipeline = tdprepview.Pipeline(
    steps=[(["col1", "col2"], qt)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Notes

Output column type is INTEGER.
This transformer uses quantile computation to assign bins.
Currently, only the "uniform" output distribution is supported.

tdprepview.DecisionTreeBinning #

DecisionTreeBinning(target_var, model_type='classification', no_bins=10, no_rows=10000)

Bin numerical data into variable-width bins based on a decision tree.

This preprocessor uses a decision tree to find optimal bin boundaries. During fitting, a sample of the data is pulled into Python to train a scikit-learn decision tree. During transform, SQL expressions are used to assign bin indices.

Parameters:

Name	Type	Description	Default
`target_var`	`str`	Column name of the target variable. Required.	required
`model_type`	`Literal['classification', 'regression']`	Either "classification" or "regression". Determines the type of tree used.	`'classification'`
`no_bins`	`int`	Number of bins (tree leaves). Must be between 2 and 100.	`10`
`no_rows`	`int`	Number of rows randomly sampled for fitting the tree. Must be between 100 and 100000.	`10000`

Examples:

Decision tree binning of a single column:

import teradataml as tdml
import tdprepview
dtb = tdprepview.DecisionTreeBinning(target_var="target", model_type="classification", no_bins=5)
pipeline = tdprepview.Pipeline(
    steps=[(["feature1"], dtb)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Decision tree binning of multiple columns:

import teradataml as tdml
import tdprepview
dtb = tdprepview.DecisionTreeBinning(target_var="target", model_type="regression", no_bins=8)
pipeline = tdprepview.Pipeline(
    steps=[(["featureA", "featureB"], dtb)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Notes

During fitting, a sample of the data is used in Python to fit a scikit-learn decision tree.
During transform, SQL formulas are applied to assign bin indices.
Output column type is INTEGER.
The number of bins corresponds to the number of leaves in the fitted tree.

tdprepview.ThresholdBinarizer #

ThresholdBinarizer(threshold='mean')

Binarize numeric data using a threshold value.

This preprocessor assigns 1 if a value exceeds the threshold and 0 otherwise.

Parameters:

Name	Type	Description	Default
`threshold`	`Union[int, float, Literal['mean', 'mode', 'median', 'P[1-100]']]`	Threshold value for binarization. Can be: - a number (int or float), or - a string: "mean", "mode", "median", or a percentile string "P[1-100]", e.g., "P33". Default is "mean".	`'mean'`

Examples:

Binarize using mean:

import teradataml as tdml
import tdprepview
binarizer = tdprepview.ThresholdBinarizer(threshold="mean")
pipeline = tdprepview.Pipeline(
    steps=[(["col1"], binarizer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Binarize using a numeric threshold:

import teradataml as tdml
import tdprepview
binarizer = tdprepview.ThresholdBinarizer(threshold=50)
pipeline = tdprepview.Pipeline(
    steps=[(["colA"], binarizer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Notes

Output column type is INTEGER (0 or 1).
Threshold can be a fixed number, a summary statistic, or a percentile of the data.

tdprepview.Binarizer #

Binarizer(threshold=0.0, copy=True)

Binarize data according to a threshold (sklearn-compatible alias).

This is a convenience alias to ThresholdBinarizer to provide a scikit-learn-like API.

Parameters:

Name	Type	Description	Default
`threshold`	`Union[int, float]`	Threshold value to use for binarization. Must be numeric. Default is 0.0.	`0.0`
`copy`	`bool`	Ignored. Whether to copy the input array before binarizing. Default is True.	`True`

Examples:

Binarize a single column:

import teradataml as tdml
import tdprepview
binarizer = tdprepview.Binarizer(threshold=10)
pipeline = tdprepview.Pipeline(
    steps=[(["col1"], binarizer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Notes

Output column type is INTEGER (0 or 1).
This class is an alias of ThresholdBinarizer for sklearn compatibility.

tdprepview.ListBinarizer #

ListBinarizer(elements1='TOP3')

Binarize text data based on membership in a list or top-K most frequent values.

This preprocessor assigns 1 if a text value is in the provided list of elements, or among the top-K most frequent values. Otherwise, it assigns 0.

Parameters:

Name	Type	Description	Default
`elements1`	`Union[str, List[str]]`	Either a list of strings to match, or a string "TOPK" indicating the K most frequent values, e.g., "TOP3" or "TOP10". Default is "TOP3".	`'TOP3'`

Examples:

Using a top-K most frequent values:

import teradataml as tdml
import tdprepview
lb = tdprepview.ListBinarizer(elements1="TOP5")
pipeline = tdprepview.Pipeline(
    steps=[(["category_col"], lb)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Using a custom list:

import teradataml as tdml
import tdprepview
lb = tdprepview.ListBinarizer(elements1=["apple", "banana", "orange"])
pipeline = tdprepview.Pipeline(
    steps=[(["fruit_col"], lb)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Notes

Output column type is INTEGER (0 or 1).
Top-K values are computed in database with a query template.

tdprepview.LabelEncoder #

LabelEncoder(elements='TOP100')

Encode a text column into numerical values using label encoding.

This preprocessor assigns integers to text values. Encoding can be based on: - a list of custom elements, or - the K most frequent values in the column, indicated by a TOPK string (e.g., "TOP20").

Parameters:

Name	Type	Description	Default
`elements`	`Union[str, List[str]]`	Either a list of strings to use for encoding, or a TOPK string specifying the number of most frequent values to encode. Default is "TOP100".	`'TOP100'`

Examples:

Using TOPK:

import teradataml as tdml
import tdprepview
le = tdprepview.LabelEncoder(elements="TOP10")
pipeline = tdprepview.Pipeline(
    steps=[(["category_col"], le)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Using a custom list:

import teradataml as tdml
import tdprepview
le = tdprepview.LabelEncoder(elements=["apple", "banana", "orange"])
pipeline = tdprepview.Pipeline(
    steps=[(["fruit_col"], le)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Notes

Output column type is INTEGER.
Top-K values are computed in the database with a query template during fitting.

tdprepview.SimpleHashEncoder #

SimpleHashEncoder(num_buckets=None, salt='')

Encode a text column using a hash function into INTEGER values.

This preprocessor applies a TD-built-in hash function. It is stateless and very performant, but hash collisions can occur. Optional bucketing can be applied on top of the hash value.

Parameters:

Name	Type	Description	Default
`num_buckets`	`Optional[int]`	Optional number of buckets to apply using modulo on the hash code. If None, the raw hash value is used. Must be an integer > 1 or None. Default is None.	`None`
`salt`	`str`	Optional string appended to each value before hashing to redistribute hash values. Default is "".	`''`

Examples:

Using raw hash values:

import teradataml as tdml
import tdprepview
hasher = tdprepview.SimpleHashEncoder()
pipeline = tdprepview.Pipeline(
    steps=[(["text_col"], hasher)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Using buckets and salt:

import teradataml as tdml
import tdprepview
hasher = tdprepview.SimpleHashEncoder(num_buckets=100, salt="xyz")
pipeline = tdprepview.Pipeline(
    steps=[(["text_col"], hasher)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Notes

Output column type is INTEGER.
Stateless: no fitting is performed; values are hashed on transform.
Collisions can occur if multiple values map to the same hash bucket.

Feature Engineering#

tdprepview.PolynomialFeatures #

PolynomialFeatures(degree=2, interaction_only=False)

Generate polynomial and interaction features from input data.

This preprocessor creates new features by raising input columns to powers up to degree and optionally including only interaction features between distinct columns. No fitting is required; transformations are applied using SQL formulas.

Parameters:

Name	Type	Description	Default
`degree`	`int`	The degree of the polynomial features. Must be an integer >= 1. Default is 2.	`2`
`interaction_only`	`bool`	If True, only interaction features (products of distinct columns) are generated. Default is False.	`False`

Examples:

Polynomial features on multiple columns:

import teradataml as tdml
import tdprepview
poly = tdprepview.PolynomialFeatures(degree=2, interaction_only=False)
pipeline = tdprepview.Pipeline(
    steps=[(["col1", "col2", "col3"], poly)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Notes

Output column type is numeric.
The class is based on scikit-learn's PolynomialFeatures from the preprocessing module.
Interactions are computed across all specified input columns.
No fitting is required; transformations are applied via SQL formulas.

tdprepview.OneHotEncoder #

OneHotEncoder(*, categories='auto', handle_unknown='ignore', min_frequency=None, max_categories=None)

One-hot encode categorical features.

This preprocessor converts categorical columns into multiple binary (0/1) columns, one per category. Categories can be inferred from the data or provided as a custom list. No fitting is required for custom categories; otherwise, necessary statistics are collected during fitting to determine categories.

Parameters:

Name	Type	Description	Default
`categories`	`Union[str, List[str]]`	Either "auto" to infer categories from the data, or a list of strings specifying the categories. Default is "auto".	`'auto'`
`handle_unknown`	`str`	Ignored. Default is "ignore".	`'ignore'`
`min_frequency`	`Optional[int]`	Ignored. Default is None.	`None`
`max_categories`	`Optional[int]`	Maximum number of categories to encode. Defaults to 50.	`None`

Examples:

Automatic category detection:

import teradataml as tdml
import tdprepview
ohe = tdprepview.OneHotEncoder(categories="auto")
pipeline = tdprepview.Pipeline(
    steps=[(["cat_col"], ohe)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Using a custom category list:

import teradataml as tdml
import tdprepview
ohe = tdprepview.OneHotEncoder(categories=["red", "green", "blue"])
pipeline = tdprepview.Pipeline(
    steps=[(["color_col"], ohe)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Notes

Output column type is INTEGER (0/1 per category).
If categories="auto", necessary statistics are collected during fitting.
For custom categories, no fitting is required; transformations use SQL formulas.
Maximum of 50 categories can be encoded.

tdprepview.MultiLabelBinarizer #

MultiLabelBinarizer(*, classes=None, sparse_output=False, max_categories=None, delimiter=', ')

Multi-label binarizer for categorical features. The input column contains delimiter-separated values. The output is one indicator (0/1) variable per unique value.

Parameters:

Name	Type	Description	Default
`classes`	`Optional[Union[str, List[str]]]`	Categories to encode. Can be: - None: categories inferred during fitting. - "auto": infer categories from training data. - List of strings: custom categories. Default is None.	`None`
`sparse_output`	`bool`	Ignored. Default is False.	`False`
`max_categories`	`Optional[int]`	Maximum number of categories to encode. Default is 50.	`None`
`delimiter`	`str`	Delimiter used to separate multiple values in the input string. Default is ", ".	`', '`

Examples:

Automatic category detection:

import teradataml as tdml
import tdprepview
mlb = tdprepview.MultiLabelBinarizer(classes="auto")
pipeline = tdprepview.Pipeline(
    steps=[(["tags"], mlb)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Using a custom list of classes:

import teradataml as tdml
import tdprepview
mlb = tdprepview.MultiLabelBinarizer(classes=["red", "green", "blue"])
pipeline = tdprepview.Pipeline(
    steps=[(["colors"], mlb)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Notes

Output column type is INTEGER (0/1 per unique value).
For classes="auto" or None, necessary statistics are collected during fitting.
For custom classes, no fitting is required; transformations use SQL formulas.
Maximum of 50 categories can be encoded.

Dimensionality Reduction & Miscellaneous#

tdprepview.PCA #

PCA(n_components='mle', *, random_state=42)

Principal Component Analysis (PCA) reduces the dimensionality of a dataset while retaining most of its variance.

During fitting, a sample of the data is pulled into Python to fit a scikit-learn PCA. During transform, the weighted sum of components is calculated directly in SQL.

Parameters:

Name	Type	Description	Default
`n_components`	`Union[int, str]`	Number of principal components to keep. Can be an integer or "mle" (automatic selection via maximum likelihood estimation). Default is "mle".	`'mle'`
`random_state`	`int`	Seed for the random number generator. Default is 42.	`42`

Examples:

Fit PCA with a fixed number of components:

import teradataml as tdml
import tdprepview
pca = tdprepview.PCA(n_components=3)
pipeline = tdprepview.Pipeline(
    steps=[(["feature1", "feature2", "feature3"], pca)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Fit PCA using MLE for component selection:

import teradataml as tdml
import tdprepview
pca = tdprepview.PCA(n_components="mle")
pipeline = tdprepview.Pipeline(
    steps=[(["feature1", "feature2", "feature3"], pca)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Notes

Output is a weighted combination of input columns (SQL computed).
The number of components can be automatically inferred with "mle".
Fitting uses a Python sample for PCA computation.

tdprepview.TryCast #

TryCast(new_type='FLOAT')

Attempt to cast a text column to a specified data type using SQL TRYCAST.

This preprocessor is stateless: nothing is done during fitting. During transform, the SQL TRYCAST function is applied to each column value.

Parameters:

Name	Type	Description	Default
`new_type`	`Literal['BYTEINT', 'SMALLINT', 'INT', 'BIGINT', 'FLOAT', 'DATE', 'TIME', 'TIMESTAMP(6)']`	Target data type to cast the column to. Must be one of: `"BYTEINT"`, `"SMALLINT"`, `"INT"`, `"BIGINT"`, `"FLOAT"`, `"DATE"`, `"TIME"`, `"TIMESTAMP(6)"`. Default is `"FLOAT"`.	`'FLOAT'`

Examples:

Convert a text column to integer:

import teradataml as tdml
import tdprepview
caster = tdprepview.TryCast(new_type="INT")
pipeline = tdprepview.Pipeline(
    steps=[("text_col", caster)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Convert a text column to timestamp:

import teradataml as tdml
import tdprepview
caster = tdprepview.TryCast(new_type="TIMESTAMP(6)")
pipeline = tdprepview.Pipeline(
    steps=[("text_col", caster)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Notes

Output type is the same as new_type.
Uses SQL TRYCAST in transform; any conversion failures result in NULLs.

tdprepview.Cast #

Cast(new_type='FLOAT')

Convert a column to a specified data type using SQL CAST.

This preprocessor is stateless: nothing is done during fitting. During transform, the SQL CAST function is applied to each column value. Typically used as the last step to ensure numeric features are FLOAT. For text columns, prefer TryCast to avoid errors.

Parameters:

Name	Type	Description	Default
`new_type`	`Literal['BYTEINT', 'SMALLINT', 'INT', 'BIGINT', 'FLOAT', 'DATE', 'TIME', 'TIMESTAMP(6)']`	Target data type to cast the column to. Must be one of: `"BYTEINT"`, `"SMALLINT"`, `"INT"`, `"BIGINT"`, `"FLOAT"`, `"DATE"`, `"TIME"`, `"TIMESTAMP(6)"`. Default is `"FLOAT"`.	`'FLOAT'`

Examples:

Convert a column to float:

import teradataml as tdml
import tdprepview
caster = tdprepview.Cast(new_type="FLOAT")
pipeline = tdprepview.Pipeline(
    steps=[("col1", caster)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Convert a column to integer:

import teradataml as tdml
import tdprepview
caster = tdprepview.Cast(new_type="INT")
pipeline = tdprepview.Pipeline(
    steps=[("col1", caster)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Notes

Output type is the same as new_type.
Uses SQL CAST in transform; invalid conversions will raise errors.