Skip to content

Preprocessors#

Impute#

tdprepview.Impute #

Impute(kind='mean', value=0)

Impute missing values in numerical columns using different strategies.

This class supports imputing missing numerical values using the mean, median, mode, min, max, or a custom value/percentile.

Parameters:

Name Type Description Default
kind Literal['mean', 'median', 'mode', 'min', 'max', 'custom']

The imputation strategy to use. Default is "mean". Valid options:

  • "mean": Replace missing values using the mean of the column.
  • "median": Replace missing values using the median of the column.
  • "mode": Replace missing values using the mode of the column.
  • "min": Replace missing values using the minimum value of the column.
  • "max": Replace missing values using the maximum value of the column.
  • "custom": Replace missing values using the provided value (constant or percentile string).
'mean'
value Union[int, float, str]

The custom value to use if kind="custom". Can be a number (int/float) or a percentile string like "P50". Ignored for other strategies. Default is 0.

0

Examples:

Using mean strategy:

import teradataml as tdml
import tdprepview
my_imputer = tdprepview.Impute(kind="mean")
my_pipeline = tdprepview.Pipeline(
    steps=[("my_column", my_imputer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
my_pipeline.fit(DF)

Using custom value:

import teradataml as tdml
import tdprepview
my_imputer = tdprepview.Impute(kind="custom", value=100)
my_pipeline = tdprepview.Pipeline(
    steps=[("my_column", my_imputer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
my_pipeline.fit(DF)

Using custom percentile:

import teradataml as tdml
import tdprepview
my_imputer = tdprepview.Impute(kind="custom", value="P90")
my_pipeline = tdprepview.Pipeline(
    steps=[("my_column", my_imputer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
my_pipeline.fit(DF)
Notes
  • During fitting, the function will query the database to compute the statistic corresponding to the chosen kind for each column using the TD_UnivariateStatistics in-db function.
  • During transform, it will use the COALESCE SQL function to replace NULLs.

tdprepview.SimpleImputer #

SimpleImputer(*, strategy='mean', fill_value=None)

Impute missing values using a specified strategy.

This class provides simple imputation methods for handling missing data, such as replacing with mean, median, most frequent value, or a constant. It is a wrapper for Impute to mimic scikit-learn's SimpleImputer API.

Parameters:

Name Type Description Default
strategy Literal['mean', 'median', 'most_frequent', 'constant']

The imputation strategy to use. Options are "mean", "median", "most_frequent", and "constant". Default is "mean".

'mean'
fill_value Optional[Union[int, float]]

The value to use for missing values when strategy="constant". Ignored for other strategies. Default is None.

None

Examples:

Using mean strategy:

import tdprepview
my_imputer = tdprepview.SimpleImputer(strategy="mean")
my_pipeline = tdprepview.Pipeline(
    steps=[("my_column", my_imputer)]
)

Using custom constant:

import tdprepview
my_imputer = tdprepview.SimpleImputer(strategy="constant", fill_value=0)
my_pipeline = tdprepview.Pipeline(
    steps=[("my_column", my_imputer)]
)

Using most frequent value:

import tdprepview
my_imputer = tdprepview.SimpleImputer(strategy="most_frequent")
my_pipeline = tdprepview.Pipeline(
    steps=[("my_column", my_imputer)]
)
Notes
  • This class is only an alias for Impute, preserving sklearn compatibility.
  • During fitting, the underlying Impute object computes the statistic for each column using the TD_UnivariateStatistics in-db function..
  • During transform, it will use the COALESCE SQL function to replace NULLs.

tdprepview.IterativeImputer #

IterativeImputer()

Impute missing values using an iterative multivariate approach.

This preprocessor uses a multivariate method to impute missing values. During fitting, a sample of the data is pulled into Python to fit a scikit-learn IterativeImputer. During transforming, NULLs are replaced by a weighted combination of other existing columns based on the fitted imputer.

Examples:

Impute at least two columns:

import teradataml as tdml
import tdprepview
my_imputer = tdprepview.IterativeImputer()
pipeline = tdprepview.Pipeline(
    steps=[(["col1", "col2"], my_imputer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
  • Fitting triggers pulling a sample into Python to fit scikit-learn's IterativeImputer.
  • Transforming replaces NULLs using the weighted combination of other columns as determined by the fitted iterative imputer.
  • At least two columns must be provided for meaningful imputation.
  • This class is based on scikit-learn's IterativeImputer from the impute module.

tdprepview.ImputeText #

ImputeText(kind='mode', value='')

Impute missing values in text columns using different strategies.

This class supports imputing missing text values using either the most frequent (mode) value or a custom specified value.

Parameters:

Name Type Description Default
kind Literal['mode', 'custom']

The type of imputation to perform. Default is "mode". "mode" uses the most-frequent non-null value, "custom" uses the provided value.

'mode'
value str

The custom value to use for imputation if kind="custom". Default is an empty string.

''

Examples:

Basic usage with kind "mode":

import teradataml as tdml
import tdprepview
my_imputer = tdprepview.ImputeText(kind="mode")
my_pipeline = tdprepview.Pipeline(
    steps=[("my_column", my_imputer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
my_pipeline.fit(DF)

Basic usage with kind "custom":

import teradataml as tdml
import tdprepview
my_imputer = tdprepview.ImputeText(kind="custom", value = "missing")
my_pipeline = tdprepview.Pipeline(
    steps=[("my_column", my_imputer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
my_pipeline.fit(DF)
Notes
  • During fitting, the function will query the database for the most frequent non-null value if kind = "mode".
  • During transform, it will use the COALESCE SQL funtion to replace NULLs.

Transform#

tdprepview.Scale #

Scale(kind='minmax', numerator_subtr=0, denominator=1, zerofinull=True, feature_range=(0, 1), clip=False)

Scale numerical values using a chosen method and parameters.

This class supports multiple scaling methods including MinMax, Z-Score, Robust, MaxAbs, and custom scaling based on a numerator subtraction and a denominator.

Parameters:

Name Type Description Default
kind Literal['minmax', 'zscore', 'robust', 'custom', 'maxabs']

The scaling method to use. Supported values:

  • "minmax": (X - MIN(X)) / (MAX(X) - MIN(X))
  • "zscore": (X - MEAN(X)) / STD(X)
  • "robust": (X - MEDIAN(X)) / (P75(X) - P25(X))
  • "maxabs": X / MAX(ABS(X))
  • "custom": (X - numerator_subtr) / denominator
'minmax'
numerator_subtr Union[int, float, str]

Value to subtract from each element before scaling. Can be int, float, or string ("mean", "std", "median", "mode", "max", "min", or percentile like "P33").

0
denominator Union[int, float, str]

Value to divide each element after subtraction. Can be int, float, or string (formula composed of "mean", "std", "median", "mode", "max", "min", or percentile like "P90"). Must not be 0 if numeric.

1
zerofinull bool

If True, output 0 when division would return null. Default is True.

True
feature_range Tuple[float, float]

Tuple of (min, max) for minmax scaling. Default is (0, 1).

(0, 1)
clip bool

If True, clip values to the feature_range for minmax scaling. Default is False.

False

Examples:

Using MinMax scaling:

import teradataml as tdml
import tdprepview
scaler = tdprepview.Scale(kind="minmax", feature_range=(0,1))
pipeline = tdprepview.Pipeline(
    steps=[("my_column", scaler)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Using custom scaling:

import teradataml as tdml
import tdprepview
scaler = tdprepview.Scale(kind="custom", numerator_subtr="mean", denominator="P75-P25")
pipeline = tdprepview.Pipeline(
    steps=[("my_column", scaler)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
  • During fitting, necessary statistics are collected with the in-database function TD_UnivariateStatistics.
  • During transform, SQL literal primitives are used for the scaling formula.
  • Custom scaling allows formulas with percentile strings or standard statistics.

tdprepview.StandardScaler #

StandardScaler(*, with_mean=True, with_std=True)

Standardize features by removing the mean and scaling to unit variance.

This class is a convenience wrapper around Scale to mimic scikit-learn's StandardScaler API. It selects either "zscore" scaling or "custom" scaling based on the with_mean and with_std flags.

Parameters:

Name Type Description Default
with_mean bool

If True, center the data before scaling. Default is True.

True
with_std bool

If True, scale the data to unit variance. Default is True.

True

Examples:

Standard scaling with mean and std:

import teradataml as tdml
import tdprepview
scaler = tdprepview.StandardScaler(with_mean=True, with_std=True)
pipeline = tdprepview.Pipeline(
    steps=[("my_column", scaler)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Scaling only with mean:

import teradataml as tdml
import tdprepview
scaler = tdprepview.StandardScaler(with_mean=True, with_std=False)
pipeline = tdprepview.Pipeline(
    steps=[("my_column", scaler)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
  • During fitting, necessary statistics are collected with the in-database function TD_UnivariateStatistics.
  • During transform, SQL literal primitives are used for the scaling formula.

tdprepview.MaxAbsScaler #

MaxAbsScaler()

Scale each feature by its maximum absolute value.

This class is a convenience wrapper around Scale to mimic scikit-learn's MaxAbsScaler API. It is designed for data that is already centered at zero or sparse.

Examples:

Scaling with MaxAbsScaler:

import teradataml as tdml
import tdprepview
scaler = tdprepview.MaxAbsScaler()
pipeline = tdprepview.Pipeline(
    steps=[("my_column", scaler)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
  • During fitting, necessary statistics are collected with the in-database function TD_UnivariateStatistics.
  • During transform, SQL literal primitives are used for scaling.

tdprepview.MinMaxScaler #

MinMaxScaler(feature_range=(0, 1), *, clip=False)

Transform features by scaling each feature to a given range.

This class is a convenience wrapper around Scale to mimic scikit-learn's MinMaxScaler API. It scales features linearly to the specified feature_range.

Parameters:

Name Type Description Default
feature_range Tuple[float, float]

Tuple (min, max) specifying the target range of transformed data. Default is (0, 1).

(0, 1)
clip bool

If True, clip values outside the feature_range. Default is False.

False

Examples:

Basic usage:

import teradataml as tdml
import tdprepview
scaler = tdprepview.MinMaxScaler(feature_range=(0, 1), clip=True)
pipeline = tdprepview.Pipeline(
    steps=[("my_column", scaler)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
  • During fitting, necessary statistics are collected with the in-database function TD_UnivariateStatistics.
  • During transform, SQL literal primitives are used for scaling.

tdprepview.RobustScaler #

RobustScaler(*, with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True, unit_variance=False)

Scale features using statistics that are robust to outliers.

This class is a convenience wrapper around Scale to mimic scikit-learn's RobustScaler API. It scales data based on the interquartile range (IQR) and optionally centers data by subtracting the median.

Parameters:

Name Type Description Default
with_centering bool

Whether to center the data before scaling. Default is True.

True
with_scaling bool

Whether to scale the data to the quantile range. Default is True.

True
quantile_range Tuple[float, float]

Tuple of floats (q_min, q_max) specifying the quantile range. Default is (25.0, 75.0).

(25.0, 75.0)
copy bool

Ignored. Default is True.

True
unit_variance bool

Ignored. Default is False.

False

Examples:

Robust scaling with default centering and scaling:

import teradataml as tdml
import tdprepview
scaler = tdprepview.RobustScaler()
pipeline = tdprepview.Pipeline(
    steps=[("my_column", scaler)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Robust scaling only by centering:

import teradataml as tdml
import tdprepview
scaler = tdprepview.RobustScaler(with_centering=True, with_scaling=False)
pipeline = tdprepview.Pipeline(
    steps=[("my_column", scaler)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
  • During fitting, necessary statistics are collected with the in-database function TD_UnivariateStatistics.
  • During transform, SQL literal primitives are used for scaling.
  • Centering subtracts the median; scaling divides by the IQR or custom formula.

tdprepview.CutOff #

CutOff(cutoff_min=None, cutoff_max=None)

Clip numeric values that fall outside a specified range.

This preprocessor limits numeric values to a minimum and/or maximum threshold. Thresholds can be constants or derived from percentiles or summary statistics.

Parameters:

Name Type Description Default
cutoff_min Optional[Union[int, float, str]]

Minimum allowed value. Can be int, float, a percentile string like "P33", or a summary statistic ("mean", "mode", "median", "min"). If None, no lower bound is applied.

None
cutoff_max Optional[Union[int, float, str]]

Maximum allowed value. Can be int, float, a percentile string like "P90", or a summary statistic ("mean", "mode", "median", "max"). If None, no upper bound is applied.

None

Examples:

Clip values with constant bounds:

import teradataml as tdml
import tdprepview
clipper = tdprepview.CutOff(cutoff_min=0, cutoff_max=100)
pipeline = tdprepview.Pipeline(
    steps=[("my_column", clipper)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Clip values using percentiles:

import teradataml as tdml
import tdprepview
clipper = tdprepview.CutOff(cutoff_min="P5", cutoff_max="P95")
pipeline = tdprepview.Pipeline(
    steps=[("my_column", clipper)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Clip values using summary statistics:

import teradataml as tdml
import tdprepview
clipper = tdprepview.CutOff(cutoff_min="median", cutoff_max="max")
pipeline = tdprepview.Pipeline(
    steps=[("my_column", clipper)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
  • During fitting, necessary statistics are collected with the in-database function TD_UnivariateStatistics.
  • During transform, values outside the range are replaced by the closest value within the range using SQL primitives.

tdprepview.CustomTransformer #

CustomTransformer(custom_str, output_column_type='FLOAT()')

Apply a custom SQL expression to a column.

This transformer allows arbitrary SQL transformations on a column using a placeholder string %%COL%% which is replaced by the actual column name.

Parameters:

Name Type Description Default
custom_str str

A custom SQL expression that contains the string "%%COL%%" where the column name should be inserted. Example: "2 * POWER(%%COL%%, 2) + 3 * %%COL%%".

required
output_column_type str

Optional. SQL data type of the resulting column. Default is "FLOAT()".

'FLOAT()'

Examples:

Apply a custom SQL expression:

import teradataml as tdml
import tdprepview
transformer = tdprepview.CustomTransformer(custom_str="2 * POWER(%%COL%%, 2) + 3 * %%COL%%")
pipeline = tdprepview.Pipeline(
    steps=[("my_column", transformer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
  • This transformer is stateless; nothing happens during fitting.
  • Use with caution: the SQL expression is executed directly in the database.

tdprepview.Normalizer #

Normalizer(norm='l2')

Normalize input data row-wise.

This preprocessor scales rows individually to a specified norm, similar to scikit-learn's Normalizer. Each row is transformed so that its vector length matches the chosen norm.

Parameters:

Name Type Description Default
norm Literal['max', 'l1', 'l2']

The normalization method to use. Possible values:

  • "max": Scale each row by its maximum absolute value.
  • "l1": Scale each row so that the sum of absolute values is 1.
  • "l2": Scale each row so that the Euclidean (L2) norm is 1. Default is "l2".
'l2'

Examples:

Normalize multiple columns:

import teradataml as tdml
import tdprepview
normalizer = tdprepview.Normalizer(norm="l2")
pipeline = tdprepview.Pipeline(
    steps=[(["col1", "col2", "col3"], normalizer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Using max normalization:

import teradataml as tdml
import tdprepview
normalizer = tdprepview.Normalizer(norm="max")
pipeline = tdprepview.Pipeline(
    steps=[(["colA", "colB"], normalizer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
  • Row-wise normalization is applied during transform; no statistics are computed during fitting.
  • The transformation uses SQL expressions to compute the row-wise norm and divide each value accordingly.
  • At least two columns must be provided for meaningful normalization.

tdprepview.PowerTransformer #

PowerTransformer(method='yeo-johnson', standardize=False)

Apply a power transform feature-wise to make data more Gaussian-like.

Power transforms are parametric, monotonic transformations applied to stabilize variance and reduce skewness. This is useful for modeling issues related to heteroscedasticity or other situations where approximate normality is desired.

Currently supports the Box-Cox transform and the Yeo-Johnson transform
  • Box-Cox requires strictly positive data.
  • Yeo-Johnson supports both positive and negative values.

Parameters:

Name Type Description Default
method Literal['yeo-johnson', 'box-cox']

The power transform method to use. Options are:

  • "yeo-johnson": Works with positive and negative values. Default.
  • "box-cox": Only works with strictly positive values.
'yeo-johnson'
standardize bool

Ignored. If standardization is desired, append a StandardScaler after this transformer.

False

Examples:

Power transform a single column:

import teradataml as tdml
import tdprepview
transformer = tdprepview.PowerTransformer(method="yeo-johnson")
pipeline = tdprepview.Pipeline(
    steps=[(["col1"], transformer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Power transform multiple columns with Box-Cox:

import teradataml as tdml
import tdprepview
transformer = tdprepview.PowerTransformer(method="box-cox")
pipeline = tdprepview.Pipeline(
    steps=[(["colA", "colB"], transformer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
  • During fitting, a sample of the data is pulled into Python to fit scikit-learn's PowerTransformer.
  • During transform, SQL formulas are applied in Teradata to compute the power-transformed values.
  • Box-Cox requires strictly positive values; Yeo-Johnson can handle negative values.

tdprepview.TargetEncoder #

TargetEncoder(target_var, categories='auto', target_type='binary', smooth='auto', cv=5, shuffle=True, random_state=None)

Encode categorical variables by replacing categories with target-based statistics.

Each category value is encoded as a smoothed estimate of the target variable mean for that category. The encoding blends the global target mean with the category-specific target mean to reduce variance, especially for infrequent categories. This implementation follows the sklearn API.

Parameters:

Name Type Description Default
target_var str

str The name of the target variable used for encoding. Required.

required
categories Union[str, List[str]]

str or list, default="auto" Categories per feature. If "auto", categories are inferred from the data. If a list, must contain at least two categories.

'auto'
target_type str

{"auto", "continuous", "binary", "multiclass"}, default="binary" Type of the target variable. Determines how encoding is computed.

'binary'
smooth Union[str, float]

{"auto"} or float, default="auto" Amount of smoothing between category means and the global mean. If float, higher values apply stronger smoothing. If "auto", a heuristic is applied.

'auto'
cv int

int, default=5 Number of folds for cross-validation during fitting.

5
shuffle bool

bool, default=True Whether to shuffle the data before splitting into folds.

True
random_state Optional[int]

int or None, default=None Controls the randomness of shuffling when shuffle=True.

None

Examples:

Encode a binary target variable:

import teradataml as tdml
import tdprepview
from tdprepview import TargetEncoder

# Create encoder with binary target
te = TargetEncoder(target_var="churn", target_type="binary")

pipeline = tdprepview.Pipeline(
    steps=[(["region"], te)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","customers"))
pipeline.fit(DF)

Encode with smoothing and reproducible folds:

import teradataml as tdml
import tdprepview
from tdprepview import TargetEncoder

te = TargetEncoder(
    target_var="label",
    smooth=10.0,
    cv=3,
    shuffle=True,
    random_state=42
)

pipeline = tdprepview.Pipeline(
    steps=[(["product"], te)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","transactions"))
pipeline.fit(DF)
Notes
  • Output column type is FLOAT.
  • Requires the target variable to be present in the training DataFrame.

Discretize#

tdprepview.FixedWidthBinning #

FixedWidthBinning(n_bins=5, lower_bound=None, upper_bound=None)

Perform fixed-width binning on a numerical column.

This preprocessor divides numerical data into a fixed number of bins. Each bin has equal width, and values are assigned an integer representing the bin index.

Parameters:

Name Type Description Default
n_bins int

Number of bins to divide the data into. Must be greater than 1.

5
lower_bound Optional[Union[int, float]]

Optional lower bound of the binning range. If None, the column minimum is used.

None
upper_bound Optional[Union[int, float]]

Optional upper bound of the binning range. If None, the column maximum is used.

None

Examples:

Basic fixed-width binning:

import teradataml as tdml
import tdprepview
binning = tdprepview.FixedWidthBinning(n_bins=5)
pipeline = tdprepview.Pipeline(
    steps=[(["col1"], binning)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Fixed-width binning with custom bounds:

import teradataml as tdml
import tdprepview
binning = tdprepview.FixedWidthBinning(n_bins=4, lower_bound=0.0, upper_bound=100.0)
pipeline = tdprepview.Pipeline(
    steps=[(["colA"], binning)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
  • The output column type is INTEGER.
  • During fitting, necessary statistics (min/max) are collected with TD_UnivariateStatistics if bounds are None.
  • Values are assigned to bins indexed from 0 to n_bins-1.

tdprepview.VariableWidthBinning #

VariableWidthBinning(kind='quantiles', no_quantiles=5, boundaries=None)

Bin numerical data into variable-width bins.

This preprocessor supports two binning strategies
  • "quantiles": Divide the data into bins based on percentiles.
  • "custom": Divide the data based on user-defined boundaries.

Parameters:

Name Type Description Default
kind Literal['quantiles', 'custom']

Method of binning. Options are "quantiles" or "custom". Default is "quantiles".

'quantiles'
no_quantiles int

Number of bins to use when kind="quantiles". Must be between 2 and 100. Default is 5.

5
boundaries Optional[List[Union[int, float]]]

List of numeric boundaries for custom binning. Must be sorted in ascending order. Required if kind="custom". Default is None.

None

Examples:

Quantile-based binning:

import teradataml as tdml
import tdprepview
binner = tdprepview.VariableWidthBinning(kind="quantiles", no_quantiles=4)
pipeline = tdprepview.Pipeline(
    steps=[(["col1", "col2"], binner)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Custom boundary binning:

import teradataml as tdml
import tdprepview
binner = tdprepview.VariableWidthBinning(kind="custom", boundaries=[0, 10, 20, 50])
pipeline = tdprepview.Pipeline(
    steps=[(["colA"], binner)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
  • The output column type is INTEGER.
  • During fitting, necessary statistics (percentiles) are collected if kind="quantiles".
  • Custom boundaries are used directly without computation.
  • Bins are indexed starting from 0 up to the number of bins minus one.

tdprepview.QuantileTransformer #

QuantileTransformer(n_quantiles=10, output_distribution='uniform', ignore_implicit_zeros=False, subsample=None, random_state=None, copy=True)

Transform features using quantiles information.

This preprocessor transforms numerical columns by mapping values to their corresponding quantile bins. Output values are integers representing the quantile index.

Parameters:

Name Type Description Default
n_quantiles int

Number of quantiles to compute. Default is 10.

10
output_distribution str

Only 'uniform' is currently supported. Default is 'uniform'.

'uniform'
ignore_implicit_zeros bool

Whether to ignore implicit zero values when computing quantiles. Default is False.

False
subsample Optional[int]

If not None, subsample the dataset to this size when computing quantiles. Ignored. Default is None.

None
random_state Optional[int]

Random seed. Ignored. Default is None.

None
copy bool

Whether to copy the input array before transforming. Ignored. Default is True.

True

Examples:

Quantile binning:

import teradataml as tdml
import tdprepview
qt = tdprepview.QuantileTransformer(n_quantiles=5)
pipeline = tdprepview.Pipeline(
    steps=[(["col1", "col2"], qt)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
  • Output column type is INTEGER.
  • This transformer uses quantile computation to assign bins.
  • Currently, only the "uniform" output distribution is supported.

tdprepview.DecisionTreeBinning #

DecisionTreeBinning(target_var, model_type='classification', no_bins=10, no_rows=10000)

Bin numerical data into variable-width bins based on a decision tree.

This preprocessor uses a decision tree to find optimal bin boundaries. During fitting, a sample of the data is pulled into Python to train a scikit-learn decision tree. During transform, SQL expressions are used to assign bin indices.

Parameters:

Name Type Description Default
target_var str

Column name of the target variable. Required.

required
model_type Literal['classification', 'regression']

Either "classification" or "regression". Determines the type of tree used.

'classification'
no_bins int

Number of bins (tree leaves). Must be between 2 and 100.

10
no_rows int

Number of rows randomly sampled for fitting the tree. Must be between 100 and 100000.

10000

Examples:

Decision tree binning of a single column:

import teradataml as tdml
import tdprepview
dtb = tdprepview.DecisionTreeBinning(target_var="target", model_type="classification", no_bins=5)
pipeline = tdprepview.Pipeline(
    steps=[(["feature1"], dtb)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Decision tree binning of multiple columns:

import teradataml as tdml
import tdprepview
dtb = tdprepview.DecisionTreeBinning(target_var="target", model_type="regression", no_bins=8)
pipeline = tdprepview.Pipeline(
    steps=[(["featureA", "featureB"], dtb)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
  • During fitting, a sample of the data is used in Python to fit a scikit-learn decision tree.
  • During transform, SQL formulas are applied to assign bin indices.
  • Output column type is INTEGER.
  • The number of bins corresponds to the number of leaves in the fitted tree.

tdprepview.ThresholdBinarizer #

ThresholdBinarizer(threshold='mean')

Binarize numeric data using a threshold value.

This preprocessor assigns 1 if a value exceeds the threshold and 0 otherwise.

Parameters:

Name Type Description Default
threshold Union[int, float, Literal['mean', 'mode', 'median', 'P[1-100]']]

Threshold value for binarization. Can be: - a number (int or float), or - a string: "mean", "mode", "median", or a percentile string "P[1-100]", e.g., "P33". Default is "mean".

'mean'

Examples:

Binarize using mean:

import teradataml as tdml
import tdprepview
binarizer = tdprepview.ThresholdBinarizer(threshold="mean")
pipeline = tdprepview.Pipeline(
    steps=[(["col1"], binarizer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Binarize using a numeric threshold:

import teradataml as tdml
import tdprepview
binarizer = tdprepview.ThresholdBinarizer(threshold=50)
pipeline = tdprepview.Pipeline(
    steps=[(["colA"], binarizer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
  • Output column type is INTEGER (0 or 1).
  • Threshold can be a fixed number, a summary statistic, or a percentile of the data.

tdprepview.Binarizer #

Binarizer(threshold=0.0, copy=True)

Binarize data according to a threshold (sklearn-compatible alias).

This is a convenience alias to ThresholdBinarizer to provide a scikit-learn-like API.

Parameters:

Name Type Description Default
threshold Union[int, float]

Threshold value to use for binarization. Must be numeric. Default is 0.0.

0.0
copy bool

Ignored. Whether to copy the input array before binarizing. Default is True.

True

Examples:

Binarize a single column:

import teradataml as tdml
import tdprepview
binarizer = tdprepview.Binarizer(threshold=10)
pipeline = tdprepview.Pipeline(
    steps=[(["col1"], binarizer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
  • Output column type is INTEGER (0 or 1).
  • This class is an alias of ThresholdBinarizer for sklearn compatibility.

tdprepview.ListBinarizer #

ListBinarizer(elements1='TOP3')

Binarize text data based on membership in a list or top-K most frequent values.

This preprocessor assigns 1 if a text value is in the provided list of elements, or among the top-K most frequent values. Otherwise, it assigns 0.

Parameters:

Name Type Description Default
elements1 Union[str, List[str]]

Either a list of strings to match, or a string "TOPK" indicating the K most frequent values, e.g., "TOP3" or "TOP10". Default is "TOP3".

'TOP3'

Examples:

Using a top-K most frequent values:

import teradataml as tdml
import tdprepview
lb = tdprepview.ListBinarizer(elements1="TOP5")
pipeline = tdprepview.Pipeline(
    steps=[(["category_col"], lb)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Using a custom list:

import teradataml as tdml
import tdprepview
lb = tdprepview.ListBinarizer(elements1=["apple", "banana", "orange"])
pipeline = tdprepview.Pipeline(
    steps=[(["fruit_col"], lb)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
  • Output column type is INTEGER (0 or 1).
  • Top-K values are computed in database with a query template.

tdprepview.LabelEncoder #

LabelEncoder(elements='TOP100')

Encode a text column into numerical values using label encoding.

This preprocessor assigns integers to text values. Encoding can be based on: - a list of custom elements, or - the K most frequent values in the column, indicated by a TOPK string (e.g., "TOP20").

Parameters:

Name Type Description Default
elements Union[str, List[str]]

Either a list of strings to use for encoding, or a TOPK string specifying the number of most frequent values to encode. Default is "TOP100".

'TOP100'

Examples:

Using TOPK:

import teradataml as tdml
import tdprepview
le = tdprepview.LabelEncoder(elements="TOP10")
pipeline = tdprepview.Pipeline(
    steps=[(["category_col"], le)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Using a custom list:

import teradataml as tdml
import tdprepview
le = tdprepview.LabelEncoder(elements=["apple", "banana", "orange"])
pipeline = tdprepview.Pipeline(
    steps=[(["fruit_col"], le)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
  • Output column type is INTEGER.
  • Top-K values are computed in the database with a query template during fitting.

tdprepview.SimpleHashEncoder #

SimpleHashEncoder(num_buckets=None, salt='')

Encode a text column using a hash function into INTEGER values.

This preprocessor applies a TD-built-in hash function. It is stateless and very performant, but hash collisions can occur. Optional bucketing can be applied on top of the hash value.

Parameters:

Name Type Description Default
num_buckets Optional[int]

Optional number of buckets to apply using modulo on the hash code. If None, the raw hash value is used. Must be an integer > 1 or None. Default is None.

None
salt str

Optional string appended to each value before hashing to redistribute hash values. Default is "".

''

Examples:

Using raw hash values:

import teradataml as tdml
import tdprepview
hasher = tdprepview.SimpleHashEncoder()
pipeline = tdprepview.Pipeline(
    steps=[(["text_col"], hasher)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Using buckets and salt:

import teradataml as tdml
import tdprepview
hasher = tdprepview.SimpleHashEncoder(num_buckets=100, salt="xyz")
pipeline = tdprepview.Pipeline(
    steps=[(["text_col"], hasher)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
  • Output column type is INTEGER.
  • Stateless: no fitting is performed; values are hashed on transform.
  • Collisions can occur if multiple values map to the same hash bucket.

Feature Engineering#

tdprepview.PolynomialFeatures #

PolynomialFeatures(degree=2, interaction_only=False)

Generate polynomial and interaction features from input data.

This preprocessor creates new features by raising input columns to powers up to degree and optionally including only interaction features between distinct columns. No fitting is required; transformations are applied using SQL formulas.

Parameters:

Name Type Description Default
degree int

The degree of the polynomial features. Must be an integer >= 1. Default is 2.

2
interaction_only bool

If True, only interaction features (products of distinct columns) are generated. Default is False.

False

Examples:

Polynomial features on multiple columns:

import teradataml as tdml
import tdprepview
poly = tdprepview.PolynomialFeatures(degree=2, interaction_only=False)
pipeline = tdprepview.Pipeline(
    steps=[(["col1", "col2", "col3"], poly)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
  • Output column type is numeric.
  • The class is based on scikit-learn's PolynomialFeatures from the preprocessing module.
  • Interactions are computed across all specified input columns.
  • No fitting is required; transformations are applied via SQL formulas.

tdprepview.OneHotEncoder #

OneHotEncoder(*, categories='auto', handle_unknown='ignore', min_frequency=None, max_categories=None)

One-hot encode categorical features.

This preprocessor converts categorical columns into multiple binary (0/1) columns, one per category. Categories can be inferred from the data or provided as a custom list. No fitting is required for custom categories; otherwise, necessary statistics are collected during fitting to determine categories.

Parameters:

Name Type Description Default
categories Union[str, List[str]]

Either "auto" to infer categories from the data, or a list of strings specifying the categories. Default is "auto".

'auto'
handle_unknown str

Ignored. Default is "ignore".

'ignore'
min_frequency Optional[int]

Ignored. Default is None.

None
max_categories Optional[int]

Maximum number of categories to encode. Defaults to 50.

None

Examples:

Automatic category detection:

import teradataml as tdml
import tdprepview
ohe = tdprepview.OneHotEncoder(categories="auto")
pipeline = tdprepview.Pipeline(
    steps=[(["cat_col"], ohe)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Using a custom category list:

import teradataml as tdml
import tdprepview
ohe = tdprepview.OneHotEncoder(categories=["red", "green", "blue"])
pipeline = tdprepview.Pipeline(
    steps=[(["color_col"], ohe)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
  • Output column type is INTEGER (0/1 per category).
  • If categories="auto", necessary statistics are collected during fitting.
  • For custom categories, no fitting is required; transformations use SQL formulas.
  • Maximum of 50 categories can be encoded.

tdprepview.MultiLabelBinarizer #

MultiLabelBinarizer(*, classes=None, sparse_output=False, max_categories=None, delimiter=', ')

Multi-label binarizer for categorical features. The input column contains delimiter-separated values. The output is one indicator (0/1) variable per unique value.

Parameters:

Name Type Description Default
classes Optional[Union[str, List[str]]]

Categories to encode. Can be: - None: categories inferred during fitting. - "auto": infer categories from training data. - List of strings: custom categories. Default is None.

None
sparse_output bool

Ignored. Default is False.

False
max_categories Optional[int]

Maximum number of categories to encode. Default is 50.

None
delimiter str

Delimiter used to separate multiple values in the input string. Default is ", ".

', '

Examples:

Automatic category detection:

import teradataml as tdml
import tdprepview
mlb = tdprepview.MultiLabelBinarizer(classes="auto")
pipeline = tdprepview.Pipeline(
    steps=[(["tags"], mlb)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Using a custom list of classes:

import teradataml as tdml
import tdprepview
mlb = tdprepview.MultiLabelBinarizer(classes=["red", "green", "blue"])
pipeline = tdprepview.Pipeline(
    steps=[(["colors"], mlb)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
  • Output column type is INTEGER (0/1 per unique value).
  • For classes="auto" or None, necessary statistics are collected during fitting.
  • For custom classes, no fitting is required; transformations use SQL formulas.
  • Maximum of 50 categories can be encoded.

Dimensionality Reduction & Miscellaneous#

tdprepview.PCA #

PCA(n_components='mle', *, random_state=42)

Principal Component Analysis (PCA) reduces the dimensionality of a dataset while retaining most of its variance.

During fitting, a sample of the data is pulled into Python to fit a scikit-learn PCA. During transform, the weighted sum of components is calculated directly in SQL.

Parameters:

Name Type Description Default
n_components Union[int, str]

Number of principal components to keep. Can be an integer or "mle" (automatic selection via maximum likelihood estimation). Default is "mle".

'mle'
random_state int

Seed for the random number generator. Default is 42.

42

Examples:

Fit PCA with a fixed number of components:

import teradataml as tdml
import tdprepview
pca = tdprepview.PCA(n_components=3)
pipeline = tdprepview.Pipeline(
    steps=[(["feature1", "feature2", "feature3"], pca)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Fit PCA using MLE for component selection:

import teradataml as tdml
import tdprepview
pca = tdprepview.PCA(n_components="mle")
pipeline = tdprepview.Pipeline(
    steps=[(["feature1", "feature2", "feature3"], pca)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
  • Output is a weighted combination of input columns (SQL computed).
  • The number of components can be automatically inferred with "mle".
  • Fitting uses a Python sample for PCA computation.

tdprepview.TryCast #

TryCast(new_type='FLOAT')

Attempt to cast a text column to a specified data type using SQL TRYCAST.

This preprocessor is stateless: nothing is done during fitting. During transform, the SQL TRYCAST function is applied to each column value.

Parameters:

Name Type Description Default
new_type Literal['BYTEINT', 'SMALLINT', 'INT', 'BIGINT', 'FLOAT', 'DATE', 'TIME', 'TIMESTAMP(6)']

Target data type to cast the column to. Must be one of: "BYTEINT", "SMALLINT", "INT", "BIGINT", "FLOAT", "DATE", "TIME", "TIMESTAMP(6)". Default is "FLOAT".

'FLOAT'

Examples:

Convert a text column to integer:

import teradataml as tdml
import tdprepview
caster = tdprepview.TryCast(new_type="INT")
pipeline = tdprepview.Pipeline(
    steps=[("text_col", caster)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Convert a text column to timestamp:

import teradataml as tdml
import tdprepview
caster = tdprepview.TryCast(new_type="TIMESTAMP(6)")
pipeline = tdprepview.Pipeline(
    steps=[("text_col", caster)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
  • Output type is the same as new_type.
  • Uses SQL TRYCAST in transform; any conversion failures result in NULLs.

tdprepview.Cast #

Cast(new_type='FLOAT')

Convert a column to a specified data type using SQL CAST.

This preprocessor is stateless: nothing is done during fitting. During transform, the SQL CAST function is applied to each column value. Typically used as the last step to ensure numeric features are FLOAT. For text columns, prefer TryCast to avoid errors.

Parameters:

Name Type Description Default
new_type Literal['BYTEINT', 'SMALLINT', 'INT', 'BIGINT', 'FLOAT', 'DATE', 'TIME', 'TIMESTAMP(6)']

Target data type to cast the column to. Must be one of: "BYTEINT", "SMALLINT", "INT", "BIGINT", "FLOAT", "DATE", "TIME", "TIMESTAMP(6)". Default is "FLOAT".

'FLOAT'

Examples:

Convert a column to float:

import teradataml as tdml
import tdprepview
caster = tdprepview.Cast(new_type="FLOAT")
pipeline = tdprepview.Pipeline(
    steps=[("col1", caster)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)

Convert a column to integer:

import teradataml as tdml
import tdprepview
caster = tdprepview.Cast(new_type="INT")
pipeline = tdprepview.Pipeline(
    steps=[("col1", caster)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
  • Output type is the same as new_type.
  • Uses SQL CAST in transform; invalid conversions will raise errors.