Preprocessors#
Impute#
tdprepview.Impute
#
Impute missing values in numerical columns using different strategies.
This class supports imputing missing numerical values using the mean, median, mode, min, max, or a custom value/percentile.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
kind
|
Literal['mean', 'median', 'mode', 'min', 'max', 'custom']
|
The imputation strategy to use. Default is
|
'mean'
|
value
|
Union[int, float, str]
|
The custom value to use if |
0
|
Examples:
Using mean strategy:
import teradataml as tdml
import tdprepview
my_imputer = tdprepview.Impute(kind="mean")
my_pipeline = tdprepview.Pipeline(
steps=[("my_column", my_imputer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
my_pipeline.fit(DF)
Using custom value:
import teradataml as tdml
import tdprepview
my_imputer = tdprepview.Impute(kind="custom", value=100)
my_pipeline = tdprepview.Pipeline(
steps=[("my_column", my_imputer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
my_pipeline.fit(DF)
Using custom percentile:
import teradataml as tdml
import tdprepview
my_imputer = tdprepview.Impute(kind="custom", value="P90")
my_pipeline = tdprepview.Pipeline(
steps=[("my_column", my_imputer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
my_pipeline.fit(DF)
Notes
- During fitting, the function will query the database to compute
the statistic corresponding to the chosen
kind
for each column using the TD_UnivariateStatistics in-db function. - During transform, it will use the
COALESCE
SQL function to replace NULLs.
tdprepview.SimpleImputer
#
Impute missing values using a specified strategy.
This class provides simple imputation methods for handling missing data,
such as replacing with mean, median, most frequent value, or a constant.
It is a wrapper for Impute
to mimic scikit-learn's SimpleImputer
API.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
strategy
|
Literal['mean', 'median', 'most_frequent', 'constant']
|
The imputation strategy to use. Options are |
'mean'
|
fill_value
|
Optional[Union[int, float]]
|
The value to use for missing values when |
None
|
Examples:
Using mean strategy:
import tdprepview
my_imputer = tdprepview.SimpleImputer(strategy="mean")
my_pipeline = tdprepview.Pipeline(
steps=[("my_column", my_imputer)]
)
Using custom constant:
import tdprepview
my_imputer = tdprepview.SimpleImputer(strategy="constant", fill_value=0)
my_pipeline = tdprepview.Pipeline(
steps=[("my_column", my_imputer)]
)
Using most frequent value:
import tdprepview
my_imputer = tdprepview.SimpleImputer(strategy="most_frequent")
my_pipeline = tdprepview.Pipeline(
steps=[("my_column", my_imputer)]
)
Notes
- This class is only an alias for
Impute
, preserving sklearn compatibility. - During fitting, the underlying
Impute
object computes the statistic for each column using the TD_UnivariateStatistics in-db function.. - During transform, it will use the
COALESCE
SQL function to replace NULLs.
tdprepview.IterativeImputer
#
Impute missing values using an iterative multivariate approach.
This preprocessor uses a multivariate method to impute missing values.
During fitting, a sample of the data is pulled into Python to fit a scikit-learn
IterativeImputer
. During transforming, NULLs are replaced by a weighted combination
of other existing columns based on the fitted imputer.
Examples:
Impute at least two columns:
import teradataml as tdml
import tdprepview
my_imputer = tdprepview.IterativeImputer()
pipeline = tdprepview.Pipeline(
steps=[(["col1", "col2"], my_imputer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
- Fitting triggers pulling a sample into Python to fit scikit-learn's IterativeImputer.
- Transforming replaces NULLs using the weighted combination of other columns as determined by the fitted iterative imputer.
- At least two columns must be provided for meaningful imputation.
- This class is based on scikit-learn's IterativeImputer from the
impute
module.
tdprepview.ImputeText
#
Impute missing values in text columns using different strategies.
This class supports imputing missing text values using either the most frequent (mode) value or a custom specified value.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
kind
|
Literal['mode', 'custom']
|
The type of imputation to perform. Default is |
'mode'
|
value
|
str
|
The custom value to use for imputation if |
''
|
Examples:
Basic usage with kind "mode":
import teradataml as tdml
import tdprepview
my_imputer = tdprepview.ImputeText(kind="mode")
my_pipeline = tdprepview.Pipeline(
steps=[("my_column", my_imputer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
my_pipeline.fit(DF)
Basic usage with kind "custom":
import teradataml as tdml
import tdprepview
my_imputer = tdprepview.ImputeText(kind="custom", value = "missing")
my_pipeline = tdprepview.Pipeline(
steps=[("my_column", my_imputer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
my_pipeline.fit(DF)
Notes
- During fitting, the function will query the database for the most frequent non-null value if kind = "mode".
- During transform, it will use the
COALESCE
SQL funtion to replace NULLs.
Transform#
tdprepview.Scale
#
Scale(kind='minmax', numerator_subtr=0, denominator=1, zerofinull=True, feature_range=(0, 1), clip=False)
Scale numerical values using a chosen method and parameters.
This class supports multiple scaling methods including MinMax, Z-Score, Robust, MaxAbs, and custom scaling based on a numerator subtraction and a denominator.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
kind
|
Literal['minmax', 'zscore', 'robust', 'custom', 'maxabs']
|
The scaling method to use. Supported values:
|
'minmax'
|
numerator_subtr
|
Union[int, float, str]
|
Value to subtract from each element before scaling. Can be int, float, or string ("mean", "std", "median", "mode", "max", "min", or percentile like "P33"). |
0
|
denominator
|
Union[int, float, str]
|
Value to divide each element after subtraction. Can be int, float, or string (formula composed of "mean", "std", "median", "mode", "max", "min", or percentile like "P90"). Must not be 0 if numeric. |
1
|
zerofinull
|
bool
|
If True, output 0 when division would return null. Default is True. |
True
|
feature_range
|
Tuple[float, float]
|
Tuple of (min, max) for minmax scaling. Default is (0, 1). |
(0, 1)
|
clip
|
bool
|
If True, clip values to the feature_range for minmax scaling. Default is False. |
False
|
Examples:
Using MinMax scaling:
import teradataml as tdml
import tdprepview
scaler = tdprepview.Scale(kind="minmax", feature_range=(0,1))
pipeline = tdprepview.Pipeline(
steps=[("my_column", scaler)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Using custom scaling:
import teradataml as tdml
import tdprepview
scaler = tdprepview.Scale(kind="custom", numerator_subtr="mean", denominator="P75-P25")
pipeline = tdprepview.Pipeline(
steps=[("my_column", scaler)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
- During fitting, necessary statistics are collected with the in-database function
TD_UnivariateStatistics
. - During transform, SQL literal primitives are used for the scaling formula.
- Custom scaling allows formulas with percentile strings or standard statistics.
tdprepview.StandardScaler
#
Standardize features by removing the mean and scaling to unit variance.
This class is a convenience wrapper around Scale
to mimic scikit-learn's
StandardScaler
API. It selects either "zscore"
scaling or "custom"
scaling
based on the with_mean
and with_std
flags.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
with_mean
|
bool
|
If True, center the data before scaling. Default is True. |
True
|
with_std
|
bool
|
If True, scale the data to unit variance. Default is True. |
True
|
Examples:
Standard scaling with mean and std:
import teradataml as tdml
import tdprepview
scaler = tdprepview.StandardScaler(with_mean=True, with_std=True)
pipeline = tdprepview.Pipeline(
steps=[("my_column", scaler)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Scaling only with mean:
import teradataml as tdml
import tdprepview
scaler = tdprepview.StandardScaler(with_mean=True, with_std=False)
pipeline = tdprepview.Pipeline(
steps=[("my_column", scaler)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
- During fitting, necessary statistics are collected with the in-database function
TD_UnivariateStatistics
. - During transform, SQL literal primitives are used for the scaling formula.
tdprepview.MaxAbsScaler
#
Scale each feature by its maximum absolute value.
This class is a convenience wrapper around Scale
to mimic scikit-learn's
MaxAbsScaler
API. It is designed for data that is already centered at zero or sparse.
Examples:
Scaling with MaxAbsScaler:
import teradataml as tdml
import tdprepview
scaler = tdprepview.MaxAbsScaler()
pipeline = tdprepview.Pipeline(
steps=[("my_column", scaler)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
- During fitting, necessary statistics are collected with the in-database function
TD_UnivariateStatistics
. - During transform, SQL literal primitives are used for scaling.
tdprepview.MinMaxScaler
#
Transform features by scaling each feature to a given range.
This class is a convenience wrapper around Scale
to mimic scikit-learn's
MinMaxScaler
API. It scales features linearly to the specified feature_range
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
feature_range
|
Tuple[float, float]
|
Tuple (min, max) specifying the target range of transformed data. Default is (0, 1). |
(0, 1)
|
clip
|
bool
|
If True, clip values outside the feature_range. Default is False. |
False
|
Examples:
Basic usage:
import teradataml as tdml
import tdprepview
scaler = tdprepview.MinMaxScaler(feature_range=(0, 1), clip=True)
pipeline = tdprepview.Pipeline(
steps=[("my_column", scaler)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
- During fitting, necessary statistics are collected with the in-database function
TD_UnivariateStatistics
. - During transform, SQL literal primitives are used for scaling.
tdprepview.RobustScaler
#
RobustScaler(*, with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True, unit_variance=False)
Scale features using statistics that are robust to outliers.
This class is a convenience wrapper around Scale
to mimic scikit-learn's
RobustScaler
API. It scales data based on the interquartile range (IQR)
and optionally centers data by subtracting the median.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
with_centering
|
bool
|
Whether to center the data before scaling. Default is True. |
True
|
with_scaling
|
bool
|
Whether to scale the data to the quantile range. Default is True. |
True
|
quantile_range
|
Tuple[float, float]
|
Tuple of floats (q_min, q_max) specifying the quantile range. Default is (25.0, 75.0). |
(25.0, 75.0)
|
copy
|
bool
|
Ignored. Default is True. |
True
|
unit_variance
|
bool
|
Ignored. Default is False. |
False
|
Examples:
Robust scaling with default centering and scaling:
import teradataml as tdml
import tdprepview
scaler = tdprepview.RobustScaler()
pipeline = tdprepview.Pipeline(
steps=[("my_column", scaler)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Robust scaling only by centering:
import teradataml as tdml
import tdprepview
scaler = tdprepview.RobustScaler(with_centering=True, with_scaling=False)
pipeline = tdprepview.Pipeline(
steps=[("my_column", scaler)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
- During fitting, necessary statistics are collected with the in-database function
TD_UnivariateStatistics
. - During transform, SQL literal primitives are used for scaling.
- Centering subtracts the median; scaling divides by the IQR or custom formula.
tdprepview.CutOff
#
Clip numeric values that fall outside a specified range.
This preprocessor limits numeric values to a minimum and/or maximum threshold. Thresholds can be constants or derived from percentiles or summary statistics.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cutoff_min
|
Optional[Union[int, float, str]]
|
Minimum allowed value. Can be int, float, a percentile string like "P33", or a summary statistic ("mean", "mode", "median", "min"). If None, no lower bound is applied. |
None
|
cutoff_max
|
Optional[Union[int, float, str]]
|
Maximum allowed value. Can be int, float, a percentile string like "P90", or a summary statistic ("mean", "mode", "median", "max"). If None, no upper bound is applied. |
None
|
Examples:
Clip values with constant bounds:
import teradataml as tdml
import tdprepview
clipper = tdprepview.CutOff(cutoff_min=0, cutoff_max=100)
pipeline = tdprepview.Pipeline(
steps=[("my_column", clipper)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Clip values using percentiles:
import teradataml as tdml
import tdprepview
clipper = tdprepview.CutOff(cutoff_min="P5", cutoff_max="P95")
pipeline = tdprepview.Pipeline(
steps=[("my_column", clipper)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Clip values using summary statistics:
import teradataml as tdml
import tdprepview
clipper = tdprepview.CutOff(cutoff_min="median", cutoff_max="max")
pipeline = tdprepview.Pipeline(
steps=[("my_column", clipper)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
- During fitting, necessary statistics are collected with the in-database function
TD_UnivariateStatistics
. - During transform, values outside the range are replaced by the closest value within the range using SQL primitives.
tdprepview.CustomTransformer
#
Apply a custom SQL expression to a column.
This transformer allows arbitrary SQL transformations on a column using a placeholder
string %%COL%%
which is replaced by the actual column name.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
custom_str
|
str
|
A custom SQL expression that contains the string "%%COL%%" where the
column name should be inserted. Example: |
required |
output_column_type
|
str
|
Optional. SQL data type of the resulting column. Default is |
'FLOAT()'
|
Examples:
Apply a custom SQL expression:
import teradataml as tdml
import tdprepview
transformer = tdprepview.CustomTransformer(custom_str="2 * POWER(%%COL%%, 2) + 3 * %%COL%%")
pipeline = tdprepview.Pipeline(
steps=[("my_column", transformer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
- This transformer is stateless; nothing happens during fitting.
- Use with caution: the SQL expression is executed directly in the database.
tdprepview.Normalizer
#
Normalize input data row-wise.
This preprocessor scales rows individually to a specified norm, similar to scikit-learn's Normalizer. Each row is transformed so that its vector length matches the chosen norm.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
norm
|
Literal['max', 'l1', 'l2']
|
The normalization method to use. Possible values:
|
'l2'
|
Examples:
Normalize multiple columns:
import teradataml as tdml
import tdprepview
normalizer = tdprepview.Normalizer(norm="l2")
pipeline = tdprepview.Pipeline(
steps=[(["col1", "col2", "col3"], normalizer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Using max normalization:
import teradataml as tdml
import tdprepview
normalizer = tdprepview.Normalizer(norm="max")
pipeline = tdprepview.Pipeline(
steps=[(["colA", "colB"], normalizer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
- Row-wise normalization is applied during transform; no statistics are computed during fitting.
- The transformation uses SQL expressions to compute the row-wise norm and divide each value accordingly.
- At least two columns must be provided for meaningful normalization.
tdprepview.PowerTransformer
#
Apply a power transform feature-wise to make data more Gaussian-like.
Power transforms are parametric, monotonic transformations applied to stabilize variance and reduce skewness. This is useful for modeling issues related to heteroscedasticity or other situations where approximate normality is desired.
Currently supports the Box-Cox transform and the Yeo-Johnson transform
- Box-Cox requires strictly positive data.
- Yeo-Johnson supports both positive and negative values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
method
|
Literal['yeo-johnson', 'box-cox']
|
The power transform method to use. Options are:
|
'yeo-johnson'
|
standardize
|
bool
|
Ignored. If standardization is desired, append a StandardScaler after this transformer. |
False
|
Examples:
Power transform a single column:
import teradataml as tdml
import tdprepview
transformer = tdprepview.PowerTransformer(method="yeo-johnson")
pipeline = tdprepview.Pipeline(
steps=[(["col1"], transformer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Power transform multiple columns with Box-Cox:
import teradataml as tdml
import tdprepview
transformer = tdprepview.PowerTransformer(method="box-cox")
pipeline = tdprepview.Pipeline(
steps=[(["colA", "colB"], transformer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
- During fitting, a sample of the data is pulled into Python to fit scikit-learn's PowerTransformer.
- During transform, SQL formulas are applied in Teradata to compute the power-transformed values.
- Box-Cox requires strictly positive values; Yeo-Johnson can handle negative values.
tdprepview.TargetEncoder
#
TargetEncoder(target_var, categories='auto', target_type='binary', smooth='auto', cv=5, shuffle=True, random_state=None)
Encode categorical variables by replacing categories with target-based statistics.
Each category value is encoded as a smoothed estimate of the target variable mean for that category. The encoding blends the global target mean with the category-specific target mean to reduce variance, especially for infrequent categories. This implementation follows the sklearn API.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
target_var
|
str
|
str The name of the target variable used for encoding. Required. |
required |
categories
|
Union[str, List[str]]
|
str or list, default="auto" Categories per feature. If "auto", categories are inferred from the data. If a list, must contain at least two categories. |
'auto'
|
target_type
|
str
|
{"auto", "continuous", "binary", "multiclass"}, default="binary" Type of the target variable. Determines how encoding is computed. |
'binary'
|
smooth
|
Union[str, float]
|
{"auto"} or float, default="auto" Amount of smoothing between category means and the global mean. If float, higher values apply stronger smoothing. If "auto", a heuristic is applied. |
'auto'
|
cv
|
int
|
int, default=5 Number of folds for cross-validation during fitting. |
5
|
shuffle
|
bool
|
bool, default=True Whether to shuffle the data before splitting into folds. |
True
|
random_state
|
Optional[int]
|
int or None, default=None
Controls the randomness of shuffling when |
None
|
Examples:
Encode a binary target variable:
import teradataml as tdml
import tdprepview
from tdprepview import TargetEncoder
# Create encoder with binary target
te = TargetEncoder(target_var="churn", target_type="binary")
pipeline = tdprepview.Pipeline(
steps=[(["region"], te)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","customers"))
pipeline.fit(DF)
Encode with smoothing and reproducible folds:
import teradataml as tdml
import tdprepview
from tdprepview import TargetEncoder
te = TargetEncoder(
target_var="label",
smooth=10.0,
cv=3,
shuffle=True,
random_state=42
)
pipeline = tdprepview.Pipeline(
steps=[(["product"], te)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","transactions"))
pipeline.fit(DF)
Notes
- Output column type is FLOAT.
- Requires the target variable to be present in the training DataFrame.
Discretize#
tdprepview.FixedWidthBinning
#
Perform fixed-width binning on a numerical column.
This preprocessor divides numerical data into a fixed number of bins. Each bin has equal width, and values are assigned an integer representing the bin index.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_bins
|
int
|
Number of bins to divide the data into. Must be greater than 1. |
5
|
lower_bound
|
Optional[Union[int, float]]
|
Optional lower bound of the binning range. If None, the column minimum is used. |
None
|
upper_bound
|
Optional[Union[int, float]]
|
Optional upper bound of the binning range. If None, the column maximum is used. |
None
|
Examples:
Basic fixed-width binning:
import teradataml as tdml
import tdprepview
binning = tdprepview.FixedWidthBinning(n_bins=5)
pipeline = tdprepview.Pipeline(
steps=[(["col1"], binning)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Fixed-width binning with custom bounds:
import teradataml as tdml
import tdprepview
binning = tdprepview.FixedWidthBinning(n_bins=4, lower_bound=0.0, upper_bound=100.0)
pipeline = tdprepview.Pipeline(
steps=[(["colA"], binning)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
- The output column type is INTEGER.
- During fitting, necessary statistics (min/max) are collected with TD_UnivariateStatistics if bounds are None.
- Values are assigned to bins indexed from 0 to n_bins-1.
tdprepview.VariableWidthBinning
#
Bin numerical data into variable-width bins.
This preprocessor supports two binning strategies
- "quantiles": Divide the data into bins based on percentiles.
- "custom": Divide the data based on user-defined boundaries.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
kind
|
Literal['quantiles', 'custom']
|
Method of binning. Options are "quantiles" or "custom". Default is "quantiles". |
'quantiles'
|
no_quantiles
|
int
|
Number of bins to use when kind="quantiles". Must be between 2 and 100. Default is 5. |
5
|
boundaries
|
Optional[List[Union[int, float]]]
|
List of numeric boundaries for custom binning. Must be sorted in ascending order. Required if kind="custom". Default is None. |
None
|
Examples:
Quantile-based binning:
import teradataml as tdml
import tdprepview
binner = tdprepview.VariableWidthBinning(kind="quantiles", no_quantiles=4)
pipeline = tdprepview.Pipeline(
steps=[(["col1", "col2"], binner)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Custom boundary binning:
import teradataml as tdml
import tdprepview
binner = tdprepview.VariableWidthBinning(kind="custom", boundaries=[0, 10, 20, 50])
pipeline = tdprepview.Pipeline(
steps=[(["colA"], binner)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
- The output column type is INTEGER.
- During fitting, necessary statistics (percentiles) are collected if kind="quantiles".
- Custom boundaries are used directly without computation.
- Bins are indexed starting from 0 up to the number of bins minus one.
tdprepview.QuantileTransformer
#
QuantileTransformer(n_quantiles=10, output_distribution='uniform', ignore_implicit_zeros=False, subsample=None, random_state=None, copy=True)
Transform features using quantiles information.
This preprocessor transforms numerical columns by mapping values to their corresponding quantile bins. Output values are integers representing the quantile index.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_quantiles
|
int
|
Number of quantiles to compute. Default is 10. |
10
|
output_distribution
|
str
|
Only 'uniform' is currently supported. Default is 'uniform'. |
'uniform'
|
ignore_implicit_zeros
|
bool
|
Whether to ignore implicit zero values when computing quantiles. Default is False. |
False
|
subsample
|
Optional[int]
|
If not None, subsample the dataset to this size when computing quantiles. Ignored. Default is None. |
None
|
random_state
|
Optional[int]
|
Random seed. Ignored. Default is None. |
None
|
copy
|
bool
|
Whether to copy the input array before transforming. Ignored. Default is True. |
True
|
Examples:
Quantile binning:
import teradataml as tdml
import tdprepview
qt = tdprepview.QuantileTransformer(n_quantiles=5)
pipeline = tdprepview.Pipeline(
steps=[(["col1", "col2"], qt)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
- Output column type is INTEGER.
- This transformer uses quantile computation to assign bins.
- Currently, only the "uniform" output distribution is supported.
tdprepview.DecisionTreeBinning
#
Bin numerical data into variable-width bins based on a decision tree.
This preprocessor uses a decision tree to find optimal bin boundaries. During fitting, a sample of the data is pulled into Python to train a scikit-learn decision tree. During transform, SQL expressions are used to assign bin indices.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
target_var
|
str
|
Column name of the target variable. Required. |
required |
model_type
|
Literal['classification', 'regression']
|
Either "classification" or "regression". Determines the type of tree used. |
'classification'
|
no_bins
|
int
|
Number of bins (tree leaves). Must be between 2 and 100. |
10
|
no_rows
|
int
|
Number of rows randomly sampled for fitting the tree. Must be between 100 and 100000. |
10000
|
Examples:
Decision tree binning of a single column:
import teradataml as tdml
import tdprepview
dtb = tdprepview.DecisionTreeBinning(target_var="target", model_type="classification", no_bins=5)
pipeline = tdprepview.Pipeline(
steps=[(["feature1"], dtb)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Decision tree binning of multiple columns:
import teradataml as tdml
import tdprepview
dtb = tdprepview.DecisionTreeBinning(target_var="target", model_type="regression", no_bins=8)
pipeline = tdprepview.Pipeline(
steps=[(["featureA", "featureB"], dtb)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
- During fitting, a sample of the data is used in Python to fit a scikit-learn decision tree.
- During transform, SQL formulas are applied to assign bin indices.
- Output column type is INTEGER.
- The number of bins corresponds to the number of leaves in the fitted tree.
tdprepview.ThresholdBinarizer
#
Binarize numeric data using a threshold value.
This preprocessor assigns 1 if a value exceeds the threshold and 0 otherwise.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
threshold
|
Union[int, float, Literal['mean', 'mode', 'median', 'P[1-100]']]
|
Threshold value for binarization. Can be: - a number (int or float), or - a string: "mean", "mode", "median", or a percentile string "P[1-100]", e.g., "P33". Default is "mean". |
'mean'
|
Examples:
Binarize using mean:
import teradataml as tdml
import tdprepview
binarizer = tdprepview.ThresholdBinarizer(threshold="mean")
pipeline = tdprepview.Pipeline(
steps=[(["col1"], binarizer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Binarize using a numeric threshold:
import teradataml as tdml
import tdprepview
binarizer = tdprepview.ThresholdBinarizer(threshold=50)
pipeline = tdprepview.Pipeline(
steps=[(["colA"], binarizer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
- Output column type is INTEGER (0 or 1).
- Threshold can be a fixed number, a summary statistic, or a percentile of the data.
tdprepview.Binarizer
#
Binarize data according to a threshold (sklearn-compatible alias).
This is a convenience alias to ThresholdBinarizer
to provide a scikit-learn-like API.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
threshold
|
Union[int, float]
|
Threshold value to use for binarization. Must be numeric. Default is 0.0. |
0.0
|
copy
|
bool
|
Ignored. Whether to copy the input array before binarizing. Default is True. |
True
|
Examples:
Binarize a single column:
import teradataml as tdml
import tdprepview
binarizer = tdprepview.Binarizer(threshold=10)
pipeline = tdprepview.Pipeline(
steps=[(["col1"], binarizer)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
- Output column type is INTEGER (0 or 1).
- This class is an alias of
ThresholdBinarizer
for sklearn compatibility.
tdprepview.ListBinarizer
#
Binarize text data based on membership in a list or top-K most frequent values.
This preprocessor assigns 1 if a text value is in the provided list of elements, or among the top-K most frequent values. Otherwise, it assigns 0.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
elements1
|
Union[str, List[str]]
|
Either a list of strings to match, or a string "TOPK" indicating the K most frequent values, e.g., "TOP3" or "TOP10". Default is "TOP3". |
'TOP3'
|
Examples:
Using a top-K most frequent values:
import teradataml as tdml
import tdprepview
lb = tdprepview.ListBinarizer(elements1="TOP5")
pipeline = tdprepview.Pipeline(
steps=[(["category_col"], lb)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Using a custom list:
import teradataml as tdml
import tdprepview
lb = tdprepview.ListBinarizer(elements1=["apple", "banana", "orange"])
pipeline = tdprepview.Pipeline(
steps=[(["fruit_col"], lb)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
- Output column type is INTEGER (0 or 1).
- Top-K values are computed in database with a query template.
tdprepview.LabelEncoder
#
Encode a text column into numerical values using label encoding.
This preprocessor assigns integers to text values. Encoding can be based on: - a list of custom elements, or - the K most frequent values in the column, indicated by a TOPK string (e.g., "TOP20").
Parameters:
Name | Type | Description | Default |
---|---|---|---|
elements
|
Union[str, List[str]]
|
Either a list of strings to use for encoding, or a TOPK string specifying the number of most frequent values to encode. Default is "TOP100". |
'TOP100'
|
Examples:
Using TOPK:
import teradataml as tdml
import tdprepview
le = tdprepview.LabelEncoder(elements="TOP10")
pipeline = tdprepview.Pipeline(
steps=[(["category_col"], le)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Using a custom list:
import teradataml as tdml
import tdprepview
le = tdprepview.LabelEncoder(elements=["apple", "banana", "orange"])
pipeline = tdprepview.Pipeline(
steps=[(["fruit_col"], le)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
- Output column type is INTEGER.
- Top-K values are computed in the database with a query template during fitting.
tdprepview.SimpleHashEncoder
#
Encode a text column using a hash function into INTEGER values.
This preprocessor applies a TD-built-in hash function. It is stateless and very performant, but hash collisions can occur. Optional bucketing can be applied on top of the hash value.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
num_buckets
|
Optional[int]
|
Optional number of buckets to apply using modulo on the hash code. If None, the raw hash value is used. Must be an integer > 1 or None. Default is None. |
None
|
salt
|
str
|
Optional string appended to each value before hashing to redistribute hash values. Default is "". |
''
|
Examples:
Using raw hash values:
import teradataml as tdml
import tdprepview
hasher = tdprepview.SimpleHashEncoder()
pipeline = tdprepview.Pipeline(
steps=[(["text_col"], hasher)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Using buckets and salt:
import teradataml as tdml
import tdprepview
hasher = tdprepview.SimpleHashEncoder(num_buckets=100, salt="xyz")
pipeline = tdprepview.Pipeline(
steps=[(["text_col"], hasher)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
- Output column type is INTEGER.
- Stateless: no fitting is performed; values are hashed on transform.
- Collisions can occur if multiple values map to the same hash bucket.
Feature Engineering#
tdprepview.PolynomialFeatures
#
Generate polynomial and interaction features from input data.
This preprocessor creates new features by raising input columns to powers up to
degree
and optionally including only interaction features between distinct columns.
No fitting is required; transformations are applied using SQL formulas.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
degree
|
int
|
The degree of the polynomial features. Must be an integer >= 1. Default is 2. |
2
|
interaction_only
|
bool
|
If True, only interaction features (products of distinct columns) are generated. Default is False. |
False
|
Examples:
Polynomial features on multiple columns:
import teradataml as tdml
import tdprepview
poly = tdprepview.PolynomialFeatures(degree=2, interaction_only=False)
pipeline = tdprepview.Pipeline(
steps=[(["col1", "col2", "col3"], poly)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
- Output column type is numeric.
- The class is based on scikit-learn's
PolynomialFeatures
from the preprocessing module. - Interactions are computed across all specified input columns.
- No fitting is required; transformations are applied via SQL formulas.
tdprepview.OneHotEncoder
#
OneHotEncoder(*, categories='auto', handle_unknown='ignore', min_frequency=None, max_categories=None)
One-hot encode categorical features.
This preprocessor converts categorical columns into multiple binary (0/1) columns, one per category. Categories can be inferred from the data or provided as a custom list. No fitting is required for custom categories; otherwise, necessary statistics are collected during fitting to determine categories.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
categories
|
Union[str, List[str]]
|
Either "auto" to infer categories from the data, or a list of strings specifying the categories. Default is "auto". |
'auto'
|
handle_unknown
|
str
|
Ignored. Default is "ignore". |
'ignore'
|
min_frequency
|
Optional[int]
|
Ignored. Default is None. |
None
|
max_categories
|
Optional[int]
|
Maximum number of categories to encode. Defaults to 50. |
None
|
Examples:
Automatic category detection:
import teradataml as tdml
import tdprepview
ohe = tdprepview.OneHotEncoder(categories="auto")
pipeline = tdprepview.Pipeline(
steps=[(["cat_col"], ohe)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Using a custom category list:
import teradataml as tdml
import tdprepview
ohe = tdprepview.OneHotEncoder(categories=["red", "green", "blue"])
pipeline = tdprepview.Pipeline(
steps=[(["color_col"], ohe)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
- Output column type is INTEGER (0/1 per category).
- If
categories="auto"
, necessary statistics are collected during fitting. - For custom categories, no fitting is required; transformations use SQL formulas.
- Maximum of 50 categories can be encoded.
tdprepview.MultiLabelBinarizer
#
Multi-label binarizer for categorical features. The input column contains delimiter-separated values. The output is one indicator (0/1) variable per unique value.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
classes
|
Optional[Union[str, List[str]]]
|
Categories to encode. Can be: - None: categories inferred during fitting. - "auto": infer categories from training data. - List of strings: custom categories. Default is None. |
None
|
sparse_output
|
bool
|
Ignored. Default is False. |
False
|
max_categories
|
Optional[int]
|
Maximum number of categories to encode. Default is 50. |
None
|
delimiter
|
str
|
Delimiter used to separate multiple values in the input string. Default is ", ". |
', '
|
Examples:
Automatic category detection:
import teradataml as tdml
import tdprepview
mlb = tdprepview.MultiLabelBinarizer(classes="auto")
pipeline = tdprepview.Pipeline(
steps=[(["tags"], mlb)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Using a custom list of classes:
import teradataml as tdml
import tdprepview
mlb = tdprepview.MultiLabelBinarizer(classes=["red", "green", "blue"])
pipeline = tdprepview.Pipeline(
steps=[(["colors"], mlb)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
- Output column type is INTEGER (0/1 per unique value).
- For
classes="auto"
or None, necessary statistics are collected during fitting. - For custom classes, no fitting is required; transformations use SQL formulas.
- Maximum of 50 categories can be encoded.
Dimensionality Reduction & Miscellaneous#
tdprepview.PCA
#
Principal Component Analysis (PCA) reduces the dimensionality of a dataset while retaining most of its variance.
During fitting, a sample of the data is pulled into Python to fit a scikit-learn PCA. During transform, the weighted sum of components is calculated directly in SQL.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_components
|
Union[int, str]
|
Number of principal components to keep. Can be an integer or "mle" (automatic selection via maximum likelihood estimation). Default is "mle". |
'mle'
|
random_state
|
int
|
Seed for the random number generator. Default is 42. |
42
|
Examples:
Fit PCA with a fixed number of components:
import teradataml as tdml
import tdprepview
pca = tdprepview.PCA(n_components=3)
pipeline = tdprepview.Pipeline(
steps=[(["feature1", "feature2", "feature3"], pca)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Fit PCA using MLE for component selection:
import teradataml as tdml
import tdprepview
pca = tdprepview.PCA(n_components="mle")
pipeline = tdprepview.Pipeline(
steps=[(["feature1", "feature2", "feature3"], pca)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
- Output is a weighted combination of input columns (SQL computed).
- The number of components can be automatically inferred with "mle".
- Fitting uses a Python sample for PCA computation.
tdprepview.TryCast
#
Attempt to cast a text column to a specified data type using SQL TRYCAST
.
This preprocessor is stateless: nothing is done during fitting. During transform,
the SQL TRYCAST
function is applied to each column value.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
new_type
|
Literal['BYTEINT', 'SMALLINT', 'INT', 'BIGINT', 'FLOAT', 'DATE', 'TIME', 'TIMESTAMP(6)']
|
Target data type to cast the column to. Must be one of:
|
'FLOAT'
|
Examples:
Convert a text column to integer:
import teradataml as tdml
import tdprepview
caster = tdprepview.TryCast(new_type="INT")
pipeline = tdprepview.Pipeline(
steps=[("text_col", caster)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Convert a text column to timestamp:
import teradataml as tdml
import tdprepview
caster = tdprepview.TryCast(new_type="TIMESTAMP(6)")
pipeline = tdprepview.Pipeline(
steps=[("text_col", caster)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
- Output type is the same as
new_type
. - Uses SQL
TRYCAST
in transform; any conversion failures result in NULLs.
tdprepview.Cast
#
Convert a column to a specified data type using SQL CAST
.
This preprocessor is stateless: nothing is done during fitting. During transform,
the SQL CAST
function is applied to each column value. Typically used as the last
step to ensure numeric features are FLOAT
. For text columns, prefer TryCast
to
avoid errors.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
new_type
|
Literal['BYTEINT', 'SMALLINT', 'INT', 'BIGINT', 'FLOAT', 'DATE', 'TIME', 'TIMESTAMP(6)']
|
Target data type to cast the column to. Must be one of:
|
'FLOAT'
|
Examples:
Convert a column to float:
import teradataml as tdml
import tdprepview
caster = tdprepview.Cast(new_type="FLOAT")
pipeline = tdprepview.Pipeline(
steps=[("col1", caster)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Convert a column to integer:
import teradataml as tdml
import tdprepview
caster = tdprepview.Cast(new_type="INT")
pipeline = tdprepview.Pipeline(
steps=[("col1", caster)]
)
DF = tdml.DataFrame(tdml.in_schema("my_schema","my_table"))
pipeline.fit(DF)
Notes
- Output type is the same as
new_type
. - Uses SQL
CAST
in transform; invalid conversions will raise errors.