The Pipeline Class#
A practical guide to building, fitting, transforming, deploying, and moving preprocessing logic with
tdprepview
.
Overview#
What is Pipeline
?
A declarative, SQL-generating preprocessing pipeline for Teradata Vantage.
You describe what to do; Pipeline
builds a DAG and emits a single, reviewable SQL query (or a view).
How it differs from scikit-learn pipelines
- In-DB pushdown: work happens in Vantage, not in Python.
- DAG, not just a list: supports fan-in/fan-out nodes.
- SQL output: get a query or create a view.
- No Python at runtime: once deployed as SQL/view/JSON, serving doesn’t need a Python process.
Why a pipeline#
- Reproducible: one object defines selection, transforms, and naming.
- Auditable: SQL is reviewable by DBAs; JSON captures full logic.
- Portable: use as a DataFrame, a SQL string, a database view, or serialized JSON.
- Clear fit vs transform: statistics computed during
fit
, pure SQL duringtransform
.
Core concepts#
Step#
A step is a tuple: (input selector, preprocessor(s), optional rename).
from tdprepview import Pipeline
from tdprepview import Impute, Scale, OneHotEncoder, PCA, Cast
pl = Pipeline([
(["age", "salary"], Impute(kind="median")), # (selector, preprocessor)
({"dtype_include": ["int"]}, Scale(kind="zscore")), # dict selector
("city", OneHotEncoder(categories="auto"), {"prefix": "oh_"}) # with rename
])
Column selectors#
str
→ one columnlist[str]
→ explicit columns-
dict
with any of:prefix
,suffix
,pattern
dtype_include
,dtype_exclude
columns_exclude
Selectors compose left-to-right inside the dict (filtering down the current leaves).
Preprocessors#
- Per-column vs combined-inputs (e.g., PCA). Combined inputs create fan-in nodes.
- Single-output vs multi-output (affects generated column names).
Renaming options#
Add {"prefix": "...", "suffix": "..."}
to control output names. This is especially useful for chaining:
DAG model#
Internally a DAG of ColumnNode
and PreProcesserNode
is built.
A WITH AS query assembles intermediate steps; leaves are the final output columns.
Building a pipeline (syntax patterns)#
Step expansion rule (when you pass a list of preprocessors + options)
- The first preprocessor runs with your provided
prefix
/suffix
. - Subsequent preprocessors are auto-rewired to take the renamed outputs as inputs, and run without re-applying the rename.
- For dict selectors,
prefix
/suffix
are composed when chaining (pattern is removed to avoid ambiguity).
Input validation (what will assert)#
- Steps must be tuples of length 2 or 3.
- Selectors must be
str | list[str] | dict
with only allowed keys. - Lists must be deduplicated (
["a","a"]
will assert). - Preprocessors must be instances of
Preprocessor
(or a list of them). - Rename options accept only
"prefix"
/"suffix"
strings.
Naming conventions#
Prefer early, explicit renames to avoid collisions and to make downstream selectors stable (e.g., prefix PCA outputs with pca_…
).
Depp Dive into Fitting: what actually happens#
-
Requirements
- A live
teradataml
context. - Provide either
DF
orschema_name
+table_name
.
- A live
-
DAG growth
- For each step, the selector resolves against current leaves.
- Nodes are appended; combined-input preprocessors create fan-in.
-
Statistics collection (only if needed)
- Triggered by
preprocessor.necessary_statistics
. -
Sources:
- In-DB:
UnivariateStatistics
(mean, std, median, percentiles), top values, top tokens. - sklearn helpers (when applicable): e.g., IterativeImputer params, PCA loadings, decision tree bin edges, power transform lambdas.
- In-DB:
- Triggered by
-
Stored after fit
- The constructed DAG and any fitted statistics per node.
Depp Dive into Transforming: what actually happens#
-
Inputs:
DF
orschema/table
(same checks asfit
). -
Outputs via
return_type
:"df"
→tdml.DataFrame
(default)"str"
→ SQL stringNone
→ side-effects only (e.g., create view)
-
Column forwarding & resilience
- Columns present in DF but not in DAG are forwarded unchanged.
- Columns expected by DAG but missing in DF are skipped.
-
Deterministic as long as schema and dtypes match what you fitted on.
Deploying preprocessing#
Think of three persistence strategies:
-
Persist a single run Materialize or create a view once; downstream jobs just read it.
-
Persist the query Use
create_replace_view=True
as a "deploy button." The logic becomes a database view that always reflects the current base table. -
Persist the logic Serialize to JSON. Python needed to reconstruct query at serve time; you can store the JSON in a CLOB for governance and later reconstruct the pipeline.
# 1) Get SQL once
sql = pl.transform(schema_name="s", table_name="t", return_type="str")
# 2) Deploy as view
pl.transform(DF=DF_in, return_type=None, create_replace_view=True,
output_schema_name="prod", output_view_name="features_v")
# 3) Store logic as JSON
pl.to_json("pipeline.json")
# ... store file content in a CLOB column for versioning
restored = Pipeline.from_json("pipeline.json")
Versioning ideas
- Store each JSON in a table with metadata (hash, author, created_at).
Visualization#
Use a Sankey diagram to inspect lineage and fan-in/fan-out.
Sankey Output
Understand the produced query#
Assuming this is the input data
Input Data
customerid | exited | surname | creditscore | geography | gender | age | tenure | balance | hascrcard | isactivemember | estimatedsalary | bank_products |
---|---|---|---|---|---|---|---|---|---|---|---|---|
15768104 | 0 | Wright | 788 | Spain | Male | 37.0 | 8 | 141541.25 | 0 | 0 | 66013.27 | RetirementAccount |
15809826 | 1 | Craigie | 728 | France | Female | 46.0 | 2 | 109705.52 | 1 | 0 | 20276.87 | PersonalLoan |
15717736 | 0 | Dr. Shen | 639 | Germany | Female | 46.0 | 10 | 110031.09 | 1 | 1 | 133995.59 | CertificateOfDeposit,CheckingAccount |
15748589 | 0 | Winter | 736 | France | Female | 30.0 | 9 | 0.0 | 1 | 0 | 34180.33 | CreditCard,HomeEquityLoan |
15704053 | 1 | T'ang | 710 | Spain | Male | 62.0 | 3 | 131078.42 | 1 | 0 | 119348.76 | RetirementAccount,PersonalLoan |
15806808 | 1 | Hope | 834 | Germany | Female | None | 8 | 112281.60 | 1 | 0 | 140225.14 | CheckingAccount,AutoLoan,RetirementAccount |
15694530 | 0 | Porter | 672 | France | Male | 28.0 | 4 | 167268.98 | 1 | 1 | 169469.30 | HomeEquityLoan |
15712903 | 0 | Diaz | 499 | France | Female | 21.0 | 3 | 176511.08 | 1 | 1 | 153920.22 | InvestmentFund |
15791045 | 0 | Boni | 568 | France | Female | 38.0 | 3 | 132951.92 | 0 | 1 | 124486.28 | RetirementAccount |
15803790 | 0 | Allen | 638 | Germany | Male | 37.0 | 2 | 89728.86 | 1 | 1 | 37294.88 | CertificateOfDeposit,PersonalLoan |
and assuming this is the Pipeline plan we have produced
Pipeline steps
import tdprepview
steps = [
(['creditscore', 'age', 'tenure', 'balance', 'estimatedsalary'],
tdprepview.SimpleImputer(strategy='mean')),
(['geography', 'gender', 'bank_products'],
tdprepview.ImputeText(kind='mode')),
(['hascrcard', 'isactivemember'],
tdprepview.SimpleImputer(strategy='most_frequent')),
(['creditscore', 'age', 'tenure', 'balance', 'estimatedsalary'],
tdprepview.MinMaxScaler()),
(['age'],
tdprepview.PowerTransformer(method='yeo-johnson')),
(['geography'],
tdprepview.OneHotEncoder(max_categories=20)),
(['gender'],
tdprepview.LabelEncoder(elements='TOP1')),
(['bank_products'],
tdprepview.MultiLabelBinarizer(delimiter=',', max_categories=20)),
({'columns_exclude': ['customerid', 'exited', 'surname']},
tdprepview.Cast(new_type='FLOAT'))
]
pl = tdprepview.Pipeline(steps)
After fitting, we get this query. Hover over the annotations to understand where each piece comes from.
WITH preprocessing_steps AS
(
SELECT
customerid AS c_i_0, -- (1)!
exited AS c_i_1,
surname AS c_i_2,
creditscore AS c_i_3,
geography AS c_i_4,
gender AS c_i_5,
age AS c_i_6,
tenure AS c_i_7,
balance AS c_i_8,
hascrcard AS c_i_9,
isactivemember AS c_i_10,
estimatedsalary AS c_i_11,
bank_products AS c_i_12,
COALESCE( c_i_3 , 650.7018 ) AS c_i_13, -- (2)!
COALESCE( c_i_4 , 'France' ) AS c_i_18, -- (3)!
COALESCE( c_i_5 , 'Male' ) AS c_i_19,
COALESCE( c_i_6 , 38.87352 ) AS c_i_14,
COALESCE( c_i_7 , 4.9726 ) AS c_i_15,
COALESCE( c_i_8 , 75582.44 ) AS c_i_16,
COALESCE( c_i_9 , 1.0 ) AS c_i_21, -- (4)!
COALESCE( c_i_10 , 1.0 ) AS c_i_22,
COALESCE( c_i_11 , 100690.6 ) AS c_i_17,
COALESCE( c_i_12 , 'RetirementAccount' ) AS c_i_20,
( ( c_i_13 ) - 350.0 ) / NULLIF( 500.0 , 0) AS c_i_23, -- (5)!
CASE c_i_18 WHEN 'France' THEN 1 ELSE 0 END AS c_i_29, -- (6)!
CASE c_i_18 WHEN 'Germany' THEN 1 ELSE 0 END AS c_i_30,
CASE c_i_18 WHEN 'Spain' THEN 1 ELSE 0 END AS c_i_31,
CASE WHEN (c_i_18) IS NOT IN ('France', 'Germany', 'Spain') THEN 1 ELSE 0 END AS c_i_32,
CASE c_i_19 WHEN 'Male' THEN 1 ELSE 0 END AS c_i_33, -- (7)!
( ( c_i_14 ) - 18.0 ) / NULLIF( 74.0 , 0) AS c_i_24,
( ( c_i_15 ) - 0.0 ) / NULLIF( 10.0 , 0) AS c_i_25,
( ( c_i_16 ) - 0.0 ) / NULLIF( 250898.1 , 0) AS c_i_26,
CAST( (c_i_21) AS FLOAT ) AS c_i_53,
CAST( (c_i_22) AS FLOAT ) AS c_i_54,
( ( c_i_17 ) - 11.58 ) / NULLIF( 199941.8 , 0) AS c_i_27,
CASE WHEN (POSITION(',MortgageLoan,' IN ','||c_i_20||','))>0 THEN 1 ELSE 0 END AS c_i_34, -- (9)!
CASE WHEN (POSITION(',RetirementAccount,' IN ','||c_i_20||','))>0 THEN 1 ELSE 0 END AS c_i_35,
CASE WHEN (POSITION(',CreditCard,' IN ','||c_i_20||','))>0 THEN 1 ELSE 0 END AS c_i_36,
CASE WHEN (POSITION(',InvestmentFund,' IN ','||c_i_20||','))>0 THEN 1 ELSE 0 END AS c_i_37,
CASE WHEN (POSITION(',HomeEquityLoan,' IN ','||c_i_20||','))>0 THEN 1 ELSE 0 END AS c_i_38,
CASE WHEN (POSITION(',PersonalLoan,' IN ','||c_i_20||','))>0 THEN 1 ELSE 0 END AS c_i_39,
CASE WHEN (POSITION(',CheckingAccount,' IN ','||c_i_20||','))>0 THEN 1 ELSE 0 END AS c_i_40,
CASE WHEN (POSITION(',SavingsAccount,' IN ','||c_i_20||','))>0 THEN 1 ELSE 0 END AS c_i_41,
CASE WHEN (POSITION(',CertificateOfDeposit,' IN ','||c_i_20||','))>0 THEN 1 ELSE 0 END AS c_i_42,
CASE WHEN (POSITION(',AutoLoan,' IN ','||c_i_20||','))>0 THEN 1 ELSE 0 END AS c_i_43,
CAST( (c_i_23) AS FLOAT ) AS c_i_44, -- (10)!
CAST( (c_i_29) AS FLOAT ) AS c_i_45,
CAST( (c_i_30) AS FLOAT ) AS c_i_46,
CAST( (c_i_31) AS FLOAT ) AS c_i_47,
CAST( (c_i_32) AS FLOAT ) AS c_i_48,
CAST( (c_i_33) AS FLOAT ) AS c_i_49,
CASE WHEN c_i_24 >= 0.0 THEN (POWER(c_i_24 + 1 ,-1.968491)-1)/(-1.968491) ELSE -(POWER(-c_i_24 + 1,2-(-1.968491))-1)/(2-(-1.968491)) END AS c_i_28, -- (8)!
CAST( (c_i_25) AS FLOAT ) AS c_i_51,
CAST( (c_i_26) AS FLOAT ) AS c_i_52,
CAST( (c_i_27) AS FLOAT ) AS c_i_55,
CAST( (c_i_34) AS FLOAT ) AS c_i_56,
CAST( (c_i_35) AS FLOAT ) AS c_i_57,
CAST( (c_i_36) AS FLOAT ) AS c_i_58,
CAST( (c_i_37) AS FLOAT ) AS c_i_59,
CAST( (c_i_38) AS FLOAT ) AS c_i_60,
CAST( (c_i_39) AS FLOAT ) AS c_i_61,
CAST( (c_i_40) AS FLOAT ) AS c_i_62,
CAST( (c_i_41) AS FLOAT ) AS c_i_63,
CAST( (c_i_42) AS FLOAT ) AS c_i_64,
CAST( (c_i_43) AS FLOAT ) AS c_i_65,
CAST( (c_i_28) AS FLOAT ) AS c_i_50
FROM
demo_user.order_raw_trainig t
)
SELECT
c_i_0 AS customerid, -- (11)!
c_i_1 AS exited,
c_i_2 AS surname,
c_i_44 AS creditscore,
c_i_45 AS geography__OHE_1_France,
c_i_46 AS geography__OHE_2_Germany,
c_i_47 AS geography__OHE_3_Spain,
c_i_48 AS geography__OHE_0_otherwise,
c_i_49 AS gender,
c_i_50 AS age,
c_i_51 AS tenure,
c_i_52 AS balance,
c_i_53 AS hascrcard,
c_i_54 AS isactivemember,
c_i_55 AS estimatedsalary,
c_i_56 AS bank_products__MLB_1_MortgageLoan, -- (12)!
c_i_57 AS bank_products__MLB_2_RetirementAccount,
c_i_58 AS bank_products__MLB_3_CreditCard,
c_i_59 AS bank_products__MLB_4_InvestmentFund,
c_i_60 AS bank_products__MLB_5_HomeEquityLoan,
c_i_61 AS bank_products__MLB_6_PersonalLoan,
c_i_62 AS bank_products__MLB_7_CheckingAccount,
c_i_63 AS bank_products__MLB_8_SavingsAccount,
c_i_64 AS bank_products__MLB_9_CertificateOfDeposit,
c_i_65 AS bank_products__MLB_10_AutoLoan
FROM
preprocessing_steps t
- aliasing of possibly lengthy column names!
- SimpleImputer(strategy='mean') on numeric columns (
creditscore
,age
,tenure
,balance
,estimatedsalary
) becomesCOALESCE(value, learned_mean)
→c_i_13..17
. - ImputeText(kind='mode') on
geography
,gender
,bank_products
:COALESCE(text_col, learned_mode)
→c_i_18..20
. - SimpleImputer(strategy='most_frequent') on
hascrcard
,isactivemember
:COALESCE(int_col, mode)
→c_i_21..22
. - MinMaxScaler() on numeric set:
(x - min) / NULLIF(range, 0)
using fitted stats →c_i_23
(creditscore
shown), and similarlyc_i_24..27
forage
,tenure
,balance
,estimatedsalary
. - OneHotEncoder(max_categories=20) on
geography
: one column per top category plus “otherwise” bucket →c_i_29..32
. - LabelEncoder(elements='TOP1') on
gender
: binary indicator for the top element (here'Male'
as 1, else 0) →c_i_33
. - PowerTransformer(method='yeo-johnson') on scaled
age
(c_i_24
): applies learned λ (here ≈ −1.9685) to producec_i_28
. - MultiLabelBinarizer(delimiter=',') on
bank_products
: token presence viaPOSITION(',Token,' IN ','||col||',')
→c_i_34..43
. - Cast(new_type='FLOAT') on all features except excluded identifiers: ensures numeric dtypes for ML-serving (
c_i_44..65
,c_i_50
,c_i_53..55
). - Forwarding untouched inputs (no preprocessors applied):
customerid
,exited
,surname
are passed through unchanged. - Final naming in
SELECT
: stable, human-readable feature names with suffixes that reflect the preprocessor (__OHE_*
,__MLB_*
, raw names for scaled/casted numerics).