Pipeline

tdprepview.Pipeline #

Pipeline(steps)

A data processing pipeline consisting of a sequence of steps. Each step is a tuple of input columns, preprocessors, and optional renaming options. During fitting, a directed acyclic graph (DAG) is generated to represent the execution order.

Parameters:

Name	Type	Description	Default
`steps`	`List[Step]`	A list of tuples representing the steps of the pipeline. Each tuple must have 2 or 3 elements. The first element specifies the input columns: `str`: a single column name `list[str]`: a list of column names `dict`: a dictionary with keys among `{"prefix","suffix","pattern","dtype_include","dtype_exclude","columns_exclude"}`, whose values are strings or lists of strings. These act as selectors for matching columns. The second element specifies the preprocessors: A single `Preprocessor` instance A list of `Preprocessor` instances. If multiple are provided, they are applied sequentially. The third element (optional) is a dict of naming options: `"prefix"`: str to prepend to the output column names `"suffix"`: str to append to the output column names Renaming makes it easier to reference transformed columns in later steps, especially when output names are otherwise controlled by the preprocessing logic. Examples of valid steps: (input_col, preprocessor) (input_cols_list, preprocessor) (input_col_dict, preprocessor) (input_col, preprocessor, options_dict) (input_cols_list, [preprocessor1, preprocessor2], options_dict)	required

Examples:

Single-step pipelines:

Single column with one preprocessor:

pl = Pipeline([("age", Impute(kind="median"))])

Multiple columns with one preprocessor:

pl = Pipeline([(["height","weight"], Scale(kind="zscore"))])

Column selection by dtype with multiple preprocessors, plus renaming:

pl = Pipeline([
    ({"dtype_include": ["int"]}, [Impute(kind="mean"), Scale(kind="minmax")], {"suffix": "_scaled"})
])

Multi-step pipelines:

Different columns, different preprocessors, no renaming:

pl = Pipeline([
    (["age","salary"], Impute(kind="median")),
    (["height","weight"], Scale(kind="zscore")),
    ("city", OneHotEncoder(categories="auto"))
])

Regex/pattern selection feeding into PCA:

pl = Pipeline([
    ({"prefix": "num_"}, [Scale(kind="zscore"), PCA(n_components=3)], {"prefix": "pca_"}),
    ({"prefix": "pca_"}, Normalizer()),
])

Multi-preprocessor step followed by categorical encoding:

pl = Pipeline([
    (["income"], [Impute(kind="median"), Scale(kind="zscore")]),
    (["city","state"], OneHotEncoder(categories="auto"))
])

Notes

Each step defines what columns are selected, what transformations are applied, and optionally how output columns are renamed.

Methods:

Name	Description
`fit`	Fit the pipeline to the given data step by step.
`transform`	Apply the transformations in the pipeline to the input data.
`plot_sankey`	Plot the DAG (directed acyclic graph) of the pipeline using a Sankey diagram.
`to_dict`	Serialize a fitted Pipeline to a Python dictionary.
`from_dict`	Construct a Pipeline object from a serialized dictionary.
`to_json`	Serialize the fitted Pipeline to a JSON file.
`from_json`	Construct a Pipeline object from a JSON file.
`from_DataFrame`	Construct a new Pipeline object from a teradataml DataFrame or database table/view.

fit #

fit(DF=None, schema_name=None, table_name=None)

Fit the pipeline to the given data step by step.

At each step, it is decided whether statistics need to be collected from Vantage. The directed acyclic graph (DAG) is then incrementally constructed by adding nodes to the previous layer.

Parameters:

Name	Type	Description	Default
`DF`	`Optional[DataFrame]`	The input DataFrame. If not provided, the DataFrame is loaded from the specified schema and table.	`None`
`schema_name`	`Optional[str]`	The schema name where the table is located. Required if `DF` is not provided.	`None`
`table_name`	`Optional[str]`	The table name to load. Required if `DF` is not provided.	`None`

Returns:

Type	Description
`None`	None

Examples:

Fit the pipeline with a DataFrame:

from tdprepview import Pipeline
from tdprepview.preprocessing import Cast

pl = Pipeline([("col1", Cast("FLOAT"))])
pl.fit(DF=my_dataframe)

Fit the pipeline by loading from schema and table:

pl = Pipeline([("col1", Cast("FLOAT"))])
pl.fit(schema_name="my_schema", table_name="my_table")

transform #

transform(DF=None, schema_name=None, table_name=None, return_type='df', create_replace_view=False, output_schema_name=None, output_view_name=None)

Apply the transformations in the pipeline to the input data.

The directed acyclic graph (DAG), constructed during fitting based on the pipeline steps, is traversed and the corresponding SQL statements are appended.

Parameters:

Name	Type	Description	Default
`DF`	`Optional[DataFrame]`	Input data. If not provided, data will be fetched from the specified `schema_name` and `table_name`.	`None`
`schema_name`	`Optional[str]`	Schema name of the input data. Required if `DF` is not provided.	`None`
`table_name`	`Optional[str]`	Table name of the input data. Required if `DF` is not provided.	`None`
`return_type`	`Literal['df', 'str', None]`	Specifies the return type of the transformation: - `"df"`: return a `tdml.DataFrame` - `"str"`: return a SQL query string - `None`: return nothing	`'df'`
`create_replace_view`	`bool`	If True, create or replace a view with the transformed data in the database.	`False`
`output_schema_name`	`Optional[str]`	Schema name of the output view. Required if `create_replace_view` is True.	`None`
`output_view_name`	`Optional[str]`	View name of the output. Required if `create_replace_view` is True.	`None`

Returns:

Type	Description
`Union[DataFrame, str, None]`	Transformed data as a `tdml.DataFrame` if `return_type="df"`,
`Union[DataFrame, str, None]`	a SQL query string if `return_type="str"`, or `None` if `return_type` is `None`.

Notes

Using create_replace_view=True acts like a "secret deploy button": it crystallizes the preprocessing logic into a database view.

Examples:

Transform data and return a DataFrame (default):

DF_transformed = pl.transform(DF=DF_input)

Transform data and return the SQL query string:

sql_query = pl.transform(DF=DF_input, return_type="str")
print(sql_query)

Transform data by fetching directly from schema and table:

DF_transformed = pl.transform(schema_name="myschema", table_name="mytable")

Deploy transformations as a database view:

pl.transform(
    DF=my_df,
    return_type=None,
    create_replace_view=True,
    output_schema_name="deploy_schema",
    output_view_name="processed_view"
)

DF_transformed = tdml.DataFrame(tdml.in_schema("deploy_schema","processed_view"))

plot_sankey #

plot_sankey()

Plot the DAG (directed acyclic graph) of the pipeline using a Sankey diagram.

Returns:

Type	Description
`Figure`	plotly.graph_objs._figure.Figure: Sankey plot of the DAG pipeline.

Examples:

Fit a simple pipeline and plot the DAG as a Sankey diagram:

from tdprepview import Pipeline, Cast

pl = Pipeline([("col1", Cast("FLOAT"))])
pl.fit(DF=DF_input)

fig = pl.plot_sankey()
fig.update_layout(height=1000)
fig.show()

to_dict #

to_dict()

Serialize a fitted Pipeline to a Python dictionary.

This method asserts that the Pipeline instance is already fitted. The resulting dictionary can be used to reconstruct the Pipeline object with the from_dict class method.

Returns:

Name	Type	Description
`dict`	`dict`	A dictionary representation of the fitted Pipeline, suitable for serialization.

Examples:

from tdprepview import Pipeline, SimpleImputer

pl = Pipeline([("col1", SimpleImputer(strategy="mean"))])
pl.fit(DF=my_dataframe)

pl_dict = pl.to_dict()

from_dict `classmethod` #

from_dict(pipeline_serialized_dict)

Construct a Pipeline object from a serialized dictionary.

This class method deserializes a dictionary into a new Pipeline instance, initializing its properties and state based on the serialized data.

Parameters:

Name	Type	Description	Default
`pipeline_serialized_dict`	`dict`	A dictionary containing serialized Pipeline data.	required

Returns:

Name	Type	Description
`Pipeline`	`Pipeline`	A new (fitted) Pipeline instance initialized with the data from the serialized dictionary.

Examples:

from tdprepview import Pipeline, SimpleImputer

pl = Pipeline([("col1", SimpleImputer(strategy="mean"))])
pl.fit(DF=my_dataframe)

pl_dict = pl.to_dict()
pl_new = Pipeline.from_dict(pl_dict)

to_json #

to_json(filepath)

Serialize the fitted Pipeline to a JSON file.

This method serializes the Pipeline into a dictionary and then saves it as a JSON file at the specified filepath. The Pipeline must be fitted before calling this method.

Parameters:

Name	Type	Description	Default
`filepath`	`str`	The path, including the filename, where the serialized Pipeline should be stored.	required

Examples:

from tdprepview import Pipeline, SimpleImputer

pl = Pipeline([("col1", SimpleImputer(strategy="mean"))])
pl.fit(DF=my_dataframe)

pl.to_json("pipeline.json")

from_json `classmethod` #

from_json(filepath)

Construct a Pipeline object from a JSON file.

This class method reads a JSON file specified by filepath, deserializes it into a dictionary, and then uses that dictionary to construct a new Pipeline instance via the from_dict method.

Parameters:

Name	Type	Description	Default
`filepath`	`str`	Path to the JSON file containing serialized Pipeline data.	required

Returns:

Name	Type	Description
`Pipeline`	`Pipeline`	A new Pipeline instance initialized with the data from the JSON file.

Examples:

from tdprepview import Pipeline, SimpleImputer

pl = Pipeline([("col1", SimpleImputer(strategy="mean"))])
pl.fit(DF=my_dataframe)

pl.to_json("pipeline.json")

pl_new = Pipeline.from_json("pipeline.json")

from_DataFrame `classmethod` #

from_DataFrame(DF, input_schema='', input_table='', non_feature_cols=[], fit_pipeline=False)

Construct a new Pipeline object from a teradataml DataFrame or database table/view.

This method uses the auto_code heuristics to generate a sequence of preprocessing steps based on the data's characteristics and schema. The resulting Pipeline can optionally be fitted immediately.

Parameters:

Name	Type	Description	Default
`DF`	`DataFrame`	The DataFrame used to determine preprocessing steps.	required
`input_schema`	`str`	Schema name of the input data. Defaults to "".	`''`
`input_table`	`str`	Table or view name of the input data. Defaults to "".	`''`
`non_feature_cols`	`list[str]`	Column names to exclude from feature processing (e.g., primary keys, target variables). Defaults to [].	`[]`
`fit_pipeline`	`bool`	If True, fit the constructed Pipeline to the DataFrame before returning. Defaults to False.	`False`

Returns:

Name	Type	Description
`Pipeline`	`Pipeline`	A new Pipeline instance, optionally fitted to the DataFrame.

Examples:

from tdprepview import Pipeline

pl = Pipeline.from_DataFrame(
    DF=my_dataframe,
    non_feature_cols=["row_id", "target_category"],
    fit_pipeline=True
)

Pipeline

tdprepview.Pipeline #

fit #

transform #

plot_sankey #

to_dict #

from_dict classmethod #

to_json #

from_json classmethod #

from_DataFrame classmethod #

from_dict `classmethod` #

from_json `classmethod` #

from_DataFrame `classmethod` #