Skip to content

Pipeline

tdprepview.Pipeline #

Pipeline(steps)

A data processing pipeline consisting of a sequence of steps. Each step is a tuple of input columns, preprocessors, and optional renaming options. During fitting, a directed acyclic graph (DAG) is generated to represent the execution order.

Parameters:

Name Type Description Default
steps List[Step]

A list of tuples representing the steps of the pipeline. Each tuple must have 2 or 3 elements.

  1. The first element specifies the input columns:

    • str: a single column name
    • list[str]: a list of column names
    • dict: a dictionary with keys among {"prefix","suffix","pattern","dtype_include","dtype_exclude","columns_exclude"}, whose values are strings or lists of strings. These act as selectors for matching columns.
  2. The second element specifies the preprocessors:

    • A single Preprocessor instance
    • A list of Preprocessor instances. If multiple are provided, they are applied sequentially.
  3. The third element (optional) is a dict of naming options:

    • "prefix": str to prepend to the output column names
    • "suffix": str to append to the output column names

Renaming makes it easier to reference transformed columns in later steps, especially when output names are otherwise controlled by the preprocessing logic.

Examples of valid steps:

  • (input_col, preprocessor)
  • (input_cols_list, preprocessor)
  • (input_col_dict, preprocessor)
  • (input_col, preprocessor, options_dict)
  • (input_cols_list, [preprocessor1, preprocessor2], options_dict)
required

Examples:

Single-step pipelines:

  1. Single column with one preprocessor:

    pl = Pipeline([("age", Impute(kind="median"))])
    

  2. Multiple columns with one preprocessor:

    pl = Pipeline([(["height","weight"], Scale(kind="zscore"))])
    

  3. Column selection by dtype with multiple preprocessors, plus renaming:

    pl = Pipeline([
        ({"dtype_include": ["int"]}, [Impute(kind="mean"), Scale(kind="minmax")], {"suffix": "_scaled"})
    ])
    

Multi-step pipelines:

  1. Different columns, different preprocessors, no renaming:

    pl = Pipeline([
        (["age","salary"], Impute(kind="median")),
        (["height","weight"], Scale(kind="zscore")),
        ("city", OneHotEncoder(categories="auto"))
    ])
    

  2. Regex/pattern selection feeding into PCA:

    pl = Pipeline([
        ({"prefix": "num_"}, [Scale(kind="zscore"), PCA(n_components=3)], {"prefix": "pca_"}),
        ({"prefix": "pca_"}, Normalizer()),
    ])
    

  3. Multi-preprocessor step followed by categorical encoding:

    pl = Pipeline([
        (["income"], [Impute(kind="median"), Scale(kind="zscore")]),
        (["city","state"], OneHotEncoder(categories="auto"))
    ])
    

Notes
  • Each step defines what columns are selected, what transformations are applied, and optionally how output columns are renamed.

Methods:

Name Description
fit

Fit the pipeline to the given data step by step.

transform

Apply the transformations in the pipeline to the input data.

plot_sankey

Plot the DAG (directed acyclic graph) of the pipeline using a Sankey diagram.

to_dict

Serialize a fitted Pipeline to a Python dictionary.

from_dict

Construct a Pipeline object from a serialized dictionary.

to_json

Serialize the fitted Pipeline to a JSON file.

from_json

Construct a Pipeline object from a JSON file.

from_DataFrame

Construct a new Pipeline object from a teradataml DataFrame or database table/view.

fit #

fit(DF=None, schema_name=None, table_name=None)

Fit the pipeline to the given data step by step.

At each step, it is decided whether statistics need to be collected from Vantage. The directed acyclic graph (DAG) is then incrementally constructed by adding nodes to the previous layer.

Parameters:

Name Type Description Default
DF Optional[DataFrame]

The input DataFrame. If not provided, the DataFrame is loaded from the specified schema and table.

None
schema_name Optional[str]

The schema name where the table is located. Required if DF is not provided.

None
table_name Optional[str]

The table name to load. Required if DF is not provided.

None

Returns:

Type Description
None

None

Examples:

Fit the pipeline with a DataFrame:

from tdprepview import Pipeline
from tdprepview.preprocessing import Cast

pl = Pipeline([("col1", Cast("FLOAT"))])
pl.fit(DF=my_dataframe)

Fit the pipeline by loading from schema and table:

pl = Pipeline([("col1", Cast("FLOAT"))])
pl.fit(schema_name="my_schema", table_name="my_table")

transform #

transform(DF=None, schema_name=None, table_name=None, return_type='df', create_replace_view=False, output_schema_name=None, output_view_name=None)

Apply the transformations in the pipeline to the input data.

The directed acyclic graph (DAG), constructed during fitting based on the pipeline steps, is traversed and the corresponding SQL statements are appended.

Parameters:

Name Type Description Default
DF Optional[DataFrame]

Input data. If not provided, data will be fetched from the specified schema_name and table_name.

None
schema_name Optional[str]

Schema name of the input data. Required if DF is not provided.

None
table_name Optional[str]

Table name of the input data. Required if DF is not provided.

None
return_type Literal['df', 'str', None]

Specifies the return type of the transformation: - "df": return a tdml.DataFrame - "str": return a SQL query string - None: return nothing

'df'
create_replace_view bool

If True, create or replace a view with the transformed data in the database.

False
output_schema_name Optional[str]

Schema name of the output view. Required if create_replace_view is True.

None
output_view_name Optional[str]

View name of the output. Required if create_replace_view is True.

None

Returns:

Type Description
Union[DataFrame, str, None]

Transformed data as a tdml.DataFrame if return_type="df",

Union[DataFrame, str, None]

a SQL query string if return_type="str", or None if return_type is None.

Notes
  • Using create_replace_view=True acts like a "secret deploy button": it crystallizes the preprocessing logic into a database view.

Examples:

Transform data and return a DataFrame (default):

DF_transformed = pl.transform(DF=DF_input)

Transform data and return the SQL query string:

sql_query = pl.transform(DF=DF_input, return_type="str")
print(sql_query)

Transform data by fetching directly from schema and table:

DF_transformed = pl.transform(schema_name="myschema", table_name="mytable")

Deploy transformations as a database view:

pl.transform(
    DF=my_df,
    return_type=None,
    create_replace_view=True,
    output_schema_name="deploy_schema",
    output_view_name="processed_view"
)

DF_transformed = tdml.DataFrame(tdml.in_schema("deploy_schema","processed_view"))

plot_sankey #

plot_sankey()

Plot the DAG (directed acyclic graph) of the pipeline using a Sankey diagram.

Returns:

Type Description
Figure

plotly.graph_objs._figure.Figure: Sankey plot of the DAG pipeline.

Examples:

Fit a simple pipeline and plot the DAG as a Sankey diagram:

from tdprepview import Pipeline, Cast

pl = Pipeline([("col1", Cast("FLOAT"))])
pl.fit(DF=DF_input)

fig = pl.plot_sankey()
fig.update_layout(height=1000)
fig.show()

to_dict #

to_dict()

Serialize a fitted Pipeline to a Python dictionary.

This method asserts that the Pipeline instance is already fitted. The resulting dictionary can be used to reconstruct the Pipeline object with the from_dict class method.

Returns:

Name Type Description
dict dict

A dictionary representation of the fitted Pipeline, suitable for serialization.

Examples:

from tdprepview import Pipeline, SimpleImputer

pl = Pipeline([("col1", SimpleImputer(strategy="mean"))])
pl.fit(DF=my_dataframe)

pl_dict = pl.to_dict()

from_dict classmethod #

from_dict(pipeline_serialized_dict)

Construct a Pipeline object from a serialized dictionary.

This class method deserializes a dictionary into a new Pipeline instance, initializing its properties and state based on the serialized data.

Parameters:

Name Type Description Default
pipeline_serialized_dict dict

A dictionary containing serialized Pipeline data.

required

Returns:

Name Type Description
Pipeline Pipeline

A new (fitted) Pipeline instance initialized with the data from the serialized dictionary.

Examples:

from tdprepview import Pipeline, SimpleImputer

pl = Pipeline([("col1", SimpleImputer(strategy="mean"))])
pl.fit(DF=my_dataframe)

pl_dict = pl.to_dict()
pl_new = Pipeline.from_dict(pl_dict)

to_json #

to_json(filepath)

Serialize the fitted Pipeline to a JSON file.

This method serializes the Pipeline into a dictionary and then saves it as a JSON file at the specified filepath. The Pipeline must be fitted before calling this method.

Parameters:

Name Type Description Default
filepath str

The path, including the filename, where the serialized Pipeline should be stored.

required

Examples:

from tdprepview import Pipeline, SimpleImputer

pl = Pipeline([("col1", SimpleImputer(strategy="mean"))])
pl.fit(DF=my_dataframe)

pl.to_json("pipeline.json")

from_json classmethod #

from_json(filepath)

Construct a Pipeline object from a JSON file.

This class method reads a JSON file specified by filepath, deserializes it into a dictionary, and then uses that dictionary to construct a new Pipeline instance via the from_dict method.

Parameters:

Name Type Description Default
filepath str

Path to the JSON file containing serialized Pipeline data.

required

Returns:

Name Type Description
Pipeline Pipeline

A new Pipeline instance initialized with the data from the JSON file.

Examples:

from tdprepview import Pipeline, SimpleImputer

pl = Pipeline([("col1", SimpleImputer(strategy="mean"))])
pl.fit(DF=my_dataframe)

pl.to_json("pipeline.json")

pl_new = Pipeline.from_json("pipeline.json")

from_DataFrame classmethod #

from_DataFrame(DF, input_schema='', input_table='', non_feature_cols=[], fit_pipeline=False)

Construct a new Pipeline object from a teradataml DataFrame or database table/view.

This method uses the auto_code heuristics to generate a sequence of preprocessing steps based on the data's characteristics and schema. The resulting Pipeline can optionally be fitted immediately.

Parameters:

Name Type Description Default
DF DataFrame

The DataFrame used to determine preprocessing steps.

required
input_schema str

Schema name of the input data. Defaults to "".

''
input_table str

Table or view name of the input data. Defaults to "".

''
non_feature_cols list[str]

Column names to exclude from feature processing (e.g., primary keys, target variables). Defaults to [].

[]
fit_pipeline bool

If True, fit the constructed Pipeline to the DataFrame before returning. Defaults to False.

False

Returns:

Name Type Description
Pipeline Pipeline

A new Pipeline instance, optionally fitted to the DataFrame.

Examples:

from tdprepview import Pipeline

pl = Pipeline.from_DataFrame(
    DF=my_dataframe,
    non_feature_cols=["row_id", "target_category"],
    fit_pipeline=True
)