Pipeline
tdprepview.Pipeline
#
A data processing pipeline consisting of a sequence of steps. Each step is a tuple of input columns, preprocessors, and optional renaming options. During fitting, a directed acyclic graph (DAG) is generated to represent the execution order.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
steps
|
List[Step]
|
A list of tuples representing the steps of the pipeline. Each tuple must have 2 or 3 elements.
Renaming makes it easier to reference transformed columns in later steps, especially when output names are otherwise controlled by the preprocessing logic. Examples of valid steps:
|
required |
Examples:
Single-step pipelines:
-
Single column with one preprocessor:
-
Multiple columns with one preprocessor:
-
Column selection by dtype with multiple preprocessors, plus renaming:
Multi-step pipelines:
-
Different columns, different preprocessors, no renaming:
-
Regex/pattern selection feeding into PCA:
-
Multi-preprocessor step followed by categorical encoding:
Notes
- Each step defines what columns are selected, what transformations are applied, and optionally how output columns are renamed.
Methods:
Name | Description |
---|---|
fit |
Fit the pipeline to the given data step by step. |
transform |
Apply the transformations in the pipeline to the input data. |
plot_sankey |
Plot the DAG (directed acyclic graph) of the pipeline using a Sankey diagram. |
to_dict |
Serialize a fitted Pipeline to a Python dictionary. |
from_dict |
Construct a Pipeline object from a serialized dictionary. |
to_json |
Serialize the fitted Pipeline to a JSON file. |
from_json |
Construct a Pipeline object from a JSON file. |
from_DataFrame |
Construct a new Pipeline object from a teradataml DataFrame or database table/view. |
fit
#
Fit the pipeline to the given data step by step.
At each step, it is decided whether statistics need to be collected from Vantage. The directed acyclic graph (DAG) is then incrementally constructed by adding nodes to the previous layer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
DF
|
Optional[DataFrame]
|
The input DataFrame. If not provided, the DataFrame is loaded from the specified schema and table. |
None
|
schema_name
|
Optional[str]
|
The schema name where the table is located. Required if |
None
|
table_name
|
Optional[str]
|
The table name to load. Required if |
None
|
Returns:
Type | Description |
---|---|
None
|
None |
Examples:
Fit the pipeline with a DataFrame:
from tdprepview import Pipeline
from tdprepview.preprocessing import Cast
pl = Pipeline([("col1", Cast("FLOAT"))])
pl.fit(DF=my_dataframe)
Fit the pipeline by loading from schema and table:
transform
#
transform(DF=None, schema_name=None, table_name=None, return_type='df', create_replace_view=False, output_schema_name=None, output_view_name=None)
Apply the transformations in the pipeline to the input data.
The directed acyclic graph (DAG), constructed during fitting based on the pipeline steps, is traversed and the corresponding SQL statements are appended.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
DF
|
Optional[DataFrame]
|
Input data. If not provided, data will be fetched from the specified
|
None
|
schema_name
|
Optional[str]
|
Schema name of the input data. Required if |
None
|
table_name
|
Optional[str]
|
Table name of the input data. Required if |
None
|
return_type
|
Literal['df', 'str', None]
|
Specifies the return type of the transformation:
- |
'df'
|
create_replace_view
|
bool
|
If True, create or replace a view with the transformed data in the database. |
False
|
output_schema_name
|
Optional[str]
|
Schema name of the output view. Required if
|
None
|
output_view_name
|
Optional[str]
|
View name of the output. Required if |
None
|
Returns:
Type | Description |
---|---|
Union[DataFrame, str, None]
|
Transformed data as a |
Union[DataFrame, str, None]
|
a SQL query string if |
Notes
- Using
create_replace_view=True
acts like a "secret deploy button": it crystallizes the preprocessing logic into a database view.
Examples:
Transform data and return a DataFrame (default):
Transform data and return the SQL query string:
Transform data by fetching directly from schema and table:
Deploy transformations as a database view:
plot_sankey
#
Plot the DAG (directed acyclic graph) of the pipeline using a Sankey diagram.
Returns:
Type | Description |
---|---|
Figure
|
plotly.graph_objs._figure.Figure: Sankey plot of the DAG pipeline. |
Examples:
Fit a simple pipeline and plot the DAG as a Sankey diagram:
to_dict
#
Serialize a fitted Pipeline to a Python dictionary.
This method asserts that the Pipeline instance is already fitted. The resulting
dictionary can be used to reconstruct the Pipeline object with the from_dict
class method.
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
A dictionary representation of the fitted Pipeline, suitable for serialization. |
Examples:
from_dict
classmethod
#
Construct a Pipeline object from a serialized dictionary.
This class method deserializes a dictionary into a new Pipeline instance, initializing its properties and state based on the serialized data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pipeline_serialized_dict
|
dict
|
A dictionary containing serialized Pipeline data. |
required |
Returns:
Name | Type | Description |
---|---|---|
Pipeline |
Pipeline
|
A new (fitted) Pipeline instance initialized with the data from the serialized dictionary. |
Examples:
to_json
#
Serialize the fitted Pipeline to a JSON file.
This method serializes the Pipeline into a dictionary and then saves it as a JSON file at the specified filepath. The Pipeline must be fitted before calling this method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filepath
|
str
|
The path, including the filename, where the serialized Pipeline should be stored. |
required |
Examples:
from_json
classmethod
#
Construct a Pipeline object from a JSON file.
This class method reads a JSON file specified by filepath
, deserializes it into a
dictionary, and then uses that dictionary to construct a new Pipeline instance via
the from_dict
method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filepath
|
str
|
Path to the JSON file containing serialized Pipeline data. |
required |
Returns:
Name | Type | Description |
---|---|---|
Pipeline |
Pipeline
|
A new Pipeline instance initialized with the data from the JSON file. |
Examples:
from_DataFrame
classmethod
#
Construct a new Pipeline object from a teradataml DataFrame or database table/view.
This method uses the auto_code
heuristics to generate a sequence of preprocessing steps
based on the data's characteristics and schema. The resulting Pipeline can optionally be
fitted immediately.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
DF
|
DataFrame
|
The DataFrame used to determine preprocessing steps. |
required |
input_schema
|
str
|
Schema name of the input data. Defaults to "". |
''
|
input_table
|
str
|
Table or view name of the input data. Defaults to "". |
''
|
non_feature_cols
|
list[str]
|
Column names to exclude from feature processing (e.g., primary keys, target variables). Defaults to []. |
[]
|
fit_pipeline
|
bool
|
If True, fit the constructed Pipeline to the DataFrame before returning. Defaults to False. |
False
|
Returns:
Name | Type | Description |
---|---|---|
Pipeline |
Pipeline
|
A new Pipeline instance, optionally fitted to the DataFrame. |
Examples: