Automatic Data Prep#

Why Auto Prep#

Why automate?

Data prep steps are repeatable across projects (impute → scale → encode → cast). Automating with heuristics gives you a strong baseline quickly, and a semi-automatic mode lets you review and adjust before committing.

Generates a suggested preprocessing pipeline from a sample of your table/view.
Two ways to use it:
- Semi-automatic: get Python code as a string (review & edit).
- Full automatic: get a ready-to-use Pipeline object.

Modes of use#

Semi-automatic (recommended)Full automatic (one-liner)

Call auto_code(...) → returns a string containing the list of steps.
You can review, tweak, and commit the code.
Best for governance and code reviews.

from tdprepview import auto_code, Pipeline
code_str = auto_code(DF=DF, non_feature_cols=["id","target"])
print(code_str)                       # inspect & edit if needed
pl = Pipeline(steps=eval(code_str))   # turn into a Pipeline

Call Pipeline.from_DataFrame(...) → returns a Pipeline (optionally fitted).
Fastest path to a baseline; fewer knobs.

from tdprepview import Pipeline
pl = Pipeline.from_DataFrame(
            DF=DF, 
            non_feature_cols=["id","target"], 
            fit_pipeline=True)

When to pick which

Use auto_code when you need a reviewable artifact.
Use from_DataFrame when you want a quick baseline and will iterate interactively.

How is it working?#

Samples up to 50,000 rows client-side to drive heuristics:
non_feature_cols: columns to exclude from feature transforms (IDs, target, audit fields).
An ordered list of steps (impute → scale → power transform → text encoding → cast).
Steps are grouped by preprocessor where possible to keep the DAG simple.
Final Cast(FLOAT) step (excluding non_feature_cols) standardizes numeric outputs for ML downstreams.

The following heuristics are applied per column on the sample of 50k rows:

Checker (source)	Applies to TD types	Condition on sample	Suggested preprocessor	Parameters	Notes
`_check_imputation`	`VARCHAR`, `CHAR`	—	`ImputeText`	`kind='mode'`	Text mode imputation
	any	exactly 2 unique non-NA	`SimpleImputer`	`strategy='most_frequent'`	Binary/bool columns
	`BIGINT`, `FLOAT`	—	`SimpleImputer`	`strategy='mean'`	Numeric mean imputation
	otherwise	—	`Impute`	`kind='mode'`	Fallback
`_check_minmaxscaling`	`BIGINT`, `FLOAT`	`len(unique) > 2`	`MinMaxScaler`	`{}`	Min-max to [0,1]
`_check_powertransformer`	`BIGINT`, `FLOAT`	`len(unique) > 2` and `abs(skew) > 0.5`	`PowerTransformer`	`method='yeo-johnson'`	Address skewness
`_check_textencoding`	`CHAR`, `VARCHAR`	≥ 10% rows contain the delimiter `,`	`MultiLabelBinarizer`	`delimiter=','`, `max_categories=20`	Multi-label tokens
	`CHAR`, `VARCHAR`	`len(unique) > 2`	`OneHotEncoder`	`max_categories=20`	OHE with cap
	`CHAR`, `VARCHAR`	otherwise	`LabelEncoder`	`elements='TOP1'`	Most frequent label to 1
Builder (final)	all except `non_feature_cols`	—	`Cast`	`new_type='FLOAT'`	Standard numeric output

The Steps

Imputation (text, binary, numeric).
Min-max scaling (numeric).
Power transform when skewed (numeric).
Text encoding (MLB / OHE / Label).
Cast to FLOAT for everything except non_feature_cols.

Every feature will be a value between 0.0 and 1.0