Skip to content

Automatic Data Prep#


Why Auto Prep#

Why automate?

Data prep steps are repeatable across projects (impute → scale → encode → cast). Automating with heuristics gives you a strong baseline quickly, and a semi-automatic mode lets you review and adjust before committing.

  • Generates a suggested preprocessing pipeline from a sample of your table/view.
  • Two ways to use it:

    • Semi-automatic: get Python code as a string (review & edit).
    • Full automatic: get a ready-to-use Pipeline object.

Modes of use#

  • Call auto_code(...) → returns a string containing the list of steps.
  • You can review, tweak, and commit the code.
  • Best for governance and code reviews.
from tdprepview import auto_code, Pipeline
code_str = auto_code(DF=DF, non_feature_cols=["id","target"])
print(code_str)                       # inspect & edit if needed
pl = Pipeline(steps=eval(code_str))   # turn into a Pipeline
  • Call Pipeline.from_DataFrame(...) → returns a Pipeline (optionally fitted).
  • Fastest path to a baseline; fewer knobs.
from tdprepview import Pipeline
pl = Pipeline.from_DataFrame(
            DF=DF, 
            non_feature_cols=["id","target"], 
            fit_pipeline=True)

When to pick which

  • Use auto_code when you need a reviewable artifact.
  • Use from_DataFrame when you want a quick baseline and will iterate interactively.

How is it working?#

  • Samples up to 50,000 rows client-side to drive heuristics:

  • non_feature_cols: columns to exclude from feature transforms (IDs, target, audit fields).

  • An ordered list of steps (impute → scale → power transform → text encoding → cast).
  • Steps are grouped by preprocessor where possible to keep the DAG simple.
  • Final Cast(FLOAT) step (excluding non_feature_cols) standardizes numeric outputs for ML downstreams.

The following heuristics are applied per column on the sample of 50k rows:

Checker (source) Applies to TD types Condition on sample Suggested preprocessor Parameters Notes
_check_imputation VARCHAR, CHAR ImputeText kind='mode' Text mode imputation
any exactly 2 unique non-NA SimpleImputer strategy='most_frequent' Binary/bool columns
BIGINT, FLOAT SimpleImputer strategy='mean' Numeric mean imputation
otherwise Impute kind='mode' Fallback
_check_minmaxscaling BIGINT, FLOAT len(unique) > 2 MinMaxScaler {} Min-max to [0,1]
_check_powertransformer BIGINT, FLOAT len(unique) > 2 and abs(skew) > 0.5 PowerTransformer method='yeo-johnson' Address skewness
_check_textencoding CHAR, VARCHAR 10% rows contain the delimiter , MultiLabelBinarizer delimiter=',', max_categories=20 Multi-label tokens
CHAR, VARCHAR len(unique) > 2 OneHotEncoder max_categories=20 OHE with cap
CHAR, VARCHAR otherwise LabelEncoder elements='TOP1' Most frequent label to 1
Builder (final) all except non_feature_cols Cast new_type='FLOAT' Standard numeric output

The Steps

  1. Imputation (text, binary, numeric).
  2. Min-max scaling (numeric).
  3. Power transform when skewed (numeric).
  4. Text encoding (MLB / OHE / Label).
  5. Cast to FLOAT for everything except non_feature_cols.

Every feature will be a value between 0.0 and 1.0