Automatic Data Prep#
Why Auto Prep#
Why automate?
Data prep steps are repeatable across projects (impute → scale → encode → cast). Automating with heuristics gives you a strong baseline quickly, and a semi-automatic mode lets you review and adjust before committing.
- Generates a suggested preprocessing pipeline from a sample of your table/view.
-
Two ways to use it:
- Semi-automatic: get Python code as a string (review & edit).
- Full automatic: get a ready-to-use Pipeline object.
Modes of use#
- Call auto_code(...) → returns a string containing the list of steps.
- You can review, tweak, and commit the code.
- Best for governance and code reviews.
When to pick which
- Use auto_code when you need a reviewable artifact.
- Use from_DataFrame when you want a quick baseline and will iterate interactively.
How is it working?#
-
Samples up to 50,000 rows client-side to drive heuristics:
-
non_feature_cols: columns to exclude from feature transforms (IDs, target, audit fields).
- An ordered list of steps (impute → scale → power transform → text encoding → cast).
- Steps are grouped by preprocessor where possible to keep the DAG simple.
- Final Cast(FLOAT) step (excluding
non_feature_cols
) standardizes numeric outputs for ML downstreams.
The following heuristics are applied per column on the sample of 50k rows:
Checker (source) | Applies to TD types | Condition on sample | Suggested preprocessor | Parameters | Notes |
---|---|---|---|---|---|
_check_imputation |
VARCHAR , CHAR |
— | ImputeText |
kind='mode' |
Text mode imputation |
any | exactly 2 unique non-NA | SimpleImputer |
strategy='most_frequent' |
Binary/bool columns | |
BIGINT , FLOAT |
— | SimpleImputer |
strategy='mean' |
Numeric mean imputation | |
otherwise | — | Impute |
kind='mode' |
Fallback | |
_check_minmaxscaling |
BIGINT , FLOAT |
len(unique) > 2 |
MinMaxScaler |
{} |
Min-max to [0,1] |
_check_powertransformer |
BIGINT , FLOAT |
len(unique) > 2 and abs(skew) > 0.5 |
PowerTransformer |
method='yeo-johnson' |
Address skewness |
_check_textencoding |
CHAR , VARCHAR |
≥ 10% rows contain the delimiter , |
MultiLabelBinarizer |
delimiter=',' , max_categories=20 |
Multi-label tokens |
CHAR , VARCHAR |
len(unique) > 2 |
OneHotEncoder |
max_categories=20 |
OHE with cap | |
CHAR , VARCHAR |
otherwise | LabelEncoder |
elements='TOP1' |
Most frequent label to 1 | |
Builder (final) | all except non_feature_cols |
— | Cast |
new_type='FLOAT' |
Standard numeric output |
The Steps
- Imputation (text, binary, numeric).
- Min-max scaling (numeric).
- Power transform when skewed (numeric).
- Text encoding (MLB / OHE / Label).
- Cast to FLOAT for everything except
non_feature_cols
.
Every feature will be a value between 0.0 and 1.0