Pipelines and imputation

Apr 27

It’s taken me a while, but I finally saw the beauty in Scikit-Learn pipelines, especially when imputing missing data.

For each round of cross-validation you do in model training, you shouldn’t be letting your model “see” any of the test data — for each round of cross validation. If you impute on your whole data set, or even just your larger training, then you have some data leakage across your train/test or train/score splits for cross validation. (Jason Brownlee explains this further in his blog posts on imputation here and here). Luckily, you can run cross-validation on a pipeline in Scikit-Learn, just like you would on a classifier or regressor.

If you use any of the advice from my Hyperopt posts, though, thinks can get a little messy. You need to have your train_test function build your pipeline depending on whether the “normalize” or “scale” option is selected. That looks like this:

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Normalizer, StandardScaler

def hyperopt_train_test(clf, params):
    cv_method = StratifiedKFold(10)
    numeric_transformer = Pipeline(steps=[
    ('imputer', IterativeImputer(max_iter=10000))])
    
    if 'normalize' in params:
        if params['normalize'] == 1:
            numeric_transformer.steps.append(['normalizer',Normalizer()])
        del params['normalize']
    if 'scale' in params:
        if params['scale'] == 1:
            numeric_transformer.steps.append(['scaler',StandardScaler()])
        del params['scale']

    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_variables)
        ]
    )
    clf_to_test = copy.copy(clf)
    clf_to_test = clf_to_test.set_params(**f_unpack_dict(params))
    clf_for_hyperopt = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', clf_to_test)])
    return cross_val_score(estimator=clf_for_hyperopt, X=X_train, y=y_train, n_jobs=8, scoring='roc_auc', cv=cv_method).mean()

def report_score_to_minimize(model_spec):
    return -1.0 * hyperopt_train_test(model_spec['clf'], model_spec['parameters'])

Ryan Melvin

Pipelines and imputation

Meta-classifiers

Choosing parameters for regularized regression