Pipelines and imputation

It’s taken me a while, but I finally saw the beauty in Scikit-Learn pipelines, especially when imputing missing data.

For each round of cross-validation you do in model training, you shouldn’t be letting your model “see” any of the test data — for each round of cross validation. If you impute on your whole data set, or even just your larger training, then you have some data leakage across your train/test or train/score splits for cross validation. (Jason Brownlee explains this further in his blog posts on imputation here and here). Luckily, you can run cross-validation on a pipeline in Scikit-Learn, just like you would on a classifier or regressor.

If you use any of the advice from my Hyperopt posts, though, thinks can get a little messy. You need to have your train_test function build your pipeline depending on whether the “normalize” or “scale” option is selected. That looks like this:

from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.preprocessing import Normalizer, StandardScaler def hyperopt_train_test(clf, params): cv_method = StratifiedKFold(10) numeric_transformer = Pipeline(steps=[ ('imputer', IterativeImputer(max_iter=10000))]) if 'normalize' in params: if params['normalize'] == 1: numeric_transformer.steps.append(['normalizer',Normalizer()]) del params['normalize'] if 'scale' in params: if params['scale'] == 1: numeric_transformer.steps.append(['scaler',StandardScaler()]) del params['scale'] preprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, numeric_variables) ] ) clf_to_test = copy.copy(clf) clf_to_test = clf_to_test.set_params(**f_unpack_dict(params)) clf_for_hyperopt = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', clf_to_test)]) return cross_val_score(estimator=clf_for_hyperopt, X=X_train, y=y_train, n_jobs=8, scoring='roc_auc', cv=cv_method).mean() def report_score_to_minimize(model_spec): return -1.0 * hyperopt_train_test(model_spec['clf'], model_spec['parameters'])
Previous
Previous

Meta-classifiers

Next
Next

Choosing parameters for regularized regression