Scikit-learn special case: leave-one-out

Today I learned the hard way that sklearn.model_selection.cross_val_score() returns NaNs when you use a probability-based score (like AUC or log-loss) with leave-one-out cross-validation (LOO-CV). Intuitively, it makes sense why LOO-CV would be special since it returns a single value each round instead of an array of values. To overcome this issues, I built a wrapper for the combination of scikit-learn methods you need to make LOO-CV scoring behave like the scoring for other cross validation methods. You’ll find it below.

I tried to adhere to the scikit-learn form as much as possible, but anyone is free to remix this work to make it better.

from sklearn.metrics import roc_auc_score
from sklearn.model_selection import LeaveOneOut,cross_val_predict

def score_loo(X, y, estimator, score_func=roc_auc_score,*,needs_proba=True, n_jobs=-1, **kwargs):
    # additional keyword arguments are passed to score_func
    if needs_proba==True:
        y_hat = cross_val_predict(estimator, X, y=y, cv=LeaveOneOut(), n_jobs=n_jobs, method='predict_proba')[:,1]
    else:
        y_hat = cross_val_predict(estimator, X, y=y, cv=LeaveOneOut(), n_jobs=n_jobs, method='predict')
     
    score = score_func(np.ravel(y), y_hat, **kwargs)
    
    return score

# example usage
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=1, n_jobs=1, n_estimators=2) 
score_loo(estimator=rf, X=X, y=y, n_jobs=16, score_func=roc_auc_score, needs_proba=True)
Previous
Previous

Universal gitignore

Next
Next

tmux is a life saver