Scratch work only.
Inspired by Austin Kleon’s Show Your Work, I am recording notes, snippets, and curiosities from my Data Science work. You won’t find any finished projects here, only scratch work.
Notes
Pickle a dictionary
Recently, I’ve found myself needing to save python dictionaries. I’m using it as a way of saving the “best” hyperparameters when working with Hyperopt (see my previous posts). The easiest way to do this in python seems to be using the “pickle” library. Writing (and later reading) a dictionary in this way looks like this.
Meta-classifiers
Since I’m working on tuning many types hyperparameters of machine learning models at once, why not also combine all of their predictions and see how well that meta-classifier performs? Meta-classifier models used a lot in Kaggle competitions. The big idea is that if you trained, say, 5 different model types, then the meta classifier will be at least as good as your best model. So it’s a cheap way to get a boost in performance. I like this simple figure from the mlxtend docs showing the function of a meta-classifier.
Pipelines and imputation
It’s taken me a while, but I finally saw the beauty in Scikit-Learn pipelines, especially when imputing missing data.
For each round of cross-validation you do in model training, you shouldn’t be letting your model “see” any of the test data — for each round of cross validation. If you impute on your whole data set, or even just your larger training, then you have some data leakage across your train/test or train/score splits for cross validation. (Jason Brownlee explains this further in his blog posts on imputation here and here). Luckily, you can run cross-validation on a pipeline in Scikit-Learn, just like you would on a classifier or regressor.
If you use any of the advice from my Hyperopt posts, though, thinks can get a little messy. You need to have your train_test function build your pipeline depending on whether the “normalize” or “scale” option is selected. That looks like this:
Choosing parameters for regularized regression
Along with some colleagues, I’m working on an automated parameter and feature selection tuner for machine learning. Currently, we’re searching for ranges of parameters that have been “successful.” I recently did a deep dive into LASSO, RIDGE, or Elastic-Net regularized regression, where you have to pick a lambda value (or a range of them to search over). There are some recommendations on Data Science blogs (like Jason Brownlee’s) on parameter ranges to search.
However, some of the creators of these methods wrote an R package called “glmnet” that picks the range of parameters based on the data you’re working with. If you aren’t using R, though, they describe the method in Section 2.5 of this paper. AND, there’s a great stack overflow discussion.
LaTeX and Python
Recently, I’ve been working a project that needs some report generation. I decided to have Python write to a LaTeX file and then let LaTeX handle all of the formatting goodness — like it does so well — of making a PDF.
To do this, I have a template TEX file will all of the necessary preamble. Whenever a script needs to generate a report, it copies the template like so:
I’m a thought leader?
Today a media team from a Fortune 50 company asked for my “thought leadership” on the future of technology in medicine — which definitely didn’t trigger my impostor syndrome (sarcasm). Nevertheless, I wrote down some thoughts. Maybe some of these thoughts are even leaders. Or am I the leader of the thoughts?…. Anyway:
Technology will
Parsing user input list
I’m working on a software package that makes tuning hyperparameters for machine learning algorithms a little easier and more friendly. As part of this, I need a way to parse a list provided by a user from the command line. Argprarse is a standard for parsing command line arguments in Python, and it has a “list” type of argument. The problem is that argparse (with this option) tries to coerce the exact user input to a list. Instead, argparse needs to assemble each piece of the argument into a list, which you can do with the following code.
Universal gitignore
Tired of having to make sure every branch (or even all your repositories) have the same gitignore? You can set up a universal gitignore file and then point to it in your system or user level gitconfig file (hidden file named “.gitconfig”). This file should be in your home directory or C:/Users/username on Windows. The line to point to your universal gitignore looks like this:
Scikit-learn special case: leave-one-out
Today I learned the hard way that sklearn.model_selection.cross_val_score() returns NaNs when you use a probability-based score (like AUC or log-loss) with leave-one-out cross-validation (LOO-CV). Intuitively, it makes sense why LOO-CV would be special since it returns a single value each round instead of an array of values. To overcome this issues, I built a wrapper for the combination of scikit-learn methods you need to make LOO-CV scoring behave like the scoring for other cross validation methods.
tmux is a life saver
Ever had an SSH session disconnect, interrupting an important process you were running. No more with tmux!
Special characters will be the death of me
After spending days trying to figure out why a particular script wouldn’t run on a Linux cluster, it turned out to be end-of-line characters from Windows. I learned today that many git clients take care of this issue when moving between Windows and Linux by default, but mine (Fork) does not.
Fishbone (Ishikawa) Diagrams
Let’s revisit the question of “what do they really want?” I’ve been asked to solve several “data science” questions that really had nothing to do with data. But being stubborn, I usually agree to engage in these problems (as long as they are interesting) anyway. In those, cases it’s sometimes necessary to figure out the root causes of the problem you’ve been tasked with. Fishbone (or Ishikawa) diagrams are a popular, useful technique for figuring out the root causes of a problem.
Text editors: some brief thoughts
Some brief commentary on the three text editors I have loved.
Hyperopt-sklearn: Automatic hyperparameter tuning
Check out my video demonstration of hyperopt-sklearn, which simplifies hyperparameter tuning. Also, hear my opinion on why it’s the right choice for almost no one.
Sometimes it helps to walk away
Today I was looking over some code I wrote about a month ago. I found one line that did exactly the opposite of what was intended. Instead of making sure some parts of a list were included, it was excluding them. This experience was a good reminder that sometimes it helps to walk away from code and come back for a review later.
What do they really want?
There’s a story I once heard about a man who went to buy a hammer from the hardware store. But what does that man really want? It turns out, he wants to hang a picture. But is that what he really wants. Well, it turns out he wants his wife to stop asking him to hang the picture. But is that what he really wants? It turns out, the man wants a better relationship with his wife.
I tried to find the attribution for this story today, and I came up empty. Regardless, it’s an instructive story.
I was recently asked to analyze a system redesign for a very specific goal.
Skopt: automatic hyperparameter tuning for scikit-learn
Check out my video demo of Scikit-Optimize for automatic hyperparameter tuning.
Drained by writing
I get wrapped up in code for hours at at time. But after an hour of writing a manuscript draft for a journal today, I felt completely drained. I’m considering dedicating a block of time every day — or may a bigger block once a week — to writing. It seems like there’s always some writing work to be done after all.
SVMs getting stuck
When trying to tune the hyperparameters of SVMs, I noticed that they’re taking exponentially longer than other model types. I’m still trying to lock down the cause for my particular data sets (typically small, around 400 observations). However, I’ve come across three tips that might help others in a similar situation:
Pareto charts: making and interpreting
Learn how to make and interpret a Pareto Chart in Excel. Here I demonstrate the process for making a Pareto Chart in Excel and describe how to interpret it. I also discuss overcoming a major pitfall of Pareto Charts -- forgetting exactly what question they answer.