Scratch work only.

Inspired by Austin Kleon’s Show Your Work, I am recording notes, snippets, and curiosities from my Data Science work. You won’t find any finished projects here, only scratch work.

Notes

Ryan Melvin Ryan Melvin

Pickle a dictionary

Recently, I’ve found myself needing to save python dictionaries. I’m using it as a way of saving the “best” hyperparameters when working with Hyperopt (see my previous posts). The easiest way to do this in python seems to be using the “pickle” library. Writing (and later reading) a dictionary in this way looks like this.

Read More
Meta-classifiers
Ryan Melvin Ryan Melvin

Meta-classifiers

Since I’m working on tuning many types hyperparameters of machine learning models at once, why not also combine all of their predictions and see how well that meta-classifier performs? Meta-classifier models used a lot in Kaggle competitions. The big idea is that if you trained, say, 5 different model types, then the meta classifier will be at least as good as your best model. So it’s a cheap way to get a boost in performance. I like this simple figure from the mlxtend docs showing the function of a meta-classifier.

Read More
Ryan Melvin Ryan Melvin

Pipelines and imputation

It’s taken me a while, but I finally saw the beauty in Scikit-Learn pipelines, especially when imputing missing data.

For each round of cross-validation you do in model training, you shouldn’t be letting your model “see” any of the test data — for each round of cross validation. If you impute on your whole data set, or even just your larger training, then you have some data leakage across your train/test or train/score splits for cross validation. (Jason Brownlee explains this further in his blog posts on imputation here and here). Luckily, you can run cross-validation on a pipeline in Scikit-Learn, just like you would on a classifier or regressor.

If you use any of the advice from my Hyperopt posts, though, thinks can get a little messy. You need to have your train_test function build your pipeline depending on whether the “normalize” or “scale” option is selected. That looks like this:

Read More
Ryan Melvin Ryan Melvin

Choosing parameters for regularized regression

Along with some colleagues, I’m working on an automated parameter and feature selection tuner for machine learning. Currently, we’re searching for ranges of parameters that have been “successful.” I recently did a deep dive into LASSO, RIDGE, or Elastic-Net regularized regression, where you have to pick a lambda value (or a range of them to search over). There are some recommendations on Data Science blogs (like Jason Brownlee’s) on parameter ranges to search.

However, some of the creators of these methods wrote an R package called “glmnet” that picks the range of parameters based on the data you’re working with. If you aren’t using R, though, they describe the method in Section 2.5 of this paper. AND, there’s a great stack overflow discussion.

Read More
Ryan Melvin Ryan Melvin

LaTeX and Python

Recently, I’ve been working a project that needs some report generation. I decided to have Python write to a LaTeX file and then let LaTeX handle all of the formatting goodness — like it does so well — of making a PDF.

To do this, I have a template TEX file will all of the necessary preamble. Whenever a script needs to generate a report, it copies the template like so:

Read More
Ryan Melvin Ryan Melvin

I’m a thought leader?

Today a media team from a Fortune 50 company asked for my “thought leadership” on the future of technology in medicine — which definitely didn’t trigger my impostor syndrome (sarcasm). Nevertheless, I wrote down some thoughts. Maybe some of these thoughts are even leaders. Or am I the leader of the thoughts?…. Anyway:

Technology will

Read More
Ryan Melvin Ryan Melvin

Parsing user input list

I’m working on a software package that makes tuning hyperparameters for machine learning algorithms a little easier and more friendly. As part of this, I need a way to parse a list provided by a user from the command line. Argprarse is a standard for parsing command line arguments in Python, and it has a “list” type of argument. The problem is that argparse (with this option) tries to coerce the exact user input to a list. Instead, argparse needs to assemble each piece of the argument into a list, which you can do with the following code.

Read More
Ryan Melvin Ryan Melvin

Universal gitignore

Tired of having to make sure every branch (or even all your repositories) have the same gitignore? You can set up a universal gitignore file and then point to it in your system or user level gitconfig file (hidden file named “.gitconfig”). This file should be in your home directory or C:/Users/username on Windows. The line to point to your universal gitignore looks like this:

Read More
Ryan Melvin Ryan Melvin

Scikit-learn special case: leave-one-out

Today I learned the hard way that sklearn.model_selection.cross_val_score() returns NaNs when you use a probability-based score (like AUC or log-loss) with leave-one-out cross-validation (LOO-CV). Intuitively, it makes sense why LOO-CV would be special since it returns a single value each round instead of an array of values. To overcome this issues, I built a wrapper for the combination of scikit-learn methods you need to make LOO-CV scoring behave like the scoring for other cross validation methods.

Read More
Ryan Melvin Ryan Melvin

tmux is a life saver

Ever had an SSH session disconnect, interrupting an important process you were running. No more with tmux!

Read More
Ryan Melvin Ryan Melvin

Special characters will be the death of me

After spending days trying to figure out why a particular script wouldn’t run on a Linux cluster, it turned out to be end-of-line characters from Windows. I learned today that many git clients take care of this issue when moving between Windows and Linux by default, but mine (Fork) does not.

Read More
Fishbone (Ishikawa) Diagrams
Ryan Melvin Ryan Melvin

Fishbone (Ishikawa) Diagrams

Let’s revisit the question of “what do they really want?” I’ve been asked to solve several “data science” questions that really had nothing to do with data. But being stubborn, I usually agree to engage in these problems (as long as they are interesting) anyway. In those, cases it’s sometimes necessary to figure out the root causes of the problem you’ve been tasked with. Fishbone (or Ishikawa) diagrams are a popular, useful technique for figuring out the root causes of a problem.

Read More
Ryan Melvin Ryan Melvin

Sometimes it helps to walk away

Today I was looking over some code I wrote about a month ago. I found one line that did exactly the opposite of what was intended. Instead of making sure some parts of a list were included, it was excluding them. This experience was a good reminder that sometimes it helps to walk away from code and come back for a review later.

Read More
Ryan Melvin Ryan Melvin

What do they really want?

There’s a story I once heard about a man who went to buy a hammer from the hardware store. But what does that man really want? It turns out, he wants to hang a picture. But is that what he really wants. Well, it turns out he wants his wife to stop asking him to hang the picture. But is that what he really wants? It turns out, the man wants a better relationship with his wife.

I tried to find the attribution for this story today, and I came up empty. Regardless, it’s an instructive story.

I was recently asked to analyze a system redesign for a very specific goal.

Read More
Ryan Melvin Ryan Melvin

Drained by writing

I get wrapped up in code for hours at at time. But after an hour of writing a manuscript draft for a journal today, I felt completely drained. I’m considering dedicating a block of time every day — or may a bigger block once a week — to writing. It seems like there’s always some writing work to be done after all.

Read More
Ryan Melvin Ryan Melvin

SVMs getting stuck

When trying to tune the hyperparameters of SVMs, I noticed that they’re taking exponentially longer than other model types. I’m still trying to lock down the cause for my particular data sets (typically small, around 400 observations). However, I’ve come across three tips that might help others in a similar situation:

Read More
Ryan Melvin Ryan Melvin

Pareto charts: making and interpreting

Learn how to make and interpret a Pareto Chart in Excel. Here I demonstrate the process for making a Pareto Chart in Excel and describe how to interpret it. I also discuss overcoming a major pitfall of Pareto Charts -- forgetting exactly what question they answer.

Read More