Scratch work only.

Inspired by Austin Kleon’s Show Your Work, I am recording notes, snippets, and curiosities from my Data Science work. You won’t find any finished projects here, only scratch work.

Notes

Ryan Melvin Ryan Melvin

Multiple Comparison Correction

Evaluating p-values from multiple comparison tests is a problem I’ve written about on my professional blog. Today I had occasion to use one of the solutions I pointed out in the article. Specifically, I used the the Benjamini–Hochberg procedure in an example that was shockingly similar to the one in this book chapter. (One might even say I had a textbook example.)

Read More
Ryan Melvin Ryan Melvin

SQL: Fill completely missing observations

I found a different solution for the missing days problem I mentioned yesterday. By combining the advice of two stackoverflow posts (here and here), I was able to fill in the missing rows between a start and end date for every subject.

Here’s what the generic version of my solution looks like:

Read More
Ryan Melvin Ryan Melvin

Differencing with SQL LAG()

I’m working on this project where I have daily values for a bunch of subjects. I need to know the differences between the values for (day5, day1), (day5, day2), and (day5, day3).

I had planned to do this in Python in a loop, but then I remembered that many languages (including SQL) have tools for handling time series data. SQL Sever’s LAG() function to the rescue!

Some of the subjects are missing values for some days. This is easy to deal with for days 1-4. The LAG() function gives a “deafault” option for setting what the value should be in the target day is missing. The syntax is

Read More
Ryan Melvin Ryan Melvin

Pearson correlation is cool, but…

You know what’s even cooler? Spearman correlation — sometimes, depending on your use case.

Spearman correlation is rank-based. So, it asks the question of how similarly positioned in two vectors the highest/middle/lowest values are. One handy result of this methodology is that it outliers tend not to sway Spearman correlation as much as they might Pearson correlation. Here’s a short example of how to use Spearman correlation in R.

Read More
Ryan Melvin Ryan Melvin

Quick bootstrapping of confidence intervals for linear models in R

In graduate school, I had to code up my bootstrapping methods for linear models myself. But, I recently discovered the “boot” library in R that certainly makes getting bootstrapped confidence intervals quick and easy in R. Here’s a simple, generic example with 10,000 replicates.

Read More
Ryan Melvin Ryan Melvin

Continued: ITS: multiple interventions and control

Continuing my example of an interrupted time series with multiple interventions and a control. Here is the completed example in R — including examining the model residuals for evidence of autoregressive (AR) and moving average (MA) processes.

Read More
Table unions
Ryan Melvin Ryan Melvin

Table unions

For some reason, software like Tableau and Power BI makes doing joins very obvious but unions (adding rows from one table to another/ stacking two or more tables on top of each other) less so.

Read More
Ryan Melvin Ryan Melvin

Imaginary roots of a parabola

A poem wherein the author forgot how roots of an equation work..... Where does parabola, y = ax^2 + bx + c, cross some line y=3? Only imaginary do my roots be! I look at a plot, and what do I see? The parabola doesn’t cross that line (in a real space that's 2-D).

Read More
Ryan Melvin Ryan Melvin

Interrupted time series: multiple interventions with a control group

I’ve seen examples of Interrupted Time Series (ITS) analysis with multiple interventions. And I’ve seen examples with control groups. But I haven’t found an example with both. In particular, I’ve really enjoyed this EdX course.

I’m currently working on a project that needs ITS with both multiple interventions and a control group. So, I’m making my own example. For my data, I have 3 interventions. Below is an example of how I’m setting up the data for regression. Hopefully I’ll have a full example including regression and analysis soon!

Read More
Comparison tests with large sample sizes
Ryan Melvin Ryan Melvin

Comparison tests with large sample sizes

A couple days ago, I posted about “Normality Tests in R” and provided a snippet of code I use that checks normality before running a t-test. If the samples aren’t normal, a non-parametric test is more useful. HOWEVER, today I learned that for large sample sizes (say, >5,000), non-parametric tests become less reliable and a parametric t-test becomes more reliable even if the samples aren’t normally distributed.

Read More
Ryan Melvin Ryan Melvin

Missing Data Imputation in R

There are many, many ways to impute missing data in R. CRAN has an article detailing them and the accompanying packages: https://cran.r-project.org/web/views/MissingData.html. Today, I wanted to do some rapid prototyping of ideas on a dataset with about 16,000 observations that had multiple instances of missing data. I started trying things on the list from CRAN. It was a good reminder that R packages are written for and by statisticians.

Read More
Normality test in R
Ryan Melvin Ryan Melvin

Normality test in R

Today I learned the hard way that the Shapiro-Wilkes test for normality in R only works on a sample size between 3 and 5,000. The fact I’m just learning this today does make me a little sad about the sample sizes I’ve worked with so far. :(

Read More
JANE API
Ryan Melvin Ryan Melvin

JANE API

I’m a fan of using JANE (Journal Author Name Estimator) for suggestions on where to send papers and even who might make good reviewers. I wanted to incorporate some of JANE’s functionality into a web app I’m building to help inexperienced authors with paper submissions.

Read More
Ryan Melvin Ryan Melvin

Reading about Deep Neural Networks

I’m diving into tuning hyper parameters multilayer perceptrons (MLPs) and Deep Neural Networks for a zero-code machine learning application I’m working on. Today I came across a helpful post by Jason Brownlee that outlines the use cases for MLP, Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs) and Hybrid Neural Networks. Since most of the data I anticipate my software package working with is tabular, MLPs seem like a good place to start.

Looking into the space of hyper parameters I might want to tune for an MLP, I found this interesting StackExchange discussion that outlines some rationale for only using one or two layers. Apparently, very few problems benefit from more than two layers. In fact, very few problems benefit from more than one layer…

Read More
Ryan Melvin Ryan Melvin

Categorical variables in a list

Do you have a set of categories for each observation but each category is part of a text list that’s in a column of your data set? Here’s a handy pandas trick I’ve recently been reminded of for handling this situation:

Read More
Ryan Melvin Ryan Melvin

Mlxtend: Making a meta-classifier in a Loop

Today, using the Python package Mlxtend, I stacked the “best” classification tree, bagged classification tree, XGBoost, SVM, logistic regression, linear discriminant analysis, and k-nearest-neighbor classifiers from 10,000 runs (each) of Hyperopt parameter tuning. I created a meta-classifier using the best of each model type using logistic regression. Since each of my model types has its own tuned Pipeline, I found out that mlxtend works just fine with sklearn pipelines. The syntax looks like this:

Read More
Ryan Melvin Ryan Melvin

Python trick for finding strings in a list that match a pattern

As part of a project to automate some complicated report generation, I need to interact with a software package that returns all of its output in a series of strings — even “warnings.” The warnings are not special other than that they begin with “Warning:” So, I needed a way to take a list of strings and pull out only the ones that start with “Warning:” After perusing some of the output, I realized it would be sufficient to find strings that contain “Warning.”

I read several suggestions on stack-overflow. Some of them suggested joining the strings into one massive string before searching. This seemed overly complicated. In the end, a list comprehension turned out to be the most satisfying answer (as they so often are).

Read More