Ryan Melvin 10/6/21 Ryan Melvin 10/6/21

Multiple Comparison Correction

Evaluating p-values from multiple comparison tests is a problem I’ve written about on my professional blog. Today I had occasion to use one of the solutions I pointed out in the article. Specifically, I used the the Benjamini–Hochberg procedure in an example that was shockingly similar to the one in this book chapter. (One might even say I had a textbook example.)

Ryan Melvin 9/29/21 Ryan Melvin 9/29/21

SQL: Fill completely missing observations

I found a different solution for the missing days problem I mentioned yesterday. By combining the advice of two stackoverflow posts (here and here), I was able to fill in the missing rows between a start and end date for every subject.

Here’s what the generic version of my solution looks like:

Ryan Melvin 9/28/21 Ryan Melvin 9/28/21

Differencing with SQL LAG()

I’m working on this project where I have daily values for a bunch of subjects. I need to know the differences between the values for (day5, day1), (day5, day2), and (day5, day3).

I had planned to do this in Python in a loop, but then I remembered that many languages (including SQL) have tools for handling time series data. SQL Sever’s LAG() function to the rescue!

Some of the subjects are missing values for some days. This is easy to deal with for days 1-4. The LAG() function gives a “deafault” option for setting what the value should be in the target day is missing. The syntax is

Ryan Melvin 9/22/21 Ryan Melvin 9/22/21

SQL date part of date-time

Today I learned how to get just the date part of a date-time in SQL Server 2008 and up.

Ryan Melvin 9/21/21 Ryan Melvin 9/21/21

Pearson correlation is cool, but…

You know what’s even cooler? Spearman correlation — sometimes, depending on your use case.

Spearman correlation is rank-based. So, it asks the question of how similarly positioned in two vectors the highest/middle/lowest values are. One handy result of this methodology is that it outliers tend not to sway Spearman correlation as much as they might Pearson correlation. Here’s a short example of how to use Spearman correlation in R.

Ryan Melvin 9/21/21 Ryan Melvin 9/21/21

Quick bootstrapping of confidence intervals for linear models in R

In graduate school, I had to code up my bootstrapping methods for linear models myself. But, I recently discovered the “boot” library in R that certainly makes getting bootstrapped confidence intervals quick and easy in R. Here’s a simple, generic example with 10,000 replicates.

Ryan Melvin 9/16/21 Ryan Melvin 9/16/21

Continued: ITS: multiple interventions and control

Continuing my example of an interrupted time series with multiple interventions and a control. Here is the completed example in R — including examining the model residuals for evidence of autoregressive (AR) and moving average (MA) processes.

Ryan Melvin 9/14/21 Ryan Melvin 9/14/21

Table unions

For some reason, software like Tableau and Power BI makes doing joins very obvious but unions (adding rows from one table to another/ stacking two or more tables on top of each other) less so.

Ryan Melvin 9/13/21 Ryan Melvin 9/13/21

Imaginary roots of a parabola

A poem wherein the author forgot how roots of an equation work..... Where does parabola, y = ax^2 + bx + c, cross some line y=3? Only imaginary do my roots be! I look at a plot, and what do I see? The parabola doesn’t cross that line (in a real space that's 2-D).

Ryan Melvin 9/10/21 Ryan Melvin 9/10/21

Interrupted time series: multiple interventions with a control group

I’ve seen examples of Interrupted Time Series (ITS) analysis with multiple interventions. And I’ve seen examples with control groups. But I haven’t found an example with both. In particular, I’ve really enjoyed this EdX course.

I’m currently working on a project that needs ITS with both multiple interventions and a control group. So, I’m making my own example. For my data, I have 3 interventions. Below is an example of how I’m setting up the data for regression. Hopefully I’ll have a full example including regression and analysis soon!

Ryan Melvin 9/3/21 Ryan Melvin 9/3/21

Comparison tests with large sample sizes

A couple days ago, I posted about “Normality Tests in R” and provided a snippet of code I use that checks normality before running a t-test. If the samples aren’t normal, a non-parametric test is more useful. HOWEVER, today I learned that for large sample sizes (say, >5,000), non-parametric tests become less reliable and a parametric t-test becomes more reliable even if the samples aren’t normally distributed.

Ryan Melvin 9/2/21 Ryan Melvin 9/2/21

Missing Data Imputation in R

There are many, many ways to impute missing data in R. CRAN has an article detailing them and the accompanying packages: https://cran.r-project.org/web/views/MissingData.html. Today, I wanted to do some rapid prototyping of ideas on a dataset with about 16,000 observations that had multiple instances of missing data. I started trying things on the list from CRAN. It was a good reminder that R packages are written for and by statisticians.

Ryan Melvin 9/1/21 Ryan Melvin 9/1/21

Normality test in R

Today I learned the hard way that the Shapiro-Wilkes test for normality in R only works on a sample size between 3 and 5,000. The fact I’m just learning this today does make me a little sad about the sample sizes I’ve worked with so far. :(

Ryan Melvin 8/31/21 Ryan Melvin 8/31/21

My PR to Ray was accepted

Ryan Melvin 8/30/21 Ryan Melvin 8/30/21

JANE API

I’m a fan of using JANE (Journal Author Name Estimator) for suggestions on where to send papers and even who might make good reviewers. I wanted to incorporate some of JANE’s functionality into a web app I’m building to help inexperienced authors with paper submissions.

Ryan Melvin 7/1/21 Ryan Melvin 7/1/21

Reading about Deep Neural Networks

I’m diving into tuning hyper parameters multilayer perceptrons (MLPs) and Deep Neural Networks for a zero-code machine learning application I’m working on. Today I came across a helpful post by Jason Brownlee that outlines the use cases for MLP, Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs) and Hybrid Neural Networks. Since most of the data I anticipate my software package working with is tabular, MLPs seem like a good place to start.

Looking into the space of hyper parameters I might want to tune for an MLP, I found this interesting StackExchange discussion that outlines some rationale for only using one or two layers. Apparently, very few problems benefit from more than two layers. In fact, very few problems benefit from more than one layer…

Ryan Melvin 6/14/21 Ryan Melvin 6/14/21

Categorical variables in a list

Do you have a set of categories for each observation but each category is part of a text list that’s in a column of your data set? Here’s a handy pandas trick I’ve recently been reminded of for handling this situation:

Ryan Melvin 6/2/21 Ryan Melvin 6/2/21

Machine learning as a service

Ryan Melvin 5/10/21 Ryan Melvin 5/10/21

Mlxtend: Making a meta-classifier in a Loop

Today, using the Python package Mlxtend, I stacked the “best” classification tree, bagged classification tree, XGBoost, SVM, logistic regression, linear discriminant analysis, and k-nearest-neighbor classifiers from 10,000 runs (each) of Hyperopt parameter tuning. I created a meta-classifier using the best of each model type using logistic regression. Since each of my model types has its own tuned Pipeline, I found out that mlxtend works just fine with sklearn pipelines. The syntax looks like this:

Ryan Melvin 5/3/21 Ryan Melvin 5/3/21

Python trick for finding strings in a list that match a pattern

As part of a project to automate some complicated report generation, I need to interact with a software package that returns all of its output in a series of strings — even “warnings.” The warnings are not special other than that they begin with “Warning:” So, I needed a way to take a list of strings and pull out only the ones that start with “Warning:” After perusing some of the output, I realized it would be sufficient to find strings that contain “Warning.”

I read several suggestions on stack-overflow. Some of them suggested joining the strings into one massive string before searching. This seemed overly complicated. In the end, a list comprehension turned out to be the most satisfying answer (as they so often are).

Scratch work only.

Links to day job

Notes

Multiple Comparison Correction

SQL: Fill completely missing observations

Differencing with SQL LAG()

SQL date part of date-time

Pearson correlation is cool, but…

Quick bootstrapping of confidence intervals for linear models in R

Continued: ITS: multiple interventions and control

Table unions

Imaginary roots of a parabola

Interrupted time series: multiple interventions with a control group

Comparison tests with large sample sizes

Missing Data Imputation in R

Normality test in R

My PR to Ray was accepted

JANE API

Reading about Deep Neural Networks

Categorical variables in a list

Machine learning as a service

Mlxtend: Making a meta-classifier in a Loop

Python trick for finding strings in a list that match a pattern