Comparison tests with large sample sizes

Sep 3

A couple days ago, I posted about “Normality Tests in R” and provided a snippet of code I use that checks normality before running a t-test. If the samples aren’t normal, a non-parametric test is more useful. HOWEVER, today I learned that for large sample sizes (say, >5,000), non-parametric tests become less reliable and a parametric t-test becomes more reliable even if the samples aren’t normally distributed.

The paper I learned this from terms it “a paradox of statistical practice,” and it certainly seems like one. Anyway, here’s an updated version of the function I provided now more appropriate for large sample sizes, where you apparently don’t need to worry about the assumption of normality — and can even get yourself in trouble if you try to.

# Get binomial p-value
library(ggpubr)
library(nortest)

compare_means = function(binary_group, var, data){

  
  
  t = t.test(data[[var]][which(data[[binary_group]]==0)]
             , data[[var]][which(data[[binary_group]]==1)]
             , var.equal = FALSE
  )
  
  
  output=data.frame(
    variable = var
    , control_mean = t$estimate[1]
    , positive_mean = t$estimate[2]
    , ttest_different = t$p.value<0.05
    , ttest_p = t$p.value
    , t_difference_estimate_low95 = t$conf.int[[1]][1]
    , t_difference_estimate_high95 = t$conf.int[[2]][1]
  )
  rownames(output) <- NULL
  return(output)

Ryan Melvin

Comparison tests with large sample sizes

Interrupted time series: multiple interventions with a control group

Missing Data Imputation in R