Comparison tests with large sample sizes
A couple days ago, I posted about “Normality Tests in R” and provided a snippet of code I use that checks normality before running a t-test. If the samples aren’t normal, a non-parametric test is more useful. HOWEVER, today I learned that for large sample sizes (say, >5,000), non-parametric tests become less reliable and a parametric t-test becomes more reliable even if the samples aren’t normally distributed.
The paper I learned this from terms it “a paradox of statistical practice,” and it certainly seems like one. Anyway, here’s an updated version of the function I provided now more appropriate for large sample sizes, where you apparently don’t need to worry about the assumption of normality — and can even get yourself in trouble if you try to.
# Get binomial p-value library(ggpubr) library(nortest) compare_means = function(binary_group, var, data){ t = t.test(data[[var]][which(data[[binary_group]]==0)] , data[[var]][which(data[[binary_group]]==1)] , var.equal = FALSE ) output=data.frame( variable = var , control_mean = t$estimate[1] , positive_mean = t$estimate[2] , ttest_different = t$p.value<0.05 , ttest_p = t$p.value , t_difference_estimate_low95 = t$conf.int[[1]][1] , t_difference_estimate_high95 = t$conf.int[[2]][1] ) rownames(output) <- NULL return(output)