Some code to enter in your workspace

Mystery functions?

ts_zpm = function(d) sqrt(length(d))*mean(d)/sd(d)

contam1 = function(x, slip=5) {x[1] = x[1]+slip; x}

Try them out

set.seed(12345)
X = rnorm(50)
ts_zpm(X)

## [1] 1.157889

ts_zpm(contam1(X))

## [1] 1.479476

ts_zpm(contam1(X,100))

## [1] 1.082075

The concept of the critical region

Wolfgang: count events and check whether the count falls into a region prescribed to have low probability under the null hypothesis
The probabilities of regions are prescribed by the binomial distribution
We need probabilities and principles for choosing the critical region
- symmetric about null value?
- represent a specific direction of effect (one-sided?)
- based on theory or simulation?

Assumptions underlying a test procedure

Exact critical regions based on finite sample distribution of a statistic
- \(t\), Wilcoxon
- the parent distribution must satisfy certain conditions
  - data are a random sample from homogeneous population (iid)
  - for \(t\): parent is Gaussian
  - for Wilcoxon: parent is continuous, finite variance (?)
Approximate critical regions based on large sample theory
- how large is large enough?
- data are a random sample from homogeneous population (iid)
Approximate critical regions based on simulation
- the generative distribution accurately represents variation underlying the experiment or observations

Computing and reporting a t test for a simple hypothesis

set.seed(12345)
X = rnorm(50)
tst = t.test(X)
tst

## 
##  One Sample t-test
## 
## data:  X
## t = 1.1579, df = 49, p-value = 0.2525
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.1320800  0.4912126
## sample estimates:
## mean of x 
## 0.1795663

Components of the result

str(tst)

## List of 9
##  $ statistic  : Named num 1.16
##   ..- attr(*, "names")= chr "t"
##  $ parameter  : Named num 49
##   ..- attr(*, "names")= chr "df"
##  $ p.value    : num 0.253
##  $ conf.int   : atomic [1:2] -0.132 0.491
##   ..- attr(*, "conf.level")= num 0.95
##  $ estimate   : Named num 0.18
##   ..- attr(*, "names")= chr "mean of x"
##  $ null.value : Named num 0
##   ..- attr(*, "names")= chr "mean"
##  $ alternative: chr "two.sided"
##  $ method     : chr "One Sample t-test"
##  $ data.name  : chr "X"
##  - attr(*, "class")= chr "htest"

The structure of the t statistic: a ratio of measures of location and dispersion

tst$statistic  # from R

##        t 
## 1.157889

sqrt(50)*mean(X)/sd(X) # by hand

## [1] 1.157889

Simulating the distribution of t under the null hypothesis: \(X \sim N(0,\sigma^2)\)

set.seed(12345)
simdist = replicate(10000,  ts_zpm( rnorm(50) ))
head(simdist)

## [1]  1.15788926  1.92818028 -0.08173422  0.82143100  0.39880144 -1.21706025

Simulated data and theoretical density

hist(simdist, freq=FALSE)
lines(seq(-3,3,.01), dt(seq(-3,3,.01), 49))

The one-sided p-value for \(\hat{t}\) = 1.1579

simulated:

mean(simdist > 1.1579)

## [1] 0.123

theoretical (exact):

integrate(function(x) dt(x,49), 1.1579, Inf)$value

## [1] 0.1262589

What if there is a contaminated observation?

contsim = replicate(10000, 
    ts_zpm( contam1( rnorm(50) ) ) )
critval_1sided = qt(.95, 49)
mean(simdist > critval_1sided) # uncontaminated

## [1] 0.0483

mean(contsim > critval_1sided) # contaminated

## [1] 0.0946

Can we fix this?

library(parody)

## Loading required package: tools

robust_t = function(x) {
 outchk = calout.detect(x, method="GESD")
 if (!is.na(outchk$ind[1])) x = x[-outchk$ind]
 sqrt(length(x))*mean(x)/sd(x)
}
set.seed(12345)
contsim_r = replicate(10000, 
    robust_t( contam1( rnorm(50) ) ) )
mean(contsim_r > critval_1sided) # robust test on contaminated

## [1] 0.0531

Recap

the one-sided one sample t test for \(H_0: \mu = 0\) involves
- \(n\), \(\bar{X}\), \(s\) to form the test statistic, and
- the \(t\) density to form critical values
simulation from the null distribution can be used to obtain an empirical p value
simulation from the contaminated distribution with a single observation shifted by \(5\sigma\) leads to a Type I error rate (using the standard \(t\) critical value for \(\alpha = 0.05\)) of 0.095
robustification of the test statistic using calibrated outlier removal can restore (approximately) the nominal Type I error rate
- can’t conclude too much from this example, of course … there is considerable theoretical literature on sensitivity of tests to contamination
- the t test is described as ‘robust’ in many texts, referring to
  - insensitivity to violation of the assumption of Gaussian population
  - insensitivity to violation of the assumption of equal variances in the two-sample case

Power to reject \(H_0:\mu = 0\) when \(\mu = 0.4\)

power.t.test(n=50, type="one.sample", 
   alt="one.sided", delta=.4)

## 
##      One-sample t test power calculation 
## 
##               n = 50
##           delta = 0.4
##              sd = 1
##       sig.level = 0.05
##           power = 0.8737242
##     alternative = one.sided

# empirical
mean( replicate(10000, 
    ts_zpm(rnorm(50, .4))>qt(.95, 49)) )

## [1] 0.8764

Effect of contamination: a single large contaminant can slash power

mean( replicate(10000, ts_zpm(
     contam1(rnorm(50, .4), slip=25))>qt(.95, 49)) )

## [1] 0.5834

Exercises:

plot power against size of slip
show that calibrated outlier removal can restore power lost with contamination

Sample median is a resistant estimator of location

We’ll use a series of outlier magnitudes (0:10) and summarize distributions of estimators

set.seed(12345)
mns <- sapply(0:10, function(o) 
  median(replicate(5000, mean(contam1(rnorm(50),o))))) 
set.seed(12345)
meds <- sapply(0:10, function(o) 
  median(replicate(5000, median(contam1(rnorm(50),o)))))

Note sensitivity of mean

plot(0:10, mns, xlab="outlier magnitude", ylab="median of stat",
  pch=19)
points(0:10, meds, pch=19, col="blue")
legend(0, .1, pch=19, col=c("black", "blue"), legend=c("mean", "median"))

Sample MAD is a resistant estimator of dispersion

set.seed(12345)
sds <- sapply(0:10, function(o) 
  median(replicate(5000, sd(contam1(rnorm(50),o))))) 
set.seed(12345)
mads <- sapply(0:10, function(o) 
  median(replicate(5000, mad(contam1(rnorm(50),o)))))

Note sensitivity of SD

plot(0:10, sds, xlab="outlier magnitude", ylab="median of stat",
  pch=19)
points(0:10, mads, pch=19, col="blue")
legend(0, 1.2, pch=19, col=c("black", "blue"), legend=c("SD", "MAD"))

Recap

Resistant estimators “peel away” extreme values
These estimators achieve “high breakdown bound”
- if up to 50% of data are corrupted, median continues to estimate population median
- if up to 25% of data are corrupted, MAD continues to estimate a scaled SD

Exercises

Generalize contam1() to help demonstrate the breakdown concept
Assess the robustness of size and power of rank-based tests (such as Wilcoxon’s signed rank test) to contamination by outliers.
Note that dsignrank, qsignrank are available.

qsignrank(.95, 50)

## [1] 808

set.seed(12345)
wilcox.test(rnorm(50, .4))$statistic

##   V 
## 977

Deficiencies of scalar summaries

Anscombe’s data

Show that the (x,y) pairs have identical

- marginal means
- marginal SDs
- correlation coefficients
- linear regressions of y on x

Use MASS::rlm to get a model for y3, x3 that fits the majority of points exactly

Outliers in RNA-seq analysis

DESeq2 has extensive discussion of Cook’s distance for identifying and reducing effects of apparent outliers
Let’s get a feel for what Cook’s distance is: source(“shi.R”); docook()

Simulating an RNA-seq experiment

suppressMessages({set.seed(12345)
library(DESeq2)
S1 = makeExampleDESeqDataSet(betaSD=.75)
D1 = DESeq(S1)
R1 = results(D1)})

summary(R1)

## 
## out of 997 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)     : 99, 9.9% 
## LFC < 0 (down)   : 95, 9.5% 
## outliers [1]     : 2, 0.2% 
## low counts [2]   : 211, 21% 
## (mean count < 6)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Exercises

Find the outliers labeled by DESeq2
Are these really suspect?
For the homogeneous simulation process demonstrated here, estimate the frequency of outlier labeling by the default rules
Define a process for injecting aberrant counts and assess the accuracy of the default rules for identifying them. How do the rules contribute to validity and power of the basic testing procedure?

Conclusion

Robustness is commonly cited as a property of new methodologies
- the term is often used vaguely
In statistics, there is robustness of validity (Type I error rate is maintained despite failure of certain assumptions) and robustness of efficiency (power does not decline in the presence of failure of certain assumptions)
Resistance is a related concept: a fraction of arbitrarily aberrant values may be present, but their effect on estimation or inference is bounded
Outlier labeling is carefully studied for generations; the outliers may be the most interesting points in your data
Learn how to use simulation, with specific attention to realism and flexibility of implementation, so that you can explore a variety of scenarios

Basic robustness concepts of statistics

Some code to enter in your workspace

Try them out

The concept of the critical region

Assumptions underlying a test procedure

Computing and reporting a t test for a simple hypothesis

Components of the result

The structure of the t statistic: a ratio of measures of location and dispersion

Simulating the distribution of t under the null hypothesis: \(X \sim N(0,\sigma^2)\)

Simulated data and theoretical density

The one-sided p-value for \(\hat{t}\) = 1.1579

What if there is a contaminated observation?

Can we fix this?

Recap

Power to reject \(H_0:\mu = 0\) when \(\mu = 0.4\)

Effect of contamination: a single large contaminant can slash power

Sample median is a resistant estimator of location

Note sensitivity of mean

Sample MAD is a resistant estimator of dispersion

Note sensitivity of SD

Recap

Exercises

Deficiencies of scalar summaries

Anscombe’s data

Outliers in RNA-seq analysis

Simulating an RNA-seq experiment

Exercises

Conclusion