Resampling Methods
  
Levi Waldron, CUNY School of Public Health
levi.waldron@sph.cuny.edu
waldronlab.github.io / waldronlab.org
  
  June 15, 2017
 
Outline and introduction
- Objectives: prediction or inference?
 
- Cross-validation
 
- Bootstrap
 
- Permutation Test
 
- Monte Carlo Simulation
 
ISLR Chapter 5: James, G. et al. An Introduction to Statistical Learning: with Applications in R. (Springer, 2013). This book can be downloaded for free at http://www-bcf.usc.edu/~gareth/ISL/getbook.html
 
Why do regression?
Inference
- Questions:
- Which predictors are associated with the response?
 
- How are predictors associated with the response?
 
- Example: do dietary habits influence the gut microbiome?
 
 
- Linear regression and generalized linear models are the workhorses
- We are more interested in interpretability than accuracy
 
- Produce interpretable models for inference on coefficients
 
 
Bootstrap, permutation tests
 
Why do regression? (cont’d)
Prediction
- Questions:
- How can we predict values of \(Y\) based on values of \(X\)
 
- Examples: Framingham Risk Score, OncotypeDX Risk Score
 
 
- Regression methods are still workhorses, but also less-interpretable machine learning methods
- We are more interested in accuracy than interpretability
 
- e.g. sensitivity/specificity for binary outcome
 
- e.g. mean-squared prediction error for continuous outcome
 
 
Cross-validation
 
Cross-validation
Why cross-validation?
Under-fitting, over-fitting, and optimal fitting
 
K-fold cross-validation approach
- Create \(K\) “folds” from the sample of size \(n\), \(K \le n\)
 
- Randomly sample \(1/K\) observations (without replacement) as the validation set
 
- Use remaining samples as the training set
 
- Fit model on the training set, estimate accuracy on the validation set
 
- Repeat \(K\) times, not using the same validation samples
 
- Average validation accuracy from each of the validation sets
 
 
Variability in cross-validation
 
Cross-validation summary
- In prediction modeling, we think of data as training or test
- Cross-validation estimates test set error from a training set
 
 
- Training set error always decreases with more complex (flexible) models
 
- Test set error as a function of model flexibility tends to be U-shaped
- The low point of the U represents the optimal bias-variance trade-off, or the most appropriate amount of model flexibility
 
 
- Computationally, \(K\) models must be fitted
- 5 or 10-fold CV are popular choices
 
- can be repeated for smoothing (e.g. see Braga-Neto et al 2004. Is cross-validation valid for small-sample microarray classification?. Bioinformatics, 20(3), 374-380.)
 
 
 
Cross-validation caveats
- Be very careful of information “leakage” into test sets, e.g.:
- feature selection using all samples
 
- “human-loop” over-fitting
 
- changing your mind on accuracy measure
 
- try a different dataset
 
 
http://hunch.net/?p=22
 
Cross-validation caveats (cont’d)
- Tuning plus accuracy estimation requires nested cross-validation
 
- Example: high-dimensional training and test sets simulated from identical true model
- Penalized regression models tuned by 5-fold CV
 
- Tuning by cross-validation does not prevent over-fitting
 
 
Waldron et al.: Optimized application of penalized regression methods to diverse genomic data. Bioinformatics 2011, 27:3399–3406.
 
Cross-validation caveats (cont’d)
- Cross-validation estimates assume that the sample is representative of the population
 
Bernau C et al.: Cross-study validation for the assessment of prediction algorithms. Bioinformatics 2014, 30:i105–12.
 
Permutation test
Permutation test
- Classical hypothesis testing: \(H_0\) of test statistic derived from assumptions about the underlying data distribution
- e.g. \(t\), \(\chi^2\) distribution
 
 
- Permutation testing: \(H_0\) determined empirically using permutations of the data where \(H_0\) is guaranteed to be true
 

 
Steps of permutation test:
- Calculate test statistic (e.g. T) in observed sample
 
- Permutation:
- Sample without replacement the response values (\(Y\)), using the same \(X\)
 
- re-compute and store the test statistic T
 
- Repeat R times, store as a vector \(T_R\)
 
 
- Calculate empirical p value: proportion of permutation \(T_R\) that exceed actual T
 
 
Calculating a p-value
\[
P = \frac{sum \left( abs(T_R) > abs(T) \right)+ 1}{length(T_R) + 1}
\]
- Why add 1?
- Phipson B, Smyth GK: Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn. Stat. Appl. Genet. Mol. Biol. 2010, 9:Article39.
 
 
 
Calculating a False Discovery Rate
- calculate # all discoveries from unpermuted data
 
- estimate # false discoveries by averaging over permutations
 
 
Permutation test - pros and cons
- Pros:
- does not require distributional assumptions
 
- can be applied to any test statistic
 
- applicable to False Discovery Rate estimation
 
 
- Cons:
- less useful for small sample sizes
 
- in naive implementations, can get p-values of “0”
 
 
 
Example from (sleep) data:
- Sleep data show the effect of two soporific drugs (increase in hours of sleep compared to control) on 10 patients.
 
##      extra        group        ID   
##  Min.   :-1.600   1:10   1      :2  
##  1st Qu.:-0.025   2:10   2      :2  
##  Median : 0.950          3      :2  
##  Mean   : 1.540          4      :2  
##  3rd Qu.: 3.400          5      :2  
##  Max.   : 5.500          6      :2  
##                          (Other):8
 
t-test for difference in mean sleep
## 
##  Welch Two Sample t-test
## 
## data:  extra by group
## t = -1.8608, df = 17.776, p-value = 0.07939
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.3654832  0.2054832
## sample estimates:
## mean in group 1 mean in group 2 
##            0.75            2.33
 
Permutation test instead of t-test
set.seed(1)
permT = function(){
  index = sample(1:nrow(sleep), replace=FALSE)
  t.test(extra ~ group[index], data=sleep)$statistic
}
Tr = replicate(999, permT())
(sum(abs(Tr) > abs(Tactual)) + 1) / (length(Tr) + 1)
 
## [1] 0.079
 
Bootstrap
The Bootstrap
ISLR Figure 5.11: Schematic of the bootstrap
 
Uses of the Bootstrap
- The Bootstrap is a very general approach to estimating sampling uncertainty, e.g. standard errors
 
- Can be applied to a very wide range of models and statistics
 
- Robust to outliers and violations of model assumptions
 
 
Example: bootstrap in the sleep dataset
- We used a permutation test to estimate a p-value
 
- We will use bootstrap to estimate a confidence interval
 
t.test(extra ~ group, data=sleep)
 
## 
##  Welch Two Sample t-test
## 
## data:  extra by group
## t = -1.8608, df = 17.776, p-value = 0.07939
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.3654832  0.2054832
## sample estimates:
## mean in group 1 mean in group 2 
##            0.75            2.33
 
Example: bootstrap in the sleep dataset
set.seed(2)
bootDiff = function(){
  boot = sleep[sample(1:nrow(sleep), replace = TRUE), ]
  mean(boot$extra[boot$group==1]) - 
    mean(boot$extra[boot$group==2])
}
bootR = replicate(1000, bootDiff())
bootR[match(c(25, 975), rank(bootR))]
 
## [1] -3.32083333  0.02727273
note: better to use library(boot)
 
Example: oral carcinoma recurrence risk
- Oral carcinoma patients treated with surgery
 
- Surgeon takes “margins” of normal-looking tissue around to tumor to be safe
- number of “margins” varies for each patient
 
 
- Can an oncogenic gene signature in histologically normal margins predict recurrence?
 
 Reis PP, Waldron L, et al.: A gene signature in histologically normal surgical margins is predictive of oral carcinoma recurrence. BMC Cancer 2011, 11:437. 
 
Example: oral carcinoma recurrence risk
- Model was trained and validated using the maximum expression of each of 4 genes from any margin
 
 
Bootstrap estimation of HR for only one margin
 
Example: oral carcinoma recurrence risk
From results:
Simulating the selection of only a single margin from each patient, the 4-gene signature maintained a predictive effect in both the training and validation sets (median HR = 2.2 in the training set and 1.8 in the validation set, with 82% and 99% of bootstrapped hazard ratios greater than the no-effect value of HR = 1)
 
Monte Carlo
What is a Monte Carlo simulation?
- “Resampling” is done from known theoretical distribution
 
- Simulated data are used to estimate the probability of possible outcomes
- most useful application for me is power estimation
 
- also used for Bayesian estimation of posterior distributions
 
 
 
How to conduct a Monte Carlo simulation
- Steps of a Monte Carlo simulations:
- Sample randomly from the simple distributions in each step
 
- Estimate the complex function for the sample
 
- Repeat this a large number of times
 
 
 
Power Calculation for a follow-up sleep study
- What sample size do we need for a future study to detect the same effect on sleep, with 90% power and \(\alpha = 0.05\)?
 
power.t.test(power=0.9, delta=(2.33-.75), 
        sd=1.9, sig.level=.05,
        type="two.sample", alternative="two.sided")
 
## 
##      Two-sample t test power calculation 
## 
##               n = 31.38141
##           delta = 1.58
##              sd = 1.9
##       sig.level = 0.05
##           power = 0.9
##     alternative = two.sided
## 
## NOTE: n is number in *each* group
 
The same calculation by Monte Carlo simulation
- Use 
rnorm() function to draw samples 
- Use 
t.test() function to get a p-value 
- Repeat many times, what % of p-values are less than 0.05?
 
 
R script
set.seed(1)
montePval = function(n){
   group1 = rnorm(n, mean=.75, sd=1.9)
   group2 = rnorm(n, mean=2.33, sd=1.9)
   t.test(group1,group2)$p.value
}
sum(replicate(1000, montePval(n=32)) < 0.05) / 1000
 
## [1] 0.895
 
Summary: resampling methods
| Cross-Validation | 
Data is randomly divided into subsets.   Results validated across sub-samples. | 
Model tuning   Estimation of prediction accuracy | 
 | 
 | 
 | 
| Permutation Test | 
Samples of size N drawn at random without replacement. | 
Hypothesis testing | 
 
Summary: resampling methods
| Bootstrap | 
Samples of size N drawn at random with replacement. | 
Confidence intervals, hypothesis testing | 
 | 
 | 
 | 
| Monte Carlo | 
Data are sampled from a known distribution | 
Power estimation, Bayesian posterior probabilities | 
 
Links
- A built html version of this lecture is available.
 
- The source R Markdown is also available from Github.