Chapter 19. Resampling

Types of resampling
There are at least four major types of resampling. Although today they are unified under a common theme, it is important to note that these four techniques were developed by different people at different periods of time for different purposes.

1. Permutation test: Also known as the randomization exact test, the permutation test is a type of statistical significance test basing its inference on an empirical distribution obtained by permuting all possible values of the test statistic. This test was developed by R. A. Fisher (1935/1960), the founder of classical statistical testing. Later, this test was refined by Freeman and Halton (1951), and promoted by Pitman (1937, 1938), Dwass (1957), and Chung and Fraser (1958). However, Fisher eventually lost interest in the permutation method because there were no computers in his days to automate such a laborious method. The original goal of developing the randomization exact test was to explore an alternative to theoretical distributions as the foundation for probabilistic inferences. Fisher recognized the usefulness of an empirically-generated sampling distribution, but he was forced to rely on the theoretical sampling due to limited computational resources at that time. Hence, the exact test was conceptualized as a forward-looking methodology (Box, 1978).

2. Cross-validation: In cross-validation, a sample is randomly divided into two or more subsets and test results are validated by comparing across sub-samples. Simple cross-validation was proposed by Kurtz (1948) as a remedy for the Rorschach test, a form of personality test that was criticized by psychometricians for its lack of data normality. Based on Kurtz's simple cross-validation, Mosier (1951) developed double cross-validation. Later, Stone (1974) and Geisser (1975) promoted the idea of cross-validation as a tool to verify statistical predictions. Finally, double cross-validation was extended to multicross-validation by Krus and Fuller (1982). The major goal of cross-validation is to avoid overfitting, which is a common problem when modelers try to account for every structure in one data set. As a remedy, cross-validation double-checks whether the alleged fitness is too good to be true (Larose, 2005). In short, the objective of cross-validation is to verify replicability and stability of results.

3. Jackknife: Jackknife is a step beyond cross-validation, in which the same test is performed repeatedly removing one subject out each time. In this technique, the deleted subject is added back into the sample and then another one is chosen for removal. This is also known as the Quenouille-Tukey Jackknife, because this tool was invented by Maurice Quenouille (1949) and later developed by John W. Tukey (1958). As the father of EDA, John Tukey attempted to use Jackknife to explore how a model is influenced by subsets of observations when outliers are present. Mosteller and Tukey (1977) stated that jackknife is an all-purpose statistical tool, used as a substitute for specialized tools that may not be available, just as the Boy Scout's trusty tool serves so variedly. Jackknife was developed to assess stability and bias of estimates rather than performing hypothesis testing, though the variance estimate obtained through this method can be easily used to define confidence intervals or to do standard hypothesis testing (Rodgers, 1999).

4. Bootstrap: “Bootstrap” means that one available sample gives rise to many others by repeated sampling (a concept reminiscent of pulling yourself up by your own bootstraps). This technique was invented by Efron (1979; 1981; 1982) and further developed by Efron and Tibshirani (1993). The bootstrap procedure was originally developed as a means of estimating “statistical accuracy.” However, the objective of “statistical accuracy” is usually misunderstood as obtaining precision in parameter estimation or the true parameters (how right it is); rather, the goal is more about examining bias and variability (How wrong it could be).
Simon (2001) argued that cross-validation and jackknife do not fit the definition of resampling. According to Simon, resampling, as the name implied, must involve reuse of samples. However, cross-validation is simply a one-time sample splitting and thus no data are reused. Similarly, jackknife reduces the sample size in each re-computation and never uses the data in their totality for each calculation. Since systematic reuse of the available data is the central theme of resampling, cross-validation and jackknife are not qualified to be classified into the resampling arena.

Nevertheless, this does not imply that cross-validation and jackknife have no merits. As a matter of fact, cross-validation is still a common practice in factor analysis. To be specific, usually a factor modeler divides the data set into two subsets. Exploratory factor analysis is conducted with the first subsample for proposing a factor structure, whereas confirmatory factor analysis is employed to verify whether the factor pattern holds in the second subsample (Mulaik, 1987). By the same token, cross-validation is still popular in the context of model building, including the time series models, regression models, and discrimination models (Chernick, 1999). Also, it is a common practice for data miners to split the data into a training set, in which a provisional model is proposed, and a validation set, in which the fitness is evaluated (Han & Kamber, 2006; Larose, 2005). In some applications, jackknife and bootstrapping are fused together. For example, in the resample library of Splus (Insightful, 2004), there is a function named “Jack after boot.” As the name implies, Jackknife is used first to subset the data, and then bootstrapping is employed to resample from the subset. In the following sections, permutation tests and bootstrapping will be illustrated with concrete examples. The beauty of resampling comes from its conceptual clarity and procedural simplicity. Henceforth, readers will not encounter equation-dense pages in this chapter.

 

 

Use the following data sets, provided by the chapter author, to reinforce your understanding of the chapter by working through the examples.

Lady Tasting Tea (click here to download data)

Lady tasting tea
In 1920, R. A. Fisher shared a story with his colleagues about how he resolved a statistical question in an innovative way. Once, Fisher met a lady who insisted that her tongue was sensitive enough to detect a subtle difference between a cup of tea with the milk being poured first and a cup of tea with the milk being added later. Fisher was skeptical and he presented eight cups of tea to this lady. Four of these eight cups were “milk-first” and four others were “tea-first.” All cups were arranged in a random order yet the lady correctly identified six out of the eight cups (Salsburg, 2001). The test results are summarized in Table 1 (in the book).

Did the woman really have a super-sensitive tongue? This question can be reformulated as a statistical problem with the following two hypotheses:

 

Law School (click here to download data)

Bootstrapping
In classical procedures, parameter estimation requires certain parametric assumptions, but bootstrapping replaces the unknown population distribution with known empirical distributions, which are also called bootstrap distributions. The bootstrap methods began to attract more and more attention after Diaconis and Efron (1983) published an essay explaining bootstrapping using layman’s terms in Scientific American. As discussed in this chapter’s introduction, the beauty of resampling is its conceptual clarity. Resampling is highly accessible to many researchers whose primary concern is the content area of psychology or biology rather than mathematics. More importantly, as many permutation tests are built upon Fisher’s counterfactual reasoning, basic bootstrapping principles also pave the way to advanced bootstrapping.


In the following example, let us revisit the simple, yet intellectually powerful example depicted by Diaconis and Efron in Scientific American. Please notice that in the following the bootstrap method will be illustrated with the use of Splus (Insightful, 2004), which was not available at the time of Diaconis and Efron’s writing. Henceforth, the demonstration below is slightly different from that in the Scientific American essay.


In their article, Diaconis and Efron asked the readers to consider a group of 15 law schools, for which the academic achievements of each freshman are measured in terms of the average undergraduate GPA and the average score on the Law School Admission Test (LSAT) (Table 4, in the book). This small dataset indicates that the correlation between GPA and LSAT score is .776.  

Given Table 4, how confidently can the researcher assert that there is a positive correlation between GPA and LSAT in the law student population?  Diaconis and Efron proposed the following strategy:

  1. The original sample is duplicated one billion times. As a result, we have 15 billion observations instead of 15. This expanded sample is treated as a virtual population or a proxy population.
  2. Samples are drawn from this virtual population to verify the estimators. Unlike permutation methods in which observations are resampled without replacement, the bootstrap employs resampling with replacement.
  3. Bias is checked by comparing the statistic of the original sample against that of the empirical distribution. The bias estimated by the bootstrap method is the mean of the empirical distribution minus the statistic for the original sample.

It is highly advisable for readers to walk through the process using the full version or the trial version of Splus, as explained below:
1. Like RSE to Excel, the Resample Library is an external add-in module to Splus. After downloading the Resample Library and opening Splus, select Load Library from File and choose “resample.”
2. Enter the law school data into a dataset.
3. Select Correlation/resample from Statistics/Data Summary.
4. Go to the tab Bootstrap. Check the boxes Perform Bootstrap, Both Distribution and QQ, Percentiles, BCa Confidence Interval. Set Number of Resamples to 1000. Then click OK.
During the bootstrapping process, the computer randomly selects 15 pairs of scores 1,000 times. At the end, these 1,000 resamplings generate an empirical distribution as shown in Figure 5. As you would expect, sometimes the resample yields a low correlation coefficient. In some extreme cases the correlation is close to zero, however, most of the time it returns a high correlation. The mean of these correlation coefficients is depicted as a dotted line, which almost overlaps the original observed correlation coefficient, 0.776, shown as a solid line.