Chapter 20: Playing with Prediction equations, and cross-validation.

Imagine you have developed a great new intervention to, say, help students with reading difficulties or to help obese people lose weight. You do some research and find out it works well, but it’s extremely expensive, and so only those who are most at risk for adverse health outcomes stemming from their obesity can have access to it. How are you going to decide who should get the intervention? Or imagine you are in charge of admissions for your college or university. You know that certain things predict success in your school, at least partly. How do you decide who gets admitted?


Most research we hear about is explanatory, meaning the goal of the research is attempting to understand a phenomenon. Does a particular variable predict success in your college? What things can help ameliorate the negative effects of obesity? How can we help students having difficulty reading? Part of almost every research study published is a section at the end where the authors tell us why we should care about their particular findings. In essence, these summary statements (e.g., people with height-to-waist ratios of less than 1.5 might benefit most from this intervention, students scoring above the 80th percentile on this particular measure are three times as likely to succeed in our college than students scoring lower) are predictions of efficacy in the future.


But how do we know that these results will generalize to your patients, your students, your applicants? You can try replicating the results in another sample (see also Chapter 7 on the p(rep) statistic), and you can keep replicating the results ad nauseam, and that will give you more confidence, if you keep getting the same results.
This chapter seeks to present a process of validation, as well as an example of how best to do this, so that scientists are not left attempting to do prediction in an ad hoc manner. Authors have been writing about this process for decades, yet it is rarely covered in depth in statistics textbooks.

 

Play with prediction equations

Download the data set I used for the chapter (based on NELS '88 data from the National Center for Educational Statistics).

In this dataset are several variables:

Step #1: Create a prediction equation based on the entire data set. This is your reference, or "population" prediction equation.

Step #2: Sample subgroups randomly or purposefully, analyze, and compare to reference, population equation. Use very small samples (N=20-50) and large samples (N=400-1000). Do several of each and see how the regression coefficients vary wildly around the reference equation.

Step #3. Practice cross validation and double cross validation. Calculate shrinkage (both described in the chapter). Which has more shrinkage, the smaller samples or larger ones?