Chapter 17. Computing and Interpreting Effect Sizes, Confidence Intervals, and Confidence Intervals for Effect Sizes

The uptake of null hypothesis statistical significance tests (NHSST) as a vehicle to evaluate social science results did not occur until the 1950s (Hubbard & Ryan, 2000), even though many of these tests were first formulated in the early 1900s (Huberty, 1999). However, the publication of criticisms of overreliance on NHSST occurred even as many of the tests first appeared. For example, early in the 20th century Boring (1919) published an article titled, "Mathematical vs. scientific importance," in which he argued that pCALCULATED values should not be the primary focus in scholarship.
     The pCALCULATED values in NHSST are mathematical probability statements about the likelihood of the sample statistics, assuming samples came from populations exactly described by the null hypothesis, and given the sample size (Thompson, 2006a). Because NHSST evaluates the probability of sample results, and not of populations values, NHSST results do not evaluate whether the sample results are replicable (Carver, 1978; Cohen, 1994).
     NHSST pCALCULATED values also cannot inform judgment about the scientific importance of results, because NHSST invokes a deductive logic, starting with the premise that the null exactly describes the population. A valid deductive logic cannot contain in conclusions any information not present in the deductive premises. Because NHSST does not invoke premises involving human values, NHSST's mathematical probability statements contain no information about the scientific import of the sample results. As Thompson (1993, p. 365) explained, "If the computer package did not ask you your values prior to its analysis, it could not have considered your value system in calculating p's, and so p's cannot be blithely used to infer the value of research results."
     The limitations of NHSST have been argued with increasing frequency, across both decades and a wide array of disciplines, as illustrated in the graphic offered by Anderson, Burnham and Thompson (2000). Included are disciplines such as biology (e.g., Suter, 1996; Yoccuz, 1991), economics (Ziliak & McCloskey, 2004), education (Carver, 1978; Thompson, 1996), psychology (Cohen, 1994; Schmidt, 1996), and the wildlife sciences (Johnson, 1999).
     The tenor of the commentary can be represented in Schmidt and Hunter's (1997) argument that "Statistical significance testing retards the growth of scientific knowledge; it never makes a positive contribution" (p. 37). Rozeboom (1997) was equally direct:

Null-hypothesis significance testing is surely the most bone-headedly misguided procedure ever institutionalized in the rote training of science students... [I]t is a sociology-of-science wonderment that this statistical practice has remained so unresponsive to criticism... (p. 335)

The purpose of the present chapter is to explain how to compute and interpret (a) selected effect sizes, (b) confidence intervals, and (c) confidence intervals for effect sizes. The focus here in on practical explanations and applications of these important tools. (Readers interested in actually calculating these CIs for ES should refer to Chapter 34 on R).

Heuristic Data


     The data presented in Table 1 and below electronically represent a random sample of real data provided by roughly 500,000 library users at over 700 libraries from around the world with respect to the perceived quality of academic library services (cf. Thompson, Cook & Kyrillidou, 2005, 2006).
     Included in the data are scores on the LibQUAL+® total scale, and three subscales:

Also included are scores on perceived outcome impacts of library use, and generic user-reported satisfaction with library service quality. Finally, user group (1 = undergraduate student, 2 = graduate student, 3 = faculty) and gender (0 = female, 1 = male) are reported.

CLICK HERE to download data for Chapter 17. This data from table 17.1 is presented in Excel format and can be imported into most statistical software packages for analyses as indicated in the Chapter.

Resources

Statistical Methods in Psychology Journals: Guidelines and Explanations

The APA Task Force on Statistical Inference (TFSI) Report as a Framework for Teaching and Evaluating Students' Understandings of Study Validity.