Chapter 02

The first author recently was privileged to be included in a conversation with a prominent politician regarding the legitimacy of a high school exit examination that all students [in that state] must pass to obtain a high school diploma. Indeed, this person was the principal architect of the mandated exit-level test policy. Ironically, when asked if he could pass the exam, he commented that “it really didn’t make a difference, since so many other factors contributed to success in life after high school!”

Cut scores are widely used by educators and psychologists in a variety of contexts. Technically, a cut score is a prescribed value on a continuum of values (given the instrument employed) that is seen to be a “threshold” that demarcates the presence or absence of a particular state or condition (e.g., proficient or non-proficient; depressed or not depressed).  Cut scores are usually associated with criterion-referenced tests (CRTs) when used to indicate a minimal level of proficiency or mastery; however, in some instances cut scores are invoked when using norm-referenced tests (NRTs). An example of this might be when a psychologist uses a normative threshold on the MMPI scales to infer potential psychopathology, or when a school diagnostician uses IQ and achievement measures to determine the presence of a learning disability based on a one standard deviation discrepancy (Anastasi, 1989). Similarly, one of the criteria for determining the presence of mental retardation is a full-scale IQ below 70 (AAMR, 2005, March). Universities often employ cut scores on NRTs as a partial determinant of who is accepted into their schools (e.g., SAT; GRE). Undoubtedly, however, the greatest use of cut scores lies in determining mastery or proficiency in a content domain by virtue of CRT performance.


After determining appropriate behavioral learning outcomes in a given content area (content standards), a CRT is constructed of items that link directly back to the content criteria [hence the term criterion-referenced (Crocker & Algina, 1986)]. Expectations for performance standards are then set demarcating levels of performance based on what a proficient examinee should be able to do. As an example, the No Child Left Behind (NCLB) legislation (2002) demands performance expectations based on cut scores – students, schools, and school districts that fail to make the cut are “left behind,” and are subject to considerable punitive actions with respect to federal school funding. The cut score on the examination represents the operationalization of performance standards based on the content standards. Sometimes the cut score indicates a dichotomous performance level (e.g., pass–fail), while at other times there may be multiple performance levels (e.g., below proficient–proficient–advanced). In certain professions, such as medicine, cut scores are used to decide who is eligible for licensure in the field. In K-12 settings, student CRT performance is used not only to gauge mastery of learning outcomes, but also for determining whether a student should be retained in grade or denied a high school diploma. What is often overlooked both by educators and public stakeholders is that the validity of decisions made based on CRT performance is dependent on the quality of the method used for establishing performance standards and the technical adequacy of the tests themselves.


Our purpose in this chapter is to outline some of the technical and practical concerns with CRTs related to the establishing of performance standards and cut scores. We start with the premise that a meaningful, legitimate, and well-defined set of learning goals and objectives has been established. Moreover, we assume that a comprehensive and technically sound pool of items has been constructed to measure these outcomes. Having taken this leap of faith (see Glass, 1978), we focus on two central issues:  (a) the relative merit of methods for establishing cut scores to define performance levels, and (b) methods of compensating for errors in the standard setting process and errors in tests. These methods are briefly elaborated, potential disadvantages of each are discussed, and suggestions for proper practice are provided.