Introduction
The concept of interrater reliability permeates many facets of modern society. For example, court cases based on a trial-by-jury require unanimous agreement from jurors regarding the verdict; life-threatening medical diagnoses often require a second or third opinion from health care professionals; student essays written in the context of high-stakes standardized testing receive points based upon the judgment of multiple readers; and Olympic competitions, such as figure skating, award medals to participants based upon an quantitative ratings of performance provided by an international panel of judges (Bond & Fox, 2001).
Any time multiple judges are used to determine important outcomes, certain technical and procedural questions emerge. Some of the more common questions are: How many raters do we need in order to be confident in our results? What is the minimum level of agreement that my raters should achieve? And, is it necessary for raters to agree exactly or is it acceptable for them to differ from each other so long as their difference is systematic and can therefore be corrected?
Key Questions to Ask Before Conducting an Interrater Reliability Study
If you are at the point in your research where you are considering conducting an interrater reliability study, then there are three important questions worth considering:
Example from the chapter:
Computing Common Consensus Estimates of Interrater Reliability
Let us now turn to a practical example of how to calculate each of these coefficients. As an example dataset, we will draw from Stemler, Grigorenko, Jarvin, & Sternberg’s (2006) study in which they developed augmented versions of the Advanced Placement Psychology Examination. Participants were required to complete a number of essay items which were subsequently scored by different sets of raters. Essay question 1, part d was a question that asked participants to give advice to a friend, who is having trouble sleeping, based on what they know about various theories of sleep. The item was scored using a 5 point scoring rubric. For this particular item, 75 participants received scores from 2 independent raters.
Details on how to calculate various indices of interrater reliability are in the chapter.