Chapter 27: Enhancing Accuracy in Research Using Regression Mixture Analysis

Cody S. Ding

Conventional regression analysis is typically used in social science research. Usually such an analysis implicitly assumes that a common set of regression parameter estimates captures the population characteristics represented in the sample. In some situations, however, this implicit assumption may not be realistic, and the sample may contain several subpopulations such as high math achievers and low math achievers. In these cases, conventional regression models may provide biased estimates since the parameter estimates are constrained to be the same across subpopulations. This paper advocates the applications of regression mixture models, also known as latent class regression analysis, in educational and social sciences research. Regression mixture analysis is more flexible than conventional regression analysis in that latent classes in the data can be identified and regression parameter estimates can vary within each latent class, which enhances prediction accuracy. An illustration of regression mixture analysis is provided based on a dataset of authentic data. The strengths and limitations of the regression mixture models are discussed in the context of educational research.   

Typical ordinary least squares (OLS) regression analyses are common in educational  and social science research. Typically, regression analysis is used to investigate the relationships between a dependent variable (either continuous or categorical if using logistic regression) and a set of independent variables based on a sample from a particular population. Often the particular interest is placed on assessment of the effect of each independent variable on dependent variable, and such an effect is considered as the average effect value across all subjects in the sample. For example, if math achievement scores of 500 students are regressed on a measure of their motivation, the value for the slope or the regression coefficient quantifies the average change in math achievement across all 500 students for one unit change in motivation. The problem is that these 500 students are treated as one homogenous group regarding motivation influences on math achievement, and the implicit assumption is that these students are from the same population with similar characteristics. What if the relationship between these two variables is different between different groups of students in some way that is not explicitly modeled?  These (often interesting) differences would be masked. 

This chapter describes the use of regression mixture model as a tool to study the relationship between a dependent variable and a set of independent variables by taking into consideration of unobserved population heterogeneity, which can enhance the prediction accuracy.

 

Data

The data used in this illustrative analysis were from the Early Childhood Longitudinal Study (ECLS), an ongoing study by the U.S. Department of Education, National Center for Education Statistics that focuses on children’s early school experiences beginning with kindergarten (Tourangeau, Nord, Lê, Pollack, & Atkins-Burnett, 2006). The study follows a nationally representative sample of children from kindergarten through fifth grade. The sample reflected all children from various racial and language background. Sampling for the ECLS was based on a dual frame, multi-stage sampling design, with 100 primary sampling units (PSU). For simplicity, only the data collected during 2004 from the fifth graders was in this paper. The sample size in the current analysis was 1,342 children, which included 650 males and 692 females. Among the total analysis sample of children, 797 were White, 126 were Black, 230 were Hispanic, 141 were Asian, and 48 were multiracial.
            Measures. In the present analysis, four measures were used as independent variables. They were:
            Self-Description Questionnaire—Math Self-Concept (Marsh, 1990). This measure assesses how children think and feel about themselves in terms of math competence. This scale includes eight items on math grades, the difficulty of math work, and interest in and enjoyment of math, with the score scale ranged from 1 to 4. The analysis used the average score of each participant.
            Academic Rating Scale-Math. This is the teacher’s rating of children’s academic performance in math. Teachers were asked to rate each child’s proficiency in the following areas: number concepts, measurement, operation, geometry, math strategies, and beginning algebraic thinking, with the score scale ranged from 1 to 5. The analysis used the average score of each participant.
            Social Rating Scale-Approach to Learning. This is the teacher’s judgment of children’s social competence. The approach to learning scale measures behaviors that affect the ease with which children can benefit from the learning environment. It includes six items that rate the child’s attentiveness, task persistence, eagerness to learn, learning independence, flexibility, organization, and following classroom rules, with the score scale ranged from 1 to 4. The analysis used the average score of each participant.
            Social Rating Scale-Self-Control.  It has four items that rate the child’s ability to control behavior by respecting the property rights of others, controlling temper, accepting peer ideas for group activities, and responding appropriately to peer pressure, with the score scale ranged from 1 to 4. The analysis used the average score of each participant.
            In all above measures, the scores were coded positively, with high scores indicating higher self-concept, and higher teacher rating on academic and social competence. The reported reliability for these independent variables ranged from .79 to .92 (Tourangeau et al., 2006).
            The dependent variable used was a composite math proficiency probability score that was computed as an average across nine math skill levels: count/number, relative size, ordinality/ sequence, add/subtract, multiple/divide, place value, rate and measurement, fractions, and area/volume. The probability scores were from 0.00 to 1.00, with a larger probability score indicating an overall higher achievement across these math skill levels.
            In addition, children’s gender and race were included as covariates. They were used to increase the classification accuracy of individuals into each latent class. In this chapter, children’s race was represented in five categories: White, Black, Hispanic, Asian (which includes Pacific Islanders and American Indians), and multiracial.