Chapter 24. Logistic Regression in the Social Sciences

Jason E. King, Baylor School of Medicine.

Regression procedures aid in understanding and testing complex relationships among variables and in forming predictive equations. Linear modeling techniques such as ordinary least squares (OLS) regression are appropriate when the predictor (independent) variables are continuously or categorically scaled and the criterion (response, dependent) variable is continuously scaled. Discriminant analysis allows prediction of a categorical criterion when all predictors are continuous and strong assumptions are met. However, a more intuitively appealing approach is to directly model the nonlinear relationship using a nonlinear methodology. In fact, discriminant analysis “is in the process of being replaced in most modern practice by logistic regression” (Darlington, 1990, p. 458). Logistic regression allows categorically- and continuously-scaled variables to predict any categorically-scaled criterion. Applications include predicting or explaining pass/fail in education, survival/non-survival in medicine, or presence/absence of a clinical disorder in psychology.
            Though slow to catch on initially (White, Long, & Tansey, 1997), the last two decades have seen tremendous growth in the use of logistic regression within the social sciences.  Nevertheless, many social scientists remain unfamiliar with its workings. One reason is the complexity of the procedure. Textbooks such as those by Hosmer and Lemeshow (2000) and Kleinbaum and Klein (2002) are valuable resources, but are written at an intermediate level of difficulty. There is also a general lack of agreement on terminology.
            The aim of this chapter is to describe binary logistic regression at an introductory level with the realization that some important complexities and nuances will be neglected.  Significant attention is given to odds ratios, effect size measures, and variable selection procedures. The chapter by Drs. Anderson and Rutkowski offers a more advanced treatment of logistic regression, including prediction to a polytomous criterion.

Heuristic Dataset

            Illustrations are made using the Employee dataset which comes bundled with recent versions of the Statistical Package for the Social Sciences (SPSS) and can also be freely downloaded at <http://support.spss.com> (but is linked directly here for convenience). The database includes measures of:

  1. Employee Education Level (in years),
  2. Sex (recoded as 0 = male; 1 = female),
  3. Minority Status (0 = no; 1 = yes),
  4. Current Salary,
  5. Previous Experience (in months),
  6. Case ID, and
  7. Job Category (custodial, clerical, managerial). The custodial and clerical job categories were combined to form a dichotomous criterion and recoded as custodial/clerical = 0 and managerial = 1. Deleting 24 cases with missing values on the Experience variable left 450 usable observations, which were the basis of all analyses.

Appendix A

Illustrative Models

Model

Predictor Variable(s)

A

Sex, Minority, Education, Experience

B

Salary

C

Sex

D

Experience

E

Sex, Minority, Education, Experience, Salary


 

Appendix B

An Overview of Exponents and Logarithms

            The exponent of a number is equal to the constant e raised to that number, where e equals approximately 2.718.  The exponent of 3 is equivalently written as exp(3) = e3 = 2.7183 = 2.718 ´ 2.718 ´ 2.718 = 20.09. A related mathematical function is the natural logarithm, or just natural log, written as ln or loge. As subtraction negates addition, the natural log negates an exponent, and vice versa. The natural log of 20.09 is written as ln(20.09) = loge(20.09) = 3. On most calculators the exponent and natural log functions are on the same button, with one function accessed by first pressing the inverse (“inv”) button. The abbreviations “ln” and “loge” both refer to the natural log, whereas “log” typically refers to a different calculation.

Appendix C

SAS Syntax for Obtaining a Best Subsets Logistic Regression With Mallows’ Cp

* Run logistic regression for full model, saving predicted probabilities (pred);

  PROC LOGISTIC;
    MODEL jobcat = educ exper sex minority;
     OUTPUT out = output1
                         p = pred;

* Define two new variables: z and u;

  z = log(pred / (1 - pred)) + ((jobcat - pred) / (pred * (1 – pred)));
  u = pred * (1- pred);

* Run linear regression with case weights;

  PROC REG;
     MODEL z = educ exper sex minority
        / SELECTION = RSQUARE CP;
     WEIGHT u;