9. Conclusion


The appropriate type of categorical dependent variable model (CDVM) is determined largely by the level of measurement of the dependent variable. The level of measurement should be, however, considered in conjunction with your theory and research questions (Long 1997). You must also examine the data generation process (DGP) of a dependent variable to understand its behavior. Sophisticated researchers pay special attention to censoring, truncation, sample selection, and other particular patterns of the DGP.

If your dependent variable is a binary variable, you may use the binary logit or probit regression model. For ordinal responses, try to fit the ordered logit/probit regression models. If you have a nominal response variable, investigate the DGP carefully and then choose one of the multinomial logit, conditional logit, and nested logit models. In order to use the conditional logit and nested logit, you need to reshape the data set in advance.

You should check the key assumptions of the CDVMs when fitting the models. Examples are the parallel regression assumption in the ordered logit model and the independence of irrelevant alternatives (IIA) assumption in the multinomial logit model. You may conduct the Brant test and Hausman test for these assumptions.

Since CDVMs are nonlinear, they produce estimates that are difficult to interpret intuitively. Consequently, researchers need to spend more time and effort interpreting the results substantively. Reporting parameter estimates and goodness-of-fit statistics is not sufficient. J. Scott Long (1997) and Long and Freese (2003) provide good examples of meaningful interpretations using predicted probabilities, factor changes in odds, and marginal/discrete changes of predicted probabilities.

Regarding statistical software for CDVMs, I would recommend the SAS QLIM and MDC procedures of SAS/ETS (see Table 3 and 4). SAS has other procedures such as LOGISTIC, GENMODE, and PROBIT, but the QLIM procedure seems best for binary and ordinal response models, and the MDC procedure is good for nominal dependent variable models. ODS is another advantage of using SAS. I also strongly recommend Stata since it provides handy ways to fit various CDVMs and also can be assisted by SPost, which has various useful commands such as .prchange, .listcoef, and .prtab. I encourage SAS Institute to develop additional statements similar to those SPost commands.

LIMDEP supports various CDVMs addressed in Greene (2003) but does not seem as user-friendly and stable as SAS and Stata. Thus, I recommend LIMDEP for CDVMs that SAS and Stata do not support. SPSS is least recommended mainly due to limited support for CDVMs and messy syntax and output.

Top


APPENDIX: Data Sets


The first data set students is a subset of data provided for David H. Good's class in the School of Public and Environmental Affairs (SPEA). The data were manipulated for the sake of data security.

Download: Students (csv) | SAS | Stata (.dta) | LIMDEP (.lpj)

  • owncar: 1 if a student owns a car
  • parking: Illegal parking (0=none, 1=sometimes, and 2=often)
  • offcamp: 1 if a student lives off-campus
  • transmode: the mode of transportation (0=walk, 1=bike, 2=bus, 3=car)
  • age: students’ age
  • income: monthly income
  • male: 1 for male and 0 for female

. tab male owncar

           |        owncar
      male |         0          1 |     Total
-----------+----------------------+----------
         0 |        76        111 |       187
         1 |        77        173 |       250
-----------+----------------------+----------
     Total |       153        284 |       437

. tab male offcamp

           |        offcamp
      male |         0          1 |     Total
-----------+----------------------+----------
         0 |         7        180 |       187
         1 |         5        245 |       250
-----------+----------------------+----------
     Total |        12        425 |       437

. tab male parking

           |             parking
      male |         0          1          2 |     Total
-----------+---------------------------------+----------
         0 |       170         13          4 |       187
         1 |       243          7          0 |       250
-----------+---------------------------------+----------
     Total |       413         20          4 |       437

. tab male transmode

           |                  transmode
      male |         0          1          2          3 |     Total
-----------+--------------------------------------------+----------
         0 |        38         18         20        111 |       187
         1 |        34         21         22        173 |       250
-----------+--------------------------------------------+----------
     Total |        72         39         42        284 |       437

. sum income age

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
      income |       437    .6168398      .17918         .4      1.227
         age |       437    20.69108    1.610812         18         29

The second data set travel on travel mode choice is adopted from Greene (2003). You may get the data from http://pages.stern.nyu.edu/~wgreene/Text/tables/tablelist5.htm

Download: Travel Mode (csv) | SAS | Stata (.dta) | LIMDEP (.lpj)

  • subject: identification number
  • mode: 1=Air, 2=Train, 3=Bus, 4=Car
  • choice: 1 if the travel mode is chosen
  • time: terminal waiting time, 0 for car
  • cost: generalized cost measure
  • income: household income
  • air_inc: interaction of air flight and household income, air*income
  • air: 1 for the air flight mode, 0 for others
  • train: 1 for the train mode, 0 for others
  • bus: 1 for the bus mode, 0 for others
  • car: 1 for the car mode, 0 for others
  • failure: failure time variable, 1-choice

. tab choice mode

           |                    mode
    choice |         1          2          3          4 |     Total
-----------+--------------------------------------------+----------
         0 |       152        147        180        151 |       630
         1 |        58         63         30         59 |       210
-----------+--------------------------------------------+----------
     Total |       210        210        210        210 |       840

. sum time income air_inc

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        time |       840    34.58929    24.94861          0         99
      income |       840    34.54762    19.67604          2         72
     air_inc |       840    8.636905    17.91206          0         72

Top


References

  • Allison, Paul D. 1991. Logistic Regression Using the SAS System: Theory and Application. Cary, NC: SAS Institute.
  • Fu, V. Kang. 1998. "Estimating Generalized Ordered Logit Models," Stata Technical Bulletin, STB-44: 27-30.
  • Greene, William H. 2003. Econometric Analysis, 5th ed. Upper Saddle River, NJ: Prentice Hall.
  • Greene, William H. 2002. LIMDEP Version 8.0 Econometric Modeling Guide, 4th ed. Plainview, New York: Econometric Software.
  • Long, J. Scott, and Jeremy Freese. 2003. Regression Models for Categorical Dependent Variables Using Stata, 2nd ed. College Station, TX: Stata Press.
  • Long, J. Scott. 1997. Regression Models for Categorical and Limited Dependent Variables. Advanced Quantitative Techniques in the Social Sciences. Sage Publications.
  • Maddala, G. S. 1983. Limited Dependent and Qualitative Variables in Econometrics. New York: Cambridge University Press.
  • SAS Institute. 2004. SAS/STAT 9.1 User's Guide. Cary, NC: SAS Institute.
  • SPSS Inc. 2001. SPSS 11.0 Syntax Reference Guide. Chicago, IL: SPSS Inc.
  • Stata Press. 2005. Stata Base Reference Manual, Release 9. College Station, TX: Stata Press.
  • Stokes, Maura E., Charles S. Davis, and Gary G. Koch. 2000. Categorical Data Analysis Using the SAS System, 2nd ed. Cary, NC: SAS Institute.
  • Williams, Richard, 2005. Glogit2: A Program for Generalized Logistic Regression/Partial Proportional Odds Models for Ordinal Dependent Variables. North American Stata Users' Groups Meeting 2005.


Acknowledgements


I am grateful to Jeremy Albright and Kevin Wilhite at the UITS Center for Statistical and Mathematical Computing for comments and suggestions. I also thank J. Scott Long in Sociology and David H. Good in the School of Public and Environmental Affairs, Indiana University, for their insightful lectures and data set.


Revision History

  • 2003. First draft.
  • 2004. Second draft.
  • 2005. Third draft (Added bivariate logit/probit models and the nested logit model with LIMDEP examples).
  • 2008. Fourth draft (Tested on new versions of software packages and added SAS ODS and selected SPSS output)


Up: Table of Contents
Prev: The Nested Logit Model