9. Conclusion
The appropriate type of categorical dependent variable model (CDVM) is determined largely by the level of measurement of the dependent variable. The level of measurement should be, however, considered in conjunction with your theory and research questions (Long 1997). You must also examine the data generation process (DGP) of a dependent variable to understand its behavior. Sophisticated researchers pay special attention to censoring, truncation, sample selection, and other particular patterns of the DGP.
If your dependent variable is a binary variable, you may use the binary logit or probit regression model.
For ordinal responses, try to fit the ordered logit/probit regression models. If you have a nominal
response variable, investigate the DGP carefully and then choose one of the multinomial logit, conditional
logit, and nested logit models. In order to use the conditional logit and nested logit, you need to reshape
the data set in advance.
You should check the key assumptions of the CDVMs when fitting the models. Examples are the parallel regression assumption in the ordered logit model and the independence of irrelevant alternatives (IIA) assumption in the multinomial logit model. You may conduct the Brant test and Hausman test for these assumptions.
Since CDVMs are nonlinear, they produce estimates that are difficult to interpret intuitively. Consequently, researchers need to spend more time and effort interpreting the results substantively. Reporting parameter estimates and goodness-of-fit statistics is not sufficient. J. Scott Long (1997) and Long and Freese (2003) provide good examples of meaningful interpretations using predicted probabilities, factor changes in odds, and marginal/discrete changes of predicted probabilities.
Regarding statistical software for CDVMs, I would recommend the SAS QLIM and MDC procedures of SAS/ETS (see
Table 3 and 4). SAS has other procedures such as LOGISTIC, GENMODE, and PROBIT, but the QLIM procedure
seems best for binary and ordinal response models, and the MDC procedure is good for nominal dependent
variable models. ODS is another advantage of using SAS. I also strongly recommend Stata since it provides
handy ways to fit various CDVMs and also can be assisted by SPost, which has various useful commands such
as .prchange, .listcoef, and .prtab. I encourage SAS Institute to develop additional statements similar to
those SPost commands.
LIMDEP supports various CDVMs addressed in Greene (2003) but does not seem as user-friendly and stable as
SAS and Stata. Thus, I recommend LIMDEP for CDVMs that SAS and Stata do not support. SPSS is least
recommended mainly due to limited support for CDVMs and messy syntax and output.
APPENDIX: Data Sets
The first data set students is a subset of data provided for David H. Good's class in the School of Public
and Environmental Affairs (SPEA). The data were manipulated for the sake of data security.
Download: Students (csv) | SAS | Stata (.dta) | LIMDEP (.lpj)
Download: Students (csv) | SAS | Stata (.dta) | LIMDEP (.lpj)
- owncar: 1 if a student owns a car
- parking: Illegal parking (0=none, 1=sometimes, and 2=often)
- offcamp: 1 if a student lives off-campus
- transmode: the mode of transportation (0=walk, 1=bike, 2=bus, 3=car)
- age: students’ age
- income: monthly income
- male: 1 for male and 0 for female
. tab male owncar
| owncar
male | 0 1 | Total
-----------+----------------------+----------
0 | 76 111 | 187
1 | 77 173 | 250
-----------+----------------------+----------
Total | 153 284 | 437
male | 0 1 | Total
-----------+----------------------+----------
0 | 76 111 | 187
1 | 77 173 | 250
-----------+----------------------+----------
Total | 153 284 | 437
. tab male offcamp
| offcamp
male | 0 1 | Total
-----------+----------------------+----------
0 | 7 180 | 187
1 | 5 245 | 250
-----------+----------------------+----------
Total | 12 425 | 437
male | 0 1 | Total
-----------+----------------------+----------
0 | 7 180 | 187
1 | 5 245 | 250
-----------+----------------------+----------
Total | 12 425 | 437
. tab male parking
|
parking
male | 0 1 2 | Total
-----------+---------------------------------+----------
0 | 170 13 4 | 187
1 | 243 7 0 | 250
-----------+---------------------------------+----------
Total | 413 20 4 | 437
male | 0 1 2 | Total
-----------+---------------------------------+----------
0 | 170 13 4 | 187
1 | 243 7 0 | 250
-----------+---------------------------------+----------
Total | 413 20 4 | 437
. tab male transmode
|
transmode
male | 0 1 2 3 | Total
-----------+--------------------------------------------+----------
0 | 38 18 20 111 | 187
1 | 34 21 22 173 | 250
-----------+--------------------------------------------+----------
Total | 72 39 42 284 | 437
male | 0 1 2 3 | Total
-----------+--------------------------------------------+----------
0 | 38 18 20 111 | 187
1 | 34 21 22 173 | 250
-----------+--------------------------------------------+----------
Total | 72 39 42 284 | 437
. sum income age
Variable |
Obs Mean Std.
Dev.
Min Max
-------------+--------------------------------------------------------
income | 437 .6168398 .17918 .4 1.227
age | 437 20.69108 1.610812 18 29
-------------+--------------------------------------------------------
income | 437 .6168398 .17918 .4 1.227
age | 437 20.69108 1.610812 18 29
The second data set travel on travel mode choice is adopted from Greene (2003). You may get the data from http://pages.stern.nyu.edu/~wgreene/Text/tables/tablelist5.htm
Download: Travel Mode (csv) | SAS | Stata (.dta) | LIMDEP (.lpj)
Download: Travel Mode (csv) | SAS | Stata (.dta) | LIMDEP (.lpj)
- subject: identification number
- mode: 1=Air, 2=Train, 3=Bus, 4=Car
- choice: 1 if the travel mode is chosen
- time: terminal waiting time, 0 for car
- cost: generalized cost measure
- income: household income
- air_inc: interaction of air flight and household income, air*income
- air: 1 for the air flight mode, 0 for others
- train: 1 for the train mode, 0 for others
- bus: 1 for the bus mode, 0 for others
- car: 1 for the car mode, 0 for others
- failure: failure time variable, 1-choice
. tab choice mode
|
mode
choice | 1 2 3 4 | Total
-----------+--------------------------------------------+----------
0 | 152 147 180 151 | 630
1 | 58 63 30 59 | 210
-----------+--------------------------------------------+----------
Total | 210 210 210 210 | 840
choice | 1 2 3 4 | Total
-----------+--------------------------------------------+----------
0 | 152 147 180 151 | 630
1 | 58 63 30 59 | 210
-----------+--------------------------------------------+----------
Total | 210 210 210 210 | 840
. sum time income air_inc
Variable |
Obs Mean Std.
Dev.
Min Max
-------------+--------------------------------------------------------
time | 840 34.58929 24.94861 0 99
income | 840 34.54762 19.67604 2 72
air_inc | 840 8.636905 17.91206 0 72
-------------+--------------------------------------------------------
time | 840 34.58929 24.94861 0 99
income | 840 34.54762 19.67604 2 72
air_inc | 840 8.636905 17.91206 0 72
References
- Allison, Paul D. 1991. Logistic Regression Using the SAS System: Theory and Application. Cary, NC: SAS Institute.
- Fu, V. Kang. 1998. "Estimating Generalized Ordered Logit Models," Stata Technical Bulletin, STB-44: 27-30.
- Greene, William H. 2003. Econometric Analysis, 5th ed. Upper Saddle River, NJ: Prentice Hall.
- Greene, William H. 2002. LIMDEP Version 8.0 Econometric Modeling Guide, 4th ed. Plainview, New York: Econometric Software.
- Long, J. Scott, and Jeremy Freese. 2003. Regression Models for Categorical Dependent Variables Using Stata, 2nd ed. College Station, TX: Stata Press.
- Long, J. Scott. 1997. Regression Models for Categorical and Limited Dependent Variables. Advanced Quantitative Techniques in the Social Sciences. Sage Publications.
- Maddala, G. S. 1983. Limited Dependent and Qualitative Variables in Econometrics. New York: Cambridge University Press.
- SAS Institute. 2004. SAS/STAT 9.1 User's Guide. Cary, NC: SAS Institute.
- SPSS Inc. 2001. SPSS 11.0 Syntax Reference Guide. Chicago, IL: SPSS Inc.
- Stata Press. 2005. Stata Base Reference Manual, Release 9. College Station, TX: Stata Press.
- Stokes, Maura E., Charles S. Davis, and Gary G. Koch. 2000. Categorical Data Analysis Using the SAS System, 2nd ed. Cary, NC: SAS Institute.
- Williams, Richard, 2005. Glogit2: A Program for Generalized Logistic Regression/Partial Proportional Odds Models for Ordinal Dependent Variables. North American Stata Users' Groups Meeting 2005.
Acknowledgements
I am grateful to Jeremy Albright and Kevin Wilhite at the UITS Center for Statistical and Mathematical Computing for comments and suggestions. I also thank J. Scott Long in Sociology and David H. Good in the School of Public and Environmental Affairs, Indiana University, for their insightful lectures and data set.
Revision History
- 2003. First draft.
- 2004. Second draft.
- 2005. Third draft (Added bivariate logit/probit models and the nested logit model with LIMDEP examples).
- 2008. Fourth draft (Tested on new versions of software packages and added SAS ODS and selected SPSS output)



