1. Introduction
A categorical variable here refers to a variable that is binary, ordinal, or nominal. Event count data are discrete
(categorical) but often treated as continuous variables. When a dependent variable is categorical, the ordinary
least squares (OLS) method can no longer produce the best linear unbiased estimator (BLUE); that is, OLS is biased
and inefficient. Consequently, researchers have developed various regression models for categorical dependent
variables. The nonlinearity of categorical dependent variable models makes it difficult to fit the models and
interpret their results.
1.1 Regression Models for Categorical Dependent Variables
In categorical dependent variable models, the left-hand side (LHS) variable or dependent variable is neither
interval nor ratio, but rather categorical. The level of measurement and data generation process (DGP) of a
dependent variable determine a proper model for data analysis. Binary responses (0 or 1) are modeled with binary
logit and probit regressions, ordinal responses (1st, 2nd, 3rd, ...) are formulated into (generalized) ordinal
logit/probit regressions, and nominal responses are analyzed by the multinomial logit (probit), conditional logit,
or nested logit model depending on specific circumstances. Independent variables on the right-hand side (RHS) are
interval, ratio, and/or binary (dummy).
Categorical dependent variable models adopt the maximum likelihood (ML) estimation method, whereas OLS uses the
moment based method. The ML method requires an assumption about probability distribution functions, such as the
logistic function and the complementary log-log function. Logit models use the standard logistic probability
distribution, while probit models assume the standard normal distribution. This document focuses on logit and
probit models only, excluding regression models for event count data (e.g., negative binomial regression model and
zero-inflated or zero-truncated regression models). Table 1.1 summarizes categorical dependent variable models in
comparison with OLS.
Table 1.1 Ordinary Least Squares and CDVMs
| |
Model |
Dependent (LHS) |
Estimation |
Independent (RHS) |
| OLS |
Ordinary least squares |
Interval or ratio scale |
Moment based method |
A linear function of interval/ratio or binary independent variables |
| CDVMs |
Binary response |
Binary (0 or 1) |
Maximum Likelihood Method |
| Ordinal response |
Ordinal (1st, 2nd, ...) |
| Nominal response |
Nominal (A, B, ...) |
| Event count data |
Count (0, 1, 2, ...) |
1.2 Logit Models versus Probit Models
How do logit models differ from probit models? The core difference lies in the distribution of errors
(disturbances). In the logit model, errors are assumed to follow the standard logistic distribution,

. The errors of the probit model are assumed to follow the standard normal distribution,

.
Figure 1.1 The Standard Normal and Standard Logistic Probability Distributions
 |
 |
| PDF of the Standard Normal Distribtuion |
CDF of the Standard Normal Distribtuion |
 |
 |
| PDF of the Standard Logistic Distribtuion |
CDF of the Standard Logistic Distribtuion |
The probability density function (PDF) of the standard normal probability distribution has a higher peak and
thinner tails than the standard logistic probability distribution (Figure 1.1). The standard logistic distribution
looks as if someone has weighed down the peak of the standard normal distribution and strained its tails. As a result, the cumulative density function (CDF) of the standard normal distribution is steeper in the middle than the CDF of the standard logistic distribution and quickly approaches zero on the left and one on the right.
The two models, of course, produce different parameter estimates. In binary response models, the estimates of a logit model are roughly pi/sqrt(3) times larger than those of the corresponding probit model.
These estimators, however, end up with almost the same standardized impacts of independent variables (Long
1997).
The choice between logit and probit model is more closely related to estimation and familiarity rather than
theoretical and interpretive aspects. In general, logit models reach convergence fairly well. Although some
(multinomial) probit models may take a long time to reach convergence, a probit model works well for bivariate
models. As computing power improves and new algorithms are developed, importance of this issue is diminishing. For
discussion on choosing logit and probit models, see Cameron and Trivedi (2009: 471-474).
Top
1.3 Estimation in SAS, Stata, LIMDEP, R, and SPSS
SAS provides several procedures for categorical dependent variable models, such as PROC LOGISTIC, PROBIT, GENMOD,
QLIM, MDC, PHREG, and CATMOD. Since these procedures support various models, a categorical dependent variable model
can be estimated by multiple procedures. For example, you may run a binary logit model using PROC LOGISTIC, QLIM,
GENMOD, and PROBIT. PROC LOGISTIC and PROC PROBIT of SAS/STAT have been commonly used, but PROC QLIM and PROC MDC
of SAS/ETS have advantages over other procedures. PROC LOGISTIC reports factor changes in the odds and tests key
hypotheses of a model.
Table 1.2 Procedures and Commands for CDVMs
| Model |
SAS 9.2 |
Stata 11 |
LIMDEP 9.0 |
SPSS 17 |
| OLS (Ordinary Least Squares) |
REG |
.regress |
Regress$ |
Regression |
| Binary |
Binary logit |
QLIM, LOGISTIC, GENMOD, PROBIT |
.logit, .logistic |
Logit$ |
Logistic regression |
| Binary Probit |
QLIM, LOGISTIC, GENMOD, PROBIT |
.probit |
Probit$ |
Probit |
| Bivariate Probit |
QLIM |
.biprobit |
Bivariateprobit$ |
- |
| Ordinal |
Ordered logit |
QLIM, LOGISTIC, GENMOD, PROBIT |
.ologit |
Ordered$, Logit$ |
Plum |
| Generalized logit |
- |
.gologit2* |
- |
- |
| Ordered Probit |
QLIM, LOGISTIC, GENMOD, PROBIT |
.oprobit |
Ordered$ |
Plum |
| Nominal |
Multinomial logit |
LOGISTIC, CATMOD |
.mlogit |
Mlogit$, Logit$ |
Nomreg |
| Conditional Logit |
LOGISTIC, MDC, PHREG |
.clogit |
Clogit$, Logit$ |
Coxreg |
| Nested logit |
MDC |
.nlogit |
Nlogit$** |
- |
| Multinomial probit |
- |
.mprobit |
- |
- |
* A user-written command written by Williams (2005).
** The Nlogit$ command is supported by NLOGIT, a stand-alone package, which is sold separately.
The QLIM (Qualitative and LImited dependent variable Model) procedure analyzes various categorical and limited
dependent variable regression models such as censored, truncated, and sample-selection models. PROC QLIM also
handles Box-Cox regression and the bivariate probit model. The MDC (Multinomial Discrete Choice) procedure can
estimate conditional logit and nested logit models.
Another advantage of using SAS is the Output Delivery System (ODS), which makes it easy to manage SAS output. ODS
enables users to redirect the output to HTML (Hypertext Markup Language) and RTF (Rich Text Format) formats. Once
SAS output is generated in a HTML document, users can easily handle tables and graphics especially when copying and
pasting them into a wordprocessor document.
Unlike SAS, Stata has individualized commands for corresponding categorical dependent variable models. For example,
the .logit and .probit commands respectively fit the binary logit and probit models, while .mlogit and .nlogit
estimate the mulitinomial logit and nested logit models. Stata enables users to perform post-hoc analyses such as
marginal effects and discrete changes in an easy manner.
The LIMDEP Logit$ and Probit$ commands support a variety of categorical dependent variable models that are
addressed in Greene's Econometric Analysis (2003). The output format of LIMDEP 9 is slightly different from that of
previous version, but key statistics remain unchanged. The nested logit model and multinomial probit model in
LIMDEP are estimated by NLOGIT, a separate package. In R, glm() fits binary logit and probit models in the object-
oriented programming concept. SPSS also supports some categorical dependent variable models and its output is often
messy and hard to read. Stata and R are case-sensitive, but SAS, LIMDEP, and SPSS are not. Table 1.2 summarizes the
procedures and commands used for categorical dependent variable models.
1.4 Long and Freese's SPost
Stata users may benefit from user-written commands such as J. Scott Long and Jeremy Freese's SPost. This collection
of user-written commands conducts many follow-up analyses of various categorical dependent variable models
including event count data models. See section 2.2 for major SPost commands.
In order to install SPost, execute the following commands consecutively. Visit J. Scott Long's Web site at
http://www.indiana.edu/~jslsoc/ to get further information.
. net from http://www.indiana.edu/~jslsoc/Stata/
. net install spost9_ado, replace
. net get spost9_do, replace
If a Stata command, function, or user-written command does not work in version 11, run the .version command to
switch the interpreter to old one and execute that command again. For example, normal() was norm() in old
versions.
Also you may update Stata or reinstall user-written commands to get their latest version installed.
. version 9
You may use Vincent Kang Fu's gologit (1998) and
Richard Williams'
gologit2 (2005) for the generalized ordered
logit model. .mfx2 is a related module written by Williams to compute marginal effects (discrete changes) in
(generalized) ordered logit and multinomial logit models. Visit
http://www.nd.edu/~rwilliam/gologit2/tsfaq.html
for more information.
. net search gologit
. net install gologit, from(http://www.Stata.com/users/jhardin), replace
. ssc install gologit2, replace
. ssc install mfx2, replace