Logistic regression

Regression analysis.

In the logistic regression, the dependent variable is categorical (possibly ordinal) and we would like to estimate the probability. Instead of dealing with the probability directly, we can use the log-odds of the probability. Log-odds is a convenient quantity because it varies (,)(-\infty, \infty). It is also called a logit function.

For a probability pp, the odds is defined as p1p\frac{p}{1-p}. The log odds log(p1p)\log \left( \frac{p}{1-p} \right) approaches -\infty as p0p \rightarrow 0 and approaches \infty as p1p \rightarrow 1.

Let y=β0+β1x1+β2x2+y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots

Now we can formulate a model where

log(p1p)y \log \left( \frac{p}{1-p} \right) \sim y

or

pey1+ey p \sim \frac{e^{y}}{1 + e^{y}}

This is the logistic regression model. We can think of it as an application of a linear regression on the logit of the probability, or as applying a logistic transformation to yy to obtain a bounded probability value. The logit and the logistic function are the inverse of the other, and the logit function is called a “link function” in the context of Generalized linear model framework.

Marginal effects and odds ratio

See also Marginal effects

The odds ratio captures multiplicative change in the dependent variable upon the change in an independent variable; by contrast, the marginal effect cpatures additive change in the dependent variable.

Because the logit function is the logarithm of the odds, the odds is eye^{y}. This means that if we calculate the odds ratio between the odds with xix_i and that with xi=xi+1x'_i = x_i + 1, we obtain eβ0+βi(xi+1)+eβ0+βixi+=eβi\frac{e^{\beta_0 + \dots \beta_i (x_i+1) + \dots }}{e^{\beta_0 + \dots \beta_i x_i + \dots}} = e^{\beta_i}. In other words, if we simply exponentiate a coefficient of the model, it gives us the odds ratio upon a unit change in the corresponding variable. Moreover, the odds ratio is a constant regardless of the value of the independent variables.

By contrast, the marginal effect is how much the probability (dependent variable) changes when we change an independent variable. Because it’s about probability, not the odds, it describes an additive change and, unlike the odds ratio, varies depending on the other variables. For binary variable, the marginal effect is the amount of change upon the change of the variable from 0 to 1. Specifically, p^(xi=1)p^(xi=0)\hat{p}(x_i = 1) - \hat{p}(x_i = 0).

For continuous variables, it is the instantaneous rate of change (“dy/dx”).

Tutorials

Tools

Statsmodels

common usage patterns

Using R-like formula. It takes care of categorical dummay variables and you can apply transformations (e.g. log) on the fly.

import statsmodels.formula.api as smf
result = smf.logit('DV ~ x1 + x2 + np.log(x3) + x4*x5', data=df).fit()
result.summary()

Calculating the odds ratio with 95% CI.

conf = result.conf_int()
conf['odds_ratio'] = result.params
conf.columns = ['2.5%', '97.5%', 'odds_ratio']
np.exp(conf)

F-test with human-readable restriction formula

result.f_test('x1 = x2 = 0')  
result.f_test('x1 = x2')

Get average marginal effects (use at parameter for other marginal effects).

margins = result.get_margeff()  # marginal effects
margins.summary()
margins.summary_frame()  # get a data frame

Selecting a reference (pivot) dummy

Currently, statsmodels does not support this choice. But because statsmodels picks a dummy using alphabetical order, we can simply replace the dummary variable that we want to have. For instance, if we have a gender column with m and f values but we want to have m as the pivot, then simply replace it with a.

df.gender.replace('m', 'a', inplace=True)

We can use a similar trick for the multinomial logit.

To read