# Logistic regression

In the logistic regression, the dependent variable is categorical (possibly ordinal) and we would like to estimate the probability. Instead of dealing with the probability directly, we can use the log-odds of the probability. Log-odds is a convenient quantity because it varies $(-\infty, \infty)$. It is also called a **logit** function.

For a probability $p$, the odds is defined as $\frac{p}{1-p}$. The log odds $\log \left( \frac{p}{1-p} \right)$ approaches $-\infty$ as $p \rightarrow 0$ and approaches $\infty$ as $p \rightarrow 1$.

Let $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots$

Now we can formulate a model where

$\log \left( \frac{p}{1-p} \right) \sim y$

or

$p \sim \frac{e^{y}}{1 + e^{y}}$

This is the logistic regression model. We can think of it as an application of a linear regression on the logit of the probability, or as applying a logistic transformation to $y$ to obtain a bounded probability value. The logit and the logistic function are the inverse of the other, and the logit function is called a “link function” in the context of Generalized linear model framework.

## Marginal effects and odds ratio

See also Marginal effects

The odds ratio captures multiplicative change in the dependent variable upon the change in an independent variable; by contrast, the marginal effect cpatures additive change in the dependent variable.

Because the logit function is the logarithm of the odds, the odds is $e^{y}$. This means that if we calculate the odds ratio between the odds with $x_i$ and that with $x'_i = x_i + 1$, we obtain $\frac{e^{\beta_0 + \dots \beta_i (x_i+1) + \dots }}{e^{\beta_0 + \dots \beta_i x_i + \dots}} = e^{\beta_i}$. In other words, if we simply exponentiate a coefficient of the model, it gives us the odds ratio upon a unit change in the corresponding variable. Moreover, the odds ratio is a constant regardless of the value of the independent variables.

By contrast, the marginal effect is how much the probability (dependent variable) changes when we change an independent variable. Because it’s about probability, not the odds, it describes an additive change and, unlike the odds ratio, varies depending on the other variables. For binary variable, the marginal effect is the amount of change upon the change of the variable from 0 to 1. Specifically, $\hat{p}(x_i = 1) - \hat{p}(x_i = 0)$.

For continuous variables, it is the instantaneous rate of change (“dy/dx”).

## Tutorials

## Tools

### Statsmodels

- Logistic Regression in Python Using Rodeo
- Machine Learning for Hackers Chapter 2, Part 2: Logistic regression with statsmodels

#### common usage patterns

Using R-like formula. It takes care of categorical dummay variables and you can apply transformations (e.g. log) on the fly.

import statsmodels.formula.api as smf result = smf.logit('DV ~ x1 + x2 + np.log(x3) + x4*x5', data=df).fit() result.summary()

Calculating the odds ratio with 95% CI.

conf = result.conf_int() conf['odds_ratio'] = result.params conf.columns = ['2.5%', '97.5%', 'odds_ratio'] np.exp(conf)

F-test with human-readable restriction formula

result.f_test('x1 = x2 = 0') result.f_test('x1 = x2')

Get average marginal effects (use `at`

parameter for other marginal effects).

margins = result.get_margeff() # marginal effects margins.summary() margins.summary_frame() # get a data frame

#### Selecting a reference (pivot) dummy

Currently, statsmodels does not support this choice. But because statsmodels picks a dummy using alphabetical order, we can simply replace the dummary variable that we want to have. For instance, if we have a `gender`

column with `m`

and `f`

values but we want to have `m`

as the pivot, then simply replace it with `a`

.

df.gender.replace('m', 'a', inplace=True)

We can use a similar trick for the multinomial logit.