More on GLM

Babak Shahbaba

Likelihood Function

A common approach to estimating the parameters of generalized linear models involves maximizing the likelihood function.
To find the likelihood function, we first need to assume a probability distribution for the data, i.e., \(P(y | \theta)\), where \(\theta\) are unknown parameters.
This distribution reflects our assumptions about the mechanism that generates the data.
The likelihood function is defined by plugging-in the observed data in the probability distribution and expressing it as a function of model parameters, i.e., \(f(\theta; y)\).

Likelihood Function

For regression models, the data include the response variables \(y\) and the explanatory variables \(x\). Therefore, in general we need to specify \(P(x, y)\).
However, since \(x\) are assumed to be fixed at their observed value, \(P(x)=1\), the joint distribution reduces to the conditional distribution of \(y | x\). \[\begin{eqnarray*} P(x, y) = P(x)P(y|x) = P(y|x) \end{eqnarray*}\]
Therefore, we only need to specify the conditional distribution of \(y\) given \(x\).

Likelihood Function – Linear Regression

For linear regression models, we assume this \(P(y|x)\) is a normal distribution.
As we mentioned, we model the expectation of this distribution as a linear function of \(x\), i.e., \(E(y|x) = x\beta\), and we assume the variance of this distribution is \(\sigma^{2}\) (which is independent of \(x\) and \(\beta\)).
Therefore, assuming that the observations are independent, we have \[\begin{eqnarray*} y | x, \beta & \sim &({2\pi} \sigma^{2} )^{-n/2}\exp(- \frac{\sum_{i=1}^{n} (y_{i} - x_{i} \beta)^{2}}{2 \sigma^{2}}) \end{eqnarray*}\]

Likelihood Function – Linear Regression

The likelihood function is specified by plugging-in the observed values of \(x\) and \(y\) in the probability distribution and expressing the result as a function of \(\beta\) (for now, we assume \(\sigma\) is fixed).
\[\begin{eqnarray*} f(\beta) & = & ({2\pi} \sigma^{2} )^{-n/2} \exp(- \frac{\sum_{i=1}^{n} (y_{i} - x_{i} \beta)^{2}}{2 \sigma^{2}}) \end{eqnarray*}\]

Likelihood Function – Logistic Regression

As mentioned before, for binary outcome variable, we use the binomial distribution. \[\begin{eqnarray*} y_i | n_i, \mu_i \sim \textrm{binomial}(n_i, \mu_i) \end{eqnarray*}\] with the Bernoulli distribution as its special case when \(n_i = 1\).
As usual, we define the systematic part of the model \(x_{i} \beta\)

Likelihood Function – Logistic Regression

A common link function for this model is the logit function and defined as \[\begin{eqnarray*} g^{-1}(\mu_i) =\log(\frac{\mu_i}{1 - \mu_i}) = x_i \beta \end{eqnarray*}\] where \(\mu_i\) is the probability of success (i.e., \(y_i = 1\)).
As the result \[\begin{eqnarray*} \mu_i & = & \frac{\exp(x_i \beta)}{1+ \exp(x_i \beta)} \end{eqnarray*}\]

Likelihood Function – Logistic Regression

The likelihood is therefore defined in terms of \(\beta\) as follows: \[\begin{eqnarray*} p(y | \mu ) & \propto & \prod_{i=1}^{n} \mu^{y_i}_i (1-\mu_i)^{n_i - y_i} \\ p(y | \beta) & \propto & \prod_{i=1}^{n} \Big( \frac{\exp(x_i \beta)}{1+ \exp(x_i \beta)} \Big) ^{y_i} \Big( \frac{1}{1+ \exp(x_i \beta)} \Big )^{n_i - y_i} \\ \end{eqnarray*}\]

Likelihood Function – Multinomial Logistic Regression

This is a generalization of logistic regression when the outcome could have multiple values (i.e., could belong to one of \(K\) classes). \[\begin{eqnarray*} y_{i} | n_i, \mu_{i1}, ..., \mu_{iK} \sim \textrm{multinomial}(n_i, \mu_{i1}, ..., \mu_{iK}) \end{eqnarray*}\] where \(\mu_{ik}\) is the probability of class \(k\) for observation \(i\) such that \(\sum_{k = 1}^{K}{\mu_{ik}} = 1\).
\(y_{i}\) is also a vector of \(K\) elements with \(\sum_{k = 1}^{K}{y_{ik}} = n_i\).
The systematic part is now a vector \(x_{i} \boldsymbol{\beta}\), where \(\boldsymbol{\beta}\) is a matrix of size \((p+1) \times K\).

Likelihood Function – Multinomial Logistic Regression

Each column \(k\) (where \(k = 1, ..., K\)) corresponds to a set of \(p+1\) parameters associated with class \(k\).
This representation is redundant and results in nonidentifiability, since one of the \({\beta}_k\)’s (where \(k = 1, ..., J\)) can be set to zero without changing the set of relationships expressible with the model.
Usually, either the parameters for \(k=1\) (the first column) or for \(k=K\) (the last column) would be set to zero.

Likelihood Function – Multinomial Logistic Regression

For the multinomial logistic model, we use a generalization of the link function we used for the binary logistic regression \[\begin{eqnarray*} \mu_{ik} = \frac{\exp( x_i\boldsymbol{ \beta}_k)}{\sum_{k'=1}^{K} \exp(x_i \boldsymbol{ \beta}_{k'})} \end{eqnarray*}\]
The likelihood in terms of \(\beta\) is as follows: \[\begin{eqnarray*} p(y | \mu ) & \propto & \prod_{i=1}^{n} \prod_{k=2}^{K }\mu_{ik}^{y_{ik}}\\ P(y|x, \boldsymbol{\beta}) & \propto & \prod_{i=1}^{n} \prod_{k=1}^{K } \Big( \frac{\exp( x\boldsymbol{ \beta}_k)}{\sum_{k'=1}^{K} \exp(x\boldsymbol{ \beta}_{k'})} \Big)^{y_{ik}} \end{eqnarray*}\]
Here \(\boldsymbol{\beta}_k\) is a column vector of \(p+1\) parameters corresponding to class \(k\).

Likelihood Function – Poisson Regression

When the outcome variable, \(y\), represents counts, we use the Poisson model. \[\begin{eqnarray*} y_i | \mu_i \sim \textrm{Poisson}(\mu_i) \end{eqnarray*}\]
The systematic components are defined as before: \(x_i \beta\). The usual link function for this model is the log link: \[\begin{eqnarray*} g^{-1}(\mu_i) = \log(\mu_i) \end{eqnarray*}\]
We therefore have \[\begin{eqnarray*} \mu_i = \exp(x_i \beta) \end{eqnarray*}\]

Likelihood Function – Poisson Regression

The likelihood in terms of \(\beta\) can obtained as follows: \[\begin{eqnarray*} p(y_i | \mu_i) & \propto & \prod_{i}^{n} \exp(- \mu_i) \mu_{i}^{y_i} \\ p(y_i | \beta) & \propto & \prod_{i}^{n} \exp[- \exp(x_i \beta) ] [\exp(x_i \beta)]^{y_i} \end{eqnarray*}\]

Maximum Likelihood Estimation

To estimate the model parameters, we find the values that maximize the likelihood of the observed data.
For this, we maximize the likelihood function with respect to model parameters.
Of course, it is easier to maximize the log of likelihood function, i.e., \(L(\beta) = log(f(\beta))\).
In many problems, this is a convex optimization problem.

Maximum Likelihood Estimation

To maximize the likelihood function, we can focus on the part of the function that is related to the parameter (i.e., kernel).
For linear regression models, \[\begin{eqnarray*} L(\beta) & = & -{\sum_{i=1}^{n} (y_{i} - x_{i} \beta)^{2}} - log({2 \sigma^{2}}) \end{eqnarray*}\]
For simplicity, we can also remove all the constant (not related to the parameters) parts; \[\begin{eqnarray*} L(\beta) & = & -{\sum_{i=1}^{n} (y_{i} - x_{i} \beta)^{2}} \end{eqnarray*}\]

Maximum Likelihood Estimation

Now we can simply set the first derivative to zero (likelihood equation) to obtain the maximum likelihood estimate \[\begin{eqnarray*} \frac{\partial L(\beta)}{\partial \beta} & = & 2{\sum_{i=1}^{n} x_{i} (y_{i} - x_{i} \beta)} = 0 \\ \hat{\beta} & = &(x'x)^{-1}x'y \end{eqnarray*}\]
In this case, MLE is the same as the least squares estimate.

Numerical methods for finding MLE

The likelihood equation for linear regression models takes a simple form that can be solved analytically.
However, the likelihood equations are in general nonlinear in \(\beta\), and as the result, numerical methods are needed to find \(\hat{\beta}\).
A common approach is to use the Newton-Raphson method to maximize the likelihood function, or equivalently, maximize the log-likelihood function, or minimize the negative log-likelihood function (called the energy function in machine learning).

Newton-Raphson method

The Newton-Raphson method is a general purpose iterative algorithm for solving nonlinear equations.
In statistics, we use this method to the solve likelihood equation.
Denote the log-likelihood as \(L(\beta)\). Our objective is to find the value of \(\beta\) for which \(L(\beta)\) is maximized.
We focus on models with a single parameter.

Newton-Raphson method

Start with an initial guess \(\beta^{(0)}\). Iteratively update your guess as follows:
- At each iteration \(n\), use the Taylor series expansion (up to the quadratic term) around the current value of \(\beta^{(n)}\) \[\begin{eqnarray*} L(\beta) & \simeq & L(\beta^{(n)}) + L'(\beta^{(n)})(\beta - \beta^{(n)})+ \frac{1}{2}L''(\beta^{(n)})(\beta-\beta^{(n)})^2 \end{eqnarray*}\]
- Now take the derivative of \(L(\beta)\), set it to zero and solve for \(\beta\). Regard the answer as your next guess \(\beta^{(n+1)}\): \[\begin{eqnarray*} \beta^{(n+1)} & = & \beta^{(n)} - \frac{L'(\beta^{(n)})}{L''(\beta^{(n)})} \end{eqnarray*}\]
- Continue the above process until the algorithm converges.

Newton-Raphson method

Fisher Scoring Algorithm

We can rewrite the equation for our next guess as \[\begin{eqnarray*} \beta^{(n+1)} & = & \beta^{(n)} + \frac{u(\beta^{(n)})}{o(\beta^{(n)})} \end{eqnarray*}\] where \(u(\beta)\) is the score function, and \[\begin{eqnarray*} o(\beta^{(n)}) & = & -L''(\beta^{(n)}) \end{eqnarray*}\]
\(o(\beta)\) is called the observed information, and its expectation \(i(\beta) = E[-L''(\beta)]\) is called the Fisher information.
Using Fisher information instead of the observed information leads to the Fisher Scoring Algorithm: \[\begin{eqnarray*} \beta^{(n+1)} & = & \beta^{(n)} + \frac{u(\beta^{(n)})}{i(\beta^{(n)})} \end{eqnarray*}\]

Wald, score, and likelihood ratio tests

Using the above graph, we can create three different tests for statistical inference:
- Wald: based on the distance between \(\hat{\beta}\) and \(\beta_0\)
- Score: based on difference between the slopes at \(\hat{\beta}\) and \(\beta_0\)
- Likelihood ratio: based on the distance between \(L_1\) and \(L_0\)