A common approach to estimating the parameters of generalized linear models involves maximizing the likelihood function.
To find the likelihood function, we first need to assume a probability distribution for the data, i.e., \(P(y | \theta)\), where \(\theta\) are unknown parameters.
This distribution reflects our assumptions about the mechanism that generates the data.
The likelihood function is defined by plugging-in the observed data in the probability distribution and expressing it as a function of model parameters, i.e., \(f(\theta; y)\).
For regression models, the data include the response variables \(y\) and the explanatory variables \(x\). Therefore, in general we need to specify \(P(x, y)\).
However, since \(x\) are assumed to be fixed at their observed value, \(P(x)=1\), the joint distribution reduces to the conditional distribution of \(y | x\). \[\begin{eqnarray*} P(x, y) = P(x)P(y|x) = P(y|x) \end{eqnarray*}\]
Therefore, we only need to specify the conditional distribution of \(y\) given \(x\).
For linear regression models, we assume this \(P(y|x)\) is a normal distribution.
As we mentioned, we model the expectation of this distribution as a linear function of \(x\), i.e., \(E(y|x) = x\beta\), and we assume the variance of this distribution is \(\sigma^{2}\) (which is independent of \(x\) and \(\beta\)).
Therefore, assuming that the observations are independent, we have \[\begin{eqnarray*} y | x, \beta & \sim &({2\pi} \sigma^{2} )^{-n/2}\exp(- \frac{\sum_{i=1}^{n} (y_{i} - x_{i} \beta)^{2}}{2 \sigma^{2}}) \end{eqnarray*}\]
As mentioned before, for binary outcome variable, we use the binomial distribution. \[\begin{eqnarray*} y_i | n_i, \mu_i \sim \textrm{binomial}(n_i, \mu_i) \end{eqnarray*}\] with the Bernoulli distribution as its special case when \(n_i = 1\).
As usual, we define the systematic part of the model \(x_{i} \beta\)
A common link function for this model is the logit function and defined as \[\begin{eqnarray*} g^{-1}(\mu_i) =\log(\frac{\mu_i}{1 - \mu_i}) = x_i \beta \end{eqnarray*}\] where \(\mu_i\) is the probability of success (i.e., \(y_i = 1\)).
As the result \[\begin{eqnarray*} \mu_i & = & \frac{\exp(x_i \beta)}{1+ \exp(x_i \beta)} \end{eqnarray*}\]
This is a generalization of logistic regression when the outcome could have multiple values (i.e., could belong to one of \(K\) classes). \[\begin{eqnarray*} y_{i} | n_i, \mu_{i1}, ..., \mu_{iK} \sim \textrm{multinomial}(n_i, \mu_{i1}, ..., \mu_{iK}) \end{eqnarray*}\] where \(\mu_{ik}\) is the probability of class \(k\) for observation \(i\) such that \(\sum_{k = 1}^{K}{\mu_{ik}} = 1\).
\(y_{i}\) is also a vector of \(K\) elements with \(\sum_{k = 1}^{K}{y_{ik}} = n_i\).
The systematic part is now a vector \(x_{i} \boldsymbol{\beta}\), where \(\boldsymbol{\beta}\) is a matrix of size \((p+1) \times K\).
Each column \(k\) (where \(k = 1, ..., K\)) corresponds to a set of \(p+1\) parameters associated with class \(k\).
This representation is redundant and results in nonidentifiability, since one of the \({\beta}_k\)’s (where \(k = 1, ..., J\)) can be set to zero without changing the set of relationships expressible with the model.
Usually, either the parameters for \(k=1\) (the first column) or for \(k=K\) (the last column) would be set to zero.
For the multinomial logistic model, we use a generalization of the link function we used for the binary logistic regression \[\begin{eqnarray*} \mu_{ik} = \frac{\exp( x_i\boldsymbol{ \beta}_k)}{\sum_{k'=1}^{K} \exp(x_i \boldsymbol{ \beta}_{k'})} \end{eqnarray*}\]
The likelihood in terms of \(\beta\) is as follows: \[\begin{eqnarray*} p(y | \mu ) & \propto & \prod_{i=1}^{n} \prod_{k=2}^{K }\mu_{ik}^{y_{ik}}\\ P(y|x, \boldsymbol{\beta}) & \propto & \prod_{i=1}^{n} \prod_{k=1}^{K } \Big( \frac{\exp( x\boldsymbol{ \beta}_k)}{\sum_{k'=1}^{K} \exp(x\boldsymbol{ \beta}_{k'})} \Big)^{y_{ik}} \end{eqnarray*}\]
Here \(\boldsymbol{\beta}_k\) is a column vector of \(p+1\) parameters corresponding to class \(k\).
When the outcome variable, \(y\), represents counts, we use the Poisson model. \[\begin{eqnarray*} y_i | \mu_i \sim \textrm{Poisson}(\mu_i) \end{eqnarray*}\]
The systematic components are defined as before: \(x_i \beta\). The usual link function for this model is the log link: \[\begin{eqnarray*} g^{-1}(\mu_i) = \log(\mu_i) \end{eqnarray*}\]
We therefore have \[\begin{eqnarray*} \mu_i = \exp(x_i \beta) \end{eqnarray*}\]
To estimate the model parameters, we find the values that maximize the likelihood of the observed data.
For this, we maximize the likelihood function with respect to model parameters.
Of course, it is easier to maximize the log of likelihood function, i.e., \(L(\beta) = log(f(\beta))\).
In many problems, this is a convex optimization problem.
To maximize the likelihood function, we can focus on the part of the function that is related to the parameter (i.e., kernel).
For linear regression models, \[\begin{eqnarray*} L(\beta) & = & -{\sum_{i=1}^{n} (y_{i} - x_{i} \beta)^{2}} - log({2 \sigma^{2}}) \end{eqnarray*}\]
For simplicity, we can also remove all the constant (not related to the parameters) parts; \[\begin{eqnarray*} L(\beta) & = & -{\sum_{i=1}^{n} (y_{i} - x_{i} \beta)^{2}} \end{eqnarray*}\]
Now we can simply set the first derivative to zero (likelihood equation) to obtain the maximum likelihood estimate \[\begin{eqnarray*} \frac{\partial L(\beta)}{\partial \beta} & = & 2{\sum_{i=1}^{n} x_{i} (y_{i} - x_{i} \beta)} = 0 \\ \hat{\beta} & = &(x'x)^{-1}x'y \end{eqnarray*}\]
In this case, MLE is the same as the least squares estimate.
The likelihood equation for linear regression models takes a simple form that can be solved analytically.
However, the likelihood equations are in general nonlinear in \(\beta\), and as the result, numerical methods are needed to find \(\hat{\beta}\).
A common approach is to use the Newton-Raphson method to maximize the likelihood function, or equivalently, maximize the log-likelihood function, or minimize the negative log-likelihood function (called the energy function in machine learning).
The Newton-Raphson method is a general purpose iterative algorithm for solving nonlinear equations.
In statistics, we use this method to the solve likelihood equation.
Denote the log-likelihood as \(L(\beta)\). Our objective is to find the value of \(\beta\) for which \(L(\beta)\) is maximized.
We focus on models with a single parameter.
At each iteration \(n\), use the Taylor series expansion (up to the quadratic term) around the current value of \(\beta^{(n)}\) \[\begin{eqnarray*} L(\beta) & \simeq & L(\beta^{(n)}) + L'(\beta^{(n)})(\beta - \beta^{(n)})+ \frac{1}{2}L''(\beta^{(n)})(\beta-\beta^{(n)})^2 \end{eqnarray*}\]
Now take the derivative of \(L(\beta)\), set it to zero and solve for \(\beta\). Regard the answer as your next guess \(\beta^{(n+1)}\): \[\begin{eqnarray*} \beta^{(n+1)} & = & \beta^{(n)} - \frac{L'(\beta^{(n)})}{L''(\beta^{(n)})} \end{eqnarray*}\]
Continue the above process until the algorithm converges.
We can rewrite the equation for our next guess as \[\begin{eqnarray*} \beta^{(n+1)} & = & \beta^{(n)} + \frac{u(\beta^{(n)})}{o(\beta^{(n)})} \end{eqnarray*}\] where \(u(\beta)\) is the score function, and \[\begin{eqnarray*} o(\beta^{(n)}) & = & -L''(\beta^{(n)}) \end{eqnarray*}\]
\(o(\beta)\) is called the observed information, and its expectation \(i(\beta) = E[-L''(\beta)]\) is called the Fisher information.
Using Fisher information instead of the observed information leads to the Fisher Scoring Algorithm: \[\begin{eqnarray*} \beta^{(n+1)} & = & \beta^{(n)} + \frac{u(\beta^{(n)})}{i(\beta^{(n)})} \end{eqnarray*}\]
Using the above graph, we can create three different tests for statistical inference:
Wald: based on the distance between \(\hat{\beta}\) and \(\beta_0\)
Score: based on difference between the slopes at \(\hat{\beta}\) and \(\beta_0\)
Likelihood ratio: based on the distance between \(L_1\) and \(L_0\)