ML-2 Note: Linear Models for Classification and GLMs
1. Why Classification Uses Logistic Regression
In classification problems, the target variable is discrete.
Binary classification example:
$$y \in \left\{ 0,1 \right\}$$
Given input features
$$x \in \mathbb{R}^d$$
we want to model
$$P(y=1|x)$$
Problem with Linear Regression
A linear model predicts
$$f(x) = w^T x$$
but
$$w^T x \in (-\infty, \infty)$$
while probabilities must satisfy
$$P(y=1|x) \in [0,1]$$
Thus we need a function that maps
$$(-\infty,\infty) \rightarrow (0,1)$$
Logistic Function
The logistic (sigmoid) function provides this mapping
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$
Therefore we define the logistic regression model
$$P(y=1|x) = \sigma(w^T x)$$
This ensures predictions are valid probabilities.
2. Binary Logistic Regression
2.1 Model
The model assumes
$$P(y=1|x) = \sigma(w^T x)$$
$$P(y=0|x) = 1 - \sigma(w^T x)$$
Equivalently,
$$P(y|x) = \sigma(w^T x)^y (1-\sigma(w^T x))^{1-y}$$
This corresponds to a Bernoulli distribution.
2.2 Negative Log Likelihood (NLL)
Given dataset
$${(x_i,y_i)}_{i=1}^{n}$$
Likelihood:
$$L(w) = \prod_{i=1}^{n} \sigma(w^T x_i)^{y_i} (1-\sigma(w^T x_i))^{1-y_i}$$
Take log likelihood
$$\log L(w) = \sum_{i=1}^{n} \left[ y_i \log \sigma(w^T x_i) + (1-y_i)\log(1-\sigma(w^T x_i)) \right]$$
Training maximizes likelihood or equivalently minimizes Negative Log Likelihood
$$\mathcal{L}(w) = - \sum_{i=1}^{n} \left[ y_i \log \hat y_i + (1-y_i)\log(1-\hat y_i) \right]$$
where
$$\hat y_i = \sigma(w^T x_i)$$
This is the binary cross-entropy loss.
2.3 Connection to KL Divergence
Let
- $q(y|x)$ be the true distribution
- $p(y|x;w)$ be the model distribution
The KL divergence is
$$D_{KL}(q||p) = \sum_y q(y|x) \log \frac{q(y|x)}{p(y|x;w)}$$
Expanding:
$$D_{KL}(q||p) = - \sum_y q(y|x)\log p(y|x;w) + \sum_y q(y|x)\log q(y|x)$$
The second term does not depend on $w$.
Therefore minimizing KL divergence is equivalent to minimizing
$$- \sum_y q(y|x)\log p(y|x;w)$$
which is exactly the negative log likelihood.
Thus
$$\text{Minimize NLL} \quad \Longleftrightarrow \quad \text{Minimize KL divergence}$$
2.4 Training via Gradient
Define
$$z_i = w^T x_i$$
$$\hat y_i = \sigma(z_i)$$
Loss:
$$\mathcal{L}(w) = - \sum_{i=1}^{n} [ y_i\log \hat y_i + (1-y_i)\log(1-\hat y_i) ]$$
Compute gradient:
$$\frac{\partial \mathcal{L}}{\partial w} = - \sum_{i=1}^{n} (y_i-\hat y_i)x_i$$
Thus
$$\nabla_w \mathcal{L} = \sum_{i=1}^{n} (\hat y_i-y_i)x_i$$
This leads to gradient descent update
$$w \leftarrow w - \eta \nabla_w \mathcal{L}$$
Interpretation:
Training adjusts $w$ so that predicted probabilities match observed labels.
3. Multiclass Classification
For $K$ classes
$$y \in {1,2,\dots,K}$$
We assign a linear score to each class
$$z_k = w_k^T x$$
These scores are converted to probabilities using the Softmax function
$$P(y=k|x) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}$$
or
$$P(y=k|x) = \frac{e^{w_k^T x}}{\sum_{j=1}^{K} e^{w_j^T x}}$$
This ensures
$$\sum_{k=1}^{K} P(y=k|x) = 1$$
Training again uses negative log likelihood
$$\mathcal{L} = -\sum_{i=1}^{n} \log P(y_i|x_i)$$
This model is known as Softmax Regression or Multinomial Logistic Regression.
4. Exponential Family of Distributions
Many probability distributions belong to the exponential family.
General form:
$$p(y|\eta) = h(y) \exp \left( \eta^T T(y) - A(\eta) \right)$$
Components:
1. $T(y)$ — Sufficient Statistic
Captures all relevant information from data about the parameter.
2. $\eta$ — Natural Parameter
Parameterization of the distribution used in the exponential form.
3. $A(\eta)$ — Log Partition Function
Ensures the distribution is normalized:
$$\int p(y|\eta)dy = 1$$
Also determines the mean and variance.
4. $h(y)$ — Base Measure
Depends only on the data, not the parameters.
4.1 Gaussian Distribution as Exponential Family
Gaussian distribution
$$p(y|\mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left( -\frac{(y-\mu)^2}{2\sigma^2} \right)$$
Rewrite in exponential family form:
Natural parameter
$$\eta = \frac{\mu}{\sigma^2}$$
Sufficient statistic
$$T(y) = y$$
Log partition function
$$A(\eta) = \frac{\mu^2}{2\sigma^2}$$
Thus Gaussian distribution belongs to the exponential family.
4.2 Bernoulli Distribution as Exponential Family
Bernoulli distribution
$$p(y|\phi) = \phi^y (1-\phi)^{1-y}$$
Rewrite:
$$p(y|\eta) = \exp \left( y\eta - \log(1+e^\eta) \right)$$
where
$$\eta = \log\frac{\phi}{1-\phi}$$
Thus
Sufficient statistic
$$T(y) = y$$
Log partition function
$$A(\eta) = \log(1+e^\eta)$$
This form directly leads to logistic regression.
5. Generalized Linear Models (GLMs)
GLMs extend linear models to distributions in the exponential family.
A GLM consists of three components.
1. Random Component
Response variable follows an exponential family distribution
$$y \sim p(y|\eta)$$
2. Linear Predictor
A linear combination of features
$$z = w^T x$$
3. Link Function
Relates the mean of the distribution to the linear predictor
$$g(\mathbb{E}[y|x]) = w^T x$$
or
$$\mathbb{E}[y|x] = g^{-1}(w^T x)$$
5.1 Examples of GLMs
Linear Regression
Distribution: Gaussian
Link function:
$$\mathbb{E}[y|x] = w^T x$$
Logistic Regression
Distribution: Bernoulli
Link function:
$$\log\frac{p}{1-p} = w^T x$$
Inverse link
$$p = \sigma(w^T x)$$