Skip to main content

Week 7 Prep: Regression

Linear Regression

What is Linear Regression?

Linear regression is a foundational statistical method that models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. The general form of a multiple linear regression model is given by:

y=β0+β1x1+β2x2++βnxn+ϵ,y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n + \epsilon,

where:

  • yy is the dependent variable,
  • x1,x2,,xnx_1, x_2, \dots, x_n are the independent variables,
  • β0\beta_0 is the intercept,
  • β1,β2,,βn\beta_1, \beta_2, \dots, \beta_n are the coefficients,
  • ϵ\epsilon is the error term.

This model relies on several key assumptions:

  • Linearity: The relationship between the predictors and the outcome is linear.
  • Independence: Observations are independent of each other.
  • Homoscedasticity: The variance of the errors is constant across all levels of the independent variables.
  • Normally Distributed Errors: The error term ϵ\epsilon follows a normal distribution.

Extensions such as polynomial regression allow modeling of non-linear relationships, while regularization techniques help prevent overfitting.

Example Problem

Consider predicting house prices:

  • x1x_1: square footage,
  • x2x_2: number of bedrooms,
  • x3x_3: age of the house.

The model estimates parameters β0,β1,β2,β3\beta_0, \beta_1, \beta_2, \beta_3 to minimize the difference between predicted prices and actual observed prices.

Additional Concepts in Linear Regression

Finding the Best Fit Line

There are two main approaches:

  1. Closed-form solution (Normal Equation):
    The optimal parameters are computed as: β^=(XTX)1XTy,\hat{\beta} = (X^TX)^{-1}X^Ty, where XX is the feature matrix.
  2. Iterative Optimization (Gradient Descent):
    This method updates the parameters iteratively to minimize the cost function.

Sum of Squared Errors (SSE)

The sum of squared errors measures the total prediction error:

SSE=i=1N(yiy^i)2.SSE = \sum_{i=1}^{N} (y_i - \hat{y}_i)^2.

Cost Function

A commonly used cost function is the mean squared error (MSE):

J(β)=12Ni=1N(yiy^i)2.J(\beta) = \frac{1}{2N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2.

Minimizing J(β)J(\beta) leads to optimal parameter estimates.

Gradient Descent Algorithm

Gradient descent is an iterative method to minimize J(β)J(\beta). The update rule is:

βj:=βjαJ(β)βj,\beta_j := \beta_j - \alpha \frac{\partial J(\beta)}{\partial \beta_j},

where α\alpha is the learning rate.


Regularization in Linear Regression

Regularization techniques prevent overfitting by penalizing large coefficient values:

  • Ridge Regression (L2 Regularization):
    Adds a penalty proportional to the square of the coefficients: J(β)=12Ni=1N(yiy^i)2+λj=1nβj2.J(\beta) = \frac{1}{2N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2 + \lambda\sum_{j=1}^{n}\beta_j^2.
  • Lasso Regression (L1 Regularization):
    Uses the absolute values of coefficients as a penalty: J(β)=12Ni=1N(yiy^i)2+λj=1nβj.J(\beta) = \frac{1}{2N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2 + \lambda\sum_{j=1}^{n}|\beta_j|.

Evaluation Metrics for Linear Regression

  • R-squared (R2R^2): Proportion of variance explained by the model.
  • Adjusted R-squared: Adjusts R2R^2 for the number of predictors.
  • Mean Absolute Error (MAE): Average absolute difference between predictions and actual values.
  • Root Mean Squared Error (RMSE): The square root of MSE, sensitive to outliers.

Additional Considerations


Logistic Regression

What is Logistic Regression?

Logistic regression is designed for binary classification. It predicts the probability pp that an input belongs to a specific class using the sigmoid function:

σ(z)=11+ez,\sigma(z) = \frac{1}{1+e^{-z}},

with

z=β0+β1x1+β2x2++βnxn.z = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n.

This transformation confines the output to (0,1)(0,1), which can be interpreted as a probability. Parameters are estimated using maximum likelihood estimation (MLE).

Example Problem

A classic application is spam detection. With features such as word frequency and sender reputation, the model computes the probability that an email is spam. If p>0.5p > 0.5, the email is classified as spam; otherwise, it is not.

Additional Concepts in Logistic Regression

Log Odds and the Linear Relationship

The log odds are expressed as:

log(p1p)=z,\log\left(\frac{p}{1-p}\right) = z,

which is a linear function of the predictors. This linearity simplifies both interpretation and parameter estimation.

Transformation in Logistic Regression

The sigmoid function transforms the linear output into a probability:

p=σ(z)=11+ez.p = \sigma(z) = \frac{1}{1+e^{-z}}.

Cost Function for Logistic Regression

The model is trained using the log loss (or cross-entropy loss):

J(β)=1Ni=1N[yilog(pi)+(1yi)log(1pi)],J(\beta) = -\frac{1}{N}\sum_{i=1}^{N}\left[y_i\log(p_i) + (1-y_i)\log(1-p_i)\right],

where pi=σ(zi)p_i = \sigma(z_i).

Gradient Descent for Logistic Regression

Gradient descent is applied similarly:

βj:=βjαJ(β)βj,\beta_j := \beta_j - \alpha \frac{\partial J(\beta)}{\partial \beta_j},

with the gradient given by:

J(β)βj=1Ni=1N(piyi)xij.\frac{\partial J(\beta)}{\partial \beta_j} = \frac{1}{N}\sum_{i=1}^{N}(p_i - y_i)x_{ij}.

Regularization in Logistic Regression

To avoid overfitting:

  • L2 Regularization:
    Adds a penalty term: J(β)=1Ni=1N[yilog(pi)+(1yi)log(1pi)]+λj=1nβj2.J(\beta) = -\frac{1}{N}\sum_{i=1}^{N}\left[y_i\log(p_i) + (1-y_i)\log(1-p_i)\right] + \lambda\sum_{j=1}^{n}\beta_j^2.
  • L1 Regularization:
    Uses the absolute values of coefficients: J(β)=1Ni=1N[yilog(pi)+(1yi)log(1pi)]+λj=1nβj.J(\beta) = -\frac{1}{N}\sum_{i=1}^{N}\left[y_i\log(p_i) + (1-y_i)\log(1-p_i)\right] + \lambda\sum_{j=1}^{n}|\beta_j|.

Evaluation Metrics for Logistic Regression

  • Accuracy: The fraction of correctly predicted observations.
  • Precision: Proportion of positive identifications that are correct.
  • **Recall (Sensitivity):** Proportion of actual positives correctly identified.
  • **F1-Score:** The harmonic mean of precision and recall.
  • ROC Curve & AUC: The ROC curve plots the true positive rate against the false positive rate; AUC quantifies overall performance.

Additional Considerations

  • Maximum Likelihood Estimation (MLE): Logistic regression uses MLE to estimate parameters.
  • Multiclass Extensions: Methods such as one-vs-rest and softmax regression extend logistic regression to handle multiple classes.
  • Threshold Tuning: Adjusting the default threshold of 0.5 can help balance precision and recall for specific applications.

Classification vs. Regression

Classification (Project 2)

Classification tasks involve predicting discrete labels. For example, identifying spam emails (1 for spam, 0 for not spam) employs models like logistic regression, random forest, or SVM. Evaluation metrics include accuracy, precision, recall, F1-score, and ROC-AUC, focusing on correctly assigning observations to categories.

Regression (Project 3)

Regression tasks involve predicting continuous numerical values. For example, forecasting house prices uses linear regression. Performance is measured using metrics such as SSE, MSE, RMSE, R2R^2, and Adjusted R2R^2, emphasizing the accurate prediction of magnitudes.

Linear vs. Logistic Regression

Key Differences

  • Output Nature:

    • Linear Regression outputs continuous values based on a linear equation.
    • Logistic Regression outputs probabilities via a sigmoid transformation, which are then mapped to discrete classes.
  • Error Metrics:

    • Linear Regression is evaluated using metrics like SSE, MSE, RMSE, and R-squared
    • Logistic Regression is evaluated using log loss (cross-entropy) and classification metrics such as accuracy, precision, recall, F1-score, and ROC-AUC.
  • Modeling Approach:

    • Linear Regression assumes a direct linear relationship between the predictors and the response variable.
    • Logistic Regression models the log odds of the outcome as a linear combination of predictors, then applies the sigmoid function to produce probability estimates.

Understanding these concepts in depth—from regularization techniques that mitigate overfitting to the evaluation metrics tailored for continuous or categorical outcomes—provides a robust foundation for selecting and fine-tuning regression models in data science and machine learning tasks.