Week 7 Prep: Regression
Linear Regression
What is Linear Regression?
Linear regression is a foundational statistical method that models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. The general form of a multiple linear regression model is given by:
where:
- is the dependent variable,
- are the independent variables,
- is the intercept,
- are the coefficients,
- is the error term.
This model relies on several key assumptions:
- Linearity: The relationship between the predictors and the outcome is linear.
- Independence: Observations are independent of each other.
- Homoscedasticity: The variance of the errors is constant across all levels of the independent variables.
- Normally Distributed Errors: The error term follows a normal distribution.
Extensions such as polynomial regression allow modeling of non-linear relationships, while regularization techniques help prevent overfitting.
Example Problem
Consider predicting house prices:
- : square footage,
- : number of bedrooms,
- : age of the house.
The model estimates parameters to minimize the difference between predicted prices and actual observed prices.
Additional Concepts in Linear Regression
Finding the Best Fit Line
There are two main approaches:
- Closed-form solution (Normal Equation):
The optimal parameters are computed as: where is the feature matrix. - Iterative Optimization (Gradient Descent):
This method updates the parameters iteratively to minimize the cost function.
Sum of Squared Errors (SSE)
The sum of squared errors measures the total prediction error:
Cost Function
A commonly used cost function is the mean squared error (MSE):
Minimizing leads to optimal parameter estimates.
Gradient Descent Algorithm
Gradient descent is an iterative method to minimize . The update rule is:
where is the learning rate.
Regularization in Linear Regression
Regularization techniques prevent overfitting by penalizing large coefficient values:
- Ridge Regression (L2 Regularization):
Adds a penalty proportional to the square of the coefficients: - Lasso Regression (L1 Regularization):
Uses the absolute values of coefficients as a penalty:
Evaluation Metrics for Linear Regression
- R-squared (): Proportion of variance explained by the model.
- Adjusted R-squared: Adjusts for the number of predictors.
- Mean Absolute Error (MAE): Average absolute difference between predictions and actual values.
- Root Mean Squared Error (RMSE): The square root of MSE, sensitive to outliers.
Additional Considerations
- Feature Scaling: Standardization or normalization improves gradient descent convergence.
- Multicollinearity: High correlation among predictors can inflate variance; regularization or dimensionality reduction can help.
- Residual Analysis: Examining residuals helps diagnose model issues like heteroscedasticity) or non-linearity.
Logistic Regression
What is Logistic Regression?
Logistic regression is designed for binary classification. It predicts the probability that an input belongs to a specific class using the sigmoid function:
with
This transformation confines the output to , which can be interpreted as a probability. Parameters are estimated using maximum likelihood estimation (MLE).
Example Problem
A classic application is spam detection. With features such as word frequency and sender reputation, the model computes the probability that an email is spam. If , the email is classified as spam; otherwise, it is not.
Additional Concepts in Logistic Regression
Log Odds and the Linear Relationship
The log odds are expressed as:
which is a linear function of the predictors. This linearity simplifies both interpretation and parameter estimation.
Transformation in Logistic Regression
The sigmoid function transforms the linear output into a probability:
Cost Function for Logistic Regression
The model is trained using the log loss (or cross-entropy loss):
where .
Gradient Descent for Logistic Regression
Gradient descent is applied similarly:
with the gradient given by:
Regularization in Logistic Regression
To avoid overfitting:
- L2 Regularization:
Adds a penalty term: - L1 Regularization:
Uses the absolute values of coefficients:
Evaluation Metrics for Logistic Regression
- Accuracy: The fraction of correctly predicted observations.
- Precision: Proportion of positive identifications that are correct.
- **Recall (Sensitivity):** Proportion of actual positives correctly identified.
- **F1-Score:** The harmonic mean of precision and recall.
- ROC Curve & AUC: The ROC curve plots the true positive rate against the false positive rate; AUC quantifies overall performance.
Additional Considerations
- Maximum Likelihood Estimation (MLE): Logistic regression uses MLE to estimate parameters.
- Multiclass Extensions: Methods such as one-vs-rest and softmax regression extend logistic regression to handle multiple classes.
- Threshold Tuning: Adjusting the default threshold of 0.5 can help balance precision and recall for specific applications.
Classification vs. Regression
Classification (Project 2)
Classification tasks involve predicting discrete labels. For example, identifying spam emails (1 for spam, 0 for not spam) employs models like logistic regression, random forest, or SVM. Evaluation metrics include accuracy, precision, recall, F1-score, and ROC-AUC, focusing on correctly assigning observations to categories.
Regression (Project 3)
Regression tasks involve predicting continuous numerical values. For example, forecasting house prices uses linear regression. Performance is measured using metrics such as SSE, MSE, RMSE, , and Adjusted , emphasizing the accurate prediction of magnitudes.
Linear vs. Logistic Regression
Key Differences
-
Output Nature:
- Linear Regression outputs continuous values based on a linear equation.
- Logistic Regression outputs probabilities via a sigmoid transformation, which are then mapped to discrete classes.
-
Error Metrics:
- Linear Regression is evaluated using metrics like SSE, MSE, RMSE, and R-squared
- Logistic Regression is evaluated using log loss (cross-entropy) and classification metrics such as accuracy, precision, recall, F1-score, and ROC-AUC.
-
Modeling Approach:
- Linear Regression assumes a direct linear relationship between the predictors and the response variable.
- Logistic Regression models the log odds of the outcome as a linear combination of predictors, then applies the sigmoid function to produce probability estimates.
Understanding these concepts in depth—from regularization techniques that mitigate overfitting to the evaluation metrics tailored for continuous or categorical outcomes—provides a robust foundation for selecting and fine-tuning regression models in data science and machine learning tasks.
Related Posts
Week 8 Prep: Project 3 Intro
In this blog post, we will choose a problem to solve using regression for Project 3.
Week 11 Prep: Project 4 Intro
In this blog post, we will choose a problem to solve using clustering for Project 4.
Week 10 Prep: Clustering
Understanding K-Means and Agglomerative Clustering.