~/blog/week-4-prep.mdx

Week 4 Prep: Classification & Decision Trees

February 2, 20252min

Machine Learning & Classification

Machine learning is a field of computer science focused on enabling computers to learn patterns and make decisions based on data, rather than following specific preprogrammed instructions.
Classification is a task that involves using machine learning algorithms to assign a class label to examples from a problem domain. A common example is marking emails as spam or not-spam based on their content.

Data Collection - involves gathering raw data from sources like sensors, databases, user-generated content, or analytics.
Data Preprocessing - involves cleaning and preparing the raw data, such as handling missing values, normalizing data ranges, and converting data to correct formats.
Data Splitting - involves dividing the dataset into subsets, typically training, validation, and testing tests.
Model Selection - involves choosing the most appropriate algorithm, like decision trees, logistic regression, or other classifiers.
Model Training - involves feeding training data into the selected model so it can learn the patterns and relationships.
Model Evaluation - involves testing the model on validation or test sets to assess its performance using different metrics.
Model Tuning - involves adjusting the model’s hyperparameters to improve its performance.

Some common metrics for evaluating classification models include:

Confusion Matrix - displays the performance of a classification model by displaying a matrix of counts of true positives, true negatives, false positives, and false negatives.
F1 Score - the harmonic mean of precision and recall. Precision is the ratio of correctly predicted positive observations to the total predicted positives, while recall is the ratio of correctly predicted positive observations to all actual positives.
Receiver Operating Characteristic Curve (ROC Curve) - displays a graph of the true positive rate (sensitivity) against the false positive rate at different thresholds.

K-Nearest Neighbors (KNN) - KNN is a simple, instance-based learning algorithm where the class of a new data point is determined by the majority class among its ‘k’ closest points in the training data.
Decision Tree - A decision tree is a flowchart-like model that makes decisions by splitting data into branches based on feature values, with each branch representing a decision rule.