Intro to Data Mining: Final Project

May 8, 2025

Clustering wine based on their chemical properties using unsupervised learning techniques and comprehensive cluster analysis.

#Clustering #Unsupervised Learning #Data Mining #Wine Analysis #Final Project

Project Overview

This capstone project demonstrates unsupervised learning through clustering analysis of wine varieties. Without labeled data, the goal is to discover natural groupings based on chemical properties.

Problem Statement

Given a dataset of wines with various chemical measurements (acidity, sugar content, pH, alcohol percentage, etc.), use clustering algorithms to identify distinct wine groups and understand what chemical characteristics define each cluster.

Dataset Features

Chemical Properties

Fixed Acidity: Tartaric acid concentration
Volatile Acidity: Acetic acid concentration
Citric Acid: Adds freshness and flavor
Residual Sugar: Remaining sugar after fermentation
Chlorides: Salt content
Free Sulfur Dioxide: Prevents microbial growth
Total Sulfur Dioxide: Free + bound forms
Density: Related to sugar and alcohol content
pH: Acidity/alkalinity measure
Sulphates: Wine additive
Alcohol: Percentage by volume

Clustering Approach

Algorithms Implemented

✅

Multiple clustering algorithms are compared to find the most meaningful wine groupings.

K-Means Clustering: Partitioning into k clusters
Hierarchical Clustering: Building a cluster dendrogram
DBSCAN: Density-based clustering
Gaussian Mixture Models: Probabilistic clustering

Methodology

Data Preprocessing
- Feature scaling and normalization
- Dimensionality reduction with PCA
- Outlier detection and handling
Optimal Cluster Selection
- Elbow method
- Silhouette analysis
- Dendrogram visualization
Cluster Analysis
- Profile each cluster by feature means
- Visualize clusters in 2D/3D space
- Interpret cluster characteristics
Validation
- Silhouette score
- Davies-Bouldin index
- Calinski-Harabasz score

Results & Insights

The project delivers:

Identification of distinct wine groups
Chemical profile of each cluster
Visual representations of clustering results
Comparison of clustering algorithms
Recommendations based on cluster characteristics

Key Findings

💡

Unsupervised learning reveals patterns that might not be apparent through manual inspection.

Which chemical properties best differentiate wine types
Natural groupings in the wine dataset
Relationships between different chemical features
Quality indicators for different wine clusters

Skills Demonstrated

Unsupervised learning techniques
Cluster evaluation and validation
Dimensionality reduction (PCA, t-SNE)
Data visualization in high dimensions
Interpretation of complex results

Real-World Applications

Clustering is used in:

Customer segmentation for marketing
Image compression and segmentation
Anomaly detection
Recommendation systems
Document categorization
Genetic sequence analysis

Intro to Data Mining: Final Project

Project Overview

Problem Statement

Dataset Features

Chemical Properties

Clustering Approach

Algorithms Implemented

Methodology

Results & Insights

Key Findings

Skills Demonstrated

Real-World Applications

Interactive Notebook

Final Project Notebook