Intro to Data Mining: Final Project
May 8, 2025
Clustering wine based on their chemical properties using unsupervised learning techniques and comprehensive cluster analysis.
Project Overview
This capstone project demonstrates unsupervised learning through clustering analysis of wine varieties. Without labeled data, the goal is to discover natural groupings based on chemical properties.
Problem Statement
Given a dataset of wines with various chemical measurements (acidity, sugar content, pH, alcohol percentage, etc.), use clustering algorithms to identify distinct wine groups and understand what chemical characteristics define each cluster.
Dataset Features
Chemical Properties
- Fixed Acidity: Tartaric acid concentration
- Volatile Acidity: Acetic acid concentration
- Citric Acid: Adds freshness and flavor
- Residual Sugar: Remaining sugar after fermentation
- Chlorides: Salt content
- Free Sulfur Dioxide: Prevents microbial growth
- Total Sulfur Dioxide: Free + bound forms
- Density: Related to sugar and alcohol content
- pH: Acidity/alkalinity measure
- Sulphates: Wine additive
- Alcohol: Percentage by volume
Clustering Approach
Algorithms Implemented
Multiple clustering algorithms are compared to find the most meaningful wine groupings.
- K-Means Clustering: Partitioning into k clusters
- Hierarchical Clustering: Building a cluster dendrogram
- DBSCAN: Density-based clustering
- Gaussian Mixture Models: Probabilistic clustering
Methodology
-
Data Preprocessing
- Feature scaling and normalization
- Dimensionality reduction with PCA
- Outlier detection and handling
-
Optimal Cluster Selection
- Elbow method
- Silhouette analysis
- Dendrogram visualization
-
Cluster Analysis
- Profile each cluster by feature means
- Visualize clusters in 2D/3D space
- Interpret cluster characteristics
-
Validation
- Silhouette score
- Davies-Bouldin index
- Calinski-Harabasz score
Results & Insights
The project delivers:
- Identification of distinct wine groups
- Chemical profile of each cluster
- Visual representations of clustering results
- Comparison of clustering algorithms
- Recommendations based on cluster characteristics
Key Findings
Unsupervised learning reveals patterns that might not be apparent through manual inspection.
- Which chemical properties best differentiate wine types
- Natural groupings in the wine dataset
- Relationships between different chemical features
- Quality indicators for different wine clusters
Skills Demonstrated
- Unsupervised learning techniques
- Cluster evaluation and validation
- Dimensionality reduction (PCA, t-SNE)
- Data visualization in high dimensions
- Interpretation of complex results
Real-World Applications
Clustering is used in:
- Customer segmentation for marketing
- Image compression and segmentation
- Anomaly detection
- Recommendation systems
- Document categorization
- Genetic sequence analysis