Project 2: Classification¶

Philip Vishnevsky
2025-02-27

1. Problem Introduction¶

1.1 Objectives¶

In this project, we aim to classify used cars into their respective brands using a comprehensive dataset of used cars from the US. The dataset includes a wide range of features such as physical dimensions (e.g., legroom, length, width, wheelbase), engine details (e.g., engine cylinders, engine displacement, horsepower), fuel attributes (e.g., fuel type, fuel tank volume, highway and city fuel economy), and market indicators like price, mileage, and days on market.

1.2 Questions to Answer¶

  • Brand Prediction: Can we accurately predict the car's brand based on its attributes?
  • Feature Influence: Which features are most important for distinguishing between different car brands?
  • Market Trends: How do key market attributes such as price and mileage vary across brands?
  • Attribute Signals: Do physical and engine specifications provide strong signals for brand classification?

2. Data Introduction¶

We are using the US Used Cars Dataset from Kaggle -- a comprehensive collection of data on used cars in the United States. Originally containing a broad set of attributes, with 3,000,000 rows and 66 columns, we have refined it for our brand classification task. After initial cleaning, we have focused on a subset of features that best capture the characteristics of each vehicle.

Key 27 features in our refined dataset include:

  • make_name: Car brand (our target variable).
  • back_legroom: Rear legroom measurement.
  • body_type: Vehicle body type (e.g., sedan, SUV, truck).
  • city: Originally a high-cardinality variable that we replaced with a frequency-encoded feature (city_freq) representing the count of vehicles from each city.
  • city_fuel_economy: Fuel economy in city driving.
  • daysonmarket: Number of days the vehicle has been listed.
  • engine_cylinders: Number of engine cylinders (e.g. 4, 6, 8, etc.)
  • engine_displacement: Engine displacement.
  • franchise_dealer: Indicator if the vehicle was sold by a franchise dealer.
  • front_legroom: Front legroom measurement.
  • fuel_tank_volume: Volume of the fuel tank.
  • fuel_type: Type of fuel used (e.g., Petrol, Diesel).
  • height: Vehicle height.
  • highway_fuel_economy: Fuel economy on the highway.
  • horsepower: Engine horsepower.
  • is_new: Boolean indicator for whether the car is new.
  • length: Vehicle length.
  • maximum_seating: Seating capacity.
  • mileage: Total mileage.
  • owner_count: Number of previous owners.
  • price: Listed price of the car.
  • savings_amount: Discount or savings offered.
  • transmission: Type of transmission (e.g., Automatic, Manual).
  • wheel_system: Drive system (e.g., FWD, AWD).
  • wheelbase: Distance between the front and rear wheels.
  • width: Vehicle width.
  • year: Model year.

3. Data Pre-Processing¶

Before modeling, the dataset must undergo several pre-processing steps:

  1. Removing Unneeded Features: We explicitly list columns to keep, ensuring irrelevant features are excluded from modeling. For example, we drop columns such as vin, bed, description, listing_id, latitude, longitude, exterior_color, interior_color, listed_date, listing_color, and many others that are not needed for brand classification.

  2. Removing Rare Brands: We remove car brands with fewer than 5,000 vehicles to focus on the top brands. This improves class balance and model stability, by reducing noise caused by vehicle makes with very few rows. This also speeds up generalization by simplifying our model's decision boundary from 100 brands to around 30.

  3. Data Type Conversion: Some numeric values are stored as strings with units (e.g., "38.3 in" or "13.2 gal"). We extract the numeric part and convert these to floats. Boolean columns, like franchise_dealer and is_new are also converted to integers.

  4. Missing Numerical Values: (e.g., in dimensions and engine specs) are imputed using statistics like the median and mode. This step ensures no null values remain in the dataset.

  5. Categorical Encoding: Features such as body_type, fuel_type, transmission, and wheel_system are one-hot encoded. The city column, originally high in cardinality (over 4600 types), is transformed via frequency encoding to avoid a massive expansion of features.

After these steps, the dataset contains only the cleaned and relevant features, with each row representing a car and its attributes, and the target variable (make_name) indicating the car’s brand, which we aim to predict using the aforementioned features.

3.0 Python Imports¶

In [13]:
# Import data libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Import Preprocessing
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split

# Import Models
from sklearn.ensemble import RandomForestClassifier
# from sklearn.svm import SVC

# Import Metrics
from sklearn.metrics import classification_report, confusion_matrix

# Suppress Warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=RuntimeWarning)

Note: A support vector model was originally tested, but due to the size of the dataset and number of features, SVC training times became infeasible. An ensemble model like random forest became more desirable

3.1 Data Preview¶

In [16]:
# Load the dataset (note: uses significant memory)
path = "./data/used_cars_data.csv"
df = pd.read_csv(path, low_memory=False)
In [17]:
# We will define a function to display some detailed information about our dataframe

def show_info(dataframe):
    # Display first few rows
    print("\nFirst Few Rows\n")
    display(dataframe.head())
    
    # Basic stats
    print("\nBasic Numeric Stats\n")
    display(dataframe.describe())
    
    print("\nShape & Cols\n")
    print(dataframe.shape)
    print(dataframe.columns)
    
    # Show column info
    print("\nDetailed Column Info\n")
    dataframe.info(verbose=True, show_counts=True)
In [18]:
show_info(df)
First Few Rows

vin back_legroom bed bed_height bed_length body_type cabin city city_fuel_economy combine_fuel_economy ... transmission transmission_display trimId trim_name vehicle_damage_category wheel_system wheel_system_display wheelbase width year
0 ZACNJABB5KPJ92081 35.1 in NaN NaN NaN SUV / Crossover NaN Bayamon NaN NaN ... A 9-Speed Automatic Overdrive t83804 Latitude FWD NaN FWD Front-Wheel Drive 101.2 in 79.6 in 2019
1 SALCJ2FX1LH858117 38.1 in NaN NaN NaN SUV / Crossover NaN San Juan NaN NaN ... A 9-Speed Automatic Overdrive t86759 S AWD NaN AWD All-Wheel Drive 107.9 in 85.6 in 2020
2 JF1VA2M67G9829723 35.4 in NaN NaN NaN Sedan NaN Guaynabo 17.0 NaN ... M 6-Speed Manual t58994 Base NaN AWD All-Wheel Drive 104.3 in 78.9 in 2016
3 SALRR2RV0L2433391 37.6 in NaN NaN NaN SUV / Crossover NaN San Juan NaN NaN ... A 8-Speed Automatic Overdrive t86074 V6 HSE AWD NaN AWD All-Wheel Drive 115 in 87.4 in 2020
4 SALCJ2FXXLH862327 38.1 in NaN NaN NaN SUV / Crossover NaN San Juan NaN NaN ... A 9-Speed Automatic Overdrive t86759 S AWD NaN AWD All-Wheel Drive 107.9 in 85.6 in 2020

5 rows × 66 columns

Basic Numeric Stats

city_fuel_economy combine_fuel_economy daysonmarket engine_displacement highway_fuel_economy horsepower is_certified latitude listing_id longitude mileage owner_count price savings_amount seller_rating sp_id vehicle_damage_category year
count 2.508755e+06 0.0 3.000040e+06 2.827654e+06 2.508755e+06 2.827654e+06 0.0 3.000040e+06 3.000040e+06 3.000040e+06 2.855653e+06 1.483027e+06 3.000040e+06 3.000040e+06 2.959168e+06 2.999944e+06 0.0 3.000040e+06
mean 2.269325e+01 NaN 7.605973e+01 2.968488e+03 2.947337e+01 2.479957e+02 NaN 3.698414e+01 2.754987e+08 -9.064224e+01 3.114690e+04 1.532644e+00 2.993337e+04 5.509768e+02 4.270413e+00 2.335224e+05 NaN 2.017728e+03
std 8.807024e+00 NaN 1.088839e+02 1.348905e+03 7.769252e+00 9.046639e+01 NaN 4.996819e+00 8.894123e+06 1.390589e+01 7.458675e+04 9.202928e-01 1.956617e+04 1.079448e+03 5.133017e-01 1.323221e+05 NaN 4.178701e+00
min 7.000000e+00 NaN 0.000000e+00 7.000000e+02 1.000000e+01 5.500000e+01 NaN 1.834670e+01 1.994620e+07 -1.579280e+02 0.000000e+00 1.000000e+00 1.650000e+02 0.000000e+00 1.000000e+00 4.159300e+04 NaN 1.915000e+03
25% 1.800000e+01 NaN 1.400000e+01 2.000000e+03 2.500000e+01 1.750000e+02 NaN 3.350920e+01 2.745794e+08 -9.708820e+01 6.000000e+00 1.000000e+00 1.845100e+04 0.000000e+00 4.000000e+00 6.337500e+04 NaN 2.017000e+03
50% 2.100000e+01 NaN 3.500000e+01 2.500000e+03 2.900000e+01 2.440000e+02 NaN 3.784710e+01 2.785453e+08 -8.724950e+01 8.267000e+03 1.000000e+00 2.647700e+04 0.000000e+00 4.341463e+00 2.816270e+05 NaN 2.020000e+03
75% 2.600000e+01 NaN 8.200000e+01 3.600000e+03 3.300000e+01 3.000000e+02 NaN 4.100620e+01 2.804553e+08 -8.045490e+01 4.366200e+04 2.000000e+00 3.822000e+04 7.850000e+02 4.605263e+00 3.366140e+05 NaN 2.020000e+03
max 1.270000e+02 NaN 3.599000e+03 8.400000e+03 1.270000e+02 1.001000e+03 NaN 6.120310e+01 2.820222e+08 -6.607850e+01 9.999999e+07 1.900000e+01 3.299995e+06 1.474140e+05 5.000000e+00 4.409510e+05 NaN 2.021000e+03
Shape & Cols

(3000040, 66)
Index(['vin', 'back_legroom', 'bed', 'bed_height', 'bed_length', 'body_type',
       'cabin', 'city', 'city_fuel_economy', 'combine_fuel_economy',
       'daysonmarket', 'dealer_zip', 'description', 'engine_cylinders',
       'engine_displacement', 'engine_type', 'exterior_color', 'fleet',
       'frame_damaged', 'franchise_dealer', 'franchise_make', 'front_legroom',
       'fuel_tank_volume', 'fuel_type', 'has_accidents', 'height',
       'highway_fuel_economy', 'horsepower', 'interior_color', 'isCab',
       'is_certified', 'is_cpo', 'is_new', 'is_oemcpo', 'latitude', 'length',
       'listed_date', 'listing_color', 'listing_id', 'longitude',
       'main_picture_url', 'major_options', 'make_name', 'maximum_seating',
       'mileage', 'model_name', 'owner_count', 'power', 'price', 'salvage',
       'savings_amount', 'seller_rating', 'sp_id', 'sp_name', 'theft_title',
       'torque', 'transmission', 'transmission_display', 'trimId', 'trim_name',
       'vehicle_damage_category', 'wheel_system', 'wheel_system_display',
       'wheelbase', 'width', 'year'],
      dtype='object')

Detailed Column Info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000040 entries, 0 to 3000039
Data columns (total 66 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   vin                      3000040 non-null  object 
 1   back_legroom             2840771 non-null  object 
 2   bed                      19568 non-null    object 
 3   bed_height               429098 non-null   object 
 4   bed_length               429098 non-null   object 
 5   body_type                2986497 non-null  object 
 6   cabin                    63533 non-null    object 
 7   city                     3000040 non-null  object 
 8   city_fuel_economy        2508755 non-null  float64
 9   combine_fuel_economy     0 non-null        float64
 10  daysonmarket             3000040 non-null  int64  
 11  dealer_zip               3000040 non-null  object 
 12  description              2922139 non-null  object 
 13  engine_cylinders         2899459 non-null  object 
 14  engine_displacement      2827654 non-null  float64
 15  engine_type              2899459 non-null  object 
 16  exterior_color           2950085 non-null  object 
 17  fleet                    1573445 non-null  object 
 18  frame_damaged            1573445 non-null  object 
 19  franchise_dealer         3000040 non-null  bool   
 20  franchise_make           2427405 non-null  object 
 21  front_legroom            2840771 non-null  object 
 22  fuel_tank_volume         2840771 non-null  object 
 23  fuel_type                2917316 non-null  object 
 24  has_accidents            1573445 non-null  object 
 25  height                   2840771 non-null  object 
 26  highway_fuel_economy     2508755 non-null  float64
 27  horsepower               2827654 non-null  float64
 28  interior_color           2616054 non-null  object 
 29  isCab                    1573445 non-null  object 
 30  is_certified             0 non-null        float64
 31  is_cpo                   182898 non-null   object 
 32  is_new                   3000040 non-null  bool   
 33  is_oemcpo                135362 non-null   object 
 34  latitude                 3000040 non-null  float64
 35  length                   2840771 non-null  object 
 36  listed_date              3000040 non-null  object 
 37  listing_color            3000040 non-null  object 
 38  listing_id               3000040 non-null  int64  
 39  longitude                3000040 non-null  float64
 40  main_picture_url         2630947 non-null  object 
 41  major_options            2799992 non-null  object 
 42  make_name                3000040 non-null  object 
 43  maximum_seating          2840771 non-null  object 
 44  mileage                  2855653 non-null  float64
 45  model_name               3000040 non-null  object 
 46  owner_count              1483027 non-null  float64
 47  power                    2518614 non-null  object 
 48  price                    3000040 non-null  float64
 49  salvage                  1573445 non-null  object 
 50  savings_amount           3000040 non-null  int64  
 51  seller_rating            2959168 non-null  float64
 52  sp_id                    2999944 non-null  float64
 53  sp_name                  3000040 non-null  object 
 54  theft_title              1573445 non-null  object 
 55  torque                   2482247 non-null  object 
 56  transmission             2935855 non-null  object 
 57  transmission_display     2935855 non-null  object 
 58  trimId                   2884213 non-null  object 
 59  trim_name                2883746 non-null  object 
 60  vehicle_damage_category  0 non-null        float64
 61  wheel_system             2853308 non-null  object 
 62  wheel_system_display     2853308 non-null  object 
 63  wheelbase                2840771 non-null  object 
 64  width                    2840771 non-null  object 
 65  year                     3000040 non-null  int64  
dtypes: bool(2), float64(14), int64(4), object(46)
memory usage: 1.4+ GB

3.2 Trends¶

We observe the following trends from this preview of the dataset:

  • There is a large number of columns with null/missing data
  • There is a large number of columns with irrelevant data
  • Certain features are encoded as string objects, when they really should be floats (e.g. 35.1 in) in the legroom column

3.3 Pre-Processing Steps¶

We will solve the issues we observed in 3.1 by preprocessing the dataset and ensuring no null values, consistent types, and encoding categorical variables. To start, we have carefully selected the most significant and relevant columns to use from the original dataset. All other features will be removed for the purposes of this project.

3.3.1 Column Removal¶

In [24]:
# We are only interested in keeping the below 27 columns
cols_to_keep = [
    "back_legroom", "body_type", "city", "city_fuel_economy", "daysonmarket",
    "engine_cylinders", "engine_displacement", "franchise_dealer",
    "front_legroom", "fuel_tank_volume", "fuel_type", "height",
    "highway_fuel_economy", "horsepower", "is_new", "length", "maximum_seating",
    "mileage", "owner_count", "price", "savings_amount", "transmission",
    "wheel_system", "wheelbase", "width", "year", "make_name"
]

df_dropped = df[cols_to_keep].copy()
In [25]:
# Call our earlier function
show_info(df_dropped)
First Few Rows

back_legroom body_type city city_fuel_economy daysonmarket engine_cylinders engine_displacement franchise_dealer front_legroom fuel_tank_volume ... mileage owner_count price savings_amount transmission wheel_system wheelbase width year make_name
0 35.1 in SUV / Crossover Bayamon NaN 522 I4 1300.0 True 41.2 in 12.7 gal ... 7.0 NaN 23141.0 0 A FWD 101.2 in 79.6 in 2019 Jeep
1 38.1 in SUV / Crossover San Juan NaN 207 I4 2000.0 True 39.1 in 17.7 gal ... 8.0 NaN 46500.0 0 A AWD 107.9 in 85.6 in 2020 Land Rover
2 35.4 in Sedan Guaynabo 17.0 1233 H4 2500.0 True 43.3 in 15.9 gal ... NaN 3.0 46995.0 0 M AWD 104.3 in 78.9 in 2016 Subaru
3 37.6 in SUV / Crossover San Juan NaN 196 V6 3000.0 True 39 in 23.5 gal ... 11.0 NaN 67430.0 0 A AWD 115 in 87.4 in 2020 Land Rover
4 38.1 in SUV / Crossover San Juan NaN 137 I4 2000.0 True 39.1 in 17.7 gal ... 7.0 NaN 48880.0 0 A AWD 107.9 in 85.6 in 2020 Land Rover

5 rows × 27 columns

Basic Numeric Stats

city_fuel_economy daysonmarket engine_displacement highway_fuel_economy horsepower mileage owner_count price savings_amount year
count 2.508755e+06 3.000040e+06 2.827654e+06 2.508755e+06 2.827654e+06 2.855653e+06 1.483027e+06 3.000040e+06 3.000040e+06 3.000040e+06
mean 2.269325e+01 7.605973e+01 2.968488e+03 2.947337e+01 2.479957e+02 3.114690e+04 1.532644e+00 2.993337e+04 5.509768e+02 2.017728e+03
std 8.807024e+00 1.088839e+02 1.348905e+03 7.769252e+00 9.046639e+01 7.458675e+04 9.202928e-01 1.956617e+04 1.079448e+03 4.178701e+00
min 7.000000e+00 0.000000e+00 7.000000e+02 1.000000e+01 5.500000e+01 0.000000e+00 1.000000e+00 1.650000e+02 0.000000e+00 1.915000e+03
25% 1.800000e+01 1.400000e+01 2.000000e+03 2.500000e+01 1.750000e+02 6.000000e+00 1.000000e+00 1.845100e+04 0.000000e+00 2.017000e+03
50% 2.100000e+01 3.500000e+01 2.500000e+03 2.900000e+01 2.440000e+02 8.267000e+03 1.000000e+00 2.647700e+04 0.000000e+00 2.020000e+03
75% 2.600000e+01 8.200000e+01 3.600000e+03 3.300000e+01 3.000000e+02 4.366200e+04 2.000000e+00 3.822000e+04 7.850000e+02 2.020000e+03
max 1.270000e+02 3.599000e+03 8.400000e+03 1.270000e+02 1.001000e+03 9.999999e+07 1.900000e+01 3.299995e+06 1.474140e+05 2.021000e+03
Shape & Cols

(3000040, 27)
Index(['back_legroom', 'body_type', 'city', 'city_fuel_economy',
       'daysonmarket', 'engine_cylinders', 'engine_displacement',
       'franchise_dealer', 'front_legroom', 'fuel_tank_volume', 'fuel_type',
       'height', 'highway_fuel_economy', 'horsepower', 'is_new', 'length',
       'maximum_seating', 'mileage', 'owner_count', 'price', 'savings_amount',
       'transmission', 'wheel_system', 'wheelbase', 'width', 'year',
       'make_name'],
      dtype='object')

Detailed Column Info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000040 entries, 0 to 3000039
Data columns (total 27 columns):
 #   Column                Non-Null Count    Dtype  
---  ------                --------------    -----  
 0   back_legroom          2840771 non-null  object 
 1   body_type             2986497 non-null  object 
 2   city                  3000040 non-null  object 
 3   city_fuel_economy     2508755 non-null  float64
 4   daysonmarket          3000040 non-null  int64  
 5   engine_cylinders      2899459 non-null  object 
 6   engine_displacement   2827654 non-null  float64
 7   franchise_dealer      3000040 non-null  bool   
 8   front_legroom         2840771 non-null  object 
 9   fuel_tank_volume      2840771 non-null  object 
 10  fuel_type             2917316 non-null  object 
 11  height                2840771 non-null  object 
 12  highway_fuel_economy  2508755 non-null  float64
 13  horsepower            2827654 non-null  float64
 14  is_new                3000040 non-null  bool   
 15  length                2840771 non-null  object 
 16  maximum_seating       2840771 non-null  object 
 17  mileage               2855653 non-null  float64
 18  owner_count           1483027 non-null  float64
 19  price                 3000040 non-null  float64
 20  savings_amount        3000040 non-null  int64  
 21  transmission          2935855 non-null  object 
 22  wheel_system          2853308 non-null  object 
 23  wheelbase             2840771 non-null  object 
 24  width                 2840771 non-null  object 
 25  year                  3000040 non-null  int64  
 26  make_name             3000040 non-null  object 
dtypes: bool(2), float64(7), int64(3), object(15)
memory usage: 577.9+ MB

3.3.2 Rare Brand Removal¶

In [27]:
# Drop brands with less than 5000 rows
make_counts = df_dropped['make_name'].value_counts()
valid_makes = make_counts[make_counts >= 5000].index
df_dropped = df_dropped[df_dropped['make_name'].isin(valid_makes)].copy()

# Display shape and remaining brands
print("Shape After Brand Removal:", df_dropped.shape)
print("Remaining makes:", df_dropped['make_name'].unique())
Shape After Brand Removal: (2968284, 27)
Remaining makes: ['Jeep' 'Land Rover' 'Subaru' 'Mazda' 'BMW' 'Hyundai' 'Chevrolet' 'Lexus'
 'Cadillac' 'Chrysler' 'Dodge' 'Mercedes-Benz' 'Nissan' 'Honda' 'Kia'
 'Ford' 'Lincoln' 'Audi' 'Jaguar' 'Volkswagen' 'RAM' 'Porsche' 'Toyota'
 'INFINITI' 'GMC' 'Acura' 'Volvo' 'Mitsubishi' 'Buick' 'MINI']

3.3.3 Data Type Conversion¶

In [29]:
# engine_cylinders are displayed as an engine configuration.
display(df_dropped['engine_cylinders'].unique())
# We need to extract cylinder counts
df_dropped['engine_cylinders'] = df_dropped['engine_cylinders'] \
    .str.extract(r'(\d+)', expand=False) \
    .astype(float)
# Results
display(df_dropped['engine_cylinders'].unique())
array(['I4', 'H4', 'V6', 'I6', 'V6 Diesel', nan, 'V8', 'V8 Biodiesel',
       'V8 Flex Fuel Vehicle', 'V6 Flex Fuel Vehicle', 'I4 Hybrid',
       'V6 Hybrid', 'I3', 'I5', 'I4 Flex Fuel Vehicle', 'I4 Diesel',
       'I6 Diesel', 'V8 Diesel', 'R2', 'V10', 'V12', 'V6 Biodiesel', 'H6',
       'I2', 'I5 Biodiesel', 'V8 Hybrid', 'H4 Hybrid',
       'V8 Compressed Natural Gas', 'W12', 'I6 Hybrid', 'I5 Diesel', 'W8',
       'I4 Compressed Natural Gas', 'V10 Diesel', 'V8 Propane',
       'I3 Hybrid', 'V6 Compressed Natural Gas'], dtype=object)
array([ 4.,  6., nan,  8.,  3.,  5.,  2., 10., 12.])
In [30]:
# Clean up other numerical features

measurement_cols = [
    'back_legroom', 'front_legroom', 'fuel_tank_volume',
    'height', 'length', 'maximum_seating', 'wheelbase', 'width'
]

# Convert these columns from strings with units to numeric floats
for col in measurement_cols:
    # Extract the first occurrence of a number (including decimals)
    df_dropped[col] = df_dropped[col].astype(str).str.extract(r'([\d\.]+)', expand=False)
    # Convert the extracted string to a float
    df_dropped[col] = pd.to_numeric(df_dropped[col], errors='coerce')
    # Impute missing values with the median of the column
    median_val = df_dropped[col].median()
    df_dropped[col].fillna(median_val, inplace=True)

3.3.4 Impute Missing Values¶

In [32]:
# Display only cols with null vals, and print num rows w missing vals
missing_values = df_dropped.isnull().sum()
display(missing_values[missing_values > 0])
missing_value_count = missing_values[missing_values > 0].count()
print(f"Number of columns with missing values: {missing_value_count}")
body_type                 12843
city_fuel_economy        481848
engine_cylinders          96393
engine_displacement      164276
fuel_type                 80986
highway_fuel_economy     481848
horsepower               164276
mileage                  143406
owner_count             1508247
transmission              63484
wheel_system             141177
dtype: int64
Number of columns with missing values: 11
In [33]:
# Impute categorical columns with the mode
categorical_cols = [
    'body_type', 'fuel_type', 'transmission', 'wheel_system'
]

for col in categorical_cols:
    mode_val = df_dropped[col].mode()[0]
    df_dropped[col].fillna(mode_val, inplace=True)
In [34]:
# Impute missing engine_cylinders with the mode
mode_value = df_dropped['engine_cylinders'].mode()[0]
df_dropped['engine_cylinders'].fillna(mode_value, inplace=True)
In [35]:
# Impute remaining numerical columns
num_cols = [
    'city_fuel_economy',   # float64
    'daysonmarket',        # int64
    'engine_displacement', # float64
    'highway_fuel_economy',# float64
    'horsepower',          # float64
    'mileage',             # float64
    'owner_count',         # float64
    'price',               # float64
    'savings_amount',      # int64
    'year'                 # int64
]

for col in num_cols:
    # For the 'year' column, we use the mode as it may have a dominant production year.
    if col == 'year':
        impute_value = df_dropped[col].mode()[0]
    else:
        impute_value = df_dropped[col].median()
    
    df_dropped[col].fillna(impute_value, inplace=True)
In [36]:
# Display only cols with null vals, and print num rows w missing vals
missing_values = df_dropped.isnull().sum()
display(missing_values[missing_values > 0])
missing_value_count = missing_values[missing_values > 0].count()
print(f"Number of columns with missing values: {missing_value_count}")
Series([], dtype: int64)
Number of columns with missing values: 0

3.3.5 Categorical Encoding¶

In [38]:
# Convert boolean columns to integers (0/1)
bool_cols = ['franchise_dealer', 'is_new']
for col in bool_cols:
    df_dropped[col] = df_dropped[col].astype(int)

# Encode the high-cardinality 'city' column using frequency encoding
city_counts = df_dropped['city'].value_counts()
df_dropped['city_freq'] = df_dropped['city'].map(city_counts)
df_dropped.drop('city', axis=1, inplace=True)

# Specify the categorical columns to encode
categorical_cols = ['body_type', 'fuel_type', 'transmission', 'wheel_system']

# One-hot encode these categorical features
# Use drop_first=True to avoid dummy variable trap (avoid multicollinearity)
df_encoded = pd.get_dummies(df_dropped, columns=categorical_cols, drop_first=True)

Now that we have completed all our preprocessing steps, let us view the prepared dataframe

In [40]:
show_info(df_encoded)
First Few Rows

back_legroom city_fuel_economy daysonmarket engine_cylinders engine_displacement franchise_dealer front_legroom fuel_tank_volume height highway_fuel_economy ... fuel_type_Gasoline fuel_type_Hybrid fuel_type_Propane transmission_CVT transmission_Dual Clutch transmission_M wheel_system_4X2 wheel_system_AWD wheel_system_FWD wheel_system_RWD
0 35.1 21.0 522 4.0 1300.0 1 41.2 12.7 66.5 29.0 ... True False False False False False False False True False
1 38.1 21.0 207 4.0 2000.0 1 39.1 17.7 68.0 29.0 ... True False False False False False False True False False
2 35.4 17.0 1233 4.0 2500.0 1 43.3 15.9 58.1 23.0 ... True False False False False True False True False False
3 37.6 21.0 196 6.0 3000.0 1 39.0 23.5 73.0 29.0 ... True False False False False False False True False False
4 38.1 21.0 137 4.0 2000.0 1 39.1 17.7 68.0 29.0 ... True False False False False False False True False False

5 rows × 45 columns

Basic Numeric Stats

back_legroom city_fuel_economy daysonmarket engine_cylinders engine_displacement franchise_dealer front_legroom fuel_tank_volume height highway_fuel_economy ... length maximum_seating mileage owner_count price savings_amount wheelbase width year city_freq
count 2.968284e+06 2.968284e+06 2.968284e+06 2.968284e+06 2.968284e+06 2.968284e+06 2.968284e+06 2.968284e+06 2.968284e+06 2.968284e+06 ... 2.968284e+06 2.968284e+06 2.968284e+06 2.968284e+06 2.968284e+06 2.968284e+06 2.968284e+06 2.968284e+06 2.968284e+06 2.968284e+06
mean 3.800007e+01 2.239953e+01 7.571930e+01 5.086334e+00 2.941816e+03 8.133187e-01 4.219773e+01 1.857000e+01 6.593519e+01 2.937974e+01 ... 1.935921e+02 5.459802e+00 2.981266e+04 1.257736e+00 2.973176e+04 5.428629e+02 1.151673e+02 7.834571e+01 2.017781e+03 4.240457e+03
std 3.054381e+00 7.863222e+00 1.079200e+02 1.447159e+00 1.315140e+03 3.896555e-01 1.465058e+00 5.325202e+00 7.350079e+00 6.921976e+00 ... 1.970834e+01 1.150393e+00 7.306362e+04 6.889402e-01 1.771610e+04 9.973965e+02 1.418191e+01 7.490284e+00 4.052912e+00 6.407033e+03
min 0.000000e+00 8.000000e+00 0.000000e+00 2.000000e+00 7.000000e+02 0.000000e+00 0.000000e+00 1.900000e+00 4.370000e+01 1.000000e+01 ... 1.420000e+02 2.000000e+00 0.000000e+00 1.000000e+00 1.650000e+02 0.000000e+00 8.340000e+01 5.500000e+01 1.915000e+03 1.000000e+00
25% 3.620000e+01 1.800000e+01 1.400000e+01 4.000000e+00 2.000000e+03 1.000000e+00 4.110000e+01 1.480000e+01 5.830000e+01 2.500000e+01 ... 1.821000e+02 5.000000e+00 6.000000e+00 1.000000e+00 1.849500e+04 0.000000e+00 1.063000e+02 7.250000e+01 2.017000e+03 8.360000e+02
50% 3.820000e+01 2.100000e+01 3.500000e+01 4.000000e+00 2.500000e+03 1.000000e+00 4.200000e+01 1.770000e+01 6.610000e+01 2.900000e+01 ... 1.900000e+02 5.000000e+00 7.794000e+03 1.000000e+00 2.644800e+04 0.000000e+00 1.112000e+02 7.730000e+01 2.020000e+03 2.098000e+03
75% 3.970000e+01 2.500000e+01 8.200000e+01 6.000000e+00 3.500000e+03 1.000000e+00 4.300000e+01 2.100000e+01 7.000000e+01 3.200000e+01 ... 2.000000e+02 6.000000e+00 4.128300e+04 1.000000e+00 3.805000e+04 7.820000e+02 1.181000e+02 8.250000e+01 2.020000e+03 5.043000e+03
max 5.980000e+01 1.270000e+02 3.599000e+03 1.200000e+01 8.400000e+03 1.000000e+00 6.700000e+01 6.400000e+01 1.176000e+02 1.220000e+02 ... 2.941000e+02 1.500000e+01 9.999999e+07 1.900000e+01 2.698500e+06 1.300210e+05 2.045000e+02 1.090000e+02 2.021000e+03 4.317500e+04

8 rows × 22 columns

Shape & Cols

(2968284, 45)
Index(['back_legroom', 'city_fuel_economy', 'daysonmarket', 'engine_cylinders',
       'engine_displacement', 'franchise_dealer', 'front_legroom',
       'fuel_tank_volume', 'height', 'highway_fuel_economy', 'horsepower',
       'is_new', 'length', 'maximum_seating', 'mileage', 'owner_count',
       'price', 'savings_amount', 'wheelbase', 'width', 'year', 'make_name',
       'city_freq', 'body_type_Coupe', 'body_type_Hatchback',
       'body_type_Minivan', 'body_type_Pickup Truck',
       'body_type_SUV / Crossover', 'body_type_Sedan', 'body_type_Van',
       'body_type_Wagon', 'fuel_type_Compressed Natural Gas',
       'fuel_type_Diesel', 'fuel_type_Electric', 'fuel_type_Flex Fuel Vehicle',
       'fuel_type_Gasoline', 'fuel_type_Hybrid', 'fuel_type_Propane',
       'transmission_CVT', 'transmission_Dual Clutch', 'transmission_M',
       'wheel_system_4X2', 'wheel_system_AWD', 'wheel_system_FWD',
       'wheel_system_RWD'],
      dtype='object')

Detailed Column Info

<class 'pandas.core.frame.DataFrame'>
Index: 2968284 entries, 0 to 3000039
Data columns (total 45 columns):
 #   Column                            Non-Null Count    Dtype  
---  ------                            --------------    -----  
 0   back_legroom                      2968284 non-null  float64
 1   city_fuel_economy                 2968284 non-null  float64
 2   daysonmarket                      2968284 non-null  int64  
 3   engine_cylinders                  2968284 non-null  float64
 4   engine_displacement               2968284 non-null  float64
 5   franchise_dealer                  2968284 non-null  int32  
 6   front_legroom                     2968284 non-null  float64
 7   fuel_tank_volume                  2968284 non-null  float64
 8   height                            2968284 non-null  float64
 9   highway_fuel_economy              2968284 non-null  float64
 10  horsepower                        2968284 non-null  float64
 11  is_new                            2968284 non-null  int32  
 12  length                            2968284 non-null  float64
 13  maximum_seating                   2968284 non-null  float64
 14  mileage                           2968284 non-null  float64
 15  owner_count                       2968284 non-null  float64
 16  price                             2968284 non-null  float64
 17  savings_amount                    2968284 non-null  int64  
 18  wheelbase                         2968284 non-null  float64
 19  width                             2968284 non-null  float64
 20  year                              2968284 non-null  int64  
 21  make_name                         2968284 non-null  object 
 22  city_freq                         2968284 non-null  int64  
 23  body_type_Coupe                   2968284 non-null  bool   
 24  body_type_Hatchback               2968284 non-null  bool   
 25  body_type_Minivan                 2968284 non-null  bool   
 26  body_type_Pickup Truck            2968284 non-null  bool   
 27  body_type_SUV / Crossover         2968284 non-null  bool   
 28  body_type_Sedan                   2968284 non-null  bool   
 29  body_type_Van                     2968284 non-null  bool   
 30  body_type_Wagon                   2968284 non-null  bool   
 31  fuel_type_Compressed Natural Gas  2968284 non-null  bool   
 32  fuel_type_Diesel                  2968284 non-null  bool   
 33  fuel_type_Electric                2968284 non-null  bool   
 34  fuel_type_Flex Fuel Vehicle       2968284 non-null  bool   
 35  fuel_type_Gasoline                2968284 non-null  bool   
 36  fuel_type_Hybrid                  2968284 non-null  bool   
 37  fuel_type_Propane                 2968284 non-null  bool   
 38  transmission_CVT                  2968284 non-null  bool   
 39  transmission_Dual Clutch          2968284 non-null  bool   
 40  transmission_M                    2968284 non-null  bool   
 41  wheel_system_4X2                  2968284 non-null  bool   
 42  wheel_system_AWD                  2968284 non-null  bool   
 43  wheel_system_FWD                  2968284 non-null  bool   
 44  wheel_system_RWD                  2968284 non-null  bool   
dtypes: bool(22), float64(16), int32(2), int64(4), object(1)
memory usage: 583.1+ MB

We see now that df_encoded has no null values, no string values (with the exception of target make_name). It also takes up less memory while retaining nearly the original 3 million rows!¶

4. Data Understanding/Visualization¶

Now that we have the data appropriately cleaned, imputed, and encoded, we are able to perform some visualizations.

4.1 Frequency Distribution of Car Brands¶

In [45]:
# Order the car brands by frequency (from most common to least)
plt.figure(figsize=(12, 8))
order = df_dropped['make_name'].value_counts().index
sns.countplot(data=df_dropped, y='make_name', order=order, palette='viridis')
plt.title("Frequency Distribution of Car Brands")
plt.xlabel("Count")
plt.ylabel("Car Brand")
plt.show()
No description has been provided for this image

4.2 Correlation Heatmap of Continuous Features¶

In [47]:
# Select a subset of continuous variables
numeric_cols = ['price', 'mileage', 'engine_displacement', 'horsepower',
                'city_fuel_economy', 'highway_fuel_economy', 'daysonmarket']

corr_matrix = df_encoded[numeric_cols].corr()

plt.figure(figsize=(8,6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap of Selected Features")
plt.show()
No description has been provided for this image

4.3 Distribution Plots Price and Mileage¶

In [49]:
# Price distribution with log scale
plt.figure(figsize=(12, 6))
sns.histplot(df_encoded['price'], bins=1000)
plt.xscale('log')
plt.title("Distribution of Price (Log Scale)")
plt.xlabel("Price (log scale)")
plt.ylabel("Frequency")
plt.show()

# Mileage distribution with log scale
plt.figure(figsize=(12, 6))
sns.histplot(df_encoded['mileage'], bins=1000)
plt.xscale('log')
plt.title("Distribution of Mileage (Log Scale)")
plt.xlabel("Mileage (log scale)")
plt.ylabel("Frequency")
plt.show()
No description has been provided for this image
No description has been provided for this image
In [50]:
# Compute the average price by brand from df_encoded
brand_avg_price = df_encoded.groupby('make_name')['price'].mean().reset_index()

# Sort brands by average price in descending order for better visualization
brand_avg_price = brand_avg_price.sort_values(by='price', ascending=False)

plt.figure(figsize=(12,8))
sns.barplot(x='price', y='make_name', data=brand_avg_price, palette="viridis")
plt.title("Average Car Price by Brand")
plt.xlabel("Average Price")
plt.ylabel("Car Brand")
plt.show()
No description has been provided for this image

4.6 Analysis of Visualized Trends¶

From the above five visualizations, we can observe the following trends:

  • Ford is the most common brand in the dataset, followed by Chevrolet, Toyota, and Nissan. The distribution among brands is uneven, with a few makes dominating the overall sample.
  • Price is positively correlated with engine displacement and horsepower, and negatively correlated with mileage.
  • Horsepower and engine displacement are strongly correlated with each other.
  • City and highway fuel economies show a strong positive correlation, indicating they tend to move in tandem.
  • The mileage distribution is heavily skewed toward lower values, with the vast majority of vehicles registering well under 100,000 miles.
  • Vehicle prices also exhibit right-skewness, peaking in the lower to mid price range (around tens of thousands of dollars) and tapering off toward the high end.
  • Luxury and performance brands (e.g., Porsche, Land Rover, Mercedes-Benz) command higher average prices. Unsurprisingly, more economy-focused brands (e.g., Kia, Hyundai, Mitsubishi) appear on the lower end of the price scale.

5. Modeling¶

In [54]:
# Separate features and target
X = df_encoded.drop('make_name', axis=1)
y = df_encoded['make_name']
In [55]:
# Encode the target labels to integers
le = LabelEncoder()
y_encoded = le.fit_transform(y)
In [56]:
# Split data into training and test sets (using stratification)
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded
)
In [57]:
# Train a Random Forest classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_clf.fit(X_train, y_train)
Out[57]:
RandomForestClassifier(n_jobs=-1, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(n_jobs=-1, random_state=42)
In [58]:
# Run evaluation
rf_pred = rf_clf.predict(X_test)

# Display classification report
print("Random Forest Classification Report:")
print(classification_report(y_test, rf_pred))
Random Forest Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.99      0.99      6121
           1       1.00      1.00      1.00      7274
           2       0.99      0.99      0.99     12439
           3       1.00      1.00      1.00     12957
           4       0.99      0.99      0.99      9433
           5       0.98      0.99      0.98     75379
           6       0.98      0.98      0.98      7279
           7       0.99      0.98      0.99     18275
           8       0.99      1.00      0.99     95267
           9       0.97      0.91      0.94     19859
          10       1.00      1.00      1.00     42892
          11       0.99      0.99      0.99     27219
          12       1.00      0.99      1.00      5675
          13       1.00      1.00      1.00      1713
          14       0.99      0.99      0.99     33687
          15       0.99      1.00      0.99     22466
          16       0.99      0.99      0.99      2995
          17       0.98      0.99      0.99      8129
          18       1.00      1.00      1.00      6572
          19       1.00      0.99      1.00      1796
          20       0.99      0.99      0.99     10751
          21       0.99      0.99      0.99     13330
          22       0.99      0.99      0.99      4141
          23       1.00      0.99      1.00     43579
          24       1.00      0.99      0.99      2589
          25       0.99      0.99      0.99     20509
          26       1.00      1.00      1.00     13946
          27       0.99      0.99      0.99     47826
          28       0.99      0.99      0.99     15218
          29       0.99      0.99      0.99      4341

    accuracy                           0.99    593657
   macro avg       0.99      0.99      0.99    593657
weighted avg       0.99      0.99      0.99    593657

6. Evaluation¶

In [60]:
# Compute the confusion matrix
cm = confusion_matrix(y_test, rf_pred)
class_names = le.classes_  # array of class labels

plt.figure(figsize=(20, 16))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=class_names, yticklabels=class_names)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Class')
plt.ylabel('Actual Class')
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.show()
No description has been provided for this image
In [61]:
# Get feature importances from the Random Forest model
importances = rf_clf.feature_importances_
feature_names = X_train.columns

# Sort the features by importance and select the top 20
indices = np.argsort(importances)[::-1]
top_n = 20
top_indices = indices[:top_n]

plt.figure(figsize=(12, 8))
sns.barplot(x=importances[top_indices], y=feature_names[top_indices])
plt.title('Top 20 Feature Importances')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()
No description has been provided for this image

We see that our Random Forest classifier delivered an impressive 99% overall accuracy, with consistently high precision, recall, and F1-scores across nearly all brands in the dataset. This strong performance indicates that the chosen features -- encompassing vehicle dimensions, engine characteristics, and market indicators -- are effective at capturing the nuances that distinguish one brand from another. Misclassifications were minimal, suggesting that our preprocessing steps and feature engineering approach successfully reduced noise and preserved brand-specific signals.

7. Storytelling¶

We set out with a straightforward but intriguing question: Can a car's brand be accurately predicted from its specifications alone? To find out, we found and processed a rich dataset of used vehicles, focusing on features such as price, mileage, engine details, and vehicle dimensions. Our data preparation steps were meticulous; cleaning out unnecessary columns, encoding categorical variables, and ensuring no information would accidentally give away the brand.

What emerged was a model that exceeded expectations. Our Random Forest classifier achieved impressive accuracy, highlighting that certain attributes -- particularly vehicle dimensions like width, front legroom, and length -- carry a lot of weight in identifying a car’s brand. While many might assume price or horsepower would dominate, the results suggest that each manufacturer’s unique design philosophy leaves discernible traces in a car’s proportions. This finding displays how well-chosen features, combined with a robust machine learning approach, can uncover the subtle differences that set brands apart.

8. Impact¶

A trustworthy brand classification system can truly transform how automotive marketplaces and dealerships operate. By sorting cars automatically according to their features, businesses can simplify inventory management, help customers locate exactly what they want, and adjust pricing or marketing strategies for different segments. When accuracy is high, it builds confidence: buyers trust that they’re finding exactly what they need, while sellers enjoy consistent, data-driven listings.

At the same time, it’s crucial to be clear about how the model is developed and tested. Even a few misclassifications can have a ripple effect—affecting pricing decisions or steering marketing efforts in the wrong direction. That’s why regular performance reviews and transparent communication about any limitations are essential. This approach means being open to ongoing improvements, whether that involves adding new features or exploring different algorithms.

In short, our results highlight the powerful impact machine learning can have on the automotive industry. By leveraging solid data, rigorous data preparation method, and advanced models, we’re building a system that not only classifies car brands accurately but also lays the groundwork for even more innovative applications in the future.

9. References¶

Dataset used: https://www.kaggle.com/datasets/ananaymital/us-used-cars-dataset/data