Project 1: Problem Definition & Data Understanding¶

Philip Vishnevsky
2025-01-31

1. Problem Introduction¶

Traffic accidents are a major concern worldwide, leading to injuries, fatalities, and economic losses. Understanding the factors that contribute to these incidents can help improve road safety and inform better traffic management strategies. This project seeks to analyze key accident conditions, such as weather, road surface conditions, and traffic patterns, to identify trends in when and why accidents occur. By answering questions about accident frequency, severity, and causes, this analysis aims to provide insights that could lead to improved safety measures and more informed policy decisions.

2. Data Introduction¶

The dataset used for this analysis contains approximately 209,000 records of traffic accidents, capturing a wide range of details such as crash date, weather and lighting conditions, road defects, accident types, and injury severity. With 24 columns, the dataset provides a comprehensive look at various contributing factors, making it valuable for identifying patterns in accident occurrences. By analyzing this data, we can uncover trends related to accident timing, common injury types, and conditions that pose the highest risks, ultimately helping to enhance traffic safety efforts.

3. Data Preprocessing¶

We will import the necessary Python libraries to visualize our dataset.

3.0 Python Imports¶

In [10]:
# Import data libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

3.1 Data Preview¶

We will now load the dataset and display some basic information about it

In [13]:
# Load dataset
file_path = "./data/traffic_accidents.csv"
df = pd.read_csv(file_path)
In [14]:
# Display first few rows
display(df.head())

# Show dataset info
df.info()
crash_date traffic_control_device weather_condition lighting_condition first_crash_type trafficway_type alignment roadway_surface_cond road_defect crash_type ... most_severe_injury injuries_total injuries_fatal injuries_incapacitating injuries_non_incapacitating injuries_reported_not_evident injuries_no_indication crash_hour crash_day_of_week crash_month
0 07/29/2023 01:00:00 PM TRAFFIC SIGNAL CLEAR DAYLIGHT TURNING NOT DIVIDED STRAIGHT AND LEVEL UNKNOWN UNKNOWN NO INJURY / DRIVE AWAY ... NO INDICATION OF INJURY 0.0 0.0 0.0 0.0 0.0 3.0 13 7 7
1 08/13/2023 12:11:00 AM TRAFFIC SIGNAL CLEAR DARKNESS, LIGHTED ROAD TURNING FOUR WAY STRAIGHT AND LEVEL DRY NO DEFECTS NO INJURY / DRIVE AWAY ... NO INDICATION OF INJURY 0.0 0.0 0.0 0.0 0.0 2.0 0 1 8
2 12/09/2021 10:30:00 AM TRAFFIC SIGNAL CLEAR DAYLIGHT REAR END T-INTERSECTION STRAIGHT AND LEVEL DRY NO DEFECTS NO INJURY / DRIVE AWAY ... NO INDICATION OF INJURY 0.0 0.0 0.0 0.0 0.0 3.0 10 5 12
3 08/09/2023 07:55:00 PM TRAFFIC SIGNAL CLEAR DAYLIGHT ANGLE FOUR WAY STRAIGHT AND LEVEL DRY NO DEFECTS INJURY AND / OR TOW DUE TO CRASH ... NONINCAPACITATING INJURY 5.0 0.0 0.0 5.0 0.0 0.0 19 4 8
4 08/19/2023 02:55:00 PM TRAFFIC SIGNAL CLEAR DAYLIGHT REAR END T-INTERSECTION STRAIGHT AND LEVEL UNKNOWN UNKNOWN NO INJURY / DRIVE AWAY ... NO INDICATION OF INJURY 0.0 0.0 0.0 0.0 0.0 3.0 14 7 8

5 rows × 24 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209306 entries, 0 to 209305
Data columns (total 24 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   crash_date                     209306 non-null  object 
 1   traffic_control_device         209306 non-null  object 
 2   weather_condition              209306 non-null  object 
 3   lighting_condition             209306 non-null  object 
 4   first_crash_type               209306 non-null  object 
 5   trafficway_type                209306 non-null  object 
 6   alignment                      209306 non-null  object 
 7   roadway_surface_cond           209306 non-null  object 
 8   road_defect                    209306 non-null  object 
 9   crash_type                     209306 non-null  object 
 10  intersection_related_i         209306 non-null  object 
 11  damage                         209306 non-null  object 
 12  prim_contributory_cause        209306 non-null  object 
 13  num_units                      209306 non-null  int64  
 14  most_severe_injury             209306 non-null  object 
 15  injuries_total                 209306 non-null  float64
 16  injuries_fatal                 209306 non-null  float64
 17  injuries_incapacitating        209306 non-null  float64
 18  injuries_non_incapacitating    209306 non-null  float64
 19  injuries_reported_not_evident  209306 non-null  float64
 20  injuries_no_indication         209306 non-null  float64
 21  crash_hour                     209306 non-null  int64  
 22  crash_day_of_week              209306 non-null  int64  
 23  crash_month                    209306 non-null  int64  
dtypes: float64(6), int64(4), object(14)
memory usage: 38.3+ MB
In [15]:
# Basic statistics
display(df.describe())
num_units injuries_total injuries_fatal injuries_incapacitating injuries_non_incapacitating injuries_reported_not_evident injuries_no_indication crash_hour crash_day_of_week crash_month
count 209306.000000 209306.000000 209306.000000 209306.000000 209306.000000 209306.000000 209306.000000 209306.000000 209306.000000 209306.000000
mean 2.063300 0.382717 0.001859 0.038102 0.221241 0.121516 2.244002 13.373047 4.144024 6.771822
std 0.396012 0.799720 0.047502 0.233964 0.614960 0.450865 1.241175 5.603830 1.966864 3.427593
min 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000
25% 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.000000 9.000000 2.000000 4.000000
50% 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.000000 14.000000 4.000000 7.000000
75% 2.000000 1.000000 0.000000 0.000000 0.000000 0.000000 3.000000 17.000000 6.000000 10.000000
max 11.000000 21.000000 3.000000 7.000000 21.000000 15.000000 49.000000 23.000000 7.000000 12.000000
In [16]:
# Check for missing values
missing_values = df.isnull().sum()
display(missing_values[missing_values > 0])  # Display only columns with null values (if any)
Series([], dtype: int64)

We see there are no missing/null values.

4. Data Understanding¶

Here, we’ll break down the key factors behind traffic accidents using visualizations. Each section focuses on a different aspect—weather, lighting, road conditions, time of day, and injury severity—to uncover patterns in when and why accidents happen. These insights will set the stage for deeper analysis in the next section.

4.1 Accidents by Weather Condition¶

In [21]:
# Set plot styles
sns.set_style("whitegrid")

# Accidents by Weather
plt.figure(figsize=(12, 6))
weather_counts = df["weather_condition"].value_counts().head(10)
sns.barplot(x=weather_counts.index, y=weather_counts.values)
plt.xticks(rotation=45)
plt.xlabel("Weather Condition")
plt.ylabel("Number of Accidents")
plt.title("Top 10 Weather Conditions Leading to Accidents")
plt.show()
No description has been provided for this image

4.2 Accidents by Lighting Condition¶

In [23]:
# Accidents by Lighting Condition
plt.figure(figsize=(12, 6))
lighting_counts = df["lighting_condition"].value_counts()
sns.barplot(x=lighting_counts.index, y=lighting_counts.values)
plt.xticks(rotation=45)
plt.xlabel("Lighting Condition")
plt.ylabel("Number of Accidents")
plt.title("Accidents by Lighting Condition")
plt.show()
No description has been provided for this image

4.3 Accidents by Time of Day¶

In [25]:
# Accidents by Time of Day
plt.figure(figsize=(12, 6))
hourly_accidents = df["crash_hour"].value_counts().sort_index()
sns.lineplot(x=hourly_accidents.index, y=hourly_accidents.values, marker="o")
plt.xlabel("Hour of Day")
plt.ylabel("Number of Accidents")
plt.title("Accidents by Hour of Day")
plt.show()
No description has been provided for this image

4.4 Accidents by Day of Week¶

In [27]:
# Accidents by Day of Week
plt.figure(figsize=(12, 6))
day_counts = df["crash_day_of_week"].value_counts().sort_index()
sns.barplot(x=day_counts.index, y=day_counts.values)
plt.xlabel("Day of Week")
plt.ylabel("Number of Accidents")
plt.title("Accidents by Day of Week")
plt.show()
No description has been provided for this image

4.5 Accidents by Road Condition¶

In [29]:
# Accidents by Road Condition
plt.figure(figsize=(12, 6))
road_surface_counts = df["roadway_surface_cond"].value_counts()
sns.barplot(x=road_surface_counts.index, y=road_surface_counts.values)
plt.xticks(rotation=45)
plt.xlabel("Road Condition")
plt.ylabel("Number of Accidents")
plt.title("Accidents by Road Condition")
plt.show()
No description has been provided for this image

4.6 Severity of Injuries¶

In [31]:
# Severity of Injuries in Accidents
plt.figure(figsize=(12, 6))
injury_counts = df["most_severe_injury"].value_counts()
sns.barplot(x=injury_counts.index, y=injury_counts.values)
plt.xticks(rotation=45)
plt.xlabel("Injury Severity")
plt.ylabel("Number of Accidents")
plt.title("Severity of Injuries in Traffic Accidents")
plt.show()
No description has been provided for this image

4.7 Severe Injuries by Condition¶

In [33]:
# Filter data for fatal and incapacitating injuries
severe_injuries = df[df["most_severe_injury"].isin(["FATAL", "INCAPACITATING INJURY"])]

# Group by different conditions and count occurrences
weather_severe = severe_injuries["weather_condition"].value_counts().head(5)
lighting_severe = severe_injuries["lighting_condition"].value_counts().head(5)
road_surface_severe = severe_injuries["roadway_surface_cond"].value_counts().head(5)

# Plot Weather Conditions for Severe Accidents
plt.figure(figsize=(12, 6))
sns.barplot(x=weather_severe.index, y=weather_severe.values, palette="Reds")
plt.xticks(rotation=45)
plt.xlabel("Weather Condition")
plt.ylabel("Severe Injury Count")
plt.title("Top Weather Conditions for Fatal & Incapacitating Injuries")
plt.show()

# Plot Lighting Conditions for Severe Accidents
plt.figure(figsize=(12, 6))
sns.barplot(x=lighting_severe.index, y=lighting_severe.values, palette="Blues")
plt.xticks(rotation=45)
plt.xlabel("Lighting Condition")
plt.ylabel("Severe Injury Count")
plt.title("Top Lighting Conditions for Fatal & Incapacitating Injuries")
plt.show()

# Plot Road Surface Conditions for Severe Accidents
plt.figure(figsize=(12, 6))
sns.barplot(x=road_surface_severe.index, y=road_surface_severe.values, palette="Greens")
plt.xticks(rotation=45)
plt.xlabel("Road Surface Condition")
plt.ylabel("Severe Injury Count")
plt.title("Top Road Surface Conditions for Fatal & Incapacitating Injuries")
plt.show()

# Group data by hour of the day for fatal and incapacitating injuries
hourly_severe = severe_injuries["crash_hour"].value_counts().sort_index()

# Plot Time of Day for Severe Accidents
plt.figure(figsize=(12, 6))
sns.lineplot(x=hourly_severe.index, y=hourly_severe.values, marker="o", color="red")
plt.xlabel("Hour of the Day")
plt.ylabel("Severe Injury Count")
plt.title("Time of Day for Fatal & Incapacitating Injuries")
plt.xticks(range(0, 24))  # Ensuring all hours are represented
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

5. Storytelling¶

The data visualizations have uncovered some interesting trends.

Most traffic accidents happen in clear weather and daylight, simply because that’s when most people are on the road. But accidents don’t disappear in bad conditions—rain, snow, and darkness still account for thousands of crashes. The timing of accidents paints an even clearer picture: spikes at 7-8 AM and 3-5 PM match rush hour traffic, when roads are packed with commuters. While accidents happen fairly evenly throughout the week, Fridays see the most crashes, while Sundays have the least likely reflecting workweek routines.

Road conditions also play a big role, with most crashes occurring on dry roads, but wet, snowy, and icy conditions still contributing thousands of incidents. When it comes to injuries, most people walk away unharmed, but nearly 40,000 cases involve some level of injury, including 8,000 incapacitating and a smaller number of fatal crashes.

Interestingly, the patterns for severe accidents closely mirror overall trends, suggesting that the most dangerous crashes aren’t necessarily happening under extreme conditions, but rather in the same everyday environments where most accidents occur.

6. Impact¶

These findings highlight that accidents aren’t just a bad-weather problem—they happen most in normal, everyday conditions. This means road safety improvements should focus on high-traffic times and common conditions, not just extreme weather. Better traffic flow during rush hours, improved intersection design, and awareness campaigns could help reduce crashes. However, there’s a downside—targeted safety measures might lead to increased surveillance, stricter regulations, or higher insurance rates in "risky" areas. The key is using this data to make roads safer without unfairly burdening drivers.

7. References¶

Dataset Source¶