Project 1: Problem Definition & Data Understanding¶
Philip Vishnevsky
2025-01-31
1. Problem Introduction¶
Traffic accidents are a major concern worldwide, leading to injuries, fatalities, and economic losses. Understanding the factors that contribute to these incidents can help improve road safety and inform better traffic management strategies. This project seeks to analyze key accident conditions, such as weather, road surface conditions, and traffic patterns, to identify trends in when and why accidents occur. By answering questions about accident frequency, severity, and causes, this analysis aims to provide insights that could lead to improved safety measures and more informed policy decisions.
2. Data Introduction¶
The dataset used for this analysis contains approximately 209,000 records of traffic accidents, capturing a wide range of details such as crash date, weather and lighting conditions, road defects, accident types, and injury severity. With 24 columns, the dataset provides a comprehensive look at various contributing factors, making it valuable for identifying patterns in accident occurrences. By analyzing this data, we can uncover trends related to accident timing, common injury types, and conditions that pose the highest risks, ultimately helping to enhance traffic safety efforts.
3. Data Preprocessing¶
We will import the necessary Python libraries to visualize our dataset.
3.0 Python Imports¶
# Import data libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
3.1 Data Preview¶
We will now load the dataset and display some basic information about it
# Load dataset
file_path = "./data/traffic_accidents.csv"
df = pd.read_csv(file_path)
# Display first few rows
display(df.head())
# Show dataset info
df.info()
| crash_date | traffic_control_device | weather_condition | lighting_condition | first_crash_type | trafficway_type | alignment | roadway_surface_cond | road_defect | crash_type | ... | most_severe_injury | injuries_total | injuries_fatal | injuries_incapacitating | injuries_non_incapacitating | injuries_reported_not_evident | injuries_no_indication | crash_hour | crash_day_of_week | crash_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 07/29/2023 01:00:00 PM | TRAFFIC SIGNAL | CLEAR | DAYLIGHT | TURNING | NOT DIVIDED | STRAIGHT AND LEVEL | UNKNOWN | UNKNOWN | NO INJURY / DRIVE AWAY | ... | NO INDICATION OF INJURY | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 13 | 7 | 7 |
| 1 | 08/13/2023 12:11:00 AM | TRAFFIC SIGNAL | CLEAR | DARKNESS, LIGHTED ROAD | TURNING | FOUR WAY | STRAIGHT AND LEVEL | DRY | NO DEFECTS | NO INJURY / DRIVE AWAY | ... | NO INDICATION OF INJURY | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0 | 1 | 8 |
| 2 | 12/09/2021 10:30:00 AM | TRAFFIC SIGNAL | CLEAR | DAYLIGHT | REAR END | T-INTERSECTION | STRAIGHT AND LEVEL | DRY | NO DEFECTS | NO INJURY / DRIVE AWAY | ... | NO INDICATION OF INJURY | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 10 | 5 | 12 |
| 3 | 08/09/2023 07:55:00 PM | TRAFFIC SIGNAL | CLEAR | DAYLIGHT | ANGLE | FOUR WAY | STRAIGHT AND LEVEL | DRY | NO DEFECTS | INJURY AND / OR TOW DUE TO CRASH | ... | NONINCAPACITATING INJURY | 5.0 | 0.0 | 0.0 | 5.0 | 0.0 | 0.0 | 19 | 4 | 8 |
| 4 | 08/19/2023 02:55:00 PM | TRAFFIC SIGNAL | CLEAR | DAYLIGHT | REAR END | T-INTERSECTION | STRAIGHT AND LEVEL | UNKNOWN | UNKNOWN | NO INJURY / DRIVE AWAY | ... | NO INDICATION OF INJURY | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 14 | 7 | 8 |
5 rows × 24 columns
<class 'pandas.core.frame.DataFrame'> RangeIndex: 209306 entries, 0 to 209305 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 crash_date 209306 non-null object 1 traffic_control_device 209306 non-null object 2 weather_condition 209306 non-null object 3 lighting_condition 209306 non-null object 4 first_crash_type 209306 non-null object 5 trafficway_type 209306 non-null object 6 alignment 209306 non-null object 7 roadway_surface_cond 209306 non-null object 8 road_defect 209306 non-null object 9 crash_type 209306 non-null object 10 intersection_related_i 209306 non-null object 11 damage 209306 non-null object 12 prim_contributory_cause 209306 non-null object 13 num_units 209306 non-null int64 14 most_severe_injury 209306 non-null object 15 injuries_total 209306 non-null float64 16 injuries_fatal 209306 non-null float64 17 injuries_incapacitating 209306 non-null float64 18 injuries_non_incapacitating 209306 non-null float64 19 injuries_reported_not_evident 209306 non-null float64 20 injuries_no_indication 209306 non-null float64 21 crash_hour 209306 non-null int64 22 crash_day_of_week 209306 non-null int64 23 crash_month 209306 non-null int64 dtypes: float64(6), int64(4), object(14) memory usage: 38.3+ MB
# Basic statistics
display(df.describe())
| num_units | injuries_total | injuries_fatal | injuries_incapacitating | injuries_non_incapacitating | injuries_reported_not_evident | injuries_no_indication | crash_hour | crash_day_of_week | crash_month | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 209306.000000 | 209306.000000 | 209306.000000 | 209306.000000 | 209306.000000 | 209306.000000 | 209306.000000 | 209306.000000 | 209306.000000 | 209306.000000 |
| mean | 2.063300 | 0.382717 | 0.001859 | 0.038102 | 0.221241 | 0.121516 | 2.244002 | 13.373047 | 4.144024 | 6.771822 |
| std | 0.396012 | 0.799720 | 0.047502 | 0.233964 | 0.614960 | 0.450865 | 1.241175 | 5.603830 | 1.966864 | 3.427593 |
| min | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 |
| 25% | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 9.000000 | 2.000000 | 4.000000 |
| 50% | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 14.000000 | 4.000000 | 7.000000 |
| 75% | 2.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 | 17.000000 | 6.000000 | 10.000000 |
| max | 11.000000 | 21.000000 | 3.000000 | 7.000000 | 21.000000 | 15.000000 | 49.000000 | 23.000000 | 7.000000 | 12.000000 |
# Check for missing values
missing_values = df.isnull().sum()
display(missing_values[missing_values > 0]) # Display only columns with null values (if any)
Series([], dtype: int64)
We see there are no missing/null values.
4. Data Understanding¶
Here, we’ll break down the key factors behind traffic accidents using visualizations. Each section focuses on a different aspect—weather, lighting, road conditions, time of day, and injury severity—to uncover patterns in when and why accidents happen. These insights will set the stage for deeper analysis in the next section.
4.1 Accidents by Weather Condition¶
# Set plot styles
sns.set_style("whitegrid")
# Accidents by Weather
plt.figure(figsize=(12, 6))
weather_counts = df["weather_condition"].value_counts().head(10)
sns.barplot(x=weather_counts.index, y=weather_counts.values)
plt.xticks(rotation=45)
plt.xlabel("Weather Condition")
plt.ylabel("Number of Accidents")
plt.title("Top 10 Weather Conditions Leading to Accidents")
plt.show()
4.2 Accidents by Lighting Condition¶
# Accidents by Lighting Condition
plt.figure(figsize=(12, 6))
lighting_counts = df["lighting_condition"].value_counts()
sns.barplot(x=lighting_counts.index, y=lighting_counts.values)
plt.xticks(rotation=45)
plt.xlabel("Lighting Condition")
plt.ylabel("Number of Accidents")
plt.title("Accidents by Lighting Condition")
plt.show()
4.3 Accidents by Time of Day¶
# Accidents by Time of Day
plt.figure(figsize=(12, 6))
hourly_accidents = df["crash_hour"].value_counts().sort_index()
sns.lineplot(x=hourly_accidents.index, y=hourly_accidents.values, marker="o")
plt.xlabel("Hour of Day")
plt.ylabel("Number of Accidents")
plt.title("Accidents by Hour of Day")
plt.show()
4.4 Accidents by Day of Week¶
# Accidents by Day of Week
plt.figure(figsize=(12, 6))
day_counts = df["crash_day_of_week"].value_counts().sort_index()
sns.barplot(x=day_counts.index, y=day_counts.values)
plt.xlabel("Day of Week")
plt.ylabel("Number of Accidents")
plt.title("Accidents by Day of Week")
plt.show()
4.5 Accidents by Road Condition¶
# Accidents by Road Condition
plt.figure(figsize=(12, 6))
road_surface_counts = df["roadway_surface_cond"].value_counts()
sns.barplot(x=road_surface_counts.index, y=road_surface_counts.values)
plt.xticks(rotation=45)
plt.xlabel("Road Condition")
plt.ylabel("Number of Accidents")
plt.title("Accidents by Road Condition")
plt.show()
4.6 Severity of Injuries¶
# Severity of Injuries in Accidents
plt.figure(figsize=(12, 6))
injury_counts = df["most_severe_injury"].value_counts()
sns.barplot(x=injury_counts.index, y=injury_counts.values)
plt.xticks(rotation=45)
plt.xlabel("Injury Severity")
plt.ylabel("Number of Accidents")
plt.title("Severity of Injuries in Traffic Accidents")
plt.show()
4.7 Severe Injuries by Condition¶
# Filter data for fatal and incapacitating injuries
severe_injuries = df[df["most_severe_injury"].isin(["FATAL", "INCAPACITATING INJURY"])]
# Group by different conditions and count occurrences
weather_severe = severe_injuries["weather_condition"].value_counts().head(5)
lighting_severe = severe_injuries["lighting_condition"].value_counts().head(5)
road_surface_severe = severe_injuries["roadway_surface_cond"].value_counts().head(5)
# Plot Weather Conditions for Severe Accidents
plt.figure(figsize=(12, 6))
sns.barplot(x=weather_severe.index, y=weather_severe.values, palette="Reds")
plt.xticks(rotation=45)
plt.xlabel("Weather Condition")
plt.ylabel("Severe Injury Count")
plt.title("Top Weather Conditions for Fatal & Incapacitating Injuries")
plt.show()
# Plot Lighting Conditions for Severe Accidents
plt.figure(figsize=(12, 6))
sns.barplot(x=lighting_severe.index, y=lighting_severe.values, palette="Blues")
plt.xticks(rotation=45)
plt.xlabel("Lighting Condition")
plt.ylabel("Severe Injury Count")
plt.title("Top Lighting Conditions for Fatal & Incapacitating Injuries")
plt.show()
# Plot Road Surface Conditions for Severe Accidents
plt.figure(figsize=(12, 6))
sns.barplot(x=road_surface_severe.index, y=road_surface_severe.values, palette="Greens")
plt.xticks(rotation=45)
plt.xlabel("Road Surface Condition")
plt.ylabel("Severe Injury Count")
plt.title("Top Road Surface Conditions for Fatal & Incapacitating Injuries")
plt.show()
# Group data by hour of the day for fatal and incapacitating injuries
hourly_severe = severe_injuries["crash_hour"].value_counts().sort_index()
# Plot Time of Day for Severe Accidents
plt.figure(figsize=(12, 6))
sns.lineplot(x=hourly_severe.index, y=hourly_severe.values, marker="o", color="red")
plt.xlabel("Hour of the Day")
plt.ylabel("Severe Injury Count")
plt.title("Time of Day for Fatal & Incapacitating Injuries")
plt.xticks(range(0, 24)) # Ensuring all hours are represented
plt.show()
5. Storytelling¶
The data visualizations have uncovered some interesting trends.
Most traffic accidents happen in clear weather and daylight, simply because that’s when most people are on the road. But accidents don’t disappear in bad conditions—rain, snow, and darkness still account for thousands of crashes. The timing of accidents paints an even clearer picture: spikes at 7-8 AM and 3-5 PM match rush hour traffic, when roads are packed with commuters. While accidents happen fairly evenly throughout the week, Fridays see the most crashes, while Sundays have the least likely reflecting workweek routines.
Road conditions also play a big role, with most crashes occurring on dry roads, but wet, snowy, and icy conditions still contributing thousands of incidents. When it comes to injuries, most people walk away unharmed, but nearly 40,000 cases involve some level of injury, including 8,000 incapacitating and a smaller number of fatal crashes.
Interestingly, the patterns for severe accidents closely mirror overall trends, suggesting that the most dangerous crashes aren’t necessarily happening under extreme conditions, but rather in the same everyday environments where most accidents occur.
6. Impact¶
These findings highlight that accidents aren’t just a bad-weather problem—they happen most in normal, everyday conditions. This means road safety improvements should focus on high-traffic times and common conditions, not just extreme weather. Better traffic flow during rush hours, improved intersection design, and awareness campaigns could help reduce crashes. However, there’s a downside—targeted safety measures might lead to increased surveillance, stricter regulations, or higher insurance rates in "risky" areas. The key is using this data to make roads safer without unfairly burdening drivers.