Bike accident dataset – HAX712X: Software development for data science

Disclaimer: this course is adapted from the work Pandas tutorial by Joris Van den Bossche. R users might also want to read Pandas: Comparison with R / R libraries for a smooth start in Pandas.

We start by importing the necessary libraries:

%matplotlib inline
import os
import numpy as np
import calendar
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from cycler import cycler
import pooch  # download data / avoid re-downloading
from IPython import get_ipython
import lzma  # to process zip file
import plotly.express as px

sns.set_palette("colorblind")
palette = sns.color_palette("twilight", n_colors=12)
pd.options.display.max_rows = 8

References:

Data loading and preprocessing

Data loading

# url = "https://koumoul.com/s/data-fair/api/v1/datasets/accidents-velos/raw"
url_db = "https://github.com/josephsalmon/HAX712X/raw/main/Data/accidents-velos_2022.csv.xz"
path_target = "./bicycle_db.csv.xz"
path, fname = os.path.split(path_target)
pooch.retrieve(url_db, path=path, fname=fname, known_hash=None)
with lzma.open(path_target) as f:
    file_content = f.read().decode('utf-8')

# write the string file_content to a file named fname_uncompressed
with open("./bicycle_db.csv", 'w') as f:
    f.write(file_content)

df_bikes = pd.read_csv("bicycle_db.csv", na_values="", low_memory=False,
                       dtype={'data': str, 'heure': str, 'departement': str})

In June 2023, the author decided to change the name of the columns, hence we had to define a dictionary to come back to legacy names:

new2old = {
    "hrmn": "heure",
    "secuexist": "existence securite",
    "grav": "gravite accident",
    "dep": "departement",
}

df_bikes.rename(columns=new2old, inplace=True)

get_ipython().system('head -5 ./bicycle_db.csv')

pd.options.display.max_columns = 40
df_bikes.head()

	identifiant accident	date	mois	jour	heure	departement	commune	lat	lon	en agglomeration	type intersection	type collision	luminosite	conditions atmosperiques	type route	circulation	nb voies	profil long route	trace plan route	largeur TPC	largeur route	etat surface	amenagement	situation	categorie usager	gravite accident	sexe	age	motif deplacement	existence securite	usage securite	obstacle fixe heurte	obstacle mobile heurte	localisation choc	manoeuvre avant accident	identifiant vehicule	type autres vehicules	manoeuvre autres vehicules	nombre autres vehicules
0	201100000004	2011-09-22	09 - septembre	3 - jeudi	15	59	59011	50.51861	2.93043	oui	Hors intersection	Trois véhicules et plus - collisions multiples	Plein jour	Normale	Route Départementale	NaN	NaN	Plat	Partie rectiligne	NaN	58.0	normale	NaN	Sur chaussée	Conducteur	1 - Blessé léger	M	39-40	Promenade - loisirs	Casque	Oui	NaN	Véhicule	NaN	Sans changement de direction	201100000004A01	VL seul	Tournant à gauche	1.0
1	201100000004	2011-09-22	09 - septembre	3 - jeudi	15	59	59011	50.51861	2.93043	oui	Hors intersection	Trois véhicules et plus - collisions multiples	Plein jour	Normale	Route Départementale	NaN	NaN	Plat	Partie rectiligne	NaN	58.0	normale	NaN	Sur chaussée	Conducteur	2 - Blessé hospitalisé	M	30-31	Promenade - loisirs	Casque	Non	NaN	Véhicule	NaN	Sans changement de direction	201100000004B02	VL seul	Tournant à gauche	1.0
2	201100000006	2011-11-14	11 - novembre	0 - lundi	17	59	59011	50.52684	2.93423	oui	Hors intersection	Deux véhicules - par l’arrière	Nuit avec éclairage public allumé	Normale	Route Départementale	NaN	NaN	NaN	Partie rectiligne	NaN	NaN	normale	NaN	Sur chaussée	Conducteur	2 - Blessé hospitalisé	M	22-23	Domicile - travail	Casque	Oui	NaN	NaN	Avant	NaN	201100000006A01	VL seul	Arrêté (hors stationnement)	1.0
3	201100000020	2011-01-27	01 - janvier	3 - jeudi	18	59	59256	50.55818	3.13667	oui	Hors intersection	Deux véhicules - par le coté	Nuit avec éclairage public allumé	Normale	Voie Communale	NaN	2.0	Plat	Partie rectiligne	NaN	60.0	normale	NaN	Sur chaussée	Conducteur	2 - Blessé hospitalisé	F	36-37	Autre	Equipement réfléchissant	Oui	NaN	Véhicule	Avant droit	Sans changement de direction	201100000020B02	VL seul	En stationnement (avec occupants)	1.0
4	201100000022	2011-09-01	09 - septembre	3 - jeudi	19	59	59256	50.55448	3.12660	oui	Hors intersection	Deux véhicules - frontale	Crépuscule ou aube	Normale	Hors réseau public	NaN	2.0	Plat	Partie rectiligne	NaN	60.0	normale	NaN	Sur chaussée	Conducteur	2 - Blessé hospitalisé	M	15-16	Promenade - loisirs	Ceinture	NaN	NaN	Véhicule	Avant gauche	Sans changement de direction	201100000022B02	VL seul	Sans changement de direction	1.0

df_bikes['existence securite'].unique()

array(['Casque', 'Equipement réfléchissant', 'Ceinture', 'Autre', nan,
       'Dispositif enfants'], dtype=object)

df_bikes['gravite accident'].unique()

array(['1 - Blessé léger', '2 - Blessé hospitalisé', '3 - Tué',
       '0 - Indemne'], dtype=object)

Handle missing values

df_bikes['date'].hasnans

False

df_bikes['heure'].hasnans

True

So arbitrarily we fill missing values with 0 (since apparently there is no time 0 reported…to double check in the source.)

df_bikes.fillna({'heure':'0'}, inplace=True)

EXERCISE: start/end of the study

Can you find the starting day and the ending day of the study automatically?

Hint: Sort the data! You can sort the data by time for instance, say with df.sort('Time').

Date and time processing

Check the date/time format:

df_bikes['date'] + ' ' + df_bikes['heure']

0        2011-09-22 15
1        2011-09-22 15
2        2011-11-14 17
3        2011-01-27 18
             ...      
35330    2018-03-21 18
35331    2018-03-31 17
35332    2018-03-31 17
35333    2018-07-31 11
Length: 35334, dtype: object

time_improved = pd.to_datetime(
    df_bikes["date"] + " " + df_bikes["heure"],
    format="%Y-%m-%d %H",
    errors="coerce",
)

df_bikes["Time"] = time_improved
# remove rows with NaT
df_bikes.dropna(subset=["Time"], inplace=True)
# set new index
df_bikes.set_index("Time", inplace=True)
# remove useless columns
df_bikes.drop(columns=["heure", "date"], inplace=True)

df_bikes.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 35334 entries, 2011-09-22 15:00:00 to 2018-07-31 11:00:00
Data columns (total 37 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   identifiant accident        35334 non-null  int64  
 1   mois                        35334 non-null  object 
 2   jour                        35334 non-null  object 
 3   departement                 35334 non-null  object 
 4   commune                     35334 non-null  object 
 5   lat                         35334 non-null  float64
 6   lon                         35334 non-null  float64
 7   en agglomeration            35334 non-null  object 
 8   type intersection           35333 non-null  object 
 9   type collision              35330 non-null  object 
 10  luminosite                  35334 non-null  object 
 11  conditions atmosperiques    35333 non-null  object 
 12  type route                  35323 non-null  object 
 13  circulation                 132 non-null    object 
 14  nb voies                    31054 non-null  float64
 15  profil long route           32748 non-null  object 
 16  trace plan route            31556 non-null  object 
 17  largeur TPC                 2705 non-null   float64
 18  largeur route               18503 non-null  float64
 19  etat surface                33694 non-null  object 
 20  amenagement                 3880 non-null   object 
 21  situation                   32742 non-null  object 
 22  categorie usager            35334 non-null  object 
 23  gravite accident            35334 non-null  object 
 24  sexe                        35334 non-null  object 
 25  age                         35323 non-null  object 
 26  motif deplacement           27579 non-null  object 
 27  existence securite          34789 non-null  object 
 28  usage securite              33690 non-null  object 
 29  obstacle fixe heurte        803 non-null    object 
 30  obstacle mobile heurte      28615 non-null  object 
 31  localisation choc           30129 non-null  object 
 32  manoeuvre avant accident    31396 non-null  object 
 33  identifiant vehicule        35334 non-null  object 
 34  type autres vehicules       30531 non-null  object 
 35  manoeuvre autres vehicules  28143 non-null  object 
 36  nombre autres vehicules     30531 non-null  float64
dtypes: float64(6), int64(1), object(30)
memory usage: 10.2+ MB

df_bike2 = df_bikes.loc[
    :, ["gravite accident", "existence securite", "age", "sexe"]
]
# df.replace(to_replace={'col1' : {99 : 0}, 'col2' : {99 : 0}}, 
#            inplace=True)
df_bike2.replace(to_replace={"existence securite": {"Inconnu": np.nan}}, inplace=True)
df_bike2.dropna(inplace=True)

EXERCISE: Is the helmet saving your life?

Perform an analysis so that you can check the benefit or not of wearing a helmet to save your life.

Beware: Preprocessing is needed to use pd.crosstab, pivot_table to avoid issues.

gravite accident	0 - Indemne	1 - Blessé léger	2 - Blessé hospitalisé	3 - Tué
existence securite
Autre	6.380605	62.294583	29.143843	2.180970
Casque	7.052803	55.608281	33.454292	3.884624
Ceinture	3.414634	15.000000	69.268293	12.317073
Dispositif enfants	7.317073	64.634146	24.390244	3.658537
Equipement réfléchissant	4.994055	56.242568	33.848593	4.914784

gravite accident	0 - Indemne	1 - Blessé léger	2 - Blessé hospitalisé	3 - Tué
existence securite
Autre	6.380605	62.294583	29.143843	2.180970
Casque	7.052803	55.608281	33.454292	3.884624
Ceinture	3.414634	15.000000	69.268293	12.317073
Dispositif enfants	7.317073	64.634146	24.390244	3.658537
Equipement réfléchissant	4.994055	56.242568	33.848593	4.914784

EXERCISE: Are men and women dying equally on a bike?

Perform an analysis to check differences between men’s and women’s survival.

sexe
F    0.239911
M    0.760089
dtype: float64

gravite accident	0 - Indemne	1 - Blessé léger	2 - Blessé hospitalisé	3 - Tué	All
sexe
F	13.40564	27.903906	19.946115	15.884194	23.868538
M	86.59436	72.096094	80.053885	84.115806	76.131462

Data visualization

Note that in the dataset, the information on the level of bike practice by gender is missing.

EXERCISE: Accident during the week?

Perform an analysis to check when the accidents are occurring during the week.

Time series visualization

df_bikes["weekday"] = df_bikes.index.day_of_week  # Monday=0, Sunday=6
df_bikes.groupby(['weekday', df_bikes.index.hour])['sexe'].count()

weekday  Time
0        0        16
         1        12
         2         2
         3         1
                ... 
6        20      122
         21       73
         22       42
         23       32
Name: sexe, Length: 168, dtype: int64

df_bikes.groupby(['weekday', df_bikes.index.hour])['age'].count()

weekday  Time
0        0        16
         1        12
         2         2
         3         1
                ... 
6        20      122
         21       73
         22       42
         23       32
Name: age, Length: 168, dtype: int64

The two last results are the same, no matter if you choose the 'age' or 'sexe' variable.

Create a daily profile per day of the week:

df_polar = (
    df_bikes.groupby(["weekday", df_bikes.index.hour])["sexe"]
    .count()
    .reset_index()
)  # all variable are similar in this sense, sexe could be replaced by age for instance here. XXX to simplify

df_polar = df_polar.astype({"Time": str}, copy=False)
df_polar["weekday"] = df_polar["weekday"].apply(lambda x: calendar.day_abbr[x])
df_polar.rename(columns={"sexe": "accidents"}, inplace=True)

Display these daily profiles

n_colors = 8  # 7 days, but 8 colors help to have weekends days' color closer
colors = px.colors.sample_colorscale(
    "mrybm", [n / (n_colors - 1) for n in range(n_colors)]
)

fig = px.line_polar(
    df_polar,
    r="accidents",
    theta="Time",
    color="weekday",
    line_close=True,
    range_r=[0, 600],
    start_angle=0,
    color_discrete_sequence=colors,
    template="seaborn",
    title="Daily accident profile: weekday effect?",
)

fig.show()

Note

In plotly the figure is interactive. If you click on the legend on the right, you can select the day you want to see. It is very convenient to compare days two by two for instance.

EXERCISE: Accident during the year

Perform an analysis to check when the accidents are occurring during the week.

Create a daily profile per month:

df_bikes["month"] = df_bikes.index.month  # Janvier=0, .... Decembre=11
# df_bikes['month'] = df_bikes['month'].apply(lambda x: calendar.month_abbr[x])
df_bikes.head()
df_bikes_month = (
    df_bikes.groupby(["month", df_bikes.index.hour])["age"]
    .count()
    .unstack(level=0)
)

df_polar2 = (
    df_bikes.groupby(["month", df_bikes.index.hour])["sexe"]
    .count()
    .reset_index()
)  # all variable are similar in this sense, sexe could be replaced by age for instance here. XXX to simplify

df_polar2 = df_polar2.astype({"Time": str}, copy=False)
df_polar2.rename(columns={"sexe": "accidents"}, inplace=True)
df_polar2["month"] = df_polar2["month"].apply(lambda x: calendar.month_abbr[x])

Display these daily profiles :

# create a cyclical color scale for 12 values:
n_colors = 12
colors = px.colors.sample_colorscale(
    "mrybm", [n / (n_colors - 1) for n in range(n_colors)]
)

fig = px.line_polar(
    df_polar2,
    r="accidents",
    theta="Time",
    color="month",
    line_close=True,
    range_r=[0, 410],
    start_angle=0,
    color_discrete_sequence=colors,
    template="seaborn",
    title="Daily accident profile: weekday effect?",
)
fig.show()

EXERCISE: Accidents by department

Perform an analysis to check when the accidents are occurring for each department, relative to population size.

Geographic visualization

In this part, we will use the geopandas library to visualize the data on a map, along with plotly for interactivity.

path_target = "./dpt_population.csv"
url = "https://public.opendatasoft.com/explore/dataset/population-francaise-par-departement-2018/download/?format=csv&timezone=Europe/Berlin&lang=en&use_labels_for_header=true&csv_separator=%3B"
path, fname = os.path.split(path_target)
pooch.retrieve(url, path=path, fname=fname, known_hash=None)

'/home/jsalmon/Documents/Mes_cours/Montpellier/HAX712X/Courses/Pandas/dpt_population.csv'

df_dtp_pop = pd.read_csv("dpt_population.csv", sep=";", low_memory=False)

df_dtp_pop.replace(to_replace={"code": {"2A": "20A", "2B": "20B"}}, inplace=True)
df_dtp_pop.sort_values(by=["Code Département"], inplace=True)

df_bikes.replace(to_replace={"code": {"2A": "20A", "2B": "20B"}}, inplace=True)
df_bikes.sort_values(by=["departement"], inplace=True)
# Clean extra departements

df_bikes = df_bikes[df_bikes["departement"].isin(df_dtp_pop["Code Département"])]

gd = df_bikes.groupby(["departement"], as_index=True, sort=True).size()

data = {"code": gd.index, "# Accidents per million": gd.values}
df = pd.DataFrame(data)
df["# Accidents per million"] = (
    df["# Accidents per million"].values
    * 10000.0
    / df_dtp_pop["Population"].values
)

We now need to download the .geojson file containing the geographic information for each department. We will use the pooch library to download the file and store it locally.

path_target = "./departements.geojson"
# url = "https://raw.githubusercontent.com/gregoiredavid/france-geojson/master/departements-avec-outre-mer.geojson"
url = "https://raw.githubusercontent.com/gregoiredavid/france-geojson/master/departements-version-simplifiee.geojson"
path, fname = os.path.split(path_target)
pooch.retrieve(url, path=path, fname=fname, known_hash=None)

'/home/jsalmon/Documents/Mes_cours/Montpellier/HAX712X/Courses/Pandas/departements.geojson'

First, you have to handle Corsican departments, which are not in the same format as the others.

import plotly.express as px
import geopandas

departement = geopandas.read_file("departements.geojson")

departement.replace(to_replace={"code": {"2A": "20A", "2B": "20B"}}, inplace=True)

departement.sort_values(by=["code"], inplace=True)

a = ["0" + str(i) for i in range(1, 10)]
b = [str(i) for i in range(1, 10)]
dict_replace = dict(zip(a, b))


departement.replace(
    to_replace={"code": {"20A": "2A", "20B": "2B", **dict_replace}}, inplace=True
)

df.replace(
    to_replace={"code": {"20A": "2A", "20B": "2B", **dict_replace}}, inplace=True
)

departement.set_index("code", inplace=True)
print(departement["nom"].head(22))

code
1                         Ain
2                       Aisne
3                      Allier
4     Alpes-de-Haute-Provence
               ...           
19                    Corrèze
2A               Corse-du-Sud
2B                Haute-Corse
21                  Côte-d'Or
Name: nom, Length: 22, dtype: object

Once this is done, you can plot the data on a map.

fig = px.choropleth_mapbox(
    df,
    geojson=departement,
    locations="code",
    color="# Accidents per million",
    range_color=(0, df["# Accidents per million"].max()),
    color_continuous_scale="rdbu",
    center={"lat": 44, "lon": 2},
    zoom=3.55,
    mapbox_style="white-bg",
)
fig.update_traces(selector=dict(type="choroplethmapbox"))
fig.update_layout(
    title_text="Accidents per million inhabitants by department",
    coloraxis_colorbar=dict(thickness=20, orientation="h", y=0.051, x=0.5),
)
fig.show()

EXERCISE: Accidents by department

Perform an analysis to check when the accidents are occurring for each department, relative to the area of the departements.

EXERCISE: plot DOMs and metropolitan France (hard?)

Plot DOMs with metropolitan France, so that all departements are visible on the same figure (for instance using this geoson file, and making all departements fitting a small figure).

References

Other interactive tools for data visualization: Altair, Bokeh. See comparisons by Aarron Geller: link
An interesting tutorial on Altair: Altair introduction