2 APM and Apriori#

Association pattern mining (APM) is a technique used in data mining to identify frequent patterns or associations among a set of items in large datasets.

2.1 The metrics used in APM#

The metrics help to quantify the strength and significance of these discovered relationships in data sets.

2.1.1 Support#

Support of an itemset is the proportion of transactions in the dataset in which the itemset appears, it indicates how frequently the itemset appears in the dataset.

Transactions refer to individual instances or records in a dataset that contain a collection of items.

\(\text{Support}(A \Rightarrow B) = \frac{\text{Freq.} (A \text{ and } B)}{\text{Total number of transactions}}\)

import pandas as pd
import numpy as np

# Define the items and records/transactions

data = {'Bread': [0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1],
        'Milk':  [0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0],
        'Cheese':[0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
        'Eggs':  [0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
        'Yogurt':[0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1]
       }

# Define the df records/transactions
df = pd.DataFrame(data)
df
Bread Milk Cheese Eggs Yogurt
0 0 0 0 0 0
1 0 0 0 0 0
2 1 0 1 1 1
3 0 0 1 1 1
4 1 1 1 1 1
5 1 0 0 0 1
6 1 1 1 1 1
7 1 0 0 0 0
8 0 0 1 1 0
9 0 1 1 1 1
10 1 1 0 0 1
11 0 0 0 0 0
12 1 1 0 0 0
13 1 0 0 0 1
14 0 0 0 0 0
15 1 1 0 0 0
16 1 1 1 1 1
17 0 0 1 1 1
18 0 1 1 1 0
19 1 0 1 1 1

Example: \(\text{Support}(Bread \Rightarrow Cheese)\)

# Step 1: Count transactions where Bread and Cheese both appear, i.e., the rule of Bread -> Cheese
both_present = df[(df['Bread'] == 1) & (df['Cheese'] == 1)]

# Step 2: Count these transactions
both_present_count = len(both_present)

# Step 3: Calculate
total_transactions = len(df)
support = both_present_count / total_transactions

print(f"Support of Bread => Cheese: {support:.2f}")
Support of Bread => Cheese: 0.25

2.1.2 Confidence#

Confidence measures the strength of association between two items in APM. It is determined by dividing the number of transactions that contain both items by the number of transactions that contain the first item alone.

\(\text{Confidence}(A \Rightarrow B) = \frac{\text{Freq.}(A \text{ and } B)}{ \text{Freq.}(A)}\)

Example: \(\text{Confidence}(Bread \Rightarrow Cheese)\)

# Step 1: Count transactions where Bread and Cheese both appear, i.e., the rule of Bread -> Cheese
both_present = df[(df['Bread'] == 1) & (df['Cheese'] == 1)]

# Step 2: Count these transactions
both_present_count = len(both_present)

# Step 3: Calculate
lhs_present_count = len(df[(df['Bread'] == 1)])
confidence = both_present_count / lhs_present_count

print(f"Confidence of Bread => Cheese: {confidence:.2f}")
Confidence of Bread => Cheese: 0.45

2.1.3 Completeness#

Completeness assess the strength of the association rule (A => B) by quantifying how often B (consequence) appears relative to A (Antecedent), providing insights into the reliability and coverage of the rule in capturing the relationship between A and B in the dataset.

\(\text{Completeness}(A \Rightarrow B) = \frac{\text{Freq.}(A \text{ and } B)}{ \text{Freq.}(B)}\)

Example: \(\text{Completeness}(Bread \Rightarrow Cheese)\)

# Step 1: Count transactions where Bread and Cheese both appear, i.e., the rule of Bread -> Cheese
both_present = df[(df['Bread'] == 1) & (df['Cheese'] == 1)]

# Step 2: Count these transactions
both_present_count = len(both_present)

# Step 3: Calculate
rhs_present_count = len(df[(df['Cheese'] == 1)])
Completeness = both_present_count / rhs_present_count

print(f"Completeness of Bread => Cheese: {Completeness:.2f}")
Completeness of Bread => Cheese: 0.50

2.1.4 Lift#

Lift quantifies how much more often items A and B occur together than expected if they were statistically independent.

\(\text{Lift}(A \Rightarrow B) = \frac{\text{Support}(A \text{ and } B)}{ \text{Support}(A) \times \text{Support}(B) }\)

  • Lift > 1: A and B are positively associated (potentially useful).

  • Lift = 1: A and B are independent (likely not useful).

  • Lift < 1: A and B are negatively associated (potentially anti-useful).

Example: \(\text{Lift}(Bread \Rightarrow Cheese)\)

lhs_rhs_support = support # see the support example: support of (bread and cheese)
lhs_support = lhs_present_count / total_transactions  # support of bread
rhs_support = rhs_present_count / total_transactions # support of cheese
# Caculation
lift = lhs_rhs_support / (lhs_support * rhs_support)
print(f"Lift of Bread => Cheese: {lift:.2f}")
Lift of Bread => Cheese: 0.91

2.1.5 Conviction#

Conviction essentially asks the question: how often does A occur without B? A high conviction value indicates that the rule A ⇒ B is unlikely to happen by chance, suggesting a strong association between A and B. Conversely, a low conviction value implies that A and B occurring together might be due to chance, indicating a weaker association.

\(\text{Conviction}(A \Rightarrow B) = \frac{1 - \text{Support}(B)}{1 - \text{Confidence}(A \Rightarrow B)}\)

Example: \(\text{Conviction}(Bread \Rightarrow Cheese)\)

conviction = (1 - rhs_support) / (1 - confidence)
print(f"Conviction of Bread => Cheese: {conviction:.2f}")
Conviction of Bread => Cheese: 0.92

2.2 Apriori algorithm#

The Apriori algorithm, introduced in Fast algorithms for mining association rules by Rakesh Agrawal and Ramakrishnan Srikant in 1994, is a foundational tool in data mining for discovering frequent itemsets and generating association rules.

This algorithm works on transaction databases, such as those recording items purchased by customers in a supermarket, with the primary goal of identifying itemsets that frequently appear together.

apriori in mlxtend lib

# ! pip install mlxtend  
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
# read csv as df
df_cs = pd.read_csv('cakeshop2.csv')
df_cs
TransactionID BREAD MILK JAM EGGS BUTTER SUGAR FLOUR CHOCOLATE YEAST CANDLES
0 1 Bread Milk Jam Eggs Butter Sugar Flour Chocolate Yeast Candles
1 2 NaN Milk NaN Eggs NaN Sugar Flour Chocolate NaN NaN
2 3 NaN Milk NaN Eggs NaN Sugar NaN Chocolate NaN NaN
3 4 NaN Milk NaN Eggs Butter NaN NaN Chocolate NaN Candles
4 5 NaN Milk NaN Eggs Butter NaN Flour Chocolate NaN NaN
... ... ... ... ... ... ... ... ... ... ... ...
495 496 Bread Milk NaN NaN Butter Sugar Flour NaN NaN Candles
496 497 Bread Milk NaN NaN NaN NaN NaN NaN NaN Candles
497 498 Bread Milk NaN NaN NaN NaN NaN NaN NaN Candles
498 499 NaN Milk NaN Eggs NaN NaN NaN NaN Yeast Candles
499 500 Bread Milk Jam Eggs Butter Sugar NaN NaN Yeast Candles

500 rows × 11 columns

# delete TransactionID columns
del df_cs['TransactionID']
df_cs
BREAD MILK JAM EGGS BUTTER SUGAR FLOUR CHOCOLATE YEAST CANDLES
0 Bread Milk Jam Eggs Butter Sugar Flour Chocolate Yeast Candles
1 NaN Milk NaN Eggs NaN Sugar Flour Chocolate NaN NaN
2 NaN Milk NaN Eggs NaN Sugar NaN Chocolate NaN NaN
3 NaN Milk NaN Eggs Butter NaN NaN Chocolate NaN Candles
4 NaN Milk NaN Eggs Butter NaN Flour Chocolate NaN NaN
... ... ... ... ... ... ... ... ... ... ...
495 Bread Milk NaN NaN Butter Sugar Flour NaN NaN Candles
496 Bread Milk NaN NaN NaN NaN NaN NaN NaN Candles
497 Bread Milk NaN NaN NaN NaN NaN NaN NaN Candles
498 NaN Milk NaN Eggs NaN NaN NaN NaN Yeast Candles
499 Bread Milk Jam Eggs Butter Sugar NaN NaN Yeast Candles

500 rows × 10 columns

# Convert categorical variable into dummy/indicator variables (Converting to ones and zeros): one-hot or hot-one encoding
df_cs = pd.get_dummies(df_cs)
df_cs
BREAD_Bread MILK_Milk JAM_Jam EGGS_Eggs BUTTER_Butter SUGAR_Sugar FLOUR_Flour CHOCOLATE_Chocolate YEAST_Yeast CANDLES_Candles
0 True True True True True True True True True True
1 False True False True False True True True False False
2 False True False True False True False True False False
3 False True False True True False False True False True
4 False True False True True False True True False False
... ... ... ... ... ... ... ... ... ... ...
495 True True False False True True True False False True
496 True True False False False False False False False True
497 True True False False False False False False False True
498 False True False True False False False False True True
499 True True True True True True False False True True

500 rows × 10 columns

Support

df_frequent = apriori(df_cs, min_support=0.2, use_colnames=True)
df_frequent
support itemsets
0 0.712 (BREAD_Bread)
1 0.802 (MILK_Milk)
2 0.246 (JAM_Jam)
3 0.336 (EGGS_Eggs)
4 0.392 (BUTTER_Butter)
5 0.592 (BREAD_Bread, MILK_Milk)
6 0.240 (EGGS_Eggs, BREAD_Bread)
7 0.246 (BUTTER_Butter, BREAD_Bread)
8 0.284 (EGGS_Eggs, MILK_Milk)
9 0.258 (BUTTER_Butter, MILK_Milk)
10 0.208 (EGGS_Eggs, BREAD_Bread, MILK_Milk)
df_frequent['length'] = df_frequent.itemsets.apply(lambda x : len(x))
df_frequent
support itemsets length
0 0.712 (BREAD_Bread) 1
1 0.802 (MILK_Milk) 1
2 0.246 (JAM_Jam) 1
3 0.336 (EGGS_Eggs) 1
4 0.392 (BUTTER_Butter) 1
5 0.592 (BREAD_Bread, MILK_Milk) 2
6 0.240 (EGGS_Eggs, BREAD_Bread) 2
7 0.246 (BUTTER_Butter, BREAD_Bread) 2
8 0.284 (EGGS_Eggs, MILK_Milk) 2
9 0.258 (BUTTER_Butter, MILK_Milk) 2
10 0.208 (EGGS_Eggs, BREAD_Bread, MILK_Milk) 3

Metrics for association_rules

The returned columns metrics

# return all the rules with confidence above 0.7
association_rules(df_frequent, metric='confidence', min_threshold=0.7)
antecedents consequents antecedent support consequent support support confidence lift leverage conviction zhangs_metric
0 (BREAD_Bread) (MILK_Milk) 0.712 0.802 0.592 0.831461 1.036734 0.020976 1.174800 0.123029
1 (MILK_Milk) (BREAD_Bread) 0.802 0.712 0.592 0.738155 1.036734 0.020976 1.099886 0.178952
2 (EGGS_Eggs) (BREAD_Bread) 0.336 0.712 0.240 0.714286 1.003210 0.000768 1.008000 0.004819
3 (EGGS_Eggs) (MILK_Milk) 0.336 0.802 0.284 0.845238 1.053913 0.014528 1.279385 0.077041
4 (EGGS_Eggs, BREAD_Bread) (MILK_Milk) 0.240 0.802 0.208 0.866667 1.080632 0.015520 1.485000 0.098178
5 (EGGS_Eggs, MILK_Milk) (BREAD_Bread) 0.284 0.712 0.208 0.732394 1.028644 0.005792 1.076211 0.038891
# return all the rules with lift above 1.05
association_rules(df_frequent, metric='lift', min_threshold=1.05) 
antecedents consequents antecedent support consequent support support confidence lift leverage conviction zhangs_metric
0 (EGGS_Eggs) (MILK_Milk) 0.336 0.802 0.284 0.845238 1.053913 0.014528 1.279385 0.077041
1 (MILK_Milk) (EGGS_Eggs) 0.802 0.336 0.284 0.354115 1.053913 0.014528 1.028046 0.258358
2 (EGGS_Eggs, BREAD_Bread) (MILK_Milk) 0.240 0.802 0.208 0.866667 1.080632 0.015520 1.485000 0.098178
3 (MILK_Milk) (EGGS_Eggs, BREAD_Bread) 0.802 0.240 0.208 0.259352 1.080632 0.015520 1.026128 0.376845
# we define a df for rules for further selection
df_rules = association_rules(df_frequent, metric='lift', min_threshold=1.05) 
df_rules['antecedent_len'] = df_rules['antecedents'].apply(lambda x: len(x))
df_rules
antecedents consequents antecedent support consequent support support confidence lift leverage conviction zhangs_metric antecedent_len
0 (EGGS_Eggs) (MILK_Milk) 0.336 0.802 0.284 0.845238 1.053913 0.014528 1.279385 0.077041 1
1 (MILK_Milk) (EGGS_Eggs) 0.802 0.336 0.284 0.354115 1.053913 0.014528 1.028046 0.258358 1
2 (EGGS_Eggs, BREAD_Bread) (MILK_Milk) 0.240 0.802 0.208 0.866667 1.080632 0.015520 1.485000 0.098178 2
3 (MILK_Milk) (EGGS_Eggs, BREAD_Bread) 0.802 0.240 0.208 0.259352 1.080632 0.015520 1.026128 0.376845 1
# useful length
df_rules.query('antecedent_len >= 2')
antecedents consequents antecedent support consequent support support confidence lift leverage conviction zhangs_metric antecedent_len
2 (EGGS_Eggs, BREAD_Bread) (MILK_Milk) 0.24 0.802 0.208 0.866667 1.080632 0.01552 1.485 0.098178 2

2.3 APM in road accident data#

In this section, we will use APM in road accident (‘acciden_data_v1.0.0_2023.db) for example.

Schema info can be found in ‘dft-road-casualty-statistics-road-safety-open-dataset-data-guide-2023-1.xlsx’

2.3.1 Database infos#

import sqlite3
# Connect to accident database 
conn = sqlite3.connect('accident_data_v1.0.0_2023.db')
# create cursor handling SQL 
cur = conn.cursor()
# output all the tables in db
res_tables = cur.execute("SELECT name FROM sqlite_master WHERE type='table'") 
table_names = res_tables.fetchall()
print(table_names)
[('accident',), ('casualty',), ('vehicle',), ('lsoa',)]
table_names[1][0]
'casualty'
# Check the foreign keys to understand the relationships across tables in database
# it is clear that accident_index column is used to link tables in db
for tn in table_names:
    res_fk = cur.execute(f"PRAGMA foreign_key_list({tn[0]})")
    fk_info = res_fk.fetchall()
    if len(fk_info) < 1:
        print(f'No foreign keys found in Table: {tn[0]}')
    
    else:
        print(f'Table {tn[0]} fkeys info --',  
             'id:', fk_info[0][0], '--',
             'seq:', fk_info[0][1],'--',
             'table:', fk_info[0][2],'--',
             'from:', fk_info[0][3],'--',
             'to:', fk_info[0][4],'--',
             'on_update:', fk_info[0][5],'--',
             'on_delete:', fk_info[0][6],'--',
             'match:', fk_info[0][7]) 
No foreign keys found in Table: accident
Table casualty fkeys info -- id: 0 -- seq: 0 -- table: accident -- from: accident_index -- to: accident_index -- on_update: NO ACTION -- on_delete: NO ACTION -- match: NONE
Table vehicle fkeys info -- id: 0 -- seq: 0 -- table: accident -- from: accident_index -- to: accident_index -- on_update: NO ACTION -- on_delete: NO ACTION -- match: NONE
No foreign keys found in Table: lsoa

2.3.2 APM in accident table#

# using pandas to read table accident
df_ac = pd.read_sql("SELECT * FROM accident;", conn)
# This is a accident-level dataset which means each row representing one accident idetified by 'addicent_index'
df_ac
accident_index accident_year accident_reference location_easting_osgr location_northing_osgr longitude latitude police_force accident_severity number_of_vehicles ... pedestrian_crossing_physical_facilities light_conditions weather_conditions road_surface_conditions special_conditions_at_site carriageway_hazards urban_or_rural_area did_police_officer_attend_scene_of_accident trunk_road_flag lsoa_of_accident_location
0 2017010001708 2017 010001708 532920.0 196330.0 -0.080107 51.650061 1 1 2 ... 0 4 1 1 0 0 1 1 2 E01001450
1 2017010009342 2017 010009342 526790.0 181970.0 -0.173845 51.522425 1 3 2 ... 0 4 1 2 0 0 1 1 2 E01004702
2 2017010009344 2017 010009344 535200.0 181260.0 -0.052969 51.514096 1 3 3 ... 0 4 1 1 0 0 1 1 2 E01004298
3 2017010009348 2017 010009348 534340.0 193560.0 -0.060658 51.624832 1 3 2 ... 4 4 2 2 0 0 1 1 2 E01001429
4 2017010009350 2017 010009350 533680.0 187820.0 -0.072372 51.573408 1 2 1 ... 5 4 1 2 0 0 1 1 2 E01001808
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
461347 2020991027064 2020 991027064 343034.0 731654.0 -2.926320 56.473539 99 2 2 ... 0 1 1 1 0 0 1 1 -1 -1
461348 2020991029573 2020 991029573 257963.0 658891.0 -4.267565 55.802353 99 3 1 ... 0 1 1 1 0 0 1 2 -1 -1
461349 2020991030297 2020 991030297 383664.0 810646.0 -2.271903 57.186317 99 2 2 ... 0 1 1 1 0 0 2 1 -1 -1
461350 2020991030900 2020 991030900 277161.0 674852.0 -3.968753 55.950940 99 3 2 ... 0 1 1 1 0 0 1 2 -1 -1
461351 2020991032575 2020 991032575 240402.0 681950.0 -4.561040 56.003843 99 3 1 ... 0 1 1 1 0 2 1 1 -1 -1

461352 rows × 36 columns

You can use pandas .describe() or info() to output the basic info in a df

You can also use some libs called ydata-profiling, sweetviz and dtale to implement the initial EDA

df_ac.describe()
accident_year location_easting_osgr location_northing_osgr longitude latitude police_force accident_severity number_of_vehicles number_of_casualties day_of_week ... pedestrian_crossing_human_control pedestrian_crossing_physical_facilities light_conditions weather_conditions road_surface_conditions special_conditions_at_site carriageway_hazards urban_or_rural_area did_police_officer_attend_scene_of_accident trunk_road_flag
count 461352.000000 461236.000000 4.612360e+05 461226.000000 461226.000000 461352.000000 461352.000000 461352.000000 461352.000000 461352.000000 ... 461352.000000 461352.000000 461352.000000 461352.000000 461352.000000 461352.000000 461352.000000 461352.000000 461352.000000 461352.000000
mean 2018.368378 452593.115756 2.801803e+05 -1.246998 52.408849 28.193388 2.784436 1.840441 1.301245 4.108171 ... 0.262112 1.099518 2.036235 1.649918 1.385239 0.217519 0.170518 1.325463 1.334877 1.680147
std 1.091566 94822.718705 1.515501e+05 1.389702 1.365101 24.801609 0.443648 0.709869 0.746398 1.927216 ... 1.460680 2.333113 1.724691 1.819426 0.955180 1.240981 1.121964 0.469057 0.556352 0.851554
min 2017.000000 64084.000000 1.023500e+04 -7.525273 49.912362 1.000000 1.000000 1.000000 1.000000 1.000000 ... -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
25% 2017.000000 388539.000000 1.755300e+05 -2.172668 51.465689 5.000000 3.000000 1.000000 1.000000 2.000000 ... 0.000000 0.000000 1.000000 1.000000 1.000000 0.000000 0.000000 1.000000 1.000000 2.000000
50% 2018.000000 459198.500000 2.208650e+05 -1.126264 51.870257 22.000000 3.000000 2.000000 1.000000 4.000000 ... 0.000000 0.000000 1.000000 1.000000 1.000000 0.000000 0.000000 1.000000 1.000000 2.000000
75% 2019.000000 529290.000000 3.865510e+05 -0.136389 53.372899 45.000000 3.000000 2.000000 1.000000 6.000000 ... 0.000000 0.000000 4.000000 1.000000 2.000000 0.000000 0.000000 2.000000 2.000000 2.000000
max 2020.000000 655391.000000 1.209512e+06 1.759641 60.763722 99.000000 3.000000 24.000000 59.000000 7.000000 ... 9.000000 9.000000 7.000000 9.000000 9.000000 9.000000 9.000000 3.000000 3.000000 2.000000

8 rows × 29 columns

df_ac.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 461352 entries, 0 to 461351
Data columns (total 36 columns):
 #   Column                                       Non-Null Count   Dtype  
---  ------                                       --------------   -----  
 0   accident_index                               461352 non-null  object 
 1   accident_year                                461352 non-null  int64  
 2   accident_reference                           461352 non-null  object 
 3   location_easting_osgr                        461236 non-null  float64
 4   location_northing_osgr                       461236 non-null  float64
 5   longitude                                    461226 non-null  float64
 6   latitude                                     461226 non-null  float64
 7   police_force                                 461352 non-null  int64  
 8   accident_severity                            461352 non-null  int64  
 9   number_of_vehicles                           461352 non-null  int64  
 10  number_of_casualties                         461352 non-null  int64  
 11  date                                         461352 non-null  object 
 12  day_of_week                                  461352 non-null  int64  
 13  time                                         461352 non-null  object 
 14  local_authority_district                     461352 non-null  int64  
 15  local_authority_ons_district                 461352 non-null  object 
 16  local_authority_highway                      461352 non-null  object 
 17  first_road_class                             461352 non-null  int64  
 18  first_road_number                            461352 non-null  int64  
 19  road_type                                    461352 non-null  int64  
 20  speed_limit                                  461352 non-null  int64  
 21  junction_detail                              461352 non-null  int64  
 22  junction_control                             461352 non-null  int64  
 23  second_road_class                            461352 non-null  int64  
 24  second_road_number                           461352 non-null  int64  
 25  pedestrian_crossing_human_control            461352 non-null  int64  
 26  pedestrian_crossing_physical_facilities      461352 non-null  int64  
 27  light_conditions                             461352 non-null  int64  
 28  weather_conditions                           461352 non-null  int64  
 29  road_surface_conditions                      461352 non-null  int64  
 30  special_conditions_at_site                   461352 non-null  int64  
 31  carriageway_hazards                          461352 non-null  int64  
 32  urban_or_rural_area                          461352 non-null  int64  
 33  did_police_officer_attend_scene_of_accident  461352 non-null  int64  
 34  trunk_road_flag                              461352 non-null  int64  
 35  lsoa_of_accident_location                    461352 non-null  object 
dtypes: float64(4), int64(25), object(7)
memory usage: 126.7+ MB

Select three columns for analysis

accident_severity, speed_limit, weather_conditions (rules can be if speed_limit, weather_conditions then accident_severity)

# select three columns
df_ac_s = df_ac[['speed_limit', 'weather_conditions', 'accident_severity']]
df_ac_s
speed_limit weather_conditions accident_severity
0 30 1 1
1 30 1 3
2 30 1 3
3 30 2 3
4 20 1 2
... ... ... ...
461347 30 1 2
461348 30 1 3
461349 60 1 2
461350 30 1 3
461351 30 1 3

461352 rows × 3 columns

# One-hot encoding transfer df colomuns 

df_list = []
for c in ['speed_limit', 'weather_conditions', 'accident_severity']:
    df_c = pd.get_dummies(df_ac_s[c], prefix=c)
    df_list.append(df_c)

# axis=1 indicates that the dataframes in df_list are being concatenated horizontally, i.e., by columns. 
# This means that the resulting dataframe df_oh will have columns from all the dataframes in df_list combined side by side, with the indices aligned.
df_oh = pd.concat(df_list, axis=1) 
df_oh
speed_limit_-1 speed_limit_20 speed_limit_30 speed_limit_40 speed_limit_50 speed_limit_60 speed_limit_70 weather_conditions_-1 weather_conditions_1 weather_conditions_2 weather_conditions_3 weather_conditions_4 weather_conditions_5 weather_conditions_6 weather_conditions_7 weather_conditions_8 weather_conditions_9 accident_severity_1 accident_severity_2 accident_severity_3
0 False False True False False False False False True False False False False False False False False True False False
1 False False True False False False False False True False False False False False False False False False False True
2 False False True False False False False False True False False False False False False False False False False True
3 False False True False False False False False False True False False False False False False False False False True
4 False True False False False False False False True False False False False False False False False False True False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
461347 False False True False False False False False True False False False False False False False False False True False
461348 False False True False False False False False True False False False False False False False False False False True
461349 False False False False False True False False True False False False False False False False False False True False
461350 False False True False False False False False True False False False False False False False False False False True
461351 False False True False False False False False True False False False False False False False False False False True

461352 rows × 20 columns

# create frequent sets for rules
df_fre = apriori(df_oh, min_support=0.2, use_colnames=True)
df_fre
support itemsets
0 0.595235 (speed_limit_30)
1 0.795436 (weather_conditions_1)
2 0.798299 (accident_severity_3)
3 0.476458 (weather_conditions_1, speed_limit_30)
4 0.486100 (accident_severity_3, speed_limit_30)
5 0.631260 (weather_conditions_1, accident_severity_3)
6 0.387476 (weather_conditions_1, accident_severity_3, sp...
# select rules with confidence >0.7
df_rule = association_rules(df_fre, metric='confidence', min_threshold=0.7) 
# select rules with lift >1
df_rule.query('lift > 1')
antecedents consequents antecedent support consequent support support confidence lift leverage conviction zhangs_metric
0 (speed_limit_30) (weather_conditions_1) 0.595235 0.795436 0.476458 0.800454 1.006308 0.002987 1.025146 0.015487
1 (speed_limit_30) (accident_severity_3) 0.595235 0.798299 0.486100 0.816651 1.022989 0.010924 1.100092 0.055518
4 (weather_conditions_1, speed_limit_30) (accident_severity_3) 0.476458 0.798299 0.387476 0.813243 1.018719 0.007120 1.080016 0.035098
5 (accident_severity_3, speed_limit_30) (weather_conditions_1) 0.486100 0.795436 0.387476 0.797113 1.002109 0.000815 1.008267 0.004094

As ‘weather_conditions_1’ represents ‘Fine no high winds’ and ‘accident_severity_3’ represents ‘Slight’, so the useful rule can be generated as ‘ Weather with fine no high winds’ and ‘the road with speed limit is 30’ \(\Rightarrow\) ‘Accident severity is slight’