{ "cells": [ { "cell_type": "markdown", "id": "f855094f-31d3-4951-971d-cddef0e776b9", "metadata": {}, "source": [ "# 2 APM and Apriori" ] }, { "cell_type": "markdown", "id": "1f1201cc-92f3-4f38-bcdd-7591da4851ef", "metadata": {}, "source": [ "Association pattern mining (APM) is a technique used in data mining to identify frequent patterns or associations among a set of items in large datasets. " ] }, { "attachments": {}, "cell_type": "markdown", "id": "4b20acf7-0b6d-4fca-9eae-8e2f0c9daeed", "metadata": {}, "source": [ "## 2.1 The metrics used in APM\n", "\n", "The metrics help to quantify the strength and significance of these discovered relationships in data sets." ] }, { "cell_type": "markdown", "id": "9b7a5f1a-3d25-4fe3-8dea-24ffe6fb1244", "metadata": {}, "source": [ "### 2.1.1 Support\n", "\n", "__Support__ of an itemset is the proportion of transactions in the dataset in which the itemset appears, it indicates how frequently the itemset appears in the dataset. \n", "\n", "Transactions refer to individual instances or records in a dataset that contain a collection of items. \n", "\n", "\n", "$\\text{Support}(A \\Rightarrow B) = \\frac{\\text{Freq.} (A \\text{ and } B)}{\\text{Total number of transactions}}$\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "ab4512fa-ee4c-476d-9c65-ae6ea5348c9d", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "# Define the items and records/transactions\n", "\n", "data = {'Bread': [0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1],\n", " 'Milk': [0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0],\n", " 'Cheese':[0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1],\n", " 'Eggs': [0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1],\n", " 'Yogurt':[0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1]\n", " }\n", "\n", "# Define the df records/transactions\n", "df = pd.DataFrame(data)\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "72c1cfa1-1d88-4465-b5b5-3f9c5c070840", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BreadMilkCheeseEggsYogurt
000000
100000
210111
300111
411111
510001
611111
710000
800110
901111
1011001
1100000
1211000
1310001
1400000
1511000
1611111
1700111
1801110
1910111
\n", "
" ], "text/plain": [ " Bread Milk Cheese Eggs Yogurt\n", "0 0 0 0 0 0\n", "1 0 0 0 0 0\n", "2 1 0 1 1 1\n", "3 0 0 1 1 1\n", "4 1 1 1 1 1\n", "5 1 0 0 0 1\n", "6 1 1 1 1 1\n", "7 1 0 0 0 0\n", "8 0 0 1 1 0\n", "9 0 1 1 1 1\n", "10 1 1 0 0 1\n", "11 0 0 0 0 0\n", "12 1 1 0 0 0\n", "13 1 0 0 0 1\n", "14 0 0 0 0 0\n", "15 1 1 0 0 0\n", "16 1 1 1 1 1\n", "17 0 0 1 1 1\n", "18 0 1 1 1 0\n", "19 1 0 1 1 1" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "id": "41a94204-ef45-46dc-9000-99eb074677f8", "metadata": {}, "source": [ "__Example:__ $\\text{Support}(Bread \\Rightarrow Cheese)$" ] }, { "cell_type": "code", "execution_count": 3, "id": "b685c137-9a92-4822-82cc-906fd0b5aedd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Support of Bread => Cheese: 0.25\n" ] } ], "source": [ "# Step 1: Count transactions where Bread and Cheese both appear, i.e., the rule of Bread -> Cheese\n", "both_present = df[(df['Bread'] == 1) & (df['Cheese'] == 1)]\n", "\n", "# Step 2: Count these transactions\n", "both_present_count = len(both_present)\n", "\n", "# Step 3: Calculate\n", "total_transactions = len(df)\n", "support = both_present_count / total_transactions\n", "\n", "print(f\"Support of Bread => Cheese: {support:.2f}\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "3556614a-e0b1-4a49-b0db-05286cb63030", "metadata": {}, "source": [ "### 2.1.2 Confidence \n", "\n", "__Confidence__ measures the strength of association between two items in APM. It is determined by dividing the number of transactions that contain both items by the number of transactions that contain the first item alone.\n", "\n", "$\\text{Confidence}(A \\Rightarrow B) = \\frac{\\text{Freq.}(A \\text{ and } B)}{ \\text{Freq.}(A)}$" ] }, { "cell_type": "code", "execution_count": null, "id": "6f96bef0-fe1d-4798-a824-e4db4d0acb81", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "ae788299-0629-4741-994d-65d8425e82f4", "metadata": {}, "source": [ "__Example:__ $\\text{Confidence}(Bread \\Rightarrow Cheese)$" ] }, { "cell_type": "code", "execution_count": 4, "id": "6ee3e6ef-9537-4b82-8cf4-6d01ec1cbe73", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Confidence of Bread => Cheese: 0.45\n" ] } ], "source": [ "# Step 1: Count transactions where Bread and Cheese both appear, i.e., the rule of Bread -> Cheese\n", "both_present = df[(df['Bread'] == 1) & (df['Cheese'] == 1)]\n", "\n", "# Step 2: Count these transactions\n", "both_present_count = len(both_present)\n", "\n", "# Step 3: Calculate\n", "lhs_present_count = len(df[(df['Bread'] == 1)])\n", "confidence = both_present_count / lhs_present_count\n", "\n", "print(f\"Confidence of Bread => Cheese: {confidence:.2f}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "c179eb91-8fff-4fa5-ab0f-404ecfd5cf0f", "metadata": {}, "outputs": [], "source": [] }, { "attachments": {}, "cell_type": "markdown", "id": "32faf6cb-0026-4e4c-a9ff-b3ae69a202ad", "metadata": {}, "source": [ "### 2.1.3 Completeness\n", "\n", "__Completeness__ assess the strength of the association rule (A => B) by quantifying how often B (consequence) appears relative to A (Antecedent), providing insights into the reliability and coverage of the rule in capturing the relationship between A and B in the dataset.\n", "\n", "\n", "$\\text{Completeness}(A \\Rightarrow B) = \\frac{\\text{Freq.}(A \\text{ and } B)}{ \\text{Freq.}(B)}$" ] }, { "cell_type": "code", "execution_count": null, "id": "74d0765d-ece1-4c46-a6c6-6b21d8a095f0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "61eae92f-df6f-4e91-a429-f576c8f1b0ca", "metadata": {}, "source": [ "__Example:__ $\\text{Completeness}(Bread \\Rightarrow Cheese)$" ] }, { "cell_type": "code", "execution_count": 5, "id": "e4323837-dc1c-4405-8b4d-bf4548b57401", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Completeness of Bread => Cheese: 0.50\n" ] } ], "source": [ "# Step 1: Count transactions where Bread and Cheese both appear, i.e., the rule of Bread -> Cheese\n", "both_present = df[(df['Bread'] == 1) & (df['Cheese'] == 1)]\n", "\n", "# Step 2: Count these transactions\n", "both_present_count = len(both_present)\n", "\n", "# Step 3: Calculate\n", "rhs_present_count = len(df[(df['Cheese'] == 1)])\n", "Completeness = both_present_count / rhs_present_count\n", "\n", "print(f\"Completeness of Bread => Cheese: {Completeness:.2f}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "b2bd3145-38f9-44f4-872a-8d06ca21c530", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "b88a00bd-659a-47d8-aea0-e2fba5333e29", "metadata": {}, "source": [ "### 2.1.4 Lift\n", "\n", "__Lift__ quantifies how much more often items A and B occur together than expected if they were statistically independent.\n", "\n", "$\\text{Lift}(A \\Rightarrow B) = \\frac{\\text{Support}(A \\text{ and } B)}{ \\text{Support}(A) \\times \\text{Support}(B) }$\n", "\n", "\n", "- Lift > 1: A and B are positively associated (potentially useful).\n", "- Lift = 1: A and B are independent (likely not useful).\n", "- Lift < 1: A and B are negatively associated (potentially anti-useful)." ] }, { "cell_type": "code", "execution_count": null, "id": "62684c5a-1bd2-4011-9e73-b8d970b00132", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "87629b02-61e6-4760-a1db-adeea7119dcf", "metadata": {}, "source": [ "__Example:__ $\\text{Lift}(Bread \\Rightarrow Cheese)$" ] }, { "cell_type": "code", "execution_count": 6, "id": "043caf29-e024-4904-b4d8-8e100670881c", "metadata": {}, "outputs": [], "source": [ "lhs_rhs_support = support # see the support example: support of (bread and cheese)" ] }, { "cell_type": "code", "execution_count": 7, "id": "133bf868-62a6-4da0-a90f-86b5a0c86150", "metadata": {}, "outputs": [], "source": [ "lhs_support = lhs_present_count / total_transactions # support of bread" ] }, { "cell_type": "code", "execution_count": 8, "id": "8121d566-7ddf-444c-98ef-05f966cd66ba", "metadata": {}, "outputs": [], "source": [ "rhs_support = rhs_present_count / total_transactions # support of cheese" ] }, { "cell_type": "code", "execution_count": 9, "id": "37c17282-7e3d-4d65-b79c-b60673ab6977", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Lift of Bread => Cheese: 0.91\n" ] } ], "source": [ "# Caculation\n", "lift = lhs_rhs_support / (lhs_support * rhs_support)\n", "print(f\"Lift of Bread => Cheese: {lift:.2f}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "d2c91230-20ee-482c-9291-0c7c2ca2b7eb", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "64e47f17-fab0-403c-a424-552686b70820", "metadata": {}, "source": [ "### 2.1.5 Conviction \n", "\n", "__Conviction__ essentially asks the question: how often does A occur without B? A high conviction value indicates that the rule A ⇒ B is unlikely to happen by chance, suggesting a strong association between A and B. Conversely, a low conviction value implies that A and B occurring together might be due to chance, indicating a weaker association.\n", "\n", "$\\text{Conviction}(A \\Rightarrow B) = \\frac{1 - \\text{Support}(B)}{1 - \\text{Confidence}(A \\Rightarrow B)}$\n" ] }, { "cell_type": "code", "execution_count": null, "id": "986d4212-97d8-43e1-b850-bc70e00239d6", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "e2595d16-b31a-4d32-b052-dc85c03c3cb6", "metadata": {}, "source": [ "__Example:__ $\\text{Conviction}(Bread \\Rightarrow Cheese)$" ] }, { "cell_type": "code", "execution_count": 10, "id": "91a35382-3793-4235-bc78-24f9eaf3f8f1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Conviction of Bread => Cheese: 0.92\n" ] } ], "source": [ "conviction = (1 - rhs_support) / (1 - confidence)\n", "print(f\"Conviction of Bread => Cheese: {conviction:.2f}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "4981ad05-53c9-4123-ad72-5124d60034f3", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "b1b2326d-70a8-4114-9b29-3f02651d2dbb", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "8d850747-dd92-4b0f-867e-e774e8debfa3", "metadata": {}, "source": [ "## 2.2 Apriori algorithm\n", "\n", "The Apriori algorithm, introduced in [Fast algorithms for mining association rules](https://www.it.uu.se/edu/course/homepage/infoutv/ht08/vldb94_rj.pdf) by Rakesh Agrawal and Ramakrishnan Srikant in 1994, is a foundational tool in data mining for discovering frequent itemsets and generating association rules. \n", "\n", "This algorithm works on transaction databases, such as those recording items purchased by customers in a supermarket, with the primary goal of identifying itemsets that frequently appear together.\n", "\n", "[apriori in mlxtend lib](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/)" ] }, { "cell_type": "code", "execution_count": 11, "id": "3d1ca790-b61a-45cf-87b6-149f0283dd3f", "metadata": {}, "outputs": [], "source": [ "# ! pip install mlxtend " ] }, { "cell_type": "code", "execution_count": 12, "id": "6f5baa38-6a9b-4f4a-88fa-30a9ca094b01", "metadata": {}, "outputs": [], "source": [ "from mlxtend.preprocessing import TransactionEncoder\n", "from mlxtend.frequent_patterns import apriori, association_rules" ] }, { "cell_type": "code", "execution_count": 13, "id": "379b8079-a8b2-4dd7-b208-4015a83f4aa7", "metadata": {}, "outputs": [], "source": [ "# read csv as df\n", "df_cs = pd.read_csv('cakeshop2.csv')" ] }, { "cell_type": "code", "execution_count": 14, "id": "7d04b306-eea1-4155-97ce-52b6f22f93e3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TransactionIDBREADMILKJAMEGGSBUTTERSUGARFLOURCHOCOLATEYEASTCANDLES
01BreadMilkJamEggsButterSugarFlourChocolateYeastCandles
12NaNMilkNaNEggsNaNSugarFlourChocolateNaNNaN
23NaNMilkNaNEggsNaNSugarNaNChocolateNaNNaN
34NaNMilkNaNEggsButterNaNNaNChocolateNaNCandles
45NaNMilkNaNEggsButterNaNFlourChocolateNaNNaN
....................................
495496BreadMilkNaNNaNButterSugarFlourNaNNaNCandles
496497BreadMilkNaNNaNNaNNaNNaNNaNNaNCandles
497498BreadMilkNaNNaNNaNNaNNaNNaNNaNCandles
498499NaNMilkNaNEggsNaNNaNNaNNaNYeastCandles
499500BreadMilkJamEggsButterSugarNaNNaNYeastCandles
\n", "

500 rows × 11 columns

\n", "
" ], "text/plain": [ " TransactionID BREAD MILK JAM EGGS BUTTER SUGAR FLOUR CHOCOLATE \\\n", "0 1 Bread Milk Jam Eggs Butter Sugar Flour Chocolate \n", "1 2 NaN Milk NaN Eggs NaN Sugar Flour Chocolate \n", "2 3 NaN Milk NaN Eggs NaN Sugar NaN Chocolate \n", "3 4 NaN Milk NaN Eggs Butter NaN NaN Chocolate \n", "4 5 NaN Milk NaN Eggs Butter NaN Flour Chocolate \n", ".. ... ... ... ... ... ... ... ... ... \n", "495 496 Bread Milk NaN NaN Butter Sugar Flour NaN \n", "496 497 Bread Milk NaN NaN NaN NaN NaN NaN \n", "497 498 Bread Milk NaN NaN NaN NaN NaN NaN \n", "498 499 NaN Milk NaN Eggs NaN NaN NaN NaN \n", "499 500 Bread Milk Jam Eggs Butter Sugar NaN NaN \n", "\n", " YEAST CANDLES \n", "0 Yeast Candles \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 NaN Candles \n", "4 NaN NaN \n", ".. ... ... \n", "495 NaN Candles \n", "496 NaN Candles \n", "497 NaN Candles \n", "498 Yeast Candles \n", "499 Yeast Candles \n", "\n", "[500 rows x 11 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_cs" ] }, { "cell_type": "code", "execution_count": 15, "id": "5fc09292-f67c-41f8-962d-d8ba2134a15b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BREADMILKJAMEGGSBUTTERSUGARFLOURCHOCOLATEYEASTCANDLES
0BreadMilkJamEggsButterSugarFlourChocolateYeastCandles
1NaNMilkNaNEggsNaNSugarFlourChocolateNaNNaN
2NaNMilkNaNEggsNaNSugarNaNChocolateNaNNaN
3NaNMilkNaNEggsButterNaNNaNChocolateNaNCandles
4NaNMilkNaNEggsButterNaNFlourChocolateNaNNaN
.................................
495BreadMilkNaNNaNButterSugarFlourNaNNaNCandles
496BreadMilkNaNNaNNaNNaNNaNNaNNaNCandles
497BreadMilkNaNNaNNaNNaNNaNNaNNaNCandles
498NaNMilkNaNEggsNaNNaNNaNNaNYeastCandles
499BreadMilkJamEggsButterSugarNaNNaNYeastCandles
\n", "

500 rows × 10 columns

\n", "
" ], "text/plain": [ " BREAD MILK JAM EGGS BUTTER SUGAR FLOUR CHOCOLATE YEAST CANDLES\n", "0 Bread Milk Jam Eggs Butter Sugar Flour Chocolate Yeast Candles\n", "1 NaN Milk NaN Eggs NaN Sugar Flour Chocolate NaN NaN\n", "2 NaN Milk NaN Eggs NaN Sugar NaN Chocolate NaN NaN\n", "3 NaN Milk NaN Eggs Butter NaN NaN Chocolate NaN Candles\n", "4 NaN Milk NaN Eggs Butter NaN Flour Chocolate NaN NaN\n", ".. ... ... ... ... ... ... ... ... ... ...\n", "495 Bread Milk NaN NaN Butter Sugar Flour NaN NaN Candles\n", "496 Bread Milk NaN NaN NaN NaN NaN NaN NaN Candles\n", "497 Bread Milk NaN NaN NaN NaN NaN NaN NaN Candles\n", "498 NaN Milk NaN Eggs NaN NaN NaN NaN Yeast Candles\n", "499 Bread Milk Jam Eggs Butter Sugar NaN NaN Yeast Candles\n", "\n", "[500 rows x 10 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# delete TransactionID columns\n", "del df_cs['TransactionID']\n", "df_cs" ] }, { "cell_type": "code", "execution_count": 16, "id": "41694218-a61a-4c8b-b58c-8d589df30363", "metadata": {}, "outputs": [], "source": [ "# Convert categorical variable into dummy/indicator variables (Converting to ones and zeros): one-hot or hot-one encoding\n", "df_cs = pd.get_dummies(df_cs)" ] }, { "cell_type": "code", "execution_count": 17, "id": "f15be594-5ba2-413a-be8b-682e17d33835", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BREAD_BreadMILK_MilkJAM_JamEGGS_EggsBUTTER_ButterSUGAR_SugarFLOUR_FlourCHOCOLATE_ChocolateYEAST_YeastCANDLES_Candles
0TrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
1FalseTrueFalseTrueFalseTrueTrueTrueFalseFalse
2FalseTrueFalseTrueFalseTrueFalseTrueFalseFalse
3FalseTrueFalseTrueTrueFalseFalseTrueFalseTrue
4FalseTrueFalseTrueTrueFalseTrueTrueFalseFalse
.................................
495TrueTrueFalseFalseTrueTrueTrueFalseFalseTrue
496TrueTrueFalseFalseFalseFalseFalseFalseFalseTrue
497TrueTrueFalseFalseFalseFalseFalseFalseFalseTrue
498FalseTrueFalseTrueFalseFalseFalseFalseTrueTrue
499TrueTrueTrueTrueTrueTrueFalseFalseTrueTrue
\n", "

500 rows × 10 columns

\n", "
" ], "text/plain": [ " BREAD_Bread MILK_Milk JAM_Jam EGGS_Eggs BUTTER_Butter SUGAR_Sugar \\\n", "0 True True True True True True \n", "1 False True False True False True \n", "2 False True False True False True \n", "3 False True False True True False \n", "4 False True False True True False \n", ".. ... ... ... ... ... ... \n", "495 True True False False True True \n", "496 True True False False False False \n", "497 True True False False False False \n", "498 False True False True False False \n", "499 True True True True True True \n", "\n", " FLOUR_Flour CHOCOLATE_Chocolate YEAST_Yeast CANDLES_Candles \n", "0 True True True True \n", "1 True True False False \n", "2 False True False False \n", "3 False True False True \n", "4 True True False False \n", ".. ... ... ... ... \n", "495 True False False True \n", "496 False False False True \n", "497 False False False True \n", "498 False False True True \n", "499 False False True True \n", "\n", "[500 rows x 10 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_cs" ] }, { "cell_type": "markdown", "id": "f85962a0-0fbf-4d17-a880-de583bc4ab89", "metadata": {}, "source": [ "__Support__" ] }, { "cell_type": "code", "execution_count": 18, "id": "7e6652c5-3305-4d9c-9173-efb346965dc0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
supportitemsets
00.712(BREAD_Bread)
10.802(MILK_Milk)
20.246(JAM_Jam)
30.336(EGGS_Eggs)
40.392(BUTTER_Butter)
50.592(BREAD_Bread, MILK_Milk)
60.240(BREAD_Bread, EGGS_Eggs)
70.246(BREAD_Bread, BUTTER_Butter)
80.284(MILK_Milk, EGGS_Eggs)
90.258(MILK_Milk, BUTTER_Butter)
100.208(BREAD_Bread, MILK_Milk, EGGS_Eggs)
\n", "
" ], "text/plain": [ " support itemsets\n", "0 0.712 (BREAD_Bread)\n", "1 0.802 (MILK_Milk)\n", "2 0.246 (JAM_Jam)\n", "3 0.336 (EGGS_Eggs)\n", "4 0.392 (BUTTER_Butter)\n", "5 0.592 (BREAD_Bread, MILK_Milk)\n", "6 0.240 (BREAD_Bread, EGGS_Eggs)\n", "7 0.246 (BREAD_Bread, BUTTER_Butter)\n", "8 0.284 (MILK_Milk, EGGS_Eggs)\n", "9 0.258 (MILK_Milk, BUTTER_Butter)\n", "10 0.208 (BREAD_Bread, MILK_Milk, EGGS_Eggs)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_frequent = apriori(df_cs, min_support=0.2, use_colnames=True)\n", "df_frequent" ] }, { "cell_type": "code", "execution_count": 19, "id": "a8e6852d-7bc7-4698-8d2a-80434ab35680", "metadata": {}, "outputs": [], "source": [ "df_frequent['length'] = df_frequent.itemsets.apply(lambda x : len(x))" ] }, { "cell_type": "code", "execution_count": 20, "id": "5e340c1c-2229-4459-9119-dd14435031e0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
supportitemsetslength
00.712(BREAD_Bread)1
10.802(MILK_Milk)1
20.246(JAM_Jam)1
30.336(EGGS_Eggs)1
40.392(BUTTER_Butter)1
50.592(BREAD_Bread, MILK_Milk)2
60.240(BREAD_Bread, EGGS_Eggs)2
70.246(BREAD_Bread, BUTTER_Butter)2
80.284(MILK_Milk, EGGS_Eggs)2
90.258(MILK_Milk, BUTTER_Butter)2
100.208(BREAD_Bread, MILK_Milk, EGGS_Eggs)3
\n", "
" ], "text/plain": [ " support itemsets length\n", "0 0.712 (BREAD_Bread) 1\n", "1 0.802 (MILK_Milk) 1\n", "2 0.246 (JAM_Jam) 1\n", "3 0.336 (EGGS_Eggs) 1\n", "4 0.392 (BUTTER_Butter) 1\n", "5 0.592 (BREAD_Bread, MILK_Milk) 2\n", "6 0.240 (BREAD_Bread, EGGS_Eggs) 2\n", "7 0.246 (BREAD_Bread, BUTTER_Butter) 2\n", "8 0.284 (MILK_Milk, EGGS_Eggs) 2\n", "9 0.258 (MILK_Milk, BUTTER_Butter) 2\n", "10 0.208 (BREAD_Bread, MILK_Milk, EGGS_Eggs) 3" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_frequent" ] }, { "cell_type": "code", "execution_count": null, "id": "f48be902-6f01-44ce-8892-e8aa71f682ec", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "47683603-7a36-4555-956a-8619f3498aa7", "metadata": {}, "source": [ "__Metrics for association_rules__\n", "\n", "[The returned columns metrics](https://rasbt.github.io/mlxtend/api_subpackages/mlxtend.frequent_patterns/#association_rules)" ] }, { "cell_type": "code", "execution_count": 21, "id": "7ad257d8-314d-4c49-adf8-73e5845663e4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
antecedentsconsequentsantecedent supportconsequent supportsupportconfidenceliftleverageconvictionzhangs_metric
0(BREAD_Bread)(MILK_Milk)0.7120.8020.5920.8314611.0367340.0209761.1748000.123029
1(MILK_Milk)(BREAD_Bread)0.8020.7120.5920.7381551.0367340.0209761.0998860.178952
2(EGGS_Eggs)(BREAD_Bread)0.3360.7120.2400.7142861.0032100.0007681.0080000.004819
3(EGGS_Eggs)(MILK_Milk)0.3360.8020.2840.8452381.0539130.0145281.2793850.077041
4(BREAD_Bread, EGGS_Eggs)(MILK_Milk)0.2400.8020.2080.8666671.0806320.0155201.4850000.098178
5(MILK_Milk, EGGS_Eggs)(BREAD_Bread)0.2840.7120.2080.7323941.0286440.0057921.0762110.038891
\n", "
" ], "text/plain": [ " antecedents consequents antecedent support \\\n", "0 (BREAD_Bread) (MILK_Milk) 0.712 \n", "1 (MILK_Milk) (BREAD_Bread) 0.802 \n", "2 (EGGS_Eggs) (BREAD_Bread) 0.336 \n", "3 (EGGS_Eggs) (MILK_Milk) 0.336 \n", "4 (BREAD_Bread, EGGS_Eggs) (MILK_Milk) 0.240 \n", "5 (MILK_Milk, EGGS_Eggs) (BREAD_Bread) 0.284 \n", "\n", " consequent support support confidence lift leverage conviction \\\n", "0 0.802 0.592 0.831461 1.036734 0.020976 1.174800 \n", "1 0.712 0.592 0.738155 1.036734 0.020976 1.099886 \n", "2 0.712 0.240 0.714286 1.003210 0.000768 1.008000 \n", "3 0.802 0.284 0.845238 1.053913 0.014528 1.279385 \n", "4 0.802 0.208 0.866667 1.080632 0.015520 1.485000 \n", "5 0.712 0.208 0.732394 1.028644 0.005792 1.076211 \n", "\n", " zhangs_metric \n", "0 0.123029 \n", "1 0.178952 \n", "2 0.004819 \n", "3 0.077041 \n", "4 0.098178 \n", "5 0.038891 " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# return all the rules with confidence above 0.7\n", "association_rules(df_frequent, metric='confidence', min_threshold=0.7)" ] }, { "cell_type": "code", "execution_count": 22, "id": "ee5fe64e-d001-4cec-9fa4-d42424910298", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
antecedentsconsequentsantecedent supportconsequent supportsupportconfidenceliftleverageconvictionzhangs_metric
0(MILK_Milk)(EGGS_Eggs)0.8020.3360.2840.3541151.0539130.0145281.0280460.258358
1(EGGS_Eggs)(MILK_Milk)0.3360.8020.2840.8452381.0539130.0145281.2793850.077041
2(BREAD_Bread, EGGS_Eggs)(MILK_Milk)0.2400.8020.2080.8666671.0806320.0155201.4850000.098178
3(MILK_Milk)(BREAD_Bread, EGGS_Eggs)0.8020.2400.2080.2593521.0806320.0155201.0261280.376845
\n", "
" ], "text/plain": [ " antecedents consequents antecedent support \\\n", "0 (MILK_Milk) (EGGS_Eggs) 0.802 \n", "1 (EGGS_Eggs) (MILK_Milk) 0.336 \n", "2 (BREAD_Bread, EGGS_Eggs) (MILK_Milk) 0.240 \n", "3 (MILK_Milk) (BREAD_Bread, EGGS_Eggs) 0.802 \n", "\n", " consequent support support confidence lift leverage conviction \\\n", "0 0.336 0.284 0.354115 1.053913 0.014528 1.028046 \n", "1 0.802 0.284 0.845238 1.053913 0.014528 1.279385 \n", "2 0.802 0.208 0.866667 1.080632 0.015520 1.485000 \n", "3 0.240 0.208 0.259352 1.080632 0.015520 1.026128 \n", "\n", " zhangs_metric \n", "0 0.258358 \n", "1 0.077041 \n", "2 0.098178 \n", "3 0.376845 " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# return all the rules with lift above 1.05\n", "association_rules(df_frequent, metric='lift', min_threshold=1.05) " ] }, { "cell_type": "code", "execution_count": null, "id": "2915ac7a-7a38-4155-ad20-a1932768ffea", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 23, "id": "ed9934f3-4de5-455b-a183-a827a5cef4c4", "metadata": {}, "outputs": [], "source": [ "# we define a df for rules for further selection\n", "df_rules = association_rules(df_frequent, metric='lift', min_threshold=1.05) \n", "df_rules['antecedent_len'] = df_rules['antecedents'].apply(lambda x: len(x))" ] }, { "cell_type": "code", "execution_count": 24, "id": "1c4afc83-f5fb-4260-9ccd-ea60adcbfd72", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
antecedentsconsequentsantecedent supportconsequent supportsupportconfidenceliftleverageconvictionzhangs_metricantecedent_len
0(MILK_Milk)(EGGS_Eggs)0.8020.3360.2840.3541151.0539130.0145281.0280460.2583581
1(EGGS_Eggs)(MILK_Milk)0.3360.8020.2840.8452381.0539130.0145281.2793850.0770411
2(BREAD_Bread, EGGS_Eggs)(MILK_Milk)0.2400.8020.2080.8666671.0806320.0155201.4850000.0981782
3(MILK_Milk)(BREAD_Bread, EGGS_Eggs)0.8020.2400.2080.2593521.0806320.0155201.0261280.3768451
\n", "
" ], "text/plain": [ " antecedents consequents antecedent support \\\n", "0 (MILK_Milk) (EGGS_Eggs) 0.802 \n", "1 (EGGS_Eggs) (MILK_Milk) 0.336 \n", "2 (BREAD_Bread, EGGS_Eggs) (MILK_Milk) 0.240 \n", "3 (MILK_Milk) (BREAD_Bread, EGGS_Eggs) 0.802 \n", "\n", " consequent support support confidence lift leverage conviction \\\n", "0 0.336 0.284 0.354115 1.053913 0.014528 1.028046 \n", "1 0.802 0.284 0.845238 1.053913 0.014528 1.279385 \n", "2 0.802 0.208 0.866667 1.080632 0.015520 1.485000 \n", "3 0.240 0.208 0.259352 1.080632 0.015520 1.026128 \n", "\n", " zhangs_metric antecedent_len \n", "0 0.258358 1 \n", "1 0.077041 1 \n", "2 0.098178 2 \n", "3 0.376845 1 " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_rules" ] }, { "cell_type": "code", "execution_count": 25, "id": "d92fc331-ba33-4ca6-a888-ba10233f4084", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
antecedentsconsequentsantecedent supportconsequent supportsupportconfidenceliftleverageconvictionzhangs_metricantecedent_len
2(BREAD_Bread, EGGS_Eggs)(MILK_Milk)0.240.8020.2080.8666671.0806320.015521.4850.0981782
\n", "
" ], "text/plain": [ " antecedents consequents antecedent support \\\n", "2 (BREAD_Bread, EGGS_Eggs) (MILK_Milk) 0.24 \n", "\n", " consequent support support confidence lift leverage conviction \\\n", "2 0.802 0.208 0.866667 1.080632 0.01552 1.485 \n", "\n", " zhangs_metric antecedent_len \n", "2 0.098178 2 " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# useful length\n", "df_rules.query('antecedent_len >= 2')" ] }, { "cell_type": "code", "execution_count": null, "id": "9535fa3f-ad57-4157-9997-c538244d9a41", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "8e4c7e1d-d8f6-4eee-9cd5-821fdbc71caf", "metadata": {}, "source": [ "## 2.3 APM in road accident data\n", "\n", "In this section, we will use APM in road accident ('acciden_data_v1.0.0_2023.db) for example.\n", "\n", "Schema info can be found in 'dft-road-casualty-statistics-road-safety-open-dataset-data-guide-2023-1.xlsx'" ] }, { "cell_type": "markdown", "id": "bdf02564-2af2-4897-86d1-9a808b345248", "metadata": {}, "source": [ "### 2.3.1 Database infos " ] }, { "cell_type": "code", "execution_count": 37, "id": "b7763e57-9f59-4620-9b08-a362aba55c38", "metadata": {}, "outputs": [], "source": [ "import sqlite3" ] }, { "cell_type": "code", "execution_count": 38, "id": "bc6197c3-829d-497a-b3e4-010a58cc2fef", "metadata": {}, "outputs": [], "source": [ "# Connect to accident database \n", "conn = sqlite3.connect('accident_data_v1.0.0_2023.db')" ] }, { "cell_type": "code", "execution_count": 39, "id": "984675b5-332a-4568-a491-35b66969eca4", "metadata": {}, "outputs": [], "source": [ "# create cursor handling SQL \n", "cur = conn.cursor()" ] }, { "cell_type": "code", "execution_count": 40, "id": "1bfb860f-6fcb-48e0-ad1d-8d9b9b318e76", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[]\n" ] } ], "source": [ "# output all the tables in db\n", "res_tables = cur.execute(\"SELECT name FROM sqlite_master WHERE type='table'\") \n", "table_names = res_tables.fetchall()\n", "print(table_names)" ] }, { "cell_type": "code", "execution_count": 41, "id": "8e5065c5-74b0-4347-8647-1aac52524f26", "metadata": {}, "outputs": [ { "ename": "IndexError", "evalue": "list index out of range", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mIndexError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[41], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mtable_names\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m1\u001b[39;49m\u001b[43m]\u001b[49m[\u001b[38;5;241m0\u001b[39m]\n", "\u001b[0;31mIndexError\u001b[0m: list index out of range" ] } ], "source": [ "table_names[1][0]" ] }, { "cell_type": "code", "execution_count": 31, "id": "298b7072-68e1-4ea7-bcbd-50fe09257893", "metadata": {}, "outputs": [], "source": [ "# Check the foreign keys to understand the relationships across tables in database\n", "# it is clear that accident_index column is used to link tables in db\n", "for tn in table_names:\n", " res_fk = cur.execute(f\"PRAGMA foreign_key_list({tn[0]})\")\n", " fk_info = res_fk.fetchall()\n", " if len(fk_info) < 1:\n", " print(f'No foreign keys found in Table: {tn[0]}')\n", " \n", " else:\n", " print(f'Table {tn[0]} fkeys info --', \n", " 'id:', fk_info[0][0], '--',\n", " 'seq:', fk_info[0][1],'--',\n", " 'table:', fk_info[0][2],'--',\n", " 'from:', fk_info[0][3],'--',\n", " 'to:', fk_info[0][4],'--',\n", " 'on_update:', fk_info[0][5],'--',\n", " 'on_delete:', fk_info[0][6],'--',\n", " 'match:', fk_info[0][7]) " ] }, { "cell_type": "code", "execution_count": null, "id": "e613e52e-d04a-47bc-973f-c5d43363eb9d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "3b40525b-c128-4ac8-bbe3-30dafc0ff3eb", "metadata": {}, "source": [ "### 2.3.2 APM in accident table" ] }, { "cell_type": "code", "execution_count": 8, "id": "c733b5ef-6db5-4f15-8ace-43cb3fca8764", "metadata": {}, "outputs": [], "source": [ "# using pandas to read table accident\n", "df_ac = pd.read_sql(\"SELECT * FROM accident;\", conn)" ] }, { "cell_type": "code", "execution_count": 17, "id": "c7b92268-b601-426c-8d84-ec9d824ee5a8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
accident_indexaccident_yearaccident_referencelocation_easting_osgrlocation_northing_osgrlongitudelatitudepolice_forceaccident_severitynumber_of_vehicles...pedestrian_crossing_physical_facilitieslight_conditionsweather_conditionsroad_surface_conditionsspecial_conditions_at_sitecarriageway_hazardsurban_or_rural_areadid_police_officer_attend_scene_of_accidenttrunk_road_flaglsoa_of_accident_location
020170100017082017010001708532920.0196330.0-0.08010751.650061112...041100112E01001450
120170100093422017010009342526790.0181970.0-0.17384551.522425132...041200112E01004702
220170100093442017010009344535200.0181260.0-0.05296951.514096133...041100112E01004298
320170100093482017010009348534340.0193560.0-0.06065851.624832132...442200112E01001429
420170100093502017010009350533680.0187820.0-0.07237251.573408121...541200112E01001808
..................................................................
46134720209910270642020991027064343034.0731654.0-2.92632056.4735399922...01110011-1-1
46134820209910295732020991029573257963.0658891.0-4.26756555.8023539931...01110012-1-1
46134920209910302972020991030297383664.0810646.0-2.27190357.1863179922...01110021-1-1
46135020209910309002020991030900277161.0674852.0-3.96875355.9509409932...01110012-1-1
46135120209910325752020991032575240402.0681950.0-4.56104056.0038439931...01110211-1-1
\n", "

461352 rows × 36 columns

\n", "
" ], "text/plain": [ " accident_index accident_year accident_reference \\\n", "0 2017010001708 2017 010001708 \n", "1 2017010009342 2017 010009342 \n", "2 2017010009344 2017 010009344 \n", "3 2017010009348 2017 010009348 \n", "4 2017010009350 2017 010009350 \n", "... ... ... ... \n", "461347 2020991027064 2020 991027064 \n", "461348 2020991029573 2020 991029573 \n", "461349 2020991030297 2020 991030297 \n", "461350 2020991030900 2020 991030900 \n", "461351 2020991032575 2020 991032575 \n", "\n", " location_easting_osgr location_northing_osgr longitude latitude \\\n", "0 532920.0 196330.0 -0.080107 51.650061 \n", "1 526790.0 181970.0 -0.173845 51.522425 \n", "2 535200.0 181260.0 -0.052969 51.514096 \n", "3 534340.0 193560.0 -0.060658 51.624832 \n", "4 533680.0 187820.0 -0.072372 51.573408 \n", "... ... ... ... ... \n", "461347 343034.0 731654.0 -2.926320 56.473539 \n", "461348 257963.0 658891.0 -4.267565 55.802353 \n", "461349 383664.0 810646.0 -2.271903 57.186317 \n", "461350 277161.0 674852.0 -3.968753 55.950940 \n", "461351 240402.0 681950.0 -4.561040 56.003843 \n", "\n", " police_force accident_severity number_of_vehicles ... \\\n", "0 1 1 2 ... \n", "1 1 3 2 ... \n", "2 1 3 3 ... \n", "3 1 3 2 ... \n", "4 1 2 1 ... \n", "... ... ... ... ... \n", "461347 99 2 2 ... \n", "461348 99 3 1 ... \n", "461349 99 2 2 ... \n", "461350 99 3 2 ... \n", "461351 99 3 1 ... \n", "\n", " pedestrian_crossing_physical_facilities light_conditions \\\n", "0 0 4 \n", "1 0 4 \n", "2 0 4 \n", "3 4 4 \n", "4 5 4 \n", "... ... ... \n", "461347 0 1 \n", "461348 0 1 \n", "461349 0 1 \n", "461350 0 1 \n", "461351 0 1 \n", "\n", " weather_conditions road_surface_conditions \\\n", "0 1 1 \n", "1 1 2 \n", "2 1 1 \n", "3 2 2 \n", "4 1 2 \n", "... ... ... \n", "461347 1 1 \n", "461348 1 1 \n", "461349 1 1 \n", "461350 1 1 \n", "461351 1 1 \n", "\n", " special_conditions_at_site carriageway_hazards urban_or_rural_area \\\n", "0 0 0 1 \n", "1 0 0 1 \n", "2 0 0 1 \n", "3 0 0 1 \n", "4 0 0 1 \n", "... ... ... ... \n", "461347 0 0 1 \n", "461348 0 0 1 \n", "461349 0 0 2 \n", "461350 0 0 1 \n", "461351 0 2 1 \n", "\n", " did_police_officer_attend_scene_of_accident trunk_road_flag \\\n", "0 1 2 \n", "1 1 2 \n", "2 1 2 \n", "3 1 2 \n", "4 1 2 \n", "... ... ... \n", "461347 1 -1 \n", "461348 2 -1 \n", "461349 1 -1 \n", "461350 2 -1 \n", "461351 1 -1 \n", "\n", " lsoa_of_accident_location \n", "0 E01001450 \n", "1 E01004702 \n", "2 E01004298 \n", "3 E01001429 \n", "4 E01001808 \n", "... ... \n", "461347 -1 \n", "461348 -1 \n", "461349 -1 \n", "461350 -1 \n", "461351 -1 \n", "\n", "[461352 rows x 36 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This is a accident-level dataset which means each row representing one accident idetified by 'addicent_index'\n", "df_ac" ] }, { "cell_type": "markdown", "id": "12a09e16-0263-491a-a763-0fbbf932ed7d", "metadata": {}, "source": [ "**You can use pandas .describe() or info() to output the basic info in a df**\n", "\n", "**You can also use some libs called [ydata-profiling](https://docs.profiling.ydata.ai/latest/), [sweetviz](https://pypi.org/project/sweetviz/) and [dtale](https://pypi.org/project/dtale/) to implement the initial EDA**" ] }, { "cell_type": "code", "execution_count": 9, "id": "3526b049-6582-4476-80fe-42014c7c71ab", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
accident_yearlocation_easting_osgrlocation_northing_osgrlongitudelatitudepolice_forceaccident_severitynumber_of_vehiclesnumber_of_casualtiesday_of_week...pedestrian_crossing_human_controlpedestrian_crossing_physical_facilitieslight_conditionsweather_conditionsroad_surface_conditionsspecial_conditions_at_sitecarriageway_hazardsurban_or_rural_areadid_police_officer_attend_scene_of_accidenttrunk_road_flag
count461352.000000461236.0000004.612360e+05461226.000000461226.000000461352.000000461352.000000461352.000000461352.000000461352.000000...461352.000000461352.000000461352.000000461352.000000461352.000000461352.000000461352.000000461352.000000461352.000000461352.000000
mean2018.368378452593.1157562.801803e+05-1.24699852.40884928.1933882.7844361.8404411.3012454.108171...0.2621121.0995182.0362351.6499181.3852390.2175190.1705181.3254631.3348771.680147
std1.09156694822.7187051.515501e+051.3897021.36510124.8016090.4436480.7098690.7463981.927216...1.4606802.3331131.7246911.8194260.9551801.2409811.1219640.4690570.5563520.851554
min2017.00000064084.0000001.023500e+04-7.52527349.9123621.0000001.0000001.0000001.0000001.000000...-1.000000-1.000000-1.000000-1.000000-1.000000-1.000000-1.000000-1.000000-1.000000-1.000000
25%2017.000000388539.0000001.755300e+05-2.17266851.4656895.0000003.0000001.0000001.0000002.000000...0.0000000.0000001.0000001.0000001.0000000.0000000.0000001.0000001.0000002.000000
50%2018.000000459198.5000002.208650e+05-1.12626451.87025722.0000003.0000002.0000001.0000004.000000...0.0000000.0000001.0000001.0000001.0000000.0000000.0000001.0000001.0000002.000000
75%2019.000000529290.0000003.865510e+05-0.13638953.37289945.0000003.0000002.0000001.0000006.000000...0.0000000.0000004.0000001.0000002.0000000.0000000.0000002.0000002.0000002.000000
max2020.000000655391.0000001.209512e+061.75964160.76372299.0000003.00000024.00000059.0000007.000000...9.0000009.0000007.0000009.0000009.0000009.0000009.0000003.0000003.0000002.000000
\n", "

8 rows × 29 columns

\n", "
" ], "text/plain": [ " accident_year location_easting_osgr location_northing_osgr \\\n", "count 461352.000000 461236.000000 4.612360e+05 \n", "mean 2018.368378 452593.115756 2.801803e+05 \n", "std 1.091566 94822.718705 1.515501e+05 \n", "min 2017.000000 64084.000000 1.023500e+04 \n", "25% 2017.000000 388539.000000 1.755300e+05 \n", "50% 2018.000000 459198.500000 2.208650e+05 \n", "75% 2019.000000 529290.000000 3.865510e+05 \n", "max 2020.000000 655391.000000 1.209512e+06 \n", "\n", " longitude latitude police_force accident_severity \\\n", "count 461226.000000 461226.000000 461352.000000 461352.000000 \n", "mean -1.246998 52.408849 28.193388 2.784436 \n", "std 1.389702 1.365101 24.801609 0.443648 \n", "min -7.525273 49.912362 1.000000 1.000000 \n", "25% -2.172668 51.465689 5.000000 3.000000 \n", "50% -1.126264 51.870257 22.000000 3.000000 \n", "75% -0.136389 53.372899 45.000000 3.000000 \n", "max 1.759641 60.763722 99.000000 3.000000 \n", "\n", " number_of_vehicles number_of_casualties day_of_week ... \\\n", "count 461352.000000 461352.000000 461352.000000 ... \n", "mean 1.840441 1.301245 4.108171 ... \n", "std 0.709869 0.746398 1.927216 ... \n", "min 1.000000 1.000000 1.000000 ... \n", "25% 1.000000 1.000000 2.000000 ... \n", "50% 2.000000 1.000000 4.000000 ... \n", "75% 2.000000 1.000000 6.000000 ... \n", "max 24.000000 59.000000 7.000000 ... \n", "\n", " pedestrian_crossing_human_control \\\n", "count 461352.000000 \n", "mean 0.262112 \n", "std 1.460680 \n", "min -1.000000 \n", "25% 0.000000 \n", "50% 0.000000 \n", "75% 0.000000 \n", "max 9.000000 \n", "\n", " pedestrian_crossing_physical_facilities light_conditions \\\n", "count 461352.000000 461352.000000 \n", "mean 1.099518 2.036235 \n", "std 2.333113 1.724691 \n", "min -1.000000 -1.000000 \n", "25% 0.000000 1.000000 \n", "50% 0.000000 1.000000 \n", "75% 0.000000 4.000000 \n", "max 9.000000 7.000000 \n", "\n", " weather_conditions road_surface_conditions \\\n", "count 461352.000000 461352.000000 \n", "mean 1.649918 1.385239 \n", "std 1.819426 0.955180 \n", "min -1.000000 -1.000000 \n", "25% 1.000000 1.000000 \n", "50% 1.000000 1.000000 \n", "75% 1.000000 2.000000 \n", "max 9.000000 9.000000 \n", "\n", " special_conditions_at_site carriageway_hazards urban_or_rural_area \\\n", "count 461352.000000 461352.000000 461352.000000 \n", "mean 0.217519 0.170518 1.325463 \n", "std 1.240981 1.121964 0.469057 \n", "min -1.000000 -1.000000 -1.000000 \n", "25% 0.000000 0.000000 1.000000 \n", "50% 0.000000 0.000000 1.000000 \n", "75% 0.000000 0.000000 2.000000 \n", "max 9.000000 9.000000 3.000000 \n", "\n", " did_police_officer_attend_scene_of_accident trunk_road_flag \n", "count 461352.000000 461352.000000 \n", "mean 1.334877 1.680147 \n", "std 0.556352 0.851554 \n", "min -1.000000 -1.000000 \n", "25% 1.000000 2.000000 \n", "50% 1.000000 2.000000 \n", "75% 2.000000 2.000000 \n", "max 3.000000 2.000000 \n", "\n", "[8 rows x 29 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ac.describe()" ] }, { "cell_type": "code", "execution_count": 10, "id": "bc0b2993-3204-42bb-aea6-ff102ff75d41", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 461352 entries, 0 to 461351\n", "Data columns (total 36 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 accident_index 461352 non-null object \n", " 1 accident_year 461352 non-null int64 \n", " 2 accident_reference 461352 non-null object \n", " 3 location_easting_osgr 461236 non-null float64\n", " 4 location_northing_osgr 461236 non-null float64\n", " 5 longitude 461226 non-null float64\n", " 6 latitude 461226 non-null float64\n", " 7 police_force 461352 non-null int64 \n", " 8 accident_severity 461352 non-null int64 \n", " 9 number_of_vehicles 461352 non-null int64 \n", " 10 number_of_casualties 461352 non-null int64 \n", " 11 date 461352 non-null object \n", " 12 day_of_week 461352 non-null int64 \n", " 13 time 461352 non-null object \n", " 14 local_authority_district 461352 non-null int64 \n", " 15 local_authority_ons_district 461352 non-null object \n", " 16 local_authority_highway 461352 non-null object \n", " 17 first_road_class 461352 non-null int64 \n", " 18 first_road_number 461352 non-null int64 \n", " 19 road_type 461352 non-null int64 \n", " 20 speed_limit 461352 non-null int64 \n", " 21 junction_detail 461352 non-null int64 \n", " 22 junction_control 461352 non-null int64 \n", " 23 second_road_class 461352 non-null int64 \n", " 24 second_road_number 461352 non-null int64 \n", " 25 pedestrian_crossing_human_control 461352 non-null int64 \n", " 26 pedestrian_crossing_physical_facilities 461352 non-null int64 \n", " 27 light_conditions 461352 non-null int64 \n", " 28 weather_conditions 461352 non-null int64 \n", " 29 road_surface_conditions 461352 non-null int64 \n", " 30 special_conditions_at_site 461352 non-null int64 \n", " 31 carriageway_hazards 461352 non-null int64 \n", " 32 urban_or_rural_area 461352 non-null int64 \n", " 33 did_police_officer_attend_scene_of_accident 461352 non-null int64 \n", " 34 trunk_road_flag 461352 non-null int64 \n", " 35 lsoa_of_accident_location 461352 non-null object \n", "dtypes: float64(4), int64(25), object(7)\n", "memory usage: 126.7+ MB\n" ] } ], "source": [ "df_ac.info()" ] }, { "cell_type": "code", "execution_count": null, "id": "37619a37-ef79-4da5-b652-107200e15967", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "7a49fe92-996d-4fc2-89e3-ac2b5596dd3e", "metadata": {}, "source": [ "**Select three columns for analysis**\n", "\n", "accident_severity, speed_limit, weather_conditions (rules can be if speed_limit, weather_conditions then accident_severity)" ] }, { "cell_type": "code", "execution_count": 46, "id": "8709c83e-108e-4014-aa5c-1b66ce878716", "metadata": {}, "outputs": [], "source": [ "# select three columns\n", "df_ac_s = df_ac[['speed_limit', 'weather_conditions', 'accident_severity']]" ] }, { "cell_type": "code", "execution_count": 48, "id": "ae325315-dccc-4f1d-a295-d070cdf35369", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speed_limitweather_conditionsaccident_severity
03011
13013
23013
33023
42012
............
4613473012
4613483013
4613496012
4613503013
4613513013
\n", "

461352 rows × 3 columns

\n", "
" ], "text/plain": [ " speed_limit weather_conditions accident_severity\n", "0 30 1 1\n", "1 30 1 3\n", "2 30 1 3\n", "3 30 2 3\n", "4 20 1 2\n", "... ... ... ...\n", "461347 30 1 2\n", "461348 30 1 3\n", "461349 60 1 2\n", "461350 30 1 3\n", "461351 30 1 3\n", "\n", "[461352 rows x 3 columns]" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ac_s" ] }, { "cell_type": "code", "execution_count": null, "id": "5aee63e9-fe7a-43db-8902-8504dd84c9ff", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 41, "id": "e9c840e6-1a4c-4db7-b87f-4a0bf6868031", "metadata": {}, "outputs": [], "source": [ "# One-hot encoding transfer df colomuns \n", "\n", "df_list = []\n", "for c in ['speed_limit', 'weather_conditions', 'accident_severity']:\n", " df_c = pd.get_dummies(df_ac_s[c], prefix=c)\n", " df_list.append(df_c)\n", "\n", "# axis=1 indicates that the dataframes in df_list are being concatenated horizontally, i.e., by columns. \n", "# This means that the resulting dataframe df_oh will have columns from all the dataframes in df_list combined side by side, with the indices aligned.\n", "df_oh = pd.concat(df_list, axis=1) " ] }, { "cell_type": "code", "execution_count": 42, "id": "d5a8f464-7e52-4bca-8338-c78078fd6242", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speed_limit_-1speed_limit_20speed_limit_30speed_limit_40speed_limit_50speed_limit_60speed_limit_70weather_conditions_-1weather_conditions_1weather_conditions_2weather_conditions_3weather_conditions_4weather_conditions_5weather_conditions_6weather_conditions_7weather_conditions_8weather_conditions_9accident_severity_1accident_severity_2accident_severity_3
0FalseFalseTrueFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalse
1FalseFalseTrueFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrue
2FalseFalseTrueFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrue
3FalseFalseTrueFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseTrue
4FalseTrueFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
...............................................................
461347FalseFalseTrueFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
461348FalseFalseTrueFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrue
461349FalseFalseFalseFalseFalseTrueFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
461350FalseFalseTrueFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrue
461351FalseFalseTrueFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrue
\n", "

461352 rows × 20 columns

\n", "
" ], "text/plain": [ " speed_limit_-1 speed_limit_20 speed_limit_30 speed_limit_40 \\\n", "0 False False True False \n", "1 False False True False \n", "2 False False True False \n", "3 False False True False \n", "4 False True False False \n", "... ... ... ... ... \n", "461347 False False True False \n", "461348 False False True False \n", "461349 False False False False \n", "461350 False False True False \n", "461351 False False True False \n", "\n", " speed_limit_50 speed_limit_60 speed_limit_70 weather_conditions_-1 \\\n", "0 False False False False \n", "1 False False False False \n", "2 False False False False \n", "3 False False False False \n", "4 False False False False \n", "... ... ... ... ... \n", "461347 False False False False \n", "461348 False False False False \n", "461349 False True False False \n", "461350 False False False False \n", "461351 False False False False \n", "\n", " weather_conditions_1 weather_conditions_2 weather_conditions_3 \\\n", "0 True False False \n", "1 True False False \n", "2 True False False \n", "3 False True False \n", "4 True False False \n", "... ... ... ... \n", "461347 True False False \n", "461348 True False False \n", "461349 True False False \n", "461350 True False False \n", "461351 True False False \n", "\n", " weather_conditions_4 weather_conditions_5 weather_conditions_6 \\\n", "0 False False False \n", "1 False False False \n", "2 False False False \n", "3 False False False \n", "4 False False False \n", "... ... ... ... \n", "461347 False False False \n", "461348 False False False \n", "461349 False False False \n", "461350 False False False \n", "461351 False False False \n", "\n", " weather_conditions_7 weather_conditions_8 weather_conditions_9 \\\n", "0 False False False \n", "1 False False False \n", "2 False False False \n", "3 False False False \n", "4 False False False \n", "... ... ... ... \n", "461347 False False False \n", "461348 False False False \n", "461349 False False False \n", "461350 False False False \n", "461351 False False False \n", "\n", " accident_severity_1 accident_severity_2 accident_severity_3 \n", "0 True False False \n", "1 False False True \n", "2 False False True \n", "3 False False True \n", "4 False True False \n", "... ... ... ... \n", "461347 False True False \n", "461348 False False True \n", "461349 False True False \n", "461350 False False True \n", "461351 False False True \n", "\n", "[461352 rows x 20 columns]" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_oh" ] }, { "cell_type": "code", "execution_count": 54, "id": "8d424564-8dc1-43e7-b349-c2fcee1d746b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
supportitemsets
00.595235(speed_limit_30)
10.795436(weather_conditions_1)
20.798299(accident_severity_3)
30.476458(weather_conditions_1, speed_limit_30)
40.486100(accident_severity_3, speed_limit_30)
50.631260(accident_severity_3, weather_conditions_1)
60.387476(accident_severity_3, weather_conditions_1, sp...
\n", "
" ], "text/plain": [ " support itemsets\n", "0 0.595235 (speed_limit_30)\n", "1 0.795436 (weather_conditions_1)\n", "2 0.798299 (accident_severity_3)\n", "3 0.476458 (weather_conditions_1, speed_limit_30)\n", "4 0.486100 (accident_severity_3, speed_limit_30)\n", "5 0.631260 (accident_severity_3, weather_conditions_1)\n", "6 0.387476 (accident_severity_3, weather_conditions_1, sp..." ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create frequent sets for rules\n", "df_fre = apriori(df_oh, min_support=0.2, use_colnames=True)\n", "df_fre" ] }, { "cell_type": "code", "execution_count": null, "id": "2dfc0168-6c0a-490a-bfec-a03f4dfd4256", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 73, "id": "bcfc3e98-b87d-4972-a8c8-a44a16687a13", "metadata": {}, "outputs": [], "source": [ "# select rules with confidence >0.7\n", "df_rule = association_rules(df_fre, metric='confidence', min_threshold=0.7) " ] }, { "cell_type": "code", "execution_count": 74, "id": "3d69227d-901f-47c9-b6cd-df00a1db1ab7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
antecedentsconsequentsantecedent supportconsequent supportsupportconfidenceliftleverageconvictionzhangs_metric
0(speed_limit_30)(weather_conditions_1)0.5952350.7954360.4764580.8004541.0063080.0029871.0251460.015487
1(speed_limit_30)(accident_severity_3)0.5952350.7982990.4861000.8166511.0229890.0109241.1000920.055518
4(accident_severity_3, speed_limit_30)(weather_conditions_1)0.4861000.7954360.3874760.7971131.0021090.0008151.0082670.004094
5(weather_conditions_1, speed_limit_30)(accident_severity_3)0.4764580.7982990.3874760.8132431.0187190.0071201.0800160.035098
\n", "
" ], "text/plain": [ " antecedents consequents \\\n", "0 (speed_limit_30) (weather_conditions_1) \n", "1 (speed_limit_30) (accident_severity_3) \n", "4 (accident_severity_3, speed_limit_30) (weather_conditions_1) \n", "5 (weather_conditions_1, speed_limit_30) (accident_severity_3) \n", "\n", " antecedent support consequent support support confidence lift \\\n", "0 0.595235 0.795436 0.476458 0.800454 1.006308 \n", "1 0.595235 0.798299 0.486100 0.816651 1.022989 \n", "4 0.486100 0.795436 0.387476 0.797113 1.002109 \n", "5 0.476458 0.798299 0.387476 0.813243 1.018719 \n", "\n", " leverage conviction zhangs_metric \n", "0 0.002987 1.025146 0.015487 \n", "1 0.010924 1.100092 0.055518 \n", "4 0.000815 1.008267 0.004094 \n", "5 0.007120 1.080016 0.035098 " ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# select rules with lift >1\n", "df_rule.query('lift > 1')" ] }, { "cell_type": "markdown", "id": "f18728a0-e9e6-4517-a28f-7359a6266b06", "metadata": {}, "source": [ "As 'weather_conditions_1' represents 'Fine no high winds' and 'accident_severity_3' represents 'Slight', so the useful rule can be generated as ' Weather with fine no high winds' and 'the road with speed limit is 30' $\\Rightarrow$ 'Accident severity is slight'" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.2" } }, "nbformat": 4, "nbformat_minor": 5 }