An Improved Algorithm for Association Rule In Relational Databases

Group 24

Jesse

Costa

Khyal

Introduction

Research Motivation (Challenge)

Amazon Reviews 2023 dataset Real-world sparse dataset
Apriori as a Baseline
FP-Growth as the preferred algorithm for this kind of data.

The Goal

Replicate the paper and observe the performance improvement.

Literature Review

Traditional Apriori algorithm faces scalability limitations
Fix: change the problem to set-intersection
Compare performance against FP-Growth which uses a prefix tree

Methodology

Implement each algorithm in python
Setup Benchmark environment
- Use the same configuration parameters
- Run with the same hardware
- Warm up the JIT compiler
- Clear memory between runs

Data Analysis

Dataset

Amazon Review Dataset

Source: Amazon Customer Reviews (2023)
Categories: Appliances, Digital Music, Gift Cards, Health & Personal Care, Office Products
Scale: Over 233 million reviews

Data Preprocessing Steps

Data Cleaning
- Removed duplicates and invalid entries
- Filtered spam/bot reviews (Not verified purchases)
Transaction Formation
- Grouped reviews by customer ID
- Created product baskets per customer, aggregated into actual products instead of product variants (colors, size, etc)

Example Data Visualizations

Purchases Per User

Preprocessing Pipeline

Results & Analysis

Runtime Comparison

Traditional Apriori: Baseline performance, significant slowdown with lower support
Improved Apriori: ~50x faster than traditional (92s → 1.8s)
FP-Growth: Best performance overall at 0.9s

Runtime Performance Chart

Execution Time Comparison

Conclusion

Successfully replicated the paper
Enhanced Apriori Algorithm with 40-60% performance improvement
Comprehensive Comparison with FP-Growth on real-world datasets
Empirical Validation across multiple dataset categories and scales