An Improved Algorithm for Association Rule In Relational Databases

Group 24

Jesse
Costa
Khyal

Introduction

Research Motivation (Challenge)

  • Amazon Reviews 2023 dataset Real-world sparse dataset
  • Apriori as a Baseline
  • FP-Growth as the preferred algorithm for this kind of data.

The Goal

Replicate the paper and observe the performance improvement.

Literature Review

  • Traditional Apriori algorithm faces scalability limitations
  • Fix: change the problem to set-intersection
  • Compare performance against FP-Growth which uses a prefix tree

Methodology

  1. Implement each algorithm in python
  2. Setup Benchmark environment
    • Use the same configuration parameters
    • Run with the same hardware
    • Warm up the JIT compiler
    • Clear memory between runs

Data Analysis

Dataset

Amazon Review Dataset

  • Source: Amazon Customer Reviews (2023)
  • Categories: Appliances, Digital Music, Gift Cards, Health & Personal Care, Office Products
  • Scale: Over 233 million reviews

Data Preprocessing Steps

  1. Data Cleaning
    • Removed duplicates and invalid entries
    • Filtered spam/bot reviews (Not verified purchases)
  2. Transaction Formation
    • Grouped reviews by customer ID
    • Created product baskets per customer, aggregated into actual products instead of product variants (colors, size, etc)

Example Data Visualizations

Purchases Per UserPreprocessing Pipeline

Results & Analysis

Runtime Comparison

  • Traditional Apriori: Baseline performance, significant slowdown with lower support
  • Improved Apriori: ~50x faster than traditional (92s → 1.8s)
  • FP-Growth: Best performance overall at 0.9s

Runtime Performance Chart

Execution Time Comparison

Conclusion

  • Successfully replicated the paper
  • Enhanced Apriori Algorithm with 40-60% performance improvement
  • Comprehensive Comparison with FP-Growth on real-world datasets
  • Empirical Validation across multiple dataset categories and scales