Skip to content

Experiments

The experiments use the Amazon Reviews 2023 dataset from Hugging Face, combining multiple product categories:

  • Categories:
    • Appliances
    • Digital Music
    • Health and Personal Care
    • Handmade Products
    • All Beauty
  • Total records: 4,118,850 reviews (combined)
  • Verified purchases: Filtered for verified purchases only (typically 60-80% of records)
  • Transactions: Varies based on preprocessing parameters (typically thousands to tens of thousands)
  • Preprocessing:
    • Parent ASIN used for product grouping (fallback to ASIN when unavailable)
    • min_transaction_size=2 (transactions must have at least 2 items)
    • Infrequent items filtered (min_item_frequency varies)
    • Category column added to track data source
  • Baseline 1: Traditional Apriori algorithm
  • Baseline 2: FP-Growth algorithm (standard implementation)
  • Test: Improved Apriori algorithm (Weighted Apriori with intersection-based counting)
  • Metrics: Execution time (microsecond precision), scalability across support thresholds, correctness verification
  • Minimum Support Threshold:
    • Primary experiments: Calculated programmatically based on item frequency distribution (typically 0.0005 or 0.05%)
    • Scalability analysis: Varies from 0.05% to 0.3% to test performance across different thresholds
  • Minimum Confidence: 0.3 (30%) for association rule generation
  • Preprocessing:
    • Parent ASIN grouping with fallback to ASIN
    • min_transaction_size=2 (only transactions with 2+ items)
    • Programmatic support threshold calculation using suggest_min_support() function
  • Runtime Measurement: Using time.perf_counter() for microsecond-level precision with internal algorithm timing
  1. Data Preparation:

    • Load combined dataset from Hugging Face (5 categories)
    • Filter verified purchases
    • Create transactions using Parent ASIN grouping
    • Apply preprocessing filters (transaction size, item frequency)
    • See Data Preprocessing for detailed preprocessing pipeline
  2. Data Exploration:

    • Analyze dataset characteristics
    • Determine optimal preprocessing parameters using EDA
    • Visualize transaction patterns and item frequencies
    • Calculate programmatic minimum support suggestions
    • See Exploratory Data Analysis for detailed EDA methodology
  3. Baseline Execution:

    • Run traditional Apriori algorithm with runtime tracking
    • Run FP-Growth algorithm with runtime tracking
  4. Improved Execution:

    • Run Improved Apriori algorithm with detailed runtime breakdown
    • Verify correctness by comparing results with traditional Apriori
  5. Improved Execution:

    • Run Improved Apriori algorithm with detailed runtime breakdown
  6. Performance Measurement:

    • Track execution time using internal algorithm timers (fit_time, frequent_itemsets_time, association_rules_time)
    • Track detailed metrics for Improved Apriori (initial_scan_time, candidate_generation_time, support_calculation_time)
    • Measure scalability across different support thresholds
  7. Result Analysis:

    • Compare frequent itemsets between algorithms
    • Verify correctness through set comparison
    • Calculate speedup ratios
    • Generate comprehensive visualizations
  8. Visualization:

    • Generate comprehensive visualizations for each algorithm
    • Create comparison visualizations
    • Generate runtime vs support threshold analysis

See the Results page for detailed experimental results and comparative analysis.