Experiments
Experimental Setup
Section titled “Experimental Setup”Datasets
Section titled “Datasets”The experiments use the Amazon Reviews 2023 dataset from Hugging Face, combining multiple product categories:
- Categories:
- Appliances
- Digital Music
- Health and Personal Care
- Handmade Products
- All Beauty
- Total records: 4,118,850 reviews (combined)
- Verified purchases: Filtered for verified purchases only (typically 60-80% of records)
- Transactions: Varies based on preprocessing parameters (typically thousands to tens of thousands)
- Preprocessing:
- Parent ASIN used for product grouping (fallback to ASIN when unavailable)
- min_transaction_size=2 (transactions must have at least 2 items)
- Infrequent items filtered (min_item_frequency varies)
- Category column added to track data source
Experimental Design
Section titled “Experimental Design”- Baseline 1: Traditional Apriori algorithm
- Baseline 2: FP-Growth algorithm (standard implementation)
- Test: Improved Apriori algorithm (Weighted Apriori with intersection-based counting)
- Metrics: Execution time (microsecond precision), scalability across support thresholds, correctness verification
Experimental Parameters
Section titled “Experimental Parameters”- Minimum Support Threshold:
- Primary experiments: Calculated programmatically based on item frequency distribution (typically 0.0005 or 0.05%)
- Scalability analysis: Varies from 0.05% to 0.3% to test performance across different thresholds
- Minimum Confidence: 0.3 (30%) for association rule generation
- Preprocessing:
- Parent ASIN grouping with fallback to ASIN
- min_transaction_size=2 (only transactions with 2+ items)
- Programmatic support threshold calculation using
suggest_min_support()function
- Runtime Measurement: Using
time.perf_counter()for microsecond-level precision with internal algorithm timing
Methodology
Section titled “Methodology”-
Data Preparation:
- Load combined dataset from Hugging Face (5 categories)
- Filter verified purchases
- Create transactions using Parent ASIN grouping
- Apply preprocessing filters (transaction size, item frequency)
- See Data Preprocessing for detailed preprocessing pipeline
-
Data Exploration:
- Analyze dataset characteristics
- Determine optimal preprocessing parameters using EDA
- Visualize transaction patterns and item frequencies
- Calculate programmatic minimum support suggestions
- See Exploratory Data Analysis for detailed EDA methodology
-
Baseline Execution:
- Run traditional Apriori algorithm with runtime tracking
- Run FP-Growth algorithm with runtime tracking
-
Improved Execution:
- Run Improved Apriori algorithm with detailed runtime breakdown
- Verify correctness by comparing results with traditional Apriori
-
Improved Execution:
- Run Improved Apriori algorithm with detailed runtime breakdown
-
Performance Measurement:
- Track execution time using internal algorithm timers (fit_time, frequent_itemsets_time, association_rules_time)
- Track detailed metrics for Improved Apriori (initial_scan_time, candidate_generation_time, support_calculation_time)
- Measure scalability across different support thresholds
-
Result Analysis:
- Compare frequent itemsets between algorithms
- Verify correctness through set comparison
- Calculate speedup ratios
- Generate comprehensive visualizations
-
Visualization:
- Generate comprehensive visualizations for each algorithm
- Create comparison visualizations
- Generate runtime vs support threshold analysis
Results Summary
Section titled “Results Summary”See the Results page for detailed experimental results and comparative analysis.