Skip to content

Data Preprocessing

Data preprocessing is a crucial step that transforms raw Amazon review data into a format suitable for frequent itemset mining algorithms. The preprocessing pipeline handles missing data, performs feature engineering, and applies filtering to ensure data quality and algorithm efficiency.

The exploration notebook (exploration.ipynb) demonstrates the complete preprocessing pipeline on a combined dataset of 5 Amazon review categories, producing clean transactions ready for algorithm execution.

The preprocessing pipeline consists of several sequential steps:

  1. Data Loading: Load JSONL dataset files
  2. Missing Data Handling: Filter and handle null values
  3. Feature Engineering: Create transactions from user-product relationships
  4. Data Filtering: Remove infrequent items and small transactions
  5. Format Conversion: Convert to algorithm-ready format

The Amazon Reviews dataset contains several types of missing values that require different handling strategies:

Problem: Many products don’t have a parent ASIN (null values)

Solution:

  • Parent ASIN groups product variants together (e.g., different colors/sizes)
  • When parent ASIN is unavailable, individual ASIN is used
  • This ensures no data loss while maximizing product grouping benefits

Impact:

  • Reduces unique item count when parent ASINs are available
  • Creates more meaningful product associations
  • Improves algorithm performance by reducing sparsity

Problem: Some records may have null or False values for verified_purchase

Solution: Filter to only include verified purchases

  • Verified purchases ensure data quality
  • Unverified reviews may not represent actual purchases
  • Reduces noise in transaction data

Impact:

  • Typically reduces dataset size by 20-40%
  • Improves data quality and result reliability
  • Focuses analysis on actual purchase behavior

Problem: Critical fields may be missing, preventing transaction creation

Solution: Filter out records with missing critical fields

  • Both user_id and asin are required for transaction creation
  • Records without these fields cannot be used
  • Better to exclude than impute (no meaningful default values)

Impact:

  • Minimal data loss (these fields are rarely missing)
  • Ensures data integrity for transaction creation
FieldMissing StrategyImpact
parent_asinFallback to asinNo data loss, maximizes grouping
verified_purchaseFilter (keep only True)20-40% data reduction, quality improvement
user_idFilter (exclude)Minimal loss, ensures integrity
asinFilter (exclude)Minimal loss, ensures integrity

Feature engineering transforms raw review data into transaction format suitable for association rule mining.

The core feature engineering step is creating transactions from user-product relationships:

Transformation: Choose between ASIN and Parent ASIN

  • When to use Parent ASIN: When available, for better grouping
  • When to use ASIN: Fallback when Parent ASIN is null
  • Benefit: Maximizes product grouping while avoiding data loss

Transformation: Group multiple reviews by same user-product pair

  • Method: Use .unique() to remove duplicates
  • Benefit: Clean transaction data, one item per user-product
  1. Dimensionality Reduction:

    • Groups product variants together
    • Reduces unique item count
    • Improves algorithm scalability
  2. Meaningful Associations:

    • Captures product family relationships
    • More interpretable association rules
    • Better for business insights
  3. Data Quality:

    • Removes duplicate entries
    • Ensures transaction integrity
    • Focuses on verified purchases

Data preprocessing transforms raw Amazon review data into a clean, structured format suitable for frequent itemset mining. By handling missing data appropriately, engineering meaningful features, and applying strategic filtering, we ensure:

  • Data Quality: Verified purchases, complete transactions
  • Algorithm Efficiency: Reduced dimensionality, filtered noise
  • Meaningful Results: Product grouping, appropriate support thresholds
  • Scalability: Efficient Polars-based operations

The preprocessing pipeline is designed to be flexible, allowing experimentation with different strategies while maintaining data integrity and algorithm compatibility.