Data Preprocessing
Overview
Section titled “Overview”Data preprocessing is a crucial step that transforms raw Amazon review data into a format suitable for frequent itemset mining algorithms. The preprocessing pipeline handles missing data, performs feature engineering, and applies filtering to ensure data quality and algorithm efficiency.
The exploration notebook (exploration.ipynb) demonstrates the complete preprocessing pipeline on a combined dataset of 5 Amazon review categories, producing clean transactions ready for algorithm execution.
Preprocessing Pipeline
Section titled “Preprocessing Pipeline”The preprocessing pipeline consists of several sequential steps:
- Data Loading: Load JSONL dataset files
- Missing Data Handling: Filter and handle null values
- Feature Engineering: Create transactions from user-product relationships
- Data Filtering: Remove infrequent items and small transactions
- Format Conversion: Convert to algorithm-ready format
Handling Missing Data
Section titled “Handling Missing Data”Missing Value Types
Section titled “Missing Value Types”The Amazon Reviews dataset contains several types of missing values that require different handling strategies:
1. Missing parent_asin Values
Section titled “1. Missing parent_asin Values”Problem: Many products don’t have a parent ASIN (null values)
Solution:
- Parent ASIN groups product variants together (e.g., different colors/sizes)
- When parent ASIN is unavailable, individual ASIN is used
- This ensures no data loss while maximizing product grouping benefits
Impact:
- Reduces unique item count when parent ASINs are available
- Creates more meaningful product associations
- Improves algorithm performance by reducing sparsity
2. Missing verified_purchase Values
Section titled “2. Missing verified_purchase Values”Problem: Some records may have null or False values for verified_purchase
Solution: Filter to only include verified purchases
- Verified purchases ensure data quality
- Unverified reviews may not represent actual purchases
- Reduces noise in transaction data
Impact:
- Typically reduces dataset size by 20-40%
- Improves data quality and result reliability
- Focuses analysis on actual purchase behavior
3. Missing user_id or asin Values
Section titled “3. Missing user_id or asin Values”Problem: Critical fields may be missing, preventing transaction creation
Solution: Filter out records with missing critical fields
- Both
user_idandasinare required for transaction creation - Records without these fields cannot be used
- Better to exclude than impute (no meaningful default values)
Impact:
- Minimal data loss (these fields are rarely missing)
- Ensures data integrity for transaction creation
Missing Data Summary
Section titled “Missing Data Summary”| Field | Missing Strategy | Impact |
|---|---|---|
parent_asin | Fallback to asin | No data loss, maximizes grouping |
verified_purchase | Filter (keep only True) | 20-40% data reduction, quality improvement |
user_id | Filter (exclude) | Minimal loss, ensures integrity |
asin | Filter (exclude) | Minimal loss, ensures integrity |
Feature Engineering
Section titled “Feature Engineering”Feature engineering transforms raw review data into transaction format suitable for association rule mining.
Transaction Creation
Section titled “Transaction Creation”The core feature engineering step is creating transactions from user-product relationships:
Feature Transformations
Section titled “Feature Transformations”1. Product ID Selection
Section titled “1. Product ID Selection”Transformation: Choose between ASIN and Parent ASIN
- When to use Parent ASIN: When available, for better grouping
- When to use ASIN: Fallback when Parent ASIN is null
- Benefit: Maximizes product grouping while avoiding data loss
2. User-Product Aggregation
Section titled “2. User-Product Aggregation”Transformation: Group multiple reviews by same user-product pair
- Method: Use
.unique()to remove duplicates - Benefit: Clean transaction data, one item per user-product
Feature Engineering Benefits
Section titled “Feature Engineering Benefits”-
Dimensionality Reduction:
- Groups product variants together
- Reduces unique item count
- Improves algorithm scalability
-
Meaningful Associations:
- Captures product family relationships
- More interpretable association rules
- Better for business insights
-
Data Quality:
- Removes duplicate entries
- Ensures transaction integrity
- Focuses on verified purchases
Conclusion
Section titled “Conclusion”Data preprocessing transforms raw Amazon review data into a clean, structured format suitable for frequent itemset mining. By handling missing data appropriately, engineering meaningful features, and applying strategic filtering, we ensure:
- Data Quality: Verified purchases, complete transactions
- Algorithm Efficiency: Reduced dimensionality, filtered noise
- Meaningful Results: Product grouping, appropriate support thresholds
- Scalability: Efficient Polars-based operations
The preprocessing pipeline is designed to be flexible, allowing experimentation with different strategies while maintaining data integrity and algorithm compatibility.