LeetCode 3230 - Customer Purchasing Behavior Analysis

Difficulty: 🟡 Medium
Topics: Database

Solution

Problem Understanding

The problem asks us to analyze customer purchasing behavior using two relational tables: Transactions and Products. The Transactions table contains every purchase made by customers, including the transaction ID, customer ID, product ID, transaction date, and amount spent. The Products table contains the product catalog with product IDs, their categories, and prices.

We are tasked with producing a summary per customer that includes the total money spent, number of transactions, number of unique categories purchased, average transaction amount, the most frequently purchased category (with tie-breaking by most recent purchase), and a calculated loyalty score. The output must be rounded appropriately and sorted by loyalty score descending, then customer ID ascending.

Important details include the rounding of numeric outputs to two decimal places, tie-breaking for top categories using transaction recency, and that loyalty score depends both on the number of transactions and the total amount spent.

Key edge cases include customers with only one transaction, multiple categories with identical purchase counts, transactions with identical timestamps, and customers who purchased only a single category.

Approaches

A brute-force approach would start by iterating through each customer, computing aggregates like total amount, transaction count, and unique categories by scanning all transactions multiple times. To find the top category, one would iterate through each customer's transaction history and count category occurrences, then identify the most recent category in the event of ties. While conceptually simple, this approach requires multiple scans per customer, leading to poor performance on large datasets.

The optimal approach leverages SQL aggregation and window functions. By joining the Transactions and Products tables, we can calculate sums, counts, and distinct category counts using standard GROUP BY aggregations. To find the most frequent category with recency tie-breaking, we can use ROW_NUMBER() over a partition of customer and category, ordering first by count descending and then by transaction date descending. This approach uses a single scan of the joined table and efficient grouping/window operations.

Approach	Time Complexity	Space Complexity	Notes
Brute Force	O(N*M)	O(N*M)	Iterates per customer over all transactions, multiple times
Optimal	O(N log N)	O(N)	Uses SQL aggregations and window functions to compute all metrics efficiently

Algorithm Walkthrough

Join the Transactions table with the Products table on product_id to access category information for each transaction.
Compute the total amount spent, transaction count, and unique category count per customer using GROUP BY customer_id.
Compute the average transaction amount as total amount divided by transaction count and round to 2 decimal places.
For each customer-category pair, count the number of transactions and identify the latest transaction date.
Use a window function with ROW_NUMBER() partitioned by customer and ordered by transaction count descending, then by latest transaction date descending to assign a rank to categories per customer.
Select the top-ranked category per customer as the most frequently purchased category with tie-breaking.
Calculate the loyalty score as (transaction_count * 10) + (total_amount / 100) and round to 2 decimal places.
Combine all metrics into a single table and order by loyalty score descending, then by customer ID ascending.

The algorithm works because aggregation correctly summarizes numeric data, and the window function ensures that ties in category frequency are resolved by recency.

Python Solution

import pandas as pd

def customer_behavior_analysis(transactions: pd.DataFrame, products: pd.DataFrame) -> pd.DataFrame:
    # Join transactions with products to get categories
    df = transactions.merge(products, on='product_id', how='left')
    
    # Aggregate per customer
    agg = df.groupby('customer_id').agg(
        total_amount=pd.NamedAgg(column='amount', aggfunc='sum'),
        transaction_count=pd.NamedAgg(column='transaction_id', aggfunc='count'),
        unique_categories=pd.NamedAgg(column='category', aggfunc=lambda x: x.nunique())
    ).reset_index()
    
    # Average transaction amount
    agg['avg_transaction_amount'] = (agg['total_amount'] / agg['transaction_count']).round(2)
    agg['total_amount'] = agg['total_amount'].round(2)
    
    # Loyalty score
    agg['loyalty_score'] = ((agg['transaction_count'] * 10) + (agg['total_amount'] / 100)).round(2)
    
    # Top category calculation
    category_counts = df.groupby(['customer_id', 'category']).agg(
        category_count=('transaction_id', 'count'),
        last_transaction=('transaction_date', 'max')
    ).reset_index()
    
    category_counts['rank'] = category_counts.groupby('customer_id')\
        .apply(lambda x: x.sort_values(['category_count', 'last_transaction'], ascending=[False, False])
               .assign(rank=lambda y: range(1, len(y)+1)))['rank'].values
    
    top_category = category_counts[category_counts['rank'] == 1][['customer_id', 'category']]
    top_category = top_category.rename(columns={'category': 'top_category'})
    
    # Merge top category with aggregated metrics
    result = agg.merge(top_category, on='customer_id', how='left')
    
    # Sort by loyalty score descending, then customer_id ascending
    result = result.sort_values(['loyalty_score', 'customer_id'], ascending=[False, True])
    
    return result

This implementation first merges the transactions with product information to access category data. Then, customer-level aggregates are computed, including total amount, transaction count, unique categories, and average transaction amount. Loyalty score is calculated using the formula given. Top category per customer is determined by counting transactions per category and applying a window function with tie-breaking based on the most recent transaction. Finally, all metrics are combined and sorted as required.

Go Solution

package main

import (
    "database/sql"
    _ "github.com/go-sql-driver/mysql"
)

func CustomerBehaviorAnalysis(db *sql.DB) (*sql.Rows, error) {
    query := `
    WITH category_rank AS (
        SELECT 
            t.customer_id,
            p.category,
            COUNT(*) AS category_count,
            MAX(t.transaction_date) AS last_transaction,
            ROW_NUMBER() OVER(PARTITION BY t.customer_id 
                              ORDER BY COUNT(*) DESC, MAX(t.transaction_date) DESC) AS rn
        FROM Transactions t
        JOIN Products p ON t.product_id = p.product_id
        GROUP BY t.customer_id, p.category
    )
    SELECT 
        a.customer_id,
        ROUND(SUM(t.amount), 2) AS total_amount,
        COUNT(t.transaction_id) AS transaction_count,
        COUNT(DISTINCT p.category) AS unique_categories,
        ROUND(SUM(t.amount)/COUNT(t.transaction_id), 2) AS avg_transaction_amount,
        cr.category AS top_category,
        ROUND(COUNT(t.transaction_id)*10 + SUM(t.amount)/100, 2) AS loyalty_score
    FROM Transactions t
    JOIN Products p ON t.product_id = p.product_id
    JOIN category_rank cr ON cr.customer_id = t.customer_id AND cr.rn = 1
    GROUP BY a.customer_id, cr.category
    ORDER BY loyalty_score DESC, customer_id ASC;
    `
    return db.Query(query)
}

In Go, the approach is similar. SQL is used to perform aggregation and ranking. ROW_NUMBER() is used to determine the top category per customer, and the final selection aggregates transactions per customer while joining the top category. We rely on the SQL engine for rounding and sorting.

Worked Examples

For customer 101:

Step	Calculation
Total amount	100 + 150 + 200 = 450.00
Transaction count	3
Unique categories	A, B, C → 3
Avg transaction	450 / 3 = 150.00
Top category	A, B, C all count 1 → C is most recent (2023-02-10)
Loyalty score	(3*10) + (450/100) = 34.50

For customer 102:

Step	Calculation
Total amount	100 + 200 = 300.00
Transaction count	2
Unique categories	A, C → 2
Avg transaction	300 / 2 = 150.00
Top category	A, C both count 1 → C is most recent (2023-01-22)
Loyalty score	(2*10) + (300/100) = 23.00

Complexity Analysis

Measure	Complexity	Explanation
Time	O(N log N)	Aggregation and ranking over N transactions require sorting for ROW_NUMBER tie-breaking
Space	O(N)	Storing aggregated metrics and category counts per customer

The main cost is sorting when computing the rank of top categories. Aggregation functions are linear over the dataset.

Test Cases

import pandas as pd

# Example test case
transactions = pd.DataFrame([
    [1, 101, 1, '2023-01-01', 100.00],
    [2, 101, 2, '2023-01-15', 150.00],
    [3, 102, 1, '2023-01-01', 100.00],
    [4, 102, 3, '2023-01-22', 200.00],
    [5, 101, 3, '2023-02-10', 200.00]
], columns=['transaction_id', 'customer_id', '