LeetCode 1097 - Game Play Analysis V

This problem requires calculating the day one retention metric for a game based on user activity data. The input is an Activity table, where each row represents a login session of a player on a particular day, including the device used and the number of games played.

LeetCode Problem 1097

Difficulty: 🔴 Hard
Topics: Database

Solution

Problem Understanding

This problem requires calculating the day one retention metric for a game based on user activity data. The input is an Activity table, where each row represents a login session of a player on a particular day, including the device used and the number of games played. The install date for a player is defined as the date of their first login, which may not be unique in the table since multiple players can install on the same day.

The day one retention for a specific install date x is the proportion of players who installed the game on x and logged back in on the following day (x+1). This proportion is rounded to two decimal places.

The output should be a table with three columns: install_dt (the install date), installs (number of players who installed on that day), and Day1_retention (the calculated retention for that day).

Important constraints include that the table may contain multiple entries per player across different days, and the table can be large, so an efficient solution is necessary. Edge cases that could trip up a naive implementation include players who never log in again, players who have multiple devices, or days where no player returns.

Approaches

A brute-force approach would iterate through every player and their activity records to identify the install date, then check if they logged in the next day. While this works correctly, it requires scanning all activity for each player for every install date, leading to high time complexity that may not scale with large datasets.

The optimal approach uses SQL window functions and joins. The key insight is to compute each player's install date once, then check for the existence of a record on the next day using a self-join. This reduces repeated scanning and allows grouping by install date to calculate the retention metric efficiently.

Approach Time Complexity Space Complexity Notes
Brute Force O(N^2) O(N) Iterate each player and each day, inefficient for large tables
Optimal O(N) O(N) Compute install date, join on next day, aggregate by install date

Algorithm Walkthrough

  1. Compute Install Dates: For each player, find the minimum event_date to identify their install date. This ensures every player has a unique install date.
  2. Join Next-Day Activity: Perform a self-join on the table to check whether each player logged in the day after their install date. This identifies returning players for day one retention.
  3. Aggregate by Install Date: Group the data by install_dt and count both the total installs and the number of players who returned on the next day.
  4. Calculate Day One Retention: Divide the number of returning players by total installs for each install date, and round to two decimal places.
  5. Return Result: Select install_dt, installs, and Day1_retention as the final output.

This works because the invariant is that each player’s install date is computed exactly once, and joining on the next day correctly identifies returning users, ensuring accurate retention calculation.

Python Solution

import pandas as pd

def day_one_retention(activity: pd.DataFrame) -> pd.DataFrame:
    # Step 1: Compute install date for each player
    install_dates = activity.groupby('player_id')['event_date'].min().reset_index()
    install_dates.rename(columns={'event_date': 'install_dt'}, inplace=True)
    
    # Step 2: Merge with activity to find next-day logins
    merged = install_dates.merge(
        activity,
        left_on='player_id',
        right_on='player_id',
        how='left'
    )
    
    # Step 3: Determine if login is on the next day
    merged['next_day_login'] = (merged['event_date'] - merged['install_dt']).dt.days == 1
    
    # Step 4: Aggregate by install date
    result = merged.groupby('install_dt').agg(
        installs=('player_id', 'nunique'),
        Day1_retention=('next_day_login', lambda x: round(x.sum() / x.nunique(), 2))
    ).reset_index()
    
    return result

This code first computes each player’s install date using groupby and min. It then merges back with the original table to find activity on the day after install. The next_day_login boolean column flags returning users, and aggregation computes both total installs and day one retention efficiently.

Go Solution

package main

import (
    "database/sql"
    _ "github.com/go-sql-driver/mysql"
    "fmt"
)

func DayOneRetention(db *sql.DB) ([]map[string]interface{}, error) {
    query := `
    WITH install_dates AS (
        SELECT player_id, MIN(event_date) AS install_dt
        FROM Activity
        GROUP BY player_id
    ),
    next_day_logins AS (
        SELECT i.install_dt, i.player_id
        FROM install_dates i
        LEFT JOIN Activity a
        ON i.player_id = a.player_id
        AND DATEDIFF(a.event_date, i.install_dt) = 1
    )
    SELECT 
        install_dt,
        COUNT(DISTINCT player_id) AS installs,
        ROUND(SUM(CASE WHEN a.player_id IS NOT NULL THEN 1 ELSE 0 END) / COUNT(DISTINCT i.player_id), 2) AS Day1_retention
    FROM install_dates i
    LEFT JOIN Activity a
    ON i.player_id = a.player_id AND DATEDIFF(a.event_date, i.install_dt) = 1
    GROUP BY install_dt
    `
    rows, err := db.Query(query)
    if err != nil {
        return nil, err
    }
    defer rows.Close()

    results := []map[string]interface{}{}
    for rows.Next() {
        var install_dt string
        var installs int
        var day1_retention float64
        if err := rows.Scan(&install_dt, &installs, &day1_retention); err != nil {
            return nil, err
        }
        results = append(results, map[string]interface{}{
            "install_dt": install_dt,
            "installs": installs,
            "Day1_retention": day1_retention,
        })
    }
    return results, nil
}

In Go, we leverage SQL for efficient computation. The query computes install dates, joins activity for the next day, and aggregates the results. The main difference from Python is the reliance on SQL for the heavy computation and scanning the sql.Rows to construct the Go data structure.

Worked Examples

For the given example:

  1. Player 1 installs on 2016-03-01 and logs in again on 2016-03-02.
  2. Player 3 installs on 2016-03-01 and does not log in on 2016-03-02.
  3. Player 2 installs on 2017-06-25 and does not log in the next day.

Grouping by install date:

install_dt installs next_day_logins Day1_retention
2016-03-01 2 1 0.50
2017-06-25 1 0 0.00

Complexity Analysis

Measure Complexity Explanation
Time O(N) Each player is processed once to compute install date, and joining on next-day activity is linear in the number of records.
Space O(N) Storage for install dates and merged dataset.

The algorithm avoids repeated scanning of activity records per player, making it efficient even for large datasets.

Test Cases

import pandas as pd

# Test case 1: Provided example
activity1 = pd.DataFrame({
    "player_id": [1, 1, 2, 3, 3],
    "device_id": [2, 2, 3, 1, 4],
    "event_date": pd.to_datetime(["2016-03-01", "2016-03-02", "2017-06-25", "2016-03-01", "2016-07-03"]),
    "games_played": [5, 6, 1, 0, 5]
})
result1 = day_one_retention(activity1)
assert result1.loc[result1.install_dt=="2016-03-01", "Day1_retention"].iloc[0] == 0.50
assert result1.loc[result1.install_dt=="2017-06-25", "Day1_retention"].iloc[0] == 0.00

# Test case 2: No player returns
activity2 = pd.DataFrame({
    "player_id": [1, 2],
    "device_id": [1, 2],
    "event_date": pd.to_datetime(["2023-01-01", "2023-01-01"]),
    "games_played": [3, 2]
})
result2 = day_one_retention(activity2)
assert all(result2["Day1_retention"] == 0.00)

# Test case 3: All players return next day
activity3 = pd.DataFrame({
    "player_id": [1, 1, 2, 2],
    "device_id": [1, 1, 2, 2],
    "event_date": pd.to_datetime(["2023-01-01", "2023-01-02", "2023-01-