LeetCode 1097 - Game Play Analysis V
This problem requires calculating the day one retention metric for a game based on user activity data. The input is an Activity table, where each row represents a login session of a player on a particular day, including the device used and the number of games played.
Difficulty: 🔴 Hard
Topics: Database
Solution
Problem Understanding
This problem requires calculating the day one retention metric for a game based on user activity data. The input is an Activity table, where each row represents a login session of a player on a particular day, including the device used and the number of games played. The install date for a player is defined as the date of their first login, which may not be unique in the table since multiple players can install on the same day.
The day one retention for a specific install date x is the proportion of players who installed the game on x and logged back in on the following day (x+1). This proportion is rounded to two decimal places.
The output should be a table with three columns: install_dt (the install date), installs (number of players who installed on that day), and Day1_retention (the calculated retention for that day).
Important constraints include that the table may contain multiple entries per player across different days, and the table can be large, so an efficient solution is necessary. Edge cases that could trip up a naive implementation include players who never log in again, players who have multiple devices, or days where no player returns.
Approaches
A brute-force approach would iterate through every player and their activity records to identify the install date, then check if they logged in the next day. While this works correctly, it requires scanning all activity for each player for every install date, leading to high time complexity that may not scale with large datasets.
The optimal approach uses SQL window functions and joins. The key insight is to compute each player's install date once, then check for the existence of a record on the next day using a self-join. This reduces repeated scanning and allows grouping by install date to calculate the retention metric efficiently.
| Approach | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Brute Force | O(N^2) | O(N) | Iterate each player and each day, inefficient for large tables |
| Optimal | O(N) | O(N) | Compute install date, join on next day, aggregate by install date |
Algorithm Walkthrough
- Compute Install Dates: For each player, find the minimum
event_dateto identify their install date. This ensures every player has a unique install date. - Join Next-Day Activity: Perform a self-join on the table to check whether each player logged in the day after their install date. This identifies returning players for day one retention.
- Aggregate by Install Date: Group the data by
install_dtand count both the total installs and the number of players who returned on the next day. - Calculate Day One Retention: Divide the number of returning players by total installs for each install date, and round to two decimal places.
- Return Result: Select
install_dt,installs, andDay1_retentionas the final output.
This works because the invariant is that each player’s install date is computed exactly once, and joining on the next day correctly identifies returning users, ensuring accurate retention calculation.
Python Solution
import pandas as pd
def day_one_retention(activity: pd.DataFrame) -> pd.DataFrame:
# Step 1: Compute install date for each player
install_dates = activity.groupby('player_id')['event_date'].min().reset_index()
install_dates.rename(columns={'event_date': 'install_dt'}, inplace=True)
# Step 2: Merge with activity to find next-day logins
merged = install_dates.merge(
activity,
left_on='player_id',
right_on='player_id',
how='left'
)
# Step 3: Determine if login is on the next day
merged['next_day_login'] = (merged['event_date'] - merged['install_dt']).dt.days == 1
# Step 4: Aggregate by install date
result = merged.groupby('install_dt').agg(
installs=('player_id', 'nunique'),
Day1_retention=('next_day_login', lambda x: round(x.sum() / x.nunique(), 2))
).reset_index()
return result
This code first computes each player’s install date using groupby and min. It then merges back with the original table to find activity on the day after install. The next_day_login boolean column flags returning users, and aggregation computes both total installs and day one retention efficiently.
Go Solution
package main
import (
"database/sql"
_ "github.com/go-sql-driver/mysql"
"fmt"
)
func DayOneRetention(db *sql.DB) ([]map[string]interface{}, error) {
query := `
WITH install_dates AS (
SELECT player_id, MIN(event_date) AS install_dt
FROM Activity
GROUP BY player_id
),
next_day_logins AS (
SELECT i.install_dt, i.player_id
FROM install_dates i
LEFT JOIN Activity a
ON i.player_id = a.player_id
AND DATEDIFF(a.event_date, i.install_dt) = 1
)
SELECT
install_dt,
COUNT(DISTINCT player_id) AS installs,
ROUND(SUM(CASE WHEN a.player_id IS NOT NULL THEN 1 ELSE 0 END) / COUNT(DISTINCT i.player_id), 2) AS Day1_retention
FROM install_dates i
LEFT JOIN Activity a
ON i.player_id = a.player_id AND DATEDIFF(a.event_date, i.install_dt) = 1
GROUP BY install_dt
`
rows, err := db.Query(query)
if err != nil {
return nil, err
}
defer rows.Close()
results := []map[string]interface{}{}
for rows.Next() {
var install_dt string
var installs int
var day1_retention float64
if err := rows.Scan(&install_dt, &installs, &day1_retention); err != nil {
return nil, err
}
results = append(results, map[string]interface{}{
"install_dt": install_dt,
"installs": installs,
"Day1_retention": day1_retention,
})
}
return results, nil
}
In Go, we leverage SQL for efficient computation. The query computes install dates, joins activity for the next day, and aggregates the results. The main difference from Python is the reliance on SQL for the heavy computation and scanning the sql.Rows to construct the Go data structure.
Worked Examples
For the given example:
- Player 1 installs on 2016-03-01 and logs in again on 2016-03-02.
- Player 3 installs on 2016-03-01 and does not log in on 2016-03-02.
- Player 2 installs on 2017-06-25 and does not log in the next day.
Grouping by install date:
| install_dt | installs | next_day_logins | Day1_retention |
|---|---|---|---|
| 2016-03-01 | 2 | 1 | 0.50 |
| 2017-06-25 | 1 | 0 | 0.00 |
Complexity Analysis
| Measure | Complexity | Explanation |
|---|---|---|
| Time | O(N) | Each player is processed once to compute install date, and joining on next-day activity is linear in the number of records. |
| Space | O(N) | Storage for install dates and merged dataset. |
The algorithm avoids repeated scanning of activity records per player, making it efficient even for large datasets.
Test Cases
import pandas as pd
# Test case 1: Provided example
activity1 = pd.DataFrame({
"player_id": [1, 1, 2, 3, 3],
"device_id": [2, 2, 3, 1, 4],
"event_date": pd.to_datetime(["2016-03-01", "2016-03-02", "2017-06-25", "2016-03-01", "2016-07-03"]),
"games_played": [5, 6, 1, 0, 5]
})
result1 = day_one_retention(activity1)
assert result1.loc[result1.install_dt=="2016-03-01", "Day1_retention"].iloc[0] == 0.50
assert result1.loc[result1.install_dt=="2017-06-25", "Day1_retention"].iloc[0] == 0.00
# Test case 2: No player returns
activity2 = pd.DataFrame({
"player_id": [1, 2],
"device_id": [1, 2],
"event_date": pd.to_datetime(["2023-01-01", "2023-01-01"]),
"games_played": [3, 2]
})
result2 = day_one_retention(activity2)
assert all(result2["Day1_retention"] == 0.00)
# Test case 3: All players return next day
activity3 = pd.DataFrame({
"player_id": [1, 1, 2, 2],
"device_id": [1, 1, 2, 2],
"event_date": pd.to_datetime(["2023-01-01", "2023-01-02", "2023-01-