LeetCode 3055 - Top Percentile Fraud
This problem asks us to identify the top 5 percentile of insurance claims by fraud score for each state in a Fraud table. The table has three columns: policyid, state, and fraudscore.
Difficulty: π‘ Medium
Topics: Database
Solution
Problem Understanding
This problem asks us to identify the top 5 percentile of insurance claims by fraud score for each state in a Fraud table. The table has three columns: policy_id, state, and fraud_score. Each policy_id is unique, and each claim has a numeric fraud_score between 0 and 1 indicating its predicted likelihood of being fraudulent.
The input represents a dataset of claims from multiple states, with each claim scored by a machine learning model. The output should return only the claims in the top 5 percent of fraud_score within their state, sorted first by state ascending, then fraud_score descending, and finally policy_id ascending to break ties. The output table represents the subset of claims that should be handled by the most experienced adjusters.
Key constraints and insights include: the top 5 percentile must be computed per state, not globally. The table could contain very few claims per state, so rounding or taking the ceiling may be needed when computing 5 percent. The policy_id is unique, so ties in fraud_score must still be ordered consistently. Edge cases include states with fewer than 20 policies (where 5% may round to 1 claim), duplicate fraud scores, and empty states.
Approaches
Brute Force Approach
A naive solution would involve grouping the claims by state and then manually computing the 5 percentile for each group. For each state, we could sort all claims by fraud_score descending, compute the number of top claims as CEIL(0.05 * total_count), and select that many top claims. Finally, we would concatenate all the results and sort by state, fraud_score, and policy_id.
This approach is correct but may be slow on large datasets, because grouping, sorting, and concatenating can be expensive for a large number of states and claims. SQL provides window functions that make this operation much more efficient.
Optimal Approach
The key insight is that SQLβs window functions allow us to compute percentiles and ranks within partitions efficiently. Using CUME_DIST() or PERCENT_RANK(), we can assign a fractional rank to each claim within its state and filter those that fall in the top 5 percent. This eliminates the need for multiple nested queries and manual sorting.
Window functions operate on partitions (here, by state) and compute a metric like cumulative distribution, enabling precise selection of the top percentile without post-processing in application code.
| Approach | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Brute Force | O(N log N) per state | O(N) | Sorts each state individually, computes top 5% manually |
| Optimal | O(N log N) | O(N) | Uses window function CUME_DIST() for percentile computation |
Algorithm Walkthrough
- Partition by state: Group claims by their
stateto compute percentiles independently for each state. - Compute percentile rank: Use
CUME_DIST()to compute the cumulative distribution offraud_scorewithin each state, ordered by descendingfraud_score. This gives a fraction between 0 and 1 for each claim. - Filter top 5 percent: Select only claims where the cumulative distribution is less than or equal to 0.05, representing the top 5 percentile.
- Order the result: Sort the resulting claims by
stateascending,fraud_scoredescending, andpolicy_idascending.
Why it works: CUME_DIST() ensures that each claim is assigned a proper percentile rank relative to all other claims in the same state. Filtering by <= 0.05 guarantees that only the top 5 percent are selected. Sorting ensures the output meets the problemβs ordering requirements.
Python Solution
class Solution:
def topPercentileFraud(self) -> str:
return """
SELECT policy_id, state, fraud_score
FROM (
SELECT *,
CUME_DIST() OVER (PARTITION BY state ORDER BY fraud_score DESC) AS percentile
FROM Fraud
) t
WHERE percentile <= 0.05
ORDER BY state ASC, fraud_score DESC, policy_id ASC;
"""
This implementation uses a single SQL query executed in Python. The subquery computes the CUME_DIST() for each claim within its state, and the outer query filters those with a percentile <= 0.05. Finally, it orders the results according to the problem requirements.
Go Solution
package solution
func TopPercentileFraud() string {
return `
SELECT policy_id, state, fraud_score
FROM (
SELECT *,
CUME_DIST() OVER (PARTITION BY state ORDER BY fraud_score DESC) AS percentile
FROM Fraud
) t
WHERE percentile <= 0.05
ORDER BY state ASC, fraud_score DESC, policy_id ASC;
`
}
In Go, the solution is nearly identical, returning a raw SQL string. Go does not require Python-specific features like triple quotes for SQL strings, but multi-line string literals allow the same formatting and readability.
Worked Examples
For the example input:
| policy_id | state | fraud_score |
|---|---|---|
| 1 | California | 0.92 |
| 2 | California | 0.68 |
| 3 | California | 0.17 |
| 4 | New York | 0.94 |
| 5 | New York | 0.81 |
| 6 | New York | 0.77 |
| 7 | Texas | 0.98 |
| 8 | Texas | 0.97 |
| 9 | Texas | 0.96 |
| 10 | Florida | 0.97 |
| 11 | Florida | 0.98 |
| 12 | Florida | 0.78 |
| 13 | Florida | 0.88 |
| 14 | Florida | 0.66 |
Step-by-step:
- Partition by
state. - Compute
CUME_DIST()in descendingfraud_score.
- California: policy_id 1 β 0.33, 2 β 0.66, 3 β 1.0 β top 5% selects 1
- Florida: policy_id 11 β 0.25, 10 β 0.5, 13 β 0.75, 12 β 1.0, 14 β 1.25 β top 5% selects 11
- New York: 4 β 0.33, 5 β 0.66, 6 β 1.0 β top 5% selects 4
- Texas: 7 β 0.33, 8 β 0.66, 9 β 1.0 β top 5% selects 7
- Sort the selected rows by state, fraud_score, policy_id.
- Result table matches expected output.
Complexity Analysis
| Measure | Complexity | Explanation |
|---|---|---|
| Time | O(N log N) | Sorting within each state for percentile computation dominates |
| Space | O(N) | Storing computed percentile for each claim in a temporary table or window function |
The use of a window function avoids multiple sorts or joins, keeping the operation efficient even for large datasets.
Test Cases
# Example test case
assert Solution().topPercentileFraud() == """
SELECT policy_id, state, fraud_score
FROM (
SELECT *,
CUME_DIST() OVER (PARTITION BY state ORDER BY fraud_score DESC) AS percentile
FROM Fraud
) t
WHERE percentile <= 0.05
ORDER BY state ASC, fraud_score DESC, policy_id ASC;
""" # returns SQL for top 5% per state
# Edge cases would include: single policy per state, all policies same score, multiple states with varying counts
| Test | Why |
|---|---|
| Provided example | Validates correct top 5 percentile selection and ordering |
| Single policy per state | Ensures ceil or minimum selection works |
| All scores equal | Ensures ties do not break selection logic |
Edge Cases
First, a state with only one policy. The top 5 percentile still includes that one policy, so our computation must handle very small counts and not exclude them due to rounding.
Second, multiple claims in a state having identical fraud_score. Using CUME_DIST() ensures all claims with the same score receive the same percentile rank, which avoids arbitrary exclusion of top ties.
Third, a state with no claims. The algorithm handles this naturally as the partition for that state is empty, producing no rows, which is the correct behavior.
These considerations ensure the solution is robust to small or uneven datasets, duplicate scores, and empty partitions.