LeetCode 571 - Find Median Given Frequency of Numbers
This problem provides a compressed representation of a dataset instead of listing every number individually. The table Numbers contains two columns: - num, the actual number - frequency, how many times that number appears If we expanded the table into a full sorted array, each…
Difficulty: 🔴 Hard
Topics: Database
Solution
LeetCode 571 - Find Median Given Frequency of Numbers
Problem Understanding
This problem provides a compressed representation of a dataset instead of listing every number individually. The table Numbers contains two columns:
num, the actual numberfrequency, how many times that number appears
If we expanded the table into a full sorted array, each num would appear exactly frequency times.
The task is to compute the median of this decompressed dataset.
The median depends on whether the total number of elements is odd or even:
- If the total count is odd, the median is the middle element.
- If the total count is even, the median is the average of the two middle elements.
The final answer must be rounded to one decimal place.
For example, if the decompressed array is:
[0, 0, 0, 1, 2]
the median is 0 because the middle element is the third value.
If the decompressed array is:
[1, 2, 3, 4]
the median is (2 + 3) / 2 = 2.5.
An important detail is that we are never asked to actually decompress the data. The frequencies may be large, so expanding the dataset explicitly could be extremely inefficient. Instead, we should reason about positions inside the sorted order.
The table guarantees that num is unique, which means each value appears in exactly one row. Since median depends only on ordering and counts, this allows us to process the rows in sorted order and determine where the middle positions fall.
Several edge cases are important:
- The total number of elements may be odd or even.
- Multiple rows may contribute to the middle positions.
- Both middle positions may fall inside the same frequency block.
- Frequencies can be very large, making decompression impractical.
- The dataset may contain negative values.
- The table may contain only one distinct number.
A correct solution must identify the middle position or positions directly from cumulative frequencies.
Approaches
Brute Force Approach
The most straightforward solution is to explicitly decompress the dataset.
For every row in the table:
- Repeat
numexactlyfrequencytimes. - Append those values into an array.
- Sort the array if necessary.
- Compute the median using standard array indexing.
This approach is correct because it reconstructs the exact dataset described by the frequency table. Once we have the full sorted array, computing the median becomes trivial.
However, this approach is inefficient when frequencies are large. If a number appears millions of times, the decompressed array becomes enormous. The memory usage grows proportionally to the total number of elements, not the number of rows.
Because of this, the brute force approach is not practical for large datasets.
Optimal Approach
The key observation is that the median depends only on positions in sorted order, not on the explicit decompressed array.
Suppose the total number of elements is N.
The median positions are:
- If
Nis odd, position(N + 1) / 2 - If
Nis even, positionsN / 2andN / 2 + 1
Instead of expanding the array, we can compute cumulative frequencies.
For each row in ascending order of num:
- Track how many elements we have processed so far.
- Determine the range of positions covered by the current number.
- Check whether the target median positions fall inside this range.
This works because each row represents a contiguous block in the sorted decompressed dataset.
For example:
| num | frequency | Covered Positions |
|---|---|---|
| 0 | 7 | 1 to 7 |
| 1 | 1 | 8 to 8 |
| 2 | 3 | 9 to 11 |
If the median position is 6, it clearly belongs to the block for 0.
This avoids decompression entirely and requires only cumulative counting.
Approach Comparison
| Approach | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Brute Force | O(T) | O(T) | Explicitly decompresses all values, where T is total frequency |
| Optimal | O(N log N) | O(1) | Uses cumulative frequencies without decompression |
Here:
Nis the number of distinct rowsTis the total decompressed size
The O(N log N) factor comes from sorting by num. In SQL, this corresponds to ordering the rows.
Algorithm Walkthrough
- Compute the total number of elements by summing all frequencies.
This tells us whether the dataset size is odd or even and determines the target median positions. 2. Determine the median index positions.
If the total count is odd, both target positions are the same middle index.
If the total count is even, we need the two middle indices.
Using 1-based indexing:
left_pos = (total_count + 1) // 2right_pos = (total_count + 2) // 2
This formula works uniformly for both odd and even totals.
3. Process rows in ascending order of num.
Since the decompressed array would be sorted, we must examine values in numeric order. 4. Maintain a cumulative frequency counter.
Suppose the cumulative count before processing the current row is prev.
After adding the current frequency, the current number covers positions:
prev + 1 through cumulative
- Check whether the target positions fall inside the current range.
If either median position lies within this interval, then the current num contributes to the median.
6. Collect the median values.
There can be one or two contributing values depending on whether the total count is odd or even. 7. Compute the final median.
Average the collected values and round to one decimal place.
Why it works
The algorithm works because cumulative frequencies exactly represent contiguous ranges in the sorted decompressed dataset.
Every number occupies a specific interval of positions. The median depends only on the middle position or positions, so once we identify which interval contains those positions, we know the corresponding median values.
Since every decompressed element belongs to exactly one cumulative interval, the algorithm always identifies the correct median without explicitly constructing the full array.
Python Solution
SELECT ROUND(AVG(num), 1) AS median
FROM (
SELECT
num,
SUM(frequency) OVER (ORDER BY num) AS cumulative_frequency,
SUM(frequency) OVER () AS total_frequency
FROM Numbers
) AS ordered_numbers
WHERE
total_frequency / 2.0 BETWEEN
cumulative_frequency - frequency
AND cumulative_frequency;
Implementation Explanation
The solution uses SQL window functions to compute cumulative statistics efficiently.
The inner query processes the rows in ascending order of num.
SUM(frequency) OVER (ORDER BY num)
computes the cumulative frequency. This tells us the ending position range for each number in the decompressed array.
SUM(frequency) OVER ()
computes the total number of decompressed elements.
For every row:
cumulative_frequency - frequency + 1is the starting positioncumulative_frequencyis the ending position
The median positions are centered around total_frequency / 2.0.
The WHERE clause selects all rows whose frequency range overlaps the median position or positions.
Finally:
AVG(num)
handles both cases automatically:
- Odd total count, one matching row
- Even total count, possibly two matching rows
The result is rounded to one decimal place using ROUND.
Go Solution
package main
import "fmt"
type Number struct {
Num int
Frequency int
}
func findMedian(numbers []Number) float64 {
totalFrequency := 0
for _, number := range numbers {
totalFrequency += number.Frequency
}
leftPos := (totalFrequency + 1) / 2
rightPos := (totalFrequency + 2) / 2
cumulative := 0
medianValues := []int{}
for _, number := range numbers {
start := cumulative + 1
cumulative += number.Frequency
end := cumulative
if leftPos >= start && leftPos <= end {
medianValues = append(medianValues, number.Num)
}
if rightPos >= start && rightPos <= end {
if len(medianValues) == 0 || medianValues[len(medianValues)-1] != number.Num {
medianValues = append(medianValues, number.Num)
}
}
}
sum := 0
for _, value := range medianValues {
sum += value
}
return float64(sum) / float64(len(medianValues))
}
func main() {
numbers := []Number{
{Num: 0, Frequency: 7},
{Num: 1, Frequency: 1},
{Num: 2, Frequency: 3},
{Num: 3, Frequency: 1},
}
fmt.Println(findMedian(numbers))
}
Go Implementation Notes
The Go version demonstrates the same cumulative frequency logic procedurally.
Unlike the SQL solution, Go requires explicit iteration and state management.
The implementation assumes the input slice is already sorted by Num. In the actual SQL problem, ordering is handled directly by the database engine.
The median positions are computed using integer arithmetic, which avoids floating point precision issues during index calculation.
The final result is converted to float64 before division to ensure correct decimal output.
Worked Examples
Example 1
Input table:
| num | frequency |
|---|---|
| 0 | 7 |
| 1 | 1 |
| 2 | 3 |
| 3 | 1 |
Step 1: Compute Total Frequency
7 + 1 + 3 + 1 = 12
The total count is even.
Median positions:
left_pos = 6
right_pos = 7
Step 2: Process Cumulative Frequencies
| num | frequency | Position Range | Contains 6? | Contains 7? |
|---|---|---|---|---|
| 0 | 7 | 1 to 7 | Yes | Yes |
| 1 | 1 | 8 to 8 | No | No |
| 2 | 3 | 9 to 11 | No | No |
| 3 | 1 | 12 to 12 | No | No |
Both median positions belong to 0.
Median:
(0 + 0) / 2 = 0.0
Complexity Analysis
| Measure | Complexity | Explanation |
|---|---|---|
| Time | O(N log N) | Rows are processed in sorted order |
| Space | O(1) | Only cumulative counters are stored |
The algorithm never constructs the decompressed dataset. Instead, it processes only the compressed representation.
This dramatically reduces memory usage when frequencies are large.
The dominant cost is sorting by num, which databases typically implement efficiently.
Test Cases
def median_from_frequencies(numbers):
total = sum(freq for _, freq in numbers)
left_pos = (total + 1) // 2
right_pos = (total + 2) // 2
cumulative = 0
values = []
for num, freq in sorted(numbers):
start = cumulative + 1
cumulative += freq
end = cumulative
if start <= left_pos <= end:
values.append(num)
if start <= right_pos <= end:
values.append(num)
return sum(values) / 2
# Provided example
assert median_from_frequencies([
(0, 7),
(1, 1),
(2, 3),
(3, 1)
]) == 0.0 # even-sized dataset with identical middle values
# Odd total count
assert median_from_frequencies([
(1, 1),
(2, 1),
(3, 1)
]) == 2.0 # single middle element
# Even total count
assert median_from_frequencies([
(1, 1),
(2, 1),
(3, 1),
(4, 1)
]) == 2.5 # average of two middle elements
# Single distinct number
assert median_from_frequencies([
(5, 100)
]) == 5.0 # all values identical
# Large frequency block
assert median_from_frequencies([
(1, 1000000),
(2, 1)
]) == 1.0 # avoids decompression
# Negative numbers
assert median_from_frequencies([
(-5, 2),
(0, 1),
(10, 2)
]) == 0.0 # median with negative values
# Median spans two blocks
assert median_from_frequencies([
(1, 2),
(2, 2)
]) == 1.5 # middle values come from different ranges
Test Summary
| Test | Why |
|---|---|
| Example input | Validates the provided sample |
| Odd total count | Ensures single middle element logic works |
| Even total count | Ensures averaging logic works |
| Single distinct number | Tests degenerate case |
| Large frequency block | Confirms no decompression is needed |
| Negative numbers | Ensures ordering works correctly |
| Median spans two blocks | Tests middle positions across ranges |
Edge Cases
Even Number of Elements
When the decompressed dataset contains an even number of elements, the median is the average of two middle values instead of a single value.
This is a common source of off-by-one errors because the algorithm must identify two separate positions.
The implementation handles this by computing both:
left_pos = (N + 1) // 2
right_pos = (N + 2) // 2
For odd counts, these positions become identical automatically.
Large Frequencies
A naive implementation may attempt to explicitly expand the dataset according to frequency counts.
If a row has frequency in the millions, this becomes extremely memory-intensive and slow.
The optimal solution avoids this entirely by using cumulative frequencies, which keeps memory usage constant regardless of decompressed size.
Median Inside One Frequency Block
Sometimes both middle positions fall inside the same frequency range.
For example:
num = 5, frequency = 100
may contain both median positions.
An incorrect implementation might accidentally count the same value twice incorrectly or fail to average properly.
The algorithm handles this naturally because averaging identical values still produces the correct median.