Difficulty: 🔴 Hard
Topics: Database

Solution

LeetCode 571 - Find Median Given Frequency of Numbers

Problem Understanding

This problem provides a compressed representation of a dataset instead of listing every number individually. The table Numbers contains two columns:

num, the actual number
frequency, how many times that number appears

If we expanded the table into a full sorted array, each num would appear exactly frequency times.

The task is to compute the median of this decompressed dataset.

The median depends on whether the total number of elements is odd or even:

If the total count is odd, the median is the middle element.
If the total count is even, the median is the average of the two middle elements.

The final answer must be rounded to one decimal place.

For example, if the decompressed array is:

[0, 0, 0, 1, 2]

the median is 0 because the middle element is the third value.

If the decompressed array is:

[1, 2, 3, 4]

the median is (2 + 3) / 2 = 2.5.

An important detail is that we are never asked to actually decompress the data. The frequencies may be large, so expanding the dataset explicitly could be extremely inefficient. Instead, we should reason about positions inside the sorted order.

The table guarantees that num is unique, which means each value appears in exactly one row. Since median depends only on ordering and counts, this allows us to process the rows in sorted order and determine where the middle positions fall.

Several edge cases are important:

The total number of elements may be odd or even.
Multiple rows may contribute to the middle positions.
Both middle positions may fall inside the same frequency block.
Frequencies can be very large, making decompression impractical.
The dataset may contain negative values.
The table may contain only one distinct number.

A correct solution must identify the middle position or positions directly from cumulative frequencies.

Approaches

Brute Force Approach

The most straightforward solution is to explicitly decompress the dataset.

For every row in the table:

Repeat num exactly frequency times.
Append those values into an array.
Sort the array if necessary.
Compute the median using standard array indexing.

This approach is correct because it reconstructs the exact dataset described by the frequency table. Once we have the full sorted array, computing the median becomes trivial.

However, this approach is inefficient when frequencies are large. If a number appears millions of times, the decompressed array becomes enormous. The memory usage grows proportionally to the total number of elements, not the number of rows.

Because of this, the brute force approach is not practical for large datasets.

Optimal Approach

The key observation is that the median depends only on positions in sorted order, not on the explicit decompressed array.

Suppose the total number of elements is N.

The median positions are:

If N is odd, position (N + 1) / 2
If N is even, positions N / 2 and N / 2 + 1

Instead of expanding the array, we can compute cumulative frequencies.

For each row in ascending order of num:

Track how many elements we have processed so far.
Determine the range of positions covered by the current number.
Check whether the target median positions fall inside this range.

This works because each row represents a contiguous block in the sorted decompressed dataset.

For example:

num	frequency	Covered Positions
0	7	1 to 7
1	1	8 to 8
2	3	9 to 11

If the median position is 6, it clearly belongs to the block for 0.

This avoids decompression entirely and requires only cumulative counting.

Approach Comparison

Approach	Time Complexity	Space Complexity	Notes
Brute Force	O(T)	O(T)	Explicitly decompresses all values, where T is total frequency
Optimal	O(N log N)	O(1)	Uses cumulative frequencies without decompression

Here:

N is the number of distinct rows
T is the total decompressed size

The O(N log N) factor comes from sorting by num. In SQL, this corresponds to ordering the rows.

Algorithm Walkthrough

Compute the total number of elements by summing all frequencies.

This tells us whether the dataset size is odd or even and determines the target median positions. 2. Determine the median index positions.

If the total count is odd, both target positions are the same middle index.

If the total count is even, we need the two middle indices.

Using 1-based indexing:

left_pos = (total_count + 1) // 2
right_pos = (total_count + 2) // 2

This formula works uniformly for both odd and even totals. 3. Process rows in ascending order of num.

Since the decompressed array would be sorted, we must examine values in numeric order. 4. Maintain a cumulative frequency counter.

Suppose the cumulative count before processing the current row is prev.

After adding the current frequency, the current number covers positions:

prev + 1 through cumulative

Check whether the target positions fall inside the current range.

If either median position lies within this interval, then the current num contributes to the median. 6. Collect the median values.

There can be one or two contributing values depending on whether the total count is odd or even. 7. Compute the final median.

Average the collected values and round to one decimal place.

Why it works

The algorithm works because cumulative frequencies exactly represent contiguous ranges in the sorted decompressed dataset.

Every number occupies a specific interval of positions. The median depends only on the middle position or positions, so once we identify which interval contains those positions, we know the corresponding median values.

Since every decompressed element belongs to exactly one cumulative interval, the algorithm always identifies the correct median without explicitly constructing the full array.

Python Solution

SELECT ROUND(AVG(num), 1) AS median
FROM (
    SELECT
        num,
        SUM(frequency) OVER (ORDER BY num) AS cumulative_frequency,
        SUM(frequency) OVER () AS total_frequency
    FROM Numbers
) AS ordered_numbers
WHERE
    total_frequency / 2.0 BETWEEN
        cumulative_frequency - frequency
        AND cumulative_frequency;

Implementation Explanation

The solution uses SQL window functions to compute cumulative statistics efficiently.

The inner query processes the rows in ascending order of num.

SUM(frequency) OVER (ORDER BY num)

computes the cumulative frequency. This tells us the ending position range for each number in the decompressed array.

SUM(frequency) OVER ()

computes the total number of decompressed elements.

For every row:

cumulative_frequency - frequency + 1 is the starting position
cumulative_frequency is the ending position

The median positions are centered around total_frequency / 2.0.

The WHERE clause selects all rows whose frequency range overlaps the median position or positions.

Finally:

AVG(num)

handles both cases automatically:

Odd total count, one matching row
Even total count, possibly two matching rows

The result is rounded to one decimal place using ROUND.

Go Solution

package main

import "fmt"

type Number struct {
	Num       int
	Frequency int
}

func findMedian(numbers []Number) float64 {
	totalFrequency := 0

	for _, number := range numbers {
		totalFrequency += number.Frequency
	}

	leftPos := (totalFrequency + 1) / 2
	rightPos := (totalFrequency + 2) / 2

	cumulative := 0
	medianValues := []int{}

	for _, number := range numbers {
		start := cumulative + 1
		cumulative += number.Frequency
		end := cumulative

		if leftPos >= start && leftPos <= end {
			medianValues = append(medianValues, number.Num)
		}

		if rightPos >= start && rightPos <= end {
			if len(medianValues) == 0 || medianValues[len(medianValues)-1] != number.Num {
				medianValues = append(medianValues, number.Num)
			}
		}
	}

	sum := 0

	for _, value := range medianValues {
		sum += value
	}

	return float64(sum) / float64(len(medianValues))
}

func main() {
	numbers := []Number{
		{Num: 0, Frequency: 7},
		{Num: 1, Frequency: 1},
		{Num: 2, Frequency: 3},
		{Num: 3, Frequency: 1},
	}

	fmt.Println(findMedian(numbers))
}

Go Implementation Notes

The Go version demonstrates the same cumulative frequency logic procedurally.

Unlike the SQL solution, Go requires explicit iteration and state management.

The implementation assumes the input slice is already sorted by Num. In the actual SQL problem, ordering is handled directly by the database engine.

The median positions are computed using integer arithmetic, which avoids floating point precision issues during index calculation.

The final result is converted to float64 before division to ensure correct decimal output.

Worked Examples

Example 1

Input table:

num	frequency
0	7
1	1
2	3
3	1

Step 1: Compute Total Frequency

7 + 1 + 3 + 1 = 12

The total count is even.

Median positions:

left_pos = 6
right_pos = 7

Step 2: Process Cumulative Frequencies

num	frequency	Position Range	Contains 6?	Contains 7?
0	7	1 to 7	Yes	Yes
1	1	8 to 8	No	No
2	3	9 to 11	No	No
3	1	12 to 12	No	No

Both median positions belong to 0.

Median:

(0 + 0) / 2 = 0.0

Complexity Analysis

Measure	Complexity	Explanation
Time	O(N log N)	Rows are processed in sorted order
Space	O(1)	Only cumulative counters are stored

The algorithm never constructs the decompressed dataset. Instead, it processes only the compressed representation.

This dramatically reduces memory usage when frequencies are large.

The dominant cost is sorting by num, which databases typically implement efficiently.

Test Cases

def median_from_frequencies(numbers):
    total = sum(freq for _, freq in numbers)

    left_pos = (total + 1) // 2
    right_pos = (total + 2) // 2

    cumulative = 0
    values = []

    for num, freq in sorted(numbers):
        start = cumulative + 1
        cumulative += freq
        end = cumulative

        if start <= left_pos <= end:
            values.append(num)

        if start <= right_pos <= end:
            values.append(num)

    return sum(values) / 2

# Provided example
assert median_from_frequencies([
    (0, 7),
    (1, 1),
    (2, 3),
    (3, 1)
]) == 0.0  # even-sized dataset with identical middle values

# Odd total count
assert median_from_frequencies([
    (1, 1),
    (2, 1),
    (3, 1)
]) == 2.0  # single middle element

# Even total count
assert median_from_frequencies([
    (1, 1),
    (2, 1),
    (3, 1),
    (4, 1)
]) == 2.5  # average of two middle elements

# Single distinct number
assert median_from_frequencies([
    (5, 100)
]) == 5.0  # all values identical

# Large frequency block
assert median_from_frequencies([
    (1, 1000000),
    (2, 1)
]) == 1.0  # avoids decompression

# Negative numbers
assert median_from_frequencies([
    (-5, 2),
    (0, 1),
    (10, 2)
]) == 0.0  # median with negative values

# Median spans two blocks
assert median_from_frequencies([
    (1, 2),
    (2, 2)
]) == 1.5  # middle values come from different ranges

Test Summary

Test	Why
Example input	Validates the provided sample
Odd total count	Ensures single middle element logic works
Even total count	Ensures averaging logic works
Single distinct number	Tests degenerate case
Large frequency block	Confirms no decompression is needed
Negative numbers	Ensures ordering works correctly
Median spans two blocks	Tests middle positions across ranges

Edge Cases

Even Number of Elements

When the decompressed dataset contains an even number of elements, the median is the average of two middle values instead of a single value.

This is a common source of off-by-one errors because the algorithm must identify two separate positions.

The implementation handles this by computing both:

left_pos = (N + 1) // 2
right_pos = (N + 2) // 2

For odd counts, these positions become identical automatically.

Large Frequencies

A naive implementation may attempt to explicitly expand the dataset according to frequency counts.

If a row has frequency in the millions, this becomes extremely memory-intensive and slow.

The optimal solution avoids this entirely by using cumulative frequencies, which keeps memory usage constant regardless of decompressed size.

Median Inside One Frequency Block

Sometimes both middle positions fall inside the same frequency range.

For example:

num = 5, frequency = 100

may contain both median positions.

An incorrect implementation might accidentally count the same value twice incorrectly or fail to average properly.

The algorithm handles this naturally because averaging identical values still produces the correct median.