LeetCode 1093 - Statistics from a Large Sample
This problem asks us to compute several descriptive statistics-minimum, maximum, mean, median, and mode-from a very large sample of integers ranging from 0 to 255.
Difficulty: 🟡 Medium
Topics: Array, Math, Probability and Statistics
Solution
Problem Understanding
This problem asks us to compute several descriptive statistics-minimum, maximum, mean, median, and mode-from a very large sample of integers ranging from 0 to 255. The key challenge is that the sample size can be up to 1 billion elements, so it is not feasible to explicitly generate the full array. Instead, we are provided with a count array of length 256, where count[k] represents the number of times the integer k appears in the sample.
The expected output is a list of floating-point numbers representing these five statistics. The problem guarantees that the mode is unique, which simplifies that calculation. The mean and median require careful handling because of the potential size of the data and the need to avoid constructing the entire array explicitly.
Constraints tell us that count has exactly 256 elements, all non-negative, and at least one element in the total sample. This ensures that minimum and maximum are always defined, and we will not encounter division by zero when calculating the mean. Edge cases to watch include: a sample where only one value exists, very skewed distributions, and even vs odd total counts affecting the median.
Approaches
A brute-force approach would be to expand the count array into a full list of elements by repeating each value k exactly count[k] times, and then sort the resulting array. Once sorted, computing minimum, maximum, mean, median, and mode is straightforward. This approach works correctly in principle, but it is completely infeasible for large counts, since the sample could reach 10^9 elements.
The optimal approach leverages the fact that the values are bounded between 0 and 255 and count already represents the frequency of each value. We can compute the minimum, maximum, mean, and mode directly by iterating over count. For the median, we use the prefix sum of counts to locate the middle element(s) without constructing the full array. This approach is linear in the size of count (256) rather than the size of the sample.
| Approach | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Brute Force | O(N log N), where N = sum(count) | O(N) | Expands the full array and sorts it; infeasible for large N |
| Optimal | O(256) = O(1) | O(1) | Iterates over count array; computes statistics using frequencies and prefix sums |
Algorithm Walkthrough
- Initialize variables for
minimum,maximum,sum,total_count,mode, andmax_count.minimumandmaximumare initially undefined.sumandtotal_counttrack the running sum and number of elements.max_counttracks the highest frequency for the mode. - Iterate over
countwith indexifrom 0 to 255. Ifcount[i] > 0, update:
minimumif it is not yet setmaximumasi(since we are iterating in increasing order)sumby addingi * count[i]total_countby addingcount[i]modeifcount[i] > max_count
- Calculate
meanassum / total_count. - To compute the
median, first determine the positions of the middle element(s):
- If
total_countis odd, the median is at position(total_count + 1) // 2 - If
total_countis even, the median is the average of the elements at positionstotal_count // 2and(total_count // 2) + 1
- Use a prefix sum approach to locate the median:
- Iterate over the
countarray, keeping a running totalrunning_count. Whenrunning_countcrosses the median positions, record the correspondingivalues.
- Return the statistics
[minimum, maximum, mean, median, mode]as floating-point numbers.
Why it works: By iterating over the frequency array rather than the expanded sample, we ensure correct statistics while handling extremely large samples efficiently. Prefix sums allow us to identify median positions without constructing the full array.
Python Solution
from typing import List
class Solution:
def sampleStats(self, count: List[int]) -> List[float]:
minimum = None
maximum = None
total_sum = 0
total_count = 0
mode = 0
max_count = 0
for i in range(256):
if count[i] > 0:
if minimum is None:
minimum = i
maximum = i
total_sum += i * count[i]
total_count += count[i]
if count[i] > max_count:
max_count = count[i]
mode = i
mean = total_sum / total_count
# Median calculation
mid1 = None
mid2 = None
running_count = 0
if total_count % 2 == 1:
median_pos = total_count // 2 + 1
else:
median_pos = total_count // 2
i = 0
current_count = 0
med1 = med2 = None
for i in range(256):
current_count += count[i]
if med1 is None and current_count >= (total_count + 1) // 2:
med1 = i
if med2 is None and current_count >= (total_count // 2 + 1):
med2 = i
if med1 is not None and med2 is not None:
break
median = (med1 + med2) / 2
return [float(minimum), float(maximum), float(mean), float(median), float(mode)]
The Python code follows the algorithm closely: it iterates once over the count array to calculate minimum, maximum, sum, mode, and total_count. Then it uses a second pass to locate the median via prefix sums. Division ensures the mean and median are floating-point numbers.
Go Solution
func sampleStats(count []int) []float64 {
var minimum, maximum, mode int
totalSum := 0
totalCount := 0
maxCount := 0
minSet := false
for i := 0; i < 256; i++ {
if count[i] > 0 {
if !minSet {
minimum = i
minSet = true
}
maximum = i
totalSum += i * count[i]
totalCount += count[i]
if count[i] > maxCount {
maxCount = count[i]
mode = i
}
}
}
mean := float64(totalSum) / float64(totalCount)
runningCount := 0
med1, med2 := -1, -1
for i := 0; i < 256; i++ {
runningCount += count[i]
if med1 == -1 && runningCount >= (totalCount+1)/2 {
med1 = i
}
if med2 == -1 && runningCount >= (totalCount/2 + 1) {
med2 = i
}
if med1 != -1 && med2 != -1 {
break
}
}
median := float64(med1+med2) / 2.0
return []float64{float64(minimum), float64(maximum), mean, median, float64(mode)}
}
The Go solution mirrors the Python logic, using integers for counts and sum and converting to float64 only when returning the statistics. Go-specific considerations include explicit initialization of variables and careful integer division for median positions.
Worked Examples
Example 1: count = [0,1,3,4,0,...]
Sample = [1,2,2,2,3,3,3,3]
| Step | minimum | maximum | total_sum | total_count | running_count | median |
|---|---|---|---|---|---|---|
| Initial | None | None | 0 | 0 | 0 | - |
| i=1 | 1 | 1 | 1 | 1 | 1 | - |
| i=2 | 1 | 2 | 7 | 4 | 4 | med1=2, med2=2 |
| i=3 | 1 | 3 | 19 | 8 | 8 | median = (2+3)/2=2.5 |
Output: [1.0, 3.0, 2.375, 2.5, 3.0]
Example 2: count = [0,4,3,2,2,...]
Sample = [1,1,1,1,2,2,2,3,3,4,4]
| Step | minimum | maximum | total_sum | total_count | running_count | median |
|---|---|---|---|---|---|---|
| Initial | None | None | 0 | 0 | 0 | - |
| i=1 | 1 | 1 | 4 | 4 | 4 | med1=1, med2=1 |
| i=2 | 1 | 2 | 10 | 7 | 7 | med2=2 |