LeetCode 1093 - Statistics from a Large Sample

This problem asks us to compute several descriptive statistics-minimum, maximum, mean, median, and mode-from a very large sample of integers ranging from 0 to 255.

LeetCode Problem 1093

Difficulty: 🟡 Medium
Topics: Array, Math, Probability and Statistics

Solution

Problem Understanding

This problem asks us to compute several descriptive statistics-minimum, maximum, mean, median, and mode-from a very large sample of integers ranging from 0 to 255. The key challenge is that the sample size can be up to 1 billion elements, so it is not feasible to explicitly generate the full array. Instead, we are provided with a count array of length 256, where count[k] represents the number of times the integer k appears in the sample.

The expected output is a list of floating-point numbers representing these five statistics. The problem guarantees that the mode is unique, which simplifies that calculation. The mean and median require careful handling because of the potential size of the data and the need to avoid constructing the entire array explicitly.

Constraints tell us that count has exactly 256 elements, all non-negative, and at least one element in the total sample. This ensures that minimum and maximum are always defined, and we will not encounter division by zero when calculating the mean. Edge cases to watch include: a sample where only one value exists, very skewed distributions, and even vs odd total counts affecting the median.

Approaches

A brute-force approach would be to expand the count array into a full list of elements by repeating each value k exactly count[k] times, and then sort the resulting array. Once sorted, computing minimum, maximum, mean, median, and mode is straightforward. This approach works correctly in principle, but it is completely infeasible for large counts, since the sample could reach 10^9 elements.

The optimal approach leverages the fact that the values are bounded between 0 and 255 and count already represents the frequency of each value. We can compute the minimum, maximum, mean, and mode directly by iterating over count. For the median, we use the prefix sum of counts to locate the middle element(s) without constructing the full array. This approach is linear in the size of count (256) rather than the size of the sample.

Approach Time Complexity Space Complexity Notes
Brute Force O(N log N), where N = sum(count) O(N) Expands the full array and sorts it; infeasible for large N
Optimal O(256) = O(1) O(1) Iterates over count array; computes statistics using frequencies and prefix sums

Algorithm Walkthrough

  1. Initialize variables for minimum, maximum, sum, total_count, mode, and max_count. minimum and maximum are initially undefined. sum and total_count track the running sum and number of elements. max_count tracks the highest frequency for the mode.
  2. Iterate over count with index i from 0 to 255. If count[i] > 0, update:
  • minimum if it is not yet set
  • maximum as i (since we are iterating in increasing order)
  • sum by adding i * count[i]
  • total_count by adding count[i]
  • mode if count[i] > max_count
  1. Calculate mean as sum / total_count.
  2. To compute the median, first determine the positions of the middle element(s):
  • If total_count is odd, the median is at position (total_count + 1) // 2
  • If total_count is even, the median is the average of the elements at positions total_count // 2 and (total_count // 2) + 1
  1. Use a prefix sum approach to locate the median:
  • Iterate over the count array, keeping a running total running_count. When running_count crosses the median positions, record the corresponding i values.
  1. Return the statistics [minimum, maximum, mean, median, mode] as floating-point numbers.

Why it works: By iterating over the frequency array rather than the expanded sample, we ensure correct statistics while handling extremely large samples efficiently. Prefix sums allow us to identify median positions without constructing the full array.

Python Solution

from typing import List

class Solution:
    def sampleStats(self, count: List[int]) -> List[float]:
        minimum = None
        maximum = None
        total_sum = 0
        total_count = 0
        mode = 0
        max_count = 0

        for i in range(256):
            if count[i] > 0:
                if minimum is None:
                    minimum = i
                maximum = i
                total_sum += i * count[i]
                total_count += count[i]
                if count[i] > max_count:
                    max_count = count[i]
                    mode = i

        mean = total_sum / total_count

        # Median calculation
        mid1 = None
        mid2 = None
        running_count = 0
        if total_count % 2 == 1:
            median_pos = total_count // 2 + 1
        else:
            median_pos = total_count // 2

        i = 0
        current_count = 0
        med1 = med2 = None
        for i in range(256):
            current_count += count[i]
            if med1 is None and current_count >= (total_count + 1) // 2:
                med1 = i
            if med2 is None and current_count >= (total_count // 2 + 1):
                med2 = i
            if med1 is not None and med2 is not None:
                break

        median = (med1 + med2) / 2

        return [float(minimum), float(maximum), float(mean), float(median), float(mode)]

The Python code follows the algorithm closely: it iterates once over the count array to calculate minimum, maximum, sum, mode, and total_count. Then it uses a second pass to locate the median via prefix sums. Division ensures the mean and median are floating-point numbers.

Go Solution

func sampleStats(count []int) []float64 {
    var minimum, maximum, mode int
    totalSum := 0
    totalCount := 0
    maxCount := 0
    minSet := false

    for i := 0; i < 256; i++ {
        if count[i] > 0 {
            if !minSet {
                minimum = i
                minSet = true
            }
            maximum = i
            totalSum += i * count[i]
            totalCount += count[i]
            if count[i] > maxCount {
                maxCount = count[i]
                mode = i
            }
        }
    }

    mean := float64(totalSum) / float64(totalCount)

    runningCount := 0
    med1, med2 := -1, -1
    for i := 0; i < 256; i++ {
        runningCount += count[i]
        if med1 == -1 && runningCount >= (totalCount+1)/2 {
            med1 = i
        }
        if med2 == -1 && runningCount >= (totalCount/2 + 1) {
            med2 = i
        }
        if med1 != -1 && med2 != -1 {
            break
        }
    }

    median := float64(med1+med2) / 2.0

    return []float64{float64(minimum), float64(maximum), mean, median, float64(mode)}
}

The Go solution mirrors the Python logic, using integers for counts and sum and converting to float64 only when returning the statistics. Go-specific considerations include explicit initialization of variables and careful integer division for median positions.

Worked Examples

Example 1: count = [0,1,3,4,0,...]

Sample = [1,2,2,2,3,3,3,3]

Step minimum maximum total_sum total_count running_count median
Initial None None 0 0 0 -
i=1 1 1 1 1 1 -
i=2 1 2 7 4 4 med1=2, med2=2
i=3 1 3 19 8 8 median = (2+3)/2=2.5

Output: [1.0, 3.0, 2.375, 2.5, 3.0]

Example 2: count = [0,4,3,2,2,...]

Sample = [1,1,1,1,2,2,2,3,3,4,4]

Step minimum maximum total_sum total_count running_count median
Initial None None 0 0 0 -
i=1 1 1 4 4 4 med1=1, med2=1
i=2 1 2 10 7 7 med2=2