LeetCode 2738 - Count Occurrences in Text

The problem asks us to analyze a database table named Files that contains two columns: filename and content. Each row corresponds to a unique file and its textual content.

LeetCode Problem 2738

Difficulty: 🟡 Medium
Topics: Database

Solution

Problem Understanding

The problem asks us to analyze a database table named Files that contains two columns: file_name and content. Each row corresponds to a unique file and its textual content. The task is to count the number of files that contain the words "bull" and "bear" as standalone words, meaning the words must be separated by spaces, punctuation, or line boundaries, and should not appear as substrings inside other words (e.g., "bullet" or "bears" do not count).

The expected output is a table with two columns: word and count, where count represents the number of files containing at least one occurrence of the word. The output should include both "bull" and "bear" in any order.

The constraints imply that the content can contain multiple lines, punctuation, and mixed capitalization, although the examples assume lowercase words. Edge cases include empty content, multiple occurrences of the words in the same file, words at the start or end of the content, and words adjacent to punctuation.

Approaches

A brute-force approach would involve fetching all file contents, splitting each file's content into words, and manually checking if "bull" or "bear" exist in the resulting list. While this works logically, it may be inefficient for large datasets since it requires scanning every word of every file in the application layer.

A more optimal approach leverages SQL string matching with regular expressions to check for whole word occurrences directly in the database. By using SQL functions like REGEXP (or RLIKE in some dialects), we can search for \bbull\b and \bbear\b patterns, where \b denotes a word boundary. Aggregating the results using COUNT(DISTINCT file_name) ensures that we count each file only once, regardless of multiple occurrences within the same file. This approach avoids pulling all text into memory and allows the database to optimize the search.

Approach Time Complexity Space Complexity Notes
Brute Force O(n * m) O(n) Scan each file's content in application memory, split into words, count occurrences manually.
Optimal O(n) O(1) SQL query using regex and aggregation, database engine handles text scanning efficiently.

Algorithm Walkthrough

  1. Construct a SQL query to count files where the content matches "bull" as a standalone word using a regular expression with word boundaries. Use COUNT(DISTINCT file_name) to avoid double-counting the same file.
  2. Repeat the query for "bear" using the same logic.
  3. Combine the two counts into a result set with two rows: one for "bull" and one for "bear".
  4. Return the result table.

Why it works: Regular expressions with word boundaries ensure that only standalone occurrences of "bull" and "bear" are counted. Using COUNT(DISTINCT file_name) guarantees that multiple occurrences in the same file contribute only once. Aggregating the results ensures correctness while remaining efficient.

Python Solution

For LeetCode, the Python solution uses pandas for database simulation, but in real scenarios, SQL queries are preferred. Here is a direct approach using Python:

import pandas as pd
import re
from typing import List, Dict

def count_occurrences(files: pd.DataFrame) -> List[Dict[str, int]]:
    result = []
    bull_count = files['content'].apply(lambda x: bool(re.search(r'\bbull\b', x))).sum()
    bear_count = files['content'].apply(lambda x: bool(re.search(r'\bbear\b', x))).sum()
    
    result.append({'word': 'bull', 'count': int(bull_count)})
    result.append({'word': 'bear', 'count': int(bear_count)})
    return result

This implementation uses re.search with \b to match word boundaries. It converts each match to a boolean, then sums over the DataFrame to count files containing at least one occurrence.

Go Solution

package main

import (
    "regexp"
)

type File struct {
    FileName string
    Content  string
}

type WordCount struct {
    Word  string
    Count int
}

func CountOccurrences(files []File) []WordCount {
    bullRegex := regexp.MustCompile(`\bbull\b`)
    bearRegex := regexp.MustCompile(`\bbear\b`)
    bullCount := 0
    bearCount := 0
    
    for _, file := range files {
        if bullRegex.MatchString(file.Content) {
            bullCount++
        }
        if bearRegex.MatchString(file.Content) {
            bearCount++
        }
    }
    
    return []WordCount{
        {"bull", bullCount},
        {"bear", bearCount},
    }
}

In Go, regexp.MustCompile is used for performance and safety. Each file is checked for matches, and counts are incremented only once per file, preserving correctness.

Worked Examples

Using the example:

file_name content
draft1.txt The stock exchange predicts a bull market...
draft2.txt ...bull market... awaiting a bear market.
draft3.txt ...bull market... awaiting a bear market.

Step-by-step:

  1. draft1.txt matches "bull" but not "bear".
  2. draft2.txt matches both "bull" and "bear".
  3. draft3.txt matches both "bull" and "bear".

Counts: "bull": 3, "bear": 2.

Complexity Analysis

Measure Complexity Explanation
Time O(n) Each file is scanned once. Regex matching is linear in content length.
Space O(1) Only counters are stored; no extra memory proportional to input size.

Even though regex scanning is linear in file content length, this is unavoidable for correctness.

Test Cases

import pandas as pd

# provided example
files = pd.DataFrame({
    'file_name': ['draft1.txt','draft2.txt','draft3.txt'],
    'content': [
        "The stock exchange predicts a bull market which would make many investors happy.",
        "The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market.",
        "The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market. As always predicting the future market is an uncertain game and all investors should follow their instincts and best practices."
    ]
})
assert count_occurrences(files) == [{'word':'bull','count':3},{'word':'bear','count':2}]  # example

# empty content
files_empty = pd.DataFrame({'file_name':['a.txt'], 'content':['']})
assert count_occurrences(files_empty) == [{'word':'bull','count':0},{'word':'bear','count':0}]

# multiple occurrences in same file
files_multi = pd.DataFrame({'file_name':['b.txt'], 'content':['bull bull bear bull bear']})
assert count_occurrences(files_multi) == [{'word':'bull','count':1},{'word':'bear','count':1}]

# word as substring
files_sub = pd.DataFrame({'file_name':['c.txt'], 'content':['bullet bears bullpen']})
assert count_occurrences(files_sub) == [{'word':'bull','count':0},{'word':'bear','count':0}]
Test Why
Example from problem Validates normal operation with multiple files and words.
Empty content Ensures no false positives with empty strings.
Multiple occurrences in one file Confirms files are counted once regardless of frequency.
Words as substrings Confirms correct matching only for standalone words.

Edge Cases

Empty Files: Files with empty content should not count. Our regex handles this correctly as re.search will return None and the boolean conversion results in False.

Punctuation-adjacent Words: Words like "bull." or "bear," are not counted because the regex requires word boundaries on both sides, correctly ignoring these cases.

Multiple Occurrences in One File: A file with "bull bull bull" should only increment the count once. Using bool(re.search(...)) ensures that multiple matches do not overcount.

This design ensures correctness under realistic input scenarios and avoids pitfalls related to substrings, punctuation, and repeated words.