LeetCode 2738 - Count Occurrences in Text
The problem asks us to analyze a database table named Files that contains two columns: filename and content. Each row corresponds to a unique file and its textual content.
Difficulty: 🟡 Medium
Topics: Database
Solution
Problem Understanding
The problem asks us to analyze a database table named Files that contains two columns: file_name and content. Each row corresponds to a unique file and its textual content. The task is to count the number of files that contain the words "bull" and "bear" as standalone words, meaning the words must be separated by spaces, punctuation, or line boundaries, and should not appear as substrings inside other words (e.g., "bullet" or "bears" do not count).
The expected output is a table with two columns: word and count, where count represents the number of files containing at least one occurrence of the word. The output should include both "bull" and "bear" in any order.
The constraints imply that the content can contain multiple lines, punctuation, and mixed capitalization, although the examples assume lowercase words. Edge cases include empty content, multiple occurrences of the words in the same file, words at the start or end of the content, and words adjacent to punctuation.
Approaches
A brute-force approach would involve fetching all file contents, splitting each file's content into words, and manually checking if "bull" or "bear" exist in the resulting list. While this works logically, it may be inefficient for large datasets since it requires scanning every word of every file in the application layer.
A more optimal approach leverages SQL string matching with regular expressions to check for whole word occurrences directly in the database. By using SQL functions like REGEXP (or RLIKE in some dialects), we can search for \bbull\b and \bbear\b patterns, where \b denotes a word boundary. Aggregating the results using COUNT(DISTINCT file_name) ensures that we count each file only once, regardless of multiple occurrences within the same file. This approach avoids pulling all text into memory and allows the database to optimize the search.
| Approach | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Brute Force | O(n * m) | O(n) | Scan each file's content in application memory, split into words, count occurrences manually. |
| Optimal | O(n) | O(1) | SQL query using regex and aggregation, database engine handles text scanning efficiently. |
Algorithm Walkthrough
- Construct a SQL query to count files where the content matches
"bull"as a standalone word using a regular expression with word boundaries. UseCOUNT(DISTINCT file_name)to avoid double-counting the same file. - Repeat the query for
"bear"using the same logic. - Combine the two counts into a result set with two rows: one for
"bull"and one for"bear". - Return the result table.
Why it works: Regular expressions with word boundaries ensure that only standalone occurrences of "bull" and "bear" are counted. Using COUNT(DISTINCT file_name) guarantees that multiple occurrences in the same file contribute only once. Aggregating the results ensures correctness while remaining efficient.
Python Solution
For LeetCode, the Python solution uses pandas for database simulation, but in real scenarios, SQL queries are preferred. Here is a direct approach using Python:
import pandas as pd
import re
from typing import List, Dict
def count_occurrences(files: pd.DataFrame) -> List[Dict[str, int]]:
result = []
bull_count = files['content'].apply(lambda x: bool(re.search(r'\bbull\b', x))).sum()
bear_count = files['content'].apply(lambda x: bool(re.search(r'\bbear\b', x))).sum()
result.append({'word': 'bull', 'count': int(bull_count)})
result.append({'word': 'bear', 'count': int(bear_count)})
return result
This implementation uses re.search with \b to match word boundaries. It converts each match to a boolean, then sums over the DataFrame to count files containing at least one occurrence.
Go Solution
package main
import (
"regexp"
)
type File struct {
FileName string
Content string
}
type WordCount struct {
Word string
Count int
}
func CountOccurrences(files []File) []WordCount {
bullRegex := regexp.MustCompile(`\bbull\b`)
bearRegex := regexp.MustCompile(`\bbear\b`)
bullCount := 0
bearCount := 0
for _, file := range files {
if bullRegex.MatchString(file.Content) {
bullCount++
}
if bearRegex.MatchString(file.Content) {
bearCount++
}
}
return []WordCount{
{"bull", bullCount},
{"bear", bearCount},
}
}
In Go, regexp.MustCompile is used for performance and safety. Each file is checked for matches, and counts are incremented only once per file, preserving correctness.
Worked Examples
Using the example:
| file_name | content |
|---|---|
| draft1.txt | The stock exchange predicts a bull market... |
| draft2.txt | ...bull market... awaiting a bear market. |
| draft3.txt | ...bull market... awaiting a bear market. |
Step-by-step:
draft1.txtmatches"bull"but not"bear".draft2.txtmatches both"bull"and"bear".draft3.txtmatches both"bull"and"bear".
Counts: "bull": 3, "bear": 2.
Complexity Analysis
| Measure | Complexity | Explanation |
|---|---|---|
| Time | O(n) | Each file is scanned once. Regex matching is linear in content length. |
| Space | O(1) | Only counters are stored; no extra memory proportional to input size. |
Even though regex scanning is linear in file content length, this is unavoidable for correctness.
Test Cases
import pandas as pd
# provided example
files = pd.DataFrame({
'file_name': ['draft1.txt','draft2.txt','draft3.txt'],
'content': [
"The stock exchange predicts a bull market which would make many investors happy.",
"The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market.",
"The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market. As always predicting the future market is an uncertain game and all investors should follow their instincts and best practices."
]
})
assert count_occurrences(files) == [{'word':'bull','count':3},{'word':'bear','count':2}] # example
# empty content
files_empty = pd.DataFrame({'file_name':['a.txt'], 'content':['']})
assert count_occurrences(files_empty) == [{'word':'bull','count':0},{'word':'bear','count':0}]
# multiple occurrences in same file
files_multi = pd.DataFrame({'file_name':['b.txt'], 'content':['bull bull bear bull bear']})
assert count_occurrences(files_multi) == [{'word':'bull','count':1},{'word':'bear','count':1}]
# word as substring
files_sub = pd.DataFrame({'file_name':['c.txt'], 'content':['bullet bears bullpen']})
assert count_occurrences(files_sub) == [{'word':'bull','count':0},{'word':'bear','count':0}]
| Test | Why |
|---|---|
| Example from problem | Validates normal operation with multiple files and words. |
| Empty content | Ensures no false positives with empty strings. |
| Multiple occurrences in one file | Confirms files are counted once regardless of frequency. |
| Words as substrings | Confirms correct matching only for standalone words. |
Edge Cases
Empty Files: Files with empty content should not count. Our regex handles this correctly as re.search will return None and the boolean conversion results in False.
Punctuation-adjacent Words: Words like "bull." or "bear," are not counted because the regex requires word boundaries on both sides, correctly ignoring these cases.
Multiple Occurrences in One File: A file with "bull bull bull" should only increment the count once. Using bool(re.search(...)) ensures that multiple matches do not overcount.
This design ensures correctness under realistic input scenarios and avoids pitfalls related to substrings, punctuation, and repeated words.