LeetCode 3475 - DNA Pattern Recognition
This problem provides a database table named Samples that contains DNA sequences and their associated species. Each row represents a single biological sample, identified by a unique sampleid.
Difficulty: 🟡 Medium
Topics: Database
Solution
LeetCode 3475 - DNA Pattern Recognition
Problem Understanding
This problem provides a database table named Samples that contains DNA sequences and their associated species. Each row represents a single biological sample, identified by a unique sample_id.
The task is to examine every DNA sequence and determine whether it matches four different biological patterns. For each sample, we must generate four indicator columns:
has_startshould be1if the sequence starts with"ATG", otherwise0.has_stopshould be1if the sequence ends with one of the stop codons"TAA","TAG", or"TGA", otherwise0.has_atatshould be1if the sequence contains the substring"ATAT"anywhere within it, otherwise0.has_gggshould be1if the sequence contains at least three consecutive'G'characters, otherwise0.
The output must include the original columns (sample_id, dna_sequence, and species) along with these four computed flags. The final result must be sorted by sample_id in ascending order.
Since this is a database problem, the solution is expressed as a SQL query rather than an algorithm implemented in a traditional programming language. The key challenge is recognizing that all four requirements are string pattern matching operations that can be efficiently handled using SQL string functions and regular expressions.
An important detail is that the sequence may contain longer runs of 'G' characters such as "GGGG" or "GGGGG". These should still satisfy the requirement because they contain at least one occurrence of three consecutive 'G' characters. Similarly, the "ATAT" motif may appear at any position in the string, including overlapping occurrences.
Approaches
Brute Force Approach
A brute force solution would examine every DNA sequence character by character.
For each row, we could manually check whether the first three characters are "ATG", whether the final three characters form one of the stop codons, whether every possible substring of length four equals "ATAT", and whether any consecutive run of 'G' characters has length at least three.
This approach is correct because it explicitly verifies every condition according to the problem definition. However, it is unnecessarily verbose in SQL and does not take advantage of the database engine's built in string matching capabilities.
Optimal Approach
The key observation is that all required checks are standard string pattern matching operations.
SQL provides functions such as:
LIKEfor prefix, suffix, and substring matching.REGEXP_LIKEfor regular expression pattern matching.
Using these functions allows the database engine to perform the checks directly and concisely.
The four conditions map naturally to SQL expressions:
- Starts with
"ATG"→dna_sequence LIKE 'ATG%' - Ends with a stop codon → regular expression
(TAA|TAG|TGA)$ - Contains
"ATAT"→dna_sequence LIKE '%ATAT%' - Contains at least three consecutive
'G'characters → regular expressionGGG+
Each expression can be wrapped in a CASE statement to convert the Boolean result into 1 or 0.
| Approach | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Brute Force | O(N × L) | O(1) | Manually scans each DNA sequence of length L |
| Optimal | O(N × L) | O(1) | Uses built in SQL pattern matching functions |
Here, N is the number of rows and L is the average DNA sequence length.
Algorithm Walkthrough
- Read every row from the
Samplestable. - Determine whether the DNA sequence begins with
"ATG"by using the patternATG%. If it matches, return1; otherwise return0. - Determine whether the DNA sequence ends with one of the stop codons. A regular expression such as
(TAA|TAG|TGA)$ensures that one of these codons appears exactly at the end of the sequence. Return1if it matches, otherwise0. - Determine whether the sequence contains the motif
"ATAT"anywhere within the string. The pattern%ATAT%checks for this condition. Return1if found, otherwise0. - Determine whether there are at least three consecutive
'G'characters. The regular expressionGGG+matches sequences containing three or more consecutive'G'characters. Return1if a match exists, otherwise0. - Output all original columns together with the four computed indicators.
- Sort the final result by
sample_idin ascending order.
Why it works
Each required biological pattern corresponds directly to a string matching operation. The prefix check verifies the start codon, the suffix regular expression verifies valid stop codons, the substring search detects the "ATAT" motif, and the consecutive character regular expression detects runs of at least three 'G' characters. Because every condition is checked exactly as stated in the problem, the resulting flags are correct for every row.
SQL Solution
SELECT
sample_id,
dna_sequence,
species,
CASE
WHEN dna_sequence LIKE 'ATG%' THEN 1
ELSE 0
END AS has_start,
CASE
WHEN REGEXP_LIKE(dna_sequence, '(TAA|TAG|TGA)$') THEN 1
ELSE 0
END AS has_stop,
CASE
WHEN dna_sequence LIKE '%ATAT%' THEN 1
ELSE 0
END AS has_atat,
CASE
WHEN REGEXP_LIKE(dna_sequence, 'GGG+') THEN 1
ELSE 0
END AS has_ggg
FROM Samples
ORDER BY sample_id;
Implementation Explanation
The query selects all original columns from the Samples table and computes four additional columns using CASE expressions.
The LIKE 'ATG%' pattern verifies that the sequence starts with the start codon. The regular expression (TAA|TAG|TGA)$ checks whether the sequence terminates with one of the three accepted stop codons. The %ATAT% pattern searches for the repeated motif anywhere in the sequence. Finally, the regular expression GGG+ identifies any run containing at least three consecutive 'G' characters.
The ORDER BY sample_id clause ensures the output is returned in ascending sample order as required.
Worked Examples
Sample 1
DNA sequence:
ATGCTAGCTAGCTAA
| Check | Result |
|---|---|
| Starts with ATG | Yes |
| Ends with TAA/TAG/TGA | Yes, ends with TAA |
| Contains ATAT | No |
| Contains GGG+ | No |
Output:
| has_start | has_stop | has_atat | has_ggg |
|---|---|---|---|
| 1 | 1 | 0 | 0 |
Sample 2
DNA sequence:
GGGTCAATCATC
| Check | Result |
|---|---|
| Starts with ATG | No |
| Ends with stop codon | No |
| Contains ATAT | No |
| Contains GGG+ | Yes, contains GGG |
Output:
| has_start | has_stop | has_atat | has_ggg |
|---|---|---|---|
| 0 | 0 | 0 | 1 |
Sample 3
DNA sequence:
ATATATCGTAGCTA
| Check | Result |
|---|---|
| Starts with ATG | No |
| Ends with stop codon | No |
| Contains ATAT | Yes |
| Contains GGG+ | No |
Output:
| has_start | has_stop | has_atat | has_ggg |
|---|---|---|---|
| 0 | 0 | 1 | 0 |
Sample 4
DNA sequence:
ATGGGGTCATCATAA
| Check | Result |
|---|---|
| Starts with ATG | Yes |
| Ends with TAA | Yes |
| Contains ATAT | No |
| Contains GGG+ | Yes, contains GGGG |
Output:
| has_start | has_stop | has_atat | has_ggg |
|---|---|---|---|
| 1 | 1 | 0 | 1 |
Sample 5
DNA sequence:
TCAGTCAGTCAG
| Check | Result |
|---|---|
| Starts with ATG | No |
| Ends with stop codon | No |
| Contains ATAT | No |
| Contains GGG+ | No |
Output:
| has_start | has_stop | has_atat | has_ggg |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
Sample 6
DNA sequence:
ATATCGCGCTAG
| Check | Result |
|---|---|
| Starts with ATG | No |
| Ends with TAG | Yes |
| Contains ATAT | Yes |
| Contains GGG+ | No |
Output:
| has_start | has_stop | has_atat | has_ggg |
|---|---|---|---|
| 0 | 1 | 1 | 0 |
Sample 7
DNA sequence:
CGTATGCGTCGTA
| Check | Result |
|---|---|
| Starts with ATG | No |
| Ends with stop codon | No |
| Contains ATAT | No |
| Contains GGG+ | No |
Output:
| has_start | has_stop | has_atat | has_ggg |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
Complexity Analysis
| Measure | Complexity | Explanation |
|---|---|---|
| Time | O(N × L) | Each sequence may be scanned while evaluating patterns |
| Space | O(1) | Only constant extra memory is required |
The database evaluates a fixed number of pattern matching expressions for each row. Since each pattern may inspect the DNA sequence once, the work is proportional to the sequence length. No auxiliary data structures are created, so the extra memory usage remains constant.
Test Cases
Although LeetCode Database problems are validated through SQL execution rather than unit tests, the following logical test scenarios cover the important cases.
| Test | Why |
|---|---|
| Starts with ATG only | Verifies start codon detection |
| Ends with TAA | Verifies stop codon detection |
| Ends with TAG | Verifies alternate stop codon |
| Ends with TGA | Verifies third stop codon |
| Contains ATAT in middle | Verifies motif detection |
| Contains GGG | Verifies minimum valid G run |
| Contains GGGG | Verifies longer runs are accepted |
| Contains none of the patterns | Verifies all flags remain 0 |
| Contains multiple patterns simultaneously | Verifies flags are independent |
| Sequence shorter than pattern length | Verifies matching safely returns 0 |
Edge Cases
Sequence Contains More Than Three Consecutive G Characters
A common mistake is checking only for the exact substring "GGG". Sequences such as "GGGG" or "GGGGG" must also be considered valid because they contain at least three consecutive 'G' characters. The regular expression GGG+ correctly handles both the minimum and longer runs.
Sequence Ends With a Similar But Invalid Suffix
A sequence might end with characters that resemble a stop codon but are not exactly one of the accepted values. For example, "ATGCTAAG" contains "TAA" internally but does not end with it. The $ anchor in (TAA|TAG|TGA)$ ensures that only true suffix matches are accepted.
Overlapping ATAT Motifs
Strings such as "ATATAT" contain overlapping occurrences of "ATAT". A substring search using %ATAT% still succeeds because only one occurrence is required. The implementation does not depend on counting occurrences, so overlapping matches are handled naturally.
Very Short DNA Sequences
Sequences shorter than three or four characters cannot possibly contain some of the required patterns. SQL pattern matching functions simply return false in these situations, producing 0 for the corresponding indicators without any special handling.
Multiple Patterns Present Simultaneously
A sequence such as "ATGGGGATATTAA" satisfies every condition. Because each flag is computed independently using separate expressions, the query correctly returns 1 for all matching categories.