LeetCode 3475 - DNA Pattern Recognition

This problem provides a database table named Samples that contains DNA sequences and their associated species. Each row represents a single biological sample, identified by a unique sampleid.

LeetCode Problem 3475

Difficulty: 🟡 Medium
Topics: Database

Solution

LeetCode 3475 - DNA Pattern Recognition

Problem Understanding

This problem provides a database table named Samples that contains DNA sequences and their associated species. Each row represents a single biological sample, identified by a unique sample_id.

The task is to examine every DNA sequence and determine whether it matches four different biological patterns. For each sample, we must generate four indicator columns:

  • has_start should be 1 if the sequence starts with "ATG", otherwise 0.
  • has_stop should be 1 if the sequence ends with one of the stop codons "TAA", "TAG", or "TGA", otherwise 0.
  • has_atat should be 1 if the sequence contains the substring "ATAT" anywhere within it, otherwise 0.
  • has_ggg should be 1 if the sequence contains at least three consecutive 'G' characters, otherwise 0.

The output must include the original columns (sample_id, dna_sequence, and species) along with these four computed flags. The final result must be sorted by sample_id in ascending order.

Since this is a database problem, the solution is expressed as a SQL query rather than an algorithm implemented in a traditional programming language. The key challenge is recognizing that all four requirements are string pattern matching operations that can be efficiently handled using SQL string functions and regular expressions.

An important detail is that the sequence may contain longer runs of 'G' characters such as "GGGG" or "GGGGG". These should still satisfy the requirement because they contain at least one occurrence of three consecutive 'G' characters. Similarly, the "ATAT" motif may appear at any position in the string, including overlapping occurrences.

Approaches

Brute Force Approach

A brute force solution would examine every DNA sequence character by character.

For each row, we could manually check whether the first three characters are "ATG", whether the final three characters form one of the stop codons, whether every possible substring of length four equals "ATAT", and whether any consecutive run of 'G' characters has length at least three.

This approach is correct because it explicitly verifies every condition according to the problem definition. However, it is unnecessarily verbose in SQL and does not take advantage of the database engine's built in string matching capabilities.

Optimal Approach

The key observation is that all required checks are standard string pattern matching operations.

SQL provides functions such as:

  • LIKE for prefix, suffix, and substring matching.
  • REGEXP_LIKE for regular expression pattern matching.

Using these functions allows the database engine to perform the checks directly and concisely.

The four conditions map naturally to SQL expressions:

  • Starts with "ATG"dna_sequence LIKE 'ATG%'
  • Ends with a stop codon → regular expression (TAA|TAG|TGA)$
  • Contains "ATAT"dna_sequence LIKE '%ATAT%'
  • Contains at least three consecutive 'G' characters → regular expression GGG+

Each expression can be wrapped in a CASE statement to convert the Boolean result into 1 or 0.

Approach Time Complexity Space Complexity Notes
Brute Force O(N × L) O(1) Manually scans each DNA sequence of length L
Optimal O(N × L) O(1) Uses built in SQL pattern matching functions

Here, N is the number of rows and L is the average DNA sequence length.

Algorithm Walkthrough

  1. Read every row from the Samples table.
  2. Determine whether the DNA sequence begins with "ATG" by using the pattern ATG%. If it matches, return 1; otherwise return 0.
  3. Determine whether the DNA sequence ends with one of the stop codons. A regular expression such as (TAA|TAG|TGA)$ ensures that one of these codons appears exactly at the end of the sequence. Return 1 if it matches, otherwise 0.
  4. Determine whether the sequence contains the motif "ATAT" anywhere within the string. The pattern %ATAT% checks for this condition. Return 1 if found, otherwise 0.
  5. Determine whether there are at least three consecutive 'G' characters. The regular expression GGG+ matches sequences containing three or more consecutive 'G' characters. Return 1 if a match exists, otherwise 0.
  6. Output all original columns together with the four computed indicators.
  7. Sort the final result by sample_id in ascending order.

Why it works

Each required biological pattern corresponds directly to a string matching operation. The prefix check verifies the start codon, the suffix regular expression verifies valid stop codons, the substring search detects the "ATAT" motif, and the consecutive character regular expression detects runs of at least three 'G' characters. Because every condition is checked exactly as stated in the problem, the resulting flags are correct for every row.

SQL Solution

SELECT
    sample_id,
    dna_sequence,
    species,
    CASE
        WHEN dna_sequence LIKE 'ATG%' THEN 1
        ELSE 0
    END AS has_start,
    CASE
        WHEN REGEXP_LIKE(dna_sequence, '(TAA|TAG|TGA)$') THEN 1
        ELSE 0
    END AS has_stop,
    CASE
        WHEN dna_sequence LIKE '%ATAT%' THEN 1
        ELSE 0
    END AS has_atat,
    CASE
        WHEN REGEXP_LIKE(dna_sequence, 'GGG+') THEN 1
        ELSE 0
    END AS has_ggg
FROM Samples
ORDER BY sample_id;

Implementation Explanation

The query selects all original columns from the Samples table and computes four additional columns using CASE expressions.

The LIKE 'ATG%' pattern verifies that the sequence starts with the start codon. The regular expression (TAA|TAG|TGA)$ checks whether the sequence terminates with one of the three accepted stop codons. The %ATAT% pattern searches for the repeated motif anywhere in the sequence. Finally, the regular expression GGG+ identifies any run containing at least three consecutive 'G' characters.

The ORDER BY sample_id clause ensures the output is returned in ascending sample order as required.

Worked Examples

Sample 1

DNA sequence:

ATGCTAGCTAGCTAA
Check Result
Starts with ATG Yes
Ends with TAA/TAG/TGA Yes, ends with TAA
Contains ATAT No
Contains GGG+ No

Output:

has_start has_stop has_atat has_ggg
1 1 0 0

Sample 2

DNA sequence:

GGGTCAATCATC
Check Result
Starts with ATG No
Ends with stop codon No
Contains ATAT No
Contains GGG+ Yes, contains GGG

Output:

has_start has_stop has_atat has_ggg
0 0 0 1

Sample 3

DNA sequence:

ATATATCGTAGCTA
Check Result
Starts with ATG No
Ends with stop codon No
Contains ATAT Yes
Contains GGG+ No

Output:

has_start has_stop has_atat has_ggg
0 0 1 0

Sample 4

DNA sequence:

ATGGGGTCATCATAA
Check Result
Starts with ATG Yes
Ends with TAA Yes
Contains ATAT No
Contains GGG+ Yes, contains GGGG

Output:

has_start has_stop has_atat has_ggg
1 1 0 1

Sample 5

DNA sequence:

TCAGTCAGTCAG
Check Result
Starts with ATG No
Ends with stop codon No
Contains ATAT No
Contains GGG+ No

Output:

has_start has_stop has_atat has_ggg
0 0 0 0

Sample 6

DNA sequence:

ATATCGCGCTAG
Check Result
Starts with ATG No
Ends with TAG Yes
Contains ATAT Yes
Contains GGG+ No

Output:

has_start has_stop has_atat has_ggg
0 1 1 0

Sample 7

DNA sequence:

CGTATGCGTCGTA
Check Result
Starts with ATG No
Ends with stop codon No
Contains ATAT No
Contains GGG+ No

Output:

has_start has_stop has_atat has_ggg
0 0 0 0

Complexity Analysis

Measure Complexity Explanation
Time O(N × L) Each sequence may be scanned while evaluating patterns
Space O(1) Only constant extra memory is required

The database evaluates a fixed number of pattern matching expressions for each row. Since each pattern may inspect the DNA sequence once, the work is proportional to the sequence length. No auxiliary data structures are created, so the extra memory usage remains constant.

Test Cases

Although LeetCode Database problems are validated through SQL execution rather than unit tests, the following logical test scenarios cover the important cases.

Test Why
Starts with ATG only Verifies start codon detection
Ends with TAA Verifies stop codon detection
Ends with TAG Verifies alternate stop codon
Ends with TGA Verifies third stop codon
Contains ATAT in middle Verifies motif detection
Contains GGG Verifies minimum valid G run
Contains GGGG Verifies longer runs are accepted
Contains none of the patterns Verifies all flags remain 0
Contains multiple patterns simultaneously Verifies flags are independent
Sequence shorter than pattern length Verifies matching safely returns 0

Edge Cases

Sequence Contains More Than Three Consecutive G Characters

A common mistake is checking only for the exact substring "GGG". Sequences such as "GGGG" or "GGGGG" must also be considered valid because they contain at least three consecutive 'G' characters. The regular expression GGG+ correctly handles both the minimum and longer runs.

Sequence Ends With a Similar But Invalid Suffix

A sequence might end with characters that resemble a stop codon but are not exactly one of the accepted values. For example, "ATGCTAAG" contains "TAA" internally but does not end with it. The $ anchor in (TAA|TAG|TGA)$ ensures that only true suffix matches are accepted.

Overlapping ATAT Motifs

Strings such as "ATATAT" contain overlapping occurrences of "ATAT". A substring search using %ATAT% still succeeds because only one occurrence is required. The implementation does not depend on counting occurrences, so overlapping matches are handled naturally.

Very Short DNA Sequences

Sequences shorter than three or four characters cannot possibly contain some of the required patterns. SQL pattern matching functions simply return false in these situations, producing 0 for the corresponding indicators without any special handling.

Multiple Patterns Present Simultaneously

A sequence such as "ATGGGGATATTAA" satisfies every condition. Because each flag is computed independently using separate expressions, the query correctly returns 1 for all matching categories.