SPOJ.com - Problem PROG0300

PROG0300 - Open reading frames

no tags

Once the complete genome of an organism has been determined, the question arises what regions of the genome are coding for proteins. Open reading frames (ORFs) are an important concept in predicting these coding regions. To understand how this works, it is important to know that translation of a DNA sequence into a protein happens by converting each three consecutive nucleotides (called codons) into the next amino acid that is appended to the growing protein chain. Certain codons play a special role in translating DNA into proteins:

start codon: the translation of a DNA sequence into a protein can start here
stop codon: the translation of a DNA sequence into a protein ends here

An ORF is a region within a genome that starts at the next start codon (from the beginning of the sequence or from the previous stop codon) and runs until the next stop codon. Note that according to this definition, it is perfectly possible that there are several start codons in between the first and the last codon of an ORF, but that there can not be any intermediate stop codons. The stop codon of an ORF is always located a multiple of three positions upstream to the position of the first start codon of the ORF. It is said that the start and stop codons are in the same reading frame. Since codons consist of three nucleotides, a DNA sequence thus has three reading frames that each may contain zero or more ORFs.

As an example, consider the following DNA sequence that can be read in three reading frames. The first reading frame starts at the letter g, the reading frame at the first letter c and the third reading frame at the second letter c.

1 gcc ctt aat ttt att cat tgg att cca ttc att aac gtg ctg atg tcc cat ttg ttt a
  |-------------* |-----------|-----------|---------------------------------*
2 g ccc tta att tta ttc att gga ttc cat tca tta acg tgc tga tgt ccc att tgt tta
            |-----------|---------------------------------*         |
3 gc cct taa ttt tat tca ttg gat tcc att cat taa cgt gct gat gtc cca ttt gtt ta
               *                     |---------------------------------*

We have separated the codons in each reading frame by a single space in the above representation. In this example we consider the codons att, gca and gcc as start codons (green) and the codons tct, tga and ttt as stop codons (red). Below each reading frame, we have underlined the ORFs with hyphens (-). In addition, we mark the first nucleotide of a start codon with a vertical bar (|) and the last nucleotide of a stop codon with an asterisk (*). In this way, it is clear to see that there are two ORFs in the first reading frame, and one ORF each in the second and third reading frame.

Assignment

In this exercise we represent DNA sequences (including codons) as strings that only contain the lower case letters a, c, g and t. Write a function ORFs that takes three arguments: a DNA sequence, a list of start codons and a list of stop codons. The function must return a list that contains all ORFs that are found in each of the three reading frames of the DNA sequence. An ORF is hereby represented as a tuple having the following content

(start position, stop position, frame)

The start position of the ORF is the position of the first nucleotide of the start codon, and the stop position is the position of the last (third) nucleotide of the stop codon. Positions are given relative to the start of the DNA sequence, so that the first nucleotide is at position zero. The frame of an ORF refers to the reading frame in which the ORF was found. The list returned by the function must contain the ORFs sorted according to increasing starting position.

Example

>>> ORFs('gcccttaattttattcattggattccattcattaacgtgctgatgtcccatttgttta', ['att', 'gca', 'gcc'], ['tct', 'tga', 'ttt'])
[(0, 11, 1), (7, 42, 2), (12, 56, 1), (26, 52, 3)]
>>> ORFs('cgtgtgcacaactcaccccgtagacccaaaatgtggataacatg', ['aca', 'gtg'], ['aaa', 'ccc', 'gac'])
[(1, 18, 2), (3, 17, 1)]
>>> ORFs('cgagggctctcactgggacggcagaggctagtcacagtat', ['agt'], ['gac', 'ggc'])
[]

Eenmaal de DNA sequentie van een organisme bepaald is, stelt zich de vraag welke gebieden van het genoom coderen voor eiwitten. Open reading frames (ORFs) vormen een belangrijk concept bij het voorspellen van deze coderende gebieden. Om dit te begrijpen, is het belangrijk te weten dat de vertaling van een DNA sequentie naar een eiwit gebeurt door telkens drie opeenvolgende nucleotiden (codons genaamd) om te zetten naar het volgende aminozuur dat aan de groeiende eiwitketen toegevoegd wordt. Bepaalde codons spelen een bijzondere rol bij het vertalen van DNA naar eiwitten:

startcodon: hier kan de vertaling van een DNA sequentie naar een eiwit beginnen
stopcodon: hier eindigt de vertaling van een DNA sequentie naar een eiwit

Een ORF is een gebied dat begint bij het eerstvolgende startcodon (vanaf het begin van de sequentie of na het vorige stopcodon) en loopt tot en met het eerstvolgende stopcodon. Merk op dat het volgens deze definitie perfect mogelijk is dat er verschillende startcodons voorkomen tussen het eerste en laatste codon van een ORF, maar dat er nooit tussenliggende stopcodons voorkomen. Het stopcodon van een ORF ligt ook altijd een veelvoud van drie posities verder dan het eerste startcodon van het ORF. Men zegt dat het start- en stopcodon in hetzelfde reading frame liggen. Omdat codons bestaan uit drie nucleotiden, heeft een DNA sequentie dus drie reading frames waarin er telkens nul of meer ORFs kunnen gevonden worden.

Beschouw bij wijze van voorbeeld de onderstaande DNA sequentie, die kan gelezen worden in drie reading frames. Het eerste reading frame begint met de letter g, het tweede reading frame met de eerste letter c en het derde reading frame met de tweede letter c.

1 gcc ctt aat ttt att cat tgg att cca ttc att aac gtg ctg atg tcc cat ttg ttt a
  |-------------* |-----------|-----------|---------------------------------*
2 g ccc tta att tta ttc att gga ttc cat tca tta acg tgc tga tgt ccc att tgt tta
            |-----------|---------------------------------*         |
3 gc cct taa ttt tat tca ttg gat tcc att cat taa cgt gct gat gtc cca ttt gtt ta
               *                     |---------------------------------*

We hebben in deze weergave de codons van elk reading frame van elkaar gescheiden door een spatie. In dit voorbeeld beschouwen we de codons att, gca en gcc als startcodons (groen) en de codons tct, tga en ttt als stopcodons (rood). Onder elk reading frame hebben we de ORFs onderlijnd met koppeltekens (-). Daarbij geven we telkens het eerste nucleotide van een startcodon aan met een verticale streep (|) en het laatste nucleotide van een stopcodon met een sterretje (*). Op die manier is het duidelijk te zien dat er twee ORFs kunnen gevonden worden in het eerste reading frame, en telkens één ORF in het tweede en derde reading frame.

Opgave

In deze opgave stellen we DNA sequenties (inclusief codons) voor als strings die enkel bestaan uit de kleine letters a, c, g en t. Schrijf een functie ORFs waaraan drie argumenten moeten doorgegeven worden: een DNA sequentie, een lijst van startcodons en een lijst van stopcodons. De functie moet een lijst teruggeven die alle ORFs bevat die gevonden worden in de drie reading frames van de sequentie. Een ORF wordt hierbij voorgesteld als een tuple van de vorm

(startpositie, stoppositie, frame)

De startpositie van het ORF is de positie van de eerste nucleotide van het startcodon, en de stoppositie de positie van de laatste (derde) nucleotide van het stopcodon. Posities worden aangegeven ten opzichte van het begin van de DNA sequentie, waarbij het eerste nucleotide positie nul heeft. Het frame van het ORF slaat op het reading frame waarin het ORF gevonden werd. In de lijst die wordt teruggegeven, moeten de ORFs gerangschikt staan volgens stijgende startpositie.

Voorbeeld

>>> ORFs('gcccttaattttattcattggattccattcattaacgtgctgatgtcccatttgttta', ['att', 'gca', 'gcc'], ['tct', 'tga', 'ttt'])
[(0, 11, 1), (7, 42, 2), (12, 56, 1), (26, 52, 3)]
>>> ORFs('cgtgtgcacaactcaccccgtagacccaaaatgtggataacatg', ['aca', 'gtg'], ['aaa', 'ccc', 'gac'])
[(1, 18, 2), (3, 17, 1)]
>>> ORFs('cgagggctctcactgggacggcagaggctagtcacagtat', ['agt'], ['gac', 'ggc'])
[]

Submit solution!

Added by:	Peter Dawyndt
Date:	2012-12-08
Time limit:	10s
Source limit:	50000B
Memory limit:	1536MB
Cluster:	Cube (Intel G860)
Languages:	PY_NBC
Resource:	None