Thursday, August 28, 2008

Blast! An ORF is not a coding sequence!

Now that The Rough Guide to Evolution is very nearly finished and the summer holidays are nearly over, my thoughts are returning to my "day job" as a bacteriologist with an interest in bioinformatics, genomics and pathogenesis (how bacteria cause disease). In particular, I am struggling with a backlog of papers and grant proposals that I have agreed to peer-review, which I will have to clear before starting work on a grant proposal on bacterial flagellar function (of which more later).

Having a blog of my own gives me a chance to highlight to a wider audience some of the problems that crop up time and again in papers submitted for review. These are technical points, so the non-specialist reader should ignore these posts. But conversely, I would ask the specialist reader to bring these points to as the attention of all their colleagues and to as wide an audience as possible!

Problems commonly arise when lab-based scientists include some bioinformatics in their work. Such scientists are typically extremely careful in describing their laboratory work, taking care to ensure that methods are described in enough detail for anyone competent in the field can reproduce them, even to the extent of saying which supplier they used for any given chemical or growth medium.
BUT when it comes to bioinformatics, they are often very imprecise and even cavalier in their use of language and in the assumptions they make! 

Simply stating "we did a BLAST search" is equivalent to saying "we grew some bacteria under undefined conditions, for an undefined length of time in an undefined growth medium"! The outcome of bioinformatics analyses often depends critically on the conditions used, just as in lab-based work, so it is crucial to specify which particular version of a program was used under what settings (e.g. was the filter on or off in BLASTP, which matrix was used, were composition-based statistics employed, what word size was used?). Better still, one should repeat the search under a variety of settings and show that the results are or are not the same whatever the settings.

Another common problem, which I encountered again today, is a confusion between "ORFs" and "coding sequences"(or CDSs) in bacterial genomes. An ORF is an open reading frame, i.e. a stretch of nucleotide sequence in a given reading frame that does not contain a stop codon, or in other words, a stretch of sequence within a reading frame bounded by two stop codons. It is NOT the same as a coding sequence (or CDS), which can be defined as a stretch of nucleotide sequence that directly encodes a protein product. CDSs are a feature of protein-coding genes, but simply identifying CDSs does not guarantee that you have found all the genes, as CDSs are not a feature of tRNA, rRNA and small regulatory RNA genes.

Identifying ORFs in a given stretch of bacterial DNA is a computationally trivial problem--it can be done even with a pencil and paper. Any given stretch of sequence typically contains many ORFs but very few if any of them will encode real proteins, i.e. represent CDSs.  

For example, here is a diagrammatic representation of stretch of DNA from  the E. coli K-12 MG1655 genome highlighting the recognised annotated CDSs:

And here is the same stretch of sequence, showing the far more numerous ORFs >100 codons.

As you can see ORFs are not the same as coding sequences!!

A rule of thumb is that the longer an ORF is, the more likely it is to encode a protein and for many genomes, simply choosing long ORFs is a good way of identifying *some* of the protein-coding genes. However, there a many problems with relying on ORF-finding alone to identify CDSs:

  1. In some cases, particularly in GC-rich genomes, long ORFs are common even though most do not encode proteins. It is not uncommon to see multiple long overlapping ORFs in such genomes and antisense ORFs, which run in the opposite direction to CDSs are often seen.
  2. Because short ORFS are so common, relying on ORF-finding alone to identify short CDSs is prone to massive over-prediction.
  3. Even when an ORF contains a CDS, it will often contain additional sequence upstream of the real start codon. Remember--an ORF is simply a stretch of sequence bounded by two stop codons. Typically, there will be dozens of codons in the ORF upstream of the CDS.

For these reasons, ORF finding alone is never used to identify bacterial protein-coding genes. Instead, two other more sophisticated approaches are used: detection of homology at the protein level and use of Markov models to identify sequences that look like CDSs (the industry standard program for this is Glimmer).

So, please can I never again be sent a paper to review that discusses "the difficulties in identifying ORFs" or talks of "predicted ORFs" or "putative ORFs" or "hypothetical ORFs", when the authors mean "the difficulties in identifying CDSs" or predicted or putative or hypothetical CDSs!!

Some of you out there may wish to protest that I am being too prescriptive and that the (mis)use of ORF to mean CDS is so common that we should treat the two terms as synonymous. In response, I would argue that maintenance of the distinction is essential to the clarity of thought and precise use of language that should be the hall marks of all scientific discourse! 

1 comment:

RPM said...

How about molecular biologists referring to different degrees (or percents) of homology? What they mean is sequence identity or similarity. Saying that two sequences have minimal homology is akin to saying that you (or your wife) are a little bit pregnant. Homology is an all or none game.

Yes, I realize that some fields of biology have a definition of homology that is different from that used by evolutionary biologists. I disagree vehemently with them.