When annotating single exon loci, it is hard to get it right. Correct transcriptional start site, correct transcriptional termination site, correct reading frame? It’s hard. Thankfully, we at GENCODE love nothing more than sinking our teeth into a single exon gene, ensuring that such loci are properly annotated, and thus perpetuated, into the publicly available genesets.
To illustrate, I shall highlight RBM15B and the manual annotation process employed by GENCODE for this tricky locus, ensuring that the annotation was correct:
Figure 1: RBM15B (A) and Rbm15b (B).
The first question we ask when confronted with a potential single exon gene is whether the locus is a genuine gene, and therefore suitable for annotation. This can be a difficult question to answer. In particular, one of the major findings of the ENCODE project is that the majority of the genome is transcribed, although the functional relevance of much of this transcription remains unclear. Furthermore, it is known that transcript datasets can be contaminated with genomic sequences. When looking at multiexon transcripts, the presence of canonical splice junctions provides confidence that the transcript at least is likely to be genuine. With single exon genes, of course, we do not have this luxury. We need to find other markers for functionality. One dataset currently being integrated into GENCODE is polyAseq. As Figure 1 shows, we observe well supported polyadenylation features corresponding to the 3’ end of the transcript evidence of RBM15B. The identification of polyadenylation features is important, as it indicates that these transcripts are the result of a biological process as opposed to technical artifacts.
Our second question is whether this locus represents a protein coding gene; if not, it could be a lncRNA or potentially a pseudogene. The largest ORF in RBM15B is 891aa, and we would consider an ORF of this size quite likely to represent a genuine CDS. However, ‘quite likely’ is not good enough for GENCODE. For single-exon transcripts, an early port-of-call for a GENCODE annotator is to look for locus conservation in the mouse genome (B), since conservation is an exceptionally strong proxy for functionality. In the vast majority of cases, such conservation is not to be found. For RMB15B we observe not only locus conservation between the genomes (with synteny maintained), but also the presence of equivalent CDS. These CDS show 93% sequence identity at the amino acid level, and initiate and terminate at orthologous codons. The Phastcons tracks shown in Figure 1 highlight the conservation of this locus across 44 mammalian genomes.
In the case of Rbm15b, additional functional support is provided by mass spectrometry (MS) data in the form of two short peptides (Fig. 1B). Our report on the use of MS data to aid annotation efforts can be found here. For Rbm15b, two short peptides (experimentally confirmed in several experiments, sampling a number of tissues) provide strong support for the mouse CDS. MS datasets continue to increase in size and quality, making them of great interest to our annotators. In a future post, we will also discuss the use of new ribosome profiling data to identify and validate CDS.
In conclusion, single exon genes can be difficult annotation problems. However, as RBM15B shows, integration of a number of different data sources can permit the correct annotation of these troubling loci. Our manual annotation effort resulted in GENCODE being the first geneset to annotate a CDS for this locus. Beneficially, with our data sharing policies with the other publicly available genesets, the correct representation of this locus is now perpetuated in the other publically available genesets (see here for more information on our annotation guidelines). We at GENCODE are proud of the expertise that we can offer ensuring correct annotation of difficult loci. If you feel that this level of manual investigation would be of benefit to loci of your interest, then please do contact us. We are more than happy to help.