Shortly after the release of GENCODE v18 last month, our eagle-eyed users noticed that the number of protein-coding genes had fallen by 12 since release v17 (conversation at https://twitter.com/GencodeGenes).
This observation is perhaps counter-intuitive; GENCODE is an expanding geneset, and between releases v17 and v18 the total number of genes swelled by 164 whilst the overall transcript number increased by 713. However, these increases are largely explained by the creation of novel long non-coding RNA and pseudogene models. Why, then, did the number of protein-coding genes fall from release v17 to v18?
The answer lies in the manual annotation processes of the HAVANA group, based at the Wellcome Trust Sanger Institute, the main contributors to the GENCODE geneset. GENCODE annotation is based on an ever-increasing wealth of diverse experimental evidence. Lately, for example, we have started to introduce ribosomal profiling data to identify novel translation initiation sites (expect a blog post in the near future), and CAGE and poly-A-seq data to identify transcription initiation and termination sites. Meanwhile, the huge amount of RNAseq data flooding in from numerous projects can flag up transcripts that we have missed. However, our role is not only to identify more human transcripts; we have an additional responsibility to update existing annotation when models can be corrected or improved. These changes happen along two lines: (1) adjustments made to the structure of models, e.g. the addition of novel exons, and (2) a reappraisal of the functionality of models, e.g. whether or not they contain valid CDS. Indeed, consider that our manual annotation of the human genome was a process that began more than a decade ago. Large numbers of GENCODE models were thus created prior to the advent of next-generation transcriptomics.
So what happened to those 12 protein-coding genes? In fact, 41 protein-coding loci have been deleted from v18, while 47 new models were added. Additionally, as discussed in our previous blog post, all GENCODE gene models are ascribed biotypes that categorize their inferred functionality. A total of 18 protein-coding loci have switched from a protein coding to a non-coding biotype in v18, leaving us with a net reduction of 12 protein-coding loci.
To further explain, perhaps the best thing to do is to look at some examples of protein-coding loci that were curated, deleted and updated between GENCODE releases v17 and v18:
CDS curated: Transcript RP4-694A7.4-001 is based on RNAseq evidence (not shown) and is a new protein-coding gene in GENCODE v18. Both CAGE and CpG data indicate that the 5’ end of the transcript represents a true transcription initiation site, while poly-A-seq data confirms the 3’ end. These observations indicate that we have defined the true extent of the locus, allowing functional annotation to proceed with confidence. Proteogenomic evidence in the form of a mass-spec tag is able to confirm the protein-coding potential of this transcript, enabling the annotation of a CDS. Without this experimental support for translation, this model would have been annotated as a non-coding RNA.
CDS deleted: While protein coding in GENCODE v17, the CDS from RP4-22D3.1-001 has been deleted from v18. Firstly, we have no data to confirm that this model represents a full-length RNA, i.e. that the true 5’ and 3’ ends have been captured. In fact, since the model is based on a single EST we would assume that this is not the case. This ambiguity confuses the functional annotation. For example, additional exons at the 3’ end of the model this would suggest that, if the predicted CDS does engage with the ribosome, it is likely to induce nonsense-mediated decay. Secondly, the original CDS does not show any level of sequence conservation. The existence of so-called ‘orphan CDS’ in mammalian genomes remains a source of debate. One thing that is certainly true is that we know much more about the extent and functionality of lncRNAs than we did 10 years ago. When annotating such a transcript, it is therefore not simply a case of looking for the longest ORF and converting it into a CDS. Instead, now that ribosome profiling and mass spectrometry data are available, we can look for actual evidence for translation. In this example, the absence of such evidence in tips the balance in favor of a non-coding model.
CDS updated: New poly-A-seq data supports the coding potential of a number of transcripts at KIAA1456. We can see that the 3’ end of three separate transcripts allows the annotation of alternatively spliced CDSs. However, the lack of poly-A-seq data to confirm the 3’ end of the shorter transcript (marked with *) means that a CDS has been deleted from this locus (shown in the ‘Deleted CDS’ row) between GENCODE v17 and v18. Instead, upORF attributes are associated with all relevant transcripts to acknowledge the potential of an additional CDS variant at the locus.
GENCODE is the gold-standard geneset of the ENCODE project and is displayed on the ENSEMBL and VEGA genome browsers. Our desire to continually expand and update our annotation provides an unrivaled benefit to our users who can be sure that our genes and transcripts are increasingly based on up-to-date experimental evidence. This helps us capture a true reflection of the transcriptional complexity that exists within the human genome.