Why do human and mouse gene counts change between GENCODE releases?

Why is GENCODE continuing to annotate the human and mouse genomes?

Ultimately, because we have not yet finalized our annotation catalogs (‘genesets’) for these species. We want our genesets to be as accurate and comprehensive as possible, and are confident that each new GENCODE release is a marked improvement on the previous version. The major changes that mark a new release are the annotation of novel genes and transcripts on the one hand, and the update or removal of genes and transcripts flagged by ongoing QC analysis on the other.

Is the situation the same for human and mouse?

Our annotation strategies are highly similar for the two species: the core process in the creation of each release is the merging of the manual HAVANA annotation with computationally-produced Ensembl annotation. However, the human geneset could be described as more ‘mature’ than that of mouse. This is because the HAVANA group has manually annotated the human genome from top to bottom, whereas progress on the mouse genome is about two thirds complete. GENCODE genesets essentially use Ensembl computational annotation to supplement the manual models, in particular to cover regions of the genome not yet annotated manually. Thus, the proportion of manual annotation in mouse GENCODE continues to increase with each release as HAVANA continue to work in a systematic chromosome-by-chromosome manner.

So human GENCODE is nearly finished, then?

No. Even though all human chromosomes have been manually annotated in great detail there is clearly still a large amount of work to be done. The HAVANA group continue to work on the human geneset, alongside Ensembl and other GENCODE partners.

Surely we should at least have identified all human protein-coding genes by now?

Our human protein-coding gene count has thus far changed from each GENCODE release to the next. These are always net changes, i.e. with each release certain protein-coding genes have been added while other have been taken away. In fact, we are currently performing a major drive to finalize – as far as possible – our protein-coding gene count.

Why would GENCODE remove an existing protein-coding gene?

Because we no longer believe it is a protein-coding gene. There are a variety of reasons why our opinion may have changed on a given locus, although all changes have ultimately been made via the manual annotation process. For example, a couple of years ago we made the decision to no longer include ‘orphan’ protein-coding genes in our dataset. These are loci featuring translations that do not exhibit strong evolutionary conservation or homology to other known proteins. In practice, this led to the removal of the protein-coding status of about 200 loci, most of which seemed to represent in silico ORF predictions that had persisted since the earliest computational analyses of the genome. To emphasize, most of these genes were not removed per se, rather converted to non-coding loci.

Secondly, a serious complication in the annotation of protein-coding genes is the existence of pseudogenes, which are frequently transcribed and can maintain sizeable ORFs. Computational pipelines thus have a tendency to call pseudogenes as protein-coding. Manual annotation is highly beneficial in the description of pseudogenes, and our ‘kill list’ contains a number of loci that are now categorized in this manner having previously existed as protein-coding genes.

… and why do GENCODE continue to add new protein-coding genes?

Firstly, although a number of these genes are indeed truly novel, many actually had a prior existence in the geneset as either pseudogenes or lncRNAs. Updates of both types were generally prompted by the existence of new experimental evidence to support coding annotation. For example, a common reason why an entire locus was previously absent from the geneset is that its expression is highly specific to a particular cell type; i.e. the existence of the gene was not obvious until such transcript evidence became available. Similarly, when novel transcript datasets are used to extend existing models, this new appreciation of the locus structure can sometimes allow for the true CDS to be identified. Proteomics data is also proving to be increasingly valuable, especially the usage of peptides obtained by mass-spectrometry to confirm and identify translated regions. Finally, we are performing an ongoing project to identify regions of the genome that are not currently annotated as protein-coding or pseudogenic but do show signatures of protein-coding evolution. We are finding that this analysis complements nicely with the appraisal of novel experimental datasets, and is particularly useful for identifying CDS that were previously missed because of their small size. Stay tuned for further details.

When will the protein-coding gene count be finalized?

It’s hard to say. We are increasingly confident that we can identify the regions of the genome that have their provenance as protein-coding sequences in an evolutionary sense, and believe that we can now accurately distinguish these from lncRNAs (although this work is ongoing). What is proving harder is to distinguish pseudogenes from protein-coding genes, which is not always possible based on annotation alone.

What about alternatively spliced transcripts within protein-coding genes?

The improved annotation of alternative splicing remains a major focus of ongoing GENCODE efforts, and the total number of alternative translations found in the human and mouse genesets also changes notably from release to release. The approaches described above are also proving highly useful for the identification and classification of alternatively spliced transcripts, i.e. the incorporation of novel transcript libraries, proteomics data and an analysis of conservation.

And what about the lncRNA catalog?

Our human and mouse lncRNA catalogs are also incomplete, and another major focus of our annotation efforts in both species. We know that whole genes, transcripts and individual exons remain to be incorporated. Their improved annotation depends largely on the availability of transcript libraries.

… and small RNAs?

Small RNAs – miRNAs, tRNAs, rRNAs, snoRNAs etc – are added to the human and mouse genesets by the Ensembl group. These models are in turn taken from public databases dedicated to the description of specific families, such as mirBase which catalogs miRNAs. Thus, changes within these catalogs will lead to simprovements in the small RNA content of GENCODE. Further information is provided by the Ensembl online documentation and most recent publication.