Gencode v15 – a complete first-pass manual annotation of the human genome

ScreenshotGencodeGenes recently attended the Biology of Genomes at the CSHL. Our main message to the conference was an analysis of GENCODE v15. Tim Hubbard detailed the expansion in protein-coding, long non-coding RNA and pseudogene loci across GENCODE releases 3c-15.

Of particular note is that v15 is the first GENCODE release to detail first-pass manual annotation for the complete human genome!

GENCODE 15 contains 19,700 protein-coding loci, 13,200 long non-coding RNA loci and 13,100 pseudogenes; the last two classes show the greatest increase across GENCODE releases 3c-15 and the number of lncRNA loci is likely to show substantial future growth:


Figure 1: Growth in the number of GENCODE loci (A) and transcripts (B) across versions 3c-15.


In addition, we used this opportunity to highlight to the community our on-going efforts to extend the geneset and improve functional annotation e.g. combining RNAseq and polyAseq data:


Figure 2: Ensembl mapping of Illumina Bodymap2 RNAseq data, used in conjunction with CAGE data produced by ENCODE and polyAseq data from Derti et al. Genome Res. 2012 gives us confidence to extend previously annotated transcripts where conventional transcriptional evidence (ESTs and mRNAs) is patchy (A) or build novel transcripts where we lacked confidence e.g. a single exon supported by a single EST (B).


Furthermore, we provided examples of how we are integrating Mass Spectrometry, CAGE and Ribosomal Profiling data into our geneset. Each of these technologies have featured in previous posts and, with their ever increasing use, are more than likely to be the focus of future posts too. The addition of functional annotation is the subject of much of the current focus within the group; more on that soon…


