Today sees the release of GENCODE v20, our first gene set built on the new assembly of the human genome, GRCh38. GENCODE v20 is immediately available via the Ensembl genome browser and gencodegenes.org, and will appear on the UCSC genome browser very soon. The gene set is a merge between a completely new automated Ensembl genebuild (that you can read about here) and updated manual annotation from the HAVANA team. The HAVANA contribution to this latest gene set is the result of a mammoth effort from the whole team, so we thought we’d take this opportunity to tell you a little bit about it.
When GRCh38 was released in December 2013 it provided the most up-to-date version of the human genome available. Most importantly, GRCh38 corrected a large number of sequencing errors identified in GRCh37, and substantially improved the representation of chromosome regions found to contain sequence gaps or mis-assemblies. Furthermore, GRCh38 contains a couple of hundred haplotypic regions in addition to the ‘main’ assembly, for example 35 distinct representations of the Leucocyte Receptor Complex/Killer Immunoglobulin-like Receptor (LRC / KIR) complex on chromosome 19. However, while you guys will undoubtedly benefit from this all new and improved genome, the transfer of the GENCODE gene set from GRCh37 has proved to be no easy task.
After lifting our v19 models over to the new assembly we didn’t simply hope for the best. Instead, we conducted extensive quality control and checking on our remapped gene models. Any reported differences – such as gene loss, changes to the CDS, introduction of non-canonical splice sites, rearranged regions, closing / introduction of gaps – were manually investigated by a member of the HAVANA team, and the annotation altered where required. This wasn’t trivial. In total thousands of QC checks were made by our annotators over a period of several weeks.. It also proved necessary to completely re-evaluate several regions of the genome that have undergone substantial changes to their sequence, in particular chromosomes 1, 9 and X. Such regions are typically repetitive, containing duplicate copies of genes and pseudogenes, and new loci have appeared on GRC38 that were not present on 37. New annotation has also been provided for alternative haplotype regions such as the LRC / KIRs as mentioned above. The manual investigation of such regions is time consuming however, and while certain haplotypes – including all KIR / LRCs – have been subjected to manual annotation, the fruits of this process will not be apparent for all regions until GENCODE v21.
Once the manual annotation checking was completed it was handed over to the Ensembl team whose genebuilders merged it with their own annotation to create the final GENCODE 20 gene set. Further analysis of GRCh38 has been provided by Ensembl’s comparative genomics, variation, and regulation teams to provide a full analysis of this new version of the genome.
So to summarise our experience: the projection of annotation from one version of a genome to another can be successful, provided (1) you have a vigorous QC pipeline in place, and (2) you have the resources to cope with these QC issues. That said, it would probably be fair to say that our annotators are not in a hurry to repeat this process any time soon.
One final point: GENCODE v20 is not different from v19 simply due to the GRCh37-38 transfer; v20 would have remained substantially different even if it had been generated on GRCh37. This is because HAVANA annotation on the human gene set is ongoing, while Ensembl continue to make improvements to their genebuilding pipeline. Most obviously, the number of protein-coding genes continues to fall (20,345 in v19 to 19,942 in v20), while the number of lncRNAs continues to rise (13,870 to 14,229). In the former case, the fall is largely explained by a reduction in the number of protein-coding genes that are present in the Ensembl genebuild but not the HAVANA annotation. Indeed, we expect the number of protein-coding genes will fall further still when GENCODE v21 is released towards the end of the year. This is because we have an ongoing drive to remove ‘orphan’ proteins from the gene set, which we define as CDS that lack appreciable conservation. Our policy now is to only keep an orphan CDS where there is specific experimental evidence for the existence of the protein product. Something else to watch out for: while the number of lncRNAs shows a net increase, we have also been removing lncRNAs that – with the benefits of RNAseq – are now seen to be part of the extended 3’ UTR sequence of protein-coding genes found a short distance upstream.
If you have any questions about Ensembl, the new GRCh38 assembly or GENCODE v20 please feel free to email firstname.lastname@example.org