We consider GENCODE to be the ideal geneset for analysing transcriptional diversity. In this post, I aim to justify this statement, and compare and contrast both the protein coding and non-coding transcript content of GENCODE with the four other major human genesets that are publicly available: UCSC, CCDS, RefSeq and AceView…
As you can see in graph 1, each of the five genesets agrees that the human genome contains ~20,000 protein coding genes. However, graph 2 shows that there is huge disparity in the number of protein coding transcripts that each geneset contains:
GENCODE contains 140,066 transcripts at coding loci. This figure is eclipsed only by AceView. However, unlike GENCODE, AceView is not manually curated and has a propensity to attach a CDS to transcripts. As a result, the AceView figure is likely to be incorrect due to ORFs spuriously called at loci GENCODE properly annotates as lncRNAs and pseudogenes. Addtionally, GENCODE contains the most protein coding transcripts per locus, at approximately 6. However it should be recalled that not all GENCODE transcripts are full length, and if an annotated transcript is partial, it is tagged with ‘start_not_found’ or ‘end_not_found’ to highlight this to the user:
When considering non-coding loci, there are three major human genesets that are publicly available: GENCODE, UCSC and RefSeq. Within these, the manually annotated GENCODE geneset is the largest with 9640 loci, compared with 6056 in UCSC genes and 4888 in RefSeq. GENCODE also contains the most alternative splicing within lncRNA loci.
Users are likely wondering how many of these transcripts are contained in all datasets and how many are unique to specific builds. We combined both the protein coding and non-coding transcript datasets of GENCODE, UCSC and RefSeq and analysed for matching transcripts – two transcripts were considered to match if all their exon junction coordinates were identical in the case of multi-exonic transcripts, or if their transcript coordinates were the same for single exon transcripts – and found that, by far and away, the geneset with most unique transcripts is GENCODE, with 91,043:
Transcriptional complexity is the biological reality. And GENCODE is the best geneset in which to view and analyse this complexity.