Comparing different publicly available genesets against GENCODE 7

Colourful numbersWe consider GENCODE to be the ideal geneset for analysing transcriptional diversity. In this post, I aim to justify this statement, and compare and contrast both the protein coding and non-coding transcript content of GENCODE with the four other major human genesets that are publicly available: UCSC, CCDS, RefSeq and AceView…

As you can see in graph 1, each of the five genesets agrees that the human genome contains ~20,000 protein coding genes. However, graph 2 shows that there is huge disparity in the number of protein coding transcripts that each geneset contains:

Blog_Numbers_Fig1Graphs 1 & 2: Comparison of protein coding genes (1) and transcripts (2) in publicly available genesets.

 

GENCODE contains 140,066 transcripts at coding loci. This figure is eclipsed only by AceView. However, unlike GENCODE, AceView is not manually curated and has a propensity to attach a CDS to transcripts. As a result, the AceView figure is likely to be incorrect due to ORFs spuriously called at loci GENCODE properly annotates as lncRNAs and pseudogenes. Addtionally, GENCODE contains the most protein coding transcripts per locus, at approximately 6. However it should be recalled that not all GENCODE transcripts are full length, and if an annotated transcript is partial, it is tagged with ‘start_not_found’ or ‘end_not_found’ to highlight this to the user:

Blog_Numbers_Fig_2Graph 3: Mean number of protein coding transcripts at coding loci across publicly available datasets.

 

When considering non-coding loci, there are three major human genesets that are publicly available: GENCODE, UCSC and RefSeq. Within these, the manually annotated GENCODE geneset is the largest with 9640 loci, compared with 6056 in UCSC genes and 4888 in RefSeq. GENCODE also contains the most alternative splicing within lncRNA loci.

Blog_Numbers_Fig_3Graphs 4, 5 & 6: Comparison of lncRNA genes (4), transcripts (5) and mean number of transcripts per lncRNA locus (6) across publicly available datasets.

 

Users are likely wondering how many of these transcripts are contained in all datasets and how many are unique to specific builds. We combined both the protein coding and non-coding transcript datasets of GENCODE, UCSC and RefSeq and analysed for matching transcripts – two transcripts were considered to match if all their exon junction coordinates were identical in the case of multi-exonic transcripts, or if their transcript coordinates were the same for single exon transcripts – and found that, by far and away, the geneset with most unique transcripts is GENCODE, with 91,043:

Blog_Numbers_Fig_4Figure 1: Venn diagram to compare all transcripts within the GENCODE, RefSeq and UCSC genesets.

 

Transcriptional complexity is the biological reality. And GENCODE is the best geneset in which to view and analyse this complexity.

Advertisements

3 thoughts on “Comparing different publicly available genesets against GENCODE 7

  1. Jessica says:

    Thanks so much for making this very timely post and saving me a ton of effort (otherwise I would have had to figure out how to generate the same data/figures myself).

  2. David Managadze says:

    GENCODE is a merge of automatic (Ensembl) and manual (HAVANA) annotations.
    RefSeq is only a manual annotation from NCBI. If you want to do correct comparison, you need to add Gnomon models to it. Gnomon models come from NCBI’s automatic Genome Annotation Pipeline (aka Gpipe) and can be downloaded from their FTP (along with RefSeq).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s