GENCODE: Why are there partial transcripts?

If you are just getting acquainted with GENCODE you may be wondering how our genebuild differs from others that are publically available.

 

Firstly, GENCODE annotation is based on the reference genome sequence and transcript/protein alignments, unlike the Refseq set which is cDNA driven. A major advantage of genomic annotation is that it does not depend on the availability of full length cDNAs; we capture a significant number of alternatively spliced variants from the use of partial transcripts e.g. ESTs. Ultimately, our goal is to capture the entire human transcriptome. However, the quality of the underlying genome sequence is crucial to our strategy; currently there are only three vertebrate genomes (human, mouse, and zebrafish) completed to a standard that warrants manual annotation.

For example, the ST7 “suppression of tumorigenicity 7” locus is represented by 19 different protein-coding transcripts in the GENCODE geneset (A), compared with just 2 in RefSeq (indicated by arrows in B). However, 8 of the GENCODE transcripts are partial length (indicated by *), having been generated from ESTs. It is our policy not to extend transcript models beyond the 5’ and 3’ limits of the evidence on which they are based. This is because we consider the true structure of the full length transcripts to be unpredictable, given our knowledge of alternative splicing. For this reason, GENCODE also contains partial coding sequences. In a later post, we will discuss why we feel it is important to capture transcriptional complexity in GENCODE.

Advertisements

2 thoughts on “GENCODE: Why are there partial transcripts?

  1. Even for cDNA driven databases like RefSeq, isn’t the categorisation of transcripts as ‘full-length’ dubious, given that most of the transcripts were annotated before the 5′ cleave-and-recap mechanism was discovered in 2009 ?

    1. Hi Dario, You are correct and this is always an important consideration. In this instance, we refer to ‘full-length’ as with regard to the complete CDS, i.e. initiation codon to termination codon. The completeness of 3 and particularly 5′ UTRs is harder to assess, and indeed variability in their lengths is known to be a source of transcriptional complexity. We are now incorporating CAGE and polyAseq data into our methodology to assist the annotation of true transcript start and endpoints. We plan to blog about this in the near future.
      If we can be of any more help, or you have specific examples, please contact us via the tab at the top of the page to discuss.
      Thanks,
      GG.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s