If you are just getting acquainted with GENCODE you may be wondering how our genebuild differs from others that are publically available.
Firstly, GENCODE annotation is based on the reference genome sequence and transcript/protein alignments, unlike the Refseq set which is cDNA driven. A major advantage of genomic annotation is that it does not depend on the availability of full length cDNAs; we capture a significant number of alternatively spliced variants from the use of partial transcripts e.g. ESTs. Ultimately, our goal is to capture the entire human transcriptome. However, the quality of the underlying genome sequence is crucial to our strategy; currently there are only three vertebrate genomes (human, mouse, and zebrafish) completed to a standard that warrants manual annotation.
For example, the ST7 “suppression of tumorigenicity 7” locus is represented by 19 different protein-coding transcripts in the GENCODE geneset (A), compared with just 2 in RefSeq (indicated by arrows in B). However, 8 of the GENCODE transcripts are partial length (indicated by *), having been generated from ESTs. It is our policy not to extend transcript models beyond the 5’ and 3’ limits of the evidence on which they are based. This is because we consider the true structure of the full length transcripts to be unpredictable, given our knowledge of alternative splicing. For this reason, GENCODE also contains partial coding sequences. In a later post, we will discuss why we feel it is important to capture transcriptional complexity in GENCODE.