Each gene model in the GENCODE dataset is given two biotypes, one at locus-level and one at transcript-level. These biotypes represent our functional annotation of genes and the transcripts they contain; a protein coding gene, for example, may contain a transcript annotated with a putative CDS. It is likely that GENCODE users are familiar with these biotypes and a summary of each can be found here.
However, nothing is ever that simple.
The ever-expanding wealth of data available from next generation projects can undoubtedly be of great benefit to annotation. However, these data also make life more complicated for our annotators, as they significantly increase the information content that we can attach to our models. We are working on ways to relate more information about the decisions made during the manual annotation process to our users. Currently, in addition to the information captured by biotypes and status, controlled vocabulary attributes are attached to transcripts and/or loci. These attributes are used to describe other features relevant to the structure or functional annotation of a transcript and are subdivided into three main categories: those that explain features related to splicing, those related to the translation of the transcript, and those related to the transcriptional evidence used to build the transcript model. A comprehensive list of all attributes used in the GENCODE annotation – along with their definitions – can be found here and here.
Let’s look at the application of attributes in practice:
Figure 1: Readthrough transcript on chromosome 16. The decision to annotate the transcript that spans the SRCAP and PHKG2 loci as a separate locus (A) is based on CAGE support for an independent TSS and confirmation of the termination site via Poly-A-seq. Attributes “CAGE supported TSS”, “readthrough” and “overlapping locus” are manually added in the Zmap transcript dialogue box (B). The “overlapping locus” attribute is also added to SRCAP and PHKG2 to relay this information to the user.
OK, so now we know about the annotation of attributes, but how can you benefit from this information? Well, the GENCODE GTF file contains all attributes given to loci and transcripts. Additionally, GENCODE is the default geneset displayed on the VEGA genome browser where both locus and transcript attributes are visible:
Figure 2: Screenshots of the VEGA genome browser showing the “overlapping locus” attribute in the gene view (A) and the “readthrough” attribute in the transcript view (B).
The total number of attributes is likely to increase as annotation examples become evermore complex. For example, we will shortly create new attributes to better capture the way RNAseq and polyAseq are used in the construction of models. We shall continue to define all public attributes in our README.txt files, publications and in future blog posts. For now, have a look at how our attributes help explain complex loci like this orphan protein; this selenoprotein and this readthrough (you can read more about this specific readthrough locus here). We believe the addition of these controlled vocabulary attributes makes an important contribution to the in-depth functional annotation provided by our geneset. If you would like additional information then please do contact us; alternatively, our ears are always open to suggestions for additional attributes that you think may be of help.