The classification of genes and transcripts in Gencode
(For a definition of Gene/Transcript Biotypes in GENCODE GTF files read here)
The Gencode datasets define a transcript as a single model annotated on a genome sequence; a set of genomic coordinates that correspond to an exonic structure (note that pseudogene models are also classed as transcripts).
The Gencode datasets define a gene as a set of one or more transcripts grouped together within a single locus. A Gencode gene therefore represents a higher level of classification compared with a transcript; while each transcript within a gene shares the same gene ID number, each possesses its own unique transcript ID number. However, the categorisation of a gene directly results from the classification of the transcript(s) contained within that locus.
Transcript classification is based on several factors. First of all, each transcript is given a biotype as listed below; a categorisation that reflects its known or predicted biological significance. There are three broad categories of biotype: transcripts annotated with a coding sequence (CDS), transcripts without a CDS, and transcripts that belong to a pseudogene locus. For example, a transcript annotated with a CDS is classed with the biotype Protein coding, whereas a transcript without a CDS may belong to a particular class of long non-coding RNA molecules (e.g. biotypes Antisense or LincRNA).
Note that a transcript model may not represent the true length of the underlying mRNA; this is typically the case when the model is based on a single EST. It is Gencode policy to restrict the model to the length of the evidence on which it is based, rather than to extend the model in line with other transcripts within the locus.
List of transcript biotypes:
1. CDS categories
Protein coding: a transcript annotated with a full length or partial CDS.
NMD: a transcript annotated with a CDS predicted to induce the nonsense-mediated decay pathway. Specifically, it is required that the STOP codon is found 50bp or more upstream of a splice junction of the transcript (see PMIDs: 12855447; 12502788; 18380348; 19543372).
NSD: a transcript annotated with a CDS predicted to induce the non-STOP decay pathway; i.e. polyadenylation features are found within the CDS (PMID: 21091502).
2. No CDS categories
Two categories are used specifically for non-coding transcripts that are attached to protein coding genes or pseudogene loci. Note that these classifications do not conclusively imply that the model is not protein coding nor that its known to be functional as a non-coding transcript.
Processed transcript: a transcript that does not contain a CDS and does not contain retained intronic sequence (see below).
Retained intron: a transcript containing transcribed intronic sequence with respect to the reference isoform of the locus. This classification depends on the absence of accompanying evidence for the presence of a CDS; for example, polyadenylation at the end of a 3′ transcribed intronic region, or the confirmation of a transcriptional start site (TSS) at the beginning of a 5′ transcribed intronic region.
In addition, there are five categories of long non-coding RNA transcripts (lncRNA). These biotypes are never used for transcripts that have been classified as objects within protein coding or pseudogene loci. Note that the TEC biotype, if applicable, would take precedence over each (see below).
lincRNA: Long intergenic non-coding RNA. A transcript that does not overlap within the start or end genomic coordinates of a coding gene or pseudogene on either strand.
Antisense: a non-coding transcript that overlaps a protein coding gene on the opposite strand, across either exonic or intronic sequence.
Sense intronic: a non-coding transcript found within an intron of a coding or non-coding gene, with no overlap of exonic sequence.
Sense overlapping: a non-coding transcript that contains a protein coding gene within its intronic sequence on the same strand, with no overlap of exonic sequence.
3 overlapping ncRNA: a non-coding transcript found within the 3 UTR of larger gene. This classification requires strong support for the presence of a genuine TSS.
3. Pseudogene categories
Processed pseudogene: a locus created via retrotransposition of an mRNA into the genome sequence. Processed pseudogenes lack introns, although their structures may be disrupted by transposon sequences. If present, the polyadenylation signal of the parent gene will be annotated as a pseudo polyA signal.
Transcribed processed pseudogene: a processed pseudogene overlapped by transcriptional evidence specific to the locus. This evidence may extend beyond the boundaries of the retroinsertion event.
Unprocessed pseudogene: a locus created via a duplication event, where the CDS of the parent gene has been disrupted by truncation or deleterious mutation. The exon / intron structure of the parent gene may be partially or completely preserved.
Transcribed unprocessed pseudogene: an unprocessed pseudogene overlapped by transcriptional evidence specific to the locus (i.e. not from the parent gene).
Unitary pseudogene: a pseudogene which is seen to have a functional ortholog in a reference species (e.g. based on a human / mouse comparison). The pseudogene has not formed by a duplication event, rather from the degradation of a protein coding gene.
Polymorphic pseudogene: a locus which is a pseudogene in the reference genome, though known to be intact in the genomes of other individuals of the same species. The annotation process has confirmed that the pseudogenisation event is not a genomic sequencing error.
4. Miscellaneous categories
TEC: Stands for To be Experimentally Confirmed. This category was initially created for the ENCODE project to highlight transcribed genomic regions potentially representing novel genes, with the expectation that their status will be confirmed in the laboratory. This category is now used for all genomes. TEC objects are independent single exon loci, being constructed based on either (1) an mRNA / cDNA that lacks polyadenylation features, or (2) a cluster of ESTs that contain polyA features.
IG gene: an immunoglobulin gene.
TR Gene: a T cell receptor gene.
Following biotype classification, each transcript is also given a status as Known, Novel or Putative. The precise meaning of this status differs depending on the transcript biotype. For protein coding transcripts, the status reflects the similarity between the annotated CDS and a pre-existing model in EntrezGene or Swissprot / Uniprot (assuming one can be found). Specifically, a Known CDS is 100% identical to a RefSeq NP or Swissprot / Uniprot entry along its length, a Novel CDS shares >60% length with a Known CDS in the same gene and uses the same initiation and termination codons, or has a corresponding known paralog or ortholog, and a Putative CDS shares <60% length with a Known CDS or has an alternative first or last coding exon. When dealing with transcripts predicted to represent a shorter fragment of the entire mRNA (see Transcript biotypes), note that the similarity with a Known CDS is judged after hypothetically extending the transcript in the 5′ and / or 3′ direction along the length of the Known CDS, as appropriate. This means, for example, a gene may include a Known CDS that contains the full length CDS corresponding to a RefSeq / SwissProt entry, and a second Known CDS that is partial in comparison to the first. Commonly, this will occur when the two models contain the same initiation codon though have distinct 5 UTRs.
For processed transcripts, the status is given as Novel or Putative. Novel is used for transcripts that contain four or more exons and / or are supported by at least one mRNA / cDNA or three ESTs, and putative for transcripts that contain 3 or fewer exons and are supported by 1 or 2 ESTs. Finally, for NMD, NSD, retained introns, all lncRNA and all pseudogene biotypes, the status is always set to Novel.
As with transcripts, Gencode genes are given a biotype. However, in this case only four biotypes are available: Protein coding, Processed transcript, Pseudogene and Polymorphic pseudogene. The biotype is selected based on the transcripts contained within the gene, and when a gene contains transcripts of different biotypes, it is defined according to the transcript with the highest level of classification. It is important to emphasise that the terms Protein coding, Polymorphic pseudogene and Processed transcript are used for both gene and transcript biotypes. Furthermore, while the biotype Processed transcript is used at the transcript level only for non-coding models attached to coding genes and pseudogenes, its use at the gene level is restricted to loci containing transcripts with lncRNA biotypes.
Gencode genes are also given a status as Known or Novel. Protein coding genes and processed transcripts are classed as Known when an official name and symbol for the gene is found in EntrezGene (ultimately derived from the HGNC); if this is not available the locus is classed as a Novel in both cases. In contrast, the gene biotypes Pseudogene and Polymorphic pseudogene are not given an accompanying status.
Examples of classification
1: RBM39 contains eight models with the Protein coding transcript biotype, each of which has the transcript status Known CDS, Novel CDS or Putative CDS as applicable. The gene also contains models with the NMD, Processed transcript or Retained intron transcript biotypes. The Processed transcript models have the transcript status Novel or Putative as appropriate, whereas the NMD and Retained intron transcripts have their status set to Novel by default. The gene level biotype is set as Protein coding since the presence of transcripts with CDS takes hierarchical preference over the presence of transcripts without CDS. The gene status is set to Known since RBM39 is recognised as an official gene symbol.
2: HOTAIR contains several models, each of which has the transcript biotype Antisense (the gene is found on the opposite strand to HOXC11), and the transcript status Novel. Since this locus is not a protein coding gene or a pseudogene the gene biotype is set to Processed transcript. The gene status is set to Known since HOTAIR is recognised as an official gene symbol.
3: SERPINB11 contains a model with the Polymorphic pseudogene transcript biotype. This locus is transcribed, and contains two models with the Processed transcript biotype. The transcript status of the pseudogene model is set to Novel by default, whereas the transcript status of the Processed transcripts are set to Novel since both contain more than 3 exons. The gene biotype is set to Polymorphic pseudogene, and there is no accompanying gene status