On the annotation of functionality in GENCODE (or: our continuing efforts to understand how a television set works).

Our readers may have seen the ENCODE project discussed once again in the news this past week, and not always in a favourable light. We are, of course, referring to the Genome Biology and Evolution publication “On the immortality of television sets: “function” in the human genome according to the evolution-free gospel of ENCODE”, by Dan Graur et al . Professor Graur, it seems fair to say, is no fan of ENCODE (or at least the conclusions drawn from the data), and his manuscript has certainly prompted debate in the genomics community. The crux of the argument presented is that, in using a definition of functionality unlinked to evolutionary biology, ENCODE has massively overstated the proportion of the human genome that is genuinely functional. We will leave it to others to discuss Graur’s opinions on aspects of ENCODE that do not relate directly to gene annotation, such as epigenomics and transcription factor binding site mapping. Here, we will stick to home turf, the GENCODE geneset.

If you’re reading this, you probably know that GENCODE was created to provide a reference set of gene annotation for the ENCODE consortium. Graur et al. do not in fact mention GENCODE directly in their publication. In their words “We shall only deal with a single article (The ENCODE Project Consortium 2012) out of more than 30 that have been published since the 6 September 2012 release”. The GENCODE companion paper can be found here. Since Graur et al. discuss human transcription at length, we are thus concerned that people who read their article may get the wrong impression about how the human genome is actually being annotated. So here are some of our thoughts, structured as a Birney-style Q & A.

Q1: Does GENCODE believe that 80% of the genome is functional?

As noted, we will only discuss here the portion of the genome that is transcribed. According to the main ENCODE paper, while 80% of the genome appears to have some biological activity, only “62% of genomic bases are reproducibly represented in sequenced long (>200 nucleotides) RNA molecules or GENCODE exons”. In fact, only 5.5% of this transcription overlaps with GENCODE exons. So we have two things here: existing GENCODE models largely based on mRNA / EST evidence, and novel transcripts inferred from RNAseq data. The suggestion, then, is that there is extensive transcription occurring outside of currently annotated GENCODE exons.

Q2: Does GENCODE believe that the 62% of the genome that is transcribed is functional?

We don’t know. It may be that less than 62% of this transcription is genuinely functional, although this will only be established through further investigation. Our attitude is that capturing the entire human transcriptome is a necessary first step in describing the complete functional content of the transcriptome. It seems reasonable to regard this previously unidentified transcription as putatively functional and then to test the functionality of this transcription with whatever techniques become available. We consider the inferred proposition of Graur et al – that the bulk of this transcription should simply be rejected on the grounds that it is not conserved – to be illogical. This is in part informed by our knowledge of lineage specific biology, as will be discussed.

At the present time, we can provide informed speculation regarding the nature of the RNAseq transcription that does not overlap with GENCODE exons. For more information, read the ENCODE companion paper by Djebali et al. This transcription covers approximately one third of the intergenic space with respect to GENCODE genes, and it may be that the bulk of it falls into the lncRNA category (see below) or else could represent extensions to existing genes (e.g. UTR sequence). It is also apparent that a proportion of this transcription occurs within enhancer regions, although the functional relevance of this phenomenon is not yet clear. The transcription that falls within the boundaries of GENCODE genes is likely to represent a mixture of antisense transcription, retained intron sequence and novel exons.

Q3: Is GENCODE going to simply convert this missing RNAseq coverage into transcript models?

At the present time we are actively engaged in trying to incorporate RNAseq data into novel transcript loci through manual annotation. We are doing this by combining RNAseq mapping with CAGE and polyAseq mapping. The latter two techniques give us the necessary confidence that we are identifying true transcript start and end points, respectively. We are targeting RNAseq regions that show strong, consistent expression in multiple tissues, and our initial projection is that several thousand new transcript models will be generated. In parallel, we will also seek to generate novel transcript models within existing genes. This will be done using the HAVANA manual annotation guidelines, which means that the functional potential of each model will be assessed.

Q4: What is a functional transcript?

Firstly, we note that there is a conceptual difference between functional genome sequence and a functional transcript. We thus do not define a functional transcript as one that has experienced evolutionary constraint (though the corresponding genome sequence may have). Instead, we begin by describing a transcript as functional when it makes a contribution to phenotypic complexity; most obviously a transcript that is translated to protein, or a lncRNA that functions in gene regulation (Graur et al. do not discuss lncRNA; more on that later). However, this basic definition is challenged by our developing understanding of transcriptional complexity. In particular, there is evidence that certain types of transcription may be indirectly functional. For example, there is evidence that an increasing number of genes can switch to a non-productive splicing pathway in order to dampen protein production. This can happen through the generation of transcripts that are targets for the nonsense-mediated decay (NMD) pathway. While the NMD transcript is clearly the result of a functional process, it may not itself be a functional transcript in that its ‘role’ is simply to be degraded. Similar arguments have been put forward for certain classes of lncRNA, for example antisense transcription that is commonly seen to occur at the promoters of protein coding genes. The functionality of this process may be to open out chromatin; the transcripts themselves could perhaps be considered a mere by-product. Note that if such indirect functionality (for want of a better term) is conserved between species, the fact that the transcript molecule is of secondary consequence with regard to the act of its transcription may free it from strong (i.e. detectable) purifying selection. If this is the case, the definition of functionality used by Graur et al. would not be of great use in studying such transcription. In short, it is our opinion that Graur et al‘s definition is undeniably powerful in certain contexts, although we find it too restrictive for the modern era.

Q5: Is there such a thing as a non-functional transcript?

Absolutely. Of particular relevance, gene expression is ultimately a stochastic process, and both transcription and splicing are known to be error prone (or ‘noisy’). This can lead to (for example) intron retention, exon skipping and the use of de novo splice sites.

Q6: So why does GENCODE ignore the fact that transcription may be non-functional?

We do not. It is true that the remit of GENCODE is to capture all human transcripts. However, at no point in our 10 years of annotating the human genome have we assumed that all transcription is functional. We actually go to great lengths to try and identify transcripts that are potentially non-functional (and we also characterise transcripts likely formed as technical artifacts of the experimental process). In fact, we believe that the major challenge of gene annotation projects is to identify the functional component of the transcriptome, and by doing so it follows that you are reciprocally identifying transcripts that do not appear to be functional (of course, philosophically speaking one cannot prove that a transcript is non-functional).

Most GENCODE models have been constructed manually by gene annotators in the HAVANA team; the remainder are automated models generated by Ensembl. At the locus level, we classify protein coding genes, lncRNAs and pseudogenes. However, our annotation is transcript-centric, and each transcript within a locus has its own characterisation for potential biological function. This means, for example, protein coding genes commonly contain transcripts that have not been annotated with protein CDS. In particular, we have 25,279 transcripts classed as retained introns. This set of models can be presumed to largely consist of transcripts where retention is due to the failure of the spliceosome to initiate or complete the splicing of that intron (or perhaps due to the contamination of a cytoplasmic RNA preparation with nuclear RNA). In other words, we have in effect annotated these models as putatively non-functional (though we note there may be a subset of functional retained introns). Our users are then free to remove this set of transcripts from their whole-transcriptome analyses. This is merely one example of this process.

Q7: Is functional annotation in GENCODE complete?

No. This fact is of undeniable significance, and we are continuing to work hard improve our models. We reiterate that capturing the complete human transcriptome is a necessary first step towards obtaining its full functional description, and that this process will not be completed in the immediate future.

Q8: How is the functional annotation of GENCODE proceeding?

Along several lines, each of which would require its own blog post to do it justice (and we’ll do this in due course…). However, we will summarise here. Firstly, Graur et al. are of course correct that evolutionary conservation is a powerful metric for assessing functionality. That’s why we use it. In fact, we use conservation in a wider context than that discussed by Graur et al., who would appear to consider only the measurement of evolutionary constraint on genome sequence. In fact, conservation can also be explored at the transcript level. Consider a hypothetical three exon gene. By examining the conservation of the genome sequence, we then see that there is a three exon CDS across this locus that is found in all vertebrate genomes available. So we are highly confident this CDS is genuine based on the conservation proxy. Now imagine the gene is alternatively spliced, as 95%+ of human multiexon genes are. Specifically, imagine a second transcript that contains exons 1 and 3 though skips exon 2 without disrupting the reading frame. What can an examination of conservation at the genome sequence level tell us about the functionality of this second transcript? Effectively nothing, as it can’t distinguish between the two transcripts (there may be conserved splicing motifs controlling the exon skip, although we can’t yet identify these with confidence). Instead, the functional validity of the exon skip can be inferred from the observation that the alternative splicing event itself is widely conserved. By assessing splicing conservation at the transcript level (in combination with an examination of exon conservation at the genome level), we recently estimated that around half of human protein coding genes have more than one CDS that is conserved in mouse. We would not have been able to do this solely by examining constraint on genome sequence.

Furthermore, conservation is not the only judge of functionality. Indeed, the existing trend is for high throughput experimental techniques to keep improving and provide ever more precise organism/tissue/cell-specific data to describe the existence and function of a given genome and transcriptome. We believe these data will eventually supersede conservation as a test for functionality. Projects like ENCODE are a step on this path and, like the initial human genome project, a driver of the technology. We currently use a wide variety of next generation datasets to test lineage specific transcripts and / or CDS, such as ribosome profiling data, mass spectrometry, CAGE mapping, polyAseq mapping and gene expression profiling via RNAseq.

Q9: Does GENCODE then disagree that conservation should be the ultimate judge of functionality?

As noted, conservation is undoubtedly important. However, it is limited in that it cannot judge lineage specific biology. For example, we have previously estimated that a third of our human CDS are not conserved in mouse. The position of Graur et al. would seem to be that these CDS are therefore not likely to be genuinely functional. It may indeed be that many of them are not functional. Rather than simply discard them, however, we think it makes more sense to take these putative CDS and scrutinise them with modern technologies. A human, after all, is not a mouse, and if we want to know what makes these species distinct then an obvious place to start would be to see how their genomes and transcriptomes differ.

Q10: Logical fallacy alert! You are saying that (a) a human is not a mouse; (b) we’ve found a genome difference between human and mouse; (c) the difference is thus responsible for making a human not a mouse.

No. We are simply saying try and use modern scientific techniques to characterise lineage specific biology, and let’s see what we find. These data are already proving to be highly useful. For example, ribosome profiling data can identify and validate translated regions, while RNAseq and CAGE can be combined to identify transcripts with restricted expression profiles (this can be indicative of a regulated process, which suggests a functional process). We note that many other groups are doing similar things. This is a tried and tested approach.

Q11: How can GENCODE judge the functionality of lncRNAs?

Firstly, we note that Graur et al. do not discuss lncRNA at all, so we are unaware of their thoughts on this enormous set of transcripts. However, given that the conservation levels of lncRNAs are typically low if detectable at all, we must presume they consider these transcripts as noise. In truth, at the present time, it is not straightforward to judge the functionality of lncRNAs at the whole transcriptome level for a variety of reasons. Above all, the scientific role of these transcripts is only beginning to become apparent, so in a sense annotators are waiting for our understanding of biology to catch up. However, numerous individual loci have now been studied in depth in the laboratory, while a wide range of large-scale projects have been published in the last couple of years. The emerging paradigm for lncRNA functionality based on such high quality work is that large numbers of these transcripts play a central role in development. This is in spite of the fact that lncRNA transcripts exhibit a high rate of evolutionary turnover. In fact, a link has been postulated between lncRNA evolution and the significant changes in protein-coding gene expression levels often observed between closely related species. Yes, the extent to which this is true remains to be seen. However, the fact that we don’t fully understand the nature and scope of functionality within lncRNA sets is not a valid reason to discard them. Instead, and with a risk of labouring the point, we believe that the annotation of putative lncRNA models is an important first step in subsequent attempts to elucidate their functionality.

Q12: Logical fallacy alert! You are using the fact that it is currently hard to judge lncRNA functionality to infer that they must all be functional.

No. We think that enough is known or suspected about lncRNA functionality to make the logical scientific position clear: let’s not discard them on the basis of their lack of conservation, especially as many poorly conserved lncRNAs are known to be functional. Instead, we’ll look at them in more detail using other methodologies and see what we find.

Q13: According to Graur et al., pseudogenes “… have always been looked upon with suspicion and wished away.” Why has GENCODE ignored pseudogenes?

We haven’t (although we do agree that pseudogenes remain unappreciated by many scientists). Happily, we are proud to say that GENCODE has the largest publicly available collection of manually curated human pseudogenes, with 13,447 loci and counting. Indeed, we wrote about this in a dedicated ENCODE companion paper. Unfortunately, as noted above Graur et al. chose not to discuss any of the ENCODE companion papers, and we must therefore presume they are unaware of this fact.

Incidentally, we find Graur et al‘s view that a functional pseudogene is not a pseudogene by definition to be rather unhelpful. When a locus has a clearly identified function it does seem reasonable to cease referring to it as a pseudogene and reclassify it in the light of this functional information; indeed, this has already happened with a number of retrogene loci. However, retaining information on the provenance of the locus is useful – particularly where the potential for function is suspected though not experimentally confirmed – because it may inform the investigation into its functionality. For example the locus PTENP1 is derived from a retrotransposition event and was annotated by us as a transcribed processed pseudogene of PTEN. However, the locus has subsequently been shown to regulate its parent protein-coding locus. Interestingly, PTENP1 has only been found in the human lineage and so would be considered non-functional by any measure solely relying on conservation. Secondly, Graur et al are correct that pseudogenes are frequently transcribed, although (definitely labouring the point now) why should this transcription be presumed to be non-functional? Increasing numbers of loci are being reported where a pre-existing pseudogene has gained a novel function in a process termed ‘resurrection’. These are a fascinating set of genes. Read our paper to find out more.

Q14: So in conclusion?

(a) 62% of the genome is transcribed. This fact is a starting point, not a finish line.

(b) Conservation is a highly informative metric for inferring functionality. However, evolution has not finished, and so it is limited in scope.

(c) New technologies can allow us to explore the breadth of functionality that exists in lineage specific transcription.


10 thoughts on “On the annotation of functionality in GENCODE (or: our continuing efforts to understand how a television set works).

  1. Let us see if I understand it right. You had a press release saying ‘junk DNA is dead and 80% of the genome is functional’. Those are very strong statements.

    Now you are backtracking and claiming that 62% of genome show signs of transcription and are potentially functional, should be studied more, etc. Nobody disagrees about that, but your new statement is much weaker than the previous bold claim. If you were not sure, why did you use strong words to whore up your project to media? It was like someone claiming that Einstein’s theory of relativity is dead, and when challenged, he makes a much weaker claim that he was doing an experiment that could potentially disprove theory of relativity.

    Don’t give me that garbage about lncRNA. LncRNAs cover at most the same genomic space as protein-coding genes, if not less. So, lncRNA+protein coding gene will take 1.5x-2x the genomic space. We were the first to discover lncRNAs in human genome 10 years back along with one of your members, and we never saw close to 62% expression of lncRNAs.

    Once again, I am not saying that much more of human genome may eventually turn out to be functional, and I would not mind if you eventually probe that 120% of human genome is functional (by some twisted math). However, I would feel more comfortable about your science, if you make the press release AFTER having definitive proof of your big claims.

    P. S. Could you please disclose, how much your project gets from ENCODE?

    1. Thank you for your comment.

      The difference between 80% and 62% is due to the fact that we discuss here only genome seqence for which there is evidence of transcription. Both of these values are found in the ENCODE main paper (and companion papers), so this is not backtracking. We don’t discuss e.g. transcription factor binding sites as our group did not do this work. So that isn’t to say we disagree with the functionality of this 18%, rather that we are leaving it to others to discuss.

      Clearly, you were not happy with the press release. I’m not going to try and change your mind on that, though I would suggest you focus on the published science itself (ideally, all of the companion papers). In this way, you’ll see the message that this 62% needs to be studied in more detail before we can be sure whats going on is very much present and correct. If you wish to disagree with the published data, then please highlight specific aspects of the methodology you believe to be flawed. We’d be happy to debate this further.

      Regarding lncRNAs, our techniques to study these transcripts are much more sophisticated than they were 10 years ago. Also, lncRNA expression levels are on average lower than for protein coding transcripts, and appear generally more likely to show restricted expression. People are now focusing their efforts on specific cell types, particularly those involved in development. The book on lncRNA has yet to be written. We’re doing what we can to help write it.

      On your final point, GENCODE is an ENCODE project. We are authors on the main ENCODE paper. I don’t see what discussing our precise levels of funding would add to the debate.



      1. Yes, the “press release” again. The press release misleading the press was issued by ENCODE. The press just improved the ENGLISH.

        Why hasn’t even one of the cowards/minons of ENCODE protest the press release by the ENCODE central committee before the critics raised their voices and pointed to the folly of the entire interpretative exercise?

        Also, isn’t it time to distinguish between “activity” and “function”?

        Dan Graur

  2. Christian Dina says:

    I was wondering how it comes that genetic association was ot mentioned – in this controversy- as a possible judge for functionality. In my mind, it is not so different from conservation.
    Obviously, there is a long way before finding the causal variants (and ig we get rid of the rare alleles synthetic association theory) in association spots. However, it seems rather obvious that a lot of association occur in non-genic regions (to make it short).

  3. […] Manequins Shed Light On Flu Transmission Life Stands on the shoulders of Giants (Viruses) On the annotation of functionality in GENCODE (or: our continuing efforts to understand how a televi… Ridiculous statements by mental health experts. (anyone have this pdf?) Opinion: We Didn’t […]

  4. leew says:

    Transcription vs. function.

    According to RefSeq data, about 42% of the human genome is defined as genic region (exons + introns). Majority (95%) of these transcribed regions are introns. If we follow your logic that transcription indicates function, we would have to conclude almost every single base in introns are functional???

    1. Thank you for your comment. In fact, we do not believe that transcription necessarily indicates function, since non-functional transcription is clearly a real phenomenon. For this reason, we do not claim to know what proportion of the transcriptome is functional.We are currently trying to find the best methods to answer this question.

      Are introns functional? This is an interesting question, although one which is undermined by the fact that ‘functional’ is a rather tricky concept to pin down. Certainly, introns are not functional in the same sense that a coding sequence is functional, but then introns are clearly not entirely devoid of significance either. Firstly, they contain regulatory sequences such as splicing signals. Secondly, the size of the intron is known to have an effect on the efficiency of the splicing reaction. Thirdly, the presence of introns facilitates the occurrence of alternative splicing. On the other hand, it is clear that the bulk of the typical intron does not experience detectable sequence constraint, so from an evolutionary biology perspective these sequences can be readily classed as non-functional. So one could say that, while introns do have a role in generating transcriptional complexity, most of the sequence they contain is not of primary importance in allowing them to do this.

      However, our primary role in GENCODE is to judge the functionality of transcripts, which is not precisely the same thing as judging the functionality of the genome sequence. What is interesting to us is that intronic regions are commonly transcribed. What is the functional relevance of this? We know that the spliceosome does not act with complete accuracy, and it is likely that introns can be retained in transcripts due to what is essentially stochastic error, i.e. the failure of the spliceosome to complete the correct splicing reaction. Retained intron sequence will undoubtedly inflate the proportion of the genome sequence that is observed to be transcribed, making intron retention a significant source of non-functional transcription. On the other hand, an increasing number of retained intron transcripts are being shown to function as host transcripts for small RNAs. Nonetheless, our suspicion is that most retained intron transcripts likely represent ‘noise’.

      We have tried to capture such information in our geneset. GENCODE has a specific transcript category called ‘retained intron’. This means you can filter such transcripts out of our geneset if you so wish. If certain examples of these retained intron transcripts gain evidence-based support for functionality in the future, we will recharacterise them as appropriate. Secondly, we also have a specific attribute that highlights transcripts known to host small RNAs.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s