Our readers may have seen the ENCODE project discussed once again in the news this past week, and not always in a favourable light. We are, of course, referring to the Genome Biology and Evolution publication “On the immortality of television sets: “function” in the human genome according to the evolution-free gospel of ENCODE”, by Dan Graur et al . Professor Graur, it seems fair to say, is no fan of ENCODE (or at least the conclusions drawn from the data), and his manuscript has certainly prompted debate in the genomics community. The crux of the argument presented is that, in using a definition of functionality unlinked to evolutionary biology, ENCODE has massively overstated the proportion of the human genome that is genuinely functional. We will leave it to others to discuss Graur’s opinions on aspects of ENCODE that do not relate directly to gene annotation, such as epigenomics and transcription factor binding site mapping. Here, we will stick to home turf, the GENCODE geneset.
If you’re reading this, you probably know that GENCODE was created to provide a reference set of gene annotation for the ENCODE consortium. Graur et al. do not in fact mention GENCODE directly in their publication. In their words “We shall only deal with a single article (The ENCODE Project Consortium 2012) out of more than 30 that have been published since the 6 September 2012 release”. The GENCODE companion paper can be found here. Since Graur et al. discuss human transcription at length, we are thus concerned that people who read their article may get the wrong impression about how the human genome is actually being annotated. So here are some of our thoughts, structured as a Birney-style Q & A.
Q1: Does GENCODE believe that 80% of the genome is functional?
As noted, we will only discuss here the portion of the genome that is transcribed. According to the main ENCODE paper, while 80% of the genome appears to have some biological activity, only “62% of genomic bases are reproducibly represented in sequenced long (>200 nucleotides) RNA molecules or GENCODE exons”. In fact, only 5.5% of this transcription overlaps with GENCODE exons. So we have two things here: existing GENCODE models largely based on mRNA / EST evidence, and novel transcripts inferred from RNAseq data. The suggestion, then, is that there is extensive transcription occurring outside of currently annotated GENCODE exons.
Q2: Does GENCODE believe that the 62% of the genome that is transcribed is functional?
We don’t know. It may be that less than 62% of this transcription is genuinely functional, although this will only be established through further investigation. Our attitude is that capturing the entire human transcriptome is a necessary first step in describing the complete functional content of the transcriptome. It seems reasonable to regard this previously unidentified transcription as putatively functional and then to test the functionality of this transcription with whatever techniques become available. We consider the inferred proposition of Graur et al – that the bulk of this transcription should simply be rejected on the grounds that it is not conserved – to be illogical. This is in part informed by our knowledge of lineage specific biology, as will be discussed.
At the present time, we can provide informed speculation regarding the nature of the RNAseq transcription that does not overlap with GENCODE exons. For more information, read the ENCODE companion paper by Djebali et al. This transcription covers approximately one third of the intergenic space with respect to GENCODE genes, and it may be that the bulk of it falls into the lncRNA category (see below) or else could represent extensions to existing genes (e.g. UTR sequence). It is also apparent that a proportion of this transcription occurs within enhancer regions, although the functional relevance of this phenomenon is not yet clear. The transcription that falls within the boundaries of GENCODE genes is likely to represent a mixture of antisense transcription, retained intron sequence and novel exons.
Q3: Is GENCODE going to simply convert this missing RNAseq coverage into transcript models?
At the present time we are actively engaged in trying to incorporate RNAseq data into novel transcript loci through manual annotation. We are doing this by combining RNAseq mapping with CAGE and polyAseq mapping. The latter two techniques give us the necessary confidence that we are identifying true transcript start and end points, respectively. We are targeting RNAseq regions that show strong, consistent expression in multiple tissues, and our initial projection is that several thousand new transcript models will be generated. In parallel, we will also seek to generate novel transcript models within existing genes. This will be done using the HAVANA manual annotation guidelines, which means that the functional potential of each model will be assessed.
Q4: What is a functional transcript?
Firstly, we note that there is a conceptual difference between functional genome sequence and a functional transcript. We thus do not define a functional transcript as one that has experienced evolutionary constraint (though the corresponding genome sequence may have). Instead, we begin by describing a transcript as functional when it makes a contribution to phenotypic complexity; most obviously a transcript that is translated to protein, or a lncRNA that functions in gene regulation (Graur et al. do not discuss lncRNA; more on that later). However, this basic definition is challenged by our developing understanding of transcriptional complexity. In particular, there is evidence that certain types of transcription may be indirectly functional. For example, there is evidence that an increasing number of genes can switch to a non-productive splicing pathway in order to dampen protein production. This can happen through the generation of transcripts that are targets for the nonsense-mediated decay (NMD) pathway. While the NMD transcript is clearly the result of a functional process, it may not itself be a functional transcript in that its ‘role’ is simply to be degraded. Similar arguments have been put forward for certain classes of lncRNA, for example antisense transcription that is commonly seen to occur at the promoters of protein coding genes. The functionality of this process may be to open out chromatin; the transcripts themselves could perhaps be considered a mere by-product. Note that if such indirect functionality (for want of a better term) is conserved between species, the fact that the transcript molecule is of secondary consequence with regard to the act of its transcription may free it from strong (i.e. detectable) purifying selection. If this is the case, the definition of functionality used by Graur et al. would not be of great use in studying such transcription. In short, it is our opinion that Graur et al‘s definition is undeniably powerful in certain contexts, although we find it too restrictive for the modern era.
Q5: Is there such a thing as a non-functional transcript?
Absolutely. Of particular relevance, gene expression is ultimately a stochastic process, and both transcription and splicing are known to be error prone (or ‘noisy’). This can lead to (for example) intron retention, exon skipping and the use of de novo splice sites.
Q6: So why does GENCODE ignore the fact that transcription may be non-functional?
We do not. It is true that the remit of GENCODE is to capture all human transcripts. However, at no point in our 10 years of annotating the human genome have we assumed that all transcription is functional. We actually go to great lengths to try and identify transcripts that are potentially non-functional (and we also characterise transcripts likely formed as technical artifacts of the experimental process). In fact, we believe that the major challenge of gene annotation projects is to identify the functional component of the transcriptome, and by doing so it follows that you are reciprocally identifying transcripts that do not appear to be functional (of course, philosophically speaking one cannot prove that a transcript is non-functional).
Most GENCODE models have been constructed manually by gene annotators in the HAVANA team; the remainder are automated models generated by Ensembl. At the locus level, we classify protein coding genes, lncRNAs and pseudogenes. However, our annotation is transcript-centric, and each transcript within a locus has its own characterisation for potential biological function. This means, for example, protein coding genes commonly contain transcripts that have not been annotated with protein CDS. In particular, we have 25,279 transcripts classed as retained introns. This set of models can be presumed to largely consist of transcripts where retention is due to the failure of the spliceosome to initiate or complete the splicing of that intron (or perhaps due to the contamination of a cytoplasmic RNA preparation with nuclear RNA). In other words, we have in effect annotated these models as putatively non-functional (though we note there may be a subset of functional retained introns). Our users are then free to remove this set of transcripts from their whole-transcriptome analyses. This is merely one example of this process.
Q7: Is functional annotation in GENCODE complete?
No. This fact is of undeniable significance, and we are continuing to work hard improve our models. We reiterate that capturing the complete human transcriptome is a necessary first step towards obtaining its full functional description, and that this process will not be completed in the immediate future.
Q8: How is the functional annotation of GENCODE proceeding?
Along several lines, each of which would require its own blog post to do it justice (and we’ll do this in due course…). However, we will summarise here. Firstly, Graur et al. are of course correct that evolutionary conservation is a powerful metric for assessing functionality. That’s why we use it. In fact, we use conservation in a wider context than that discussed by Graur et al., who would appear to consider only the measurement of evolutionary constraint on genome sequence. In fact, conservation can also be explored at the transcript level. Consider a hypothetical three exon gene. By examining the conservation of the genome sequence, we then see that there is a three exon CDS across this locus that is found in all vertebrate genomes available. So we are highly confident this CDS is genuine based on the conservation proxy. Now imagine the gene is alternatively spliced, as 95%+ of human multiexon genes are. Specifically, imagine a second transcript that contains exons 1 and 3 though skips exon 2 without disrupting the reading frame. What can an examination of conservation at the genome sequence level tell us about the functionality of this second transcript? Effectively nothing, as it can’t distinguish between the two transcripts (there may be conserved splicing motifs controlling the exon skip, although we can’t yet identify these with confidence). Instead, the functional validity of the exon skip can be inferred from the observation that the alternative splicing event itself is widely conserved. By assessing splicing conservation at the transcript level (in combination with an examination of exon conservation at the genome level), we recently estimated that around half of human protein coding genes have more than one CDS that is conserved in mouse. We would not have been able to do this solely by examining constraint on genome sequence.
Furthermore, conservation is not the only judge of functionality. Indeed, the existing trend is for high throughput experimental techniques to keep improving and provide ever more precise organism/tissue/cell-specific data to describe the existence and function of a given genome and transcriptome. We believe these data will eventually supersede conservation as a test for functionality. Projects like ENCODE are a step on this path and, like the initial human genome project, a driver of the technology. We currently use a wide variety of next generation datasets to test lineage specific transcripts and / or CDS, such as ribosome profiling data, mass spectrometry, CAGE mapping, polyAseq mapping and gene expression profiling via RNAseq.
Q9: Does GENCODE then disagree that conservation should be the ultimate judge of functionality?
As noted, conservation is undoubtedly important. However, it is limited in that it cannot judge lineage specific biology. For example, we have previously estimated that a third of our human CDS are not conserved in mouse. The position of Graur et al. would seem to be that these CDS are therefore not likely to be genuinely functional. It may indeed be that many of them are not functional. Rather than simply discard them, however, we think it makes more sense to take these putative CDS and scrutinise them with modern technologies. A human, after all, is not a mouse, and if we want to know what makes these species distinct then an obvious place to start would be to see how their genomes and transcriptomes differ.
Q10: Logical fallacy alert! You are saying that (a) a human is not a mouse; (b) we’ve found a genome difference between human and mouse; (c) the difference is thus responsible for making a human not a mouse.
No. We are simply saying try and use modern scientific techniques to characterise lineage specific biology, and let’s see what we find. These data are already proving to be highly useful. For example, ribosome profiling data can identify and validate translated regions, while RNAseq and CAGE can be combined to identify transcripts with restricted expression profiles (this can be indicative of a regulated process, which suggests a functional process). We note that many other groups are doing similar things. This is a tried and tested approach.
Q11: How can GENCODE judge the functionality of lncRNAs?
Firstly, we note that Graur et al. do not discuss lncRNA at all, so we are unaware of their thoughts on this enormous set of transcripts. However, given that the conservation levels of lncRNAs are typically low if detectable at all, we must presume they consider these transcripts as noise. In truth, at the present time, it is not straightforward to judge the functionality of lncRNAs at the whole transcriptome level for a variety of reasons. Above all, the scientific role of these transcripts is only beginning to become apparent, so in a sense annotators are waiting for our understanding of biology to catch up. However, numerous individual loci have now been studied in depth in the laboratory, while a wide range of large-scale projects have been published in the last couple of years. The emerging paradigm for lncRNA functionality based on such high quality work is that large numbers of these transcripts play a central role in development. This is in spite of the fact that lncRNA transcripts exhibit a high rate of evolutionary turnover. In fact, a link has been postulated between lncRNA evolution and the significant changes in protein-coding gene expression levels often observed between closely related species. Yes, the extent to which this is true remains to be seen. However, the fact that we don’t fully understand the nature and scope of functionality within lncRNA sets is not a valid reason to discard them. Instead, and with a risk of labouring the point, we believe that the annotation of putative lncRNA models is an important first step in subsequent attempts to elucidate their functionality.
Q12: Logical fallacy alert! You are using the fact that it is currently hard to judge lncRNA functionality to infer that they must all be functional.
No. We think that enough is known or suspected about lncRNA functionality to make the logical scientific position clear: let’s not discard them on the basis of their lack of conservation, especially as many poorly conserved lncRNAs are known to be functional. Instead, we’ll look at them in more detail using other methodologies and see what we find.
Q13: According to Graur et al., pseudogenes “… have always been looked upon with suspicion and wished away.” Why has GENCODE ignored pseudogenes?
We haven’t (although we do agree that pseudogenes remain unappreciated by many scientists). Happily, we are proud to say that GENCODE has the largest publicly available collection of manually curated human pseudogenes, with 13,447 loci and counting. Indeed, we wrote about this in a dedicated ENCODE companion paper. Unfortunately, as noted above Graur et al. chose not to discuss any of the ENCODE companion papers, and we must therefore presume they are unaware of this fact.
Incidentally, we find Graur et al‘s view that a functional pseudogene is not a pseudogene by definition to be rather unhelpful. When a locus has a clearly identified function it does seem reasonable to cease referring to it as a pseudogene and reclassify it in the light of this functional information; indeed, this has already happened with a number of retrogene loci. However, retaining information on the provenance of the locus is useful – particularly where the potential for function is suspected though not experimentally confirmed – because it may inform the investigation into its functionality. For example the locus PTENP1 is derived from a retrotransposition event and was annotated by us as a transcribed processed pseudogene of PTEN. However, the locus has subsequently been shown to regulate its parent protein-coding locus. Interestingly, PTENP1 has only been found in the human lineage and so would be considered non-functional by any measure solely relying on conservation. Secondly, Graur et al are correct that pseudogenes are frequently transcribed, although (definitely labouring the point now) why should this transcription be presumed to be non-functional? Increasing numbers of loci are being reported where a pre-existing pseudogene has gained a novel function in a process termed ‘resurrection’. These are a fascinating set of genes. Read our paper to find out more.
Q14: So in conclusion?
(a) 62% of the genome is transcribed. This fact is a starting point, not a finish line.
(b) Conservation is a highly informative metric for inferring functionality. However, evolution has not finished, and so it is limited in scope.
(c) New technologies can allow us to explore the breadth of functionality that exists in lineage specific transcription.