As you probably know by now, our remit is not only to capture all human transcripts as models, but also to provide an informed judgment on the functional potential of each model. Often we can do this with a high degree of certainty, in particular when a transcript is translated into a protein for which there is vigorous experimental support. However, the number of human transcripts that have been directly studied in the laboratory actually represents a small fraction of the transcriptome. As such, we believe the true extent of functionality across the transcriptome remains to be ascertained (as discussed at length in our previous post). Non-coding transcription is perhaps the most significant source of ‘dark matter’ in the transcriptome. Functionality clearly exists within this enormous set of transcripts, as demonstrated both by single-locus studies and genome / transcriptome-wide analyses. Nonetheless, the pertinent questions are clear: how many non-coding RNAs are truly functional, and what are the various modes by which they function? These questions will no doubt inspire a multitude of blog posts. Here we’ll kick things off by highlighting a particular class of non-coding transcript: those formed by dual promoters.
Most protein-coding genes are controlled by CpG island-based promoters. Such promoters do not appear to set an origin of transcription at a specific and consistent base on the genome. Instead, transcription appears to initiate over a range of bases, such that the transcription start site (TSS) of a protein-coding gene is typically more of a smudge. This fact becomes particularly obvious upon working with cap analysis of gene expression (CAGE) clusters. However, it has also been known for some time that CpG promoters can also induce transcription on the antisense strand, which thus proceeds in the opposite direction to transcription from the protein-coding locus. Again, CAGE is very useful in highlighting such transcription, as are polyAseq and RNAseq. The HAVANA group is in fact currently combining all three technologies, together with existing ‘old-school’ transcript libraries, as part of a drive to capture non-coding transcripts of all categories currently missing in our genebuild. And what do we find? About half of the transcripts we are constructing originate from dual promoters. Let’s look at an example:
Our initial projection is that the number of protein coding genes undergoing two-way transcription may run into the thousands. So, the big question: is this transcription doing anything that’s actually useful? As far as HMGN3 goes, we don’t know. The CAGE score and polyAseq mapping indicate that the level of antisense transcription at this locus is significant. Does significant transcription indicate functionality? Perhaps. An alternative possibility is that such transcription is simply opportunistic: the chromatin opened up to allow the transcription of gene HMGN3, and transcription also proceeded in the antisense direction simply because it could. Interestingly, work with the GRO-seq methodology suggests that such transcription may be generally common to CpG protein-coding genes. However, in contrast to the other methods discussed, GRO-seq captures the act of transcription rather than mature transcripts that actually exist in the cell. In other words, it may be that, while all CpG protein-coding genes undergo two-way transcription, only a subset of loci generate stable antisense transcripts at an appreciable level. We speculate that transcript stability may be linked to the evolution of polyadenylation features.
Do we know anything about the functionality of antisense transcription in general? It’s certainly a topic that has interested several groups in recent months. One possibility is that this transcription may be involved in gene regulation. Firstly, it could be that the transcript itself plays a direct role in a regulatory process. A second possibility is that it is not the transcript itself that is functional, rather the actual process of its transcription. In this scenario, the generation of the antisense transcript imparts a level of control over the output of transcription from the sense protein-coding locus. How exactly this may occur is somewhat speculative; it may be that the antisense transcript competes with the sense locus for RNA polymerase complexes.
In summary, dual promoter transcription is a widespread phenomenon that remains largely mysterious. Our view is that the annotation of transcripts is a necessary first step in efforts to understand their nature. We don’t know the extent of functionality in dual promoter transcript sets. However, we’ll do whatever we can to help find out.