Recently, we read with interest the Nature publication describing the role of the human basigin protein in facilitating erythrocyte invasion by the malaria parasite. This work, spearheaded at the Wellcome Trust Sanger Institute, indicates that basigin should be considered a target for the development of new anti-malarial therapies. Drug design is beyond the remit of the GencodeGenes team, however, the recurring theme of this blog will be how a detailed appreciation of a genes transcriptional output can inform our understanding of its functionality. In this brief post we will discuss the BSG gene specifically within this context.
It turns out BSG is rather interesting.
Our GENOCDE dataset describes 12 transcripts at the human BSG locus, including 7 distinct protein-coding transcripts that use one of 4 initiation codons (Figure 1). In other words, GENCODE indicates that there are potentially 7 isoforms of the basigin protein for researchers to consider. However, that word ‘potentially’ is important. Most human genes contain alternatively spliced transcripts, although experimental support for alternative functionality is rarely available. Critically, we know that transcription and splicing are error-prone processes, while de novo splice sites arising from mutation can be incorporated into new transcripts. As such, we anticipate the existence of ‘noise’ in the transcriptome; transcripts considered ‘junk’ by the pessimist, or the raw materials for evolution by the optimist. So how can we tell if the 7 annotated BGS protein coding transcripts represent 7 functional protein isoforms? In the absence of experimental support for translation, we can turn to proxies.
Figure 1: The human BSG locus. Highlighting the very well conserved 348bp exon that is unique to only one protein coding transcript.
One of the most powerful proxies for indicating functionality is conservation, so let’s compare our annotation of human BSG with HAVANA annotation of the Bsg gene in mouse (Figure 2). In both genes, the majority of the cDNA and EST evidence available supports one of two equivalent transcripts (i.e. each exon can be cross-mapped between the species), labelled 1 and 2 in each diagram. The 348bp exon highlighted is therefore subjected to alternative splicing in both species, thus the conservation argument suggests that both isoforms are functional. This information may be of interest to research scientists, since a therapy directly targeting the 348bp exon (or the equivalent protein region) would be potentially ineffective against the form that splices out this exon.
Figure 2: The Mouse Bsg locus. Highlighted is the 348bp exon that is, like at the orthologous human locus, unique to only one protein coding isoform.
In contrast, the human transcripts labelled 3 and 4 are not supported in mouse, and their initiation codons do not display conservation beyond higher primates. This isn’t surprising, since both codons reside within primate-specific transposable elements identified by Repeatmasker (Figure 1). The poor conservation of transcripts 3 and 4 does not confirm that they are non-functional (a human is not a mouse, after all, and transposable elements are often linked to exon creation); this proxy is simply non-informative in these cases. The functionality of such transcripts must be judged by other methods (and in practice our annotators would search the literature to find what has already been reported; PubMed has over 200 papers on BSG…). In future posts we’ll discuss how the functionality of transcripts can also be assessed using RNAseq libraries, promoter signatures and high-throughput proteomics. Meanwhile, if you wish to read more about the comparative annotation of the GENCODE gene-build take a look at our recent MBE article.