Skip Navigation


Briefings in Functional Genomics and Proteomics Advance Access originally published online on February 20, 2006
Briefings in Functional Genomics and Proteomics 2006 5(1):46-51; doi:10.1093/bfgp/ell011
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
5/1/46    most recent
ell011v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Bonizzoni, P.
Right arrow Articles by Pesole, G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bonizzoni, P.
Right arrow Articles by Pesole, G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© Oxford University Press, 2006, All rights reserved. For permissions, please email: journals.permissions@oxfordjournals.org

Special Issues Papers

Computational methods for alternative splicing prediction

Paola Bonizzoni, Raffaella Rizzi and Graziano Pesole

Corresponding author. Graziano Pesole, Dipartimento di Biochimica e Biologia Molecolare, Universita di Bari, Via Orabona 4, 70126 Bari, Italy. Tel: +39 080 5929663; Fax: +39 080 5929690. E-mail: graziano.pesole{at}biologia.uniba.it


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 CONCLUSIONS
 Acknowledgements
 References
 
The fact that a large majority of mammalian genes are subject to alternative splicing indicates that this phenomenon represents a major mechanism for increasing proteome complexity.

Here, we provide an overview of current methods for the computational prediction of alternative splicing based on the alignment of genome and transcript sequences. Specific features and limitations of different approaches and software are discussed, particularly those affecting prediction accuracy and assembly of alternative transcripts.

Keywords: alternative splicing, gene regulation, algorithm, software, comparative genomics


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 CONCLUSIONS
 Acknowledgements
 References
 
Alternative splicing (AS) is emerging as a major mechanism for increasing proteome complexity in vertebrates and other eukaryotes [1]. This finding has promoted an increasing number of both computational studies—generally based on the alignment of cDNAs and genome sequences—and corresponding experimental investigations. In particular, the analysis of gene structure and the characterization of splicing events have been accomplished by the design of tools to perform pairwise alignments of genomic and transcript sequences with the insertion of large gap regions (introns). Such tools include BLAT [2], Spidey [3] and SIM4 [4].

The large and continuously increasing amount of EST data, structured in gene-oriented clusters in the UNIGENE database, is the major information source for the computational detection of AS patterns of genes. However, the aforementioned tools suffer from intrinsic shortcomings that need to be addressed by suitable algorithms specifically devised for AS detection.

Methods based on EST and mRNA comparison
Clustering of EST data was the first computational approach used to detect alternative splice events. Indeed, thanks to extensive analysis of EST alignments a high percentage of human genes has been shown to undergo AS [5, 6]. However, pairwise comparison of EST data may not be the most suitable computational approach to detect splice sites and AS events. Indeed, EST surveys are subject to genomic contamination (‘EST’ clones derived from genomic DNA rather than mRNA) that may result in large regions of insertions due to the presence of intron sequences. Similarly, paralogue contamination in ESTs (mixing of ESTs derived from different paralogous genes) and fragmented ESTs (ESTs with sequencing errors or from incompletely spliced mRNA) may produce pairwise alignments suggesting erroneous AS patterns. Moreover, predictive methods based on EST and mRNA comparison have limited power, since they do not use information in the intronic part of the genome. Indeed, the splicing process is controlled by specific sequence motifs in the genomic sequences such as the well-known GT–AG dinucleotides flanking more than 98% of the known intron sequences [7]. These dinucleotide motifs, surrounded by a longer conserved consensus, provide valuable information for the location and scoring of splice sites through the alignment of ESTs and genomic sequences. Due to the above reasons, algorithms based on EST–genome pairwise comparison have provided more reliable tools for the detection of splice sites.

Methods based on genome–EST pairwise alignment comparison
Popular programmes to align cDNA and genomic segments include EST_GENOME [8], SIM4 [4], Spidey [3], GeneSeqer [9] and MGAlign [10]. Specific integrated programs such as BLAT [2] and SQUALL [11] perform both genomic mapping and alignment. BLAT and SIM4 are basic subroutines utilized in several methods to identify splice sites and assembly transcript variants. On the other hand, specialized cDNA–genome alignment algorithms have been proposed and claimed to outperform these programmes in terms of accuracy and computational efficiency. Among these, GeneSeqer [9] and GMAP [12] are the most recent. In particular, GMAP exhibits a high accuracy in single EST alignment as it is able to detect statistically significant microexons. Moreover, it is more computationally efficient than programmes such as BLAT or SQUALL (e.g. few seconds for the alignment of a single cDNA against a large genome sequence and a very limited RAM requirement, i.e. about 128 MB against 8 GB required by BLAT and 12 GB required by SQUALL). Another programme specialized for the task of aligning cDNAs to the genomic regions, ASPIC-aligner, is incorporated in the ASPIC package [13].

EST–genome alignment algorithms can be compared on the basis of some common computational steps they share: (i) the use of short oligomers collected in an indexing table to accomplish genomic mapping of a transcript sequence; (ii) the alignment of entire exons through the extension of oligomer matching by different strategies such as approximate algorithms or exact dynamic programming (DP); (iii) the identification and/or refinement of detected splice sites.

Concerning step (i), for example, SIM4 uses oligomers of fixed length 12, while 11mers and 14mers are used in BLAT and SQUALL, respectively. The use of tables for oligomers on a genomic-scale requires a pre-computation step, as well as a large amount of dedicated RAM. GMAP confronts this issue by using files instead of RAM memory. ASPIC-aligner also has a pre-computation step that builds a hash table of short oligomers to be used as matching words in the alignment process. However, distinct from other programmes, ASPIC-aligner computes the length of a component (oligomer matching to the genome) as a function of the length of the input genomic sequence, thus guaranteeing that the alignment process is faster for longer genes. Step (ii) is performed by DP algorithms in EST_GENOME and GeneSeqer, while in SQUALL and SIM4 the predominant strategy for computing the alignment of exonic region is the assembly of consecutive oligomers (called seeds) that match to the genome. The ASPIC-aligner uses quite a different strategy, as seeds are only used to locate an initial region for the alignment of the exon, while the K-band linear time and space DP algorithm [14] is then applied to find the optimal alignment of the entire exon. Due to possible alternative matching positions of the same seed, ASPIC-aligner tries alternative positions to locate the best exon alignment. Finally, step (iii) (identification and refinement of splice sites) is a highly relevant ingredient in splicing detection methods; The section, ‘Criteria’ to improve splice sites detection is devoted to this issue.

Among the programmes discussed above, BLAT and SIM4 provide built-in modules—used by other methods for AS and gene-structure detection—that use both transcript (cDNAs, ESTs) and genomic data. Indeed, transcript splice variants detected by such methods start from gene-structure information provided by the aforementioned alignment modules.

ECgene [15] is a recently designed programme using a BLAT-like alignment module. ECgene (gene modelling by EST clustering) algorithm combines an EST-clustering procedure with a transcript assembly procedure. In ECgene, the alignment of mRNAs/ESTs to the genomic sequence is performed by the combined use of BLAT and SIM4. Then, an algorithm to cluster EST alignments that share at least one splice site is used to generate primary clusters. In a successive step, the connectivity of exons in each primary cluster is represented by a directed acyclic graph (DAG), where paths of the DAG represent putative transcript models. Prediction of splice sites and alternative events in ECgene is performed on the basis of the computed transcript models.

Clearly, a shortcoming of this approach is the fact that it does not provide directly an exon–intron gene structure, but multiple transcript models, several of which may be not reliable.

Methods based on EST–genome multiple alignment comparison
Despite the availability of several reasonably fast and accurate cDNA–genome alignment programmes, the problem of an accurate prediction of splice sites and exon–intron structure is still far from a fully satisfactory solution. The main reason for this is that these programmes perform independent pairwise alignment of ESTs against the genome and different ESTs from the same gene may lead to contradictory gene structure alignments. These are due on one hand to insertions, deletions and single nucleotide polymorphisms (SNPs) frequently occurring in EST data, and on the other hand to both the repetitive structure of genomic sequences and the presence of contamination of EST clusters with paralogous sequences. Sequencing errors are quite frequent in ESTs, where error rates are estimated to be 1.5% for high-quality sequences and 3–4% overall. Moreover, sequencing errors near splice junctions make the detection of splice sites a rather complicated task.

In order to overcome these limitations, pointed out in [16] and [13], approaches based on the idea of performing a multiple EST sequence comparison and alignment against the genomic sequence have been proposed. These methods allow the minimization of false splice predictions due to incorrect independent EST–gene alignment not supported by the majority of EST data. A novel algorithm predicting splice sites, based on the comparison of an entire cluster of ESTs (typically a UNIGENE cluster) is implemented in the ASPIC web tool [13]. Similarly, POA-MSA [16] and ASAP [17] apply new algorithms for the multiple alignment of EST data to produce predictions supported by a set of EST data. The method proposed in [16] constructs a multiple sequence alignment of ESTs by means of the partial order alignment (POA) algorithm in order to determine a consensus sequence. Such a sequence is then aligned to the genome by BLAST in order to detect splice sites. However, the method presented in [16] requires a high computation time for processing an entire UNIGENE cluster (e.g. 4 h for processing 5000 ESTs compared to <30 min needed by ASPIC) and moreover it does not adopt specific procedures to improve the sensitivity and accuracy in detecting canonical and non-canonical splice sites. This implies that ESTs producing genome alignments without canonical splice sites are simply excluded from the analysis [17]. On the contrary, ASPIC adopts an ad hoc dynamic programming procedure for the refinement of intron boundaries and allows any kind of non-canonical splice site if reliably supported by EST alignments.

Criteria to improve splice sites detection
Approaches such as BLAT, which use oligomer matching to align exons, do not generally allow high precision location of splice sites. Computational approaches proposed in the literature to locate intron boundaries are mainly based on the following ideas. ECgene forces the alignments with non-canonical introns (not flanked by GT–AG or GC–AG) to be further corrected with the SIM4 programme that requires the splice sites to be canonical if possible. Alternatively, it is possible to use splice site models, such as scoring matrices [7,18]. Other EST–genome alignment algorithms such as GMAP make use of a DP algorithm to improve splice site location. But, DP may not be sufficient for the identification of correct EST splice sites, since an optimal alignment may not be unique or the most biologically plausible solution.

A novel approach in detecting splice sites was proposed recently in ASPIC: the basic idea is to use a dynamic programming algorithm to locate the gap leaving the minimum number of errors on each splice site that guarantees at the same time the highest scoring splice sites. Non-canonical fuzzy splice sites can be eventually reduced to the canonical ones using the information around splice sites provided by a multiple EST alignment. Indeed, it frequently happens that a splice site is supported by the genome alignment near the intron boundary of a large number of ESTs. Some of the ESTs covering the same intron boundary, due to the presence of mismatches and/or indels, may generate a misalignment (even without mismatches) supporting a different intron—often flanked by non-canonical splice sites—but located very close to a canonical splice site supported by the higher quality ESTs. In this case, ASPIC reduces the fuzzy splices to the canonical ones by accepting few mismatches and/or indels in lower quality ESTs allowing the optimal genome multialignment of all ESTs/transcripts near the intron boundary.

Tests carried out using ASPIC on several genes show that multiple EST comparison improves notably the sensitivity (low false negatives) and selectivity (low false positives) of intron predictions.

Full-length transcript assembly problem
The problem of predicting gene-structure by analysing splice sites in the genome involves solving another related problem: reconstructing alternative full-length isoforms by assembling spliced EST sequences. Since AS events are reconstructed from fragmented EST data, such events may be combinatorially combined in full-length transcripts, plausibly producing an exponential number of transcripts. Computationally, the reconstruction must rely on rules that allow the minimization of the number of predicted transcripts that do not occur in nature, including, for example, the use of highly reliable intron predictions (i.e. those with well-scoring canonical sites or supported by at least two transcripts) and of polyA/T sites to identify transcript ends. Various graph-based methods for the assembly of ESTs based on graphs have been reported recently. These include ESTgenes [19], Splicing graphs [6], ECgene [15], ASP [17] and ASPIC [13]. All of these methods rely on the idea of representing basic information on splicing events in the nodes of the graphs, while a whole path of the graphs represents a description of a sequence of such events that occur in a full-length transcript.

These approaches differ from the basic information used in the assembly graphs and the algorithm used to reconstruct paths: nodes are exons in ECgene and ASP or nucleotide sequences as in ASG [20], while in ASPIC and ESTgenes nodes are single spliced ESTs, thus requiring the construction of significantly smaller graphs. In ASP, the paths of the graph are weighted and a dynamic programming algorithm is used to detect the most promising (or productive) path. In ASPIC, an assembly graph is constructed representing a partial order relation among spliced ESTs (two transcripts are related if they overlap, i.e. share common splice sites). An efficient algorithm reporting plausible paths, without requiring a trimming phase as in [15] or [17] to remove redundant models, is adopted. Reducing the number of false transcript predictions, bounding the size of the graph and efficiently producing distinct paths are critical computational issues in the assembly problem.


    CONCLUSIONS
 TOP
 ABSTRACT
 INTRODUCTION
 CONCLUSIONS
 Acknowledgements
 References
 
The computational problem of predicting AS deserves further attention and is far from being solved completely. Indeed, as shown above, the availability of several alignment tools specifically designed to produce an accurate EST–genome alignment allows us to overcome only the first major issue in the AS prediction problem. Indeed, some issues are only partially addressed by currently available tools. These include:

  1. detecting the true splice variants differing by few bases, even consisting of a codon;
  2. providing an accurate and reliable prediction of exact splice sites either canonical (including U11/U12, AT-AC) or non-canonical;
  3. producing splice site predictions based on reliable initial and terminal EST alignment exons or of short internal exons (microexons).

Table 1 reports a comparison of available tools under the aforementioned criteria. The computational solution of step (i) involves balancing the two specific needs: avoiding the predictions of false splice sites shifted in coordinates of few bases, but distinguishing those sites that are true splice variants. Clearly, when processing large transcript data sets, the number of ESTs with neighbouring splice sites increases, but on the other hand, the comparison of multiple ESTs facilitates distinguishing correct splice sites supported by a large number of high quality ESTs (over 98% of identities) from false splice sites. For example, ECgene assumes that neighbouring splice sites are identical if they are within 16 base pairs, thus both splice sites differing by few bases due to low quality of ESTs as well as true splicing variants differing a few nucleotides are excluded.


View this table:
[in this window]
[in a new window]
 
Table 1: List of the most popular software tools for AS prediction. For each software we report the type (tool or database), the web address, if a multiple sequence alignment (MSA) is used, if the prediction is based on a multiple set of EST data, if full-length transcripts (FL) are computed, the minimum difference between splice coordinates (MDIF) and the availability of an upload facility for input data

 
The use of larger transcript datasets, whose source library and production protocol is clearly and unambiguously annotated, is also crucial for a better assessment of the AS pattern and the understanding of its functional relevance.

Finally, it should be noted that to fully understand the functional relevance of AS, the developmental-, tissue-, or pathological-specificity of detected AS isoforms needs to be addressed in a systematic manner. Such investigations will inevitably require multidisciplinary approaches involving bioinformatics and the wet-bench exercise.


Key Points

  • Alternative splicing (AS) is emerging as a major mechanism for increasing proteome complexity in vertebrates and other eukaryotes
  • Despite the availability of several reasonably fast and accurate cDNA–genome alignment programmes, the problem of an accurate prediction of splice sites and exon–intron structure is still far from a fully satisfactory solution.
  • Programmes performing independent pairwise alignment of ESTs against the genome and different ESTs from the same gene may lead to contradictory gene structure models.
  • The ASPIC is a novel method based on multiple sequence comparison that adopts specific procedures to improve the sensitivity and accuracy in detecting canonical and non-canonical splice sites compared to other known methods.
  • Reconstructing alternative full-length isoforms by assembling spliced EST sequences must rely on rules that allow the minimization of the number of predicted transcripts that do not occur in nature, including, for example, the use of highly reliable intron predictions and of polyA/T sites to identify transcript ends.

 


    Acknowledgements
 TOP
 ABSTRACT
 INTRODUCTION
 CONCLUSIONS
 Acknowledgements
 References
 
This work was supported by FIRB projects ‘Bioinformatica per la Genomica e la Proteomica’ and ‘Laboratorio Italiano di Bioinformatica—L.I.BI.’ (Ministero dell’Istruzione e Ricerca Scientifica, Italy) and Associazione Italiana Ricerca sul Cancro. The authors thank David Horner for his helpful comments on the manuscript.


    FOOTNOTES
 
Paola Bonizzoni is an Associate Professor of Computer Science at the University of Milano-Bicocca in Milan. Her research interests are mainly in the area of Theoretical Computer Science, and include computational complexity, models in biomolecular computation and combinatorial algorithms in Computational Biology.

Raffaella Rizzi has currently a postdoc position in Computer Science at the University of Milan-Bicocca. Her research interests include computational models in molecular biology, gene expression and DNA microarrays.

Graziano Pesole is a Full Professor of Molecular Biology at the University of Bari, leading a research group in Bioinformatics and Comparative Genomics. His research interests include bioinformatics tools for genome annotation and molecular evolution.


    References
 TOP
 ABSTRACT
 INTRODUCTION
 CONCLUSIONS
 Acknowledgements
 References
 

  1. Graveley BR. Alternative splicing: increasing diversity in the proteomic world. Trends Genet 2001; 17:100–7.[CrossRef][ISI][Medline]
  2. Kent WJ. BLAT—the BLAST-like alignment tool. Genome Res 2002; 12:656–64.[Abstract/Free Full Text]
  3. Wheelan SJ, Church DM, Ostell JM. Spidey: a tool for mRNA-to-genomic alignments. Genome Res 2001; 11:1952–7.[Abstract/Free Full Text]
  4. Florea L, Hartzell G, Zhang Z, et al. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res 1998; 8:967–74.[Abstract/Free Full Text]
  5. Brett D, Hanke J, Lehmann G, et al. EST comparison indicates 38% of human mRNAs contain possible alternative splice forms. FEBS Lett 2000; 474:83–6.[CrossRef][ISI][Medline]
  6. Heber S, Alekseyev M, Sze SH, et al. Splicing graphs and EST assembly problem. Bioinformatics 2002; 18:Suppl 1, S181–8.[Abstract]
  7. Burset M, Seledtsov IA, Solovyev VV. Analysis of canonical and non-canonical splice sites in mammalian genomes. Nucleic Acids Res 2000; 28:4364–75.[Abstract/Free Full Text]
  8. Mott R. EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA. Comput Appl Biosci 1997; 13:477–8.[Free Full Text]
  9. Usuka J, Zhu W, Brendel V. Optimal spliced alignment of homologous cDNA to a genomic DNA template. Bioinformatics 2000; 16:203–11.[Abstract/Free Full Text]
  10. Lee BT, Tan TW, Ranganathan S. MGAlignIt: a web service for the alignment of mRNA/EST and genomic sequences. Nucleic Acids Res 2003; 31:3533–6.[Abstract/Free Full Text]
  11. Ogasawara J, Morishita S. Fast and sensitive algorithm for aligning ESTs to human genome. Proc IEEE Comput Soc Bioinform Conf 2002; 1:43–53.[Medline]
  12. Wu TD, Watanabe CK. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 2005; 21:1859–75.[Abstract/Free Full Text]
  13. Bonizzoni P, Rizzi R, Pesole G. ASPIC: a novel method to predict the exon-intron structure of a gene that is optimally compatible to a set of transcript sequences. BMC Bioinformatics 2005; 6:244.[CrossRef][Medline]
  14. Setubal J, Meidanis J. Introduction to Computational Molecular Biology. Boston MA: PWS Publishing Company 1997.
  15. Kim N, Shin S, Lee S. ECgene: genome-based EST clustering and gene modeling for alternative splicing. Genome Res 2005; 15:566–76.[Abstract/Free Full Text]
  16. Grasso C, Modrek B, Xing Y, Lee C. Genome-wide detection of alternative splicing in expressed sequences using partial order multiple sequence alignment graphs. Pac Symp Biocomput 2004; 1:29–41.
  17. Xing Y, Resch A, Lee C. The multiassembly problem: reconstructing multiple transcript isoforms from EST fragment mixtures. Genome Res 2004; 14:426–41.[Abstract/Free Full Text]
  18. Shapiro MB, Senapathy P. RNA splice junctions of different classes of eukaryotes: sequence statistics and functional implications in gene expression. Nucleic Acids Res 1987; 15:7155–74.[Abstract/Free Full Text]
  19. Eyras E, Caccamo M, Curwen V, et al. ESTGenes: alternative splicing from ESTs in Ensembl. Genome Res 2004; 14:976–87.[Abstract/Free Full Text]
  20. Leipzig J, Pevzner P, Heber S. The alternative splicing gallery (ASG): bridging the gap between genome and transcriptome. Nucleic Acids Res 2004; 32:3977–83.[Abstract/Free Full Text]

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
T. Castrignano, M. D'Antonio, A. Anselmo, D. Carrabino, A. D'Onorio De Meo, A. M. D'Erchia, F. Licciulli, M. Mangiulli, F. Mignone, G. Pavesi, et al.
ASPicDB: A database resource for alternative splicing analysis
Bioinformatics, May 15, 2008; 24(10): 1300 - 1304.
[Abstract] [Full Text] [PDF]


Home page
Brief Funct Genomic ProteomicHome page
R. Casadio, P. L. Martelli, and A. Pierleoni
The prediction of protein subcellular localization from sequence: a shortcut to functional genome annotation
Brief Funct Genomic Proteomic, February 18, 2008; (2008) eln003v1.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. Bhasi, R. V. Pandey, S. P. Utharasamy, and P. Senapathy
EuSplice: a unified resource for the analysis of splice signals and alternative splicing in eukaryotic genes
Bioinformatics, July 15, 2007; 23(14): 1815 - 1823.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
5/1/46    most recent
ell011v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Bonizzoni, P.
Right arrow Articles by Pesole, G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bonizzoni, P.
Right arrow Articles by Pesole, G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?