Briefings in Functional Genomics and Proteomics Advance Access originally published online on October 29, 2007
Briefings in Functional Genomics and Proteomics 2007 6(3):202-219; doi:10.1093/bfgp/elm025
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Special Issue Papers |
Genome browsing with Ensembl: a practical overview
Corresponding author. Ewan Birney, EMBL Outstation – EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK. Tel: +44(0)1223 494992; Fax: +44(0)1223 494468; E-mail: birney{at}ebi.ac.uk
| ABSTRACT |
|---|
|
|
|---|
A wealth of gene information is accruing in public databases. Genome browsers such as Ensembl are needed to organize and depict this information in the context of the genome. Ensembl provides an open source gene set based on experimental evidence for over 30 species, the majority of which are vertebrates. Genes and annotation are accessible through the Ensembl browser (http://www.ensembl.org), and through direct queries of its databases using the Perl API (Application Programme Interface), MySQL or BioMart.
Keywords: Ensembl, genome browser, annotation, data mining, gene prediction, comparative genomics
| OVERVIEW |
|---|
|
|
|---|
This article presents a general introduction to Ensembl focusing on the basis for the gene set, data upload using DAS, data access and comparative genomics. The introductory sections are followed by a series of modules describing practical aspects of using the browser. The specific modules and what they cover are as follows:
Module 1: How to view information for one gene (including biological basis for the gene prediction and an introduction to viewing external data with DAS)?
Module 2: How to view a region of the chromosome and associated annotation?
Module 3: How to view SNPs and other variations?
Module 4: How to view homologies and alignments?
Module 5: How to query the Ensembl database with BioMart?
| INTRODUCTION TO THE ENSEMBL PROJECT |
|---|
|
|
|---|
The Ensembl project (Figure 1) [1–4] aims to provide an up-to-date gene set with associated annotation on the most recent assemblies for vertebrate species (including human [5, 6], mouse [7], rat [8] and zebrafish [9]) along with several model organisms commonly used in scientific studies (including yeast [10], Caenorhabditis elegans [11] and fruit fly [12]). Ensembl stands next to the genome browsers of UCSC (University of California, Santa Cruz, http://genome.ucsc.edu/) and NCBI (National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov/mapview/) as one of the three most-used genome browsers in the world today. In Ensembl version 43 http://feb2007.archive.ensembl.org/index.html, 31 annotated genomes were included in Ensembl (Table 1). At the time of publication (September, 2007), the number has increased to 35. The project is open source and provides a free and comprehensive resource for both the research community and scientific industry. Ensembl is used by a broad range of scientists, from geneticists to molecular biologists to bioinformaticians worldwide. This meets Ensembl's original aim to provide a free and comprehensive resource to the scientific community. Ensembl was established in 1999 as a joint project between the EBI (EMBL) and the Wellcome Trust Sanger Institute, with additional funding from NIH–NIAID, EU, BBSRC and MRC.
|
|
All Ensembl gene predictions are based on biological evidence, specifically mRNA and protein evidence from public databases such as UniProt [13] and RefSeq [14]. See the section on the Ensembl gene set in this article. Associated annotation ranges from sequence variation to functional classes on the protein level and includes SNPs (single nucleotide polymorphisms), in-dels (insertion–deletion mutations), clone sets, protein domains and functional classes such as GO (Gene Ontology [15]) terms. Expression data from the GNF (Genomics Institute of the Novartis Research Foundation) [16] project and eVOC (Expressed Sequence Annotation for Humans) ontologies [17] can be accessed from the database for human. These projects involve the determination of the location (tissue type/organ) and timing of gene expression. For the remaining species, expression data can only be inferred from homologies. Ensembl genes are trackable throughout scientific databases as they are mapped to external identifiers in databases such as UniProt and RefSeq (sequence repositories across species), and MGD (Mouse Genome Database: a sequence repository for mouse) [18] or probeset IDs from Affymetrix (GeneChip®) [19] or Illumina® (BeadArrays) [20]. In this way, Ensembl expands beyond its own databases to other databases and literature in the scientific community.
| THE ENSEMBL GENE SET |
|---|
|
|
|---|
Ensembl strives to provide the most accurate and up-to-date gene set possible. If available, manually curated datasets are imported, such as the SGD (Saccharomyces Genome Database [21]) gene set for Saccharomyces cerevisiae, the WormBase [22] gene set for C. elegans, and the VEGA/Havana [23] set for Homo sapiens. The VEGA (vertebrate genome annotation) consortium [23] provides manual annotation of vertebrate genomes, focusing on regions in human, mouse, zebrafish, pig and dog. For species where manually curated evidence is not available, Ensembl annotates the gene set using a gene prediction pathway (or annotation pipeline). This is termed as the genebuild [24], which determines the Ensembl gene set using biological evidence, namely mRNA and protein information in databases such as UniProt/Swiss-Prot and annotated entries in RefSeq. Every resulting gene is based on at least one mRNA or protein, and in most cases, one Ensembl gene has been determined using multiple pieces of evidence from comprehensive biological databases.
The Ensembl annotation pipeline is carefully followed by the genebuild team [25]. A typical genebuild is performed over weeks, resulting in the Ensembl gene set of known and novel genes for a species. Two stages are applied in the full genebuild: the targeted stage and the similarity stage [26]. In the first stage, mRNA and proteins from the same species are aligned to the assembly for that species, using the GeneWise [27] algorithm. This results in the known Ensembl genes, which match genes with IDs in public databases for the same species. Novel genes result from the similarity build in which mRNA from the same species that were not aligned in the targetted stage and proteins from both the same and closely related species are aligned to the assembly using GeneWise and Exonerate [28]. A separate approach is used for low-coverage assemblies. This genebuild is based on projection of human genes based on a whole gene alignment between the low coverage genome and the human. Specific information about the genebuild and assembly used for each species is found on the species-specific index page, along with species-specific news for the current release (for example, for human: Figure 2). For some organisms, like Drosophila melanogaster, genes are imported and matched to the assembly, in this case from FlyBase [29], a manually curated database of high quality. Yeast genes are imported from SGD to provide a simple eukaryote for comparative genomics studies. The evidence used to build each Ensembl gene can be readily viewed in the supporting evidence panel of ExonView (see Figure 11, Module 1).
|
|
The Ensembl genebuild procedure takes into account species-specific characteristics, such as duplicated genes in the teleosts [such as Danio rerio (zebrafish)] arising from a potential ancestral duplication in the genome [30, 31]. Ensembl strives to assess the quality of its gene set predicted from the annotation pipeline using experimental data. For the chicken genome, experimental methods relying on RT–PCR analysis showed Ensembl exon predictions to be 92% correct, with 94% splice junctions correctly predicted by Ensembl [32]. Ensembl compares genes predicted in the annotation pipeline to those determined by manual curators (VEGA [23]). This provides a quality check of the predicted gene set. Ensembl is a member of the CCDS consortium http://www.ensembl.org/Homo_sapiens/ccds.html, in which NCBI (National Center for Biotechnology Information), WTSI (Wellcome Trust Sanger Institute), UCSC (University of California at Santa Cruz) and the EBI (European Bioinformatics Institute) aim to agree upon consensus coding sequences using one assembly, to obtain the most reliable set of genes possible. The CCDS set is available for human and mouse.
Annotation of pseudogenes is available for most species (human, mouse, rat, zebrafish, C. elegans, cow, armadillo, etc.). Ensembl transcripts that appear twice in the genome are tested for the features of processed pseudogenes: specifically a lack of introns and the presence of a poly(A) tail.
EST evidence is used to predict a separate set of Ensembl genes (Ensembl EST genes). These can be seen in ContigView (Module 2). EST information can support an Ensembl protein-coding gene prediction, especially if there is an overlap of an Ensembl EST gene prediction and a gene from the genebuild, as these two gene sets come from two different lines of evidence. ESTs have been used to determine various splice isoforms of a gene [33], and the Ensembl EST gene set can be extracted from the database using the Perl API (information available at: http://www.ensembl.org/info/software/index.html), or by direct query of the otherfeatures database with MySQL [34]. EST evidence may be excluded from the main genebuild for a species as it can often be contaminated with genomic DNA, be fragmented or contain sequencing errors. However, EST evidence is used if it is of high quality, for some species, especially if there is not extensive protein/mRNA information in public databases for that species.
A genebuild is performed every time a new assembly becomes available or once in a year. However, new gene and mRNA information in public databases such as RefSeq and UniProt, and annotation updates such as oligos and probes can be included in Ensembl with every new release (i.e. every 2 months). These updates can be extracted from the Ensembl database and are shown in the GeneView page or ContigView, where tracks such as Ensembl genes and EMBL mRNAs can be drawn alongside the contig (or chromosome) (Figure 3). Live information in external databases can be visualized in GeneView and ContigView using DAS (the distributed annotation system: see Modules 1 and 2). This information can be more current, depending on the database.
|
| ORTHOLOGY/PARALOGY PREDICTION |
|---|
|
|
|---|
Orthologue and paralogue prediction is now carried out through the construction of phylogenetic trees across multiple species. The gene tree constructions use the longest translation for a gene and can be viewed in the GeneTreeView page (Figure 4), accessible by a link from the GeneView page for any gene. NJTREE http://treesoft.sourceforge.net/njtree.shtml is used to generate trees from the longest translation of an Ensembl gene and to compare these against the species tree to determine speciation or duplication events. Nodes represent either duplication (red squares) or speciation events (blue squares), and in this way both recent and ancient paralogues can be detected. In addition to providing evolutionary information, gene predictions can be checked across species as a method of quality control. Alignments can be exported using the Export roll-down menu on this page. Furthermore, alignments may be viewed and exported using the JalView [35] option on this page.
|
Multiple alignments are determined using Mercator http://www.biostat.wisc.edu/~cdewey/mercator/to build a synteny map and then Pecan http://www.ebi.ac.uk/~bjp/pecan/(B.Paten, manuscript in preparation) to perform the alignments. These alignments can be seen in the browser AlignSliceView, ContigView and CytoView pages. All alignments can be exported using BioMart, and can be found in the ftp site ftp://ftp.ensembl.org/pub/ as emf files (see current_multi_species datafiles). Pairwise alignments are also available, determined with BlastZ-net [36] analysis for closely related species (alignment on the nucleic acid level) and Translated-Blat [37] for more distant species (alignment on the protein level). From these alignments, syntenic blocks are determined with a cutoff of (currently) 100 kb. Syntenic blocks can be viewed in SyntenyView and CytoView.
| DATA ACCESS |
|---|
|
|
|---|
To access Ensembl genes and associated annotation, three main windows to these data are provided to the public: the Ensembl browser (http://www.ensembl.org) (Figure 5), BioMart (a data-mining tool that extracts information from the Ensembl databases) and the Perl API. These three windows to Ensembl data are updated with every release. Furthermore, the core, variation, EST and comparative genomics databases are accessible through a public MySQL server. For access to the database (ensembldb.ensembl.org) specify user as anonymous (no password required). The API requires basic knowledge of Perl, however BioMart does not require any programming knowledge.
|
The Ensembl browser not only provides wetlab researchers access to the gene set and annotation without the need to directly query the Ensembl database, it displays genes and other features so that they can be directly understood and compared in the context of a chromosomal region. Furthermore, sequences, alignments and genomic features such as clone sets can be directly exported from the browser and BioMart. The BLAST [38] tool allows any sequence to be compared against any genomic assembly in Ensembl. BLAST searches can be carried out using the Ensembl browser WU-BLAST [39]. Alternatively, sequences can be compared against a cDNA or peptide set. BLASTN, BLASTP, BLASTX TBLASTN and TBLASTX are all supported options in the Ensembl browser. An option for very fast alignment to nucleic acid sequence is included (SSAHA2 [40]). An example of how SSAHA can be used within Ensembl is described in the following reference [41]. Finally, the browser provides a window into comparative genomics, where one can view homology prediction, gene trees, alignments and protein family predictions.
The main page of Ensembl (Figure 5) lists all the species in the browser for which there is annotation and provides links to BLAST, BioMart and the API. The assembly used in the genebuild for each species is shown by the picture or name, and clicking on a species link brings the user to an index page for more information about that species including statistics on the genebuild and species-specific news for the current release. News for each release is also shown here, browsable backwards into the archive sites. The navigation column on the left provides navigation of the website. Finally, a smart search is provided at the top, and help pages and contact information can be accessed by clicking the blue button at the top right-hand corner of the page.
The browser is customizable. Optional user logins allow specific pages such as ContigView to be configured, not only as cookies but within stored preferences that can be accessed from any computer. Pages can also be bookmarked and notes can be attached within the browser. Furthermore, local data can be uploaded and displayed on an individual site. These customized pages can be shared using the group function of the logins.
Links to pages in the browser can be found in the left-hand navigation column on every page, and within the page itself. From the GeneView and ContigView pages, the two major views of the browser, virtually all of the other pages can be reached. Use the Gene information link at the left to reach the GeneView page, or click on an Ensembl Gene ID link within a page. To reach the ContigView page, click on Graphical view on the left, or follow links to a chromosomal region. More specific pages will be described in the modules, along with the links to reach these pages.Links to BioMart are available from every browser page. BioMart (Figure 6) is a data-mining tool developed to quickly and effectively obtain datasets from the Ensembl databases as to the user's query. Tables can be exported in various formats (HTML, text, Microsoft Excel) and sequences can be obtained in FASTA format using this program. Module 5 discusses the uses of BioMart and provides some examples.
|
| THE BROWSER: OVERVIEW AND PRACTICAL MODULES |
|---|
|
|
|---|
The browser is extensive in what it attempts to show and make accessible. To keep current with protein and mRNA information in the scientific databases, along with new genomic sequence information, a new Ensembl release occurs every 2 months incorporating new data, such as cDNA mapping updates and SNP information, along with new assemblies and gene sets. Archive sites (Figure 7) are available extending back in time for at least 2 years, depending on the species. A summary of archive sites and assemblies used is found at http://archive.ensembl.org/assembly.html. In contrast to the Archive sites, Ensembl also provides Pre! sites that contain new assembly information that is not yet fully annotated (in version 43 there are 6 species in the Pre! Site) (Figure 1). These sites allow visualization and extraction of the newest sequence assembly information as soon as it is available.
|
The first module focuses on one gene or transcript (GeneView, TransView) and demonstrates how the supporting evidence behind a gene prediction can be viewed (ExonView). Module 1 also provides an introduction to viewing external sources with DAS (the distributed annotation system) [42]. Module 2 describes how to view a chromosomal region and annotation for a section of the genome (ContigView, CytoView). Module 3 discusses variations (SNPView, GeneSNPView, TranscriptSNPView), module 4 demonstrates comparative genomics options (GeneTreeView, AIignSliceView, GeneSeqAlignView and SyntenyView) and module 5 provides an overview of BioMart.
Practical module 1: view information for a gene (The GeneView, TransView and ExonView pages)
To search for a gene, type in a name or ID and optionally the species of interest along with gene in the search box at the top of the home page and click Go. (Figure 8) (For example, human HFE gene). Click on the Ensembl identifier (in this example, ENSG00000010704) in the search results to go to the GeneView page. (A link to ContigView is also provided in the header of the search result). Note that there is also a Vega gene annotated. Pages in the Ensembl browser are termed views. GeneView (also reachable through the Gene information link at the left of Ensembl pages) provides gene-specific information such as gene structure, number of transcripts, position on the chromosome, homology information, links to DAS (see below), identifiers in other databases (Figure 9), and protein domain predictions.
|
|
The distributed annotation system (DAS [42]) provides a means of allowing Ensembl to expand beyond its own databases by including information from external sources not housed in Ensembl. With DAS, information in databases worldwide can be viewed for an Ensembl gene, such as publications in PubMed (select the option HUGO_text in the GeneView page, for human) (Figure 10A), or for a position on the assembly (in ContigView). To display this information, select one or more DAS options in the GeneView page and click Update. As DAS sources are not housed in Ensembl, but remain external, any recent modifications to those databases will be accessed by the Ensembl browser. DAS is a way of reaching out beyond Ensembl's databases to expand the annotation available for a gene or chromosomal region, and to provide the newest information in those databases.
|
From this page, links are available for the gene tree, ID history (to track Ensembl IDs in previous releases) and chromosomal position among others. To navigate through the pages (views) of Ensembl, use the yellow navigation column at the left of the page. Click on the link Transcript information to go to the TransView page for this gene.
TransView contains much of the same information as the GeneView page, however it is focused on only one transcript. A list of similarity matches is shown (matches to gene IDs in other databases such as UniProt and Entrez Gene [43], phenotype IDs in MIM (Mendelian Inheritance in Man [44]) (Figure 9) and IDs for probes from Agilent, Illumina, etc. The base-pair sequence for the spliced transcript (exons only) is shown here. Protein sequence and SNPs can be drawn along the base-pair sequence.
The ExonView page shows intronic and flanking sequence as well as exons, and includes the supporting evidence for an Ensembl gene prediction. Click on exon information at the left to reach the ExonView page. This page is colour-coded to differentiate coding sequence (black), UTRs (untranslated regions, purple), intronic (blue) and flanking sequence (green). The display can be configured to show full or partial intronic sequence along with a variable flanking sequence length. At the bottom of the page is the Supporting Evidence panel (Figure 11). This panel shows all mRNA and protein entries in public databases (UniProt/SwissProt, UniProt/TrEMBL and RefSeq) that were used to make an Ensembl transcript prediction.
Module 2: view a region of the chromosome
Gene information can also be viewed for a region of the chromosome, rather than starting with one specific ID. The Ensembl search function allows a chromosomal region to be directly accessed. For example, search for mouse chromosome 2:152700000.152800000 to visualize this base-pair range on ContigView, which displays a specific region of chromosome 2. SNP information can be displayed along the chromosome in this page using the features roll-down menu in the Detailed view panel of this page. One can also display Ensembl genes, genes in other databases, repeats and comparative genomics information, and other annotation using the roll-down menus in this display.
DAS tracks can also be viewed in the ContigView page. In Figure 10B, DECIPHER elements (http://decipher.sanger.ac.uk/) and GIS PETs [45, 46] are shown alongside Ensembl transcripts. (See the figure legend for more information about these DAS sources.) Finally, users can display their own information on Ensembl pages: GeneView, ProtView, ContigView and CytoView. (See the Manage sources link under DAS in these pages).
DAS sources are available across species. For example, for mouse, Fantom CAGE tags (Short Cap-Analysis Gene Expression sequences from the Functional Annotation of the mouse consortium at RIKEN) [47] and MICER clones (a library of targeting vectors for gene silencing and chromosome rearrangements in mouse) [48] can be viewed along the assembly in ContigView. The Detailed view panel is highly customizable and provides a template upon which to display features along a chromosome or contig (Figure 12).
|
This page can be obtained by clicking on Graphical view from most Ensembl pages. The Graphical overview link directly under the Graphical view link in the left-hand navigation column leads to CytoView in which a large region of the chromosome can be viewed (up to 50 Mb, in comparison to a maximum display of 1 Mb for ContigView. However, fewer annotation options are available in the CytoView display.) In CytoView, clone sets cannot only be viewed along the chromosomal region as in ContigView, they can be exported by chromosome, a specified genomic region, or the whole genome. [Clone sets can also be exported with BioMart, under the Genomic Features option in the Attributes (Features) page].
Module 3: variations
Most variation information in Ensembl is imported from dbSNP, though some are imported from resequencing projects such as the STAR project for rat. The imported variations (SNPs at single base-pair locations and in-dels) include flanking sequence, and are matched against the genome and stored in the database along with SNP type in the context of an Ensembl transcript (for example, coding, noncoding, intronic), allele and any ensuing peptide shift. SNPView and GeneSNPView portray variation information in depth, and SNPs can also be drawn on TransView, ProtView and ContigView. To reach SNPView, click on a SNP drawn in the TransView, ProtView or ContigView page. In the first two of these pages, SNPs can be drawn along the sequence (see the customization choices under the sequence). In ContigView (obtained through the graphical view link), SNPs may be turned on using the Features menu of the Detailed view panel. In addition, BioMart can be used to access this variation information, either using the Ensembl core database or a SNP database (dbSNP and others). Finally, SNPs across strains can be displayed for mouse and rat (across breeds for dogs, and across individuals for human) with TranscriptSNPView to view similarities and differences in SNPs in multiple strains. For example, SNPs in BALB/cByJ can be compared with those in 129X1/SvJ and the reference strain (C57BL/6J). (To find this page, click on compare SNPs for transcript in the left-hand navigation column). SNPs for different strains can be exported using BioMart. (Figure 13) An in-depth discussion of variation resources will be presented (Chen et al., manuscript in preparation).
|
Module 4: comparative genomics
Homology predictions and alignments can be viewed and exported using either the browser or BioMart. Homologues (orthologues and paralogues) are predicted from gene trees and visible on the GeneView page. Click on any of the homologues for an alignment to the gene of interest. To view the gene tree for a gene, click on Gene tree info at the left of the page to go to GeneTreeView (Figure 7). Multiple alignments are calculated across eutherian mammals (seven species are used: chimpanzee, cow, dog, human, macaque, mouse and rat) and amniotic vertebrates (10 species are used: cow, chicken, chimpanzee, dog, human, macaque, mouse, opossum, platypus and rat). These alignments are calculated using whole genomes and can be graphically displayed on AlignSliceView (use the View alignment with ... link from the ContigView page). Export the aligned sequences on the nucleotide level from this view, or view and/or export the sequences using GeneSeqAlignView (the Genomic sequence alignment link from the GeneView page). Alignments can also be exported through BioMart. Pairwise alignments are available using these same pages.
Syntenic regions are determined using these alignments. Syntenic blocks for a chromosome with chromosomes of another species are displayed in Synteny View. Syntenic blocks can also be displayed in CytoView. MultiContigView allows two chromosomes from different species to be compared with each other. Conserved regions can be highlighted.
Module 5: BioMart
BioMart provides a tool to allow fast export of customized tables and sequences in a format useful to scientists [Microsoft Excel, text, HTML or FASTA (for sequences) format]. Annotation in Ensembl such as gene IDs, SNPs, GO terms and protein domains can all be obtained with this program. Sequences (such as: genomic, transcript, cDNA, protein and flanking sequences) can be exported in FASTA format with the option to customize the header.
BioMart extracts data from the Ensembl databases according to the user's specifications, and is upgraded along with Ensembl with every release. There are three main phases of the web interface: (i) the database is selected, for example the Ensembl gene set for a species can be selected in Dataset, (ii) the Attributes are chosen, allowing associated annotation to be attached to this gene set (such as position on the chromosome, associated SNPs, homologous genes in other species, IDs in other databases, etc.) and (iii) Filters can be selected, allowing a subset of the gene set to be chosen (if desired) (Figure 14).
|
As a practical introduction to the BioMart set-up, the following explanation is given. The three phases of BioMart are marked A, B and C in Figure 15. Attributes and filters selected appear in the left-hand summary sections of BioMart, and the output will reflect these choices unless deselected. The attributes determine the headings for columns in the BioMart output table, and the order in which they were selected determines the order of the columns.
|
The attributes can be thought of as what one would like to know about the selected gene set (For example, the chromosomal positions of a set of genes, associated GO terms, variations or sequences). The optional third phase [the Filter phase (Figure 15: C)] allows the gene set to be narrowed, if information is not wanted for the entire gene set for a species. These filters are specified by what the scientist already knows about his/her gene set. Chromosomal location, gene IDs and InterPro domains [49] are among the options that can be used to select a smaller gene set.
One example of a BioMart query is as follows: enter Ensembl gene IDs (for example, the HFE gene: ENSG00000010704) and obtain a corresponding list of official HGNC IDs determined by the HUGO Gene Nomenclature Committee [50] (for human), MGI [51] (for mouse), Entrez Gene and UniProt. This would be performed by specifying the Ensembl gene ID as a filter (click Filters at the left of the BioMart window, enter ENSG00000010704 under ID list limit in the GENE section in the right-hand side of the window), and selecting HGNC (for human), Entrez Gene and UniProt IDs as attributes. Note that a row for each transcript is given in the output table.
A second example is to find all genes on chromosome X that are associated with a specific InterPro domain, for example the immunoglobulin-like domain (IPR013151). In this case, the InterPro domain and X chromosome are specified under Filters, and gene IDs are selected in the Attributes section.
Advanced queries can be carried out using a linked-in secondary dataset. More information about how to use BioMart is available at http://www.biomart.org, also short tutorials are available in video format in the Ensembl Helpdesk section: Workshops Online. BioMart is also available in the archive sites.
| SUMMARY |
|---|
|
|
|---|
Ensembl endeavours to provide a comprehensive, highly accurate and current representation of the genome for a variety of species (focusing on the vertebrates). The browser organizes and depicts (with graphs, diagrams and tables) a vast multitude of gene and sequence-associated information for the scientific community. Ensembl addresses the genome through a variety of browser pages (or Ensembl Views), and through databases publicly accessible by the Perl API (Application Programming Interface: a series of algorithms that allow extraction of specific data from a database) or through BioMart. In addition, genomic annotation is available for download via the ftp site. New releases are provided to keep current with the most recent entries and updates in scientific databases.
The Perl API is kept current with every release. Documentation is available, along with instruction and a tutorial. The Perl API accesses all Ensembl databases [core, variation, otherfeatures (containing the EST genes) and compara, the comparative genomics database]. Support and discussion are offered in the form of the ensembl-dev list (subscription instructions are here: http://www.ensembl.org/info/about/contact.html), also an ensembl-announce mailing list keeps users up-to-date on coming developments.
Technical support is also offered in the form of a Helpdesk. Scientists and programmers are encouraged to email questions or comments on any level to helpdesk{at}ensembl.org. Furthermore, detailed help pages are provided for Ensembl views. Clicking on the blue Help button in the upper right-hand corner of any page returns page-specific information. Short instructional videos and slide presentations are available on the website, along with a worked example and glossary. Finally, Ensembl provides free workshops to instruct beginners and intermediate users in the website and/or the API.
| FUTURE GOALS |
|---|
|
|
|---|
As sequence assemblies improve and mRNA and protein entries in databases become more comprehensive, the Ensembl gene set is updated to reflect this new information. The annotation pipeline is continuously compared against gene sets developed by manual annotators, and is improved in order to make the most accurate, biologically relevant gene predictions. With regards to the increasing amounts of gene annotation in scientific databases, Ensembl aims to both contain this wealth of information and organize the site so that complexity is minimized. Frequent releases and DAS allow Ensembl to incorporate the newest information about its genes, and Ensembl aims to maintain that despite the growing number of assemblies, reflecting an increasing number of organisms with sequenced genomes. Finally, Ensembl strives to reach outside the genome and trace a gene from its simple sequence out to cellular function, connecting it with the world of proteins within an organism, and making this information readily accessible to the scientific public.
Key Points
|
| FOOTNOTES |
|---|
Giulietta Spudich has a background in biochemical research in the US and UK, and joined the Ensembl Outreach and Training team in July 2006. She now develops educational materials for Ensembl and gives workshops worldwide.
Xosé M. Fernández-Suárez has a background in molecular biology. He is the Project Leader for Ensembl Outreach and Training. He has been involved at multiple stages of the Ensembl development and has given courses and talks on Ensembl worldwide.
Ewan Birney, a Senior Scientist at EMBL, is currently the Head of Nucleotide Data at the EBI and leads the EBI half of the Ensembl and Reactome projects. He has been involved in the analysis of nearly every metazoan genome sequence, both via his leadership of Ensembl and contribution of his own research, and is a leading member of the human genome community, directly involved in both the draft and finished analysis of the human genome. In 2003, Ewan was awarded the inaugural Francis Crick Prize from the Royal Society, presented to an outstanding young molecular biologist.
| References |
|---|
|
|
|---|
- Hubbard T. Biological information: making it accessible and integrated (and trying to make sense of it). Bioinformatics (2002) 18(Suppl 2):S140.[Abstract]
- Birney E, Andrews TD, Bevan P, et al. An overview of ensembl. Genome Res (2004) 14:925–8.
[Abstract/Free Full Text] - Birney E, Andrews D, Caccamo M, et al. Ensembl 2006. Nucleic Acids Res (2006) 34(Database issue):D556–61.
[Abstract/Free Full Text] - Hubbard TJ, Aken BL, Beal K, et al. Ensembl 2007. Nucleic Acids Res (2007) 35(Database issue):D610–7.
[Abstract/Free Full Text] - Venter JC, Adams MD, Myers EW, et al. The sequence of the human genome. Science (2001) 291:1304–51.
[Abstract/Free Full Text] - Lander ES, Linton LM, Birren B, et al. Initial sequencing and analysis of the human genome. Nature (2001) 409:860–921.[CrossRef][Medline]
- Waterston RH, Lindblad-Toh K, et al, Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature (2002) 420:520–62.[CrossRef][Medline]
- Gibbs RA, Weinstock GM, Metzker ML, et al. Genome sequence of the brown Norway rat yields insights into mammalian evolution. Nature (2004) 428:493–521.[CrossRef][Medline]
- Sprague J, Bayraktaroglu L, Clements D, et al. The zebrafish information network: The zebrafish model organism database. Nucleic Acids Res (2006) 34(Database issue):D581–5.
[Abstract/Free Full Text] - Goffeau A, Barrell BG, Bussey H, et al. Life with 6000 genes. Science (1996) 274:546–7, 563.
[Abstract/Free Full Text] - C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science (1998) 282:2012–8.
[Abstract/Free Full Text] - Myers EW, Sutton GG, Delcher AL, et al. A whole-genome assembly of drosophila. Science (2000) 287:2196–204.
[Abstract/Free Full Text] - Bairoch A, Apweiler R, Wu CH, et al. The universal protein resource (UniProt). Nucleic Acids Res (2005) 33(Database issue):D154–9.
[Abstract/Free Full Text] - Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res (2007) 35(Database issue):D61–5.
[Abstract/Free Full Text] - Harris MA, Clark J, Ireland A, et al. The gene ontology (GO) database and informatics resource. Nucleic Acids Res (2004) 32(Database issue):D258–61.
[Abstract/Free Full Text] - Su AI, Cooke MP, Ching KA, et al. Large-scale analysis of the human and mouse transcriptomes. Proc Natl Acad Sci USA (2002) 99:4465–70.
[Abstract/Free Full Text] - Kelso J, Visagie J, Theiler G, et al. eVOC: a controlled vocabulary for unifying gene expression data. Genome Res (2003) 13:1222–30.
[Abstract/Free Full Text] - Blake JA, Richardson JE, Bult CJ, et al. MGD: the mouse genome database. Nucleic Acids Res (2003) 31:193–5.
[Abstract/Free Full Text] - Fodor SP, Rava RP, Huang XC, et al. Multiplexed biochemical assays with biological chips. Nature (1993) 364:555–6.[CrossRef][Medline]
- Steemers FJ, Gunderson KL. Illumina, inc. Pharmacogenomics (2005) 6:777–82.[CrossRef][Web of Science][Medline]
- Cherry JM, Adler C, Ball C, et al. SGD: Saccharomyces genome database. Nucleic Acids Res (1998) 26:73–9.
[Abstract/Free Full Text] - Chen N, Harris TW, Antoshechkin I, et al. WormBase: a comprehensive data resource for caenorhabditis biology and genomics. Nucleic Acids Res (2005) 33(Database issue):D383–9.
[Abstract/Free Full Text] - Ashurst JL, Chen CK, Gilbert JG, et al. The vertebrate genome annotation (vega) database. Nucleic Acids Res (2005) 33(Database issue):D459–65.
[Abstract/Free Full Text] - Curwen V, Eyras E, Andrews TD, et al. The ensembl automatic gene annotation system. Genome Res (2004) 14:942–50.
[Abstract/Free Full Text] - Potter SC, Clarke L, Curwen V, et al. The ensembl analysis pipeline. Genome Res (2004) 14:934–41.
[Abstract/Free Full Text] - Fernández-Suárez XM, Searle S, Birney E. Ensembl's annotation pipeline and its use in eukaryotic genomes. In. In: Anonymous In Silico Genomics and Proteomics: Functional Annotation of Genomes and Proteins. (2006) New York: Nova Science Publishers, Inc. 109–23. Mulder, N. and Apweiler, R.
- Birney E, Clamp M, Durbin R. GeneWise and genomewise. Genome Res (2004) 14:988–95.
[Abstract/Free Full Text] - Slater GS, Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics (2005) 6:31.[CrossRef][Medline]
- Drysdale RA, Crosby MA, FlyBase Consortium. FlyBase: genes and gene models. Nucleic Acids Res (2005) 33(Database issue):D390–5.
[Abstract/Free Full Text] - Brunet FG, Crollius HR, Paris M, et al. Gene loss and evolutionary rates following whole-genome duplication in teleost fishes. Mol Biol Evol (2006) 23:1808–16.
[Abstract/Free Full Text] - Jaillon O, Aury JM, Brunet F, et al. Genome duplication in the teleost fish tetraodon nigroviridis reveals the early vertebrate proto-karyotype. Nature (2004) 431:946–57.[CrossRef][Medline]
- Eyras E, Reymond A, Castelo R, et al. Gene finding in the chicken genome. BMC Bioinformatics (2005) 6:131.[CrossRef][Medline]
- Brett D, Hanke J, Lehmann G, et al. EST comparison indicates 38% of human mRNAs contain possible alternative splice forms. FEBS Lett (2005) 474:83–6.[CrossRef]
- Stabenau A, McVicker G, Melsopp C, et al. The ensembl core software libraries. Genome Res (2004) 14:929–33.
[Abstract/Free Full Text] - Clamp M, Cuff J, Searle SM, et al. The jalview java alignment editor. Bioinformatics (2004) 20:426–7.
[Abstract/Free Full Text] - Schwartz S, Kent WJ, Smit A, et al. Human-mouse alignments with BLASTZ. Genome Res (2003) 13:103–7.
[Abstract/Free Full Text] - Kent WJ. BLAT – the BLAST-like alignment tool. Genome Res (2002) 12:656–64.
[Abstract/Free Full Text] - Altschul SF, Gish W, Miller W, et al. Basic local alignment search tool. J Mol Biol (1990) 215:403–10.[CrossRef][Web of Science][Medline]
- Lopez R, Silventoinen V, Robinson S, et al. WU-Blast2 server at the European bioinformatics institute. Nucleic Acids Res (2003) 31:3795–8.
[Abstract/Free Full Text] - Ning Z, Cox AJ, Mullikin JC. SSAHA: a fast search method for large DNA databases. Genome Res (2001) 11:1725–9.
[Abstract/Free Full Text] - Fernández-Suárez XM, Schuster MK. Using the ensembl genome server to browse genomic sequence data. Current Protocols in Bioinformatics Supplement (2007) 16:1.15.1–1.15.36.
- Dowell RD, Jokerst RM, Day A, et al. The distributed annotation system. BMC Bioinformatics (2001) 2:7.[CrossRef][Medline]
- Maglott D, Ostell J, Pruitt KD, et al. Entrez gene: gene-centered information at NCBI. Nucleic Acids Res (2007) 35(Database issue):D26–31.
[Abstract/Free Full Text] - McKusick VA. Mendelian inheritance in man and its online version, OMIM. Am J Hum Genet (2007) 80:588–604.[CrossRef][Web of Science][Medline]
- Chiu KP, Wong CH, Chen Q, et al. PET-tool: a software suite for comprehensive processing and managing of paired-end diTag (PET) sequence data. BMC Bioinformatics (2006) 7:390.[CrossRef][Medline]
- Ng P, Wei CL, Sung WK, et al. Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation. Nat Methods (2005) 2:105–11.[CrossRef][Web of Science][Medline]
- Shiraki T, Kondo S, Katayama S, et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc Natl Acad Sci USA (2003) 100:15776–81.
[Abstract/Free Full Text] - Adams DJ, Biggs PJ, Cox T, et al. Mutagenic insertion and chromosome engineering resource (MICER). Nat Genet (2004) 36:867–71.[CrossRef][Web of Science][Medline]
- Mulder NJ, Apweiler R, Attwood TK, et al. InterPro, progress and status in 2005. Nucleic Acids Res (2005) 33(Database issue):D201–5.
[Abstract/Free Full Text] - Eyre TA, Ducluzeau F, Sneddon TP, et al. The HUGO gene nomenclature database, 2006 updates. Nucleic Acids Res (2006) 34(Database issue):D319–21.
[Abstract/Free Full Text] - Eppig JT, Blake JA, Bult CJ, et al. The mouse genome database (MGD): new features facilitating a model system. Nucleic Acids Res (2007) 35(Database issue):D630–7.
[Abstract/Free Full Text]
This article has been cited by other articles:
![]() |
T. Wang and T. S. Furey Analysis of Complex Disease Association and Linkage Studies Using the University of California Santa Cruz Genome Browser Circ Cardiovasc Genet, April 1, 2009; 2(2): 199 - 204. [Full Text] [PDF] |
||||
![]() |
G. A. Reeves, K. Eilbeck, M. Magrane, C. O'Donovan, L. Montecchi-Palazzi, M. A. Harris, S. Orchard, R. C. Jimenez, A. Prlic, T. J. P. Hubbard, et al. The Protein Feature Ontology: a tool for the unification of protein feature annotations Bioinformatics, December 1, 2008; 24(23): 2767 - 2772. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
















