首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 22 毫秒
1.
The Ensembl automatic gene annotation system   总被引:17,自引:2,他引:15       下载免费PDF全文
As more genomes are sequenced, there is an increasing need for automated first-pass annotation which allows timely access to important genomic information. The Ensembl gene-building system enables fast automated annotation of eukaryotic genomes. It annotates genes based on evidence derived from known protein, cDNA, and EST sequences. The gene-building system rests on top of the core Ensembl (MySQL) database schema and Perl Application Programming Interface (API), and the data generated are accessible through the Ensembl genome browser (http://www.ensembl.org). To date, the Ensembl predicted gene sets are available for the A. gambiae, C. briggsae, zebrafish, mouse, rat, and human genomes and have been heavily relied upon in the publication of the human, mouse, rat, and A. gambiae genome sequence analysis. Here we describe in detail the gene-building system and the algorithms involved. All code and data are freely available from http://www.ensembl.org.  相似文献   

2.
3.
We have tested whether a direct correlation of sequence information and staining properties of chromosomes is possible and whether this combined information can be used to precisely map any position on the chromosome. Despite huge differences of compaction between the naked DNA and the DNA packed in chromosomes we found a striking correlation when visualizing the GGCC density on both levels. Software was developed that allows one to superimpose chromosomal fluorescence intensity profiles generated by chromolysin A3 (CMA3) staining with GGCC density extracted from the Ensembl database. Thus, any position along the chromosome can be defined in megabase pairs (Mb) besides the cytoband information, enabling direct alignment of chromosomal information with the sequence data. The mapping tool was validated using 13 different BAC clones, resulting in a mean difference from Ensembl data of 2 Mb (ranging from 0.79 to 3.57 Mb). Our results indicate that the sequence density information and information gained with sequence-specific fluorochromes are superimposable. Thus, the visualized GGCC motif density along the chromosome (sequence bands) provides a unique platform for comparing different types of genomic information. Electronic Supplementary Material The online version of this article (doi:) contains supplementary material, which is available to authorized users.  相似文献   

4.
5.
6.
Ensembl (http://www.ensembl.org/) is a bioinformatics project to organize biological information around the sequences of large genomes. It is a comprehensive source of stable automatic annotation of individual genomes, and of the synteny and orthology relationships between them. It is also a framework for integration of any biological data that can be mapped onto features derived from the genomic sequence. Ensembl is available as an interactive Web site, a set of flat files, and as a complete, portable open source software system for handling genomes. All data are provided without restriction, and code is freely available. Ensembl's aims are to continue to "widen" this biological integration to include other model organisms relevant to understanding human biology as they become available; to "deepen" this integration to provide an ever more seamless linkage between equivalent components in different species; and to provide further classification of functional elements in the genome that have been previously elusive.  相似文献   

7.
8.
The IMGT/HLA Database is a specialist database for sequences of the human major histocompatibility (MHC) system. It includes all the HLA sequences officially recognised and named by the WHO Nomenclature Committee for Factors of the HLA System. The database provides users with online tools and facilities for the retrieval and analysis of these sequences. These include allele reports, alignment tools and a detailed database of all source cells. The online IMGT/HLA submission tool allows the submission of both new and confirmatory allele sequences directly to the WHO Nomenclature Committee for Factors of the HLA System. The latest version (release 1.4.1, November 1999) contains 1,015 HLA alleles from over 2,270 component sequences derived from the EMBL/GenBank/DDBJ databases. From its release in December 1998 until December 1999 the IMGT/HLA website received approximately 100,000 hits. The database currently focuses on the human major histocompatibility complex but will be used as a model system to provide specialist databases for the MHC sequences of other species.  相似文献   

9.
With the availability of Internet, the interest in the possibilities of telepathology has increased considerably. In the foreground is thereby the need of the non-expert to bring in the opinions of experts on morphological findings by means of a fast and simple procedure. The new telepathology system iPath is in compliance with these needs. The system is based on small, but when possible independently working modules. This concept allows a simple adaptation of the system to the individual environment of the user (e.g. for different cameras, frame-grabbers, microscope steering tables etc.) and for individual needs. iPath has been in use for 6 months with various working groups. In telepathology a distinction is made between "passive" and "active" consultations but for both forms a non-expert brings in the opinion of an expert. In an active consultation both are in direct connection with each other (orally or via a chat-function), this is however not the case with a passive consultation. An active consultation can include the interactive discussion of the expert with the non-expert on images in an image database or the direct interpretation of images from a microscope by the expert. Four software modules are available for a free and as fast as possible application: (1) the module "Microscope control", (2) the module "Connector" (insertion of images directly from the microscope without a motorized microscope), (3) the module "Client-application" via the web-browser and (4) the module "Server" with a database. The server is placed in the internet and not behind a firewall. The server permanently receives information from the periphery and returns the information to the periphery on request. The only thing which the expert, the non-expert and the microscope have to know is how contact can made with the server.  相似文献   

10.
SSAHA: a fast search method for large DNA databases   总被引:17,自引:2,他引:17  
Ning Z  Cox AJ  Mullikin JC 《Genome research》2001,11(10):1725-1729
We describe an algorithm, SSAHA (Sequence Search and Alignment by Hashing Algorithm), for performing fast searches on databases containing multiple gigabases of DNA. Sequences in the database are preprocessed by breaking them into consecutive k-tuples of k contiguous bases and then using a hash table to store the position of each occurrence of each k-tuple. Searching for a query sequence in the database is done by obtaining from the hash table the "hits" for each k-tuple in the query sequence and then performing a sort on the results. We discuss the effect of the tuple length k on the search speed, memory usage, and sensitivity of the algorithm and present the results of computational experiments which show that SSAHA can be three to four orders of magnitude faster than BLAST or FASTA, while requiring less memory than suffix tree methods. The SSAHA algorithm is used for high-throughput single nucleotide polymorphism (SNP) detection and very large scale sequence assembly. Also, it provides Web-based sequence search facilities for Ensembl projects.  相似文献   

11.
12.
16S rRNA gene sequences of 102 Nocardia isolates were analyzed using the Integrated Database Network System (IDNS) SmartGene centroid database. A total of 76% of the isolates were correctly identified. Discordant identifications were due to inadequate centroid length (3 species), inaccurate or insufficient entries in the public databases (5 species), and heterogeneous sequences among members of a species (1 species).Nocardia species are significant human pathogens, especially in immunocompromised patients. Accurate species assignment is important to guide appropriate antibiotic selection, as several species possess high levels of antibiotic resistance (9, 10). 16S rRNA gene sequencing has become the most widely used method for the identification of these organisms.Noncurated public databases contain a wealth of sequence information but also include a significant amount of information that is compromised by identification errors and low-quality sequence data (8). Searches against such databases and analysis of the results have become increasingly time-consuming, due to the rapidly growing number of submitted sequences. A quality-controlled sequence database that eliminates inaccurate and redundant entries and identifies the best representative sequence for a particular species would facilitate rapid and accurate identification.SmartGene (SmartGene GmbH, Zug, Switzerland) is a web-based sequence search tool that allows comparison of input sequences with those in two curated databases. Using a proprietary filtering algorithm, the SmartGene 16S rRNA eubacteria database is assembled using sequences deemed acceptable from the public databases. A search using the eubacteria database allows the user to examine the inter- and intraspecies variabilities of a particular species within the SmartGene platform. An additional database, the centroid database, is prepared using a supplementary proprietary algorithm that examines all sequences of a particular species in the eubacteria database and creates a discrete “species group” of sequences for that species. The most representative sequence of that species is designated the “centroid” sequence. Only one sequence per species (the centroid sequence) is included in the centroid database; this sequence may or may not be the type strain of the species, depending on the heterogeneity of that species. Output from a search of the centroid database with an unknown sequence will show the most similar centroid sequences sorted by match score or by the number of base mismatches. The results will also indicate the number of sequences in the centroid group.A study comparing the use of a previous version of the SmartGene software with the use of conventional identification and with that of another proprietary sequence database found SmartGene to be a time- and labor-saving system that provided a larger percentage of accurate genus or species assignments for a diverse group of bacteria than did the proprietary database (7).We report here a study comparing the usefulness of the SmartGene centroid and eubacteria databases for accurate species-level identification of a variety of clinically relevant Nocardia isolates. We also include a careful analysis of discrepant results.  相似文献   

13.
With the completion of the human genome sequence and genome sequence available for other vertebrate genomes, the task of manual annotation at the large genome scale has become a priority. Possibly even more important, is the requirement to curate and improve this annotation in the light of future data. For this to be possible, there is a need for tools to access and manage the annotation. Ensembl provides an excellent means for storing gene structures, genome features, and sequence, but it does not support the extra textual data necessary for manual annotation. We have extended Ensembl to create the Otter manual annotation system. This comprises a relational database schema for storing the manual annotation data, an application-programming interface (API) to access it, an extensible markup language (XML) format to allow transfer of the data, and a server to allow multiuser/multimachine access to the data. We have also written a data-adaptor plugin for the Apollo Browser/Editor to enable it to utilize an Otter server. The otter database is currently used by the Vertebrate Genome Annotation (VEGA) site (http://vega.sanger.ac.uk), which provides access to manually curated human chromosomes. Support is also being developed for using the AceDB annotation editor, FMap, via a perl wrapper called Lace. The Human and Vertebrate Annotation (HAVANA) group annotators at the Sanger center are using this to annotate human chromosomes 1 and 20.  相似文献   

14.
High-volume sequencing of DNA and RNA is now within reach of any research laboratory and is quickly becoming established as a key research tool. In many workflows, each of the short sequences ("reads") resulting from a sequencing run are first "mapped" (aligned) to a reference sequence to infer the read from which the genomic location derived, a challenging task because of the high data volumes and often large genomes. Existing read mapping software excel in either speed (e.g., BWA, Bowtie, ELAND) or sensitivity (e.g., Novoalign), but not in both. In addition, performance often deteriorates in the presence of sequence variation, particularly so for short insertions and deletions (indels). Here, we present a read mapper, Stampy, which uses a hybrid mapping algorithm and a detailed statistical model to achieve both speed and sensitivity, particularly when reads include sequence variation. This results in a higher useable sequence yield and improved accuracy compared to that of existing software.  相似文献   

15.
Comparative genomics techniques are used in bioinformatics analyses to identify the structural and functional properties of DNA sequences. As the amount of available sequence data steadily increases, the ability to perform large-scale comparative analyses has become increasingly relevant. In addition, the growing complexity of genomic feature annotation means that new approaches to genomic visualization need to be explored. We have developed a Java-based application called Sockeye that uses three-dimensional (3D) graphics technology to facilitate the visualization of annotation and conservation across multiple sequences. This software uses the Ensembl database project to import sequence and annotation information from several eukaryotic species. A user can additionally import their own custom sequence and annotation data. Individual annotation objects are displayed in Sockeye by using custom 3D models. Ensembl-derived and imported sequences can be analyzed by using a suite of multiple and pair-wise alignment algorithms. The results of these comparative analyses are also displayed in the 3D environment of Sockeye. By using the Java3D API to visualize genomic data in a 3D environment, we are able to compactly display cross-sequence comparisons. This provides the user with a novel platform for visualizing and comparing genomic feature organization.  相似文献   

16.
As the human genome sequencing project nears completion, there has been a vast increase in the rate at which disease and nondisease associated variant sequences are being sought and detected. This has heightened the need for software with which to accumulate allelic variant (mutation) data, and with which to make the data accessible to the scientific community. Many ad hoc solutions have been developed by those interested in specific genes and diseases, and the creation of central databases which hold data for all genes has provided an alternative repository for some of the locus data. Despite this, few specialised software tools exist for researchers to create their own locus-specific allelic variant databases. This article describes methods available to potential curators, including software systems developed with the sole purpose of generating locus-specific mutation databases. In particular, the authors' own software, MuStaRtrade mark, is described. MuStaRtrade mark allows curators to maintain a database on a laptop computer if desired, while being able to export the data to an automatically generated Website which will run on any cgi compliant Web server. Searching the database and the submission of new mutations are made possible through fill-in Web forms. A number of other software tools which may be of use to curators are also described.  相似文献   

17.
A visual stimulus display system controlled by a microcomputer was constructed at low cost. The system consists of a LED stimulus display device, a microcomputer, two interface boards, a pointing device (a "mouse") and two kinds of software. The first software package is written in BASIC. Its functions are: to construct stimulus patterns using the mouse, to construct letter patterns (alphabet, digit, symbols and Japanese letters--kanji, hiragana, katakana), to modify the patterns, to store the patterns on a floppy disc, to translate the patterns into integer data which are used to display the patterns in the second software. The second software package, written in BASIC and machine language, controls display of a sequence of stimulus patterns in predetermined time schedules in visual experiments.  相似文献   

18.
To interpret whole exome/genome sequence data for clinical and research purposes, comprehensive phenotypic information, knowledge of pedigree structure, and results of previous clinical testing are essential. With these requirements in mind and to meet the needs of the Centers for Mendelian Genomics project, we have developed PhenoDB ( http://phenodb.net ), a secure, Web‐based portal for entry, storage, and analysis of phenotypic and other clinical information. The phenotypic features are organized hierarchically according to the major headings and subheadings of the Online Mendelian Inheritance in Man (OMIM®) clinical synopses, with further subdivisions according to structure and function. Every string allows for a free‐text entry. All of the approximately 2,900 features use the preferred term from Elements of Morphology and are fully searchable and mapped to the Human Phenotype Ontology and Elements of Morphology. The PhenoDB allows for ascertainment of relevant information from a case in a family or cohort, which is then searchable by family, OMIM number, phenotypic feature, mode of inheritance, genes screened, and so on. The database can also be used to format phenotypic data for submission to dbGaP for appropriately consented individuals. PhenoDB was built using Django, an open source Web development tool, and is freely available through the Johns Hopkins McKusick‐Nathans Institute of Genetic Medicine ( http://phenodb.net ).  相似文献   

19.
The Ensembl Web site (http://www.ensembl.org/) is the principal user interface to the data of the Ensembl project, and currently serves >500,000 pages (approximately 2.5 million hits) per week, providing access to >80 GB (gigabyte) of data to users in more than 80 countries. Built atop an open-source platform comprising Apache/mod_perl and the MySQL relational database management system, it is modular, extensible, and freely available. It is being actively reused and extended in several different projects, and has been downloaded and installed in companies and academic institutions worldwide. Here, we describe some of the technical features of the site, with particular reference to its dynamic configuration that enables it to handle disparate data from multiple species.  相似文献   

20.
Deep sequencing technologies are completely revolutionizing the approach to DNA analysis. Mitochondrial DNA (mtDNA) studies entered in the “postgenomic era”: the burst in sequenced samples observed in nuclear genomics is expected also in mitochondria, a trend that can already be detected checking complete mtDNA sequences database submission rate. Tools for the analysis of these data are available, but they fail in throughput or in easiness of use. We present here a new pipeline based on previous algorithms, inherited from the “nuclear genomic toolbox,” combined with a newly developed algorithm capable of efficiently and easily classify new mtDNA sequences according to PhyloTree nomenclature. Detected mutations are also annotated using data collected from publicly available databases. Thanks to the analysis of all freely available sequences with known haplogroup obtained from GenBank, we were able to produce a PhyloTree‐based weighted tree, taking into account each haplogroup pattern conservation. The combination of a highly efficient aligner, coupled with our algorithm and massive usage of asynchronous parallel processing, allowed us to build a high‐throughput pipeline for the analysis of mtDNA sequences that can be quickly updated to follow the ever‐changing nomenclature. HaploFind is freely accessible at the following Web address: https://haplofind.unibo.it .  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号