| 1. Identification of thermophilic species by the amino acid compositions deduced from their genomes (up) |
| David P. Kreil, EMBL - EBI / University of Cambridge;
Christos A. Ouzounis, EMBL - EBI; kreil@ebi.ac.uk |
| Short Abstract:
Global amino-acid-compositions as deduced from 47 complete genomic sequences were analyzed by hierarchical clustering/PCA. Although GC-content had a dominant effect, thermophiles can be identified by their amino-acid-compositions alone. While the number of genomes is now high enough to discern even a third factor, more of `unusual´ species are still required. |
| One Page Abstract:
The global amino acid compositions as deduced from the complete genomic sequences of seven thermophilic archaea, one mesophilic archeon, two thermophilic bacteria, 34 mesophilic bacteria, and three eukaryotic species were analyzed by hierarchical clustering and principal components analysis (PCA). This study presents a careful statistical analysis of factors that affect amino acid composition. Both hierarchical clustering and PCA showed an influence of two main factors on amino acid composition. Even though GC-content has a dominant effect, thermophilic species can be identified by their global amino acid compositions alone. Differences between the groups of thermophiles and mesophiles were verified with appropriate statistical post-hoc tests. Based on this data analysis we introduce a `compositional tree´ of species that takes into account not only homologous proteins, but also proteins unique to particular species. We expect this simple yet novel approach to be a useful additional tool for the study of phylogeny at the genome level. This analysis extends our previous work [1] to a larger number of species, one of which is a mesophilic archaeon. The new analysis clearly supports the notion that the second strongest determining factor of global amino acid composition is indeed thermophilicity, and not perhaps archaeic origin. With the larger number of completely sequenced genomes available, besides GC-content and thermophily, a third major separable factor is now emerging which determines amino acid composition. However, for the present analysis the genomes of only one mesophilic archaeon and two thermophilic bacteria were available. This points to a general problem for whole genome studies, as increasingly, the selection of sequenced genomes available is very biased. We show how to deal with this problem by application of thorough statistical methods. [1] Kreil, D. P. and Ouzounis, C. A. (2001) `Identification of thermophilic species by the amino acid compositions deduced from their genomes´. Nucleic Acids Res. 29, 1608-15. |
|
|
| 2. Sequence Analysis by Iterative Maps - beyond graphical representation (up) |
| Susana Vinga, ITQB/Universidade Nova Lisboa;
Jonas S. Almeida, Department of Biometry and Epidemiology, Medical University of South Carolina; ITQB/Universidade Nova Lisboa; João A. Carriço, António Maretzek, ITQB/Universidade Nova Lisboa; Peter A. Noble, Madilyn Fletcher, Belle W. Baruch Institute for Marine Biology and Coastal Research, Marine Science Program and Department of Biologica; svinga@itqb.unl.pt |
| Short Abstract:
Iterative mapping of DNA sequences by Chaos Game Representation (CGR) was first proposed for the investigation of patterns. Counting arbitrarily sized quadrant frequencies, order-free Markov Chain probability tables are obtained, highlighting the usefulness of CGR as a sequence-modelling tool. The iterative procedure was further extended to accommodate higher dimension alphabets. |
| One Page Abstract:
Iterative mapping of DNA sequences by Chaos Game Representation (CGR) was first proposed in 1990 for the investigation of patterns. This initial attempt to produce scale-independent representations of biological sequences has some important properties: 1) one-to-one correspondence between points in the continuous map and the respective sequences; 2) proximity in the map of sequences with the same suffix [maximum distance 2^-k in each coordinate on the map between two sequences with the same k last units]. We have further explored this representation and found that, by counting arbitrarily sized quadrant frequency, a order-free Markov Chain probability tables is obtained, accommodating both integer and non-integer order resolution. These newly uncovered properties highlight the usefulness of CGR as a sequence-modelling tool rather than just a graphical representation technique. The iterative procedure was further extended to accommodate higher dimensions defining unit-block iterated continuous domains. Further reading: Almeida, J.A., Carriço, J.A., Maretzek, A., Noble, P.A. and Fletcher, M. (2001) Analysis of genomic sequences by Chaos Game Representation. Bioinformatics, 17, 429-437 |
|
|
| 3. Global Analysis of Protein Activities Using Proteome Chips (up) |
| Ning Lan, Ronald Jansen, Paul Bertone, Heng Zhu, Michael Snyder, Mark
Gerstein, Yale University;
lan@bioinfo.mbb.yale.edu |
| Short Abstract:
A defined collection of 5800 yeast proteins was printed on proteome
microarray and screened for their ability to interact with proteins, nucleic
acids, and phospholipids. An algorithm was developed to identify positive
signal on proteome microarray and to cluster the proteins identified into
functional groups.
|
| One Page Abstract:
A daunting task in the post-genome sequencing era is to ascribe functions to every protein encoded by a given genome. Direct analysis of protein function on proteome chips is likely to provide an extremely valuable approach for elucidating gene function on a global scale. A defined collection of 5800 proteins from the budding yeast was prepared using high-throughput techniques and printed onto glass slides to screen for many activities including protein-protein, protein-DNA, protein-RNA, and protein-liposome interactions. Visual inspection identified 39 yeast proteins that bind calmodulin. Sequence analysis revealed that these calmodulin-binding proteins share a motif whose consensus is I/L-Q-X-X-K-K/X-G-B, where X is any residue and B is a basic residue. An algorithm was developed to identify and analyze positive signals in protein-liposome binding experiments. Variations between chips and local variations on the chip cause additional fluctuations of the binding signals quantitated using GenePix software. To correct the variation between chips, the signals were scaled from different experiments into a common range by subtracting the median and dividing by the difference between upper and lower quartile, thus transforming the signal distributions of different experiments to comparable shapes. To correct the local variation on the chip, we performed a ¡°neighborhood subtraction¡± for each spot. We defined a region of two rows above and below as well as two columns to the left and right of a spot as the neighborhood region. The median signal of this region was then subtracted from the spot signal. The number of highly fluorescent spots in any neighborhood region is generally low enough in these experiments not to disturb the median significantly. Finally, if the variation between two parallel samples was greater than 3 standard deviations of the error distribution of the samples, the data point was flagged and excluded from further analysis. After this filtering procedure, we normalized the filtered lipid binding signal G with the GST signal R, yielding the ratio r = G/R which is a measure of the binding per amount of protein and allows comparison of binding signals between different proteins. The specific binding ratio r is sensitive to errors eG and eR in both the G and R signals. Therefore, we computed 90% and 95% confidence intervals for this ratio with a Monte-Carlo procedure, assuming that r is a good approximation of the actual average of the ratio population: r + er = (G+eG)/(R+eR) where er represents the error of the ratio r. This algorithm identified 150 yeast proteins that bind phosphotidylinositol lipids, 52 of which correspond to uncharacterized proteins, indicating that many previously uncharacterized proteins have potentially important biochemical activities. These proteins were clustered into four groups based on the binding strength and specificity. These results have obtained a wealth of new information about many known and previously uncharacterized proteins, thus demonstrate that proteome chips provide valuable opportunity for direct global proteome analysis. |
|
|
| 4. Assignment of Homology to Genome Sequences using a Library of Hidden Markov Models that Represent all Proteins of Known Structure (up) |
| Julian Gough, MRC Laboratory of Molecular Biology;
Kevin Karplus, Richard Hughey, UCSC, USA; Cyrus Chothia, MRC Laboratory of Molecular Biology; jgough@mrc-lmb.cam.ac.uk |
| Short Abstract:
A hidden Markov model library representing all proteins of known structure has been built based on SCOP. This library has been used on all complete genomes to assign structural superfamilies to sequences. The genome assignments, sequence alignments, and a facility to search the library are available at http://stash.mrc-lmb.cam.ac.uk/SUPERFAMILY. |
| One Page Abstract:
Of the sequence comparison methods, profile based methods perform with greater selectively than those that use pair-wise comparisons. Of the profile methods, hidden Markov models (HMMs) are apparently the best. The first part of this poster describes calculations that (i) improve the performance of HMMs and (ii) determine a good, possibly the best, procedure for creating HMMs for sequences of proteins of known structure. For a family of related proteins, more homologues are detected using multiple models built from diverse single seed sequences than from one model built from a good alignment of those sequences. A new procedure is described for detecting and correcting those errors that arise at the model-building stage of the procedure. These two improvements greatly increase selectivity and coverage, The second part of the poster describes the construction of a library of HMMs, called SUPERFAMILY, that represent essentially all proteins of known structure. The sequences of the domains in proteins of known structure, that have identities less than 95%, are used as seeds to build the models. Using the current data, this gives a library with 4894 models. The third part of the poster describes the use of the SUPERFAMILY model library to annotate the sequences of more than 50 genomes. The models match twice as many target sequences as are matched by pairwise sequence comparison methods. For each genome, close to half of the sequences are matched in all or in part and, overall, the matches cover 35% of eukaryotic genomes and 43% of bacterial genomes. Many sequences labeled as being hypothetical are homologous to proteins of known structure. The annotations derived from these matches are available from a public web server at: http://stash.mrc-lmb.cam.ac.uk/SUPERFAMILY. This server also enables users to match their own sequences against the SUPERFAMILY model library. |
|
|
| 5. Divergence and conservation: comparison of the complete predicted protein sets of four eukaryotes (up) |
| Catherine A. Ball, Kara Dolinski, Shuai Weng, John C. Matese, Gavin
Sherlock, Dianna Fisk, Selina Dwight, Karen Christie, Anand Sethuraman,
J. Michael Cherry, David Botstein, Stanford University School of Medicine;
ball@genome.stanford.edu |
| Short Abstract:
Comparison of the complete predicted protein sets of four eukaryotic organisms (Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster and Arabidopsis thaliana) suggest that the common ancestor of plants, multicellular animals and single-celled fungi was the source of a significant fraction of the current complement of proteins for each species. |
| One Page Abstract:
Comparison of the complete predicted protein sets of four eukaryotic organisms (Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster and Arabidopsis thaliana) suggest that the common ancestor of plants, multicellular animals and single-celled fungi was the source of a significant fraction of the current complement of proteins for each species. Complete sets of predicted proteins were compared using BLASTP, grouped into families of related proteins and subjected to CLUSTALW analysis to determine closest similarity. At all stringency levels, 30-50% of sequence similarity families have at least one representative sequence from each organism. Unsurprisingly, the closest similarities are found between proteins from D. melanogaster and C. elegans, the multicellular animals. Systematic functional analysis of S. cerevisiae genes and proteins have significant implications for their homologs in other species. S. cerevisiae proteins encoded by essential genes are more likely to have homologs in one of the other species. In addition, S. cerevisiae proteins that interact with many other proteins are also more likely to be conserved. Using gene associations from the Gene Ontology Consortium, similarity
families were associated with biological processes and molecular functions.
The most conserved biological processes, as inferred from shared annotations,
correspond to core cellular processes such as metabolism and protein synthesis.
|
|
|
| 6. Identification of novel small noncoding RNAs in Escherichia coli using a probabilistic comparative method (up) |
| Elena Rivas, Robert J. Klein, Sean R. Eddy, Washington University
in St. Louis;
elena@genetics.wustl.edu |
| Short Abstract:
We apply comparative genomics in a probabilistic computational method to find novel noncoding RNAs. We use this computational method to screen the E.coli genome using whole genome comparison to four other gamma proteobacteria genome sequences. A number of those candidates have been shown by Northern blot analysis to produce small, apparently noncoding RNA transcripts. |
| One Page Abstract:
We have developed a comparative genomic method to find novel noncoding RNA genes. This method takes advantage of the pattern of fixed mutations between two conserved sequences in order to infer why the sequence is functional. A protein-gene exon, for instance, may show a telltale abundance of synonymous substitutions, whereas a structural RNA may show a telltale abundance of compensatory mutations consistent with a conserved Watson-Crick base-paired secondary structure. We have formalized this intuitive notion by constructing three probabilistic ``pair-grammars". Each grammar models a different functional pattern of evolution: coding which favours synonymous mutations; RNA which models a pattern of compensatory base changes; and a null hypothesis of position-independent evolution. Here we report the results of applying this computational screen to the E. coli genome, using whole genome comparisons to four other gamma proteobacterial genome sequences. In this screen we have generated a large number of candidates for RNA genes from E.coli intergenic regions. A number of those candidates have been shown by Northern blot analysis to produce small, apparently noncoding RNA transcripts. |
|
|
| 7. Intron-like sequences in non-coding genomic regions. (up) |
| Elisabetta Pizzi, Emanuele Bultrini, Paolo Del Giudice, Clara Frontali,
Istituto Superiore di Sanità, Rome (Italy);
frontali@iss.infn.it |
| Short Abstract:
A method was developed for the analysis of correlations in oligonucleotide usage in extended genomic portions. Preliminary results for C.elegans and D.melanogaster genomes reveal auto- and cross-correlations in the dictionary prevailing in introns and in intergenic regions. The method lends itself to automatic partitioning and to cluster analysis. |
| One Page Abstract:
INTRON-LIKE SEQUENCES IN NON-CODING GENOMIC REGIONS E. Bultrini, P. Del Giudice, C. Frontali and E. Pizzi - Istituto Superiore di Sanità, Rome, Italy The heterogeneity, or patchiness, which characterises most eukaryotic genomes, suggests that different parts of the same genome may adopt different strategies to encode relevant biological information, as a consequence of different evolutionary pressures. Such considerations should help in partitioning long non-coding sequences into functionally different regions. Starting from the observation that similarity in oligonucleotide usage in introns and intergenic regions, extending over distances of more than 10 Kb, contributes to the correlation structure of Caenorhabditis elegans genome [1], we examined different statistical methods with the aim to provide a quantitative measure of this effect and to find practicable ways to score extended genomic portions for consistent oligonucleotide usage in distant regions not necessarily related by significant sequence similarity. Preliminary results were obtained using a simple approach which, after dividing an arbitrarily long sequence into non-overlapping segments typically 100 bp long, computes correlation coefficients between oligonucleotide frequency distributions for all fragment pairs. By repeating the procedure on the randomly shuffled segments it becomes possible to disaggregate the effects of base composition and of biased oligonucleotide usage. Distant genomic regions that adopt a similar oligonucleotide dictionary above and beyond what expected on the basis of nucleotide frequencies can easily be recognised as off-diagonal patterns in a sort of coarse-grained dot plot representing the resulting matrices through a brightness scale proportional to the correlation coefficient value. Along with this visual representation, it is possible to perform cluster analyses of segments according to oligonucleotide usage. It was possible to demonstrate that C. elegans and Drosophila melanogaster introns auto- and cross-correlate from the point of view of oligonucleotide usage, and that, in both genomes, interspersed elements exhibiting intron-like features are abundant in regions that, according to current annotation, are intergenic. Clusters of these elements might mark as yet unpredicted genes, but we cannot rule out the speculative hypothesis that those non-coding regions that are subject to weak functional constraints might harbour similar elements. Methods for the automatic partitioning of chromosome-long sequences on this basis are under development. 1) C. Frontali, E. Pizzi (1999) Gene 232,87-95 |
|
|
| 8. Repeats in human genomic DNA and sequence heterogeneity (up) |
| Dirk Holste, Humboldt University Berlin;
d.holste@itb.biologie.hu-berlin.de |
| Short Abstract:
We study sequence heterogeneity of human chromosomes, by quantifying dinucleotide correlations and frequency distributions of oligonucleotides. Using simple stochastic models, we quantify the presence of fluctuations and the extend to which interspersed repeats, monomeric tandem repeats, and CpG suppression can account for the heterogeneity and the increasing oligonucleotide nonuniformity. |
| One Page Abstract:
The origin and extend of the base compositional variation and its relation to the organization and function in human genomic DNA poses fundamental questions. The observed sequence heterogeneity may require active constraints for generating and maintaining these pattern, and the analysis of the sequence heterogeneity could contribute to an understanding of the nature of compositional constraints. In the past, several attempts have been made to relate those observations to the known biological features like the presence of period-3 bp, the length distribution of protein-coding regions, the presence and expansion of repeats, or the evolution of DNA. The sequencing of the human genome provides a suitable occasion to test earlier propositions on the base composition the human genome, such as the role of interspersed repeats, which comprise over 50% of the whole genome. We study statistical patterns in the DNA sequence of two human chromosomes, by quantifying small- and long-ranging dinucleotide correlations and by examining the nonuniformity of the frequency distribution of oligonucleotides. We investigate to which degree known biological features may explain the observed statistical patterns. Using simple stochastic models, we study the role of interspersed repeats as a potential cause of the observed heterogeneity. We study the superposition of interspersed repeats and monomeric tandem repeats, and the suppression of CpG dinucleotides as possible features that may cause the increasing nonuniformity of the oligonucleotide distribution with increasing oligonucleotide length. |
|
|
| 9. PhageBase: A precalculated database of bacteriophage sequences (up) |
| Frank Desiere, Nestle Research Centre;
Günther Kurapkat, Clemens Suter-Crazzolara, Lion Bioscience AG; Harald Brüssow, Nestle Research Centre; frank.desiere@rdls.nestle.com |
| Short Abstract:
We have employed genomeSCOUT to create PhageBase, a multi-functional, pre-computed bacteriophage genome database. Information about protein homology (e.g. protein families, orthologs and clusters of orthologous groups, COGs) and gene order is collected, stored and visualised with the data integration system SRS. PhageBase will allow to address questions of phage evolution. |
| One Page Abstract:
The accumulation of more and more complete bacteriophage genome sequences requires new computational approaches for dealing with these data. Bacteriophage genomes are not especially large; most are smaller than 200 kb. They nevertheless represent a formidable challenge to bioinformatics algorithms since phages seem to be the result of both vertical and horizontal evolution and are thus a good model system for web-like phylogenies (Brüssow and Desiere 2001) currently discussed for prokaryotic genomes. In addition to their high mutation rate but also to their extraordinary power to recombine and exchange functional modules, single genes and gene segments encoding single protein domains, phages with double-stranded DNA genomes show an about 10-fold higher mutation rate than their bacterial hosts (Li 1997). Furthermore, there is good reason to believe that tailed phages are as old as their prokaryote hosts. This situation holds several extraordinary challenges for bioinformatics investigations. On one side, the sensitivity of sequence search algorithms must be adaptable to very distantly related sequences and in the absence of detectable sequence similarity has to account for conserved gene-order (synteny). We have employed genomeSCOUT (Suter-Crazzolara and Kurapkat 2000) to create a multi-functional, pre-computed bacteriophage genome database that allows rapid identification and functional characterisation of genes and proteins through genome comparison. With a number of independent algorithms, information about different levels of protein homology (concerning e.g. protein families, orthologs and clusters of orthologous groups, COGs) and gene order is collected and stored. These databases are then used for interactive comparison of genomes and subsequent analysis. The application is based on the well-established data integration system SRS. SRS ensures at the same time the fast handling of many genomes, access to several pre-computed databases and linking functions between these databases. Last but not least, SRS offers a fully integrated user-friendly graphical representations of search results. Gene context analysis in bacteriophages shows that the conservation of genome organization is surprisingly high if compared to prokaryotic genomes (Wolf and Koonin 2001). The genome map of bacteriophages is conserved from the Siphoviridae family. These phages infect both gram-negative and gram-positive Eubacteria as well as the Euryarchaeota branch of Archaea. In fact, the structural genes from E. coli phage lambda, Streptococcus phage Sfi21 and Archaeavirus psi-M2 can be superimposed. Gene strings, which are conserved in taxonomically distant organisms, are most likely functionally interacting genes. We assume that their conserved reflects an ancestral gene order. Accordingly, gene strings identified in new genomes can be assumed to be functionally linked and the information on gene clustering can be used for functional predictions. PhageBase will hopefully allow to address problems which are currently discussed in bacterial genomics and bacterial phylogeny (Koonin et al., 2000): Unity or diversity of origin, vertical versus horizontal gene transfer, nonorthologous gene displacement, tree- versus web-like phylogeny, synteny versus instability of gene order, gene splitting versus domain accretion. |
|
|
| 10. Searching for regions of conservation in the Arabidopsis genome (up) |
| Brad Chapman, John Bowers, Andrew H. Paterson, Plant Genome Mapping
Laboratory, University of Georgia;
chapmanb@arches.uga.edu |
| Short Abstract:
To identify regions of Arabidopsis which might serve as conserved anchor points for comparisons between different plant species, ESTs from crop species were compared to the ordered Arabidopsis genome. Conserved blocks of high sequence similarity were identified in Arabidopsis and categorized within the biological context of the genome. |
| One Page Abstract:
With the recent completion of the Arabidopsis genome, a major challenge is applying the information known about Arabidopsis to help advance research in other plant species. Towards this goal, we were interested in finding regions of Arabidopsis that appeared to be preferentially conserved during evolution, hypothesizing that these regions could serve as anchor points for comparisons between multiple plant genomes. To identify regions of interest in Arabidopsis, we employed sequence comparison using BLAST searches to locate EST sequences from Sorghum, Cotton and Sugarcane on the Arabidopsis genome. By displaying the levels of sequence similarity across the genome, blocks appeared with groups of very high or low levels of sequence similarity. Using a probabilistic Hidden Markov Model approach, we categorized the blocks as being either strongly or weakly conserved. Several regions of the Arabidopsis genome were identified that were similarly categorized using Sorghum, Cotton and Sugarcane comparisons, but not using comparisons with randomly generated sequences. To try and determine if these regions were biologically meaningful, we looked at the distribution of Matrix Attachment Regions in the genome and attempted to correlate these structural elements with the conserved regions we had identified. Results of these analyses will be presented, along with in-depth analysis of some potentially conserved regions of the Arabidopsis genome. Some points of discussion will include the potential evolutionary significance of the conserved regions, as well as the applicability of the results to help advance research in less well-studied plant species. |
|
|
| 11. Constructing Comparative Maps with Unresolved Marker Order (up) |
| Debra Goldberg, Center for Applied Mathematics, Cornell University;
Susan McCouch, Department of Plant Breeding, Cornell University; Jon Kleinberg, Department of Computer Science, Cornell University; debra@cam.cornell.edu |
| Short Abstract:
Species maps (genetic and physical) frequently include groups of markers (genes) whose precise relative order cannot be determined. We present efficient algorithms that construct comparative maps from such species maps in a principled manner. Our approach recognizes arrangements of co-located markers that give a most parsimonious comparative map. |
| One Page Abstract:
Comparative maps are a powerful tool for aggregating genetic information about related organisms, for predicting the location of orthologous genes, for understanding chromosome evolution, for inferring phylogenetic relationships and for examining hypotheses about the evolution of gene families and gene function in diverse organisms. The species maps which are the input to the process of constructing comparative maps are often themselves constructed from incomplete or inconsistent data, resulting in markers (or genes) whose precise relative order cannot be determined in the input species maps. This incomplete marker order information is generally handled in one of two ways: each marker may be assigned an interval on a chromosome, where the interval size varies for different markers and marker intervals may overlap, or sets of markers whose relative order cannot be reliably inferred are placed together in a bin which is mapped to a common location (megalocus). Previous automated and manual methods have handled such markers in an ad hoc or arbitrary fashion. We present efficient algorithms for each of the standard representations which systematically use all information provided to produce comparative maps in a principled manner. The algorithms extend our earlier work on the ``chromosome labeling problem,'' which uses a dynamic programming technique to find an optimal balance of accuracy (the data should be explained well by the map) and parsimony (there should be relatively few homeologous segments, so that only syntenic relationships above our confidence threshold are labeled). We handle the overlapped interval representation by a direct extension of this technique. Our main algorithms focus on the megalocus model, in which the input markers are partitioned into sets: relative order between sets is fully known, while relative marker order within a set is completely unknown. For this model, we present algorithms which not only use the available information, but also arrange the co-located markers in a most parsimonious way. The chromosome labeling problem with unknown ordering is thus placed on a principled footing via these algorithms in which results are optimized over all possible orderings. This canonical marker order can be viewed as a working hypothesis about the original incomplete data set, and can serve as a basis for further lab work. A preliminary version of ``DeCAL'' (Detecting Common Ancestral Linkage-segments), an open-source product based on these algorithms, is now available. For input, it requires the positions of the markers of one species, as well as the location of homologs to each marker in the second species. Output is given both graphically and in text form. Only a single parameter is required, which carries a simple biological explanation. Our program allows comparative maps to be constructed in a few minutes. Results have been evaluated for diverse pairs of species, and closely approximate prior manual expert analyses. |
|
|
| 12. Identification of novel small RNA molecules in the Escherichia coli genome: from in silico to in vivo (up) |
| Ruth Hershberg, Liron Argaman, Department of Molecular Genetics
and Biotechnology The Hebrew University -Hadassah Medical School;
Joerg Vogel, Institute of Cell and Molecular Biology, Biomedical Center, Uppsala University; Gill Bejerano, Institute of Computer Science, The Hebrew University; E. Gerhart H. Wagner, Institute of Cell and Molecular Biology, Biomedical Center, Uppsala University; Shoshy Altuvia, Hanah Margalit, Department of Molecular Genetics and Biotechnology The Hebrew University -Hadassah Medical School; rutih@md.huji.ac.il |
| Short Abstract:
Small RNAs (sRNAs) have been difficult to detect both experimentally and computationally. We developed a computational strategy to search the Escherichia coli genome for sRNA-encoding genes. We used transcriptional signals and genomic features to predict 41 sRNAs. 23 were tested experimentally, of which 17 were shown to be real sRNAs. |
| One Page Abstract:
Small, untranslated RNA molecules exist in all kingdoms of life. These RNAs carry out diverse functions and many of them are regulators of gene expression. Genes encoding small RNAs (sRNAs) are difficult to detect experimentally or to predict by traditional sequence analysis approaches. Thus, in spite of the importance of these molecules, many of the sRNAs known to date were discovered fortuitously. We developed a computational strategy to search the Escherichia coli genome for genes encoding small RNAs. Our method was based on the transcription signals and genomic features, such as location and conservation, that characterize the 10 known sRNAs in E. coli. The search was limited to regions of the genome in which no gene existed on either strand. These regions were searched for transcriptional signals (promoter sequences recognized by the major sigma factor of E. coli RNA polymerase (sigma70), and Rho-independent terminators). Sequences for which the distance between the predicted promoter and terminator was 50-400 bases were compared to genome sequences of other bacteria. Sequences with good conservation were predicted as sRNAs. 23 of the predicted genes were tested experimentally, out of which 17 were shown to be expressed in E. coli. The newly discovered sRNAs showed diverse expression patterns and most of them were abundant. |
|
|
| 13. Operons: conservation and accurate predictions among prokaryotes (up) |
| Gabriel Moreno-Hagelsieb, Julio Collado-Vides, CIFN-UNAM;
moreno@cifn.unam.mx |
| Short Abstract:
Neighbor genes within known operons are shown to be more conserved at three levels (co-occurrence, adjacency, and fused) than genes at transcription unit (TU) boundaries. We also show that prediction of operons as designed with information from E. coli works with other prokaryotes. |
| One Page Abstract:
Based in a database of experimentally characterized transcription units (TUs) of Escherichia coli, and its genomic annotations, we show that adjacent genes within operons (polycistronic TUs) are more conserved than adjacent genes found at TU boundaries (last gene in a TU, and first in the next). The conservation is measured at three levels: (1) co-occurrence, that is, two genes found in an operon have each an ortholog in another genome with more frequency than two genes at TU boundaries, that is, TU boundaries are left more frequently as orphans. (2) Among those genes having both orthologs, those found in operons in E. coli are conserved more frequently as neighbors in other genomes. (3) Genes within operons can be found as fusions in other genomes. We also show that the prediction method of TUs we developped with information of TUs of E. coli works as well (more than 82% of accuracy) against a collection of known operons of Bacillus subtilis, and provide evidence of its functionality in the prediction of the transcription units organization of all prokaryotes. Genes predicted to be in operons show higher conservation of adjacency than genes predicted to be at TU boundaries, and the population of operons of each organism is shown to be easy to calculate from the inter-genic distance distributions of pairs of adjacent genes found in the same strand. |
|
|
| 14. The EBI Proteome Analysis Database (up) |
| Pauk Kersey, Apweiler, R., Biswas, M., Fleischmann, W., Kanapin, A,,
Karavidopoulou, Y., Kriventseva, E., Mittard, V., Mulder, N., EMBL-European
Bioninformatics Institute;
Phan, I., Swiss Institute of Bioinformatics; Zdobnov, E., EMBL-European Bioninformatics Institute; pkersey@ebi.ac.uk |
| Short Abstract:
The EBI Proteome Analysis Database has been developed to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. Complete non-redundant sets of SWISS-PROT and TrEMBL entries are assembled for each proteome, and are analysed using the Interpro and CluSTr databases, GO, and structural information. |
| One Page Abstract:
The dramatic recent growth in the number of fully sequenced organisms in recent years has created new challenges and opportunities for biological sequence databases. It is now possible to make statistical and comparative analyses of organisms based on their entire proteome. While the specific function of a newly predicted protein cannot be known for certain for its sequence alone, the use of protein domain databases allows the assignment of proteins to families and therefore for a proteome to be described in terms of its composition. For example, it is possible to establish that a given protein family may found in a restricted portion of the taxonomic range; that two organisms share certain protein families, but not others; or that a particular family is especially highly represented in a certain species. The Proteome Analysis Database has been developed by the SWISS-PROT group at the EBI in order to provide such an analysis. Several features distinguish this database. Firstly, non-redundant up-to-date comprehensive data sets are maintained for each complete proteome, in order that the statistical analysis is not skewed. These are created by selecting entries from the high quality protein sequence databases SWISS-PROT and TrEMBL. Protein sequence data is tracked into TrEMBL from genome sequencing projects, and merges with existing entires are accounted for. Special procedures are used to establish eukaryotic proteomes. The facility to perform unbiased sequence similarity searches against these sets is offered. Secondly, a powerful set of tools have been chosen to analyse the sets. The Proteome Analysis Database uses Intepro, an integrated database of protein domains, and CluSTr, a database that groups proteins according to overall sequence similarity. Proteins can also be functionally classified according to the Gene Ontology. Additionally, structural information relevant to each proteome is provided. Thirdly, the entire database is updated weekly and kept synchronised with the underlying sequence databases from which it is constructed. Finally, a web-based interface allows users to customise their own comparative analysis using the resources made available by the database, while popular queries are precomputed for rapid response. |
|
|
| 15. Practical transcriptome analysis system in the RIKEN mouse cDNA project (up) |
| Hidemasa Bono, Takeya Kasukawa, Itoshi Nikaido, Yasushi Okazaki, Yoshihide
Hayashizaki, Laboratory for Genome Exploration Research Group, RIKEN
Genomic Sciences Center(GSC);
bono@gsc.riken.go.jp |
| Short Abstract:
Practical transcriptome analysis system in a large-scale cDNA project, the RIKEN mouse cDNA project is presented. cDNA clone data of libraries, full-length sequences, mapping information to chromosomal location and gene expression information are successfully integrated to analyze mouse transcriptome in conjunction with sequence information from human and other model organisms. |
| One Page Abstract:
We have pursued RIKEN mouse encyclopedia project, which attempt to catalogizing as an encyclopedia. 1. full-length cDNA clones 2. full-length cDNA sequences, 3. mapping information of all cDNA clones 4. Gene expression information of all genes. In the last august, we held the FANTOM meeting to annotate functional information to 21,076 RIKEN mouse cDNA clones(Nature, 409, 685-690, (2001)). In that meeting, we made web-based system, called FANTOM+ that includes functional annotation information, as well as the graphical sequence analysis report. (http://www.gsc.riken.go.jp/e/FANTOM/ ) These efforts in FANTOM are now expanded to set up practical mouse transcriptome analysis system which organizes not only functional annotation, but biological knowledge that may contain inconsistent information. We will report the status of this project. |
|
|
| 16. The automated identification of novel lipases/esterases on a multi-genome scale (up) |
| Sanna Herrgard, Stephen A. Cammer, Jen Montimurro, Jeffrey A. Speir,
Brian Hoffman, Susan M. Baxter, Jacquelyn S. Fetrow, GeneFormatics,
Incorporated;
sannaherrgard@geneformatics.com |
| Short Abstract:
We have applied threading and protein functional descriptors to identify sequences with putative lipase and esterase functions in four gram-positive genomes. These studies yielded 15 sequences previously unreported as having lipase/esterase functions. Our findings are supported by 3D-conservation profiles between the active sites in known lipases/esterases and our assignments. |
| One Page Abstract:
The rapid and accurate functional annotation of the growing number of DNA and protein sequences has become a key challenge of the post-genomic era. Current annotation methods rely heavily upon simple sequence similarity; a protein sequence of undetermined biochemical function is assumed to have the same function as the protein most similar in sequence to it. Since protein sequences are generally less conserved than protein structures, sequence-based annotation methods often fail to detect proteins with low sequence similarity. In order to circumvent the limitations of sequence-based approaches, we have screened structure models obtained by threading with a library of function-specific structural descriptors (Fuzzy Functional Forms). We demonstrate the use of this method in rapidly annotating entire genomes and identifying novel function assignments for ORFs for which conventional sequence-based annotation methods fail. Specifically, we have assigned novel lipase and esterase functions to 15 sequences in the genomes of four gram-positive bacteria: Bacillus subtilis, Ureaplasma urealyticum, Mycoplasma pneumoniae and Mycobacterium tuberculosis. Our findings are supported by the sequence-structure conservation profiles between the active sites in known lipases/esterases and our assignments. These analyses indicate that even though the overall sequence similarity between known lipases/esterases and our assigned ORFs is often low, remarkable local similarities exist in the predicted active sites. |
|
|
| 17. Identification of membrane protein orthologs in worm, fly and human (up) |
| Gang Liu, Christian E. V. Storm, Erik L. L. Sonnhammer, Center for
Genomics and Bioinformatics (CGR), Karolinska Institute;
Gang.Liu@cgr.ki.se |
| Short Abstract:
Genome-wide identification of transmembrane protein orthologs has been carried out. Hidden Markov models (HMMs) which were previously built for membrane protein families of worm were used to search for homologs in other species. Orthologous relationships were assigned by using Orthostrapping, a phylogeny-based method that gives orthology confidence. |
| One Page Abstract:
Based on the completion of genome sequencing projects of the nematode Caenorhabditis elegans, the fruit fly Drosophila melanogaster, as well as the newly finished human genome, we are able to analyze orthologous relationships in protein families between higher eukaryotes. This study will help us to understand gene function and the evolution of these protein families. Previously, proteins with at least two transmembrane segments of the C. elegans were classified as 189 clusters, and hidden Markov models (HMMs) were created for each protein family (Remm and Sonnhammer, Genome Res., 10:1679, 2000). We used these models to retrieve fly and human homologs. 52% of the clusters contain members from worm, fly and human, while 8% of the clusters are present only in worm and fly. Only 2% of the clusters contain worm and human homologs but not fly. The remaining 37% of the clusters are worm-specific. The clusters were analyzed for orthologs using Orthostrapping, a phylogenetic based method which gives orthology confidence. We present a list of putative membrane protein orthologs in worm, fly, and human. |
|
|
| 18. Discovering Binding Sites from Expression Patterns: A simple Hyper-Geometric Approach (up) |
| Yoseph Barash, Gill Bejerano, Tommy Kaplan, Nir Friedman, School
of Computer Science & Engineering, The Hebrew University;
hoan@cs.huji.ac.il |
| Short Abstract:
We present a fast approach to transcription factor binding site discovery. Using a simple hypergeometric model we rapidly find short conserved patterns within a gene group compared to its genome background. These seeds are iteratively expanded into PSSMs. We analyze recent yeast and human datasets, and compare to MEME. |
| One Page Abstract:
A central issue in molecular biology is understanding the regulatory mechanisms that control gene expression. The recent flood of genomic and post-genomic data opens the way for computational methods elucidating the key components that play a role in these mechanisms. One important consequence is the ability to recognize groups of genes that are co-expressed using whole-genome expression patterns. Our aim is to identify in-silico putative transcription factor binding sites in the promoter regions of these gene that explain the co-regulation, and hint at possible regulators. In this paper we describe a simple, fast, and yet powerful, approach to this task using a hyper-geometric statistical model and a straightforward computational procedure. This results in small conserved sequence seeds that are statistically significant compared to the genome-wide promoter background. We then expand these short seeds into position specific scoring matrices using an EM-like procedure. We demonstrate the utility and speed of our methods by applying them to several recent yeast and human data sets. We also compare our results with those of MEME when run on the same sets. |
|
|
| 19. Visualizing whole genome comparisons: Artemis Comparison Tool (ACT) (up) |
| Keith James, Kim Rutherford, The Sanger Centre;
kdj@sanger.ac.uk |
| Short Abstract:
Artemis Comparison Tool (ACT) is a DNA sequence comparison viewer based on Artemis. It presents a graphical representation of pairwise sequence comparison data generated by programs such as BLAST. EMBL, GenBank or GFF format annotation can be loaded in addition and presented in context with the comparison. |
| One Page Abstract:
The amount of data obtained from a pairwise comparison of whole genomes can be overwhelming, even when those genomes are highly similar. Interesting features such as syntenic regions, insertions, deletions, dispersed repeats, large-scale inversions or translocations are often not immediately apparent from the raw alignment output (e.g. Blast output). Often the genomic context of these raw results, with respect to gene predictions and existing annotation is lost. Artemis Comparison Tool (ACT) is a DNA sequence comparison viewer based on Artemis. It presents a graphical representation of pairwise sequence comparison data generated by programs such as BLAST. EMBL, GenBank or GFF format annotation can be loaded in addition and presented in context with the comparison. The user can pan and zoom the view interactively, examine per-CDS database search results and create or edit annotation from within the ACT environment. Example ACT analyses of genomes sequenced at the Pathogen Sequencing Unit are presented. In common with Artemis, ACT is written in Java and runs on UNIX, GNU/Linux, Macintosh and MS Windows systems. ACT is free software and is distributed under the terms of the GNU General Public License. ACT is available from the ACT web site: http://www.sanger.ac.uk/Software/ACT/ |
|
|
| 20. The use of Artemis for the annotation of eukaryotic genomes (up) |
| Valerie Wood, Kim Rutherford, The Sanger Centre;
val@sanger.ac.uk |
| Short Abstract:
Artemis is a DNA sequence viewer and annotation tool written in Java. It can read and write EMBL, GenBank and GFF format feature information and display these in the context of the sequence and its six frame translation. |
| One Page Abstract:
Artemis is a DNA sequence viewer and annotation tool written in Java. It can read EMBL, GenBank and GFF format feature information and display these in the context of the sequence and its six frame translation. The package can also display the results of external analyses, or plot the results of statistical calculations, performed on the sequence or CDS features. Artemis is the main annotation tool used for genome analysis in the Pathogen Sequencing Unit at the Sanger Centre, and is used routinely for annotation of eukaryotic genomes. The use of Artemis for the analysis and annotation of the completed genome of the unicellular fungus Schizosaccharomyces pombe, and the reannotation of Saccharomyces cerevisiae will be presented. Artemis is also used in the annotation of the parasitic worm Brugia malayi, the social amoeba Dictyostelium discoideum and the unicellular eukaryotic parasites Plasmodium falciparum, Trypanosoma brucei, Leishmania major and Toxoplasma gondii. Artemis is available from the Artemis web site: http://www.sanger.ac.uk/Software/Artemis/ The European S. pombe genome sequencing project can be accessed at http://www.sanger.ac.uk/Projects/S_pombe/ |
|
|
| 21. A de novo approach to identifying repetitive elements in genomic sequences (up) |
| Elizabeth Thomas, John Healy, Cold Spring Harbor Laboratory;
Nathan Srebro, Massachusetts Institute of Technology; Jacob Schwartz, New York University; Michael Wigler, Cold Spring Harbor Laboratory; thomase@cshl.org |
| Short Abstract:
We investigate tools for identifying repeats in genomic sequences, using whole genome frequencies of short oligomers. Highly-repetitive elements, even if poorly conserved, will contain many high frequency oligomers. We can identify repetitive elements using this signature, without depending on prior knowledge of repeat sequence or structure. |
| One Page Abstract:
Less than 2% of the human genome codes for proteins (IHGSC, 2001). It has been estimated that 50% of the remaining sequence consists of interspersed repetitive elements, not all of which have been classified or identified. An understanding of the origins of these repetitive elements, and their diversity, is likely to shed light on the evolution of genomes. Tools commonly used for defining and identifying repeats depend on prior knowledge about the structure and sequence of known repeats. Because of this assumption of prior knowledge, these tools are inappropriate for identifying unknown repetitive elements in genomic sequences. Now that whole genome sequences are available, new approaches can be taken, which depend merely on the simple fact that repetitive elements repeat. We investigate tools based on the whole genome frequencies of short oligomers, and simple algorithms that can be applied to these frequencies. Highly-repetitive elements, even if poorly conserved, will contain many high frequency oligomers. We can identify repetitive elements using this signature, without depending on prior knowledge of repeat sequence or structure. |
|
|
| 22. Extendable parallel system for automatic genome analysis (up) |
| Audrius Meskauskas, Frank Lehmann-Horn, Karin Jurkat Rott, Ulm university;
Audrius.Meskauskas@medizin.uni-ulm.de |
| Short Abstract:
We developed the automatic system for analysing of all potential genes in a defined region. We created 15 modules for the sequence retrieving, E-PCR, gene prediction, similarity search, protein pattern search, etc. The system was used to look for a gene between D3S2370 and D3S1292 on human chromosome 3 |
| One Page Abstract:
Experimental methods of linkage analysis often indicate a certain large region where the gene of interest in is located. It is possible to narrow the search interval by bioinformatical methods. Such methods are usually developed by specialised bioinformatical groups and are accessible from their specialised Internet pages. The preferred set of programs depends on the type of the gene being cloned and from the general strategy of the respective research group. Submitting a numerous requests to different Internet servers and analysis of the received responses takes large amount of the qualified researcher time. Therefore, we developed a Java-based data-mining program to detect and analyse all possible genes on a defined chromosome region. We created the modules for the following tasks: 1. Automatic converting between WUSM and NCBI naming systems of clones. 2. Getting the sequences and coordinates for a given set of markers. 3. Getting a list of clones for the given NCBI contig. 4. Sequence retrieving. 5. E-PCR 6. Gene prediction. 7. BLAST similarity search through EST database. 8 Collecting additional information in that tissues the similar cDNA was detected. 9. Predicted protein pattern search, revealing portenial protein family. 10. Transmembrane regions detection. 11. Protein sorting signals detection. 12. PEST region detection. 13. Translating between gene position inside clone and gene position inside NCBI contig 14. Finding, which of the genes, predicted by the system, are already described. contigs. 15. Central kernel for parallel submission or requests. The system was used to predict and analyse all genes, lying between markers D3S2370 and D3S1292 on human chromosome 3. It was noticed that after setting lower confidence level GenScan predicts more genes than are given in NCBI database for this region (we determined the exact dependency curve). In some cases these newly predicted genes had correlation between the presence of transmembrane helixes, peptide sorting signals and specific protein domains. BLAST also revealed their similarity to know cDNA from the different organs. Having also promoter and poly A signals, at least part of the mentioned sequences might be functional genes. The information, obtained for the predicted genes, was analysed in the context of the known hypothesis about the function of the gene being looking for. It reduced the search interval from about 6 Mbp to a much smaller set of potential coding sequences, performing its task in the gene cloning project. |
|
|
| 23. Whole genome phylogenies using vector representations of protein sequences (up) |
| Gary W. Stuart, Department of Life Sciences, Indiana State University;
Jeffery J. Leader, Department of Mathematics, Rose-Hulman Institute of Technology; G-Stuart@indstate.edu |
| Short Abstract:
Optimized SVD-based vector representations of proteins from whole genomes were used to produce comprehensive gene and species phylogenies. A pilot analysis using 832 mitochodrial proteins from 64 vertebrates produced a robust and accurate tree. A larger analysis using nearly 30,000 proteins from 17 bacterial genomes revealed some non-traditional relationships. |
| One Page Abstract:
Accurate phylogenetic trees have been produced following the singular value decomposition (SVD) of data matrices containing vector representations of all proteins encoded within complete genomes. Both gene trees and species trees have been derived using this method. In a pilot analysis, the complete set of 13 mitochondrial proteins from each of 64 vertebrates was used to produce a matrix representing each protein in terms of its tetrapeptide frequencies. SVD with dimension reduction was then used to provide adjusted vector representations for each protein in multidimensional space. Pairwise cosine (similarity) values were determined and converted to distance measures as required for the generation of phylogenetic trees using the NEIGHBOR program of PHYLIP. The resulting gene trees indicated that this method was clearly capable of recognizing and grouping similar proteins, as most members of the 13 mitochodrial protein families were accurately placed in large monophyletic or nearly monophyletic groups. An optimal dimension reduction was determined that produced the best grouping of genes within families. Species trees were then produced by 1) summing the optimized SVD-based vector representations of the individual mitochodrial proteins from each organism, 2) deriving cosine-based distance values for each pair of summed vectors, and 3) using NEIGHBOR to generate trees from the resulting distance values. Within these trees, cartilagenous fish, bony fish, reptiles, birds, non-eutherian mammals, and eutherian mammals were well grouped and reasonably arranged. Following the successful analysis of complete mitochondrial genomes, we applied this method to the genomes of 17 bacteria, including 4 archaebacterial species. Both selected partial genome datasets (~2300 proteins) and whole genome datasets (~30,000 proteins) were analyzed. Optimal dimension reduction was estimated in some cases by observing how well genes where grouped into COG families. The resulting species trees tended to reinforce many traditional bacterial relationships, while challenging others. For instance, Borrelia burgdorferi, the spirochete responsible for lyme disease, grouped with Rickettsia prowazekii, a proteobacterium, instead of Treponema pallidum, another spirochete. With further refinements and increased computational power, it should
be possible to produce exhaustive biomolecular phylogenies from a large
number of complete prokaryotic and eukaryotic genomes.
|
|
|
| 24. Correlated Sequence Signature as Markers of Protein-Protein Interaction (up) |
| Einat Sprinzak, Hanah Margalit, Department of Molecular Genetics
and Biotechnology The Hebrew University -Hadassah Medical School;
einats@md.huji.ac.il |
| Short Abstract:
We propose a novel approach for clustering pairs of interacting proteins by combinations of their sequence signatures. The identified correlated sequence signatures can be used as markers for predicting protein-protein interactions in the cell. Such an approach reduces significantly the search in the interaction space, and enables directed experimental screens. |
| One Page Abstract:
As protein-protein interaction is intrinsic to most cellular processes, the ability to predict which proteins in the cell interact can aid significantly in identifying the function of newly discovered proteins, and in understanding the molecular networks they participate in. An appealing approach would be to predict the interacting partners by characteristic sequence motifs that typify the proteins that are involved in the interaction. Valuable insight towards this end can be gained by mining databases of experimentally determined interacting proteins. Conventionally, single protein sequences have been clustered into families by distinct sequence signatures. Here we propose a novel approach for clustering different pairs of interacting proteins by combinations of their sequence signatures. To identify such informative signature combinations, a database of interacting proteins is required, as well as a scheme for characterizing protein sequences by their signatures. In the current study we demonstrate the potential of this approach on a comprehensive database of experimentally determined pairs of interacting proteins in the yeast S. cerevisiae. The proteins are characterized by sequence signatures, as defined by the InterPro classification. A statistical analysis is performed on all possible combinations of two sequence signatures, identifying combinations of sequence signatures that are over-represented in the database of pairs of interacting proteins. It is proposed that such correlated sequence signatures can be used as markers for predicting unknown protein-protein interactions in the cell. Such an approach reduces significantly the search in the interaction space, and enables directed experimental interaction screens. |
|
|
| 25. Molecular and Functional Plasticity in the E. coli Metabolic Map (up) |
| Sophia Tsoka, Christos Ouzounis, EMBL-EBI;
tsoka@ebi.ac.uk |
| Short Abstract:
Large-scale sequence analysis of the enzymes comprising the entire metabolic complement of Escherichia coli was performed. Reactions and pathways were grouped according to sequence similarity of the corresponding enzymes and enzyme families were mapped to corresponding reactions and pathways. Function convergence/divergenence is assessed and modes of pathway evolution are discussed. |
| One Page Abstract:
Genome analysis by sequence similarity provides useful hints for evolutionary and functional relations between proteins. Especially important is the identification of cases of divergent or convergent evolution, whereby similar sequences have different function and vice versa. These represent cases that are often overlooked by functional assignments based on detection of sequence similarity. Furthermore, understanding the intricacies of the sequence-to-function relationship for metabolic enzymes (1) can also enable reconstruction of the evolutionary history of protein function diversification and biochemical pathways (2). Large-scale sequence analysis of the enzymes comprising the entire metabolic complement of Escherichia coli (3) has been undertaken in order to reveal the molecular and functional diversity of metabolic network components. Metabolic enzymes and their functional roles in terms of EC number classification and pathway involvement were identified from the Ecocyc database (4). Automated sequence clustering of the E. coli enzymes was performed (5) to identify enzyme families. Subsequently, reactions and pathways were grouped according to the sequence similarity of the corresponding enzymes. The mapping of similar sequences into different reactions and pathways delineates to a significant extent cases of evolutionary divergence and convergence of protein families (6). 1 Ouzounis CA and Karp PD, Genome Res. 2000, 10:568-76. 2 Tsoka and
Ouzounis, FEBS Lett. 2000, 480:42-8. 3 Tsoka S and Ouzounis CA, Nature
Genet 2000, 26:141-2. 4 Karp PD et al., Nucleic Acids Res. 2000, 28:56-9.
5 Enright AJ and Ouzounis CA, Bioinformatics 2000, 16:451-7. 6 Tsoka S
and Ouzounis CA, submitted.
|
|
|
| 26. Genome Size Distribution in Prokaryotes, Eukaryotes, and Viruses (up) |
| David Ussery, Heidi Dvinge, Herluf Riddersholm, Nikolaj Blom, Kristoffer
Rapacki, Center for Biological Sequence Analysis, Biocentrum-DTU, Denmark;
dogs@cbs.dtu.dk |
| Short Abstract:
The haploid genome size for more than 5000 organisms is compared. We find a large distribution of sizes, ranging from about 1000 base-pairs (bp) to 670,000,000,000 bp. We compare the genome sizes and repeats for chromosomes from all four eukaryotic kingdoms as well as for prokaryotes and viruses. |
| One Page Abstract:
The haploid genome size for more than 5000 organisms is compared. We find a very large distribution of sizes, ranging from around a 1000 base-pair (bp) viral genome to more than 670,000,000,000 bp for Amoeba dubia. We compare the genome size for all four eukaryotic kingdoms as well as for prokaryotes and viruses. Whilst there is no correlation between the biological complexity of an organism and the size of its genome, there is often a correlation between the size of the nucleus and the genome size. That is, the concentration of the DNA appears to be constant in certain groups of organisms. We compare the genome size to the fraction of various types of DNA repeats for 65 sequenced eukaryotic chromosomes and more than 100 prokaryotic chromosomes, as well as more than 300 viral chromosomes. In general larger eukaryotic chromosomes contain more repetitive DNA than would be expected for a random sequence of the same base-composition, and direct repeats occur more often than inverted repeats. The Database Of Genome Sizes (DOGS) can be found at the following URL: http://www.cbs.dtu.dk/databases/DOGS/index.html |
|
|
| 27. EnsEMBL Genome Annotation Project (up) |
| Ewan Birney, EBI;
Michelle Clamp, Tim Hubbard, The Sanger Center; Lukasz Huminiecki, Emmanuel Mongin, Arne Stabenau, EBI; birney@ebi.ac.uk |
| Short Abstract:
EnsEMBL, a joint project between the Sanger Centre and the EBI, is an automatic annotation system for eukaryotic genomes. All data and code is freely available and easy to access through CVS and the web. All code is written in object oriented perl using MySQL as backend relational database. |
| One Page Abstract:
EnsEMBL is an automatic annotation system for eukaryotic genomes. It has been designed as a fully portable and platform independent system which handles both finished and unfinished genomes. It provides annotations at both nucleic acid and amino acid level. Collectively, the features identified on the DNA sequence by EnsEMBL mostly comprise genes, transcripts (alternative splice variants), exons, markers, SNPs, repeats and regions highly similar to other sequences. For each peptide predicted, EnsEMBL provides interpro (pfam, prints, prosite) domain annotation. Currently EnsEMBL distributes data for the human and mouse genomes. This data is contained in a number of relational databases eg. the human data is stored in the core database with sequences and genes, the SNP database, the mouse trace database, the disease database etc. Only the core database is needed to run EnsEMBL software. The EnsEMBL website (www.ensembl.org) provides easy access to this information with a number of visualisation tools such as GeneView, MapView, or ContigView. Additionally, an ftp site (ftp.sanger.ac.uk:/pub/ensembl/current) allows to download large amounts of genomic data. A number of algorithms is utilised in the production of sequence annotations: blast, exonerate, genescan, genewewise etc. These algorithms tend to be computationally intensive and as such require both dedicated software and hardware. The automation in the annotation procedure is achieved by the EnsEMBL-pipeline. The pipeline uses LSF to distribute the computations to a farm of alpha servers. EnsEMBL's code is written in object oriented perl to facilitate better software design and easier porting to java and corba. BioPerl (www.bioperl.org) interfaces are used, wherever this is appropriate. The whole code is seperated into packages according to its use. Packages exist for the operation of the analysis pipeline and the web server as well as for the various non essential datasets (SNP, Maps, Disease, SAGE). The underlying MySQL database is accessed through a layer of adaptor objects, which provide persistence for high level objects. These represent well known things like Genes, Exons, Features, Sequence etc. EnsEMBL provides CVS access to all its code to encourage people to contribute to the project. The code development is openly discussed on a mailing list (ensembl-dev@ebi.ac.uk). EnsEMBL is a joint project between the Sanger Center and the European Bioinformatics Institute. |
|
|
| 28. A Framework for Identifying Transcriptional cis-Regulatory Elements in the Drosophlia Genome (up) |
| Benjamin P. Berman, University of California, Berkeley;
Barret D. Pfeiffer, Susan E. Celniker, Lawrence Berkeley National Laboratory; Michael B. Eisen, Lawrence Berkeley National Laboratory & University of California, Berkeley; Gerald M. Rubin, Howard Hughes Medical Institute & University of California, Berkeley; benb@fruitfly.berkeley.edu |
| Short Abstract:
We are developing techniques that make use of transcription factor binding specificities and evolutionary conservation of binding sites to search for cis-regulatory enhancer elements genome-wide. We have evaluated these techniques using a collection of well-studied enhancer elements from the transcriptional cascade that controls the development of anterior/posterior segmentation in Drosophila. |
| One Page Abstract:
The development and maintenance of the many diverse cell types of complex multi-cellular organisms results in large part from cis-regulatory DNA sequences that control precise mRNA transcriptional programs. In addition to promoter sequences important for recruiting the basal transcriptional machinery, "enhancer" modules up to several hundred base pairs long contain binding sites for various sequence specific transcription factors. The state of these transcription factors constitutes the input to the transcriptional program, and the enahcer sequence serves as the "logic" which integrates these diverse inputs. This logic can facilitate both inhibitory and cooperative interactions. A single gene may be regulated by many such enhancer modules, which may act independently or together to affect transcription in various spatial and temporal domains during the organism's life. Our aim is to identify novel enhancer modules in the genome of the fruitfly, Drosophila melanogaster. Because enhancer modules can be found within a flanking region of DNA spanning many kilobases, this is not a simple task. We are developing algorithms that take advantage of two important observations: (1) that enhancers are likely to contain many individual binding sites for diverse transcription factors, and (2) that functional binding sites are likely to be conserved through evolution. Previous work indicates that enhancers might be identified with considerable specificity by searching for clusters of conserved binding sites. We have been developing several tools to evaluate this prospect. We have developed an interactive web database to manage transcription factor binding specificities and to plot potential binding sites in the context of genomic annotations. We construct position weight matrix (PWM) models for each of the factors in our database, and these models are used to score potential sites. Cutoff scores can be interactively adjusted, as can constraints geared towards finding clusters of sites within a given window size. In order to evaluate our techniques, we have focused on a particular biological process. The cascade that ultimately gives rise to anterior-posterior segmentation in Drosophila involves complex interactions between a host of early embryonic transcription factors, and this process is among the best understood examples of complex transcriptional regulation to be found in any organism. We have collected a set of more than 20 enhancer modules that have been experimentally shown to regulate expression patterns during this process. We will discuss the ability of our techniques to distinguish these known enhancers with specificity appropriate for genome-scale searches. |
|
|
| 29. Genome-wide modeling of protein structures (up) |
| Ole Lund, Morten Nielsen, Thomas Nordahl Petersen, Claus Lundegaard,
Garry P. Gippert, Structural Bioinformatics Advanced Technologies A/S
(SBI-AT), Agern Alle 3, DK-2970 Hoersholm, Denmark.;
Muthu Prabhakaran, Gopalan Raghunathan, Structural Bioinformatics, Inc., 10929 Technology Place, San Diego, CA 92127. www.strubix.com.; Jakob Bohr, Søren Brunak, SAB, Structural Bioinformatics, Inc., 10929 Technology Place, San Diego, CA 92127.; Kal Ramnarayan, Structural Bioinformatics, Inc., 10929 Technology Place, San Diego, CA 92127. www.strubix.com.; olund@strubix.dk |
| Short Abstract:
We have developed a very sensitive and specific fold recognition method, as well as a method to create accurate alignments for genome wide use. From these alignments protein models for a large part of several complete genomes are generated using the PROTEOMINE (TM) program. |
| One Page Abstract:
We have developed a very sensitive and specific fold recognition method, as well as a method to create accurate alignments for genome wide use. From these alignments protein models for a large part of several complete genomes are generated using the PROTEOMINE (TM) program. The performance of these methods are compared to the state of the art, both by in-house benchmarks and by participation in CASP4. We apply our methods to the creation of relational databases of the tertiary and/or the secondary structure of all proteins in genomes. These databases can be used for drug discovery and modeling of protein-protein complexes. |
|
|
| 30. Genome wide search of human imprinting genes by data mining of EST in UniGene (up) |
| Maxwell P. Lee, Howard Yang, Ying Hu, Michael Edmonson, Ken Buetow,
Hongtao Fan, Lo Hanson, NIH/NCI/DCEG/LPG;
leemax@mail.nih.gov |
| Short Abstract:
We took a computational approach to systematically search all human imprinted genes. Analysis of SNP frequency in EST in human UniGene has identified 140 candidate imprinting genes. Three of them correspond to known imprinted genes. We are currently in the process to validate these findings by experiments. |
| One Page Abstract:
Genomic imprinting is an epigenetic modification of the chromosome that leads to preferential expression of a specific parental allele of the gene. Abnormal imprinting is associated with several human diseases and loss of imprinting (LOI) is frequently found in human cancers. We have undertaken a computational approach to systematically search all human imprinted genes. We have decided to search for imprinting genes from a single nucleotide polymorphism (SNP) database containing all SNPs in expressed sequence tag (EST). The Bayesian statistics was used to estimate the genotype frequency. Significant reduction in the heterozygote suggests that the SNP is located either in an imprinted gene or in a region involved in loss of heterozygosity in tumor cells. From 1.8 million EST in UniGene, we have analyzed 20130 genes and have identified 140 candidate imprinting genes. Among the 140 candidate imprinting genes, three correspond to known imprinted genes. We are currently in the process to validate these findings by experiments. In conclusion, data mining of EST represents an effective way for genome wide search of imprinting genes. |
|
|
| 31. What we learned from statistics on arabidopsis documented genes (up) |
| Pierre Rouzé, Laboratoire Associé de l'Institut National
de la Recherche Agronomique (France), Universiteit Gent, Belgium;
Catherine Mathé, Flanders Interuniversity Institute for Biotechnology (VIB). Department of Plant Genetics. Universiteit Gent, Belgium; Sébastien Aubourg, Unité de Recherche en Génomique Végétale, Evry, France; Patrice Déhais, Flanders Interuniversity Institute for Biotechnology (VIB). Department of Plant Genetics. Universiteit Gent, Belgium; Pierre.Rouze@gengenp.rug.ac.be |
| Short Abstract:
In order to improve our knowledge of Arabidopsis genes structure, we analyzed a set of almost two thousand genes properly annotated by aligning cognate cDNA to genomic sequences. We present statistics on length and composition of the genes and their elements, codon usage and also new data on some signals. |
| One Page Abstract:
While the Arabidopsis genome is now fully sequenced, its sequence is still not completely decrypted, or at least not in a reliable way. Current annotations were indeed made in an automatic (or semi-automatic) way, using gene prediction software, that we know to be far from satisfactory (Pavy et al., 1999). Convinced of the necessity to learn more about gene structures before trying to improve gene prediction strategies, we did some statistical analysis of known genes. We only used genes for which biological evidence was available, i.e. for which we had the cognate mRNA (cDNA). Gene structures (intron/exon organisation) were completely reconstructed by aligning the genomic DNA sequence to its cognate cDNA. Thus, most results presented here were done on a data set of 1 811 genes, which is not much compared to the expected 26 000 genes of Arabidopsis, but at least refers to true data. Several kinds of analysis were performed. Results globally underline the high diversity of genes, in terms of composition and structure, as well as the non unique definition of elements (introns, exons) along genes. Indeed, a "standard" gene (transcription unit) should be 2,4 kb long, with 7 coding exons of 192 bp each, and at about 2 kb from the previous gene. But, beside these means, 10% of the genes have only one exon; the maximal number of exons per gene in our data was 79; an intron can reach 4 kb; and an intergenic sequence can be as small as 300 bp for co-oriented genes or even negative (overlapping genes) otherwise. Moreover, within a gene, the first and last coding exons are larger (270 bp in mean) than the internal ones (147 bp), and the first introns are also larger than the next ones (230 bp against 155 bp, in mean). We also found some indication for a compositional bias along coding sequences (CDS), with an enrichment of A3% from 5' to 3' (and a decrease of T3), while (G+C)3% is minimal in the middle part of CDS. Interestingly, when comparing codon usage between several genes, we come to the conclusion that codon usage is influenced for a major part by translation efficiency-related constraints, leading to 2 gene classes (Mathé et al., 1999). Furthermore, we did some analysis of gene specific signals: splice sites, and among them some non-canonical ones as GC/AG; translation initiation codon; or polyadenylation sites. References Pavy, N., Rombauts, S., Déhais, P., Mathé, C., Ramana, D.V.V., Leroy, P. and Rouzé, P. (1999) Evaluation of gene prediction software using a genomic data set: application to Arabidopsis thaliana sequences. Bioinformatics 15 (11): 887-899 Mathé, C., Peresetsky, A., Déhais, P., Van Montagu, M. and Rouzé, P. (1999) Classification of Arabidopsis thaliana gene sequences: clustering of coding sequences into two groups according to codon usage improves gene prediction. Journal of Molecular Biology, 285 (5), 1977-1991. |
|
|
| 32. Evaluation of Computer Algorithms to Search Regulatory Protein Binding Sites (up) |
| Esperanza Benítez-Bellón, Gabriel Moreno-Hagelsieb, Julio
Collado-Vides, CIFN, UNAM, México;
ebenitez@cifn.unam.mx |
| Short Abstract:
By the analysis of known regulons we evaluate the efficiency of two methods to extract patterns defining regulator-binding sites and to find other members of the regulon. The points of maximum accuracy for each family are reported. |
| One Page Abstract:
We present the evaluation of two methods for extracting patters corresponding to the binding sites of several transcriptional regulators reported in RegulonDB. We define a positive and a negative set for each regulon, find the patterns, a matrix using CONSENSUS (Bioinformatics 15(7/8):563-577, 1999), and a set of dyads when using dyad-detector (Nucl Acids Res 28(8):1808-1818, 2000). We evaluate true positives, true negatives, accuracy, and positive predictive values. We find the best thresholds to find new members of each regulon, and provide some predictions. |
|
|
| 33. An HMM Approach to Identify Novel Repeats Through Fluctuations in Composition (up) |
| Hui Wang, Affymetrix, Inc; Dept of Computer Science and Computer
Engineering, UCSC, Santa Cruz, CA 95064 USA;
David Haussler, Department of Computer Science and Engineering, University of California at Santa Cruz, Santa Cruz, California 95064 ; John Burke, DoubleTwist Inc. Oakland, California 94612 USA; hui_wang@affymetrix.com |
| Short Abstract:
We present an automatic, efficient HMM method for identifying novel repeats. This method compares local fluctuations in sequence composition with the composition of the database as a whole. It does not rely on similarity comparison with known repetitive element databases and is of great utility for microarray and sequence analysis. |
| One Page Abstract:
Repetitive DNA sequences are widely distributed among eukaryotic and prokaryotic genomes. Reassociation kinetics studies of eukaryotic DNA has established that approximately 30% of the human genome is composed of repetitive DNA sequences1, while some other genomes, such as X. laevis, contain up to 70% repetitive sequences. These features, often called repetitive elements, have been categorized and archived. If not accounted for, these elements can render many sequence analysis procedures uninformative. Massively parallel gene expression monitoring through microarray technology can potentially be seriously affected by the existence of repetitive elements. The problem is that repetitive sequences can cause sequence similarity among genes unrelated in function. This problem can be even worse in SNP detection. Hence in array design and sequence functional analysis, it is very important to take repetitive elements into consideration. Many procedures for repeat identification and detection have been developed. However a fundamental limitation of these methods is that they rely mostly upon databases of experimentally known elements and similarity searching algorithms to detect the existence of repetitive elements. There are databases of mRNA fragments (also called expressed sequence tags or ESTs) that sample many of the transcribed regions of the genome. These databases contain multiple representations of non-repetitive elements with frequencies greater than many real repeats; hence positive and negative cases are often indistinguishable on the basis of mere frequency. Clustering and alignment of these databases can remove redundancy, but the presence of repeats reduces the tractability of this procedure as well. Coupled with the availability of whole genome sequences, a through, efficient (preferably linear time) computational approach of identifying repeats would be of great utility. Hidden Markov Models (HMM) have been successfully used in biological sequence analysis for gene finding, protein structural modeling and phylogenetic analysis. We present a method for repeat detection that collates repeat frequency in the entire databases with local shifts in sequence composition to identify positive cases. A simple HMM is built and used to distinguish repeat states from non-repeat states. We apply this approach to both expressed sequences and whole genome sequences. The method runs in linear time and constant space with respect to database size. |
|
|
| 34. Novel non-coding RNAs identified in the genomes of Methanococcus jannaschii and Pyrococcus furiosus (up) |
| Robert J. Klein, Washington University;
Sean Eddy, Howard Hughes Medical Institute and Washington University; rjklein@genetics.wustl.edu |
| Short Abstract:
We used a bias in G+C content as the basis for a computational screen for novel structural RNAs in sequenced, AT-rich, hyperthermophile genomes. This screen identifies most noncoding RNA loci as well as several novel loci. Expression of small RNAs from some of these loci has been experimentally confirmed. |
| One Page Abstract:
The G+C content of structural RNA genes positively correlates with optimal growth temperature, while the G+C content of an entire genome does not. Although this GC composition difference is undetectably weak in most sequenced genomes, it is a strong bias in AT-rich hyperthermophile genomes (e.g. Methanococcus jannaschii and the various Pyrococcus species). We are using this bias as the basis for a computational screen for novel structural RNAs. Using a two-state hidden Markov model (a formal statistical model of the expected GC bias), we have identified GC-rich regions of these genomes. We have shown that the screen identifies almost all known structural RNAs. We have also identified and done preliminary computational characterization of 14 putative noncoding RNA loci in Methanococcus jannaschii, and 9 putative noncoding RNA loci in Pyrococcus furiosus. Northern blot analysis clearly demonstrates that 3 of these loci in Methanococcus jannaschii and 5 of these loci in Pyrococcus furiosus are expressed as small RNAs. We have cloned and sequenced the full length of several of these RNAs, and sequence analysis argues strongly in favor of a non-coding, rather than small-peptide coding, function for these RNA molecules. |
|
|
| 35. RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs (up) |
| Christian M. Zmasek, Washington University School of Medicine;
Sean R. Eddy, Howard Hughes Medical Institute and Washington University School of Medicine; zmasek@genetics.wustl.edu |
| Short Abstract:
A procedure for automated inference of orthologs over bootstrap resampled phylogenetic trees is presented ("RIO", Resampled Inference of Orthologs). This is used for functional prediction via phylogenetic analysis ("phylogenomics"). Results from analyzing the C.elegans proteome are shown. We discuss where phylogenomic analyses might be more reliable than similarity based analyses. |
| One Page Abstract:
When analyzing protein sequences using sequence similarity searches, orthologous sequences (diverged by speciation) are more reliable predictors of a new protein's function than paralogous sequences (diverged by gene duplication), because duplication enables functional diversification. The utility of phylogenetic information in high-throughput genome annotation ("phylogenomics", [1]) is widely recognized, but existing approaches are either manual or indirect (e.g. not based on phylogenetic trees). Here we present a procedure for automated phylogenomics using explicit phylogenetic inference. At the center stands the inference of gene duplications by comparing the gene tree containing the sequence to be analyzed to a trusted species tree. Various algorithms to accomplished this have been published. We employ one of our own design which has a pathological worst case behavior of O(n²) but which appears to be superior in most practical cases, partially due to its simplicity ([2], and references therein). A major caveat of all phylogenetic analyses is the unreliability of the resulting trees. Therefore, inference of gene duplications is performed over bootstrap resampled phylogenetic trees to estimate the reliability of the orthology assignments (RIO -- Resampled Inference of Orthologs). In addition, unusual differences in maximum likelihood branch length values are used to automatically detect other potential pitfalls for functional annotation caused by unequal rates of evolution. We show results of performing this procedure on the C. elegans proteome. It appears that this procedure is particularly useful for the automated detection of first representatives of novel protein subfamilies. This procedure is being implemented as a suite of Java classes and Perl scripts and will eventually be available in its entirety as (part of) the "FORESTER" framework at http://www.genetics.wustl.edu/eddy/forester/. [1] Eisen JA (1998) Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Research, 8, 163-167. [2] Zmasek CM and Eddy SR (2001) A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics, in press. |
|
|
| 36. Comparative genomics for data mining of eukaryotic and prokaryotic genomes (up) |
| Clemens Suter-Crazzolara, Günther Kurapkat, LION bioscience;
suter@lionbioscience.com |
| Short Abstract:
With the increase in number of completely sequenced genomes, comparative genomics is becoming increasingly important for data mining purposes. We describe a software solution for comparison of multiple eu- and prokaryotic genomes. Key benefits are: high speed analysis and flexible inclusion of any number of genomes, biological databases and applications. |
| One Page Abstract:
With the advent of genome projects, scientists are confronted with a wealth of sequence information. Both the sizes and the number of biological databases grow rapidly. However, the gap between data collection and interpretation is also growing. For this reason, genome databases contain a wealth of information which is accessible, but which may remain obscured. Intelligent systems are needed to bridge the widening gap between data collection and interpretation. To interpret data from newly sequenced eu- and prokaryotic genomes, comparative genomics is rapidly gaining importance. This results from the observation that even genomes of distantly related organisms may still encode proteins with high sequence similarity. Additionally, the order of genes within a genome may also be conserved. As the number of completely annotated genomes grows, comparing new genomes to this knowledge base becomes increasingly important for collecting biological information. We have employed these observations to design a computational analysis system, which allows through genome comparisons novel ways of gene function characterization. In an initial step the researcher can, through a simple web interface, add genomes from in house sequencing projects or public sources to the system. With several algorithms, genome comparison relationships are determined. This results in the collection of data concerning homology and orthology relationships, as well as gene order. This information is stored in five distinct databases. In the second step, the researcher can query these databases for interactive comparisons of genomes. Results are either depicted in graphical views to allow easy interpretation or in tabular form to summarize the data obtained. Initially designed for the analysis of prokaryotic genomes, the application has now been further developed to allow the analysis of eukaryotic genomes such as human, mouse, D. melanogaster, C. elegans, plants or yeasts. Expressed Sequence Tag (EST) consensus sequences can also be imported, to allow transcriptome researchers to compare such sequences to completed genomes, allowing the assignment of functions to ESTs. The application is based on the data integration system SRS (Etzold, T. et al., 1996. Methods Enzymol. 266: 114-128) which results in several unique characteristics: 1. High flexibility. The user can add any number of genomes, biological databases or applications to the system. Currently SRS allows the seamless integration of more than 400 biological databases. 2. Reliable, high speed handling of large genomic data sets. The most complex queries give results instantly. 3. Unique, SRS-based linking functions between all databases result in access to a wealth of biological data. 4. User friendly graphical representations allow easy interpretation of search results. These benefits result in highly efficient collection of information on genomes, genes and proteins. The application can be used for projects spanning from the identification of drug targets to the correct annotation of genomes. (http://www.lionbioscience.com/genomeSCOUT) |
|
|
| 37. DNA atlases for the Campylobacter jejuni genome (up) |
| Lise Petersen, CBS, Biocentrum-DTU, and Department of Microbiology,
DVL, Denmark;
Stephen L.W. On, Department of Microbiology, Danish Veterinary Laboratory, DK-1970 Copenhagen.; David W. Ussery, Center for Biological Sequence analysis, Danish Technical University, DK-2800 Lyngby; lpe@svs.dk |
| Short Abstract:
We have analyzed the genome sequence of C. jejuni NCTC11168 for DNA structural motifs. Whilst global repeats are under- represented, local repeats (including palindromic regions) are over-represented in the C. jejuni genome. The three hyper-variable regions of the genome (which all encode surface-exposed products), display unique structural properties. |
| One Page Abstract:
We have analyzed the genome sequence of Campylobacter jejuni NCTC11168 using "DNA Atlases", which is a method for visualization of DNA properties of an entire chromosome as a circular plot (Genome Atlas). These properties include mechanical or structural parameters (such as intrinsic curvature, base- stacking energy, and DNA flexibility) as well as the occurrence of local and global repeats including palindromic sequences. We find that for the C. jejuni genome, global repeats are under-represented, whilst local repeats are over represented and the percentage of palindromes significantly exceeds that of other known pathogenic bacteria, including E. coli. This is partly due to the high AT percentage of C. jejuni, but may still lead to increased mutability compared to bacteria with lower values. There are three chromosomal regions containing hypervariable sequences. One region encompass two sets of global repeats, the fla-genes and one additional set of tandemly arranged genes. The latter contain a conserved domain that shares homology with other annotated genes in the same chromosomal region. Possible function of these genes will be discussed. Furthermore, of the 1700 genes annotated in the genome sequence, we estimate that approximately 12% are random ORF's, and not true genes. We conclude that the genome atlas is a valuable tool, and that the results can be exploited for both fundamental and applied research purposes. Genome atlases for C. jejuni NCTC11168 are available at http://www.cbs.dtu.dk/services/GenomeAtlas/Bacteria/Campylobacter/jejuni/NCTC11168/.
|
|
|
| 38. DNA atlases for the Staphylococcus aureus genome (up) |
| Christian B. Jendresen, Maiken H. Pedersen, Morten S. Thomsen, Torsten
Kolind, David W. Ussery, Center for Biological Sequence Analysis, DTU;
s001638@student.dtu.dk |
| Short Abstract:
Based on DNA atlases of two recently published S. aureus genomes, we have found an obvious symmetry in the circular chromosomes. We also found the pathogenic islands to have structurally extreme properties. Finally, we found several matches between resistance genes and S. aureus plasmids, suggesting horizontal gene transfers. |
| One Page Abstract:
Based on two recently sequenced multi-resistant strains of Staphylococcus aureus we have examined the organism's significant ability to acquire resistance genes and mutate certain pathogenic genes. DNA atlases are used to provide an overview of several structural properties of the genome. Parameters include intrinsic curvature, flexibility, stacking energy, local and global repeats, quasi- and perfect palindromes. These are used to locate deviating DNA segments able to provide us with information on S. aureus. The S. aureus genome is unusually symmetric, which simplifies origin and terminus determination. We have found DNA atlases to be a valuable tool in studying the correlations between DNA structure and function. Pathogenic factors and superantigens have mutated, thus making the bacteria capable of evading the immune systems of host organisms. Numerous antibiotic resistance genes are located near global repeats and on transposable elements. Close alignments with plasmid vectors suggest occurrence of horizontal gene transfers. Genome atlases for S.aureus N315 and S.aureus Mu50 are available at http://www.cbs.dtu.dk/services/GenomeAtlas/Bacteria/Staphylococcus/aureus |
|
|
| 39. SNAPping up functionally related genes based on context information: a colinearity free approach (up) |
| Grigory Kolesov, Hans-Werner Mewes, Dmitrij Frishman, MIPS, Institute
for Bioinformatics, GSF National Research Center for Environment and Health;
G.Kolesov@gsf.de |
| Short Abstract:
A new algorithm finds functionally related non-homologous genes in prokaryotic genomes. It does not rely on the presence of conserved gene strings. Instead, it utilizes the graph of neighborhood and similarity relationships to find those paths that have higher probability to include genes which are functionally related. |
| One Page Abstract:
We present a computational approach for finding genes that are functionally related but do not possess any noticeable sequence similarity. Our method, which we call SNAP (Similarity-Neighborhood APproach), reveals the conservation of gene order on bacterial chromosomes based on both cross-genome comparison and context information. The novel feature of this method is that it does not rely on detection of conserved colinear gene strings. Instead, we introduce the notion of a similarity-neighborhood graph (SN-graph) which is constructed from the chains of similarity and neighborhood relationships between orthologous genes in different genomes and adjacent genes in the same genome, respectively. An SN-cycle is defined as a closed path on the SN-graph and is postulated to preferentially join functionally related gene products that participate in the same biochemical or regulatory process. We demonstrate the substantial non-randomness and functional significance of SN-cycles derived from real genome data and estimate the prediction accuracy of SNAP in assigning broad function to uncharacterized proteins. Technically, SNAP algorithm is implemented as multithreaded server application and is accessible via Web. |
|
|
| 40. Sequencing and Comparison of Orthopoxviruses (up) |
| Scott Sammons, Michael Frace, Miriam Laker, Melissa Olsen-Rasmussen,
Roger Morey, Yu Li, Richard Kline, Joseph J. Esposito, Inger Damon, Robert
Wohlhueter, National Center for Infectious Diseases, Centers for Disease
Control & Prevention;
ssammons@cdc.gov |
| Short Abstract:
Six variola virus isolates were sequenced by using long-distance, high-fidelity PCR of overlapping amplicons as templates for fluorescence-based sequencing. PCR-based primer-walking sequencing reactions were separated by capillary electrophoresis. Output sequence trace files were edited and assembled using Phred/Phrap/Consed and open reading frames of greater that 60 amino acids were analyzed. |
| One Page Abstract:
The genomes of six variola major strains were sequenced by using long-distance, high-fidelity PCR of overlapping amplicons as templates for fluorescence-based sequencing. Each genome is approximately 186 kb of double-stranded DNA with between 200 and 240 predicted open reading frames (ORFs) of greater than 60 amino acids. Each ORF sequence has been compared with the five other locally sequenced strains and with sequences of previously published orthopoxviruses Bangladesh-1975 (L22579), India-1967 (X69198), and vaccinia virus Copenhagen (M35027). The most highly conserved ORFs are located in the center portion of the genome, and the majority have known functions involving transcription, DNA replication and repair, protein processing, virion structure, and nucleotide metabolism. To minimize the amount of poxvirus needed, 15 micrograms of purified genomic DNA was used as template for approximately 1800 primer-walking sequencing reactions. The reactions were set up using robotic assistance and subjected to thermocycling, and the reaction products were separated by capillary electrophoresis (Beckman Coulter CEQ 2000XL). Sequencing of variola strain Congo-1970, Somalia-1977, India-1964, Horn-1948, Nepal-1973, and Afganistan-1970 has been completed. Output sequence trace files were edited, evaluated for quality, and then assembled by using Phred/Phrap/Consed software until about a 10-fold redundancy of high-quality sequence data was attained. ORFs of greater than 60 amino acids were then compared with each other and those in the public databases. These ORFs were analyzed for the presence of known early, middle, and late promoter sequences. ORFs with no homologs in the other strains were further analyzed for protein motifs by using several tools and databases. |
|
|
| 41. Integrating mouse and human comparative map data with sequence and annotation resources. (up) |
| Ann-Marie Mallon, Joseph Weekes, Paul Denny, Mark Strivens, Steve Brown,
Informatics Group, Mammalian Genetics Unit and UK Mouse Genome Centre,
Medical Research Council;
a.mallon@har.mrc.ac.uk |
| Short Abstract:
Comparative maps have previously been generated between mouse and human. This data has been used to integrate the homologous genomic sequence between the two species and to allow integration of other resources such as sequence annotation/similarity or phenotype data, facilitating the use of data present in only one species. |
| One Page Abstract:
The progress of human and mouse genome sequencing programmes allows systematic cross-species comparison of the two genomes as a tool for gene and regulatory element identification1. This data will also be an important tool for exploiting the rapidly growing mouse mutant resource and moving from mutant phenotype to underlying gene. As the opportunities to perform comparative sequence analysis emerge, it is important to develop parameters for such analyses and to examine the outcomes of cross-species comparison for use in developing integrated data sets and software for such analysis. As the sequence data increases in quantity and accuracy it is important to be able to extract the genomic sequence from homologous chromosomal regions. This information would then be beneficial in generating parameters for comparative sequence analysis and also as a foundation for building an integrated data source between the two species. This could then become the basis for querying various linked sources of information including sequence annotation and phenotype data. To date detailed comparative maps have been generated between these two species, utilising homologous genes as markers to highlight homologous chromosomal segments among the two genomes. We have utilised this comparative map data to integrate the homologous genomic sequence between the two organisms. Using this data we aim to refine the comparative map data by sequence alignment and also to integrate other information from databases such as MGD (Mouse Genome Database). This system will be utilised within the UK Mouse sequencing programme (http://mrcseq.har.mrc.ac.uk), to aid in the annotation of mouse genomic sequence. 1: Mallon A.M., et al. Comparative genome sequence analysis of
the Bpa/Str region in mouse and Man. Genome Res. 2000 Jun;10(6):758-75.
|
|
|
| 42. De novo Identification of Repeat Families in the Genome (up) |
| Zhirong Bao, Sean Eddy, HHMI, Dept of Genetics, Washington University;
bao@genetics.wustl.edu |
| Short Abstract:
We have developed a de novo approach for the identification and classification of repetitive elements from genomic sequences. We incorporated multiple alignment information to extend the usual approach of single linkage clustering of BLAST hits. The algorithm is now being used to analyze various genomes. |
| One Page Abstract:
Repetitive elements consist a major part of eukaryotic genomes. We have developed and implemented a de novo approach for the identification and classification of these elements from genomic sequences, based on algorithmic extensions to the usual approach of single linkage clustering of BLAST hits. To overcome the tendency of grouping unrelated sequences by single linkage clustering, we incorporated multiple alignment information in defining the boundaries of individual copies of the elements and in constructing linkages. The alogorithm is now being used to analyze various genomes. |
|
|
| 43. Using the Arabidopsis genome to assess gene content in higher plants (up) |
| Keith Allen, Paradigm Genetics;
kallen@paragen.com |
| Short Abstract:
To assess gene content in Arabidopsis, I have exhaustively compared EST unigene sets from eight plant species to the Arabidopsis genome. Comparison between unigene sets identified lineage-specific genes, and gene loss in Arabidopsis. About 15% of dicot genes and about 30% of monocot genes are missing from Arabidopsis. |
| One Page Abstract:
Arabidopsis has been a favorite model organism of plant biologists for more than two decades because of its small size, rapid growth cycle, small genome and tractable genetics. Using an organism as a model system assumes that genes in this organism will have similar functions to the equivalent genes in other species, and more fundamentally, that the gene complement of the model system is substantially similar to other, more economically important species. The completion of the Arabidopsis genome, coupled with a number of large EST projects in other species provides an unprecedented opportunity to examine this question of gene content, and to clarify the position of Arabidopsis as a model system. Specifically, to what extent does Arabidopsis contain the same genetic complement as other plant species? This question can be addressed by using Arabidopsis as a reference genome, and then comparing unigene sets constructed for each test genome to the reference genome using a Smith Waterman algorithm that translates both query and target DNA sequences in six frames and does the comparison in protein space. This computationally expensive step was perfomed on a Paracel Genematcher2 as part of a beta testing program. Unigene sets were constructed from publicly available EST projects for six angiosperms, one conifer, and a green alga. The test species (and number of input ESTs), were tomato (Solanaceae, 94,523 ESTs), soybean (Papilionoideae, 137,952 ESTs), Medicago (Papilionoideae, 115,717 ESTs), rice (Oryzeae, 63790 ESTs), barley (Triticeae, 105,273 ESTs), maize (Andropogoneae, 85287 ESTs), Loblolly Pine, a conifer (Pinaceae, 31,99 ESTs), and a green alga, Chlamydomonas (Volvocales, 55874 ESTs). Unigene sets were constructed using the Paracel Clustering Package. The central conclusion of this work is that "novel" genes (ie, genes not present in the reference genome) increase in number with increasing evolutionary distance. Soybean, the closest relative of Arabidopsis in this set, had about 17% of its EST contigs fail to get a hit in Arabidopsis. In pine, the most distant higher plant species used, this number was over 30%. I will present a closer examination of the "novel" genes each species and cross species comparison to, for example, identitfy monocot-specific genes. I will also present direct evidence of specific gene loss events in the lineage leading to Arabidopsis. Analysis of contigs corresponding to genes not found in Arabidopsis will also be shown. |
|
|
| 44. Origin of Replication in Circular Bacterial Genomes and Plasmids (up) |
| Peder Worning, Lars J. Jensen, Hans-Henrik Stærfeldt, Dave W.
Ussery, CBS, Biocentrum, DTU;
peder@cbs.dtu.dk |
| Short Abstract:
By comparing the frequencies in leading and lagging strand of all oligonucloetides up to length of 8 bp, we find the origin of replication in circular genomes. We find the origin in nearly all the sequenced Bacterial genomes, several Bacterial plasmids, plus a number of mitochondrial, and chloroplast genomes. |
| One Page Abstract:
We present a method for finding the origin of replication in circular Bacterial genomes. The method is based on differences in word frequencies between leading and lagging strand, and the nucleotide sequence is the only input needed. We have analysed complete genome sequences from more than 50 different species including both Bacteria and Archaea. Our method finds the correct location in all the genomes where the position of the origin is firmly confirmated, and shows that the origin have been misplaced in several of the published genomes. We even find a probable origin position in the genomes where it has not been predicted before. We are also able to find the position of the origin in several bacterial plasmids, plus some mitochondrial and chloroplast genomes. |
|
|
| 45. Combining frequency and positional information to predict transcription factor binding sites (up) |
| Szymon M. Kielbasa, Jan O. Korbel, Dieter Beule, Johannes Schuchhardt,
Hanspeter Herzel, Innovationskolleg Theoretische Biologie;
s.kielbasa@itb.biologie.hu-berlin.de |
| Short Abstract:
Both the frequency and positional information are analysed to predict transcription factor binding sites in upstream regions of coregulated genes. Evaluations for several yeast families as well as new results for a set of genes downregulated via H-Ras activation are presented. |
| One Page Abstract:
Even though a number of genome projects have been finished on the sequence level, still only a small proportion of DNA regulatory elements have been identified. Growing amounts of gene expression data provide the possibility of finding coregulated genes by clustering methods. By analysis of the promoter regions of those genes, rather weak signals of transcription factor binding sites may be detected. We present the algorithm ITB ("Integrated Tool for Box finding"), which combines frequency and positional information to predict transcription factor binding sites in upstream regions of coregulated genes. Motifs of a specified length are exhaustively scored by estimating their frequencies with respect to a statistical background model and investigating their tendency to cluster formation. The alphabet used to assemble motifs may contain symbols matching multiple bases. ITB detects consensus sequences of experimentally verified transcription factor binding sites of the yeast Saccharomyces cerevisiae. Moreover, a number of new binding site candidates with significant scores are predicted. Besides applying ITB on the yeast upstream regions, the program is run on human promoter sequences. From this investigation a new candidate motif "CGARCG" has been proposed, as a transcription factor binding site for a set of genes downregulated via H-Ras activation. |
|
|
| 46. Identification of distantly related sequences through the use of structural models of protein evolution (up) |
| Lisa Davies, Nick Goldman, Department of Zoology, University of
Cambridge;
L.Davies@zoo.cam.ac.uk |
| Short Abstract:
A novel database search program is assessed which identifies distantly related homologs through the inclusion of information about the effects of protein structure on sequence evolution. Solvent accessibility information only confers a limited advantage, but protein secondary structure information markedly improves the identification of homologous sequences with low sequence identity. |
| One Page Abstract:
Database search programs in common use (e.g. BLAST, FASTA) identify sequence homology through the use of pairwise alignment techniques. These programs are good at detecting closely related sequence but have problems accurately detecting homologous sequences with low sequence identity. A new approach, described here, tries to improve the detection of distantly related homologs by rejecting the assumption that all sites in a protein behave in an identical manner. This is done without the use of profile techniques, which require the preliminary collection of a set of homologs. Programs such as BLAST and FASTA use general properties of a protein to generate alignment scores, which simplifies calculations but may also result in a decrease in accuracy. In reality, amino acid replacement probabilities and rates, amino acid frequencies and gap probabilities all vary according to where a residue lies in a protein structure. Typical patterns of these structure-specific variations in evolutionary dynamics can be incorporated into a database search program through the use of Hidden Markov Models (HMMs) and hence potentially improve the detection of more distantly related sequences. In this study, the utility of including structure-specific evolutionary information in a database search program has been assessed. Initial work has concentrated on the generation of database search programs that either use solvent accessibility distinctions or protein secondary structure distinctions. The improvement of adding the `extra' information has then been evaluated through the use of both simulated sequences, which exactly fit the models, and real sequences from the SCOP database. The success rate of each of these programs has been compared to a simplified model that contains the general properties of proteins but with no structural distinctions. We have discovered that adding accessibility information gives a limited advantage only when sequences are distantly related. Using secondary structure distinctions, however, provides a greater improvement over a model containing no structural information for all but the case when the sequences are closely related. There is also an advantage over more traditional database search programs such as BLAST, FASTA and Smith-Waterman. Incorporating structure-specific evolutionary models into database search programs can therefore potentially lead to an improvement in the identification of distant homologs. |
|
|
| 47. The Paracel Filtering Package (PFP): A Novel Approach to Filtering and Masking of DNA and Protein Sequences (up) |
| Cecilie Boysen, Charles P. Smith, Stephanie Pao, Cassi Paul, Joseph
A. Borkowski, Paracel, Inc.;
boysen@paracel.com |
| Short Abstract:
We describe a comprehensive, flexible suite of tools that utilize several algorithms to identify repeats and contaminants in biomolecular sequences. These methods include using repeat profile models and non-destructive XML-based annotation. Together, these tools enable otherwise unavailable masking options. We demonstrate PFP's speed and effectiveness on a variety of datasets. |
| One Page Abstract:
The Paracel Filtering Package (PFP): A Novel Approach to Filtering and Masking of DNA and Protein Sequences Cecilie Boysen, Charles P. Smith, Stephanie Pao, Cassi Paul, Joseph A. Borkowski. Paracel, Inc. Filtering and masking is a required, but often neglected, first step for many bioinformatics analyses, such as EST clustering and assembly, database mining, and DNA chip design. Properly done, this filtering and masking can produce a dramatic improvement in the quality of the final results. Unfortunately, many current masking techniques are an ad hoc assembly of various single purpose comparison tools. Using such an assembly of tools is often cumbersome, requiring many individual steps including simple conversion and bookkeeping operations. Managing filtering in this way can often lead to incomplete or erroneous results. To address these known shortcomings we have developed the Paracel Filtering Package (PFP), a comprehensive, flexible suite of tools for filtering and masking. PFP takes input in most standard format and identifies sequences using a variety of user selectable algorithms. These include dust and pseg for identification of low complexity regions in DNA and protein sequences, and Haste (hash accelerated search tool) and full Smith-Waterman for comparison to sets of repeat, vector, or contaminant sequences. PFP also identifies low quality sequences based on either quality values or ambiguous base calls. A variety of actions can then be performed on the identified sequence regions. The available actions are masking, removal of the entire sequence, excision, or annotation. The annotation action is unique to the PFP suite in that it produces an XML based annotation of the identified sequence regions, e.g. low-complexity and genome-wide repeat regions, without replacing the underlying sequence characters. This allows for non-destructive masking options when used as part of a multi-step clustering and assembly process where masks are applied during one stage (such as clustering) and removed during subsequent stages (such as final assembly). This allows production of final assemblies and consensus sequences free of unwanted masking characters. Additionally, PFP, if used with Paracel's GeneMatcher, can use repeat profiles for highly sensitive annotation of repeat regions. These profiles are gribskov type DNA profiles created from multiple sequence alignments of identified repeat regions. Here we investigate the use of such repeat profile searching and compare the results with the commonly applied algorithms for repeat finding. PFP can be customized for the specific task at hand, that is, different settings can be applied depending on purpose and kind of sequence and species. We will describe a general method for optimization of masking parameters that can be used on any dataset. These tools present a comprehensive set of filtering and masking options not available in any other package. Still, executed by a single command, PFP performs this often convoluted process of cleaning up or annotating sequences with great speed and effectiveness. |
|
|
| 48. EST Clustering with Self-Organizing Map (SOM) (up) |
| Ari-Matti Saren, Plant Genomics group, Institute of Biotechnology,
University of Helsinki;
Timo Knuuttila, Mikko Kolehmainen, Visipoint Oy; ari-matti.saren@helsinki.fi |
| Short Abstract:
The existing EST clustering algorithms perform adequately on smaller datasets, but with the rapid increase in EST data, scalability is becoming a problem. We present a method of using short exact sequence matches and Self-Organizing Map (SOM) to rapidly classify the EST data into subsets that can be clustered individually. |
| One Page Abstract:
The existing EST clustering algorithms perform adequately on small and moderate datasets, but with the rapid accumulation of EST sequence data, scalability is increasingly becoming a problem. Since the scalability issue is inherent in this sort of problem where everything has to be compared to everything, we try to approach the problem from the side of keeping the datasets reasonably sized. A self-organizing Map (SOM) is an unsupervised neural network learning algorithm which has been succesfully used for the analysis and organization of large datasets. We present a method of dividing ESTs into subsets of potentially clustering sequences by using a SOM to classify the sequences according to the number of short exact sequence matches found. These subsets can then be individually clustered and aligned using existing tools. We are using the Visual Data SOM software package from Visipoint Oy (http://www.visipoint.fi/index.html) and custom software components developed at the Barley EST sequencing project at Institute of Biotechnology Plant Genomics group (http://www.biocenter.helsinki.fi/bi/bare-1_html/est.htm) Visual Data uses a tree-structured SOM (TS-SOM) algorithm [1], a variation of the classical Kohonen SOM [2]. [1] Koikkalainen, P. (1994) in: Proceedings of the 11th European Conference on Artificial Intelligence (Cohn A., Ed.), pp. 211-215,. Wiley and Sons, New York [2] Kohonen, T. (1995) Self-organizing maps. Springer, Berlin. |
|
|
| 49. Analysis of Information Content for Biological Sequences (up) |
| Jian Zhang, EURANDOM, Eindhoven, The Netherlands;
jzhang@euridice.tue.nl |
| Short Abstract:
We present an exploratory approach to parsing and analyzing a set of multiple DNA and protein sequences. It is based on an analysis-of-variance (ANOVA) type decomposition of the information content. Our method is applied to parsing and clustering some protein sequences. |
| One Page Abstract:
Decomposing a biological sequence into modular domains is a basic prerequisite to identify functional units in biological molecules. Several commonly used segmentation procedures consists of two steps: First, collect and align a set of sequences which is homologous to the target sequence; then parse this multiple alignment into several blocks and identify the conserved ones by using a semi-automatic method, which combines manual analysis and experts knowledge. In this paper, we present an exploratory approach to parsing and analyzing the above multiple alignment. It is based on an analysis-of-variance (ANOVA) type decomposition of the information content, a variant on the concept reviewed by Stormo and Fields (1998). Unlike the traditional changepoint method, our approach takes into account not only the composition biases but also the overdispersion effects among the blocks. Our method is tested on the families of ribosomal proteins with a promising performance. Finally we extend our approach to the problem of clustering a set of objects labeled by some probability vectors. As an application, our approach is applied to clustering protein sequences via their pairwise alignment scores. |
|
|
| 50. Clustering proteins by fold patterns (up) |
| David Gilbert, Department of Computing, City University, London,
UK;
Juris Viksna, Institute of Mathematics and Computer Science, University of Latvia, Riga, Latvia; Aik Choon Tan, Department of Computing, City University, London, UK; Lorenz Wernisch, Department of Crystallography, Birkbeck College, University of London, London, UK; drg@soi.city.ac.uk |
| Short Abstract:
We describe a technique to automatically divide a set of proteins into clusters each associated with a common topological pattern, using TOPS descriptions as formal models of protein folds. We are applying the technique to generate characteristic patterns associated with EC functional groups. |
| One Page Abstract:
We have developed a technique to automatically divide a set of proteins into clusters each associated with a common topological pattern. This technique can be applied to families of protein domains with structurally diverse members where there is no clear structure-based tree hierarchy. Examples are the families determined by the Enzyme Classification scheme; EC numbers are assigned at the chain level in the PDB and thus all the domains comprising a chain will be assigned the same number. The motivation is that a pattern associated with an undivided diverse group may be very small and of weak descriptive and hence classificatory power. Our ultimate aim is to discover patterns with a high classificatory power for diverse protein families. Our method takes as input TOPS descriptions of protein structures (Gilbert et al, 1999) and uses a development of the pattern discovery method for topological patterns described in (Gilbert et al, 2001) which employs repeated pattern extension and matching. In our new method, we permit patterns to be discovered for less than 100% of the examples in a learning set S. Effectively this means that having discovered a maximal pattern which describes all the members of S, we attempt to continue to extend that pattern to a larger pattern P which only matches a subset T of S. We then remove the matched set T from S to give S' and repeat the procedure over S' until the learning set becomes empty. The result of our method is a cover of the initial set S of structures, comprising a partition of S into subsets each associated with a descriptive pattern. The critical issue is to decide when to stop extending the pattern; we achieve this when the `goodness' of the pattern reaches a certain threshold (PV). Otherwise, we may end up by generating a pattern which is associated with just one domain, and this is likely to be too specific and of no use in classifying new structures. This `goodness' is computed as a function over the compression that the pattern achieves over all the members of the matching set T and its coverage, i.e. the size of T compared to S. We have applied our algorithm to a representative subset of the protein data bank, derived from non-identical representatives (N-reps) of release 2.0 the CATH database (www.biochem.ucl.ac.uk/bsm/cath). We selected those domains to which a function in the EC classification has been assigned, and further restricted our set to those domains with some beta sheet content (since our pattern discovery method does not work well for all-alpha domains). The table of of associations between EC numbers and domains was supplied by Roman Laskowski of the BSM group at University College. We performed our pattern discovery and clustering over 141 families of proteins (defined by every domain sharing the same 4 EC numbers) and containing at least 4 members. Some results for this set with graphical illustrations of the patterns can be found at http://www.soi.city.ac.uk/~drg/tops/EC_alphabeta/ References Gilbert D, Westhead DR, Nagano N, Thornton JM. Motif-based searching in tops protein topology databases. Bioinformatics 1999;15:317-326. Gilbert D, Westhead DR, Viksna, J and Thornton J M, Topology-based protein structure comparison using a pattern discovery technique, Journal of Computers and Chemistry, 2001, in press. Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., and Thornton, J.M. (1997) CATH- A Hierarchic Classification of Protein Domain Structures. Structure. Vol 5. No 8. p.1093-1108. |
|
|
| 51. Confidence Measures for Fold Recognition (up) |
| Ingolf Sommer, Niklas von Öhsen, Alexander Zien, Ralf Zimmer,
Thomas Lengauer, GMD/SCAI;
ingolf.sommer@gmd.de |
| Short Abstract:
We present an extensive evaluation of different methods and criteria to detect remote homologs of a given protein sequence. There are two associated problems, first, to develop a sensitive searching method to identify possible candidates and, second, to assign a confidence to the identified candidates in order to select the best one. |
| One Page Abstract:
We present an extensive evaluation of different methods and criteria to detect remote homologs of a given protein sequence. There are two associated problems, first, to develop a sensitive searching method to identify possible candidates and, second, to assign a confidence to the identified candidates in order to select the best one. Especially, p-values have been proposed for assigning confidence for local sequence alignment searching procedures, such as BLAST, with great success due to an extensive theoretical backing. We propose empirical approximations to p-values for searching procedures involving global alignments, sequence profiles, and structure based scores (threading). We review different methods for detecting remotely homologous protein folds (sequence alignment and threading, with and without frequency profiles, global and local), analyze their fold-recogintion performance and establish confidence measures that tell how much trust to put into the prediction made with a certain method. The analysis is performed on a representative subset of the PDB with at most 40% sequence homology, with proteins classified according to the SCOP classification. For fold-recognition, i.e. the attempt to find a template fold with a known structure for a target protein sequence whose structure we are searching, we find that methods using frequency profiles generally perform better than methods using plain sequences, and that threading methods perform better than sequence alignment methods. Thus the method of choice for detecting remote homologies is threading using frequency profiles. In order to assert the quality of the predictions made with these methods, we establish several confidence measures, including raw scores, z-scores, raw-score gaps, z-score gaps, and different methods of p-value estimation (thus ranging from computationally cheap to more elaborated) and compare them. The confidence-measure methods are compared with several error measures. For local alignment methods, where the distribution of scores is theoretically known, we find that p-value methods that make use of this knowledge work best, albeit computationally cheaper methods as the score gaps perform competitively. For global methods, where no theory is available on the score distribution, score-gap methods perform best. * S. F. Altschul et al, "Gapped BLAST and PSI-BLAST: a new generation of protein datab ase search programs", Nucleic Acids Research, 1997 * M. Gribskov et al, "Profile analysis: Detection of distantly related proteins", PNAS, 1987 * N. Alexandrov et al, "Fast Protein Fold Recognition via Sequence to Structure Alignment and Contact Capacity Potentials", Proc. Pacific Symposium on Biocomputing, 1996 * S. F. Altschul et al. "Basic local alignment search tool", JMB, 1990
|
|
|
| 52. Protein Structure Prediction By Threading With Experimental Constraints (up) |
| Mario Albrecht, Institute for Algorithms and Scientific Computing
(SCAI), German National Research Center for Information Technology (GMD);
Ralf Zimmer, Thomas Lengauer, Institute for Algorithms and Scientific Computing (SCAI), German National Research Center for Information Technology ; mario.albrecht@gmd.de |
| Short Abstract:
We present an extended and modified version RDP* of the Recursive Dynamic Programming method to predict the structure of proteins. The algorithm is now capable of incorporating additional structural constraints, for instance, atomic distances obtained by mass spectrometry or NMR spectroscopy experiments, into the alignment computation of our threading approach. |
| One Page Abstract:
The threading approach predicts the protein structure by aligning representative protein structures with an amino-acid sequence called the target sequence whose three-dimensional backbone structure is unknown. The sequence-structure alignments obtained are then ranked by score. The best-scoring alignment should identify the template structure that is most compatible with the target sequence and thus afford a meaningful structural model. However, the problem of developing an accurate scoring function is still unsolved particularly for distantly related target and template folds. Especially, making the scoring scheme reflect diverse biological constraints seems to be a difficult task. Thus threading methods based solely on sequence information often fail. To remedy the inherent shortcomings of the scoring function, it becomes necessary to incorporate more biological knowledge on the target protein, which may be obtained from experimental data by mass spectrometry or NMR spectroscopy. These additional constraints such as atomic distances guide the threading process in order to improve the accuracy of fold recognition. Experimental results that taken alone would give insufficient data for the complete structure determination may already yield enough constraints to support the threading procedure considerably. Our recursive dynamic programming method searches for structurally correct target-template pairs in suitable sets of alternative near-optimal solutions to the alignment problem, which transcends the usual exact optimization of one biologically incomplete scoring function employed in other threading approaches. The method can incorporate biological constraints directly into the alignment computation by means of different filter algorithms. This is more efficient than weeding out wrong models from the list of already generated complete solutions. In this way, the method can produce biologically more meaningful models that adhere to the structural constraints that are known about the target. This approach improves the fold recognition rate as well as the alignment quality. Keywords: protein structure determination, fold recognition, protein threading, experimental constraints, mass spectrometry, NMR, NOE Selected References: P. M. Bowers,C. E. M. Strauss, David Baker: De novo protein structure determination using sparse NMR data. Journal of Biomolecular NMR, 18:311-318, 2000. R. Thiele, R. Zimmer, T. Lengauer: Protein Threading by Recursive Dynamic Programming, Journal of Molecular Biology, 290(3):757-779, 1999. Y. Xu, D. Xu, O. H. Crawford, J. R. Einstein: A computational method for NMR-constrained protein threading, Journal of Computational Biology, 7(3/4):449-467, 2000. M. M. Young, N. Tang, J. C. Hempel, C. M. Oshiro, E. W. Taylor, I. D. Kuntz, B. W. Gibson, G. Dollinger: High throughput protein fold identification by using experimental constraints derived from intramolecular cross-links and mass spectrometry. Proceedings of the National Academy of Sciences USA, 97(11):5802-5806, 2000.
|
|
|
| 53. Exonerate - a tool for rapid large scale comparison of cDNA and genomic sequences. (up) |
| Guy St.C. Slater, Ewan Birney, Ensembl / EMBL-EBI;
guy@ebi.ac.uk |
| Short Abstract:
Exonerate is a tool for rapid large scale comparison of cDNA with genomic DNA. The algorithm and its implementation are described in the context of related methods. Examples of application of the algorithm for large scale analyses within the Ensembl group are given. For more information, see: \url{http://www.ebi.ac.uk/~guy/exonerate/} |
| One Page Abstract:
Exonerate is a tool for rapid large scale comparison of cDNA with genomic DNA. HSPs are seeded using a FSM built from the word neighbourhoods of multiple ESTS. A bounded sparse dynamic programming algorithm is used to join HSPs to form gapped alignments. Alignments are generated between candidate HSP pairs by dynamic programming algorithms which integrate an affine gap model and PSSM based splice site prediction for intron modelling. This approach provides a good approximation of the underlying model, while remaining very fast by restricting the dynamic programming to regions likely to contain significant alignments. Examples of application of the algorithm for large scale comparison of EST data sets with the entire human genome are given. Exonerate is implemented in C, using the glib library, and is available under the the terms of the LGPL. For more information, see: http://www.ebi.ac.uk/~guy/exonerate/ -- |
|
|
| 54. Prospector: Very Large Searches with Distributed Blast and Smith-Waterman (up) |
| Douglas Blair, Dustin Lucien, Hadon Nash, Dale Newfield, John Grefenstette,
Parabon Labs, Parabon Computation;
doug@parabon.com |
| Short Abstract:
We have created a novel implementation of BLAST and Smith-Waterman for the Parabon Frontier distributed computing platform. We present design specifics, implementation details, performance results, and sensitivity comparison for very large database searches using the Prospector versions of BLASTN, BLASTP, BLASTX, TBLASTN, and TBLASTX, and the analogous versions of Smith-Waterman. |
| One Page Abstract:
Advances in sequencing technology have created an avalanche of sequence data in the public databases that continues at an astonishing rate. In the year 2000, for example, the number of nucleotides in GenBank more than doubled from 4.6 billion bases to 11.1 billion bases. Even with the completion of the human genome, data from other sequencing projects and ESTs being generated from various organisms and cell types ensure that the floodgates are not going to close anytime soon. Given that large amounts of new sequence data will continue to become available for the foreseeable future, the need to compare new data to existing data will continue to make high-performance solutions for large-scale sequence analysis a critical requirement for many projects. High-performance methods for performing large searches typically involve using specialized hardware or clusters of general-purpose machines. While providing a high level of performance, such systems are costly in terms of both hardware and support infrastructure. Distributed computing utilizing the idle cycles of existing workstations represents a software alternative to traditional high-performance computing platforms. In the Parabon Frontier distributed computing model, large jobs are decomposed into hundreds or thousands of tasks and distributed to idle machines, either over the Internet or within an organization's intranet. We have created an application called Prospector that implements both BLAST and Smith-Waterman algorithms for performing large-scale sequence comparison on the Frontier distributed computing platform. As required by the Frontier model, Prospector is written entirely in Java, providing a maximum of safety for the machines providing idle cycles. In this poster, we present design specifics, implementation details, and performance results for very large database-to-database searches using Prospector. Results showing the effective scalability of Prospector to searches with thousands of machines are given. Experiments are described using Prospector versions of BLASTN, BLASTP, BLASTX, TBLASTN, and TBLASTX, as well as the analogous versions of Smith-Waterman. |
|
|
| 55. Accuracy Comparisons of Parallel Implementations of Smith-Waterman, BLAST and HMM Methods (up) |
| James Candlin, Stephanie Pao, Paracel, Inc.;
candlin@paracel.com |
| Short Abstract:
Using known evolutionary relationship data from SCOP, we compare the accuracy of Paracel's GeneMatcher[tm] and BlastMachine[tm] implementations of sequence search algorithms to the originals. We demonstrate that the Paracel implementations are biologically equivalent, but much faster, allowing routine use of even the most rigorous algorithms at a genomic scale. |
| One Page Abstract:
Accuracy Comparisons of Parallel Implementations of Smith-Waterman, BLAST and HMM Methods Stephanie Pao, James Candlin Paracel, Inc. Sequence database searching is a workhorse of bioinformatic analysis, and is embodied in a variety of useful methods such as Smith-Waterman and related dynamic programming methods, BLAST, and Hidden Markov Models. Unfortunately, many of these algorithms are slow, especially at a genomic scale. With Paracel's GeneMatcher[tm] and BlastMachine[tm] systems, we have addressed this challenge by developing new hardware and software implementations of these methods that are parallelized and can be used routinely, even on genomic scale projects. It is crucial that our new implementations retain the general behavior of the original manifestations of the algorithms, and their biological accuracy can be maintained. An approach to assessing this, at least for detecting protein family relationships, is to search an annotated database with a broad set of protein sequences and measure the ability of such methods to find only the evolutionarily related sequences. Effective resources for this are the SCOP (Structural Classification of Proteins) database, which reliably identifies superfamily membership based on structural analysis, and the associated PDBD40 database of domains that are no more than 40% identical to one another and may be used as in Brenner et al.(1) to test the ability of algorithms to find related sequences. Using the same general approach, we compare the Paracel implementations of BLAST, Smith-Waterman and HMM to the original software methods. The testing set consisted of all sequences in the SCOP PDBD40 database. The comparison test was an all- against-all comparison of all described domain sequences against a database of the same sequences using all the pairwise and multiple-sequence-derived algorithms. We assess the results of the comparisons at various false positive levels and describe the number of detected homologies with each similarity searching method. Our criteria for homology are based on the superfamily classifications within the SCOP database. We demonstrate that our implementations are biologically equivalent to the original algorithms, but convey the advantages of far greater speed and throughput required for genomic analysis. The comparison also demonstrates the greater accuracy of the slower but more rigorous methods, and that even these are fast enough in our implementation for routine use. ______________________ (1) Brenner, S.E., Chothia, C. and Hubbard, T., Assessment of sequence comparison methods with reliable structurally identified distant evolutionary relationships', Proc. Natl. Acad. Sci. USA 15, 6073- 6078 (1998). |
|
|
| 56. Target-BLAST: an algorithm to find genes unique to pathogens causing similar human diseases (up) |
| Joanna Fueyo, Department of Pharmacology, University of Pennsylvania
School of Medicine;
Jonathan Crabtree, Computational Biology and Informatics Laboratory, University of Pennsylvania; Jeffrey N. Weiser, Department of Microbiology and Pediatrics, University of Pennsylvania School of Medicine; fueyo@mail.med.upenn.edu |
| Short Abstract:
There is an imminent need for novel data mining algorithms for DNA and protein sequences, especially for discovering which genes may be important in the establishment of infectious disease in humans. Here we present a novel algorithm that is used to find such genes using a fully computational approach. |
| One Page Abstract:
Target-BLAST: an algorithm to find genes unique to pathogens causing similar human diseases Abstract Motivation: This paper describes a novel algorithm, Target-BLAST,
designed for the discovery of genes unique to pathogens that share common
features. Bacteria that share common features, such as the ability to cause
a similar set of human infections, may share genes in common that facilitate
colonization of the host or establishment of infection. The organisms that
reside in the human respiratory tract are the most common cause of antibiotic
usage for infectious disease. In a search for genes unique to respiratory
pathogens, we designed an algorithm to find genes that are unique to the
organisms that colonize the respiratory tract, since these bacteria may
have sequences in common that may be useful as novel, alternative drug
targets for respiratory infections. The algorithm we designed for this
purpose, Target-BLAST, uses Position-Specific-Iterated BLAST (PSI-BLAST)
and its predecessor Basic Local Alignment Search Tool (BLAST) to discover
homologs in microorganisms that share common features, such as the ability
to reside in or cause disease in the same host environment. Results: We
illustrate the use of Target-BLAST to search sequences obtained from the
public databases to find genes unique to a group of pathogens that cause
pneumonia, and discovered a set of 12 genes unique to the bacteria that
reside in the respiratory tract. The sensitivity of Target-BLAST to distant
relationships is gained through the iterative use of BLAST followed by
PSI-BLAST. This approach facilitates the comparison and extraction of information
from both partial, unannotated DNA sequences as well as complete, annotated
genomes. Target-BLAST is considerably faster than existing genome analysis
tools, and it permits one to find genes conserved in both whole and partial,
unannotated genomic data.
|
|
|
| 57. Software to predict microArray hybridisation: ACROSS (up) |
| Antoine Janssen, Keygene N.V.;
Jurgen Pletinckx, Algonomics N.V.; Jan van Oeveren, Martin Reijans, René Hogers, Keygene N.V.; Philippe Stas, Algonomics N.V.; Michiel van Eijk, Keygene N.V.; Ignace Lasters, Algonomics N.V.; René van Schaik, Keygene N.V.; aj@keygene.com |
| Short Abstract:
High-quality micro-array data can only be obtained when non-specific or cross hybridization is excluded or at least minimized. We developed a new software tool ACROSS to predict hybridization based on sequence alignments and hence to assist in the optimal design of microarray probes. |
| One Page Abstract:
Software to Predict MicroArray Hybridization: ACROSS Jurgen Pletinckx(2), Antoine Janssen(1), Jan van Oeveren(1), Martin Reijans(1), René Hogers(1), Philippe Stas(2), Michiel van Eijk(1), Ignace Lasters(2), René van Schaik(1) (1)Keygene N.V., PO Box 216, 6700 AE, Wageningen, The Netherlands, info@keygene.com (2) Algonomics N.V., Technologiepark 4, B 9052 Ghent, Belgium, info@algonomics.com Keywords: expression arrays, sequence alignment, cross hybridization, hybridization prediction Abstract Microarrays can be used to detect AFLP® fragments for genotyping or expression analysis by hybridization with labeled AFLP reactions or cDNA, respectively. High-quality data can only be obtained from microarrays when non-specific or cross hybridization is excluded or at least minimized. We developed a new software tool ACROSS to predict hybridization and hence to assist in the optimal design of microarray probes. The ACROSS software takes as input the sequence information of DNA-fragments or oligos in either one or two sets. Homology analysis within one set or between the two sets of sequences is performed to predict fragment hybridization. These predictions are based on quantitative analysis of hybridization data from model experiments with known sequences obtained under realistic experimental conditions. A test set of oligonucleotides, derived from 2 DNA fragments A and B, is used to create series of complementary sequences with increasing sequence homology and hybridized with the corresponding full length fragments as targets. Hybridization results are presented and the characteristics of ACROSS are discussed. AFLP® is a registered trademark of Keygene N.V. |
|
|
| 58. Discovering Dyad Signals (up) |
| Eleazar Eskin, Columbia University;
Pavel Pevzner, UCSD; eeskin@cs.columbia.edu |
| Short Abstract:
Dyad signals in DNA sequences consist of two signals that occur a fixed distance apart. Because each component signal may not be statistically significant on its own, we perform an exhaustive search using a pattern driven approach to discover these signals. We present two extensions to pattern driven approaches to improve efficiency. |
| One Page Abstract:
Signal finding is a fundamental and well studied problem in computational biology. The problem consists of discovering patterns in unaligned sequences. In the context of DNA sequences, the patterns can correspond to regulatory sites which can be used for drug discovery. Current approaches to discovering signals focus on monad signals. These signals are typically short contiguous sequences of a certain length which occur in a statistically significant way in the unaligned sequences with a certain amount of mismatches. This statistical significance is obtained using a scoring function of the candidate signal. However, several of the actual regulatory sites are actually dyad signals. In some cases the signals are palindromes of each other. A difficulty in discovering dyad signals is that each component monad signal in the dyad may not be statistically significant making the dyad signal difficult to find using traditional methods. There have been many approaches presented to discover monad signals. Among the best performing are MEME, CONSENSUS, Gibbs-sampler, random projections and combinatorial based approaches (see references at poster). The goal of all of these approaches focus on discovering the highest scoring signals. When applied to discovering dyad signals, the methods may have problems in the case where each of the pair of monad signals is not statistically significant on its own. In this project we present an algorithm for discovering dyad signals. The algorithm first performs an exhaustive search over potential monad signals in the data and then examines several thousand of the highest scoring monads and looks for dyad signals by checking each pair of signals and determining if they occur a fixed distance apart. Clearly the vast majority of the monad signals examined are not statistically significant on their own. The computational bottleneck for this approach is the exhaustive search of candidate patterns. The simplest way to perform this search is to use a pattern driven approach. A pattern driven approach is simply checking all possible patterns against the data. Unfortunately for long patterns, this can be computationally expensive. In this project, we present two efficient extensions to the pattern driven approach: emulated pattern driven approach and sparse suffix trees. The first extension examines only the relevant sequences to the data and stores the candidate signals in a hash table. Although the algorithm is significantly faster in some practical cases, the algorithm requires a lot of memory. The second extension presents efficient data structures for storing the candidate signals which reduce the amount of memory use. The dyad signals are discovered by examining the several thousand of the highest scoring monad signals. For each pair of these signals, we examine the sequence positions where the patterns occurred. For patterns that occur on the same sequence we compute the distance between the patterns. If the pair of patterns consistently occur at a certain fixed distance, then the pair of patterns is a dyad signal. We present several sets of experiments over biological and synthetic data. We first present a set of experiments evaluating the monad signal finding methods. Although we present the signal finding methods in order to discover dyad signals, they in fact perform well finding monad signals. We also present a set of experiments over dyad samples. We perform experiments over synthetic data. This data consists of an i.i.d. sequence where a dyad signal was inserted. We also perform experiments over the biological sample and automatically recover the dyad signal discovered in the sample. |
|
|
| 59. Efficient all-against-all computation of melting temperatures for dna chip design (up) |
| Lars Kaderali, Alexander Schliep, University of Cologne;
kaderali@zpr.uni-koeln.de |
| Short Abstract:
Determining target specific probes for DNA chips requires the computation of melting temperatures for all pairs of probes and targets. We present a fast algorithm based on an extended nearest neighbor model. The algorithm combines suffix trees and alignment computations. Also, a framework is presented to suggest actual probes. |
| One Page Abstract:
The problem of determining target specific probes for DNA chips requires the computation of melting temperatures for the duplexes formed betweeen all target sequences and all probe candidates. The problem is further complicated due to mismatches and possibly unpaired bases within the duplex. For complexity reasons, traditional computer programs restrain themselves to mere string comparisons to ensure the specificity of chip probes. We present an efficient algorithm to solve this problem. The thermodynamic calculations are based on an extended nearest neighbor model, allowing for both mismatches and unpaired bases within a duplex. We introduce a new thermodynamic alignment algorithm to efficiently calculate melting temperatures. This algorithm is combined with modified generalized suffix trees to speed up the computation. The algorithm is the core of a software framework to suggest actual probes to be used for DNA chip experiments, given the target sequences as input. |
|
|
| 60. Wavelet techniques for detecting large scale similarities between long DNA sequences (up) |
| Frederic Guyon, Serge Hazout, INSERM U436, Universite Paris 7;
guyon@urbb.jussieu.fr |
| Short Abstract:
We have developped an efficient and fast algorithm to detect large scale similarities between long sequences based on Fast Wavelet Transforms. The FWT algorithm allows one to compute dotplots at different scale levels and to zoom in and out on the regions of interest within a dotplot. |
| One Page Abstract:
As complete genomic sequences become available, new methods to tackle very large DNA sequences arerequired. Some very large scale sequence duplications inside or between genomes have been identified. These large scale genomic duplications yield precious information for genome annotation and genome evolution. Yet, standard algorithms devised for the comparison of very long sequences are time and space consuming, and aligning whole genomes or chromosomes at least a very difficult computational task [5], or limited to closely related sequences [1]. To tackle this problem, we have applied the discrete wavelet transform to the computation and visualization of dotplots. Dotplot is a well established technique to compare two sequences [3, 4]. Dotplot is an image where dots correspond to sequence nucleic acids or amino acids matches. Two difficulties arise when computing dotplots : the computer time and memory space used to compute a dotplot is proportional to the product of the length of the two sequences. Our method consists in reducing the size of the dotplot by computing coarse dotplots. For this purpose, DNA sequences are transformed into indicator signals. These signals are decomposed using a fast wavelet decomposition technique [2]. Finally, coarse dotplots are computed using the coarse level representation of the indicator signals. The low dimensionality of the coarse level signals makes fast computation of coarse dot matrix possible. Moreover, the fast multiresolution analysis [2] provides an efficient algorithm for zooming in and out the dotplot image and gives the possibility to quickly navigate inside the dotplot from coarse to fine levels. The detection sensitivity and specificity depends on the scale level. At coarse levels, a region of similarity is detectable only if it is sufficiently large. Then, the low scale dotplot provides the possibility to reveal regions with very low level of similarity. At fine levels, regions of similarity are more accurately localized with higher specificity. References [1] A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, and S.L. White, O.and Salzberg. Alignment of whole genomes. Nucleic Acids Res, 1999. [2] S. G. Mallat. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans. on Patt. Anal. and Mach. Intell., 11:674-693, 1989. [3] J. Pustell and F. Kafatos. A high speed, high capacity homology matrix : Zooming through sv40 and polyoma. Nucleic Acids Research, 10(15):4765-4782, 1982. [4] E.L. Sonnhammer and R. Durbin. A dot-matrix program with dynamic threshold control suited for genomic dna and protein sequence analysis. Gene, 1996. [5] M. Waterman. Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman & Hall, New York, 1995. |
|
|
| 61. TRAP: Tandem Repeat Assembly Program, capable of correctly assembling nearly identical repeats (up) |
| Martti T. Tammi, Erik Arner, Dept. of Genetics and Pathology, Uppsala
University;
Tom Britton, Dept. of Mathematics, Uppsala University; Daniel Nilsson, Björn Andersson, Dept. of Genetics and Pathology, Uppsala University; martti.tammi@genpat.uu.se |
| Short Abstract:
We present a method to separate almost identical repeats, and a rigorous study of combinatorial and statistical properties of sequencing errors and real differences between repeats in a fragment assembly framework. We also show that it is possible to make assemblies with a pre-defined probability of error using this method. |
| One Page Abstract:
The software commonly used for assembly of shotgun sequence data has several limitations. This is especially true when repetitive sequences are encountered. Shotgun assembly is a difficult task, even for non-repetitive regions, but the use of quality assessment of the data and efficient matching algorithms have made it possible to assemble most sequences efficiently. In the case of highly repetitive sequences, however, these algorithms fail to distinguish between sequencing errors and single base differences in regions containing nearly identical repeats. We present the TRAP method to detect subtle differences between repeat copies, and show results from a rigorous study of combinatorial and statistical properties of sequencing errors and real differences between repeats in a fragment assembly framework. We also show that it is possible to make assemblies with a pre-defined probability of error using this method. The key step in the TRAP method is the construction of multi-alignments consisting of all defined overlaps for a region, followed by detection of pairs of columns containing coinciding deviations from column consensus. For each pair, the probability of observing such a pair by chance is computed. If the probability exceeds a threshold, the pair is rejected. The remaining errors are trapped in a clustering process that follows. In a data set containing repeat regions with 1% randomly distributed differences, and a sequencing error up to 11%, demanding a coverage of at least three sequence reads, we could detect 65% of the differences with an error of 0.5%. This is not the final error, since most of these errors can be trapped in the clustering stage. The data set consisted of 108 simulated assemblies with varying repeat length and copy numbers. The TRAP software package is implemented in four separate program modules, each performing a distinct task: 1. The initial screening for contamination and determination of read quality. 2. The computation of overlaps. 3. Repeat elements analysis and fragment layout generation. 4. The generation of the consensus sequence and a status report. TRAP has been shown to work by assembling both real and simulated shotgun data. We have simulated a shotgun project containing eight 1789 base pairs long repeats in tandem. The difference between repeat units was 1.0% and the simulated sequencing error 2.71%. TRAP was capable of correctly assembling all 367 sequence reads, while PHRAP could not resolve any of the repeat elements correctly and assembled the sequence reads belonging to the different repeats almost randomly, which made it impossible to extract the correct consensus sequence from any of the repeat elements. A simulation of a BAC project of size 140 kb containing eight 1789 base pairs long tandem repeats gave similar results, with only 1.6% of the reads wrongly placed, the difference between repeat copies being 1.0% and the average sequencing error 7.2%. The erroneously placed sequence reads did not affect the consensus sequence. The main features of TRAP are the ability to separate long repeating regions from each other by distinguishing single base substitutions as well as insertions/deletions from errors, and the ability to pinpoint regions, where additional sequencing is needed to efficiently close the remaining gaps. Since repeats are common in most sequencing projects, this software should be of use for the sequencing community. |
|
|
| 62. LASSAP: A powerful and flexible tool for (large-scale) sequence comparisons (up) |
| Heus, H.C., Glémet, E., Raffinot, M., Metayer, K., Chambre,
P., Codani, J-J., Gene-IT;
Raffinot, M., Laboratoire Génome et Informatique, CNRS, France.; heus@gene-it.com |
| Short Abstract:
Existing sequence comparison tools are not designed to handle large-scale comparisons efficiently. LASSAP software is designed to compare entire databases of sequences, integrating an adequate set of algorithms that run on multiple processors. It allows extensive result management: databases and results can be queried during all steps of a workflow. |
| One Page Abstract:
There is large collection of good sequence comparison tools available. However, most of these tools have not been designed to handle results of large-scale comparisons efficiently. To answer complex questions one has to depend on tricks and scripts tailored to each tool to parse and interpret the output. This makes it difficult to redo an experiment with different parameters or another algorithm. Often, bioinformatics people use a substantial amount of their `quality time' trying to solve trivial problems over and over again. Therefore, Gene-IT has developed Lassap, a large-scale sequence comparison tool that is powerful and extremely flexible. Lassap has been used to solve everyday bioinformatics problems that occur in the academic world and the industry. Some examples of its use are the annotation of sequences on a micro-array, cross-species genomic comparisons (GENOSCOPE), making non-redundant databases (Swissprot / TrEMBL) and implementing gene family workflows (INTERPRO). All this can be done with a minimal investment of time and resources. The key specifications of Lassap are: ·The package is built around the comparison of entire databases of sequences. ·A complete set of algorithms is available (optimised for speed and minimal use of system resources): BLAST, Smith/Waterman, Needleman/Wunsch, string matching, pattern matching, enhanced versions of known algorithms and smart combinations of algorithms. ·After sequence comparison, similar sequences can be clustered. ·There are extensive database and result management options. Sequence databases, comparison results and clusters can be sorted, queried and filtered during all steps of the workflow. Hereby, the sequence and its annotation are handled simultaneously. Furthermore all data structures can be virtually concatenated. In this way it is easy to handle updates of sequence databases, comparison results and clusters. ·The Unix command line interface makes it very easy to incorporate Lassap into an existing bioinformatics infrastructure and to automate complex workflows. ·Lassap is very flexible, e.g. changing an algorithm is as easy as changing a single command line option. ·Lassap is a professional tool. It is written in the C programming language and is available for almost all Unix systems (single and multi-threaded) and Linux clusters as well. ·For more information visit our web site: http://www.gene-it.com |
|
|
| 63. Mini-greedy algorithm for multiple RNA structural alignments (up) |
| J. Gorodkin, Bioinformatics Research Center, University of Aarhus,
Denmark;
R. B. Lyngsoe, Department of Computer science, University of California at Santa Cruz, USA; G. D. Stormo, Department of Genetics, Washington University Medical School, USA; gorodkin@bioinf.au.dk |
| Short Abstract:
A mini-greedy algorithm performs the greedy step on only a core set of sequences where the remaining sequences are aligned with the core in turn, based on a ranking of the sequences. This algorithm result in a significant speed-up of the FOLDALIGN approach for multiple structural alignment of RNA sequences. |
| One Page Abstract:
The problem with greedy algorithms is that a large fraction of the sequences in a data set is subject to many redundant computations, thus making greedy algorithms expensive. Here we present an approach that we will call mini-greedy, that first isolate a small core of suitable sequences on which to apply the greedy algorithm. Then the remaining sequences are aligned onto the resulting core alignment in an order determined by a ranking scheme. The ranking scheme is based on the pairwise score among all sequences. In contrast to algorithms such as CLUSTAL, where the multiple alignment is built from individual clusters of sequences, our algorithm seeks to find a set of core sequences that have as much in common to as many sequences as possible in the data set. We apply this algorithm to multiple structural alignments of RNA sequences, through FOLDALIGN, and show that the mini-greedy algorithm can reduce the computational time significantly. Different schemes are compared. The mini-greedy algorithm has been implemented as a part of the Stem-Loop Align SearcH server at http://www.bioinf.au.dk/slash/. |
|
|
| 64. Neural network and genetic algorithm identification of coupling specificity and functional residues in G protein-coupled receptors (up) |
| Anthony Bucci, Jason M. Johnson, Pfizer Discovery Technology Center;
abucci@brandeis.edu |
| Short Abstract:
We demonstrate that artificial neural networks are effective in automated prediction of G-coupling function, and when combined with a genetic algorithm, can identify residues in GPCR sequences that impact G protein coupling specificity. The methods we describe are general and can be applied to other classification problems in molecular biology. |
| One Page Abstract:
G protein-coupled receptors (GPCRs) form a large family of human cell membrane signaling proteins with seven transmembrane segments, transmitting signals across the cell membrane in response to a wide variety of endogenous ligands. GPCRs are attractive drug targets because they are amenable to small molecule intervention and play critical roles in many human diseases. The molecular and cellular functions of many of these proteins are currently unknown. Here, we demonstrate the utility of artificial neural networks (ANNs) to the problem of predicting the G protein-coupling specificity of GPCRs, the key determinant of downstream signaling function. Using a set of ~100 GPCRs with known G-protein specificity, we conducted a cross-validation study comparing performance of ANNs and homology-based classifiers on the G-coupling prediction task. Our results show that ANNs, given access to only a 20-residue window of the GPCR primary sequence, perform as well as BLAST or a nearest-neighbor classifier given access to the full-length sequence. Building on this result, we used a genetic algorithm (GA) to discover a set of GPCR sequence positions that allowed ANNs to outperform both BLAST and nearest-neighbor classifiers in a leave-one-out cross-validation test. These residue positions reveal regions of GPCR structure likely to be involved in G-protein coupling and discrimination among G-protein subtypes. We conclude that artificial neural networks are effective in automated prediction of G-coupling function, and when combined with a GA, can identify specific residues in GPCR sequences that are important for G-protein coupling. The ANN and GA methods we describe are general and can be applied to other classification and function-prediction problems in molecular biology. |
|
|
| 65. Determination of Classificatory Motifs for the Identification of Organism Using DNA Chips (up) |
| Uta Bohnebeck, Tom Wetjen, University of Bremen, Center for Computing
Technologies (TZI);
Denja Drutschmann, University of Bremen, Center for Environmental Research and Technology (UFT); bohnebec@tzi.de |
| Short Abstract:
A procedure for constructing highly sensitive and specific oligo-nucleotides for the identification of organisms will be exemplarily presented with sequences of hepatitis C virus. It can be shown that the common motifs detected in randomly sampled subsets can be generalized to be also present in the population with high probability. |
| One Page Abstract:
A procedure for constructing highly sensitive and highly specific oligo-nucleotides for the identification of organisms, e.g. in environmental samples, will be presented. The problem of achieving a high sensitivity first may be considered as a conservation learning task [2], i.e. common motifs (representing conserved regions) detected in randomly sampled subsets can be generalized to be also present in the population with high probability. The algorithm for the first step of the procedure calculates all non-redundant common motifs which meet constraints, i.e. motif length, permitted mismatches, and sample coverage. The common motifs are determined with the aid of a generalized suffix tree using a pattern-driven approach [5]. Each resulting motif corresponds to a set of associated sequences - the potential oligo-nucleotides of the DNA chip - which are concrete substrings of the given sequence set and which represent the variability of the population. For the second step of the procedure, a Blast search in the public nucleotide sequence databases is carried out in order to prove high specificity. Further optimization steps according to hybridization conditions have to be carried out [1]. In the experiments carried out, 171 complete genome sequences of the hepatitis C virus (HCV) were used. A relative arrangement of these sequences based on the idea of maximal unique matches [3] was performed. The result confirmed the observation of [4], i.e. that only the 5'UTR is sufficiently conserved in order to determine common motifs which can be used to identify HCV in general. Therefore, the motif determination was only executed on the 5'UTR of these 171 sequences by a 10-fold-cross-validation using randomly sampled subsets of 137 (80%) sequences. Allowing one mismatch, on the average each subset produced a result set containing five common motifs with a length between 39 and 68 bp. These set of motifs were not 100% identical since a motif of one subset was enlarged in another subset, or motifs were concatenated to one large motif. However, by merging the result sets a non-redundant set of maximal shared motifs could be extracted. Using the associated sets of oligo-nucleotides a 100% coverage of the population was obtained. Typically, one of the oligo-nucleotides belonging to one motif showed approximately 95% coverage while the others only represented single variations containing one mismatch. [1] U. Bohnebeck, M. Nölte, T. Schäfer, M. Sirava, T. Waschulzik, G. Volkmann. An Approach to the Determination of Optimized Oligonucleotide Sets for DNA Chips, In: Proceedings of ISMB'99, Poster and Demonstrations, Heidelberg, 1999. [2] A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert. Approaches to Automatic Discovery of Patterns in Biosequences, Journal of Computational Biology, 5:277-303,1998. [3] A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, and S.L. Salzberg. Alignment of whole genomes, Nucleic Acids Research, 27(11):2369-2376, 1999. [4] J.H. Han, V. Shyamala, K.H. Richman, M.J. Brauer, B. Irvine, M.S. Urdea, R. Tekamp-Olson, G. Kuo, Q.-L. Choo, and M. Houghton. Characterization of the terminal regions of hepatitis C viral RNA: Identification of conserved sequences in the 5' untranslated region and poly(A) tails at the 3' end, Proc. Natl. Acad. Sci. USA, 88(5):1711-1715, 1991 [5] M.-F. Sagot. Spelling approximate repeated or common motifs using a suffix tree, In: Latin'98, volume 1380 of LNCS, pages |
|
|
| 66. An algorithm for detecting similar reaction patterns between metabolic pathways (up) |
| Yukako Tohsato, Ryutaro Saijo, Takao Amihama, Hideo Matsuda, Akihiro
Hashimoto, Department of Informatics and Mathematical Science, Osaka
University;
yukako@ics.es.osaka-u.ac.jp |
| Short Abstract:
We have developed a method for detecting similar reaction patterns between metabolic pathways. Given two trees representing pathways, our method extracts similar subtrees between them using a subtree-matching algorithm. We have found a similar reaction pattern between the glycol metabolism and degradation pathway and the fucose and rhamnose catabolism pathway. |
| One Page Abstract:
Metabolic pathway is recognized as one of the most important biological networks. Comparative analyses of the pathways give important information on their evolutions and pharmacological targets. In this poster, we present a method for comparing pathways based on the similarity of reactions catalyzed by enzymes. In our approach, the reaction similarity is formulated as a scoring scheme based on the information content of the occurrence probabilities of EC numbers. For example, the occurrence probability P of the matching between a pair of similar EC numbers, say 1.1.1.1 and 1.1.1.2, is very rare compared to randomly-chosen combination of such pairs, and then we give a higher score, - log P, for the matching. Using this scoring scheme, one can perform alignment between two pathways if they can be represented as strings of EC numbers. However, metabolic pathways often include branching structure. The comparison between such pathways cannot be directly performed by applying some string-matching algorithms. To cope with this issue, we have developed a method for comparing pathways that include branching structure. In our method, pathways are represented as trees, and compared them by a subtree-matching algorithm. The computational complexity of the subtree-matching problem is generally NP hard. Thus we have developed a greedy algorithm for the comparison. The effectiveness of our method is demonstrated by applying to metabolic pathways in Escherichia coli, which are re-constructed from the metabolic maps of the EcoCyc database. As a result, we have found a similar reaction pattern ([2.7.1.31] - [2.7.1.51], [4.1.1.47] - [4.1.2.17], [1.2.1.21] - [1.2.1.22] and [1.1.1.37] - [1.1.1.77]) between the glycol metabolism and degradation pathway and the fucose and rhamnose catabolism pathway. |
|
|
| 67. SPP-1 : A Program for the Construction of a Promoter Model From a Set of Homologous Sequences (up) |
| Ulrike Goebel, Thomas Wiehe, Thomas Mitchell-Olds, Max-Planck Institute
of Chemical Ecology;
goebel@stargate.ice.mpg.de |
| Short Abstract:
We present a new algorithm which constructs a PROMOTER MODEL from a set of unaligned homologous, coregulated polII promoters. It employs a comparative approach, which in addition to sequence similarity can also take into account DNA structural similarity. In a second phase, higher order modules of conserved motifs are found. |
| One Page Abstract:
We present a new algorithm which constructs a PROMOTER MODEL from a set of unaligned homologous, coregulated polII promoters. It rests on the following assumptions: DNA contact points of individual members of the transcription initiation complex are constrained in their ability to tolerate mutations and thus stand out as short (6-10 bp) conserved motifs. The arrangement of the proteins in the initiation complex is reflected by a hierarchical arrangement of the binding sites on the DNA, and it this pattern which really identifies the promoter. It, too, should be at least in part conserved in members of a family of promoters which are known to confer the same expression pattern. Another aspect which has been shown to be conserved at least in parts of polII promoters is DNA structure, especially bendability and stiffness. Most probably the sequence conservation seen at transcription factor binding sites is just an extreme case of structural conservation (identical sequences have identical structures). It can well be that there are sites which have drifted apart on the sequence level in different members of a promoter family, while still being conserved with respect to some relevant structural property. Our algorithm first constructs gap-free blocks of sequence segments from the input sequences. A block can contain zero or multiple segments from a given input sequence. It is maximal with respect to the number of segments, such that all pairs of segments in a block are SIMILAR. In contrast to other existing algorithms, SIMILARITY is a relation which can be freely defined, and in particular can refer to similarity with respect to DNA structural parameters. In a second phase, the algorithm looks for a pattern of these motifs which is common (with variations) to all input sequences. Motifs which are part of such a pattern can not only be more trusted to be truely biologically relevant, but the pattern also constitutes a testable hypothesis ( a PROMOTER MODEL) about the input family of promoter sequences. |
|
|
| 68. A new score function for distantly related protein sequence comparison. (up) |
| Maricel Kann, Richard Goldstein, The University of Michigan;
mkann@umich.edu |
| Short Abstract:
A new method to derive a score function to detect remote relationships between protein sequences has been developed using an optimization procedure. We find that the new score function obtained in such a manner performs better than standard score functions for the identification of distant homologies |
| One Page Abstract:
In order to understand the evolutionary history of the new sequences, aligning their the primary structure of the probe sequence with others in the database is one of the most significant and widely used techniques. Sequences with a high similarity score usually share a common structure and might have similar functions or mechanisms. All of these methods rely on some score function to measures sequence similarity. The choice of score function is especially critical for these distant relationships. A new method to derive a score function to detect remote relationships between protein sequences has been developed. The new score function was obtained after maximization of a function of merit representing a measure of success in recognizing homologs of the newly sequenced protein among thousands of non-homolog sequences in the databases. We find that the new score function obtained in such a manner performs better than standard score functions for the identification of distant homologies. |
|
|
| 69. Expression profiling on DNA-microarrays: In silico clone selection for DNA chips (up) |
| Rainer König, DKFZ, Division of Functional Genome Analysis,
Im Neuenheimer Feld 506, 69120 Heidelberg, Germany;
Johannes Beckers, GSF - National Research Center, Institute of Experimental Genetics, Ingolstädter Landstr.1, 85764 Neuherberg, Ge; Marcus Frohme, Tamara Korica, DKFZ, Division of Functional Genome Analysis, Im Neuenheimer Feld 506, 69120 Heidelberg, Germany; Stefan Haas, MPI for Molecular Genetics, Computational Molecular Biology, Ihnestraße 73, 14195 Berlin, Germany; Matthias Seltmann, Christine Machka, Yali Chen, Alexei Drobychev, GSF - National Research Center, Institute of Experimental Genetics, Ingolstädter Landstr.1, 85764 Neuherberg, Ge; Sabine Tornow, Michael Mader, GSF - National Research Center, Institute for Bioinformatics, Ingolstädter Landstr.1, 85764 Neuherberg, Germany; Martin Hrabé de Angelis, GSF - National Research Center, Institute of Experimental Genetics, Ingolstädter Landstr.1, 85764 Neuherberg, Ge; Werner Mewes, GSF - National Research Center, Institute for Bioinformatics, Ingolstädter Landstr.1, 85764 Neuherberg, Germany; Jörg Hoheisel, DKFZ, Division of Functional Genome Analysis, Im Neuenheimer Feld 506, 69120 Heidelberg, Germany; Martin Vingron, MPI for Molecular Genetics, Computational Molecular Biology, Ihnestraße 73, 14195 Berlin, Germany; r.koenig@dkfz.de |
| Short Abstract:
As part of the German-Human-Genome-Project DNA-microarray technology is being established to systematically analyse gene function in mouse mutant lines. A software was developed to select gene specific IMAGE-clones from subsets of known genes. It is designed to optimise good hybridisation to the target and low cross-hybridisation with other known genes. |
| One Page Abstract:
As part of the German Human Genome Project (DHGP) DNA microarray technology is used for a systematic analysis of gene function in ENU induced mutant mice. To design our chips for chosen subsets of genes, a software system was developed that selects gene specific EST-clones from the IMAGE clone set. To obtain specific expression signals for each gene, the algorithm is designed to optimise two demands for the immobilised antisense probe, (1) good hybridisation to the target and (2) no cross-hybridisation with other known genes. In respect to these tasks a method is presented. In an exemplary application, the design of the first chip contains genes with known functions, for example, during embryonic development or genes that are relevant for the pathogenesis of related human diseases. Additionally, a set of constitutively expressed genes was selected to facilitate normalisation. Public access is offered to select clones for known mouse genes (www.dkfz-heidelberg.de/tbi/services/koenig/services/clones2chip_front.pl). |
|
|
| 70. Data Mining: Efficiency of Using Sequence Databases for Polymorphism Discovery (up) |
| David G. Cox, University of Turin / International Agency for Research
on Cancer;
Federico Canzian, Catherine Boillot, International Agency for Research on Cancer; cox@iarc.fr |
| Short Abstract:
We selected thirteen genes, and determined the complete collection of polymorphisms existing in these genes in our laboratory using DHPLC, or in other laboratories using comparable methods. Then we compared these results to polymorphisms found by aligning sequences of the genes in the GenBank Database, calling single base differences between sequences polymorphisms. |
| One Page Abstract:
An open question in research on Single Nucleotide Polymorphisms (SNPs) is, what is the percentage of true SNPs found by in silico pre-screening? To this end, we selected thirteen genes, and determined The complete collection of "true" polymorphisms, or polymorphisms experimentally detected, existing in these genes in our laboratory using Denaturing High Performance Liquid Chromatography (DHPLC) and fluorescent sequencing, or in other laboratories using comparable methods (Single Strand Confirmation Polymorphism, Denaturing Gradient Gel Electrophoresis). The genes studied by our group were PTGS2, IGFBP1, IGFBP3, and CYP19. GenBank sequence information was then aligned using two methods, and sequence differences termed "candidate" polymorphisms. We then compared the series of SNPs obtained experimentally and in silico and we have found that in silico methods are relatively specific (up to 55% of candidate SNPs found by SNPFinder have been discovered by experimental procedure) but have low sensitivity (not more than 27% of true SNPs are found by in silico methods). |
|
|
| 71. Automated modeling of protein structures (up) |
| Morten Nielsen, Ole Lund, Claus Lundegaard, Thomas N. Petersen, Structural
Bioinformatics Advanced Technologies A/S (SBI-AT), Agern Alle 3, DK-2970
Hoersholm, Denmark.;
Jakob Bohr, Soeren Brunak, Structural Bioinformatics, Inc., SAB, San Diego, California; Garry P. Gippert, Structural Bioinformatics Advanced Technologies A/S (SBI-AT), Agern Alle 3, DK-2970 Hoersholm, Denmark.; mnielsen@strubix.dk |
| Short Abstract:
SBI-AT has developed a novel and highly reliable method for fold recognition based on a ranking of alignment Z-scores computed from sequence profiles and predicted secondary structure. The method was used for template identification in the CASP4 protein structure prediction experiment. |
| One Page Abstract:
SUMMARY Automated modeling of protein structure from sequence guides the assignment of function and the selection of targets for experimental structure studies, and provides starting points for expert modeling of protein structure. Fold recognition is an important benchmark in automated modeling approaches. SBI-AT has developed a novel and highly reliable method for fold recognition based on a proprietary ranking of alignment Z-scores computed from sequence profiles and predicted secondary structure (Petersen et al., 2000). In this presentation we compare results obtained using methods developed at SBI-AT with the well-known PDB-BLAST and FFAS (Rychlewski et al., 2000) methods. Our results in the comparative modeling sections of the CASP4 protein structure prediction experiment are also summarized. CONCLUSIONS The SBI-AT fold recognition method performs well compared to FFAS and PDB-BLAST, particularly in the high-reliability regime. A large performance increase for SCOP family relationships (Murzin et al., 1995) impacts strongly on comparative modeling of protein structures. In CASP4 comparative modeling categories, SBI-AT's automatic modeling method shows a performance comparable to that of the best expert and automatic methods developed elsewhere. REFERENCES Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Nucleic Acids Res 1997 25:3389-402. CASP4, http://predictioncenter.llnl.gov/casp4/. Murzin AG, Brenner SE, Hubbard T, Chothia C (1995). J. Mol. Biol. 247:536-540. Petersen TN, Lundegaard C, Nielsen M, Bohr H, Bohr J, Brunak S, Gippert GP, Lund O (2000). Proteins 41:17-20. Rychlewski L, Jaroszewski L, Li W, Godzik A (2000). Protein Sci. 9:232-241. ABOUT SBI-AT SBI-AT, a subsidiary of Structural Bioinformatics Inc., San Diego, develops novel computer algorithms for the prediction and exploration of structural and dynamical features of proteins, specifically targeted for use in rational drug design and biotechnology applications. www.strubix.dk. |
|
|
| 72. Algorithmic improvements to in silico transcript reconstruction implemented in the Paracel Clustering Package (PCP) (up) |
| Joseph A. Borkowski, Jun Qian, Cassi Paul, Charles P. Smith, KyungNa
Oh, Glen Herrmannsfeldt, Cecilie Boysen, Paracel, Inc.;
borkowski@paracel.com |
| Short Abstract:
We describe novel algorithms which improve EST-based transcript
reconstruction quality and which allow more accurate splice form detection.
These algorithms make use of pairwise overlap and redundancy measurements
to identify and remove artifactual chimeric sequences and low-quality 'non-N'
sequence segments. We demonstrate their effect on overall transcript quality.
|
| One Page Abstract:
Algorithmic improvements to in silico transcript reconstruction implemented in the Paracel Clustering Package (PCP) Joseph A. Borkowski, Jun Qian, Cassi Paul, Charles P. Smith, KyungNa Oh, Glen Herrmannsfeldt, Cecilie Boysen. Paracel, Inc. Transcript reconstruction from EST data, whether public or private, can often be problematic. Ideally, it should be possible to reconstruct a single consensus sequence present in a cell using measured pairwise overlap between individual ESTs. When attempting to reconstruct transcripts, available programs typically overestimate the number of alternative transcripts from a given organism and also tend to produce a number of large false clusters. These problems can have serious negative consequences when using the output of transcript reconstruction for either gene identification or oligo chip design. They are often more acute when a high quality genomic sequence segment covering any particular transcript is not available or not used to guide correct reconstruction. Overestimation of alternative transcript forms can be proven with thorough comparison to any available high quality genomic sequence data. Incorrectly predicted alternative transcripts will not have a corresponding genomic sequence segment. In most cases this overestimation is due to low quality sequence segments present in the input sequence data. When a particular assembly program is unable to align these sequences with the true transcript along its entire length, they are falsely reported as alternative transcripts of the true transcript. To reduce false alternative transcript reporting we have developed an algorithm that can identify and remove low quality sequence segments present in ESTs, even when quality values are not available. This algorithm is based on the rate of score drop-off in a pairwise alignment. It has been observed that sequence quality tends to drop off slowly as it approaches the end of an EST. Therefore its score in comparison with higher quality sequences tends to drop-off slowly after reaching an inflection point at the end of a high scoring match segment. In contrast, the score for a true alternative transcript tends to drop off at a much faster rate. When an end of a sequence has been identified as being low quality, that end is not used in construction of a consensus sequence. High quality sequence segments that are derived from alternative transcripts are used to create alternative consensus sequences. In our experience, large false clusters often are due to the presence of contaminants, repeats, or chimeric sequences in the input EST dataset. We have developed an algorithm that detects and reports these chimeric EST sequences. This algorithm uses all of the computed pairwise overlaps that exist in a set of sequences and uses this information to determine abrupt break points in the overall contig structure where a single sequence joins two coherent subclusters. When these break points are detected, and supported by a sufficient number of independent sequences on either side of the break point, a chimeric sequence is reported. These chimeric sequences are not used for the creation of a cluster. The chimeric and bad end detection algorithms have been incorporated into the Paracel Clustering Package (PCP). We report on both the accuracy of these algorithms and their effect on overall transcript quality. |
|
|
| 73. Computational Structural Genomics (up) |
| Steven E. Brenner, University of California, Berkeley;
brenner@compbio.berkeley.edu |
| Short Abstract:
Structural genomics aims to provide a good experimental structure or computational model of every tractable protein in a complete genome. At the Berkeley Structural Genomics Center, we focus on the organisms Mycoplasma pneumonia and M. genitalium. Computational components include selection of protein targets, managing experimental data, and analyzing solved structures. |
| One Page Abstract:
Structural genomics aims to provide a good experimental structure or computational model of every tractable protein in a complete genome. Underlying this goal is the immense value of protein structure, especially in permitting recognition of distant evolutionary relationships for proteins whose sequence analysis has failed to find any significant homolog. A considerable fraction of the genes in all sequenced genomes have no known function, and structure determination provides a direct means of revealing homology that may be used to infer their putative molecular function. The solved structures will be similarly useful for elucidating the biochemical or biophysical role of proteins that have been previously ascribed only phenotypic functions. More generally, knowledge of an increasingly complete repertoire of protein structures will aid structure prediction methods, improve understanding of protein structure, and ultimately lend insight into molecular interactions and pathways. We use computational methods to select families whose structures cannot be predicted and which are likely to be amenable to experimental characterization. Methods to be employed included modern sequence analysis and clustering algorithms. Also consulted is the PRESAGE database for structural genomics, which records the community's experimental work underway and computational predictions. The protein families are ranked according to several criteria including taxonomic diversity and known functional information. Individual proteins, often homologs from hyperthermophiles, are selected from these families as targets for structure determination. The solved structures are examined for structural similarity to other proteins of known structure. Homologous proteins in sequence databases are computationally modeled, to provide a resource of protein structure models complementing the experimentally solved protein structures. References Brenner SE, Levitt M. 2000. Expectations from structural genomics. Protein Sci. 9:197-200. Brenner SE. 1999. Errors in genome annotation. Trends Genet 15:132-133. Brenner SE, Barken D, Levitt M. 1999. The PRESAGE database for structural genomics. Nucleic Acids Res 27:251-253. Brenner SE, Chothia C, Hubbard TJP. 1998. Assessing sequence comparison methods with reliable structurally-identified distant evolutionary relationships. Proc Natl Acad Sci USA 95:6073-6078. Brenner SE, Chothia C, Hubbard TJP. 1997. Population statistics of protein structures. Curr Opin Struct Biol 7:369-376. Brenner SE, Chothia C, Hubbard TJP, Murzin AG. 1996. Understanding protein structure: Using SCOP for fold interpretation. Meth Enzymol 266:635-643. Brenner SE, Hubbard T, Murzin A, Chothia C. 1995. Gene duplications
in the H. influenzae genome. Nature 378:140. Brenner SE. 1995. World wide
web and molecular biology. Science 268:622-623.
|
|
|
| 74. Persistently Conserved Positions in Structurally-Similar, Sequence Dissimilar Proteins: Roles in Preserving Protein Fold and Function (up) |
| Iddo Friedberg, Hanah Margalit, The Hebrew University, Jerusalem;
idoerg@cc.huji.ac.il |
| Short Abstract:
This study addresses the problem of proteins that have the same fold, but no sequence similarity. Using a database of such protein pairs we analyze those positions which are mutually, persistently conserved among close and distant family members. In many cases those positions show a role in function and/or fold. |
| One Page Abstract:
Many protein pairs that share the same fold do not have any detectable sequence similarity, providing a valuable source of information for studying sequence-structure relationship. In this study we use a stringent data set of structurally-similar, sequence-dissimilar protein pairs to characterize residues which may play a role in the determination of protein structure and/or function. For each protein in the database we identify amino-acid positions that show residue conservation within both close and distant family members. These positions are termed persistently conserved. We then proceed to determine the mutually persistently conserved positions, those structurally aligned positions in a protein pair that are persistently conserved in both pair-mates. Due to their intra- and inter-family conservation, these positions are good candidates for determining protein fold and function. We find that about 50% of the persistently conserved positions are mutually conserved. A significant fraction of them are located in critical positions for secondary structure determination, they are mostly buried, and many of them form spatial clusters within their protein structures. A substitution matrix based on the subset of persistently mutually conserved positions shows two distinct characteristics: (i) it is different from other available matrices, even those that are derived from structural alignments. (ii) it contains a significant amount of mutual information, emphasizing the special residue restrictions imposed on these positions. Such a substitution matrix should be valuable for protein design experiments. |
|
|
| 75. Using Surface Envelopes in 3D Structure Modeling (up) |
| Jonathan M. Dugan, Glenn A. Williams, Russ B. Altman, Stanford Medical
Informatics;
dugan@smi.stanford.edu |
| Short Abstract:
Our group has built unified data structures and algorithms that are highly flexible and applicable to a variety of different data types for modeling macromolecular structures. This poster outlines the development, implementation, and results of algorithms capable of integrating surface shape data into the 3D structure modeling process. |
| One Page Abstract:
Modeling the 3D structure of biological macromolecules assists in the understanding of biological function, and can assist in the discovery of novel pharmaceuticals. Current crystallographic methods for structure determination have been very successful, but are not applicable in all cases. Fortunately, other experimental methods can provide useful data regarding biomolecular structure, although typically these data are noisy and sparse. The sources of these data include those that provide distances (such as nmr, binding, affinity, and crosslinking measurements) as well as those that produce other types of structure information, such as solvent accessibility and overall geometric features--such as volume or the shape of enclosing surface envelopes (SE). Our group has focused on building unified data structures and algorithms that are highly flexible and applicable to a variety of different data types -- with the goal of combining these heterogeneous data to maximize their utility in modeling macromolecular structures. This poster outlines the development and implementation of algorithms capable of integrating SE data into the 3D structure modeling process. I present the results of modeling several proteins and test structures with distance data and SE data derived from solved structures. |
|
|
| 76. Molecular modelling in studies of SDR and MDR proteins (up) |
| Erik Nordling, Bengt Persson, MBB, SBC, Karolinska Institutet;
erik.nordling@mbb.ki.se |
| Short Abstract:
The presentation covers the use of molecular modelling methods in studies of the medium-chain dehydrogenases/reductases (MDR) and short-chain dehydrogenases/reductases (SDR) protein families. In particular a sub classification of the MDR family is described and substrate specificity is investigated of the Endoplasmic reticulum associated amyloid beta binding protein (ERAB). |
| One Page Abstract:
The wealth of structural information available through the Protein Databank (PDB) may be extended to structural neighbours using homology modelling. The technique may be used routinely down to 40% sequence identity to yield accurate models if there are no large insertions or deletions in the alignment. Proteins with lower sequence identities are possible to model to reasonable accuracy, but require considerable more care in the modelling process. We have employed these techniques on members of the protein families SDR (Short-chain Dehydrogenases/Reductases) and MDR (Medium-chain Dehydrogenases/Reductases). In the first case we model ERAB (Endoplasmic Reticulum associated Amyloid b-peptide Binding Protein) from 7alpha-Hydroxysteroid Dehydrogenase (27% sequence identity) and yield a structure that is compatible with known enzymatic data. X-ray crystallography later verified the core parts of the modelled structure. Recently, we have tried to use the homology modelling to further clarify the evolutionary relationship within subgroups of the MDR family. We have also used various docking methods to investigate substrate specificity and binding mechanisms. This has been applied to ADH class I beta and gamma isozymes and ERAB, giving results compatible with kinetic data. |
|
|
| 77. Consensus Predictions of Membrane Protein Topology (up) |
| Johan Nilsson, Bengt Persson, Gunnar von Heijne, Stockholm Bioinformatics
Centre;
johan.nilsson@mbb.ki.se |
| Short Abstract:
Consensus predictions of membrane protein topology might provide a means to estimate the reliability of predicted topologies. Using five topology prediction methods according to a "majority-vote" principle, we found that the topology of nearly half of all E.coli inner membrane proteins can be predicted with high reliability (>90% correct predictions). |
| One Page Abstract:
Computational methods for identification and characterisation of integral membrane proteins will become increasingly important as the number of completely sequenced genomes increases. At present, several methods are available for prediction of integral membrane protein topology and approaches employed include neural networks, hidden Markov models, multiple sequence alignments and dynamic programming. Considering the large amount of transmembrane proteins in a typical genome (20-25%), even a slight improvement in the ability to predict membrane protein topology will have major effects on e.g. automatic sequence annotation. In this study we have explored the possibility that consensus predictions of membrane protein topology might provide a means to estimate the reliability of a certain predicted topology. Our intention was to improve topology predictions by combining the results obtained from a number of methods according to a "majority-vote" principle. We used five popular topology prediction methods: TMHMM, HMMTOP, MEMSAT, TOPPRED and PHD. Our results show that the fraction of correctly predicted topologies over a test set of 60 Escherichia coli inner membrane proteins with experimentally determined topologies increases with the number of methods that agree. The topology of nearly half of the sequences can be predicted with high reliability (>90% correct predictions) using our approach. |
|
|
| 78. Development of New statistical measures of protein model quality and its application for consensus prediction of protein structure (up) |
| Fang Huisheng and Arne Elofsson, Stockholm Bioinformatics Center,
Stockholm University;
fang@sbc.su.se |
| Short Abstract:
In this study we have used better statistical measures of the similarity between a protein-model and the correct structure. These new measures have been used to improve the performance of Pcons, a consensus based fold recognition method. We show that using these new measures we obtain better predictions. |
| One Page Abstract:
Development of New statistical measures of protein model quality and its application for consensus prediction of protein structure Fang Huisheng and Arne Elofsson Stockholm Bioinformatics Center, Stockholm University, SE-10691 Stockholm, Sweden More and more methods for predicting protein structure have been developed based on different algorithms and information. It has recently been confirmed that for different targets, different methods produce the best predictions and the final prediction accuracy could be improved if available methods would be combined in a perfect manner (Lundström et al, 2001). Recent studies show that the statistics distribution, i.e. a P-value, assessing the similarity between a model and the structure can be developed (Levitt & Gerstein, 1998) and proved a good measure of protein model quality. In this study a score (the LGscore) was used, however, if the number of matched residues is less than 120, it has been shown the distribution does not follow the curves used to calculate the P-value.This means that a P-value really should represent the statistics correctly. In the present work, we have first recalculated the P-values depending on the number of aligned residues. We use two functions one for describing the average score and another for the standard deviation. These functions can be used to describe the behavior of the score from 10 aligned residues to more than 300. Based on it, we calculate a new P-value, using an extreme value distribution as done by Levitt & Gerstein. The new P-values does not show the same dependency of fragment size as the old. In CAFASP2 it was observed that very good models for short targets did not obtain a significant score LGscore. On the basis of this observation we have introduced the "Q-value". The reason is that the even a perfect structural similarity for a short protein is not very significant. To overcome these problems when scoring models we have created a new score (the Q-value) that is depending on the length of the target. It was calculated from the P-values for models with 30-50% sequence identity. Using new and old LGscore, Q-value, Pcons consensus predictors combining seven servers has been developed. The procedure of its is as described as following:we firstly compare the two kinds of similarity (i.e. new LGscore) between models, and model and target structure about 199 targets from LiveBench2. Furthermore, we build two models with Multiple linear regression and Neural Networks respectively to describe the relationship between new LGscore,old LGscore and Q-value between similarity of model-model and target structure and model. The performance trial shows that the model of new LGscore is better than old LGscore. Reference 1. Lundström et al, Pcons: A neural network based consensus predictor that improves fold recognition. 2001 2. Siew, et al, MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics, Vol. 16,2000, p776-785 3. Zemla et al, Processing and Analysis of CASP3 protein structure predictions Proteins:Structure,Function, and Genetics, 1999,Suppl 3:22-29 4. Levitt et al, A unified statistical framework for sequence comparison and structure comparison. Proc. Natl. Acad. Sci. USA 1998,95:5913-592 |
|
|
| 79. Signal filtration methods to extract structural information from evolutionary data applied to G protein-coupled receptor (GPCR) transmembrane domains (up) |
| L. Marsh, Dept. of Biology, Long Island University;
lmarsh@liu.edu |
| Short Abstract:
The conservation score of GPCR amino acid residues was treated as a noisy signal containing information about solvent accessibility of the residues. This conservation signal was subjected to a Fourier-based filtration/reconstruction method to extract structural information. Solvent accessibility of transmembrane residues was correctly predicted for a series of diverse opsins. |
| One Page Abstract:
The relationship between the structure of a protein and the rate of evolution of specific residues is intriguing and complex. Solvent-exposed residues often evolve more rapidly than residues involved in structural contacts. We consider residue conservation during evolution as a signal, albeit an extremely noisy signal, reflecting solvent protection and other structural information. Using a signal filtration/reconstruction approach with a novel wavelet-based Fourier approximation, solvent exposure of amino acid residues could be predicted from residue conservation data. The degree of conservation at each position of the TM was calculated for random clusters of TM sequences drawn from a pool of 158 opsins. Aligned sequences were compared using a modified BLOSUM substitution matrix to generate a series representing the degree of conservation at each amino acid position (`conservation signal'). The unfiltered conservation signals exhibited a weak, but positive, Pearson correlation coefficient with solvent inaccessibility of residues in the rhodopsin structure supporting a relationship between substitution rate and accessibility in this system. Solvent accessibility of the alpha-helical TMs has a periodicity of about 3.6 in rhodopsin. Fourier analysis confirmed that the conservation signal contained structural information, but simple Fourier methods did not yield robust predictions. A filter was designed that permitted enhancing alpha helical patterns, accomodation of helix breaks, and a waveform correction for the fact that most residues in the structure are not solvent exposed. This filter was implemented as a wavelet-based Fourier-filtration approximation and produced prediction success rates of >95% for the tested (relatively uniform) TM1 and TM7. The method is now being applied to other systems. |
|
|
| 80. Using clusters to derive structural commonality for ATP binding sites (up) |
| Yosef Yehuda Kuttner, Mariana Babor, Marvin Edelman, Vladimir Sobolev,
Weizmann Institute of Science;
joshef.kuttner@weizmann.ac.il |
| Short Abstract:
We developed a method for structural multiple alignment of binding
sites with a given ligand, in order to search for similarities in spatial
arrangement of binding pocket atoms. An algorithm for identifying clusters
of atoms was developed. Our strategy seeks commonalities in arrangement
of contacting atoms around rigid ligand components.
|
| One Page Abstract:
Keywords: atomic contacts, molecular recognition, cluster, adenine We are developing a method for structural multiple alignment of binding sites with a given ligand, in order to search for similarities in the spatial arrangement of binding pocket atoms. An algorithm for identifying clusters of atoms was developed for this task. For a given flexible ligand, the binding pocket shape in different target proteins might vary considerably due to different ligand conformations. Our strategy was, therefore, to seek commonalities in the arrangement of contacting atoms around rigid, or almost rigid, ligand components. The rigid (or almost rigid) chemical moiety from different files was superimposed. LPC software [1] was then used to determine the protein atoms in contact with the ligand and to classify the atom contacts according to their physico-chemical properties. A search for atomic clusters was then conducted. Atoms were defined as belonging to a cluster if they were within a given distance of each other and came from different PDB entries. Additionally, members of a cluster had a requirement to form attractive contacts with the ligand. The ATP molecule was chosen for this study. We selected the rigid adenine ring moiety as a test object. A non-redundant dataset of 14 PDB entries of ATP-protein complexes (resolution of 2.2Å or better) was analyzed. The adenine rings of the 14 files were superimposed with concerted movement of the protein atoms in contact with the rings. Several groups have recently sought structural commonalities in nucleotide base recognition by proteins: Kobayashi and Go [2] found remarkable similarities despite considerable differences in primary sequence; Shi and Berg [3] used consensus sequences to construct novel proteins with increased DNA affinity in zinc finger proteins of the Cys-His2 type; Denessiouk & Johnson [4] found similarities in the relative positions of different nucleotide-base binding motifs along polypeptide chains from related proteins, although not in their three dimensional space; while Moodie et al. [5] found no specific recognition motif for adenylate in terms of particular residue/ligand interactions, although they found commonalities in shape and polarity properties at ligand/protein interfaces. In our work, hydrophobic clusters were found above and beneath the plane (as previously indicated by Moodie et al. [5]) of the adenine ring, which included some hydrophilic atoms acting as proton donors hydrogen-bonded to the conjugated system. We also found two clusters containing atoms that form hydrogen bonds. The network of atomic clusters so determined was taken as the consensus binding-site structure for the adenine ring of ATP. We note that the hydrogen bond acceptor and donor clusters, (in contact with N-6 and N-1, respectively) are in similar geometric juxtaposition as the hydrogen bonds between the adenine and thymine base pairs in DNA. Cluster positions for the adenine ring were derived. Their relative arrangement served as a fingerprint to search for putative binding sites. When the searching procedure located 6 or more cluster positions, the correct binding site was found for all proteins tested, but usually there were multiple solutions (up to 25 putative pockets). We are now attempting to derive cluster positions for the ribose ring to reduce the number of incorrect solutions. References [1] Sobolev V., Sorokine A., Prilusky J., Abola E.E., Edelman M. (1999). Automated analysis of interatomic contacts in proteins. Bioinformatics, 15: 327-332. [2] Kobayashi N., Go N. (1997). A method to search for similar protein local structures at ligand-binding sites and its application to adenine recognition. Eur. Biophys. J., 26: 135-144. [3] Shi Y., Berg J.M. (1995). A direct comparison of the properties of natural and designed zinc finger proteins. Chem. Biol. 2: 83-89. [4] Denessiouk K.A., Johnson M.S. (2000). When fold is not important: A common structural framework for adenine and AMP binding in 12 unrelated families. Proteins, 38: 310-326. [5] Moodie S.L., Mitchell J.B.O., Thornton J.M. (1996). Protein recognition of adenylate: An example of a fuzzy recognition template. J. Mol. Biol., 263: 486-500. |
|
|
| 81. Aromaticity of Domains in Photosynthetic Reaction Centers; A Clue to the Protein's Control of Energy Dissipation during Enzymatic Reactions (up) |
| Ilan Samish, Avigdor Scherz, Plant Sciences Department, Weizmann
Institute of Science, Rehovot, Israel;
Haim J Wolfson, School of Computer Science, , Tel-Aviv University, Tel-Aviv, Israel; Ilan.Samish@weizmann.ac.il |
| Short Abstract:
Photosynthetic reaction centers serve as model membrane proteins for studying structure-function relationship. Multiple structural alignment of reaction centers, combinatorial mutagenesis of conserved sites and analysis of the protein microenvironment along the electron-transfer pathway suggest that protein aromaticity is involved in controlling energy-dissipation and reactant-geometry during electron-transfer in an entropy/enthalpy mechanism. |
| One Page Abstract:
Photosynthetic reaction centers (RCs), which conduct light induced electron transfer (ET), may serve as model membrane proteins for studying functions of conserved 3D elements. First, based on the fact that structure is more conserved than sequence, multiple structural alignment (MUSTA algorithm) was conducted on RCs from oxygenic and non-oxygenic organisms. The algorithm was conducted in the full RC and in the subunit level resulting in a 'tree' of common cores in the different subgroups. A common core located around the 4-helix bundle center of the complex was found to all RCs compared, in which amino acids (AAs) of a particular attribute form clusters. These clusters suggested conservation of aromatic and of high packing AAs. Second, two conserved AAs in the D1 subunit of the photosystem II RC underwent combinatorial mutagenesis, receiving 11-12 photoautotrophic mutants in each site. Neither positively charged nor aromatic AAs were included. Third, the content of virtual tubes (radii of 2-5A) between the ET cofactors in the bacterial RC was examined. Findings included: 1. Tubes of up to 3.5A-radius do not include backbone atoms. 2. All tubes have a uniform atom density. 3. A larger percentage of non-aromatic AAs is found in the slower ET rate domains. 4. The active branch contains a larger fraction of aromatic AAs than the inactive one. We propose that non-aromatic AAs enable entropic changes required for energy dissipation in the slow ET milieu, while rigid domains optimize reactant geometry required in the fast ET domains. These findings are proposed to shed light on the protein management of two contradictory prerequisites: a need to position reactants in a precise configuration during the electronic density migration, and an opposing need to rapidly dissipate the evolved energy in order to avoid the backward reaction. |
|
|
| 82. LIGPROT: A database for the analysis and visualization of ligand binding. (up) |
| Rafael Najmanovich, Eran Eyal, Vladimir Sobolev, Marvin Edelman, Weizmann
Institute of Sciences;
rafael.najmanovich@weizmann.ac.il |
| Short Abstract:
LigProt is a structural database of paired Apo and Holo protein forms (derived from the PDB) useful to studies of ligand binding. The database is automatically updated and offers a web based interface that allows browsing and searching as well as visualization of the superimposed Holo and Apo forms. |
| One Page Abstract:
A database of paired protein structures in complexed (holo-protein) and uncomplexed (apo-protein) forms from the PDB macromolecular structural database can provide a myriad of information to be used as raw data in bioinformatics studies as well as in the planning of experiments by molecular biologists. Such a database was used in our recent study of side chain flexibility (Najmanovich et al., PROTEINS, 39: 261-268 (2000)). In the present work we: 1. Automate our database creation procedure so that the database can be upgraded regularly to cope with the growth of the PDB and, 2. Create a web-based visualization tool similar to MutaProt (http://bioinfo.ac.il/MutaProt) (Eyal et al., Bioinformatics, 17(4): 381-382 (2001)) for searching the database according to several criteria and visualizing the results using Chime. The database is automatically built in three stages: 1. A list of all ligands present in the PDB is created. 2. All possible candidate apo-protein entries for each entry in list 1 is built, and, 3. Each candidate holo-apo pair is tested to ensure that the binding site contains no ligand other than the one under consideration in both entries. PDB entries with resolution lower than 2.5 A or containing DNA or RNA are excluded from the database. The search and visualization interface allows browsing of the database and searching according to protein and ligand PDB code. We are currently implementing search by protein and ligand name as well as binding site composition and structural characteristics. Once an entry is selected, a list of the intermolecular contacts present in the holo protein is generated using LPC software (http://www.weizmann.ac.il/sgedg/lpc) (Sobolev et al., Bioinformatics, 15(4): 327-332 (1999)). The visualization allows for the inspection of the superimposed structure of the binding site in both entries. |
|
|
| 83. ThreadMAP: Protein Secondary Structure Determination (up) |
| Lydia E. Tapia, Thomas R. Ioerger, Department of Computer Science,
Texas A&M University;
James C. Sacchettini, The Center for Structural Biology, Texas A&M University; ltapia@tamu.edu |
| Short Abstract:
We have developed a new algorithm based on pattern recognition that recognizes secondary structure fragments in an electron density map prior to any structure solution, as part of the Textal automated model building system. Our approach consists of tracing the density map, extracting geometry based features, and performing classification. |
| One Page Abstract:
ThreadMAP: Protein Secondary Structure Determination Lydia E. Tapia(1), Thomas R. Ioerger(1), and James C. Sacchettini(2) (1)Department of Computer Science (2)The Center for Structural Biology Texas A&M University College Station, TX ltapia@tamu.edu, ioerger@cs.tamu.edu, sacchett@bioch.tamu.edu Upon the initial construction of a three-dimensional electron-density map of a protein, many protein crystallographers are often faced with low quality and low-resolution data. Because of this noisy data, automated methods for determining the structure of a protein are often hindered. Secondary structure information can help automated methods to refine a map. Also, obtaining quick secondary structure information directly from an electron density map can lead to large-scale protein database searching. For example, secondary structure of proteins from the PDB can be matched against that of a new electron density map. Homologous structures can then be used to solve the sequence of the new, unsolved protein. We have developed a new algorithm based on pattern recognition that recognizes secondary structure fragments in an electron density map prior to any structure solution, as part of the Textal automated model building system [1]. Our approach consists of tracing the density map, extracting features based on the geometry of the trace, and performing classification. An easy way to visualize the structure of an electron density map is to reduce the map to a series of lines representing the core of the density, a trace. An algorithm similar to the one used in Bones [2] is used. Once the map is reduced, simple heuristics such as three-way branching and distance metrics can be used to separate the backbone of the protein from the side chains. A group of features have been developed that characterize this backbone trace. For example, two-dimensional projections of the trace are made that capture the circular nature of a spiraling helix and the directness (movement in only one dimension) of a strand. Also, three-dimensional features are used to capture information about the Euclidean distance a trace travels. All the features are extracted for all overlapping windows of twenty trace points (~10 angstroms). We currently train on a database of feature vectors and their corresponding DSSP [3] predictions. When we receive a query feature vector, we use a nearest neighbor approach to find its closest matches from within the database. The classifications from the closest match can be used to classify the query vector. Smoothing techniques are used to take advantage of the sequential nature of secondary structure. This gives us more confidence in regions of consistent prediction and removes some ambiguity about structure transition regions. The result of this program is an automatic characterization of secondary structure fragments from a density map alone. [1] T. Holton, T. Ioerger, J. Christopher, and J. Sacchettini. (2000). Determining protein structure from electron-density maps using pattern matching. Acta Cryst. D56, 722-734. [2] T. Jones, J. Zou, S. Cowan, and M. Kjeldgaard (1991). Improved methods for building protein models in electron density maps and the location of errors in these models. Acta Cryst. A47, 110-119. [3] W. Kabsch and C. Sander (1983). Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577-2637. |
|
|
| 84. FAUST, an algorithm for functional annotations of protein structures using structural templates. (up) |
| Krzysztof Olszewski, Mariusz Milik, Sándor Szalma, Xiangshan
Ni, Molecular Simulation Inc.;
kato@msi.com |
| Short Abstract:
FAUST is an automated procedure involving: extraction of functionally significant templates from protein structures and using such templates to annotate novel structures. FAUST templates are used for active and binding site searches in protein structures. Preliminary results of protein structure database annotations with derived Structural Templates are presented. |
| One Page Abstract:
FAUST (Functional Annotations Using Structural Templates) is an automated procedure involving: extraction of functionally significant templates from protein structures and using such templates to annotate novel structures. Both whole protein and structural templates can be represented as colored undirected graphs with atoms as vertices and inter-atom distances as edge weights. Vertex colors are based on chemical identities of the atoms. In this representation, a structural template is defined as a common sub-graph of graphs corresponding to functionally related proteins. Edge labels are considered equivalent if inter-atomic distances for corresponding vertices (atoms) differ less than a threshold value. Hence, in extraction procedure, pairs of functionally related protein structures are searched for sets of chemically equivalent atoms whose inter-atomic distances are conserved in both of searched structures. Structural Templates resulting from such pair wise searches are combined to maximize classification performance on a training set of chosen protein structures. FAUST extraction algorithm does not use any external, expert input, and it works best for sets of dissimilar structures from non-homologous proteins. The resulting Structural Template provides a new description of the protein function, which includes natural plasticity of protein active site. In FAUST approach Structural Templates are used for active and binding site searches in protein structures. Also, Structural Templates are applicable to evaluation and refinement of protein models. We are demonstrating here FAUST extraction results for the highly divergent family of serine proteases that exhibit conserved Structural Template. We compare FAUST Structural Templates to the standard description of the serine proteases active site conservation and demonstrate depth of information captured in such description. Also, we present preliminary results of protein structure database annotations with derived Structural Templates. |
|
|
| 85. Predicting structural features in protein segments (up) |
| Fredrik Pettersson, Anders Berglund, Research Group for Chemometrics,
Department of Organic Chemistry, University of Umeå;
fredrik.pettersson@chem.umu.se |
| Short Abstract:
Using multivariate techniques such as PLS we are building a predictive model that will be able to determine protein structure for small segments solely using sequence as input. The model is based on a library consisting of a diverse set of 1496 good proteins. Calculations are done using a super-computer. |
| One Page Abstract:
One approach for protein structure prediction is to build up the structure of a whole protein from smaller sequences. These building blocks can either be different secondary structure elements or sequence elements with a specific window size. After determining the structure for each of the segments the overall structure of the assembled protein can be determined by putting together the pieces in an optimal way. In this project we are focusing on the first step that is to find the structure of the constituents based on sequence solely. Our goal is to make a predictive model that will with sequence as input be able to identify the most structurally similar match in a sequence library. A multivariate projection method, PLS, is used for relating sequence similarity to structural similarity. PLS calculates latent variables giving a good approximation of the in-data (X) and correlate well with the response (Y). In our case the resulting model is describing the correlation between sequence features and structure. When using multivariate techniques sequence information has to be represented in a numerical way. This is obtained using z-scales. These scales are based on a physico-chemical characterization of the amino acids. Based on a library consisting of a diverse set of 1496 good proteins, sub libraries with the data divided into smaller segments (5-10aa) has been constructed. For each protein segment, sequence and structure are characterized. Protein sequence is characterized using z-scales and a couple of sequence similarity matrices. These values are then subsequently compared to those of the other members in the sublibrary. Structural similarity is represented as a CRMS value and a value representing similar secondary structure. Based on this data a PLS model is constructed that will be used for ab initio structure prediction. The program for doing sequence and structure characterization with subsequent comparisons has been implemented in Perl and Fortran 77. The outer Perl layer invokes the Fortran 77 program, which performs the computationally heavy processes. Calculations will be performed on a parallel-computer at HPC2N at the University of Umeå. Because of the fact that structure is highly dependent on the overall properties of the whole protein we cannot expect to obtain a perfect prediction model. We will be satisfied if we in this initial stage will be able score the true match in the library within the top 10 library matches. Preliminary results indicate that this may well be possible but more work need yet to be done. |
|
|
| 86. GA Generates New Amino Acid Indices through Comparison between Native and Random Sequences (up) |
| Satoru Kanai, PharmaDesign, Inc.;
Hiroyuki Toh, Department of Bioinformatics, Biomolecular Engineering Research Institute; skanai@pharmadesign.co.jp |
| Short Abstract:
If folding information of a protein is encoded by the arrangement of the amino acid residues along the primary structure, the information would degarde by the random suffling of the residues. We developed a new method to extract folding information by the comparison between native seqeunces and random sequences. |
| One Page Abstract:
The amino acid sequence of a protein carries its folding information. If the information is encoded by the arrangement of the amino acid residues along the primary structure, the random shuffling of the residues would degrade the information. We developed a new method to compare the native sequence with random sequences generated from the native sequence, in order to extract such information. First, amino acid indices were randomly generated. That is, the initial indices have no significance on the feature of residues. Next, using the indices, the averaged distance between a native sequence and the random sequences was calculated, based on the autoregressive (AR) analysis and the linear predictive coding (LPC) cepstrum analysis. The indices were subjected to the genetic algorithms (GA) using the distance as the fitness, so that the distance between the native sequence and the random sequences becomes larger. We found that the indices converged to hydrophobicity indices by the GA operation. The AR analysis with the converged indices revealed that the autocorrelation in the native sequence is related to the secondary structure. |
|
|
| 87. STING Millennium: Web based suite of programs for comprehensive and simultaneous analysis of structure and sequence (up) |
| Goran Neshich, EMBRAPA/CNPTIA -Campinas, SP - Brazil;
Roberto C. Togawa, EMBRAPA/CENARGEN - Basilia, DF -Brazil; Wellington Vilella Torres, Tharsis Fonseca e Campos, Leonardo Lima Ferreira, Adilton Guedes Oliveira, Ronald Tetsuo Miura, Marcus Kiyoshi Inoue, Luiz Gustavo Horita, Georgios Pappas Jr., EMBRAPA/CNPTIA -Campinas, SP - Brazil; Barry Honig, Columbia University, New York - USA; neshich@cnptia.embrapa.br |
| Short Abstract:
STING Millennium is a web based suite of programs for visualization of molecular structure and comprehensive structure analysis: sequence and structure positions for residues, pattern search, 3D neighbors, H-bonds, structure quality, nature of atomic contacts of intra/inter chain type and residue conservation. Available: http://honiglab.cpmc.columbia.edu/SMS http://asparagin.cenargen.embrapa.br/SMS http://leonina.cnptia.embrapa.br http://morphy.sdsc.edu:8080/SMS/ |
| One Page Abstract:
STING Millennium is a web based suite of programs that starts with visualizing molecular structure and than leads a user through a series of operations resulting in a comprehensive structure analysis: amino acid sequence and structure positions, pattern search, 3D neighbors identification, H-bonds, angles and distances between atoms are easy to obtain thanks to the intuitive graphic and menu interface. In addition, a user can obtain: sequence to structure relationships, analysis of a quality of the structure, nature and volume of atomic contacts of intra and inter chain type, analysis of relative amino acid position conservation and relationship with intra-chain contacts, effectively establishing Folding Essential Residue (FER) indicators etc.. The main aspect of the STING Millennium is the ability to combine data delivery through the web with structural analysis tools in order to provide a self-contained instrument for macromolecular studies. More than a simple front-end to the Chime plugin, STING offers analytical services which we will only briefly describe here, counting that users will refer to extensive on-line help for further details. STING Millennium is composed by two main windows. The sequence window displays sequence and contains the general menus with the commands and a structure window that displays the macromolecular rendered tree-dimensional structure. In general terms STING Millennium provides the following services: * Ability to easily select residues in the sequence, select elements of secondary structure, as well as offer a wide variety of methods for rendering and coloring a molecule (mostly available through ACTION menu). * Defining 3D neighbors to arbitrary selected residue * Definition and display of amino acids participating in interfacial regions between polypeptide chains (through WINDOWS/Interface chain menu selection) * Building surfaces of whole molecule or just IFR part of it * Interactive Ramachandram plots, permitting rapid identification of residues in the disallowed regions and display of selected residues in the structure window * Calculation of residue frequency within selected chain or on interface, as well as frequency of those residues filtered through chosen contact parameters. * Hydrogen bond net calculation with special attention given to participation of water molecules. * Contacts definition and calculation for the whole molecule and/or interfaces * Convenient 2D graphical presentation of parameters extracted from 3D structure * Display of sequence neighbors and calculation of relative sequence conservation for the family of homologous proteins In the links entry in the main menu, several external services that deal with PDB files are listed. These consist of links to web sites containing programs that accept a PDB code as input to perform useful tasks, which makes STING Millennium highly integrated with other important data resources. STING Millennium is both didactic tool as well as research tool. It is easy to use and requires virtually no training time. STING Millennium is available at: http://honiglab.cpmc.columbia.edu/SMS http://asparagin.cenargen.embrapa.br/SMS http://leonina.cnptia.embrapa.br and http://morphy.sdsc.edu:8080/SMS/ |
|
|
| 88. Side chain-positioning as an integer programming problem (up) |
| Olivia Eriksson, Stockholm Bioinformatics Center, Stockholm University;
Yishao Zhou, Department of Mathematics, Stockholm University; Arne Elofsson, Stockholm Bioinformatics Center, Stockholm University; olivia@sbc.su.se |
| Short Abstract:
We here present a novel integer programming based algorithm that finds the optimal set of sidechain rotamers on a fixed backbone in polynomial time. The complexity of this algorithm is similar to the commonly used pruning algorithms. Further, it is guaranteed to find the optimal solution. |
| One Page Abstract:
The problem to position side chains on a fixed backbone is one of the fundamental parts in homology modeling and protein design algorithms. Most homology modeling methods use some algorithm to place side chains onto a fixed backbone. Homology modeling methods are the necessary complement to the large scale structural genomics projects that are being planned. Recently it has also been shown that for automatic design of protein sequences it is of the uttermost important to find the global solution to the side chain positioning problem. If a suboptimal solution is found the difference in free energy between different sequences will be smaller than the errors for the side chain positioning problem. Many different algorithms have been developed to solve this problem. The most successful methods have used a fixed rotamer library and not continuous rotamers. This makes it possible to detect a single global minimum energy conformation. The most promising method to solve this problem in polynomial time today, is the dead end elimination theorem. Here is introduced another method. We formulate the problem as a linear integer program, relax the integer constraints and solve the thereby obtained linear program. We show that the solution to the relaxed problem always will be integer and therefore the solution to the original problem. By using this problem formulation the global minimum energy conformation will be found in polynomial time. |
|
|
| 89. Prediction of the quality of protein models using neural networks (up) |
| Björn Wallner, Arne Elofsson, Stockholm Bioinformatics Center;
bjorn@sbc.su.se |
| Short Abstract:
Neural networks are trained to predict quality of protein models, based on accessibility surfaces and contacts between residues and 13 different atom types. A correlation coefficient of 0.81 is obtained for an independent test set. This method might be useful to increase the specificity of fold-recognition methods. |
| One Page Abstract:
Models of proteins are made to help our understanding of how a particular protein functions. However, no good measure of the quality of the model exist. To address this problem neural networks are trained to predict quality of protein models. Besides, the possibility to measure the quality of a model, this might also be useful to increase the specificity of fold-recognition methods. Here we generate a large set of models, using alignment methods and the homology model program Modeller (Sali et al, 1993). The quality of these models were measured using a modified version of the LGscore (Cristobal. et al, 2001). The training was based on accessibility surfaces, the contacts between residues and contacts between 12 different atom types. The training was performed for different cutoffs. For the atom type contacts, networks were trained on eight cutoffs ranging from 3.0 Å to 4.75 Å in 0.25 Å intervals, the contacts with atoms in the same residue were omitted. For the residue contacts six cutoffs in the range 4 Å to 12 Å were used, only contacts between residues more than five residues apart in the sequence were counted, to avoid accumulation of contacts between residues laying close in the sequence. The accessibility surfaces were represented as fraction of low(<25%), medium (25%-75%) and high (>75%) relative accessibility for each residue respectively. A neural network was trained for every single combination of parameter type and a correlation coefficient for an independent test set was calculated as a measure of how good each network preformed. For the atom contacts alone the best correlation, 0.70, was obtained with a 4.5 Å cutoff, for the residue contacts cutoff of 6 Å gave the best correlation, 0.63. For the accessibility surfaces high and low relative accessiblity gave best correlation with 0.70 for low and 0.52 for high. The different parameter types probably contain overlapping information, nevertheless if a network is trained on a combination of the best atom and residue contacts together with the accessibility surfaces a correlation coefficient of 0.81 is obtained. References Sali, A & Blundell, T. L. (1993). Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779-815. Cristobal, S et al. (2001). How can the accuracy of a protein model be measured?. Manuscript in preparation. |
|
|
| 90. Targeting proteins with novel folds for structural genomics (up) |
| Liam J. McGuffin, David T. Jones, Bioinformatics Group, Brunel University;
liam.mcguffin@brunel.ac.uk |
| Short Abstract:
Finding novel folds is an important aim of structural genomics. We have evaluated a number of methods that discriminate between proteins with novel and known folds. We propose that simple secondary structure alignments could identify novel folds more selectively than both sequence alignments and a simple fold recognition method, GenTHREADER. |
| One Page Abstract:
Beyond the era of genome sequencing the focus has turned to proteomics, and in particular the high-throughput determination of protein structure or structural genomics. The ultimate objective of structural genomics is to determine the structure of every protein coded by every single gene within a genome. The premise being that once solved, protein structures may then be used to decode the functions of those genes identified within a genome. Determining each protein structure experimentally using current techniques is not feasible due to cost and time limitations. Models for proteins with >30% sequence identity to a protein with a known structure can be built fairly easily by homology modeling (Sali, 1998; Brenner, 2000; Portugaly et al., 2000). Beyond this, threading or fold recognition methods are able to assign folds to more distantly related proteins, however this is both time consuming and is limited by the current library of templates. Fast fold recognition or genomic threading techniques such as 3D-PSSM (Kelly et al. 2000), SAM-T98 (Karplus et al., 1999) and GenTHREADER (Jones, 1999) have been developed which overcome the time issue. However, these techniques rely upon finding some homology to solved structures and may perform poorly when sequences show no apparent evolutionary relationship to any known protein family (Jones, 1999). The problem remains that a fold can, of course, only be recognized if a template fold for the protein exists. The identification of nominally distinct folds is important to structural genomics. Solving structures of new folds experimentally will increase the range of folds that can be used as models or templates for computational structure determination (Sali, 1998). Therefore, methods must be developed that aim to discriminate between folds which have been seen before (known folds) and those which are novel. Methods which are capable of identifying novel folds would also greatly benefit the protein structure prediction field, as one of the first questions that must be addressed when predicting the structure of a new protein sequence is whether or not it has a known fold or not. Sequence based clustering methods such as PROTOMAP (Portugaly et al., 2000) have been developed in attempt to estimate the probability of a protein having a "new" fold. As homologous proteins must by definition have a common fold, generally speaking, sets of sequences with less than say 30% identity have a higher chance of having a novel fold than sets of proteins without sequence clustering. However, two similar folds may have very low sequence similarity (even by the standards of sensitive sequence profile comparison), and thus a potential novel fold determined by simple sequence searching could easily turn out to have a known structure. In this case methods that are based solely on sequence information are unreliable. Alignments of secondary structure elements have been shown to provide a rapid estimate of fold for sequences with no detectable homology to any known structure. Although this kind of method can not be relied upon for accurate fold recognition it has been found that it does offer an improvement over sequence alignment in its ability to assign folds to evolutionarily distant proteins (McGuffin et al, 2001). It has also been suggested that class or folding type of distantly related proteins can be discerned simply by measuring differences in amino composition (Eisenhaber et al., 1996; Wang et al., 2000), and so composition based filtering has also been proposed as a possible way of increasing the likelihood of finding new folds. We have compared the ability of a simple fold recognition method (GenTHREADER) and a variety of simple sequence analysis methods to discriminate between domains with novel folds and those with known folds. We also have evaluated methods based on simple pairwise alignments of secondary structure elements. We propose that simple alignments of secondary structure elements could potentially be a more selective method than both GenTHREADER and standard sequence alignment at finding novel folds when sequences show no detectable homology to proteins with known structures. Brenner, SE. Target selection for structural genomics. Nature Struct Biol Suppl 2000;967-969. Eisenhaber, F, Frömel, C, Argos, P. Prediction of secondary stuctural content of proteins from their amino acid composition alone. II. The paradox with secondary structural class. Proteins 1996;25:169-179. Jones, D. T. GenTHREADER: An efficient and reliable protein fold recognition method for genomic sequences. J Mol Biol 1999; 287:797-815. Karplus, K, Barrett, C, Cline, M, Diekhans, M, Grate, L, Hughley, R. Predicting protein structure using only sequence information. Proteins Suppl 3 1999:121-125. Kelley, LA, MacCallum RM & Sternberg, MJE. Enhanced genome annotation using structural profiles in the program 3D-PSSM. J Mol Biol 2000; 299:499-520. McGuffin, LJ, Bryson, K, Jones DT. What are the baselines for protein fold recognition? Bioinformatics 2001;17:63-72. Potugaly, E, Linial, M Estimating the probability for a protein to have a new fold: A statistical computational model. Proc Natl Acad Sci USA 2000; 97:5161-5116. Sali, A. 100,000 protein structures for the biologist. Nature Struct Biol 1998;5:1029-1032. Wang, Z, Zheng, Y. How good is prediction of protein structural class by the component coupled method? Proteins 2000;38:165-175. |
|
|
| 91. Protein Structural Domain Parsing by Consensus Reasoning (up) |
| HwaSeob Joseph Yun, Casimir A. Kulikowski, Ilya Muchnik, Gaetano T.
Montelione, Rutgers University;
seabee@cs.rutgers.edu |
| Short Abstract:
Combined domain parsing methods based on HMM and BLAST can provide concrete and definite predictions with consensus reasoning. Classifiers trained by DDD were tested on SCOP domains from same families containing those DDD seeds, which produced 75.5% accuracy. Tests on Bakers Yeast produced 64.6% correct functional predictions by EC classifications. |
| One Page Abstract:
Domain parsing, or the detection of signals of protein structural domains from sequence data, is a complex and difficult problem. If carried out reliably it would be a powerful interpretive and predictive tool for genomic and proteomic studies. We report on a novel approach to domain parsing using consensus techniques based on Hidden Markov Models (HMMs) and BLAST searches built from a training set of 1471 continuous structural domains from the Dali Domain Dictionary (DDD). According to their underlying mechanisms, various domain parsing tools have their own unique advantages against each other, and our method begins by running individual programs at their full extents to maximize these distinctive characteristics undisturbed. After acquiring possible signals of domains, 1471 results from each tool are ranked and screened by an objective threshold comparable to each program we used. These selections of best hits are then paired only when the targeted domains are from the same seed in DDD. These matched pairs have differences in their domain boundary predictions on both N and C terminal sides, and by plotting these differences on an N-C terminal difference plain, a Pareto set can be extracted to acquire the point with minimal differences and maximal overlapping length among the detected signals. Proper classification of the unknown sequence is assigned by referencing SCOP definitions of the strongest signal collected with this consensus reasoning method. We have tested the approach in two ways. In the first, validation of domain parsing on an independent test sample of 347 family-matched structural domain sequences from the SCOP database yields a consensus prediction performance rate of 75.5%, well above the 58% obtained by simple logical agreement of methods. A second independent test was to check the potential of combining methods for functional annotation. Using 339 biochemically well-characterized Bakers Yeast sequences which had matching EC codes to our model sequences, we compared results at different levels of the EC codes between HMM, BLAST, and disjunctive predictions against the query domains. This showed that there is a slightly higher likelihood of including the right prediction by using the disjunctive prediction than either method alone. There are 64.6% correct exact functional predictions in the top 10 BLAST or HMM results, while comparable matches at the highest EC level yields 93.5% as the upper bound for prediction. |
|
|
| 92. Attempt to optimise template selection in protein homology modelling using logical feature descriptors (up) |
| Alexander Diemand, H. Scheib, T. Schwede, N. Guex, GlaxoSmithKline
R&D, Geneva;
azd93529@gsk.com |
| Short Abstract:
We address the template selection step in protein homology modelling. The structures of putative templates can vary even if their sequences are highly similar. Clustering them and the derivation of discriminatory explanations by induction helps expert modellers in decision making. We assessed our method for a number of protein families. |
| One Page Abstract:
Even though for the majority of proteins there is no structural information available, the structures of several families of homologous proteins have been extensively studied. To close the gap between the huge amount of protein sequences available and the still very limited number of protein structures resolved, we apply homology modelling, based on the observation that high sequence similarity implies structural similarity. In this work, we focused on optimising the critical template selection step, i.e. in cases where numerous potential template structures are available. The structures of these putative templates can vary even if their sequences are highly similar: presence or absence of substrate or regulatory compounds, domain movements, experimental method and different organisms, respectively. It is time consuming and potentially erroneous to identify by hand the key features which distinguish these structures and determine their suitability as homology modelling templates. This process has been automated and its usability assessed for a number of protein families from the publicly accessible Protein Data Bank (PDB). This method first clusters proteins of a particular family based on structure comparison, with each protein being described in annotations from SwissProt, Prosite, and the PDB file itself. Then, an algorithm generates the most general hypothesis by logical induction which in feature space distinguishes the clusters from each other. As a result, an explanatory feature description is obtained, which can be used to guide the template selection by either asking an expert modeller to verify or manually alter the proposed selection or in an automated mode to make the most prominent decision. This method will be integrated into the SwissModel/DeepView protein modelling suite and thus will be made available to other researchers. |
|
|
| 93. Prediction of amyloid fibril-forming proteins (up) |
| Yvonne Kallberg, Magnus Gustafsson, Johan Thyberg, Bengt Persson, Jan
Johansson, Karolinska Institutet;
yvonne.kallberg@mbb.ki.se |
| Short Abstract:
Amyloid fibrils are formed from different proteins, and are very similar in spite of differences the native structures. The fibrils are based on beta-strands which means that proteins containing mainly helices must undergo structural changes. By comparing secondary structures, we are able to predict fibril formation among such proteins. |
| One Page Abstract:
Amyloid fibrils can be formed from different proteins, and are associated with severe diseases like the neurodegenerative Alzheimer's disease and bovine spongiform encephelopathy. In spite of differences in their native structures, these proteins form very similar amyloid fibrils with beta- strands perpendicular and beta-sheets parallel to the fibre axis. Thus amyloid-forming proteins that contain mainly alpha-helical structures must undergo alpha-helix to beta-strand conversions before or during fibril formation. In order to investigate this, we searched for experimentally determined alpha-helices with predicted beta-strands in 1324 proteins, and found 37 proteins that contained alpha/beta discordant segments. The set includes three known amyloidogenic proteins: the prion protein, amyloid beta peptide (Abeta) and lung surfactant protein C (SP-C). Three other proteins (transpeptidase, triacylglycerol lipase and coagulation factor XIII) where also found to form amyloid fibrils. It is known that replacement of valine residues in the discordant segment of SP-C with leucine yields a peptide with a helical conformation. It is also known that Abeta that lack the discordant stretch or with key substitutions reverts the discordance and no fibrils are formed. Our data strongly suggest that long stretches of alpha-helix/beta-strand discordance predict amyloid fibril formation. |
|
|
| 94. Evaluation of structure prediction models using the ProML specification languag (up) |
| Daniel Hanisch, Ralf Zimmer, Thomas Lengauer, GMD - National Research
Center, St. Augustin, Germany;
hanisch@cartan.gmd.de |
| Short Abstract:
We propose the ProML specification language for proteins and protein families based on the open XML standard. ProML allows for efficient specification and visualization of heterogeneous protein data. As an application, we discuss the representation of features of protein clusters and the use of experimental constraints for validation of structural models. |
| One Page Abstract:
Title: Evaluation of structure prediction models using the ProML specification language Authors: Daniel Hanisch, Ralf Zimmer, Thomas Lengauer We propose a specification language ProML for protein sequences, structures, and families based on the open XML standard. The language allows for portable, system-independent, machine-parsable and human-readable representation of essential features of proteins. In contrast to existing XML applications in this field, our emphasis is not on the molecular structure of one protein or molecule (as in CML), nor on annotation of one gene or one protein for use with a proprietary browser (as in BioML) , but on efficient representation of heterogeneous data associated with one or several proteins. As we developed ProML in the context of structure prediction, we focused on properties useful in threading and clustering algorithms. Extensions for other applications, however, are straigthforward to realize within ProML. To achieve this goal, one ProML document is able to describe several proteins and their properties in a structured manner. ProML defines low-level elements as building blocks for more complex properties. Predefined elements include primary and secondary sequence information, three dimensional coordinates, CATH structural classification and Prosite patterns. A Property tree relates properties to proteins in a hierarchical manner. We define an optimality criterion for this tree, which allows for efficient use of represented information in algorithms. ProML is of immediate use for several bioinformatics applications: we discuss clustering of proteins into families and the representation of the specific shared features of the respective clusters. ProML's Property tree defines a hierarchical view on these features, thereby making within-cluster similarities and differences among potential subclusters easily visible to humans and accessible to algorithms. In a second application, we use experimentally derived constraints, represented in ProML, in a protein structure prediction approach for the validation of proposed theoretical models and improvement of fold recognition rate on a representative benchmark protein set. To this end, we computed conserved cores for structural clusters of our benchmark library and produced ProML documents for the clusters containing the structural cores. By exploiting randomly generated as well as simulated cross-link distance constraints measureable by mass spectrometry, we were able to improve fold recognition on our test set. For this, we applied a post filtering approach to results produced by our threading algorithm 123D. References: [1] T. Bray, J. Paoli, and C.M. Sperberg-McQueen. Extensible Markup Language (XML) 1.0. February 1998. http://www.w3.org/TR/1998/REC-xml-19980210.html. [2] P. Murray-Rust. CML - the Chemical Markup Language. http://www.xml-cml.org [3] Proteomics Inc. BioML - Biological Markup Language. http://www.bioml.com/bioml/. [4] D. Hoffmann, and R. Zimmer. Fluorescence Energy for Elucidating the 3D-Structure of Biological Macromolecules. German Patent Office, PCT/EP99/01008, 10. Feb 1999 [5] N. Alexandrov, R. Nussinov, and R. Zimmer. Fast Protein Fold Recognition via Sequence to Structure Alignment and Contact Capacity Potentials. In Pacific Symposium on Biocomputing'96, 53-72, 1996. |
|
|
| 95. Incremental Volume Minimization of Proteins (represented by Collagen Type I (local minimization)) (up) |
| Meir Israelowitz, P. Campbell, L. Ernst, J. M. Ernsthausen, W. Galbraith,
S. W. Hussain, Carnegie Mellon University;
I. Verdinelli, University of Rome; Troy Wymore, Pittsburgh Supercomputer Center; D. L Farkas, Carnegie Mellon University; meir@andrew.cmu.edu |
| Short Abstract:
Type I collagen, a major extra-cellular matrix protein, is a long-sequence fibrous biopolymer. The importance of considering mathematical models for collagen structure is in the fact that about 90% of the structures of most inner organs are made of some type of collagen. |
| One Page Abstract:
Type I collagen, a major extra-cellular matrix protein, is a long-sequence fibrous biopolymer. The importance of considering mathematical models for collagen structure is in the fact that about 90% of the structures of most inner organs are made of some type of collagen. Hence the relevance of simulating collagen structures for tissue engineering, as collagen fibers are meta-structured basis of tissue. A number of methods can be used to determine the optimal conformations of polypeptides under various conditions, and several techniques have been used to estimate the native conformations of large globular proteins. Our approach to modeling protein structure consists in approximating the process of protein folding. This can be expanded to large, multi-molecular network structures. While almost all existing models consider random packing of rigid spheres, we managed to reduce the molecular volume by using concepts from low dimensional topology (braids) and differential geometry. A braid group has the property of maintaining the continuity of a sequence while the minimization is performed (the topology guarantee the continuity during the minimization). Our model creates segments from the braid (the segments are hydrogen bond distance). These segments are amino-acid peptides (or beads), and we consider the functional distance geometry functional rather than the simple minimize action of the distance between atoms' centers. We have applied this approach to PDB files: 1BBF, 1CGD, 1AQ5 and Collagen Type I. |
|
|
| 96. Automatic Inference of Protein Quaternary Structure from Crystallographic Data. (up) |
| Hannes Ponstingl, European Bioinformatics Institute, EMBL-EBI, Hinxton,
UK.;
Thomas Kabir, Biomolecular Structure and Modelling Unit, Biochemistry and Molecular Biology Department, University College London, ; Janet M. Thornton, European Bioinformatics Institute, EMBL-EBI, Hinxton, and Biomolecular Structure and Modelling Unit, Biochemistry and; hpo@ebi.ac.uk |
| Short Abstract:
A procedure was developed that generates the likely quaternary assembly of a protein from its atom coordinates and crystal symmetry deposited in the Protein Data Bank (PDB). It applies a graph-theoretic algorithm to interface scores derived from crystal structures of globular proteins of distinct oligomeric state in solution. |
| One Page Abstract:
The atomic coordinates of protein crystal structures deposited in the Protein Data Bank (PDB) describe the asymmetric unit of the crystal - not the physiologically relevant assembly of the polypeptide chains. Moreover, the PDB annotation of the functional assembly is still sparse and unreliable. This work is an attempt to provide the possible macromolecular assemblies likely to be prevalent in solution. The assemblies are ranked according to statistics obtained from a representative set of crystal structures of globular proteins whose oligomeric state in solution is distinct and experimentally established. All intermolecular contacts present in the crystal are re-generated by applying crystallographic symmetry operations to the deposited coordinates. From this set, those contacts are discarded that are most likely to be artifacts of the crystal environment. For this task, we derived a scoring function for protein-protein interfaces from the trusted set of water-soluble oligomers. The scoring function, a so-called statistical potential, is based on pairs of atom types and distance information. Hypothetical assemblies are generated by successively applying a graph-theoretic minimum-cut algorithm to the scored crystal contacts. Thresholds for assembly classification are obtained from statistics on the scores of these minimum-cuts. The performance and generalisation behaviour of the procedure in identifying the functional assembly is assessed using cross-validation methods on the data set of trusted oligomers. A comparison is made to scoring interfaces by using a traditional measure of contact size. The derived interface-scoring function is also expected to prove useful in screening predicted complexes in protein-protein docking protocols. |
|
|
| 97. Modelling Class II MHC Molecules using Constraint Logic Programming (up) |
| Martin T. Swain, Anthony J. Brooks, Graham J.L. Kemp, University
of Aberdeen;
mswain@csd.abdn.ac.uk |
| Short Abstract:
The MHC-Thread program uses a heuristic scoring function to predict peptides that are likely to bind to a class II MHC allele, based on the allele's known or modelled 3D structure. To increase its utility, we have developed an automatic technique for modelling peptide binding grooves using constraint logic programming. |
| One Page Abstract:
The identification of peptides which bind to MHC molecules is useful when hunting for regions of a protein which may be responsible for causing an unwanted immune response. The MHC-Thread program analyses three-dimensional models of candidate peptides in the peptide binding grooves of class II MHC molecules (Brooks, 1999). Heuristic functions are used to score the complex based upon chemical and spatial complementarity, and thus predict peptides likely to bind to specific alleles. The utility of this program is increased through having an automated method to build models of class II MHC alleles. Sequence comparisons suggest that the overall structure of class II MHC alleles is well conserved and that the main differences between alleles are due to mutations in the vicinity of the peptide binding groove. Thus, side-chain placement is central in constructing models of MHC alleles. We have developed a novel approach to the side-chain placement problem that uses constraint logic programming (CLP). Our method generates a constraint-based description of atomic packing that is used iteratively to create CLP programs: each program representing successively tighter packing constraints. In these programs rotamer conformations are represented as values for finite domain variables, and bad steric contacts involving rotamers are represented as constraints. The CLP side-chain placement method has been validated by predicting side-chain conformations of X-ray structures with an accuracy comparable to that of other methods (Swain and Kemp, in press). Preliminary results obtained from the MHC-Thread program with homology models created using the CLP side-chain modelling system are encouraging, and show good agreement with experimentally derived binding data. References Brooks, A.J. (1999) Computational Prediction of HLA-DR Binding Peptides. PhD Thesis, University of Aberdeen. Swain, M.T. and Kemp, G.J.L. (in press) Modelling protein side-chain conformations using constraint logic programming. Computers & Chemistry. |
|
|
| 98. DoME: Rapid Molecular Docking with Adaptive Mesh Solutions to the Poisson-Boltzmann Equation (up) |
| Julie C. Mitchell, Lynn F. Ten Eyck, San Diego Supercomputer Center;
mitchell@sdsc.edu |
| Short Abstract:
The Docking Mesh Evaluator (DoME) uses adaptive mesh solutions to the Poisson-Boltzmann Equation to evaluate docking energies, interpolating potentials against a mesh that is dense in high gradient regions. DoME achieves a high level of precision in approximating electrostatic potentials and performs energy calculations far more rapidly than traditional methods. |
| One Page Abstract:
With the continued increase in the number of known protein structures comes a wealth of opportunity to predict molecular interactions and docked protein structures via computational means. Many molecular docking methods are able to achieve biologically accurate solutions to protein docking problems. However, it is difficult to obtain both speed and precision in a single algorithm. The most accurate methods are computationally expensive, while faster methods introduce non-trivial computational errors or ignore electrostatic information in favor of more tractable geometric algorithms. We will present a method for molecular docking that is both highly efficient and uses a detailed implicit solvent model for approximating electrostatic energies. The Docking Mesh Evaluator (DoME) uses adaptive, finite element solutions to the Poisson-Boltzmann equation generated by the Adaptive Poisson-Boltzmann Solver (APBS) to model electrostatics. A simplex lookup scheme is employed to allow rapid interpolation of solutions defined on an irregular mesh. This mesh is also used as a basis for interpolating Lennard-Jones potentials. The underlying scheme for interpolating potential functions has been expanded into a collection of tools for use in molecular docking. In particular, DoME is able to interpolate electrostatic potentials over a grid or surface, comprehensively scan the docking configuration space and compute local minima to docking potential energies. The initial results for biological problems appear quite promising, and the computations are remarkably fast. For one protein-protein docking problem, computing local minima using an all atom AMBER potential consumed 30 minutes while DoME was able to perform the same computation in just a few seconds. |
|
|
| 99. Electrostatic potential surface and molecular dynamics of HIV-1 protease brazilian mutants (up) |
| Elza Helena Andrade Barbosa, Alan Wilter da Silva, Laurent Emanuel
Dardenne, Paulo Mascarello Bisch, Pedro Geraldo Pascutti, Federal University
of Rio de Janeiro;
ehab@biof.ufrj.br |
| Short Abstract:
We did 1 nanosecond molecular dynamics for eleven HIV-1 protease Brazilian mutants and obtained structural images, Ramachandran plots, calculations of rmsd, hydrogen bonds and electrostatic potential on the surface.We saw conformational changes near the active site of mutants and no electrostatic complementarities for some, supporting the drug resistance. |
| One Page Abstract:
ELECTROSTATIC POTENTIAL SURFACE AND MOLECULAR DYNAMICS OF HIV-1 PROTEASE BRAZILIAN MUTANTS Barbosa, E. H. A. 1, da Silva, A. W. S. 1, Dardenne, L. E.2, Bisch, P. M. 1, Pascutti, P. G. 1 1- Instituto de Biofísica Carlos Chagas Filho - UFRJ 2- Laboratório Nacional de Computação Científica - CNPq Drug resistance in HIV-1 protease has been emerged in many countries. By using Molecular Modeling and Dynamics tools, we investigate eleven HIV-1 protease Brazilian mutants that are resistant to usual inhibitors. Were built theoretical models by homology using as standard the NMR-3D structure of HIV-1 mutant protease C95A found in Protein Data Bank (code 1BVE). A 1 nanosecond dynamics were performed for all systems (including 1BVE) using THOR program, a software package that uses GROMOS force field, developed in our laboratory. As a result, were obtained structural images and Ramachandran plots for each mutant. Calculations of root mean square deviation were performed. The hydrogen bonds and van der Waals contacts between the HIV-1 protease mutants and inhibitors were monitored in the active site. They showed relative stability during dynamics. Most of the models fluctuate around their respective minimized structures during dynamics simulation, however were observed conformational changes induced by mutations. Main conformational changes near active site were verified in the positions 26-29, 35, 46, 47 and 53. Electrostatic potential were calculated on the accessible solvent surface for all mutants and the usual anti-retroviral drugs to identify charge and hydrophobic complementarities between them. It was observed that for some mutants there are no complementarities, what would explain the lost of drug activity, conducting to resistance. These results support that HIV-1 protease resistant drugs could be induced by conformational changes and lost of electrostatic and hydrophobic complementarities in mutants. |
|
|
| 100. Automated functional annotation of protein structures (up) |
| Mike Hsin-Ping Liang, Russ B. Altman, Stanford University;
mliang@smi.stanford.edu |
| Short Abstract:
Current methods for constructing 3D models of protein function and for annotating protein structures are manual and time-intensive. We propose an automated method for constructing 3D models of protein functional sites that can be used for high-throughput annotation of protein structures. |
| One Page Abstract:
In the past, protein structures have been determined because of specific biological interest and background. Recently, various structural genomics initiatives are rapidly determining protein structures without understanding their function. There is a growing need to annotate these proteins in an efficient manner to keep up with the rapid increase of structures. Existing protein sequence motif databases provide putative function for protein sequences. However, it is well known that structure is more conserved than sequence, and it is the properties associated with particular residues and their relative position in the structure that convey function. Thus, creating a 3D motif analagous to the 1D sequence motif will increase performance in annotation of protein function. We propose an automated method for constructing 3D models of protein functional sites, by augmenting 1D sequence motifs. The model provides a 3D statistical description of the biochemical and physical properties surrounding a functional site. The model can be used to quickly scan a protein structure for potential sites. It can also be used to gain insight on what properties are involved in the particular function. This method has been applied to the EF-hand calcium binding motif. |
|
|
| 101. Structural annotation of the human genome (up) |
| Arne Mueller, Lawrence A. Kelley, Michael J.E. Sternberg, Imperial
Cancer Research Fund;
a.mueller@icrf.icnet.uk |
| Short Abstract:
The proteins of the human genome draft (Ensembl-0.8) have been assigned to homologous proteins of known structure. More than one third of the proteome is covered. We have compared the fold and domain composition of different organisms. A special focus has been put on the proteins encoded by human diseases genes. |
| One Page Abstract:
In February 2001 the draft sequence of the human genome was published. In this work we have annotated the proteins of the public draft (1) based on the Ensembl version 0.8.0 data-set (http://www.ensembl.org) with protein structure by assigning homologous sequences of the SCOP (3) and PDB databases to human proteins via Blast/PSI-BLAST (4) and fold recognition using 3D-PSSM (5). The fold composition of proteins encoded by human disease genes is analysed. Results are compared with those of other organisms. The draft human genome sequence from the Ensembl data-set contains 28913 different protein sequences of which Blast/PSI-BLAST can assign 44% to at least one protein of known structure (35% of the amino acid residues of the proteome). An additional 41% of the human sequences can be assigned to functionally annotated sequences of the public databases, and a further 16% have homology to sequences of unknown function or hypothetical proteins. Only 8% are without any detectable homology to any other sequence in the public databases including 3% (of the total) that are in non-globular regions. With 3D-PSSM we can confidently assign 5% of the residues in the human proteome to a protein of known structure (7% of the sequences) that cannot be assigned by PSI-BLAST: 3% are in the fraction that was classified as functional (but not structurally) annotated by PSI-BLAST, and 2% are located in the fraction of `homology but unknown function'. We are currently working on an optimised version of 3D-PSSM that is better adapted to long protein sequences to improve our results and to extend the fraction of `unknown function' to which we can assign a protein of known structure, because often structure comes with functional annotation. Compared to the proteomes of D. melanogaster, C. elegans and S. cerevisiae for which a fraction of 18% to 20% is completely uncharacterised, the draft human protein set is well annotated (in terms of structure and function). These results may be related to the difficulties of identifying novel genes in the human genome (i.g. gene finding). The human proteome is structurally better annotated than the other three eukayotic genomes (27% to 28% of the proteome) but less than most bacterial genomes (lowest is 40% for M. tuberculosis, highest is 45% for E. coli). The most popular structural superfamily (as defined by SCOP release 1.53) in the human proteome is the Immunoglobulin superfamily (which often is found as a repetitive unit), and the top ranking superfamilies are similar to those in D. melanogaster but differ (even in total number) from those in C. elegans. We present a detailed analysis of a SCOP based domain comparison between different proteomes. There are 109 superfamilies unique to the four multicellular eukaryotes above, six are unique to yeast (S. cerevisiae and S. pombe), also six superfamilies are unique to the three archaea we have processed and 68 superfamilies are unique to the seven processed bacteria. Of the 6656 human proteins in the Ensembl database that are linked to a diseases of the OMIM database (6) 3278 different proteins have at least one homologue of known structure. More than 5000 scop domains can be identified within these proteins. The most popular structural superfamilies resemble those of the proteome in general (e.g. Immunoglobulins, Protein kinase domains, Fibronectin). The data from our analysis is stored in a relational database managed by MySQL allowing for complex queries and the in-cooperation of new resources and genomes when available (other genomes are currently in the process pipeline). The data will be made publicly available via the world wide web. References: 1. International Human Genome Sequencing Consortium (2001). Initial sequencing and analysis of the human genome. Nature, 409(6822):860-921 2. Hubbard, T.J.P., Ailey, B., Brenner, S.E., Murzin, A.G. & Chothia, C. (1999). SCOP: A structural classification of proteins database. Nuc. Acids Res. 27:254-256. 3. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein data base search programs. Nucleic Acids Res. 25:3389-3402. 4. Kelley, L.A., MacCallum, R.M. & Sternberg, M.J.E. (2000). Enhanced genome annotation using structural profiles in the program 3D-PSSM. J. Mol. Biol. 299:499-520. 5. McKusick-Nathans Institute for Genetic Medicine, Johns Hopkins University (Baltimore, MD) and National Center for Biotechnology Information, National Library of Medicine (Bethesda, MD), 2000. Online Mendelian Inheritance in Man, OMIM (TM). World Wide Web URL: http://www.ncbi.nlm.nih.gov/omim/ |
|
|
| 102. Estimation of p-values for global alignments of protein sequences. (up) |
| Caleb Webber, Geoffrey J. Barton, EMBL-European Bioinformatics Institute;
caleb@ebi.ac.uk |
| Short Abstract:
The global alignment of protein sequence pairs is often used in the classification and analysis of full-length sequences. The method presented here allows the probability that two protein sequences share the same fold to be estimated from the global sequence alignment Z-score. |
| One Page Abstract:
Classification and analysis of full-length protein sequences often involves the global alignment of sequence pairs. The calculation of a Z-score for the comparison gives a length and composition corrected measure of the similarity between the sequences. However, the Z-score alone does not indicate the likely biological significance of the similarity. A new background distribution to estimate the significance of pair-wise sequence alignment scores was developed by comparison of 250 proteins in different fold-families from the SCOP database. All 31,125 unique pairs of sequences were aligned with a range of matrices and gap penalties. The distributions of Z-scores from these alignments were fitted with a peak distribution, from which the probability of obtaining a given Z-score from a global alignment between 2 structurally-unrelated protein sequences was calculated. This analysis was also applied to global alignment of best locally-aligned subsequences, generated by the Smith-Waterman algorithm. The relationships between Z-score and probability varied little over the matrix/gap penalty combinations examined. However, a positive shift was observed for Z-scores derived by global alignment of locally-aligned subsequences, compared to global alignment of the entire sequence. This shift was shown to be the result of pre-selection by local alignment rather than any structural similarity in the sequences. Benchmarking the search ability of both methods using the SCOP superfamily classification showed that global alignment Z-scores are as effective as SSEARCH at low error rates and more effective at higher error rates. Global alignment of best locally-aligned subsequence was significantly less effective in this capacity. The estimation of statistical significance was shown to give similar results to the estimations of SSEARCH and BLAST, providing confidence in the method. This work provides a database-independent method of assessing the significance of pair-wise sequence global alignment scores. Software to apply the statistics to any alignment is available from http://barton.ebi.ac.uk. |
|
|
| 103. Sequence and Structure Conservation Patterns of the Gelsolin Fold (up) |
| Benyaminy Hadar, Graduate student;
Wolfson Haim, Nussinov Ruth, Professor; hadar1@tau.ac.il |
| Short Abstract:
The gelsolin family proteins are involved in actin cytoskeleton remodeling and can also form amyloids. We analyzed sequence and structural conservation patterns of the protein. We describe a subset of conserved residues, largely of beta structure. These are likely to be responsible for stability (and function) of the gelsolin fold. |
| One Page Abstract:
The gelsolin family consists of actin binding proteins that are involved in remodeling the actin cytoskeleton and can also form amyloids. The family shares a repeated motif of 125-150 residues that is found in a wide range of phyla as either three or six repeats. The repeats share low sequence homology but are similarly folded. Using a novel multiple structure comparison algorithm, the coordinate files that represent the structural diversity of the different domains of gelsolin were subjected to sequence order independent multiple structure alignment. A common structural core of 38 amino acids was found, capturing the common topologically conserved positions of the fold. The sequences of the aligned structures were used to initiate iterative PSI-blast searches of the nrdb (nonredundant database compilation). After clustering and filtering short sequences, a final large and diverse (average pairwise percent identity of 20%) database of 270 sequences was constructed. Structural and sequential patterns were combined. The highest conservation values were found for a group of hydrophobic (some of which are aromatic) residues populating a common central beta hairpin (strands C and D). These conserved residues are likely to be responsible for stability (and function) of the gelsolin fold. |
|
|
| 104. Finding all protein kinases in the human genome (up) |
| Gerard Manning, Glen Charydczak, David Whyte, Sean Caenepeel, Ricardo
Martinez, Sucha Sudarsanam, SUGEN, Inc.;
gerard-manning@sugen.com |
| Short Abstract:
We used profile HMMs, ab inito gene finding, homology and ESTs to predict all protein kinases in the human genome and used domain homology to classify and organise genes into families. We compare our predictions with those of Celera and Ensembl, and with the kinases of fly, worm and yeast. |
| One Page Abstract:
We have used a combination of automated gene finding methods and manual analysis to predict full length sequences for all human protein kinases. Profile HMMs were used to efficiently predict kinase catlytic domains in initial single-read genomic sequences and ESTs, with a very low error rate. To predict longer sequences in genomic assemblies, we used a mixture of ab initio gene prediction (Genscan), protein homology (Genewise and Blast) and mapping of ESTs to the genomic region. In most cases, some amount of manual curation was needed for optimal predictions, due to specific weaknesses of each prediction technique and the imperfect nature of genomic assemblies. We found that ~20% of kinase sequences appear to be pseudogenes, with single exons, and multiple stops and frameshifts in the sequence. We compare our predictions with those of Celera and Ensembl. We have mapped all kinase genes to chromosomal bands, and searched for genes linked to particular disease loci or cancer amplicons. Comparison of all human kinase domains and those of yeast, worm and fly genomes reveals the presence of several new conserved groupings of kinases, and a putative orthology mapping between many novel human kinases and their model organism counterparts. Genomic comparison also shows specific expansions of sub-groups in the different lineages. |
|
|
| 105. Remote Homology Detection Using Significant Sequence Patterns (up) |
| Florian Sohler, Alexander Zien, GMD - National Research Center,
Skt. Augustin, Germany;
florian.sohler@gmd.de |
| Short Abstract:
We present a new method to detect remote homologs for proteins. To score a candidate protein, the frequency of short but significant patterns in the sequence is used. With this scoring scheme and support vector machines we can classify proteins into their SCOP superfamilies better than BLAST. |
| One Page Abstract:
We present a new method to detect remote homologs for proteins using short but significant sequence patterns. The goal is to build models for protein classes that enable us to correctly classify new query proteins. Recently it has been proposed to use probabilistic suffix trees to model protein families. Probabilistic suffix trees can be viewed as variable order markov models, and thus, they are able to model short conserved patterns. In contrast to other widely used models like profile hidden markov models (HMMs) or sequence profiles, there is no alignment information used to create the model and no alignment performed to score candidate sequences. The main advantage of probabilistic suffix trees is their speed. Training can be performed in time linear in the size of the input sequences and scoring is linear in the sequence length as well. Unfortunately, according to our experience, probabilistic suffix trees only work well for closely related proteins. There are at least two possible reasons for this. The first is that they do not explicitly model amino acid substitutions, insertions or deletions. The other possibility is that distant homologs cannot be found without using alignment information. To allow for some amino acid substitutions we cluster the amino acids into groups like 'hydrophobic', 'polar' etc. and use patterns of this reduced alphabet instead of the amino acid alphabet. We use suffix trees to find patterns that appear significantly frequently in a given class of proteins, but then apply more involved machine learning tools to build a model from these patterns. To score the significance of a given motif we count the number of appearances of that motif in the protein class and compare that number to the expected number of occurrences given a simple probabilistic model. If this significance score is above a certain threshold we accept the corresponding pattern into the list of significant patterns. Since it will be significant for very long (and thus specific patterns) to appear only once, we also require each pattern to occur more often than a given minimum number. Therefore we have two parameters to tune the sensitivity and specificity of the patterns chosen by our algorithm. If a pattern appears in 90% of all training sequences it is expected to appear in most of the unknown sequences belonging to that class as well. On the other hand, if the pattern is so specific, that it appears almost only in the training set, unclassified sequences that have that pattern will probably belong to that class. The length of the patterns found this way is typically between five and ten. The number of patterns can vary between 50 and several hundreds depending on the training set and the given parameters. To build a classifier from our list of significant patterns we use support vector machines with the 'Radial Basis Functions' kernel. The features for a sequence are simply the frequencies of each of our significant patterns normalized with respect to a simple null model. We evaluate our method by trying to predict SCOP superfamilies in a simple cross-validation protocol. As a training set for a superfamily classifier we take one family away from the superfamily which will be our test set later. The remaining families, and optionally additional Blast hits, we use for training. Sequences of all other superfamilies are also divided up into training and test set. Results show that our method does surprisingly well, which shows that remote homologs can be detected without computing alignments. The algorithm clearly outperforms Blast and is almost competitive to HMMs. Especially on families that are hard to classify for HMMs its performance is comparable while on easier families more false positives are produced. This suggests that a combination of alignment based methods and our new method can improve the prediction performance significantly. References: T. Jaakkola, M. Diekhans, D. Haussler: A Discriminative Framework for Detecting Remote Protein Homologies JCB, 2000, Vol. 7, no. 1/2, pp. 95-114 G. Bejerano, G. Yona: Variations on probabilistic suffix trees: statistical modeling and prediction of protein families Bioinformatics, 2001, Vol. 17, no. 1, pp. 23-43 A. Apostolico, M. E. Bock, S. Lonardi, X. Xu: Efficient detection of unusual words JCB, 2000, Vol. 7, no. 1/2, pp. 71-94 C. J. C. Burges: A tutorial on support vector machines for pattern recognition
Data mining and knowledge discovery, 1998, Vol. 2, pp. 121-167
|
|
|
| 106. TRIBE-MCL: A Novel Algorithm for accurate detection of protein families (up) |
| Anton Enright, European Bioinformatics Institute;
Stijn Van Dongen, CWI, Amsterdam, Netherlands.; Christos A.Ouzounis, European Bioinformatics Institute; enright@ebi.ac.uk |
| Short Abstract:
We present a novel method for clustering of proteins into families based on sequence similarity information. This method uses 'Markov' clustering to successfully classify proteins into families with extremely high accuracy. The method is not led-astray by conventional problems of this type of analysis, such as promiscuous domains and protein fragments. |
| One Page Abstract:
Detection of protein families in complete genomes is a valuable method in functional genomics. Members of a protein family should possess equivalent functional roles in the cell. If one knows the function of one of the members of a family it should be possible to transfer this function to other members of the family whose functions may not be known. Generally protein families are detected by clustering proteins together based on their sequence similarity. Many methods exist for this type of analysis, however most of these methods are not fully-automatic and rely on manual intervention for the correct detection of valid protein families. Other automatic methods fail to correctly detect protein families in complex eukaryotic datasets due to the presence of multi-domain proteins and proteins which contain a promiscuous domain. Previously we developed a method called GeneRAGE for protein family analysis in bacteria. This method could not realistically be extended to higher eukaryotes, such as the human genome, due to the computation time required to break-down the complex modular domain structure of many eukaryotic proteins. To this end we have developed a novel method for protein sequence clustering based on the Markov Clustering (MCL) algorithm. This is a purely probabilistic approach which can automatically and accurately cluster proteins into families based on sequence similarity alone without explicit knowledge of protein domains. We first represent biological sequence similarities in terms of a graph, nodes representing proteins, and edges representing weighted sequence similarity scores which connect proteins. The MCL algorithm calculates random walks through this graph, and uses two mathematical operators to model flow within the graph. Because members of a protein family are generally more highly similar to each other than to members of other (related or unrelated) families, flow within a family is higher than flow between families (i.e. through a common or promiscuous domain). The algorithm models tidal forces through the graph until equilibrium is reached, and then calculates a clustering based on these observed patterns of flow. We have tested the algorithm extensively using the INTERPRO, SCOP and SWISSPROT databases, and have observed very high accuracy for the assignment of proteins into valid protein families. Recently we used this algorithm to produce protein family information for the draft human genome in the Ensembl 080 release. This analysis involved the clustering of over 100,000 proteins into 13,000 families, and took a little over six hours to complete on a small workstation. Validation using databases such as SCOP and INTERPRO have indicated that the method is performing with an accuracy of >90%. We believe that this method will be extremely useful for protein family analysis and functional genomics. |
|
|
| 107. In Silico Analysis of Bacterial Virulence Factors: Redefining Pathogenesis (up) |
| Kelly Paine, Edward Jenner Institute for Vaccine Research;
kelly.paine@jenner.ac.uk |
| Short Abstract:
A virulence factor is any agent produced by a pathogen that is essential for causing disease in a host. Bacterial protein virulence factors have attracted great interest as targets for antimicrobial research. We have been utilising protein fingerprinting methods to characterise such moieties and redefine the meaning of pathogenesis. |
| One Page Abstract:
There has been a recent surge in the number of completed bacterial genome sequences, and with this explosion of data comes the need to discover novel targets for antimicrobial research. A synergistic interaction between the well-established science of bacteriology, and the emergent discipline of bioinformatics should provide tools for such a task. Bacterial resistance to conventional antibiotics is on the increase, and combined with other factors such as the prevalence of HIV, is proving costly in terms of both money and human lives. Even the strongest drugs are now useless against some species like Staphylococcus aureus. Pathogenic mechanisms can spread quickly through a bacterial population via lateral transfer, and, as most virulent bacteria rely on the presence of these "virulence factors" to infect a human host, they must be considered essential for pathogenicity. It is these genes that will provide the novel targets required for future research into new drugs and vaccines. Bioinformatics can aid in this process; the key advantage of computer-based screening techniques is the speed at which the identification and selection of these targets can be done. Making a reality of the predictions on how a protein may act a certain way in vivo, or what sort of immune response will be elicited from a virulence factor carefully selected from database mining and gene expression profiling, has, in the past, fallen mainly to the more conventionally trained biologist. A recognised and powerful method of classifying new protein families is to use conserved regions between multiple alignments of proteins. Each homologous region is a "motif", and sets of motifs provide a signature or fingerprint for unique identification. We have been using this method to characterise novel virulence factor protein families, in collaboration with the PRINTS group at the University of Manchester, UK. Among those families already analysed include: components of the Gram-negative enteropathogenic type three secretion system, the Bacillus anthrax toxin, and Escherichia coli haemolysin. |
|
|
| 108. Relationships between structural conservation and dynamical properties in the PAS family proteins. (up) |
| Laura Bonati, Alessandro Pandini, Demetrio Pitea, Dipartimento di
Scienze dell'Ambiente e del Territorio, Università degli Studi di
Milano-Bicocca;
laura.bonati@unimib.it |
| Short Abstract:
To obtain information about Ah receptor binding, we applied sequence analysis tools to the multiple alignment of different AhR and propose a new protocol to correlate the dynamical properties of the 3D structures of reference PAS proteins, obtained by MD simulations, to the information on structural conservation within the family. |
| One Page Abstract:
By applying structure prediction and homology modelling methodologies, we previously developed [1,2] a three-dimensional model of the ligand binding domain of the mouse Aryl hydrocarbon Receptor (mAhR); this is a member of the PAS (Per-ARNT-Sim) family of transcriptional regulatory proteins. The crystal structures of the three PAS domains used as templates (the bacterial photoactive yellow protein PYP, the human potassium channel HERG and the bacterial oxygen sensing FixL protein) reveal a highly conserved structural framework. Despite the low level of sequence identity of mAhR with the templates, this high structural conservation allowed us to develope a suitable model, based on the combination of sequence and secondary structure information. On these bases we are studying [3] the binding process of mAhR with PolyChlorinated Dibenzo-p-Dioxins (PCDD), a class of ligands of environmental interest, by using Molecular Dynamics simulations to refine the mAhR model, molecular docking techniques to identify the residues directly interacting with PCDDs and hybrid QM/MM methodologies to obtain relative binding energies for a series of PCDDs. However, the modelling procedure has also suggested the possibility of obtaining more information about binding from the sequences of other Ah receptors included in the multiple alignment, as well as the need of a more accurate analysis on the molecular basis of the high structural conservation in the PAS family. Here we present an application of sequence analysis tools to the multiple alignment of Ah receptors from different species. The differences in the response of these proteins to PCDDs and the conservation of some residues in their ligand binding domain with respect to mAhR highlight key amino acids important for dioxin binding. Moreover, we propose some tools to analyse the dynamical properties of the three-dimensional structures of the reference PAS proteins, obtained by MD simulations, and to correlate them to the information on the structural conservation within the family. Based on the idea that physical information derived from molecular modelling of protein conformations may give a key contribution in understanding the evolutionary conservation, these tools may constitute a new general protocol to correlate evolutionary information and structural dynamical behaviour in a family of proteins. 1) M. Procopio, A. Lahm, A. Tramontano, L. Bonati, D. Pitea, "Homology modeling of the AhR ligand binding domain", Organohalogen Compounds (1999) 42, 405-408. 2) M. Procopio, A. Lahm, A. Tramontano, L. Bonati, D. Pitea, "A Model for Recognition of PCDDs by the Aryl Hydrocarbon Receptor", Proteins, submitted. 3) L. Bonati, A. Pandini, D. Pitea, L. De Gioia, P. Fantucci, "Computational investigation of the PolyChlorinatedDibenzo-p-Dioxins - Ah receptor interaction: structure prediction of the ligand binding domain and molecular docking", Italian Journal of Biochemistry (2000) 49, 65. |
|
|
| 109. Classifying G-protein coupled receptors with support vector machines (up) |
| Rachel Karchin, Dr. Kevin Karplus, Dr. David Haussler, University
of California, Santa Cruz;
rachelk@cse.ucsc.edu |
| Short Abstract:
We discuss the relative merits of various automated methods for recognizing GPCRs: BLAST, hidden Markov models and support vector machines (SVMs). Our experiments show that, for those interested in annotation-quality classification, SVMs are worth the effort. We have set up a web server for SVM GPCR subfamily classification at \url{http://www.soe.ucsc.edu/research/compbio/gpcr-subclass}. |
| One Page Abstract:
The enormous amount of protein sequence data uncovered by genome research has increased the demand for computer software that can automate the recognition of new proteins. We discuss the relative merits of various automated methods for recognizing G-protein coupled receptors (GPCRs), a superfamily of cell membrane signalling proteins. GPCRs are the focus of a significant amount of current pharmaceutical research because they play an important role in many diseases. However, their structures remain largely unsolved. The methods described in this paper use only primary sequence information to make their predictions. We compare a simple nearest neighbor approach (BLAST), methods based on multiple alignments generated by a statistical profile hidden Markov model, and methods, including support vector machines, that transform protein sequences into fixed-length feature vectors. The last is the most computationally expensive method, but our experiments show that, for those interested in annotation-quality classification, the results are worth the effort. In two-fold cross-validation experiments testing recognition of GPCR subfamilies that bind a specific ligand (such as a histamine molecule), the errors per sequence at the minimum error point (MEP) were 13.7% for multi-class SVMs, 17.1% for our SVMtree method of hierarchical multi-class SVM classification, 25.5% for BLAST, 30% for profile HMMs, and 49% for classification based on nearest neighbor feature vector (kernNN). The percentage of true positives recognized before the first false positive was 65% for both SVM methods, 13% for BLAST, 5% for profile HMMs and 4% for kernNN. We have set up a web server for GPCR subfamily classification based on hierarchical multi-class SVMs at http://www.soe.ucsc.edu/research/compbio/gpcr-subclass. By scanning predicted peptides found in the human genome with the SVMtree server, we have identified a large number of genes that encode GPCRs. Although most of these were previously annotated, one appears to be novel as of our scan date in May~2001: an olfactory receptor on chromosome~1. A list of our predictions for human GPCRs is available at http://www.soe.ucsc.edu/research/compbio/gpcr_hg/class_results. We also provide suggested classification for 16 sequences previously identified as GPCRs but unclassified in GPCRDB. |
|
|
| 110. Domain-finding with CluSTr: Re-occuring motifs determined with a database of mutual sequence similarity (up) |
| Evgenia V. Kriventseva, EMBL-European Bioinformatics Institute,
Wellcome Trust Genome Campus, Hinxton, Cambridge EMBL-EBI European Bioinformatics
Institute, Wellcome Trust Genome Campus, Hinxton Hall, Hinxton, Cambridgeshire,
CB10 1SD, UK;
Steffen Möller, Rolf Apweiler, EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge EMBL-EBI European Bioinforma; {zhenya,moeller}@ebi.ac.uk |
| Short Abstract:
This work makes use of the pairwise comparison data stored for the CluSTr project, which allows users to analyse sequence matches. Known InterPro protein signatures are used to separate the well-known domains from the uncharacterised. This faciliates a bootstrap approach to discover new protein domains. |
| One Page Abstract:
The CluSTr project (http://www.ebi.ac.uk/clustr/) provides an automatic classification of proteins. The classification is determined according to the pairwise Smith-Waterman similarity scores, normalised by randomisation to derive a Z-score. The CluSTr database information on the protein clusters and the underlying similarity matrix are stored in a relational database. This work uses the pairwise comparison data, which is also underlying the definition of the clusters. Here we present a method to display those regions of sequences that are most often found to be similar to other proteins. This is shown both in dependence of the location on the protein sequence and the Z-score. Additional context is offered by the visualisation of matches to the InterPro (http://www.ebi.ac.uk/interpro/) member databases and position-dependent sequence annotation from the SWISS-PROT FT lines. Regions in sequences of special interest can be specified to be automatically retrieved for further analysis, which facilitates a bootstrap approach to determine new protein domains. Another nice feature of this approach is that it helps to overcome an inherent limitation of algorithms to determine local sequence similarity. These either focus on an area of maximum similarity and thereby ignore remaining similarities or are no longer specific. As a consequence, in multi-domain proteins some shared domains may be omitted as regions of pairwise sequence similarity. With the assumption that the omitted domains may also occur independly from the ones found, the respective regions will be highlighted since the database of sequence similarities contains similarities between any two proteins. The local sequence similarity together with the clustering of protein sequences should be a very interesting aid in the hunt for new protein domains, especially within the context of the most important information from SWISS-PROT/TrEMBL and InterPro. Protein clusters for sequences of completely sequenced eukaryotes for which no InterPro domains were found can be accessed from the Proteome Analysis pages (http://www.ebi.ac.uk/proteome/). 1. Kriventseva E. V., Fleischmann W., Zdobnov E., Apweiler R.: CluSTr: a database of Clusters of SWISS-PROT+TrEMBL proteins. Nucl. Acids Res. 2001, 29(1):33-36. 2. Apweiler R., Attwood T. K., Bairoch A., Bateman A., Birney E., Biswas M., Bucher P., Cerutti L., Corpet F., Croning M. D. R., Durbin R., Falquet L., Fleischmann W., Gouzy J., Hermjakob H., Hulo N., Jonassen I., Kahn D., Kanapin A., Karavidopoulou Y., Lopez R., Marx B., Mulder N. J., Oinn T. M., Pagni M., Servant F., Sigrist C. J. A., Zdobnov E. M.: InterPro - An integrated documentation resource for protein families, domains and functional sites. Nucl. Acids Res. 2001, 29(1):37-40. 3. Apweiler R., Biswas M., Fleischmann W., Kanapin A., Karavidopoulou Y., Kersey P., Kriventseva E. V., Mittard V., Mulder N., Phan I., Zdobnov E.: Proteome Analysis Database: online application of InterPro and CluSTr for the functional classification of proteins in whole genomes. Nucl. Acids Res. 2001, 29(1):44-48. 4. Fleischmann W., Möller S., Gateau A., Apweiler R.: A novel method
for automatic functional annotation of proteins. Bioinformatics 1999 Mar;15(3):228-33.
|
|
|
| 111. PFAM domain distributions in the yeast proteome and interactome (up) |
| Christian Ahrens, Christoph Michetschläger, Andrei Grigoriev,
GPC Biotech AG;
christian.ahrens@gpc-biotech.com |
| Short Abstract:
When comparing the distributions of PFAM domains in the yeast proteome and interactome, we found cell signalling and protein-protein interaction domains occurring with a higher frequency. The analysis of their co-occurrences within one protein or interaction pairs reveals certain preferred domain combinations. Possible functional implications will be discussed. |
| One Page Abstract:
The proteome of budding yeast (Saccharomyces cerevisiae) as defined by the Saccharomyces Genome Database (6311 proteins) and the interactome (defined by several large-scale protein-protein interaction datasets) were analysed for the presence of PFAM-A domains, using the HMMER 2.0 hidden Markov Model software, and the PFAM-A library of HMM´s (v6.0). Several PFAM domain families occur with a higher frequency in the interactome, including domains involved in cell signalling and protein-protein interaction. In addition, the frequencies of co-occurrences of PFAM domains within one protein and within interaction pairs were determined, and preferred domain combinations could be identified for either dataset. The results of these analyses and possible functional implications will be discussed. |
|
|
| 112. Identifying Protein Domain Boundaries using Sequence Similarity for Structural Genomics Target Selection (up) |
| Gulriz Aytekin-Kurban, Terry Gaasterland, Rockefeller University;
gulriz@frida.rockefeller.edu |
| Short Abstract:
The method, CLUE, predicts putative domain boundaries on a protein using pairwise alignments of the protein sequence with all available proteins computed with psi-blast. It can identify domains on sequences not classified into existing structural and functional domain families. We evaluated the method comparing resulting boundaries to structural domain boundaries. |
| One Page Abstract:
The Structural Genomics Initiative seeks to solve three-dimensional (3D) protein structures for as many distinct new folds as possible. These structures will in turn increase the number of computationally modeled 3D protein structures. Achieving this goal requires that candidate structure targets be selected from all available proteins such that the likelihood that a new structure will reveal a new fold is maximized. A prerequisite for target selection is the reliable identification of structural domain boundaries in proteins with no known structure. Once domains are established, the corresponding sequences can then be clustered into domain families. The domain families can be prioritized according to likely efficacy of high-throughput structure determination, and whether or not they have member proteins of known structure. This paper introduces and evaluates a new method for predicting protein domain boundaries in proteins across complete genomes in large scale. The method was applied to proteins from 12 genomes and to proteins in PDB. For the proteins already in PDB, the resulting domain boundaries are compared with structural domain boundaries from the SCOP and CATH databases. The method introduced here, called CLUE, uses pairwise sequence alignments of a query protein with all available proteins computed with the alignment tool psi-blast. A sliding window scoring function is applied across the query sequence to identify regions with a coalescence of internal alignment boundaries, especially boundaries that include an N-terminal or C-terminal end of the aligned (target) sequence. The output of the scoring function is evaluated automatically to identify the best candidate domain boundaries in the query protein. The output of the procedure is a list of best predicted domain boundary positions and subsequences of the query protein. The main strength of CLUE is that it can identify putative domain boundaries on sequences that have local sequence similarity to a set of proteins, yet cannot be classified into existing structural and/or functional domain families. Although it may sometimes perform worse than the existing classifier methods for well-known protein families, it is a valuable method for the subset of proteins where new domain families have yet to be discovered. We use CLUE as the first step to divide sequences into domains before building domain families. However, CLUE works for any arbitrary query sequence; it can be integrated into a sequence annotation system such as MAGPIE without a need for building families. CLUE was evaluated by comparing predicted domain boundaries on every PDB sequence to the structural domain boundaries computed by the SCOP and CATH methods. For each structural domain family, we counted the number of instances where the boundary of the domain on a sequence had a predicted domain boundary within a distance less than 30 amino acids. We excluded the cases where the domain boundary occured at the N or C-terminal end; remaining cases were internal domain boundaries. Either the begin or the end position of a domain can be inside a sequence while the other is at a terminal end of the sequence. The number of cases among PDB sequences where both boundaries of a domain were internal to a sequence was very small. Therefore, two different counts for internal domain begin and end positions were computed. For each domain family, we calculated a percentage for predicted domain boundaries in all internal instances. The average of the percentages across all families was taken to show the overall performance. CLUE predicts on average 66\% of the begin positions of the instances of a SCOP domain family internal on a sequence, and 65\% of the intances of internal end positions. For CATH domain families, the averages are 52\% and 56\%, resp. The method presented here is efficient and scalable. It can be applied to any protein and does not require the construction of domain families for accurate structural domain boundary predictions. CLUE has been implemented with a web interface that serves predicted domain boundaries for proteins across whole genomes. CLUE web site serves boundaries for proteins from the initial 12-genomes dataset at genomes.rockefeller.edu/CLUE. |
|
|
| 113. Comparative study of in vitro and in vivo protein evolution. (up) |
| Vadim P. Valuev, Dmitry A. Afonnikov, Dmitry A. Grigorovich, Nikolay
A. Kolchanov, Institute of Cytology and Genetics SB RAS;
valuev@bionet.nsc.ru |
| Short Abstract:
In amino acid composition the in vitro evolved proteins deviate from native ones and more strongly follow the codon degeneracy; aminoacid interchanges resemble generally those in native proteins, matching better families with restricted function. The study of pairwise correlations allowed some insight into processes determining structure-functional integrity of proteins. \url{http://wwwmgs.bionet.nsc.ru/mgs/gnw/aspd/} |
| One Page Abstract:
In the early 90's started the flow of experiments with the application of techniques of in vitro evolution of proteins. This process implies sieving large (up to 109 individual members) pools of molecules through several consecutive rounds of selection and amplification to retrieve finally the molecules that most strongly show the desired property. (Roberts and Ja, 1999) This technique with various improvements was applied to pursue a number of goals, including selection for thermodynamically more stable proteins (Gu et al., 1995; Kim,D.E. et al., 1998; Braisted,A.C. and Wells,J.A., 1996), engineering of proteins with new or improved enzymatic activities (Baca,M. et al., 1997; Fujii,I. et al., 1998; Widersten,M. and Mannervik,B., 1995), mapping epitopes and binding sites (Zozulya et al., 1999; Castano,A.R. et al., 1995), finding substrates for enzymes (Matthews,D. and Wells,J.A., 1993 ;Matthews,D. et al., 1994), selecting antibodies(Clackson,T. et al., 1991; Vaughan,T.J. et al., 1998), finding small peptide mimetics for large protein molecules (Wrighton,N. and Gearing,D., 1999) etc. We have compiled a database ASPD (Artificial Selected Proteins/Peptides Database) storing the published results of phage display experiments. The first release, ASPD 1.0, contains information on 120 experiments. A database entry corresponds to a set of peptides or proteins selected against one target. Generally they contain some common motif and can be aligned. Each entry contains the description of the scaffold and target for selection, the links to the databases SwissProt, PDB, Prosite and Enzyme, and the aligned set of sequences retrieved through phage display. The ASPD is SRS-formatted and can be accessed from http://wwwmgs.bionet.nsc.ru/mgs/gnw/aspd/. The amino acid composition of the in vitro evolved proteins (only those amino acids which were retrieved via evolution were taken into account, those positions which had not been randomized were ignored) compared to that of SwissProt shows the following trends: the most overrepresented aminoacids in ASPD (compared to SwissProt) are tryptophan (the percentage of which in ASPD exceeds more than threefold that in SwissProt), tyrosine, arginine; the most underrepresented are lysine, glutamic acid and valine. After thorough examination of the distribution it becomes clear that the overall amino acid composition of ASPD follows much more the number of codons for each amino acid, than the composition of SwissProt. The other effect that superimposes on the codon frequency is that there is a preference for hydrophilic amino acids. The preference for hydrophilic amino acids may be due to the bias intrinsic to phage display experiments, where selection is often made for amino acids making part of active sites. And the observation about the codon frequencies, though very simple, suggests a very important thing about native protein evolution - that the sequences of native proteins are determined greatly by their evolutionary history and not by the functional requirements. It is illustrated by the fact that by means of phage display were retrieved small mimetics for large protein molecules (such as erythropoietin) (Wrighton et al., 1996; Wrighton and Gearing, 1999) that have no sequence similarity with them. We have also calculated the aminoacid similarity matrix for our database. It shows the greatest correlation values with the matrices of the BLOSUM family - of about 80%. Its application in homology searches suggests that it is mostly fit for the cases when protein evolution is restricted by strong functional restraints to yield exactly isofunctional proteins. Each entry in the ASPD database was analyzed for presence of pairwise correlations in terms of 4 amino acid properties: volume (Chothia, 1984), hydrophobicity (Eisenberg D et al., 1984), isoelectric point value (White et al., 1978) and polarity (Ponnuswamy et al., 1980). We have revealed a number of clusters of correlating positions, which correspond to the structurally important regions of proteins. Such clusters were found on the turns, where negative correlations in volume and both positive and negative ones in isoelectric point were observed, and within the core, where correlations were mostly in hydrophobicity, but in isoelectric point and polarity as well. These clusters were not found in the families of native proteins. Additional information is available at the http://wwwmgs.bionet.nsc.ru/mgs/gnw/aspd/ The work was supported with RFBR grant ¹00-04-49229 and its supplement ¹01-04-06240. VV is also an INTAS PhD fellow (YS-00-177). |
|
|
| 114. DART: Finding Proteins with Similar Domain Architecture (up) |
| LY Geer, M Domrachev, DJ Lipman, SH Bryant, NCBI/NLM/NIH;
lewisg@ncbi.nlm.nih.gov |
| Short Abstract:
The Domain Architecture Retrieval Tool (DART) identifies proteins with similar domain composition. Domains in a query sequence are identified by a sensitive profile search. Proteins with similar domain architectures are retrieved and listed in ranked order. DART is available at \url{http://www.ncbi.nlm.nih.gov/Structure/lexington/lexington.cgi?cmd=rps} |
| One Page Abstract:
The Domain Architecture Retrieval Tool (DART), hosted by NCBI, performs similarity searching of proteins based on their domain architecture. The goal is to find protein similarities using consistent and sensitive protein domain profiles, rather than solely by sequence similarity. DART has been designed to be fast and informative. The underlying algorithm is based on domain annotation of a significant subset of all publicly known protein sequences through the use of Reverse-PSI-BLAST (RPS-BLAST) [1] and protein domain databases, including SMART [2] and Pfam [3]. Given a protein sequence, DART runs RPS-BLAST and displays the protein using a "beads on a string" style. DART then displays a ranked, graphical list of proteins with similar sets of domains. Ranking is done by the number of unique hits to domains that are the same or redundant to the domains in the query sequence. The query can be refined taxonomically or by selecting domains of interest. DART is linked to CD-Search (http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi), which also uses RPS-BLAST and can display the domain profile alignment in greater detail. To create the databases underlying DART, all the sequences in the NCBI non-redundant database (nr) [4] are aligned to Pfam, SMART, and other domain databases using RPS-BLAST. These alignments are sorted by sequence and by domain. Redundancy between protein domains is used in ranking and querying the sequences because domain databases contain related domains. Redundancy between two domains is defined as a significant number of overlap hits by both domains to nr. Redundant pairs are clustered transitively to create a final list of redundant domains. DART is available at http://www.ncbi.nlm.nih.gov/Structure/lexington/lexington.cgi?cmd=rps [1] Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Nucleic Acids Res. 1997 Sep 1; 25(17): 3389-3402. [2] Schultz J, Copley RR, Doerks T, Ponting CP, Bork P. Nucleic Acids Res. 2000 Jan 1; 28(1): 231-234. [3] Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer ELL. Nucleic Acids Res. 2000 Jan 1; 28(1): 263-266. [4] Wheeler DL, Chappey C, Lash AE, Leipe DD, Madden TL, Schuler GD, Tatusova TA, Rapp BA. Nucleic Acids Res. 2000 Jan 1; 28(1): 10-14. |
|
|
| 115. Statistical approaches for the analysis of immunoglobulin V-REGION IMGT data (up) |
| Christelle Pommié, Manuel Ruiz, Nathalie Syz, Véronique
Giudicelli, LIGM Institut de Génétique Humaine;
Robert Sabatier, Laboratoire de physique Moléculaire; Marie-Paule Lefranc, LIGM Institut de Génétique Humaine; cpommie@ligm.igh.cnrs.fr |
| Short Abstract:
IMGT, the international ImMunoGeneTics database (http://imgt.cines.fr) is an integrated information system specializing in Immunoglobulins, TcR and MHC molecules of all vertebrates species. Our aim was to define most appropriate statistical methods to analyze the IMGT sequences and structural data, useful to establish amino acid correlations in 3D structures. |
| One Page Abstract:
Owing to their fundamental role in the immune system, the Immunoglobulin (Ig) and T cell Receptor (TcR) variable domains (corresponding to the V-J-REGION and V-D-J-REGION labels in IMGT, the international ImMunoGeneTics database, http://imgt.cines.fr) have been extensively studied. Moreover, owing to the recent years sequencing efforts, all the human Ig and TcR genes are now characterized. Analysis of the correlation between sequences, structures and specificities of the variable domains has important implications in medical research (repertoire in autoimmune diseases, AIDS, leukemias, lymphomas, myelomas), therapeutic approaches (antibody engineering), genome diversity and genome evolution study. The Ig and TcR V-REGIONs represent a privileged situation by the conservation of their structure despite divergent sequences and the considerable amount of genomic, structural and functional data. The unique IMGT numbering for Ig and TcR V-REGION sequences of all vertebrate species has been established to facilitate sequence comparison and cross-referencing between experiments from different laboratories whatever the antigen receptor (Ig or TcR), the chain type (heavy or light chains for Ig; alpha, beta, gamma or delta chains for TcR) or the species. In the IMGT unique numbering, conserved amino acids from FR always have the same number whatever the Ig or TcR variable sequence, and whatever the species they come from. The IMGT unique numbering has allowed to redefine the limits of the FR and CDR regions. The FR-IMGT and CDR-IMGT lengths become in themselves crucial information, which characterize variable regions belonging to a group, a subgroup and/or a gene. FR amino acids located at the same position in different sequences can be compared without requiring sequence alignments. This also holds for amino acids belonging to CDR-IMGT of the same length. The IMGT unique numbering permits rapid correlation between protein sequences and three dimensional (3D) structure of Ig and TcR V-REGIONs. Standardized multi-sequence alignments obtained with the IMGT unique numbering allow to set up statistical approaches of the amino acid physico-chemical properties, position by position. These analyses are not only useful to study mutations and allele polymorphisms, but are also needed to establish correlations between amino acids in the protein 3D structures and to extract new knowledge in the IMGT/PROTEIN-DB database, currently in development. As an example of our approach, we describe below the statistical analysis of the hydropathy property of the amino acids found at standardized positions of the three frameworks of the V-REGIONs of two types of chains, the human immunoglobulin light chain kappa and lambda. A total of 1114 human rearranged productive Ig V-REGIONs was obtained, 585 belonging to the kappa chains and 529 to the lambda chains. The V-REGION nucleotide sequences were translated into amino acid sequences. Gaps and delimitations of the FR-IMGT and CDR-IMGT were created according to the IMGT unique numbering. For each chain type V-REGIONs, three sets were created which correspond to FR1-IMGT (amino acid positions 1 to 26), FR2-IMGT (amino acid positions 39 to 55) and FR3-IMGT (amino acid positions 66 to 104), respectively. The six amino acid sequence sets were analyzed to obtain contingency tables which contain the number of each amino acid at each position. The statistical analysis was realized with two different but complementary multivariate descriptive statistical analysis (MDSA) methods: the correspondence (or factor) analysis and the hierarchic classification methods (Ward's method), using the ADE-4 software. The amino acid positions of the kappa and lambda FR1-IMGT, FR2-IMGT and FR3-IMGT sets were compared, two by two, for the amino acid "hydropathy" variable class. A total of six analyses was performed. A correspondence analysis (COA in ADE-4) was applied to each set of kappa amino acid positions (from FR1-IMGT, FR2-IMGT, FR3-IMGT, respectively) and the corresponding set of lambda amino acid positions (from FR1-IMGT, FR2-IMGT, FR3-IMGT, respectively) was projected for the hydropathy variable. One-fifty-seven (79 Kappa and 78 Lambda) amino acid positions from 1114 Ig V-REGION sequences were analysed by the correspondence and the classification analysis methods. These methods, appropriate for the analysis of large data matrices, are particularly interesting in view of the large amount of data to be studied in IMGT. Moreover, used together, they provide different but complementary results and allow a reciprocal analysis of the data. Such an approach was feasible owing to the standardization of the amino acid positions in IMGT sequences. The statistical differences of the hydropathy variable at given amino acid positions has allowed to define the characteristic hydropathy property of the kappa and lambda amino acids, respectively. On the other hand, the statistical resemblances between each kappa and lambda position has allowed to identify positions where amino acid hydropathy property may be important for the conserved structure of the Ig fold. Similar analysis with other variables (amino acid solvent accessibility, hydrogen and Van der Waals bondings) and on other sets of sequences will be particularly useful to establish correlations between amino acid positions of the Ig fold. |
|
|
| 116. Markovian Domain Signatures: Statistical Segmentation of Protein Sequences (up) |
| Gill Bejerano, Yevgeny Seldin, Naftali Tishby, School of Computer
Science & Engineering, The Hebrew University;
jill@cs.huji.ac.il |
| Short Abstract:
We present a novel method for protein sequence domain detection and classification. Our method is fully automated, does not require multiple alignments, and handles heterogeneous unordered multi-domain groups. It constructs unique domain signatures through clustering regions of conserved statistics. Examples detect protein fusion events, and outperform HMM classification. |
| One Page Abstract:
Characterization of a protein family by its distinct sequence domains is crucial for functional annotation and correct classification of newly discovered proteins. Conventional multiple sequence alignment-based methods, such as hidden Markov modeling (HMM), come to difficulties when faced with heterogeneous groups of proteins. However even many families of proteins sharing a common domain contain instances of several other domains, without any common linear ordering. Ignoring this modularity may lead to poor or even false classification and annotation. An automated method that can analyse a group of proteins into the sequence domains it contains is therefore highly desirable. We apply a novel method to this problem. The method takes as input an
unaligned group of protein sequences. It segments them and clusters the
segments into groups sharing the same underlying statistics. A variable
memory Markov model (VMM) is built using a prediction suffix tree (PST)
data structure for each group of segments. Refinement is achieved by letting
the PSTs compete over the segments. A deterministic annealing framework
infers the number of underlying PST models while avoiding many inferior
solutions. We show that regions of conserved statistics correlate well
with protein sequence domains, by matching a unique signature to each domain.
This is done in a fully automated manner, and does not require or attempt
a multiple alignment. Several representative cases are presented. We identify
a protein fusion event, refine an HMM superfamily classification into the
underlying families the HMM cannot separate, and detect all 12 instances
of a short domain in a group of 396 sequences.
|
|
|
| 117. Identification Of Novel Conserved Sequence Motifs In Human Transmembrane Proteins (up) |
| Eike Staub, Artemis Hatzigeorgiou, Bernd Hinzmann, Christian Pilarsky,
Thomas Specht, Andre Rosenthal, metaGen Pharmaceuticals GmbH;
eike.staub@metagen.de |
| Short Abstract:
Here we present an approach to identify novel conserved protein motifs in human transmembrane (TM) proteins. Thousands of predicted TM protein sequences were scanned for known protein domains, transmembrane segments, low-complexity and coiled-coil regions. Using a PSIBLAST-based procedure we identified interesting new motifs in the remaining, previously uncharacterized sequence fragments. |
| One Page Abstract:
Here we present an approach to identify novel conserved protein motifs in human transmembrane (TM) proteins. Thousands of predicted TM protein sequences were scanned for known protein domains, transmembrane segments, low-complexity and coiled-coil regions. Using a PSIBLAST-based procedure we identified interesting new motifs in the remaining, previously uncharacterized sequence fragments. |
|
|
| 118. Using Profile Scores to Determine a Tree Representation of Protein Relationships. (up) |
| K. Diemer, T. Hatton, P. Thomas, Celera Genomics;
diemerkl@fc.celera.com |
| Short Abstract:
An algorithm is introduced that uses profile scores to generate a tree of orthologs/paralogs and to split it into functional subgroups automatically. The similarity measure is based on the score of one sequence cluster to the profile of another. The algorithm is compared with other methods and expert curation.
|
| One Page Abstract:
Many algorithms have been proposed for reconstructing the evolution of protein families from DNA or protein sequence information. The primary goal has been to model the most likely historical sequence of events that gave rise to the protein sequences observed today in the form of orthologs and paralogs. Another use of phylogenetic trees has emerged: prediction of "attributes" of proteins, primarily function, from sequence information. Emerging first from genetic "rescue" experiments, in which a defective protein in one organism can be functionally replaced by a related protein from another organism, it has been repeatedly observed that proteins more closely related in sequence tend to be more closely related in function. It is also well known that the functional specificity of a given protein is generally conferred by only a subset of its constituent amino acids. Some of this specificity can be inferred from analysis of a protein family: positions that vary among the family members are not required for whatever function(s) the family members have in common, while positions that are strictly conserved may be important for those functions. Statistical profiles that describe the conservation patterns at different positions in a set of related proteins have been used to aid in phylogenetic reconstruction . Whether or not using profiles leads to more accurate phylogenetic reconstruction, they may lead to a greater correlation with function, which is the primary focus of the work presented here. The algorithm introduced here uses agglomerative clustering, where the
similarity measure used to join clusters is based on an approximation to
the weighted score of the sequences in one cluster to the profile of the
other cluster. Sequence fragments, which are not infrequent in current
sequence databases, can be accommodated easily. A heuristic score-based
measure is used to split the tree into functional subgroups. When assessing
the performance of our algorithm, we examine the correlation between the
resulting tree and the functions of the constituent proteins. We have evaluated
the algorithm on alignments and corresponding expert functional annotations
from publicly accessible websites, as well as several internally constructed
test cases.
|
|
|
| 119. Apoptosis Signalling Pathway Database - Combining Complimentary Structural, Profile based, and Pair-wise Homologies. (up) |
| Kutbuddin S. Doctor, John C. Reed, Adam Godzik, The Burnham Institute;
Philip E. Bourne, San Diego Super Computer Center & University of California, San Diego; ksdoctor@burnham-inst.org |
| Short Abstract:
This relational database system and web interface (http://apoptosis-db.org/) collects and organizes protein families involved in apoptosis. The database is organized around domains (profile based) that are partially indicative of apoptotic function. Sub-family classification based on combined pair-wise homology over domains provides reliable functional classification. Structural homology links profile based domains. |
| One Page Abstract:
This relational database system and web interface (http://apoptosis-db.org) collects and organizes protein families involved in apoptosis. The database is organized around domains (profile based) that are partially indicative of apoptotic function. Sub-family classification based on combined pair-wise homology over domains provides reliable functional classification. Structural homology links profile based domains which share more generalized functions. |
|
|
| 120. From clustering to expression data to motif finding: a multistep online procedure. (up) |
| Gert Thijs, Kathleen Marchal, Frank De Smet, Janick Mathys, Magali
Lescot, K.U.Leuven - ESAT/SISTA;
Stephane Rombauts, PlantGenetics, VIB, U.Gent; Bart De Moor, K.U.Leuven - ESAT/SISTA; Pierre Rouze, PlantGenetics, VIB, U.Gent; Yves Moreau, K.U.Leuven - ESAT/SISTA; gert.thijs@esat.kuleuven.ac.be |
| Short Abstract:
We present an integrated web-based tool for automatic multistep analysis of microarray data. The gene expression data are clustered to find groups of co-expressed genes. The upstream regions are selected based on accession number and gene name. Finally the sequences are send to the Motif Sampler to find over-represented motifs. |
| One Page Abstract:
Microarray experiments allow to gain a global insight into the transcriptional behaviour of the organism. The deciphering of the regulatory mechanism based on the transcript profiles is one of the major challenges of bioinformatics. Genes that have a similar expression profile, are hypothesized to have a higher probability of being coregulated. Clustering techniques will group together genes with similar expression profiles. Finding specific cis-acting motifs in the upstream region of sets a of co-expressed genes can to some extend validate the clusters. Here we present an interactive web-based user interface that integrates the cluster analysis and motif finding tools for the analysis of microarray data. We propose a multistep online procedure. Starting from the expression data together with the correspoding identification tags of the genes (accession number and gene name) using the adaptive quality-based clustering algorithm will define groups of tightly co-expressed genes. Each gene in a cluster is identified by its accession number and gene name. Based on these tags the upstream region will be retrieved. First the sequences are downloaded from GenBank and all the genes are located and indexed. In the next step the corresponding upstream region is identified. If this region is too short for further analysis the gene is blasted to locate the upstream region in genomic sequences. This sequence selection relies on an automated procedure but at each step an intermediary report is shown where the user can interfere with the process. Once the upstream regions are identified the user can send the sequence to the Motif Sampler to find the over-represented motifs. The webinterface can be accessed through the following URL: http://www.esat.kuleuven.ac.be/~dna/BioI/Software.html |
|
|
| 121. Probe Based Scaling of Microarray Expression Data (up) |
| Christopher Workman, Lars Juhl Jensen, Steen Knudsen, Søren
Brunak, Center for Biological Sequence Analysis;
workman@cbs.dtu.dk |
| Short Abstract:
There are several analysis steps after hybridization and scanning that lay the foundation for all further analysis. For this reason, special care should be taken in image analysis, data scaling and normalization prior to calculating mRNA levels. This poster presents a new probe based scaling method for microarray expression data. |
| One Page Abstract:
Between scanning a chip and making conclusions about mRNA levels there are several important steps that effect the results and lay the foundation for all further analysis. For this reason, special care should be taken in image analysis, data scaling, normalization, and outlier detection prior to calculation mRNA levels. After this is done, converting sets of probe pair intensities to mRNA levels (what I will call the feature extraction problem) is still not as straight forward as one might think. There is very little precedence in feature extraction for probe pair data of this type but some examples are starting to show up in the literature (Li and Wong PNAS, v.98, 2001). What confounds the development of these methods is not knowing what the correct results should be. Using replicate experiments from the same and different RNA isolations from a single tissue, we can measure the effects of scaling and normalization on reproducibility. In this poster I will present a new scaling and feature extraction methods and compare them to the existing methods with respect to their effects on reproducibility. |
|
|
| 122. Revealing the Fine-Structures: Wavelet-based Fuzzy Clustering of Gene Expression Data (up) |
| Matthias E. Futschik, Nikola K. Kasabov, University of Otago;
mfutschik@infoscience.otago.ac.nz |
| Short Abstract:
We studied yeast cell cycle expression data using fuzzy clustering and wavelet analysis. Both methods allow a more general approach for discovering the underlying structures and patterns in gene expression data than previous methods and can be valuable tools for revealing the complexity and the fine structure of cellular regulatory networks. |
| One Page Abstract:
The invention of microarray technologies has opened the door for the study of global mechanisms in the cell. By measuring thousands of genes simultaneously it has become possible to get snapshots of the states of a cell. While monitoring the mRNA levels reveals only a part of the whole picture and protein arrays are still in their infancy, the DNA microarray technique has quickly become an established method and will lead the way in the analysis of the global behavior of cellular networks. A major challenge is, however, the extraction of valuable knowledge from the mass of data produced in microarray experiments. Clustering has frequently been used to obtain a first insight into the structure of the data. It assigns genes to defined groups according to the similarities of their expression profiles. Since it is assumed that co-regulated genes show a similar expression pattern, clustering can discover functionally related genes. So far various cluster algorithms have been introduced like hierarchical clustering, k-means and SOM. A common property of these methods is the assignment of a gene into a single distinct cluster. However, this procedure might be too restrictive considering the complexity of cellular regulatory networks. Single genes are frequently involved in several different physiological pathways. An adequate clustering algorithm should reflect this. In this work, we present fuzzy clustering[1] as alternative to the traditional methods. Fuzzy clustering may group genes more naturally by allowing a single gene to belong to different clusters. This opens the way to a more complex partitioning of genes. We found that a significant number of genes seem to belong to different clusters showing that fuzzy clustering might be an appropriate approach to use. Furthermore, fuzzy clustering leads to the definition of the core of a cluster in a straight forward way. Using this feature, we can examine the correlations between the expression signal and the information in regulatory DNA sequences in detail. For the illustration of this novel approach we apply fuzzy clustering to yeast cell cycle gene expression data set[2]. To meet the temporal character of this data, we apply wavelet analysis[3] to represent the expression profiles. Wavelet analysis offers the possibility to study the genetic network on different time scales while preserving the temporal order of the expression signals. An interesting possibility is the usage of wavelet decomposition to distinguish the true biological signals from noise. Finally we address the important issue of cluster validation by comparing different cluster validity criteria and discuss the problem of model parameter selection. We show that both fuzzy clustering and wavelet analysis allow a wider approach for discovering underlying structures and patterns in gene expression data than previous methods and can be valuable tools for revealing the complexity and the fine structure of cellular regulatory networks. References: [1] James C. Bezdak, Pattern Recognition with Fuzzy Objective Function Algorithms, Advanced Applications in Pattern Recognition, Plenum Press, 1983 [2] Paul T. Spellman et.al, Molcular Biology of the Cell, Vol. 9, 3273-3297, 1998 [3] Ingrid Daubechies, Ten Lectures on Wavelets, Society for Industrial and Applied Mathematics, 1992 |
|
|
| 123. Identification of clinically relevant genes in lung tumor expression data. (up) |
| Olga Troyanskaya, Stanford Medical Informatics and Department of
Genetics, Stanford University School of Medicine;
Mitchell Garber, Department of Genetics, Stanford University School of Medicine; Russ B. Altman, Stanford Medical Informatics, Stanford University School of Medicine; David Botstein, Department of Genetics, Stanford University School of Medicine; olgat@smi.stanford.edu |
| Short Abstract:
We developed methods for detecting clinically relevant genes in the context of lung cancer gene expression data. We present a correlation-based method for identifying survival-associated genes for lung adenocarcinomas. We also show a method based on a nonparametric t-test for identifying gene expression patterns associated with specific tumor types. |
| One Page Abstract:
A major biomedical question in microarray studies is selecting genes associated with specific clinical parameters, for example patient survival. Identification of such markers, or groups of genes, may lead to clinical outcomes prediction and treatment guidance. Additionally, analysis of gene expression data associated with clinical data may allow molecular-level tumor classification. These tumor subtypes, which may appear histologically similar, are molecularly distinct and lead to differences in clinical outcomes such as patient survival, drug response, and metastatic status. Methods for automated analysis of gene expression data associated with clinical data are therefore needed. Our work is focused on developing and evaluating methods for detecting clinically relevant genes in the context of lung cancer gene expression data. We use a non-parametric t-test based method for identification of genes associated with specific tumor types. This method was applied to lung tumor data to distinguish between subtypes of lung adenocarcinomas which are not histologically distinct. We also describe a correlation-based method for identification of genes correlated with patient survival. The method identifies genes whose expression can be best used to classify tumors in terms of `good' and `bad' survival outcomes for patients with lung adenocarcinomas. |
|
|
| 124. Machine Learning Techniques for the Analysis of Microarray Gene Expression Data: A Critical Appraisal (up) |
| Mahesan Niranjan, The University of Sheffield;
m.niranjan@sheffield.ac.uk |
| Short Abstract:
This poster takes a critical look at some of the high performance machine learning techniques as applied to microarray gene expression data. It uses the yeast gene expression and leukhemia datasets available in the publis domain to illustrate that reasonably simple techniques can achieve performances comparable to highly nonlinear techniques. |
| One Page Abstract:
In recent literature we see that a wide range of powerful machine learning algorithms have been proposed for the analysis of gene expression data from microarrays. New clustering methods such as Gene Shaving have been invented in this context. Support Vector Machines, Bayes Nets, Gaussian Processes and Latent Variable Methods have all been recommended as the right tools with which inference problems in such data should be approached. Recent literature takes the form that each machine learning expert with interest in the subject of microarray data advances his/her favourite method as the way forward for the biologists generating the data. In this poster I report on taking a critical look at this collection of techniques applied to this problem. In particular I report on the Yeast Gene and Leukhemia datasets, available in the public domain. It turns out that the underlying classification problems arising in these datasets are sufficiently simple that pattern processing techniques available in textbooks are as good as any sophisticated methodology. The key result from this observation is that many of the high dimensional problems could be reduced to problems in much lower dimensionality by reasonably simple techniques, resulting in the possibility of effective interpretation of such data. |
|
|
| 125. Comparison of Methods for The Classification of Tumors Using Gene Expression Data (up) |
| Grace S. Shieh, Chi-Chih Chen, Ing-Cheng Jiang, Insti. of Statistical
Science, Academia sinica, TAIWAN;
Yu-Shan Shih, Dept of Math., National Chung Cheng Univ.; gshieh@stat.sinica.edu.tw |
| Short Abstract:
The performance of support Vector Machines and QUEST to classify tumors based on gene expression data from cDNA microarrays is compared to those in Dudoit et al. (2000). We access error rates from 150 data sets generated, by a statistical method, from NCI 60 cell lines and Lymphoma data, respectively. |
| One Page Abstract:
The performance of support Vector Machines and QUEST to classify tumors based on gene expression data from cDNA microarrays is compared to those four major methods in Dudoit et al. (2000). We generate 150 data sets, by a statistical sampling method, from the original NCI 60 cell lines (Ross et al., 2000) and Lymphoma data (Alizadeh et al., 2000), respectively. In each set, about two thirds of each generated data are used as training data and the rest test data. Some variables (gene expressions), out of many (for instance, 1,416 in NCI 60 cell lines), have been selected by a statistical criterion to implement those methods and to access their prediction error rates. |
|
|
| 126. Using expression data for testing hypotheses on genetic networks - minimal requirements for the experimental design (up) |
| Dirk Repsilber, Institute of Molecular Evolution, Evolutionary Biology
Centre of the University of Uppsala, Norbyvägen 18 C, SE-75236 Uppsala,
Sweden;
Siv Andersson, Hans Liljenström, Institute of Molecular Evolution, Evolutionary Biology Centre of the University of Uppsala, Norbyvägen 18 C, SE-; dirk.repsilber@ebc.uu.se |
| Short Abstract:
We systematically tested the requirements for the experimental design for ranking false hypotheses about a genetic network's structure, given expression data. This is an important functional genomics task, because the parameter space of reasonable models is too big to be able to come along without previous biological knowledge. |
| One Page Abstract:
A variety of ``Reverse-Engineering'' algorithms have been proposed, on how to use expression data to reconstruct interactions in small networks. This may help to understand genetic regulation, the core task of nowadays functional genomics. Only few point to the necessity of measuring ``independent'' samples to be able to reengineer even the smallest genetic networks with a sensible confidence. Here, we systematically tested the requirements for the experimental design which is necessary not only to reengineer the ``right'' genetic network, but also to be able to rank false hypotheses about its structure. Presumably the latter is the task most frequently to be solved in near future of functional genomics, because the parameter space of reasonable models is too big to be able to sort out without using previous biological knowledge. However, this knowledge has mainly been inferred from sequence data, and several, equal possible hypotheses need to be weighted against each other. Thus, algorithmic solutions that can be computationally automated to perform this task are indispensable. Following the work of Wahde and Hertz (2000) we use a genetic algorithm to explore the parameter space of a multistage discrete genetic network model (fixed connectivity and number of states per node). |
|
|
| 127. In silico search for cis-acting regulatory sequences in co-expressed gene clusters (up) |
| Stephane Rombauts, Department of Plant Genetics, VIB, University
of Gent, Ledeganckstraat 35, 9000 Gent Belgium;
Magali Lescot, Gert Thijs, Kathleen Marchal, SISTA/COSIC-ESAT, KULeuven, Kasteelpark Arenberg 10, 3001 Leuven-Heverlee Belgium; Cedric Simillion, Department of Plant Genetics, VIB, University of Gent, Ledeganckstraat 35, 9000 Gent Belgium; Bart De Moor, Yves Moreau, SISTA/COSIC-ESAT, KULeuven, Kasteelpark Arenberg 10, 3001 Leuven-Heverlee Belgium; Pierre Rouzé, INRA associated laboratory, VIB, University of Gent, Ledeganckstraat 35, 9000 Gent Belgium; strom@gengenp.rug.ac.be |
| Short Abstract:
With large-scale transcriptome expression analyses, such as microarrays, one can tackle the problem of gene regulation. Motif finding algorithms aim at detecting motifs in the upstream regions of co-regulated genes. For that purpose we improved the original Gibbs Sampler from Lawrence. PlantCARE has been improved and new features were added. |
| One Page Abstract:
Among the fully sequenced genomes, that of the dicotyledonous plant model Arabidopsis thaliana has been made available since december 2000 and big efforts are made to extract knowledge out of the sequences. With large-scale transcriptome expression analyses, such as microarrays, producing large clusters of co-expressed genes, one can tackle the problem of gene regulation. It is commonly accepted that at least a subset of the sequences of a given cluster should share regulatory elements. In general, data on plant cis-acting regulatory elements is lacking, although they determine the processes in which genes are involved, and are of major importance for plant biotechnology. Motif finding algorithms aim at detecting such motifs in the upstream regions of co-regulated genes by looking for over-represented oligonucleotides. For that purpose we developed the Motif Sampler being an improved implementation of the original Gibbs Sampler from Lawrence[1]. To test the Motif Sampler[2] on experimental data sets, the microarray data of plant response to mechanical wounding from Reymond[3] was used as well as the data from Schaffer[5] on the circadian clock. To assign a functional interpretation to the found motifs, the consensus of the motifs was compared with the entries in PlantCARE[6]. Several interesting motifs were found: resp. for the wounding experiments (methyl jasmonate responsive elements, elicitor-responsive elements and the abcissic acid response element) as well as elements for the circadian clock experiments. The PlantCARE database and web site have been improved and new features were added to deal with the predicted data. Among the updates, an interactive graphical display of promoter boxes mapped on the query sequence together with information regarding the sites has been put up. Additionally we aim at describing promoters as functional entities composed of several elements based on extensive analyses of pools of co-regulated genes clustered from microarray experiments. At present, we have collected over the 400 different cis-acting regulatory elements from the literature describing more than 159 individual promoters from higher plant genes. (http://sphinx.rug.ac.be:8080/PlantCARE/) References [1] Lawrence, C. E. et al. (1993). "Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment." Science 262(5131): 208-14. [2] G. Thijs, et al. (2000) A higher-order background model improves the detection by Gibbs sampling of potential promoter regulatory elements in DNA sequences. Submitted. http://www.esat.kuleuven.ac.be/~thijs/Work/MotifSampler.html [3] Reymond, P et al. (2000). "Differential gene expression in response to mechanical wounding and insect feeding in Arabidopsis". Plant Cell 12(5): 707-20. [4] De Smet, F, et al.(2000) http://www.esat.kuleuven.ac.be/~thijs/Work/Clustering.html [5] Schaffer R., et al (2001) "Microarray analysis of diurnal and circadian-regulated genes in Arabidopsis." Plant Cell 13:113-123. [6] Rombauts S. et al. (1999). "PlantCARE, a plant cis-acting regulatory element database." Nucleic Acids Res 27(1): 295-6.() |
|
|
| 128. A decision tree method for classification of promoters based on TF binding sites (up) |
| Alexander Kel, BIOBASE GmbH, Mascheroder Weg 1B, 38124 Braunschweig,
Germany; Institute of Cytology and Genetics, Pr. Lavrentyeva 10, 360090,
Novosibirsk, Russia;
Tatyana Ivanova, Institute of Cytology and Genetics, Pr. Lavrentyeva 10, 360090, Novosibirsk,; Olga Kel-Margoulis, BIOBASE GmbH, Mascheroder Weg 1B, 38124 Braunschweig, Germany; Institute of Cytology and Genetics, Pr. Lavrentyeva 10; Michael Zhang, Watson School of Biological Sciences, Cold Spring Harbor Laboratory, 1 Bungtown Road, P.O.Box 100,Cold Spring Harbor,; Edgar Wingender, BIOBASE GmbH, Mascheroder Weg 1B, 38124 Braunschweig, Germany; ake@biobase.de |
| Short Abstract:
We have developed a new method for revealing of class-specific composite modules (combinations of transcription factor binding sites) in promoters of eukaryotic genes that are functionally related or coexpressed. A decision tree system is constructed to classify promoters in genomes and computationally predict their function. |
| One Page Abstract:
We have developed a new method for revealing class-specific composite modules in promoters of functionally related or coexpressed genes. On the basis of the revealed composite modules a decision tree method is developed to classify promoters of several functionally related gene groups. Seven sets of promoters were obtained from different sources: promoters for cell-cycle related genes (43 promoters) and brain enriched genes (45 promoters) (collected in this work on the base of literature search), muscle-specific (25 promoters) and immune cell specific genes (24 promoters) (Kel et al., (1999) JMB 288,353-376), erythroid specific genes (10 promoters) (http:/www.bionet.nsc.ru), liver enriched genes (39 promoters) and housekeeping genes (26 promoters) (EPD rel.62). The promoter sequences of the length 600 bp (from -500 to +99 relative start of transcription) were extracted from EMBL database. To search for binding sites a library of about 400 matrices for various transcription factors were applied (TRANSFAC rel 4.4 (Wingender, E. et al., (2000) NAR 28, 316-319) with a new searching tool - "Match". To classify promoters we build a decision tree. The internal nodes of the tree represent selected composite modules. On the basis of the composite module, at every node we calculate a decision function F(X) for each sequence X as it is passed to the tree. A decision tree was build by a variant of genetic algorithm, that optimises the structure of the decision tree, selects the specific combinations of cis-elements for every node of the tree and defines cut-off values of the corresponding functions F. The bottom nodes of the tree (leafs) contain 7 different promoter classes. Percent of correct classifications achieved by the tree varies for different promoter classes from 35% for promoters of brain enriched genes to more then 70% for cell cycle related promoters. The following set of TF binding sites appeared to be the most effective for classification of the mentioned sets of promoters: E2F, OCT-1, NF-AT, MyoD, SRF and NF-kB. The classification tree and the program for promoter classification can be found at: http://www.gene-regulation.com/. The decision tree method enables to identify new promoters and computationally predict their function. It provides means to analyse gene expression data by constructing promoter models for coexpressed genes. |
|
|
| 129. Biostatistical Methods to Analyse Gene Expression Profiles (up) |
| Jobst Landgrebe, MPI of Psychiatry, Munich;
Gerhard Welzl, GSF-Research Center, Munich; Wolfgang Wurst, MPI of Psychiatry, Munich; landgreb@mpipsykl.mpg.de |
| Short Abstract:
We analysed gene expression data of mouse mutants with principal component analysis. We selected genes with extreme values in the reduced system, supervised by the variance within observation groups. This enabled us to explore differences between the samples and to extract fundamental gene expression patterns related to these differences. |
| One Page Abstract:
Biostatistical Methods to Analyse Gene Expression Profiles Jobst Landgrebe(1), Gerhard Welzl(2) and Wolfgang Wurst(1/2) 1 GSF-National Research Centre for Environment and Health, Ingolstädter Landstraße 1, D-85764 Neuherberg 2 Max-Planck-Institute of Psychiatry, Molecular Neurogenetics, Kraepelinstr.10, D-80804 München Abstract: DNA microarray gene expression data are characterised by an increasing number of probes for cDNAs . Bioinformatical and biostatistical methods are applied to study the variance in gene expression across collections of related arrays and to detect fundamental patterns underlying these gene expression profiles. Many mathematical techniques have been developed to detect patterns in complex data. Quite a few of these methods are essentially different ways of clustering points in multidimensional space, e.g. hierarchical clustering, or self-organising maps. Holter et al. successfully applied the singular value decomposition method to sets of DNA microarray gene expression data (Holter et al. 2000). Another method named "Gene Shaving" is based on computing a leading principal component iteratively (Hastie et al. 2000). We analysed gene expression data of genetical and pharmacological mouse models with principal component analysis (PCA) by regarding the experimental conditions as variables (columns) and the genes as objects (rows). The additional information about the related arrays (groups of mice) requires some modification of the PCA (Krzanowski). We selected genes with extreme values in the reduced system (high variance between groups), supervised by the variance within groups. Using this method we were able to explore differences between the samples and to extract fundamental gene expression patterns related to these differences. To complete our analysis we ran the data visualisation system XGobi and compared the results with the outcome of other multivariate methods. References: HASTIE, T., TIBSHIRANI, R., EISEN, M.B., ALIZADEH, A., LEVY, R., STAUDT, L., CHAN, W.C., BOTSTEIN, D. and BROWN, P. (2000): Gene ´shaving' as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology 1(2) , 1-20. HOLTER, N.S., MITRA, M., MARITAN, A., CIEPLAK, M., BANAVAR, J.R. and FEDOROFF, N.V. (2000): Fundamental patterns underlying gene expression profiles: Simplicity from complexity. Proc. Natl. Acad. Sci. USA 97, 8409-8414. KRZANOWSKI, W.J. (2000): Principles of Multivariate Analysis. Oxford University Press, New York |
|
|
| 130. Syntactic structures for understanding gene regulatory networks (up) |
| Peter Lee, Mike Hallett, Tom Hudson, McGill University;
pdlee@genome.mcgill.ca |
| Short Abstract:
We present a novel system for representing complex gene relations derived from functional annotation databases. We propose a method for application of this functional representation to the analysis and interpretation of microarray gene expression data. |
| One Page Abstract:
The analysis and interpretation of large-scale gene expression datasets requires methods for integrating information about gene function. The majority of knowledge about biomolecular systems exists in the form of qualitative descriptions that are intuitive but covering a diverse spectrum of mechanistic information and experimental conditions. Pathway databases (ie:KEGG), and other functional classification systems (such as GO, MESH) compress information contained in the literature database to varying degrees. However, existing paradigms for representing this information (such as path maps, hierarchical trees and circuit diagrams) lack scalability and do not adequately capture the diversity and subtlety of the interactions between genes and their products. We describe a novel system for the representation of functional information contained in various functional databases. By preserving syntactic structures from the knowledge base, we propose a general interface that enables construction of comparisons between gene expression analyses and current intuitive understandings of gene regulation. We are in the process of developing this interface to access data via a microarray gene expression database. |
|
|
| 131. Adaptive quality-based clustering of gene expression profiles (up) |
| Frank De Smet, Frank De Smet, Kathleen Marchal, Janick Mathys, Gert
Thijs, Bart De Moor, Yves Moreau, ESAT-SISTA/COSIC/DocArch;
frank.desmet@esat.kuleuven.ac.be |
| Short Abstract:
A two-step algorithm to cluster significantly (with a certain confidence) coexpressed genes is presented. First, we try to find a sphere in the data where the 'density' of expression profiles is locally maximal. In a second step, we derive the optimal radius (or quality) of this sphere using an EM-algorithm. |
| One Page Abstract:
Clustering genes based on their expression behaviour/profiles (e.g., measured by microarrays) is an important step preceding further analysis of the interaction between these genes. The hypothesis that a cluster contains either coregulated or functionally related genes only holds if the cluster algorithm that is used, groups genes with a significant degree of coexpression. Genes not tightly coexpressed have to be excluded from further analysis. With these remarks in mind we designed an iterative two-step algorithm: First, we try to find a sphere in the data where the 'density' of expression profiles is locally maximal (based on a preliminary estimate of the radius R of the cluster - quality based approach(1)). In a second step, we derive the optimal radius (or quality) of the cluster/sphere so that only significantly coexpressed genes (represented by a significance level S (e.g., S=95%)) are included in the cluster. This is achieved by fitting a model to the data using an EM-algorithm. The model used assumes that the data is normalised (the expression vectors have mean zero and variance one and are therefore located on the intersection of a hyperplane and a hypersphere). By inferring the radius or quality from the data itself, the biologist is released from estimating this parameter manually (this parameter was sometimes hard to predict - setting the quality too strict will exclude a considerable number of coregulated genes, setting it too wide will include too many genes that are not coregulated). The most important properties of this approach are: a. Few user-defined parameters (e.g., no pre-definition of the number of clusters) with an intuitive meaning. b. Not all genes are assigned to a cluster. c. The computational complexity of this method is approximately linear in the number of gene expression profiles in the data set. Finally, we tested this algorithm successfully on real and artificial data. References 1. Heyer, L. J., Kruglyak, S., and Yooseph, S. (1999) Exploring expression data: Identification and analysis of coexpressed genes. Genome Res., 9, 1106-1115. Acknowledgements Frank De Smet is a research assistant with the K.U.Leuven. Yves Moreau
is a post-doctoral researcher of the FWO. Prof. Bart De Moor is a full
professor with the K.U.Leuven. This work is supported by the Flemish Government
(Research Council KUL (GOA Mefisto-666, IDO), FWO (G.0256.97,G.0240.99,G.0115.01,
Research communities ICCoS, ANMMM, PhD and postdoc grants), Bil.Int. Research
Program, IWT (Eureka-1562 (Synopsis), Eureka-2063 (Impact), Eureka-2419
(FLiTE), STWW-Genprom, IWT project Soft4s, PhD grants)), Federal State
(IUAP IV-02, IUAP IV-24, Durable development MD/01/024), EU (TMR-Alapades,
TMR-Ernsi, TMR-Niconet), Industrial contract research (ISMC, Data4s, Electrabel,
Verhaert, Laborelec).
|
|
|
| 132. Incorporating Biological Knowledge Into Analyses of Microarray Data (up) |
| Jessica Ross, Division of Biomedical Informatics, Department of
Medicine, Stanford University School of Medicine, Stanford, California;
Jeff Shrager, Carnegie Institute of Washington, Stanford, California; Glenn Rosen, Division of Pulmonary and Critical Care, Department of Medicine, Stanford University School of Medicine, Stanford, Ca; Pat Langely, Institute for the Study of Learning and Expertise, Palo Alto, California; ccross@leland.stanford.edu |
| Short Abstract:
We have developed a system that lets biologists view human gene expression data as it relates to molecules that occur in specific biological pathways, letting them search for all pathways that contain molecules, view expression levels of those molecules graphically, and calculate correlations between those expression levels. |
| One Page Abstract:
High-throughput technologies have generated large amounts of data in the biological sciences. Using clustering algorithms to find patterns in these data is the primary method of analysis found in the literature, and is used predominantly as an exploratory tool, rather than as a test to evaluate a scientific hypothesis. These analyses have been very successful in letting scientists better classify tissue based on gene expression data. However, the results of clustering are often difficult to interpret in terms of the classical pathway models, which biologists often express as diagrams. The ability to reconcile microarray data with these models would greatly assist biologists in communicating knowledge gained from these high-throughput experiments. Furthermore, the ability to explain microarray results in relation to familiar biological pathways and molecular processes will directly support the formation and testing of hypotheses about these processes that regularly occur in the physical or wet lab. We have developed a system that lets biologists view human gene expression data as it relates to molecules that occur in specific biological pathways. We use a database with over 5000 biological reactions inferred from the literature on humans, most of which pertain to signaling pathways within the cell and therefore are directly relevant to current theories of the causes for many human diseases. The software lets a scientist search for all pathways that contain molecules of interest, view expression levels of those molecules graphically, and calculate correlations between those expression levels. In addition, the user may suggest a new pathway for comparison to the data. Using this program, we have been able to reconcile data from microarray experiments on human fibroblasts with accepted pathways for cell cycle and signaling. Our results show correlations between molecules that occur in these pathways, even though cluster analysis did not group them and/or include them in any group. We believe this system will serve as a valuable tool that will let biologists incorporate microarray data into the process of hypothesizing and testing their models. |
|
|
| 133. Semantic Link: A Knowledge Discovery Tool for Gene Expression Profiling (up) |
| Ingrid M. Keseler, Nikolai N. Kalnine, BD/CLONTECH;
imkeseler@clontech.com |
| Short Abstract:
"Semantic Link" is an internet-based knowledge discovery tool designed for the interpretation of gene expression data. It contains a map of biological semantic concepts (genes, diseases, etc.) that are related to each other by their representation in the prevailing scientific literature. |
| One Page Abstract:
Microarray-based gene expression profiling generates thousands of data points which represent relative abundance of individual mRNA molecules in experimental and control samples. If we consider the cell as a network of interacting molecules with a mechanism of feedback control of their expression and degradation, a gene expression profile reflects an induction or suppression of certain regulatory pathways in response to a "treatment". Interpreting gene expression data in terms of metabolic pathways is a challenging task for several reasons: 1) Incomplete knowledge of the functional role of genes in the cell. 2) Complex nature of cellular pathway network. 3) Limited sensitivity and selectivity of the microarray data. In addition, most of the relevant information is scattered over a variety of Internet databases and scientific publications, which are not designed for high-throughput processing. Semantic Link is a text processing program that extracts all available information on gene functions and related disorders from Medline titles and abstracts and organizes it in a database. A dictionary of 350,000 selected words and phrases representing gene names and biological processes was built from Medline '96 to '01. The building process consisted of automatic extraction of terms from the text followed by supervised filtering. The dictionary was then supplied to the text processor for identification of terms in the text. Finally, articles were clustered by counting the co-occurrence of terms in the same abstract, paragraph or sentence. Basic elements of linguistic analysis (protein = proteins, gene expression = expression of gene, etc.) and substitution of synonyms were applied at that stage. The resulting database of semantic terms and links can be viewed by an internet client in the form of either a graphical network or a taxonomy of dictionary items. A trial version of Semantic Link built on the collection of terms of the Gene Ontology Consortium is available at http://atlasinfo.clontech.com/. |
|
|
| 134. Integration of transcript reconstruction and gene expression profiles to enhance disease gene discovery. (up) |
| Peter van Heusden, Electric Genetics;
Alan Christoffels, Soraya Bardien, South African National Bioinformatics Institute; Gary Greyling, Electric Genetics; Ari Ziskind, University of Stellenbosch; Johann Visagie, Antoine van Gelder, Electric Genetics; Janet Kelso, South African National Bioinformatics Institute; Liza Groenewald, Tania Hide, Electric Genetics; Win Hide, South African National Bioinformatics Institute; pvh@egenetics.com |
| Short Abstract:
We developed a tool to automate identification of positional candidates for genetic disorders based on expression state, physical mapping and genome mapping information. A controlled vocabulary was integrated into stackPACK EST clustering system to generate expression profiles and the resulting transcripts were mapped to the genome (http://genome.ucsc.edu/) and graphically visualised. |
| One Page Abstract:
There is an urgent need by human geneticists for bioinformatic tools to exploit the sequence data and other information generated by the Human Genome Project. We have developed a tool to automate identification of positional candidates for genetic disorders based on (1) expression state, (2) physical mapping and (3) genome mapping information. For expression state we will extract information from various gene expression repositories for standardised functional annotation of positional candidates, thereby enabling the effective prioritisation of these genes. Unprocessed expression data in the form of expressed sequence tags (ESTs), serial analysis of gene expression (SAGE) tags, and array-based experiments are stored in numerous disparate databases. However, with the absence of a standardised nomenclature, there are problems with accessing and manipulating this information. These difficulties are compounded in the context of high-throughput systematic analysis and emphasise the need for consistent across-database description of the same terms and objects. We have constructed a controlled vocabulary for standardised description of gene expression state. This vocabulary has been integrated into the stackPACK EST clustering system in order to generate cluster expression profiles. The credibility of these genes as positional candidates are enhanced by their mapping onto the Santa Cruz- assembled human genome sequence (http://genome.ucsc.edu/) using BLAST and SIM4. Genemap 99 radiation hybrid markers were also mapped to the genome sequence using ePCR to provide reference points . The resulting expression profiles and mapping information are exported in a standardised EMBL format for visualisation purposes e.g. using Artemis. The tool has been tested using two known disease loci, retinitis pigmentosa on 8q (RP1 gene) and type 2 diabetes locus on 2q (CAPN10 gene). |
|
|
| 135. Gene Expression Database (GXD): integrated access to gene expression information from the laboratory mouse (up) |
| Martin Ringwald, Dale A. Begley, Ingeborg J. McCright, Terry F. Hayamizu,
David P. Hill, Constance M. Smith, Judith A. Blake, Janan T. Eppig, Jim
A. Kadin, Joel E. Richardson, The Jackson Laboratory;
ringwald@informatics.jax.org |
| Short Abstract:
GXD is a community resource. Its objective is to capture and integrate different types of gene expression data from the laboratory mouse and to place these data in the larger biological and analytical context. GXD is accessible at http://www.informatics.jax.org/. New data are made available on a daily basis. |
| One Page Abstract:
The Gene Expression Database (GXD) is a community resource of gene expression information from the laboratory mouse. The database is designed as an open-ended system that can integrate different types of expression data, such as RNA in situ hybridization and immunohistochemistry data, Northern and Western blot data, RT-PCR data, cDNA data, and microarray data. Thus, as data accumulate, GXD provides increasingly complete information about what transcripts and proteins are produced by what genes; where, when and in what amounts these gene products are expressed; and how their expression varies in different mouse strains and mutants. Expression patterns are described using an extensive dictionary of anatomical terms for the mouse that has been established in collaboration with our colleagues in Edinburgh, UK*. The anatomical dictionary names the tissues and structures for each developmental stage, and organizes the terms hierarchically from body region or system to tissue to tissue substructure. This model enables an integrated description of expression patterns for various assays with differing spatial resolution, computational analysis of expression patterns at different levels of detail, and continuous extensions of the anatomical dictionary itself. Expression records are linked to digitized images of original expression data. GXD is available at http://www.informatics.jax.org/. It is integrated with the Mouse Genome Database to enable a combined analysis of genotype, expression, and phenotype data. In conjunction with the Gene Ontology project we build shared controlled vocabularies for biological processes, molecular functions and cellular components and assign those terms to mouse genes and their products. These classification schemes provide important new search parameters for expression data. Extensive interconnections with sequence databases and with databases from other species further extend GXD's utility for analysis of gene expression information. *Edinburgh collaborators: J. Bard, R. Baldock, D. Davidson, M. Kaufman. GXD is supported by NIH grant HD33745. The Gene Ontology project is supported by NIH grant HG02273. |
|
|
| 136. Analysis of gene expression profiles between interaction protein pairs in M.musclus (up) |
| Rintaro Saito, Harukazu Suzuki, Ikuko Kagawa, Rika Miki, Hidemasa Bono,
Hideaki Konno, Yasushi Okazaki, Yoshihide Hayashizaki, Laboratory for
Genome Exploration Research Group, RIKEN Gemomic Sciences Center(GSC),
RIKEN Yokohama Institute;
rsaito@gsc.riken.go.jp |
| Short Abstract:
Toward an integrative analysis of gene expression data and protein-protein interaction data, we have calculated correlation coefficients of gene expression profiles between interaction protein pairs in M.musculus. We will present current results and discuss the general rules of expression patterns between the interaction pairs. |
| One Page Abstract:
Proteins play pivotal roles in all biological phenomena where physiological interactions of many proteins are involved in the construction of the biological pathways such as metabolic pathways and signal transduction pathways. Analyses of the biological pathways are one of the most important issues not only for molecular biology but also for medicine. Recent development of DNA microarray technologies enabled us to examine expression patterns of many genes at a time. In addition, yeast two-hybrid method is widely used to screen physiological protein-protein interactions in high-throughput manner. Development of several computational methods to infer pathways using either expression data or protein-protein interaction data is in progress. However, integral approaches for analyzing both expression data and protein-protein interactions in higher organisms has not been established yet. The genome encyclopedia project in RIKEN genome exploration research group has already collected large number of mouse full-length enriched cDNAs(Nature 409:p685, 2001). We have also analyzed expression profiles of those cDNAs in 49 different tissues using DNA microarray(Proc.Natl.Acad.Sci. USA 98:p2199, 2001). In addition, we are screening protein-protein interactions using those cDNAs and identified approximately 150 interactions(paper in submission). We have analyzed the correlation coefficient of gene expression profiles between interaction protein pairs. The results show that the degrees of correlations seem to depend on both the set of selected data used for the calculation, and the protein functions. We will present the current results and discuss the general rules of expression patterns between the interacting pairs. Further, computational method to infer novel pathways using expression and protein-protein interaction data will be discussed. |
|
|
| 137. Learning genomic nature of complex diseases from the gene expression data (up) |
| Andrew B. Goryachev, GeneData AG;
Pascale F. Macgregor, Clinical Genomics Center, University Health Network, Toronto, Canada; Katryn Furuya, Hospital for Sick Children, Toronto, Canada; Aled M. Edwards, C.H. Best Institute, University of Toronto, Canada; Andrew.Goryachev@genedata.com |
| Short Abstract:
The major genomics challenge is how to apply various data-mining tools to extract biologically important information from expression data. We present a complete study of a complex liver disease. A variety of statistical analyses provided by the Expressionist software was applied to reveal intricate co-expression patterns characterising the disease. |
| One Page Abstract:
Significant emphasis is currently placed on the understanding of the molecular nature of complex human diseases, e.g., cancers. It has become evident that maladies caused by the malfunction of a single gene are rare. Instead, complex genome-scale aberrations are found responsible in an ever-growing number of cases. Expression data provide an ample evidence for the existence of complex relationships between genes found in a given disorder. However, identification of such connections from the experimental data is a challenging task that requires a variety of data mining methods applied in various combinations. In practical applications, in which several diseases represented by many samples are compared to heterogeneous normal groups, the complexity of analysis quickly explodes. This overwhelming complexity demands sophisticated software tools offering a comprehensive set of analyses as well as advanced data management. We present a complete study in which complex expression data were analysed with the Expressionist software from GeneData AG. A human liver disorder with poorly understood origin was compared to another liver disease and the normal group of samples in a large-scale expression profiling experiment. A variety of filtering, clustering and correlation analysis methods was applied to the data to reveal intricate patterns of gene co-expression hinting at possible co-regulation characteristic of the particular disease. We also present a novel clustering approach which provides flexible definition of the cluster size and number. |
|
|
| 138. Comparative Assessment of Normalization Methods for cDNA microarray data (up) |
| Ilana Saarikko, Timo Viljanen, Turku Centre for Biotechnology and
Turku Centre for Computer Science, University of Turku, Finland;
Riitta Lahesmaa, Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, Finland; Tapio Salakoski, Turku Centre for Biotechnology and Turku Centre for Computer Science, University of Turku, Finland; Esa Uusipaikka, Department of Statistics, University of Turku, Finland; ilana.saarikko@btk.utu.fi |
| Short Abstract:
There are many sources of variation in data obtained by cDNA microarray experiments. Using replicated experiments, we have studied how existing normalization methods affect data and subsequent analysis. Based on this study we point out key issues in normalization as well as propose guidelines for choosing methods for selected situations. |
| One Page Abstract:
Microarrays are one of the latest breakthroughs in the field of biotechnology which allows monitoring of expression levels for thousands of genes simultaneously. The microarray technology is already widely in use and the applications range from comparison of expression profiles to prediction of regulatory networks. One of the major problem of this new technology is the uneven quality of data. Due to the nature of microarray experiments there are many sources of variation in obtained data. To make data reliable enough to enable comparison across experiments such variation needs to be removed. Process for removing this variation is called normalization. Commonly used normalization methods force the distribution of the log-ratio of expression levels to have median or mean of zero. We have studied existing normalization methods and demonstrated how different methods affect data and subsequent analysis. One of our goals is to define general criteria for necessary and sufficient normalization. In our studies we have focused on replicated microarray experiments. We used data from four replicated slides each having three replicated arrays of 1536 genes. Slides were hybridized with mRNA from two different samples which were labeled with Cy3 and with Cy5. We also used data from additional staining as the measure of amount of cDNA on probe. This data was used for normalization purposes. Based on the replicate data, we validate the normalization methods in two ways. First, we examine how normalization affects the correlation between replicate arrays. Second, we study how the set of differentially expressed genes, defined by various criteria, varies with different normalization methods. On the basis of the results, we suggest guidelines for choosing good normalization methods for different situations. keywords: cDNA microarray, gene expression, normalization |
|
|
| 139. Identifying different types of human lymphoma by SVM and ensembles of learning machines using DNA microarray data. (up) |
| Giorgio Valentini, D.I.S.I., Dipartimento di Informatica e Scienze
dell' Informazione, Universita' di Genova;
valenti@disi.unige.it |
| Short Abstract:
We propose supervised methods for identifying different types of human lymphoma using DNA microarray gene expression data. Support Vector Machines and ensembles of neural networks can correctly classify different types of lymphoma, offering also insights into the role of coordinately expressed groups of genes in carcinogenic processes of lymphoid cells. |
| One Page Abstract:
DNA hybridization microarrays supply information about gene expression through measurements of mRNA levels of large amounts of genes in a cell. Information obtained by DNA microarray technology gives a snapshot of the overall functional status of a cell, offering new insights into potential different types of lymphomas, discriminated on molecular and functional basis. Gene expression data produced by DNA microarray technology can be processed through unsupervised machine learning methods, using clustering algorithms to group together similar expression patterns corresponding to different tissues in order to separate cancerous from normal samples. Anyway, unsupervised methods cannot always correctly separate classes. Supervised methods can overcome this problem, exploiting "a priori" biological and medical knowledge on the problem domain. In this work we use supervised learning machines methods for recognizing cancerous and normal lymphoid tissues, classifying different types of human lymphomas and also identifying groups of genes related to a specific type of lymphoma. We use data of a specialized DNA microarray, named "Lymphochip", developed at Stanford University School of Medicine, specifically designed to study lymphoid and carcinogenesis related genes. In our first task we distinguish cancerous from normal tissues using the overall information available. This dichotomic problem is tackled using Support Vector Machines (SVM) and Multi-Layer Perceptrons (MLP). In our second task we try to directly classify different types of lymphoma (a multiclass problem) using MLPs and Parallel Non linear Dichotomizers (PND), i.e. ensembles of learning machines based on output coding decomposition of a multiclass problem. These methods consist in decomposing a multi-class problem in a set of two-class problems according to some decomposition scheme, training the dichotomizers independently and combining the outputs to give the class label. In the third task we pointed out how to use "a priori" biological and medical knowledge for separating two functional subclasses of diffuse large B-cell lymphoma (DLBCL) not detectable with traditional morphological classification schemes, identifying a set of coordinately expressed genes related to the separation of the two DLBCL subgroups. The results show that SVM, MLP and PND can be successfully applied to the analysis of DNA microarray gene expression of data and to the identification of sets of coordinately expressed genes related to specific types of lymphoma. |
|
|
| 140. On the Influence of the Transcription Factor on the Information Content of Binding Sites (up) |
| Jan T. Kim, Thomas Martinetz, Daniel Polani, Institut für Neuro-
und Bioinformatik, Universität zu Lübeck;
kim@inb.mu-luebeck.de |
| Short Abstract:
We develop a probabilistic model for coevolution of a transcription factor and its binding sites. Maximum entropy analysis reveals connections between binding site information content and binding behaviour of the transcription factor, and into the bioinformatic basis of Rsequence = Rfrequency. This may be useful for improving binding site recognition. |
| One Page Abstract:
Transcription factors and their binding sites are a centerpiece of genetic information processing. Transcription factor binding sites are short sequence words. The location of these binding sites on the genome provides important information about the structure of the regulatory networks the transcription factor is involved in, as well as about the location of genes and other coding regimes on the genome. However, finding these binding sites has turned out to be a difficult task which can only be solved with prior knowledge about the principle binding behaviour of transcription factors. A model for the basic probability distributions underlying the coevolution of the transcription factor and its binding sites within the genome is presented. State spaces for the transcription factor and for the genome are jointly represented, which is an extension of previous models in which only genome space is considered. The model is formally analyzed with a maximum entropy approach. Empirical analyses using comupter based enumerations of the joint state spaces are performed to show that approximations made during formal analysis are justified. The results give new insights into the connection between the information content of these binding sites and the binding behaviour of the transcription factor with particularly interesting implications for the relation between binding site information content (Rsequence) and binding site abundance on the genome (determining Rfrequency). The intriguing empirical observation that Rsequence approximately equals Rfrequency in a couple of instances still awaits a complete bioinformatic explanation. Our analysis reveals that this (approximate) equality cannot be generically deduced from information theoretic principles. Regarding basic bioinformatics, this finding leads to a renewal of interest in empirical studies of binding site information content and distribution across genomes. Since binding site information content is not determined by fundamental informatic principles, one must assume that the relation of Rsequence and Rfrequency determined by biological principles that are not yet known, and that therefore should be investigated using empirical studies combined with theoretical efforts. On the applied side, advances in understanding the bioinformatic principles underlying binding site evolution are likely to provide additional sources of prior knowledge that is useful for developing improved binding site recognition schemes. |
|
|
| 141. A Mouse Developmental Gene Index (up) |
| Janet Kelso, South African National Bioinformatics Institute, University
of the Western Cape;
George J. Kargul, Yong Qian, Dawood B. Dudekula, Minoru S.H. Ko, Developmental Genomics and Aging Section, Laboratory of Genetics, National Institute on Aging, National Institutes of; Winston A. Hide, South African National Bioinformatics Institute, University of the Western Cape; janet@sanbi.ac.za |
| Short Abstract:
We produced and annotated a mouse developmental gene index using cDNAs generated from mouse developmental libraries. This index has been compared to the RIKEN mouse cDNA collection to determine redundancy of the datasets. Selection and annotation of clones for rearraying, and subsequent production of a mouse cDNA microarray is presented. |
| One Page Abstract:
While providing large amounts of genomic information, genomic sequencing efforts do not address the pressing need for comprehensive gene expression information. Despite their generally low sequence quality and short length, expressed sequence tags (ESTs) remain a rich source of gene expression information, providing data on expression location, expression level and the presence of alternative transcript isoforms. Attempts to elucidate the entire expressed gene complement of an organism have been hampered by the scarcity of full-length cDNAs representing all expressed gene transcripts. The absence of full-length transcript data and the relative abundance of ESTs have resulted in a number of groups producing reconstructed transcript gene indices. These gene indices seek to reduce the redundancy and error present in the EST databases by clustering and assembling ESTs based on sequence identity and clone annotation. Clustered EST data has proven invaluable in the development of understanding in gene and alternative splice form discovery, genome annotation and gene regulation. In this study we have produced and annotated a mouse developmental gene index from high quality cDNA sequences generated from early mouse developmental libraries in collaboration with Minoru Ko's group in the Gerontology Research Center at the National Institute of Ageing. This gene index has been compared to the recently published RIKEN mouse cDNA collection to determine redundancy of the datasets. Progress in the selection and annotation of clones for rearraying and subsequent production of a mouse developmental cDNA microarray is presented. |
|
|
| 142. Using Gene Expression and Artificial Neural Networks for Classification and Diagnostic Prediction of Cancers (up) |
| Markus Ringner, National Human Genome Research Institute/NIH;
Javed Khan, National Cancer Institute/NIH; Jun S Wei, Lao H Saal, National Human Genome Research Institute/NIH; Carsten Peterson, Complex Systems Division, Lund University; Paul S. Meltzer, National Human Genome Research Institute/NIH; mringner@thep.lu.se |
| Short Abstract:
A method of classifying cancers to specific diagnostic categories based on their gene expression signatures using artificial neural networks (ANNs) is presented. We trained the ANNs using small round blue cell tumors, belonging to four distinct diagnostic categories. The ANNs correctly classified all samples and identified the genes most relevant to the classification. |
| One Page Abstract:
Small blue round cell tumors (SRBCT) of childhood including; neuroblastoma (NB), rhabdomyosarcoma (RMS), Burkitt's lymphoma (BL) and the Ewing's sarcoma (EWS) are difficult to distinguish by routine immunohistochemistry. Currently there is no single test that can precisely distinguish these cancers, and several techniques are utilized to diagnose them, including cytogenetics, interphase fluorescence in situ hybridization, reverse transcription PCR and immunohistochemistry. In addition, poorly differentiated cancers can still pose a diagnostic dilemma. Gene expression profiling with cDNA microarray techniques permits the simultaneous analysis of multiple markers and hence offers quite some promise in categorizing cancers into subgroups. We use gene expression data from cDNA microarrays containing 6567 genes from 63 SRBCT to calibrate artificial neural network (ANN) models to recognize cancers belonging to each of the four categories. The training samples included both tumor biopsy (13 EWS and 10 RMS) material and cell lines (10 EWS, 10 RMS, 12 NB and 8 BL). Given the small available data set, we preprocess the gene expression levels using Principal Component Analysis (PCA) retaining 10 dominant directions and thereby reducing the input space significantly. We classify the samples in the four categories using a 3-fold cross validation procedure: The 63 known (labeled) samples are randomly shuffled and split into 3 equally sized groups. Linear perceptron models are then calibrated with 10 input variables using two of the groups and the third group is reserved for testing predictions (validation). This procedure is repeated 3 times, each time with a different group used for validation. The random shuffling is redone 1250 times and for each shuffling we analyze 3 ANN models. Thus, in total each sample belongs to a validation set 1250 times and 3750 ANN models have been calibrated. The committee of models classify all validation samples correctly. Due to the limited amount of training data and the high performance already achieved, we limit ourselves to linear models with no hidden units. Confidence measures in terms of distances to ideal classifications are developed for the data. The sensitivity upon the different genes is determined by the absolute value of the partial derivative of the output with respect to the gene expressions, averaged over samples and ANN models. By using the resulting ranking list of the inputs (genes) we redo the training procedure for different number of inputs and establish the minimal number of genes, which optimize the classification of the four cancer types. In this way 96 genes are identified, which correctly classify the 63 samples. We then further test the validity of the models by classifying an additional set of 25 ("blind test") samples containing both [A] tumor samples (5 EWS, 5 RMS, and 4 NB) and cell lines (1 EWS, 2 NB, 3 BL) and [B] 5 non-SRBCT including 2 normal muscle). We are able to correctly classify [A] all 20 of the SRBCT and [B] based on confidence-related criteria reject the non-SRBCT samples. In addition, on evaluation of the top 96 ranked gene list we identify several genes that are uniquely expressed in a specific cancer, that have potential biological and therapeutic implications, and which have not been previously associated with these cancers. We feel that this method of ANN analysis of gene expression data provides a powerful tool for classification, diagnosis and gene discovery. That only 96 genes are required for this application, opens the potential for cost effective fabrication of SRBCT subarrays in diagnostic use. |
|
|
| 143. Classification of malignant states in multistep carcinogenesis using gene expression matrix (up) |
| Koji Kadota, Department of Biotechnology, The University of Tokyo,
and Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences
Center (GSC), RIKEN Yokohama Institute;
Yasushi Okazaki, Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center (GSC), RIKEN Yokohama Institute; Shugo Nakamura, Department of Biotechnology, The University of Tokyo; Yoshihide Hayashizaki, Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center (GSC), RIKEN Yokohama Institute; Kentaro Shimizu, Department of Biotechnology, The University of Tokyo; kadota@bi.a.u-tokyo.ac.jp |
| Short Abstract:
cDNA microarray technology has a potential to be used to diagnose malignant samples from benign ones in the clinical field. We have developed an efficient method to extract genes that can contribute to classify malignant samples from benign ones with minimal false negative diagnosis. |
| One Page Abstract:
Certain types of cancer are reported to grow through multistep carcinogenesis. There are several types of tumors featuring from benign to malignant clinical course. Recently, microarray technology has revolutionalized to see the global expression of many tissues or conditions. This technique has been successfully applied to the clinical samples to classify malignant from benign samples. Until recently, several supervised or unsupervised methods have been developed to classify two distinct states such as tumor vs. normal clinical samples. However, the accuracy of classification using these methods varies depending on the dataset. It is essential to utilize the predictor genes having 100% accuracy in light of diagnosing malignant samples as malignant rather than diagnosing benign samples as benign. In this work, we have developed a novel method to select genes characterizing malignant state from benign using gene expression matrix. In brief, genes that can contribute to the characterization of malignancy phenotype were selected by subtracting each gene from the original genes set to see if the gene positively contributes to characterize the malignant phenotype. We introduced this algorithm to practical clinical samples to evaluate if the presence of metastasis can be accurately predicted. |
|
|
| 144. Bioinformatics Tools in the Screening of Gene Delivery Systems (up) |
| Karin Regnström, Eva Ragnarsson, Per Artursson, Dep of Pharmacy,
University of Uppsala, Sweden;
Karin.Regnstrom@galenik.uu.se |
| Short Abstract:
We use array technology and bioinformatics tools to evaluate the gene expression profiles of suitable gene delivery systems. Our studies show that each delivery systems tested results in an unique profile a "fingerprint". Together with other experimental data they are used for screening and further design of delivery systems. |
| One Page Abstract:
Purpose. To use bioinformatics tools in the comparison and evaluation of gene expression profiles originating from treatment with newly developed gene delivery systems. Introduction. In our laboratory we use array technology to evaluate immunogenic properties as well as possible toxic reactions of suitable gene delivery candidates. Methods. Gene delivery systems formulated with a reporter plasmid were administred mucosally to mice. Cells from the animals were harvested and total RNA was extracted. A 32P- labeled cDNA copy of the RNA-samples were produced and the probes were hybridized to a cDNA expression array and scanned with a phosphorimager. The image were analyzed and normalized to the data of control samples to enable comparison of the different formulations. Pairwise comparisons of the overall gene expression changes between different delivery systems were made using the Spotfire program (1). For comparisons of up to five delivery systems the GeneCluster program (2) were used. The gene expression data were filtered to obtain genes with a significant change in expression and clustered using self organizing maps (SOMs). Further visualization was obtained by the Treeview program (3). Results. The genes that passed the significance filter were sorted in SOMs and distinct clusters were obtained by the different delivery systems. The clusters showed gene groups which were selectively affected after treatment with different delivery systems. Some samples also showed high expression of known toxicity markers. It was possible to discern a gene expression "fingerprint" for each of the gene delivery systems tested. Conclusions. This study identified important changes in gene expression profiles induced by the gene delivery systems studied. We conclude that bioinformatics in combination with the array technology has a great potential for the evaluation of pharmaceutical formulations during screening procedures. In progress. We want to create a database containing gene expression data from all our formulations tested as well as other experimental data and molecular properties of the delivery systems. Our goal is to develop a tool which screens this database for suitable gene delivery systems by multiple comparisons and evaluations, which results in improved design of gene delivery systems. 2. Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E. S. & Golub, T. R. (1999) Proc Natl Acad Sci U S A 96, 2907-12. 3. Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. (1998) Proc Natl Acad Sci U S A 95, 14863-8. |
|
|
| 145. Cross talking in cellular networks: tRNA-synthetase and amino acid synthetic enzymes in Escherichia coli (up) |
| Emmeli Taberman, Måns Ehrenberg, Uppsala University;
emmeli.taberman@icm.uu.se |
| Short Abstract:
By constructing global mathematical models of growing bacteria we studied the control of production of an amino acid and its aminoacyl tRNA synthetase. The goal was to investigate how the cell can avoid interference between these two control loops, and discriminate between signals for charging deficiency and amino acid deficiency. |
| One Page Abstract:
It has been known for a long time that microorganisms exert different types of control for the expression of different operons. Control can be exerted at the transcriptional (e.g. by ribosome dependent attenuation mechanisms or repressors), the translational (e.g. by autogenous feed-back) or at the posttranslational (e.g. protein modifications) level. To assess how mechanisms for the control of gene expression behave in vivo we have constructed a global mathematical model for growing bacteria. This has been used to study, first, control of expression of an operon for enzymes that synthesise the amino acid threonine and, second, control of synthesis of the aminoacyl-tRNA synthetase (ThrRS) that couples Thr to tRNAThr. The threonine biosynthetic pathway is regulated by an attenuation mechanism, involving a leader peptide with multiple Thr and Ile codons. The expression from the gene for ThrRS is regulated by an autogenous mechanism, where the leader of the mRNA that encodes ThrRS mimics tRNAThr. When ThrRS is in excess, it binds stronlgy to the leader of its mRNA and thereby inhibits initiation of translation. An interesting problem is now how the cell can discriminate between a maladjusted rate of synthesis of an amino acid, on one hand, and a too high or a too low level of the corresponding aminoacyl-tRNA synthetase, on the other. For instance, if the aminoacyl-tRNA synthetase concentration is too low, this will not only signal for increased production of the synthetase but also for increased (attenuation control with ribosome step time as signal) or decreased (repressor control with amino acid pool as signal) production of the amino acid synthesising pathway. Our analysis shows that there is considerable "cross-talk" between control systems for amino acid synthesis and production of tRNA synthetases. We discuss how the cell can minimize the negative effects of signal misinterpretations due to such crosstalk. We also describe suitable experiments to test predictions based on our mathematical models |
|
|
| 146. Assessing Clusters and Motifs from Gene Expression Data (up) |
| D. K. Smith, L. M. Jakt, Biochemistry Dept., Univ. of Hong Kong;
L. Cao, Dept. Microbiology, UHK; K. S. E. Cheah, Biochemistry Dept., Univ. of Hong Kong; dsmith@hkusua.hku.hk |
| Short Abstract:
A method has been developed to assess gene clusters derived from microarray experiments. The probability of finding motif matches associated with the genes in the cluster by chance is determined. Issues of biological relevance, over or under-clustering, activity in several clusters or the refinement of motifs can be addressed. |
| One Page Abstract:
ASSESSING CLUSTERS AND MOTIFS FROM GENE EXPRESSION DATA Jakt, L.M.1, Cao, L.2, Cheah, K.S.E.1 and Smith, D.K.1 Departments of Biochemistry1 and Microbiology2, University of Hong Kong, Pok Fu Lam, Hong Kong. When analysing gene expression data from microarray based studies, it is common to compare the expression profiles of the genes and perform some clustering of the profiles. Genes with similar expression profiles are grouped by the clustering algorithm and these genes are more likely to have similar functions or to be regulated in a common manner. Searches for conserved DNA motifs, which may potentially be cis-regulatory elements, can be undertaken in the non-coding regions of the genes in the cluster. For a computational study of gene expression there is a wide range of algorithms available to cluster expression profiles, to find new motifs in unaligned DNA sequences and to match known motifs to DNA sequences. Experimental errors from the microarray studies can also propagate through the computational analysis and so compound the effects of any limitations in the algorithms used. A method to evaluate these analyses is desirable. We have developed a method to assess the potential functional significance of clusters and motifs which is based on the probability of finding a certain number of matches to a motif in all of the gene clusters. As a starting point, we take a set of genes that have been clustered, based on their expression profiles, by some algorithm and a series of sequence motifs that may describe cis-regulatory elements. Issues of what threshold score to use for the differing motif matching algorithms are avoided by taking the best matches to a motif across the gene set, in groups of 50 to 600 matches. By counting the number of matches that are associated with each gene cluster, we can calculate the probability of observing, by chance, that number of matches to a motif in the non-coding regions of the genes in a cluster. The likely functional relevance of the clusters and motifs can be assessed based on these probabilities. This technique allows strong and weakly matching motifs to be detected and refined and significant matches to motifs across cluster boundaries can be observed. Application of this method to the yeast genome and a series of regulatory motifs led to the prediction that the previously unidentified factor known as Swi Five Factor was one of the yeast fork head proteins. Subsequently, this was confirmed by others. |
|
|
| 147. Statistical Analysis of Gene Expression Profile Changes among Experimental Groups (up) |
| Taesung Park, Sung-Gon Lee, Seungmook Lee, Department of Statistics,
Seoul National University;
Dong-Hyun Yoo, Mi-Yoon Chang, Yong-Sung Lee, Department of Biochemistry, Hanyang University College of Medicine; tspark@stats.snu.ac.kr |
| Short Abstract:
cDNA microarray technology allows the monitoring of expression levels for thousands of genes simultaneously. We propose a test procedure for testing gene expression profile differences among experimental groups. The test procedure is illustrated using cDNA microarrays of 3,800 gene expression profiles from neuronal differentiation of cortical stem cells. |
| One Page Abstract:
cDNA microarray technology allows the monitoring of expression levels for thousands of genes simultaneously. Cluster analysis is commonly used to group together genes with similar patterns of expression. Genes in different clusters tend to be regarded to have different expression profiles. When we are interested in testing gene expression profiles over time for different experimental groups, however, the usual clustering methods do not help much. We consider a simple summary measure to differentiate genes that have high variability and ones that do not. Using this measure, we propose a test procedure to test the differences in gene expression profiles among experimental groups. The test procedure is illustrated using cDNA microarrays of 3,800 genes obtained in an experiment to search for changes in gene expression profiles during neuronal differentiation of cortical stem cells. |
|
|
| 148. Multivariate method for selection of sets of differently expressed genes (up) |
| Ashot Chilingarian, N. Gevorgyan, Cosmic Ray Division, Yerevan Physics
Institute, Armenia;
A. Szabo, Department of Oncological Sciences and Huntsman Cancer Institute, University of Utah; A. Vardanyan, Cosmic Ray Division, Yerevan Physics Institute, Armenia; chili@yerphi.am |
| Short Abstract:
Genes differentially expressed in two tissues are found by an evolutionary algorithm maximizing the Mahalonobis distance between gene expression vectors. "Evolutionary bootstrap" resolves the instability of sample covariance matrices. We show the superiority of this multidimensional method compared to commonly used one-dimensional tests using a microarray data simulation model. |
| One Page Abstract:
An important problem addressed using cDNA microarray data is the detection of genes differentially expressed in two tissues of interest. Currently used approaches consider each gene separately and evaluate their differential expression independently, ignoring the multidimensional structure of the data. However it is well known that correlation among covariates can enhance the ability to detect less pronounced differences. We propose a novel approach utilizing the gene correlation information for finding the differentially expressed genes. The Mahalonobis distance between vectors of gene expressions is the criterion for simultaneously comparing a set of genes and an evolutionary algorithm is developed for maximizing it. However the extreme imbalance of the number of genes and the number of experiments causes an instability of the sample covariance matrices, so a direct application of the Mahalonobis distance is not feasible. To overcome this problem we develop a new method of combining data from small-scale random search experiments that we term "evolutionary bootstrap". We validate the proposed method in two ways. First we simulate cDNA microarray data where the extent of differential expression of each genes is known. We apply the multidimensional method and several commonly used one-dimensional statistical tests and compare their ability to correctly identify differentially expressed genes and to rank them according to differential expression. By utilizing the correlation structure the multivariate method, in addition to the genes found by the one-dimensional criteria, finds genes whose differential expression is not detectable marginally. As a different test, we apply the proposed method to data on two colon cancer cell lines and evaluate its ability to find genes that allow the classification of the samples according to their origin.
|
|
|
| 149. Understanding Non Small Cell Lung Cancer by Analysis of Expression Profiles (up) |
| Nir Friedman, Yoseph Barash, Hebrew University;
Amir Ben-Dor, Zohar Yakhini, Agilent Laboratories; Naftali Kaminski, Sheba Medical Center, Israel; nir@cs.huji.ac.il |
| Short Abstract:
To understand the molecular mechanisms that underlie lung cancer, we analyze gene expression patterns in tumor and normal lung samples. We present computational methods that we developed to extract biological meaning from this data. We discuss the significance of the information we retreive and its potential impact on cancer research. |
| One Page Abstract:
Lung cancer is a common malignancy and a major determinant of overall cancer mortality in developed and developing countries. Despite intensive research, little has changed in the understanding and management of the disease. In order to determine the transcriptional programs that are active in non small cell lung cancer (NSCLC) gene expression patterns of ~12,000 genes were collected from 24 NSCLC tumor samples, 11 normal histology samples from lung resections for cancer and pooled normal lung RNA (5 individual lungs) obtained commercially. In this poster, we present analysis of these gene expression profiles. We show that gene expression patterns were highly distinct in tumor and normal tissues. We use the Total-Number-of-Misclassifications (TNoM), Information-content (Info) and Gaussian-Error scores to detect genes that significantly differ between NSCLC tumors and normal lung samples. One evident observation was that informative genes were significantly overabundant in our dataset, thus supporting the significance of the results. To better understand the transcriptional program we analyzed the genomic location of genes that differ between NSCLC tumors and normal lung tissues, and compared these to cytogenetic abnormalities observed in the tumor samples. Finally, we developed and used class discovery tools to characterize putative tumor sub-types. The wealth of statistically significant and biologically meaningful information in our dataset supports our contention that transcriptional profiling will lead to new insights into the pathogenesis of lung cancer, thus leading to development of new tools for early detection and treatment of this devastating disease. |
|
|
| 150. Applications of high-throughput identification of tissue expression profiles and specificity (up) |
| Fabien Campagne, Lucy Skrabanek, Harel Weinstein, Institute for
Computational Biomedicine, Department of Physiology and Biophysics; Mount
Sinai School of Medicine;
Fabien.Campagne@physbio.mssm.edu |
| Short Abstract:
We recently developed TissueInfo: an automated, high-throughput method to identify the tissue expression profile and the specificity of a query sequence. We will briefly introduce applications of this new method to custom microarray production, gene discovery, genome analyses, signaling pathway modeling and tissue information ab initio prediction. |
| One Page Abstract:
Organisms such as mammals do not express every single gene encoded by their genome in each of their cells. Rather, the various cell types of the organism express particular subsets of the genes in the genome. Cell types are further organized into tissues, and tissues constitute the organs that carry out various physiological functions. The detailed mechanisms of gene products underlying the functioning of this complex organization are today largely unknown. Several methods, including SAGE [1], microarray technology [2] can be applied to the study of differential gene expression in the various cell types, in different tissues. We recently developed TissueInfo, a high-throughput method to identify the tissue expression profile of the genes in an organism's genome, as well as the tissue specificity of a query sequence [3]. The method carefully organizes the data publicly available in dbEST [4] and is purely computational. With 80% coverage of the benchmark considered, TissueInfo achieves an accuracy of 76% when the tissue specificity of a gene is predicted and 89% when its expression in a given tissue is predicted. These results make possible the application of TissueInfo to the complete sequences available in the public draft of the human genome. Our poster will present some novel features of the tissue information obtained when profiling about 10,000 human genes for their expression in, and specificity to, 104 human tissues. This will illustrate the application of TissueInfo to genome-wide statistical analysis of gene expression in tissues. In addition, we will describe other potential applications of TissueInfo, such as in the production of tissue-specific microarrays, where TissueInfo can greatly speed up and simplify the selection of clones expressed in a given tissue. Another area of important application of, TissueInfo relates to gene discovery pipelines where this method can be integrated to provide the ability to calculate tissue expression profiles and specificity for candidate genes. As shown in our recent identification of the Sac sensory receptor gene candidate [5], prediction of restricted tissue expression, or other specific expression profiles, can be pivotal in the identification of a gene candidate. A third illustrative application consists of the assembly of training sets of genes grouped according their expression profile for the ab initio prediction of tissue information. More information about the method will be available from our web site: http://icb.mssm.edu. 1. Velculescu, V.E., et al., Serial analysis of gene expression. Science, 1995. 270(5235): p. 484-7. 2. Shoemaker, D.D., et al., Experimental annotation of the human genome using microarray technology. Nature, 2001. 409(6822): p. 922-7. 3. Skrabanek, L. and F. Campagne, TissueInfo: high-throughput identification of tissue expression profiles and specificity. submitted, 2001. 4. Boguski, M.S., T.M. Lowe, and C.M. Tolstoshev, dbEST--database for "expressed sequence tags". Nat Genet, 1993. 4(4): p. 332-3. 5. Max, M., et al., Tas1r3, encoding a new candidate taste receptor, is allelic to the sweet responsiveness locus Sac. Nat Genet, 2001. 28: p. 58-63. |
|
|
| 151. Identifying regulatory networks by combinatorial analysis of promoter elements (up) |
| Yitzhak Pilpel, Priya Sudarsanam, George M. Church, Department of
Genetics, Harvard Medical School;
tpilpel@genetics.med.harvard.edu |
| Short Abstract:
We have developed a new computational method that analyzes microarray data to discover synergistic regulatory motif combinations and applied it to the analysis of promoters of S. cerevisiae. Such interactions are organized into highly connected graphs suggesting that a small number of regulators me be responsible for multiple expression patterns |
| One Page Abstract:
The recent availability of microarray data has led to the development of several computational approaches for studying genome-wide transcriptional regulation. These approaches have been very successful in deriving known and new regulatory motifs from the promoters of co-expressed genes. However, few studies have so far addressed the combinatorial nature of transcription, a well-established phenomenon in eukaryotes. We have developed a new computational method that analyzes microarray data to discover synergistic regulatory motif combinations and applied it to the analysis of promoters of S. cerevisiae. Our method suggests causal relationships between each motif in a combination and the observed expression patterns. In addition to identifying novel motif combinations that affect expression patterns during the cell cycle, sporulation, and various stress response conditions, we have also discovered regulatory cross-talk between several of these processes. We have developed novel visualization tools that allow the analysis of the causal relationships between regulatory motif combinations and expression profiles. In addition, we have generated global motif synergy maps that provide a view of the transcription networks in the cell. The maps are highly connected suggesting that a small number of transcription factors are responsible for a complex set of expression patterns in diverse conditions. This approach should be important for modeling transcriptional regulatory networks in more complex eukaryotes. |
|
|
| 152. The use of discretization in the analysis of cDNA microarray expression profiles for the identification of tissue-specific genes (up) |
| Janick Mathys, Kathleen Marchal, Patrick Glenisson, Geert Fannes, Peter
Antal, Yves Moreau, Bart De Moor, department of electrical engineering,
K. U. Leuven;
Paul Van Hummelen, VIB MicroArray Facility, MAF; jmathys@esat.kuleuven.ac.be |
| Short Abstract:
A simple procedure was developed to analyze gene expression profiles from cDNA microarrays for the identification of tissue-specific genes. This procedure consists of the discretization of both background-corrected red intensities and ratios followed by Euclidian distance-based clustering. |
| One Page Abstract:
To assess tissue-specific gene expression a standard concept in data mining was used : discretization. Discretization means that thresholds are determined or chosen. Based on these thresholds, decisions are made about the expression of a gene (ON or OFF, over- or under-expression). To obtain gene expression profiles from various mouse tissues, cDNA was prepared from brain, kidney, heart, liver, lung, skeletal muscle, spleen, testis and hybridized on mouse cDNA microarrays. The microarrays contained 9216 spots coming from 4600 randomly chosen mouse genes printed in duplicate. Twelve slides were hybridized with each of the tissues labeled in red against spleen (reference) labeled in green. Following image analysis, genes were labeled ON or OFF according to a predetermined intensity treshold. The threshold was set at the local background intensity of the spot plus two standard deviations of the mean spot intensitiy. If the intensity of a gene was below this threshold, the gene was considered OFF and got the label 0. Otherwise, the gene was considered ON and got the label 1 if one of the duplicate spots were above threshold and label 2 if both duplicate spots were above background. It was found that the threshold settings were not optimal because the sensitivity of the green and the red channel were different. Methods to adjust the thresholds for each dye specifically are now being developed and compared. The sum of the ON/OFF labels over the various tissues was used to divide the genes in the following groups : Group A: Constitutively expressed genes (602 genes), Group B: Tissue specific genes and Group C: Non expressed genes. Group A : The group of genes that were ON in each tissue (each gene had label 2 in all 12 experiments and thus: sum=24) could be further separated in potential housekeeping genes (A1) and tissue-specific genes (A2). For this purpose, the ratios were discretized : if the ratio of a gene was > 2 (2-fold overexpression) or < 0.5 (2-fold underexpression) the gene received label 1 or 2 respectively and for ratios between 0.5 and 2 the gene was given label 0. Similarly to the previous discretization, the sum of the labels over the various tissues was made and used to separate the genes. The group of genes with sum=0 could be considered as potential housekeeping genes (84 genes). The remaining genes were clustered by calculating the Euclidian distance for each gene-pair. These clusters constitute of genes that were differentially expressed in one or more tissues (A2). Group B : This group was further divided by the same clustering method as was described for group A2. Except for heart, for each tissue a set of genes were found that were uniquely expressed in that specific tissue. Group C : For a large set of genes (629 genes) no fluorescent signals above threshold were found in any of the tissues. These genes were not expressed in any of these tissues or were below the detection limit of the assay. In a final analysis, the tissues themselves were subjected to an hierarchical clustering algorithm based on the ratios of the genes that were ON in each tissue. The results of this clustering matched remarkably well with results obtained for the tissue -specific genes, for instance heart and skeletal muscle seem to share the most specific genes of group A2. Our results are being further confirmed by information on function and tissue-specificity obtained from Unigene, GO and Pubmed. |
|
|
| 153. Quantitative analysis of a bacterial gene expression by using the gusA reporter system in a non-steady state continuous culture (up) |
| Kathleen Marchal, Centre of Microbial and Plant Genetics, K.U. Leuven/
SISTA Department of Electrical Engineering, K. U. Leuven;
Jun Sun, Centre of Microbial and Plant Genetics, K.U. Leuven; Ilse Smets, Kristel Bernaerts, Jan Van Impe, BioTeC-Bioprocess technology and Control, K.U. Leuven; Bart de Moor, SISTA Department of Electrical Engineering, K.U. Leuven; Jos Vanderleyden, Centre of Microbial and Plant Genetics, K.U. Leuven; kathleen.marchal@esat.kuleuven.ac.be |
| Short Abstract:
A general dynamic model (forward model) was used to study the "mere influence" of O2 on the expression of a bacterial fusion protein (A. brasilense cytN-gusA fusion). The experimental set up used consisted of a non-steady state continuous culture of which the O2 concentration was regularly perturbed. |
| One Page Abstract:
In this study a dynamic model was developed to describe the "mere influence" of O2 on the expression of an important respiratory enzyme of the bacterium A. brasilense cytN gene (encoding a cytcbb3 terminal oxidase (Marchal et al., 1998). The experimental set up consisted of the combined use of a non-steady state continuous culture and a translational gene fusion (cytN-gusA ). The use of a continuous culture allows accurate monitoring slight changes in input parameters (O2, input C-source,...). Moreover, the input parameters (in this case O2) can systematically be perturbed to study the effect on the output parameters (fusion protein synthesis measured as b-glucuronidase activity, cell density, output C-source concentration). The combined use of a structural dynamic modeling and the appropriate experimental set up (training and validation experiments) allowed to construct a structural forward model (based on differential equations describing cell growth, substrate consumption and fusion protein synthesis) to describe the dynamic behavior of the system upon varying input signals. Simulation results showed that under the conditions tested the cytN gene expression was not subjected to catabolic repression. The hybrid fusion protein seemingly behaves as a very stable protein in A. brasilense and, consistent with previous results, O2 is the major signal regulating the cytN promoter. In principle this approach can be generalized to assess the effect of any controllable external signal on bacterial gene expression in a non-steady state continuous culture. Use of the method outlined here has several advantages over the commonly used steady state measurements |