Whole Genome Analysis

1.Identification of thermophilic species by the amino acid compositions deduced from their genomes
2.Sequence Analysis by Iterative Maps - beyond graphical representation
3.Global Analysis of Protein Activities Using Proteome Chips
4.Assignment of Homology to Genome Sequences using a Library of Hidden Markov Models that Represent all Proteins of Known Structure
5.Divergence and conservation: comparison of the complete predicted protein sets of four eukaryotes
6.Identification of novel small noncoding RNAs in Escherichia coli using a probabilistic comparative method
7.Intron-like sequences in non-coding genomic regions.
8.Repeats in human genomic DNA and sequence heterogeneity
9.PhageBase: A precalculated database of bacteriophage sequences
10.Searching for regions of conservation in the Arabidopsis genome
11.Constructing Comparative Maps with Unresolved Marker Order
12.Identification of novel small RNA molecules in the Escherichia coli genome: from in silico to in vivo
13.Operons: conservation and accurate predictions among prokaryotes
14.The EBI Proteome Analysis Database
15.Practical transcriptome analysis system in the RIKEN mouse cDNA project
16.The automated identification of novel lipases/esterases on a multi-genome scale
17.Identification of membrane protein orthologs in worm, fly and human
18.Discovering Binding Sites from Expression Patterns: A simple Hyper-Geometric Approach
19.Visualizing whole genome comparisons: Artemis Comparison Tool (ACT)
20.The use of Artemis for the annotation of eukaryotic genomes
21.A de novo approach to identifying repetitive elements in genomic sequences
22.Extendable parallel system for automatic genome analysis
23.Whole genome phylogenies using vector representations of protein sequences
24.Correlated Sequence Signature as Markers of Protein-Protein Interaction
25.Molecular and Functional Plasticity in the E. coli Metabolic Map
26.Genome Size Distribution in Prokaryotes, Eukaryotes, and Viruses
27.EnsEMBL Genome Annotation Project
28.A Framework for Identifying Transcriptional cis-Regulatory Elements in the Drosophlia Genome
29.Genome-wide modeling of protein structures
30.Genome wide search of human imprinting genes by data mining of EST in UniGene
31.What we learned from statistics on arabidopsis documented genes
32.Evaluation of Computer Algorithms to Search Regulatory Protein Binding Sites
33.An HMM Approach to Identify Novel Repeats Through Fluctuations in Composition
34.Novel non-coding RNAs identified in the genomes of Methanococcus jannaschii and Pyrococcus furiosus
35.RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs
36.Comparative genomics for data mining of eukaryotic and prokaryotic genomes
37.DNA atlases for the Campylobacter jejuni genome
38.DNA atlases for the Staphylococcus aureus genome
39.SNAPping up functionally related genes based on context information: a colinearity free approach
40.Sequencing and Comparison of Orthopoxviruses
41.Integrating mouse and human comparative map data with sequence and annotation resources.
42.De novo Identification of Repeat Families in the Genome
43.Using the Arabidopsis genome to assess gene content in higher plants
44.Origin of Replication in Circular Bacterial Genomes and Plasmids



1. Identification of thermophilic species by the amino acid compositions deduced from their genomes (up)
David P. Kreil, EMBL - EBI / University of Cambridge;
Christos A. Ouzounis, EMBL - EBI;
kreil@ebi.ac.uk
Short Abstract:

Global amino-acid-compositions as deduced from 47 complete genomic sequences were analyzed by hierarchical clustering/PCA. Although GC-content had a dominant effect, thermophiles can be identified by their amino-acid-compositions alone. While the number of genomes is now high enough to discern even a third factor, more of `unusual´ species are still required.

One Page Abstract:

The global amino acid compositions as deduced from the complete genomic sequences of seven thermophilic archaea, one mesophilic archeon, two thermophilic bacteria, 34 mesophilic bacteria, and three eukaryotic species were analyzed by hierarchical clustering and principal components analysis (PCA).

This study presents a careful statistical analysis of factors that affect amino acid composition. Both hierarchical clustering and PCA showed an influence of two main factors on amino acid composition. Even though GC-content has a dominant effect, thermophilic species can be identified by their global amino acid compositions alone. Differences between the groups of thermophiles and mesophiles were verified with appropriate statistical post-hoc tests.

Based on this data analysis we introduce a `compositional tree´ of species that takes into account not only homologous proteins, but also proteins unique to particular species. We expect this simple yet novel approach to be a useful additional tool for the study of phylogeny at the genome level.

This analysis extends our previous work [1] to a larger number of species, one of which is a mesophilic archaeon. The new analysis clearly supports the notion that the second strongest determining factor of global amino acid composition is indeed thermophilicity, and not perhaps archaeic origin. With the larger number of completely sequenced genomes available, besides GC-content and thermophily, a third major separable factor is now emerging which determines amino acid composition. However, for the present analysis the genomes of only one mesophilic archaeon and two thermophilic bacteria were available. This points to a general problem for whole genome studies, as increasingly, the selection of sequenced genomes available is very biased. We show how to deal with this problem by application of thorough statistical methods.

[1] Kreil, D. P. and Ouzounis, C. A. (2001) `Identification of thermophilic species by the amino acid compositions deduced from their genomes´. Nucleic Acids Res. 29, 1608-15.


2. Sequence Analysis by Iterative Maps - beyond graphical representation (up)
Susana Vinga, ITQB/Universidade Nova Lisboa;
Jonas S. Almeida, Department of Biometry and Epidemiology, Medical University of South Carolina; ITQB/Universidade Nova Lisboa;
João A. Carriço, António Maretzek, ITQB/Universidade Nova Lisboa;
Peter A. Noble, Madilyn Fletcher, Belle W. Baruch Institute for Marine Biology and Coastal Research, Marine Science Program and Department of Biologica;
svinga@itqb.unl.pt
Short Abstract:

Iterative mapping of DNA sequences by Chaos Game Representation (CGR) was first proposed for the investigation of patterns. Counting arbitrarily sized quadrant frequencies, order-free Markov Chain probability tables are obtained, highlighting the usefulness of CGR as a sequence-modelling tool. The iterative procedure was further extended to accommodate higher dimension alphabets.

One Page Abstract:

Iterative mapping of DNA sequences by Chaos Game Representation (CGR) was first proposed in 1990 for the investigation of patterns. This initial attempt to produce scale-independent representations of biological sequences has some important properties: 1) one-to-one correspondence between points in the continuous map and the respective sequences; 2) proximity in the map of sequences with the same suffix [maximum distance 2^-k in each coordinate on the map between two sequences with the same k last units]. We have further explored this representation and found that, by counting arbitrarily sized quadrant frequency, a order-free Markov Chain probability tables is obtained, accommodating both integer and non-integer order resolution. These newly uncovered properties highlight the usefulness of CGR as a sequence-modelling tool rather than just a graphical representation technique. The iterative procedure was further extended to accommodate higher dimensions defining unit-block iterated continuous domains.

Further reading: Almeida, J.A., Carriço, J.A., Maretzek, A., Noble, P.A. and Fletcher, M. (2001) Analysis of genomic sequences by Chaos Game Representation. Bioinformatics, 17, 429-437


3. Global Analysis of Protein Activities Using Proteome Chips (up)
Ning Lan, Ronald Jansen, Paul Bertone, Heng Zhu, Michael Snyder, Mark Gerstein, Yale University;
lan@bioinfo.mbb.yale.edu
Short Abstract:

A defined collection of 5800 yeast proteins was printed on proteome microarray and screened for their ability to interact with proteins, nucleic acids, and phospholipids. An algorithm was developed to identify positive signal on proteome microarray and to cluster the proteins identified into functional groups.

One Page Abstract:

A daunting task in the post-genome sequencing era is to ascribe functions to every protein encoded by a given genome. Direct analysis of protein function on proteome chips is likely to provide an extremely valuable approach for elucidating gene function on a global scale. A defined collection of 5800 proteins from the budding yeast was prepared using high-throughput techniques and printed onto glass slides to screen for many activities including protein-protein, protein-DNA, protein-RNA, and protein-liposome interactions.

Visual inspection identified 39 yeast proteins that bind calmodulin. Sequence analysis revealed that these calmodulin-binding proteins share a motif whose consensus is I/L-Q-X-X-K-K/X-G-B, where X is any residue and B is a basic residue.

An algorithm was developed to identify and analyze positive signals in protein-liposome binding experiments. Variations between chips and local variations on the chip cause additional fluctuations of the binding signals quantitated using GenePix software. To correct the variation between chips, the signals were scaled from different experiments into a common range by subtracting the median and dividing by the difference between upper and lower quartile, thus transforming the signal distributions of different experiments to comparable shapes. To correct the local variation on the chip, we performed a ¡°neighborhood subtraction¡± for each spot. We defined a region of two rows above and below as well as two columns to the left and right of a spot as the neighborhood region. The median signal of this region was then subtracted from the spot signal. The number of highly fluorescent spots in any neighborhood region is generally low enough in these experiments not to disturb the median significantly. Finally, if the variation between two parallel samples was greater than 3 standard deviations of the error distribution of the samples, the data point was flagged and excluded from further analysis. After this filtering procedure, we normalized the filtered lipid binding signal G with the GST signal R, yielding the ratio r = G/R which is a measure of the binding per amount of protein and allows comparison of binding signals between different proteins. The specific binding ratio r is sensitive to errors eG and eR in both the G and R signals. Therefore, we computed 90% and 95% confidence intervals for this ratio with a Monte-Carlo procedure, assuming that r is a good approximation of the actual average of the ratio population: r + er = (G+eG)/(R+eR) where er represents the error of the ratio r.

This algorithm identified 150 yeast proteins that bind phosphotidylinositol lipids, 52 of which correspond to uncharacterized proteins, indicating that many previously uncharacterized proteins have potentially important biochemical activities. These proteins were clustered into four groups based on the binding strength and specificity.

These results have obtained a wealth of new information about many known and previously uncharacterized proteins, thus demonstrate that proteome chips provide valuable opportunity for direct global proteome analysis.


4. Assignment of Homology to Genome Sequences using a Library of Hidden Markov Models that Represent all Proteins of Known Structure (up)
Julian Gough, MRC Laboratory of Molecular Biology;
Kevin Karplus, Richard Hughey, UCSC, USA;
Cyrus Chothia, MRC Laboratory of Molecular Biology;
jgough@mrc-lmb.cam.ac.uk
Short Abstract:

A hidden Markov model library representing all proteins of known structure has been built based on SCOP. This library has been used on all complete genomes to assign structural superfamilies to sequences. The genome assignments, sequence alignments, and a facility to search the library are available at http://stash.mrc-lmb.cam.ac.uk/SUPERFAMILY.

One Page Abstract:

Of the sequence comparison methods, profile based methods perform with greater selectively than those that use pair-wise comparisons. Of the profile methods, hidden Markov models (HMMs) are apparently the best. The first part of this poster describes calculations that (i) improve the performance of HMMs and (ii) determine a good, possibly the best, procedure for creating HMMs for sequences of proteins of known structure. For a family of related proteins, more homologues are detected using multiple models built from diverse single seed sequences than from one model built from a good alignment of those sequences. A new procedure is described for detecting and correcting those errors that arise at the model-building stage of the procedure. These two improvements greatly increase selectivity and coverage, The second part of the poster describes the construction of a library of HMMs, called SUPERFAMILY, that represent essentially all proteins of known structure. The sequences of the domains in proteins of known structure, that have identities less than 95%, are used as seeds to build the models. Using the current data, this gives a library with 4894 models. The third part of the poster describes the use of the SUPERFAMILY model library to annotate the sequences of more than 50 genomes. The models match twice as many target sequences as are matched by pairwise sequence comparison methods. For each genome, close to half of the sequences are matched in all or in part and, overall, the matches cover 35% of eukaryotic genomes and 43% of bacterial genomes. Many sequences labeled as being hypothetical are homologous to proteins of known structure. The annotations derived from these matches are available from a public web server at: http://stash.mrc-lmb.cam.ac.uk/SUPERFAMILY. This server also enables users to match their own sequences against the SUPERFAMILY model library.


5. Divergence and conservation: comparison of the complete predicted protein sets of four eukaryotes (up)
Catherine A. Ball, Kara Dolinski, Shuai Weng, John C. Matese, Gavin Sherlock, Dianna Fisk, Selina Dwight, Karen Christie, Anand Sethuraman, J. Michael Cherry, David Botstein, Stanford University School of Medicine;
ball@genome.stanford.edu
Short Abstract:

Comparison of the complete predicted protein sets of four eukaryotic organisms (Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster and Arabidopsis thaliana) suggest that the common ancestor of plants, multicellular animals and single-celled fungi was the source of a significant fraction of the current complement of proteins for each species.

One Page Abstract:

Comparison of the complete predicted protein sets of four eukaryotic organisms (Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster and Arabidopsis thaliana) suggest that the common ancestor of plants, multicellular animals and single-celled fungi was the source of a significant fraction of the current complement of proteins for each species. Complete sets of predicted proteins were compared using BLASTP, grouped into families of related proteins and subjected to CLUSTALW analysis to determine closest similarity. At all stringency levels, 30-50% of sequence similarity families have at least one representative sequence from each organism. Unsurprisingly, the closest similarities are found between proteins from D. melanogaster and C. elegans, the multicellular animals.

Systematic functional analysis of S. cerevisiae genes and proteins have significant implications for their homologs in other species. S. cerevisiae proteins encoded by essential genes are more likely to have homologs in one of the other species. In addition, S. cerevisiae proteins that interact with many other proteins are also more likely to be conserved.

Using gene associations from the Gene Ontology Consortium, similarity families were associated with biological processes and molecular functions. The most conserved biological processes, as inferred from shared annotations, correspond to core cellular processes such as metabolism and protein synthesis.


6. Identification of novel small noncoding RNAs in Escherichia coli using a probabilistic comparative method (up)
Elena Rivas, Robert J. Klein, Sean R. Eddy, Washington University in St. Louis;
elena@genetics.wustl.edu
Short Abstract:

We apply comparative genomics in a probabilistic computational method to find novel noncoding RNAs. We use this computational method to screen the E.coli genome using whole genome comparison to four other gamma proteobacteria genome sequences. A number of those candidates have been shown by Northern blot analysis to produce small, apparently noncoding RNA transcripts.

One Page Abstract:

We have developed a comparative genomic method to find novel noncoding RNA genes. This method takes advantage of the pattern of fixed mutations between two conserved sequences in order to infer why the sequence is functional. A protein-gene exon, for instance, may show a telltale abundance of synonymous substitutions, whereas a structural RNA may show a telltale abundance of compensatory mutations consistent with a conserved Watson-Crick base-paired secondary structure. We have formalized this intuitive notion by constructing three probabilistic ``pair-grammars". Each grammar models a different functional pattern of evolution: coding which favours synonymous mutations; RNA which models a pattern of compensatory base changes; and a null hypothesis of position-independent evolution. Here we report the results of applying this computational screen to the E. coli genome, using whole genome comparisons to four other gamma proteobacterial genome sequences. In this screen we have generated a large number of candidates for RNA genes from E.coli intergenic regions. A number of those candidates have been shown by Northern blot analysis to produce small, apparently noncoding RNA transcripts.


7. Intron-like sequences in non-coding genomic regions. (up)
Elisabetta Pizzi, Emanuele Bultrini, Paolo Del Giudice, Clara Frontali, Istituto Superiore di Sanità, Rome (Italy);
frontali@iss.infn.it
Short Abstract:

A method was developed for the analysis of correlations in oligonucleotide usage in extended genomic portions. Preliminary results for C.elegans and D.melanogaster genomes reveal auto- and cross-correlations in the dictionary prevailing in introns and in intergenic regions. The method lends itself to automatic partitioning and to cluster analysis.

One Page Abstract:

INTRON-LIKE SEQUENCES IN NON-CODING GENOMIC REGIONS

E. Bultrini, P. Del Giudice, C. Frontali and E. Pizzi - Istituto Superiore di Sanità, Rome, Italy

The heterogeneity, or patchiness, which characterises most eukaryotic genomes, suggests that different parts of the same genome may adopt different strategies to encode relevant biological information, as a consequence of different evolutionary pressures. Such considerations should help in partitioning long non-coding sequences into functionally different regions.

Starting from the observation that similarity in oligonucleotide usage in introns and intergenic regions, extending over distances of more than 10 Kb, contributes to the correlation structure of Caenorhabditis elegans genome [1], we examined different statistical methods with the aim to provide a quantitative measure of this effect and to find practicable ways to score extended genomic portions for consistent oligonucleotide usage in distant regions not necessarily related by significant sequence similarity.

Preliminary results were obtained using a simple approach which, after dividing an arbitrarily long sequence into non-overlapping segments typically 100 bp long, computes correlation coefficients between oligonucleotide frequency distributions for all fragment pairs. By repeating the procedure on the randomly shuffled segments it becomes possible to disaggregate the effects of base composition and of biased oligonucleotide usage. Distant genomic regions that adopt a similar oligonucleotide dictionary above and beyond what expected on the basis of nucleotide frequencies can easily be recognised as off-diagonal patterns in a sort of coarse-grained dot plot representing the resulting matrices through a brightness scale proportional to the correlation coefficient value. Along with this visual representation, it is possible to perform cluster analyses of segments according to oligonucleotide usage.

It was possible to demonstrate that C. elegans and Drosophila melanogaster introns auto- and cross-correlate from the point of view of oligonucleotide usage, and that, in both genomes, interspersed elements exhibiting intron-like features are abundant in regions that, according to current annotation, are intergenic. Clusters of these elements might mark as yet unpredicted genes, but we cannot rule out the speculative hypothesis that those non-coding regions that are subject to weak functional constraints might harbour similar elements. Methods for the automatic partitioning of chromosome-long sequences on this basis are under development.

1) C. Frontali, E. Pizzi (1999) Gene 232,87-95


8. Repeats in human genomic DNA and sequence heterogeneity (up)
Dirk Holste, Humboldt University Berlin;
d.holste@itb.biologie.hu-berlin.de
Short Abstract:

We study sequence heterogeneity of human chromosomes, by quantifying dinucleotide correlations and frequency distributions of oligonucleotides. Using simple stochastic models, we quantify the presence of fluctuations and the extend to which interspersed repeats, monomeric tandem repeats, and CpG suppression can account for the heterogeneity and the increasing oligonucleotide nonuniformity.

One Page Abstract:

The origin and extend of the base compositional variation and its relation to the organization and function in human genomic DNA poses fundamental questions. The observed sequence heterogeneity may require active constraints for generating and maintaining these pattern, and the analysis of the sequence heterogeneity could contribute to an understanding of the nature of compositional constraints. In the past, several attempts have been made to relate those observations to the known biological features like the presence of period-3 bp, the length distribution of protein-coding regions, the presence and expansion of repeats, or the evolution of DNA.

The sequencing of the human genome provides a suitable occasion to test earlier propositions on the base composition the human genome, such as the role of interspersed repeats, which comprise over 50% of the whole genome. We study statistical patterns in the DNA sequence of two human chromosomes, by quantifying small- and long-ranging dinucleotide correlations and by examining the nonuniformity of the frequency distribution of oligonucleotides.

We investigate to which degree known biological features may explain the observed statistical patterns. Using simple stochastic models, we study the role of interspersed repeats as a potential cause of the observed heterogeneity. We study the superposition of interspersed repeats and monomeric tandem repeats, and the suppression of CpG dinucleotides as possible features that may cause the increasing nonuniformity of the oligonucleotide distribution with increasing oligonucleotide length.


9. PhageBase: A precalculated database of bacteriophage sequences (up)
Frank Desiere, Nestle Research Centre;
Günther Kurapkat, Clemens Suter-Crazzolara, Lion Bioscience AG;
Harald Brüssow, Nestle Research Centre;
frank.desiere@rdls.nestle.com
Short Abstract:

We have employed genomeSCOUT to create PhageBase, a multi-functional, pre-computed bacteriophage genome database. Information about protein homology (e.g. protein families, orthologs and clusters of orthologous groups, COGs) and gene order is collected, stored and visualised with the data integration system SRS. PhageBase will allow to address questions of phage evolution.

One Page Abstract:

The accumulation of more and more complete bacteriophage genome sequences requires new computational approaches for dealing with these data. Bacteriophage genomes are not especially large; most are smaller than 200 kb. They nevertheless represent a formidable challenge to bioinformatics algorithms since phages seem to be the result of both vertical and horizontal evolution and are thus a good model system for web-like phylogenies (Brüssow and Desiere 2001) currently discussed for prokaryotic genomes. In addition to their high mutation rate but also to their extraordinary power to recombine and exchange functional modules, single genes and gene segments encoding single protein domains, phages with double-stranded DNA genomes show an about 10-fold higher mutation rate than their bacterial hosts (Li 1997). Furthermore, there is good reason to believe that tailed phages are as old as their prokaryote hosts. This situation holds several extraordinary challenges for bioinformatics investigations. On one side, the sensitivity of sequence search algorithms must be adaptable to very distantly related sequences and in the absence of detectable sequence similarity has to account for conserved gene-order (synteny).

We have employed genomeSCOUT (Suter-Crazzolara and Kurapkat 2000) to create a multi-functional, pre-computed bacteriophage genome database that allows rapid identification and functional characterisation of genes and proteins through genome comparison. With a number of independent algorithms, information about different levels of protein homology (concerning e.g. protein families, orthologs and clusters of orthologous groups, COGs) and gene order is collected and stored. These databases are then used for interactive comparison of genomes and subsequent analysis. The application is based on the well-established data integration system SRS. SRS ensures at the same time the fast handling of many genomes, access to several pre-computed databases and linking functions between these databases. Last but not least, SRS offers a fully integrated user-friendly graphical representations of search results.

Gene context analysis in bacteriophages shows that the conservation of genome organization is surprisingly high if compared to prokaryotic genomes (Wolf and Koonin 2001). The genome map of bacteriophages is conserved from the Siphoviridae family. These phages infect both gram-negative and gram-positive Eubacteria as well as the Euryarchaeota branch of Archaea. In fact, the structural genes from E. coli phage lambda, Streptococcus phage Sfi21 and Archaeavirus psi-M2 can be superimposed. Gene strings, which are conserved in taxonomically distant organisms, are most likely functionally interacting genes. We assume that their conserved reflects an ancestral gene order. Accordingly, gene strings identified in new genomes can be assumed to be functionally linked and the information on gene clustering can be used for functional predictions. PhageBase will hopefully allow to address problems which are currently discussed in bacterial genomics and bacterial phylogeny (Koonin et al., 2000): Unity or diversity of origin, vertical versus horizontal gene transfer, nonorthologous gene displacement, tree- versus web-like phylogeny, synteny versus instability of gene order, gene splitting versus domain accretion.


10. Searching for regions of conservation in the Arabidopsis genome (up)
Brad Chapman, John Bowers, Andrew H. Paterson, Plant Genome Mapping Laboratory, University of Georgia;
chapmanb@arches.uga.edu
Short Abstract:

To identify regions of Arabidopsis which might serve as conserved anchor points for comparisons between different plant species, ESTs from crop species were compared to the ordered Arabidopsis genome. Conserved blocks of high sequence similarity were identified in Arabidopsis and categorized within the biological context of the genome.

One Page Abstract:

With the recent completion of the Arabidopsis genome, a major challenge is applying the information known about Arabidopsis to help advance research in other plant species. Towards this goal, we were interested in finding regions of Arabidopsis that appeared to be preferentially conserved during evolution, hypothesizing that these regions could serve as anchor points for comparisons between multiple plant genomes. To identify regions of interest in Arabidopsis, we employed sequence comparison using BLAST searches to locate EST sequences from Sorghum, Cotton and Sugarcane on the Arabidopsis genome. By displaying the levels of sequence similarity across the genome, blocks appeared with groups of very high or low levels of sequence similarity. Using a probabilistic Hidden Markov Model approach, we categorized the blocks as being either strongly or weakly conserved. Several regions of the Arabidopsis genome were identified that were similarly categorized using Sorghum, Cotton and Sugarcane comparisons, but not using comparisons with randomly generated sequences. To try and determine if these regions were biologically meaningful, we looked at the distribution of Matrix Attachment Regions in the genome and attempted to correlate these structural elements with the conserved regions we had identified. Results of these analyses will be presented, along with in-depth analysis of some potentially conserved regions of the Arabidopsis genome. Some points of discussion will include the potential evolutionary significance of the conserved regions, as well as the applicability of the results to help advance research in less well-studied plant species.


11. Constructing Comparative Maps with Unresolved Marker Order (up)
Debra Goldberg, Center for Applied Mathematics, Cornell University;
Susan McCouch, Department of Plant Breeding, Cornell University;
Jon Kleinberg, Department of Computer Science, Cornell University;
debra@cam.cornell.edu
Short Abstract:

Species maps (genetic and physical) frequently include groups of markers (genes) whose precise relative order cannot be determined. We present efficient algorithms that construct comparative maps from such species maps in a principled manner. Our approach recognizes arrangements of co-located markers that give a most parsimonious comparative map.

One Page Abstract:

Comparative maps are a powerful tool for aggregating genetic information about related organisms, for predicting the location of orthologous genes, for understanding chromosome evolution, for inferring phylogenetic relationships and for examining hypotheses about the evolution of gene families and gene function in diverse organisms. The species maps which are the input to the process of constructing comparative maps are often themselves constructed from incomplete or inconsistent data, resulting in markers (or genes) whose precise relative order cannot be determined in the input species maps. This incomplete marker order information is generally handled in one of two ways: each marker may be assigned an interval on a chromosome, where the interval size varies for different markers and marker intervals may overlap, or sets of markers whose relative order cannot be reliably inferred are placed together in a bin which is mapped to a common location (megalocus). Previous automated and manual methods have handled such markers in an ad hoc or arbitrary fashion.

We present efficient algorithms for each of the standard representations which systematically use all information provided to produce comparative maps in a principled manner. The algorithms extend our earlier work on the ``chromosome labeling problem,'' which uses a dynamic programming technique to find an optimal balance of accuracy (the data should be explained well by the map) and parsimony (there should be relatively few homeologous segments, so that only syntenic relationships above our confidence threshold are labeled). We handle the overlapped interval representation by a direct extension of this technique.

Our main algorithms focus on the megalocus model, in which the input markers are partitioned into sets: relative order between sets is fully known, while relative marker order within a set is completely unknown. For this model, we present algorithms which not only use the available information, but also arrange the co-located markers in a most parsimonious way. The chromosome labeling problem with unknown ordering is thus placed on a principled footing via these algorithms in which results are optimized over all possible orderings. This canonical marker order can be viewed as a working hypothesis about the original incomplete data set, and can serve as a basis for further lab work.

A preliminary version of ``DeCAL'' (Detecting Common Ancestral Linkage-segments), an open-source product based on these algorithms, is now available. For input, it requires the positions of the markers of one species, as well as the location of homologs to each marker in the second species. Output is given both graphically and in text form. Only a single parameter is required, which carries a simple biological explanation. Our program allows comparative maps to be constructed in a few minutes. Results have been evaluated for diverse pairs of species, and closely approximate prior manual expert analyses.


12. Identification of novel small RNA molecules in the Escherichia coli genome: from in silico to in vivo (up)
Ruth Hershberg, Liron Argaman, Department of Molecular Genetics and Biotechnology The Hebrew University -Hadassah Medical School;
Joerg Vogel, Institute of Cell and Molecular Biology, Biomedical Center, Uppsala University;
Gill Bejerano, Institute of Computer Science, The Hebrew University;
E. Gerhart H. Wagner, Institute of Cell and Molecular Biology, Biomedical Center, Uppsala University;
Shoshy Altuvia, Hanah Margalit, Department of Molecular Genetics and Biotechnology The Hebrew University -Hadassah Medical School;
rutih@md.huji.ac.il
Short Abstract:

Small RNAs (sRNAs) have been difficult to detect both experimentally and computationally. We developed a computational strategy to search the Escherichia coli genome for sRNA-encoding genes. We used transcriptional signals and genomic features to predict 41 sRNAs. 23 were tested experimentally, of which 17 were shown to be real sRNAs.

One Page Abstract:

Small, untranslated RNA molecules exist in all kingdoms of life. These RNAs carry out diverse functions and many of them are regulators of gene expression. Genes encoding small RNAs (sRNAs) are difficult to detect experimentally or to predict by traditional sequence analysis approaches. Thus, in spite of the importance of these molecules, many of the sRNAs known to date were discovered fortuitously. We developed a computational strategy to search the Escherichia coli genome for genes encoding small RNAs. Our method was based on the transcription signals and genomic features, such as location and conservation, that characterize the 10 known sRNAs in E. coli. The search was limited to regions of the genome in which no gene existed on either strand. These regions were searched for transcriptional signals (promoter sequences recognized by the major sigma factor of E. coli RNA polymerase (sigma70), and Rho-independent terminators). Sequences for which the distance between the predicted promoter and terminator was 50-400 bases were compared to genome sequences of other bacteria. Sequences with good conservation were predicted as sRNAs. 23 of the predicted genes were tested experimentally, out of which 17 were shown to be expressed in E. coli. The newly discovered sRNAs showed diverse expression patterns and most of them were abundant.


13. Operons: conservation and accurate predictions among prokaryotes (up)
Gabriel Moreno-Hagelsieb, Julio Collado-Vides, CIFN-UNAM;
moreno@cifn.unam.mx
Short Abstract:

Neighbor genes within known operons are shown to be more conserved at three levels (co-occurrence, adjacency, and fused) than genes at transcription unit (TU) boundaries. We also show that prediction of operons as designed with information from E. coli works with other prokaryotes.

One Page Abstract:

Based in a database of experimentally characterized transcription units (TUs) of Escherichia coli, and its genomic annotations, we show that adjacent genes within operons (polycistronic TUs) are more conserved than adjacent genes found at TU boundaries (last gene in a TU, and first in the next). The conservation is measured at three levels: (1) co-occurrence, that is, two genes found in an operon have each an ortholog in another genome with more frequency than two genes at TU boundaries, that is, TU boundaries are left more frequently as orphans. (2) Among those genes having both orthologs, those found in operons in E. coli are conserved more frequently as neighbors in other genomes. (3) Genes within operons can be found as fusions in other genomes. We also show that the prediction method of TUs we developped with information of TUs of E. coli works as well (more than 82% of accuracy) against a collection of known operons of Bacillus subtilis, and provide evidence of its functionality in the prediction of the transcription units organization of all prokaryotes. Genes predicted to be in operons show higher conservation of adjacency than genes predicted to be at TU boundaries, and the population of operons of each organism is shown to be easy to calculate from the inter-genic distance distributions of pairs of adjacent genes found in the same strand.


14. The EBI Proteome Analysis Database (up)
Pauk Kersey, Apweiler, R., Biswas, M., Fleischmann, W., Kanapin, A,, Karavidopoulou, Y., Kriventseva, E., Mittard, V., Mulder, N., EMBL-European Bioninformatics Institute;
Phan, I., Swiss Institute of Bioinformatics;
Zdobnov, E., EMBL-European Bioninformatics Institute;
pkersey@ebi.ac.uk
Short Abstract:

The EBI Proteome Analysis Database has been developed to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. Complete non-redundant sets of SWISS-PROT and TrEMBL entries are assembled for each proteome, and are analysed using the Interpro and CluSTr databases, GO, and structural information.

One Page Abstract:

The dramatic recent growth in the number of fully sequenced organisms in recent years has created new challenges and opportunities for biological sequence databases. It is now possible to make statistical and comparative analyses of organisms based on their entire proteome. While the specific function of a newly predicted protein cannot be known for certain for its sequence alone, the use of protein domain databases allows the assignment of proteins to families and therefore for a proteome to be described in terms of its composition. For example, it is possible to establish that a given protein family may found in a restricted portion of the taxonomic range; that two organisms share certain protein families, but not others; or that a particular family is especially highly represented in a certain species.

The Proteome Analysis Database has been developed by the SWISS-PROT group at the EBI in order to provide such an analysis. Several features distinguish this database. Firstly, non-redundant up-to-date comprehensive data sets are maintained for each complete proteome, in order that the statistical analysis is not skewed. These are created by selecting entries from the high quality protein sequence databases SWISS-PROT and TrEMBL. Protein sequence data is tracked into TrEMBL from genome sequencing projects, and merges with existing entires are accounted for. Special procedures are used to establish eukaryotic proteomes. The facility to perform unbiased sequence similarity searches against these sets is offered.

Secondly, a powerful set of tools have been chosen to analyse the sets. The Proteome Analysis Database uses Intepro, an integrated database of protein domains, and CluSTr, a database that groups proteins according to overall sequence similarity. Proteins can also be functionally classified according to the Gene Ontology. Additionally, structural information relevant to each proteome is provided.

Thirdly, the entire database is updated weekly and kept synchronised with the underlying sequence databases from which it is constructed. Finally, a web-based interface allows users to customise their own comparative analysis using the resources made available by the database, while popular queries are precomputed for rapid response.


15. Practical transcriptome analysis system in the RIKEN mouse cDNA project (up)
Hidemasa Bono, Takeya Kasukawa, Itoshi Nikaido, Yasushi Okazaki, Yoshihide Hayashizaki, Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center(GSC);
bono@gsc.riken.go.jp
Short Abstract:

Practical transcriptome analysis system in a large-scale cDNA project, the RIKEN mouse cDNA project is presented. cDNA clone data of libraries, full-length sequences, mapping information to chromosomal location and gene expression information are successfully integrated to analyze mouse transcriptome in conjunction with sequence information from human and other model organisms.

One Page Abstract:

We have pursued RIKEN mouse encyclopedia project, which attempt to catalogizing as an encyclopedia. 1. full-length cDNA clones 2. full-length cDNA sequences, 3. mapping information of all cDNA clones 4. Gene expression information of all genes. In the last august, we held the FANTOM meeting to annotate functional information to 21,076 RIKEN mouse cDNA clones(Nature, 409, 685-690, (2001)). In that meeting, we made web-based system, called FANTOM+ that includes functional annotation information, as well as the graphical sequence analysis report. (http://www.gsc.riken.go.jp/e/FANTOM/ ) These efforts in FANTOM are now expanded to set up practical mouse transcriptome analysis system which organizes not only functional annotation, but biological knowledge that may contain inconsistent information. We will report the status of this project.


16. The automated identification of novel lipases/esterases on a multi-genome scale (up)
Sanna Herrgard, Stephen A. Cammer, Jen Montimurro, Jeffrey A. Speir, Brian Hoffman, Susan M. Baxter, Jacquelyn S. Fetrow, GeneFormatics, Incorporated;
sannaherrgard@geneformatics.com
Short Abstract:

We have applied threading and protein functional descriptors to identify sequences with putative lipase and esterase functions in four gram-positive genomes. These studies yielded 15 sequences previously unreported as having lipase/esterase functions. Our findings are supported by 3D-conservation profiles between the active sites in known lipases/esterases and our assignments.

One Page Abstract:

The rapid and accurate functional annotation of the growing number of DNA and protein sequences has become a key challenge of the post-genomic era. Current annotation methods rely heavily upon simple sequence similarity; a protein sequence of undetermined biochemical function is assumed to have the same function as the protein most similar in sequence to it. Since protein sequences are generally less conserved than protein structures, sequence-based annotation methods often fail to detect proteins with low sequence similarity. In order to circumvent the limitations of sequence-based approaches, we have screened structure models obtained by threading with a library of function-specific structural descriptors (Fuzzy Functional Forms). We demonstrate the use of this method in rapidly annotating entire genomes and identifying novel function assignments for ORFs for which conventional sequence-based annotation methods fail. Specifically, we have assigned novel lipase and esterase functions to 15 sequences in the genomes of four gram-positive bacteria: Bacillus subtilis, Ureaplasma urealyticum, Mycoplasma pneumoniae and Mycobacterium tuberculosis. Our findings are supported by the sequence-structure conservation profiles between the active sites in known lipases/esterases and our assignments. These analyses indicate that even though the overall sequence similarity between known lipases/esterases and our assigned ORFs is often low, remarkable local similarities exist in the predicted active sites.


17. Identification of membrane protein orthologs in worm, fly and human (up)
Gang Liu, Christian E. V. Storm, Erik L. L. Sonnhammer, Center for Genomics and Bioinformatics (CGR), Karolinska Institute;
Gang.Liu@cgr.ki.se
Short Abstract:

Genome-wide identification of transmembrane protein orthologs has been carried out. Hidden Markov models (HMMs) which were previously built for membrane protein families of worm were used to search for homologs in other species. Orthologous relationships were assigned by using Orthostrapping, a phylogeny-based method that gives orthology confidence.

One Page Abstract:

Based on the completion of genome sequencing projects of the nematode Caenorhabditis elegans, the fruit fly Drosophila melanogaster, as well as the newly finished human genome, we are able to analyze orthologous relationships in protein families between higher eukaryotes. This study will help us to understand gene function and the evolution of these protein families. Previously, proteins with at least two transmembrane segments of the C. elegans were classified as 189 clusters, and hidden Markov models (HMMs) were created for each protein family (Remm and Sonnhammer, Genome Res., 10:1679, 2000). We used these models to retrieve fly and human homologs. 52% of the clusters contain members from worm, fly and human, while 8% of the clusters are present only in worm and fly. Only 2% of the clusters contain worm and human homologs but not fly. The remaining 37% of the clusters are worm-specific. The clusters were analyzed for orthologs using Orthostrapping, a phylogenetic based method which gives orthology confidence. We present a list of putative membrane protein orthologs in worm, fly, and human.


18. Discovering Binding Sites from Expression Patterns: A simple Hyper-Geometric Approach (up)
Yoseph Barash, Gill Bejerano, Tommy Kaplan, Nir Friedman, School of Computer Science & Engineering, The Hebrew University;
hoan@cs.huji.ac.il
Short Abstract:

We present a fast approach to transcription factor binding site discovery. Using a simple hypergeometric model we rapidly find short conserved patterns within a gene group compared to its genome background. These seeds are iteratively expanded into PSSMs. We analyze recent yeast and human datasets, and compare to MEME.

One Page Abstract:

A central issue in molecular biology is understanding the regulatory mechanisms that control gene expression. The recent flood of genomic and post-genomic data opens the way for computational methods elucidating the key components that play a role in these mechanisms. One important consequence is the ability to recognize groups of genes that are co-expressed using whole-genome expression patterns. Our aim is to identify in-silico putative transcription factor binding sites in the promoter regions of these gene that explain the co-regulation, and hint at possible regulators.

In this paper we describe a simple, fast, and yet powerful, approach to this task using a hyper-geometric statistical model and a straightforward computational procedure. This results in small conserved sequence seeds that are statistically significant compared to the genome-wide promoter background. We then expand these short seeds into position specific scoring matrices using an EM-like procedure. We demonstrate the utility and speed of our methods by applying them to several recent yeast and human data sets. We also compare our results with those of MEME when run on the same sets.


19. Visualizing whole genome comparisons: Artemis Comparison Tool (ACT) (up)
Keith James, Kim Rutherford, The Sanger Centre;
kdj@sanger.ac.uk
Short Abstract:

Artemis Comparison Tool (ACT) is a DNA sequence comparison viewer based on Artemis. It presents a graphical representation of pairwise sequence comparison data generated by programs such as BLAST. EMBL, GenBank or GFF format annotation can be loaded in addition and presented in context with the comparison.

One Page Abstract:

The amount of data obtained from a pairwise comparison of whole genomes can be overwhelming, even when those genomes are highly similar. Interesting features such as syntenic regions, insertions, deletions, dispersed repeats, large-scale inversions or translocations are often not immediately apparent from the raw alignment output (e.g. Blast output). Often the genomic context of these raw results, with respect to gene predictions and existing annotation is lost.

Artemis Comparison Tool (ACT) is a DNA sequence comparison viewer based on Artemis. It presents a graphical representation of pairwise sequence comparison data generated by programs such as BLAST. EMBL, GenBank or GFF format annotation can be loaded in addition and presented in context with the comparison. The user can pan and zoom the view interactively, examine per-CDS database search results and create or edit annotation from within the ACT environment. Example ACT analyses of genomes sequenced at the Pathogen Sequencing Unit are presented.

In common with Artemis, ACT is written in Java and runs on UNIX, GNU/Linux, Macintosh and MS Windows systems. ACT is free software and is distributed under the terms of the GNU General Public License.

ACT is available from the ACT web site: http://www.sanger.ac.uk/Software/ACT/


20. The use of Artemis for the annotation of eukaryotic genomes (up)
Valerie Wood, Kim Rutherford, The Sanger Centre;
val@sanger.ac.uk
Short Abstract:

Artemis is a DNA sequence viewer and annotation tool written in Java. It can read and write EMBL, GenBank and GFF format feature information and display these in the context of the sequence and its six frame translation.

One Page Abstract:

Artemis is a DNA sequence viewer and annotation tool written in Java. It can read EMBL, GenBank and GFF format feature information and display these in the context of the sequence and its six frame translation. The package can also display the results of external analyses, or plot the results of statistical calculations, performed on the sequence or CDS features.

Artemis is the main annotation tool used for genome analysis in the Pathogen Sequencing Unit at the Sanger Centre, and is used routinely for annotation of eukaryotic genomes.

The use of Artemis for the analysis and annotation of the completed genome of the unicellular fungus Schizosaccharomyces pombe, and the reannotation of Saccharomyces cerevisiae will be presented. Artemis is also used in the annotation of the parasitic worm Brugia malayi, the social amoeba Dictyostelium discoideum and the unicellular eukaryotic parasites Plasmodium falciparum, Trypanosoma brucei, Leishmania major and Toxoplasma gondii.

Artemis is available from the Artemis web site: http://www.sanger.ac.uk/Software/Artemis/

The European S. pombe genome sequencing project can be accessed at http://www.sanger.ac.uk/Projects/S_pombe/


21. A de novo approach to identifying repetitive elements in genomic sequences (up)
Elizabeth Thomas, John Healy, Cold Spring Harbor Laboratory;
Nathan Srebro, Massachusetts Institute of Technology;
Jacob Schwartz, New York University;
Michael Wigler, Cold Spring Harbor Laboratory;
thomase@cshl.org
Short Abstract:

We investigate tools for identifying repeats in genomic sequences, using whole genome frequencies of short oligomers. Highly-repetitive elements, even if poorly conserved, will contain many high frequency oligomers. We can identify repetitive elements using this signature, without depending on prior knowledge of repeat sequence or structure.

One Page Abstract:

Less than 2% of the human genome codes for proteins (IHGSC, 2001). It has been estimated that 50% of the remaining sequence consists of interspersed repetitive elements, not all of which have been classified or identified. An understanding of the origins of these repetitive elements, and their diversity, is likely to shed light on the evolution of genomes. Tools commonly used for defining and identifying repeats depend on prior knowledge about the structure and sequence of known repeats. Because of this assumption of prior knowledge, these tools are inappropriate for identifying unknown repetitive elements in genomic sequences. Now that whole genome sequences are available, new approaches can be taken, which depend merely on the simple fact that repetitive elements repeat. We investigate tools based on the whole genome frequencies of short oligomers, and simple algorithms that can be applied to these frequencies. Highly-repetitive elements, even if poorly conserved, will contain many high frequency oligomers. We can identify repetitive elements using this signature, without depending on prior knowledge of repeat sequence or structure.


22. Extendable parallel system for automatic genome analysis (up)
Audrius Meskauskas, Frank Lehmann-Horn, Karin Jurkat Rott, Ulm university;
Audrius.Meskauskas@medizin.uni-ulm.de
Short Abstract:

We developed the automatic system for analysing of all potential genes in a defined region. We created 15 modules for the sequence retrieving, E-PCR, gene prediction, similarity search, protein pattern search, etc. The system was used to look for a gene between D3S2370 and D3S1292 on human chromosome 3

One Page Abstract:

Experimental methods of linkage analysis often indicate a certain large region where the gene of interest in is located. It is possible to narrow the search interval by bioinformatical methods. Such methods are usually developed by specialised bioinformatical groups and are accessible from their specialised Internet pages. The preferred set of programs depends on the type of the gene being cloned and from the general strategy of the respective research group. Submitting a numerous requests to different Internet servers and analysis of the received responses takes large amount of the qualified researcher time. Therefore, we developed a Java-based data-mining program to detect and analyse all possible genes on a defined chromosome region. We created the modules for the following tasks: 1. Automatic converting between WUSM and NCBI naming systems of clones. 2. Getting the sequences and coordinates for a given set of markers. 3. Getting a list of clones for the given NCBI contig. 4. Sequence retrieving. 5. E-PCR 6. Gene prediction. 7. BLAST similarity search through EST database. 8 Collecting additional information in that tissues the similar cDNA was detected. 9. Predicted protein pattern search, revealing portenial protein family. 10. Transmembrane regions detection. 11. Protein sorting signals detection. 12. PEST region detection. 13. Translating between gene position inside clone and gene position inside NCBI contig 14. Finding, which of the genes, predicted by the system, are already described. contigs. 15. Central kernel for parallel submission or requests. The system was used to predict and analyse all genes, lying between markers D3S2370 and D3S1292 on human chromosome 3. It was noticed that after setting lower confidence level GenScan predicts more genes than are given in NCBI database for this region (we determined the exact dependency curve). In some cases these newly predicted genes had correlation between the presence of transmembrane helixes, peptide sorting signals and specific protein domains. BLAST also revealed their similarity to know cDNA from the different organs. Having also promoter and poly A signals, at least part of the mentioned sequences might be functional genes. The information, obtained for the predicted genes, was analysed in the context of the known hypothesis about the function of the gene being looking for. It reduced the search interval from about 6 Mbp to a much smaller set of potential coding sequences, performing its task in the gene cloning project.


23. Whole genome phylogenies using vector representations of protein sequences (up)
Gary W. Stuart, Department of Life Sciences, Indiana State University;
Jeffery J. Leader, Department of Mathematics, Rose-Hulman Institute of Technology;
G-Stuart@indstate.edu
Short Abstract:

Optimized SVD-based vector representations of proteins from whole genomes were used to produce comprehensive gene and species phylogenies. A pilot analysis using 832 mitochodrial proteins from 64 vertebrates produced a robust and accurate tree. A larger analysis using nearly 30,000 proteins from 17 bacterial genomes revealed some non-traditional relationships.

One Page Abstract:

Accurate phylogenetic trees have been produced following the singular value decomposition (SVD) of data matrices containing vector representations of all proteins encoded within complete genomes. Both gene trees and species trees have been derived using this method. In a pilot analysis, the complete set of 13 mitochondrial proteins from each of 64 vertebrates was used to produce a matrix representing each protein in terms of its tetrapeptide frequencies. SVD with dimension reduction was then used to provide adjusted vector representations for each protein in multidimensional space. Pairwise cosine (similarity) values were determined and converted to distance measures as required for the generation of phylogenetic trees using the NEIGHBOR program of PHYLIP. The resulting gene trees indicated that this method was clearly capable of recognizing and grouping similar proteins, as most members of the 13 mitochodrial protein families were accurately placed in large monophyletic or nearly monophyletic groups. An optimal dimension reduction was determined that produced the best grouping of genes within families.

Species trees were then produced by 1) summing the optimized SVD-based vector representations of the individual mitochodrial proteins from each organism, 2) deriving cosine-based distance values for each pair of summed vectors, and 3) using NEIGHBOR to generate trees from the resulting distance values. Within these trees, cartilagenous fish, bony fish, reptiles, birds, non-eutherian mammals, and eutherian mammals were well grouped and reasonably arranged.

Following the successful analysis of complete mitochondrial genomes, we applied this method to the genomes of 17 bacteria, including 4 archaebacterial species. Both selected partial genome datasets (~2300 proteins) and whole genome datasets (~30,000 proteins) were analyzed. Optimal dimension reduction was estimated in some cases by observing how well genes where grouped into COG families. The resulting species trees tended to reinforce many traditional bacterial relationships, while challenging others. For instance, Borrelia burgdorferi, the spirochete responsible for lyme disease, grouped with Rickettsia prowazekii, a proteobacterium, instead of Treponema pallidum, another spirochete.

With further refinements and increased computational power, it should be possible to produce exhaustive biomolecular phylogenies from a large number of complete prokaryotic and eukaryotic genomes.


24. Correlated Sequence Signature as Markers of Protein-Protein Interaction (up)
Einat Sprinzak, Hanah Margalit, Department of Molecular Genetics and Biotechnology The Hebrew University -Hadassah Medical School;
einats@md.huji.ac.il
Short Abstract:

We propose a novel approach for clustering pairs of interacting proteins by combinations of their sequence signatures. The identified correlated sequence signatures can be used as markers for predicting protein-protein interactions in the cell. Such an approach reduces significantly the search in the interaction space, and enables directed experimental screens.

One Page Abstract:

As protein-protein interaction is intrinsic to most cellular processes, the ability to predict which proteins in the cell interact can aid significantly in identifying the function of newly discovered proteins, and in understanding the molecular networks they participate in. An appealing approach would be to predict the interacting partners by characteristic sequence motifs that typify the proteins that are involved in the interaction. Valuable insight towards this end can be gained by mining databases of experimentally determined interacting proteins. Conventionally, single protein sequences have been clustered into families by distinct sequence signatures. Here we propose a novel approach for clustering different pairs of interacting proteins by combinations of their sequence signatures. To identify such informative signature combinations, a database of interacting proteins is required, as well as a scheme for characterizing protein sequences by their signatures. In the current study we demonstrate the potential of this approach on a comprehensive database of experimentally determined pairs of interacting proteins in the yeast S. cerevisiae. The proteins are characterized by sequence signatures, as defined by the InterPro classification. A statistical analysis is performed on all possible combinations of two sequence signatures, identifying combinations of sequence signatures that are over-represented in the database of pairs of interacting proteins. It is proposed that such correlated sequence signatures can be used as markers for predicting unknown protein-protein interactions in the cell. Such an approach reduces significantly the search in the interaction space, and enables directed experimental interaction screens.


25. Molecular and Functional Plasticity in the E. coli Metabolic Map (up)
Sophia Tsoka, Christos Ouzounis, EMBL-EBI;
tsoka@ebi.ac.uk
Short Abstract:

Large-scale sequence analysis of the enzymes comprising the entire metabolic complement of Escherichia coli was performed. Reactions and pathways were grouped according to sequence similarity of the corresponding enzymes and enzyme families were mapped to corresponding reactions and pathways. Function convergence/divergenence is assessed and modes of pathway evolution are discussed.

One Page Abstract:

Genome analysis by sequence similarity provides useful hints for evolutionary and functional relations between proteins. Especially important is the identification of cases of divergent or convergent evolution, whereby similar sequences have different function and vice versa. These represent cases that are often overlooked by functional assignments based on detection of sequence similarity. Furthermore, understanding the intricacies of the sequence-to-function relationship for metabolic enzymes (1) can also enable reconstruction of the evolutionary history of protein function diversification and biochemical pathways (2).

Large-scale sequence analysis of the enzymes comprising the entire metabolic complement of Escherichia coli (3) has been undertaken in order to reveal the molecular and functional diversity of metabolic network components. Metabolic enzymes and their functional roles in terms of EC number classification and pathway involvement were identified from the Ecocyc database (4). Automated sequence clustering of the E. coli enzymes was performed (5) to identify enzyme families. Subsequently, reactions and pathways were grouped according to the sequence similarity of the corresponding enzymes. The mapping of similar sequences into different reactions and pathways delineates to a significant extent cases of evolutionary divergence and convergence of protein families (6).

1 Ouzounis CA and Karp PD, Genome Res. 2000, 10:568-76. 2 Tsoka and Ouzounis, FEBS Lett. 2000, 480:42-8. 3 Tsoka S and Ouzounis CA, Nature Genet 2000, 26:141-2. 4 Karp PD et al., Nucleic Acids Res. 2000, 28:56-9. 5 Enright AJ and Ouzounis CA, Bioinformatics 2000, 16:451-7. 6 Tsoka S and Ouzounis CA, submitted.


26. Genome Size Distribution in Prokaryotes, Eukaryotes, and Viruses (up)
David Ussery, Heidi Dvinge, Herluf Riddersholm, Nikolaj Blom, Kristoffer Rapacki, Center for Biological Sequence Analysis, Biocentrum-DTU, Denmark;
dogs@cbs.dtu.dk
Short Abstract:

The haploid genome size for more than 5000 organisms is compared. We find a large distribution of sizes, ranging from about 1000 base-pairs (bp) to 670,000,000,000 bp. We compare the genome sizes and repeats for chromosomes from all four eukaryotic kingdoms as well as for prokaryotes and viruses.

One Page Abstract:

The haploid genome size for more than 5000 organisms is compared. We find a very large distribution of sizes, ranging from around a 1000 base-pair (bp) viral genome to more than 670,000,000,000 bp for Amoeba dubia. We compare the genome size for all four eukaryotic kingdoms as well as for prokaryotes and viruses. Whilst there is no correlation between the biological complexity of an organism and the size of its genome, there is often a correlation between the size of the nucleus and the genome size. That is, the concentration of the DNA appears to be constant in certain groups of organisms. We compare the genome size to the fraction of various types of DNA repeats for 65 sequenced eukaryotic chromosomes and more than 100 prokaryotic chromosomes, as well as more than 300 viral chromosomes. In general larger eukaryotic chromosomes contain more repetitive DNA than would be expected for a random sequence of the same base-composition, and direct repeats occur more often than inverted repeats.

The Database Of Genome Sizes (DOGS) can be found at the following URL: http://www.cbs.dtu.dk/databases/DOGS/index.html


27. EnsEMBL Genome Annotation Project (up)
Ewan Birney, EBI;
Michelle Clamp, Tim Hubbard, The Sanger Center;
Lukasz Huminiecki, Emmanuel Mongin, Arne Stabenau, EBI;
birney@ebi.ac.uk
Short Abstract:

EnsEMBL, a joint project between the Sanger Centre and the EBI, is an automatic annotation system for eukaryotic genomes. All data and code is freely available and easy to access through CVS and the web. All code is written in object oriented perl using MySQL as backend relational database.

One Page Abstract:

EnsEMBL is an automatic annotation system for eukaryotic genomes. It has been designed as a fully portable and platform independent system which handles both finished and unfinished genomes. It provides annotations at both nucleic acid and amino acid level. Collectively, the features identified on the DNA sequence by EnsEMBL mostly comprise genes, transcripts (alternative splice variants), exons, markers, SNPs, repeats and regions highly similar to other sequences. For each peptide predicted, EnsEMBL provides interpro (pfam, prints, prosite) domain annotation.

Currently EnsEMBL distributes data for the human and mouse genomes. This data is contained in a number of relational databases eg. the human data is stored in the core database with sequences and genes, the SNP database, the mouse trace database, the disease database etc. Only the core database is needed to run EnsEMBL software.

The EnsEMBL website (www.ensembl.org) provides easy access to this information with a number of visualisation tools such as GeneView, MapView, or ContigView. Additionally, an ftp site (ftp.sanger.ac.uk:/pub/ensembl/current) allows to download large amounts of genomic data.

A number of algorithms is utilised in the production of sequence annotations: blast, exonerate, genescan, genewewise etc. These algorithms tend to be computationally intensive and as such require both dedicated software and hardware. The automation in the annotation procedure is achieved by the EnsEMBL-pipeline. The pipeline uses LSF to distribute the computations to a farm of alpha servers.

EnsEMBL's code is written in object oriented perl to facilitate better software design and easier porting to java and corba. BioPerl (www.bioperl.org) interfaces are used, wherever this is appropriate. The whole code is seperated into packages according to its use. Packages exist for the operation of the analysis pipeline and the web server as well as for the various non essential datasets (SNP, Maps, Disease, SAGE). The underlying MySQL database is accessed through a layer of adaptor objects, which provide persistence for high level objects. These represent well known things like Genes, Exons, Features, Sequence etc.

EnsEMBL provides CVS access to all its code to encourage people to contribute to the project. The code development is openly discussed on a mailing list (ensembl-dev@ebi.ac.uk).

EnsEMBL is a joint project between the Sanger Center and the European Bioinformatics Institute.


28. A Framework for Identifying Transcriptional cis-Regulatory Elements in the Drosophlia Genome (up)
Benjamin P. Berman, University of California, Berkeley;
Barret D. Pfeiffer, Susan E. Celniker, Lawrence Berkeley National Laboratory;
Michael B. Eisen, Lawrence Berkeley National Laboratory & University of California, Berkeley;
Gerald M. Rubin, Howard Hughes Medical Institute & University of California, Berkeley;
benb@fruitfly.berkeley.edu
Short Abstract:

We are developing techniques that make use of transcription factor binding specificities and evolutionary conservation of binding sites to search for cis-regulatory enhancer elements genome-wide. We have evaluated these techniques using a collection of well-studied enhancer elements from the transcriptional cascade that controls the development of anterior/posterior segmentation in Drosophila.

One Page Abstract:

The development and maintenance of the many diverse cell types of complex multi-cellular organisms results in large part from cis-regulatory DNA sequences that control precise mRNA transcriptional programs. In addition to promoter sequences important for recruiting the basal transcriptional machinery, “enhancer” modules up to several hundred base pairs long contain binding sites for various sequence specific transcription factors. The state of these transcription factors constitutes the input to the transcriptional program, and the enahcer sequence serves as the “logic” which integrates these diverse inputs. This logic can facilitate both inhibitory and cooperative interactions. A single gene may be regulated by many such enhancer modules, which may act independently or together to affect transcription in various spatial and temporal domains during the organism’s life.

Our aim is to identify novel enhancer modules in the genome of the fruitfly, Drosophila melanogaster. Because enhancer modules can be found within a flanking region of DNA spanning many kilobases, this is not a simple task. We are developing algorithms that take advantage of two important observations: (1) that enhancers are likely to contain many individual binding sites for diverse transcription factors, and (2) that functional binding sites are likely to be conserved through evolution. Previous work indicates that enhancers might be identified with considerable specificity by searching for clusters of conserved binding sites. We have been developing several tools to evaluate this prospect.

We have developed an interactive web database to manage transcription factor binding specificities and to plot potential binding sites in the context of genomic annotations. We construct position weight matrix (PWM) models for each of the factors in our database, and these models are used to score potential sites. Cutoff scores can be interactively adjusted, as can constraints geared towards finding clusters of sites within a given window size.

In order to evaluate our techniques, we have focused on a particular biological process. The cascade that ultimately gives rise to anterior-posterior segmentation in Drosophila involves complex interactions between a host of early embryonic transcription factors, and this process is among the best understood examples of complex transcriptional regulation to be found in any organism. We have collected a set of more than 20 enhancer modules that have been experimentally shown to regulate expression patterns during this process. We will discuss the ability of our techniques to distinguish these known enhancers with specificity appropriate for genome-scale searches.


29. Genome-wide modeling of protein structures (up)
Ole Lund, Morten Nielsen, Thomas Nordahl Petersen, Claus Lundegaard, Garry P. Gippert, Structural Bioinformatics Advanced Technologies A/S (SBI-AT), Agern Alle 3, DK-2970 Hoersholm, Denmark.;
Muthu Prabhakaran, Gopalan Raghunathan, Structural Bioinformatics, Inc., 10929 Technology Place, San Diego, CA 92127. www.strubix.com.;
Jakob Bohr, Søren Brunak, SAB, Structural Bioinformatics, Inc., 10929 Technology Place, San Diego, CA 92127.;
Kal Ramnarayan, Structural Bioinformatics, Inc., 10929 Technology Place, San Diego, CA 92127. www.strubix.com.;
olund@strubix.dk
Short Abstract:

We have developed a very sensitive and specific fold recognition method, as well as a method to create accurate alignments for genome wide use. From these alignments protein models for a large part of several complete genomes are generated using the PROTEOMINE (TM) program.

One Page Abstract:

We have developed a very sensitive and specific fold recognition method, as well as a method to create accurate alignments for genome wide use. From these alignments protein models for a large part of several complete genomes are generated using the PROTEOMINE (TM) program. The performance of these methods are compared to the state of the art, both by in-house benchmarks and by participation in CASP4. We apply our methods to the creation of relational databases of the tertiary and/or the secondary structure of all proteins in genomes. These databases can be used for drug discovery and modeling of protein-protein complexes.


30. Genome wide search of human imprinting genes by data mining of EST in UniGene (up)
Maxwell P. Lee, Howard Yang, Ying Hu, Michael Edmonson, Ken Buetow, Hongtao Fan, Lo Hanson, NIH/NCI/DCEG/LPG;
leemax@mail.nih.gov
Short Abstract:

We took a computational approach to systematically search all human imprinted genes. Analysis of SNP frequency in EST in human UniGene has identified 140 candidate imprinting genes. Three of them correspond to known imprinted genes. We are currently in the process to validate these findings by experiments.

One Page Abstract:

Genomic imprinting is an epigenetic modification of the chromosome that leads to preferential expression of a specific parental allele of the gene. Abnormal imprinting is associated with several human diseases and loss of imprinting (LOI) is frequently found in human cancers. We have undertaken a computational approach to systematically search all human imprinted genes. We have decided to search for imprinting genes from a single nucleotide polymorphism (SNP) database containing all SNPs in expressed sequence tag (EST). The Bayesian statistics was used to estimate the genotype frequency. Significant reduction in the heterozygote suggests that the SNP is located either in an imprinted gene or in a region involved in loss of heterozygosity in tumor cells. From 1.8 million EST in UniGene, we have analyzed 20130 genes and have identified 140 candidate imprinting genes. Among the 140 candidate imprinting genes, three correspond to known imprinted genes. We are currently in the process to validate these findings by experiments. In conclusion, data mining of EST represents an effective way for genome wide search of imprinting genes.


31. What we learned from statistics on arabidopsis documented genes (up)
Pierre Rouzé, Laboratoire Associé de l'Institut National de la Recherche Agronomique (France), Universiteit Gent, Belgium;
Catherine Mathé, Flanders Interuniversity Institute for Biotechnology (VIB). Department of Plant Genetics. Universiteit Gent, Belgium;
Sébastien Aubourg, Unité de Recherche en Génomique Végétale, Evry, France;
Patrice Déhais, Flanders Interuniversity Institute for Biotechnology (VIB). Department of Plant Genetics. Universiteit Gent, Belgium;
Pierre.Rouze@gengenp.rug.ac.be
Short Abstract:

In order to improve our knowledge of Arabidopsis genes structure, we analyzed a set of almost two thousand genes properly annotated by aligning cognate cDNA to genomic sequences.

We present statistics on length and composition of the genes and their elements, codon usage and also new data on some signals.

One Page Abstract:

While the Arabidopsis genome is now fully sequenced, its sequence is still not completely decrypted, or at least not in a reliable way. Current annotations were indeed made in an automatic (or semi-automatic) way, using gene prediction software, that we know to be far from satisfactory (Pavy et al., 1999). Convinced of the necessity to learn more about gene structures before trying to improve gene prediction strategies, we did some statistical analysis of known genes. We only used genes for which biological evidence was available, i.e. for which we had the cognate mRNA (cDNA). Gene structures (intron/exon organisation) were completely reconstructed by aligning the genomic DNA sequence to its cognate cDNA. Thus, most results presented here were done on a data set of 1 811 genes, which is not much compared to the expected 26 000 genes of Arabidopsis, but at least refers to true data. Several kinds of analysis were performed. Results globally underline the high diversity of genes, in terms of composition and structure, as well as the non unique definition of elements (introns, exons) along genes. Indeed, a "standard" gene (transcription unit) should be 2,4 kb long, with 7 coding exons of 192 bp each, and at about 2 kb from the previous gene. But, beside these means, 10% of the genes have only one exon; the maximal number of exons per gene in our data was 79; an intron can reach 4 kb; and an intergenic sequence can be as small as 300 bp for co-oriented genes or even negative (overlapping genes) otherwise. Moreover, within a gene, the first and last coding exons are larger (270 bp in mean) than the internal ones (147 bp), and the first introns are also larger than the next ones (230 bp against 155 bp, in mean). We also found some indication for a compositional bias along coding sequences (CDS), with an enrichment of A3% from 5' to 3' (and a decrease of T3), while (G+C)3% is minimal in the middle part of CDS. Interestingly, when comparing codon usage between several genes, we come to the conclusion that codon usage is influenced for a major part by translation efficiency-related constraints, leading to 2 gene classes (Mathé et al., 1999). Furthermore, we did some analysis of gene specific signals: splice sites, and among them some non-canonical ones as GC/AG; translation initiation codon; or polyadenylation sites.

References

Pavy, N., Rombauts, S., Déhais, P., Mathé, C., Ramana, D.V.V., Leroy, P. and Rouzé, P. (1999) Evaluation of gene prediction software using a genomic data set: application to Arabidopsis thaliana sequences. Bioinformatics 15 (11): 887-899

Mathé, C., Peresetsky, A., Déhais, P., Van Montagu, M. and Rouzé, P. (1999) Classification of Arabidopsis thaliana gene sequences: clustering of coding sequences into two groups according to codon usage improves gene prediction. Journal of Molecular Biology, 285 (5), 1977-1991.


32. Evaluation of Computer Algorithms to Search Regulatory Protein Binding Sites (up)
Esperanza Benítez-Bellón, Gabriel Moreno-Hagelsieb, Julio Collado-Vides, CIFN, UNAM, México;
ebenitez@cifn.unam.mx
Short Abstract:

By the analysis of known regulons we evaluate the efficiency of two methods to extract patterns defining regulator-binding sites and to find other members of the regulon. The points of maximum accuracy for each family are reported.

One Page Abstract:

We present the evaluation of two methods for extracting patters corresponding to the binding sites of several transcriptional regulators reported in RegulonDB. We define a positive and a negative set for each regulon, find the patterns, a matrix using CONSENSUS (Bioinformatics 15(7/8):563-577, 1999), and a set of dyads when using dyad-detector (Nucl Acids Res 28(8):1808-1818, 2000). We evaluate true positives, true negatives, accuracy, and positive predictive values. We find the best thresholds to find new members of each regulon, and provide some predictions.


33. An HMM Approach to Identify Novel Repeats Through Fluctuations in Composition (up)
Hui Wang, Affymetrix, Inc; Dept of Computer Science and Computer Engineering, UCSC, Santa Cruz, CA 95064 USA;
David Haussler, Department of Computer Science and Engineering, University of California at Santa Cruz, Santa Cruz, California 95064 ;
John Burke, DoubleTwist Inc. Oakland, California 94612 USA;
hui_wang@affymetrix.com
Short Abstract:

We present an automatic, efficient HMM method for identifying novel repeats. This method compares local fluctuations in sequence composition with the composition of the database as a whole. It does not rely on similarity comparison with known repetitive element databases and is of great utility for microarray and sequence analysis.

One Page Abstract:

Repetitive DNA sequences are widely distributed among eukaryotic and prokaryotic genomes. Reassociation kinetics studies of eukaryotic DNA has established that approximately 30% of the human genome is composed of repetitive DNA sequences1, while some other genomes, such as X. laevis, contain up to 70% repetitive sequences. These features, often called repetitive elements, have been categorized and archived.

If not accounted for, these elements can render many sequence analysis procedures uninformative. Massively parallel gene expression monitoring through microarray technology can potentially be seriously affected by the existence of repetitive elements. The problem is that repetitive sequences can cause sequence similarity among genes unrelated in function. This problem can be even worse in SNP detection. Hence in array design and sequence functional analysis, it is very important to take repetitive elements into consideration.

Many procedures for repeat identification and detection have been developed. However a fundamental limitation of these methods is that they rely mostly upon databases of experimentally known elements and similarity searching algorithms to detect the existence of repetitive elements.

There are databases of mRNA fragments (also called expressed sequence tags or ESTs) that sample many of the transcribed regions of the genome. These databases contain multiple representations of non-repetitive elements with frequencies greater than many real repeats; hence positive and negative cases are often indistinguishable on the basis of mere frequency. Clustering and alignment of these databases can remove redundancy, but the presence of repeats reduces the tractability of this procedure as well. Coupled with the availability of whole genome sequences, a through, efficient (preferably linear time) computational approach of identifying repeats would be of great utility.

Hidden Markov Models (HMM) have been successfully used in biological sequence analysis for gene finding, protein structural modeling and phylogenetic analysis.

We present a method for repeat detection that collates repeat frequency in the entire databases with local shifts in sequence composition to identify positive cases. A simple HMM is built and used to distinguish repeat states from non-repeat states. We apply this approach to both expressed sequences and whole genome sequences. The method runs in linear time and constant space with respect to database size.


34. Novel non-coding RNAs identified in the genomes of Methanococcus jannaschii and Pyrococcus furiosus (up)
Robert J. Klein, Washington University;
Sean Eddy, Howard Hughes Medical Institute and Washington University;
rjklein@genetics.wustl.edu
Short Abstract:

We used a bias in G+C content as the basis for a computational screen for novel structural RNAs in sequenced, AT-rich, hyperthermophile genomes. This screen identifies most noncoding RNA loci as well as several novel loci. Expression of small RNAs from some of these loci has been experimentally confirmed.

One Page Abstract:

The G+C content of structural RNA genes positively correlates with optimal growth temperature, while the G+C content of an entire genome does not. Although this GC composition difference is undetectably weak in most sequenced genomes, it is a strong bias in AT-rich hyperthermophile genomes (e.g. Methanococcus jannaschii and the various Pyrococcus species). We are using this bias as the basis for a computational screen for novel structural RNAs. Using a two-state hidden Markov model (a formal statistical model of the expected GC bias), we have identified GC-rich regions of these genomes. We have shown that the screen identifies almost all known structural RNAs. We have also identified and done preliminary computational characterization of 14 putative noncoding RNA loci in Methanococcus jannaschii, and 9 putative noncoding RNA loci in Pyrococcus furiosus. Northern blot analysis clearly demonstrates that 3 of these loci in Methanococcus jannaschii and 5 of these loci in Pyrococcus furiosus are expressed as small RNAs. We have cloned and sequenced the full length of several of these RNAs, and sequence analysis argues strongly in favor of a non-coding, rather than small-peptide coding, function for these RNA molecules.


35. RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs (up)
Christian M. Zmasek, Washington University School of Medicine;
Sean R. Eddy, Howard Hughes Medical Institute and Washington University School of Medicine;
zmasek@genetics.wustl.edu
Short Abstract:

A procedure for automated inference of orthologs over bootstrap resampled phylogenetic trees is presented ("RIO", Resampled Inference of Orthologs). This is used for functional prediction via phylogenetic analysis ("phylogenomics"). Results from analyzing the C.elegans proteome are shown. We discuss where phylogenomic analyses might be more reliable than similarity based analyses.

One Page Abstract:

When analyzing protein sequences using sequence similarity searches, orthologous sequences (diverged by speciation) are more reliable predictors of a new protein's function than paralogous sequences (diverged by gene duplication), because duplication enables functional diversification. The utility of phylogenetic information in high-throughput genome annotation ("phylogenomics", [1]) is widely recognized, but existing approaches are either manual or indirect (e.g. not based on phylogenetic trees). Here we present a procedure for automated phylogenomics using explicit phylogenetic inference.

At the center stands the inference of gene duplications by comparing the gene tree containing the sequence to be analyzed to a trusted species tree. Various algorithms to accomplished this have been published. We employ one of our own design which has a pathological worst case behavior of O(n²) but which appears to be superior in most practical cases, partially due to its simplicity ([2], and references therein).

A major caveat of all phylogenetic analyses is the unreliability of the resulting trees. Therefore, inference of gene duplications is performed over bootstrap resampled phylogenetic trees to estimate the reliability of the orthology assignments (RIO -- Resampled Inference of Orthologs). In addition, unusual differences in maximum likelihood branch length values are used to automatically detect other potential pitfalls for functional annotation caused by unequal rates of evolution.

We show results of performing this procedure on the C. elegans proteome. It appears that this procedure is particularly useful for the automated detection of first representatives of novel protein subfamilies.

This procedure is being implemented as a suite of Java classes and Perl scripts and will eventually be available in its entirety as (part of) the "FORESTER" framework at http://www.genetics.wustl.edu/eddy/forester/.

[1] Eisen JA (1998) Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Research, 8, 163-167.

[2] Zmasek CM and Eddy SR (2001) A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics, in press.


36. Comparative genomics for data mining of eukaryotic and prokaryotic genomes (up)
Clemens Suter-Crazzolara, Günther Kurapkat, LION bioscience;
suter@lionbioscience.com
Short Abstract:

With the increase in number of completely sequenced genomes, comparative genomics is becoming increasingly important for data mining purposes. We describe a software solution for comparison of multiple eu- and prokaryotic genomes. Key benefits are: high speed analysis and flexible inclusion of any number of genomes, biological databases and applications.

One Page Abstract:

With the advent of genome projects, scientists are confronted with a wealth of sequence information. Both the sizes and the number of biological databases grow rapidly. However, the gap between data collection and interpretation is also growing. For this reason, genome databases contain a wealth of information which is accessible, but which may remain obscured. Intelligent systems are needed to bridge the widening gap between data collection and interpretation. To interpret data from newly sequenced eu- and prokaryotic genomes, comparative genomics is rapidly gaining importance. This results from the observation that even genomes of distantly related organisms may still encode proteins with high sequence similarity. Additionally, the order of genes within a genome may also be conserved. As the number of completely annotated genomes grows, comparing new genomes to this knowledge base becomes increasingly important for collecting biological information. We have employed these observations to design a computational analysis system, which allows through genome comparisons novel ways of gene function characterization. In an initial step the researcher can, through a simple web interface, add genomes from in house sequencing projects or public sources to the system. With several algorithms, genome comparison relationships are determined. This results in the collection of data concerning homology and orthology relationships, as well as gene order. This information is stored in five distinct databases. In the second step, the researcher can query these databases for interactive comparisons of genomes. Results are either depicted in graphical views to allow easy interpretation or in tabular form to summarize the data obtained. Initially designed for the analysis of prokaryotic genomes, the application has now been further developed to allow the analysis of eukaryotic genomes such as human, mouse, D. melanogaster, C. elegans, plants or yeasts. Expressed Sequence Tag (EST) consensus sequences can also be imported, to allow transcriptome researchers to compare such sequences to completed genomes, allowing the assignment of functions to ESTs. The application is based on the data integration system SRS (Etzold, T. et al., 1996. Methods Enzymol. 266: 114-128) which results in several unique characteristics: 1. High flexibility. The user can add any number of genomes, biological databases or applications to the system. Currently SRS allows the seamless integration of more than 400 biological databases. 2. Reliable, high speed handling of large genomic data sets. The most complex queries give results instantly. 3. Unique, SRS-based linking functions between all databases result in access to a wealth of biological data. 4. User friendly graphical representations allow easy interpretation of search results. These benefits result in highly efficient collection of information on genomes, genes and proteins. The application can be used for projects spanning from the identification of drug targets to the correct annotation of genomes. (http://www.lionbioscience.com/genomeSCOUT)


37. DNA atlases for the Campylobacter jejuni genome (up)
Lise Petersen, CBS, Biocentrum-DTU, and Department of Microbiology, DVL, Denmark;
Stephen L.W. On, Department of Microbiology, Danish Veterinary Laboratory, DK-1970 Copenhagen.;
David W. Ussery, Center for Biological Sequence analysis, Danish Technical University, DK-2800 Lyngby;
lpe@svs.dk
Short Abstract:

We have analyzed the genome sequence of C. jejuni NCTC11168 for DNA structural motifs. Whilst global repeats are under- represented, local repeats (including palindromic regions) are over-represented in the C. jejuni genome. The three hyper-variable regions of the genome (which all encode surface-exposed products), display unique structural properties.

One Page Abstract:

We have analyzed the genome sequence of Campylobacter jejuni NCTC11168 using "DNA Atlases", which is a method for visualization of DNA properties of an entire chromosome as a circular plot (Genome Atlas). These properties include mechanical or structural parameters (such as intrinsic curvature, base- stacking energy, and DNA flexibility) as well as the occurrence of local and global repeats including palindromic sequences. We find that for the C. jejuni genome, global repeats are under-represented, whilst local repeats are over represented and the percentage of palindromes significantly exceeds that of other known pathogenic bacteria, including E. coli. This is partly due to the high AT percentage of C. jejuni, but may still lead to increased mutability compared to bacteria with lower values. There are three chromosomal regions containing hypervariable sequences. One region encompass two sets of global repeats, the fla-genes and one additional set of tandemly arranged genes. The latter contain a conserved domain that shares homology with other annotated genes in the same chromosomal region. Possible function of these genes will be discussed. Furthermore, of the 1700 genes annotated in the genome sequence, we estimate that approximately 12% are random ORF's, and not true genes. We conclude that the genome atlas is a valuable tool, and that the results can be exploited for both fundamental and applied research purposes. Genome atlases for C. jejuni NCTC11168 are available at

http://www.cbs.dtu.dk/services/GenomeAtlas/Bacteria/Campylobacter/jejuni/NCTC11168/.


38. DNA atlases for the Staphylococcus aureus genome (up)
Christian B. Jendresen, Maiken H. Pedersen, Morten S. Thomsen, Torsten Kolind, David W. Ussery, Center for Biological Sequence Analysis, DTU;
s001638@student.dtu.dk
Short Abstract:

Based on DNA atlases of two recently published S. aureus genomes, we have found an obvious symmetry in the circular chromosomes. We also found the pathogenic islands to have structurally extreme properties. Finally, we found several matches between resistance genes and S. aureus plasmids, suggesting horizontal gene transfers.

One Page Abstract:

Based on two recently sequenced multi-resistant strains of Staphylococcus aureus we have examined the organism's significant ability to acquire resistance genes and mutate certain pathogenic genes. DNA atlases are used to provide an overview of several structural properties of the genome. Parameters include intrinsic curvature, flexibility, stacking energy, local and global repeats, quasi- and perfect palindromes. These are used to locate deviating DNA segments able to provide us with information on S. aureus. The S. aureus genome is unusually symmetric, which simplifies origin and terminus determination. We have found DNA atlases to be a valuable tool in studying the correlations between DNA structure and function. Pathogenic factors and superantigens have mutated, thus making the bacteria capable of evading the immune systems of host organisms. Numerous antibiotic resistance genes are located near global repeats and on transposable elements. Close alignments with plasmid vectors suggest occurrence of horizontal gene transfers.

Genome atlases for S.aureus N315 and S.aureus Mu50 are available at http://www.cbs.dtu.dk/services/GenomeAtlas/Bacteria/Staphylococcus/aureus


39. SNAPping up functionally related genes based on context information: a colinearity free approach (up)
Grigory Kolesov, Hans-Werner Mewes, Dmitrij Frishman, MIPS, Institute for Bioinformatics, GSF National Research Center for Environment and Health;
G.Kolesov@gsf.de
Short Abstract:

A new algorithm finds functionally related non-homologous genes in prokaryotic genomes. It does not rely on the presence of conserved gene strings. Instead, it utilizes the graph of neighborhood and similarity relationships to find those paths that have higher probability to include genes which are functionally related.

One Page Abstract:

We present a computational approach for finding genes that are functionally related but do not possess any noticeable sequence similarity. Our method, which we call SNAP (Similarity-Neighborhood APproach), reveals the conservation of gene order on bacterial chromosomes based on both cross-genome comparison and context information. The novel feature of this method is that it does not rely on detection of conserved colinear gene strings. Instead, we introduce the notion of a similarity-neighborhood graph (SN-graph) which is constructed from the chains of similarity and neighborhood relationships between orthologous genes in different genomes and adjacent genes in the same genome, respectively. An SN-cycle is defined as a closed path on the SN-graph and is postulated to preferentially join functionally related gene products that participate in the same biochemical or regulatory process. We demonstrate the substantial non-randomness and functional significance of SN-cycles derived from real genome data and estimate the prediction accuracy of SNAP in assigning broad function to uncharacterized proteins. Technically, SNAP algorithm is implemented as multithreaded server application and is accessible via Web.


40. Sequencing and Comparison of Orthopoxviruses (up)
Scott Sammons, Michael Frace, Miriam Laker, Melissa Olsen-Rasmussen, Roger Morey, Yu Li, Richard Kline, Joseph J. Esposito, Inger Damon, Robert Wohlhueter, National Center for Infectious Diseases, Centers for Disease Control & Prevention;
ssammons@cdc.gov
Short Abstract:

Six variola virus isolates were sequenced by using long-distance, high-fidelity PCR of overlapping amplicons as templates for fluorescence-based sequencing. PCR-based primer-walking sequencing reactions were separated by capillary electrophoresis. Output sequence trace files were edited and assembled using Phred/Phrap/Consed and open reading frames of greater that 60 amino acids were analyzed.

One Page Abstract:

The genomes of six variola major strains were sequenced by using long-distance, high-fidelity PCR of overlapping amplicons as templates for fluorescence-based sequencing. Each genome is approximately 186 kb of double-stranded DNA with between 200 and 240 predicted open reading frames (ORFs) of greater than 60 amino acids. Each ORF sequence has been compared with the five other locally sequenced strains and with sequences of previously published orthopoxviruses Bangladesh-1975 (L22579), India-1967 (X69198), and vaccinia virus Copenhagen (M35027). The most highly conserved ORFs are located in the center portion of the genome, and the majority have known functions involving transcription, DNA replication and repair, protein processing, virion structure, and nucleotide metabolism.

To minimize the amount of poxvirus needed, 15 micrograms of purified genomic DNA was used as template for approximately 1800 primer-walking sequencing reactions. The reactions were set up using robotic assistance and subjected to thermocycling, and the reaction products were separated by capillary electrophoresis (Beckman Coulter CEQ 2000XL). Sequencing of variola strain Congo-1970, Somalia-1977, India-1964, Horn-1948, Nepal-1973, and Afganistan-1970 has been completed.

Output sequence trace files were edited, evaluated for quality, and then assembled by using Phred/Phrap/Consed software until about a 10-fold redundancy of high-quality sequence data was attained. ORFs of greater than 60 amino acids were then compared with each other and those in the public databases. These ORFs were analyzed for the presence of known early, middle, and late promoter sequences. ORFs with no homologs in the other strains were further analyzed for protein motifs by using several tools and databases.


41. Integrating mouse and human comparative map data with sequence and annotation resources. (up)
Ann-Marie Mallon, Joseph Weekes, Paul Denny, Mark Strivens, Steve Brown, Informatics Group, Mammalian Genetics Unit and UK Mouse Genome Centre, Medical Research Council;
a.mallon@har.mrc.ac.uk
Short Abstract:

Comparative maps have previously been generated between mouse and human. This data has been used to integrate the homologous genomic sequence between the two species and to allow integration of other resources such as sequence annotation/similarity or phenotype data, facilitating the use of data present in only one species.

One Page Abstract:

The progress of human and mouse genome sequencing programmes allows systematic cross-species comparison of the two genomes as a tool for gene and regulatory element identification1. This data will also be an important tool for exploiting the rapidly growing mouse mutant resource and moving from mutant phenotype to underlying gene. As the opportunities to perform comparative sequence analysis emerge, it is important to develop parameters for such analyses and to examine the outcomes of cross-species comparison for use in developing integrated data sets and software for such analysis.

As the sequence data increases in quantity and accuracy it is important to be able to extract the genomic sequence from homologous chromosomal regions. This information would then be beneficial in generating parameters for comparative sequence analysis and also as a foundation for building an integrated data source between the two species. This could then become the basis for querying various linked sources of information including sequence annotation and phenotype data.

To date detailed comparative maps have been generated between these two species, utilising homologous genes as markers to highlight homologous chromosomal segments among the two genomes. We have utilised this comparative map data to integrate the homologous genomic sequence between the two organisms. Using this data we aim to refine the comparative map data by sequence alignment and also to integrate other information from databases such as MGD (Mouse Genome Database). This system will be utilised within the UK Mouse sequencing programme (http://mrcseq.har.mrc.ac.uk), to aid in the annotation of mouse genomic sequence.

1: Mallon A.M., et al. Comparative genome sequence analysis of the Bpa/Str region in mouse and Man. Genome Res. 2000 Jun;10(6):758-75.


42. De novo Identification of Repeat Families in the Genome (up)
Zhirong Bao, Sean Eddy, HHMI, Dept of Genetics, Washington University;
bao@genetics.wustl.edu
Short Abstract:

We have developed a de novo approach for the identification and classification of repetitive elements from genomic sequences. We incorporated multiple alignment information to extend the usual approach of single linkage clustering of BLAST hits. The algorithm is now being used to analyze various genomes.

One Page Abstract:

Repetitive elements consist a major part of eukaryotic genomes. We have developed and implemented a de novo approach for the identification and classification of these elements from genomic sequences, based on algorithmic extensions to the usual approach of single linkage clustering of BLAST hits. To overcome the tendency of grouping unrelated sequences by single linkage clustering, we incorporated multiple alignment information in defining the boundaries of individual copies of the elements and in constructing linkages. The alogorithm is now being used to analyze various genomes.


43. Using the Arabidopsis genome to assess gene content in higher plants (up)
Keith Allen, Paradigm Genetics;
kallen@paragen.com
Short Abstract:

To assess gene content in Arabidopsis, I have exhaustively compared EST unigene sets from eight plant species to the Arabidopsis genome. Comparison between unigene sets identified lineage-specific genes, and gene loss in Arabidopsis. About 15% of dicot genes and about 30% of monocot genes are missing from Arabidopsis.

One Page Abstract:

Arabidopsis has been a favorite model organism of plant biologists for more than two decades because of its small size, rapid growth cycle, small genome and tractable genetics. Using an organism as a model system assumes that genes in this organism will have similar functions to the equivalent genes in other species, and more fundamentally, that the gene complement of the model system is substantially similar to other, more economically important species. The completion of the Arabidopsis genome, coupled with a number of large EST projects in other species provides an unprecedented opportunity to examine this question of gene content, and to clarify the position of Arabidopsis as a model system. Specifically, to what extent does Arabidopsis contain the same genetic complement as other plant species? This question can be addressed by using Arabidopsis as a reference genome, and then comparing unigene sets constructed for each test genome to the reference genome using a Smith Waterman algorithm that translates both query and target DNA sequences in six frames and does the comparison in protein space. This computationally expensive step was perfomed on a Paracel Genematcher2 as part of a beta testing program. Unigene sets were constructed from publicly available EST projects for six angiosperms, one conifer, and a green alga. The test species (and number of input ESTs), were tomato (Solanaceae, 94,523 ESTs), soybean (Papilionoideae, 137,952 ESTs), Medicago (Papilionoideae, 115,717 ESTs), rice (Oryzeae, 63790 ESTs), barley (Triticeae, 105,273 ESTs), maize (Andropogoneae, 85287 ESTs), Loblolly Pine, a conifer (Pinaceae, 31,99 ESTs), and a green alga, Chlamydomonas (Volvocales, 55874 ESTs). Unigene sets were constructed using the Paracel Clustering Package. The central conclusion of this work is that "novel" genes (ie, genes not present in the reference genome) increase in number with increasing evolutionary distance. Soybean, the closest relative of Arabidopsis in this set, had about 17% of its EST contigs fail to get a hit in Arabidopsis. In pine, the most distant higher plant species used, this number was over 30%. I will present a closer examination of the "novel" genes each species and cross species comparison to, for example, identitfy monocot-specific genes. I will also present direct evidence of specific gene loss events in the lineage leading to Arabidopsis. Analysis of contigs corresponding to genes not found in Arabidopsis will also be shown.


44. Origin of Replication in Circular Bacterial Genomes and Plasmids (up)
Peder Worning, Lars J. Jensen, Hans-Henrik Stærfeldt, Dave W. Ussery, CBS, Biocentrum, DTU;
peder@cbs.dtu.dk
Short Abstract:

By comparing the frequencies in leading and lagging strand of all oligonucloetides up to length of 8 bp, we find the origin of replication in circular genomes. We find the origin in nearly all the sequenced Bacterial genomes, several Bacterial plasmids, plus a number of mitochondrial, and chloroplast genomes.

One Page Abstract:

We present a method for finding the origin of replication in circular Bacterial genomes. The method is based on differences in word frequencies between leading and lagging strand, and the nucleotide sequence is the only input needed. We have analysed complete genome sequences from more than 50 different species including both Bacteria and Archaea. Our method finds the correct location in all the genomes where the position of the origin is firmly confirmated, and shows that the origin have been misplaced in several of the published genomes. We even find a probable origin position in the genomes where it has not been predicted before. We are also able to find the position of the origin in several bacterial plasmids, plus some mitochondrial and chloroplast genomes.