Alignment Techniques

45.Combining frequency and positional information to predict transcription factor binding sites
46.Identification of distantly related sequences through the use of structural models of protein evolution
47.The Paracel Filtering Package (PFP): A Novel Approach to Filtering and Masking of DNA and Protein Sequences
48.EST Clustering with Self-Organizing Map (SOM)
49.Analysis of Information Content for Biological Sequences
50.Clustering proteins by fold patterns
51.Confidence Measures for Fold Recognition
52.Protein Structure Prediction By Threading With Experimental Constraints
53.Exonerate - a tool for rapid large scale comparison of cDNA and genomic sequences.
54.Prospector: Very Large Searches with Distributed Blast and Smith-Waterman
55.Accuracy Comparisons of Parallel Implementations of Smith-Waterman, BLAST and HMM Methods
56.Target-BLAST: an algorithm to find genes unique to pathogens causing similar human diseases
57.Software to predict microArray hybridisation: ACROSS
58.Discovering Dyad Signals
59.Efficient all-against-all computation of melting temperatures for dna chip design
60.Wavelet techniques for detecting large scale similarities between long DNA sequences
61.TRAP: Tandem Repeat Assembly Program, capable of correctly assembling nearly identical repeats
62.LASSAP: A powerful and flexible tool for (large-scale) sequence comparisons
63.Mini-greedy algorithm for multiple RNA structural alignments
64.Neural network and genetic algorithm identification of coupling specificity and functional residues in G protein-coupled receptors
65.Determination of Classificatory Motifs for the Identification of Organism Using DNA Chips
66.An algorithm for detecting similar reaction patterns between metabolic pathways
67.SPP-1 : A Program for the Construction of a Promoter Model From a Set of Homologous Sequences
68.A new score function for distantly related protein sequence comparison.
69.Expression profiling on DNA-microarrays: In silico clone selection for DNA chips
70.Data Mining: Efficiency of Using Sequence Databases for Polymorphism Discovery
71.Automated modeling of protein structures
72.Algorithmic improvements to in silico transcript reconstruction implemented in the Paracel Clustering Package (PCP)



45. Combining frequency and positional information to predict transcription factor binding sites (up)
Szymon M. Kielbasa, Jan O. Korbel, Dieter Beule, Johannes Schuchhardt, Hanspeter Herzel, Innovationskolleg Theoretische Biologie;
s.kielbasa@itb.biologie.hu-berlin.de
Short Abstract:

Both the frequency and positional information are analysed to predict transcription factor binding sites in upstream regions of coregulated genes. Evaluations for several yeast families as well as new results for a set of genes downregulated via H-Ras activation are presented.

One Page Abstract:

Even though a number of genome projects have been finished on the sequence level, still only a small proportion of DNA regulatory elements have been identified. Growing amounts of gene expression data provide the possibility of finding coregulated genes by clustering methods. By analysis of the promoter regions of those genes, rather weak signals of transcription factor binding sites may be detected.

We present the algorithm ITB ("Integrated Tool for Box finding"), which combines frequency and positional information to predict transcription factor binding sites in upstream regions of coregulated genes. Motifs of a specified length are exhaustively scored by estimating their frequencies with respect to a statistical background model and investigating their tendency to cluster formation. The alphabet used to assemble motifs may contain symbols matching multiple bases.

ITB detects consensus sequences of experimentally verified transcription factor binding sites of the yeast Saccharomyces cerevisiae. Moreover, a number of new binding site candidates with significant scores are predicted. Besides applying ITB on the yeast upstream regions, the program is run on human promoter sequences. From this investigation a new candidate motif "CGARCG" has been proposed, as a transcription factor binding site for a set of genes downregulated via H-Ras activation.


46. Identification of distantly related sequences through the use of structural models of protein evolution (up)
Lisa Davies, Nick Goldman, Department of Zoology, University of Cambridge;
L.Davies@zoo.cam.ac.uk
Short Abstract:

A novel database search program is assessed which identifies distantly related homologs through the inclusion of information about the effects of protein structure on sequence evolution. Solvent accessibility information only confers a limited advantage, but protein secondary structure information markedly improves the identification of homologous sequences with low sequence identity.

One Page Abstract:

Database search programs in common use (e.g. BLAST, FASTA) identify sequence homology through the use of pairwise alignment techniques. These programs are good at detecting closely related sequence but have problems accurately detecting homologous sequences with low sequence identity. A new approach, described here, tries to improve the detection of distantly related homologs by rejecting the assumption that all sites in a protein behave in an identical manner. This is done without the use of profile techniques, which require the preliminary collection of a set of homologs. Programs such as BLAST and FASTA use general properties of a protein to generate alignment scores, which simplifies calculations but may also result in a decrease in accuracy. In reality, amino acid replacement probabilities and rates, amino acid frequencies and gap probabilities all vary according to where a residue lies in a protein structure. Typical patterns of these structure-specific variations in evolutionary dynamics can be incorporated into a database search program through the use of Hidden Markov Models (HMMs) and hence potentially improve the detection of more distantly related sequences.

In this study, the utility of including structure-specific evolutionary information in a database search program has been assessed. Initial work has concentrated on the generation of database search programs that either use solvent accessibility distinctions or protein secondary structure distinctions. The improvement of adding the ‘extra’ information has then been evaluated through the use of both simulated sequences, which exactly fit the models, and real sequences from the SCOP database. The success rate of each of these programs has been compared to a simplified model that contains the general properties of proteins but with no structural distinctions.

We have discovered that adding accessibility information gives a limited advantage only when sequences are distantly related. Using secondary structure distinctions, however, provides a greater improvement over a model containing no structural information for all but the case when the sequences are closely related. There is also an advantage over more traditional database search programs such as BLAST, FASTA and Smith-Waterman. Incorporating structure-specific evolutionary models into database search programs can therefore potentially lead to an improvement in the identification of distant homologs.


47. The Paracel Filtering Package (PFP): A Novel Approach to Filtering and Masking of DNA and Protein Sequences (up)
Cecilie Boysen, Charles P. Smith, Stephanie Pao, Cassi Paul, Joseph A. Borkowski, Paracel, Inc.;
boysen@paracel.com
Short Abstract:

We describe a comprehensive, flexible suite of tools that utilize several algorithms to identify repeats and contaminants in biomolecular sequences. These methods include using repeat profile models and non-destructive XML-based annotation. Together, these tools enable otherwise unavailable masking options. We demonstrate PFP's speed and effectiveness on a variety of datasets.

One Page Abstract:

The Paracel Filtering Package (PFP): A Novel Approach to Filtering and Masking of DNA and Protein Sequences

Cecilie Boysen, Charles P. Smith, Stephanie Pao, Cassi Paul, Joseph A. Borkowski.

Paracel, Inc.

Filtering and masking is a required, but often neglected, first step for many bioinformatics analyses, such as EST clustering and assembly, database mining, and DNA chip design. Properly done, this filtering and masking can produce a dramatic improvement in the quality of the final results. Unfortunately, many current masking techniques are an ad hoc assembly of various single purpose comparison tools. Using such an assembly of tools is often cumbersome, requiring many individual steps including simple conversion and bookkeeping operations. Managing filtering in this way can often lead to incomplete or erroneous results.

To address these known shortcomings we have developed the Paracel Filtering Package (PFP), a comprehensive, flexible suite of tools for filtering and masking. PFP takes input in most standard format and identifies sequences using a variety of user selectable algorithms. These include dust and pseg for identification of low complexity regions in DNA and protein sequences, and Haste (hash accelerated search tool) and full Smith-Waterman for comparison to sets of repeat, vector, or contaminant sequences. PFP also identifies low quality sequences based on either quality values or ambiguous base calls. A variety of actions can then be performed on the identified sequence regions. The available actions are masking, removal of the entire sequence, excision, or annotation. The annotation action is unique to the PFP suite in that it produces an XML based annotation of the identified sequence regions, e.g. low-complexity and genome-wide repeat regions, without replacing the underlying sequence characters. This allows for non-destructive masking options when used as part of a multi-step clustering and assembly process where masks are applied during one stage (such as clustering) and removed during subsequent stages (such as final assembly). This allows production of final assemblies and consensus sequences free of unwanted masking characters.

Additionally, PFP, if used with Paracel's GeneMatcher, can use repeat profiles for highly sensitive annotation of repeat regions. These profiles are gribskov type DNA profiles created from multiple sequence alignments of identified repeat regions. Here we investigate the use of such repeat profile searching and compare the results with the commonly applied algorithms for repeat finding.

PFP can be customized for the specific task at hand, that is, different settings can be applied depending on purpose and kind of sequence and species. We will describe a general method for optimization of masking parameters that can be used on any dataset.

These tools present a comprehensive set of filtering and masking options not available in any other package. Still, executed by a single command, PFP performs this often convoluted process of cleaning up or annotating sequences with great speed and effectiveness.


48. EST Clustering with Self-Organizing Map (SOM) (up)
Ari-Matti Saren, Plant Genomics group, Institute of Biotechnology, University of Helsinki;
Timo Knuuttila, Mikko Kolehmainen, Visipoint Oy;
ari-matti.saren@helsinki.fi
Short Abstract:

The existing EST clustering algorithms perform adequately on smaller datasets, but with the rapid increase in EST data, scalability is becoming a problem. We present a method of using short exact sequence matches and Self-Organizing Map (SOM) to rapidly classify the EST data into subsets that can be clustered individually.

One Page Abstract:

The existing EST clustering algorithms perform adequately on small and moderate datasets, but with the rapid accumulation of EST sequence data, scalability is increasingly becoming a problem. Since the scalability issue is inherent in this sort of problem where everything has to be compared to everything, we try to approach the problem from the side of keeping the datasets reasonably sized.

A self-organizing Map (SOM) is an unsupervised neural network learning algorithm which has been succesfully used for the analysis and organization of large datasets. We present a method of dividing ESTs into subsets of potentially clustering sequences by using a SOM to classify the sequences according to the number of short exact sequence matches found. These subsets can then be individually clustered and aligned using existing tools.

We are using the Visual Data SOM software package from Visipoint Oy (http://www.visipoint.fi/index.html) and custom software components developed at the Barley EST sequencing project at Institute of Biotechnology Plant Genomics group (http://www.biocenter.helsinki.fi/bi/bare-1_html/est.htm) Visual Data uses a tree-structured SOM (TS-SOM) algorithm [1], a variation of the classical Kohonen SOM [2].

[1] Koikkalainen, P. (1994) in: Proceedings of the 11th European Conference on Artificial Intelligence (Cohn A., Ed.), pp. 211-215,. Wiley and Sons, New York [2] Kohonen, T. (1995) Self-organizing maps. Springer, Berlin.


49. Analysis of Information Content for Biological Sequences (up)
Jian Zhang, EURANDOM, Eindhoven, The Netherlands;
jzhang@euridice.tue.nl
Short Abstract:

We present an exploratory approach to parsing and analyzing a set of multiple DNA and protein sequences. It is based on an analysis-of-variance (ANOVA) type decomposition of the information content. Our method is applied to parsing and clustering some protein sequences.

One Page Abstract:

Decomposing a biological sequence into modular domains is a basic prerequisite to identify functional units in biological molecules. Several commonly used segmentation procedures consists of two steps: First, collect and align a set of sequences which is homologous to the target sequence; then parse this multiple alignment into several blocks and identify the conserved ones by using a semi-automatic method, which combines manual analysis and experts knowledge. In this paper, we present an exploratory approach to parsing and analyzing the above multiple alignment. It is based on an analysis-of-variance (ANOVA) type decomposition of the information content, a variant on the concept reviewed by Stormo and Fields (1998). Unlike the traditional changepoint method, our approach takes into account not only the composition biases but also the overdispersion effects among the blocks. Our method is tested on the families of ribosomal proteins with a promising performance. Finally we extend our approach to the problem of clustering a set of objects labeled by some probability vectors. As an application, our approach is applied to clustering protein sequences via their pairwise alignment scores.


50. Clustering proteins by fold patterns (up)
David Gilbert, Department of Computing, City University, London, UK;
Juris Viksna, Institute of Mathematics and Computer Science, University of Latvia, Riga, Latvia;
Aik Choon Tan, Department of Computing, City University, London, UK;
Lorenz Wernisch, Department of Crystallography, Birkbeck College, University of London, London, UK;
drg@soi.city.ac.uk
Short Abstract:

We describe a technique to automatically divide a set of proteins into clusters each associated with a common topological pattern, using TOPS descriptions as formal models of protein folds. We are applying the technique to generate characteristic patterns associated with EC functional groups.

One Page Abstract:

We have developed a technique to automatically divide a set of proteins into clusters each associated with a common topological pattern. This technique can be applied to families of protein domains with structurally diverse members where there is no clear structure-based tree hierarchy. Examples are the families determined by the Enzyme Classification scheme; EC numbers are assigned at the chain level in the PDB and thus all the domains comprising a chain will be assigned the same number. The motivation is that a pattern associated with an undivided diverse group may be very small and of weak descriptive and hence classificatory power. Our ultimate aim is to discover patterns with a high classificatory power for diverse protein families.

Our method takes as input TOPS descriptions of protein structures (Gilbert et al, 1999) and uses a development of the pattern discovery method for topological patterns described in (Gilbert et al, 2001) which employs repeated pattern extension and matching. In our new method, we permit patterns to be discovered for less than 100% of the examples in a learning set S. Effectively this means that having discovered a maximal pattern which describes all the members of S, we attempt to continue to extend that pattern to a larger pattern P which only matches a subset T of S. We then remove the matched set T from S to give S' and repeat the procedure over S' until the learning set becomes empty. The result of our method is a cover of the initial set S of structures, comprising a partition of S into subsets each associated with a descriptive pattern.

The critical issue is to decide when to stop extending the pattern; we achieve this when the `goodness' of the pattern reaches a certain threshold (PV). Otherwise, we may end up by generating a pattern which is associated with just one domain, and this is likely to be too specific and of no use in classifying new structures. This `goodness' is computed as a function over the compression that the pattern achieves over all the members of the matching set T and its coverage, i.e. the size of T compared to S.

We have applied our algorithm to a representative subset of the protein data bank, derived from non-identical representatives (N-reps) of release 2.0 the CATH database (www.biochem.ucl.ac.uk/bsm/cath). We selected those domains to which a function in the EC classification has been assigned, and further restricted our set to those domains with some beta sheet content (since our pattern discovery method does not work well for all-alpha domains). The table of of associations between EC numbers and domains was supplied by Roman Laskowski of the BSM group at University College. We performed our pattern discovery and clustering over 141 families of proteins (defined by every domain sharing the same 4 EC numbers) and containing at least 4 members. Some results for this set with graphical illustrations of the patterns can be found at http://www.soi.city.ac.uk/~drg/tops/EC_alphabeta/

References

Gilbert D, Westhead DR, Nagano N, Thornton JM. Motif-based searching in tops protein topology databases. Bioinformatics 1999;15:317-326.

Gilbert D, Westhead DR, Viksna, J and Thornton J M, Topology-based protein structure comparison using a pattern discovery technique, Journal of Computers and Chemistry, 2001, in press.

Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., and Thornton, J.M. (1997) CATH- A Hierarchic Classification of Protein Domain Structures. Structure. Vol 5. No 8. p.1093-1108.


51. Confidence Measures for Fold Recognition (up)
Ingolf Sommer, Niklas von Öhsen, Alexander Zien, Ralf Zimmer, Thomas Lengauer, GMD/SCAI;
ingolf.sommer@gmd.de
Short Abstract:

We present an extensive evaluation of different methods and criteria to detect remote homologs of a given protein sequence. There are two associated problems, first, to develop a sensitive searching method to identify possible candidates and, second, to assign a confidence to the identified candidates in order to select the best one.

One Page Abstract:

We present an extensive evaluation of different methods and criteria to detect remote homologs of a given protein sequence. There are two associated problems, first, to develop a sensitive searching method to identify possible candidates and, second, to assign a confidence to the identified candidates in order to select the best one.

Especially, p-values have been proposed for assigning confidence for local sequence alignment searching procedures, such as BLAST, with great success due to an extensive theoretical backing. We propose empirical approximations to p-values for searching procedures involving global alignments, sequence profiles, and structure based scores (threading).

We review different methods for detecting remotely homologous protein folds (sequence alignment and threading, with and without frequency profiles, global and local), analyze their fold-recogintion performance and establish confidence measures that tell how much trust to put into the prediction made with a certain method.

The analysis is performed on a representative subset of the PDB with at most 40% sequence homology, with proteins classified according to the SCOP classification.

For fold-recognition, i.e. the attempt to find a template fold with a known structure for a target protein sequence whose structure we are searching, we find that methods using frequency profiles generally perform better than methods using plain sequences, and that threading methods perform better than sequence alignment methods. Thus the method of choice for detecting remote homologies is threading using frequency profiles.

In order to assert the quality of the predictions made with these methods, we establish several confidence measures, including raw scores, z-scores, raw-score gaps, z-score gaps, and different methods of p-value estimation (thus ranging from computationally cheap to more elaborated) and compare them.

The confidence-measure methods are compared with several error measures. For local alignment methods, where the distribution of scores is theoretically known, we find that p-value methods that make use of this knowledge work best, albeit computationally cheaper methods as the score gaps perform competitively. For global methods, where no theory is available on the score distribution, score-gap methods perform best.

* S. F. Altschul et al, "Gapped BLAST and PSI-BLAST: a new generation of protein datab ase search programs", Nucleic Acids Research, 1997

* M. Gribskov et al, "Profile analysis: Detection of distantly related proteins", PNAS, 1987

* N. Alexandrov et al, "Fast Protein Fold Recognition via Sequence to Structure Alignment and Contact Capacity Potentials", Proc. Pacific Symposium on Biocomputing, 1996

* S. F. Altschul et al. "Basic local alignment search tool", JMB, 1990


52. Protein Structure Prediction By Threading With Experimental Constraints (up)
Mario Albrecht, Institute for Algorithms and Scientific Computing (SCAI), German National Research Center for Information Technology (GMD);
Ralf Zimmer, Thomas Lengauer, Institute for Algorithms and Scientific Computing (SCAI), German National Research Center for Information Technology ;
mario.albrecht@gmd.de
Short Abstract:

We present an extended and modified version RDP* of the Recursive Dynamic Programming method to predict the structure of proteins. The algorithm is now capable of incorporating additional structural constraints, for instance, atomic distances obtained by mass spectrometry or NMR spectroscopy experiments, into the alignment computation of our threading approach.

One Page Abstract:

The threading approach predicts the protein structure by aligning representative protein structures with an amino-acid sequence called the target sequence whose three-dimensional backbone structure is unknown. The sequence-structure alignments obtained are then ranked by score. The best-scoring alignment should identify the template structure that is most compatible with the target sequence and thus afford a meaningful structural model.

However, the problem of developing an accurate scoring function is still unsolved particularly for distantly related target and template folds. Especially, making the scoring scheme reflect diverse biological constraints seems to be a difficult task. Thus threading methods based solely on sequence information often fail.

To remedy the inherent shortcomings of the scoring function, it becomes necessary to incorporate more biological knowledge on the target protein, which may be obtained from experimental data by mass spectrometry or NMR spectroscopy. These additional constraints such as atomic distances guide the threading process in order to improve the accuracy of fold recognition. Experimental results that taken alone would give insufficient data for the complete structure determination may already yield enough constraints to support the threading procedure considerably.

Our recursive dynamic programming method searches for structurally correct target-template pairs in suitable sets of alternative near-optimal solutions to the alignment problem, which transcends the usual exact optimization of one biologically incomplete scoring function employed in other threading approaches. The method can incorporate biological constraints directly into the alignment computation by means of different filter algorithms. This is more efficient than weeding out wrong models from the list of already generated complete solutions. In this way, the method can produce biologically more meaningful models that adhere to the structural constraints that are known about the target. This approach improves the fold recognition rate as well as the alignment quality.

Keywords: protein structure determination, fold recognition, protein threading, experimental constraints, mass spectrometry, NMR, NOE

Selected References:

P. M. Bowers,C. E. M. Strauss, David Baker: De novo protein structure determination using sparse NMR data. Journal of Biomolecular NMR, 18:311-318, 2000.

R. Thiele, R. Zimmer, T. Lengauer: Protein Threading by Recursive Dynamic Programming, Journal of Molecular Biology, 290(3):757-779, 1999.

Y. Xu, D. Xu, O. H. Crawford, J. R. Einstein: A computational method for NMR-constrained protein threading, Journal of Computational Biology, 7(3/4):449-467, 2000.

M. M. Young, N. Tang, J. C. Hempel, C. M. Oshiro, E. W. Taylor, I. D. Kuntz, B. W. Gibson, G. Dollinger: High throughput protein fold identification by using experimental constraints derived from intramolecular cross-links and mass spectrometry. Proceedings of the National Academy of Sciences USA, 97(11):5802-5806, 2000.


53. Exonerate - a tool for rapid large scale comparison of cDNA and genomic sequences. (up)
Guy St.C. Slater, Ewan Birney, Ensembl / EMBL-EBI;
guy@ebi.ac.uk
Short Abstract:

Exonerate is a tool for rapid large scale comparison of cDNA with genomic DNA. The algorithm and its implementation are described in the context of related methods. Examples of application of the algorithm for large scale analyses within the Ensembl group are given.

For more information, see: \url{http://www.ebi.ac.uk/~guy/exonerate/}

One Page Abstract:

Exonerate is a tool for rapid large scale comparison of cDNA with genomic DNA.

HSPs are seeded using a FSM built from the word neighbourhoods of multiple ESTS. A bounded sparse dynamic programming algorithm is used to join HSPs to form gapped alignments. Alignments are generated between candidate HSP pairs by dynamic programming algorithms which integrate an affine gap model and PSSM based splice site prediction for intron modelling.

This approach provides a good approximation of the underlying model, while remaining very fast by restricting the dynamic programming to regions likely to contain significant alignments. Examples of application of the algorithm for large scale comparison of EST data sets with the entire human genome are given.

Exonerate is implemented in C, using the glib library, and is available under the the terms of the LGPL.

For more information, see: http://www.ebi.ac.uk/~guy/exonerate/

--


54. Prospector: Very Large Searches with Distributed Blast and Smith-Waterman (up)
Douglas Blair, Dustin Lucien, Hadon Nash, Dale Newfield, John Grefenstette, Parabon Labs, Parabon Computation;
doug@parabon.com
Short Abstract:

We have created a novel implementation of BLAST and Smith-Waterman for the Parabon Frontier distributed computing platform. We present design specifics, implementation details, performance results, and sensitivity comparison for very large database searches using the Prospector versions of BLASTN, BLASTP, BLASTX, TBLASTN, and TBLASTX, and the analogous versions of Smith-Waterman.

One Page Abstract:

Advances in sequencing technology have created an avalanche of sequence data in the public databases that continues at an astonishing rate. In the year 2000, for example, the number of nucleotides in GenBank more than doubled from 4.6 billion bases to 11.1 billion bases. Even with the completion of the human genome, data from other sequencing projects and ESTs being generated from various organisms and cell types ensure that the floodgates are not going to close anytime soon. Given that large amounts of new sequence data will continue to become available for the foreseeable future, the need to compare new data to existing data will continue to make high-performance solutions for large-scale sequence analysis a critical requirement for many projects. High-performance methods for performing large searches typically involve using specialized hardware or clusters of general-purpose machines. While providing a high level of performance, such systems are costly in terms of both hardware and support infrastructure.

Distributed computing utilizing the idle cycles of existing workstations represents a software alternative to traditional high-performance computing platforms. In the Parabon Frontier distributed computing model, large jobs are decomposed into hundreds or thousands of tasks and distributed to idle machines, either over the Internet or within an organization's intranet. We have created an application called Prospector that implements both BLAST and Smith-Waterman algorithms for performing large-scale sequence comparison on the Frontier distributed computing platform. As required by the Frontier model, Prospector is written entirely in Java, providing a maximum of safety for the machines providing idle cycles.

In this poster, we present design specifics, implementation details, and performance results for very large database-to-database searches using Prospector. Results showing the effective scalability of Prospector to searches with thousands of machines are given. Experiments are described using Prospector versions of BLASTN, BLASTP, BLASTX, TBLASTN, and TBLASTX, as well as the analogous versions of Smith-Waterman.


55. Accuracy Comparisons of Parallel Implementations of Smith-Waterman, BLAST and HMM Methods (up)
James Candlin, Stephanie Pao, Paracel, Inc.;
candlin@paracel.com
Short Abstract:

Using known evolutionary relationship data from SCOP, we compare the accuracy of Paracel's GeneMatcher™ and BlastMachine™ implementations of sequence search algorithms to the originals. We demonstrate that the Paracel implementations are biologically equivalent, but much faster, allowing routine use of even the most rigorous algorithms at a genomic scale.

One Page Abstract:

Accuracy Comparisons of Parallel Implementations of Smith-Waterman, BLAST and HMM Methods

Stephanie Pao, James Candlin

Paracel, Inc.

Sequence database searching is a workhorse of bioinformatic analysis, and is embodied in a variety of useful methods such as Smith-Waterman and related dynamic programming methods, BLAST, and Hidden Markov Models. Unfortunately, many of these algorithms are slow, especially at a genomic scale. With Paracel's GeneMatcher™ and BlastMachine™ systems, we have addressed this challenge by developing new hardware and software implementations of these methods that are parallelized and can be used routinely, even on genomic scale projects.

It is crucial that our new implementations retain the general behavior of the original manifestations of the algorithms, and their biological accuracy can be maintained. An approach to assessing this, at least for detecting protein family relationships, is to search an annotated database with a broad set of protein sequences and measure the ability of such methods to find only the evolutionarily related sequences. Effective resources for this are the SCOP (Structural Classification of Proteins) database, which reliably identifies superfamily membership based on structural analysis, and the associated PDBD40 database of domains that are no more than 40% identical to one another and may be used as in Brenner et al.(1) to test the ability of algorithms to find related sequences.

Using the same general approach, we compare the Paracel implementations of BLAST, Smith-Waterman and HMM to the original software methods.

The testing set consisted of all sequences in the SCOP PDBD40 database. The comparison test was an all- against-all comparison of all described domain sequences against a database of the same sequences using all the pairwise and multiple-sequence-derived algorithms. We assess the results of the comparisons at various false positive levels and describe the number of detected homologies with each similarity searching method. Our criteria for homology are based on the superfamily classifications within the SCOP database.

We demonstrate that our implementations are biologically equivalent to the original algorithms, but convey the advantages of far greater speed and throughput required for genomic analysis. The comparison also demonstrates the greater accuracy of the slower but more rigorous methods, and that even these are fast enough in our implementation for routine use.

______________________ (1) Brenner, S.E., Chothia, C. and Hubbard, T., Assessment of sequence comparison methods with reliable structurally identified distant evolutionary relationships', Proc. Natl. Acad. Sci. USA 15, 6073- 6078 (1998).


56. Target-BLAST: an algorithm to find genes unique to pathogens causing similar human diseases (up)
Joanna Fueyo, Department of Pharmacology, University of Pennsylvania School of Medicine;
Jonathan Crabtree, Computational Biology and Informatics Laboratory, University of Pennsylvania;
Jeffrey N. Weiser, Department of Microbiology and Pediatrics, University of Pennsylvania School of Medicine;
fueyo@mail.med.upenn.edu
Short Abstract:

There is an imminent need for novel data mining algorithms for DNA and protein sequences, especially for discovering which genes may be important in the establishment of infectious disease in humans. Here we present a novel algorithm that is used to find such genes using a fully computational approach.

One Page Abstract:

Target-BLAST: an algorithm to find genes unique to pathogens causing similar human diseases

Abstract Motivation: This paper describes a novel algorithm, Target-BLAST, designed for the discovery of genes unique to pathogens that share common features. Bacteria that share common features, such as the ability to cause a similar set of human infections, may share genes in common that facilitate colonization of the host or establishment of infection. The organisms that reside in the human respiratory tract are the most common cause of antibiotic usage for infectious disease. In a search for genes unique to respiratory pathogens, we designed an algorithm to find genes that are unique to the organisms that colonize the respiratory tract, since these bacteria may have sequences in common that may be useful as novel, alternative drug targets for respiratory infections. The algorithm we designed for this purpose, Target-BLAST, uses Position-Specific-Iterated BLAST (PSI-BLAST) and its predecessor Basic Local Alignment Search Tool (BLAST) to discover homologs in microorganisms that share common features, such as the ability to reside in or cause disease in the same host environment. Results: We illustrate the use of Target-BLAST to search sequences obtained from the public databases to find genes unique to a group of pathogens that cause pneumonia, and discovered a set of 12 genes unique to the bacteria that reside in the respiratory tract. The sensitivity of Target-BLAST to distant relationships is gained through the iterative use of BLAST followed by PSI-BLAST. This approach facilitates the comparison and extraction of information from both partial, unannotated DNA sequences as well as complete, annotated genomes. Target-BLAST is considerably faster than existing genome analysis tools, and it permits one to find genes conserved in both whole and partial, unannotated genomic data.


57. Software to predict microArray hybridisation: ACROSS (up)
Antoine Janssen, Keygene N.V.;
Jurgen Pletinckx, Algonomics N.V.;
Jan van Oeveren, Martin Reijans, René Hogers, Keygene N.V.;
Philippe Stas, Algonomics N.V.;
Michiel van Eijk, Keygene N.V.;
Ignace Lasters, Algonomics N.V.;
René van Schaik, Keygene N.V.;
aj@keygene.com
Short Abstract:

High-quality micro-array data can only be obtained when non-specific or cross hybridization is excluded or at least minimized. We developed a new software tool ACROSS to predict hybridization based on sequence alignments and hence to assist in the optimal design of microarray probes.

One Page Abstract:

Software to Predict MicroArray Hybridization: ACROSS

Jurgen Pletinckx(2), Antoine Janssen(1), Jan van Oeveren(1), Martin Reijans(1), René Hogers(1), Philippe Stas(2), Michiel van Eijk(1), Ignace Lasters(2), René van Schaik(1)

(1)Keygene N.V., PO Box 216, 6700 AE, Wageningen, The Netherlands, info@keygene.com (2) Algonomics N.V., Technologiepark 4, B 9052 Ghent, Belgium, info@algonomics.com

Keywords: expression arrays, sequence alignment, cross hybridization, hybridization prediction

Abstract

Microarrays can be used to detect AFLP® fragments for genotyping or expression analysis by hybridization with labeled AFLP reactions or cDNA, respectively. High-quality data can only be obtained from microarrays when non-specific or cross hybridization is excluded or at least minimized. We developed a new software tool ACROSS to predict hybridization and hence to assist in the optimal design of microarray probes. The ACROSS software takes as input the sequence information of DNA-fragments or oligos in either one or two sets. Homology analysis within one set or between the two sets of sequences is performed to predict fragment hybridization. These predictions are based on quantitative analysis of hybridization data from model experiments with known sequences obtained under realistic experimental conditions. A test set of oligonucleotides, derived from 2 DNA fragments A and B, is used to create series of complementary sequences with increasing sequence homology and hybridized with the corresponding full length fragments as targets. Hybridization results are presented and the characteristics of ACROSS are discussed.

AFLP® is a registered trademark of Keygene N.V.


58. Discovering Dyad Signals (up)
Eleazar Eskin, Columbia University;
Pavel Pevzner, UCSD;
eeskin@cs.columbia.edu
Short Abstract:

Dyad signals in DNA sequences consist of two signals that occur a fixed distance apart. Because each component signal may not be statistically significant on its own, we perform an exhaustive search using a pattern driven approach to discover these signals. We present two extensions to pattern driven approaches to improve efficiency.

One Page Abstract:

Signal finding is a fundamental and well studied problem in computational biology. The problem consists of discovering patterns in unaligned sequences. In the context of DNA sequences, the patterns can correspond to regulatory sites which can be used for drug discovery. Current approaches to discovering signals focus on monad signals. These signals are typically short contiguous sequences of a certain length which occur in a statistically significant way in the unaligned sequences with a certain amount of mismatches. This statistical significance is obtained using a scoring function of the candidate signal. However, several of the actual regulatory sites are actually dyad signals. In some cases the signals are palindromes of each other. A difficulty in discovering dyad signals is that each component monad signal in the dyad may not be statistically significant making the dyad signal difficult to find using traditional methods.

There have been many approaches presented to discover monad signals. Among the best performing are MEME, CONSENSUS, Gibbs-sampler, random projections and combinatorial based approaches (see references at poster). The goal of all of these approaches focus on discovering the highest scoring signals. When applied to discovering dyad signals, the methods may have problems in the case where each of the pair of monad signals is not statistically significant on its own.

In this project we present an algorithm for discovering dyad signals. The algorithm first performs an exhaustive search over potential monad signals in the data and then examines several thousand of the highest scoring monads and looks for dyad signals by checking each pair of signals and determining if they occur a fixed distance apart. Clearly the vast majority of the monad signals examined are not statistically significant on their own.

The computational bottleneck for this approach is the exhaustive search of candidate patterns. The simplest way to perform this search is to use a pattern driven approach. A pattern driven approach is simply checking all possible patterns against the data. Unfortunately for long patterns, this can be computationally expensive. In this project, we present two efficient extensions to the pattern driven approach: emulated pattern driven approach and sparse suffix trees. The first extension examines only the relevant sequences to the data and stores the candidate signals in a hash table. Although the algorithm is significantly faster in some practical cases, the algorithm requires a lot of memory. The second extension presents efficient data structures for storing the candidate signals which reduce the amount of memory use.

The dyad signals are discovered by examining the several thousand of the highest scoring monad signals. For each pair of these signals, we examine the sequence positions where the patterns occurred. For patterns that occur on the same sequence we compute the distance between the patterns. If the pair of patterns consistently occur at a certain fixed distance, then the pair of patterns is a dyad signal.

We present several sets of experiments over biological and synthetic data. We first present a set of experiments evaluating the monad signal finding methods. Although we present the signal finding methods in order to discover dyad signals, they in fact perform well finding monad signals.

We also present a set of experiments over dyad samples. We perform experiments over synthetic data. This data consists of an i.i.d. sequence where a dyad signal was inserted. We also perform experiments over the biological sample and automatically recover the dyad signal discovered in the sample.


59. Efficient all-against-all computation of melting temperatures for dna chip design (up)
Lars Kaderali, Alexander Schliep, University of Cologne;
kaderali@zpr.uni-koeln.de
Short Abstract:

Determining target specific probes for DNA chips requires the computation of melting temperatures for all pairs of probes and targets. We present a fast algorithm based on an extended nearest neighbor model. The algorithm combines suffix trees and alignment computations. Also, a framework is presented to suggest actual probes.

One Page Abstract:

The problem of determining target specific probes for DNA chips requires the computation of melting temperatures for the duplexes formed betweeen all target sequences and all probe candidates. The problem is further complicated due to mismatches and possibly unpaired bases within the duplex. For complexity reasons, traditional computer programs restrain themselves to mere string comparisons to ensure the specificity of chip probes.

We present an efficient algorithm to solve this problem. The thermodynamic calculations are based on an extended nearest neighbor model, allowing for both mismatches and unpaired bases within a duplex. We introduce a new thermodynamic alignment algorithm to efficiently calculate melting temperatures. This algorithm is combined with modified generalized suffix trees to speed up the computation.

The algorithm is the core of a software framework to suggest actual probes to be used for DNA chip experiments, given the target sequences as input.


60. Wavelet techniques for detecting large scale similarities between long DNA sequences (up)
Frederic Guyon, Serge Hazout, INSERM U436, Universite Paris 7;
guyon@urbb.jussieu.fr
Short Abstract:

We have developped an efficient and fast algorithm to detect large scale similarities between long sequences based on Fast Wavelet Transforms. The FWT algorithm allows one to compute dotplots at different scale levels and to zoom in and out on the regions of interest within a dotplot.

One Page Abstract:

As complete genomic sequences become available, new methods to tackle very large DNA sequences arerequired. Some very large scale sequence duplications inside or between genomes have been identified. These large scale genomic duplications yield precious information for genome annotation and genome evolution. Yet, standard algorithms devised for the comparison of very long sequences are time and space consuming, and aligning whole genomes or chromosomes at least a very difficult computational task [5], or limited to closely related sequences [1].

To tackle this problem, we have applied the discrete wavelet transform to the computation and visualization of dotplots. Dotplot is a well established technique to compare two sequences [3, 4]. Dotplot is an image where dots correspond to sequence nucleic acids or amino acids matches. Two difficulties arise when computing dotplots : the computer time and memory space used to compute a dotplot is proportional to the product of the length of the two sequences. Our method consists in reducing the size of the dotplot by computing coarse dotplots. For this purpose, DNA sequences are transformed into indicator signals. These signals are decomposed using a fast wavelet decomposition technique [2]. Finally, coarse dotplots are computed using the coarse level representation of the indicator signals.

The low dimensionality of the coarse level signals makes fast computation of coarse dot matrix possible. Moreover, the fast multiresolution analysis [2] provides an efficient algorithm for zooming in and out the dotplot image and gives the possibility to quickly navigate inside the dotplot from coarse to fine levels. The detection sensitivity and specificity depends on the scale level. At coarse levels, a region of similarity is detectable only if it is sufficiently large. Then, the low scale dotplot provides the possibility to reveal regions with very low level of similarity. At fine levels, regions of similarity are more accurately localized with higher specificity.

References [1] A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, and S.L. White, O.and Salzberg. Alignment of whole genomes. Nucleic Acids Res, 1999.

[2] S. G. Mallat. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans. on Patt. Anal. and Mach. Intell., 11:674-693, 1989.

[3] J. Pustell and F. Kafatos. A high speed, high capacity homology matrix : Zooming through sv40 and polyoma. Nucleic Acids Research, 10(15):4765-4782, 1982.

[4] E.L. Sonnhammer and R. Durbin. A dot-matrix program with dynamic threshold control suited for genomic dna and protein sequence analysis. Gene, 1996.

[5] M. Waterman. Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman & Hall, New York, 1995.


61. TRAP: Tandem Repeat Assembly Program, capable of correctly assembling nearly identical repeats (up)
Martti T. Tammi, Erik Arner, Dept. of Genetics and Pathology, Uppsala University;
Tom Britton, Dept. of Mathematics, Uppsala University;
Daniel Nilsson, Björn Andersson, Dept. of Genetics and Pathology, Uppsala University;
martti.tammi@genpat.uu.se
Short Abstract:

We present a method to separate almost identical repeats, and a rigorous study of combinatorial and statistical properties of sequencing errors and real differences between repeats in a fragment assembly framework. We also show that it is possible to make assemblies with a pre-defined probability of error using this method.

One Page Abstract:

The software commonly used for assembly of shotgun sequence data has several limitations. This is especially true when repetitive sequences are encountered. Shotgun assembly is a difficult task, even for non-repetitive regions, but the use of quality assessment of the data and efficient matching algorithms have made it possible to assemble most sequences efficiently. In the case of highly repetitive sequences, however, these algorithms fail to distinguish between sequencing errors and single base differences in regions containing nearly identical repeats. We present the TRAP method to detect subtle differences between repeat copies, and show results from a rigorous study of combinatorial and statistical properties of sequencing errors and real differences between repeats in a fragment assembly framework. We also show that it is possible to make assemblies with a pre-defined probability of error using this method.

The key step in the TRAP method is the construction of multi-alignments consisting of all defined overlaps for a region, followed by detection of pairs of columns containing coinciding deviations from column consensus. For each pair, the probability of observing such a pair by chance is computed. If the probability exceeds a threshold, the pair is rejected. The remaining errors are trapped in a clustering process that follows. In a data set containing repeat regions with 1% randomly distributed differences, and a sequencing error up to 11%, demanding a coverage of at least three sequence reads, we could detect 65% of the differences with an error of 0.5%. This is not the final error, since most of these errors can be trapped in the clustering stage. The data set consisted of 108 simulated assemblies with varying repeat length and copy numbers.

The TRAP software package is implemented in four separate program modules, each performing a distinct task: 1. The initial screening for contamination and determination of read quality. 2. The computation of overlaps. 3. Repeat elements analysis and fragment layout generation. 4. The generation of the consensus sequence and a status report. TRAP has been shown to work by assembling both real and simulated shotgun data. We have simulated a shotgun project containing eight 1789 base pairs long repeats in tandem. The difference between repeat units was 1.0% and the simulated sequencing error 2.71%. TRAP was capable of correctly assembling all 367 sequence reads, while PHRAP could not resolve any of the repeat elements correctly and assembled the sequence reads belonging to the different repeats almost randomly, which made it impossible to extract the correct consensus sequence from any of the repeat elements. A simulation of a BAC project of size 140 kb containing eight 1789 base pairs long tandem repeats gave similar results, with only 1.6% of the reads wrongly placed, the difference between repeat copies being 1.0% and the average sequencing error 7.2%. The erroneously placed sequence reads did not affect the consensus sequence. The main features of TRAP are the ability to separate long repeating regions from each other by distinguishing single base substitutions as well as insertions/deletions from errors, and the ability to pinpoint regions, where additional sequencing is needed to efficiently close the remaining gaps. Since repeats are common in most sequencing projects, this software should be of use for the sequencing community.


62. LASSAP: A powerful and flexible tool for (large-scale) sequence comparisons (up)
Heus, H.C., Glémet, E., Raffinot, M., Metayer, K., Chambre, P., Codani, J-J., Gene-IT;
Raffinot, M., Laboratoire Génome et Informatique, CNRS, France.;
heus@gene-it.com
Short Abstract:

Existing sequence comparison tools are not designed to handle large-scale comparisons efficiently. LASSAP software is designed to compare entire databases of sequences, integrating an adequate set of algorithms that run on multiple processors. It allows extensive result management: databases and results can be queried during all steps of a workflow.

One Page Abstract:

There is large collection of good sequence comparison tools available. However, most of these tools have not been designed to handle results of large-scale comparisons efficiently. To answer complex questions one has to depend on tricks and scripts tailored to each tool to parse and interpret the output. This makes it difficult to redo an experiment with different parameters or another algorithm. Often, bioinformatics people use a substantial amount of their ‘quality time’ trying to solve trivial problems over and over again.

Therefore, Gene-IT has developed Lassap, a large-scale sequence comparison tool that is powerful and extremely flexible. Lassap has been used to solve everyday bioinformatics problems that occur in the academic world and the industry. Some examples of its use are the annotation of sequences on a micro-array, cross-species genomic comparisons (GENOSCOPE), making non-redundant databases (Swissprot / TrEMBL) and implementing gene family workflows (INTERPRO). All this can be done with a minimal investment of time and resources.

The key specifications of Lassap are: ·The package is built around the comparison of entire databases of sequences. ·A complete set of algorithms is available (optimised for speed and minimal use of system resources): BLAST, Smith/Waterman, Needleman/Wunsch, string matching, pattern matching, enhanced versions of known algorithms and smart combinations of algorithms. ·After sequence comparison, similar sequences can be clustered. ·There are extensive database and result management options. Sequence databases, comparison results and clusters can be sorted, queried and filtered during all steps of the workflow. Hereby, the sequence and its annotation are handled simultaneously. Furthermore all data structures can be virtually concatenated. In this way it is easy to handle updates of sequence databases, comparison results and clusters. ·The Unix command line interface makes it very easy to incorporate Lassap into an existing bioinformatics infrastructure and to automate complex workflows. ·Lassap is very flexible, e.g. changing an algorithm is as easy as changing a single command line option. ·Lassap is a professional tool. It is written in the C programming language and is available for almost all Unix systems (single and multi-threaded) and Linux clusters as well. ·For more information visit our web site: http://www.gene-it.com


63. Mini-greedy algorithm for multiple RNA structural alignments (up)
J. Gorodkin, Bioinformatics Research Center, University of Aarhus, Denmark;
R. B. Lyngsoe, Department of Computer science, University of California at Santa Cruz, USA;
G. D. Stormo, Department of Genetics, Washington University Medical School, USA;
gorodkin@bioinf.au.dk
Short Abstract:

A mini-greedy algorithm performs the greedy step on only a core set of sequences where the remaining sequences are aligned with the core in turn, based on a ranking of the sequences. This algorithm result in a significant speed-up of the FOLDALIGN approach for multiple structural alignment of RNA sequences.

One Page Abstract:

The problem with greedy algorithms is that a large fraction of the sequences in a data set is subject to many redundant computations, thus making greedy algorithms expensive. Here we present an approach that we will call mini-greedy, that first isolate a small core of suitable sequences on which to apply the greedy algorithm. Then the remaining sequences are aligned onto the resulting core alignment in an order determined by a ranking scheme. The ranking scheme is based on the pairwise score among all sequences. In contrast to algorithms such as CLUSTAL, where the multiple alignment is built from individual clusters of sequences, our algorithm seeks to find a set of core sequences that have as much in common to as many sequences as possible in the data set. We apply this algorithm to multiple structural alignments of RNA sequences, through FOLDALIGN, and show that the mini-greedy algorithm can reduce the computational time significantly. Different schemes are compared. The mini-greedy algorithm has been implemented as a part of the Stem-Loop Align SearcH server at http://www.bioinf.au.dk/slash/.


64. Neural network and genetic algorithm identification of coupling specificity and functional residues in G protein-coupled receptors (up)
Anthony Bucci, Jason M. Johnson, Pfizer Discovery Technology Center;
abucci@brandeis.edu
Short Abstract:

We demonstrate that artificial neural networks are effective in automated prediction of G-coupling function, and when combined with a genetic algorithm, can identify residues in GPCR sequences that impact G protein coupling specificity. The methods we describe are general and can be applied to other classification problems in molecular biology.

One Page Abstract:

G protein-coupled receptors (GPCRs) form a large family of human cell membrane signaling proteins with seven transmembrane segments, transmitting signals across the cell membrane in response to a wide variety of endogenous ligands. GPCRs are attractive drug targets because they are amenable to small molecule intervention and play critical roles in many human diseases. The molecular and cellular functions of many of these proteins are currently unknown. Here, we demonstrate the utility of artificial neural networks (ANNs) to the problem of predicting the G protein-coupling specificity of GPCRs, the key determinant of downstream signaling function. Using a set of ~100 GPCRs with known G-protein specificity, we conducted a cross-validation study comparing performance of ANNs and homology-based classifiers on the G-coupling prediction task. Our results show that ANNs, given access to only a 20-residue window of the GPCR primary sequence, perform as well as BLAST or a nearest-neighbor classifier given access to the full-length sequence. Building on this result, we used a genetic algorithm (GA) to discover a set of GPCR sequence positions that allowed ANNs to outperform both BLAST and nearest-neighbor classifiers in a leave-one-out cross-validation test. These residue positions reveal regions of GPCR structure likely to be involved in G-protein coupling and discrimination among G-protein subtypes. We conclude that artificial neural networks are effective in automated prediction of G-coupling function, and when combined with a GA, can identify specific residues in GPCR sequences that are important for G-protein coupling. The ANN and GA methods we describe are general and can be applied to other classification and function-prediction problems in molecular biology.


65. Determination of Classificatory Motifs for the Identification of Organism Using DNA Chips (up)
Uta Bohnebeck, Tom Wetjen, University of Bremen, Center for Computing Technologies (TZI);
Denja Drutschmann, University of Bremen, Center for Environmental Research and Technology (UFT);
bohnebec@tzi.de
Short Abstract:

A procedure for constructing highly sensitive and specific oligo-nucleotides for the identification of organisms will be exemplarily presented with sequences of hepatitis C virus. It can be shown that the common motifs detected in randomly sampled subsets can be generalized to be also present in the population with high probability.

One Page Abstract:

A procedure for constructing highly sensitive and highly specific oligo-nucleotides for the identification of organisms, e.g. in environmental samples, will be presented. The problem of achieving a high sensitivity first may be considered as a conservation learning task [2], i.e. common motifs (representing conserved regions) detected in randomly sampled subsets can be generalized to be also present in the population with high probability. The algorithm for the first step of the procedure calculates all non-redundant common motifs which meet constraints, i.e. motif length, permitted mismatches, and sample coverage. The common motifs are determined with the aid of a generalized suffix tree using a pattern-driven approach [5]. Each resulting motif corresponds to a set of associated sequences - the potential oligo-nucleotides of the DNA chip - which are concrete substrings of the given sequence set and which represent the variability of the population. For the second step of the procedure, a Blast search in the public nucleotide sequence databases is carried out in order to prove high specificity. Further optimization steps according to hybridization conditions have to be carried out [1].

In the experiments carried out, 171 complete genome sequences of the hepatitis C virus (HCV) were used. A relative arrangement of these sequences based on the idea of maximal unique matches [3] was performed. The result confirmed the observation of [4], i.e. that only the 5'UTR is sufficiently conserved in order to determine common motifs which can be used to identify HCV in general. Therefore, the motif determination was only executed on the 5'UTR of these 171 sequences by a 10-fold-cross-validation using randomly sampled subsets of 137 (80%) sequences. Allowing one mismatch, on the average each subset produced a result set containing five common motifs with a length between 39 and 68 bp. These set of motifs were not 100% identical since a motif of one subset was enlarged in another subset, or motifs were concatenated to one large motif. However, by merging the result sets a non-redundant set of maximal shared motifs could be extracted. Using the associated sets of oligo-nucleotides a 100% coverage of the population was obtained. Typically, one of the oligo-nucleotides belonging to one motif showed approximately 95% coverage while the others only represented single variations containing one mismatch.

[1] U. Bohnebeck, M. Nölte, T. Schäfer, M. Sirava, T. Waschulzik, G. Volkmann. An Approach to the Determination of Optimized Oligonucleotide Sets for DNA Chips, In: Proceedings of ISMB'99, Poster and Demonstrations, Heidelberg, 1999.

[2] A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert. Approaches to Automatic Discovery of Patterns in Biosequences, Journal of Computational Biology, 5:277-303,1998.

[3] A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, and S.L. Salzberg. Alignment of whole genomes, Nucleic Acids Research, 27(11):2369-2376, 1999.

[4] J.H. Han, V. Shyamala, K.H. Richman, M.J. Brauer, B. Irvine, M.S. Urdea, R. Tekamp-Olson, G. Kuo, Q.-L. Choo, and M. Houghton. Characterization of the terminal regions of hepatitis C viral RNA: Identification of conserved sequences in the 5' untranslated region and poly(A) tails at the 3' end, Proc. Natl. Acad. Sci. USA, 88(5):1711-1715, 1991

[5] M.-F. Sagot. Spelling approximate repeated or common motifs using a suffix tree, In: Latin'98, volume 1380 of LNCS, pages


66. An algorithm for detecting similar reaction patterns between metabolic pathways (up)
Yukako Tohsato, Ryutaro Saijo, Takao Amihama, Hideo Matsuda, Akihiro Hashimoto, Department of Informatics and Mathematical Science, Osaka University;
yukako@ics.es.osaka-u.ac.jp
Short Abstract:

We have developed a method for detecting similar reaction patterns between metabolic pathways. Given two trees representing pathways, our method extracts similar subtrees between them using a subtree-matching algorithm. We have found a similar reaction pattern between the glycol metabolism and degradation pathway and the fucose and rhamnose catabolism pathway.

One Page Abstract:

Metabolic pathway is recognized as one of the most important biological networks. Comparative analyses of the pathways give important information on their evolutions and pharmacological targets. In this poster, we present a method for comparing pathways based on the similarity of reactions catalyzed by enzymes. In our approach, the reaction similarity is formulated as a scoring scheme based on the information content of the occurrence probabilities of EC numbers. For example, the occurrence probability P of the matching between a pair of similar EC numbers, say 1.1.1.1 and 1.1.1.2, is very rare compared to randomly-chosen combination of such pairs, and then we give a higher score, – log P, for the matching. Using this scoring scheme, one can perform alignment between two pathways if they can be represented as strings of EC numbers. However, metabolic pathways often include branching structure. The comparison between such pathways cannot be directly performed by applying some string-matching algorithms. To cope with this issue, we have developed a method for comparing pathways that include branching structure. In our method, pathways are represented as trees, and compared them by a subtree-matching algorithm. The computational complexity of the subtree-matching problem is generally NP hard. Thus we have developed a greedy algorithm for the comparison. The effectiveness of our method is demonstrated by applying to metabolic pathways in Escherichia coli, which are re-constructed from the metabolic maps of the EcoCyc database. As a result, we have found a similar reaction pattern ([2.7.1.31] - [2.7.1.51], [4.1.1.47] - [4.1.2.17], [1.2.1.21] - [1.2.1.22] and [1.1.1.37] - [1.1.1.77]) between the glycol metabolism and degradation pathway and the fucose and rhamnose catabolism pathway.


67. SPP-1 : A Program for the Construction of a Promoter Model From a Set of Homologous Sequences (up)
Ulrike Goebel, Thomas Wiehe, Thomas Mitchell-Olds, Max-Planck Institute of Chemical Ecology;
goebel@stargate.ice.mpg.de
Short Abstract:

We present a new algorithm which constructs a PROMOTER MODEL from a set of unaligned homologous, coregulated polII promoters. It employs a comparative approach, which in addition to sequence similarity can also take into account DNA structural similarity. In a second phase, higher order modules of conserved motifs are found.

One Page Abstract:

We present a new algorithm which constructs a PROMOTER MODEL from a set of unaligned homologous, coregulated polII promoters.

It rests on the following assumptions: DNA contact points of individual members of the transcription initiation complex are constrained in their ability to tolerate mutations and thus stand out as short (6-10 bp) conserved motifs. The arrangement of the proteins in the initiation complex is reflected by a hierarchical arrangement of the binding sites on the DNA, and it this pattern which really identifies the promoter. It, too, should be at least in part conserved in members of a family of promoters which are known to confer the same expression pattern. Another aspect which has been shown to be conserved at least in parts of polII promoters is DNA structure, especially bendability and stiffness. Most probably the sequence conservation seen at transcription factor binding sites is just an extreme case of structural conservation (identical sequences have identical structures). It can well be that there are sites which have drifted apart on the sequence level in different members of a promoter family, while still being conserved with respect to some relevant structural property.

Our algorithm first constructs gap-free blocks of sequence segments from the input sequences. A block can contain zero or multiple segments from a given input sequence. It is maximal with respect to the number of segments, such that all pairs of segments in a block are SIMILAR. In contrast to other existing algorithms, SIMILARITY is a relation which can be freely defined, and in particular can refer to similarity with respect to DNA structural parameters. In a second phase, the algorithm looks for a pattern of these motifs which is common (with variations) to all input sequences. Motifs which are part of such a pattern can not only be more trusted to be truely biologically relevant, but the pattern also constitutes a testable hypothesis ( a PROMOTER MODEL) about the input family of promoter sequences.


68. A new score function for distantly related protein sequence comparison. (up)
Maricel Kann, Richard Goldstein, The University of Michigan;
mkann@umich.edu
Short Abstract:

A new method to derive a score function to detect remote relationships between protein sequences has been developed using an optimization procedure. We find that the new score function obtained in such a manner performs better than standard score functions for the identification of distant homologies

One Page Abstract:

In order to understand the evolutionary history of the new sequences, aligning their the primary structure of the probe sequence with others in the database is one of the most significant and widely used techniques. Sequences with a high similarity score usually share a common structure and might have similar functions or mechanisms. All of these methods rely on some score function to measures sequence similarity. The choice of score function is especially critical for these distant relationships. A new method to derive a score function to detect remote relationships between protein sequences has been developed. The new score function was obtained after maximization of a function of merit representing a measure of success in recognizing homologs of the newly sequenced protein among thousands of non-homolog sequences in the databases. We find that the new score function obtained in such a manner performs better than standard score functions for the identification of distant homologies.


69. Expression profiling on DNA-microarrays: In silico clone selection for DNA chips (up)
Rainer König, DKFZ, Division of Functional Genome Analysis, Im Neuenheimer Feld 506, 69120 Heidelberg, Germany;
Johannes Beckers, GSF - National Research Center, Institute of Experimental Genetics, Ingolstädter Landstr.1, 85764 Neuherberg, Ge;
Marcus Frohme, Tamara Korica, DKFZ, Division of Functional Genome Analysis, Im Neuenheimer Feld 506, 69120 Heidelberg, Germany;
Stefan Haas, MPI for Molecular Genetics, Computational Molecular Biology, Ihnestraße 73, 14195 Berlin, Germany;
Matthias Seltmann, Christine Machka, Yali Chen, Alexei Drobychev, GSF - National Research Center, Institute of Experimental Genetics, Ingolstädter Landstr.1, 85764 Neuherberg, Ge;
Sabine Tornow, Michael Mader, GSF - National Research Center, Institute for Bioinformatics, Ingolstädter Landstr.1, 85764 Neuherberg, Germany;
Martin Hrabé de Angelis, GSF - National Research Center, Institute of Experimental Genetics, Ingolstädter Landstr.1, 85764 Neuherberg, Ge;
Werner Mewes, GSF - National Research Center, Institute for Bioinformatics, Ingolstädter Landstr.1, 85764 Neuherberg, Germany;
Jörg Hoheisel, DKFZ, Division of Functional Genome Analysis, Im Neuenheimer Feld 506, 69120 Heidelberg, Germany;
Martin Vingron, MPI for Molecular Genetics, Computational Molecular Biology, Ihnestraße 73, 14195 Berlin, Germany;
r.koenig@dkfz.de
Short Abstract:

As part of the German-Human-Genome-Project DNA-microarray technology is being established to systematically analyse gene function in mouse mutant lines. A software was developed to select gene specific IMAGE-clones from subsets of known genes. It is designed to optimise good hybridisation to the target and low cross-hybridisation with other known genes.

One Page Abstract:

As part of the German Human Genome Project (DHGP) DNA microarray technology is used for a systematic analysis of gene function in ENU induced mutant mice. To design our chips for chosen subsets of genes, a software system was developed that selects gene specific EST-clones from the IMAGE clone set. To obtain specific expression signals for each gene, the algorithm is designed to optimise two demands for the immobilised antisense probe, (1) good hybridisation to the target and (2) no cross-hybridisation with other known genes. In respect to these tasks a method is presented. In an exemplary application, the design of the first chip contains genes with known functions, for example, during embryonic development or genes that are relevant for the pathogenesis of related human diseases. Additionally, a set of constitutively expressed genes was selected to facilitate normalisation. Public access is offered to select clones for known mouse genes (www.dkfz-heidelberg.de/tbi/services/koenig/services/clones2chip_front.pl).


70. Data Mining: Efficiency of Using Sequence Databases for Polymorphism Discovery (up)
David G. Cox, University of Turin / International Agency for Research on Cancer;
Federico Canzian, Catherine Boillot, International Agency for Research on Cancer;
cox@iarc.fr
Short Abstract:

We selected thirteen genes, and determined the complete collection of polymorphisms existing in these genes in our laboratory using DHPLC, or in other laboratories using comparable methods. Then we compared these results to polymorphisms found by aligning sequences of the genes in the GenBank Database, calling single base differences between sequences polymorphisms.

One Page Abstract:

An open question in research on Single Nucleotide Polymorphisms (SNPs) is, what is the percentage of true SNPs found by in silico pre-screening? To this end, we selected thirteen genes, and determined The complete collection of "true" polymorphisms, or polymorphisms experimentally detected, existing in these genes in our laboratory using Denaturing High Performance Liquid Chromatography (DHPLC) and fluorescent sequencing, or in other laboratories using comparable methods (Single Strand Confirmation Polymorphism, Denaturing Gradient Gel Electrophoresis). The genes studied by our group were PTGS2, IGFBP1, IGFBP3, and CYP19. GenBank sequence information was then aligned using two methods, and sequence differences termed "candidate" polymorphisms. We then compared the series of SNPs obtained experimentally and in silico and we have found that in silico methods are relatively specific (up to 55% of candidate SNPs found by SNPFinder have been discovered by experimental procedure) but have low sensitivity (not more than 27% of true SNPs are found by in silico methods).


71. Automated modeling of protein structures (up)
Morten Nielsen, Ole Lund, Claus Lundegaard, Thomas N. Petersen, Structural Bioinformatics Advanced Technologies A/S (SBI-AT), Agern Alle 3, DK-2970 Hoersholm, Denmark.;
Jakob Bohr, Soeren Brunak, Structural Bioinformatics, Inc., SAB, San Diego, California;
Garry P. Gippert, Structural Bioinformatics Advanced Technologies A/S (SBI-AT), Agern Alle 3, DK-2970 Hoersholm, Denmark.;
mnielsen@strubix.dk
Short Abstract:

SBI-AT has developed a novel and highly reliable method for fold recognition based on a ranking of alignment Z-scores computed from sequence profiles and predicted secondary structure. The method was used for template identification in the CASP4 protein structure prediction experiment.

One Page Abstract:

SUMMARY

Automated modeling of protein structure from sequence guides the assignment of function and the selection of targets for experimental structure studies, and provides starting points for expert modeling of protein structure.

Fold recognition is an important benchmark in automated modeling approaches. SBI-AT has developed a novel and highly reliable method for fold recognition based on a proprietary ranking of alignment Z-scores computed from sequence profiles and predicted secondary structure (Petersen et al., 2000).

In this presentation we compare results obtained using methods developed at SBI-AT with the well-known PDB-BLAST and FFAS (Rychlewski et al., 2000) methods. Our results in the comparative modeling sections of the CASP4 protein structure prediction experiment are also summarized.

CONCLUSIONS

The SBI-AT fold recognition method performs well compared to FFAS and PDB-BLAST, particularly in the high-reliability regime. A large performance increase for SCOP family relationships (Murzin et al., 1995) impacts strongly on comparative modeling of protein structures.

In CASP4 comparative modeling categories, SBI-AT's automatic modeling method shows a performance comparable to that of the best expert and automatic methods developed elsewhere.

REFERENCES

Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Nucleic Acids Res 1997 25:3389-402.

CASP4, http://predictioncenter.llnl.gov/casp4/.

Murzin AG, Brenner SE, Hubbard T, Chothia C (1995). J. Mol. Biol. 247:536-540.

Petersen TN, Lundegaard C, Nielsen M, Bohr H, Bohr J, Brunak S, Gippert GP, Lund O (2000). Proteins 41:17-20.

Rychlewski L, Jaroszewski L, Li W, Godzik A (2000). Protein Sci. 9:232-241.

ABOUT SBI-AT

SBI-AT, a subsidiary of Structural Bioinformatics Inc., San Diego, develops novel computer algorithms for the prediction and exploration of structural and dynamical features of proteins, specifically targeted for use in rational drug design and biotechnology applications. www.strubix.dk.


72. Algorithmic improvements to in silico transcript reconstruction implemented in the Paracel Clustering Package (PCP) (up)
Joseph A. Borkowski, Jun Qian, Cassi Paul, Charles P. Smith, KyungNa Oh, Glen Herrmannsfeldt, Cecilie Boysen, Paracel, Inc.;
borkowski@paracel.com
Short Abstract:

We describe novel algorithms which improve EST-based transcript reconstruction quality and which allow more accurate splice form detection. These algorithms make use of pairwise overlap and redundancy measurements to identify and remove artifactual chimeric sequences and low-quality 'non-N' sequence segments. We demonstrate their effect on overall transcript quality.

One Page Abstract:

Algorithmic improvements to in silico transcript reconstruction implemented in the Paracel Clustering Package (PCP) Joseph A. Borkowski, Jun Qian, Cassi Paul, Charles P. Smith, KyungNa Oh, Glen Herrmannsfeldt, Cecilie Boysen.

Paracel, Inc.

Transcript reconstruction from EST data, whether public or private, can often be problematic. Ideally, it should be possible to reconstruct a single consensus sequence present in a cell using measured pairwise overlap between individual ESTs. When attempting to reconstruct transcripts, available programs typically overestimate the number of alternative transcripts from a given organism and also tend to produce a number of large false clusters. These problems can have serious negative consequences when using the output of transcript reconstruction for either gene identification or oligo chip design. They are often more acute when a high quality genomic sequence segment covering any particular transcript is not available or not used to guide correct reconstruction.

Overestimation of alternative transcript forms can be proven with thorough comparison to any available high quality genomic sequence data. Incorrectly predicted alternative transcripts will not have a corresponding genomic sequence segment. In most cases this overestimation is due to low quality sequence segments present in the input sequence data. When a particular assembly program is unable to align these sequences with the true transcript along its entire length, they are falsely reported as alternative transcripts of the true transcript.

To reduce false alternative transcript reporting we have developed an algorithm that can identify and remove low quality sequence segments present in ESTs, even when quality values are not available. This algorithm is based on the rate of score drop-off in a pairwise alignment. It has been observed that sequence quality tends to drop off slowly as it approaches the end of an EST. Therefore its score in comparison with higher quality sequences tends to drop-off slowly after reaching an inflection point at the end of a high scoring match segment. In contrast, the score for a true alternative transcript tends to drop off at a much faster rate. When an end of a sequence has been identified as being low quality, that end is not used in construction of a consensus sequence. High quality sequence segments that are derived from alternative transcripts are used to create alternative consensus sequences.

In our experience, large false clusters often are due to the presence of contaminants, repeats, or chimeric sequences in the input EST dataset. We have developed an algorithm that detects and reports these chimeric EST sequences. This algorithm uses all of the computed pairwise overlaps that exist in a set of sequences and uses this information to determine abrupt break points in the overall contig structure where a single sequence joins two coherent subclusters. When these break points are detected, and supported by a sufficient number of independent sequences on either side of the break point, a chimeric sequence is reported. These chimeric sequences are not used for the creation of a cluster.

The chimeric and bad end detection algorithms have been incorporated into the Paracel Clustering Package (PCP). We report on both the accuracy of these algorithms and their effect on overall transcript quality.