321.Conformational studies on O-specific polysaccharides of Shigella dysenteriae
322.The risk of failure to detect disease genes due to epistatic interactions
323.A stoichiometric model for the central carbon metabolism of Aspergillus oryzae
324.Computational Antisense Prediction
325.Determination of the active structure of Chemotactic peptides
326.A web-based graphical interface with an efficient algorithm for identifying DNA and protein patterns
327.SAMIE: towards a probabilistic code for protein-DNA interactions
328.Prediction of N-linked glycosylation sites in proteins.
329.Restriction Enzymes Dramatically Enhance SBH
331.PaTre: a method for paralogy tree construction
332.Bioinformatics Services at the Human Genome Mapping Project- Resource Centre.
333.A Bayesian approach to learning Hidden Markov Models with applications to biological sequence analysis
334.Genome-wide Operon prediction in C. elegans
335.The detection of attenuation regulation in bacterial genomes
336.Predicting Protein-Protein Interactions from Sequences
337.A Computational Screen for Novel Targets of SnoRNAs
338.Tackling Biocomputing Tasks using a Meta-Data Framework
339.Automated learning of unknown length motifs in unaligned DNA sequences with genetic algorithms
340.Conservation of CD23 extracellular domain through vertebrate species suggests a functional role in B-lymphocyte differentiation
341.The ERATO Systems Biology Workbench An Integrated Environment for Systems Biology Software
342.A Symmetrizing Transformation for Microarray Data
343."Genquire" - Interactive analysis and annotation of genome databases using multiple levels of data visualization
344.Combinatorial genomics: validation of high throughput protein interaction data with clustered expression profiles

321. Conformational studies on O-specific polysaccharides of Shigella dysenteriae (up)
Rosen, J., Nyholm, P.G., Rabobi, A., Göteborg University;
Mulard, L.A., Pasteur Institute, Paris;
Short Abstract:

Conformational analyses of the O-antigenic polysaccharides of Shigella dysenteriae type 1, 2 and 4 have been performed using modified HSEA and MM3(96). For type 1 the results show two conformers, differing with respect to the a-D-Gal-(1-3)-a-D-GlcNAc linkage. The type 2 and 4 antigens are highly constrained according to the calculations.

One Page Abstract:

Conformational analyses of fragments of the O-antigenic polysaccharides of Shigella dysenteriae type 1, 2 and 4 have been performed using modified HSEA and MM3(96). The sequences of the repeating units of these O-antigens are shown below:

S. dys. type 1: -3)-L-Rha-(1-2)-a-D-Gal-(1-3)-a-D-GlcNAc-(1-3)-a-L-Rha-1

S. dys. type 2: -3)-a-D-GalNAc-(1-4)-[a-D-GlcNAc-(1-3)]-a-D-GalNAc-(1-4)- a-D-Glc-(1-4)-a-D-Gal-(1-

S. dys type 4: -3)-a-D-GlcNAc-(1-3)-[a-L-Fuc-(1-4)]a-D-GlcpNAc-(1-4)-a-D- GlcpA-(1-3)-a-L-Fucp-(1-

For the type 1 O-antigen the results of the calculations indicate that shorter fragments like the trisaccharide a-L-Rha-(1-2)-a-D-Gal-(1-3)-a-D-GlcNAc and the tetrasaccharide a-L-Rha-(1-2)-a-D-Gal-(1-3)-a-D-GlcNAc-(1-3)-a-L-Rha exist as two different conformers, I and II, differing with respect to the conformation of the a-D-Gal-(1-3)-a-D-GlcNAc linkage (phi/psi= -40./-40. (I) and 10./30. (II), respectively). For the pentasaccharide a-L-Rha-(1-3)-a-L-Rha-(1-2)-a-D-Gal-(1-3)-a-D-GlcNAc-(1-3)-a-L-Rha and longer fragments the calculations indicate preference for conformation II. Such a conformational change for the a-D-Gal-(1-3)-a-D-GlcNAc linkage is in agreement with previously obtained NMR data. For longer fragments of the polysaccharide the "back-folded" conformation leads to a compact helical conformation with the galactose residues protruding radially from the core of the helix consistent with the role of L-Rha-(1-2)-a-D-Gal as the major epitope of this O-specific polysaccharide.

The type 2 antigen is fairly constrained due to the branch which involves three N-acetyl hexose amines. The substitution at 2-, 3-, and 4- positions of the central GalNAc residue (see formula above) appears to give rise to a single well defined conformation with restricted flexibility. In the case of the type 4 O-antigen there are very significant steric contraints around the GlcNAc residue at the branching point (see formula above). Despite these constraints there are at least 3 different favoured conformations of the trisaccharide a-D-GlcNAc-(1-3)-[a-L-Fuc-(1-4)]-a-D-GlcNAc. The complexity of the potential energy surfaces of this system suggest that this system is of interest for further studies to validate the force fields for conformational analyses of saccharides. Furthermore, the results obtained on these systems are of interest for the design of carbohydrate based vaccines against shigellosis caused by Shigella dysenteriae. It is our intention to compile these structures in a database with favoured conformations of oligosaccharides.

322. The risk of failure to detect disease genes due to epistatic interactions (up)
Gavin A. Huttley, Simon Easteal, Sue Wilson, Centre for Bioinformation Science, ANU;
Short Abstract:

The prevalent paradigm for the analysis of common human diseases assumes a single gene is largely responsible for affecting individual disease risk. We present a theoretical result demonstrating that failure to factor in epistatic interactions, a common component of complex diseases, leads to elevated rates of false positives and negatives.

One Page Abstract:

Empirical evidence from model organisms indicates that the genetic background can strongly influence the phenotype exhibited by a specific genotype due to epistatic (gene-gene) interactions. The prevalent paradigm for the analysis of common human diseases assumes, however, that a single gene is largely responsible for affecting individual disease risk. The consequence of examining each gene as though it were solely responsible for conferring disease risk when in fact that risk is contingent upon interactions with another disease locus has not been previously determined. We examined the effect of two (or more) major epistatic disease genes when data are analysed assuming a single disease gene. We produce a general genetic model to analyse so-called triad data (two parents and an affected child) for two marker loci. Based on this model we show that results can vary markedly depending on the parameters associated with the “unidentified” disease gene. The results indicate that if parameters associated with the second gene were to vary between studies, then the conclusions from those studies may also vary. This is a theoretically broad result with important implications for interpreting different results from individual studies and comparing results between studies. It demonstrates that failure to factor in such interactions can lead to an elevated type II rate, or false negatives. This is particularly troubling for genomic scan type study designs.

323. A stoichiometric model for the central carbon metabolism of Aspergillus oryzae (up)
Helga David, Mats Åkesson, Jens Nielsen, Center for Process Biotechnology, BioCentrum-DTU, Technical University of Denmark;
Short Abstract:

A stoichiometric model for the metabolism of Aspergillus oryzae was developed based on literature. Intracellular compartmentalization was considered and biochemical reactions and transport reactions were included in the model. Flux balance analysis was combined with linear programming to study the phenotypic behaviour of the microrganism during aerobic glucose-limited continuous cultures.

One Page Abstract:

A stoichiometric model for the central carbon metabolism of Aspergillus oryzae

H. M. David, M. Åkesson, J. Nielsen Center for Process Biotechnology, Biocentrum – DTU, Building 223, Technical University of Denmark, DK – 2800 Lyngby, Denmark (helga.david@biocentrum.dtu.dk)

Genomic sequencing efforts are beginning to produce complete organism-specific genetic information at an extraordinary rate. The functional assignment of individual genes derived from sequence data, which can be viewed as the first level of functional genomics, should soon give rise to a more challenging stage, one in which the focus is on the interrelatedness of gene products and their role in multigenetic cellular functions. Genome-scale flux balance models can be used in the elucidation of the genotype-phenotype relationship and hence represent a potentially valuable approach in metabolic engineering for the design and optimisation of bioprocesses [1].

Filamentous fungi are attractive microrganisms for industrial production of different products, including organic acids, antibiotics and proteins. They exhibit a number of properties that confer them competitive advantages over other eucaryotic and prokaryotic organisms, namely the ability to produce and secrete large amounts of proteins and the capabilities to process eukaryotic mRNA, glycolysate and make post-translational modifications, which make them also promising hosts for production of recombinant proteins [2,3]. Aspergilli combine many of the useful features of bacteria with those of higher organisms [4]. However, little work has been done regarding mathematical modeling of their metabolism. The complete sequence of the genome is not available yet for any species of Aspergilli, which hampers the development of a genome-scale model for functional analysis in these microrganisms.

As a starting point, a stoichiometric model for the central carbon metabolism of the filamentous fungi Aspergillus oryzae is constructed based on literature information [5]. Intracellular compartmentalization is considered and biochemical reactions as well as transport reactions between the cytosol and mitochondria are included in the model. Flux balance analysis in combination with linear optimisation techniques is used to study the phenotypic behaviour of the microorganism during aerobic glucose-limited continuous cultures. In concrete, the capability of the filamentous fungi to maximally produce a metabolite of interest is investigated and the redirection of flux distributions in the primary metabolism for the overproduction of the metabolite is studied.

[1] Edwards, J.S., Ramakrishna, R., Schilling, C.H., and Palsson, B.O. (1999) Metabolic Flux Balance Analysis. In: Metabolic Engineering. S.Y. Lee and E.T. Papoutsakis, eds. Marcel Dekker,Inc., New York, pp. 13-57. [2] Carlsen, M. (1994) a-Amilase production by Aspergillus oryzae. Ph.D.-thesis, Department of Biotechnology, Technical University of Denmark. [3] Pedersen, H. (1999) Protein production by industrial Aspergilli. Ph.D.-thesis, Department of Biotechnology, Technical University of Denmark. [4] Martinelli, S. D. (1994) Aspergillus nidulans as as experimental organism. In: Aspergillus: 50 years on. S.D. Martinelli and J.R. Kinghorn, Elsevier, pp. 33-58. [5] Pedersen,H., Carlsen,M., & Nielsen,J. (1999) Identification of enzymes and quantification of metabolic fluxes in the wild type and in a recombinant aspergillus oryzae strain. Appl.Environ.Microbiol. 65, 11-19.

324. Computational Antisense Prediction (up)
Alistair Chalk, Erik Sonnhammer, Center for Genomics Research,Karolinska Institute;
Short Abstract:

We present an in silico model for prediction of antisense oligonucleotide (AO) efficacy. Collecting data from AO experiments in the literature generated a 490 AO dataset. We trained neural networks, an ensemble giving an overall correlation coefficient of 0.38, predicting effective AOs with a success rate of 85%.

One Page Abstract:

Experimental testing of antisense oligonucleotides is time consuming and costly. Here we present an in silico model for prediction of antisense oligonucleotide (AO) prediction. Collecting sequence and efficacy data from AO scanning experiments in the literature generated a database of 490 AO molecules. Using a set of derived parameters based on AO sequence properties we trained a neural network model. An ensemble of 10 networks gave an overall correlation coefficient of 0.38. This model can predict effective AOs (>50% inhibition of gene expression) with a success rate of 85%. At this threshold the model predicts approximately 2 AOs per 1000 base pairs, making it a stringent yet practical model for AO prediction.

325. Determination of the active structure of Chemotactic peptides (up)
Youssef Wazady, Laboratoire de Recherche, Ecole Supérieur de Technologie, BP 8012 Oasis, Route d'El Jadida, Km 7, Casablanca, Maroc.;
C. H. Ameziane, Département de Chimie, Faculté des Sciences et Techniques Fes Sais, Université Sidi Mohamed Ben Abdell;
Short Abstract:

In order to investigate the proper peptide backbone conformation that is biologically active, the chemotactic peptides were studied by the theoretical method PEPSEA. This study shows that the active structure was beta turn structure.

One Page Abstract:


The tripeptides formyl-Met-X-Phe-OMe (with X is respectively the a-a disubstituted amino acid; Aib (a-aminoisobutyric), Acc5 (amino-1 cyclopentanecarboxylic acid) or Acc6 (amino-1 cyclohexanecarboxylic acid)) are active analogs of chemotactic peptide formyl-Met-Leu-Phe-OMe (fMLPOMe), known by its ability to induce release of lysosomal enzyme. The Acc6 analog is largely more active than the parent peptide formyl-Met-Leu-Phe-OMe, whereas Aib and Acc5 analogs are less active. The comparative conformational study of these peptides by the method of exploring of conformational hypersurfaces PEPSEA shows a flexible structure for the parent peptide, and a tendency to the a turn structure for the three other conformationally constrained peptides.

Key words: Flexible structure, chemotactic peptide, conformation, intramolecular hydrogen bond.

326. A web-based graphical interface with an efficient algorithm for identifying DNA and protein patterns (up)
Jay H. Choi, Gary D. Stormo, Washington University;
jhc1@genetics.wustl.edu, stormo@genetics.wustl.edu
Short Abstract:

We developed a web-based graphical interface for an existing pattern recognition program, Consensus, to help identifying, displaying, and exploring DNA and protein patterns. We also developed a new efficient heuristic algorithm to identify consensus patterns, which is especially well suited for web-based interface.

One Page Abstract:

1. Introduction

In recent years, the world-wide-web and Internet have become invaluable research tools in biomedical sciences. The easy-access, convenience of the web-browser, and dynamic features of the web-graphics make a web-based application a good candidate for program interface. It can also provide new features that are not possible in the command-line based applications.

We developed a web-based graphical interface for the pattern recognition program, consensus-v6c [1], to help identifying, displaying, and exploring DNA and protein patterns. We also developed a new efficient heuristic algorithm to identify consensus patterns, which is especially well suited for web-based interface.

2. Consensus package / consensus-vio (iteration-optimization version)

The consensus package is a set of web-based tools, which we developed based on an existing pattern recognition program, consensus-v6c [1]. These tools are used for identifying consensus patterns from unaligned DNA or protein sequences and displaying them in graphical formats, such as a sequence graph and site-highlighted sequence. It also provides the representation of pattern matrix in sequence logo format [2]. The consensus package also includes several auxiliary tools for automating existing command line options as well as several new features such as converting sequence formats and iterative computing with exclusion/inclusion of selected sequence pattern.

Since the set of input sequences in biological applications is often quite large, the time complexity becomes a critical factor in the development of sequence analysis software. The time complexity of the algorithm used in consensus-v6c is O(n2) where n is the number of sequences, which becomes impractical for web-based interfaces as the number of input sequences increases. Therefore, we improved the performance of identifying consensus patterns with our new heuristic method called consensus-vio. In this method, we approximate to the optimal solution with iteration technique. It uses the order-dependent linear comparison option [3] of consensus-v6c to obtain an initial pattern matrix. Then, it scans through each sequence in the set to find the sites corresponding to the initial matrix and builds a new enhanced pattern matrix based on the results. This step is repeated until it converges to an optimal pattern matrix. The significance of the pattern matrix is measured by the log-likelihood-scoring scheme (information content), P value, and expected frequency [1].

3. Implementation

The consensus package consists of several core programs, auxiliary programs, and web programs. The core programs including consenus-v6c, patser-v3c, and consensus-vio were written in C, whereas all of auxiliary and web programs were written in Perl. They were all developed on Sun workstations running Solaris; however, the code is portable to Linux environment.

4. Results

The performance of consensus-vio (speed and result) is mostly determined by the initial matrix. If the initial matrix has a relatively low expected-frequency, the number of iterations to get an optimal matrix is significantly decreased. Thus, consensus-vio obtains the initial matrix using iterative-runs of linear comparison method [3] in consensus-v6c, shuffling the order of sequence before each cycle. The higher the number of iterations, the lower the probability of getting a “bad” initial matrix (statistically insignificant, mostly caused by local minimum); however, since the iteration for getting an initial matrix consumes most of time spent on the entire execution of the consensus-vio, it is a trade-off factor between speed of the computation and precision of the results. The consensus-vio was tested on 18 CRP regulated genes (CRP: E.coli cyclic AMP receptor protein) with 4 randomly generated sequences incorporated, each of 105 base pair (bp) long. It was set to 10 iterative runs for the linear comparison method of consensus-v6c to obtain the initial matrix. The results indicate that the consensus-vio was able to achieve the same level of precision as the original algorithm, but with shorter computation time. References

[1] Hertz, G. Z. and Stormo, G. D. (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15: 563-577.

[2] Schneider, T. D. and Stephens, R. M. (1990) Sequence Logos: A New Way to Display Consensus Sequences. Nucleic Acids Research, 18: 6097-6100.

[3] Hertz, G. Z. Hartzell, G. W. and Stormo, G. D. (1990) Identification of consensus patterns in unaligned DNA sequences known to be functionally related. CABIOS, 6: 81-92.

327. SAMIE: towards a probabilistic code for protein-DNA interactions (up)
Panayiotis V. Benos, Washington University;
Alan Lapedes, Los Alamos National Laboratories;
Gary D. Stormo, Washington University;
Short Abstract:

We are investigating the rules behind the recognition of specific DNA targets by particular proteins. We recently presented a data-driven approach to incorporate base-amino acid preferences into a well defined probabilistic model. In this poster, we present the prediction capabilities of our algorithm (SAMIE), trained on phage display data.

One Page Abstract:

1. The aim. Understanding the rules that govern the "recognition" of particular DNA target sites by the corresponding transcription factors.

2. Some history. Currently existing models include exploitation of DNA-protein co-crystal structural data [1,2] as well as non-quantitative models [3,4]. However, the structural information is not always available; and the non-quantitative models cannot make quantitative predictions. We recently presented a data-driven approach to incorporate base-amino acid preferences into a well defined probabilistic model [5]. In summary, our model consists of a weight matrix that assigns "interaction values" to every pair of base-amino acid in every position. This weight matrix corresponds to the effective energy of particular interactions and can be used to explore different models of interactions ("one-to-one", "many-to-many"", etc).

3. Case study: EGR family. The family of the Early Growth Response factors (EGR) is the best studied today, with respect to the protein-DNA interactions. EGR genes are Zn-finger proteins that were first identified in mammals and later found in a variety of other species. The structure of the three Zn-fingers of the mouse protein bound to its consensus DNA sequence was initially solved crystallographically at 2.1 A [6] and consequently refined to 1.6 A [7]. The target site is now believed to be 10 bp long and each finger contacts 4 of these bases (with one base overlap in the target site of each finger) [7]. The topology of the molecules in the solved crystal structure showed that each of four "critical" amino acids in every finger could contact one base each on the target site. It was also found that the three fingers bind the DNA in a modular fashion, independently of each other.

A number of studies have followed, resulting in the development of methods to select randomized proteins that bind particular (fixed) DNA sites (SELEX) or vice versa (phage display) [3,8,9]. In our previous study we tested our algorithm on SELEX data of the EGR family, collected from the literature. Subsequently, SAMIE was able to predict adequately well the DNA target sequences from its own training set and from data that were not included into it (given the corresponding protein sequences).

4. Results: training on phage display data. In order to evaluate further the potential of our method, we trained SAMIE on the phage display data. Although the algorithm is essentially the same as before, special care was given to the amino acid reference ("background") probabilities. Different amino acid randomisation schemes, imposed by the particular phage display experiments, were taken into consideration during training. Moreover, the problem of predicting the protein sequence of a particular DNA target is much more difficult than its reverse (predicting the DNA target by knowing the protein sequence). The phage display data set that we used for the training was also obtained from the literature and the training was done according to the "one-to-one" model of interactions (i.e. one amino acid contacts one base). We tested the prediction capabilities of the trained model on its own training set and on independent data. In all cases, the predictions using SAMIE are better than that of methods existing in the literature. In addition, our methodology offers the potential to capitalise on quantitative detail, as well as to be used to explore more general model of interactions, given availability of data.


[1] Y. Mandel-Gutfreund and H. Margalit (1998) Quantitative parameters for amino acid-base interaction: implications for prediction of protein-DNA binding sites. Nuc. Acids Res. 26:2306-2312.

[2] M. Suzuki and N. Yagi, (1994), DNA recognition code of transcription factors in the helix-turn-helix, probe helix, hormone receptor, and zinc families. Proc. Natl. Acad. Sci., U.S.A. 91: 12357-12361.

[3] Y. Choo and Klug, A. (1995), Designing DNA-binding proteins on the surface of filamentous phage. Current Biology 6:431-436.

[4] S.A. Wolfe, H.A. Greisman, E.I. Ramm and C.O. Pabo (1999), Analysis of Zinc Fingers Optimized via Phage Display: Evaluating the Utility of a Recognition Code. J. Mol. Biol. 285: 1917-1934.

[5] P.V. Benos, A.S. Lapedes, D. Fields and G.D. Stormo (2001), SAMIE: Statistical Algorithm for Modeling Binding Energies. Proc. Pac. Symp. Bioc. 6: 115-126.

[6] N.P. Pavletich and C.O. Pabo (1991), Zinc finger-DNA recognition: crystal structure of a Zif268-DNA complex at 2.1 A. Science, 252: 809-817.

[7] M. Elrod-Erickson, M.A. Rould, L. Nekludova and C.O. Pabo (1996), Zif268 protein-DNA complex refined at 1.6 A: a model system for understanding zinc finger-DNA interactions. Current Biology, 4: 1171-1180.

[8] S.A. Wolfe, H.A. Greisman, E.I. Ramm and C.O. Pabo (1999), Analysis of Zinc fingers optimised via phage display: evaluating the utility of a recognition code. J. Mol. Biol., 285: 1917-1934.

[9] M. Isalan and Y. Choo (2000), Engineered Zinc finger proteins that respond to DNA modification by HaeIII and HhaI methyltransferase enzymes. J. Mol. Biol., 295: 471-477.

328. Prediction of N-linked glycosylation sites in proteins. (up)
Ramneek Gupta, CBS, Technical University of Denmark;
Eva Jung, Swiss Institute of Bioinformatics;
Mario Soumpasis, Soren Brunak, CBS, Technical University of Denmark;
Short Abstract:

Artificial neural networks were used to predict N-linked glycosylation sites from adjoining sequence context. In a cross-validated performance, the networks carried an overall accuracy of 76% in distinguishing glycosylated and non-glycosylated sequons (Asn-Xaa-Ser/Thr). The method can be optimised for specificity or sensitivity. A webserver shall be available at http://www.cbs.dtu.dk/services/NetNGlyc/

One Page Abstract:

Contrary to widespread belief, acceptor sites for N-linked glycosylation on protein sequences, are not well characterised. The consensus sequence, Asn-Xaa-Ser/Thr (where Xaa is not Pro), is known to be a prerequisite for the modification. However, not all of these sequons are modified and it is thus not discriminatory between glycosylated and non-glycosylated asparagines. We train artificial neural networks on the surrounding sequence context, in an attempt to discriminate between acceptor and non-acceptor sequons. In a cross-validated performance, the networks could identify 86% of the glycosylated and 61% of the non-glycosylated sequons, with an overall accuracy of 76%. The method can be optimised for high specificity or high sensitivity. Also presented is a structural study on the local conformation of the acceptor site, using topological kappa-tau measures.

Glycosylation is an important post-translational modification, and is known to influence protein folding, localisation and trafficking, protein solubility, antigenicity, biological activity and half-life, as well as cell-cell interactions. We investigate the spread of known N-glycosylation sites across functional categories of the human proteome.

An N-glycosylation site predictor for human proteins shall be made available at http://www.cbs.dtu.dk/services/NetNGlyc/

329. Restriction Enzymes Dramatically Enhance SBH (up)
Esti Yeger-Lotem, Benny Chor, Sagi Snir, Computer Science dept., Technion, Haifa 32000, Israel.;
Zohar Yakhini, Computer Science dept., Technion, Haifa 32000, Israel, and Life Science Technologies Laboratory, Agilent Technologies;
Short Abstract:

The number of n-long DNA sequences, consistent with a given SBH spectrum, grows exponentially with n. By incorporating data from a small number of restriction enzyme digestion assays we dramatically reduce the number of consistent sequences. In other words, the combined information uniquely identifies much longer DNA sequences.

One Page Abstract:

Sequencing by hybridization, or SBH [1], is a DNA sequencing technique based on DNA micro-array hybridization assays. A single stranded target DNA is hybridized to a DNA micro-array containing all possible k-mers (words of length k over the alphabet {A,C,G,T}). This hybridization experiment produces the spectrum of the target DNA, namely a list of all k-mers found at least once in the sequence. The spectrum does not reveal the location of any k-mer in the sequence, nor does it count the number of its appearances. Given the spectrum and the sequence length, the next step is to identify all sequences of this length that are consistent with the spectrum. Sequencing is successful if a single sequence is identified. Unfortunately, the number of sequences consistent with a given spectrum increases exponentially with the sequence length. For example, the 8-mer spectra of 200 base long sequences are, with high probability, consistent with a single sequence, but for length 900 the average number of consistent sequences is over 35,000. In this work we tackle this combinatorial explosion problem by combining the SBH spectrum with information from enzymatic digestion assays.

As opposed to previous enhancements to SBH that drew upon prior sequence information [3], upon iterated hybridization assays [4], or upon the utilization of novel nucleic acid chemistry [5,6], our approach is based on additional information coming from a very standard technique in molecular biology. In general, restriction enzymes recognize a specific short motif in a DNA sequence, and cut the DNA at all sites of this motif. In a complete digestion experiment, the output is a list of DNA fragment lengths, such that no fragment contains the restriction motif. The procedure we propose is the following: In addition to the hybridization assay, a small number (1-4) of digestion assays are conducted, each with a different restriction enzyme. In the computational phase of identifying consistent sequences, we consider both the hybridization and digestion information. This dramatically reduces the number of consistent sequences. In other words, the combination of SBH and a few complete digest assays increases, very significantly, the length of sequences that can be uniquely determined.

Algorithmically, the SBH spectrum is used to build a graph which is a sub-graph of the de-Bruijn graph of rank k-1. In this graph vertices represent (k-1)-mers and edges represent k-mers in the spectrum. A consistent sequence corresponds to a directed path in the graph that covers every edge in the graph at least once [2]. Since we do not know beforehand which edges are to be covered more than once, a tree like computation proceeds by trying all possibilities. The typical size of this tree is exponential in the number of edges (observed k-mers). What the digestion data does for us is to extensively prune this tree. This has the effect of substantially reducing the number of consistent sequences, as well as the overall computation time.

We used simulations to test this approach. Example results are: for 900 long DNA sequences and 8-mer SBH, two restriction enzymes reduce the average number of consistent sequences from about 35,000 to close to 1. For 30,000 long DNA sequences and for 11-mer spectra, four restriction enzymes reduce the average number of consistent sequences from the order of a billion to close to 1. These results are far better than any sequencing technique currently in use for a single multi-component assay.

References: [1] R. Drmanac, I. Labat, I. Bruckner, and R. Crkevenjakov, Sequencing of Megabases Plus by Hybridization, 1989, Genomics, 4, pp 114-128. [2] P.A. Pevzner, 1-Tuple DNA Sequencing: Computer Analysis, 1989, J. Biomol. Struct. Dyn., 7: 63-73. [3] I. Pe’er and R. Shamir, Spectrum Alignment: Efficient Resequencing by Hybridization, Proc. ISMB ’00, pp. 260-268. [4] S. S. Skiena and G. Sundaram, Reconstructing Strings from Substrings, 1995, J. Computational Biology 2:333-353. [5] P. A. Pevzner, Y. P. Lysov, K. R. Khrapko, A. V. Belyavsky, V. L. Florentiev, and A. D. Mirzabekov, 1991, Improved Chips for Sequencing by Hybridization, J. Biomol. Struct. Dyn. 9:399-410. [6] F. P. Preparata, A. Frieze, and E. Upfal, Optimal Reconstruction of a Sequence from its Probes, 1999, J. Computational Biology 6(3-4):361-368.

331. PaTre: a method for paralogy tree construction (up)
Roberto Marangoni, Antonio Savona, Paolo Ferragina, Antonio Frangioni, Nadia Pisanti, Dept. of Informatics, University of Pisa, Italy;
Romano Felicioli, CISSC - university of Pisa;
Short Abstract:

PaTRe is a method able to construct a paralogy tree starting from a family of paralogs. In such a tree, the nodes represent the paralog genes, while the oriented arcs represent the relationship "matrix->copy", since we suppose that there is a mechanism of duplication-with-modification at the basis of paralogs generation.

One Page Abstract:

Genes belonging to the same organism are called paralogs when they show a significant similarity in the sequences, even if they have a different biological function. It is an emergent biological paradigm that the families of paralogs in a genome derive from a mechanism of gene duplication-with-modification, repeated many times in the history of the organism. This paradigm could be at the basis of the increase in the complexity of the organisms observed during evolution. In order to understand how such process could have taken place, it is necessary to put the paralogs belonging to same family in a tree which describes the history of their appearance in the genome: a paralogy tree. Here we present a method, called PaTre, which is able to generate paralogy trees by receiving in input a family of genes. The reliability of the inferential process has been tested by means of a simulator that implemented different hypotheses on the duplication-with-modification paradigm. The simulator receives in input a sequence and generates some copies of it which are modified accordingly to probability distributions derived from statistical genomics. These sequences are then used to test the robustness of PaTre. The experimental results show that PaTre constructs a set of paralogy trees which always contains the correct one. The size of this set can be seen to be related to the completeness of the input set of sequences; in particular, when the input set is complete then PaTre constructs very few paralogy trees. A user could exploit this property to measure the incompleteness of an input set of sequences. The robustness and biological applications of PaTre will be discussed.

332. Bioinformatics Services at the Human Genome Mapping Project- Resource Centre. (up)
T. J. Carver, G. Williams, M. Bishop, UK MRC HGMP-RC;
Short Abstract:

The UK MRC established the Human Genome Mapping Project Resource Centre (HGMP-RC) in 1990. It provides biological materials, access to databases, information on the tools to analyse them, as well as training and a helpdesk. The service is primarily for scientists engaged in functional and comparative genomics and proteomics.

One Page Abstract:

The UK Medical Research Council (MRC) established the Human Genome Mapping Project Resource Centre (HGMP-RC) in 1990. Its role is to provide biological materials, access to databases, information on the tools to analyse them, as well as training and help in the use of its facilities. The service is primarily for scientists engaged in functional and comparative genomics and proteomics. The HGMP staff operate bioinformatics and biology helpdesks to answer any user queries. There are a number of bioinformatics training courses that aim to show how to use the resources effectively.

The HGMP-RC provides support for a wide range of Bioinformatics applications. Presented in this poster are some of the Bioinformatics resources and applications available at the HGMP-RC. For example, tools for assisting in nucleotide analysis (NIX) and protein identification and analysis (PIX) are presented. The aim of these web based tools is to provide a web page of results to obtain a consensus of information from a number of programs. Also, the Bioinformatics applications group are involved in the development of the free EMBOSS package (the European Molecular Biology Open Software Suite) which is being specially developed for the needs of the molecular biology user community.



333. A Bayesian approach to learning Hidden Markov Models with applications to biological sequence analysis (up)
Alexander Schliep, ZAIK, University of Cologne;
Short Abstract:

Hidden-Markov-Models (HMMs) are in the form of profile HMMs one of widely used tools in bioinformatics. We have developed a Bayesian approach which allows to learn general (non-profile) HMMs from sequence data, where the model size and specificity is user controllable via one specified prior distribution.

One Page Abstract:

Hidden-Markov-Models (HMMs) are a widely and successfully used tool in statistical modeling and statistical pattern recognition with gene finding being one of the prime examples in bioinformatics. One fundamental problem in the application of Hidden Markov Models is finding the HMM's underlying architecture or topology especially when there is no strong evidence towards a specific topology from the application domain (e.g., when doing black box modeling). This is evidenced by the predominance of profile HMMs in biological applications.

Topology is important with regard to good parameter estimates and with regard to performance: A model with ``too many'' states -- and hence too many parameters -- requires too much training data while an model with ``not enough'' states prohibits the HMM from capturing subtle statistical patterns.To determine the ``optimal'' topology either knowledge from the application domain is used or a trial and error procedure using ad-hoc methods (i.e., model surgery) are employed; systematic procedures have been rarely considered (e.g., Bayesian Model merging, Stolcke and Omohundro).

We have developed a novel algorithm that will infer an HMM representation of the (ergodic) process generating a sequence, without prespecifying the topology of the model. That is, we infer the number of hidden states, the structure of transitions between states and the corresponding transition and emission probabilities. We use a Bayesian approach where one suitable prior forces generalization, allowing control over the model size produced.

We will present the algorithm, some of our theoretical results, and results from numerical experiments on biological sequence data.

334. Genome-wide Operon prediction in C. elegans (up)
Wei Lu, John Hensley, Michael Man, Pfizer Global Research & Development - Ann Arbor Laboratories;
Thomas Blumenthal, University of Colorado Health Sciences Center;
Short Abstract:

We have developed a machine learning approach for predicting operons in C. elegans. Genomic and experimental features were analyzed for building our model. Every gene in the WormBase has been assigned a probability score. Approximately 23% of C. elegans genes have positive scores and are likely to be operon genes.

One Page Abstract:

The C. elegans genome contains operons, clusters of closely spaced genes transcribed together from a common 5 prime regulatory region, whose mRNAs are created by further processing of the polycistronic precursor RNA. The existence of operons in C. elegans provides opportunities for studying an important mechanism of coordinated gene expression in this experimental organism. A computational approach for predicting operons in the C. elegans genome has been developed. We take a probabilistic machine learning approach to generate predictive models, by analyzing genomic as well as experimental data (features) that have distinct probability distribution between operon genes and non-operon genes. Based on the analyses of a training set consisting of 884 downstream operon genes, 187 first genes in operons, and 360 non-operon genes, models have been established for predicting either downstream operon genes or first genes in operons. Each of the 19,627 annotated genes in the WormBase, including confirmed genes as well as predicted genes from the sequenced genome, has been assigned a Log-likelihood ratio score indicating its probability of being an operon gene. The method has been validated using known examples of operon genes. Using a Log-likelihood ratio score greater than zero, the sensitivity of prediction is approximately 90%, and the false positive rate is around 10%. For comparison, the same training sets were analyzed using Discriminant Analysis, a standard classification procedure based on Bayes theorem and the assumption of multivariate normal distribution. Although the classification of downstream operon genes had a low error rate of 6%, the classification of control genes had an error rate of 64%, and the classification of 1st operon genes had an error rate of 73%. Using a Log-likelihood ratio value of greater than zero, approximately 23% of C. elegans genes are predicted to be operon genes. Further examination and validation of these predicted operon genes will enrich our understanding about the extent and organization of operons in the C. elegans genome, and provide us with more insight into the relationships between genes contained in the same operons. The machine learning method presented here does not employ arbitrary selection criteria, and can be updated to incorporate data from more features, such as expression profiling data or functional annotation data when they become available.

335. The detection of attenuation regulation in bacterial genomes (up)
Warren C. Lathe, Ph.D., Mikita Suyama, Peer Bork, European Molecular Biology Laboratory- Heidelberg Germany;
Short Abstract:

We report the prediction of over 260 upstream regions of genes with possible attenuation regulation in Bacillis subtilis and Escherichia coli. These predictions are based on a computational method devised from characteristics of known terminator fold candidates and benchmark regions of entire genomes.

One Page Abstract:

Many operons of biochemical pathways in bacterial genomes are regulated by a process called attenuation. Though the specific mechanism can be quite different, attenuation in these operons have in common the termination of transcription by a RNA 'terminator' fold upstream the first gene in the operon. In the past, detecting regulation by attenuation has been a long process of experimental trial and error, on a case to case basis. We report here the prediction of over 260 upstream regions of genes with attenuation regulation structures in Bacillis subtilis and Escherichia coli. These predictions are based on a reliable computational method devised from characteristics of known terminator fold candidates and benchmark regions of entire genomes.

336. Predicting Protein-Protein Interactions from Sequences (up)
Rajarshi Bandyopadhyay, Xin-Xing Tan, Devika Subramanian, Kathleen S. Matthews, Rice University, Houston, USA;
Short Abstract:

We study underlying distributions of residue properties (charge, hydrophobicity,size) in fixed-length sequence windows to form probabilistic models of this 'reduced' space. We learn the distribution parameters of 'interesting' windows from SH3-PXXP interactions in the DIP, and apply Bayesian inference to score interactions among general SH3-PXXP protein pairs.

One Page Abstract:

We address the problem of predicting protein-protein interactions using sequence information. In particular, we focus on well-known motif-motif interactions, such as SH3-PXXP. One of our approaches is to study the underlying (discrete) probability distributions in the 'reduced' space of chemical/physical residue properties i.e. charge, hydrophobicity and size. In this reduced space, we compare fixed-length windows of residues (which include the PXXP motif) taken from multiple PXXP proteins which interact with the same SH3 protein, according to the Database of Interac- ting Proteins (DIP). Our conjecture is that PXXP proteins reacting with the same SH3 protein should show high homology at binding sites in this reduced space. Using a robust statistical criterion such as information-theoretic entropy to determine homology, we select a small set of windows which we consider 'interesting' as putative binding sites. We find that the distributions of charge, hydrophobicity and size properties in our interesting set differ signifi- cantly from those obtained by a large random sampling of proteins in SwissProt, as also from PXXP windows in general. Based on the above observations, we apply a Bayesian infer- ence-based approach. Starting from known interactions in the DIP, we observe the distribution of properties in the reduced space. We learn the underlying parameters of these distributions in a set of 'interesting' windows and form simple probabilistic models of the interactions.We rank hypotheses regarding interactions between a pair of proteins in the general population of proteins and score these putative interactions using Bayesian inference on our learned probabilistic models. We believe that this and other computational and learning techniques may provide useful insights into classifying and predicting protein-protein interactions.

337. A Computational Screen for Novel Targets of SnoRNAs (up)
Steven Johnson, Department of Genetics, Washington University, St. Louis, MO;
Sean Eddy, HHMI & Washington University;
Short Abstract:

C/D box snoRNAs contain antisense, or guide, sequences that direct the modification of rRNA. "Orphan" snoRNAs have been identified that possess no convincing complementarity to rRNA. We hypothesize that snoRNAs are modifying a variety of cellular RNAs. We are using comparative genomics to screen for candidate novel snoRNA targets.

One Page Abstract:

While it has been known for some time that snoRNAs are involved in eukaryotic rRNA modification, there is mounting evidence that they may play a much larger cellular role. C/D box snoRNAs contain a 9-21 nucleotide guide, or antisense, sequence that is complementary to their target RNA. This guide sequence dictates the nucleotide that is modified. In recent years, "orphan" snoRNAs have been discovered that do not possess convincing complementarity to rRNA. SnoRNAs have now been identified in human and mouse that target RNAs in the splicesome and some have suggested that snoRNAs are targeting mRNA.(1,2) Additionally, it has been shown that snoRNAs are also present in the kingdom Archaea.(3) We hypothesize that snoRNAs may target many RNAs in the cell and we are carrying out a systematic search for novel snoRNA targets. We will be performing a computational screen for snoRNA targets in P. abysii, P. furiosus, and P. horikoshii. The Pyrococci are unique in that orthologous snoRNAs have been identified in all three of the fully sequenced species. We are in the process of identifying candidate novel targets that are conserved in all three species. The candidates that are identified from this screen will then be experimentally verified.

References 1) Tycowski, et al. Molecular Cell 1998. 2:629-638. 2) Cavaille, et all. Proceedings of Nat. Acad. Sci. 2000. 97(26):14311-14316. 3) Omer, et al. Science 2000. 288:517-522.

338. Tackling Biocomputing Tasks using a Meta-Data Framework (up)
Peter Ernst, Rüdiger Bräuning, Mechthilde Falkenhahn, Karl-Heinz Glatting, Agnes Hotz-Wagenblatt, Mark van der Linden, Barbara Pardon, Coral del Val, Deutsches Krebsforschungszentrum (DKFZ, German Cancer Research Center), Department of Molecular Biophysics;
Short Abstract:

We describe the development of a common framework for tasks in bioinformatics. This framework enables the interweaving of individual applications into a grid of work- and data-flows by specifying meta-data. Using a common framework it is possible to implement new tasks simply by writing configuration files.

One Page Abstract:

A lot of real-world questions to bioinformatics can only be answered by requesting several, individual, well established applications, each of it using different computing methods and databases.

We describe the development of a common framework for tasks in bioinformatics. This framework enables the interweaving of individual applications into a grid of work- and data-flows by specifying meta-data.

The rules in the meta-data controlling the work- and data-flow describe the interoperation of the individual applications and the merging of the individual outputs into a common output report. The object-oriented wrapping of applications is supported by custom output parsers.

Using a common framework it is possible to implement new tasks simply by writing configuration files.

The following tasks have already been implemented:

ProtSweep: analysis and possible identification of protein sequences by combining different types of database searches

DNASweep: analysis of an unknown DNA sequence by combining database searches, exon/intron recognition and promoter prediction

DomainSweep: analysis of an unknown protein sequence by combining searches against different protein domain and family databases

PATH: estimation and realization of phylogenies using all major phylogenetic analysis methods

We already have implemented a similar meta-data approach in W2H - a WWW interface to GCG/EMBOSS/HUSAR (www.w2h.dkfz-heidelberg.de).

References: M. Senger, P. Ernst, K.-H. Glatting (2000) Distributed application management in bioinformatics, in: "Genomics and Proteomics: Functional and Computational Aspects", S. Suhai, Ed., Kluwer Academic/Plenum Publishers, New York, 215-229,

M. Senger, T. Flores, K.-H. Glatting, P. Ernst, A. Hotz-Wagenblatt, S. Suhai (1998) W2H: WWW interface to the GCG sequence analysis package, Bioinformatics, 14, 452-457.

339. Automated learning of unknown length motifs in unaligned DNA sequences with genetic algorithms (up)
David Hernandez, Robin Gras, Ron Appel, Swiss Institute of Bioinfomatics, PI Group;
Short Abstract:

We describe a new approach of automated complex motif learning from a set of unaligned DNA sequences using a genetic algorithm. The process converges on a structured motif which best describes the training set of sequences by optimizing a specific fitness function.

One Page Abstract:

During evolution, some biological structures are conserved because their efficiency give them a better survival than others that are less useful. Therefore, in biological sequences, some motifs are well conserved because they are involved in biological processes that confer them a better fitness. We use a similar representation with the principle of the survival of the well adapted structures, to find these conserved motifs in biological sequences.

We have created a new model based on genetic algorithms which is able to learn common motifs of a training set of DNA sequences. Our learning model is based on structured motifs composed of several ungapped words, separated by specified distances within an interval. These words are defined using the extended DNA alphabet (IUPAC-IUB) with the intention of describing consensus. The words and distances are coded in structures called chromosomes. There is no need for an a priori knowledge, the algorithm is able to learn the length of the words, their composition and the distances among them. It converges on the best motif according to a fitness function. Therefore, it is able to isolate signals which are involved in biological processes like protein binding sites found in gene promoters.

A specific function computes the fitness of each chromosome. This function is conceived to optimize the composition, the length and the stringency of the motifs. We also designed specific genetic operators like a slide operator and a distinctive crossover. The slide operator shifts the chromosome information to the left or to the right, in order to walk on the sequences, it allows to jump quickly to a related solution which would be difficult to reach with classical operators. Because homologous information is not necessarily situated on the same positions on different chromosomes, a distinctive crossover, which attempts to match related information before exchange, has shown to be more efficient than classical ones.

To improve the global efficiency of the genetic algorithm, we are developing a multiple-population system in order to separate the problem in its basic components. Some populations will optimize small pieces of solution (words or sub-words) and the other will combine them in a more structured form. The communication network will be organized hierarchically. This organization leads to a parallelization of the problem where each population will run on separate processors and will send the best potential solutions to a higher population.

We tested our algorithm with two training sets. The first one is a set of artificial sequences where three words with non conserved positions were inserted. The second one consisted of 132 plant promoters from EPD database, where the algorithm is able to efficiently extract the TATA-box. New tests on protein sequences are in progress.

340. Conservation of CD23 extracellular domain through vertebrate species suggests a functional role in B-lymphocyte differentiation (up)
Bellido M, Dpt. of Hematology, Hospital de la Santa Creu i Sant Pau, Barcelona, Spain;
Valverde JR, Bioinformatics Service, EMBnet-Spain, CNB, CSIC, Madrid, Spain;
Bordes R, Dpt. of Pathology, Hospital de la Santa Creu i Sant Pau, Barcelona, Spain;
Rubiol E, Ubeda J, Martino R, Aventin A, Sierra J, Nomdedeu J, Dpt. of Hematology, Hospital de la Santa Creu i Sant Pau, Barcelona, Spain;
Short Abstract:

Atypical chronic lymphocytic leukemias show loss of the CD23 surface expression whereas mantle cell lymphomas show its absence. An evolutionary conservation of the sequences in CD23 extracellular domain would justify the study of these sequences as specific targets for possible molecular lesions underlying the differential CD23 expression between these disorders.

One Page Abstract:

CD23 is a monoclonal antibody that recognizes the lymphocyte low-affinity receptor for IgE (Fc epsilon RII), an antigen involved in B-lymphocyte differentiation. It is highly expressed in activated mature B-lymphocytes and absent in immature bone marrow B-cells. Malignant lymphoproliferative disorders show differential expression of CD23, which has been useful for the immunophenotypic classification of chronic lymphocytic leukemias (CLLs) and non-hodgkin lymphomas (NHLs). The loss of the CD23 surface expression has been reported in atypical CLLs, whereas the absence of CD23 surface expression is a characteristic feature for mantle cell lymphomas (MCLs). Due to the similar immunophenotypic profile between atypical CLLs and MCLs, it would be useful for diagnosis to determine the underlying molecular lesions responsible for the loss and absence of CD23 expression in these disorders. CD23 codifies for a protein of 321 amino acids that contains a 21-amino acid cytoplasmic domain, a 26-amino acid transmembrane domain and a 274-amino acid extracellular domain. The last one includes a region of 123 amino acids similar to C-type animal lectins. We have analyzed the evolution of the extracellular domain of CD23 molecule through divergent vertebrate species to determine conserved sequences which would be functionally relevant. Using public gene databases from Europe (EBI-SRS6) and USA (NCBI-GenBank), we have analyzed cDNAs and amino acid sequences of the CD23 antigen from Homo sapiens, Mus musculus (alternative spliced isoforms A, B and C), Equus caballus, Bos taurus, Rattus norvegicus and Ancylostoma ceylanicum species. CLUSTALW-Jalview (from Embnet, Spain) was used for multiple sequence alignment and the construction of the phylogenetic tree. The cDNA Bos taurus sequence (em:AF1443722) was the most identical to human CD23 extracellular domain (em:M14766. Alignment comparisons of amino acid sequences revealed the evolutionary conservation of one region within the extracellular CD23 domain homologous with the carbohydrate-binding domain in animal lectins. The evolutionary conserved sequences in CD23 extracellular domain constitute specific targets for the study of possible molecular lesions (deletions, mutations) that could justify the differential surface CD23 expression between atypical CLLs and MCLs. Supported by AEHH grant (B. M).

341. The ERATO Systems Biology Workbench An Integrated Environment for Systems Biology Software (up)
A. M. Finney, Control and Dynamical Systems, California Institute of Technology;
M. Hucka, H. M. Sauro, Control and Dynamical Systems, California Institute of Technology, US;
H. Bolouri, Science and Technology Research Centre, University of Hertfordshire, UK;
J. Doyle, Control and Dynamical Systems, California Institute of Technology, US;
H. Kitano, Sony Computer Science Labs, Tokyo, Japan;
Short Abstract:

The goal of the ERATO Systems Biology Workbench is to create an integrated, easy-to-use software environment that enables sharing of resources between simulation and analysis tools for systems biology. Our initial focus is on interoperability between simulation tools, including deterministic and stochastic simulators as well as optimization.

One Page Abstract:

The goal of the ERATO Systems Biology Workbench (SBW) project is to create an integrated, easy-to-use software environment that enables sharing of models and resources between simulation and analysis tools for systems biology. Our initial focus is on achieving interoperability between simulation tools, including deterministic and stochastic simulators, as well as a variety of analysis and optimization tools. Our long-term goal is to develop a flexible and adaptable environment that provides (1) the ability to interact seamlessly with a variety of software packages that implement different approaches to modeling, parameter analysis, and other related tasks; (2) the ability to construct complex hierarchical multicellular models from arrays of model components; and (3) the ability to interact with biologically-oriented databases containing data, models and other relevant information. We hope that by providing an open, common framework for software interoperability, developers can spend less time recreating facilities that already exist in similar forms in other packages, and instead concentrate on developing new algorithms and models.

SBW uses a straightforward broker-based, message-passing architecture that supports Linux and Windows, and is being implemented using versatile and transportable technologies including Java, XML and sockets. Software components that implement different facilities (such as GUIs, model simulation methods, analysis methods, etc.) can be connected to each other through SBW using a straightforward application programming interface (API).

The software products of this project will be open source, portable to Windows and Linux, and use current and emerging standards such as the Systems Biology Markup Language (SBML). Both SBW and SBML are being developed in close collaboration with the groups developing the simulation packages BioSpice, DBSolve, E-Cell, Gepasi, Jarnac, VCell and StochSim. We hope to make the Systems Biology Workbench a vehicle for collaboration between developers of bioinformatics technology, and we are actively seeking other collaborators to extend the workbench.

342. A Symmetrizing Transformation for Microarray Data (up)
Blythe Durbin, David M. Rocke, University of California at Davis;
Short Abstract:

Most classical statistical methodologies rely on the assumptions that the data are normally distributed with constant variance. Microarray data fail to satisfy these assumptions. We introduce a symmetrizing transformation for microarray data which transforms the data so that they are more nearly normally distributed.

One Page Abstract:

Most classical statistical methodologies, such as regression analysis, rely on the assumptions that the data are normally distributed with constant variance. When faced with data not conforming to these assumptions, of which microarray data are a prime example, the statistician has the option either of developing new techniques to deal with rogue data, or of transforming the data. Transformation to normality and constant variance of data, such as microarray data, which fail to satisfy the usual assumptions is generally easier than developing new techniques. We introduce a model for microarray data which models the measured expression as a function of the true expression and two error parameters. This two-component model is y = \alpha + \mu exp(\eta) + \epsilon where y is the measured expression, \alpha is the mean background noise, \mu is the true expression, and \eta and \epsilon are normally distributed error terms. Plots of microarray data with replicated observations for each gene confirm the validity of this model. Note that the variance of y depends on the true expression and that y is not normally distributed except when \mu = 0. Therefore, transformation of the data is necessary before classical statistical techniques can be applied. Using this model as a starting point, we look for a transformation parameter \lambda such that y^\lambda is distributed symmetrically about its mean (and hence is more nearly normal). This is done by expanding y^\lambda as a Taylor series polynomial in \eta and \epsilon, calculating the third moment (a measure of asymmetry) and solving for \lambda such that the third moment of the expansion is equal to 0. Preliminary results with simulated data show that this procedure successfully symmetrizes data distributed according to the two-component model, which bodes well for its ability to symmetrize microarray data.

343. "Genquire" - Interactive analysis and annotation of genome databases using multiple levels of data visualization (up)
Mark Wilkinson, David Block, Matthew Links, Jacek Nowak, William Crosby, Plant Biotechnology Institute, National Research Council of Canada;
Short Abstract:

Genome database information is commonly distributed through CGI-generated HTML. "Genquire" is a standalone interface to genome databases for bioinformaticians and bench scientists. Data is displayed at the whole-genome, contig, and nucleotide level, providing a contextual overview of the data.

One Page Abstract:

To date, the primary method for visualization of genome database sequence and annotation information has been through the relatively static displays generated via CGI scripts. These are limited in both their interactivity and speed of delivery of information. We present an interactive database visualization tool, Genquire, which provides bench-scientists, bioinformatics researchers, annotators and curators with a 'live' graphical and manipulable display of genome data. Genquire offers 3 levels of visualization - whole genome, contig, and nucleotide - which interact to provide the user with a high degree of contextual information. The Contig-level view, conceptually based on the Genotator software(1), displays sequence features 'stacked' by their source origin or feature type on a zoomable-scrollable window. Single or multiple features may be mouse selected and sent to external analysis programs such as Blast or Sim4. Alternately, the selected features may be sent to a nucleotide-level display. The nucleotide level display allows single-base-pair modification of feature boundaries and creation of de-novo sequence features by simple highlighting of the desired region. Annotation tools are built in, providing rapid user-controlled-vocabulary annotation of individual or multiple sequence features via mouse clicks. In addition, a browser for the GO ontology database enables clickable selection and assignment of GO annotation to existing features or user-defined regions of DNA. Query results against an annotated dataset are displayed at all levels of visualization, allowing rapid surveying of the genome as well as more detailed analysis of individual sequence features which meet the query criterion. Genquire also acts as a graphical environment for the Blast and SIM4 alignment programs, and performs post-processing on the output from these analyses; Blast 'hits' are mapped back onto the Genome and Contig level displays, while SIM4 EST/cDNA alignments are displayed at the nucleotide level aligned against a marked-up and editable parent genomic sequence. Together these tools make Genquire a 'sandbox' for the exploration and annotation of genome sequence data. It is coded in Perl, using BioPerl as the foundation, and is thus cross-platform compatable (Linux/Unix, MS Windows, and Mac OS X), making it easily accessible to both bioinformatics professionals and bench scientists. As the interfaces into the various public genome databases become more defined, Genquire will act as a fast and efficient interactive viewer of individual genome data, and will be further developed to allow cross-genome comparisons and analyses.

(1) Harris, N.L. (1997), Genome Research 7(7):754-762

344. Combinatorial genomics: validation of high throughput protein interaction data with clustered expression profiles (up)
Patrick Kemmeren, Genomics Lab, UMC Utrecht;
Jaak Vilo, European Bioinformatics Institute;
Frank Holstege, Genomics Lab, UMC Utrecht;
Short Abstract:

We have verified protein interaction datasets using mRNA expression data to determine which of the reported interactions show coexpression of the respective mRNAs. Besides verifying over 500 putative interactions, this has also allowed us to compare different clustering algorithms, distance measurements and the reliability of the interaction datasets.

One Page Abstract:

Functional genomic analyses create enormous amounts of data that lead to many hypotheses about the function of individual genes. Traditional approaches to testing these hypotheses are slower than the rate at which new datasets are generated. A major challenge is to develop new ways of processing data and testing the ensuing hypotheses. One approach is to determine which hypotheses are most plausible by quick validation tests on completely different data types. This will increase the value of high throughput data and allow follow-up experimentation to be prioritized. Here we test high throughput protein interaction data generated by two-hybrid screening of S. cerevisiae proteins. This is done by determining which of the putative interactions are reflected by coexpression of the respective mRNAs in collections of DNA microarray data. Of the 5500 two-hybrid interactions analyzed, at least 500 are found to coincide with mRNA coregulation, increasing the likelihood that these interactions are bona fide. This is confirmed by “wet lab” experiments on several of the interactions. The results of comparing different types of mRNA clustering analyses and distance metrics are presented and the methods used will be accessible through a webinterface. Besides assigning function to previously uncharacterized genes, this study shows the feasibility of combining different types of functional genomic data, improving their general utility and the rate at which annotation of whole genome sequences can be achieved.