ISMB 2001 Poster Abstracts

Whole Genome Analysis

1. Identification of thermophilic species by the amino acid compositions deduced from their genomes
2. Sequence Analysis by Iterative Maps - beyond graphical representation
3. Global Analysis of Protein Activities Using Proteome Chips
4. Assignment of Homology to Genome Sequences using a Library of Hidden Markov Models that Represent all Proteins of Known Structure
5. Divergence and conservation: comparison of the complete predicted protein sets of four eukaryotes
6. Identification of novel small noncoding RNAs in Escherichia coli using a probabilistic comparative method
7. Intron-like sequences in non-coding genomic regions.
8. Repeats in human genomic DNA and sequence heterogeneity
9. PhageBase: A precalculated database of bacteriophage sequences
10. Searching for regions of conservation in the Arabidopsis genome
11. Constructing Comparative Maps with Unresolved Marker Order
12. Identification of novel small RNA molecules in the Escherichia coli genome: from in silico to in vivo
13. Operons: conservation and accurate predictions among prokaryotes
14. The EBI Proteome Analysis Database
15. Practical transcriptome analysis system in the RIKEN mouse cDNA project
16. The automated identification of novel lipases/esterases on a multi-genome scale
17. Identification of membrane protein orthologs in worm, fly and human
18. Discovering Binding Sites from Expression Patterns: A simple Hyper-Geometric Approach
19. Visualizing whole genome comparisons: Artemis Comparison Tool (ACT)
20. The use of Artemis for the annotation of eukaryotic genomes
21. A de novo approach to identifying repetitive elements in genomic sequences
22. Extendable parallel system for automatic genome analysis
23. Whole genome phylogenies using vector representations of protein sequences
24. Correlated Sequence Signature as Markers of Protein-Protein Interaction
25. Molecular and Functional Plasticity in the E. coli Metabolic Map
26. Genome Size Distribution in Prokaryotes, Eukaryotes, and Viruses
27. EnsEMBL Genome Annotation Project
28. A Framework for Identifying Transcriptional cis-Regulatory Elements in the Drosophlia Genome
29. Genome-wide modeling of protein structures
30. Genome wide search of human imprinting genes by data mining of EST in UniGene
31. What we learned from statistics on arabidopsis documented genes
32. Evaluation of Computer Algorithms to Search Regulatory Protein Binding Sites
33. An HMM Approach to Identify Novel Repeats Through Fluctuations in Composition
34. Novel non-coding RNAs identified in the genomes of Methanococcus jannaschii and Pyrococcus furiosus
35. RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs
36. Comparative genomics for data mining of eukaryotic and prokaryotic genomes
37. DNA atlases for the Campylobacter jejuni genome
38. DNA atlases for the Staphylococcus aureus genome
39. SNAPping up functionally related genes based on context information: a colinearity free approach
40. Sequencing and Comparison of Orthopoxviruses
41. Integrating mouse and human comparative map data with sequence and annotation resources.
42. De novo Identification of Repeat Families in the Genome
43. Using the Arabidopsis genome to assess gene content in higher plants
44. Origin of Replication in Circular Bacterial Genomes and Plasmids

Alignment Techniques

45. Combining frequency and positional information to predict transcription factor binding sites
46. Identification of distantly related sequences through the use of structural models of protein evolution
47. The Paracel Filtering Package (PFP): A Novel Approach to Filtering and Masking of DNA and Protein Sequences
48. EST Clustering with Self-Organizing Map (SOM)
49. Analysis of Information Content for Biological Sequences
50. Clustering proteins by fold patterns
51. Confidence Measures for Fold Recognition
52. Protein Structure Prediction By Threading With Experimental Constraints
53. Exonerate - a tool for rapid large scale comparison of cDNA and genomic sequences.
54. Prospector: Very Large Searches with Distributed Blast and Smith-Waterman
55. Accuracy Comparisons of Parallel Implementations of Smith-Waterman, BLAST and HMM Methods
56. Target-BLAST: an algorithm to find genes unique to pathogens causing similar human diseases
57. Software to predict microArray hybridisation: ACROSS
58. Discovering Dyad Signals
59. Efficient all-against-all computation of melting temperatures for dna chip design
60. Wavelet techniques for detecting large scale similarities between long DNA sequences
61. TRAP: Tandem Repeat Assembly Program, capable of correctly assembling nearly identical repeats
62. LASSAP: A powerful and flexible tool for (large-scale) sequence comparisons
63. Mini-greedy algorithm for multiple RNA structural alignments
64. Neural network and genetic algorithm identification of coupling specificity and functional residues in G protein-coupled receptors
65. Determination of Classificatory Motifs for the Identification of Organism Using DNA Chips
66. An algorithm for detecting similar reaction patterns between metabolic pathways
67. SPP-1 : A Program for the Construction of a Promoter Model From a Set of Homologous Sequences
68. A new score function for distantly related protein sequence comparison.
69. Expression profiling on DNA-microarrays: In silico clone selection for DNA chips
70. Data Mining: Efficiency of Using Sequence Databases for Polymorphism Discovery
71. Automated modeling of protein structures
72. Algorithmic improvements to in silico transcript reconstruction implemented in the Paracel Clustering Package (PCP)

Protein Structure

73. Computational Structural Genomics
74. Persistently Conserved Positions in Structurally-Similar, Sequence Dissimilar Proteins: Roles in Preserving Protein Fold and Function
75. Using Surface Envelopes in 3D Structure Modeling
76. Molecular modelling in studies of SDR and MDR proteins
77. Consensus Predictions of Membrane Protein Topology
78. Development of New statistical measures of protein model quality and its application for consensus prediction of protein structure
79. Signal filtration methods to extract structural information from evolutionary data applied to G protein-coupled receptor (GPCR) transmembrane domains
80. Using clusters to derive structural commonality for ATP binding sites
81. Aromaticity of Domains in Photosynthetic Reaction Centers; A Clue to the Protein's Control of Energy Dissipation during Enzymatic Reactions
82. LIGPROT: A database for the analysis and visualization of ligand binding.
83. ThreadMAP: Protein Secondary Structure Determination
84. FAUST, an algorithm for functional annotations of protein structures using structural templates.
85. Predicting structural features in protein segments
86. GA Generates New Amino Acid Indices through Comparison between Native and Random Sequences
87. STING Millennium: Web based suite of programs for comprehensive and simultaneous analysis of structure and sequence
88. Side chain-positioning as an integer programming problem
89. Prediction of the quality of protein models using neural networks
90. Targeting proteins with novel folds for structural genomics
91. Protein Structural Domain Parsing by Consensus Reasoning
92. Attempt to optimise template selection in protein homology modelling using logical feature descriptors
93. Prediction of amyloid fibril-forming proteins
94. Evaluation of structure prediction models using the ProML specification languag
95. Incremental Volume Minimization of Proteins (represented by Collagen Type I (local minimization))
96. Automatic Inference of Protein Quaternary Structure from Crystallographic Data.
97. Modelling Class II MHC Molecules using Constraint Logic Programming
98. DoME: Rapid Molecular Docking with Adaptive Mesh Solutions to the Poisson-Boltzmann Equation
99. Electrostatic potential surface and molecular dynamics of HIV-1 protease brazilian mutants
100. Automated functional annotation of protein structures
101. Structural annotation of the human genome
102. Estimation of p-values for global alignments of protein sequences.
103. Sequence and Structure Conservation Patterns of the Gelsolin Fold

Protein Families

104. Finding all protein kinases in the human genome
105. Remote Homology Detection Using Significant Sequence Patterns
106. TRIBE-MCL: A Novel Algorithm for accurate detection of protein families
107. In Silico Analysis of Bacterial Virulence Factors: Redefining Pathogenesis
108. Relationships between structural conservation and dynamical properties in the PAS family proteins.
109. Classifying G-protein coupled receptors with support vector machines
110. Domain-finding with CluSTr: Re-occuring motifs determined with a database of mutual sequence similarity
111. PFAM domain distributions in the yeast proteome and interactome
112. Identifying Protein Domain Boundaries using Sequence Similarity for Structural Genomics Target Selection
113. Comparative study of in vitro and in vivo protein evolution.
114. DART: Finding Proteins with Similar Domain Architecture
115. Statistical approaches for the analysis of immunoglobulin V-REGION IMGT data
116. Markovian Domain Signatures: Statistical Segmentation of Protein Sequences
117. Identification Of Novel Conserved Sequence Motifs In Human Transmembrane Proteins
118. Using Profile Scores to Determine a Tree Representation of Protein Relationships.
119. Apoptosis Signalling Pathway Database - Combining Complimentary Structural, Profile based, and Pair-wise Homologies.

Gene Expression

120. From clustering to expression data to motif finding: a multistep online procedure.
121. Probe Based Scaling of Microarray Expression Data
122. Revealing the Fine-Structures: Wavelet-based Fuzzy Clustering of Gene Expression Data
123. Identification of clinically relevant genes in lung tumor expression data.
124. Machine Learning Techniques for the Analysis of Microarray Gene Expression Data: A Critical Appraisal
125. Comparison of Methods for The Classification of Tumors Using Gene Expression Data
126. Using expression data for testing hypotheses on genetic networks - minimal requirements for the experimental design
127. In silico search for cis-acting regulatory sequences in co-expressed gene clusters
128. A decision tree method for classification of promoters based on TF binding sites
129. Biostatistical Methods to Analyse Gene Expression Profiles
130. Syntactic structures for understanding gene regulatory networks
131. Adaptive quality-based clustering of gene expression profiles
132. Incorporating Biological Knowledge Into Analyses of Microarray Data
133. Semantic Link: A Knowledge Discovery Tool for Gene Expression Profiling
134. Integration of transcript reconstruction and gene expression profiles to enhance disease gene discovery.
135. Gene Expression Database (GXD): integrated access to gene expression information from the laboratory mouse
136. Analysis of gene expression profiles between interaction protein pairs in M.musclus
137. Learning genomic nature of complex diseases from the gene expression data
138. Comparative Assessment of Normalization Methods for cDNA microarray data
139. Identifying different types of human lymphoma by SVM and ensembles of learning machines using DNA microarray data.
140. On the Influence of the Transcription Factor on the Information Content of Binding Sites
141. A Mouse Developmental Gene Index
142. Using Gene Expression and Artificial Neural Networks for Classification and Diagnostic Prediction of Cancers
143. Classification of malignant states in multistep carcinogenesis using gene expression matrix
144. Bioinformatics Tools in the Screening of Gene Delivery Systems
145. Cross talking in cellular networks: tRNA-synthetase and amino acid synthetic enzymes in Escherichia coli
146. Assessing Clusters and Motifs from Gene Expression Data
147. Statistical Analysis of Gene Expression Profile Changes among Experimental Groups
148. Multivariate method for selection of sets of differently expressed genes
149. Understanding Non Small Cell Lung Cancer by Analysis of Expression Profiles
150. Applications of high-throughput identification of tissue expression profiles and specificity
151. Identifying regulatory networks by combinatorial analysis of promoter elements
152. The use of discretization in the analysis of cDNA microarray expression profiles for the identification of tissue-specific genes
153. Quantitative analysis of a bacterial gene expression by using the gusA reporter system in a non-steady state continuous culture
154. Analysis of 5035 high-quality EST clones from mature potato tuber
155. Using highly redundant oligonucleotide arrays to validate ESTs: Development and use of a human Affymetrix MuscleChip.
156. Expression Profiler
157. Analysis of the transcriptional apparatus in the holoparasitic flowering plant genus Cuscuta
158. Inferring Regulatory Pathways in E.Coli using Dynamic Bayesian Networks
159. ConSite: Identification of transcription factor binding sites conserved between orthologous gene sequences
160. FSCAN - An open source program for analysis of two-color fluorescence-labeled cDNA microarrays
161. A method for designing PCR primers for amplifying cDNA array clones
162. Statistical modelling of variation in microarray replicates
163. The new explore of diffuse large B-cell lymphoma
164. Including protein-protein interaction networks into supervised classification of genes based on gene expression data
165. Comparative Splicing Pattern Analysis between Mouse and Human Exon-skipped Transcripts
166. Non-parametric statistics of gene expression data
167. Transcriptome and proteome analysis of Escherichia coli during high cell density cultivation
168. Molecular signatures of commonly fatal carcinomas: predicting the anatomic site of tumor origin
169. Tuning Sub-networks Inference by Prior Knowledge on Gene Regulation
170. Detection of alternative expression by analysis of inconsistency in microarray probe performance
171. Which clustering algorithms best use expression data to group genes by function?
172. Visualization and Analysis Tool for Gene Expression Data Mining
173. Prediction of co-regulated genes in Bacillus subtilis based on the conserved upstream elements across three closely related species
174. Comparison of performances of hierarchical and non hierarchical neural networks for the analysis of DNA array data
175. Linking micro-array based expression data with molecular interactions, pathways and cheminformatics
176. Classification of Acute Leukemia Gene Expression Data Using Weight Function and Principal Component Analysis
177. Application of Fuzzy Robust Competitive Clustering Algorithm on Microarray Gene Expression Profiling Analysis
178. A Robust Algorithm for Expression Analysis
179. Analysis of Gene Expression by Short Tag Sequencing - Theoretical Considerations
180. Computational analysis of RNA splicing by intron definition in five organisms
181. Image Analysis and Feature Extraction Methods for High-Density Oligonucleotide Arrays
182. Transcriptional control mechanisms in the global context of the cell
183. Measurement and Prediction of Gene Expression in Whole Genomes
184. Analysis of orthologous gene expression in microarray data

Complex Biological Networks

185. A rapid algorithm for generating minimal pathway distances
186. System analysis of complex molecular networks by mathematical simulation and control theory
187. Parameter Estimation of Signal Transduction Pathways using Real-Coded Genetic Algorithms
188. A two-phase partition method simulates the dynamic behavior of the heat shock response with high accuracy at a remarkably high speed.
189. Integration of Computational Techniques for the Modelling of Signal Transduction
190. Validating Metabolic Prediction
191. (withdrawn)
192. Scale-free Behaviour in Protein Domain Networks
193. Inferring gene dependency networks using expression profiles from yeast deletion mutants
194. On Network Genomics
195. Semantic Modeling of Signal Transduction Pathways and the Linkage to Biological Datasources
196. Pathway Analysis of Metabolic Networks: New version of METATOOL and convenient classification of metabolites
197. (withdrawn)
198. PIMRider : an integrated exploration platform for large protein interaction network
199. From "pathways" to functional network. A new technology for reconstruction of human metabolism.
200. Protein Pathway Mapping in Human Cells
201. MAP-Kinase-Cascade: Switch, Amplifier or Feedback Controller?
202. Genomic Object Net: Basic Architechture and Visualization for Biopathway Simulation
203. KEGG human cell cycle pathway and its comparison to the viral genomes.

Protein Function and Localization

204. (withdrawn)
205. Eukaryotic Protein Processing: Predicting the Cleavage Sites of Proprotein Convertases
206. Understanding multi-organelle predicted subcellular localization
207. Vizualization and Interpretation of the Molecular Scanner Data
208. SFINX: A generic system for integrated graphical analysis of predicted protein sequence features
209. Protein Pathway Profiling
210. Ab initio prediction of human orphan protein function
211. A consensus based approach to genome data mining of beta barrel outer membrane proteins.
212. Predicting tyrosine sulfation sites in protein sequences
213. Characterization of aspartylglucosaminuria mutations
214. Elucidating a "theoretical" proteome of the Arabidopsis thaliana thylakoid
215. Machine Learning Algorithms in the Detection of Functional Relatedness of Proteins
216. Predicting protein functions based on InterPro and GO
217. iPSORT: Simple rules for predicting N-terminal protein sorting signals.

DNA and RNA Structure

218. Molecular dynamics of protein-RNA interactions: the recognition of an RNA stem-loop by a Staufen double-stranded RNA-binding domain
219. Numerical analysis of RNA structurisation process.
220. Estimation of the Amount of A-DNA and Z-DNA in Sequenced Chromosomes
221. Feature selection to discriminate five dominating bacteria typically observed in Effluent Treatment Plant

Evolution

222. Identifying orthologs and paralogs in speciation-duplication trees
223. Reconstructing the duplication history of tandemly repeated genes
224. New approaches for the analysis of gene family evolution
225. Calculating orthology support levels in large scale data analyses
226. Search Treespace and Evaluate Regions of an Alignment with LumberJack
227. Analysis of Large-Scale Duplications Involving Human Olfactory Receptors
228. Whole Genome Comparison by Metabolic Pathway Profiling

Databases, Information and Knowledge Management

228a. Use of Runs Statistics for Pattern Recognition in Genomic DNA Sequences
229. Biomine - multiple database sequence analysis tool
230. Efficient virtual screening toll for early drug discovery
231. WebGen-Net: a workbench system for support of genetic network construction
232. ArrayExpress - a Public Repository for Gene Expression Data at the European Bioinformatics Institute
233. Formulation of the estimation methods for the kinetic parameters in cellular dynamics using object-oriented database
234. Towards a Transcript Centric Annotation System Using in silico Transcript Generation and XML Databases
235. Trawler: Fishing in the Biomedical Literaturome
236. GeneLynx: A Comprehensive and Extensible Portal to the Human Genome
237. Representation and integration of metabolic and genomic data: the Panoramix project
238. GDB's Draft Sequence Browser (GDSB) - A bridge between the Human Genome Database (GDB) and Human Genome Draft Sequence.
239. Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT.
240. A tissue classification system
241. Proteome's Protein Databases: Integration of the Gene Ontology Vocabulary
242. Pattern Recognition for Micro-array Data using Context based Divergence and its Validation
243. GIMS - A Data Warehouse for Storage and Analysis of Genomic and Functional Data.
244. Mutable gene model
245. Discovery Support System for Human Genetics
246. Search, a sensitive and exhaustive homology search programme
247. Gene Ontology Annotation Campaign and Spotfire Gene Ontology Plug-In
248. Predicting the transcription factor binding site-candidate, which is responsible for alteration in DNA/protein binding pattern associated with disease susceptibility/resistance
249. Open Exchange of Source Code and Data from the Stanford Microarray Database
250. Portal XML Server: Toward Accessing and Managing Bioinformatics XML Data on the Web
251. PFAM-Pro: A New Prokaryotic Protein Family Database
252. Production System for Mining in Large Biological Dataset
253. Querying Multiple Biological Databanks
254. Automated Analysis of Biomedical Literature to Annotate Genes with Gene Ontology Codes : Application of A Maximum Entropy Method to Text Classification Application of A Maximum Entropy Method to Text Classification
255. Rat Liver Aging Proteome Database: A web-based workbench for proteome analysis
256. UCspots Before Your Eyes
257. PASS2: Semi-Automated database of Protein Alignments organised as Structural Superfamilies
258. The HAMAP project: High quality Automated Microbial Annotation of Proteomes
259. BioWIC: Biologically what's in common?
260. Structure information in the Pfam database
261. Sequence Viewers for Distributed Genomic Annotations
262. The Immunodeficiency Resource: Knowledge Base for Immunodeficiencies
263. XGI: A versatile high throughput automated sequence analysis and annotation pipeline.
264. Mouse Genome Database: an integrated informatics database resource
265. Methods for the automated identification of substance names in journal articles
266. Development and Implementation of a Conceptual Model for Spatial Genomics
267. Modelling Genomic Annotation Data using Objects and Associations: the GenoAnnot project
268. An Overview of the HIV Databases at Los Alamos
269. Variability of the immunoglobulin superfamily V-set fold using Shannon entropy analysis
270. SNPSnapper - application for genotyping and database storaging of SNP genotypes produced by microarray technology
271. An XML document management system using XLink for an integration of biological data
272. Immunodeficiency Mutation Databases in the Internet
273. Martel: parsing flat-file formats as XML
274. Disulphide database(DSDBASE): A handy tool for Engineering disulphides and modeling disulphide rich systems.
275. A Combinatorial Approach to Ontology Development
276. StressDB: The PennState Arabidopsis Stress Database
277. The GeM system for comparative mapping of mammalain genomes
278. TrEMBL protein sequence database: production, interconnectivity and future developments
279. PIR Web-Based Tools and Databases for Genomic and Proteomic Research
280. Classification of Protein Structures by Global Geometric Measures
281. EST Analysis and Gene expression in Populus leaves
282. Perl-based simple retrieval system behind the InterProScan package.
283. Intelligent validation of oligonucleotides for high-throughput synthesis
284. Iobion MicroPoint Curator, a complete database system for the analysis and visualization of microarray data
285. Rapid IT prototype to support macroarray experiments
286. EXProt - a database for EXPerimentally verified Protein functions.
287. HeSSPer: a program to analyze sequence alignments and protein structures derived from HSSP database
288. Protein Structure extensions to EMBOSS
289. GENIA corpus: A Semamticaly Annotated Corpus in Molecular Biology Domain
290. Talisman - Rapid Development of Web Based Tools for Bioinformatics
291. The Eukaryotic Linear Motif Database ELM
292. Information search, retrieval and organization relevant to molecular biology
293. Los Alamos National Laboratory Pathogen Database Systems
294. Facilitating knowledge acquisition and discovery through automatic web data minining
295. Functional analysis of novel human full-length cDNAs from the HUNT database at Helix Research Institute
296. (withdrawn)
297. iSPOT and MINT: a method and a database dedicated to molecular interactions
298. An Integrated Sequence Data Management and Annotation System For Microbial Genome Projects
299. A home-made implementation for bibliographic data management: application to a specific protein.
300. Statistical structurization of 30,000 technical terms in medical textbooks: A non-ontological approach for computation of human gene functions.

Gene Finding

301. Searching for RNA Genes Using Base-Composition Statistics
302. dna2hmm - a homology based genefinder
303. DIGIT: a novel gene finding program by combining genefinders
304. GAZE: A generic, flexible tool for gene-prediction
305. An estimate of the total number of genes in microbial genomes based on length distributions 
306. Potential binding sites for PPARg in promoters and upstream sequences
307. On the Species of Origin: Diagnosing the Source of Symbiotic Transcripts
308. PAGAN : Predict and Annotate Genes in genomic sequence based on ANalysis of EST Clusters
309. Nested Genes in the Human Genome
310. Integrating Protein Homology into the Twinscan System for Gene Structure Prediction
311. Improving exon detection using human-rodent genomic sequence comparison
312. In-silico to in-vivo analysis of whole proteome
313. Incorporating Additional Information to Hidden-Markov Models for Gene Prediction
314. Gene prediction in the post-genomic era
315. Improved Splice Site Prediction by Considering Local GC Content
316. Relaxed profile matching as a method for identifying putative novel proteins from the genome
317. Analyzing Alternatively Spliced Transcripts
318. EST curation with improved gene feature models
319. Annotation of Human Genomic Regions to Identify Candidate Disease Genes
320. Annotation of the E. coli genome revisited

Other

321. Conformational studies on O-specific polysaccharides of Shigella dysenteriae
322. The risk of failure to detect disease genes due to epistatic interactions
323. A stoichiometric model for the central carbon metabolism of Aspergillus oryzae
324. Computational Antisense Prediction
325. Determination of the active structure of Chemotactic peptides
326. A web-based graphical interface with an efficient algorithm for identifying DNA and protein patterns
327. SAMIE: towards a probabilistic code for protein-DNA interactions
328. Prediction of N-linked glycosylation sites in proteins.
329. Restriction Enzymes Dramatically Enhance SBH
330. (withdrawn)
331. PaTre: a method for paralogy tree construction
332. Bioinformatics Services at the Human Genome Mapping Project- Resource Centre.
333. A Bayesian approach to learning Hidden Markov Models with applications to biological sequence analysis
334. Genome-wide Operon prediction in C. elegans
335. The detection of attenuation regulation in bacterial genomes
336. Predicting Protein-Protein Interactions from Sequences
337. A Computational Screen for Novel Targets of SnoRNAs
338. Tackling Biocomputing Tasks using a Meta-Data Framework
339. Automated learning of unknown length motifs in unaligned DNA sequences with genetic algorithms
340. Conservation of CD23 extracellular domain through vertebrate species suggests a functional role in B-lymphocyte differentiation
341. The ERATO Systems Biology Workbench An Integrated Environment for Systems Biology Software
342. A Symmetrizing Transformation for Microarray Data
343. "Genquire" - Interactive analysis and annotation of genome databases using multiple levels of data visualization
344. Combinatorial genomics: validation of high throughput protein interaction data with clustered expression profiles
345. Better Blast performance using Setter


1. Identification of thermophilic species by the amino acid compositions deduced from their genomes (up)
David P. Kreil, EMBL - EBI / University of Cambridge;
Christos A. Ouzounis, EMBL - EBI;
kreil@ebi.ac.uk
Short Abstract:

 Global amino-acid-compositions as deduced from 47 complete genomic sequences were analyzed by hierarchical clustering/PCA. Although GC-content had a dominant effect, thermophiles can be identified by their amino-acid-compositions alone. While the number of genomes is now high enough to discern even a third factor, more of `unusual´ species are still required. 

One Page Abstract:

 The global amino acid compositions as deduced from the complete genomic sequences of seven thermophilic archaea, one mesophilic archeon, two thermophilic bacteria, 34 mesophilic bacteria, and three eukaryotic species were analyzed by hierarchical clustering and principal components analysis (PCA).

This study presents a careful statistical analysis of factors that affect amino acid composition. Both hierarchical clustering and PCA showed an influence of two main factors on amino acid composition. Even though GC-content has a dominant effect, thermophilic species can be identified by their global amino acid compositions alone. Differences between the groups of thermophiles and mesophiles were verified with appropriate statistical post-hoc tests.

Based on this data analysis we introduce a `compositional tree´ of species that takes into account not only homologous proteins, but also proteins unique to particular species. We expect this simple yet novel approach to be a useful additional tool for the study of phylogeny at the genome level.

This analysis extends our previous work [1] to a larger number of species, one of which is a mesophilic archaeon. The new analysis clearly supports the notion that the second strongest determining factor of global amino acid composition is indeed thermophilicity, and not perhaps archaeic origin. With the larger number of completely sequenced genomes available, besides GC-content and thermophily, a third major separable factor is now emerging which determines amino acid composition. However, for the present analysis the genomes of only one mesophilic archaeon and two thermophilic bacteria were available. This points to a general problem for whole genome studies, as increasingly, the selection of sequenced genomes available is very biased. We show how to deal with this problem by application of thorough statistical methods.

[1] Kreil, D. P. and Ouzounis, C. A. (2001) `Identification of thermophilic species by the amino acid compositions deduced from their genomes´. Nucleic Acids Res. 29, 1608-15. 


2. Sequence Analysis by Iterative Maps - beyond graphical representation (up)
Susana Vinga, ITQB/Universidade Nova Lisboa;
Jonas S. Almeida, Department of Biometry and Epidemiology, Medical University of South Carolina; ITQB/Universidade Nova Lisboa;
João A. Carriço, António Maretzek, ITQB/Universidade Nova Lisboa;
Peter A. Noble, Madilyn Fletcher, Belle W. Baruch Institute for Marine Biology and Coastal Research, Marine Science Program and Department of Biologica;
svinga@itqb.unl.pt
Short Abstract:

 Iterative mapping of DNA sequences by Chaos Game Representation (CGR) was first proposed for the investigation of patterns. Counting arbitrarily sized quadrant frequencies, order-free Markov Chain probability tables are obtained, highlighting the usefulness of CGR as a sequence-modelling tool. The iterative procedure was further extended to accommodate higher dimension alphabets. 

One Page Abstract:

 Iterative mapping of DNA sequences by Chaos Game Representation (CGR) was first proposed in 1990 for the investigation of patterns. This initial attempt to produce scale-independent representations of biological sequences has some important properties: 1) one-to-one correspondence between points in the continuous map and the respective sequences; 2) proximity in the map of sequences with the same suffix [maximum distance 2^-k in each coordinate on the map between two sequences with the same k last units]. We have further explored this representation and found that, by counting arbitrarily sized quadrant frequency, a order-free Markov Chain probability tables is obtained, accommodating both integer and non-integer order resolution. These newly uncovered properties highlight the usefulness of CGR as a sequence-modelling tool rather than just a graphical representation technique. The iterative procedure was further extended to accommodate higher dimensions defining unit-block iterated continuous domains.

Further reading: Almeida, J.A., Carriço, J.A., Maretzek, A., Noble, P.A. and Fletcher, M. (2001) Analysis of genomic sequences by Chaos Game Representation. Bioinformatics, 17, 429-437 


3. Global Analysis of Protein Activities Using Proteome Chips (up)
Ning Lan, Ronald Jansen, Paul Bertone, Heng Zhu, Michael Snyder, Mark Gerstein, Yale University;
lan@bioinfo.mbb.yale.edu
Short Abstract:

 A defined collection of 5800 yeast proteins was printed on proteome microarray and screened for their ability to interact with proteins, nucleic acids, and phospholipids. An algorithm was developed to identify positive signal on proteome microarray and to cluster the proteins identified into functional groups.
 
 

One Page Abstract:

 A daunting task in the post-genome sequencing era is to ascribe functions to every protein encoded by a given genome. Direct analysis of protein function on proteome chips is likely to provide an extremely valuable approach for elucidating gene function on a global scale. A defined collection of 5800 proteins from the budding yeast was prepared using high-throughput techniques and printed onto glass slides to screen for many activities including protein-protein, protein-DNA, protein-RNA, and protein-liposome interactions. 

Visual inspection identified 39 yeast proteins that bind calmodulin. Sequence analysis revealed that these calmodulin-binding proteins share a motif whose consensus is I/L-Q-X-X-K-K/X-G-B, where X is any residue and B is a basic residue. 

An algorithm was developed to identify and analyze positive signals in protein-liposome binding experiments. Variations between chips and local variations on the chip cause additional fluctuations of the binding signals quantitated using GenePix software. To correct the variation between chips, the signals were scaled from different experiments into a common range by subtracting the median and dividing by the difference between upper and lower quartile, thus transforming the signal distributions of different experiments to comparable shapes. To correct the local variation on the chip, we performed a ¡°neighborhood subtraction¡± for each spot. We defined a region of two rows above and below as well as two columns to the left and right of a spot as the neighborhood region. The median signal of this region was then subtracted from the spot signal. The number of highly fluorescent spots in any neighborhood region is generally low enough in these experiments not to disturb the median significantly. Finally, if the variation between two parallel samples was greater than 3 standard deviations of the error distribution of the samples, the data point was flagged and excluded from further analysis. After this filtering procedure, we normalized the filtered lipid binding signal G with the GST signal R, yielding the ratio r = G/R which is a measure of the binding per amount of protein and allows comparison of binding signals between different proteins. The specific binding ratio r is sensitive to errors eG and eR in both the G and R signals. Therefore, we computed 90% and 95% confidence intervals for this ratio with a Monte-Carlo procedure, assuming that r is a good approximation of the actual average of the ratio population: r + er = (G+eG)/(R+eR) where er represents the error of the ratio r.

This algorithm identified 150 yeast proteins that bind phosphotidylinositol lipids, 52 of which correspond to uncharacterized proteins, indicating that many previously uncharacterized proteins have potentially important biochemical activities. These proteins were clustered into four groups based on the binding strength and specificity. 

These results have obtained a wealth of new information about many known and previously uncharacterized proteins, thus demonstrate that proteome chips provide valuable opportunity for direct global proteome analysis. 


4. Assignment of Homology to Genome Sequences using a Library of Hidden Markov Models that Represent all Proteins of Known Structure (up)
Julian Gough, MRC Laboratory of Molecular Biology;
Kevin Karplus, Richard Hughey, UCSC, USA;
Cyrus Chothia, MRC Laboratory of Molecular Biology;
jgough@mrc-lmb.cam.ac.uk
Short Abstract:

 A hidden Markov model library representing all proteins of known structure has been built based on SCOP. This library has been used on all complete genomes to assign structural superfamilies to sequences. The genome assignments, sequence alignments, and a facility to search the library are available at http://stash.mrc-lmb.cam.ac.uk/SUPERFAMILY.

One Page Abstract:

 Of the sequence comparison methods, profile based methods perform with greater selectively than those that use pair-wise comparisons. Of the profile methods, hidden Markov models (HMMs) are apparently the best. The first part of this poster describes calculations that (i) improve the performance of HMMs and (ii) determine a good, possibly the best, procedure for creating HMMs for sequences of proteins of known structure. For a family of related proteins, more homologues are detected using multiple models built from diverse single seed sequences than from one model built from a good alignment of those sequences. A new procedure is described for detecting and correcting those errors that arise at the model-building stage of the procedure. These two improvements greatly increase selectivity and coverage, The second part of the poster describes the construction of a library of HMMs, called SUPERFAMILY, that represent essentially all proteins of known structure. The sequences of the domains in proteins of known structure, that have identities less than 95%, are used as seeds to build the models. Using the current data, this gives a library with 4894 models. The third part of the poster describes the use of the SUPERFAMILY model library to annotate the sequences of more than 50 genomes. The models match twice as many target sequences as are matched by pairwise sequence comparison methods. For each genome, close to half of the sequences are matched in all or in part and, overall, the matches cover 35% of eukaryotic genomes and 43% of bacterial genomes. Many sequences labeled as being hypothetical are homologous to proteins of known structure. The annotations derived from these matches are available from a public web server at: http://stash.mrc-lmb.cam.ac.uk/SUPERFAMILY. This server also enables users to match their own sequences against the SUPERFAMILY model library. 


5. Divergence and conservation: comparison of the complete predicted protein sets of four eukaryotes (up)
Catherine A. Ball, Kara Dolinski, Shuai Weng, John C. Matese, Gavin Sherlock, Dianna Fisk, Selina Dwight, Karen Christie, Anand Sethuraman, J. Michael Cherry, David Botstein, Stanford University School of Medicine;
ball@genome.stanford.edu
Short Abstract:

 Comparison of the complete predicted protein sets of four eukaryotic organisms (Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster and Arabidopsis thaliana) suggest that the common ancestor of plants, multicellular animals and single-celled fungi was the source of a significant fraction of the current complement of proteins for each species.

One Page Abstract:

 Comparison of the complete predicted protein sets of four eukaryotic organisms (Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster and Arabidopsis thaliana) suggest that the common ancestor of plants, multicellular animals and single-celled fungi was the source of a significant fraction of the current complement of proteins for each species. Complete sets of predicted proteins were compared using BLASTP, grouped into families of related proteins and subjected to CLUSTALW analysis to determine closest similarity. At all stringency levels, 30-50% of sequence similarity families have at least one representative sequence from each organism. Unsurprisingly, the closest similarities are found between proteins from D. melanogaster and C. elegans, the multicellular animals.

Systematic functional analysis of S. cerevisiae genes and proteins have significant implications for their homologs in other species. S. cerevisiae proteins encoded by essential genes are more likely to have homologs in one of the other species. In addition, S. cerevisiae proteins that interact with many other proteins are also more likely to be conserved.

Using gene associations from the Gene Ontology Consortium, similarity families were associated with biological processes and molecular functions. The most conserved biological processes, as inferred from shared annotations, correspond to core cellular processes such as metabolism and protein synthesis.
 
 


6. Identification of novel small noncoding RNAs in Escherichia coli using a probabilistic comparative method (up)
Elena Rivas, Robert J. Klein, Sean R. Eddy, Washington University in St. Louis;
elena@genetics.wustl.edu
Short Abstract:

 We apply comparative genomics in a probabilistic computational method to find novel noncoding RNAs. We use this computational method to screen the E.coli genome using whole genome comparison to four other gamma proteobacteria genome sequences. A number of those candidates have been shown by Northern blot analysis to produce small, apparently noncoding RNA transcripts. 

One Page Abstract:

 We have developed a comparative genomic method to find novel noncoding RNA genes. This method takes advantage of the pattern of fixed mutations between two conserved sequences in order to infer why the sequence is functional. A protein-gene exon, for instance, may show a telltale abundance of synonymous substitutions, whereas a structural RNA may show a telltale abundance of compensatory mutations consistent with a conserved Watson-Crick base-paired secondary structure. We have formalized this intuitive notion by constructing three probabilistic ``pair-grammars". Each grammar models a different functional pattern of evolution: coding which favours synonymous mutations; RNA which models a pattern of compensatory base changes; and a null hypothesis of position-independent evolution. Here we report the results of applying this computational screen to the E. coli genome, using whole genome comparisons to four other gamma proteobacterial genome sequences. In this screen we have generated a large number of candidates for RNA genes from E.coli intergenic regions. A number of those candidates have been shown by Northern blot analysis to produce small, apparently noncoding RNA transcripts.


7. Intron-like sequences in non-coding genomic regions. (up)
Elisabetta Pizzi, Emanuele Bultrini, Paolo Del Giudice, Clara Frontali, Istituto Superiore di Sanità, Rome (Italy);
frontali@iss.infn.it
Short Abstract:

 A method was developed for the analysis of correlations in oligonucleotide usage in extended genomic portions. Preliminary results for C.elegans and D.melanogaster genomes reveal auto- and cross-correlations in the dictionary prevailing in introns and in intergenic regions. The method lends itself to automatic partitioning and to cluster analysis. 

One Page Abstract:

 INTRON-LIKE SEQUENCES IN NON-CODING GENOMIC REGIONS

E. Bultrini, P. Del Giudice, C. Frontali and E. Pizzi - Istituto Superiore di Sanità, Rome, Italy

The heterogeneity, or patchiness, which characterises most eukaryotic genomes, suggests that different parts of the same genome may adopt different strategies to encode relevant biological information, as a consequence of different evolutionary pressures. Such considerations should help in partitioning long non-coding sequences into functionally different regions.

Starting from the observation that similarity in oligonucleotide usage in introns and intergenic regions, extending over distances of more than 10 Kb, contributes to the correlation structure of Caenorhabditis elegans genome [1], we examined different statistical methods with the aim to provide a quantitative measure of this effect and to find practicable ways to score extended genomic portions for consistent oligonucleotide usage in distant regions not necessarily related by significant sequence similarity.

Preliminary results were obtained using a simple approach which, after dividing an arbitrarily long sequence into non-overlapping segments typically 100 bp long, computes correlation coefficients between oligonucleotide frequency distributions for all fragment pairs. By repeating the procedure on the randomly shuffled segments it becomes possible to disaggregate the effects of base composition and of biased oligonucleotide usage. Distant genomic regions that adopt a similar oligonucleotide dictionary above and beyond what expected on the basis of nucleotide frequencies can easily be recognised as off-diagonal patterns in a sort of coarse-grained dot plot representing the resulting matrices through a brightness scale proportional to the correlation coefficient value. Along with this visual representation, it is possible to perform cluster analyses of segments according to oligonucleotide usage.

It was possible to demonstrate that C. elegans and Drosophila melanogaster introns auto- and cross-correlate from the point of view of oligonucleotide usage, and that, in both genomes, interspersed elements exhibiting intron-like features are abundant in regions that, according to current annotation, are intergenic. Clusters of these elements might mark as yet unpredicted genes, but we cannot rule out the speculative hypothesis that those non-coding regions that are subject to weak functional constraints might harbour similar elements. Methods for the automatic partitioning of chromosome-long sequences on this basis are under development.

1) C. Frontali, E. Pizzi (1999) Gene 232,87-95 


8. Repeats in human genomic DNA and sequence heterogeneity (up)
Dirk Holste, Humboldt University Berlin;
d.holste@itb.biologie.hu-berlin.de
Short Abstract:

 We study sequence heterogeneity of human chromosomes, by quantifying dinucleotide correlations and frequency distributions of oligonucleotides. Using simple stochastic models, we quantify the presence of fluctuations and the extend to which interspersed repeats, monomeric tandem repeats, and CpG suppression can account for the heterogeneity and the increasing oligonucleotide nonuniformity.

One Page Abstract:

 The origin and extend of the base compositional variation and its relation to the organization and function in human genomic DNA poses fundamental questions. The observed sequence heterogeneity may require active constraints for generating and maintaining these pattern, and the analysis of the sequence heterogeneity could contribute to an understanding of the nature of compositional constraints. In the past, several attempts have been made to relate those observations to the known biological features like the presence of period-3 bp, the length distribution of protein-coding regions, the presence and expansion of repeats, or the evolution of DNA.

The sequencing of the human genome provides a suitable occasion to test earlier propositions on the base composition the human genome, such as the role of interspersed repeats, which comprise over 50% of the whole genome. We study statistical patterns in the DNA sequence of two human chromosomes, by quantifying small- and long-ranging dinucleotide correlations and by examining the nonuniformity of the frequency distribution of oligonucleotides. 

We investigate to which degree known biological features may explain the observed statistical patterns. Using simple stochastic models, we study the role of interspersed repeats as a potential cause of the observed heterogeneity. We study the superposition of interspersed repeats and monomeric tandem repeats, and the suppression of CpG dinucleotides as possible features that may cause the increasing nonuniformity of the oligonucleotide distribution with increasing oligonucleotide length. 


9. PhageBase: A precalculated database of bacteriophage sequences (up)
Frank Desiere, Nestle Research Centre;
Günther Kurapkat, Clemens Suter-Crazzolara, Lion Bioscience AG;
Harald Brüssow, Nestle Research Centre;
frank.desiere@rdls.nestle.com
Short Abstract:

 We have employed genomeSCOUT to create PhageBase, a multi-functional, pre-computed bacteriophage genome database. Information about protein homology (e.g. protein families, orthologs and clusters of orthologous groups, COGs) and gene order is collected, stored and visualised with the data integration system SRS. PhageBase will allow to address questions of phage evolution.

One Page Abstract:

 The accumulation of more and more complete bacteriophage genome sequences requires new computational approaches for dealing with these data. Bacteriophage genomes are not especially large; most are smaller than 200 kb. They nevertheless represent a formidable challenge to bioinformatics algorithms since phages seem to be the result of both vertical and horizontal evolution and are thus a good model system for web-like phylogenies (Brüssow and Desiere 2001) currently discussed for prokaryotic genomes. In addition to their high mutation rate but also to their extraordinary power to recombine and exchange functional modules, single genes and gene segments encoding single protein domains, phages with double-stranded DNA genomes show an about 10-fold higher mutation rate than their bacterial hosts (Li 1997). Furthermore, there is good reason to believe that tailed phages are as old as their prokaryote hosts. This situation holds several extraordinary challenges for bioinformatics investigations. On one side, the sensitivity of sequence search algorithms must be adaptable to very distantly related sequences and in the absence of detectable sequence similarity has to account for conserved gene-order (synteny).

We have employed genomeSCOUT (Suter-Crazzolara and Kurapkat 2000) to create a multi-functional, pre-computed bacteriophage genome database that allows rapid identification and functional characterisation of genes and proteins through genome comparison. With a number of independent algorithms, information about different levels of protein homology (concerning e.g. protein families, orthologs and clusters of orthologous groups, COGs) and gene order is collected and stored. These databases are then used for interactive comparison of genomes and subsequent analysis. The application is based on the well-established data integration system SRS. SRS ensures at the same time the fast handling of many genomes, access to several pre-computed databases and linking functions between these databases. Last but not least, SRS offers a fully integrated user-friendly graphical representations of search results.

Gene context analysis in bacteriophages shows that the conservation of genome organization is surprisingly high if compared to prokaryotic genomes (Wolf and Koonin 2001). The genome map of bacteriophages is conserved from the Siphoviridae family. These phages infect both gram-negative and gram-positive Eubacteria as well as the Euryarchaeota branch of Archaea. In fact, the structural genes from E. coli phage lambda, Streptococcus phage Sfi21 and Archaeavirus psi-M2 can be superimposed. Gene strings, which are conserved in taxonomically distant organisms, are most likely functionally interacting genes. We assume that their conserved reflects an ancestral gene order. Accordingly, gene strings identified in new genomes can be assumed to be functionally linked and the information on gene clustering can be used for functional predictions. PhageBase will hopefully allow to address problems which are currently discussed in bacterial genomics and bacterial phylogeny (Koonin et al., 2000): Unity or diversity of origin, vertical versus horizontal gene transfer, nonorthologous gene displacement, tree- versus web-like phylogeny, synteny versus instability of gene order, gene splitting versus domain accretion. 


10. Searching for regions of conservation in the Arabidopsis genome (up)
Brad Chapman, John Bowers, Andrew H. Paterson, Plant Genome Mapping Laboratory, University of Georgia;
chapmanb@arches.uga.edu
Short Abstract:

 To identify regions of Arabidopsis which might serve as conserved anchor points for comparisons between different plant species, ESTs from crop species were compared to the ordered Arabidopsis genome. Conserved blocks of high sequence similarity were identified in Arabidopsis and categorized within the biological context of the genome.

One Page Abstract:

 With the recent completion of the Arabidopsis genome, a major challenge is applying the information known about Arabidopsis to help advance research in other plant species. Towards this goal, we were interested in finding regions of Arabidopsis that appeared to be preferentially conserved during evolution, hypothesizing that these regions could serve as anchor points for comparisons between multiple plant genomes. To identify regions of interest in Arabidopsis, we employed sequence comparison using BLAST searches to locate EST sequences from Sorghum, Cotton and Sugarcane on the Arabidopsis genome. By displaying the levels of sequence similarity across the genome, blocks appeared with groups of very high or low levels of sequence similarity. Using a probabilistic Hidden Markov Model approach, we categorized the blocks as being either strongly or weakly conserved. Several regions of the Arabidopsis genome were identified that were similarly categorized using Sorghum, Cotton and Sugarcane comparisons, but not using comparisons with randomly generated sequences. To try and determine if these regions were biologically meaningful, we looked at the distribution of Matrix Attachment Regions in the genome and attempted to correlate these structural elements with the conserved regions we had identified. Results of these analyses will be presented, along with in-depth analysis of some potentially conserved regions of the Arabidopsis genome. Some points of discussion will include the potential evolutionary significance of the conserved regions, as well as the applicability of the results to help advance research in less well-studied plant species.


11. Constructing Comparative Maps with Unresolved Marker Order (up)
Debra Goldberg, Center for Applied Mathematics, Cornell University;
Susan McCouch, Department of Plant Breeding, Cornell University;
Jon Kleinberg, Department of Computer Science, Cornell University;
debra@cam.cornell.edu
Short Abstract:

 Species maps (genetic and physical) frequently include groups of markers (genes) whose precise relative order cannot be determined. We present efficient algorithms that construct comparative maps from such species maps in a principled manner. Our approach recognizes arrangements of co-located markers that give a most parsimonious comparative map. 

One Page Abstract:

 Comparative maps are a powerful tool for aggregating genetic information about related organisms, for predicting the location of orthologous genes, for understanding chromosome evolution, for inferring phylogenetic relationships and for examining hypotheses about the evolution of gene families and gene function in diverse organisms. The species maps which are the input to the process of constructing comparative maps are often themselves constructed from incomplete or inconsistent data, resulting in markers (or genes) whose precise relative order cannot be determined in the input species maps. This incomplete marker order information is generally handled in one of two ways: each marker may be assigned an interval on a chromosome, where the interval size varies for different markers and marker intervals may overlap, or sets of markers whose relative order cannot be reliably inferred are placed together in a bin which is mapped to a common location (megalocus). Previous automated and manual methods have handled such markers in an ad hoc or arbitrary fashion. 

We present efficient algorithms for each of the standard representations which systematically use all information provided to produce comparative maps in a principled manner. The algorithms extend our earlier work on the ``chromosome labeling problem,'' which uses a dynamic programming technique to find an optimal balance of accuracy (the data should be explained well by the map) and parsimony (there should be relatively few homeologous segments, so that only syntenic relationships above our confidence threshold are labeled). We handle the overlapped interval representation by a direct extension of this technique.

Our main algorithms focus on the megalocus model, in which the input markers are partitioned into sets: relative order between sets is fully known, while relative marker order within a set is completely unknown. For this model, we present algorithms which not only use the available information, but also arrange the co-located markers in a most parsimonious way. The chromosome labeling problem with unknown ordering is thus placed on a principled footing via these algorithms in which results are optimized over all possible orderings. This canonical marker order can be viewed as a working hypothesis about the original incomplete data set, and can serve as a basis for further lab work.

A preliminary version of ``DeCAL'' (Detecting Common Ancestral Linkage-segments), an open-source product based on these algorithms, is now available. For input, it requires the positions of the markers of one species, as well as the location of homologs to each marker in the second species. Output is given both graphically and in text form. Only a single parameter is required, which carries a simple biological explanation. Our program allows comparative maps to be constructed in a few minutes. Results have been evaluated for diverse pairs of species, and closely approximate prior manual expert analyses. 


12. Identification of novel small RNA molecules in the Escherichia coli genome: from in silico to in vivo (up)
Ruth Hershberg, Liron Argaman, Department of Molecular Genetics and Biotechnology The Hebrew University -Hadassah Medical School;
Joerg Vogel, Institute of Cell and Molecular Biology, Biomedical Center, Uppsala University;
Gill Bejerano, Institute of Computer Science, The Hebrew University;
E. Gerhart H. Wagner, Institute of Cell and Molecular Biology, Biomedical Center, Uppsala University;
Shoshy Altuvia, Hanah Margalit, Department of Molecular Genetics and Biotechnology The Hebrew University -Hadassah Medical School;
rutih@md.huji.ac.il
Short Abstract:

 Small RNAs (sRNAs) have been difficult to detect both experimentally and computationally. We developed a computational strategy to search the Escherichia coli genome for sRNA-encoding genes. We used transcriptional signals and genomic features to predict 41 sRNAs. 23 were tested experimentally, of which 17 were shown to be real sRNAs.

One Page Abstract:

 Small, untranslated RNA molecules exist in all kingdoms of life. These RNAs carry out diverse functions and many of them are regulators of gene expression. Genes encoding small RNAs (sRNAs) are difficult to detect experimentally or to predict by traditional sequence analysis approaches. Thus, in spite of the importance of these molecules, many of the sRNAs known to date were discovered fortuitously. We developed a computational strategy to search the Escherichia coli genome for genes encoding small RNAs. Our method was based on the transcription signals and genomic features, such as location and conservation, that characterize the 10 known sRNAs in E. coli. The search was limited to regions of the genome in which no gene existed on either strand. These regions were searched for transcriptional signals (promoter sequences recognized by the major sigma factor of E. coli RNA polymerase (sigma70), and Rho-independent terminators). Sequences for which the distance between the predicted promoter and terminator was 50-400 bases were compared to genome sequences of other bacteria. Sequences with good conservation were predicted as sRNAs. 23 of the predicted genes were tested experimentally, out of which 17 were shown to be expressed in E. coli. The newly discovered sRNAs showed diverse expression patterns and most of them were abundant.


13. Operons: conservation and accurate predictions among prokaryotes (up)
Gabriel Moreno-Hagelsieb, Julio Collado-Vides, CIFN-UNAM;
moreno@cifn.unam.mx
Short Abstract:

 Neighbor genes within known operons are shown to be more conserved at three levels (co-occurrence, adjacency, and fused) than genes at transcription unit (TU) boundaries. We also show that prediction of operons as designed with information from E. coli works with other prokaryotes.

One Page Abstract:

 Based in a database of experimentally characterized transcription units (TUs) of Escherichia coli, and its genomic annotations, we show that adjacent genes within operons (polycistronic TUs) are more conserved than adjacent genes found at TU boundaries (last gene in a TU, and first in the next). The conservation is measured at three levels: (1) co-occurrence, that is, two genes found in an operon have each an ortholog in another genome with more frequency than two genes at TU boundaries, that is, TU boundaries are left more frequently as orphans. (2) Among those genes having both orthologs, those found in operons in E. coli are conserved more frequently as neighbors in other genomes. (3) Genes within operons can be found as fusions in other genomes. We also show that the prediction method of TUs we developped with information of TUs of E. coli works as well (more than 82% of accuracy) against a collection of known operons of Bacillus subtilis, and provide evidence of its functionality in the prediction of the transcription units organization of all prokaryotes. Genes predicted to be in operons show higher conservation of adjacency than genes predicted to be at TU boundaries, and the population of operons of each organism is shown to be easy to calculate from the inter-genic distance distributions of pairs of adjacent genes found in the same strand.


14. The EBI Proteome Analysis Database (up)
Pauk Kersey, Apweiler, R., Biswas, M., Fleischmann, W., Kanapin, A,, Karavidopoulou, Y., Kriventseva, E., Mittard, V., Mulder, N., EMBL-European Bioninformatics Institute;
Phan, I., Swiss Institute of Bioinformatics;
Zdobnov, E., EMBL-European Bioninformatics Institute;
pkersey@ebi.ac.uk
Short Abstract:

 The EBI Proteome Analysis Database has been developed to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. Complete non-redundant sets of SWISS-PROT and TrEMBL entries are assembled for each proteome, and are analysed using the Interpro and CluSTr databases, GO, and structural information.

One Page Abstract:

 The dramatic recent growth in the number of fully sequenced organisms in recent years has created new challenges and opportunities for biological sequence databases. It is now possible to make statistical and comparative analyses of organisms based on their entire proteome. While the specific function of a newly predicted protein cannot be known for certain for its sequence alone, the use of protein domain databases allows the assignment of proteins to families and therefore for a proteome to be described in terms of its composition. For example, it is possible to establish that a given protein family may found in a restricted portion of the taxonomic range; that two organisms share certain protein families, but not others; or that a particular family is especially highly represented in a certain species.

The Proteome Analysis Database has been developed by the SWISS-PROT group at the EBI in order to provide such an analysis. Several features distinguish this database. Firstly, non-redundant up-to-date comprehensive data sets are maintained for each complete proteome, in order that the statistical analysis is not skewed. These are created by selecting entries from the high quality protein sequence databases SWISS-PROT and TrEMBL. Protein sequence data is tracked into TrEMBL from genome sequencing projects, and merges with existing entires are accounted for. Special procedures are used to establish eukaryotic proteomes. The facility to perform unbiased sequence similarity searches against these sets is offered. 

Secondly, a powerful set of tools have been chosen to analyse the sets. The Proteome Analysis Database uses Intepro, an integrated database of protein domains, and CluSTr, a database that groups proteins according to overall sequence similarity. Proteins can also be functionally classified according to the Gene Ontology. Additionally, structural information relevant to each proteome is provided. 

Thirdly, the entire database is updated weekly and kept synchronised with the underlying sequence databases from which it is constructed. Finally, a web-based interface allows users to customise their own comparative analysis using the resources made available by the database, while popular queries are precomputed for rapid response. 


15. Practical transcriptome analysis system in the RIKEN mouse cDNA project (up)
Hidemasa Bono, Takeya Kasukawa, Itoshi Nikaido, Yasushi Okazaki, Yoshihide Hayashizaki, Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center(GSC);
bono@gsc.riken.go.jp
Short Abstract:

 Practical transcriptome analysis system in a large-scale cDNA project, the RIKEN mouse cDNA project is presented. cDNA clone data of libraries, full-length sequences, mapping information to chromosomal location and gene expression information are successfully integrated to analyze mouse transcriptome in conjunction with sequence information from human and other model organisms.

One Page Abstract:

 We have pursued RIKEN mouse encyclopedia project, which attempt to catalogizing as an encyclopedia. 1. full-length cDNA clones 2. full-length cDNA sequences, 3. mapping information of all cDNA clones 4. Gene expression information of all genes. In the last august, we held the FANTOM meeting to annotate functional information to 21,076 RIKEN mouse cDNA clones(Nature, 409, 685-690, (2001)). In that meeting, we made web-based system, called FANTOM+ that includes functional annotation information, as well as the graphical sequence analysis report. (http://www.gsc.riken.go.jp/e/FANTOM/ ) These efforts in FANTOM are now expanded to set up practical mouse transcriptome analysis system which organizes not only functional annotation, but biological knowledge that may contain inconsistent information. We will report the status of this project. 


16. The automated identification of novel lipases/esterases on a multi-genome scale (up)
Sanna Herrgard, Stephen A. Cammer, Jen Montimurro, Jeffrey A. Speir, Brian Hoffman, Susan M. Baxter, Jacquelyn S. Fetrow, GeneFormatics, Incorporated;
sannaherrgard@geneformatics.com
Short Abstract:

 We have applied threading and protein functional descriptors to identify sequences with putative lipase and esterase functions in four gram-positive genomes. These studies yielded 15 sequences previously unreported as having lipase/esterase functions. Our findings are supported by 3D-conservation profiles between the active sites in known lipases/esterases and our assignments.

One Page Abstract:

 The rapid and accurate functional annotation of the growing number of DNA and protein sequences has become a key challenge of the post-genomic era. Current annotation methods rely heavily upon simple sequence similarity; a protein sequence of undetermined biochemical function is assumed to have the same function as the protein most similar in sequence to it. Since protein sequences are generally less conserved than protein structures, sequence-based annotation methods often fail to detect proteins with low sequence similarity. In order to circumvent the limitations of sequence-based approaches, we have screened structure models obtained by threading with a library of function-specific structural descriptors (Fuzzy Functional Forms). We demonstrate the use of this method in rapidly annotating entire genomes and identifying novel function assignments for ORFs for which conventional sequence-based annotation methods fail. Specifically, we have assigned novel lipase and esterase functions to 15 sequences in the genomes of four gram-positive bacteria: Bacillus subtilis, Ureaplasma urealyticum, Mycoplasma pneumoniae and Mycobacterium tuberculosis. Our findings are supported by the sequence-structure conservation profiles between the active sites in known lipases/esterases and our assignments. These analyses indicate that even though the overall sequence similarity between known lipases/esterases and our assigned ORFs is often low, remarkable local similarities exist in the predicted active sites.


17. Identification of membrane protein orthologs in worm, fly and human (up)
Gang Liu, Christian E. V. Storm, Erik L. L. Sonnhammer, Center for Genomics and Bioinformatics (CGR), Karolinska Institute;
Gang.Liu@cgr.ki.se
Short Abstract:

 Genome-wide identification of transmembrane protein orthologs has been carried out. Hidden Markov models (HMMs) which were previously built for membrane protein families of worm were used to search for homologs in other species. Orthologous relationships were assigned by using Orthostrapping, a phylogeny-based method that gives orthology confidence.

One Page Abstract:

 Based on the completion of genome sequencing projects of the nematode Caenorhabditis elegans, the fruit fly Drosophila melanogaster, as well as the newly finished human genome, we are able to analyze orthologous relationships in protein families between higher eukaryotes. This study will help us to understand gene function and the evolution of these protein families. Previously, proteins with at least two transmembrane segments of the C. elegans were classified as 189 clusters, and hidden Markov models (HMMs) were created for each protein family (Remm and Sonnhammer, Genome Res., 10:1679, 2000). We used these models to retrieve fly and human homologs. 52% of the clusters contain members from worm, fly and human, while 8% of the clusters are present only in worm and fly. Only 2% of the clusters contain worm and human homologs but not fly. The remaining 37% of the clusters are worm-specific. The clusters were analyzed for orthologs using Orthostrapping, a phylogenetic based method which gives orthology confidence. We present a list of putative membrane protein orthologs in worm, fly, and human. 


18. Discovering Binding Sites from Expression Patterns: A simple Hyper-Geometric Approach (up)
Yoseph Barash, Gill Bejerano, Tommy Kaplan, Nir Friedman, School of Computer Science & Engineering, The Hebrew University;
hoan@cs.huji.ac.il
Short Abstract:

 We present a fast approach to transcription factor binding site discovery. Using a simple hypergeometric model we rapidly find short conserved patterns within a gene group compared to its genome background. These seeds are iteratively expanded into PSSMs. We analyze recent yeast and human datasets, and compare to MEME. 

One Page Abstract:

 A central issue in molecular biology is understanding the regulatory mechanisms that control gene expression. The recent flood of genomic and post-genomic data opens the way for computational methods elucidating the key components that play a role in these mechanisms. One important consequence is the ability to recognize groups of genes that are co-expressed using whole-genome expression patterns. Our aim is to identify in-silico putative transcription factor binding sites in the promoter regions of these gene that explain the co-regulation, and hint at possible regulators.

In this paper we describe a simple, fast, and yet powerful, approach to this task using a hyper-geometric statistical model and a straightforward computational procedure. This results in small conserved sequence seeds that are statistically significant compared to the genome-wide promoter background. We then expand these short seeds into position specific scoring matrices using an EM-like procedure. We demonstrate the utility and speed of our methods by applying them to several recent yeast and human data sets. We also compare our results with those of MEME when run on the same sets. 


19. Visualizing whole genome comparisons: Artemis Comparison Tool (ACT) (up)
Keith James, Kim Rutherford, The Sanger Centre;
kdj@sanger.ac.uk
Short Abstract:

 Artemis Comparison Tool (ACT) is a DNA sequence comparison viewer based on Artemis. It presents a graphical representation of pairwise sequence comparison data generated by programs such as BLAST. EMBL, GenBank or GFF format annotation can be loaded in addition and presented in context with the comparison.

One Page Abstract:

 The amount of data obtained from a pairwise comparison of whole genomes can be overwhelming, even when those genomes are highly similar. Interesting features such as syntenic regions, insertions, deletions, dispersed repeats, large-scale inversions or translocations are often not immediately apparent from the raw alignment output (e.g. Blast output). Often the genomic context of these raw results, with respect to gene predictions and existing annotation is lost.

Artemis Comparison Tool (ACT) is a DNA sequence comparison viewer based on Artemis. It presents a graphical representation of pairwise sequence comparison data generated by programs such as BLAST. EMBL, GenBank or GFF format annotation can be loaded in addition and presented in context with the comparison. The user can pan and zoom the view interactively, examine per-CDS database search results and create or edit annotation from within the ACT environment. Example ACT analyses of genomes sequenced at the Pathogen Sequencing Unit are presented.

In common with Artemis, ACT is written in Java and runs on UNIX, GNU/Linux, Macintosh and MS Windows systems. ACT is free software and is distributed under the terms of the GNU General Public License.

ACT is available from the ACT web site: http://www.sanger.ac.uk/Software/ACT/


20. The use of Artemis for the annotation of eukaryotic genomes (up)
Valerie Wood, Kim Rutherford, The Sanger Centre;
val@sanger.ac.uk
Short Abstract:

 Artemis is a DNA sequence viewer and annotation tool written in Java. It can read and write EMBL, GenBank and GFF format feature information and display these in the context of the sequence and its six frame translation.

One Page Abstract:

 Artemis is a DNA sequence viewer and annotation tool written in Java. It can read EMBL, GenBank and GFF format feature information and display these in the context of the sequence and its six frame translation. The package can also display the results of external analyses, or plot the results of statistical calculations, performed on the sequence or CDS features.

Artemis is the main annotation tool used for genome analysis in the Pathogen Sequencing Unit at the Sanger Centre, and is used routinely for annotation of eukaryotic genomes.

The use of Artemis for the analysis and annotation of the completed genome of the unicellular fungus Schizosaccharomyces pombe, and the reannotation of Saccharomyces cerevisiae will be presented. Artemis is also used in the annotation of the parasitic worm Brugia malayi, the social amoeba Dictyostelium discoideum and the unicellular eukaryotic parasites Plasmodium falciparum, Trypanosoma brucei, Leishmania major and Toxoplasma gondii.

 Artemis is available from the Artemis web site: http://www.sanger.ac.uk/Software/Artemis/

The European S. pombe genome sequencing project can be accessed at http://www.sanger.ac.uk/Projects/S_pombe/


21. A de novo approach to identifying repetitive elements in genomic sequences (up)
Elizabeth Thomas, John Healy, Cold Spring Harbor Laboratory;
Nathan Srebro, Massachusetts Institute of Technology;
Jacob Schwartz, New York University;
Michael Wigler, Cold Spring Harbor Laboratory;
thomase@cshl.org
Short Abstract:

 We investigate tools for identifying repeats in genomic sequences, using whole genome frequencies of short oligomers. Highly-repetitive elements, even if poorly conserved, will contain many high frequency oligomers. We can identify repetitive elements using this signature, without depending on prior knowledge of repeat sequence or structure.

One Page Abstract:

 Less than 2% of the human genome codes for proteins (IHGSC, 2001). It has been estimated that 50% of the remaining sequence consists of interspersed repetitive elements, not all of which have been classified or identified. An understanding of the origins of these repetitive elements, and their diversity, is likely to shed light on the evolution of genomes. Tools commonly used for defining and identifying repeats depend on prior knowledge about the structure and sequence of known repeats. Because of this assumption of prior knowledge, these tools are inappropriate for identifying unknown repetitive elements in genomic sequences. Now that whole genome sequences are available, new approaches can be taken, which depend merely on the simple fact that repetitive elements repeat. We investigate tools based on the whole genome frequencies of short oligomers, and simple algorithms that can be applied to these frequencies. Highly-repetitive elements, even if poorly conserved, will contain many high frequency oligomers. We can identify repetitive elements using this signature, without depending on prior knowledge of repeat sequence or structure. 


22. Extendable parallel system for automatic genome analysis (up)
Audrius Meskauskas, Frank Lehmann-Horn, Karin Jurkat Rott, Ulm university;
Audrius.Meskauskas@medizin.uni-ulm.de
Short Abstract:

 We developed the automatic system for analysing of all potential genes in a defined region. We created 15 modules for the sequence retrieving, E-PCR, gene prediction, similarity search, protein pattern search, etc. The system was used to look for a gene between D3S2370 and D3S1292 on human chromosome 3

One Page Abstract:

 Experimental methods of linkage analysis often indicate a certain large region where the gene of interest in is located. It is possible to narrow the search interval by bioinformatical methods. Such methods are usually developed by specialised bioinformatical groups and are accessible from their specialised Internet pages. The preferred set of programs depends on the type of the gene being cloned and from the general strategy of the respective research group. Submitting a numerous requests to different Internet servers and analysis of the received responses takes large amount of the qualified researcher time. Therefore, we developed a Java-based data-mining program to detect and analyse all possible genes on a defined chromosome region. We created the modules for the following tasks: 1. Automatic converting between WUSM and NCBI naming systems of clones. 2. Getting the sequences and coordinates for a given set of markers. 3. Getting a list of clones for the given NCBI contig. 4. Sequence retrieving. 5. E-PCR 6. Gene prediction. 7. BLAST similarity search through EST database. 8 Collecting additional information in that tissues the similar cDNA was detected. 9. Predicted protein pattern search, revealing portenial protein family. 10. Transmembrane regions detection. 11. Protein sorting signals detection. 12. PEST region detection. 13. Translating between gene position inside clone and gene position inside NCBI contig 14. Finding, which of the genes, predicted by the system, are already described. contigs. 15. Central kernel for parallel submission or requests. The system was used to predict and analyse all genes, lying between markers D3S2370 and D3S1292 on human chromosome 3. It was noticed that after setting lower confidence level GenScan predicts more genes than are given in NCBI database for this region (we determined the exact dependency curve). In some cases these newly predicted genes had correlation between the presence of transmembrane helixes, peptide sorting signals and specific protein domains. BLAST also revealed their similarity to know cDNA from the different organs. Having also promoter and poly A signals, at least part of the mentioned sequences might be functional genes. The information, obtained for the predicted genes, was analysed in the context of the known hypothesis about the function of the gene being looking for. It reduced the search interval from about 6 Mbp to a much smaller set of potential coding sequences, performing its task in the gene cloning project. 


23. Whole genome phylogenies using vector representations of protein sequences (up)
Gary W. Stuart, Department of Life Sciences, Indiana State University;
Jeffery J. Leader, Department of Mathematics, Rose-Hulman Institute of Technology;
G-Stuart@indstate.edu
Short Abstract:

 Optimized SVD-based vector representations of proteins from whole genomes were used to produce comprehensive gene and species phylogenies. A pilot analysis using 832 mitochodrial proteins from 64 vertebrates produced a robust and accurate tree. A larger analysis using nearly 30,000 proteins from 17 bacterial genomes revealed some non-traditional relationships. 

One Page Abstract:

 Accurate phylogenetic trees have been produced following the singular value decomposition (SVD) of data matrices containing vector representations of all proteins encoded within complete genomes. Both gene trees and species trees have been derived using this method. In a pilot analysis, the complete set of 13 mitochondrial proteins from each of 64 vertebrates was used to produce a matrix representing each protein in terms of its tetrapeptide frequencies. SVD with dimension reduction was then used to provide adjusted vector representations for each protein in multidimensional space. Pairwise cosine (similarity) values were determined and converted to distance measures as required for the generation of phylogenetic trees using the NEIGHBOR program of PHYLIP. The resulting gene trees indicated that this method was clearly capable of recognizing and grouping similar proteins, as most members of the 13 mitochodrial protein families were accurately placed in large monophyletic or nearly monophyletic groups. An optimal dimension reduction was determined that produced the best grouping of genes within families.

Species trees were then produced by 1) summing the optimized SVD-based vector representations of the individual mitochodrial proteins from each organism, 2) deriving cosine-based distance values for each pair of summed vectors, and 3) using NEIGHBOR to generate trees from the resulting distance values. Within these trees, cartilagenous fish, bony fish, reptiles, birds, non-eutherian mammals, and eutherian mammals were well grouped and reasonably arranged.

Following the successful analysis of complete mitochondrial genomes, we applied this method to the genomes of 17 bacteria, including 4 archaebacterial species. Both selected partial genome datasets (~2300 proteins) and whole genome datasets (~30,000 proteins) were analyzed. Optimal dimension reduction was estimated in some cases by observing how well genes where grouped into COG families. The resulting species trees tended to reinforce many traditional bacterial relationships, while challenging others. For instance, Borrelia burgdorferi, the spirochete responsible for lyme disease, grouped with Rickettsia prowazekii, a proteobacterium, instead of Treponema pallidum, another spirochete.

With further refinements and increased computational power, it should be possible to produce exhaustive biomolecular phylogenies from a large number of complete prokaryotic and eukaryotic genomes.
 
 


24. Correlated Sequence Signature as Markers of Protein-Protein Interaction (up)
Einat Sprinzak, Hanah Margalit, Department of Molecular Genetics and Biotechnology The Hebrew University -Hadassah Medical School;
einats@md.huji.ac.il
Short Abstract:

 We propose a novel approach for clustering pairs of interacting proteins by combinations of their sequence signatures. The identified correlated sequence signatures can be used as markers for predicting protein-protein interactions in the cell. Such an approach reduces significantly the search in the interaction space, and enables directed experimental screens.

One Page Abstract:

 As protein-protein interaction is intrinsic to most cellular processes, the ability to predict which proteins in the cell interact can aid significantly in identifying the function of newly discovered proteins, and in understanding the molecular networks they participate in. An appealing approach would be to predict the interacting partners by characteristic sequence motifs that typify the proteins that are involved in the interaction. Valuable insight towards this end can be gained by mining databases of experimentally determined interacting proteins. Conventionally, single protein sequences have been clustered into families by distinct sequence signatures. Here we propose a novel approach for clustering different pairs of interacting proteins by combinations of their sequence signatures. To identify such informative signature combinations, a database of interacting proteins is required, as well as a scheme for characterizing protein sequences by their signatures. In the current study we demonstrate the potential of this approach on a comprehensive database of experimentally determined pairs of interacting proteins in the yeast S. cerevisiae. The proteins are characterized by sequence signatures, as defined by the InterPro classification. A statistical analysis is performed on all possible combinations of two sequence signatures, identifying combinations of sequence signatures that are over-represented in the database of pairs of interacting proteins. It is proposed that such correlated sequence signatures can be used as markers for predicting unknown protein-protein interactions in the cell. Such an approach reduces significantly the search in the interaction space, and enables directed experimental interaction screens.


25. Molecular and Functional Plasticity in the E. coli Metabolic Map (up)
Sophia Tsoka, Christos Ouzounis, EMBL-EBI;
tsoka@ebi.ac.uk
Short Abstract:

 Large-scale sequence analysis of the enzymes comprising the entire metabolic complement of Escherichia coli was performed. Reactions and pathways were grouped according to sequence similarity of the corresponding enzymes and enzyme families were mapped to corresponding reactions and pathways. Function convergence/divergenence is assessed and modes of pathway evolution are discussed.

One Page Abstract:

 Genome analysis by sequence similarity provides useful hints for evolutionary and functional relations between proteins. Especially important is the identification of cases of divergent or convergent evolution, whereby similar sequences have different function and vice versa. These represent cases that are often overlooked by functional assignments based on detection of sequence similarity. Furthermore, understanding the intricacies of the sequence-to-function relationship for metabolic enzymes (1) can also enable reconstruction of the evolutionary history of protein function diversification and biochemical pathways (2).

Large-scale sequence analysis of the enzymes comprising the entire metabolic complement of Escherichia coli (3) has been undertaken in order to reveal the molecular and functional diversity of metabolic network components. Metabolic enzymes and their functional roles in terms of EC number classification and pathway involvement were identified from the Ecocyc database (4). Automated sequence clustering of the E. coli enzymes was performed (5) to identify enzyme families. Subsequently, reactions and pathways were grouped according to the sequence similarity of the corresponding enzymes. The mapping of similar sequences into different reactions and pathways delineates to a significant extent cases of evolutionary divergence and convergence of protein families (6).

1 Ouzounis CA and Karp PD, Genome Res. 2000, 10:568-76. 2 Tsoka and Ouzounis, FEBS Lett. 2000, 480:42-8. 3 Tsoka S and Ouzounis CA, Nature Genet 2000, 26:141-2. 4 Karp PD et al., Nucleic Acids Res. 2000, 28:56-9. 5 Enright AJ and Ouzounis CA, Bioinformatics 2000, 16:451-7. 6 Tsoka S and Ouzounis CA, submitted.
 
 


26. Genome Size Distribution in Prokaryotes, Eukaryotes, and Viruses (up)
David Ussery, Heidi Dvinge, Herluf Riddersholm, Nikolaj Blom, Kristoffer Rapacki, Center for Biological Sequence Analysis, Biocentrum-DTU, Denmark;
dogs@cbs.dtu.dk
Short Abstract:

 The haploid genome size for more than 5000 organisms is compared. We find a large distribution of sizes, ranging from about 1000 base-pairs (bp) to 670,000,000,000 bp. We compare the genome sizes and repeats for chromosomes from all four eukaryotic kingdoms as well as for prokaryotes and viruses. 

One Page Abstract:

 The haploid genome size for more than 5000 organisms is compared. We find a very large distribution of sizes, ranging from around a 1000 base-pair (bp) viral genome to more than 670,000,000,000 bp for Amoeba dubia. We compare the genome size for all four eukaryotic kingdoms as well as for prokaryotes and viruses. Whilst there is no correlation between the biological complexity of an organism and the size of its genome, there is often a correlation between the size of the nucleus and the genome size. That is, the concentration of the DNA appears to be constant in certain groups of organisms. We compare the genome size to the fraction of various types of DNA repeats for 65 sequenced eukaryotic chromosomes and more than 100 prokaryotic chromosomes, as well as more than 300 viral chromosomes. In general larger eukaryotic chromosomes contain more repetitive DNA than would be expected for a random sequence of the same base-composition, and direct repeats occur more often than inverted repeats. 

The Database Of Genome Sizes (DOGS) can be found at the following URL: http://www.cbs.dtu.dk/databases/DOGS/index.html


27. EnsEMBL Genome Annotation Project (up)
Ewan Birney, EBI;
Michelle Clamp, Tim Hubbard, The Sanger Center;
Lukasz Huminiecki, Emmanuel Mongin, Arne Stabenau, EBI;
birney@ebi.ac.uk
Short Abstract:

 EnsEMBL, a joint project between the Sanger Centre and the EBI, is an automatic annotation system for eukaryotic genomes. All data and code is freely available and easy to access through CVS and the web. All code is written in object oriented perl using MySQL as backend relational database. 

One Page Abstract:

 EnsEMBL is an automatic annotation system for eukaryotic genomes. It has been designed as a fully portable and platform independent system which handles both finished and unfinished genomes. It provides annotations at both nucleic acid and amino acid level. Collectively, the features identified on the DNA sequence by EnsEMBL mostly comprise genes, transcripts (alternative splice variants), exons, markers, SNPs, repeats and regions highly similar to other sequences. For each peptide predicted, EnsEMBL provides interpro (pfam, prints, prosite) domain annotation.

Currently EnsEMBL distributes data for the human and mouse genomes. This data is contained in a number of relational databases eg. the human data is stored in the core database with sequences and genes, the SNP database, the mouse trace database, the disease database etc. Only the core database is needed to run EnsEMBL software.

The EnsEMBL website (www.ensembl.org) provides easy access to this information with a number of visualisation tools such as GeneView, MapView, or ContigView. Additionally, an ftp site (ftp.sanger.ac.uk:/pub/ensembl/current) allows to download large amounts of genomic data.

A number of algorithms is utilised in the production of sequence annotations: blast, exonerate, genescan, genewewise etc. These algorithms tend to be computationally intensive and as such require both dedicated software and hardware. The automation in the annotation procedure is achieved by the EnsEMBL-pipeline. The pipeline uses LSF to distribute the computations to a farm of alpha servers.

EnsEMBL's code is written in object oriented perl to facilitate better software design and easier porting to java and corba. BioPerl (www.bioperl.org) interfaces are used, wherever this is appropriate. The whole code is seperated into packages according to its use. Packages exist for the operation of the analysis pipeline and the web server as well as for the various non essential datasets (SNP, Maps, Disease, SAGE). The underlying MySQL database is accessed through a layer of adaptor objects, which provide persistence for high level objects. These represent well known things like Genes, Exons, Features, Sequence etc.

EnsEMBL provides CVS access to all its code to encourage people to contribute to the project. The code development is openly discussed on a mailing list (ensembl-dev@ebi.ac.uk).

EnsEMBL is a joint project between the Sanger Center and the European Bioinformatics Institute. 


28. A Framework for Identifying Transcriptional cis-Regulatory Elements in the Drosophlia Genome (up)
Benjamin P. Berman, University of California, Berkeley;
Barret D. Pfeiffer, Susan E. Celniker, Lawrence Berkeley National Laboratory;
Michael B. Eisen, Lawrence Berkeley National Laboratory & University of California, Berkeley;
Gerald M. Rubin, Howard Hughes Medical Institute & University of California, Berkeley;
benb@fruitfly.berkeley.edu
Short Abstract:

 We are developing techniques that make use of transcription factor binding specificities and evolutionary conservation of binding sites to search for cis-regulatory enhancer elements genome-wide. We have evaluated these techniques using a collection of well-studied enhancer elements from the transcriptional cascade that controls the development of anterior/posterior segmentation in Drosophila.

One Page Abstract:

 The development and maintenance of the many diverse cell types of complex multi-cellular organisms results in large part from cis-regulatory DNA sequences that control precise mRNA transcriptional programs. In addition to promoter sequences important for recruiting the basal transcriptional machinery, "enhancer" modules up to several hundred base pairs long contain binding sites for various sequence specific transcription factors. The state of these transcription factors constitutes the input to the transcriptional program, and the enahcer sequence serves as the "logic" which integrates these diverse inputs. This logic can facilitate both inhibitory and cooperative interactions. A single gene may be regulated by many such enhancer modules, which may act independently or together to affect transcription in various spatial and temporal domains during the organism's life.

Our aim is to identify novel enhancer modules in the genome of the fruitfly, Drosophila melanogaster. Because enhancer modules can be found within a flanking region of DNA spanning many kilobases, this is not a simple task. We are developing algorithms that take advantage of two important observations: (1) that enhancers are likely to contain many individual binding sites for diverse transcription factors, and (2) that functional binding sites are likely to be conserved through evolution. Previous work indicates that enhancers might be identified with considerable specificity by searching for clusters of conserved binding sites. We have been developing several tools to evaluate this prospect.

We have developed an interactive web database to manage transcription factor binding specificities and to plot potential binding sites in the context of genomic annotations. We construct position weight matrix (PWM) models for each of the factors in our database, and these models are used to score potential sites. Cutoff scores can be interactively adjusted, as can constraints geared towards finding clusters of sites within a given window size. 

In order to evaluate our techniques, we have focused on a particular biological process. The cascade that ultimately gives rise to anterior-posterior segmentation in Drosophila involves complex interactions between a host of early embryonic transcription factors, and this process is among the best understood examples of complex transcriptional regulation to be found in any organism. We have collected a set of more than 20 enhancer modules that have been experimentally shown to regulate expression patterns during this process. We will discuss the ability of our techniques to distinguish these known enhancers with specificity appropriate for genome-scale searches. 


29. Genome-wide modeling of protein structures (up)
Ole Lund, Morten Nielsen, Thomas Nordahl Petersen, Claus Lundegaard, Garry P. Gippert, Structural Bioinformatics Advanced Technologies A/S (SBI-AT), Agern Alle 3, DK-2970 Hoersholm, Denmark.;
Muthu Prabhakaran, Gopalan Raghunathan, Structural Bioinformatics, Inc., 10929 Technology Place, San Diego, CA 92127. www.strubix.com.;
Jakob Bohr, Søren Brunak, SAB, Structural Bioinformatics, Inc., 10929 Technology Place, San Diego, CA 92127.;
Kal Ramnarayan, Structural Bioinformatics, Inc., 10929 Technology Place, San Diego, CA 92127. www.strubix.com.;
olund@strubix.dk
Short Abstract:

 We have developed a very sensitive and specific fold recognition method, as well as a method to create accurate alignments for genome wide use. From these alignments protein models for a large part of several complete genomes are generated using the PROTEOMINE (TM) program. 

One Page Abstract:

 We have developed a very sensitive and specific fold recognition method, as well as a method to create accurate alignments for genome wide use. From these alignments protein models for a large part of several complete genomes are generated using the PROTEOMINE (TM) program. The performance of these methods are compared to the state of the art, both by in-house benchmarks and by participation in CASP4. We apply our methods to the creation of relational databases of the tertiary and/or the secondary structure of all proteins in genomes. These databases can be used for drug discovery and modeling of protein-protein complexes. 


30. Genome wide search of human imprinting genes by data mining of EST in UniGene (up)
Maxwell P. Lee, Howard Yang, Ying Hu, Michael Edmonson, Ken Buetow, Hongtao Fan, Lo Hanson, NIH/NCI/DCEG/LPG;
leemax@mail.nih.gov
Short Abstract:

 We took a computational approach to systematically search all human imprinted genes. Analysis of SNP frequency in EST in human UniGene has identified 140 candidate imprinting genes. Three of them correspond to known imprinted genes. We are currently in the process to validate these findings by experiments.

One Page Abstract:

 Genomic imprinting is an epigenetic modification of the chromosome that leads to preferential expression of a specific parental allele of the gene. Abnormal imprinting is associated with several human diseases and loss of imprinting (LOI) is frequently found in human cancers. We have undertaken a computational approach to systematically search all human imprinted genes. We have decided to search for imprinting genes from a single nucleotide polymorphism (SNP) database containing all SNPs in expressed sequence tag (EST). The Bayesian statistics was used to estimate the genotype frequency. Significant reduction in the heterozygote suggests that the SNP is located either in an imprinted gene or in a region involved in loss of heterozygosity in tumor cells. From 1.8 million EST in UniGene, we have analyzed 20130 genes and have identified 140 candidate imprinting genes. Among the 140 candidate imprinting genes, three correspond to known imprinted genes. We are currently in the process to validate these findings by experiments. In conclusion, data mining of EST represents an effective way for genome wide search of imprinting genes.


31. What we learned from statistics on arabidopsis documented genes (up)
Pierre Rouzé, Laboratoire Associé de l'Institut National de la Recherche Agronomique (France), Universiteit Gent, Belgium;
Catherine Mathé, Flanders Interuniversity Institute for Biotechnology (VIB). Department of Plant Genetics. Universiteit Gent, Belgium;
Sébastien Aubourg, Unité de Recherche en Génomique Végétale, Evry, France;
Patrice Déhais, Flanders Interuniversity Institute for Biotechnology (VIB). Department of Plant Genetics. Universiteit Gent, Belgium;
Pierre.Rouze@gengenp.rug.ac.be
Short Abstract:

 In order to improve our knowledge of Arabidopsis genes structure, we analyzed a set of almost two thousand genes properly annotated by aligning cognate cDNA to genomic sequences. 

We present statistics on length and composition of the genes and their elements, codon usage and also new data on some signals. 

One Page Abstract:

 While the Arabidopsis genome is now fully sequenced, its sequence is still not completely decrypted, or at least not in a reliable way. Current annotations were indeed made in an automatic (or semi-automatic) way, using gene prediction software, that we know to be far from satisfactory (Pavy et al., 1999). Convinced of the necessity to learn more about gene structures before trying to improve gene prediction strategies, we did some statistical analysis of known genes. We only used genes for which biological evidence was available, i.e. for which we had the cognate mRNA (cDNA). Gene structures (intron/exon organisation) were completely reconstructed by aligning the genomic DNA sequence to its cognate cDNA. Thus, most results presented here were done on a data set of 1 811 genes, which is not much compared to the expected 26 000 genes of Arabidopsis, but at least refers to true data. Several kinds of analysis were performed. Results globally underline the high diversity of genes, in terms of composition and structure, as well as the non unique definition of elements (introns, exons) along genes. Indeed, a "standard" gene (transcription unit) should be 2,4 kb long, with 7 coding exons of 192 bp each, and at about 2 kb from the previous gene. But, beside these means, 10% of the genes have only one exon; the maximal number of exons per gene in our data was 79; an intron can reach 4 kb; and an intergenic sequence can be as small as 300 bp for co-oriented genes or even negative (overlapping genes) otherwise. Moreover, within a gene, the first and last coding exons are larger (270 bp in mean) than the internal ones (147 bp), and the first introns are also larger than the next ones (230 bp against 155 bp, in mean). We also found some indication for a compositional bias along coding sequences (CDS), with an enrichment of A3% from 5' to 3' (and a decrease of T3), while (G+C)3% is minimal in the middle part of CDS. Interestingly, when comparing codon usage between several genes, we come to the conclusion that codon usage is influenced for a major part by translation efficiency-related constraints, leading to 2 gene classes (Mathé et al., 1999). Furthermore, we did some analysis of gene specific signals: splice sites, and among them some non-canonical ones as GC/AG; translation initiation codon; or polyadenylation sites.

 References

Pavy, N., Rombauts, S., Déhais, P., Mathé, C., Ramana, D.V.V., Leroy, P. and Rouzé, P. (1999) Evaluation of gene prediction software using a genomic data set: application to Arabidopsis thaliana sequences. Bioinformatics 15 (11): 887-899 

Mathé, C., Peresetsky, A., Déhais, P., Van Montagu, M. and Rouzé, P. (1999) Classification of Arabidopsis thaliana gene sequences: clustering of coding sequences into two groups according to codon usage improves gene prediction. Journal of Molecular Biology, 285 (5), 1977-1991. 


32. Evaluation of Computer Algorithms to Search Regulatory Protein Binding Sites (up)
Esperanza Benítez-Bellón, Gabriel Moreno-Hagelsieb, Julio Collado-Vides, CIFN, UNAM, México;
ebenitez@cifn.unam.mx
Short Abstract:

 By the analysis of known regulons we evaluate the efficiency of two methods to extract patterns defining regulator-binding sites and to find other members of the regulon. The points of maximum accuracy for each family are reported.

One Page Abstract:

 We present the evaluation of two methods for extracting patters corresponding to the binding sites of several transcriptional regulators reported in RegulonDB. We define a positive and a negative set for each regulon, find the patterns, a matrix using CONSENSUS (Bioinformatics 15(7/8):563-577, 1999), and a set of dyads when using dyad-detector (Nucl Acids Res 28(8):1808-1818, 2000). We evaluate true positives, true negatives, accuracy, and positive predictive values. We find the best thresholds to find new members of each regulon, and provide some predictions.


33. An HMM Approach to Identify Novel Repeats Through Fluctuations in Composition (up)
Hui Wang, Affymetrix, Inc; Dept of Computer Science and Computer Engineering, UCSC, Santa Cruz, CA 95064 USA;
David Haussler, Department of Computer Science and Engineering, University of California at Santa Cruz, Santa Cruz, California 95064 ;
John Burke, DoubleTwist Inc. Oakland, California 94612 USA;
hui_wang@affymetrix.com
Short Abstract:

 We present an automatic, efficient HMM method for identifying novel repeats. This method compares local fluctuations in sequence composition with the composition of the database as a whole. It does not rely on similarity comparison with known repetitive element databases and is of great utility for microarray and sequence analysis.

One Page Abstract:

 Repetitive DNA sequences are widely distributed among eukaryotic and prokaryotic genomes. Reassociation kinetics studies of eukaryotic DNA has established that approximately 30% of the human genome is composed of repetitive DNA sequences1, while some other genomes, such as X. laevis, contain up to 70% repetitive sequences. These features, often called repetitive elements, have been categorized and archived.

If not accounted for, these elements can render many sequence analysis procedures uninformative. Massively parallel gene expression monitoring through microarray technology can potentially be seriously affected by the existence of repetitive elements. The problem is that repetitive sequences can cause sequence similarity among genes unrelated in function. This problem can be even worse in SNP detection. Hence in array design and sequence functional analysis, it is very important to take repetitive elements into consideration. 

Many procedures for repeat identification and detection have been developed. However a fundamental limitation of these methods is that they rely mostly upon databases of experimentally known elements and similarity searching algorithms to detect the existence of repetitive elements.

There are databases of mRNA fragments (also called expressed sequence tags or ESTs) that sample many of the transcribed regions of the genome. These databases contain multiple representations of non-repetitive elements with frequencies greater than many real repeats; hence positive and negative cases are often indistinguishable on the basis of mere frequency. Clustering and alignment of these databases can remove redundancy, but the presence of repeats reduces the tractability of this procedure as well. Coupled with the availability of whole genome sequences, a through, efficient (preferably linear time) computational approach of identifying repeats would be of great utility.

Hidden Markov Models (HMM) have been successfully used in biological sequence analysis for gene finding, protein structural modeling and phylogenetic analysis.

We present a method for repeat detection that collates repeat frequency in the entire databases with local shifts in sequence composition to identify positive cases. A simple HMM is built and used to distinguish repeat states from non-repeat states. We apply this approach to both expressed sequences and whole genome sequences. The method runs in linear time and constant space with respect to database size. 


34. Novel non-coding RNAs identified in the genomes of Methanococcus jannaschii and Pyrococcus furiosus (up)
Robert J. Klein, Washington University;
Sean Eddy, Howard Hughes Medical Institute and Washington University;
rjklein@genetics.wustl.edu
Short Abstract:

 We used a bias in G+C content as the basis for a computational screen for novel structural RNAs in sequenced, AT-rich, hyperthermophile genomes. This screen identifies most noncoding RNA loci as well as several novel loci. Expression of small RNAs from some of these loci has been experimentally confirmed.

One Page Abstract:

 The G+C content of structural RNA genes positively correlates with optimal growth temperature, while the G+C content of an entire genome does not. Although this GC composition difference is undetectably weak in most sequenced genomes, it is a strong bias in AT-rich hyperthermophile genomes (e.g. Methanococcus jannaschii and the various Pyrococcus species). We are using this bias as the basis for a computational screen for novel structural RNAs. Using a two-state hidden Markov model (a formal statistical model of the expected GC bias), we have identified GC-rich regions of these genomes. We have shown that the screen identifies almost all known structural RNAs. We have also identified and done preliminary computational characterization of 14 putative noncoding RNA loci in Methanococcus jannaschii, and 9 putative noncoding RNA loci in Pyrococcus furiosus. Northern blot analysis clearly demonstrates that 3 of these loci in Methanococcus jannaschii and 5 of these loci in Pyrococcus furiosus are expressed as small RNAs. We have cloned and sequenced the full length of several of these RNAs, and sequence analysis argues strongly in favor of a non-coding, rather than small-peptide coding, function for these RNA molecules.


35. RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs (up)
Christian M. Zmasek, Washington University School of Medicine;
Sean R. Eddy, Howard Hughes Medical Institute and Washington University School of Medicine;
zmasek@genetics.wustl.edu
Short Abstract:

 A procedure for automated inference of orthologs over bootstrap resampled phylogenetic trees is presented ("RIO", Resampled Inference of Orthologs). This is used for functional prediction via phylogenetic analysis ("phylogenomics"). Results from analyzing the C.elegans proteome are shown. We discuss where phylogenomic analyses might be more reliable than similarity based analyses.

One Page Abstract:

 When analyzing protein sequences using sequence similarity searches, orthologous sequences (diverged by speciation) are more reliable predictors of a new protein's function than paralogous sequences (diverged by gene duplication), because duplication enables functional diversification. The utility of phylogenetic information in high-throughput genome annotation ("phylogenomics", [1]) is widely recognized, but existing approaches are either manual or indirect (e.g. not based on phylogenetic trees). Here we present a procedure for automated phylogenomics using explicit phylogenetic inference.

 At the center stands the inference of gene duplications by comparing the gene tree containing the sequence to be analyzed to a trusted species tree. Various algorithms to accomplished this have been published. We employ one of our own design which has a pathological worst case behavior of O(n²) but which appears to be superior in most practical cases, partially due to its simplicity ([2], and references therein).

 A major caveat of all phylogenetic analyses is the unreliability of the resulting trees. Therefore, inference of gene duplications is performed over bootstrap resampled phylogenetic trees to estimate the reliability of the orthology assignments (RIO -- Resampled Inference of Orthologs). In addition, unusual differences in maximum likelihood branch length values are used to automatically detect other potential pitfalls for functional annotation caused by unequal rates of evolution.

 We show results of performing this procedure on the C. elegans proteome. It appears that this procedure is particularly useful for the automated detection of first representatives of novel protein subfamilies.

 This procedure is being implemented as a suite of Java classes and Perl scripts and will eventually be available in its entirety as (part of) the "FORESTER" framework at http://www.genetics.wustl.edu/eddy/forester/.

[1] Eisen JA (1998) Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Research, 8, 163-167.

[2] Zmasek CM and Eddy SR (2001) A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics, in press. 


36. Comparative genomics for data mining of eukaryotic and prokaryotic genomes (up)
Clemens Suter-Crazzolara, Günther Kurapkat, LION bioscience;
suter@lionbioscience.com
Short Abstract:

 With the increase in number of completely sequenced genomes, comparative genomics is becoming increasingly important for data mining purposes. We describe a software solution for comparison of multiple eu- and prokaryotic genomes. Key benefits are: high speed analysis and flexible inclusion of any number of genomes, biological databases and applications.

One Page Abstract:

 With the advent of genome projects, scientists are confronted with a wealth of sequence information. Both the sizes and the number of biological databases grow rapidly. However, the gap between data collection and interpretation is also growing. For this reason, genome databases contain a wealth of information which is accessible, but which may remain obscured. Intelligent systems are needed to bridge the widening gap between data collection and interpretation. To interpret data from newly sequenced eu- and prokaryotic genomes, comparative genomics is rapidly gaining importance. This results from the observation that even genomes of distantly related organisms may still encode proteins with high sequence similarity. Additionally, the order of genes within a genome may also be conserved. As the number of completely annotated genomes grows, comparing new genomes to this knowledge base becomes increasingly important for collecting biological information. We have employed these observations to design a computational analysis system, which allows through genome comparisons novel ways of gene function characterization. In an initial step the researcher can, through a simple web interface, add genomes from in house sequencing projects or public sources to the system. With several algorithms, genome comparison relationships are determined. This results in the collection of data concerning homology and orthology relationships, as well as gene order. This information is stored in five distinct databases. In the second step, the researcher can query these databases for interactive comparisons of genomes. Results are either depicted in graphical views to allow easy interpretation or in tabular form to summarize the data obtained. Initially designed for the analysis of prokaryotic genomes, the application has now been further developed to allow the analysis of eukaryotic genomes such as human, mouse, D. melanogaster, C. elegans, plants or yeasts. Expressed Sequence Tag (EST) consensus sequences can also be imported, to allow transcriptome researchers to compare such sequences to completed genomes, allowing the assignment of functions to ESTs. The application is based on the data integration system SRS (Etzold, T. et al., 1996. Methods Enzymol. 266: 114-128) which results in several unique characteristics: 1. High flexibility. The user can add any number of genomes, biological databases or applications to the system. Currently SRS allows the seamless integration of more than 400 biological databases. 2. Reliable, high speed handling of large genomic data sets. The most complex queries give results instantly. 3. Unique, SRS-based linking functions between all databases result in access to a wealth of biological data. 4. User friendly graphical representations allow easy interpretation of search results. These benefits result in highly efficient collection of information on genomes, genes and proteins. The application can be used for projects spanning from the identification of drug targets to the correct annotation of genomes. (http://www.lionbioscience.com/genomeSCOUT


37. DNA atlases for the Campylobacter jejuni genome (up)
Lise Petersen, CBS, Biocentrum-DTU, and Department of Microbiology, DVL, Denmark;
Stephen L.W. On, Department of Microbiology, Danish Veterinary Laboratory, DK-1970 Copenhagen.;
David W. Ussery, Center for Biological Sequence analysis, Danish Technical University, DK-2800 Lyngby;
lpe@svs.dk
Short Abstract:

 We have analyzed the genome sequence of C. jejuni NCTC11168 for DNA structural motifs. Whilst global repeats are under- represented, local repeats (including palindromic regions) are over-represented in the C. jejuni genome. The three hyper-variable regions of the genome (which all encode surface-exposed products), display unique structural properties. 

One Page Abstract:

 We have analyzed the genome sequence of Campylobacter jejuni NCTC11168 using "DNA Atlases", which is a method for visualization of DNA properties of an entire chromosome as a circular plot (Genome Atlas). These properties include mechanical or structural parameters (such as intrinsic curvature, base- stacking energy, and DNA flexibility) as well as the occurrence of local and global repeats including palindromic sequences. We find that for the C. jejuni genome, global repeats are under-represented, whilst local repeats are over represented and the percentage of palindromes significantly exceeds that of other known pathogenic bacteria, including E. coli. This is partly due to the high AT percentage of C. jejuni, but may still lead to increased mutability compared to bacteria with lower values. There are three chromosomal regions containing hypervariable sequences. One region encompass two sets of global repeats, the fla-genes and one additional set of tandemly arranged genes. The latter contain a conserved domain that shares homology with other annotated genes in the same chromosomal region. Possible function of these genes will be discussed. Furthermore, of the 1700 genes annotated in the genome sequence, we estimate that approximately 12% are random ORF's, and not true genes. We conclude that the genome atlas is a valuable tool, and that the results can be exploited for both fundamental and applied research purposes. Genome atlases for C. jejuni NCTC11168 are available at 

http://www.cbs.dtu.dk/services/GenomeAtlas/Bacteria/Campylobacter/jejuni/NCTC11168/.
 
 


38. DNA atlases for the Staphylococcus aureus genome (up)
Christian B. Jendresen, Maiken H. Pedersen, Morten S. Thomsen, Torsten Kolind, David W. Ussery, Center for Biological Sequence Analysis, DTU;
s001638@student.dtu.dk
Short Abstract:

 Based on DNA atlases of two recently published S. aureus genomes, we have found an obvious symmetry in the circular chromosomes. We also found the pathogenic islands to have structurally extreme properties. Finally, we found several matches between resistance genes and S. aureus plasmids, suggesting horizontal gene transfers. 

One Page Abstract:

 Based on two recently sequenced multi-resistant strains of Staphylococcus aureus we have examined the organism's significant ability to acquire resistance genes and mutate certain pathogenic genes. DNA atlases are used to provide an overview of several structural properties of the genome. Parameters include intrinsic curvature, flexibility, stacking energy, local and global repeats, quasi- and perfect palindromes. These are used to locate deviating DNA segments able to provide us with information on S. aureus. The S. aureus genome is unusually symmetric, which simplifies origin and terminus determination. We have found DNA atlases to be a valuable tool in studying the correlations between DNA structure and function. Pathogenic factors and superantigens have mutated, thus making the bacteria capable of evading the immune systems of host organisms. Numerous antibiotic resistance genes are located near global repeats and on transposable elements. Close alignments with plasmid vectors suggest occurrence of horizontal gene transfers.

Genome atlases for S.aureus N315 and S.aureus Mu50 are available at http://www.cbs.dtu.dk/services/GenomeAtlas/Bacteria/Staphylococcus/aureus


39. SNAPping up functionally related genes based on context information: a colinearity free approach (up)
Grigory Kolesov, Hans-Werner Mewes, Dmitrij Frishman, MIPS, Institute for Bioinformatics, GSF National Research Center for Environment and Health;
G.Kolesov@gsf.de
Short Abstract:

 A new algorithm finds functionally related non-homologous genes in prokaryotic genomes. It does not rely on the presence of conserved gene strings. Instead, it utilizes the graph of neighborhood and similarity relationships to find those paths that have higher probability to include genes which are functionally related.

One Page Abstract:

 We present a computational approach for finding genes that are functionally related but do not possess any noticeable sequence similarity. Our method, which we call SNAP (Similarity-Neighborhood APproach), reveals the conservation of gene order on bacterial chromosomes based on both cross-genome comparison and context information. The novel feature of this method is that it does not rely on detection of conserved colinear gene strings. Instead, we introduce the notion of a similarity-neighborhood graph (SN-graph) which is constructed from the chains of similarity and neighborhood relationships between orthologous genes in different genomes and adjacent genes in the same genome, respectively. An SN-cycle is defined as a closed path on the SN-graph and is postulated to preferentially join functionally related gene products that participate in the same biochemical or regulatory process. We demonstrate the substantial non-randomness and functional significance of SN-cycles derived from real genome data and estimate the prediction accuracy of SNAP in assigning broad function to uncharacterized proteins. Technically, SNAP algorithm is implemented as multithreaded server application and is accessible via Web.


40. Sequencing and Comparison of Orthopoxviruses (up)
Scott Sammons, Michael Frace, Miriam Laker, Melissa Olsen-Rasmussen, Roger Morey, Yu Li, Richard Kline, Joseph J. Esposito, Inger Damon, Robert Wohlhueter, National Center for Infectious Diseases, Centers for Disease Control & Prevention;
ssammons@cdc.gov
Short Abstract:

 Six variola virus isolates were sequenced by using long-distance, high-fidelity PCR of overlapping amplicons as templates for fluorescence-based sequencing. PCR-based primer-walking sequencing reactions were separated by capillary electrophoresis. Output sequence trace files were edited and assembled using Phred/Phrap/Consed and open reading frames of greater that 60 amino acids were analyzed.

One Page Abstract:

 The genomes of six variola major strains were sequenced by using long-distance, high-fidelity PCR of overlapping amplicons as templates for fluorescence-based sequencing. Each genome is approximately 186 kb of double-stranded DNA with between 200 and 240 predicted open reading frames (ORFs) of greater than 60 amino acids. Each ORF sequence has been compared with the five other locally sequenced strains and with sequences of previously published orthopoxviruses Bangladesh-1975 (L22579), India-1967 (X69198), and vaccinia virus Copenhagen (M35027). The most highly conserved ORFs are located in the center portion of the genome, and the majority have known functions involving transcription, DNA replication and repair, protein processing, virion structure, and nucleotide metabolism. 

To minimize the amount of poxvirus needed, 15 micrograms of purified genomic DNA was used as template for approximately 1800 primer-walking sequencing reactions. The reactions were set up using robotic assistance and subjected to thermocycling, and the reaction products were separated by capillary electrophoresis (Beckman Coulter CEQ 2000XL). Sequencing of variola strain Congo-1970, Somalia-1977, India-1964, Horn-1948, Nepal-1973, and Afganistan-1970 has been completed. 

Output sequence trace files were edited, evaluated for quality, and then assembled by using Phred/Phrap/Consed software until about a 10-fold redundancy of high-quality sequence data was attained. ORFs of greater than 60 amino acids were then compared with each other and those in the public databases. These ORFs were analyzed for the presence of known early, middle, and late promoter sequences. ORFs with no homologs in the other strains were further analyzed for protein motifs by using several tools and databases. 


41. Integrating mouse and human comparative map data with sequence and annotation resources. (up)
Ann-Marie Mallon, Joseph Weekes, Paul Denny, Mark Strivens, Steve Brown, Informatics Group, Mammalian Genetics Unit and UK Mouse Genome Centre, Medical Research Council;
a.mallon@har.mrc.ac.uk
Short Abstract:

 Comparative maps have previously been generated between mouse and human. This data has been used to integrate the homologous genomic sequence between the two species and to allow integration of other resources such as sequence annotation/similarity or phenotype data, facilitating the use of data present in only one species.

One Page Abstract:

 The progress of human and mouse genome sequencing programmes allows systematic cross-species comparison of the two genomes as a tool for gene and regulatory element identification1. This data will also be an important tool for exploiting the rapidly growing mouse mutant resource and moving from mutant phenotype to underlying gene. As the opportunities to perform comparative sequence analysis emerge, it is important to develop parameters for such analyses and to examine the outcomes of cross-species comparison for use in developing integrated data sets and software for such analysis. 

As the sequence data increases in quantity and accuracy it is important to be able to extract the genomic sequence from homologous chromosomal regions. This information would then be beneficial in generating parameters for comparative sequence analysis and also as a foundation for building an integrated data source between the two species. This could then become the basis for querying various linked sources of information including sequence annotation and phenotype data. 

To date detailed comparative maps have been generated between these two species, utilising homologous genes as markers to highlight homologous chromosomal segments among the two genomes. We have utilised this comparative map data to integrate the homologous genomic sequence between the two organisms. Using this data we aim to refine the comparative map data by sequence alignment and also to integrate other information from databases such as MGD (Mouse Genome Database). This system will be utilised within the UK Mouse sequencing programme (http://mrcseq.har.mrc.ac.uk), to aid in the annotation of mouse genomic sequence.

 1: Mallon A.M., et al. Comparative genome sequence analysis of the Bpa/Str region in mouse and Man. Genome Res. 2000 Jun;10(6):758-75.
 
 


42. De novo Identification of Repeat Families in the Genome (up)
Zhirong Bao, Sean Eddy, HHMI, Dept of Genetics, Washington University;
bao@genetics.wustl.edu
Short Abstract:

 We have developed a de novo approach for the identification and classification of repetitive elements from genomic sequences. We incorporated multiple alignment information to extend the usual approach of single linkage clustering of BLAST hits. The algorithm is now being used to analyze various genomes. 

One Page Abstract:

 Repetitive elements consist a major part of eukaryotic genomes. We have developed and implemented a de novo approach for the identification and classification of these elements from genomic sequences, based on algorithmic extensions to the usual approach of single linkage clustering of BLAST hits. To overcome the tendency of grouping unrelated sequences by single linkage clustering, we incorporated multiple alignment information in defining the boundaries of individual copies of the elements and in constructing linkages. The alogorithm is now being used to analyze various genomes. 


43. Using the Arabidopsis genome to assess gene content in higher plants (up)
Keith Allen, Paradigm Genetics;
kallen@paragen.com
Short Abstract:

 To assess gene content in Arabidopsis, I have exhaustively compared EST unigene sets from eight plant species to the Arabidopsis genome. Comparison between unigene sets identified lineage-specific genes, and gene loss in Arabidopsis. About 15% of dicot genes and about 30% of monocot genes are missing from Arabidopsis.

One Page Abstract:

 Arabidopsis has been a favorite model organism of plant biologists for more than two decades because of its small size, rapid growth cycle, small genome and tractable genetics. Using an organism as a model system assumes that genes in this organism will have similar functions to the equivalent genes in other species, and more fundamentally, that the gene complement of the model system is substantially similar to other, more economically important species. The completion of the Arabidopsis genome, coupled with a number of large EST projects in other species provides an unprecedented opportunity to examine this question of gene content, and to clarify the position of Arabidopsis as a model system. Specifically, to what extent does Arabidopsis contain the same genetic complement as other plant species? This question can be addressed by using Arabidopsis as a reference genome, and then comparing unigene sets constructed for each test genome to the reference genome using a Smith Waterman algorithm that translates both query and target DNA sequences in six frames and does the comparison in protein space. This computationally expensive step was perfomed on a Paracel Genematcher2 as part of a beta testing program. Unigene sets were constructed from publicly available EST projects for six angiosperms, one conifer, and a green alga. The test species (and number of input ESTs), were tomato (Solanaceae, 94,523 ESTs), soybean (Papilionoideae, 137,952 ESTs), Medicago (Papilionoideae, 115,717 ESTs), rice (Oryzeae, 63790 ESTs), barley (Triticeae, 105,273 ESTs), maize (Andropogoneae, 85287 ESTs), Loblolly Pine, a conifer (Pinaceae, 31,99 ESTs), and a green alga, Chlamydomonas (Volvocales, 55874 ESTs). Unigene sets were constructed using the Paracel Clustering Package. The central conclusion of this work is that "novel" genes (ie, genes not present in the reference genome) increase in number with increasing evolutionary distance. Soybean, the closest relative of Arabidopsis in this set, had about 17% of its EST contigs fail to get a hit in Arabidopsis. In pine, the most distant higher plant species used, this number was over 30%. I will present a closer examination of the "novel" genes each species and cross species comparison to, for example, identitfy monocot-specific genes. I will also present direct evidence of specific gene loss events in the lineage leading to Arabidopsis. Analysis of contigs corresponding to genes not found in Arabidopsis will also be shown.


44. Origin of Replication in Circular Bacterial Genomes and Plasmids (up)
Peder Worning, Lars J. Jensen, Hans-Henrik Stærfeldt, Dave W. Ussery, CBS, Biocentrum, DTU;
peder@cbs.dtu.dk
Short Abstract:

 By comparing the frequencies in leading and lagging strand of all oligonucloetides up to length of 8 bp, we find the origin of replication in circular genomes. We find the origin in nearly all the sequenced Bacterial genomes, several Bacterial plasmids, plus a number of mitochondrial, and chloroplast genomes.

One Page Abstract:

 We present a method for finding the origin of replication in circular Bacterial genomes. The method is based on differences in word frequencies between leading and lagging strand, and the nucleotide sequence is the only input needed. We have analysed complete genome sequences from more than 50 different species including both Bacteria and Archaea. Our method finds the correct location in all the genomes where the position of the origin is firmly confirmated, and shows that the origin have been misplaced in several of the published genomes. We even find a probable origin position in the genomes where it has not been predicted before. We are also able to find the position of the origin in several bacterial plasmids, plus some mitochondrial and chloroplast genomes. 


45. Combining frequency and positional information to predict transcription factor binding sites (up)
Szymon M. Kielbasa, Jan O. Korbel, Dieter Beule, Johannes Schuchhardt, Hanspeter Herzel, Innovationskolleg Theoretische Biologie;
s.kielbasa@itb.biologie.hu-berlin.de
Short Abstract:

 Both the frequency and positional information are analysed to predict transcription factor binding sites in upstream regions of coregulated genes. Evaluations for several yeast families as well as new results for a set of genes downregulated via H-Ras activation are presented.

One Page Abstract:

 Even though a number of genome projects have been finished on the sequence level, still only a small proportion of DNA regulatory elements have been identified. Growing amounts of gene expression data provide the possibility of finding coregulated genes by clustering methods. By analysis of the promoter regions of those genes, rather weak signals of transcription factor binding sites may be detected. 

We present the algorithm ITB ("Integrated Tool for Box finding"), which combines frequency and positional information to predict transcription factor binding sites in upstream regions of coregulated genes. Motifs of a specified length are exhaustively scored by estimating their frequencies with respect to a statistical background model and investigating their tendency to cluster formation. The alphabet used to assemble motifs may contain symbols matching multiple bases.

ITB detects consensus sequences of experimentally verified transcription factor binding sites of the yeast Saccharomyces cerevisiae. Moreover, a number of new binding site candidates with significant scores are predicted. Besides applying ITB on the yeast upstream regions, the program is run on human promoter sequences. From this investigation a new candidate motif "CGARCG" has been proposed, as a transcription factor binding site for a set of genes downregulated via H-Ras activation. 


46. Identification of distantly related sequences through the use of structural models of protein evolution (up)
Lisa Davies, Nick Goldman, Department of Zoology, University of Cambridge;
L.Davies@zoo.cam.ac.uk
Short Abstract:

 A novel database search program is assessed which identifies distantly related homologs through the inclusion of information about the effects of protein structure on sequence evolution. Solvent accessibility information only confers a limited advantage, but protein secondary structure information markedly improves the identification of homologous sequences with low sequence identity.

One Page Abstract:

 Database search programs in common use (e.g. BLAST, FASTA) identify sequence homology through the use of pairwise alignment techniques. These programs are good at detecting closely related sequence but have problems accurately detecting homologous sequences with low sequence identity. A new approach, described here, tries to improve the detection of distantly related homologs by rejecting the assumption that all sites in a protein behave in an identical manner. This is done without the use of profile techniques, which require the preliminary collection of a set of homologs. Programs such as BLAST and FASTA use general properties of a protein to generate alignment scores, which simplifies calculations but may also result in a decrease in accuracy. In reality, amino acid replacement probabilities and rates, amino acid frequencies and gap probabilities all vary according to where a residue lies in a protein structure. Typical patterns of these structure-specific variations in evolutionary dynamics can be incorporated into a database search program through the use of Hidden Markov Models (HMMs) and hence potentially improve the detection of more distantly related sequences. 

In this study, the utility of including structure-specific evolutionary information in a database search program has been assessed. Initial work has concentrated on the generation of database search programs that either use solvent accessibility distinctions or protein secondary structure distinctions. The improvement of adding the `extra' information has then been evaluated through the use of both simulated sequences, which exactly fit the models, and real sequences from the SCOP database. The success rate of each of these programs has been compared to a simplified model that contains the general properties of proteins but with no structural distinctions.

We have discovered that adding accessibility information gives a limited advantage only when sequences are distantly related. Using secondary structure distinctions, however, provides a greater improvement over a model containing no structural information for all but the case when the sequences are closely related. There is also an advantage over more traditional database search programs such as BLAST, FASTA and Smith-Waterman. Incorporating structure-specific evolutionary models into database search programs can therefore potentially lead to an improvement in the identification of distant homologs. 


47. The Paracel Filtering Package (PFP): A Novel Approach to Filtering and Masking of DNA and Protein Sequences (up)
Cecilie Boysen, Charles P. Smith, Stephanie Pao, Cassi Paul, Joseph A. Borkowski, Paracel, Inc.;
boysen@paracel.com
Short Abstract:

 We describe a comprehensive, flexible suite of tools that utilize several algorithms to identify repeats and contaminants in biomolecular sequences. These methods include using repeat profile models and non-destructive XML-based annotation. Together, these tools enable otherwise unavailable masking options. We demonstrate PFP's speed and effectiveness on a variety of datasets. 

One Page Abstract:

 The Paracel Filtering Package (PFP): A Novel Approach to Filtering and Masking of DNA and Protein Sequences

Cecilie Boysen, Charles P. Smith, Stephanie Pao, Cassi Paul, Joseph A. Borkowski.

Paracel, Inc.

Filtering and masking is a required, but often neglected, first step for many bioinformatics analyses, such as EST clustering and assembly, database mining, and DNA chip design. Properly done, this filtering and masking can produce a dramatic improvement in the quality of the final results. Unfortunately, many current masking techniques are an ad hoc assembly of various single purpose comparison tools. Using such an assembly of tools is often cumbersome, requiring many individual steps including simple conversion and bookkeeping operations. Managing filtering in this way can often lead to incomplete or erroneous results.

To address these known shortcomings we have developed the Paracel Filtering Package (PFP), a comprehensive, flexible suite of tools for filtering and masking. PFP takes input in most standard format and identifies sequences using a variety of user selectable algorithms. These include dust and pseg for identification of low complexity regions in DNA and protein sequences, and Haste (hash accelerated search tool) and full Smith-Waterman for comparison to sets of repeat, vector, or contaminant sequences. PFP also identifies low quality sequences based on either quality values or ambiguous base calls. A variety of actions can then be performed on the identified sequence regions. The available actions are masking, removal of the entire sequence, excision, or annotation. The annotation action is unique to the PFP suite in that it produces an XML based annotation of the identified sequence regions, e.g. low-complexity and genome-wide repeat regions, without replacing the underlying sequence characters. This allows for non-destructive masking options when used as part of a multi-step clustering and assembly process where masks are applied during one stage (such as clustering) and removed during subsequent stages (such as final assembly). This allows production of final assemblies and consensus sequences free of unwanted masking characters. 

Additionally, PFP, if used with Paracel's GeneMatcher, can use repeat profiles for highly sensitive annotation of repeat regions. These profiles are gribskov type DNA profiles created from multiple sequence alignments of identified repeat regions. Here we investigate the use of such repeat profile searching and compare the results with the commonly applied algorithms for repeat finding. 

PFP can be customized for the specific task at hand, that is, different settings can be applied depending on purpose and kind of sequence and species. We will describe a general method for optimization of masking parameters that can be used on any dataset.

These tools present a comprehensive set of filtering and masking options not available in any other package. Still, executed by a single command, PFP performs this often convoluted process of cleaning up or annotating sequences with great speed and effectiveness. 


48. EST Clustering with Self-Organizing Map (SOM) (up)
Ari-Matti Saren, Plant Genomics group, Institute of Biotechnology, University of Helsinki;
Timo Knuuttila, Mikko Kolehmainen, Visipoint Oy;
ari-matti.saren@helsinki.fi
Short Abstract:

 The existing EST clustering algorithms perform adequately on smaller datasets, but with the rapid increase in EST data, scalability is becoming a problem. We present a method of using short exact sequence matches and Self-Organizing Map (SOM) to rapidly classify the EST data into subsets that can be clustered individually.

One Page Abstract:

 The existing EST clustering algorithms perform adequately on small and moderate datasets, but with the rapid accumulation of EST sequence data, scalability is increasingly becoming a problem. Since the scalability issue is inherent in this sort of problem where everything has to be compared to everything, we try to approach the problem from the side of keeping the datasets reasonably sized.

A self-organizing Map (SOM) is an unsupervised neural network learning algorithm which has been succesfully used for the analysis and organization of large datasets. We present a method of dividing ESTs into subsets of potentially clustering sequences by using a SOM to classify the sequences according to the number of short exact sequence matches found. These subsets can then be individually clustered and aligned using existing tools.

We are using the Visual Data SOM software package from Visipoint Oy (http://www.visipoint.fi/index.html) and custom software components developed at the Barley EST sequencing project at Institute of Biotechnology Plant Genomics group (http://www.biocenter.helsinki.fi/bi/bare-1_html/est.htm) Visual Data uses a tree-structured SOM (TS-SOM) algorithm [1], a variation of the classical Kohonen SOM [2].

[1] Koikkalainen, P. (1994) in: Proceedings of the 11th European Conference on Artificial Intelligence (Cohn A., Ed.), pp. 211-215,. Wiley and Sons, New York [2] Kohonen, T. (1995) Self-organizing maps. Springer, Berlin.


49. Analysis of Information Content for Biological Sequences (up)
Jian Zhang, EURANDOM, Eindhoven, The Netherlands;
jzhang@euridice.tue.nl
Short Abstract:

 We present an exploratory approach to parsing and analyzing a set of multiple DNA and protein sequences. It is based on an analysis-of-variance (ANOVA) type decomposition of the information content. Our method is applied to parsing and clustering some protein sequences.

One Page Abstract:

 Decomposing a biological sequence into modular domains is a basic prerequisite to identify functional units in biological molecules. Several commonly used segmentation procedures consists of two steps: First, collect and align a set of sequences which is homologous to the target sequence; then parse this multiple alignment into several blocks and identify the conserved ones by using a semi-automatic method, which combines manual analysis and experts knowledge. In this paper, we present an exploratory approach to parsing and analyzing the above multiple alignment. It is based on an analysis-of-variance (ANOVA) type decomposition of the information content, a variant on the concept reviewed by Stormo and Fields (1998). Unlike the traditional changepoint method, our approach takes into account not only the composition biases but also the overdispersion effects among the blocks. Our method is tested on the families of ribosomal proteins with a promising performance. Finally we extend our approach to the problem of clustering a set of objects labeled by some probability vectors. As an application, our approach is applied to clustering protein sequences via their pairwise alignment scores. 


50. Clustering proteins by fold patterns (up)
David Gilbert, Department of Computing, City University, London, UK;
Juris Viksna, Institute of Mathematics and Computer Science, University of Latvia, Riga, Latvia;
Aik Choon Tan, Department of Computing, City University, London, UK;
Lorenz Wernisch, Department of Crystallography, Birkbeck College, University of London, London, UK;
drg@soi.city.ac.uk
Short Abstract:

 We describe a technique to automatically divide a set of proteins into clusters each associated with a common topological pattern, using TOPS descriptions as formal models of protein folds. We are applying the technique to generate characteristic patterns associated with EC functional groups. 

One Page Abstract:

 We have developed a technique to automatically divide a set of proteins into clusters each associated with a common topological pattern. This technique can be applied to families of protein domains with structurally diverse members where there is no clear structure-based tree hierarchy. Examples are the families determined by the Enzyme Classification scheme; EC numbers are assigned at the chain level in the PDB and thus all the domains comprising a chain will be assigned the same number. The motivation is that a pattern associated with an undivided diverse group may be very small and of weak descriptive and hence classificatory power. Our ultimate aim is to discover patterns with a high classificatory power for diverse protein families.

Our method takes as input TOPS descriptions of protein structures (Gilbert et al, 1999) and uses a development of the pattern discovery method for topological patterns described in (Gilbert et al, 2001) which employs repeated pattern extension and matching. In our new method, we permit patterns to be discovered for less than 100% of the examples in a learning set S. Effectively this means that having discovered a maximal pattern which describes all the members of S, we attempt to continue to extend that pattern to a larger pattern P which only matches a subset T of S. We then remove the matched set T from S to give S' and repeat the procedure over S' until the learning set becomes empty. The result of our method is a cover of the initial set S of structures, comprising a partition of S into subsets each associated with a descriptive pattern.

The critical issue is to decide when to stop extending the pattern; we achieve this when the `goodness' of the pattern reaches a certain threshold (PV). Otherwise, we may end up by generating a pattern which is associated with just one domain, and this is likely to be too specific and of no use in classifying new structures. This `goodness' is computed as a function over the compression that the pattern achieves over all the members of the matching set T and its coverage, i.e. the size of T compared to S.

We have applied our algorithm to a representative subset of the protein data bank, derived from non-identical representatives (N-reps) of release 2.0 the CATH database (www.biochem.ucl.ac.uk/bsm/cath). We selected those domains to which a function in the EC classification has been assigned, and further restricted our set to those domains with some beta sheet content (since our pattern discovery method does not work well for all-alpha domains). The table of of associations between EC numbers and domains was supplied by Roman Laskowski of the BSM group at University College. We performed our pattern discovery and clustering over 141 families of proteins (defined by every domain sharing the same 4 EC numbers) and containing at least 4 members. Some results for this set with graphical illustrations of the patterns can be found at http://www.soi.city.ac.uk/~drg/tops/EC_alphabeta/

References

Gilbert D, Westhead DR, Nagano N, Thornton JM. Motif-based searching in tops protein topology databases. Bioinformatics 1999;15:317-326.

Gilbert D, Westhead DR, Viksna, J and Thornton J M, Topology-based protein structure comparison using a pattern discovery technique, Journal of Computers and Chemistry, 2001, in press.

Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., and Thornton, J.M. (1997) CATH- A Hierarchic Classification of Protein Domain Structures. Structure. Vol 5. No 8. p.1093-1108. 


51. Confidence Measures for Fold Recognition (up)
Ingolf Sommer, Niklas von Öhsen, Alexander Zien, Ralf Zimmer, Thomas Lengauer, GMD/SCAI;
ingolf.sommer@gmd.de
Short Abstract:

 We present an extensive evaluation of different methods and criteria to detect remote homologs of a given protein sequence. There are two associated problems, first, to develop a sensitive searching method to identify possible candidates and, second, to assign a confidence to the identified candidates in order to select the best one.

One Page Abstract:

 We present an extensive evaluation of different methods and criteria to detect remote homologs of a given protein sequence. There are two associated problems, first, to develop a sensitive searching method to identify possible candidates and, second, to assign a confidence to the identified candidates in order to select the best one.

Especially, p-values have been proposed for assigning confidence for local sequence alignment searching procedures, such as BLAST, with great success due to an extensive theoretical backing. We propose empirical approximations to p-values for searching procedures involving global alignments, sequence profiles, and structure based scores (threading).

We review different methods for detecting remotely homologous protein folds (sequence alignment and threading, with and without frequency profiles, global and local), analyze their fold-recogintion performance and establish confidence measures that tell how much trust to put into the prediction made with a certain method. 

The analysis is performed on a representative subset of the PDB with at most 40% sequence homology, with proteins classified according to the SCOP classification.

For fold-recognition, i.e. the attempt to find a template fold with a known structure for a target protein sequence whose structure we are searching, we find that methods using frequency profiles generally perform better than methods using plain sequences, and that threading methods perform better than sequence alignment methods. Thus the method of choice for detecting remote homologies is threading using frequency profiles.

In order to assert the quality of the predictions made with these methods, we establish several confidence measures, including raw scores, z-scores, raw-score gaps, z-score gaps, and different methods of p-value estimation (thus ranging from computationally cheap to more elaborated) and compare them.

The confidence-measure methods are compared with several error measures. For local alignment methods, where the distribution of scores is theoretically known, we find that p-value methods that make use of this knowledge work best, albeit computationally cheaper methods as the score gaps perform competitively. For global methods, where no theory is available on the score distribution, score-gap methods perform best.

 * S. F. Altschul et al, "Gapped BLAST and PSI-BLAST: a new generation of protein datab ase search programs", Nucleic Acids Research, 1997

* M. Gribskov et al, "Profile analysis: Detection of distantly related proteins", PNAS, 1987

* N. Alexandrov et al, "Fast Protein Fold Recognition via Sequence to Structure Alignment and Contact Capacity Potentials", Proc. Pacific Symposium on Biocomputing, 1996

* S. F. Altschul et al. "Basic local alignment search tool", JMB, 1990
 
 


52. Protein Structure Prediction By Threading With Experimental Constraints (up)
Mario Albrecht, Institute for Algorithms and Scientific Computing (SCAI), German National Research Center for Information Technology (GMD);
Ralf Zimmer, Thomas Lengauer, Institute for Algorithms and Scientific Computing (SCAI), German National Research Center for Information Technology ;
mario.albrecht@gmd.de
Short Abstract:

 We present an extended and modified version RDP* of the Recursive Dynamic Programming method to predict the structure of proteins. The algorithm is now capable of incorporating additional structural constraints, for instance, atomic distances obtained by mass spectrometry or NMR spectroscopy experiments, into the alignment computation of our threading approach. 

One Page Abstract:

 The threading approach predicts the protein structure by aligning representative protein structures with an amino-acid sequence called the target sequence whose three-dimensional backbone structure is unknown. The sequence-structure alignments obtained are then ranked by score. The best-scoring alignment should identify the template structure that is most compatible with the target sequence and thus afford a meaningful structural model.

However, the problem of developing an accurate scoring function is still unsolved particularly for distantly related target and template folds. Especially, making the scoring scheme reflect diverse biological constraints seems to be a difficult task. Thus threading methods based solely on sequence information often fail. 

To remedy the inherent shortcomings of the scoring function, it becomes necessary to incorporate more biological knowledge on the target protein, which may be obtained from experimental data by mass spectrometry or NMR spectroscopy. These additional constraints such as atomic distances guide the threading process in order to improve the accuracy of fold recognition. Experimental results that taken alone would give insufficient data for the complete structure determination may already yield enough constraints to support the threading procedure considerably. 

Our recursive dynamic programming method searches for structurally correct target-template pairs in suitable sets of alternative near-optimal solutions to the alignment problem, which transcends the usual exact optimization of one biologically incomplete scoring function employed in other threading approaches. The method can incorporate biological constraints directly into the alignment computation by means of different filter algorithms. This is more efficient than weeding out wrong models from the list of already generated complete solutions. In this way, the method can produce biologically more meaningful models that adhere to the structural constraints that are known about the target. This approach improves the fold recognition rate as well as the alignment quality.

 Keywords: protein structure determination, fold recognition, protein threading, experimental constraints, mass spectrometry, NMR, NOE

Selected References:

P. M. Bowers,C. E. M. Strauss, David Baker: De novo protein structure determination using sparse NMR data. Journal of Biomolecular NMR, 18:311-318, 2000.

R. Thiele, R. Zimmer, T. Lengauer: Protein Threading by Recursive Dynamic Programming, Journal of Molecular Biology, 290(3):757-779, 1999.

Y. Xu, D. Xu, O. H. Crawford, J. R. Einstein: A computational method for NMR-constrained protein threading, Journal of Computational Biology, 7(3/4):449-467, 2000.

M. M. Young, N. Tang, J. C. Hempel, C. M. Oshiro, E. W. Taylor, I. D. Kuntz, B. W. Gibson, G. Dollinger: High throughput protein fold identification by using experimental constraints derived from intramolecular cross-links and mass spectrometry. Proceedings of the National Academy of Sciences USA, 97(11):5802-5806, 2000.

 


53. Exonerate - a tool for rapid large scale comparison of cDNA and genomic sequences. (up)
Guy St.C. Slater, Ewan Birney, Ensembl / EMBL-EBI;
guy@ebi.ac.uk
Short Abstract:

 Exonerate is a tool for rapid large scale comparison of cDNA with genomic DNA. The algorithm and its implementation are described in the context of related methods. Examples of application of the algorithm for large scale analyses within the Ensembl group are given.

For more information, see: \url{http://www.ebi.ac.uk/~guy/exonerate/

One Page Abstract:

 Exonerate is a tool for rapid large scale comparison of cDNA with genomic DNA.

HSPs are seeded using a FSM built from the word neighbourhoods of multiple ESTS. A bounded sparse dynamic programming algorithm is used to join HSPs to form gapped alignments. Alignments are generated between candidate HSP pairs by dynamic programming algorithms which integrate an affine gap model and PSSM based splice site prediction for intron modelling.

This approach provides a good approximation of the underlying model, while remaining very fast by restricting the dynamic programming to regions likely to contain significant alignments. Examples of application of the algorithm for large scale comparison of EST data sets with the entire human genome are given.

Exonerate is implemented in C, using the glib library, and is available under the the terms of the LGPL.

For more information, see: http://www.ebi.ac.uk/~guy/exonerate/

--


54. Prospector: Very Large Searches with Distributed Blast and Smith-Waterman (up)
Douglas Blair, Dustin Lucien, Hadon Nash, Dale Newfield, John Grefenstette, Parabon Labs, Parabon Computation;
doug@parabon.com
Short Abstract:

 We have created a novel implementation of BLAST and Smith-Waterman for the Parabon Frontier distributed computing platform. We present design specifics, implementation details, performance results, and sensitivity comparison for very large database searches using the Prospector versions of BLASTN, BLASTP, BLASTX, TBLASTN, and TBLASTX, and the analogous versions of Smith-Waterman.

One Page Abstract:

 Advances in sequencing technology have created an avalanche of sequence data in the public databases that continues at an astonishing rate. In the year 2000, for example, the number of nucleotides in GenBank more than doubled from 4.6 billion bases to 11.1 billion bases. Even with the completion of the human genome, data from other sequencing projects and ESTs being generated from various organisms and cell types ensure that the floodgates are not going to close anytime soon. Given that large amounts of new sequence data will continue to become available for the foreseeable future, the need to compare new data to existing data will continue to make high-performance solutions for large-scale sequence analysis a critical requirement for many projects. High-performance methods for performing large searches typically involve using specialized hardware or clusters of general-purpose machines. While providing a high level of performance, such systems are costly in terms of both hardware and support infrastructure.

 Distributed computing utilizing the idle cycles of existing workstations represents a software alternative to traditional high-performance computing platforms. In the Parabon Frontier distributed computing model, large jobs are decomposed into hundreds or thousands of tasks and distributed to idle machines, either over the Internet or within an organization's intranet. We have created an application called Prospector that implements both BLAST and Smith-Waterman algorithms for performing large-scale sequence comparison on the Frontier distributed computing platform. As required by the Frontier model, Prospector is written entirely in Java, providing a maximum of safety for the machines providing idle cycles.

 In this poster, we present design specifics, implementation details, and performance results for very large database-to-database searches using Prospector. Results showing the effective scalability of Prospector to searches with thousands of machines are given. Experiments are described using Prospector versions of BLASTN, BLASTP, BLASTX, TBLASTN, and TBLASTX, as well as the analogous versions of Smith-Waterman. 


55. Accuracy Comparisons of Parallel Implementations of Smith-Waterman, BLAST and HMM Methods (up)
James Candlin, Stephanie Pao, Paracel, Inc.;
candlin@paracel.com
Short Abstract:

 Using known evolutionary relationship data from SCOP, we compare the accuracy of Paracel's GeneMatcher[tm] and BlastMachine[tm] implementations of sequence search algorithms to the originals. We demonstrate that the Paracel implementations are biologically equivalent, but much faster, allowing routine use of even the most rigorous algorithms at a genomic scale. 

One Page Abstract:

 Accuracy Comparisons of Parallel Implementations of Smith-Waterman, BLAST and HMM Methods 

Stephanie Pao, James Candlin

Paracel, Inc.

Sequence database searching is a workhorse of bioinformatic analysis, and is embodied in a variety of useful methods such as Smith-Waterman and related dynamic programming methods, BLAST, and Hidden Markov Models. Unfortunately, many of these algorithms are slow, especially at a genomic scale. With Paracel's GeneMatcher[tm] and BlastMachine[tm] systems, we have addressed this challenge by developing new hardware and software implementations of these methods that are parallelized and can be used routinely, even on genomic scale projects.

It is crucial that our new implementations retain the general behavior of the original manifestations of the algorithms, and their biological accuracy can be maintained. An approach to assessing this, at least for detecting protein family relationships, is to search an annotated database with a broad set of protein sequences and measure the ability of such methods to find only the evolutionarily related sequences. Effective resources for this are the SCOP (Structural Classification of Proteins) database, which reliably identifies superfamily membership based on structural analysis, and the associated PDBD40 database of domains that are no more than 40% identical to one another and may be used as in Brenner et al.(1) to test the ability of algorithms to find related sequences.

Using the same general approach, we compare the Paracel implementations of BLAST, Smith-Waterman and HMM to the original software methods. 

The testing set consisted of all sequences in the SCOP PDBD40 database. The comparison test was an all- against-all comparison of all described domain sequences against a database of the same sequences using all the pairwise and multiple-sequence-derived algorithms. We assess the results of the comparisons at various false positive levels and describe the number of detected homologies with each similarity searching method. Our criteria for homology are based on the superfamily classifications within the SCOP database. 

We demonstrate that our implementations are biologically equivalent to the original algorithms, but convey the advantages of far greater speed and throughput required for genomic analysis. The comparison also demonstrates the greater accuracy of the slower but more rigorous methods, and that even these are fast enough in our implementation for routine use. 

______________________ (1) Brenner, S.E., Chothia, C. and Hubbard, T., Assessment of sequence comparison methods with reliable structurally identified distant evolutionary relationships', Proc. Natl. Acad. Sci. USA 15, 6073- 6078 (1998). 


56. Target-BLAST: an algorithm to find genes unique to pathogens causing similar human diseases (up)
Joanna Fueyo, Department of Pharmacology, University of Pennsylvania School of Medicine;
Jonathan Crabtree, Computational Biology and Informatics Laboratory, University of Pennsylvania;
Jeffrey N. Weiser, Department of Microbiology and Pediatrics, University of Pennsylvania School of Medicine;
fueyo@mail.med.upenn.edu
Short Abstract:

 There is an imminent need for novel data mining algorithms for DNA and protein sequences, especially for discovering which genes may be important in the establishment of infectious disease in humans. Here we present a novel algorithm that is used to find such genes using a fully computational approach.

One Page Abstract:

 Target-BLAST: an algorithm to find genes unique to pathogens causing similar human diseases

Abstract Motivation: This paper describes a novel algorithm, Target-BLAST, designed for the discovery of genes unique to pathogens that share common features. Bacteria that share common features, such as the ability to cause a similar set of human infections, may share genes in common that facilitate colonization of the host or establishment of infection. The organisms that reside in the human respiratory tract are the most common cause of antibiotic usage for infectious disease. In a search for genes unique to respiratory pathogens, we designed an algorithm to find genes that are unique to the organisms that colonize the respiratory tract, since these bacteria may have sequences in common that may be useful as novel, alternative drug targets for respiratory infections. The algorithm we designed for this purpose, Target-BLAST, uses Position-Specific-Iterated BLAST (PSI-BLAST) and its predecessor Basic Local Alignment Search Tool (BLAST) to discover homologs in microorganisms that share common features, such as the ability to reside in or cause disease in the same host environment. Results: We illustrate the use of Target-BLAST to search sequences obtained from the public databases to find genes unique to a group of pathogens that cause pneumonia, and discovered a set of 12 genes unique to the bacteria that reside in the respiratory tract. The sensitivity of Target-BLAST to distant relationships is gained through the iterative use of BLAST followed by PSI-BLAST. This approach facilitates the comparison and extraction of information from both partial, unannotated DNA sequences as well as complete, annotated genomes. Target-BLAST is considerably faster than existing genome analysis tools, and it permits one to find genes conserved in both whole and partial, unannotated genomic data.
 
 


57. Software to predict microArray hybridisation: ACROSS (up)
Antoine Janssen, Keygene N.V.;
Jurgen Pletinckx, Algonomics N.V.;
Jan van Oeveren, Martin Reijans, René Hogers, Keygene N.V.;
Philippe Stas, Algonomics N.V.;
Michiel van Eijk, Keygene N.V.;
Ignace Lasters, Algonomics N.V.;
René van Schaik, Keygene N.V.;
aj@keygene.com
Short Abstract:

 High-quality micro-array data can only be obtained when non-specific or cross hybridization is excluded or at least minimized. We developed a new software tool ACROSS to predict hybridization based on sequence alignments and hence to assist in the optimal design of microarray probes.

One Page Abstract:

 Software to Predict MicroArray Hybridization: ACROSS

Jurgen Pletinckx(2), Antoine Janssen(1), Jan van Oeveren(1), Martin Reijans(1), René Hogers(1), Philippe Stas(2), Michiel van Eijk(1), Ignace Lasters(2), René van Schaik(1) 

(1)Keygene N.V., PO Box 216, 6700 AE, Wageningen, The Netherlands, info@keygene.com (2) Algonomics N.V., Technologiepark 4, B 9052 Ghent, Belgium, info@algonomics.com

 Keywords: expression arrays, sequence alignment, cross hybridization, hybridization prediction

Abstract 

Microarrays can be used to detect AFLP® fragments for genotyping or expression analysis by hybridization with labeled AFLP reactions or cDNA, respectively. High-quality data can only be obtained from microarrays when non-specific or cross hybridization is excluded or at least minimized. We developed a new software tool ACROSS to predict hybridization and hence to assist in the optimal design of microarray probes. The ACROSS software takes as input the sequence information of DNA-fragments or oligos in either one or two sets. Homology analysis within one set or between the two sets of sequences is performed to predict fragment hybridization. These predictions are based on quantitative analysis of hybridization data from model experiments with known sequences obtained under realistic experimental conditions. A test set of oligonucleotides, derived from 2 DNA fragments A and B, is used to create series of complementary sequences with increasing sequence homology and hybridized with the corresponding full length fragments as targets. Hybridization results are presented and the characteristics of ACROSS are discussed.

AFLP® is a registered trademark of Keygene N.V. 


58. Discovering Dyad Signals (up)
Eleazar Eskin, Columbia University;
Pavel Pevzner, UCSD;
eeskin@cs.columbia.edu
Short Abstract:

 Dyad signals in DNA sequences consist of two signals that occur a fixed distance apart. Because each component signal may not be statistically significant on its own, we perform an exhaustive search using a pattern driven approach to discover these signals. We present two extensions to pattern driven approaches to improve efficiency. 

One Page Abstract:

 Signal finding is a fundamental and well studied problem in computational biology. The problem consists of discovering patterns in unaligned sequences. In the context of DNA sequences, the patterns can correspond to regulatory sites which can be used for drug discovery. Current approaches to discovering signals focus on monad signals. These signals are typically short contiguous sequences of a certain length which occur in a statistically significant way in the unaligned sequences with a certain amount of mismatches. This statistical significance is obtained using a scoring function of the candidate signal. However, several of the actual regulatory sites are actually dyad signals. In some cases the signals are palindromes of each other. A difficulty in discovering dyad signals is that each component monad signal in the dyad may not be statistically significant making the dyad signal difficult to find using traditional methods.

There have been many approaches presented to discover monad signals. Among the best performing are MEME, CONSENSUS, Gibbs-sampler, random projections and combinatorial based approaches (see references at poster). The goal of all of these approaches focus on discovering the highest scoring signals. When applied to discovering dyad signals, the methods may have problems in the case where each of the pair of monad signals is not statistically significant on its own.

In this project we present an algorithm for discovering dyad signals. The algorithm first performs an exhaustive search over potential monad signals in the data and then examines several thousand of the highest scoring monads and looks for dyad signals by checking each pair of signals and determining if they occur a fixed distance apart. Clearly the vast majority of the monad signals examined are not statistically significant on their own.

The computational bottleneck for this approach is the exhaustive search of candidate patterns. The simplest way to perform this search is to use a pattern driven approach. A pattern driven approach is simply checking all possible patterns against the data. Unfortunately for long patterns, this can be computationally expensive. In this project, we present two efficient extensions to the pattern driven approach: emulated pattern driven approach and sparse suffix trees. The first extension examines only the relevant sequences to the data and stores the candidate signals in a hash table. Although the algorithm is significantly faster in some practical cases, the algorithm requires a lot of memory. The second extension presents efficient data structures for storing the candidate signals which reduce the amount of memory use.

The dyad signals are discovered by examining the several thousand of the highest scoring monad signals. For each pair of these signals, we examine the sequence positions where the patterns occurred. For patterns that occur on the same sequence we compute the distance between the patterns. If the pair of patterns consistently occur at a certain fixed distance, then the pair of patterns is a dyad signal.

 We present several sets of experiments over biological and synthetic data. We first present a set of experiments evaluating the monad signal finding methods. Although we present the signal finding methods in order to discover dyad signals, they in fact perform well finding monad signals.

We also present a set of experiments over dyad samples. We perform experiments over synthetic data. This data consists of an i.i.d. sequence where a dyad signal was inserted. We also perform experiments over the biological sample and automatically recover the dyad signal discovered in the sample. 


59. Efficient all-against-all computation of melting temperatures for dna chip design (up)
Lars Kaderali, Alexander Schliep, University of Cologne;
kaderali@zpr.uni-koeln.de
Short Abstract:

 Determining target specific probes for DNA chips requires the computation of melting temperatures for all pairs of probes and targets. We present a fast algorithm based on an extended nearest neighbor model. The algorithm combines suffix trees and alignment computations. Also, a framework is presented to suggest actual probes. 

One Page Abstract:

 The problem of determining target specific probes for DNA chips requires the computation of melting temperatures for the duplexes formed betweeen all target sequences and all probe candidates. The problem is further complicated due to mismatches and possibly unpaired bases within the duplex. For complexity reasons, traditional computer programs restrain themselves to mere string comparisons to ensure the specificity of chip probes.

We present an efficient algorithm to solve this problem. The thermodynamic calculations are based on an extended nearest neighbor model, allowing for both mismatches and unpaired bases within a duplex. We introduce a new thermodynamic alignment algorithm to efficiently calculate melting temperatures. This algorithm is combined with modified generalized suffix trees to speed up the computation.

The algorithm is the core of a software framework to suggest actual probes to be used for DNA chip experiments, given the target sequences as input. 


60. Wavelet techniques for detecting large scale similarities between long DNA sequences (up)
Frederic Guyon, Serge Hazout, INSERM U436, Universite Paris 7;
guyon@urbb.jussieu.fr
Short Abstract:

 We have developped an efficient and fast algorithm to detect large scale similarities between long sequences based on Fast Wavelet Transforms. The FWT algorithm allows one to compute dotplots at different scale levels and to zoom in and out on the regions of interest within a dotplot. 

One Page Abstract:

 As complete genomic sequences become available, new methods to tackle very large DNA sequences arerequired. Some very large scale sequence duplications inside or between genomes have been identified. These large scale genomic duplications yield precious information for genome annotation and genome evolution. Yet, standard algorithms devised for the comparison of very long sequences are time and space consuming, and aligning whole genomes or chromosomes at least a very difficult computational task [5], or limited to closely related sequences [1].

To tackle this problem, we have applied the discrete wavelet transform to the computation and visualization of dotplots. Dotplot is a well established technique to compare two sequences [3, 4]. Dotplot is an image where dots correspond to sequence nucleic acids or amino acids matches. Two difficulties arise when computing dotplots : the computer time and memory space used to compute a dotplot is proportional to the product of the length of the two sequences. Our method consists in reducing the size of the dotplot by computing coarse dotplots. For this purpose, DNA sequences are transformed into indicator signals. These signals are decomposed using a fast wavelet decomposition technique [2]. Finally, coarse dotplots are computed using the coarse level representation of the indicator signals.

The low dimensionality of the coarse level signals makes fast computation of coarse dot matrix possible. Moreover, the fast multiresolution analysis [2] provides an efficient algorithm for zooming in and out the dotplot image and gives the possibility to quickly navigate inside the dotplot from coarse to fine levels. The detection sensitivity and specificity depends on the scale level. At coarse levels, a region of similarity is detectable only if it is sufficiently large. Then, the low scale dotplot provides the possibility to reveal regions with very low level of similarity. At fine levels, regions of similarity are more accurately localized with higher specificity.

References [1] A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, and S.L. White, O.and Salzberg. Alignment of whole genomes. Nucleic Acids Res, 1999.

[2] S. G. Mallat. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans. on Patt. Anal. and Mach. Intell., 11:674-693, 1989.

[3] J. Pustell and F. Kafatos. A high speed, high capacity homology matrix : Zooming through sv40 and polyoma. Nucleic Acids Research, 10(15):4765-4782, 1982.

[4] E.L. Sonnhammer and R. Durbin. A dot-matrix program with dynamic threshold control suited for genomic dna and protein sequence analysis. Gene, 1996.

[5] M. Waterman. Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman & Hall, New York, 1995. 


61. TRAP: Tandem Repeat Assembly Program, capable of correctly assembling nearly identical repeats (up)
Martti T. Tammi, Erik Arner, Dept. of Genetics and Pathology, Uppsala University;
Tom Britton, Dept. of Mathematics, Uppsala University;
Daniel Nilsson, Björn Andersson, Dept. of Genetics and Pathology, Uppsala University;
martti.tammi@genpat.uu.se
Short Abstract:

 We present a method to separate almost identical repeats, and a rigorous study of combinatorial and statistical properties of sequencing errors and real differences between repeats in a fragment assembly framework. We also show that it is possible to make assemblies with a pre-defined probability of error using this method.

One Page Abstract:

 The software commonly used for assembly of shotgun sequence data has several limitations. This is especially true when repetitive sequences are encountered. Shotgun assembly is a difficult task, even for non-repetitive regions, but the use of quality assessment of the data and efficient matching algorithms have made it possible to assemble most sequences efficiently. In the case of highly repetitive sequences, however, these algorithms fail to distinguish between sequencing errors and single base differences in regions containing nearly identical repeats. We present the TRAP method to detect subtle differences between repeat copies, and show results from a rigorous study of combinatorial and statistical properties of sequencing errors and real differences between repeats in a fragment assembly framework. We also show that it is possible to make assemblies with a pre-defined probability of error using this method.

The key step in the TRAP method is the construction of multi-alignments consisting of all defined overlaps for a region, followed by detection of pairs of columns containing coinciding deviations from column consensus. For each pair, the probability of observing such a pair by chance is computed. If the probability exceeds a threshold, the pair is rejected. The remaining errors are trapped in a clustering process that follows. In a data set containing repeat regions with 1% randomly distributed differences, and a sequencing error up to 11%, demanding a coverage of at least three sequence reads, we could detect 65% of the differences with an error of 0.5%. This is not the final error, since most of these errors can be trapped in the clustering stage. The data set consisted of 108 simulated assemblies with varying repeat length and copy numbers.

The TRAP software package is implemented in four separate program modules, each performing a distinct task: 1. The initial screening for contamination and determination of read quality. 2. The computation of overlaps. 3. Repeat elements analysis and fragment layout generation. 4. The generation of the consensus sequence and a status report. TRAP has been shown to work by assembling both real and simulated shotgun data. We have simulated a shotgun project containing eight 1789 base pairs long repeats in tandem. The difference between repeat units was 1.0% and the simulated sequencing error 2.71%. TRAP was capable of correctly assembling all 367 sequence reads, while PHRAP could not resolve any of the repeat elements correctly and assembled the sequence reads belonging to the different repeats almost randomly, which made it impossible to extract the correct consensus sequence from any of the repeat elements. A simulation of a BAC project of size 140 kb containing eight 1789 base pairs long tandem repeats gave similar results, with only 1.6% of the reads wrongly placed, the difference between repeat copies being 1.0% and the average sequencing error 7.2%. The erroneously placed sequence reads did not affect the consensus sequence. The main features of TRAP are the ability to separate long repeating regions from each other by distinguishing single base substitutions as well as insertions/deletions from errors, and the ability to pinpoint regions, where additional sequencing is needed to efficiently close the remaining gaps. Since repeats are common in most sequencing projects, this software should be of use for the sequencing community.


62. LASSAP: A powerful and flexible tool for (large-scale) sequence comparisons (up)
Heus, H.C., Glémet, E., Raffinot, M., Metayer, K., Chambre, P., Codani, J-J., Gene-IT;
Raffinot, M., Laboratoire Génome et Informatique, CNRS, France.;
heus@gene-it.com
Short Abstract:

 Existing sequence comparison tools are not designed to handle large-scale comparisons efficiently. LASSAP software is designed to compare entire databases of sequences, integrating an adequate set of algorithms that run on multiple processors. It allows extensive result management: databases and results can be queried during all steps of a workflow.

One Page Abstract:

 There is large collection of good sequence comparison tools available. However, most of these tools have not been designed to handle results of large-scale comparisons efficiently. To answer complex questions one has to depend on tricks and scripts tailored to each tool to parse and interpret the output. This makes it difficult to redo an experiment with different parameters or another algorithm. Often, bioinformatics people use a substantial amount of their `quality time' trying to solve trivial problems over and over again.

Therefore, Gene-IT has developed Lassap, a large-scale sequence comparison tool that is powerful and extremely flexible. Lassap has been used to solve everyday bioinformatics problems that occur in the academic world and the industry. Some examples of its use are the annotation of sequences on a micro-array, cross-species genomic comparisons (GENOSCOPE), making non-redundant databases (Swissprot / TrEMBL) and implementing gene family workflows (INTERPRO). All this can be done with a minimal investment of time and resources.

The key specifications of Lassap are: ·The package is built around the comparison of entire databases of sequences. ·A complete set of algorithms is available (optimised for speed and minimal use of system resources): BLAST, Smith/Waterman, Needleman/Wunsch, string matching, pattern matching, enhanced versions of known algorithms and smart combinations of algorithms. ·After sequence comparison, similar sequences can be clustered. ·There are extensive database and result management options. Sequence databases, comparison results and clusters can be sorted, queried and filtered during all steps of the workflow. Hereby, the sequence and its annotation are handled simultaneously. Furthermore all data structures can be virtually concatenated. In this way it is easy to handle updates of sequence databases, comparison results and clusters. ·The Unix command line interface makes it very easy to incorporate Lassap into an existing bioinformatics infrastructure and to automate complex workflows. ·Lassap is very flexible, e.g. changing an algorithm is as easy as changing a single command line option. ·Lassap is a professional tool. It is written in the C programming language and is available for almost all Unix systems (single and multi-threaded) and Linux clusters as well. ·For more information visit our web site: http://www.gene-it.com


63. Mini-greedy algorithm for multiple RNA structural alignments (up)
J. Gorodkin, Bioinformatics Research Center, University of Aarhus, Denmark;
R. B. Lyngsoe, Department of Computer science, University of California at Santa Cruz, USA;
G. D. Stormo, Department of Genetics, Washington University Medical School, USA;
gorodkin@bioinf.au.dk
Short Abstract:

 A mini-greedy algorithm performs the greedy step on only a core set of sequences where the remaining sequences are aligned with the core in turn, based on a ranking of the sequences. This algorithm result in a significant speed-up of the FOLDALIGN approach for multiple structural alignment of RNA sequences. 

One Page Abstract:

 The problem with greedy algorithms is that a large fraction of the sequences in a data set is subject to many redundant computations, thus making greedy algorithms expensive. Here we present an approach that we will call mini-greedy, that first isolate a small core of suitable sequences on which to apply the greedy algorithm. Then the remaining sequences are aligned onto the resulting core alignment in an order determined by a ranking scheme. The ranking scheme is based on the pairwise score among all sequences. In contrast to algorithms such as CLUSTAL, where the multiple alignment is built from individual clusters of sequences, our algorithm seeks to find a set of core sequences that have as much in common to as many sequences as possible in the data set. We apply this algorithm to multiple structural alignments of RNA sequences, through FOLDALIGN, and show that the mini-greedy algorithm can reduce the computational time significantly. Different schemes are compared. The mini-greedy algorithm has been implemented as a part of the Stem-Loop Align SearcH server at http://www.bioinf.au.dk/slash/.


64. Neural network and genetic algorithm identification of coupling specificity and functional residues in G protein-coupled receptors (up)
Anthony Bucci, Jason M. Johnson, Pfizer Discovery Technology Center;
abucci@brandeis.edu
Short Abstract:

 We demonstrate that artificial neural networks are effective in automated prediction of G-coupling function, and when combined with a genetic algorithm, can identify residues in GPCR sequences that impact G protein coupling specificity. The methods we describe are general and can be applied to other classification problems in molecular biology. 

One Page Abstract:

 G protein-coupled receptors (GPCRs) form a large family of human cell membrane signaling proteins with seven transmembrane segments, transmitting signals across the cell membrane in response to a wide variety of endogenous ligands. GPCRs are attractive drug targets because they are amenable to small molecule intervention and play critical roles in many human diseases. The molecular and cellular functions of many of these proteins are currently unknown. Here, we demonstrate the utility of artificial neural networks (ANNs) to the problem of predicting the G protein-coupling specificity of GPCRs, the key determinant of downstream signaling function. Using a set of ~100 GPCRs with known G-protein specificity, we conducted a cross-validation study comparing performance of ANNs and homology-based classifiers on the G-coupling prediction task. Our results show that ANNs, given access to only a 20-residue window of the GPCR primary sequence, perform as well as BLAST or a nearest-neighbor classifier given access to the full-length sequence. Building on this result, we used a genetic algorithm (GA) to discover a set of GPCR sequence positions that allowed ANNs to outperform both BLAST and nearest-neighbor classifiers in a leave-one-out cross-validation test. These residue positions reveal regions of GPCR structure likely to be involved in G-protein coupling and discrimination among G-protein subtypes. We conclude that artificial neural networks are effective in automated prediction of G-coupling function, and when combined with a GA, can identify specific residues in GPCR sequences that are important for G-protein coupling. The ANN and GA methods we describe are general and can be applied to other classification and function-prediction problems in molecular biology. 


65. Determination of Classificatory Motifs for the Identification of Organism Using DNA Chips (up)
Uta Bohnebeck, Tom Wetjen, University of Bremen, Center for Computing Technologies (TZI);
Denja Drutschmann, University of Bremen, Center for Environmental Research and Technology (UFT);
bohnebec@tzi.de
Short Abstract:

 A procedure for constructing highly sensitive and specific oligo-nucleotides for the identification of organisms will be exemplarily presented with sequences of hepatitis C virus. It can be shown that the common motifs detected in randomly sampled subsets can be generalized to be also present in the population with high probability.

One Page Abstract:

 A procedure for constructing highly sensitive and highly specific oligo-nucleotides for the identification of organisms, e.g. in environmental samples, will be presented. The problem of achieving a high sensitivity first may be considered as a conservation learning task [2], i.e. common motifs (representing conserved regions) detected in randomly sampled subsets can be generalized to be also present in the population with high probability. The algorithm for the first step of the procedure calculates all non-redundant common motifs which meet constraints, i.e. motif length, permitted mismatches, and sample coverage. The common motifs are determined with the aid of a generalized suffix tree using a pattern-driven approach [5]. Each resulting motif corresponds to a set of associated sequences - the potential oligo-nucleotides of the DNA chip - which are concrete substrings of the given sequence set and which represent the variability of the population. For the second step of the procedure, a Blast search in the public nucleotide sequence databases is carried out in order to prove high specificity. Further optimization steps according to hybridization conditions have to be carried out [1].

In the experiments carried out, 171 complete genome sequences of the hepatitis C virus (HCV) were used. A relative arrangement of these sequences based on the idea of maximal unique matches [3] was performed. The result confirmed the observation of [4], i.e. that only the 5'UTR is sufficiently conserved in order to determine common motifs which can be used to identify HCV in general. Therefore, the motif determination was only executed on the 5'UTR of these 171 sequences by a 10-fold-cross-validation using randomly sampled subsets of 137 (80%) sequences. Allowing one mismatch, on the average each subset produced a result set containing five common motifs with a length between 39 and 68 bp. These set of motifs were not 100% identical since a motif of one subset was enlarged in another subset, or motifs were concatenated to one large motif. However, by merging the result sets a non-redundant set of maximal shared motifs could be extracted. Using the associated sets of oligo-nucleotides a 100% coverage of the population was obtained. Typically, one of the oligo-nucleotides belonging to one motif showed approximately 95% coverage while the others only represented single variations containing one mismatch. 

[1] U. Bohnebeck, M. Nölte, T. Schäfer, M. Sirava, T. Waschulzik, G. Volkmann. An Approach to the Determination of Optimized Oligonucleotide Sets for DNA Chips, In: Proceedings of ISMB'99, Poster and Demonstrations, Heidelberg, 1999.

[2] A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert. Approaches to Automatic Discovery of Patterns in Biosequences, Journal of Computational Biology, 5:277-303,1998.

[3] A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, and S.L. Salzberg. Alignment of whole genomes, Nucleic Acids Research, 27(11):2369-2376, 1999.

[4] J.H. Han, V. Shyamala, K.H. Richman, M.J. Brauer, B. Irvine, M.S. Urdea, R. Tekamp-Olson, G. Kuo, Q.-L. Choo, and M. Houghton. Characterization of the terminal regions of hepatitis C viral RNA: Identification of conserved sequences in the 5' untranslated region and poly(A) tails at the 3' end, Proc. Natl. Acad. Sci. USA, 88(5):1711-1715, 1991

[5] M.-F. Sagot. Spelling approximate repeated or common motifs using a suffix tree, In: Latin'98, volume 1380 of LNCS, pages 


66. An algorithm for detecting similar reaction patterns between metabolic pathways (up)
Yukako Tohsato, Ryutaro Saijo, Takao Amihama, Hideo Matsuda, Akihiro Hashimoto, Department of Informatics and Mathematical Science, Osaka University;
yukako@ics.es.osaka-u.ac.jp
Short Abstract:

 We have developed a method for detecting similar reaction patterns between metabolic pathways. Given two trees representing pathways, our method extracts similar subtrees between them using a subtree-matching algorithm. We have found a similar reaction pattern between the glycol metabolism and degradation pathway and the fucose and rhamnose catabolism pathway. 

One Page Abstract:

 Metabolic pathway is recognized as one of the most important biological networks. Comparative analyses of the pathways give important information on their evolutions and pharmacological targets. In this poster, we present a method for comparing pathways based on the similarity of reactions catalyzed by enzymes. In our approach, the reaction similarity is formulated as a scoring scheme based on the information content of the occurrence probabilities of EC numbers. For example, the occurrence probability P of the matching between a pair of similar EC numbers, say 1.1.1.1 and 1.1.1.2, is very rare compared to randomly-chosen combination of such pairs, and then we give a higher score, - log P, for the matching. Using this scoring scheme, one can perform alignment between two pathways if they can be represented as strings of EC numbers. However, metabolic pathways often include branching structure. The comparison between such pathways cannot be directly performed by applying some string-matching algorithms. To cope with this issue, we have developed a method for comparing pathways that include branching structure. In our method, pathways are represented as trees, and compared them by a subtree-matching algorithm. The computational complexity of the subtree-matching problem is generally NP hard. Thus we have developed a greedy algorithm for the comparison. The effectiveness of our method is demonstrated by applying to metabolic pathways in Escherichia coli, which are re-constructed from the metabolic maps of the EcoCyc database. As a result, we have found a similar reaction pattern ([2.7.1.31] - [2.7.1.51], [4.1.1.47] - [4.1.2.17], [1.2.1.21] - [1.2.1.22] and [1.1.1.37] - [1.1.1.77]) between the glycol metabolism and degradation pathway and the fucose and rhamnose catabolism pathway. 


67. SPP-1 : A Program for the Construction of a Promoter Model From a Set of Homologous Sequences (up)
Ulrike Goebel, Thomas Wiehe, Thomas Mitchell-Olds, Max-Planck Institute of Chemical Ecology;
goebel@stargate.ice.mpg.de
Short Abstract:

 We present a new algorithm which constructs a PROMOTER MODEL from a set of unaligned homologous, coregulated polII promoters. It employs a comparative approach, which in addition to sequence similarity can also take into account DNA structural similarity. In a second phase, higher order modules of conserved motifs are found. 

One Page Abstract:

 We present a new algorithm which constructs a PROMOTER MODEL from a set of unaligned homologous, coregulated polII promoters.

It rests on the following assumptions: DNA contact points of individual members of the transcription initiation complex are constrained in their ability to tolerate mutations and thus stand out as short (6-10 bp) conserved motifs. The arrangement of the proteins in the initiation complex is reflected by a hierarchical arrangement of the binding sites on the DNA, and it this pattern which really identifies the promoter. It, too, should be at least in part conserved in members of a family of promoters which are known to confer the same expression pattern. Another aspect which has been shown to be conserved at least in parts of polII promoters is DNA structure, especially bendability and stiffness. Most probably the sequence conservation seen at transcription factor binding sites is just an extreme case of structural conservation (identical sequences have identical structures). It can well be that there are sites which have drifted apart on the sequence level in different members of a promoter family, while still being conserved with respect to some relevant structural property.

Our algorithm first constructs gap-free blocks of sequence segments from the input sequences. A block can contain zero or multiple segments from a given input sequence. It is maximal with respect to the number of segments, such that all pairs of segments in a block are SIMILAR. In contrast to other existing algorithms, SIMILARITY is a relation which can be freely defined, and in particular can refer to similarity with respect to DNA structural parameters. In a second phase, the algorithm looks for a pattern of these motifs which is common (with variations) to all input sequences. Motifs which are part of such a pattern can not only be more trusted to be truely biologically relevant, but the pattern also constitutes a testable hypothesis ( a PROMOTER MODEL) about the input family of promoter sequences.


68. A new score function for distantly related protein sequence comparison. (up)
Maricel Kann, Richard Goldstein, The University of Michigan;
mkann@umich.edu
Short Abstract:

 A new method to derive a score function to detect remote relationships between protein sequences has been developed using an optimization procedure. We find that the new score function obtained in such a manner performs better than standard score functions for the identification of distant homologies

One Page Abstract:

 In order to understand the evolutionary history of the new sequences, aligning their the primary structure of the probe sequence with others in the database is one of the most significant and widely used techniques. Sequences with a high similarity score usually share a common structure and might have similar functions or mechanisms. All of these methods rely on some score function to measures sequence similarity. The choice of score function is especially critical for these distant relationships. A new method to derive a score function to detect remote relationships between protein sequences has been developed. The new score function was obtained after maximization of a function of merit representing a measure of success in recognizing homologs of the newly sequenced protein among thousands of non-homolog sequences in the databases. We find that the new score function obtained in such a manner performs better than standard score functions for the identification of distant homologies.


69. Expression profiling on DNA-microarrays: In silico clone selection for DNA chips (up)
Rainer König, DKFZ, Division of Functional Genome Analysis, Im Neuenheimer Feld 506, 69120 Heidelberg, Germany;
Johannes Beckers, GSF - National Research Center, Institute of Experimental Genetics, Ingolstädter Landstr.1, 85764 Neuherberg, Ge;
Marcus Frohme, Tamara Korica, DKFZ, Division of Functional Genome Analysis, Im Neuenheimer Feld 506, 69120 Heidelberg, Germany;
Stefan Haas, MPI for Molecular Genetics, Computational Molecular Biology, Ihnestraße 73, 14195 Berlin, Germany;
Matthias Seltmann, Christine Machka, Yali Chen, Alexei Drobychev, GSF - National Research Center, Institute of Experimental Genetics, Ingolstädter Landstr.1, 85764 Neuherberg, Ge;
Sabine Tornow, Michael Mader, GSF - National Research Center, Institute for Bioinformatics, Ingolstädter Landstr.1, 85764 Neuherberg, Germany;
Martin Hrabé de Angelis, GSF - National Research Center, Institute of Experimental Genetics, Ingolstädter Landstr.1, 85764 Neuherberg, Ge;
Werner Mewes, GSF - National Research Center, Institute for Bioinformatics, Ingolstädter Landstr.1, 85764 Neuherberg, Germany;
Jörg Hoheisel, DKFZ, Division of Functional Genome Analysis, Im Neuenheimer Feld 506, 69120 Heidelberg, Germany;
Martin Vingron, MPI for Molecular Genetics, Computational Molecular Biology, Ihnestraße 73, 14195 Berlin, Germany;
r.koenig@dkfz.de
Short Abstract:

 As part of the German-Human-Genome-Project DNA-microarray technology is being established to systematically analyse gene function in mouse mutant lines. A software was developed to select gene specific IMAGE-clones from subsets of known genes. It is designed to optimise good hybridisation to the target and low cross-hybridisation with other known genes.

One Page Abstract:

 As part of the German Human Genome Project (DHGP) DNA microarray technology is used for a systematic analysis of gene function in ENU induced mutant mice. To design our chips for chosen subsets of genes, a software system was developed that selects gene specific EST-clones from the IMAGE clone set. To obtain specific expression signals for each gene, the algorithm is designed to optimise two demands for the immobilised antisense probe, (1) good hybridisation to the target and (2) no cross-hybridisation with other known genes. In respect to these tasks a method is presented. In an exemplary application, the design of the first chip contains genes with known functions, for example, during embryonic development or genes that are relevant for the pathogenesis of related human diseases. Additionally, a set of constitutively expressed genes was selected to facilitate normalisation. Public access is offered to select clones for known mouse genes (www.dkfz-heidelberg.de/tbi/services/koenig/services/clones2chip_front.pl).


70. Data Mining: Efficiency of Using Sequence Databases for Polymorphism Discovery (up)
David G. Cox, University of Turin / International Agency for Research on Cancer;
Federico Canzian, Catherine Boillot, International Agency for Research on Cancer;
cox@iarc.fr
Short Abstract:

 We selected thirteen genes, and determined the complete collection of polymorphisms existing in these genes in our laboratory using DHPLC, or in other laboratories using comparable methods. Then we compared these results to polymorphisms found by aligning sequences of the genes in the GenBank Database, calling single base differences between sequences polymorphisms.

One Page Abstract:

 An open question in research on Single Nucleotide Polymorphisms (SNPs) is, what is the percentage of true SNPs found by in silico pre-screening? To this end, we selected thirteen genes, and determined The complete collection of "true" polymorphisms, or polymorphisms experimentally detected, existing in these genes in our laboratory using Denaturing High Performance Liquid Chromatography (DHPLC) and fluorescent sequencing, or in other laboratories using comparable methods (Single Strand Confirmation Polymorphism, Denaturing Gradient Gel Electrophoresis). The genes studied by our group were PTGS2, IGFBP1, IGFBP3, and CYP19. GenBank sequence information was then aligned using two methods, and sequence differences termed "candidate" polymorphisms. We then compared the series of SNPs obtained experimentally and in silico and we have found that in silico methods are relatively specific (up to 55% of candidate SNPs found by SNPFinder have been discovered by experimental procedure) but have low sensitivity (not more than 27% of true SNPs are found by in silico methods).


71. Automated modeling of protein structures (up)
Morten Nielsen, Ole Lund, Claus Lundegaard, Thomas N. Petersen, Structural Bioinformatics Advanced Technologies A/S (SBI-AT), Agern Alle 3, DK-2970 Hoersholm, Denmark.;
Jakob Bohr, Soeren Brunak, Structural Bioinformatics, Inc., SAB, San Diego, California;
Garry P. Gippert, Structural Bioinformatics Advanced Technologies A/S (SBI-AT), Agern Alle 3, DK-2970 Hoersholm, Denmark.;
mnielsen@strubix.dk
Short Abstract:

 SBI-AT has developed a novel and highly reliable method for fold recognition based on a ranking of alignment Z-scores computed from sequence profiles and predicted secondary structure. The method was used for template identification in the CASP4 protein structure prediction experiment. 

One Page Abstract:

 SUMMARY

Automated modeling of protein structure from sequence guides the assignment of function and the selection of targets for experimental structure studies, and provides starting points for expert modeling of protein structure.

Fold recognition is an important benchmark in automated modeling approaches. SBI-AT has developed a novel and highly reliable method for fold recognition based on a proprietary ranking of alignment Z-scores computed from sequence profiles and predicted secondary structure (Petersen et al., 2000).

In this presentation we compare results obtained using methods developed at SBI-AT with the well-known PDB-BLAST and FFAS (Rychlewski et al., 2000) methods. Our results in the comparative modeling sections of the CASP4 protein structure prediction experiment are also summarized.

CONCLUSIONS

The SBI-AT fold recognition method performs well compared to FFAS and PDB-BLAST, particularly in the high-reliability regime. A large performance increase for SCOP family relationships (Murzin et al., 1995) impacts strongly on comparative modeling of protein structures.

In CASP4 comparative modeling categories, SBI-AT's automatic modeling method shows a performance comparable to that of the best expert and automatic methods developed elsewhere.

REFERENCES

Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Nucleic Acids Res 1997 25:3389-402.

CASP4, http://predictioncenter.llnl.gov/casp4/.

Murzin AG, Brenner SE, Hubbard T, Chothia C (1995). J. Mol. Biol. 247:536-540.

Petersen TN, Lundegaard C, Nielsen M, Bohr H, Bohr J, Brunak S, Gippert GP, Lund O (2000). Proteins 41:17-20.

Rychlewski L, Jaroszewski L, Li W, Godzik A (2000). Protein Sci. 9:232-241.

ABOUT SBI-AT

SBI-AT, a subsidiary of Structural Bioinformatics Inc., San Diego, develops novel computer algorithms for the prediction and exploration of structural and dynamical features of proteins, specifically targeted for use in rational drug design and biotechnology applications. www.strubix.dk.


72. Algorithmic improvements to in silico transcript reconstruction implemented in the Paracel Clustering Package (PCP) (up)
Joseph A. Borkowski, Jun Qian, Cassi Paul, Charles P. Smith, KyungNa Oh, Glen Herrmannsfeldt, Cecilie Boysen, Paracel, Inc.;
borkowski@paracel.com
Short Abstract:

 We describe novel algorithms which improve EST-based transcript reconstruction quality and which allow more accurate splice form detection. These algorithms make use of pairwise overlap and redundancy measurements to identify and remove artifactual chimeric sequences and low-quality 'non-N' sequence segments. We demonstrate their effect on overall transcript quality.
 
 

One Page Abstract:

 Algorithmic improvements to in silico transcript reconstruction implemented in the Paracel Clustering Package (PCP) Joseph A. Borkowski, Jun Qian, Cassi Paul, Charles P. Smith, KyungNa Oh, Glen Herrmannsfeldt, Cecilie Boysen.

Paracel, Inc.

Transcript reconstruction from EST data, whether public or private, can often be problematic. Ideally, it should be possible to reconstruct a single consensus sequence present in a cell using measured pairwise overlap between individual ESTs. When attempting to reconstruct transcripts, available programs typically overestimate the number of alternative transcripts from a given organism and also tend to produce a number of large false clusters. These problems can have serious negative consequences when using the output of transcript reconstruction for either gene identification or oligo chip design. They are often more acute when a high quality genomic sequence segment covering any particular transcript is not available or not used to guide correct reconstruction.

Overestimation of alternative transcript forms can be proven with thorough comparison to any available high quality genomic sequence data. Incorrectly predicted alternative transcripts will not have a corresponding genomic sequence segment. In most cases this overestimation is due to low quality sequence segments present in the input sequence data. When a particular assembly program is unable to align these sequences with the true transcript along its entire length, they are falsely reported as alternative transcripts of the true transcript.

To reduce false alternative transcript reporting we have developed an algorithm that can identify and remove low quality sequence segments present in ESTs, even when quality values are not available. This algorithm is based on the rate of score drop-off in a pairwise alignment. It has been observed that sequence quality tends to drop off slowly as it approaches the end of an EST. Therefore its score in comparison with higher quality sequences tends to drop-off slowly after reaching an inflection point at the end of a high scoring match segment. In contrast, the score for a true alternative transcript tends to drop off at a much faster rate. When an end of a sequence has been identified as being low quality, that end is not used in construction of a consensus sequence. High quality sequence segments that are derived from alternative transcripts are used to create alternative consensus sequences. 

In our experience, large false clusters often are due to the presence of contaminants, repeats, or chimeric sequences in the input EST dataset. We have developed an algorithm that detects and reports these chimeric EST sequences. This algorithm uses all of the computed pairwise overlaps that exist in a set of sequences and uses this information to determine abrupt break points in the overall contig structure where a single sequence joins two coherent subclusters. When these break points are detected, and supported by a sufficient number of independent sequences on either side of the break point, a chimeric sequence is reported. These chimeric sequences are not used for the creation of a cluster.

The chimeric and bad end detection algorithms have been incorporated into the Paracel Clustering Package (PCP). We report on both the accuracy of these algorithms and their effect on overall transcript quality. 


73. Computational Structural Genomics (up)
Steven E. Brenner, University of California, Berkeley;
brenner@compbio.berkeley.edu
Short Abstract:

 Structural genomics aims to provide a good experimental structure or computational model of every tractable protein in a complete genome. At the Berkeley Structural Genomics Center, we focus on the organisms Mycoplasma pneumonia and M. genitalium. Computational components include selection of protein targets, managing experimental data, and analyzing solved structures.

One Page Abstract:

 Structural genomics aims to provide a good experimental structure or computational model of every tractable protein in a complete genome. Underlying this goal is the immense value of protein structure, especially in permitting recognition of distant evolutionary relationships for proteins whose sequence analysis has failed to find any significant homolog. A considerable fraction of the genes in all sequenced genomes have no known function, and structure determination provides a direct means of revealing homology that may be used to infer their putative molecular function. The solved structures will be similarly useful for elucidating the biochemical or biophysical role of proteins that have been previously ascribed only phenotypic functions. More generally, knowledge of an increasingly complete repertoire of protein structures will aid structure prediction methods, improve understanding of protein structure, and ultimately lend insight into molecular interactions and pathways.

We use computational methods to select families whose structures cannot be predicted and which are likely to be amenable to experimental characterization. Methods to be employed included modern sequence analysis and clustering algorithms. Also consulted is the PRESAGE database for structural genomics, which records the community's experimental work underway and computational predictions. The protein families are ranked according to several criteria including taxonomic diversity and known functional information. Individual proteins, often homologs from hyperthermophiles, are selected from these families as targets for structure determination. The solved structures are examined for structural similarity to other proteins of known structure. Homologous proteins in sequence databases are computationally modeled, to provide a resource of protein structure models complementing the experimentally solved protein structures.

References

Brenner SE, Levitt M. 2000. Expectations from structural genomics. Protein Sci. 9:197-200.

Brenner SE. 1999. Errors in genome annotation. Trends Genet 15:132-133. 

Brenner SE, Barken D, Levitt M. 1999. The PRESAGE database for structural genomics. Nucleic Acids Res 27:251-253.

Brenner SE, Chothia C, Hubbard TJP. 1998. Assessing sequence comparison methods with reliable structurally-identified distant evolutionary relationships. Proc Natl Acad Sci USA 95:6073-6078.

Brenner SE, Chothia C, Hubbard TJP. 1997. Population statistics of protein structures. Curr Opin Struct Biol 7:369-376.

Brenner SE, Chothia C, Hubbard TJP, Murzin AG. 1996. Understanding protein structure: Using SCOP for fold interpretation. Meth Enzymol 266:635-643.

Brenner SE, Hubbard T, Murzin A, Chothia C. 1995. Gene duplications in the H. influenzae genome. Nature 378:140. Brenner SE. 1995. World wide web and molecular biology. Science 268:622-623.
 
 


74. Persistently Conserved Positions in Structurally-Similar, Sequence Dissimilar Proteins: Roles in Preserving Protein Fold and Function (up)
Iddo Friedberg, Hanah Margalit, The Hebrew University, Jerusalem;
idoerg@cc.huji.ac.il
Short Abstract:

 This study addresses the problem of proteins that have the same fold, but no sequence similarity. Using a database of such protein pairs we analyze those positions which are mutually, persistently conserved among close and distant family members. In many cases those positions show a role in function and/or fold.

One Page Abstract:

 Many protein pairs that share the same fold do not have any detectable sequence similarity, providing a valuable source of information for studying sequence-structure relationship. In this study we use a stringent data set of structurally-similar, sequence-dissimilar protein pairs to characterize residues which may play a role in the determination of protein structure and/or function. For each protein in the database we identify amino-acid positions that show residue conservation within both close and distant family members. These positions are termed persistently conserved. We then proceed to determine the mutually persistently conserved positions, those structurally aligned positions in a protein pair that are persistently conserved in both pair-mates. Due to their intra- and inter-family conservation, these positions are good candidates for determining protein fold and function. We find that about 50% of the persistently conserved positions are mutually conserved. A significant fraction of them are located in critical positions for secondary structure determination, they are mostly buried, and many of them form spatial clusters within their protein structures. A substitution matrix based on the subset of persistently mutually conserved positions shows two distinct characteristics: (i) it is different from other available matrices, even those that are derived from structural alignments. (ii) it contains a significant amount of mutual information, emphasizing the special residue restrictions imposed on these positions. Such a substitution matrix should be valuable for protein design experiments. 


75. Using Surface Envelopes in 3D Structure Modeling (up)
Jonathan M. Dugan, Glenn A. Williams, Russ B. Altman, Stanford Medical Informatics;
dugan@smi.stanford.edu
Short Abstract:

 Our group has built unified data structures and algorithms that are highly flexible and applicable to a variety of different data types for modeling macromolecular structures. This poster outlines the development, implementation, and results of algorithms capable of integrating surface shape data into the 3D structure modeling process. 

One Page Abstract:

 Modeling the 3D structure of biological macromolecules assists in the understanding of biological function, and can assist in the discovery of novel pharmaceuticals. Current crystallographic methods for structure determination have been very successful, but are not applicable in all cases. Fortunately, other experimental methods can provide useful data regarding biomolecular structure, although typically these data are noisy and sparse. The sources of these data include those that provide distances (such as nmr, binding, affinity, and crosslinking measurements) as well as those that produce other types of structure information, such as solvent accessibility and overall geometric features--such as volume or the shape of enclosing surface envelopes (SE). Our group has focused on building unified data structures and algorithms that are highly flexible and applicable to a variety of different data types -- with the goal of combining these heterogeneous data to maximize their utility in modeling macromolecular structures. This poster outlines the development and implementation of algorithms capable of integrating SE data into the 3D structure modeling process. I present the results of modeling several proteins and test structures with distance data and SE data derived from solved structures. 


76. Molecular modelling in studies of SDR and MDR proteins (up)
Erik Nordling, Bengt Persson, MBB, SBC, Karolinska Institutet;
erik.nordling@mbb.ki.se
Short Abstract:

 The presentation covers the use of molecular modelling methods in studies of the medium-chain dehydrogenases/reductases (MDR) and short-chain dehydrogenases/reductases (SDR) protein families. In particular a sub classification of the MDR family is described and substrate specificity is investigated of the Endoplasmic reticulum associated amyloid beta binding protein (ERAB).

One Page Abstract:

 The wealth of structural information available through the Protein Databank (PDB) may be extended to structural neighbours using homology modelling. The technique may be used routinely down to 40% sequence identity to yield accurate models if there are no large insertions or deletions in the alignment. Proteins with lower sequence identities are possible to model to reasonable accuracy, but require considerable more care in the modelling process.

We have employed these techniques on members of the protein families SDR (Short-chain Dehydrogenases/Reductases) and MDR (Medium-chain Dehydrogenases/Reductases). In the first case we model ERAB (Endoplasmic Reticulum associated Amyloid b-peptide Binding Protein) from 7alpha-Hydroxysteroid Dehydrogenase (27% sequence identity) and yield a structure that is compatible with known enzymatic data. X-ray crystallography later verified the core parts of the modelled structure. Recently, we have tried to use the homology modelling to further clarify the evolutionary relationship within subgroups of the MDR family.

We have also used various docking methods to investigate substrate specificity and binding mechanisms. This has been applied to ADH class I beta and gamma isozymes and ERAB, giving results compatible with kinetic data. 


77. Consensus Predictions of Membrane Protein Topology (up)
Johan Nilsson, Bengt Persson, Gunnar von Heijne, Stockholm Bioinformatics Centre;
johan.nilsson@mbb.ki.se
Short Abstract:

 Consensus predictions of membrane protein topology might provide a means to estimate the reliability of predicted topologies. Using five topology prediction methods according to a "majority-vote" principle, we found that the topology of nearly half of all E.coli inner membrane proteins can be predicted with high reliability (>90% correct predictions).

One Page Abstract:

 Computational methods for identification and characterisation of integral membrane proteins will become increasingly important as the number of completely sequenced genomes increases. At present, several methods are available for prediction of integral membrane protein topology and approaches employed include neural networks, hidden Markov models, multiple sequence alignments and dynamic programming. Considering the large amount of transmembrane proteins in a typical genome (20-25%), even a slight improvement in the ability to predict membrane protein topology will have major effects on e.g. automatic sequence annotation. In this study we have explored the possibility that consensus predictions of membrane protein topology might provide a means to estimate the reliability of a certain predicted topology. Our intention was to improve topology predictions by combining the results obtained from a number of methods according to a "majority-vote" principle. We used five popular topology prediction methods: TMHMM, HMMTOP, MEMSAT, TOPPRED and PHD. Our results show that the fraction of correctly predicted topologies over a test set of 60 Escherichia coli inner membrane proteins with experimentally determined topologies increases with the number of methods that agree. The topology of nearly half of the sequences can be predicted with high reliability (>90% correct predictions) using our approach. 


78. Development of New statistical measures of protein model quality and its application for consensus prediction of protein structure (up)
Fang Huisheng and Arne Elofsson, Stockholm Bioinformatics Center, Stockholm University;
fang@sbc.su.se
Short Abstract:

 In this study we have used better statistical measures of the similarity between a protein-model and the correct structure. These new measures have been used to improve the performance of Pcons, a consensus based fold recognition method. We show that using these new measures we obtain better predictions.

One Page Abstract:

 Development of New statistical measures of protein model quality and its application for consensus prediction of protein structure

Fang Huisheng and Arne Elofsson

Stockholm Bioinformatics Center, Stockholm University, SE-10691 Stockholm, Sweden

 More and more methods for predicting protein structure have been developed based on different algorithms and information. It has recently been confirmed that for different targets, different methods produce the best predictions and the final prediction accuracy could be improved if available methods would be combined in a perfect manner (Lundström et al, 2001). Recent studies show that the statistics distribution, i.e. a P-value, assessing the similarity between a model and the structure can be developed (Levitt & Gerstein, 1998) and proved a good measure of protein model quality. In this study a score (the LGscore) was used, however, if the number of matched residues is less than 120, it has been shown the distribution does not follow the curves used to calculate the P-value.This means that a P-value really should represent the statistics correctly.

In the present work, we have first recalculated the P-values depending on the number of aligned residues. We use two functions one for describing the average score and another for the standard deviation. These functions can be used to describe the behavior of the score from 10 aligned residues to more than 300. Based on it, we calculate a new P-value, using an extreme value distribution as done by Levitt & Gerstein. The new P-values does not show the same dependency of fragment size as the old.

 In CAFASP2 it was observed that very good models for short targets did not obtain a significant score LGscore. On the basis of this observation we have introduced the "Q-value". The reason is that the even a perfect structural similarity for a short protein is not very significant. To overcome these problems when scoring models we have created a new score (the Q-value) that is depending on the length of the target. It was calculated from the P-values for models with 30-50% sequence identity. Using new and old LGscore, Q-value, Pcons consensus predictors combining seven servers has been developed. The procedure of its is as described as following:we firstly compare the two kinds of similarity (i.e. new LGscore) between models, and model and target structure about 199 targets from LiveBench2. Furthermore, we build two models with Multiple linear regression and Neural Networks respectively to describe the relationship between new LGscore,old LGscore and Q-value between similarity of model-model and target structure and model. The performance trial shows that the model of new LGscore is better than old LGscore. 

Reference

1. Lundström et al, Pcons: A neural network based consensus predictor that improves fold recognition. 2001

2. Siew, et al, MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics, Vol. 16,2000, p776-785

3. Zemla et al, Processing and Analysis of CASP3 protein structure predictions Proteins:Structure,Function, and Genetics, 1999,Suppl 3:22-29

4. Levitt et al, A unified statistical framework for sequence comparison and structure comparison. Proc. Natl. Acad. Sci. USA 1998,95:5913-592


79. Signal filtration methods to extract structural information from evolutionary data applied to G protein-coupled receptor (GPCR) transmembrane domains (up)
L. Marsh, Dept. of Biology, Long Island University;
lmarsh@liu.edu
Short Abstract:

 The conservation score of GPCR amino acid residues was treated as a noisy signal containing information about solvent accessibility of the residues. This conservation signal was subjected to a Fourier-based filtration/reconstruction method to extract structural information. Solvent accessibility of transmembrane residues was correctly predicted for a series of diverse opsins.

One Page Abstract:

 The relationship between the structure of a protein and the rate of evolution of specific residues is intriguing and complex. Solvent-exposed residues often evolve more rapidly than residues involved in structural contacts. We consider residue conservation during evolution as a signal, albeit an extremely noisy signal, reflecting solvent protection and other structural information. Using a signal filtration/reconstruction approach with a novel wavelet-based Fourier approximation, solvent exposure of amino acid residues could be predicted from residue conservation data. The degree of conservation at each position of the TM was calculated for random clusters of TM sequences drawn from a pool of 158 opsins. Aligned sequences were compared using a modified BLOSUM substitution matrix to generate a series representing the degree of conservation at each amino acid position (`conservation signal'). The unfiltered conservation signals exhibited a weak, but positive, Pearson correlation coefficient with solvent inaccessibility of residues in the rhodopsin structure supporting a relationship between substitution rate and accessibility in this system. Solvent accessibility of the alpha-helical TMs has a periodicity of about 3.6 in rhodopsin. Fourier analysis confirmed that the conservation signal contained structural information, but simple Fourier methods did not yield robust predictions. A filter was designed that permitted enhancing alpha helical patterns, accomodation of helix breaks, and a waveform correction for the fact that most residues in the structure are not solvent exposed. This filter was implemented as a wavelet-based Fourier-filtration approximation and produced prediction success rates of >95% for the tested (relatively uniform) TM1 and TM7. The method is now being applied to other systems.


80. Using clusters to derive structural commonality for ATP binding sites (up)
Yosef Yehuda Kuttner, Mariana Babor, Marvin Edelman, Vladimir Sobolev, Weizmann Institute of Science;
joshef.kuttner@weizmann.ac.il
Short Abstract:

 We developed a method for structural multiple alignment of binding sites with a given ligand, in order to search for similarities in spatial arrangement of binding pocket atoms. An algorithm for identifying clusters of atoms was developed. Our strategy seeks commonalities in arrangement of contacting atoms around rigid ligand components.
 
 

One Page Abstract:

 Keywords: atomic contacts, molecular recognition, cluster, adenine 

We are developing a method for structural multiple alignment of binding sites with a given ligand, in order to search for similarities in the spatial arrangement of binding pocket atoms. An algorithm for identifying clusters of atoms was developed for this task. For a given flexible ligand, the binding pocket shape in different target proteins might vary considerably due to different ligand conformations. Our strategy was, therefore, to seek commonalities in the arrangement of contacting atoms around rigid, or almost rigid, ligand components. The rigid (or almost rigid) chemical moiety from different files was superimposed. LPC software [1] was then used to determine the protein atoms in contact with the ligand and to classify the atom contacts according to their physico-chemical properties. A search for atomic clusters was then conducted. Atoms were defined as belonging to a cluster if they were within a given distance of each other and came from different PDB entries. Additionally, members of a cluster had a requirement to form attractive contacts with the ligand. The ATP molecule was chosen for this study. We selected the rigid adenine ring moiety as a test object. A non-redundant dataset of 14 PDB entries of ATP-protein complexes (resolution of 2.2Å or better) was analyzed. The adenine rings of the 14 files were superimposed with concerted movement of the protein atoms in contact with the rings. Several groups have recently sought structural commonalities in nucleotide base recognition by proteins: Kobayashi and Go [2] found remarkable similarities despite considerable differences in primary sequence; Shi and Berg [3] used consensus sequences to construct novel proteins with increased DNA affinity in zinc finger proteins of the Cys-His2 type; Denessiouk & Johnson [4] found similarities in the relative positions of different nucleotide-base binding motifs along polypeptide chains from related proteins, although not in their three dimensional space; while Moodie et al. [5] found no specific recognition motif for adenylate in terms of particular residue/ligand interactions, although they found commonalities in shape and polarity properties at ligand/protein interfaces. In our work, hydrophobic clusters were found above and beneath the plane (as previously indicated by Moodie et al. [5]) of the adenine ring, which included some hydrophilic atoms acting as proton donors hydrogen-bonded to the conjugated system. We also found two clusters containing atoms that form hydrogen bonds. The network of atomic clusters so determined was taken as the consensus binding-site structure for the adenine ring of ATP. We note that the hydrogen bond acceptor and donor clusters, (in contact with N-6 and N-1, respectively) are in similar geometric juxtaposition as the hydrogen bonds between the adenine and thymine base pairs in DNA. Cluster positions for the adenine ring were derived. Their relative arrangement served as a fingerprint to search for putative binding sites. When the searching procedure located 6 or more cluster positions, the correct binding site was found for all proteins tested, but usually there were multiple solutions (up to 25 putative pockets). We are now attempting to derive cluster positions for the ribose ring to reduce the number of incorrect solutions.

References [1] Sobolev V., Sorokine A., Prilusky J., Abola E.E., Edelman M. (1999). Automated analysis of interatomic contacts in proteins. Bioinformatics, 15: 327-332.

[2] Kobayashi N., Go N. (1997). A method to search for similar protein local structures at ligand-binding sites and its application to adenine recognition. Eur. Biophys. J., 26: 135-144.

[3] Shi Y., Berg J.M. (1995). A direct comparison of the properties of natural and designed zinc finger proteins. Chem. Biol. 2: 83-89.

[4] Denessiouk K.A., Johnson M.S. (2000). When fold is not important: A common structural framework for adenine and AMP binding in 12 unrelated families. Proteins, 38: 310-326.

[5] Moodie S.L., Mitchell J.B.O., Thornton J.M. (1996). Protein recognition of adenylate: An example of a fuzzy recognition template. J. Mol. Biol., 263: 486-500. 


81. Aromaticity of Domains in Photosynthetic Reaction Centers; A Clue to the Protein's Control of Energy Dissipation during Enzymatic Reactions (up)
Ilan Samish, Avigdor Scherz, Plant Sciences Department, Weizmann Institute of Science, Rehovot, Israel;
Haim J Wolfson, School of Computer Science, , Tel-Aviv University, Tel-Aviv, Israel;
Ilan.Samish@weizmann.ac.il
Short Abstract:

 Photosynthetic reaction centers serve as model membrane proteins for studying structure-function relationship. Multiple structural alignment of reaction centers, combinatorial mutagenesis of conserved sites and analysis of the protein microenvironment along the electron-transfer pathway suggest that protein aromaticity is involved in controlling energy-dissipation and reactant-geometry during electron-transfer in an entropy/enthalpy mechanism.

One Page Abstract:

 Photosynthetic reaction centers (RCs), which conduct light induced electron transfer (ET), may serve as model membrane proteins for studying functions of conserved 3D elements. First, based on the fact that structure is more conserved than sequence, multiple structural alignment (MUSTA algorithm) was conducted on RCs from oxygenic and non-oxygenic organisms. The algorithm was conducted in the full RC and in the subunit level resulting in a 'tree' of common cores in the different subgroups. A common core located around the 4-helix bundle center of the complex was found to all RCs compared, in which amino acids (AAs) of a particular attribute form clusters. These clusters suggested conservation of aromatic and of high packing AAs. Second, two conserved AAs in the D1 subunit of the photosystem II RC underwent combinatorial mutagenesis, receiving 11-12 photoautotrophic mutants in each site. Neither positively charged nor aromatic AAs were included. Third, the content of virtual tubes (radii of 2-5A) between the ET cofactors in the bacterial RC was examined. Findings included: 1. Tubes of up to 3.5A-radius do not include backbone atoms. 2. All tubes have a uniform atom density. 3. A larger percentage of non-aromatic AAs is found in the slower ET rate domains. 4. The active branch contains a larger fraction of aromatic AAs than the inactive one. We propose that non-aromatic AAs enable entropic changes required for energy dissipation in the slow ET milieu, while rigid domains optimize reactant geometry required in the fast ET domains. These findings are proposed to shed light on the protein management of two contradictory prerequisites: a need to position reactants in a precise configuration during the electronic density migration, and an opposing need to rapidly dissipate the evolved energy in order to avoid the backward reaction. 


82. LIGPROT: A database for the analysis and visualization of ligand binding. (up)
Rafael Najmanovich, Eran Eyal, Vladimir Sobolev, Marvin Edelman, Weizmann Institute of Sciences;
rafael.najmanovich@weizmann.ac.il
Short Abstract:

 LigProt is a structural database of paired Apo and Holo protein forms (derived from the PDB) useful to studies of ligand binding. The database is automatically updated and offers a web based interface that allows browsing and searching as well as visualization of the superimposed Holo and Apo forms.

One Page Abstract:

 A database of paired protein structures in complexed (holo-protein) and uncomplexed (apo-protein) forms from the PDB macromolecular structural database can provide a myriad of information to be used as raw data in bioinformatics studies as well as in the planning of experiments by molecular biologists. Such a database was used in our recent study of side chain flexibility (Najmanovich et al., PROTEINS, 39: 261-268 (2000)). In the present work we: 1. Automate our database creation procedure so that the database can be upgraded regularly to cope with the growth of the PDB and, 2. Create a web-based visualization tool similar to MutaProt (http://bioinfo.ac.il/MutaProt) (Eyal et al., Bioinformatics, 17(4): 381-382 (2001)) for searching the database according to several criteria and visualizing the results using Chime.

The database is automatically built in three stages: 1. A list of all ligands present in the PDB is created. 2. All possible candidate apo-protein entries for each entry in list 1 is built, and, 3. Each candidate holo-apo pair is tested to ensure that the binding site contains no ligand other than the one under consideration in both entries. PDB entries with resolution lower than 2.5 A or containing DNA or RNA are excluded from the database.

The search and visualization interface allows browsing of the database and searching according to protein and ligand PDB code. We are currently implementing search by protein and ligand name as well as binding site composition and structural characteristics. Once an entry is selected, a list of the intermolecular contacts present in the holo protein is generated using LPC software (http://www.weizmann.ac.il/sgedg/lpc) (Sobolev et al., Bioinformatics, 15(4): 327-332 (1999)). The visualization allows for the inspection of the superimposed structure of the binding site in both entries.


83. ThreadMAP: Protein Secondary Structure Determination (up)
Lydia E. Tapia, Thomas R. Ioerger, Department of Computer Science, Texas A&M University;
James C. Sacchettini, The Center for Structural Biology, Texas A&M University;
ltapia@tamu.edu
Short Abstract:

 We have developed a new algorithm based on pattern recognition that recognizes secondary structure fragments in an electron density map prior to any structure solution, as part of the Textal automated model building system. Our approach consists of tracing the density map, extracting geometry based features, and performing classification.

One Page Abstract:

 ThreadMAP: Protein Secondary Structure Determination 

Lydia E. Tapia(1), Thomas R. Ioerger(1), and James C. Sacchettini(2)

(1)Department of Computer Science

(2)The Center for Structural Biology

Texas A&M University

College Station, TX

ltapia@tamu.edu, ioerger@cs.tamu.edu, sacchett@bioch.tamu.edu

 Upon the initial construction of a three-dimensional electron-density map of a protein, many protein crystallographers are often faced with low quality and low-resolution data. Because of this noisy data, automated methods for determining the structure of a protein are often hindered. Secondary structure information can help automated methods to refine a map. Also, obtaining quick secondary structure information directly from an electron density map can lead to large-scale protein database searching. For example, secondary structure of proteins from the PDB can be matched against that of a new electron density map. Homologous structures can then be used to solve the sequence of the new, unsolved protein.

 We have developed a new algorithm based on pattern recognition that recognizes secondary structure fragments in an electron density map prior to any structure solution, as part of the Textal automated model building system [1]. Our approach consists of tracing the density map, extracting features based on the geometry of the trace, and performing classification. 

An easy way to visualize the structure of an electron density map is to reduce the map to a series of lines representing the core of the density, a trace. An algorithm similar to the one used in Bones [2] is used. Once the map is reduced, simple heuristics such as three-way branching and distance metrics can be used to separate the backbone of the protein from the side chains.

 A group of features have been developed that characterize this backbone trace. For example, two-dimensional projections of the trace are made that capture the circular nature of a spiraling helix and the directness (movement in only one dimension) of a strand. Also, three-dimensional features are used to capture information about the Euclidean distance a trace travels. All the features are extracted for all overlapping windows of twenty trace points (~10 angstroms).

 We currently train on a database of feature vectors and their corresponding DSSP [3] predictions. When we receive a query feature vector, we use a nearest neighbor approach to find its closest matches from within the database. The classifications from the closest match can be used to classify the query vector. Smoothing techniques are used to take advantage of the sequential nature of secondary structure. This gives us more confidence in regions of consistent prediction and removes some ambiguity about structure transition regions. The result of this program is an automatic characterization of secondary structure fragments from a density map alone.

[1] T. Holton, T. Ioerger, J. Christopher, and J. Sacchettini. (2000). Determining protein structure from electron-density maps using pattern matching. Acta Cryst. D56, 722-734.

[2] T. Jones, J. Zou, S. Cowan, and M. Kjeldgaard (1991). Improved methods for building protein models in electron density maps and the location of errors in these models. Acta Cryst. A47, 110-119. 

[3] W. Kabsch and C. Sander (1983). Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577-2637.


84. FAUST, an algorithm for functional annotations of protein structures using structural templates. (up)
Krzysztof Olszewski, Mariusz Milik, Sándor Szalma, Xiangshan Ni, Molecular Simulation Inc.;
kato@msi.com
Short Abstract:

 FAUST is an automated procedure involving: extraction of functionally significant templates from protein structures and using such templates to annotate novel structures. FAUST templates are used for active and binding site searches in protein structures. Preliminary results of protein structure database annotations with derived Structural Templates are presented.

One Page Abstract:

 FAUST (Functional Annotations Using Structural Templates) is an automated procedure involving: extraction of functionally significant templates from protein structures and using such templates to annotate novel structures. Both whole protein and structural templates can be represented as colored undirected graphs with atoms as vertices and inter-atom distances as edge weights. Vertex colors are based on chemical identities of the atoms. In this representation, a structural template is defined as a common sub-graph of graphs corresponding to functionally related proteins. Edge labels are considered equivalent if inter-atomic distances for corresponding vertices (atoms) differ less than a threshold value. Hence, in extraction procedure, pairs of functionally related protein structures are searched for sets of chemically equivalent atoms whose inter-atomic distances are conserved in both of searched structures. Structural Templates resulting from such pair wise searches are combined to maximize classification performance on a training set of chosen protein structures. FAUST extraction algorithm does not use any external, expert input, and it works best for sets of dissimilar structures from non-homologous proteins. The resulting Structural Template provides a new description of the protein function, which includes natural plasticity of protein active site. In FAUST approach Structural Templates are used for active and binding site searches in protein structures. Also, Structural Templates are applicable to evaluation and refinement of protein models. 

We are demonstrating here FAUST extraction results for the highly divergent family of serine proteases that exhibit conserved Structural Template. We compare FAUST Structural Templates to the standard description of the serine proteases active site conservation and demonstrate depth of information captured in such description. Also, we present preliminary results of protein structure database annotations with derived Structural Templates. 


85. Predicting structural features in protein segments (up)
Fredrik Pettersson, Anders Berglund, Research Group for Chemometrics, Department of Organic Chemistry, University of Umeå;
fredrik.pettersson@chem.umu.se
Short Abstract:

 Using multivariate techniques such as PLS we are building a predictive model that will be able to determine protein structure for small segments solely using sequence as input. The model is based on a library consisting of a diverse set of 1496 good proteins. Calculations are done using a super-computer.

One Page Abstract:

 One approach for protein structure prediction is to build up the structure of a whole protein from smaller sequences. These building blocks can either be different secondary structure elements or sequence elements with a specific window size. After determining the structure for each of the segments the overall structure of the assembled protein can be determined by putting together the pieces in an optimal way. In this project we are focusing on the first step that is to find the structure of the constituents based on sequence solely. Our goal is to make a predictive model that will with sequence as input be able to identify the most structurally similar match in a sequence library. 

A multivariate projection method, PLS, is used for relating sequence similarity to structural similarity. PLS calculates latent variables giving a good approximation of the in-data (X) and correlate well with the response (Y). In our case the resulting model is describing the correlation between sequence features and structure. When using multivariate techniques sequence information has to be represented in a numerical way. This is obtained using z-scales. These scales are based on a physico-chemical characterization of the amino acids.

Based on a library consisting of a diverse set of 1496 good proteins, sub libraries with the data divided into smaller segments (5-10aa) has been constructed. For each protein segment, sequence and structure are characterized. Protein sequence is characterized using z-scales and a couple of sequence similarity matrices. These values are then subsequently compared to those of the other members in the sublibrary. Structural similarity is represented as a CRMS value and a value representing similar secondary structure. Based on this data a PLS model is constructed that will be used for ab initio structure prediction.

The program for doing sequence and structure characterization with subsequent comparisons has been implemented in Perl and Fortran 77. The outer Perl layer invokes the Fortran 77 program, which performs the computationally heavy processes. Calculations will be performed on a parallel-computer at HPC2N at the University of Umeå.

Because of the fact that structure is highly dependent on the overall properties of the whole protein we cannot expect to obtain a perfect prediction model. We will be satisfied if we in this initial stage will be able score the true match in the library within the top 10 library matches. Preliminary results indicate that this may well be possible but more work need yet to be done. 


86. GA Generates New Amino Acid Indices through Comparison between Native and Random Sequences (up)
Satoru Kanai, PharmaDesign, Inc.;
Hiroyuki Toh, Department of Bioinformatics, Biomolecular Engineering Research Institute;
skanai@pharmadesign.co.jp
Short Abstract:

 If folding information of a protein is encoded by the arrangement of the amino acid residues along the primary structure, the information would degarde by the random suffling of the residues. We developed a new method to extract folding information by the comparison between native seqeunces and random sequences.

One Page Abstract:

 The amino acid sequence of a protein carries its folding information. If the information is encoded by the arrangement of the amino acid residues along the primary structure, the random shuffling of the residues would degrade the information. We developed a new method to compare the native sequence with random sequences generated from the native sequence, in order to extract such information. First, amino acid indices were randomly generated. That is, the initial indices have no significance on the feature of residues. Next, using the indices, the averaged distance between a native sequence and the random sequences was calculated, based on the autoregressive (AR) analysis and the linear predictive coding (LPC) cepstrum analysis. The indices were subjected to the genetic algorithms (GA) using the distance as the fitness, so that the distance between the native sequence and the random sequences becomes larger. We found that the indices converged to hydrophobicity indices by the GA operation. The AR analysis with the converged indices revealed that the autocorrelation in the native sequence is related to the secondary structure. 


87. STING Millennium: Web based suite of programs for comprehensive and simultaneous analysis of structure and sequence (up)
Goran Neshich, EMBRAPA/CNPTIA -Campinas, SP - Brazil;
Roberto C. Togawa, EMBRAPA/CENARGEN - Basilia, DF -Brazil;
Wellington Vilella Torres, Tharsis Fonseca e Campos, Leonardo Lima Ferreira, Adilton Guedes Oliveira, Ronald Tetsuo Miura, Marcus Kiyoshi Inoue, Luiz Gustavo Horita, Georgios Pappas Jr., EMBRAPA/CNPTIA -Campinas, SP - Brazil;
Barry Honig, Columbia University, New York - USA;
neshich@cnptia.embrapa.br
Short Abstract:

 STING Millennium is a web based suite of programs for visualization of molecular structure and comprehensive structure analysis: sequence and structure positions for residues, pattern search, 3D neighbors, H-bonds, structure quality, nature of atomic contacts of intra/inter chain type and residue conservation. Available: http://honiglab.cpmc.columbia.edu/SMS http://asparagin.cenargen.embrapa.br/SMS http://leonina.cnptia.embrapa.br http://morphy.sdsc.edu:8080/SMS/

One Page Abstract:

 STING Millennium is a web based suite of programs that starts with visualizing molecular structure and than leads a user through a series of operations resulting in a comprehensive structure analysis: amino acid sequence and structure positions, pattern search, 3D neighbors identification, H-bonds, angles and distances between atoms are easy to obtain thanks to the intuitive graphic and menu interface. In addition, a user can obtain: sequence to structure relationships, analysis of a quality of the structure, nature and volume of atomic contacts of intra and inter chain type, analysis of relative amino acid position conservation and relationship with intra-chain contacts, effectively establishing Folding Essential Residue (FER) indicators etc.. The main aspect of the STING Millennium is the ability to combine data delivery through the web with structural analysis tools in order to provide a self-contained instrument for macromolecular studies. More than a simple front-end to the Chime plugin, STING offers analytical services which we will only briefly describe here, counting that users will refer to extensive on-line help for further details. STING Millennium is composed by two main windows. The sequence window displays sequence and contains the general menus with the commands and a structure window that displays the macromolecular rendered tree-dimensional structure. In general terms STING Millennium provides the following services: * Ability to easily select residues in the sequence, select elements of secondary structure, as well as offer a wide variety of methods for rendering and coloring a molecule (mostly available through ACTION menu). * Defining 3D neighbors to arbitrary selected residue * Definition and display of amino acids participating in interfacial regions between polypeptide chains (through WINDOWS/Interface chain menu selection) * Building surfaces of whole molecule or just IFR part of it * Interactive Ramachandram plots, permitting rapid identification of residues in the disallowed regions and display of selected residues in the structure window * Calculation of residue frequency within selected chain or on interface, as well as frequency of those residues filtered through chosen contact parameters. * Hydrogen bond net calculation with special attention given to participation of water molecules. * Contacts definition and calculation for the whole molecule and/or interfaces * Convenient 2D graphical presentation of parameters extracted from 3D structure * Display of sequence neighbors and calculation of relative sequence conservation for the family of homologous proteins In the links entry in the main menu, several external services that deal with PDB files are listed. These consist of links to web sites containing programs that accept a PDB code as input to perform useful tasks, which makes STING Millennium highly integrated with other important data resources. STING Millennium is both didactic tool as well as research tool. It is easy to use and requires virtually no training time. STING Millennium is available at: http://honiglab.cpmc.columbia.edu/SMS http://asparagin.cenargen.embrapa.br/SMS http://leonina.cnptia.embrapa.br and http://morphy.sdsc.edu:8080/SMS/


88. Side chain-positioning as an integer programming problem (up)
Olivia Eriksson, Stockholm Bioinformatics Center, Stockholm University;
Yishao Zhou, Department of Mathematics, Stockholm University;
Arne Elofsson, Stockholm Bioinformatics Center, Stockholm University;
olivia@sbc.su.se
Short Abstract:

 We here present a novel integer programming based algorithm that finds the optimal set of sidechain rotamers on a fixed backbone in polynomial time. The complexity of this algorithm is similar to the commonly used pruning algorithms. Further, it is guaranteed to find the optimal solution.

One Page Abstract:

 The problem to position side chains on a fixed backbone is one of the fundamental parts in homology modeling and protein design algorithms. Most homology modeling methods use some algorithm to place side chains onto a fixed backbone. Homology modeling methods are the necessary complement to the large scale structural genomics projects that are being planned. Recently it has also been shown that for automatic design of protein sequences it is of the uttermost important to find the global solution to the side chain positioning problem. If a suboptimal solution is found the difference in free energy between different sequences will be smaller than the errors for the side chain positioning problem. Many different algorithms have been developed to solve this problem. The most successful methods have used a fixed rotamer library and not continuous rotamers. This makes it possible to detect a single global minimum energy conformation. The most promising method to solve this problem in polynomial time today, is the dead end elimination theorem. Here is introduced another method. We formulate the problem as a linear integer program, relax the integer constraints and solve the thereby obtained linear program. We show that the solution to the relaxed problem always will be integer and therefore the solution to the original problem. By using this problem formulation the global minimum energy conformation will be found in polynomial time.


89. Prediction of the quality of protein models using neural networks (up)
Björn Wallner, Arne Elofsson, Stockholm Bioinformatics Center;
bjorn@sbc.su.se
Short Abstract:

 Neural networks are trained to predict quality of protein models, based on accessibility surfaces and contacts between residues and 13 different atom types. A correlation coefficient of 0.81 is obtained for an independent test set. This method might be useful to increase the specificity of fold-recognition methods.

One Page Abstract:

 Models of proteins are made to help our understanding of how a particular protein functions. However, no good measure of the quality of the model exist. To address this problem neural networks are trained to predict quality of protein models. Besides, the possibility to measure the quality of a model, this might also be useful to increase the specificity of fold-recognition methods.

Here we generate a large set of models, using alignment methods and the homology model program Modeller (Sali et al, 1993). The quality of these models were measured using a modified version of the LGscore (Cristobal. et al, 2001).

The training was based on accessibility surfaces, the contacts between residues and contacts between 12 different atom types. The training was performed for different cutoffs. For the atom type contacts, networks were trained on eight cutoffs ranging from 3.0 Å to 4.75 Å in 0.25 Å intervals, the contacts with atoms in the same residue were omitted. For the residue contacts six cutoffs in the range 4 Å to 12 Å were used, only contacts between residues more than five residues apart in the sequence were counted, to avoid accumulation of contacts between residues laying close in the sequence. The accessibility surfaces were represented as fraction of low(<25%), medium (25%-75%) and high (>75%) relative accessibility for each residue respectively.

A neural network was trained for every single combination of parameter type and a correlation coefficient for an independent test set was calculated as a measure of how good each network preformed. For the atom contacts alone the best correlation, 0.70, was obtained with a 4.5 Å cutoff, for the residue contacts cutoff of 6 Å gave the best correlation, 0.63. For the accessibility surfaces high and low relative accessiblity gave best correlation with 0.70 for low and 0.52 for high.

The different parameter types probably contain overlapping information, nevertheless if a network is trained on a combination of the best atom and residue contacts together with the accessibility surfaces a correlation coefficient of 0.81 is obtained.

References Sali, A & Blundell, T. L. (1993). Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779-815.

Cristobal, S et al. (2001). How can the accuracy of a protein model be measured?. Manuscript in preparation.


90. Targeting proteins with novel folds for structural genomics (up)
Liam J. McGuffin, David T. Jones, Bioinformatics Group, Brunel University;
liam.mcguffin@brunel.ac.uk
Short Abstract:

 Finding novel folds is an important aim of structural genomics. We have evaluated a number of methods that discriminate between proteins with novel and known folds. We propose that simple secondary structure alignments could identify novel folds more selectively than both sequence alignments and a simple fold recognition method, GenTHREADER. 

One Page Abstract:

 Beyond the era of genome sequencing the focus has turned to proteomics, and in particular the high-throughput determination of protein structure or structural genomics. The ultimate objective of structural genomics is to determine the structure of every protein coded by every single gene within a genome. The premise being that once solved, protein structures may then be used to decode the functions of those genes identified within a genome. Determining each protein structure experimentally using current techniques is not feasible due to cost and time limitations. Models for proteins with >30% sequence identity to a protein with a known structure can be built fairly easily by homology modeling (Sali, 1998; Brenner, 2000; Portugaly et al., 2000). Beyond this, threading or fold recognition methods are able to assign folds to more distantly related proteins, however this is both time consuming and is limited by the current library of templates. Fast fold recognition or genomic threading techniques such as 3D-PSSM (Kelly et al. 2000), SAM-T98 (Karplus et al., 1999) and GenTHREADER (Jones, 1999) have been developed which overcome the time issue. However, these techniques rely upon finding some homology to solved structures and may perform poorly when sequences show no apparent evolutionary relationship to any known protein family (Jones, 1999). The problem remains that a fold can, of course, only be recognized if a template fold for the protein exists. The identification of nominally distinct folds is important to structural genomics. Solving structures of new folds experimentally will increase the range of folds that can be used as models or templates for computational structure determination (Sali, 1998). Therefore, methods must be developed that aim to discriminate between folds which have been seen before (known folds) and those which are novel. Methods which are capable of identifying novel folds would also greatly benefit the protein structure prediction field, as one of the first questions that must be addressed when predicting the structure of a new protein sequence is whether or not it has a known fold or not. Sequence based clustering methods such as PROTOMAP (Portugaly et al., 2000) have been developed in attempt to estimate the probability of a protein having a "new" fold. As homologous proteins must by definition have a common fold, generally speaking, sets of sequences with less than say 30% identity have a higher chance of having a novel fold than sets of proteins without sequence clustering. However, two similar folds may have very low sequence similarity (even by the standards of sensitive sequence profile comparison), and thus a potential novel fold determined by simple sequence searching could easily turn out to have a known structure. In this case methods that are based solely on sequence information are unreliable. Alignments of secondary structure elements have been shown to provide a rapid estimate of fold for sequences with no detectable homology to any known structure. Although this kind of method can not be relied upon for accurate fold recognition it has been found that it does offer an improvement over sequence alignment in its ability to assign folds to evolutionarily distant proteins (McGuffin et al, 2001). It has also been suggested that class or folding type of distantly related proteins can be discerned simply by measuring differences in amino composition (Eisenhaber et al., 1996; Wang et al., 2000), and so composition based filtering has also been proposed as a possible way of increasing the likelihood of finding new folds. We have compared the ability of a simple fold recognition method (GenTHREADER) and a variety of simple sequence analysis methods to discriminate between domains with novel folds and those with known folds. We also have evaluated methods based on simple pairwise alignments of secondary structure elements. We propose that simple alignments of secondary structure elements could potentially be a more selective method than both GenTHREADER and standard sequence alignment at finding novel folds when sequences show no detectable homology to proteins with known structures.

Brenner, SE. Target selection for structural genomics. Nature Struct Biol Suppl 2000;967-969.

Eisenhaber, F, Frömel, C, Argos, P. Prediction of secondary stuctural content of proteins from their amino acid composition alone. II. The paradox with secondary structural class. Proteins 1996;25:169-179.

Jones, D. T. GenTHREADER: An efficient and reliable protein fold recognition method for genomic sequences. J Mol Biol 1999; 287:797-815.

Karplus, K, Barrett, C, Cline, M, Diekhans, M, Grate, L, Hughley, R. Predicting protein structure using only sequence information. Proteins Suppl 3 1999:121-125.

Kelley, LA, MacCallum RM & Sternberg, MJE. Enhanced genome annotation using structural profiles in the program 3D-PSSM. J Mol Biol 2000; 299:499-520.

McGuffin, LJ, Bryson, K, Jones DT. What are the baselines for protein fold recognition? Bioinformatics 2001;17:63-72.

Potugaly, E, Linial, M Estimating the probability for a protein to have a new fold: A statistical computational model. Proc Natl Acad Sci USA 2000; 97:5161-5116.

Sali, A. 100,000 protein structures for the biologist. Nature Struct Biol 1998;5:1029-1032.

Wang, Z, Zheng, Y. How good is prediction of protein structural class by the component coupled method? Proteins 2000;38:165-175.


91. Protein Structural Domain Parsing by Consensus Reasoning (up)
HwaSeob Joseph Yun, Casimir A. Kulikowski, Ilya Muchnik, Gaetano T. Montelione, Rutgers University;
seabee@cs.rutgers.edu
Short Abstract:

 Combined domain parsing methods based on HMM and BLAST can provide concrete and definite predictions with consensus reasoning. Classifiers trained by DDD were tested on SCOP domains from same families containing those DDD seeds, which produced 75.5% accuracy. Tests on Bakers Yeast produced 64.6% correct functional predictions by EC classifications.

One Page Abstract:

 Domain parsing, or the detection of signals of protein structural domains from sequence data, is a complex and difficult problem. If carried out reliably it would be a powerful interpretive and predictive tool for genomic and proteomic studies. We report on a novel approach to domain parsing using consensus techniques based on Hidden Markov Models (HMMs) and BLAST searches built from a training set of 1471 continuous structural domains from the Dali Domain Dictionary (DDD). 

According to their underlying mechanisms, various domain parsing tools have their own unique advantages against each other, and our method begins by running individual programs at their full extents to maximize these distinctive characteristics undisturbed. After acquiring possible signals of domains, 1471 results from each tool are ranked and screened by an objective threshold comparable to each program we used. These selections of best hits are then paired only when the targeted domains are from the same seed in DDD. These matched pairs have differences in their domain boundary predictions on both N and C terminal sides, and by plotting these differences on an N-C terminal difference plain, a Pareto set can be extracted to acquire the point with minimal differences and maximal overlapping length among the detected signals. Proper classification of the unknown sequence is assigned by referencing SCOP definitions of the strongest signal collected with this consensus reasoning method.

We have tested the approach in two ways. In the first, validation of domain parsing on an independent test sample of 347 family-matched structural domain sequences from the SCOP database yields a consensus prediction performance rate of 75.5%, well above the 58% obtained by simple logical agreement of methods. A second independent test was to check the potential of combining methods for functional annotation. Using 339 biochemically well-characterized Bakers Yeast sequences which had matching EC codes to our model sequences, we compared results at different levels of the EC codes between HMM, BLAST, and disjunctive predictions against the query domains. This showed that there is a slightly higher likelihood of including the right prediction by using the disjunctive prediction than either method alone. There are 64.6% correct exact functional predictions in the top 10 BLAST or HMM results, while comparable matches at the highest EC level yields 93.5% as the upper bound for prediction.


92. Attempt to optimise template selection in protein homology modelling using logical feature descriptors (up)
Alexander Diemand, H. Scheib, T. Schwede, N. Guex, GlaxoSmithKline R&D, Geneva;
azd93529@gsk.com
Short Abstract:

 We address the template selection step in protein homology modelling. The structures of putative templates can vary even if their sequences are highly similar. Clustering them and the derivation of discriminatory explanations by induction helps expert modellers in decision making. We assessed our method for a number of protein families.

One Page Abstract:

 Even though for the majority of proteins there is no structural information available, the structures of several families of homologous proteins have been extensively studied. To close the gap between the huge amount of protein sequences available and the still very limited number of protein structures resolved, we apply homology modelling, based on the observation that high sequence similarity implies structural similarity. In this work, we focused on optimising the critical template selection step, i.e. in cases where numerous potential template structures are available.

The structures of these putative templates can vary even if their sequences are highly similar: presence or absence of substrate or regulatory compounds, domain movements, experimental method and different organisms, respectively. It is time consuming and potentially erroneous to identify by hand the key features which distinguish these structures and determine their suitability as homology modelling templates. This process has been automated and its usability assessed for a number of protein families from the publicly accessible Protein Data Bank (PDB).

This method first clusters proteins of a particular family based on structure comparison, with each protein being described in annotations from SwissProt, Prosite, and the PDB file itself. Then, an algorithm generates the most general hypothesis by logical induction which in feature space distinguishes the clusters from each other. As a result, an explanatory feature description is obtained, which can be used to guide the template selection by either asking an expert modeller to verify or manually alter the proposed selection or in an automated mode to make the most prominent decision.

This method will be integrated into the SwissModel/DeepView protein modelling suite and thus will be made available to other researchers.


93. Prediction of amyloid fibril-forming proteins (up)
Yvonne Kallberg, Magnus Gustafsson, Johan Thyberg, Bengt Persson, Jan Johansson, Karolinska Institutet;
yvonne.kallberg@mbb.ki.se
Short Abstract:

 Amyloid fibrils are formed from different proteins, and are very similar in spite of differences the native structures. The fibrils are based on beta-strands which means that proteins containing mainly helices must undergo structural changes. By comparing secondary structures, we are able to predict fibril formation among such proteins. 

One Page Abstract:

 Amyloid fibrils can be formed from different proteins, and are associated with severe diseases like the neurodegenerative Alzheimer's disease and bovine spongiform encephelopathy. In spite of differences in their native structures, these proteins form very similar amyloid fibrils with beta- strands perpendicular and beta-sheets parallel to the fibre axis. Thus amyloid-forming proteins that contain mainly alpha-helical structures must undergo alpha-helix to beta-strand conversions before or during fibril formation. In order to investigate this, we searched for experimentally determined alpha-helices with predicted beta-strands in 1324 proteins, and found 37 proteins that contained alpha/beta discordant segments. The set includes three known amyloidogenic proteins: the prion protein, amyloid beta peptide (Abeta) and lung surfactant protein C (SP-C). Three other proteins (transpeptidase, triacylglycerol lipase and coagulation factor XIII) where also found to form amyloid fibrils. It is known that replacement of valine residues in the discordant segment of SP-C with leucine yields a peptide with a helical conformation. It is also known that Abeta that lack the discordant stretch or with key substitutions reverts the discordance and no fibrils are formed. Our data strongly suggest that long stretches of alpha-helix/beta-strand discordance predict amyloid fibril formation. 


94. Evaluation of structure prediction models using the ProML specification languag (up)
Daniel Hanisch, Ralf Zimmer, Thomas Lengauer, GMD - National Research Center, St. Augustin, Germany;
hanisch@cartan.gmd.de
Short Abstract:

 We propose the ProML specification language for proteins and protein families based on the open XML standard. ProML allows for efficient specification and visualization of heterogeneous protein data. As an application, we discuss the representation of features of protein clusters and the use of experimental constraints for validation of structural models. 

One Page Abstract:

 Title: Evaluation of structure prediction models using the ProML specification language

Authors: Daniel Hanisch, Ralf Zimmer, Thomas Lengauer

 We propose a specification language ProML for protein sequences, structures, and families based on the open XML standard. The language allows for portable, system-independent, machine-parsable and human-readable representation of essential features of proteins. In contrast to existing XML applications in this field, our emphasis is not on the molecular structure of one protein or molecule (as in CML), nor on annotation of one gene or one protein for use with a proprietary browser (as in BioML) , but on efficient representation of heterogeneous data associated with one or several proteins. As we developed ProML in the context of structure prediction, we focused on properties useful in threading and clustering algorithms. Extensions for other applications, however, are straigthforward to realize within ProML.

To achieve this goal, one ProML document is able to describe several proteins and their properties in a structured manner. ProML defines low-level elements as building blocks for more complex properties. Predefined elements include primary and secondary sequence information, three dimensional coordinates, CATH structural classification and Prosite patterns. A Property tree relates properties to proteins in a hierarchical manner. We define an optimality criterion for this tree, which allows for efficient use of represented information in algorithms.

ProML is of immediate use for several bioinformatics applications: we discuss clustering of proteins into families and the representation of the specific shared features of the respective clusters. ProML's Property tree defines a hierarchical view on these features, thereby making within-cluster similarities and differences among potential subclusters easily visible to humans and accessible to algorithms.

In a second application, we use experimentally derived constraints, represented in ProML, in a protein structure prediction approach for the validation of proposed theoretical models and improvement of fold recognition rate on a representative benchmark protein set. To this end, we computed conserved cores for structural clusters of our benchmark library and produced ProML documents for the clusters containing the structural cores. By exploiting randomly generated as well as simulated cross-link distance constraints measureable by mass spectrometry, we were able to improve fold recognition on our test set. For this, we applied a post filtering approach to results produced by our threading algorithm 123D. 

References: 

[1] T. Bray, J. Paoli, and C.M. Sperberg-McQueen. Extensible Markup Language (XML) 1.0. February 1998. http://www.w3.org/TR/1998/REC-xml-19980210.html.

[2] P. Murray-Rust. CML - the Chemical Markup Language. http://www.xml-cml.org

[3] Proteomics Inc. BioML - Biological Markup Language. http://www.bioml.com/bioml/.

[4] D. Hoffmann, and R. Zimmer. Fluorescence Energy for Elucidating the 3D-Structure of Biological Macromolecules. German Patent Office, PCT/EP99/01008, 10. Feb 1999

[5] N. Alexandrov, R. Nussinov, and R. Zimmer. Fast Protein Fold Recognition via Sequence to Structure Alignment and Contact Capacity Potentials. In Pacific Symposium on Biocomputing'96, 53-72, 1996. 


95. Incremental Volume Minimization of Proteins (represented by Collagen Type I (local minimization)) (up)
Meir Israelowitz, P. Campbell, L. Ernst, J. M. Ernsthausen, W. Galbraith, S. W. Hussain, Carnegie Mellon University;
I. Verdinelli, University of Rome;
Troy Wymore, Pittsburgh Supercomputer Center;
D. L Farkas, Carnegie Mellon University;
meir@andrew.cmu.edu
Short Abstract:

 Type I collagen, a major extra-cellular matrix protein, is a long-sequence fibrous biopolymer. The importance of considering mathematical models for collagen structure is in the fact that about 90% of the structures of most inner organs are made of some type of collagen.

One Page Abstract:

 Type I collagen, a major extra-cellular matrix protein, is a long-sequence fibrous biopolymer. The importance of considering mathematical models for collagen structure is in the fact that about 90% of the structures of most inner organs are made of some type of collagen. Hence the relevance of simulating collagen structures for tissue engineering, as collagen fibers are meta-structured basis of tissue. A number of methods can be used to determine the optimal conformations of polypeptides under various conditions, and several techniques have been used to estimate the native conformations of large globular proteins. Our approach to modeling protein structure consists in approximating the process of protein folding. This can be expanded to large, multi-molecular network structures. While almost all existing models consider random packing of rigid spheres, we managed to reduce the molecular volume by using concepts from low dimensional topology (braids) and differential geometry. A braid group has the property of maintaining the continuity of a sequence while the minimization is performed (the topology guarantee the continuity during the minimization). Our model creates segments from the braid (the segments are hydrogen bond distance). These segments are amino-acid peptides (or beads), and we consider the functional distance geometry functional rather than the simple minimize action of the distance between atoms' centers. We have applied this approach to PDB files: 1BBF, 1CGD, 1AQ5 and Collagen Type I. 


96. Automatic Inference of Protein Quaternary Structure from Crystallographic Data. (up)
Hannes Ponstingl, European Bioinformatics Institute, EMBL-EBI, Hinxton, UK.;
Thomas Kabir, Biomolecular Structure and Modelling Unit, Biochemistry and Molecular Biology Department, University College London, ;
Janet M. Thornton, European Bioinformatics Institute, EMBL-EBI, Hinxton, and Biomolecular Structure and Modelling Unit, Biochemistry and;
hpo@ebi.ac.uk
Short Abstract:

 A procedure was developed that generates the likely quaternary assembly of a protein from its atom coordinates and crystal symmetry deposited in the Protein Data Bank (PDB). It applies a graph-theoretic algorithm to interface scores derived from crystal structures of globular proteins of distinct oligomeric state in solution.

One Page Abstract:

 The atomic coordinates of protein crystal structures deposited in the Protein Data Bank (PDB) describe the asymmetric unit of the crystal - not the physiologically relevant assembly of the polypeptide chains. Moreover, the PDB annotation of the functional assembly is still sparse and unreliable.

This work is an attempt to provide the possible macromolecular assemblies likely to be prevalent in solution. The assemblies are ranked according to statistics obtained from a representative set of crystal structures of globular proteins whose oligomeric state in solution is distinct and experimentally established.

All intermolecular contacts present in the crystal are re-generated by applying crystallographic symmetry operations to the deposited coordinates. From this set, those contacts are discarded that are most likely to be artifacts of the crystal environment.

For this task, we derived a scoring function for protein-protein interfaces from the trusted set of water-soluble oligomers. The scoring function, a so-called statistical potential, is based on pairs of atom types and distance information.

Hypothetical assemblies are generated by successively applying a graph-theoretic minimum-cut algorithm to the scored crystal contacts. Thresholds for assembly classification are obtained from statistics on the scores of these minimum-cuts.

The performance and generalisation behaviour of the procedure in identifying the functional assembly is assessed using cross-validation methods on the data set of trusted oligomers. A comparison is made to scoring interfaces by using a traditional measure of contact size.

The derived interface-scoring function is also expected to prove useful in screening predicted complexes in protein-protein docking protocols. 


97. Modelling Class II MHC Molecules using Constraint Logic Programming (up)
Martin T. Swain, Anthony J. Brooks, Graham J.L. Kemp, University of Aberdeen;
mswain@csd.abdn.ac.uk
Short Abstract:

 The MHC-Thread program uses a heuristic scoring function to predict peptides that are likely to bind to a class II MHC allele, based on the allele's known or modelled 3D structure. To increase its utility, we have developed an automatic technique for modelling peptide binding grooves using constraint logic programming. 

One Page Abstract:

 The identification of peptides which bind to MHC molecules is useful when hunting for regions of a protein which may be responsible for causing an unwanted immune response. The MHC-Thread program analyses three-dimensional models of candidate peptides in the peptide binding grooves of class II MHC molecules (Brooks, 1999). Heuristic functions are used to score the complex based upon chemical and spatial complementarity, and thus predict peptides likely to bind to specific alleles. The utility of this program is increased through having an automated method to build models of class II MHC alleles. Sequence comparisons suggest that the overall structure of class II MHC alleles is well conserved and that the main differences between alleles are due to mutations in the vicinity of the peptide binding groove. Thus, side-chain placement is central in constructing models of MHC alleles.

We have developed a novel approach to the side-chain placement problem that uses constraint logic programming (CLP). Our method generates a constraint-based description of atomic packing that is used iteratively to create CLP programs: each program representing successively tighter packing constraints. In these programs rotamer conformations are represented as values for finite domain variables, and bad steric contacts involving rotamers are represented as constraints. The CLP side-chain placement method has been validated by predicting side-chain conformations of X-ray structures with an accuracy comparable to that of other methods (Swain and Kemp, in press).

Preliminary results obtained from the MHC-Thread program with homology models created using the CLP side-chain modelling system are encouraging, and show good agreement with experimentally derived binding data.

References

Brooks, A.J. (1999) Computational Prediction of HLA-DR Binding Peptides. PhD Thesis, University of Aberdeen.

Swain, M.T. and Kemp, G.J.L. (in press) Modelling protein side-chain conformations using constraint logic programming. Computers & Chemistry. 


98. DoME: Rapid Molecular Docking with Adaptive Mesh Solutions to the Poisson-Boltzmann Equation (up)
Julie C. Mitchell, Lynn F. Ten Eyck, San Diego Supercomputer Center;
mitchell@sdsc.edu
Short Abstract:

 The Docking Mesh Evaluator (DoME) uses adaptive mesh solutions to the Poisson-Boltzmann Equation to evaluate docking energies, interpolating potentials against a mesh that is dense in high gradient regions. DoME achieves a high level of precision in approximating electrostatic potentials and performs energy calculations far more rapidly than traditional methods. 

One Page Abstract:

 With the continued increase in the number of known protein structures comes a wealth of opportunity to predict molecular interactions and docked protein structures via computational means. Many molecular docking methods are able to achieve biologically accurate solutions to protein docking problems. However, it is difficult to obtain both speed and precision in a single algorithm. The most accurate methods are computationally expensive, while faster methods introduce non-trivial computational errors or ignore electrostatic information in favor of more tractable geometric algorithms.

We will present a method for molecular docking that is both highly efficient and uses a detailed implicit solvent model for approximating electrostatic energies. The Docking Mesh Evaluator (DoME) uses adaptive, finite element solutions to the Poisson-Boltzmann equation generated by the Adaptive Poisson-Boltzmann Solver (APBS) to model electrostatics. A simplex lookup scheme is employed to allow rapid interpolation of solutions defined on an irregular mesh. This mesh is also used as a basis for interpolating Lennard-Jones potentials. 

The underlying scheme for interpolating potential functions has been expanded into a collection of tools for use in molecular docking. In particular, DoME is able to interpolate electrostatic potentials over a grid or surface, comprehensively scan the docking configuration space and compute local minima to docking potential energies. The initial results for biological problems appear quite promising, and the computations are remarkably fast. For one protein-protein docking problem, computing local minima using an all atom AMBER potential consumed 30 minutes while DoME was able to perform the same computation in just a few seconds.


99. Electrostatic potential surface and molecular dynamics of HIV-1 protease brazilian mutants (up)
Elza Helena Andrade Barbosa, Alan Wilter da Silva, Laurent Emanuel Dardenne, Paulo Mascarello Bisch, Pedro Geraldo Pascutti, Federal University of Rio de Janeiro;
ehab@biof.ufrj.br
Short Abstract:

 We did 1 nanosecond molecular dynamics for eleven HIV-1 protease Brazilian mutants and obtained structural images, Ramachandran plots, calculations of rmsd, hydrogen bonds and electrostatic potential on the surface.We saw conformational changes near the active site of mutants and no electrostatic complementarities for some, supporting the drug resistance.

One Page Abstract:

 ELECTROSTATIC POTENTIAL SURFACE AND MOLECULAR DYNAMICS OF HIV-1 PROTEASE BRAZILIAN MUTANTS Barbosa, E. H. A. 1, da Silva, A. W. S. 1, Dardenne, L. E.2, Bisch, P. M. 1, Pascutti, P. G. 1 1- Instituto de Biofísica Carlos Chagas Filho - UFRJ 2- Laboratório Nacional de Computação Científica - CNPq

Drug resistance in HIV-1 protease has been emerged in many countries. By using Molecular Modeling and Dynamics tools, we investigate eleven HIV-1 protease Brazilian mutants that are resistant to usual inhibitors. Were built theoretical models by homology using as standard the NMR-3D structure of HIV-1 mutant protease C95A found in Protein Data Bank (code 1BVE). A 1 nanosecond dynamics were performed for all systems (including 1BVE) using THOR program, a software package that uses GROMOS force field, developed in our laboratory. As a result, were obtained structural images and Ramachandran plots for each mutant. Calculations of root mean square deviation were performed. The hydrogen bonds and van der Waals contacts between the HIV-1 protease mutants and inhibitors were monitored in the active site. They showed relative stability during dynamics. Most of the models fluctuate around their respective minimized structures during dynamics simulation, however were observed conformational changes induced by mutations. Main conformational changes near active site were verified in the positions 26-29, 35, 46, 47 and 53. Electrostatic potential were calculated on the accessible solvent surface for all mutants and the usual anti-retroviral drugs to identify charge and hydrophobic complementarities between them. It was observed that for some mutants there are no complementarities, what would explain the lost of drug activity, conducting to resistance. These results support that HIV-1 protease resistant drugs could be induced by conformational changes and lost of electrostatic and hydrophobic complementarities in mutants. 


100. Automated functional annotation of protein structures (up)
Mike Hsin-Ping Liang, Russ B. Altman, Stanford University;
mliang@smi.stanford.edu
Short Abstract:

 Current methods for constructing 3D models of protein function and for annotating protein structures are manual and time-intensive. We propose an automated method for constructing 3D models of protein functional sites that can be used for high-throughput annotation of protein structures.

One Page Abstract:

 In the past, protein structures have been determined because of specific biological interest and background. Recently, various structural genomics initiatives are rapidly determining protein structures without understanding their function. There is a growing need to annotate these proteins in an efficient manner to keep up with the rapid increase of structures.

Existing protein sequence motif databases provide putative function for protein sequences. However, it is well known that structure is more conserved than sequence, and it is the properties associated with particular residues and their relative position in the structure that convey function. Thus, creating a 3D motif analagous to the 1D sequence motif will increase performance in annotation of protein function. We propose an automated method for constructing 3D models of protein functional sites, by augmenting 1D sequence motifs. The model provides a 3D statistical description of the biochemical and physical properties surrounding a functional site. The model can be used to quickly scan a protein structure for potential sites. It can also be used to gain insight on what properties are involved in the particular function. This method has been applied to the EF-hand calcium binding motif.


101. Structural annotation of the human genome (up)
Arne Mueller, Lawrence A. Kelley, Michael J.E. Sternberg, Imperial Cancer Research Fund;
a.mueller@icrf.icnet.uk
Short Abstract:

 The proteins of the human genome draft (Ensembl-0.8) have been assigned to homologous proteins of known structure. More than one third of the proteome is covered. We have compared the fold and domain composition of different organisms. A special focus has been put on the proteins encoded by human diseases genes.

One Page Abstract:

 In February 2001 the draft sequence of the human genome was published. In this work we have annotated the proteins of the public draft (1) based on the Ensembl version 0.8.0 data-set (http://www.ensembl.org) with protein structure by assigning homologous sequences of the SCOP (3) and PDB databases to human proteins via Blast/PSI-BLAST (4) and fold recognition using 3D-PSSM (5). The fold composition of proteins encoded by human disease genes is analysed. Results are compared with those of other organisms.

The draft human genome sequence from the Ensembl data-set contains 28913 different protein sequences of which Blast/PSI-BLAST can assign 44% to at least one protein of known structure (35% of the amino acid residues of the proteome). An additional 41% of the human sequences can be assigned to functionally annotated sequences of the public databases, and a further 16% have homology to sequences of unknown function or hypothetical proteins. Only 8% are without any detectable homology to any other sequence in the public databases including 3% (of the total) that are in non-globular regions.

With 3D-PSSM we can confidently assign 5% of the residues in the human proteome to a protein of known structure (7% of the sequences) that cannot be assigned by PSI-BLAST: 3% are in the fraction that was classified as functional (but not structurally) annotated by PSI-BLAST, and 2% are located in the fraction of `homology but unknown function'. We are currently working on an optimised version of 3D-PSSM that is better adapted to long protein sequences to improve our results and to extend the fraction of `unknown function' to which we can assign a protein of known structure, because often structure comes with functional annotation.

Compared to the proteomes of D. melanogaster, C. elegans and S. cerevisiae for which a fraction of 18% to 20% is completely uncharacterised, the draft human protein set is well annotated (in terms of structure and function). These results may be related to the difficulties of identifying novel genes in the human genome (i.g. gene finding). The human proteome is structurally better annotated than the other three eukayotic genomes (27% to 28% of the proteome) but less than most bacterial genomes (lowest is 40% for M. tuberculosis, highest is 45% for E. coli).

The most popular structural superfamily (as defined by SCOP release 1.53) in the human proteome is the Immunoglobulin superfamily (which often is found as a repetitive unit), and the top ranking superfamilies are similar to those in D. melanogaster but differ (even in total number) from those in C. elegans. We present a detailed analysis of a SCOP based domain comparison between different proteomes. There are 109 superfamilies unique to the four multicellular eukaryotes above, six are unique to yeast (S. cerevisiae and S. pombe), also six superfamilies are unique to the three archaea we have processed and 68 superfamilies are unique to the seven processed bacteria.

Of the 6656 human proteins in the Ensembl database that are linked to a diseases of the OMIM database (6) 3278 different proteins have at least one homologue of known structure. More than 5000 scop domains can be identified within these proteins. The most popular structural superfamilies resemble those of the proteome in general (e.g. Immunoglobulins, Protein kinase domains, Fibronectin).

The data from our analysis is stored in a relational database managed by MySQL allowing for complex queries and the in-cooperation of new resources and genomes when available (other genomes are currently in the process pipeline). The data will be made publicly available via the world wide web.

References:

1. International Human Genome Sequencing Consortium (2001). Initial sequencing and analysis of the human genome. Nature, 409(6822):860-921

2. Hubbard, T.J.P., Ailey, B., Brenner, S.E., Murzin, A.G. & Chothia, C. (1999). SCOP: A structural classification of proteins database. Nuc. Acids Res. 27:254-256.

3. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein data base search programs. Nucleic Acids Res. 25:3389-3402.

4. Kelley, L.A., MacCallum, R.M. & Sternberg, M.J.E. (2000). Enhanced genome annotation using structural profiles in the program 3D-PSSM. J. Mol. Biol. 299:499-520.

5. McKusick-Nathans Institute for Genetic Medicine, Johns Hopkins University (Baltimore, MD) and National Center for Biotechnology Information, National Library of Medicine (Bethesda, MD), 2000. Online Mendelian Inheritance in Man, OMIM (TM). World Wide Web URL: http://www.ncbi.nlm.nih.gov/omim/


102. Estimation of p-values for global alignments of protein sequences. (up)
Caleb Webber, Geoffrey J. Barton, EMBL-European Bioinformatics Institute;
caleb@ebi.ac.uk
Short Abstract:

 The global alignment of protein sequence pairs is often used in the classification and analysis of full-length sequences. The method presented here allows the probability that two protein sequences share the same fold to be estimated from the global sequence alignment Z-score. 

One Page Abstract:

 Classification and analysis of full-length protein sequences often involves the global alignment of sequence pairs. The calculation of a Z-score for the comparison gives a length and composition corrected measure of the similarity between the sequences. However, the Z-score alone does not indicate the likely biological significance of the similarity. A new background distribution to estimate the significance of pair-wise sequence alignment scores was developed by comparison of 250 proteins in different fold-families from the SCOP database. All 31,125 unique pairs of sequences were aligned with a range of matrices and gap penalties. The distributions of Z-scores from these alignments were fitted with a peak distribution, from which the probability of obtaining a given Z-score from a global alignment between 2 structurally-unrelated protein sequences was calculated. This analysis was also applied to global alignment of best locally-aligned subsequences, generated by the Smith-Waterman algorithm. The relationships between Z-score and probability varied little over the matrix/gap penalty combinations examined. However, a positive shift was observed for Z-scores derived by global alignment of locally-aligned subsequences, compared to global alignment of the entire sequence. This shift was shown to be the result of pre-selection by local alignment rather than any structural similarity in the sequences. Benchmarking the search ability of both methods using the SCOP superfamily classification showed that global alignment Z-scores are as effective as SSEARCH at low error rates and more effective at higher error rates. Global alignment of best locally-aligned subsequence was significantly less effective in this capacity. The estimation of statistical significance was shown to give similar results to the estimations of SSEARCH and BLAST, providing confidence in the method. This work provides a database-independent method of assessing the significance of pair-wise sequence global alignment scores. Software to apply the statistics to any alignment is available from http://barton.ebi.ac.uk.


103. Sequence and Structure Conservation Patterns of the Gelsolin Fold (up)
Benyaminy Hadar, Graduate student;
Wolfson Haim, Nussinov Ruth, Professor;
hadar1@tau.ac.il
Short Abstract:

 The gelsolin family proteins are involved in actin cytoskeleton remodeling and can also form amyloids. We analyzed sequence and structural conservation patterns of the protein. We describe a subset of conserved residues, largely of beta structure. These are likely to be responsible for stability (and function) of the gelsolin fold. 

One Page Abstract:

 The gelsolin family consists of actin binding proteins that are involved in remodeling the actin cytoskeleton and can also form amyloids. The family shares a repeated motif of 125-150 residues that is found in a wide range of phyla as either three or six repeats. The repeats share low sequence homology but are similarly folded. Using a novel multiple structure comparison algorithm, the coordinate files that represent the structural diversity of the different domains of gelsolin were subjected to sequence order independent multiple structure alignment. A common structural core of 38 amino acids was found, capturing the common topologically conserved positions of the fold. The sequences of the aligned structures were used to initiate iterative PSI-blast searches of the nrdb (nonredundant database compilation). After clustering and filtering short sequences, a final large and diverse (average pairwise percent identity of 20%) database of 270 sequences was constructed. Structural and sequential patterns were combined. The highest conservation values were found for a group of hydrophobic (some of which are aromatic) residues populating a common central beta hairpin (strands C and D). These conserved residues are likely to be responsible for stability (and function) of the gelsolin fold. 


104. Finding all protein kinases in the human genome (up)
Gerard Manning, Glen Charydczak, David Whyte, Sean Caenepeel, Ricardo Martinez, Sucha Sudarsanam, SUGEN, Inc.;
gerard-manning@sugen.com
Short Abstract:

 We used profile HMMs, ab inito gene finding, homology and ESTs to predict all protein kinases in the human genome and used domain homology to classify and organise genes into families. We compare our predictions with those of Celera and Ensembl, and with the kinases of fly, worm and yeast.

One Page Abstract:

 We have used a combination of automated gene finding methods and manual analysis to predict full length sequences for all human protein kinases. Profile HMMs were used to efficiently predict kinase catlytic domains in initial single-read genomic sequences and ESTs, with a very low error rate. To predict longer sequences in genomic assemblies, we used a mixture of ab initio gene prediction (Genscan), protein homology (Genewise and Blast) and mapping of ESTs to the genomic region. In most cases, some amount of manual curation was needed for optimal predictions, due to specific weaknesses of each prediction technique and the imperfect nature of genomic assemblies.

We found that ~20% of kinase sequences appear to be pseudogenes, with single exons, and multiple stops and frameshifts in the sequence. We compare our predictions with those of Celera and Ensembl.

We have mapped all kinase genes to chromosomal bands, and searched for genes linked to particular disease loci or cancer amplicons. 

Comparison of all human kinase domains and those of yeast, worm and fly genomes reveals the presence of several new conserved groupings of kinases, and a putative orthology mapping between many novel human kinases and their model organism counterparts. Genomic comparison also shows specific expansions of sub-groups in the different lineages. 


105. Remote Homology Detection Using Significant Sequence Patterns (up)
Florian Sohler, Alexander Zien, GMD - National Research Center, Skt. Augustin, Germany;
florian.sohler@gmd.de
Short Abstract:

 We present a new method to detect remote homologs for proteins. To score a candidate protein, the frequency of short but significant patterns in the sequence is used. With this scoring scheme and support vector machines we can classify proteins into their SCOP superfamilies better than BLAST.

One Page Abstract:

 We present a new method to detect remote homologs for proteins using short but significant sequence patterns. The goal is to build models for protein classes that enable us to correctly classify new query proteins.

Recently it has been proposed to use probabilistic suffix trees to model protein families. Probabilistic suffix trees can be viewed as variable order markov models, and thus, they are able to model short conserved patterns. In contrast to other widely used models like profile hidden markov models (HMMs) or sequence profiles, there is no alignment information used to create the model and no alignment performed to score candidate sequences. The main advantage of probabilistic suffix trees is their speed. Training can be performed in time linear in the size of the input sequences and scoring is linear in the sequence length as well. Unfortunately, according to our experience, probabilistic suffix trees only work well for closely related proteins. There are at least two possible reasons for this. The first is that they do not explicitly model amino acid substitutions, insertions or deletions. The other possibility is that distant homologs cannot be found without using alignment information.

To allow for some amino acid substitutions we cluster the amino acids into groups like 'hydrophobic', 'polar' etc. and use patterns of this reduced alphabet instead of the amino acid alphabet.

We use suffix trees to find patterns that appear significantly frequently in a given class of proteins, but then apply more involved machine learning tools to build a model from these patterns. To score the significance of a given motif we count the number of appearances of that motif in the protein class and compare that number to the expected number of occurrences given a simple probabilistic model. If this significance score is above a certain threshold we accept the corresponding pattern into the list of significant patterns. Since it will be significant for very long (and thus specific patterns) to appear only once, we also require each pattern to occur more often than a given minimum number. Therefore we have two parameters to tune the sensitivity and specificity of the patterns chosen by our algorithm. If a pattern appears in 90% of all training sequences it is expected to appear in most of the unknown sequences belonging to that class as well. On the other hand, if the pattern is so specific, that it appears almost only in the training set, unclassified sequences that have that pattern will probably belong to that class. The length of the patterns found this way is typically between five and ten. The number of patterns can vary between 50 and several hundreds depending on the training set and the given parameters.

To build a classifier from our list of significant patterns we use support vector machines with the 'Radial Basis Functions' kernel. The features for a sequence are simply the frequencies of each of our significant patterns normalized with respect to a simple null model.

We evaluate our method by trying to predict SCOP superfamilies in a simple cross-validation protocol. As a training set for a superfamily classifier we take one family away from the superfamily which will be our test set later. The remaining families, and optionally additional Blast hits, we use for training. Sequences of all other superfamilies are also divided up into training and test set.

Results show that our method does surprisingly well, which shows that remote homologs can be detected without computing alignments. The algorithm clearly outperforms Blast and is almost competitive to HMMs. Especially on families that are hard to classify for HMMs its performance is comparable while on easier families more false positives are produced. This suggests that a combination of alignment based methods and our new method can improve the prediction performance significantly.

 References:

T. Jaakkola, M. Diekhans, D. Haussler: A Discriminative Framework for Detecting Remote Protein Homologies JCB, 2000, Vol. 7, no. 1/2, pp. 95-114

G. Bejerano, G. Yona: Variations on probabilistic suffix trees: statistical modeling and prediction of protein families Bioinformatics, 2001, Vol. 17, no. 1, pp. 23-43

A. Apostolico, M. E. Bock, S. Lonardi, X. Xu: Efficient detection of unusual words JCB, 2000, Vol. 7, no. 1/2, pp. 71-94

C. J. C. Burges: A tutorial on support vector machines for pattern recognition Data mining and knowledge discovery, 1998, Vol. 2, pp. 121-167
 
 


106. TRIBE-MCL: A Novel Algorithm for accurate detection of protein families (up)
Anton Enright, European Bioinformatics Institute;
Stijn Van Dongen, CWI, Amsterdam, Netherlands.;
Christos A.Ouzounis, European Bioinformatics Institute;
enright@ebi.ac.uk
Short Abstract:

 We present a novel method for clustering of proteins into families based on sequence similarity information. This method uses 'Markov' clustering to successfully classify proteins into families with extremely high accuracy. The method is not led-astray by conventional problems of this type of analysis, such as promiscuous domains and protein fragments.

One Page Abstract:

 Detection of protein families in complete genomes is a valuable method in functional genomics. Members of a protein family should possess equivalent functional roles in the cell. If one knows the function of one of the members of a family it should be possible to transfer this function to other members of the family whose functions may not be known. Generally protein families are detected by clustering proteins together based on their sequence similarity. Many methods exist for this type of analysis, however most of these methods are not fully-automatic and rely on manual intervention for the correct detection of valid protein families. Other automatic methods fail to correctly detect protein families in complex eukaryotic datasets due to the presence of multi-domain proteins and proteins which contain a promiscuous domain. Previously we developed a method called GeneRAGE for protein family analysis in bacteria. This method could not realistically be extended to higher eukaryotes, such as the human genome, due to the computation time required to break-down the complex modular domain structure of many eukaryotic proteins. To this end we have developed a novel method for protein sequence clustering based on the Markov Clustering (MCL) algorithm. This is a purely probabilistic approach which can automatically and accurately cluster proteins into families based on sequence similarity alone without explicit knowledge of protein domains. We first represent biological sequence similarities in terms of a graph, nodes representing proteins, and edges representing weighted sequence similarity scores which connect proteins. The MCL algorithm calculates random walks through this graph, and uses two mathematical operators to model flow within the graph. Because members of a protein family are generally more highly similar to each other than to members of other (related or unrelated) families, flow within a

family is higher than flow between families (i.e. through a common or promiscuous domain). The algorithm models tidal forces through the graph until equilibrium is reached, and then calculates a clustering based on these observed patterns of flow. We have tested the algorithm extensively using the INTERPRO, SCOP and SWISSPROT databases, and have observed very high accuracy for the assignment of proteins into valid protein families. Recently we used this algorithm to produce protein family information for the draft human genome in the Ensembl 080 release. This analysis involved the clustering of over 100,000 proteins into 13,000 families, and took a little over six hours to complete on a small workstation. Validation using databases such as SCOP and INTERPRO have indicated that the method is performing with an accuracy of >90%. We believe that this method will be extremely useful for protein family analysis and functional genomics.


107. In Silico Analysis of Bacterial Virulence Factors: Redefining Pathogenesis (up)
Kelly Paine, Edward Jenner Institute for Vaccine Research;
kelly.paine@jenner.ac.uk
Short Abstract:

 A virulence factor is any agent produced by a pathogen that is essential for causing disease in a host. Bacterial protein virulence factors have attracted great interest as targets for antimicrobial research. We have been utilising protein fingerprinting methods to characterise such moieties and redefine the meaning of pathogenesis.

One Page Abstract:

 There has been a recent surge in the number of completed bacterial genome sequences, and with this explosion of data comes the need to discover novel targets for antimicrobial research. A synergistic interaction between the well-established science of bacteriology, and the emergent discipline of bioinformatics should provide tools for such a task. Bacterial resistance to conventional antibiotics is on the increase, and combined with other factors such as the prevalence of HIV, is proving costly in terms of both money and human lives. Even the strongest drugs are now useless against some species like Staphylococcus aureus.

 Pathogenic mechanisms can spread quickly through a bacterial population via lateral transfer, and, as most virulent bacteria rely on the presence of these "virulence factors" to infect a human host, they must be considered essential for pathogenicity. It is these genes that will provide the novel targets required for future research into new drugs and vaccines. Bioinformatics can aid in this process; the key advantage of computer-based screening techniques is the speed at which the identification and selection of these targets can be done. Making a reality of the predictions on how a protein may act a certain way in vivo, or what sort of immune response will be elicited from a virulence factor carefully selected from database mining and gene expression profiling, has, in the past, fallen mainly to the more conventionally trained biologist.

A recognised and powerful method of classifying new protein families is to use conserved regions between multiple alignments of proteins. Each homologous region is a "motif", and sets of motifs provide a signature or fingerprint for unique identification. We have been using this method to characterise novel virulence factor protein families, in collaboration with the PRINTS group at the University of Manchester, UK. Among those families already analysed include: components of the Gram-negative enteropathogenic type three secretion system, the Bacillus anthrax toxin, and Escherichia coli haemolysin. 


108. Relationships between structural conservation and dynamical properties in the PAS family proteins. (up)
Laura Bonati, Alessandro Pandini, Demetrio Pitea, Dipartimento di Scienze dell'Ambiente e del Territorio, Università degli Studi di Milano-Bicocca;
laura.bonati@unimib.it
Short Abstract:

 To obtain information about Ah receptor binding, we applied sequence analysis tools to the multiple alignment of different AhR and propose a new protocol to correlate the dynamical properties of the 3D structures of reference PAS proteins, obtained by MD simulations, to the information on structural conservation within the family.

One Page Abstract:

 By applying structure prediction and homology modelling methodologies, we previously developed [1,2] a three-dimensional model of the ligand binding domain of the mouse Aryl hydrocarbon Receptor (mAhR); this is a member of the PAS (Per-ARNT-Sim) family of transcriptional regulatory proteins. The crystal structures of the three PAS domains used as templates (the bacterial photoactive yellow protein PYP, the human potassium channel HERG and the bacterial oxygen sensing FixL protein) reveal a highly conserved structural framework. Despite the low level of sequence identity of mAhR with the templates, this high structural conservation allowed us to develope a suitable model, based on the combination of sequence and secondary structure information. On these bases we are studying [3] the binding process of mAhR with PolyChlorinated Dibenzo-p-Dioxins (PCDD), a class of ligands of environmental interest, by using Molecular Dynamics simulations to refine the mAhR model, molecular docking techniques to identify the residues directly interacting with PCDDs and hybrid QM/MM methodologies to obtain relative binding energies for a series of PCDDs. However, the modelling procedure has also suggested the possibility of obtaining more information about binding from the sequences of other Ah receptors included in the multiple alignment, as well as the need of a more accurate analysis on the molecular basis of the high structural conservation in the PAS family. Here we present an application of sequence analysis tools to the multiple alignment of Ah receptors from different species. The differences in the response of these proteins to PCDDs and the conservation of some residues in their ligand binding domain with respect to mAhR highlight key amino acids important for dioxin binding. Moreover, we propose some tools to analyse the dynamical properties of the three-dimensional structures of the reference PAS proteins, obtained by MD simulations, and to correlate them to the information on the structural conservation within the family. Based on the idea that physical information derived from molecular modelling of protein conformations may give a key contribution in understanding the evolutionary conservation, these tools may constitute a new general protocol to correlate evolutionary information and structural dynamical behaviour in a family of proteins.

 1) M. Procopio, A. Lahm, A. Tramontano, L. Bonati, D. Pitea, "Homology modeling of the AhR ligand binding domain", Organohalogen Compounds (1999) 42, 405-408.

2) M. Procopio, A. Lahm, A. Tramontano, L. Bonati, D. Pitea, "A Model for Recognition of PCDDs by the Aryl Hydrocarbon Receptor", Proteins, submitted.

3) L. Bonati, A. Pandini, D. Pitea, L. De Gioia, P. Fantucci, "Computational investigation of the PolyChlorinatedDibenzo-p-Dioxins - Ah receptor interaction: structure prediction of the ligand binding domain and molecular docking", Italian Journal of Biochemistry (2000) 49, 65. 


109. Classifying G-protein coupled receptors with support vector machines (up)
Rachel Karchin, Dr. Kevin Karplus, Dr. David Haussler, University of California, Santa Cruz;
rachelk@cse.ucsc.edu
Short Abstract:

 We discuss the relative merits of various automated methods for recognizing GPCRs: BLAST, hidden Markov models and support vector machines (SVMs). Our experiments show that, for those interested in annotation-quality classification, SVMs are worth the effort. We have set up a web server for SVM GPCR subfamily classification at \url{http://www.soe.ucsc.edu/research/compbio/gpcr-subclass}. 

One Page Abstract:

 The enormous amount of protein sequence data uncovered by genome research has increased the demand for computer software that can automate the recognition of new proteins. We discuss the relative merits of various automated methods for recognizing G-protein coupled receptors (GPCRs), a superfamily of cell membrane signalling proteins. GPCRs are the focus of a significant amount of current pharmaceutical research because they play an important role in many diseases. However, their structures remain largely unsolved. The methods described in this paper use only primary sequence information to make their predictions. We compare a simple nearest neighbor approach (BLAST), methods based on multiple alignments generated by a statistical profile hidden Markov model, and methods, including support vector machines, that transform protein sequences into fixed-length feature vectors. The last is the most computationally expensive method, but our experiments show that, for those interested in annotation-quality classification, the results are worth the effort. In two-fold cross-validation experiments testing recognition of GPCR subfamilies that bind a specific ligand (such as a histamine molecule), the errors per sequence at the minimum error point (MEP) were 13.7% for multi-class SVMs, 17.1% for our SVMtree method of hierarchical multi-class SVM classification, 25.5% for BLAST, 30% for profile HMMs, and 49% for classification based on nearest neighbor feature vector (kernNN). The percentage of true positives recognized before the first false positive was 65% for both SVM methods, 13% for BLAST, 5% for profile HMMs and 4% for kernNN. We have set up a web server for GPCR subfamily classification based on hierarchical multi-class SVMs at http://www.soe.ucsc.edu/research/compbio/gpcr-subclass. By scanning predicted peptides found in the human genome with the SVMtree server, we have identified a large number of genes that encode GPCRs. Although most of these were previously annotated, one appears to be novel as of our scan date in May~2001: an olfactory receptor on chromosome~1. A list of our predictions for human GPCRs is available at http://www.soe.ucsc.edu/research/compbio/gpcr_hg/class_results. We also provide suggested classification for 16 sequences previously identified as GPCRs but unclassified in GPCRDB.


110. Domain-finding with CluSTr: Re-occuring motifs determined with a database of mutual sequence similarity (up)
Evgenia V. Kriventseva, EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge EMBL-EBI European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton Hall, Hinxton, Cambridgeshire, CB10 1SD, UK;
Steffen Möller, Rolf Apweiler, EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge EMBL-EBI European Bioinforma;
{zhenya,moeller}@ebi.ac.uk
Short Abstract:

 This work makes use of the pairwise comparison data stored for the CluSTr project, which allows users to analyse sequence matches. Known InterPro protein signatures are used to separate the well-known domains from the uncharacterised. This faciliates a bootstrap approach to discover new protein domains.

One Page Abstract:

 The CluSTr project (http://www.ebi.ac.uk/clustr/) provides an automatic classification of proteins. The classification is determined according to the pairwise Smith-Waterman similarity scores, normalised by randomisation to derive a Z-score.

The CluSTr database information on the protein clusters and the underlying similarity matrix are stored in a relational database. This work uses the pairwise comparison data, which is also underlying the definition of the clusters. Here we present a method to display those regions of sequences that are most often found to be similar to other proteins. This is shown both in dependence of the location on the protein sequence and the Z-score.

Additional context is offered by the visualisation of matches to the InterPro (http://www.ebi.ac.uk/interpro/) member databases and position-dependent sequence annotation from the SWISS-PROT FT lines. Regions in sequences of special interest can be specified to be automatically retrieved for further analysis, which facilitates a bootstrap approach to determine new protein domains.

Another nice feature of this approach is that it helps to overcome an inherent limitation of algorithms to determine local sequence similarity. These either focus on an area of maximum similarity and thereby ignore remaining similarities or are no longer specific. As a consequence, in multi-domain proteins some shared domains may be omitted as regions of pairwise sequence similarity. With the assumption that the omitted domains may also occur independly from the ones found, the respective regions will be highlighted since the database of sequence similarities contains similarities between any two proteins.

The local sequence similarity together with the clustering of protein sequences should be a very interesting aid in the hunt for new protein domains, especially within the context of the most important information from SWISS-PROT/TrEMBL and InterPro. Protein clusters for sequences of completely sequenced eukaryotes for which no InterPro domains were found can be accessed from the Proteome Analysis pages (http://www.ebi.ac.uk/proteome/).

1. Kriventseva E. V., Fleischmann W., Zdobnov E., Apweiler R.: CluSTr: a database of Clusters of SWISS-PROT+TrEMBL proteins. Nucl. Acids Res. 2001, 29(1):33-36.

2. Apweiler R., Attwood T. K., Bairoch A., Bateman A., Birney E., Biswas M., Bucher P., Cerutti L., Corpet F., Croning M. D. R., Durbin R., Falquet L., Fleischmann W., Gouzy J., Hermjakob H., Hulo N., Jonassen I., Kahn D., Kanapin A., Karavidopoulou Y., Lopez R., Marx B., Mulder N. J., Oinn T. M., Pagni M., Servant F., Sigrist C. J. A., Zdobnov E. M.: InterPro - An integrated documentation resource for protein families, domains and functional sites. Nucl. Acids Res. 2001, 29(1):37-40.

3. Apweiler R., Biswas M., Fleischmann W., Kanapin A., Karavidopoulou Y., Kersey P., Kriventseva E. V., Mittard V., Mulder N., Phan I., Zdobnov E.: Proteome Analysis Database: online application of InterPro and CluSTr for the functional classification of proteins in whole genomes. Nucl. Acids Res. 2001, 29(1):44-48.

4. Fleischmann W., Möller S., Gateau A., Apweiler R.: A novel method for automatic functional annotation of proteins. Bioinformatics 1999 Mar;15(3):228-33.
 
 


111. PFAM domain distributions in the yeast proteome and interactome (up)
Christian Ahrens, Christoph Michetschläger, Andrei Grigoriev, GPC Biotech AG;
christian.ahrens@gpc-biotech.com
Short Abstract:

 When comparing the distributions of PFAM domains in the yeast proteome and interactome, we found cell signalling and protein-protein interaction domains occurring with a higher frequency. The analysis of their co-occurrences within one protein or interaction pairs reveals certain preferred domain combinations. Possible functional implications will be discussed.

One Page Abstract:

 The proteome of budding yeast (Saccharomyces cerevisiae) as defined by the Saccharomyces Genome Database (6311 proteins) and the interactome (defined by several large-scale protein-protein interaction datasets) were analysed for the presence of PFAM-A domains, using the HMMER 2.0 hidden Markov Model software, and the PFAM-A library of HMM´s (v6.0). Several PFAM domain families occur with a higher frequency in the interactome, including domains involved in cell signalling and protein-protein interaction. In addition, the frequencies of co-occurrences of PFAM domains within one protein and within interaction pairs were determined, and preferred domain combinations could be identified for either dataset. The results of these analyses and possible functional implications will be discussed. 


112. Identifying Protein Domain Boundaries using Sequence Similarity for Structural Genomics Target Selection (up)
Gulriz Aytekin-Kurban, Terry Gaasterland, Rockefeller University;
gulriz@frida.rockefeller.edu
Short Abstract:

 The method, CLUE, predicts putative domain boundaries on a protein using pairwise alignments of the protein sequence with all available proteins computed with psi-blast. It can identify domains on sequences not classified into existing structural and functional domain families. We evaluated the method comparing resulting boundaries to structural domain boundaries.

One Page Abstract:

 The Structural Genomics Initiative seeks to solve three-dimensional (3D) protein structures for as many distinct new folds as possible. These structures will in turn increase the number of computationally modeled 3D protein structures. Achieving this goal requires that candidate structure targets be selected from all available proteins such that the likelihood that a new structure will reveal a new fold is maximized. A prerequisite for target selection is the reliable identification of structural domain boundaries in proteins with no known structure. Once domains are established, the corresponding sequences can then be clustered into domain families. The domain families can be prioritized according to likely efficacy of high-throughput structure determination, and whether or not they have member proteins of known structure. This paper introduces and evaluates a new method for predicting protein domain boundaries in proteins across complete genomes in large scale. The method was applied to proteins from 12 genomes and to proteins in PDB. For the proteins already in PDB, the resulting domain boundaries are compared with structural domain boundaries from the SCOP and CATH databases.

The method introduced here, called CLUE, uses pairwise sequence alignments of a query protein with all available proteins computed with the alignment tool psi-blast. A sliding window scoring function is applied across the query sequence to identify regions with a coalescence of internal alignment boundaries, especially boundaries that include an N-terminal or C-terminal end of the aligned (target) sequence. The output of the scoring function is evaluated automatically to identify the best candidate domain boundaries in the query protein. The output of the procedure is a list of best predicted domain boundary positions and subsequences of the query protein.

The main strength of CLUE is that it can identify putative domain boundaries on sequences that have local sequence similarity to a set of proteins, yet cannot be classified into existing structural and/or functional domain families. Although it may sometimes perform worse than the existing classifier methods for well-known protein families, it is a valuable method for the subset of proteins where new domain families have yet to be discovered. We use CLUE as the first step to divide sequences into domains before building domain families. However, CLUE works for any arbitrary query sequence; it can be integrated into a sequence annotation system such as MAGPIE without a need for building families. 

CLUE was evaluated by comparing predicted domain boundaries on every PDB sequence to the structural domain boundaries computed by the SCOP and CATH methods. For each structural domain family, we counted the number of instances where the boundary of the domain on a sequence had a predicted domain boundary within a distance less than 30 amino acids. We excluded the cases where the domain boundary occured at the N or C-terminal end; remaining cases were internal domain boundaries. Either the begin or the end position of a domain can be inside a sequence while the other is at a terminal end of the sequence. The number of cases among PDB sequences where both boundaries of a domain were internal to a sequence was very small. Therefore, two different counts for internal domain begin and end positions were computed. For each domain family, we calculated a percentage for predicted domain boundaries in all internal instances. The average of the percentages across all families was taken to show the overall performance. CLUE predicts on average 66\% of the begin positions of the instances of a SCOP domain family internal on a sequence, and 65\% of the intances of internal end positions. For CATH domain families, the averages are 52\% and 56\%, resp.

The method presented here is efficient and scalable. It can be applied to any protein and does not require the construction of domain families for accurate structural domain boundary predictions. CLUE has been implemented with a web interface that serves predicted domain boundaries for proteins across whole genomes. CLUE web site serves boundaries for proteins from the initial 12-genomes dataset at genomes.rockefeller.edu/CLUE. 


113. Comparative study of in vitro and in vivo protein evolution. (up)
Vadim P. Valuev, Dmitry A. Afonnikov, Dmitry A. Grigorovich, Nikolay A. Kolchanov, Institute of Cytology and Genetics SB RAS;
valuev@bionet.nsc.ru
Short Abstract:

 In amino acid composition the in vitro evolved proteins deviate from native ones and more strongly follow the codon degeneracy; aminoacid interchanges resemble generally those in native proteins, matching better families with restricted function. The study of pairwise correlations allowed some insight into processes determining structure-functional integrity of proteins. \url{http://wwwmgs.bionet.nsc.ru/mgs/gnw/aspd/}

One Page Abstract:

 In the early 90's started the flow of experiments with the application of techniques of in vitro evolution of proteins. This process implies sieving large (up to 109 individual members) pools of molecules through several consecutive rounds of selection and amplification to retrieve finally the molecules that most strongly show the desired property. (Roberts and Ja, 1999) This technique with various improvements was applied to pursue a number of goals, including selection for thermodynamically more stable proteins (Gu et al., 1995; Kim,D.E. et al., 1998; Braisted,A.C. and Wells,J.A., 1996), engineering of proteins with new or improved enzymatic activities (Baca,M. et al., 1997; Fujii,I. et al., 1998; Widersten,M. and Mannervik,B., 1995), mapping epitopes and binding sites (Zozulya et al., 1999; Castano,A.R. et al., 1995), finding substrates for enzymes (Matthews,D. and Wells,J.A., 1993 ;Matthews,D. et al., 1994), selecting antibodies(Clackson,T. et al., 1991; Vaughan,T.J. et al., 1998), finding small peptide mimetics for large protein molecules (Wrighton,N. and Gearing,D., 1999) etc. We have compiled a database ASPD (Artificial Selected Proteins/Peptides Database) storing the published results of phage display experiments. The first release, ASPD 1.0, contains information on 120 experiments. A database entry corresponds to a set of peptides or proteins selected against one target. Generally they contain some common motif and can be aligned. Each entry contains the description of the scaffold and target for selection, the links to the databases SwissProt, PDB, Prosite and Enzyme, and the aligned set of sequences retrieved through phage display. The ASPD is SRS-formatted and can be accessed from http://wwwmgs.bionet.nsc.ru/mgs/gnw/aspd/. The amino acid composition of the in vitro evolved proteins (only those amino acids which were retrieved via evolution were taken into account, those positions which had not been randomized were ignored) compared to that of SwissProt shows the following trends: the most overrepresented aminoacids in ASPD (compared to SwissProt) are tryptophan (the percentage of which in ASPD exceeds more than threefold that in SwissProt), tyrosine, arginine; the most underrepresented are lysine, glutamic acid and valine. After thorough examination of the distribution it becomes clear that the overall amino acid composition of ASPD follows much more the number of codons for each amino acid, than the composition of SwissProt. The other effect that superimposes on the codon frequency is that there is a preference for hydrophilic amino acids. The preference for hydrophilic amino acids may be due to the bias intrinsic to phage display experiments, where selection is often made for amino acids making part of active sites. And the observation about the codon frequencies, though very simple, suggests a very important thing about native protein evolution - that the sequences of native proteins are determined greatly by their evolutionary history and not by the functional requirements. It is illustrated by the fact that by means of phage display were retrieved small mimetics for large protein molecules (such as erythropoietin) (Wrighton et al., 1996; Wrighton and Gearing, 1999) that have no sequence similarity with them. We have also calculated the aminoacid similarity matrix for our database. It shows the greatest correlation values with the matrices of the BLOSUM family - of about 80%. Its application in homology searches suggests that it is mostly fit for the cases when protein evolution is restricted by strong functional restraints to yield exactly isofunctional proteins. Each entry in the ASPD database was analyzed for presence of pairwise correlations in terms of 4 amino acid properties: volume (Chothia, 1984), hydrophobicity (Eisenberg D et al., 1984), isoelectric point value (White et al., 1978) and polarity (Ponnuswamy et al., 1980). We have revealed a number of clusters of correlating positions, which correspond to the structurally important regions of proteins. Such clusters were found on the turns, where negative correlations in volume and both positive and negative ones in isoelectric point were observed, and within the core, where correlations were mostly in hydrophobicity, but in isoelectric point and polarity as well. These clusters were not found in the families of native proteins. Additional information is available at the http://wwwmgs.bionet.nsc.ru/mgs/gnw/aspd/ The work was supported with RFBR grant ¹00-04-49229 and its supplement ¹01-04-06240. VV is also an INTAS PhD fellow (YS-00-177).


114. DART: Finding Proteins with Similar Domain Architecture (up)
LY Geer, M Domrachev, DJ Lipman, SH Bryant, NCBI/NLM/NIH;
lewisg@ncbi.nlm.nih.gov
Short Abstract:

 The Domain Architecture Retrieval Tool (DART) identifies proteins with similar domain composition. Domains in a query sequence are identified by a sensitive profile search. Proteins with similar domain architectures are retrieved and listed in ranked order. DART is available at \url{http://www.ncbi.nlm.nih.gov/Structure/lexington/lexington.cgi?cmd=rps

One Page Abstract:

 The Domain Architecture Retrieval Tool (DART), hosted by NCBI, performs similarity searching of proteins based on their domain architecture. The goal is to find protein similarities using consistent and sensitive protein domain profiles, rather than solely by sequence similarity. DART has been designed to be fast and informative. The underlying algorithm is based on domain annotation of a significant subset of all publicly known protein sequences through the use of Reverse-PSI-BLAST (RPS-BLAST) [1] and protein domain databases, including SMART [2] and Pfam [3]. 

Given a protein sequence, DART runs RPS-BLAST and displays the protein using a "beads on a string" style. DART then displays a ranked, graphical list of proteins with similar sets of domains. Ranking is done by the number of unique hits to domains that are the same or redundant to the domains in the query sequence. The query can be refined taxonomically or by selecting domains of interest. DART is linked to CD-Search (http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi), which also uses RPS-BLAST and can display the domain profile alignment in greater detail. 

To create the databases underlying DART, all the sequences in the NCBI non-redundant database (nr) [4] are aligned to Pfam, SMART, and other domain databases using RPS-BLAST. These alignments are sorted by sequence and by domain.

Redundancy between protein domains is used in ranking and querying the sequences because domain databases contain related domains. Redundancy between two domains is defined as a significant number of overlap hits by both domains to nr. Redundant pairs are clustered transitively to create a final list of redundant domains.

DART is available at http://www.ncbi.nlm.nih.gov/Structure/lexington/lexington.cgi?cmd=rps

[1] Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Nucleic Acids Res. 1997 Sep 1; 25(17): 3389-3402.

[2] Schultz J, Copley RR, Doerks T, Ponting CP, Bork P. Nucleic Acids Res. 2000 Jan 1; 28(1): 231-234.

[3] Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer ELL. Nucleic Acids Res. 2000 Jan 1; 28(1): 263-266.

[4] Wheeler DL, Chappey C, Lash AE, Leipe DD, Madden TL, Schuler GD, Tatusova TA, Rapp BA. Nucleic Acids Res. 2000 Jan 1; 28(1): 10-14. 


115. Statistical approaches for the analysis of immunoglobulin V-REGION IMGT data (up)
Christelle Pommié, Manuel Ruiz, Nathalie Syz, Véronique Giudicelli, LIGM Institut de Génétique Humaine;
Robert Sabatier, Laboratoire de physique Moléculaire;
Marie-Paule Lefranc, LIGM Institut de Génétique Humaine;
cpommie@ligm.igh.cnrs.fr
Short Abstract:

 IMGT, the international ImMunoGeneTics database (http://imgt.cines.fr) is an integrated information system specializing in Immunoglobulins, TcR and MHC molecules of all vertebrates species. Our aim was to define most appropriate statistical methods to analyze the IMGT sequences and structural data, useful to establish amino acid correlations in 3D structures.

One Page Abstract:

 Owing to their fundamental role in the immune system, the Immunoglobulin (Ig) and T cell Receptor (TcR) variable domains (corresponding to the V-J-REGION and V-D-J-REGION labels in IMGT, the international ImMunoGeneTics database, http://imgt.cines.fr) have been extensively studied. Moreover, owing to the recent years sequencing efforts, all the human Ig and TcR genes are now characterized. Analysis of the correlation between sequences, structures and specificities of the variable domains has important implications in medical research (repertoire in autoimmune diseases, AIDS, leukemias, lymphomas, myelomas), therapeutic approaches (antibody engineering), genome diversity and genome evolution study. The Ig and TcR V-REGIONs represent a privileged situation by the conservation of their structure despite divergent sequences and the considerable amount of genomic, structural and functional data. The unique IMGT numbering for Ig and TcR V-REGION sequences of all vertebrate species has been established to facilitate sequence comparison and cross-referencing between experiments from different laboratories whatever the antigen receptor (Ig or TcR), the chain type (heavy or light chains for Ig; alpha, beta, gamma or delta chains for TcR) or the species. In the IMGT unique numbering, conserved amino acids from FR always have the same number whatever the Ig or TcR variable sequence, and whatever the species they come from. The IMGT unique numbering has allowed to redefine the limits of the FR and CDR regions. The FR-IMGT and CDR-IMGT lengths become in themselves crucial information, which characterize variable regions belonging to a group, a subgroup and/or a gene. FR amino acids located at the same position in different sequences can be compared without requiring sequence alignments. This also holds for amino acids belonging to CDR-IMGT of the same length. The IMGT unique numbering permits rapid correlation between protein sequences and three dimensional (3D) structure of Ig and TcR V-REGIONs. Standardized multi-sequence alignments obtained with the IMGT unique numbering allow to set up statistical approaches of the amino acid physico-chemical properties, position by position. These analyses are not only useful to study mutations and allele polymorphisms, but are also needed to establish correlations between amino acids in the protein 3D structures and to extract new knowledge in the IMGT/PROTEIN-DB database, currently in development. As an example of our approach, we describe below the statistical analysis of the hydropathy property of the amino acids found at standardized positions of the three frameworks of the V-REGIONs of two types of chains, the human immunoglobulin light chain kappa and lambda. A total of 1114 human rearranged productive Ig V-REGIONs was obtained, 585 belonging to the kappa chains and 529 to the lambda chains. The V-REGION nucleotide sequences were translated into amino acid sequences. Gaps and delimitations of the FR-IMGT and CDR-IMGT were created according to the IMGT unique numbering. For each chain type V-REGIONs, three sets were created which correspond to FR1-IMGT (amino acid positions 1 to 26), FR2-IMGT (amino acid positions 39 to 55) and FR3-IMGT (amino acid positions 66 to 104), respectively. The six amino acid sequence sets were analyzed to obtain contingency tables which contain the number of each amino acid at each position. The statistical analysis was realized with two different but complementary multivariate descriptive statistical analysis (MDSA) methods: the correspondence (or factor) analysis and the hierarchic classification methods (Ward's method), using the ADE-4 software. The amino acid positions of the kappa and lambda FR1-IMGT, FR2-IMGT and FR3-IMGT sets were compared, two by two, for the amino acid "hydropathy" variable class. A total of six analyses was performed. A correspondence analysis (COA in ADE-4) was applied to each set of kappa amino acid positions (from FR1-IMGT, FR2-IMGT, FR3-IMGT, respectively) and the corresponding set of lambda amino acid positions (from FR1-IMGT, FR2-IMGT, FR3-IMGT, respectively) was projected for the hydropathy variable. One-fifty-seven (79 Kappa and 78 Lambda) amino acid positions from 1114 Ig V-REGION sequences were analysed by the correspondence and the classification analysis methods. These methods, appropriate for the analysis of large data matrices, are particularly interesting in view of the large amount of data to be studied in IMGT. Moreover, used together, they provide different but complementary results and allow a reciprocal analysis of the data. Such an approach was feasible owing to the standardization of the amino acid positions in IMGT sequences. The statistical differences of the hydropathy variable at given amino acid positions has allowed to define the characteristic hydropathy property of the kappa and lambda amino acids, respectively. On the other hand, the statistical resemblances between each kappa and lambda position has allowed to identify positions where amino acid hydropathy property may be important for the conserved structure of the Ig fold. Similar analysis with other variables (amino acid solvent accessibility, hydrogen and Van der Waals bondings) and on other sets of sequences will be particularly useful to establish correlations between amino acid positions of the Ig fold. 


116. Markovian Domain Signatures: Statistical Segmentation of Protein Sequences (up)
Gill Bejerano, Yevgeny Seldin, Naftali Tishby, School of Computer Science & Engineering, The Hebrew University;
jill@cs.huji.ac.il
Short Abstract:

 We present a novel method for protein sequence domain detection and classification. Our method is fully automated, does not require multiple alignments, and handles heterogeneous unordered multi-domain groups. It constructs unique domain signatures through clustering regions of conserved statistics. Examples detect protein fusion events, and outperform HMM classification. 

One Page Abstract:

 Characterization of a protein family by its distinct sequence domains is crucial for functional annotation and correct classification of newly discovered proteins. Conventional multiple sequence alignment-based methods, such as hidden Markov modeling (HMM), come to difficulties when faced with heterogeneous groups of proteins. However even many families of proteins sharing a common domain contain instances of several other domains, without any common linear ordering. Ignoring this modularity may lead to poor or even false classification and annotation. An automated method that can analyse a group of proteins into the sequence domains it contains is therefore highly desirable.

We apply a novel method to this problem. The method takes as input an unaligned group of protein sequences. It segments them and clusters the segments into groups sharing the same underlying statistics. A variable memory Markov model (VMM) is built using a prediction suffix tree (PST) data structure for each group of segments. Refinement is achieved by letting the PSTs compete over the segments. A deterministic annealing framework infers the number of underlying PST models while avoiding many inferior solutions. We show that regions of conserved statistics correlate well with protein sequence domains, by matching a unique signature to each domain. This is done in a fully automated manner, and does not require or attempt a multiple alignment. Several representative cases are presented. We identify a protein fusion event, refine an HMM superfamily classification into the underlying families the HMM cannot separate, and detect all 12 instances of a short domain in a group of 396 sequences.
 
 


117. Identification Of Novel Conserved Sequence Motifs In Human Transmembrane Proteins (up)
Eike Staub, Artemis Hatzigeorgiou, Bernd Hinzmann, Christian Pilarsky, Thomas Specht, Andre Rosenthal, metaGen Pharmaceuticals GmbH;
eike.staub@metagen.de
Short Abstract:

 Here we present an approach to identify novel conserved protein motifs in human transmembrane (TM) proteins. Thousands of predicted TM protein sequences were scanned for known protein domains, transmembrane segments, low-complexity and coiled-coil regions. Using a PSIBLAST-based procedure we identified interesting new motifs in the remaining, previously uncharacterized sequence fragments. 

One Page Abstract:

 Here we present an approach to identify novel conserved protein motifs in human transmembrane (TM) proteins. Thousands of predicted TM protein sequences were scanned for known protein domains, transmembrane segments, low-complexity and coiled-coil regions. Using a PSIBLAST-based procedure we identified interesting new motifs in the remaining, previously uncharacterized sequence fragments. 


118. Using Profile Scores to Determine a Tree Representation of Protein Relationships. (up)
K. Diemer, T. Hatton, P. Thomas, Celera Genomics;
diemerkl@fc.celera.com
Short Abstract:

 An algorithm is introduced that uses profile scores to generate a tree of orthologs/paralogs and to split it into functional subgroups automatically. The similarity measure is based on the score of one sequence cluster to the profile of another. The algorithm is compared with other methods and expert curation.

 

One Page Abstract:

 Many algorithms have been proposed for reconstructing the evolution of protein families from DNA or protein sequence information. The primary goal has been to model the most likely historical sequence of events that gave rise to the protein sequences observed today in the form of orthologs and paralogs. 

Another use of phylogenetic trees has emerged: prediction of "attributes" of proteins, primarily function, from sequence information. Emerging first from genetic "rescue" experiments, in which a defective protein in one organism can be functionally replaced by a related protein from another organism, it has been repeatedly observed that proteins more closely related in sequence tend to be more closely related in function. It is also well known that the functional specificity of a given protein is generally conferred by only a subset of its constituent amino acids. Some of this specificity can be inferred from analysis of a protein family: positions that vary among the family members are not required for whatever function(s) the family members have in common, while positions that are strictly conserved may be important for those functions. Statistical profiles that describe the conservation patterns at different positions in a set of related proteins have been used to aid in phylogenetic reconstruction . Whether or not using profiles leads to more accurate phylogenetic reconstruction, they may lead to a greater correlation with function, which is the primary focus of the work presented here.

The algorithm introduced here uses agglomerative clustering, where the similarity measure used to join clusters is based on an approximation to the weighted score of the sequences in one cluster to the profile of the other cluster. Sequence fragments, which are not infrequent in current sequence databases, can be accommodated easily. A heuristic score-based measure is used to split the tree into functional subgroups. When assessing the performance of our algorithm, we examine the correlation between the resulting tree and the functions of the constituent proteins. We have evaluated the algorithm on alignments and corresponding expert functional annotations from publicly accessible websites, as well as several internally constructed test cases.
 
 


119. Apoptosis Signalling Pathway Database - Combining Complimentary Structural, Profile based, and Pair-wise Homologies. (up)
Kutbuddin S. Doctor, John C. Reed, Adam Godzik, The Burnham Institute;
Philip E. Bourne, San Diego Super Computer Center & University of California, San Diego;
ksdoctor@burnham-inst.org
Short Abstract:

 This relational database system and web interface (http://apoptosis-db.org/) collects and organizes protein families involved in apoptosis. The database is organized around domains (profile based) that are partially indicative of apoptotic function. Sub-family classification based on combined pair-wise homology over domains provides reliable functional classification. Structural homology links profile based domains.

One Page Abstract:

 This relational database system and web interface (http://apoptosis-db.org) collects and organizes protein families involved in apoptosis. The database is organized around domains (profile based) that are partially indicative of apoptotic function. Sub-family classification based on combined pair-wise homology over domains provides reliable functional classification. Structural homology links profile based domains which share more generalized functions.


120. From clustering to expression data to motif finding: a multistep online procedure. (up)
Gert Thijs, Kathleen Marchal, Frank De Smet, Janick Mathys, Magali Lescot, K.U.Leuven - ESAT/SISTA;
Stephane Rombauts, PlantGenetics, VIB, U.Gent;
Bart De Moor, K.U.Leuven - ESAT/SISTA;
Pierre Rouze, PlantGenetics, VIB, U.Gent;
Yves Moreau, K.U.Leuven - ESAT/SISTA;
gert.thijs@esat.kuleuven.ac.be
Short Abstract:

 We present an integrated web-based tool for automatic multistep analysis of microarray data. The gene expression data are clustered to find groups of co-expressed genes. The upstream regions are selected based on accession number and gene name. Finally the sequences are send to the Motif Sampler to find over-represented motifs. 

One Page Abstract:

 Microarray experiments allow to gain a global insight into the transcriptional behaviour of the organism. The deciphering of the regulatory mechanism based on the transcript profiles is one of the major challenges of bioinformatics. Genes that have a similar expression profile, are hypothesized to have a higher probability of being coregulated. Clustering techniques will group together genes with similar expression profiles. Finding specific cis-acting motifs in the upstream region of sets a of co-expressed genes can to some extend validate the clusters. Here we present an interactive web-based user interface that integrates the cluster analysis and motif finding tools for the analysis of microarray data. We propose a multistep online procedure. Starting from the expression data together with the correspoding identification tags of the genes (accession number and gene name) using the adaptive quality-based clustering algorithm will define groups of tightly co-expressed genes. Each gene in a cluster is identified by its accession number and gene name. Based on these tags the upstream region will be retrieved. First the sequences are downloaded from GenBank and all the genes are located and indexed. In the next step the corresponding upstream region is identified. If this region is too short for further analysis the gene is blasted to locate the upstream region in genomic sequences. This sequence selection relies on an automated procedure but at each step an intermediary report is shown where the user can interfere with the process. Once the upstream regions are identified the user can send the sequence to the Motif Sampler to find the over-represented motifs. The webinterface can be accessed through the following URL: http://www.esat.kuleuven.ac.be/~dna/BioI/Software.html


121. Probe Based Scaling of Microarray Expression Data (up)
Christopher Workman, Lars Juhl Jensen, Steen Knudsen, Søren Brunak, Center for Biological Sequence Analysis;
workman@cbs.dtu.dk
Short Abstract:

 There are several analysis steps after hybridization and scanning that lay the foundation for all further analysis. For this reason, special care should be taken in image analysis, data scaling and normalization prior to calculating mRNA levels. This poster presents a new probe based scaling method for microarray expression data.

One Page Abstract:

 Between scanning a chip and making conclusions about mRNA levels there are several important steps that effect the results and lay the foundation for all further analysis. For this reason, special care should be taken in image analysis, data scaling, normalization, and outlier detection prior to calculation mRNA levels. After this is done, converting sets of probe pair intensities to mRNA levels (what I will call the feature extraction problem) is still not as straight forward as one might think. There is very little precedence in feature extraction for probe pair data of this type but some examples are starting to show up in the literature (Li and Wong PNAS, v.98, 2001). What confounds the development of these methods is not knowing what the correct results should be. Using replicate experiments from the same and different RNA isolations from a single tissue, we can measure the effects of scaling and normalization on reproducibility. In this poster I will present a new scaling and feature extraction methods and compare them to the existing methods with respect to their effects on reproducibility.


122. Revealing the Fine-Structures: Wavelet-based Fuzzy Clustering of Gene Expression Data (up)
Matthias E. Futschik, Nikola K. Kasabov, University of Otago;
mfutschik@infoscience.otago.ac.nz
Short Abstract:

 We studied yeast cell cycle expression data using fuzzy clustering and wavelet analysis. Both methods allow a more general approach for discovering the underlying structures and patterns in gene expression data than previous methods and can be valuable tools for revealing the complexity and the fine structure of cellular regulatory networks.

One Page Abstract:

 The invention of microarray technologies has opened the door for the study of global mechanisms in the cell. By measuring thousands of genes simultaneously it has become possible to get snapshots of the states of a cell. While monitoring the mRNA levels reveals only a part of the whole picture and protein arrays are still in their infancy, the DNA microarray technique has quickly become an established method and will lead the way in the analysis of the global behavior of cellular networks.

A major challenge is, however, the extraction of valuable knowledge from the mass of data produced in microarray experiments. Clustering has frequently been used to obtain a first insight into the structure of the data. It assigns genes to defined groups according to the similarities of their expression profiles. Since it is assumed that co-regulated genes show a similar expression pattern, clustering can discover functionally related genes. So far various cluster algorithms have been introduced like hierarchical clustering, k-means and SOM. A common property of these methods is the assignment of a gene into a single distinct cluster. However, this procedure might be too restrictive considering the complexity of cellular regulatory networks. Single genes are frequently involved in several different physiological pathways. An adequate clustering algorithm should reflect this.

In this work, we present fuzzy clustering[1] as alternative to the traditional methods. Fuzzy clustering may group genes more naturally by allowing a single gene to belong to different clusters. This opens the way to a more complex partitioning of genes. We found that a significant number of genes seem to belong to different clusters showing that fuzzy clustering might be an appropriate approach to use. Furthermore, fuzzy clustering leads to the definition of the core of a cluster in a straight forward way. Using this feature, we can examine the correlations between the expression signal and the information in regulatory DNA sequences in detail.

For the illustration of this novel approach we apply fuzzy clustering to yeast cell cycle gene expression data set[2]. To meet the temporal character of this data, we apply wavelet analysis[3] to represent the expression profiles. Wavelet analysis offers the possibility to study the genetic network on different time scales while preserving the temporal order of the expression signals. An interesting possibility is the usage of wavelet decomposition to distinguish the true biological signals from noise.

Finally we address the important issue of cluster validation by comparing different cluster validity criteria and discuss the problem of model parameter selection.

We show that both fuzzy clustering and wavelet analysis allow a wider approach for discovering underlying structures and patterns in gene expression data than previous methods and can be valuable tools for revealing the complexity and the fine structure of cellular regulatory networks.

References: [1] James C. Bezdak, Pattern Recognition with Fuzzy Objective Function Algorithms, Advanced Applications in Pattern Recognition, Plenum Press, 1983

[2] Paul T. Spellman et.al, Molcular Biology of the Cell, Vol. 9, 3273-3297, 1998

[3] Ingrid Daubechies, Ten Lectures on Wavelets, Society for Industrial and Applied Mathematics, 1992


123. Identification of clinically relevant genes in lung tumor expression data. (up)
Olga Troyanskaya, Stanford Medical Informatics and Department of Genetics, Stanford University School of Medicine;
Mitchell Garber, Department of Genetics, Stanford University School of Medicine;
Russ B. Altman, Stanford Medical Informatics, Stanford University School of Medicine;
David Botstein, Department of Genetics, Stanford University School of Medicine;
olgat@smi.stanford.edu
Short Abstract:

 We developed methods for detecting clinically relevant genes in the context of lung cancer gene expression data. We present a correlation-based method for identifying survival-associated genes for lung adenocarcinomas. We also show a method based on a nonparametric t-test for identifying gene expression patterns associated with specific tumor types.

One Page Abstract:

 A major biomedical question in microarray studies is selecting genes associated with specific clinical parameters, for example patient survival. Identification of such markers, or groups of genes, may lead to clinical outcomes prediction and treatment guidance. Additionally, analysis of gene expression data associated with clinical data may allow molecular-level tumor classification. These tumor subtypes, which may appear histologically similar, are molecularly distinct and lead to differences in clinical outcomes such as patient survival, drug response, and metastatic status. Methods for automated analysis of gene expression data associated with clinical data are therefore needed.

 Our work is focused on developing and evaluating methods for detecting clinically relevant genes in the context of lung cancer gene expression data. We use a non-parametric t-test based method for identification of genes associated with specific tumor types. This method was applied to lung tumor data to distinguish between subtypes of lung adenocarcinomas which are not histologically distinct. We also describe a correlation-based method for identification of genes correlated with patient survival. The method identifies genes whose expression can be best used to classify tumors in terms of `good' and `bad' survival outcomes for patients with lung adenocarcinomas. 


124. Machine Learning Techniques for the Analysis of Microarray Gene Expression Data: A Critical Appraisal (up)
Mahesan Niranjan, The University of Sheffield;
m.niranjan@sheffield.ac.uk
Short Abstract:

 This poster takes a critical look at some of the high performance machine learning techniques as applied to microarray gene expression data. It uses the yeast gene expression and leukhemia datasets available in the publis domain to illustrate that reasonably simple techniques can achieve performances comparable to highly nonlinear techniques.

One Page Abstract:

 In recent literature we see that a wide range of powerful machine learning algorithms have been proposed for the analysis of gene expression data from microarrays. New clustering methods such as Gene Shaving have been invented in this context. Support Vector Machines, Bayes Nets, Gaussian Processes and Latent Variable Methods have all been recommended as the right tools with which inference problems in such data should be approached. Recent literature takes the form that each machine learning expert with interest in the subject of microarray data advances his/her favourite method as the way forward for the biologists generating the data.

In this poster I report on taking a critical look at this collection of techniques applied to this problem. In particular I report on the Yeast Gene and Leukhemia datasets, available in the public domain. It turns out that the underlying classification problems arising in these datasets are sufficiently simple that pattern processing techniques available in textbooks are as good as any sophisticated methodology. The key result from this observation is that many of the high dimensional problems could be reduced to problems in much lower dimensionality by reasonably simple techniques, resulting in the possibility of effective interpretation of such data.


125. Comparison of Methods for The Classification of Tumors Using Gene Expression Data (up)
Grace S. Shieh, Chi-Chih Chen, Ing-Cheng Jiang, Insti. of Statistical Science, Academia sinica, TAIWAN;
Yu-Shan Shih, Dept of Math., National Chung Cheng Univ.;
gshieh@stat.sinica.edu.tw
Short Abstract:

 The performance of support Vector Machines and QUEST to classify tumors based on gene expression data from cDNA microarrays is compared to those in Dudoit et al. (2000). We access error rates from 150 data sets generated, by a statistical method, from NCI 60 cell lines and Lymphoma data, respectively.

One Page Abstract:

 The performance of support Vector Machines and QUEST to classify tumors based on gene expression data from cDNA microarrays is compared to those four major methods in Dudoit et al. (2000). We generate 150 data sets, by a statistical sampling method, from the original NCI 60 cell lines (Ross et al., 2000) and Lymphoma data (Alizadeh et al., 2000), respectively. 

In each set, about two thirds of each generated data are used as training data and the rest test data. Some variables (gene expressions), out of many (for instance, 1,416 in NCI 60 cell lines), have been selected by a statistical criterion to implement those methods and to access their prediction error rates. 


126. Using expression data for testing hypotheses on genetic networks - minimal requirements for the experimental design (up)
Dirk Repsilber, Institute of Molecular Evolution, Evolutionary Biology Centre of the University of Uppsala, Norbyvägen 18 C, SE-75236 Uppsala, Sweden;
Siv Andersson, Hans Liljenström, Institute of Molecular Evolution, Evolutionary Biology Centre of the University of Uppsala, Norbyvägen 18 C, SE-;
dirk.repsilber@ebc.uu.se
Short Abstract:

 We systematically tested the requirements for the experimental design for ranking false hypotheses about a genetic network's structure, given expression data. This is an important functional genomics task, because the parameter space of reasonable models is too big to be able to come along without previous biological knowledge.

One Page Abstract:

 A variety of ``Reverse-Engineering'' algorithms have been proposed, on how to use expression data to reconstruct interactions in small networks. This may help to understand genetic regulation, the core task of nowadays functional genomics. Only few point to the necessity of measuring ``independent'' samples to be able to reengineer even the smallest genetic networks with a sensible confidence. Here, we systematically tested the requirements for the experimental design which is necessary not only to reengineer the ``right'' genetic network, but also to be able to rank false hypotheses about its structure. Presumably the latter is the task most frequently to be solved in near future of functional genomics, because the parameter space of reasonable models is too big to be able to sort out without using previous biological knowledge. However, this knowledge has mainly been inferred from sequence data, and several, equal possible hypotheses need to be weighted against each other. Thus, algorithmic solutions that can be computationally automated to perform this task are indispensable. Following the work of Wahde and Hertz (2000) we use a genetic algorithm to explore the parameter space of a multistage discrete genetic network model (fixed connectivity and number of states per node).


127. In silico search for cis-acting regulatory sequences in co-expressed gene clusters (up)
Stephane Rombauts, Department of Plant Genetics, VIB, University of Gent, Ledeganckstraat 35, 9000 Gent Belgium;
Magali Lescot, Gert Thijs, Kathleen Marchal, SISTA/COSIC-ESAT, KULeuven, Kasteelpark Arenberg 10, 3001 Leuven-Heverlee Belgium;
Cedric Simillion, Department of Plant Genetics, VIB, University of Gent, Ledeganckstraat 35, 9000 Gent Belgium;
Bart De Moor, Yves Moreau, SISTA/COSIC-ESAT, KULeuven, Kasteelpark Arenberg 10, 3001 Leuven-Heverlee Belgium;
Pierre Rouzé, INRA associated laboratory, VIB, University of Gent, Ledeganckstraat 35, 9000 Gent Belgium;
strom@gengenp.rug.ac.be
Short Abstract:

 With large-scale transcriptome expression analyses, such as microarrays, one can tackle the problem of gene regulation. Motif finding algorithms aim at detecting motifs in the upstream regions of co-regulated genes. For that purpose we improved the original Gibbs Sampler from Lawrence. PlantCARE has been improved and new features were added.

One Page Abstract:

 Among the fully sequenced genomes, that of the dicotyledonous plant model Arabidopsis thaliana has been made available since december 2000 and big efforts are made to extract knowledge out of the sequences. With large-scale transcriptome expression analyses, such as microarrays, producing large clusters of co-expressed genes, one can tackle the problem of gene regulation. It is commonly accepted that at least a subset of the sequences of a given cluster should share regulatory elements. In general, data on plant cis-acting regulatory elements is lacking, although they determine the processes in which genes are involved, and are of major importance for plant biotechnology. Motif finding algorithms aim at detecting such motifs in the upstream regions of co-regulated genes by looking for over-represented oligonucleotides. For that purpose we developed the Motif Sampler being an improved implementation of the original Gibbs Sampler from Lawrence[1]. To test the Motif Sampler[2] on experimental data sets, the microarray data of plant response to mechanical wounding from Reymond[3] was used as well as the data from Schaffer[5] on the circadian clock. To assign a functional interpretation to the found motifs, the consensus of the motifs was compared with the entries in PlantCARE[6]. Several interesting motifs were found: resp. for the wounding experiments (methyl jasmonate responsive elements, elicitor-responsive elements and the abcissic acid response element) as well as elements for the circadian clock experiments. The PlantCARE database and web site have been improved and new features were added to deal with the predicted data. Among the updates, an interactive graphical display of promoter boxes mapped on the query sequence together with information regarding the sites has been put up. Additionally we aim at describing promoters as functional entities composed of several elements based on extensive analyses of pools of co-regulated genes clustered from microarray experiments. At present, we have collected over the 400 different cis-acting regulatory elements from the literature describing more than 159 individual promoters from higher plant genes. (http://sphinx.rug.ac.be:8080/PlantCARE/)

References [1] Lawrence, C. E. et al. (1993). "Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment." Science 262(5131): 208-14. [2] G. Thijs, et al. (2000) A higher-order background model improves the detection by Gibbs sampling of potential promoter regulatory elements in DNA sequences. Submitted. http://www.esat.kuleuven.ac.be/~thijs/Work/MotifSampler.html [3] Reymond, P et al. (2000). "Differential gene expression in response to mechanical wounding and insect feeding in Arabidopsis". Plant Cell 12(5): 707-20. [4] De Smet, F, et al.(2000) http://www.esat.kuleuven.ac.be/~thijs/Work/Clustering.html [5] Schaffer R., et al (2001) "Microarray analysis of diurnal and circadian-regulated genes in Arabidopsis." Plant Cell 13:113-123. [6] Rombauts S. et al. (1999). "PlantCARE, a plant cis-acting regulatory element database." Nucleic Acids Res 27(1): 295-6.() 


128. A decision tree method for classification of promoters based on TF binding sites (up)
Alexander Kel, BIOBASE GmbH, Mascheroder Weg 1B, 38124 Braunschweig, Germany; Institute of Cytology and Genetics, Pr. Lavrentyeva 10, 360090, Novosibirsk, Russia;
Tatyana Ivanova, Institute of Cytology and Genetics, Pr. Lavrentyeva 10, 360090, Novosibirsk,;
Olga Kel-Margoulis, BIOBASE GmbH, Mascheroder Weg 1B, 38124 Braunschweig, Germany; Institute of Cytology and Genetics, Pr. Lavrentyeva 10;
Michael Zhang, Watson School of Biological Sciences, Cold Spring Harbor Laboratory, 1 Bungtown Road, P.O.Box 100,Cold Spring Harbor,;
Edgar Wingender, BIOBASE GmbH, Mascheroder Weg 1B, 38124 Braunschweig, Germany;
ake@biobase.de
Short Abstract:

 We have developed a new method for revealing of class-specific composite modules (combinations of transcription factor binding sites) in promoters of eukaryotic genes that are functionally related or coexpressed. A decision tree system is constructed to classify promoters in genomes and computationally predict their function.

One Page Abstract:

 We have developed a new method for revealing class-specific composite modules in promoters of functionally related or coexpressed genes. On the basis of the revealed composite modules a decision tree method is developed to classify promoters of several functionally related gene groups. Seven sets of promoters were obtained from different sources: promoters for cell-cycle related genes (43 promoters) and brain enriched genes (45 promoters) (collected in this work on the base of literature search), muscle-specific (25 promoters) and immune cell specific genes (24 promoters) (Kel et al., (1999) JMB 288,353-376), erythroid specific genes (10 promoters) (http:/www.bionet.nsc.ru), liver enriched genes (39 promoters) and housekeeping genes (26 promoters) (EPD rel.62). The promoter sequences of the length 600 bp (from -500 to +99 relative start of transcription) were extracted from EMBL database. To search for binding sites a library of about 400 matrices for various transcription factors were applied (TRANSFAC rel 4.4 (Wingender, E. et al., (2000) NAR 28, 316-319) with a new searching tool - "Match". To classify promoters we build a decision tree. The internal nodes of the tree represent selected composite modules. On the basis of the composite module, at every node we calculate a decision function F(X) for each sequence X as it is passed to the tree. A decision tree was build by a variant of genetic algorithm, that optimises the structure of the decision tree, selects the specific combinations of cis-elements for every node of the tree and defines cut-off values of the corresponding functions F. The bottom nodes of the tree (leafs) contain 7 different promoter classes. Percent of correct classifications achieved by the tree varies for different promoter classes from 35% for promoters of brain enriched genes to more then 70% for cell cycle related promoters. The following set of TF binding sites appeared to be the most effective for classification of the mentioned sets of promoters: E2F, OCT-1, NF-AT, MyoD, SRF and NF-kB. The classification tree and the program for promoter classification can be found at: http://www.gene-regulation.com/. The decision tree method enables to identify new promoters and computationally predict their function. It provides means to analyse gene expression data by constructing promoter models for coexpressed genes. 


129. Biostatistical Methods to Analyse Gene Expression Profiles (up)
Jobst Landgrebe, MPI of Psychiatry, Munich;
Gerhard Welzl, GSF-Research Center, Munich;
Wolfgang Wurst, MPI of Psychiatry, Munich;
landgreb@mpipsykl.mpg.de
Short Abstract:

 We analysed gene expression data of mouse mutants with principal component analysis. We selected genes with extreme values in the reduced system, supervised by the variance within observation groups. This enabled us to explore differences between the samples and to extract fundamental gene expression patterns related to these differences.

One Page Abstract:

 Biostatistical Methods to Analyse Gene Expression Profiles

Jobst Landgrebe(1), Gerhard Welzl(2) and Wolfgang Wurst(1/2)

1 GSF-National Research Centre for Environment and Health, Ingolstädter Landstraße 1, D-85764 Neuherberg 2 Max-Planck-Institute of Psychiatry, Molecular Neurogenetics, Kraepelinstr.10, D-80804 München

Abstract: DNA microarray gene expression data are characterised by an increasing number of probes for cDNAs . Bioinformatical and biostatistical methods are applied to study the variance in gene expression across collections of related arrays and to detect fundamental patterns underlying these gene expression profiles. Many mathematical techniques have been developed to detect patterns in complex data. Quite a few of these methods are essentially different ways of clustering points in multidimensional space, e.g. hierarchical clustering, or self-organising maps. Holter et al. successfully applied the singular value decomposition method to sets of DNA microarray gene expression data (Holter et al. 2000). Another method named "Gene Shaving" is based on computing a leading principal component iteratively (Hastie et al. 2000). We analysed gene expression data of genetical and pharmacological mouse models with principal component analysis (PCA) by regarding the experimental conditions as variables (columns) and the genes as objects (rows). The additional information about the related arrays (groups of mice) requires some modification of the PCA (Krzanowski). We selected genes with extreme values in the reduced system (high variance between groups), supervised by the variance within groups. Using this method we were able to explore differences between the samples and to extract fundamental gene expression patterns related to these differences. To complete our analysis we ran the data visualisation system XGobi and compared the results with the outcome of other multivariate methods. References: HASTIE, T., TIBSHIRANI, R., EISEN, M.B., ALIZADEH, A., LEVY, R., STAUDT, L., CHAN, W.C., BOTSTEIN, D. and BROWN, P. (2000): Gene ´shaving' as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology 1(2) , 1-20. HOLTER, N.S., MITRA, M., MARITAN, A., CIEPLAK, M., BANAVAR, J.R. and FEDOROFF, N.V. (2000): Fundamental patterns underlying gene expression profiles: Simplicity from complexity. Proc. Natl. Acad. Sci. USA 97, 8409-8414. KRZANOWSKI, W.J. (2000): Principles of Multivariate Analysis. Oxford University Press, New York


130. Syntactic structures for understanding gene regulatory networks (up)
Peter Lee, Mike Hallett, Tom Hudson, McGill University;
pdlee@genome.mcgill.ca
Short Abstract:

 We present a novel system for representing complex gene relations derived from functional annotation databases. We propose a method for application of this functional representation to the analysis and interpretation of microarray gene expression data.

One Page Abstract:

 The analysis and interpretation of large-scale gene expression datasets requires methods for integrating information about gene function. The majority of knowledge about biomolecular systems exists in the form of qualitative descriptions that are intuitive but covering a diverse spectrum of mechanistic information and experimental conditions. Pathway databases (ie:KEGG), and other functional classification systems (such as GO, MESH) compress information contained in the literature database to varying degrees. However, existing paradigms for representing this information (such as path maps, hierarchical trees and circuit diagrams) lack scalability and do not adequately capture the diversity and subtlety of the interactions between genes and their products. We describe a novel system for the representation of functional information contained in various functional databases. By preserving syntactic structures from the knowledge base, we propose a general interface that enables construction of comparisons between gene expression analyses and current intuitive understandings of gene regulation. We are in the process of developing this interface to access data via a microarray gene expression database.


131. Adaptive quality-based clustering of gene expression profiles (up)
Frank De Smet, Frank De Smet, Kathleen Marchal, Janick Mathys, Gert Thijs, Bart De Moor, Yves Moreau, ESAT-SISTA/COSIC/DocArch;
frank.desmet@esat.kuleuven.ac.be
Short Abstract:

 A two-step algorithm to cluster significantly (with a certain confidence) coexpressed genes is presented. First, we try to find a sphere in the data where the 'density' of expression profiles is locally maximal. In a second step, we derive the optimal radius (or quality) of this sphere using an EM-algorithm.

One Page Abstract:

 Clustering genes based on their expression behaviour/profiles (e.g., measured by microarrays) is an important step preceding further analysis of the interaction between these genes. The hypothesis that a cluster contains either coregulated or functionally related genes only holds if the cluster algorithm that is used, groups genes with a significant degree of coexpression. Genes not tightly coexpressed have to be excluded from further analysis. 

With these remarks in mind we designed an iterative two-step algorithm: First, we try to find a sphere in the data where the 'density' of expression profiles is locally maximal (based on a preliminary estimate of the radius R of the cluster - quality based approach(1)). In a second step, we derive the optimal radius (or quality) of the cluster/sphere so that only significantly coexpressed genes (represented by a significance level S (e.g., S=95%)) are included in the cluster. This is achieved by fitting a model to the data using an EM-algorithm. The model used assumes that the data is normalised (the expression vectors have mean zero and variance one and are therefore located on the intersection of a hyperplane and a hypersphere). By inferring the radius or quality from the data itself, the biologist is released from estimating this parameter manually (this parameter was sometimes hard to predict - setting the quality too strict will exclude a considerable number of coregulated genes, setting it too wide will include too many genes that are not coregulated).

The most important properties of this approach are:

a. Few user-defined parameters (e.g., no pre-definition of the number of clusters) with an intuitive meaning.

b. Not all genes are assigned to a cluster.

c. The computational complexity of this method is approximately linear in the number of gene expression profiles in the data set.

Finally, we tested this algorithm successfully on real and artificial data.

References

1. Heyer, L. J., Kruglyak, S., and Yooseph, S. (1999) Exploring expression data: Identification and analysis of coexpressed genes. Genome Res., 9, 1106-1115.

Acknowledgements

Frank De Smet is a research assistant with the K.U.Leuven. Yves Moreau is a post-doctoral researcher of the FWO. Prof. Bart De Moor is a full professor with the K.U.Leuven. This work is supported by the Flemish Government (Research Council KUL (GOA Mefisto-666, IDO), FWO (G.0256.97,G.0240.99,G.0115.01, Research communities ICCoS, ANMMM, PhD and postdoc grants), Bil.Int. Research Program, IWT (Eureka-1562 (Synopsis), Eureka-2063 (Impact), Eureka-2419 (FLiTE), STWW-Genprom, IWT project Soft4s, PhD grants)), Federal State (IUAP IV-02, IUAP IV-24, Durable development MD/01/024), EU (TMR-Alapades, TMR-Ernsi, TMR-Niconet), Industrial contract research (ISMC, Data4s, Electrabel, Verhaert, Laborelec).
 
 


132. Incorporating Biological Knowledge Into Analyses of Microarray Data (up)
Jessica Ross, Division of Biomedical Informatics, Department of Medicine, Stanford University School of Medicine, Stanford, California;
Jeff Shrager, Carnegie Institute of Washington, Stanford, California;
Glenn Rosen, Division of Pulmonary and Critical Care, Department of Medicine, Stanford University School of Medicine, Stanford, Ca;
Pat Langely, Institute for the Study of Learning and Expertise, Palo Alto, California;
ccross@leland.stanford.edu
Short Abstract:

 We have developed a system that lets biologists view human gene expression data as it relates to molecules that occur in specific biological pathways, letting them search for all pathways that contain molecules, view expression levels of those molecules graphically, and calculate correlations between those expression levels.

One Page Abstract:

 High-throughput technologies have generated large amounts of data in the biological sciences. Using clustering algorithms to find patterns in these data is the primary method of analysis found in the literature, and is used predominantly as an exploratory tool, rather than as a test to evaluate a scientific hypothesis. These analyses have been very successful in letting scientists better classify tissue based on gene expression data. However, the results of clustering are often difficult to interpret in terms of the classical pathway models, which biologists often express as diagrams. The ability to reconcile microarray data with these models would greatly assist biologists in communicating knowledge gained from these high-throughput experiments. Furthermore, the ability to explain microarray results in relation to familiar biological pathways and molecular processes will directly support the formation and testing of hypotheses about these processes that regularly occur in the physical or wet lab. We have developed a system that lets biologists view human gene expression data as it relates to molecules that occur in specific biological pathways. We use a database with over 5000 biological reactions inferred from the literature on humans, most of which pertain to signaling pathways within the cell and therefore are directly relevant to current theories of the causes for many human diseases. The software lets a scientist search for all pathways that contain molecules of interest, view expression levels of those molecules graphically, and calculate correlations between those expression levels. In addition, the user may suggest a new pathway for comparison to the data. Using this program, we have been able to reconcile data from microarray experiments on human fibroblasts with accepted pathways for cell cycle and signaling. Our results show correlations between molecules that occur in these pathways, even though cluster analysis did not group them and/or include them in any group. We believe this system will serve as a valuable tool that will let biologists incorporate microarray data into the process of hypothesizing and testing their models. 


133. Semantic Link: A Knowledge Discovery Tool for Gene Expression Profiling (up)
Ingrid M. Keseler, Nikolai N. Kalnine, BD/CLONTECH;
imkeseler@clontech.com
Short Abstract:

 "Semantic Link" is an internet-based knowledge discovery tool designed for the interpretation of gene expression data. It contains a map of biological semantic concepts (genes, diseases, etc.) that are related to each other by their representation in the prevailing scientific literature. 

One Page Abstract:

 Microarray-based gene expression profiling generates thousands of data points which represent relative abundance of individual mRNA molecules in experimental and control samples. If we consider the cell as a network of interacting molecules with a mechanism of feedback control of their expression and degradation, a gene expression profile reflects an induction or suppression of certain regulatory pathways in response to a "treatment". Interpreting gene expression data in terms of metabolic pathways is a challenging task for several reasons: 1) Incomplete knowledge of the functional role of genes in the cell. 2) Complex nature of cellular pathway network. 3) Limited sensitivity and selectivity of the microarray data. In addition, most of the relevant information is scattered over a variety of Internet databases and scientific publications, which are not designed for high-throughput processing. Semantic Link is a text processing program that extracts all available information on gene functions and related disorders from Medline titles and abstracts and organizes it in a database. A dictionary of 350,000 selected words and phrases representing gene names and biological processes was built from Medline '96 to '01. The building process consisted of automatic extraction of terms from the text followed by supervised filtering. The dictionary was then supplied to the text processor for identification of terms in the text. Finally, articles were clustered by counting the co-occurrence of terms in the same abstract, paragraph or sentence. Basic elements of linguistic analysis (protein = proteins, gene expression = expression of gene, etc.) and substitution of synonyms were applied at that stage. The resulting database of semantic terms and links can be viewed by an internet client in the form of either a graphical network or a taxonomy of dictionary items. A trial version of Semantic Link built on the collection of terms of the Gene Ontology Consortium is available at http://atlasinfo.clontech.com/.


134. Integration of transcript reconstruction and gene expression profiles to enhance disease gene discovery. (up)
Peter van Heusden, Electric Genetics;
Alan Christoffels, Soraya Bardien, South African National Bioinformatics Institute;
Gary Greyling, Electric Genetics;
Ari Ziskind, University of Stellenbosch;
Johann Visagie, Antoine van Gelder, Electric Genetics;
Janet Kelso, South African National Bioinformatics Institute;
Liza Groenewald, Tania Hide, Electric Genetics;
Win Hide, South African National Bioinformatics Institute;
pvh@egenetics.com
Short Abstract:

 We developed a tool to automate identification of positional candidates for genetic disorders based on expression state, physical mapping and genome mapping information. A controlled vocabulary was integrated into stackPACK EST clustering system to generate expression profiles and the resulting transcripts were mapped to the genome (http://genome.ucsc.edu/) and graphically visualised.

One Page Abstract:

 There is an urgent need by human geneticists for bioinformatic tools to exploit the sequence data and other information generated by the Human Genome Project. We have developed a tool to automate identification of positional candidates for genetic disorders based on (1) expression state, (2) physical mapping and (3) genome mapping information. For expression state we will extract information from various gene expression repositories for standardised functional annotation of positional candidates, thereby enabling the effective prioritisation of these genes. Unprocessed expression data in the form of expressed sequence tags (ESTs), serial analysis of gene expression (SAGE) tags, and array-based experiments are stored in numerous disparate databases. However, with the absence of a standardised nomenclature, there are problems with accessing and manipulating this information. These difficulties are compounded in the context of high-throughput systematic analysis and emphasise the need for consistent across-database description of the same terms and objects. We have constructed a controlled vocabulary for standardised description of gene expression state. This vocabulary has been integrated into the stackPACK EST clustering system in order to generate cluster expression profiles. The credibility of these genes as positional candidates are enhanced by their mapping onto the Santa Cruz- assembled human genome sequence (http://genome.ucsc.edu/) using BLAST and SIM4. Genemap 99 radiation hybrid markers were also mapped to the genome sequence using ePCR to provide reference points . The resulting expression profiles and mapping information are exported in a standardised EMBL format for visualisation purposes e.g. using Artemis. The tool has been tested using two known disease loci, retinitis pigmentosa on 8q (RP1 gene) and type 2 diabetes locus on 2q (CAPN10 gene). 


135. Gene Expression Database (GXD): integrated access to gene expression information from the laboratory mouse (up)
Martin Ringwald, Dale A. Begley, Ingeborg J. McCright, Terry F. Hayamizu, David P. Hill, Constance M. Smith, Judith A. Blake, Janan T. Eppig, Jim A. Kadin, Joel E. Richardson, The Jackson Laboratory;
ringwald@informatics.jax.org
Short Abstract:

 GXD is a community resource. Its objective is to capture and integrate different types of gene expression data from the laboratory mouse and to place these data in the larger biological and analytical context. GXD is accessible at http://www.informatics.jax.org/. New data are made available on a daily basis.

One Page Abstract:

 The Gene Expression Database (GXD) is a community resource of gene expression information from the laboratory mouse. The database is designed as an open-ended system that can integrate different types of expression data, such as RNA in situ hybridization and immunohistochemistry data, Northern and Western blot data, RT-PCR data, cDNA data, and microarray data. Thus, as data accumulate, GXD provides increasingly complete information about what transcripts and proteins are produced by what genes; where, when and in what amounts these gene products are expressed; and how their expression varies in different mouse strains and mutants. Expression patterns are described using an extensive dictionary of anatomical terms for the mouse that has been established in collaboration with our colleagues in Edinburgh, UK*. The anatomical dictionary names the tissues and structures for each developmental stage, and organizes the terms hierarchically from body region or system to tissue to tissue substructure. This model enables an integrated description of expression patterns for various assays with differing spatial resolution, computational analysis of expression patterns at different levels of detail, and continuous extensions of the anatomical dictionary itself. Expression records are linked to digitized images of original expression data. GXD is available at http://www.informatics.jax.org/. It is integrated with the Mouse Genome Database to enable a combined analysis of genotype, expression, and phenotype data. In conjunction with the Gene Ontology project we build shared controlled vocabularies for biological processes, molecular functions and cellular components and assign those terms to mouse genes and their products. These classification schemes provide important new search parameters for expression data. Extensive interconnections with sequence databases and with databases from other species further extend GXD's utility for analysis of gene expression information. *Edinburgh collaborators: J. Bard, R. Baldock, D. Davidson, M. Kaufman. GXD is supported by NIH grant HD33745. The Gene Ontology project is supported by NIH grant HG02273. 


136. Analysis of gene expression profiles between interaction protein pairs in M.musclus (up)
Rintaro Saito, Harukazu Suzuki, Ikuko Kagawa, Rika Miki, Hidemasa Bono, Hideaki Konno, Yasushi Okazaki, Yoshihide Hayashizaki, Laboratory for Genome Exploration Research Group, RIKEN Gemomic Sciences Center(GSC), RIKEN Yokohama Institute;
rsaito@gsc.riken.go.jp
Short Abstract:

 Toward an integrative analysis of gene expression data and protein-protein interaction data, we have calculated correlation coefficients of gene expression profiles between interaction protein pairs in M.musculus. We will present current results and discuss the general rules of expression patterns between the interaction pairs.

One Page Abstract:

 Proteins play pivotal roles in all biological phenomena where physiological interactions of many proteins are involved in the construction of the biological pathways such as metabolic pathways and signal transduction pathways. Analyses of the biological pathways are one of the most important issues not only for molecular biology but also for medicine. Recent development of DNA microarray technologies enabled us to examine expression patterns of many genes at a time. In addition, yeast two-hybrid method is widely used to screen physiological protein-protein interactions in high-throughput manner. Development of several computational methods to infer pathways using either expression data or protein-protein interaction data is in progress. However, integral approaches for analyzing both expression data and protein-protein interactions in higher organisms has not been established yet. The genome encyclopedia project in RIKEN genome exploration research group has already collected large number of mouse full-length enriched cDNAs(Nature 409:p685, 2001). We have also analyzed expression profiles of those cDNAs in 49 different tissues using DNA microarray(Proc.Natl.Acad.Sci. USA 98:p2199, 2001). In addition, we are screening protein-protein interactions using those cDNAs and identified approximately 150 interactions(paper in submission). We have analyzed the correlation coefficient of gene expression profiles between interaction protein pairs. The results show that the degrees of correlations seem to depend on both the set of selected data used for the calculation, and the protein functions. We will present the current results and discuss the general rules of expression patterns between the interacting pairs. Further, computational method to infer novel pathways using expression and protein-protein interaction data will be discussed. 


137. Learning genomic nature of complex diseases from the gene expression data (up)
Andrew B. Goryachev, GeneData AG;
Pascale F. Macgregor, Clinical Genomics Center, University Health Network, Toronto, Canada;
Katryn Furuya, Hospital for Sick Children, Toronto, Canada;
Aled M. Edwards, C.H. Best Institute, University of Toronto, Canada;
Andrew.Goryachev@genedata.com
Short Abstract:

 The major genomics challenge is how to apply various data-mining tools to extract biologically important information from expression data. We present a complete study of a complex liver disease. A variety of statistical analyses provided by the Expressionist software was applied to reveal intricate co-expression patterns characterising the disease.

One Page Abstract:

 Significant emphasis is currently placed on the understanding of the molecular nature of complex human diseases, e.g., cancers. It has become evident that maladies caused by the malfunction of a single gene are rare. Instead, complex genome-scale aberrations are found responsible in an ever-growing number of cases. Expression data provide an ample evidence for the existence of complex relationships between genes found in a given disorder. However, identification of such connections from the experimental data is a challenging task that requires a variety of data mining methods applied in various combinations. In practical applications, in which several diseases represented by many samples are compared to heterogeneous normal groups, the complexity of analysis quickly explodes. This overwhelming complexity demands sophisticated software tools offering a comprehensive set of analyses as well as advanced data management. We present a complete study in which complex expression data were analysed with the Expressionist software from GeneData AG. A human liver disorder with poorly understood origin was compared to another liver disease and the normal group of samples in a large-scale expression profiling experiment. A variety of filtering, clustering and correlation analysis methods was applied to the data to reveal intricate patterns of gene co-expression hinting at possible co-regulation characteristic of the particular disease. We also present a novel clustering approach which provides flexible definition of the cluster size and number. 


138. Comparative Assessment of Normalization Methods for cDNA microarray data (up)
Ilana Saarikko, Timo Viljanen, Turku Centre for Biotechnology and Turku Centre for Computer Science, University of Turku, Finland;
Riitta Lahesmaa, Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, Finland;
Tapio Salakoski, Turku Centre for Biotechnology and Turku Centre for Computer Science, University of Turku, Finland;
Esa Uusipaikka, Department of Statistics, University of Turku, Finland;
ilana.saarikko@btk.utu.fi
Short Abstract:

 There are many sources of variation in data obtained by cDNA microarray experiments. Using replicated experiments, we have studied how existing normalization methods affect data and subsequent analysis. Based on this study we point out key issues in normalization as well as propose guidelines for choosing methods for selected situations. 

One Page Abstract:

 Microarrays are one of the latest breakthroughs in the field of biotechnology which allows monitoring of expression levels for thousands of genes simultaneously. The microarray technology is already widely in use and the applications range from comparison of expression profiles to prediction of regulatory networks. One of the major problem of this new technology is the uneven quality of data. Due to the nature of microarray experiments there are many sources of variation in obtained data. To make data reliable enough to enable comparison across experiments such variation needs to be removed. Process for removing this variation is called normalization. Commonly used normalization methods force the distribution of the log-ratio of expression levels to have median or mean of zero.

We have studied existing normalization methods and demonstrated how different methods affect data and subsequent analysis. One of our goals is to define general criteria for necessary and sufficient normalization. In our studies we have focused on replicated microarray experiments. We used data from four replicated slides each having three replicated arrays of 1536 genes. Slides were hybridized with mRNA from two different samples which were labeled with Cy3 and with Cy5. We also used data from additional staining as the measure of amount of cDNA on probe. This data was used for normalization purposes.

Based on the replicate data, we validate the normalization methods in two ways. First, we examine how normalization affects the correlation between replicate arrays. Second, we study how the set of differentially expressed genes, defined by various criteria, varies with different normalization methods. On the basis of the results, we suggest guidelines for choosing good normalization methods for different situations. 

keywords: cDNA microarray, gene expression, normalization


139. Identifying different types of human lymphoma by SVM and ensembles of learning machines using DNA microarray data. (up)
Giorgio Valentini, D.I.S.I., Dipartimento di Informatica e Scienze dell' Informazione, Universita' di Genova;
valenti@disi.unige.it
Short Abstract:

 We propose supervised methods for identifying different types of human lymphoma using DNA microarray gene expression data. Support Vector Machines and ensembles of neural networks can correctly classify different types of lymphoma, offering also insights into the role of coordinately expressed groups of genes in carcinogenic processes of lymphoid cells.

One Page Abstract:

 DNA hybridization microarrays supply information about gene expression through measurements of mRNA levels of large amounts of genes in a cell. Information obtained by DNA microarray technology gives a snapshot of the overall functional status of a cell, offering new insights into potential different types of lymphomas, discriminated on molecular and functional basis. Gene expression data produced by DNA microarray technology can be processed through unsupervised machine learning methods, using clustering algorithms to group together similar expression patterns corresponding to different tissues in order to separate cancerous from normal samples. Anyway, unsupervised methods cannot always correctly separate classes. Supervised methods can overcome this problem, exploiting "a priori" biological and medical knowledge on the problem domain. In this work we use supervised learning machines methods for recognizing cancerous and normal lymphoid tissues, classifying different types of human lymphomas and also identifying groups of genes related to a specific type of lymphoma. We use data of a specialized DNA microarray, named "Lymphochip", developed at Stanford University School of Medicine, specifically designed to study lymphoid and carcinogenesis related genes. In our first task we distinguish cancerous from normal tissues using the overall information available. This dichotomic problem is tackled using Support Vector Machines (SVM) and Multi-Layer Perceptrons (MLP). In our second task we try to directly classify different types of lymphoma (a multiclass problem) using MLPs and Parallel Non linear Dichotomizers (PND), i.e. ensembles of learning machines based on output coding decomposition of a multiclass problem. These methods consist in decomposing a multi-class problem in a set of two-class problems according to some decomposition scheme, training the dichotomizers independently and combining the outputs to give the class label. In the third task we pointed out how to use "a priori" biological and medical knowledge for separating two functional subclasses of diffuse large B-cell lymphoma (DLBCL) not detectable with traditional morphological classification schemes, identifying a set of coordinately expressed genes related to the separation of the two DLBCL subgroups. The results show that SVM, MLP and PND can be successfully applied to the analysis of DNA microarray gene expression of data and to the identification of sets of coordinately expressed genes related to specific types of lymphoma. 


140. On the Influence of the Transcription Factor on the Information Content of Binding Sites (up)
Jan T. Kim, Thomas Martinetz, Daniel Polani, Institut für Neuro- und Bioinformatik, Universität zu Lübeck;
kim@inb.mu-luebeck.de
Short Abstract:

 We develop a probabilistic model for coevolution of a transcription factor and its binding sites. Maximum entropy analysis reveals connections between binding site information content and binding behaviour of the transcription factor, and into the bioinformatic basis of Rsequence = Rfrequency. This may be useful for improving binding site recognition. 

One Page Abstract:

 Transcription factors and their binding sites are a centerpiece of genetic information processing. Transcription factor binding sites are short sequence words. The location of these binding sites on the genome provides important information about the structure of the regulatory networks the transcription factor is involved in, as well as about the location of genes and other coding regimes on the genome. However, finding these binding sites has turned out to be a difficult task which can only be solved with prior knowledge about the principle binding behaviour of transcription factors. A model for the basic probability distributions underlying the coevolution of the transcription factor and its binding sites within the genome is presented. State spaces for the transcription factor and for the genome are jointly represented, which is an extension of previous models in which only genome space is considered. The model is formally analyzed with a maximum entropy approach. Empirical analyses using comupter based enumerations of the joint state spaces are performed to show that approximations made during formal analysis are justified.

The results give new insights into the connection between the information content of these binding sites and the binding behaviour of the transcription factor with particularly interesting implications for the relation between binding site information content (Rsequence) and binding site abundance on the genome (determining Rfrequency). The intriguing empirical observation that Rsequence approximately equals Rfrequency in a couple of instances still awaits a complete bioinformatic explanation. Our analysis reveals that this (approximate) equality cannot be generically deduced from information theoretic principles.

Regarding basic bioinformatics, this finding leads to a renewal of interest in empirical studies of binding site information content and distribution across genomes. Since binding site information content is not determined by fundamental informatic principles, one must assume that the relation of Rsequence and Rfrequency determined by biological principles that are not yet known, and that therefore should be investigated using empirical studies combined with theoretical efforts. On the applied side, advances in understanding the bioinformatic principles underlying binding site evolution are likely to provide additional sources of prior knowledge that is useful for developing improved binding site recognition schemes.


141. A Mouse Developmental Gene Index (up)
Janet Kelso, South African National Bioinformatics Institute, University of the Western Cape;
George J. Kargul, Yong Qian, Dawood B. Dudekula, Minoru S.H. Ko, Developmental Genomics and Aging Section, Laboratory of Genetics, National Institute on Aging, National Institutes of;
Winston A. Hide, South African National Bioinformatics Institute, University of the Western Cape;
janet@sanbi.ac.za
Short Abstract:

 We produced and annotated a mouse developmental gene index using cDNAs generated from mouse developmental libraries. This index has been compared to the RIKEN mouse cDNA collection to determine redundancy of the datasets. Selection and annotation of clones for rearraying, and subsequent production of a mouse cDNA microarray is presented.

One Page Abstract:

 While providing large amounts of genomic information, genomic sequencing efforts do not address the pressing need for comprehensive gene expression information. Despite their generally low sequence quality and short length, expressed sequence tags (ESTs) remain a rich source of gene expression information, providing data on expression location, expression level and the presence of alternative transcript isoforms. Attempts to elucidate the entire expressed gene complement of an organism have been hampered by the scarcity of full-length cDNAs representing all expressed gene transcripts. The absence of full-length transcript data and the relative abundance of ESTs have resulted in a number of groups producing reconstructed transcript gene indices. These gene indices seek to reduce the redundancy and error present in the EST databases by clustering and assembling ESTs based on sequence identity and clone annotation. Clustered EST data has proven invaluable in the development of understanding in gene and alternative splice form discovery, genome annotation and gene regulation. In this study we have produced and annotated a mouse developmental gene index from high quality cDNA sequences generated from early mouse developmental libraries in collaboration with Minoru Ko's group in the Gerontology Research Center at the National Institute of Ageing. This gene index has been compared to the recently published RIKEN mouse cDNA collection to determine redundancy of the datasets. Progress in the selection and annotation of clones for rearraying and subsequent production of a mouse developmental cDNA microarray is presented. 


142. Using Gene Expression and Artificial Neural Networks for Classification and Diagnostic Prediction of Cancers (up)
Markus Ringner, National Human Genome Research Institute/NIH;
Javed Khan, National Cancer Institute/NIH;
Jun S Wei, Lao H Saal, National Human Genome Research Institute/NIH;
Carsten Peterson, Complex Systems Division, Lund University;
Paul S. Meltzer, National Human Genome Research Institute/NIH;
mringner@thep.lu.se
Short Abstract:

 A method of classifying cancers to specific diagnostic categories based on their gene expression signatures using artificial neural networks (ANNs) is presented. We trained the ANNs using small round blue cell tumors, belonging to four distinct diagnostic categories. The ANNs correctly classified all samples and identified the genes most relevant to the classification.

One Page Abstract:

 Small blue round cell tumors (SRBCT) of childhood including; neuroblastoma (NB), rhabdomyosarcoma (RMS), Burkitt's lymphoma (BL) and the Ewing's sarcoma (EWS) are difficult to distinguish by routine immunohistochemistry. Currently there is no single test that can precisely distinguish these cancers, and several techniques are utilized to diagnose them, including cytogenetics, interphase fluorescence in situ hybridization, reverse transcription PCR and immunohistochemistry. In addition, poorly differentiated cancers can still pose a diagnostic dilemma. Gene expression profiling with cDNA microarray techniques permits the simultaneous analysis of multiple markers and hence offers quite some promise in categorizing cancers into subgroups.

We use gene expression data from cDNA microarrays containing 6567 genes from 63 SRBCT to calibrate artificial neural network (ANN) models to recognize cancers belonging to each of the four categories. The training samples included both tumor biopsy (13 EWS and 10 RMS) material and cell lines (10 EWS, 10 RMS, 12 NB and 8 BL). Given the small available data set, we preprocess the gene expression levels using Principal Component Analysis (PCA) retaining 10 dominant directions and thereby reducing the input space significantly.

We classify the samples in the four categories using a 3-fold cross validation procedure: The 63 known (labeled) samples are randomly shuffled and split into 3 equally sized groups. Linear perceptron models are then calibrated with 10 input variables using two of the groups and the third group is reserved for testing predictions (validation). This procedure is repeated 3 times, each time with a different group used for validation. The random shuffling is redone 1250 times and for each shuffling we analyze 3 ANN models. Thus, in total each sample belongs to a validation set 1250 times and 3750 ANN models have been calibrated. The committee of models classify all validation samples correctly. Due to the limited amount of training data and the high performance already achieved, we limit ourselves to linear models with no hidden units. Confidence measures in terms of distances to ideal classifications are developed for the data.

The sensitivity upon the different genes is determined by the absolute value of the partial derivative of the output with respect to the gene expressions, averaged over samples and ANN models. By using the resulting ranking list of the inputs (genes) we redo the training procedure for different number of inputs and establish the minimal number of genes, which optimize the classification of the four cancer types. In this way 96 genes are identified, which correctly classify the 63 samples.

We then further test the validity of the models by classifying an additional set of 25 ("blind test") samples containing both [A] tumor samples (5 EWS, 5 RMS, and 4 NB) and cell lines (1 EWS, 2 NB, 3 BL) and [B] 5 non-SRBCT including 2 normal muscle). We are able to correctly classify [A] all 20 of the SRBCT and [B] based on confidence-related criteria reject the non-SRBCT samples. In addition, on evaluation of the top 96 ranked gene list we identify several genes that are uniquely expressed in a specific cancer, that have potential biological and therapeutic implications, and which have not been previously associated with these cancers.

We feel that this method of ANN analysis of gene expression data provides a powerful tool for classification, diagnosis and gene discovery. That only 96 genes are required for this application, opens the potential for cost effective fabrication of SRBCT subarrays in diagnostic use. 


143. Classification of malignant states in multistep carcinogenesis using gene expression matrix (up)
Koji Kadota, Department of Biotechnology, The University of Tokyo, and Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center (GSC), RIKEN Yokohama Institute;
Yasushi Okazaki, Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center (GSC), RIKEN Yokohama Institute;
Shugo Nakamura, Department of Biotechnology, The University of Tokyo;
Yoshihide Hayashizaki, Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center (GSC), RIKEN Yokohama Institute;
Kentaro Shimizu, Department of Biotechnology, The University of Tokyo;
kadota@bi.a.u-tokyo.ac.jp
Short Abstract:

 cDNA microarray technology has a potential to be used to diagnose malignant samples from benign ones in the clinical field. We have developed an efficient method to extract genes that can contribute to classify malignant samples from benign ones with minimal false negative diagnosis.

One Page Abstract:

 Certain types of cancer are reported to grow through multistep carcinogenesis. There are several types of tumors featuring from benign to malignant clinical course. Recently, microarray technology has revolutionalized to see the global expression of many tissues or conditions. This technique has been successfully applied to the clinical samples to classify malignant from benign samples. Until recently, several supervised or unsupervised methods have been developed to classify two distinct states such as tumor vs. normal clinical samples. However, the accuracy of classification using these methods varies depending on the dataset. It is essential to utilize the predictor genes having 100% accuracy in light of diagnosing malignant samples as malignant rather than diagnosing benign samples as benign. In this work, we have developed a novel method to select genes characterizing malignant state from benign using gene expression matrix. In brief, genes that can contribute to the characterization of malignancy phenotype were selected by subtracting each gene from the original genes set to see if the gene positively contributes to characterize the malignant phenotype. We introduced this algorithm to practical clinical samples to evaluate if the presence of metastasis can be accurately predicted. 


144. Bioinformatics Tools in the Screening of Gene Delivery Systems (up)
Karin Regnström, Eva Ragnarsson, Per Artursson, Dep of Pharmacy, University of Uppsala, Sweden;
Karin.Regnstrom@galenik.uu.se
Short Abstract:

 We use array technology and bioinformatics tools to evaluate the gene expression profiles of suitable gene delivery systems. Our studies show that each delivery systems tested results in an unique profile a "fingerprint". Together with other experimental data they are used for screening and further design of delivery systems.

One Page Abstract:

 Purpose. To use bioinformatics tools in the comparison and evaluation of gene expression profiles originating from treatment with newly developed gene delivery systems.

Introduction. In our laboratory we use array technology to evaluate immunogenic properties as well as possible toxic reactions of suitable gene delivery candidates.

Methods. Gene delivery systems formulated with a reporter plasmid were administred mucosally to mice. Cells from the animals were harvested and total RNA was extracted. A 32P- labeled cDNA copy of the RNA-samples were produced and the probes were hybridized to a cDNA expression array and scanned with a phosphorimager. The image were analyzed and normalized to the data of control samples to enable comparison of the different formulations. Pairwise comparisons of the overall gene expression changes between different delivery systems were made using the Spotfire program (1). For comparisons of up to five delivery systems the GeneCluster program (2) were used. The gene expression data were filtered to obtain genes with a significant change in expression and clustered using self organizing maps (SOMs). Further visualization was obtained by the Treeview program (3).

Results. The genes that passed the significance filter were sorted in SOMs and distinct clusters were obtained by the different delivery systems. The clusters showed gene groups which were selectively affected after treatment with different delivery systems. Some samples also showed high expression of known toxicity markers. It was possible to discern a gene expression "fingerprint" for each of the gene delivery systems tested.

Conclusions. This study identified important changes in gene expression profiles induced by the gene delivery systems studied. We conclude that bioinformatics in combination with the array technology has a great potential for the evaluation of pharmaceutical formulations during screening procedures.

In progress. We want to create a database containing gene expression data from all our formulations tested as well as other experimental data and molecular properties of the delivery systems. Our goal is to develop a tool which screens this database for suitable gene delivery systems by multiple comparisons and evaluations, which results in improved design of gene delivery systems.

 1. http://www.spotfire.com/

2. Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E. S. & Golub, T. R. (1999) Proc Natl Acad Sci U S A 96, 2907-12.

3. Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. (1998) Proc Natl Acad Sci U S A 95, 14863-8. 


145. Cross talking in cellular networks: tRNA-synthetase and amino acid synthetic enzymes in Escherichia coli (up)
Emmeli Taberman, Måns Ehrenberg, Uppsala University;
emmeli.taberman@icm.uu.se
Short Abstract:

 By constructing global mathematical models of growing bacteria we studied the control of production of an amino acid and its aminoacyl tRNA synthetase. The goal was to investigate how the cell can avoid interference between these two control loops, and discriminate between signals for charging deficiency and amino acid deficiency. 

One Page Abstract:

 It has been known for a long time that microorganisms exert different types of control for the expression of different operons. Control can be exerted at the transcriptional (e.g. by ribosome dependent attenuation mechanisms or repressors), the translational (e.g. by autogenous feed-back) or at the posttranslational (e.g. protein modifications) level. To assess how mechanisms for the control of gene expression behave in vivo we have constructed a global mathematical model for growing bacteria. This has been used to study, first, control of expression of an operon for enzymes that synthesise the amino acid threonine and, second, control of synthesis of the aminoacyl-tRNA synthetase (ThrRS) that couples Thr to tRNAThr. The threonine biosynthetic pathway is regulated by an attenuation mechanism, involving a leader peptide with multiple Thr and Ile codons. The expression from the gene for ThrRS is regulated by an autogenous mechanism, where the leader of the mRNA that encodes ThrRS mimics tRNAThr. When ThrRS is in excess, it binds stronlgy to the leader of its mRNA and thereby inhibits initiation of translation. An interesting problem is now how the cell can discriminate between a maladjusted rate of synthesis of an amino acid, on one hand, and a too high or a too low level of the corresponding aminoacyl-tRNA synthetase, on the other. For instance, if the aminoacyl-tRNA synthetase concentration is too low, this will not only signal for increased production of the synthetase but also for increased (attenuation control with ribosome step time as signal) or decreased (repressor control with amino acid pool as signal) production of the amino acid synthesising pathway. Our analysis shows that there is considerable "cross-talk" between control systems for amino acid synthesis and production of tRNA synthetases. We discuss how the cell can minimize the negative effects of signal misinterpretations due to such crosstalk. We also describe suitable experiments to test predictions based on our mathematical models 


146. Assessing Clusters and Motifs from Gene Expression Data (up)
D. K. Smith, L. M. Jakt, Biochemistry Dept., Univ. of Hong Kong;
L. Cao, Dept. Microbiology, UHK;
K. S. E. Cheah, Biochemistry Dept., Univ. of Hong Kong;
dsmith@hkusua.hku.hk
Short Abstract:

 A method has been developed to assess gene clusters derived from microarray experiments. The probability of finding motif matches associated with the genes in the cluster by chance is determined. Issues of biological relevance, over or under-clustering, activity in several clusters or the refinement of motifs can be addressed.

One Page Abstract:

 ASSESSING CLUSTERS AND MOTIFS FROM GENE EXPRESSION DATA

Jakt, L.M.1, Cao, L.2, Cheah, K.S.E.1 and Smith, D.K.1

Departments of Biochemistry1 and Microbiology2, University of Hong Kong, Pok Fu Lam, Hong Kong.

 When analysing gene expression data from microarray based studies, it is common to compare the expression profiles of the genes and perform some clustering of the profiles. Genes with similar expression profiles are grouped by the clustering algorithm and these genes are more likely to have similar functions or to be regulated in a common manner. Searches for conserved DNA motifs, which may potentially be cis-regulatory elements, can be undertaken in the non-coding regions of the genes in the cluster. For a computational study of gene expression there is a wide range of algorithms available to cluster expression profiles, to find new motifs in unaligned DNA sequences and to match known motifs to DNA sequences. Experimental errors from the microarray studies can also propagate through the computational analysis and so compound the effects of any limitations in the algorithms used. A method to evaluate these analyses is desirable.

 We have developed a method to assess the potential functional significance of clusters and motifs which is based on the probability of finding a certain number of matches to a motif in all of the gene clusters. As a starting point, we take a set of genes that have been clustered, based on their expression profiles, by some algorithm and a series of sequence motifs that may describe cis-regulatory elements. Issues of what threshold score to use for the differing motif matching algorithms are avoided by taking the best matches to a motif across the gene set, in groups of 50 to 600 matches. By counting the number of matches that are associated with each gene cluster, we can calculate the probability of observing, by chance, that number of matches to a motif in the non-coding regions of the genes in a cluster. The likely functional relevance of the clusters and motifs can be assessed based on these probabilities. This technique allows strong and weakly matching motifs to be detected and refined and significant matches to motifs across cluster boundaries can be observed. Application of this method to the yeast genome and a series of regulatory motifs led to the prediction that the previously unidentified factor known as Swi Five Factor was one of the yeast fork head proteins. Subsequently, this was confirmed by others. 


147. Statistical Analysis of Gene Expression Profile Changes among Experimental Groups (up)
Taesung Park, Sung-Gon Lee, Seungmook Lee, Department of Statistics, Seoul National University;
Dong-Hyun Yoo, Mi-Yoon Chang, Yong-Sung Lee, Department of Biochemistry, Hanyang University College of Medicine;
tspark@stats.snu.ac.kr
Short Abstract:

 cDNA microarray technology allows the monitoring of expression levels for thousands of genes simultaneously. We propose a test procedure for testing gene expression profile differences among experimental groups. The test procedure is illustrated using cDNA microarrays of 3,800 gene expression profiles from neuronal differentiation of cortical stem cells.

One Page Abstract:

 cDNA microarray technology allows the monitoring of expression levels for thousands of genes simultaneously. Cluster analysis is commonly used to group together genes with similar patterns of expression. Genes in different clusters tend to be regarded to have different expression profiles. When we are interested in testing gene expression profiles over time for different experimental groups, however, the usual clustering methods do not help much. We consider a simple summary measure to differentiate genes that have high variability and ones that do not. Using this measure, we propose a test procedure to test the differences in gene expression profiles among experimental groups. The test procedure is illustrated using cDNA microarrays of 3,800 genes obtained in an experiment to search for changes in gene expression profiles during neuronal differentiation of cortical stem cells.


148. Multivariate method for selection of sets of differently expressed genes (up)
Ashot Chilingarian, N. Gevorgyan, Cosmic Ray Division, Yerevan Physics Institute, Armenia;
A. Szabo, Department of Oncological Sciences and Huntsman Cancer Institute, University of Utah;
A. Vardanyan, Cosmic Ray Division, Yerevan Physics Institute, Armenia;
chili@yerphi.am
Short Abstract:

 Genes differentially expressed in two tissues are found by an evolutionary algorithm maximizing the Mahalonobis distance between gene expression vectors. "Evolutionary bootstrap" resolves the instability of sample covariance matrices. We show the superiority of this multidimensional method compared to commonly used one-dimensional tests using a microarray data simulation model.

One Page Abstract:

 An important problem addressed using cDNA microarray data is the detection of genes differentially expressed in two tissues of interest. Currently used approaches consider each gene separately and evaluate their differential expression independently, ignoring the multidimensional structure of the data. However it is well known that correlation among covariates can enhance the ability to detect less pronounced differences. We propose a novel approach utilizing the gene correlation information for finding the differentially expressed genes. The Mahalonobis distance between vectors of gene expressions is the criterion for simultaneously comparing a set of genes and an evolutionary algorithm is developed for maximizing it. However the extreme imbalance of the number of genes and the number of experiments causes an instability of the sample covariance matrices, so a direct application of the Mahalonobis distance is not feasible. To overcome this problem we develop a new method of combining data from small-scale random search experiments that we term "evolutionary bootstrap". We validate the proposed method in two ways. First we simulate cDNA microarray data where the extent of differential expression of each genes is known. We apply the multidimensional method and several commonly used one-dimensional statistical tests and compare their ability to correctly identify differentially expressed genes and to rank them according to differential expression. By utilizing the correlation structure the multivariate method, in addition to the genes found by the one-dimensional criteria, finds genes whose differential expression is not detectable marginally. As a different test, we apply the proposed method to data on two colon cancer cell lines and evaluate its ability to find genes that allow the classification of the samples according to their origin.

 


149. Understanding Non Small Cell Lung Cancer by Analysis of Expression Profiles (up)
Nir Friedman, Yoseph Barash, Hebrew University;
Amir Ben-Dor, Zohar Yakhini, Agilent Laboratories;
Naftali Kaminski, Sheba Medical Center, Israel;
nir@cs.huji.ac.il
Short Abstract:

 To understand the molecular mechanisms that underlie lung cancer, we analyze gene expression patterns in tumor and normal lung samples. We present computational methods that we developed to extract biological meaning from this data. We discuss the significance of the information we retreive and its potential impact on cancer research. 

One Page Abstract:

 Lung cancer is a common malignancy and a major determinant of overall cancer mortality in developed and developing countries. Despite intensive research, little has changed in the understanding and management of the disease. In order to determine the transcriptional programs that are active in non small cell lung cancer (NSCLC) gene expression patterns of ~12,000 genes were collected from 24 NSCLC tumor samples, 11 normal histology samples from lung resections for cancer and pooled normal lung RNA (5 individual lungs) obtained commercially.

 In this poster, we present analysis of these gene expression profiles. We show that gene expression patterns were highly distinct in tumor and normal tissues. We use the Total-Number-of-Misclassifications (TNoM), Information-content (Info) and Gaussian-Error scores to detect genes that significantly differ between NSCLC tumors and normal lung samples. One evident observation was that informative genes were significantly overabundant in our dataset, thus supporting the significance of the results. 

To better understand the transcriptional program we analyzed the genomic location of genes that differ between NSCLC tumors and normal lung tissues, and compared these to cytogenetic abnormalities observed in the tumor samples. Finally, we developed and used class discovery tools to characterize putative tumor sub-types.

 The wealth of statistically significant and biologically meaningful information in our dataset supports our contention that transcriptional profiling will lead to new insights into the pathogenesis of lung cancer, thus leading to development of new tools for early detection and treatment of this devastating disease. 


150. Applications of high-throughput identification of tissue expression profiles and specificity (up)
Fabien Campagne, Lucy Skrabanek, Harel Weinstein, Institute for Computational Biomedicine, Department of Physiology and Biophysics; Mount Sinai School of Medicine;
Fabien.Campagne@physbio.mssm.edu
Short Abstract:

 We recently developed TissueInfo: an automated, high-throughput method to identify the tissue expression profile and the specificity of a query sequence. We will briefly introduce applications of this new method to custom microarray production, gene discovery, genome analyses, signaling pathway modeling and tissue information ab initio prediction.

One Page Abstract:

 Organisms such as mammals do not express every single gene encoded by their genome in each of their cells. Rather, the various cell types of the organism express particular subsets of the genes in the genome. Cell types are further organized into tissues, and tissues constitute the organs that carry out various physiological functions. The detailed mechanisms of gene products underlying the functioning of this complex organization are today largely unknown. Several methods, including SAGE [1], microarray technology [2] can be applied to the study of differential gene expression in the various cell types, in different tissues. We recently developed TissueInfo, a high-throughput method to identify the tissue expression profile of the genes in an organism's genome, as well as the tissue specificity of a query sequence [3]. The method carefully organizes the data publicly available in dbEST [4] and is purely computational. With 80% coverage of the benchmark considered, TissueInfo achieves an accuracy of 76% when the tissue specificity of a gene is predicted and 89% when its expression in a given tissue is predicted. These results make possible the application of TissueInfo to the complete sequences available in the public draft of the human genome. Our poster will present some novel features of the tissue information obtained when profiling about 10,000 human genes for their expression in, and specificity to, 104 human tissues. This will illustrate the application of TissueInfo to genome-wide statistical analysis of gene expression in tissues. In addition, we will describe other potential applications of TissueInfo, such as in the production of tissue-specific microarrays, where TissueInfo can greatly speed up and simplify the selection of clones expressed in a given tissue. Another area of important application of, TissueInfo relates to gene discovery pipelines where this method can be integrated to provide the ability to calculate tissue expression profiles and specificity for candidate genes. As shown in our recent identification of the Sac sensory receptor gene candidate [5], prediction of restricted tissue expression, or other specific expression profiles, can be pivotal in the identification of a gene candidate. A third illustrative application consists of the assembly of training sets of genes grouped according their expression profile for the ab initio prediction of tissue information. More information about the method will be available from our web site: http://icb.mssm.edu.

1. Velculescu, V.E., et al., Serial analysis of gene expression. Science, 1995. 270(5235): p. 484-7.

2. Shoemaker, D.D., et al., Experimental annotation of the human genome using microarray technology. Nature, 2001. 409(6822): p. 922-7.

3. Skrabanek, L. and F. Campagne, TissueInfo: high-throughput identification of tissue expression profiles and specificity. submitted, 2001.

4. Boguski, M.S., T.M. Lowe, and C.M. Tolstoshev, dbEST--database for "expressed sequence tags". Nat Genet, 1993. 4(4): p. 332-3.

5. Max, M., et al., Tas1r3, encoding a new candidate taste receptor, is allelic to the sweet responsiveness locus Sac. Nat Genet, 2001. 28: p. 58-63. 


151. Identifying regulatory networks by combinatorial analysis of promoter elements (up)
Yitzhak Pilpel, Priya Sudarsanam, George M. Church, Department of Genetics, Harvard Medical School;
tpilpel@genetics.med.harvard.edu
Short Abstract:

 We have developed a new computational method that analyzes microarray data to discover synergistic regulatory motif combinations and applied it to the analysis of promoters of S. cerevisiae. Such interactions are organized into highly connected graphs suggesting that a small number of regulators me be responsible for multiple expression patterns

One Page Abstract:

 The recent availability of microarray data has led to the development of several computational approaches for studying genome-wide transcriptional regulation. These approaches have been very successful in deriving known and new regulatory motifs from the promoters of co-expressed genes. However, few studies have so far addressed the combinatorial nature of transcription, a well-established phenomenon in eukaryotes. We have developed a new computational method that analyzes microarray data to discover synergistic regulatory motif combinations and applied it to the analysis of promoters of S. cerevisiae. Our method suggests causal relationships between each motif in a combination and the observed expression patterns. In addition to identifying novel motif combinations that affect expression patterns during the cell cycle, sporulation, and various stress response conditions, we have also discovered regulatory cross-talk between several of these processes. We have developed novel visualization tools that allow the analysis of the causal relationships between regulatory motif combinations and expression profiles. In addition, we have generated global motif synergy maps that provide a view of the transcription networks in the cell. The maps are highly connected suggesting that a small number of transcription factors are responsible for a complex set of expression patterns in diverse conditions. This approach should be important for modeling transcriptional regulatory networks in more complex eukaryotes.


152. The use of discretization in the analysis of cDNA microarray expression profiles for the identification of tissue-specific genes (up)
Janick Mathys, Kathleen Marchal, Patrick Glenisson, Geert Fannes, Peter Antal, Yves Moreau, Bart De Moor, department of electrical engineering, K. U. Leuven;
Paul Van Hummelen, VIB MicroArray Facility, MAF;
jmathys@esat.kuleuven.ac.be
Short Abstract:

 A simple procedure was developed to analyze gene expression profiles from cDNA microarrays for the identification of tissue-specific genes. This procedure consists of the discretization of both background-corrected red intensities and ratios followed by Euclidian distance-based clustering.

One Page Abstract:

 To assess tissue-specific gene expression a standard concept in data mining was used : discretization. Discretization means that thresholds are determined or chosen. Based on these thresholds, decisions are made about the expression of a gene (ON or OFF, over- or under-expression). To obtain gene expression profiles from various mouse tissues, cDNA was prepared from brain, kidney, heart, liver, lung, skeletal muscle, spleen, testis and hybridized on mouse cDNA microarrays. The microarrays contained 9216 spots coming from 4600 randomly chosen mouse genes printed in duplicate. Twelve slides were hybridized with each of the tissues labeled in red against spleen (reference) labeled in green. Following image analysis, genes were labeled ON or OFF according to a predetermined intensity treshold. The threshold was set at the local background intensity of the spot plus two standard deviations of the mean spot intensitiy. If the intensity of a gene was below this threshold, the gene was considered OFF and got the label 0. Otherwise, the gene was considered ON and got the label 1 if one of the duplicate spots were above threshold and label 2 if both duplicate spots were above background. It was found that the threshold settings were not optimal because the sensitivity of the green and the red channel were different. Methods to adjust the thresholds for each dye specifically are now being developed and compared. The sum of the ON/OFF labels over the various tissues was used to divide the genes in the following groups : Group A: Constitutively expressed genes (602 genes), Group B: Tissue specific genes and Group C: Non expressed genes. Group A : The group of genes that were ON in each tissue (each gene had label 2 in all 12 experiments and thus: sum=24) could be further separated in potential housekeeping genes (A1) and tissue-specific genes (A2). For this purpose, the ratios were discretized : if the ratio of a gene was > 2 (2-fold overexpression) or < 0.5 (2-fold underexpression) the gene received label 1 or 2 respectively and for ratios between 0.5 and 2 the gene was given label 0. Similarly to the previous discretization, the sum of the labels over the various tissues was made and used to separate the genes. The group of genes with sum=0 could be considered as potential housekeeping genes (84 genes). The remaining genes were clustered by calculating the Euclidian distance for each gene-pair. These clusters constitute of genes that were differentially expressed in one or more tissues (A2). Group B : This group was further divided by the same clustering method as was described for group A2. Except for heart, for each tissue a set of genes were found that were uniquely expressed in that specific tissue. Group C : For a large set of genes (629 genes) no fluorescent signals above threshold were found in any of the tissues. These genes were not expressed in any of these tissues or were below the detection limit of the assay. In a final analysis, the tissues themselves were subjected to an hierarchical clustering algorithm based on the ratios of the genes that were ON in each tissue. The results of this clustering matched remarkably well with results obtained for the tissue -specific genes, for instance heart and skeletal muscle seem to share the most specific genes of group A2. Our results are being further confirmed by information on function and tissue-specificity obtained from Unigene, GO and Pubmed. 


153. Quantitative analysis of a bacterial gene expression by using the gusA reporter system in a non-steady state continuous culture (up)
Kathleen Marchal, Centre of Microbial and Plant Genetics, K.U. Leuven/ SISTA Department of Electrical Engineering, K. U. Leuven;
Jun Sun, Centre of Microbial and Plant Genetics, K.U. Leuven;
Ilse Smets, Kristel Bernaerts, Jan Van Impe, BioTeC-Bioprocess technology and Control, K.U. Leuven;
Bart de Moor, SISTA Department of Electrical Engineering, K.U. Leuven;
Jos Vanderleyden, Centre of Microbial and Plant Genetics, K.U. Leuven;
kathleen.marchal@esat.kuleuven.ac.be
Short Abstract:

 A general dynamic model (forward model) was used to study the "mere influence" of O2 on the expression of a bacterial fusion protein (A. brasilense cytN-gusA fusion). The experimental set up used consisted of a non-steady state continuous culture of which the O2 concentration was regularly perturbed.

One Page Abstract:

 In this study a dynamic model was developed to describe the "mere influence" of O2 on the expression of an important respiratory enzyme of the bacterium A. brasilense cytN gene (encoding a cytcbb3 terminal oxidase (Marchal et al., 1998). The experimental set up consisted of the combined use of a non-steady state continuous culture and a translational gene fusion (cytN-gusA ). The use of a continuous culture allows accurate monitoring slight changes in input parameters (O2, input C-source,...). Moreover, the input parameters (in this case O2) can systematically be perturbed to study the effect on the output parameters (fusion protein synthesis measured as b-glucuronidase activity, cell density, output C-source concentration). The combined use of a structural dynamic modeling and the appropriate experimental set up (training and validation experiments) allowed to construct a structural forward model (based on differential equations describing cell growth, substrate consumption and fusion protein synthesis) to describe the dynamic behavior of the system upon varying input signals. Simulation results showed that under the conditions tested the cytN gene expression was not subjected to catabolic repression. The hybrid fusion protein seemingly behaves as a very stable protein in A. brasilense and, consistent with previous results, O2 is the major signal regulating the cytN promoter. In principle this approach can be generalized to assess the effect of any controllable external signal on bacterial gene expression in a non-steady state continuous culture. Use of the method outlined here has several advantages over the commonly used steady state measurements (less plasmid-stability problems, less time-consuming, more quantitative etc). Similarly constructed forward models can be used to predict the response of a recombinant promoter-gene construct on the alteration of an external signal (applications in metabolic engineering, process control in fermentation technology). Moreover this study clearly highlights the complexity of using differential equations for forward modeling of genetic networks. In the work presented here constructing a general model of the expression of only one gene regulated by 2 external parameters required introduction of 14 parameters (model could be reduced after sensitivity analysis to a specific model containing 6 major parameters), performance of 3 experimental datasets and extensive computational analysis.

Marchal et al. 1998. J. Bacteriol.180: 5689-5696 


154. Analysis of 5035 high-quality EST clones from mature potato tuber (up)
Jeppe Emmersen, Meg Crookshanks, Kåre Lehmann, Karen G. Welinder, Aalborg University;
je@bio.auc.dk
Short Abstract:

 The biosynthetic potential of mature tuber from potato (var Kuras) was elucidated. 5035 EST's were sequenced with an average length of 592 bp. The tuber EST's displayed significantly higher expression level of genes involved in Protein Destination and Protein Synthesis functions than potato EST libraries of leaf, stolon and shoot. 

One Page Abstract:

 The biosynthetic potential of the economically important potato plant was elucidated. 5035 EST sequences of high quality from mature tuber were generated and analyzed. The average trimmed read length of the library was 592 bp, which is considerably higher than for other EST libraries.

The DNATools analysis software package, developed by S. W. Rasmussen, Carlsberg Research Center was used to store sequences, analyze BLAST results, build EST submission files for the dbEST, analyze redundancy, and edit sequences. DNATools was also used to build searchable flatfile databases containing sequences and BLAST results. This software was chosen due to high functionality and low price. DNATools, however, only permit searching database hits using E-values and simple keywords. To enhance the search capabilities of DNATools, a number of Perl scripts were written to enable more advanced search parameters such as listing all sequences with more than 95% identity in a BLAST hit.

The expression level of potato mature tuber was compared to EST libraries from potato stolon, leaf and shoots (R.S. van der Hoeven et al,GenBank) giving a total of 35000 potato EST sequences. The sequences were divided into the different function categories suggested by MIPS, searching the annotated Arabidopsis genes with sequences from each EST library using BLASTX. This analysis showed that tuber has a significantly higher expression of genes involved in Protein Destination and in Protein Synthesis compared with the other potato tissues. The limitation of using Arabidopsis thaliana as model was evident as 25% to 34% of sequences in the potato libraries had no match to Arabidopsis (E-value 1E-5 or lower).

Potato EST sequences were also compared to tomato EST sequences, as both plants belong to the nightshade family. EST libraries from tomato seed, flower, root, shoot, cotyledon, leaf and fruit (GenBank) were assembled into one BLAST database, with a total of 82355 tomato EST sequences. Each potato library was then compared to the tomato sequences by BLASTN. As expected, sequences identities of orthologous genes of tomato and potato are very high (>90%). 


155. Using highly redundant oligonucleotide arrays to validate ESTs: Development and use of a human Affymetrix MuscleChip. (up)
Rehannah H. A. Borup, Yi-Wen Chen, Marina Bakay, Children's National Medical Center, Washington DC, USA;
Stefano Toppo, Giorgio Valle, Gerolamo Lanfranchi, University of Padova;
Eric P Hoffman, Children's National Medical Center, Washington DC, USA;
RBorup@childrens-research.org
Short Abstract:

 We present design, production, and use of a highly redundant oligonucleotide using the Affymetrix-microarray platform (32 oligonucleotides per gene studied). We present the comparison of transcript abundance using absolute intensity analyses from human muscle biopsy RNA (expression profiling), and by EST cluster member number from cDNA library sequencing.

One Page Abstract:

 The confidence that any particular EST cluster represents a true gene depends primarly on the recurrent identification of the sequence from multiple cDNA sources. However, a significant proportion of ESTs remain as "singletons", with only one sequence representing that cluster in dbEST. Such singletons and low redundancy clusters must be verified to impart confidence on their existence. One method of promise centers on expression profiling; the identification of the EST as an expressed sequence in profiling by microarrays may add considerable confidence to the existence of the singleton EST. To test this hypothesis, we have designed a highly redundant custom Affymetrix MuscleChip based largely upon an EST sequencing project of non-normalized human muscle cDNA library by the University of Padova. Our MuscleChip contains 4,601 probe sets (average probe set = 16 perfect match, and 16 mismatch oligonucleotides), with a total of ~150,000 25mer oligonucleotides. These probe sets represented 2,075 ESTs from the Padova database, and 1,100 genes downloaded from previous Affymetrix stock GeneChips. 571 of our sequences were represented by two or more distinct probe sets, leading to a very high degree of redundancy within the MuscleChip. Redundancy is particularly important in muscle, where there are many very closely related genes with specific functions (e.g. >10 myosin heavy chain isoform genes differing by only a few bases).

We present data using this MuscleChip on a series of human muscle biopsies, including normal muscle, and Duchenne muscular dystrophy muscle. We present comparison of transcript abundance by EST sequence hits (cluster member number), and expression profiling normalized relative absolute analyses. All sequences on the MuscleChip were blasted against most recent versions of the sequence databases to update all gene assignments. Approximately 57% of ESTs not defined as of 1998 were now able to be assigned a gene name/function. This left approximately 325 sequences represented as true undefined ESTs on our MuscleChip. Of these 55% are called present on the Muscle Chip. We found considerable variation in the range of EST cluster members when compared to absolute intensities by expression profiling of muscle biopsies, although there was an overall trend of greater EST cluster members correlating with greater absolute intensities. We hypothesize that those genes showing concordant EST cluster number and absolute analyses intensity have a high level of confidence regarding RNA level in the tissue. Finally, we present data on the EST sequences, showing that a substantial subset are indeed verified by expression profiling. The differentially regulated ESTs we have found in muscular dystrophy and other conditions are now prioritized for full-length sequence determination and protein characterization. 


156. Expression Profiler (up)
Jaak Vilo, Alvis Brazma, European Bioinformatics Institute;
vilo@ebi.ac.uk
Short Abstract:

 Expression Profiler (ep.ebi.ac.uk) is a WWW-based environment for analysis of microarray and other genomics data. Different components allow users to explore, cluster, analyze, and visualize gene expression, protein-protein interaction, regulatory sequence, sequence motif, and functional annotation data, as well as link the analysis results to other WWW-based tools and databases. 

One Page Abstract:

 Expression Profiler (ep.ebi.ac.uk) is a set of WWW-based software tools for analysis and mining of microarray and other genomics data. Different components of Expression Profiler allow users to explore, cluster and visualize the gene expression data; link the results of clustering to other web-based tools; compare experimental protein-protein interaction lists with gene expression data; browse Gene Ontology annotations, extract genes in each category and explore their expression profiles; perform the extraction of putative promoter sequences for the genes in the clusters; perform pattern discovery on the sets of extracted sequences; and visualize the patterns (motifs) on these sequences. The main components of Expression Profiler are:

* EPCLUST "Expression Profile data CLUSTering and analysis", is a collection of clustering and visualization methods for analysis of expression data. EPCLUST contains implementations of standard hierarchical and K-means clustering algorithms for many distance (similarity) measures and data transformation and normalization methods. EPCLUST implements also various similarity searches based on expression data for individual genes or gene clusters.

* URLMAP is a general, configurable tool for mapping HTML form contents (e.g. cluster contents from EPCLUST) to other on-line analysis tools and databases (HTML forms). URLMAP allows for example to link clusters of genes to various databases (SwissProt, SGD, YPD, etc) and tools (KEGG metabolic pathway database query tool, RSA-tools, EPCLUST, GENOMES, etc.)

* GENOMES is a tool for retrieval of information about sets of genes, linking genes to other databases, and extraction of genomic sequences relative to the gene start and end positions.

* SPEXS, "Sequence Pattern Exhaustive Search", is a pattern discovery tool based on rapid exhaustive enumeration of all patterns occurring in sets of sequences, and reporting the most frequent, or most significant ones. SPEXS can be used for example for de novo prediction of potential transcription factor binding motifs on DNA, or motif discovery from protein sequences.

* PATMATCH is a tool for visualization and exploration of patterns and motifs on the DNA and protein sequences. PATMATCH is integrated to EPCLUST and GENOMES, allowing, for instance, visualization of the transcription factor binding motifs on the regulatory sequences of genes combined with the respective expression profile clustering.

* EP:PPI (Protein-Protein Interaction analysis tool) integrates experimental or predicted protein-protein interaction data with gene expression analysis in EPCLUST.

* EP:GO is a browser for controlled vocabularies produced by Gene Ontology annotation project (www.geneontology.org), that allows the extraction of genes associated to each Gene Ontology category and subsequently analyze gene expression, regulatory sequence, and protein-protein interaction data for these genes.

The WWW-based architecture makes it possible to perform all the described analyses independently of the user's hardware platform or operating system, without the need to install numerous software tools on each computer. If needed, the system can be run behind the firewalls, within the intranets of the companies, thus providing also the needed data security. Currently we are integrating ArrayExpress database and Expression Profiler analysis tools in a single system. This will open new opportunities for integrating many different public and private data sources for analysis and mining. 


157. Analysis of the transcriptional apparatus in the holoparasitic flowering plant genus Cuscuta (up)
Sabine Berg, Tom A.W. van der Kooij, Kirsten Krause, Karin Krupinska, Botanical Institute, Christian-Albrechts-University, Kiel, Germany;
sberg@bot.uni-kiel.de
Short Abstract:

 The holoparasitic flowering plant genus Cuscuta includes species with different stages of plastome reduction. One central question in plastome analysis is to understand the transcription machinery. In some species, loss of one of the three plastid RNA polymerases dramatically changed gene expression and promoter structures.

One Page Abstract:

 The holoparasitic flowering plant genus Cuscuta consists of fully photosynthetically active species with functional chloroplasts, intermediate forms with restricted photosynthetic capacity and achlorophyllous plant species, with extremely reduced plastome and plastid functions (van der Kooij et al. 2000). Therefore, it might be an ideal model to study the loss of plastid genes and their functions in the evolutionary context of parasitic development. One of the central question in further plastome analysis of several Cuscuta species is the analysis of the transcription machinery. Transcription in plastids, in general, is shared between the plastid encoded RNA polymerase (PEP), which resembles the E.coli enzyme and consists of four subunits (rpoA, B, C1 and C2), and a nuclear encoded RNA polymerase (NEP), which is imported into the plastid. Loss of the plastid encoded enzyme may result in transcription relying only on the nuclear encoded enzyme. Therefore, we tested the presence of rpo genes in several Cuscuta species and analysed promotor usage in two different plastid genes. Southern blot analysis and PCR amplification reveals that rpoA and rpoB genes coding for subunits of the plastid encoded polymerase (PEP) are only present in the photosynthetically active species Cuscuta reflexa and C. europea. C. gronovii, C. plathyloba, C. subinclusa and C. odorata lack these subunits (Krause et al). Both the house-keeping gene rrn16, involved in translation, as well as the photosynthesis gene rbcL, coding for the large subunit of RUBISCO, are present in the species investigated. The coding region of the genes are highly conserved among the species, whereas amplified promotor fragments show large deletions in size. Sequence alignment of the promotor regions of the rrn 16 and the rbcL gene, respectively, show large deletions in four species investigated. The PEP-specific promotor sequences of rrn 16 are present in C. reflexa but not in C. gronovii, C. odorata and C. subinclusa. Northern blot analysis reveals transcription of the gene in all species (Krause et al.). PEP-specific promotor sequences of rbcL are found in C. reflexa and C. europea but not in C. gronovii, C. subinclusa and C. plathyloba, northern analysis reveals transcription in all species. Primer extension analysis was used to check whether a different promotor is being used since PEP as well as promotor specific structures are missing in four species. The PEP promotor is conserved and similar in sequence and size in the chlorophyll-containing species C.reflexa (rrn16,rbcL) and C. europea (rbcL). However in C. gronovii, C. odorata and C. subinclusa rrn16 transcripts are initiated from a different promotor which strongly resembles a NEP-Promotor in sequence. Transcription in these plastids has to rely on the imported NEP due to loss of the rpo genes of the plastid encoded enzyme in some Cuscuta species. Primer extension analysis of the rbcL gene reveals a conserved PEP-promotor sequence in C. reflexa and C. europea, whereas transcription in C. gronovii, C. subinclusa and C. plathyloba initiates at NEP-promotor sequences. 

Krause et al, submitted; van der Kooij et al 2000, Planta 210:701-707


158. Inferring Regulatory Pathways in E.Coli using Dynamic Bayesian Networks (up)
Irene M. Ong, David Page, University of Wisconsin-Madison;
ong@cs.wisc.edu
Short Abstract:

 This work presents the first application (to our knowledge) of Dynamic Bayesian Networks (DBNs) to time-series gene expression microarray data. We introduce an approach to determining transcriptional regulatory pathways by encoding background knowledge about gene expression for the particular organism being modeled into the initial, core structure of the DBN.

One Page Abstract:

 In order to fully understand how genomes operate, we need an understanding of how genes ``communicate'' as a network to organize the construction and daily functions of cells within an organism [DOE 2000]. We are interested in uncovering this genome-wide circuitry that underlies the regulation of cells. In this poster, we introduce an approach to determining transcriptional regulatory pathways by applying Dynamic Bayesian Networks (DBNs) to time-series gene expression data. The data are obtained from DNA microarray hybridization experiments in E.Coli.

There has been much work in the area of analyzing gene expression data. The most closely related work, [Friedman, Linial, Nachman and Pe'er 2000], addressed the task of determining properties of the transcriptional program of an organism (Baker's yeast) by using Bayesian Networks (BNs) to analyze gene expression data. However, this method can only represent the correlations between genes at a given time. It does not show how genes regulate each other over time in the complex workings of regulatory pathways. To the best of our knowledge, we are the first group to apply DBNs to time-series microarray data.

DBNs are essentially BNs with a few additional assumptions that allow them to tractably model temporal information. We graphically model transcription in a DBN by building an initial DBN structure that exploits background knowledge from an operon map, a mapping of known and predicted operons to their associated genes. The operon map was obtained from [Craven, Page, Shavlik, Bockhorst and Glasner 2000]. Specifically, a time-slice in our initial DBN model consists of all the operons and genes, where each operon node has arcs that connect it to the sequence of gene nodes that are transcribed together for that particular operon. The gene nodes, our evidence variables, are discretized gene expression levels that indicate an increase, decrease, or no change in expression level from one time-slice to another.

Using this initial DBN structure, our goal is to learn the arcs from operons in one time-slice to those in another. If operon 1 at time t_i has an arc to operon 2 at time t_(i+1), this implies operons 1 and 2 are in the same regulatory pathway. These arcs, as well as all conditional probabilities, are inferred from time-series microarray data for E.Coli using the structural EM algorithm [Friedman 1998].

The results of our experiments were mixed, however the experiments did provide evidence that DBN learning is capable of identifying operons in E.Coli that are in a common regulatory pathway.


159. ConSite: Identification of transcription factor binding sites conserved between orthologous gene sequences (up)
Albin Sandelin, Boris Lenhard, Luis Mendoza, Wyeth Wasserman, CGR, Karolinska Institutet;
albin.sandelin@cgr.ki.se
Short Abstract:

 Understanding gene regulation is a post-genome research challenge. Methods for transcription factor binding site detection are insufficiently specific.

Based upon the hypothesis that regulatory sequences in non-coding DNA are preferentially conserved, we have constructed ConSite, a tool to align two orthologous genomic sequences and identify conserved binding sites.

One Page Abstract:

 Understanding the mechanisms by which gene expression is regulated is one of the principal challenges in post-sequence human genome research.Current motif-based methods for computational detection of individual transcription factor binding sites are as yet insufficiently specific to warrant experimental investigation. Regulatory elements are short and transcription factors generally tolerate considerable variation between the experimentally defined binding sites. As a consequence, similar elements will be found purely by chance at a high frequency in human genomic sequence. In short, the rate of false-positive predictions is prohibitively high for most purposes.

In order to accurately predict functional transcription factor binding sites, we must develop approaches beyond isolated profiles representing clusters of known sites. This may be achieved by a subset of approaches, including: (i) considering combinatorial site-clusters, (ii) addressing the poorly understood subject of chromatin superstructure, or (iii) by using computational approaches unrelated to biological mechanisms of gene regulation. 

With the increasing availability of genomic sequences from diverse species, it is possible to extract regulatory information via genomic sequence comparisons, a process termed "phylogenetic footprinting". Based upon the hypothesis that regulatory sequences in non-coding DNA are more likely to be conserved than sequences without sequence-specific function, we have constructed ConSite, a tool to align two orthologous genomic sequences and identify the binding sites which are conserved between the pair.

Constructed primarily as a tool for experimental researchers, ConSite gives the user the option to consider exon-intron relationships in concordance with the binding sites detected. To further narrow the scope of the analysis, users have the option to scan with subsets of the transcription factor profiles based on species or protein structure classes. Three output formats are provided, including graphical and text sequence alignments, as well as a tabular report.


160. FSCAN - An open source program for analysis of two-color fluorescence-labeled cDNA microarrays (up)
Peter J. Munson, Ph.D., L. Young, Vinay V. Prabhu, Mathematical and Statistical Computing Lab, CIT, NIH;
munson@nih.gov
Short Abstract:

 FSCAN is an free, open-source program for image analysis of images generated from cDNA microarrays, available at http://abs.cit.nih.gov/fscan. Developed under the MATLAB system, the program runs on Windows, MacOS and Unix platforms. It provides interactive statistical and graphical analysis features, links to external databases and exportable text file output.

One Page Abstract:

 FSCAN is an free, open-source program for image analysis of images generated from cDNA microarrays, available at http://abs.cit.nih.gov/fscan. Developed under the MATLAB system, the program runs on Windows, MacOS and Unix platforms. It provides interactive statistical and graphical analysis features, links to external databases and exportable text file output. The program correctly reads Axon, Moldecular Dynamics, .gel and .tiff images. It provides image segmentation, grid-overlay, spot detection and spot quantification algorithms. Because of the open source feature, the user may modify the provided algorithms if desired. Several statistics are measured for each spot including signal, background levels, standard deviations, spot size for each of two channels. After analyzing an image, the user is presented with a dynamic analysis workbench where he may select spots for closer examination, observe the presence of image artifacts, while viewing the same spot repsented in a scatterplot view. Clicking on any spot in the "array-view" automatically selects it in the "scatterplot-view" and simulataneously identifies the associated clone and gene information. Links to external databases are provided so that browsing the web for additional information is facilitated. The program has been used extensively to analyze gene chips produced by the NCI containing up to 6500 genes, and can easily be adapted to virtually any commercial chip configuration. A sister program (PSCAN) is available for analysis of P33 labeled, nylon-based arrays.


161. A method for designing PCR primers for amplifying cDNA array clones (up)
Henrik Bjørn Nielsen, Steen Knudsen, Center for Biological Sequence Analysis (CBS), Technical University of Denmark;
hbjorn@cbs.dtu.dk
Short Abstract:

 PROBEWIZ designs PCR primers for cDNA-array with minimal homology to other expressed se-quences from a given organism. The primers can be restricted on Tm, product length and primer size. The primer selection is based on user-defined penalties for homology, primer quality, and positioning toward the 3'end. Find PROBEWIZ at www.cbs.dtu.dk/services/DNAarray/probewiz.html

One Page Abstract:

 When designing targets for cDNA arrays it is important that the target will only anneal to the desired probe sequence especially in expression studies where the abundance of probes varies greatly. Manual design of targets, looking for regions with homology, is tedious and time consuming. A fast solution to the problem of designing large numbers of PCR primers that will amplify sequences with minimal homology to other sequences in a database of EST's is presented. The program PROBEWIZ designs probes/targets for cDNA array, Northern blot or Southern blot by first searching for areas with homology higher than 50% to other sequences in a database. Using this information, PROBEWIZ then finds a number of potential PCR primer pairs that all meet a set of user defined criteriaS (Tm of primer, length of product, primer size). These are then evaluated and sorted according to tree parameters, all assigned an importance weight by the user: 1) the homology to other genes in the database 2) the 3' proximity of the probe, and 3) the primer quality. PROBEWIZ is accessible from www.cbs.dtu.dk/services/DNAarray/probewiz.html


162. Statistical modelling of variation in microarray replicates (up)
S. Soneji, Birkbeck College, London;
S. Kendall, London School of Hygiene and Tropical Medicine, London;
J. Mangan, K. Lang, J. Hinds, P. Butcher, St Georges Hospital Medical School, London;
N. Stoker, London School of Hygiene and Tropical Medicine, London;
L. Wernisch, Birkbeck College, London;
s.soneji@mail.cryst.bbk.ac.uk
Short Abstract:

 We have used microarray analysis to identify differentially expressed genes in a Mycobacterium tuberculosis mutant. Several replicates have been produced and analysed by analysis of variance models in order to obtain stronger signals and to estimate the amount of variation due to different sources of experimental error.

One Page Abstract:

 Microarray analysis is a powerful technique for the identification of differentially expressed genes in mutant organisms as compared to the wild type. In the Mycobacterium tuberculosis H37Rv mutant Tame12 the tcrS gene, which codes for the sensor of a two-component regulatory system, has been knocked out. The purpose of the experiment was to identify genes under direct or indirect control of this regulatory system. In order to help our analyses, several replicates of the same hybridization experiment were carried out. The purpose of the replicates was twofold. Firstly, with replicates signals of over- or underexpression can be extracted more reliably from noisy data. Secondly, an analysis of variance can be applied to reveal the amount of variation due to the various sources of experimental error. Identification of these sources might help to reduce noise in future experiments. We prepared three different RNA samples of both mutant and wild type cultures. Each sample was hybridized to two glass-slide microarrays containing PCR products from all 3924 genes of M. tuberculosis. Thus, all in all we produced 6 hybridization replicates. We also repeated the scanning and spot quantification processes. We fitted linear models taking various combinations of these factors into account and compared their explanatory power. A common feature of all fitted models was that the gene-bacterial strain interaction was one of the weakest compared to the degree of variations among other factors. This result lends weight to the importance of replicates for this type of experiments. Bootstrap resampling of the residuals of the best models was used to obtain p-values for the significance of expression levels of differentially expressed genes. When searching for conspicuous expression levels among thousands of genes a proper adjustment of p-values is mandatory. We found about 10 genes with significantly different expression levels. Reverse trancriptase-PCR is being used to confirm these results. 


163. The new explore of diffuse large B-cell lymphoma (up)
Junbai Wang, Tumor Biology Dep. Den Norsk Radium Hospital;
Jan Delabie, Lymphoma Research Group, Den Norsk Radium Hospital;
Ola Myklebost, Tumor Biology Department, Den Norsk Radium Hospital;
junbaiw@radium.uio.no
Short Abstract:

 A new approach is applied to study diffuse large B-cell lymphoma (DLBCL). The new strategy is a combination of Clustering, Self-organize map and Principal component analysis. From this new approach, we easily distinguished DBCL, CLL and FL type of lymphoid maliganacies. A possibile three subgroups of DLBCL is proposaled. 

One Page Abstract:

 Current array technologies have made it possible to print tens thousands of genes in a single slide. This allows us simultaneously analysis of tens thousands different genes in a single experiment. The challenge now is to interpret such massive data sets. We proposal a two-level approach to simplify the explore of complex DNA microarray data, the first step is to extract the fundamental patterns of gene expression inherent in the data, then is more detailed investigation of particular interesting groups of genes or samples by a resourceful visualization. This new strategy is a combination of Clustering, Self-organize map and Principal component analysis. Most of the analytical calculations and graphical features are provided by MATLAB. To demonstrate the value of such analysis, the approach is applied to diffuse large B-cell lymphoma (DLBCL) with expression patterns of 3906 unique genes for 96 normal and malignant lymphocyte samples. From this new approache, we not only easily distinguished DBCL, CLL and FL type of lymphoid malignancies and confirmed early suggestions that there are two subtypes of DLBCL (GCB and ACT type) but also discover a possible two subgroups in ACT type of diffuse large B-cell lymphoma (DLBCL). The new founding proved the value of such approach and can be applied to study other massive DNA microarray data set. 


164. Including protein-protein interaction networks into supervised classification of genes based on gene expression data (up)
Joachim Theilhaber, Christoph Brockel, Michael Heuer, Steven Bushnell, Aventis Pharmaceuticals, Cambridge Genomics Center;
joachim.theilhaber@aventis.com
Short Abstract:

 For selecting genes in specific pathways we have extended our supervised classifier GENNC (Gene Expression Nearest-Neighbor Classifier - previously sucessful in finding genes in the mouse osteogenic and myoblastic pathways), by including protein-protein interaction information into its distance metric. We report on results for both yeast and mammalian systems. 

One Page Abstract:

 For selecting genes involved in specific regulatory or metabolic pathways, on the basis of microarray expression data, we had previously developed GENNC (Gene Expression Nearest-Neighbor Classifier)*. GENNC is a supervised classification scheme using the k-nearest-neighbor method, that classifies genes based on their co-regulation with members of a biological training set. GENNC has been sucessfully applied to finding genes in the mouse osteogenic and myoblastic pathways. We extend the classifier by including protein-protein interaction information into the distance metric, with a P-value measure of statistical significance of the metric. The P-value is essential in filtering out noisy data so as to maintain reasonable classifier sensitivity in the presence of very large data sets. Benchmark results based on yeast microarray data will be presented, as well as more tentative work involving mammalian data. In all cases we emphasize using cross-validation error rates for evaluating and optimizing the classifier performance, an issue of critical importance when selecting potential drug targets for further, labor-intensive experimental biological validation.

* ``Finding Genes in the C2C12 Osteogenic Pathway by k-Nearest-Neighbor Classification of Expression Data'', Joachim Theilhaber, Timothy Connolly, Steven Bushnell and Aventis Osteoporosis Team, Pacific Symposium on Biocomputing 2001, Mauna Lani, Hawaii, Jan 3-7 2001.


165. Comparative Splicing Pattern Analysis between Mouse and Human Exon-skipped Transcripts (up)
Tzu-Ming Chern, Winston Hide, South African National Bioinformatics Institute, University of Western Cape;
tzuming@sanbi.ac.za
Short Abstract:

 We have developed a system to unequivocally capture putative exon-skipped transcripts. From our pilot studies, we have observed a small sample that has demonstrated a differential splicing pattern between mouse and human exon-skipped transcripts. Current work on differential splicing pattern between mouse and human will be presented.

One Page Abstract:

 We have developed a system to unequivocally capture putative exon-skipped transcripts by mapping these transcripts back to their respective genomic sequences. A pilot study of 138 mouse genes and 30 human genes has been used to assess the occurrence of exon-skipping in these organisms. Preliminary analyses suggest that the rate of exon-skipping in human appears to be higher than the rate in mouse. It has been observed that all the exon-skipped genes in human have high EST abundance (>50 ESTs) whereas only 70% of the mouse skipped genes have high EST abundance. Our tissue level analyses suggests a significant correlation between high EST abundance in tissues and high exon-skipping frequency in both mouse and human exon-skipped genes. We have found tumorous tissues in both mouse and human to have the highest number of exon-skipped ESTs. Our protein analyses suggest that some of the skipped exons in mouse and human encode domains and families that are important for enzymatic and DNA-binding functions. We have also observed a differential splicing pattern that occurs in a small sampled mouse and human exon-skipped genes. Current investigations into differential splicing pattern of mouse and human exon-skipped transcripts will be presented.


166. Non-parametric statistics of gene expression data (up)
Yuzhen Ye, Shanghai Institute of Biochemistry and Cell, Chinese Academy of Sciences;
Haixu Tang, Department of Mathmatics,University of Southern California;
yeyz@sunm.shcnc.ac.cn
Short Abstract:

 Though applications of both classical and recently developed classification algorithms to gene expression data mining have been reported, it is still a statistical challenge to extract useful information from high dimensional gene expression data. Here we introduce non-parametric statistical methods to attack this problem, and their advantages are also discussed.

One Page Abstract:

 Non-parametric statistics of gene expression data

Yuzhen Ye Shanghai Institute of Biochemistry and Cell, Chinese Academy of Sciences, Shanghai 200031, China e.mail: yeyz@sunm.shcnc.ac.cn

Haixu Tang Department of Mathmatics, University of Southern California, Los Angeles, CA90089, USA e.mail: tanghx@hto.usc.edu

Some gene expression data, involving the differentiation between tumor and healthy tissue samples, are availale recently (Alon99, Golub00). The analysis on these data sets focus on clustering different genes into subsets that are co-expressed across different conditions. Clustering technique turns out a successful method to identify functionally related gene families. Similar method could also be used to classify different tissue types based on their gene expression profiles(Alon99). 

Clustering is a so-called "un-supervised" classifier, which doesn't use the tissue type annotation directly. This information is only used for evaluating the results. In contrast, "supervised" methods try to predict the classification of new tissues, based on the knowledge from training on examples of tissues which have been previously classified. Basically this problem may be illustrated in the following way. Suppose we have m tumor tissue samples and n healthy tissue samples, often referred as "training set". We measure the expression level of N genes in each of these samples. Now how can we identify a subset of these genes, referred as feature genes, so that we can correctly predict the tissue types of some other type-unknown samples, referred as "testing set", based on their expression levels? It looks likely to be a typical classification problem. In fact, some classical and recently developed classfication algorithms, such as nearest neighbour classifier, linear discriminant analysis, classification tree, bagging and boosting and support vector machine, have been applied to this problem and the comparison results have been reported (Ben Dor00, Duroit00).

However, Two aspects of this type of expression data are worth being emphasized. First, the experiments are rarely replicated, and hence there are many experimental errors in the data. Second, the annotation of the tissue types perhaps do not coincide with their real property, because there are some sampling mistakes in the tissue preparation. Both of them make the precise value of the expression data less reliable. Nevertheless, there are still some information structure hidden inside the data and we believe non-parametric statistical methods are more suitable to extract them from the high dimentional data than the accurate statistics as mentioned above.

The poster will discuss the application of non-parametric statistical methods in tissue classification problem, specifically three topics as following: a: detecting the outlier tissues in the training set; b: identifying the genes which are differentially expressed in tumor and healthy tissues; c: predicting the tissue types in the testing set;

Please notice that from a statistical pointview, the first problem and the third one are relatively similar: the tumor tissues could be considered as outliers in the healthy tissue group, and vice versa. This similarity will be addressed in details.


167. Transcriptome and proteome analysis of Escherichia coli during high cell density cultivation (up)
Sang Yup Lee, Sung Ho Yoon, Mee-Jung Han, KAIST;
Jong Shin Yoo, Korea Basic Science Institute;
Geunbae Lim, Samsung Advanced Institue of Technology;
leesy@mail.kaist.ac.kr
Short Abstract:

 High cell density cultivation of E. coli was carried out under a constant specific growth rate, and the transcriptome and proteome profiles were analyzed using DNA microarray and 2D-gel electrophoresis. The detailed results on the variation of transcriptome and proteome profiles will be presented along with the possible physiological explanations.

One Page Abstract:

 The recent completion of Escherichia coli genome sequencing signals the necessity of developing new strategies for answering the basic questions concerning cellular function. High cell density cultivation (HCDC) is an essential biochemical engineering practice to achieve high level production of various bioproducts. True process optimization by fed-batch culture is often hampered due to little knowledge on the physiology and metabolism during high cell density. We have manufactured DNA microarray containing 2,850 genes including all functionally known and putative ones. Exponential feeding strategy was adopted for high cell density cultures of E. coli in order to reduce pH variation, by-product formation, and other inhibiting culture condition. DNA microarray can be used to compare global changes in gene expression that occur in response to an environmental stimulus or to compare the effects of genetic changes on gene expression. This analysis can provide important information about cell physiology and has the potential to identify connections between regulatory or metabolic pathways that were not previously known. The proteome analysis using two-dimensional gel electrophoresis (2D-gel) in conjunction with MALDI-TOF can also provide valuable information to elucidate the integrated cellular responses when bacterial cells grow under various environments. Two-dimensional gel electrophoresis is a powerful tool for identification of proteins having different expressed profiles under qualitatively or quantitatively different culture states. Therefore, combined analysis of transcriptome and proteome profiles can supply reliable and huge amount of data for the studies on the understanding of microorganism under various culture conditions. In this study, we report combined analysis of transcriptome and proteome of E. coli cells during the high cell density cultivation. Fed-batch fermentation of E. coli was carried out until the maximum cell density reached to 74 g dry cell weight/L (OD600 of ca. 230), and then the transcriptome and proteome were analyzed using DNA microarray and 2D-gel electrophoresis. We discuss the remarkable and interesting changes in gene expression during HCDC and also suggest the possible strategies for the efficient fermentation from transcriptome and proteome analysis. . [This work was supported by the Korean Ministry of Commerce, Industry and Energy and by the Korean Ministry of Science and Technology under the NRL program.] 


168. Molecular signatures of commonly fatal carcinomas: predicting the anatomic site of tumor origin (up)
Andrew I. Su, The Scripps Research Institute;
John B. Welsh, Lisa M. Sapinoso, Suzanne G. Kern, Petre Dimitrov, Hilmar Lapp, The Genomics Institute of the Novartis Research Foundation;
Peter G. Schultz, The Scripps Research Institute, The Genomics Institute of the Novartis Research Foundation;
Steven M. Powell, Christopher A. Moskaluk, Henry F. Frierson, Jr., University of Virginia Health System;
Garret M. Hampton, The Genomics Institute of the Novartis Research Foundation;
asu@scripps.edu
Short Abstract:

 We have constructed a molecular classification scheme based on mRNA profiling for ten groups of commonly fatal carcinomas. We identified sets of genes that are uniquely characteristic of each tumor type, and used these genes to correctly predict the anatomic site of origin for 90% of 176 carcinomas.

One Page Abstract:

 Histopathological classification of human tumors is fundamental for the optimal treatment of patients with cancer. Here, we used mRNA profiling and supervised machine learning algorithms to construct a molecular classification scheme for the ten most commonly fatal carcinomas in the United States. We identified gene subsets whose expression is uniquely characteristic of each tumor type, and show that these genes can be used to accurately predict the anatomic site of origin for 90% of 176 carcinomas, including metastatic lesions and cancers whose microscopic features of tissue origin were not readily identifiable. A number of the genes that distinguish one tumor type from another are potential diagnostic and pharmacologic targets. This study demonstrates the existence of gene subsets whose expression is unique to specific carcinomas, and illustrates the feasibility of predicting the tissue origin of a cancer in the context of multiple cancer classes.


169. Tuning Sub-networks Inference by Prior Knowledge on Gene Regulation (up)
Barak Shenhav, Department of molecular genetics, Weizmann institute of science, Rehovot, 76100, Israel;
Dana Teltsh, Dana Pe'er, School of Computer Science & Engineering, Hebrew University, Jerusalem, 91904, Israel;
Aviv Regev, Department of Cell Research and Immunology, Life Sciences Faculty, Tel Aviv University, Tel Aviv, 69978, Israel and D;
Gal Elidan, Nir Friedman, School of Computer Science & Engineering, Hebrew University, Jerusalem, 91904, Israel;
barak.shenhav@weizmann.ac.il
Short Abstract:

 Bayesian networks are used to reconstruct statistically significant gene interactions and to infer gene sub-networks from expression profiles. Here we show that by constraining the learning procedure with additional information on gene regulation, more refined and accurate networks may be inferred, reflecting a wider scope of biological knowledge.

One Page Abstract:

 Genome-wide expression profiles obtained using microarrays provide insight into molecular pathways and genetic networks. However, due to the complexity of these systems, the task of reconstructing genetic networks from the currently limited expression data remains a challenge.

Friedman et al [1] suggested modeling genetic interactions using Bayesian networks. According to this model, each gene expression level is represented by a random variable, and interactions between genes (e.g. induction or repression) are treated as probabilistic dependencies. This provides a framework for reconstructing both individual gene interactions (features) as well as inferring entire significant sub-networks [2]. Due to the limited amount of data, many alternative networks may be inferred, resulting in multiple putative features with varying levels of confidence, as estimated by non-parametric bootstrap.

While expression profiling data is scarce, gene regulation has already been extensively studied by other experimental approaches. These culminated in a large body of knowledge, primarily focused on inducers and repressors.

Here, we incorporate this additional information on gene regulation into the Bayesian network learning framework. We constrain our inference to networks which are consistent with prior knowledge of regulation. These constraints can be relaxed and applied in a probabilistic manner, based on our confidence in this information. This allows us to include both proven and predicted interactions as part of our biological knowledge base.

We applied our approach to the Saccharomyces cerevisiae expression profiles in the Rosetta Compendium [3]. We focused on a selected subset of genes, and extracted relevant regulation information from the YPD database [4]. The data was pre-processed and treated similarly to Pe'er et al [2]. We compare our results with those based solely on expression data, show the improvement in the quality and structure of some of the subnetworks, and discuss their importance for revealing more accurate interactions and better structured networks.

[1] N. Friedman, M. Linial, I. Nachman and D. Pe'er. Using Bayesian Networks to Analyze Expression Data. Journal of Computational Biology, 7:601-620, 2000.

[2] D. Pe'er, A. Regev, G. Elidan and N. Friedman. Inferring Subnetowrks from Perturbed Expression Profiles. In 9'th International Conference on Intelligent Systems for Molecular Biology (ISMB), 2001.

[3] Hughes, T. R., M. J. Marton, A. R. Jones, C. J. Roberts, R. Stoughton, C. D. Armour, H. A. Bennett, E. Coffey, H. Dai, Y. D. He, M. J. Kidd, A. M. King, M. R. Meyer, D. Slade, P. Y. Lum, S. B. Stepaniants, D. D. Shoemaker, D. Gachotte, K. Chakraburtty, J. Simon, M. Bard, and S. H. Friend (2000). Functional discovery via a compendium of expression profiles. Cell 102(1):109--26, 2000.

[4] Proteome Yeast Protein Database. http://www.proteome.com/databases/YPD/


170. Detection of alternative expression by analysis of inconsistency in microarray probe performance (up)
Andrey Ptitsyn, Genomics Institute of the Novartis Research Foundation;
ptitsyn@gnf.org
Short Abstract:

 Study of probe behavior on Affymetrix U95A and HS1 (developed by GNF) chips suggests that inconsistency in hybridization performance of some probes with the rest of the set can be explained by the alternatively expressed gene variants. We suggest a statistical metric for of alternatively expressed genes in microarray databases. 

One Page Abstract:

 Biochips of the Affymetrix type are constructed so that each gene is represented by a number of oligonucleotide probes. Ideally all probes in the probe set supposed to produce similar intensity in each hybridization experiment with some degree of variance, reflecting noise coming from scanning, image analysis and other sources. Yet in some cases individual probes express behavior significantly different from the other probes in the same set. We have studied consistency of probe behavior in more then 50 experiments, conducted on 2 biochips - Affymetrix U95A and HS1, developed by GNF. On both chips probe sets with inconsistently behaving probes are selected. Evidence is collected, indicating that inconsistent probes may belong to the alternatively expressed gene variants (effects of alternative splicing and/or alternative polyadenylation). We also suggest a statistical metric for effective datamining of alternatively expressed genes in microarray databases.


171. Which clustering algorithms best use expression data to group genes by function? (up)
Frank D Gibbons, Frederick P Roth, Dept of Biological Chemistry and Molecular Pharmacology;
fgibbons@hms.harvard.edu
Short Abstract:

 For inferring gene function, we assert that the best expression-based gene clustering algorithm is one which best groups genes by function. We scored commonly used clustering algorithms using a figure-of-merit based on total mutual information between clusters and a large set of S. cerevisiae gene attributes.

One Page Abstract:

 Clustering genes based on their expression patterns has proven useful as an exploratory data analysis tool. In particular, it has been observed that clustering by expression has a tendency to group genes of similar function together. This fact has led to the idea of 'guilt-by-association', or inference of gene function based on expression. Many clustering methods are in use, but little guidance is available on which are most suitable for this purpose. Data-driven figures of merit for clustering algorithms have been applied, but do not directly address this question. We assert that the best algorithm for inference of gene function based on expression is the one which best clusters genes according to their function. We developed a figure of merit for expression-based clustering algorithms based on the total mutual information between clusters and a large set of gene attributes. Using a collection of Saccharomyces cerevisiae gene expression data and gene annotation from the Saccharomyces Genome Database and Gene Ontology Consortium, we applied this figure-of-merit to evaluate commonly used clustering algorithms, data transformations, expression-based distance measures between genes, and the most appropriate number of clusters.


172. Visualization and Analysis Tool for Gene Expression Data Mining (up)
Alexander Sturn, Institute of Biomedical Engineering, Graz University of Technology, Krenngasse 37, 8010 Graz, Austria;
John Quackenbush, The Institute for Genomic Research, Rockville, MD 20850, USA;
Hackl Hubert, Zlatko Trajanoski, Institute of Biomedical Engineering, Graz University of Technology, Krenngasse 37, 8010 Graz, Austria;
alexander.sturn@tugraz.at
Short Abstract:

 We have developed a platform independent Java suite, which integrates various tools for microarray gene expression data mining including filters, normalization and visualization tools, as well as hierarchical and non hierarchical clustering algorithms incorporating multiple similarity distance measurements. Additionally it is possible to map gene expression data onto chromosomal sequences.

One Page Abstract:

 High throughput gene expression analysis is becoming increasingly important in many areas of basic and applied biomedical research. Oligonucleotide and cDNA microarray technology is one very promising approach for high throughput transcriptome analysis and provides the opportunity to study gene expression patterns on a genomic scale. Thousands or even tens of thousands of genes can be spotted on a single microscope slide and relative expression levels of each gene can be determined by measuring the fluorescence intensity of labeled mRNA hybridized to the arrays. Beyond the simple discrimination of differentially expressed genes, functional annotation (guilt-by-association), diagnostic classification, and the investigation of transcriptional control mechanisms (coregulation from coexpression) requires the clustering of genes from multiple experiments into groups with similar expression patterns. Several clustering techniques have been recently developed and applied to analyze microarray data. However, to the best of our knowledge, there is no single tool, which integrates the common clustering and visualization methods and provides the capability to perform an easy comparison of results from different clustering approaches. We have developed a versatile, platform independent, and easy to use Java suite for the simultaneous visualization and analysis of a whole set of gene expression experiments. After reading the data from flat files using a flexible plug-in interface, several graphical representations of all intensity values can be generated, showing a matrix of experiments and genes, where multiple experiments and genes can be easily compared with each other. Each gene can be linked to additional information at the NCBI. Several filters and normalization procedures are provided to gain a best possible representation of the data for further statistical analysis. Eleven different kinds of similarity distance measurements have been implemented, ranging from simple Pearson correlation to more sophisticated approaches like mutual information and rank correlation coefficients. The most commonly used hierarchical and non hierarchical clustering or classification algorithms have been implemented to identify similarly expressed genes and extract expression patterns inherent in the data, including: (1) hierarchical clustering, (2) k-means, (3) self organizing maps, (4) principal component analysis, and (5) support vector machines. An important and valuable feature of this software is the ability to compare clustering results from different clustering techniques and parameter settings. This can provide the researcher with additional information compared to a single method approach. Additionally, it is possible to map gene expression data onto chromosomal sequences to enhance the investigation of regulatory mechanisms. Genes on consecutive chromosomal locations are often co-expressed and can be easily identified by this method. Finally, extensive work has been undertaken to accomplish visualization of the gene expression data and clustering results in a user friendly and intuitive way. The flexibility, the variety of analysis and data visualizations tools, as well as the transparency and portability, provides this software suite with the potential to become a valuable tool in functional genomic studies. 


173. Prediction of co-regulated genes in Bacillus subtilis based on the conserved upstream elements across three closely related species (up)
Goro Terai, INTEC Web and Genome Informatics Corp.;
Toshihisa Takagi, Kenta Nakai, Human Genome Center, Institute of Medical Science, University of Tokyo;
terai@ims.u-tokyo.ac.jp
Short Abstract:

 The conservation information of three closely related species, Bacillus subtilis, Bacillus halodurans, and Bacillus stearothermophilus, was used to predict co-regulated genes of B. subtilis. We will report the results of extensive comparison between our prediction (cis-elements and regulons) and known examples using our database on B. subtilis transcription, DBTBS (http://elmo.ims.u-tokyo.ac.jp/dbtbs/). 

One Page Abstract:

 Identification of co-regulated genes is essential for elucidating transcriptional regulatory networks and the function of uncharacterized genes. Although co-regulated genes should have at least one common sequence element, it is generally difficult to identify these genes from the presence of this element because it is very easily obscured by noises. To overcome this problem, we used the conservation information of three closely related species: Bacillus subtilis, Bacillus halodurans, and Bacillus stearothermophilus. Although even such species had a limited number of clearly orthologous genes, we could obtain 3,178 phylogenetically conserved elements from the upstream intergenic regions of 1,568 B. subtilis genes. Similarity between these elements was used to cluster these genes. No other a priori knowledge on genes and elements was used. Another merit of predicting B. subtilis genes is that this species has a rich accumulation of experimental study. We confirmed that general elements such as -35/-10 boxes and Shine-Dalgarno sequence are not the major obstacles. Moreover, we could identify some genes known or suggested to be regulated by a common transcription factor as well as genes regulated by a common attenuation effecter. We also identified some plausible additional members of known co-regulated genes. Thus, our approach is promising for exploring potentially co-regulated genes.


174. Comparison of performances of hierarchical and non hierarchical neural networks for the analysis of DNA array data (up)
Joaquin Dopazo, Javier Herrero, Bioinformatics, CNIO, Ctra. Majadahonda-Pozuelo, Km 2, Majadahonda, 28220 Madrid, Spain;
jdopazo@cnio.es
Short Abstract:

 Unsupervised neural networks are extensively used for the analysis of DNA array data due to properties like robustness and nearly linear runtimes. Here we present a comparison of the performances of two unsupervised neural networks and an aggregative hierarchical method in terms of runtime and accuracy in the classification obtained.

One Page Abstract:

 Comparison of performances of hierarchical and non hierarchical neural networks for the analysis of DNA array data

Javier Herrero and Joaquin Dopazo

Bioinformatics, CNIO, Ctra. Majadahonda-Pozuelo, Km 2, Majadahonda, 28220 Madrid, Spain

DNA microarray technology opens up the possibility of measuring the expression level of thousands of genes in a single experiment (Brown and Botsein, 1999). Serial experiments measuring gene expression at different conditions, times, or distinct experiments with diverse tissues, patients, etc., allows obtaining gene expression profiles under the different experimental conditions studied. Initial experiments suggest that genes having similar expression profiles tend to be playing similar roles in the cell. Aggregative hierarchical clustering has been extensively used for finding the clusters of co-expressing genes (Eisen,et al, 1998; Wen et al., 1998) Nevertheless, several authors (Tamayo et al., 1999) have noted that aggregative hierarchical clustering suffer from lack of robustness. In addition, aggregative hierarchical clustering methods have runtimes that are at least quadratic (Hartigan, 1975), which makes them very slow when thousands of items are to be analysed. These arguments leaded to use neural networks as an alternative to aggregative hierarchical cluster methods (Tamayo et al., 1999; Törönen et al., 1999; Herrero et al. 2001). Unsupervised neural networks, like Self Organising Maps (SOM) (Kohonen, 1997) or Self Organising Tree Algorithm (SOTA) (Dopazo and Carazo, 1997), provide a more robust and appropriate framework for the clustering of big amounts of noisy data. Neural networks have a series of properties that make them suitable for the analysis of gene expression patterns. They can deal with real-world data sets containing noisy, ill-defined items with irrelevant variables and outliers, and whose statistical distributions do not need to be parametric ones and are reasonably fast and can be easily scaled to large data sets. Here we present a comparison of the performances of SOM and SOTA both in terms of runtime and accuracy in the classification obtained. The results are compared to a classical aggregative hierarchical method.

References

Brown, P.O. and Botsein, D. (1999). Nature Biotechnol. 14:1675-1680.

Dopazo, J. & Carazo, J.M. (1997) J. Mol. Evol 44:226-233.

Eisen M., Spellman P. L., Brown P. O., Botsein D. (1998). Proc. Natl. Acad. Sci. USA. 95: 14863-14868

Hartigan, J.A. (1975) Clustering algorithms. New York, Wiley

Herrero, J., Valencia, A. and Dopazo, J. (2001) Bioinformatics 17:126-136.

Kohonen, T. (1997) Self-organizing maps, Berlin. Springer-Verlag.

Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S. & Golub, T.R. (1999) Proc. Natl. Acad. Sci. USA 96:2907-2912.

Törönen, P., Kolehmainen, M., Wong, G. & Castrén, E. (1999) FEBS letters 451:142-146.

Wen, X., Fuhrman, S., Michaels, G.S., Carr, D.B., Smith, S., Barker, J.L. & Somogyi, R. (1998) Proc. Natl. Acad. Sci. USA 95:334-339 


175. Linking micro-array based expression data with molecular interactions, pathways and cheminformatics (up)
Robin Munro, Iris Ansorge, Ewald Aydt, Kay Böttcher, Eric Minch, Claus Kremoser, Thomas Meyer, Jeroen van de Peppel, Tobias Schlegl, Stefan Weiss, Martin Hofmann, LION Bioscience AG;
robin.munro@lionbioscience.com
Short Abstract:

 As the information obtained from Expression data increases it is very important to associate this with other types of information. We have bridged the gap between expression data molecular interactions, pathways and chem-informatics with tools that can cross communicate. We are using these to study Estrogen receptors in our laboratories.

One Page Abstract:

 As the information obtained from Expression data increases it is of great importance to be able to associate this with other types of information so as to be able to deduce more about the way in which genes are responding to changes in expression. We have bridged the gap between expression data molecular interactions and pathways with three visualisation tools which can cross communicate with an IAC (inter application communication) server based around SRS (sequence retrieval system) technology. The SRS platform is used as the universal tool to interconnect different domains like protein analysis and pathway analysis tools.

By combining our informatics and laboratory expertise the data can be seamlessly passed from experimental recording to expression data analysis, to protein interaction and to pathway reconstruction. In this example we focus on Estrogen Receptors which are a class of Nuclear Receptors. These are well-established therapeutic targets that are tightly connected to disease areas.

In proof of concept we have used protein interactions from the literature and Yeast 2 Hybrid (Y2H) to identify genes which are expressed similarly across different tissues. Similarly genes can be passed to and from a pathway reconstruction tool where relevant biological pathways can be generated and analyzed. We also demonstrate how the link between this expression analysis system and chem-informatics software allows us to correlate structural descriptors for low molecular weight compounds with cellular responses at the expression level.

The confrontation of cells with low molecular weight substances can induce drastic changes in gene expression patterns. These changes are the consequence of molecular interactions of low molecular weight substances with cellular targets (proteins or DNA) inducing signalling pathways and ultimately leading to changes in gene expression. In the world of pharmacology these changes are divided in wanted therapeutic and unwanted side effects.

In an initial proof of principle experiment we have used this system for the correlation of structure activity relationships (SAR) of selective estrogen receptor modulators (SERMs) with the cellular response patterns induced in osteoblasts. To overcome the problem with the limited availability of bone-specific cDNA clone collections we have generated in our laboratories a non-redundant set of cDNA fragments isolated from specialised cDNA libraries. The ultimate aim of our approach is to use this system for the prediction of biological effects based on structure / expression relationships. 


176. Classification of Acute Leukemia Gene Expression Data Using Weight Function and Principal Component Analysis (up)
Jeongah Yoon, Jee-Hyub Kim, Biological Research Information Center, Pohang University of Science and Technology;
Hong Gil Nam, Department of Life Science, Pohang University of Science and Technology;
yja@bric.postech.ac.kr
Short Abstract:

 We present a new method for classification of acute leukemia gene expression data based on weight function and principal component analysis (PCA). The proposed method can treat imprecise and high dimensional problems of gene expression data. The classification results of training and test samples show 100% and 94.12% accuracy, respectively. 

One Page Abstract:

 Recently, classification of tumor samples has been an important aspect of cancer diagnosis and treatment using DNA microarray technology. We present a new method for classification of acute leukemia gene expression data based on weight function and principal component analysis. The goal is to establish the best predictive classifier of the two cancer subtypes (ALL/AML). Gene expression data are characterized by very high dimensionality and contain irrelevant features. Thus, comprehensible interpretation is difficult as well as the large complexity of the original data requires high computational cost. First, to solve these problems, we describe the use of weight function to minimize unknown imprecise information in microarray data and principal component analysis (PCA) to reduce high dimensionality before making the classifier which is a sort of basic mathematical model to predict new samples. The weights by cubic weight function applied to each training sample are based on the Euclidean distance between sample mean and each training sample. The samples are proportionally weighted by values between 0 and 1 according to their proximity to the sample mean of each class. The main advantage of weight function method is that more weight is given to the samples that are the closest to the mean of each class. These new weighted data go through PCA that linearly transforms the high dimensional data into new score values of reduced dimension without the loss of important information. Next, the maximal number of PCs is determined by the fraction of eigenvalues which indicates the relative significance of the ith new PC component. Among the PCs, we select automatically optimal two PCs which satisfy both minimum within-class variances and maximum between-class variances. The scores on the optimal two PC dimensions are applied to Fisher's discriminant function to separate the training data into two types of tumor. Finally, this classifier is used to classify unknown or new samples. The samples are divided into two sets by the provider, which is 38 (ALL/AML=27/11) as a initial or training set and 34 (24/14) as an independent or test set with each sample giving the expression levels of 7129 genes. Our result shows that the discriminant function from 38-case training set classifies perfectly the samples into two classes (100% accuracy). For the prediction of the 34-case test set, there were only two cases misclassified by the algorithm (2/34, 94.12% accuracy): two AML¡¯s were labeled as ALL¡¯s. Consequently, a novel classification method is proposed by using weight function and PCA method, which can solve problems of imprecise and high dimensional gene expression data. This algorithm can be extended to other cases with high potential in structural and functional genomics. 


177. Application of Fuzzy Robust Competitive Clustering Algorithm on Microarray Gene Expression Profiling Analysis (up)
Xudong Dai, Rutgers University;
Wei Xiong, WaveGenix, LLC;
Hichem Frigui, University of Memphis;
Tong Fang, SIMONS;
xudong@hotmail.com
Short Abstract:

 Clustering algorithms for gene expression analysis assume well-defined boundaries between clusters. This assumption may not be valid for biological processes. We propose applying Robust Competitive Agglomeration (RCA), which uses fuzzy cluster membership, on microarray data analysis. Our result suggests that the RCA is useful for gene expression profiling analysis.

One Page Abstract:

 Global gene expression profiles revealed by microarray technology can improve our understanding of complicated biological processes. Clustering algorithms have proven to be important tools for extracting meaningful patterns embedded in those transcriptional profiles. Standard hierarchical clustering and simple partitional clustering procedures have been widely used by biologists for extracting genes associated with certain cellular events or disorders. Unfortunately, these simple clustering algorithms have several inherent limitations that can affect the quality of the detected profiles. Moreover, the above clustering algorithms assume well-defined boundaries between different clusters. This assumption may not be valid for most biological processes where complicated and extensive molecular interactions occur and can result in overlapping clusters. To overcome these limitations, we propose investigating a more robust clustering algorithm, called Robust Competitive Agglomeration (RCA). The RCA algorithm uses a fuzzy membership to handle overlapping clusters, and a robust membership to handle noise points and outliers. Moreover, the RCA can efficiently find the optimal number of clusters. Thus, it is an ideal tool for gene expression profiling analysis. The performance of the RCA is illustrated with real microarray data. We show that the RCA algorithm can recognize patterns embedded across different clusters and therefore, model after the complicated cellular events in which extensive cross talk and interactions along pathways may exist.


178. A Robust Algorithm for Expression Analysis (up)
Earl Hubbell, Wei-Min Liu, Teresa Webster, Fred Christians, Gang Lu, Joy Fang, Rui Mei, Affymetrix;
Earl_Hubbell@affymetrix.com
Short Abstract:

 We consider the problem of estimating gene expression using oligonucleotide arrays. Such estimates should have approximate linearity in concentration, non-negative results, statistical robustness, and include estimates of variation. A robust algorithm meeting these goals has been validated against spike experiments, and shows comparable performance to existing standards. 

One Page Abstract:

 Gene expression analysis is one of the most important applications of oligonucleotide array technology. Samples containing a variety of transcripts are hybridized to complementary probes on a surface, and intensity data is obtained for each probe-transcript interaction. Such data is often noisy and subject to confounding cross-hybridization effects, necessitating careful attention for analysis.

Our approach to analysis of intensity data from arrays is derived from simple models linking intensity data with the underlying concentration of targets. We do not assume strong distributions for errors, nor do we assume that probes have identical properties. The two simple models are:

(1) Intensity(I) = Non-transcript-related-effects(NTRE)+transcript-related-effects(TRE) and (2) log(TRE) = log(concentration(c)) + log(affinity(A))+residual(R).

Intensity is the observed intensity of a given probe sequence in an experiment, and NTRE and TRE are the hidden division of these intensities into intensity related to the transcript of interest and the remainder. The concentration is the concentration of the transcript, and the affinity is the probe affinity for this transcript. Note that intensities, non-transcript-effects, and transcript-related-effects are all nonnegative, as are concentrations and affinities. For stability, we will assume that zero values are actually small positive values. 

Because probe affinities can vary widely, and are typically stable across experiments, we note the derived model, describing the transcript-related effect for the same probe sequence in two experiments x and y:

(2a) log(TRE(x))-log(TRE(y)) = log(concentration(c(x)))-log(concentration(c(y)))+residual. 

This derived relationship will be used in comparative analysis across experiments. These models capture broad empirically observed properties of oligonucleotide behavior in experiments.

Given these two models, we build an algorithm in layers starting from the intensity values. In the first layer, we estimate the non-transcript-related-effect for each intensity, and remove it to obtain an estimate (probe value (PV)) of the transcript-related-effects. We take care to ensure that this value is positive, as the transcript-related-effects and concentrations are positive. Within this procedure we can also estimate the significance of TRE vs NTRE, which corresponds to making a call of present or absent in the Affymetrix standard software. 

In the second layer, we estimate the effect due to the target concentration by combining the individual probe values using robust statistics on the log-scale. We observe that intensities have experimental variation that increases with intensity, which suggests strongly that a log-transformation of the data will stabilize the variance. We use a robust statistic, the one-step Tukey biweight, to obtain location and scale estimates for the data. The biweight is known to have excellent behavior in the face of outliers, and using a single step avoids issues of convergence. 

In a variant of the second layer, we estimate the log-ratio of transcript concentrations in two experiments. Because of the great disparity in affinities of probes for transcripts, the appropriate statistical tool is a "paired-sample" test. First we obtain a probe log ratio (PLR) by differencing the log(PV) in each experiment. Then we apply our biweight statistic to obtain an estimate of the log-ratio of concentrations. Under reasonable assumptions about the nature of the residuals, the resulting estimate of scale can be used to test the significance of this value. 

An algorithm following these procedures achieves the design goals - non-negative results, approximate linearity (given approximately linear probe behavior), and robustness against outliers. The performance of this algorithm was checked against a panel of extensive spikes against complex backgrounds. Performance was found to be comparable with the existing standard Affymetrix algorithm. 


179. Analysis of Gene Expression by Short Tag Sequencing - Theoretical Considerations (up)
Per Unneberg, Magnus Larsson, Dept of Biotechnology, Royal Institute of Technology (KTH), Stockholm;
Anders Wennborg, Dept of Biosciences, Karolinska Institute, Stockholm;
peru@biochem.kth.se
Short Abstract:

 We have focused on certain aspects that are essential for a reliable analysis of short sequence data in relation to the existing transcript index databases. These aspects include the influence of tag length, tag uniqueness and restriction enzyme recognition site frequencies.

One Page Abstract:

 Gene expression analysis has lately received much attention due to the advent of hybridisation array technologies. Alternative methods, based on cDNA-sequencing techniques, suffer the disadvantage of having a lower throughput. Still, these methods provide important information about gene expression. Firstly, previously unknown transcripts can be detected by showing the actual sequence contents of the sample without the need for pre-selection of probes. Secondly, with sufficiently large samples, quantitative information can be obtained about genes expressed at very low levels, falling below the detection limit of hybridisation array methods. The low throughput has been addressed by devising methods based on isolation of short sequence tags from each sampled mRNA. Examples of such methods are Serial Analysis of Gene Expression (SAGE), Tandem Arrayed Ligation of Expressed Sequence Tags (TALEST), and pyrosequencing. In general, tags of 10-20 bp length downstream of a given restriction enzyme cleavage site are used to identify the original mRNA.

We have focused on certain aspects that are essential for a reliable analysis of such short sequence data in relation to the existing transcript index databases. Firstly, two human transcript databases, RefSeq and UniGene, were analysed to investigate the reliability of short tag identification. Short tags were generated from transcript sequences based on a range of possible restriction enzyme recognition sites. For the enzyme NlaIII, which is commonly used in SAGE, approximately 5% of the transcripts were not identifiable by short tags, either because they lacked restriction enzyme recognition sites or because the generated 3'-tags were shorter than 10 bp. However, more than 90% of 10 bp tags were found to uniquely identify a transcript. Secondly, the specificity in identifying transcripts by the sequence similarity search algorithm BLAST was investigated with different sequence tag lengths. We found a tag-length in the interval 17-20 bp (including the restriction enzyme recognition site) to be sufficient for transcript identification by BLAST, while longer tag lengths did not appreciably improve the results. 


180. Computational analysis of RNA splicing by intron definition in five organisms (up)
Lee Lim, Phillip Sharp, Chris Burge, MIT;
leelim@mit.edu
Short Abstract:

 Splicing of short introns from five eukaryotes was simulated using five features: the splice sites, the branch signal, intron length, and intron composition. The contribution of each of these features to splicing accuracy was analyzed, and the amount of information required for highly accurate splicing was estimated.

One Page Abstract:

 A goal of research on pre-mRNA splicing is to write down (or implement in a computer program) a set of rules which describes how the splicing machinery identifies the precise locations of exons and introns in a transcript. Although this goal has not yet been realized, concepts such as intron definition have been developed to explain how the spliceosome recognizes introns. Short introns are likely spliced by the intron definition mechanism, where the 5' and 3' splice signals are initially recognized and paired in an intron-spanning interaction. Taking advantage of the recent availability of genomic sequences from five eukaryotes, we used a computational approach to: 1) analyze how well the intron definition model could splice short introns in these organisms and 2) understand the contribution of different transcript features to this process. Using datasets of reliably annotated transcripts from each organism, we identified populations of short introns in each dataset. Five features known or hypothesized to be involved in intron definition were analyzed: the 5' splice signal, the 3' splice signal, the branch signal, intron length preference, and intron composition. Using the concept of relative entropy from information theory, the information content of each of the five features was measured, giving a quantitative estimate of how much each feature could contribute to splicing specificity. In addition, a Monte Carlo method was used to estimate the amount of information necessary for accurate splicing of short introns: approximately 30-35 bits, depending on the organism. A program, IntronScan, was developed which uses the five features to identify the locations of short introns in transcripts. High accuracies of splicing (94-95%) could be attained in Drosophila and C. elegans, with the bulk of information deriving from the 5' and 3' splice signal motifs. S. cerevisiae was unique in deriving a large percentage of its information from the branch signal. However, the 3' splice site signal was not precisely identified in 15% of S. cerevisiae introns, implying that our knowledge of 3' splice site selection in this organism is incomplete. In Arabidopsis, the 5' and 3' splice signals are relatively weak, and are not sufficient to reliably identify introns. However, use of the intron composition feature resulted in dramatic improvements in accuracy (from 68% to 92%). In Arabidopsis, Drosophila, and human, closer analysis of the intron composition feature showed that a large percentage of the improvement in accuracy obtained with this feature could be attributed to small sets of sequence motifs; some of these potential intronic enhancers have already been experimentally verified while others have not yet been experimentally tested. Even with the use of the intron composition feature, the highest accuracy obtained in human was 85%, suggesting that other features not considered in our analysis must provide substantial amounts of information for splicing in vertebrates. 


181. Image Analysis and Feature Extraction Methods for High-Density Oligonucleotide Arrays (up)
Hilmar Lapp, Yingyao Zhou, Peter Dimitrov, Genomics Institute of the Novartis Foundation (GNF);
Ruben Abagyan, The Scripps Research Institute, La Jolla, CA, USA;
lapp@gnf.org
Short Abstract:

 Our goal was to devise an image analysis algorithm for high-density oligonucleotide array images that is both accurate and insensitive to artifacts on the feature level, and that can deal with slightly distorted grids. We will demonstrate the relative performance of different model-based and ad-hoc algorithms by consistency across replicated experiments.

One Page Abstract:

 Image Analysis and Feature Extraction Methods for High-Density Oligonucleotide Arrays

 Hilmar Lapp1,2, Yingyao Zhou1, Peter Dimitrov1, Ruben Abagyan1,3

 1Genomics Institute of the Novartis Research Foundation, San Diego, USA 2Novartis Research Institute, IFD/CBC, Vienna, Austria 3The Scripps Research Institute, San Diego, USA lapp@gnf.org

 Motivation: High-density oligonucleotide arrays have become a favorite technology for large-scale gene expression profiling [1] and genomic hybridization-based research projects [2]. Many applications, like genome-wide tissue profiles, treatment response-studies, etc., rely on quantitative expression signals being obtained from the hybridization images of the array. The method of choice for image analysis of Affymetrix oligonucleotide array images is usually the GeneChip software provided by Affymetrix, which by default employs a quantile-based quantification method using all but the border pixels of a cell [3]. Apart from being empirical, this method is sensitive to certain artifacts in the chip image, like broad inter-cell stripes and small bright stars. Our goal was to devise an Affymetrix array image analysis algorithm that is both accurate and insensitive to artifacts on the feature level, and that can deal with slightly distorted grids. We defined accuracy in terms of consistency of expression signals across replicate chips. Results: We divided the task of image analysis and feature extraction into two steps, namely locating the grid of cells on the image, and quantifying each cell. Our method for locating the cells uses the corner coordinates provided when the array was scanned, and approximates a distortion of the grid from the expected rectangle as a continuous effect. The quantification of each cell also was split into two tasks, the first being pixel selection and the second being quantification of the previously selected pixels. We implemented several model-based as well as ad-hoc algorithms for both pixel selection and quantification. We will demonstrate the relative performance of the different possible algorithms as assessed by consistency of expression signals across replicated experiments, and we will show how the algorithms can deal with cell-level artifacts. It turned out that the performance strongly depends on the actual shape of a cell's pixel intensity profile, which is biased towards different intensity ranges.

[1] Lockhart D.J. et al. (1996), Nature Biotechnology 14:1675-1680; Wodicka L. et al. (1997), Nature Biotechnology 15:1359-1366 [2] Lipshutz et al. (1999), Nature Genetics Suppl. 21:21-24; Lockhart D.J. and Winzeler E. (2000), Nature 405:827-836 [3] Affymetrix GeneChip Manual; see e.g. Winzeler E. et al. (1999), Science 285:901-906
 
 


182. Transcriptional control mechanisms in the global context of the cell (up)
Johan Elf, Måns Ehrenberg, Uppsala Univeristy, Department of Cell and Molecular Biology;
johan.andersson@icm.uu.se
Short Abstract:

 The quality of molecular control mechanisms can be evaluated from their contribution to the fitness of the organism. In this study we have compared the attenuation and repressor mechanisms for regulation of amino acid biosynthetic operons. The conclusion is that a repressor system can sustain a higher growth rate.

One Page Abstract:

 A hierarchy of feedback loops that employ a great variety of molecular mechanisms achieves control of gene expression in bacteria. Understanding the dynamics of these control systems and evaluating their quality require global mathematical models of growing cells under different external conditions. In such a theoretical framework the quality of a control loop can be evaluated in terms of its impact on the growth rate of the population of cells, and quantified by a "fitness parameter". 

We have compared repressor systems and attenuation mechanisms in control of expression of amino acid biosynthetic operons in Escherichia coli. The control systems' ability to optimally balance the investment in the biosynthetic enzymes to the demand raised by protein synthesis will determine the growth rate and thereby the fitness of these control systems. The analysis is based on a mathematical model for whole cells taking into account the synthesis of twenty amino acids, aminoacylation of tRNAs and the consumption of amino acids by ribosomes in the making of new proteins. 

The signals for the repressor and attenuation mechanisms are the concentrations of amino acids and aminoacyl-tRNAs, respectively. The flows through the amino acid and aminoacyl-tRNA pools are large and insensitive to pool size. The signals are therefore very sensitive to how synthesis and consumption of amino acids are balanced. In the language of automatic control theory, these ubiquitous intracellular feedback mechanisms display "bang-bang"-control. Their general design principle almost automatically brings the intracellular enzyme concentration to their optimal values.

Our results suggest that attenuation mechanisms, in contrast to repressor control, cannot keep the cell at a high growth rate and with a low frequency of amino acid substitution errors in protein synthesis. The reason is that attenuation mechanisms in spite of their high sensitivity only respond when ribosomes are slowed down by amino acid deficiency, and this reduced protein elongation rate directly impairs both growth rate and accuracy of mRNA translation. 

Top-down approaches to cell modeling as described here are motivated by their ability to reveal universal principles behind gene regulation. They will also serve as a necessary conceptual framework for the design of a new generation of experiments directed towards control of gene expression in bacteria and other organisms. 


183. Measurement and Prediction of Gene Expression in Whole Genomes (up)
Carsten Friis, Peder Worning, Center for Biological Sequence Analysis (CBS);
Birgitte Regenberg, BioCentrum, DTU;
Steen Knudsen, David Ussery, Center for Biological Sequence Analysis (CBS);
carsten@cbs.dtu.dk
Short Abstract:

 We have analysed expression of genes in the Escherichia coli and Saccharomyces cerevisiae genomes. The data are displayed graphically, such that expression levels throughout the whole genome can be visualised at once. We compare the results with DNA structural features and predicted mRNA expression levels, based on several different methods. 

One Page Abstract:
 
 

We have analysed the expression of genes in the Escherichia coli and Saccharomyces cerevisiae genomes, using Affymetrix DNA chip technology. The data are displayed graphically, using "DNA Atlases", such that expression levels throughout the whole genome can be visualised at once and compared to DNA sequence parameters (such as AT-content and DNA flexibility). 

We find that genes with similar expression levels are not uniformly distributed throughout the genome, but tend to cluster. To explain this phenomena, the expression data are correlated to three different models for prediction of expression levels.

The first examines solely the structural parameters of the DNA, assuming that DNA must unwrap before transcription can occur, and a correlation between helix flexibility and gene expression levels might be found. The second approach is based on the Codon Adaption Index (CAI), a previously published weight matrix designed to distinguish between highly and lowly expressed genes based on codon usage. Finally a neural network trained to recognise genes with high expression from E. coli is applied to both genomes.

The results from all three models are compared statistically and correlation coefficients are presented.


184. Analysis of orthologous gene expression in microarray data (up)
J.L. Jimenez, J. Sgouros, Imperial Cancer Research Fund;
jimenez@icrf.icnet.uk
Short Abstract:

 Sequence-based clustering of orthologs from several genomes has revealed conserved groups of sequences involved in different cellular processes. Analysis of the expression of orthologs within and between organisms can provide information on relationships between gene groups involved in core biochemical functions and assist with the functional classification of orphan genes.

One Page Abstract:

 Sequence comparison using databases of characterised genes can provide valuable hints about the molecular function of newly sequenced genes. At the genome level, these comparisons have enabled the functional classification of genes within organisms and provided important information about how these "libraries" of functions are maintained and modified between different phylogenetic groups during evolution. However, the most complete study up to date, the Clusters of Orthologous Groups (COGs), has also shown the difficulty in distinguishing between actual orthologs (functionally equivalent proteins) and paralogs (proteins with similar sequence but whose function may be have been diverted from the original ancestor), as well as a considerable number of "orphan" groups that remain uncharacterised due to lack of biochemical information for any of their representative genes.

Microarray studies of the expression behaviour of a genome can help to redefine and improve the classification of the functional libraries by grouping genes regulated at the same time in different cellular processes. This grouping is usually done by means of unsupervised classification algorithms that do not impose any a priori constraint on the analysed data. Information about uncharacterised genes can be deduced by looking at their accompanying partners in the resulting groups, although usually each group contains genes from more than one of the known functional classes and thus only a rough assignment of the cellular process in which the orphan gene is involved can be obtained. 

In the study presented here, we have used the COG information for the budding yeast proteins as starting point for the supervised grouping of several microarray experiments. Although the grouping defined by the COGs is in general preserved during gene expression, it is possible to divide broad groups into subclasses that reflect the oligomerisation state of the proteins and/or more specific functionality. Analysis of the relationships between these groups can clarify the boundaries of the cellular processes in which they are involved. Further comparison of the regulation of mitochondrial and non-mitochondrial proteins also shows the relevance of subcellular compartmentalisation in eukaryotes. Along with assisting protein annotation, the aim of this preliminary study is the comparison of gene expression between different organisms to understand how gene regulation could play a role in speciation. 


185. A rapid algorithm for generating minimal pathway distances (up)
S.C.G. Rison, J.M. Thornton, Department of Biochemistry and Molecular Biology, University College London;
E. Simeonidis, I.D.L. Bogle, L.G. Papageorgiou, Department of Chemical Engineering, University College London;
rison@biochem.ucl.ac.uk
Short Abstract:

 We present a rapid algorithm based on mathematical programming for calculating minimal pathway distances applied to metabolic networks. The algorithm presented is capable of finding the minimal distances from the source enzyme to all enzymes in the pathway in a single pass. This step is then repeated for each enzyme. 

One Page Abstract:

 We present a rapid algorithm based on mathematical programming for calculating minimal pathway distances applied to metabolic networks. Minimal pathway distances are identified as the smallest number of metabolic steps separating two enzymes in a pathway. The algorithm presented deals effectively with circularity and reaction directionality. It is capable of finding the minimal distances from the source enzyme to all enzymes in the pathway in a single pass. This step is then repeated for each enzyme.

We illustrate the use of minimal pathway distances by calculating them for Escherichia coli small molecule metabolism pathways and considering their correlations with genomic distance (distance separating two genes on a chromosome) and enzyme function (as characterised by EC number).

Although we consider only metabolic networks, the algorithm is generalised and applicable to many other biological networks such as signalling pathways.


186. System analysis of complex molecular networks by mathematical simulation and control theory (up)
Hiroyuki Kurata, Kyushu Institute of Technology;
Hisao Ohtake, Hiroshima University;
H. El-Samad, Iowa State University;
T.-M. Yi, California Institute of Technology;
M. Khammash, Iowa State University;
John Doyle, California Institute of Technology;
kurata@bse.kyutech.ac.jp
Short Abstract:

 To extract the design principle of complex molecular networks of a biological system, we developed the technology regarding bioinformatics and systems engineering. Biosimulators for synthesizing a molecular system and control theory for analyzing it were strongly required in the post-genomic era.

One Page Abstract:

 In biological systems, control is carried out through molecular interaction processes, whereas in artificial substances it is performed according to calculation based on physical and chemical laws. A molecular interaction network process may be a kind of systems for calculation, but its calculation method is completely different from that of artificial substances such as a computer. Is it possible to elucidate such biological systems using control theory for an artificial substance.? Molecular interaction processes can be converted into a mathematical model, but it is hard to analytically solve them due to their nonlinearity. What we can do is not to analytically analyze a biological system, but to numerically simulate a molecular interaction process, making clear the difference and similarity between biological and artificial systems. Such comparison leads us to extract design principles underlying a molecular architecture. In this article, the comparison with the artificial process enabled us to analyze a biological system with control theory, and to predict how the interaction among subsystems was generated in a biological system. To extract the design principle out of complicated networks, we study several systems such as the heat shock response, circadian clocks, and the ammonia assimilation system. In this study, we reported the topics of the heat shock response. We present a mathematical model that reproduced the main features of the heat shock response and analyzed it with control theory whose key words were complexity, robustness, and controls. In the heat shock response, the activity and amount of s32 is controlled by three mechanisms: DnaK-mediated s32 activity control (feedback control), heat-induced translation (feedforward control), and s32 degradation of s32-epxressed FtsH (local servo feedback control). Feedback control plays a major role in the heat shock response, because feedback functions well without feedforward and autogenous controls. The addition of feedforward and local servo feedback controls increased the insensitivity to the fluctuations from other subsystems. Briefly, complexity in s32 regulation showed the capability to generate robustness in the heat shock response, thereby making the parameter insensitive to the perturbation from other subsystems. The complexity of biological systems has introduced conceptual and practical difficulties. Among the most important has been the difficulty in isolating a smaller subsystem that could be analyzed separately. Complexity seems to impede isolating smaller subsystems out of the whole system. However, this study demonstrates that complexity can generate the insensitivity to perturbations among subsystems, thereby making it possible to extract a smaller subsystem out of the whole system and analyze it separately. If robustness is a common feature of the key properties of interconnected subsystems, a biological system is a collective body of mosaic-like subsystems rather than a melting pot of subsystems. 


187. Parameter Estimation of Signal Transduction Pathways using Real-Coded Genetic Algorithms (up)
Shuhei Kimura, Takashi Naka, Mariko Hatakeyama, Akihiko Konagaya, RIKEN Genomic Sciences Center;
skimura@gsc.riken.go.jp
Short Abstract:

 For understanding quantitative dynamics of signal transduction, a computational simulation is one of the most effective methods. However, a computational simulation often requires several kinetic parameters which are unmeasurable by the existing experimental techniques. We use a real-coded genetic algorithm for estimating these unknown parameters for EGF signal transduction. 

One Page Abstract:

 For predicting the biological phenomena in cells, it is effective to create mathematical models. We choose a set of ordinary differential equations as a mathematical model of reactions in cells. For a simulation, the set of differential equations requires kinetic parameters given from biological experiments. However, several kinetic parameters cannot be obtained by the existing biological techniques.

Thus, we must estimate these parameters from the experimentally measurable data. The problem of estimating the unknown kinetic parameters is formulated as a function optimization problem, if we treat unknown parameters as decision variables, and a difference between the measurable data and simulation results as an objective value.

In this optimization problem, it is impossible to calculate differential values analytically. Moreover, if there are only few measurable data, the problem may have plural optima. A real-coded genetic algorithm (GA) is suitable for a problem that has these properties. The GA is an optimization method inspired by Darwin's theory about evolution, and especially, the real-coded GA is proper to function optimization problems. We apply the real-coded GA to the kinetic parameter estimation problem. As an example, we employed receptor tyrosine kinase signal transduction pathway followed by EGF binding in mammalian cells. 


188. A two-phase partition method simulates the dynamic behavior of the heat shock response with high accuracy at a remarkably high speed. (up)
Hiroyuki Kurata, Kyushu Institute of Technology;
kurata@bse.kyutech.ac.jp
Short Abstract:

 In order to accurately simulate the dynamic behavior of a molecular network of a biological system at an extremely high speed, the two-phase partition method was developed that automatically divided all the chemical reaction equations into two phases: the binding phase and the reaction phase.

One Page Abstract:

 Metabolic Control Analysis (MCA) and Biochemical Systems Theory (BST) have been demonstrated to be useful for simulating various metabolic circuits. However, there have been few reports of successful simulation of molecular networks consisting of proteins and DNAs, such as stress responses, cell division, chemotaxis, and circadian clocks, because such pathways cannot be described by the Michaelis-Menten rate equations. Generally, conventional mass action equations or the method for simplifying complicated networks into rate equations have been employed to simulate the protein and DNA networks. The problem is that the differential equations with the rate parameters whose values were quite different in the time-scale of reactions were so stiff that the calculation time became quite large. The simplified rate equation method depended on the structures of the network and on the values of the system parameters, because it neglects some reactions to simplify a complicated network. To overcome these problems, a two-phase partition method was developed that automatically divided all the chemical reaction equations into two phases: the binding phase and the reaction phase. This method simulated all the reactions involving protein and DNA signal transduction, and calculated them at an extremely high speed. Actually, the two-phase partition method accurately simulated the dynamic behavior of the heat shock response that contained the huge differences in the time-scale of reactions. The calculation speed was 4 x 104-fold higher than the conventional mass action method. The heat shock response was an excellent model showing the dynamic behavior with a quick and sharp transient response of a regulatory protein.


189. Integration of Computational Techniques for the Modelling of Signal Transduction (up)
Pedro Pablo González, Maura Cárdenas, Carlos Gershenson, Jaime Lagúnez-Otero, Chemistry Institute, National University of Mexico (UNAM);
ppgp@servidor.unam.mx
Short Abstract:

 We present an intracellular signalling model obtained by integrating several computational techniques into an agent-based paradigm. Cellulat, the model, takes into account two essential aspects of the intracellular signalling networks: cogntivie capacities and a spatial organization. The characteristics of an intracellular signalling virtual laboratory based on our model are discussed.

One Page Abstract:

 Each cell in a multicellular organism receives specific combinations of chemical signals generated by other cells or from their internal milieu. The final effect of the signals received by a cell can be translated in the regulation of the cell metabolism, in cellular division or in its death. Once the extracellular signals bind to the receptors, different signalling processes are activated, generating complex information transmission networks. The more experimental data about cellular function we obtain, the more important the computational models become. The models allow for the visualization of the network components and permit the prediction of the effects of perturbations on components or sections of the signalling pathway. Within the computer sciences, the artificial intelligence is one of the main areas to model biological systems. This is due to the great variety of models, techniques and methods that support this research area, many of which are inherited from disciplines such as cognitive sciences and neuroscience. Among the main techniques of artificial intelligence and computer sciences commonly used to model cellular signalling networks are artificial neural networks, Boolean networks, petri nets, rule-based systems, cellular automata, and multi-agent systems. The high complexity level of intracellular communication networks makes them difficult to model with any isolated technique. However, when integrating the most relevant features of these techniques in a single computational system, it should be possible to obtain a more robust model of signal transduction. This would permit a better visualization, understanding of the processes and components that integrate the networks. The theory of behaviour-based systems constitutes an useful approach for the modelling of intracellular signalling networks. The model permits to take into account communication between agents via a shared data structure, in which other cellular compartments and elements of the signalling pathways can be explicitly represented. In this sense, the blackboard architecture becomes appropriate. In this work, we propose an effective and robust model of intracellular signalling, which has been obtained joining the main structural and functional characteristics of behaviour-based systems with the blackboard architecture. That is, a cell can be seen as a society of autonomous agents, where each agent communicates with the others through the creation or modification of signals on a shared data structure, named "blackboard". The autonomous agents model determinated functional components of intracellular signalling pathways, such as signalling proteins and other mechanisms. The blackboard levels represent different cellular compartments related to the signalling pathways, whereas the different objects created on the blackboard represent signal molecules, activation or inhibition signals or others elements belonging to the intracellular medium. One of the reasons for our interest in the analysis and understanding of the signalling pathways, is the possibility of regulating them. In principle, it is possible to observe this process in a virtual laboratory based on our paradigm. In particular, we would like to see the effects of perturbations on the systems such as adding elements or taking them out as knock-outs. The expected effects would be directly on the cognitive capacity of the network and ultimately on the decisions taken by the cell in order to differentiate, proliferate or become senescent. Pathologies and natural processes can be followed in the computation of the interactions made by the components in the modelled network. The paradigm presented here is the backbone of the virtual laboratory. With the virtual laboratory we hope that etiologies and expected results of putative therapeutic strategies can be visualized.


190. Validating Metabolic Prediction (up)
Lynda B.M. Ellis, Jiangbi Liu, John Carlis, Marielle Vigouroux, C. Douglas Hershberger, Lawrence P. Wackett, University of Minnesota;
lynda@tc.umn.edu
Short Abstract:

 A Pathway Engine is being developed to use UM-BBD (http://umbbd.ahc.umn.edu) knowledge to predict catabolic pathways for compounds the UM-BBD does not contain. The Engine was validated using 100 compounds with known pathways. Pathways were predicted for 69 of these compounds; 91% of them were reasonably similar to known pathways.

One Page Abstract:

 The University of Minnesota Biocatalysis/ Biodegradation Database (UM-BBD, http://umbbd.ahc.umn.edu/, 1) provides curated information on microbial catabolic enzymes and their organization into metabolic pathways. The UM-BBD's 100+ pathways represent the major microbial routes for biotransformation of many of the organic functional groups found in the environment. However the UM-BBD will never contain information on more than a fraction of all biodegradation that may occur. Our goal is to use the data in the UM-BBD to predict possible biodegradation pathways for compounds it does not contain. Towards this goal, we are developing a Pathway Engine. 

Three challenges in Pathway Engine development are similarity, similarity and similarity. Which UM-BBD compounds are similar to a query compound? How similar are the reactions these UM-BBD compounds undergo? And how similar is a predicted pathway to a known pathway? We measure compound similarity using the Tversky similarity metric (2); reaction similarity using a method developed during this project based on enzyme EC (3) codes; and pathway similarity using a method developed during this project based on dynamic, pairwise, global alignment of EC code chains. 

The Pathway Engine was cross-validated using the 100 compounds that begin catabolic pathways present in the UM-BBD, containing two or more reactions. The Engine predicted one or more degradation pathways for 69 of them. For 31 of these compounds (45%) the Engine's most similar predicted pathway was very similar to a known pathway (pathway similarity score > 0.7). An additional 32 of these compounds (46%) had a pathway that was reasonably similar to a known pathway (0.7 > pathway similarity score > 0.5). The Pathway Engine will be described and challenges to be overcome in its further development will be discussed. _________

1. Ellis, L.B.M., Hershberger, C.D., Bryan, E.M., and Wackett, L.P. (2001) "The University of Minnesota Biocatalysis/Biodegradation Database: Emphasizing Enzymes" Nucleic Acids Research 29: 340-343.

2. Bradshaw, J. (1997) "Introduction to Tversky similarity measure" Proceedings of MUG '97, the 11th Annual Daylight User Group Meeting, URL = http://www.daylight.com/meetings/mug97/Bradshaw/MUG97/tv_tversky.html

3. Moss, G.P (2001) "Enzyme Nomenclature" Nomenclature Commission of the International Union of Biochemistry and Molecular Biology (IUBMB). URL = http://www.chem.qmw.ac.uk/iubmb/enzyme/


192. Scale-free Behaviour in Protein Domain Networks (up)
Stefan Wuchty, European Media Laboratory;
stefan.wuchty@eml.org
Short Abstract:

 Several technical, social and biological networks were recently found to demonstrate scale-free and small-world behaviour. The topology of protein domain networks generated with data from popular domain databases exhibits features of small-world and scale-free nets. The extent of connectivity among domains reflects the evolutionary complexity of the organisms considered.

One Page Abstract:

 Diverse disordered systems are best described as networks with complex topologies. Often the connection topology is assumed to be either completely regular or completely random. Originally, small-world graphs are generated by randomly rewiring nodes in a regular network. By defining measures that distinguish these three types of networks, several sociological and technological networks are of the small-world type. A small-world graph is formally defined as a sparse graph which is much more highly clustered than an equally sparse random graph. Scale-free networks display a connectivity distribution which decays as a power-law. This feature was found to be a direct consequence of two generic mechanisms: (1) Networks are allowed to expand continously by the addition of new vertices. (2) These newly added nodes attach preferentially to sites that are already well connected. We are thus dealing with relatively few highly connected vertices and many rare connected ones.

Recently, metabolic networks were discovered as small-world and scale-free nets. This result encouraged us to investigate protein domains in a similar fashion. With the definition of domains as nodes and edges, which connect nodes, if domains occur together in one protein, the resulting graph was found to comprise small-world and scale-free behaviour. Domain data were retrieved from popular protein domain databases.

Interestingly, the degree of connectivity is different if one focuses on different species. In conclusion, the domains show a higher connectivity the higher the evolutionary level of the organism is. Apparently, an evolutionary trend to higher connectivity of domains as well as growing complexity in the domain architecture can be detected. Thus, complex domain arrangements provide protein sets which are sufficient to preserve cellular procedures without dramatically expanding the absolute size of the proteome. 


193. Inferring gene dependency networks using expression profiles from yeast deletion mutants (up)
Johan Rung, Thomas Schlitt, Alvis Brazma, Ugis Sarkans, Jaak Vilo, EMBL-EBI;
johan@ebi.ac.uk
Short Abstract:

 We propose a method for inferring a graph structure describing gene functional dependencies, given microarray data collected from mutation experiments. The algorithm is applied to yeast microarray data, and we present suggested functional networks from different parts of the yeast regulatory system together with a performance analysis of the algorithm.

One Page Abstract:

 We propose a method for inferring a graph structure describing gene dependencies, given gene expression data collected from a set of mutation experiments. By combining the information about the observed changes in expression levels after a mutation with a statistical analysis, we gather lists of dependencies between gene pairs, together with information about the direction of changes (positive or negative influence on the mRNA level). These connections are combined into a network structure with nodes containing single genes or groups of genes. The network represents the knowledge about pairwise influences regardless of the nature of the influence. It is constructed in a way that it can be used for stand-alone analysis with predictions about genetic functions and grouping of genes, and also as a good initial probabilistic network for structure learning algorithms. We have applied the algorithm to microarray data for Saccharomyces cerevisiae by Hughes et al. (Cell, July 7, 2000), which included 274 single-gene mutations. The gene-specific error model from that publication has been used in our experiments. The full network as given by the algorithm has been analyzed for consistency with known pathways, and we present examples from different parts of the regulatory system in yeast. Also, the performance of the algorithm has been analyzed with respect to consistency of edges and grouping of genes dependent on parameter settings. A comparison with gene grouping found by regular clustering algorithms is also presented together with functional predictions for a number of ORFs with unknown function. 


194. On Network Genomics (up)
Christian V. Forst, Bioscience Division, Los Alamos National Laboratory;
chris@lanl.gov
Short Abstract:

 "Network Genomics", a new research field that combines genomic information with connectivity information of cellular networks is presented. By analyzing gene-expression of metabolic networks chemical switches of metabolic flux have been identified. A comparative network genomics approach has revealed a relationship between gene-context/operon structure and networks. 

One Page Abstract:

 "Network Genomics", a new research field that combines genomic information with connectivity information of cellular networks is presented. The information provided by completely sequenced genomes can yield insights into the multi-level organization of organisms and their evolution. By gene-expression networks, genes coding for individual polypeptides are expressed. Individual enzyme complexes are formed, often through assembly of multiple polypeptides. At another level, sets of enzymes group into metabolic networks. 

By analyzing gene-expression of metabolic networks chemical switches of metabolic flux have been identified. Special references to relationships between gene-context/operon structure and networks are made. For this purpose a method is presented that extends the conventional sequence comparison and phylogenetic analysis of individual enzymes to metabolic networks. The method will find application in comparative network analysis of microbial organisms. 


195. Semantic Modeling of Signal Transduction Pathways and the Linkage to Biological Datasources (up)
M. Weismueller, R. Eils, Div. "Intelligent Bioinformatics Systems", German Cancer Research Center (dkfz), 69120 Heidelberg, Germany;
m.weismueller@dkfz.de
Short Abstract:

 Our aim is to model signaling networks computationally in a semi-qualitative way and to use it for simulation of protein-protein interactions describing an information flow through the cell. For the modeling we use a synchronous process algebra, pi-calculus, and import signal data from a signal transduction database TRANSPATH.

One Page Abstract:

 Signal transduction (ST) is the mechanism a cell reacts on a stimulus coming from outside the cell. ST alters gene transcription in the nucleus, therefore changing protein synthesis and the behavior of the cell. ST can be described as an information flow from outside into the cell mediated by biochemical reactions of signal molecules. ST pathways play a major role in the field of cancer research. Several pathways have been identified to be responsible for cancer development by over- / underexpression of genes or by functional modification of signal proteins caused by alterations of their sequence. 

Quantitative data and measurements of signal molecules are not yet available for a comprehensive number of pathways. Several model systems have been studied in detail including concentration and activity measurements of signal molecules. But these attempts are not sufficient to allow a study of whole signal transduction networks of cells. Therefore the idea is to reduce the view on signal information flow in the cell to a state based one: A protein is active (mediating information) or inactive (not mediating information). Additional information is incooperated when available from biological experiments: Protein X binds to protein Y, protein X phosphorylates protein Y etc.

One way to describe this non-quantitative view on signal transduction of cells is the qualitative modeling of signal information flow through the cell. The information flow is an abstract view on biochemical reactions of signal molecules. These interactions of molecules are semantically modeled describing the biochemical interaction not in numerical equations like differential equations, but in abstracted biochemical reaction descriptions.

The aim to use the information of a ST database - TRANSPATH (http://transpath.gbf.de/) - to build up a comprehensive model of ST pathways in the computer. One should be able to answer biological questions about the interaction or alteration of pathways under certain conditions and formulate hypotheses, which might be tested in experiments.

To model ST pathways the pi-calculus (http://www.lfcs.informatics.ed.ac.uk/reports/89/ECS-LFCS-89-85/) is used to represent parallel interactions of proteins. This notion was adapted from the BioPSI project (http://www.wisdom.weizmann.ac.il/~aviv/). The pi-calculus is a kind of programming language. Protein interactions of ST pathways can be programmed and simulated using the information of TRANSPATH. These simulations result in an output, which has to be interpreted under the posed conditions.

The first step of the work is to model an important ST pathway: the ERK-MAPK pathway. In this model a protein is interpreted as a computational unit having an input layer, a computational layer and an output layer. This pathway model implementation exemplifies how ST pathways can be built up in a computer.


196. Pathway Analysis of Metabolic Networks: New version of METATOOL and convenient classification of metabolites (up)
Stefan Schuster, Ferdinand Moldenhauer, Ionela Oancea, Max Delbruck Center for Molecular Medicine, Dept. of Bioinformatics, D-13092 Berlin-Buch, Germany;
Thomas Pfeiffer, ETH Zurich, Experimental Ecology / Theoretical Biology, CH-8092 Zurich, Switzerland;
stschust@mdc-berlin.de
Short Abstract:

 The concept of elementary modes formalizes the term "biochemical pathway." We present the newest version of METATOOL, a program for determining elementary modes and other topological features of metabolism. We outline strategies for finding a convenient classification of source and sinks metabolites and intermediates. This is illustrated by biochemical examples. 

One Page Abstract:

 The topological analysis of metabolic networks has become an important integrative part of bioinformatics. This analysis includes methods for the computer-aided synthesis of biochemical pathways, which is instrumental in functional genomics and biotechnology (Schuster et al., 2000). One of these methods is based on the concept of elementary flux modes. It allows one to test whether sets of enzymes form a coherent pathway allowing mass balancing for each intermediate and complying with the directionality of reactions (irreversibility). Importantly, pathway analysis can be performed without the knowledge of kinetic parameters. An algorithm for computing all elementary modes in biochemical reaction networks of any complexity has been implemented by us earlier, in the program METATOOL (Pfeiffer et al., 1999). Here, we present the newest version of METATOOL (version 3.5), which includes several new features such as the detection of the connectivity distribution (connectivity is defined as the number of reactions in which a given metabolite participates). Moreover, the branch point metabolites of a network and the conservation relations are explicitly given. Dead-end metabolites and sets of inconsistent irreversible reactions are indicated, which helps the user check model consistency. Moreover, the elementary modes are compared with the modes in the convex basis.

An important technical question in the computation of elementary modes is the convenient classification of external metabolites (sources and sinks) and internal metabolites (intermediates). A reasonable criterion for this classification is to minimize the number of elementary modes. This criterion is related to Kolmogorov complexity and was chosen in order to reduce combinatorial explosion in complex networks. We present two strategies (implemented as C programs) to find the convenient classification. These tools are illustrated by biochemical networks taken from nucleotide metabolism and monosaccharide metabolism. 

References T. Pfeiffer, I. Sanchez-Valdenebro, J.C. Nuno, F. Montero, S. Schuster: METATOOL: For Studying Metabolic Networks. Bioinformatics 15 (1999) 251-257. S. Schuster, D. Fell, T. Dandekar: A General Definition of Metabolic Pathways Useful for Systematic Organization and Analysis of Complex Metabolic Networks. Nature Biotechnol. 18 (2000) 326-332.


198. PIMRider : an integrated exploration platform for large protein interaction network (up)
Jérôme WOJCIK, Fabien PETEL, Alain MEIL, Vincent SCHACHTER, Yvan CHEMAMA, Hybrigenics S.A., Paris, France;
jwojcik@hybrigenics.fr
Short Abstract:

 The PIMRider is a web-based software platform developed to visualize and explore protein interaction networks. Experimental protein-protein interactions derived from yeast two-hybrid assays are integrated with external database annotations in a rich data structure. Modular viewers allow the biologist to focus on specific pathways and formulate new interpretations.

One Page Abstract:

 Proteome-wide technologies that are now massively being used in the protein function study require highly sophisticated bioinfomatics tools to store and utilize the large amount of experimental data produced. We have developed a software platform to integrate in-house yeast two-hybrid assay data with several public and partner databases. "Rough" experimental data once stored are post-processed with specific algorithms that improve the reliability and the comprehensibility of the information. Finally, via several dedicated viewers refined results are available on-line to the scientific community and partners : the ProteinViewer[tm] displays the annotations, bibliographies, database entries, genomic information and interactions found for each protein of the proteome studied; the InteractionViewer[tm] displays in a graphical mode details of the interaction found by the yeast two-hybrid assay where interacting fragments and the computed Selected Interacting Domain (SID®) are positioned relatively to the coding sequence of the two proteins; the MultiSIDViewer[tm] displays in a graphical mode all the SID® computed and positioned relatively to the coding sequence of the protein studied; the PIMViewer[tm] displays the cell-wide protein interaction map as a graph and allows the biologist to filter information depending on their reliability and focus on a particular pathway; the PIMRider® Annotator module allows partner scientists to edit protein and interaction annotations on-line. The platform is based on multi-tier architecture including an Oracle relational database server accessible through a SQL layer, a Java Server Page[tm] that generates the HTML pages (viewers are included in the HTML pages as Java applets) and an Apache server to handle http queries. The PIMRider® platform permits to visualize both the experimental protein interaction map of Helicobacter pylori and the interaction network of Escherichia coli that were respectively used as source and predicted by the 'Interacting Domain Profile Pair' inference method (Wojcik and Schächter, ISMB 2001 communication). A PIMRider® free demonstration is available at http://pim.hybrigenics.com.


199. From "pathways" to functional network. A new technology for reconstruction of human metabolism. (up)
Tatiana Nikolskaya, Ph.D., Andrej Bugrim, Ph.D., GeneGo LLC;
tnikolsk@genego.com
Short Abstract:

 We present a new technology, called functional reconstructions. It allows identification of a set of major metabolic, regulatory and developmental "functional units" in the context of a biochemical network. Our method involves integration of known human-specific pathways with kinds of data for computational reconstruction and analysis of relevant metabolic network.

One Page Abstract:

 "Traditional" view of biochemistry was formulated in terms of "pathways", where a chain of biochemical transformations, a pathway presumably serves its specific function in the cell. On the other hand, many recent studies show that metabolic and cell signaling processes are actually highly interconnected and networks of immense complexity can potentially result from combining a limited number of reactions and interactions in many different combinations and sequences. How to reconcile these views? Here we present a new technology, called functional reconstructions. Our goal is to identify a set of major metabolic, regulatory and developmental "functional units" in the context of a biochemical network as a whole. We start with identification, careful annotation and elucidation of human-specific pathways for which biochemical evidence exists. Such "biochemical reconstructions" serve as informational "skeleton" for efficient functional integrating of different medical, biological and genomic kinds of data, thus providing a missing link between clinical data and human genome sequence. At the second stage we extend our collection of the pathways by computational reconstruction of relevant metabolic network. Finally, by integration of expression data into resulting metabolic map we can generate a "snap shot" for any specific cell, tissue, disease or condition. Comparison of such snap shots made for the same tissue at different developmental stages or different conditions provides a framework for identification of regulatory pathways and other potential "functional units" in the biochemical network, eventually leading to construction of a "functional view" of such network - Functional Reconstruction. 

Our technology allows to:

1. Restore complicated cellular networks using abundant gene expression data (EST and micro-array) as well as genome sequence. 2. Precisely identify relationships between different human genes, pathways and parts of metabolism 3. Identify all over- and under-expressed genes, specific for given tissue/condition 4. Generate interactive, integrative metabolic functional outlines for all parts of human metabolism. 5. Produce user-friendly expression circuits for visualization and clustering of micro-array data. 6. Functionally localize SNPs and other genetic markers on generated metabolic and regulatory maps. 7. Propose human-specific developmental pathways by applying data from other organisms to human-specific functional reconstructions. 


200. Protein Pathway Mapping in Human Cells (up)
Kunbin Qu, Nan Lin, Xiang Xu, Donald Payan, Rigel Pharmaceuticals, Inc.;
kqu@rigel.com
Short Abstract:

 We present a large-scale (several hundred baits) pathway map in human cells by yeast-two-hybrid. The binding is represented as a matrix. Gene classification is achieved by recursively joining the matrix elements. Inferences are made for novel genes, providing a useful tool for functional annotation and pathway mapping.

One Page Abstract:

 We present a large-scale pathway map of protein-protein interactions in human cells by yeast-two-hybrid methodologies and downstream data analysis. In the yeast-two-hybrid system, the protein of interest (the bait) is fused to a known DNA-binding domain such as GAL4, and the potential hit (a member of the cDNA library being screened) is fused to a cognate transcriptional activation domain. Co-expression of the two chimeric proteins results in transcriptional activation of the reporter that is downstream of the DNA-binding domain if the chimeric proteins associate. Baits utilized in this study include several hundred members of the following pathways: T and B cell signaling, cell cycle regulation, TNF pathway, exocytosis and others. Each bait generated five hits on average, in accordance with that published by Tucker et. al. (2001). The entire interaction network can be viewed as a matrix in which baits are represented by rows, and the non-redundant hits are listed in columns (cDNA library fragments are grouped by sequence similarity). The binding relationship of the bait and hit is represented by a binary vector within the matrix, the number 1 indicating a specific interaction, and 0 for no interaction. The binary matrix is then converted to a probabilistic matrix by the following fashion: each element is added by a pseudo-count, its value is then normalized based on the binding number of the whole network. Gene classification for both baits and hits is achieved by recursively joining the two elements with the highest Pearson correlation coefficients calculated from the probabilistic matrix until the matrix dimension reduces to one in the direction that the joining is performed. This process leads to clustering of genes with a similar binding vector profile. Those genes have a higher probability of sharing similar biological function and pathway assuming that they interact with similar proteins in the cell. Therefore, inferences are made for both baits and hits based on clustered members with documentation, providing a useful tool for functional annotation and detecting new members of a known pathway. For example, cRaf, Traf2, Traf5, I-flice, RIP, Flame-1, Clarp and Mch5 are clustered together by this analysis. Although these proteins have distinct roles in differing signaling pathways, they all participate in the biological process of programmed cell death. Another group has SAP, Fyn, Lyn and STAT1. These genes encode proteins with varying functional domains including adaptor proteins, kinases, and transcription factors. Commonality is found in the presence of an SH2 domain and a functional role in a variety of immune responses by association with the specific receptors. A graphical representation of the complexity-reduced network based on the clustered nodes provides better visualization of the resultant pathway network. The graph is implemented through a modified multilevel force-directed graph drawing algorithm (Walshaw G., 2000). This multilevel algorithm accelerates a large graph by force-directed layout with an improved regional and global quality up to 100,000 nodes. Detailed annotation and pathway mapping of novel genes present in each cluster is under way.

Tucker CL, Gera JF, Uetz P. Towards an understanding of complex protein networks. Trends in Cell Biology. Vol. 11, No. 3: 102-106.

Walshaw C. A Multilevel Algorithm for Force-Directed Graph Drawing. Tech. Rep. 00/IM/60, Univ. Greenwich, London SE10 9LS, UK, April 2000. 


201. MAP-Kinase-Cascade: Switch, Amplifier or Feedback Controller? (up)
Nils Blüthgen, Hanspeter Herzel, Innovationskolleg Theoretische Biologie;
nils.bluethgen@itb.biologie.hu-berlin.de
Short Abstract:

 The MAP-Kinase-Cascade is a highly conserved module in signaling pathways in eukaryotes. By modeling reaction kinetics, we investigate the steady-states and dynamics of the system. We show how the switch-like behavior is realized and how amplification is generated. We also show adaptation introducing a negative feedback loop.

One Page Abstract:

 The three-step MAP-Kinase-Cascade is a highly conserved module in signaling pathways. It is present in all eukaryotes and has a wide range of functions in signal transductions, e.g. stress-response, cell-cycle-control, cell-wall-construction, osmosensor, growth and differentiation. By modeling the reaction kinetics of the cascade we investigate the properties of steady state solutions and the dynamical behavior under the action of a negative feedback loop. 

The system shows different behavior depending on the fraction of activated MAPKK: in the low activation range a rather switch-like response is observed while in the intermediate activation range amplification increases. This corresponds to a shift of the working point dependent on activated MAPKK concentration and other reaction parameters. In order to characterize the steady-state behavior of the whole MAP-kinase cascade we fit the signal response with a Hill curve. This defines a Hill coefficient for the entire system. Within this framework the models of Bhalla/Iyengar[1] and Huang/Ferrell[2] are compared and the remarkably different characteristics of their models can be understood. The robustness of the switch-like response is investigated by calculation of Hill-coefficients for varying reaction parameters.

Introducing an indirect negative feedback loop, the MAPK-Cascade shows damped oscillations, which can be interpreted as adaptation to different upstream signals. Asthagiri and Laufenburger suggest that the integral of activated MAPK over time is a reasonable metric for encapsulating information for transcription[3]. We analyse this integral for increasing stimulus.

 References: [1] Science 1999. 283: 381-387 [2] PNAS 1996. 93: 10078-10083 [3] Annu. Rev. Biomed Eng. 2000. 02:31-53


202. Genomic Object Net: Basic Architechture and Visualization for Biopathway Simulation (up)
Hiroshi Matsuno, Atsushi Doi, Yamaguchi University;
Rainer Drath, ABB AG in Heidelberg, Germany;
Satoru Miyano, Human Genome Center, Institute of Medical Science, The University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-;
matsuno@sci.yamaguchi-u.ac.jp
Short Abstract:

 Genomic Object Net is a software tool for describing and simulating structurally complex dynamic causal interactions and processes such as metabolic pathways, signal transduction cascades, gene regulations. The notion of hybrid object net is employed as its basic architechture, and visualization technique is developed for intuitive understanding of biopathway simulation.

One Page Abstract:

 Along with the completion of many genome sequencing projects, a new interest of research is emerging for elucidating how the living systems function in terms of all levels of biological information, and then to develop information technology for applying such systemic information to medicine and biology. Among many issues related to this matter, a vital necessity is to develop information technology with which we can easily represent and simulate the structurally complex dynamic causal interactions and processes of various biological objects such as genomic DNA, mRNA, proteins, functional proteins, molecular transactions and processes such as metabolic pathways, signal transduction cascades, genetic networks, etc.

 In order for software tools to be accepted by users in biology/medicine for biopathway representation and simulation, the following two matters should be resolved, at least: (1) Remove issues which are irrelevant to biological importance; (2) Allow users to represent biopathways intuitively and understand/manage easily the details of representation and simulation mechanism. We have developed a software tool Genomic Object Net (http://www.GenomicObject.Net/) for representing and simulating biopathways based on [1,2] together with visualization strategy that would satisfy (1) and (2). Its employs the notion of hybrid object net [2] as its basic simulation architechure. Usually, a biopathway information is conceputually described as a figure together with the explanation about the relations between biological objects of concern and the measured/observed data proving their qualitative/quantitative relations. These information can be easily described and simulated with Genomic Object Net. We show some representation and simulation examples of typical biopathways related to gene regulation, metabolic pathway, and signal transduction, which cover the basic ascpects in biopathways.

[1] Matsuno, H., Doi, A., Nagasaki, M., and Miyano, S. 2000. Hybrid Petri net representation of gene regulatory network. Proc. Pacific Symposium on Biocomputing 2000, pp. 338-349.

[2] Drath, R. 1998. Hybrid Object Nets: An object oriented concept for modeling complex hybrid systems. Proc. Hybrid Dynamical Systems. 3rd International Conference on Automation of Mixed Processes, ADPM'98, pp. 437-442.
 
 


203. KEGG human cell cycle pathway and its comparison to the viral genomes. (up)
Toshiaki Katayama, Yoshinori Okuji, Minoru Kanehisa, Bioinformatics Center, Kyoto University, Japan;
k@bioruby.org
Short Abstract:

 We have constructed human and yeast cell cycle pathway database under the KEGG project. We determined conserved and diverse pathways between these two species. By mapping homologous viral genes onto this pathway map, strategies of the viral proliferation were strongly suggested. 

One Page Abstract:

 Molecular mechanisms of the eukaryotic cell cycle regulation have been massively studied in the past decade. We have assembled this knowledge from literature and constructed yeast and human cell cycle regulatory pathway diagrams under the KEGG project. Compared to other works, our presentation provides an overall picture on the control flows involving various molecular interactions in the eukaryotic cell division. These pathway diagrams can be used for gene function assignment in other organisms, comparative network analysis of complex biological pathways, visualization and correlation analysis of gene expression data from microarrays, among others. In this study, we have mapped homologous viral genes onto the pathway diagrams by sequence similarity, which is a practical example of utilizing our pathway data. In parallel, we have constructed a database of viral genes from a set of complete viral genomes, called v-GENES and v-GENOME. As a result of homology search of v-GENES entries against pathway components, many viruses revealed to have counterparts of the cell cycle regulatory genes. For example, viruses have G1 Cyclin/CDK and its regulators or G1 transcription initiators, but do not have any subunit of a large protein complex. Such a tendency suggests that viruses carry only those genes that can critically affect initiation of the host cell's proliferative activities. We will show these results with taxonomical classifications of viruses having homologous genes to their hosts at the poster session. KEGG/PATHWAY, v-GENES, and v-GENOME databases are available from KEGG website at http://www.genome.jp/kegg/kegg2.html.


205. Eukaryotic Protein Processing: Predicting the Cleavage Sites of Proprotein Convertases (up)
Peter Duckert, Søren Brunak, Nikolaj Blom, Center for Biological Sequence Analysis, Biocentrum-DTU, Technical University of Denmark, DK-2800, Denmark;
peterd@cbs.dtu.dk
Short Abstract:

 Many biologically active proteins and peptides are generated by limited endoproteolysis of inactive precursors. Cleavage often occurs at sites containing basic amino acids. We examined sequence patterns characteristic of experimentally verified sites and describe a neural network based method for predicting whether a given site is a potential cleavage site. 

One Page Abstract:

 Many biologically active proteins and peptides are generated by limited endoproteolysis of inactive precursors. This is an important and evolutionary ancient mechanism which determines the level and duration of specific biological activities, and in addition controls that the biologically active molecules are formed in the appropriate cellular compartments. After removal of the signal peptide precursor cleavage often occurs at sites composed of single or paired basic amino acids (arginine[R] or lysine[K]) (Seidah & Chrétien, 1999).

The enzymes responsible for this cleavage are relatively few in number and with general functions. They have been molecularly and functionally characterized and shown to belong to a family of evolutionary conserved serine proteases related to the subtilisin and kexin enzymes. 

Seven mammalian members of a dibasic-and monobasic-specific subfamily of proteases related to the yeast subtilase kexin ("proprotein convertases" or PCs) are presently known, PC1, PC2, furin, PC4, PC5, PACE4, and PC7. Since not all mono- and dibasic peptides are potential cleavage sites of PCs, we examined the sequence patterns characteristic of experimentally verified sites and describe a neural network based method for predicting whether a given site is a potential cleavage site for the PC enzymes. We here present preliminary work on the characterization and prediction of PC cleavage sites. 


206. Understanding multi-organelle predicted subcellular localization (up)
Joel Zupicich, Steven E. Brenner, William C. Skarnes, University of California, Berkeley;
joelz@socrates.berkeley.edu
Short Abstract:

 We have approached the problem of protein localization using tools to identify domains that are unlikely to appear in a single polypeptide. Using stringent criteria for existing tools, we have identified a large class of proteins in the SwissProt-TrEMBL database that exhibit characteristics of multiple organelles.

One Page Abstract:

 We have approached the problem of protein localization using tools to identify domains that are unlikely to appear in a single polypeptide. Using stringent criteria for existing computational tools, we have identified a large class of proteins in the SwissProt/TrEMBL database that exhibit characteristics of multiple organelles. Our results show that domains largely thought to be incompatible can exist in a single protein. Our data may lead to the discovery of new protein functions that are unlikely to be uncovered using classical biochemistry. In addition, these proteins are represented in taxonomically diverse species and are especially prevalent in C. elegans. Using subcellular localization methods in cell culture, we confirm our computational predictions for three mammalian proteins. In light of these results, the regulation of known proteins we identified may need to be reevaluated. 


207. Vizualization and Interpretation of the Molecular Scanner Data (up)
Müller, Markus, Gras, Robin, Appel, Ron, Swiss Institute of Bioinformatics;
Hochstrasser, Denis, LCCC, Geneva University Hospital, Geneva, Switzerland;
markus.mueller@isb-sib.ch
Short Abstract:

 The molecular scanner is a highly automated method that combines 2D-gel electrophoresis with peptide mass fingerprinting (PMF) techniques in order to identify the proteins in a 2D-gel. Based on visualization methods we deduce a coupled map lattices algorithm that improves the signal to noise ratio of the PMF identification.

One Page Abstract:

 The molecular scanner is a highly automated tool that combines 2D-gel electrophoresis with peptide mass fingerprinting (PMF) techniques. Proteins separated in a 2D-gel are digested 'in parallel' and transferred onto a collecting membrane. Since diffusion in this process is not a problem, the location of the peptides on the membrane corresponds to the location of their proteins in the 2D-gel. A matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) mass spectrometer then scans the membrane yielding a list of peptide masses for each scanned point. We visualize all the obtained masses, which provides important information on the presence of chemical noise. Since chemical noise is shown to be a potential source for false matches in the PMF identification procedure, removing this noise improves the results. We then present an algorithm based on coupled map lattices that makes use of the intensity distribution of the detected masses. It calculates the centers of these distributions and groups together nearby centers of different masses. The masses belonging to the same group are then submitted to our in-house PMF identification program SmartIdent. Since these masses are purged of chemical noise and overlapping masses of other centers they provide an unambiguous identification which would not be the case if untreated mass lists were submitted. These identifications are then used to create a 2-dimensional protein map of the 2D-gel.

 


208. SFINX: A generic system for integrated graphical analysis of predicted protein sequence features (up)
Erik Sonnhammer, Center for Genomics and Bioinformatics, Karolinska Institutet;
Erik.Sonnhammer@cgr.ki.se
Short Abstract:

 A package for integrated graphical analysis of predicted protein sequence features is presented. The output from a large set of prediction programs for coiled coils, secondary structure, signal peptides, and transmembrane segments are presented graphically in the Blixem and Dotter viewers. The system is available at www.cgr.ki.se/SFINX. 

One Page Abstract:

 Correct predictions of functionally important sequence features is becoming increasingly important as we enter the era of functional genomics, when wet experiments are guided by bioinformatics. One important type of sequence feature are signals for subcellular localization. Such information will greatly influence the hypothesis of biological function, and will guide the choice of experiments to test this hypothesis.

However, different programs often produce disagreeing predictions, even for transmembrane topology which is often considered trivial. It is therefore important to detect areas of disagreement between programs or algorithmic variants and use all available information to produce a model of the most reliable structural and functional sequence features in a given protein.

With these purposes in mind, a generic software design which allows many different sets of segmental or continuous-curve sequence features to be viewed in combination is presented. Any such data, generated by individual external programs, may be judged alongside a self dot-plot or a multiple alignment of database matches. The implementation is based on extensions to the graphical viewers Dotter and Blixem, and scripts that convert data from external programs to a simple generic data definition format called SFS (Sequence Feature Series). The entire package of scripts and graphical viewers is called SFINX.

A web server that can run these analyses and launch Blixem or Dotter on Windows and Unix machines is available at www.cgr.ki.se/SFINX. The output is passed to the viewer in the generic SFS data format and could thus in principle be displayed by other viewers as well. It is also possible to get the output of the predictions in XML.

The poster describes applications for analysis of compositional and repetitive features in protein sequences, such as predicted coiled coils, secondary structure, signal peptides, transmembrane segments, as well as general low-complexity or periodic subsequences. Dot-plots and flanking database matches provide valuable contextual information for these assignments. It further shows that disagreement between prediction programs is very common, and that simultaneous inspection of underlying propensities, predictions, and homology information can lead to significantly improved prediction of structural and functional features, and localization. 


209. Protein Pathway Profiling (up)
Peter Rieger, Head of Method Development;
prieger@kelman.de
Short Abstract:

 The in silico protein-protein interaction mapping of the complete yeast genome with Kelman's SPRAB technology is demonstrated. We constructed a GeneNetwork, representing all putative interactions between the proteins encoded by the yeast genome. The software tool GeneViator realizes visualization of data and the navigation within this network.

One Page Abstract:

 Despite of complete sequencing of more and more genomes, including the human one, scientists are still only at the threshold of understanding the functions of numerous individual genes, especially with respect to their complex interplay. To move ahead in this field researchers need advanced biocomputing solutions. Kelman offers such a high-end solution of bioinfomatics and functional genome research which ensures new levels of data consistency and exploitation. By means of SPRAB (Selective Protein Recognition And Binding) technology we can predict the genetically-determined functional relationships between proteins. Resulting from computational protein-protein interaction mapping we are able to construct local gene networks, which provide an indepth understanding of gene interplay, involving gene products in all their molecular versions.

Here we present the in silico protein-protein interaction mapping of the complete yeast genome, demonstrating the systematic application of SPRAB technology. The resulting Yeast-GeneNetwork represents the complete set of putative interactions between the proteins encoded by the yeast genome. The visualization of data as well as the navigation within this network of physical protein-protein interactions is realized by means of Kelman's software tool GeneViator. Moreover, the GeneViator enables the user to include information from other sources into the network, like experimental data of interactions, expression data and so on. So, a new dimension of uncovering and profiling of protein pathways can be presented. The approach of considering and combining different aspects of gene interactions allows a raster search for relevant pathways, providing an indepth understanding of gene function. The advantages of Kelman's approach are exemplified here on selected findings from the Yeast-GeneNetwork. 


210. Ab initio prediction of human orphan protein function (up)
Lars Juhl Jensen, Center for Biological Sequence Analysis, The Technical University of Denmark;
R. Gupta, C.A.F. Andersen;
D. Devos, Protein Design Group, CNB-CSIC, Spain A. Krogh;
J. Tamames, Protein Design Group, CNB-CSIC, Spain A. Valencia;
H. Nielsen, S. Brunak;
N. Blom, C. Kesmir, C. Workman, H.H. Staerfeldt, K. Rapacki, S. Knudsen, Center for Biological Sequence Analysis, The Technical University of Denmark;
ljj@cbs.dtu.dk
Short Abstract:

 We present a novel method for ab initio prediction of protein function from sequence data alone. The cellular role is predicted by neural networks that integrate predictions of post-translational modifications and other protein features known to play an important role in determining subcellular location and regulation of proteins.

One Page Abstract:

 Of the 30,000 to 40,000 genes believed to be present in the human genome not more than half can be assigned a functional role based on homology to known proteins. Traditionally, protein function has been viewed as something directly related to the conformation of the poly-peptide chain. However, as the 3D structure currently is quite hard to calculate from the sequence, a computational strategy for the elucidation of orphan protein function may benefit also from the prediction of functional attributes which are more directly related to the linear sequence of amino acids.

Our approach to function prediction is based on the fact that a protein is not alone when performing its biological task. As it will have to operate using the same cellular machinery for modification and sorting as all the other proteins do, on can expect some conservation of essential types of post-translational modifications (PTMs). Because reasonably precise methods for prediction of PTMs from sequence exist today, our prediction method which integrates such relevant features to assign orphan protein to functional class, can be applied to all proteins where the sequence in known.

For any function prediction method, the ability to correctly assign the relationship depends strongly on the function classification scheme used. We predict a scheme of 12 cellular functions which is closely related to the 14 class classification originally proposed by Riley for the E.coli genome. All human sequences in SWISS-PROT were automatically assigned to classes, by a system based on an additive scoring scheme of the SWISS-PROT keywords. These scores were then compared to two thresholds to obtain a positive and a negative set where the most uncertain functional annotations are excluded. To minimize the problem of having similar sequences for training and testing, we used a heuristic algorithm to split our data set into two sets so that the similarity between the two sets was minimal.

To find the optimal combination of parameters for each of the different categories we used a boot-strap strategy. First, for every category a simple feedforward neural network with one hidden layer was trained on every separate feature to judge which features were potentially useful for prediction of at least one category. On each category a network was then trained for every pair these features and subsequently on combinations of more features, to find the best feature combinations. An ensemble for each functional class was made from the best five best networks for that class.

The output of these networks were subsequently transformed in to probabilistic scores based on Gaussian kernel density estimates of the score distributions of positive and negative examples respectively. To calculate the combined prediction of an ensemble of networks we simply take the average of their probabilistic predictions.

Interestingly, the combinations of attributes selected for a given category (among the 20+ initially considered) also implicitly characterize a particular functional class in an entirely new way. It appears that the use of posttranslational modifications is essential for the prediction of several functional classes. In addition to attributes related to subcellular location the most important features for predicting if a protein is say, regulatory or not, are all PTMs. Similarly PTMs are very important for correct assignment of proteins related to the cell envelope, replication and transcription. 

The fact that (predicted) PTMs correlate strongly with the functional categories fits well with biological knowledge. For proteins with "regulatory function" two of the most important features were S/T phosphorylation and Y-phosphorylation, respectively. It is very satisfying that the neural networks found this correlation considering that reversible phosphorylation is a well known and widely used regulatory mechanism. Also the choice of encoding makes biological sense as serine and threonines are known to be phosphorylated by the same kinases, while tyrosines are phosphorylated by different kinases.

The selection of category-relevant attributes is based on quantitative assessment of the ability to predict (assign) categories for orphan sequences non-similar to the sequences used to train the method. When the sensitivity is below 40\% the level of false positive predictions is very low. The confidence in the predictions can be used directly to separate assignments with low probability of being wrong from those where the probability is higher.

We have used our method to estimate the breakdown on functional categories of the human genome. Using for every protein the predicted probability of each category, the number of proteins in each category was subsequently estimated by summing over the probability of the category in question for every protein. 


211. A consensus based approach to genome data mining of beta barrel outer membrane proteins. (up)
Rajiv V. Basaiawmoit, Manjunath K.R., Krishnaswamy S., Bioinformatics centre, SBT, M.K.University;
rajivvaid@usa.net
Short Abstract:

 We have developed a consensus-based approach using compositional analysis, hydropathy profiles, sequence comparison techniques, secondary structure prediction and structure based sequence profiles to delineate beta-barrel outer membrane proteins from the proteome databases. The results of the analysis on the available proteomes will be presented.

One Page Abstract:

 Beta-barrel outer membrane proteins are found to be associated with a variety of important functions from passive trimeric porins like OmpF to monomeric active transporters like FhuA. The beta-barrel structures range from 8 stranded to 22 stranded beta-barrels with possible functional differentiation based on barrel sizes. They can also be responsible for the pathogenicity of an organism and defense against attack proteins. Some porins are also involved in apoptosis. Their functional diversity therefore makes them an interesting class of proteins. With the large scale sequencing of genomes that are underway, it therefore would be advantageous to identify porins across proteomes. The successful identification of the putative function of a protein often depends on the first steps of a search (BLAST or FastA) against a database. Using single methods to do analysis may lead to misleading results, and as such this calls for a consensus-based approach for the identification of beta-barrel outer membrane proteins. We have developed a consensus-based approach using compositional analysis, hydropathy profiles, sequence comparison techniques (BLAST and FASTA), secondary structure prediction and structure based sequence profiles to delineate beta-barrel outer membrane proteins from the proteome databases. We performed this over fully sequenced genomes, and a more detailed analysis using the spirochete genome sequences Borrelia burgdorferi and Treponema palladium. Large discrepancies were found in the annotations, for e.g. in one case the annotation terms it a porin, whereas our analysis shows that it is less likely a beta-barrel protein. The work can be extended to the creation of a non-redundant, annotated, database of porins and also aid in structural genomic initiatives for porins with scripts for automation of the process. 


212. Predicting tyrosine sulfation sites in protein sequences (up)
Flavio Monigatti, Eva Jung, Amos Bairoch, Swiss Institute of Bioinformatics;
Eva.Jung@isb-sib.ch
Short Abstract:

 Tyrosine sulfation is an important post-translational modification of secreted and / or membrane bound proteins. To predict tyrosine sulfation sites in protein sequences, a novel program (the Sulfinator) will be presented that combines two different, serially switched Hidden Markov Models (HMM). The Sulfinator shall be made available at http://www.expasy.org/tools .

One Page Abstract:

 Tyrosine sulfation is an ubiquitous post-translational modification of proteins that go through the secretory pathway within living cells. The biological role of sulfation is largely unknown, however, there is strong evidence that protein sulfation is required for optimal biological activity of proteins. No clear cut acceptor motif, utilized by protein tyrosyltransferases, has been described that could be used for prediction of tyrosine sulfation sites in proteins. Here we present a novel method to predict tyrosine sulfation sites in protein sequences using two different, serially switched Hidden Markov Models (HMM). The first HMM consists of a fifteen amino acids long linear chain with the target tyrosine at position six. While the first HMM is responsible for the recognition of possible sulfation sites, the second HMM, an eleven amino acids long linear chain, assures the correct alignment of the previously matched sequences. In a test set of validated non-sulfated and sulfated sequences extracted from the SWISS-PROT database, the Sulfinator correctly predicts ~99 % of the tyrosine sulfation sites (true positives) and 98% of the non-sulfated tyrosines (true negatives). The results from scanning available proteomes in the SWISS-PROT database suggest that tyrosine sulfation could be more abundant than previously anticipated. The Sulfinator shall be made available at http://www.expasy.org/tools .


213. Characterization of aspartylglucosaminuria mutations (up)
Jani Saarela, Minna Laine, National Public Health Institute, Finland;
Carita Oinonen, University of Joensuu, Finland;
Carina von Schantz, Anu Jalanko, National Public Health Institute, Finland;
Juha Rouvinen, University of Joensuu, Finland;
Leena Peltonen, University of California Los Angeles;
Jani.Saarela@ktl.fi
Short Abstract:

 Aspartylglucosaminuria is a recessively inherited human disease. Altogether 26 different aspartylglucosaminuria mutations have been identified. Many of these interfere with the complex intracellular maturation and processing of the aspartylglucosaminidase polypeptide. We used the three-dimensional structure of functional aspartylglucosaminidase enzyme to predict structural consequences of aspartylglucosaminuria mutations.

One Page Abstract:

 A deficiency of functional aspartylglucosaminidase (AGA) causes a lysosomal storage disease, aspartylglucosaminuria (AGU). The recessively inherited disease is enriched in the Finnish population, where 98% of AGU alleles contain one founder mutation, AGUFin. Elsewhere in the world, we and others have described 18 different sporadic mutations in AGU patients. Many of these mutations are predicted to interfere with the complex intracellular maturation and processing of the AGA polypeptide. Proper initial folding of AGA in the endoplasmic reticulum is dependent on intramolecular disulfide bridge formation and dimerization of two precursor polypeptides. The subsequent activation of AGA occurs autocatalytically in the endoplasmic reticulum and the protein is transported via the Golgi to the lysosomal compartment using the mannose 6-phosphate receptor pathway. We used the three-dimensional structure of AGA to predict structural consequences of AGU mutations, including six novel mutations, and make an effort to characterize every known disease mutation by dissecting the effect of mutations on intracellular stability, maturation, transport, and the activity of AGA. Most mutations are substitutions replacing the original amino acid with a bulkier residue. Mutations of the dimer interface prevent dimerization in the ER, while active site mutations not only destroy the activity but also affect maturation of the precursor. Depending on their effects on the AGA polypeptide the mutations can be categorized as mild, average, or severe. These data contribute to the expanding body of knowledge pertaining to molecular pathogenesis of AGU


214. Elucidating a "theoretical" proteome of the Arabidopsis thaliana thylakoid (up)
Olof Emanuelsson, Stockholm Bioinformatics Center, Stockholm University, Sweden;
Jean-Benoît Peltier, Dept. of Plant Biology, Cornell University, USA;
Gunnar von Heijne, Stockholm Bioinformatics Center, Stockholm University, Sweden;
Klaas J. van Wijk, Dept. of Plant Biology, Cornell University, USA;
olof@sbc.su.se
Short Abstract:

 Scanning the entire Arabidopsis genome using subcellular localization predictors (TargetP, SignalP-HMM) and a transmembrane predictor (TMHMM), we predict the total proteome size of the lumen of the chloroplast sub-compartment thylakoid to be somewhere between 200 and 400 different proteins, whereof a substantial part lack any functional annotation.

One Page Abstract:

 The Arabidopsis thaliana genome offers nice opportunities to develop and test whole-genome based approaches to theoretical proteomics. Using subcellular localization predictions (TargetP followed by SignalP-HMM) and subsequent transmembrane predictions (TMHMM 2.0), we have predicted the total proteome size of the lumen of the chloroplast sub-compartment thylakoid to be somewhere between 200 and 400 different proteins, whereof a substantial part lacks any functional annotation and approximately 50% contain a TAT-pathway signal. We have also evaluated the combined predictor approach in several ways, specifically addressing the SignalP performance on the signal peptide-like thylakoidal transfering domain adjacent to the chloroplast transit peptide, and it has been clear that a thylakoid-dedicated signal peptide predictor potentially could be useful. The outcome of the predictions will be used by biologists for guidance of experimental verification of thylakoidal localization of newly discovered Arabidopsis proteins. 


215. Machine Learning Algorithms in the Detection of Functional Relatedness of Proteins (up)
Mahesan Niranjan, Renata Camargo, The University of Sheffield;
r.camargo@dcs.shef.ac.uk
Short Abstract:

 This poster reports on the application of machine learning algorithms to detect functional similarity between pairs of proteins. Decision making is based upon a set of features based on structural, sequence and keword descriptions in literature. The work is based on protein pairs labelled by Holm and Sander (ISMB97).

One Page Abstract:

 Detecting functional similarity between pairs of proteins may be cast as a machine learning problem, in cases where many diverse pieces of information is available of proteins. Such a problem was formulated by Holm and Sander (1997). They make available a dataset of 940 protein pairs, hand labelled as evolutionarily related or not. Features describing each pair are: (a) structural similarity as measured by the z-score of alignment, (b) sequence overlap being above or below a threshold, (c) Enzyme Class number characterising biochemical reactions, (d) experimental information of functional sites, (e) predicted functional sites and (f) overlap in keywords of literature describing the pair. Holm et al report on coverage/selectivity for each of the features of the database and suggest more powerful machine learning algorithms may be applied to this problem.

In this poster we report on work that pursues this line by reconstructing the dataset of the same protein pairs and applying a range of machine learning algorithms including logistic regression, Fisher Linear Discriminant Analysis and Support Vector Machines to classify functionally related pairs from those that aren't related. A particular result is in the selection of features appropriate for the classification.

Reference:

Liisa Holm and Chris Sander (1997): Decision Support System for the Evolutionary Classification of Protein Structures, Proc. ISMB, 1997.


216. Predicting protein functions based on InterPro and GO (up)
Wolfgang Fleischmann, Nicola Mulder, Alexander Kanapin, Evgueni Zdobnov, Rolf Apweiler, European Bioinformatics Institute;
fleischmann@ebi.ac.uk
Short Abstract:

 We present a mapping between InterPro (a database of protein families, domains and functional sites) and GO (a controlled vocabulary of gene product functions, processes and components). 

This data allows to classify protein sequences with a high coverage (47%) in a reliable and robust way. 

One Page Abstract:

 We observed that many domains and families in InterPro (www.ebi.ac.uk/interpro) are conserved enough to infer the function of proteins matching these InterPro signatures.

As controlled vocabulary for the protein functions, we used the GO terms provided by Gene Ontology Consortium (www.geneontology.org), as they are gaining support in the user community.

To gain reliable results, we inspected manually all SWISS-PROT proteins known to match a given InterPro entry and assigned the relevant GO terms for the function, process and component. We avoided domain specific terms and used only those terms that apply to the whole protein.

We assigned GO terms to 2567 of the 3914 InterPro entries. Broken down by GO ontologies, 2308 InterPro entries can predict the molecular function, 1943 entries the biological process, and 1090 entries the cellular component.

Using the InterPro2GO mapping, it is possible to infer GO terms for 47.0% of all SWISS-PROT + TrEMBL proteins. This coverage is expected to raise in future, as we add new signatures to InterPro and continue to assign GO terms to the remaining InterPro entries.

The mapping is incorporated into the InterPro database and is accessible at www.ebi.ac.uk/interpro.

Furthermore, we regularily construct overview reports of the molecular function of the completely sequenced genomes, available at www.ebi.ac.uk/proteome.

To enable users to download the assignments and predict the function of proprietary protein sequences, we added a module to the InterProScan software. 


217. iPSORT: Simple rules for predicting N-terminal protein sorting signals. (up)
Hideo Bannai, Human Genome Center, Institute of Medical Science, University of Tokyo;
Yoshinori Tamada, Department of Mathematical Sciences;
Osamu Maruyama, Faculty of Mathematics, Kyushu University;
Kenta Nakai, Satoru Miyano, Human Genome Center, Institute of Medical Science, University of Tokyo;
bannai@ims.u-tokyo.ac.jp
Short Abstract:

 Using a discovery oriented approach of hypothesis generation, we search for simple, understandable rules with high accuracy for predicting N-terminal sorting signals of proteins. The prediction accuracy comes close to the state-of-the-art neural network based predictor, TargetP. A experimental web service is provided at http://www.hypothesiscreator.net/iPSORT/.

One Page Abstract:

 The prediction of localization sites of various proteins is an important and challenging problem in the field of molecular biology. Currently, a neural network based system, TargetP, is the best predictor of N-terminal sorting signals in the literature. One drawback of neural networks, however, is that it is generally difficult to understand and interpret how and why they make such predictions. In this work, we aim to generate simple and interpretable rules as predictors, and still achieve a practical prediction accuracy. We adopted a discovery oriented approach which consists of an extensive search for simple rules and various attributes. The simple rules we search for include a pattern matching over a alphabet of protein classification, and also rules consisting of amino acid attributes contained in the AAindex database. A rule is created for each signal type, and is combined into a decision list for the final predictor. We have succeeded in finding rules for plant proteins which are almost as good as TargetP in terms of prediction accuracy, while still retaining a very simple and interpretable form. We further apply the acquired knowledge to non-plant proteins.

The rules we obtained are consistent with widely believed characteristics of the N-terminal sorting signals, and it is somewhat surprising that such accuracy could be obtained with such attributes. 


218. Molecular dynamics of protein-RNA interactions: the recognition of an RNA stem-loop by a Staufen double-stranded RNA-binding domain (up)
Tiziana Castrignano`, Giovanni Chillemi, CASPUR (Italian Interuniversities Consortium for Supercomputing Applications);
Gabriele Varani, MRC Laboratory of Molecular Biology, Hills Road, Cambridge, CB2 2QH, UK;
Alessandro Desideri, University of Rome "Tor Vergata", Via della Ricerca Scientifica, 00133 Rome, Italy;
Tiziana.Castrignano@caspur.it
Short Abstract:

 In this work we report 2ns molecular dynamics simulations of three molecular systems: 1. the complex between a double-stranded RNA-binding domain (dsRBD) and a RNA stem-loop from Drosophila 2. the free protein dsRBD 3. the free RNA stem-loop. Analysis of trajectories highlights the regions involved in the recognition process. 

One Page Abstract:

 RNA-protein interaction play a central role in a wide range of biological processes. One of the most common RNA-binding motifs is the double-stranded RNA-binding domain (dsRBD), found in many eukaryotic and procaryotic proteins involved in RNA processing maturation and localization. In this work we report 2ns molecular dynamics simulations in aqueous solution of three molecular systems: 1. the complex between the third dsRBD (dsRBD3) from Drosophila Staufen and a RNA stem-loop 2. the free protein dsRBD3 3. the free RNA stem-loop. All the systems have been simulated using the AMBER force field and the Ewald summation methods to treat the electrostatic interactions. Analysis of the trajectories has permitted us to highlight the regions involved in the recognition process and to compare the difference in flexibility between the free and the bound macromolecules. These data are also compared to the experimental NMR structures and the experimental mutational studies to provide a description of the residues crucial for the binding affinity and specificity. 


219. Numerical analysis of RNA structurisation process. (up)
Ekaterina Kozyreva, State Reseatch Institute of Genetics and Selection of Industrial Microorganisms;
T. M. Eneev, N. N. Kozlov, E. I. Kugushev, Keldysh Institute of Applied Mathematics RAS;
D. I. Sabitov, Moscow State University;
Katya.Kozyreva@moscow.att.com
Short Abstract:

 A problem of RNA structurisation process during its transcription is considered. Both a mathematical model of RNA secondary structure formation and its principally new algorithm developed on the basis of oriented, dynamically changing mathematical graph with corresponding computer implementation are described.

One Page Abstract:

 A problem of RNA structurisation during its transcription process is considered. Both a mathematical model of RNA secondary structure formation and its principally new algorithm with corresponding computer implementation are described. The growth of RNA chain is associated to the set of discrete internal structural gaps from unstructured state to the final configuration when the molecule folds into locally stable structure. The ordered set of these gaps simulates elongation of RNA and its structure formation at the current step of the transcription. The application of this so called consecutive approach significantly improves accuracy of RNA secondary structure prediction and confirms the assumption on discontinuity of RNA transcription process. The set of inter-structural gaps is presented by oriented, dynamically changing mathematical graph. Each vertex of the graph is related to the set of possible secondary structures of RNA transcript at the current step of RNA chain growth. The verges are assigned dG - a free energy increment related to the permissible structural transitions from unstructured to structured state at the each step of simulation process. The path on this graph characterizes step-by-step formation of RNA secondary structure. A new computer implementation on the basis of consecutive accumulation and storage of transition paths on the structural graph allows to reduce RNA secondary structure calculations expenditures up to two orders in comparison with other traditional approaches and averages about 12 hours per molecule at MVS-1000 complex. The development of the approach described above makes possible carrying out a comprehensive set of calculating experiments for RNA molecules of more then 150 nucleotides. Fifty molecules of Rnase P-RNA with known secondary structure were tested over approach proposed. It was revealed that the accuracy of secondary structure prediction against the period of transcription grows rapidly while the increase of period of transcription from 1 to 20 nucleotides per step. Three remarkable peaks of the number of RNA molecules with more then 50% of correctly predicted base-pairs should be noticed for the range of RNA speed chain growth from 20 to 60 nucleotides per step. It can be assumed that the period of transcription lies within these limits. All of three peaks in the diapason of 20 - 60 nucleotides correspond to the values of T which are multiple to the length of the spire of RNA in A-form. For values of T more then 60 nucleotides the secondary structure forecasting accuracy curve demonstrates fading trend.


220. Estimation of the Amount of A-DNA and Z-DNA in Sequenced Chromosomes (up)
David Ussery, Dikeos Mario Soumpasis, Hans Henrick Stærfeldt, Peder Worning, Anders Krogh, CBS, DTU;
dave@cbs.dtu.dk
Short Abstract:

 We have examined sequenced chromosomes for stretches of purines (R) or pyrimidines (Y) capable of forming A-DNA and alternating YR stretches which could form Z-DNA. Out of more than 500 sequenced chromosomes from eukaryotes, prokaryotes, and viruses, the majority have more A-DNA and Z-DNA than expected for a random sequence.

One Page Abstract:

 We have examined sequenced chromosomes for stretches of purines (R) or pyrimidines (Y) capable of forming A-DNA and alternating YR stretches which could form left-handed Z-DNA. Since A-DNA helices can readily form with stretches of 5 purines in a row, we measure the fraction of each genome which contains purine (or pyrimidine) tracts of lengths of 5 bp or longer, as a measure of the A-DNA content. Using this criteria, a random sequence would be expected to contain about 18.75% percent A-DNA. On average, the more than 500 sequenced chromosomes examined contained an average of 25% A-DNA, with a low of 10% and a high of 40% A-DNA content. In the majority of cases (e.g., for 84% of the chromosomes, (which contain 98% of the total DNA)), there is more A-DNA than would be expected from a random sequence. The percent of the chromosome which is capable of forming Z-DNA is estimated by looking for alternating pyrimidine-purine stretches (YR)n of length of at least 10 bp. Based on this assumption, the expected value is calculated as being about 0.6% of the genome. Although the average for all genomes was higher than expected (with an average of 1.04% for all chromosomes), in many prokaryotic genomes such tracts are found less than would be expected, whilst in eukaryotic chromosomes, alternating YR tracts are more common than anticipated. Overall, about a third of the chromosomes (197 mainly viral of 589 total) had less alternating purine/pyrimidine stretches than expected, whilst the remaining two thirds (including all eukaryotic chromosomes) had more, with some protozoan chromosomes containing more than 6% (e.g., more than 10x the expected value) of their length containing YR stretches of 10 bp or longer. Localisation of A-DNA and Z-DNA regions within chromosomes and possible biological roles for these alternative DNA conformations are discussed. 


221. Feature selection to discriminate five dominating bacteria typically observed in Effluent Treatment Plant (up)
Hemant J. Purohit, D.V.Raje, R.N.Singh, National Environmental Engineering Research Institute;
hemantdrd@hotmail.com
Short Abstract:

 Six dinucleotide features were obtained using stepwise algorithm from 16S rRNA to discriminate five dominating bacterial groups from effluent treatment plant. Two linear composites were obtained to discriminate the training set of sequences with 91% accuracy and were validated for a test set with almost the same predictive accuracy.

One Page Abstract:

 Defining a microbial community and identifying bacteria, at least at the genus level, is a first step in predicting the behavior of any biological treatment system. In effluent treatment plants, the most dominating and typically observed bacterial groups are Pseudomonas, Moraxella, Acinetobactor, Burkholderia and Alcaligenes. Even though genetically close, these bacteria may be distinguished from each other based on their nucleotide compositions. Our interest lies in selecting the features from 16S rDNA sequences, which could be used to develop a tracking tool. Twenty sequences from each of the above groups were retrieved from GenBank. A feature space comprising of likelihood estimate of dinucleotides was defined on the sequence data. A stepwise feature selection method was used which resulted in six out of the total sixteen features that had significant variability across the sequences. Multiple group discriminant analysis was carried out to test the efficacy of the selected features to segregate the sequences into respective groups. Two linear composites, as a function of these features, could discriminate the training set of sequences with 91% accuracy and were validated for a test set with almost the same predictive accuracy. This ascertained the relevance of the selected features in the classification. These features independently or in combination might generate genus specific patterns that might be used to develop PCR protocols and thereby a tracking tool. The program for determining the likelihood estimates is available with the corresponding author. The rest of the analysis has been carried out using SPSS software, which is commercially available.


222. Identifying orthologs and paralogs in speciation-duplication trees (up)
Lars Arvestad, Center for Genome Research, Karolinska Institutet;
lars.arvestad@cgr.ki.se
Short Abstract:

 A speciation-duplication tree (SD-tree) is a tree where an inner node is a bifurcating S-node or a D-node without degree restriction. We consider the problem of computing an SD-tree given a species tree and pairwise evolutionary distances with the aim of identifying orthologs and paralogs.

One Page Abstract:

 The study of proteins in model organisms is an important tool for advancing our understanding of human proteins. Assuming that homology often implies similar function, we can transfer knowledge from one organism to another. However, gene duplication can make it difficult to draw conclusions based on protein homology. Here, the identification of orthologous and paralogous protein relationships can be an important aid for the researcher.

Orthologs are homologs for which the most recent common ancestor (MRCA) corresponds to an speciation event. For paralogs, the MRCA derives from a gene duplication event. 

We consider the problem of computing a speciation-duplication tree (SD-tree) for homologous sequences from a small set of species and present an algorithm for computing an SD-tree given a species tree and pairwise evolutionary distances.
 
 


223. Reconstructing the duplication history of tandemly repeated genes (up)
Olivier ELEMENTO, IMGT (The International IMmunoGeneTics Database) & LIRMM Montpellier;
Olivier GASCUEL, LIRMM;
Marie-Paule LEFRANC, IMGT;
olivier@ligm.igh.cnrs.fr
Short Abstract:

 We describe here a novel approach to the reconstruction of the duplication history of tandemly repeated genes, based on a model of duplication by unequal recombination. We present this model and its related mathematical objects, the reconstruction algorithms and their application to two data sets of immunogenetics sequences. 

One Page Abstract:

 Classical phylogenetic analysis studies the relationships between species based on the comparison of a single gene. Its main goal is to reconstruct a tree which represents the history of speciations. The problem we describe here is different : we aim at reconstructing the duplication history of a single gene within a single genome and we uniquely consider long (several kilobases) and tandemly arranged sequences, where each one contains a single gene. Assuming our sequences were not affected by gene conversion events, and our loci did not undergo any deletions, we introduce a simple model of duplication based solely on unequal recombination (unequal recombination is commonly acknowledged as the primary mechanism responsible for tandem duplications). Our model of duplication allows simple duplications (a gene is duplicated and inserted adjacent to the initial gene), and bloc duplications (a bloc of 2 or n sequences is duplicated, and inserted near the initial bloc). Although identical just after duplication, these sequences diverge over time as they accumulate their own mutations. We then define three types of mathematical objects to describe the evolution of these clusters of tandemly repeated genes. First, we define what we call a time-valued duplication history, i.e. a description of the real duplication history. Since we cannot generally rely on the moleculer clock hypothesis, inferring a time-valued duplication history from nucleotide sequences is not possible. In particular, the position of the root and the order in which duplications occurred cannot be recovered from DNA sequences. Consequently, we can only reconstruct what we call a duplication tree, i.e an unrooted phylogeny whose topology is compatible with at least one duplication history. According to the model of duplication, the root of a duplication tree can only be situated somewhere (but not everywhere) in the tree between the most distant repeats on the locus. When rooted at one of its allowed branches, a duplication can be transformed into what we call an ordinal duplication history, i.e. a history in which duplication events are partially ordered. Although a duplication tree is a phylogeny, it is easy to show that not all phylogenies can be duplication trees. We use an algorithm we called PDT (for PossibleDuplicationTree) to determine whether a given phylogeny with ordered leaves can be a duplication tree or not. This algorithm provides us with a mathematical characterisation of duplication histories and duplication trees. We also use the PDT algorithm to show that, for a given number of tandemly repeated sequences, the number of duplication trees is largely inferior to the number of distinct phylogenies. Given this model of duplication, we use an exhaustive search procedure to reconstruct duplication trees: given a set of nucleotide sequences, we compute the parcimony value of every possible duplication tree, and we select those which minimize this value. To speed up the reconstruction (especially when dealing with large numbers of repeated genes), we use a faster (but not guaranteed to find the optimal tree) search procedure based on a greedy heuristics : starting with a tree made from the first three repeats, our procedure iteratively inserts new repeats onto the growing tree, such that each resulting tree minimizes the parcimony value. The procedure stops when all repeats are inserted. We applied this model and these search procedures to two human loci containing tamdemly repeated immunoglobulins and T-cell receptors genes : the IGLC and TRGV loci. We showed for both these loci that the duplication tree found by our exhaustive search procedures corresponds to the most parcimonious phylogeny. Since the probability of a phylogeny being a duplication tree is small (0.04 in the TRGV case), this constitutes a strong validation of our initial hypothesis concerning the duplication mechanisms. Besides, the heuristics-based search reconstructs the same duplication tree as the exhaustive search, but in a much faster way. These results keep stable to a bootstrap analysis, indicating that this identity between the most parcimonious duplication tree and the most parcimonious phylogeny is not fortuitous. Compatibility of our reconstructed trees with known polymorphisms (two genes are missing in some individuals) in the TRGV locus provides further evidence that our reconstruction can provide good insights into the duplication histories of tandemly repeated genes.


224. New approaches for the analysis of gene family evolution (up)
Jessica Siltberg, Jens Lagergren, Bengt Sennblad, David A. Liberles, Stockholm Bioinformatics Center, Stockholm University, 10691 Stockholm, Sweden;
jessica@sbc.su.se
Short Abstract:

 To detect genes undergoing adaptive evolution where different positions within proteins evolve at different rates, Ka/Ks is frequently underestimated when averaging over entire gene sequences. We present a simple covarion-based method for estimating Ka/Ks ratios calculated from variant residues in subclades of a phylogenetic tree operating with stationary substitution trends.

One Page Abstract:

 Analyzing the ratio of nonsynonymous to synonymous nucleotide substitution rates (Ka/Ks) has emerged as a powerful technique in the detection of proteins undergoing adaptive evolution or changes of function. Because different positions within proteins evolve at different rates, Ka/Ks is frequently underestimated when averaging over entire gene sequences. To correct for this, it is possible to examine discrete windows of primary sequence and examine Ka/Ks within these sliding windows. However, this approach ignores the largest underlying reason for site-specific variation in rates- selective pressures dictated by protein three dimensional structure. As an alternative, we present a simple covarion-based method for estimating Ka/Ks ratios calculated from variant residues in subclades of a phylogenetic tree operating with stationary substitution trends.


225. Calculating orthology support levels in large scale data analyses (up)
Christian Storm, Erik Sonnhammer, Center for Genomics Research, Karolinska Institutet;
christian.storm@cgr.ki.se
Short Abstract:

 Orthologous proteins in different species are likely to have similar biochemical function and biological role. Here we present a method that calculates orthology support levels by analyzing a set of bootstrap trees instead of the optimal tree.

One Page Abstract:

 Orthologous proteins in different species are likely to have similar biochemical function and biological role. When annotating a newly sequenced genome by sequence homology, the most precise and reliable functional information can thus be derived from orthologs in other species. A standard method of finding orthologs is to compare the sequence tree with the species tree. However, since the topology of phylogenetic trees is not always reliable one might get incorrect assignments. Here we present a method that resolves this problem by analyzing a set of bootstrap trees instead of the optimal tree. The frequency of orthology assignments in the bootstrap trees can be interpreted as a support value for possible orthology of the sequences. This approach is efficient enough to analyze large datasets in the size of whole genomes. It is implemented in C and Java and calculates orthology support levels for all pairwise combination of sequences of two goups of species. The method was tested on simulated datasets and on real data of homologous proteins. 


226. Search Treespace and Evaluate Regions of an Alignment with LumberJack (up)
Carolyn J. Lawrence, R. Kelly Dawe, Russell L. Malmberg, University of Georgia;
carolyn@dogwood.botany.uga.edu
Short Abstract:

 The ML heuristic search algorithms currently available are computationally impractical for large datasets (especially those consisting of protein sequences). We are developing a ML heuristic search tool that we call LumberJack that progressively jackknifes an alignment to generate multiple NJ trees, then compares them based upon likelihood scores.

One Page Abstract:

 Phylogenomics is a method of sequence-based function prediction by phylogenetic analysis (Eisen 1998). The phylogenomic method often yields more accurate functional hypotheses than techniques based solely upon sequence similarity (such as BLAST). It is implemented by constructing a reasonable phylogenetic tree for a given dataset, then mapping the functions of experimentally characterized proteins onto the tree. Kuhner and Felsenstein (1994) found that the optimality criterion most successful at inferring accurate phylogenies overall is ML. However, the ML heuristic search algorithms currently available are computationally impractical for large datasets (especially those consisting of protein sequences). We are developing a ML heuristic search tool that we call LumberJack. LumberJack progressively jackknifes an alignment to generate multiple NJ trees, then compares those trees statistically on the basis of their relative likelihood scores. This sampling procedure finds phylogenetic trees that are similar to those built by ML star decomposition (Adachi and Hasegawa 1996), but carries out only a fraction of the computations.


227. Analysis of Large-Scale Duplications Involving Human Olfactory Receptors (up)
Tera Newman, Barbara Trask, U. of Washington dept. of Molecular Biotechnology and Fred Hutchinson Cancer Research Center;
Janet Young, Fred Hutchinson Cancer Research Center;
newmant@u.washington.edu
Short Abstract:

 Olfactory receptor genes (ORs) have a complicated history of duplication, diversification, and pseudogenization. OR gene diversity has evolved in order to bind multitudes of odorants. Using sequence homology, phylogenetic analysis and genomic repeat structure we have characterized the duplications of large blocks of a recently active ~100­member subfamily of ORs.

One Page Abstract:

 Olfactory receptor genes (ORs) constitute a family of G­protein coupled receptors and are the largest protein family in the mammalian genome. The human OR family has approximately 1000 members that have expanded to more than 40 regions of the genome. This family has experienced sequence diversification and rampant pseudogenization. The diversity of the ORs likely arose from intense selective pressure to recognize the broad spectrum of volatile odorants encountered in different environmental niches. Here we provide an analysis of an ~100­member subfamily of ORs, labeled 7E, that has undergone substantial recent genomic expansion. The duplication events included more than 30 kb of genomic sequence surrounding the coding regions of the genes. We combined phylogenetic analysis of all 7E gene coding sequences with the overall sequence homology and large­scale repeat structure of the genomic regions surrounding each 7E member. Using the output of PAUP, Miropeats, and RepeatMasker, we have constructed highly informative graphics that depict the nature and extent of similarity of the genomic sequence around each 7E gene. Almost all members are pseudogenes, suggesting that the original ancestral copies were non­functional before the expansion. Phylogenetically clustered genes are not physically close in the genome, implying extensive inter­chromosomal, rather than local, expansion. The 7E subfamily separates well into two groups based on phylogenetic analysis of the coding sequence, the positions of two stop­codon polymorphisms, and the repeat­motif structure surrounding the genes. These surrounding regions may reveal common elements of the mechanisms of gene duplication. Further study of the 7E subfamily will increase understanding of the complex history of these genes. Additionally, 7E duplications may mediate large­scale chromosomal rearrangements similar to those that are involved in disease phenotypes.


228. Whole Genome Comparison by Metabolic Pathway Profiling (up)
Li Liao, Sun Kim, Jean-Francois Tomb, DuPont Central Research & Development;
li.liao@usa.dupont.com
Short Abstract:

 We developed a method to compare organisms based on whole-genome metabolic pathway profiles. The method includes scoring schemes and algorithms for evaluating profiles that are based on a hierarchy of attributes. Phylogenetic trees of 31 completed genomes were constructed and compared to conventional phylogenetic trees based on 16s rRNA.

One Page Abstract:

 Traditionally, reconstruction of evolutionary relationship among organisms is based on comparisons of 16S rRNA sequences. The significance of phylogenies based on these sequences have been recently questioned with growing evidence for extensive lateral transfer of genetic material. Phylogenetic trees based on (protein) sequence analysis are not all congruent with traditional trees. As more genomes are sequenced and their metabolic pathways reconstructed, it becomes possible to perform genome comparisons from a biochemical/physiological perspective. We believe that such comparisons may yield novel insights into the evolution of metabolic pathways and bear relevance to metabolic enginnering of industrial microbes. 

We developed a computational method to compare organisms based on whole metabolic pathway analysis. The presence and absence of metabolic pathways in organisms are profiled, and the profiles are utilized for genome comparison. Scoring schemes and an algorithm were developed for evaluating generic profiles which are based on attributes that bear hierarchical relationships. Based on this methodology, phylogenetic trees of completed genomes were constructed. The results provide a perspective on the relationship among organisms that is different from conventional phylogenetic trees based on 16s rRNA.


228a. Use of Runs Statistics for Pattern Recognition in Genomic DNA Sequences (up)
Leo Wang-Kit Cheung, University of Manitoba;
wcheung@cc.umanitoba.ca
Short Abstract:

 Based on the finite-Markov-chain-imbedding technique, a recursion is derived for the calculation of the exact distribution of a double runs statistic. Having studied this distribution under different probabilistic frameworks, we can use it not only for detecting DNA signal clustering, but also for revealing homogeneous regions of DNA.

One Page Abstract:

 In the field of computational biology, DNA sequences have been studied through mathematical and statistical analyses of patterns. Research on counting problems and the distribution theory of runs and patterns has been heavily influenced by combinatorial methods (Waterman, 95). Owing to the complexity of these methods, exact distributions of many runs statistics still remain unknown. Recently, a completely different finite Markov chain imbedding (FMCI) method has been introduced for studying distributions of runs and patterns (Fu and Koutras, 94). Essentially, it is to imbed the runs statistic into a Markov chain (MC) so that the distribution of the runs statistic can be expressed in terms of the transition probabilities of the imbedded MC. Hence, the runs distribution is in a simpler form.

Based on the finite Markov chain imbedding (FMCI) technique, a recursive algorithm is derived for the calculation of the exact distribution of a double runs statistic. Being able to obtain this distribution, we can construct critical regions of a statistical test to test for randomness against clustering in DNA sequence data. The distribution of this statistic has also been investigated under a hidden Markov model (HMM) framework. This leads to the creation of probabilistic profiles for trapping HMM parameters. Applications of these profiles in conjunction with HMMs for pattern recognition in DNA sequences are illustrated via case studies using real human DNA data provided by Dr. Anders Pedersen at the Center for Biological Sequence Analysis (CBSA). 


229. Biomine - multiple database sequence analysis tool (up)
Simon Greenaway, Joseph Weekes, Andrew Blake, Helen Kirkbride, Jerome Jones, Katia Pilicheva, Michelle Simon, Sarah Webb, Ann-Marie Mallon, Mark Strivens, Informatics Group, Mammalian Genetics Unit and UK Mouse Genome Centre, Medical Research Council;
s.greenaway@har.mrc.ac.uk
Short Abstract:

 Biomine is a sequence analysis and management tool allowing the parallel searching of multiple sequence databases from a single user interface without the need for specialist bio-informatics skills.

One Page Abstract:

 Biomine is a sequence analysis and management tool allowing the parallel searching of multiple sequence databases from a single user interface without the need for specialist bio-informatics skills. 

Users login to the system and are then presented with the options of launching a new search or viewing completed results. New searches can be set up by either pasting FASTA format sequence into a web form, by entering an accession number or by uploading a file of FASTA sequences. The user also selects what types of searches are needed e.g. nucleotide, genomic, EST, protein or one of their custom databases. 

The database searches are managed by a Java based server which controls the parallel launching of database searches. This server is fully configurable to allow different queuing strategies, server loading controls and full search job controls by means of an administration interface.

The results page shows the progress of the users database searches and the searches are organised into project folders. The user can view a simplified representation or the full output of the search results once completed from browsing through their project folders. Project folders may be kept private or shared between members of a research team. Results can be tagged to repeat searches whenever one of the databases is updated. The results link to relevant publicly available databases in a intuitive and easy to use interface. 


230. Efficient virtual screening toll for early drug discovery (up)
Prof. Dr. Paul Wrede, CallistoGen AG;
paul.wrede@callistogen.de
Short Abstract:

 Today the search for new pharmacological molecules is mainly a matter of trial and error. CallistoGen developed an efficient and robust virtual screening tool called PHACIR® for rapid guided search of new bioactive molecules. The algorithm of PHACIR® is a similarity search based on the pharmacophore concept.

One Page Abstract:

 Random search proved to be an inefficient method for lead discovery. Alternatively, virtual screening algorithms enable a guided search in the high dimensional chemical space. 2D and 3D pharmacophore models, neural network concepts, and new bioinformatic approaches lead to fast and efficient virtual screening tools. PHACIR® (PHArmaCophore Identification Routine) - a 3D pharmacophore model based algorithm - generates highly enriched focused compound libraries as demonstrated in several retrospective screenings. The database screening speed exceeds 7000 cmpds/sec on average workstations. Even single topological query information is sufficient for PHACIR screening, i.e. 2D structure input of only one active compound can be used for scanning large compound libraries. Prospective screening results, a 10% hit rate of biological active new compounds was found repeatedly, confirm the concept of the PHACIR algorithm. ClassyFire® - CallistoGen's artificial neural network diversity analyser - produces high quality focused compound libraries to identify potential lead candidates with different scaffolds. For de novo design of biologically active peptides evolutionary algorithms proved to be very useful. PepHarvester® and Darwinizer® allow a guided search through the high dimensional sequence space. Compared to known isofunctional sequences the peptides found are highly diverse.

Correspondence: info@callistogen.de


231. WebGen-Net: a workbench system for support of genetic network construction (up)
Mikio Yoshida, Hideaki Shimano, Yukari Shibagaki, Hiroshi Fukagawa, Takeshi Mizuno, Intec Web and Genome Informatics Corp., 1-3-3 Shinsuna, Koto-ku, Tokyo, 136-0075, Japan;
yoshida@gic.intec.co.jp
Short Abstract:

 We have developed a workbench system for support of genetic network construction that constructs a genetic network among focused genes by connecting binominal relations extracted from various genome databases in advance. This system helps users interprets their hypotheses or experimental results by referring the prior biological knowledge.

One Page Abstract:

 Genetic network analysis plays an important role in determination of the protein&#129;fs function. However, in order to construct a genetic network, a huge amount of biological data is required, for instance gene transcription regulation, protein-protein interaction, sequence similarity, and the like. Since these data is continuously increasing, it is impractical that a biologist deals with them by oneself.

To overcome this problem, authors propose an interactive system that can support of genetic network construction. The system consists of a rearranged database and a graphical user interface module (GUI module). The rearranged database stores binominal relations related to genetic network collected from public databases and several experimental results in advance. This system constructs a genetic network among focused genes by connecting these relations based on a predefined model. The GUI module can display the constructed genetic network graphically and enables users to edit it. Therefore, users can grasp their experimental results and can confirm differences between their hypotheses or their own experimental results and the prior biological knowledge.

To evaluate the efficiency of this system, we have utilized it to help interpretations of experimental results of a comprehensive protein-protein interaction screening of budding yeast (consists of 4,549 interactions among 3,278 proteins). In this evaluation, two effective facilities have been developed. One is Connected Components Extraction (CCE) and the other is Alternative Paths Derivation (APD). CCE is utilized to estimate a protein&#129;fs function, and APD is to validate the results. The experiment was based on Yeast Two-Hybrid System, and one or two intervening protein between prey and bait may cause a false positive result. APD is to find all of the possibilities of it by extracting A-X-B (or A-X-Y-B) paths from the previously reported protein-protein interactions. Furthermore, an interaction having more than two alternative paths suggests that the proteins consisting the paths compose a protein complex. By using these features, 195 connected components, 182 (299) alternative paths with one (two) intervening protein(s) were extracted.

Selected Reference: Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, and Sakaki Y., A comprehensive two-hybrid analysis to explore the yeast protein interactome, Proc Natl Acad Sci U S A 2001 Apr 10;98(8):4569-74 


232. ArrayExpress - a Public Repository for Gene Expression Data at the European Bioinformatics Institute (up)
Alvis Brazma, Ugis Sarkans, Helen Parkinson, Alan Robinson, Mohammadreza Shojatalab, Jaak Vilo, EMBL-EBI;
brazma@ebi.ac.uk
Short Abstract:

 ArrayExpress (www.ebi.ac.uk/microarray) is a public repository for microarray based gene expression data, which covers the requirements of Minimum Information About a Microarray Experiment (MIAME), developed by the MGED consortium. It supports data import in MAML format, which is an XML-based data exchanged format (see www.mged.org).

One Page Abstract:

 By allowing "snapshots" of gene expression for tens of thousands of genes in a single experiment, microarrays have already profoundly affected life science research and are producing massive amounts of functional genomics data. Well organized public repositories for such data are needed, if these data are to be freely accessed and explored by the life science community and not lost in the future. There are a number of clear reasons, in addition to the size of the datasets, why databases of microarray data are not simple. First, gene expresion data make sense only in the context of a detailed description of the experimental conditions, which have to be described in a systematic way, to permit queries and data mining. Second, datasets obtained in different experiments may not be directly comparable - no standard reliable units for measuring gene expression exist. Third, microarray technology is still developing rapidly, therefore microarray data management systems should be extremely flexible. ArrayExpress is a public repository for microarray based gene expression data, being established by the EMBL-EBI (www.ebi.ac.uk/microarray/). We have developed an object model for representing microarray based gene expression data, which can be used for building custom-made gene expression databases. The model is freely available from www.ebi.ac.uk/arrayexpress/. The model covers the requirements of the Minimum Information About a Microarray Experiment (MIAME) developed by the Microarray Gene Expression Database (MGED) group (for more information about MGED and MIAME see www.mged.org). The key feature of the object model is the notion of generic, technology-independent expression data matrices, facilitating expression data warehouse development and data mining. Structured sample description framework is provided, encouraging data providers to describe their laboratory protocols in a formal way. ArrayExpress is based on the described object model and is implemented in Oracle. It can store both raw and processed data, and it is independent of experimental platforms, image analysis and data normalization methods. The repository supports data import in MAML format (MicroArray Markup Language is an XML-based data exchanged format developed by the MGED consortium). Currently we are developing data submission and annotation tools, to facilitate the data deposition process, and a Web based data query interface. ArrayExpress will be linked to Expression Profiler which is an Internet based microarary data analysis tool (ep.ebi.ac.uk). In the future ArrayExpress will support interfaces currently under development within the Object Management Group (OMG), in which the EBI is actively participating. References: One-stop shop for microarray data. Brazma, A., A. Robinson, G. Cameron, and M. Ashburner. Nature 403:699-700 (2000). 


233. Formulation of the estimation methods for the kinetic parameters in cellular dynamics using object-oriented database (up)
Takashi Naka, Saitama Junior College, Kazo-shi, Saitama 347-8503, Japan;
Mariko Hatakeyama, Mio Ichikawa, Computational Genomics Team, Bioinformatics Group, RIKEN Genomic Sciences Center, 1-7-22 Suehiro-cho, Tsurumi-ku, Yok;
Naoto Sakamoto, Institute of Information Sciences and Electronics, University of Tsukuba, Tsukuba-shi, Ibaraki 305-8573, Japan;
Akihiko Konagaya, Computational Genomics Team, Bioinformatics Group, RIKEN Genomic Sciences Center, 1-7-22 Suehiro-cho, Tsurumi-ku, Yok;
naka@sjc.ac.jp
Short Abstract:

 The simulation analysis with mathematical models is effective for elucidation of the signal transduction mechanisms in cells. Estimation and/or adjustment of the kinetic parameters are necessary to perform the simulation of cellular dynamics. The methods for estimation/adjustment of the kinetic parameters are formulated in an object-oriented database.

One Page Abstract:

 The intensive studies on the signal transduction processes in cells have revealed their functions as the elaborate information processing systems. The simulation analysis with mathematical models is effective to elucidate the mechanisms of the cellular information processing. Construction of mathematical models employs the kinetic reaction schemes which have been reported in experimental studies and collection of kinetic parameters such as rate constants, diffusion coefficients and concentration distribution of every chemical reactant is utilized to perform the simulation with the constructed models. As some kinetic parameters required for simulation are not always described in the literatures, estimation for these deficient data is essential. It is further necessary to adjust the values available for parameters in construction of the mathematical model for a specific reaction system because the experimental conditions for the measurement may differ by literatures. In this study, the procedures to estimate or adjust the kinetic parameters are collected from the literatures, and formulated as the relationship between the measured values of the kinetic parameters and the experimental conditions at which the measurement was made. These relationships are integrated into the bio-molecular database. The object-oriented database which makes each data a class object is adopted as a framework of the database management system. The formulated relationships for estimation or adjustment are represented as the constructor method, and the estimation or adjustment of the data is carried out at the time of instance generation which corresponds to reference of the data. Furthermore, extension of the database is attempted so as to perform the simulation by regarding a set of the generated instance objects as the element of a particle simulation framework. 


234. Towards a Transcript Centric Annotation System Using in silico Transcript Generation and XML Databases (up)
Anatoly Ulyanov, Michael Heuer, Steven Bushnell, Aventis Pharmaceuticals;
Anatoly.Ulyanov@Aventis.com
Short Abstract:

 The presenting system is a resource for annotating genomic sequences and ultimately, for identifying pharmaceutically relevant targets. It allows for the assignment of annotation using homology based methods, improved EST clustering through the use of high quality seed sequences, and through these results reveal potential alternate splicing of genes.

One Page Abstract:

 A significant effort in Bioinformatics is made to find potential therapeutic targets for drag candidates. The success of this effort depends mainly on identifying and annotating all transcripts from human and other important mammalian genomes. In the current situation, where roughly one third of human transcripts are present in publicly available finished genome sequences, we took the approach of consolidating sequence information from public and proprietary sources of DNA sequences to advance internal annotation efforts. In this work we developed a system called TransDB, which computed DNA transcripts in silico, using DNA sequences from GenBank. Central to the value of this system is a created "rule base," revealing GenBank entries that have a high probability of being expressed. The rule base automatically rejects ESTs, predicted genes, and artificial or synthetic sequences. The system then returns the subset of GenBank records that have experimental evidence of the transcription. TransDB is compiled in three phases. The first phase includes filtering GenBank entries using information in FEATURE tables, and generating a stack of operations like "region", "join" and "complement" to compute transcripts. The second phase is a compiler, which reads the operation stack and produces TransDB entries. Every entry in TransDB then inherits an accession number and a product description from the source GenBank entry. The final phase includes another filter, which removes redundant entries. The transcripts are computed from descriptions located in feature tables including mRNA, CDS, exon, 5'UTR, and 3'UTR (http://www.aventisandthegenome.com/services.htm.

Assembled ESTs and the in silico transcripts generated here provide an initial set of consensi of DNA sequences. If treated as persistent objects, these consensi can be flexibly annotated using XML databases. Here, we use XML technology to "glue" together heterogeneous sources of data, and use applications to compute and view results. Every source of information including: sequence homology results, classification and ontology assignments, expression profiles, marker homologies, etc are presented in the system as an XML database. Adding a new piece of information about transcripts means adding a new XML database. Integrating this new database into the system requires only adding the name of the new database to the list of registered databases and writing a piece of code which reads and presents this data in a browser. We successfully used this technology to build a system that includes, in addition to mentioned above sources of annotation, such components as personal annotation with history of changes, classifications, cross-species bridges, graphical presentations, and a query system.


235. Trawler: Fishing in the Biomedical Literaturome (up)
Lars Arvestad, Wyeth Wasserman, Center for Genome Research, Karolinska Institutet;
lars.arvestad@cgr.ki.se
Short Abstract:

 We present Trawler, a set-based interface between sequences and the biomedical literature. Given a set of sequence or abstract identifiers, Trawler captures the associated articles. Using cycles of feedback, Trawler expands and refines the literature collection. A unique feature is the application of "attractors", which allow categorization of the papers. 

One Page Abstract:

 The information contained in the ocean of biomedical literature far exceeds that which can be obtained by sequence analysis. In order to facilitate access to this underutilized resource from sequence-based starting positions, we have developed Trawler, a semi-automated text analysis system for PubMed abstracts.

Trawler aims to ease access to the biomedical literature by providing a mode of operation based on a working set of articles. This working set can be seeded by an initial set of sequence identifiers or articles. In addition to basic operations, such as inspection and trimming, Trawler offers automatic classification and set-based extension of the working set.

Automatic Classification

We explore two classification methods: scoring and attractors. The scoring method is based on scores assigned to discriminating words (for instance Marcotte et al, Bioinformatics 2001), while the attractor method is based on PubMed's neighbor definition procedure. Both methods utilize a set of "standard" articles for major fields of research, e.g., gene expression analysis, SNP analysis, or sequencing technology. For the scoring method, discriminative words are extracted for each research field and associated word scores are computed. For an article to be classified to a class C, its associated score S_C must exceed a threshold. The attractor method identifies the neighboring articles for each set of field-typical articles, and identifies the intersection of these attractor sets with the article sets.

Set-based extension

Two methods of extending the working set are considered. The first and simplest utilizes PubMed's related-article feature. However, our approach is based on the entire article set, rather than individual members. As Trawler maintains scores for the relevance of each article, the contribution of each paper to the expand set is weighted, providing an ordered list of potentially associated articles.

The second method is based on common sequences. From an initial article identifier or set, Trawler identifies the associated sequences, and incorporates other articles that address sequences for the same gene in a gene index, for instance the human sequence index for the GeneLynx system (www.genelynx.org).

Trawler can be accessed via a link from our departmental resources page: http://www.cgr.ki.se/cgr/services
 
 


236. GeneLynx: A Comprehensive and Extensible Portal to the Human Genome (up)
Boris Lenhard, Wyeth W. Wasserman, Center for Genomics and Bioinformatics, Karolinska Institutet, Stockholm, Sweden;
Boris.Lenhard@cgr.ki.se
Short Abstract:

 GeneLynx is a meta-database of information on human genes present in public databases. The GeneLynx project is based on the goal that for every human gene researchers should be able to access a single page containing links to all the information pertinent to that gene. GeneLynx is available at http://www.genelynx.org.

One Page Abstract:

 GeneLynx is a meta-database of human genes associated with an extensive collection of hyperlinks to gene-specific information in diverse databases publicly available on the internet. The GeneLynx project is based on the notion that for every human gene, given any gene-specific identifier (accession number, approved gene name, text or sequence), researchers should be able to access a single web page that provides a set of links to all the publicly available information pertinent to that gene. 

GeneLynx is implemented as an extensible relational database with an intuitive and user-friendly web interface, containing a number of unique features to increase its efficiency in collating data about genes of interest. The data is automatically extracted from more than thirty external resources, using appropriate approaches to maximize the coverage. The system includes a set of software tools for database building and curation. An indexing service facilitates linking from external sites. 

Among the unique features of GeneLynx are a a communal curation system for user-aided improvement of data quality and completeness, and a standardized protocol for adding new resources to the database. 

GeneLynx can be accessed freely at http://www.genelynx.org.


237. Representation and integration of metabolic and genomic data: the Panoramix project (up)
Morgat A., Boyer F., INRIA Rhône-Alpes, Helix project;
Rivière-Rolland H., Genome Express, France;
Ziebelin D., Université Joseph Fourier, France;
Rechenmann F., Viari A., INRIA Rhône-Alpes, Helix project;
anne.morgat@inrialpes.fr
Short Abstract:

 We present three knowledge bases (Genomix, Proteix and Metabolix) dedicated to tree aspects of bacterial genome analysis (Genes, Enzymatic Assemblies and Metabolism). All of them are based on an object-oriented representation using classes and associations. Their use will be exemplified by the problem of reconstructing metabolic pathways. 

One Page Abstract:

 We have developed three knowledge bases (Genomix, Proteix and Metabolix) dedicated to bacterial genome analysis. Each of these bases deals with a particular aspect of genomic and post-genomic data : 

"Genomix" concerns organisms and their genes with a special emphasis to completely sequenced bacteria. It contains data about the genes, their phylogenetic relationship (paralogy, orthology) and their organisation along the chromosome (bacterial synteny). 

"Proteix" deals with the product of the genes (proteins) with a special emphasis to enzymes and molecular assemblies constituting molecular enzymes. 

"Metabolix" is dedicated to intermediary metabolism and models biochemical reactions, chemical compounds and catalytic activities. 

All these knowledge bases (KB) have been developed by using an object-based data model and have been implemented by using the AROM representation system (http://www.inrialpes.fr/romans/arom) developed in the Romans project at the INRIA Rhône-Alpes. AROM provides a powerful (UML-like) framework based on classes and associations. The explicit representation of n-ary associations (i.e. connecting more than two classes) turned out to be very useful in modelling complex relationships between objects (such as alternative substrates in Metabolix or molecular enzymes in Proteix). 

These three bases are, of course, strongly interconnected, for instances the association of correspondance between genes and proteins for Genomix and Proteix KBs, or the relationship between molecular enzymes and catalytic activities for Proteix and Metabolix KBs. This will allow to answer questions like "Given two bacterial species, is there any conservation in the chromosomal arrangement of their genes coding for enzymes acting in a given metabolic pathway ?". At the present time this kind of question can be answered by combining queries (expressed in a builtin query langage (AML)) and the AROM java-API. 

Besides querying, another application of these three knowledge bases is ab initio pathways reconstruction. Here we state this problem as finding a minimal cost path in a graph connecting compounds through biochemical reactions. The cost function may depend on the number of involved reactions, the total energetic balance (NADH, ATP, etc) or the chromosomal distance between the genes which are coding for the enzymes catalysing the biochemical reactions. 

On the poster we shall present the conceptual models behind these three bases together with some preliminary results on the pathway reconstruction problem. 


238. GDB's Draft Sequence Browser (GDSB) - A bridge between the Human Genome Database (GDB) and Human Genome Draft Sequence. (up)
Weimin Zhu, Christopher J. Porter, Bioinformatics Supercomputing Center, Hospital for Sick Children;
C. Conover Talbot, Jonhs Hopkins University, Baltimore;
Kenny Li, Sadanan da Murthy, A. Jamie Cuticchia, Bioinformatics Supercomputing Center, Hospital for Sick Children;
wzhu@sickkids.on.ca
Short Abstract:

 GDB's Draft Sequence Browser (GDSB) extends the GDB schema by annotating GDB objects on the Golden Path human draft sequence assembly. GDSB data can be browsed by chromosome, contig, and physical locations, searched by clone or GDB object ID/name, and accessed through links from GDB objects. 

One Page Abstract:

 The Genome Database (GDB, http://www.gdb.org) is a public repository of data on human genomic data on genes, STSs, clones and variation. Mapping data from large genome centers and smaller mapping efforts, where available, are all represented in GDB. These data are integrated to construct and regularly update a calculated comprehensive map that reflects our most current understanding of the human genome. As human genome research shifts in emphasis from mapping to sequence and function analysis, the scope of the GDB schema has to be extended to meet the needs of the scientific community (Cuticchia, 2000). The GDB-Draft Sequence Browser (GDSB) is one approach we are using to accommodate these needs. The GDSB places objects (Genes, Clones, STSs) from GDB in the context of the Golden Path draft sequence assembly. The draft sequence data can be browsed by chromosome and by contig, and can be searched for contigs containing a particular sequence, clone or other GDB object, or by physical location. Objects are cross-referenced between the draft sequence and GDB, allowing easy transition from a GDB object to its sequence position, or from an object on the sequence to its full GDB record. In the first release of GDSB, results are displayed in tables of textual data. The next release will introduce graphic search, browsing, and display tools. Future releases will also include other assemblies, and additional sequence annotations, including SNPs, gene structure, protein, and functional information. In addition to automated and in-house curation, we will support community annotation of the sequence. GDSB is but one example of our program to extend GDB's traditional schema to maintain GDB as a unique source of human genome information, as evidenced by the integration of GDB, GDSB, and the GDB e-PCR database (Porter, 2001). 1. Cuticchia AJ(2000). Future vision of the GDB human genome database. Human Mutation 15:62-67 2. Porter CJ(2001). Reverse electronic PCR (e-PCR) using the GDB e-PCR database. HGM2001, #307


239. Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT. (up)
Ernst Kretschmann, Wolfgang Fleischmann, Rolf Apweiler, European Bioinformatics Institute;
ketsch@ebi.ac.uk
Short Abstract:

 Protein annotation of in SWISS-PROT follows certain patterns. It largely depends on the organism in which the protein was found and on signature matches of its sequence. The C4.5 data mining algorithm was used to detect annotation rules, which were evaluated and applied on yet unannotated proteins in TrEMBL.

One Page Abstract:

 The gap between the amount of newly submitted protein data and reliable functional annotation in public databases is growing. Traditional manual annotation by literature curation and sequence analysis tools without the use of automated annotation systems is not able to keep up with the ever increasing quantity of data that is submitted. Automated supplements to manually curated databases such as TrEMBL or GenPept cover raw data but provide only limited annotation. To improve this situation automatic tools are needed that support manual annotation, automatically increase the amount of reliable information and help to detect inconsistencies in manually generated annotations.

The standard data mining algorithm C4.5 was successfully applied to gain knowledge about the Keyword annotation in SWISS-PROT. 11306 rules were generated, which are provided in a database and can be applied to yet unannotated protein sequences and viewed using a web browser. They rely on the taxonomy of the organism, in which the protein was found and on signature matches of its sequence. The statistical evaluation of the generated rules by cross-validation suggests that by applying them on arbitrary proteins 33% of their keyword annotation can be generated with an error rate of 1.5%. The coverage rate of the keyword annotation can be increased to 60% by tolerating a higher error rate of 5%.

The results of the automatic data mining process can be browsed on http://golgi.ebi.ac.uk:8080/Spearmint

The source code is available upon request.


240. A tissue classification system (up)
Shiri Freilich, Ruth Barshir, Tali Vishne, Shachar Zehavi, Jeanne Bernstein, Compugen Ltd.;
shirif@compugen.co.il
Short Abstract:

 Genbank libraries annotations were used to classify ESTs into a hierarchical tissue system, sensitive to different pathological conditions. The system utilizes Compugen's clustering algorithm to provide a gene expression profile and it is used for the identification of tissue-specific genes, potentially useful as diagnostics for different pathological conditions.

One Page Abstract:

 The accumulating data about expressed sequences provide a first insight into the molecular mechanisms of development and differentiation. We present here a tissue classification system that provides a gene-tissue expression profile using that data.

We first analyzed Genbank relevant data for the construction of a tree like classification system. For each EST the following fields were scanned while searching for defined keywords: tissue, tissue library, library, organ. The classification system enables recursive summation of each node in the tree. For example, if the ileum is classified under bowel, the bowel's classified ESTs will automatically include ileum's ESTs. The keywords also solve the Genbank synonym problem (e.g. heart and cardiac have the same id entry). The system provides different entries for different pathological conditions. For example, normal pancreas and tumoric pancreas would have different accessions beneath pancreas and these accessions can be united or distinguished in a flexible manner, according to user demands.

Our next step was to move forward from the EST level to the gene level. For this purpose we used Compugen's clustering algorithms in order to obtain a tissue expression profile for each gene. Tissue specific genes are defined as genes composed of at least 5 ESTs, in which at least 80% of the ESTs are classified to the desired tissue (according to user definitions). Using Compugen's database we have identified about 4000 potential tissue-specific genes.

Tissue specific genes differentially expressed in normal versus pathological conditions have the potential to serve as diagnostic and/or prognostic markers. A group of 400 genes that are both tumor specific and tissue specific was identified. Twenty-two colorectal-tumor specific genes are currently being validated in our lab. 

Our future goals include the application of the gene-tissue classification system into a splice variant-tissue classification system. Splice variants are considered to be one of the major elements in vertebrata differentiation. Compugen's expertise in the identification of splice variants will be used to build a splice variant-tissue expression profile that may be interesting to explore.
 
 


241. Proteome's Protein Databases: Integration of the Gene Ontology Vocabulary (up)
Elizabeth Evans, Laura Selfors, Burk Braun, Matt Crawford, Kevin Roberg-Perez, Proteome, a division of Incyte Genomics;
ee@proteome.com
Short Abstract:

 Proteome's protein databases integrate knowledge from the research literature with sequence and software tools to produce a unique resource for biologists. We have now integrated the Gene Ontology (http://www.geneontology.org/) vocabulary into our hand-curated databases, providing proteins with controlled vocabulary designations that are highly detailed and accurate.

One Page Abstract:

 Proteome's mammalian databases integrate the accumulated knowledge from the research literature with genomic information and software tools to produce a powerful resource for bioinformatic scientists and biologists of all disciplines. Our protein report pages are interconnected to allow searching across multiple proteins and species for common protein and gene characteristics.

Recently, we have integrated the powerful descriptive terminology of the Gene Ontology consortium (http://www.geneontology.org/) into the curation techniques and tools used to create our databases. We will describe various aspects of this project, including an outline of how the integration was accomplished and how it is kept current, a statistical portrait of term usage, and what the impact has been on our 'property-based' descriptions of mammalian proteins. We will give examples of ways that GO terms curated by our staff from the primary research literature can be used to create a detailed picture of protein function, based solely on the GO molecular function, biological process and cellular component properties. In addition, the information in our databases enables us to transfer knowledge about known proteins to unknown proteins, based on protein sequence similarity. The GO ontologies can also be used to describe these predicted properties of uncharacterized proteins.
 
 


242. Pattern Recognition for Micro-array Data using Context based Divergence and its Validation (up)
W. D. Tembe, Anca L. Ralescu, Future Intelligence Technology Laboratory, University of Cincinnati;
wtembe@ececs.uc.edu
Short Abstract:

 A deterministic unsupervised pattern recognition algorithm driven by user adjustable parameters to identify genes with similar expression levels is described. A context dependent similarity metric and a general purpose validation methodology is proposed. Clusters for Saccharomyces Cervisiae and validation results for protein localization in yeast data are presented. 

One Page Abstract:

 We present a new approach to pattern recognition, characterization and validation for the micro-array data that is effective, transparent and flexible. It allows for parameterized experimentation, bridging the gap between the computational aspects and the domain dependent issues of the problem.

In many respects this approach departs from the bulk of those adopted so far. More precisely, like other previous approaches it presents an algorithm to gather/group data points (high dimensional vectors); however, unlike general clustering algorithms where data points are gathered by optimizing a performance measure(which maximizes discrepancy between clusters and at the same time minimizes discrepancy within a cluster), in this approach explicit criteria are used to achieve and control such grouping. Groups are formed iteratively based on criteria on various parameters. In addition to algorithmic issues we consider the added requirement that the results be amenable to an interpretation which is meaningful to the application domain from which the data come. The following are the main points of the approach:

* Data points are gathered in "structured clusters" in which members are differentiated among themselves by associating with each an individual "weight" which conveys its importance/contribution within the cluster.

* A global cluster measure called "cluster integrity" captures cluster tightness and is used to constrain the for mation and expansion of clusters.

* Data points are grouped into clusters based on their "similarity". Underlying the similarity is the new concept of "context-dependent divergence", a non-symmetric, distance-like measure of contrast. Unlike other well-known definitions for divergence(e.g.for probability distributions) such as the Kullback-Leibler divergence, ours includes, in addition to the local discrepancy between two data points, a "context component". The similarity measure is defined based on the (normalized) divergence measure. This asymmetric measure is meant to capture some of the similarity features discussed in the literature. It is first defined for individual data points and then, by aggregation (e.g. weighted linear combination), for clusters. 

* A "fuzzy reasoning" algorithm is used to implement a selected merging strategy which is defined in terms of conditions on the mutual similarities with respect to a threshold on the similarity, and conditions on cluster integrity, through a threshold on this measure. The fuzzy reasoning module is user-driven, in the sense that the user can input the value of the threshold controlling the grouping but the system is also able to suggest (based on a nalysis of the data) suitable values for this threshold. Varying the parameters of the fuzzy sets used in this system can further affect the merging strategy.

* Three complementary ways of cluster visualization are described, therefore providing the user with the opportunity to explore hypothetical cluster formation strategies. 

* There is no assumption on the data set beyond the given data points along with a normalization, such that for a generic data point x, |x| <= M for some M > 0. 

* The approach is entirely un-supervised and hence the question of validating results arises. We propose a general validation strategy which can be applied to any un-supervised algorithm. This calls for computing two measures of overlap between two groups/clusters, in terms of a "necessary support" based on the actual relative overlap between the two groups, and a "possible support" based on expected relative overlap under all possible cluster formation. 

Results of the approach to pattern extraction for the saccharomyces cervisiae (yeast) micro-array data and of the validation for the protein localization data set are presented, graphically illustrated and commented. They strongly support the claims made about the merits of the adopted approach.


243. GIMS - A Data Warehouse for Storage and Analysis of Genomic and Functional Data. (up)
Mike Cornell, Norman W. Paton, Shengli Wu, Paul Kirby, University Of Manchester;
Karen Eilbeck, Celera;
Andy Brass, Crispin J. Miller, Carole A. Goble, Stephen G. Oliver, University Of Manchester;
mcornell@cs.man.ac.uk
Short Abstract:

 GIMS is an object database that models the eukaryote genome and integrates it with functional data on the transcriptome and on protein-protein interactions. We used GIMS to store the yeast genome and demonstrate how storage of diverse genomic data can be beneficial for analysing transcriptome data using context-rich queries.

One Page Abstract:

 Effective analysis of genome sequences and associated functional data requires access to many different kinds of biological information. For example, when analysing transcriptome data, it may be useful to have access to the sequences upstream of the genes, or to the cellular location of their protein products. The information may currently be stored in different formats at different sites that do not use techniques that readily allow analysis in conjunction with other information. The Genome Information Management System (GIMS) is an object database that integrates genome sequence data with functional data on the transcriptome and on protein-protein interactions in a single data warehouse. We have used GIMS to store the yeast genome and to demonstrate how the integrated storage of diverse kinds of genomic data can be beneficial for analysing data using context-rich queries. GIMS demonstrates the benefits of an object based approach to data storage and analysis for genomic data. It allows data to be stored in a way that reflects the underlying mechanisms in the organism, and permits complex questions to be asked of the data. This poster provides an overview of the GIMS system and describes some analyses that illustrate its use for mining transcriptome data. We show how data can be analysed in terms of gene location, attributes of the genes protein products (such as cellular location, function or protein:protein interactions), and regulatory regions present in upstream sequences. 


244. Mutable gene model (up)
Giuseppe Insana, Heikki Lehväslaiho, EMBL-EBI;
insana@ebi.ac.uk
Short Abstract:

 We have created Perl classes and applications to analyze and validate sequence variations in expressed genes. A gene is represented as a double linked list with unique labels and DNA nucleotides as values. Exons, transcripts and translation products are virtual objects pointing to the DNA structure. See http://www.ebi.ac.uk/mutations/toolkit/.

One Page Abstract:

 To effectively manage the complexities of sequence changes, new tools are needed: to analyse the mutations and their propagation from the DNA to transcripts and translation products, to store variation information in exchangable and extensible format and to automatically validate existing mutation databases.

For these purposes, we have implemented a mutable gene model in two sets of Perl modules: Bio::LiveSeq and Bio::Variation (http://www.ebi.ac.uk/mutations/toolkit/).

The Bio::LiveSeq modules read EMBL formatted sequences and create a double-linked list data structure for the DNA sequence. Instead of storing exons, transcripts and translation products as separate strings, they are computed dynamically from DNA. The use of pointers to individual nucleotides on the DNA sequence makes the structure completely independent of any positional information and robust to changes such as insertions or deletions. Multiple mutations with interdependent effects, e.g. frame shift mutation followed by another frame restoring mutation, are easily handled.

The use of pointers also facilitates easy conversion between different coordinate systems (based on entry, coding sequence or whole gene). Additional protein level infromation can be read from SWISS-PROT into the system allowing comparative analysis of nucleotide and polypeptide features.

Bio::Variation modules collect variation information as differences between the reference and the variant sequences and calculate descriptive attibutes (labels like "missense" and restriction site changes). Permanent storage is possible in EMBL-like flatfile and XML formats.

An online application, "Mutation Checker" is available to researchers wishing to see the effect of a mutation on any chosen EMBL entry. (http://www.ebi.ac.uk/cgi-bin/mutations/check.cgi). Another use of the Mutation Toolkit has been the validation of the OMIM (Online Mendelian Inheritance in Man) database entries.

The modules are available as part of the open source BioPerl (http://bioperl.org) project. 


245. Discovery Support System for Human Genetics (up)
Dimitar Hristovski, Institute of Biomedical Informatics, Medical Faculty, University of Ljubljana, Slovenia;
Borut Peterlin, Department of Human Genetics, Clinical Center Ljubljana, Slovenia;
Saso Dzeroski, Institute Jozef Stefan, Ljubljana, Slovenia;
dimitar.hristovski@mf.uni-lj.si
Short Abstract:

 We describe an interactive discovery support system for human genetics. The goal of the system is to discover new, potentially meaningful relations between a given starting concept of interest and other concepts that have not been published in the medical literature yet (e.g. a gene candidate for a disease).

One Page Abstract:

 Positional cloning approach has proved very successful in cloning genes for Mendelian genetic human diseases. However, gene identification rarely implies understanding of pathophysiology of a genetic disease and consequently the rationale for therapeutic strategies. Moreover, knowing the entire human genome sequence requires novel methodological approaches in the analysis of genetic disease. In this paper we describe an interactive discovery support system (DSS) for the field of medicine in general and human genetics in particular. The intended users of the system are researchers in biomedicine. The goal of the system is to discover new, potentially meaningful relations between a given starting concept of interest and other concepts that have not been published in the medical literature yet. The main idea is to first find all the concepts Y related to the starting concept X (e.g. if X is a disease then Y can be pathological functions, symptoms, etc.). Then all the concepts Z related to Y are found (e.g. if Y is a pathological function, Z can be a molecule, structurally or functionally, related to the pathophisiology). As the last step we check whether X and Z appear together in the medical literature. If they do not appear together, we have discovered a potentially new relation between X and Z. This relation should be confirmed or rejected using human judgment, laboratory methods or clinical investigations, depending on the nature of X and Z. The known relations between the medical concepts come from the Medline bibliographic database. The concepts are drawn from the MeSH (Medical Subject Headings) controlled dictionary and thesaurus, which is used for indexing in the Medline database. We use a data mining technique called `association rules' for discovering relationships between medical concepts. Our discovery support system is interactive, i.e. that the user of the system can interactively guide the discovery process by selecting concepts and relations of interest. The system provides the possibility to show the Medline documents relevant to the concepts of interest and also to show the related proteins and nucleotides. We used the DSS to analyze Incontinentia pigmenti (IP), a monogenic genodermatosis, the gene of which has recently been identified via the positional cloning approach (Nature 2000;405:466-71). We were interested whether the gene could have been predicted as a gene candidate by the DSS. We succeeded in identifying the NEMO gene as the gene candidate and in retrieving its cDNA sequence (available since 1998). Moreover, the DSS provided some potentially useful data for understanding the pathogenesis of disease. It has to be stressed that efficient use of DSS is largely driven by the scientist. We conclude that the DSS is a useful tool complementary to the already existing bioinformatic tools in the field of human genetics. 


246. Search, a sensitive and exhaustive homology search programme (up)
Mitsuo Murata, Tohoku Bunka Gakuen College;
Norio Murata, National Institute for Basic Biology;
mmurata@tech.tbgu.ac.jp
Short Abstract:

 A sensitive and exhaustive homology search programme based on the Needleman-Bunsch algorithm has been created. The program, Search, is written in C and assembly language and with a query sequence of 200 amino acids it takes about 3 hours to search the entire SWISS-PROT and TrEMBL databases.

One Page Abstract:

 A sensitive and exhaustive homology search programme based on the Needleman-Wunsch algorithm has been created. In the original Needleman-Wunsch algorithm, the maximum match score is calculated on a two-dimensional array, MAT(m,n), where m and n are the lengths of the two sequences, and the similarity or homology between the two sequences is statistically assessed by the maximum scores from a large number of scrambled sequences of the original sequences. Depending on the size of the proteins this algorithm demands a large amount of computer memory and CPU time and to implement this algorithm in a homology search program has been considered impractical. However, calculation of the maximum score alone does not require the use of two-dimensional array as it is not necessary to trace back the maximum match pathway; in a modified algorithm only a single one-dimensional array is used for MAT. The CPU time can be shortened using a faster programming language. Thus, with many powerful personal computers presently available, it is possible to write a programme based on the Needleman-Wunsch algorithm which can be used for searching homologous sequences in large protein sequence databases. The programme so created was named Search and its most CPU-intensive parts were written in assembly language. Search was used for searching homologous sequences to the slr1854 gene product of Synechocystis, a cyanobacterium, which contains 198 amino acids, but whose function has not been identified. The databases screened were SWISS-PROT and TrEMBL which together contain 531,048 protein sequences currently. With the sample size of 200 for statistical analysis, it took 3 hours 9 min on a Celeron 556 MHz computer to search the two databases. One of the proteins which was found homologous to the query sequence was the general stress protein 18 (GSP18) of Bacillus subtilis.


247. Gene Ontology Annotation Campaign and Spotfire Gene Ontology Plug-In (up)
Bo Servenius, Stefan Pierrou, Robert Virtala, Jacob Sjöberg, Dan Gustafsson, AstraZeneca R&D Lund, Sweden;
Caroline Oakley, AstraZeneca R&D Charnwood, UK;
Tobias Fändriks, Spotfire Inc, Göteborg, Sweden;
bo.servenius@astrazeneca.com
Short Abstract:

 Gene Ontology (GO) is increasingly used for annotations of the gene product aspects: molecular function, biological process and cellular component. We have developed an annotation system, including a GO plug-in for the Spotfire program, and initiated a large scale annotation project aiming to annotate human genes obtained from Affymetrix experiments.
 
 

One Page Abstract:

 Expression analysis experiments is becoming more and more common in both academic and industrial molecular biology research. Results of such experiments - e.g.. generated with Affymetrix technology - do often consists of lists of more or less incomprehensible gene names. To enrich the listings good annotations describing different aspects of the genes are needed. Annotations should be made with controlled vocabularies that preferentially have a hierarchical or otherwise structured arrangement.

Gene Ontology (GO) is such an vocabulary (http://www.geneontology.org) and has during the last couple of years become more and more used for annotations of the gene aspects: molecular function, biological process and cellular component. GO was originally compiled by the model organism projects Flybase (Drosophila), Saccharomyces Genome Database and the Mouse Genome Database. Since the start several more model organism databases have joined in. However, almost no annotations have been made for the human genes and thus there are very few annotations with GO that can be used directly for annotations of the genes that are used in the expression analysis experiments we are generating with the Affymetrix system.

At AstraZeneca R&D Lund we have initiated a project - Gene Ontology Annotation Campaign (GOAC) aiming to annotate a large set of genes obtained in a specific set of Affymetrix experiments. To facilitate these annotations we have built an annotation system encompassing a database storing the GO annotations with references and comments, and data aggregations supplying the annotators with supporting information. We are using the GO Browser (John Richter) for looking up the GO-terms.

In collaboration with Spotfire Inc (http://www.spotfire.com) we have developed a plug-in for their data visualization program Spotfire. This plug-in - Spotfire Ontology Explorer (SOE) - makes it possible to visualize how a set of genes obtained in a data visualization with Spotfire are positioned in the GO structure. It is also possible to out from a specific position in the GO structure see how those genes are represented in the visualization. In this way we can get a very much better overview of the output of an expression analysis results listing. 

Further on we are exploring the possibilities to exploit the GO annotated genes within our AstraZeneca wide SRS based bioinformatics system e-lab. The implementation of GO annotations in more and more public domain and proprietary bioinformatics databases will make it possible to use the information integration capabilities of SRS utilizing the GO annotations. 


248. Predicting the transcription factor binding site-candidate, which is responsible for alteration in DNA/protein binding pattern associated with disease susceptibility/resistance (up)
Julia Ponomarenko, Galina Orlova, Tatyana Merkulova, Elena Gorshkova, Oleg Fokin, Sergey Lavryushev, Mikhail Ponomarenko, Institute of Cytology and Genetics, Novosibisk, Russia;
jpon@bionet.nsc.ru
Short Abstract:

 A databases/tools system rSNP_Guide predicting TF-site presence/absence and explaining correlation between SNP-allele and disease was developed and applied for analysis of several disease-related alleles: NTFa (malaria), GpIbb (Bernard-Soulier syndrome), K-ras (lung tumor), TDO2 (mental disorders). rSNP_Guide is supplied by Help-options recommending how to use the system for SNP-analysis, \url{http://wwwmgs.bionet.nsc.ru/mgs/systems/rsnp/}.

One Page Abstract:

 Single Nucleotide Polymorphism's (SNP's) analysis bridges the gap between genome sequence and human health-care. We have developed a system of databases and tools, rSNP_Guide, addressed to recognition of transcription factor (TF) binding site, the presence/absence of which could explain the SNP-related disease susceptibility/resistance. For the system's design, two heuristic observations were considered. First, we anticipate that the databases on site directed mutagenesis could be used for disease study too. Second, we have unexpectedly shown that accounting of SNP-caused alterations in gel mobility shift assay in addition to common DNA analysis may increase TF-site recognition accuracy. So, we have integrated natural and artificial mutation data and developed a Java-script applet, which predicts the TF-site candidate responsible for disease susceptibility/resistance in regulatory DNA regions. Initial data on naturally occurring mutations were taken from HGMD, dbSNP, HGBASE ALFRED, OMIM databases, whereas the site-directed mutagenesis data are from TRANSFAC, TRRD, COMPEL, ACTIVITY databases. From original papers (rSNP_BIB), we document the experimental design (SYSTEM) and the additional data on DNA/protein-complex alterations (rSNP_DB). These SNP/mutagenesis-related databases were added by three natural TF-site related resources: (i) a databases SAMPLES on TF-sites experimentally detected; (ii) a database MATRIX on TF site weight matrices; (iii) a Java-script applet rSNP_Tools applying the weight matrices for site recognition within regulatory DNA regions. By systemizing the rSNP_DB entries, their characteristic examples were selected for an illustrative treatment by the applet rSNP_Tools. The tools use the resources SAMPLES and MATRIX to recognize the TF-site candidate responsible for disease susceptibility or drug resistance/sensitiveness. Finally, the results are stored (rSNP_Reports) for illustrating how to use rSNP_Tools in pactice. To use rSNP_Guide in practice, a user first must have a target sequence and mutated variants for (+) and (-) DNA strands. For example, for analysis of the site-directed inhibitory mutations within the promoter of rat angiotensin II type 1A receptor gene, rAT(1A)R-C, four DNA sequences should be prepared, including (i) the allele "WT" (+)-chain, 5'-tttttatTtttaAataaat-3'; (ii) (-)-chain, 5'-atttatTtaaaAataaaaa-3'; (iii) the mutant "MT", 5'-tttttatGtttaCataaat-3'; and (iv) 5'- atttatGtaaaCataaaaa-3' (altered nucleotides are capitalized). A TF is selected in the upper section by uploading the rSNP_Tools' interface, URL=http://wwwmgs.bionet.nsc.ru/mgs/programs/rsnp/. In TF site recognition window, the user inputs each sequence variant and observes the TF site recognition score profile. With a single dominant peak, the corresponding score should be fixed. When all TFs of interest have been examined and their peaks fixed, the additional evidence is input, i.e., the relative degree of DNA/protein binding efficiency scaled in-between +1 and -1. In our example, the transcription activity of the rAT(1A)R-C gene is "normal" in "WT" allele ("+1") and "inhibited" in "MT" ("0"). Then a single TF may be predicted to bind (in our example, the site MEF-2 is a TF-site candidate responsible for regulating the rAT(1A)R-C gene transcription activity, as it was confirmed experimentally). In the case that either several or no TFs are predicted, the significance threshold (p=0,025, default) could be varied from 0.05 to 0.0001. The system rSNP_Guide was tested for several human genes with SNP-alleles and disorders: NTFa (severe malaria), pC (type I protein C deficiency), GpIbb (Bernard-Soulier syndrome), factor VII (severe bleeding disorder), and Gg-globin (hereditary persistence of fetal hemoglobin). Since these tests were successful, we have applied rSNP_Guide to the SNP-related alleles of #2 intron K-ras gene (lung tumor) and #6 intron TDO2 gene (mental disorders). For K-ras gene, rSNP_Guide has predicted the GATA-like site-candidate, present in "CA"- and absent in "CC"- and "GC" alleles. This explains different tumor susceptibility/resistance of alleles in lung, which was confirmed by antibody test. For the TDO2 gene, the YY1 site-candidate was first predicted by the rSNP_Guide and, then, confirmed experimentally. Thus, we hope that rSNP_Guide is useful for SNP-related studies.


249. Open Exchange of Source Code and Data from the Stanford Microarray Database (up)
Gail Binkley, Catherine Ball, John Matese, Tina Hernandez-Boussard, Heng Jin, Pat Brown, David Botstein, Gavin Sherlock, Stanford University;
gail@genome.stanford.edu
Short Abstract:

 The Stanford Microarray Database (SMD) is committed to providing open source code and database design, and the data associated with published microarray experiments.

One Page Abstract:

 The Stanford Microarray Database (SMD) stores raw and processed data from cDNA microarray experiments, and provides web-based tools to retrieve, visualize and analyze these data. The primary function of SMD is to support the ongoing research efforts at Stanford University, but we are also committed to providing open source code and database design, and the data associated with published microarray experiments. Toward this end, the first release of the SMD source code and database schema occurred in March 2001. A second release is scheduled for this summer, that will include further abstraction of data retrieval through the use of Perl Objects, to simplify and streamline the code. There are now over 1,100 published arrays available to the public in SMD. A current goal of SMD is to provide these data for downloading in a MIAME-compliant format. The MIAME (Mimimum Information About a Microarray Experiment) specification is an international effort to define a standard method to describe a microarray experiment. The status of the effort to map SMD to the MIAME specification will be presented. SMD can be accessed at: http://genome-www.stanford.edu/microarray/


250. Portal XML Server: Toward Accessing and Managing Bioinformatics XML Data on the Web (up)
Noboru Matoba, Laboratory for Database and Media Systems, Graduate School of Information Science, Nara Institute of Science and Technology (NAIST);
Masatoshi Yoshikawa, Junko Tanoue, Shunsuke Uemura, Laboratory for Database and Media Systems, Graduate School of Information Science, Nara Institute of Science and Tech;
noboru-m@is.aist-nara.ac.jp
Short Abstract:

 We propose a Portal XML Server system that provides functionalities for accessing bioinformatics XML data. Users can access necessary data only by visiting the Web site without the knowledge of XML, thus it is expected to become a useful site for molecular-biologists.

One Page Abstract:

 XML is emerging as a standard format on the Web. In the field of bioinformatics, data is beginning to be distributed in XML. Examples of such data formatted in XML include GAME and MaXML. We foresee, in near future, large amount of XML data produced from exisiting databases will be interchanged in XML on the Web. However, almost all of the users of biological databases are not an expert of XML. Hence, 

We propose a system which searches more effectively and utilizes bioinformatics data, without being conscious of XML. The goal of the system is providing a portal site for accessing bioinformatics XML data. Users can access to necessary data only by visiting the Web site without the knowledge of XML. The system stores users' personal information and utilizes the history of operation and liking, thus ahcieves individualization and optimization for every user. Moreover, taking advantage of the functions of XML (XLink etc.), an intuitive and effective user interface is offered to XML data which have a lot of link information. Thereby, the system aims at taking more effective approach to many complicated bioinformatics data.

Currently, XML Viewer is under development as a part of above-mentioned function. This is a system for coming out and showing a user XML file which is variously and complicated structure (various annotations, ID, alignment etc. being included) intelligibly and effectively. Therefore, it considers giving the zoom function; extraction and display function of applicable data; and export function to another file to a system. By attaching other functionalities (link viewer etc.) to the system, we expect the system helps users to discover a new fact.

As a future object, we will continue to implement other functions to one, and want to perform analysis and the proposal of XML schema (DTD) desired from now on.

 [References]

XEWA Workshop http://www-casc.llnl.gov/xewa/

Genome Annotation Markup Elements (GAME) http://www.bioxml.org/Projects/game/

Mouse annotation XML (MaXML) http://genome.gsc.riken.go.jp/homology/ftp.html

XML Linking Language (XLink) Version 1.0 http://www.w3.org/TR/xlink/


251. PFAM-Pro: A New Prokaryotic Protein Family Database (up)
Martin Gollery, Chip Erickson, David Rector, Jim Lindelien, TimeLogic;
martyg@timelogic.com
Short Abstract:

 PFAM-Pro is a new database of HMM's that have been trained exclusively on Prokaryotes. This gives PFAM-Pro some advantages for the annotation of new microbial genomes. Experiments show that using PFAM-Pro in conjunction with PFAM will yield better results than either database alone.

One Page Abstract:

 Hidden Markov Models have become popular for similarity searches. This is due to the manner in which they represent the match, insertion, deletion or transition states for each position in the model. The scores in the model represent the probabilities for each of these states in the data used to build the model. These probabilities are more specific to the sequence family in question than a standard scoring matrix (such as PAM or BLOSUM). These scoring matrices are based on substitution probabilities of amino acids in an entire database, and may be less representative for a smaller subset of sequences. As a result, HMMs can recognize that a new protein belongs to an existing protein family, even if the similarity from BLAST or Smith-Waterman alignments seems weak. While analysis of large data sets using conventional CPU's can be slow, hardware implementations such as that provided by DeCypher is extremely fast. The PFAM database is curated by Washington University and the Sanger Center. PFAM consists of a database of HMMs, covering many common protein domains. Version 6.2 of PFAM contains alignments and models for over 2700 protein families, based on the Swissprot and SP-TrEMBL protein sequence databases. 

Even as a scoring matrix reflects the data from which it was derived, so does a Hidden Markov Model reflect its' training data. If all of the data is derived from a single organism or type of organism, then the model will be less effective at finding matches with a dissimilar organism. Therefore, the models in PFAM are based on a wide range of organisms for the broadest possible application.

 The concept behind PFAM-Pro, is to turn this idea around completely. While PFAM is widely useful on all organisms, PFAM-Pro is designed for use with prokaryotes only. While PFAM is generic, PFAM-Pro is specific, and should therefore yield improved results for microorganisms.

 While PFAM-Pro yields some dramatic advantages, we do not advise using it exclusively. Experiments show that using PFAM in conjunction with PFAM-Pro will yield better results than either database alone.

 PFAM-Pro is copywrited, and is freely available for use.


252. Production System for Mining in Large Biological Dataset (up)
Yoshihiro Ohta, Tetsuo Nishikawa, HITACHI Central Research Laboratory;
Shigeo Ihara, HITACHI Ltd.;
yoh@crl.hitachi.co.jp
Short Abstract:

 We constructed a production system which is a middle ware for integrated analysis including data mining and information retrieval from large biological database. While it keeps the libraries of biological images, it also generates dynamic animations automatically and can shows the mining and retrieval results visually and clearly. 

One Page Abstract:

 We designed production system which can keep compatibility of biological data set, change their abstract level and moreover be recycled, expanded easily. The production system defined here is a middle ware for integrated analysis including data mining and information retrieval. While it keeps the libraries of biological images, it also generates dynamic animations suiting each circumstance automatically, and shows the results visually and clearly by receiving retrieval and mining requests from biological researchers.

For that we have designed and developed engines for mining and retrieval with dynamic animation interface. This middle ware is not only a constructor for new databases, but also an important tool playing a role of high performance function analysis and automatic dynamic animation generation. By using the production system constructed as above, we can do unitary control for information about genes, protein, pathway and relations of biological objects with graphical user interface. Each component of this system is as follows.

(1)"Definition of annotation and data structure of the production system" In order to give annotations to biological data and increase efficiency of analysis, we developed techniques to give various annotation sets to the biological data sets, for example, genome sequences and Medline abstracts. (2)"Construction of retrieval and mining engines for production system" For the purpose of retrieving and mining useful and unknown information from huge biological data set, we devised an algorithm of engines for information retrieval and data mining, including SVM, Decision Tree, and so on. (3)"Visual interface for representing the results of analysis" Intending to show mining and retrieval results visually and clearly, we developed visualized libraries of production system on the Web browser.

We evaluated usability of the production system, using Medline abstracts, expression profiles, clinical data sets, and so on, then we certified that we are able to interpret and reconsider the analyzed results visually and clearly with good performance. 


253. Querying Multiple Biological Databanks (up)
Patrick Lambrix, Linköpings universitet;
patla@ida.liu.se
Short Abstract:

 Users of multiple biological databanks face several problems including lack of transparency, the need to know implementation details of the databanks and lack of common representation. To alleviate these problems, we present a query language that contains basic operators for query languages for biological databanks and a supporting architecture. 

One Page Abstract:

 Nowadays, biologists use a number of large biological databanks to find relevant information for their research. Users of these databanks face a number of problems. This is partly due to the inherent complexity of building such databanks, but also due to the fact that some of the databanks are built without much consideration for the current practice and lessons learned from building complex databases in related areas such as distributed databases and information retrieval.

A first problem that users face is that they are required to have good knowledge about which databanks contain which information as well as about the query languages and the internal structure of the databanks. Often there is a form-based user interface that helps the user partly, although different systems may use different ways of filling in the data fields in the forms. With respect to querying most databanks allow for full-text search and for search with predefined keywords. The internal structure of the databanks may be different (e.g. flat files, relational databases, object-oriented databases). Biologists, however, should not need to have good knowledge about the specific implementations of the databanks but instead a transparent solution should be provided.

Another problem is that representations of the same data by different biologists will likely be different. This requires methods for representation of and reasoning about biological data for the understanding and integration of the different conceptual models of different researchers and databanks.

When users have a complex query where information from different sources needs to be combined, they often have to divide the queries into a number of simple queries, submit the simple queries to one databank at a time and combine the results themselves. This is a non-trivial task containing steps such as deciding which databanks to use and in which order, how terms in one databank map to terms in other databanks and how to combine the results. A mistake in any of these steps may lead to inefficient processing or even in not obtaining a result. A requirement for biological databanks is therefore that they allow for complex queries in a highly distributed and heterogeneous environment.

To alleviate these problems we propose a base query language that contains basic operators that should be present in any query language for biological databanks. The choice is based on a study of current practice as well as on a (currently small) number of interviews. In this work we restrict the scope of the query language to text and observe that the proposed query language is not necessarily used as an end-user query language. For the common end user the language is likely to be hidden behind a user interface. The main features of the language include an object model, queries about types and values, paths and path variables as well as the use of specialized functions that allow for hooks to, for instance, alignment and test search programs. Further, we propose an architecture for a system that supports this query language and deals with the problems concerning representational issues and the lack of transparency. The fact that many users use daily a large number of legacy systems is also taken into account. The proposed architecture contains a central system consisting of a user interface, a query interpreter and expander, a retrieval engine and an answer filter and assembler. Further, the architecture assumes the existence of an ontology base, a databank knowledge base with information about the contents and capabilities of the source databanks as well as the use of wrappers that encapsulate the source databanks.
 
 


254. Automated Analysis of Biomedical Literature to Annotate Genes with Gene Ontology Codes : Application of A Maximum Entropy Method to Text Classification Application of A Maximum Entropy Method to Text Classification (up)
Soumya Raychaudhuri, Jeffrey T. Chang, Patrick D. Sutphin, Russ B. Altman, Stanford University;
tumpa@stanford.edu
Short Abstract:

 Intensive efforts to functionally annotate genes with controlled vocabularies, such as Gene Ontology (GO), are underway to summarize the function of characterized genes. We propose using computational approaches to interpret literature to provide annotation. These methods are applied to annotating yeast genes with GO biological process codes.

One Page Abstract:

 Genomic experimental approaches to biology are forcing practitioners to consider biological systems from a more global perspective. Investigators examining a broad biological phenomena may be unfamiliar with many of the individual genetic components that are directly or tangentially involved. To facilitate such interpretation, intensive efforts to annotate genes with controlled vocabularies, such as Gene Ontology (GO), are underway to summarize the function of characterized genes. When codes from a controlled vocabulary are assigned to genes, experts must carefully examine the literature to obtain the relevant codes. Expansion of biological understanding and specialized needs of sub-disciplines will force re-annotation over time. We propose using computational approaches to automatically interpret literature to provide annotation. The methods we describe here are applied to annotating all of the genes within the yeast genome with GO biological process codes. We use a maximum entropy text classification algorithm to identify the subjects discussed in the texts associated with the gene in question and thereby annotate the gene. Here we annotate genes based on two sets of abstracts: a curated collection of articles maintained by Saccharomyces Genome Database and articles obtained by rapid sequence database queries. 


255. Rat Liver Aging Proteome Database: A web-based workbench for proteome analysis (up)
Jee-Hyub Kim, Yong-Wook Kim, Hyung-Yong Kim, Jae-Eun Chung, Jin-Hee Kim, Sujin Chae, Eun-Mi Jung, Tea-Jin Eom, Yong-Ho In, R & D Institute, Bioinfomatix Inc.;
Hong Gil Nam, Department of Life Science, Pohang University of Science and Technology, Pohang, Korea (South);
kjh726@bioinfomatix.com
Short Abstract:

 We developed a rat liver aging-related proteome database. The database has annotated protein function information based on GO (Gene Ontology) and provides an easy GUI (Graphic User Interface) for inputting data, searching, visualization, and protein expression profile analysis. This allows it to serve as a workbench for proteome analysis.

One Page Abstract:

 Since the announcement of the first human genome sequence draft, researchers have been moving to analyze the human proteome expressed by the genome. These days, proteome analysis is most commonly accomplished by the combination of two-dimensional gel electrophoresis (2DE) and mass spectrometry (MS) from which a large mount of 2DE images and MS data are generated. In order to store, retrieve and analyze those kinds of data in biological research, it is necessary to develop a system that is easy to use, and to provide good visualization.

In this poster, we present a database example showing two groups of aging proteomes which were extracted from rat liver cells over a period of two years. One group is from rats on a controlled diet, and the other from diet-free rats. It is shown that cells from diet-restricted rats live longer than other cells, and thus we can find major factors influencing the aging of cells. In this database, we stored 8 gel images and 78 spots information. Among the 78 spots, 26 have annotated protein function information, including peptide mass information. For standardization, we used a controlled vocabulary based on GO (Gene Ontology) in parts of the database.

We made the database for researchers not only to store those data, but also to provide a workbench for proteome analysis. To do this we developed a proteome expression profile analysis program for predicting unknown protein function. We also added many modules to search other databases with their own experiment data and provided an easy GUI (Graphic User Interface) for processing image data and analyzing proteome data.

In the field of proteomics, new technologies are constantly emerging, so we designed the system for extensibility. In the future, we will add modules for image comparison and tandem mass spectrometry. We will also link information in this database to the rat genome sequence. 


256. UCspots Before Your Eyes (up)
Ellen Graves, UCSF Ernest Gallo Clinic & Research Center;
ellen@egcrc.net
Short Abstract:

 UCspots is a complete microarray LIMS for managing data from plate preparation to image analysis results, including agarose gel images, arrayer settings, experimental parameters, single channel and composite microarray images, and image analysis software results. UCspots is MIAME compliant and available in Oracle 8 and IBM DB2.

One Page Abstract:

 UCspots is a complete microarray LIMS for managing data from plate preparation to image analysis results, including agarose gel images, arrayer settings, experimental parameters, single channel and composite microarray images, and image analysis software results.

The plate preparation process is completely captured, including transfers to and from different sized plates, PCR protocol data, and individual quality control values for each element as it is processed. Array fabrication data includes robot configuration, slide usage, and grid locations for individual elements from the arrayed plates.

Full experiment data collection is guaranteed through a flexible, directed user interface that guides an investigator to add all data to the LIMS. Data collected includes hybridization, wash and probe parameters, as well as images and analysis software results. The database holds all results from currently available packages GenePix, ScanAlyze, and Spot, and can be expanded to collect data from newly developed software. Experiment results, either single, multiple, or partial experiments, can be exported to data analysis packages such as GeneSpring, Cluster, and TreeView.

We have taken an object-oriented approach to the schema design for UCspots in the areas of element types and image analysis software results. Instead of trying to collect all element or analysis results in a generic table, we provide a flexible schema where tables for additional element types or analysis packages can be added easily.

Sites using UCspots can configure the LIMS to collect elements and analysis software unique to their microarray implementation. Through this flexible schema, it is possible to query across element types or analysis results, allowing improved quality control and/or evaluation of different technologies.

UCspots' schema is MIAME compliant, and data collected can be exported to public microarray databases as they become available. The database is available in Oracle 8 and IBM DB2. The UCspots data entry application, written in java, insures data integrity beyond that provided by the relational DBMS. The web query pages allow secure access from the desktop, allowing individual and collaborative data analysis. 

The advantages of UCspots over other microarray LIMS include full coverage of design, fabrication, and experiments with microarrays. UCspots can be easily integrated into an analysis workflow that includes any analysis package that allows data import. We have emphasized quality control, especially with elements, since arrays are only as good as the elements placed on them. We have also designed the schema to have an association between each element on a plate and a spot on each microarray, insuring that experimental results can be correlated across experiments as well as within an experiment. 


257. PASS2: Semi-Automated database of Protein Alignments organised as Structural Superfamilies (up)
V.Mallika, Dr.R.Sowdhamini, National Centre for Biological Sciences - Tata Institute for Fundamental Research;
mallika@ncbs.res.in
Short Abstract:

 PASS2 is nearly-automated version of CAMPASS with interesting features and links. Superfamily members extracted from SCOP by using 25% sequence-identity/good-resolution cutt-off. Alignment is based on the conservation of structural features like solvent-accessibility, hydrogen-bonding and secondary-structures. These structure-based sequence alignments created by COMPARER and JOY. Sample SMS_CAMPASS URL: \url{http://www.ncbs.res.in/~faculty/mini/campass/sms_campass.html}

One Page Abstract:

 We have generated and updated, nearly-automated version of the original superfamily alignment database, CAMPASS (Sowdhamini et al., 1998 ["CAMPASS: A database of structurally aligned protein superfamilies" Structure 6(9):1087-94]). This new version, PASS2, contains all possible alignments of protein structures at the superfamily level in direct correspondence with the SCOP database (Murzin et al.,1995[SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536-540.]). The superfamilies are chosen from SCOP and all the representative members are extracted by using cut-off of 25% sequence identity and good resolution. For all MMS(Multi Member Superfamily) alignment of family members is based on the conservation of structural features like solvent accessibility, hydrogen bonding and the presence of secondary structures. We have employed COMPARER (Sali and Blundell, 1990[ The definition of topological equivalence in homologous and analogous structures : A procedure involving a comparison of local properties and relationships. J. Mol. Biol., 212, p.403.]; Sali et al.,1992[ A variable gap penalty function and feature weights for protein 3-D structure comparison. Prot. Engng.,, 5, p.43.]) to obtain the multiple sequence alignment of distantly related proteins. The final alignment of MMS and SMS(Single Member Superfamily) are presented in JOY format (Mizuguchi et al., 1998[JOY: protein sequence-structure representation and analysis. Bioinformatics 14:617-623.]) to include structural features. This version also contains interesting features like keyword search, alignment option by using MALIGN and JOY for user's query sequence with all superfamily members and PSI-BLAST, PHI-BLAST homology sequence search against PASS2 are introduced. For every entry in PASS2, there are various links with other databases like RCSB, SCOP, EBI, HOMSTRAD, CATH, FSSP, PALI, PRESAGE, MODBASE, LPFC, DDBASE and DSDBASE is avaibale. Rasmol for visualizing the superposed 3D-structures and single 3D-structure of the superfamiy members are available. The association of the sequences from genome databases with the existing structural superfamilies is under way. The sample page for SMS-CAMPASS (Structure Based Sequence Annotation for Single Member Superfamilies)contains 421 entries is available now from the following URL: http://www.ncbs.res.in/~faculty/mini/campass/sms_campass.html


258. The HAMAP project: High quality Automated Microbial Annotation of Proteomes (up)
Anne-Lise Veuthey, Corinne Lachaize, Alexandre Gattiker, Karine Michoud, Catherine Rivoire, Andrea Auchincloss, Elisabeth Coudert, Elisabeth Gasteiger, Swiss Institute of Bioinformatics;
Paul Kersey, European Bioinformatics Institute (EBI);
Marco Pagni, Amos Bairoch, Swiss Institute of Bioinformatics;
Anne-Lise.Veuthey@isb-sib.ch
Short Abstract:

 In the framework of the SWISS-PROT database, we initiated a project to automatically annotate bacterial and archaeal proteins belonging to two categories: proteins having no similarity with other proteins or belonging to well-defined families. Annotation will be inferred after several consistency controls in order to keep the quality of SWISS-PROT.

One Page Abstract:

 More than 50 complete genomes are available in public databases. Collectively they encode more than 80'000 different protein sequences. Such a large amount of sequences makes classical manual annotation an intractable task. We therefore initiated, in the framework of the SWISS-PROT database [1], a project that aims to annotate automatically a significant percentage of proteins originating from microbial genome sequencing projects. It is being developed to deal specifically with two subsets of bacterial and archaeal proteins: 1) Proteins that have no recognizable similarity to any other microbial or non-microbial proteins (generally called "ORFans"). This task mainly implies automatic recognition and annotation of features such as signal sequences, transmembrane domains, coiled-coil regions, inteins, ATP/GTP-binding sites, etc. 2) Proteins that are part of well-defined families or subfamilies where it is possible, using software tools, to build automatically a SWISS-PROT entry of a quality identical to that produced manually by an expert annotator. In order to do this we are building, for each well-defined (sub)family, a rule system that describes the level and extent of annotations that can be assigned by similarity with a prototype manually-annotated entry. Such a rule system also includes a carefully edited multiple alignment of the (sub)family. In both cases described above, the idea is to annotate proteins with the highest level of quality. The programs in development are specifically designed to track down "eccentric" proteins. Family assignment will be performed by several identification methods based on profiles, HMM and similarity search in order to avoid false positives. Moreover, the programs will detect peculiarities like size discrepancy, absence or divergence of regions involved in activity or binding (to metals, nucleotides, etc), presence of paralogs, inconsistencies with the biological context etc. Such "problematic" proteins will not be annotated automatically and will be flagged for further analysis by SWISS-PROT expert annotators. Finally, the consistency of annotations will be controlled at genome level by checking the completeness of metabolic pathways.

[1] Bairoch A., Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 Nucleic Acids Res. 28:45-48(2000).
 
 


259. BioWIC: Biologically what's in common? (up)
Benjamin J. Stapley, Imperial Cancer Research Fund, London;
Micheal JE Sternberg, Imperial College of Science and Technology, London;
b.stapley@icrf.icnet.uk
Short Abstract:

 BioWic determines and visualizes the semantics of a collection of proteins using common terms obtained from their Swiss-Prot annotations and Medline. The system presents the user with terms that are most significant and a graph that reveals the underlying relationships between the proteins. The work is illustrated with examples.

One Page Abstract:

 Many new experimental techniques such as mass spetrometry and expression array technologies yield data on the behaviour of a large number of genes or proteins simultaneously. BioWIC is a simple, generic bioinformatics tool that can aid in the interpretation of such experiments. BioWIC takes a set of proteins and tries to determine common terms that describe the sequences. This is achieved by finding Swiss-Prot homologues (or identities) for each protein, extracting relevant text, and determining the most significant terms. 

For each protein, homologous swiss-prot sequences are extracted; from these citing Medline documents or keywords are retrieved. The terms are weighted by inverse document frequency (IDF) and in associating terms with sequences, we apply IDF iteratively and generate term vectors describing each protein.

In order to visualise the relations between the proteins, we determine how similar each protein is to every other protein using the cosine of the term vectors. Pairs of proteins that have a cosine similarity above some threshold are linked and an undirected graph generated. In addition, keywords are include in the graph by linking them to their most relevant proteins.

The impetus here is to present the user with labelled clusters which help in comprehension of the underlying semantic structure of the data and can aid in the formulation of new hypotheses. 

The work is illustrated with examples from S. cerevisiae expression data.


260. Structure information in the Pfam database (up)
Sam Griffiths-Jones, Mhairi Marshall, Kevin Howe, Alex Bateman, The Sanger Centre;
sgj@sanger.ac.uk
Short Abstract:

 Pfam is a database of protein domain families. The latest release of Pfam (6.3) contains 2847 families and matches over 68% of protein sequences and 48% of residues in sequence databases. Recently, extensive use of available structural information has led to significant improvements in Pfam family quality and annotation. 

One Page Abstract:

 Pfam is a large collection of protein multiple sequence alignments and profile hidden Markov models. The database is available on the WWW in the UK at http://www.sanger.ac.uk/Software/Pfam/, in Sweden at http://www.cgr.ki.se/Pfam/ and in the US at http://pfam.wustl.edu/. The latest version of Pfam (6.3) contains 2847 families matching over 68% of all protein sequences and 48% of all residues in SWISS-PROT 39 and TrEMBL 14. 

Recently, development of the Pfam database has focussed on the extensive use of available structural information to improve the quality of Pfam families, and add structural and functional annotation. Domains are the structural and functional building blocks of proteins, and so, where the data are available, structural information has been used to ensure that Pfam families correspond with single structural domains. This matching of families and domains enables enhanced understanding of the function of multi-domain proteins and facilitates cross-linking and integration with structure classification databases such as SCOP and CATH. The action of chopping a single family into two or more structural domains in many cases also enables the elucidation of an increased incidence of the particular domain, often in novel protein contexts. 

Pfam sequence alignments now include structural markup derived from the DSSP database, and active site residues as described in SWISS-PROT feature tables. The improved web site graphical view also shows a number of predicted non-domain regions of proteins including transmembrane, low complexity, coiled coil and signal peptide regions. 


261. Sequence Viewers for Distributed Genomic Annotations (up)
Robin Dowell, Howard Hughes Medical Institute and Washington University in St. Louis;
Allen Day, Cold Spring Harbor Laboratory;
Rodney M. Jokerst, Howard Hughes Medical Institute and Washington University in St. Louis;
Guanming Wu, Cold Spring Harbor Laboratorypring Harbor Laboratory;
Lincoln Stein, Cold Spring Harbor Laboratory;
robin@genetics.wustl.edu
Short Abstract:

 The Distributed Sequence Annotation System (DAS) is a lightweight XML-based protocol for exchanging information about genomic annotations. We discuss here the design and implementation of two viewers for DAS annotations. One is a standalone Java application and the other is a server-side Perl script.
 
 

One Page Abstract:

 The Distributed Sequence Annotation System (DAS) is a lightweight XML-based protocol for exchanging information about genomic annotations. The system can be used to publish the positions of predicted genes, nucleotide variations, repetitive elements, CpG islands, genetic markers, or indeed any predicted or experimental features that can be assigned to genomic coordinates. The system is currently used by the WormBase database to publish C. elegans genomic annotations, by GadFly to publish annotations on D. melanogaster genomic database GadFly, and by the Ensembl project to publish annotations on H. sapiens and M. musculus.

We discuss here the design and implementation of two viewers for DAS annotations. One, called Geodesic, is a standalone Java application. It connects to one or more DAS servers, retrieves annotations, and displays them on an integrated map. The other, called DasView, is a Perl application that runs as a server-side script. It connects to one or more DAS servers, constructs an integrated image, and serves the image to a web browser as a set of clickable imagemaps. Both viewers provide the user with one-click linking to the primary data sources where they can learn more about a selected annotation, and are sufficiently flexible to accept a wide range of annotation types and visualization styles. The standalone Java viewer is appropriate for extensive, long-term use. The Perl implementation is suitable for casual use because it does not require the user to preinstall the software.

Both viewers are freely available under Open Source licensing terms. They can be downloaded from http://biodas.org/


262. The Immunodeficiency Resource: Knowledge Base for Immunodeficiencies (up)
Marianne Pusa, Jouni Väliaho, Jukka Lehtiniemi, Tuomo Ylinen, Institute of Medical Technology, University of Tampere, Finland;
Mauno Vihinen, Institute of Medical Technology, University of Tampere, Finland, Tampere University Hospital, Tampere Finland;
marianne.pusa@uta.fi
Short Abstract:

 For long it has been a difficult task to find a source providing integrated information on rare diseases. The Immunodeficiency Resource (IDR) is a comprehensive knowledge base for all information on immunodeficiencies including computational data and analyses, clinical, biochemical, genetic and structural information. The IDR is freely available at http://www.uta.fi/imt/bioinfo/idr.

One Page Abstract:

 For long it has been a difficult task to find a source providing extensive, integrated information on rare diseases. The Immunodeficiency Resource is a comprehensive knowledge base for all information on immunodeficiencies including computational data and analyses, clinical, biochemical, genetic and structural information. The IDR is freely available at http://www.uta.fi/imt/bioinfo/idr.

The IDR is maintained in order to collect and distribute all essential information and links related to immunodeficiencies in an easily accessible format. All information on the IDR is gradually collected in the XML-format and the XML properties will be utilized to offer data services for different platforms. The IDR information system is based on disease and gene specific fact sheets which are provided for all immunodeficiencies. They act as a starting point for further disease related information. All information on the IDR server is validated by specialists of the IDR group. 

We have compiled all major immunodeficiency data onto the IDR with thousands of useful and interesting links in addition to our own pages. The IDR also includes articles, instructional resources, analysis and visualization tools as well as advanced search tools. Any text string search across all information is possible. The data is powerfully distributed, error corrected and validated before release. 

The IDR integrates various web-based services e.g. sequence databases (EMBL, GenBank, SwissProt), genome information (GDB, UniGene, GeneCard, GenAtlas), protein structure database (PDB), diseases (OMIM), references (Medline), patient information (ESID registry), symptoms and diagnosis (ESID/PAGID recommendations), laboratories performing diagnosis (IDdiagnostics), mutation data (IDbases), animal models (MGD, Flybase, SacchDB) and information produced by the IDR team. 

We offer up-to-date information on immunodeficiencies and immunology to people with different backgrounds. New features are continuously added to provide a comprehensive navigation point for anyone interested in these disorders whether a physician, nurse, research scientist, patient, parent of a patient or the general public. 


263. XGI: A versatile high throughput automated sequence analysis and annotation pipeline. (up)
Mark E. Waugh, William T. Anderson, The National Center for Genome Resources;
Mark W. Fiers, Plant Research International;
Jeff T. Inman, Faye D. Schilkey, John P. Sullivan, Callum J. Bell, The National Center for Genome Resources;
mew@ncgr.org
Short Abstract:

 XGI is a portable and flexible multi-threaded system for automated analysis and annotation of both genomic and expressed sequence data. XGI is component-based in that analysis operations are handled by independent modules that interact through XML with the central "pipeline", which handles the data flow and provides database interactivity.

One Page Abstract:

 XGI (Genome Initiative for species "X") has been developed as a portable, flexible system for automated analysis and annotation of both genomic and expressed sequence data. As the name implies, we have developed XGI to reflect the common nature of sequence data independent of organism. XGI has a component-based architecture in which analysis operations are handled by independent modules that interact through XML with the central "pipeline," which controls the operations, handles the data flow and provides database interactivity. The data model itself has been designed to handle rapid changes to analysis components, including the addition of new algorithms. The pipeline is multi-threaded and has been written to take full advantage of SMP and DMP systems. Currently, the system handles sequence quality control including vector and artifact screening and low-quality read trimming. The EST-oriented pipeline then clusters and assembles ESTs into consensus sequences and performs analysis on the results, including similarity and motif searching. The genomic-oriented pipeline handles sequence assembly followed by gene prediction and downstream subsequence analysis including similarity searching on the predicted genes and ORFs. Both pipelines use a novel method of assigning Gene Ontology (geneontology.org) annotations to predicted features to assist in putative identification. Access to the data is through the web using a standard web browser connecting to a secure server. The GUI has been developed using AxKit, which converts the results of database queries into XML that is interpreted and displayed by the perl embedded stylesheets specified in the header tags. This enables rapid changes to the look, feel and functionality of the GUI with minimal effort and allows multiple different GUIs to coexist on the same server, reducing administration effort. XGI security has been modeled on the UNIX paradigm and provides USER, GROUP and GLOBAL levels of access for each row in the database.
 
 


264. Mouse Genome Database: an integrated informatics database resource (up)
Tom C. Wiegers, Janan T. Eppig, Judith A. Blake, Joel E. Richardson, Carol J. Bult, Jim Kadin, Mouse Genome Informatics, The Jackson Laboratory;
tcw@informatics.jax.org
Short Abstract:

 The Mouse Genome Database(MGD) provides a fully integrated information resource about the laboratory mouse. From genotype (genome) to phenotype including literature curation and experimental datasets, MGD's extensive integration, data representation and robust query capabilities include sequences, maps, gene reports, alleles and phenotypes with links to expression and other resources.

One Page Abstract:

 The laboratory mouse is the premier model system for understanding human biology and disease. Much contemporary research focuses on the comparative analysis of mouse and human sequence data combined with the exploration of mouse mutant phenotypes. The Mouse Genome Database (MGD) provides an integration nexus for comprehensive information on the mouse. MGD makes available a wide range of genetic and genomic information including: unique representation of mouse genes, sequences and sequence maps, comparative maps and data for mammals (especially mouse, human and rat), allelic polymorphism data and descriptive phenotypic information for genes, mutations, and mouse strains. Experimental data and citations are provided for all data associations. Literature curation (over 70,000 articles) is an important source for experimental data. MGD maintains curated interconnections with other online resources such as SWISS-PROT, LocusLink, PubMed, OMIM, and databases for other species. A new feature is the use of the Gene Ontology controlled vocabularies of the molecular function, biological process and cellular component for the description of gene products. Phenotype and disease model ontologies are being developed. A second recent new feature of MGD is comprehensive representations of phenotypic alleles. These were developed, in part, to support the explosion in new mutant allele discovery from mutagenesis projects and gene targeting efforts. Allele records now include information origin and the molecular mutation involved, and are being annotated with precise phenotypic descriptions and human disease information. All allele data are fully integrated with sequence, ortholog,gene expression, and strain polymorphism data. 

MGD is supported by NIH grant HG00330.

Blake J.A., et al., (2001) The Mouse Genome Database (MGD): integration nexus for the laboratory mouse. Nucleic Acids Research 29: 91-94. 


265. Methods for the automated identification of substance names in journal articles (up)
Michael Krauthammer, Department of Medical Informatics, Columbia University;
Andrey Rzhetsky, Department of Medical Informatics, Columbia Genome Center;
Carol Friedman, Department of Medical Informatics, Department of Computer Science, Queens;
mk651@columbia.edu
Short Abstract:

 Identification and tagging of substance names in scientific articles is an important first step in automated knowledge extraction systems. We report on using a novel approach for this problem based on a syntactic analysis of the articles, a sequence comparison tool (BLAST) and a dictionary of substance names.

One Page Abstract:

 INTRODUCTION Our group is building a system (GeneWays) for the automated knowledge extraction from online scientific journals. The goal is the reconstruction of molecular networks consisting of interactions between molecular entities as reported in the literature. Identification and tagging of gene and protein names, as well as other substances, is an important first step for successful knowledge extraction from articles. In recent years, different authors have proposed alternative ways for this task, such as using morphological rules, syntax analysis, dictionaries or even hidden Markov Models. The main challenges in identifying substance names are: Spelling variations and errors, multi-token words as well newly introduced words that are not listed in any dictionary. We have previously demonstrated that it is possible to tackle problems of spelling variation and errors by using a sequence comparison tool such as BLAST. Here we show that by combining this technique with a syntactic analysis of the article, it is also possible to handle the problem of multi-token words. 

METHOD In summary, the system first identifies noun phrases by performing a syntactic analysis of the article. The system then selects those noun phrases, which most likely contain substance names, by applying morphological rules and a broad coverage dictionary. After a part of speech analysis of the selected noun phrases, the article is marked up; most potential substance names are specifically tagged, including multi-token words. The next step is the exact identification of the marked up substance names, i.e. matching the marked up names with an official entry in a reference database such as LocusLink, taking into account spelling variations and errors. This task is accomplished by using BLAST, a popular sequence comparison tool. All substance names from a reference database are converted into a string of nucleotides, by substituting each character in the name with a predetermined unique nucleotide combination. The encoded names are then imported into the BLAST database using the FASTA format. The marked up substance names from the first step are translated, using the same nucleotide combinations, into a string of nucleotides and matched against the nucleotide representation of substance names in the BLAST database. Significant alignments are listed in the BLAST output file, which is subsequently processed using Perl-scripts. At this final stage, the system has determined the closest match between each marked up name and the reference database according to the individual alignment scores.

RESULTS AND CONCLUSION Our results indicate that a combination of methods for the identification of substance names yields the necessary precision for the subsequent automated knowledge extraction from scientific articles.

REFERENCE Krauthammer M, Rzhetsky A, Morozov P, Friedman C.Using BLAST for identifying gene and protein names in journal articles. Gene. 2000 Dec 23;259(1-2):245-52. 


266. Development and Implementation of a Conceptual Model for Spatial Genomics (up)
Mary E. Dolan, Carol J. Bult, The Jackson Laboratory, Bar Harbor, ME, USA;
Kate Beard, Constance Holden, University of Maine, Orono, ME, USA;
mary_dolan@umit.maine.edu
Short Abstract:

 We have developed a data model for genome data interpretation that reflects the natural and intuitive thought patterns of the biologist; incorporates spatial information in a way that is natural and intuitive to a spatial data analyst; includes a high degree of expressivity of the complex interactions among genome features.

One Page Abstract:

 Genomics researchers continue to face the problem that sequencing methods and other experimental advances have produced and continue to produce massive amounts of complex data, which must be stored, organized and, most importantly, interpreted and integrated to be of use to biologists. Each month the bioinformatics journals present novel approaches to this problem. One promising strategy is to take an interdisciplinary approach, in which one attempts to bring the concepts, techniques and tools of a mature field to bear on a body of new data types. Although genetics researchers have long recognized the biological significance of the spatial organization of the genome and made different kinds of "maps" to visualize genetic and genomic features, attempts to fully exploit the concepts and methods developed in the area of spatial data analysis and geographic information systems have been limited. Our work is an attempt to move beyond this and use the tools of this field to analyze, query, and visualize aspects of the spatial organization of genomic features and the comparison of genomes.

We present here, as part of a project to develop a proof of principle Genome Spatial Information System (GenoSIS, http://www.spatial.maine.edu/~cbult/project.html), a conceptual model for genome data interpretation that: reflects the natural and intuitive thought patterns of the biologist; incorporates spatial information in a way that is natural and intuitive to a spatial data analyst; includes a high degree of expressivity of the complex interactions among genome features that recent experimental evidence indicates is essential to understanding genome structure and the regulation of genome function (Kim, J., Bergmann, A., Stubbs, L. (2000) Exon Sharing of a Novel Human Zinc-Finger Gene, ZIM2, and Paternally Expressed Gene 3 (PEG3). Genomics, 64, 114-118; Labrador, M., Mongelard, F., Plata-Rengifo, P., Baxter, E.M., Corces, V.G., Gerasimova, T.I. (2001) Protein encoding by both DNA strands. Nature, 409, 1000).

We also present an implementation of this conceptual model that: integrates data from different data sources using an object-based database schema; allows us to easily take advantage of existing dynamic spatial analysis, classification, querying, and visualization methods and tools according to specifications outlined by the Open GIS Consortium (http://www.opengis.org/techno/specs.htm); stores data in a manner that can be easily updated to include the most recent discoveries of complex feature interactions and dependencies.
 
 


267. Modelling Genomic Annotation Data using Objects and Associations: the GenoAnnot project (up)
Helene Riviere-Rolland, Gilles Faucherand, Genome-Express;
Christophe Bruley, INRIA Rhone-Alpes 655 avenue de l'Europe - 38330 - Montbonnot - France;
Anne Morgat, INRIA Rhone-Alpes;
Magali Roux-Rouquie, Institut Pasteur 25 rue du Dr. Roux - 75015 - Paris - France;
Claudine Medigue, Institut Pasteur;
François Rechenmann, Alain Viari, INRIA Rhone-Alpes;
Yves Vandenbrouck, Genome-Express;
h.riviere-rolland@genomex.com
Short Abstract:

 The Geno* project aims at constructing a modular environment dedicated to complete genome analysis and exploration. In this framework, the GenoAnnot module focuses on the annotation of prokaryotic and eukaryotic genomes. It is based on an object-oriented model, using classes and associations. We present here the GenoAnnot ontology and architecture.

One Page Abstract:

 GenoAnnot is part of Geno*, a modular environment dedicated to complete genome analysis and exploration. Geno* is an ongoing project between INRIA, Institut Pasteur, Hybrigenics and Genome Express. In this context, GenoAnnot focuses on the annotation process of prokaryotic and eukaryotic genomes. It provides a framework to identify and visualise chromosomal regions of interest (such as coding regions, regulatory signals) in order to assist the biologist in the course of raw genome annotation. GenoAnnot extends our former project (Imagene) both in terms of biology (eukaryotic genomes) and technology.

As a first step in the design of GenoAnnot, it is of major importance to formalise and explicitly represent the biological concepts that come into play. For this purpose we chose an object-based (UML-like) model in order to design our data model. This model was then implemented using the AROM representation system (http://www.inrialpes.fr/romans/arom) developed in the Romans project in the INRIA Rhône-Alpes. AROM provides an original knowledge representation model based on classes and associations. 

Our model for GenoAnnot is composed of four main root classes: the class holds genetic and phylogenetic information about the species under study. The root class represents biological entities involved in the constitution and expression of a genome. Instances of this class are not necessarily linked to a known sequence (e.g. the existence of a gene may be known even if the corresponding part of the chromosome has never been sequenced). The root class represents regions of interest (defined as intervals) on sequences. Instances of this class compose most of the annotation information. They can be imported from databanks or produced by tasks within the GenoAnnot environment. Finally the class holds information related to the genomic sequences (chromosomes, contigs) that have to be annotated. At the present time our hierarchy is composed of about 90 subclasses of these four main classes (most of them are in the and hierarchy). Moreover these classes are connected together by using associations. In AROM, these associations can be n-ary (i.e. may connect more than two classes) and, as for classes, can be organised into hierarchies and can have attributes. This later feature turned out to be particularly useful. For instance the coordinate of a particular feature on a chromosome is held by an association, therefore allowing to link the same feature to different versions of the chromosome, possibly at different locations on each of them. 

Another important functionality of GenoAnnot, inherited from Imagene, is its ability to produce these objects (namely instances of classes and associations) by using tasks. Tasks represent the methodological knowledge. They implement sequence analysis methods (such as gene or signal finding) and are run under the control of a generic task-engine provided by Geno*. Finally, all the objects in GenoAnnot can be managed and visualised through graphical user interfaces. 

A first version of GenoAnnot will be made available at the end of 2001, both as a standalone application and as a java-API. 


268. An Overview of the HIV Databases at Los Alamos (up)
Brian K. Gaschen, Charles E. Calef, Rama Thakallapally, Brian T. Foley, Carla L. Kuiken, Bette T. Korber, Theoretical Biology and Biophysics, Los Alamos National Laboratory;
bkg@lanl.gov
Short Abstract:

 The HIV Genetic Sequence, Immunology, and Drug Resistance Databases at http://hiv-web.lanl.gov collect, compile, annotate and analyze HIV and other primate immunodeficiency virus gene and protein sequences, T-cell epitopes and antibody reactivity data. Tools to aid researchers in the analysis of the sequences and in vaccine design are provided at the site. 

One Page Abstract:

 The HIV Genetic Sequence, Immunology, and Drug Resistance Databases at http://hiv-web.lanl.gov collect, compile, annotate and analyze primate immunodeficiency virus gene and protein sequences, T-cell epitopes, and antibody reactivity data. The databases bring together sequence data from individual studies to provide a global frame of reference. The drug resistance database contains a collection of mutations in HIV genes that confer resistance to anti-HIV drugs, and can be searched through a variety of fields including gene, drug class, and mutation. Alignments of T-cell epitopes and linear antibody binding sites are available from the immunology database including variation summarized by HIV-1 subtype and country of origin. Alignments of gene and protein sequences, phylogenetic trees, analyses of site-specific variability, and models of immunodeficiency virus evolution are available from the sequence database. Sequence data can be retrieved via a large variety of selection criteria, including country of isolation, subtype of virus, patient health, date, coreceptor usage, and region of genome. The database provides tools for analyses, both online via web browsers, and through downloadable software. The database staff also conducts independent research using the resources of the databases and the unique computational facilities available at Los Alamos. The results of these projects are used to enrich the resources available on our web site. Some of the recent projects at the database include the development of a parallel version of fastDNAml which has been used in studies to estimate the most likely date of the most recent common ancestor of the current HIV pandemic, combining the immunology and sequence database information to assess CTL epitope variability by subtype across different regions of the genome, studies of HIV subtype distribution and variation in New York City immigrant populations, and the development of tools to aid in more efficient vaccine design.


269. Variability of the immunoglobulin superfamily V-set fold using Shannon entropy analysis (up)
Manuel Ruiz, Marie-Paule Lefranc, MGT, the international ImMunoGeneTics database;
manu@ligm.igh.cnrs.fr
Short Abstract:

 Taking into account expertise provided by IMGT, the international ImMunoGeneTics database (http://imgt.cines.fr), a systematic sequence variability analysis of the Immunoglobulin and T cell Receptor variable domains was realized. This study allows to underline the sequence variations and constraints within the functional and structural requirements of the Ig and TcR V-REGIONs. 

One Page Abstract:

 The Immunoglobulins (Ig) and T cell Receptors (TcR) are extensively studied and a considerable amount of genomic, structural and functional data concerning these molecules is available. IMGT, the international ImMunoGeneTics database (http://imgt.cines.fr) manages the large flow of new Immunoglobulins (Ig) and T cell Receptors (TcR) sequences and currently contains more than 44 000 sequences. Variability analysis of the Immunoglobulin and T cell Receptor V-REGIONs has previously been studied by different approaches. The Kabat-Wu (KW) method is the most popular one. A modified version of the KW method, the Jores, Alzari, and Meo (JAM) method has been established in order to enhance the resolving power of the variability index. However usually both methods are used without a critical assessment of the results, and without any standardization of the amino acid position between chain types and species. A third approach, the Shannon information analysis, has been proposed as being more appropriate to analyze the sequence variability. Since IMGT sequence annotations provide exhaustive and high quality Ig and TcR V-REGION data, based on the IMGT Scientific chart rules and on the standardized IMGT unique numbering, that standardization can now be exploited to set up variability analysis of the V-set fold, amino acid position by amino acid position. In this study, we carried out a systematic variability analysis of the annotated V-set fold sequences from IMGT, using Shannon information analysis. This variability analysis permits to describe the susceptibility of an amino acid position to evolutionary replacements and to underline the sequence variations and constraints within the functional and structural requirements of the immunoglobulin superfamily V-set fold. This approach is particularly important in the cases of antibody engineering, humanization of antibody fragments, model building, and structural evolution studies. 


270. SNPSnapper - application for genotyping and database storaging of SNP genotypes produced by microarray technology (up)
Juha Saharinen, Pekka Ellonen, Janna Saarela, National Public Health Institute;
Leena Peltonen, Department of Human Genetics, UCLA School of Medicine, Los Angeles, USA.;
Juha.Saharinen@KTL.Fi
Short Abstract:

 SNPs are important genotype markers. We have developed an application for SNP allele calling and genotyping from microarray data. The microarray technology together with SNPSnapper software allows semi-automatised, high throughput SNP genotyping followed by data storage and management in a relational database.

One Page Abstract:

 Single nucleotide polymorphisms (SNPs) are the most common type of genetic variations, with an estimated frequency of about 1/1000 in the human genome. This in addition to low mutation frequency of the SNPs makes them very useful genetic markers for a number of different genetic studies. We have developed a software package for SNP allele calling and genotyping of samples from microarray data. In the experimental set up, the allele specific oligonucleotides are immobilised on a microscope slide. Corresponding genomic regions of analysed samples are amplified by multiplex PCR, in vitro transcribed to RNA, and finally hybridised to immobilised oligonucleotides. The genotypes are determined by allele specific primer extension using fluorescent-labelled nucleotides. This system delivers currently thousands of genotypes per a glass slide. The data from microarray reader is imported to SNPSnapper program, which sorts the microarray data points according to the SNP and provides a GUI for visualisation of allelic ratios and related absolute signal intensities. In a produced scatter plot, clusters representing different genotypes are typically seen, whereas in a fraction plot, the different genotypes are distinguished by their location in the intensity fraction axis. SNPSnapper assigns genotypes for each sample by comparing the signal intensities representing the different alleles. The limiting signal intensity values as well as the fraction value boundaries between called genotypes can be set either manually or automatically. The genotypes can be further validated by hand, to discard e.g. genotypes derived from low signal intensity data points or from areas of the glass plate with high background intensity. The genotyping data is then stored in a relational database, where each genotype is linked to the experiment, project and SNP. Likewise each experiment is stored in the database with information describing the used instrument, SNPs, samples, experimental conditions and the operator. These settings allow high throughput genotyping with semi-automatised information processing and database storaging. The application is done with Borland Delphi development environment, and the database connection is made with Micro Soft ADO components, allowing database server independent implementation. 


271. An XML document management system using XLink for an integration of biological data (up)
Junko Tanoue, Laboratory for Database and Media Systems, Graduate School of Information Science, Nara Institute of Science and Technology (NAIST);
Masatoshi Yoshikawa, Noboru Matoba, Shunsuke Uemura, Laboratory for Database and Media Systems, Graduate School of Information Science, Nara Institute of Science and Tech;
junko-ta@is.aist-nara.ac.jp
Short Abstract:

 We propose an XML document management system adopting XLink technology. The system can help researchers define relationships among XML resources collected from biological data providers. Such link information as expert knowledge is expected to be useful especially for making annotation to resources or writing reports.

One Page Abstract:

 Biological data resources have began to provide their data in XML formats. With the use of XML,interoperation of databases and provision of broker services would become much easier. Such improvement is surely a benefit for the researchers who wish to discover unknown biological rules or facts. 

Here we would like to propose another type XML applications for biologists who manage data collected from public databases. This application employs XLink (XML Linking Language) to specify relationships among data recourses.

XLink has the following features:

1.Assert linking relationships among more than two resources

2.Associate metadata with a link

3.Express links that reside in a location separate from the linked resources

 When XLink is used for linking XML documents, it enables:

1.To make a link to a specific part of a document without anchor tag

2.To create a virtual document embedding linked parts from outside documents.

This virtual document can serve as a sort of scrapbook which a researcher may want to use to organize his/her idea. One may share this link information with other researchers because such links are useful expertise. An application adopting XLink technology can help the researchers manage XML document with link information and it can be useful especially for making annotation to resources or writing reports.

[References]

Achard F, Vaysseix G, Barillot E. XML, bioinformatics and data integration. Bioinformatics 2001 Feb;17(2):115-25

XEWA Workshop http://www-casc.llnl.gov/xewa/

XML Linking Language (XLink) Version 1.0 http://www.w3.org/TR/xlink/

XML Pointer Language (XPointer) Version 1.0 http://www.w3.org/TR/xptr


272. Immunodeficiency Mutation Databases in the Internet (up)
Marianne Pusa, Jouni Väliaho, Institute of Medical Technology, University of Tampere, Finland;
Pentti Riikonen, Institute of Medical Technology, University of Tampere, Finland, Turku Centre for Computer Science, University of Tur;
Tuomo Ylinen, Jukka Lehtiniemi, Institute of Medical Technology, University of Tampere, Finland;
Mauno Vihinen, Institute of Medical Technology, University of Tampere, Finland, Tampere University Hospital, Tampere, Finland;
marianne.pusa@uta.fi
Short Abstract:

 Altogether 15 mutation databases (IDbases) on primary immunodeficiencies are currently maintained at the IMT bioinformatics. We have developed a program suite, MUTbase, providing interactive and quality controlled submissions of information to mutation databases. Our mutation databases offer updated, integrated information on immunodeficiencies for all interested parties.

One Page Abstract:

 Altogether 15 mutation databases (IDbases) on primary immunodeficiencies are currently maintained at the IMT bioinformatics. We have developed a program suite, MUTbase, providing user-friendly, interactive and quality controlled submissions of information to mutation databases. Our mutation databases offer updated information on immunodeficiencies for all interested parties in an integrated, comprehensive format.

The amount of knowledge of genetic variation is increasing rapidly. There are many diseases causing mutations and we have created a system for collecting, maintaining and analysing data of the growing number of these mutations. Our system contains patient based mutation databases that function as a resource for genomics and genetics in molecular biology as well as diagnostics and therapeutics in medicine.

There are over 80 immunodeficiencies currently known and due to being rare information on them has been difficult to obtain. Altogether 15 mutation databases are maintained at the IMT bioinformatics containing some 1630 entries and the number is increasing. We also maintain a registry, KinMutBase, of mutations in human protein kinases related to disorders. All databases and further information is available at http://www.uta.fi/imt/bioinfo.

The program suite, MUTbase, provides user-friendly, interactive and quality controlled submission of information to mutation databases. The interactive data submission forms include several quality controls and essential links to related information. Database maintenance can also be carried out using the Internet tools. There is also a variety of tools provided for further study of the database on the World Wide Web. The program package writes and updates a large number of Web pages e.g. distribution and statistics of disease-causing mutations and changes in restriction patterns. The MUTbase is available free on the Internet at http://www.uta.fi/imt/bioinfo/MUTbase.

IDbases can be accessed via the BioWAP service, which is a mobile internet service also provided by the IMT bioinformatics. BioWAP provides a channel to all major bioinformatics databases and analysis programs. 


273. Martel: parsing flat-file formats as XML (up)
Andrew Dalke, Dalke Scientific Software, LLC;
dalke@dalkescientific.com
Short Abstract:

 Martel is a parser generator for bioinformatics which allows existing flat-file formats to be used as if they are in XML. It simplifies the process of data extraction and conversion, HTML markup and even format validation. Martel is part of the biopython.org collaboration. 

One Page Abstract:

 Bioinformaticians deal with hundreds of formats. Some are well defined and others are ad hoc. Some come from databases and others from program outputs. All need to be parsed.

There are many different requirements for a parser. A survey of 19 different SWISS-PROT parsers found these use cases: 1) count the number of records in a file, 2) find the name and location of each record, 3) convert to FASTA, 4) convert to a generic data model (ignoring some fields), 5) convert to a full data model (no fields ignored), 6) display a record in HTML with hyperlinks and 7) validate the format is correct. An eighth common case is to recognize a format or version automatically. Excepting Martel, no existing system handles all of these cases and those that come close require a considerable amount of additional programming.

The traditional way to write a parser is with a parser generator like yacc or Lion's SRS. Yacc proves to be an inappropriate solution for bioinformatics formats because it assumes strong separability between lexing and parsing. Bioinformatics formats are strongly position dependent which calls for explicit state control in yacc. SRS solves that by making the lexer state implicit in the parser but it ties the parser's actions tightly to the parsing so the parser cannot easily be changed to handle the different use cases mentioned in the previous paragraph.

Martel combines two different approaches to simplify parser development. The first is the recognition that most bioinformatics formats -- and nearly all the formats that are otherwise hard to parse -- can be described with a regular grammar instead of a context-free one. This simplifies the format definition by using a single grammar instead of the hybrid lexer/parser used in other systems. The grammar is a slightly modified subset of Perl5 regular expressions and is easily understood by most bioinformatics software developers.

The other approach is the use of an event-based parser. Once a record has been converted into a parse tree it is traversed and infomation about the nodes sent to the caller via SAX events from XML processing. Tags for the startElement and endElement events come from the groups defined in the format definition and named using the (?P<named>group) syntax made popular by Python. Generating events allows separation between the parser and the actions. Generating SAX events allows existing XML tools, like DOM and XSLT, to be used with no additional effort.

Martel has been in use for several months as part of the biopython.org collaboration. It has proved capable in handling all of the use cases listed earlier. In performance it is faster than the other parsers available for Perl and Python, and is within a factor of four slower than equivalent but less flexible code in Java and C. When used to validate it has identified many problems in existing popular database formats either because the format documentation is incomplete or because the distributed data is indeed in the wrong format.

Martel is freely available. More information is available at http://www.dalkescientific.com/Martel/


274. Disulphide database(DSDBASE): A handy tool for Engineering disulphides and modeling disulphide rich systems. (up)
Vinayagam.A, Dr.R.Sowdhamini, National Centre for Biological Sciences;
vinayagam@ncbs.res.in
Short Abstract:

 DSDBASE - A database on disulphide bonds includes native disulphides and those that are stereochemically possible between pairs of residues in a protein. One potential application is to design site-directed mutants in order to enhance the thermalstability of the protein. Another use is proposing 3D models of disulphide rich peptides. 

One Page Abstract:

 The occurrence and geomentry of all known disulphide bonds in protein structures have been recorded by examining the stereochemistry of such covalent cross-links in a non-redundant database of proteins.In addition, other modelled disulphides which are stereochemically possible between pairs of residues in a protein are also considered.The modelling of disulphides have been achieved using the program MODIP (Sowdhamini etal., (1989),Prot.Engng.,3,95-103). The inclusion of native and modelled disulphides increases the dimensions of the derived database (DSDBASE) enormously. Disulphide bonds of specific loop sizes, strained native disulphides and scrambled disulphide models will be discussed. One of the potential use of DSDBASE is to design site-directed mutants in order to enhance the thermal stability of the proteins. Another application, which will be illustrated, is to employ this database for proposing three-dimensional models of disulphide-rich polypeptides like toxins and small proteins of known disulphide bond connectivity.This method seemed to be highly efficient when it was applied to known models themselves(like Endothelin). The database is accessible over the web (http://www.ncbs.res.in/~faculty/mini/dsdbase/dsdbase.html).
 
 


275. A Combinatorial Approach to Ontology Development (up)
Judith A. Blake, David P. Hill, Joel E. Richardson, Martin Ringwald, Mouse Genome Informatics, Jackson Laboratory;
jblake@informatics.jax.org
Short Abstract:

 We present an experiment in ontology development that entails combining separate ontologies to create more complex and specific ontologies. Two test sets, one of developmental processes, the other of heart anatomy, are computationally combined to generate a novel third ontology with interesting implications for future data representations.

One Page Abstract:

 The Gene Ontology (GO) project encompasses the development and use of the biological ontologies of molecular function, biological process and cellular component. Complete representation of developmental processes necessitates incorporation of concepts from an additional ontology, anatomy. Incorporation of the anatomy component is problematic for a project like GO because anatomical concepts can be species-specific. We present an experimental test of cross-product ontology implementation where the biological process of heart development for the mouse is explored. This approach recognizes that the developmental portion of the biological process ontology can be constructed from the two independent ontologies of developmental processes and mouse developmental anatomy whose cross product provides all possible combinations. The cross-product approach is inherently complete as long as the two vocabularies used to construct it are complete and it is logical as long as the two vocabularies are orthogonal. The generation of complete cross products can be automated and provides an alternative approach to ontological development from that of the incremental addition of terms. Definitions of the cross-product terms can be automatically derived from the combination of the original definitions, and synonyms provide customary terms for the user. This approach is being explored for implementation in the GO project as well as for other ontology development in the Mouse Genome Database (MGD) and the Gene Expression Database (GXD).

 The GO consortium is supported by NIH grant HG-02273. MGD is supported by NIH grant HG-00330. GXD is supported by NICHD grant HD-33745. 


276. StressDB: The PennState Arabidopsis Stress Database (up)
Smita Mitra, Nigam Shah, Pennsylvania State University, PA, USA.;
mxm66@psu.edu
Short Abstract:

 We are in the final stages of development of a Windows-based microarray database, StressDB for management of microarray data. We are using an Oracle database management system on Windows NT/2000 server. The scripts for the backend, application layer and front-end will be packaged for easy installation by any non-profit institution.

One Page Abstract:

 StressDB: The PennState Arabidopsis Stress Database

With the advent of DNA array-based technologies the scientific community has felt the necessity for more sophisticated tools for the management, comprehension and exploitation of data. The microarray technology yields copious volumes of data that necessitates the use of sophisticated database management systems for efficient data management. The current public databases include ArrayExpress, Gene Expression Omnibus (GEO) and Microarray Project- ArrayDB. While these repositories will provide us with central repositories for publicly available microarray data, local databases are needed for researchers to store and manage their data until such time that it is ready for publication. Some of the local microarray databases include Stanford Microarray Database (SMD), Yale Microarray Database (YMD), Gene Expression Database (GXD) and RNA Abundance Database (RAD). Most of the other local databases are in the premature stages of development and continue to evolve. 

With the goal to better manage our data we decided to build a web-based relational database for the storage, retrieval and analysis of microarray data. We are in the final stages in the development of StressDB. We are going to release our database to the public by August 2001. The scripts for the database will be made freely available online to non-profit institutions soon afterwards. We will be packaging our database scripts for easy installation by other labs (with appropriate licenses). We believe that the scientific community will benefit from a small and comprehensive yet fully functional microarray database designed for use by a single or a few closely-knit labs. We also believe that none of the current databases meet this qualification. The public databases are not designed for implementation by a single lab and hence do not qualify for candidacy. The local databases that currently exist are either overwhelmingly big in backend architecture and functionalities and high maintenance for a small group of labs (meant only for a big group of collaborators covering various organisms), or are in a very premature developmental stage. We are confident that StressDB will be a valuable contribution to the scientific community.

 Goals of StressDB: 1.We are building a relational database for the storage, web-based retrieval and analysis of microarray data. 2.We are using an Oracle database management system on a Windows NT server. 3.We have made a commitment to confirming our standards for data storage and retrieval to meet the minimal requirements imposed by MIAME. 4.The scripts for the backend, application layer and front-end will be packaged for easy installation by any non-profit institution (after the receipt of appropriate licenses from Oracle and Microsoft). Anybody with knowledge of the Windows OS and systems administration should effortlessly be able to follow our `readme' files and install the database on a Windows NT or Windows 2000 server. 


277. The GeM system for comparative mapping of mammalain genomes (up)
Gisèle Bronner, Bruno Spataro, Christian Gautier, Université Claude Bernard - Lyon 1;
François Rechenmann, INRIA Rhône-Alpes;
bronner@biomserv.univ-lyon1.fr
Short Abstract:

 We present a model for comparative mapping and its implementation as a knowledge base in the GeM system. GeM consists in the coupling of this knowledge base (GeMCore), with specific graphical interfaces, like GeMME for molecular evolution. GeM was used to characterize evolutionary changes of genome between human and mouse.

One Page Abstract:

 An integrative view of genomes is made possible through comparative genomics, which takes into account both the diversity and the heterogeneity of genomic data available for many organisms. Moreover, the structure and the dynamics of genetic information often depend on gene locations. So comparisons at the genome scale rather than the gene scale, which is possible because chromosomal segments are conserved between species, are of major interest.

Comparative mapping appears to be an essential way to extrapolate genomic knowledge among organisms, especially from model organisms to economic ones. Mapping information makes possible to combine genetic and sequence data associated to homologous genes between organisms according to their similar location, as well as to analyze genome structure, evolution and function. However, such studies need modeling and integration of genomic and mapping data, which currently does not exist.

We propose a model for comparative mapping and its implementation as a knowledge base in the GeM system for comparative genomic mapping in mammals. GeM consists in a core knowledge base, GeMCore dedicated to the management and control of genomic and mapping information for many organisms, which is coupled to domain-specific graphical user interfaces dedicated to specific problems (medicine, agronomy, molecular evolution...). Among these interfaces is GeMME, dedicated to molecular evolution.

GeMCore integrates data gathered from the HUGO, MGD, LocusLink and Hovergen databases, after being examined to evaluate their consistency. The GeMCore model is UML-like, associations being as richly expressed as the entities of the domain. It is thus possible to handle data associated to marker types, spatial relations produced by comparative mapping, and evolutionary relations between markers and species. GeMCore is implemented with the AROM knowledge representation language, and possesses an API, which makes it possible to easily link it to the domain specific graphical interfaces.

The GeMME interface to study evolutionary mechanisms at the genome level is an example of a fully implemented one. It consists in a higher model for evolution, with its own concepts, data, and analytical tools, particularly at the statistical and graphical levels. Thus, conserved segments between species, Marey's maps or spatial organization of genomic information can be graphically represented, and specific statistics can be computed such as the Moran or the Geary auto-correlation indices.

The combination of a generic model for comparative genomic mapping with domain-specific interfaces allows easy adding of novel data as well as development of new methods. Our system was used to characterize evolutionary changes of genome structures between human and mouse at the genome level. Some of these results are presented here.


278. TrEMBL protein sequence database: production, interconnectivity and future developments (up)
Maria-Jesus Martin, Claire O'Donovan, Allyson Williams, Rolf Apweiler, EBI-EMBL;
martin@ebi.ac.uk
Short Abstract:

 The TrEMBL database has focused on making the protein sequence data available as quickly as possible and enhancing the data by computer-generated annotation methods. Due to the diverse sources of information present in TrEMBL, we introduce evidence tags to allow users to see how the data items have been generated.

One Page Abstract:

 TrEMBL (Translation of EMBL nucleotide sequence database), is a computer-annotated protein sequence database derived from the translation of all coding sequences (CDS) in the nucleotide sequence databases EMBL/DDBJ/GenBank, except for those already included in SWISS-PROT. SWISS-PROT, a curated protein sequence data bank, contains not only sequence data but also annotation relevant to a particular sequence. TrEMBL was created in 1996, as a supplement to SWISS-PROT, to cope with the tremendous increase of sequence data that is submitted to the public nucleotide sequence databases. Unlike SWISS-PROT entries, those in TrEMBL are awaiting manual annotation. SWISS-PROT and TrEMBL releases, SWISS-PROT and TrEMBL updates and TrEMBLnew (new entries to be integrated into TrEMBL) are published weekly in the non-redundant database SPTR.

In the era of the genome projects, TrEMBL has two important priorities, which are performed on a regular basis and published in the weekly update of SPTR:

- To provide the protein sequence as soon as the nucleotide sequences are available in the nucleotide sequence databases via TrEMBLnew. - To add as much as information as possible to the predicted protein by automatic annotation methods.

To achieve the first, TrEMBL puts special emphasis on sequences from the complete genome projects. Shortly after a genome is available in the nucleotide sequence databases, TrEMBLnew entries are created for each predicted coding sequence. These entries are then prioritized for promotion into TrEMBL after post-processing which includes redundancy checks and enhancements in the annotation. With the availability of such large amounts of sequence data, the new challenge for TrEMBL is to attach biological functional information to these predicted sequences. InterPro is an integrated documentation resource for protein families, domains and functional sites which is used to link TrEMBL entries to different pattern databases such as PROSITE, PRINTS and Pfam and to the cluster sequence database, ClusTr. In addition, automatic annotation is carried out by a rule-based system that improves approximately 20% of the sequences in TrEMBL. The process of linking TrEMBL entries to other databases such as MGD, HSSP and FlyBase entries is also applied regularly. TrEMBL entries have diverse sources of information which we wish to flag to allow users to see where the data items come from and what level of confidence which can be attached to them and to enable SWSS-PROT staff to automatically update data if the underlying evidence changes. To achieve this, we are introducing evidence tags to TrEMBL entries. This is currently ongoing internally and we hope to provide a public version by the end of 2001. Documentation on the process so far is provided at: ftp://ftp.ebi.ac.uk/pub/databases/trembl/evidenceDocumentation.html
 
 


279. PIR Web-Based Tools and Databases for Genomic and Proteomic Research (up)
Huang, H, Barker, W.C, Chen, Y, Hu, Z, Lewis, K, Orcutt, B.C, Yeh, L.L, Wu, C, National Biomedical Research Foundation;
huang@nbrf.georgetown.edu
Short Abstract:

 We have recently expanded the Web site of the Protein Information Resource (PIR) with new web-based tools to facilitate data retrieval, sequence analysis and protein annotation. A user-friendly navigation system with graphical interface connects PIR-PSD, iProClass, and other useful databases. These will better support genomic/proteomic research and scientific discovery.

One Page Abstract:

 We have recently expanded the Web site of the Protein Information Resource (PIR) with new web-based tools to facilitate data retrieval, sequence analysis and protein annotation. The PIR-International Protein Sequence Database (PIR-PSD) is a non-redundant, expertly annotated, fully classified, and extensively cross-referenced protein sequence database. The iProClass is an integrated resource that provides comprehensive family relationships and structural/functional classifications and features of proteins. The PIR Web site connects these useful databases and tools with a user-friendly navigation system and graphical interface. In addition to standard data retrieval and analysis tools, the following new tools have been implemented: 1) HMM (Hidden Markov Model) Domain/Motif Search Tool searches a query sequence against HMM profiles for PIR or Pfam domains or iProclass motifs. This search also allows users to build an HMM profile and search the profile against the PIR-PSD. 2) The Bibliography Submission Page provides a mechanism for the user community to submit literature information for building better protein databases with validated information. 3) The BLAST Similarity Search Tool searches a sequence against PIR-NR, a complete non-redundant protein sequence database currently containing more than 630,000 sequences. The search results, including links to all source databases, are displayed in a user-friendly graphical format. The newly enhanced PIR web-based tools and databases will better support genomic/proteomic research and scientific discovery. The PIR website is accessible at http://pir.georgetown.edu.

This work is supported by NLM grant LM05798 and NSF grant DBI-9974855. 


280. Classification of Protein Structures by Global Geometric Measures (up)
Peter Roegen, Dept. of Math., Tech. Univ. of Denmark;
Peter.Roegen@mat.dtu.dk
Short Abstract:

 The geometry of a protein is characterized by 30 numbers. Hereby comparison of protein structures is reduced to comparing 30 numbers. The fact, that 93% of all connected CATH1.7 domains can be automatically classified correctly based on these 30 numbers alone, shows the power of this new protein structure description.

One Page Abstract:

 Classification of Protein Structures by Global Geometric Measures

The idea is to give an absolute description of each protein structure by a set of characteristic numbers. In contrast to this, protein similarity measures as e.g. RMSD[1], FSSP Z scores[2], and AF Distance[3], are relative and compare pairs of protein structures only.

 Demands to such a set of numbers are: There has to be enough numbers to distinguish between different protein structures. To ensure similar protein structures to have similar numbers, each of these numbers has to depend continuously on deformation of the protein. Finally, the size of such a set of numbers has to be reasonably small.

 A set of numbers, that fulfills the above demands, is given by a family of global geometric measures, that stems from perturbative expansion of Witten's Chen-Simons path integral associated with a knot in 3-space. Two of these numbers are the Writhe and the Average Crossing Number and the remaining numbers are generalizations of these two. Each protein structure is now associated with a 30 dimensional score-vector.

 A (pseudo) metric, denoted the Gauss Metric, on the space of protein structures is given by the euclidean metric on the score-vector space. This Gauss Metric is shown to be correlated with the RMSD for the pairs of homologous CATH1.7[4] domains and the Gauss Metric is zero in the case of identical domains only. Furthermore, proteins of different size are directly comparable without use of alignment or gap penalties.

 A natural way to represent a homology class of CATH1.7 is by the average of the score-vectors of all proteins in this class, denoted the H-center. 93% of all connected CATH1.7 domains are closest to a H-center of there own topology class. By averaging, each H-center should dependent only slightly on addition of new protein structures. This automatic protein structure classification system is thus expected to be only slightly dependent on the set of protein structures used to define it.

References: 

[1] B. W. Matthews & M. S. Rossmann,Methods Enzymol. 115:397-420.

[2] L. Holm $ C. Sanders, Nucleic Acids Res. (1997) 25:231-234.

[3] A. Falicov & F. E. Cohen, J. Mol. Biol. (1996) 258:871-892.

[4] L. L. Counte et al., Nucleic Acids Res. (2000) 28:257-259.


281. EST Analysis and Gene expression in Populus leaves (up)
Prof. Petter Gustafsson, Umea University;
Rupali Bhalerao, KTH Stockholm;
Stefan Jansson, Harry Björkbacka, Umea University;
Rikard Erlandsson, Joakim Lundeberg, KTH Stockholm;
petter.gustafsson@plantphys.umu.se
Short Abstract:

 A set of 4921 sequences from a cDNA library made from young leaves were subjected to clustering using PHRAP and WU BLASTed to other databases (Swissprot and MENDEL). We focus on gene discovery, transcript profiling and studies on gene function using transgenic trees.

One Page Abstract:

 We are running a large scale tree EST program was using Populus tremulaxtremuloides as the model organism. The program is run in collaboration with scientists at the Dept. of Biotechnology , The Royal Institute of Technology, KTH, Stockholm. The over all goal of this project is to establish The Populus Genome Project as the international authority in tree genome research. We focus on three themes: gene discovery, transcript profiling and studeis on gene function using transgenic trees. We have produced more than 45.000 ESTs. A set of 4921 sequences from a cDNA library made from young leaves were subjected to clustering using PHRAP and WU BLASTed to other databases (Swissprot and MENDEL). Clones with a high similarity to sequences in these databases were automatically annotated, while those with lower BLAST scores were manually annotated. All sequences with high similarity to a gene given a gene family number (GFN) in the MENDEL database were annotated according to that number, others were annotated as PGFNs (PopulusGFN). The genes were also assigned to a functional class based on the Functional Classification scheme for Arabidopsis at MIPS. The flow diagram of the process of annotation and functional class assignment will be presented in the poster. 14 % of the clones encoded the small sub-unit of Rubisco (rbcS) and 4.5 % the major light-harvesting chlorophyll a/b-binding protein Lhcb1. Other photosynthetic proteins were also well represented. Germin-like protein corresponded to 1 % and methallotionein and one protein without significant homology to any protein in public databases corresponded to almost 0,5 % of the clones. The hypothesis that clones frequency could serve as an approximation for protein content was tested by comparing clone frequencies for photosynthetic proteins know to be present in equimolar amounts. In general, the clone frequency was within a factor 2 from what could be predicted from protein stoichiometries but from two out of 20 genes, the EST clone frequency gave a misleading figure. The chloroplast protein synthesis was estimated to be approx imately 20 % of the total protein synthesis of Populus leaves, about 45 % of the leaf proteins were estimated to be directly involved in photosynthesis and about 50 % of all leaf proteins seems to be localised to the chloroplast. 


282. Perl-based simple retrieval system behind the InterProScan package. (up)
Zdobnov E.M., Apweiler R., EMBL-EBI;
zdevg@ebi.ac.uk
Short Abstract:

 We present a Perl-based data retrieval system. The system has a modular structure. Each of the data description modules defines the data schema and the associated text parsing routines. The system featured by the use of recursive descent parsing rules, efficient lazy-parsing and fast data retrieval using B-tree indexing.

One Page Abstract:

 InterProScan [1] is a tool that scans a given protein sequences against the InterPro member databases of protein signatures. The InterPro [2] database (v3.0, March 2001) integrates PROSITE [3], PRINTS [4], Pfam [5], ProDom [6] and SMART [7] databases and the addition of others is scheduled. The number of signature databases and the number of the associated scanning tools as well as the use of further refinement procedures make the complexity of the problem. That requires InterProScan to do a considerable data look-up from some databases and program outputs. In the InterProScan package a Perl-based simple data retrieval system was introduced to provide the required data look-up efficiency and an easy extensibility. The system has a modular structure and is designed in an SRS-like [8] fashion. Each of the data description modules defines the data schema of the source text data and the parsing rules. The corresponding Perl module provides an object-oriented interface to the underlying entry attributes. The parsing of the source data into the memory objects happens only once and is done upon request, implementing so-called lazy-parsing. Hierarchical parsing rules are implemented using the recursive-descent approach (Parse-RecDescent package). Fast data retrieval is implemented using the Perl native B-trees indexing (DB_File.pm, based on Berkeley DB). The simple 'one Perl module per data source' organisation makes it possible to reuse the modules in other stand-alone ad-hock solutions. It is freely available as a part of InterProScan package from the EBI ftp server (ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/). 

1. Zdobnov E.M., Hermjakob H., and Apweiler R. "InterProScan - an integration tool for the signature-recognition methods in InterPro" 
Currents in computational molecular biology 2001, ed. El-Mabrouk N., Lengauer T., and Sankoff D. 2001, Montreal: CRM. p. 41-2. 

2. Apweiler R., Attwood T.K., Bairoch A., Bateman A., Birney E., Biswas M., Bucher P., Cerutti L., Corpet F., Croning M.D., Durbin R., Falquet L., Fleischmann W., Gouzy J., Hermjakob H., Hulo N., Jonassen I., Kahn D., Kanapin A., Karavidopoulou Y., Lopez R., Marx B., Mulder N.J., Oinn T.M., Pagni M., Servant F., Sigrist C.J., and Zdobnov E.M. "The InterPro database, an integrated documentation resource for protein families, domains and functional sites" 
Nucleic Acids Res, 2001. 29(1): p. 37-40. 

3. Hofmann K., Bucher P., Falquet L., and Bairoch A. "The PROSITE database, its status in 1999" 
Nucleic Acids Res, 1999. 27(1): p. 215-9. 

4. Attwood T.K., Croning M.D., Flower D.R., Lewis A.P., Mabey J.E., Scordis P., Selley J.N., and Wright W. "PRINTS-S: the database formerly known as PRINTS" 
Nucleic Acids Res, 2000. 28(1): p. 225-7. 

5. Bateman A., Birney E., Durbin R., Eddy S.R., Howe K.L., and Sonnhammer E.L. "The Pfam protein families database" 
Nucleic Acids Res, 2000. 28(1): p. 263-6. 

6. Corpet F., Gouzy J., and Kahn D. "Recent improvements of the ProDom database of protein domain families" 
Nucleic Acids Res, 1999. 27(1): p. 263-7. 

7. Schultz J., Copley R.R., Doerks T., Ponting C.P., and Bork P. "SMART: a web-based tool for the study of genetically mobile domains" 
Nucleic Acids Res, 2000. 28(1): p. 231-4. 

8. Etzold T., Ulyanov A., and Argos P. "SRS: information retrieval system for molecular biology data banks" 
Methods Enzymol, 1996. 266: p. 114-28.


283. Intelligent validation of oligonucleotides for high-throughput synthesis (up)
Bastien Chevreux, Thomas Pfisterer, Christoph Göthe, Sebastian Liepe, Klaus Charissé, Bernd Drescher, MWG BIOTECH AG;
bach@mwgdna.com
Short Abstract:

 Physical restrictions, which cannot be expressed by simple, rule based systems, prevent the synthesis of certain oligonucleotides. MWG BIOTECH AG has designed validating methods that use intelligent agent technology to check oligos described using the XML based OML language, which permits to filter critical oligos before they go into production. 

One Page Abstract:

 Over the last few years, MWG BIOTECH AG has acquired and maintained the leadership in the production of salt free oligonucleotides. The products range from short, simple oligos (with no modification) to long oligonucleotides having 5'-, 3'- and up to four different internal modifications.

Unfortunately, certain combinations of oligos and modifications are impossible to produce due to physical restrictions. The main problem in recognising these combinations - aside from simple folding problems - consists in the fact that no simple rules can be formulated that express the restrictions correctly. Most of the difficult cases are therefore stored in internal company knowledge bases to which researchers outside a company normally do not have access. It can be quite a frustrating experience both for a customer and a manufacturer when the production of oligos is deferred one or multiple times because of this.

MWG BIOTECH AG bioinformatics department has developed two complementary strategies to resolve this problem. First, the XML based Oligodefinition Meta Language (OML) has been designed to describe oligonucleotides and their modifications. This allows using standard XML tools to read and write files containing oligo descriptions and ensures cross-plattform independence while ensuring unambiguousness in the oligos. The second part consists of a consistency and manufacturability checking system that validates oligos. This system bases on intelligent agent technology incorporated into clients that can update their checking rules and routines on demand via internet at a company knowledge server.
 
 


284. Iobion MicroPoint Curator, a complete database system for the analysis and visualization of microarray data (up)
Jason Goncalves, Iobion Informatics;
Terry Gaasterland, Rockefeller University;
Joe Sorge, Iobion Informatics;
jgonca@iobion.com
Short Abstract:

 MicroPoint Curator is an enterprise-level solution for microarray expression analysis that enables researchers to store, analyze, and visualize gene expression data. It stores a complete annotation of experiments based on MIAME and includes multiple normalization and analysis methods including hierarchical clustering, k-means clustering, PCA and multi-dimensional scaling. 

One Page Abstract:

 The Iobion MicroPoint Curator is an affordable enterprise-level solution for microarray expression analysis modeled after the TANGO system developed in the laboratory of Terry Gaasterland. MicroPoint Curator enables researchers to store, analyze, and visualize large quantities of gene expression data obtained from DNA microarrays. MicroPoint Curator stores all raw microarray data, including 16-bit TIFF microarray images, in a fully relational database. The database holds a complete annotation of all imported microarray experiments based on the MIAME standards and is compliant with the microarray data XML exchange standard (MAML). 

MicroPoint Curator addresses data analysis issues specific to spotted microarrays, including filtering of dust spots or regions with printing imperfections. Users may drill down to spot images at any point in the analysis, flagging artifacts and outliers throughout the process. Several normalization methods are supported including intensity-based normalization, log-ratio based normalization, normalization with housekeeping or with exogenous control genes, linear-regression based normalization, and grid-by-grid (print-tip) normalization. Processed microarray data can be clustered and visualized using MicroPoint Curator; current supported methods include hierarchical clustering, k-means clustering, principal component analysis, multi-dimensional scaling and CAST clustering. MicroPoint Curator also provides embedded links to public gene annotation sources and Genome Ontology (GO) annotations so functional gene annotation is easily available. 

MicroPoint Curator is one of several applications on the Iobion bioinformatics server appliance the Iobion Sapphire. Iobion Sapphire is an Intel based hardware system that comes with Linux, Apache Web server, a full-featured relational database, the R statistical language and a comprehensive suite of bioinformatic tools and databases preinstalled. The system serves these scientific applications to Web clients on a local intranet or over the Internet. 


285. Rapid IT prototype to support macroarray experiments (up)
Grombacher, Thomas, Karch, Oliver, Wilbert, Oliver Maria, Toldo, Luca, Merck KGaA Germany, Bio- and Chemoinformatics;
thomas.grombacher@merck.de
Short Abstract:

 We have used standard methods to design a rapid prototype for handling data from macroarray experiments. Our strategy involves a web based solution for data uploading, data storage in RDBMS, and querying and representing data in SRS6. Data representation is also supported by graphic applets (eSuite2.0) and 2D plots.

One Page Abstract:

 Rapid IT prototyping is a need to support biologists in handling large sets of data. We have generated a solution for the handling of macroarray expression data by using, and efficiently combining standard methods for the management of biological data. Namely, we combined intranet technology for the interaction with the users with data storage in a relational data warehouse and with data representation in SRS6. 

Image processing was performed by the biologists with the AIDA software (RAYTEST GmbH). Raw data and all data pertained to the experiment were exported as tab delimited file. Intranet is used for the uploading of data and supplying experimental details via a CGI fill-in form. After storage in the RDBMS, standard statistical procedures were used for cleaning up and evaluating the data by background correction, normalization upon mean filter value, and calculation of log2 variations. We also used clustering of similarly expressed genes with the k-medoids method implemented in CLARA (Clustering Large Applications). CLARA uses the minimal average silhouette width as criterion to choose the number of clusters to partition the data. All results were stored back into the database.

As solution for querying and representing the processed data we have chosen the SRS6 system (LION Biosciences AG) which is an industrial standard for this task. The SRS internal representation of data was extended by graphics from external applets and gifs. We used eSuite2.0 (Lotus) applets for the graphical representation of expression data and visualized the original filter spots as gifs which are stored in the database as well. The data representation in SRS is supplemented by 2D plots which are generated from the expression values of genes in two different tissues. This plot is build via a direct call to the database also using a CGI web interface. 


286. EXProt - a database for EXPerimentally verified Protein functions. (up)
Frank H.J. van Enckevort, Björn M. Ursing, Jack A.M. Leunissen, Centre for Molecular and Biomolecular Informatics (CMBI), Nijmegen, The Netherlands.;
Roland J. Siezen, NIZO food research, Ede, The Netherlands.;
frankve@cmbi.kun.nl
Short Abstract:

 EXProt (database for EXPerimentally verified Protein functions) is a new database containing protein sequences for which the function has been experimentally verified. EXProt Release 1.1 is a selection of 4351 entries from the Pseudomonas Community Annotation Project and the prokaryotic section of EMBL nucleotide sequence database. (http://www.cmbi.kun.nl/EXProt). 

One Page Abstract:

 EXProt (a database for EXPerimentally verified Protein functions) is a new non-redundant database containing 4351 protein sequences for which the function has been experimentally verified. It is a selection of 375 entries from the Pseudomonas Community Annotation Project (PseudoCAP, http://www.pseudomonas.com) and 3976 entries from the prokaryotic section of the EMBL nucleotide sequence database, Release 66 (http://www.ebi.ac.uk/embl/). The entries in EXProt all have a unique ID number and provide information about organism, protein sequence, functional annotation, link to entry in original database, and if known, gene name and link to references in PubMed. The EXProt database will be extended to include more genome databases and topic specific databases. Next to be included are proteins from GenProtEC (http://genprotec.mbl.edu). The EXProt web page (http://www.cmbi.nl/EXProt/) provides further description of the database and search tools (blastp & blastx). The EXProt entries are indexed in SRS6 at CMBI, Nijmegen, The Netherlands and can be searched through keywords (http://www.cmbi.kun.nl/EXProt/srs/). Authors can be contacted through email (EXProt@cmbi.kun.nl). 


287. HeSSPer: a program to analyze sequence alignments and protein structures derived from HSSP database (up)
Georgios Pappas Jr., Universidade Católica de Brasília - Brazil;
gpappas@pos.ucb.br
Short Abstract:

 HeSSPer is a Java based software package that parses and analyzes HSSP files. It provides a visual environment that assists the integration of amino acid conservation data with three-dimensional structure information, helping to understand the network of atomic contacts important for the maintenance of particular folding patterns.

One Page Abstract:

 HeSSPer is a Java based software package that parses and analyzes HSSP files (Homology-derived Secondary Structure of Proteins), which is a database of multiple sequence alignments for each of the proteins with known structure in the Protein DataBank (PDB). HeSSPer provides a rich visual environment aimed to perform in depth studies of the correlations between patterns of sequence conservation and protein structure. Among its capabilities the program offers a series of visualization tools permitting to gather detailed information about each individual protein in the multiple alignment, plots of sequence conservation measures, sequence logos and direct pointers to other relevant databases. It also integrates and controls helper programs to display and manipulate sequence alignments (Jalview) as well three-dimensional structures (Rasmol). Atomic contacts between residues receive special attention and are displayed with two new representations, graphical contacts and contacts tree, that permit a rapid identification of critical contacts based on the values of sequence conservation. In summary, HeSSPer is a tool that assists the integration of amino acid conservation of a particular protein family with the available three-dimensional structure information, helping to understand the network of atomic contacts important for function or structural maintenance. 


288. Protein Structure extensions to EMBOSS (up)
Mr. Ranjeeva D. Ranasinghe, Dr. Jon C. Ison, Dr. Alan J. Bleasby, UK MRC HGMP Resource Centre;
rranasin@hgmp.mrc.ac.uk
Short Abstract:

 EMBOSS is a free Open Source software package developed for biological sequence analysis. Our recent incorporations provide new software and databases for three-dimensional structures of proteins. We address the need for consistent and highly parsable sources of coordinate and domain data. These databases and related software are publicly available. 

One Page Abstract:

 European Molecular Biology Open Software Suite (EMBOSS) is a free Open Source software package developed for biological sequence analysis. The software can automatically handle a variety of data formats and even allows for transparent retrieval of sequence data from the web. Also, as extensive libraries are provided with the package, it allows other scientists to develop and release software under the GNU open source software license. EMBOSS also integrates a range of currently available packages and tools for sequence analysis into a seamless whole.

Until recently all EMBOSS programs were for the analysis of nucleic acid and protein sequences. Our recent incorporations provide new software and databases for three-dimensional structures of proteins. For example, we address the need for consistent and highly parsable sources of coordinate and domain data by providing the following: a database of "cleaned up" protein coordinates data in EMBL-like format using a residue numbering scheme; two databases of clean coordinate data for individual SCOP domains in EMBL-like format and PDB format respectively; the SCOP classification in EMBL-like format. These databases and related software are publicly available. 

Other software provided includes programs for calculating residue-residue contact site data, algorithms for removing redundancy in the SCOP sequence database and wrappers for existing programs and packages. For instance, a STAMP wrapper can generate structural alignments for each SCOP family and a wrapper for PSI-BLAST can use these alignments for searches of a protein sequence database. Software to interrogate swissprot by keyword and integrate the search results with those of the PSI-BLAST searches are also provided.
 
 


289. GENIA corpus: A Semamticaly Annotated Corpus in Molecular Biology Domain (up)
Tomoko OHTA, Department of Information Science, Graduate School of Science, University of Tokyo;
Yuka TATEISI, CREST, Japan Science and Technology Corporation;
Sang Zoo LEE, Korea University;
Jun-ichi TSUJII, Department of Information Science, Graduate School of Science, University of Tokyo;
okap@is.s.u-tokyo.ac.jp
Short Abstract:

 We have built a corpus of annotated abstracts obtained from the MEDLINE database. We already annotated 1,000 abstracts with 36 different semantic classes. In this poster, we report on this new corpus, its ontological basis, our annotation scheme, and statistics of its annotated objects.

One Page Abstract:

 Introduction:

Corpus annotation is now a key topic for all areas of natural language processing (NLP) and information extraction (IE) which employ supervised learning. With the explosion of results in molecular biology there is an increased need for IE to extract knowledge to support database building and to search intelligently for information in online journal collections. To support this, we have built a corpus of annotated abstracts obtained from the MEDLINE database [2,3]. We already annotated 1,000 abstracts with 36 different semantic classes and are trying to increase the number of abstracts to 3,000. In this poster, we outline the features of this new corpus, its ontological basis, our annotation scheme, and statistics of its annotated objects.

Ontological basis and annotation scheme:

The task of annotation can be regarded as identifying and classifying the names that appears in the texts according to a pre-defined classification. For a reliable classification, the classification must be well-defined and easy to understand by the domain experts who annotate the texts. To fulfill this requirement, we had built a conceptual model (ontology) of substances and sources (substance location). In this ontology, we classify substances according to their chemical characteristics rather than their biological role. We have marked up the names of PROTEINs, DNAs, RNAs, SOURCEs, and OTHERs that appear in the abstracts in GPML/XML format [4]. These names are considered to be relevant to the description of biological processes, and recognition of such names is necessary for understanding higher level 'event' knowledge. 

Statistics:

We have annotated 1,000 abstracts related to the transcription factors in human blood cells. We have marked up around 32,000 names with 36 different semantic classes. Around 9,500 proteins, 3,500 DNAs, 400 DNAs, 7,000 sources, 11,600 others are marked up.

Conclusion:

We have built a semantically annotated corpus. The GENIA corpus is useful as a training set for recognition program of biological names and terms. This corpus can also used to gain the knowledge of how the tagged names are related to each other and other names, in order to give feedback to the annotators and enhance the ontology and enables us to annotate more rich information such as biological roles.

References:

[1] Y. Ohta, et, al., Automatic construction of knowledge base from biological papers, Proc. of ISMB-97, pp218-225, 1997.

[2] T. Ohta, et, al., A Semantically Annotated Corpus from MEDLINE Abstracts, In Genome Informatics, Universal Academy Press, Inc., pp.294-295, 1999.

[3] Y. Tateisi, et, al., Building an Annotated Corpus in the Molecular-Biology Domain, In Proc. COLING 2000 Workshop on Semantic Annotation and Intelligent Content, pp. 28-34, 2000.

[4] GENIA project: http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/


290. Talisman - Rapid Development of Web Based Tools for Bioinformatics (up)
Tom Oinn, EMBL-EBI;
tmo@ebi.ac.uk
Short Abstract:

 Talisman is an XML based Rapid Application Development tool intended to enable non programmers to create sophisticated web based tools. Platorm independant and arbitrarily extensible, it acts as a common layer between users and diverse systems such as databases, SRS installations and sequence analysis tools.

One Page Abstract:

 A common problem that we have had to deal with at the EBI is how to provide users with little or no computing background access to the data (commonly held in relational databases) that they wish to work with. Such data storage systems in general do not provide a non-programmer friendly interface to their data, and therefore require the construction of additional tools for end users. Until recently, each of these tools was written on a case by case basis.

Talisman is an attempt to create a system that allows us to develop these tools in a fraction of the time it would have taken to create the tools from scratch. The system defines a language used to create pages; this language, containing as it does definitions of content and behaviour of the system, is semantically equivalent to the minimum specification that would have been required by a programmer to create the system from nothing. A Talisman page definition could therefore be regarded as a formalised project specification language.

Talisman has already been deployed for various applications related to the InterPro database at the EBI (www.ebi.ac.uk/interpro). The typical time to create a system using Talisman is approximately twenty times less than to create the system from scratch, and this has allowed us to provide our biologists with a range of sophisticated and easy to use tools.

Talisman is entirely written in Java, and will compile and run on any platform that has a compliant JVM and Servlet engine. This means that installation of Talisman can be as simple as copying a single file into your servlet engine and pressing the restart button. The extensible nature of Talisman prevents it being locked in to any given field, and efforts are currently underway to integrate it into various other framework type systems such as AppLab and BioJava DAS servers.

Talisman is released under the LGPL license, and may be found at http://golgi.ebi.ac.uk/talisman.


291. The Eukaryotic Linear Motif Database ELM (up)
Rune Linding, Christine Gemuend, Sophie Chabanis, Toby Gibson, EMBL - Biocomputing Unit - Gibson Team;
linding@EMBL-Heidelberg.DE
Short Abstract:

 About any protein, the question is "What functional sites are in it?" Currently this question cannot be answered completely and reliably by bioinformatics. The ELM consortium will implement a novel resource for the prediction of functional sites. Effective prediction of short motifs requires implementation of new context-dependent filtering software.

One Page Abstract:

 About any protein sequence of interest, the question will be asked "What functional sites are in my protein?" Since eukaryotic proteins are highly modular and may have 10s or even 100s of domains, the question is often asked many times about the same protein. Although there may be sites/domains of hitherto unknown function, which will have to be revealed by experimentation, it is desirable to identify all the known types of functional module. There are many bioinformatics tools devoted to this end and, in favourable cases, it may be possible to get a good initial description of protein function. However, it is not possible to reliably answer the question using current resources.

Protein modules come in two generic categories: (1) larger folded globular domains and (2) small linear motifs that are often unstructured. The globular domains can be quite well detected by sensitive probabilistic methods such as HMMs (Hidden Markov Models) or profiles. However, statistically robust methods cannot usually be applied to small motifs, while pattern-based methods over-predict enormously: the few true motifs are lost amongst massive numbers of false positives. There are a number of existing web-accessible resources specialising in protein modules or functional motifs: The PFAM, SMART and PROSITE servers are excellent tools to find globular domains in a protein. PROSITE is also the most important resource for patterns corresponding to both dispersed and linear motifs, whilst PSORT focuses on a restricted set of motifs that are diagnostic for cell compartmentalisation. 

Since linear motifs are both statistically insignificant and prone to massive over-prediction, simply detecting matches in sequences has almost no predictive power. Probably for this reason (PROSITE excepted) there has been much less activity in assembling databases of linear motifs, as compared to the plethora of globular domain databases. Simply collecting the data is insufficient to provide a useful facility. One approach to detecting linear motifs may be to employ context-based discriminatory rules to filter out many of the false positives, so that a small number of plausible candidates remain to be followed up. For example, a candidate motif can be excluded from further consideration if the motif is buried in the core of a globular domain or if the protein resides in the wrong cellular compartment. It follows therefore that the software tools are just as important as the data collection when it comes to detecting linear motifs in sequences. Indeed we would argue that it is only worth assembling the data if the tools to use it are in place. Linear motifs have been neglected in state-of-the-art bioinformatics.

The purpose of the ELM application is to redress this neglect by putting in place the tools needed to detect linear motifs in proteins and to apply these tools in the prediction of functional sites in proteins. Linear motifs come in variable lengths and conservation patterns. Therefore, no single pattern description method is optimal for describing the set of linear motifs. ELM will use the most appropriate pattern descriptor for each motif, choosing from among: (1) Exact match; (2) Regular expression; (3) Weight matrix; and, if needed, (4) HMM for the most complex motifs. One problem faced by ELM is that motifs may have diverged in different eukaryotic species. For example the KDEL motif is highly conserved in the animals but shows variability in the fungi, being HDEL in baker's yeast, DDEL in the quite closely related Kluyveromyces lactis and ADEL in fission yeast. The ELM resource will ensure that this type of variation can be efficiently presented to the user, who will most often be interested in only a subset of eukaryotes.


292. Information search, retrieval and organization relevant to molecular biology (up)
R. Kincaid, A. Vailaya, S. Handley, P. Chundi, N. Shechtman, B. Nardi, D. Moh, A. Kuchinsky, K. Graham, M. Creech, A. Adler, Life Science Informatics, Systems Solutions Lab, Agilent Technologies;
robert_kincaid@agilent.com
Short Abstract:

 Data relevant to molecular biology can now be found in many different places and different forms. This poster describes recent work directed at novel approaches for acquiring, organizing and integrating such disparate data, in order to maximize its usefulness to molecular biologists working in disease or drug discovery.

One Page Abstract:

 The explosive, simultaneous growth in both information technology and high-throughput genomics has led to the situation where enormous quantities of data relevant to life sciences can now be found in a variety of locations as well as different forms. This information may be contained in a relational database, a flat file, an abstract, a publication, an HTML document, etc. Such information may be on a public web-server, a proprietary database or even a researcher's personal web page. With this plethora of data two key problems arise: finding the information at all when it is potentially scattered across many different sources, and navigating through the information found to select that which is relevant to the particular question at a hand. This poster describes recent work directed at novel approaches to addressing both acquiring such disparate data, as well as organizing it to facilitate its use by molecular biologists working in disease or drug discovery. Applying extensible meta-search technology allows finding data from a variety of different sources. Following this data acquisition with well-known data-mining algorithms permits the results of meta-searches to be organized and more easily navigated and understood. Finally, we integrate this retrieved information with further organization tools to assist manually curating the information into a complete overview of biological relevance and function.


293. Los Alamos National Laboratory Pathogen Database Systems (up)
Christian V. Forst, Staff, Bioscience Division, Los Alamos National Laboratory;
chris@lanl.gov
Short Abstract:

 Second generation genome databases at Los Alamos contain information about biological threat and STD pathogens. These completely annotated genomes include information on repeats, ABC transporters, pathways (comparison), proteome comparison and phylogenies. The databases integrates standard bioinformatics tools with in-house developed software. Public access is available at http://www.cbnp.lanl.gov and http://www.stdgen.lanl.gov.

One Page Abstract:

 Among the several specialized sequence databases at Los Alamos are second generation, curated databases that contain molecular sequence information about biological threat pathogens and pathogens related to sexually transmitted diseases. The relational database schema contains annotations for the complete genomes of bacterial and viral pathogens, and selected close relatives. These completely annotated genomes include a rich collection of references to experimental evidence and information on repeats, ABC transporters, pathway and pathway comparison, proteome comparison as well as phylogenies. The databases integrates standard bioinformatics tools such as Blast, Blocks, COG, ProDom, PFam, KEGG, WIT with in-house developed software such as BugSpray for whole-genome comparison, DisplayNet and PredPath for metabolic network prediction and comparative analysis. In 2001, genome sequences of oral pathogens will be included. Public access to the databases includes a wide range of search capability and user-friendly tables and graphics available at http://www.cbnp.lanl.gov and http://www.stdgen.lanl.gov.


294. Facilitating knowledge acquisition and discovery through automatic web data minining (up)
Florence HORN, Hinrich SCHÜTZE, Dpt of Cellular and Molecular Pharmacology, UCSF, San Francisco;
Emmanuel BETTLER, Gerrit VRIEND, Center of Molecular and Biomolecular Informatics, University of Nijmegen, The Netherlands;
Fred E. COHEN, Dpt of Cellular and Molecular Pharmacology, UCSF, San Francisco;
horn@cmpharm.ucsf.edu
Short Abstract:

 The goals of the GPCRDB and NucleaRDB databases are to collect, provide and harvest heterogeneous information on GPCRs and nuclear receptors. The bottleneck in database maintenance is data acquisition. We present our work on automated data collection. In particular, we describe our methodology to extract mutation data from Medline abstracts. 

One Page Abstract:

 The amount of genomic and proteomic data that is entered each day into databases and the experimental literature is outstripping the ability of experimental scientists to keep pace. Consequently, there is a need for specialized databases that collect and organize data around one topic or one class of molecules. These systems are extremely useful for experimental scientists because they can access a large fraction of the available data from one single source.

 We are involved in the development of two such databases with substantial pharmacological relevance. These are the GPCRDB and NucleaRDB information systems that collect and disseminate data related to G protein-coupled receptors and intra-nuclear hormone receptors, respectively. The GPCRDB was a pilot project aimed at building a generic molecular class-specific database capable of dealing with highly heterogeneous data. The NucleaRDB was started last year as an application of the concept for the generalization of this technology. The GPCRDB is available via the WWW at http://www.gpcr.org/7tm/ and the NucleaRDB at http://www.receptors.org/NR/.

 The major bottleneck in database maintenance is data acquisition. Databases can only provide the user with information that has been entered and indexed into a computer file. Unfortunately, data deposition is only obligatory for sequences and three-dimensional atomic coordinates. All other experimental data has to be manually extracted from the literature and entered into databases by data managers and curators.

 Consequently, our project is to develop a methodology to automatically extract experimental data such as mutation data, ligand binding information, expression data, etc., by having computer software electronically read articles. This could potentially lead to discoveries that were not possible because the knowledge was spread out and buried in the literature. We chose to focus on mutation data for nuclear receptors. This poster shows how we automatically capture heterogeneous information from different databases, and in particular, the methodology we use to select Medline abstracts, analyze their contents and extract mutation information.


295. Functional analysis of novel human full-length cDNAs from the HUNT database at Helix Research Institute (up)
Henrik T. Yudate, Ph.D, Helix Research Institute;
Makiko Suwa, Electrotechnical Laboratory (ETL), Tsukuba, Japan;
Ryotaro Irie, Helix Research Institute;
Hiroshi Matsui, RIKEN Genomic Sciences Center, Japan;
Tetsuo Nishikawa, Yoshitaka Nakamura, Helix Research Institute;
Daisuke Yamaguchi, Hitachi Software Engineering Co., Ltd.;
Zhang Zhipeng, Tomoyuki Yamamoto, Keiichi Nagai, et al., Helix Research Institute;
yudate@hri.co.jp
Short Abstract:

 Helix Research Institute, a joint research project principally funded through The Japan Key Technology Center, has developed a high-throughput system for cloning of human full-length cDNAs. Here we describe latest developments from the Bioinformatics Department for in silico analysis of HUman Novel Transcripts and release via our HUNT database: http://www.hri.co.jp/HUNT/

One Page Abstract:

 Helix Research Institute, Inc. (HRI), a joint research project principally funded through The Japan Key Technology Center, has developed a high-throughput system for cloning and sequencing of human full-length cDNAs and for identifying gene function. The clones have been sequenced in the NEDO Human cDNA Sequencing Project and released via the DNA Data Bank of Japan. Here we describe the latest developments from the HRI Bioinformatics Department for in silico analysis and functional annotation of these HUman Novel Transcripts.

Currently, we have carried out in silico analysis of 4996 full-length cDNA sequences and released this information in the publicly available HUNT database (http://www.hri.co.jp/HUNT; Yudate HT, et al., 2001). Protein sequences have been predicted for these full-length cDNA sequences using our ATGpr in-house software. A blastp sequence similarity search with the non-redundant protein sequence database nr and the Swiss-Prot database reveals, that a large fraction of these novel proteins have little similarity to any proteins of known function. We find that the HUNT database contains more than 2500 truly novel and uncharacterized proteins because hits from the similarity search are in large part to hypothetical proteins.

However, it is still possible that tertiary structures of related possibly family proteins have been determined despite a low sequence similarity, and to find these we thread the sequences onto a set of library protein structures with the THREADER fold recognition software. These calculations are CPU time consuming, but at HRI we have dedicated several cpu's for continuous computation of novel sequences and have analyzed more than 1300 in this way. As a result, tertiary structure candidates are listed in the HUNT database for a considerable number of entries and these serve in many cases as the only important source of information as to the function of the corresponding proteins. It is a hope that the functional annotation of candidate structures can be transferred to the query sequences, and a way to assess the predictions is to obtain complementary information from well established sequence analysis programs. Here we provide estimated localization by the PSORT program, and we also include secondary structure predictions from PREDATOR and the CHAPERON sequence analysis software, and results from a search for PFAM, PRINTS-S, and PROSITE sequence patterns, this again to support the fold recognition results where possible.

Validation of the individual structure assignments from THREADER can also be obtained from the independent GENIUS results, which is a sophisticated intermediate sequence-search system also for structure assignment. The reliability of any prediction increases when the results from these two fundamentally different approaches coincide. The sequences which are judged to have the most reliable structure assignments are clustered according to the functional annotation of the assigned structures, and representative examples will be given. All sequence data and analysis results will be available from the HUNT home page at http://www.hri.co.jp/HUNT

Rererence: Yudate HT, Suwa M, Irie R, et al., Nucleic Acids Research 2001 Vol. 29 No. 1 pp. 185-188 


297. iSPOT and MINT: a method and a database dedicated to molecular interactions (up)
M. Helmer-Citterich, Brannetti, B., Zanzoni, A., Via, A., Ferre', F., Montecchi-Palazzi, L., Cesareni, G., Helmer-Citterich, M., Dept. Biology - University of Rome Tor Vergata;
citterich@uniroma2.it
Short Abstract:

 We present the SPOT method for the inference of protein domain specificity and a new database of Molecular INTeractions (MINT). SPOT was developed using the SH3 domain as a model system and has now been applied to PDZ domains and to MHC class I molecules.

One Page Abstract:

 iSPOT (iSpecificity Prediction Of Target) is a web tool developed to infer the protein-protein interaction mediated by families of peptide recognition modules. The SPOT procedure (Brannetti et al., 2000) utilizes information extracted, for each protein domain family, from position-specific contacts derived from all the available domain/peptide complexes of known structure. The framework of domain/peptide contacts defined on the structure of the complexes is used to build a residue/residue interaction database derived from ligands obtained by panning peptide libraries displayed on filamentous phage. The method is being optimised with a genetic algorithm and will soon be available on the web. It is available now for SH3 and PDZ domains and for MHC class I molecules. iSPOT will offer the possibility to answer the following questions: which protein (or peptide) is a possible ligand for a given SH3 (or PDZ or MHC class I molecule)? Which is the best possible SH3 (or PDZ, or MHC class I) interacting domain for a given protein/peptide sequence? What residues should one mutate in a domain to lower/increase its affinity for a given peptide ligand? 

MINT (Molecular INTeractions) is a relational database built to collect and integrate protein interaction data in a unique database accessible via a user-friendly web interface. MINT now contains experimentally determined protein-protein interaction data. In the near future, MINT will be enriched with protein-DNA and protein-RNA interactions and will also allow the collection of peptide lists selected from molecular repertoire like those resulting from phage display experiments. Moreover we plan to add information about interactions inferred by using computational predictive methods. Curators manually submit the interactions. MINT is an SQL database and the web interface is written in an HTML embedded language named PHP (hypertext preprocessor).


298. An Integrated Sequence Data Management and Annotation System For Microbial Genome Projects (up)
Ki-Bong Kim, Department of Computer Engineering, Chungnam National University, Daejeon, Korea (ROK);
Hwajung Seo, Hyeweon Nam, Hongsuk Tae, Pan-Gyu Kim, Dae-Sang Lee, Information and Technology Institute, SmallSoft Co., Ltd., Daejeon, Korea;
Haeyoung Jeong, GenoTech Corp., Daejeon, korea;
Kiejung Park, Information and Technology Institute, SmallSoft Co., Ltd., Daejeon, Korea;
kbkim@comeng.chungnam.ac.kr
Short Abstract:

 We developed the efficient data management and annotation system customized for microbial genome projects. The system consists of six main components, including local databases, infra-databases, contig assembly program, essential analysis programs, various utilities, and window-based graphical user interfaces. Each component is tightly coupled with one another.

One Page Abstract:

 A lot of microbial genome sequencing projects lead to the deluge of the sequence data and its related additional abundant information, which require the systematic and automatic data processing system to meet the efficient data management and annotation. In this context, we developed the efficient data management and annotation system customized for microbial genome projects. This means that the system paves the way for the systematic and automatic approaches to the data collection, retrieval, processing, analysis and annotation. The system consists of six main components, including local databases, infra-databases, contig assembly program, essential analysis programs, various utilities, and window-based graphical user interfaces. Local databases are a repository of all the data from raw data to annotated ones. Feedback control in local databases is possible to update all related data like dominoes. In addition to the local database, infra-databases include the essential public databases such as GenBank, PIR, and SwissProt. MySQL is adopted as DBMS. Contig assembly program includes two main modules of our own implementing. One is assembly module, the other is trace viewer which verifies assembly result and base calling. Public analysis programs are utilized and incorporated into the analysis component that also contains the analysis programs of our own making. The analysis component can be categorized into four main items - database search, homology search, ORF search, and signal pattern search. Each component is tightly coupled with one another by means of various utilities behind the window-based GUI. The system has a client-server architecture in which window-based client programs on a PC calls many programs running on the server side through the network connection. This system will be very helpful to genome researchers who need to efficiently manage and analyze their own bulky sequence data from low-level to high-one


299. A home-made implementation for bibliographic data management: application to a specific protein. (up)
Alessandro Pandini, Laura Bonati, Demetrio Pitea, Dipartimento di Scienze dell'Ambiente e del Territorio, Università degli Studi di Milano-Bicocca;
alessandro.pandini@unimib.it
Short Abstract:

 The development of a bibliographic-reference database on the Aryl hydrocarbon Receptor is presented. The basic goals are: implementing a PubMed-based resource, assuring automatically updating service, adding user profiles, developing a web-based interface connected to the DBMS, generating a user access and web-publishing the project to join contributions from other groups.

One Page Abstract:

 Scientific research on a specific topic could generate a remarkable quantity of bibliographic patrimony that requires administration. Often a research group has to face problems like dealing with huge amount of papers, assuring a full accessibility to group members, exploiting background richness, being able to quickly correlate new hypothesis with previously published: a DBMS (DataBase Management System) could solve many of those problems. Additionally an on-line database resource could fruitfully become a "virtual meeting point" for researchers on the topic. In order to satisfy a series of internal group requests and to offer an on-line resource for colleagues, we designed, implemented and web-published a bibliographic reference database on the Aryl hydrocarbon Receptor. During the design phase we identified some basic goals: implementing a PubMed-based resource, assuring automatically updating service, adding user profiles and on-line access. We chose a x86 architecture with Linux operating system for data and access management. Being consistent with these choices we decided to develope the project on an Apache web server with MySQL DBMS. Perl was chosen as the "glue system" to connect the project parts and generate dynamic html code for the web based interface. During the implementation phase we collected bibliographic references directly from PubMed and automatically inserted them in the database, setting up the management system. Moreover, we developed a web-based interface connected to the DBMS. Consistent with the design phase, the final implementation features are: an automatically retrieving system that queries PubMed engine and notifies full updating info, both by e-mail and on login demand; an user-friendly graphical interface, accessible via different web-browser; a powerful query system; an output generator for text, html and BibTeX format files. Finally, we have generated a user access with different levels of services and web-published the full project. We plan to extend the project to a thematic web-site on the Aryl hydrocarbon Receptor joining contributions from other groups, with the long-term goal of creating an on-line forum about this protein. 


300. Statistical structurization of 30,000 technical terms in medical textbooks: A non-ontological approach for computation of human gene functions. (up)
Tsutomu MATSUNAGA, Kenji FUKAISHI, Yasuhiro TANAKA, NTT DATA CORPORATION;
Takuro TAMURA, Iwao YAMASHITA, Yamaguchi, Hitachi Software Engineering Co., Ltd.;
Teruyoshi HISHIKI, JBIRC, National institute for Advanced Industrial Science;
Kousaku OKUBO, Institute for Molecular and Cellular Biology, Osaka University;
matunaga@rd.nttdata.co.jp
Short Abstract:

 We statistically structured 30,000 biomedical technical terms by their distribution patterns across medical textbooks for knowledge representation using subspace method of pattern recognition. Resultant structure was graphically represented for evaluation and known human genes were located in the structure according to their annotations. Any cluster of genes can be automatically annotated.

One Page Abstract:

 With the advent of the techniques and machines for DNA and protein analyses, feature values for each gene/protein unit have been massively generated. Those values are being used to organize the gene/protein world in various ways and establishment of methods to make biomedical sense of resultant structures is an urgent issue in predicting functions of 'unknown genes'. Functional descriptions connected to 'known genes' in DBs are the primary source of knowledge to be employed in making sense but the functional descriptions are fully appreciated only by those who have thousands of technical terms in mind in an organized form. In order to make machines to use such knowledge for 'known genes', straightforward are declarative approaches, such as KEGG and Gene Ontology where descriptions are manually translated to expressions with limited terminologies whose structure is provided by experts. The drawbacks for this approach are labor, possible bias, and requirement of constant update. In order to complement such drawbacks in declarative approaches, we have statistically structured 30,000 biomedical objects represented by 55,000 biomedical terms by their distribution patterns across more than 20,000 pages of 21 textbooks for medical education. We first constructed graphical representation of co-occurrence patterns of objects to evaluate the power of statistical approaches in expressing medical knowledge. Then we employed a subspace method of pattern recognition for sensitive calculation of relations across objects. Using gene annotations available in public, almost all of the known human genes are mapped onto this space of primitive words. Any cluster of genes made from observed feature values can be automatically annotated by the orders of technical terms in the neighborhood of 'known gene' members in them. 


301. Searching for RNA Genes Using Base-Composition Statistics (up)
Peter Schattner, GencodeDecode Research;
schattner@alum.mit.edu
Short Abstract:

 The feasibility of using local single-base (GC%, G-C, A-T) and dinucleotide frequency variations for non-protein-coding rna (ncrna) genefinding was investigated. Significant frequency differences were found between ncrnas and genomes and among ncrnas. A search program based on GC% was developed and tested on M. jannasciii and C. elegans sequences

One Page Abstract:

 BACKGROUND: RNA-gene-finding programs for non-protein-coding rnas (ncrnas) with well characterized sequences have achieved considerable success. However, identifying less-well-characterized ncrnas has been significantly less successful. Indeed a recent paper (Bioinformatics 16:573-585, 2000) suggests that it may be impossible to use sequence and secondary structure alone to detect ncrnas within newly sequenced genomes. In the same paper, it is noted that, for some genomes, it may be possible to use the local percentage of GC bases (GC%) as a filter to screen for ncrna-rich regions. More generally one might look for multiple variations of single and/or dinucleotide base statistics as a signature of ncrna-rich regions. The present work investigates the feasibility of this approach. Single-base and dinucleotide statistics for a variety of ncrnas are compiled, compared to the genomic background and applied to the task of screening for ncrna-rich regions. METHODS: Ncrna sequence data for tRNAs, rRNAs, snRNAs, srpRNAs, rna pseudoknots and other small rnas were obtained from public databases. For each ncrna type, several base-composition statistics were computed: GC%, per base G-C and A-T differences -- e.g. (n(G) - n(C))/ (n(G) + n(C)), and normalized dinucleotide frequencies (f(AB)/f(A)*f(B)). Statistics were computed for complete chromosomes and 100 kb "isochores" taken from the M. jannasciii and C. elegans genomes. A simple ncrna screening program based solely on local GC% values was also implemented. The program partitions a genomic region into two components - one ( hopefully small) component with high probability of containing ncrnas and the other component with low probability of containing ncrnas. The program was run against sequences of the M. jannasciii and C. elegans genomes with results compared to rna annotations in the genbank ".gbk" and wormbase ".gff" database files. RESULTS: For M. jannasciii mean rna GC% is 65.9%(+/-3.2) while overall genomic GC% is 31.4%. For C. elegans, mean rna GC% is 52.5%(+/-9.5) while chromosomal GC% ranges from 34.7% (chromosome IV) to 36.3% (chromosome II). Significant GC% variations are observed both among ncrna classes and between differing isochores on an individual chromosome. For example in C. elegans, trna GC% is 58.8%(+/-3.9) while snrna GC% is 40.2%(+/-4.8). On chromosome I, mean GC% of 100 kb isochores is 36.0% with a range from 33.3% to 40.3%. Average per-base G-C differences for M. jannasciii and C. elegans rnas are 0.046 (+/-0.058) and 0.038 (+/-0.058) respectively as compared to the genomic values of 0.061 (+/-0.360) and 0.021 (+/-0.242). Rna per-base A-T differences are -0.054 (+/-0.12) and -0.135 (+/-0.12) while the genomic values are 0.023 (+/-0.186) and 0.014 (+/-0.195) for M. jannasciii and C. elegans chromosme I, respectively. Again, variations exist among rnas. For example for four known C. elegans snrprnas, per-base G-C differences are 0.18 (+/-0.02). The most significant dinucleotide frequency variation was observed in M. jannasciii for CG base pairs where the normalized rna CG frequency is 0.75 (+/-0.25) while the genomic value is 0.34 (+/-0.46). The GC%-based filtering program identified a component with less than 1% of the M. jannasciii genome that contained all 43 ncrnas annotated in the genbank genomic .gbk file. In experiments over 1.5 MB of C. elegans chromosome X, a component containing 5.3% of the sequence with 44 of 51 annotated ncrnas was identified. DISCUSSION / CONCLUSION: The present work suggests that, at least for very high AT concentration genomes such as M. jannasciii, one can identify small genomic regions rich in ncrnas using GC% filtering alone. In most cases, however, GC% screening will probably need to be supplemented by testing for additional statistical signatures and/or conventional primary and secondary sequence structure motifs in order to successfully isolate potential ncrnas. The feasibility of such multi-stage ncrna detectors is currently being investigated.
 
 


302. dna2hmm - a homology based genefinder (up)
Betty Lazareva, Paul Thomas, Celera Genomics;
betty.lazareva@fc.celera.com
Short Abstract:

 Dna2hmm is an implementation of an algorithm that aligns genomic DNA to a protein HMM with sequencing error correction. The algorithm is designed to work with SAM HMMs in contrast to GeneWise that utilizes the HMMER framework. Tests show dna2hmm to be more accurate than GeneWise, though slightly slower.
 
 

One Page Abstract:

 Dna2hmm is a software implementation of an algorithm that aligns genomic DNA to a protein HMM with sequencing error correction. The algorithm is specifically designed to work with SAM HMMs, in contrast to GeneWise that works in HMMER framework. Our approach involves a full dynamic programming with an optimization function that is a sum of three components: the Viterbi alignment score of predicted translation (which we refer to as "SAM-like score"), a splicing score and, finally, a penalty for correction of errors in the DNA sequence. 

We tested our algorithm by scoring a UCSC test set of well-annotated genes against a set of SAM HMMs we generated at varying levels of sequence similarity to the genes in the test set. In our comparison with GeneWise, we converted SAM HMMs to HMMER format preserving the SAM NULL model, which improved GeneWise performance. Our experiments show high correlation between Viterbi protein scores and SAM-like scores of the predicted translation, which makes for straightforward statistical interpretation of the alignment score. For GeneWise, we found the agreement between protein scores and predicted alignment scores to be much weaker. Also the accuracy of prediction on the base-pair level as well as on the exon level is significantly higher for dna2hmm algorithm compared to GeneWise. However, in the current implementation, dna2hmm is about 3 times slower than the latest version of GeneWise. Detailed comparison and testing results are discussed in our presentation.

Because the SAM-like dna2hmm score correlates well with the score of a protein translation, dna2hmm score results can be directly compared to protein sequence comparison scores and derived distributions for estimating statistical significance.

 


303. DIGIT: a novel gene finding program by combining genefinders (up)
Tetsushi Yada, Human Genome Center, Institute of Medical Science, University of Tokyo;
Yasushi Totoki, Genomic Sciences Center, RIKEN;
Yoshio Takaeda, Mitsubishi Research Institute, Inc.;
Yoshiyuki Sakaki, Toshihisa Takagi, Human Genome Center, Institute of Medical Science, University of Tokyo;
yada@ims.u-tokyo.ac.jp
Short Abstract:

 We present here a general scheme to combine plural genefinders. By using our scheme, we have developed a novel gene finding program named DIGIT which combines FGENESH, GENSCAN and HMMgene. We show you the results of the benchmark tests and the analysis of 2.7 billion bases of human genome sequence.

One Page Abstract:

 We have developed a novel gene finding program named DIGIT which finds genes by combining existing genefinders. It has been well known that the reliability of gene annotation is increased by combining plural genefinders. However, the following two problems arise in the case of combining genefinders: (1) how to ensure the frame consistency between exons within a gene and (2) how to take into account the exon scores given by genefinders. We have addressed these problems by applying hidden Markov model and Bayesian procedure, and have implemented the scheme into DIGIT. Since our scheme provides a general framework of combining genefinders, DIGIT has the ability to combine most of genefinders by systematic manner. As well as presenting the detailed algorithm of DIGIT, we report here its prediction accuracy. DIGIT has been designed so as to combine FGENESH, GENSCAN and HMMgene, and the prediction accuracy has been assessed by using three different data sets. For all data sets, DIGIT successfully discarded many false positive exons predicted by genefinders and yielded remarkable improvements in sensitivity and specificity at the gene level compared with the best gene level accuracies achieved by any single genefinder. 


304. GAZE: A generic, flexible tool for gene-prediction (up)
Kevin Howe, Richard Durbin, The Sanger Centre;
klh@sanger.ac.uk
Short Abstract:

 We have developed a gene-finding system called GAZE which allows gene-prediction data from multiple arbitrary sources to be integrated into complete gene-structures in a flexible and user-configurable manner. The system gains further flexibility by performing the integration over an XML model of gene-structure also supplied by the user. 

One Page Abstract:

 The ACEDB (http://www.acedb.org) package contains as a sub-component, a gene-prediction tool which, like several other gene-prediction programs available, has as its final stage something akin to "exon assembly". During this phase, signal and content sensors are combined using dynamic programming to produce a `best guess' prediction of genes within the region being considered. The model of a gene in terms of its components and how they relate to each other, is hard-coded into the system. It was also originally the case that parameters for the use of signal and content measures were hard-coded, but gradually this functionality has been factored out as user-configurable. It remains the case though that there is no way to refine the model of gene-structure over which assembly takes place without changing the code. Other gene-prediction programs that we know of suffer from the same sort of rigidities. 

We have developed a gene-prediction tool, GAZE, with the aim of addressing these inflexibilities. GAZE is a stand-alone gene-finding software system based upon an abstraction of the dynamic programming engine underlying the ACEDB gene-finding tool. It produces predictions of gene-structure by integrating arbitrary prediction information from multiple sources supplied by the user. This integration takes place over a model of gene-structure that is also completely defined by the user. 

GAZE is controlled by a user-specified "structure file" (in XML format), which contains both a state-based model of gene-structure, and information for controlling precisely how the program should integrate the given gene-prediction data over the given model. Both the gene-prediction data supplied to GAZE, and its prediction of gene structure, are in GFF (www.sanger.ac.uk/Software/GFF). 


305. An estimate of the total number of genes in microbial genomes based on length distributions (up)
Marie Skovgaard, L. J. Jensen, S. Brunak, D. Ussery, A. Krogh, Center for Biological Sequence Analysis;
marie@cbs.dtu.dk
Short Abstract:

 In sequenced microbial genomes some of the annotated genes, are actually not protein coding genes, but rather ORFs that occur by chance. Based on comparison of the length distributions of annotated and known proteins we estimate the number of true protein coding genes.
 
 

One Page Abstract:

 In sequenced microbial genomes some of the annotated genes, usually marked in the public databases as hypothetical, are actually not protein coding genes, but rather open reading frames (ORFs) that occur by chance. Therefore, the number of annotated genes is higher than the actual number of genes for most of these microbes. 

we have plotted the length distribution of the annotated genes and compared it to the length distributions of those matching known proteins and those with no match to known proteins. From these plots it can be seen that the length distribution of proteins with no known matches differ from the distribution of proteins with matches to known proteins. Since the majority of the proteins with no known matches are short, this lead to the conclusion that too many short genes are annotated in many genomes.

Therefore we estimate the true number of protein coding genes for sequenced genomes by two different methods. 

The first estimate is based on the assumption that the fraction of long genes matching the SWISS-PROT database equals the fraction of all genes that matches SWISS-PROT. To obtain an estimate of the true number of proteins in each organism we have used the proteins in the SWISS-PROT database, as a reference. Since ORFs longer than 200 amino acids are unlikely to occur by chance in most organisms, the fraction of those matching SWISS-PROT was used as an estimate of the fraction of the total number of true proteins that match SWISS-PROT. Then the estimated number of genes is easily obtained by dividing the total number of matching proteins with this fraction. 

The second estimate is completely independent of database matches. The maximal number of non-overlapping open reading frames longer than 100 triplets was found and the estimate of the true number of genes was obtained by reducing this number by those expected to occur at random. ORFs shorter than 100 triplets were excluded since relatively few genes are expected, and the estimate becomes ill-behaved because of the huge number of short ORFs. The approximation of the corrections is quite crude, but serves as a control for the estimate based on alignment and SWISS-PROT.

Our estimates of the number of real protein coding genes reduce the number of true proteins by 10-30% for the majority of microbial organisms. The two extremes are represented by M. genetalium where the estimates are 1-5% lower and A. pernix where they are close to 50% lower. 


306. Potential binding sites for PPARg in promoters and upstream sequences (up)
Hubert Hackl, 1 The Institute for Genomic Research, Rockville, MD 2 Institute for Biomedical Engineering, Graz University of Technology, Graz, Austria;
Alexander Sturn, Institute for Biomedical Engineering, Graz University of Technology, Graz, Austria;
Vasiliki Michopoulos, National Institutes of Health, Bethesda, MD;
John Quackenbush, The Institute for Genomic Research, Rockville, MD;
Zlatko Trajanoski, Institute for Biomedical Engineering, Graz University of Technology, Graz, Austria;
hhackl@tigr.org
Short Abstract:

 We present the strategy and results of the search for binding sites in promoters to identify target genes for PPARg - a key player in adipogenesis. From experimentally verified binding sites for PPAR a novel position weight matrix was derived and an algorithm based on the information content was applied.

One Page Abstract:

 Peroxisome proliferator-activated receptor gamma (PPARg) plays a key role in the differentiation of adipose tissues and is important in the adipose specific expression of a number of genes. Given the centrality of this nuclear receptor in adipocyte differentiation and development of obesity the identification of potential target genes could improve the rational design of new classes of drugs to control obesity. PPARs bind neither as homodimer nor as monomer but strictly depend on the retinoid X receptor (RXR) as DNA- binding protein. The consensus sequence for the binding of PPAR:RXR is given by a 5' flanking region and two half sites with an adenine (A) in between. (5'-AWCT AGGNCA A AGGTCA-3') [1].

In order to identify potential binding sites for PPARg searches in promoters and upstream sequences from human, mouse and rat genes as well of Genbank entries from other vertebrates genes were performed with 1) the consensus sequence according the IUPAC string convention 2) the position weight matrix from the TRANSFAC database [3] and 3) a novel, from experimentally verified binding sites for PPARg derived position weight matrix(PWM). For the search with the consensus sequence and the TRANSFAC PWM the programs FindPatterns of the Wisconsin package [4] and the online service of the MatInspector [6] were used, respectively. To search with and to evaluate the new context-specific PWM, as well as for the determination of the optimal threshold level an algorithm based on the information content similar to the MatInspector program [5] was implemented.

Searching with the IUPAC method yielded in a noticeable number of matches, which could be refined by using the TRANSFAC PWM for PPARg, since the PWM captures obviously more information than the consensus sequence. The search with the newly constructed PPAR matrix resulted in a reasonable number of potential sites and so far unidentified putative target genes(2% of the studied promoter sequences at a threshold level of 0.85).

[1] Desvergne B, Wahli W: Peroxisome Proliferator-Activated Receptors: Nuclear Control of Metabolism. Endocrine Reviews 20:649-688, 1999

[2] Fickett JW: Quantitative Discrimination of MEF2 Sites. Mol Cell Biol 16:437-441, 1996

[3] Wingender, E., Chen, X., Hehl, R., Karas, H., Liebich, I., Matys, V., Meinhardt, T., Prüß, M., Reuter, I. and Schacherer, F.: TRANSFAC: an integrated system for gene expression regulation. NAR 28: 316-319, 2000 (http://transfac.gbf.de/TRANSFAC).

[4] Wisconsin Package, Genetics Computer Group, WI

[5] Quandt K, Frech K, Karas H, Wingender E, Werner T: MatInd and MatInspector: new fast versatile tools for detection of consensus matches in nucleotide sequence data. NAR 23:4878-4884, 1995

[6] http://genomatix.gsf.de/


307. On the Species of Origin: Diagnosing the Source of Symbiotic Transcripts (up)
Peter T. Hraber, Santa Fe Institute;
Jennifer W. Weller, Virginia Technical University;
pth@santafe.edu
Short Abstract:

 Sequencing expressed tags from mixed cultures of interacting symbionts (pathogenic or mutualistic) can help identify genes that regulate symbiosis, but presents an analytic challenge: to determine from which organism a transcript originated. Previous solutions used nucleotide composition or similarity searches, but a comparative analysis of hexamer counts is more powerful.

One Page Abstract:

 Most organisms have developed ways to recognize and interact with other species. Symbiotic interactions range from pathogenic to mutualistic. Some molecular mechanisms of interspecific interaction are well understood, but many remain to be discovered. Expressed sequence tags (ESTs) from cultures of interacting symbionts can help identify genes that regulate symbiosis, but present a unique challenge for functional analysis. Given a sequence expressed in an interaction between two symbionts, the challenge is to determine from which organism the transcript originated. For high-throughput sequencing from interaction cultures, a reliable computational approach is needed. Previous investigations into GC nucleotide composition and comparative similarity searching provide provisional solutions, but a comparative lexical analysis, which uses a likelihood-ratio test of hexamer counts, is more powerful. Tests against genes whose origin and function are known yielded 94% accuracy. Microbial transcripts comprised about 75% of a Phytophthora sojae-infected soybean (Glycine max cv Harasoy) library, contrasted with 15% or less in root tissue libraries of Medicago truncatula from axenic, Phytophthora medicaginis-infected, mycorrhizal, and rhizobacterial treatments. Many of the symbiotic transcripts were of unknown function, suggesting candidates for further functional investigation.


308. PAGAN : Predict and Annotate Genes in genomic sequence based on ANalysis of EST Clusters (up)
Sasivimol Kittivoravitkul, Marek Sergot, Department of Computing, Imperial College of Science, Technology and Medicine;
sk297@doc.ic.ac.uk
Short Abstract:

 Unlike other gene index projects which cluster whole ESTs, PAGAN clusters alignments of ESTs from similarity search results. This eliminates undesirable features of ESTs enabling more precise prediction of exons. Gene structures are revealed by further assembly and refinement process. The results are shown graphically at annotation and nucleotide level.

One Page Abstract:

 Expressed Sequence Tags (ESTs) have been an essential source in gene discovery. A similarity search against the EST database, dbEST, could reveal homologous genes in a genomic sequence and provide additional information regarding those genes. Due to the high redundancy and low quality of ESTs, analyzing the similarity search results in order to find genes and their structures is not a trivial task. 

We have developed a tool, called PAGAN, which annotates genes in genomic sequences by extracting gene information from the results of a similarity search of dbEST. PAGAN filters the search results using the degree of identity and the length of the homologous EST as criteria as in Bailey et al. and then clusters them into groups, which are likely to represent the same exons, using d2_cluster (Burke et al., 1999). Unlike other gene index projects which cluster whole ESTs, PAGAN clusters only the parts of the EST that align with the genomic sequence. In clustering whole ESTs, it is possible that ESTs are put together in the same cluster because of chimeric clones, contamination and other artifacts whereas clustering the alignment parts of EST reduces these false joins and discards irrelevant information in EST. To obtain the actual underlying exon of each cluster, a consensus sequence for each cluster is derived using PHRAP and CAP3.

We have further refined the results by using source information of EST such as cloneID and polarity. The results of masking out repeat and low complexity DNA sequences before performing the similarity search, which causes gaps in the search results, are also taken into account. The results of PAGAN are displayed graphically at the level of annotation, which can compare the results to different kinds of analysis programs, and at the level of the nucleotide, which allows the user to inspect the variation among EST alignments in the cluster. 

Preliminary experiments with benchmark data have shown that PAGAN can detect about 12% more exons than the similarity search of STACK (Christoffels et al., 2001).

Reference :

Bailey, L.C., Searls, D., Overton, G.C. (1998) Analysis of EST-Driven Gene Annotation in Human Genomic Sequence. Genome Research, 8, 362-376

Burke,J., Davison,D., Hide,W., (1999) d2_cluster : A Validated Method for Clustering EST and Full-length cDNA Sequence. Genome Research, 9, 1135-1142

Christoffels,A., Gelder,A.V., Greyling,G., Miller,R., Hide,T., and Hide,W. (2001) STACK: Sequence Tag Alignment and Consensus Knowledgebase. Nucleic Acids Res., 29, 234-238 


309. Nested Genes in the Human Genome (up)
Zipora Y. Fligelman, Einat Hazkani-Covo, Sarah Pollock, Hanne Volpin, Nili Guttmann-Beck, Compugen Ltd.;
zipo@compugen.co.il
Short Abstract:

 Using the LEADS platform, novel examples of the rare phenomenon of nested genes were found in the human genome. Known examples from the literature were also detected. Most gene-finding algorithms model one gene per locus, and probably miss this phenomenon.

One Page Abstract:

 Nested Genes in the Human Genome

Zipora Y. Fligelman*, Einat Hazkani-Covo*, Sarah Pollock, Hanne Volpin, and Nili Guttmann-Beck.

Compugen Ltd, Pinchas Rosen 72 Tel-Aviv 69512 ISRAEL 

{zipo,einat}@compugen.co.il

The phenomenon of nested genes or interleaving genes, that are located within introns of another gene, is believed to be rare in the human genome since only a few examples are known. Most gene-finding programs predict one gene per locus, thus even with knowledge of most of the genome sequence it is difficult to estimate the frequency of this biological phenomenon. Hence, nested genes may be more abundant in the human genome than previously thought.

The LEADS platform (Shoshan et al. 2001), which clusters and assembles ESTs and mRNA sequences to the genome, determines the exon / intron structure of the resulting genes, including alternative splicing. We used this platform to identify nested genes. We concentrated on examples where both genes were supported by known RNA sequences.

We verified the existence of five examples of nested genes known from the literature, in three of them the nested genes are encoded by opposite strands (Poher et al. 1999; Dunham et al. 1999) while two show nested genes encoded by the same strand (Tycowski et al. 1993; Cervini et al. 1995). We have found at least 12 new examples of nested genes encoded both by opposite strands and by the same strand. Identifying nested genes on the same strand is more complex, as it is difficult to distinguish the phenomenon from alternative splicing when insufficient EST samples are obtained. The results of our study have more than doubled the number of known nested genes. 

References:

· Cervini, R., Houhou, L., Pradat, P. F., Bejanin, S., Mallet, J., and Berrard, S. 1995. Specific vesicular acetylcholine transporter promoters lie within the first intron of the rat choline acetyltransferase gene. J Biol Chem. 270(42):24654-24657.

· Dunham, I., Shimizu, N., Roe, B. A., Chissoe, S., Hunt, A. R., Collins, J. E., Bruskiewich, R., Beare, D. M., Clamp, M., Smink, L. J., Ainscough, R., Almeida J. P.,Babbage A, Bagguley C, Bailey J, Barlow K, Bates KN, Beasley O, Bird. C. P., Blakey, S., Bridgeman, A. M., Buck, D., Burgess, J., Burrill, W. D., and O'Brien, K.P., et al. 1999. The DNA sequence of human chromosome 22. Nature 402(6761):489-495. 

· Pohar, N., Godenschwege, T. A. and Buchner, E. 1999. Invertebrate tissue inhibitor of metalloproteinase: structure and nested gene organization within the synapsin locus is conserved from Drosophila to human. Genomics. 57(2):293-296.

· Shoshan, A., Grebinskiy V., Magen A., Scolnicov, A., Fink, E., Lehavi D., and Wasserman, A. Designing oligo librarirs taking alternative splicing into account. In M. L. Bittner, et. al., editors, Microarrays: Optical Technologies and Informatics, Proc. SPIE 4266, May 2001.

· Tycowski, K. T., Shu, M. D. and Steitz, J. A.1993. A small nucleolar RNA is processed from an intron of the human gene encoding ribosomal protein S3. Genes Dev. 7(7A):1176-1190. 

* These authors contributed equally to this work.


310. Integrating Protein Homology into the Twinscan System for Gene Structure Prediction (up)
Paul Flicek, Ian Korf, Michael R. Brent, Washington University;
pflicek@cs.wustl.edu
Short Abstract:

 Twinscan is a new gene-structure prediction system that directly extends the probability model of Genscan, allowing it to exploit the patterns of conservation observed in local alignments between a target sequence and its homologs. We present an addition to the Twinscan system that incorporates protein homology into the probability model.

One Page Abstract:

 Twinscan [1] is a new gene-structure prediction system that directly extends the probability model of Genscan, allowing it to exploit the patterns of conservation observed in local alignments between a target sequence and its homologs. Twinscan is specifically designed for the analysis of high-throughput genomic sequences. It can handle multiple, incomplete or no genes on the target sequence and allows for inversions, duplications and changes in intron-exon structure between the target sequence and its homologs. 

We present an addition to the Twinscan system that incorporates protein homology into the probability model. This modification addresses one of the current limits of the Twinscan system: the requirement of the availibility of a significant portion of an informant genome at an appropriate evolutionary distance. Our preprocessing step includes a simple BLASTX [2] search to identify proteins that are potentially homologous to the target sequence. Additionally, since we we use the highest scoring matches, we allow the use of proteins that are evolutionarily more or less distant than the ideal informant genome.

By combining this with the current Twinscan system, patterns of conservation at both the protein and nucleotide level can be created, which helps enable the identification conserved non-coding regions. These complementary patters of conservation may be important for future work on the automated prediction of regulatory regions, etc.

Our experiments have shown that this extension of Twinscan, using protein homology information alone to construct the conservation sequence, performs nearly as well as when the top homologs (i.e. one or more sequences from the informant genome that match a given target sequence best) are used. While this present work is limited in its ability to find genes that do not have protein matches in the the databases, we feel that for many organisms it may be an important additional source of information until more appropriate informant genomes have been sequenced. Finally, this represents a first step toward using both genomic and protein homology together in an integrated gene-structure prediction system.

 [1] Korf, I., Flicek, P., Duan, D., Brent, M. R. 2001. Integrating genomic homology into gene structure prediction. Bioinformatics (in press)

[2] Gish, W. and States, D.J. 1993. Identification of protein coding regions by database similarity search. Nature Genetics 3(3): 266-72 


311. Improving exon detection using human-rodent genomic sequence comparison (up)
Luis Mendoza, Wyeth W. Wasserman, Center for Genomics and Bioinformatics, Karolinska Institute;
luis.mendoza@cgr.ki.se
Short Abstract:

 We describe a method to identify coding exons in ortholgous genomic sequence pairs. Based on a new alignment tool (DPB) and the GenScan algorithm, MaskScan outperforms a variety of methods in our comparative study. Results of a screen of human chromosome 22 and orthologous mouse sequences will be described.

One Page Abstract:

 Functional sequences are, in general, preferentially conserved between species over the course of evolution. Therefore, comparative genomic sequence analysis can be used as a tool for the identification of functional elements within long nucleotide sequences.

 We developed a method to identify coding exons in human DNA, based on the identification of conserved regions between orthologous human and rodent genomic fragments. The recognition of conserved segments is enabled by a new global alignment program (DPB). Subsequent to alignment, regions of low conservation are masked and the resulting sequence analyzed for the presence of genes or exons. The performance of the approaches were measured in terms of sensitivity, specificity and accuracy at both the exon and nucleotide levels. For comparison, we also measured the performance of common gene finding programs (GeneID and GenScan), a conservation-based gene finding tool (SGP-1), and variations of our algorithm based on different gene identification and exon finding tools. MaskScan delivered the best performance of all the methods on our datasets of orthologous gene sequences.

 Results of a screen of human chromosome 22 for potential coding exons will be presented.

 We have implemented web servers for both the alignment program in isolation (http://kisac.cgr.ki.se/cgi-bin/dpb_page.cgi), and the MaskScan method (http://kisac.cgr.ki.se/cgi-bin/maskscan.cgi).


312. In-silico to in-vivo analysis of whole proteome (up)
Rajani Kanth Vangala, Janki Rangatia, Gerhard Behre, Ludwig Maximillain University;
vrk234@yahoo.com
Short Abstract:

 A computational method for inferring protein-protein interactions or target genes is proposed here. This method is based on already available biologically proven data and five independent tests with genome sequence are done. AML1/ETO protein found in AML M2 was analysed and in-vivo interactions and target genes could also be shown

One Page Abstract:

 A large scale effort to measure, detect and analyse the whole proteome for new protein-protein interactions and target genes using various experimental methods are underway. Eventually all these approaches are labour, time intensive and need to be developed further for accuracy. Here we propose a computational method for inferring protein-protein interactions or target genes for a protein of interest. This Matrix Method (MM) is based on already available biologically proven data and five independent tests with genome sequence for each kind of protein-protein interaction prediction are carried out. For target genes identification of a protein in test, a known DNA sequence to which it is shown to bind is taken into consideration and genome wide analysis is done. As a model we have analysed a fusion gene product AML1/ETO found in many cases of Acute Myeloid Leukemia (AML) subtype M2. The functional role of this fusion protein has not yet being worked in-detail for drug design purposes. The proteins/genes identified using this method were easily shown to interact biologically or differ in expression levels, thus proving that new protein interactions or target genes can be identified by this method. This kind of analysis could give us an understanding of whole proteome and for identification of new drug targets in many diseases.


313. Incorporating Additional Information to Hidden-Markov Models for Gene Prediction (up)
Tomas Vinar, Brona Brejova, University of Waterloo;
Ming Li, University of California Santa Barbara;
Ying Xu, Oak Ridge National Laboratory;
tvinar@math.uwaterloo.ca
Short Abstract:

 HMMs are frequently used in gene finding. Since it is generally hard to expand HMMs to incorporate different sources of information, we use external programs that are combined using machine learning techniques. In this way we achieve high accuracy of HMM-based gene finders and flexibility to incorporate many data sources. 

One Page Abstract:

 Hidden Markov models (HMMs) are frequently used in gene finding. Some features can be expressed conveniently in terms of HMMs, such as the basic structural constrains (i.e. that coding sequence consists of several exons separated by introns), dicodon bias in exons, and various other signals (such as splice site signals etc). By using generalized forms of HMMs it is also possible to include more information in the model, e.g. distribution of exon lengths. However adding more information greatly increases the number of states of the model, its conceptual complexity, running time and memory requirements. Some types of information (such as homologs in protein databases) are even hard to model using traditional HMMs.

Therefore we propose to build a relatively simple HMM and incorporate additional sources of information in the form of "advisors". Each advisor is a prediction algorithm that takes into account one kind of additional information that can be inferred from the sequence or genomic databases.

Different advisors may produce contradicting predictions. Their results are therefore combined using machine learning approaches. The combination of advisors gives for each position of the sequence a distribution of probability over all possible structural elements that may occupy that position. Probability of an entire gene structure is then a product over all positions of individual structural element probabilities. This probability is then combined with the probability of the same structure given by the HMM.

Currently our advisors include: (a) signal detecting algorithms for finding splice sites, promoters, branch site etc. (b) results of homology searches against EST and protein databases (c) optional suggestions from a human expert, which allows a user to influence the gene finding process.

Main advantage of this approach is its modularity. In order to add new sources of data we just need to add an advisor and automatically adjust weights in the combination phase. Therefore we can expect to achieve combination of high accuracy typical for HMM-based gene finders and high flexibility of supporting many data sources.

Acknowledgements: This research is conducted in collaboration with Bioinformatics Solutions, Inc. 


314. Gene prediction in the post-genomic era (up)
Enrique Blanco, Institut Municipal d'Investigacions Mediques (IMIM) / Facultat d'Informatica de Barcelona - (Universitat Politecnica de Catalunya);
Genis Parra, Sergi Castellano, Josep F. Abril, Moises Burset, Institut Municipal d'Investigacions Mediques (IMIM);
Xavier Messeguer, Facultat d'Informatica de Barcelona - (Universitat Politecnica de Catalunya);
Roderic Guigo, Institut Municipal d'Investigacions Mediques (IMIM);
eblanco@imim.es
Short Abstract:

 geneid is a program to predict genes in anonymous genomic sequences designed with a hierarchical but simple structure. geneid accuracy is comparable to other existing tools being very efficient in terms of time and memory consuming (including a parallel version). Recent applications of geneid to reannotation of eukaryotic genomes will be described.

One Page Abstract:

 In Eukaryotes, genes are DNA stretches difficult to accurately detect due to the existence of long intergenic regions and gene fragmentation into exons and introns. Nowadays, a number of genome sequencing projects have been finished or are in the final stages, providing gigabases of raw information which need to be processed to gain biological relevant knowledge. The first step in such process is locating the protein-coding genes. So far, several drafts from these species have been published and therefore, more accurate descriptions are supposed to come in the near future entering therefore in the age of reannotation.

geneid was one of the first programs to predict full exonic structures of vertebrate genes in anonymous DNA sequences (Guigo et al. 1992). It was designed following a simple hierarchical structure (signal to exon to gene) although the scoring scheme to assess the reliability of the predictions was rather heuristic. Here we present the new geneid version, developed during the last two years. This new version mantains the hierarchical structure in the original geneid but simplifying the scoring schema and adding a probabilistic meaning to the scores now computed as log-likelihood ratios of different Markov models. Thus, the optimal gene will be computed as the exon assembling which maximizes the sum of the scores of its assembled exons, using an efficient dynamic programming algorithm to search the space of every potential gene structure.

The accuracy of geneid predictions in human and fly genomes is comparable to the accuracy of the other programs. In contrast, the simple of the new version of geneid results in greater performance in terms of speed and memory usage. Moreover, we have developed a parallel version implemented with Pthreads (POSIX) from the current modular structure that is able to process the same input sequence dividing the execution time by a linear factor (number of processors).

Due to the simplicity of its design, geneid results may be employed in projects other than pure gene finding. We will describe two recent applications of geneid: first, in order to search for selenoprotein genes - genes in which TGA codon encodes the amino acid selenocysteine, in addition of being an stop signal - and second, to predict novel genes in human genome using shotgun mouse genome sequences.

geneid is freely available at this web address www1.imim.es/software/geneid.html and the on-line web server is also supplied at www1.imim.es/geneid.html

References:

- Guigo et al. 1992. Prediction of gene structure. J. Mol. Biol. 226: 141-157. - Parra et al. 2000. GeneID in Drosophila. Genome Research 10: 511-515.


315. Improved Splice Site Prediction by Considering Local GC Content (up)
Aaron Levine, Richard Durbin, The Sanger Centre;
adl@sanger.ac.uk
Short Abstract:

 We have produced a stand-alone splice site predictor that considers local GC content during the prediction process. Our predictor shows significant improvements over standard models at identifying both donor and acceptor sites, particularly in gene-rich high GC regions, and is designed to integrate easily into probabilistic gene prediction systems.

One Page Abstract:

 Despite the recent completion of the draft sequence of the human genome, accurate ab initio prediction of complex mammalian genes remains a largely unsolved problem. As many mammalian genes consist of a large number of small exons separated by much longer introns, reliable identification of intron splicing junctions is a key problem on which successful gene prediction depends. However, mammalian splice site consensus sequences are notoriously degenerate and uninformative, leading to high false positive rates for even the best ab initio splice site predictors. 

We have explored the utility of considering a variety of additional information sources and found that considering local GC content during splice site prediction leads to a significant improvement in accuracy over standard first-order weight matrix models. Stratification by local GC content has its most profound effects on predictions in gene-rich high GC areas and aids in the prediction of both donor and acceptor splice sites.

We have produced a stand-alone splice site predictor (available from our website), which predicts both canonical donor and acceptor sites, as well as rarer GC donor sites. Our predictor generates both log-odds scores and posterior probability values for each potential splice site and is designed to be easily integrated into probabilistic gene prediction systems. Preliminary results indicate that our splice predictor performs comparable to or better than other top splice predictors under most conditions. 


316. Relaxed profile matching as a method for identifying putative novel proteins from the genome (up)
Mark Ibberson, Achim Frauenschuh, Massimo De Francesco, Serono Pharmaceutical Research Institute;
Mark.Ibberson@serono.com
Short Abstract:

 We have used a relaxed profile strategy to identify novel putative coding regions from the human genome. A relaxed profile for a family of secreted proteins identified 7974 unique open reading frames, of which 253 were selected for further analysis based on signal peptide and secondary structure predictions.

One Page Abstract:

 The function of about 40% of the human genome cannot be assigned based on sequence homology to currently characterized protein families (1,2). For these and hitherto unidentified proteins, alternative methods need to be employed in order to identify proteins sharing similar biological characteristics, but lacking sequence homology to known protein families. We have attempted to identify novel proteins that share similar predicted structural properties to a known protein family, but no obvious sequence homology. To do this, we created a regular expression-based profile loosely describing a family of secreted proteins and used it to search the human draft genome. The resulting matches were subsequently filtered using signal peptide and secondary structure prediction algorithms to identify candidates for future analysis. The profile, for the chemokine family of secreted proteins, identified a total of 7974 unique open reading frames from the draft human genome. Of these, approximately 30% (2441) were either identical or homologous to known proteins present in SwissProt/Trembl or Derwent Patents databases. The remaining 70% (5533) were analyzed using signal and structure prediction algorithms and 253 (~5%) were selected based on signal peptide and secondary structure predictions. A subset of these sequences is currently being validated experimentally. Our initial results suggest that this type of strategy could be useful for identifying distant members of protein families or evolutionarily unrelated proteins that have evolved similar biological functions.

1) International Human Genome Sequencing Consortium, Nature 409, 860-921.

2) Venter, J.C et al., Science 291, 1304-1351. 


317. Analyzing Alternatively Spliced Transcripts (up)
Ann Loraine, Guoying Liu, Alan Williams, Ray Wheeler, Michael A. Siani-Rose, David Kulp, Affymetrix, Inc.;
Ann_Loraine@affymetrix.com
Short Abstract:

 Two collections of alternatively spliced human loci were prepared from public human genome data. The first collection was built from mRNA-to-genomic alignments; the second was made using the gene-finding program AltGenie. Comparing these, we identified high-quality AltGenie-predicted loci and used these to test a protein-homology-based scheme for assessing transcript quality.

One Page Abstract:

 Current estimates project that one third to one half of all human genes undergo alternative splicing and therefore give rise to multiple protein forms, many of which exhibit different, even antagonistic, activities [Mironov; Lander; Taylor]. To advance understanding of the role of alternative splicing in generating protein diversity, we have built two collections of alternative splice predictions using the public Human Genome Project data released Oct. 7, 2000, the same version analyzed in Lander, et al. The first set (A) includes 96,832 transcripts from 18,948 loci and was built using AltGenie, a gene-finding program that first uses EST/mRNA-to-genomic alignments to detect internal exons and then applies statistical methods to detect protein-coding exons in flanking genomic sequence [Kulp and Wheeler, unpublished data]. The second collection (RS/C) contains gene predictions made by aligning previously reported mRNAs to genomic sequence. Although the RS/C collection is based on experimental evidence (mRNA sequence records), it is still predictive since it is not always possible to reconstruct the reference transcript and/or protein from genomic sequence. 

Due to the limitations of current computational gene-finding methods, it is widely agreed that before a computed prediction can be accepted, it must be vetted by an expert curator or confirmed by experimental data. Manual and experimental inspection of predicted genes is expensive and time-consuming; thus, we are exploring ways to automate gene prediction analysis with the goal of building a reliable collection of alternatively spliced human genes. Curators typically use protein homology data to evaluate whether a novel predicted transcript is likely to be correct; that is, curators give more weight to a predicted transcript when it encodes a protein that is homologous to a previously characterized protein family. Following this same reasoning, we are testing methods for gene prediction analysis that use protein homology data to assess the quality of predicted transcripts.

References

Lander, et al. 2001. Initial sequencing and analysis of the human genome. Nature 409(6822):860-921.

Mironov, A.A., Fickett, J.W., and Gelfand, M.S. 1999. Frequent alternative splicing of human genes. Genome Res. 9(12):1288-93.

Taylor, J.F., Zhang, Q.Q., Wyatt, J.R., and Dean, N.M. 1999. Induction of endogenous Bcl-xS through the control of Bcl-x pre-mRNA splicing by antisense oligonucleotides. Nature Biotech. 17: 1097-1100. 


318. EST curation with improved gene feature models (up)
Debopriya Das, Eldar Giladi, Incyte Genomics;
ddas@incyte.com
Short Abstract:

 We present an effective EST curation tool which combines EST hits on the genome with statistical models for gene features on the genomic sequence. For each EST, the tool generates a ranked set of segments which potentially represent exons or portions thereof.

One Page Abstract:

 The availability of the human genomic sequence and of large repositories of ESTs provides the opportunity to generate an improved view of a gene, or a portion thereof, by developing algorithms that combine both data-sets. In this poster we present a simple tool which combines EST hits on the genome with statistical models for gene features on the genomic sequence, and generates for each EST a ranked set of segments which potentially represent exons or portions thereof, associated with the EST.

Our work is motivated by several applications. First, an important route to the discovery of rare genes, that is genes with a low expression level, involves seeking singletons and extending them into full length transcripts by generating primers from those singletons. By singleton we mean an EST which overlaps with no other EST from a given library or a collection of libraries. The difficulty with this approach is that some mRNAs are contaminated with un-spliced introns, and ESTs which are sequenced from them could be intronic singletons. Hence, it is important to have a tool which would help discriminate between singletons which are intronic and those which originate in an exon. By leveraging additional information from the genomic sequence our tool is able to improve the discrimination substantially. Another application of this tool is to aid curators in generating full length transcripts by providing them a ranked set of candidate exons associated with each EST.

In order to compute the candidate exons associated with an EST, the EST is first mapped to the genome. Then the genomic segment which encompasses the EST by 500-1000nt on each side is analyzed. In this segment potential donor and acceptor splice sites are identified and scored with a Markov model, and for each pair of sites the coding potential of the region bound between them is computed. The resulting segments are ranked based on those three scores using a simple ranking scheme. For the first exon, the method substitutes the coding potential score by a combination of a coding score for the segment downstream of the Methionine and a score from a model for UTR, for the region between the acceptor and the Methionine. An analogous variant exists for the last exon.

The statistical models which we use for each of the components mentioned above are Markov models analogous to those used by Genscan. In view of the large amounts of curated genes mapped to the genomic sequence, available in the Incyte database, we where able to use higher order models in some cases, and in general we obtained substantially improved performance on models trained on this large data collection.

The exon-curation tool which we present here differs from the available genefinding tools in that the latter try to find the global gene structure given a genomic piece, whereas we try to do a local analysis and curation of a single exon. However, we have identified a few examples in which our tool detects errors in the predictions of genefinding tools. Moreover, we find that this tool can identify an intron-free segment and the correct extent of the exon accurately about 85-90% of the cases, including the first and last coding exons. We are currently assessing the coding content of ESTs in public databases using this tool, and developing additional improved gene feature models.

 


319. Annotation of Human Genomic Regions to Identify Candidate Disease Genes (up)
Damian Smedley, Derek Huntley, Sasivimol Kittivoravitkul, Holger Hummerich, Imperial College, London, UK;
Soraya Bardien-Kruger, SANBI, South Africa;
Peter Little, University of New South Wales, Australia;
Win Hide, SANBI, South Africa;
Marek Sergot, Mark McCarthy, Imperial College, London, UK;
d.smedley@ic.ac.uk
Short Abstract:

 GANESH is an annotation tool for genomic regions identified in human disease gene studies. Daily updated gene predictions are based on similarity to known genes/ESTs, Genscan prediction, and mouse genome synergy. To select candidate genes we are developing automated methods to retrieve gene expression and function data.

One Page Abstract:

 We have developed a set of tools, GANESH (Genome Annotation Network for Expressed Sequence Hits: http://zebrafish.doc.ic.ac.uk/Ganesh/), to annotate small (10-20 Mb) regions of the human genome identified as containing potential disease genes. Genomic sequence, in the form of finished and unfinished BAC/PAC clones from the public genome effort is retrieved, and orientated according to the Golden Path (http://genome.ucsc.edu/). Annotation is then carried out on a clone by clone basis using a series of automated scripts which (i) blast the genomic sequence against embl, dbEST, STACKdb, SwissProt, Trembl, IPI, dbSnp and dbSts, (ii) identify Pfam domains and (iii) run Genscan exon predictions. Both the genomic sequence and the databases searched are automatically checked every night for updates and reprocessing occurs whenever required. Predicted genes are constructed based on parallel lines of evidence including (i) similarity to known genes and ESTs, (ii) Genscan predictions, and (iii) synergy with the emerging mouse genome. All predicted genes are stored, including the unlikely ones, as in a positional cloning approach every possible gene in the region should be considered as a potential candidate. However we categorize the predictions in terms of the levels of supporting evidence and hence likelihood of being a real gene. For instance, category 1.1 genes have an exact match to a known gene, category 1.2 genes have strong homology to a known gene, category 2 genes have exons predicted from Genscan or mouse genome comparisons with some backup EST evidence, and category 3 genes have EST evidence only. The results of the annotation are stored in a relational MySQL database and can be viewed remotely/locally using a java GUI. To drive selection of positional candidates for further study we are in the process of developing automated retrieval methods to collect attributes for each predicted gene. These attributes will include qualitative (tissue expression profiles) and quantitative (microarray) expression data, functional data (from gene ontology and involvement in KEGG metabolic pathways), and finally the gene's position in the region. Again the results will be stored in a MySQL database with a java GUI allowing the biologist to recover all the candidate genes according to their particular criteria. Testing of candidates is likely to involve SNP association studies, so identified SNPs and their positions in the gene structure will also be stored in the attribute database. GANESH annotation on a region of chromosome 1 (1q21-24) will be presented. This region has been identified by several groups, including our own, as harbouring a gene involved in type 2 diabetes. In this region, 133 category 1.1 genes, 145 category 1.2 genes and 393 genes from the other categories, were identified. Examples of the gene attribute retrieval will also be demonstrated. 


320. Annotation of the E. coli genome revisited (up)
Vera van Noort, David Ussery, Thomas Schou Larsen, Marie Skovgaard, Centre for Biological Sequence Analysis, DTU;
V.vanNoort@students.bio.uu.nl
Short Abstract:

 Our aim is to find the real protein coding genes in E. coli, using sequenced genomes of four different E. coli strains and four other enteric bacteria. We put all annotated genes into six different categories, ranging from confident to wrong using de novo prediction, homology searches and contextual information.

One Page Abstract:

 E. coli is one of the most studied model organisms in biology. The genome of this enteric bacterium was sequenced in 1997 and, as in other sequenced genomes, about 30 percent of the genes were of unknown function. The genomes of four different E. coli strains and four other enteric bacteria are now available. The genomes of the four sequenced E. coli strains differ in size between 4,636,552 and 529,376 basepairs. The number of genes that are annotated also differs a lot between the strains, namely between 4405 and 5502. A former study has shown that the number of protein coding genes in the E. coli K12_MG1655 genome is overestimated between 15 and 20 %, which means that between 625 and 950 annotated genes are not protein coding regions but rather Open Reading Frames that occur by chance. This was estimated from the number of random ocurring stop codons, based on AT content, as well as matches with non-hypothetical proteins from SwissProt. As more and more people use public databases and assume that all annotated genes in these databases are real, it is necessary to find out which annotated genes correspond to true genes and include these in the databases.

Our aim is to find the genes that are real protein coding genes. To do this we put all annotated genes into six different categories, ranging from confident to wrong. First, genes that match known proteins, are considered "confident". Then genes that have close orthologs, i.e. known proteins with a common ancestor in another organism, are considered "conserved hypothetical", which means less confident. For E. coli we can use, for example Salmonella to find close orthologs. If an ORF is conserved over distant organisms, we consider this ORF "conserved hypothetical".In addition to de novo prediction and homology searches wealso use contextual information, to find the right direction of the genes. This is necessary because conservation on one strand, implies conservation on the other strand for obvious reasons. Moreover the genefinding might give some additional hypothetical genes, for which we can find evidence like for the already annotated genes. This evidence can make them confident. Additional evidence, can be found in DNA expression data, from microarray experiments. For this, however it is necessary to know the positions of the primers. Because primers of wrongly annotated genes can be located in mRNA containing real expressed genes. Roughly 3000 genes are found to be expressed, many of the remaining genes are short ORFs, that are unlikely to be real, that is they are overpredicted. If the genefinding gives a low score and no evidence for that gene is found, we consider the gene wrongly annotated. "More confident" and "less confident" are used relative to the old annotation depending on how confident we think the genes are. 

We found evidence for 1,112 genes, that are conserved on the DNA level between genomes of 8 different enteric bacterial genomes. Conserved, in this case, means more than 50 % identity. 


321. Conformational studies on O-specific polysaccharides of Shigella dysenteriae (up)
Rosen, J., Nyholm, P.G., Rabobi, A., Göteborg University;
Mulard, L.A., Pasteur Institute, Paris;
f94jiro@dd.chalmers.se
Short Abstract:

 Conformational analyses of the O-antigenic polysaccharides of Shigella dysenteriae type 1, 2 and 4 have been performed using modified HSEA and MM3(96). For type 1 the results show two conformers, differing with respect to the a-D-Gal-(1-3)-a-D-GlcNAc linkage. The type 2 and 4 antigens are highly constrained according to the calculations.

One Page Abstract:

 Conformational analyses of fragments of the O-antigenic polysaccharides of Shigella dysenteriae type 1, 2 and 4 have been performed using modified HSEA and MM3(96). The sequences of the repeating units of these O-antigens are shown below:

S. dys. type 1: -3)-L-Rha-(1-2)-a-D-Gal-(1-3)-a-D-GlcNAc-(1-3)-a-L-Rha-1

S. dys. type 2: -3)-a-D-GalNAc-(1-4)-[a-D-GlcNAc-(1-3)]-a-D-GalNAc-(1-4)- a-D-Glc-(1-4)-a-D-Gal-(1- 

S. dys type 4: -3)-a-D-GlcNAc-(1-3)-[a-L-Fuc-(1-4)]a-D-GlcpNAc-(1-4)-a-D- GlcpA-(1-3)-a-L-Fucp-(1-

For the type 1 O-antigen the results of the calculations indicate that shorter fragments like the trisaccharide a-L-Rha-(1-2)-a-D-Gal-(1-3)-a-D-GlcNAc and the tetrasaccharide a-L-Rha-(1-2)-a-D-Gal-(1-3)-a-D-GlcNAc-(1-3)-a-L-Rha exist as two different conformers, I and II, differing with respect to the conformation of the a-D-Gal-(1-3)-a-D-GlcNAc linkage (phi/psi= -40./-40. (I) and 10./30. (II), respectively). For the pentasaccharide a-L-Rha-(1-3)-a-L-Rha-(1-2)-a-D-Gal-(1-3)-a-D-GlcNAc-(1-3)-a-L-Rha and longer fragments the calculations indicate preference for conformation II. Such a conformational change for the a-D-Gal-(1-3)-a-D-GlcNAc linkage is in agreement with previously obtained NMR data. For longer fragments of the polysaccharide the "back-folded" conformation leads to a compact helical conformation with the galactose residues protruding radially from the core of the helix consistent with the role of L-Rha-(1-2)-a-D-Gal as the major epitope of this O-specific polysaccharide.

The type 2 antigen is fairly constrained due to the branch which involves three N-acetyl hexose amines. The substitution at 2-, 3-, and 4- positions of the central GalNAc residue (see formula above) appears to give rise to a single well defined conformation with restricted flexibility. In the case of the type 4 O-antigen there are very significant steric contraints around the GlcNAc residue at the branching point (see formula above). Despite these constraints there are at least 3 different favoured conformations of the trisaccharide a-D-GlcNAc-(1-3)-[a-L-Fuc-(1-4)]-a-D-GlcNAc. The complexity of the potential energy surfaces of this system suggest that this system is of interest for further studies to validate the force fields for conformational analyses of saccharides. Furthermore, the results obtained on these systems are of interest for the design of carbohydrate based vaccines against shigellosis caused by Shigella dysenteriae. It is our intention to compile these structures in a database with favoured conformations of oligosaccharides. 


322. The risk of failure to detect disease genes due to epistatic interactions (up)
Gavin A. Huttley, Simon Easteal, Sue Wilson, Centre for Bioinformation Science, ANU;
gavin.huttley@anu.edu.au
Short Abstract:

 The prevalent paradigm for the analysis of common human diseases assumes a single gene is largely responsible for affecting individual disease risk. We present a theoretical result demonstrating that failure to factor in epistatic interactions, a common component of complex diseases, leads to elevated rates of false positives and negatives.

One Page Abstract:

 Empirical evidence from model organisms indicates that the genetic background can strongly influence the phenotype exhibited by a specific genotype due to epistatic (gene-gene) interactions. The prevalent paradigm for the analysis of common human diseases assumes, however, that a single gene is largely responsible for affecting individual disease risk. The consequence of examining each gene as though it were solely responsible for conferring disease risk when in fact that risk is contingent upon interactions with another disease locus has not been previously determined. We examined the effect of two (or more) major epistatic disease genes when data are analysed assuming a single disease gene. We produce a general genetic model to analyse so-called triad data (two parents and an affected child) for two marker loci. Based on this model we show that results can vary markedly depending on the parameters associated with the "unidentified" disease gene. The results indicate that if parameters associated with the second gene were to vary between studies, then the conclusions from those studies may also vary. This is a theoretically broad result with important implications for interpreting different results from individual studies and comparing results between studies. It demonstrates that failure to factor in such interactions can lead to an elevated type II rate, or false negatives. This is particularly troubling for genomic scan type study designs.


323. A stoichiometric model for the central carbon metabolism of Aspergillus oryzae (up)
Helga David, Mats Åkesson, Jens Nielsen, Center for Process Biotechnology, BioCentrum-DTU, Technical University of Denmark;
helga.david@biocentrum.dtu.dk
Short Abstract:

 A stoichiometric model for the metabolism of Aspergillus oryzae was developed based on literature. Intracellular compartmentalization was considered and biochemical reactions and transport reactions were included in the model. Flux balance analysis was combined with linear programming to study the phenotypic behaviour of the microrganism during aerobic glucose-limited continuous cultures.

One Page Abstract:

 A stoichiometric model for the central carbon metabolism of Aspergillus oryzae

H. M. David, M. Åkesson, J. Nielsen Center for Process Biotechnology, Biocentrum - DTU, Building 223, Technical University of Denmark, DK - 2800 Lyngby, Denmark (helga.david@biocentrum.dtu.dk)

Genomic sequencing efforts are beginning to produce complete organism-specific genetic information at an extraordinary rate. The functional assignment of individual genes derived from sequence data, which can be viewed as the first level of functional genomics, should soon give rise to a more challenging stage, one in which the focus is on the interrelatedness of gene products and their role in multigenetic cellular functions. Genome-scale flux balance models can be used in the elucidation of the genotype-phenotype relationship and hence represent a potentially valuable approach in metabolic engineering for the design and optimisation of bioprocesses [1].

Filamentous fungi are attractive microrganisms for industrial production of different products, including organic acids, antibiotics and proteins. They exhibit a number of properties that confer them competitive advantages over other eucaryotic and prokaryotic organisms, namely the ability to produce and secrete large amounts of proteins and the capabilities to process eukaryotic mRNA, glycolysate and make post-translational modifications, which make them also promising hosts for production of recombinant proteins [2,3]. Aspergilli combine many of the useful features of bacteria with those of higher organisms [4]. However, little work has been done regarding mathematical modeling of their metabolism. The complete sequence of the genome is not available yet for any species of Aspergilli, which hampers the development of a genome-scale model for functional analysis in these microrganisms.

As a starting point, a stoichiometric model for the central carbon metabolism of the filamentous fungi Aspergillus oryzae is constructed based on literature information [5]. Intracellular compartmentalization is considered and biochemical reactions as well as transport reactions between the cytosol and mitochondria are included in the model. Flux balance analysis in combination with linear optimisation techniques is used to study the phenotypic behaviour of the microorganism during aerobic glucose-limited continuous cultures. In concrete, the capability of the filamentous fungi to maximally produce a metabolite of interest is investigated and the redirection of flux distributions in the primary metabolism for the overproduction of the metabolite is studied.

 [1] Edwards, J.S., Ramakrishna, R., Schilling, C.H., and Palsson, B.O. (1999) Metabolic Flux Balance Analysis. In: Metabolic Engineering. S.Y. Lee and E.T. Papoutsakis, eds. Marcel Dekker,Inc., New York, pp. 13-57. [2] Carlsen, M. (1994) a-Amilase production by Aspergillus oryzae. Ph.D.-thesis, Department of Biotechnology, Technical University of Denmark. [3] Pedersen, H. (1999) Protein production by industrial Aspergilli. Ph.D.-thesis, Department of Biotechnology, Technical University of Denmark. [4] Martinelli, S. D. (1994) Aspergillus nidulans as as experimental organism. In: Aspergillus: 50 years on. S.D. Martinelli and J.R. Kinghorn, Elsevier, pp. 33-58. [5] Pedersen,H., Carlsen,M., & Nielsen,J. (1999) Identification of enzymes and quantification of metabolic fluxes in the wild type and in a recombinant aspergillus oryzae strain. Appl.Environ.Microbiol. 65, 11-19. 


324. Computational Antisense Prediction (up)
Alistair Chalk, Erik Sonnhammer, Center for Genomics Research,Karolinska Institute;
alistair.chalk@cgr.ki.se
Short Abstract:

 We present an in silico model for prediction of antisense oligonucleotide (AO) efficacy. Collecting data from AO experiments in the literature generated a 490 AO dataset. We trained neural networks, an ensemble giving an overall correlation coefficient of 0.38, predicting effective AOs with a success rate of 85%.

One Page Abstract:

 Experimental testing of antisense oligonucleotides is time consuming and costly. Here we present an in silico model for prediction of antisense oligonucleotide (AO) prediction. Collecting sequence and efficacy data from AO scanning experiments in the literature generated a database of 490 AO molecules. Using a set of derived parameters based on AO sequence properties we trained a neural network model. An ensemble of 10 networks gave an overall correlation coefficient of 0.38. This model can predict effective AOs (>50% inhibition of gene expression) with a success rate of 85%. At this threshold the model predicts approximately 2 AOs per 1000 base pairs, making it a stringent yet practical model for AO prediction.


325. Determination of the active structure of Chemotactic peptides (up)
Youssef Wazady, Laboratoire de Recherche, Ecole Supérieur de Technologie, BP 8012 Oasis, Route d'El Jadida, Km 7, Casablanca, Maroc.;
C. H. Ameziane, Département de Chimie, Faculté des Sciences et Techniques Fes Sais, Université Sidi Mohamed Ben Abdell;
wazady@hotmail.com
Short Abstract:

 In order to investigate the proper peptide backbone conformation that is biologically active, the chemotactic peptides were studied by the theoretical method PEPSEA. This study shows that the active structure was beta turn structure.

One Page Abstract:

 Abstract:

The tripeptides formyl-Met-X-Phe-OMe (with X is respectively the a-a disubstituted amino acid; Aib (a-aminoisobutyric), Acc5 (amino-1 cyclopentanecarboxylic acid) or Acc6 (amino-1 cyclohexanecarboxylic acid)) are active analogs of chemotactic peptide formyl-Met-Leu-Phe-OMe (fMLPOMe), known by its ability to induce release of ly