Databases, Information and Knowledge Management

228a.Use of Runs Statistics for Pattern Recognition in Genomic DNA Sequences
229.Biomine – multiple database sequence analysis tool
230.Efficient virtual screening toll for early drug discovery
231.WebGen-Net: a workbench system for support of genetic network construction
232.ArrayExpress – a Public Repository for Gene Expression Data at the European Bioinformatics Institute
233.Formulation of the estimation methods for the kinetic parameters in cellular dynamics using object-oriented database
234.Towards a Transcript Centric Annotation System Using in silico Transcript Generation and XML Databases
235.Trawler: Fishing in the Biomedical Literaturome
236.GeneLynx: A Comprehensive and Extensible Portal to the Human Genome
237.Representation and integration of metabolic and genomic data: the Panoramix project
238.GDB's Draft Sequence Browser (GDSB) - A bridge between the Human Genome Database (GDB) and Human Genome Draft Sequence.
239.Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT.
240.A tissue classification system
241.Proteome's Protein Databases: Integration of the Gene Ontology Vocabulary
242.Pattern Recognition for Micro-array Data using Context based Divergence and its Validation
243.GIMS - A Data Warehouse for Storage and Analysis of Genomic and Functional Data.
244.Mutable gene model
245.Discovery Support System for Human Genetics
246.Search, a sensitive and exhaustive homology search programme
247.Gene Ontology Annotation Campaign and Spotfire Gene Ontology Plug-In
248.Predicting the transcription factor binding site-candidate, which is responsible for alteration in DNA/protein binding pattern associated with disease susceptibility/resistance
249.Open Exchange of Source Code and Data from the Stanford Microarray Database
250.Portal XML Server: Toward Accessing and Managing Bioinformatics XML Data on the Web
251.PFAM-Pro: A New Prokaryotic Protein Family Database
252.Production System for Mining in Large Biological Dataset
253.Querying Multiple Biological Databanks
254.Automated Analysis of Biomedical Literature to Annotate Genes with Gene Ontology Codes : Application of A Maximum Entropy Method to Text Classification Application of A Maximum Entropy Method to Text Classification
255.Rat Liver Aging Proteome Database: A web-based workbench for proteome analysis
256.UCspots Before Your Eyes
257.PASS2: Semi-Automated database of Protein Alignments organised as Structural Superfamilies
258.The HAMAP project: High quality Automated Microbial Annotation of Proteomes
259.BioWIC: Biologically what's in common?
260.Structure information in the Pfam database
261.Sequence Viewers for Distributed Genomic Annotations
262.The Immunodeficiency Resource: Knowledge Base for Immunodeficiencies
263.XGI: A versatile high throughput automated sequence analysis and annotation pipeline.
264.Mouse Genome Database: an integrated informatics database resource
265.Methods for the automated identification of substance names in journal articles
266.Development and Implementation of a Conceptual Model for Spatial Genomics
267.Modelling Genomic Annotation Data using Objects and Associations: the GenoAnnot project
268.An Overview of the HIV Databases at Los Alamos
269.Variability of the immunoglobulin superfamily V-set fold using Shannon entropy analysis
270.SNPSnapper – application for genotyping and database storaging of SNP genotypes produced by microarray technology
271.An XML document management system using XLink for an integration of biological data
272.Immunodeficiency Mutation Databases in the Internet
273.Martel: parsing flat-file formats as XML
274.Disulphide database(DSDBASE): A handy tool for Engineering disulphides and modeling disulphide rich systems.
275.A Combinatorial Approach to Ontology Development
276.StressDB: The PennState Arabidopsis Stress Database
277.The GeM system for comparative mapping of mammalain genomes
278.TrEMBL protein sequence database: production, interconnectivity and future developments
279.PIR Web-Based Tools and Databases for Genomic and Proteomic Research
280.Classification of Protein Structures by Global Geometric Measures
281.EST Analysis and Gene expression in Populus leaves
282.Perl-based simple retrieval system behind the InterProScan package.
283.Intelligent validation of oligonucleotides for high-throughput synthesis
284.Iobion MicroPoint Curator, a complete database system for the analysis and visualization of microarray data
285.Rapid IT prototype to support macroarray experiments
286.EXProt - a database for EXPerimentally verified Protein functions.
287.HeSSPer: a program to analyze sequence alignments and protein structures derived from HSSP database
288.Protein Structure extensions to EMBOSS
289.GENIA corpus: A Semamticaly Annotated Corpus in Molecular Biology Domain
290.Talisman - Rapid Development of Web Based Tools for Bioinformatics
291.The Eukaryotic Linear Motif Database ELM
292.Information search, retrieval and organization relevant to molecular biology
293.Los Alamos National Laboratory Pathogen Database Systems
294.Facilitating knowledge acquisition and discovery through automatic web data minining
295.Functional analysis of novel human full-length cDNAs from the HUNT database at Helix Research Institute
296.(withdrawn)
297.iSPOT and MINT: a method and a database dedicated to molecular interactions
298.An Integrated Sequence Data Management and Annotation System For Microbial Genome Projects
299.A home-made implementation for bibliographic data management: application to a specific protein.
300.Statistical structurization of 30,000 technical terms in medical textbooks: A non-ontological approach for computation of human gene functions.



228a. Use of Runs Statistics for Pattern Recognition in Genomic DNA Sequences (up)
Leo Wang-Kit Cheung, University of Manitoba;
wcheung@cc.umanitoba.ca
Short Abstract:

Based on the finite-Markov-chain-imbedding technique, a recursion is derived for the calculation of the exact distribution of a double runs statistic. Having studied this distribution under different probabilistic frameworks, we can use it not only for detecting DNA signal clustering, but also for revealing homogeneous regions of DNA.

One Page Abstract:

In the field of computational biology, DNA sequences have been studied through mathematical and statistical analyses of patterns. Research on counting problems and the distribution theory of runs and patterns has been heavily influenced by combinatorial methods (Waterman, 95). Owing to the complexity of these methods, exact distributions of many runs statistics still remain unknown. Recently, a completely different finite Markov chain imbedding (FMCI) method has been introduced for studying distributions of runs and patterns (Fu and Koutras, 94). Essentially, it is to imbed the runs statistic into a Markov chain (MC) so that the distribution of the runs statistic can be expressed in terms of the transition probabilities of the imbedded MC. Hence, the runs distribution is in a simpler form.

Based on the finite Markov chain imbedding (FMCI) technique, a recursive algorithm is derived for the calculation of the exact distribution of a double runs statistic. Being able to obtain this distribution, we can construct critical regions of a statistical test to test for randomness against clustering in DNA sequence data. The distribution of this statistic has also been investigated under a hidden Markov model (HMM) framework. This leads to the creation of probabilistic profiles for trapping HMM parameters. Applications of these profiles in conjunction with HMMs for pattern recognition in DNA sequences are illustrated via case studies using real human DNA data provided by Dr. Anders Pedersen at the Center for Biological Sequence Analysis (CBSA).


229. Biomine – multiple database sequence analysis tool (up)
Simon Greenaway, Joseph Weekes, Andrew Blake, Helen Kirkbride, Jerome Jones, Katia Pilicheva, Michelle Simon, Sarah Webb, Ann-Marie Mallon, Mark Strivens, Informatics Group, Mammalian Genetics Unit and UK Mouse Genome Centre, Medical Research Council;
s.greenaway@har.mrc.ac.uk
Short Abstract:

Biomine is a sequence analysis and management tool allowing the parallel searching of multiple sequence databases from a single user interface without the need for specialist bio-informatics skills.

One Page Abstract:

Biomine is a sequence analysis and management tool allowing the parallel searching of multiple sequence databases from a single user interface without the need for specialist bio-informatics skills.

Users login to the system and are then presented with the options of launching a new search or viewing completed results. New searches can be set up by either pasting FASTA format sequence into a web form, by entering an accession number or by uploading a file of FASTA sequences. The user also selects what types of searches are needed e.g. nucleotide, genomic, EST, protein or one of their custom databases.

The database searches are managed by a Java based server which controls the parallel launching of database searches. This server is fully configurable to allow different queuing strategies, server loading controls and full search job controls by means of an administration interface.

The results page shows the progress of the users database searches and the searches are organised into project folders. The user can view a simplified representation or the full output of the search results once completed from browsing through their project folders. Project folders may be kept private or shared between members of a research team. Results can be tagged to repeat searches whenever one of the databases is updated. The results link to relevant publicly available databases in a intuitive and easy to use interface.


230. Efficient virtual screening toll for early drug discovery (up)
Prof. Dr. Paul Wrede, CallistoGen AG;
paul.wrede@callistogen.de
Short Abstract:

Today the search for new pharmacological molecules is mainly a matter of trial and error. CallistoGen developed an efficient and robust virtual screening tool called PHACIR® for rapid guided search of new bioactive molecules. The algorithm of PHACIR® is a similarity search based on the pharmacophore concept.

One Page Abstract:

Random search proved to be an inefficient method for lead discovery. Alternatively, virtual screening algorithms enable a guided search in the high dimensional chemical space. 2D and 3D pharmacophore models, neural network concepts, and new bioinformatic approaches lead to fast and efficient virtual screening tools. PHACIR® (PHArmaCophore Identification Routine) - a 3D pharmacophore model based algorithm - generates highly enriched focused compound libraries as demonstrated in several retrospective screenings. The database screening speed exceeds 7000 cmpds/sec on average workstations. Even single topological query information is sufficient for PHACIR screening, i.e. 2D structure input of only one active compound can be used for scanning large compound libraries. Prospective screening results, a 10% hit rate of biological active new compounds was found repeatedly, confirm the concept of the PHACIR algorithm. ClassyFire® - CallistoGen's artificial neural network diversity analyser - produces high quality focused compound libraries to identify potential lead candidates with different scaffolds. For de novo design of biologically active peptides evolutionary algorithms proved to be very useful. PepHarvester® and Darwinizer® allow a guided search through the high dimensional sequence space. Compared to known isofunctional sequences the peptides found are highly diverse.

Correspondence: info@callistogen.de


231. WebGen-Net: a workbench system for support of genetic network construction (up)
Mikio Yoshida, Hideaki Shimano, Yukari Shibagaki, Hiroshi Fukagawa, Takeshi Mizuno, Intec Web and Genome Informatics Corp., 1-3-3 Shinsuna, Koto-ku, Tokyo, 136-0075, Japan;
yoshida@gic.intec.co.jp
Short Abstract:

We have developed a workbench system for support of genetic network construction that constructs a genetic network among focused genes by connecting binominal relations extracted from various genome databases in advance. This system helps users interprets their hypotheses or experimental results by referring the prior biological knowledge.

One Page Abstract:

Genetic network analysis plays an important role in determination of the proteinfs function. However, in order to construct a genetic network, a huge amount of biological data is required, for instance gene transcription regulation, protein-protein interaction, sequence similarity, and the like. Since these data is continuously increasing, it is impractical that a biologist deals with them by oneself.

To overcome this problem, authors propose an interactive system that can support of genetic network construction. The system consists of a rearranged database and a graphical user interface module (GUI module). The rearranged database stores binominal relations related to genetic network collected from public databases and several experimental results in advance. This system constructs a genetic network among focused genes by connecting these relations based on a predefined model. The GUI module can display the constructed genetic network graphically and enables users to edit it. Therefore, users can grasp their experimental results and can confirm differences between their hypotheses or their own experimental results and the prior biological knowledge.

To evaluate the efficiency of this system, we have utilized it to help interpretations of experimental results of a comprehensive protein-protein interaction screening of budding yeast (consists of 4,549 interactions among 3,278 proteins). In this evaluation, two effective facilities have been developed. One is Connected Components Extraction (CCE) and the other is Alternative Paths Derivation (APD). CCE is utilized to estimate a proteinfs function, and APD is to validate the results. The experiment was based on Yeast Two-Hybrid System, and one or two intervening protein between prey and bait may cause a false positive result. APD is to find all of the possibilities of it by extracting A-X-B (or A-X-Y-B) paths from the previously reported protein-protein interactions. Furthermore, an interaction having more than two alternative paths suggests that the proteins consisting the paths compose a protein complex. By using these features, 195 connected components, 182 (299) alternative paths with one (two) intervening protein(s) were extracted.

Selected Reference: Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, and Sakaki Y., A comprehensive two-hybrid analysis to explore the yeast protein interactome, Proc Natl Acad Sci U S A 2001 Apr 10;98(8):4569-74


232. ArrayExpress – a Public Repository for Gene Expression Data at the European Bioinformatics Institute (up)
Alvis Brazma, Ugis Sarkans, Helen Parkinson, Alan Robinson, Mohammadreza Shojatalab, Jaak Vilo, EMBL-EBI;
brazma@ebi.ac.uk
Short Abstract:

ArrayExpress (www.ebi.ac.uk/microarray) is a public repository for microarray based gene expression data, which covers the requirements of Minimum Information About a Microarray Experiment (MIAME), developed by the MGED consortium. It supports data import in MAML format, which is an XML-based data exchanged format (see www.mged.org).

One Page Abstract:

By allowing “snapshots” of gene expression for tens of thousands of genes in a single experiment, microarrays have already profoundly affected life science research and are producing massive amounts of functional genomics data. Well organized public repositories for such data are needed, if these data are to be freely accessed and explored by the life science community and not lost in the future. There are a number of clear reasons, in addition to the size of the datasets, why databases of microarray data are not simple. First, gene expresion data make sense only in the context of a detailed description of the experimental conditions, which have to be described in a systematic way, to permit queries and data mining. Second, datasets obtained in different experiments may not be directly comparable – no standard reliable units for measuring gene expression exist. Third, microarray technology is still developing rapidly, therefore microarray data management systems should be extremely flexible. ArrayExpress is a public repository for microarray based gene expression data, being established by the EMBL-EBI (www.ebi.ac.uk/microarray/). We have developed an object model for representing microarray based gene expression data, which can be used for building custom-made gene expression databases. The model is freely available from www.ebi.ac.uk/arrayexpress/. The model covers the requirements of the Minimum Information About a Microarray Experiment (MIAME) developed by the Microarray Gene Expression Database (MGED) group (for more information about MGED and MIAME see www.mged.org). The key feature of the object model is the notion of generic, technology-independent expression data matrices, facilitating expression data warehouse development and data mining. Structured sample description framework is provided, encouraging data providers to describe their laboratory protocols in a formal way. ArrayExpress is based on the described object model and is implemented in Oracle. It can store both raw and processed data, and it is independent of experimental platforms, image analysis and data normalization methods. The repository supports data import in MAML format (MicroArray Markup Language is an XML-based data exchanged format developed by the MGED consortium). Currently we are developing data submission and annotation tools, to facilitate the data deposition process, and a Web based data query interface. ArrayExpress will be linked to Expression Profiler which is an Internet based microarary data analysis tool (ep.ebi.ac.uk). In the future ArrayExpress will support interfaces currently under development within the Object Management Group (OMG), in which the EBI is actively participating. References: One-stop shop for microarray data. Brazma, A., A. Robinson, G. Cameron, and M. Ashburner. Nature 403:699-700 (2000).


233. Formulation of the estimation methods for the kinetic parameters in cellular dynamics using object-oriented database (up)
Takashi Naka, Saitama Junior College, Kazo-shi, Saitama 347-8503, Japan;
Mariko Hatakeyama, Mio Ichikawa, Computational Genomics Team, Bioinformatics Group, RIKEN Genomic Sciences Center, 1-7-22 Suehiro-cho, Tsurumi-ku, Yok;
Naoto Sakamoto, Institute of Information Sciences and Electronics, University of Tsukuba, Tsukuba-shi, Ibaraki 305-8573, Japan;
Akihiko Konagaya, Computational Genomics Team, Bioinformatics Group, RIKEN Genomic Sciences Center, 1-7-22 Suehiro-cho, Tsurumi-ku, Yok;
naka@sjc.ac.jp
Short Abstract:

The simulation analysis with mathematical models is effective for elucidation of the signal transduction mechanisms in cells. Estimation and/or adjustment of the kinetic parameters are necessary to perform the simulation of cellular dynamics. The methods for estimation/adjustment of the kinetic parameters are formulated in an object-oriented database.

One Page Abstract:

The intensive studies on the signal transduction processes in cells have revealed their functions as the elaborate information processing systems. The simulation analysis with mathematical models is effective to elucidate the mechanisms of the cellular information processing. Construction of mathematical models employs the kinetic reaction schemes which have been reported in experimental studies and collection of kinetic parameters such as rate constants, diffusion coefficients and concentration distribution of every chemical reactant is utilized to perform the simulation with the constructed models. As some kinetic parameters required for simulation are not always described in the literatures, estimation for these deficient data is essential. It is further necessary to adjust the values available for parameters in construction of the mathematical model for a specific reaction system because the experimental conditions for the measurement may differ by literatures. In this study, the procedures to estimate or adjust the kinetic parameters are collected from the literatures, and formulated as the relationship between the measured values of the kinetic parameters and the experimental conditions at which the measurement was made. These relationships are integrated into the bio-molecular database. The object-oriented database which makes each data a class object is adopted as a framework of the database management system. The formulated relationships for estimation or adjustment are represented as the constructor method, and the estimation or adjustment of the data is carried out at the time of instance generation which corresponds to reference of the data. Furthermore, extension of the database is attempted so as to perform the simulation by regarding a set of the generated instance objects as the element of a particle simulation framework.


234. Towards a Transcript Centric Annotation System Using in silico Transcript Generation and XML Databases (up)
Anatoly Ulyanov, Michael Heuer, Steven Bushnell, Aventis Pharmaceuticals;
Anatoly.Ulyanov@Aventis.com
Short Abstract:

The presenting system is a resource for annotating genomic sequences and ultimately, for identifying pharmaceutically relevant targets. It allows for the assignment of annotation using homology based methods, improved EST clustering through the use of high quality seed sequences, and through these results reveal potential alternate splicing of genes.

One Page Abstract:

A significant effort in Bioinformatics is made to find potential therapeutic targets for drag candidates. The success of this effort depends mainly on identifying and annotating all transcripts from human and other important mammalian genomes. In the current situation, where roughly one third of human transcripts are present in publicly available finished genome sequences, we took the approach of consolidating sequence information from public and proprietary sources of DNA sequences to advance internal annotation efforts. In this work we developed a system called TransDB, which computed DNA transcripts in silico, using DNA sequences from GenBank. Central to the value of this system is a created "rule base," revealing GenBank entries that have a high probability of being expressed. The rule base automatically rejects ESTs, predicted genes, and artificial or synthetic sequences. The system then returns the subset of GenBank records that have experimental evidence of the transcription. TransDB is compiled in three phases. The first phase includes filtering GenBank entries using information in FEATURE tables, and generating a stack of operations like "region", "join" and "complement" to compute transcripts. The second phase is a compiler, which reads the operation stack and produces TransDB entries. Every entry in TransDB then inherits an accession number and a product description from the source GenBank entry. The final phase includes another filter, which removes redundant entries. The transcripts are computed from descriptions located in feature tables including mRNA, CDS, exon, 5’UTR, and 3’UTR (http://www.aventisandthegenome.com/services.htm.

Assembled ESTs and the in silico transcripts generated here provide an initial set of consensi of DNA sequences. If treated as persistent objects, these consensi can be flexibly annotated using XML databases. Here, we use XML technology to “glue” together heterogeneous sources of data, and use applications to compute and view results. Every source of information including: sequence homology results, classification and ontology assignments, expression profiles, marker homologies, etc are presented in the system as an XML database. Adding a new piece of information about transcripts means adding a new XML database. Integrating this new database into the system requires only adding the name of the new database to the list of registered databases and writing a piece of code which reads and presents this data in a browser. We successfully used this technology to build a system that includes, in addition to mentioned above sources of annotation, such components as personal annotation with history of changes, classifications, cross-species bridges, graphical presentations, and a query system.


235. Trawler: Fishing in the Biomedical Literaturome (up)
Lars Arvestad, Wyeth Wasserman, Center for Genome Research, Karolinska Institutet;
lars.arvestad@cgr.ki.se
Short Abstract:

We present Trawler, a set-based interface between sequences and the biomedical literature. Given a set of sequence or abstract identifiers, Trawler captures the associated articles. Using cycles of feedback, Trawler expands and refines the literature collection. A unique feature is the application of "attractors", which allow categorization of the papers.

One Page Abstract:

The information contained in the ocean of biomedical literature far exceeds that which can be obtained by sequence analysis. In order to facilitate access to this underutilized resource from sequence-based starting positions, we have developed Trawler, a semi-automated text analysis system for PubMed abstracts.

Trawler aims to ease access to the biomedical literature by providing a mode of operation based on a working set of articles. This working set can be seeded by an initial set of sequence identifiers or articles. In addition to basic operations, such as inspection and trimming, Trawler offers automatic classification and set-based extension of the working set.

Automatic Classification

We explore two classification methods: scoring and attractors. The scoring method is based on scores assigned to discriminating words (for instance Marcotte et al, Bioinformatics 2001), while the attractor method is based on PubMed's neighbor definition procedure. Both methods utilize a set of "standard" articles for major fields of research, e.g., gene expression analysis, SNP analysis, or sequencing technology. For the scoring method, discriminative words are extracted for each research field and associated word scores are computed. For an article to be classified to a class C, its associated score S_C must exceed a threshold. The attractor method identifies the neighboring articles for each set of field-typical articles, and identifies the intersection of these attractor sets with the article sets.

Set-based extension

Two methods of extending the working set are considered. The first and simplest utilizes PubMed's related-article feature. However, our approach is based on the entire article set, rather than individual members. As Trawler maintains scores for the relevance of each article, the contribution of each paper to the expand set is weighted, providing an ordered list of potentially associated articles.

The second method is based on common sequences. From an initial article identifier or set, Trawler identifies the associated sequences, and incorporates other articles that address sequences for the same gene in a gene index, for instance the human sequence index for the GeneLynx system (www.genelynx.org).

Trawler can be accessed via a link from our departmental resources page: http://www.cgr.ki.se/cgr/services


236. GeneLynx: A Comprehensive and Extensible Portal to the Human Genome (up)
Boris Lenhard, Wyeth W. Wasserman, Center for Genomics and Bioinformatics, Karolinska Institutet, Stockholm, Sweden;
Boris.Lenhard@cgr.ki.se
Short Abstract:

GeneLynx is a meta-database of information on human genes present in public databases. The GeneLynx project is based on the goal that for every human gene researchers should be able to access a single page containing links to all the information pertinent to that gene. GeneLynx is available at http://www.genelynx.org.

One Page Abstract:

GeneLynx is a meta-database of human genes associated with an extensive collection of hyperlinks to gene-specific information in diverse databases publicly available on the internet. The GeneLynx project is based on the notion that for every human gene, given any gene-specific identifier (accession number, approved gene name, text or sequence), researchers should be able to access a single web page that provides a set of links to all the publicly available information pertinent to that gene.

GeneLynx is implemented as an extensible relational database with an intuitive and user-friendly web interface, containing a number of unique features to increase its efficiency in collating data about genes of interest. The data is automatically extracted from more than thirty external resources, using appropriate approaches to maximize the coverage. The system includes a set of software tools for database building and curation. An indexing service facilitates linking from external sites.

Among the unique features of GeneLynx are a a communal curation system for user-aided improvement of data quality and completeness, and a standardized protocol for adding new resources to the database.

GeneLynx can be accessed freely at http://www.genelynx.org.


237. Representation and integration of metabolic and genomic data: the Panoramix project (up)
Morgat A., Boyer F., INRIA Rhône-Alpes, Helix project;
Rivière-Rolland H., Genome Express, France;
Ziebelin D., Université Joseph Fourier, France;
Rechenmann F., Viari A., INRIA Rhône-Alpes, Helix project;
anne.morgat@inrialpes.fr
Short Abstract:

We present three knowledge bases (Genomix, Proteix and Metabolix) dedicated to tree aspects of bacterial genome analysis (Genes, Enzymatic Assemblies and Metabolism). All of them are based on an object-oriented representation using classes and associations. Their use will be exemplified by the problem of reconstructing metabolic pathways.

One Page Abstract:

We have developed three knowledge bases (Genomix, Proteix and Metabolix) dedicated to bacterial genome analysis. Each of these bases deals with a particular aspect of genomic and post-genomic data :

"Genomix" concerns organisms and their genes with a special emphasis to completely sequenced bacteria. It contains data about the genes, their phylogenetic relationship (paralogy, orthology) and their organisation along the chromosome (bacterial synteny).

"Proteix" deals with the product of the genes (proteins) with a special emphasis to enzymes and molecular assemblies constituting molecular enzymes.

"Metabolix" is dedicated to intermediary metabolism and models biochemical reactions, chemical compounds and catalytic activities.

All these knowledge bases (KB) have been developed by using an object-based data model and have been implemented by using the AROM representation system (http://www.inrialpes.fr/romans/arom) developed in the Romans project at the INRIA Rhône-Alpes. AROM provides a powerful (UML-like) framework based on classes and associations. The explicit representation of n-ary associations (i.e. connecting more than two classes) turned out to be very useful in modelling complex relationships between objects (such as alternative substrates in Metabolix or molecular enzymes in Proteix).

These three bases are, of course, strongly interconnected, for instances the association of correspondance between genes and proteins for Genomix and Proteix KBs, or the relationship between molecular enzymes and catalytic activities for Proteix and Metabolix KBs. This will allow to answer questions like "Given two bacterial species, is there any conservation in the chromosomal arrangement of their genes coding for enzymes acting in a given metabolic pathway ?". At the present time this kind of question can be answered by combining queries (expressed in a builtin query langage (AML)) and the AROM java-API.

Besides querying, another application of these three knowledge bases is ab initio pathways reconstruction. Here we state this problem as finding a minimal cost path in a graph connecting compounds through biochemical reactions. The cost function may depend on the number of involved reactions, the total energetic balance (NADH, ATP, etc) or the chromosomal distance between the genes which are coding for the enzymes catalysing the biochemical reactions.

On the poster we shall present the conceptual models behind these three bases together with some preliminary results on the pathway reconstruction problem.


238. GDB's Draft Sequence Browser (GDSB) - A bridge between the Human Genome Database (GDB) and Human Genome Draft Sequence. (up)
Weimin Zhu, Christopher J. Porter, Bioinformatics Supercomputing Center, Hospital for Sick Children;
C. Conover Talbot, Jonhs Hopkins University, Baltimore;
Kenny Li, Sadanan da Murthy, A. Jamie Cuticchia, Bioinformatics Supercomputing Center, Hospital for Sick Children;
wzhu@sickkids.on.ca
Short Abstract:

GDB's Draft Sequence Browser (GDSB) extends the GDB schema by annotating GDB objects on the Golden Path human draft sequence assembly. GDSB data can be browsed by chromosome, contig, and physical locations, searched by clone or GDB object ID/name, and accessed through links from GDB objects.

One Page Abstract:

The Genome Database (GDB, http://www.gdb.org) is a public repository of data on human genomic data on genes, STSs, clones and variation. Mapping data from large genome centers and smaller mapping efforts, where available, are all represented in GDB. These data are integrated to construct and regularly update a calculated comprehensive map that reflects our most current understanding of the human genome. As human genome research shifts in emphasis from mapping to sequence and function analysis, the scope of the GDB schema has to be extended to meet the needs of the scientific community (Cuticchia, 2000). The GDB-Draft Sequence Browser (GDSB) is one approach we are using to accommodate these needs. The GDSB places objects (Genes, Clones, STSs) from GDB in the context of the Golden Path draft sequence assembly. The draft sequence data can be browsed by chromosome and by contig, and can be searched for contigs containing a particular sequence, clone or other GDB object, or by physical location. Objects are cross-referenced between the draft sequence and GDB, allowing easy transition from a GDB object to its sequence position, or from an object on the sequence to its full GDB record. In the first release of GDSB, results are displayed in tables of textual data. The next release will introduce graphic search, browsing, and display tools. Future releases will also include other assemblies, and additional sequence annotations, including SNPs, gene structure, protein, and functional information. In addition to automated and in-house curation, we will support community annotation of the sequence. GDSB is but one example of our program to extend GDB's traditional schema to maintain GDB as a unique source of human genome information, as evidenced by the integration of GDB, GDSB, and the GDB e-PCR database (Porter, 2001). 1. Cuticchia AJ(2000). Future vision of the GDB human genome database. Human Mutation 15:62-67 2. Porter CJ(2001). Reverse electronic PCR (e-PCR) using the GDB e-PCR database. HGM2001, #307


239. Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT. (up)
Ernst Kretschmann, Wolfgang Fleischmann, Rolf Apweiler, European Bioinformatics Institute;
ketsch@ebi.ac.uk
Short Abstract:

Protein annotation of in SWISS-PROT follows certain patterns. It largely depends on the organism in which the protein was found and on signature matches of its sequence. The C4.5 data mining algorithm was used to detect annotation rules, which were evaluated and applied on yet unannotated proteins in TrEMBL.

One Page Abstract:

The gap between the amount of newly submitted protein data and reliable functional annotation in public databases is growing. Traditional manual annotation by literature curation and sequence analysis tools without the use of automated annotation systems is not able to keep up with the ever increasing quantity of data that is submitted. Automated supplements to manually curated databases such as TrEMBL or GenPept cover raw data but provide only limited annotation. To improve this situation automatic tools are needed that support manual annotation, automatically increase the amount of reliable information and help to detect inconsistencies in manually generated annotations.

The standard data mining algorithm C4.5 was successfully applied to gain knowledge about the Keyword annotation in SWISS-PROT. 11306 rules were generated, which are provided in a database and can be applied to yet unannotated protein sequences and viewed using a web browser. They rely on the taxonomy of the organism, in which the protein was found and on signature matches of its sequence. The statistical evaluation of the generated rules by cross-validation suggests that by applying them on arbitrary proteins 33% of their keyword annotation can be generated with an error rate of 1.5%. The coverage rate of the keyword annotation can be increased to 60% by tolerating a higher error rate of 5%.

The results of the automatic data mining process can be browsed on http://golgi.ebi.ac.uk:8080/Spearmint

The source code is available upon request.


240. A tissue classification system (up)
Shiri Freilich, Ruth Barshir, Tali Vishne, Shachar Zehavi, Jeanne Bernstein, Compugen Ltd.;
shirif@compugen.co.il
Short Abstract:

Genbank libraries annotations were used to classify ESTs into a hierarchical tissue system, sensitive to different pathological conditions. The system utilizes Compugen's clustering algorithm to provide a gene expression profile and it is used for the identification of tissue-specific genes, potentially useful as diagnostics for different pathological conditions.

One Page Abstract:

The accumulating data about expressed sequences provide a first insight into the molecular mechanisms of development and differentiation. We present here a tissue classification system that provides a gene-tissue expression profile using that data.

We first analyzed Genbank relevant data for the construction of a tree like classification system. For each EST the following fields were scanned while searching for defined keywords: tissue, tissue library, library, organ. The classification system enables recursive summation of each node in the tree. For example, if the ileum is classified under bowel, the bowel's classified ESTs will automatically include ileum's ESTs. The keywords also solve the Genbank synonym problem (e.g. heart and cardiac have the same id entry). The system provides different entries for different pathological conditions. For example, normal pancreas and tumoric pancreas would have different accessions beneath pancreas and these accessions can be united or distinguished in a flexible manner, according to user demands.

Our next step was to move forward from the EST level to the gene level. For this purpose we used Compugen's clustering algorithms in order to obtain a tissue expression profile for each gene. Tissue specific genes are defined as genes composed of at least 5 ESTs, in which at least 80% of the ESTs are classified to the desired tissue (according to user definitions). Using Compugen's database we have identified about 4000 potential tissue-specific genes.

Tissue specific genes differentially expressed in normal versus pathological conditions have the potential to serve as diagnostic and/or prognostic markers. A group of 400 genes that are both tumor specific and tissue specific was identified. Twenty-two colorectal-tumor specific genes are currently being validated in our lab.

Our future goals include the application of the gene-tissue classification system into a splice variant-tissue classification system. Splice variants are considered to be one of the major elements in vertebrata differentiation. Compugen's expertise in the identification of splice variants will be used to build a splice variant-tissue expression profile that may be interesting to explore.


241. Proteome's Protein Databases: Integration of the Gene Ontology Vocabulary (up)
Elizabeth Evans, Laura Selfors, Burk Braun, Matt Crawford, Kevin Roberg-Perez, Proteome, a division of Incyte Genomics;
ee@proteome.com
Short Abstract:

Proteome's protein databases integrate knowledge from the research literature with sequence and software tools to produce a unique resource for biologists. We have now integrated the Gene Ontology (http://www.geneontology.org/) vocabulary into our hand-curated databases, providing proteins with controlled vocabulary designations that are highly detailed and accurate.

One Page Abstract:

Proteome's mammalian databases integrate the accumulated knowledge from the research literature with genomic information and software tools to produce a powerful resource for bioinformatic scientists and biologists of all disciplines. Our protein report pages are interconnected to allow searching across multiple proteins and species for common protein and gene characteristics.

Recently, we have integrated the powerful descriptive terminology of the Gene Ontology consortium (http://www.geneontology.org/) into the curation techniques and tools used to create our databases. We will describe various aspects of this project, including an outline of how the integration was accomplished and how it is kept current, a statistical portrait of term usage, and what the impact has been on our 'property-based' descriptions of mammalian proteins. We will give examples of ways that GO terms curated by our staff from the primary research literature can be used to create a detailed picture of protein function, based solely on the GO molecular function, biological process and cellular component properties. In addition, the information in our databases enables us to transfer knowledge about known proteins to unknown proteins, based on protein sequence similarity. The GO ontologies can also be used to describe these predicted properties of uncharacterized proteins.


242. Pattern Recognition for Micro-array Data using Context based Divergence and its Validation (up)
W. D. Tembe, Anca L. Ralescu, Future Intelligence Technology Laboratory, University of Cincinnati;
wtembe@ececs.uc.edu
Short Abstract:

A deterministic unsupervised pattern recognition algorithm driven by user adjustable parameters to identify genes with similar expression levels is described. A context dependent similarity metric and a general purpose validation methodology is proposed. Clusters for Saccharomyces Cervisiae and validation results for protein localization in yeast data are presented.

One Page Abstract:

We present a new approach to pattern recognition, characterization and validation for the micro-array data that is effective, transparent and flexible. It allows for parameterized experimentation, bridging the gap between the computational aspects and the domain dependent issues of the problem.

In many respects this approach departs from the bulk of those adopted so far. More precisely, like other previous approaches it presents an algorithm to gather/group data points (high dimensional vectors); however, unlike general clustering algorithms where data points are gathered by optimizing a performance measure(which maximizes discrepancy between clusters and at the same time minimizes discrepancy within a cluster), in this approach explicit criteria are used to achieve and control such grouping. Groups are formed iteratively based on criteria on various parameters. In addition to algorithmic issues we consider the added requirement that the results be amenable to an interpretation which is meaningful to the application domain from which the data come. The following are the main points of the approach:

* Data points are gathered in "structured clusters" in which members are differentiated among themselves by associating with each an individual "weight" which conveys its importance/contribution within the cluster.

* A global cluster measure called "cluster integrity" captures cluster tightness and is used to constrain the for mation and expansion of clusters.

* Data points are grouped into clusters based on their "similarity". Underlying the similarity is the new concept of "context-dependent divergence", a non-symmetric, distance-like measure of contrast. Unlike other well-known definitions for divergence(e.g.for probability distributions) such as the Kullback-Leibler divergence, ours includes, in addition to the local discrepancy between two data points, a "context component". The similarity measure is defined based on the (normalized) divergence measure. This asymmetric measure is meant to capture some of the similarity features discussed in the literature. It is first defined for individual data points and then, by aggregation (e.g. weighted linear combination), for clusters.

* A "fuzzy reasoning" algorithm is used to implement a selected merging strategy which is defined in terms of conditions on the mutual similarities with respect to a threshold on the similarity, and conditions on cluster integrity, through a threshold on this measure. The fuzzy reasoning module is user-driven, in the sense that the user can input the value of the threshold controlling the grouping but the system is also able to suggest (based on a nalysis of the data) suitable values for this threshold. Varying the parameters of the fuzzy sets used in this system can further affect the merging strategy.

* Three complementary ways of cluster visualization are described, therefore providing the user with the opportunity to explore hypothetical cluster formation strategies.

* There is no assumption on the data set beyond the given data points along with a normalization, such that for a generic data point x, |x| <= M for some M > 0.

* The approach is entirely un-supervised and hence the question of validating results arises. We propose a general validation strategy which can be applied to any un-supervised algorithm. This calls for computing two measures of overlap between two groups/clusters, in terms of a "necessary support" based on the actual relative overlap between the two groups, and a "possible support" based on expected relative overlap under all possible cluster formation.

Results of the approach to pattern extraction for the saccharomyces cervisiae (yeast) micro-array data and of the validation for the protein localization data set are presented, graphically illustrated and commented. They strongly support the claims made about the merits of the adopted approach.


243. GIMS - A Data Warehouse for Storage and Analysis of Genomic and Functional Data. (up)
Mike Cornell, Norman W. Paton, Shengli Wu, Paul Kirby, University Of Manchester;
Karen Eilbeck, Celera;
Andy Brass, Crispin J. Miller, Carole A. Goble, Stephen G. Oliver, University Of Manchester;
mcornell@cs.man.ac.uk
Short Abstract:

GIMS is an object database that models the eukaryote genome and integrates it with functional data on the transcriptome and on protein-protein interactions. We used GIMS to store the yeast genome and demonstrate how storage of diverse genomic data can be beneficial for analysing transcriptome data using context-rich queries.

One Page Abstract:

Effective analysis of genome sequences and associated functional data requires access to many different kinds of biological information. For example, when analysing transcriptome data, it may be useful to have access to the sequences upstream of the genes, or to the cellular location of their protein products. The information may currently be stored in different formats at different sites that do not use techniques that readily allow analysis in conjunction with other information. The Genome Information Management System (GIMS) is an object database that integrates genome sequence data with functional data on the transcriptome and on protein-protein interactions in a single data warehouse. We have used GIMS to store the yeast genome and to demonstrate how the integrated storage of diverse kinds of genomic data can be beneficial for analysing data using context-rich queries. GIMS demonstrates the benefits of an object based approach to data storage and analysis for genomic data. It allows data to be stored in a way that reflects the underlying mechanisms in the organism, and permits complex questions to be asked of the data. This poster provides an overview of the GIMS system and describes some analyses that illustrate its use for mining transcriptome data. We show how data can be analysed in terms of gene location, attributes of the genes protein products (such as cellular location, function or protein:protein interactions), and regulatory regions present in upstream sequences.


244. Mutable gene model (up)
Giuseppe Insana, Heikki Lehväslaiho, EMBL-EBI;
insana@ebi.ac.uk
Short Abstract:

We have created Perl classes and applications to analyze and validate sequence variations in expressed genes. A gene is represented as a double linked list with unique labels and DNA nucleotides as values. Exons, transcripts and translation products are virtual objects pointing to the DNA structure. See http://www.ebi.ac.uk/mutations/toolkit/.

One Page Abstract:

To effectively manage the complexities of sequence changes, new tools are needed: to analyse the mutations and their propagation from the DNA to transcripts and translation products, to store variation information in exchangable and extensible format and to automatically validate existing mutation databases.

For these purposes, we have implemented a mutable gene model in two sets of Perl modules: Bio::LiveSeq and Bio::Variation (http://www.ebi.ac.uk/mutations/toolkit/).

The Bio::LiveSeq modules read EMBL formatted sequences and create a double-linked list data structure for the DNA sequence. Instead of storing exons, transcripts and translation products as separate strings, they are computed dynamically from DNA. The use of pointers to individual nucleotides on the DNA sequence makes the structure completely independent of any positional information and robust to changes such as insertions or deletions. Multiple mutations with interdependent effects, e.g. frame shift mutation followed by another frame restoring mutation, are easily handled.

The use of pointers also facilitates easy conversion between different coordinate systems (based on entry, coding sequence or whole gene). Additional protein level infromation can be read from SWISS-PROT into the system allowing comparative analysis of nucleotide and polypeptide features.

Bio::Variation modules collect variation information as differences between the reference and the variant sequences and calculate descriptive attibutes (labels like "missense" and restriction site changes). Permanent storage is possible in EMBL-like flatfile and XML formats.

An online application, "Mutation Checker" is available to researchers wishing to see the effect of a mutation on any chosen EMBL entry. (http://www.ebi.ac.uk/cgi-bin/mutations/check.cgi). Another use of the Mutation Toolkit has been the validation of the OMIM (Online Mendelian Inheritance in Man) database entries.

The modules are available as part of the open source BioPerl (http://bioperl.org) project.


245. Discovery Support System for Human Genetics (up)
Dimitar Hristovski, Institute of Biomedical Informatics, Medical Faculty, University of Ljubljana, Slovenia;
Borut Peterlin, Department of Human Genetics, Clinical Center Ljubljana, Slovenia;
Saso Dzeroski, Institute Jozef Stefan, Ljubljana, Slovenia;
dimitar.hristovski@mf.uni-lj.si
Short Abstract:

We describe an interactive discovery support system for human genetics. The goal of the system is to discover new, potentially meaningful relations between a given starting concept of interest and other concepts that have not been published in the medical literature yet (e.g. a gene candidate for a disease).

One Page Abstract:

Positional cloning approach has proved very successful in cloning genes for Mendelian genetic human diseases. However, gene identification rarely implies understanding of pathophysiology of a genetic disease and consequently the rationale for therapeutic strategies. Moreover, knowing the entire human genome sequence requires novel methodological approaches in the analysis of genetic disease. In this paper we describe an interactive discovery support system (DSS) for the field of medicine in general and human genetics in particular. The intended users of the system are researchers in biomedicine. The goal of the system is to discover new, potentially meaningful relations between a given starting concept of interest and other concepts that have not been published in the medical literature yet. The main idea is to first find all the concepts Y related to the starting concept X (e.g. if X is a disease then Y can be pathological functions, symptoms, etc.). Then all the concepts Z related to Y are found (e.g. if Y is a pathological function, Z can be a molecule, structurally or functionally, related to the pathophisiology). As the last step we check whether X and Z appear together in the medical literature. If they do not appear together, we have discovered a potentially new relation between X and Z. This relation should be confirmed or rejected using human judgment, laboratory methods or clinical investigations, depending on the nature of X and Z. The known relations between the medical concepts come from the Medline bibliographic database. The concepts are drawn from the MeSH (Medical Subject Headings) controlled dictionary and thesaurus, which is used for indexing in the Medline database. We use a data mining technique called ‘association rules’ for discovering relationships between medical concepts. Our discovery support system is interactive, i.e. that the user of the system can interactively guide the discovery process by selecting concepts and relations of interest. The system provides the possibility to show the Medline documents relevant to the concepts of interest and also to show the related proteins and nucleotides. We used the DSS to analyze Incontinentia pigmenti (IP), a monogenic genodermatosis, the gene of which has recently been identified via the positional cloning approach (Nature 2000;405:466-71). We were interested whether the gene could have been predicted as a gene candidate by the DSS. We succeeded in identifying the NEMO gene as the gene candidate and in retrieving its cDNA sequence (available since 1998). Moreover, the DSS provided some potentially useful data for understanding the pathogenesis of disease. It has to be stressed that efficient use of DSS is largely driven by the scientist. We conclude that the DSS is a useful tool complementary to the already existing bioinformatic tools in the field of human genetics.


246. Search, a sensitive and exhaustive homology search programme (up)
Mitsuo Murata, Tohoku Bunka Gakuen College;
Norio Murata, National Institute for Basic Biology;
mmurata@tech.tbgu.ac.jp
Short Abstract:

A sensitive and exhaustive homology search programme based on the Needleman-Bunsch algorithm has been created. The program, Search, is written in C and assembly language and with a query sequence of 200 amino acids it takes about 3 hours to search the entire SWISS-PROT and TrEMBL databases.

One Page Abstract:

A sensitive and exhaustive homology search programme based on the Needleman-Wunsch algorithm has been created. In the original Needleman-Wunsch algorithm, the maximum match score is calculated on a two-dimensional array, MAT(m,n), where m and n are the lengths of the two sequences, and the similarity or homology between the two sequences is statistically assessed by the maximum scores from a large number of scrambled sequences of the original sequences. Depending on the size of the proteins this algorithm demands a large amount of computer memory and CPU time and to implement this algorithm in a homology search program has been considered impractical. However, calculation of the maximum score alone does not require the use of two-dimensional array as it is not necessary to trace back the maximum match pathway; in a modified algorithm only a single one-dimensional array is used for MAT. The CPU time can be shortened using a faster programming language. Thus, with many powerful personal computers presently available, it is possible to write a programme based on the Needleman-Wunsch algorithm which can be used for searching homologous sequences in large protein sequence databases. The programme so created was named Search and its most CPU-intensive parts were written in assembly language. Search was used for searching homologous sequences to the slr1854 gene product of Synechocystis, a cyanobacterium, which contains 198 amino acids, but whose function has not been identified. The databases screened were SWISS-PROT and TrEMBL which together contain 531,048 protein sequences currently. With the sample size of 200 for statistical analysis, it took 3 hours 9 min on a Celeron 556 MHz computer to search the two databases. One of the proteins which was found homologous to the query sequence was the general stress protein 18 (GSP18) of Bacillus subtilis.


247. Gene Ontology Annotation Campaign and Spotfire Gene Ontology Plug-In (up)
Bo Servenius, Stefan Pierrou, Robert Virtala, Jacob Sjöberg, Dan Gustafsson, AstraZeneca R&D Lund, Sweden;
Caroline Oakley, AstraZeneca R&D Charnwood, UK;
Tobias Fändriks, Spotfire Inc, Göteborg, Sweden;
bo.servenius@astrazeneca.com
Short Abstract:

Gene Ontology (GO) is increasingly used for annotations of the gene product aspects: molecular function, biological process and cellular component. We have developed an annotation system, including a GO plug-in for the Spotfire program, and initiated a large scale annotation project aiming to annotate human genes obtained from Affymetrix experiments.

One Page Abstract:

Expression analysis experiments is becoming more and more common in both academic and industrial molecular biology research. Results of such experiments - e.g.. generated with Affymetrix technology - do often consists of lists of more or less incomprehensible gene names. To enrich the listings good annotations describing different aspects of the genes are needed. Annotations should be made with controlled vocabularies that preferentially have a hierarchical or otherwise structured arrangement.

Gene Ontology (GO) is such an vocabulary (http://www.geneontology.org) and has during the last couple of years become more and more used for annotations of the gene aspects: molecular function, biological process and cellular component. GO was originally compiled by the model organism projects Flybase (Drosophila), Saccharomyces Genome Database and the Mouse Genome Database. Since the start several more model organism databases have joined in. However, almost no annotations have been made for the human genes and thus there are very few annotations with GO that can be used directly for annotations of the genes that are used in the expression analysis experiments we are generating with the Affymetrix system.

At AstraZeneca R&D Lund we have initiated a project - Gene Ontology Annotation Campaign (GOAC) aiming to annotate a large set of genes obtained in a specific set of Affymetrix experiments. To facilitate these annotations we have built an annotation system encompassing a database storing the GO annotations with references and comments, and data aggregations supplying the annotators with supporting information. We are using the GO Browser (John Richter) for looking up the GO-terms.

In collaboration with Spotfire Inc (http://www.spotfire.com) we have developed a plug-in for their data visualization program Spotfire. This plug-in - Spotfire Ontology Explorer (SOE) - makes it possible to visualize how a set of genes obtained in a data visualization with Spotfire are positioned in the GO structure. It is also possible to out from a specific position in the GO structure see how those genes are represented in the visualization. In this way we can get a very much better overview of the output of an expression analysis results listing.

Further on we are exploring the possibilities to exploit the GO annotated genes within our AstraZeneca wide SRS based bioinformatics system e-lab. The implementation of GO annotations in more and more public domain and proprietary bioinformatics databases will make it possible to use the information integration capabilities of SRS utilizing the GO annotations.


248. Predicting the transcription factor binding site-candidate, which is responsible for alteration in DNA/protein binding pattern associated with disease susceptibility/resistance (up)
Julia Ponomarenko, Galina Orlova, Tatyana Merkulova, Elena Gorshkova, Oleg Fokin, Sergey Lavryushev, Mikhail Ponomarenko, Institute of Cytology and Genetics, Novosibisk, Russia;
jpon@bionet.nsc.ru
Short Abstract:

A databases/tools system rSNP_Guide predicting TF-site presence/absence and explaining correlation between SNP-allele and disease was developed and applied for analysis of several disease-related alleles: NTFa (malaria), GpIbb (Bernard-Soulier syndrome), K-ras (lung tumor), TDO2 (mental disorders). rSNP_Guide is supplied by Help-options recommending how to use the system for SNP-analysis, \url{http://wwwmgs.bionet.nsc.ru/mgs/systems/rsnp/}.

One Page Abstract:

Single Nucleotide Polymorphism's (SNP's) analysis bridges the gap between genome sequence and human health-care. We have developed a system of databases and tools, rSNP_Guide, addressed to recognition of transcription factor (TF) binding site, the presence/absence of which could explain the SNP-related disease susceptibility/resistance. For the system's design, two heuristic observations were considered. First, we anticipate that the databases on site directed mutagenesis could be used for disease study too. Second, we have unexpectedly shown that accounting of SNP-caused alterations in gel mobility shift assay in addition to common DNA analysis may increase TF-site recognition accuracy. So, we have integrated natural and artificial mutation data and developed a Java-script applet, which predicts the TF-site candidate responsible for disease susceptibility/resistance in regulatory DNA regions. Initial data on naturally occurring mutations were taken from HGMD, dbSNP, HGBASE ALFRED, OMIM databases, whereas the site-directed mutagenesis data are from TRANSFAC, TRRD, COMPEL, ACTIVITY databases. From original papers (rSNP_BIB), we document the experimental design (SYSTEM) and the additional data on DNA/protein-complex alterations (rSNP_DB). These SNP/mutagenesis-related databases were added by three natural TF-site related resources: (i) a databases SAMPLES on TF-sites experimentally detected; (ii) a database MATRIX on TF site weight matrices; (iii) a Java-script applet rSNP_Tools applying the weight matrices for site recognition within regulatory DNA regions. By systemizing the rSNP_DB entries, their characteristic examples were selected for an illustrative treatment by the applet rSNP_Tools. The tools use the resources SAMPLES and MATRIX to recognize the TF-site candidate responsible for disease susceptibility or drug resistance/sensitiveness. Finally, the results are stored (rSNP_Reports) for illustrating how to use rSNP_Tools in pactice. To use rSNP_Guide in practice, a user first must have a target sequence and mutated variants for (+) and (-) DNA strands. For example, for analysis of the site-directed inhibitory mutations within the promoter of rat angiotensin II type 1A receptor gene, rAT(1A)R-C, four DNA sequences should be prepared, including (i) the allele "WT" (+)-chain, 5'-tttttatTtttaAataaat-3'; (ii) (-)-chain, 5'-atttatTtaaaAataaaaa-3'; (iii) the mutant "MT", 5'-tttttatGtttaCataaat-3'; and (iv) 5'- atttatGtaaaCataaaaa-3' (altered nucleotides are capitalized). A TF is selected in the upper section by uploading the rSNP_Tools' interface, URL=http://wwwmgs.bionet.nsc.ru/mgs/programs/rsnp/. In TF site recognition window, the user inputs each sequence variant and observes the TF site recognition score profile. With a single dominant peak, the corresponding score should be fixed. When all TFs of interest have been examined and their peaks fixed, the additional evidence is input, i.e., the relative degree of DNA/protein binding efficiency scaled in-between +1 and -1. In our example, the transcription activity of the rAT(1A)R-C gene is "normal" in "WT" allele ("+1") and "inhibited" in "MT" ("0"). Then a single TF may be predicted to bind (in our example, the site MEF-2 is a TF-site candidate responsible for regulating the rAT(1A)R-C gene transcription activity, as it was confirmed experimentally). In the case that either several or no TFs are predicted, the significance threshold (p=0,025, default) could be varied from 0.05 to 0.0001. The system rSNP_Guide was tested for several human genes with SNP-alleles and disorders: NTFa (severe malaria), pC (type I protein C deficiency), GpIbb (Bernard-Soulier syndrome), factor VII (severe bleeding disorder), and Gg-globin (hereditary persistence of fetal hemoglobin). Since these tests were successful, we have applied rSNP_Guide to the SNP-related alleles of #2 intron K-ras gene (lung tumor) and #6 intron TDO2 gene (mental disorders). For K-ras gene, rSNP_Guide has predicted the GATA-like site-candidate, present in "CA"- and absent in "CC"- and "GC" alleles. This explains different tumor susceptibility/resistance of alleles in lung, which was confirmed by antibody test. For the TDO2 gene, the YY1 site-candidate was first predicted by the rSNP_Guide and, then, confirmed experimentally. Thus, we hope that rSNP_Guide is useful for SNP-related studies.


249. Open Exchange of Source Code and Data from the Stanford Microarray Database (up)
Gail Binkley, Catherine Ball, John Matese, Tina Hernandez-Boussard, Heng Jin, Pat Brown, David Botstein, Gavin Sherlock, Stanford University;
gail@genome.stanford.edu
Short Abstract:

The Stanford Microarray Database (SMD) is committed to providing open source code and database design, and the data associated with published microarray experiments.

One Page Abstract:

The Stanford Microarray Database (SMD) stores raw and processed data from cDNA microarray experiments, and provides web-based tools to retrieve, visualize and analyze these data. The primary function of SMD is to support the ongoing research efforts at Stanford University, but we are also committed to providing open source code and database design, and the data associated with published microarray experiments. Toward this end, the first release of the SMD source code and database schema occurred in March 2001. A second release is scheduled for this summer, that will include further abstraction of data retrieval through the use of Perl Objects, to simplify and streamline the code. There are now over 1,100 published arrays available to the public in SMD. A current goal of SMD is to provide these data for downloading in a MIAME-compliant format. The MIAME (Mimimum Information About a Microarray Experiment) specification is an international effort to define a standard method to describe a microarray experiment. The status of the effort to map SMD to the MIAME specification will be presented. SMD can be accessed at: http://genome-www.stanford.edu/microarray/


250. Portal XML Server: Toward Accessing and Managing Bioinformatics XML Data on the Web (up)
Noboru Matoba, Laboratory for Database and Media Systems, Graduate School of Information Science, Nara Institute of Science and Technology (NAIST);
Masatoshi Yoshikawa, Junko Tanoue, Shunsuke Uemura, Laboratory for Database and Media Systems, Graduate School of Information Science, Nara Institute of Science and Tech;
noboru-m@is.aist-nara.ac.jp
Short Abstract:

We propose a Portal XML Server system that provides functionalities for accessing bioinformatics XML data. Users can access necessary data only by visiting the Web site without the knowledge of XML, thus it is expected to become a useful site for molecular-biologists.

One Page Abstract:

XML is emerging as a standard format on the Web. In the field of bioinformatics, data is beginning to be distributed in XML. Examples of such data formatted in XML include GAME and MaXML. We foresee, in near future, large amount of XML data produced from exisiting databases will be interchanged in XML on the Web. However, almost all of the users of biological databases are not an expert of XML. Hence,

We propose a system which searches more effectively and utilizes bioinformatics data, without being conscious of XML. The goal of the system is providing a portal site for accessing bioinformatics XML data. Users can access to necessary data only by visiting the Web site without the knowledge of XML. The system stores users' personal information and utilizes the history of operation and liking, thus ahcieves individualization and optimization for every user. Moreover, taking advantage of the functions of XML (XLink etc.), an intuitive and effective user interface is offered to XML data which have a lot of link information. Thereby, the system aims at taking more effective approach to many complicated bioinformatics data.

Currently, XML Viewer is under development as a part of above-mentioned function. This is a system for coming out and showing a user XML file which is variously and complicated structure (various annotations, ID, alignment etc. being included) intelligibly and effectively. Therefore, it considers giving the zoom function; extraction and display function of applicable data; and export function to another file to a system. By attaching other functionalities (link viewer etc.) to the system, we expect the system helps users to discover a new fact.

As a future object, we will continue to implement other functions to one, and want to perform analysis and the proposal of XML schema (DTD) desired from now on.

[References]

XEWA Workshop http://www-casc.llnl.gov/xewa/

Genome Annotation Markup Elements (GAME) http://www.bioxml.org/Projects/game/

Mouse annotation XML (MaXML) http://genome.gsc.riken.go.jp/homology/ftp.html

XML Linking Language (XLink) Version 1.0 http://www.w3.org/TR/xlink/


251. PFAM-Pro: A New Prokaryotic Protein Family Database (up)
Martin Gollery, Chip Erickson, David Rector, Jim Lindelien, TimeLogic;
martyg@timelogic.com
Short Abstract:

PFAM-Pro is a new database of HMM's that have been trained exclusively on Prokaryotes. This gives PFAM-Pro some advantages for the annotation of new microbial genomes. Experiments show that using PFAM-Pro in conjunction with PFAM will yield better results than either database alone.

One Page Abstract:

Hidden Markov Models have become popular for similarity searches. This is due to the manner in which they represent the match, insertion, deletion or transition states for each position in the model. The scores in the model represent the probabilities for each of these states in the data used to build the model. These probabilities are more specific to the sequence family in question than a standard scoring matrix (such as PAM or BLOSUM). These scoring matrices are based on substitution probabilities of amino acids in an entire database, and may be less representative for a smaller subset of sequences. As a result, HMMs can recognize that a new protein belongs to an existing protein family, even if the similarity from BLAST or Smith-Waterman alignments seems weak. While analysis of large data sets using conventional CPU’s can be slow, hardware implementations such as that provided by DeCypher is extremely fast. The PFAM database is curated by Washington University and the Sanger Center. PFAM consists of a database of HMMs, covering many common protein domains. Version 6.2 of PFAM contains alignments and models for over 2700 protein families, based on the Swissprot and SP-TrEMBL protein sequence databases.

Even as a scoring matrix reflects the data from which it was derived, so does a Hidden Markov Model reflect its' training data. If all of the data is derived from a single organism or type of organism, then the model will be less effective at finding matches with a dissimilar organism. Therefore, the models in PFAM are based on a wide range of organisms for the broadest possible application.

The concept behind PFAM-Pro, is to turn this idea around completely. While PFAM is widely useful on all organisms, PFAM-Pro is designed for use with prokaryotes only. While PFAM is generic, PFAM-Pro is specific, and should therefore yield improved results for microorganisms.

While PFAM-Pro yields some dramatic advantages, we do not advise using it exclusively. Experiments show that using PFAM in conjunction with PFAM-Pro will yield better results than either database alone.

PFAM-Pro is copywrited, and is freely available for use.


252. Production System for Mining in Large Biological Dataset (up)
Yoshihiro Ohta, Tetsuo Nishikawa, HITACHI Central Research Laboratory;
Shigeo Ihara, HITACHI Ltd.;
yoh@crl.hitachi.co.jp
Short Abstract:

We constructed a production system which is a middle ware for integrated analysis including data mining and information retrieval from large biological database. While it keeps the libraries of biological images, it also generates dynamic animations automatically and can shows the mining and retrieval results visually and clearly.

One Page Abstract:

We designed production system which can keep compatibility of biological data set, change their abstract level and moreover be recycled, expanded easily. The production system defined here is a middle ware for integrated analysis including data mining and information retrieval. While it keeps the libraries of biological images, it also generates dynamic animations suiting each circumstance automatically, and shows the results visually and clearly by receiving retrieval and mining requests from biological researchers.

For that we have designed and developed engines for mining and retrieval with dynamic animation interface. This middle ware is not only a constructor for new databases, but also an important tool playing a role of high performance function analysis and automatic dynamic animation generation. By using the production system constructed as above, we can do unitary control for information about genes, protein, pathway and relations of biological objects with graphical user interface. Each component of this system is as follows.

(1)"Definition of annotation and data structure of the production system" In order to give annotations to biological data and increase efficiency of analysis, we developed techniques to give various annotation sets to the biological data sets, for example, genome sequences and Medline abstracts. (2)"Construction of retrieval and mining engines for production system" For the purpose of retrieving and mining useful and unknown information from huge biological data set, we devised an algorithm of engines for information retrieval and data mining, including SVM, Decision Tree, and so on. (3)"Visual interface for representing the results of analysis" Intending to show mining and retrieval results visually and clearly, we developed visualized libraries of production system on the Web browser.

We evaluated usability of the production system, using Medline abstracts, expression profiles, clinical data sets, and so on, then we certified that we are able to interpret and reconsider the analyzed results visually and clearly with good performance.


253. Querying Multiple Biological Databanks (up)
Patrick Lambrix, Linköpings universitet;
patla@ida.liu.se
Short Abstract:

Users of multiple biological databanks face several problems including lack of transparency, the need to know implementation details of the databanks and lack of common representation. To alleviate these problems, we present a query language that contains basic operators for query languages for biological databanks and a supporting architecture.

One Page Abstract:

Nowadays, biologists use a number of large biological databanks to find relevant information for their research. Users of these databanks face a number of problems. This is partly due to the inherent complexity of building such databanks, but also due to the fact that some of the databanks are built without much consideration for the current practice and lessons learned from building complex databases in related areas such as distributed databases and information retrieval.

A first problem that users face is that they are required to have good knowledge about which databanks contain which information as well as about the query languages and the internal structure of the databanks. Often there is a form-based user interface that helps the user partly, although different systems may use different ways of filling in the data fields in the forms. With respect to querying most databanks allow for full-text search and for search with predefined keywords. The internal structure of the databanks may be different (e.g. flat files, relational databases, object-oriented databases). Biologists, however, should not need to have good knowledge about the specific implementations of the databanks but instead a transparent solution should be provided.

Another problem is that representations of the same data by different biologists will likely be different. This requires methods for representation of and reasoning about biological data for the understanding and integration of the different conceptual models of different researchers and databanks.

When users have a complex query where information from different sources needs to be combined, they often have to divide the queries into a number of simple queries, submit the simple queries to one databank at a time and combine the results themselves. This is a non-trivial task containing steps such as deciding which databanks to use and in which order, how terms in one databank map to terms in other databanks and how to combine the results. A mistake in any of these steps may lead to inefficient processing or even in not obtaining a result. A requirement for biological databanks is therefore that they allow for complex queries in a highly distributed and heterogeneous environment.

To alleviate these problems we propose a base query language that contains basic operators that should be present in any query language for biological databanks. The choice is based on a study of current practice as well as on a (currently small) number of interviews. In this work we restrict the scope of the query language to text and observe that the proposed query language is not necessarily used as an end-user query language. For the common end user the language is likely to be hidden behind a user interface. The main features of the language include an object model, queries about types and values, paths and path variables as well as the use of specialized functions that allow for hooks to, for instance, alignment and test search programs. Further, we propose an architecture for a system that supports this query language and deals with the problems concerning representational issues and the lack of transparency. The fact that many users use daily a large number of legacy systems is also taken into account. The proposed architecture contains a central system consisting of a user interface, a query interpreter and expander, a retrieval engine and an answer filter and assembler. Further, the architecture assumes the existence of an ontology base, a databank knowledge base with information about the contents and capabilities of the source databanks as well as the use of wrappers that encapsulate the source databanks.


254. Automated Analysis of Biomedical Literature to Annotate Genes with Gene Ontology Codes : Application of A Maximum Entropy Method to Text Classification Application of A Maximum Entropy Method to Text Classification (up)
Soumya Raychaudhuri, Jeffrey T. Chang, Patrick D. Sutphin, Russ B. Altman, Stanford University;
tumpa@stanford.edu
Short Abstract:

Intensive efforts to functionally annotate genes with controlled vocabularies, such as Gene Ontology (GO), are underway to summarize the function of characterized genes. We propose using computational approaches to interpret literature to provide annotation. These methods are applied to annotating yeast genes with GO biological process codes.

One Page Abstract:

Genomic experimental approaches to biology are forcing practitioners to consider biological systems from a more global perspective. Investigators examining a broad biological phenomena may be unfamiliar with many of the individual genetic components that are directly or tangentially involved. To facilitate such interpretation, intensive efforts to annotate genes with controlled vocabularies, such as Gene Ontology (GO), are underway to summarize the function of characterized genes. When codes from a controlled vocabulary are assigned to genes, experts must carefully examine the literature to obtain the relevant codes. Expansion of biological understanding and specialized needs of sub-disciplines will force re-annotation over time. We propose using computational approaches to automatically interpret literature to provide annotation. The methods we describe here are applied to annotating all of the genes within the yeast genome with GO biological process codes. We use a maximum entropy text classification algorithm to identify the subjects discussed in the texts associated with the gene in question and thereby annotate the gene. Here we annotate genes based on two sets of abstracts: a curated collection of articles maintained by Saccharomyces Genome Database and articles obtained by rapid sequence database queries.


255. Rat Liver Aging Proteome Database: A web-based workbench for proteome analysis (up)
Jee-Hyub Kim, Yong-Wook Kim, Hyung-Yong Kim, Jae-Eun Chung, Jin-Hee Kim, Sujin Chae, Eun-Mi Jung, Tea-Jin Eom, Yong-Ho In, R & D Institute, Bioinfomatix Inc.;
Hong Gil Nam, Department of Life Science, Pohang University of Science and Technology, Pohang, Korea (South);
kjh726@bioinfomatix.com
Short Abstract:

We developed a rat liver aging-related proteome database. The database has annotated protein function information based on GO (Gene Ontology) and provides an easy GUI (Graphic User Interface) for inputting data, searching, visualization, and protein expression profile analysis. This allows it to serve as a workbench for proteome analysis.

One Page Abstract:

Since the announcement of the first human genome sequence draft, researchers have been moving to analyze the human proteome expressed by the genome. These days, proteome analysis is most commonly accomplished by the combination of two-dimensional gel electrophoresis (2DE) and mass spectrometry (MS) from which a large mount of 2DE images and MS data are generated. In order to store, retrieve and analyze those kinds of data in biological research, it is necessary to develop a system that is easy to use, and to provide good visualization.

In this poster, we present a database example showing two groups of aging proteomes which were extracted from rat liver cells over a period of two years. One group is from rats on a controlled diet, and the other from diet-free rats. It is shown that cells from diet-restricted rats live longer than other cells, and thus we can find major factors influencing the aging of cells. In this database, we stored 8 gel images and 78 spots information. Among the 78 spots, 26 have annotated protein function information, including peptide mass information. For standardization, we used a controlled vocabulary based on GO (Gene Ontology) in parts of the database.

We made the database for researchers not only to store those data, but also to provide a workbench for proteome analysis. To do this we developed a proteome expression profile analysis program for predicting unknown protein function. We also added many modules to search other databases with their own experiment data and provided an easy GUI (Graphic User Interface) for processing image data and analyzing proteome data.

In the field of proteomics, new technologies are constantly emerging, so we designed the system for extensibility. In the future, we will add modules for image comparison and tandem mass spectrometry. We will also link information in this database to the rat genome sequence.


256. UCspots Before Your Eyes (up)
Ellen Graves, UCSF Ernest Gallo Clinic & Research Center;
ellen@egcrc.net
Short Abstract:

UCspots is a complete microarray LIMS for managing data from plate preparation to image analysis results, including agarose gel images, arrayer settings, experimental parameters, single channel and composite microarray images, and image analysis software results. UCspots is MIAME compliant and available in Oracle 8 and IBM DB2.

One Page Abstract:

UCspots is a complete microarray LIMS for managing data from plate preparation to image analysis results, including agarose gel images, arrayer settings, experimental parameters, single channel and composite microarray images, and image analysis software results.

The plate preparation process is completely captured, including transfers to and from different sized plates, PCR protocol data, and individual quality control values for each element as it is processed. Array fabrication data includes robot configuration, slide usage, and grid locations for individual elements from the arrayed plates.

Full experiment data collection is guaranteed through a flexible, directed user interface that guides an investigator to add all data to the LIMS. Data collected includes hybridization, wash and probe parameters, as well as images and analysis software results. The database holds all results from currently available packages GenePix, ScanAlyze, and Spot, and can be expanded to collect data from newly developed software. Experiment results, either single, multiple, or partial experiments, can be exported to data analysis packages such as GeneSpring, Cluster, and TreeView.

We have taken an object-oriented approach to the schema design for UCspots in the areas of element types and image analysis software results. Instead of trying to collect all element or analysis results in a generic table, we provide a flexible schema where tables for additional element types or analysis packages can be added easily.

Sites using UCspots can configure the LIMS to collect elements and analysis software unique to their microarray implementation. Through this flexible schema, it is possible to query across element types or analysis results, allowing improved quality control and/or evaluation of different technologies.

UCspots’ schema is MIAME compliant, and data collected can be exported to public microarray databases as they become available. The database is available in Oracle 8 and IBM DB2. The UCspots data entry application, written in java, insures data integrity beyond that provided by the relational DBMS. The web query pages allow secure access from the desktop, allowing individual and collaborative data analysis.

The advantages of UCspots over other microarray LIMS include full coverage of design, fabrication, and experiments with microarrays. UCspots can be easily integrated into an analysis workflow that includes any analysis package that allows data import. We have emphasized quality control, especially with elements, since arrays are only as good as the elements placed on them. We have also designed the schema to have an association between each element on a plate and a spot on each microarray, insuring that experimental results can be correlated across experiments as well as within an experiment.


257. PASS2: Semi-Automated database of Protein Alignments organised as Structural Superfamilies (up)
V.Mallika, Dr.R.Sowdhamini, National Centre for Biological Sciences - Tata Institute for Fundamental Research;
mallika@ncbs.res.in
Short Abstract:

PASS2 is nearly-automated version of CAMPASS with interesting features and links. Superfamily members extracted from SCOP by using 25% sequence-identity/good-resolution cutt-off. Alignment is based on the conservation of structural features like solvent-accessibility, hydrogen-bonding and secondary-structures. These structure-based sequence alignments created by COMPARER and JOY. Sample SMS_CAMPASS URL: \url{http://www.ncbs.res.in/~faculty/mini/campass/sms_campass.html}

One Page Abstract:

We have generated and updated, nearly-automated version of the original superfamily alignment database, CAMPASS (Sowdhamini et al., 1998 ["CAMPASS: A database of structurally aligned protein superfamilies" Structure 6(9):1087-94]). This new version, PASS2, contains all possible alignments of protein structures at the superfamily level in direct correspondence with the SCOP database (Murzin et al.,1995[SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536-540.]). The superfamilies are chosen from SCOP and all the representative members are extracted by using cut-off of 25% sequence identity and good resolution. For all MMS(Multi Member Superfamily) alignment of family members is based on the conservation of structural features like solvent accessibility, hydrogen bonding and the presence of secondary structures. We have employed COMPARER (Sali and Blundell, 1990[ The definition of topological equivalence in homologous and analogous structures : A procedure involving a comparison of local properties and relationships. J. Mol. Biol., 212, p.403.]; Sali et al.,1992[ A variable gap penalty function and feature weights for protein 3-D structure comparison. Prot. Engng.,, 5, p.43.]) to obtain the multiple sequence alignment of distantly related proteins. The final alignment of MMS and SMS(Single Member Superfamily) are presented in JOY format (Mizuguchi et al., 1998[JOY: protein sequence-structure representation and analysis. Bioinformatics 14:617-623.]) to include structural features. This version also contains interesting features like keyword search, alignment option by using MALIGN and JOY for user's query sequence with all superfamily members and PSI-BLAST, PHI-BLAST homology sequence search against PASS2 are introduced. For every entry in PASS2, there are various links with other databases like RCSB, SCOP, EBI, HOMSTRAD, CATH, FSSP, PALI, PRESAGE, MODBASE, LPFC, DDBASE and DSDBASE is avaibale. Rasmol for visualizing the superposed 3D-structures and single 3D-structure of the superfamiy members are available. The association of the sequences from genome databases with the existing structural superfamilies is under way. The sample page for SMS-CAMPASS (Structure Based Sequence Annotation for Single Member Superfamilies)contains 421 entries is available now from the following URL: http://www.ncbs.res.in/~faculty/mini/campass/sms_campass.html


258. The HAMAP project: High quality Automated Microbial Annotation of Proteomes (up)
Anne-Lise Veuthey, Corinne Lachaize, Alexandre Gattiker, Karine Michoud, Catherine Rivoire, Andrea Auchincloss, Elisabeth Coudert, Elisabeth Gasteiger, Swiss Institute of Bioinformatics;
Paul Kersey, European Bioinformatics Institute (EBI);
Marco Pagni, Amos Bairoch, Swiss Institute of Bioinformatics;
Anne-Lise.Veuthey@isb-sib.ch
Short Abstract:

In the framework of the SWISS-PROT database, we initiated a project to automatically annotate bacterial and archaeal proteins belonging to two categories: proteins having no similarity with other proteins or belonging to well-defined families. Annotation will be inferred after several consistency controls in order to keep the quality of SWISS-PROT.

One Page Abstract:

More than 50 complete genomes are available in public databases. Collectively they encode more than 80'000 different protein sequences. Such a large amount of sequences makes classical manual annotation an intractable task. We therefore initiated, in the framework of the SWISS-PROT database [1], a project that aims to annotate automatically a significant percentage of proteins originating from microbial genome sequencing projects. It is being developed to deal specifically with two subsets of bacterial and archaeal proteins: 1) Proteins that have no recognizable similarity to any other microbial or non-microbial proteins (generally called "ORFans"). This task mainly implies automatic recognition and annotation of features such as signal sequences, transmembrane domains, coiled-coil regions, inteins, ATP/GTP-binding sites, etc. 2) Proteins that are part of well-defined families or subfamilies where it is possible, using software tools, to build automatically a SWISS-PROT entry of a quality identical to that produced manually by an expert annotator. In order to do this we are building, for each well-defined (sub)family, a rule system that describes the level and extent of annotations that can be assigned by similarity with a prototype manually-annotated entry. Such a rule system also includes a carefully edited multiple alignment of the (sub)family. In both cases described above, the idea is to annotate proteins with the highest level of quality. The programs in development are specifically designed to track down "eccentric" proteins. Family assignment will be performed by several identification methods based on profiles, HMM and similarity search in order to avoid false positives. Moreover, the programs will detect peculiarities like size discrepancy, absence or divergence of regions involved in activity or binding (to metals, nucleotides, etc), presence of paralogs, inconsistencies with the biological context etc. Such "problematic" proteins will not be annotated automatically and will be flagged for further analysis by SWISS-PROT expert annotators. Finally, the consistency of annotations will be controlled at genome level by checking the completeness of metabolic pathways.

[1] Bairoch A., Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 Nucleic Acids Res. 28:45-48(2000).


259. BioWIC: Biologically what's in common? (up)
Benjamin J. Stapley, Imperial Cancer Research Fund, London;
Micheal JE Sternberg, Imperial College of Science and Technology, London;
b.stapley@icrf.icnet.uk
Short Abstract:

BioWic determines and visualizes the semantics of a collection of proteins using common terms obtained from their Swiss-Prot annotations and Medline. The system presents the user with terms that are most significant and a graph that reveals the underlying relationships between the proteins. The work is illustrated with examples.

One Page Abstract:

Many new experimental techniques such as mass spetrometry and expression array technologies yield data on the behaviour of a large number of genes or proteins simultaneously. BioWIC is a simple, generic bioinformatics tool that can aid in the interpretation of such experiments. BioWIC takes a set of proteins and tries to determine common terms that describe the sequences. This is achieved by finding Swiss-Prot homologues (or identities) for each protein, extracting relevant text, and determining the most significant terms.

For each protein, homologous swiss-prot sequences are extracted; from these citing Medline documents or keywords are retrieved. The terms are weighted by inverse document frequency (IDF) and in associating terms with sequences, we apply IDF iteratively and generate term vectors describing each protein.

In order to visualise the relations between the proteins, we determine how similar each protein is to every other protein using the cosine of the term vectors. Pairs of proteins that have a cosine similarity above some threshold are linked and an undirected graph generated. In addition, keywords are include in the graph by linking them to their most relevant proteins.

The impetus here is to present the user with labelled clusters which help in comprehension of the underlying semantic structure of the data and can aid in the formulation of new hypotheses.

The work is illustrated with examples from S. cerevisiae expression data.


260. Structure information in the Pfam database (up)
Sam Griffiths-Jones, Mhairi Marshall, Kevin Howe, Alex Bateman, The Sanger Centre;
sgj@sanger.ac.uk
Short Abstract:

Pfam is a database of protein domain families. The latest release of Pfam (6.3) contains 2847 families and matches over 68% of protein sequences and 48% of residues in sequence databases. Recently, extensive use of available structural information has led to significant improvements in Pfam family quality and annotation.

One Page Abstract:

Pfam is a large collection of protein multiple sequence alignments and profile hidden Markov models. The database is available on the WWW in the UK at http://www.sanger.ac.uk/Software/Pfam/, in Sweden at http://www.cgr.ki.se/Pfam/ and in the US at http://pfam.wustl.edu/. The latest version of Pfam (6.3) contains 2847 families matching over 68% of all protein sequences and 48% of all residues in SWISS-PROT 39 and TrEMBL 14.

Recently, development of the Pfam database has focussed on the extensive use of available structural information to improve the quality of Pfam families, and add structural and functional annotation. Domains are the structural and functional building blocks of proteins, and so, where the data are available, structural information has been used to ensure that Pfam families correspond with single structural domains. This matching of families and domains enables enhanced understanding of the function of multi-domain proteins and facilitates cross-linking and integration with structure classification databases such as SCOP and CATH. The action of chopping a single family into two or more structural domains in many cases also enables the elucidation of an increased incidence of the particular domain, often in novel protein contexts.

Pfam sequence alignments now include structural markup derived from the DSSP database, and active site residues as described in SWISS-PROT feature tables. The improved web site graphical view also shows a number of predicted non-domain regions of proteins including transmembrane, low complexity, coiled coil and signal peptide regions.


261. Sequence Viewers for Distributed Genomic Annotations (up)
Robin Dowell, Howard Hughes Medical Institute and Washington University in St. Louis;
Allen Day, Cold Spring Harbor Laboratory;
Rodney M. Jokerst, Howard Hughes Medical Institute and Washington University in St. Louis;
Guanming Wu, Cold Spring Harbor Laboratorypring Harbor Laboratory;
Lincoln Stein, Cold Spring Harbor Laboratory;
robin@genetics.wustl.edu
Short Abstract:

The Distributed Sequence Annotation System (DAS) is a lightweight XML-based protocol for exchanging information about genomic annotations. We discuss here the design and implementation of two viewers for DAS annotations. One is a standalone Java application and the other is a server-side Perl script.

One Page Abstract:

The Distributed Sequence Annotation System (DAS) is a lightweight XML-based protocol for exchanging information about genomic annotations. The system can be used to publish the positions of predicted genes, nucleotide variations, repetitive elements, CpG islands, genetic markers, or indeed any predicted or experimental features that can be assigned to genomic coordinates. The system is currently used by the WormBase database to publish C. elegans genomic annotations, by GadFly to publish annotations on D. melanogaster genomic database GadFly, and by the Ensembl project to publish annotations on H. sapiens and M. musculus.

We discuss here the design and implementation of two viewers for DAS annotations. One, called Geodesic, is a standalone Java application. It connects to one or more DAS servers, retrieves annotations, and displays them on an integrated map. The other, called DasView, is a Perl application that runs as a server-side script. It connects to one or more DAS servers, constructs an integrated image, and serves the image to a web browser as a set of clickable imagemaps. Both viewers provide the user with one-click linking to the primary data sources where they can learn more about a selected annotation, and are sufficiently flexible to accept a wide range of annotation types and visualization styles. The standalone Java viewer is appropriate for extensive, long-term use. The Perl implementation is suitable for casual use because it does not require the user to preinstall the software.

Both viewers are freely available under Open Source licensing terms. They can be downloaded from http://biodas.org/


262. The Immunodeficiency Resource: Knowledge Base for Immunodeficiencies (up)
Marianne Pusa, Jouni Väliaho, Jukka Lehtiniemi, Tuomo Ylinen, Institute of Medical Technology, University of Tampere, Finland;
Mauno Vihinen, Institute of Medical Technology, University of Tampere, Finland, Tampere University Hospital, Tampere Finland;
marianne.pusa@uta.fi
Short Abstract:

For long it has been a difficult task to find a source providing integrated information on rare diseases. The Immunodeficiency Resource (IDR) is a comprehensive knowledge base for all information on immunodeficiencies including computational data and analyses, clinical, biochemical, genetic and structural information. The IDR is freely available at http://www.uta.fi/imt/bioinfo/idr.

One Page Abstract:

For long it has been a difficult task to find a source providing extensive, integrated information on rare diseases. The Immunodeficiency Resource is a comprehensive knowledge base for all information on immunodeficiencies including computational data and analyses, clinical, biochemical, genetic and structural information. The IDR is freely available at http://www.uta.fi/imt/bioinfo/idr.

The IDR is maintained in order to collect and distribute all essential information and links related to immunodeficiencies in an easily accessible format. All information on the IDR is gradually collected in the XML-format and the XML properties will be utilized to offer data services for different platforms. The IDR information system is based on disease and gene specific fact sheets which are provided for all immunodeficiencies. They act as a starting point for further disease related information. All information on the IDR server is validated by specialists of the IDR group.

We have compiled all major immunodeficiency data onto the IDR with thousands of useful and interesting links in addition to our own pages. The IDR also includes articles, instructional resources, analysis and visualization tools as well as advanced search tools. Any text string search across all information is possible. The data is powerfully distributed, error corrected and validated before release.

The IDR integrates various web-based services e.g. sequence databases (EMBL, GenBank, SwissProt), genome information (GDB, UniGene, GeneCard, GenAtlas), protein structure database (PDB), diseases (OMIM), references (Medline), patient information (ESID registry), symptoms and diagnosis (ESID/PAGID recommendations), laboratories performing diagnosis (IDdiagnostics), mutation data (IDbases), animal models (MGD, Flybase, SacchDB) and information produced by the IDR team.

We offer up-to-date information on immunodeficiencies and immunology to people with different backgrounds. New features are continuously added to provide a comprehensive navigation point for anyone interested in these disorders whether a physician, nurse, research scientist, patient, parent of a patient or the general public.


263. XGI: A versatile high throughput automated sequence analysis and annotation pipeline. (up)
Mark E. Waugh, William T. Anderson, The National Center for Genome Resources;
Mark W. Fiers, Plant Research International;
Jeff T. Inman, Faye D. Schilkey, John P. Sullivan, Callum J. Bell, The National Center for Genome Resources;
mew@ncgr.org
Short Abstract:

XGI is a portable and flexible multi-threaded system for automated analysis and annotation of both genomic and expressed sequence data. XGI is component-based in that analysis operations are handled by independent modules that interact through XML with the central "pipeline", which handles the data flow and provides database interactivity.

One Page Abstract:

XGI (Genome Initiative for species "X") has been developed as a portable, flexible system for automated analysis and annotation of both genomic and expressed sequence data. As the name implies, we have developed XGI to reflect the common nature of sequence data independent of organism. XGI has a component-based architecture in which analysis operations are handled by independent modules that interact through XML with the central "pipeline," which controls the operations, handles the data flow and provides database interactivity. The data model itself has been designed to handle rapid changes to analysis components, including the addition of new algorithms. The pipeline is multi-threaded and has been written to take full advantage of SMP and DMP systems. Currently, the system handles sequence quality control including vector and artifact screening and low-quality read trimming. The EST-oriented pipeline then clusters and assembles ESTs into consensus sequences and performs analysis on the results, including similarity and motif searching. The genomic-oriented pipeline handles sequence assembly followed by gene prediction and downstream subsequence analysis including similarity searching on the predicted genes and ORFs. Both pipelines use a novel method of assigning Gene Ontology (geneontology.org) annotations to predicted features to assist in putative identification. Access to the data is through the web using a standard web browser connecting to a secure server. The GUI has been developed using AxKit, which converts the results of database queries into XML that is interpreted and displayed by the perl embedded stylesheets specified in the header tags. This enables rapid changes to the look, feel and functionality of the GUI with minimal effort and allows multiple different GUIs to coexist on the same server, reducing administration effort. XGI security has been modeled on the UNIX paradigm and provides USER, GROUP and GLOBAL levels of access for each row in the database.


264. Mouse Genome Database: an integrated informatics database resource (up)
Tom C. Wiegers, Janan T. Eppig, Judith A. Blake, Joel E. Richardson, Carol J. Bult, Jim Kadin, Mouse Genome Informatics, The Jackson Laboratory;
tcw@informatics.jax.org
Short Abstract:

The Mouse Genome Database(MGD) provides a fully integrated information resource about the laboratory mouse. From genotype (genome) to phenotype including literature curation and experimental datasets, MGD's extensive integration, data representation and robust query capabilities include sequences, maps, gene reports, alleles and phenotypes with links to expression and other resources.

One Page Abstract:

The laboratory mouse is the premier model system for understanding human biology and disease. Much contemporary research focuses on the comparative analysis of mouse and human sequence data combined with the exploration of mouse mutant phenotypes. The Mouse Genome Database (MGD) provides an integration nexus for comprehensive information on the mouse. MGD makes available a wide range of genetic and genomic information including: unique representation of mouse genes, sequences and sequence maps, comparative maps and data for mammals (especially mouse, human and rat), allelic polymorphism data and descriptive phenotypic information for genes, mutations, and mouse strains. Experimental data and citations are provided for all data associations. Literature curation (over 70,000 articles) is an important source for experimental data. MGD maintains curated interconnections with other online resources such as SWISS-PROT, LocusLink, PubMed, OMIM, and databases for other species. A new feature is the use of the Gene Ontology controlled vocabularies of the molecular function, biological process and cellular component for the description of gene products. Phenotype and disease model ontologies are being developed. A second recent new feature of MGD is comprehensive representations of phenotypic alleles. These were developed, in part, to support the explosion in new mutant allele discovery from mutagenesis projects and gene targeting efforts. Allele records now include information origin and the molecular mutation involved, and are being annotated with precise phenotypic descriptions and human disease information. All allele data are fully integrated with sequence, ortholog,gene expression, and strain polymorphism data.

MGD is supported by NIH grant HG00330.

Blake J.A., et al., (2001) The Mouse Genome Database (MGD): integration nexus for the laboratory mouse. Nucleic Acids Research 29: 91-94.


265. Methods for the automated identification of substance names in journal articles (up)
Michael Krauthammer, Department of Medical Informatics, Columbia University;
Andrey Rzhetsky, Department of Medical Informatics, Columbia Genome Center;
Carol Friedman, Department of Medical Informatics, Department of Computer Science, Queens;
mk651@columbia.edu
Short Abstract:

Identification and tagging of substance names in scientific articles is an important first step in automated knowledge extraction systems. We report on using a novel approach for this problem based on a syntactic analysis of the articles, a sequence comparison tool (BLAST) and a dictionary of substance names.

One Page Abstract:

INTRODUCTION Our group is building a system (GeneWays) for the automated knowledge extraction from online scientific journals. The goal is the reconstruction of molecular networks consisting of interactions between molecular entities as reported in the literature. Identification and tagging of gene and protein names, as well as other substances, is an important first step for successful knowledge extraction from articles. In recent years, different authors have proposed alternative ways for this task, such as using morphological rules, syntax analysis, dictionaries or even hidden Markov Models. The main challenges in identifying substance names are: Spelling variations and errors, multi-token words as well newly introduced words that are not listed in any dictionary. We have previously demonstrated that it is possible to tackle problems of spelling variation and errors by using a sequence comparison tool such as BLAST. Here we show that by combining this technique with a syntactic analysis of the article, it is also possible to handle the problem of multi-token words.

METHOD In summary, the system first identifies noun phrases by performing a syntactic analysis of the article. The system then selects those noun phrases, which most likely contain substance names, by applying morphological rules and a broad coverage dictionary. After a part of speech analysis of the selected noun phrases, the article is marked up; most potential substance names are specifically tagged, including multi-token words. The next step is the exact identification of the marked up substance names, i.e. matching the marked up names with an official entry in a reference database such as LocusLink, taking into account spelling variations and errors. This task is accomplished by using BLAST, a popular sequence comparison tool. All substance names from a reference database are converted into a string of nucleotides, by substituting each character in the name with a predetermined unique nucleotide combination. The encoded names are then imported into the BLAST database using the FASTA format. The marked up substance names from the first step are translated, using the same nucleotide combinations, into a string of nucleotides and matched against the nucleotide representation of substance names in the BLAST database. Significant alignments are listed in the BLAST output file, which is subsequently processed using Perl-scripts. At this final stage, the system has determined the closest match between each marked up name and the reference database according to the individual alignment scores.

RESULTS AND CONCLUSION Our results indicate that a combination of methods for the identification of substance names yields the necessary precision for the subsequent automated knowledge extraction from scientific articles.

REFERENCE Krauthammer M, Rzhetsky A, Morozov P, Friedman C.Using BLAST for identifying gene and protein names in journal articles. Gene. 2000 Dec 23;259(1-2):245-52.


266. Development and Implementation of a Conceptual Model for Spatial Genomics (up)
Mary E. Dolan, Carol J. Bult, The Jackson Laboratory, Bar Harbor, ME, USA;
Kate Beard, Constance Holden, University of Maine, Orono, ME, USA;
mary_dolan@umit.maine.edu
Short Abstract:

We have developed a data model for genome data interpretation that reflects the natural and intuitive thought patterns of the biologist; incorporates spatial information in a way that is natural and intuitive to a spatial data analyst; includes a high degree of expressivity of the complex interactions among genome features.

One Page Abstract:

Genomics researchers continue to face the problem that sequencing methods and other experimental advances have produced and continue to produce massive amounts of complex data, which must be stored, organized and, most importantly, interpreted and integrated to be of use to biologists. Each month the bioinformatics journals present novel approaches to this problem. One promising strategy is to take an interdisciplinary approach, in which one attempts to bring the concepts, techniques and tools of a mature field to bear on a body of new data types. Although genetics researchers have long recognized the biological significance of the spatial organization of the genome and made different kinds of "maps" to visualize genetic and genomic features, attempts to fully exploit the concepts and methods developed in the area of spatial data analysis and geographic information systems have been limited. Our work is an attempt to move beyond this and use the tools of this field to analyze, query, and visualize aspects of the spatial organization of genomic features and the comparison of genomes.

We present here, as part of a project to develop a proof of principle Genome Spatial Information System (GenoSIS, http://www.spatial.maine.edu/~cbult/project.html), a conceptual model for genome data interpretation that: reflects the natural and intuitive thought patterns of the biologist; incorporates spatial information in a way that is natural and intuitive to a spatial data analyst; includes a high degree of expressivity of the complex interactions among genome features that recent experimental evidence indicates is essential to understanding genome structure and the regulation of genome function (Kim, J., Bergmann, A., Stubbs, L. (2000) Exon Sharing of a Novel Human Zinc-Finger Gene, ZIM2, and Paternally Expressed Gene 3 (PEG3). Genomics, 64, 114-118; Labrador, M., Mongelard, F., Plata-Rengifo, P., Baxter, E.M., Corces, V.G., Gerasimova, T.I. (2001) Protein encoding by both DNA strands. Nature, 409, 1000).

We also present an implementation of this conceptual model that: integrates data from different data sources using an object-based database schema; allows us to easily take advantage of existing dynamic spatial analysis, classification, querying, and visualization methods and tools according to specifications outlined by the Open GIS Consortium (http://www.opengis.org/techno/specs.htm); stores data in a manner that can be easily updated to include the most recent discoveries of complex feature interactions and dependencies.


267. Modelling Genomic Annotation Data using Objects and Associations: the GenoAnnot project (up)
Helene Riviere-Rolland, Gilles Faucherand, Genome-Express;
Christophe Bruley, INRIA Rhone-Alpes 655 avenue de l'Europe – 38330 - Montbonnot - France;
Anne Morgat, INRIA Rhone-Alpes;
Magali Roux-Rouquie, Institut Pasteur 25 rue du Dr. Roux – 75015 – Paris - France;
Claudine Medigue, Institut Pasteur;
François Rechenmann, Alain Viari, INRIA Rhone-Alpes;
Yves Vandenbrouck, Genome-Express;
h.riviere-rolland@genomex.com
Short Abstract:

The Geno* project aims at constructing a modular environment dedicated to complete genome analysis and exploration. In this framework, the GenoAnnot module focuses on the annotation of prokaryotic and eukaryotic genomes. It is based on an object-oriented model, using classes and associations. We present here the GenoAnnot ontology and architecture.

One Page Abstract:

GenoAnnot is part of Geno*, a modular environment dedicated to complete genome analysis and exploration. Geno* is an ongoing project between INRIA, Institut Pasteur, Hybrigenics and Genome Express. In this context, GenoAnnot focuses on the annotation process of prokaryotic and eukaryotic genomes. It provides a framework to identify and visualise chromosomal regions of interest (such as coding regions, regulatory signals) in order to assist the biologist in the course of raw genome annotation. GenoAnnot extends our former project (Imagene) both in terms of biology (eukaryotic genomes) and technology.

As a first step in the design of GenoAnnot, it is of major importance to formalise and explicitly represent the biological concepts that come into play. For this purpose we chose an object-based (UML-like) model in order to design our data model. This model was then implemented using the AROM representation system (http://www.inrialpes.fr/romans/arom) developed in the Romans project in the INRIA Rhône-Alpes. AROM provides an original knowledge representation model based on classes and associations.

Our model for GenoAnnot is composed of four main root classes: the class holds genetic and phylogenetic information about the species under study. The root class represents biological entities involved in the constitution and expression of a genome. Instances of this class are not necessarily linked to a known sequence (e.g. the existence of a gene may be known even if the corresponding part of the chromosome has never been sequenced). The root class represents regions of interest (defined as intervals) on sequences. Instances of this class compose most of the annotation information. They can be imported from databanks or produced by tasks within the GenoAnnot environment. Finally the class holds information related to the genomic sequences (chromosomes, contigs) that have to be annotated. At the present time our hierarchy is composed of about 90 subclasses of these four main classes (most of them are in the and hierarchy). Moreover these classes are connected together by using associations. In AROM, these associations can be n-ary (i.e. may connect more than two classes) and, as for classes, can be organised into hierarchies and can have attributes. This later feature turned out to be particularly useful. For instance the coordinate of a particular feature on a chromosome is held by an association, therefore allowing to link the same feature to different versions of the chromosome, possibly at different locations on each of them.

Another important functionality of GenoAnnot, inherited from Imagene, is its ability to produce these objects (namely instances of classes and associations) by using tasks. Tasks represent the methodological knowledge. They implement sequence analysis methods (such as gene or signal finding) and are run under the control of a generic task-engine provided by Geno*. Finally, all the objects in GenoAnnot can be managed and visualised through graphical user interfaces.

A first version of GenoAnnot will be made available at the end of 2001, both as a standalone application and as a java-API.


268. An Overview of the HIV Databases at Los Alamos (up)
Brian K. Gaschen, Charles E. Calef, Rama Thakallapally, Brian T. Foley, Carla L. Kuiken, Bette T. Korber, Theoretical Biology and Biophysics, Los Alamos National Laboratory;
bkg@lanl.gov
Short Abstract:

The HIV Genetic Sequence, Immunology, and Drug Resistance Databases at http://hiv-web.lanl.gov collect, compile, annotate and analyze HIV and other primate immunodeficiency virus gene and protein sequences, T-cell epitopes and antibody reactivity data. Tools to aid researchers in the analysis of the sequences and in vaccine design are provided at the site.

One Page Abstract:

The HIV Genetic Sequence, Immunology, and Drug Resistance Databases at http://hiv-web.lanl.gov collect, compile, annotate and analyze primate immunodeficiency virus gene and protein sequences, T-cell epitopes, and antibody reactivity data. The databases bring together sequence data from individual studies to provide a global frame of reference. The drug resistance database contains a collection of mutations in HIV genes that confer resistance to anti-HIV drugs, and can be searched through a variety of fields including gene, drug class, and mutation. Alignments of T-cell epitopes and linear antibody binding sites are available from the immunology database including variation summarized by HIV-1 subtype and country of origin. Alignments of gene and protein sequences, phylogenetic trees, analyses of site-specific variability, and models of immunodeficiency virus evolution are available from the sequence database. Sequence data can be retrieved via a large variety of selection criteria, including country of isolation, subtype of virus, patient health, date, coreceptor usage, and region of genome. The database provides tools for analyses, both online via web browsers, and through downloadable software. The database staff also conducts independent research using the resources of the databases and the unique computational facilities available at Los Alamos. The results of these projects are used to enrich the resources available on our web site. Some of the recent projects at the database include the development of a parallel version of fastDNAml which has been used in studies to estimate the most likely date of the most recent common ancestor of the current HIV pandemic, combining the immunology and sequence database information to assess CTL epitope variability by subtype across different regions of the genome, studies of HIV subtype distribution and variation in New York City immigrant populations, and the development of tools to aid in more efficient vaccine design.


269. Variability of the immunoglobulin superfamily V-set fold using Shannon entropy analysis (up)
Manuel Ruiz, Marie-Paule Lefranc, MGT, the international ImMunoGeneTics database;
manu@ligm.igh.cnrs.fr
Short Abstract:

Taking into account expertise provided by IMGT, the international ImMunoGeneTics database (http://imgt.cines.fr), a systematic sequence variability analysis of the Immunoglobulin and T cell Receptor variable domains was realized. This study allows to underline the sequence variations and constraints within the functional and structural requirements of the Ig and TcR V-REGIONs.

One Page Abstract:

The Immunoglobulins (Ig) and T cell Receptors (TcR) are extensively studied and a considerable amount of genomic, structural and functional data concerning these molecules is available. IMGT, the international ImMunoGeneTics database (http://imgt.cines.fr) manages the large flow of new Immunoglobulins (Ig) and T cell Receptors (TcR) sequences and currently contains more than 44 000 sequences. Variability analysis of the Immunoglobulin and T cell Receptor V-REGIONs has previously been studied by different approaches. The Kabat-Wu (KW) method is the most popular one. A modified version of the KW method, the Jores, Alzari, and Meo (JAM) method has been established in order to enhance the resolving power of the variability index. However usually both methods are used without a critical assessment of the results, and without any standardization of the amino acid position between chain types and species. A third approach, the Shannon information analysis, has been proposed as being more appropriate to analyze the sequence variability. Since IMGT sequence annotations provide exhaustive and high quality Ig and TcR V-REGION data, based on the IMGT Scientific chart rules and on the standardized IMGT unique numbering, that standardization can now be exploited to set up variability analysis of the V-set fold, amino acid position by amino acid position. In this study, we carried out a systematic variability analysis of the annotated V-set fold sequences from IMGT, using Shannon information analysis. This variability analysis permits to describe the susceptibility of an amino acid position to evolutionary replacements and to underline the sequence variations and constraints within the functional and structural requirements of the immunoglobulin superfamily V-set fold. This approach is particularly important in the cases of antibody engineering, humanization of antibody fragments, model building, and structural evolution studies.


270. SNPSnapper – application for genotyping and database storaging of SNP genotypes produced by microarray technology (up)
Juha Saharinen, Pekka Ellonen, Janna Saarela, National Public Health Institute;
Leena Peltonen, Department of Human Genetics, UCLA School of Medicine, Los Angeles, USA.;
Juha.Saharinen@KTL.Fi
Short Abstract:

SNPs are important genotype markers. We have developed an application for SNP allele calling and genotyping from microarray data. The microarray technology together with SNPSnapper software allows semi-automatised, high throughput SNP genotyping followed by data storage and management in a relational database.

One Page Abstract:

Single nucleotide polymorphisms (SNPs) are the most common type of genetic variations, with an estimated frequency of about 1/1000 in the human genome. This in addition to low mutation frequency of the SNPs makes them very useful genetic markers for a number of different genetic studies. We have developed a software package for SNP allele calling and genotyping of samples from microarray data. In the experimental set up, the allele specific oligonucleotides are immobilised on a microscope slide. Corresponding genomic regions of analysed samples are amplified by multiplex PCR, in vitro transcribed to RNA, and finally hybridised to immobilised oligonucleotides. The genotypes are determined by allele specific primer extension using fluorescent-labelled nucleotides. This system delivers currently thousands of genotypes per a glass slide. The data from microarray reader is imported to SNPSnapper program, which sorts the microarray data points according to the SNP and provides a GUI for visualisation of allelic ratios and related absolute signal intensities. In a produced scatter plot, clusters representing different genotypes are typically seen, whereas in a fraction plot, the different genotypes are distinguished by their location in the intensity fraction axis. SNPSnapper assigns genotypes for each sample by comparing the signal intensities representing the different alleles. The limiting signal intensity values as well as the fraction value boundaries between called genotypes can be set either manually or automatically. The genotypes can be further validated by hand, to discard e.g. genotypes derived from low signal intensity data points or from areas of the glass plate with high background intensity. The genotyping data is then stored in a relational database, where each genotype is linked to the experiment, project and SNP. Likewise each experiment is stored in the database with information describing the used instrument, SNPs, samples, experimental conditions and the operator. These settings allow high throughput genotyping with semi-automatised information processing and database storaging. The application is done with Borland Delphi development environment, and the database connection is made with Micro Soft ADO components, allowing database server independent implementation.


271. An XML document management system using XLink for an integration of biological data (up)
Junko Tanoue, Laboratory for Database and Media Systems, Graduate School of Information Science, Nara Institute of Science and Technology (NAIST);
Masatoshi Yoshikawa, Noboru Matoba, Shunsuke Uemura, Laboratory for Database and Media Systems, Graduate School of Information Science, Nara Institute of Science and Tech;
junko-ta@is.aist-nara.ac.jp
Short Abstract:

We propose an XML document management system adopting XLink technology. The system can help researchers define relationships among XML resources collected from biological data providers. Such link information as expert knowledge is expected to be useful especially for making annotation to resources or writing reports.

One Page Abstract:

Biological data resources have began to provide their data in XML formats. With the use of XML,interoperation of databases and provision of broker services would become much easier. Such improvement is surely a benefit for the researchers who wish to discover unknown biological rules or facts.

Here we would like to propose another type XML applications for biologists who manage data collected from public databases. This application employs XLink (XML Linking Language) to specify relationships among data recourses.

XLink has the following features:

1.Assert linking relationships among more than two resources

2.Associate metadata with a link

3.Express links that reside in a location separate from the linked resources

When XLink is used for linking XML documents, it enables:

1.To make a link to a specific part of a document without anchor tag

2.To create a virtual document embedding linked parts from outside documents.

This virtual document can serve as a sort of scrapbook which a researcher may want to use to organize his/her idea. One may share this link information with other researchers because such links are useful expertise. An application adopting XLink technology can help the researchers manage XML document with link information and it can be useful especially for making annotation to resources or writing reports.

[References]

Achard F, Vaysseix G, Barillot E. XML, bioinformatics and data integration. Bioinformatics 2001 Feb;17(2):115-25

XEWA Workshop http://www-casc.llnl.gov/xewa/

XML Linking Language (XLink) Version 1.0 http://www.w3.org/TR/xlink/

XML Pointer Language (XPointer) Version 1.0 http://www.w3.org/TR/xptr


272. Immunodeficiency Mutation Databases in the Internet (up)
Marianne Pusa, Jouni Väliaho, Institute of Medical Technology, University of Tampere, Finland;
Pentti Riikonen, Institute of Medical Technology, University of Tampere, Finland, Turku Centre for Computer Science, University of Tur;
Tuomo Ylinen, Jukka Lehtiniemi, Institute of Medical Technology, University of Tampere, Finland;
Mauno Vihinen, Institute of Medical Technology, University of Tampere, Finland, Tampere University Hospital, Tampere, Finland;
marianne.pusa@uta.fi
Short Abstract:

Altogether 15 mutation databases (IDbases) on primary immunodeficiencies are currently maintained at the IMT bioinformatics. We have developed a program suite, MUTbase, providing interactive and quality controlled submissions of information to mutation databases. Our mutation databases offer updated, integrated information on immunodeficiencies for all interested parties.

One Page Abstract:

Altogether 15 mutation databases (IDbases) on primary immunodeficiencies are currently maintained at the IMT bioinformatics. We have developed a program suite, MUTbase, providing user-friendly, interactive and quality controlled submissions of information to mutation databases. Our mutation databases offer updated information on immunodeficiencies for all interested parties in an integrated, comprehensive format.

The amount of knowledge of genetic variation is increasing rapidly. There are many diseases causing mutations and we have created a system for collecting, maintaining and analysing data of the growing number of these mutations. Our system contains patient based mutation databases that function as a resource for genomics and genetics in molecular biology as well as diagnostics and therapeutics in medicine.

There are over 80 immunodeficiencies currently known and due to being rare information on them has been difficult to obtain. Altogether 15 mutation databases are maintained at the IMT bioinformatics containing some 1630 entries and the number is increasing. We also maintain a registry, KinMutBase, of mutations in human protein kinases related to disorders. All databases and further information is available at http://www.uta.fi/imt/bioinfo.

The program suite, MUTbase, provides user-friendly, interactive and quality controlled submission of information to mutation databases. The interactive data submission forms include several quality controls and essential links to related information. Database maintenance can also be carried out using the Internet tools. There is also a variety of tools provided for further study of the database on the World Wide Web. The program package writes and updates a large number of Web pages e.g. distribution and statistics of disease-causing mutations and changes in restriction patterns. The MUTbase is available free on the Internet at http://www.uta.fi/imt/bioinfo/MUTbase.

IDbases can be accessed via the BioWAP service, which is a mobile internet service also provided by the IMT bioinformatics. BioWAP provides a channel to all major bioinformatics databases and analysis programs.


273. Martel: parsing flat-file formats as XML (up)
Andrew Dalke, Dalke Scientific Software, LLC;
dalke@dalkescientific.com
Short Abstract:

Martel is a parser generator for bioinformatics which allows existing flat-file formats to be used as if they are in XML. It simplifies the process of data extraction and conversion, HTML markup and even format validation. Martel is part of the biopython.org collaboration.

One Page Abstract:

Bioinformaticians deal with hundreds of formats. Some are well defined and others are ad hoc. Some come from databases and others from program outputs. All need to be parsed.

There are many different requirements for a parser. A survey of 19 different SWISS-PROT parsers found these use cases: 1) count the number of records in a file, 2) find the name and location of each record, 3) convert to FASTA, 4) convert to a generic data model (ignoring some fields), 5) convert to a full data model (no fields ignored), 6) display a record in HTML with hyperlinks and 7) validate the format is correct. An eighth common case is to recognize a format or version automatically. Excepting Martel, no existing system handles all of these cases and those that come close require a considerable amount of additional programming.

The traditional way to write a parser is with a parser generator like yacc or Lion's SRS. Yacc proves to be an inappropriate solution for bioinformatics formats because it assumes strong separability between lexing and parsing. Bioinformatics formats are strongly position dependent which calls for explicit state control in yacc. SRS solves that by making the lexer state implicit in the parser but it ties the parser's actions tightly to the parsing so the parser cannot easily be changed to handle the different use cases mentioned in the previous paragraph.

Martel combines two different approaches to simplify parser development. The first is the recognition that most bioinformatics formats -- and nearly all the formats that are otherwise hard to parse -- can be described with a regular grammar instead of a context-free one. This simplifies the format definition by using a single grammar instead of the hybrid lexer/parser used in other systems. The grammar is a slightly modified subset of Perl5 regular expressions and is easily understood by most bioinformatics software developers.

The other approach is the use of an event-based parser. Once a record has been converted into a parse tree it is traversed and infomation about the nodes sent to the caller via SAX events from XML processing. Tags for the startElement and endElement events come from the groups defined in the format definition and named using the (?P<named>group) syntax made popular by Python. Generating events allows separation between the parser and the actions. Generating SAX events allows existing XML tools, like DOM and XSLT, to be used with no additional effort.

Martel has been in use for several months as part of the biopython.org collaboration. It has proved capable in handling all of the use cases listed earlier. In performance it is faster than the other parsers available for Perl and Python, and is within a factor of four slower than equivalent but less flexible code in Java and C. When used to validate it has identified many problems in existing popular database formats either because the format documentation is incomplete or because the distributed data is indeed in the wrong format.

Martel is freely available. More information is available at http://www.dalkescientific.com/Martel/ .


274. Disulphide database(DSDBASE): A handy tool for Engineering disulphides and modeling disulphide rich systems. (up)
Vinayagam.A, Dr.R.Sowdhamini, National Centre for Biological Sciences;
vinayagam@ncbs.res.in
Short Abstract:

DSDBASE - A database on disulphide bonds includes native disulphides and those that are stereochemically possible between pairs of residues in a protein. One potential application is to design site-directed mutants in order to enhance the thermalstability of the protein. Another use is proposing 3D models of disulphide rich peptides.

One Page Abstract:

The occurrence and geomentry of all known disulphide bonds in protein structures have been recorded by examining the stereochemistry of such covalent cross-links in a non-redundant database of proteins.In addition, other modelled disulphides which are stereochemically possible between pairs of residues in a protein are also considered.The modelling of disulphides have been achieved using the program MODIP (Sowdhamini etal., (1989),Prot.Engng.,3,95-103). The inclusion of native and modelled disulphides increases the dimensions of the derived database (DSDBASE) enormously. Disulphide bonds of specific loop sizes, strained native disulphides and scrambled disulphide models will be discussed. One of the potential use of DSDBASE is to design site-directed mutants in order to enhance the thermal stability of the proteins. Another application, which will be illustrated, is to employ this database for proposing three-dimensional models of disulphide-rich polypeptides like toxins and small proteins of known disulphide bond connectivity.This method seemed to be highly efficient when it was applied to known models themselves(like Endothelin). The database is accessible over the web (http://www.ncbs.res.in/~faculty/mini/dsdbase/dsdbase.html).


275. A Combinatorial Approach to Ontology Development (up)
Judith A. Blake, David P. Hill, Joel E. Richardson, Martin Ringwald, Mouse Genome Informatics, Jackson Laboratory;
jblake@informatics.jax.org
Short Abstract:

We present an experiment in ontology development that entails combining separate ontologies to create more complex and specific ontologies. Two test sets, one of developmental processes, the other of heart anatomy, are computationally combined to generate a novel third ontology with interesting implications for future data representations.

One Page Abstract:

The Gene Ontology (GO) project encompasses the development and use of the biological ontologies of molecular function, biological process and cellular component. Complete representation of developmental processes necessitates incorporation of concepts from an additional ontology, anatomy. Incorporation of the anatomy component is problematic for a project like GO because anatomical concepts can be species-specific. We present an experimental test of cross-product ontology implementation where the biological process of heart development for the mouse is explored. This approach recognizes that the developmental portion of the biological process ontology can be constructed from the two independent ontologies of developmental processes and mouse developmental anatomy whose cross product provides all possible combinations. The cross-product approach is inherently complete as long as the two vocabularies used to construct it are complete and it is logical as long as the two vocabularies are orthogonal. The generation of complete cross products can be automated and provides an alternative approach to ontological development from that of the incremental addition of terms. Definitions of the cross-product terms can be automatically derived from the combination of the original definitions, and synonyms provide customary terms for the user. This approach is being explored for implementation in the GO project as well as for other ontology development in the Mouse Genome Database (MGD) and the Gene Expression Database (GXD).

The GO consortium is supported by NIH grant HG-02273. MGD is supported by NIH grant HG-00330. GXD is supported by NICHD grant HD-33745.


276. StressDB: The PennState Arabidopsis Stress Database (up)
Smita Mitra, Nigam Shah, Pennsylvania State University, PA, USA.;
mxm66@psu.edu
Short Abstract:

We are in the final stages of development of a Windows-based microarray database, StressDB for management of microarray data. We are using an Oracle database management system on Windows NT/2000 server. The scripts for the backend, application layer and front-end will be packaged for easy installation by any non-profit institution.

One Page Abstract:

StressDB: The PennState Arabidopsis Stress Database

With the advent of DNA array-based technologies the scientific community has felt the necessity for more sophisticated tools for the management, comprehension and exploitation of data. The microarray technology yields copious volumes of data that necessitates the use of sophisticated database management systems for efficient data management. The current public databases include ArrayExpress, Gene Expression Omnibus (GEO) and Microarray Project- ArrayDB. While these repositories will provide us with central repositories for publicly available microarray data, local databases are needed for researchers to store and manage their data until such time that it is ready for publication. Some of the local microarray databases include Stanford Microarray Database (SMD), Yale Microarray Database (YMD), Gene Expression Database (GXD) and RNA Abundance Database (RAD). Most of the other local databases are in the premature stages of development and continue to evolve.

With the goal to better manage our data we decided to build a web-based relational database for the storage, retrieval and analysis of microarray data. We are in the final stages in the development of StressDB. We are going to release our database to the public by August 2001. The scripts for the database will be made freely available online to non-profit institutions soon afterwards. We will be packaging our database scripts for easy installation by other labs (with appropriate licenses). We believe that the scientific community will benefit from a small and comprehensive yet fully functional microarray database designed for use by a single or a few closely-knit labs. We also believe that none of the current databases meet this qualification. The public databases are not designed for implementation by a single lab and hence do not qualify for candidacy. The local databases that currently exist are either overwhelmingly big in backend architecture and functionalities and high maintenance for a small group of labs (meant only for a big group of collaborators covering various organisms), or are in a very premature developmental stage. We are confident that StressDB will be a valuable contribution to the scientific community.

Goals of StressDB: 1.We are building a relational database for the storage, web-based retrieval and analysis of microarray data. 2.We are using an Oracle database management system on a Windows NT server. 3.We have made a commitment to confirming our standards for data storage and retrieval to meet the minimal requirements imposed by MIAME. 4.The scripts for the backend, application layer and front-end will be packaged for easy installation by any non-profit institution (after the receipt of appropriate licenses from Oracle and Microsoft). Anybody with knowledge of the Windows OS and systems administration should effortlessly be able to follow our ‘readme’ files and install the database on a Windows NT or Windows 2000 server.


277. The GeM system for comparative mapping of mammalain genomes (up)
Gisèle Bronner, Bruno Spataro, Christian Gautier, Université Claude Bernard - Lyon 1;
François Rechenmann, INRIA Rhône-Alpes;
bronner@biomserv.univ-lyon1.fr
Short Abstract:

We present a model for comparative mapping and its implementation as a knowledge base in the GeM system. GeM consists in the coupling of this knowledge base (GeMCore), with specific graphical interfaces, like GeMME for molecular evolution. GeM was used to characterize evolutionary changes of genome between human and mouse.

One Page Abstract:

An integrative view of genomes is made possible through comparative genomics, which takes into account both the diversity and the heterogeneity of genomic data available for many organisms. Moreover, the structure and the dynamics of genetic information often depend on gene locations. So comparisons at the genome scale rather than the gene scale, which is possible because chromosomal segments are conserved between species, are of major interest.

Comparative mapping appears to be an essential way to extrapolate genomic knowledge among organisms, especially from model organisms to economic ones. Mapping information makes possible to combine genetic and sequence data associated to homologous genes between organisms according to their similar location, as well as to analyze genome structure, evolution and function. However, such studies need modeling and integration of genomic and mapping data, which currently does not exist.

We propose a model for comparative mapping and its implementation as a knowledge base in the GeM system for comparative genomic mapping in mammals. GeM consists in a core knowledge base, GeMCore dedicated to the management and control of genomic and mapping information for many organisms, which is coupled to domain-specific graphical user interfaces dedicated to specific problems (medicine, agronomy, molecular evolution…). Among these interfaces is GeMME, dedicated to molecular evolution.

GeMCore integrates data gathered from the HUGO, MGD, LocusLink and Hovergen databases, after being examined to evaluate their consistency. The GeMCore model is UML-like, associations being as richly expressed as the entities of the domain. It is thus possible to handle data associated to marker types, spatial relations produced by comparative mapping, and evolutionary relations between markers and species. GeMCore is implemented with the AROM knowledge representation language, and possesses an API, which makes it possible to easily link it to the domain specific graphical interfaces.

The GeMME interface to study evolutionary mechanisms at the genome level is an example of a fully implemented one. It consists in a higher model for evolution, with its own concepts, data, and analytical tools, particularly at the statistical and graphical levels. Thus, conserved segments between species, Marey’s maps or spatial organization of genomic information can be graphically represented, and specific statistics can be computed such as the Moran or the Geary auto-correlation indices.

The combination of a generic model for comparative genomic mapping with domain-specific interfaces allows easy adding of novel data as well as development of new methods. Our system was used to characterize evolutionary changes of genome structures between human and mouse at the genome level. Some of these results are presented here.


278. TrEMBL protein sequence database: production, interconnectivity and future developments (up)
Maria-Jesus Martin, Claire O'Donovan, Allyson Williams, Rolf Apweiler, EBI-EMBL;
martin@ebi.ac.uk
Short Abstract:

The TrEMBL database has focused on making the protein sequence data available as quickly as possible and enhancing the data by computer-generated annotation methods. Due to the diverse sources of information present in TrEMBL, we introduce evidence tags to allow users to see how the data items have been generated.

One Page Abstract:

TrEMBL (Translation of EMBL nucleotide sequence database), is a computer-annotated protein sequence database derived from the translation of all coding sequences (CDS) in the nucleotide sequence databases EMBL/DDBJ/GenBank, except for those already included in SWISS-PROT. SWISS-PROT, a curated protein sequence data bank, contains not only sequence data but also annotation relevant to a particular sequence. TrEMBL was created in 1996, as a supplement to SWISS-PROT, to cope with the tremendous increase of sequence data that is submitted to the public nucleotide sequence databases. Unlike SWISS-PROT entries, those in TrEMBL are awaiting manual annotation. SWISS-PROT and TrEMBL releases, SWISS-PROT and TrEMBL updates and TrEMBLnew (new entries to be integrated into TrEMBL) are published weekly in the non-redundant database SPTR.

In the era of the genome projects, TrEMBL has two important priorities, which are performed on a regular basis and published in the weekly update of SPTR:

- To provide the protein sequence as soon as the nucleotide sequences are available in the nucleotide sequence databases via TrEMBLnew. - To add as much as information as possible to the predicted protein by automatic annotation methods.

To achieve the first, TrEMBL puts special emphasis on sequences from the complete genome projects. Shortly after a genome is available in the nucleotide sequence databases, TrEMBLnew entries are created for each predicted coding sequence. These entries are then prioritized for promotion into TrEMBL after post-processing which includes redundancy checks and enhancements in the annotation. With the availability of such large amounts of sequence data, the new challenge for TrEMBL is to attach biological functional information to these predicted sequences. InterPro is an integrated documentation resource for protein families, domains and functional sites which is used to link TrEMBL entries to different pattern databases such as PROSITE, PRINTS and Pfam and to the cluster sequence database, ClusTr. In addition, automatic annotation is carried out by a rule-based system that improves approximately 20% of the sequences in TrEMBL. The process of linking TrEMBL entries to other databases such as MGD, HSSP and FlyBase entries is also applied regularly. TrEMBL entries have diverse sources of information which we wish to flag to allow users to see where the data items come from and what level of confidence which can be attached to them and to enable SWSS-PROT staff to automatically update data if the underlying evidence changes. To achieve this, we are introducing evidence tags to TrEMBL entries. This is currently ongoing internally and we hope to provide a public version by the end of 2001. Documentation on the process so far is provided at: ftp://ftp.ebi.ac.uk/pub/databases/trembl/evidenceDocumentation.html


279. PIR Web-Based Tools and Databases for Genomic and Proteomic Research (up)
Huang, H, Barker, W.C, Chen, Y, Hu, Z, Lewis, K, Orcutt, B.C, Yeh, L.L, Wu, C, National Biomedical Research Foundation;
huang@nbrf.georgetown.edu
Short Abstract:

We have recently expanded the Web site of the Protein Information Resource (PIR) with new web-based tools to facilitate data retrieval, sequence analysis and protein annotation. A user-friendly navigation system with graphical interface connects PIR-PSD, iProClass, and other useful databases. These will better support genomic/proteomic research and scientific discovery.

One Page Abstract:

We have recently expanded the Web site of the Protein Information Resource (PIR) with new web-based tools to facilitate data retrieval, sequence analysis and protein annotation. The PIR-International Protein Sequence Database (PIR-PSD) is a non-redundant, expertly annotated, fully classified, and extensively cross-referenced protein sequence database. The iProClass is an integrated resource that provides comprehensive family relationships and structural/functional classifications and features of proteins. The PIR Web site connects these useful databases and tools with a user-friendly navigation system and graphical interface. In addition to standard data retrieval and analysis tools, the following new tools have been implemented: 1) HMM (Hidden Markov Model) Domain/Motif Search Tool searches a query sequence against HMM profiles for PIR or Pfam domains or iProclass motifs. This search also allows users to build an HMM profile and search the profile against the PIR-PSD. 2) The Bibliography Submission Page provides a mechanism for the user community to submit literature information for building better protein databases with validated information. 3) The BLAST Similarity Search Tool searches a sequence against PIR-NR, a complete non-redundant protein sequence database currently containing more than 630,000 sequences. The search results, including links to all source databases, are displayed in a user-friendly graphical format. The newly enhanced PIR web-based tools and databases will better support genomic/proteomic research and scientific discovery. The PIR website is accessible at http://pir.georgetown.edu.

This work is supported by NLM grant LM05798 and NSF grant DBI-9974855.


280. Classification of Protein Structures by Global Geometric Measures (up)
Peter Roegen, Dept. of Math., Tech. Univ. of Denmark;
Peter.Roegen@mat.dtu.dk
Short Abstract:

The geometry of a protein is characterized by 30 numbers. Hereby comparison of protein structures is reduced to comparing 30 numbers. The fact, that 93% of all connected CATH1.7 domains can be automatically classified correctly based on these 30 numbers alone, shows the power of this new protein structure description.

One Page Abstract:

Classification of Protein Structures by Global Geometric Measures

The idea is to give an absolute description of each protein structure by a set of characteristic numbers. In contrast to this, protein similarity measures as e.g. RMSD[1], FSSP Z scores[2], and AF Distance[3], are relative and compare pairs of protein structures only.

Demands to such a set of numbers are: There has to be enough numbers to distinguish between different protein structures. To ensure similar protein structures to have similar numbers, each of these numbers has to depend continuously on deformation of the protein. Finally, the size of such a set of numbers has to be reasonably small.

A set of numbers, that fulfills the above demands, is given by a family of global geometric measures, that stems from perturbative expansion of Witten's Chen-Simons path integral associated with a knot in 3-space. Two of these numbers are the Writhe and the Average Crossing Number and the remaining numbers are generalizations of these two. Each protein structure is now associated with a 30 dimensional score-vector.

A (pseudo) metric, denoted the Gauss Metric, on the space of protein structures is given by the euclidean metric on the score-vector space. This Gauss Metric is shown to be correlated with the RMSD for the pairs of homologous CATH1.7[4] domains and the Gauss Metric is zero in the case of identical domains only. Furthermore, proteins of different size are directly comparable without use of alignment or gap penalties.

A natural way to represent a homology class of CATH1.7 is by the average of the score-vectors of all proteins in this class, denoted the H-center. 93% of all connected CATH1.7 domains are closest to a H-center of there own topology class. By averaging, each H-center should dependent only slightly on addition of new protein structures. This automatic protein structure classification system is thus expected to be only slightly dependent on the set of protein structures used to define it.

References:

[1] B. W. Matthews & M. S. Rossmann,Methods Enzymol. 115:397-420.

[2] L. Holm $ C. Sanders, Nucleic Acids Res. (1997) 25:231-234.

[3] A. Falicov & F. E. Cohen, J. Mol. Biol. (1996) 258:871-892.

[4] L. L. Counte et al., Nucleic Acids Res. (2000) 28:257-259.


281. EST Analysis and Gene expression in Populus leaves (up)
Prof. Petter Gustafsson, Umea University;
Rupali Bhalerao, KTH Stockholm;
Stefan Jansson, Harry Björkbacka, Umea University;
Rikard Erlandsson, Joakim Lundeberg, KTH Stockholm;
petter.gustafsson@plantphys.umu.se
Short Abstract:

A set of 4921 sequences from a cDNA library made from young leaves were subjected to clustering using PHRAP and WU BLASTed to other databases (Swissprot and MENDEL). We focus on gene discovery, transcript profiling and studies on gene function using transgenic trees.

One Page Abstract:

We are running a large scale tree EST program was using Populus tremulaxtremuloides as the model organism. The program is run in collaboration with scientists at the Dept. of Biotechnology , The Royal Institute of Technology, KTH, Stockholm. The over all goal of this project is to establish The Populus Genome Project as the international authority in tree genome research. We focus on three themes: gene discovery, transcript profiling and studeis on gene function using transgenic trees. We have produced more than 45.000 ESTs. A set of 4921 sequences from a cDNA library made from young leaves were subjected to clustering using PHRAP and WU BLASTed to other databases (Swissprot and MENDEL). Clones with a high similarity to sequences in these databases were automatically annotated, while those with lower BLAST scores were manually annotated. All sequences with high similarity to a gene given a gene family number (GFN) in the MENDEL database were annotated according to that number, others were annotated as PGFNs (PopulusGFN). The genes were also assigned to a functional class based on the Functional Classification scheme for Arabidopsis at MIPS. The flow diagram of the process of annotation and functional class assignment will be presented in the poster. 14 % of the clones encoded the small sub-unit of Rubisco (rbcS) and 4.5 % the major light-harvesting chlorophyll a/b-binding protein Lhcb1. Other photosynthetic proteins were also well represented. Germin-like protein corresponded to 1 % and methallotionein and one protein without significant homology to any protein in public databases corresponded to almost 0,5 % of the clones. The hypothesis that clones frequency could serve as an approximation for protein content was tested by comparing clone frequencies for photosynthetic proteins know to be present in equimolar amounts. In general, the clone frequency was within a factor 2 from what could be predicted from protein stoichiometries but from two out of 20 genes, the EST clone frequency gave a misleading figure. The chloroplast protein synthesis was estimated to be approx imately 20 % of the total protein synthesis of Populus leaves, about 45 % of the leaf proteins were estimated to be directly involved in photosynthesis and about 50 % of all leaf proteins seems to be localised to the chloroplast.


282. Perl-based simple retrieval system behind the InterProScan package. (up)
Zdobnov E.M., Apweiler R., EMBL-EBI;
zdevg@ebi.ac.uk
Short Abstract:

We present a Perl-based data retrieval system. The system has a modular structure. Each of the data description modules defines the data schema and the associated text parsing routines. The system featured by the use of recursive descent parsing rules, efficient lazy-parsing and fast data retrieval using B-tree indexing.

One Page Abstract:

InterProScan [1] is a tool that scans a given protein sequences against the InterPro member databases of protein signatures. The InterPro [2] database (v3.0, March 2001) integrates PROSITE [3], PRINTS [4], Pfam [5], ProDom [6] and SMART [7] databases and the addition of others is scheduled. The number of signature databases and the number of the associated scanning tools as well as the use of further refinement procedures make the complexity of the problem. That requires InterProScan to do a considerable data look-up from some databases and program outputs. In the InterProScan package a Perl-based simple data retrieval system was introduced to provide the required data look-up efficiency and an easy extensibility. The system has a modular structure and is designed in an SRS-like [8] fashion. Each of the data description modules defines the data schema of the source text data and the parsing rules. The corresponding Perl module provides an object-oriented interface to the underlying entry attributes. The parsing of the source data into the memory objects happens only once and is done upon request, implementing so-called lazy-parsing. Hierarchical parsing rules are implemented using the recursive-descent approach (Parse-RecDescent package). Fast data retrieval is implemented using the Perl native B-trees indexing (DB_File.pm, based on Berkeley DB). The simple 'one Perl module per data source' organisation makes it possible to reuse the modules in other stand-alone ad-hock solutions. It is freely available as a part of InterProScan package from the EBI ftp server (ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/).

1. Zdobnov E.M., Hermjakob H., and Apweiler R. "InterProScan - an integration tool for the signature-recognition methods in InterPro"
Currents in computational molecular biology 2001, ed. El-Mabrouk N., Lengauer T., and Sankoff D. 2001, Montreal: CRM. p. 41-2.

2. Apweiler R., Attwood T.K., Bairoch A., Bateman A., Birney E., Biswas M., Bucher P., Cerutti L., Corpet F., Croning M.D., Durbin R., Falquet L., Fleischmann W., Gouzy J., Hermjakob H., Hulo N., Jonassen I., Kahn D., Kanapin A., Karavidopoulou Y., Lopez R., Marx B., Mulder N.J., Oinn T.M., Pagni M., Servant F., Sigrist C.J., and Zdobnov E.M. "The InterPro database, an integrated documentation resource for protein families, domains and functional sites"
Nucleic Acids Res, 2001. 29(1): p. 37-40.

3. Hofmann K., Bucher P., Falquet L., and Bairoch A. "The PROSITE database, its status in 1999"
Nucleic Acids Res, 1999. 27(1): p. 215-9.

4. Attwood T.K., Croning M.D., Flower D.R., Lewis A.P., Mabey J.E., Scordis P., Selley J.N., and Wright W. "PRINTS-S: the database formerly known as PRINTS"
Nucleic Acids Res, 2000. 28(1): p. 225-7.

5. Bateman A., Birney E., Durbin R., Eddy S.R., Howe K.L., and Sonnhammer E.L. "The Pfam protein families database"
Nucleic Acids Res, 2000. 28(1): p. 263-6.

6. Corpet F., Gouzy J., and Kahn D. "Recent improvements of the ProDom database of protein domain families"
Nucleic Acids Res, 1999. 27(1): p. 263-7.

7. Schultz J., Copley R.R., Doerks T., Ponting C.P., and Bork P. "SMART: a web-based tool for the study of genetically mobile domains"
Nucleic Acids Res, 2000. 28(1): p. 231-4.

8. Etzold T., Ulyanov A., and Argos P. "SRS: information retrieval system for molecular biology data banks"
Methods Enzymol, 1996. 266: p. 114-28.


283. Intelligent validation of oligonucleotides for high-throughput synthesis (up)
Bastien Chevreux, Thomas Pfisterer, Christoph Göthe, Sebastian Liepe, Klaus Charissé, Bernd Drescher, MWG BIOTECH AG;
bach@mwgdna.com
Short Abstract:

Physical restrictions, which cannot be expressed by simple, rule based systems, prevent the synthesis of certain oligonucleotides. MWG BIOTECH AG has designed validating methods that use intelligent agent technology to check oligos described using the XML based OML language, which permits to filter critical oligos before they go into production.

One Page Abstract:

Over the last few years, MWG BIOTECH AG has acquired and maintained the leadership in the production of salt free oligonucleotides. The products range from short, simple oligos (with no modification) to long oligonucleotides having 5'-, 3'- and up to four different internal modifications.

Unfortunately, certain combinations of oligos and modifications are impossible to produce due to physical restrictions. The main problem in recognising these combinations - aside from simple folding problems - consists in the fact that no simple rules can be formulated that express the restrictions correctly. Most of the difficult cases are therefore stored in internal company knowledge bases to which researchers outside a company normally do not have access. It can be quite a frustrating experience both for a customer and a manufacturer when the production of oligos is deferred one or multiple times because of this.

MWG BIOTECH AG bioinformatics department has developed two complementary strategies to resolve this problem. First, the XML based Oligodefinition Meta Language (OML) has been designed to describe oligonucleotides and their modifications. This allows using standard XML tools to read and write files containing oligo descriptions and ensures cross-plattform independence while ensuring unambiguousness in the oligos. The second part consists of a consistency and manufacturability checking system that validates oligos. This system bases on intelligent agent technology incorporated into clients that can update their checking rules and routines on demand via internet at a company knowledge server.


284. Iobion MicroPoint Curator, a complete database system for the analysis and visualization of microarray data (up)
Jason Goncalves, Iobion Informatics;
Terry Gaasterland, Rockefeller University;
Joe Sorge, Iobion Informatics;
jgonca@iobion.com
Short Abstract:

MicroPoint Curator is an enterprise-level solution for microarray expression analysis that enables researchers to store, analyze, and visualize gene expression data. It stores a complete annotation of experiments based on MIAME and includes multiple normalization and analysis methods including hierarchical clustering, k-means clustering, PCA and multi-dimensional scaling.

One Page Abstract:

The Iobion MicroPoint Curator is an affordable enterprise-level solution for microarray expression analysis modeled after the TANGO system developed in the laboratory of Terry Gaasterland. MicroPoint Curator enables researchers to store, analyze, and visualize large quantities of gene expression data obtained from DNA microarrays. MicroPoint Curator stores all raw microarray data, including 16-bit TIFF microarray images, in a fully relational database. The database holds a complete annotation of all imported microarray experiments based on the MIAME standards and is compliant with the microarray data XML exchange standard (MAML).

MicroPoint Curator addresses data analysis issues specific to spotted microarrays, including filtering of dust spots or regions with printing imperfections. Users may drill down to spot images at any point in the analysis, flagging artifacts and outliers throughout the process. Several normalization methods are supported including intensity-based normalization, log-ratio based normalization, normalization with housekeeping or with exogenous control genes, linear-regression based normalization, and grid-by-grid (print-tip) normalization. Processed microarray data can be clustered and visualized using MicroPoint Curator; current supported methods include hierarchical clustering, k-means clustering, principal component analysis, multi-dimensional scaling and CAST clustering. MicroPoint Curator also provides embedded links to public gene annotation sources and Genome Ontology (GO) annotations so functional gene annotation is easily available.

MicroPoint Curator is one of several applications on the Iobion bioinformatics server appliance the Iobion Sapphire. Iobion Sapphire is an Intel based hardware system that comes with Linux, Apache Web server, a full-featured relational database, the R statistical language and a comprehensive suite of bioinformatic tools and databases preinstalled. The system serves these scientific applications to Web clients on a local intranet or over the Internet.


285. Rapid IT prototype to support macroarray experiments (up)
Grombacher, Thomas, Karch, Oliver, Wilbert, Oliver Maria, Toldo, Luca, Merck KGaA Germany, Bio- and Chemoinformatics;
thomas.grombacher@merck.de
Short Abstract:

We have used standard methods to design a rapid prototype for handling data from macroarray experiments. Our strategy involves a web based solution for data uploading, data storage in RDBMS, and querying and representing data in SRS6. Data representation is also supported by graphic applets (eSuite2.0) and 2D plots.

One Page Abstract:

Rapid IT prototyping is a need to support biologists in handling large sets of data. We have generated a solution for the handling of macroarray expression data by using, and efficiently combining standard methods for the management of biological data. Namely, we combined intranet technology for the interaction with the users with data storage in a relational data warehouse and with data representation in SRS6.

Image processing was performed by the biologists with the AIDA software (RAYTEST GmbH). Raw data and all data pertained to the experiment were exported as tab delimited file. Intranet is used for the uploading of data and supplying experimental details via a CGI fill-in form. After storage in the RDBMS, standard statistical procedures were used for cleaning up and evaluating the data by background correction, normalization upon mean filter value, and calculation of log2 variations. We also used clustering of similarly expressed genes with the k-medoids method implemented in CLARA (Clustering Large Applications). CLARA uses the minimal average silhouette width as criterion to choose the number of clusters to partition the data. All results were stored back into the database.

As solution for querying and representing the processed data we have chosen the SRS6 system (LION Biosciences AG) which is an industrial standard for this task. The SRS internal representation of data was extended by graphics from external applets and gifs. We used eSuite2.0 (Lotus) applets for the graphical representation of expression data and visualized the original filter spots as gifs which are stored in the database as well. The data representation in SRS is supplemented by 2D plots which are generated from the expression values of genes in two different tissues. This plot is build via a direct call to the database also using a CGI web interface.


286. EXProt - a database for EXPerimentally verified Protein functions. (up)
Frank H.J. van Enckevort, Björn M. Ursing, Jack A.M. Leunissen, Centre for Molecular and Biomolecular Informatics (CMBI), Nijmegen, The Netherlands.;
Roland J. Siezen, NIZO food research, Ede, The Netherlands.;
frankve@cmbi.kun.nl
Short Abstract:

EXProt (database for EXPerimentally verified Protein functions) is a new database containing protein sequences for which the function has been experimentally verified. EXProt Release 1.1 is a selection of 4351 entries from the Pseudomonas Community Annotation Project and the prokaryotic section of EMBL nucleotide sequence database. (http://www.cmbi.kun.nl/EXProt).

One Page Abstract:

EXProt (a database for EXPerimentally verified Protein functions) is a new non-redundant database containing 4351 protein sequences for which the function has been experimentally verified. It is a selection of 375 entries from the Pseudomonas Community Annotation Project (PseudoCAP, http://www.pseudomonas.com) and 3976 entries from the prokaryotic section of the EMBL nucleotide sequence database, Release 66 (http://www.ebi.ac.uk/embl/). The entries in EXProt all have a unique ID number and provide information about organism, protein sequence, functional annotation, link to entry in original database, and if known, gene name and link to references in PubMed. The EXProt database will be extended to include more genome databases and topic specific databases. Next to be included are proteins from GenProtEC (http://genprotec.mbl.edu). The EXProt web page (http://www.cmbi.nl/EXProt/) provides further description of the database and search tools (blastp & blastx). The EXProt entries are indexed in SRS6 at CMBI, Nijmegen, The Netherlands and can be searched through keywords (http://www.cmbi.kun.nl/EXProt/srs/). Authors can be contacted through email (EXProt@cmbi.kun.nl).


287. HeSSPer: a program to analyze sequence alignments and protein structures derived from HSSP database (up)
Georgios Pappas Jr., Universidade Católica de Brasília - Brazil;
gpappas@pos.ucb.br
Short Abstract:

HeSSPer is a Java based software package that parses and analyzes HSSP files. It provides a visual environment that assists the integration of amino acid conservation data with three-dimensional structure information, helping to understand the network of atomic contacts important for the maintenance of particular folding patterns.

One Page Abstract:

HeSSPer is a Java based software package that parses and analyzes HSSP files (Homology-derived Secondary Structure of Proteins), which is a database of multiple sequence alignments for each of the proteins with known structure in the Protein DataBank (PDB). HeSSPer provides a rich visual environment aimed to perform in depth studies of the correlations between patterns of sequence conservation and protein structure. Among its capabilities the program offers a series of visualization tools permitting to gather detailed information about each individual protein in the multiple alignment, plots of sequence conservation measures, sequence logos and direct pointers to other relevant databases. It also integrates and controls helper programs to display and manipulate sequence alignments (Jalview) as well three-dimensional structures (Rasmol). Atomic contacts between residues receive special attention and are displayed with two new representations, graphical contacts and contacts tree, that permit a rapid identification of critical contacts based on the values of sequence conservation. In summary, HeSSPer is a tool that assists the integration of amino acid conservation of a particular protein family with the available three-dimensional structure information, helping to understand the network of atomic contacts important for function or structural maintenance.


288. Protein Structure extensions to EMBOSS (up)
Mr. Ranjeeva D. Ranasinghe, Dr. Jon C. Ison, Dr. Alan J. Bleasby, UK MRC HGMP Resource Centre;
rranasin@hgmp.mrc.ac.uk
Short Abstract:

EMBOSS is a free Open Source software package developed for biological sequence analysis. Our recent incorporations provide new software and databases for three-dimensional structures of proteins. We address the need for consistent and highly parsable sources of coordinate and domain data. These databases and related software are publicly available.

One Page Abstract:

European Molecular Biology Open Software Suite (EMBOSS) is a free Open Source software package developed for biological sequence analysis. The software can automatically handle a variety of data formats and even allows for transparent retrieval of sequence data from the web. Also, as extensive libraries are provided with the package, it allows other scientists to develop and release software under the GNU open source software license. EMBOSS also integrates a range of currently available packages and tools for sequence analysis into a seamless whole.

Until recently all EMBOSS programs were for the analysis of nucleic acid and protein sequences. Our recent incorporations provide new software and databases for three-dimensional structures of proteins. For example, we address the need for consistent and highly parsable sources of coordinate and domain data by providing the following: a database of "cleaned up" protein coordinates data in EMBL-like format using a residue numbering scheme; two databases of clean coordinate data for individual SCOP domains in EMBL-like format and PDB format respectively; the SCOP classification in EMBL-like format. These databases and related software are publicly available.

Other software provided includes programs for calculating residue-residue contact site data, algorithms for removing redundancy in the SCOP sequence database and wrappers for existing programs and packages. For instance, a STAMP wrapper can generate structural alignments for each SCOP family and a wrapper for PSI-BLAST can use these alignments for searches of a protein sequence database. Software to interrogate swissprot by keyword and integrate the search results with those of the PSI-BLAST searches are also provided.


289. GENIA corpus: A Semamticaly Annotated Corpus in Molecular Biology Domain (up)
Tomoko OHTA, Department of Information Science, Graduate School of Science, University of Tokyo;
Yuka TATEISI, CREST, Japan Science and Technology Corporation;
Sang Zoo LEE, Korea University;
Jun-ichi TSUJII, Department of Information Science, Graduate School of Science, University of Tokyo;
okap@is.s.u-tokyo.ac.jp
Short Abstract:

We have built a corpus of annotated abstracts obtained from the MEDLINE database. We already annotated 1,000 abstracts with 36 different semantic classes. In this poster, we report on this new corpus, its ontological basis, our annotation scheme, and statistics of its annotated objects.

One Page Abstract:

Introduction:

Corpus annotation is now a key topic for all areas of natural language processing (NLP) and information extraction (IE) which employ supervised learning. With the explosion of results in molecular biology there is an increased need for IE to extract knowledge to support database building and to search intelligently for information in online journal collections. To support this, we have built a corpus of annotated abstracts obtained from the MEDLINE database [2,3]. We already annotated 1,000 abstracts with 36 different semantic classes and are trying to increase the number of abstracts to 3,000. In this poster, we outline the features of this new corpus, its ontological basis, our annotation scheme, and statistics of its annotated objects.

Ontological basis and annotation scheme:

The task of annotation can be regarded as identifying and classifying the names that appears in the texts according to a pre-defined classification. For a reliable classification, the classification must be well-defined and easy to understand by the domain experts who annotate the texts. To fulfill this requirement, we had built a conceptual model (ontology) of substances and sources (substance location). In this ontology, we classify substances according to their chemical characteristics rather than their biological role. We have marked up the names of PROTEINs, DNAs, RNAs, SOURCEs, and OTHERs that appear in the abstracts in GPML/XML format [4]. These names are considered to be relevant to the description of biological processes, and recognition of such names is necessary for understanding higher level 'event' knowledge.

Statistics:

We have annotated 1,000 abstracts related to the transcription factors in human blood cells. We have marked up around 32,000 names with 36 different semantic classes. Around 9,500 proteins, 3,500 DNAs, 400 DNAs, 7,000 sources, 11,600 others are marked up.

Conclusion:

We have built a semantically annotated corpus. The GENIA corpus is useful as a training set for recognition program of biological names and terms. This corpus can also used to gain the knowledge of how the tagged names are related to each other and other names, in order to give feedback to the annotators and enhance the ontology and enables us to annotate more rich information such as biological roles.

References:

[1] Y. Ohta, et, al., Automatic construction of knowledge base from biological papers, Proc. of ISMB-97, pp218-225, 1997.

[2] T. Ohta, et, al., A Semantically Annotated Corpus from MEDLINE Abstracts, In Genome Informatics, Universal Academy Press, Inc., pp.294-295, 1999.

[3] Y. Tateisi, et, al., Building an Annotated Corpus in the Molecular-Biology Domain, In Proc. COLING 2000 Workshop on Semantic Annotation and Intelligent Content, pp. 28-34, 2000.

[4] GENIA project: http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/


290. Talisman - Rapid Development of Web Based Tools for Bioinformatics (up)
Tom Oinn, EMBL-EBI;
tmo@ebi.ac.uk
Short Abstract:

Talisman is an XML based Rapid Application Development tool intended to enable non programmers to create sophisticated web based tools. Platorm independant and arbitrarily extensible, it acts as a common layer between users and diverse systems such as databases, SRS installations and sequence analysis tools.

One Page Abstract:

A common problem that we have had to deal with at the EBI is how to provide users with little or no computing background access to the data (commonly held in relational databases) that they wish to work with. Such data storage systems in general do not provide a non-programmer friendly interface to their data, and therefore require the construction of additional tools for end users. Until recently, each of these tools was written on a case by case basis.

Talisman is an attempt to create a system that allows us to develop these tools in a fraction of the time it would have taken to create the tools from scratch. The system defines a language used to create pages; this language, containing as it does definitions of content and behaviour of the system, is semantically equivalent to the minimum specification that would have been required by a programmer to create the system from nothing. A Talisman page definition could therefore be regarded as a formalised project specification language.

Talisman has already been deployed for various applications related to the InterPro database at the EBI (www.ebi.ac.uk/interpro). The typical time to create a system using Talisman is approximately twenty times less than to create the system from scratch, and this has allowed us to provide our biologists with a range of sophisticated and easy to use tools.

Talisman is entirely written in Java, and will compile and run on any platform that has a compliant JVM and Servlet engine. This means that installation of Talisman can be as simple as copying a single file into your servlet engine and pressing the restart button. The extensible nature of Talisman prevents it being locked in to any given field, and efforts are currently underway to integrate it into various other framework type systems such as AppLab and BioJava DAS servers.

Talisman is released under the LGPL license, and may be found at http://golgi.ebi.ac.uk/talisman.


291. The Eukaryotic Linear Motif Database ELM (up)
Rune Linding, Christine Gemuend, Sophie Chabanis, Toby Gibson, EMBL - Biocomputing Unit - Gibson Team;
linding@EMBL-Heidelberg.DE
Short Abstract:

About any protein, the question is "What functional sites are in it?" Currently this question cannot be answered completely and reliably by bioinformatics. The ELM consortium will implement a novel resource for the prediction of functional sites. Effective prediction of short motifs requires implementation of new context-dependent filtering software.

One Page Abstract:

About any protein sequence of interest, the question will be asked "What functional sites are in my protein?" Since eukaryotic proteins are highly modular and may have 10s or even 100s of domains, the question is often asked many times about the same protein. Although there may be sites/domains of hitherto unknown function, which will have to be revealed by experimentation, it is desirable to identify all the known types of functional module. There are many bioinformatics tools devoted to this end and, in favourable cases, it may be possible to get a good initial description of protein function. However, it is not possible to reliably answer the question using current resources.

Protein modules come in two generic categories: (1) larger folded globular domains and (2) small linear motifs that are often unstructured. The globular domains can be quite well detected by sensitive probabilistic methods such as HMMs (Hidden Markov Models) or profiles. However, statistically robust methods cannot usually be applied to small motifs, while pattern-based methods over-predict enormously: the few true motifs are lost amongst massive numbers of false positives. There are a number of existing web-accessible resources specialising in protein modules or functional motifs: The PFAM, SMART and PROSITE servers are excellent tools to find globular domains in a protein. PROSITE is also the most important resource for patterns corresponding to both dispersed and linear motifs, whilst PSORT focuses on a restricted set of motifs that are diagnostic for cell compartmentalisation.

Since linear motifs are both statistically insignificant and prone to massive over-prediction, simply detecting matches in sequences has almost no predictive power. Probably for this reason (PROSITE excepted) there has been much less activity in assembling databases of linear motifs, as compared to the plethora of globular domain databases. Simply collecting the data is insufficient to provide a useful facility. One approach to detecting linear motifs may be to employ context-based discriminatory rules to filter out many of the false positives, so that a small number of plausible candidates remain to be followed up. For example, a candidate motif can be excluded from further consideration if the motif is buried in the core of a globular domain or if the protein resides in the wrong cellular compartment. It follows therefore that the software tools are just as important as the data collection when it comes to detecting linear motifs in sequences. Indeed we would argue that it is only worth assembling the data if the tools to use it are in place. Linear motifs have been neglected in state-of-the-art bioinformatics.

The purpose of the ELM application is to redress this neglect by putting in place the tools needed to detect linear motifs in proteins and to apply these tools in the prediction of functional sites in proteins. Linear motifs come in variable lengths and conservation patterns. Therefore, no single pattern description method is optimal for describing the set of linear motifs. ELM will use the most appropriate pattern descriptor for each motif, choosing from among: (1) Exact match; (2) Regular expression; (3) Weight matrix; and, if needed, (4) HMM for the most complex motifs. One problem faced by ELM is that motifs may have diverged in different eukaryotic species. For example the KDEL motif is highly conserved in the animals but shows variability in the fungi, being HDEL in baker's yeast, DDEL in the quite closely related Kluyveromyces lactis and ADEL in fission yeast. The ELM resource will ensure that this type of variation can be efficiently presented to the user, who will most often be interested in only a subset of eukaryotes.


292. Information search, retrieval and organization relevant to molecular biology (up)
R. Kincaid, A. Vailaya, S. Handley, P. Chundi, N. Shechtman, B. Nardi, D. Moh, A. Kuchinsky, K. Graham, M. Creech, A. Adler, Life Science Informatics, Systems Solutions Lab, Agilent Technologies;
robert_kincaid@agilent.com
Short Abstract:

Data relevant to molecular biology can now be found in many different places and different forms. This poster describes recent work directed at novel approaches for acquiring, organizing and integrating such disparate data, in order to maximize its usefulness to molecular biologists working in disease or drug discovery.

One Page Abstract:

The explosive, simultaneous growth in both information technology and high-throughput genomics has led to the situation where enormous quantities of data relevant to life sciences can now be found in a variety of locations as well as different forms. This information may be contained in a relational database, a flat file, an abstract, a publication, an HTML document, etc. Such information may be on a public web-server, a proprietary database or even a researcher’s personal web page. With this plethora of data two key problems arise: finding the information at all when it is potentially scattered across many different sources, and navigating through the information found to select that which is relevant to the particular question at a hand. This poster describes recent work directed at novel approaches to addressing both acquiring such disparate data, as well as organizing it to facilitate its use by molecular biologists working in disease or drug discovery. Applying extensible meta-search technology allows finding data from a variety of different sources. Following this data acquisition with well-known data-mining algorithms permits the results of meta-searches to be organized and more easily navigated and understood. Finally, we integrate this retrieved information with further organization tools to assist manually curating the information into a complete overview of biological relevance and function.


293. Los Alamos National Laboratory Pathogen Database Systems (up)
Christian V. Forst, Staff, Bioscience Division, Los Alamos National Laboratory;
chris@lanl.gov
Short Abstract:

Second generation genome databases at Los Alamos contain information about biological threat and STD pathogens. These completely annotated genomes include information on repeats, ABC transporters, pathways (comparison), proteome comparison and phylogenies. The databases integrates standard bioinformatics tools with in-house developed software. Public access is available at http://www.cbnp.lanl.gov and http://www.stdgen.lanl.gov.

One Page Abstract:

Among the several specialized sequence databases at Los Alamos are second generation, curated databases that contain molecular sequence information about biological threat pathogens and pathogens related to sexually transmitted diseases. The relational database schema contains annotations for the complete genomes of bacterial and viral pathogens, and selected close relatives. These completely annotated genomes include a rich collection of references to experimental evidence and information on repeats, ABC transporters, pathway and pathway comparison, proteome comparison as well as phylogenies. The databases integrates standard bioinformatics tools such as Blast, Blocks, COG, ProDom, PFam, KEGG, WIT with in-house developed software such as BugSpray for whole-genome comparison, DisplayNet and PredPath for metabolic network prediction and comparative analysis. In 2001, genome sequences of oral pathogens will be included. Public access to the databases includes a wide range of search capability and user-friendly tables and graphics available at http://www.cbnp.lanl.gov and http://www.stdgen.lanl.gov.


294. Facilitating knowledge acquisition and discovery through automatic web data minining (up)
Florence HORN, Hinrich SCHÜTZE, Dpt of Cellular and Molecular Pharmacology, UCSF, San Francisco;
Emmanuel BETTLER, Gerrit VRIEND, Center of Molecular and Biomolecular Informatics, University of Nijmegen, The Netherlands;
Fred E. COHEN, Dpt of Cellular and Molecular Pharmacology, UCSF, San Francisco;
horn@cmpharm.ucsf.edu
Short Abstract:

The goals of the GPCRDB and NucleaRDB databases are to collect, provide and harvest heterogeneous information on GPCRs and nuclear receptors. The bottleneck in database maintenance is data acquisition. We present our work on automated data collection. In particular, we describe our methodology to extract mutation data from Medline abstracts.

One Page Abstract:

The amount of genomic and proteomic data that is entered each day into databases and the experimental literature is outstripping the ability of experimental scientists to keep pace. Consequently, there is a need for specialized databases that collect and organize data around one topic or one class of molecules. These systems are extremely useful for experimental scientists because they can access a large fraction of the available data from one single source.

We are involved in the development of two such databases with substantial pharmacological relevance. These are the GPCRDB and NucleaRDB information systems that collect and disseminate data related to G protein-coupled receptors and intra-nuclear hormone receptors, respectively. The GPCRDB was a pilot project aimed at building a generic molecular class-specific database capable of dealing with highly heterogeneous data. The NucleaRDB was started last year as an application of the concept for the generalization of this technology. The GPCRDB is available via the WWW at http://www.gpcr.org/7tm/ and the NucleaRDB at http://www.receptors.org/NR/.

The major bottleneck in database maintenance is data acquisition. Databases can only provide the user with information that has been entered and indexed into a computer file. Unfortunately, data deposition is only obligatory for sequences and three-dimensional atomic coordinates. All other experimental data has to be manually extracted from the literature and entered into databases by data managers and curators.

Consequently, our project is to develop a methodology to automatically extract experimental data such as mutation data, ligand binding information, expression data, etc., by having computer software electronically read articles. This could potentially lead to discoveries that were not possible because the knowledge was spread out and buried in the literature. We chose to focus on mutation data for nuclear receptors. This poster shows how we automatically capture heterogeneous information from different databases, and in particular, the methodology we use to select Medline abstracts, analyze their contents and extract mutation information.


295. Functional analysis of novel human full-length cDNAs from the HUNT database at Helix Research Institute (up)
Henrik T. Yudate, Ph.D, Helix Research Institute;
Makiko Suwa, Electrotechnical Laboratory (ETL), Tsukuba, Japan;
Ryotaro Irie, Helix Research Institute;
Hiroshi Matsui, RIKEN Genomic Sciences Center, Japan;
Tetsuo Nishikawa, Yoshitaka Nakamura, Helix Research Institute;
Daisuke Yamaguchi, Hitachi Software Engineering Co., Ltd.;
Zhang Zhipeng, Tomoyuki Yamamoto, Keiichi Nagai, et al., Helix Research Institute;
yudate@hri.co.jp
Short Abstract:

Helix Research Institute, a joint research project principally funded through The Japan Key Technology Center, has developed a high-throughput system for cloning of human full-length cDNAs. Here we describe latest developments from the Bioinformatics Department for in silico analysis of HUman Novel Transcripts and release via our HUNT database: http://www.hri.co.jp/HUNT/

One Page Abstract:

Helix Research Institute, Inc. (HRI), a joint research project principally funded through The Japan Key Technology Center, has developed a high-throughput system for cloning and sequencing of human full-length cDNAs and for identifying gene function. The clones have been sequenced in the NEDO Human cDNA Sequencing Project and released via the DNA Data Bank of Japan. Here we describe the latest developments from the HRI Bioinformatics Department for in silico analysis and functional annotation of these HUman Novel Transcripts.

Currently, we have carried out in silico analysis of 4996 full-length cDNA sequences and released this information in the publicly available HUNT database (http://www.hri.co.jp/HUNT; Yudate HT, et al., 2001). Protein sequences have been predicted for these full-length cDNA sequences using our ATGpr in-house software. A blastp sequence similarity search with the non-redundant protein sequence database nr and the Swiss-Prot database reveals, that a large fraction of these novel proteins have little similarity to any proteins of known function. We find that the HUNT database contains more than 2500 truly novel and uncharacterized proteins because hits from the similarity search are in large part to hypothetical proteins.

However, it is still possible that tertiary structures of related possibly family proteins have been determined despite a low sequence similarity, and to find these we thread the sequences onto a set of library protein structures with the THREADER fold recognition software. These calculations are CPU time consuming, but at HRI we have dedicated several cpu's for continuous computation of novel sequences and have analyzed more than 1300 in this way. As a result, tertiary structure candidates are listed in the HUNT database for a considerable number of entries and these serve in many cases as the only important source of information as to the function of the corresponding proteins. It is a hope that the functional annotation of candidate structures can be transferred to the query sequences, and a way to assess the predictions is to obtain complementary information from well established sequence analysis programs. Here we provide estimated localization by the PSORT program, and we also include secondary structure predictions from PREDATOR and the CHAPERON sequence analysis software, and results from a search for PFAM, PRINTS-S, and PROSITE sequence patterns, this again to support the fold recognition results where possible.

Validation of the individual structure assignments from THREADER can also be obtained from the independent GENIUS results, which is a sophisticated intermediate sequence-search system also for structure assignment. The reliability of any prediction increases when the results from these two fundamentally different approaches coincide. The sequences which are judged to have the most reliable structure assignments are clustered according to the functional annotation of the assigned structures, and representative examples will be given. All sequence data and analysis results will be available from the HUNT home page at http://www.hri.co.jp/HUNT

Rererence: Yudate HT, Suwa M, Irie R, et al., Nucleic Acids Research 2001 Vol. 29 No. 1 pp. 185-188


297. iSPOT and MINT: a method and a database dedicated to molecular interactions (up)
M. Helmer-Citterich, Brannetti, B., Zanzoni, A., Via, A., Ferre', F., Montecchi-Palazzi, L., Cesareni, G., Helmer-Citterich, M., Dept. Biology - University of Rome Tor Vergata;
citterich@uniroma2.it
Short Abstract:

We present the SPOT method for the inference of protein domain specificity and a new database of Molecular INTeractions (MINT). SPOT was developed using the SH3 domain as a model system and has now been applied to PDZ domains and to MHC class I molecules.

One Page Abstract:

iSPOT (iSpecificity Prediction Of Target) is a web tool developed to infer the protein-protein interaction mediated by families of peptide recognition modules. The SPOT procedure (Brannetti et al., 2000) utilizes information extracted, for each protein domain family, from position-specific contacts derived from all the available domain/peptide complexes of known structure. The framework of domain/peptide contacts defined on the structure of the complexes is used to build a residue/residue interaction database derived from ligands obtained by panning peptide libraries displayed on filamentous phage. The method is being optimised with a genetic algorithm and will soon be available on the web. It is available now for SH3 and PDZ domains and for MHC class I molecules. iSPOT will offer the possibility to answer the following questions: which protein (or peptide) is a possible ligand for a given SH3 (or PDZ or MHC class I molecule)? Which is the best possible SH3 (or PDZ, or MHC class I) interacting domain for a given protein/peptide sequence? What residues should one mutate in a domain to lower/increase its affinity for a given peptide ligand?

MINT (Molecular INTeractions) is a relational database built to collect and integrate protein interaction data in a unique database accessible via a user-friendly web interface. MINT now contains experimentally determined protein-protein interaction data. In the near future, MINT will be enriched with protein-DNA and protein-RNA interactions and will also allow the collection of peptide lists selected from molecular repertoire like those resulting from phage display experiments. Moreover we plan to add information about interactions inferred by using computational predictive methods. Curators manually submit the interactions. MINT is an SQL database and the web interface is written in an HTML embedded language named PHP (hypertext preprocessor).


298. An Integrated Sequence Data Management and Annotation System For Microbial Genome Projects (up)
Ki-Bong Kim, Department of Computer Engineering, Chungnam National University, Daejeon, Korea (ROK);
Hwajung Seo, Hyeweon Nam, Hongsuk Tae, Pan-Gyu Kim, Dae-Sang Lee, Information and Technology Institute, SmallSoft Co., Ltd., Daejeon, Korea;
Haeyoung Jeong, GenoTech Corp., Daejeon, korea;
Kiejung Park, Information and Technology Institute, SmallSoft Co., Ltd., Daejeon, Korea;
kbkim@comeng.chungnam.ac.kr
Short Abstract:

We developed the efficient data management and annotation system customized for microbial genome projects. The system consists of six main components, including local databases, infra-databases, contig assembly program, essential analysis programs, various utilities, and window-based graphical user interfaces. Each component is tightly coupled with one another.

One Page Abstract:

A lot of microbial genome sequencing projects lead to the deluge of the sequence data and its related additional abundant information, which require the systematic and automatic data processing system to meet the efficient data management and annotation. In this context, we developed the efficient data management and annotation system customized for microbial genome projects. This means that the system paves the way for the systematic and automatic approaches to the data collection, retrieval, processing, analysis and annotation. The system consists of six main components, including local databases, infra-databases, contig assembly program, essential analysis programs, various utilities, and window-based graphical user interfaces. Local databases are a repository of all the data from raw data to annotated ones. Feedback control in local databases is possible to update all related data like dominoes. In addition to the local database, infra-databases include the essential public databases such as GenBank, PIR, and SwissProt. MySQL is adopted as DBMS. Contig assembly program includes two main modules of our own implementing. One is assembly module, the other is trace viewer which verifies assembly result and base calling. Public analysis programs are utilized and incorporated into the analysis component that also contains the analysis programs of our own making. The analysis component can be categorized into four main items – database search, homology search, ORF search, and signal pattern search. Each component is tightly coupled with one another by means of various utilities behind the window-based GUI. The system has a client-server architecture in which window-based client programs on a PC calls many programs running on the server side through the network connection. This system will be very helpful to genome researchers who need to efficiently manage and analyze their own bulky sequence data from low-level to high-one


299. A home-made implementation for bibliographic data management: application to a specific protein. (up)
Alessandro Pandini, Laura Bonati, Demetrio Pitea, Dipartimento di Scienze dell'Ambiente e del Territorio, Università degli Studi di Milano-Bicocca;
alessandro.pandini@unimib.it
Short Abstract:

The development of a bibliographic-reference database on the Aryl hydrocarbon Receptor is presented. The basic goals are: implementing a PubMed-based resource, assuring automatically updating service, adding user profiles, developing a web-based interface connected to the DBMS, generating a user access and web-publishing the project to join contributions from other groups.

One Page Abstract:

Scientific research on a specific topic could generate a remarkable quantity of bibliographic patrimony that requires administration. Often a research group has to face problems like dealing with huge amount of papers, assuring a full accessibility to group members, exploiting background richness, being able to quickly correlate new hypothesis with previously published: a DBMS (DataBase Management System) could solve many of those problems. Additionally an on-line database resource could fruitfully become a "virtual meeting point" for researchers on the topic. In order to satisfy a series of internal group requests and to offer an on-line resource for colleagues, we designed, implemented and web-published a bibliographic reference database on the Aryl hydrocarbon Receptor. During the design phase we identified some basic goals: implementing a PubMed-based resource, assuring automatically updating service, adding user profiles and on-line access. We chose a x86 architecture with Linux operating system for data and access management. Being consistent with these choices we decided to develope the project on an Apache web server with MySQL DBMS. Perl was chosen as the "glue system" to connect the project parts and generate dynamic html code for the web based interface. During the implementation phase we collected bibliographic references directly from PubMed and automatically inserted them in the database, setting up the management system. Moreover, we developed a web-based interface connected to the DBMS. Consistent with the design phase, the final implementation features are: an automatically retrieving system that queries PubMed engine and notifies full updating info, both by e-mail and on login demand; an user-friendly graphical interface, accessible via different web-browser; a powerful query system; an output generator for text, html and BibTeX format files. Finally, we have generated a user access with different levels of services and web-published the full project. We plan to extend the project to a thematic web-site on the Aryl hydrocarbon Receptor joining contributions from other groups, with the long-term goal of creating an on-line forum about this protein.


300. Statistical structurization of 30,000 technical terms in medical textbooks: A non-ontological approach for computation of human gene functions. (up)
Tsutomu MATSUNAGA, Kenji FUKAISHI, Yasuhiro TANAKA, NTT DATA CORPORATION;
Takuro TAMURA, Iwao YAMASHITA, Yamaguchi, Hitachi Software Engineering Co., Ltd.;
Teruyoshi HISHIKI, JBIRC, National institute for Advanced Industrial Science;
Kousaku OKUBO, Institute for Molecular and Cellular Biology, Osaka University;
matunaga@rd.nttdata.co.jp
Short Abstract:

We statistically structured 30,000 biomedical technical terms by their distribution patterns across medical textbooks for knowledge representation using subspace method of pattern recognition. Resultant structure was graphically represented for evaluation and known human genes were located in the structure according to their annotations. Any cluster of genes can be automatically annotated.

One Page Abstract:

With the advent of the techniques and machines for DNA and protein analyses, feature values for each gene/protein unit have been massively generated. Those values are being used to organize the gene/protein world in various ways and establishment of methods to make biomedical sense of resultant structures is an urgent issue in predicting functions of 'unknown genes'. Functional descriptions connected to 'known genes' in DBs are the primary source of knowledge to be employed in making sense but the functional descriptions are fully appreciated only by those who have thousands of technical terms in mind in an organized form. In order to make machines to use such knowledge for 'known genes', straightforward are declarative approaches, such as KEGG and Gene Ontology where descriptions are manually translated to expressions with limited terminologies whose structure is provided by experts. The drawbacks for this approach are labor, possible bias, and requirement of constant update. In order to complement such drawbacks in declarative approaches, we have statistically structured 30,000 biomedical objects represented by 55,000 biomedical terms by their distribution patterns across more than 20,000 pages of 21 textbooks for medical education. We first constructed graphical representation of co-occurrence patterns of objects to evaluate the power of statistical approaches in expressing medical knowledge. Then we employed a subspace method of pattern recognition for sensitive calculation of relations across objects. Using gene annotations available in public, almost all of the known human genes are mapped onto this space of primitive words. Any cluster of genes made from observed feature values can be automatically annotated by the orders of technical terms in the neighborhood of 'known gene' members in them.