Gene Expression

120.From clustering to expression data to motif finding: a multistep online procedure.
121.Probe Based Scaling of Microarray Expression Data
122.Revealing the Fine-Structures: Wavelet-based Fuzzy Clustering of Gene Expression Data
123.Identification of clinically relevant genes in lung tumor expression data.
124.Machine Learning Techniques for the Analysis of Microarray Gene Expression Data: A Critical Appraisal
125.Comparison of Methods for The Classification of Tumors Using Gene Expression Data
126.Using expression data for testing hypotheses on genetic networks - minimal requirements for the experimental design
127.In silico search for cis-acting regulatory sequences in co-expressed gene clusters
128.A decision tree method for classification of promoters based on TF binding sites
129.Biostatistical Methods to Analyse Gene Expression Profiles
130.Syntactic structures for understanding gene regulatory networks
131.Adaptive quality-based clustering of gene expression profiles
132.Incorporating Biological Knowledge Into Analyses of Microarray Data
133.Semantic Link: A Knowledge Discovery Tool for Gene Expression Profiling
134.Integration of transcript reconstruction and gene expression profiles to enhance disease gene discovery.
135.Gene Expression Database (GXD): integrated access to gene expression information from the laboratory mouse
136.Analysis of gene expression profiles between interaction protein pairs in M.musclus
137.Learning genomic nature of complex diseases from the gene expression data
138.Comparative Assessment of Normalization Methods for cDNA microarray data
139.Identifying different types of human lymphoma by SVM and ensembles of learning machines using DNA microarray data.
140.On the Influence of the Transcription Factor on the Information Content of Binding Sites
141.A Mouse Developmental Gene Index
142.Using Gene Expression and Artificial Neural Networks for Classification and Diagnostic Prediction of Cancers
143.Classification of malignant states in multistep carcinogenesis using gene expression matrix
144.Bioinformatics Tools in the Screening of Gene Delivery Systems
145.Cross talking in cellular networks: tRNA-synthetase and amino acid synthetic enzymes in Escherichia coli
146.Assessing Clusters and Motifs from Gene Expression Data
147.Statistical Analysis of Gene Expression Profile Changes among Experimental Groups
148.Multivariate method for selection of sets of differently expressed genes
149.Understanding Non Small Cell Lung Cancer by Analysis of Expression Profiles
150.Applications of high-throughput identification of tissue expression profiles and specificity
151.Identifying regulatory networks by combinatorial analysis of promoter elements
152.The use of discretization in the analysis of cDNA microarray expression profiles for the identification of tissue-specific genes
153.Quantitative analysis of a bacterial gene expression by using the gusA reporter system in a non-steady state continuous culture
154.Analysis of 5035 high-quality EST clones from mature potato tuber
155.Using highly redundant oligonucleotide arrays to validate ESTs: Development and use of a human Affymetrix MuscleChip.
156.Expression Profiler
157.Analysis of the transcriptional apparatus in the holoparasitic flowering plant genus Cuscuta
158.Inferring Regulatory Pathways in E.Coli using Dynamic Bayesian Networks
159.ConSite: Identification of transcription factor binding sites conserved between orthologous gene sequences
160.FSCAN - An open source program for analysis of two-color fluorescence-labeled cDNA microarrays
161.A method for designing PCR primers for amplifying cDNA array clones
162.Statistical modelling of variation in microarray replicates
163.The new explore of diffuse large B-cell lymphoma
164.Including protein-protein interaction networks into supervised classification of genes based on gene expression data
165.Comparative Splicing Pattern Analysis between Mouse and Human Exon-skipped Transcripts
166.Non-parametric statistics of gene expression data
167.Transcriptome and proteome analysis of Escherichia coli during high cell density cultivation
168.Molecular signatures of commonly fatal carcinomas: predicting the anatomic site of tumor origin
169.Tuning Sub-networks Inference by Prior Knowledge on Gene Regulation
170.Detection of alternative expression by analysis of inconsistency in microarray probe performance
171.Which clustering algorithms best use expression data to group genes by function?
172.Visualization and Analysis Tool for Gene Expression Data Mining
173.Prediction of co-regulated genes in Bacillus subtilis based on the conserved upstream elements across three closely related species
174.Comparison of performances of hierarchical and non hierarchical neural networks for the analysis of DNA array data
175.Linking micro-array based expression data with molecular interactions, pathways and cheminformatics
176.Classification of Acute Leukemia Gene Expression Data Using Weight Function and Principal Component Analysis
177.Application of Fuzzy Robust Competitive Clustering Algorithm on Microarray Gene Expression Profiling Analysis
178.A Robust Algorithm for Expression Analysis
179.Analysis of Gene Expression by Short Tag Sequencing - Theoretical Considerations
180.Computational analysis of RNA splicing by intron definition in five organisms
181.Image Analysis and Feature Extraction Methods for High-Density Oligonucleotide Arrays
182.Transcriptional control mechanisms in the global context of the cell
183.Measurement and Prediction of Gene Expression in Whole Genomes
184.Analysis of orthologous gene expression in microarray data



120. From clustering to expression data to motif finding: a multistep online procedure. (up)
Gert Thijs, Kathleen Marchal, Frank De Smet, Janick Mathys, Magali Lescot, K.U.Leuven - ESAT/SISTA;
Stephane Rombauts, PlantGenetics, VIB, U.Gent;
Bart De Moor, K.U.Leuven - ESAT/SISTA;
Pierre Rouze, PlantGenetics, VIB, U.Gent;
Yves Moreau, K.U.Leuven - ESAT/SISTA;
gert.thijs@esat.kuleuven.ac.be
Short Abstract:

We present an integrated web-based tool for automatic multistep analysis of microarray data. The gene expression data are clustered to find groups of co-expressed genes. The upstream regions are selected based on accession number and gene name. Finally the sequences are send to the Motif Sampler to find over-represented motifs.

One Page Abstract:

Microarray experiments allow to gain a global insight into the transcriptional behaviour of the organism. The deciphering of the regulatory mechanism based on the transcript profiles is one of the major challenges of bioinformatics. Genes that have a similar expression profile, are hypothesized to have a higher probability of being coregulated. Clustering techniques will group together genes with similar expression profiles. Finding specific cis-acting motifs in the upstream region of sets a of co-expressed genes can to some extend validate the clusters. Here we present an interactive web-based user interface that integrates the cluster analysis and motif finding tools for the analysis of microarray data. We propose a multistep online procedure. Starting from the expression data together with the correspoding identification tags of the genes (accession number and gene name) using the adaptive quality-based clustering algorithm will define groups of tightly co-expressed genes. Each gene in a cluster is identified by its accession number and gene name. Based on these tags the upstream region will be retrieved. First the sequences are downloaded from GenBank and all the genes are located and indexed. In the next step the corresponding upstream region is identified. If this region is too short for further analysis the gene is blasted to locate the upstream region in genomic sequences. This sequence selection relies on an automated procedure but at each step an intermediary report is shown where the user can interfere with the process. Once the upstream regions are identified the user can send the sequence to the Motif Sampler to find the over-represented motifs. The webinterface can be accessed through the following URL: http://www.esat.kuleuven.ac.be/~dna/BioI/Software.html


121. Probe Based Scaling of Microarray Expression Data (up)
Christopher Workman, Lars Juhl Jensen, Steen Knudsen, Søren Brunak, Center for Biological Sequence Analysis;
workman@cbs.dtu.dk
Short Abstract:

There are several analysis steps after hybridization and scanning that lay the foundation for all further analysis. For this reason, special care should be taken in image analysis, data scaling and normalization prior to calculating mRNA levels. This poster presents a new probe based scaling method for microarray expression data.

One Page Abstract:

Between scanning a chip and making conclusions about mRNA levels there are several important steps that effect the results and lay the foundation for all further analysis. For this reason, special care should be taken in image analysis, data scaling, normalization, and outlier detection prior to calculation mRNA levels. After this is done, converting sets of probe pair intensities to mRNA levels (what I will call the feature extraction problem) is still not as straight forward as one might think. There is very little precedence in feature extraction for probe pair data of this type but some examples are starting to show up in the literature (Li and Wong PNAS, v.98, 2001). What confounds the development of these methods is not knowing what the correct results should be. Using replicate experiments from the same and different RNA isolations from a single tissue, we can measure the effects of scaling and normalization on reproducibility. In this poster I will present a new scaling and feature extraction methods and compare them to the existing methods with respect to their effects on reproducibility.


122. Revealing the Fine-Structures: Wavelet-based Fuzzy Clustering of Gene Expression Data (up)
Matthias E. Futschik, Nikola K. Kasabov, University of Otago;
mfutschik@infoscience.otago.ac.nz
Short Abstract:

We studied yeast cell cycle expression data using fuzzy clustering and wavelet analysis. Both methods allow a more general approach for discovering the underlying structures and patterns in gene expression data than previous methods and can be valuable tools for revealing the complexity and the fine structure of cellular regulatory networks.

One Page Abstract:

The invention of microarray technologies has opened the door for the study of global mechanisms in the cell. By measuring thousands of genes simultaneously it has become possible to get snapshots of the states of a cell. While monitoring the mRNA levels reveals only a part of the whole picture and protein arrays are still in their infancy, the DNA microarray technique has quickly become an established method and will lead the way in the analysis of the global behavior of cellular networks.

A major challenge is, however, the extraction of valuable knowledge from the mass of data produced in microarray experiments. Clustering has frequently been used to obtain a first insight into the structure of the data. It assigns genes to defined groups according to the similarities of their expression profiles. Since it is assumed that co-regulated genes show a similar expression pattern, clustering can discover functionally related genes. So far various cluster algorithms have been introduced like hierarchical clustering, k-means and SOM. A common property of these methods is the assignment of a gene into a single distinct cluster. However, this procedure might be too restrictive considering the complexity of cellular regulatory networks. Single genes are frequently involved in several different physiological pathways. An adequate clustering algorithm should reflect this.

In this work, we present fuzzy clustering[1] as alternative to the traditional methods. Fuzzy clustering may group genes more naturally by allowing a single gene to belong to different clusters. This opens the way to a more complex partitioning of genes. We found that a significant number of genes seem to belong to different clusters showing that fuzzy clustering might be an appropriate approach to use. Furthermore, fuzzy clustering leads to the definition of the core of a cluster in a straight forward way. Using this feature, we can examine the correlations between the expression signal and the information in regulatory DNA sequences in detail.

For the illustration of this novel approach we apply fuzzy clustering to yeast cell cycle gene expression data set[2]. To meet the temporal character of this data, we apply wavelet analysis[3] to represent the expression profiles. Wavelet analysis offers the possibility to study the genetic network on different time scales while preserving the temporal order of the expression signals. An interesting possibility is the usage of wavelet decomposition to distinguish the true biological signals from noise.

Finally we address the important issue of cluster validation by comparing different cluster validity criteria and discuss the problem of model parameter selection.

We show that both fuzzy clustering and wavelet analysis allow a wider approach for discovering underlying structures and patterns in gene expression data than previous methods and can be valuable tools for revealing the complexity and the fine structure of cellular regulatory networks.

References: [1] James C. Bezdak, Pattern Recognition with Fuzzy Objective Function Algorithms, Advanced Applications in Pattern Recognition, Plenum Press, 1983

[2] Paul T. Spellman et.al, Molcular Biology of the Cell, Vol. 9, 3273-3297, 1998

[3] Ingrid Daubechies, Ten Lectures on Wavelets, Society for Industrial and Applied Mathematics, 1992


123. Identification of clinically relevant genes in lung tumor expression data. (up)
Olga Troyanskaya, Stanford Medical Informatics and Department of Genetics, Stanford University School of Medicine;
Mitchell Garber, Department of Genetics, Stanford University School of Medicine;
Russ B. Altman, Stanford Medical Informatics, Stanford University School of Medicine;
David Botstein, Department of Genetics, Stanford University School of Medicine;
olgat@smi.stanford.edu
Short Abstract:

We developed methods for detecting clinically relevant genes in the context of lung cancer gene expression data. We present a correlation-based method for identifying survival-associated genes for lung adenocarcinomas. We also show a method based on a nonparametric t-test for identifying gene expression patterns associated with specific tumor types.

One Page Abstract:

A major biomedical question in microarray studies is selecting genes associated with specific clinical parameters, for example patient survival. Identification of such markers, or groups of genes, may lead to clinical outcomes prediction and treatment guidance. Additionally, analysis of gene expression data associated with clinical data may allow molecular-level tumor classification. These tumor subtypes, which may appear histologically similar, are molecularly distinct and lead to differences in clinical outcomes such as patient survival, drug response, and metastatic status. Methods for automated analysis of gene expression data associated with clinical data are therefore needed.

Our work is focused on developing and evaluating methods for detecting clinically relevant genes in the context of lung cancer gene expression data. We use a non-parametric t-test based method for identification of genes associated with specific tumor types. This method was applied to lung tumor data to distinguish between subtypes of lung adenocarcinomas which are not histologically distinct. We also describe a correlation-based method for identification of genes correlated with patient survival. The method identifies genes whose expression can be best used to classify tumors in terms of ‘good’ and ‘bad’ survival outcomes for patients with lung adenocarcinomas.


124. Machine Learning Techniques for the Analysis of Microarray Gene Expression Data: A Critical Appraisal (up)
Mahesan Niranjan, The University of Sheffield;
m.niranjan@sheffield.ac.uk
Short Abstract:

This poster takes a critical look at some of the high performance machine learning techniques as applied to microarray gene expression data. It uses the yeast gene expression and leukhemia datasets available in the publis domain to illustrate that reasonably simple techniques can achieve performances comparable to highly nonlinear techniques.

One Page Abstract:

In recent literature we see that a wide range of powerful machine learning algorithms have been proposed for the analysis of gene expression data from microarrays. New clustering methods such as Gene Shaving have been invented in this context. Support Vector Machines, Bayes Nets, Gaussian Processes and Latent Variable Methods have all been recommended as the right tools with which inference problems in such data should be approached. Recent literature takes the form that each machine learning expert with interest in the subject of microarray data advances his/her favourite method as the way forward for the biologists generating the data.

In this poster I report on taking a critical look at this collection of techniques applied to this problem. In particular I report on the Yeast Gene and Leukhemia datasets, available in the public domain. It turns out that the underlying classification problems arising in these datasets are sufficiently simple that pattern processing techniques available in textbooks are as good as any sophisticated methodology. The key result from this observation is that many of the high dimensional problems could be reduced to problems in much lower dimensionality by reasonably simple techniques, resulting in the possibility of effective interpretation of such data.


125. Comparison of Methods for The Classification of Tumors Using Gene Expression Data (up)
Grace S. Shieh, Chi-Chih Chen, Ing-Cheng Jiang, Insti. of Statistical Science, Academia sinica, TAIWAN;
Yu-Shan Shih, Dept of Math., National Chung Cheng Univ.;
gshieh@stat.sinica.edu.tw
Short Abstract:

The performance of support Vector Machines and QUEST to classify tumors based on gene expression data from cDNA microarrays is compared to those in Dudoit et al. (2000). We access error rates from 150 data sets generated, by a statistical method, from NCI 60 cell lines and Lymphoma data, respectively.

One Page Abstract:

The performance of support Vector Machines and QUEST to classify tumors based on gene expression data from cDNA microarrays is compared to those four major methods in Dudoit et al. (2000). We generate 150 data sets, by a statistical sampling method, from the original NCI 60 cell lines (Ross et al., 2000) and Lymphoma data (Alizadeh et al., 2000), respectively.

In each set, about two thirds of each generated data are used as training data and the rest test data. Some variables (gene expressions), out of many (for instance, 1,416 in NCI 60 cell lines), have been selected by a statistical criterion to implement those methods and to access their prediction error rates.


126. Using expression data for testing hypotheses on genetic networks - minimal requirements for the experimental design (up)
Dirk Repsilber, Institute of Molecular Evolution, Evolutionary Biology Centre of the University of Uppsala, Norbyvägen 18 C, SE-75236 Uppsala, Sweden;
Siv Andersson, Hans Liljenström, Institute of Molecular Evolution, Evolutionary Biology Centre of the University of Uppsala, Norbyvägen 18 C, SE-;
dirk.repsilber@ebc.uu.se
Short Abstract:

We systematically tested the requirements for the experimental design for ranking false hypotheses about a genetic network's structure, given expression data. This is an important functional genomics task, because the parameter space of reasonable models is too big to be able to come along without previous biological knowledge.

One Page Abstract:

A variety of ``Reverse-Engineering'' algorithms have been proposed, on how to use expression data to reconstruct interactions in small networks. This may help to understand genetic regulation, the core task of nowadays functional genomics. Only few point to the necessity of measuring ``independent'' samples to be able to reengineer even the smallest genetic networks with a sensible confidence. Here, we systematically tested the requirements for the experimental design which is necessary not only to reengineer the ``right'' genetic network, but also to be able to rank false hypotheses about its structure. Presumably the latter is the task most frequently to be solved in near future of functional genomics, because the parameter space of reasonable models is too big to be able to sort out without using previous biological knowledge. However, this knowledge has mainly been inferred from sequence data, and several, equal possible hypotheses need to be weighted against each other. Thus, algorithmic solutions that can be computationally automated to perform this task are indispensable. Following the work of Wahde and Hertz (2000) we use a genetic algorithm to explore the parameter space of a multistage discrete genetic network model (fixed connectivity and number of states per node).


127. In silico search for cis-acting regulatory sequences in co-expressed gene clusters (up)
Stephane Rombauts, Department of Plant Genetics, VIB, University of Gent, Ledeganckstraat 35, 9000 Gent Belgium;
Magali Lescot, Gert Thijs, Kathleen Marchal, SISTA/COSIC-ESAT, KULeuven, Kasteelpark Arenberg 10, 3001 Leuven-Heverlee Belgium;
Cedric Simillion, Department of Plant Genetics, VIB, University of Gent, Ledeganckstraat 35, 9000 Gent Belgium;
Bart De Moor, Yves Moreau, SISTA/COSIC-ESAT, KULeuven, Kasteelpark Arenberg 10, 3001 Leuven-Heverlee Belgium;
Pierre Rouzé, INRA associated laboratory, VIB, University of Gent, Ledeganckstraat 35, 9000 Gent Belgium;
strom@gengenp.rug.ac.be
Short Abstract:

With large-scale transcriptome expression analyses, such as microarrays, one can tackle the problem of gene regulation. Motif finding algorithms aim at detecting motifs in the upstream regions of co-regulated genes. For that purpose we improved the original Gibbs Sampler from Lawrence. PlantCARE has been improved and new features were added.

One Page Abstract:

Among the fully sequenced genomes, that of the dicotyledonous plant model Arabidopsis thaliana has been made available since december 2000 and big efforts are made to extract knowledge out of the sequences. With large-scale transcriptome expression analyses, such as microarrays, producing large clusters of co-expressed genes, one can tackle the problem of gene regulation. It is commonly accepted that at least a subset of the sequences of a given cluster should share regulatory elements. In general, data on plant cis-acting regulatory elements is lacking, although they determine the processes in which genes are involved, and are of major importance for plant biotechnology. Motif finding algorithms aim at detecting such motifs in the upstream regions of co-regulated genes by looking for over-represented oligonucleotides. For that purpose we developed the Motif Sampler being an improved implementation of the original Gibbs Sampler from Lawrence[1]. To test the Motif Sampler[2] on experimental data sets, the microarray data of plant response to mechanical wounding from Reymond[3] was used as well as the data from Schaffer[5] on the circadian clock. To assign a functional interpretation to the found motifs, the consensus of the motifs was compared with the entries in PlantCARE[6]. Several interesting motifs were found: resp. for the wounding experiments (methyl jasmonate responsive elements, elicitor-responsive elements and the abcissic acid response element) as well as elements for the circadian clock experiments. The PlantCARE database and web site have been improved and new features were added to deal with the predicted data. Among the updates, an interactive graphical display of promoter boxes mapped on the query sequence together with information regarding the sites has been put up. Additionally we aim at describing promoters as functional entities composed of several elements based on extensive analyses of pools of co-regulated genes clustered from microarray experiments. At present, we have collected over the 400 different cis-acting regulatory elements from the literature describing more than 159 individual promoters from higher plant genes. (http://sphinx.rug.ac.be:8080/PlantCARE/)

References [1] Lawrence, C. E. et al. (1993). "Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment." Science 262(5131): 208-14. [2] G. Thijs, et al. (2000) A higher-order background model improves the detection by Gibbs sampling of potential promoter regulatory elements in DNA sequences. Submitted. http://www.esat.kuleuven.ac.be/~thijs/Work/MotifSampler.html [3] Reymond, P et al. (2000). "Differential gene expression in response to mechanical wounding and insect feeding in Arabidopsis". Plant Cell 12(5): 707-20. [4] De Smet, F, et al.(2000) http://www.esat.kuleuven.ac.be/~thijs/Work/Clustering.html [5] Schaffer R., et al (2001) "Microarray analysis of diurnal and circadian-regulated genes in Arabidopsis." Plant Cell 13:113-123. [6] Rombauts S. et al. (1999). "PlantCARE, a plant cis-acting regulatory element database." Nucleic Acids Res 27(1): 295-6.()


128. A decision tree method for classification of promoters based on TF binding sites (up)
Alexander Kel, BIOBASE GmbH, Mascheroder Weg 1B, 38124 Braunschweig, Germany; Institute of Cytology and Genetics, Pr. Lavrentyeva 10, 360090, Novosibirsk, Russia;
Tatyana Ivanova, Institute of Cytology and Genetics, Pr. Lavrentyeva 10, 360090, Novosibirsk,;
Olga Kel-Margoulis, BIOBASE GmbH, Mascheroder Weg 1B, 38124 Braunschweig, Germany; Institute of Cytology and Genetics, Pr. Lavrentyeva 10;
Michael Zhang, Watson School of Biological Sciences, Cold Spring Harbor Laboratory, 1 Bungtown Road, P.O.Box 100,Cold Spring Harbor,;
Edgar Wingender, BIOBASE GmbH, Mascheroder Weg 1B, 38124 Braunschweig, Germany;
ake@biobase.de
Short Abstract:

We have developed a new method for revealing of class-specific composite modules (combinations of transcription factor binding sites) in promoters of eukaryotic genes that are functionally related or coexpressed. A decision tree system is constructed to classify promoters in genomes and computationally predict their function.

One Page Abstract:

We have developed a new method for revealing class-specific composite modules in promoters of functionally related or coexpressed genes. On the basis of the revealed composite modules a decision tree method is developed to classify promoters of several functionally related gene groups. Seven sets of promoters were obtained from different sources: promoters for cell-cycle related genes (43 promoters) and brain enriched genes (45 promoters) (collected in this work on the base of literature search), muscle-specific (25 promoters) and immune cell specific genes (24 promoters) (Kel et al., (1999) JMB 288,353-376), erythroid specific genes (10 promoters) (http:/www.bionet.nsc.ru), liver enriched genes (39 promoters) and housekeeping genes (26 promoters) (EPD rel.62). The promoter sequences of the length 600 bp (from -500 to +99 relative start of transcription) were extracted from EMBL database. To search for binding sites a library of about 400 matrices for various transcription factors were applied (TRANSFAC rel 4.4 (Wingender, E. et al., (2000) NAR 28, 316-319) with a new searching tool - "Match". To classify promoters we build a decision tree. The internal nodes of the tree represent selected composite modules. On the basis of the composite module, at every node we calculate a decision function F(X) for each sequence X as it is passed to the tree. A decision tree was build by a variant of genetic algorithm, that optimises the structure of the decision tree, selects the specific combinations of cis-elements for every node of the tree and defines cut-off values of the corresponding functions F. The bottom nodes of the tree (leafs) contain 7 different promoter classes. Percent of correct classifications achieved by the tree varies for different promoter classes from 35% for promoters of brain enriched genes to more then 70% for cell cycle related promoters. The following set of TF binding sites appeared to be the most effective for classification of the mentioned sets of promoters: E2F, OCT-1, NF-AT, MyoD, SRF and NF-kB. The classification tree and the program for promoter classification can be found at: http://www.gene-regulation.com/. The decision tree method enables to identify new promoters and computationally predict their function. It provides means to analyse gene expression data by constructing promoter models for coexpressed genes.


129. Biostatistical Methods to Analyse Gene Expression Profiles (up)
Jobst Landgrebe, MPI of Psychiatry, Munich;
Gerhard Welzl, GSF-Research Center, Munich;
Wolfgang Wurst, MPI of Psychiatry, Munich;
landgreb@mpipsykl.mpg.de
Short Abstract:

We analysed gene expression data of mouse mutants with principal component analysis. We selected genes with extreme values in the reduced system, supervised by the variance within observation groups. This enabled us to explore differences between the samples and to extract fundamental gene expression patterns related to these differences.

One Page Abstract:

Biostatistical Methods to Analyse Gene Expression Profiles

Jobst Landgrebe(1), Gerhard Welzl(2) and Wolfgang Wurst(1/2)

1 GSF-National Research Centre for Environment and Health, Ingolstädter Landstraße 1, D-85764 Neuherberg 2 Max-Planck-Institute of Psychiatry, Molecular Neurogenetics, Kraepelinstr.10, D-80804 München

Abstract: DNA microarray gene expression data are characterised by an increasing number of probes for cDNAs . Bioinformatical and biostatistical methods are applied to study the variance in gene expression across collections of related arrays and to detect fundamental patterns underlying these gene expression profiles. Many mathematical techniques have been developed to detect patterns in complex data. Quite a few of these methods are essentially different ways of clustering points in multidimensional space, e.g. hierarchical clustering, or self-organising maps. Holter et al. successfully applied the singular value decomposition method to sets of DNA microarray gene expression data (Holter et al. 2000). Another method named "Gene Shaving" is based on computing a leading principal component iteratively (Hastie et al. 2000). We analysed gene expression data of genetical and pharmacological mouse models with principal component analysis (PCA) by regarding the experimental conditions as variables (columns) and the genes as objects (rows). The additional information about the related arrays (groups of mice) requires some modification of the PCA (Krzanowski). We selected genes with extreme values in the reduced system (high variance between groups), supervised by the variance within groups. Using this method we were able to explore differences between the samples and to extract fundamental gene expression patterns related to these differences. To complete our analysis we ran the data visualisation system XGobi and compared the results with the outcome of other multivariate methods. References: HASTIE, T., TIBSHIRANI, R., EISEN, M.B., ALIZADEH, A., LEVY, R., STAUDT, L., CHAN, W.C., BOTSTEIN, D. and BROWN, P. (2000): Gene ´shaving' as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology 1(2) , 1-20. HOLTER, N.S., MITRA, M., MARITAN, A., CIEPLAK, M., BANAVAR, J.R. and FEDOROFF, N.V. (2000): Fundamental patterns underlying gene expression profiles: Simplicity from complexity. Proc. Natl. Acad. Sci. USA 97, 8409-8414. KRZANOWSKI, W.J. (2000): Principles of Multivariate Analysis. Oxford University Press, New York


130. Syntactic structures for understanding gene regulatory networks (up)
Peter Lee, Mike Hallett, Tom Hudson, McGill University;
pdlee@genome.mcgill.ca
Short Abstract:

We present a novel system for representing complex gene relations derived from functional annotation databases. We propose a method for application of this functional representation to the analysis and interpretation of microarray gene expression data.

One Page Abstract:

The analysis and interpretation of large-scale gene expression datasets requires methods for integrating information about gene function. The majority of knowledge about biomolecular systems exists in the form of qualitative descriptions that are intuitive but covering a diverse spectrum of mechanistic information and experimental conditions. Pathway databases (ie:KEGG), and other functional classification systems (such as GO, MESH) compress information contained in the literature database to varying degrees. However, existing paradigms for representing this information (such as path maps, hierarchical trees and circuit diagrams) lack scalability and do not adequately capture the diversity and subtlety of the interactions between genes and their products. We describe a novel system for the representation of functional information contained in various functional databases. By preserving syntactic structures from the knowledge base, we propose a general interface that enables construction of comparisons between gene expression analyses and current intuitive understandings of gene regulation. We are in the process of developing this interface to access data via a microarray gene expression database.


131. Adaptive quality-based clustering of gene expression profiles (up)
Frank De Smet, Frank De Smet, Kathleen Marchal, Janick Mathys, Gert Thijs, Bart De Moor, Yves Moreau, ESAT-SISTA/COSIC/DocArch;
frank.desmet@esat.kuleuven.ac.be
Short Abstract:

A two-step algorithm to cluster significantly (with a certain confidence) coexpressed genes is presented. First, we try to find a sphere in the data where the 'density' of expression profiles is locally maximal. In a second step, we derive the optimal radius (or quality) of this sphere using an EM-algorithm.

One Page Abstract:

Clustering genes based on their expression behaviour/profiles (e.g., measured by microarrays) is an important step preceding further analysis of the interaction between these genes. The hypothesis that a cluster contains either coregulated or functionally related genes only holds if the cluster algorithm that is used, groups genes with a significant degree of coexpression. Genes not tightly coexpressed have to be excluded from further analysis.

With these remarks in mind we designed an iterative two-step algorithm: First, we try to find a sphere in the data where the 'density' of expression profiles is locally maximal (based on a preliminary estimate of the radius R of the cluster - quality based approach(1)). In a second step, we derive the optimal radius (or quality) of the cluster/sphere so that only significantly coexpressed genes (represented by a significance level S (e.g., S=95%)) are included in the cluster. This is achieved by fitting a model to the data using an EM-algorithm. The model used assumes that the data is normalised (the expression vectors have mean zero and variance one and are therefore located on the intersection of a hyperplane and a hypersphere). By inferring the radius or quality from the data itself, the biologist is released from estimating this parameter manually (this parameter was sometimes hard to predict - setting the quality too strict will exclude a considerable number of coregulated genes, setting it too wide will include too many genes that are not coregulated).

The most important properties of this approach are:

a. Few user-defined parameters (e.g., no pre-definition of the number of clusters) with an intuitive meaning.

b. Not all genes are assigned to a cluster.

c. The computational complexity of this method is approximately linear in the number of gene expression profiles in the data set.

Finally, we tested this algorithm successfully on real and artificial data.

References

1. Heyer, L. J., Kruglyak, S., and Yooseph, S. (1999) Exploring expression data: Identification and analysis of coexpressed genes. Genome Res., 9, 1106-1115.

Acknowledgements

Frank De Smet is a research assistant with the K.U.Leuven. Yves Moreau is a post-doctoral researcher of the FWO. Prof. Bart De Moor is a full professor with the K.U.Leuven. This work is supported by the Flemish Government (Research Council KUL (GOA Mefisto-666, IDO), FWO (G.0256.97,G.0240.99,G.0115.01, Research communities ICCoS, ANMMM, PhD and postdoc grants), Bil.Int. Research Program, IWT (Eureka-1562 (Synopsis), Eureka-2063 (Impact), Eureka-2419 (FLiTE), STWW-Genprom, IWT project Soft4s, PhD grants)), Federal State (IUAP IV-02, IUAP IV-24, Durable development MD/01/024), EU (TMR-Alapades, TMR-Ernsi, TMR-Niconet), Industrial contract research (ISMC, Data4s, Electrabel, Verhaert, Laborelec).


132. Incorporating Biological Knowledge Into Analyses of Microarray Data (up)
Jessica Ross, Division of Biomedical Informatics, Department of Medicine, Stanford University School of Medicine, Stanford, California;
Jeff Shrager, Carnegie Institute of Washington, Stanford, California;
Glenn Rosen, Division of Pulmonary and Critical Care, Department of Medicine, Stanford University School of Medicine, Stanford, Ca;
Pat Langely, Institute for the Study of Learning and Expertise, Palo Alto, California;
ccross@leland.stanford.edu
Short Abstract:

We have developed a system that lets biologists view human gene expression data as it relates to molecules that occur in specific biological pathways, letting them search for all pathways that contain molecules, view expression levels of those molecules graphically, and calculate correlations between those expression levels.

One Page Abstract:

High-throughput technologies have generated large amounts of data in the biological sciences. Using clustering algorithms to find patterns in these data is the primary method of analysis found in the literature, and is used predominantly as an exploratory tool, rather than as a test to evaluate a scientific hypothesis. These analyses have been very successful in letting scientists better classify tissue based on gene expression data. However, the results of clustering are often difficult to interpret in terms of the classical pathway models, which biologists often express as diagrams. The ability to reconcile microarray data with these models would greatly assist biologists in communicating knowledge gained from these high-throughput experiments. Furthermore, the ability to explain microarray results in relation to familiar biological pathways and molecular processes will directly support the formation and testing of hypotheses about these processes that regularly occur in the physical or wet lab. We have developed a system that lets biologists view human gene expression data as it relates to molecules that occur in specific biological pathways. We use a database with over 5000 biological reactions inferred from the literature on humans, most of which pertain to signaling pathways within the cell and therefore are directly relevant to current theories of the causes for many human diseases. The software lets a scientist search for all pathways that contain molecules of interest, view expression levels of those molecules graphically, and calculate correlations between those expression levels. In addition, the user may suggest a new pathway for comparison to the data. Using this program, we have been able to reconcile data from microarray experiments on human fibroblasts with accepted pathways for cell cycle and signaling. Our results show correlations between molecules that occur in these pathways, even though cluster analysis did not group them and/or include them in any group. We believe this system will serve as a valuable tool that will let biologists incorporate microarray data into the process of hypothesizing and testing their models.


133. Semantic Link: A Knowledge Discovery Tool for Gene Expression Profiling (up)
Ingrid M. Keseler, Nikolai N. Kalnine, BD/CLONTECH;
imkeseler@clontech.com
Short Abstract:

"Semantic Link" is an internet-based knowledge discovery tool designed for the interpretation of gene expression data. It contains a map of biological semantic concepts (genes, diseases, etc.) that are related to each other by their representation in the prevailing scientific literature.

One Page Abstract:

Microarray-based gene expression profiling generates thousands of data points which represent relative abundance of individual mRNA molecules in experimental and control samples. If we consider the cell as a network of interacting molecules with a mechanism of feedback control of their expression and degradation, a gene expression profile reflects an induction or suppression of certain regulatory pathways in response to a “treatment”. Interpreting gene expression data in terms of metabolic pathways is a challenging task for several reasons: 1) Incomplete knowledge of the functional role of genes in the cell. 2) Complex nature of cellular pathway network. 3) Limited sensitivity and selectivity of the microarray data. In addition, most of the relevant information is scattered over a variety of Internet databases and scientific publications, which are not designed for high-throughput processing. Semantic Link is a text processing program that extracts all available information on gene functions and related disorders from Medline titles and abstracts and organizes it in a database. A dictionary of 350,000 selected words and phrases representing gene names and biological processes was built from Medline ’96 to ’01. The building process consisted of automatic extraction of terms from the text followed by supervised filtering. The dictionary was then supplied to the text processor for identification of terms in the text. Finally, articles were clustered by counting the co-occurrence of terms in the same abstract, paragraph or sentence. Basic elements of linguistic analysis (protein = proteins, gene expression = expression of gene, etc.) and substitution of synonyms were applied at that stage. The resulting database of semantic terms and links can be viewed by an internet client in the form of either a graphical network or a taxonomy of dictionary items. A trial version of Semantic Link built on the collection of terms of the Gene Ontology Consortium is available at http://atlasinfo.clontech.com/.


134. Integration of transcript reconstruction and gene expression profiles to enhance disease gene discovery. (up)
Peter van Heusden, Electric Genetics;
Alan Christoffels, Soraya Bardien, South African National Bioinformatics Institute;
Gary Greyling, Electric Genetics;
Ari Ziskind, University of Stellenbosch;
Johann Visagie, Antoine van Gelder, Electric Genetics;
Janet Kelso, South African National Bioinformatics Institute;
Liza Groenewald, Tania Hide, Electric Genetics;
Win Hide, South African National Bioinformatics Institute;
pvh@egenetics.com
Short Abstract:

We developed a tool to automate identification of positional candidates for genetic disorders based on expression state, physical mapping and genome mapping information. A controlled vocabulary was integrated into stackPACK EST clustering system to generate expression profiles and the resulting transcripts were mapped to the genome (http://genome.ucsc.edu/) and graphically visualised.

One Page Abstract:

There is an urgent need by human geneticists for bioinformatic tools to exploit the sequence data and other information generated by the Human Genome Project. We have developed a tool to automate identification of positional candidates for genetic disorders based on (1) expression state, (2) physical mapping and (3) genome mapping information. For expression state we will extract information from various gene expression repositories for standardised functional annotation of positional candidates, thereby enabling the effective prioritisation of these genes. Unprocessed expression data in the form of expressed sequence tags (ESTs), serial analysis of gene expression (SAGE) tags, and array-based experiments are stored in numerous disparate databases. However, with the absence of a standardised nomenclature, there are problems with accessing and manipulating this information. These difficulties are compounded in the context of high-throughput systematic analysis and emphasise the need for consistent across-database description of the same terms and objects. We have constructed a controlled vocabulary for standardised description of gene expression state. This vocabulary has been integrated into the stackPACK EST clustering system in order to generate cluster expression profiles. The credibility of these genes as positional candidates are enhanced by their mapping onto the Santa Cruz- assembled human genome sequence (http://genome.ucsc.edu/) using BLAST and SIM4. Genemap 99 radiation hybrid markers were also mapped to the genome sequence using ePCR to provide reference points . The resulting expression profiles and mapping information are exported in a standardised EMBL format for visualisation purposes e.g. using Artemis. The tool has been tested using two known disease loci, retinitis pigmentosa on 8q (RP1 gene) and type 2 diabetes locus on 2q (CAPN10 gene).


135. Gene Expression Database (GXD): integrated access to gene expression information from the laboratory mouse (up)
Martin Ringwald, Dale A. Begley, Ingeborg J. McCright, Terry F. Hayamizu, David P. Hill, Constance M. Smith, Judith A. Blake, Janan T. Eppig, Jim A. Kadin, Joel E. Richardson, The Jackson Laboratory;
ringwald@informatics.jax.org
Short Abstract:

GXD is a community resource. Its objective is to capture and integrate different types of gene expression data from the laboratory mouse and to place these data in the larger biological and analytical context. GXD is accessible at http://www.informatics.jax.org/. New data are made available on a daily basis.

One Page Abstract:

The Gene Expression Database (GXD) is a community resource of gene expression information from the laboratory mouse. The database is designed as an open-ended system that can integrate different types of expression data, such as RNA in situ hybridization and immunohistochemistry data, Northern and Western blot data, RT-PCR data, cDNA data, and microarray data. Thus, as data accumulate, GXD provides increasingly complete information about what transcripts and proteins are produced by what genes; where, when and in what amounts these gene products are expressed; and how their expression varies in different mouse strains and mutants. Expression patterns are described using an extensive dictionary of anatomical terms for the mouse that has been established in collaboration with our colleagues in Edinburgh, UK*. The anatomical dictionary names the tissues and structures for each developmental stage, and organizes the terms hierarchically from body region or system to tissue to tissue substructure. This model enables an integrated description of expression patterns for various assays with differing spatial resolution, computational analysis of expression patterns at different levels of detail, and continuous extensions of the anatomical dictionary itself. Expression records are linked to digitized images of original expression data. GXD is available at http://www.informatics.jax.org/. It is integrated with the Mouse Genome Database to enable a combined analysis of genotype, expression, and phenotype data. In conjunction with the Gene Ontology project we build shared controlled vocabularies for biological processes, molecular functions and cellular components and assign those terms to mouse genes and their products. These classification schemes provide important new search parameters for expression data. Extensive interconnections with sequence databases and with databases from other species further extend GXD’s utility for analysis of gene expression information. *Edinburgh collaborators: J. Bard, R. Baldock, D. Davidson, M. Kaufman. GXD is supported by NIH grant HD33745. The Gene Ontology project is supported by NIH grant HG02273.


136. Analysis of gene expression profiles between interaction protein pairs in M.musclus (up)
Rintaro Saito, Harukazu Suzuki, Ikuko Kagawa, Rika Miki, Hidemasa Bono, Hideaki Konno, Yasushi Okazaki, Yoshihide Hayashizaki, Laboratory for Genome Exploration Research Group, RIKEN Gemomic Sciences Center(GSC), RIKEN Yokohama Institute;
rsaito@gsc.riken.go.jp
Short Abstract:

Toward an integrative analysis of gene expression data and protein-protein interaction data, we have calculated correlation coefficients of gene expression profiles between interaction protein pairs in M.musculus. We will present current results and discuss the general rules of expression patterns between the interaction pairs.

One Page Abstract:

Proteins play pivotal roles in all biological phenomena where physiological interactions of many proteins are involved in the construction of the biological pathways such as metabolic pathways and signal transduction pathways. Analyses of the biological pathways are one of the most important issues not only for molecular biology but also for medicine. Recent development of DNA microarray technologies enabled us to examine expression patterns of many genes at a time. In addition, yeast two-hybrid method is widely used to screen physiological protein-protein interactions in high-throughput manner. Development of several computational methods to infer pathways using either expression data or protein-protein interaction data is in progress. However, integral approaches for analyzing both expression data and protein-protein interactions in higher organisms has not been established yet. The genome encyclopedia project in RIKEN genome exploration research group has already collected large number of mouse full-length enriched cDNAs(Nature 409:p685, 2001). We have also analyzed expression profiles of those cDNAs in 49 different tissues using DNA microarray(Proc.Natl.Acad.Sci. USA 98:p2199, 2001). In addition, we are screening protein-protein interactions using those cDNAs and identified approximately 150 interactions(paper in submission). We have analyzed the correlation coefficient of gene expression profiles between interaction protein pairs. The results show that the degrees of correlations seem to depend on both the set of selected data used for the calculation, and the protein functions. We will present the current results and discuss the general rules of expression patterns between the interacting pairs. Further, computational method to infer novel pathways using expression and protein-protein interaction data will be discussed.


137. Learning genomic nature of complex diseases from the gene expression data (up)
Andrew B. Goryachev, GeneData AG;
Pascale F. Macgregor, Clinical Genomics Center, University Health Network, Toronto, Canada;
Katryn Furuya, Hospital for Sick Children, Toronto, Canada;
Aled M. Edwards, C.H. Best Institute, University of Toronto, Canada;
Andrew.Goryachev@genedata.com
Short Abstract:

The major genomics challenge is how to apply various data-mining tools to extract biologically important information from expression data. We present a complete study of a complex liver disease. A variety of statistical analyses provided by the Expressionist software was applied to reveal intricate co-expression patterns characterising the disease.

One Page Abstract:

Significant emphasis is currently placed on the understanding of the molecular nature of complex human diseases, e.g., cancers. It has become evident that maladies caused by the malfunction of a single gene are rare. Instead, complex genome-scale aberrations are found responsible in an ever-growing number of cases. Expression data provide an ample evidence for the existence of complex relationships between genes found in a given disorder. However, identification of such connections from the experimental data is a challenging task that requires a variety of data mining methods applied in various combinations. In practical applications, in which several diseases represented by many samples are compared to heterogeneous normal groups, the complexity of analysis quickly explodes. This overwhelming complexity demands sophisticated software tools offering a comprehensive set of analyses as well as advanced data management. We present a complete study in which complex expression data were analysed with the Expressionist software from GeneData AG. A human liver disorder with poorly understood origin was compared to another liver disease and the normal group of samples in a large-scale expression profiling experiment. A variety of filtering, clustering and correlation analysis methods was applied to the data to reveal intricate patterns of gene co-expression hinting at possible co-regulation characteristic of the particular disease. We also present a novel clustering approach which provides flexible definition of the cluster size and number.


138. Comparative Assessment of Normalization Methods for cDNA microarray data (up)
Ilana Saarikko, Timo Viljanen, Turku Centre for Biotechnology and Turku Centre for Computer Science, University of Turku, Finland;
Riitta Lahesmaa, Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, Finland;
Tapio Salakoski, Turku Centre for Biotechnology and Turku Centre for Computer Science, University of Turku, Finland;
Esa Uusipaikka, Department of Statistics, University of Turku, Finland;
ilana.saarikko@btk.utu.fi
Short Abstract:

There are many sources of variation in data obtained by cDNA microarray experiments. Using replicated experiments, we have studied how existing normalization methods affect data and subsequent analysis. Based on this study we point out key issues in normalization as well as propose guidelines for choosing methods for selected situations.

One Page Abstract:

Microarrays are one of the latest breakthroughs in the field of biotechnology which allows monitoring of expression levels for thousands of genes simultaneously. The microarray technology is already widely in use and the applications range from comparison of expression profiles to prediction of regulatory networks. One of the major problem of this new technology is the uneven quality of data. Due to the nature of microarray experiments there are many sources of variation in obtained data. To make data reliable enough to enable comparison across experiments such variation needs to be removed. Process for removing this variation is called normalization. Commonly used normalization methods force the distribution of the log-ratio of expression levels to have median or mean of zero.

We have studied existing normalization methods and demonstrated how different methods affect data and subsequent analysis. One of our goals is to define general criteria for necessary and sufficient normalization. In our studies we have focused on replicated microarray experiments. We used data from four replicated slides each having three replicated arrays of 1536 genes. Slides were hybridized with mRNA from two different samples which were labeled with Cy3 and with Cy5. We also used data from additional staining as the measure of amount of cDNA on probe. This data was used for normalization purposes.

Based on the replicate data, we validate the normalization methods in two ways. First, we examine how normalization affects the correlation between replicate arrays. Second, we study how the set of differentially expressed genes, defined by various criteria, varies with different normalization methods. On the basis of the results, we suggest guidelines for choosing good normalization methods for different situations.

keywords: cDNA microarray, gene expression, normalization


139. Identifying different types of human lymphoma by SVM and ensembles of learning machines using DNA microarray data. (up)
Giorgio Valentini, D.I.S.I., Dipartimento di Informatica e Scienze dell' Informazione, Universita' di Genova;
valenti@disi.unige.it
Short Abstract:

We propose supervised methods for identifying different types of human lymphoma using DNA microarray gene expression data. Support Vector Machines and ensembles of neural networks can correctly classify different types of lymphoma, offering also insights into the role of coordinately expressed groups of genes in carcinogenic processes of lymphoid cells.

One Page Abstract:

DNA hybridization microarrays supply information about gene expression through measurements of mRNA levels of large amounts of genes in a cell. Information obtained by DNA microarray technology gives a snapshot of the overall functional status of a cell, offering new insights into potential different types of lymphomas, discriminated on molecular and functional basis. Gene expression data produced by DNA microarray technology can be processed through unsupervised machine learning methods, using clustering algorithms to group together similar expression patterns corresponding to different tissues in order to separate cancerous from normal samples. Anyway, unsupervised methods cannot always correctly separate classes. Supervised methods can overcome this problem, exploiting "a priori" biological and medical knowledge on the problem domain. In this work we use supervised learning machines methods for recognizing cancerous and normal lymphoid tissues, classifying different types of human lymphomas and also identifying groups of genes related to a specific type of lymphoma. We use data of a specialized DNA microarray, named "Lymphochip", developed at Stanford University School of Medicine, specifically designed to study lymphoid and carcinogenesis related genes. In our first task we distinguish cancerous from normal tissues using the overall information available. This dichotomic problem is tackled using Support Vector Machines (SVM) and Multi-Layer Perceptrons (MLP). In our second task we try to directly classify different types of lymphoma (a multiclass problem) using MLPs and Parallel Non linear Dichotomizers (PND), i.e. ensembles of learning machines based on output coding decomposition of a multiclass problem. These methods consist in decomposing a multi-class problem in a set of two-class problems according to some decomposition scheme, training the dichotomizers independently and combining the outputs to give the class label. In the third task we pointed out how to use "a priori" biological and medical knowledge for separating two functional subclasses of diffuse large B-cell lymphoma (DLBCL) not detectable with traditional morphological classification schemes, identifying a set of coordinately expressed genes related to the separation of the two DLBCL subgroups. The results show that SVM, MLP and PND can be successfully applied to the analysis of DNA microarray gene expression of data and to the identification of sets of coordinately expressed genes related to specific types of lymphoma.


140. On the Influence of the Transcription Factor on the Information Content of Binding Sites (up)
Jan T. Kim, Thomas Martinetz, Daniel Polani, Institut für Neuro- und Bioinformatik, Universität zu Lübeck;
kim@inb.mu-luebeck.de
Short Abstract:

We develop a probabilistic model for coevolution of a transcription factor and its binding sites. Maximum entropy analysis reveals connections between binding site information content and binding behaviour of the transcription factor, and into the bioinformatic basis of Rsequence = Rfrequency. This may be useful for improving binding site recognition.

One Page Abstract:

Transcription factors and their binding sites are a centerpiece of genetic information processing. Transcription factor binding sites are short sequence words. The location of these binding sites on the genome provides important information about the structure of the regulatory networks the transcription factor is involved in, as well as about the location of genes and other coding regimes on the genome. However, finding these binding sites has turned out to be a difficult task which can only be solved with prior knowledge about the principle binding behaviour of transcription factors. A model for the basic probability distributions underlying the coevolution of the transcription factor and its binding sites within the genome is presented. State spaces for the transcription factor and for the genome are jointly represented, which is an extension of previous models in which only genome space is considered. The model is formally analyzed with a maximum entropy approach. Empirical analyses using comupter based enumerations of the joint state spaces are performed to show that approximations made during formal analysis are justified.

The results give new insights into the connection between the information content of these binding sites and the binding behaviour of the transcription factor with particularly interesting implications for the relation between binding site information content (Rsequence) and binding site abundance on the genome (determining Rfrequency). The intriguing empirical observation that Rsequence approximately equals Rfrequency in a couple of instances still awaits a complete bioinformatic explanation. Our analysis reveals that this (approximate) equality cannot be generically deduced from information theoretic principles.

Regarding basic bioinformatics, this finding leads to a renewal of interest in empirical studies of binding site information content and distribution across genomes. Since binding site information content is not determined by fundamental informatic principles, one must assume that the relation of Rsequence and Rfrequency determined by biological principles that are not yet known, and that therefore should be investigated using empirical studies combined with theoretical efforts. On the applied side, advances in understanding the bioinformatic principles underlying binding site evolution are likely to provide additional sources of prior knowledge that is useful for developing improved binding site recognition schemes.


141. A Mouse Developmental Gene Index (up)
Janet Kelso, South African National Bioinformatics Institute, University of the Western Cape;
George J. Kargul, Yong Qian, Dawood B. Dudekula, Minoru S.H. Ko, Developmental Genomics and Aging Section, Laboratory of Genetics, National Institute on Aging, National Institutes of;
Winston A. Hide, South African National Bioinformatics Institute, University of the Western Cape;
janet@sanbi.ac.za
Short Abstract:

We produced and annotated a mouse developmental gene index using cDNAs generated from mouse developmental libraries. This index has been compared to the RIKEN mouse cDNA collection to determine redundancy of the datasets. Selection and annotation of clones for rearraying, and subsequent production of a mouse cDNA microarray is presented.

One Page Abstract:

While providing large amounts of genomic information, genomic sequencing efforts do not address the pressing need for comprehensive gene expression information. Despite their generally low sequence quality and short length, expressed sequence tags (ESTs) remain a rich source of gene expression information, providing data on expression location, expression level and the presence of alternative transcript isoforms. Attempts to elucidate the entire expressed gene complement of an organism have been hampered by the scarcity of full-length cDNAs representing all expressed gene transcripts. The absence of full-length transcript data and the relative abundance of ESTs have resulted in a number of groups producing reconstructed transcript gene indices. These gene indices seek to reduce the redundancy and error present in the EST databases by clustering and assembling ESTs based on sequence identity and clone annotation. Clustered EST data has proven invaluable in the development of understanding in gene and alternative splice form discovery, genome annotation and gene regulation. In this study we have produced and annotated a mouse developmental gene index from high quality cDNA sequences generated from early mouse developmental libraries in collaboration with Minoru Ko’s group in the Gerontology Research Center at the National Institute of Ageing. This gene index has been compared to the recently published RIKEN mouse cDNA collection to determine redundancy of the datasets. Progress in the selection and annotation of clones for rearraying and subsequent production of a mouse developmental cDNA microarray is presented.


142. Using Gene Expression and Artificial Neural Networks for Classification and Diagnostic Prediction of Cancers (up)
Markus Ringner, National Human Genome Research Institute/NIH;
Javed Khan, National Cancer Institute/NIH;
Jun S Wei, Lao H Saal, National Human Genome Research Institute/NIH;
Carsten Peterson, Complex Systems Division, Lund University;
Paul S. Meltzer, National Human Genome Research Institute/NIH;
mringner@thep.lu.se
Short Abstract:

A method of classifying cancers to specific diagnostic categories based on their gene expression signatures using artificial neural networks (ANNs) is presented. We trained the ANNs using small round blue cell tumors, belonging to four distinct diagnostic categories. The ANNs correctly classified all samples and identified the genes most relevant to the classification.

One Page Abstract:

Small blue round cell tumors (SRBCT) of childhood including; neuroblastoma (NB), rhabdomyosarcoma (RMS), Burkitt's lymphoma (BL) and the Ewing's sarcoma (EWS) are difficult to distinguish by routine immunohistochemistry. Currently there is no single test that can precisely distinguish these cancers, and several techniques are utilized to diagnose them, including cytogenetics, interphase fluorescence in situ hybridization, reverse transcription PCR and immunohistochemistry. In addition, poorly differentiated cancers can still pose a diagnostic dilemma. Gene expression profiling with cDNA microarray techniques permits the simultaneous analysis of multiple markers and hence offers quite some promise in categorizing cancers into subgroups.

We use gene expression data from cDNA microarrays containing 6567 genes from 63 SRBCT to calibrate artificial neural network (ANN) models to recognize cancers belonging to each of the four categories. The training samples included both tumor biopsy (13 EWS and 10 RMS) material and cell lines (10 EWS, 10 RMS, 12 NB and 8 BL). Given the small available data set, we preprocess the gene expression levels using Principal Component Analysis (PCA) retaining 10 dominant directions and thereby reducing the input space significantly.

We classify the samples in the four categories using a 3-fold cross validation procedure: The 63 known (labeled) samples are randomly shuffled and split into 3 equally sized groups. Linear perceptron models are then calibrated with 10 input variables using two of the groups and the third group is reserved for testing predictions (validation). This procedure is repeated 3 times, each time with a different group used for validation. The random shuffling is redone 1250 times and for each shuffling we analyze 3 ANN models. Thus, in total each sample belongs to a validation set 1250 times and 3750 ANN models have been calibrated. The committee of models classify all validation samples correctly. Due to the limited amount of training data and the high performance already achieved, we limit ourselves to linear models with no hidden units. Confidence measures in terms of distances to ideal classifications are developed for the data.

The sensitivity upon the different genes is determined by the absolute value of the partial derivative of the output with respect to the gene expressions, averaged over samples and ANN models. By using the resulting ranking list of the inputs (genes) we redo the training procedure for different number of inputs and establish the minimal number of genes, which optimize the classification of the four cancer types. In this way 96 genes are identified, which correctly classify the 63 samples.

We then further test the validity of the models by classifying an additional set of 25 ("blind test") samples containing both [A] tumor samples (5 EWS, 5 RMS, and 4 NB) and cell lines (1 EWS, 2 NB, 3 BL) and [B] 5 non-SRBCT including 2 normal muscle). We are able to correctly classify [A] all 20 of the SRBCT and [B] based on confidence-related criteria reject the non-SRBCT samples. In addition, on evaluation of the top 96 ranked gene list we identify several genes that are uniquely expressed in a specific cancer, that have potential biological and therapeutic implications, and which have not been previously associated with these cancers.

We feel that this method of ANN analysis of gene expression data provides a powerful tool for classification, diagnosis and gene discovery. That only 96 genes are required for this application, opens the potential for cost effective fabrication of SRBCT subarrays in diagnostic use.


143. Classification of malignant states in multistep carcinogenesis using gene expression matrix (up)
Koji Kadota, Department of Biotechnology, The University of Tokyo, and Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center (GSC), RIKEN Yokohama Institute;
Yasushi Okazaki, Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center (GSC), RIKEN Yokohama Institute;
Shugo Nakamura, Department of Biotechnology, The University of Tokyo;
Yoshihide Hayashizaki, Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center (GSC), RIKEN Yokohama Institute;
Kentaro Shimizu, Department of Biotechnology, The University of Tokyo;
kadota@bi.a.u-tokyo.ac.jp
Short Abstract:

cDNA microarray technology has a potential to be used to diagnose malignant samples from benign ones in the clinical field. We have developed an efficient method to extract genes that can contribute to classify malignant samples from benign ones with minimal false negative diagnosis.

One Page Abstract:

Certain types of cancer are reported to grow through multistep carcinogenesis. There are several types of tumors featuring from benign to malignant clinical course. Recently, microarray technology has revolutionalized to see the global expression of many tissues or conditions. This technique has been successfully applied to the clinical samples to classify malignant from benign samples. Until recently, several supervised or unsupervised methods have been developed to classify two distinct states such as tumor vs. normal clinical samples. However, the accuracy of classification using these methods varies depending on the dataset. It is essential to utilize the predictor genes having 100% accuracy in light of diagnosing malignant samples as malignant rather than diagnosing benign samples as benign. In this work, we have developed a novel method to select genes characterizing malignant state from benign using gene expression matrix. In brief, genes that can contribute to the characterization of malignancy phenotype were selected by subtracting each gene from the original genes set to see if the gene positively contributes to characterize the malignant phenotype. We introduced this algorithm to practical clinical samples to evaluate if the presence of metastasis can be accurately predicted.


144. Bioinformatics Tools in the Screening of Gene Delivery Systems (up)
Karin Regnström, Eva Ragnarsson, Per Artursson, Dep of Pharmacy, University of Uppsala, Sweden;
Karin.Regnstrom@galenik.uu.se
Short Abstract:

We use array technology and bioinformatics tools to evaluate the gene expression profiles of suitable gene delivery systems. Our studies show that each delivery systems tested results in an unique profile a "fingerprint". Together with other experimental data they are used for screening and further design of delivery systems.

One Page Abstract:

Purpose. To use bioinformatics tools in the comparison and evaluation of gene expression profiles originating from treatment with newly developed gene delivery systems.

Introduction. In our laboratory we use array technology to evaluate immunogenic properties as well as possible toxic reactions of suitable gene delivery candidates.

Methods. Gene delivery systems formulated with a reporter plasmid were administred mucosally to mice. Cells from the animals were harvested and total RNA was extracted. A 32P- labeled cDNA copy of the RNA-samples were produced and the probes were hybridized to a cDNA expression array and scanned with a phosphorimager. The image were analyzed and normalized to the data of control samples to enable comparison of the different formulations. Pairwise comparisons of the overall gene expression changes between different delivery systems were made using the Spotfire program (1). For comparisons of up to five delivery systems the GeneCluster program (2) were used. The gene expression data were filtered to obtain genes with a significant change in expression and clustered using self organizing maps (SOMs). Further visualization was obtained by the Treeview program (3).

Results. The genes that passed the significance filter were sorted in SOMs and distinct clusters were obtained by the different delivery systems. The clusters showed gene groups which were selectively affected after treatment with different delivery systems. Some samples also showed high expression of known toxicity markers. It was possible to discern a gene expression "fingerprint" for each of the gene delivery systems tested.

Conclusions. This study identified important changes in gene expression profiles induced by the gene delivery systems studied. We conclude that bioinformatics in combination with the array technology has a great potential for the evaluation of pharmaceutical formulations during screening procedures.

In progress. We want to create a database containing gene expression data from all our formulations tested as well as other experimental data and molecular properties of the delivery systems. Our goal is to develop a tool which screens this database for suitable gene delivery systems by multiple comparisons and evaluations, which results in improved design of gene delivery systems.

1. http://www.spotfire.com/

2. Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E. S. & Golub, T. R. (1999) Proc Natl Acad Sci U S A 96, 2907-12.

3. Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. (1998) Proc Natl Acad Sci U S A 95, 14863-8.


145. Cross talking in cellular networks: tRNA-synthetase and amino acid synthetic enzymes in Escherichia coli (up)
Emmeli Taberman, Måns Ehrenberg, Uppsala University;
emmeli.taberman@icm.uu.se
Short Abstract:

By constructing global mathematical models of growing bacteria we studied the control of production of an amino acid and its aminoacyl tRNA synthetase. The goal was to investigate how the cell can avoid interference between these two control loops, and discriminate between signals for charging deficiency and amino acid deficiency.

One Page Abstract:

It has been known for a long time that microorganisms exert different types of control for the expression of different operons. Control can be exerted at the transcriptional (e.g. by ribosome dependent attenuation mechanisms or repressors), the translational (e.g. by autogenous feed-back) or at the posttranslational (e.g. protein modifications) level. To assess how mechanisms for the control of gene expression behave in vivo we have constructed a global mathematical model for growing bacteria. This has been used to study, first, control of expression of an operon for enzymes that synthesise the amino acid threonine and, second, control of synthesis of the aminoacyl-tRNA synthetase (ThrRS) that couples Thr to tRNAThr. The threonine biosynthetic pathway is regulated by an attenuation mechanism, involving a leader peptide with multiple Thr and Ile codons. The expression from the gene for ThrRS is regulated by an autogenous mechanism, where the leader of the mRNA that encodes ThrRS mimics tRNAThr. When ThrRS is in excess, it binds stronlgy to the leader of its mRNA and thereby inhibits initiation of translation. An interesting problem is now how the cell can discriminate between a maladjusted rate of synthesis of an amino acid, on one hand, and a too high or a too low level of the corresponding aminoacyl-tRNA synthetase, on the other. For instance, if the aminoacyl-tRNA synthetase concentration is too low, this will not only signal for increased production of the synthetase but also for increased (attenuation control with ribosome step time as signal) or decreased (repressor control with amino acid pool as signal) production of the amino acid synthesising pathway. Our analysis shows that there is considerable “cross-talk” between control systems for amino acid synthesis and production of tRNA synthetases. We discuss how the cell can minimize the negative effects of signal misinterpretations due to such crosstalk. We also describe suitable experiments to test predictions based on our mathematical models


146. Assessing Clusters and Motifs from Gene Expression Data (up)
D. K. Smith, L. M. Jakt, Biochemistry Dept., Univ. of Hong Kong;
L. Cao, Dept. Microbiology, UHK;
K. S. E. Cheah, Biochemistry Dept., Univ. of Hong Kong;
dsmith@hkusua.hku.hk
Short Abstract:

A method has been developed to assess gene clusters derived from microarray experiments. The probability of finding motif matches associated with the genes in the cluster by chance is determined. Issues of biological relevance, over or under-clustering, activity in several clusters or the refinement of motifs can be addressed.

One Page Abstract:

ASSESSING CLUSTERS AND MOTIFS FROM GENE EXPRESSION DATA

Jakt, L.M.1, Cao, L.2, Cheah, K.S.E.1 and Smith, D.K.1

Departments of Biochemistry1 and Microbiology2, University of Hong Kong, Pok Fu Lam, Hong Kong.

When analysing gene expression data from microarray based studies, it is common to compare the expression profiles of the genes and perform some clustering of the profiles. Genes with similar expression profiles are grouped by the clustering algorithm and these genes are more likely to have similar functions or to be regulated in a common manner. Searches for conserved DNA motifs, which may potentially be cis-regulatory elements, can be undertaken in the non-coding regions of the genes in the cluster. For a computational study of gene expression there is a wide range of algorithms available to cluster expression profiles, to find new motifs in unaligned DNA sequences and to match known motifs to DNA sequences. Experimental errors from the microarray studies can also propagate through the computational analysis and so compound the effects of any limitations in the algorithms used. A method to evaluate these analyses is desirable.

We have developed a method to assess the potential functional significance of clusters and motifs which is based on the probability of finding a certain number of matches to a motif in all of the gene clusters. As a starting point, we take a set of genes that have been clustered, based on their expression profiles, by some algorithm and a series of sequence motifs that may describe cis-regulatory elements. Issues of what threshold score to use for the differing motif matching algorithms are avoided by taking the best matches to a motif across the gene set, in groups of 50 to 600 matches. By counting the number of matches that are associated with each gene cluster, we can calculate the probability of observing, by chance, that number of matches to a motif in the non-coding regions of the genes in a cluster. The likely functional relevance of the clusters and motifs can be assessed based on these probabilities. This technique allows strong and weakly matching motifs to be detected and refined and significant matches to motifs across cluster boundaries can be observed. Application of this method to the yeast genome and a series of regulatory motifs led to the prediction that the previously unidentified factor known as Swi Five Factor was one of the yeast fork head proteins. Subsequently, this was confirmed by others.


147. Statistical Analysis of Gene Expression Profile Changes among Experimental Groups (up)
Taesung Park, Sung-Gon Lee, Seungmook Lee, Department of Statistics, Seoul National University;
Dong-Hyun Yoo, Mi-Yoon Chang, Yong-Sung Lee, Department of Biochemistry, Hanyang University College of Medicine;
tspark@stats.snu.ac.kr
Short Abstract:

cDNA microarray technology allows the monitoring of expression levels for thousands of genes simultaneously. We propose a test procedure for testing gene expression profile differences among experimental groups. The test procedure is illustrated using cDNA microarrays of 3,800 gene expression profiles from neuronal differentiation of cortical stem cells.

One Page Abstract:

cDNA microarray technology allows the monitoring of expression levels for thousands of genes simultaneously. Cluster analysis is commonly used to group together genes with similar patterns of expression. Genes in different clusters tend to be regarded to have different expression profiles. When we are interested in testing gene expression profiles over time for different experimental groups, however, the usual clustering methods do not help much. We consider a simple summary measure to differentiate genes that have high variability and ones that do not. Using this measure, we propose a test procedure to test the differences in gene expression profiles among experimental groups. The test procedure is illustrated using cDNA microarrays of 3,800 genes obtained in an experiment to search for changes in gene expression profiles during neuronal differentiation of cortical stem cells.


148. Multivariate method for selection of sets of differently expressed genes (up)
Ashot Chilingarian, N. Gevorgyan, Cosmic Ray Division, Yerevan Physics Institute, Armenia;
A. Szabo, Department of Oncological Sciences and Huntsman Cancer Institute, University of Utah;
A. Vardanyan, Cosmic Ray Division, Yerevan Physics Institute, Armenia;
chili@yerphi.am
Short Abstract:

Genes differentially expressed in two tissues are found by an evolutionary algorithm maximizing the Mahalonobis distance between gene expression vectors. “Evolutionary bootstrap” resolves the instability of sample covariance matrices. We show the superiority of this multidimensional method compared to commonly used one-dimensional tests using a microarray data simulation model.

One Page Abstract:

An important problem addressed using cDNA microarray data is the detection of genes differentially expressed in two tissues of interest. Currently used approaches consider each gene separately and evaluate their differential expression independently, ignoring the multidimensional structure of the data. However it is well known that correlation among covariates can enhance the ability to detect less pronounced differences. We propose a novel approach utilizing the gene correlation information for finding the differentially expressed genes. The Mahalonobis distance between vectors of gene expressions is the criterion for simultaneously comparing a set of genes and an evolutionary algorithm is developed for maximizing it. However the extreme imbalance of the number of genes and the number of experiments causes an instability of the sample covariance matrices, so a direct application of the Mahalonobis distance is not feasible. To overcome this problem we develop a new method of combining data from small-scale random search experiments that we term “evolutionary bootstrap”. We validate the proposed method in two ways. First we simulate cDNA microarray data where the extent of differential expression of each genes is known. We apply the multidimensional method and several commonly used one-dimensional statistical tests and compare their ability to correctly identify differentially expressed genes and to rank them according to differential expression. By utilizing the correlation structure the multivariate method, in addition to the genes found by the one-dimensional criteria, finds genes whose differential expression is not detectable marginally. As a different test, we apply the proposed method to data on two colon cancer cell lines and evaluate its ability to find genes that allow the classification of the samples according to their origin.


149. Understanding Non Small Cell Lung Cancer by Analysis of Expression Profiles (up)
Nir Friedman, Yoseph Barash, Hebrew University;
Amir Ben-Dor, Zohar Yakhini, Agilent Laboratories;
Naftali Kaminski, Sheba Medical Center, Israel;
nir@cs.huji.ac.il
Short Abstract:

To understand the molecular mechanisms that underlie lung cancer, we analyze gene expression patterns in tumor and normal lung samples. We present computational methods that we developed to extract biological meaning from this data. We discuss the significance of the information we retreive and its potential impact on cancer research.

One Page Abstract:

Lung cancer is a common malignancy and a major determinant of overall cancer mortality in developed and developing countries. Despite intensive research, little has changed in the understanding and management of the disease. In order to determine the transcriptional programs that are active in non small cell lung cancer (NSCLC) gene expression patterns of ~12,000 genes were collected from 24 NSCLC tumor samples, 11 normal histology samples from lung resections for cancer and pooled normal lung RNA (5 individual lungs) obtained commercially.

In this poster, we present analysis of these gene expression profiles. We show that gene expression patterns were highly distinct in tumor and normal tissues. We use the Total-Number-of-Misclassifications (TNoM), Information-content (Info) and Gaussian-Error scores to detect genes that significantly differ between NSCLC tumors and normal lung samples. One evident observation was that informative genes were significantly overabundant in our dataset, thus supporting the significance of the results.

To better understand the transcriptional program we analyzed the genomic location of genes that differ between NSCLC tumors and normal lung tissues, and compared these to cytogenetic abnormalities observed in the tumor samples. Finally, we developed and used class discovery tools to characterize putative tumor sub-types.

The wealth of statistically significant and biologically meaningful information in our dataset supports our contention that transcriptional profiling will lead to new insights into the pathogenesis of lung cancer, thus leading to development of new tools for early detection and treatment of this devastating disease.


150. Applications of high-throughput identification of tissue expression profiles and specificity (up)
Fabien Campagne, Lucy Skrabanek, Harel Weinstein, Institute for Computational Biomedicine, Department of Physiology and Biophysics; Mount Sinai School of Medicine;
Fabien.Campagne@physbio.mssm.edu
Short Abstract:

We recently developed TissueInfo: an automated, high-throughput method to identify the tissue expression profile and the specificity of a query sequence. We will briefly introduce applications of this new method to custom microarray production, gene discovery, genome analyses, signaling pathway modeling and tissue information ab initio prediction.

One Page Abstract:

Organisms such as mammals do not express every single gene encoded by their genome in each of their cells. Rather, the various cell types of the organism express particular subsets of the genes in the genome. Cell types are further organized into tissues, and tissues constitute the organs that carry out various physiological functions. The detailed mechanisms of gene products underlying the functioning of this complex organization are today largely unknown. Several methods, including SAGE [1], microarray technology [2] can be applied to the study of differential gene expression in the various cell types, in different tissues. We recently developed TissueInfo, a high-throughput method to identify the tissue expression profile of the genes in an organism's genome, as well as the tissue specificity of a query sequence [3]. The method carefully organizes the data publicly available in dbEST [4] and is purely computational. With 80% coverage of the benchmark considered, TissueInfo achieves an accuracy of 76% when the tissue specificity of a gene is predicted and 89% when its expression in a given tissue is predicted. These results make possible the application of TissueInfo to the complete sequences available in the public draft of the human genome. Our poster will present some novel features of the tissue information obtained when profiling about 10,000 human genes for their expression in, and specificity to, 104 human tissues. This will illustrate the application of TissueInfo to genome-wide statistical analysis of gene expression in tissues. In addition, we will describe other potential applications of TissueInfo, such as in the production of tissue-specific microarrays, where TissueInfo can greatly speed up and simplify the selection of clones expressed in a given tissue. Another area of important application of, TissueInfo relates to gene discovery pipelines where this method can be integrated to provide the ability to calculate tissue expression profiles and specificity for candidate genes. As shown in our recent identification of the Sac sensory receptor gene candidate [5], prediction of restricted tissue expression, or other specific expression profiles, can be pivotal in the identification of a gene candidate. A third illustrative application consists of the assembly of training sets of genes grouped according their expression profile for the ab initio prediction of tissue information. More information about the method will be available from our web site: http://icb.mssm.edu.

1. Velculescu, V.E., et al., Serial analysis of gene expression. Science, 1995. 270(5235): p. 484-7.

2. Shoemaker, D.D., et al., Experimental annotation of the human genome using microarray technology. Nature, 2001. 409(6822): p. 922-7.

3. Skrabanek, L. and F. Campagne, TissueInfo: high-throughput identification of tissue expression profiles and specificity. submitted, 2001.

4. Boguski, M.S., T.M. Lowe, and C.M. Tolstoshev, dbEST--database for "expressed sequence tags". Nat Genet, 1993. 4(4): p. 332-3.

5. Max, M., et al., Tas1r3, encoding a new candidate taste receptor, is allelic to the sweet responsiveness locus Sac. Nat Genet, 2001. 28: p. 58-63.


151. Identifying regulatory networks by combinatorial analysis of promoter elements (up)
Yitzhak Pilpel, Priya Sudarsanam, George M. Church, Department of Genetics, Harvard Medical School;
tpilpel@genetics.med.harvard.edu
Short Abstract:

We have developed a new computational method that analyzes microarray data to discover synergistic regulatory motif combinations and applied it to the analysis of promoters of S. cerevisiae. Such interactions are organized into highly connected graphs suggesting that a small number of regulators me be responsible for multiple expression patterns

One Page Abstract:

The recent availability of microarray data has led to the development of several computational approaches for studying genome-wide transcriptional regulation. These approaches have been very successful in deriving known and new regulatory motifs from the promoters of co-expressed genes. However, few studies have so far addressed the combinatorial nature of transcription, a well-established phenomenon in eukaryotes. We have developed a new computational method that analyzes microarray data to discover synergistic regulatory motif combinations and applied it to the analysis of promoters of S. cerevisiae. Our method suggests causal relationships between each motif in a combination and the observed expression patterns. In addition to identifying novel motif combinations that affect expression patterns during the cell cycle, sporulation, and various stress response conditions, we have also discovered regulatory cross-talk between several of these processes. We have developed novel visualization tools that allow the analysis of the causal relationships between regulatory motif combinations and expression profiles. In addition, we have generated global motif synergy maps that provide a view of the transcription networks in the cell. The maps are highly connected suggesting that a small number of transcription factors are responsible for a complex set of expression patterns in diverse conditions. This approach should be important for modeling transcriptional regulatory networks in more complex eukaryotes.


152. The use of discretization in the analysis of cDNA microarray expression profiles for the identification of tissue-specific genes (up)
Janick Mathys, Kathleen Marchal, Patrick Glenisson, Geert Fannes, Peter Antal, Yves Moreau, Bart De Moor, department of electrical engineering, K. U. Leuven;
Paul Van Hummelen, VIB MicroArray Facility, MAF;
jmathys@esat.kuleuven.ac.be
Short Abstract:

A simple procedure was developed to analyze gene expression profiles from cDNA microarrays for the identification of tissue-specific genes. This procedure consists of the discretization of both background-corrected red intensities and ratios followed by Euclidian distance-based clustering.

One Page Abstract:

To assess tissue-specific gene expression a standard concept in data mining was used : discretization. Discretization means that thresholds are determined or chosen. Based on these thresholds, decisions are made about the expression of a gene (ON or OFF, over- or under-expression). To obtain gene expression profiles from various mouse tissues, cDNA was prepared from brain, kidney, heart, liver, lung, skeletal muscle, spleen, testis and hybridized on mouse cDNA microarrays. The microarrays contained 9216 spots coming from 4600 randomly chosen mouse genes printed in duplicate. Twelve slides were hybridized with each of the tissues labeled in red against spleen (reference) labeled in green. Following image analysis, genes were labeled ON or OFF according to a predetermined intensity treshold. The threshold was set at the local background intensity of the spot plus two standard deviations of the mean spot intensitiy. If the intensity of a gene was below this threshold, the gene was considered OFF and got the label 0. Otherwise, the gene was considered ON and got the label 1 if one of the duplicate spots were above threshold and label 2 if both duplicate spots were above background. It was found that the threshold settings were not optimal because the sensitivity of the green and the red channel were different. Methods to adjust the thresholds for each dye specifically are now being developed and compared. The sum of the ON/OFF labels over the various tissues was used to divide the genes in the following groups : Group A: Constitutively expressed genes (602 genes), Group B: Tissue specific genes and Group C: Non expressed genes. Group A : The group of genes that were ON in each tissue (each gene had label 2 in all 12 experiments and thus: sum=24) could be further separated in potential housekeeping genes (A1) and tissue-specific genes (A2). For this purpose, the ratios were discretized : if the ratio of a gene was > 2 (2-fold overexpression) or < 0.5 (2-fold underexpression) the gene received label 1 or 2 respectively and for ratios between 0.5 and 2 the gene was given label 0. Similarly to the previous discretization, the sum of the labels over the various tissues was made and used to separate the genes. The group of genes with sum=0 could be considered as potential housekeeping genes (84 genes). The remaining genes were clustered by calculating the Euclidian distance for each gene-pair. These clusters constitute of genes that were differentially expressed in one or more tissues (A2). Group B : This group was further divided by the same clustering method as was described for group A2. Except for heart, for each tissue a set of genes were found that were uniquely expressed in that specific tissue. Group C : For a large set of genes (629 genes) no fluorescent signals above threshold were found in any of the tissues. These genes were not expressed in any of these tissues or were below the detection limit of the assay. In a final analysis, the tissues themselves were subjected to an hierarchical clustering algorithm based on the ratios of the genes that were ON in each tissue. The results of this clustering matched remarkably well with results obtained for the tissue -specific genes, for instance heart and skeletal muscle seem to share the most specific genes of group A2. Our results are being further confirmed by information on function and tissue-specificity obtained from Unigene, GO and Pubmed.


153. Quantitative analysis of a bacterial gene expression by using the gusA reporter system in a non-steady state continuous culture (up)
Kathleen Marchal, Centre of Microbial and Plant Genetics, K.U. Leuven/ SISTA Department of Electrical Engineering, K. U. Leuven;
Jun Sun, Centre of Microbial and Plant Genetics, K.U. Leuven;
Ilse Smets, Kristel Bernaerts, Jan Van Impe, BioTeC-Bioprocess technology and Control, K.U. Leuven;
Bart de Moor, SISTA Department of Electrical Engineering, K.U. Leuven;
Jos Vanderleyden, Centre of Microbial and Plant Genetics, K.U. Leuven;
kathleen.marchal@esat.kuleuven.ac.be
Short Abstract:

A general dynamic model (forward model) was used to study the “mere influence” of O2 on the expression of a bacterial fusion protein (A. brasilense cytN-gusA fusion). The experimental set up used consisted of a non-steady state continuous culture of which the O2 concentration was regularly perturbed.

One Page Abstract:

In this study a dynamic model was developed to describe the “mere influence” of O2 on the expression of an important respiratory enzyme of the bacterium A. brasilense cytN gene (encoding a cytcbb3 terminal oxidase (Marchal et al., 1998). The experimental set up consisted of the combined use of a non-steady state continuous culture and a translational gene fusion (cytN-gusA ). The use of a continuous culture allows accurate monitoring slight changes in input parameters (O2, input C-source,…). Moreover, the input parameters (in this case O2) can systematically be perturbed to study the effect on the output parameters (fusion protein synthesis measured as b-glucuronidase activity, cell density, output C-source concentration). The combined use of a structural dynamic modeling and the appropriate experimental set up (training and validation experiments) allowed to construct a structural forward model (based on differential equations describing cell growth, substrate consumption and fusion protein synthesis) to describe the dynamic behavior of the system upon varying input signals. Simulation results showed that under the conditions tested the cytN gene expression was not subjected to catabolic repression. The hybrid fusion protein seemingly behaves as a very stable protein in A. brasilense and, consistent with previous results, O2 is the major signal regulating the cytN promoter. In principle this approach can be generalized to assess the effect of any controllable external signal on bacterial gene expression in a non-steady state continuous culture. Use of the method outlined here has several advantages over the commonly used steady state measurements (less plasmid-stability problems, less time-consuming, more quantitative etc). Similarly constructed forward models can be used to predict the response of a recombinant promoter-gene construct on the alteration of an external signal (applications in metabolic engineering, process control in fermentation technology). Moreover this study clearly highlights the complexity of using differential equations for forward modeling of genetic networks. In the work presented here constructing a general model of the expression of only one gene regulated by 2 external parameters required introduction of 14 parameters (model could be reduced after sensitivity analysis to a specific model containing 6 major parameters), performance of 3 experimental datasets and extensive computational analysis.

Marchal et al. 1998. J. Bacteriol.180: 5689-5696


154. Analysis of 5035 high-quality EST clones from mature potato tuber (up)
Jeppe Emmersen, Meg Crookshanks, Kåre Lehmann, Karen G. Welinder, Aalborg University;
je@bio.auc.dk
Short Abstract:

The biosynthetic potential of mature tuber from potato (var Kuras) was elucidated. 5035 EST's were sequenced with an average length of 592 bp. The tuber EST's displayed significantly higher expression level of genes involved in Protein Destination and Protein Synthesis functions than potato EST libraries of leaf, stolon and shoot.

One Page Abstract:

The biosynthetic potential of the economically important potato plant was elucidated. 5035 EST sequences of high quality from mature tuber were generated and analyzed. The average trimmed read length of the library was 592 bp, which is considerably higher than for other EST libraries.

The DNATools analysis software package, developed by S. W. Rasmussen, Carlsberg Research Center was used to store sequences, analyze BLAST results, build EST submission files for the dbEST, analyze redundancy, and edit sequences. DNATools was also used to build searchable flatfile databases containing sequences and BLAST results. This software was chosen due to high functionality and low price. DNATools, however, only permit searching database hits using E-values and simple keywords. To enhance the search capabilities of DNATools, a number of Perl scripts were written to enable more advanced search parameters such as listing all sequences with more than 95% identity in a BLAST hit.

The expression level of potato mature tuber was compared to EST libraries from potato stolon, leaf and shoots (R.S. van der Hoeven et al,GenBank) giving a total of 35000 potato EST sequences. The sequences were divided into the different function categories suggested by MIPS, searching the annotated Arabidopsis genes with sequences from each EST library using BLASTX. This analysis showed that tuber has a significantly higher expression of genes involved in Protein Destination and in Protein Synthesis compared with the other potato tissues. The limitation of using Arabidopsis thaliana as model was evident as 25% to 34% of sequences in the potato libraries had no match to Arabidopsis (E-value 1E-5 or lower).

Potato EST sequences were also compared to tomato EST sequences, as both plants belong to the nightshade family. EST libraries from tomato seed, flower, root, shoot, cotyledon, leaf and fruit (GenBank) were assembled into one BLAST database, with a total of 82355 tomato EST sequences. Each potato library was then compared to the tomato sequences by BLASTN. As expected, sequences identities of orthologous genes of tomato and potato are very high (>90%).


155. Using highly redundant oligonucleotide arrays to validate ESTs: Development and use of a human Affymetrix MuscleChip. (up)
Rehannah H. A. Borup, Yi-Wen Chen, Marina Bakay, Children's National Medical Center, Washington DC, USA;
Stefano Toppo, Giorgio Valle, Gerolamo Lanfranchi, University of Padova;
Eric P Hoffman, Children's National Medical Center, Washington DC, USA;
RBorup@childrens-research.org
Short Abstract:

We present design, production, and use of a highly redundant oligonucleotide using the Affymetrix-microarray platform (32 oligonucleotides per gene studied). We present the comparison of transcript abundance using absolute intensity analyses from human muscle biopsy RNA (expression profiling), and by EST cluster member number from cDNA library sequencing.

One Page Abstract:

The confidence that any particular EST cluster represents a true gene depends primarly on the recurrent identification of the sequence from multiple cDNA sources. However, a significant proportion of ESTs remain as “singletons”, with only one sequence representing that cluster in dbEST. Such singletons and low redundancy clusters must be verified to impart confidence on their existence. One method of promise centers on expression profiling; the identification of the EST as an expressed sequence in profiling by microarrays may add considerable confidence to the existence of the singleton EST. To test this hypothesis, we have designed a highly redundant custom Affymetrix MuscleChip based largely upon an EST sequencing project of non-normalized human muscle cDNA library by the University of Padova. Our MuscleChip contains 4,601 probe sets (average probe set = 16 perfect match, and 16 mismatch oligonucleotides), with a total of ~150,000 25mer oligonucleotides. These probe sets represented 2,075 ESTs from the Padova database, and 1,100 genes downloaded from previous Affymetrix stock GeneChips. 571 of our sequences were represented by two or more distinct probe sets, leading to a very high degree of redundancy within the MuscleChip. Redundancy is particularly important in muscle, where there are many very closely related genes with specific functions (e.g. >10 myosin heavy chain isoform genes differing by only a few bases).

We present data using this MuscleChip on a series of human muscle biopsies, including normal muscle, and Duchenne muscular dystrophy muscle. We present comparison of transcript abundance by EST sequence hits (cluster member number), and expression profiling normalized relative absolute analyses. All sequences on the MuscleChip were blasted against most recent versions of the sequence databases to update all gene assignments. Approximately 57% of ESTs not defined as of 1998 were now able to be assigned a gene name/function. This left approximately 325 sequences represented as true undefined ESTs on our MuscleChip. Of these 55% are called present on the Muscle Chip. We found considerable variation in the range of EST cluster members when compared to absolute intensities by expression profiling of muscle biopsies, although there was an overall trend of greater EST cluster members correlating with greater absolute intensities. We hypothesize that those genes showing concordant EST cluster number and absolute analyses intensity have a high level of confidence regarding RNA level in the tissue. Finally, we present data on the EST sequences, showing that a substantial subset are indeed verified by expression profiling. The differentially regulated ESTs we have found in muscular dystrophy and other conditions are now prioritized for full-length sequence determination and protein characterization.


156. Expression Profiler (up)
Jaak Vilo, Alvis Brazma, European Bioinformatics Institute;
vilo@ebi.ac.uk
Short Abstract:

Expression Profiler (ep.ebi.ac.uk) is a WWW-based environment for analysis of microarray and other genomics data. Different components allow users to explore, cluster, analyze, and visualize gene expression, protein-protein interaction, regulatory sequence, sequence motif, and functional annotation data, as well as link the analysis results to other WWW-based tools and databases.

One Page Abstract:

Expression Profiler (ep.ebi.ac.uk) is a set of WWW-based software tools for analysis and mining of microarray and other genomics data. Different components of Expression Profiler allow users to explore, cluster and visualize the gene expression data; link the results of clustering to other web-based tools; compare experimental protein-protein interaction lists with gene expression data; browse Gene Ontology annotations, extract genes in each category and explore their expression profiles; perform the extraction of putative promoter sequences for the genes in the clusters; perform pattern discovery on the sets of extracted sequences; and visualize the patterns (motifs) on these sequences. The main components of Expression Profiler are:

* EPCLUST "Expression Profile data CLUSTering and analysis", is a collection of clustering and visualization methods for analysis of expression data. EPCLUST contains implementations of standard hierarchical and K-means clustering algorithms for many distance (similarity) measures and data transformation and normalization methods. EPCLUST implements also various similarity searches based on expression data for individual genes or gene clusters.

* URLMAP is a general, configurable tool for mapping HTML form contents (e.g. cluster contents from EPCLUST) to other on-line analysis tools and databases (HTML forms). URLMAP allows for example to link clusters of genes to various databases (SwissProt, SGD, YPD, etc) and tools (KEGG metabolic pathway database query tool, RSA-tools, EPCLUST, GENOMES, etc.)

* GENOMES is a tool for retrieval of information about sets of genes, linking genes to other databases, and extraction of genomic sequences relative to the gene start and end positions.

* SPEXS, "Sequence Pattern Exhaustive Search", is a pattern discovery tool based on rapid exhaustive enumeration of all patterns occurring in sets of sequences, and reporting the most frequent, or most significant ones. SPEXS can be used for example for de novo prediction of potential transcription factor binding motifs on DNA, or motif discovery from protein sequences.

* PATMATCH is a tool for visualization and exploration of patterns and motifs on the DNA and protein sequences. PATMATCH is integrated to EPCLUST and GENOMES, allowing, for instance, visualization of the transcription factor binding motifs on the regulatory sequences of genes combined with the respective expression profile clustering.

* EP:PPI (Protein-Protein Interaction analysis tool) integrates experimental or predicted protein-protein interaction data with gene expression analysis in EPCLUST.

* EP:GO is a browser for controlled vocabularies produced by Gene Ontology annotation project (www.geneontology.org), that allows the extraction of genes associated to each Gene Ontology category and subsequently analyze gene expression, regulatory sequence, and protein-protein interaction data for these genes.

The WWW-based architecture makes it possible to perform all the described analyses independently of the user's hardware platform or operating system, without the need to install numerous software tools on each computer. If needed, the system can be run behind the firewalls, within the intranets of the companies, thus providing also the needed data security. Currently we are integrating ArrayExpress database and Expression Profiler analysis tools in a single system. This will open new opportunities for integrating many different public and private data sources for analysis and mining.


157. Analysis of the transcriptional apparatus in the holoparasitic flowering plant genus Cuscuta (up)
Sabine Berg, Tom A.W. van der Kooij, Kirsten Krause, Karin Krupinska, Botanical Institute, Christian-Albrechts-University, Kiel, Germany;
sberg@bot.uni-kiel.de
Short Abstract:

The holoparasitic flowering plant genus Cuscuta includes species with different stages of plastome reduction. One central question in plastome analysis is to understand the transcription machinery. In some species, loss of one of the three plastid RNA polymerases dramatically changed gene expression and promoter structures.

One Page Abstract:

The holoparasitic flowering plant genus Cuscuta consists of fully photosynthetically active species with functional chloroplasts, intermediate forms with restricted photosynthetic capacity and achlorophyllous plant species, with extremely reduced plastome and plastid functions (van der Kooij et al. 2000). Therefore, it might be an ideal model to study the loss of plastid genes and their functions in the evolutionary context of parasitic development. One of the central question in further plastome analysis of several Cuscuta species is the analysis of the transcription machinery. Transcription in plastids, in general, is shared between the plastid encoded RNA polymerase (PEP), which resembles the E.coli enzyme and consists of four subunits (rpoA, B, C1 and C2), and a nuclear encoded RNA polymerase (NEP), which is imported into the plastid. Loss of the plastid encoded enzyme may result in transcription relying only on the nuclear encoded enzyme. Therefore, we tested the presence of rpo genes in several Cuscuta species and analysed promotor usage in two different plastid genes. Southern blot analysis and PCR amplification reveals that rpoA and rpoB genes coding for subunits of the plastid encoded polymerase (PEP) are only present in the photosynthetically active species Cuscuta reflexa and C. europea. C. gronovii, C. plathyloba, C. subinclusa and C. odorata lack these subunits (Krause et al). Both the house-keeping gene rrn16, involved in translation, as well as the photosynthesis gene rbcL, coding for the large subunit of RUBISCO, are present in the species investigated. The coding region of the genes are highly conserved among the species, whereas amplified promotor fragments show large deletions in size. Sequence alignment of the promotor regions of the rrn 16 and the rbcL gene, respectively, show large deletions in four species investigated. The PEP-specific promotor sequences of rrn 16 are present in C. reflexa but not in C. gronovii, C. odorata and C. subinclusa. Northern blot analysis reveals transcription of the gene in all species (Krause et al.). PEP-specific promotor sequences of rbcL are found in C. reflexa and C. europea but not in C. gronovii, C. subinclusa and C. plathyloba, northern analysis reveals transcription in all species. Primer extension analysis was used to check whether a different promotor is being used since PEP as well as promotor specific structures are missing in four species. The PEP promotor is conserved and similar in sequence and size in the chlorophyll-containing species C.reflexa (rrn16,rbcL) and C. europea (rbcL). However in C. gronovii, C. odorata and C. subinclusa rrn16 transcripts are initiated from a different promotor which strongly resembles a NEP-Promotor in sequence. Transcription in these plastids has to rely on the imported NEP due to loss of the rpo genes of the plastid encoded enzyme in some Cuscuta species. Primer extension analysis of the rbcL gene reveals a conserved PEP-promotor sequence in C. reflexa and C. europea, whereas transcription in C. gronovii, C. subinclusa and C. plathyloba initiates at NEP-promotor sequences.

Krause et al, submitted; van der Kooij et al 2000, Planta 210:701-707


158. Inferring Regulatory Pathways in E.Coli using Dynamic Bayesian Networks (up)
Irene M. Ong, David Page, University of Wisconsin-Madison;
ong@cs.wisc.edu
Short Abstract:

This work presents the first application (to our knowledge) of Dynamic Bayesian Networks (DBNs) to time-series gene expression microarray data. We introduce an approach to determining transcriptional regulatory pathways by encoding background knowledge about gene expression for the particular organism being modeled into the initial, core structure of the DBN.

One Page Abstract:

In order to fully understand how genomes operate, we need an understanding of how genes ``communicate'' as a network to organize the construction and daily functions of cells within an organism [DOE 2000]. We are interested in uncovering this genome-wide circuitry that underlies the regulation of cells. In this poster, we introduce an approach to determining transcriptional regulatory pathways by applying Dynamic Bayesian Networks (DBNs) to time-series gene expression data. The data are obtained from DNA microarray hybridization experiments in E.Coli.

There has been much work in the area of analyzing gene expression data. The most closely related work, [Friedman, Linial, Nachman and Pe'er 2000], addressed the task of determining properties of the transcriptional program of an organism (Baker's yeast) by using Bayesian Networks (BNs) to analyze gene expression data. However, this method can only represent the correlations between genes at a given time. It does not show how genes regulate each other over time in the complex workings of regulatory pathways. To the best of our knowledge, we are the first group to apply DBNs to time-series microarray data.

DBNs are essentially BNs with a few additional assumptions that allow them to tractably model temporal information. We graphically model transcription in a DBN by building an initial DBN structure that exploits background knowledge from an operon map, a mapping of known and predicted operons to their associated genes. The operon map was obtained from [Craven, Page, Shavlik, Bockhorst and Glasner 2000]. Specifically, a time-slice in our initial DBN model consists of all the operons and genes, where each operon node has arcs that connect it to the sequence of gene nodes that are transcribed together for that particular operon. The gene nodes, our evidence variables, are discretized gene expression levels that indicate an increase, decrease, or no change in expression level from one time-slice to another.

Using this initial DBN structure, our goal is to learn the arcs from operons in one time-slice to those in another. If operon 1 at time t_i has an arc to operon 2 at time t_(i+1), this implies operons 1 and 2 are in the same regulatory pathway. These arcs, as well as all conditional probabilities, are inferred from time-series microarray data for E.Coli using the structural EM algorithm [Friedman 1998].

The results of our experiments were mixed, however the experiments did provide evidence that DBN learning is capable of identifying operons in E.Coli that are in a common regulatory pathway.


159. ConSite: Identification of transcription factor binding sites conserved between orthologous gene sequences (up)
Albin Sandelin, Boris Lenhard, Luis Mendoza, Wyeth Wasserman, CGR, Karolinska Institutet;
albin.sandelin@cgr.ki.se
Short Abstract:

Understanding gene regulation is a post-genome research challenge. Methods for transcription factor binding site detection are insufficiently specific.

Based upon the hypothesis that regulatory sequences in non-coding DNA are preferentially conserved, we have constructed ConSite, a tool to align two orthologous genomic sequences and identify conserved binding sites.

One Page Abstract:

Understanding the mechanisms by which gene expression is regulated is one of the principal challenges in post-sequence human genome research.Current motif-based methods for computational detection of individual transcription factor binding sites are as yet insufficiently specific to warrant experimental investigation. Regulatory elements are short and transcription factors generally tolerate considerable variation between the experimentally defined binding sites. As a consequence, similar elements will be found purely by chance at a high frequency in human genomic sequence. In short, the rate of false-positive predictions is prohibitively high for most purposes.

In order to accurately predict functional transcription factor binding sites, we must develop approaches beyond isolated profiles representing clusters of known sites. This may be achieved by a subset of approaches, including: (i) considering combinatorial site-clusters, (ii) addressing the poorly understood subject of chromatin superstructure, or (iii) by using computational approaches unrelated to biological mechanisms of gene regulation.

With the increasing availability of genomic sequences from diverse species, it is possible to extract regulatory information via genomic sequence comparisons, a process termed "phylogenetic footprinting". Based upon the hypothesis that regulatory sequences in non-coding DNA are more likely to be conserved than sequences without sequence-specific function, we have constructed ConSite, a tool to align two orthologous genomic sequences and identify the binding sites which are conserved between the pair.

Constructed primarily as a tool for experimental researchers, ConSite gives the user the option to consider exon-intron relationships in concordance with the binding sites detected. To further narrow the scope of the analysis, users have the option to scan with subsets of the transcription factor profiles based on species or protein structure classes. Three output formats are provided, including graphical and text sequence alignments, as well as a tabular report.


160. FSCAN - An open source program for analysis of two-color fluorescence-labeled cDNA microarrays (up)
Peter J. Munson, Ph.D., L. Young, Vinay V. Prabhu, Mathematical and Statistical Computing Lab, CIT, NIH;
munson@nih.gov
Short Abstract:

FSCAN is an free, open-source program for image analysis of images generated from cDNA microarrays, available at http://abs.cit.nih.gov/fscan. Developed under the MATLAB system, the program runs on Windows, MacOS and Unix platforms. It provides interactive statistical and graphical analysis features, links to external databases and exportable text file output.

One Page Abstract:

FSCAN is an free, open-source program for image analysis of images generated from cDNA microarrays, available at http://abs.cit.nih.gov/fscan. Developed under the MATLAB system, the program runs on Windows, MacOS and Unix platforms. It provides interactive statistical and graphical analysis features, links to external databases and exportable text file output. The program correctly reads Axon, Moldecular Dynamics, .gel and .tiff images. It provides image segmentation, grid-overlay, spot detection and spot quantification algorithms. Because of the open source feature, the user may modify the provided algorithms if desired. Several statistics are measured for each spot including signal, background levels, standard deviations, spot size for each of two channels. After analyzing an image, the user is presented with a dynamic analysis workbench where he may select spots for closer examination, observe the presence of image artifacts, while viewing the same spot repsented in a scatterplot view. Clicking on any spot in the "array-view" automatically selects it in the "scatterplot-view" and simulataneously identifies the associated clone and gene information. Links to external databases are provided so that browsing the web for additional information is facilitated. The program has been used extensively to analyze gene chips produced by the NCI containing up to 6500 genes, and can easily be adapted to virtually any commercial chip configuration. A sister program (PSCAN) is available for analysis of P33 labeled, nylon-based arrays.


161. A method for designing PCR primers for amplifying cDNA array clones (up)
Henrik Bjørn Nielsen, Steen Knudsen, Center for Biological Sequence Analysis (CBS), Technical University of Denmark;
hbjorn@cbs.dtu.dk
Short Abstract:

PROBEWIZ designs PCR primers for cDNA-array with minimal homology to other expressed se-quences from a given organism. The primers can be restricted on Tm, product length and primer size. The primer selection is based on user-defined penalties for homology, primer quality, and positioning toward the 3’end. Find PROBEWIZ at www.cbs.dtu.dk/services/DNAarray/probewiz.html

One Page Abstract:

When designing targets for cDNA arrays it is important that the target will only anneal to the desired probe sequence especially in expression studies where the abundance of probes varies greatly. Manual design of targets, looking for regions with homology, is tedious and time consuming. A fast solution to the problem of designing large numbers of PCR primers that will amplify sequences with minimal homology to other sequences in a database of EST's is presented. The program PROBEWIZ designs probes/targets for cDNA array, Northern blot or Southern blot by first searching for areas with homology higher than 50% to other sequences in a database. Using this information, PROBEWIZ then finds a number of potential PCR primer pairs that all meet a set of user defined criteriaS (Tm of primer, length of product, primer size). These are then evaluated and sorted according to tree parameters, all assigned an importance weight by the user: 1) the homology to other genes in the database 2) the 3' proximity of the probe, and 3) the primer quality. PROBEWIZ is accessible from www.cbs.dtu.dk/services/DNAarray/probewiz.html


162. Statistical modelling of variation in microarray replicates (up)
S. Soneji, Birkbeck College, London;
S. Kendall, London School of Hygiene and Tropical Medicine, London;
J. Mangan, K. Lang, J. Hinds, P. Butcher, St Georges Hospital Medical School, London;
N. Stoker, London School of Hygiene and Tropical Medicine, London;
L. Wernisch, Birkbeck College, London;
s.soneji@mail.cryst.bbk.ac.uk
Short Abstract:

We have used microarray analysis to identify differentially expressed genes in a Mycobacterium tuberculosis mutant. Several replicates have been produced and analysed by analysis of variance models in order to obtain stronger signals and to estimate the amount of variation due to different sources of experimental error.

One Page Abstract:

Microarray analysis is a powerful technique for the identification of differentially expressed genes in mutant organisms as compared to the wild type. In the Mycobacterium tuberculosis H37Rv mutant Tame12 the tcrS gene, which codes for the sensor of a two-component regulatory system, has been knocked out. The purpose of the experiment was to identify genes under direct or indirect control of this regulatory system. In order to help our analyses, several replicates of the same hybridization experiment were carried out. The purpose of the replicates was twofold. Firstly, with replicates signals of over- or underexpression can be extracted more reliably from noisy data. Secondly, an analysis of variance can be applied to reveal the amount of variation due to the various sources of experimental error. Identification of these sources might help to reduce noise in future experiments. We prepared three different RNA samples of both mutant and wild type cultures. Each sample was hybridized to two glass-slide microarrays containing PCR products from all 3924 genes of M. tuberculosis. Thus, all in all we produced 6 hybridization replicates. We also repeated the scanning and spot quantification processes. We fitted linear models taking various combinations of these factors into account and compared their explanatory power. A common feature of all fitted models was that the gene-bacterial strain interaction was one of the weakest compared to the degree of variations among other factors. This result lends weight to the importance of replicates for this type of experiments. Bootstrap resampling of the residuals of the best models was used to obtain p-values for the significance of expression levels of differentially expressed genes. When searching for conspicuous expression levels among thousands of genes a proper adjustment of p-values is mandatory. We found about 10 genes with significantly different expression levels. Reverse trancriptase-PCR is being used to confirm these results.


163. The new explore of diffuse large B-cell lymphoma (up)
Junbai Wang, Tumor Biology Dep. Den Norsk Radium Hospital;
Jan Delabie, Lymphoma Research Group, Den Norsk Radium Hospital;
Ola Myklebost, Tumor Biology Department, Den Norsk Radium Hospital;
junbaiw@radium.uio.no
Short Abstract:

A new approach is applied to study diffuse large B-cell lymphoma (DLBCL). The new strategy is a combination of Clustering, Self-organize map and Principal component analysis. From this new approach, we easily distinguished DBCL, CLL and FL type of lymphoid maliganacies. A possibile three subgroups of DLBCL is proposaled.

One Page Abstract:

Current array technologies have made it possible to print tens thousands of genes in a single slide. This allows us simultaneously analysis of tens thousands different genes in a single experiment. The challenge now is to interpret such massive data sets. We proposal a two-level approach to simplify the explore of complex DNA microarray data, the first step is to extract the fundamental patterns of gene expression inherent in the data, then is more detailed investigation of particular interesting groups of genes or samples by a resourceful visualization. This new strategy is a combination of Clustering, Self-organize map and Principal component analysis. Most of the analytical calculations and graphical features are provided by MATLAB. To demonstrate the value of such analysis, the approach is applied to diffuse large B-cell lymphoma (DLBCL) with expression patterns of 3906 unique genes for 96 normal and malignant lymphocyte samples. From this new approache, we not only easily distinguished DBCL, CLL and FL type of lymphoid malignancies and confirmed early suggestions that there are two subtypes of DLBCL (GCB and ACT type) but also discover a possible two subgroups in ACT type of diffuse large B-cell lymphoma (DLBCL). The new founding proved the value of such approach and can be applied to study other massive DNA microarray data set.


164. Including protein-protein interaction networks into supervised classification of genes based on gene expression data (up)
Joachim Theilhaber, Christoph Brockel, Michael Heuer, Steven Bushnell, Aventis Pharmaceuticals, Cambridge Genomics Center;
joachim.theilhaber@aventis.com
Short Abstract:

For selecting genes in specific pathways we have extended our supervised classifier GENNC (Gene Expression Nearest-Neighbor Classifier - previously sucessful in finding genes in the mouse osteogenic and myoblastic pathways), by including protein-protein interaction information into its distance metric. We report on results for both yeast and mammalian systems.

One Page Abstract:

For selecting genes involved in specific regulatory or metabolic pathways, on the basis of microarray expression data, we had previously developed GENNC (Gene Expression Nearest-Neighbor Classifier)*. GENNC is a supervised classification scheme using the k-nearest-neighbor method, that classifies genes based on their co-regulation with members of a biological training set. GENNC has been sucessfully applied to finding genes in the mouse osteogenic and myoblastic pathways. We extend the classifier by including protein-protein interaction information into the distance metric, with a P-value measure of statistical significance of the metric. The P-value is essential in filtering out noisy data so as to maintain reasonable classifier sensitivity in the presence of very large data sets. Benchmark results based on yeast microarray data will be presented, as well as more tentative work involving mammalian data. In all cases we emphasize using cross-validation error rates for evaluating and optimizing the classifier performance, an issue of critical importance when selecting potential drug targets for further, labor-intensive experimental biological validation.

* ``Finding Genes in the C2C12 Osteogenic Pathway by k-Nearest-Neighbor Classification of Expression Data'', Joachim Theilhaber, Timothy Connolly, Steven Bushnell and Aventis Osteoporosis Team, Pacific Symposium on Biocomputing 2001, Mauna Lani, Hawaii, Jan 3-7 2001.


165. Comparative Splicing Pattern Analysis between Mouse and Human Exon-skipped Transcripts (up)
Tzu-Ming Chern, Winston Hide, South African National Bioinformatics Institute, University of Western Cape;
tzuming@sanbi.ac.za
Short Abstract:

We have developed a system to unequivocally capture putative exon-skipped transcripts. From our pilot studies, we have observed a small sample that has demonstrated a differential splicing pattern between mouse and human exon-skipped transcripts. Current work on differential splicing pattern between mouse and human will be presented.

One Page Abstract:

We have developed a system to unequivocally capture putative exon-skipped transcripts by mapping these transcripts back to their respective genomic sequences. A pilot study of 138 mouse genes and 30 human genes has been used to assess the occurrence of exon-skipping in these organisms. Preliminary analyses suggest that the rate of exon-skipping in human appears to be higher than the rate in mouse. It has been observed that all the exon-skipped genes in human have high EST abundance (>50 ESTs) whereas only 70% of the mouse skipped genes have high EST abundance. Our tissue level analyses suggests a significant correlation between high EST abundance in tissues and high exon-skipping frequency in both mouse and human exon-skipped genes. We have found tumorous tissues in both mouse and human to have the highest number of exon-skipped ESTs. Our protein analyses suggest that some of the skipped exons in mouse and human encode domains and families that are important for enzymatic and DNA-binding functions. We have also observed a differential splicing pattern that occurs in a small sampled mouse and human exon-skipped genes. Current investigations into differential splicing pattern of mouse and human exon-skipped transcripts will be presented.


166. Non-parametric statistics of gene expression data (up)
Yuzhen Ye, Shanghai Institute of Biochemistry and Cell, Chinese Academy of Sciences;
Haixu Tang, Department of Mathmatics,University of Southern California;
yeyz@sunm.shcnc.ac.cn
Short Abstract:

Though applications of both classical and recently developed classification algorithms to gene expression data mining have been reported, it is still a statistical challenge to extract useful information from high dimensional gene expression data. Here we introduce non-parametric statistical methods to attack this problem, and their advantages are also discussed.

One Page Abstract:

Non-parametric statistics of gene expression data

Yuzhen Ye Shanghai Institute of Biochemistry and Cell, Chinese Academy of Sciences, Shanghai 200031, China e.mail: yeyz@sunm.shcnc.ac.cn

Haixu Tang Department of Mathmatics, University of Southern California, Los Angeles, CA90089, USA e.mail: tanghx@hto.usc.edu

Some gene expression data, involving the differentiation between tumor and healthy tissue samples, are availale recently (Alon99, Golub00). The analysis on these data sets focus on clustering different genes into subsets that are co-expressed across different conditions. Clustering technique turns out a successful method to identify functionally related gene families. Similar method could also be used to classify different tissue types based on their gene expression profiles(Alon99).

Clustering is a so-called "un-supervised" classifier, which doesn't use the tissue type annotation directly. This information is only used for evaluating the results. In contrast, "supervised" methods try to predict the classification of new tissues, based on the knowledge from training on examples of tissues which have been previously classified. Basically this problem may be illustrated in the following way. Suppose we have m tumor tissue samples and n healthy tissue samples, often referred as "training set". We measure the expression level of N genes in each of these samples. Now how can we identify a subset of these genes, referred as feature genes, so that we can correctly predict the tissue types of some other type-unknown samples, referred as "testing set", based on their expression levels? It looks likely to be a typical classification problem. In fact, some classical and recently developed classfication algorithms, such as nearest neighbour classifier, linear discriminant analysis, classification tree, bagging and boosting and support vector machine, have been applied to this problem and the comparison results have been reported (Ben Dor00, Duroit00).

However, Two aspects of this type of expression data are worth being emphasized. First, the experiments are rarely replicated, and hence there are many experimental errors in the data. Second, the annotation of the tissue types perhaps do not coincide with their real property, because there are some sampling mistakes in the tissue preparation. Both of them make the precise value of the expression data less reliable. Nevertheless, there are still some information structure hidden inside the data and we believe non-parametric statistical methods are more suitable to extract them from the high dimentional data than the accurate statistics as mentioned above.

The poster will discuss the application of non-parametric statistical methods in tissue classification problem, specifically three topics as following: a: detecting the outlier tissues in the training set; b: identifying the genes which are differentially expressed in tumor and healthy tissues; c: predicting the tissue types in the testing set;

Please notice that from a statistical pointview, the first problem and the third one are relatively similar: the tumor tissues could be considered as outliers in the healthy tissue group, and vice versa. This similarity will be addressed in details.


167. Transcriptome and proteome analysis of Escherichia coli during high cell density cultivation (up)
Sang Yup Lee, Sung Ho Yoon, Mee-Jung Han, KAIST;
Jong Shin Yoo, Korea Basic Science Institute;
Geunbae Lim, Samsung Advanced Institue of Technology;
leesy@mail.kaist.ac.kr
Short Abstract:

High cell density cultivation of E. coli was carried out under a constant specific growth rate, and the transcriptome and proteome profiles were analyzed using DNA microarray and 2D-gel electrophoresis. The detailed results on the variation of transcriptome and proteome profiles will be presented along with the possible physiological explanations.

One Page Abstract:

The recent completion of Escherichia coli genome sequencing signals the necessity of developing new strategies for answering the basic questions concerning cellular function. High cell density cultivation (HCDC) is an essential biochemical engineering practice to achieve high level production of various bioproducts. True process optimization by fed-batch culture is often hampered due to little knowledge on the physiology and metabolism during high cell density. We have manufactured DNA microarray containing 2,850 genes including all functionally known and putative ones. Exponential feeding strategy was adopted for high cell density cultures of E. coli in order to reduce pH variation, by-product formation, and other inhibiting culture condition. DNA microarray can be used to compare global changes in gene expression that occur in response to an environmental stimulus or to compare the effects of genetic changes on gene expression. This analysis can provide important information about cell physiology and has the potential to identify connections between regulatory or metabolic pathways that were not previously known. The proteome analysis using two-dimensional gel electrophoresis (2D-gel) in conjunction with MALDI-TOF can also provide valuable information to elucidate the integrated cellular responses when bacterial cells grow under various environments. Two-dimensional gel electrophoresis is a powerful tool for identification of proteins having different expressed profiles under qualitatively or quantitatively different culture states. Therefore, combined analysis of transcriptome and proteome profiles can supply reliable and huge amount of data for the studies on the understanding of microorganism under various culture conditions. In this study, we report combined analysis of transcriptome and proteome of E. coli cells during the high cell density cultivation. Fed-batch fermentation of E. coli was carried out until the maximum cell density reached to 74 g dry cell weight/L (OD600 of ca. 230), and then the transcriptome and proteome were analyzed using DNA microarray and 2D-gel electrophoresis. We discuss the remarkable and interesting changes in gene expression during HCDC and also suggest the possible strategies for the efficient fermentation from transcriptome and proteome analysis. . [This work was supported by the Korean Ministry of Commerce, Industry and Energy and by the Korean Ministry of Science and Technology under the NRL program.]


168. Molecular signatures of commonly fatal carcinomas: predicting the anatomic site of tumor origin (up)
Andrew I. Su, The Scripps Research Institute;
John B. Welsh, Lisa M. Sapinoso, Suzanne G. Kern, Petre Dimitrov, Hilmar Lapp, The Genomics Institute of the Novartis Research Foundation;
Peter G. Schultz, The Scripps Research Institute, The Genomics Institute of the Novartis Research Foundation;
Steven M. Powell, Christopher A. Moskaluk, Henry F. Frierson, Jr., University of Virginia Health System;
Garret M. Hampton, The Genomics Institute of the Novartis Research Foundation;
asu@scripps.edu
Short Abstract:

We have constructed a molecular classification scheme based on mRNA profiling for ten groups of commonly fatal carcinomas. We identified sets of genes that are uniquely characteristic of each tumor type, and used these genes to correctly predict the anatomic site of origin for 90% of 176 carcinomas.

One Page Abstract:

Histopathological classification of human tumors is fundamental for the optimal treatment of patients with cancer. Here, we used mRNA profiling and supervised machine learning algorithms to construct a molecular classification scheme for the ten most commonly fatal carcinomas in the United States. We identified gene subsets whose expression is uniquely characteristic of each tumor type, and show that these genes can be used to accurately predict the anatomic site of origin for 90% of 176 carcinomas, including metastatic lesions and cancers whose microscopic features of tissue origin were not readily identifiable. A number of the genes that distinguish one tumor type from another are potential diagnostic and pharmacologic targets. This study demonstrates the existence of gene subsets whose expression is unique to specific carcinomas, and illustrates the feasibility of predicting the tissue origin of a cancer in the context of multiple cancer classes.


169. Tuning Sub-networks Inference by Prior Knowledge on Gene Regulation (up)
Barak Shenhav, Department of molecular genetics, Weizmann institute of science, Rehovot, 76100, Israel;
Dana Teltsh, Dana Pe'er, School of Computer Science & Engineering, Hebrew University, Jerusalem, 91904, Israel;
Aviv Regev, Department of Cell Research and Immunology, Life Sciences Faculty, Tel Aviv University, Tel Aviv, 69978, Israel and D;
Gal Elidan, Nir Friedman, School of Computer Science & Engineering, Hebrew University, Jerusalem, 91904, Israel;
barak.shenhav@weizmann.ac.il
Short Abstract:

Bayesian networks are used to reconstruct statistically significant gene interactions and to infer gene sub-networks from expression profiles. Here we show that by constraining the learning procedure with additional information on gene regulation, more refined and accurate networks may be inferred, reflecting a wider scope of biological knowledge.

One Page Abstract:

Genome-wide expression profiles obtained using microarrays provide insight into molecular pathways and genetic networks. However, due to the complexity of these systems, the task of reconstructing genetic networks from the currently limited expression data remains a challenge.

Friedman et al [1] suggested modeling genetic interactions using Bayesian networks. According to this model, each gene expression level is represented by a random variable, and interactions between genes (e.g. induction or repression) are treated as probabilistic dependencies. This provides a framework for reconstructing both individual gene interactions (features) as well as inferring entire significant sub-networks [2]. Due to the limited amount of data, many alternative networks may be inferred, resulting in multiple putative features with varying levels of confidence, as estimated by non-parametric bootstrap.

While expression profiling data is scarce, gene regulation has already been extensively studied by other experimental approaches. These culminated in a large body of knowledge, primarily focused on inducers and repressors.

Here, we incorporate this additional information on gene regulation into the Bayesian network learning framework. We constrain our inference to networks which are consistent with prior knowledge of regulation. These constraints can be relaxed and applied in a probabilistic manner, based on our confidence in this information. This allows us to include both proven and predicted interactions as part of our biological knowledge base.

We applied our approach to the Saccharomyces cerevisiae expression profiles in the Rosetta Compendium [3]. We focused on a selected subset of genes, and extracted relevant regulation information from the YPD database [4]. The data was pre-processed and treated similarly to Pe'er et al [2]. We compare our results with those based solely on expression data, show the improvement in the quality and structure of some of the subnetworks, and discuss their importance for revealing more accurate interactions and better structured networks.

[1] N. Friedman, M. Linial, I. Nachman and D. Pe'er. Using Bayesian Networks to Analyze Expression Data. Journal of Computational Biology, 7:601-620, 2000.

[2] D. Pe'er, A. Regev, G. Elidan and N. Friedman. Inferring Subnetowrks from Perturbed Expression Profiles. In 9'th International Conference on Intelligent Systems for Molecular Biology (ISMB), 2001.

[3] Hughes, T. R., M. J. Marton, A. R. Jones, C. J. Roberts, R. Stoughton, C. D. Armour, H. A. Bennett, E. Coffey, H. Dai, Y. D. He, M. J. Kidd, A. M. King, M. R. Meyer, D. Slade, P. Y. Lum, S. B. Stepaniants, D. D. Shoemaker, D. Gachotte, K. Chakraburtty, J. Simon, M. Bard, and S. H. Friend (2000). Functional discovery via a compendium of expression profiles. Cell 102(1):109--26, 2000.

[4] Proteome Yeast Protein Database. http://www.proteome.com/databases/YPD/


170. Detection of alternative expression by analysis of inconsistency in microarray probe performance (up)
Andrey Ptitsyn, Genomics Institute of the Novartis Research Foundation;
ptitsyn@gnf.org
Short Abstract:

Study of probe behavior on Affymetrix U95A and HS1 (developed by GNF) chips suggests that inconsistency in hybridization performance of some probes with the rest of the set can be explained by the alternatively expressed gene variants. We suggest a statistical metric for of alternatively expressed genes in microarray databases.

One Page Abstract:

Biochips of the Affymetrix type are constructed so that each gene is represented by a number of oligonucleotide probes. Ideally all probes in the probe set supposed to produce similar intensity in each hybridization experiment with some degree of variance, reflecting noise coming from scanning, image analysis and other sources. Yet in some cases individual probes express behavior significantly different from the other probes in the same set. We have studied consistency of probe behavior in more then 50 experiments, conducted on 2 biochips - Affymetrix U95A and HS1, developed by GNF. On both chips probe sets with inconsistently behaving probes are selected. Evidence is collected, indicating that inconsistent probes may belong to the alternatively expressed gene variants (effects of alternative splicing and/or alternative polyadenylation). We also suggest a statistical metric for effective datamining of alternatively expressed genes in microarray databases.


171. Which clustering algorithms best use expression data to group genes by function? (up)
Frank D Gibbons, Frederick P Roth, Dept of Biological Chemistry and Molecular Pharmacology;
fgibbons@hms.harvard.edu
Short Abstract:

For inferring gene function, we assert that the best expression-based gene clustering algorithm is one which best groups genes by function. We scored commonly used clustering algorithms using a figure-of-merit based on total mutual information between clusters and a large set of S. cerevisiae gene attributes.

One Page Abstract:

Clustering genes based on their expression patterns has proven useful as an exploratory data analysis tool. In particular, it has been observed that clustering by expression has a tendency to group genes of similar function together. This fact has led to the idea of 'guilt-by-association', or inference of gene function based on expression. Many clustering methods are in use, but little guidance is available on which are most suitable for this purpose. Data-driven figures of merit for clustering algorithms have been applied, but do not directly address this question. We assert that the best algorithm for inference of gene function based on expression is the one which best clusters genes according to their function. We developed a figure of merit for expression-based clustering algorithms based on the total mutual information between clusters and a large set of gene attributes. Using a collection of Saccharomyces cerevisiae gene expression data and gene annotation from the Saccharomyces Genome Database and Gene Ontology Consortium, we applied this figure-of-merit to evaluate commonly used clustering algorithms, data transformations, expression-based distance measures between genes, and the most appropriate number of clusters.


172. Visualization and Analysis Tool for Gene Expression Data Mining (up)
Alexander Sturn, Institute of Biomedical Engineering, Graz University of Technology, Krenngasse 37, 8010 Graz, Austria;
John Quackenbush, The Institute for Genomic Research, Rockville, MD 20850, USA;
Hackl Hubert, Zlatko Trajanoski, Institute of Biomedical Engineering, Graz University of Technology, Krenngasse 37, 8010 Graz, Austria;
alexander.sturn@tugraz.at
Short Abstract:

We have developed a platform independent Java suite, which integrates various tools for microarray gene expression data mining including filters, normalization and visualization tools, as well as hierarchical and non hierarchical clustering algorithms incorporating multiple similarity distance measurements. Additionally it is possible to map gene expression data onto chromosomal sequences.

One Page Abstract:

High throughput gene expression analysis is becoming increasingly important in many areas of basic and applied biomedical research. Oligonucleotide and cDNA microarray technology is one very promising approach for high throughput transcriptome analysis and provides the opportunity to study gene expression patterns on a genomic scale. Thousands or even tens of thousands of genes can be spotted on a single microscope slide and relative expression levels of each gene can be determined by measuring the fluorescence intensity of labeled mRNA hybridized to the arrays. Beyond the simple discrimination of differentially expressed genes, functional annotation (guilt-by-association), diagnostic classification, and the investigation of transcriptional control mechanisms (coregulation from coexpression) requires the clustering of genes from multiple experiments into groups with similar expression patterns. Several clustering techniques have been recently developed and applied to analyze microarray data. However, to the best of our knowledge, there is no single tool, which integrates the common clustering and visualization methods and provides the capability to perform an easy comparison of results from different clustering approaches. We have developed a versatile, platform independent, and easy to use Java suite for the simultaneous visualization and analysis of a whole set of gene expression experiments. After reading the data from flat files using a flexible plug-in interface, several graphical representations of all intensity values can be generated, showing a matrix of experiments and genes, where multiple experiments and genes can be easily compared with each other. Each gene can be linked to additional information at the NCBI. Several filters and normalization procedures are provided to gain a best possible representation of the data for further statistical analysis. Eleven different kinds of similarity distance measurements have been implemented, ranging from simple Pearson correlation to more sophisticated approaches like mutual information and rank correlation coefficients. The most commonly used hierarchical and non hierarchical clustering or classification algorithms have been implemented to identify similarly expressed genes and extract expression patterns inherent in the data, including: (1) hierarchical clustering, (2) k-means, (3) self organizing maps, (4) principal component analysis, and (5) support vector machines. An important and valuable feature of this software is the ability to compare clustering results from different clustering techniques and parameter settings. This can provide the researcher with additional information compared to a single method approach. Additionally, it is possible to map gene expression data onto chromosomal sequences to enhance the investigation of regulatory mechanisms. Genes on consecutive chromosomal locations are often co-expressed and can be easily identified by this method. Finally, extensive work has been undertaken to accomplish visualization of the gene expression data and clustering results in a user friendly and intuitive way. The flexibility, the variety of analysis and data visualizations tools, as well as the transparency and portability, provides this software suite with the potential to become a valuable tool in functional genomic studies.


173. Prediction of co-regulated genes in Bacillus subtilis based on the conserved upstream elements across three closely related species (up)
Goro Terai, INTEC Web and Genome Informatics Corp.;
Toshihisa Takagi, Kenta Nakai, Human Genome Center, Institute of Medical Science, University of Tokyo;
terai@ims.u-tokyo.ac.jp
Short Abstract:

The conservation information of three closely related species, Bacillus subtilis, Bacillus halodurans, and Bacillus stearothermophilus, was used to predict co-regulated genes of B. subtilis. We will report the results of extensive comparison between our prediction (cis-elements and regulons) and known examples using our database on B. subtilis transcription, DBTBS (http://elmo.ims.u-tokyo.ac.jp/dbtbs/).

One Page Abstract:

Identification of co-regulated genes is essential for elucidating transcriptional regulatory networks and the function of uncharacterized genes. Although co-regulated genes should have at least one common sequence element, it is generally difficult to identify these genes from the presence of this element because it is very easily obscured by noises. To overcome this problem, we used the conservation information of three closely related species: Bacillus subtilis, Bacillus halodurans, and Bacillus stearothermophilus. Although even such species had a limited number of clearly orthologous genes, we could obtain 3,178 phylogenetically conserved elements from the upstream intergenic regions of 1,568 B. subtilis genes. Similarity between these elements was used to cluster these genes. No other a priori knowledge on genes and elements was used. Another merit of predicting B. subtilis genes is that this species has a rich accumulation of experimental study. We confirmed that general elements such as -35/-10 boxes and Shine-Dalgarno sequence are not the major obstacles. Moreover, we could identify some genes known or suggested to be regulated by a common transcription factor as well as genes regulated by a common attenuation effecter. We also identified some plausible additional members of known co-regulated genes. Thus, our approach is promising for exploring potentially co-regulated genes.


174. Comparison of performances of hierarchical and non hierarchical neural networks for the analysis of DNA array data (up)
Joaquin Dopazo, Javier Herrero, Bioinformatics, CNIO, Ctra. Majadahonda-Pozuelo, Km 2, Majadahonda, 28220 Madrid, Spain;
jdopazo@cnio.es
Short Abstract:

Unsupervised neural networks are extensively used for the analysis of DNA array data due to properties like robustness and nearly linear runtimes. Here we present a comparison of the performances of two unsupervised neural networks and an aggregative hierarchical method in terms of runtime and accuracy in the classification obtained.

One Page Abstract:

Comparison of performances of hierarchical and non hierarchical neural networks for the analysis of DNA array data

Javier Herrero and Joaquin Dopazo

Bioinformatics, CNIO, Ctra. Majadahonda-Pozuelo, Km 2, Majadahonda, 28220 Madrid, Spain

DNA microarray technology opens up the possibility of measuring the expression level of thousands of genes in a single experiment (Brown and Botsein, 1999). Serial experiments measuring gene expression at different conditions, times, or distinct experiments with diverse tissues, patients, etc., allows obtaining gene expression profiles under the different experimental conditions studied. Initial experiments suggest that genes having similar expression profiles tend to be playing similar roles in the cell. Aggregative hierarchical clustering has been extensively used for finding the clusters of co-expressing genes (Eisen,et al, 1998; Wen et al., 1998) Nevertheless, several authors (Tamayo et al., 1999) have noted that aggregative hierarchical clustering suffer from lack of robustness. In addition, aggregative hierarchical clustering methods have runtimes that are at least quadratic (Hartigan, 1975), which makes them very slow when thousands of items are to be analysed. These arguments leaded to use neural networks as an alternative to aggregative hierarchical cluster methods (Tamayo et al., 1999; Törönen et al., 1999; Herrero et al. 2001). Unsupervised neural networks, like Self Organising Maps (SOM) (Kohonen, 1997) or Self Organising Tree Algorithm (SOTA) (Dopazo and Carazo, 1997), provide a more robust and appropriate framework for the clustering of big amounts of noisy data. Neural networks have a series of properties that make them suitable for the analysis of gene expression patterns. They can deal with real-world data sets containing noisy, ill-defined items with irrelevant variables and outliers, and whose statistical distributions do not need to be parametric ones and are reasonably fast and can be easily scaled to large data sets. Here we present a comparison of the performances of SOM and SOTA both in terms of runtime and accuracy in the classification obtained. The results are compared to a classical aggregative hierarchical method.

References

Brown, P.O. and Botsein, D. (1999). Nature Biotechnol. 14:1675-1680.

Dopazo, J. & Carazo, J.M. (1997) J. Mol. Evol 44:226-233.

Eisen M., Spellman P. L., Brown P. O., Botsein D. (1998). Proc. Natl. Acad. Sci. USA. 95: 14863-14868

Hartigan, J.A. (1975) Clustering algorithms. New York, Wiley

Herrero, J., Valencia, A. and Dopazo, J. (2001) Bioinformatics 17:126-136.

Kohonen, T. (1997) Self-organizing maps, Berlin. Springer-Verlag.

Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S. & Golub, T.R. (1999) Proc. Natl. Acad. Sci. USA 96:2907-2912.

Törönen, P., Kolehmainen, M., Wong, G. & Castrén, E. (1999) FEBS letters 451:142-146.

Wen, X., Fuhrman, S., Michaels, G.S., Carr, D.B., Smith, S., Barker, J.L. & Somogyi, R. (1998) Proc. Natl. Acad. Sci. USA 95:334-339


175. Linking micro-array based expression data with molecular interactions, pathways and cheminformatics (up)
Robin Munro, Iris Ansorge, Ewald Aydt, Kay Böttcher, Eric Minch, Claus Kremoser, Thomas Meyer, Jeroen van de Peppel, Tobias Schlegl, Stefan Weiss, Martin Hofmann, LION Bioscience AG;
robin.munro@lionbioscience.com
Short Abstract:

As the information obtained from Expression data increases it is very important to associate this with other types of information. We have bridged the gap between expression data molecular interactions, pathways and chem-informatics with tools that can cross communicate. We are using these to study Estrogen receptors in our laboratories.

One Page Abstract:

As the information obtained from Expression data increases it is of great importance to be able to associate this with other types of information so as to be able to deduce more about the way in which genes are responding to changes in expression. We have bridged the gap between expression data molecular interactions and pathways with three visualisation tools which can cross communicate with an IAC (inter application communication) server based around SRS (sequence retrieval system) technology. The SRS platform is used as the universal tool to interconnect different domains like protein analysis and pathway analysis tools.

By combining our informatics and laboratory expertise the data can be seamlessly passed from experimental recording to expression data analysis, to protein interaction and to pathway reconstruction. In this example we focus on Estrogen Receptors which are a class of Nuclear Receptors. These are well-established therapeutic targets that are tightly connected to disease areas.

In proof of concept we have used protein interactions from the literature and Yeast 2 Hybrid (Y2H) to identify genes which are expressed similarly across different tissues. Similarly genes can be passed to and from a pathway reconstruction tool where relevant biological pathways can be generated and analyzed. We also demonstrate how the link between this expression analysis system and chem-informatics software allows us to correlate structural descriptors for low molecular weight compounds with cellular responses at the expression level.

The confrontation of cells with low molecular weight substances can induce drastic changes in gene expression patterns. These changes are the consequence of molecular interactions of low molecular weight substances with cellular targets (proteins or DNA) inducing signalling pathways and ultimately leading to changes in gene expression. In the world of pharmacology these changes are divided in wanted therapeutic and unwanted side effects.

In an initial proof of principle experiment we have used this system for the correlation of structure activity relationships (SAR) of selective estrogen receptor modulators (SERMs) with the cellular response patterns induced in osteoblasts. To overcome the problem with the limited availability of bone-specific cDNA clone collections we have generated in our laboratories a non-redundant set of cDNA fragments isolated from specialised cDNA libraries. The ultimate aim of our approach is to use this system for the prediction of biological effects based on structure / expression relationships.


176. Classification of Acute Leukemia Gene Expression Data Using Weight Function and Principal Component Analysis (up)
Jeongah Yoon, Jee-Hyub Kim, Biological Research Information Center, Pohang University of Science and Technology;
Hong Gil Nam, Department of Life Science, Pohang University of Science and Technology;
yja@bric.postech.ac.kr
Short Abstract:

We present a new method for classification of acute leukemia gene expression data based on weight function and principal component analysis (PCA). The proposed method can treat imprecise and high dimensional problems of gene expression data. The classification results of training and test samples show 100% and 94.12% accuracy, respectively.

One Page Abstract:

Recently, classification of tumor samples has been an important aspect of cancer diagnosis and treatment using DNA microarray technology. We present a new method for classification of acute leukemia gene expression data based on weight function and principal component analysis. The goal is to establish the best predictive classifier of the two cancer subtypes (ALL/AML). Gene expression data are characterized by very high dimensionality and contain irrelevant features. Thus, comprehensible interpretation is difficult as well as the large complexity of the original data requires high computational cost. First, to solve these problems, we describe the use of weight function to minimize unknown imprecise information in microarray data and principal component analysis (PCA) to reduce high dimensionality before making the classifier which is a sort of basic mathematical model to predict new samples. The weights by cubic weight function applied to each training sample are based on the Euclidean distance between sample mean and each training sample. The samples are proportionally weighted by values between 0 and 1 according to their proximity to the sample mean of each class. The main advantage of weight function method is that more weight is given to the samples that are the closest to the mean of each class. These new weighted data go through PCA that linearly transforms the high dimensional data into new score values of reduced dimension without the loss of important information. Next, the maximal number of PCs is determined by the fraction of eigenvalues which indicates the relative significance of the ith new PC component. Among the PCs, we select automatically optimal two PCs which satisfy both minimum within-class variances and maximum between-class variances. The scores on the optimal two PC dimensions are applied to Fisher's discriminant function to separate the training data into two types of tumor. Finally, this classifier is used to classify unknown or new samples. The samples are divided into two sets by the provider, which is 38 (ALL/AML=27/11) as a initial or training set and 34 (24/14) as an independent or test set with each sample giving the expression levels of 7129 genes. Our result shows that the discriminant function from 38-case training set classifies perfectly the samples into two classes (100% accuracy). For the prediction of the 34-case test set, there were only two cases misclassified by the algorithm (2/34, 94.12% accuracy): two AML¡¯s were labeled as ALL¡¯s. Consequently, a novel classification method is proposed by using weight function and PCA method, which can solve problems of imprecise and high dimensional gene expression data. This algorithm can be extended to other cases with high potential in structural and functional genomics.


177. Application of Fuzzy Robust Competitive Clustering Algorithm on Microarray Gene Expression Profiling Analysis (up)
Xudong Dai, Rutgers University;
Wei Xiong, WaveGenix, LLC;
Hichem Frigui, University of Memphis;
Tong Fang, SIMONS;
xudong@hotmail.com
Short Abstract:

Clustering algorithms for gene expression analysis assume well-defined boundaries between clusters. This assumption may not be valid for biological processes. We propose applying Robust Competitive Agglomeration (RCA), which uses fuzzy cluster membership, on microarray data analysis. Our result suggests that the RCA is useful for gene expression profiling analysis.

One Page Abstract:

Global gene expression profiles revealed by microarray technology can improve our understanding of complicated biological processes. Clustering algorithms have proven to be important tools for extracting meaningful patterns embedded in those transcriptional profiles. Standard hierarchical clustering and simple partitional clustering procedures have been widely used by biologists for extracting genes associated with certain cellular events or disorders. Unfortunately, these simple clustering algorithms have several inherent limitations that can affect the quality of the detected profiles. Moreover, the above clustering algorithms assume well-defined boundaries between different clusters. This assumption may not be valid for most biological processes where complicated and extensive molecular interactions occur and can result in overlapping clusters. To overcome these limitations, we propose investigating a more robust clustering algorithm, called Robust Competitive Agglomeration (RCA). The RCA algorithm uses a fuzzy membership to handle overlapping clusters, and a robust membership to handle noise points and outliers. Moreover, the RCA can efficiently find the optimal number of clusters. Thus, it is an ideal tool for gene expression profiling analysis. The performance of the RCA is illustrated with real microarray data. We show that the RCA algorithm can recognize patterns embedded across different clusters and therefore, model after the complicated cellular events in which extensive cross talk and interactions along pathways may exist.


178. A Robust Algorithm for Expression Analysis (up)
Earl Hubbell, Wei-Min Liu, Teresa Webster, Fred Christians, Gang Lu, Joy Fang, Rui Mei, Affymetrix;
Earl_Hubbell@affymetrix.com
Short Abstract:

We consider the problem of estimating gene expression using oligonucleotide arrays. Such estimates should have approximate linearity in concentration, non-negative results, statistical robustness, and include estimates of variation. A robust algorithm meeting these goals has been validated against spike experiments, and shows comparable performance to existing standards.

One Page Abstract:

Gene expression analysis is one of the most important applications of oligonucleotide array technology. Samples containing a variety of transcripts are hybridized to complementary probes on a surface, and intensity data is obtained for each probe-transcript interaction. Such data is often noisy and subject to confounding cross-hybridization effects, necessitating careful attention for analysis.

Our approach to analysis of intensity data from arrays is derived from simple models linking intensity data with the underlying concentration of targets. We do not assume strong distributions for errors, nor do we assume that probes have identical properties. The two simple models are:

(1) Intensity(I) = Non-transcript-related-effects(NTRE)+transcript-related-effects(TRE) and (2) log(TRE) = log(concentration(c)) + log(affinity(A))+residual(R).

Intensity is the observed intensity of a given probe sequence in an experiment, and NTRE and TRE are the hidden division of these intensities into intensity related to the transcript of interest and the remainder. The concentration is the concentration of the transcript, and the affinity is the probe affinity for this transcript. Note that intensities, non-transcript-effects, and transcript-related-effects are all nonnegative, as are concentrations and affinities. For stability, we will assume that zero values are actually small positive values.

Because probe affinities can vary widely, and are typically stable across experiments, we note the derived model, describing the transcript-related effect for the same probe sequence in two experiments x and y:

(2a) log(TRE(x))-log(TRE(y)) = log(concentration(c(x)))-log(concentration(c(y)))+residual.

This derived relationship will be used in comparative analysis across experiments. These models capture broad empirically observed properties of oligonucleotide behavior in experiments.

Given these two models, we build an algorithm in layers starting from the intensity values. In the first layer, we estimate the non-transcript-related-effect for each intensity, and remove it to obtain an estimate (probe value (PV)) of the transcript-related-effects. We take care to ensure that this value is positive, as the transcript-related-effects and concentrations are positive. Within this procedure we can also estimate the significance of TRE vs NTRE, which corresponds to making a call of present or absent in the Affymetrix standard software.

In the second layer, we estimate the effect due to the target concentration by combining the individual probe values using robust statistics on the log-scale. We observe that intensities have experimental variation that increases with intensity, which suggests strongly that a log-transformation of the data will stabilize the variance. We use a robust statistic, the one-step Tukey biweight, to obtain location and scale estimates for the data. The biweight is known to have excellent behavior in the face of outliers, and using a single step avoids issues of convergence.

In a variant of the second layer, we estimate the log-ratio of transcript concentrations in two experiments. Because of the great disparity in affinities of probes for transcripts, the appropriate statistical tool is a "paired-sample" test. First we obtain a probe log ratio (PLR) by differencing the log(PV) in each experiment. Then we apply our biweight statistic to obtain an estimate of the log-ratio of concentrations. Under reasonable assumptions about the nature of the residuals, the resulting estimate of scale can be used to test the significance of this value.

An algorithm following these procedures achieves the design goals - non-negative results, approximate linearity (given approximately linear probe behavior), and robustness against outliers. The performance of this algorithm was checked against a panel of extensive spikes against complex backgrounds. Performance was found to be comparable with the existing standard Affymetrix algorithm.


179. Analysis of Gene Expression by Short Tag Sequencing - Theoretical Considerations (up)
Per Unneberg, Magnus Larsson, Dept of Biotechnology, Royal Institute of Technology (KTH), Stockholm;
Anders Wennborg, Dept of Biosciences, Karolinska Institute, Stockholm;
peru@biochem.kth.se
Short Abstract:

We have focused on certain aspects that are essential for a reliable analysis of short sequence data in relation to the existing transcript index databases. These aspects include the influence of tag length, tag uniqueness and restriction enzyme recognition site frequencies.

One Page Abstract:

Gene expression analysis has lately received much attention due to the advent of hybridisation array technologies. Alternative methods, based on cDNA-sequencing techniques, suffer the disadvantage of having a lower throughput. Still, these methods provide important information about gene expression. Firstly, previously unknown transcripts can be detected by showing the actual sequence contents of the sample without the need for pre-selection of probes. Secondly, with sufficiently large samples, quantitative information can be obtained about genes expressed at very low levels, falling below the detection limit of hybridisation array methods. The low throughput has been addressed by devising methods based on isolation of short sequence tags from each sampled mRNA. Examples of such methods are Serial Analysis of Gene Expression (SAGE), Tandem Arrayed Ligation of Expressed Sequence Tags (TALEST), and pyrosequencing. In general, tags of 10-20 bp length downstream of a given restriction enzyme cleavage site are used to identify the original mRNA.

We have focused on certain aspects that are essential for a reliable analysis of such short sequence data in relation to the existing transcript index databases. Firstly, two human transcript databases, RefSeq and UniGene, were analysed to investigate the reliability of short tag identification. Short tags were generated from transcript sequences based on a range of possible restriction enzyme recognition sites. For the enzyme NlaIII, which is commonly used in SAGE, approximately 5% of the transcripts were not identifiable by short tags, either because they lacked restriction enzyme recognition sites or because the generated 3'-tags were shorter than 10 bp. However, more than 90% of 10 bp tags were found to uniquely identify a transcript. Secondly, the specificity in identifying transcripts by the sequence similarity search algorithm BLAST was investigated with different sequence tag lengths. We found a tag-length in the interval 17-20 bp (including the restriction enzyme recognition site) to be sufficient for transcript identification by BLAST, while longer tag lengths did not appreciably improve the results.


180. Computational analysis of RNA splicing by intron definition in five organisms (up)
Lee Lim, Phillip Sharp, Chris Burge, MIT;
leelim@mit.edu
Short Abstract:

Splicing of short introns from five eukaryotes was simulated using five features: the splice sites, the branch signal, intron length, and intron composition. The contribution of each of these features to splicing accuracy was analyzed, and the amount of information required for highly accurate splicing was estimated.

One Page Abstract:

A goal of research on pre-mRNA splicing is to write down (or implement in a computer program) a set of rules which describes how the splicing machinery identifies the precise locations of exons and introns in a transcript. Although this goal has not yet been realized, concepts such as intron definition have been developed to explain how the spliceosome recognizes introns. Short introns are likely spliced by the intron definition mechanism, where the 5’ and 3’ splice signals are initially recognized and paired in an intron-spanning interaction. Taking advantage of the recent availability of genomic sequences from five eukaryotes, we used a computational approach to: 1) analyze how well the intron definition model could splice short introns in these organisms and 2) understand the contribution of different transcript features to this process. Using datasets of reliably annotated transcripts from each organism, we identified populations of short introns in each dataset. Five features known or hypothesized to be involved in intron definition were analyzed: the 5’ splice signal, the 3’ splice signal, the branch signal, intron length preference, and intron composition. Using the concept of relative entropy from information theory, the information content of each of the five features was measured, giving a quantitative estimate of how much each feature could contribute to splicing specificity. In addition, a Monte Carlo method was used to estimate the amount of information necessary for accurate splicing of short introns: approximately 30-35 bits, depending on the organism. A program, IntronScan, was developed which uses the five features to identify the locations of short introns in transcripts. High accuracies of splicing (94-95%) could be attained in Drosophila and C. elegans, with the bulk of information deriving from the 5’ and 3’ splice signal motifs. S. cerevisiae was unique in deriving a large percentage of its information from the branch signal. However, the 3’ splice site signal was not precisely identified in 15% of S. cerevisiae introns, implying that our knowledge of 3’ splice site selection in this organism is incomplete. In Arabidopsis, the 5’ and 3’ splice signals are relatively weak, and are not sufficient to reliably identify introns. However, use of the intron composition feature resulted in dramatic improvements in accuracy (from 68% to 92%). In Arabidopsis, Drosophila, and human, closer analysis of the intron composition feature showed that a large percentage of the improvement in accuracy obtained with this feature could be attributed to small sets of sequence motifs; some of these potential intronic enhancers have already been experimentally verified while others have not yet been experimentally tested. Even with the use of the intron composition feature, the highest accuracy obtained in human was 85%, suggesting that other features not considered in our analysis must provide substantial amounts of information for splicing in vertebrates.


181. Image Analysis and Feature Extraction Methods for High-Density Oligonucleotide Arrays (up)
Hilmar Lapp, Yingyao Zhou, Peter Dimitrov, Genomics Institute of the Novartis Foundation (GNF);
Ruben Abagyan, The Scripps Research Institute, La Jolla, CA, USA;
lapp@gnf.org
Short Abstract:

Our goal was to devise an image analysis algorithm for high-density oligonucleotide array images that is both accurate and insensitive to artifacts on the feature level, and that can deal with slightly distorted grids. We will demonstrate the relative performance of different model-based and ad-hoc algorithms by consistency across replicated experiments.

One Page Abstract:

Image Analysis and Feature Extraction Methods for High-Density Oligonucleotide Arrays

Hilmar Lapp1,2, Yingyao Zhou1, Peter Dimitrov1, Ruben Abagyan1,3

1Genomics Institute of the Novartis Research Foundation, San Diego, USA 2Novartis Research Institute, IFD/CBC, Vienna, Austria 3The Scripps Research Institute, San Diego, USA lapp@gnf.org

Motivation: High-density oligonucleotide arrays have become a favorite technology for large-scale gene expression profiling [1] and genomic hybridization-based research projects [2]. Many applications, like genome-wide tissue profiles, treatment response-studies, etc., rely on quantitative expression signals being obtained from the hybridization images of the array. The method of choice for image analysis of Affymetrix oligonucleotide array images is usually the GeneChip software provided by Affymetrix, which by default employs a quantile-based quantification method using all but the border pixels of a cell [3]. Apart from being empirical, this method is sensitive to certain artifacts in the chip image, like broad inter-cell stripes and small bright stars. Our goal was to devise an Affymetrix array image analysis algorithm that is both accurate and insensitive to artifacts on the feature level, and that can deal with slightly distorted grids. We defined accuracy in terms of consistency of expression signals across replicate chips. Results: We divided the task of image analysis and feature extraction into two steps, namely locating the grid of cells on the image, and quantifying each cell. Our method for locating the cells uses the corner coordinates provided when the array was scanned, and approximates a distortion of the grid from the expected rectangle as a continuous effect. The quantification of each cell also was split into two tasks, the first being pixel selection and the second being quantification of the previously selected pixels. We implemented several model-based as well as ad-hoc algorithms for both pixel selection and quantification. We will demonstrate the relative performance of the different possible algorithms as assessed by consistency of expression signals across replicated experiments, and we will show how the algorithms can deal with cell-level artifacts. It turned out that the performance strongly depends on the actual shape of a cell's pixel intensity profile, which is biased towards different intensity ranges.

[1] Lockhart D.J. et al. (1996), Nature Biotechnology 14:1675-1680; Wodicka L. et al. (1997), Nature Biotechnology 15:1359-1366 [2] Lipshutz et al. (1999), Nature Genetics Suppl. 21:21-24; Lockhart D.J. and Winzeler E. (2000), Nature 405:827-836 [3] Affymetrix GeneChip Manual; see e.g. Winzeler E. et al. (1999), Science 285:901-906


182. Transcriptional control mechanisms in the global context of the cell (up)
Johan Elf, Måns Ehrenberg, Uppsala Univeristy, Department of Cell and Molecular Biology;
johan.andersson@icm.uu.se
Short Abstract:

The quality of molecular control mechanisms can be evaluated from their contribution to the fitness of the organism. In this study we have compared the attenuation and repressor mechanisms for regulation of amino acid biosynthetic operons. The conclusion is that a repressor system can sustain a higher growth rate.

One Page Abstract:

A hierarchy of feedback loops that employ a great variety of molecular mechanisms achieves control of gene expression in bacteria. Understanding the dynamics of these control systems and evaluating their quality require global mathematical models of growing cells under different external conditions. In such a theoretical framework the quality of a control loop can be evaluated in terms of its impact on the growth rate of the population of cells, and quantified by a “fitness parameter”.

We have compared repressor systems and attenuation mechanisms in control of expression of amino acid biosynthetic operons in Escherichia coli. The control systems’ ability to optimally balance the investment in the biosynthetic enzymes to the demand raised by protein synthesis will determine the growth rate and thereby the fitness of these control systems. The analysis is based on a mathematical model for whole cells taking into account the synthesis of twenty amino acids, aminoacylation of tRNAs and the consumption of amino acids by ribosomes in the making of new proteins.

The signals for the repressor and attenuation mechanisms are the concentrations of amino acids and aminoacyl-tRNAs, respectively. The flows through the amino acid and aminoacyl-tRNA pools are large and insensitive to pool size. The signals are therefore very sensitive to how synthesis and consumption of amino acids are balanced. In the language of automatic control theory, these ubiquitous intracellular feedback mechanisms display “bang-bang”-control. Their general design principle almost automatically brings the intracellular enzyme concentration to their optimal values.

Our results suggest that attenuation mechanisms, in contrast to repressor control, cannot keep the cell at a high growth rate and with a low frequency of amino acid substitution errors in protein synthesis. The reason is that attenuation mechanisms in spite of their high sensitivity only respond when ribosomes are slowed down by amino acid deficiency, and this reduced protein elongation rate directly impairs both growth rate and accuracy of mRNA translation.

Top-down approaches to cell modeling as described here are motivated by their ability to reveal universal principles behind gene regulation. They will also serve as a necessary conceptual framework for the design of a new generation of experiments directed towards control of gene expression in bacteria and other organisms.


183. Measurement and Prediction of Gene Expression in Whole Genomes (up)
Carsten Friis, Peder Worning, Center for Biological Sequence Analysis (CBS);
Birgitte Regenberg, BioCentrum, DTU;
Steen Knudsen, David Ussery, Center for Biological Sequence Analysis (CBS);
carsten@cbs.dtu.dk
Short Abstract:

We have analysed expression of genes in the Escherichia coli and Saccharomyces cerevisiae genomes. The data are displayed graphically, such that expression levels throughout the whole genome can be visualised at once. We compare the results with DNA structural features and predicted mRNA expression levels, based on several different methods.

One Page Abstract:

We have analysed the expression of genes in the Escherichia coli and Saccharomyces cerevisiae genomes, using Affymetrix DNA chip technology. The data are displayed graphically, using "DNA Atlases", such that expression levels throughout the whole genome can be visualised at once and compared to DNA sequence parameters (such as AT-content and DNA flexibility).

We find that genes with similar expression levels are not uniformly distributed throughout the genome, but tend to cluster. To explain this phenomena, the expression data are correlated to three different models for prediction of expression levels.

The first examines solely the structural parameters of the DNA, assuming that DNA must unwrap before transcription can occur, and a correlation between helix flexibility and gene expression levels might be found. The second approach is based on the Codon Adaption Index (CAI), a previously published weight matrix designed to distinguish between highly and lowly expressed genes based on codon usage. Finally a neural network trained to recognise genes with high expression from E. coli is applied to both genomes.

The results from all three models are compared statistically and correlation coefficients are presented.


184. Analysis of orthologous gene expression in microarray data (up)
J.L. Jimenez, J. Sgouros, Imperial Cancer Research Fund;
jimenez@icrf.icnet.uk
Short Abstract:

Sequence-based clustering of orthologs from several genomes has revealed conserved groups of sequences involved in different cellular processes. Analysis of the expression of orthologs within and between organisms can provide information on relationships between gene groups involved in core biochemical functions and assist with the functional classification of orphan genes.

One Page Abstract:

Sequence comparison using databases of characterised genes can provide valuable hints about the molecular function of newly sequenced genes. At the genome level, these comparisons have enabled the functional classification of genes within organisms and provided important information about how these "libraries" of functions are maintained and modified between different phylogenetic groups during evolution. However, the most complete study up to date, the Clusters of Orthologous Groups (COGs), has also shown the difficulty in distinguishing between actual orthologs (functionally equivalent proteins) and paralogs (proteins with similar sequence but whose function may be have been diverted from the original ancestor), as well as a considerable number of "orphan" groups that remain uncharacterised due to lack of biochemical information for any of their representative genes.

Microarray studies of the expression behaviour of a genome can help to redefine and improve the classification of the functional libraries by grouping genes regulated at the same time in different cellular processes. This grouping is usually done by means of unsupervised classification algorithms that do not impose any a priori constraint on the analysed data. Information about uncharacterised genes can be deduced by looking at their accompanying partners in the resulting groups, although usually each group contains genes from more than one of the known functional classes and thus only a rough assignment of the cellular process in which the orphan gene is involved can be obtained.

In the study presented here, we have used the COG information for the budding yeast proteins as starting point for the supervised grouping of several microarray experiments. Although the grouping defined by the COGs is in general preserved during gene expression, it is possible to divide broad groups into subclasses that reflect the oligomerisation state of the proteins and/or more specific functionality. Analysis of the relationships between these groups can clarify the boundaries of the cellular processes in which they are involved. Further comparison of the regulation of mitochondrial and non-mitochondrial proteins also shows the relevance of subcellular compartmentalisation in eukaryotes. Along with assisting protein annotation, the aim of this preliminary study is the comparison of gene expression between different organisms to understand how gene regulation could play a role in speciation.