Networks and Modeling

Monday July 23 8:30 - 9:15

Protein Interactions

David Eisenberg, Ioannis Xenarios, Joyce Duan, Lukasz Salwinski, Edward Marcotte, Matteo Pellegrini, Michael J. Thompson, Todd Yeates, University of California, Los Angeles


Networks of protein interactions control the lives of cells, yet we are only beginning to appreciate the nature and complexity of these networks. We have taken two approaches to the study of protein networks. The first is to infer functional interactions between pairs of proteins, by combing four methods: Rosetta Stone (fused domains), Phylogenetic Profiles (correlated occurrence of pairs of proteins in genomes), Gene Neighbor (separation of pairs of protein- encoding genes on chromosomes), and analysis of DNA microarray signals. This combination produces networks of protein functional interactions.

The second approach is to summarize studies from the scientific literature of interacting proteins in a database, the Database of Interacting Proteins ( This database, DIP, has now grown to thousands of interactions, and provides an second type of network of cellular protein interactions. The network from DIP is of physically interacting proteins, whereas the network of functional interactions is broader, including information on metabolic interactions.

The insights these networks offer about cellular function will be summarized. One of these is an estimate of the probable number of protein interactions in cells.

Monday July 23 9:15 - 9:40  

Protein-protein interaction map inference using interacting domain profile pairs

Jérôme Wojcik, Vincent Schächter, Hybrigenics S.A.


A number of predictive methods have been designed to predict protein interaction from sequence or expression data. On the experimental front, however, high-throughput proteomics technologies are starting to yield large volumes of protein-protein interaction data.

High-quality experimental protein interaction maps constitute the natural dataset upon which to build interaction predictions. Thus the motivation to develop the first interaction-based protein interaction map prediction algorithm.

A technique to predict protein-protein interaction maps across organisms is introduced, the `interaction-domain pair profile' method. The method uses a high-quality protein interaction map with interaction domain information as input to predict an interaction map in another organism. It combines sequence similarity searches with clustering based on interaction patterns and interaction domain information. We apply this approach to the prediction of an interaction map of Escherichia coli from the recently published interaction map of the human gastric pathogen Helicobacter pylori. Results are compared with predictions of a second inference method based only on full-length protein sequence similarity -- the ``naive'' method. The domain-based method is shown to i) eliminate a significant amount of false-positives of the naive method that are the consequences of multi-domain proteins; ii) increase the sensitivity compared to the naive method by identifying new potential interactions.

Monday July 23 9:40 - 10:05  

Inferring subnetworks from perturbed expression profiles

Dana Pe'er, Hebrew University; Aviv Regev Tel Aviv University and Weizmann Institute of Science; Gal Elidan, Nir Friedman, Hebrew University


Genome-wide expression profiles of genetic mutants provide a wide variety of measurements of cellular responses to perturbations. Typical analysis of such data identifies genes affected by perturbation and uses clustering to group genes of similar function. In this paper we discover a finer structure of interactions between genes, such as causality, mediation, activation, and inhibition by using a Bayesian network framework. We extend this framework to correctly handle perturbations, and to identify significant subnetworks of interacting genes. We apply this method to expression data of S. cerevisiae mutants and uncover a variety of structured metabolic, signaling and regulatory pathways.

Monday July 23 10:40 - 11:05  

GEST: a gene expression search tool based on a novel Bayesian similarity metric

Lawrence Hunter, Ronald C. Taylor, Sonia M. Leach, University of Colorado Health Sciences Center; Richard Simon, National Cancer Institute


Gene expression array technology has made possible the assay of expression levels of tens of thousands of genes at a time; large databases of such measurements are currently under construction. One important use of such databases is the ability to search for experiments that have similar gene expression levels as a query, potentially identifying previously unsuspected relationships among cellular states. Such searches depend crucially on the metric used to assess the similarity between pairs of experiments. The complex joint distribution of gene expression levels, particularly their correlational structure and non-normality, make simple similarity metrics such as Euclidean distance or correlational similarity scores suboptimal for use in this application. We present a similarity metric for gene expression array experiments that takes into account the complex joint distribution of expression values. We provide a computationally tractable approximation to this measure, and have implemented a database search tool based on it. We discuss implementation issues and efficiency, and we compare our new metric to other standard metrics.

Monday July 23 11:05 - 11:30  

Rich probabilistic models for gene expression

Eran Segal, Ben Taskar, Stanford University; Audrey Gasch, Lawrence Berkeley National Labs; Nir Friedman, Hebrew University; Daphne Koller, Stanford University


Clustering is commonly used for analyzing gene expression data. Despite their successes, clustering methods suffer from a number of limitations. First, these methods reveal similarities that exist over all of the measurements, while obscuring relationships that exist over only a subset of the data. Second, clustering methods cannot readily incorporate additional types of information, such as clinical data or known attributes of genes. To circumvent these shortcomings, we propose the use of a single coherent probabilistic model, that encompasses much of the rich structure in the genomic expression data, while incorporating additional information such as experiment type, putative binding sites, or functional information. We show how this model can be learned from the data, allowing us to discover patterns in the data and dependencies between the gene expression patterns and additional attributes. The learned model reveals context-specific relationships, that exist only over a subset of the experiments in the dataset. We demonstrate the power of our approach on synthetic data and on two real-world gene expression data sets for yeast. For example, we demonstrate a novel functionality that falls naturally out of our framework: predicting the ``cluster'' of the array resulting from a gene mutation based only on the gene's expression pattern in the context of other mutations.

Monday July 23 11:30 - 11:55  

Molecular classification of multiple tumor types

Chen-Hsiang Yeang, Sridhar Ramaswamy, Pablo Tamayo, Sayan Mukherjee, Ryan M. Rifkin, Michael Angelo, Michael Reich, Eric Lander, Jill Mesirov, Todd Golub, MIT Whitehead Institute


Using gene expression data to classify tumor types is a very promising tool in cancer diagnosis. Previous works show several pairs of tumor types can be successfully distinguished by their gene expression patterns. However, the simultaneous classification across a heterogeneous set of tumor types has not been well studied yet. We obtained 190 samples from 14 tumor classes and generated a combined expression dataset containing 16063 genes for each of those samples. We performed multi-class classification by combining the outputs of binary classifiers. Three binary classifiers (k-nearest neighbors, weighted voting, and support vector machines) were applied in conjunction with three combination scenarios (one-vs-all, all-pairs, hierarchical partitioning). We achieved the best cross validation error rate of 18.75% and the best test error rate of 21.74% by using the one-vs-all support vector machine algorithm. The results demonstrate the feasibility of performing clinically useful classification from samples of multiple tumor types.

Monday July 23 11:55 - 12:20  

Centralization: a new method for the normalization of gene expression data

Alexander Zien, Thomas Aigner, Ralf Zimmer, Thomas Lengauer, GMD - German National Research Center for Information Technology and University of Erlangen-Nürnberg


 Microarrays measure values that are approximately proportional to the numbers of copies of different mRNA molecules in samples. Due to technical difficulties, the constant of proportionality between the measured intensities and the numbers of mRNA copies per cell is unknown and may vary for different arrays. Usually, the data are normalized (i.e., array-wise multiplied by appropriate factors) in order to compensate for this effect and to enable informative comparisons between different experiments. Centralization is a new two-step method for the computation of such normalization factors that is both biologically better motivated and more robust than standard approaches. First, for each pair of arrays the quotient of the constants of proportionality is estimated. Second, from the resulting matrix of pairwise quotients an optimally consistent scaling of the samples is computed.

Monday July 23 13:50 - 14:35

The phenomenon of the web

Bernardo Huberman, HP Sand Hill Labs


 The World Wide Web has swiftly become an ecology of knowledge in which highly diverse information is interlinked in an extremely complex and arbitrary fashion. Because of the way it is structured and used, the Web also offers a unique laboratory for studying its properties and the way users forage for information.

There are two important aspects to the phenomenon of the Web that go beyond its global reach. First, the Web exhibits regularities in the way it grows and how people use it and interact with each other. Equally important, there are generative theories that explain these regularities. Second, the very nature of the Web enables the creation of novel mechanisms and institutions to help users and providers interact in more efficient and trusting ways. I will describe a particular one that impacts the search, retrieval and exhange of information within the bioinformatics domain.

Monday July 23 14:35 - 15:00  

Visualizing associations between genome sequences and gene expression data using genome-mean expression profiles

Derek Y. Chiang, University of California, Berkeley; Patrick O. Brown, Stanford University School of Medicine; Michael B. Eisen, University of California, Berkeley and Ernest Orlando Lawrence Berkeley National Lab


The combination of genome-wide expression patterns and full genome sequences offers a great opportunity to further our understanding of the mechanisms and logic of transcriptional regulation. Many methods have been described that identify sequence motifs enriched in transcription control regions of genes that share similar gene expression patterns. Here we present an alternative approach that evaluates the transcriptional information contained by specific sequence motifs by computing for each motif the mean expression profile of all genes that contain the motif in their transcription control regions. These genome-mean expression profiles (GMEP's) are valuable for visualizing the relationship between genome sequences and gene expression data, and for characterizing the transcriptional importance of specific sequence motifs.

Analysis of GMEP's calculated from a dataset of 519 whole-genome microarray experiments in Saccharomyces cerevisiae show a significant correlation between GMEP's of motifs that are reverse complements, a result that supports the relationship between GMEP's and transcriptional regulation. Hierarchical clustering of GMEP's identifies clusters of motifs that correspond to binding sites of well-characterized transcription factors. The GMEP's of these clustered motifs have patterns of variation across conditions that reflect the known activities of these transcription factors.

Software that computed GMEP's from sequence and gene expression data is available under the terms of the Gnu Public License from

Monday July 23 15:35 - 16:00  

Identifying splits with clear separation: a new class discovery method for gene expression data

Anja von Heydebreck, Max-Planck-Institute for Molecular Genetics; Wolfgang Huber, Annemarie Poustka, German Cancer Research Center; Martin Vingron, Max-Planck-Institute for Molecular Genetics


We present a new class discovery method for microarray gene expression data. Based on a collection of gene expression profiles from different tissue samples, the method searches for binary class distinctions in the set of samples that show clear separation in the expression levels of specific subsets of genes. Several mutually independent class distinctions may be found, which is difficult to obtain from most commonly used clustering algorithms. Each class distinction can be biologically interpreted in terms of its supporting genes. The mathematical characterization of the favored class distinctions is based on statistical concepts. By analyzing three data sets from cancer gene expression studies, we demonstrate that our method is able to detect biologically relevant structures, for example cancer subtypes, in an unsupervised fashion.

Monday July 23 16:00 - 16:25  

CLIFF: clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts

Eric P. Xing, Richard M. Karp, University of California, Berkeley


We present CLIFF, an algorithm for clustering biological samples using gene expression microarray data. This clustering problem is difficult for several reasons, in particular the sparsity of the data, the high dimensionality of the feature (gene) space, and the fact that many features are irrelevant or redundant. Our algorithm iterates between two computational processes, feature filtering and clustering. Given a reference partition that approximates the correct clustering of the samples, our feature filtering procedure ranks the features according to their intrinsic discriminability, relevance to the reference partition, and irredundancy to other relevant features, and uses this ranking to select the features to be used in the following round of clustering. Our clustering algorithm, which is based on the concept of a normalized cut, clusters the samples into a new reference partition on the basis of the selected features. On a well-studied problem involving 72 leukemia samples and 7130 genes, we demonstrate that CLIFF outperforms standard clustering approaches that do not consider the feature selection issue, and produces a result that is very close to the original expert labeling of the sample set.

Monday July 23 16:25 - 17:55

Bioethics and fiction science

P. Baldi, University of California, Irvine


 At the beginning of the third millennium, mankind for the first time is staring at its genome. Within the same generation, the human brain will also stare at machines that surpass its raw computing power and an interconnected world of information processing devices approaching science fiction. Rapid progress in biological and computer technologies raise profound questions about the nature and boundaries of life, intelligence, and who we really are.

It may be of comfort to some to realize that modern biotechnology is in many ways just a powerful extension of human agricultural and breeding practices. Such practices have been ongoing for thousands of years. Throughout our history, we have manipulated and selected the genomes of plants and animals, and even our own. For most of our past, such control could be exerted only at the macroscopic level of entire organisms. It was crude, slow, and cumbersome. Today we can manipulate genomes directly at the microscopic level, the level of single genes and their constituents. We begin to be able to edit genomes like we do computer programs with a scope, speed, and precision that far exceeds evolution, rendering sexuality cumbersome and obsolete.

This workshop has two distinct themes. One is ``fiction science'', extrapolating current trends and imagining some of the possible scenarios for the future, regardless of their desirability. The second is bioethics in the context of contemporary issues such as stem cell research, genetic therapies, genetically modified food, human cloning, and gene patents.

Although the issues to be addressed are non-technical, they are no less important or timely. In 1999, stem cells were voted, ``breakthrough of the year'' in Science. Today, the Human Genome Project is essentially completed and the race for patenting human genes, as well as engineered organisms, is on. Human cloning is within reach and the carbon-silicon boundary has begun to evaporate.

The workshop will provide a brief overview of these issues and a forum for free discussion and speculation. As scientists, we have a duty to explore these questions and share information with the community. If we don't, who will?