Networks of protein interactions control the lives of cells, yet we are only beginning to appreciate the nature and complexity of these networks. We have taken two approaches to the study of protein networks. The first is to infer functional interactions between pairs of proteins, by combing four methods: Rosetta Stone (fused domains), Phylogenetic Profiles (correlated occurrence of pairs of proteins in genomes), Gene Neighbor (separation of pairs of protein- encoding genes on chromosomes), and analysis of DNA microarray signals. This combination produces networks of protein functional interactions.
The second approach is to summarize studies from the scientific literature of interacting proteins in a database, the Database of Interacting Proteins (http://dip.doe-mbi.ucla.edu/). This database, DIP, has now grown to thousands of interactions, and provides an second type of network of cellular protein interactions. The network from DIP is of physically interacting proteins, whereas the network of functional interactions is broader, including information on metabolic interactions.
The insights these networks offer about cellular function will be summarized. One of these is an estimate of the probable number of protein interactions in cells.
A number of predictive methods have been designed to predict protein interaction from sequence or expression data. On the experimental front, however, high-throughput proteomics technologies are starting to yield large volumes of protein-protein interaction data.
High-quality experimental protein interaction maps constitute the natural dataset upon which to build interaction predictions. Thus the motivation to develop the first interaction-based protein interaction map prediction algorithm.
A technique to predict protein-protein interaction maps across organisms is introduced, the `interaction-domain pair profile' method. The method uses a high-quality protein interaction map with interaction domain information as input to predict an interaction map in another organism. It combines sequence similarity searches with clustering based on interaction patterns and interaction domain information. We apply this approach to the prediction of an interaction map of Escherichia coli from the recently published interaction map of the human gastric pathogen Helicobacter pylori. Results are compared with predictions of a second inference method based only on full-length protein sequence similarity -- the ``naive'' method. The domain-based method is shown to i) eliminate a significant amount of false-positives of the naive method that are the consequences of multi-domain proteins; ii) increase the sensitivity compared to the naive method by identifying new potential interactions.
Genome-wide expression profiles of genetic mutants provide a wide variety of measurements of cellular responses to perturbations. Typical analysis of such data identifies genes affected by perturbation and uses clustering to group genes of similar function. In this paper we discover a finer structure of interactions between genes, such as causality, mediation, activation, and inhibition by using a Bayesian network framework. We extend this framework to correctly handle perturbations, and to identify significant subnetworks of interacting genes. We apply this method to expression data of S. cerevisiae mutants and uncover a variety of structured metabolic, signaling and regulatory pathways.
Gene expression array technology has made possible the assay of expression levels of tens of thousands of genes at a time; large databases of such measurements are currently under construction. One important use of such databases is the ability to search for experiments that have similar gene expression levels as a query, potentially identifying previously unsuspected relationships among cellular states. Such searches depend crucially on the metric used to assess the similarity between pairs of experiments. The complex joint distribution of gene expression levels, particularly their correlational structure and non-normality, make simple similarity metrics such as Euclidean distance or correlational similarity scores suboptimal for use in this application. We present a similarity metric for gene expression array experiments that takes into account the complex joint distribution of expression values. We provide a computationally tractable approximation to this measure, and have implemented a database search tool based on it. We discuss implementation issues and efficiency, and we compare our new metric to other standard metrics.
Clustering is commonly used for analyzing gene expression data. Despite their successes, clustering methods suffer from a number of limitations. First, these methods reveal similarities that exist over all of the measurements, while obscuring relationships that exist over only a subset of the data. Second, clustering methods cannot readily incorporate additional types of information, such as clinical data or known attributes of genes. To circumvent these shortcomings, we propose the use of a single coherent probabilistic model, that encompasses much of the rich structure in the genomic expression data, while incorporating additional information such as experiment type, putative binding sites, or functional information. We show how this model can be learned from the data, allowing us to discover patterns in the data and dependencies between the gene expression patterns and additional attributes. The learned model reveals context-specific relationships, that exist only over a subset of the experiments in the dataset. We demonstrate the power of our approach on synthetic data and on two real-world gene expression data sets for yeast. For example, we demonstrate a novel functionality that falls naturally out of our framework: predicting the ``cluster'' of the array resulting from a gene mutation based only on the gene's expression pattern in the context of other mutations.
Using gene expression data to classify tumor types is a very promising tool in cancer diagnosis. Previous works show several pairs of tumor types can be successfully distinguished by their gene expression patterns. However, the simultaneous classification across a heterogeneous set of tumor types has not been well studied yet. We obtained 190 samples from 14 tumor classes and generated a combined expression dataset containing 16063 genes for each of those samples. We performed multi-class classification by combining the outputs of binary classifiers. Three binary classifiers (k-nearest neighbors, weighted voting, and support vector machines) were applied in conjunction with three combination scenarios (one-vs-all, all-pairs, hierarchical partitioning). We achieved the best cross validation error rate of 18.75% and the best test error rate of 21.74% by using the one-vs-all support vector machine algorithm. The results demonstrate the feasibility of performing clinically useful classification from samples of multiple tumor types.
Microarrays measure values that are approximately proportional to the numbers of copies of different mRNA molecules in samples. Due to technical difficulties, the constant of proportionality between the measured intensities and the numbers of mRNA copies per cell is unknown and may vary for different arrays. Usually, the data are normalized (i.e., array-wise multiplied by appropriate factors) in order to compensate for this effect and to enable informative comparisons between different experiments. Centralization is a new two-step method for the computation of such normalization factors that is both biologically better motivated and more robust than standard approaches. First, for each pair of arrays the quotient of the constants of proportionality is estimated. Second, from the resulting matrix of pairwise quotients an optimally consistent scaling of the samples is computed.
The World Wide Web has swiftly become an ecology of knowledge in which highly diverse information is interlinked in an extremely complex and arbitrary fashion. Because of the way it is structured and used, the Web also offers a unique laboratory for studying its properties and the way users forage for information.
There are two important aspects to the phenomenon of the Web that go beyond its global reach. First, the Web exhibits regularities in the way it grows and how people use it and interact with each other. Equally important, there are generative theories that explain these regularities. Second, the very nature of the Web enables the creation of novel mechanisms and institutions to help users and providers interact in more efficient and trusting ways. I will describe a particular one that impacts the search, retrieval and exhange of information within the bioinformatics domain.
The combination of genome-wide expression patterns and full genome sequences offers a great opportunity to further our understanding of the mechanisms and logic of transcriptional regulation. Many methods have been described that identify sequence motifs enriched in transcription control regions of genes that share similar gene expression patterns. Here we present an alternative approach that evaluates the transcriptional information contained by specific sequence motifs by computing for each motif the mean expression profile of all genes that contain the motif in their transcription control regions. These genome-mean expression profiles (GMEP's) are valuable for visualizing the relationship between genome sequences and gene expression data, and for characterizing the transcriptional importance of specific sequence motifs.
Analysis of GMEP's calculated from a dataset of 519 whole-genome microarray experiments in Saccharomyces cerevisiae show a significant correlation between GMEP's of motifs that are reverse complements, a result that supports the relationship between GMEP's and transcriptional regulation. Hierarchical clustering of GMEP's identifies clusters of motifs that correspond to binding sites of well-characterized transcription factors. The GMEP's of these clustered motifs have patterns of variation across conditions that reflect the known activities of these transcription factors.
Software that computed GMEP's from sequence and gene expression data is available under the terms of the Gnu Public License from http://rana.lbl.gov/.
We present a new class discovery method for microarray gene expression data. Based on a collection of gene expression profiles from different tissue samples, the method searches for binary class distinctions in the set of samples that show clear separation in the expression levels of specific subsets of genes. Several mutually independent class distinctions may be found, which is difficult to obtain from most commonly used clustering algorithms. Each class distinction can be biologically interpreted in terms of its supporting genes. The mathematical characterization of the favored class distinctions is based on statistical concepts. By analyzing three data sets from cancer gene expression studies, we demonstrate that our method is able to detect biologically relevant structures, for example cancer subtypes, in an unsupervised fashion.
We present CLIFF, an algorithm for clustering biological samples using gene expression microarray data. This clustering problem is difficult for several reasons, in particular the sparsity of the data, the high dimensionality of the feature (gene) space, and the fact that many features are irrelevant or redundant. Our algorithm iterates between two computational processes, feature filtering and clustering. Given a reference partition that approximates the correct clustering of the samples, our feature filtering procedure ranks the features according to their intrinsic discriminability, relevance to the reference partition, and irredundancy to other relevant features, and uses this ranking to select the features to be used in the following round of clustering. Our clustering algorithm, which is based on the concept of a normalized cut, clusters the samples into a new reference partition on the basis of the selected features. On a well-studied problem involving 72 leukemia samples and 7130 genes, we demonstrate that CLIFF outperforms standard clustering approaches that do not consider the feature selection issue, and produces a result that is very close to the original expert labeling of the sample set.
At the beginning of the third millennium, mankind for the first time is staring
at its genome. Within the same generation, the human brain will also stare at
machines that surpass its raw computing power and an interconnected world of
information processing devices approaching science fiction. Rapid progress in
biological and computer technologies raise profound questions about the nature
and boundaries of life, intelligence, and who we really are.
It may be of comfort to some to realize that modern biotechnology is in many
ways just a powerful extension of human agricultural and breeding practices.
Such practices have been ongoing for thousands of years. Throughout our
history, we have manipulated and selected the genomes of plants and animals,
and even our own. For most of our past, such control could be exerted only at
the macroscopic level of entire organisms. It was crude, slow, and cumbersome.
Today we can manipulate genomes directly at the microscopic level, the level of
single genes and their constituents. We begin to be able to edit genomes like
we do computer programs with a scope, speed, and precision that far exceeds
evolution, rendering sexuality cumbersome and obsolete.
This workshop has two distinct themes. One is ``fiction science'',
extrapolating current trends and imagining some of the possible scenarios for
the future, regardless of their desirability. The second is bioethics in the
context of contemporary issues such as stem cell research, genetic therapies,
genetically modified food, human cloning, and gene patents.
Although the issues to be addressed are non-technical, they are no less
important or timely. In 1999, stem cells were voted, ``breakthrough of the
year'' in Science. Today, the Human Genome Project is essentially completed and
the race for patenting human genes, as well as engineered organisms, is on.
Human cloning is within reach and the carbon-silicon boundary has begun to
The workshop will provide a brief overview of these issues and a forum for free discussion and speculation. As scientists, we have a duty to explore these questions and share information with the community. If we don't, who will?