Protein folding is perhaps the most fundamental process associated with the
generation of functional structures in biology. There has been considerable
progress in the last few years in understanding the underlying principles that
govern this highly complex process. Central to much of this progress has
been the development of ideas as to the nature of the energy surface or
landscape for a folding reaction. These ideas have arisen from a
combination of theoretical analysis and experimental investigation (Dinner
et al., TIBS 25, 331-339, 2000). Of particular importance in the latter has
been the concerted application of a wide range of experimental techniques
each able to describe aspects of the structural changes taking place during
the folding process. NMR spectroscopy and protein engineering have both
been key methods in this approach because of their ability to provide
structural and dynamical information at the level of individual residues.
Recently, new approaches have been devised that combine experimental
data directly with simulation techniques to define the structures of key
species on the folding surface (Vendruscolo et al., Nature 409, 641-645,
Recently, much research has also focussed on the realisation that proteins can misfold in vivo and that this phenomenon is linked with a wide range of diseases, particularly those associated with modern highly developed societies. We have been investigating in particular the nature of the amyloidogenic diseases (that include Alzheimer's disease and the spongiform encephalopathies e.g. BSE and CJD) in which protein misfolding leads to the aggregation of proteins, often into fibrillar or thread- like structures. One system of particular interest to us has been c-type lysozyme. This protein has been for some time one of our model systems for studying fundamental aspects of folding. The discovery that clinical cases of amyloidosis are connected with single point mutations in the lysozyme gene has therefore enabled us to explore the molecular basis of this disease in a well-defined model system (Booth et al., Nature, 385, 787-793, 1997). This work has recently been extended by the discovery that many proteins not associated with clinical manifestations of disease can form amyloid structures in the laboratory under appropriately chosen conditions (Chiti et al., PNAS 96, 3590-3594, 1999; Fandrich et al., Nature 410, 165- 166, 2001). Such findings have led us to put forward ideas as to the fundamental origin of the various diseases associated with the formation of amyloid structures, many of which are particularly associated with new practices or old age. We have also speculated more generally that the avoidance of aggregation could be a major driving force in the evolution of protein sequences and structures. (Dobson, Phil. Trans. R. Soc. Lond. B356, 133-145, 2001).
Domains are the building blocks of all globular proteins, and are units of compact three-dimensional structure as well as evolutionary units. There is a limited repertoire of domain families, so that these domain families are duplicated and combined in different ways to form the set of proteins in a genome. Proteins are gene products. The processes that produce new genes are duplication and recombination as well as gene fusion and fission. We attempt to gain an overview of these processes by studying the structural domains in the proteins of seven genomes from the three kingdoms of life: Eubacteria, Archaea and Eukaryota. We use here the domain and superfamily definitions in Structural Classification of Proteins Database (SCOP) in order to map pairs of adjacent domains in genome sequences in terms of their superfamily combinations. We find 624 out of the 764 superfamilies in SCOP in these genomes, and the 624 families occur in 585 pairwise combinations. Most families are observed in combination with one or two other families, while a few families are very versatile in their combinatorial behaviour. This type of pattern can be described by a scale-free network. Finally, we study domain repeats and we compare the set of the domain combinations in the genomes to those in PDB, and discuss the implications for structural genomics.
G protein coupled receptors (GPCRs) are found in great
numbers in most eukaryotic genomes. They are
responsible for sensing a staggering variety of
structurally diverse ligands, with their activation resulting
in the initiation of a variety of cellular signalling
cascades. The physiological response that is observed
following receptor activation is governed by the guanine
nucleotide-binding proteins (G proteins) to which a
particular receptor chooses to couple. Previous
investigations have demonstrated that the specificity of
the receptor-G protein interaction is governed by the
intracellular domains of the receptor. Despite many
studies it has proven very difficult to predict de novo,
from the receptor sequence alone, the G proteins to which
a GPCR is most likely to couple. We have used a data-mining approach,
combining pattern discovery with
membrane topology prediction, to find patterns of amino
acid residues in the intracellular domains of GPCR
sequences that are specific for coupling to a particular
functional class of G proteins. A prediction system was
then built, being based upon these discovered patterns. We
can report this approach was successful in the prediction
of G protein coupling specificity of unknown sequences.
Such predictions should be of great use in providing in
silico characterisation of newly cloned receptor
sequences and for improving the annotation of GPCRs
stored in protein sequence databases.
Available at: http://www.ebi.ac.uk/ croning/coupling.html
Knowing the number of residue contacts in a protein is crucial for deriving constraints useful in modeling protein folding, protein structure, and/or scoring remote homology searches. Here we use an ensemble of bi-directional recurrent neural network architectures and evolutionary information to improve the state-of-the-art in contact prediction using a large corpus of curated data. The ensemble is used to discriminate between two different states of residue contacts, characterized by a contact number higher or lower than the average value of the residue distribution. The ensemble achieves performances ranging from 70.1% to 73.1% depending on the radius adopted to discriminate contacts (6Å to 12Å). These performances represent gains of 15% to 20% over the base line statistical predictors always assigning an aminoacid to the most numerous state, 3% to 7% better than any previous method. Combination of different radius predictors further improves the performance. Server: http://promoter.ics.uci.edu/BRNN-PRED/
Given a transmembrane protein, we wish to find related ones by a database search. Due to the strongly hydrophobic amino acid composition of transmembrane domains, suboptimal results are obtained when general-purpose scoring matrices such as BLOSUM are used. Recently, a transmembrane-specific score matrix called PHAT was shown to perform much better than BLOSUM. In this article, we derive a transmembrane score matrix family, called SLIM, which has several distinguishing features. In contrast to currently used matrices, SLIM is non-symmetric. The asymmetry arises because different background compositions are assumed for the transmembrane query and the unknown database sequences. We describe the mathematical model behind SLIM in detail and show that SLIM outperforms PHAT both on simulated data and in a realistic setting. Since non-symmetric score matrices are a new concept in database search methods, we discuss some important theoretical and practical issues.
Motivation: We present a framework to generate comprehensive overviews of protein-protein interactions. In the post-genomic view of cellular function, each biological entity is seen in the context of a complex network of interactions. Accordingly, we model functional space by representing protein-protein-interaction data as undirected graphs. We suggest a general approach to generate interaction maps of cellular networks in the presence of huge amounts of fragmented and incomplete data, and to derive representations of large networks which hide clutter while keeping the essential architecture of the interaction space. This is achieved by contracting the graphs according to domain-specific hierarchical classifications. The key concept here is the notion of induced interaction, which allows the integration, comparison and analysis of interaction data from different sources and different organisms at a given level of abstraction.
Results: We apply this approach to compute the overlap between the DIP compendium of interaction data and a dataset of yeast two-hybrid experiments. The architecture of this network is scale-free, as frequently seen in biological networks, and this property persists through many levels of abstraction. Connections in the network can be projected downwards from higher levels of abstraction down to the level of individual proteins. As an example, we describe an algorithm for fold assignment by network context. This method currently predicts protein folds at 30% accuracy without any requirement of detectable sequence similarity of the query protein to a protein of known structure. We used this algorithm to compile a list of structural assignments for previously unassigned genes from yeast. Finally we discuss ways forward to use interaction networks for the prediction of novel protein-protein interactions.