Protein Structure and Modeling

Sunday July 22 9:00 - 9:45

Protein folding, molecular evolution, and human disease

Christopher M. Dobson, University of Oxford


Protein folding is perhaps the most fundamental process associated with the generation of functional structures in biology. There has been considerable progress in the last few years in understanding the underlying principles that govern this highly complex process. Central to much of this progress has been the development of ideas as to the nature of the energy surface or landscape for a folding reaction. These ideas have arisen from a combination of theoretical analysis and experimental investigation (Dinner et al., TIBS 25, 331-339, 2000). Of particular importance in the latter has been the concerted application of a wide range of experimental techniques each able to describe aspects of the structural changes taking place during the folding process. NMR spectroscopy and protein engineering have both been key methods in this approach because of their ability to provide structural and dynamical information at the level of individual residues. Recently, new approaches have been devised that combine experimental data directly with simulation techniques to define the structures of key species on the folding surface (Vendruscolo et al., Nature 409, 641-645, 2001).
Recently, much research has also focussed on the realisation that proteins can misfold in vivo and that this phenomenon is linked with a wide range of diseases, particularly those associated with modern highly developed societies. We have been investigating in particular the nature of the amyloidogenic diseases (that include Alzheimer's disease and the spongiform encephalopathies e.g. BSE and CJD) in which protein misfolding leads to the aggregation of proteins, often into fibrillar or thread- like structures. One system of particular interest to us has been c-type lysozyme. This protein has been for some time one of our model systems for studying fundamental aspects of folding. The discovery that clinical cases of amyloidosis are connected with single point mutations in the lysozyme gene has therefore enabled us to explore the molecular basis of this disease in a well-defined model system (Booth et al., Nature, 385, 787-793, 1997). This work has recently been extended by the discovery that many proteins not associated with clinical manifestations of disease can form amyloid structures in the laboratory under appropriately chosen conditions (Chiti et al., PNAS 96, 3590-3594, 1999; Fandrich et al., Nature 410, 165- 166, 2001). Such findings have led us to put forward ideas as to the fundamental origin of the various diseases associated with the formation of amyloid structures, many of which are particularly associated with new practices or old age. We have also speculated more generally that the avoidance of aggregation could be a major driving force in the evolution of protein sequences and structures. (Dobson, Phil. Trans. R. Soc. Lond. B356, 133-145, 2001).

Sunday July 22 9:45 - 10:10  

An insight into domain combinations

Gordana Apic, Julian Gough, MRC Laboratory of Molecular Biology; Sarah A. Teichmann, University College London


Domains are the building blocks of all globular proteins, and are units of compact three-dimensional structure as well as evolutionary units. There is a limited repertoire of domain families, so that these domain families are duplicated and combined in different ways to form the set of proteins in a genome. Proteins are gene products. The processes that produce new genes are duplication and recombination as well as gene fusion and fission. We attempt to gain an overview of these processes by studying the structural domains in the proteins of seven genomes from the three kingdoms of life: Eubacteria, Archaea and Eukaryota. We use here the domain and superfamily definitions in Structural Classification of Proteins Database (SCOP) in order to map pairs of adjacent domains in genome sequences in terms of their superfamily combinations. We find 624 out of the 764 superfamilies in SCOP in these genomes, and the 624 families occur in 585 pairwise combinations. Most families are observed in combination with one or two other families, while a few families are very versatile in their combinatorial behaviour. This type of pattern can be described by a scale-free network. Finally, we study domain repeats and we compare the set of the domain combinations in the genomes to those in PDB, and discuss the implications for structural genomics.

Sunday July 22 10:50 - 11:15  

Prediction of the coupling specificity of G protein coupled receptors to their G proteins

Steffen Möller, Jaak Vilo, European Bioinformatics Institute; Michael D.R. Croning, European Bioinformatics Institute and University of Manchester


G protein coupled receptors (GPCRs) are found in great numbers in most eukaryotic genomes. They are responsible for sensing a staggering variety of structurally diverse ligands, with their activation resulting in the initiation of a variety of cellular signalling cascades. The physiological response that is observed following receptor activation is governed by the guanine nucleotide-binding proteins (G proteins) to which a particular receptor chooses to couple. Previous investigations have demonstrated that the specificity of the receptor-G protein interaction is governed by the intracellular domains of the receptor. Despite many studies it has proven very difficult to predict de novo, from the receptor sequence alone, the G proteins to which a GPCR is most likely to couple. We have used a data-mining approach, combining pattern discovery with membrane topology prediction, to find patterns of amino acid residues in the intracellular domains of GPCR sequences that are specific for coupling to a particular functional class of G proteins. A prediction system was then built, being based upon these discovered patterns. We can report this approach was successful in the prediction of G protein coupling specificity of unknown sequences. Such predictions should be of great use in providing in silico characterisation of newly cloned receptor sequences and for improving the annotation of GPCRs stored in protein sequence databases. Available at: croning/coupling.html

Sunday July 22 11:15 - 11:40  

Improved prediction of the number of residue contacts in proteins by recurrent neural networks

Gianluca Pollastri, Pierre Baldi, University of California, Irvine; Pietro Fariselli, Rita Casadio, University of Bologna


Knowing the number of residue contacts in a protein is crucial for deriving constraints useful in modeling protein folding, protein structure, and/or scoring remote homology searches. Here we use an ensemble of bi-directional recurrent neural network architectures and evolutionary information to improve the state-of-the-art in contact prediction using a large corpus of curated data. The ensemble is used to discriminate between two different states of residue contacts, characterized by a contact number higher or lower than the average value of the residue distribution. The ensemble achieves performances ranging from 70.1% to 73.1% depending on the radius adopted to discriminate contacts (6Å to 12Å). These performances represent gains of 15% to 20% over the base line statistical predictors always assigning an aminoacid to the most numerous state, 3% to 7% better than any previous method. Combination of different radius predictors further improves the performance. Server:

Sunday July 22 11:40 - 12:05  

Non-symmetric score matrices and the detection of homologous transmembrane proteins

Tobias Müller, Sven Rahmann, Deutsches Krebsforschungszentrum and MPI für Molekulare Genetik; Marc Rehmsmeier, Deutsches Krebsforschungszentrum


Given a transmembrane protein, we wish to find related ones by a database search. Due to the strongly hydrophobic amino acid composition of transmembrane domains, suboptimal results are obtained when general-purpose scoring matrices such as BLOSUM are used. Recently, a transmembrane-specific score matrix called PHAT was shown to perform much better than BLOSUM. In this article, we derive a transmembrane score matrix family, called SLIM, which has several distinguishing features. In contrast to currently used matrices, SLIM is non-symmetric. The asymmetry arises because different background compositions are assumed for the transmembrane query and the unknown database sequences. We describe the mathematical model behind SLIM in detail and show that SLIM outperforms PHAT both on simulated data and in a realistic setting. Since non-symmetric score matrices are a new concept in database search methods, we discuss some important theoretical and practical issues.

Sunday July 22 12:05 - 12:30  

Generating protein interaction maps from incomplete data: application to fold assignment

Michael Lappe, Jong Park, European Bioinformatics Institute; Oliver Niggemann, University of Paderborn; Liisa Holm, European Bioinformatics Institute


Motivation: We present a framework to generate comprehensive overviews of protein-protein interactions. In the post-genomic view of cellular function, each biological entity is seen in the context of a complex network of interactions. Accordingly, we model functional space by representing protein-protein-interaction data as undirected graphs. We suggest a general approach to generate interaction maps of cellular networks in the presence of huge amounts of fragmented and incomplete data, and to derive representations of large networks which hide clutter while keeping the essential architecture of the interaction space. This is achieved by contracting the graphs according to domain-specific hierarchical classifications. The key concept here is the notion of induced interaction, which allows the integration, comparison and analysis of interaction data from different sources and different organisms at a given level of abstraction.
Results: We apply this approach to compute the overlap between the DIP compendium of interaction data and a dataset of yeast two-hybrid experiments. The architecture of this network is scale-free, as frequently seen in biological networks, and this property persists through many levels of abstraction. Connections in the network can be projected downwards from higher levels of abstraction down to the level of individual proteins. As an example, we describe an algorithm for fold assignment by network context. This method currently predicts protein folds at 30% accuracy without any requirement of detectable sequence similarity of the query protein to a protein of known structure. We used this algorithm to compile a list of structural assignments for previously unassigned genes from yeast. Finally we discuss ways forward to use interaction networks for the prediction of novel protein-protein interactions.