Protein Structure

73.Computational Structural Genomics
74.Persistently Conserved Positions in Structurally-Similar, Sequence Dissimilar Proteins: Roles in Preserving Protein Fold and Function
75.Using Surface Envelopes in 3D Structure Modeling
76.Molecular modelling in studies of SDR and MDR proteins
77.Consensus Predictions of Membrane Protein Topology
78.Development of New statistical measures of protein model quality and its application for consensus prediction of protein structure
79.Signal filtration methods to extract structural information from evolutionary data applied to G protein-coupled receptor (GPCR) transmembrane domains
80.Using clusters to derive structural commonality for ATP binding sites
81.Aromaticity of Domains in Photosynthetic Reaction Centers; A Clue to the Protein's Control of Energy Dissipation during Enzymatic Reactions
82.LIGPROT: A database for the analysis and visualization of ligand binding.
83.ThreadMAP: Protein Secondary Structure Determination
84.FAUST, an algorithm for functional annotations of protein structures using structural templates.
85.Predicting structural features in protein segments
86.GA Generates New Amino Acid Indices through Comparison between Native and Random Sequences
87.STING Millennium: Web based suite of programs for comprehensive and simultaneous analysis of structure and sequence
88.Side chain-positioning as an integer programming problem
89.Prediction of the quality of protein models using neural networks
90.Targeting proteins with novel folds for structural genomics
91.Protein Structural Domain Parsing by Consensus Reasoning
92.Attempt to optimise template selection in protein homology modelling using logical feature descriptors
93.Prediction of amyloid fibril-forming proteins
94.Evaluation of structure prediction models using the ProML specification languag
95.Incremental Volume Minimization of Proteins (represented by Collagen Type I (local minimization))
96.Automatic Inference of Protein Quaternary Structure from Crystallographic Data.
97.Modelling Class II MHC Molecules using Constraint Logic Programming
98.DoME: Rapid Molecular Docking with Adaptive Mesh Solutions to the Poisson-Boltzmann Equation
99.Electrostatic potential surface and molecular dynamics of HIV-1 protease brazilian mutants
100.Automated functional annotation of protein structures
101.Structural annotation of the human genome
102.Estimation of p-values for global alignments of protein sequences.
103.Sequence and Structure Conservation Patterns of the Gelsolin Fold

73. Computational Structural Genomics (up)
Steven E. Brenner, University of California, Berkeley;
Short Abstract:

Structural genomics aims to provide a good experimental structure or computational model of every tractable protein in a complete genome. At the Berkeley Structural Genomics Center, we focus on the organisms Mycoplasma pneumonia and M. genitalium. Computational components include selection of protein targets, managing experimental data, and analyzing solved structures.

One Page Abstract:

Structural genomics aims to provide a good experimental structure or computational model of every tractable protein in a complete genome. Underlying this goal is the immense value of protein structure, especially in permitting recognition of distant evolutionary relationships for proteins whose sequence analysis has failed to find any significant homolog. A considerable fraction of the genes in all sequenced genomes have no known function, and structure determination provides a direct means of revealing homology that may be used to infer their putative molecular function. The solved structures will be similarly useful for elucidating the biochemical or biophysical role of proteins that have been previously ascribed only phenotypic functions. More generally, knowledge of an increasingly complete repertoire of protein structures will aid structure prediction methods, improve understanding of protein structure, and ultimately lend insight into molecular interactions and pathways.

We use computational methods to select families whose structures cannot be predicted and which are likely to be amenable to experimental characterization. Methods to be employed included modern sequence analysis and clustering algorithms. Also consulted is the PRESAGE database for structural genomics, which records the community’s experimental work underway and computational predictions. The protein families are ranked according to several criteria including taxonomic diversity and known functional information. Individual proteins, often homologs from hyperthermophiles, are selected from these families as targets for structure determination. The solved structures are examined for structural similarity to other proteins of known structure. Homologous proteins in sequence databases are computationally modeled, to provide a resource of protein structure models complementing the experimentally solved protein structures.


Brenner SE, Levitt M. 2000. Expectations from structural genomics. Protein Sci. 9:197-200.

Brenner SE. 1999. Errors in genome annotation. Trends Genet 15:132-133.

Brenner SE, Barken D, Levitt M. 1999. The PRESAGE database for structural genomics. Nucleic Acids Res 27:251-253.

Brenner SE, Chothia C, Hubbard TJP. 1998. Assessing sequence comparison methods with reliable structurally-identified distant evolutionary relationships. Proc Natl Acad Sci USA 95:6073-6078.

Brenner SE, Chothia C, Hubbard TJP. 1997. Population statistics of protein structures. Curr Opin Struct Biol 7:369-376.

Brenner SE, Chothia C, Hubbard TJP, Murzin AG. 1996. Understanding protein structure: Using SCOP for fold interpretation. Meth Enzymol 266:635-643.

Brenner SE, Hubbard T, Murzin A, Chothia C. 1995. Gene duplications in the H. influenzae genome. Nature 378:140. Brenner SE. 1995. World wide web and molecular biology. Science 268:622-623.

74. Persistently Conserved Positions in Structurally-Similar, Sequence Dissimilar Proteins: Roles in Preserving Protein Fold and Function (up)
Iddo Friedberg, Hanah Margalit, The Hebrew University, Jerusalem;
Short Abstract:

This study addresses the problem of proteins that have the same fold, but no sequence similarity. Using a database of such protein pairs we analyze those positions which are mutually, persistently conserved among close and distant family members. In many cases those positions show a role in function and/or fold.

One Page Abstract:

Many protein pairs that share the same fold do not have any detectable sequence similarity, providing a valuable source of information for studying sequence-structure relationship. In this study we use a stringent data set of structurally-similar, sequence-dissimilar protein pairs to characterize residues which may play a role in the determination of protein structure and/or function. For each protein in the database we identify amino-acid positions that show residue conservation within both close and distant family members. These positions are termed persistently conserved. We then proceed to determine the mutually persistently conserved positions, those structurally aligned positions in a protein pair that are persistently conserved in both pair-mates. Due to their intra- and inter-family conservation, these positions are good candidates for determining protein fold and function. We find that about 50% of the persistently conserved positions are mutually conserved. A significant fraction of them are located in critical positions for secondary structure determination, they are mostly buried, and many of them form spatial clusters within their protein structures. A substitution matrix based on the subset of persistently mutually conserved positions shows two distinct characteristics: (i) it is different from other available matrices, even those that are derived from structural alignments. (ii) it contains a significant amount of mutual information, emphasizing the special residue restrictions imposed on these positions. Such a substitution matrix should be valuable for protein design experiments.

75. Using Surface Envelopes in 3D Structure Modeling (up)
Jonathan M. Dugan, Glenn A. Williams, Russ B. Altman, Stanford Medical Informatics;
Short Abstract:

Our group has built unified data structures and algorithms that are highly flexible and applicable to a variety of different data types for modeling macromolecular structures. This poster outlines the development, implementation, and results of algorithms capable of integrating surface shape data into the 3D structure modeling process.

One Page Abstract:

Modeling the 3D structure of biological macromolecules assists in the understanding of biological function, and can assist in the discovery of novel pharmaceuticals. Current crystallographic methods for structure determination have been very successful, but are not applicable in all cases. Fortunately, other experimental methods can provide useful data regarding biomolecular structure, although typically these data are noisy and sparse. The sources of these data include those that provide distances (such as nmr, binding, affinity, and crosslinking measurements) as well as those that produce other types of structure information, such as solvent accessibility and overall geometric features--such as volume or the shape of enclosing surface envelopes (SE). Our group has focused on building unified data structures and algorithms that are highly flexible and applicable to a variety of different data types -- with the goal of combining these heterogeneous data to maximize their utility in modeling macromolecular structures. This poster outlines the development and implementation of algorithms capable of integrating SE data into the 3D structure modeling process. I present the results of modeling several proteins and test structures with distance data and SE data derived from solved structures.

76. Molecular modelling in studies of SDR and MDR proteins (up)
Erik Nordling, Bengt Persson, MBB, SBC, Karolinska Institutet;
Short Abstract:

The presentation covers the use of molecular modelling methods in studies of the medium-chain dehydrogenases/reductases (MDR) and short-chain dehydrogenases/reductases (SDR) protein families. In particular a sub classification of the MDR family is described and substrate specificity is investigated of the Endoplasmic reticulum associated amyloid beta binding protein (ERAB).

One Page Abstract:

The wealth of structural information available through the Protein Databank (PDB) may be extended to structural neighbours using homology modelling. The technique may be used routinely down to 40% sequence identity to yield accurate models if there are no large insertions or deletions in the alignment. Proteins with lower sequence identities are possible to model to reasonable accuracy, but require considerable more care in the modelling process.

We have employed these techniques on members of the protein families SDR (Short-chain Dehydrogenases/Reductases) and MDR (Medium-chain Dehydrogenases/Reductases). In the first case we model ERAB (Endoplasmic Reticulum associated Amyloid b-peptide Binding Protein) from 7alpha-Hydroxysteroid Dehydrogenase (27% sequence identity) and yield a structure that is compatible with known enzymatic data. X-ray crystallography later verified the core parts of the modelled structure. Recently, we have tried to use the homology modelling to further clarify the evolutionary relationship within subgroups of the MDR family.

We have also used various docking methods to investigate substrate specificity and binding mechanisms. This has been applied to ADH class I beta and gamma isozymes and ERAB, giving results compatible with kinetic data.

77. Consensus Predictions of Membrane Protein Topology (up)
Johan Nilsson, Bengt Persson, Gunnar von Heijne, Stockholm Bioinformatics Centre;
Short Abstract:

Consensus predictions of membrane protein topology might provide a means to estimate the reliability of predicted topologies. Using five topology prediction methods according to a “majority-vote” principle, we found that the topology of nearly half of all E.coli inner membrane proteins can be predicted with high reliability (>90% correct predictions).

One Page Abstract:

Computational methods for identification and characterisation of integral membrane proteins will become increasingly important as the number of completely sequenced genomes increases. At present, several methods are available for prediction of integral membrane protein topology and approaches employed include neural networks, hidden Markov models, multiple sequence alignments and dynamic programming. Considering the large amount of transmembrane proteins in a typical genome (20-25%), even a slight improvement in the ability to predict membrane protein topology will have major effects on e.g. automatic sequence annotation. In this study we have explored the possibility that consensus predictions of membrane protein topology might provide a means to estimate the reliability of a certain predicted topology. Our intention was to improve topology predictions by combining the results obtained from a number of methods according to a “majority-vote” principle. We used five popular topology prediction methods: TMHMM, HMMTOP, MEMSAT, TOPPRED and PHD. Our results show that the fraction of correctly predicted topologies over a test set of 60 Escherichia coli inner membrane proteins with experimentally determined topologies increases with the number of methods that agree. The topology of nearly half of the sequences can be predicted with high reliability (>90% correct predictions) using our approach.

78. Development of New statistical measures of protein model quality and its application for consensus prediction of protein structure (up)
Fang Huisheng and Arne Elofsson, Stockholm Bioinformatics Center, Stockholm University;
Short Abstract:

In this study we have used better statistical measures of the similarity between a protein-model and the correct structure. These new measures have been used to improve the performance of Pcons, a consensus based fold recognition method. We show that using these new measures we obtain better predictions.

One Page Abstract:

Development of New statistical measures of protein model quality and its application for consensus prediction of protein structure

Fang Huisheng and Arne Elofsson

Stockholm Bioinformatics Center, Stockholm University, SE-10691 Stockholm, Sweden

More and more methods for predicting protein structure have been developed based on different algorithms and information. It has recently been confirmed that for different targets, different methods produce the best predictions and the final prediction accuracy could be improved if available methods would be combined in a perfect manner (Lundström et al, 2001). Recent studies show that the statistics distribution, i.e. a P-value, assessing the similarity between a model and the structure can be developed (Levitt & Gerstein, 1998) and proved a good measure of protein model quality. In this study a score (the LGscore) was used, however, if the number of matched residues is less than 120, it has been shown the distribution does not follow the curves used to calculate the P-value.This means that a P-value really should represent the statistics correctly.

In the present work, we have first recalculated the P-values depending on the number of aligned residues. We use two functions one for describing the average score and another for the standard deviation. These functions can be used to describe the behavior of the score from 10 aligned residues to more than 300. Based on it, we calculate a new P-value, using an extreme value distribution as done by Levitt & Gerstein. The new P-values does not show the same dependency of fragment size as the old.

In CAFASP2 it was observed that very good models for short targets did not obtain a significant score LGscore. On the basis of this observation we have introduced the "Q-value". The reason is that the even a perfect structural similarity for a short protein is not very significant. To overcome these problems when scoring models we have created a new score (the Q-value) that is depending on the length of the target. It was calculated from the P-values for models with 30-50% sequence identity. Using new and old LGscore, Q-value, Pcons consensus predictors combining seven servers has been developed. The procedure of its is as described as following:we firstly compare the two kinds of similarity (i.e. new LGscore) between models, and model and target structure about 199 targets from LiveBench2. Furthermore, we build two models with Multiple linear regression and Neural Networks respectively to describe the relationship between new LGscore,old LGscore and Q-value between similarity of model-model and target structure and model. The performance trial shows that the model of new LGscore is better than old LGscore.


1. Lundström et al, Pcons: A neural network based consensus predictor that improves fold recognition. 2001

2. Siew, et al, MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics, Vol. 16,2000, p776-785

3. Zemla et al, Processing and Analysis of CASP3 protein structure predictions Proteins:Structure,Function, and Genetics, 1999,Suppl 3:22-29

4. Levitt et al, A unified statistical framework for sequence comparison and structure comparison. Proc. Natl. Acad. Sci. USA 1998,95:5913-592

79. Signal filtration methods to extract structural information from evolutionary data applied to G protein-coupled receptor (GPCR) transmembrane domains (up)
L. Marsh, Dept. of Biology, Long Island University;
Short Abstract:

The conservation score of GPCR amino acid residues was treated as a noisy signal containing information about solvent accessibility of the residues. This conservation signal was subjected to a Fourier-based filtration/reconstruction method to extract structural information. Solvent accessibility of transmembrane residues was correctly predicted for a series of diverse opsins.

One Page Abstract:

The relationship between the structure of a protein and the rate of evolution of specific residues is intriguing and complex. Solvent-exposed residues often evolve more rapidly than residues involved in structural contacts. We consider residue conservation during evolution as a signal, albeit an extremely noisy signal, reflecting solvent protection and other structural information. Using a signal filtration/reconstruction approach with a novel wavelet-based Fourier approximation, solvent exposure of amino acid residues could be predicted from residue conservation data. The degree of conservation at each position of the TM was calculated for random clusters of TM sequences drawn from a pool of 158 opsins. Aligned sequences were compared using a modified BLOSUM substitution matrix to generate a series representing the degree of conservation at each amino acid position (‘conservation signal’). The unfiltered conservation signals exhibited a weak, but positive, Pearson correlation coefficient with solvent inaccessibility of residues in the rhodopsin structure supporting a relationship between substitution rate and accessibility in this system. Solvent accessibility of the alpha-helical TMs has a periodicity of about 3.6 in rhodopsin. Fourier analysis confirmed that the conservation signal contained structural information, but simple Fourier methods did not yield robust predictions. A filter was designed that permitted enhancing alpha helical patterns, accomodation of helix breaks, and a waveform correction for the fact that most residues in the structure are not solvent exposed. This filter was implemented as a wavelet-based Fourier-filtration approximation and produced prediction success rates of >95% for the tested (relatively uniform) TM1 and TM7. The method is now being applied to other systems.

80. Using clusters to derive structural commonality for ATP binding sites (up)
Yosef Yehuda Kuttner, Mariana Babor, Marvin Edelman, Vladimir Sobolev, Weizmann Institute of Science;
Short Abstract:

We developed a method for structural multiple alignment of binding sites with a given ligand, in order to search for similarities in spatial arrangement of binding pocket atoms. An algorithm for identifying clusters of atoms was developed. Our strategy seeks commonalities in arrangement of contacting atoms around rigid ligand components.

One Page Abstract:

Keywords: atomic contacts, molecular recognition, cluster, adenine

We are developing a method for structural multiple alignment of binding sites with a given ligand, in order to search for similarities in the spatial arrangement of binding pocket atoms. An algorithm for identifying clusters of atoms was developed for this task. For a given flexible ligand, the binding pocket shape in different target proteins might vary considerably due to different ligand conformations. Our strategy was, therefore, to seek commonalities in the arrangement of contacting atoms around rigid, or almost rigid, ligand components. The rigid (or almost rigid) chemical moiety from different files was superimposed. LPC software [1] was then used to determine the protein atoms in contact with the ligand and to classify the atom contacts according to their physico-chemical properties. A search for atomic clusters was then conducted. Atoms were defined as belonging to a cluster if they were within a given distance of each other and came from different PDB entries. Additionally, members of a cluster had a requirement to form attractive contacts with the ligand. The ATP molecule was chosen for this study. We selected the rigid adenine ring moiety as a test object. A non-redundant dataset of 14 PDB entries of ATP-protein complexes (resolution of 2.2Å or better) was analyzed. The adenine rings of the 14 files were superimposed with concerted movement of the protein atoms in contact with the rings. Several groups have recently sought structural commonalities in nucleotide base recognition by proteins: Kobayashi and Go [2] found remarkable similarities despite considerable differences in primary sequence; Shi and Berg [3] used consensus sequences to construct novel proteins with increased DNA affinity in zinc finger proteins of the Cys-His2 type; Denessiouk & Johnson [4] found similarities in the relative positions of different nucleotide-base binding motifs along polypeptide chains from related proteins, although not in their three dimensional space; while Moodie et al. [5] found no specific recognition motif for adenylate in terms of particular residue/ligand interactions, although they found commonalities in shape and polarity properties at ligand/protein interfaces. In our work, hydrophobic clusters were found above and beneath the plane (as previously indicated by Moodie et al. [5]) of the adenine ring, which included some hydrophilic atoms acting as proton donors hydrogen-bonded to the conjugated system. We also found two clusters containing atoms that form hydrogen bonds. The network of atomic clusters so determined was taken as the consensus binding-site structure for the adenine ring of ATP. We note that the hydrogen bond acceptor and donor clusters, (in contact with N-6 and N-1, respectively) are in similar geometric juxtaposition as the hydrogen bonds between the adenine and thymine base pairs in DNA. Cluster positions for the adenine ring were derived. Their relative arrangement served as a fingerprint to search for putative binding sites. When the searching procedure located 6 or more cluster positions, the correct binding site was found for all proteins tested, but usually there were multiple solutions (up to 25 putative pockets). We are now attempting to derive cluster positions for the ribose ring to reduce the number of incorrect solutions.

References [1] Sobolev V., Sorokine A., Prilusky J., Abola E.E., Edelman M. (1999). Automated analysis of interatomic contacts in proteins. Bioinformatics, 15: 327-332.

[2] Kobayashi N., Go N. (1997). A method to search for similar protein local structures at ligand-binding sites and its application to adenine recognition. Eur. Biophys. J., 26: 135-144.

[3] Shi Y., Berg J.M. (1995). A direct comparison of the properties of natural and designed zinc finger proteins. Chem. Biol. 2: 83-89.

[4] Denessiouk K.A., Johnson M.S. (2000). When fold is not important: A common structural framework for adenine and AMP binding in 12 unrelated families. Proteins, 38: 310-326.

[5] Moodie S.L., Mitchell J.B.O., Thornton J.M. (1996). Protein recognition of adenylate: An example of a fuzzy recognition template. J. Mol. Biol., 263: 486-500.

81. Aromaticity of Domains in Photosynthetic Reaction Centers; A Clue to the Protein's Control of Energy Dissipation during Enzymatic Reactions (up)
Ilan Samish, Avigdor Scherz, Plant Sciences Department, Weizmann Institute of Science, Rehovot, Israel;
Haim J Wolfson, School of Computer Science, , Tel-Aviv University, Tel-Aviv, Israel;
Short Abstract:

Photosynthetic reaction centers serve as model membrane proteins for studying structure-function relationship. Multiple structural alignment of reaction centers, combinatorial mutagenesis of conserved sites and analysis of the protein microenvironment along the electron-transfer pathway suggest that protein aromaticity is involved in controlling energy-dissipation and reactant-geometry during electron-transfer in an entropy/enthalpy mechanism.

One Page Abstract:

Photosynthetic reaction centers (RCs), which conduct light induced electron transfer (ET), may serve as model membrane proteins for studying functions of conserved 3D elements. First, based on the fact that structure is more conserved than sequence, multiple structural alignment (MUSTA algorithm) was conducted on RCs from oxygenic and non-oxygenic organisms. The algorithm was conducted in the full RC and in the subunit level resulting in a 'tree' of common cores in the different subgroups. A common core located around the 4-helix bundle center of the complex was found to all RCs compared, in which amino acids (AAs) of a particular attribute form clusters. These clusters suggested conservation of aromatic and of high packing AAs. Second, two conserved AAs in the D1 subunit of the photosystem II RC underwent combinatorial mutagenesis, receiving 11-12 photoautotrophic mutants in each site. Neither positively charged nor aromatic AAs were included. Third, the content of virtual tubes (radii of 2-5A) between the ET cofactors in the bacterial RC was examined. Findings included: 1. Tubes of up to 3.5A-radius do not include backbone atoms. 2. All tubes have a uniform atom density. 3. A larger percentage of non-aromatic AAs is found in the slower ET rate domains. 4. The active branch contains a larger fraction of aromatic AAs than the inactive one. We propose that non-aromatic AAs enable entropic changes required for energy dissipation in the slow ET milieu, while rigid domains optimize reactant geometry required in the fast ET domains. These findings are proposed to shed light on the protein management of two contradictory prerequisites: a need to position reactants in a precise configuration during the electronic density migration, and an opposing need to rapidly dissipate the evolved energy in order to avoid the backward reaction.

82. LIGPROT: A database for the analysis and visualization of ligand binding. (up)
Rafael Najmanovich, Eran Eyal, Vladimir Sobolev, Marvin Edelman, Weizmann Institute of Sciences;
Short Abstract:

LigProt is a structural database of paired Apo and Holo protein forms (derived from the PDB) useful to studies of ligand binding. The database is automatically updated and offers a web based interface that allows browsing and searching as well as visualization of the superimposed Holo and Apo forms.

One Page Abstract:

A database of paired protein structures in complexed (holo-protein) and uncomplexed (apo-protein) forms from the PDB macromolecular structural database can provide a myriad of information to be used as raw data in bioinformatics studies as well as in the planning of experiments by molecular biologists. Such a database was used in our recent study of side chain flexibility (Najmanovich et al., PROTEINS, 39: 261-268 (2000)). In the present work we: 1. Automate our database creation procedure so that the database can be upgraded regularly to cope with the growth of the PDB and, 2. Create a web-based visualization tool similar to MutaProt ( (Eyal et al., Bioinformatics, 17(4): 381-382 (2001)) for searching the database according to several criteria and visualizing the results using Chime.

The database is automatically built in three stages: 1. A list of all ligands present in the PDB is created. 2. All possible candidate apo-protein entries for each entry in list 1 is built, and, 3. Each candidate holo-apo pair is tested to ensure that the binding site contains no ligand other than the one under consideration in both entries. PDB entries with resolution lower than 2.5 A or containing DNA or RNA are excluded from the database.

The search and visualization interface allows browsing of the database and searching according to protein and ligand PDB code. We are currently implementing search by protein and ligand name as well as binding site composition and structural characteristics. Once an entry is selected, a list of the intermolecular contacts present in the holo protein is generated using LPC software ( (Sobolev et al., Bioinformatics, 15(4): 327-332 (1999)). The visualization allows for the inspection of the superimposed structure of the binding site in both entries.

83. ThreadMAP: Protein Secondary Structure Determination (up)
Lydia E. Tapia, Thomas R. Ioerger, Department of Computer Science, Texas A&M University;
James C. Sacchettini, The Center for Structural Biology, Texas A&M University;
Short Abstract:

We have developed a new algorithm based on pattern recognition that recognizes secondary structure fragments in an electron density map prior to any structure solution, as part of the Textal automated model building system. Our approach consists of tracing the density map, extracting geometry based features, and performing classification.

One Page Abstract:

ThreadMAP: Protein Secondary Structure Determination

Lydia E. Tapia(1), Thomas R. Ioerger(1), and James C. Sacchettini(2)

(1)Department of Computer Science

(2)The Center for Structural Biology

Texas A&M University

College Station, TX,,

Upon the initial construction of a three-dimensional electron-density map of a protein, many protein crystallographers are often faced with low quality and low-resolution data. Because of this noisy data, automated methods for determining the structure of a protein are often hindered. Secondary structure information can help automated methods to refine a map. Also, obtaining quick secondary structure information directly from an electron density map can lead to large-scale protein database searching. For example, secondary structure of proteins from the PDB can be matched against that of a new electron density map. Homologous structures can then be used to solve the sequence of the new, unsolved protein.

We have developed a new algorithm based on pattern recognition that recognizes secondary structure fragments in an electron density map prior to any structure solution, as part of the Textal automated model building system [1]. Our approach consists of tracing the density map, extracting features based on the geometry of the trace, and performing classification.

An easy way to visualize the structure of an electron density map is to reduce the map to a series of lines representing the core of the density, a trace. An algorithm similar to the one used in Bones [2] is used. Once the map is reduced, simple heuristics such as three-way branching and distance metrics can be used to separate the backbone of the protein from the side chains.

A group of features have been developed that characterize this backbone trace. For example, two-dimensional projections of the trace are made that capture the circular nature of a spiraling helix and the directness (movement in only one dimension) of a strand. Also, three-dimensional features are used to capture information about the Euclidean distance a trace travels. All the features are extracted for all overlapping windows of twenty trace points (~10 angstroms).

We currently train on a database of feature vectors and their corresponding DSSP [3] predictions. When we receive a query feature vector, we use a nearest neighbor approach to find its closest matches from within the database. The classifications from the closest match can be used to classify the query vector. Smoothing techniques are used to take advantage of the sequential nature of secondary structure. This gives us more confidence in regions of consistent prediction and removes some ambiguity about structure transition regions. The result of this program is an automatic characterization of secondary structure fragments from a density map alone.

[1] T. Holton, T. Ioerger, J. Christopher, and J. Sacchettini. (2000). Determining protein structure from electron-density maps using pattern matching. Acta Cryst. D56, 722-734.

[2] T. Jones, J. Zou, S. Cowan, and M. Kjeldgaard (1991). Improved methods for building protein models in electron density maps and the location of errors in these models. Acta Cryst. A47, 110-119.

[3] W. Kabsch and C. Sander (1983). Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577-2637.

84. FAUST, an algorithm for functional annotations of protein structures using structural templates. (up)
Krzysztof Olszewski, Mariusz Milik, Sándor Szalma, Xiangshan Ni, Molecular Simulation Inc.;
Short Abstract:

FAUST is an automated procedure involving: extraction of functionally significant templates from protein structures and using such templates to annotate novel structures. FAUST templates are used for active and binding site searches in protein structures. Preliminary results of protein structure database annotations with derived Structural Templates are presented.

One Page Abstract:

FAUST (Functional Annotations Using Structural Templates) is an automated procedure involving: extraction of functionally significant templates from protein structures and using such templates to annotate novel structures. Both whole protein and structural templates can be represented as colored undirected graphs with atoms as vertices and inter-atom distances as edge weights. Vertex colors are based on chemical identities of the atoms. In this representation, a structural template is defined as a common sub-graph of graphs corresponding to functionally related proteins. Edge labels are considered equivalent if inter-atomic distances for corresponding vertices (atoms) differ less than a threshold value. Hence, in extraction procedure, pairs of functionally related protein structures are searched for sets of chemically equivalent atoms whose inter-atomic distances are conserved in both of searched structures. Structural Templates resulting from such pair wise searches are combined to maximize classification performance on a training set of chosen protein structures. FAUST extraction algorithm does not use any external, expert input, and it works best for sets of dissimilar structures from non-homologous proteins. The resulting Structural Template provides a new description of the protein function, which includes natural plasticity of protein active site. In FAUST approach Structural Templates are used for active and binding site searches in protein structures. Also, Structural Templates are applicable to evaluation and refinement of protein models.

We are demonstrating here FAUST extraction results for the highly divergent family of serine proteases that exhibit conserved Structural Template. We compare FAUST Structural Templates to the standard description of the serine proteases active site conservation and demonstrate depth of information captured in such description. Also, we present preliminary results of protein structure database annotations with derived Structural Templates.

85. Predicting structural features in protein segments (up)
Fredrik Pettersson, Anders Berglund, Research Group for Chemometrics, Department of Organic Chemistry, University of Umeå;
Short Abstract:

Using multivariate techniques such as PLS we are building a predictive model that will be able to determine protein structure for small segments solely using sequence as input. The model is based on a library consisting of a diverse set of 1496 good proteins. Calculations are done using a super-computer.

One Page Abstract:

One approach for protein structure prediction is to build up the structure of a whole protein from smaller sequences. These building blocks can either be different secondary structure elements or sequence elements with a specific window size. After determining the structure for each of the segments the overall structure of the assembled protein can be determined by putting together the pieces in an optimal way. In this project we are focusing on the first step that is to find the structure of the constituents based on sequence solely. Our goal is to make a predictive model that will with sequence as input be able to identify the most structurally similar match in a sequence library.

A multivariate projection method, PLS, is used for relating sequence similarity to structural similarity. PLS calculates latent variables giving a good approximation of the in-data (X) and correlate well with the response (Y). In our case the resulting model is describing the correlation between sequence features and structure. When using multivariate techniques sequence information has to be represented in a numerical way. This is obtained using z-scales. These scales are based on a physico-chemical characterization of the amino acids.

Based on a library consisting of a diverse set of 1496 good proteins, sub libraries with the data divided into smaller segments (5-10aa) has been constructed. For each protein segment, sequence and structure are characterized. Protein sequence is characterized using z-scales and a couple of sequence similarity matrices. These values are then subsequently compared to those of the other members in the sublibrary. Structural similarity is represented as a CRMS value and a value representing similar secondary structure. Based on this data a PLS model is constructed that will be used for ab initio structure prediction.

The program for doing sequence and structure characterization with subsequent comparisons has been implemented in Perl and Fortran 77. The outer Perl layer invokes the Fortran 77 program, which performs the computationally heavy processes. Calculations will be performed on a parallel-computer at HPC2N at the University of Umeå.

Because of the fact that structure is highly dependent on the overall properties of the whole protein we cannot expect to obtain a perfect prediction model. We will be satisfied if we in this initial stage will be able score the true match in the library within the top 10 library matches. Preliminary results indicate that this may well be possible but more work need yet to be done.

86. GA Generates New Amino Acid Indices through Comparison between Native and Random Sequences (up)
Satoru Kanai, PharmaDesign, Inc.;
Hiroyuki Toh, Department of Bioinformatics, Biomolecular Engineering Research Institute;
Short Abstract:

If folding information of a protein is encoded by the arrangement of the amino acid residues along the primary structure, the information would degarde by the random suffling of the residues. We developed a new method to extract folding information by the comparison between native seqeunces and random sequences.

One Page Abstract:

The amino acid sequence of a protein carries its folding information. If the information is encoded by the arrangement of the amino acid residues along the primary structure, the random shuffling of the residues would degrade the information. We developed a new method to compare the native sequence with random sequences generated from the native sequence, in order to extract such information. First, amino acid indices were randomly generated. That is, the initial indices have no significance on the feature of residues. Next, using the indices, the averaged distance between a native sequence and the random sequences was calculated, based on the autoregressive (AR) analysis and the linear predictive coding (LPC) cepstrum analysis. The indices were subjected to the genetic algorithms (GA) using the distance as the fitness, so that the distance between the native sequence and the random sequences becomes larger. We found that the indices converged to hydrophobicity indices by the GA operation. The AR analysis with the converged indices revealed that the autocorrelation in the native sequence is related to the secondary structure.

87. STING Millennium: Web based suite of programs for comprehensive and simultaneous analysis of structure and sequence (up)
Goran Neshich, EMBRAPA/CNPTIA -Campinas, SP - Brazil;
Roberto C. Togawa, EMBRAPA/CENARGEN - Basilia, DF -Brazil;
Wellington Vilella Torres, Tharsis Fonseca e Campos, Leonardo Lima Ferreira, Adilton Guedes Oliveira, Ronald Tetsuo Miura, Marcus Kiyoshi Inoue, Luiz Gustavo Horita, Georgios Pappas Jr., EMBRAPA/CNPTIA -Campinas, SP - Brazil;
Barry Honig, Columbia University, New York - USA;
Short Abstract:

STING Millennium is a web based suite of programs for visualization of molecular structure and comprehensive structure analysis: sequence and structure positions for residues, pattern search, 3D neighbors, H-bonds, structure quality, nature of atomic contacts of intra/inter chain type and residue conservation. Available:

One Page Abstract:

STING Millennium is a web based suite of programs that starts with visualizing molecular structure and than leads a user through a series of operations resulting in a comprehensive structure analysis: amino acid sequence and structure positions, pattern search, 3D neighbors identification, H-bonds, angles and distances between atoms are easy to obtain thanks to the intuitive graphic and menu interface. In addition, a user can obtain: sequence to structure relationships, analysis of a quality of the structure, nature and volume of atomic contacts of intra and inter chain type, analysis of relative amino acid position conservation and relationship with intra-chain contacts, effectively establishing Folding Essential Residue (FER) indicators etc.. The main aspect of the STING Millennium is the ability to combine data delivery through the web with structural analysis tools in order to provide a self-contained instrument for macromolecular studies. More than a simple front-end to the Chime plugin, STING offers analytical services which we will only briefly describe here, counting that users will refer to extensive on-line help for further details. STING Millennium is composed by two main windows. The sequence window displays sequence and contains the general menus with the commands and a structure window that displays the macromolecular rendered tree-dimensional structure. In general terms STING Millennium provides the following services: * Ability to easily select residues in the sequence, select elements of secondary structure, as well as offer a wide variety of methods for rendering and coloring a molecule (mostly available through ACTION menu). * Defining 3D neighbors to arbitrary selected residue * Definition and display of amino acids participating in interfacial regions between polypeptide chains (through WINDOWS/Interface chain menu selection) * Building surfaces of whole molecule or just IFR part of it * Interactive Ramachandram plots, permitting rapid identification of residues in the disallowed regions and display of selected residues in the structure window * Calculation of residue frequency within selected chain or on interface, as well as frequency of those residues filtered through chosen contact parameters. * Hydrogen bond net calculation with special attention given to participation of water molecules. * Contacts definition and calculation for the whole molecule and/or interfaces * Convenient 2D graphical presentation of parameters extracted from 3D structure * Display of sequence neighbors and calculation of relative sequence conservation for the family of homologous proteins In the links entry in the main menu, several external services that deal with PDB files are listed. These consist of links to web sites containing programs that accept a PDB code as input to perform useful tasks, which makes STING Millennium highly integrated with other important data resources. STING Millennium is both didactic tool as well as research tool. It is easy to use and requires virtually no training time. STING Millennium is available at: and

88. Side chain-positioning as an integer programming problem (up)
Olivia Eriksson, Stockholm Bioinformatics Center, Stockholm University;
Yishao Zhou, Department of Mathematics, Stockholm University;
Arne Elofsson, Stockholm Bioinformatics Center, Stockholm University;
Short Abstract:

We here present a novel integer programming based algorithm that finds the optimal set of sidechain rotamers on a fixed backbone in polynomial time. The complexity of this algorithm is similar to the commonly used pruning algorithms. Further, it is guaranteed to find the optimal solution.

One Page Abstract:

The problem to position side chains on a fixed backbone is one of the fundamental parts in homology modeling and protein design algorithms. Most homology modeling methods use some algorithm to place side chains onto a fixed backbone. Homology modeling methods are the necessary complement to the large scale structural genomics projects that are being planned. Recently it has also been shown that for automatic design of protein sequences it is of the uttermost important to find the global solution to the side chain positioning problem. If a suboptimal solution is found the difference in free energy between different sequences will be smaller than the errors for the side chain positioning problem. Many different algorithms have been developed to solve this problem. The most successful methods have used a fixed rotamer library and not continuous rotamers. This makes it possible to detect a single global minimum energy conformation. The most promising method to solve this problem in polynomial time today, is the dead end elimination theorem. Here is introduced another method. We formulate the problem as a linear integer program, relax the integer constraints and solve the thereby obtained linear program. We show that the solution to the relaxed problem always will be integer and therefore the solution to the original problem. By using this problem formulation the global minimum energy conformation will be found in polynomial time.

89. Prediction of the quality of protein models using neural networks (up)
Björn Wallner, Arne Elofsson, Stockholm Bioinformatics Center;
Short Abstract:

Neural networks are trained to predict quality of protein models, based on accessibility surfaces and contacts between residues and 13 different atom types. A correlation coefficient of 0.81 is obtained for an independent test set. This method might be useful to increase the specificity of fold-recognition methods.

One Page Abstract:

Models of proteins are made to help our understanding of how a particular protein functions. However, no good measure of the quality of the model exist. To address this problem neural networks are trained to predict quality of protein models. Besides, the possibility to measure the quality of a model, this might also be useful to increase the specificity of fold-recognition methods.

Here we generate a large set of models, using alignment methods and the homology model program Modeller (Sali et al, 1993). The quality of these models were measured using a modified version of the LGscore (Cristobal. et al, 2001).

The training was based on accessibility surfaces, the contacts between residues and contacts between 12 different atom types. The training was performed for different cutoffs. For the atom type contacts, networks were trained on eight cutoffs ranging from 3.0 Å to 4.75 Å in 0.25 Å intervals, the contacts with atoms in the same residue were omitted. For the residue contacts six cutoffs in the range 4 Å to 12 Å were used, only contacts between residues more than five residues apart in the sequence were counted, to avoid accumulation of contacts between residues laying close in the sequence. The accessibility surfaces were represented as fraction of low(<25%), medium (25%-75%) and high (>75%) relative accessibility for each residue respectively.

A neural network was trained for every single combination of parameter type and a correlation coefficient for an independent test set was calculated as a measure of how good each network preformed. For the atom contacts alone the best correlation, 0.70, was obtained with a 4.5 Å cutoff, for the residue contacts cutoff of 6 Å gave the best correlation, 0.63. For the accessibility surfaces high and low relative accessiblity gave best correlation with 0.70 for low and 0.52 for high.

The different parameter types probably contain overlapping information, nevertheless if a network is trained on a combination of the best atom and residue contacts together with the accessibility surfaces a correlation coefficient of 0.81 is obtained.

References Sali, A & Blundell, T. L. (1993). Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779-815.

Cristobal, S et al. (2001). How can the accuracy of a protein model be measured?. Manuscript in preparation.

90. Targeting proteins with novel folds for structural genomics (up)
Liam J. McGuffin, David T. Jones, Bioinformatics Group, Brunel University;
Short Abstract:

Finding novel folds is an important aim of structural genomics. We have evaluated a number of methods that discriminate between proteins with novel and known folds. We propose that simple secondary structure alignments could identify novel folds more selectively than both sequence alignments and a simple fold recognition method, GenTHREADER.

One Page Abstract:

Beyond the era of genome sequencing the focus has turned to proteomics, and in particular the high-throughput determination of protein structure or structural genomics. The ultimate objective of structural genomics is to determine the structure of every protein coded by every single gene within a genome. The premise being that once solved, protein structures may then be used to decode the functions of those genes identified within a genome. Determining each protein structure experimentally using current techniques is not feasible due to cost and time limitations. Models for proteins with >30% sequence identity to a protein with a known structure can be built fairly easily by homology modeling (Sali, 1998; Brenner, 2000; Portugaly et al., 2000). Beyond this, threading or fold recognition methods are able to assign folds to more distantly related proteins, however this is both time consuming and is limited by the current library of templates. Fast fold recognition or genomic threading techniques such as 3D-PSSM (Kelly et al. 2000), SAM-T98 (Karplus et al., 1999) and GenTHREADER (Jones, 1999) have been developed which overcome the time issue. However, these techniques rely upon finding some homology to solved structures and may perform poorly when sequences show no apparent evolutionary relationship to any known protein family (Jones, 1999). The problem remains that a fold can, of course, only be recognized if a template fold for the protein exists. The identification of nominally distinct folds is important to structural genomics. Solving structures of new folds experimentally will increase the range of folds that can be used as models or templates for computational structure determination (Sali, 1998). Therefore, methods must be developed that aim to discriminate between folds which have been seen before (known folds) and those which are novel. Methods which are capable of identifying novel folds would also greatly benefit the protein structure prediction field, as one of the first questions that must be addressed when predicting the structure of a new protein sequence is whether or not it has a known fold or not. Sequence based clustering methods such as PROTOMAP (Portugaly et al., 2000) have been developed in attempt to estimate the probability of a protein having a "new" fold. As homologous proteins must by definition have a common fold, generally speaking, sets of sequences with less than say 30% identity have a higher chance of having a novel fold than sets of proteins without sequence clustering. However, two similar folds may have very low sequence similarity (even by the standards of sensitive sequence profile comparison), and thus a potential novel fold determined by simple sequence searching could easily turn out to have a known structure. In this case methods that are based solely on sequence information are unreliable. Alignments of secondary structure elements have been shown to provide a rapid estimate of fold for sequences with no detectable homology to any known structure. Although this kind of method can not be relied upon for accurate fold recognition it has been found that it does offer an improvement over sequence alignment in its ability to assign folds to evolutionarily distant proteins (McGuffin et al, 2001). It has also been suggested that class or folding type of distantly related proteins can be discerned simply by measuring differences in amino composition (Eisenhaber et al., 1996; Wang et al., 2000), and so composition based filtering has also been proposed as a possible way of increasing the likelihood of finding new folds. We have compared the ability of a simple fold recognition method (GenTHREADER) and a variety of simple sequence analysis methods to discriminate between domains with novel folds and those with known folds. We also have evaluated methods based on simple pairwise alignments of secondary structure elements. We propose that simple alignments of secondary structure elements could potentially be a more selective method than both GenTHREADER and standard sequence alignment at finding novel folds when sequences show no detectable homology to proteins with known structures.

Brenner, SE. Target selection for structural genomics. Nature Struct Biol Suppl 2000;967-969.

Eisenhaber, F, Frömel, C, Argos, P. Prediction of secondary stuctural content of proteins from their amino acid composition alone. II. The paradox with secondary structural class. Proteins 1996;25:169-179.

Jones, D. T. GenTHREADER: An efficient and reliable protein fold recognition method for genomic sequences. J Mol Biol 1999; 287:797-815.

Karplus, K, Barrett, C, Cline, M, Diekhans, M, Grate, L, Hughley, R. Predicting protein structure using only sequence information. Proteins Suppl 3 1999:121-125.

Kelley, LA, MacCallum RM & Sternberg, MJE. Enhanced genome annotation using structural profiles in the program 3D-PSSM. J Mol Biol 2000; 299:499-520.

McGuffin, LJ, Bryson, K, Jones DT. What are the baselines for protein fold recognition? Bioinformatics 2001;17:63-72.

Potugaly, E, Linial, M Estimating the probability for a protein to have a new fold: A statistical computational model. Proc Natl Acad Sci USA 2000; 97:5161-5116.

Sali, A. 100,000 protein structures for the biologist. Nature Struct Biol 1998;5:1029-1032.

Wang, Z, Zheng, Y. How good is prediction of protein structural class by the component coupled method? Proteins 2000;38:165-175.

91. Protein Structural Domain Parsing by Consensus Reasoning (up)
HwaSeob Joseph Yun, Casimir A. Kulikowski, Ilya Muchnik, Gaetano T. Montelione, Rutgers University;
Short Abstract:

Combined domain parsing methods based on HMM and BLAST can provide concrete and definite predictions with consensus reasoning. Classifiers trained by DDD were tested on SCOP domains from same families containing those DDD seeds, which produced 75.5% accuracy. Tests on Bakers Yeast produced 64.6% correct functional predictions by EC classifications.

One Page Abstract:

Domain parsing, or the detection of signals of protein structural domains from sequence data, is a complex and difficult problem. If carried out reliably it would be a powerful interpretive and predictive tool for genomic and proteomic studies. We report on a novel approach to domain parsing using consensus techniques based on Hidden Markov Models (HMMs) and BLAST searches built from a training set of 1471 continuous structural domains from the Dali Domain Dictionary (DDD).

According to their underlying mechanisms, various domain parsing tools have their own unique advantages against each other, and our method begins by running individual programs at their full extents to maximize these distinctive characteristics undisturbed. After acquiring possible signals of domains, 1471 results from each tool are ranked and screened by an objective threshold comparable to each program we used. These selections of best hits are then paired only when the targeted domains are from the same seed in DDD. These matched pairs have differences in their domain boundary predictions on both N and C terminal sides, and by plotting these differences on an N-C terminal difference plain, a Pareto set can be extracted to acquire the point with minimal differences and maximal overlapping length among the detected signals. Proper classification of the unknown sequence is assigned by referencing SCOP definitions of the strongest signal collected with this consensus reasoning method.

We have tested the approach in two ways. In the first, validation of domain parsing on an independent test sample of 347 family-matched structural domain sequences from the SCOP database yields a consensus prediction performance rate of 75.5%, well above the 58% obtained by simple logical agreement of methods. A second independent test was to check the potential of combining methods for functional annotation. Using 339 biochemically well-characterized Bakers Yeast sequences which had matching EC codes to our model sequences, we compared results at different levels of the EC codes between HMM, BLAST, and disjunctive predictions against the query domains. This showed that there is a slightly higher likelihood of including the right prediction by using the disjunctive prediction than either method alone. There are 64.6% correct exact functional predictions in the top 10 BLAST or HMM results, while comparable matches at the highest EC level yields 93.5% as the upper bound for prediction.

92. Attempt to optimise template selection in protein homology modelling using logical feature descriptors (up)
Alexander Diemand, H. Scheib, T. Schwede, N. Guex, GlaxoSmithKline R&D, Geneva;
Short Abstract:

We address the template selection step in protein homology modelling. The structures of putative templates can vary even if their sequences are highly similar. Clustering them and the derivation of discriminatory explanations by induction helps expert modellers in decision making. We assessed our method for a number of protein families.

One Page Abstract:

Even though for the majority of proteins there is no structural information available, the structures of several families of homologous proteins have been extensively studied. To close the gap between the huge amount of protein sequences available and the still very limited number of protein structures resolved, we apply homology modelling, based on the observation that high sequence similarity implies structural similarity. In this work, we focused on optimising the critical template selection step, i.e. in cases where numerous potential template structures are available.

The structures of these putative templates can vary even if their sequences are highly similar: presence or absence of substrate or regulatory compounds, domain movements, experimental method and different organisms, respectively. It is time consuming and potentially erroneous to identify by hand the key features which distinguish these structures and determine their suitability as homology modelling templates. This process has been automated and its usability assessed for a number of protein families from the publicly accessible Protein Data Bank (PDB).

This method first clusters proteins of a particular family based on structure comparison, with each protein being described in annotations from SwissProt, Prosite, and the PDB file itself. Then, an algorithm generates the most general hypothesis by logical induction which in feature space distinguishes the clusters from each other. As a result, an explanatory feature description is obtained, which can be used to guide the template selection by either asking an expert modeller to verify or manually alter the proposed selection or in an automated mode to make the most prominent decision.

This method will be integrated into the SwissModel/DeepView protein modelling suite and thus will be made available to other researchers.

93. Prediction of amyloid fibril-forming proteins (up)
Yvonne Kallberg, Magnus Gustafsson, Johan Thyberg, Bengt Persson, Jan Johansson, Karolinska Institutet;
Short Abstract:

Amyloid fibrils are formed from different proteins, and are very similar in spite of differences the native structures. The fibrils are based on beta-strands which means that proteins containing mainly helices must undergo structural changes. By comparing secondary structures, we are able to predict fibril formation among such proteins.

One Page Abstract:

Amyloid fibrils can be formed from different proteins, and are associated with severe diseases like the neurodegenerative Alzheimer's disease and bovine spongiform encephelopathy. In spite of differences in their native structures, these proteins form very similar amyloid fibrils with beta- strands perpendicular and beta-sheets parallel to the fibre axis. Thus amyloid-forming proteins that contain mainly alpha-helical structures must undergo alpha-helix to beta-strand conversions before or during fibril formation. In order to investigate this, we searched for experimentally determined alpha-helices with predicted beta-strands in 1324 proteins, and found 37 proteins that contained alpha/beta discordant segments. The set includes three known amyloidogenic proteins: the prion protein, amyloid beta peptide (Abeta) and lung surfactant protein C (SP-C). Three other proteins (transpeptidase, triacylglycerol lipase and coagulation factor XIII) where also found to form amyloid fibrils. It is known that replacement of valine residues in the discordant segment of SP-C with leucine yields a peptide with a helical conformation. It is also known that Abeta that lack the discordant stretch or with key substitutions reverts the discordance and no fibrils are formed. Our data strongly suggest that long stretches of alpha-helix/beta-strand discordance predict amyloid fibril formation.

94. Evaluation of structure prediction models using the ProML specification languag (up)
Daniel Hanisch, Ralf Zimmer, Thomas Lengauer, GMD - National Research Center, St. Augustin, Germany;
Short Abstract:

We propose the ProML specification language for proteins and protein families based on the open XML standard. ProML allows for efficient specification and visualization of heterogeneous protein data. As an application, we discuss the representation of features of protein clusters and the use of experimental constraints for validation of structural models.

One Page Abstract:

Title: Evaluation of structure prediction models using the ProML specification language

Authors: Daniel Hanisch, Ralf Zimmer, Thomas Lengauer

We propose a specification language ProML for protein sequences, structures, and families based on the open XML standard. The language allows for portable, system-independent, machine-parsable and human-readable representation of essential features of proteins. In contrast to existing XML applications in this field, our emphasis is not on the molecular structure of one protein or molecule (as in CML), nor on annotation of one gene or one protein for use with a proprietary browser (as in BioML) , but on efficient representation of heterogeneous data associated with one or several proteins. As we developed ProML in the context of structure prediction, we focused on properties useful in threading and clustering algorithms. Extensions for other applications, however, are straigthforward to realize within ProML.

To achieve this goal, one ProML document is able to describe several proteins and their properties in a structured manner. ProML defines low-level elements as building blocks for more complex properties. Predefined elements include primary and secondary sequence information, three dimensional coordinates, CATH structural classification and Prosite patterns. A Property tree relates properties to proteins in a hierarchical manner. We define an optimality criterion for this tree, which allows for efficient use of represented information in algorithms.

ProML is of immediate use for several bioinformatics applications: we discuss clustering of proteins into families and the representation of the specific shared features of the respective clusters. ProML's Property tree defines a hierarchical view on these features, thereby making within-cluster similarities and differences among potential subclusters easily visible to humans and accessible to algorithms.

In a second application, we use experimentally derived constraints, represented in ProML, in a protein structure prediction approach for the validation of proposed theoretical models and improvement of fold recognition rate on a representative benchmark protein set. To this end, we computed conserved cores for structural clusters of our benchmark library and produced ProML documents for the clusters containing the structural cores. By exploiting randomly generated as well as simulated cross-link distance constraints measureable by mass spectrometry, we were able to improve fold recognition on our test set. For this, we applied a post filtering approach to results produced by our threading algorithm 123D.


[1] T. Bray, J. Paoli, and C.M. Sperberg-McQueen. Extensible Markup Language (XML) 1.0. February 1998.

[2] P. Murray-Rust. CML - the Chemical Markup Language.

[3] Proteomics Inc. BioML - Biological Markup Language.

[4] D. Hoffmann, and R. Zimmer. Fluorescence Energy for Elucidating the 3D-Structure of Biological Macromolecules. German Patent Office, PCT/EP99/01008, 10. Feb 1999

[5] N. Alexandrov, R. Nussinov, and R. Zimmer. Fast Protein Fold Recognition via Sequence to Structure Alignment and Contact Capacity Potentials. In Pacific Symposium on Biocomputing'96, 53-72, 1996.

95. Incremental Volume Minimization of Proteins (represented by Collagen Type I (local minimization)) (up)
Meir Israelowitz, P. Campbell, L. Ernst, J. M. Ernsthausen, W. Galbraith, S. W. Hussain, Carnegie Mellon University;
I. Verdinelli, University of Rome;
Troy Wymore, Pittsburgh Supercomputer Center;
D. L Farkas, Carnegie Mellon University;
Short Abstract:

Type I collagen, a major extra-cellular matrix protein, is a long-sequence fibrous biopolymer. The importance of considering mathematical models for collagen structure is in the fact that about 90% of the structures of most inner organs are made of some type of collagen.

One Page Abstract:

Type I collagen, a major extra-cellular matrix protein, is a long-sequence fibrous biopolymer. The importance of considering mathematical models for collagen structure is in the fact that about 90% of the structures of most inner organs are made of some type of collagen. Hence the relevance of simulating collagen structures for tissue engineering, as collagen fibers are meta-structured basis of tissue. A number of methods can be used to determine the optimal conformations of polypeptides under various conditions, and several techniques have been used to estimate the native conformations of large globular proteins. Our approach to modeling protein structure consists in approximating the process of protein folding. This can be expanded to large, multi-molecular network structures. While almost all existing models consider random packing of rigid spheres, we managed to reduce the molecular volume by using concepts from low dimensional topology (braids) and differential geometry. A braid group has the property of maintaining the continuity of a sequence while the minimization is performed (the topology guarantee the continuity during the minimization). Our model creates segments from the braid (the segments are hydrogen bond distance). These segments are amino-acid peptides (or beads), and we consider the functional distance geometry functional rather than the simple minimize action of the distance between atoms' centers. We have applied this approach to PDB files: 1BBF, 1CGD, 1AQ5 and Collagen Type I.

96. Automatic Inference of Protein Quaternary Structure from Crystallographic Data. (up)
Hannes Ponstingl, European Bioinformatics Institute, EMBL-EBI, Hinxton, UK.;
Thomas Kabir, Biomolecular Structure and Modelling Unit, Biochemistry and Molecular Biology Department, University College London, ;
Janet M. Thornton, European Bioinformatics Institute, EMBL-EBI, Hinxton, and Biomolecular Structure and Modelling Unit, Biochemistry and;
Short Abstract:

A procedure was developed that generates the likely quaternary assembly of a protein from its atom coordinates and crystal symmetry deposited in the Protein Data Bank (PDB). It applies a graph-theoretic algorithm to interface scores derived from crystal structures of globular proteins of distinct oligomeric state in solution.

One Page Abstract:

The atomic coordinates of protein crystal structures deposited in the Protein Data Bank (PDB) describe the asymmetric unit of the crystal - not the physiologically relevant assembly of the polypeptide chains. Moreover, the PDB annotation of the functional assembly is still sparse and unreliable.

This work is an attempt to provide the possible macromolecular assemblies likely to be prevalent in solution. The assemblies are ranked according to statistics obtained from a representative set of crystal structures of globular proteins whose oligomeric state in solution is distinct and experimentally established.

All intermolecular contacts present in the crystal are re-generated by applying crystallographic symmetry operations to the deposited coordinates. From this set, those contacts are discarded that are most likely to be artifacts of the crystal environment.

For this task, we derived a scoring function for protein-protein interfaces from the trusted set of water-soluble oligomers. The scoring function, a so-called statistical potential, is based on pairs of atom types and distance information.

Hypothetical assemblies are generated by successively applying a graph-theoretic minimum-cut algorithm to the scored crystal contacts. Thresholds for assembly classification are obtained from statistics on the scores of these minimum-cuts.

The performance and generalisation behaviour of the procedure in identifying the functional assembly is assessed using cross-validation methods on the data set of trusted oligomers. A comparison is made to scoring interfaces by using a traditional measure of contact size.

The derived interface-scoring function is also expected to prove useful in screening predicted complexes in protein-protein docking protocols.

97. Modelling Class II MHC Molecules using Constraint Logic Programming (up)
Martin T. Swain, Anthony J. Brooks, Graham J.L. Kemp, University of Aberdeen;
Short Abstract:

The MHC-Thread program uses a heuristic scoring function to predict peptides that are likely to bind to a class II MHC allele, based on the allele's known or modelled 3D structure. To increase its utility, we have developed an automatic technique for modelling peptide binding grooves using constraint logic programming.

One Page Abstract:

The identification of peptides which bind to MHC molecules is useful when hunting for regions of a protein which may be responsible for causing an unwanted immune response. The MHC-Thread program analyses three-dimensional models of candidate peptides in the peptide binding grooves of class II MHC molecules (Brooks, 1999). Heuristic functions are used to score the complex based upon chemical and spatial complementarity, and thus predict peptides likely to bind to specific alleles. The utility of this program is increased through having an automated method to build models of class II MHC alleles. Sequence comparisons suggest that the overall structure of class II MHC alleles is well conserved and that the main differences between alleles are due to mutations in the vicinity of the peptide binding groove. Thus, side-chain placement is central in constructing models of MHC alleles.

We have developed a novel approach to the side-chain placement problem that uses constraint logic programming (CLP). Our method generates a constraint-based description of atomic packing that is used iteratively to create CLP programs: each program representing successively tighter packing constraints. In these programs rotamer conformations are represented as values for finite domain variables, and bad steric contacts involving rotamers are represented as constraints. The CLP side-chain placement method has been validated by predicting side-chain conformations of X-ray structures with an accuracy comparable to that of other methods (Swain and Kemp, in press).

Preliminary results obtained from the MHC-Thread program with homology models created using the CLP side-chain modelling system are encouraging, and show good agreement with experimentally derived binding data.


Brooks, A.J. (1999) Computational Prediction of HLA-DR Binding Peptides. PhD Thesis, University of Aberdeen.

Swain, M.T. and Kemp, G.J.L. (in press) Modelling protein side-chain conformations using constraint logic programming. Computers & Chemistry.

98. DoME: Rapid Molecular Docking with Adaptive Mesh Solutions to the Poisson-Boltzmann Equation (up)
Julie C. Mitchell, Lynn F. Ten Eyck, San Diego Supercomputer Center;
Short Abstract:

The Docking Mesh Evaluator (DoME) uses adaptive mesh solutions to the Poisson-Boltzmann Equation to evaluate docking energies, interpolating potentials against a mesh that is dense in high gradient regions. DoME achieves a high level of precision in approximating electrostatic potentials and performs energy calculations far more rapidly than traditional methods.

One Page Abstract:

With the continued increase in the number of known protein structures comes a wealth of opportunity to predict molecular interactions and docked protein structures via computational means. Many molecular docking methods are able to achieve biologically accurate solutions to protein docking problems. However, it is difficult to obtain both speed and precision in a single algorithm. The most accurate methods are computationally expensive, while faster methods introduce non-trivial computational errors or ignore electrostatic information in favor of more tractable geometric algorithms.

We will present a method for molecular docking that is both highly efficient and uses a detailed implicit solvent model for approximating electrostatic energies. The Docking Mesh Evaluator (DoME) uses adaptive, finite element solutions to the Poisson-Boltzmann equation generated by the Adaptive Poisson-Boltzmann Solver (APBS) to model electrostatics. A simplex lookup scheme is employed to allow rapid interpolation of solutions defined on an irregular mesh. This mesh is also used as a basis for interpolating Lennard-Jones potentials.

The underlying scheme for interpolating potential functions has been expanded into a collection of tools for use in molecular docking. In particular, DoME is able to interpolate electrostatic potentials over a grid or surface, comprehensively scan the docking configuration space and compute local minima to docking potential energies. The initial results for biological problems appear quite promising, and the computations are remarkably fast. For one protein-protein docking problem, computing local minima using an all atom AMBER potential consumed 30 minutes while DoME was able to perform the same computation in just a few seconds.

99. Electrostatic potential surface and molecular dynamics of HIV-1 protease brazilian mutants (up)
Elza Helena Andrade Barbosa, Alan Wilter da Silva, Laurent Emanuel Dardenne, Paulo Mascarello Bisch, Pedro Geraldo Pascutti, Federal University of Rio de Janeiro;
Short Abstract:

We did 1 nanosecond molecular dynamics for eleven HIV-1 protease Brazilian mutants and obtained structural images, Ramachandran plots, calculations of rmsd, hydrogen bonds and electrostatic potential on the surface.We saw conformational changes near the active site of mutants and no electrostatic complementarities for some, supporting the drug resistance.

One Page Abstract:

ELECTROSTATIC POTENTIAL SURFACE AND MOLECULAR DYNAMICS OF HIV-1 PROTEASE BRAZILIAN MUTANTS Barbosa, E. H. A. 1, da Silva, A. W. S. 1, Dardenne, L. E.2, Bisch, P. M. 1, Pascutti, P. G. 1 1- Instituto de Biofísica Carlos Chagas Filho - UFRJ 2- Laboratório Nacional de Computação Científica - CNPq

Drug resistance in HIV-1 protease has been emerged in many countries. By using Molecular Modeling and Dynamics tools, we investigate eleven HIV-1 protease Brazilian mutants that are resistant to usual inhibitors. Were built theoretical models by homology using as standard the NMR-3D structure of HIV-1 mutant protease C95A found in Protein Data Bank (code 1BVE). A 1 nanosecond dynamics were performed for all systems (including 1BVE) using THOR program, a software package that uses GROMOS force field, developed in our laboratory. As a result, were obtained structural images and Ramachandran plots for each mutant. Calculations of root mean square deviation were performed. The hydrogen bonds and van der Waals contacts between the HIV-1 protease mutants and inhibitors were monitored in the active site. They showed relative stability during dynamics. Most of the models fluctuate around their respective minimized structures during dynamics simulation, however were observed conformational changes induced by mutations. Main conformational changes near active site were verified in the positions 26-29, 35, 46, 47 and 53. Electrostatic potential were calculated on the accessible solvent surface for all mutants and the usual anti-retroviral drugs to identify charge and hydrophobic complementarities between them. It was observed that for some mutants there are no complementarities, what would explain the lost of drug activity, conducting to resistance. These results support that HIV-1 protease resistant drugs could be induced by conformational changes and lost of electrostatic and hydrophobic complementarities in mutants.

100. Automated functional annotation of protein structures (up)
Mike Hsin-Ping Liang, Russ B. Altman, Stanford University;
Short Abstract:

Current methods for constructing 3D models of protein function and for annotating protein structures are manual and time-intensive. We propose an automated method for constructing 3D models of protein functional sites that can be used for high-throughput annotation of protein structures.

One Page Abstract:

In the past, protein structures have been determined because of specific biological interest and background. Recently, various structural genomics initiatives are rapidly determining protein structures without understanding their function. There is a growing need to annotate these proteins in an efficient manner to keep up with the rapid increase of structures.

Existing protein sequence motif databases provide putative function for protein sequences. However, it is well known that structure is more conserved than sequence, and it is the properties associated with particular residues and their relative position in the structure that convey function. Thus, creating a 3D motif analagous to the 1D sequence motif will increase performance in annotation of protein function. We propose an automated method for constructing 3D models of protein functional sites, by augmenting 1D sequence motifs. The model provides a 3D statistical description of the biochemical and physical properties surrounding a functional site. The model can be used to quickly scan a protein structure for potential sites. It can also be used to gain insight on what properties are involved in the particular function. This method has been applied to the EF-hand calcium binding motif.

101. Structural annotation of the human genome (up)
Arne Mueller, Lawrence A. Kelley, Michael J.E. Sternberg, Imperial Cancer Research Fund;
Short Abstract:

The proteins of the human genome draft (Ensembl-0.8) have been assigned to homologous proteins of known structure. More than one third of the proteome is covered. We have compared the fold and domain composition of different organisms. A special focus has been put on the proteins encoded by human diseases genes.

One Page Abstract:

In February 2001 the draft sequence of the human genome was published. In this work we have annotated the proteins of the public draft (1) based on the Ensembl version 0.8.0 data-set ( with protein structure by assigning homologous sequences of the SCOP (3) and PDB databases to human proteins via Blast/PSI-BLAST (4) and fold recognition using 3D-PSSM (5). The fold composition of proteins encoded by human disease genes is analysed. Results are compared with those of other organisms.

The draft human genome sequence from the Ensembl data-set contains 28913 different protein sequences of which Blast/PSI-BLAST can assign 44% to at least one protein of known structure (35% of the amino acid residues of the proteome). An additional 41% of the human sequences can be assigned to functionally annotated sequences of the public databases, and a further 16% have homology to sequences of unknown function or hypothetical proteins. Only 8% are without any detectable homology to any other sequence in the public databases including 3% (of the total) that are in non-globular regions.

With 3D-PSSM we can confidently assign 5% of the residues in the human proteome to a protein of known structure (7% of the sequences) that cannot be assigned by PSI-BLAST: 3% are in the fraction that was classified as functional (but not structurally) annotated by PSI-BLAST, and 2% are located in the fraction of `homology but unknown function'. We are currently working on an optimised version of 3D-PSSM that is better adapted to long protein sequences to improve our results and to extend the fraction of `unknown function' to which we can assign a protein of known structure, because often structure comes with functional annotation.

Compared to the proteomes of D. melanogaster, C. elegans and S. cerevisiae for which a fraction of 18% to 20% is completely uncharacterised, the draft human protein set is well annotated (in terms of structure and function). These results may be related to the difficulties of identifying novel genes in the human genome (i.g. gene finding). The human proteome is structurally better annotated than the other three eukayotic genomes (27% to 28% of the proteome) but less than most bacterial genomes (lowest is 40% for M. tuberculosis, highest is 45% for E. coli).

The most popular structural superfamily (as defined by SCOP release 1.53) in the human proteome is the Immunoglobulin superfamily (which often is found as a repetitive unit), and the top ranking superfamilies are similar to those in D. melanogaster but differ (even in total number) from those in C. elegans. We present a detailed analysis of a SCOP based domain comparison between different proteomes. There are 109 superfamilies unique to the four multicellular eukaryotes above, six are unique to yeast (S. cerevisiae and S. pombe), also six superfamilies are unique to the three archaea we have processed and 68 superfamilies are unique to the seven processed bacteria.

Of the 6656 human proteins in the Ensembl database that are linked to a diseases of the OMIM database (6) 3278 different proteins have at least one homologue of known structure. More than 5000 scop domains can be identified within these proteins. The most popular structural superfamilies resemble those of the proteome in general (e.g. Immunoglobulins, Protein kinase domains, Fibronectin).

The data from our analysis is stored in a relational database managed by MySQL allowing for complex queries and the in-cooperation of new resources and genomes when available (other genomes are currently in the process pipeline). The data will be made publicly available via the world wide web.


1. International Human Genome Sequencing Consortium (2001). Initial sequencing and analysis of the human genome. Nature, 409(6822):860-921

2. Hubbard, T.J.P., Ailey, B., Brenner, S.E., Murzin, A.G. & Chothia, C. (1999). SCOP: A structural classification of proteins database. Nuc. Acids Res. 27:254-256.

3. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein data base search programs. Nucleic Acids Res. 25:3389-3402.

4. Kelley, L.A., MacCallum, R.M. & Sternberg, M.J.E. (2000). Enhanced genome annotation using structural profiles in the program 3D-PSSM. J. Mol. Biol. 299:499-520.

5. McKusick-Nathans Institute for Genetic Medicine, Johns Hopkins University (Baltimore, MD) and National Center for Biotechnology Information, National Library of Medicine (Bethesda, MD), 2000. Online Mendelian Inheritance in Man, OMIM (TM). World Wide Web URL:

102. Estimation of p-values for global alignments of protein sequences. (up)
Caleb Webber, Geoffrey J. Barton, EMBL-European Bioinformatics Institute;
Short Abstract:

The global alignment of protein sequence pairs is often used in the classification and analysis of full-length sequences. The method presented here allows the probability that two protein sequences share the same fold to be estimated from the global sequence alignment Z-score.

One Page Abstract:

Classification and analysis of full-length protein sequences often involves the global alignment of sequence pairs. The calculation of a Z-score for the comparison gives a length and composition corrected measure of the similarity between the sequences. However, the Z-score alone does not indicate the likely biological significance of the similarity. A new background distribution to estimate the significance of pair-wise sequence alignment scores was developed by comparison of 250 proteins in different fold-families from the SCOP database. All 31,125 unique pairs of sequences were aligned with a range of matrices and gap penalties. The distributions of Z-scores from these alignments were fitted with a peak distribution, from which the probability of obtaining a given Z-score from a global alignment between 2 structurally-unrelated protein sequences was calculated. This analysis was also applied to global alignment of best locally-aligned subsequences, generated by the Smith-Waterman algorithm. The relationships between Z-score and probability varied little over the matrix/gap penalty combinations examined. However, a positive shift was observed for Z-scores derived by global alignment of locally-aligned subsequences, compared to global alignment of the entire sequence. This shift was shown to be the result of pre-selection by local alignment rather than any structural similarity in the sequences. Benchmarking the search ability of both methods using the SCOP superfamily classification showed that global alignment Z-scores are as effective as SSEARCH at low error rates and more effective at higher error rates. Global alignment of best locally-aligned subsequence was significantly less effective in this capacity. The estimation of statistical significance was shown to give similar results to the estimations of SSEARCH and BLAST, providing confidence in the method. This work provides a database-independent method of assessing the significance of pair-wise sequence global alignment scores. Software to apply the statistics to any alignment is available from

103. Sequence and Structure Conservation Patterns of the Gelsolin Fold (up)
Benyaminy Hadar, Graduate student;
Wolfson Haim, Nussinov Ruth, Professor;
Short Abstract:

The gelsolin family proteins are involved in actin cytoskeleton remodeling and can also form amyloids. We analyzed sequence and structural conservation patterns of the protein. We describe a subset of conserved residues, largely of beta structure. These are likely to be responsible for stability (and function) of the gelsolin fold.

One Page Abstract:

The gelsolin family consists of actin binding proteins that are involved in remodeling the actin cytoskeleton and can also form amyloids. The family shares a repeated motif of 125-150 residues that is found in a wide range of phyla as either three or six repeats. The repeats share low sequence homology but are similarly folded. Using a novel multiple structure comparison algorithm, the coordinate files that represent the structural diversity of the different domains of gelsolin were subjected to sequence order independent multiple structure alignment. A common structural core of 38 amino acids was found, capturing the common topologically conserved positions of the fold. The sequences of the aligned structures were used to initiate iterative PSI-blast searches of the nrdb (nonredundant database compilation). After clustering and filtering short sequences, a final large and diverse (average pairwise percent identity of 20%) database of 270 sequences was constructed. Structural and sequential patterns were combined. The highest conservation values were found for a group of hydrophobic (some of which are aromatic) residues populating a common central beta hairpin (strands C and D). These conserved residues are likely to be responsible for stability (and function) of the gelsolin fold.