Protein Families

104.Finding all protein kinases in the human genome
105.Remote Homology Detection Using Significant Sequence Patterns
106.TRIBE-MCL: A Novel Algorithm for accurate detection of protein families
107.In Silico Analysis of Bacterial Virulence Factors: Redefining Pathogenesis
108.Relationships between structural conservation and dynamical properties in the PAS family proteins.
109.Classifying G-protein coupled receptors with support vector machines
110.Domain-finding with CluSTr: Re-occuring motifs determined with a database of mutual sequence similarity
111.PFAM domain distributions in the yeast proteome and interactome
112.Identifying Protein Domain Boundaries using Sequence Similarity for Structural Genomics Target Selection
113.Comparative study of in vitro and in vivo protein evolution.
114.DART: Finding Proteins with Similar Domain Architecture
115.Statistical approaches for the analysis of immunoglobulin V-REGION IMGT data
116.Markovian Domain Signatures: Statistical Segmentation of Protein Sequences
117.Identification Of Novel Conserved Sequence Motifs In Human Transmembrane Proteins
118.Using Profile Scores to Determine a Tree Representation of Protein Relationships.
119.Apoptosis Signalling Pathway Database - Combining Complimentary Structural, Profile based, and Pair-wise Homologies.



104. Finding all protein kinases in the human genome (up)
Gerard Manning, Glen Charydczak, David Whyte, Sean Caenepeel, Ricardo Martinez, Sucha Sudarsanam, SUGEN, Inc.;
gerard-manning@sugen.com
Short Abstract:

We used profile HMMs, ab inito gene finding, homology and ESTs to predict all protein kinases in the human genome and used domain homology to classify and organise genes into families. We compare our predictions with those of Celera and Ensembl, and with the kinases of fly, worm and yeast.

One Page Abstract:

We have used a combination of automated gene finding methods and manual analysis to predict full length sequences for all human protein kinases. Profile HMMs were used to efficiently predict kinase catlytic domains in initial single-read genomic sequences and ESTs, with a very low error rate. To predict longer sequences in genomic assemblies, we used a mixture of ab initio gene prediction (Genscan), protein homology (Genewise and Blast) and mapping of ESTs to the genomic region. In most cases, some amount of manual curation was needed for optimal predictions, due to specific weaknesses of each prediction technique and the imperfect nature of genomic assemblies.

We found that ~20% of kinase sequences appear to be pseudogenes, with single exons, and multiple stops and frameshifts in the sequence. We compare our predictions with those of Celera and Ensembl.

We have mapped all kinase genes to chromosomal bands, and searched for genes linked to particular disease loci or cancer amplicons.

Comparison of all human kinase domains and those of yeast, worm and fly genomes reveals the presence of several new conserved groupings of kinases, and a putative orthology mapping between many novel human kinases and their model organism counterparts. Genomic comparison also shows specific expansions of sub-groups in the different lineages.


105. Remote Homology Detection Using Significant Sequence Patterns (up)
Florian Sohler, Alexander Zien, GMD - National Research Center, Skt. Augustin, Germany;
florian.sohler@gmd.de
Short Abstract:

We present a new method to detect remote homologs for proteins. To score a candidate protein, the frequency of short but significant patterns in the sequence is used. With this scoring scheme and support vector machines we can classify proteins into their SCOP superfamilies better than BLAST.

One Page Abstract:

We present a new method to detect remote homologs for proteins using short but significant sequence patterns. The goal is to build models for protein classes that enable us to correctly classify new query proteins.

Recently it has been proposed to use probabilistic suffix trees to model protein families. Probabilistic suffix trees can be viewed as variable order markov models, and thus, they are able to model short conserved patterns. In contrast to other widely used models like profile hidden markov models (HMMs) or sequence profiles, there is no alignment information used to create the model and no alignment performed to score candidate sequences. The main advantage of probabilistic suffix trees is their speed. Training can be performed in time linear in the size of the input sequences and scoring is linear in the sequence length as well. Unfortunately, according to our experience, probabilistic suffix trees only work well for closely related proteins. There are at least two possible reasons for this. The first is that they do not explicitly model amino acid substitutions, insertions or deletions. The other possibility is that distant homologs cannot be found without using alignment information.

To allow for some amino acid substitutions we cluster the amino acids into groups like 'hydrophobic', 'polar' etc. and use patterns of this reduced alphabet instead of the amino acid alphabet.

We use suffix trees to find patterns that appear significantly frequently in a given class of proteins, but then apply more involved machine learning tools to build a model from these patterns. To score the significance of a given motif we count the number of appearances of that motif in the protein class and compare that number to the expected number of occurrences given a simple probabilistic model. If this significance score is above a certain threshold we accept the corresponding pattern into the list of significant patterns. Since it will be significant for very long (and thus specific patterns) to appear only once, we also require each pattern to occur more often than a given minimum number. Therefore we have two parameters to tune the sensitivity and specificity of the patterns chosen by our algorithm. If a pattern appears in 90% of all training sequences it is expected to appear in most of the unknown sequences belonging to that class as well. On the other hand, if the pattern is so specific, that it appears almost only in the training set, unclassified sequences that have that pattern will probably belong to that class. The length of the patterns found this way is typically between five and ten. The number of patterns can vary between 50 and several hundreds depending on the training set and the given parameters.

To build a classifier from our list of significant patterns we use support vector machines with the 'Radial Basis Functions' kernel. The features for a sequence are simply the frequencies of each of our significant patterns normalized with respect to a simple null model.

We evaluate our method by trying to predict SCOP superfamilies in a simple cross-validation protocol. As a training set for a superfamily classifier we take one family away from the superfamily which will be our test set later. The remaining families, and optionally additional Blast hits, we use for training. Sequences of all other superfamilies are also divided up into training and test set.

Results show that our method does surprisingly well, which shows that remote homologs can be detected without computing alignments. The algorithm clearly outperforms Blast and is almost competitive to HMMs. Especially on families that are hard to classify for HMMs its performance is comparable while on easier families more false positives are produced. This suggests that a combination of alignment based methods and our new method can improve the prediction performance significantly.

References:

T. Jaakkola, M. Diekhans, D. Haussler: A Discriminative Framework for Detecting Remote Protein Homologies JCB, 2000, Vol. 7, no. 1/2, pp. 95-114

G. Bejerano, G. Yona: Variations on probabilistic suffix trees: statistical modeling and prediction of protein families Bioinformatics, 2001, Vol. 17, no. 1, pp. 23-43

A. Apostolico, M. E. Bock, S. Lonardi, X. Xu: Efficient detection of unusual words JCB, 2000, Vol. 7, no. 1/2, pp. 71-94

C. J. C. Burges: A tutorial on support vector machines for pattern recognition Data mining and knowledge discovery, 1998, Vol. 2, pp. 121-167


106. TRIBE-MCL: A Novel Algorithm for accurate detection of protein families (up)
Anton Enright, European Bioinformatics Institute;
Stijn Van Dongen, CWI, Amsterdam, Netherlands.;
Christos A.Ouzounis, European Bioinformatics Institute;
enright@ebi.ac.uk
Short Abstract:

We present a novel method for clustering of proteins into families based on sequence similarity information. This method uses 'Markov' clustering to successfully classify proteins into families with extremely high accuracy. The method is not led-astray by conventional problems of this type of analysis, such as promiscuous domains and protein fragments.

One Page Abstract:

Detection of protein families in complete genomes is a valuable method in functional genomics. Members of a protein family should possess equivalent functional roles in the cell. If one knows the function of one of the members of a family it should be possible to transfer this function to other members of the family whose functions may not be known. Generally protein families are detected by clustering proteins together based on their sequence similarity. Many methods exist for this type of analysis, however most of these methods are not fully-automatic and rely on manual intervention for the correct detection of valid protein families. Other automatic methods fail to correctly detect protein families in complex eukaryotic datasets due to the presence of multi-domain proteins and proteins which contain a promiscuous domain. Previously we developed a method called GeneRAGE for protein family analysis in bacteria. This method could not realistically be extended to higher eukaryotes, such as the human genome, due to the computation time required to break-down the complex modular domain structure of many eukaryotic proteins. To this end we have developed a novel method for protein sequence clustering based on the Markov Clustering (MCL) algorithm. This is a purely probabilistic approach which can automatically and accurately cluster proteins into families based on sequence similarity alone without explicit knowledge of protein domains. We first represent biological sequence similarities in terms of a graph, nodes representing proteins, and edges representing weighted sequence similarity scores which connect proteins. The MCL algorithm calculates random walks through this graph, and uses two mathematical operators to model flow within the graph. Because members of a protein family are generally more highly similar to each other than to members of other (related or unrelated) families, flow within a

family is higher than flow between families (i.e. through a common or promiscuous domain). The algorithm models tidal forces through the graph until equilibrium is reached, and then calculates a clustering based on these observed patterns of flow. We have tested the algorithm extensively using the INTERPRO, SCOP and SWISSPROT databases, and have observed very high accuracy for the assignment of proteins into valid protein families. Recently we used this algorithm to produce protein family information for the draft human genome in the Ensembl 080 release. This analysis involved the clustering of over 100,000 proteins into 13,000 families, and took a little over six hours to complete on a small workstation. Validation using databases such as SCOP and INTERPRO have indicated that the method is performing with an accuracy of >90%. We believe that this method will be extremely useful for protein family analysis and functional genomics.


107. In Silico Analysis of Bacterial Virulence Factors: Redefining Pathogenesis (up)
Kelly Paine, Edward Jenner Institute for Vaccine Research;
kelly.paine@jenner.ac.uk
Short Abstract:

A virulence factor is any agent produced by a pathogen that is essential for causing disease in a host. Bacterial protein virulence factors have attracted great interest as targets for antimicrobial research. We have been utilising protein fingerprinting methods to characterise such moieties and redefine the meaning of pathogenesis.

One Page Abstract:

There has been a recent surge in the number of completed bacterial genome sequences, and with this explosion of data comes the need to discover novel targets for antimicrobial research. A synergistic interaction between the well-established science of bacteriology, and the emergent discipline of bioinformatics should provide tools for such a task. Bacterial resistance to conventional antibiotics is on the increase, and combined with other factors such as the prevalence of HIV, is proving costly in terms of both money and human lives. Even the strongest drugs are now useless against some species like Staphylococcus aureus.

Pathogenic mechanisms can spread quickly through a bacterial population via lateral transfer, and, as most virulent bacteria rely on the presence of these “virulence factors” to infect a human host, they must be considered essential for pathogenicity. It is these genes that will provide the novel targets required for future research into new drugs and vaccines. Bioinformatics can aid in this process; the key advantage of computer-based screening techniques is the speed at which the identification and selection of these targets can be done. Making a reality of the predictions on how a protein may act a certain way in vivo, or what sort of immune response will be elicited from a virulence factor carefully selected from database mining and gene expression profiling, has, in the past, fallen mainly to the more conventionally trained biologist.

A recognised and powerful method of classifying new protein families is to use conserved regions between multiple alignments of proteins. Each homologous region is a "motif", and sets of motifs provide a signature or fingerprint for unique identification. We have been using this method to characterise novel virulence factor protein families, in collaboration with the PRINTS group at the University of Manchester, UK. Among those families already analysed include: components of the Gram-negative enteropathogenic type three secretion system, the Bacillus anthrax toxin, and Escherichia coli haemolysin.


108. Relationships between structural conservation and dynamical properties in the PAS family proteins. (up)
Laura Bonati, Alessandro Pandini, Demetrio Pitea, Dipartimento di Scienze dell'Ambiente e del Territorio, Università degli Studi di Milano-Bicocca;
laura.bonati@unimib.it
Short Abstract:

To obtain information about Ah receptor binding, we applied sequence analysis tools to the multiple alignment of different AhR and propose a new protocol to correlate the dynamical properties of the 3D structures of reference PAS proteins, obtained by MD simulations, to the information on structural conservation within the family.

One Page Abstract:

By applying structure prediction and homology modelling methodologies, we previously developed [1,2] a three-dimensional model of the ligand binding domain of the mouse Aryl hydrocarbon Receptor (mAhR); this is a member of the PAS (Per-ARNT-Sim) family of transcriptional regulatory proteins. The crystal structures of the three PAS domains used as templates (the bacterial photoactive yellow protein PYP, the human potassium channel HERG and the bacterial oxygen sensing FixL protein) reveal a highly conserved structural framework. Despite the low level of sequence identity of mAhR with the templates, this high structural conservation allowed us to develope a suitable model, based on the combination of sequence and secondary structure information. On these bases we are studying [3] the binding process of mAhR with PolyChlorinated Dibenzo-p-Dioxins (PCDD), a class of ligands of environmental interest, by using Molecular Dynamics simulations to refine the mAhR model, molecular docking techniques to identify the residues directly interacting with PCDDs and hybrid QM/MM methodologies to obtain relative binding energies for a series of PCDDs. However, the modelling procedure has also suggested the possibility of obtaining more information about binding from the sequences of other Ah receptors included in the multiple alignment, as well as the need of a more accurate analysis on the molecular basis of the high structural conservation in the PAS family. Here we present an application of sequence analysis tools to the multiple alignment of Ah receptors from different species. The differences in the response of these proteins to PCDDs and the conservation of some residues in their ligand binding domain with respect to mAhR highlight key amino acids important for dioxin binding. Moreover, we propose some tools to analyse the dynamical properties of the three-dimensional structures of the reference PAS proteins, obtained by MD simulations, and to correlate them to the information on the structural conservation within the family. Based on the idea that physical information derived from molecular modelling of protein conformations may give a key contribution in understanding the evolutionary conservation, these tools may constitute a new general protocol to correlate evolutionary information and structural dynamical behaviour in a family of proteins.

1) M. Procopio, A. Lahm, A. Tramontano, L. Bonati, D. Pitea, "Homology modeling of the AhR ligand binding domain", Organohalogen Compounds (1999) 42, 405-408.

2) M. Procopio, A. Lahm, A. Tramontano, L. Bonati, D. Pitea, "A Model for Recognition of PCDDs by the Aryl Hydrocarbon Receptor", Proteins, submitted.

3) L. Bonati, A. Pandini, D. Pitea, L. De Gioia, P. Fantucci, "Computational investigation of the PolyChlorinatedDibenzo-p-Dioxins - Ah receptor interaction: structure prediction of the ligand binding domain and molecular docking", Italian Journal of Biochemistry (2000) 49, 65.


109. Classifying G-protein coupled receptors with support vector machines (up)
Rachel Karchin, Dr. Kevin Karplus, Dr. David Haussler, University of California, Santa Cruz;
rachelk@cse.ucsc.edu
Short Abstract:

We discuss the relative merits of various automated methods for recognizing GPCRs: BLAST, hidden Markov models and support vector machines (SVMs). Our experiments show that, for those interested in annotation-quality classification, SVMs are worth the effort. We have set up a web server for SVM GPCR subfamily classification at \url{http://www.soe.ucsc.edu/research/compbio/gpcr-subclass}.

One Page Abstract:

The enormous amount of protein sequence data uncovered by genome research has increased the demand for computer software that can automate the recognition of new proteins. We discuss the relative merits of various automated methods for recognizing G-protein coupled receptors (GPCRs), a superfamily of cell membrane signalling proteins. GPCRs are the focus of a significant amount of current pharmaceutical research because they play an important role in many diseases. However, their structures remain largely unsolved. The methods described in this paper use only primary sequence information to make their predictions. We compare a simple nearest neighbor approach (BLAST), methods based on multiple alignments generated by a statistical profile hidden Markov model, and methods, including support vector machines, that transform protein sequences into fixed-length feature vectors. The last is the most computationally expensive method, but our experiments show that, for those interested in annotation-quality classification, the results are worth the effort. In two-fold cross-validation experiments testing recognition of GPCR subfamilies that bind a specific ligand (such as a histamine molecule), the errors per sequence at the minimum error point (MEP) were 13.7% for multi-class SVMs, 17.1% for our SVMtree method of hierarchical multi-class SVM classification, 25.5% for BLAST, 30% for profile HMMs, and 49% for classification based on nearest neighbor feature vector (kernNN). The percentage of true positives recognized before the first false positive was 65% for both SVM methods, 13% for BLAST, 5% for profile HMMs and 4% for kernNN. We have set up a web server for GPCR subfamily classification based on hierarchical multi-class SVMs at http://www.soe.ucsc.edu/research/compbio/gpcr-subclass. By scanning predicted peptides found in the human genome with the SVMtree server, we have identified a large number of genes that encode GPCRs. Although most of these were previously annotated, one appears to be novel as of our scan date in May~2001: an olfactory receptor on chromosome~1. A list of our predictions for human GPCRs is available at http://www.soe.ucsc.edu/research/compbio/gpcr_hg/class_results. We also provide suggested classification for 16 sequences previously identified as GPCRs but unclassified in GPCRDB.


110. Domain-finding with CluSTr: Re-occuring motifs determined with a database of mutual sequence similarity (up)
Evgenia V. Kriventseva, EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge EMBL-EBI European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton Hall, Hinxton, Cambridgeshire, CB10 1SD, UK;
Steffen Möller, Rolf Apweiler, EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge EMBL-EBI European Bioinforma;
{zhenya,moeller}@ebi.ac.uk
Short Abstract:

This work makes use of the pairwise comparison data stored for the CluSTr project, which allows users to analyse sequence matches. Known InterPro protein signatures are used to separate the well-known domains from the uncharacterised. This faciliates a bootstrap approach to discover new protein domains.

One Page Abstract:

The CluSTr project (http://www.ebi.ac.uk/clustr/) provides an automatic classification of proteins. The classification is determined according to the pairwise Smith-Waterman similarity scores, normalised by randomisation to derive a Z-score.

The CluSTr database information on the protein clusters and the underlying similarity matrix are stored in a relational database. This work uses the pairwise comparison data, which is also underlying the definition of the clusters. Here we present a method to display those regions of sequences that are most often found to be similar to other proteins. This is shown both in dependence of the location on the protein sequence and the Z-score.

Additional context is offered by the visualisation of matches to the InterPro (http://www.ebi.ac.uk/interpro/) member databases and position-dependent sequence annotation from the SWISS-PROT FT lines. Regions in sequences of special interest can be specified to be automatically retrieved for further analysis, which facilitates a bootstrap approach to determine new protein domains.

Another nice feature of this approach is that it helps to overcome an inherent limitation of algorithms to determine local sequence similarity. These either focus on an area of maximum similarity and thereby ignore remaining similarities or are no longer specific. As a consequence, in multi-domain proteins some shared domains may be omitted as regions of pairwise sequence similarity. With the assumption that the omitted domains may also occur independly from the ones found, the respective regions will be highlighted since the database of sequence similarities contains similarities between any two proteins.

The local sequence similarity together with the clustering of protein sequences should be a very interesting aid in the hunt for new protein domains, especially within the context of the most important information from SWISS-PROT/TrEMBL and InterPro. Protein clusters for sequences of completely sequenced eukaryotes for which no InterPro domains were found can be accessed from the Proteome Analysis pages (http://www.ebi.ac.uk/proteome/).

1. Kriventseva E. V., Fleischmann W., Zdobnov E., Apweiler R.: CluSTr: a database of Clusters of SWISS-PROT+TrEMBL proteins. Nucl. Acids Res. 2001, 29(1):33-36.

2. Apweiler R., Attwood T. K., Bairoch A., Bateman A., Birney E., Biswas M., Bucher P., Cerutti L., Corpet F., Croning M. D. R., Durbin R., Falquet L., Fleischmann W., Gouzy J., Hermjakob H., Hulo N., Jonassen I., Kahn D., Kanapin A., Karavidopoulou Y., Lopez R., Marx B., Mulder N. J., Oinn T. M., Pagni M., Servant F., Sigrist C. J. A., Zdobnov E. M.: InterPro - An integrated documentation resource for protein families, domains and functional sites. Nucl. Acids Res. 2001, 29(1):37-40.

3. Apweiler R., Biswas M., Fleischmann W., Kanapin A., Karavidopoulou Y., Kersey P., Kriventseva E. V., Mittard V., Mulder N., Phan I., Zdobnov E.: Proteome Analysis Database: online application of InterPro and CluSTr for the functional classification of proteins in whole genomes. Nucl. Acids Res. 2001, 29(1):44-48.

4. Fleischmann W., Möller S., Gateau A., Apweiler R.: A novel method for automatic functional annotation of proteins. Bioinformatics 1999 Mar;15(3):228-33.


111. PFAM domain distributions in the yeast proteome and interactome (up)
Christian Ahrens, Christoph Michetschläger, Andrei Grigoriev, GPC Biotech AG;
christian.ahrens@gpc-biotech.com
Short Abstract:

When comparing the distributions of PFAM domains in the yeast proteome and interactome, we found cell signalling and protein-protein interaction domains occurring with a higher frequency. The analysis of their co-occurrences within one protein or interaction pairs reveals certain preferred domain combinations. Possible functional implications will be discussed.

One Page Abstract:

The proteome of budding yeast (Saccharomyces cerevisiae) as defined by the Saccharomyces Genome Database (6311 proteins) and the interactome (defined by several large-scale protein-protein interaction datasets) were analysed for the presence of PFAM-A domains, using the HMMER 2.0 hidden Markov Model software, and the PFAM-A library of HMM´s (v6.0). Several PFAM domain families occur with a higher frequency in the interactome, including domains involved in cell signalling and protein-protein interaction. In addition, the frequencies of co-occurrences of PFAM domains within one protein and within interaction pairs were determined, and preferred domain combinations could be identified for either dataset. The results of these analyses and possible functional implications will be discussed.


112. Identifying Protein Domain Boundaries using Sequence Similarity for Structural Genomics Target Selection (up)
Gulriz Aytekin-Kurban, Terry Gaasterland, Rockefeller University;
gulriz@frida.rockefeller.edu
Short Abstract:

The method, CLUE, predicts putative domain boundaries on a protein using pairwise alignments of the protein sequence with all available proteins computed with psi-blast. It can identify domains on sequences not classified into existing structural and functional domain families. We evaluated the method comparing resulting boundaries to structural domain boundaries.

One Page Abstract:

The Structural Genomics Initiative seeks to solve three-dimensional (3D) protein structures for as many distinct new folds as possible. These structures will in turn increase the number of computationally modeled 3D protein structures. Achieving this goal requires that candidate structure targets be selected from all available proteins such that the likelihood that a new structure will reveal a new fold is maximized. A prerequisite for target selection is the reliable identification of structural domain boundaries in proteins with no known structure. Once domains are established, the corresponding sequences can then be clustered into domain families. The domain families can be prioritized according to likely efficacy of high-throughput structure determination, and whether or not they have member proteins of known structure. This paper introduces and evaluates a new method for predicting protein domain boundaries in proteins across complete genomes in large scale. The method was applied to proteins from 12 genomes and to proteins in PDB. For the proteins already in PDB, the resulting domain boundaries are compared with structural domain boundaries from the SCOP and CATH databases.

The method introduced here, called CLUE, uses pairwise sequence alignments of a query protein with all available proteins computed with the alignment tool psi-blast. A sliding window scoring function is applied across the query sequence to identify regions with a coalescence of internal alignment boundaries, especially boundaries that include an N-terminal or C-terminal end of the aligned (target) sequence. The output of the scoring function is evaluated automatically to identify the best candidate domain boundaries in the query protein. The output of the procedure is a list of best predicted domain boundary positions and subsequences of the query protein.

The main strength of CLUE is that it can identify putative domain boundaries on sequences that have local sequence similarity to a set of proteins, yet cannot be classified into existing structural and/or functional domain families. Although it may sometimes perform worse than the existing classifier methods for well-known protein families, it is a valuable method for the subset of proteins where new domain families have yet to be discovered. We use CLUE as the first step to divide sequences into domains before building domain families. However, CLUE works for any arbitrary query sequence; it can be integrated into a sequence annotation system such as MAGPIE without a need for building families.

CLUE was evaluated by comparing predicted domain boundaries on every PDB sequence to the structural domain boundaries computed by the SCOP and CATH methods. For each structural domain family, we counted the number of instances where the boundary of the domain on a sequence had a predicted domain boundary within a distance less than 30 amino acids. We excluded the cases where the domain boundary occured at the N or C-terminal end; remaining cases were internal domain boundaries. Either the begin or the end position of a domain can be inside a sequence while the other is at a terminal end of the sequence. The number of cases among PDB sequences where both boundaries of a domain were internal to a sequence was very small. Therefore, two different counts for internal domain begin and end positions were computed. For each domain family, we calculated a percentage for predicted domain boundaries in all internal instances. The average of the percentages across all families was taken to show the overall performance. CLUE predicts on average 66\% of the begin positions of the instances of a SCOP domain family internal on a sequence, and 65\% of the intances of internal end positions. For CATH domain families, the averages are 52\% and 56\%, resp.

The method presented here is efficient and scalable. It can be applied to any protein and does not require the construction of domain families for accurate structural domain boundary predictions. CLUE has been implemented with a web interface that serves predicted domain boundaries for proteins across whole genomes. CLUE web site serves boundaries for proteins from the initial 12-genomes dataset at genomes.rockefeller.edu/CLUE.


113. Comparative study of in vitro and in vivo protein evolution. (up)
Vadim P. Valuev, Dmitry A. Afonnikov, Dmitry A. Grigorovich, Nikolay A. Kolchanov, Institute of Cytology and Genetics SB RAS;
valuev@bionet.nsc.ru
Short Abstract:

In amino acid composition the in vitro evolved proteins deviate from native ones and more strongly follow the codon degeneracy; aminoacid interchanges resemble generally those in native proteins, matching better families with restricted function. The study of pairwise correlations allowed some insight into processes determining structure-functional integrity of proteins. \url{http://wwwmgs.bionet.nsc.ru/mgs/gnw/aspd/}

One Page Abstract:

In the early 90’s started the flow of experiments with the application of techniques of in vitro evolution of proteins. This process implies sieving large (up to 109 individual members) pools of molecules through several consecutive rounds of selection and amplification to retrieve finally the molecules that most strongly show the desired property. (Roberts and Ja, 1999) This technique with various improvements was applied to pursue a number of goals, including selection for thermodynamically more stable proteins (Gu et al., 1995; Kim,D.E. et al., 1998; Braisted,A.C. and Wells,J.A., 1996), engineering of proteins with new or improved enzymatic activities (Baca,M. et al., 1997; Fujii,I. et al., 1998; Widersten,M. and Mannervik,B., 1995), mapping epitopes and binding sites (Zozulya et al., 1999; Castano,A.R. et al., 1995), finding substrates for enzymes (Matthews,D. and Wells,J.A., 1993 ;Matthews,D. et al., 1994), selecting antibodies(Clackson,T. et al., 1991; Vaughan,T.J. et al., 1998), finding small peptide mimetics for large protein molecules (Wrighton,N. and Gearing,D., 1999) etc. We have compiled a database ASPD (Artificial Selected Proteins/Peptides Database) storing the published results of phage display experiments. The first release, ASPD 1.0, contains information on 120 experiments. A database entry corresponds to a set of peptides or proteins selected against one target. Generally they contain some common motif and can be aligned. Each entry contains the description of the scaffold and target for selection, the links to the databases SwissProt, PDB, Prosite and Enzyme, and the aligned set of sequences retrieved through phage display. The ASPD is SRS-formatted and can be accessed from http://wwwmgs.bionet.nsc.ru/mgs/gnw/aspd/. The amino acid composition of the in vitro evolved proteins (only those amino acids which were retrieved via evolution were taken into account, those positions which had not been randomized were ignored) compared to that of SwissProt shows the following trends: the most overrepresented aminoacids in ASPD (compared to SwissProt) are tryptophan (the percentage of which in ASPD exceeds more than threefold that in SwissProt), tyrosine, arginine; the most underrepresented are lysine, glutamic acid and valine. After thorough examination of the distribution it becomes clear that the overall amino acid composition of ASPD follows much more the number of codons for each amino acid, than the composition of SwissProt. The other effect that superimposes on the codon frequency is that there is a preference for hydrophilic amino acids. The preference for hydrophilic amino acids may be due to the bias intrinsic to phage display experiments, where selection is often made for amino acids making part of active sites. And the observation about the codon frequencies, though very simple, suggests a very important thing about native protein evolution – that the sequences of native proteins are determined greatly by their evolutionary history and not by the functional requirements. It is illustrated by the fact that by means of phage display were retrieved small mimetics for large protein molecules (such as erythropoietin) (Wrighton et al., 1996; Wrighton and Gearing, 1999) that have no sequence similarity with them. We have also calculated the aminoacid similarity matrix for our database. It shows the greatest correlation values with the matrices of the BLOSUM family – of about 80%. Its application in homology searches suggests that it is mostly fit for the cases when protein evolution is restricted by strong functional restraints to yield exactly isofunctional proteins. Each entry in the ASPD database was analyzed for presence of pairwise correlations in terms of 4 amino acid properties: volume (Chothia, 1984), hydrophobicity (Eisenberg D et al., 1984), isoelectric point value (White et al., 1978) and polarity (Ponnuswamy et al., 1980). We have revealed a number of clusters of correlating positions, which correspond to the structurally important regions of proteins. Such clusters were found on the turns, where negative correlations in volume and both positive and negative ones in isoelectric point were observed, and within the core, where correlations were mostly in hydrophobicity, but in isoelectric point and polarity as well. These clusters were not found in the families of native proteins. Additional information is available at the http://wwwmgs.bionet.nsc.ru/mgs/gnw/aspd/ The work was supported with RFBR grant ¹00-04-49229 and its supplement ¹01-04-06240. VV is also an INTAS PhD fellow (YS-00-177).


114. DART: Finding Proteins with Similar Domain Architecture (up)
LY Geer, M Domrachev, DJ Lipman, SH Bryant, NCBI/NLM/NIH;
lewisg@ncbi.nlm.nih.gov
Short Abstract:

The Domain Architecture Retrieval Tool (DART) identifies proteins with similar domain composition. Domains in a query sequence are identified by a sensitive profile search. Proteins with similar domain architectures are retrieved and listed in ranked order. DART is available at \url{http://www.ncbi.nlm.nih.gov/Structure/lexington/lexington.cgi?cmd=rps}

One Page Abstract:

The Domain Architecture Retrieval Tool (DART), hosted by NCBI, performs similarity searching of proteins based on their domain architecture. The goal is to find protein similarities using consistent and sensitive protein domain profiles, rather than solely by sequence similarity. DART has been designed to be fast and informative. The underlying algorithm is based on domain annotation of a significant subset of all publicly known protein sequences through the use of Reverse-PSI-BLAST (RPS-BLAST) [1] and protein domain databases, including SMART [2] and Pfam [3].

Given a protein sequence, DART runs RPS-BLAST and displays the protein using a "beads on a string" style. DART then displays a ranked, graphical list of proteins with similar sets of domains. Ranking is done by the number of unique hits to domains that are the same or redundant to the domains in the query sequence. The query can be refined taxonomically or by selecting domains of interest. DART is linked to CD-Search (http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi), which also uses RPS-BLAST and can display the domain profile alignment in greater detail.

To create the databases underlying DART, all the sequences in the NCBI non-redundant database (nr) [4] are aligned to Pfam, SMART, and other domain databases using RPS-BLAST. These alignments are sorted by sequence and by domain.

Redundancy between protein domains is used in ranking and querying the sequences because domain databases contain related domains. Redundancy between two domains is defined as a significant number of overlap hits by both domains to nr. Redundant pairs are clustered transitively to create a final list of redundant domains.

DART is available at http://www.ncbi.nlm.nih.gov/Structure/lexington/lexington.cgi?cmd=rps

[1] Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Nucleic Acids Res. 1997 Sep 1; 25(17): 3389-3402.

[2] Schultz J, Copley RR, Doerks T, Ponting CP, Bork P. Nucleic Acids Res. 2000 Jan 1; 28(1): 231-234.

[3] Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer ELL. Nucleic Acids Res. 2000 Jan 1; 28(1): 263-266.

[4] Wheeler DL, Chappey C, Lash AE, Leipe DD, Madden TL, Schuler GD, Tatusova TA, Rapp BA. Nucleic Acids Res. 2000 Jan 1; 28(1): 10-14.


115. Statistical approaches for the analysis of immunoglobulin V-REGION IMGT data (up)
Christelle Pommié, Manuel Ruiz, Nathalie Syz, Véronique Giudicelli, LIGM Institut de Génétique Humaine;
Robert Sabatier, Laboratoire de physique Moléculaire;
Marie-Paule Lefranc, LIGM Institut de Génétique Humaine;
cpommie@ligm.igh.cnrs.fr
Short Abstract:

IMGT, the international ImMunoGeneTics database (http://imgt.cines.fr) is an integrated information system specializing in Immunoglobulins, TcR and MHC molecules of all vertebrates species. Our aim was to define most appropriate statistical methods to analyze the IMGT sequences and structural data, useful to establish amino acid correlations in 3D structures.

One Page Abstract:

Owing to their fundamental role in the immune system, the Immunoglobulin (Ig) and T cell Receptor (TcR) variable domains (corresponding to the V-J-REGION and V-D-J-REGION labels in IMGT, the international ImMunoGeneTics database, http://imgt.cines.fr) have been extensively studied. Moreover, owing to the recent years sequencing efforts, all the human Ig and TcR genes are now characterized. Analysis of the correlation between sequences, structures and specificities of the variable domains has important implications in medical research (repertoire in autoimmune diseases, AIDS, leukemias, lymphomas, myelomas), therapeutic approaches (antibody engineering), genome diversity and genome evolution study. The Ig and TcR V-REGIONs represent a privileged situation by the conservation of their structure despite divergent sequences and the considerable amount of genomic, structural and functional data. The unique IMGT numbering for Ig and TcR V-REGION sequences of all vertebrate species has been established to facilitate sequence comparison and cross-referencing between experiments from different laboratories whatever the antigen receptor (Ig or TcR), the chain type (heavy or light chains for Ig; alpha, beta, gamma or delta chains for TcR) or the species. In the IMGT unique numbering, conserved amino acids from FR always have the same number whatever the Ig or TcR variable sequence, and whatever the species they come from. The IMGT unique numbering has allowed to redefine the limits of the FR and CDR regions. The FR-IMGT and CDR-IMGT lengths become in themselves crucial information, which characterize variable regions belonging to a group, a subgroup and/or a gene. FR amino acids located at the same position in different sequences can be compared without requiring sequence alignments. This also holds for amino acids belonging to CDR-IMGT of the same length. The IMGT unique numbering permits rapid correlation between protein sequences and three dimensional (3D) structure of Ig and TcR V-REGIONs. Standardized multi-sequence alignments obtained with the IMGT unique numbering allow to set up statistical approaches of the amino acid physico-chemical properties, position by position. These analyses are not only useful to study mutations and allele polymorphisms, but are also needed to establish correlations between amino acids in the protein 3D structures and to extract new knowledge in the IMGT/PROTEIN-DB database, currently in development. As an example of our approach, we describe below the statistical analysis of the hydropathy property of the amino acids found at standardized positions of the three frameworks of the V-REGIONs of two types of chains, the human immunoglobulin light chain kappa and lambda. A total of 1114 human rearranged productive Ig V-REGIONs was obtained, 585 belonging to the kappa chains and 529 to the lambda chains. The V-REGION nucleotide sequences were translated into amino acid sequences. Gaps and delimitations of the FR-IMGT and CDR-IMGT were created according to the IMGT unique numbering. For each chain type V-REGIONs, three sets were created which correspond to FR1-IMGT (amino acid positions 1 to 26), FR2-IMGT (amino acid positions 39 to 55) and FR3-IMGT (amino acid positions 66 to 104), respectively. The six amino acid sequence sets were analyzed to obtain contingency tables which contain the number of each amino acid at each position. The statistical analysis was realized with two different but complementary multivariate descriptive statistical analysis (MDSA) methods: the correspondence (or factor) analysis and the hierarchic classification methods (Ward's method), using the ADE-4 software. The amino acid positions of the kappa and lambda FR1-IMGT, FR2-IMGT and FR3-IMGT sets were compared, two by two, for the amino acid "hydropathy" variable class. A total of six analyses was performed. A correspondence analysis (COA in ADE-4) was applied to each set of kappa amino acid positions (from FR1-IMGT, FR2-IMGT, FR3-IMGT, respectively) and the corresponding set of lambda amino acid positions (from FR1-IMGT, FR2-IMGT, FR3-IMGT, respectively) was projected for the hydropathy variable. One-fifty-seven (79 Kappa and 78 Lambda) amino acid positions from 1114 Ig V-REGION sequences were analysed by the correspondence and the classification analysis methods. These methods, appropriate for the analysis of large data matrices, are particularly interesting in view of the large amount of data to be studied in IMGT. Moreover, used together, they provide different but complementary results and allow a reciprocal analysis of the data. Such an approach was feasible owing to the standardization of the amino acid positions in IMGT sequences. The statistical differences of the hydropathy variable at given amino acid positions has allowed to define the characteristic hydropathy property of the kappa and lambda amino acids, respectively. On the other hand, the statistical resemblances between each kappa and lambda position has allowed to identify positions where amino acid hydropathy property may be important for the conserved structure of the Ig fold. Similar analysis with other variables (amino acid solvent accessibility, hydrogen and Van der Waals bondings) and on other sets of sequences will be particularly useful to establish correlations between amino acid positions of the Ig fold.


116. Markovian Domain Signatures: Statistical Segmentation of Protein Sequences (up)
Gill Bejerano, Yevgeny Seldin, Naftali Tishby, School of Computer Science & Engineering, The Hebrew University;
jill@cs.huji.ac.il
Short Abstract:

We present a novel method for protein sequence domain detection and classification. Our method is fully automated, does not require multiple alignments, and handles heterogeneous unordered multi-domain groups. It constructs unique domain signatures through clustering regions of conserved statistics. Examples detect protein fusion events, and outperform HMM classification.

One Page Abstract:

Characterization of a protein family by its distinct sequence domains is crucial for functional annotation and correct classification of newly discovered proteins. Conventional multiple sequence alignment-based methods, such as hidden Markov modeling (HMM), come to difficulties when faced with heterogeneous groups of proteins. However even many families of proteins sharing a common domain contain instances of several other domains, without any common linear ordering. Ignoring this modularity may lead to poor or even false classification and annotation. An automated method that can analyse a group of proteins into the sequence domains it contains is therefore highly desirable.

We apply a novel method to this problem. The method takes as input an unaligned group of protein sequences. It segments them and clusters the segments into groups sharing the same underlying statistics. A variable memory Markov model (VMM) is built using a prediction suffix tree (PST) data structure for each group of segments. Refinement is achieved by letting the PSTs compete over the segments. A deterministic annealing framework infers the number of underlying PST models while avoiding many inferior solutions. We show that regions of conserved statistics correlate well with protein sequence domains, by matching a unique signature to each domain. This is done in a fully automated manner, and does not require or attempt a multiple alignment. Several representative cases are presented. We identify a protein fusion event, refine an HMM superfamily classification into the underlying families the HMM cannot separate, and detect all 12 instances of a short domain in a group of 396 sequences.


117. Identification Of Novel Conserved Sequence Motifs In Human Transmembrane Proteins (up)
Eike Staub, Artemis Hatzigeorgiou, Bernd Hinzmann, Christian Pilarsky, Thomas Specht, Andre Rosenthal, metaGen Pharmaceuticals GmbH;
eike.staub@metagen.de
Short Abstract:

Here we present an approach to identify novel conserved protein motifs in human transmembrane (TM) proteins. Thousands of predicted TM protein sequences were scanned for known protein domains, transmembrane segments, low-complexity and coiled-coil regions. Using a PSIBLAST-based procedure we identified interesting new motifs in the remaining, previously uncharacterized sequence fragments.

One Page Abstract:

Here we present an approach to identify novel conserved protein motifs in human transmembrane (TM) proteins. Thousands of predicted TM protein sequences were scanned for known protein domains, transmembrane segments, low-complexity and coiled-coil regions. Using a PSIBLAST-based procedure we identified interesting new motifs in the remaining, previously uncharacterized sequence fragments.


118. Using Profile Scores to Determine a Tree Representation of Protein Relationships. (up)
K. Diemer, T. Hatton, P. Thomas, Celera Genomics;
diemerkl@fc.celera.com
Short Abstract:

An algorithm is introduced that uses profile scores to generate a tree of orthologs/paralogs and to split it into functional subgroups automatically. The similarity measure is based on the score of one sequence cluster to the profile of another. The algorithm is compared with other methods and expert curation.

One Page Abstract:

Many algorithms have been proposed for reconstructing the evolution of protein families from DNA or protein sequence information. The primary goal has been to model the most likely historical sequence of events that gave rise to the protein sequences observed today in the form of orthologs and paralogs.

Another use of phylogenetic trees has emerged: prediction of "attributes" of proteins, primarily function, from sequence information. Emerging first from genetic "rescue" experiments, in which a defective protein in one organism can be functionally replaced by a related protein from another organism, it has been repeatedly observed that proteins more closely related in sequence tend to be more closely related in function. It is also well known that the functional specificity of a given protein is generally conferred by only a subset of its constituent amino acids. Some of this specificity can be inferred from analysis of a protein family: positions that vary among the family members are not required for whatever function(s) the family members have in common, while positions that are strictly conserved may be important for those functions. Statistical profiles that describe the conservation patterns at different positions in a set of related proteins have been used to aid in phylogenetic reconstruction . Whether or not using profiles leads to more accurate phylogenetic reconstruction, they may lead to a greater correlation with function, which is the primary focus of the work presented here.

The algorithm introduced here uses agglomerative clustering, where the similarity measure used to join clusters is based on an approximation to the weighted score of the sequences in one cluster to the profile of the other cluster. Sequence fragments, which are not infrequent in current sequence databases, can be accommodated easily. A heuristic score-based measure is used to split the tree into functional subgroups. When assessing the performance of our algorithm, we examine the correlation between the resulting tree and the functions of the constituent proteins. We have evaluated the algorithm on alignments and corresponding expert functional annotations from publicly accessible websites, as well as several internally constructed test cases.


119. Apoptosis Signalling Pathway Database - Combining Complimentary Structural, Profile based, and Pair-wise Homologies. (up)
Kutbuddin S. Doctor, John C. Reed, Adam Godzik, The Burnham Institute;
Philip E. Bourne, San Diego Super Computer Center & University of California, San Diego;
ksdoctor@burnham-inst.org
Short Abstract:

This relational database system and web interface (http://apoptosis-db.org/) collects and organizes protein families involved in apoptosis. The database is organized around domains (profile based) that are partially indicative of apoptotic function. Sub-family classification based on combined pair-wise homology over domains provides reliable functional classification. Structural homology links profile based domains.

One Page Abstract:

This relational database system and web interface (http://apoptosis-db.org) collects and organizes protein families involved in apoptosis. The database is organized around domains (profile based) that are partially indicative of apoptotic function. Sub-family classification based on combined pair-wise homology over domains provides reliable functional classification. Structural homology links profile based domains which share more generalized functions.