Protein Function and Localization

204.(withdrawn)
205.Eukaryotic Protein Processing: Predicting the Cleavage Sites of Proprotein Convertases
206.Understanding multi-organelle predicted subcellular localization
207.Vizualization and Interpretation of the Molecular Scanner Data
208.SFINX: A generic system for integrated graphical analysis of predicted protein sequence features
209.Protein Pathway Profiling
210.Ab initio prediction of human orphan protein function
211.A consensus based approach to genome data mining of beta barrel outer membrane proteins.
212.Predicting tyrosine sulfation sites in protein sequences
213.Characterization of aspartylglucosaminuria mutations
214.Elucidating a "theoretical" proteome of the Arabidopsis thaliana thylakoid
215.Machine Learning Algorithms in the Detection of Functional Relatedness of Proteins
216.Predicting protein functions based on InterPro and GO
217.iPSORT: Simple rules for predicting N-terminal protein sorting signals.



205. Eukaryotic Protein Processing: Predicting the Cleavage Sites of Proprotein Convertases (up)
Peter Duckert, Søren Brunak, Nikolaj Blom, Center for Biological Sequence Analysis, Biocentrum-DTU, Technical University of Denmark, DK-2800, Denmark;
peterd@cbs.dtu.dk
Short Abstract:

Many biologically active proteins and peptides are generated by limited endoproteolysis of inactive precursors. Cleavage often occurs at sites containing basic amino acids. We examined sequence patterns characteristic of experimentally verified sites and describe a neural network based method for predicting whether a given site is a potential cleavage site.

One Page Abstract:

Many biologically active proteins and peptides are generated by limited endoproteolysis of inactive precursors. This is an important and evolutionary ancient mechanism which determines the level and duration of specific biological activities, and in addition controls that the biologically active molecules are formed in the appropriate cellular compartments. After removal of the signal peptide precursor cleavage often occurs at sites composed of single or paired basic amino acids (arginine[R] or lysine[K]) (Seidah & Chrétien, 1999).

The enzymes responsible for this cleavage are relatively few in number and with general functions. They have been molecularly and functionally characterized and shown to belong to a family of evolutionary conserved serine proteases related to the subtilisin and kexin enzymes.

Seven mammalian members of a dibasic-and monobasic-specific subfamily of proteases related to the yeast subtilase kexin ("proprotein convertases" or PCs) are presently known, PC1, PC2, furin, PC4, PC5, PACE4, and PC7. Since not all mono- and dibasic peptides are potential cleavage sites of PCs, we examined the sequence patterns characteristic of experimentally verified sites and describe a neural network based method for predicting whether a given site is a potential cleavage site for the PC enzymes. We here present preliminary work on the characterization and prediction of PC cleavage sites.


206. Understanding multi-organelle predicted subcellular localization (up)
Joel Zupicich, Steven E. Brenner, William C. Skarnes, University of California, Berkeley;
joelz@socrates.berkeley.edu
Short Abstract:

We have approached the problem of protein localization using tools to identify domains that are unlikely to appear in a single polypeptide. Using stringent criteria for existing tools, we have identified a large class of proteins in the SwissProt-TrEMBL database that exhibit characteristics of multiple organelles.

One Page Abstract:

We have approached the problem of protein localization using tools to identify domains that are unlikely to appear in a single polypeptide. Using stringent criteria for existing computational tools, we have identified a large class of proteins in the SwissProt/TrEMBL database that exhibit characteristics of multiple organelles. Our results show that domains largely thought to be incompatible can exist in a single protein. Our data may lead to the discovery of new protein functions that are unlikely to be uncovered using classical biochemistry. In addition, these proteins are represented in taxonomically diverse species and are especially prevalent in C. elegans. Using subcellular localization methods in cell culture, we confirm our computational predictions for three mammalian proteins. In light of these results, the regulation of known proteins we identified may need to be reevaluated.


207. Vizualization and Interpretation of the Molecular Scanner Data (up)
Müller, Markus, Gras, Robin, Appel, Ron, Swiss Institute of Bioinformatics;
Hochstrasser, Denis, LCCC, Geneva University Hospital, Geneva, Switzerland;
markus.mueller@isb-sib.ch
Short Abstract:

The molecular scanner is a highly automated method that combines 2D-gel electrophoresis with peptide mass fingerprinting (PMF) techniques in order to identify the proteins in a 2D-gel. Based on visualization methods we deduce a coupled map lattices algorithm that improves the signal to noise ratio of the PMF identification.

One Page Abstract:

The molecular scanner is a highly automated tool that combines 2D-gel electrophoresis with peptide mass fingerprinting (PMF) techniques. Proteins separated in a 2D-gel are digested 'in parallel' and transferred onto a collecting membrane. Since diffusion in this process is not a problem, the location of the peptides on the membrane corresponds to the location of their proteins in the 2D-gel. A matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) mass spectrometer then scans the membrane yielding a list of peptide masses for each scanned point. We visualize all the obtained masses, which provides important information on the presence of chemical noise. Since chemical noise is shown to be a potential source for false matches in the PMF identification procedure, removing this noise improves the results. We then present an algorithm based on coupled map lattices that makes use of the intensity distribution of the detected masses. It calculates the centers of these distributions and groups together nearby centers of different masses. The masses belonging to the same group are then submitted to our in-house PMF identification program SmartIdent. Since these masses are purged of chemical noise and overlapping masses of other centers they provide an unambiguous identification which would not be the case if untreated mass lists were submitted. These identifications are then used to create a 2-dimensional protein map of the 2D-gel.


208. SFINX: A generic system for integrated graphical analysis of predicted protein sequence features (up)
Erik Sonnhammer, Center for Genomics and Bioinformatics, Karolinska Institutet;
Erik.Sonnhammer@cgr.ki.se
Short Abstract:

A package for integrated graphical analysis of predicted protein sequence features is presented. The output from a large set of prediction programs for coiled coils, secondary structure, signal peptides, and transmembrane segments are presented graphically in the Blixem and Dotter viewers. The system is available at www.cgr.ki.se/SFINX.

One Page Abstract:

Correct predictions of functionally important sequence features is becoming increasingly important as we enter the era of functional genomics, when wet experiments are guided by bioinformatics. One important type of sequence feature are signals for subcellular localization. Such information will greatly influence the hypothesis of biological function, and will guide the choice of experiments to test this hypothesis.

However, different programs often produce disagreeing predictions, even for transmembrane topology which is often considered trivial. It is therefore important to detect areas of disagreement between programs or algorithmic variants and use all available information to produce a model of the most reliable structural and functional sequence features in a given protein.

With these purposes in mind, a generic software design which allows many different sets of segmental or continuous-curve sequence features to be viewed in combination is presented. Any such data, generated by individual external programs, may be judged alongside a self dot-plot or a multiple alignment of database matches. The implementation is based on extensions to the graphical viewers Dotter and Blixem, and scripts that convert data from external programs to a simple generic data definition format called SFS (Sequence Feature Series). The entire package of scripts and graphical viewers is called SFINX.

A web server that can run these analyses and launch Blixem or Dotter on Windows and Unix machines is available at www.cgr.ki.se/SFINX. The output is passed to the viewer in the generic SFS data format and could thus in principle be displayed by other viewers as well. It is also possible to get the output of the predictions in XML.

The poster describes applications for analysis of compositional and repetitive features in protein sequences, such as predicted coiled coils, secondary structure, signal peptides, transmembrane segments, as well as general low-complexity or periodic subsequences. Dot-plots and flanking database matches provide valuable contextual information for these assignments. It further shows that disagreement between prediction programs is very common, and that simultaneous inspection of underlying propensities, predictions, and homology information can lead to significantly improved prediction of structural and functional features, and localization.


209. Protein Pathway Profiling (up)
Peter Rieger, Head of Method Development;
prieger@kelman.de
Short Abstract:

The in silico protein-protein interaction mapping of the complete yeast genome with Kelman’s SPRAB technology is demonstrated. We constructed a GeneNetwork, representing all putative interactions between the proteins encoded by the yeast genome. The software tool GeneViator realizes visualization of data and the navigation within this network.

One Page Abstract:

Despite of complete sequencing of more and more genomes, including the human one, scientists are still only at the threshold of understanding the functions of numerous individual genes, especially with respect to their complex interplay. To move ahead in this field researchers need advanced biocomputing solutions. Kelman offers such a high-end solution of bioinfomatics and functional genome research which ensures new levels of data consistency and exploitation. By means of SPRAB (Selective Protein Recognition And Binding) technology we can predict the genetically-determined functional relationships between proteins. Resulting from computational protein-protein interaction mapping we are able to construct local gene networks, which provide an indepth understanding of gene interplay, involving gene products in all their molecular versions.

Here we present the in silico protein-protein interaction mapping of the complete yeast genome, demonstrating the systematic application of SPRAB technology. The resulting Yeast-GeneNetwork represents the complete set of putative interactions between the proteins encoded by the yeast genome. The visualization of data as well as the navigation within this network of physical protein-protein interactions is realized by means of Kelman’s software tool GeneViator. Moreover, the GeneViator enables the user to include information from other sources into the network, like experimental data of interactions, expression data and so on. So, a new dimension of uncovering and profiling of protein pathways can be presented. The approach of considering and combining different aspects of gene interactions allows a raster search for relevant pathways, providing an indepth understanding of gene function. The advantages of Kelman’s approach are exemplified here on selected findings from the Yeast-GeneNetwork.


210. Ab initio prediction of human orphan protein function (up)
Lars Juhl Jensen, Center for Biological Sequence Analysis, The Technical University of Denmark;
R. Gupta, C.A.F. Andersen;
D. Devos, Protein Design Group, CNB-CSIC, Spain A. Krogh;
J. Tamames, Protein Design Group, CNB-CSIC, Spain A. Valencia;
H. Nielsen, S. Brunak;
N. Blom, C. Kesmir, C. Workman, H.H. Staerfeldt, K. Rapacki, S. Knudsen, Center for Biological Sequence Analysis, The Technical University of Denmark;
ljj@cbs.dtu.dk
Short Abstract:

We present a novel method for ab initio prediction of protein function from sequence data alone. The cellular role is predicted by neural networks that integrate predictions of post-translational modifications and other protein features known to play an important role in determining subcellular location and regulation of proteins.

One Page Abstract:

Of the 30,000 to 40,000 genes believed to be present in the human genome not more than half can be assigned a functional role based on homology to known proteins. Traditionally, protein function has been viewed as something directly related to the conformation of the poly-peptide chain. However, as the 3D structure currently is quite hard to calculate from the sequence, a computational strategy for the elucidation of orphan protein function may benefit also from the prediction of functional attributes which are more directly related to the linear sequence of amino acids.

Our approach to function prediction is based on the fact that a protein is not alone when performing its biological task. As it will have to operate using the same cellular machinery for modification and sorting as all the other proteins do, on can expect some conservation of essential types of post-translational modifications (PTMs). Because reasonably precise methods for prediction of PTMs from sequence exist today, our prediction method which integrates such relevant features to assign orphan protein to functional class, can be applied to all proteins where the sequence in known.

For any function prediction method, the ability to correctly assign the relationship depends strongly on the function classification scheme used. We predict a scheme of 12 cellular functions which is closely related to the 14 class classification originally proposed by Riley for the E.coli genome. All human sequences in SWISS-PROT were automatically assigned to classes, by a system based on an additive scoring scheme of the SWISS-PROT keywords. These scores were then compared to two thresholds to obtain a positive and a negative set where the most uncertain functional annotations are excluded. To minimize the problem of having similar sequences for training and testing, we used a heuristic algorithm to split our data set into two sets so that the similarity between the two sets was minimal.

To find the optimal combination of parameters for each of the different categories we used a boot-strap strategy. First, for every category a simple feedforward neural network with one hidden layer was trained on every separate feature to judge which features were potentially useful for prediction of at least one category. On each category a network was then trained for every pair these features and subsequently on combinations of more features, to find the best feature combinations. An ensemble for each functional class was made from the best five best networks for that class.

The output of these networks were subsequently transformed in to probabilistic scores based on Gaussian kernel density estimates of the score distributions of positive and negative examples respectively. To calculate the combined prediction of an ensemble of networks we simply take the average of their probabilistic predictions.

Interestingly, the combinations of attributes selected for a given category (among the 20+ initially considered) also implicitly characterize a particular functional class in an entirely new way. It appears that the use of posttranslational modifications is essential for the prediction of several functional classes. In addition to attributes related to subcellular location the most important features for predicting if a protein is say, regulatory or not, are all PTMs. Similarly PTMs are very important for correct assignment of proteins related to the cell envelope, replication and transcription.

The fact that (predicted) PTMs correlate strongly with the functional categories fits well with biological knowledge. For proteins with "regulatory function" two of the most important features were S/T phosphorylation and Y-phosphorylation, respectively. It is very satisfying that the neural networks found this correlation considering that reversible phosphorylation is a well known and widely used regulatory mechanism. Also the choice of encoding makes biological sense as serine and threonines are known to be phosphorylated by the same kinases, while tyrosines are phosphorylated by different kinases.

The selection of category-relevant attributes is based on quantitative assessment of the ability to predict (assign) categories for orphan sequences non-similar to the sequences used to train the method. When the sensitivity is below 40\% the level of false positive predictions is very low. The confidence in the predictions can be used directly to separate assignments with low probability of being wrong from those where the probability is higher.

We have used our method to estimate the breakdown on functional categories of the human genome. Using for every protein the predicted probability of each category, the number of proteins in each category was subsequently estimated by summing over the probability of the category in question for every protein.


211. A consensus based approach to genome data mining of beta barrel outer membrane proteins. (up)
Rajiv V. Basaiawmoit, Manjunath K.R., Krishnaswamy S., Bioinformatics centre, SBT, M.K.University;
rajivvaid@usa.net
Short Abstract:

We have developed a consensus-based approach using compositional analysis, hydropathy profiles, sequence comparison techniques, secondary structure prediction and structure based sequence profiles to delineate beta-barrel outer membrane proteins from the proteome databases. The results of the analysis on the available proteomes will be presented.

One Page Abstract:

Beta-barrel outer membrane proteins are found to be associated with a variety of important functions from passive trimeric porins like OmpF to monomeric active transporters like FhuA. The beta-barrel structures range from 8 stranded to 22 stranded beta-barrels with possible functional differentiation based on barrel sizes. They can also be responsible for the pathogenicity of an organism and defense against attack proteins. Some porins are also involved in apoptosis. Their functional diversity therefore makes them an interesting class of proteins. With the large scale sequencing of genomes that are underway, it therefore would be advantageous to identify porins across proteomes. The successful identification of the putative function of a protein often depends on the first steps of a search (BLAST or FastA) against a database. Using single methods to do analysis may lead to misleading results, and as such this calls for a consensus-based approach for the identification of beta-barrel outer membrane proteins. We have developed a consensus-based approach using compositional analysis, hydropathy profiles, sequence comparison techniques (BLAST and FASTA), secondary structure prediction and structure based sequence profiles to delineate beta-barrel outer membrane proteins from the proteome databases. We performed this over fully sequenced genomes, and a more detailed analysis using the spirochete genome sequences Borrelia burgdorferi and Treponema palladium. Large discrepancies were found in the annotations, for e.g. in one case the annotation terms it a porin, whereas our analysis shows that it is less likely a beta-barrel protein. The work can be extended to the creation of a non-redundant, annotated, database of porins and also aid in structural genomic initiatives for porins with scripts for automation of the process.


212. Predicting tyrosine sulfation sites in protein sequences (up)
Flavio Monigatti, Eva Jung, Amos Bairoch, Swiss Institute of Bioinformatics;
Eva.Jung@isb-sib.ch
Short Abstract:

Tyrosine sulfation is an important post-translational modification of secreted and / or membrane bound proteins. To predict tyrosine sulfation sites in protein sequences, a novel program (the Sulfinator) will be presented that combines two different, serially switched Hidden Markov Models (HMM). The Sulfinator shall be made available at http://www.expasy.org/tools .

One Page Abstract:

Tyrosine sulfation is an ubiquitous post-translational modification of proteins that go through the secretory pathway within living cells. The biological role of sulfation is largely unknown, however, there is strong evidence that protein sulfation is required for optimal biological activity of proteins. No clear cut acceptor motif, utilized by protein tyrosyltransferases, has been described that could be used for prediction of tyrosine sulfation sites in proteins. Here we present a novel method to predict tyrosine sulfation sites in protein sequences using two different, serially switched Hidden Markov Models (HMM). The first HMM consists of a fifteen amino acids long linear chain with the target tyrosine at position six. While the first HMM is responsible for the recognition of possible sulfation sites, the second HMM, an eleven amino acids long linear chain, assures the correct alignment of the previously matched sequences. In a test set of validated non-sulfated and sulfated sequences extracted from the SWISS-PROT database, the Sulfinator correctly predicts ~99 % of the tyrosine sulfation sites (true positives) and 98% of the non-sulfated tyrosines (true negatives). The results from scanning available proteomes in the SWISS-PROT database suggest that tyrosine sulfation could be more abundant than previously anticipated. The Sulfinator shall be made available at http://www.expasy.org/tools .


213. Characterization of aspartylglucosaminuria mutations (up)
Jani Saarela, Minna Laine, National Public Health Institute, Finland;
Carita Oinonen, University of Joensuu, Finland;
Carina von Schantz, Anu Jalanko, National Public Health Institute, Finland;
Juha Rouvinen, University of Joensuu, Finland;
Leena Peltonen, University of California Los Angeles;
Jani.Saarela@ktl.fi
Short Abstract:

Aspartylglucosaminuria is a recessively inherited human disease. Altogether 26 different aspartylglucosaminuria mutations have been identified. Many of these interfere with the complex intracellular maturation and processing of the aspartylglucosaminidase polypeptide. We used the three-dimensional structure of functional aspartylglucosaminidase enzyme to predict structural consequences of aspartylglucosaminuria mutations.

One Page Abstract:

A deficiency of functional aspartylglucosaminidase (AGA) causes a lysosomal storage disease, aspartylglucosaminuria (AGU). The recessively inherited disease is enriched in the Finnish population, where 98% of AGU alleles contain one founder mutation, AGUFin. Elsewhere in the world, we and others have described 18 different sporadic mutations in AGU patients. Many of these mutations are predicted to interfere with the complex intracellular maturation and processing of the AGA polypeptide. Proper initial folding of AGA in the endoplasmic reticulum is dependent on intramolecular disulfide bridge formation and dimerization of two precursor polypeptides. The subsequent activation of AGA occurs autocatalytically in the endoplasmic reticulum and the protein is transported via the Golgi to the lysosomal compartment using the mannose 6-phosphate receptor pathway. We used the three-dimensional structure of AGA to predict structural consequences of AGU mutations, including six novel mutations, and make an effort to characterize every known disease mutation by dissecting the effect of mutations on intracellular stability, maturation, transport, and the activity of AGA. Most mutations are substitutions replacing the original amino acid with a bulkier residue. Mutations of the dimer interface prevent dimerization in the ER, while active site mutations not only destroy the activity but also affect maturation of the precursor. Depending on their effects on the AGA polypeptide the mutations can be categorized as mild, average, or severe. These data contribute to the expanding body of knowledge pertaining to molecular pathogenesis of AGU


214. Elucidating a "theoretical" proteome of the Arabidopsis thaliana thylakoid (up)
Olof Emanuelsson, Stockholm Bioinformatics Center, Stockholm University, Sweden;
Jean-Benoît Peltier, Dept. of Plant Biology, Cornell University, USA;
Gunnar von Heijne, Stockholm Bioinformatics Center, Stockholm University, Sweden;
Klaas J. van Wijk, Dept. of Plant Biology, Cornell University, USA;
olof@sbc.su.se
Short Abstract:

Scanning the entire Arabidopsis genome using subcellular localization predictors (TargetP, SignalP-HMM) and a transmembrane predictor (TMHMM), we predict the total proteome size of the lumen of the chloroplast sub-compartment thylakoid to be somewhere between 200 and 400 different proteins, whereof a substantial part lack any functional annotation.

One Page Abstract:

The Arabidopsis thaliana genome offers nice opportunities to develop and test whole-genome based approaches to theoretical proteomics. Using subcellular localization predictions (TargetP followed by SignalP-HMM) and subsequent transmembrane predictions (TMHMM 2.0), we have predicted the total proteome size of the lumen of the chloroplast sub-compartment thylakoid to be somewhere between 200 and 400 different proteins, whereof a substantial part lacks any functional annotation and approximately 50% contain a TAT-pathway signal. We have also evaluated the combined predictor approach in several ways, specifically addressing the SignalP performance on the signal peptide-like thylakoidal transfering domain adjacent to the chloroplast transit peptide, and it has been clear that a thylakoid-dedicated signal peptide predictor potentially could be useful. The outcome of the predictions will be used by biologists for guidance of experimental verification of thylakoidal localization of newly discovered Arabidopsis proteins.


215. Machine Learning Algorithms in the Detection of Functional Relatedness of Proteins (up)
Mahesan Niranjan, Renata Camargo, The University of Sheffield;
r.camargo@dcs.shef.ac.uk
Short Abstract:

This poster reports on the application of machine learning algorithms to detect functional similarity between pairs of proteins. Decision making is based upon a set of features based on structural, sequence and keword descriptions in literature. The work is based on protein pairs labelled by Holm and Sander (ISMB97).

One Page Abstract:

Detecting functional similarity between pairs of proteins may be cast as a machine learning problem, in cases where many diverse pieces of information is available of proteins. Such a problem was formulated by Holm and Sander (1997). They make available a dataset of 940 protein pairs, hand labelled as evolutionarily related or not. Features describing each pair are: (a) structural similarity as measured by the z-score of alignment, (b) sequence overlap being above or below a threshold, (c) Enzyme Class number characterising biochemical reactions, (d) experimental information of functional sites, (e) predicted functional sites and (f) overlap in keywords of literature describing the pair. Holm et al report on coverage/selectivity for each of the features of the database and suggest more powerful machine learning algorithms may be applied to this problem.

In this poster we report on work that pursues this line by reconstructing the dataset of the same protein pairs and applying a range of machine learning algorithms including logistic regression, Fisher Linear Discriminant Analysis and Support Vector Machines to classify functionally related pairs from those that aren't related. A particular result is in the selection of features appropriate for the classification.

Reference:

Liisa Holm and Chris Sander (1997): Decision Support System for the Evolutionary Classification of Protein Structures, Proc. ISMB, 1997.


216. Predicting protein functions based on InterPro and GO (up)
Wolfgang Fleischmann, Nicola Mulder, Alexander Kanapin, Evgueni Zdobnov, Rolf Apweiler, European Bioinformatics Institute;
fleischmann@ebi.ac.uk
Short Abstract:

We present a mapping between InterPro (a database of protein families, domains and functional sites) and GO (a controlled vocabulary of gene product functions, processes and components).

This data allows to classify protein sequences with a high coverage (47%) in a reliable and robust way.

One Page Abstract:

We observed that many domains and families in InterPro (www.ebi.ac.uk/interpro) are conserved enough to infer the function of proteins matching these InterPro signatures.

As controlled vocabulary for the protein functions, we used the GO terms provided by Gene Ontology Consortium (www.geneontology.org), as they are gaining support in the user community.

To gain reliable results, we inspected manually all SWISS-PROT proteins known to match a given InterPro entry and assigned the relevant GO terms for the function, process and component. We avoided domain specific terms and used only those terms that apply to the whole protein.

We assigned GO terms to 2567 of the 3914 InterPro entries. Broken down by GO ontologies, 2308 InterPro entries can predict the molecular function, 1943 entries the biological process, and 1090 entries the cellular component.

Using the InterPro2GO mapping, it is possible to infer GO terms for 47.0% of all SWISS-PROT + TrEMBL proteins. This coverage is expected to raise in future, as we add new signatures to InterPro and continue to assign GO terms to the remaining InterPro entries.

The mapping is incorporated into the InterPro database and is accessible at www.ebi.ac.uk/interpro.

Furthermore, we regularily construct overview reports of the molecular function of the completely sequenced genomes, available at www.ebi.ac.uk/proteome.

To enable users to download the assignments and predict the function of proprietary protein sequences, we added a module to the InterProScan software.


217. iPSORT: Simple rules for predicting N-terminal protein sorting signals. (up)
Hideo Bannai, Human Genome Center, Institute of Medical Science, University of Tokyo;
Yoshinori Tamada, Department of Mathematical Sciences;
Osamu Maruyama, Faculty of Mathematics, Kyushu University;
Kenta Nakai, Satoru Miyano, Human Genome Center, Institute of Medical Science, University of Tokyo;
bannai@ims.u-tokyo.ac.jp
Short Abstract:

Using a discovery oriented approach of hypothesis generation, we search for simple, understandable rules with high accuracy for predicting N-terminal sorting signals of proteins. The prediction accuracy comes close to the state-of-the-art neural network based predictor, TargetP. A experimental web service is provided at http://www.hypothesiscreator.net/iPSORT/.

One Page Abstract:

The prediction of localization sites of various proteins is an important and challenging problem in the field of molecular biology. Currently, a neural network based system, TargetP, is the best predictor of N-terminal sorting signals in the literature. One drawback of neural networks, however, is that it is generally difficult to understand and interpret how and why they make such predictions. In this work, we aim to generate simple and interpretable rules as predictors, and still achieve a practical prediction accuracy. We adopted a discovery oriented approach which consists of an extensive search for simple rules and various attributes. The simple rules we search for include a pattern matching over a alphabet of protein classification, and also rules consisting of amino acid attributes contained in the AAindex database. A rule is created for each signal type, and is combined into a decision list for the final predictor. We have succeeded in finding rules for plant proteins which are almost as good as TargetP in terms of prediction accuracy, while still retaining a very simple and interpretable form. We further apply the acquired knowledge to non-plant proteins.

The rules we obtained are consistent with widely believed characteristics of the N-terminal sorting signals, and it is somewhat surprising that such accuracy could be obtained with such attributes.