Gene Finding

301.Searching for RNA Genes Using Base-Composition Statistics
302.dna2hmm - a homology based genefinder
303.DIGIT: a novel gene finding program by combining genefinders
304.GAZE: A generic, flexible tool for gene-prediction
305.An estimate of the total number of genes in microbial genomes based on length distributions
306.Potential binding sites for PPARg in promoters and upstream sequences
307.On the Species of Origin: Diagnosing the Source of Symbiotic Transcripts
308.PAGAN : Predict and Annotate Genes in genomic sequence based on ANalysis of EST Clusters
309.Nested Genes in the Human Genome
310.Integrating Protein Homology into the Twinscan System for Gene Structure Prediction
311.Improving exon detection using human-rodent genomic sequence comparison
312.In-silico to in-vivo analysis of whole proteome
313.Incorporating Additional Information to Hidden-Markov Models for Gene Prediction
314.Gene prediction in the post-genomic era
315.Improved Splice Site Prediction by Considering Local GC Content
316.Relaxed profile matching as a method for identifying putative novel proteins from the genome
317.Analyzing Alternatively Spliced Transcripts
318.EST curation with improved gene feature models
319.Annotation of Human Genomic Regions to Identify Candidate Disease Genes
320.Annotation of the E. coli genome revisited

301. Searching for RNA Genes Using Base-Composition Statistics (up)
Peter Schattner, GencodeDecode Research;
Short Abstract:

The feasibility of using local single-base (GC%, G-C, A-T) and dinucleotide frequency variations for non-protein-coding rna (ncrna) genefinding was investigated. Significant frequency differences were found between ncrnas and genomes and among ncrnas. A search program based on GC% was developed and tested on M. jannasciii and C. elegans sequences

One Page Abstract:

BACKGROUND: RNA-gene-finding programs for non-protein-coding rnas (ncrnas) with well characterized sequences have achieved considerable success. However, identifying less-well-characterized ncrnas has been significantly less successful. Indeed a recent paper (Bioinformatics 16:573-585, 2000) suggests that it may be impossible to use sequence and secondary structure alone to detect ncrnas within newly sequenced genomes. In the same paper, it is noted that, for some genomes, it may be possible to use the local percentage of GC bases (GC%) as a filter to screen for ncrna-rich regions. More generally one might look for multiple variations of single and/or dinucleotide base statistics as a signature of ncrna-rich regions. The present work investigates the feasibility of this approach. Single-base and dinucleotide statistics for a variety of ncrnas are compiled, compared to the genomic background and applied to the task of screening for ncrna-rich regions. METHODS: Ncrna sequence data for tRNAs, rRNAs, snRNAs, srpRNAs, rna pseudoknots and other small rnas were obtained from public databases. For each ncrna type, several base-composition statistics were computed: GC%, per base G-C and A-T differences -– e.g. (n(G) - n(C))/ (n(G) + n(C)), and normalized dinucleotide frequencies (f(AB)/f(A)*f(B)). Statistics were computed for complete chromosomes and 100 kb "isochores" taken from the M. jannasciii and C. elegans genomes. A simple ncrna screening program based solely on local GC% values was also implemented. The program partitions a genomic region into two components - one ( hopefully small) component with high probability of containing ncrnas and the other component with low probability of containing ncrnas. The program was run against sequences of the M. jannasciii and C. elegans genomes with results compared to rna annotations in the genbank ".gbk" and wormbase ".gff" database files. RESULTS: For M. jannasciii mean rna GC% is 65.9%(+/-3.2) while overall genomic GC% is 31.4%. For C. elegans, mean rna GC% is 52.5%(+/-9.5) while chromosomal GC% ranges from 34.7% (chromosome IV) to 36.3% (chromosome II). Significant GC% variations are observed both among ncrna classes and between differing isochores on an individual chromosome. For example in C. elegans, trna GC% is 58.8%(+/-3.9) while snrna GC% is 40.2%(+/-4.8). On chromosome I, mean GC% of 100 kb isochores is 36.0% with a range from 33.3% to 40.3%. Average per-base G-C differences for M. jannasciii and C. elegans rnas are 0.046 (+/-0.058) and 0.038 (+/-0.058) respectively as compared to the genomic values of 0.061 (+/-0.360) and 0.021 (+/-0.242). Rna per-base A-T differences are -0.054 (+/-0.12) and -0.135 (+/-0.12) while the genomic values are 0.023 (+/-0.186) and 0.014 (+/-0.195) for M. jannasciii and C. elegans chromosme I, respectively. Again, variations exist among rnas. For example for four known C. elegans snrprnas, per-base G-C differences are 0.18 (+/-0.02). The most significant dinucleotide frequency variation was observed in M. jannasciii for CG base pairs where the normalized rna CG frequency is 0.75 (+/-0.25) while the genomic value is 0.34 (+/-0.46). The GC%-based filtering program identified a component with less than 1% of the M. jannasciii genome that contained all 43 ncrnas annotated in the genbank genomic .gbk file. In experiments over 1.5 MB of C. elegans chromosome X, a component containing 5.3% of the sequence with 44 of 51 annotated ncrnas was identified. DISCUSSION / CONCLUSION: The present work suggests that, at least for very high AT concentration genomes such as M. jannasciii, one can identify small genomic regions rich in ncrnas using GC% filtering alone. In most cases, however, GC% screening will probably need to be supplemented by testing for additional statistical signatures and/or conventional primary and secondary sequence structure motifs in order to successfully isolate potential ncrnas. The feasibility of such multi-stage ncrna detectors is currently being investigated.

302. dna2hmm - a homology based genefinder (up)
Betty Lazareva, Paul Thomas, Celera Genomics;
Short Abstract:

Dna2hmm is an implementation of an algorithm that aligns genomic DNA to a protein HMM with sequencing error correction. The algorithm is designed to work with SAM HMMs in contrast to GeneWise that utilizes the HMMER framework. Tests show dna2hmm to be more accurate than GeneWise, though slightly slower.

One Page Abstract:

Dna2hmm is a software implementation of an algorithm that aligns genomic DNA to a protein HMM with sequencing error correction. The algorithm is specifically designed to work with SAM HMMs, in contrast to GeneWise that works in HMMER framework. Our approach involves a full dynamic programming with an optimization function that is a sum of three components: the Viterbi alignment score of predicted translation (which we refer to as "SAM-like score"), a splicing score and, finally, a penalty for correction of errors in the DNA sequence.

We tested our algorithm by scoring a UCSC test set of well-annotated genes against a set of SAM HMMs we generated at varying levels of sequence similarity to the genes in the test set. In our comparison with GeneWise, we converted SAM HMMs to HMMER format preserving the SAM NULL model, which improved GeneWise performance. Our experiments show high correlation between Viterbi protein scores and SAM-like scores of the predicted translation, which makes for straightforward statistical interpretation of the alignment score. For GeneWise, we found the agreement between protein scores and predicted alignment scores to be much weaker. Also the accuracy of prediction on the base-pair level as well as on the exon level is significantly higher for dna2hmm algorithm compared to GeneWise. However, in the current implementation, dna2hmm is about 3 times slower than the latest version of GeneWise. Detailed comparison and testing results are discussed in our presentation.

Because the SAM-like dna2hmm score correlates well with the score of a protein translation, dna2hmm score results can be directly compared to protein sequence comparison scores and derived distributions for estimating statistical significance.

303. DIGIT: a novel gene finding program by combining genefinders (up)
Tetsushi Yada, Human Genome Center, Institute of Medical Science, University of Tokyo;
Yasushi Totoki, Genomic Sciences Center, RIKEN;
Yoshio Takaeda, Mitsubishi Research Institute, Inc.;
Yoshiyuki Sakaki, Toshihisa Takagi, Human Genome Center, Institute of Medical Science, University of Tokyo;
Short Abstract:

We present here a general scheme to combine plural genefinders. By using our scheme, we have developed a novel gene finding program named DIGIT which combines FGENESH, GENSCAN and HMMgene. We show you the results of the benchmark tests and the analysis of 2.7 billion bases of human genome sequence.

One Page Abstract:

We have developed a novel gene finding program named DIGIT which finds genes by combining existing genefinders. It has been well known that the reliability of gene annotation is increased by combining plural genefinders. However, the following two problems arise in the case of combining genefinders: (1) how to ensure the frame consistency between exons within a gene and (2) how to take into account the exon scores given by genefinders. We have addressed these problems by applying hidden Markov model and Bayesian procedure, and have implemented the scheme into DIGIT. Since our scheme provides a general framework of combining genefinders, DIGIT has the ability to combine most of genefinders by systematic manner. As well as presenting the detailed algorithm of DIGIT, we report here its prediction accuracy. DIGIT has been designed so as to combine FGENESH, GENSCAN and HMMgene, and the prediction accuracy has been assessed by using three different data sets. For all data sets, DIGIT successfully discarded many false positive exons predicted by genefinders and yielded remarkable improvements in sensitivity and specificity at the gene level compared with the best gene level accuracies achieved by any single genefinder.

304. GAZE: A generic, flexible tool for gene-prediction (up)
Kevin Howe, Richard Durbin, The Sanger Centre;
Short Abstract:

We have developed a gene-finding system called GAZE which allows gene-prediction data from multiple arbitrary sources to be integrated into complete gene-structures in a flexible and user-configurable manner. The system gains further flexibility by performing the integration over an XML model of gene-structure also supplied by the user.

One Page Abstract:

The ACEDB ( package contains as a sub-component, a gene-prediction tool which, like several other gene-prediction programs available, has as its final stage something akin to "exon assembly". During this phase, signal and content sensors are combined using dynamic programming to produce a `best guess' prediction of genes within the region being considered. The model of a gene in terms of its components and how they relate to each other, is hard-coded into the system. It was also originally the case that parameters for the use of signal and content measures were hard-coded, but gradually this functionality has been factored out as user-configurable. It remains the case though that there is no way to refine the model of gene-structure over which assembly takes place without changing the code. Other gene-prediction programs that we know of suffer from the same sort of rigidities.

We have developed a gene-prediction tool, GAZE, with the aim of addressing these inflexibilities. GAZE is a stand-alone gene-finding software system based upon an abstraction of the dynamic programming engine underlying the ACEDB gene-finding tool. It produces predictions of gene-structure by integrating arbitrary prediction information from multiple sources supplied by the user. This integration takes place over a model of gene-structure that is also completely defined by the user.

GAZE is controlled by a user-specified "structure file" (in XML format), which contains both a state-based model of gene-structure, and information for controlling precisely how the program should integrate the given gene-prediction data over the given model. Both the gene-prediction data supplied to GAZE, and its prediction of gene structure, are in GFF (

305. An estimate of the total number of genes in microbial genomes based on length distributions (up)
Marie Skovgaard, L. J. Jensen, S. Brunak, D. Ussery, A. Krogh, Center for Biological Sequence Analysis;
Short Abstract:

In sequenced microbial genomes some of the annotated genes, are actually not protein coding genes, but rather ORFs that occur by chance. Based on comparison of the length distributions of annotated and known proteins we estimate the number of true protein coding genes.

One Page Abstract:

In sequenced microbial genomes some of the annotated genes, usually marked in the public databases as hypothetical, are actually not protein coding genes, but rather open reading frames (ORFs) that occur by chance. Therefore, the number of annotated genes is higher than the actual number of genes for most of these microbes.

we have plotted the length distribution of the annotated genes and compared it to the length distributions of those matching known proteins and those with no match to known proteins. From these plots it can be seen that the length distribution of proteins with no known matches differ from the distribution of proteins with matches to known proteins. Since the majority of the proteins with no known matches are short, this lead to the conclusion that too many short genes are annotated in many genomes.

Therefore we estimate the true number of protein coding genes for sequenced genomes by two different methods.

The first estimate is based on the assumption that the fraction of long genes matching the SWISS-PROT database equals the fraction of all genes that matches SWISS-PROT. To obtain an estimate of the true number of proteins in each organism we have used the proteins in the SWISS-PROT database, as a reference. Since ORFs longer than 200 amino acids are unlikely to occur by chance in most organisms, the fraction of those matching SWISS-PROT was used as an estimate of the fraction of the total number of true proteins that match SWISS-PROT. Then the estimated number of genes is easily obtained by dividing the total number of matching proteins with this fraction.

The second estimate is completely independent of database matches. The maximal number of non-overlapping open reading frames longer than 100 triplets was found and the estimate of the true number of genes was obtained by reducing this number by those expected to occur at random. ORFs shorter than 100 triplets were excluded since relatively few genes are expected, and the estimate becomes ill-behaved because of the huge number of short ORFs. The approximation of the corrections is quite crude, but serves as a control for the estimate based on alignment and SWISS-PROT.

Our estimates of the number of real protein coding genes reduce the number of true proteins by 10-30% for the majority of microbial organisms. The two extremes are represented by M. genetalium where the estimates are 1-5% lower and A. pernix where they are close to 50% lower.

306. Potential binding sites for PPARg in promoters and upstream sequences (up)
Hubert Hackl, 1 The Institute for Genomic Research, Rockville, MD 2 Institute for Biomedical Engineering, Graz University of Technology, Graz, Austria;
Alexander Sturn, Institute for Biomedical Engineering, Graz University of Technology, Graz, Austria;
Vasiliki Michopoulos, National Institutes of Health, Bethesda, MD;
John Quackenbush, The Institute for Genomic Research, Rockville, MD;
Zlatko Trajanoski, Institute for Biomedical Engineering, Graz University of Technology, Graz, Austria;
Short Abstract:

We present the strategy and results of the search for binding sites in promoters to identify target genes for PPARg - a key player in adipogenesis. From experimentally verified binding sites for PPAR a novel position weight matrix was derived and an algorithm based on the information content was applied.

One Page Abstract:

Peroxisome proliferator-activated receptor gamma (PPARg) plays a key role in the differentiation of adipose tissues and is important in the adipose specific expression of a number of genes. Given the centrality of this nuclear receptor in adipocyte differentiation and development of obesity the identification of potential target genes could improve the rational design of new classes of drugs to control obesity. PPARs bind neither as homodimer nor as monomer but strictly depend on the retinoid X receptor (RXR) as DNA- binding protein. The consensus sequence for the binding of PPAR:RXR is given by a 5’ flanking region and two half sites with an adenine (A) in between. (5’-AWCT AGGNCA A AGGTCA-3’) [1].

In order to identify potential binding sites for PPARg searches in promoters and upstream sequences from human, mouse and rat genes as well of Genbank entries from other vertebrates genes were performed with 1) the consensus sequence according the IUPAC string convention 2) the position weight matrix from the TRANSFAC database [3] and 3) a novel, from experimentally verified binding sites for PPARg derived position weight matrix(PWM). For the search with the consensus sequence and the TRANSFAC PWM the programs FindPatterns of the Wisconsin package [4] and the online service of the MatInspector [6] were used, respectively. To search with and to evaluate the new context-specific PWM, as well as for the determination of the optimal threshold level an algorithm based on the information content similar to the MatInspector program [5] was implemented.

Searching with the IUPAC method yielded in a noticeable number of matches, which could be refined by using the TRANSFAC PWM for PPARg, since the PWM captures obviously more information than the consensus sequence. The search with the newly constructed PPAR matrix resulted in a reasonable number of potential sites and so far unidentified putative target genes(2% of the studied promoter sequences at a threshold level of 0.85).

[1] Desvergne B, Wahli W: Peroxisome Proliferator-Activated Receptors: Nuclear Control of Metabolism. Endocrine Reviews 20:649-688, 1999

[2] Fickett JW: Quantitative Discrimination of MEF2 Sites. Mol Cell Biol 16:437-441, 1996

[3] Wingender, E., Chen, X., Hehl, R., Karas, H., Liebich, I., Matys, V., Meinhardt, T., Prüß, M., Reuter, I. and Schacherer, F.: TRANSFAC: an integrated system for gene expression regulation. NAR 28: 316-319, 2000 (

[4] Wisconsin Package, Genetics Computer Group, WI

[5] Quandt K, Frech K, Karas H, Wingender E, Werner T: MatInd and MatInspector: new fast versatile tools for detection of consensus matches in nucleotide sequence data. NAR 23:4878-4884, 1995


307. On the Species of Origin: Diagnosing the Source of Symbiotic Transcripts (up)
Peter T. Hraber, Santa Fe Institute;
Jennifer W. Weller, Virginia Technical University;
Short Abstract:

Sequencing expressed tags from mixed cultures of interacting symbionts (pathogenic or mutualistic) can help identify genes that regulate symbiosis, but presents an analytic challenge: to determine from which organism a transcript originated. Previous solutions used nucleotide composition or similarity searches, but a comparative analysis of hexamer counts is more powerful.

One Page Abstract:

Most organisms have developed ways to recognize and interact with other species. Symbiotic interactions range from pathogenic to mutualistic. Some molecular mechanisms of interspecific interaction are well understood, but many remain to be discovered. Expressed sequence tags (ESTs) from cultures of interacting symbionts can help identify genes that regulate symbiosis, but present a unique challenge for functional analysis. Given a sequence expressed in an interaction between two symbionts, the challenge is to determine from which organism the transcript originated. For high-throughput sequencing from interaction cultures, a reliable computational approach is needed. Previous investigations into GC nucleotide composition and comparative similarity searching provide provisional solutions, but a comparative lexical analysis, which uses a likelihood-ratio test of hexamer counts, is more powerful. Tests against genes whose origin and function are known yielded 94% accuracy. Microbial transcripts comprised about 75% of a Phytophthora sojae-infected soybean (Glycine max cv Harasoy) library, contrasted with 15% or less in root tissue libraries of Medicago truncatula from axenic, Phytophthora medicaginis-infected, mycorrhizal, and rhizobacterial treatments. Many of the symbiotic transcripts were of unknown function, suggesting candidates for further functional investigation.

308. PAGAN : Predict and Annotate Genes in genomic sequence based on ANalysis of EST Clusters (up)
Sasivimol Kittivoravitkul, Marek Sergot, Department of Computing, Imperial College of Science, Technology and Medicine;
Short Abstract:

Unlike other gene index projects which cluster whole ESTs, PAGAN clusters alignments of ESTs from similarity search results. This eliminates undesirable features of ESTs enabling more precise prediction of exons. Gene structures are revealed by further assembly and refinement process. The results are shown graphically at annotation and nucleotide level.

One Page Abstract:

Expressed Sequence Tags (ESTs) have been an essential source in gene discovery. A similarity search against the EST database, dbEST, could reveal homologous genes in a genomic sequence and provide additional information regarding those genes. Due to the high redundancy and low quality of ESTs, analyzing the similarity search results in order to find genes and their structures is not a trivial task.

We have developed a tool, called PAGAN, which annotates genes in genomic sequences by extracting gene information from the results of a similarity search of dbEST. PAGAN filters the search results using the degree of identity and the length of the homologous EST as criteria as in Bailey et al. and then clusters them into groups, which are likely to represent the same exons, using d2_cluster (Burke et al., 1999). Unlike other gene index projects which cluster whole ESTs, PAGAN clusters only the parts of the EST that align with the genomic sequence. In clustering whole ESTs, it is possible that ESTs are put together in the same cluster because of chimeric clones, contamination and other artifacts whereas clustering the alignment parts of EST reduces these false joins and discards irrelevant information in EST. To obtain the actual underlying exon of each cluster, a consensus sequence for each cluster is derived using PHRAP and CAP3.

We have further refined the results by using source information of EST such as cloneID and polarity. The results of masking out repeat and low complexity DNA sequences before performing the similarity search, which causes gaps in the search results, are also taken into account. The results of PAGAN are displayed graphically at the level of annotation, which can compare the results to different kinds of analysis programs, and at the level of the nucleotide, which allows the user to inspect the variation among EST alignments in the cluster.

Preliminary experiments with benchmark data have shown that PAGAN can detect about 12% more exons than the similarity search of STACK (Christoffels et al., 2001).

Reference :

Bailey, L.C., Searls, D., Overton, G.C. (1998) Analysis of EST-Driven Gene Annotation in Human Genomic Sequence. Genome Research, 8, 362-376

Burke,J., Davison,D., Hide,W., (1999) d2_cluster : A Validated Method for Clustering EST and Full-length cDNA Sequence. Genome Research, 9, 1135-1142

Christoffels,A., Gelder,A.V., Greyling,G., Miller,R., Hide,T., and Hide,W. (2001) STACK: Sequence Tag Alignment and Consensus Knowledgebase. Nucleic Acids Res., 29, 234-238

309. Nested Genes in the Human Genome (up)
Zipora Y. Fligelman, Einat Hazkani-Covo, Sarah Pollock, Hanne Volpin, Nili Guttmann-Beck, Compugen Ltd.;
Short Abstract:

Using the LEADS platform, novel examples of the rare phenomenon of nested genes were found in the human genome. Known examples from the literature were also detected. Most gene-finding algorithms model one gene per locus, and probably miss this phenomenon.

One Page Abstract:

Nested Genes in the Human Genome

Zipora Y. Fligelman*, Einat Hazkani-Covo*, Sarah Pollock, Hanne Volpin, and Nili Guttmann-Beck.

Compugen Ltd, Pinchas Rosen 72 Tel-Aviv 69512 ISRAEL


The phenomenon of nested genes or interleaving genes, that are located within introns of another gene, is believed to be rare in the human genome since only a few examples are known. Most gene-finding programs predict one gene per locus, thus even with knowledge of most of the genome sequence it is difficult to estimate the frequency of this biological phenomenon. Hence, nested genes may be more abundant in the human genome than previously thought.

The LEADS platform (Shoshan et al. 2001), which clusters and assembles ESTs and mRNA sequences to the genome, determines the exon / intron structure of the resulting genes, including alternative splicing. We used this platform to identify nested genes. We concentrated on examples where both genes were supported by known RNA sequences.

We verified the existence of five examples of nested genes known from the literature, in three of them the nested genes are encoded by opposite strands (Poher et al. 1999; Dunham et al. 1999) while two show nested genes encoded by the same strand (Tycowski et al. 1993; Cervini et al. 1995). We have found at least 12 new examples of nested genes encoded both by opposite strands and by the same strand. Identifying nested genes on the same strand is more complex, as it is difficult to distinguish the phenomenon from alternative splicing when insufficient EST samples are obtained. The results of our study have more than doubled the number of known nested genes.


· Cervini, R., Houhou, L., Pradat, P. F., Bejanin, S., Mallet, J., and Berrard, S. 1995. Specific vesicular acetylcholine transporter promoters lie within the first intron of the rat choline acetyltransferase gene. J Biol Chem. 270(42):24654-24657.

· Dunham, I., Shimizu, N., Roe, B. A., Chissoe, S., Hunt, A. R., Collins, J. E., Bruskiewich, R., Beare, D. M., Clamp, M., Smink, L. J., Ainscough, R., Almeida J. P.,Babbage A, Bagguley C, Bailey J, Barlow K, Bates KN, Beasley O, Bird. C. P., Blakey, S., Bridgeman, A. M., Buck, D., Burgess, J., Burrill, W. D., and O'Brien, K.P., et al. 1999. The DNA sequence of human chromosome 22. Nature 402(6761):489-495.

· Pohar, N., Godenschwege, T. A. and Buchner, E. 1999. Invertebrate tissue inhibitor of metalloproteinase: structure and nested gene organization within the synapsin locus is conserved from Drosophila to human. Genomics. 57(2):293-296.

· Shoshan, A., Grebinskiy V., Magen A., Scolnicov, A., Fink, E., Lehavi D., and Wasserman, A. Designing oligo librarirs taking alternative splicing into account. In M. L. Bittner, et. al., editors, Microarrays: Optical Technologies and Informatics, Proc. SPIE 4266, May 2001.

· Tycowski, K. T., Shu, M. D. and Steitz, J. A.1993. A small nucleolar RNA is processed from an intron of the human gene encoding ribosomal protein S3. Genes Dev. 7(7A):1176-1190.

* These authors contributed equally to this work.

310. Integrating Protein Homology into the Twinscan System for Gene Structure Prediction (up)
Paul Flicek, Ian Korf, Michael R. Brent, Washington University;
Short Abstract:

Twinscan is a new gene-structure prediction system that directly extends the probability model of Genscan, allowing it to exploit the patterns of conservation observed in local alignments between a target sequence and its homologs. We present an addition to the Twinscan system that incorporates protein homology into the probability model.

One Page Abstract:

Twinscan [1] is a new gene-structure prediction system that directly extends the probability model of Genscan, allowing it to exploit the patterns of conservation observed in local alignments between a target sequence and its homologs. Twinscan is specifically designed for the analysis of high-throughput genomic sequences. It can handle multiple, incomplete or no genes on the target sequence and allows for inversions, duplications and changes in intron-exon structure between the target sequence and its homologs.

We present an addition to the Twinscan system that incorporates protein homology into the probability model. This modification addresses one of the current limits of the Twinscan system: the requirement of the availibility of a significant portion of an informant genome at an appropriate evolutionary distance. Our preprocessing step includes a simple BLASTX [2] search to identify proteins that are potentially homologous to the target sequence. Additionally, since we we use the highest scoring matches, we allow the use of proteins that are evolutionarily more or less distant than the ideal informant genome.

By combining this with the current Twinscan system, patterns of conservation at both the protein and nucleotide level can be created, which helps enable the identification conserved non-coding regions. These complementary patters of conservation may be important for future work on the automated prediction of regulatory regions, etc.

Our experiments have shown that this extension of Twinscan, using protein homology information alone to construct the conservation sequence, performs nearly as well as when the top homologs (i.e. one or more sequences from the informant genome that match a given target sequence best) are used. While this present work is limited in its ability to find genes that do not have protein matches in the the databases, we feel that for many organisms it may be an important additional source of information until more appropriate informant genomes have been sequenced. Finally, this represents a first step toward using both genomic and protein homology together in an integrated gene-structure prediction system.

[1] Korf, I., Flicek, P., Duan, D., Brent, M. R. 2001. Integrating genomic homology into gene structure prediction. Bioinformatics (in press)

[2] Gish, W. and States, D.J. 1993. Identification of protein coding regions by database similarity search. Nature Genetics 3(3): 266-72

311. Improving exon detection using human-rodent genomic sequence comparison (up)
Luis Mendoza, Wyeth W. Wasserman, Center for Genomics and Bioinformatics, Karolinska Institute;
Short Abstract:

We describe a method to identify coding exons in ortholgous genomic sequence pairs. Based on a new alignment tool (DPB) and the GenScan algorithm, MaskScan outperforms a variety of methods in our comparative study. Results of a screen of human chromosome 22 and orthologous mouse sequences will be described.

One Page Abstract:

Functional sequences are, in general, preferentially conserved between species over the course of evolution. Therefore, comparative genomic sequence analysis can be used as a tool for the identification of functional elements within long nucleotide sequences.

We developed a method to identify coding exons in human DNA, based on the identification of conserved regions between orthologous human and rodent genomic fragments. The recognition of conserved segments is enabled by a new global alignment program (DPB). Subsequent to alignment, regions of low conservation are masked and the resulting sequence analyzed for the presence of genes or exons. The performance of the approaches were measured in terms of sensitivity, specificity and accuracy at both the exon and nucleotide levels. For comparison, we also measured the performance of common gene finding programs (GeneID and GenScan), a conservation-based gene finding tool (SGP-1), and variations of our algorithm based on different gene identification and exon finding tools. MaskScan delivered the best performance of all the methods on our datasets of orthologous gene sequences.

Results of a screen of human chromosome 22 for potential coding exons will be presented.

We have implemented web servers for both the alignment program in isolation (, and the MaskScan method (

312. In-silico to in-vivo analysis of whole proteome (up)
Rajani Kanth Vangala, Janki Rangatia, Gerhard Behre, Ludwig Maximillain University;
Short Abstract:

A computational method for inferring protein-protein interactions or target genes is proposed here. This method is based on already available biologically proven data and five independent tests with genome sequence are done. AML1/ETO protein found in AML M2 was analysed and in-vivo interactions and target genes could also be shown

One Page Abstract:

A large scale effort to measure, detect and analyse the whole proteome for new protein-protein interactions and target genes using various experimental methods are underway. Eventually all these approaches are labour, time intensive and need to be developed further for accuracy. Here we propose a computational method for inferring protein-protein interactions or target genes for a protein of interest. This Matrix Method (MM) is based on already available biologically proven data and five independent tests with genome sequence for each kind of protein-protein interaction prediction are carried out. For target genes identification of a protein in test, a known DNA sequence to which it is shown to bind is taken into consideration and genome wide analysis is done. As a model we have analysed a fusion gene product AML1/ETO found in many cases of Acute Myeloid Leukemia (AML) subtype M2. The functional role of this fusion protein has not yet being worked in-detail for drug design purposes. The proteins/genes identified using this method were easily shown to interact biologically or differ in expression levels, thus proving that new protein interactions or target genes can be identified by this method. This kind of analysis could give us an understanding of whole proteome and for identification of new drug targets in many diseases.

313. Incorporating Additional Information to Hidden-Markov Models for Gene Prediction (up)
Tomas Vinar, Brona Brejova, University of Waterloo;
Ming Li, University of California Santa Barbara;
Ying Xu, Oak Ridge National Laboratory;
Short Abstract:

HMMs are frequently used in gene finding. Since it is generally hard to expand HMMs to incorporate different sources of information, we use external programs that are combined using machine learning techniques. In this way we achieve high accuracy of HMM-based gene finders and flexibility to incorporate many data sources.

One Page Abstract:

Hidden Markov models (HMMs) are frequently used in gene finding. Some features can be expressed conveniently in terms of HMMs, such as the basic structural constrains (i.e. that coding sequence consists of several exons separated by introns), dicodon bias in exons, and various other signals (such as splice site signals etc). By using generalized forms of HMMs it is also possible to include more information in the model, e.g. distribution of exon lengths. However adding more information greatly increases the number of states of the model, its conceptual complexity, running time and memory requirements. Some types of information (such as homologs in protein databases) are even hard to model using traditional HMMs.

Therefore we propose to build a relatively simple HMM and incorporate additional sources of information in the form of "advisors". Each advisor is a prediction algorithm that takes into account one kind of additional information that can be inferred from the sequence or genomic databases.

Different advisors may produce contradicting predictions. Their results are therefore combined using machine learning approaches. The combination of advisors gives for each position of the sequence a distribution of probability over all possible structural elements that may occupy that position. Probability of an entire gene structure is then a product over all positions of individual structural element probabilities. This probability is then combined with the probability of the same structure given by the HMM.

Currently our advisors include: (a) signal detecting algorithms for finding splice sites, promoters, branch site etc. (b) results of homology searches against EST and protein databases (c) optional suggestions from a human expert, which allows a user to influence the gene finding process.

Main advantage of this approach is its modularity. In order to add new sources of data we just need to add an advisor and automatically adjust weights in the combination phase. Therefore we can expect to achieve combination of high accuracy typical for HMM-based gene finders and high flexibility of supporting many data sources.

Acknowledgements: This research is conducted in collaboration with Bioinformatics Solutions, Inc.

314. Gene prediction in the post-genomic era (up)
Enrique Blanco, Institut Municipal d'Investigacions Mediques (IMIM) / Facultat d'Informatica de Barcelona - (Universitat Politecnica de Catalunya);
Genis Parra, Sergi Castellano, Josep F. Abril, Moises Burset, Institut Municipal d'Investigacions Mediques (IMIM);
Xavier Messeguer, Facultat d'Informatica de Barcelona - (Universitat Politecnica de Catalunya);
Roderic Guigo, Institut Municipal d'Investigacions Mediques (IMIM);
Short Abstract:

geneid is a program to predict genes in anonymous genomic sequences designed with a hierarchical but simple structure. geneid accuracy is comparable to other existing tools being very efficient in terms of time and memory consuming (including a parallel version). Recent applications of geneid to reannotation of eukaryotic genomes will be described.

One Page Abstract:

In Eukaryotes, genes are DNA stretches difficult to accurately detect due to the existence of long intergenic regions and gene fragmentation into exons and introns. Nowadays, a number of genome sequencing projects have been finished or are in the final stages, providing gigabases of raw information which need to be processed to gain biological relevant knowledge. The first step in such process is locating the protein-coding genes. So far, several drafts from these species have been published and therefore, more accurate descriptions are supposed to come in the near future entering therefore in the age of reannotation.

geneid was one of the first programs to predict full exonic structures of vertebrate genes in anonymous DNA sequences (Guigo et al. 1992). It was designed following a simple hierarchical structure (signal to exon to gene) although the scoring scheme to assess the reliability of the predictions was rather heuristic. Here we present the new geneid version, developed during the last two years. This new version mantains the hierarchical structure in the original geneid but simplifying the scoring schema and adding a probabilistic meaning to the scores now computed as log-likelihood ratios of different Markov models. Thus, the optimal gene will be computed as the exon assembling which maximizes the sum of the scores of its assembled exons, using an efficient dynamic programming algorithm to search the space of every potential gene structure.

The accuracy of geneid predictions in human and fly genomes is comparable to the accuracy of the other programs. In contrast, the simple of the new version of geneid results in greater performance in terms of speed and memory usage. Moreover, we have developed a parallel version implemented with Pthreads (POSIX) from the current modular structure that is able to process the same input sequence dividing the execution time by a linear factor (number of processors).

Due to the simplicity of its design, geneid results may be employed in projects other than pure gene finding. We will describe two recent applications of geneid: first, in order to search for selenoprotein genes - genes in which TGA codon encodes the amino acid selenocysteine, in addition of being an stop signal - and second, to predict novel genes in human genome using shotgun mouse genome sequences.

geneid is freely available at this web address and the on-line web server is also supplied at


- Guigo et al. 1992. Prediction of gene structure. J. Mol. Biol. 226: 141-157. - Parra et al. 2000. GeneID in Drosophila. Genome Research 10: 511-515.

315. Improved Splice Site Prediction by Considering Local GC Content (up)
Aaron Levine, Richard Durbin, The Sanger Centre;
Short Abstract:

We have produced a stand-alone splice site predictor that considers local GC content during the prediction process. Our predictor shows significant improvements over standard models at identifying both donor and acceptor sites, particularly in gene-rich high GC regions, and is designed to integrate easily into probabilistic gene prediction systems.

One Page Abstract:

Despite the recent completion of the draft sequence of the human genome, accurate ab initio prediction of complex mammalian genes remains a largely unsolved problem. As many mammalian genes consist of a large number of small exons separated by much longer introns, reliable identification of intron splicing junctions is a key problem on which successful gene prediction depends. However, mammalian splice site consensus sequences are notoriously degenerate and uninformative, leading to high false positive rates for even the best ab initio splice site predictors.

We have explored the utility of considering a variety of additional information sources and found that considering local GC content during splice site prediction leads to a significant improvement in accuracy over standard first-order weight matrix models. Stratification by local GC content has its most profound effects on predictions in gene-rich high GC areas and aids in the prediction of both donor and acceptor splice sites.

We have produced a stand-alone splice site predictor (available from our website), which predicts both canonical donor and acceptor sites, as well as rarer GC donor sites. Our predictor generates both log-odds scores and posterior probability values for each potential splice site and is designed to be easily integrated into probabilistic gene prediction systems. Preliminary results indicate that our splice predictor performs comparable to or better than other top splice predictors under most conditions.

316. Relaxed profile matching as a method for identifying putative novel proteins from the genome (up)
Mark Ibberson, Achim Frauenschuh, Massimo De Francesco, Serono Pharmaceutical Research Institute;
Short Abstract:

We have used a relaxed profile strategy to identify novel putative coding regions from the human genome. A relaxed profile for a family of secreted proteins identified 7974 unique open reading frames, of which 253 were selected for further analysis based on signal peptide and secondary structure predictions.

One Page Abstract:

The function of about 40% of the human genome cannot be assigned based on sequence homology to currently characterized protein families (1,2). For these and hitherto unidentified proteins, alternative methods need to be employed in order to identify proteins sharing similar biological characteristics, but lacking sequence homology to known protein families. We have attempted to identify novel proteins that share similar predicted structural properties to a known protein family, but no obvious sequence homology. To do this, we created a regular expression-based profile loosely describing a family of secreted proteins and used it to search the human draft genome. The resulting matches were subsequently filtered using signal peptide and secondary structure prediction algorithms to identify candidates for future analysis. The profile, for the chemokine family of secreted proteins, identified a total of 7974 unique open reading frames from the draft human genome. Of these, approximately 30% (2441) were either identical or homologous to known proteins present in SwissProt/Trembl or Derwent Patents databases. The remaining 70% (5533) were analyzed using signal and structure prediction algorithms and 253 (~5%) were selected based on signal peptide and secondary structure predictions. A subset of these sequences is currently being validated experimentally. Our initial results suggest that this type of strategy could be useful for identifying distant members of protein families or evolutionarily unrelated proteins that have evolved similar biological functions.

1) International Human Genome Sequencing Consortium, Nature 409, 860-921.

2) Venter, J.C et al., Science 291, 1304-1351.

317. Analyzing Alternatively Spliced Transcripts (up)
Ann Loraine, Guoying Liu, Alan Williams, Ray Wheeler, Michael A. Siani-Rose, David Kulp, Affymetrix, Inc.;
Short Abstract:

Two collections of alternatively spliced human loci were prepared from public human genome data. The first collection was built from mRNA-to-genomic alignments; the second was made using the gene-finding program AltGenie. Comparing these, we identified high-quality AltGenie-predicted loci and used these to test a protein-homology-based scheme for assessing transcript quality.

One Page Abstract:

Current estimates project that one third to one half of all human genes undergo alternative splicing and therefore give rise to multiple protein forms, many of which exhibit different, even antagonistic, activities [Mironov; Lander; Taylor]. To advance understanding of the role of alternative splicing in generating protein diversity, we have built two collections of alternative splice predictions using the public Human Genome Project data released Oct. 7, 2000, the same version analyzed in Lander, et al. The first set (A) includes 96,832 transcripts from 18,948 loci and was built using AltGenie, a gene-finding program that first uses EST/mRNA-to-genomic alignments to detect internal exons and then applies statistical methods to detect protein-coding exons in flanking genomic sequence [Kulp and Wheeler, unpublished data]. The second collection (RS/C) contains gene predictions made by aligning previously reported mRNAs to genomic sequence. Although the RS/C collection is based on experimental evidence (mRNA sequence records), it is still predictive since it is not always possible to reconstruct the reference transcript and/or protein from genomic sequence.

Due to the limitations of current computational gene-finding methods, it is widely agreed that before a computed prediction can be accepted, it must be vetted by an expert curator or confirmed by experimental data. Manual and experimental inspection of predicted genes is expensive and time-consuming; thus, we are exploring ways to automate gene prediction analysis with the goal of building a reliable collection of alternatively spliced human genes. Curators typically use protein homology data to evaluate whether a novel predicted transcript is likely to be correct; that is, curators give more weight to a predicted transcript when it encodes a protein that is homologous to a previously characterized protein family. Following this same reasoning, we are testing methods for gene prediction analysis that use protein homology data to assess the quality of predicted transcripts.


Lander, et al. 2001. Initial sequencing and analysis of the human genome. Nature 409(6822):860-921.

Mironov, A.A., Fickett, J.W., and Gelfand, M.S. 1999. Frequent alternative splicing of human genes. Genome Res. 9(12):1288-93.

Taylor, J.F., Zhang, Q.Q., Wyatt, J.R., and Dean, N.M. 1999. Induction of endogenous Bcl-xS through the control of Bcl-x pre-mRNA splicing by antisense oligonucleotides. Nature Biotech. 17: 1097-1100.

318. EST curation with improved gene feature models (up)
Debopriya Das, Eldar Giladi, Incyte Genomics;
Short Abstract:

We present an effective EST curation tool which combines EST hits on the genome with statistical models for gene features on the genomic sequence. For each EST, the tool generates a ranked set of segments which potentially represent exons or portions thereof.

One Page Abstract:

The availability of the human genomic sequence and of large repositories of ESTs provides the opportunity to generate an improved view of a gene, or a portion thereof, by developing algorithms that combine both data-sets. In this poster we present a simple tool which combines EST hits on the genome with statistical models for gene features on the genomic sequence, and generates for each EST a ranked set of segments which potentially represent exons or portions thereof, associated with the EST.

Our work is motivated by several applications. First, an important route to the discovery of rare genes, that is genes with a low expression level, involves seeking singletons and extending them into full length transcripts by generating primers from those singletons. By singleton we mean an EST which overlaps with no other EST from a given library or a collection of libraries. The difficulty with this approach is that some mRNAs are contaminated with un-spliced introns, and ESTs which are sequenced from them could be intronic singletons. Hence, it is important to have a tool which would help discriminate between singletons which are intronic and those which originate in an exon. By leveraging additional information from the genomic sequence our tool is able to improve the discrimination substantially. Another application of this tool is to aid curators in generating full length transcripts by providing them a ranked set of candidate exons associated with each EST.

In order to compute the candidate exons associated with an EST, the EST is first mapped to the genome. Then the genomic segment which encompasses the EST by 500-1000nt on each side is analyzed. In this segment potential donor and acceptor splice sites are identified and scored with a Markov model, and for each pair of sites the coding potential of the region bound between them is computed. The resulting segments are ranked based on those three scores using a simple ranking scheme. For the first exon, the method substitutes the coding potential score by a combination of a coding score for the segment downstream of the Methionine and a score from a model for UTR, for the region between the acceptor and the Methionine. An analogous variant exists for the last exon.

The statistical models which we use for each of the components mentioned above are Markov models analogous to those used by Genscan. In view of the large amounts of curated genes mapped to the genomic sequence, available in the Incyte database, we where able to use higher order models in some cases, and in general we obtained substantially improved performance on models trained on this large data collection.

The exon-curation tool which we present here differs from the available genefinding tools in that the latter try to find the global gene structure given a genomic piece, whereas we try to do a local analysis and curation of a single exon. However, we have identified a few examples in which our tool detects errors in the predictions of genefinding tools. Moreover, we find that this tool can identify an intron-free segment and the correct extent of the exon accurately about 85-90% of the cases, including the first and last coding exons. We are currently assessing the coding content of ESTs in public databases using this tool, and developing additional improved gene feature models.

319. Annotation of Human Genomic Regions to Identify Candidate Disease Genes (up)
Damian Smedley, Derek Huntley, Sasivimol Kittivoravitkul, Holger Hummerich, Imperial College, London, UK;
Soraya Bardien-Kruger, SANBI, South Africa;
Peter Little, University of New South Wales, Australia;
Win Hide, SANBI, South Africa;
Marek Sergot, Mark McCarthy, Imperial College, London, UK;
Short Abstract:

GANESH is an annotation tool for genomic regions identified in human disease gene studies. Daily updated gene predictions are based on similarity to known genes/ESTs, Genscan prediction, and mouse genome synergy. To select candidate genes we are developing automated methods to retrieve gene expression and function data.

One Page Abstract:

We have developed a set of tools, GANESH (Genome Annotation Network for Expressed Sequence Hits:, to annotate small (10-20 Mb) regions of the human genome identified as containing potential disease genes. Genomic sequence, in the form of finished and unfinished BAC/PAC clones from the public genome effort is retrieved, and orientated according to the Golden Path ( Annotation is then carried out on a clone by clone basis using a series of automated scripts which (i) blast the genomic sequence against embl, dbEST, STACKdb, SwissProt, Trembl, IPI, dbSnp and dbSts, (ii) identify Pfam domains and (iii) run Genscan exon predictions. Both the genomic sequence and the databases searched are automatically checked every night for updates and reprocessing occurs whenever required. Predicted genes are constructed based on parallel lines of evidence including (i) similarity to known genes and ESTs, (ii) Genscan predictions, and (iii) synergy with the emerging mouse genome. All predicted genes are stored, including the unlikely ones, as in a positional cloning approach every possible gene in the region should be considered as a potential candidate. However we categorize the predictions in terms of the levels of supporting evidence and hence likelihood of being a real gene. For instance, category 1.1 genes have an exact match to a known gene, category 1.2 genes have strong homology to a known gene, category 2 genes have exons predicted from Genscan or mouse genome comparisons with some backup EST evidence, and category 3 genes have EST evidence only. The results of the annotation are stored in a relational MySQL database and can be viewed remotely/locally using a java GUI. To drive selection of positional candidates for further study we are in the process of developing automated retrieval methods to collect attributes for each predicted gene. These attributes will include qualitative (tissue expression profiles) and quantitative (microarray) expression data, functional data (from gene ontology and involvement in KEGG metabolic pathways), and finally the gene's position in the region. Again the results will be stored in a MySQL database with a java GUI allowing the biologist to recover all the candidate genes according to their particular criteria. Testing of candidates is likely to involve SNP association studies, so identified SNPs and their positions in the gene structure will also be stored in the attribute database. GANESH annotation on a region of chromosome 1 (1q21-24) will be presented. This region has been identified by several groups, including our own, as harbouring a gene involved in type 2 diabetes. In this region, 133 category 1.1 genes, 145 category 1.2 genes and 393 genes from the other categories, were identified. Examples of the gene attribute retrieval will also be demonstrated.

320. Annotation of the E. coli genome revisited (up)
Vera van Noort, David Ussery, Thomas Schou Larsen, Marie Skovgaard, Centre for Biological Sequence Analysis, DTU;
Short Abstract:

Our aim is to find the real protein coding genes in E. coli, using sequenced genomes of four different E. coli strains and four other enteric bacteria. We put all annotated genes into six different categories, ranging from confident to wrong using de novo prediction, homology searches and contextual information.

One Page Abstract:

E. coli is one of the most studied model organisms in biology. The genome of this enteric bacterium was sequenced in 1997 and, as in other sequenced genomes, about 30 percent of the genes were of unknown function. The genomes of four different E. coli strains and four other enteric bacteria are now available. The genomes of the four sequenced E. coli strains differ in size between 4,636,552 and 529,376 basepairs. The number of genes that are annotated also differs a lot between the strains, namely between 4405 and 5502. A former study has shown that the number of protein coding genes in the E. coli K12_MG1655 genome is overestimated between 15 and 20 %, which means that between 625 and 950 annotated genes are not protein coding regions but rather Open Reading Frames that occur by chance. This was estimated from the number of random ocurring stop codons, based on AT content, as well as matches with non-hypothetical proteins from SwissProt. As more and more people use public databases and assume that all annotated genes in these databases are real, it is necessary to find out which annotated genes correspond to true genes and include these in the databases.

Our aim is to find the genes that are real protein coding genes. To do this we put all annotated genes into six different categories, ranging from confident to wrong. First, genes that match known proteins, are considered "confident". Then genes that have close orthologs, i.e. known proteins with a common ancestor in another organism, are considered "conserved hypothetical", which means less confident. For E. coli we can use, for example Salmonella to find close orthologs. If an ORF is conserved over distant organisms, we consider this ORF "conserved hypothetical".In addition to de novo prediction and homology searches wealso use contextual information, to find the right direction of the genes. This is necessary because conservation on one strand, implies conservation on the other strand for obvious reasons. Moreover the genefinding might give some additional hypothetical genes, for which we can find evidence like for the already annotated genes. This evidence can make them confident. Additional evidence, can be found in DNA expression data, from microarray experiments. For this, however it is necessary to know the positions of the primers. Because primers of wrongly annotated genes can be located in mRNA containing real expressed genes. Roughly 3000 genes are found to be expressed, many of the remaining genes are short ORFs, that are unlikely to be real, that is they are overpredicted. If the genefinding gives a low score and no evidence for that gene is found, we consider the gene wrongly annotated. "More confident" and "less confident" are used relative to the old annotation depending on how confident we think the genes are.

We found evidence for 1,112 genes, that are conserved on the DNA level between genomes of 8 different enteric bacterial genomes. Conserved, in this case, means more than 50 % identity.