Gene Structure, Regulation, and Modeling

Tuesday July 24 9:00 - 9:45

The Modern RNA world: many genes don't encode proteins

Sean Eddy, Washington University


Some genes produce functional RNAs instead of encoding proteins. Current genefinding approaches focus exclusively on protein-coding genes. The diversity of the ``modern RNA world" is an open question. The availability of multiple complete genome sequences has made it possible for us and others to develop computational screening approaches for detecting novel noncoding RNAs. The most powerful screens are based on comparative genome sequence analysis. Early results from these screens in a number of genomes, coupled with experimental confirmation of the predictions, indicate that even the well-studied Escherichia coli genome has numerous previously uncharacterized noncoding RNA genes.

Tuesday July 24 9:45 - 10:10  

Promoter prediction in the human genome

Sridhar Hannenhalli, Samuel Levy, Celera Genomics


Computational prediction of eukaryotic polII promoters has been one of the most elusive problems despite considerable effort devoted to the study. Researchers have looked for various types of signals around the transcriptional start site (TSS), viz. oligo-nucleotide statistics, potential binding sites for core factors, clusters of binding sites, proximity to CpG islands etc.. The proximity of CpG islands to gene starts is now a well established fact, although until recently, it was based on very little genomic data. In this work we explore the possibility of enhancing the promoter prediction accuracy by combining CpG island information with a few other, biologically motivated, seemingly independent signals, that cover most of the known knowledge. We benchmarked the method on a much larger genomic datasets compared to previous studies. We were able to improve slightly upon current prediction accuracy. Furthermore, we observe that CpG islands are the most dominant signals and the other signals do not improve the prediction. This suggests that the computational prediction of promoters for genes with no associated CpG-island ( typically having tissue-specific expression) looking only at the immediate neighborhood of the TSS may not even be possible. We suggest some biological experiments and studies to better understand the biology of transcription.

Tuesday July 24 10:45 - 11:10  

Joint modeling of DNA sequence and physical properties to improve eukaryotic promoter recognition

Uwe Ohler, Heinrich Niemann, Universität Erlangen-Nürnberg; Guo-chun Liao, Gerald M. Rubin, University of California at Berkeley


We present an approach to integrate physical properties of DNA, such as DNA bendability or GC content, into our probabilistic promoter recognition system MCPROMOTER. In the new model, a promoter is represented as a sequence of consecutive segments represented by joint likelihoods for DNA sequence and profiles of physical properties. Sequence likelihoods are modeled with interpolated Markov chains, physical properties with Gaussian distributions. The background uses two joint sequence/profile models for coding and non-coding sequences, each consisting of a mixture of a sense and an anti-sense submodel. On a large Drosophila test set, we achieved a reduction of about 30% of false positives when compared with a model solely based on sequence likelihoods.

Tuesday July 24 11:10 - 11:35  

Computational expansion of genetic networks

Amos Tanay, Ron Shamir, Tel-Aviv University


We present a new methodology for computational analysis of gene and protein networks. The aim is to generate new educated hypotheses on gene functions and on the logic of the biological network circuitry, based on gene expression profiles. The framework supports the incorporation of biologically motivated network constraints and rules to improve specificity. Since current data is insufficient for de-novo reconstruction, the method receives as input a known pathway core and suggests likely expansions to it. Network modeling is combinatorial, yet data can be probabilistic. At the heart of the approach are a fitness function which estimates the quality of suggested network expansions given the core and the data, and a specificity measure of the expansions. The approach has been implemented in an interactive software tool called GENESYS. We report encouraging results in preliminary analysis of yeast ergosterol pathway based on transcription profiles. In particular, the analysis suggests a novel ergosterol transcription factor.

Tuesday July 24 11:35 - 12:00  

GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles

Carol Friedman, Queens College CUNY and Columbia University; Pauline Kra, Hong Yu, Michael Krauthammer, Andrey Rzhetsky, Columbia University


Systems that extract structured information from natural language passages have been highly successful in specialized domains. The time is opportune for developing analogous applications for molecular biology and genomics. We present a system, GENIES, that extracts and structures information about cellular pathways from the biological literature in accordance with a knowledge model that we developed earlier.

We implemented GENIES by modifying an existing medical natural language processing system, MedLEE, and performed a preliminary evaluation study. Our results demonstrate the value of the underlying techniques for the purpose of acquiring valuable knowledge from biological journals.

Tuesday July 24 12:00 - 12:25  

Designing better phages

Steven S. Skiena, State University of New York


 We propose a method to engineer the genome of bacteriophages to increase their effectiveness as antibacterial agents. Specifically, we exploit the redundancy of the triplet code to design genomes that avoid restriction sites while producing the same proteins as wild-type phages. We give an efficient algorithm to minimize the number of restriction sites against sets of cutter sequences, and demonstrate that that phage genomes can be significantly protected against surprisingly large sets of enzymes with no loss of function. Finally, we develop a model to explain why evolution has failed to eliminate many possible restriction sites despite selective pressure, thus motivating the need for genome-level sequence engineering.


Tuesday July 24 14:00 - 14:50

Computational Analysis of RNA Splicing

Christopher Burge, MIT


 RNA splicing is an essential step in the expression of most eukaryotic genes. An important goal of research on this process is to determine a set of rules, perhaps encoded in a computer algorithm, that accurately predicts the splicing pattern of primary transcripts. Splicing of short introns is thought to proceed via an ``intron definition'' mechanism, in which the 5' and 3' splice signals (5'ss and 3'ss) are initially recognized and paired across the intron. We are using computational methods to help understand the specificity of RNA splicing by intron definition, taking advantage of available transcript data from five eukaryotes with essentially complete genomic sequences - yeast, fly, worm, mustard weed and human. We have found that short introns in Drosophila and C. elegans contain essentially all of the information for their recognition by the splicing machinery and computer programs which simulate splicing specificity can identify the exact boundaries of approximately 95% of short introns in both organisms. In yeast, the 5'ss, branch signal and 3'ss can accurately identify intron locations but do not precisely determine the location of 3' cleavage in every intron. The 5'ss, branch signal and 3'ss are not sufficient to accurately identify short introns in human and plant transcripts. However, specific subsets of candidate intronic enhancer motifs can be identified in both human and Arabidopsis which contribute dramatically to the accuracy of splicing simulators. These motifs are predominantly U-rich in Arabidopsis and mostly contain G triples (GGG) in human. We are developing computational methods to more specifically predict which oligonucleotides act as intronic or exonic splicing enhancers and the transcript locations where they function. Important applications of splicing simulators are for prediction of genes in genomic sequences and for prediction of the splicing phenotypes of genetic mutations or polymorphisms.

Tuesday July 24 14:50 - 15:15  

Disambiguating proteins, genes, and RNA in text: a machine learning approach

Vasileios Hatzivassiloglou, Pablo A. Duboué, Andrey Rzhetsky, Columbia University


We present an automated system for assigning protein, gene, or mRNA class labels to biological terms in free text. Three machine learning algorithms and several extended ways for defining contextual features for disambiguation are examined, and a fully unsupervised manner for obtaining training examples is proposed. We train and evaluate our system over a collection of 9 million words of molecular biology journal articles, obtaining accuracy rates up to 85%.

Tuesday July 24 15:50 - 16:15  

Gene recognition based on DAG shortest paths

John S. Chuang, Dan Roth, University of Illinois at Urbana-Champaign


We describe DAGGER, an ab initio gene recognition program which combines the output of high dimensional signal sensors in an intuitive gene model based on directed acyclic graphs. In the first stage, candidate start, donor, acceptor, and stop sites are scored using the SNoW learning architecture. These sites are then used to generate a directed acyclic graph in which each source-sink path represents a possible gene structure. Training sequences are used to optimize an edge weighting function so that the shortest source-sink path maximizes exon-level prediction accuracy. Experimental evaluation of prediction accuracy on two benchmark data sets demonstrates that DAGGER is competitive with ab initio gene finding programs based on Hidden Markov Models.

Tuesday July 24 16:15 - 16:40  

An efficient algorithm for finding short approximate non-tandem repeats

Ezekiel F. Adebiyi, Universität Tübingen; Tao Jiang, University of California, Riverside; Michael Kaufmann, Universität Tübingen


We study the problem of approximate non-tandem repeat extraction. Given a long subject string S of length N over a finite alphabet $\Sigma$ and a threshold D, we would like to find all short substrings of S of length P that repeat with at most D differences, i.e., insertions, deletions, and mismatches. We give a careful theoretical characterization of the set of seeds (i.e., some maximal exact repeats) required by the algorithm, and prove a sublinear bound on their expected numbers. Using this result, we present a sub-quadratic algorithm for finding all short (i.e., of length $O(\log N)$) approximate repeats. The running time of our algorithm is $O(DN^{3pow(\epsilon)-1}\log N)$, where $\epsilon=D/P$ and $pow(\epsilon)$ is an increasing, concave function that is 0 when $\epsilon=0$ and about 0.9 for DNA and protein sequences.

Tuesday July 24 16:40 - 17:05  

Integrating genomic homology into gene structure prediction

Ian Korf, Paul Flicek, Daniel Duan, Michael R. Brent, Washington University


TWINSCAN is a new gene-structure prediction system that directly extends the probability model of GENSCAN, allowing it to exploit homology between two related genomes. Separate probability models are used for conservation in exons, introns, splice sites, and UTRs, reflecting the differences among their patterns of evolutionary conservation. TWINSCAN is specifically designed for the analysis of high-throughput genomic sequences containing an unknown number of genes. In experiments on high-throughput mouse sequences, using homologous sequences from the human genome, TWINSCAN shows notable improvement over GENSCAN in exon sensitivity and specificity and dramatic improvement in exact gene sensitivity and specificity. This improvement can be attributed entirely to modeling the patterns of evolutionary conservation in genomic sequence.