Abstract
Some genes produce functional RNAs instead of encoding proteins. Current genefinding approaches focus exclusively on protein-coding genes. The diversity of the ``modern RNA world" is an open question. The availability of multiple complete genome sequences has made it possible for us and others to develop computational screening approaches for detecting novel noncoding RNAs. The most powerful screens are based on comparative genome sequence analysis. Early results from these screens in a number of genomes, coupled with experimental confirmation of the predictions, indicate that even the well-studied Escherichia coli genome has numerous previously uncharacterized noncoding RNA genes.
Abstract
Computational prediction of eukaryotic polII promoters has been one of the most elusive problems despite considerable effort devoted to the study. Researchers have looked for various types of signals around the transcriptional start site (TSS), viz. oligo-nucleotide statistics, potential binding sites for core factors, clusters of binding sites, proximity to CpG islands etc.. The proximity of CpG islands to gene starts is now a well established fact, although until recently, it was based on very little genomic data. In this work we explore the possibility of enhancing the promoter prediction accuracy by combining CpG island information with a few other, biologically motivated, seemingly independent signals, that cover most of the known knowledge. We benchmarked the method on a much larger genomic datasets compared to previous studies. We were able to improve slightly upon current prediction accuracy. Furthermore, we observe that CpG islands are the most dominant signals and the other signals do not improve the prediction. This suggests that the computational prediction of promoters for genes with no associated CpG-island ( typically having tissue-specific expression) looking only at the immediate neighborhood of the TSS may not even be possible. We suggest some biological experiments and studies to better understand the biology of transcription.
Abstract
We present an approach to integrate physical properties of DNA, such as DNA bendability or GC content, into our probabilistic promoter recognition system MCPROMOTER. In the new model, a promoter is represented as a sequence of consecutive segments represented by joint likelihoods for DNA sequence and profiles of physical properties. Sequence likelihoods are modeled with interpolated Markov chains, physical properties with Gaussian distributions. The background uses two joint sequence/profile models for coding and non-coding sequences, each consisting of a mixture of a sense and an anti-sense submodel. On a large Drosophila test set, we achieved a reduction of about 30% of false positives when compared with a model solely based on sequence likelihoods.
Abstract
We present a new methodology for computational analysis of gene and protein networks. The aim is to generate new educated hypotheses on gene functions and on the logic of the biological network circuitry, based on gene expression profiles. The framework supports the incorporation of biologically motivated network constraints and rules to improve specificity. Since current data is insufficient for de-novo reconstruction, the method receives as input a known pathway core and suggests likely expansions to it. Network modeling is combinatorial, yet data can be probabilistic. At the heart of the approach are a fitness function which estimates the quality of suggested network expansions given the core and the data, and a specificity measure of the expansions. The approach has been implemented in an interactive software tool called GENESYS. We report encouraging results in preliminary analysis of yeast ergosterol pathway based on transcription profiles. In particular, the analysis suggests a novel ergosterol transcription factor.
Abstract
Systems that extract structured information from natural language passages have been highly successful in specialized domains. The time is opportune for developing analogous applications for molecular biology and genomics. We present a system, GENIES, that extracts and structures information about cellular pathways from the biological literature in accordance with a knowledge model that we developed earlier.
We implemented GENIES by modifying an existing medical natural language processing system, MedLEE, and performed a preliminary evaluation study. Our results demonstrate the value of the underlying techniques for the purpose of acquiring valuable knowledge from biological journals.
Abstract
We propose a method to engineer the genome of bacteriophages to increase their effectiveness as antibacterial agents. Specifically, we exploit the redundancy of the triplet code to design genomes that avoid restriction sites while producing the same proteins as wild-type phages. We give an efficient algorithm to minimize the number of restriction sites against sets of cutter sequences, and demonstrate that that phage genomes can be significantly protected against surprisingly large sets of enzymes with no loss of function. Finally, we develop a model to explain why evolution has failed to eliminate many possible restriction sites despite selective pressure, thus motivating the need for genome-level sequence engineering.
Abstract
RNA splicing is an essential step in the expression of most eukaryotic genes. An important goal of research on this process is to determine a set of rules, perhaps encoded in a computer algorithm, that accurately predicts the splicing pattern of primary transcripts. Splicing of short introns is thought to proceed via an ``intron definition'' mechanism, in which the 5' and 3' splice signals (5'ss and 3'ss) are initially recognized and paired across the intron. We are using computational methods to help understand the specificity of RNA splicing by intron definition, taking advantage of available transcript data from five eukaryotes with essentially complete genomic sequences - yeast, fly, worm, mustard weed and human. We have found that short introns in Drosophila and C. elegans contain essentially all of the information for their recognition by the splicing machinery and computer programs which simulate splicing specificity can identify the exact boundaries of approximately 95% of short introns in both organisms. In yeast, the 5'ss, branch signal and 3'ss can accurately identify intron locations but do not precisely determine the location of 3' cleavage in every intron. The 5'ss, branch signal and 3'ss are not sufficient to accurately identify short introns in human and plant transcripts. However, specific subsets of candidate intronic enhancer motifs can be identified in both human and Arabidopsis which contribute dramatically to the accuracy of splicing simulators. These motifs are predominantly U-rich in Arabidopsis and mostly contain G triples (GGG) in human. We are developing computational methods to more specifically predict which oligonucleotides act as intronic or exonic splicing enhancers and the transcript locations where they function. Important applications of splicing simulators are for prediction of genes in genomic sequences and for prediction of the splicing phenotypes of genetic mutations or polymorphisms.
Abstract
We present an automated system for assigning protein, gene, or mRNA class labels to biological terms in free text. Three machine learning algorithms and several extended ways for defining contextual features for disambiguation are examined, and a fully unsupervised manner for obtaining training examples is proposed. We train and evaluate our system over a collection of 9 million words of molecular biology journal articles, obtaining accuracy rates up to 85%.
Abstract
We describe DAGGER, an ab initio gene recognition program which combines the output of high dimensional signal sensors in an intuitive gene model based on directed acyclic graphs. In the first stage, candidate start, donor, acceptor, and stop sites are scored using the SNoW learning architecture. These sites are then used to generate a directed acyclic graph in which each source-sink path represents a possible gene structure. Training sequences are used to optimize an edge weighting function so that the shortest source-sink path maximizes exon-level prediction accuracy. Experimental evaluation of prediction accuracy on two benchmark data sets demonstrates that DAGGER is competitive with ab initio gene finding programs based on Hidden Markov Models.
Abstract
We study the problem of approximate non-tandem repeat extraction.
Given a long subject
string S of length N over a finite alphabet
and a
threshold D, we would like to find all short substrings of S of
length P that repeat with at most D differences, i.e.,
insertions, deletions, and mismatches. We give a careful theoretical
characterization of the set of seeds (i.e., some maximal exact
repeats) required by the algorithm,
and prove a sublinear bound on their expected numbers. Using this
result, we present a sub-quadratic algorithm for finding all
short (i.e., of length
)
approximate repeats.
The running time of our algorithm is
,
where
and
is an increasing, concave
function
that is 0 when
and about 0.9 for DNA and protein
sequences.
Abstract
TWINSCAN is a new gene-structure prediction system that directly extends the probability model of GENSCAN, allowing it to exploit homology between two related genomes. Separate probability models are used for conservation in exons, introns, splice sites, and UTRs, reflecting the differences among their patterns of evolutionary conservation. TWINSCAN is specifically designed for the analysis of high-throughput genomic sequences containing an unknown number of genes. In experiments on high-throughput mouse sequences, using homologous sequences from the human genome, TWINSCAN shows notable improvement over GENSCAN in exon sensitivity and specificity and dramatic improvement in exact gene sensitivity and specificity. This improvement can be attributed entirely to modeling the patterns of evolutionary conservation in genomic sequence.