Structural genomics has the goal of obtaining useful three-dimensional models of all proteins, by a combination of experimental structure determination and comparative model building. We evaluate different strategies for optimizing information return on effort. The strategy which maximizes structural coverage requires about seven times fewer structure determinations compared with the strategy in which targets are selected at random. With a choice of reasonable model quality and the goal of 90 to estimate the total effort of structural genomics. It would take approximately 16,000 carefully selected structure determinations to construct useful atomic models for the vast majority of all proteins. In practice, unless there is global co-ordination of target selection, the total effort will likely increase by a factor of three. The task can be accomplished within a decade, provided selection of targets is highly coordinated and significant funding is available. Based on a paper in Nature Structural Biology, June 2001, with Dennis Vitkup, Eugene Melamud and John Moult.
Low complexity proteins and protein domains have sequences which appear highly non-random. Over the years, these sequences have been routinely filtered out during sequence similarity searches because interest has been focused on globular proteins, and inclusion of these domains can severely skew search results. However, early work on these proteins and more recent studies of the related area of repeated protein sequences suggests that low complexity protein domains have function and therefore are in need of further investigation. 0j.py is a new tool for demarcating low complexity protein domains more accurately than has been possible to date. The paper describes 0j.py and its use in revealing proteins with repeated and poly-amino-acid peptides. Statistical methods are then employed to to examine the distribution of these proteins across species, while keyword clustering is used to suggest roles performed by proteins through the use of low complexity domains.
The typical output of many computational methods to identify binding sites is a long list of motifs containing some real motifs (those most likely to correspond to the actual binding sites) along with a large number of random variations of these. We present a statistical method to separate real motifs from their artifacts. This produces a short list of high quality motifs that is sufficient to explain the over-representation of all motifs in the given sequences. Using synthetic data sets, we show that the output of our method is very accurate. On various sets of upstream sequences in S. cerevisiae, our program identifies several known binding sites, as well as a number of significant novel motifs.
Molecular portraits, such as mRNA expression or DNA methylation patterns, have been shown to be strongly correlated with phenotypical parameters. These molecular patterns can be revealed routinely on a genomic scale. However, class prediction based on these patterns is an under-determined problem, due to the extreme high dimensionality of the data compared to the usually small number of available samples. This makes a reduction of the data dimensionality necessary. Here we demonstrate how phenotypic classes can be predicted by combining feature selection and discriminant analysis. By comparing several feature selection methods we show that the right dimension reduction strategy is of crucial importance for the classification performance. The techniques are demonstrated by methylation pattern based discrimination between acute lymphoblastic leukemia and acute myeloid leukemia.
Pattern discovery in unaligned DNA sequences is a challenging problem in both computer science and molecular biology. Several different methods and techniques have been proposed so far, but in most of the cases signals in DNA sequences are very complicated and avoid detection. Exact exhaustive methods can solve the problem only for short signals with a limited number of mutations. In this work, we extend exhaustive enumeration also to longer patterns. More in detail, the basic version of algorithm presented in this paper, given as input a set of sequences and an error ratio , finds all patterns that occur in at least q sequences of the set with at most mutations, where m is the length of the pattern. The only restriction is imposed on the location of mutations along the signal. That is, a valid occurrence of a pattern can present at most mismatches in the first i nucleotides, and so on. However, we show how the algorithm can be used also when no assumption can be made on the position of mutations. In this case, it is also possible to have an estimate of the probability of finding a signal according to the signal length, the error ratio, and the input parameters. Finally, we discuss some significance measures that can be used to sort the patterns output by the algorithm.
This discussion addresses the question of whether the record of scientific
research should be privately owned and controlled.
We believe that the permanent, archival record of scientific
research and ideas should neither be owned nor controlled by
publishers, but should belong to the public, and should be
made freely available.
We support the establishment of international online public libraries of science that contain the complete text of all published scientific articles in searchable and interlinked formats.