Evolution

222.Identifying orthologs and paralogs in speciation-duplication trees
223.Reconstructing the duplication history of tandemly repeated genes
224.New approaches for the analysis of gene family evolution
225.Calculating orthology support levels in large scale data analyses
226.Search Treespace and Evaluate Regions of an Alignment with LumberJack
227.Analysis of Large-Scale Duplications Involving Human Olfactory Receptors
228.Whole Genome Comparison by Metabolic Pathway Profiling



222. Identifying orthologs and paralogs in speciation-duplication trees (up)
Lars Arvestad, Center for Genome Research, Karolinska Institutet;
lars.arvestad@cgr.ki.se
Short Abstract:

A speciation-duplication tree (SD-tree) is a tree where an inner node is a bifurcating S-node or a D-node without degree restriction. We consider the problem of computing an SD-tree given a species tree and pairwise evolutionary distances with the aim of identifying orthologs and paralogs.

One Page Abstract:

The study of proteins in model organisms is an important tool for advancing our understanding of human proteins. Assuming that homology often implies similar function, we can transfer knowledge from one organism to another. However, gene duplication can make it difficult to draw conclusions based on protein homology. Here, the identification of orthologous and paralogous protein relationships can be an important aid for the researcher.

Orthologs are homologs for which the most recent common ancestor (MRCA) corresponds to an speciation event. For paralogs, the MRCA derives from a gene duplication event.

We consider the problem of computing a speciation-duplication tree (SD-tree) for homologous sequences from a small set of species and present an algorithm for computing an SD-tree given a species tree and pairwise evolutionary distances.


223. Reconstructing the duplication history of tandemly repeated genes (up)
Olivier ELEMENTO, IMGT (The International IMmunoGeneTics Database) & LIRMM Montpellier;
Olivier GASCUEL, LIRMM;
Marie-Paule LEFRANC, IMGT;
olivier@ligm.igh.cnrs.fr
Short Abstract:

We describe here a novel approach to the reconstruction of the duplication history of tandemly repeated genes, based on a model of duplication by unequal recombination. We present this model and its related mathematical objects, the reconstruction algorithms and their application to two data sets of immunogenetics sequences.

One Page Abstract:

Classical phylogenetic analysis studies the relationships between species based on the comparison of a single gene. Its main goal is to reconstruct a tree which represents the history of speciations. The problem we describe here is different : we aim at reconstructing the duplication history of a single gene within a single genome and we uniquely consider long (several kilobases) and tandemly arranged sequences, where each one contains a single gene. Assuming our sequences were not affected by gene conversion events, and our loci did not undergo any deletions, we introduce a simple model of duplication based solely on unequal recombination (unequal recombination is commonly acknowledged as the primary mechanism responsible for tandem duplications). Our model of duplication allows simple duplications (a gene is duplicated and inserted adjacent to the initial gene), and bloc duplications (a bloc of 2 or n sequences is duplicated, and inserted near the initial bloc). Although identical just after duplication, these sequences diverge over time as they accumulate their own mutations. We then define three types of mathematical objects to describe the evolution of these clusters of tandemly repeated genes. First, we define what we call a time-valued duplication history, i.e. a description of the real duplication history. Since we cannot generally rely on the moleculer clock hypothesis, inferring a time-valued duplication history from nucleotide sequences is not possible. In particular, the position of the root and the order in which duplications occurred cannot be recovered from DNA sequences. Consequently, we can only reconstruct what we call a duplication tree, i.e an unrooted phylogeny whose topology is compatible with at least one duplication history. According to the model of duplication, the root of a duplication tree can only be situated somewhere (but not everywhere) in the tree between the most distant repeats on the locus. When rooted at one of its allowed branches, a duplication can be transformed into what we call an ordinal duplication history, i.e. a history in which duplication events are partially ordered. Although a duplication tree is a phylogeny, it is easy to show that not all phylogenies can be duplication trees. We use an algorithm we called PDT (for PossibleDuplicationTree) to determine whether a given phylogeny with ordered leaves can be a duplication tree or not. This algorithm provides us with a mathematical characterisation of duplication histories and duplication trees. We also use the PDT algorithm to show that, for a given number of tandemly repeated sequences, the number of duplication trees is largely inferior to the number of distinct phylogenies. Given this model of duplication, we use an exhaustive search procedure to reconstruct duplication trees: given a set of nucleotide sequences, we compute the parcimony value of every possible duplication tree, and we select those which minimize this value. To speed up the reconstruction (especially when dealing with large numbers of repeated genes), we use a faster (but not guaranteed to find the optimal tree) search procedure based on a greedy heuristics : starting with a tree made from the first three repeats, our procedure iteratively inserts new repeats onto the growing tree, such that each resulting tree minimizes the parcimony value. The procedure stops when all repeats are inserted. We applied this model and these search procedures to two human loci containing tamdemly repeated immunoglobulins and T-cell receptors genes : the IGLC and TRGV loci. We showed for both these loci that the duplication tree found by our exhaustive search procedures corresponds to the most parcimonious phylogeny. Since the probability of a phylogeny being a duplication tree is small (0.04 in the TRGV case), this constitutes a strong validation of our initial hypothesis concerning the duplication mechanisms. Besides, the heuristics-based search reconstructs the same duplication tree as the exhaustive search, but in a much faster way. These results keep stable to a bootstrap analysis, indicating that this identity between the most parcimonious duplication tree and the most parcimonious phylogeny is not fortuitous. Compatibility of our reconstructed trees with known polymorphisms (two genes are missing in some individuals) in the TRGV locus provides further evidence that our reconstruction can provide good insights into the duplication histories of tandemly repeated genes.


224. New approaches for the analysis of gene family evolution (up)
Jessica Siltberg, Jens Lagergren, Bengt Sennblad, David A. Liberles, Stockholm Bioinformatics Center, Stockholm University, 10691 Stockholm, Sweden;
jessica@sbc.su.se
Short Abstract:

To detect genes undergoing adaptive evolution where different positions within proteins evolve at different rates, Ka/Ks is frequently underestimated when averaging over entire gene sequences. We present a simple covarion-based method for estimating Ka/Ks ratios calculated from variant residues in subclades of a phylogenetic tree operating with stationary substitution trends.

One Page Abstract:

Analyzing the ratio of nonsynonymous to synonymous nucleotide substitution rates (Ka/Ks) has emerged as a powerful technique in the detection of proteins undergoing adaptive evolution or changes of function. Because different positions within proteins evolve at different rates, Ka/Ks is frequently underestimated when averaging over entire gene sequences. To correct for this, it is possible to examine discrete windows of primary sequence and examine Ka/Ks within these sliding windows. However, this approach ignores the largest underlying reason for site-specific variation in rates- selective pressures dictated by protein three dimensional structure. As an alternative, we present a simple covarion-based method for estimating Ka/Ks ratios calculated from variant residues in subclades of a phylogenetic tree operating with stationary substitution trends.


225. Calculating orthology support levels in large scale data analyses (up)
Christian Storm, Erik Sonnhammer, Center for Genomics Research, Karolinska Institutet;
christian.storm@cgr.ki.se
Short Abstract:

Orthologous proteins in different species are likely to have similar biochemical function and biological role. Here we present a method that calculates orthology support levels by analyzing a set of bootstrap trees instead of the optimal tree.

One Page Abstract:

Orthologous proteins in different species are likely to have similar biochemical function and biological role. When annotating a newly sequenced genome by sequence homology, the most precise and reliable functional information can thus be derived from orthologs in other species. A standard method of finding orthologs is to compare the sequence tree with the species tree. However, since the topology of phylogenetic trees is not always reliable one might get incorrect assignments. Here we present a method that resolves this problem by analyzing a set of bootstrap trees instead of the optimal tree. The frequency of orthology assignments in the bootstrap trees can be interpreted as a support value for possible orthology of the sequences. This approach is efficient enough to analyze large datasets in the size of whole genomes. It is implemented in C and Java and calculates orthology support levels for all pairwise combination of sequences of two goups of species. The method was tested on simulated datasets and on real data of homologous proteins.


226. Search Treespace and Evaluate Regions of an Alignment with LumberJack (up)
Carolyn J. Lawrence, R. Kelly Dawe, Russell L. Malmberg, University of Georgia;
carolyn@dogwood.botany.uga.edu
Short Abstract:

The ML heuristic search algorithms currently available are computationally impractical for large datasets (especially those consisting of protein sequences). We are developing a ML heuristic search tool that we call LumberJack that progressively jackknifes an alignment to generate multiple NJ trees, then compares them based upon likelihood scores.

One Page Abstract:

Phylogenomics is a method of sequence-based function prediction by phylogenetic analysis (Eisen 1998). The phylogenomic method often yields more accurate functional hypotheses than techniques based solely upon sequence similarity (such as BLAST). It is implemented by constructing a reasonable phylogenetic tree for a given dataset, then mapping the functions of experimentally characterized proteins onto the tree. Kuhner and Felsenstein (1994) found that the optimality criterion most successful at inferring accurate phylogenies overall is ML. However, the ML heuristic search algorithms currently available are computationally impractical for large datasets (especially those consisting of protein sequences). We are developing a ML heuristic search tool that we call LumberJack. LumberJack progressively jackknifes an alignment to generate multiple NJ trees, then compares those trees statistically on the basis of their relative likelihood scores. This sampling procedure finds phylogenetic trees that are similar to those built by ML star decomposition (Adachi and Hasegawa 1996), but carries out only a fraction of the computations.


227. Analysis of Large-Scale Duplications Involving Human Olfactory Receptors (up)
Tera Newman, Barbara Trask, U. of Washington dept. of Molecular Biotechnology and Fred Hutchinson Cancer Research Center;
Janet Young, Fred Hutchinson Cancer Research Center;
newmant@u.washington.edu
Short Abstract:

Olfactory receptor genes (ORs) have a complicated history of duplication, diversification, and pseudogenization. OR gene diversity has evolved in order to bind multitudes of odorants. Using sequence homology, phylogenetic analysis and genomic repeat structure we have characterized the duplications of large blocks of a recently active ~100­member subfamily of ORs.

One Page Abstract:

Olfactory receptor genes (ORs) constitute a family of G­protein coupled receptors and are the largest protein family in the mammalian genome. The human OR family has approximately 1000 members that have expanded to more than 40 regions of the genome. This family has experienced sequence diversification and rampant pseudogenization. The diversity of the ORs likely arose from intense selective pressure to recognize the broad spectrum of volatile odorants encountered in different environmental niches. Here we provide an analysis of an ~100­member subfamily of ORs, labeled 7E, that has undergone substantial recent genomic expansion. The duplication events included more than 30 kb of genomic sequence surrounding the coding regions of the genes. We combined phylogenetic analysis of all 7E gene coding sequences with the overall sequence homology and large­scale repeat structure of the genomic regions surrounding each 7E member. Using the output of PAUP, Miropeats, and RepeatMasker, we have constructed highly informative graphics that depict the nature and extent of similarity of the genomic sequence around each 7E gene. Almost all members are pseudogenes, suggesting that the original ancestral copies were non­functional before the expansion. Phylogenetically clustered genes are not physically close in the genome, implying extensive inter­chromosomal, rather than local, expansion. The 7E subfamily separates well into two groups based on phylogenetic analysis of the coding sequence, the positions of two stop­codon polymorphisms, and the repeat­motif structure surrounding the genes. These surrounding regions may reveal common elements of the mechanisms of gene duplication. Further study of the 7E subfamily will increase understanding of the complex history of these genes. Additionally, 7E duplications may mediate large­scale chromosomal rearrangements similar to those that are involved in disease phenotypes.


228. Whole Genome Comparison by Metabolic Pathway Profiling (up)
Li Liao, Sun Kim, Jean-Francois Tomb, DuPont Central Research & Development;
li.liao@usa.dupont.com
Short Abstract:

We developed a method to compare organisms based on whole-genome metabolic pathway profiles. The method includes scoring schemes and algorithms for evaluating profiles that are based on a hierarchy of attributes. Phylogenetic trees of 31 completed genomes were constructed and compared to conventional phylogenetic trees based on 16s rRNA.

One Page Abstract:

Traditionally, reconstruction of evolutionary relationship among organisms is based on comparisons of 16S rRNA sequences. The significance of phylogenies based on these sequences have been recently questioned with growing evidence for extensive lateral transfer of genetic material. Phylogenetic trees based on (protein) sequence analysis are not all congruent with traditional trees. As more genomes are sequenced and their metabolic pathways reconstructed, it becomes possible to perform genome comparisons from a biochemical/physiological perspective. We believe that such comparisons may yield novel insights into the evolution of metabolic pathways and bear relevance to metabolic enginnering of industrial microbes.

We developed a computational method to compare organisms based on whole metabolic pathway analysis. The presence and absence of metabolic pathways in organisms are profiled, and the profiles are utilized for genome comparison. Scoring schemes and an algorithm were developed for evaluating generic profiles which are based on attributes that bear hierarchical relationships. Based on this methodology, phylogenetic trees of completed genomes were constructed. The results provide a perspective on the relationship among organisms that is different from conventional phylogenetic trees based on 16s rRNA.