The present invention relates to the field of computer science, and more particularly to a feature frequency based method of comparing and cataloguing electronically stored linear information, where the similarity of databases of information is evaluated based on the comparisons of short overlapping fragments of information (features), and to the proper selection of fragment length for comparison of linear information.
Keyword based information comparison methods are commonly used for comparisons of electronic data. However, keyword based similarity between two comparisons fails to capture the syntax and context in which those keyword usage similarities occur. Using human language as an example, it is possible for two articles to be written on the same subject to contain similar word frequencies if written by two different authors. There may be a common vocabulary, jargon or lingo associated with that particular topic of subject matter. However, it is highly unlikely for two separate authors two possess the same linguistic style and syntax unless a significant portion of one author borrowed material from the author. To compare documents at the level of distinguishing one author from another, differences in style and syntax must be captured.
Several methods exist for relating texts to each other using a keyword vector based system including Caid et al., in U.S. Pat. No. 5,619,709; Weissman et al, in U.S. Pat. No. 6,816,857, and Kasian et al., in publication number WO 2006/133050A2. The present methods differ from the above in the manner in which the linear information is preprocessed and in the manner in which the optimal feature lengths are found. This results in the novel ability to capture the syntactical idiosyncrasies of specific authors as well as the unique vocabularies associated with a certain genre or subject matter, and the ability to compare entire works or digitized data.
The present invention provides for a method of text comparison which categorizes texts by syntax and subject matter from the text contents itself, without any additional supervision. The classification itself is based upon word usage frequency, as well as the syntax (or ordering) of words within the text body.
Thus, the present invention provides a computer implemented method for comparing digital linear information. In an exemplary embodiment, the method includes (a) providing a database of information to be compared, where the data is stored in individual objects in a linear form, (b) preprocessing of each object by removing all punctuation and delimiting characters if needed, (c) reducing each object to a linear string of data, (d) applying a sliding window to the string of data to extract and count features of a given length, (e) assembling feature counts within the string of data in a feature frequency profile (FFP), (f) determining the best length feature for comparison, such as by using entropy information contained within the string, from a vocabulary feature profile and/or from a cumulative relative entropy profile, (g) comparing the FFPs of optimal length features using a distance metric, (h) assembling distances between objects into a symmetric pair-wise distance matrix, and (i) visualizing the distance matrix.
The method can be applied to comparing very large genomic sequences. When applied to a biological or genomic sequence, the method further includes the conversion of the sequence to a reduced two letter alphabet for comparison. The method can further include filtering features of low complexity, high frequency and reverse complement matching.
The present invention also provides a system for implementing the method on a computing device for large genome data or large text database. In an exemplary embodiment, the computing device is a multiprocessor computing device. In another embodiment, the vocabulary feature profile and cumulative relative entropy profile are used to determine the best lengths for feature comparison.
The present invention also provides a system for comparing digital linear information. In an exemplary embodiment, the system includes (a) a storage including a set of data items to be related, where each data item includes a plurality of terms, (b) a frame generator conFIG.d to generate a frame that selects a plurality of terms in the data items to associate, (c) a profile generator conFIG.d to generate feature frequency profiles to extract and count features of a given length within the frame, where the profile generator further includes instructions for determination of the best length feature for comparison, (d) a distance processor conFIG.d to compare the optimal best length features and assemble the distances between the objects into a pair-wise distance matrix, and (e) a visualization module to visualize the distance matrix. The system can be housed and implemented on a computing device, such as a multiprocessor computing device.
The presently described method is able to distinguish between different literature genres, subject matter and between different authors writing styles. The information that is compared between texts is essentially relative word frequency, where each word can actually represent one or more space-delineated words and short-range syntax (adjacent words) from the original texts. Although, the present examples have used human language texts specifically in English, comparisons can be made with this method between texts of any language. Also the comparison method described is not limited to the comparison of books alone. In a generalized form the method can be applied to any string of information. For example the text strings may be biological sequence information such as DNA sequences, entire genome sequences, or amino acid sequences. The method can also be applied to other linear digitized information such as visual and audio information.
The foregoing aspects and others will be readily appreciated by the skilled artisan from the following description of illustrative embodiments when read in conjunction with the accompanying drawings.
One embodiment of the invention is a computational method and system for the comparison and analysis of different objects of information within a database or collection. All objects are compared in a pair-wise fashion so the relative similarity between each object to every other object in the collection is known. The present invention also employs a vector based system and a distance measure to compare information. The present invention differs from the above in the manner in which the linear information is preprocessed and in the manner in which the optimal feature lengths are found. First human language texts are preprocessed by removing all punctuation marks and word delimiters (spaces). Entire texts are reduced to one long concatenated string. Next a short window of a given length is slid over the entire string and the frequencies of concatenated word fragments are counted. In the case of written texts these concatenated word fragments are able to capture the syntactical idiosyncrasies of specific authors as well as the unique vocabularies associated with a certain genre or subject matter.
In the present method, human language electronic texts are first pre-processed by transforming all letters to lowercase, removing spaces and punctuation, thereby removing the boundaries of words from the text. In one embodiment, each object in the database is preprocessed to remove word and punctuation boundaries. The data is preprocessed in such a way as to reduce the entire text to a long string of characters without the presence of any word or sentence delimiting boundaries. The frequencies of word fragments, or features, are used to reduce each item to a vector form which are herein referred as feature frequency profiles (FFP). The form of the information may not necessarily be human readable text but may also be in the form of un-annotated nucleic acid genomic sequence data or protein amino acid sequence. The information may also be in the form of machine readable code. The data must only be computer memory storable in a linear fashion. The nature of the content or information stored in irrelevant.
Then “words” (features represented by strings of alphabets) are counted with a sliding window of a given length of letters. In one embodiment the sliding window is eight letters as is used in the examples. Each text is then represented by a long vector of overlapping feature frequencies. Note, that the features are composed of portions of words and parts of preceding and following words. Thus, ‘short range’ syntax is encoded within the feature frequencies. The collection of frequencies for each l-letter length feature is used to form a word feature frequency profile.
A typical length text will be represented by a vector of several thousand features (“words”), where each feature occurs with a specific frequency. The vector is then length normalized, i.e. each frequency is divided by the total number of words found in the text. This enables texts of different length to be compared. Two texts, once converted to word vector form can be compared using many forms of distance or similarity measures. A collection of word vectors of several texts is stored in a word feature frequency profile. Several texts can be compared to each other in pairwise fashion, creating a distance or similarity matrix. The relative similarity of different texts to each other is visualized using multidimensional scaling, principal component analysis, hierarchical clustering or tree building. From the analysis, one can group similar books, index and catalogue them in a systematic hierarchy using a hierarchical clustering, neighbor joining, UPGMA or any other tree building method.
Each document within the dataset is preprocessed individually 102. For the case of text files a single document is obtained from the dataset 301. See the example of a fragment of text in
In state 103 the best length l for feature comparison must be determined from the processed text strings.
A VF profile can be created from the feature count information of a preprocessed string.
A CRE profile can be constructed from the feature frequency information of the preprocessed string.
where: Cl is the vector of feature counts
The CREP profile is itself based upon the cumulative relative entropy. The relative entropy between two FFPs of length l, Pl and Ql is,
where: K is the number of features in Pl and Ql.
where: fa1a2 . . . al is the frequency of a feature from Fl−1 formed from the letters a1a2 . . . al.
An expected FFP, {circumflex over (F)}l, can be found from Fl−1 and Fl−2. Further, {circumflex over (F)}l and {circumflex over (F)}l−1 can be used to find {circumflex over (F)}l+1, and thus all {circumflex over (F)}l+k up to infinite k can be found by iteratively applying the above relation to find the next longest expected FFP.
In order to measure how close the expected frequency is to the observed frequency for the entire probability distribution, the relative entropy (RE) between {circumflex over (F)}l and Fl is computed. The cumulative relative entropy (CRE) at l is defined as the sum of relative entropy from l to infinity (but in practice one can stop when RE˜0):
The CRE represents the accuracy of predicting FFPs for all lengths greater than or equal to l, given the prior distributions Fl−1 and Fl−2. For example starting at from l=3, F2 and F1 are used to calculate the expected FFP {circumflex over (F)}3−k when k=0 at state 504. The RE entropy is calculated between F3 and {circumflex over (F)}3 . A cumulative total is kept of the relative entropies for all k 505. When the relative entropy equals zero, the running total is stopped 506 and then the CRE for the next l is calculated 507. If a given sequence has zero CRE at feature length l, then the FFPs Fl−1 and Fl−2 have all the information necessary to form longer features. CRE for all l is assembled into a CRE profile 508. See
As shown in
Referring now to
Referring now to
Thus, the present method is able to distinguish between different literature genres, subject matter and between different authors writing styles. The information that is compared between texts is essentially relative word frequency, where each word can actually represent one or more space-delineated words and short-range syntax (adjacent words) from the original texts. In some embodiments, the present system is demonstrated using pieces of English books to demonstrate the process, however it should be noted that present system and methods are not limited to works of literature but could be applied to any form of written material, including but not limited to, computer programming code, or even biological sequence data, such as nucleic acid or genomic sequence data.
Genomic sequences can be analyzed in exactly the same way as human language texts, using the FFP method.
would be translated to the 2 letter R-Y alphabet:
The reduced alphabet is especially useful for comparing large genomes, because it substantially reduces memory allocation requirements. A further reduction can be accomplished by establishing equivalency between the reverse complement and its forward sequence. Both the R-Y alphabet and reverse complement matching were employed in comparisons of primate genomes, and features of length l=24 were used. For large genomes, such as this example, a multi-processor machine is used to calculate and compare FFPs.
The genomes of higher order Eukaryotes contain a large fraction of sequence which is repetitive or of low complexity, most often in non-genic or inter-genic regions. For genomes where the coverage is low or the assembly is incomplete, better results are observed if these sequences are removed. The complexity of a feature, Kf, is determined by comparing its size in bytes, before and after lossless compression.
K
f
=|s−s
compress
−s
header| Equation 7
The complexity of l-mers for a given l is normally distributed, and one can choose only the most complex features, which are generally of low frequency. Features with complexity greater than the average complexity μ(μ=10 for l=24), where μ is the average complexity are used for comparison.
In one embodiment, high frequency features should be disregarded because these frequency values tend to dominate the Jensen-Shannon distance score. The average (μ) and standard deviation (σ) of the count values, cl,i for all genomes are calculated and only those features in the range of 1≦cl,i≦μ+σ a are used for comparison.
All of the features described above may be embodied in software code and executed by a personal or general use computer. In one embodiment, a system comprising a computer having a processor and storage and a computer-implemented process of alignment-free comparison as described herein.
In another embodiment, the system may incorporate a multi-processor computer array to facilitate faster comparison of large databases or especially large data files. And in another embodiment, the system includes a web server that generates and serves pages of a host web site to computing devices of end users. The computing devices may include a variety of other types of devices, such as cellular telephones and Personal Digital Assistants (PDA). Tb web server may he implemented as a single physical server or a collection of physical servers. Certain embodiments may alternatively be embodied in another type of multi-user, interactive system, such as an interactive television system, or an online services network.
In such an embodiment, the web server provides user access to electronic information represented within a database or a collection of databases. An information acquisition processor that runs on, or in association with, the web server provides functionality for users to enter a search query for information they would like to find. In one embodiment, the information represented in the database may include documents, characters, words, images, songs, or videos or any other data that may be stored electronically and analyzed in linear form by the present methods. Many hundreds of thousands or millions of bytes of data may be stored in the database.
For example, in one embodiment, a document or other object in the information database may be retrieved using the information acquisition processor. Each object may be located by, for example, conducting a search for the item via the information acquisition processor, or by selecting the object from a browse tree listing.
In response to a query received by the information acquisition processor, the system sends the query to an FFP generator, which in addition to receiving the query, obtains the data to be queried and compared and generates the feature frequency profiles and the other vector profiles such as the vocabulary feature profile, and the cumulative relative entropy profile. The data can be stored in a database in order to generate the profile information based on the query along with the subsequent vectors and profiles generated. In certain system embodiments, a set limit can be placed on the number of FFPs that are created in order to address the substantially large amounts of relationships that can be created in web space, as discussed above.
The resulting profiles are then sent to the query results processor, which processes the results using the comparison process of the present invention, and optionally creates a visual representation of the distances and relationships found, and sends this data to the information acquisition processor. In one embodiment, the query results processor is a parallel processor or clustered machine which may be required if the data to be queried is very large, e.g., entire genomes or proteomes of several organisms. The results data may then be returned to computing devices that submitted the query via the Internet.
In another embodiment, the system comprising modules to carry out various steps as outlined in the FIG.s. For example, a module to find the optimum feature length having instructions to (a) retrieve a single preprocessed string 201, (b) for l=1 to 20, 202, move through each string with a sliding window of length l 203, (c) count feature frequencies of all features of length l, 204, (d) repeat for next length l 205, (e) create Vocabulary Feature Profile 206, (f) find lmax 207, (g) create cumulative relative entropy profile (CREP) 208, (h) find lCREmin 209, and (i) find optimum l feature length 210.
All of the various embodiments and methods described herein fall within the scope of this invention. As shown by the three examples, book comparison, primate genome comparison, other embodiments that are apparent to those of skill in the art, including embodiments which do not provide all of the benefits and features set forth herein and do not address all of the problems set forth herein, are also within the scope of this invention.
The present whole genome comparison of placental mammals, using feature frequency profiles (FFP), an alignment-free method is further described herein below.
The comparison of two closely related genomes at the base-by-base nucleotide sequence level can be routinely accomplished by traditional sequence alignment. However, as species diverge over time, genomic rearrangements, such as gene transposition, deletion and duplication make sequence alignment impractical. An alignment free method, such as the scheme presented here, can be used to overcome these issues associated with genome comparison. The FFP alignment-free method can compare genomes in their entirety at the nucleotide level in both the genic and non-genic regions. This method divides sequences into overlapping ‘words’ or l-mers of a given length or resolution, l. Then, two genomes are compared based on the relative differences in feature frequencies.
The present invention can be applied to the study of higher-order Eukaryotic phylogeny, especially of the placental mammals (Eutherians). The determination of the root of the placental lineage is a current subject of debate. The NIH funded-Mammalian Genome Project, led by a collaboration of the Broad Institute, Baylor and Washington Universities has sequenced many mammalian genomes to at least 1-2× coverage (each base is represented at least 1-2 times in overlapping sequence fragments). These low coverage genomes are a collection of unassembled contig/supercontig sequences. The whole-sequence alignment-free method is an ideal approach for comparing these fragmentary low-coverage genomes.
Based on previous molecular sequence analysis, the Eutherian mammals can be divided into four primary groups Afrotheria (elephants, manatees, aardvarks, tenrecs), Xenarthra (armadillos, sloths and anteaters), Laurasiatheria (cows, cats, dogs, bats, cetaceans), and Euarchontoglires (archonta-primates+glires-rabbits and rodents). Four contradictory hypotheses have been proposed for Eutherian diversification: Exafroplacentalia (afrotheria diverged first) (Murphy W J, et al. (2001) Science. 294, 2348-2351), Atlantogenata (Xenarthra and Afrotheria are sisters, but their common ancestor diverged first) (Douady C J, et al., (2002) Mol Phylogenet. Evol. 25, 200-209), Epitheria (Xenarthra diverged first) (Kriegs J O, et al., (2006) PLoS Biol 4, e91) and Exrodentiaplacentalia. (rodents diverged first, and rabbits diverged next). (Misawa K, Nei M (2003) J Mol Evol 57,S290-S296).
Results, Genome Size
Block-FFP, and Optimal Resolution
A few preliminary steps are necessary before a set of genomes can be compared. The initial step is to check the sizes of all the genomes, within the comparison set. The range in size from smallest to largest genome should be less than 4 fold and this is the case with our genome set, as shown in
Whole Genome Comparison
In this study, two methods were used for displaying the relatedness among genomes: multi-dimensional scaling (MDS) and tree building. Tree topologies were constructed from distance matrices from the FFP method. Features of length l=24 were used to form FFPs and calculated a Jensen Shannon divergence matrix. The matrix Dl=24, was used to construct a tree with the neighbor joining (
Individual Chromosome Comparison
The present invention can also compare genomes on the individual chromosome level. Only the higher coverage genomes are assembled into chromosomes. The MDS method can display distance information in the form of a two or three dimensional map. MDS can be used for both dimensional reduction and clustering. MDS are used primarily for clustering, which allows us to visually examine the distance relationships between the mammalian genomes. This technique also allows for the examination of clustering without implying any phylogenetic relationship.
Genic and Non-Genic Chromosome Comparisons
The majority of Eukaryotic genomes are composed of non-coding sequence. The present invention allows for the determining whether the non-coding sequence contains evolutionary information, and to what extent similar information is shared between the non-coding and genic sequence. Presumably, comparison of one genome to another using our method would primarily be a comparison of the non-coding sequence, as this is the major element. The present invention allows for determining whether comparisons of genic portions yielded similar genomic distance relationships as non-genic portions. Results of this test were shown in
Discussion and Conclusion
The Eutherian phylogeny observed by using the FFP alignment free method most and atlantogenata divergence hypothesis.
The present invention allows for examining the sequence similarity between chromosomes of several species (i.e. an all chromosome to all chromosome comparison.) Presumably this would recover some information about how specific chromosomes evolve. This was possible for the higher coverage depth genomes, where individual chromosomes have been assembled, not however for the survey genomes. A particularly surprising result was the quality of the clustering observed between chromosomes of a particular species (
Contrast this to the inter-chromosomal similarity of platypus and chicken, both are spread into long bands (
It was observed in the test case that the comparison of genic or non-genic sequences give similar distance information, as shown in
Thus, the alignment free FFP method according to an exemplary embodiment of the present invention is a useful tool for comparing genomes because it provides an unbiased measure of similarity, and is especially useful in the absence of a multiple sequence alignment. A key advantage of this method its ability to compare genic and non-genic sequence with equal utility. Whole genome sequence comparison provides much promise.
Materials and Methods
Data
Twenty-seven genomes were obtained from the public sources below. The following genomes where obtained from NCBI (ftp.ncbi.nlm.nih.gov): Homo sapiens, Mus, musculus (Mouse), Rattus norvegicus (Rat), Gallus gallus (Chicken), Pan troglodytes (Chimp), Ornithorhynchus anatinus (Platypus), Macaca mulatta (Monkey), Monodelphis domestica (Opposum), Equus caballus (Horse), Canis familiaris (Dog), Bos taurus (Cow). The next set of genomes was obtained from the Broad Institute (ftp.broad.mit.edu): Felis catus (Cat), Erinaceus europeaus (Hedgehog), Echinops telfari (Tenrec), Dasypus novemicinctus (Armadillo), Cavia porcellus (Guinea pig), Loxodonta Africana (Elephant), Microcebus murinus (mouse lemur), Myotis lucifugus (Microbat), Ochotona princeps (Pika), Oryctolagus cuniculus (Rabbit), Otolemur garnetti (Bushbaby), Sorex araneus (common shrew), Spermophilis tridecemlineatus (squirrel), Tupaia belangeri (tree shrew). Washington University (genome.wustl.edu): Pongo pygmaeus (orangutan), Callithrix jacchus (marmoset, new world monkey). Genomes from the latter two sources are low coverage sequences (1-2x) and consist of many unassembled contigs.
Methods
Feature Frequency Profiles and JS Divergence
To count the frequencies of each word in the genome, a sliding window of length l is run through the nucleic acid sequence from base position 1 to n−l+1. Several of the genomes are represented by a collection of assembled chromosomes and others are just a collection of unassembled contigs. When counting, l-mers continue over the whole genome, but the sliding window does not span over gaps present from sequencing. The counts are tabulated in the vector Cl,
Cl=<cn, . . . , clK (1)
where K is the number of features. The raw frequency counts are normalized to form a probability distribution vector or frequency profile,
giving the relative abundance of each l-mers. This normalization removes the sequence length dependence.
The distance between two probability vectors Pl and Ql is calculated using the Jensen-Shannon (JS) divergence,
where KL is the Kullback-Leibler divergence,
The JS divergence is a convenient distance measure for our purpose because it is metric and bounded between zero and one. In practice the JS divergence information is in the form of a distance matrix, D, containing pair-wise distances between genomes. For a set of n genomes, an n by n divergence matrix, Dl is formed by computing pairwise JS divergences using a specific feature length l.
Filtered l-mer Sets and the Purine-Pyrimidine Alphabet
The use of all l-mers for especially long sequences has some memory allocation limitations. The Cl count vector is implemented as a hash table data structure. Two forms of filtering were employed to overcome this limitation: 1) a reduced alphabet and 2) reverse/forward complement matching.
1) The most effective form of l-mer reduction is to assume that some words may be equivalent to each other, and indeed this is a logical manner in which to proceed, because sequence evolution is tolerant of many kinds of sequence substitutions. The bases A and G (both purine bases), and C and T (pyrimidines) can form two equivalent bases U and Y. Thus the 4 base alphabet sequence:
In the 2 letter alphabet, would be translated to:
The reduced alphabet is especially usefully for comparing large genomes, because it substantially reduces the number of 1-mers which must be stored in the Cl frequency hash table.
2) Another step was taken to reduce the l-mer set size by allowing the reverse complement to match the forward sequence. The above forward sequence would be equivalent to its reverse complement:
Only the forward or reverse complement need be stored and counted, thus reducing the hash table size by roughly half. For an even length l, the number of features:
Removal of Highly Repetitive and Low Complexity Words
The genomes of higher order Eukaryotes contain a large fraction of sequence which is repetitive or of low complexity, most often in non-coding or inter-genic regions of the chromosomes. For genomes where the coverage is low or the assembly is incomplete, it has been observed better results if these sequences are removed. Naturally, these regions are typically the most difficult to assemble in any sequencing project, and as a result are the least complete. The complexity of a feature, Kc is determined by comparing the size in bytes of the feature, before, s, and after lossless compression, scompress.
K
c
=|s−s
compress (6)
The compression is implemented simply using the gzip utility (with the—best argument). The complexity of features for a given l is normally distributed, and one can then choose only the most complex words. In the primate whole genome example, words of high complexity lying outside of μ+σ (μ=10,σ=3) were used.
Also, high frequency words are removed from the set of l-mers because these frequency values tend to dominate the Jensen-Shannon divergence score. The average and deviation of the count values in Cl for all genomes was calculated in the primate examples and only those features which occurred less than μ+σ times were chosen. In this case, the distribution of word frequencies is extremely skewed to high frequencies, with the majority of 2-letter words occurring between 100-300 times.
where nj is the length of the larger of the two genomes, i and 4 is the base alphabet size. Note, for comparison of large chromosomes, a simplified 2-letter alphabet was used for computational expediency. When sequences a and b are compared, each sequence is divided into m length blocks.
In this case Dl is the distance between each block-to-block comparison, and the probability distribution, Fl, is for each m-length block.
Genic, Non-Genic and Exonic Chromosomes
To compare tree topologies between non-genic, genic, and exonic genome sequences, the present invention partitioned human chromosomes 1 through 4 using the following method. First, the Genebank record (.gbk) for each chromosome was obtained. Those regions of the chromosome annotated as genes were concatenated together, with a special (‘X’) character separating the two sequences. This special character serves as stop marker to prevent l-mers overlapping gene segments. Note a large number of these genes are predicted. These genes form the genic chromosome. All regions in the chromosome assembly, not highlighted as genes in the .gbk record are concatenated together forming the non-genic chromosome. Finally all exons from the .gbk record are placed together forming the exonic chromosome. The various partitioned chromosomes were compared and their topological relationships are shown with a neighbor joining tree, as shown in
Jackknife Validation Tests with FFP
A jackknife validation test was used to access the robustness of the whole genome tree topology. All features for l=24, after filtering (˜2.5×107 features) were randomly divided into 100 different subsets. Each subset of features is used to form an individually normalized feature frequency profile, a divergence matrix D is calculated for each subset of features, and then a neighbor joining tree is constructed. A consensus tree was then built from the forest of trees using CONSENSE (Felsenstein, J. (1989) Phylogeny Inference Package (Version 3.2) Cladistics, 5, 164-166) and extended majority rule. The support values are indicated in the internal nodes of
Clustering and Tree Building
The distance relationships between sequences, Dl, were examined using 2 methods: (a) (
Conclusion
A whole genome comparison, including both the genic and non-genic sequence, is representative of the whole genome divergence, which may reflect the divergence of organism better than methods based on selected genes. The latter are subject to sampling effects which can lead to biased results supporting a specific gene phylogeny rather than organism phylogeny. From the FFP comparison, the trees reconstructed using FFP are bush-like—which is consistent with the hypothesis of a rapid mammalian radiation.
Non-coding (non-exonic) sequences such as inter-genic regions and introns contain evolutionary phylogenic signal. The signal from non-genic (the whole genome minus the exonic, intronic, and regulatory regions) sequences of mammals on a whole genome scale is very similar to the evolutionary signal present in exonic and genic regions.
The current FFP method has the ability to analyze rare genomic changes such as indels and retroposon insertions. These events constitute a significant portion of the evolutionary signal present in mammalian genomes.
The phylogenies of the whole genomes and each of the component classes have similar topologies and all agree well with the evolutionary phylogeny based on a recent large dataset, multi-species, multi-gene based alignment. Additionally, the FFP genome comparison methods can account for rare genomic changes, such as indels and retroposon insertions, in calculating genome-scale distances.
Phylogenetic and taxonomic studies of viruses have become increasingly important as more and more whole viral genomes are sequenced (1-4). Knowledge of viral taxonomy and phylogeny is not only useful for understanding the diversity and evolution of viruses not only within a viral family, but also among different viral families that may have a common origin (5). They also provide useful information in drug design against virally induced diseases (6).
One of the unusual aspects of viral genomes is that they exhibit high sequence divergence due to high mutation rate, genetic recombination, re-assortment, horizontal gene transfer (HGT), gene duplication, and gene gain/loss (7, 8). A direct consequence of the high sequence divergence and relatively small number of genes in viruses is that the number of highly conserved genes among different viral families is very small or, sometimes, undetectable. For example, the relationship among different families of eukaryote large DNA viruses (LDV) has often been studied based on multiple sequence alignment of a single gene, the DNA polymerase gene (9). Whether this single-gene based analysis can be used to properly infer viral species phylogeny is debatable.
Due to this and other limitations of multiple sequence alignment comparison of one or a few selected viral genes, there has been a growing interest in alignment-free methods for whole genome comparison and phylogenomic studies. For example, alignment-free approaches have been used in the reconstruction of virus genome trees for a few individual virus families and multiple virus families: The composition vector method was used to construct a genome tree for large dsDNA viruses. The average common substring approach was used for phylogenomic analysis of the reverse-transcribing viruses and the negative-sense ssRNA viruses; and tetranucleotide usage patterns were found useful for inferring phylogenomic relationship among bacteriophages and eukaryotic viruses. Besides genome trees, self-organizing maps have also been used to understand the grouping of viruses.
In the previous alignment-free phylogenomic studies using l-mer profiles, three important issues were not properly addressed: (a) the selection of the feature length, l, appears to be without logical basis; (b) no statistical assessment of the tree branching support was provided; (c) the effect of HGT on phylogenomic relationship was not considered. HGT in LDVs has been documented by alignment-based methods, but these studies have mostly searched for HGT from host to a single family of viruses, and there has not been a study of inter viral family HGT among LDVs.
To address these issues, the present invention provides an alignment-free method using feature frequency profiles (FFPs). The present invention uses the FFP method, supplemented by a novel HGT detection technique, to study the taxonomic grouping and phylogenomic relationship among subfamilies within each family, as well as phylogenomic relationship among 11 families of eukaryote large dsDNA viruses, including 4 dsDNA insect viruses which have not yet been assigned to any virus family by the International Committee on the Taxonomy of Viruses (ICTV). Altogether, 142 complete LDV proteomes from NCBI's non-redundant RefSeq database (Pruitt K D, Tatusova T, Maglott D R (2007) NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 35:D61-65) were analyzed.
Results on the whole proteome tree reconstruction are presented, including the choice of optimal feature length, and the identification of inter viral family HGT genes. In order to increase the sensitivity of the FFP method, two filtering schemes were applied: the filtering of HGT candidate genes and the filtering of low-complexity features. Next, the overall features of the LDV proteome tree and grouping of individual viral families are described showing possible the evolutionary relationship among the individual families of the LDV, and the differences between the FFP phylogeny and existing alignment-based phylogenies of several individual viral families. Finally, the FFP tree is compared to a previous published alignment-free analysis.
Optimal Feature Length
When whole genomes/proteomes are compared using l-mer FFP, different feature (l-mer) lengths can lead to different tree topologies. Thus, determining the optimal feature length is critical for obtaining a reliable tree topology. Based on both cumulative relative entropy (CRE) and relative sequence divergence (RSD) analyses, the optimal feature length for LDV proteomes is determined to be 8 amino acids long (see Materials and Methods). This estimate depends on the range of proteome size compared and the sequence divergence properties of the viruses, as shown in
Horizontal Rene Transfer Between Viral Families
The present invention uses the Jensen-Shannon divergence (JS) (Lin J (1991) Divergence measures based on the Shannon entropy. IEEE T Inform Theory 37:145-151.) of pair-wise FFPs to estimate the dissimilarity of 2 proteomes. JS provides a summary statistic of given FFP pairs (see Materials and Methods), and to a first approximation, is a measure of the fraction of the common features between two genomes. Thus, JS can be dominated by one or more unusually similar genes as they may contribute the most number of shared features, and this can distort the tree topology. For viruses from different families, such genes can be considered as candidates for inter-family HGT and should be removed before constructing FFPs. The inter-family gene transfer may be the result of a direct viral gene transfer between two viruses while co-infecting the same host, or when two viruses capture the same cellular gene from their phylogenetically related hosts in two separate events. In either case, it was assumed that HGT event occurred more recently compared to viral speciation, thus, the HGT genes have much higher sequence similarity compared to the rest of common genes between two compared viral families.
With our criteria for inter-family HGT detection, as shown in Material and Methods and
Low Complexity Feature Filtering
Low complexity features are those 8-mers consisting of one or very few types of amino acids. They generally bear no or little phylogenetic signal and may lead to misleading phylogeny if not removed in the proteome tree reconstruction. For the LDV proteomes, 8-mers with K2<1.1 are filtered out (see Materials and Method).
The FFP Proteome Tree of LDV Superfamily
After deleting the HGT candidate proteins and filtering out the low complexity features, the whole proteome FFP tree is obtained for feature length 8 (
Relationship Among LDV Viral Families
A potential evolutionary relationship between families is also observed: The two families of iridovirus and ascovirus form a monophyletic group with high statistical support, in support of a gene-alignment based study (27); Nudivirus family is related to baculovirus family closer than to any other families of LDVs with a moderate support by the modified bootstrap method (see Phylgenetic tree building and robustness test in Materials and Methods); asfarvirus family is closer to poxvirus family than to any other families of LDVs with a relatively weak support. Finally, the above-mentioned six viral families form a large monophyletic group with moderate statistical support. It was also noticed that three herpes virus families (herpesviridae, alloherpesviridae and malacoherpesviridae) are not related phylgenetically (see Herpesviridae below).
In the following, the FFP phylogeny of individual viral families is compared to those based on alignment-based method.
Baculoviridae
The grouping of baculoviruses in the FFP tree, as shown in the outer ring of
Herpesviridae
Herpesviruses are morphologically distinct from other viruses and they divide into 3 families under the recently established order Herpesvirales (Davison A J, et al. (2009) The order Herpesvirales. Arch Viral 154:171-177; McGeoch D J, Rixon F J, Davison A J (2006) Topics in herpesvirus genomics and evolution. Virus Res 117:90-104), namely Herpesviridae, Alloherpesviridae and Malacoherpesviridae. In the FFP tree, each family forms a clade, but the 3 families do not cluster to form a monophyletic group, indicating a lack of inter-family phylogenetic relationship at the genome sequence level despite morphological similarities. The Herpesviridae clade further divides into 3 monophyletic subgroups. as shown in the middle ring of
At the genus level, all except the Rhadinovirus genus, as shown in inner ring of
Phycodnaviridae and Mimiviridae
There are nine phycodnaviruses and one mimivirus with complete proteomes in our data set. Each multi-member genus forms its own clade with high branch support. The recently sequenced marine green algae virus OtV5 (Derelle E, et al. (2008) Life-cycle and genome of OtV5, a large DNA virus of the pelagic marine unicellular green alga Ostreococcus tauri. PLoS ONE 3:e2250) is not yet included in the ICTV 2008 Official Taxonomy, though sequence comparison of the DNA polymerase gene suggested that it belong to the genus Prasinovirus. In the FFP tree, OtV5 is positioned next to the Chlorovirus genus, as is also the case with the DNA polymerase-based analysis.
The 9 phycodnaviruses do not form a monophyletic group in the FFP tree, because mimivirus, as shown in
Poxviridae
The grouping of poxviruses in the proteome tree is consistent with the ICTV classification. The highly supported poxvirus clade falls into 2 monophyletic groups corresponding to the entomopoxvirinae and chordopoxvirinae subfamilies, and the latter further divides into three monophyletic groups associated with reptilian, avian and mammalian hosts, respectively. Each genus forms a clade in the FFP tree, and the branching order of different genera mostly agrees with an analysis based on alignment of a core set of 35 genes shared in the chordopoxvirinae subfamily (
In the FFP tree, the unclassified crocodile poxvirus is the outgroup of the chordopoxvirinae clade and positioned next to the Avipoxvirus genus. This suggests that the crocodile poxvirus could be assigned to a new genus within the chordopoxvirinae subfamily.
Other Viruses
There are four insect viruses that are not assigned to any viral family. Two of them (HzNV1 and GbNV) are nudiviruses and they form a clade and related closest to the baculovirus family in the FFP tree, consistent with an analysis based on alignment of the DNA polymerase gene. The other two insect viruses causing salivary gland hypertrophy (MdSGHV and GpSGHV) form a clade with high support in the FFP tree, corroborating a recent finding that the two are related and form a distinct clade based on analysis of gene trees. They are related closest to WSSV among LDV families. The FFP tree also suggests that the two nudiviruses and the two SGHVs be separately assigned to two new viral families.
Comparison with Another Alijnment-Free Method
In a previous report on the reconstruction of the whole proteome phylogeny of large dsDNA viruses (Gao L, Qi J (2007) Whole genome molecular phylogeny of large dsDNA viruses using composition vector method. BMC Evol Biol 7:41), the authors used an l-mer-based composition vector (CV) method with subtracted background ‘noise’ modeled by a Markov Chain estimator. Notable differences between the FFP tree and the CV tree are, 1) the CV tree was based on l-mers of length 5, but it was shown that the optimal feature length should be 8; 2) the CV tree did not explicitly deal with HGT among LDV families; 3) the authors did not provide statistical assessment of branch support in the CV tree; 4) neither baculoviruses nor iridoviruses are monophyletic in the CV tree; 5) the phycodnaviruses do not form a monophyletic group, with or without the mimivirus in the CV tree; and 6) ascoviruses were not included in the CV tree, which could further distort the CV tree topology due to the extensive HGT between ascovirus and baculovirus.
FFP Method vs Multiple Sequence Alignment (MSA) Method
MSA method has to select a small set of highly conserved genes for alignment, and has to assume that phylogeny of those selected genes represents the phylogeny of species. Thus, MSA method can be applied only within individual families or for families that are closely related, and the genes selected for MSA truly represent each species of the families. Therefore, MSA method can not be used for comparing a large population of diverse multiple families of LDVs. For inferring phylogeny of diverse families, FFP method has at least three advantages: (a) the whole genome/proteome is used to represent each species; (b) it does not require selection highly conserved genes common to all diverse families compared; and (c) it is not very sensitive to large-scale genome rearrangement and other changes including gene gain and loss.
Two points of interest about the optimal feature length for comparing viral whole proteomes—(a) On random statistical basis, one would predict that the probability of finding a unique 8-amino acid sequence in proteins is one in 208, thus, each protein family in our collection of all proteins in LDVs has a unique 8-mer. It follows then that, if there is a common 8-mer between two proteomes, there must be a homologous protein in both viral species. This is true for two members in a virus species or two closely related viral species. For example, after excluding HGT genes and low complexity features, 95% of protein-pair from two closely related viruses (same family but different genera) that share the same 8-mer sequence are homologous with a blast E-value of 0.01 or lower. However, the prediction does not hold true for protein-pairs from two viruses of different families: for 50 type species representing all the LDV genera, it was found that only 53% protein-pairs from two viral families are homologous with a blast E-value of 0.01 or lower; (b) Even for two proteomes that share some common 8-mers, the profiles of the frequencies of all 8-mers are not same between the two proteomes. In addition to the advantages of FFP method mentioned in previous paragraph both points described here are important in understanding the clear difference between MSA method and FFP method.
Using the alignment-free FFP method of the present invention, the molecular phylogeny and horizontal gene transfer (HGT) between families for a broad population of large dsDNA eukaryote viruses consisting of eleven viral families were studied. The unique aspects of this study include: 1) the selection of feature length of l-mer optimal for phylogeny inference of the population; 2) a modified bootstrap support analysis of the branching orders in the FFP tree; and 3) identification of inter-family HGT candidate genes and exclusion of the genes from the FFP tree reconstruction. The analysis of the FFP tree for the broad population of LDVs suggests that the method is suitable for grouping diverse families of virues, subgrouping within individual families, finding possible evolutionary relationship among the families, and assigning “unclassified” species, even when there are no or few common genes among the broad population.
Materials and Methods
Dataset
The viral sequences were downloaded from NCBI's REFSEQ database (September 2008 release) (24). Protein sequences for large eukaryote dsDNA viruses are extracted from the .faa file. Polydnaviruses are excluded from consideration because they hardly share any common genes with other virus familiesThe final dataset of 142 LDVs consists of 11 viral families and 4 insect viruses unassigned to any family. The list of viruses is included as supplementary material.
Feature Frequency Profile (FFP) and Distance Matrix
The feature frequency profile of a given sequence is obtained by counting all overlapping features of length l by sliding a window of width l along the sequence, advancing one letter at a time. The FFP of a proteome is the total sum of the FFPs for each protein sequence contained therein. The present invention uses the normalized FFP, i.e. the probability of occurrence of each word in a proteome. The dissimilarity between two FFPs can be estimated from the Jensen-Shannon divergence (JS) (25). For two probability distributions P=(p1, p2, . . . ) and Q=(q1, q2, . . . ), JS is given by
where the KL(P,Q) is the Kullback-Leibler divergence (42) or relative entropy:
and the summation is over all features. Note that JS is bounded between 0 and 1. Strictly speaking, JS is not a distance metric, as it does not satisfy the triangle inequality relationship. However, this violation happens only for short feature lengths and is of no concern to us. For a given feature length l, the distance matrix for a collection of proteomes is constructed from all pair-wise JSs.
Relative Sequence Divergence (RSD), Cumulative Relative Entropy (CRE), and Optimal Feature Length
Since different feature lengths can lead to different tree topologies, it is important to know the optimal feature length for inferring reliable trees. Two methods exist for estimating the optimal feature length. The first is related to information theory and makes use of cumulative relative entropy (CRE) of individual proteomes. By contrast, the second method estimates the relative sequence divergence (RSD) of a proteome relative to a random sequence of the same size by comparing their relatedness (in terms of FFP) to a group of proteomes. Both methods give the same estimate for optimal feature length for the population of viruses in LDV superfamily.
CRE
This method estimates the minimal feature length for which the information content of a proteome can be approximated by its FFP. This is done by requiring the CRE between the FFP of a proteome and that of a Markov chain estimator to be small. Under a Markov chain model of order l−2, the expected l-mer frequencies of a sequence or proteome is given by frequencies of features of lengths l−1 and l−2 as follows,
where f denotes observed feature-frequencies of a proteome, ai denotes amino acid type at position i of a feature. The difference between the estimated and observed l-mer frequencies can be measured by the relative entropy KL({tilde over (P)}l,Pl), where {tilde over (P)}l and Pl are estimated and observed probability vectors of l-mers respectively. This difference as a function of feature length exhibits a peak, whose position can be estimated using random sequences (zero order Markov chains) and is well approximated by
lpeak≅log20N +1 (4)
where the base 20 is the number of amino acid types and Nis the proteome size.
A monotonically decreasing function can be constructed for the cumulative relative entropy (CRE),
The minimal feature length at which CRE(l) approaches zero can be used iteratively to infer approximate frequencies of increasingly longer features, and is defined as the optimal feature length for phylogeny inference. For a group of divergent sequences like LDVs, this is approximately given by
lCRE˜2log20N (6)
where N denotes the largest proteome size. For LDVs, the largest proteomes (i e mimivirus and phycodnaviruses) give lCRE≈8. This estimate is confirmed in
RSD
This method requires that on average, a biological sequence shares more features than a random sequence of the same length with a group of bio-sequences. For a group of n related biological sequences, the relative sequence divergence (RSD) for a biological sequence si with i=l . . . n can be defined as
where c(si,sj) denotes the number of common words between sequences si and sj. ri denotes a random sequence of zero order Markov chain with the same length as si. For short feature lengths (l<lpeak), nearly all possible features are used by both the random sequence and viral proteomes, and the RSD is around one. For longer feature lengths (l>lpeak), the feature space is sparsely sampled, with all the viral proteomes sampling one region and the random sequence a different region. As feature length increases, the overlap in feature space between the viral proteomes and random sequence becomes smaller and the RSD decreases to zero. Optimal feature length for phylogeny inference is obtained when RSD becomes much smaller than one.
In
Inter-Family HGT Candidates
HGT between viral families can cause some distortion of the tree topology, because JS can be biased by the few highly similar genes shared between two viruses as measured by the number of common 8-mers. For LDV proteomes at the optimal feature length l=8, the distribution of common 8-mers in a protein pair is illustrated in
To see the effect of using different HGT cutoffs (i.e. number of shared 8-mers) on LDV phylogeny, tree topologies with cutoffs ranging from 6 to 40 were compared. It was observed that the tree topology remains stable for a HGT cutoff in the range 13-31 (data not shown). For this work, a conservative HGT cutoff of 20 was used, and identified 164 HGT instances consisting of 8 genes (Suppl. Table S1). One hundred sixty four proteins coded by 8 HGT genes are excluded in the construction of the FFP proteome tree.
Filtering Out Low Complexity Features
These features generally bear no or little phylogenetic signal and could distort the tree topology if enough of them are present in the viral proteomes. One measure of feature complexity is the Shannon entropy
where i runs over the 20 amino acid types, ni is the occurrence frequency of amino acid type i in a given feature, and l is the feature length. This and another closely related complexity measure Ki were used to detect and exclude regions of low complexity in amino acid sequences (44) during sequence alignment. For 8-mers, K2 takes on values between 0 and 3, corresponding to using 1 and 8 amino acid types respectively.
The effect of using different low complexity cutoffs on phylogenetic tree reconstruction is noted but not shown. Excluding only the least complex features (i.e. homo 8-mers) causes appreciable change in the tree topology. For K2 between 0 and 1.5, it was observed that the tree topology is most stable for cutoffs 0.9, 1.1 and 1.3. Based on this analysis, 8-mer features with K2<1.1 were filtered out. These features account for 0.3% of the viral proteomes on average, and up to a maximum of 2% for the EhV86 proteome. By way of comparison, for random sequences with equal usage of different amino acid types, the fraction of 8-mers with K2<1.1 is less than 10−5. The compositional types of these low complexity features include A8, AxB8−x (x=1-4), and A6B1C1, where A, B and C denote different amino acid types.
Phylogenetic Tree Reconstruction and Robustness Test
Phylogenetic trees are constructed from distance matrices using BIONJ (Gascuel O (1997) BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol Biol Evol 14:685-695). Robustness of the tree topology is estimated using a modified version of the bootstrap method (elsenstein J (1985) Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783-791) which works as follows. A table is first constructed with each row representing one viral proteome and each column representing one feature present in a viral proteome. Each table element indicates the feature frequency in a proteome (zero if absent). The bootstrap is applied to the columns of the table except that columns that are re-drawn are treated as drawn only once (i.e. each column is either present or absent in the bootstrapped table). Thus, the re-sampled table has fewer columns but each feature maintains the same frequency as in the original table. This procedure is equivalent to a jackknife test deleting 1/e (i.e. 37%) of the features. A new distance matrix is then calculated for the re-sampled table. 200 replicates were used to estimate the branch support for the un-bootstrapped tree. For the LDV dataset, a significant proportion of the features are unique to only one proteome, and the resampling is expected to underestimate the branch support. This and other factors (Alfaro M E, Zoller S, Lutzoni F (2003) Bayes or bootstrap? A simulation study comparing the performance of Bayesian Markov chain Monte Carlo sampling and bootstrapping in assessing phylogenetic confidence. Mol Biol Evol 20:255-266.) were taken into consideration when making phylogenetic inferences. Trees are drawn using iTOL (Letunic I, Bork P (2007) Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation. Bioinformatics 23:127-128).
The foregoing description and Examples detail certain preferred embodiments and describes the best mode contemplated by the inventors. It will be appreciated, however, that no matter how detailed the foregoing may appear in text, the present embodiments may be practiced in many ways and the present embodiments should be construed in accordance with the appended claims and any equivalents thereof.
The term “comprising” is intended herein to be open-ended, including not only the recited elements, but further encompassing any additional elements. The present examples, methods, and procedures are meant to exemplify and illustrate the invention and should in no way be seen as limiting the scope of the invention. Any patents and publications mentioned in this specification are indicative of levels of those skilled in the art to which the invention pertains and are hereby incorporated by reference to the same extent as if each was specifically and individually incorporated by reference
This application is the national phase application of International application number PCT/US2009/060268, filed Oct. 9, 2009, which claims priority to and the benefit of U.S. Provisional Application No. 61/104,646, filed on Oct. 10, 2008, both of which are hereby incorporated by reference in their entirety.
This invention was made with government support under Contract No. DE-AC02-05CH11231 awarded by the U.S. Department of Energy and under National Institutes of Health Grant No. 3P50GM062412-0552. The government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US09/60268 | 10/9/2009 | WO | 00 | 4/8/2011 |
Number | Date | Country | |
---|---|---|---|
61104646 | Oct 2008 | US |