The present invention relates to a method for predicting a gem cluster including secondary metabolism-related genes from among gene clusters composed of a plurality of genes, a prediction program, and a prediction device.
Secondary metabolites have a high likelihood of being biologically active, and they are very useful as lead compounds for pharmaceuticals. There are a wide variety of secondary metabolites, and they are found in various organism species, such as actinomycetes, fungi, and plants. However, such secondary metabolites are pressed only under special conditions that may not be revealed yet, and there is much that remains unknown about such secondary metabolites. This, it is believed that many secondary metabolites having useful properties remain undiscovered. Even if such secondary metabolites were to be discovered, it would be difficult to stably produce sufficient amounts thereof. Accordingly, problems arise when the use of such secondary metabolites is intended.
Along with innovative progress in DNA sequencing techniques in recent years, genomic information of various organism species (microorganism, in particular) is accumulating at an accelerated rate. Accordingly, it is certain that genomic nucleotide sequences of several thousand or more types of microorganisms will be determined within a period of several years. Organisms whose genomic information remains unknown may be subjected to the aforementioned DNA sequencing techniques, so that genomic information thereof can be acquired rapidly in a cost-effective manner. Because of the accumulation of genomic information and convenience of genomic information analysis, comparative genomic analysis, such as whole-gnome analysis and synteny analysis, becomes applicable to a wide variety of organism species.
With the use of databases constructed by accumulating detailed and vast amounts of genome information and information concerning the structures of secondary metabolites, diversity thereof or the distribution thereof in living world, accordingly, discovery of useful unknown secondary metabolites and identification of genes involved in biosynthesis of secondary metabolites (i.e., secondary metabolism-related genes) can be expected. However, it has been difficult to identify the secondary metabolism-related genes with high accuracy with the use of currently available comparative genome analysis techniques for the following reasons. That is, secondary metabolism-related genes are often contradictory to phylogenetic trees of genera and species, and the are numerous unknown genes whose functions remain unknown.
In the past, secondary metabolism-related genes had been analyzed on the basis of detection of known genes with high sequence homology (i.e., core genes), such as polyketide synthase (PKS) genes or nonribosomal peptide synthetase (NRPS) genes, and prediction of a cluster including genes associated therewith. Specific examples include SMURF described “in Khaldi Nora; Seifuddin Fayaz T.; Turner Geoff; et al., SMURF: Genomic mapping of fungal secondary metabolite clusters, FUNGAL GENETICS AND BIOLOGY, 47, 9, 73741, 2010”, antiSMASH described in “Medema Marnix H.; Blin Kai; Cimermancic Peter et al., antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences, NUCLEIC ACIDS RESEARCH, 39, 339-346, 2011”, CLUSEAN described in “Weber T.; Rausch C.; Lopez P.; et al., CLUSEAN: A computer-based framework for the automated analysis of bacterial secondary metabolite biosynthetic gene clusters, JOURNAL OF BIOTECHNOLOGY, 140, 1-2, 13-17, 2009”, and ClustScan described in “Starcevic Antonio; Zucko Jurica; Simunkovic Jurica; et al., ClustScan: An integrated program package for the semi-automatic annotation of modular biosynthetic gene clusters and in silico prediction of novel chemical structures, NUCLEIC ACIDS RESEARCH, 36, 21, 6882-6892, 2008”.
However, clusters detected by such techniques are limited to secondary metabolic gene clusters including core genes, which are parts of whole clusters including secondary metabolism-related genes. In other words, it was impossible according to the aforementioned techniques to predict secondary metabolic gene clusters that do not include core genes possibly accounting for a half or more of whale clusters.
Under the above circumstances, objects of the present invention are to provide a method that can predict a gene cluster including secondary metabolism-related genes with high accuracy, independent of the information concerning car genes, a prediction program, and a prediction device.
The present invention, which has attained the objects described above, includes the following.
(1) A method for predicting a gene cluster including secondary metabolism-related genes comprising:
a step of subjecting genes included in nucleotide sequence information of at least a pair of genomes to homology search mutually to identify homologous gene combinations in the nucleotide sequence information of the genomes and orthologous gene combinations in the homologous gene combinations;
a step of identifying a region of the gene arrangement of which is conserved in the nucleotide sequence information of the other genomes as a gene cluster on the basis of the results of homology search; and
a step of identifying a synteny-like region in the gene cluster identified in the previous step on the basis of the presence of orthologous genes determined as a result of homology search and evaluating whether or not the gene cluster includes secondary metabolism-related genes on the basis of the rate of the synteny-like region in the gene cluster.
(2) The method of prediction according to (1), wherein the gene cluster is evaluated to include secondary metabolism-related genes when the rats of the genes included in the synteny-like region relative to the genes included in the whole go cluster is not more than a given level.
(3) The method of prediction according to (2), wherein the given level is 25%.
(4) The method of prediction according to (1), wherein the synteny-like region includes at least two orthologous genes and the distance between neighboring orthologous genes is within a given distance in the nucleotide sequence information of genomes and in the nucleotide sequence information of the other genomes.
(5) The method of prediction according to (4), wherein the given distance is 10 kb to 30 kb.
(6) The method of prediction according to (1), wherein a synteny region and a non-synteny region are determined in advance using nucleotide sequence information of one of at least a pair of genomes subjected to comparison and nucleotide sequence information of a third genome that is different from the pair of genomes and the determined synteny region is designated as a synteny-like region.
(7) The method of prediction according to (1), wherein the step of gene cluster identification is followed by a step in which the number of homologous genes included in the identified gene cluster and/or the total number of genes included in the identified gene cluster are compared with the predetermined standard values and the step of evaluating whether or not the gene cluster includes secondary metabolism-related genes is carried out with regard to the gene cluster exhibiting the number of homologous genes not less than the standard value and/or the gene cluster exhibiting the total number of genes less than the standard value.
(8) The method of prediction according to (7), wherein the standard value for the number of homologous genes is designated 3 and the standard value for the total number of genes is designated 35.
(9) The method of prediction according to (1), wherein the step of gene cluster identification is followed by a step in which the total number of genes included in the identified gene cluster is compared with the predetermined standard value or a length of the identified gene cluster is compared with the predetermined standard value and the step of evaluating whether or not the gene cluster includes secondary metabolism-related genes is carried out with regard to the gene cluster exhibiting the total number of genes or the length less than the standard value,
wherein, in the step of evaluating whether or not the gene cluster includes secondary metabolism-related genes, genes neighboring the gene duster to be evaluated are added to modify the gene cluster to comprise the number of genes defined a the standard value and a synteny-like region in the modified gene cluster consisting of the number of genes defined as the standard value is identified.
(10) The method of prediction according to (9), wherein the standard value for the total number of genes is designated 35.
(11) The method of prediction according to (1), wherein the step of gene cluster identification is followed by a step in which the total number of genes included in the identified gene cluster is compared with the predetermined standard value or a length of the identified gene cluster is compared with the predetermined standard value and the step of evaluating whether or not the gene cluster includes secondary metabolism-related genes is carried out with regard to the gene cluster exhibiting the total number of genes or the length less than the standard value,
wherein, in the step of evaluating whether or not the gene cluster includes secondary metabolism-related genes, a given number of genes or a given length of a region is added to modify the gene cluster to be evaluated and a synteny-like region in the modified gene cluster is identified.
(12) The method of prediction according to (1), wherein the step of gene cluster identification comprises starting the trace backing from a cell exhibiting the maximal score in the Smith-Waterman matrix built on the basis of the Smith-Waterman algorithm so as to identify a gene cluster.
(13) The method of prediction according to (12), wherein the step of gene cluster identification comprises assigning a score of 0 into a cell included in the identified gene cluster, subjecting the Smith-Waterman matrix to the trace backing so as to identify another region in which the gene arrangement is conserved, subjecting the identified region to the Smith-Waterman algorithm again so as to identify a region the gene arrangement of which is conserved, and identifying the region as a gene cluster.
(14) The method of prediction according to (1), wherein the step of gene cluster identification is followed by a step in which the total number of genes included in the identified gene cluster is compared with the predetermined standard value or a length of the identified gene cluster is compared with the predetermined standard value and a given number of genes or a given length of a region of is added to the gene cluster so as to elongate the gene cluster to the standard size,
positive scores are given to the genes constituting the elongated gene cluster that are homologous to the genes constituting the gene cluster in the nucleotide sequence information of the other genomes to be compared, and negative scores are given to the genes that are not homologous,
scores are successively totaled from the gene located at the center of the gene cluster toward the ends and the genes exhibiting the maximal total scores are identified as the gene cluster boundaries, and
a region between the genes identified as the boundaries is identified a gene cluster.
(15) The method of prediction according to (14), wherein the predetermined standard value for the total number of genes is designated 15 to 65.
This description includes part or all of the content as disclosed in the description and/or drawings of Japanese Patent Application No. 2012-210044, which is a priority document of the present application.
The present invention enables prediction of a novel cluster including secondary-metabolism-related genes, regardless of the presence or absence of core genes, by application of a technique of nucleotide sequence comparison to an arrangement of genes recognized as a sequence via a comparative genomics method and by distinguishing a region of interest from a simple synteny.
Hereafter, the present invention is described in detail with reference to the drawings.
The method for predicting a gene cluster including secondary metabolism-related genes according to the present invention comprises: a step of using the results of homology search conducted on genes included in at least a pair of genomes to identify a gene cluster on the basis of the arrangement of the compared genomic genes; and a step of determining whether or not the identified gene cluster includes secondary metabolism-related genes (
The term “secondary metabolism-related genes” used herein refers to genes involved in biosynthesis of secondary metabolites. The term “secondary metabolites” refers to metabolites that are not directly associated with vital activity of organisms. When substances synthesized by organisms are collectively referred to as “metabolites,” metabolites are classified as primary metabolites or secondary metabolites. In such a case, secondary metabolites can be metabolites other than primary metabolites. The term “primary metabolites” refers to substances that are directly associated with vital activity of organisms. Examples thereof include sugars, amino acids, lipids, and nucleic acids. That is, “secondary metabolites” may be defined as substances other than sugars, amino acids, lipids, and nucleic acids. Examples of secondary metabolites include antibiotics, alkaloid, terpenoid, flavonoid, polyketide, phenols, glycoside, and special amino acids that do not constitute a protein.
Genes involved in biosynthesis of secondary metabolites encompass genes encoding enzymes associated with assimilation reactions or dissimilation reactions of secondary metabolites, genes encoding proteins associated with translocation and/or accumulation of secondary metabolites, and genes encoding proteins associated with regulation of expression of such genes.
More specific examples of secondary metabolism-related genes include genes involved in biosynthesis of polyketide, nonribosomal peptide alkaloid, terpenoid, flavonoid, and other compounds that are not classified as primary metabolites. It should be noted that gene clusters predicted by the prediction method according to the present invention do not always include the secondary metabolism-related genes specifically exemplified above and that such gene clusters occasionally include other secondary metabolism-related genes.
According to the method of the present invention, a gene cluster is first identified. The term “gene cluster” used herein refers to a group of a plurality of genes included in a given continuous region; and to a group of a plurality of genes whose arrangements are conserved among a plurality of genomes (e.g., between a pair of genomes). The term “continuous region” may be a region included in the entire genome or a part of the genome constituted by nucleic acids, such as chromosomes and mitochondria. Specifically, the term “gene cluster” refers to a group of a plurality of genes whose arrangements are conserved in a continuous region constituting the entire genome or a part of the genome.
Nucleotide sequence information of at least a pair of genomes is prepared in order to identify a gene cluster. Nucleotide sequence information of genomes is character data representing four types of nucleotides (i.e., adenine, guanine, cytosine, and thymine as A, G, C, and G, respectively). Nucleotide sequence information of genomes is represented starting from the 5′-end toward the 3′-end. Nucleotide sequence information of either or both of a pair of genomes may be obtained from a database storing nucleotide sequence information of various genomes, or such information may be obtained from a known or unknown organism via a DNA sequencing technique. Any of the DNA sequencing techniques described in, for example, Chapter 11 of Molecular Cloning A Laboratory Manual, Fourth Edition (Cold Spring Harbor Laboratory Press) can be employed.
Nucleotide sequence information of genomes may be obtained from any organism species. In other words, the prediction method of the present invention enables prediction of a gene cluster including secondary metabolism-related genes, regardless of organism species. Specific examples of organism species include plants, bacteria, actinomycetes, fungi, filamentous, fungi, and mushrooms. In addition, nucleotide sequence. Information of genomes may be derived from an unknown organism species. For example, the nucleotide sequence of DNA that is attracted directly from the environment such as from soil, sludge, lake water, or seawater, without culture (that is, so-called environmental DNA) may be determined, and the determined nucleotide sequence may be used as nucleotide sequence information of genome. According to the prediction method of the present invention, specially, a gene cluster including secondary metabolism-related genes existing in environmental DNA can be predicted.
In order to identify gene clusters based on nucleotide sequence information of at least a pair of genomes, at the outset, arrangements of a plurality of genes in the pair of genomes are compared on the basis of nucleotide sequence information of the genomes, and regions in which the gene arrangements are conserved are identified.
In order to compare the arrangements of genes, genes included in the nucleotide sequence information of the target pair of genomes are subjected to homology search mutually, and comminations of homologous genes between the nucleotide sequences information of the genomes and combinations of orthologous genes among the combinations of homologous genes are identified. To this end, the amino acid sequences encoded by a plurality of genes included in the nucleotide sequence information of the target pair of genomes are first deduced. The amino acid sequences can be deduced with the use of software for open reading frame analysis. With the use of such software for analysis, three open reading frames (ORFs) of the nucleotide sequence information of genomes represented starting from the 5′ end toward the 3′ end and complementary strands thereof can be identified. In this case, genes in nucleotide sequence information of one gnome are designated as xi (i=1, 2, . . . , I), and genes in nucleotide sequence information of the other genome are designated as yj (j=1, 2, . . . , J).
Subsequently, amino acid sequences of all genes included in nucleotide sequence information of one of the genomes are designated as query sequences, and homology search is carried out using the amino acid sequences of genes included in nucleotide sequence information of the other genome as database sequences. Homology search can be carried out with the use of conventional software for homology analysis, such as Blastp, FASTA, or Clustal. Also, the quay sequences are replaced with the database sequences, and homology search is carried out in the same manner as described above.
According to homology search, genes exhibiting high sequence similarity can be identified mutually in the nucleotide sequence information of the pair of genomes. For example, a threshold is determined for a value exhibiting sequence similarity, and a combination of genes exhibiting a value exceeding such threshold can be identified as homologous genes. Among the combinations of genes identified as homologous genes, the combinations of genes satisfying a given standard can be identified as orthologous genes. “Orthologous genes” are defined as homologous genes diverged from a common ancestral gene by speciation.
Examples of values exhibiting sequence similarity include e-values, bits, and amino acid identities determined by Blast search. By designating a threshold for one or more such values, accordingly, combinations of homologous genes can be identified. More specifically, the e-value as a threshold can be set at, for example, 1.0e-20, preferably 1.0e-15, and particularly preferably 1.0e-10, in homology search between query sequences and database sequences and in homology search conducted with the use of the query sequences and the database sequence in reverse (Such homology searches are collectively referred to as “a set of homology searches.”). A combination of genes exhibiting an e-value at or below the threshold as a result of the set of homology searches can be identified as homologous genes from among the nucleotide sequence information of the both genomes.
In order to identify orthologous genes from among the homologous genes identified in the manner described above, a standard is set so that a combination of genes satisfying the definition of orthologous genes described above can be selected. When a combination of genes is found to be in the top 5, preferably in the top 3, and particularly preferably at the top of the list of a set of gems prepared in descending order of sequence similarity (e.g., the ascending order of the e-value) as a result of the set of homology searches, specifically, such combination of genes can be defined as a combination of orthologous genes. From among the combinations of homologous genes identified as a result of the set of homology searches, a combination of orthologous genes can be identified by a method other than the method described above.
Subsequently, arrangements of genes in nucleotide sequence information of a pair of genomes are compared based an the results of homology search, and regions in which the gene arrangements are conserved are identified. In order to “compare the arrangements of genes in nucleotide sequence information of a pair of genomes,” assuming that a plurality of genes in the nucleotide sequence information of genomes constitute a string of letters in which genes are regarded as letters, an algorithm that searches for strings of letters and compares similarities thereof can be employed.
Examples of algorithms that can be used in this process include the Smith-Waterman algorithm, the Needleman-Wunsch algorithm, and the k-tuple method for searching strings of letters. The Smith-Waterman algorithm is particularly preferable because it enables a local alignment search to be carried out with high sensitivity.
By employing the Smith-Waterman algorithm, specifically, arrangements of genes in nucleotide sequence information of a pair of genomes can be compared in the manner described below. Genes in the nucleotide sequence information of one of the genomes are designated as xi (i=1, 2, . . . , I), and genes in the nucleotide sequence information of the other genome are designated as yj (j=1, 2, . . . , J). According to the Smith-Waterman algorithm, the (J+1)×(I+1) matrix (two-dimensional) of the genes in the nucleotide sequence information of one of the genomes, xi (i=1, 2, . . . , I), and that of the genes in the nucleotide sequence information of the other genome yj (j=1, 2, . . . , J), are built (
The scores determined in accordance with the procedures shown below are recorded in the cells of the matrix. When homology is observed between xi and yj, specifically, the score is determined in accordance with the formula indicated below.
When no homology is observed, the score is determined in accordance with the formula indicated below.
When all the cells of the matrix are subjected to the scoring described above, the trace backing starts from the cell exhibiting the maximal score toward the cell exhibiting a score of 0. In the cells along the trace backing path, a set of coordinates exhibiting high homology between xi and yj is designated as R0. Gap and mismatch scores are penalty scores, and they are set within the range from approximately −0.4 to −0.1, and both the gap and mismatch scores are preferably −0.2.
R
0={(j1,i1),(j2,i2), . . . (jn,in)},
provided that
j
1
≦j
2
≦ . . . ≧j
n
,i
1
≦i
2
≦ . . . ≦i
n
R0 is a set of coordinates indicating the pair of highly homologous genes which are located in a region in which gene arrangement is conserved. Specifically, R0 constitutes a gene cluster; that is, a group of a plurality of genes whose arrangements are conserved in the nucleotide sequence information of a pair of genomes. When a plurality of cells exhibit the maximal score in accordance with the matrix: (J+1)×(I+1), a plurality of gene clusters are identified through the process described above.
According to the prediction method of the present invention, whether or not the gene cluster R0 identified in the manner described above includes secondary metabolism-related genes can be determined in the manner described below in detail. According to the prediction method of the present invention, in addition to the gene cluster R0 identified in the manner described above, another gene cluster R′0 can be identified, and whether or not such gene cluster R′0 includes secondary metabolism-related genes can be determined in accordance with the procedures described below (
A gene cluster R′0 is a gene cluster other than the gone cluster R0 described above, and it is identified by subjecting the gene clusters Rm (m=1, 2, 3, . . . ) each identified as a region in which the gene arrangement is conserved in relation to xi (i=1, 2, . . . , I) and yj (j=1, 2, . . . , J) to alignment analysis again (denoted as “Alignment 2” in
A gene cluster including secondary metabolism-related genes is constituted by a wide variety of genes. When a gene cluster is compared with another gene cluster, accordingly, a large gap can appear as a result of insertion or deletion of a gene unit. In order to realize detection of a region containing many gaps as a gene cluster, a gene cluster Rm (m=1, 2, 3, . . . ) is identified with the use of the (J+1)×(I+1) matrix and acres obtained by the calculation described above. A method for identifying the gene cluster Rm (m=1, 2, 3, . . . ) is not particularly limited, and the process described below can be employed.
At the outset, “0” is assigned for all the cells indicated by the coordinates included in the set obtained in the previous step (starting from R0).
SW(j,i)=0
provided that
(j,i)=(j1,i1),(j2,i2), . . . (jn,in)
In the (J+1)×(I+1) matrix in which “0” is assigned for each cell of Re, subsequently, the trace backing starts again from the cell exhibiting the maximal score larger than 1 toward the cell exhibiting a score of 0. The cell exhibiting the maximal score larger than 1 satisfies the following condition, which is designated as “Condition *1.”
By starting the trace backing from the cell satisfying Condition *1 toward the cell exhibiting a score of 0, a set of coordinates indicating a cell in which high homology between xi and yj is exhibited can be identified (Rm). When a plurality of cells satisfy Condition in accordance with the matrix: (J+1)×(I+1) in which “0” is assigned for each cell of R0, a plurality of gene clusters Rm (m=1, 2, 3, . . . ) are identified through the process described above.
If a plurality of gene clusters (m=1, 2, 3, . . . ) identified in the manner described above are sufficiently near the gene cluster R0 that had already been identified, the score would be influenced of the scores of the cells included in the cluster R0. In order to eliminate the influence by the scores of the cells included in the cluster R0, after a plurality of gene clusters Rm (m=1, 2, 3, . . . ) have been identified in the manner described above, accordingly, it is preferable for the identified gene clusters Rm to be subjected to an algorithm for searching strings of letters, such as the Smith-Waterman algorithm, to re-identify the arrangement of conserved genes.
Concerning those satisfying n(Rm)≧3 in the set of Rm (m=1, 2, 3, . . . ), more specifically, a region satisfying the following condition is extracted.
(j1≦j≦jn)∩(i1≦i≦in)
The scores are determined again while building the matrix (two-dimensional) in the manner as described above. Thus, a newly constructed gene cluster R′0 can be derived from the gene cluster Rm (m=1, 2, 3, . . . ) identified in the manner described above.
By repeating the above procedure until the trace backing from the cell satisfying Conditions *1 toward the cell exhibiting a score of 0 can be no longer performed, gene clusters (R0, R′0, R″0 . . . ) to be subjected to evaluation as to whether or not such gene clusters include secondary metabolism-related genes can be identified.
It is determined whether or not the gene cluster represented by R0 or the gene clusters represented by R0, R′0, R″0 . . . identified in the manner described above include secondary metabolism-related genes (“Orthologue verification” in
According to the prediction method of the present invention, whether or not the gene cluster of interest includes secondary metabolism-related gene is determined by taking characteristic features, such as the facts that secondary metabolism-related genes are highly diversified and there me substantially no orthologous genes between different species, into consideration. Such characteristic features indicate that the proportion of synteny-like regions is small in a gene cluster including secondary metabolism-related genes. Accordingly, synteny-like regions in the identified gene clusters are identified, and whether or not the gene clusters of interest include secondary metabolism-related genes can be determined on the basis of the proportion of the synteny-like regions in the gene clusters.
More specifically, a synteny-like region in an identified gene cluster can be evaluated using the number of orthologous genes included in the gene cluster and the distance between such orthologous genes. In such a case, it is preferable that the scope of gene clusters to be evaluated be limited on the basis of gene cluster size or the number of homologous genes included in such gene clusters. Specifically, whether or not the gene cluster represented by R0 or the gene clusters represented by R0, R′0, R″0 . . . include(s), for example, 2 or more, and preferably 3 or more combinations of homologous genes is inspected. Also, whether or not the total number of genes is, for example, 50 or less, preferably 40 or less, and more preferably 35 or less is inspected. The gene clusters satisfying both conditions described above are preferably subjected to orthologue verification in order to identify synteny-like regions. A gene cluster that does not satisfy either condition is not subjected to the subsequent procedure, and it is rejected as a gene cluster that does not include secondary metabolism-related genes. When a standard such that the number of homologous gene combinations is 3 and the total number of genes is 35 is designated at this stage, for example, the scope of gene cluster is narrowed down under the conditions below (*2):
wherein n represents a position of a gene in a gene cluster; and in represents a position of a gene in the genome.
Subsequently, gene clusters satisfying the above conditions (e.g., Condition *2) are subjected to orthologue verification. Prior to orthologue verification, gene clusters are modified so as to adjust the number of genes included in each gene cluster to the total number of genes under the conditions described above (e.g., 35 genes under Condition *2) (
With regard to xi (i=1, 2, . . . , I) and yj (j=1, 2, . . . J), the sets of the total genes when the number of genes included in a gene cluster is 35 are represented by X and Y, respectively.
X=(xi|i is an integer satisfying a≦i≦b,provided that a≦i1,in≦b,b−a+1=35)
Y=(yj|j is an integer satisfying c≦j≦d,provided that c≦j1jn≦d,d−c+1=35)
Whether or not combinations of orthologous genes between the genes included in X and Y are present is determined on the basis of the results of the homology search described above (dashed arrow in
The synteny-like regions identified in X and Y are represented as subsets of X and Y; xSB and ySB, respectively. When the number of elements in both subsets is not more than a given proportion relative to the number of elements in X and Y as a whole, respectively, it is determined that a gene cluster comprising xi and a gene cluster comprising yj to include secondary metabolism-related genes. A given proportion is not particularly limited, and it can be 30%, 25%, or 20%. When a given proportion is designated as 25% (Condition *3), for example, those satisfying the following conditions can be predicted to be gene clusters including secondary metabolism-related genes.
In
A method for predicting a gene cluster including secondary metabolism-related genes is not limited to a method involving the use of the synteny-like region identified in accordance with the procedure described above. A synteny like region identified by another method may be used. An example of a method for identifying a synteny-like region is a method in which nucleotide sequence information of different types of genomes and annotation information are used to determine a synteny region and a non-synteny region in advance.
With the use of the synteny region determined in advance as the synteny-like region in the method of the present invention, a gene cluster including secondary metabolism-related genes can be predicted in the manner described above. That is, a method of identifying a synteny-like region on the basis of a synteny region can be carried out in the same manner as with the method of determining a synteny-like region described in
According to this method, a gene cluster including secondary metabolism-related genes can be occasionally predicted with higher accuracy than with the method comprising detecting a gene cluster and then identifying a synteny-like region described above. In the case of comparison between highly related species such as A. flavus and A. oryzae, for example, some A. oryzae strains may have a gene cluster highly homologous to the aflatoxin biosynthesis gene cluster. In addition, other A. flavus or A. oryzae strains do not have the second gene cluster highly homologous to the gene cluster described above. Accordingly, the aflatoxin biosynthesis gone cluster that is present in A. flavus may not be detected. In such a case, the third genome is used to determine a synteny region in advance for one of the two types of organism species to be actually compared. This can improve predictability. According to this method, a synteny region is defined as a gene region that is present in common in relatively related species, such as Aspergillus.
With a method for predicting a gene cluster including secondary metabolism-related genes, as described above, the gene clusters to be evaluated were limited on the basis of number of genes included in the gene clusters. According to the prediction method of the present invention, however, the gene clusters to be evaluated may be limited on the basis of gene cluster length. Specifically, gene cluster length may be compared with a given standard value, and a gene cluster with a length, less than the standard value may be subjected to orthologue verification. While the standard value is not particularly limited, it may be, for example, 125 kb (corresponding to about 50 genes), preferably 100 kb (corresponding to about 40 genes), and more preferably 87.5 kb (corresponding to about 35 genes).
According to a method for predicting a gene cluster including secondary metabolism-related genes, as described above, the number of genes included in a gene cluster was adjusted to a given level (e.g., 35) prior to orthologue verification. According to the prediction method of the present invention, however, a given number of genes or a region of a given length may be added to a gene cluster so as to modify the gene cluster prior to orthologue verification, and the modified gene cluster may then be subjected to orthologue verification.
A gene cluster can be modified by, for example, a method comprising modifying the gene cluster boundary, as described below. That is, the boundaries of particular gene clusters represented by R0, R′0, R″0 . . . are modified. Modification of the gene cluster boundary is synonymous with determination as to the necessity of addition of genes located outside the gene cluster identified by the method described in the [Identification of gene cluster] section above to the gene cluster.
As shown in
More specifically, the assemblies of the total genes when the number of genes included in the gene clusters, for example, xi (i=1, 2, . . . , I) and yj(1, 2, . . . , J), are designated as X and Y, respectively.
X=(xi|i is an integer satisfying a≦i≦b,provided that a≦i1,in≦b,b−a+1=35)
Y=(yj|j is an integer satisfying c≦j≦d,provided that c≦j1jn≦d,d−c+1=35)
In order to modify the gene cluster boundary, the one-dimensional sequence (SC) comprising n(X) number of elements was prepared. The scores determined in accordance with, for example, the formulae shown below can be assigned to the elements of the sequence. When xi is homologous to at least one of yc, yc-1, . . . yd-1, and yd:
When xi is not homologous to any of yc, yc-1, . . . , yd-1, and yd:
After the scores were determined for all the elements in the sequence, the elements exhibiting the maximal scores within the relevant ranges (1) and (2) indicated above are designated as istart su and istop, respectively. The set Y is subjected to the same procedure.
istart and istop identified in the manner described above are designated as the gene cluster boundaries. Specifically, gene clusters with modified boundaries are represented as follows.
x
i(istart≦i≦istop),yj(jstart≦j≦jstop)
In a score represented by SC(j) attained when xi is homologous to none of yc, yc-1, . . . yd-1, or yd, a negative value can be, for example, −0.1, −0.2, −03, −0.4, −0.5, or −1.
By modifying the boundaries of the gene clusters represented by R0, R′0, R″0 . . . in the manner described above, accuracy of prediction of the gene clusters including secondary metabolism-related genes through orthologue verification can be improved. Modification of the gene cluster boundary may be carried out before or after the process of orthologue verification described above.
The method for predicting a gene cluster including secondary metabolism-related genes according to the present invention described above can be implemented with the use of a computer equipped with an input unit, such as a mouse and a keyboard, a central processing unit (CPU), a storage unit including volatile and/or non-volatile memory, and an output unit, such as a display. A computer is preferably connected to a memory unit such as an external database or an external computer system through a communication network such as the internet or an intranet. Specifically, the prediction method according to the present invention can be provided as a prediction program that can predict a gene cluster including secondary metabolism-related genes with the use of the computer unit constituted as described above. In other words, a computer in which such prediction program has been installed is a prediction device for a gene cluster including secondary metabolism-related genes.
In order to implement the prediction method using a computer, nucleotide sequence information of a pair of genome may be inputted into a computer from an external storage unit or a computer system through a communication network. Alternatively, the computer may be connected to a DNA sequencer through an interface, and sequence information may be inputted into the computer. In addition, storage media such as a DVD or a CD may be used to read nucleotide sequence information of a pair of genomes into the computer.
With the use of a computer, nucleotide sequence information of a pair of genomes can be subjected to homology search with the aid of a central processing unit, and the results of the homology search can be stored in the storage unit. With the use of a computer, in addition, the procedures for [Identification of gene clusters] and [Determination of gene cluster including secondary metabolism-related genes] described above can be performed with the use of software equipped with an algorithm that searches for strings of letters, such as the Smith-Waterman algorithm.
Hereafter, the present invention is described in greater detail with reference to the following examples, although the technical scope of the present invention is not limited to such examples.
In Example 1, 8 types of genomic data sets were used. The data of Aspergillus oryzae equivalent to the data registered at GenBank (AP007150-AP007177) were used. The data of Aspergillus flavus downloaded from GenBank in the GenBank file format were used (GenBank Accession NOs: EQ963472 to EQ963493). The data of Aspergillus fumigatas, Aspergillus nidulans, Aspergillus terreus, Magnaporthe grisea, Fusarium graminearum, and Chaetomium globosum were downloaded from the Broad Institute.
In Example 1, genes exhibiting e-values of 1.0e-10 or less as a result of homology search were designated as homologous genes. In Example 1, also, a pair of genes was designated as a pair of orthologous genes when the genes were listed on the top in the list of the pairs of genes prepared in descending order (i.e., ascending order of e-value) as a result of homology search.
In Example 1, also, gene arrangement conservation was examined using the Smith-Waterman algorithm, and gene clusters represented by R0, R′0, R″0 . . . were identified. In order to identify a synteny-like region, standards to the effect that the number of homologous gene combinations included in the identified gene cluster should be at least 3 and the total number of genes should be less than 35 were established in Example 1. In addition, the term “synteny-like region” used herein refers to a region comprising a plurality of orthologous genes in which the distance between neighboring orthologous genes (although other genes may be present therebetween) is 10 kb or less, 20 kb or less, or 30 kb or less.
In Example 1, the original gene cluster in which the number of genes included in the synteny-like region (subsets of X and Y: xSB and ySB) is less than 25% (i.e., 8 or fewer) of the 35 genes was predicted to be a gene cluster including secondary metabolism-related genes.
With the use of 10 genomic nucleotide sequences of filamentous fungi such as A. flavus or A. oryzae for which genomic analyses had been completed, the number of gene clusters including secondary metabolism-related genes was predicted by the method described above, and Table 1 shows the results of such prediction. Table 1-1 shows the results attained by defining a synteny-like region as a region in which the distance between neighboring orthologous genes is 10 kb or less. Table 12 shows the results attained by designating such distance as 20 kb or less, and Table 1-3 shows the results attained by designating such distance as 30 kb or less. These results demonstrate that the results would not significantly vary if the synteny-like region were to be defined as a region in which the distance between neighboring orthologous genes was 10 kb to 30 kb.
A.
A.
A.
A.
A.
F.
F.
F.
C.
M.
flavus
oryzae
terreus
fumigatus
nidulans
graminearum
verticillioides
oxysporum
globosum
grisea
A. flavus
A. oryzae
A. terreus
A. fumigatus
A. nidulans
F. graminearum
F. verticillioides
F. oxysporum
C. globosum
M. grisea
A.
A.
A.
A.
A.
F.
F.
F.
C.
M.
flavus
oryzae
terreus
fumigatus
nidulans
graminearum
verticillioides
oxysporum
globosum
grisea
A. flavus
A. oryzae
A. terreus
A. fumigatus
A. nidulans
F. graminearum
F. verticillioides
F. oxysporum
C. globosum
M. grisea
A.
A.
A.
A.
A.
F.
F.
F.
C.
M.
flavus
oryzae
terreus
fumigatus
nidulans
graminearum
verticillioides
oxysporum
globosum
grisea
A. flavus
A. oryzae
A. terreus
A. fumigatus
A. nidulans
F. graminearum
F. verticillioides
F. oxysporum
C. globosum
M. grisea
Table 2 shows the results of calculation of the proportion of gene clusters containing Q genes among the gene clusters predicted to include secondary metabolism-related genes in Example 1. The term “Q genes” refer to genes that are classified as secondary metabolism-related genes as a result of functional classification of clusters of orthologous groups (COG).
A.
A.
A.
A.
A.
F.
F.
F.
C.
M.
flavus
oryzae
terreus
fumigatus
nidulans
graminearum
verticillioides
oxysporum
globosum
grisea
A. flavus
A. oryzae
A. terreus
A. fumigatus
A. nidulans
F. graminearum
F. verticillioides
F. oxysporum
C. globosum
M. grisea
A.
A.
A.
A.
A.
F.
F.
F.
C.
M.
flavus
oryzae
terreus
fumigatus
nidulans
graminearum
verticillioides
oxysporum
globosum
grisea
A. flavus
A. oryzae
A. terreus
A. fumigatus
A. nidulans
F. graminearum
F. verticillioides
F. oxysporum
C. globosum
M. grisea
A.
A.
A.
A.
A.
F.
F.
F.
C.
M.
flavus
oryzae
terreus
fumigatus
nidulans
graminearum
verticillioides
oxysporum
globosum
grisea
A. flavus
A. oryzae
A. terreus
A. fumigatus
A. nidulans
F. graminearum
F. verticillioides
F. oxysporum
C. globosum
M. grisea
The results shown in Table 2 demonstrate that gene clusters predicted to include secondary metabolism-related genes in Example 1 are highly likely to include Q genes. This indicates that a gene cluster including secondary metabolism-related genes can be predicted with high accuracy according to the method described in Example 1 and that a gene cluster including secondary metabolism-related genes, which could not be identified in accordance with a conventional methodology, is highly likely to be identified.
In Example 2, gene arrangement conservation was examined using the Smith-Waterman algorithm in the same manner as in Example 1, and gene clusters represented by R0, R′0, R″0 . . . were identified. In Example 2, also, gene clusters including secondary metabolism-related genes were predicted in the same manner as in Example 1 except for the points described below. That is, in a process for modifying the boundary between the identified gene clusters, a score of “+1” was assigned for each gene included in the gene cluster, which had been elongated to contain 35 genes, in the presence of homologous genes, a score of “−0.3” was assigned in the absence of homologous genes, the scores were summed from the center of the elongated gene cluster, and the gene exhibiting the maximal total of the scores was designated as the gene cluster boundary.
A part of gene clusters including secondary metabolism-related genes predicted in Example 2 are shown in Table 3. As with the case of Example 1, Table 4 shows gene clusters including secondary metabolism-related genes, which were predicted without modification of the gene cluster boundary.
Aspergillus flavus
Magnaporthe grisea
Aspergillus flavus
Aspergillus fumigatus
Aspergillus oryzae
Aspergillus flavus
Aspergillus nidulans
Aspergillus terreus
Aspergillus nidulans
Aspergillus nidulans
Aspergillus terreus
Aspergillus nidulans
Magnaporthe grisea
Aspergillus nidulans
Fusarium graminearum
Aspergillus fumigatus
Aspergillus terreus
Aspergillus fumigatus
Aspergillus nidulans
Aspergillus fumigatus
Aspergillus oryzae
Aspergillus fumigatus
Fusarium oxysporum
Aspergillus fumigatus
Fusarium graminearum
Aspergillus fumigatus
Aspergillus fumigatus
Fusarium verticillioides
Aspergillus fumigatus
Fusarium graminearum
Aspergillus terreus
Aspergillus oryzae
Fusarium graminearum
Aspergillus terreus
Fusarium graminearum
Chaetomium globosum
Fusarium verticillioides
Chaetomium globosum
Fusarium verticillioides
Aspergillus fumigatus
Fusarium verticillioides
Fusarium verticillioides
Magnaporthe grisea
Fusarium verticillioides
Aspergillus flavus
Aspergillus flavus
Magnaporthe grisea
Aspergillus flavus
Aspergillus fumigatus
Aspergillus oryzae
Aspergillus flavus
Aspergillus nidulans
Aspergillus terreus
Aspergillus nidulans
Aspergillus nidulans
Aspergillus terreus
Aspergillus nidulans
Aspergillus terreus
Aspergillus nidulans
Fusarium graminearum
Aspergillus fumigatus
Aspergillus terreus
Aspergillus fumigatus
Aspergillus nidulans
Aspergillus fumigatus
Aspergillus oryzae
Aspergillus fumigatus
Fusarium graminearum
Aspergillus fumigatus
Fusarium graminearum
Aspergillus fumigatus
Aspergillus fumigatus
Fusarium verticillioides
Aspergillus fumigatus
Fusarium graminearum
Aspergillus terreus
Magnaporthe grisea
Fusarium graminearum
Aspergillus terreus
Fusarium graminearum
Chaetomium globosum
Fusarium verticillioides
Chaetomium globosum
Fusarium verticillioides
Aspergillus fumigatus
Fusarium verticillioides
Fusarium verticillioides
Magnaporthe grisea
Fusarium verticillioides
Aspergillus flavus
In Table 3 and Table 4, the column indicating “Error” represents the number of genes in the predicted gene cluster that are out of alignment toward the upstream direction (toward the 5′ end) and toward the downstream direction (toward the 3′ end) relative to the gene cluster that actually includes secondary metabolism-related genes.
As is apparent from Table 4, 94 genes were counted as errors when the gene cluster boundary was not modified. This indicates that each of the 21 gene clusters shown in Table 4 includes 4.5 errors on average. When the gene cluster boundary was modified, in contrast 82 genes were counted as errors, and each of the 21 gene clusters includes 3.9 errors on average. Thus, by modifying the gene cluster boundary, a gene cluster including secondary metabolism-related genes can be detected with higher accuracy.
All publications, patents, and patent applications cited herein are incorporated herein by reference in their entirety.
Number | Date | Country | Kind |
---|---|---|---|
2012-210044 | Sep 2012 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2013/075702 | 9/24/2013 | WO | 00 |