The contents of the Electronic Sequence Listings filed herewith (Sequences_ST25, txt; Size 2062 bytes: and Date of Creation: Jul. 7, 2014) is herein incorporated by reference in its entirety
The present invention relates in general terms to DNA genotypic data that is linked to clinical diagnosis, phenotypic data. More particularly a method is presented for extracting genomic classifiers of disease risk from genomic data as obtained from micro array or gene chip assays in conjunction with their phenotypic correlates. This leads to methods of disease forecasting and individual patient disease risk prediction; as well as to devices which accomplish these goals.
There are almost three billion (coding and non-coding) DNA base pair in the human genome, with about 99.5% of these are shared by all homosapiens. Each somatic cell contains a maternal and a paternal contribution; so the overwhelming contribution is homozygous, but the remaining pairs appear as two alleles. These noteworthy deviant markers are termed the S(ingle) N(ucleotide) P(olymorphism)—SNPs, which are heterozygous pairs or alleles. By definition for the allele pair to be a SNP the rarer allele frequency must be greater than 1% in the population. A SNP for which both alleles produce the same polypeptide sequence is said to be a silent (synonymous) mutation. If a different polypeptide occurs it is said to be a replacement polymorphism. There is a general view that this subset of the genome accounts for human variations, and in particular carries the potential for acquiring diseases. Replacement polymorphisms, which result in polypeptide substitution, are thought to be responsible for over half the known diseases of mutagenic origin (Stenson et al., 2009).
DNA genotyping is performed by microarrays, also referred to as a gene chips. This is a collection of microscopic DNA spots attached to a solid surface; referred to as reporters or probes. A single chip can contain many hundreds of thousands of probes. While polymerase chain reaction (PCR) microarrays, or gene chips, have facilitated acquisition of vast quantities of genomic data, disappointment has been expressed on the lack of DNA variant linkages to human diseases particularly in the case of complex disorders (Chakravarti, 2011).
Patents exist that associate SNPs with genetic-based diseases. For example, in U.S. Patent Publication No. US2004/0132015 a process is disclosed for detecting mutations in regions determined by codon scanning algorithm. A process for preparing the DNA chip is disclosed using the process, as is the method for detecting mutations using DNA chips. Mutations can be discerned as various genetic diseases and this can be detected and identified, the DNA chip using the codon scanning algorithm can be applied for diagnoses of genetic mutations-associated diseases.
A computer algorithm for mathematical allele combination from a gene type device is disclosed in U.S. Pat. No. 7,272,506. The patent discloses an automated method for identifying allele values from a data file and analyzing polymorphisms DNA. The method is used for distinguishing targeted polymorphisms DNA sites without control samples.
U.S. Patent Publication No. US 2011/0014607 discloses methods for identifying imprinted genes. In some of the methods a first data set is provided of a plurality of nucleic acid sequences corresponding to a plurality of genes known to be imprinted in a subject. A second data set includes a plurality of nucleic acid sequences corresponding to genes known not to be imprinted into a subject. One or more features are identified that, by themselves or combinations, are differentially present or absent from the first data set as contained with the second data set. One or more features are applied to a test data set comprising a plurality of genomic DNA sequences that correspond to one or more genes for which an imprinting status is unknown to identify and imprint gene in a subject. The '607 Publication also discloses a method for identifying a feature in the subject with respect to an imprinted gene and methods for detecting the presence of susceptibility to a medical condition associated with parent-of-origin dependent monoallelic expression in the subject.
An algorithm for quantifying polymorphisms in an electropherogram as disclosed in U.S. Patent Publication No. US 2011/0238318. This Publication discloses a method of quantifying particular target cite of a cell or organism by performing genomic sequencing in which the DNA sample is extracted from a cell or organism and after being treated to convert cytosine to uracil and a fragment of the treated DNA is amplified. A sequence analysis is then performed from an. electropherogram from which calculations can be made to perform a sequence analysis.
U.S. Patent Publication No. US 2010/0285980 discloses a gene expression profile algorithm that provides a test for likelihood of recurrence of colorectal cancer and response to chemotherapy that involves analysis of gene expression values of prognostic and/or predictive genes. A biological sample can be obtained from a cancer patient. The measure of the expression levels to provide such information is analyzed and other methods of analysis are disclosed to identify genes that co-express with a validated biomarker that may be substituted by that biomarker in an assay is also disclosed.
The invention uses microarray genotyping data and reformulates the data in a new and unique allelic form which facilitates genetic marker classification of disease risk by novel and expeditious means; this may be an enabler of “personalized medicine”—a proposed avenue for dealing with disease association. As recently as Oct. 7, 2011 a Science editorial refers to present methodologies as a gaping hole. The present discoveries provide new approaches for the successful isolation of a kernel of genomic markers associated with a disease, and for the first time offers a successful way for proper analyses and prediction of multi-marker complex diseases.
The features of this invention are contained in the following elements:
1. Development of a novel reorganization of genomic DNA sequences which standardizes the order and acquisition of individual alleles.
2. Emphasis on an unambiguous allele locus, instead of the universally used SNP pair, that now results in a significantly enhanced discovery procedure for risk loci related to disease.
3. A new theoretical analysis, based on the premise that disease represents a small signal immersed in large genomic databases and which then produces a disease classifier and a composite Score yielding disease prognosis.
4. A new method for marker discovery that produces unbiased selection of high value potential disease risk loci.
5. A method for incorporating the contributions of a large number of genomic loci for grading the risk of disease; this may be cause of the present failure to predict, since most studies focus on single SNPs.
Consequences of those elements are the following:
An algorithm for detecting genomic markers associated with disease by contrasting these with non-disease controls.
A method for ordering markers in terms of their value for disease association.
A method for eliminating environmental, cultural, ethnic and other irrelevant elements from disease consideration.
An improved method is the product of an exploration for ways in which to improve the success of genomic disease prediction and presents the elements of more successful prediction tools for the identification of genomic locations of disease based on: genomic labeling it loci as determined by correlated SNP linkage as determined from several additional GWAS (Genome Wide Association Studies) databases. The Wellcome Trust T2D (type 2, adult-onset, diabetes) has played a particularly important role in developing the new tools; additionally, the development of a wellness classifier, probably in part due to counter disease mutations has proven to be a powerful new tool and concept.
The result has been a substantial increase in the successful genomic prediction of disease, for example, for the Wellcome Trust data there is compelling evidence of a greater than 99% successful prediction rate.
Additionally the new tools provide a sounder basis for selecting genomic reporters for the construction of gene arrays for specific diseases, based on SNP linkage associations for both disease and wellness loci.
An important element in the improved identification of disease versus wellness is the construction of a classifier designed to distinguish the two states in an optimal fashion.
Application of this improved method will positively influence the construction and use of gene arrays for purposes of public health, the discovery of disease mechanisms, and standards for the safe use of pharmacological drugs with known side effects.
An algorithm for isolating the classifier set, a particular case being the indicator vector for type 2 diabetes. T2D.
Those skilled in the art will appreciate the improvements and advantages that derive from the present invention upon reading the following detailed description, claims, and drawings. Further details are in Sirovich 2014a, 2014b both attached.
An outline of the algorithmic processes is furnished in the flow diagram which describes the components and the steps in the overall algorithmic procedures. Further details of the methods and their algorithmic connections are presented below in relation to
In Step F1 of
The methods of this application are based on clinically linked genomic data. The framework of the methods will be illustrated for type 2 diabetes, T2D: Finland-United States Investigation of NIDDM Genetics (FUSION) study, NIH-dbGap. Details of the database are as follows:
919 T2D cases, 787 normal glucose tolerant (NGT) controls; 315693 Common SNPs,
The data contained in the database (F1) may contain irrelevant sequences or faulti loci. The data set so accessed may be optionally prepared to facilitate or expedite analysis or enhance the accuracy of the results. Such preparation may include selection of the number of sequences (at F3) and/or repair or delete faulty loci (at F4) as shown in
Repair/Delete Faulty Loci-F4
Normally acquired data by gene chips contains a relatively low level of missing data. Data management tools such as Matlab, the ‘R’ programming language, or the ‘Plink’ toolset, (the last two are publicly available), easily accomplish this task. In accordance with the example being described the method focuses primarily on loci with fewer than 2 missing symbols, reducing the number of SNPs to 272,423; and replacing remaining missing symbols by appropriate column mode symbols.
Selection of a limited number of sequences (Step F3) is optional and no such restriction is required although it has been determined that comparable results will be obtained. Also, repairing and/or deleting faulty loci (Step F4) is optional. However, data preparation (Steps F3 and/or F4) does improve the results.
If the data is not to be “prepared” at F2 step 2 may be by-passed, at. F5, in which case all the native data accessed at F1 is used, as to be described.
The nth SNP of a database sequence is registered as two alleles indexed as, (2n-1,2n) and referred to as the odd and even members of the SNP pair. According to the standard order introduced here the higher symbol will be move to the odd position and abased as a 2 and the lower symbol moved to the even position and aliased by 1. No information is lost since the accompanying rs number (part of database) of a SNP fully furnishes location and alleles. The information at a SNP of a sequence is determined by whether 22, 21 or 11 is registered. The steps that are followed are outlined in
Transformation of the alias symbols “1” & “2” into a digital form is achieved by the vectorization
1→[1, 0], 2→[0, 1]. (1)
For a typical sequence S this gives the symbol to vector formulation
S=(T,A,C,C,T,T,T,A,G,A)→[0,1,1,0,1,0,1,0,0,1,0,1,0,1,0,0,1,1,0]. (2) (Sequence No. 12)
Other equivalent aliasing symbols can be used.
The required transformations can be easily accomplished by the above mentioned software.
A feature of the invention is the examination of the digitized data to determine the high value markers or loci. The examination, at F8, involves identifying those markers or loci that reflect, relate to or are associated with the specific disease using a suitable mathematical model, typically a statistical model, that compares the data base for both disease and control loci. Best results have been found when the statistical model eliminates or minimizes biased sampling. Examples of such mathematical models includes the use of Odds Ratios (RO), at F8a, and/or Incremental Information (M), at F8b. However, other models, at F8n, are possible and sampling models developed in the future are contemplated if these reduce bias and provide more accurate, consistent or reliable results.
For a collection of M sequences, at any SNP the probability at an odd allele, po(2), is easily calculated as the frequency of symbol 2 at the odd position and similarly pe(1) is the probability of symbol 1 the even allele.
The probability of a 22 SNP pair, P(22), is then given by
P(22)=1−pe(1), (3)
similarly the probability of a 11 SNP pair is
P(11)=1−po(2), (4)
and from this the SNP probability of a 21 pair is
P(21)=1−P(22)−P(11). (5)
if P1) denotes the probability of symbol 2 for a SNP then
P(1)=1−P(2). (6)
This reduces all allele and SNP probabilities to simple steps.
It should be noted that the symbols 2 & 1 are the designated symbols at the odd and even alleles, but they might not be the most probable symbols at a locus, for example the data might dictate that at some locus po(1)>po(2), in which ease the non designated symbol is the most probable. To make this clear this can be written as follows:
probability of most probable allele: II (7)
probability of rarer allele: Θ=1-II (8)
The allele odds-ratio is given by
Where the subscripts d & c refer to the disease and control cases, and odds ratios are a standard way for contrasting the two conditions. This for example is a standard package contained in the Plink toolbox. The SNP counterpart will be denoted by Ω, which as shown in Sirovich, 2014a, 2014b and here later is less effective than (9). Large values of the odds-ratios have been generally regarded as potentially being associated with disease risk.
It is a general observation that for genomic data that pd and pc are not very different, and to explore this consider
p
d
=p
c+Δ (10)
with A relatively small. Hence the odds-ratio ω, (9), becomes
so that ω is large if pc≈0 or 1, thus indicating a possible strong unwanted selection bias.
Shannon (1948) has demonstrated that entropy,
S(p)=−(p ln p+(1−p)ln2(1−p))=−ln2 pp qq=S(q). (12)
provides an optimal basis for unbiased sampling, Jaynes (1955). (Here we again follow common convention that if p is the probability of one of two events then
q=1−p, (13)
is the probability of the second event.) In the present situation (12) leads to incremental information sampling
M=S
d
−S
c, (14)
which the counterpart to the odds ratio. Under the approximation (10) this yields
If D is the disease matrix with rows composed of disease sequences,
and C is the comparable control matrix
then the indicator v optimizes the criterion functional
with v suitably normalized, and the columns of C and D are restricted to the chosen set of high risk loci, which might emerge from odds ratios, incremental information or from some other method of choosing the admissible range of risk alleles.
The process of optimizing the above criterion falls under the category of standard mathematics and an explicit indicator v that emerges explicitly from typical eigenvector software that can be found universally and in particular in Matlab and R. There are two parts to the solution, one is a reduced set of allele locations, and the second part are the indicated risk symbols at the corresponding loci; which is the classifier and in a general sense is a word. The indicator is a two row matrix, the first giving the loci and the second the classifier ‘word’. A sequence viewed in the space of loci defined by the first row of the indicator vector will be scored by the sum of loci at which recorded symbols agree with the Classifier F10 of
Scoring for the FUSION database is displayed in
Additional details and further background is contained in the attached Sirovich 2014a & 2014b publications.
For type 2 diabetes, (T2D), database the nominal incremental information criterion
M>0.07, (18)
produces roughly 15,000 potential high risk loci, which as mentioned above will be referred to as the admissibility class. This is a relatively large number which includes such unwanted contributions as environmental factors, cultural and ethnic elements and other unknown irrelevant effects in addition to the actual loci of the specific disease in question. It is the role of the indicator analysis to optimally eliminate the unwanted elements while retaining the disease related loci. The classifier for T2D as determined from incremental information indicator analyses contains 1,355 risk loci. (This should to be compared with 4,315 loci when using odds-ratio sampling, Sirovich, 2014a. and therefore represents an immense improvement in precision.) Of more importance is that the resolution factor, p is now roughly 100 times better than was obtained using odds ratio sampling of Sirovich, 2014a. In Table 1 below the top 15 markers out the 1,355 markers for T2D, and their statistics are shown.
The highest value of incremental information is roughly 0.2. and a choice of threshold much greater than 0.07 reduces the size of the admissibility set, e.g. for M>0.09 the size of the admissibility set is 5300 and the classifier is about 400 in length. Use of this produces a large number of false positives and false negatives. In the same vein a small threshold produces a large admissibility set which overfits the data with a loss of precision. The value in (18) was deemed to be the most suitable over a wide range trials.
d
c
d
c
d/
c
In Table 1 the top 15 loci based on incremental information are listed in column 4. The true allele probabilities of (7) & (8) have been used and the notation of the table follows that convention. Note that the probability of the rarer symbol determines the risk.
The largely improbable multiple and prominent appearance of chromosome 7 in column 3 implies a link of T2D with chromosome 7.
A search of prior T2D studies (Perry et al., 2009; Morris et al., 2012; McCarthy & Zeggini, 2009) revealed a list of roughly 120 SNPs that have been associated with T2D. The intersection of this set with the presently studied 272,423 SNPs produced 35 candidate risk SNPs.
rs2237892, rs7578597, rs8050136 rs1111875, rs12970134, rs1387153, rs1470579, rs1496653, rs1552224, rs163184, rs16861329, rs17168486, rs2007084, rs2261181, rs2334499, rs243021, rs2447090, rs2612035, rs340874, rs391300, rs3923113, rs4299828, rs4607517, rs4812829, rs5215, rs6795735, rs7041847, rs7178572, rs7612463, rs7756992, rs7903146, rs8042680, rs831571, rs896854. rs9470794
Only one of these, rs2237892, meets the conditions for being admissible, i.e., one of 15,000 odd potential loci mentioned above. The pervasive reason for failure to belong to this admissibility class was the condition that
d≈c, (19)
which since both rare probabilities are small implies that the odds ratio is large but the incremental information is near zero. It is highly unlikely that a locus of this sort can be a true predictor. Virtually all loci determined on the basis of odds ratios reported in Sirovich, 2014a were wiped out by this condition.
Allele 347351 is the odd locus of rs2237892, and is a SNP that has been associated in the literature with T2D. For this locus d=0.108 and
c=0.084, and therefore the odds ratio lies in the steep climb at the left of
d is almost 3 standard deviations from
c, and in the usual manner leads to a significant p-value.
On the basis that d/
c=1.28 it has been said in the literature that if the rarer symbol is found then it is 28% more likely that it signifies T2D rather than normalcy. However the odds that this symbol will be found is only about 1 in 10 and so this is really a poor predictor of T2D, which is the reason that indicator analysis eliminates this as a candidate locus. By comparison, the top entry in Table 1 states that its symbol will be found roughly 1 in 6 times and if found signifies that it is 68% more likely to signify T2D. As such it is an order of magnitude more effective a predictor of T2D, which is the case for the listed entries of Table 1
The gold standard for determining the value of any classifier of disease is its ability to predict that disease. This section provides verification of the methodologies by providing virtually compelling evidence of disease prediction.
To test the predictive ability of indicators as disease classifiers a randomly chosen fixed percentage of the case and control sets are made to serve as the training set by which to determine a disease classifier, and then to interrogate the remaining test set for its success as a predictor. The results after doing this repeatedly are displayed in Table 2.
In Table 2 the first line gives the training set fraction, the second the average size of the classifier and the third the predictive success rate.
For all cases of the table the criterion incremental information was taken as M=0.07, which was deemed to be better than other values. The number of trials was chosen so that there would be a strong likelihood that each sequence of the database would eventually appear in the test set at least once. For example for the training fraction of 0.99, there were 9 cases and 7 control sequences in east test set, randomly chosen in more than 1300 trials; for all other case there were at 500 trials. The indicator vector size and success rates represent averages over all trials.
The 70% partition clearly gives the best success rate and therefore was subjected to intense testing and the evidence for the 62.4% success rate of prediction is compelling; as is the evidence that with more data the success rate will become more robust Sirovich, 2014b
A further improved method for the search and comparison of genomic sequences in the form of disease and control populations which finds correlated activity of individual alleles involves determining correlations by the Pearson formula (for the disease matrix of rare symbols (mutants) collection, M; other quantities follow standard definition.
From this a method which finds sets of alleles that exhibit correlated activity; and from this selects alleles for which disease loci are substantially greater for the disease compared to control populations.
The aforementioned method additionally uses incremental information as disclosed in U.S. patent application Ser. No. 14/268,982, incorporated as if fully set forth herein, in the selection procedure. This method which predicts disease based on a selected set of alleles, Al, along with a set of mutant symbols W, that exhibit sufficiently greater correlated allele activity compared to the control case, and termed linkage ratio; and as well high incremental information as described in my aforementioned application.
The classifier of disease in accordance with the method is constructed in the matrix form
A putative sequence can then be assigned an agreement score Sd with the classifier.
The method for determining a wellness classifier can now be carried out by the above described steps.
This is obtained by interchanging the disease and control populations in the above algorithm, and this also leads to a wellness classifier and a wellness score, Sc.
The degree of disease versus wellness of a putative genomic state sequence can be measured by data obtained from a patient. This is based on assessing the difference score Sd-Sc.
The disease and wellness classifier is obtained on the basis of the specific ancestry of the disease and control populations for a specific disease. More than one ancestry disease-wellness classifier pair is assembled on a gene array for a specific disease. More than one ancestry disease-wellness classifier pair of one disease is assembled on a gene array. Many ancestry disease-wellness classifiers for many diseases can also be our assembled on a gene array. The composition is determined of the probative gene arrays for a particular disease as used to assemble disease/control databases known as GWAS (Genome Wide Association Studies). The general approach involves study of a relatively small number of disease and control patients and the use of and extremely large number of possible alleles for the purposes of later GWAS data acquisition. A far better selection procedure for choosing the alleles would be based on incremental information (patent) and the present linkage criterion of this application. The method can be used for assembling wide ranging ancestries and diseases on a single gene array to deal for large-scale public health applications. The method is useful for determining a probative gene array on the basis of genome scans, based on the above steps, which searches for classifiers of high and low vulnerability to pharmaceuticals that have the risk of detrimental side effects. For example, there are extremely effective drugs for treating osteoporosis in the market, but which have a very small set of cases that produce dangerous side effects. Determination of possible users vulnerable to the drug would result in a great reduction of “side effects” in the population which would result in greater use and benefit in the general population. The method is useful for assembling a gene array to classify vulnerability to side effects of one pharmaceutical product. The method is also useful for assembling the gene array to classify vulnerability to side effects of more than one pharmaceutical product.
Features and benefits or advantages of the invention over prior art approaches include:
Table 3 illustrates the results when T2D databases were examined and Wellcome Trust (WT) is far and away the most successful. The first line of the table, D/C, indicates potential locations for finding a disease classifier; while the second line C/D shows potential wellness loci. It is clear that in constructing the gene array for the Starr County (ST) & Fusion (FUS) there was no focus on wellness loci, which clearly proves to be as important as disease loci.
Number | Date | Country | |
---|---|---|---|
Parent | 14268982 | May 2014 | US |
Child | 15614230 | US |