The variations in many of the characteristics of people have a genetic basis (i.e., are the result of differences in their DNA), which changes the expression of proteins relevant to the characteristics. Sometimes the differences involve one or more base pairs. The single base pair differences are called single nucleotide polymorphisms, or SNPs for short.
Some SNPs do not appear to be associated with physical changes in people, but others may. Significant effort has been made to identify as many SNPs as possible in the human genome. SNPs that do not themselves change protein expression and cause disease may be close on the chromosome to harmful mutations. Because of this proximity, such SNPs may be correlated with harmful but unknown mutations and serve as markers for them. Such markers can help in discovery and identification of the mutations and aid the development of therapeutic drugs. Also, analyzing shifts in SNP allele frequency among different groups of people may help population geneticists to trace the evolution of the human race and to unravel the connections between different ethnic groups and races.
While a significant number of SNPs have been identified, it is not a simple matter to determine how variations in phenotypic characteristics, e.g. diseases, are related to individual SNPs, combinations of SNPs or sequences of SNPs, within a gene or an entire genotype. One method of trying to determine the genetic basis of a particular characteristic is an association study. Association studies attempt to identify the differences at a genetic level between people in which a characteristic of their phenotype (e.g. a particular disease) is present (“case” or “affected”) and people in which the characteristic is apparently not present (“control” or “unaffected”) in order to identify the genetic basis for the characteristic.
A significant portion of human DNA (over 99%) is invariant and SNPs make up a small amount of genetic information. Moreover, a large majority of SNPs do not appear to be associated with a particular characteristic of the phenotype in a simple way. Rather, expression of a characteristic in a phenotype appears to be determined in subtle and complex ways by many different and various combinations of several SNP alleles. Therefore, association studies are trying to identify a small and subtle signature within a very large amount of genetic information.
Nucleic acid arrays have been successfully used in association studies. One significant advance in association studies is outlined in, for example, U.S. patent application Ser. No. 10/106,097, filed Mar. 26, 2002 entitled “Methods for Genomic Analysis”, which is incorporated herein by reference. According to these methods, a significant number of SNPs are first identified across the genome, and optionally grouped into SNP haplotype groups. Thereafter, the SNPs are screened in case and control groups to identify the SNPs that correlate with a phenotype.
The following patents and patent applications are incorporated herein by reference in their entirety: U.S. Pat. No. 5,800,992, filed Jun. 25, 1996, entitled “Method of Detecting Nucleic Acids”; U.S. Pat. No. 5,861,242, filed Jan. 9, 1997, entitled “Array of Nucleic Acid Probes on Biological Chips for Diagnosis of HIV and Method of Using the Same”; U.S. Pat. No. 6,228,593, filed Jan. 14, 2000, entitled “Computer-Aided Probability Base Calling for Arrays of Nucleic Acid Probes on Chips”; U.S. patent application Ser. No. 09/922,492, filed Aug. 3, 2001, entitled “High Performance Wafer Scanning”; U.S. patent application Ser. No. 10/131,832, filed Apr. 24, 2002, entitled “Methods for Reducing Complexity of Nucleic Acid Samples”; U.S. patent application Ser. No. 10/236,480, filed Sep. 5, 2002, entitled “Methods for Amplification of Nucleic Acids”; and U.S. patent application Ser. No. 10/341,832, filed Jan. 14, 2003, entitled “Short Range PCR Primer Picking”.
While meeting with significant success, it is desirable to further improve the speed and efficiency with which association studies are performed, and further decrease the cost of such studies. The present invention meets these and other needs.
The present invention provides computer-implemented methods, apparatus, and systems for obtaining and analyzing nucleic acid sequence data. In particular, methods, apparatus, computer systems and programs are provided for analyzing data from nucleic acid arrays to characterize bi-allelic markers, such as single nucleotide polymorphisms (SNPs), in nucleic acid sequences.
In one aspect of the invention, a computer-implemented method is provided for characterizing a position in a nucleic acid segment or sequence. In one embodiment, the method comprises: inputting into a computer system a first measure of relative allele frequency at the interrogation position in the nucleic acid segment derived from a first sample collected from a first group of n individuals, wherein n is an integer equal to or larger than 2; inputting into the computer system a second measure of relative allele frequency at the interrogation position in the nucleic acid segment derived from a second sample collected from a second group of m individuals, wherein m is an integer equal to or larger than 2; and analyzing in the computer system the first measure and the second measure to characterize the interrogation position, for example, as being associated with a phenotypic characteristic of interest.
In another aspect of the invention, a data processing apparatus is provided for characterizing an interrogation position in a nucleic acid segment. The data processing apparatus comprises: a data processor; a storage device holding computer readable code in communication with the data processor, the computer readable code including: computer code which determines a first measure of relative allele frequency at the interrogation position in the nucleic acid segment derived from a first sample collected from a first group of n individuals, wherein n is an integer equal to or larger than 2; computer code which determines a second measure of relative allele frequency at the interrogation position in the nucleic acid segment derived from a second sample collected from a second group of m individuals, wherein m is an integer equal to or larger than 2; and computer code which analyzes the first measure and the second measure to characterize the interrogation position, for example, as being associated with a phenotypic characteristic of interest.
In yet another aspect of the invention, a computer readable medium is provided for holding computer readable code for characterizing a position in a nucleic acid segment and for carrying out the processes of: determining a first measure of relative allele frequency at the interrogation position in the nucleic acid segment derived from a first sample collected from a first group of n individuals, wherein n is an integer equal to or larger than 2; determining a second measure of relative allele frequency at the interrogation position in the nucleic acid segment derived from a second sample collected from a second group of m individuals, wherein m is an integer equal to or larger than 2; and analyzing the first measure and the second measure to characterize the interrogation position as being associated with a phenotypic characteristic of interest.
The above-described embodiments and other aspects of the invention are described in detail below in the section of “Detailed Description of the Invention”.
The invention, together with further advantages thereof, may best be understood by reference to the following description taken in conjunction with the accompanying drawings in which:
In the Figures, like reference numerals refer to like components and elements.
The present inventions relate to computer programs for analysis and characterization of nucleic acid sequences, for example, bi-allelic markers in nucleic acid sequences. The inventions are particularly applicable in the field of association studies but are not limited to such applications. Rather, they can be used whenever it is desirable to reliably characterize a nucleotide position in a nucleic acid sequence. Such a nucleotide position is hereinafter referred to as “an interrogation position” which can be any nucleotide position within a nucleic acid sequence of interest, including but not limited to a SNP position or a nucleotide position within an expressed sequence tag (EST), a genome survey sequence (GSS), or a sequence tagged site (STS) that is operationally unique in a genome.
1. SNPs in the Human Genome Sequence
The invention is particularly applicable in analyzing and characterizing human DNA sequences collected-from different groups of individuals.
The published human genome sequences are composites made up of DNA sequences from a variety of people. One ‘correct’ DNA sequence cannot be assumed, but rather such composites can be considered a reference DNA sequence with which the DNA sequences of individuals can be compared. The human genome includes mutations which may or may not be common to the population in general. In general, “common” SNPs will be most useful herein, e.g., those that occur in more than 10% of the population, although in some embodiments common and rare SNPs may be utilized. As illustrated in
At the first position 11 in the reference HGS segment 10, Persons 1, 2 and 3 have the same base A as the base in the reference sequence A on both chromosome segments. Therefore, in general, this position is not considered to be a SNP position (based on the data in
At the second position 12, the reference sequence base T is different to the base G in all three people. This position may not be considered a common SNP, as all the population have the same different base, which may indicate that the reference sequence has a unique mutation at this position which is not present in any other person (recalling that the human genome sequence is merely an ad hoc reference sequence and not the only ‘correct’ sequence). Or, there may simply be a sequencing error in the reference sequence.
At the third position 13, Person 3 has a different base T in its maternal chromosome (17m) as compared to the reference base C in its paternal chromosome (17p) at the same position. This location is designated as a singleton SNP since it occurs infrequently in the population (e.g., only once in one chromosome within a population of 3 individuals in this example).
The fourth position can be considered a common SNP position. The reference allele in the HGS reference segment is A and Person 3 has the reference allele on both chromosomes 17m and 17p. Such a person has a homozygous reference genotype. Person 1 has the reference allele A on the paternal chromosome (15p) and the alternate allele G on the maternal chromosome (15m). This person has a heterozygous genotype. Person 2 has the alternate allele G on both maternal and paternal chromosomes (16m and 16p) and has a homozygous alternate genotype. Position 14 can be considered to be a common SNP as the alternate allele is common in the population. Although common SNPs are often interrogated in association studies to characterize as being associated with the phenotypic characteristic under investigation, rare and singleton SNPs can also be interrogated for the same purpose.
SNPs are most frequently biallelic, i.e.,. are characterized by having only two possible alleles. In this example the reference allele is A and the alternate allele is G, and not T or C. For each supposed SNP position identified in the human genome the reference and alternate alleles have been experimentally determined.
Preferably, the invention is applied to characterize SNP positions in nucleic acid segments collected from different groups of individuals in genotype-phenotype association studies. Association studies are useful for the identification of genetic components (e.g., genes and their regulatory regions, and unexpressed regions of genomic DNA) associated with phenotypic traits. These genetic components may be directly or indirectly involved in the manifestation of the phenotypic trait, whether causative or predictive. These genetic components may occur at one specific locus in the genome or at multiple loci on the same or different chromosomes.
2. Genotype-Phenotype Association Studies
Association studies may be accomplished by determining the genotypes (e.g. which allele is present at each of a given set of polymorphisms, which are typically SNPs) of individuals with the phenotype of interest (for example, individuals exhibiting a particular disease or individuals who respond in a particular manner to administration of a drug) and comparing the genotypes of these individuals to the genotypes of a control group of individuals who do not exhibit the phenotype of interest.
To facilitate an association study, a computer implemented method is provided herein for characterizing a single or multiple interrogation positions, such as SNP positions, in nucleic acid segments collected from case and control groups of individuals. According to the method, relative allele frequencies for the case and control groups at an interrogation position in the nucleic acid segments are input into a computer and analyzed in order to assess if the interrogation position can be characterized as being associated with the phenotypic characteristic of interest. Optionally, the genotypes of two or more populations of individuals may be compared on the basis of, for example, age, gender, ethnicity, or geographic location.
The phenotype of interest may be a disease, condition or other characteristic found in humans. A disease may be manifested as the presence of signs and/or symptoms in an individual or patient that are generally recognized as abnormal. Diseases may be diagnosed and categorized based on pathological changes. Signs may include any objective evidence of a disease such as changes that are evident by physical examination of a patient or the results of diagnostic tests which may include, among others, laboratory tests to determine the presence of variances or variant forms of certain genes in a patient. Symptoms are signs or indications in a patient of a disease, disorder, or condition that differs from normal function, sensation, or appearance, which may include, without limitation, physical disabilities, morbidity, pain, and other changes from the normal condition exhibited by an individual. The phenotypic characteristic of interest may also be an individual's reaction to administration of an agent (e.g., a drug, food, or alcohol) or other interventions, adverse or beneficial. The person's reaction may be susceptible or resistant to the effect(s) of the agent.
Examples of diseases or conditions of interest include, but are not limited to, neural disorders, connective tissue disorders, immune system disorders, skeletal/bone disorders, hematologicaL disorders, muscular disorders, hormonal disorders, infection by pathogens such as viruses, bacteria and parasites, reproductive disorders, gastrointestinal disorders, pulmonary disorders, cardiovascular disorders, renal disorders, proliferative disorders, and cancerous disease conditions. Specific examples of diseases include, but are not limited to, Alzheimer's disease, Huntington's disease, Creutzfeldt-Jacob syndrome, vascular dementia, fragile-X syndrome, primary or acquired immunodeficiency, viral infection, bacterial infection, parasitic infection, pneumonia, meningitis, herpes zoster, diabetes mellitus, ulcer, gastric reflux, liver diseases, autoimmune diseases such as multiple sclerosis, rheumatoid arthritis, Graves' disease, systemic lupus erythematosus, diabetes mellitus, aseptic meningitis, systemic scleroderma, autism, allergies, asthma, adult-onset idiopathic hypoparathyroidism and membranous glomerulonephritis, kidney failure, metabolic and congenital kidney disorders, heart disease, obesity, limb ischemia, leukemia, Hodgkin's disease, non-Hodgkin's lymphoma, and neuplasma located in the colon, abdomen, bone, breast, digestive system, liver, pancreas, peritoneum, endocrine glands (adrenal, parathyroid, pituitary, testicle, ovary, thymus, thyroid), eye, head and neck, nervous (central and peripheral), pelvis, skin, soft tissue, spleen, thorax, and urogenital tract.
Although SNPs within the coding regions of DNA may have functional consequences, SNPs identified in other portions of a gene or in nongenetic regions can provide identification of genetic polymorphisms of functional significance in an association study. For example, SNPs in the promoter, enhancers, silencers, or other regulatory sites, introns, splice sites, and 3′ untranslated regions of a gene may affect the transcription, translation and message processing of the genetic information thereby affecting protein levels or concentrations, availability, or activity. Many of these alterations may or may not affect protein cellular functions. However, the polymorphism may affect the response of an individual to a particular drug or treatment, for example, by interfering with the therapeutic effect on a target protein or nucleic acid. Thus, SNPs in both coding regions of DNA and noncoding regions of the genome can be analyzed in an association study, e.g., to identify SNPs associated with a particular response to a drug treatment. In one embodiment, loci across the part or all of the genome are evaluated without regard to whether any particular region is believed before the study to have a functional role. Another embodiment, SNPs in a particular region or regions, such as in the vicinity of a gene, which may be a coding or noncoding region(s), can be evaluated without previous knowledge that the region is associated with the phenotype of interest.
Once a genetic locus or multiple loci in the genome are associated with a particular phenotypic trait, such as disease susceptibility or drug response, the gene or genes or regulatory elements responsible for the trait may be identified. These genes or regulatory elements may then be used as therapeutic targets for the treatment of the disease, for more efficacious medicine specifically tailored to the genetic makeup of an individual, to develop diagnostics for identifying individuals with the trait, for further studying the biological pathways underlying the trait, or in a variety of other applications.
The control group 21 includes six people A-F chosen at random and each believed not to have the disease that is the subject of the study. The case group 22 includes six people A′-F′ each believed to have the disease. For the purpose of illustration, assume there are in fact six SNP positions, 1-6, which are associated with the disease, out of all the SNP positions in the human genome sequence. At each of the six disease SNP positions, the presence of a particular allele (designated by “x”) is involved with the occurrence of the disease. However, for this example, the individuals who display the trait of interest have at least two of the “x” alleles.
The purpose of an association study is to characterize each SNP position analyzed as being associated with the phenotypic characteristic of interest. In other words, in the context of the model illustrated in
This can be achieved by measuring and analyzing the relative allele frequency, P, of a SNP position for the case and control groups. That is, the allele frequencies of the alleles that are involved in the phenotypic trait of interest will be significantly higher in the case group (which displays said trait) than in the control group (which does not). Exemplary embodiments of methods for carrying out experiments and analyzing experimental data will now be described in greater detail.
1. General Scheme
Either the whole human genome or a region of interest within the human genome is selected at step 32 and the SNP positions in that region are identified and selected at step 34 using any available source of human genome data that contains information regarding SNPs, such as, for example, the U.C. Santa Cruz Human Genome Browser Gateway (http://genome.ucsc.edu/cgi-bin/hgGateway) or the NCBI dbSNP website (http://www.ncbi.nlm.nih.gov/SNP/). The reference and alternate allele data for the selected SNPs is used at step 36 to design and manufacture the assays, such as the probe arrays to be used in the experiment as will be described in greater detail below. The actual SNPs selected to be used in the experiment will depend on the nature, e.g. comprehensiveness, of the experiment being carried out, and may include common, rare, and singleton SNPs. Preferably, common SNPs are used. Preferably more than 300,000 SNPs are used, preferably more than 600,000 SNPs are used, and preferably more than 1 million SNPs are used across the genome.
Biological samples, for example, blood or tissue samples, are taken from individuals who are either “affected” (exhibit the phenotype of interest) or “unaffected” (do not exhibit the phenotype of interest) at step 38. The individuals who are affected form the “case” group, and the individuals who are unaffected form the “control” group at step 40. The number of individuals in the case group and in the control group may be the same or different, each preferably equal to or larger than 5, optionally equal to or larger than 20, optionally equal to larger than 50, optionally equal to or larger than 100, optionally equal to or larger than 200, optionally equal to or larger than 500, optionally equal to or larger than 1000, optionally between 10-100,000, optionally between 100-10,000, or optionally between 200-1,000. Further, studies may involve multiple phenotypes and, as such, a given population may be subdivided differently depending on the phenotype of interest under study. For example, one population may be studied for both heart disease and stroke, and while some individuals may be in the case group for both, other individuals may be in the case group for heart disease and in the control group for stroke, or vice versa. Thus, the two case groups from the same population would have different constituent individuals depending on which phenotype was being used as the criteria for dividing the population.
The samples for the association study may be pooled either as tissue samples from the case or control group, or optionally by pooling genetic materials from the group (i.e., nucleic acid such as genomic DNA, mitochondrial DNA, extragenomic DNA, cDNA, or RNA), and optionally by further amplification of the pooled genetic materials. The pooled nucleic acid may be amplified separately for the case and control groups, optionally under the same conditions of amplification reactions (e.g., PCR), optionally in PCR reactions run in parallel, and optionally in PCR reactions carried out by the same person. The amplicons of genetic materials from the case and control groups may be both labeled with a detectable marker (e.g., cychrome, fluorescein, Alexa-488, radioisotopes and biotin). Alternatively, the amplicons of genetic materials from the case may be labeled with a detectable marker (e.g., cychrome) different from the detectable marker on amplicons of the control group (e.g., fluorescein). Two-color labeling is described in U.S. Pat. No. 6,342,355, incorporated herein by reference in its entirety. The amplicons from the case and control groups can be pooled together to form one sample if different labels are used, in one embodiment. In certain embodiments, the genetic materials are labeled with a detectable marker prior to application to an array; in certain embodiments, the genetic materials are further stained with a detectable marker after application to the array.
The tissue samples for the members of the control group are pooled at step 42 which provides a pool of genetic material of the individuals in the population of the control group. Similarly, the tissue samples of the case group are pooled at step 42 to provide a pool of genetic material of the individuals in the population of the case group. Optionally, genetic materials may be isolated from each individual in the case or control group and pooled together to provide a sample for the association study.
By pooling tissues or nucleic acids from individuals in each group and characterizing SNPs in the pool, the cost of an association study can be significantly reduced by not scanning all SNPs separately in each individual. In one embodiment, two or more “rounds” are performed in an association study. In one round, all or a large fraction of the SNPs selected at step 34 are evaluated in a pooled manner. A subset of those SNPs (for example, those that are found to be in regions of the genome that are potentially involved in manifestation of the phenotype of interest or those that, based on the pooled analysis, appear to be associated with the phenotype of interest) are used in a second round in which each individual is genotyped for only those potentially associated SNPs. The efficiency of the association study is significantly increased by characterizing the SNPs that have been pooled together as compared to individually characterizing each SNP separately.
For example, for a study with case and control groups of 100 individuals each, 100-fold fewer experiments are required if samples are pooled for each group. In the case where two rounds are performed, the number of SNPs that need to be further characterized is significantly reduced e.g., from millions in the first round to a few thousands or fewer in the second round.
Target nucleic acids in the samples may be prepared using any technique known to those skilled in the art, which, preferably, results in the production of a nucleic acid molecule sufficiently pure to determine the presence or absence of one or more variations (e.g. SNPs) at one or more locations in the nucleic acid molecule. It is noted that although genomic DNA is often as a target nucleic acid, other nucleic acid samples may be used, such as RNA (e.g., mRNA) and cDNA, and both human and non-human (e.g., plant, bacterial, fungal, avian, reptilian, amphibian, fish or non-human mammalian) nucleic acid samples.
The DNA sequences of the case and control pools may be amplified by performing PCR reactions at step 44. Methods for performing PCR reactions are described in Innis et al. “PCR Protocols: a Guide to Methods and Applications”, Academic Press, Inc., 1990, and in U.S. patent application Ser. No. 10/042,492, filed Jan. 9, 2002, entitled “Methods for Amplification of Nucleic Acids”, both of which are incorporated herein by reference. The DNA samples from the case and control pool can be amplified as many times as required to provide enough DNA to allow statistically significant differences in SNP frequencies to be detected in the case and control groups. The control and case pools can each be amplified by two separate PCR reactions to provide sufficient pooled DNA case and control group samples to allow an experiment to be repeated, for example, three times for each PCR reaction, thereby providing six sets of experimental results for each group: three for each of the case and control groups from the first PCR reaction and three for each of the case and control groups from the second PCR reaction. Optionally, PCR may be performed on nucleic acid contained in a sample from each individual in the case or control group prior to the pooling of his/her sample with others in the group.
The SNPs are then assayed to determine the relative allele frequency of each SNP in the case and control groups. In one embodiment, the allelic frequency of SNPs or other genetic information can be detected by using oligonucleotide arrays. Depending on the number of SNPs to be screened, for example, oligonucleotide arrays with low, medium or high density can be employed. The density of the oligonucleotide array may be higher than 100 probes per square centimeters, optionally higher than 1000, optionally higher than 10,000, optionally higher than 1,000,000, optionally between 100-100,000,000, and optionally between 1,000,000-80,000,000 probes per square centimeters. Of course, other assays may also be utilized, such as sequencing methods, mass spectroscopy and electrophoresis.
Preferably, high-density oligonucleotide array chips or larger DNA probe array wafers (from which individual chips would otherwise be obtained by breaking up the wafer) are used in one embodiment of the invention. DNA probe array wafers generally comprise glass wafers on which high density arrays of DNA probes (short segments of DNA) have been formed. Each of these wafers can hold, for example, approximately 60 million or more DNA probes that are used to recognize DNA sequences. The recognition of sample DNA by the set of DNA probes on the glass wafer takes place through the mechanism of DNA hybridization. When a DNA sample hybridizes with an array of DNA probes, the sample binds to those probes that are complementary to the sample DNA sequence. By evaluating which probes hybridize to the sample DNA more strongly, it is possible to determine whether or not a known sequence of DNA is present in the sample DNA, for example, to detect the presence of a SNP.
The use of DNA probe arrays to obtain genetic information involves the following general steps: design and manufacture of DNA probe array wafers, preparation of the sample, hybridization of target DNA to the array, detection of hybridization events and data analysis to determine sequence. Preferred wafers are manufactured using a process adapted from semiconductor manufacturing to achieve cost effectiveness and high quality, and are available from Affymetrix, Inc. of California.
Probe arrays can be manufactured by a light-directed chemical synthesis process, which combines solid-phase chemical synthesis with photolithographic fabrication techniques as employed in the semiconductor industry. Using a series of photolithographic masks to define chip exposure sites, followed by specific chemical synthesis steps, the process constructs high-density arrays of oligonucleotides, with each probe in a predefined position in the array. Multiple probe arrays are synthesized simultaneously on a large glass wafer. This parallel process enhances reproducibility and helps achieve economies of scale.
Once fabricated, DNA probe arrays can be used to obtain genetic information about nucleic acid samples. The nucleic acid samples are tagged with a fluorescent reporter group by standard biochemical methods. The labeled samples are incubated with a DNA probe array, and segments of the samples bind, or hybridize, with complementary sequences on the DNA probe array. The DNA probe array is then scanned and the patterns of hybridization are detected by emission of light from the fluorescent reporter groups. Because the identity and position of each probe on the DNA probe array is known, the nature of the nucleic acid sequences in the sample applied to the DNA probe array can be determined. When these arrays are used for genotyping experiments, they may be referred to as genotyping arrays.
Once fabricated the arrays are ready for hybridization. The nucleic acid sample to be analyzed is isolated, amplified and labeled with a fluorescent reporter group. The labeled nucleic acid sample is then incubated with the array using a fluidics station and hybridization oven. After the hybridization reaction is complete, the array is inserted into the scanner, where patterns of hybridization are detected. The hybridization data are collected as light emitted from the fluorescent reporter groups already incorporated into the labeled nucleic acid, which is now bound to the probe array. Probes that most clearly match the labeled nucleic acid bind more of the nucleic acid, and hence accumulate more of the fluorescent signal than those that have mismatches. Since the sequence and position of each probe on the array are known, by complementarity, the identity of the nucleic acid sample applied to the probe array can be identified.
In reference to
In another embodiment, the case and control pools are differentially labeled and combined into a single pool and are hybridized with a single set of the designed genotyping arrays. In this way case and control group data can be obtained from the same physical arrays. More than one set of genotyping arrays can be used to replicate the experiment. Labels that may be used include, but are not limited to, cychrome, fluorescein, Alexa-488, or biotin (in certain embodiments, later stained with phycoerythrin-streptavidin after hybridization). In certain embodiments, the genetic materials are labeled with a detectable marker prior to application to an array; in certain embodiments, the genetic materials are further stained with a detectable marker after application to the array. Two-color labeling is described in U.S. Pat. No. 6,342,355, incorporated herein by reference in its entirety. Each array may be scanned such that the signal from both labels is detected simultaneously, or may be scanned twice to detect each signal separately. It has been found that measured intensities for cychrome and fluorescein labeled pools can have a non-linear relationship. One origin of this effect is believed to be due to the saturation of sample molecules with cychrome. Without wishing to be bound by theory, it is believed that adjacent cychrome on the surface of the labeled molecules may interfere and reduce the amount of light emitted. Therefore, in one embodiment, only a single staining step is used for cychrome, rather than two stains. Alternatively, a reduced amount of cychrome can be used in the staining in order to reduce the non-linearity in detected intensity. In certain embodiments, fluorescein-labeled pools may be further stained with Alexa-488. By the comparing the intensity of the label corresponding to the case group to the intensity of the label corresponding to the control group at a specific probe on the array, one can determine if a particular SNP allele appears more commonly in the case or control group.
The data analysis at step 48 characterizes the SNPs as either being associated with the phenotype being studied or not associated with the phenotype being studied. In one embodiment, the method can reduce the initial set of SNPs to a much smaller set of associated SNPs, probably several orders of magnitude fewer, e.g. tens of thousands from hundreds of thousands. The result of the first set of experiments is a reduced set of SNPs comprising a set of SNP positions 50, which have been-characterized as likely being associated with the phenotype being studied. Of course, in this embodiment, one will use a rougher filter that will potentially capture many non-associated SNPs as well as associated SNPs.
Typically, the association study experiments are repeated, but using the reduced set of associated SNPs 50, in the selection 34 of the SNPs to be used in designing a second set of arrays. Hence a second iteration of the association study experiments may be performed characterizing fewer SNPs (e.g., thousands of SNPs) resulting from the first experiment than those in the first round (e.g., millions of SNPs). Hence, the actual hybridization experiments can be replicated an increased number of times to help reduce random variations in the experimental data due to, for example, sampling or experimental error.
In one preferred embodiment, the association study is performed by using individual samples with the genotyping array (or other assays), rather than using pooled samples. Analysis of the data at step 48 during this second iteration further reduces the set of previously characterized SNPs so that a more definitive characterization of SNPs associated with the disease can be obtained. For example, after the second analysis on the reduced set of SNPs a set of tens, hundreds or thousands of SNPs characterized as associated with the disease can be identified. The set of SNPs likely associated with the disease having been identified, the particular alleles at those SNP positions can be determined from the genotypes of the members of the case and control groups, providing an indication of the genetic basis of the disease or condition. These associated SNPs, their alleles and allelic frequencies, and the genetic regions in which these SNPs are found can be used for a wide variety of purposes, e.g., to screen for individuals who may be susceptible to the disease, to help in the development of medicaments and diagnostics, or in other applications.
For example, if the associated SNPs are located in the coding region of a gene, function of this gene may be implicated in the manifestation of the phenotype of interest (e.g. succeptibility or resistance to a disease). In one embodiment, the gene containing the associated SNPs can be cloned into an expression vector for expressing the encoded protein recombinantly. Optionally, the encoded protein may be expressed in its native cellular environment, or isolated from other cellular components. Various agents can be screened for the ability to modulate activities of the gene and its product, including but not limited to small molecule compounds, antisense molecules, ribozymes, triple helix molecules, antibodies and other protein or peptide modulators. If the gene encodes a receptor which has a known cognate ligand, agonists and antagonists against the receptor may be developed based on the structures of the receptor and/or ligand. The agonists and antagonists may be tested in an in vitro or in vivo assay, for example a cell-based assay expressing the gene product, alone or in competition with the cognate ligand. Changes in the activities of the gene or its product in the presence or absence of the agent can be monitored using various detection methods known in the art, including but not limited to hybridization arrays, radioisotope-labeling, immunochemical staining, and fluorescence-activated cell sorting (FACS). Depending on the role of the gene played in the onset and/or progression of the disease/condition, the agent that effectively promotes or inhibits the activities of the gene or its product can be selected and used as a lead agent to be further developed into a pharmaceutical agent.
If the associated SNPs are located in a regulatory region of a gene, control of the expression of the gene may be implicated in the manifestation of the phenotype of interest. Various agents can be screened for the ability to modulate expression of the gene, including but not limited to small molecule compounds, antisense molecules, ribozymes, triple helix molecules, antibodies and other protein or peptide modulators. For example, antibodies or a protein modulator of the gene expression may be developed by using a yeast-two-hybrid system to screen for molecules that interfere with the binding of a transcription factor for the gene, thereby altering the control mechanism of the regulatory region containing the associated SNPs. Changes in the expression levels of the associated gene in the cells can be monitored using various detection methods known in the art, including but not limited to hybridization arrays, radioisotope-labeling, immunochemical staining, and fluorescence-activated cell sorting (FACS). Depending on the role of the gene played in the onset and/or progression of the disease/condition, the agent that effectively increases or decreases expression levels of the gene can be selected and used as a lead agent to be further developed into a pharmaceutical agent.
Various methods can be used in the analysis at step 48 of the experimental data to characterize the SNPs, and several exemplary embodiments will be described below.
2. Methods for Analysis of Experimental Data
The different data analysis methods can be implemented wholly, or in part, by suitable computer processes. Certain embodiments of the present invention employ processes acting under control of instructions and/or data stored in or transferred through one or more computer systems. Embodiments of the present invention also relate to an apparatus for performing these operations. This apparatus may be specially designed and/or constructed for the required purposes, or it may be a general-purpose computer selectively activated or reconfigured by a computer program and/or data structure stored in the computer. The processes presented herein are not inherently related to any particular computer or other apparatus. In particular, various general-purpose machines may be used with programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required method steps. A particular structure for a variety of these machines will appear from the description given below.
In addition, embodiments of the present invention relate to computer readable media or computer program products that include program instructions and/or data (including data structures) for performing various computer-implemented operations. Examples of computer-readable media include, but are not limited to, magnetic media such as hard disks, floppy disks, magnetic tape; optical media such as CD-ROM devices and holographic devices; magneto-optical media; semiconductor memory devices, and hardware devices that are specially configured to store and perform program instructions, such as read-only memory devices (ROM) and random access memory (RAM), and sometimes application-specific integrated circuits (ASICs), programmable logic devices (PLDs) and signal transmission media for delivering computer-readable instructions, such as local area networks, wide area networks, and the Internet. The data and program instructions of this invention may also be embodied on a carrier wave or other transport medium (e.g., optical lines, electrical lines, and/or airwaves).
For example, the program instruction of this invention may be installed in or is part of a general purpose computer which can be part of a network, and also can be connected to a broader network such as the Internet, e.g., for data retrieval. Optionally, the program instruction is installed on a computer system in a manner such that the program instruction can be assessed over a network, e.g., over the Internet. Also optionally, the program instruction or a necessary part thereof may be downloaded from a remote computer over the network, or alternatively may be used for analysis with all or most of the functionality remaining on a storage computer or server. In the latter mode, analysis results can be transmitted to the remote computer.
Examples of program instructions include both machine codes, such as produced by a compiler, and files containing higher level code that may be executed by the computer using an interpreter. Further, the program instructions include machine code, source code and any other code that directly or indirectly controls operation of a computing machine in accordance with this invention. The code may specify input, output, calculations, conditionals, branches, iterative loops, etc.
CPU 220 is also coupled to a variety of input/output devices 230 such as display 202, keyboard 208, mouse 210 and speakers. In general, an input/output device may be any of: video displays, track balls, mice, keyboards, microphones, touch-sensitive displays, transducer card readers, magnetic or paper tape readers, tablets, styluses, voice or handwriting recognizers, biometrics readers, or other computers. CPU 220 optionally may be coupled to another computer or telecommunications network 234 using network interface 232. With such a network interface, it is contemplated that the CPU might receive information from the network, or might output information to the network in the course of performing the above-described method steps. Furthermore, method embodiments of the present invention may execute solely upon CPU 220 or may execute over a network such as the Internet in conjunction with a remote CPU that shares a portion of the processing.
For example, SNP or sequence data for analysis may be recorded in storage media or may be retrieved from a remote location or locations. Typically, such data would be at least temporarily stored in the computer system. Also at least temporarily stored in the computer system will be at least portion of the program instruction of the present invention so that at least a portion of the analysis according to the invention can be performed using the sequence data.
Depending on the program structure, the program instructions for the described analysis according to the present invention can be called or loaded in various ways. For example, the program instructions can be called in portions as needed, all of the instructions of an analysis can be loaded at one time, or analysis with one program can be completed and the resulting data stored before a further program is loaded for additional analysis.
1) General Steps of Data Anlysis
A number of intensity correction routines 301 can be carried out on the raw intensity data. Background intensity can inflate intensity measurements causing them to be higher than they should be based on the amount of sample bound to a probe on an array. Background intensity can be dependent or independent of the concentration of sample (target) applied to an array. Concentration-dependent background increases with increasing amounts of sample, and may represent, for example, mismatch and nonspecific hybridization of the sample to the array. Concentration-independent background does not increase with increasing amounts of sample, and can be the result of, for example, scattered or reflected light during scanning of the array. In one embodiment, a background intensity can be subtracted from all of the intensity measurements. In one embodiment, the value of the background intensity is set at the intensity of a probe cell that ranks the 1000th dimmest on the oligonucleotide array. The measured intensity for that probe cell is subtracted from all the intensities as a background correction. Intensity data that are equal to or less than zero after the background correction may be discarded from further evaluation. Of course, other measures could also be used to provide a background intensity value, for example, the 500th, dimmest probe cell or some percent (0.1-3%, for example) of a saturated probe cell.
In an embodiment in which the case and control samples are differently marked or labeled, then the method may correct for differences in detected intensity which are marker-dependent. For example, as noted above, it has been found that there is no linear relationship between the measured intensities for labeling with cychrome and fluorescein. A quadratic fit to measured cychrome and fluorescein intensity data can be carried out to provide a quadratic correction to the cychrome intensity data. The measured cychrome intensity data can then be subjected to this quadratic correction. At high cychrome intensities a parabolic correction function can be used. The correction can be scanner dependent and so different correction data may be used for intensity data collected from different scanners. The background correction described above can be carried out before or after any label-related corrections to the data have been carried out.
In one embodiment, at a data processing step 303, saturated probe cells can also be discarded from further evaluation. If the photomultiplier tube of the scanner measures a saturated probe cell, then the measured intensity is not proportional to the amount of the sample present at the probe cell. Therefore, data collected from the scanner during hybridization and measurement of the raw intensity data, or the intensity data reaching a maximum threshold, can be used to determine if the intensity is in fact a saturated intensity in which case the intensity can be discarded from further evaluation.
In one embodiment, intensity data for SNP positions that are physically too close to each other can be discarded at step 304. If two SNP positions are sufficiently close that a set of probes for a first genotype including one of the two SNPs would also detect a second genotype including the other of the two SNPs, then the intensity data for that set of probes can be discarded from further evaluation.
In a next data processing step 305, a conformance value, C, is calculated for the remaining intensity data, which provides an indication of the reliability of data and so can be used to remove unreliable data from the data set used for subsequent analysis. In general terms, the conformance calculation process 305 involves computing the conformance for each and every SNP on each of the arrays in the set of arrays for the experiment. The conformance calculation uses the non-discarded and corrected measured light intensity data 302 from the hybridized arrays. The conformance calculation process 305 will be described in greater detail below. After conformance values have been calculated for all the SNPs, the conformance data is thresholded 306 and the data for those SNPs having a conformance less than the threshold value are rejected from the data set used in the subsequent steps of the method. In an embodiment, the conformance threshold can be greater than approximately 0.8 and can be approximately 0.9 or greater.
In an additional data processing step (see
A process is then carried out at step 308 to determine an estimate of the relative allele frequency, P′, for each SNP position using the sanitized and conforming intensity data, as will be described in greater detail below. The estimate of relative allele frequency, P′, is computed for each SNP position for each of the experiments performed on case and control groups to provide a set of P′ values for each SNP position for the case and control groups, respectively. The P′ values generally indicate amount of target that has hybridized to the probe(s) for a particular SNP. Generally, if the case or control group has a higher proportion of individuals with a particular SNP, the corresponding feature on an array will be brighter (i.e., will have a higher intensity).
The members of the sets of P′, are then analyzed 310 for each SNP position to determine whether the SNP position can be considered likely to be associated with the disease. If it is determined 310 that the analysis of the P′ data indicates that the SNP position can be characterized as associated with the disease, then the SNP position is added 312 to the set of SNP positions 50 characterized as associated with the disease. If it is determined 310 that the SNP position is not likely to be associated with the disease then the SNP position can be discarded, or otherwise flagged as not, or unlikely to be, associated. This is process is carried out so that all of the SNP positions have been evaluated.
An optional association validation step 314 can be used. The validation step 314, includes applying a secondary test to the set of associated SNP positions 50 to determine whether the characterization as “associated” is consistent with a second criterion. In a preferred embodiment the secondary criterion is based on genetic and/or biological information. This provides a set of validated associated SNPs 50′, in which the validity of the association has been further supported.
Hence the result of the process 300 is a set of associated SNP positions 50, or optionally 50′, that are characterized as likely to be associated with the disease. This information can then be fed back at step 52 into the general association study method 30 as indicated in
The results of the above analysis can be output, stored, and/or transmitted to any of a variety of programs, or other functional units further use or preservation.
Various parts of the method will now be described in greater detail. Although the preceding and following flow charts illustrate various operations being carried out together or in a particular sequence, the flow charts are merely intended to be illustrative and not to limit the invention only to the specific order and combinations of operations and processes described. It will be appreciated that various combinations and/or sequences of operations and processes, including omitting various process steps, can be used to put the general methods and processes described herein into effect.
2) Tiling of Probes for a Target SNP
In the illustrated embodiment of the invention, for each particular SNP position of interest, or “target” SNP to be interrogated, a set of eighty probes (referred to as a “tiling” of probes) is provided on an array or distributed over one or more physical arrays. In other embodiments, a tiling of probes may comprise a set of less than eighty probes, for example, 40, 24, or fewer probes.
The target SNP also has a forward alternate sequence 406 associated with it, including the alternate allele, G, at the SNP position, and a reverse alternate sequence 408 complementary to the forward alternate sequence. In
Each of the reference forward, reference reverse, alternate forward and alternate reverse sequences has a probe tiling associated with it, 410, 420, 430 and 440 respectively, represented as an array of squares in
The probes tiling the array and their positions on the arrays are known and have been designed specifically for the reference allele sequences and alternate allele sequences at step 36 of the experimental protocol. As can be seen eighty probes in total are used for each target SNP. (Twenty probes each for the reference allele forward sequence, reference allele reverse sequence, alternate allele forward sequence and alternate allele reverse sequence.)
In the following, ‘tiling’ will indicate, depending on the context, either the twenty probes for a reference allele forward 410, reference allele reverse 420, alternate allele forward 430 or alternate allele reverse 440 sequence individually, or all eighty probes associated with the target SNP together.
Although a twenty-five nucleotide probe sequence has been described above, it will be appreciated that other probe nucleotide lengths and probe-formats can be used.
If conformance data is not required, as will be described further below, then single probes which are perfectly complementary to the respective reference forward, reference reverse, alternate forward and alternate reverse sequences could be used for each tiling 410, 420, 430, 440 in which case four probes could be used. Fewer or more than the five probe sets per forward and reverse sequence of the reference and alternate alleles can be used, but five probes sets has been found to provide reliable conformance data as will be described below.
3) Conformance Assessment
A first tiling of the multiplicity of eighty probe tilings is selected at step 602 and the intensity data measured from the probes of the first tiling is used in the conformance assessment. The intensity data for the reference or alternate allele is selected 604 and the intensity data for a first probe set, e.g. probe set 412 as indicated by the broken bold line in
If the perfect match probe is determined at step 610 by comparison with the recorded intensities of the other probes in the probe set to be the brightest probe in the probe set, then this is an indication that the probe has bound DNA having the intended sequence. Therefore that probe set can be considered to be conforming, i.e., to include useful experimental data and to be reliable. Therefore a count of the number of conforming probe sets for the reference tilings 410, 420 is updated 612. The process then updates a count 614 of the total number of probe sets for the tilings that have had their conformance evaluated. As can be seen the count of the total number of probe sets evaluated for the reference tilings is updated irrespective of whether a particular probe set conforms or not. If a perfect match probe is not the brightest, then that does not necessarily mean that all the data for the tiling is unreliable, e.g. because the probes are damaged or because there has been some other failure.
The process is then repeated for the next probe set 414 in the tiling 410 for the reference forward and reverse sequences and so on until all ten probe sets for the reference forward and reverse tiling have been evaluated. Then, the conformance for the reference sequence tiling 410, 420 is calculated in step 618. In general, the conformance, C, for a tiling is the number of conforming probe sets divided by the total number of probe sets in the tiling, which in this example is ten.
Hence a conformance for the reference tiling, CR, has been calculated. The process is then repeated 620 using the intensity data for the alternate forward and reverse sequences and a conformance for the alternate sequence, CA, is calculated. The conformance C for the target SNP is then set 622 as the greater of CA and CR to reflect that although the data for the alternate or reference sequence may not be reliable, the data for the other may be reliable and so the conformance value for the better data is used as the metric by which to assess the validity of the data for the target SNP. The conformance value C is stored and indexed by the target SNP and an identifier of whether the data corresponds to the control group or case group, and which experiment. The conformance for a next target SNP is then calculated using the data for the corresponding tiling 624 and the process is repeated until the intensity data for all the 80 probe tilings has been evaluated.
In general, conformance measures the number of times that the presence of one of the reference or alternate sequence is detected out of the number of times either was present. In other words, conformance measures the number of times that the probe that is a perfect match to the sample is the brightest feature (probe) for a given offset, indicating that it hybridized to the sample better than the other features for that offset. It has been found that conformance values below about 0.8 tend to be an indicator that the tiling did not reliably detect the intended nucleotide sequence in the sample. A conformance threshold of 0.8 or greater can be used, but optionally a conformance threshold of 0.9 is used as an indication that either the reference or alternate allele sequence was reliably detected in the DNA sample.
The process is carried out for all of the non-rejected intensity data for all of the experiments to provide a data set of conformance values for each target SNP for each experiment using the control and case group DNA samples. At step 306 of the general method shown in
4) AS/AB Assessment
5) Calculation of Relative Allele Frequency
The relative allele frequency, P, for a SNP position indicates the proportion of the reference and alternate alleles at the SNP position. P, in one specific embodiment, could be calculated from the concentration of the reference allele (cr) at a SNP position divided by the total concentration of the alternate and reference alleles at the SNP position (ca+cr), i.e. P=cr/(ca+cr). For example, in the scenario illustrated in
P′ has been found to have a relationship to P and, while in some circumstances it can be approximately linear, P′ typically has a higher order relationship to P. However, in general if P′ is determined for a particular SNP position for two different pools of the same nucleic acid sample, then P′ is found to vary by a small amount the majority of the time. Therefore, although the relationship between P and P′ may vary, calculations based on P′ can be robust. In particular, it has been found that the relationship between P and P′ may vary between production runs of physical arrays. Therefore it is preferable that the present method use data collected from arrays of the same production batch. However, the methods may also be used for data collected from arrays of different production batches.
As illustrated in
A probe pair count is incremented at step 710 and then the process is repeated at step 712 for each of the pairs of perfectly complementary probes in the tilings 410, 420 for the current target SNP position. When all of the probe pairs have been evaluated, an average value for P′ for the target SNP position is calculated 714 at step as the sum of the relative intensities for each probe pair divided by the total number of probe sets, which in this example is ten. This formula provides P′ for this target SNP position. This P′ value indicates the relative allele frequency for the population of the case or control groups, as the DNA samples for the case and control groups were each pooled. Hence the P′ value is that for the target SNP position averaged over the population of either the case or control groups. The averaged P′ value is stored. At step 718, the process is repeated for the next target SNP position until average P′ values have been calculated for all the target SNP positions using the stored intensity data.
The P′ value generally indicates the relative amount of target that has hybridized to the reference and alternate perfect match probe(s) for a particular SNP. In general, if the case or control group has a higher proportion of individuals with a particular SNP allele, the corresponding features on any array will be brighter. The difference in relative allelic frequency (ΔP′) is equal to the relative reference allelic frequency in a case group minus the relative reference allelic frequency in a control group, i.e. ΔP′=P′case−P′control. Where P′ is calculated based on the proportion of the reference allele relative to the alternate allele, a positive difference in relative allelic frequency (i.e. a positive ΔP′ value) indicates that the reference variant is associated with the case group and the alternate variant is associated with the control group; and a negative difference in relative allelic frequency (i.e. a negative ΔP′ value) indicates that the alternate variant is associated with the case group and the reference allele is associated with the control group.
6) Methods for Calculating P′
The computation of the estimate of relative allele frequency, P′, initially uses the intensity data for a first target SNP position from the target SNP positions whose data has not been rejected, for example, after the conformance test or the AS/AB assessment. For each particular SNP position of interest, or “target” SNP, to be interrogated, a set of eighty probes (referred to as a “tiling” of probes) may be provided, which include probes to interrogate at offsets −2, −1, 0, 1, 2 for both forward and reverse sequences for reference and alternate alleles. In other words, an eighty probe tiling may contain 20 probes for each of the “reference forward”, “reference reverse”, “alternate forward” and “alternate reverse” tilings, as shown in
In some embodiments, the P′ value can be calculated by taking an average of the intensity ratios (i.e., “Mean of the Intensity Ratios”), which is the sum of the relative intensities for each perfect match probe pair P′i divided by the total number of perfectly matching probe pairs. The following formula can be used for calculating P′ for this target SNP position:
where n=total number of the offsets.
The P′ value calculated using this formula may be intensity- and/or signal-dependent, for example sensitive to the “outliers”, which are data points outside the normal intensity range. Outlier may be due to “speckles”, which may cause a hybridization-independent increase in probe intensity. Outliers may also arise where, while IR and IA may not be individually outside the normal intensity range, IR is at the high end of the normal range and IA Is at the low end causing the ratio IR/IR+IA to be higher than it should be.
In other embodiments, the P′ value can be calculated by averaging the perfect match intensity measurements at a plurality of different offsets, then determining P′ by calculating a ratio of those average intensity measurements (i.e. “Ratio of the Mean Intensities”), which reduces the sensitivity of the P′ calculation to outliers:
In some embodiments, 24 perfect match probes (comprising 12 perfectly matching probe pairs) may be used to calculate P′ values since there are four additional probes in the 80 probe tiling that are perfectly complementary to the reference or alternate sequence, considering that at position 0 (or offset 0) there is a perfect match to the reference sequence in the alternate tiling and a perfect match to the alternate sequence in the reference tiling. In other words, at the 0 offset for a particular tiling (reference forward, alternate forward, reference reverse, or alternate reverse), there are actually two perfect match probes, one that is complementary to the reference sequence and the other that is complementary to the alternate sequence. This is because at the 0 offset the tilings are duplicated between the reference and alternate tilings for a given strand orientation (forward or reverse). For example, in the reference forward tiling in
In further embodiments, rather than an arithmetic mean calculation, a trimmed mean may be used to calculate P′ from the P′i values or the intensity measurements. A trimmed mean is found by ignoring the k smallest observations and the k largest observations and averaging the rest of the sample, and is discussed in greater detail below. In still further embodiments, a Winsorized mean may be used to calculate P′ from the P′i values or the intensity measurements. A Winsorized mean is similar to a trimmed mean except that instead of ignoring the k smallest and k largest observations, those observations are replaced by the nearest non-extreme values. For example, if k=2 the smallest and 2nd smallest values are each replaced by the 3rd smallest value, and the largest and 2nd largest values are each replaced by the 3rd largest value.
In other embodiments, a background correction may be subtracted from the individual intensity measurements prior to calculating P′ for a given SNP. Determination of values for a background correction are discussed in detail below.
As one of skill will readily recognize, these methods may be combined in different ways. For example, one may use the “Mean of the Intensity Ratios” method with 12 perfect match probes and a background correction, or one may use the “Ratio of the Mean Intensities” method with 10 perfect match probes and no background correction. Further, either method may use trimmed, Winsorized or arithmetic means to calculate P′. One particular embodiment is described below.
6) One Example of a Method for Calculating P′—.“Ratio of the Mean Intensities”
The P′ value may be calculated in a different way to reduce its sensitivity to the outliers as follows. This method involves averaging the intensity measurements at a plurality of different offsets, then determining P′ by calculating a ratio of those average intensity measures as shown in the equation below. In this example, the total number of perfectly matching probe pairs used is 12, instead of 10 as in the example described above, although the method may be used for 10 perfectly matching probe pairs, as well. Thus,the P′ value calculated in this example is computed from 12 intensities which represents a 20% increase in the amount of data compared to that calculated based on 10 matching probe sets.
In this example, the P′ value can be calculated by taking the average of the intensities before calculating the ratio so that the P′ value is less sensitive to the “outliers” by using the formula shown below.
Although an arithmetic mean may be taken, the sensitivity of P′ to the “outliers” can be further reduced by taking a trimmed mean of the intensities. A trimmed mean is found by ignoring the k smallest observations and the k largest observations and averaging the rest of the sample. The number of points excluded at each end of the distribution, k, can be chosen. In this example, there are 6 perfect matches for the forward sequence and 6 for the reverse sequence. Thus, there is a total of 12 for each reference and alternate allele, i.e., 6 each for IA,for, IR,for, IA,rev, and IA,rev. For each of IA,for, IR,for, IA,rev, and IR,rev, the six data points are ordered from smallest to largest.
Some centered fraction of each data set will be used for calculating the intensity averages. In this example, out of the 6 perfect match intensities for each of IA,for, IR,for, IA,rev, and IR,rev, the middle half of the ordered intensities is preferred for the calculation of the trimmed mean of the average intensity; so, for a data set of six points, three data points are desired. However, there is no “middle three” in an ordered set of six, so the middle four points are taken with the outside two points weighted at only ½ of the weight of the middle two points. In other words, the middle interval (values #3 and #4) is extended by one probe on each side (values #2 and #5) and the first (#2) and last (#5) are taken with a weight of ½. (Thus, the “middle three” is ½+1+1+½.) Since not all the data points are weighted the same, this trimmed mean is also a “weighted mean”. Of course, one of skill will readily recognize that the fraction of observations included in the calculation of average intensity may vary and range from, for example, 50% to 95%, depending on, for example, the size of the data set. Shown below is an example of how to calculate a trimmed mean for IA,for, <IA,for>, for the perfect match (PM) probes:
Similarly, <IR,forpm>,<IA,revpm>, and <IR,revpm> can be calculated using the same formula substituting the appropriate intensity measures from the “reference, forward”, “alternate reverse”, and “reference reverse” tilings, respectively.
In certain embodiments, a background correction is used to further analyze and process the intensity data prior to calculating P′. This may be performed by subtracting a background correction from each intensity measurement as discussed above. In other embodiments, a background correction for the average intensities can be calculated. For instance, when applying only reference sequence to an array, a “perfect” concentration-dependent and concentration-independent background subtraction would make the <IA>=0 (since no alternate sequence is being applied) while <IR>≠0. Since the difference between the reference and alternate sequence is at one nucleotide position (i.e., is one “mismatch”), the intensity of a probe containing one mismatch to the sequence applied would be an ideal background correction. As such, the concentration-dependent and concentration-independent background derived from mismatch probes can be subtracted from perfect match probes, and in order to have the <IA>=0 for the reference sequence, one needs to subtract as background the intensity of one mismatch. Since all the intensity data must be treated the same, this correction would be subtracted from both <IR> and <IA> measures. However, since the identity of the target sequence being applied to the array is not known in advance, the probes that represent only one mismatch are also not known. Therefore, as an approximation of the intensity of a single mismatch probe, the intensity of all the mismatch probes are averaged and that value (“background intensity”) is subtracted from <IR> and <IA> measures.
In certain embodiments, the background intensity can be calculated by taking an arithmetic mean of the intensities of mismatch probes, and in other embodiments a Winsorized mean of the intensities of mismatch probes is used. In still other embodiments, such as in the example described in detail below, a trimmed mean of the intensities of mismatch probes is utilized to calculate a background intensity value. As shown in the formula below, the mean intensity of the mismatch probes in the forward reference tiling (<IR,formm>) is calculated. The mean intensities of mismatch probes in the reverse reference (<IR,revmm>), forward alternate (<IA,formm>), and reverse alternate (<IA,revmm>) tilings may also be calculated in the same manner using the intensities of the respective mismatch probes. According to this formula, the middle half (½×14) in the distribution is preferred. Since 7 points are not evenly distributed in the middle of the ordered set of intensity measurements, 8 points in the middle are taken instead (disregarding the 3 points at each end of the distribution). The 1st and 8th points are weighted by half and the rest of the points are weighted as full values (still a total weight of 7 points) in calculating the mean:
The resulting mean intensity data (<IR,formm>, <IR,revmm>, <IA,formm>, and <IA,revmm>) may then be averaged and the resulting average intensity for the mismatch probes is used as Background in computing the P′ value:
Where <IR,formm>, <IR,revmm>, <IA,formm>, and <IA,revmm> are trimmed means.
Although trimmed means are used for the calculation of <IR,formm>, <IR,revmm>, <IA,formm>, and <IA,revmm> in this example, arithmetic or Winsorized means may also be used for these calculations. Further, although the mean intensity data is calculated separately for the four tilings (reference forward and reverse, and alternate forward and reverse) in this example, in other embodiments the intensities for all the mismatch probes for all the tilings may be averaged together in a single arithmetic, trimmed, or Winsorized mean calculation to derive a Background correction.
In still other embodiments, a background correction may be calculated across an entire array. For example, the intensities of all the mismatch probes on an array may be averaged (e.g., trimmed mean, Winsorized mean, or arithmetic mean) to calculate one background correction for all the intensity measurements collected from a single scan.
The computation of P′ value for some embodiments is summarized in the following equations. The intensity data for the perfect match probes for each of the reference and alternate sequences (forward and reverse) are identified and averaged by taking the trimmed mean, from which the average intensities for the perfect match reference and for the perfect match alternate (<IRpm> and <IApm>) are calculated:
Where <IR,forpm>, <IR,revpm>, <IA,forpm>, and <IA,revpm> are trimmed means.
The average intensity for the reference and alternate (<IR> and <IA>) are calculated by subtracting the Background from the average intensities for the perfect matches, from which the relative intensity, or P′, are calculated:
or, in another form,
These methods for calculating the P′ value are particularly useful for low intensities, although the example works well with a wide range of intensities. If the P′ values from various diploid samples are plotted on an intensity map, where IR is the X-axis and IA is the Y-axis, the distribution of P′ values can be illustrated in
By correcting the concentration-dependent and concentration-independent background, the P′ values are brought closer to what their intrinsic values should be. If the P′ values for various SNPs are plotted on a graph, they are now clustered at around P′=1, 0.5, and 0 (
The analysis of P′ or the difference in P′ (ΔP′) works well for both individual genotyping and for pooled genotyping where pool of cases and pool of controls from different individuals are used as the samples.
7) AS/AB Threshold Analysis for Calculation of P′
“Amplitude of Signal” (AS) & “Amplitude of Background” (AB) can be defined so that an acceptable AS to AB ratio (i.e. a “cut-off” value) can determine whether intensity data can be considered reliable for use in calculating P′ values. A flow diagram describing how to calculate an AS/AB ratio is shown in
The “amplitude of signal” (AS) is defined as the hypotenuse in a right triangle with <IRpm> and <IApm> being the other two sides of the right triangle, where <IRpm> and <IApm> are the intensity of perfect match reference probes and the intensity of perfect match alternate probes, respectively. According to the Pythagorean Theorem, which asserts that for a right triangle, the square of the hypotenuse is equal to the sum of the squares of the other two sides: a2+b2=c2, the AS can be calculated at step 2640B using:
Similarly, the “amplitude of background” (AB) is defined as the hypotenuse in a right triangle with <IRmm> and <IAmm> being the other two sides of the right triangle, where <IRmm> and <IAmm> are the intensity of mismatch reference probes and the intensity of mismatch alternate probes, respectively. According to the Pythagorean Theorem, the AB can be calculated at step 2640A using:
In various embodiments, different AS/AB ratio thresholds can be chosen. In some embodiments, the AS/AB ratio threshold is determined based on the free energy of association (ΔG) between the target and the probe on the array. Depending on the data and methodology, the threshold value can be any number within the range of 1 to 10. For a particular data set, if the AS/AB ratio is below the threshold value, this data will not be included in further analysis and is a “no pass”. One example is an AS/AB ratio threshold of 1.5 for the quality analysis, as shown in step 2660. The AS/AB ratio is another quality control step in the genotyping experiments and may be performed after the initial conformance assessment of each probe set. An AS/AB threshold may be used to accept or reject data to be used in either the “Mean of the Intensity Ratios” or the “Ratio of the Mean Intensities” P′ calculation.
By using the calculation methods provided herein, sets of P′ data for the control and case groups can be generated for all of the target SNP positions for each experiment. The experiment can be replicated a number of times, using the same DNA samples from the same control and case groups and the same array designs to provide a number of sets of P′ each corresponding to different one of the replicated experiments. For example, step 46 (
One of the aims of the present invention is to identify SNP positions that may be characterized as likely to be associated with the disease or other phenotypic trait exhibited by the members of the case group and not exhibited by members of the control group. The number of SNP positions that are ultimately characterizable as being associated with a phenotype by the methods described herein has been found to occur at the level of a few tens to a few hundred SNP positions per thousands to millions of initial SNP positions. In the following examples, for the purposes of explanation, it is assumed that six replicate experiments were carried out where data from multiple experiments is used.
8) Analysis of P′ Values to Identify Associated SNP Positions
i) Threshold Analysis
In a further embodiment, a variable standard deviation threshold value (cutoff) is used depending on the number of P′ values available to calculate the mean and standard deviation. For a small number of data items, a larger standard deviation may be acceptable and so the cutoff would be increased. However, for a larger number of data items, the standard deviation of the sample can be considered a reasonable estimate of the actual standard deviation of the underlying population and so a more stringent standard deviation cutoff can be used to help identify and reject unreliable results.
In an alternative embodiment, a chi-squared distribution is used to determine a standard deviation cutoff. The chi-squared distribution corresponding to the number of P′ values available and the maximum acceptable true standard deviation for the null hypothesis is used to determine the likelihood of observing as large a sample standard deviation under that null hypothesis. If it is determined that the sample standard deviation is not so large as to be inconsistent with the maximum acceptable standard deviation, then the sample standard deviation is “acceptable” and the thresholding step 810 is met. If the likelihood under this null hypothesis is unacceptably small, then the thresholding step 810 is not passed and the data items are rejected. Using the chi-squared distribution, a confidence level of approximately 95% and a maximum acceptable true standard deviation of 0.01-0.1 could be used to determine whether the calculated standard deviation is acceptable or not.
If both the case and control standard deviations pass the thresholding step 810, then the magnitude of the difference in the mean P′ for the case group and the mean P′ for the control group is calculated, i.e. |<P′cas>−<P′con>|=|Δ<P′>|. As it is the magnitude that is relevant in this embodiment, it will be understood that <P′con>−<P′cas> can be calculated instead. The magnitude of the difference in mean P′s is then thresholded 812. The threshold criterion used can be any criterion which is sufficient to distinguish with an acceptable degree reliability between significant |Δ<P′>|s and those which are unlikely to be significant. In one embodiment a threshold of 0.05 can be used. Optionally, a threshold of between 0.02 and 0.2 may be used. If the threshold is criterion is met, then the difference is deemed sufficient for the SNP position to be considered an associated SNP position, as there can be considered to be a genuine difference in the relative allele frequencies at the SNP position in the case and control groups. Therefore the SNP position is identified at step 814, by a flag or any other suitable mechanism, as being an associated SNP position. A next SNP position is then selected at step 816 for evaluation.
In another embodiment, a first threshold is used during a first iteration of the general method and a second threshold is used during the second iteration for the reduced number of SNPs. The second threshold can be greater than the first threshold. The second threshold can be between 0.05 and 0.5, optionally between 0.07 and 0.3, or optionally between 0.08 and 0.2. In a preferred embodiment a threshold of approximately 0.05 is used during the first iteration and a threshold of 0.10 is used in the second iteration. The first threshold acts as a coarse filter to identify associated SNPs and the second threshold acts as a finer filter so that the characterization of SNPs by the second iteration is more robust.
ii) T-Test Analysis
At step 822 the arithmetic mean <P′Con> and the standard deviation σCon for the control group P′ values for the selected SNP are calculated. At step 824 the arithmetic mean <P′Case> and the standard deviation σCase for the case group P′ values for the selected SNP are calculated.
The method then carries out a t-test at step 826 to determine whether it is unlikely that the control group and case group P′ values come from the same distribution. If this is the case, then it can be considered likely that the selected target SNP is associated with the disease as the frequency of occurrence of the reference or alternate allele at the selected target SNP position is different in the case group and the control group. However, if the P′ values appear to come from the same underlying distribution, then it is likely that the selected target SNP position does not distinguish the case and control groups as the frequency of occurrence of the alternate or reference allele at the SNP position can be considered to be statistically the same in the case and control groups.
A null hypothesis approach is used in which it is assumed that the P′ values for the case and control groups come from the same distribution and so ΔP′=0. A two sided t-test is used as ΔP′>0 and ΔP′<0 can both be indicative of an association between the chosen SNP and the phenotype of interest. The t-test returns a t-statistic which is used, together with the number of degrees of freedom (NCase+NCon−2), which is 10 in the event that two groups of 6 P′ data items are used, to obtain a p-value from a standard look up table. The p-value provides a measure of consistency of the data with the null hypothesis. A low p-value indicates that ΔP′=0 is unlikely, therefore the P′ values can be considered likely to come from different distributions.
The statistical significance of the P′Case and P′Con groups of values is assessed using a t-test 826. In particular the t-statistic is computed in one embodiment using:
Where σ={square root}{square root over ([(NCase−1)σ2Case+(NCon−1)σ2Con]/(NCase+NCon−2))}
where NCase and NCon are respectively the number of P′ case and control data items in the respective distributions and σ is the sample standard deviation under the null hypothesis. The t-statistic is then compared with a Student t distribution using the number of degrees of freedom (NCase+NCon−2) to obtain a p-value. The p-value is then assessed at step 828 to determine whether it is appropriate to reject the null hypothesis. A p-value of 0.01 corresponds to a 1% probability of observing as large a difference between cases and controls under the null hypothesis, and a p-value of 0.05 corresponds to a 5% probability. That is, for the latter case, five out of every hundred P′ distributions will be wrongly characterized as being separate P′ distributions rather than the same P′ distribution. In one embodiment a p-value of 0.05 is used in step 828, which provides a 95% confidence level, so that a SNP is identified as an associated SNP in step 830 if the t-test returns a 0.05 or lower p-value.
The process is then repeated for another one of the selected target SNP positions to determine whether that target SNP can be characterized as associated with the disease or not. The process is carried out until all of the target SNP positions have been evaluated and results in a set of data 50 identifying a set of associated SNP positions which have been characterized as being likely to be associated with the case group disease (or other phenotypic trait).
In this way, the large number of target SNPs (typically of order 106) used in the initial first tier association study experiments can be reduced by typically 99% as the target SNPs are segregated into those SNPs for which there is some evidence that ΔP′ is non-zero and those for which this is not the case. A second tier association study can then be carried out 52 for the same case and control groups but with a greatly reduced set of target SNPs which have now been characterized as more likely to be associated with the phenotype of the case group and those characterized target SNPs are taken into account in redesigning the arrays and the experimental protocol for the second tier association study experiments. In some cases, individual genotyping analysis rather than pooled genotyping analysis,is used in the second round.
Different p-values can be used in the first and second tiers of the general method. A lower confidence level may be used in the first tier to provide a relatively coarse filter to identify associated SNPs, and then a higher confidence level can be used to provide a finer filter for the reduced number of SNPs in the second iteration.
Also, in certain alternative embodiments, different confidence levels may be used for the SNP association criterion in step 828 depending on the P′ values. The t-test is a parametric test and assumes that the distributions being considered are normal. However, in practice the P′ distributions may not be normal, although the t-test method is still considered to provide reliable results. It has been found that P′ values close to 0.5 have a closer to normal distribution than those closer to 0 and 1. Therefore different confidence levels may be used depending on the P′ values of the distributions being evaluated. P′ values close to 0.5 may use a more stringent criterion as the t-test may be more applicable to the corresponding P′ distributions than P′ values close to 0 or 1.
iii) Analysis of P′ Data by Using Genetic Information
Another class of processes for analyzing the P′ data will now be described with reference to
As described in U.S. patent application Ser. No. 10/106,097, incorporated herein in its entirety, alleles (variants) making up blocks of polymorphisms (e.g., SNPs) are often correlated or “linked”, resulting in reduced genetic variability and defining a limited number of “SNP-haplotype patterns”, each of which reflects descent from a single, ancient ancestral chromosome (Fullerton, et al., Am. J. Hum. Genet. 67:881 (2000)). As is well known in the art, the term “linkage disequilibrium” or “linked” refers to genetic loci that tend to be transmitted from generation to generation together; e.g., genetic loci that are inherited non-randomly. “SNP haplotype block” means a group of variant or SNP locations that do not appear to recombine independently and that can be grouped together in blocks of variants or SNPs. In other words, a haplotype block is a set of correlated SNPs in a genomic DNA sequence. Typically, haplotype blocks found in genomic DNA sequences extend from a few kilobases up to greater than 100 kilobases. The term “SNP haplotype pattern” refers to the set of genotypes for SNPs in a SNP haplotype block for a single DNA strand. In other words, a haplotype pattern refers to an arrangement of one or more polymorphic nucleotides on a particular chromosome within a haplotype block. The haplotype pattern preserves the information of the phase of the SNPs, i.e., which set of SNPs was inherited from one parent, and which from the other.
This analysis process utilizes the observation that loci that are genetically linked, for example within a haplotype block, tend to be inherited together. Therefore, when one polymorphism is directly involved with the manifestation of a phenotypic trait of interest (e.g., causes a change in a protein directly responsible for the trait), other polymorphisms that are linked to the first will also be “associated” with the phenotypic trait, even if they are not directly involved in the trait. As such, multiple SNPs may be associated with a phenotypic property simply due to the fact that they are linked to a SNP that is directly involved in the manifestation of the phenotypic property. So, the characterization of one SNP as associated may be reinforced by the characterization of another SNP as associated if that other SNP is in the same haplotype block of DNA.
The characterization of one SNP as associated can be further reinforced by comparing the case and control group haplotype patterns that include at least one of the polymorphisms, and haplotype blocks that include at least one haplotype pattern. For example, sets or patterns of polymorphisms that occur at a higher or lower frequency in one population than in another indicate areas in the genome where phenotypic trait-related loci may be located. Optionally, the characterization may be reinforced by comparing the haplotype structures of a region of interest present in case and control groups to identify those polymorphisms or haplotype patterns that associate with a phenotypic trait of interest.
Therefore a first process includes assessing the physical separation of two SNPs with the assumption being that the closer two SNPs are to one another, the more likely it is that they are linked to one another. A ΔP′ value is calculated for each SNP and their proximity is determined. If the ΔP′ for both SNPs is greater than a threshold value, e.g., 0.02, and the separation of the SNPs is less than approximately 100 kilobases, then the two SNPs may both be characterized as associated. In other embodiments, the SNPs can be considered to be sufficiently proximate if their separation is less than approximately 50 kb, and preferably approximately 10 kb or less. Without wishing to be bound by theory, SNPs that are close to one another are more likely to be part of the same block of DNA than SNPs that are far apart, and that block of DNA may be involved in the disease mechanism.
A second similar process includes varying the proximity criterion using knowledge of the length of DNA blocks that are inherited. For example, a haplotype block map can be used to provide information as to the length of the block in which a first SNP with a finite ΔP′ is located. Examples of how to construct a haplotype map are described in detail in U.S. patent application Ser. Nos. 10/284,444, 10/166,341, and 60/323,059, all of which are incorporated by reference herein. The proximity criterion can then be adjusted based on the length of the block containing the first SNP. For example, if the haplotype block is approximately 20 kb long and the first SNP is located in the middle of the block, then only SNPs separated by less than approximately 10 kb from the first SNP, and therefore on the same block need have their ΔP′s evaluated, to see if the SNPs can be characterized as associated. In other words, if a first SNP is found to be associated with a phenotypic trait of interest, than one would expect that other SNPs in the same haplotype block as the first SNP are likely to also be associated with the same phenotypic trait of interest. If this is found not to be the case, then the first SNP's association with the phenotypic trait may be called into question.
The location of points of high recombination (owing to the cross-over of chromosomes during meiosis) can be used to define the size of blocks of DNA, as in general a recombination “hotspot” generally is not found within a haplotype block since the SNPs in such blocks tend to be inherited together. As such, typically recombination hotspots only occur between haplotype blocks. The position of such recombination hot spots can also be used in characterizing SNPs. For example if two SNPs are sufficiently close together to meet the proximity criterion and their ΔP′s are sufficiently similar, if they are on separate sides of a recombination hot spot then they are unlikely to be inherited together and so they would not likely be in the same haplotype block. So, if one SNP was shown to be associated with a phenotypic trait of interest, the putative association of the other SNP could not be assumed and would have to be determined independently. Further, the association of one SNP could not be used to validate the association of the other.
Another process uses SNP clustering within haplotype blocks to characterize a SNP position. Since the borders of haplotype blocks may be determined in various ways, SNPs that are clustered within a haplotype block are more likely to remain in that haplotype block even after a reanalysis of the haplotype structure that changes one or both borders of the haplotype block than are SNPs that are far apart in a block. That is, clustered SNPs are more likely to be linked to one another than are SNPs that are widely dispersed. This process therefore determines the ΔP′ values for a number of SNPs that occur in clusters in a haplotype block, and if the ΔP′ values are sufficiently similar and are above a given threshold then the selected SNPs can be characterized as associated.
Haplotype blocks are inherited and may be associated with the genetic basis for a disease. Irrespective, it has been observed that ΔP′s of SNPs that are associated with the disease tend to cluster around a similar value depending on the proximity of the SNPs on the chromosome. Haplotype blocks are typically on the order of 10s of kbs (kilobases) long and can include several to over 100 SNPs. Hence by assessing the similarity and proximity of ΔP′ values, and using haplotype map information, SNP locations can be characterized as associated or not, or their characterization reinforced by this secondary genetic information.
For example, the x's in
Now consider that position 856 is a haplotype block boundary that separates haplotype block B from haplotype block C. The ΔP′ value 861 in haplotype block C may have initially been considered to indicate an associated SNP position. However, the difference in its ΔP′ value as compared to the ΔP′ values for the other SNP positions in haplotype block C may be considered sufficient to indicate that it was falsely characterized as an associated SNP position. The fact that it is close in value to the ΔP′ value 859 is not sufficient to validate its association since haplotype block boundary 856 falls between the two positions.
Hence, the similarity of ΔP′ values for associated SNP positions, the proximity of the associated SNP positions, and/or haplotype map data can be used to provide a secondary indication of the validity of the characterization of a SNP position as being an associated SNP position.
For example, another means of validating an associated SNP position includes identifying a location of the nucleic acid segment in the human genome. The SNP position may occur within the coding or noncoding regions of a gene. It may be in a region that regulates the expression of a gene, such as a promoter or enhancer region. It may be in an exon or intron of a gene. If an associated SNP position is located in a coding or regulatory region of a gene, then the gene may be deemed to be associated with the phenotypic characteristic of interest. In such a case, the step of validating may further include cloning and expressing the associated gene to produce and/or characterize a protein product, or regulating expression of the associated gene in cells in vitro or in vivo and detecting changes of cells in response to the regulation. In certain embodiments, once an associated gene has been identified, the step of validating may further include screening a library of pharmaceutical candidates against the associated gene or gene product, and selecting the pharmaceutical candidates that modulate expression of the associated gene or activity of the associated gene product. These pharmaceutical candidates may also be screened to determine whether or not they have an effect, either direct or indirect, on the phenotype of interest. This screening may be performed in numerous ways well known to those of skill in the art including, but not limited to, administering the pharmaceutical candidate and monitoring the phenotype of interest for any changes in response to the pharmaceutical candidate. Other methods of use of associated SNP positions, associated genes, and pharmaceutical candidates -are described in detail in U.S. patent application Ser. No. 10/106,097, filed Mar. 26, 2002, entitled “Methods for Genomic Analysis”, which is incorporated herein by reference in its entirety for all purposes.
For each SNP position that has been characterized as an associated SNP position, a ΔP′ value was calculated, for example in step 812 of
As explained above, the similarity of ΔP′ values for associated SNP positions and their position with respect to haplotype block boundaries can be used to validate the characterization of the SNP position as an associated SNP position. A first two associated SNPs that are adjacent to one another are identified 1410 for evaluation, for example from the set of {associated SNPs} 50 shown in
If the ΔP′ values are dissimilar and there is no boundary (e.g. SNP positions 861 and 862 in
More than one block may be associated with a phenotype (for example, a single gene may contain more than one block or may overlap a block boundary, or associated blocks may be dispersed across a chromosome or a genome when multiple genes are involved). Therefore, the use of haplotype data can provide greater reliability by way of validating SNPs that have been characterized as associated based on their respective ΔP′s.
Further, the existence of a boundary does not necessarily require a difference in ΔP′ values between adjacent haplotype blocks and so a determination that the ΔP′ values are similar and there is a boundary is not inconsistent with the validity of a characterization of the SNPs as associated.
At this stage the method has therefore validated, or not, the characterization of the first two adjacent SNP positions as associated. In the example, the ΔP′ values for 853 and 855 are similar and there is no boundary and this is consistent with both 853 and 855 being associated SNPs. The method then determines whether there are adjacent SNPs in {associated SNPs} 50 that have not been analyzed by process 1400. If there are, then another adjacent pair of associated SNP positions are analyzed, for example 855 and 857 in
Hence, as can be seen, although the fourth and fifth associated SNP positions in
Various modifications and additions to this general methodology of using similarity and position data to validate the characterization of SNP positions as associated may be employed. For example, a ΔP′ for a one of the SNP positions on a haplotype block may be compared with an average ΔP′ for the rest of the SNP positions on that haplotype block in order to determine similarity, and a measure of the distribution of the ΔP′ values for the rest of the SNP positions, e.g. the standard deviation, may be used as, or to help determine, a threshold value.
Further, the distance between neighboring SNP positions may be used as a supplementary validation mechanism, instead of the ΔP′ similarity and boundary validation mechanisms described above.
Haplotype map information can be used to refine allele frequency (P or P′) and/or allele frequency difference (ΔP or ΔP′) data to reduce the rate of false positive associations. In particular, haplotype map information can be used to refine estimates of SNP allele frequency differences between a pooled sample of a case group and that of a control group. The method exploits the fact that within a haplotype block most of the variation in SNP allele frequencies can be accounted for by variation in frequencies of a relatively small set of common haplotype patterns. And, within a block, the sum of changes in these pattern frequencies should be approximately zero, to the extent that those patterns in the haplotype map accurately represent the total genetic diversity of that interval.
In one embodiment of the method, the haplotype map is constructed so that common haplotype patterns account for at least 80% of the genetic variation in a set of more than 20 haploid genomes selected from a globally diverse panel of cell lines. Examples of, how to construct a haplotype map (e.g., Chromosome 21 haplotype map) are described in detail in U.S. patent application Ser. Nos. 10/284,444, 10/166,341, and 60/323,059, all entitled “Human Genomic Polymorphisms”, all of which are incorporated by reference herein.
Optionally, the haplotype map may be constructed to account for a larger or smaller fraction of observed genetic variation; and the map could also be constructed using individuals from a specific population. For example, the map may be constructed from individuals of a particular ancestral heritage or individuals who exhibit a particular phenotype of interest. The specific method of construction of the map does not limit the application of this invention, but it is required that the map assigns SNPs to discrete haplotype blocks, and identifies a limited set of common haplotype patterns within each block.
In a particular embodiment, given ΔP′s for a haplotype block linear regression is used to solve for underlying haplotype pattern frequency differences using the following equation.
where ΔPi is the difference between the estimated reference allele frequency for SNP i within a haplotype block in a case group and the corresponding frequency in a control group; Δfj is the (unknown) frequency difference for a common haplotype pattern; j∈1 . . . N, where N is the total number of different patterns for the haplotype block; and mij is a coefficient that takes a value of +0.5 if the allele at position i in pattern j matches the reference allele, and −0.5 if it matches the alternate allele for that SNP. The reason for the 0.5 factor is that the frequency difference for an allele would otherwise be double counted when differences for the complete set of patterns are evaluated.
This linear regression model also requires that the sum of pattern frequency differences add up to 0:
This constraint can be folded into the above linear regression equation by substituting:
to obtain:
where
rij≡mij−miN
In the linear regression model, the ΔPi are the dependent variables, the {overscore (r)}ij are the independent variables, and the Δfj are the regression coefficients to be fitted. Either a standard statistical package or a dedicated computer program can be used to perform the linear regression. In the following example, the analysis was done using the R analysis package (http://www.r-project.org). Standard measures (R2, and the p-value for an F test) can be used to judge the quality of the fit of the SNP data to the haplotype pattern information. It is noted that deviations from a perfect fit may arise from experimental errors, and/or inaccuracies in the haplotype map.
As an exemplary illustration, the linear regression model described above was applied to refine allele frequency differences between a case and control group at 10 SNP position in a haplotype block. One haplotype block consisting of 10 SNPs was selected based on the haplotype map for Chromosome 21 (as described in U.S. patent application Ser. Nos. 10/284,444, 10/166,341, and 60/323,059, all of which are incorporated by reference herein). The 10 SNPs were identified here by their chromosomal positions in the me assembly build 30 and listed in Table 1 as follows.
The following four common patterns for this haplotype block had been previously identified described in U.S. patent application Ser. Nos. 10/284,444, 10/166,341, and 60/323,059) and listed here in Table 2:
The m matrix, indicating whether a position in a pattern matches the reference or alternate allele, is given by Table 3 shown below:
In Table 3, the four numbered columns represent the four common haplotype patterns and the ten numbered rows represent the ten SNPs in the haplotype block. As shown in the m matrix in Table 3, pattern 2 matches the reference allele at every SNP as column 2 all contains +0.5 values.
The r matrix was obtained by subtracting column 4 from the other three columns and is giving by Table 4:
In an association study using methods of the present invention, estimates of pooled allele frequency differences (ΔPi) were measured, e.g., by using hybridization to oligonucleotide arrays, and listed in Table 5 as follows:
The linear regression model was applied to the values in Table 5 against the r matrix in Table 4 and produced estimates for the pattern frequency differences as listed in Table 6. It is noted that the estimate for pattern 4 in Table 6 was obtained using the rule that the four differences must sum to zero.
In this instance, the quality of the fit to the observed SNP frequency differences is fairly high. The regression yielded an R2 of 0.91, meaning that 91% of the variance in the original ΔPi data was accounted for by the three independent Δfj values. An F-test for the fit gave a p-value <0.0005 indicating the fit is quite unlikely to arise by chance. Using the calculated Δfj values in Table 6 and the r matrix in Table 4, fitted values for the individual SNP frequency differences were generated and listed in Table 7.
These fitted allele frequency differences may be used to inform the selection of SNP's as candidates for further analysis, which might include individual genotyping in the same or different cohort to validate putative associations. In instances where the quality of the fit to the haplotype map is good, the fitted allele frequency differences should have lower variance than the raw data for individual SNPs, because they incorporate information about the expected correlations between SNPs.
According to the linear regression model described above, to permit estimation of frequency differences for N patterns in a haplotype block, data for at least N−1 SNPs should be available. The haplotype map may be constructed such that a block with N common haplotype patterns must contain at least N−1 SNPs, so given complete data, this requirement is always satisfied. Where exactly N−1 SNPs are present, there is no additional correlation information in the haplotype map. Preferably, data for more than N−1 SNPs is available per haplotype block. Inclusion of rare haplotype patterns in the analysis would make the relative numbers of SNPs and patterns less favorable for the method.
In one embodiment, the linear regression analysis is performed on allele frequency differences. One of the advantages associated with this method is that it is invariant to some systematic errors in the pooled allele frequency estimates. The method is valid as long as the pooled measurement is strongly correlated with actual allele frequency. If the measurement is systematically scaled or offset with respect to actual frequencies, any offsets will cancel out in the calculation of frequency differences, and the resulting fitted values will be in the same scale as the measurements. As in the example described above, allele frequency estimates were made based on relative intensities from hybridization to oligonucleotides. These intensity ratios are somewhat inaccurate measures of absolute allele frequencies, but are well suited to detecting small frequency differences between pools. The linear regression analysis of the present invention can be performed on these estimated allele frequency differences to render the data closer to the true values.
Optionally, the linear regression analysis of the present invention may be used to estimate the absolute frequencies (i.e., P or P′ values) of the haplotype patterns in a single pool, e.g., a sample from an individual or collected from a group of individuals. The P values may be measured by using the oligonucleotide hybridization assay described above or other genotyping methods known in the art. Then the linear regression analysis of the present invention is used to refine the data by using the following relationship:
where mij takes a value of +1 if the allele at position i in pattern j matches the reference allele, and 0 if it matches the alternate allele for that SNP. The frequencies are also constrained by:
This constraint can be used to reduce the number of free parameters in the regression by one. It also may be useful to constrain the regression such that 0≦fj≦1 to avoid impossible solutions.
These methods use allelic information in haplotype patterns from the separately derived haplotype map, but do not use information about SNP allele frequencies or haplotype pattern frequencies in the data used to derive the map. The linear regression methods for refining P and ΔP values should be generally applicable to analysis of data obtained from various population or subpopulations. In one situation, the haplotype map may be derived from one population of individuals whereas the P and/or ΔP values may be derived from measurement of relative allele frequencies of another, different population. In another situation, although less common, the haplotype map and P and/or ΔP values may be derived from the same population of individuals. While there is much evidence that SNP allele frequencies (and by extension, haplotype pattern frequencies) vary widely between different human subpopulations, it is believed that the haplotype boundaries and common patterns observed in a globally diverse population should largely be a superset of the boundaries and patterns observed in more narrowly derived subpopulations. Consequently, these methods may tend to conservatively underestimate the amount of actual SNP redundancy.
The methods described above can be used to further characterize the associated SNPs, thereby identifying SNP positions which may merit further investigations or to provide a weighting which may be given to the data in subsequent analysis.
iv) Analysis Based on Distribution of ΔP′ Values
A further class of processes for analyzing the P′ values to characterize SNP positions as associated or not will now be described with reference to
In a first analysis process, the distribution of ΔP′s for all the SNPs for a single experiment is determined. The associated SNPs are then identified as corresponding to those SNPs that fall in the top 0.5% and the bottom 0.5% of the distribution (or alternatively the top 1% if |ΔP′| is used). In general associated SNPs are rare, therefore taking extreme ΔP′ values as being indicative of an associated SNP can be used to remove the large majority of non associated SNPs from the large number of SNPs being evaluated. Other cut off criteria can be used. For example the top and bottom 1-10% of ΔP′ values can be used. The SNP positions having ΔP′ values falling within the extreme parts of the distribution are then identified as candidate associated SNPs.
Another analysis procedure uses distributions of ΔP′ values for all SNPs from separate replicated experiments, as illustrated in
If none of the SNP positions are associated, then the set of ΔP′ values will not be systematically biased. Therefore it is possible to estimate the number of SNPs that will fall within the top 5% or bottom 5% for repeated experiments, in the absence of any associations. Deviations away from that estimated number can therefore be taken as an indication of the existence of associations and the SNP positions whose ΔP′ values consistently fall within the extreme parts of the distribution for repeated experiments can be considered likely to be associated SNP positions and characterized as such.
An embodiment of a data processing method embodying the principles enumerated above will now be described in further detail with reference to the flow chart in
The method then identifies at step 1006 any SNP positions whose ΔP′ values occur in the top 5% in both experiments (replicates) {Top 5% Expt1}∩{Top 5% Expt2} and SNP positions whose ΔP′ values occur in the bottom 5% in both experiments {Bottom 5% Expt1}∩{Bottom 5% Expt2}. On the assumption of no associated SNP positions, the number of SNP positions whose ΔP′ values will fall within the top 5% in two experiments by chance would be approximately 5% of 5% of 1.7 million, which is approximately 4,250, and similarly for the bottom 5%. Therefore, if there are substantially more than 4,250 SNP positions common to the top 5% in both experiments, the some of them are likely associated SNP positions, and similarly for the bottom 5%. Although some of the SNP positions common to both experiments will by chance be non-associated, a significant number of the members of the set of SNPs common to the top 5% of both experiments will be associated SNP positions. As will be appreciated, this has reduced the initial set of 1.7 million candidate SNP positions by approximately three orders of magnitude and the reduced set of SNP positions can be characterized 1008 as likely being associated SNPs which can be investigated in greater detail.
If more than two sets of experimental results are available, then the number of SNP positions having ΔP′ values in the top 5% which are common to all of the experiments by chance will be reduced by 5% for each experiment. Therefore, for example, if six experiments have been carried out, the number of SNP positions that by chance have ΔP′ values falling within the top 5% in all six experiments would be approximately (0.05)6×1.7 million which is on the order of about 0.0266. Therefore any SNP positions common to all six experiments can be characterized 1008 as likely to be SNP positions associated with the phenotype of the case group under investigation. Therefore the intersection of the sets of SNP positions with ΔP′ values falling within the top 5% for different experiments will provide a subset of SNP positions likely to be associated with the phenotype. The certainty of the characterization of the SNP positions as associated will increase with the number of experiments. Further, for reduced initial candidate SNP sets, fewer experiments may be needed in order to provide a suitable level of statistical certainty. Furthermore, a reduced proportion of the distribution of ΔP′ values can be used to reduce the number of repeated experiments required in order to more definitively identify associated SNPs (e.g., a top 1% or 2%). However, that runs the risk of excluding SNP positions which are associated and so an increased number of repeated experiments may be preferred.
The set of SNP positions characterized as associated {associated SNPs} 50 (
v) Analysis Based on P′ from Forward/Reverse Probe Tiling
Another process 1026 for analyzing P′ data to identify associated SNPs, corresponding to method step 310, will now be described with reference to
In contrast, in the inconsistent instance, FWDP′ is higher in the control group than in the case group, but REVP′ is lower in the control group than in the case group. The alternate inconsistent instance is FWDP′ being lower in the control group than in the case group and the REVP′ being higher in the control group than in the case group. Therefore an inconsistent variation can be considered to be REVP′ and FWDP′ changing with opposite signs between the case and control groups. A line passing through the corresponding control group datum Xcontrol and the case group datum Xcase would have a negative gradient, which is indicative of an inconsistent scenario.
The process 1026 therefore determines at step 1030 whether the variation in FWDP′ and REVP′ between the case and control groups is consistent, for example by determining the sign of the gradient in the virtual, two-dimensional FWDP′ and REVP′ space. If it is determined that the variation is consistent, then the SNP being evaluated is identified as likely being an associated SNP 1032 and a next SNP is selected for evaluation at step 1034. Otherwise, the SNP is not identified as likely being associated and a next SNP is selected for evaluation 1034.
vi) Nonparametic Ranking Test
A further process 1040 for analyzing P′ data to identify associated SNPs, corresponding to method step 310, will now be described with reference to
In certain embodiments, the P′ analysis process 1040 includes some optional steps for determining whether the P′ data is suitable for analysis by the rank-sum test. The process calculates the standard deviation for the set of P′ values from repeated experiments for a SNP position for the case group and the control group. For a small set of P′ values resulting from a small number of repeated experiments, as there is a small sample size, there is a relatively large chance that the P′ values for the case group, or control group, would all be very similar, even though in reality the underlying distribution of P′ values is much greater. In order to reduce false positive identifications of associated SNPs from small P′ data sets, the standard deviation of the set of repeated experiment P′ case values and of the set of repeated experiment P′ control values are thresholded at step 1044 to identify clusters of P′ values having a very narrow distribution, indicative of a chance or fluke result. In one embodiment a standard deviation of 0.005 is used as the threshold standard deviation value. If either the case or control group P′ standard deviation is less than approximately 0.005, then the corresponding SNP position is rejected from further consideration. Sets of P′ data that pass this clustering step are then analyzed to identify associated SNPs by the remainder of the process. This checking of the P′ data is optional and is particularly useful when small groups of data are used. Steps 1042 and 1044 can be omitted if the sample sizes are sufficiently large.
The P′ values for the case and control group are then ranked 1046 by their absolute values, as illustrated by
The p-value is then thresholded against a confidence level to determine 1052 whether the null hypothesis that the case and control P′ data items come from the same distribution is sufficiently likely to be true. If the null hypothesis is sufficiently likely, then the SNP is not characterized as an associated SNP, as indicated by process flow 1054. If the null hypothesis is determined at step 1052 to be sufficiently unlikely to be true, then the SNP can be identified at step 1056 as an associated SNP. In an embodiment a 95% confidence level in accepting the alternative hypothesis is used for the p-value threshold. A 99% confidence level can be used if greater stringency is required in the identification of associated SNPs, e.g., in analyzing the results of a second tier experiment. The process 1040 can then be repeated for all SNPs that require evaluation.
vii) Analysis Based on Pairing Individual P′ Case and Control Values
A further process 1070 for analyzing P′ data to identify associated SNPs, corresponding to method step 310, will now be described with reference to
Pairing P′ values by common experimental conditions helps to prevent variations in experimental techniques and procedures from causing variations in the P′ values. A variety of different experimental conditions which can affect the outcome of an experiment can be used as the criterion by which to pair P′ case and control values and the following is by way of example only.
In a pairing step 1072, the process 1070 identifies P′ case and control values for a SNP position which were measured using array wafers or chips from the same production batch or lot and for which the experiment was carried out by the same person. In this way the effect of variations in array properties (e.g. probe properties) between lots and variations in the individual experimental technique of different people can be reduced. The process the calculates at step 1074 the difference, ΔP′, for the experimentally paired P′ case and control values for the SNPs positions and for each of the repeated experiments. The paired ΔP′s are then evaluated at step 1076 by any of a number of methods as will be described in greater detail below. Following evaluation of the paired ΔP′s, associated SNP positions are identified at step 1078.
One process for evaluating the paired ΔP′s is to carry out a paired t-test using the paired P′ values.
The P′ case and P′ control values can then be paired by PCR reaction and experiment number to provide six experimentally paired P′s and corresponding ΔP′s. Two pairs are illustrated in
in which <ΔP′> is the arithmetic mean of the individual ΔP′s, σΔP′ is the standard deviation of the ΔP′s and NΔP′ is the number of ΔP′s, which in this example is six. The p-value for the calculated t statistic is then determined at step 1124 from a look up table using the number of degrees of freedom, which is NΔP′−1 and being five in this example. The p-value is then used to determine at step 1126 whether the null hypothesis is sufficiently unlikely so that the alternate hypothesis ΔP′≠0 can be accepted. In one embodiment, a p-value of less than 0.05 is considered sufficient to reject the null hypothesis. Therefore, the p-value is thresholded at step 1126 and if the threshold is passed, the SNP is identified as an associated SNP 1128 (corresponding to step 1078 in
The paired t-test is a parametric test and therefore is predicated on the data having a particular distribution. In instances in which a parametric test is not appropriate, a non-parametric counterpart to the paired t-test can be used, such as Friedman's test.
A number of other process can be used to implement step 1076 of evaluating the set of experimentally paired ΔP′s. In one process, the median value of the set of ΔP′s for a SNP position is selected and thresholded. If the median ΔP′ is equal to or greater than approximately 0.05 then the SNP is considered to be an associated SNP and is identified as such. In another process, an Olympic average is taken of the set of ΔP′s for a SNP. The largest and smallest ΔP′s are rejected and an average is taken of the remaining ΔP′s. The Olympic average ΔP′ is then thresholded, and the SNP is identified as an associated SNP if the Olympic average is equal to or greater than approximately 0.05. A further process involves determining whether the set of ΔP′s all have the same sign. If the set of ΔP′s (where all the ΔP′s have been calculated consistently) all have the same sign (i.e. all positive or all negative) then the consistent determination of ΔP′ across repeated experiments is an indication that the ΔP′ is ≠0 and so the SNP is identified as likely being an associated SNP. In a further process, a ranking test, similar to that described with reference to
The above described processes provide example embodiments of methods for analyzing P′ values to determine whether a SNP position should be characterized as likely being associated or not. The processes use at least two measures of P′ or ΔP′ for a SNP position or multiple SNP positions, in order to characterize the SNP position. Various statistical tests have been described above and other parametric and nonparametric tests can be used depending on the P′ values available for analysis. Many of the processes use properties of the distribution of P′ or ΔP′ values, such as an arithmetic mean, standard deviation, median, etc, and other properties of distributions can be used. Various cut off values or thresholds are used in the processes which can be empirically determined and can vary depending on the experiments being carried out and/or on the required stringency. The use of genetic data to identify or validate SNP associations can also be used. A number of the steps used in the different example processes may be used in, or adapted or modified for use in, other of the example processes as will be apparent from the teachings herein.
Although the foregoing invention has been described in some detail for purposes of clarity of understanding, it will be apparent that certain changes and modifications may be practiced within the scope of the appended claims. Therefore, the described embodiments should be taken as illustrative and not restrictive, and the invention should not be limited to the details given herein but should be defined by the following claims and their full scope of equivalents.
This application claims the benefit of U.S. Provisional Application No. 60/460,329 filed Apr. 3, 2003, incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
60460329 | Apr 2003 | US | |
60518107 | Nov 2003 | US |