The present innovation relates to systems and methods for communicating genomic information, and particular relates to a methodology and graphic user interface for visualizing genomic information.
While it is understood that environment, diet, age, lifestyle, and general health can all play a role in an individual's response to medication, it is widely believed that an individual's genetic makeup is the key to creating personalized efficacious and safe medications. At the intersection of pharmacology and genomics lies the field of pharmacogenomics. This field is the study of how an individual's genetic inheritance affects drug response and holds the promise that drugs may be tailor made for individuals and fine tuned for their specific genetic makeup. In order achieve this goal, pharmacogenomics combines biochemistry and other traditional pharmaceutical sciences with annotated knowledge of genes, proteins, and single nucleotide polymorphisms. Single nucleotide polymorphisms are believed to play a particularly important role in understanding etiologies of disease. Pharmacogenomics has the potential to dramatically reduce the estimated 100,000 deaths and 2 million hospitalizations that occur each year in the United States as the result of adverse drug response as discussed in J. Lazarou, B. H. Pomeranz, and P. N. Corey. Incidence of adverse drug reactions in hospitalized patients: a meta-analysis of prospective studies. JAMA. Apr 15, 1998. 279(15):1200-5. It also promises more powerful medications, advance screening for disease susceptibility, the development of new and powerful vaccines, improvements in drug discovery and approval process and decreased cost for health care.
An example of the benefits of pharmacogenomics is the understanding of the DNA variations in the cytochrome P450 (CYP) family of liver enzymes, which are responsible for breaking down more than 30 different classes of drugs. Less active forms of these enzymes can result in poor metabolism of drugs and inefficient elimination from the body, which in turn can lead to drug overdose.
Another example is an enzyme called TPMT (thiopurine methyltransferase), which plays an important role the breakdown of a class of therapeutics called thiopurines. Thiopurines are commonly used in chemotherapy treatment of common childhood leukemia. A small percentage of Caucasians have genetic variants that prevent them from producing an active form of this protein. As a result, thiopurines elevate to toxic levels in the patient because the inactive form of TMPT is unable to break down the drug. Today, doctors can use a genetic test to screen patients for this deficiency, and the TMPT activity is monitored to determine appropriate thiopurine dosage levels as discussed in S. Pistoi. Facing your genetic destiny, part II. Scientific American. Feb. 25, 2002.
One of the recognized problems in the field of pharmacogenomics is discovery of the complex gene variations that affect drug response. The design of studies to find single nucleotide polymorphisms is tedious and as SNPs occur every 100 to 300 bases along the 3-billion-base human genome. Thus millions of SNPs must be identified and analyzed to determine their involvement in drug response. This pharmacogenomics problem is further compounded by the need to understand which genes are involved in disease, thus the big picture requires understanding the complex interplay of genetic modifications that affect disease and the genetic modifications that affect the efficacy of drugs. The process of designing studies to understand this interplay is both time consuming and costly.
What is needed is a way to assist researchers in the process of designing such studies. The present teachings can fulfill this need.
In accordance with the present innovation, a method for displaying genomic information includes displaying a first axis representing a chromosome with units of basepairs. It also includes displaying on the first axis first and second sets of gene reference marks identifying genes located on forward and reverse strands of the chromosome. One or more sets of additional reference marks are further displayed, including genetic marker reference marks and haplotype reference marks. Each set of haplotype reference marks identifies one or more haplotype blocks for a population.
The method for visualizing genomic information and graphic user interface implementing the method is advantageous over previous viewing systems and methods in several ways. For example, the sets of gene reference marks can indicate intron and exon regions for one of more genes in the set. Also, the exon regions can be encoded with prediction power information for one or more populations that can be calculated via a statistical model. Further, the first linear axis displaying the chromosome in basepair units can be visually related to a nonlinear axis in LD units for a selected population. Yet further, the gene reference marks can be single-nucleotide polymorphisms. Further still, the navigation mechanism provided in an online browser format with complimentary controls can permit the user to select a chromosome for display and/or navigate the chromosome and its displayed SNPs and Haplotypes with name search and/or pan and zoom functionality. Yet further still, the user may be permitted to automatically query an online ordering system for assays by navigating the genomic data to a point of interest and selecting single-nucleotide polymorphisms.
Further areas of applicability of the disclosed methods will become apparent from the detailed description provided hereinafter. It should be understood that the detailed description and specific examples, while indicating various embodiments, are intended for purposes of illustration only and are not intended to limit the scope of the innovation.
The present innovation will become more fully understood from the detailed description and the accompanying drawings, wherein:
The following description is merely exemplary in nature and is in no way intended to limit the methods, their application, or uses. Before proceeding to description of the visualization technique and graphic user interface with reference to
After the primers and probes were synthesized in the high-throughput manufacturing facility, quality-control steps can be implemented. For example, oligonucleotide integrity can be tested and assay performance can be tested against a panel of 10 individual genomic DNA samples. Only assays that pass QC tests at step 106 are moved on for validation in the population panels at step 108, which can include DNA samples from some number African-American, Caucasian (from the Coriell Institute/NIGMS Human Variation panels), Chinese, and Japanese individuals. Some embodiments use 45 individuals from each population. Assay validation in population samples can help ensure that the locus is polymorphic and that the allele frequency will be adequate for association studies in a variety of populations. The performance of each assay can be benchmarked at step 110 against several criteria. Examples of such criteria are background signal, adequate signal generation, and specificity. Assays that meet performance criteria and some minimum minor allele frequency (for example 5%) at step 112 in either of the populations tested are annotated at step 114 and released for sale at step 116 at the Applied Biosystems on-line store.
Assay validation yield results have demonstrated that the SNP selection “triage” procedure can be effective in prioritizing SNPs with higher likelihood of being highly polymorphic in multiple populations. For example, in 258,260 assays validated on African-American and Caucasian populations, approximately 95% of the 122,287 SNPs assays that passed the performance criteria described above were polymorphic. As shown in
Analysis of genotype data from reference samples is now described. The individual genotypes of the DNA samples generated during validation have enabled study of the profile of linkage disequilibrium across gene regions of the genome for these populations. Methods have been applied to identify haplotype blocks, regions of strong LD and low haplotype diversity, and locations with statistical power for finding association. In addition, metric maps can be constructed that are scaled to the strength of LD and can guide the selection of SNPs for association studies independent of block boundaries (cf. Maniatis et al., PNAS 99: 2228-33, 2002). Ultimately, one of the metrics of greatest practical utility will relate to the power of detecting an association between a disease or disease-risk phenotype and SNPs marker in that region. Empirical data can provide an opportunity to estimate the power of a LD SNP map for a large number of known genes. These power estimations can be used to design a genetic study by selecting the adequate number of markers and sample size.
Turning to
In the present example, the panel shows a section of chromosome 6. In some embodiments according to this example, vertical blue bars indicate SNPs, and horizontal red bars are haplotype blocks (African American), while horizontal yellow bars are haplotype blocks (Caucasian). Genes on the forward strand (magenta are introns), while genes on the reverse strand (magenta are introns). The first axis in basepairs (a linear scale) is visually related to a second axis in Linkage Disequilibrium Units (a nonlinear scale) by blue lines that indicate SNPs and location of the two axes. Gene bars are also color-coded to display prediction power based on linkage disequilibrium (bottom is Caucasian, top is African American). A power legend is in the upper right hand corner.
Using the empirical data, parsimonious subsets of SNPs (“tagging” SNPs) can be identified that have adequate power in disease association studies. This can greatly reduce the study time and cost. Furthermore, the data can allow the identification of regions where, due to the low LD, additional and complementary SNPs currently not in the validated set are needed. These custom assays can be ordered through from a service which employs the same design algorithm. For example, the Assays-by-Design™ service from Applied Biosystems is such a service. According to the present teachings, one or more graphic user interfaces can be used to allow researchers to access the analyses of the reference data obtained in order to help them select SNPs for their studies.
Assays developed according to the method described above are commercially available and may be purchased via an online store as pictured in
As described above, a high-quality LD map of validated SNPs can be created by integrating information from both public and private human genome efforts. Expertise in assay design and bioinformatics can allow development of a set of validated SNPs and ready-to-use assay reagents for use with an easy workflow. The individual genotypes being generated can enable a survey of the magnitude of LD and the haplotype diversity across gene regions of the genome for these populations. This survey allows identification of regions that will require higher or lower SNP density to further optimize the map.
In order to further describe the development of the genomic information visualized according to the present teachings, a comparative study is presented of the patterns of linkage disequilibrium (LD) across three human autosomes: chromosomes 6, 21, and 22. A total of 19,860 SNPs with a median spacing ranging from 4 to 7 kb, covering more than 193 Mb of chromosomal segments, and overlapping 2,266 predicted gene regions, were genotyped in 45 African-American and 45 Caucasian DNA samples from the Coriell Institute. Levels of LD potentially useful for mapping extended 30-57% longer for Caucasians as compared to African-Americans, whereas chromosome 6 showed about 50% more extensive LD than the shorter chromosomes (21 and 22). Several methods were applied to find haplotype blocks, optimizing for a minimum number of blocks. However, for a given method multiple optimal solutions were obtained, and while overlapping, they differ up to 37% in the location of boundaries. When comparing different methods, the differences in shared boundaries are more dramatic, although again significant overlap exists. When an optimal solution of the D′-based method was selected, haplotype blocks mean length ranged from 29 to 51 Kb and were on average 33-42% larger in the Caucasian population than in the African-American population, and 60% larger in chromosome 6 than in chromosomes 21 and 22. The blocks found in African-Americans overlap 70% in length with the Caucasian blocks, whereas the reverse is only about 50%, largely due to Caucasian-specific block segments. In the overlapped block segments, 70% of the common haplotypes are shared between the populations, but 21% are exclusive to African-Americans, and only 8.5% are Caucasian unique. It was found that, even when up to 93% of the typed SNPs can be found participating in blocks of at least two SNPs, these blocks cover only 31-49% of the length of the chromosomal segments studied. Utilizing previously developed theory for metric LD maps, population-specific LD maps were produced for the three chromosomes, that when plotted against physical distance, show plateaus of strong LD and steps of high recombination. The total number of LD units in the maps was 35% longer in African-Americas than in Caucasians. LD was highly correlated to recombination rates estimated from high-resolution linkage maps, and to a lesser extent to SNP density and GC content. Finally, the average statistical power to find association on a per gene basis was estimated using the current SNP map, under reasonable assumptions for complex disease. The results suggest that an average power of over 0.8 for a sample of 500 cases and 500 controls can be obtained for at least 60% of the genes studied when the disease allele frequency is 0.1, and up to 93% when the frequency is 0.2. Together, these results point out areas and genes where additional SNPs would be required for finer coverage and definition of the LD patterns, but suggest that the current SNP density might provide an acceptable starting point to perform association studies and more exhaustive haplotype maps.
Recently, there has been tremendous interest in empirically establishing the patterns of allelic association, also known as linkage disequilibrium (LD), among polymorphic variants of the human genome. When two alleles at adjacent loci co-occur in a chromosomal segment more often than expected if they were segregating independently in the population, the loci are in linkage disequilibrium. The extent of LD across genomic regions is a useful parameter for defining the statistical power of association studies utilizing single-nucleotide polymorphisms (SNP) as surrogate genetic markers, and for guiding the selection and spacing of such polymorphisms to create a marker map useful in candidate gene, candidate region, and eventually whole-genome association studies.
With the aim of developing a SNP map to serve as a resource for candidate-gene and candidate-region association studies, SNPs with a median spacing of less than 7 kb covering most of the length of three human autosomes: chromosomes 6, 21, and 22 were selected. 90 samples of unrelated individuals from two human populations, African-Americans and Caucasians, were genotyped utilizing 5′ nuclease assays that are commercially available as part of a genome-wide set. The empirical results of this comparative study of LD across the three chromosomes and two populations studied are described: blocks with strong LD and low haplotype diversity are identified using a variety of algorithms, the characteristics of those blocks as well as the robustness of the different haplotype block definitions are analyzed, and metric maps for describing regional differences in LD and for guiding SNP selection for association studies are described. Finally, the results of haplotype-based power calculations for case-control studies are presented across the gene-spanning regions of these three chromosomes to better understand the utility of the SNP set examined here.
The TaqMan® probe-based, 5′ nuclease assays, were utilized to genotype 19,860 SNPs selected from the Celera Human RefSNP database (v 3.6) in 45 African-American and 45 Caucasian DNA samples from the Coriell Institute/NIGMS Human Variation panels. Those assays are commercially available as part of Applied Biosystems' Assays-on-Demand™ SNP Genotyping Products. All SNPs had heterozygosity greater than 0.1 in the respective population, and were tested for deviation of Hardy-Weinberg Equilibrium (p<0.001). In some embodiments, the SNP set covers a total of 193.6 Mb, or approximately 15% of the genome (75% of chromosome 6; 92% of chromosome 21; 89% of chromosome 22) without gaps greater than 60 kb. The mean SNP spacing ranges from 10.4 to 7.2 kb, whereas the median spacing ranges from 6.7 to 3.8 kb, indicating that for most covered segments there is high-resolution coverage.
Identification and analysis of haplotype blocks can be accomplished by implementing several methods to identify segments of strong LD and low haplotype diversity (i.e. “haplotype blocks”) For example, the |D′| method of Gabriel et al. (Science 296:2225-9, 2002), the four-gamete rule, and an alternative method based on hypothesis testing using |D′| performed at two p-value thresholds of 0.05 and 0.001. One skilled in the art will appreciate that there are other methods for computing LD and haplotype blocks. Grouping SNPs into haplotype blocks by any method can yield several alternative partitions. For example, turning to
In particular,
Construction of LD maps is now described. The description of LD patterns using the haplotype block paradigm does not fully describe the extent of LD that is useful for mapping in the greater than 50% of chromosomal intervals not encompassed by blocks in study described. An alternative approach to describe the local patterns of LD is to calculate the metric linkage disequilibrium units (LDUs) between pairs of SNPs developed by Maniatis et al. (PNAS 99: 2228-33, 2002). These units are additive and provide a coordinate system whose scale is proportional to the regional differences in the strength of LD, in a fashion analogous to the recombination maps constructed in cM used to guide linkage studies.
Turning now to
The LDU scale can be useful in that the relationships between regions of low haplotype diversity (i.e., blocks) are specified in terms of map distance. These block regions are evident on the LD map scale but it is more important to determine the number of LDUs in a region since any two blocks, by any definition, may be in high LD with each other. Therefore, reliance on tagging haplotype blocks may be locally inefficient for determining optimal marker coverage. Also, the fraction of the genome in inter-block regions is not characterized in terms of haplotype blocks but rather in terms of LD map structure that can be determined fully given sufficient marker density. A remarkable property of the LDU maps for the two populations is that their overall contour is rather similar—most of the differences are found in the magnitude of the steps in regions of low LD/high recombination. This suggests that it may be possible to develop a ‘standard’ LD map that is efficient for association mapping in all populations if suitably scaled.
The power of the SNP set for association studies is now discussed. An important question is whether the marker density provides enough statistical power for association studies given the empirically observed LD profile. In the study described herein, the power for finding association across genes in the three chromosomes was calculated under a fixed sample size which is typical of these types of studies. A haplotype-based test and parameters compatible with the common variant/common disease hypothesis of complex disease were utilized, assuming disease allele frequencies of 0.1 or 0.2. To calculate power, each common haplotype inferred in a gene window was assumed to be in LD with the disease allele and a power value calculated. To provide a single power value per gene, an average weighted on the haplotype frequencies was computed. This average gives greater weight to the power estimated for the common haplotypes, and presumes that common haplotypes might be more likely to harbour more recent disease mutations.
Turning now to
As described above, haplotype blocks for the entire length of three human autosomes were identified, and metric maps were constructed that are scaled to the strength of LD. The latter can guide the selection of SNPs for association studies independent of block boundaries. By all measures used, Caucasians showed about one-third more LD than African-Americans, and chromosome 6 exhibited up to 50% more LD than chromosomes 21 or 22. These results provide an empirical foundation for designing association studies, knowing in advance which genes have marker coverage likely to deliver adequate statistical power and which would require more SNPs and/or larger sample sizes.
An unzoomed view after chromosome selection shows the entire chromosomal axis 154. The chromosomal axis is in units of base pairs, including multiples thereof, such as kilobase or other multiple of basepair units. The user can change the resolution by zooming in and out, and may be permitted to zoom in to a point where single basepair units are employed. Zooming can be achieved by a mouse left click. The zoomed view centers at the pointer location. A zoom out can be achieved by a right clicking, which can automatically adjust zoom and pan settings minimally to achieve “round numbers” for desired axis positions as further explained below.
Those skilled in the art can now appreciate from the foregoing description that these broad teachings can be implemented in a variety of forms. Therefore, while the teachings have been described in connection with particular examples thereof, the true scope thereof should not be so limited since other modifications will become apparent to the skilled practitioner upon a study of the drawings, the specification and the following claims.
This application claims the benefit of U.S. Provisional Application No. 60/466,310, filed on Apr. 28, 2003. The disclosure of the above application is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
60466310 | Apr 2003 | US |