Genome-wide association studies have discovered many genetic loci associated with disease, but the molecular basis of these associations is often unresolved. Genome-wide regulatory and gene expression profiles measured across individuals and diseases reflect downstream effects of genetic variation, and may allow for functional annotation of disease-associated loci.
Complete genome sequences of individual patients will soon become integrated as part of routine clinical care. There exists a need for interpreting the clinical significance of novel genetic variants presenting in a patient's personal genome, which are known to be associated with diverse clinical disorders. Current tools and approaches for clinical assessment of genetic variation do not explicitly consider gene regulatory information and are typically focused on specific gene coding regions.
Embodiments of the present invention enable genome-wide systematic evaluation of potentially clinically relevant genetic variation in a personal genome. Among other things, the present invention provides methods embodied in a system that can be applied to genetic information comprising an individual genome to assess the regulatory impact of specific genetic variants and their possible impact on biological function or disease pathology.
An embodiment of the invention is comprised of databases and algorithms embodied in software, where the databases contain information providing genome-wide quantitative and genetic profiles of transcription factor binding measured across a multitude of individual human genomes, information of genetic variants associated with disease conditions, information on DNA motifs associated with transcription factor binding, as well as molecular profiles of disease pathology.
Embodiments of the present invention use genome-wide quantitative gene regulatory information to assess total genetic variation presenting in an individual genome. The present invention provides the ability to infer transcription factor binding events from genotypes. Also, the present invention provides the ability to associate individual variation in gene regulatory regions with biological function and disease pathology.
Applications of the present invention include clinical assessment of personal genomes, clinical assessment of cancer genomes, and interpretation of genetic disease associations, discovery of regulatory DNA biomarkers. In other embodiments, additional types of gene regulatory information can be added.
These and other embodiments can be more fully appreciated upon an understanding of the detailed description of the invention as disclosed below in conjunction with the attached figures.
The following drawings will be used to more fully describe embodiments of the present invention.
Among other things, the present invention relates to methods, techniques, and algorithms that are intended to be implemented in digital computer system 100 such as generally shown in
Computer system 100 may include at least one central processing unit 102 but may include many processors or processing cores. Computer system 100 may further include memory 104 in different forms such as RAM, ROM, hard disk, optical drives, and removable drives that may further include drive controllers and other hardware. Auxiliary storage 112 may also be include that can be similar to memory 104 but may be more remotely incorporated such as in a distributed computer system with distributed memory capabilities.
Computer system 100 may further include at least one output device 108 such as a display unit, video hardware, or other peripherals (e.g., printer). At least one input device 106 may also be included in computer system 100 that may include a pointing device (e.g., mouse), a text input device (e.g., keyboard), or touch screen.
Communications interfaces 114 also form an important aspect of computer system 100 especially where computer system 100 is deployed as a distributed computer system. Computer interfaces 114 may include LAN network adapters, WAN network adapters, wireless interfaces, Bluetooth interfaces, modems and other networking interfaces as currently available and as may be developed in the future.
Computer system 100 may further include other components 116 that may be generally available components as well as specially developed components for implementation of the present invention. Importantly, computer system 100 incorporates various data buses 118 that are intended to allow for communication of the various components of computer system 100. Data buses 118 include, for example, input/output buses and bus controllers.
Indeed, the present invention is not limited to computer system 100 as known at the time of the invention. Instead, the present invention is intended to be deployed in future computer systems with more advanced technology that can make use of all aspects of the present invention. It is expected that computer technology will continue to advance but one of ordinary skill in the art will be able to take the present disclosure and implement the described teachings on the more advanced computers or other digital devices such as mobile telephones or “smart” televisions as they become available.
Moreover, the present invention may be implemented on one or more distributed computers. Still further, the present invention may be implemented in various types of software languages including C, C++, and others. Also, one of ordinary skill in the art is familiar with compiling software source code into executable software that may be stored in various forms and in various media (e.g., magnetic, optical, solid state, etc.). One of ordinary skill in the art is familiar with the use of computers and software languages and, with an understanding of the present disclosure, will be able to implement the present teachings for use on a wide variety of computers.
The present disclosure provides a detailed explanation of the present invention with detailed explanations that allow one of ordinary skill in the art to implement the present invention into a computerized method. Certain of these and other details are not included in the present disclosure so as not to detract from the teachings presented herein but it is understood that one of ordinary skill in the art would be familiar with such details.
In an embodiment of the invention as shown in
User computing device 124 can be implemented in various forms such as desktop computer 128, laptop computer 130, smart phone 132, or tablet device 134. Other devices that may be developed and are capable of the computing actions described herein are also appropriate for use in conjunction with the present invention.
In the present disclosure, computing and other activities will be described as being conducted on either computer server 122 or user computing device 124. It should be understood, however, that many if not all of such activities may be reassigned from one to the other device while keeping within the present teachings. For example, for certain steps computations that may be described as being performed on computer server 122, a different embodiment may have such computations performed on user computing device 124.
In an embodiment of the invention, computer server 122 is implemented as a web server on which Apache HTTP web server software is run. Computer server 122 can also be implemented in other manners such as an Oracle web server (known as Oracle iPlanet Web Server). In an embodiment computer server 122 is a UNIX-based machine but can also be implemented in other forms such as a Windows-based machine. Configured as a web server, computer server 122 is configured to serve web pages over network 126 such as the internet.
In an embodiment, user computing device 124 is configured so as to run web browser software. For example, where user computing device 124 is implemented as desktop computer 128 or laptop computer 130, currently available web browser software includes Internet Explorer, Firefox, and Chrome. Other browser software is available for different applications of user computing device 124. Still other software is expected to be developed in the future that is able to execute certain steps of the present invention.
In an embodiment, user computing device 124, through the use of appropriate software, queries computer server 122. Responsive to such query, computer server 122 provides information so as to display certain graphics and text on user computing device. In an embodiment, the information provided by computer server 122 is in the form of HTML that can be interpreted by and properly displayed on user computing device 124. Computer server 122 may provide other information that can be interpreted on user computing device.
It has been found that transcription factor binding is significant in the analysis of regulatory variation. For example, as shown in
In a certain sense, it is important to consider how binding of one factor can affect the binding of another. For example, as shown in
A systematic approach is presented herein to combine disease association, transcription factor binding, and gene expression data to assess the functional consequences of variants associated with hundreds of human diseases. In an analysis of genome-wide binding profiles of NFκB, it was found that disease-associated SNPs are enriched in NFκB binding regions overall, and specifically for inflammatory mediated diseases, such as asthma, rheumatoid arthritis, and coronary artery disease. Using genome-wide binding variation information for eight fully sequenced individuals, it was found that regions of NFκB binding correlated with disease-associated variants in an allele-specific manner (see pipeline method of
The association between genotype and phenotype is a fundamental problem in biology and translation medicine. Genome-wide association studies (GWASs) have identified many genetic variants associated with diseases [see Hindorff, L. A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA 106, 9362-9367 (2009); note that these and other references cited herein are incorporated by reference for all purposes], but such approaches rely on “tag” single nucleotide polymorphisms (SNPs) found on DNA microarrays. While these SNPs may lie in or near gene regions, their specific influences on the biology of disease are not necessarily determined in typical GWASs [Green, E. D., Guyer, M. S. National Human Genome Research Institute Charting a course for genomic medicine from base pairs to bedside. Nature 470, 204-213 (2011)]. Furthermore, disease-associated SNPs that are found outside of genic regions are often not further investigated because they are of unknown function.
Systems biology can provide an approach to bridge the gap between genotype and phenotype. For example, human variation in transcription factor (TF) binding has been correlated with polymorphisms in motifs for NFκB and PolII in ten individuals [Kasowski, M. et al. Variation in transcription factor binding among humans. Science 328, 232-235 (2010); Karczewski, K. J. et al. Discovering Cooperative Transcription Factor Associations using Binding Variation Information and the ALPHABIT Pipeline. 1-25 (2011)] and regulatory features across dozens of cell lines have been mapped extensively by the ENCODE project [Birney, E. et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799-816 (2007)].
It is, therefore, expected that polymorphisms that affect transcription factor binding can have a significant influence on disease because the differences in TF binding (that lead to downstream differences in expression) may be the true underlying cause of the disease association of the SNPs. These functional biology-rich sources of data can, therefore, be leveraged to suggest putative function for previously unannotated disease-associated SNPs.
In the present disclosure, the role of transcription factor binding sites in disease is described. As a non-limiting case study, genome-wide enrichments are explored for disease SNPs in NFκB (p65) binding regions to predict genotype-specific binding events associated with disease.
Shown in
As shown in step 210 of
NFκB Binding Regions are Enriched for Disease Associated SNPs
In a particular embodiment of the present invention, a compendium of disease SNPs [see Ashley, E. A. et al. Clinical assessment incorporating a personal genome. Lancet 375, 1525-1535 (2010); Chen, R., Davydov, E. V., Sirota, M. & Butte, A. J. Non-synonymous and synonymous coding SNPs show similar likelihood and effect size of human disease association. PLoS ONE 5, e13574 (2010)] was intersected with a set of 15,522 NFκB binding regions found in lymphoblastoid cell lines from ten individuals [Kasowski, M. et al. Variation in transcription factor binding among humans. Science 328, 232-235 (2010)]. It was found that established disease-associated SNPs were overabundant in regions bound by NFκB (χ2=292.9; p=1.1e-65; Fisher's OR=2.95). These associations are not biased by the platforms used for disease association discovery, as NFκB regions are underrepresented on Affymetrix 6.0 and 500K arrays (Fisher's OR=0.8 and 0.82, respectively) and only slightly overrepresented on Illumina 550K and 1M (Fisher's OR= and 1.42, respectively), which represented a smaller portion of this analysis. Additionally, binding sites of a known interacting factor, Stat1, were also highly enriched for disease-SNPs; this enrichment was not present in promoter regions, as defined by PolII binding as shown in
As shown in
Disease-associated SNPs in NFκB binding regions are more pleiotropic (e.g., typically associated with more diseases) than the collection of known disease-associated SNPs (1.33 vs. 1.15; t-test p-value=8.7e-4; Mann Whitney U-test p-value=2.7e-7). For example, as shown in
Disease Associated SNPs are Found in More Biologically Relevant Binding Regions
NFκB binding regions that harbor disease-associated SNPs are more strongly bound by NFκB, as determined by ChIP-Seq binding intensity, compared to the background of all NFκB binding regions. Additionally, these binding regions are less variable, indicating the potential for evolutionary constraint on these regions.
SNPs in NFκB Binding Regions Suggest a Mechanism for the Biology of Disease
In a systematic effort to assign functional annotation to disease-associated SNPs, a pipeline as shown in
Using genotype and NFκB binding information from eight individuals [Kasowski, M. et al. Variation in transcription factor binding among humans. Science 328, 232-235 (2010)], a preliminary, lower-power analysis was performed to identify candidate SNPs. In an assessment of SNPs in NFκB binding regions in linkage disequilibrium (R2>0.5) with a disease-associated SNP, SNPs associated with NFκB binding were found by an ANOVA. For instance, rs6135095, a SNP previously reported to be associated with atherosclerosis, shows significant association between genotype and NFκB binding in the 8 cell lines queried.
These variants associated with NFκB binding were linked with downstream expression effects of nearby genes. Considering all genes within 200 kb to be potential targets, disease-associated SNPs were found to be associated with changes in NFκB binding which were correlated with expression of nearby genes.
In an independent validation experiment, aortic tissues from 10 individuals were genotyped and certain of them certain of them were found with rs6135095 CT and TT. This SNP was found to be associated with binding of NFκB (by ChIP-qPCR) as well as expression of nearby genes (SIRPG, etc).
ENCODE/HS Sites
Poll binding regions were not overrepresented for disease SNPs. Therefore, an enrichment for disease-associated SNPs was computed for various factors in several cell lines and it was found that these SNPs were overrepresented in certain of those factors.
Data Sources
Data on disease-SNP associations (p<0.01) were used as in [Ashley, E. A. et al. Clinical assessment incorporating a personal genome. Lancet 375, 1525-1535 (2010); Chen, R., Davydov, E. V., Sirota, M. & Butte, A. J. Non-synonymous and synonymous coding SNPs show similar likelihood and effect size of human disease association. PLoS ONE 5, e13574 (2010)]. ChIP-Seq data on eight cell lines with individual genome sequences was obtained from [Kasowski, M. et al. Variation in transcription factor binding among humans. Science 328, 232-235 (2010)]. All analyses were performed using dbSNP release 132 and hg19 coordinates.
ASB/ASE
ChIP-Seq reads were mapped to hg19 assembly of the human genome using BWA. PCR duplicates were filtered using Picard tools. Variant calling files were downloaded from 1000 Genomes and converted to hg19 coordinates with VCF tools. Allele-specific binding (ASB) was determined on a per-heterozygote per-individual basis for the ten individuals. Reads were filtered to be above MAQ 30 mapping quality. For each individual, a binomial probability of success was determined based on the probability that a reference allele maps to the genome compared to a non-reference. Allele-specific expression (ASE) was similarly determined using reads from the transcriptome of each individual.
Statistical Analysis
Overall associations between NFκB binding regions and disease-associated SNPs were ascertained by chi-squared and Fisher's exact tests. Associations between individual SNPs and binding strengths were tested by two sample t-tests (with two genotypes grouped) or ANOVA for all 3 genotypes. All statistical analysis methods were performed using R statistical software (2.12.1).
Using embodiments according to the present invention, the previously unknown functional significance of regulatory variants is possible. Indeed, embodiment of the present invention can be used to discover new transcription factor interactions. For example, using the present invention disease-associated variants can be connected to molecular pathophysiology and can explain the function of non-coding SNPs.
It should be appreciated by those skilled in the art that the specific embodiments disclosed above may be readily utilized as a basis for modifying or designing other techniques for carrying out the same purposes of the present invention. It should also be appreciated by those skilled in the art that such modifications do not depart from the scope of the invention as set forth in the appended claims.
This application claims priority to U.S. Provisional Application No. 61/526,242 filed Aug. 22, 2012, which is hereby incorporated by reference in its entirety for all purposes. This application claims priority to U.S. Provisional Application No. 61/526,095 filed Aug. 22, 2012, which is hereby incorporated by reference in its entirety for all purposes.
This invention was made with Government support under contract HG000237 awarded by the National Institutes of Health. The Government has certain rights in this invention.
Number | Date | Country | |
---|---|---|---|
61526242 | Aug 2011 | US | |
61526095 | Aug 2011 | US |