The subject matter disclosed herein is generally directed to identifying individuals with a genetic predisposition to coronary artery disease. In particular, the disclosure relates to a method for determining a risk of developing coronary artery disease, e.g., myocardial infarction, in a subject, and in some instances, providing a treatment to those determined to have an increased genetic risk.
This patent application contains lengthy table sections. Copies of the tables have been submitted electronically in ASCII format and are hereby incorporated herein by reference, and may be employed in the practice of the invention. Said ASCII tables are, as follows: (1) BI-10219 Table A.txt (116KSNP_score), 3,217,459 bytes, created Jul. 12, 2017. (2) BI-10219 Table B.txt (6.6M Variant score) (fourteen parts: Part 1, 21,206,184 bytes; Part 2, 21,175,211 bytes; Part 3, 21,158,106 bytes; Part 4, 21,127,244 bytes; Part 5, 21,014,819 bytes; Part 6, 20,982,102 bytes; Part 7, 20,886,150 bytes; Part 8, 21,102,333 bytes; Part 9, 21,811,365 bytes; Part 10, 21,989,831 bytes; Part 11, 21,812,645 bytes; Part 12, 21,519,953 bytes; Part 13, 21,579,221 bytes; Part 14, 5,647,238 bytes) created Jul. 12, 2018. (3) BI-10219 Table C.txt (Top1% Variant score), 2,785,002 bytes, created Jul. 12, 2018. (4) BI-10219 Table D. txt, (sixteen parts: Part 1, 22,586,645 bytes; Part 2, 24,440,729 bytes; Part 3, 20,830,181 bytes; Part 4, 21,327,291 bytes; Part 5, 18,526,925 bytes; Part 6, 19,795,940 bytes; Part 7, 16,866,228 bytes; Part 8, 15,754,356 bytes; Part 9, 12,235,034 bytes; Part 10, 15,461,486 bytes; Part 11, 14,982,489 bytes; Part 12, 14,604,279 bytes; Part 13, 21,076,445 bytes; Part 14, 17,159,606 bytes; Part 15, 16,408,171 bytes; Part 16, 21,580,720 bytes;) created Jul. 12, 2018.
An increased risk of myocardial infarction in those with a parental history was first documented in 1951 (see Gertler et al., J Am. Med. Ass., 1951; 147(7):621-25), catalyzing efforts to identify the discrete DNA-based drivers of heritable risk. A molecular defect in the gene encoding the LDL receptor (LDLR) was identified as a driver of hypercholesterolemia and coronary risk in 1985. (See Lehrman et al., Science, 1985; 227(4683):140-46). Subsequent genome-wide association studies (GWAS) were performed based on arrays designed to capture variants common in the population. The first such analyses for coronary disease uncovered multiple risk variants in the chromosomal 9p21 locus in 2007. (See Samani et al., N. Eng. J Med., 2007; 357:443-53; Helgadottir et al., Science, 2007; 316:1491-1493; McPherson et al., Science, 2007; 316:1488-1491). Since then, more than 60 common genetic variants have been identified in progressively larger GWAS studies. (See Myocardial Infarction Genetics Consortium, Kathiresan S, Voight B F, et al., Nat Genet., 2009; 41(3):334-41; CARDIoGRAMplusC4D Consortium, Deloukas P, Kanoni S, et al., Nat Genet., 2013; 45:25-33; Nikpay et al., Nat Genet. 2015; 47(10):1121-30; Myocardial Infarction Genetics and CARDIoGRAM Exome Consortia Investigators, Stitziel N O, Stirrups K E, et al., N Engl J Med., 2016; 374(12):1134-44; Webb et al., J Am Coll Cardiol, 2017; 69(7):823-836). Furthermore, candidate gene analysis and whole exome sequencing, which captures variation in the 1% of the genome that encodes proteins, have associated a cumulative burden of rare, damaging variants in at least 9 genes with coronary risk. (See Do et al., Nature, 2015; 518(7537):102-6; Cohen et al., N Engl J Med., 2006; 354(12):1264-72; Myocardial Infarction Genetics Consortium Investigators, Stitziel N O, Won H H, et al., N Engl J Med., 2014; 371(22):2072-82; Nioi et al., N Engl J Med., 2016; 374(22):2131-41; Jorgensen et al., N Engl J Med., 2014 Jul. 3; 371(1):32-41; Crosby et al., Loss-of-function mutations in APOC3, triglycerides, and coronary disease, N Engl J Med., 2014; 371:22-31; Dewey et al., N Engl J Med., 2016; 374(12):1123-33; Khera et al., JAMA, 2017; 317(9):937-946).
Citation or identification of any document in this application is not an admission that such document is available as prior art to the present invention.
In one aspect, the disclosure relates to a method of determining a risk of developing coronary artery disease, e.g., myocardial infarction, in a subject, the method comprising: identifying whether at least 95 single nucleotide polymorphisms (SNPs) from Table D is present in a biological sample from the subject; wherein the presence of a risk allele of a SNP from Table D indicates that the subject has an increased risk of coronary artery disease, and wherein the presence of an alternative allele indicates that the subject has a decreased risk of coronary artery disease. In another aspect, the invention relates to a method of determining the risk of developing coronary artery disease comprising odds ratios that are improved over method in the prior art.
In another aspect, the invention relates to a method of determining a polygenic risk score for (PRS) developing coronary artery disease in a subject, the method comprising selecting at least 95 single nucleotide polymorphisms (SNPs) from Table D; identifying whether the at least 95 SNPs are present in a biological sample from the subject; and calculating the polygenic risk score (PRS) based on the presence of the SNPs.
In another aspect, the invention relates to a method of identifying a risk of developing coronary artery disease, e.g., myocardial infarction, in a subject and providing a treatment to the subject, the method comprising obtaining a biological sample from the subject; identifying whether at least one single nucleotide polymorphism (SNP) from Table D is present in the biological sample; wherein the presence of a risk allele of a SNP from Table D indicates that the subject has an increased risk of coronary artery disease; and initiating a treatment to the subject, wherein the treatment comprises statins, ezetimibe, beta-blocking agents, angiotensin-converting-enzyme inhibitors, aspirin, anticoagulants, antiplatelet agents, angiotension II receptor blockers, angiotensin receptor neprilysin inhibitors, calcium channel blockers, cholesterol-lowering medications, vasodilators, antidiuretics, renin-angiotensin system agents, lipid-modifying medicines, anti-inflammatory agents, nitrates, antiarrhythmic medicines, steroidal or non-steroidal anti-inflammatory drugs, DNA methyltransferase inhibitors and/or histone deacetylase inhibitors.
In another aspect, the invention relates to a method of reducing a risk of coronary artery disease, e.g., myocardial infarction, in a subject comprising administering to the subject a treatment which comprises one or more statins, beta-blocking agents, angiotensin-converting-enzyme inhibitors, aspirin, anticoagulants, antiplatelet agents, angiotension II receptor blockers, angiotensin receptor neprilysin inhibitors, calcium channel blockers, cholesterol-lowering medications, vasodilators, antidiuretics, renin-angiotensin system agents, lipid-modifying medicines, anti-inflammatory agents, nitrates, antiarrhythmic medicines, steroidal or non-steroidal anti-inflammatory drugs, DNA methyltransferase inhibitors and/or histone deacetylase inhibitors, wherein the subject has a polygenic risk score that corresponds to a high righ group, and wherein the polygenic risk score is calculated by a method comprising selecting at least 95 single nucleotide polymorphisms (SNPs) from Table D; identifying whether the at least 95 SNPs are present in a biological sample from the subject; and calculating the polygenic risk score (PRS) based on the presence of the SNPs.
In another aspect, the invention relates to a method of determining a risk of developing coronary artery disease in a subject, the method comprising identifying whether at least 95 single nucleotide polymorphisms (SNPs) from Table D is present in a biological sample from the subject and calculating a polygenic risk score (PRS); wherein the presence of a risk allele of a SNP from Table D indicates that the subject has an increased risk of coronary artery disease, and wherein the presence of an alternative allele indicates that the subject has a decreased risk of coronary artery disease.
In another aspect, the invention relates to a method of determining a risk of developing coronary artery disease in a subject, the method comprising obtaining a biological sample from the subject; identifying whether at least 95 single nucleotide polymorphisms (SNPs) from Table D is present in the biological sample from the subject and, optionally, calculating a polygenic risk score (PRS); wherein the presence of a risk allele of a SNP from Table D indicates that the subject has an increased risk of coronary artery disease, and wherein the presence of an alternative allele indicates that the subject has a decreased risk of coronary artery disease.
In another aspect, the invention relates to a method of determining a risk of developing breast cancer in a subject, the method comprising determining the presence or absence of risk alleles associated with breast cancer; calculating a polygenic risk score for the subject; wherein the presence of a risk allele indicates that the subject has an increased risk of breast cancer, and wherein the presence of an alternative allele indicates that the subject has a decreased risk of breast cancer. The invention also relates to a method of determining the risk of developing breast cancer in a subject comprising odds ratios that are improved over methods in the prior art. In some aspects, the polygenic risk score does not comprise alleles of BRCA-1 or BRCA-2. In another aspect of the inveniton, the polygenic risk score comprises odds ratios indicative of breast cancer. In some aspects of the invention, the polygenic risk score comprises odds ratios determined on a plurality of genetic loci. In another aspect, the method comprises odds ratios 1.5 or greater, or 1.75 or greater, or 2.0 or greater, or 2.25 or greater for the top 20% of the distribution; or 1.5 or greater, or 1.75 or greater, or 2.0 or greater, or 2.25 or greater, or 2.5 or greater, or 2.75 or greater for the top 5% of the distribution. In another aspect, the method comprises odds ratios equal to or greater than provided in Table 28. In particular, Table 28 provides odds rations corresponding to stratified subject populations. For example, odds ratios can be from 1.0 to 1.5, 1.5 to 2.0, 2.0 to 2.5, 2.5 to 3.0, 3.0 to 3.5, 3.5 to 4.0, 4.0 to 4.5, 4.5 to 5.0, 5.0 to 5.5, 5.5 to 6.0, 6.0 to 6.5, 6.5 to 7.0, or higher, including individual values within the ranges. The odds ratios can be associated with, for example, the top quartile, the top quintile, the top 20%, the top 10%, the top 5%, the top 1%, the top 0.5%, or the top 0.25% of subject populations. In some aspects of the invention, the polygenic risk score is used to guide enhanced diagnostic strategies, e.g., mammography, breast MRI, or breast ultrasound; or the polygenic risk score is used to guide chemoprevention; or the polygenic risk score is used to guide prophylactic breast surgery.
In another aspect, the invention relates to a method of determining a risk of developing obesity in a subject, the method comprising determining the presence or absence of risk alleles associated with obesity; calculating a polygenic risk score for the subject; wherein the presence of a risk allele indicates that the subject has an increased risk of obesity, and wherein the presence of an alternative allele indicates that the subject has a decreased risk of obesity. The invention also relates to a method of determining a risk of developing obesity in a subject comprsing odds ratios that are improved over methods in the prior art. In some aspects of the invention, the polygenic risk score comprises odds ratios indicative of obesity. In another aspect of the invention, the polygenic risk score comprises odds ratios determined on a plurality of genetic loci. In another aspect, the method comprises odds ratios 1.5 or greater, or 2.0 or greater, or 2.5 or greater, or 3.0 or greater, or 3.5 or greater, or 4.0 or greater for the top 20% of the distribution; or 1.5 or greater, or 2.0 or greater, or 2.5 or greater, or 3.0 or greater, or 3.5 or greater, or 4.0 or greater, or 4.5 or greater, or 5.0 or greater for the top 5% of the distribution. In another aspect of the invention, the method comprises odds ratios equal to or greater than provided in Table 28. In another aspect of the invention, the polygenic risk score is used to prescribe intensive lifestyle interventions, to prescribe anti-obesity medicines, or to prescribe bariatric surgery.
In another aspect, the invention relates to a method of detecting single nucleotide polymorphisms (SNPs) in a subject, said method comprising: detecting whether at least 95 SNPs from Table D are present in a biological sample from a subject by contacting the biological sample with a set of probes to each SNP and detecting binding of the probes, by amplifying genome regions comprising the SNPs using a set of amplification primers, or by sequencing genomic regions comprising or enriched for the SNPs. In some embodiments, the method comprises detecting whether at least 95 SNPs from Table D are present in the biological sample comprises detecting whether at least 100 SNPs are present in the biological sample. In some embodiments, the method comprises detecting whether at least 95 SNPs from Table D are present in the biological sample comprises detecting whether at least 200 SNPs, or at least 500 SNPs, or at least 1000 SNPs, or at least 2000 SNPs, or at least 5000 SNPs, or at least 10,000 SNPs, or at least 20,000 SNPs, or at least 50,000 SNPs, or at least 75,000 SNPs, or at least 100,000 SNPs, or at least 500,000 SNPs, or at least 1,000,000 SNPs, or at least 2,000,000 SNPs, or at least 3,000,000 SNPs, or at least 4,000,000 SNPs, or at least 5,000,000 SNPs, or at least 6,000,000 SNPs are present in the biological sample.
Accordingly, it is an object of the invention not to encompass within the invention any previously known product, process of making the product, or method of using the product such that Applicants reserve the right and hereby disclose a disclaimer of any previously known product, process, or method. It is further noted that the invention does not intend to encompass within the scope of the invention any product, process, or making of the product or method of using the product, which does not meet the written description and enablement requirements of the USPTO (35 U.S.C. § 112, first paragraph) or the EPO (Article 83 of the EPC), such that Applicants reserve the right and hereby disclose a disclaimer of any previously described product, process of making the product, or method of using the product. It may be advantageous in the practice of the invention to be in compliance with Art. 53(c) EPC and Rule 28(b) and (c) EPC. All rights to explicitly disclaim any embodiments that are the subject of any granted patent(s) of applicant in the lineage of this application or in any other lineage or in any prior filed application of any third party is explicitly reserved. Nothing herein is to be construed as a promise.
It is noted that in this disclosure and particularly in the claims and/or paragraphs, terms such as “comprises”, “comprised”, “comprising” and the like can have the meaning attributed to it in U.S. Patent law; e.g., they can mean “includes”, “included”, “including”, and the like; and that terms such as “consisting essentially of” and “consists essentially of” have the meaning ascribed to them in U.S. Patent law, e.g., they allow for elements not explicitly recited, but exclude elements that are found in the prior art or that affect a basic or novel characteristic of the invention.
These and other aspects, objects, features, and advantages of the example embodiments will become apparent to those having ordinary skill in the art upon consideration of the following detailed description of illustrated example embodiments.
An understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention may be utilized, and the accompanying drawings of which:
Unless defined otherwise, technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains. Definitions of common terms and techniques in molecular biology may be found in Molecular Cloning: A Laboratory Manual, 2nd edition (1989) (Sambrook, Fritsch, and Maniatis); Molecular Cloning: A Laboratory Manual, 4th edition (2012) (Green and Sambrook); Current Protocols in Molecular Biology (1987) (F. M. Ausubel et al. eds.); the series Methods in Enzymology (Academic Press, Inc.): PCR 2: A Practical Approach (1995) (M. J. MacPherson, B. D. Hames, and G. R. Taylor eds.): Antibodies, A Laboraotry Manual (1988) (Harlow and Lane, eds.): Antibodies A Laboraotry Manual, 2nd edition 2013 (E. A. Greenfield ed.); Animal Cell Culture (1987) (R. I. Freshney, ed.); Benjamin Lewin, Genes IX, published by Jones and Bartlet, 2008 (ISBN 0763752223); Kendrew et al. (eds.), The Encyclopedia of Molecular Biology, published by Blackwell Science Ltd., 1994 (ISBN 0632021829); Robert A. Meyers (ed.), Molecular Biology and Biotechnology: a Comprehensive Desk Reference, published by VCH Publishers, Inc., 1995 (ISBN 9780471185710); Singleton et al., Dictionary of Microbiology and Molecular Biology 2nd ed., J. Wiley & Sons (New York, N.Y. 1994), March, Advanced Organic Chemistry Reactions, Mechanisms and Structure 4th ed., John Wiley & Sons (New York, N.Y. 1992); and Marten H. Hofker and Jan van Deursen, Transgenic Mouse Methods and Protocols, 2nd edition (2011).
As used herein, the singular forms “a”, “an”, and “the” include both singular and plural referents unless the context clearly dictates otherwise.
The term “optional” or “optionally” means that the subsequent described event, circumstance or substituent may or may not occur, and that the description includes instances where the event or circumstance occurs and instances where it does not.
The recitation of numerical ranges by endpoints includes all numbers and fractions subsumed within the respective ranges, as well as the recited endpoints.
The terms “about” or “approximately” as used herein when referring to a measurable value such as a parameter, an amount, a temporal duration, and the like, are meant to encompass variations of and from the specified value, such as variations of +/−10% or less, +/−5% or less, +/−1% or less, and +/−0.1% or less of and from the specified value, insofar such variations are appropriate to perform in the disclosed invention. It is to be understood that the value to which the modifier “about” or “approximately” refers is itself also specifically, and preferably, disclosed.
As used herein, a “biological sample” may contain whole cells and/or live cells and/or cell debris. The biological sample may contain (or be derived from) a “bodily fluid”. The present invention encompasses embodiments wherein the bodily fluid is selected from amniotic fluid, aqueous humour, vitreous humour, bile, blood serum, breast milk, cerebrospinal fluid, cerumen (earwax), chyle, chyme, endolymph, perilymph, exudates, feces, female ejaculate, gastric acid, gastric juice, lymph, mucus (including nasal drainage and phlegm), pericardial fluid, peritoneal fluid, pleural fluid, pus, rheum, saliva, sebum (skin oil), semen, sputum, synovial fluid, sweat, tears, urine, vaginal secretion, vomit and mixtures of one or more thereof. Biological samples include cell cultures, bodily fluids, cell cultures from bodily fluids. Bodily fluids may be obtained from a mammal organism, for example by puncture, or other collecting or sampling procedures.
The terms “subject,” “individual,” and “patient” are used interchangeably herein to refer to a vertebrate, preferably a mammal, more preferably a human. Mammals include, but are not limited to, murines, simians, humans, farm animals, sport animals, and pets. Tissues, cells and their progeny of a biological entity obtained in vivo or cultured in vitro are also encompassed.
Various embodiments are described hereinafter. It should be noted that the specific embodiments are not intended as an exhaustive description or as a limitation to the broader aspects discussed herein. One aspect described in conjunction with a particular embodiment is not necessarily limited to that embodiment and can be practiced with any other embodiment(s). Reference throughout this specification to “one embodiment”, “an embodiment,” “an example embodiment,” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrases “in one embodiment,” “in an embodiment,” or “an example embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment, but may. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner, as would be apparent to a person skilled in the art from this disclosure, in one or more embodiments. Furthermore, while some embodiments described herein include some but not other features included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention. For example, in the appended claims, any of the claimed embodiments can be used in any combination.
All publications, published patent documents, and patent applications cited herein are hereby incorporated by reference to the same extent as though each individual publication, published patent document, or patent application was specifically and individually indicated as being incorporated by reference.
The present disclosure relateds to Applicant's findings that lead to the development of a genetic predictor that can identify a subset of the population at more than 4-fold higher risk for coronary arterny disease, for example, myocardial infarction. This is among the strongest predictors ever developed such application. In certain embodiments, determination of the presence or absence of risk alleles is followed by calculating the polygenic risk score for the subject, wherein a high polygenic score indicates a higher risk for developing CAD.
Risk assessments using large numbers of SNPs offers the advantage of increased predictive power. In certain embodiments, the invention includes in the risk assessment large numbers of alleles, for example, at least 500,000, at least 1,000,000, at least 2,000,000, at least 3,000,000, at least 4,000,000, at least 5,000,000, or at least 6,000,000 SNPs from Table B or Table C or Table D. In some embodiments, risk assessment may comprise assessing all of the SNPs from Table D.
In some embodiments, the present disclosure provides to a method of determining a risk of developing coronary artery disease, e.g, myocardial infarction, in a subject, the method comprising identifying whether at least 50, at least 95, at least 100, at least 200, at least 500, at least 1000, at least 2000, at least 5000, at least 10,000, at least 20,000, at least 50,000, at least 75,000, at least 100,000, at least 500,000, at least 1,000,000, at least 2,000,000, at least 3,000,000, at least 4,000,000, at least 5,000,000, or at least 6,000,000 SNPs single nucleotide polymorphisms (SNPs) from Table A or Table B or Table C or Table D is present in a biological sample from the subject; wherein the presence of a risk allele of a SNP from Table A or Table B or Table C or Table D indicates that the subject has an increased risk of coronary artery disease, and wherein the presence of an alternative allele indicates that the subject has a decreased risk of coronary artery disease.
In an embodiment, the invention provides a method of determining a risk of developing coronary artery disease, e.g., myocardial infarction, in a subject comprising identifying whether the SNPs from Table A or Table B or Table C or Table D is present in a biological sample from the subject and calculating a polygenic risk score (PRS) for the subject based on the identified SNPs. The number of identified SNPs can be at least 50, at least 95, at least 100, at least 200, at least 500, at least 1000, at least 2000, at least 5000, at least 10,000, at least 20,000, at least 50,000, at least 75,000, at least 100,000, at least 500,000, at least 1,000,000, at least 2,000,000, at least 3,000,000, at least 4,000,000, at least 5,000,000, or at least 6,000,000.
In an embodiment, the invention provides a method of determining a risk of developing coronary artery disease, e.g., myocardial infarction, in a subject, the method comprising identifying whether at least 50, at least 95, at least 100, at least 200, at least 500, at least 1000, at least 2000, at least 5000, at least 10,000, at least 20,000, at least 50,000, at least 75,000, at least 100,000, at least 500,000, at least 1,000,000, at least 2,000,000, at least 3,000,000, at least 4,000,000, at least 5,000,000, or at least 6,000,000 single nucleotide polymorphisms (SNPs) from Table A or Table B or Table C or Table D is present in a biological sample from the subject and calculating a polygenic risk score (PRS); wherein the presence of a risk allele of a SNP from Table A or Table B or Table C or Table D indicates that the subject has an increased risk of coronary artery disease, and wherein the presence of an alternative allele indicates that the subject has a decreased risk of coronary artery disease.
In an embodiment, the invention provides a method of determining a risk of developing coronary artery disease, e.g., myocardial infarction, in a subject comprising identifying whether the SNPs from Table A or Table B or Table C or Table D is present in a biological sample from the subject and calculating a polygenic risk score (PRS) for the subject based on the identified SNPs, wherein the PRS is calculated by summing the weighted risk score associated with each SNP identified. The number of identified SNPs can be at least 50, at least 95, at least 100, at least 200, at least 500, at least 1000, at least 2000, at least 5000, at least 10,000, at least 20,000, at least 50,000, at least 75,000, at least 100,000, at least 500,000, at least 1,000,000, at least 2,000,000, at least 3,000,000, at least 4,000,000, at least 5,000,000, or at least 6,000,000.
In an of the embodiment, the invention provides a method of determining a risk of developing coronary artery disease, e.g., myocardial infarction, in a subject comprising identifying whether the SNPs from Table A or Table B or Table C or Table D is present in a biological sample from the subject, wherein identifying comprises measuring the presence of the at least 95 SNPs in the biological sample. The number of identified SNPs can be at least 50, at least 95, at least 100, at least 200, at least 500, at least 1000, at least 2000, at least 5000, at least 10,000, at least 20,000, at least 50,000, at least 75,000, at least 100,000, at least 500,000, at least 1,000,000, at least 2,000,000, at least 3,000,000, at least 4,000,000, at least 5,000,000, or at least 6,000,000.
The invention provides a method of determining a polygenic risk score for (PRS) developing coronary artery disease, e.g., myocardial infarction, in a subject, the method comprising selecting at least 50, at least 95, at least 100, at least 200, at least 500, at least 1000, at least 2000, at least 5000, at least 10,000, at least 20,000, at least 50,000, at least 75,000, or at least 100,000, at least 500,000, at least 1,000,000, at least 2,000,000, at least 3,000,000, at least 4,000,000, at least 5,000,000, or at least 6,000,000 single nucleotide polymorphisms (SNPs) from Table A or Table B or Table C or Table D; identifying whether the SNPs are present in a biological sample from the subject; and calculating the polygenic risk score (PRS) based on the presence of the SNPs.
In an embodiment, the invention provides a method of determining a risk of developing coronary artery disease, e.g., myocardial infarction, in a subject comprising identifying whether the SNPs from Table A or Table B or Table C or Table D is present in a biological sample from the subject, calculating a polygenic risk score (PRS) for the subject based on the identified SNPs, and assigning the subject to a risk group based on the PRS. The PRS may be divided into quintiles, e.g., top quintile, intermediate quintile, and bottom quintile, wherein the top quintile of polygenic scores correspond the highest genetic risk group and the bottom quintile of polygenic scores correspond to the lowest genetic risk group. The number of identified SNPs can be at least 50, at least 95, at least 100, at least 200, at least 500, at least 1000, at least 2000, at least 5000, at least 10,000, at least 20,000, at least 50,000, at least 75,000, at least 100,000, at least 500,000, at least 1,000,000, at least 2,000,000, at least 3,000,000, at least 4,000,000, at least 5,000,000, or at least 6,000,000.
In an embodiment, the invention provides a method for selecting subjects or candidates with a risk for developing coronary artery disease comprising identifying whether at least 50, at least 95, at least 100, at least 200, at least 500, at least 1000, at least 2000, at least 5000, at least 10,000, at least 20,000, at least 50,000, at least 75,000, at least 100,000, at least 500,000, at least 1,000,000, at least 2,000,000, at least 3,000,000, at least 4,000,000, at least 5,000,000, or at least 6,000,000 SNPs single nucleotide polymorphisms (SNPs) from Table A or Table B or Table C or Table D is present in a biological sample from each subject or candidate; calculating a polygenic risk score (PRS) for each subject or candidate based on the identified SNPs; and selecting the subjects or candidates with a desired risk group.
For all CAD risk assessments, incorporation of large numbers of SNPs offers the advantage of increased predictive power. The invention further provides risk assessments outlined above incorporating for example, at least 500,000, at least 1,000,000, at least 2,000,000, at least 3,000,000, at least 4,000,000, at least 5,000,000, or at least 6,000,000 SNPs from Table B or Table C or Table D.
In certain embodiments of the invention, risk assessments comprise the highest weighted polymorphisms, including, but not limited to the top 50%, 55%, 60%, 70%, 80%, 90%, or 95% of SNPs from Table A or Table B or Table C or Table D. Table C, for example, comprises the highest weighted 10% of alleles (SNPs) of Table B.
In an embodiment, the method is used to select a population of subjects or candidates for clinical trials, e.g., a clinical trial to determine whether a particular treatment or treatment plan is effective against coronary artery disease, e.g., myocardial infarction. In an embodiment, the desired risk group is a population comprising high risk subjects or candidates. In an embodiment, the selected population of subjects or candidates are responders, i.e., the subjects or candidates are responsive to the treatment or treatment plan.
In an embodiment, the invention provides a method for selecting a population of subjects or candidates with a high risk for developing artery disease comprising identifying whether at least 50, at least 95, at least 100, at least 200, at least 500, at least 1000, at least 2000, at least 5000, at least 10,000, at least 20,000, at least 50,000, at least 75,000, at least 100,000, at least 500,000, at least 1,000,000, at least 2,000,000, at least 3,000,000, at least 4,000,000, at least 5,000,000, or at least 6,000,000 SNPs single nucleotide polymorphisms (SNPs) from Table A or Table B or Table C or Table D is present in a biological sample from each subject or candidate; calculating a polygenic risk score (PRS) for each subject or candidate based on the identified SNPs; and selecting the subjects or candidates in the high risk group. In an embodiment, the method is used to select a population of subjects or candidates for clinical trials, e.g., a clinical trial to determine whether a particular treatment or treatment plan is effective against coronary artery disease, e.g., myocardial infarction. In an embodiment, the selected candidates or subjects are divided into subgroups based on the identified SNPs for each subject or candidate, and the method is used to determine whether a particular treatment or treatment plan is effective against a particular SNP or a particular group of SNPs. In other word, the method can be employed to determine susceptibility of a population of subjects to a particular treatment or treatment plan, wherein the population of subjects is selected based on the SNPs identified in the subjects.
In any of the above embodiment, the method may further comprises an initial step of obtaining a biological sample from the subject.
In any of the above embodiment, the number of identified SNPs is at least 100 SNPs.
In any of the above embodiment, the number of identified SNPs is at least 200 SNPs.
In any of the above embodiment, the number of identified SNPs is at least 500 SNPs.
In any of the above embodiment, the number of identified SNPs is at least 1,000 SNPs.
In any of the above embodiment, the number of identified SNPs is at least 2,000 SNPs.
In any of the above embodiment, the number of identified SNPs is at least 5,000 SNPs.
In any of the above embodiment, the number of identified SNPs is at least 10,000 SNPs.
In any of the above embodiment, the number of identified SNPs is at least 20,000 SNPs.
In any of the above embodiment, the number of identified SNPs is at least 50,000 SNPs.
In any of the above embodiment, the number of identified SNPs is at least 75,000 SNPs.
In any of the above embodiment, the number of identified SNPs is at least 100,000 SNPs.
In any of the above embodiment, the number of identified SNPs is at least 500,000.
In any of the above embodiment, the number of identified SNPs is at least 1,000,000.
In any of the above embodiment, the number of identified SNPs is at least 2,000,000.
In any of the above embodiment, the number of identified SNPs is at least 3,000,000.
In any of the above embodiment, the number of identified SNPs is at least 4,000,000.
In any of the above embodiment, the number of identified SNPs is at least 5,000,000.
In any of the above embodiment, the number of identified SNPs is at least 6,000,000.
In any of the above embodiment, the identified SNPs comprise the highest risk SNPs or SNPs with a weight risk score in the top 10%, top 20%, top 30%, top 40%, or top 50% in Table A or Table B or Table C or Table D.
In any of the above embodiments, the identified SNPs comprise one or more of rs17517928, rs2972146, rs17843797, rs748431, rs7623687, rs12493885, rs10857147, rs7678555, rs1800449, rs10841443, rs2244608, rs11057401, rs3851738, rs2972146, rs7500448, and rs8108632.
Also disclosed herein are methods for detecting SNPs in a subject. In some cases, the method may include detecting whether one or more SNPs from Tables A, B, C, or D (e.g., Table D) are present in a biolgiocal sample from a subject. The detecting may include contacting the biological sample with a set of probes to each SNP, detecting binding the probes, amplifying genome regions comprising the SNPs using a set of amplification primers, sequencing genomic regions comprising or enriched for the SNPs, or any combination of these steps. In some cases, the method may detect whether at least 95 SNPs, at least 100 SNPs, at least 200 SNPs, or at least 500 SNPs, or at least 1000 SNPs, or at least 2000 SNPs, or at least 5000 SNPs, or at least 10,000 SNPs, or at least 20,000 SNPs, or at least 50,000 SNPs, or at least 75,000 SNPs, or at least 100,000 SNPs, or at least 500,000 SNPs, or at least 1,000,000 SNPs, or at least 2,000,000 SNPs, or at least 3,000,000 SNPs, or at least 4,000,000 SNPs, or at least 5,000,000 SNPs, or at least 6,000,000 SNPs are present in the biological sample.
In any of the above embodiments, the method further comprises initiating a treatment to the subject. The treatment can be determined or adjusted according to the risk of coronary artery disease or myocardial infarction. The treatment can comprise statins, ezetimibe, beta-blocking agents, angiotensin-converting-enzyme inhibitors, aspirin, anticoagulants, antiplatelet agents, angiotension II receptor blockers, angiotensin receptor neprilysin inhibitors, calcium channel blockers, cholesterol-lowering medications, vasodilators, antidiuretics, renin-angiotensin system agents, lipid-modifying medicines, anti-inflammatory agents, nitrates, antiarrhythmic medicines, steroidal or non-steroidal anti-inflammatory drugs, DNA methyltransferase inhibitors and/or histone deacetylase inhibitors. The DNA methyltransferase inhibitors can be any DNA methyltransferase known in the art, e.g., 5-aza-2′-deoxycytidine or 5-azacytidine. The histone deacetylase inhibitors can be any histone deacetylase inhibitors known in the art, e.g., varinostat, romidepsin, panobinostat, belinostat or entinostat. The lipid-modifying medicines can be any lipid-modifying compounds known in the art, e.g., an antagonist of PCSK9, an antisense oligonucleotide targeting apolipoprotein C-III, and an antisense oligonucleotide to lower lipoprotein(a). The statins can be any statins known in the art, e.g., atorvastatin, fluvastatin, lovastatin, pravastatin, rosuvastatin, and simvastatin. Initiating a treatment can include devising a treatment plan based on the risk group, which corresponds to the PRS calculated for the subject.
In one embodiment, a treatment or a method of treatment can include gene therapy/genome editing and/or the nucleic acid vector used in a gene therapy vector known in the art. In one embodiment, one or more target locus within the subject's genomic DNA is targeted and modified. A treatment method comprises gene editing tools available in the art, e.g., CRISPR(Clustered Regularly Interspaced Short Palindromic Repeats), zinc finger nucleases, meganucleases where a target DNA locus, e.g., a gene of interest, is modified to create a mutation in the gene product, e.g., a protein or enzyme, with reduced activity or no activity (loss-of-function mutation). In some embodiment, vectors canc comprise viral vector, e.g., retroviruses, adenoviruses, adeno-associated viruses, and lentiviruses. Examples of a target locus of interest include the genes PCSK9, APOC3, ANGPTL8, LPL, CD36, HBB and NPC1L1.
The invention provides methods and models to establish causation of elements of alleles (e.g., chromosomal regions, genetic loci) identified as associated with increased disease risk. In an embodiment of the invention, a model animal, for example but not limited to a rat, a mouse, a dog, a pig, a non-human primate, or a chimeric animal comprising human cells can be employed. In an embodiment of the invention, an organ or organoid can be employed, which can be characterized as from a human or a non-human mammal. In an embodiment of the invention, a cell line from a human or non-human mammal can be employed.
The invention provides for modifying, for example mutating or modulating expression of, one or more genetic elements of a model. Such modifications can be made in a model organism singly, or in combination. In certain example embodiments, the one or more genetic elements may be modified using a nuclease. The term “nuclease” as used herein broadly refers to an agent, for example a protein or a small molecule, capable of cleaving a phosphodiester bond connecting nucleotide residues in a nucleic acid molecule. In some embodiments, a nuclease may be a protein, e.g., an enzyme that can bind a nucleic acid molecule and cleave a phosphodiester bond connecting nucleotide residues within the nucleic acid molecule. A nuclease may be an endonuclease, cleaving a phosphodiester bonds within a polynucleotide chain, or an exonuclease, cleaving a phosphodiester bond at the end of the polynucleotide chain. Preferably, the nuclease is an endonuclease. Preferably, the nuclease is a site-specific nuclease, binding and/or cleaving a specific phosphodiester bond within a specific nucleotide sequence, which may be referred to as “recognition sequence”, “nuclease target site”, or “target site”. In some embodiments, a nuclease may recognize a single stranded target site, in other embodiments a nuclease may recognize a double-stranded target site, for example a double-stranded DNA target site. Some endonucleases cut a double-stranded nucleic acid target site symmetrically, i.e., cutting both strands at the same position so that the ends comprise base-paired nucleotides, also known as blunt ends. Other endonucleases cut a double-stranded nucleic acid target sites asymmetrically, i.e., cutting each strand at a different position so that the ends comprise unpaired nucleotides. Unpaired nucleotides at the end of a double-stranded DNA molecule are also referred to as “overhangs”, e.g., “5′-overhang” or “3′-overhang”, depending on whether the unpaired nucleotide(s) form(s) the 5′ or the 5′ end of the respective DNA strand.
The nuclease may introduce one or more single-strand nicks and/or double-strand breaks in the endogenous gene, whereupon the sequence of the endogenous gene may be modified or mutated via non-homologous end joining (NHEJ) or homology-directed repair (HDR).
In certain embodiments, the nuclease may comprise (i) a DNA-binding portion configured to specifically bind to the endogenous gene and (ii) a DNA cleavage portion. Generally, the DNA cleavage portion will cleave the nucleic acid within or in the vicinity of the sequence to which the DNA-binding portion is configured to bind.
In certain embodiments, the DNA-binding portion may comprise a zinc finger protein or DNA-binding domain thereof, a transcription activator-like effector (TALE) protein or DNA-binding domain thereof, or an RNA-guided protein or DNA-binding domain thereof.
Programmable nucleic acid-modifying agents in the context of the present invention may be used to modify endogenous cell DNA or RNA sequences, including DNA and/or RNA sequences encoding the target genes and target gene products disclosed herein. In certain example embodiments, the programmable nucleic acid-modifying agents may be used to edit a target sequence to restore native or wild-type functionality. In certain other embodiments, the programmable nucleic-acid modifying agents may be used to insert a new gene or gene product to modify the phenotype of target cells. In certain other example embodiments, the programmable nucleic-acid modifying agents may be used to delete or otherwise silence the expression of a target gene or gene product. Programmable nucleic-acid modifying agents may used in both in vivo an ex vivo applications disclosed herein.
1. CRISPR/Cas Systems
In general, a CRISPR-Cas or CRISPR system as used herein and in documents, such as WO 2014/093622 (PCT/US2013/074667), refers collectively to transcripts and other elements involved in the expression of or directing the activity of CRISPR-associated (“Cas”) genes, including sequences encoding a Cas gene, a tracr (trans-activating CRISPR) sequence (e.g. tracrRNA or an active partial tracrRNA), a tracr-mate sequence (encompassing a “direct repeat” and a tracrRNA-processed partial direct repeat in the context of an endogenous CRISPR system), a guide sequence (also referred to as a “spacer” in the context of an endogenous CRISPR system), or “RNA(s)” as that term is herein used (e.g., RNA(s) to guide Cas, such as Cas9, e.g. CRISPR RNA and transactivating (tracr) RNA or a single guide RNA (sgRNA) (chimeric RNA)) or other sequences and transcripts from a CRISPR locus. In general, a CRISPR system is characterized by elements that promote the formation of a CRISPR complex at the site of a target sequence (also referred to as a protospacer in the context of an endogenous CRISPR system). See, e.g, Shmakov et al. (2015) “Discovery and Functional Characterization of Diverse Class 2 CRISPR-Cas Systems”, Molecular Cell, DOI: dx.doi.org/10.1016/j.molcel.2015.10.008.
In certain embodiments, a protospacer adjacent motif (PAM) or PAM-like motif directs binding of the effector protein complex as disclosed herein to the target locus of interest. In some embodiments, the PAM may be a 5′ PAM (i.e., located upstream of the 5′ end of the protospacer). In other embodiments, the PAM may be a 3′ PAM (i.e., located downstream of the 5′ end of the protospacer). The term “PAM” may be used interchangeably with the term “PFS” or “protospacer flanking site” or “protospacer flanking sequence”.
In a preferred embodiment, the CRISPR effector protein may recognize a 3′ PAM. In certain embodiments, the CRISPR effector protein may recognize a 3′ PAM which is 5′H, wherein H is A, C or U.
In the context of formation of a CRISPR complex, “target sequence” refers to a sequence to which a guide sequence is designed to have complementarity, where hybridization between a target sequence and a guide sequence promotes the formation of a CRISPR complex. A target sequence may comprise RNA polynucleotides. The term “target RNA” refers to a RNA polynucleotide being or comprising the target sequence. In other words, the target RNA may be a RNA polynucleotide or a part of a RNA polynucleotide to which a part of the gRNA, i.e. the guide sequence, is designed to have complementarity and to which the effector function mediated by the complex comprising CRISPR effector protein and a gRNA is to be directed. In some embodiments, a target sequence is located in the nucleus or cytoplasm of a cell.
In certain example embodiments, the CRISPR effector protein may be delivered using a nucleic acid molecule encoding the CRISPR effector protein. The nucleic acid molecule encoding a CRISPR effector protein, may advantageously be a codon optimized CRISPR effector protein. An example of a codon optimized sequence, is in this instance a sequence optimized for expression in eukaryote, e.g., humans (i.e. being optimized for expression in humans), or for another eukaryote, animal or mammal as herein discussed; see, e.g., SaCas9 human codon optimized sequence in WO 2014/093622 (PCT/US2013/074667). Whilst this is preferred, it will be appreciated that other examples are possible and codon optimization for a host species other than human, or for codon optimization for specific organs is known. In some embodiments, an enzyme coding sequence encoding a CRISPR effector protein is a codon optimized for expression in particular cells, such as eukaryotic cells. The eukaryotic cells may be those of or derived from a particular organism, such as a plant or a mammal, including but not limited to human, or non-human eukaryote or animal or mammal as herein discussed, e.g., mouse, rat, rabbit, dog, livestock, or non-human mammal or primate. In some embodiments, processes for modifying the germ line genetic identity of human beings and/or processes for modifying the genetic identity of animals which are likely to cause them suffering without any substantial medical benefit to man or animal, and also animals resulting from such processes, may be excluded. In general, codon optimization refers to a process of modifying a nucleic acid sequence for enhanced expression in the host cells of interest by replacing at least one codon (e.g. about or more than about 1, 2, 3, 4, 5, 10, 15, 20, 25, 50, or more codons) of the native sequence with codons that are more frequently or most frequently used in the genes of that host cell while maintaining the native amino acid sequence. Various species exhibit particular bias for certain codons of a particular amino acid. Codon bias (differences in codon usage between organisms) often correlates with the efficiency of translation of messenger RNA (mRNA), which is in turn believed to be dependent on, among other things, the properties of the codons being translated and the availability of particular transfer RNA (tRNA) molecules. The predominance of selected tRNAs in a cell is generally a reflection of the codons used most frequently in peptide synthesis. Accordingly, genes can be tailored for optimal gene expression in a given organism based on codon optimization. Codon usage tables are readily available, for example, at the “Codon Usage Database” available at kazusa.orjp/codon/and these tables can be adapted in a number of ways. See Nakamura, Y., et al. “Codon usage tabulated from the international DNA sequence databases: status for the year 2000” Nucl. Acids Res. 28:292 (2000). Computer algorithms for codon optimizing a particular sequence for expression in a particular host cell are also available, such as Gene Forge (Aptagen; Jacobus, Pa.), are also available. In some embodiments, one or more codons (e.g. 1, 2, 3, 4, 5, 10, 15, 20, 25, 50, or more, or all codons) in a sequence encoding a Cas correspond to the most frequently used codon for a particular amino acid.
In certain embodiments, the methods as described herein may comprise providing a Cas transgenic cell in which one or more nucleic acids encoding one or more guide RNAs are provided or introduced operably connected in the cell with a regulatory element comprising a promoter of one or more gene of interest. As used herein, the term “Cas transgenic cell” refers to a cell, such as a eukaryotic cell, in which a Cas gene has been genomically integrated. The nature, type, or origin of the cell are not particularly limiting according to the present invention. Also the way the Cas transgene is introduced in the cell may vary and can be any method as is known in the art. In certain embodiments, the Cas transgenic cell is obtained by introducing the Cas transgene in an isolated cell. In certain other embodiments, the Cas transgenic cell is obtained by isolating cells from a Cas transgenic organism. By means of example, and without limitation, the Cas transgenic cell as referred to herein may be derived from a Cas transgenic eukaryote, such as a Cas knock-in eukaryote. Reference is made to WO 2014/093622 (PCT/US13/74667), incorporated herein by reference. Methods of US Patent Publication Nos. 20120017290 and 20110265198 assigned to Sangamo BioSciences, Inc. directed to targeting the Rosa locus may be modified to utilize the CRISPR Cas system of the present invention. Methods of US Patent Publication No. 20130236946 assigned to Cellectis directed to targeting the Rosa locus may also be modified to utilize the CRISPR Cas system of the present invention. By means of further example reference is made to Platt et. al. (Cell; 159(2):440-455 (2014)), describing a Cas9 knock-in mouse, which is incorporated herein by reference. The Cas transgene can further comprise a Lox-Stop-polyA-Lox(LSL) cassette thereby rendering Cas expression inducible by Cre recombinase. Alternatively, the Cas transgenic cell may be obtained by introducing the Cas transgene in an isolated cell. Delivery systems for transgenes are well known in the art. By means of example, the Cas transgene may be delivered in for instance eukaryotic cell by means of vector (e.g., AAV, adenovirus, lentivirus) and/or particle and/or nanoparticle delivery, as also described herein elsewhere.
It will be understood by the skilled person that the cell, such as the Cas transgenic cell, as referred to herein may comprise further genomic alterations besides having an integrated Cas gene or the mutations arising from the sequence specific action of Cas when complexed with RNA capable of guiding Cas to a target locus.
In certain aspects the invention involves vectors, e.g. for delivering or introducing in a cell Cas and/or RNA capable of guiding Cas to a target locus (i.e. guide RNA), but also for propagating these components (e.g. in prokaryotic cells). A used herein, a “vector” is a tool that allows or facilitates the transfer of an entity from one environment to another. It is a replicon, such as a plasmid, phage, or cosmid, into which another DNA segment may be inserted so as to bring about the replication of the inserted segment. Generally, a vector is capable of replication when associated with the proper control elements. In general, the term “vector” refers to a nucleic acid molecule capable of transporting another nucleic acid to which it has been linked. Vectors include, but are not limited to, nucleic acid molecules that are single-stranded, double-stranded, or partially double-stranded; nucleic acid molecules that comprise one or more free ends, no free ends (e.g. circular); nucleic acid molecules that comprise DNA, RNA, or both; and other varieties of polynucleotides known in the art. One type of vector is a “plasmid,” which refers to a circular double stranded DNA loop into which additional DNA segments can be inserted, such as by standard molecular cloning techniques. Another type of vector is a viral vector, wherein virally-derived DNA or RNA sequences are present in the vector for packaging into a virus (e.g. retroviruses, replication defective retroviruses, adenoviruses, replication defective adenoviruses, and adeno-associated viruses (AAVs)). Viral vectors also include polynucleotides carried by a virus for transfection into a host cell. Certain vectors are capable of autonomous replication in a host cell into which they are introduced (e.g. bacterial vectors having a bacterial origin of replication and episomal mammalian vectors). Other vectors (e.g., non-episomal mammalian vectors) are integrated into the genome of a host cell upon introduction into the host cell, and thereby are replicated along with the host genome. Moreover, certain vectors are capable of directing the expression of genes to which they are operatively-linked. Such vectors are referred to herein as “expression vectors.” Common expression vectors of utility in recombinant DNA techniques are often in the form of plasmids.
Recombinant expression vectors can comprise a nucleic acid of the invention in a form suitable for expression of the nucleic acid in a host cell, which means that the recombinant expression vectors include one or more regulatory elements, which may be selected on the basis of the host cells to be used for expression, that is operatively-linked to the nucleic acid sequence to be expressed. Within a recombinant expression vector, “operably linked” is intended to mean that the nucleotide sequence of interest is linked to the regulatory element(s) in a manner that allows for expression of the nucleotide sequence (e.g. in an in vitro transcription/translation system or in a host cell when the vector is introduced into the host cell). With regards to recombination and cloning methods, mention is made of U.S. patent application Ser. No. 10/815,730, published Sep. 2, 2004 as US 2004-0171156 A1, the contents of which are herein incorporated by reference in their entirety. Thus, the embodiments disclosed herein may also comprise transgenic cells comprising the CRISPR effector system. In certain example embodiments, the transgenic cell may function as an individual discrete volume. In other words samples comprising a masking construct may be delivered to a cell, for example in a suitable delivery vesicle and if the target is present in the delivery vesicle the CRISPR effector is activated and a detectable signal generated.
The vector(s) can include the regulatory element(s), e.g., promoter(s). The vector(s) can comprise Cas encoding sequences, and/or a single, but possibly also can comprise at least 3 or 8 or 16 or 32 or 48 or 50 guide RNA(s) (e.g., sgRNAs) encoding sequences, such as 1-2, 1-3, 1-4 1-5, 3-6, 3-7, 3-8, 3-9, 3-10, 3-8, 3-16, 3-30, 3-32, 3-48, 3-50 RNA(s) (e.g., sgRNAs). In a single vector there can be a promoter for each RNA (e.g., sgRNA), advantageously when there are up to about 16 RNA(s); and, when a single vector provides for more than 16 RNA(s), one or more promoter(s) can drive expression of more than one of the RNA(s), e.g., when there are 32 RNA(s), each promoter can drive expression of two RNA(s), and when there are 48 RNA(s), each promoter can drive expression of three RNA(s). By simple arithmetic and well established cloning protocols and the teachings in this disclosure one skilled in the art can readily practice the invention as to the RNA(s) for a suitable exemplary vector such as AAV, and a suitable promoter such as the U6 promoter. For example, the packaging limit of AAV is ˜4.7 kb. The length of a single U6-gRNA (plus restriction sites for cloning) is 361 bp. Therefore, the skilled person can readily fit about 12-16, e.g., 13 U6-gRNA cassettes in a single vector. This can be assembled by any suitable means, such as a golden gate strategy used for TALE assembly (genome-engineering.org/taleffectors/). The skilled person can also use a tandem guide strategy to increase the number of U6-gRNAs by approximately 1.5 times, e.g., to increase from 12-16, e.g., 13 to approximately 18-24, e.g., about 19 U6-gRNAs. Therefore, one skilled in the art can readily reach approximately 18-24, e.g., about 19 promoter-RNAs, e.g., U6-gRNAs in a single vector, e.g., an AAV vector. A further means for increasing the number of promoters and RNAs in a vector is to use a single promoter (e.g., U6) to express an array of RNAs separated by cleavable sequences. And an even further means for increasing the number of promoter-RNAs in a vector, is to express an array of promoter-RNAs separated by cleavable sequences in the intron of a coding sequence or gene; and, in this instance it is advantageous to use a polymerase II promoter, which can have increased expression and enable the transcription of long RNA in a tissue specific manner. (see, e.g., nar.oxfordjournals.org/content/34/7/e53.short and nature.com/mt/journal/v16/n9/abs/mt2008144a.html). In an advantageous embodiment, AAV may package U6 tandem gRNA targeting up to about 50 genes. Accordingly, from the knowledge in the art and the teachings in this disclosure the skilled person can readily make and use vector(s), e.g., a single vector, expressing multiple RNAs or guides under the control or operatively or functionally linked to one or more promoters—especially as to the numbers of RNAs or guides discussed herein, without any undue experimentation.
The guide RNA(s) encoding sequences and/or Cas encoding sequences, can be functionally or operatively linked to regulatory element(s) and hence the regulatory element(s) drive expression. The promoter(s) can be constitutive promoter(s) and/or conditional promoter(s) and/or inducible promoter(s) and/or tissue specific promoter(s). The promoter can be selected from the group consisting of RNA polymerases, pol I, pol II, pol III, T7, U6, H1, retroviral Rous sarcoma virus (RSV) LTR promoter, the cytomegalovirus (CMV) promoter, the SV40 promoter, the dihydrofolate reductase promoter, the β-actin promoter, the phosphoglycerol kinase (PGK) promoter, and the EF1α promoter. An advantageous promoter is the promoter is U6.
Additional effectors for use according to the invention can be identified by their proximity to cas1 genes, for example, though not limited to, within the region 20 kb from the start of the cas1 gene and 20 kb from the end of the cas1 gene. In certain embodiments, the effector protein comprises at least one HEPN domain and at least 500 amino acids, and wherein the C2c2 effector protein is naturally present in a prokaryotic genome within 20 kb upstream or downstream of a Cas gene or a CRISPR array. Non-limiting examples of Cas proteins include Cas1, Cas1B, Cas2, Cas3, Cas4, Cas5, Cash, Cas7, Cas8, Cas9 (also known as Csn1 and Csx12), Cas10, Csy1, Csy2, Csy3, Cse1, Cse2, Csc1, Csc2, Csa5, Csn2, Csm2, Csm3, Csm4, Csm5, Csm6, Cmr1, Cmr3, Cmr4, Cmr5, Cmr6, Csb1, Csb2, Csb3, Csx17, Csx14, Csx10, Csx16, CsaX, Csx3, Csx1, Csx15, Csf1, Csf2, Csf3, Csf4, homologues thereof, or modified versions thereof. In certain example embodiments, the C2c2 effector protein is naturally present in a prokaryotic genome within 20 kb upstream or downstream of a Cas 1 gene. The terms “orthologue” (also referred to as “ortholog” herein) and “homologue” (also referred to as “homolog” herein) are well known in the art. By means of further guidance, a “homologue” of a protein as used herein is a protein of the same species which performs the same or a similar function as the protein it is a homologue of. Homologous proteins may but need not be structurally related, or are only partially structurally related. An “orthologue” of a protein as used herein is a protein of a different species which performs the same or a similar function as the protein it is an orthologue of Orthologous proteins may but need not be structurally related, or are only partially structurally related.
a) DNA Repair and NHEJ
In certain embodiments, nuclease-induced non-homologous end-joining (NHEJ) can be used to target gene-specific knockouts. Nuclease-induced NHEJ can also be used to remove (e.g., delete) sequence in a gene of interest. Generally, NHEJ repairs a double-strand break in the DNA by joining together the two ends; however, generally, the original sequence is restored only if two compatible ends, exactly as they were formed by the double-strand break, are perfectly ligated. The DNA ends of the double-strand break are frequently the subject of enzymatic processing, resulting in the addition or removal of nucleotides, at one or both strands, prior to rejoining of the ends. This results in the presence of insertion and/or deletion (indel) mutations in the DNA sequence at the site of the NHEJ repair. Two-thirds of these mutations typically alter the reading frame and, therefore, produce a non-functional protein. Additionally, mutations that maintain the reading frame, but which insert or delete a significant amount of sequence, can destroy functionality of the protein. This is locus dependent as mutations in critical functional domains are likely less tolerable than mutations in non-critical regions of the protein. The indel mutations generated by NHEJ are unpredictable in nature; however, at a given break site certain indel sequences are favored and are over represented in the population, likely due to small regions of microhomology. The lengths of deletions can vary widely; most commonly in the 1-50 bp range, but they can easily be greater than 50 bp, e.g., they can easily reach greater than about 100-200 bp. Insertions tend to be shorter and often include short duplications of the sequence immediately surrounding the break site. However, it is possible to obtain large insertions, and in these cases, the inserted sequence has often been traced to other regions of the genome or to plasmid DNA present in the cells.
Because NHEJ is a mutagenic process, it may also be used to delete small sequence motifs as long as the generation of a specific final sequence is not required. If a double-strand break is targeted near to a short target sequence, the deletion mutations caused by the NHEJ repair often span, and therefore remove, the unwanted nucleotides. For the deletion of larger DNA segments, introducing two double-strand breaks, one on each side of the sequence, can result in NHEJ between the ends with removal of the entire intervening sequence. Both of these approaches can be used to delete specific DNA sequences; however, the error-prone nature of NHEJ may still produce indel mutations at the site of repair.
Both double strand cleaving by the CRISPR/Cas system can be used in the methods and compositions described herein to generate NHEJ-mediated indels. NHEJ-mediated indels targeted to the gene, e.g., a coding region, e.g., an early coding region of a gene of interest can be used to knockout (i.e., eliminate expression of) a gene of interest. For example, early coding region of a gene of interest includes sequence immediately following a transcription start site, within a first exon of the coding sequence, or within 500 bp of the transcription start site (e.g., less than 500, 450, 400, 350, 300, 250, 200, 150, 100 or 50 bp).
In an embodiment, in which the CRISPR/Cas system generates a double strand break for the purpose of inducing NHEJ-mediated indels, a guide RNA may be configured to position one double-strand break in close proximity to a nucleotide of the target position. In an embodiment, the cleavage site may be between 0-500 bp away from the target position (e.g., less than 500, 400, 300, 200, 100, 50, 40, 30, 25, 20, 15, 10, 9, 8, 7, 6, 5, 4, 3, 2 or 1 bp from the target position).
In an embodiment, in which two guide RNAs complexing with CRISPR/Cas system nickases induce two single strand breaks for the purpose of inducing NHEJ-mediated indels, two guide RNAs may be configured to position two single-strand breaks to provide for NHEJ repair a nucleotide of the target position.
b) dCas and Functional Effectors
Unlike CRISPR-Cas-mediated gene knockout, which permanently eliminates expression by mutating the gene at the DNA level, CRISPR-Cas knockdown allows for temporary reduction of gene expression through the use of artificial transcription factors. Mutating key residues in cleavage domains of the Cas protein results in the generation of a catalytically inactive Cas protein. A catalytically inactive Cas protein complexes with a guide RNA and localizes to the DNA sequence specified by that guide RNA's targeting domain, however, it does not cleave the target DNA. Fusion of the inactive Cas protein to an effector domain also referred to herein as a functional domain, e.g., a transcription repression domain, enables recruitment of the effector to any DNA site specified by the guide RNA.
In general, the positioning of the one or more functional domain on the inactivated CRISPR/Cas protein is one which allows for correct spatial orientation for the functional domain to affect the target with the attributed functional effect. For example, if the functional domain is a transcription activator (e.g., VP64 or p65), the transcription activator is placed in a spatial orientation which allows it to affect the transcription of the target. Likewise, a transcription repressor will be advantageously positioned to affect the transcription of the target, and a nuclease (e.g., Fok1) will be advantageously positioned to cleave or partially cleave the target. This may include positions other than the N-/C-terminus of the CRISPR protein.
In certain embodiments, Cas protein may be fused to a transcriptional repression domain and recruited to the promoter region of a gene. Especially for gene repression, it is contemplated herein that blocking the binding site of an endogenous transcription factor would aid in downregulating gene expression.
In an embodiment, a guide RNA molecule can be targeted to a known transcription response elements (e.g., promoters, enhancers, etc.), a known upstream activating sequences, and/or sequences of unknown or known function that are suspected of being able to control expression of the target DNA. Idem: adapt to refer to regions with the motifs of interest
In some methods, a target polynucleotide can be inactivated to effect the modification of the expression in a cell. For example, upon the binding of a CRISPR complex to a target sequence in a cell, the target polynucleotide is inactivated such that the sequence is not transcribed, the coded protein is not produced, or the sequence does not function as the wild-type sequence does. For example, a protein or microRNA coding sequence may be inactivated such that the protein is not produced. idem
c) Guide Molecules
As used herein, the term “guide sequence” and “guide molecule” in the context of a CRISPR-Cas system, comprises any polynucleotide sequence having sufficient complementarity with a target nucleic acid sequence to hybridize with the target nucleic acid sequence and direct sequence-specific binding of a nucleic acid-targeting complex to the target nucleic acid sequence. The guide sequences made using the methods disclosed herein may be a full-length guide sequence, a truncated guide sequence, a full-length sgRNA sequence, a truncated sgRNA sequence, or an E+F sgRNA sequence. In some embodiments, the degree of complementarity of the guide sequence to a given target sequence, when optimally aligned using a suitable alignment algorithm, is about or more than about 50%, 60%, 75%, 80%, 85%, 90%, 95%, 97.5%, 99%, or more. In certain example embodiments, the guide molecule comprises a guide sequence that may be designed to have at least one mismatch with the target sequence, such that a RNA duplex formed between the guide sequence and the target sequence. Accordingly, the degree of complementarity is preferably less than 99%. For instance, where the guide sequence consists of 24 nucleotides, the degree of complementarity is more particularly about 96% or less. In particular embodiments, the guide sequence is designed to have a stretch of two or more adjacent mismatching nucleotides, such that the degree of complementarity over the entire guide sequence is further reduced. For instance, where the guide sequence consists of 24 nucleotides, the degree of complementarity is more particularly about 96% or less, more particularly, about 92% or less, more particularly about 88% or less, more particularly about 84% or less, more particularly about 80% or less, more particularly about 76% or less, more particularly about 72% or less, depending on whether the stretch of two or more mismatching nucleotides encompasses 2, 3, 4, 5, 6 or 7 nucleotides, etc. In some embodiments, aside from the stretch of one or more mismatching nucleotides, the degree of complementarity, when optimally aligned using a suitable alignment algorithm, is about or more than about 50%, 60%, 75%, 80%, 85%, 90%, 95%, 97.5%, 99%, or more. Optimal alignment may be determined with the use of any suitable algorithm for aligning sequences, non-limiting example of which include the Smith-Waterman algorithm, the Needleman-Wunsch algorithm, algorithms based on the Burrows-Wheeler Transform (e.g., the Burrows Wheeler Aligner), ClustalW, Clustal X, BLAT, Novoalign (Novocraft Technologies; available at www.novocraft.com), ELAND (Illumina, San Diego, Calif.), SOAP (available at soap.genomics.org.cn), and Maq (available at maq.sourceforge.net). The ability of a guide sequence (within a nucleic acid-targeting guide RNA) to direct sequence-specific binding of a nucleic acid-targeting complex to a target nucleic acid sequence may be assessed by any suitable assay. For example, the components of a nucleic acid-targeting CRISPR system sufficient to form a nucleic acid-targeting complex, including the guide sequence to be tested, may be provided to a host cell having the corresponding target nucleic acid sequence, such as by transfection with vectors encoding the components of the nucleic acid-targeting complex, followed by an assessment of preferential targeting (e.g., cleavage) within the target nucleic acid sequence, such as by Surveyor assay as described herein. Similarly, cleavage of a target nucleic acid sequence (or a sequence in the vicinity thereof) may be evaluated in a test tube by providing the target nucleic acid sequence, components of a nucleic acid-targeting complex, including the guide sequence to be tested and a control guide sequence different from the test guide sequence, and comparing binding or rate of cleavage at or in the vicinity of the target sequence between the test and control guide sequence reactions. Other assays are possible, and will occur to those skilled in the art. A guide sequence, and hence a nucleic acid-targeting guide RNA may be selected to target any target nucleic acid sequence.
In certain embodiments, the guide sequence or spacer length of the guide molecules is from 15 to 50 nt. In certain embodiments, the spacer length of the guide RNA is at least 15 nucleotides. In certain embodiments, the spacer length is from 15 to 17 nt, e.g., 15, 16, or 17 nt, from 17 to 20 nt, e.g., 17, 18, 19, or 20 nt, from 20 to 24 nt, e.g., 20, 21, 22, 23, or 24 nt, from 23 to 25 nt, e.g., 23, 24, or 25 nt, from 24 to 27 nt, e.g., 24, 25, 26, or 27 nt, from 27-30 nt, e.g., 27, 28, 29, or 30 nt, from 30-35 nt, e.g., 30, 31, 32, 33, 34, or 35 nt, or 35 nt or longer. In certain example embodiment, the guide sequence is 15, 16, 17,18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39 40, 41, 42, 43, 44, 45, 46, 47 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, or 100 nt.
In some embodiments, the guide sequence is an RNA sequence of between 10 to 50 nt in length, but more particularly of about 20-30 nt advantageously about 20 nt, 23-25 nt or 24 nt. The guide sequence is selected so as to ensure that it hybridizes to the target sequence. This is described more in detail below. Selection can encompass further steps which increase efficacy and specificity.
In some embodiments, the guide sequence has a canonical length (e.g., about 15-30 nt) is used to hybridize with the target RNA or DNA. In some embodiments, a guide molecule is longer than the canonical length (e.g., >30 nt) is used to hybridize with the target RNA or DNA, such that a region of the guide sequence hybridizes with a region of the RNA or DNA strand outside of the Cas-guide target complex. This can be of interest where additional modifications, such deamination of nucleotides is of interest. In alternative embodiments, it is of interest to maintain the limitation of the canonical guide sequence length.
In some embodiments, the sequence of the guide molecule (direct repeat and/or spacer) is selected to reduce the degree secondary structure within the guide molecule. In some embodiments, about or less than about 75%, 50%, 40%, 30%, 25%, 20%, 15%, 10%, 5%, 1%, or fewer of the nucleotides of the nucleic acid-targeting guide RNA participate in self-complementary base pairing when optimally folded. Optimal folding may be determined by any suitable polynucleotide folding algorithm. Some programs are based on calculating the minimal Gibbs free energy. An example of one such algorithm is mFold, as described by Zuker and Stiegler (Nucleic Acids Res. 9 (1981), 133-148). Another example folding algorithm is the online webserver RNAfold, developed at Institute for Theoretical Chemistry at the University of Vienna, using the centroid structure prediction algorithm (see e.g., A. R. Gruber et al., 2008, Cell 106(1): 23-24; and PA Carr and GM Church, 2009, Nature Biotechnology 27(12): 1151-62).
In some embodiments, it is of interest to reduce the susceptibility of the guide molecule to RNA cleavage, such as to cleavage by Cas13. Accordingly, in particular embodiments, the guide molecule is adjusted to avoide cleavage by Cas13 or other RNA-cleaving enzymes.
In certain embodiments, the guide molecule comprises non-naturally occurring nucleic acids and/or non-naturally occurring nucleotides and/or nucleotide analogs, and/or chemically modifications. Preferably, these non-naturally occurring nucleic acids and non-naturally occurring nucleotides are located outside the guide sequence. Non-naturally occurring nucleic acids can include, for example, mixtures of naturally and non-naturally occurring nucleotides. Non-naturally occurring nucleotides and/or nucleotide analogs may be modified at the ribose, phosphate, and/or base moiety. In an embodiment of the invention, a guide nucleic acid comprises ribonucleotides and non-ribonucleotides. In one such embodiment, a guide comprises one or more ribonucleotides and one or more deoxyribonucleotides. In an embodiment of the invention, the guide comprises one or more non-naturally occurring nucleotide or nucleotide analog such as a nucleotide with phosphorothioate linkage, a locked nucleic acid (LNA) nucleotides comprising a methylene bridge between the 2′ and 4′ carbons of the ribose ring, or bridged nucleic acids (BNA). Other examples of modified nucleotides include 2′-O-methyl analogs, 2′-deoxy analogs, or 2′-fluoro analogs. Further examples of modified bases include, but are not limited to, 2-aminopurine, 5-bromo-uridine, pseudouridine, inosine, 7-methylguanosine. Examples of guide RNA chemical modifications include, without limitation, incorporation of 2′-O-methyl (M), 2′-O-methyl 3′ phosphorothioate (MS), S-constrained ethyl(cEt), or 2′-O-methyl 3′ thioPACE (MSP) at one or more terminal nucleotides. Such chemically modified guides can comprise increased stability and increased activity as compared to unmodified guides, though on-target vs. off-target specificity is not predictable. (See, Hendel, 2015, Nat Biotechnol. 33(9):985-9, doi: 10.1038/nbt.3290, published online 29 Jun. 2015 Ragdarm et al., 0215, PNAS, E7110-E7111; Allerson et al., J. Med. Chem. 2005, 48:901-904; Bramsen et al., Front. Genet., 2012, 3:154; Deng et al., PNAS, 2015, 112:11870-11875; Sharma et al., MedChemComm., 2014, 5:1454-1471; Hendel et al., Nat. Biotechnol. (2015) 33(9): 985-989; Li et al., Nature Biomedical Engineering, 2017, 1, 0066 DOI:10.1038/s41551-017-0066). In some embodiments, the 5′ and/or 3′ end of a guide RNA is modified by a variety of functional moieties including fluorescent dyes, polyethylene glycol, cholesterol, proteins, or detection tags. (See Kelly et al., 2016, J. Biotech. 233:74-83). In certain embodiments, a guide comprises ribonucleotides in a region that binds to a target RNA and one or more deoxyribonucletides and/or nucleotide analogs in a region that binds to Cas13. In an embodiment of the invention, deoxyribonucleotides and/or nucleotide analogs are incorporated in engineered guide structures, such as, without limitation, stem-loop regions, and the seed region. For Cas13 guide, in certain embodiments, the modification is not in the 5′-handle of the stem-loop regions. Chemical modification in the 5′-handle of the stem-loop region of a guide may abolish its function (see Li, et al., Nature Biomedical Engineering, 2017, 1:0066). In certain embodiments, at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, or 75 nucleotides of a guide is chemically modified. In some embodiments, 3-5 nucleotides at either the 3′ or the 5′ end of a guide is chemically modified. In some embodiments, only minor modifications are introduced in the seed region, such as 2′-F modifications. In some embodiments, 2′-F modification is introduced at the 3′ end of a guide. In certain embodiments, three to five nucleotides at the 5′ and/or the 3′ end of the guide are chemically modified with 2′-O-methyl (M), 2′-O-methyl 3′ phosphorothioate (MS), S-constrained ethyl(cEt), or 2′-O-methyl 3′ thioPACE (MSP). Such modification can enhance genome editing efficiency (see Hendel et al., Nat. Biotechnol. (2015) 33(9): 985-989). In certain embodiments, all of the phosphodiester bonds of a guide are substituted with phosphorothioates (PS) for enhancing levels of gene disruption. In certain embodiments, more than five nucleotides at the 5′ and/or the 3′ end of the guide are chemically modified with 2′-O-Me, 2′-F or S-constrained ethyl(cEt). Such chemically modified guide can mediate enhanced levels of gene disruption (see Ragdarm et al., 0215, PNAS, E7110-E7111). In an embodiment of the invention, a guide is modified to comprise a chemical moiety at its 3′ and/or 5′ end. Such moieties include, but are not limited to amine, azide, alkyne, thio, dibenzocyclooctyne (DBCO), or Rhodamine. In certain embodiment, the chemical moiety is conjugated to the guide by a linker, such as an alkyl chain. In certain embodiments, the chemical moiety of the modified guide can be used to attach the guide to another molecule, such as DNA, RNA, protein, or nanoparticles. Such chemically modified guide can be used to identify or enrich cells generically edited by a CRISPR system (see Lee et al., eLife, 2017, 6:e25312, DOI:10.7554).
In some embodiments, the modification to the guide is a chemical modification, an insertion, a deletion or a split. In some embodiments, the chemical modification includes, but is not limited to, incorporation of 2′-O-methyl (M) analogs, 2′-deoxy analogs, 2-thiouridine analogs, N6-methyladenosine analogs, 2′-fluoro analogs, 2-aminopurine, 5-bromo-uridine, pseudouridine (Ψ), N1-methylpseudouridine (melΨP), 5-methoxyuridine (5moU), inosine, 7-methylguanosine, 2′-O-methyl 3′phosphorothioate (MS), S-constrained ethyl(cEt), phosphorothioate (PS), or 2′-O-methyl 3′thioPACE (MSP). In some embodiments, the guide comprises one or more of phosphorothioate modifications. In certain embodiments, at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or 25 nucleotides of the guide are chemically modified. In certain embodiments, one or more nucleotides in the seed region are chemically modified. In certain embodiments, one or more nucleotides in the 3′-terminus are chemically modified. In certain embodiments, none of the nucleotides in the 5′-handle is chemically modified. In some embodiments, the chemical modification in the seed region is a minor modification, such as incorporation of a 2′-fluoro analog. In a specific embodiment, one nucleotide of the seed region is replaced with a 2′-fluoro analog. In some embodiments, 5 to 10 nucleotides in the 3′-terminus are chemically modified. Such chemical modifications at the 3′-terminus of the Cas13 CrRNA may improve Cas13 activity. In a specific embodiment, 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10 nucleotides in the 3′-terminus are replaced with 2′-fluoro analogues. In a specific embodiment, 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10 nucleotides in the 3′-terminus are replaced with 2′-O-methyl (M) analogs.
In some embodiments, the loop of the 5′-handle of the guide is modified. In some embodiments, the loop of the 5′-handle of the guide is modified to have a deletion, an insertion, a split, or chemical modifications. In certain embodiments, the modified loop comprises 3, 4, or 5 nucleotides. In certain embodiments, the loop comprises the sequence of UCUU, UUUU, UAUU, or UGUU (SEQ. I.D. Nos. 1-4).
In some embodiments, the guide molecule forms a stemloop with a separate non-covalently linked sequence, which can be DNA or RNA. In particular embodiments, the sequences forming the guide are first synthesized using the standard phosphoramidite synthetic protocol (Herdewijn, P., ed., Methods in Molecular Biology Col 288, Oligonucleotide Synthesis: Methods and Applications, Humana Press, New Jersey (2012)). In some embodiments, these sequences can be functionalized to contain an appropriate functional group for ligation using the standard protocol known in the art (Hermanson, G. T., Bioconjugate Techniques, Academic Press (2013)). Examples of functional groups include, but are not limited to, hydroxyl, amine, carboxylic acid, carboxylic acid halide, carboxylic acid active ester, aldehyde, carbonyl, chlorocarbonyl, imidazolylcarbonyl, hydrozide, semicarbazide, thio semicarbazide, thiol, maleimide, haloalkyl, sufonyl, ally, propargyl, diene, alkyne, and azide. Once this sequence is functionalized, a covalent chemical bond or linkage can be formed between this sequence and the direct repeat sequence. Examples of chemical bonds include, but are not limited to, those based on carbamates, ethers, esters, amides, imines, amidines, aminotrizines, hydrozone, disulfides, thioethers, thioesters, phosphorothioates, phosphorodithioates, sulfonamides, sulfonates, fulfones, sulfoxides, ureas, thioureas, hydrazide, oxime, triazole, photolabile linkages, C—C bond forming groups such as Diels-Alder cyclo-addition pairs or ring-closing metathesis pairs, and Michael reaction pairs.
In some embodiments, these stem-loop forming sequences can be chemically synthesized. In some embodiments, the chemical synthesis uses automated, solid-phase oligonucleotide synthesis machines with 2′-acetoxyethyl orthoester (2′-ACE) (Scaringe et al., J. Am. Chem. Soc. (1998) 120: 11820-11821; Scaringe, Methods Enzymol. (2000) 317: 3-18) or 2′-thionocarbamate (2′-TC) chemistry (Dellinger et al., J. Am. Chem. Soc. (2011) 133: 11540-11546; Hendel et al., Nat. Biotechnol. (2015) 33:985-989).
In certain embodiments, the guide molecule comprises (1) a guide sequence capable of hybridizing to a target locus and (2) a tracr mate or direct repeat sequence whereby the direct repeat sequence is located upstream (i.e., 5′) from the guide sequence. In a particular embodiment the seed sequence (i.e. the sequence essential critical for recognition and/or hybridization to the sequence at the target locus) of the guide sequence is approximately within the first 10 nucleotides of the guide sequence.
In a particular embodiment the guide molecule comprises a guide sequence linked to a direct repeat sequence, wherein the direct repeat sequence comprises one or more stem loops or optimized secondary structures. In particular embodiments, the direct repeat has a minimum length of 16 nts and a single stem loop. In further embodiments the direct repeat has a length longer than 16 nts, preferably more than 17 nts, and has more than one stem loops or optimized secondary structures. In particular embodiments the guide molecule comprises or consists of the guide sequence linked to all or part of the natural direct repeat sequence. A typical Type V or Type VI CRISPR-cas guide molecule comprises (in 3′ to 5′ direction or in 5′ to 3′ direction): a guide sequence a first complimentary stretch (the “repeat”), a loop (which is typically 4 or 5 nucleotides long), a second complimentary stretch (the “anti-repeat” being complimentary to the repeat), and a poly A (often poly U in RNA) tail (terminator). In certain embodiments, the direct repeat sequence retains its natural architecture and forms a single stem loop. In particular embodiments, certain aspects of the guide architecture can be modified, for example by addition, subtraction, or substitution of features, whereas certain other aspects of guide architecture are maintained. Preferred locations for engineered guide molecule modifications, including but not limited to insertions, deletions, and substitutions include guide termini and regions of the guide molecule that are exposed when complexed with the CRISPR-Cas protein and/or target, for example the stemloop of the direct repeat sequence.
In particular embodiments, the stem comprises at least about 4 bp comprising complementary X and Y sequences, although stems of more, e.g., 5, 6, 7, 8, 9, 10, 11 or 12 or fewer, e.g., 3, 2, base pairs are also contemplated. Thus, for example X2-10 and Y2-10 (wherein X and Y represent any complementary set of nucleotides) may be contemplated. In one aspect, the stem made of the X and Y nucleotides, together with the loop will form a complete hairpin in the overall secondary structure; and, this may be advantageous and the amount of base pairs can be any amount that forms a complete hairpin. In one aspect, any complementary X:Y basepairing sequence (e.g., as to length) is tolerated, so long as the secondary structure of the entire guide molecule is preserved. In one aspect, the loop that connects the stem made of X:Y basepairs can be any sequence of the same length (e.g., 4 or 5 nucleotides) or longer that does not interrupt the overall secondary structure of the guide molecule. In one aspect, the stemloop can further comprise, e.g. an MS2 aptamer. In one aspect, the stem comprises about 5-7 bp comprising complementary X and Y sequences, although stems of more or fewer basepairs are also contemplated. In one aspect, non-Watson Crick basepairing is contemplated, where such pairing otherwise generally preserves the architecture of the stemloop at that position.
In particular embodiments the natural hairpin or stemloop structure of the guide molecule is extended or replaced by an extended stemloop. It has been demonstrated that extension of the stem can enhance the assembly of the guide molecule with the CRISPR-Cas proten (Chen et al. Cell. (2013); 155(7): 1479-1491). In particular embodiments the stem of the stemloop is extended by at least 1, 2, 3, 4, 5 or more complementary basepairs (i.e. corresponding to the addition of 2,4, 6, 8, 10 or more nucleotides in the guide molecule). In particular embodiments these are located at the end of the stem, adjacent to the loop of the stemloop.
In particular embodiments, the susceptibility of the guide molecule to RNAses or to decreased expression can be reduced by slight modifications of the sequence of the guide molecule which do not affect its function. For instance, in particular embodiments, premature termination of transcription, such as premature transcription of U6 Pol-III, can be removed by modifying a putative Pol-III terminator (4 consecutive U's) in the guide molecules sequence. Where such sequence modification is required in the stemloop of the guide molecule, it is preferably ensured by a basepair flip.
In a particular embodiment the direct repeat may be modified to comprise one or more protein-binding RNA aptamers. In a particular embodiment, one or more aptamers may be included such as part of optimized secondary structure. Such aptamers may be capable of binding a bacteriophage coat protein as detailed further herein.
In some embodiments, the guide molecule forms a duplex with a target RNA comprising at least one target cytosine residue to be edited. Upon hybridization of the guide RNA molecule to the target RNA, the cytidine deaminase binds to the single strand RNA in the duplex made accessible by the mismatch in the guide sequence and catalyzes deamination of one or more target cytosine residues comprised within the stretch of mismatching nucleotides.
A guide sequence, and hence a nucleic acid-targeting guide RNA may be selected to target any target nucleic acid sequence. The target sequence may be mRNA.
In certain embodiments, the target sequence should be associated with a PAM (protospacer adjacent motif) or PFS (protospacer flanking sequence or site); that is, a short sequence recognized by the CRISPR complex. Depending on the nature of the CRISPR-Cas protein, the target sequence should be selected such that its complementary sequence in the DNA duplex (also referred to herein as the non-target sequence) is upstream or downstream of the PAM. In the embodiments of the present invention where the CRISPR-Cas protein is a Cas13 protein, the complementary sequence of the target sequence is downstream or 3′ of the PAM or upstream or 5′ of the PAM. The precise sequence and length requirements for the PAM differ depending on the Cas13 protein used, but PAMs are typically 2-5 base pair sequences adjacent the protospacer (that is, the target sequence). Examples of the natural PAM sequences for different Cas13 orthologues are provided herein below and the skilled person will be able to identify further PAM sequences for use with a given Cas13 protein.
Further, engineering of the PAM Interacting (PI) domain may allow programming of PAM specificity, improve target site recognition fidelity, and increase the versatility of the CRISPR-Cas protein, for example as described for Cas9 in Kleinstiver B P et al. Engineered CRISPR-Cas9 nucleases with altered PAM specificities. Nature. 2015 Jul. 23; 523(7561):481-5. doi: 10.1038/nature14592. As further detailed herein, the skilled person will understand that Cas13 proteins may be modified analogously.
In particular embodiment, the guide is an escorted guide. By “escorted” is meant that the CRISPR-Cas system or complex or guide is delivered to a selected time or place within a cell, so that activity of the CRISPR-Cas system or complex or guide is spatially or temporally controlled. For example, the activity and destination of the 3 CRISPR-Cas system or complex or guide may be controlled by an escort RNA aptamer sequence that has binding affinity for an aptamer ligand, such as a cell surface protein or other localized cellular component. Alternatively, the escort aptamer may for example be responsive to an aptamer effector on or in the cell, such as a transient effector, such as an external energy source that is applied to the cell at a particular time.
The escorted CRISPR-Cas systems or complexes have a guide molecule with a functional structure designed to improve guide molecule structure, architecture, stability, genetic expression, or any combination thereof. Such a structure can include an aptamer.
Aptamers are biomolecules that can be designed or selected to bind tightly to other ligands, for example using a technique called systematic evolution of ligands by exponential enrichment (SELEX; Tuerk C, Gold L: “Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase.” Science 1990, 249:505-510). Nucleic acid aptamers can for example be selected from pools of random-sequence oligonucleotides, with high binding affinities and specificities for a wide range of biomedically relevant targets, suggesting a wide range of therapeutic utilities for aptamers (Keefe, Anthony D., Supriya Pai, and Andrew Ellington. “Aptamers as therapeutics.” Nature Reviews Drug Discovery 9.7 (2010): 537-550). These characteristics also suggest a wide range of uses for aptamers as drug delivery vehicles (Levy-Nissenbaum, Etgar, et al. “Nanotechnology and aptamers: applications in drug delivery.” Trends in biotechnology 26.8 (2008): 442-449; and, Hicke B J, Stephens A W. “Escort aptamers: a delivery service for diagnosis and therapy.” J Clin Invest 2000, 106:923-928.). Aptamers may also be constructed that function as molecular switches, responding to a que by changing properties, such as RNA aptamers that bind fluorophores to mimic the activity of green flourescent protein (Paige, Jeremy S., Karen Y. Wu, and Samie R. Jaffrey. “RNA mimics of green fluorescent protein.” Science 333.6042 (2011): 642-646). It has also been suggested that aptamers may be used as components of targeted siRNA therapeutic delivery systems, for example targeting cell surface proteins (Zhou, Jiehua, and John J. Rossi. “Aptamer-targeted cell-specific RNA interference.” Silence 1.1 (2010): 4).
Accordingly, in particular embodiments, the guide molecule is modified, e.g., by one or more aptamer(s) designed to improve guide molecule delivery, including delivery across the cellular membrane, to intracellular compartments, or into the nucleus. Such a structure can include, either in addition to the one or more aptamer(s) or without such one or more aptamer(s), moiety(ies) so as to render the guide molecule deliverable, inducible or responsive to a selected effector. The invention accordingly comprehends an guide molecule that responds to normal or pathological physiological conditions, including without limitation pH, hypoxia, O2 concentration, temperature, protein concentration, enzymatic concentration, lipid structure, light exposure, mechanical disruption (e.g. ultrasound waves), magnetic fields, electric fields, or electromagnetic radiation.
Light responsiveness of an inducible system may be achieved via the activation and binding of cryptochrome-2 and CIB1. Blue light stimulation induces an activating conformational change in cryptochrome-2, resulting in recruitment of its binding partner CIB1. This binding is fast and reversible, achieving saturation in <15 sec following pulsed stimulation and returning to baseline<15 min after the end of stimulation. These rapid binding kinetics result in a system temporally bound only by the speed of transcription/translation and transcript/protein degradation, rather than uptake and clearance of inducing agents. Crytochrome-2 activation is also highly sensitive, allowing for the use of low light intensity stimulation and mitigating the risks of phototoxicity. Further, in a context such as the intact mammalian brain, variable light intensity may be used to control the size of a stimulated region, allowing for greater precision than vector delivery alone may offer.
The invention contemplates energy sources such as electromagnetic radiation, sound energy or thermal energy to induce the guide. Advantageously, the electromagnetic radiation is a component of visible light. In a preferred embodiment, the light is a blue light with a wavelength of about 450 to about 495 nm. In an especially preferred embodiment, the wavelength is about 488 nm. In another preferred embodiment, the light stimulation is via pulses. The light power may range from about 0-9 mW/cm2. In a preferred embodiment, a stimulation paradigm of as low as 0.25 sec every 15 sec should result in maximal activation.
The chemical or energy sensitive guide may undergo a conformational change upon induction by the binding of a chemical source or by the energy allowing it act as a guide and have the Cas13 CRISPR-Cas system or complex function. The invention can involve applying the chemical source or energy so as to have the guide function and the Cas13 CRISPR-Cas system or complex function; and optionally further determining that the expression of the genomic locus is altered.
There are several different designs of this chemical inducible system: 1. ABI-PYL based system inducible by Abscisic Acid (ABA) (see, e.g., http://stke.sciencemag.org/cgi/content/abstract/sigtrans;4/164/rs2), 2. FKBP-FRB based system inducible by rapamycin (or related chemicals based on rapamycin) (see, e.g., http://www.nature.com/nmeth/journal/v2/n6/full/nmeth763.html), 3. GID1-GAI based system inducible by Gibberellin (GA) (see, e.g., http://www.nature.com/nchembio/journal/v8/n5/full/nchembio.922.html).
A chemical inducible system can be an estrogen receptor (ER) based system inducible by 4-hydroxytamoxifen (4OHT) (see, e.g., http://www.pnas.org/content/104/3/1027. abstract). A mutated ligand-binding domain of the estrogen receptor called ERT2 translocates into the nucleus of cells upon binding of 4-hydroxytamoxifen. In further embodiments of the invention any naturally occurring or engineered derivative of any nuclear receptor, thyroid hormone receptor, retinoic acid receptor, estrogren receptor, estrogen-related receptor, glucocorticoid receptor, progesterone receptor, androgen receptor may be used in inducible systems analogous to the ER based inducible system.
Another inducible system is based on the design using Transient receptor potential (TRP) ion channel based system inducible by energy, heat or radio-wave (see, e.g., http://www.sciencemag.org/content/336/6081/604). These TRP family proteins respond to different stimuli, including light and heat. When this protein is activated by light or heat, the ion channel will open and allow the entering of ions such as calcium into the plasma membrane. This influx of ions will bind to intracellular ion interacting partners linked to a polypeptide including the guide and the other components of the Cas13 CRISPR-Cas complex or system, and the binding will induce the change of sub-cellular localization of the polypeptide, leading to the entire polypeptide entering the nucleus of cells. Once inside the nucleus, the guide protein and the other components of the Cas13 CRISPR-Cas complex will be active and modulating target gene expression in cells.
While light activation may be an advantageous embodiment, sometimes it may be disadvantageous especially for in vivo applications in which the light may not penetrate the skin or other organs. In this instance, other methods of energy activation are contemplated, in particular, electric field energy and/or ultrasound which have a similar effect.
Electric field energy is preferably administered substantially as described in the art, using one or more electric pulses of from about 1 Volt/cm to about 10 kVolts/cm under in vivo conditions. Instead of or in addition to the pulses, the electric field may be delivered in a continuous manner. The electric pulse may be applied for between 1 μs and 500 milliseconds, preferably between 1 μs and 100 milliseconds. The electric field may be applied continuously or in a pulsed manner for 5 about minutes.
As used herein, ‘electric field energy’ is the electrical energy to which a cell is exposed. Preferably the electric field has a strength of from about 1 Volt/cm to about 10 kVolts/cm or more under in vivo conditions (see WO97/49450).
As used herein, the term “electric field” includes one or more pulses at variable capacitance and voltage and including exponential and/or square wave and/or modulated wave and/or modulated square wave forms. References to electric fields and electricity should be taken to include reference the presence of an electric potential difference in the environment of a cell. Such an environment may be set up by way of static electricity, alternating current (AC), direct current (DC), etc, as known in the art. The electric field may be uniform, non-uniform or otherwise, and may vary in strength and/or direction in a time dependent manner.
Single or multiple applications of electric field, as well as single or multiple applications of ultrasound are also possible, in any order and in any combination. The ultrasound and/or the electric field may be delivered as single or multiple continuous applications, or as pulses (pulsatile delivery).
Electroporation has been used in both in vitro and in vivo procedures to introduce foreign material into living cells. With in vitro applications, a sample of live cells is first mixed with the agent of interest and placed between electrodes such as parallel plates. Then, the electrodes apply an electrical field to the cell/implant mixture. Examples of systems that perform in vitro electroporation include the Electro Cell Manipulator ECM600 product, and the Electro Square Porator T820, both made by the BTX Division of Genetronics, Inc (see U.S. Pat. No. 5,869,326).
The known electroporation techniques (both in vitro and in vivo) function by applying a brief high voltage pulse to electrodes positioned around the treatment region. The electric field generated between the electrodes causes the cell membranes to temporarily become porous, whereupon molecules of the agent of interest enter the cells. In known electroporation applications, this electric field comprises a single square wave pulse on the order of 1000 V/cm, of about 100 .mu.s duration. Such a pulse may be generated, for example, in known applications of the Electro Square Porator T820.
Preferably, the electric field has a strength of from about 1 V/cm to about 10 kV/cm under in vitro conditions. Thus, the electric field may have a strength of 1 V/cm, 2 V/cm, 3 V/cm, 4 V/cm, 5 V/cm, 6 V/cm, 7 V/cm, 8 V/cm, 9 V/cm, 10 V/cm, 20 V/cm, 50 V/cm, 100 V/cm, 200 V/cm, 300 V/cm, 400 V/cm, 500 V/cm, 600 V/cm, 700 V/cm, 800 V/cm, 900 V/cm, 1 kV/cm, 2 kV/cm, 5 kV/cm, 10 kV/cm, 20 kV/cm, 50 kV/cm or more. More preferably from about 0.5 kV/cm to about 4.0 kV/cm under in vitro conditions. Preferably the electric field has a strength of from about 1 V/cm to about 10 kV/cm under in vivo conditions. However, the electric field strengths may be lowered where the number of pulses delivered to the target site are increased. Thus, pulsatile delivery of electric fields at lower field strengths is envisaged.
Preferably the application of the electric field is in the form of multiple pulses such as double pulses of the same strength and capacitance or sequential pulses of varying strength and/or capacitance. As used herein, the term “pulse” includes one or more electric pulses at variable capacitance and voltage and including exponential and/or square wave and/or modulated wave/square wave forms.
Preferably the electric pulse is delivered as a waveform selected from an exponential wave form, a square wave form, a modulated wave form and a modulated square wave form.
A preferred embodiment employs direct current at low voltage. Thus, Applicants disclose the use of an electric field which is applied to the cell, tissue or tissue mass at a field strength of between 1V/cm and 20V/cm, for a period of 100 milliseconds or more, preferably 15 minutes or more.
Ultrasound is advantageously administered at a power level of from about 0.05 W/cm2 to about 100 W/cm2. Diagnostic or therapeutic ultrasound may be used, or combinations thereof.
As used herein, the term “ultrasound” refers to a form of energy which consists of mechanical vibrations the frequencies of which are so high they are above the range of human hearing. Lower frequency limit of the ultrasonic spectrum may generally be taken as about 20 kHz. Most diagnostic applications of ultrasound employ frequencies in the range 1 and 15 MHz′ (From Ultrasonics in Clinical Diagnosis, P. N. T. Wells, ed., 2nd. Edition, Publ. Churchill Livingstone [Edinburgh, London & NY, 1977]).
Ultrasound has been used in both diagnostic and therapeutic applications. When used as a diagnostic tool (“diagnostic ultrasound”), ultrasound is typically used in an energy density range of up to about 100 mW/cm2 (FDA recommendation), although energy densities of up to 750 mW/cm2 have been used. In physiotherapy, ultrasound is typically used as an energy source in a range up to about 3 to 4 W/cm2 (WHO recommendation). In other therapeutic applications, higher intensities of ultrasound may be employed, for example, HIFU at 100 W/cm up to 1 kW/cm2 (or even higher) for short periods of time. The term “ultrasound” as used in this specification is intended to encompass diagnostic, therapeutic and focused ultrasound.
Focused ultrasound (FUS) allows thermal energy to be delivered without an invasive probe (see Morocz et al 1998 Journal of Magnetic Resonance Imaging Vol. 8, No. 1, pp. 136-142. Another form of focused ultrasound is high intensity focused ultrasound (HIFU) which is reviewed by Moussatov et al in Ultrasonics (1998) Vol. 36, No. 8, pp. 893-900 and TranHuuHue et al in Acustica (1997) Vol. 83, No. 6, pp. 1103-1106.
Preferably, a combination of diagnostic ultrasound and a therapeutic ultrasound is employed. This combination is not intended to be limiting, however, and the skilled reader will appreciate that any variety of combinations of ultrasound may be used. Additionally, the energy density, frequency of ultrasound, and period of exposure may be varied.
Preferably the exposure to an ultrasound energy source is at a power density of from about 0.05 to about 100 Wcm-2. Even more preferably, the exposure to an ultrasound energy source is at a power density of from about 1 to about 15 Wcm-2.
Preferably the exposure to an ultrasound energy source is at a frequency of from about 0.015 to about 10.0 MHz. More preferably the exposure to an ultrasound energy source is at a frequency of from about 0.02 to about 5.0 MHz or about 6.0 MHz. Most preferably, the ultrasound is applied at a frequency of 3 MHz.
Preferably the exposure is for periods of from about 10 milliseconds to about 60 minutes. Preferably the exposure is for periods of from about 1 second to about 5 minutes. More preferably, the ultrasound is applied for about 2 minutes. Depending on the particular target cell to be disrupted, however, the exposure may be for a longer duration, for example, for 15 minutes.
Advantageously, the target tissue is exposed to an ultrasound energy source at an acoustic power density of from about 0.05 Wcm-2 to about 10 Wcm-2 with a frequency ranging from about 0.015 to about 10 MHz (see WO 98/52609). However, alternatives are also possible, for example, exposure to an ultrasound energy source at an acoustic power density of above 100 Wcm-2, but for reduced periods of time, for example, 1000 Wcm-2 for periods in the millisecond range or less.
Preferably the application of the ultrasound is in the form of multiple pulses; thus, both continuous wave and pulsed wave (pulsatile delivery of ultrasound) may be employed in any combination. For example, continuous wave ultrasound may be applied, followed by pulsed wave ultrasound, or vice versa. This may be repeated any number of times, in any order and combination. The pulsed wave ultrasound may be applied against a background of continuous wave ultrasound, and any number of pulses may be used in any number of groups.
Preferably, the ultrasound may comprise pulsed wave ultrasound. In a highly preferred embodiment, the ultrasound is applied at a power density of 0.7 Wcm-2 or 1.25 Wcm-2 as a continuous wave. Higher power densities may be employed if pulsed wave ultrasound is used.
Use of ultrasound is advantageous as, like light, it may be focused accurately on a target. Moreover, ultrasound is advantageous as it may be focused more deeply into tissues unlike light. It is therefore better suited to whole-tissue penetration (such as but not limited to a lobe of the liver) or whole organ (such as but not limited to the entire liver or an entire muscle, such as the heart) therapy. Another important advantage is that ultrasound is a non-invasive stimulus which is used in a wide variety of diagnostic and therapeutic applications. By way of example, ultrasound is well known in medical imaging techniques and, additionally, in orthopedic therapy. Furthermore, instruments suitable for the application of ultrasound to a subject vertebrate are widely available and their use is well known in the art.
In particular embodiments, the guide molecule is modified by a secondary structure to increase the specificity of the CRISPR-Cas system and the secondary structure can protect against exonuclease activity and allow for 5′ additions to the guide sequence also referred to herein as a protected guide molecule.
In one aspect, the invention provides for hybridizing a “protector RNA” to a sequence of the guide molecule, wherein the “protector RNA” is an RNA strand complementary to the 3′ end of the guide molecule to thereby generate a partially double-stranded guide RNA. In an embodiment of the invention, protecting mismatched bases (i.e. the bases of the guide molecule which do not form part of the guide sequence) with a perfectly complementary protector sequence decreases the likelihood of target RNA binding to the mismatched basepairs at the 3′ end. In particular embodiments of the invention, additional sequences comprising an extented length may also be present within the guide molecule such that the guide comprises a protector sequence within the guide molecule. This “protector sequence” ensures that the guide molecule comprises a “protected sequence” in addition to an “exposed sequence” (comprising the part of the guide sequence hybridizing to the target sequence). In particular embodiments, the guide molecule is modified by the presence of the protector guide to comprise a secondary structure such as a hairpin. Advantageously there are three or four to thirty or more, e.g., about 10 or more, contiguous base pairs having complementarity to the protected sequence, the guide sequence or both. It is advantageous that the protected portion does not impede thermodynamics of the CRISPR-Cas system interacting with its target. By providing such an extension including a partially double stranded guide moleucle, the guide molecule is considered protected and results in improved specific binding of the CRISPR-Cas complex, while maintaining specific activity.
In particular embodiments, use is made of a truncated guide (tru-guide), i.e. a guide molecule which comprises a guide sequence which is truncated in length with respect to the canonical guide sequence length. As described by Nowak et al. (Nucleic Acids Res (2016) 44 (20): 9555-9564), such guides may allow catalytically active CRISPR-Cas enzyme to bind its target without cleaving the target RNA. In particular embodiments, a truncated guide is used which allows the binding of the target but retains only nickase activity of the CRISPR-Cas enzyme.
The present invention may be further illustrated and extended based on aspects of CRISPR-Cas development and use as set forth in the following articles and particularly as relates to delivery of a CRISPR protein complex and uses of an RNA guided endonuclease in cells and organisms:
The methods and tools provided herein are may be designed for use with or Cas13, a type II nuclease that does not make use of tracrRNA. Orthologs of Cas13 have been identified in different bacterial species as described herein. Further type II nucleases with similar properties can be identified using methods described in the art (Shmakov et al. 2015, 60:385-397; Abudayeh et al. 2016, Science, 5; 353(6299)). In particular embodiments, such methods for identifying novel CRISPR effector proteins may comprise the steps of selecting sequences from the database encoding a seed which identifies the presence of a CRISPR Cas locus, identifying loci located within 10 kb of the seed comprising Open Reading Frames (ORFs) in the selected sequences, selecting therefrom loci comprising ORFs of which only a single ORF encodes a novel CRISPR effector having greater than 700 amino acids and no more than 90% homology to a known CRISPR effector. In particular embodiments, the seed is a protein that is common to the CRISPR-Cas system, such as Cas1. In further embodiments, the CRISPR array is used as a seed to identify new effector proteins.
Also, “Dimeric CRISPR RNA-guided FokI nucleases for highly specific genome editing”, Shengdar Q. Tsai, Nicolas Wyvekens, Cyd Khayter, Jennifer A. Foden, Vishal Thapar, Deepak Reyon, Mathew J. Goodwin, Martin J. Aryee, J. Keith Joung Nature Biotechnology 32(6): 569-77 (2014), relates to dimeric RNA-guided Fold Nucleases that recognize extended sequences and can edit endogenous genes with high efficiencies in human cells.
With respect to general information on CRISPR/Cas Systems, components thereof, and delivery of such components, including methods, materials, delivery vehicles, vectors, particles, and making and using thereof, including as to amounts and formulations, as well as CRISPR-Cas-expressing eukaryotic cells, CRISPR-Cas expressing eukaryotes, such as a mouse, reference is made to: U.S. Pat. Nos. 8,999,641, 8,993,233, 8,697,359, 8,771,945, 8,795,965, 8,865,406, 8,871,445, 8,889,356, 8,889,418, 8,895,308, 8,906,616, 8,932,814, and 8,945,839; US Patent Publications US 2014-0310830 (U.S. application Ser. No. 14/105,031), US 2014-0287938 A1 (U.S. application Ser. No. 14/213,991), US 2014-0273234 A1 (U.S. application Ser. No. 14/293,674), US2014-0273232 A1 (U.S. application Ser. No. 14/290,575), US 2014-0273231 (U.S. application Ser. No. 14/259,420), US 2014-0256046 A1 (U.S. application Ser. No. 14/226,274), US 2014-0248702 A1 (U.S. application Ser. No. 14/258,458), US 2014-0242700 A1 (U.S. application Ser. No. 14/222,930), US 2014-0242699 A1 (U.S. application Ser. No. 14/183,512), US 2014-0242664 A1 (U.S. application Ser. No. 14/104,990), US 2014-0234972 A1 (U.S. application Ser. No. 14/183,471), US 2014-0227787 A1 (U.S. application Ser. No. 14/256,912), US 2014-0189896 A1 (U.S. application Ser. No. 14/105,035), US 2014-0186958 (U.S. application Ser. No. 14/105,017), US 2014-0186919 A1 (U.S. application Ser. No. 14/104,977), US 2014-0186843 A1 (U.S. application Ser. No. 14/104,900), US 2014-0179770 A1 (U.S. application Ser. No. 14/104,837) and US 2014-0179006 A1 (U.S. application Ser. No. 14/183,486), US 2014-0170753 (U.S. application Ser. No. 14/183,429); US 2015-0184139 (U.S. application Ser. No. 14/324,960); Ser. No. 14/054,414 European Patent Applications EP 2 771 468 (EP13818570.7), EP 2 764 103 (EP13824232.6), and EP 2 784 162 (EP14170383.5); and PCT Patent Publications WO2014/093661 (PCT/US2013/074743), WO2014/093694 (PCT/US2013/074790), WO2014/093595 (PCT/US2013/074611), WO2014/093718 (PCT/US2013/074825), WO2014/093709 (PCT/US2013/074812), WO2014/093622 (PCT/US2013/074667), WO2014/093635 (PCT/US2013/074691), WO2014/093655 (PCT/US2013/074736), WO2014/093712 (PCT/US2013/074819), WO2014/093701 (PCT/US2013/074800), WO2014/018423 (PCT/US2013/051418), WO2014/204723 (PCT/US2014/041790), WO2014/204724 (PCT/US2014/041800), WO2014/204725 (PCT/US2014/041803), WO2014/204726 (PCT/US2014/041804), WO2014/204727 (PCT/US2014/041806), WO2014/204728 (PCT/US2014/041808), WO2014/204729 (PCT/US2014/041809), WO2015/089351 (PCT/US2014/069897), WO2015/089354 (PCT/US2014/069902), WO2015/089364 (PCT/US2014/069925), WO2015/089427 (PCT/US2014/070068), WO2015/089462 (PCT/US2014/070127), WO2015/089419 (PCT/US2014/070057), WO2015/089465 (PCT/US2014/070135), WO2015/089486 (PCT/US2014/070175), WO2015/058052 (PCT/US2014/061077), WO2015/070083 (PCT/US2014/064663), WO2015/089354 (PCT/US2014/069902), WO2015/089351 (PCT/US2014/069897), WO2015/089364 (PCT/US2014/069925), WO2015/089427 (PCT/US2014/070068), WO2015/089473 (PCT/US2014/070152), WO2015/089486 (PCT/US2014/070175), WO2016/049258 (PCT/US2015/051830), WO2016/094867 (PCT/US2015/065385), WO2016/094872 (PCT/US2015/065393), WO2016/094874 (PCT/US2015/065396), WO2016/106244 (PCT/US2015/067177).
Mention is also made of U.S. application 62/180,709, 17 Jun. 15, PROTECTED GUIDE RNAS (PGRNAS); U.S. application 62/091,455, filed, 12 Dec. 14, PROTECTED GUIDE RNAS (PGRNAS); U.S. application 62/096,708, 24 Dec. 14, PROTECTED GUIDE RNAS (PGRNAS); U.S. applications 62/091,462, 12 Dec. 14, 62/096,324, 23 Dec. 14, 62/180,681, 17 Jun. 2015, and 62/237,496, 5 Oct. 2015, DEAD GUIDES FOR CRISPR TRANSCRIPTION FACTORS; U.S. application 62/091,456, 12 Dec. 14 and 62/180,692, 17 Jun. 2015, ESCORTED AND FUNCTIONALIZED GUIDES FOR CRISPR-CAS SYSTEMS; U.S. application 62/091,461, 12 Dec. 14, DELIVERY, USE AND THERAPEUTIC APPLICATIONS OF THE CRISPR-CAS SYSTEMS AND COMPOSITIONS FOR GENOME EDITING AS TO HEMATOPOETIC STEM CELLS (HSCs); U.S. application 62/094,903, 19 Dec. 14, UNBIASED IDENTIFICATION OF DOUBLE-STRAND BREAKS AND GENOMIC REARRANGEMENT BY GENOME-WISE INSERT CAPTURE SEQUENCING; U.S. application 62/096,761, 24 Dec. 14, ENGINEERING OF SYSTEMS, METHODS AND OPTIMIZED ENZYME AND GUIDE SCAFFOLDS FOR SEQUENCE MANIPULATION; U.S. application 62/098,059, 30 Dec. 14, 62/181,641, 18 Jun. 2015, and 62/181,667, 18 Jun. 2015, RNA-TARGETING SYSTEM; U.S. application 62/096,656, 24 Dec. 14 and 62/181,151, 17 Jun. 2015, CRISPR HAVING OR ASSOCIATED WITH DESTABILIZATION DOMAINS; U.S. application 62/096,697, 24 Dec. 14, CRISPR HAVING OR ASSOCIATED WITH AAV; U.S. application 62/098,158, 30 Dec. 14, ENGINEERED CRISPR COMPLEX INSERTIONAL TARGETING SYSTEMS; U.S. application 62/151,052, 22 Apr. 15, CELLULAR TARGETING FOR EXTRACELLULAR EXOSOMAL REPORTING; U.S. application 62/054,490, 24 Sep. 14, DELIVERY, USE AND THERAPEUTIC APPLICATIONS OF THE CRISPR-CAS SYSTEMS AND COMPOSITIONS FOR TARGETING DISORDERS AND DISEASES USING PARTICLE DELIVERY COMPONENTS; U.S. application 61/939,154, 12 Feb. 14, SYSTEMS, METHODS AND COMPOSITIONS FOR SEQUENCE MANIPULATION WITH OPTIMIZED FUNCTIONAL CRISPR-CAS SYSTEMS; U.S. application 62/055,484, 25 Sep. 14, SYSTEMS, METHODS AND COMPOSITIONS FOR SEQUENCE MANIPULATION WITH OPTIMIZED FUNCTIONAL CRISPR-CAS SYSTEMS; U.S. application 62/087,537, 4 Dec. 14, SYSTEMS, METHODS AND COMPOSITIONS FOR SEQUENCE MANIPULATION WITH OPTIMIZED FUNCTIONAL CRISPR-CAS SYSTEMS; U.S. application 62/054,651, 24 Sep. 14, DELIVERY, USE AND THERAPEUTIC APPLICATIONS OF THE CRISPR-CAS SYSTEMS AND COMPOSITIONS FOR MODELING COMPETITION OF MULTIPLE CANCER MUTATIONS IN VIVO; U.S. application 62/067,886, 23 Oct. 14, DELIVERY, USE AND THERAPEUTIC APPLICATIONS OF THE CRISPR-CAS SYSTEMS AND COMPOSITIONS FOR MODELING COMPETITION OF MULTIPLE CANCER MUTATIONS IN VIVO; U.S. applications 62/054,675, 24 Sep. 14 and 62/181,002, 17 Jun. 2015, DELIVERY, USE AND THERAPEUTIC APPLICATIONS OF THE CRISPR-CAS SYSTEMS AND COMPOSITIONS IN NEURONAL CELLS/TISSUES; U.S. application 62/054,528, 24 Sep. 14, DELIVERY, USE AND THERAPEUTIC APPLICATIONS OF THE CRISPR-CAS SYSTEMS AND COMPOSITIONS IN IMMUNE DISEASES OR DISORDERS; U.S. application 62/055,454, 25 Sep. 14, DELIVERY, USE AND THERAPEUTIC APPLICATIONS OF THE CRISPR-CAS SYSTEMS AND COMPOSITIONS FOR TARGETING DISORDERS AND DISEASES USING CELL PENETRATION PEPTIDES (CPP); U.S. application 62/055,460, 25 Sep. 14, MULTIFUNCTIONAL-CRISPR COMPLEXES AND/OR OPTIMIZED ENZYME LINKED FUNCTIONAL-CRISPR COMPLEXES; U.S. application 62/087,475, 4 Dec. 14 and 62/181,690, 18 Jun. 2015, FUNCTIONAL SCREENING WITH OPTIMIZED FUNCTIONAL CRISPR-CAS SYSTEMS; U.S. application 62/055,487, 25 Sep. 14, FUNCTIONAL SCREENING WITH OPTIMIZED FUNCTIONAL CRISPR-CAS SYSTEMS; U.S. application 62/087,546, 4 Dec. 14 and 62/181,687, 18 Jun. 2015, MULTIFUNCTIONAL CRISPR COMPLEXES AND/OR OPTIMIZED ENZYME LINKED FUNCTIONAL-CRISPR COMPLEXES; and U.S. application 62/098,285, 30 Dec. 14, CRISPR MEDIATED IN VIVO MODELING AND GENETIC SCREENING OF TUMOR GROWTH AND METASTASIS.
Mention is made of U.S. applications 62/181,659, 18 Jun. 2015 and 62/207,318, 19 Aug. 2015, ENGINEERING AND OPTIMIZATION OF SYSTEMS, METHODS, ENZYME AND GUIDE SCAFFOLDS OF CAS9 ORTHOLOGS AND VARIANTS FOR SEQUENCE MANIPULATION. Mention is made of U.S. applications 62/181,663, 18 Jun. 2015 and 62/245,264, 22 Oct. 2015, NOVEL CRISPR ENZYMES AND SYSTEMS, U.S. applications 62/181,675, 18 Jun. 2015, 62/285,349, 22 Oct. 2015, 62/296,522, 17 Feb. 2016, and 62/320,231, 8 Apr. 2016, NOVEL CRISPR ENZYMES AND SYSTEMS, U.S. application 62/232,067, 24 Sep. 2015, U.S. application Ser. No. 14/975,085, 18 Dec. 2015, European application No. 16150428.7, U.S. application 62/205,733, 16 Aug. 2015, U.S. application 62/201,542, 5 Aug. 2015, U.S. application 62/193,507, 16 Jul. 2015, and U.S. application 62/181,739, 18 Jun. 2015, each entitled NOVEL CRISPR ENZYMES AND SYSTEMS and of U.S. application 62/245,270, 22 Oct. 2015, NOVEL CRISPR ENZYMES AND SYSTEMS. Mention is also made of U.S. application 61/939,256, 12 Feb. 2014, and WO 2015/089473 (PCT/US2014/070152), 12 Dec. 2014, each entitled ENGINEERING OF SYSTEMS, METHODS AND OPTIMIZED GUIDE COMPOSITIONS WITH NEW ARCHITECTURES FOR SEQUENCE MANIPULATION. Mention is also made of PCT/US2015/045504, 15 Aug. 2015, U.S. application 62/180,699, 17 Jun. 2015, and U.S. application 62/038,358, 17 Aug. 2014, each entitled GENOME EDITING USING CAS9 NICKASES.
Each of these patents, patent publications, and applications, and all documents cited therein or during their prosecution (“appln cited documents”) and all documents cited or referenced in the appln cited documents, together with any instructions, descriptions, product specifications, and product sheets for any products mentioned therein or in any document therein and incorporated by reference herein, are hereby incorporated herein by reference, and may be employed in the practice of the invention. All documents (e.g., these patents, patent publications and applications and the appln cited documents) are incorporated herein by reference to the same extent as if each individual document was specifically and individually indicated to be incorporated by reference.
2. Tale Systems
As disclosed herein editing can be made by way of the transcription activator-like effector nucleases (TALENs) system. Transcription activator-like effectors (TALEs) can be engineered to bind practically any desired DNA sequence. Exemplary methods of genome editing using the TALEN system can be found for example in Cermak T. Doyle E L. Christian M. Wang L. Zhang Y. Schmidt C, et al. Efficient design and assembly of custom TALEN and other TAL effector-based constructs for DNA targeting. Nucleic Acids Res. 2011; 39:e82; Zhang F. Cong L. Lodato S. Kosuri S. Church G M. Arlotta P Efficient construction of sequence-specific TAL effectors for modulating mammalian transcription. Nat Biotechnol. 2011; 29:149-153 and U.S. Pat. Nos. 8,450,471, 8,440,431 and 8,440,432, all of which are specifically incorporated by reference.
In advantageous embodiments of the invention, the methods provided herein use isolated, non-naturally occurring, recombinant or engineered DNA binding proteins that comprise TALE monomers as a part of their organizational structure that enable the targeting of nucleic acid sequences with improved efficiency and expanded specificity.
Naturally occurring TALEs or “wild type TALEs” are nucleic acid binding proteins secreted by numerous species of proteobacteria. TALE polypeptides contain a nucleic acid binding domain composed of tandem repeats of highly conserved monomer polypeptides that are predominantly 33, 34 or 35 amino acids in length and that differ from each other mainly in amino acid positions 12 and 13. In advantageous embodiments the nucleic acid is DNA. As used herein, the term “polypeptide monomers”, or “TALE monomers” will be used to refer to the highly conserved repetitive polypeptide sequences within the TALE nucleic acid binding domain and the term “repeat variable di-residues” or “RVD” will be used to refer to the highly variable amino acids at positions 12 and 13 of the polypeptide monomers. As provided throughout the disclosure, the amino acid residues of the RVD are depicted using the IUPAC single letter code for amino acids. A general representation of a TALE monomer which is comprised within the DNA binding domain is X1-11-(X12X13)-X14-33 or 34 or 35, where the subscript indicates the amino acid position and X represents any amino acid. X12X13 indicate the RVDs. In some polypeptide monomers, the variable amino acid at position 13 is missing or absent and in such polypeptide monomers, the RVD consists of a single amino acid. In such cases the RVD may be alternatively represented as X*, where X represents X12 and (*) indicates that X13 is absent. The DNA binding domain comprises several repeats of TALE monomers and this may be represented as (X1-11-(X12X13)-X14-33 or 34 or 35)z, where in an advantageous embodiment, z is at least 5 to 40. In a further advantageous embodiment, z is at least 10 to 26.
The TALE monomers have a nucleotide binding affinity that is determined by the identity of the amino acids in its RVD. For example, polypeptide monomers with an RVD of NI preferentially bind to adenine (A), polypeptide monomers with an RVD of NG preferentially bind to thymine (T), polypeptide monomers with an RVD of HD preferentially bind to cytosine (C) and polypeptide monomers with an RVD of NN preferentially bind to both adenine (A) and guanine (G). In yet another embodiment of the invention, polypeptide monomers with an RVD of IG preferentially bind to T. Thus, the number and order of the polypeptide monomer repeats in the nucleic acid binding domain of a TALE determines its nucleic acid target specificity. In still further embodiments of the invention, polypeptide monomers with an RVD of NS recognize all four base pairs and may bind to A, T, G or C. The structure and function of TALEs is further described in, for example, Moscou et al., Science 326:1501 (2009); Boch et al., Science 326:1509-1512 (2009); and Zhang et al., Nature Biotechnology 29:149-153 (2011), each of which is incorporated by reference in its entirety.
The TALE polypeptides used in methods of the invention are isolated, non-naturally occurring, recombinant or engineered nucleic acid-binding proteins that have nucleic acid or DNA binding regions containing polypeptide monomer repeats that are designed to target specific nucleic acid sequences.
As described herein, polypeptide monomers having an RVD of HN or NH preferentially bind to guanine and thereby allow the generation of TALE polypeptides with high binding specificity for guanine containing target nucleic acid sequences. In a preferred embodiment of the invention, polypeptide monomers having RVDs RN, NN, NK, SN, NH, KN, HN, NQ, HH, RG, KH, RH and SS preferentially bind to guanine. In a much more advantageous embodiment of the invention, polypeptide monomers having RVDs RN, NK, NQ, HH, KH, RH, SS and SN preferentially bind to guanine and thereby allow the generation of TALE polypeptides with high binding specificity for guanine containing target nucleic acid sequences. In an even more advantageous embodiment of the invention, polypeptide monomers having RVDs HH, KH, NH, NK, NQ, RH, RN and SS preferentially bind to guanine and thereby allow the generation of TALE polypeptides with high binding specificity for guanine containing target nucleic acid sequences. In a further advantageous embodiment, the RVDs that have high binding specificity for guanine are RN, NH RH and KH. Furthermore, polypeptide monomers having an RVD of NV preferentially bind to adenine and guanine. In more preferred embodiments of the invention, polypeptide monomers having RVDs of H*, HA, KA, N*, NA, NC, NS, RA, and S* bind to adenine, guanine, cytosine and thymine with comparable affinity.
The predetermined N-terminal to C-terminal order of the one or more polypeptide monomers of the nucleic acid or DNA binding domain determines the corresponding predetermined target nucleic acid sequence to which the TALE polypeptides will bind. As used herein the polypeptide monomers and at least one or more half polypeptide monomers are “specifically ordered to target” the genomic locus or gene of interest. In plant genomes, the natural TALE-binding sites always begin with a thymine (T), which may be specified by a cryptic signal within the non-repetitive N-terminus of the TALE polypeptide; in some cases this region may be referred to as repeat 0. In animal genomes, TALE binding sites do not necessarily have to begin with a thymine (T) and TALE polypeptides may target DNA sequences that begin with T, A, G or C. The tandem repeat of TALE monomers always ends with a half-length repeat or a stretch of sequence that may share identity with only the first 20 amino acids of a repetitive full length TALE monomer and this half repeat may be referred to as a half-monomer, which is included in the term “TALE monomer”. Therefore, it follows that the length of the nucleic acid or DNA being targeted is equal to the number of full polypeptide monomers plus two.
As described in Zhang et al., Nature Biotechnology 29:149-153 (2011), TALE polypeptide binding efficiency may be increased by including amino acid sequences from the “capping regions” that are directly N-terminal or C-terminal of the DNA binding region of naturally occurring TALEs into the engineered TALEs at positions N-terminal or C-terminal of the engineered TALE DNA binding region. Thus, in certain embodiments, the TALE polypeptides described herein further comprise an N-terminal capping region and/or a C-terminal capping region.
An exemplary amino acid sequence of a N-terminal capping region is:
An exemplary amino acid sequence of a C-terminal capping region is:
As used herein the predetermined “N-terminus” to “C terminus” orientation of the N-terminal capping region, the DNA binding domain comprising the repeat TALE monomers and the C-terminal capping region provide structural basis for the organization of different domains in the d-TALEs or polypeptides of the invention.
The entire N-terminal and/or C-terminal capping regions are not necessary to enhance the binding activity of the DNA binding region. Therefore, in certain embodiments, fragments of the N-terminal and/or C-terminal capping regions are included in the TALE polypeptides described herein.
In certain embodiments, the TALE polypeptides described herein contain a N-terminal capping region fragment that included at least 10, 20, 30, 40, 50, 54, 60, 70, 80, 87, 90, 94, 100, 102, 110, 117, 120, 130, 140, 147, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260 or 270 amino acids of an N-terminal capping region. In certain embodiments, the N-terminal capping region fragment amino acids are of the C-terminus (the DNA-binding region proximal end) of an N-terminal capping region. As described in Zhang et al., Nature Biotechnology 29:149-153 (2011), N-terminal capping region fragments that include the C-terminal 240 amino acids enhance binding activity equal to the full length capping region, while fragments that include the C-terminal 147 amino acids retain greater than 80% of the efficacy of the full length capping region, and fragments that include the C-terminal 117 amino acids retain greater than 50% of the activity of the full-length capping region.
In some embodiments, the TALE polypeptides described herein contain a C-terminal capping region fragment that included at least 6, 10, 20, 30, 37, 40, 50, 60, 68, 70, 80, 90, 100, 110, 120, 127, 130, 140, 150, 155, 160, 170, 180 amino acids of a C-terminal capping region. In certain embodiments, the C-terminal capping region fragment amino acids are of the N-terminus (the DNA-binding region proximal end) of a C-terminal capping region. As described in Zhang et al., Nature Biotechnology 29:149-153 (2011), C-terminal capping region fragments that include the C-terminal 68 amino acids enhance binding activity equal to the full length capping region, while fragments that include the C-terminal 20 amino acids retain greater than 50% of the efficacy of the full length capping region.
In certain embodiments, the capping regions of the TALE polypeptides described herein do not need to have identical sequences to the capping region sequences provided herein. Thus, in some embodiments, the capping region of the TALE polypeptides described herein have sequences that are at least 50%, 60%, 70%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% identical or share identity to the capping region amino acid sequences provided herein. Sequence identity is related to sequence homology. Homology comparisons may be conducted by eye, or more usually, with the aid of readily available sequence comparison programs. These commercially available computer programs may calculate percent (%) homology between two or more sequences and may also calculate the sequence identity shared by two or more amino acid or nucleic acid sequences. In some preferred embodiments, the capping region of the TALE polypeptides described herein have sequences that are at least 95% identical or share identity to the capping region amino acid sequences provided herein.
Sequence homologies may be generated by any of a number of computer programs known in the art, which include but are not limited to BLAST or FASTA. Suitable computer program for carrying out alignments like the GCG Wisconsin Bestfit package may also be used. Once the software has produced an optimal alignment, it is possible to calculate % homology, preferably % sequence identity. The software typically does this as part of the sequence comparison and generates a numerical result.
In advantageous embodiments described herein, the TALE polypeptides of the invention include a nucleic acid binding domain linked to the one or more effector domains. The terms “effector domain” or “regulatory and functional domain” refer to a polypeptide sequence that has an activity other than binding to the nucleic acid sequence recognized by the nucleic acid binding domain. By combining a nucleic acid binding domain with one or more effector domains, the polypeptides of the invention may be used to target the one or more functions or activities mediated by the effector domain to a particular target DNA sequence to which the nucleic acid binding domain specifically binds.
In some embodiments of the TALE polypeptides described herein, the activity mediated by the effector domain is a biological activity. For example, in some embodiments the effector domain is a transcriptional inhibitor (i.e., a repressor domain), such as an mSin interaction domain (SID). SID4X domain or a Kruppel-associated box (KRAB) or fragments of the KRAB domain. In some embodiments the effector domain is an enhancer of transcription (i.e. an activation domain), such as the VP16, VP64 or p65 activation domain. In some embodiments, the nucleic acid binding is linked, for example, with an effector domain that includes but is not limited to a transposase, integrase, recombinase, resolvase, invertase, protease, DNA methyltransferase, DNA demethylase, histone acetylase, histone deacetylase, nuclease, transcriptional repressor, transcriptional activator, transcription factor recruiting, protein nuclear-localization signal or cellular uptake signal.
In some embodiments, the effector domain is a protein domain which exhibits activities which include but are not limited to transposase activity, integrase activity, recombinase activity, resolvase activity, invertase activity, protease activity, DNA methyltransferase activity, DNA demethylase activity, histone acetylase activity, histone deacetylase activity, nuclease activity, nuclear-localization signaling activity, transcriptional repressor activity, transcriptional activator activity, transcription factor recruiting activity, or cellular uptake signaling activity. Other preferred embodiments of the invention may include any combination the activities described herein.
3. ZN-Finger Nucleases
Other preferred tools for genome editing for use in the context of this invention include zinc finger systems and TALE systems. One type of programmable DNA-binding domain is provided by artificial zinc-finger (ZF) technology, which involves arrays of ZF modules to target new DNA-binding sites in the genome. Each finger module in a ZF array targets three DNA bases. A customized array of individual zinc finger domains is assembled into a ZF protein (ZFP).
ZFPs can comprise a functional domain. The first synthetic zinc finger nucleases (ZFNs) were developed by fusing a ZF protein to the catalytic domain of the Type IIS restriction enzyme FokI. (Kim, Y. G. et al., 1994, Chimeric restriction endonuclease, Proc. Natl. Acad. Sci. U.S.A. 91, 883-887; Kim, Y. G. et al., 1996, Hybrid restriction enzymes: zinc finger fusions to Fok I cleavage domain. Proc. Natl. Acad. Sci. U.S.A. 93, 1156-1160). Increased cleavage specificity can be attained with decreased off target activity by use of paired ZFN heterodimers, each targeting different nucleotide sequences separated by a short spacer. (Doyon, Y. et al., 2011, Enhancing zinc-finger-nuclease activity with improved obligate heterodimeric architectures. Nat. Methods 8, 74-79). ZFPs can also be designed as transcription activators and repressors and have been used to target many genes in a wide variety of organisms. Exemplary methods of genome editing using ZFNs can be found for example in U.S. Pat. Nos. 6,534,261, 6,607,882, 6,746,838, 6,794,136, 6,824,978, 6,866,997, 6,933,113, 6,979,539, 7,013,219, 7,030,215, 7,220,719, 7,241,573, 7,241,574, 7,585,849, 7,595,376, 6,903,185, and 6,479,626, all of which are specifically incorporated by reference.
4. Meganucleases
As disclosed herein editing can be made by way of meganucleases, which are endodeoxyribonucleases characterized by a large recognition site (double-stranded DNA sequences of 12 to 40 base pairs). Exemplary method for using meganucleases can be found in U.S. Pat. Nos. 8,163,514; 8,133,697; 8,021,867; 8,119,361; 8,119,381; 8,124,369; and 8,129,134, which are specifically incorporated by reference.
In certain embodiments, the nuclease may be employed to mutate or regulate genetic elements singly or in combination in the organism. Thus by varying one or more genetic elements in a model organism, the invention provides a means for establishing or confirming causality between genetic changes and phenotypic effects. The genetic changes can be the SNPs or any variation in linkage diseqilibrium with the SNP.
Similarly, the model organisms can be used to test effectiveness of therapeutic intervention. In an embodiment, the invention is used to define or establish subgroups of individuals (or models) at elevated risk for coronary artery disease on the basis of different risk factors or combinations of risk factors. In one embodiment, the separate subgroups are used to characterize susceptibility to therapeutic interventions that may vary from subgroup to subgroup. In another embodiment, therapies are selected according the SNPs identified in a subject.
In an aspect of the invention, there is targeted genomic editing to modify one or more genomic sequences of interest to reduce disease risk. One or more targets may be selected, depending on the genotypic and/or phenotypic outcome. For instance, one or more therapeutic targets may be selected, depending on (genetic) disease etiology or the desired therapeutic outcome. The (therapeutic) target(s) may be a single gene, locus, or other genomic site, or may be multiple genes, loci or other genomic sites. As is known in the art, a single gene, locus, or other genomic site may be targeted more than once, such as by use of multiple gRNAs.
According to the invention, genomic sequences associated with disease risk are identified by single nucleotide polymorphisms (SNPs). The SNPs are linked to the genomic sequences of interest, i.e., close to or within the genomic sequences of interest, and may or may not be causative of the risk variation. That is, functional differences between alleles distinguished by the SNPs may result from sequence variation of an SNP or from one or more differences between alleles located near to the location of the SNP. In either case, the invention provides for gene editing in order to reduce disease risk. In general, a higher risk allele would be edited to resemble more closely a lower risk allele. Often such editing would involve individual base changes, but can also involve insertions and deletions. For example, trinucleotide repeat regions may be edited to change the number of trinucleotide repeats.
In certain embodiments, the nuclease is used for gene editing. Nuclease based therapy or therapeutics may involve target disruption, such as target mutation, such as leading to gene knockout. Nuclease activity, such as CRISPR-Cas system based therapy or therapeutics may involve replacement of particular target sites, such as leading to target correction. Nuclease based therapy or therapeutics may involve removal of particular target sites, such as leading to target deletion. Nuclease activity, such as CRISPR-Cas system based therapy or therapeutics may involve modulation of target site functionality, such as target site activity or accessibility, leading for instance to (transcriptional and/or epigenetic) gene or genomic region activation or gene or genomic region silencing. The skilled person will understand that modulation of target site functionality may involve nuclease mutation (such as for instance generation of a catalytically inactive CRISPR effector) and/or functionalization (such as for instance fusion of the CRISPR effector with a heterologous functional domain, such as a transcriptional activator or repressor), as described herein elsewhere.
Accordingly, in an aspect, the invention relates to a method as described herein, comprising selection of one or more (therapeutic) target, selecting one or more nuclease function, and optimization of selected parameters or variables associated with the nuclease system and/or its functionality. In a related aspect, the invention relates to a method as described herein, comprising (a) selecting one or more (therapeutic) target loci, (b) selecting one or more nuclease system functionalities, (c) optionally selecting one or more modes of delivery, and preparing, developing, or designing a CRISPR-Cas system selected based on steps (a)-(c). Method for selecting optimal Cas9 and Cas12 based systems are disclosed, for example, in Internataional Patent Application Publication Nos. WO/2018/035388 and WO/2018/035387.
In certain embodiments, nuclease system functionality comprises genomic mutation. In certain embodiments, nuclease system functionality comprises single genomic mutation. In certain embodiments, nuclease system functionality comprises multiple genomic mutations. In certain embodiments, nuclease system functionality comprises gene knockout. In certain embodiments, nuclease system functionality comprises single gene knockout. In certain embodiments, nuclease system functionality comprises multiple gene knockout. In certain embodiments, nuclease system functionality comprises gene correction. In certain embodiments, nuclease system functionality comprises single gene correction. In certain embodiments, nuclease system functionality comprises multiple gene correction. In certain embodiments, nuclease system functionality comprises genomic region correction. In certain embodiments, nuclease system functionality comprises single genomic region correction. In certain embodiments, nuclease system functionality comprises multiple genomic region correction. In certain embodiments, nuclease system functionality comprises gene deletion. In certain embodiments, nuclease system functionality comprises single gene deletion. In certain embodiments, nuclease system functionality comprises multiple gene deletion. In certain embodiments, nuclease system functionality comprises genomic region deletion. In certain embodiments, nuclease system functionality comprises single genomic region deletion. In certain embodiments, nuclease system functionality comprises multiple genomic region deletion. In certain embodiments, nuclease system functionality comprises modulation of gene or genomic region functionality. In certain embodiments, nuclease system functionality comprises modulation of single gene or genomic region functionality. In certain embodiments, nuclease system functionality comprises modulation of multiple gene or genomic region functionality. In certain embodiments, nuclease system functionality comprises gene or genomic region functionality, such as gene or genomic region activity. In certain embodiments, nuclease system functionality comprises single gene or genomic region functionality, such as gene or genomic region activity. In certain embodiments, nuclease system functionality comprises multiple gene or genomic region functionality, such as gene or genomic region activity. In certain embodiments, nuclease system functionality comprises modulation gene activity or accessibility optionally leading to transcriptional and/or epigenetic gene or genomic region activation or gene or genomic region silencing. In certain embodiments, nuclease system functionality comprises modulation single gene activity or accessibility optionally leading to transcriptional and/or epigenetic gene or genomic region activation or gene or genomic region silencing. In certain embodiments, nuclease system functionality comprises modulation multiple gene activity or accessibility optionally leading to transcriptional and/or epigenetic gene or genomic region activation or gene or genomic region silencing.
The methods as described herein may further involve selection of the nuclease system mode of delivery. In certain embodiments, gRNA (and tracr, if and where needed, optionally provided as a sgRNA) and/or CRISPR effector protein are or are to be delivered. In certain embodiments, gRNA (and tracr, if and where needed, optionally provided as a sgRNA) and/or CRISPR effector mRNA are or are to be delivered. In certain embodiments, gRNA (and tracr, if and where needed, optionally provided as a sgRNA) and/or CRISPR effector provided in a DNA-based expression system are or are to be delivered. In certain embodiments, delivery of the individual CRISPR-Cas system components comprises a combination of the above modes of delivery. In certain embodiments, delivery comprises delivering gRNA and/or CRISPR effector protein, delivering gRNA and/or CRISPR effector mRNA, or delivering gRNA and/or CRISPR effector as a DNA based expression system.
Accordingly, in an aspect, the invention relates to a method as described herein, comprising selection of one or more (therapeutic) target, selecting nuclease system functionality, selecting nuclease system mode of delivery, and optimization of selected parameters or variables associated with the nuclease system and/or its functionality.
The methods as described herein may further involve selection of the nuclease system delivery vehicle and/or expression system. Delivery vehicles and expression systems are described herein elsewhere. By means of example, delivery vehicles of nucleic acids and/or proteins include nanoparticles, liposomes, etc. Delivery vehicles for DNA, such as DNA-based expression systems include for instance biolistics, viral based vector systems (e.g. adenoviral, AAV, lentiviral), etc. the skilled person will understand that selection of the mode of delivery, as well as delivery vehicle or expression system may depend on for instance the cell or tissues to be targeted. In certain embodiments, the a delivery vehicle and/or expression system for delivering the nuclease systems or components thereof comprises liposomes, lipid particles, nanoparticles, biolistics, or viral-based expression/delivery systems.
Optimization of selected parameters or variables in the methods as described herein may result in optimized or improved nuclease system, such as CISPR-Cas system based therapy or therapeutic, specificity, efficacy, and/or safety. In certain embodiments, one or more of the following parameters or variables are taken into account, are selected, or are optimized in the methods of the invention as described herein: CRISPR effector specificity, gRNA specificity, CRISPR-Cas complex specificity, PAM restrictiveness, PAM type (natural or modified), PAM nucleotide content, PAM length, CRISPR effector activity, gRNA activity, CRISPR-Cas complex activity, target cleavage efficiency, target site selection, target sequence length, ability of effector protein to access regions of high chromatin accessibility, degree of uniform enzyme activity across genomic targets, epigenetic tolerance, mismatch/budge tolerance, CRISPR effector stability, CRISPR effector mRNA stability, gRNA stability, CRISPR-Cas complex stability, CRISPR effector protein or mRNA immunogenicity or toxicity, gRNA immunogenicity or toxicity, CRISPR-Cas complex immunogenicity or toxicity, CRISPR effector protein or mRNA dose or titer, gRNA dose or titer, CRISPR-Cas complex dose or titer, CRISPR effector protein size, CRISPR effector expression level, gRNA expression level, CRISPR-Cas complex expression level, CRISPR effector spatiotemporal expression, gRNA spatiotemporal expression, CRISPR-Cas complex spatiotemporal expression.
In certain embodiments, selecting one or more CRISP-Cas system functionalities comprises selecting one or more of an optimal effector protein, an optimal guide RNA, or both.
In an exemplary method for modifying a target polynucleotide by integrating an exogenous polynucleotide template, a double stranded break is introduced into the genome sequence by the CRISPR complex, the break is repaired via homologous recombination an exogenous polynucleotide template such that the template is integrated into the genome. The presence of a double-stranded break facilitates integration of the template.
In an exemplary method for modifying a target polynucleotide by integrating an exogenous polynucleotide template, a single stranded break is introduced into the genome sequence by the nuclease, for example wherein the CRIPR-Cas protein is a nickase. The break is repaired via homologous recombination an exogenous polynucleotide template such that the template is integrated into the genome. The presence of a single-stranded break facilitates integration of the template.
In certain embodiments, the therapeutic nuclease system is multiplexed for targeting multiple loci. In certain embodiments, this can be established by using multiple (tandem or multiplex) guide RNA (gRNA) sequences. In certain embodiments, said gRNA sequences are separated by a nucleotide sequence, such as a direct repeat (DR). In certain embodiments, said gRNA sequences are separated by a sequence cleavable by a host enzyme. In certain embodiments, a “self-inactivating” gRNA is includes which targets an element of the CRISPR system.
In certain embodiments, selecting an optimal effector protein comprises optimizing one or more of effector protein type, size, PAM specificity, effector protein stability, immunogenicity or toxicity, functional specificity, and efficacy, or other CRISPR effector associated parameters or variables as described herein elsewhere.
The invention further provides for targeted delivery whereby a nuclease system is preferably delivered to a cell type of interest. In one embodiment, it may be preferable for a CRISPR system engineered to target certain genetic loci to a particular cell type wherein those loci are expressed and active. According to the invention, a CRISPR system can be preferentially targeted to, without limitation, to a liver cell, an epithelial cell, a hematpoietic cell, or an immune cell. In an embodiment of the invention, a cell type of interest is preferentially targeted by using viral vectors of a particular serotypes. In an embodiment of the invention, a cell type of interest is preferentially targeted by a vector particle displaying a target-specific ligand.
In certain embodiments, selecting an optimal effector protein comprises optimizing one or more of effector protein type, size, PAM specificity, effector protein stability, immunogenicity or toxicity, functional specificity, and efficacy, or other CRISPR effector associated parameters or variables as described herein elsewhere.
In any of the above embodiment, identifying whether the SNP is present includes obtaining information regarding the identity (i.e., of a specific nucleotide), presence or absence of one or more specific SNP in a subject. Determining the presence of an SNP can, but need not, include obtaining a sample comprising DNA from a subject. The individual or organization who determines the presence of an SNP need not actually carry out the physical analysis of a sample from a subject; the methods can include using information obtained by analysis of the sample by a third party. Thus the methods can include steps that occur at more than one site. For example, a sample can be obtained from a subject at a first site, such as at a health care provider, or at the subject's home in the case of a self-testing kit. The sample can be analyzed at the same or a second site, e.g., at a laboratory or other testing facility. Identifying the presence of a SNP can be done by any DNA detection method known in the art, including sequencing at least part of a genome of one or more cells from the subject.
In certain example embodiments, detection of SNPs can be done by sequencing. Sequencing can be, for example, whole genome sequencing. In certain embodiments, the invention involves plate based single cell RNA sequencing (see, e.g., Picelli, S. et al., 2014, “Full-length RNA-seq from single cells using Smart-seq2” Nature protocols 9, 171-181, doi:10.1038/nprot.2014.006). In certain embodiments, the invention involves high-throughput single-cell RNA-seq and/or targeted nucleic acid profiling (for example, sequencing, quantitative reverse transcription polymerase chain reaction, and the like) where the RNAs from different cells are tagged individually, allowing a single library to be created while retaining the cell identity of each read. In this regard reference is made to Macosko et al., 2015, “Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets” Cell 161, 1202-1214; International patent application number PCT/US2015/049178, published as WO2016/040476 on Mar. 17, 2016; Klein et al., 2015, “Droplet Barcoding for Single-Cell Transcriptomics Applied to Embryonic Stem Cells” Cell 161, 1187-1201; International patent application number PCT/US2016/027734, published as WO2016168584A1 on Oct. 20, 2016; Zheng, et al., 2016, “Haplotyping germline and cancer genomes with high-throughput linked-read sequencing” Nature Biotechnology 34, 303-311; Zheng, et al., 2017, “Massively parallel digital transcriptional profiling of single cells” Nat. Commun. 8, 14049 doi: 10.1038/ncomms14049; International patent publication number WO2014210353A2; Zilionis, et al., 2017, “Single-cell barcoding and sequencing using droplet microfluidics” Nat Protoc. January; 12(1):44-73; Cao et al., 2017, “Comprehensive single cell transcriptional profiling of a multicellular organism by combinatorial indexing” bioRxiv preprint first posted online Feb. 2, 2017, doi: dx.doi.org/10.1101/104844; Rosenberg et al., 2017, “Scaling single cell transcriptomics through split pool barcoding” bioRxiv preprint first posted online Feb. 2, 2017, doi: dx.doi.org/10.1101/105163; Vitak, et al., “Sequencing thousands of single-cell genomes with combinatorial indexing” Nature Methods, 14(3):302-308, 2017; Cao, et al., Comprehensive single-cell transcriptional profiling of a multicellular organism. Science, 357(6352):661-667, 2017; and Gierahn et al., “Seq-Well: portable, low-cost RNA sequencing of single cells at high throughput” Nature Methods 14, 395-398 (2017), all the contents and disclosure of each of which are herein incorporated by reference in their entirety. In certain embodiments, the invention involves single nucleus RNA sequencing. In this regard reference is made to Swiech et al., 2014, “In vivo interrogation of gene function in the mammalian brain using CRISPR-Cas9” Nature Biotechnology Vol. 33, pp. 102-106; Habib et al., 2016, “Div-Seq: Single-nucleus RNA-Seq reveals dynamics of rare adult newborn neurons” Science, Vol. 353, Issue 6302, pp. 925-928; Habib et al., 2017, “Massively parallel single-nucleus RNA-seq with DroNc-seq” Nat Methods. 2017 October; 14(10):955-958; and International patent application number PCT/US2016/059239, published as WO2017164936 on Sep. 28, 2017, which are herein incorporated by reference in their entirety.
In certain example embodiments, target genomic regions of interest may be enriched from single cell sequencing libraries prior to sequencing analysis. Example enrichment methods are described, for example, in U.S. Provisional Application No. 62/576,031 entitled “Single Cell Cellular Component Enrichment from Barcoded Sequencing Libraries” filed Oct. 23, 2017.
SNPs may be detected through hybridization-based methods, including dynamic allele-specific hybridization (DASH), molecular beacons, and SNP microarrays, enzyme-based methods including RFLP, PCR-based, e.g., allelic-specific polymerase chain reaction (AS-PCR), polymerase chain reaction-restriction fragment length polymorphism (PCR-RFLP), multiplex PCR real-time invader assay (mPCR-RETINA), (amplification refractory mutation system (ARMS), Flap endonuclease, primer extension, 5′ nuclease, e.g., Taqman or 5′nuclease allelic discrimination assay, and oligonucleotide ligation assay, and methods such as single strand conformation polymorphism, temperature gradient gel electrophoresis, denaturing high performance liquid chromatography, high-resolution melting of the entire amplicon, use of DNA mismatch-binding proteins, SNPlex, and Surveyor nuclease assay.
In any of the above embodiment, the subject can be animal which include mammal, human and non-human mammal.
In an embodiment, the invention provides a method of identifying a risk of developing coronary artery disease, e.g., myocardial infarction, in a subject and providing a treatment to the subject, the method comprising obtaining a biological sample from the subject; identifying whether at least one single nucleotide polymorphism (SNP) from Table A or Table B or Table C or Table D is present in the biological sample; wherein the presence of a risk allele of a SNP from Table A or Table B or Table C or Table D indicates that the subject has an increased risk of coronary artery disease or myocardial infarction; and initiating a treatment to the subject, wherein the treatment comprises statins, ezetimibe, beta-blocking agents, angiotensin-converting-enzyme inhibitors, aspirin, anticoagulants, antiplatelet agents, angiotension II receptor blockers, angiotensin receptor neprilysin inhibitors, calcium channel blockers, cholesterol-lowering medications, vasodilators, antidiuretics, renin-angiotensin system agents, lipid-modifying medicines, anti-inflammatory agents, nitrates, antiarrhythmic medicines, steroidal or non-steroidal anti-inflammatory drugs, DNA methyltransferase inhibitors and/or histone deacetylase inhibitors.
In an embodiment, the invention provides a method of reducing a risk of coronary artery disease, e.g., myocardial infarction, in a subject comprising administering to the subject a treatment which comprises one or more statins, beta-blocking agents, angiotensin-converting-enzyme inhibitors, aspirin, anticoagulants, antiplatelet agents, angiotension II receptor blockers, angiotensin receptor neprilysin inhibitors, calcium channel blockers, cholesterol-lowering medications, vasodilators, antidiuretics, renin-angiotensin system agents, lipid-modifying medicines, anti-inflammatory agents, nitrates, antiarrhythmic medicines, steroidal or non-steroidal anti-inflammatory drugs, DNA methyltransferase inhibitors and/or histone deacetylase inhibitors, wherein the subject has a polygenic risk score that corresponds to a high risk group. The polygenic risk score may be calculated by selecting at least 50, at least 95, at least 100, at least 200, at least 500, at least 1000, at least 2000, at least 5000, at least 10,000, at least 20,000, at least 50,000, at least 75,000, or at least 100,000 single nucleotide polymorphisms (SNPs) from Table A or Table B or Table C or Table D; identifying whether the at least 50, at least 95, at least 100, at least 200, at least 500, at least 1000, at least 2000, at least 5000, at least 10,000, at least 20,000, at least 50,000, at least 75,000, or at least 100,000, at least 500,000, at least 1,000,000, at least 2,000,000, at least 3,000,000, at least 4,000,000, at least 5,000,000, or at least 6,000,000 SNPs are present in a biological sample from the subject; and calculating the polygenic risk score (PRS) based on the presence of the SNPs.
Although the present invention and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined in the appended claims.
As used herein, the term “coronary artery disease” include, e.g., stable angina, unstable angina, myocardial infarction, and sudden cardiac death.
As used herein, the term “myocardial infarction”, also known as a heart attack, include, e.g., early-onset MI.
As used herein, the term “biological sample” is used in its broadest sense. A biological sample may be obtained from a subject (e.g., a human) or from components (e.g., tissues) of a subject. The sample may be of any biological tissue or fluid with which biomarkers of the present invention may be assayed. Frequently, the sample will be a “clinical sample”, i.e., a sample derived from a patient. Such samples include, but are not limited to, bodily fluids, e.g., urine, whole blood, blood plasma, saliva; tissue or fine needle biopsy samples; and archival samples with known diagnosis, treatment and/or outcome history. The term biological sample also encompasses any material derived by processing the biological sample. Derived materials include, but are not limited to, cells (or their progeny) isolated from the sample, proteins or nucleic acid molecules extracted from the sample. Processing of the biological sample may involve one or more of, filtration, distillation, extraction, concentration, inactivation of interfering components, addition of reagents, and the like. In some embodiments, the biological sample is a whole blood sample. In some embodiments, the biological sample includes peripheral blood mononuclear cells (PBMCs) obtained from a subject. PBMCs can be extracted from whole blood using ficoll, a hydrophilic polysaccharide that separates layers of blood, and gradient centrifugation, which will separate the blood into a top layer of plasma, followed by a layer of PBMCs and a bottom fraction of polymorphonuclear cells (such as neutrophils and eosinophils) and erythrocytes.
As used herein, Table A refers to BI-10219 Table A.txt (116KSNP_score), 3,217,459 bytes, which contains 116859 SNPs and is submitted with this application. The information contained in Table A includes chromosome number, position of the nucleotide, allelic variants, risk allele, and weighted risk score. For example, the entry “1:109817192_A_G_A 0.11453” indicates that the SNP is on chromosome 1 and at nucleotide position 109817192. The allele is either A or G, and the risk allele is A. The weighted risk score associated with the risk allele is 0.11453.
Table B refers to BI-10219 Table B.txt (6.6M Variant score) (divided into parts 1-14) which contains 6,630,150 SNPs and is submitted with this application. The information contained in Table B includes chromosome number, position of the nucleotide, allelic variants, risk allele, and weighted risk score. Table C refers to BI-10219 Table C.txt (Top1% Variant score) which contains 66,296 SNPs and is submitted with this application. The information contained in Table C includes chromosome number, position of the nucleotide, allelic variants, risk allele, and weighted risk score.
Table D refers to BI-10219 Table B.txt which contains 6,630,150 SNPs and is submitted with this application. The information contained in Table D includes chromosome number, position of the nucleotide, allelic variants, risk allele, and weighted risk score.
As used herein, an “allele” is one of a pair or series of genetic variants of a polymorphism at a specific genomic location. A “response allele” is an allele that is associated with altered response to a treatment. Where a SNP is biallelic, both alleles will be response alleles (e.g., one will be associated with a positive response, while the other allele is associated with no or a negative response, or some variation thereof).
As used herein, “genotype” refers to the diploid combination of alleles for a given genetic polymorphism. A homozygous subject carries two copies of the same allele and a heterozygous subject carries two different alleles.
As used herein, a “haplotype” is one or a set of signature genetic changes (polymorphisms) that are normally grouped closely together on the DNA strand, and are usually inherited as a group; the polymorphisms are also referred to herein as “markers.” A “haplotype” as used herein is information regarding the presence or absence of one or more genetic markers in a given chromosomal region in a subject. A haplotype can consist of a variety of genetic markers, including indels (insertions or deletions of the DNA at particular locations on the chromosome); single nucleotide polymorphisms (SNPs) in which a particular nucleotide is changed; microsatellites; and minis atellites.
The term “chromosome” as used herein refers to a gene carrier of a cell that is derived from chromatin and comprises DNA and protein components (e.g., histones). The conventional internationally recognized individual human genome chromosome numbering identification system is employed herein. The size of an individual chromosome can vary from one type to another with a given multi-chromosomal genome and from one genome to another. In the case of the human genome, the entire DNA mass of a given chromosome is usually greater than about 100,000,000 base pairs.
The term “gene” refers to a DNA sequence in a chromosome that codes for a product (either RNA or its translation product, a polypeptide). A gene contains a coding region and includes regions preceding and following the coding region (termed respectively “leader” and “trailer”). The coding region is comprised of a plurality of coding segments (“exons”) and intervening sequences (“introns”) between individual coding segments.
As used herein, the terms “protein”, “polypeptide”, and “peptide” are used herein interchangeably, and refer to amino acid sequences of a variety of lengths, either in their neutral (uncharged) forms or as salts, and either unmodified or modified by glycosylation, side chain oxidation, or phosphorylation, or modified by deletion, insertion, or change in one or more amino acids.
As used herein, the terms “nucleic acid molecule” and “polynucleotide” are used herein interchangeably. They refer to a deoxyribonucleotide or ribonucleotide polymer in either single- or double-stranded form, and unless otherwise stated, encompass known analogs of natural nucleotides that can function in a similar manner as naturally occurring nucleotides. The terms encompass nucleic acid-like structures with synthetic backbones, as well as amplification products.
As used herein, the term “hybridizing” refers to the binding of two single stranded nucleic acids via complementary base pairing. The term “specific hybridization” refers to a process in which a nucleic acid molecule preferentially binds, duplexes, or hybridizes to a particular nucleic acid sequence under stringent conditions (e.g., in the presence of competitor nucleic acids with a lower degree of complementarity to the hybridizing strand). In certain embodiments of the present invention, these terms more specifically refer to a process in which a nucleic acid fragment (or segment) from a test sample preferentially binds to a particular probe and to a lesser extent or not at all, to other probes, for example, when these probes are immobilized on an array.
The term “probe” refers to an oligonucleotide. A probe can be single stranded at the time of hybridization to a target. As used herein, probes include primers, i.e., oligonucleotides that can be used to prime a reaction, e.g., a PCR reaction.
The term “label” or “label containing moiety” refers in a moiety capable of detection, such as a radioactive isotope or group containing same, and nonisotopic labels, such as enzymes, biotin, avidin, streptavidin, digoxygenin, luminescent agents, dyes, haptens, and the like. Luminescent agents, depending upon the source of exciting energy, can be classified as radioluminescent, chemiluminescent, bioluminescent, and photoluminescent (including fluorescent and phosphorescent). A probe described herein can be bound, e.g., chemically bound to label-containing moieties or can be suitable to be so bound. The probe can be directly or indirectly labeled.
The term “direct label probe” (or “directly labeled probe”) refers to a nucleic acid probe whose label after hybrid formation with a target is detectable without further reactive processing of hybrid. The term “indirect label probe” (or “indirectly labeled probe”) refers to a nucleic acid probe whose label after hybrid formation with a target is further reacted in subsequent processing with one or more reagents to associate therewith one or more moieties that finally result in a detectable entity.
The terms “target,” “DNA target,” or “DNA target locus” refers to a nucleotide sequence that occurs at a specific chromosomal location. Each such sequence or portion is preferably at least partially, single stranded (e.g., denatured) at the time of hybridization. When the target nucleotide sequences are located only in a single region or fraction of a given chromosome, the term “target region” is sometimes used. Targets for hybridization can be derived from specimens which include, but are not limited to, chromosomes or regions of chromosomes in normal, diseased or malignant human cells, either interphase or at any state of meiosis or mitosis, and either extracted or derived from living or postmortem tissues, organs or fluids; germinal cells including sperm and egg cells, or cells from zygotes, fetuses, or embryos, or chorionic or amniotic cells, or cells from any other germinating body; cells grown in vitro, from either long-term or short-term culture, and either normal, immortalized or transformed; inter- or intraspecific hybrids of different types of cells or differentiation states of these cells; individual chromosomes or portions of chromosomes, or translocated, deleted or other damaged chromosomes, isolated by any of a number of means known to those with skill in the art, including libraries of such chromosomes cloned and propagated in prokaryotic or other cloning vectors, or amplified in vitro by means well known to those with skill; or any forensic material, including but not limited to blood, or other samples.
As used herein, the terms “array”, “micro-array”, and “biochip” are used herein interchangeably. They refer to an arrangement, on a substrate surface, of hybridizable array elements, preferably, multiple nucleic acid molecules of known sequences. Each nucleic acid molecule is immobilized to a discrete spot (i.e., a defined location or assigned position) on the substrate surface. The term “micro-array” more specifically refers to an array that is miniaturized so as to require microscopic examination for visual evaluation.
The present invention will be further illustrated in the following Examples which are given for illustration purposes only and are not intended to limit the invention in any way.
Coronary artery disease (CAD) is a leading cause of disability and mortality worldwide (GBD 2015 Mortality and Causes of Death Collaborators, Global, regional, and national life expectancy, all-cause mortality, and cause-specific mortality for 249 causes of death, 1980-2015: a systematic analysis for the Global Burden of Disease Study 2015. Lancet 388, 1459-1544 (2016)). Genome-wide association studies (GWAS) have provided new clues to the pathophysiology for this common, complex disease. Largely using a case-control design with cases ascertained based on CAD status, published studies have highlighted at least 80 loci reaching genome-wide significance (Schunkert, H. et al., Nat Genet 43, 333-8 (2011); Deloukas, P. et al., Nat Genet 45, 25-33 (2013); CARDIoGRAMplusC4D Consortium. A comprehensive 1000 Genomes-based genome-wide association meta-analysis of coronary artery disease. Nat Genet 47, 1121-30 (2015); Myocardial Infarction Genetics and CARDIoGRAM Exome Consortia Investigators. Coding Variation in ANGPTL4, LPL, and SVEP1 and the Risk of Coronary Disease. N Engl J Med 374, 1134-44 (2016); Nioi, P. et al., N Engl J Med 374, 2131-41 (2016); Webb, T. R. et al., J Am Coll Cardiol 69, 823-836 (2017); Howson, J. M. M. et al., Nature Genetics (2017)).
Population-based biobanks such as UK Biobank offer new potential for genetic analysis of common complex diseases. New opportunities include scale, a diverse range of traits, and the ability to explore a fuller spectrum of phenotypic consequences for identified DNA variants. Leveraging the UK Biobank resource, Applicants sought to: 1) perform a genetic discovery analysis; 2) explore the phenotypic consequences and tissue-specific effects associated with CAD risk alleles; and 3) characterize the functional consequences of a risk mutation in a promising pathway.
Applicants designed a three-stage GWAS (
Characteristics of UK Biobank participants stratified by presence of CAD are presented in Table 1. CAD cases were more likely to be older, male, on lipid-lowering therapy, have a history of smoking, and affected with type 2 diabetes. After quality control, 9,061,845 DNA sequence variants were tested for association in 4,831 CAD patients and 115,455 controls in UK Biobank (Stage 1). A total of 269 variants at five distinct loci met the genome-wide significance threshold (P<5×10-8) (
After meta-analysis, 15 new loci exceeded genome-wide significance (Tables 3-4), bringing the total number of established CAD loci to 95. One of the 15 loci (HNF1A) has since been reported in Howson, J. M. M. et al., Nature Genetics (2017). Effect allele frequencies of the 15 newly identified loci ranged from 13% to 86%, with effect sizes ranging from 1.05 to 1.08. Descriptions of relevant loci appear in Table 5, and regional association plots for novel CAD loci are shown in
To move from these 15 DNA sequence variants to biologic insights, Applicants took two approaches: phenome-wide association scanning and functional analysis. Understanding the full spectrum of phenotypic consequences of a given DNA sequence variant may shed light on the mechanism by which a variant/gene leads to disease. Termed a ‘phenome-wide association study’ or “PheWAS”, this approach tests the association of a mapped disease variant with a broad range of human phenotypes (Denny, J. C. et al., Nat Biotechnol 31, 1102-10 (2013)). In collaboration with Genomics plc, Applicants conducted a PheWAS combining UK Biobank data, mRNA transcript phenotypes in the Genotype-Tissue Expression Project (GTEx) dataset (Aguet, F. et al. Local genetic effects on gene expression across 44 human tissues. bioRxiv (2016)), and an integrated set of GWAS results from a variety of publically available sources (Global Lipids Genetics Consortium et al., Nat Genet 45, 1274-83 (2013); Manning, A. K. et al., Nat Genet 44, 659-69 (2012); Prokopenko, I. et al., PLoS Genet 10, e1004235 (2014); Wood, A. R. et al., Nat Genet 46, 1173-86 (2014); Berndt, S. I. et al., Nat Genet 45, 501-12 (2013); Pattaro, C. et al., Nat Commun 7, 10023 (2016); Liu, J. Z. et al., Nat Genet 47, 979-86 (2015); Dastani, Z. et al., PLoS Genet 8, e1002607 (2012); Morris, A. P. et al., Nat Genet 44, 981-90 (2012)).
Applicants found that several of the newly identified DNA sequence variants correlated with a range of human traits (
rs17517928
FN1
2
C
T
0.75
0.016
0.003
1.19E−06
Height
GIANT
Std Dev
rs2972146
LOC646736
2
T
G
0.65
0.045
0.006
6.39E−14
Fasting Insulin Adj
MAGIC
Std Dev
BMI
rs2972146
LOC646736
2
T
G
0.65
−0.030
0.004
1.24E−11
Body Fat Percentage
UK
Std Dev
Biobank
rs2972146
LOC646736
2
T
G
0.65
−0.040
0.008
2.26E−06
Adiponectin
ADIPOGen
Std Dev
rs2972146
LOC646736
2
T
G
0.65
0.077
0.019
4.68E−05
Type 2 Diabetes
DIAGAM
ln(OR)
rs2972146
LOC646736
2
T
G
0.65
−0.031
0.003
2.73E−20
High Density
GLGC
Std Dev
Lipoprotein
Cholesterol
rs2972146
LOC646736
2
T
G
0.65
0.028
0.003
1.41E−16
Triglycerides
GLGC
Std Dev
rs17843797
UMPS-
3
G
T
0.13
0.029
0.006
2.94E−06
Body Fat Percentage
UK
Std Dev
ITGB5
Biobank
rs7623687
RHOA
3
A
C
0.86
−0.115
0.024
2.30E−06
Inflammatory Bowel
IIBDGC
ln(OR)
Disease
rs10857147
4
T
A
0.29
0.023
0.005
2.08E−06
eGFRcrea
CKDGen
mL/min/
1.73 m2
rs10857147
4
T
A
0.29
0.866
0.091
1.90E−21
Systolic BP
UK
mmHg
Biobank
rs10857147
4
T
A
0.29
0.491
0.051
4.93E−22
Diastolic BP
UK
mmHg
Biobank
rs10841443
RP11-
12
G
C
0.67
0.270
0.050
5.89E−08
Diastolic BP
UK
mmHg
664H17.1
Biobank
rs2244608
HNF1A
12
G
A
0.32
0.032
0.004
2.11E−20
Low Density
GLGC
Std Dev
Lipoprotein
Cholesterol
rs2244608
HNF1A
12
G
A
0.32
0.028
0.003
2.71E−17
Total Cholesterol
GLGC
Std Dev
rs11057401
CCDC92
12
T
A
0.69
−0.027
0.005
2.22E−09
Body Fat Percentage
UK
Std Dev
Biobank
rs11057401
CCDC92
12
T
A
0.69
0.036
0.005
1.21E−15
Waist Hip Ratio Adj
UK
Std Dev
BMI
Biobank
rs11057401
CCDC92
12
T
A
0.69
−0.052
0.009
2.24E−09
Adiponectin
ADIPOGen
Std Dev
rs11057401
CCDC92
12
T
A
0.69
−0.028
0.005
1.03E−08
High Density
GLGC
Std Dev
Lipoprotein
Cholesterol
rs11057401
CCDC92
12
T
A
0.69
0.027
0.005
6.64E−08
Triglycerides
GLGC
Std Dev
rs3851738
CFDP1
16
C
G
0.6
0.016
0.003
1.80E−07
Height
GIANT
Std Dev
rs3851738
CFDP1
16
C
G
0.6
0.414
0.084
8.08E−07
Systolic BP
UK
mmHg
Biobank
rs7500448
CDH13
16
A
G
0.75
−0.050
0.010
6.57E−07
Adiponectin
ADIPOGen
Std Dev
Compelling additional insights from the PheWAS emerged at the CCDC92 locus. Across 25 distinct traits and disorders, Applicants observed significant associations (P<0.00013) for CCDC92 p.Ser70Cys (rs11057401) with body fat percentage, waist-to-hip circumference ratio, as well as plasma high-density lipoprotein, triglyceride, and adiponectin levels. The directionality of these associations are hallmarks of insulin resistance and lipodystrophy (Manning, A. K. et al., Nat Genet 44, 659-69 (2012); Shungin, D. et al., Nature 518, 187-96 (2015)), and the association with plasma adiponectin levels localizes these genetic effects to adipose tissue. Recent work has highlighted two candidate genes at this locus, CCDC92 and DNAH10 (Lotta, L. A. et al., Nat Genet (2016)).
However, a few of the CAD loci (FN1, LOX, ITGB5, and ARHGEF26) did not associate with any of the studied risk factor traits and thus, appear to function through pathways beyond known CAD risk factors (
In aggregate, the analysis brings the total number of known CAD loci to 95 (Schunkert, H. et al., Nat Genet 43, 333-8 (2011); Deloukas, P. et al., Nat Genet 45, 25-33 (2013); CARDIoGRAMplusC4D Consortium. A comprehensive 1000 Genomes-based genome-wide association meta-analysis of coronary artery disease. Nat Genet 47, 1121-30 (2015); Myocardial Infarction Genetics and CARDIoGRAM Exome Consortia Investigators. Coding Variation in ANGPTL4, LPL, and SVEP1 and the Risk of Coronary Disease. N Engl J Med 374, 1134-44 (2016); Nioi, P. et al., N Engl J Med 374, 2131-41 (2016); Webb, T. R. et al., J Am Coll Cardiol 69, 823-836 (2017); Howson, J. M. M. et al., Nature Genetics (2017)), and in
At one of the new loci that did not relate to known risk factors, ARHGEF26 (encoding Rho Guanine Nucleotide Exchange Factor 26), Applicants performed functional studies. Prior experimental work had connected this gene with murine atherosclerosis (Samson, T. et al., PLoS One 8, e55202 (2013)). Earlier studies established a role for ARHGEF26 in facilitating the transendothelial migration of leukocytes, a key step in the initiation of atherosclerosis (van Rijssel, J. et al., Mol Biol Cell 23, 2831-44 (2012); van Buul, J. D. et al., J Cell Biol 178, 1279-93 (2007)). ARHGEF26 has been shown to activate RhoG GTPase by promoting the exchange of GDP by GTP and contributing to the formation of ICAM-1-induced endothelial docking structures that facilitate leukocyte transendothelial migration (van Rijssel, J. et al., Mol Biol Cell 23, 2831-44 (2012); van Buul, J. D. et al., J Cell Biol 178, 1279-93 (2007)). In addition, Arhgef26 −/− mice, when crossed with atherosclerosis-prone Apoe null mice, displayed less aortic atherosclerosis (Samson, T. et al., PLoS One 8, e55202 (2013)).
At ARHGEF26 p.Val29Leu (rs12493885), the 29Leu allele, observed in 85% of participants, is associated with increased risk for CAD. Applicants first examined the hypothesis that a haplotype block containing this variant may alter expression of ARHGEF26 in coronary artery. While this region demonstrates eQTL effects in a variety of tissues, there is no evidence of alteration of ARHGEF26 expression in coronary artery in both eQTL and allele specific expression analyses (
Next, Applicants examined whether ARHGEF26 p.Val29Leu may influence disease risk through its protein-altering consequence. Applicants knocked down endogenous ARHGEF26 through siRNA and observed decreased leukocyte transendothelial migration, leukocyte adhesion on endothelial cells, and vascular smooth cell proliferation (Zahedi, F. et al., Cell Mol Life Sci (2016)) (
How could the ARHGEF26 29Leu mutation lead to a gain-of-function phenotype? Applicants evaluated its functional impact in two ways, addressing ARHGEF26 quality and quantity, respectively. First, could the 29Leu mutation alter ARHGEF26 nucleotide exchange activity on RhoG? To answer this question, Applicants developed a GTP-GDP nucleotide exchange assay using recombinant human full-length ARHGEF26 (wild-type or 29Leu) and RhoG proteins (Ellerbroek, S. M. et al., Mol Biol Cell 15, 3309-19 (2004)). In a cell-free system, equal amount of wild-type or 29Leu ARHGEF26 protein was incubated with RhoG pre-loaded with GDP. After 60 minutes, Applicants observed no significant difference in nucleotide exchange activity between wild-type and 29Leu mutant ARHGEF26 (
Second, could the 29Leu allele affect cellular abundance of ARHGEF26 protein? Applicants examined this possibility by treating cells expressing wild-type or 29Leu mutant ARHGEF26 with cycloheximide, a protein synthesis inhibitor, and compared ARHGEF26 degradation over time by Western blotting. Compared to wild-type ARHGEF26, the 29Leu mutant protein displayed a longer half-life (
In summary, Applicants performed a gene discovery study for CAD using a large population-based biobank, identified 15 new loci, and explored the phenotypic consequences of CAD risk variants through PheWAS and in vitro functional analysis. These findings permit several conclusions. First, CAD cases phenotyped via electronic health records and verbal interviews exhibit similar genetic architecture to those derived in epidemiologic cohorts and can prove useful in gene discovery efforts. Second, phenome-wide association studies with risk variants can provide initial clues on how DNA sequence variants may lead to disease. Lastly, considerable experimental evidence in cells and rodents has suggested that transendothelial migration of leukocytes is a key step in the formation of atherosclerosis (Gerhardt, T. & Ley, K., Cardiovasc Res 107, 321-30 (2015)); here, Applicants provide human genetic support for a role of this pathway in CAD.
Applicants performed a three-stage sequential analysis to identify novel genetic loci associated with CAD. In Stage 1, Applicants first tested the association of DNA sequence variants with CAD in UK Biobank. Beginning in 2006, individuals aged 45 to 69 years old were recruited from across the United Kingdom for participation in the UK Biobank Study (Collins, R. What makes UK Biobank special? The Lancet 379, 1173-1174 (2012)). At enrollment, a trained healthcare provider ascertained participants' medical histories through verbal interview. In addition, participants' electronic health records (EHR) including inpatient International Classification of Disease (ICD-10) diagnosis codes and Office of Population and Censuses Surveys (OPCS-4) procedure codes, were integrated into UK Biobank. Individuals were defined as having CAD based on at least one of the following criteria:
All other individuals were defined as controls. In total, genotypes were available for 120,286 participants of European ancestry.
In Stage 2, Applicants took forward 2,190 variants that reached nominal significance in Stage 1 for meta-analysis in the Coronary ARtery DIsease Genome wide Replication and Meta-analysis (CARDIoGRAM) Exome Consortia exome array analysis which incorporated 42,355 cases and 78,240 controls6 (Table 8). In Stage 3, Applicants took forward 387,174 variants that reached nominal significance in Stage 1 (and not available in Stage 2) for meta-analysis into the CARDIoGRAMplusC4D 1000 Genomes imputation study containing 60,801 cases and 123,504 controls5 (http://www.cardiogramplusc4d.org/). Informed consent was obtained for all participants, and UK Biobank received ethical approval from the Research Ethics Committee (reference number 11/NW/0382). Our study was approved by a local Institutional Review Board at Partners Healthcare (protocol 2013P001840).
UK Biobank samples were genotyped using either the UK Bileve (Wain, L. V. et al., Lancet Respir. Med. 3, 769-781 (2015)) or UK Biobank Axiom Arrays having been performed in 33 separate batches of samples by Affymetrix (High Wycombe, UK). A total of 806,466 directly genotyped DNA sequence variants were available after variant quality control (QC). The UK Biobank team then performed imputation from a combined 1000 Genomes/UK10K reference panel; phasing was performed using SHAPEIT-3 and imputation carried out via IMPUTE3. Variant level QC exclusion metrics applied to imputed data for GWAS included: call rate<95%, Hardy-Weinberg Equilibrium P-value<1×10-6, posterior call probability<0.9, imputation quality<0.4, and minor allele frequency (MAF)<0.005. Sex chromosome and mitochondrial genetic data were excluded from this analysis. In total, 9,061,845 imputed DNA sequence variants were included in our analysis. For sample QC, the UK Biobank analysis team removed individuals of relatedness 3rd degree or higher, and an additional 480 samples with an excess of missing genotype calls or more heterozygosity than expected were excluded. In total, genotypes were available for 120,286 participants of European ancestry.
The BOLT-LMM software (Loh, P. R. et al., Nat Genet 47, 284-90 (2015)) was used to perform linear mixed models (LMMs) for association testing. CAD case status was analyzed while adjusting for age, gender, and chip array at run-time. This analysis was used to derive statistical significance. As effect estimates from BOLT-LMM software are unreliable due to the treatment of binary phenotype data as quantitative data, Applicants performed logistic regression to derive effect estimates for each variant that exceeded genome-wide significance. Effect estimates of top variants were derived from logistic regression using allelic dosages adjusting for age, sex, chip at run-time, and ten principal components under the assumption of additive effects utilizing the R v3.2.0 (www.R-project.org) and SNPTEST (mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html) statistical software programs.
Stage 2 and 3 Meta-Analysis
In stage 2, top variants (P<0.05) from UK Biobank were then meta-analyzed with exome chip data from the CARDIoGRAM Exome Consortium (Myocardial Infarction Genetics and CARDIoGRAM Exome Consortia Investigators. Coding Variation in ANGPTL4, LPL, and SVEP1 and the Risk of Coronary Disease. N Engl J Med 374, 1134-44 (2016)). Tested variants in the CARDIoGRAM exome array study were analyzed through logistic regression with an additive model adjusting for study specific covariates and principal components of ancestry as appropriate. Top variants from UK Biobank that were not available for analysis in the CARDIoGRAM exome array study were then meta-analyzed with data from the 1000 Genomes imputed CARDIoGRAMplusC4D GWAS (CARDIoGRAMplusC4D Consortium. A comprehensive 1000 Genomes-based genome-wide association meta-analysis of coronary artery disease. Nat Genet 47, 1121-30 (2015)) in Stage 3.
Given differences in effect size units between the UK Biobank Stage 1 data and the CARDIoGRAM Exome/1000 Genomes CARDIoGRAMplusC4D data, both Stage 2 and 3 meta-analyses were performed via a weighted z-score method, adjusting for an unbalanced ratio of cases to controls. To derive effect size estimates for variants exceeding genome-wide significance, Applicants meta-analyzed logistic regression results using inverse-variance weighting with fixed effects (METAL software) (Willer et al., Bioinformatics 26, 2190-1 (2010)). Applicants set a combined statistical threshold of P<5×10−8 for genome wide significance. P values reported in analysis Stages 1, 2, and 3 are all two-sided.
For all 15 novel DNA sequence variants associated with CAD in our study, Applicants collaborated with Genomics plc to conduct a phenome-wide association study. This PheWAS used the Genomics plc Platform, UK Biobank, and GTEx Consortium eQTL data. The Genomics plc Platform includes PheWAS data across 545 distinct molecular and disease phenotypes, at an integrated set of over 14 million common variants, from 677 GWAS studies. UK Biobank analyses within the Genomics plc Platform were conducted under a separate research agreement. Applicants selected 25 phenotypes across a range of relevant diseases, metabolic and anthropometric traits from either previously published GWAS datasets or UK Biobank. Complete details of phenotype definitions, sample sizes, and GWAS data sources are shown in Tables 9 and 10. In the PheWAS, quantitative traits were standardized to have unit variance, imputation was performed to generate results for all variants within the 1000 Genomes reference panel, and P values were recalculated based on a Wald test statistic for uniformity.
PLoS Genet 10, e1004235 (2014))
Commun 7, 10023 (2016))
Phenotypes were declared to be significantly associated with the risk variant if they met a Bonferroni corrected P value of <0.00013 [0.05/(25 traits×15 DNA sequence variants)]. Phenome scan results were then depicted in a heatmap based on the Z-scores for all variant-disease/trait associations aligned to the CAD risk allele as implemented by the gplots package (https://cran.r-project.org/web/packages/gplots/gplots.pdf) in R v3.2.0. To identify loci that might influence gene expression, Applicants used previously published cis-expression quantitative trait locus (eQTL) mapping data from the Genotype-Tissue Expression (GTEx) Consortium Project across 44 tissues. Applicants queried the 15 novel variants identified in our study for overlap with genome-wide significant variant-gene pairs from the GTEx portal (gtexportal.org).
Allele-specific expression (ASE) data from the GTEx project were obtained from dbGaP (accession phs000424.v6.p1). The generation of these data is summarized in Aguet et al., and relied on methods described earlier. In brief, only uniquely mapping reads with base quality>10 at the SNP were counted, and only SNPs with coverage of at least 8 reads were reported. For ARHGEF26 p.Val29Leu, ASE counts were available for 20 heterozygous individuals. A two-sided binomial test was used to identify SNPs with significant allelic imbalance in each individual, and Benjamini-Hochberg adjusted p-values were calculated across all sites measured in an individual.
HUVEC heterozygous for rs12493885 were identified from Caucasian donors by SNP genotyping. A 2.9 kb genomic fragment spanning from 5′ upstream of ARHGEF26 to exon 2 (rs12493885) was cloned into a pMiniT 2.0 vector (NEB) using the heterozygous HUVEC genomic DNA as a template, and sequenced for reference and alternative alleles. The −2516 to +2 reference and alternative haplotypes upstream of ARHGEF26 (NC_000003.12:154119477-154121994) were amplified from the 2.9 kb region by PCR with primers designed to create 5′ NheI and 3′ HindIII restriction sites in the PCR products. The amplified fragments were subcloned between the NheI and HindIII sites of a promoterless firefly luciferase (luc2) expression vector pGL4.10 (Promega), to create two plasmids: pGL4.10-Ref and pGL4.10-Alt. Promoterless pGL4.10-control, and pGL4.73[hRluc/SV40] vector containing the renilla luciferase hRluc reporter gene and an SV40 early enhancer/promoter, were used as negative control and co-reporter, respectively. Cells were cotransfected with equal amounts of luc2 expression plasmid (pGL4.10-control, pGL4.10-Ref and pGL4.10-Alt) and pGL4.73 vector by Lipofectamine 2000. Cells were harvested at 48 h after transfection and followed by a Dual-Glo Luciferase Assay (Promega) to measure firefly and renilla luciferase activities. The firefly luciferase activity was normalized to renilla luciferase in the same sample, and expressed as fold change relative to pGL4.10-control group.
Human full-length ARHGEF26 (wild-type or 29Leu) and RhoG (residues 1-188) proteins, both with N-terminal His-SUMO tags, were expressed in E. coli BL21(DE3) cells in TB media. Nucleotide exchange assay samples were prepared in buffer containing 10 mM HEPES pH 7.4, 150 mM NaCl, 1 mM MgCl2, 0.5 uM MANT-GTP, 2 mM TCEP with 1 μM ARHGEF26. Just prior to reading, RhoG protein, pre-loaded with GDP, was added to a final concentration of 0.4 μM. MANT-GTP fluorescence was monitored for 60 minutes on a SpectraMax M2 at 37° C. using an excitation wavelength of 280 nm and an emissions wavelength of 440 nm with a 435 nm cutoff. Fluorescence data was imported into Prism GraphPad for analysis.
Functional Characterization of ARHGEF26 p. Val29Leu in Arterial Tissue
To investigate the functional effects of ARHGEF26 p.Val29Leu (rs12493885), Applicants knocked-down the expression of endogenous ARHGEF26 in cultured human aortic endothelial cells (HAEC) and human coronary artery smooth muscle cells (HCASMC) by RNA interference. Applicants then overexpressed wild-type or mutant ARHGEF26 (29Leu) resistant to siRNA, and measured leukocyte transendothelial migration, leukocyte adhesion on endothelial cells, and HCASMC proliferation in vitro. Applicants also evaluated the degradation of wild-type or 29Leu mutant ARHGEF26 with a cycloheximide chase assay and Western blotting.
Cell Culture
Human Aortic Endothelial Cells (HAEC), Human Umbilical Vein Endothelial Cells (HUVEC), and Human Coronary Artery Smooth Muscle Cells (HCASMC) were purchased from Lifeline Cell Technology and maintained in VascuLife EnGS Endothelial Medium and SMC Medium (Lifeline Cell Technology) free of antibiotics at 37° C. and 5% CO2. HAEC, HUVEC, and HCASMC at passages 2-6 were used for experiments. HL60 cell line was purchased from Sigma-Aldrich. HEK293 and THP-1 cell lines were purchased from ATCC. HEK293 was maintained in high-glucose Dulbecco's Modified Eagle Medium with GlutaMA Supplement and 10% fetal bovine serum (Thermo Fisher Scientific). HL60 and THP-1 cells were maintained in RPMI 1640 Medium supplemented with 10% non-heated-inactivated fetal bovine serum (Thermo Fisher Scientific). HL60 cells were differentiated for 5 days in medium containing 1.3% DMSO for leukocyte TEM assays. Cell line specificity was confirmed with tissue-specific markers: HAEC were von Willebrand Factor positive and smooth muscle a-actin negative, HCASMC were von Willebrand Factor negative and smooth muscle a-actin positive. Both cell types were confirmed to be mycoplasma negative.
siRNA and ARHGEF26 Constructs
Silencer Select siRNA against 3′UTR of human ARHGEF26 was customized from Thermo Fisher Scientific. Targeting efficiency of siRNA was confirmed by western blot of transfected cells. Non-targeting siRNA control was purchased from Thermo Fisher Scientific. The cDNA containing the complete open-reading frame of human ARHGEF26 (NM 015595.3) was obtained from the Mammalian Gene Collection (MGC) and cloned with an N-terminal FLAG-GGGS sequence onto a pcDNA3.4 mammalian expression vector (Thermo Fisher Scientific) using NEBuilder HiFi DNA Assembly Master Mix (NEB). Wild-type ARHGEF26 and 29Leu mutant was generated by site-directed mutagenesis (Q5 kit, NEB) and sanger-sequenced. Vector without FLAG-GGGS-ARHGEF26 insert is used as control vector.
Transfection
HAEC and HCASMC were transfected in 6-well format using Lipofectamine 2000 Transfection Reagent (Invitrogen) following manufacture's protocol. Briefly, cells were plated at 90% confluency the day prior to transfection. Then cells were washed and replenished with Opti-MEM I Reduced Serum Medium. Per well, cells were co-transfected with 50 nM siRNA with 1 μg/mL ARHGEF26 vector (final concentration). Medium was replaced at 4 hours post-transfection. Cells were trypsinized and re-plated one-day after transfection (HAEC), or re-plated and starved in serum-free medium (HCASMC).
Leukocyte TEM Assay
Leukocyte TEM assay was modified from previously described (van Buul, J. D. et al., J Cell Biol 178, 1279-93 (2007)). HAEC was plated on a HTS Transwell 96-well permeable insert with 5.0 μm pore size (Corning) in 40 μL/well medium and allowed to settle for 8 hours. Then the transwell was replaced with complete medium contain 10 ng/mL TNF-α (PeproTech) and cultured overnight. The next day, 235 μL/well serum-free endothelial cell medium containing 0.25% BSA with vehicle or 50 ng/mL SDF-1 (PeproTech) was placed on a 96-well white receiver plate. The medium in the transwell insert was removed and replaced with 75 μL/well serum-free endothelial cell medium containing 0.25% BSA and 200,000 differentiated HL60 cells. The insert was then gently placed in the receiver plate and incubated at 37° C. for 5 hours with lid on. The insert was removed and HL60 migrated into the receiver plate was quantified with a luminescent assay (CellTiter-Glo, Promega). Standard curve of HL60 cells was prepared by serial dilutions on an identical white receiver plate, with total HL60 cell input set as 100%. Differences in means of percentage of migrated cells per well were assessed by two-way ANOVA with uncorrected Fisher's LSD test within vehicle and SDF-1 subgroups, respectively, and significance threshold set as P<0.05.
Leukocyte Adhesion Assay
HAEC were transfected and re-plated on a black-wall, clear-bottom 96-well plate and cultured until 100% confluence (48-72-hour post-transfection). Prior to the assay, HAEC were treated with 10 ng/mL TNF-α overnight. THP-1 cells were labeled with Calcein-AM cell-permeant dye (Thermo Fisher Scientific), washed, and added to wells containing HAEC at 200,000/well in serum-free medium containing 0.25% BSA, and incubated at 37° C. for 1 hour. The wells were washed four times in 37° C. PBS. After the final wash, the plate was drained thoroughly and 100 μL TBS buffer containing 1% NP-40 was added to each well. The plate was agitated for 10 min protected from light, and the fluorescence was measured on a plate reader. Standard curve was generated on an identical, separate plate. Differences in means of fluorescent intensity were assessed by one-way ANOVA with Dunnett's multiple comparisons test, and a multiplicity adjusted P value set as 0.05 for statistical significance.
VSMC Proliferation
HCASMC were transfected and re-plated on a 96-well plate in serum-free medium and starved. After 48 hours, the plate was replaced with medium containing serum and cells are allowed to proliferate for 72 hours. To measure cell proliferation, the medium was removed and cell numbers in each well were counted with a luminescent assay (CellTiter-Glo, Promega). Differences in means of luminescence were assessed by one-way ANOVA with Dunnett's multiple comparisons test, and a multiplicity adjusted P value set as 0.05 for statistical significance.
Western Blot
Cells were harvested with lysis buffer (150 mM NaCl, 50 mM Tris HCl, 0.5% NP-40 and 0.1% sodium deoxycholate, pH 7.5) supplemented with fresh protease inhibitors (Pierce Protease Inhibitor Mini Tablet, EDTA free). Cell lysate was incubated for 15 min in rotation and centrifuged at 20,000 g for 15 min at 4° C. to remove insoluble materials. The protein concentration in the supernatant was measured by a bicinchoninic acid (BCA) assay kit (Thermo Fisher Scientific) and normalized with Laemmli sample buffer. Equal amount of protein was separated by sodium dodecyl sulfate polyacrylamide gel electrophoresis (SDS-PAGE) on 4-20% Mini-PROTEAN TGX precast gels (Bio-Rad Laboratories), transferred to nitrocellulose membrane, and blocked with 5% non-fat milk in Tris-buffered saline supplemented with 0.05% Tween-20 (TBST) at room temperature for 1 hour. The membrane was then probed with primary antibodies to ARHGEF26 (Sigma-Aldrich), FLAG (M2 HRP-conjugated, Sigma-Aldrich), or actin (HRP-conjugated, Santa Cruz Biotechnology), respectively, in 1% non-fat milk in TBST. The HRP-conjugated anti-rabbit secondary antibody was then incubated at room temperature for 1 hour for ARHGEF26 blots. After extensive washing, the membranes were imaged by an enhanced chemiluminescence substrate (EMD Millipore) and imaged on Amersham Imager 600 (GE Healthcare).
Cycloheximide Chase Assay
FLAG-tagged WT or 29Leu FLAG-ARHGEF26 was overexpressed in HEK293 cells for 48 hours. One day prior to the cycloheximide chase, WT and 29Leu ARHGEF26-transfected cells (12 wells each) were plated on the same 24-well plate at 150,000 cells per well in 500 μL medium. For the cycloheximide chase, 500 μL medium containing 100 μg/mL or 200 μg/mL cycloheximide (Enzo Life Sciences) was added to each well to achieve 50 μg/mL or 100 μg/mL final concentration. Cells were harvested in lysis buffer at indicated time points post chase, and BCA-normalized lysate (20 μg/time points) were probed for FLAG by Western blot. For each cycloheximide dose, 2 blot sections (WT and 29Leu) from the same treated plate were blotted on same membrane and simultaneously imaged.
Stage 2 and Stage 3 data contributed by CARDIoGRAM Exome and CARDIoGRAMplusC4D investigators is available at www. CARDIOGRAMPLUSC4D.ORG.
The genetic and phenotypic UK Biobank data are available upon application to the UK Biobank (www.ukbiobank.ac.uk/).
Both genetic and lifestyle factors are key drivers of coronary artery disease, a complex disorder that is the leading cause of death worldwide. (Lozano R, Naghavi M, Foreman K, et al. Global and regional mortality from 235 causes of death for 20 age groups in 1990 and 2010: a systematic analysis for the Global Burden of Disease Study 2010. Lancet 2012; 380:2095-2128). A familial pattern in the risk of coronary artery disease was first described in 1938 and was subsequently confirmed in large studies involving twins and prospective cohorts.http://www.nejm.org/doi/full/10.1056/NEJMoa1605086—ref2 (Müller C. Xanthomata, hypercholesterolemia, angina pectoris. Acta Med Scand 1938; 89:75-84; Gertler M M, Garn S M, White P D. Young candidates for coronary heart disease. J Am Med Assoc 1951; 147:621-625; Slack J, Evans K A. The increased risk of death from ischaemic heart disease in first degree relatives of 121 men and 96 women with ischaemic heart disease. J Med Genet 1966; 3:239-257; Marenberg M E, Risch N, Berkman L F, Floderus B, de Faire U. Genetic susceptibility to death from coronary heart disease in a study of twins. N Engl J Med 1994; 330:1041-1046; Lloyd-Jones D M, Nam B H, D'Agostino R B Sr, et al. Parental cardiovascular disease as a risk factor for cardiovascular disease in middle-aged adults: a prospective study of parents and offspring. JAMA 2004; 291:2204-2211). Since 2007, genomewide association analyses have identified more than 50 independent loci associated with the risk of coronary artery disease. (Samani N J, Erdmann J, Hall A S, et al. Genomewide association analysis of coronary artery disease. N Engl J Med 2007; 357:443-453; Helgadottir A, Thorleifsson G, Manolescu A, et al. A common variant on chromosome 9p21 affects the risk of myocardial infarction. Science 2007; 316:1491-1493; McPherson R, Pertsemlidis A, Kavaslar N, et al. A common allele on chromosome 9 associated with coronary heart disease. Science 2007; 316:1488-1491; Myocardial Infarction Genetics Consortium. Genome-wide association of early-onset myocardial infarction with single nucleotide polymorphisms and copy number variants. Nat Genet 2009; 41:334-341; Erdmann J, Grosshennig A, Braund P S, et al. New susceptibility locus for coronary artery disease on chromosome 3q22.3. Nat Genet 2009; 41:280-282; Coronary Artery Disease (C4D) Genetics Consortium. A genome-wide association study in Europeans and South Asians identifies five new loci for coronary artery disease. Nat Genet 2011; 43:339-344; IBC 50K CAD Consortium. Large-scale gene-centric analysis identifies novel variants for coronary artery disease. PLoS Genet 2011; 7:e1002260-e1002260; The CARDIoGRAMplusC4D Consortium. Large-scale association analysis identifies new risk loci for coronary artery disease. Nat Genet 2013; 45:25-33; Nikpay M, Goel A, Won H H, et al. A comprehensive 1,000 Genomes-based genome-wide association meta-analysis of coronary artery disease. Nat Genet 2015; 47:1121-1130). These risk alleles, when aggregated into a polygenic risk score, are predictive of incident coronary events and provide a continuous and quantitative measure of genetic susceptibility. (Kathiresan S, Melander O, Anevski D, et al. Polymorphisms associated with cholesterol and risk of cardiovascular events. N Engl J Med 2008; 358:1240-1249; Ripatti S, Tikkanen E, Orho-Melander M, et al. A multilocus genetic risk score for coronary heart disease: case-control and prospective cohort analyses. Lancet 2010; 376:1393-1400; Paynter N P, Chasman D I, Pare G, et al. Association between a literature-based genetic risk score and cardiovascular events in women. JAMA 2010; 303:631-637; Thanassoulis G, Peloso G M, Pencina M J, et al. A genetic risk score is associated with incident cardiovascular disease and coronary artery calcium: the Framingham Heart Study. Circ Cardiovasc Genet 2012; 5:113-121; Brautbar A, Pompeii L A, Dehghan A, et al. A genetic risk score based on direct associations with coronary heart disease improves coronary heart disease risk prediction in the Atherosclerosis Risk in Communities (ARIC), but not in the Rotterdam and Framingham Offspring, Studies. Atherosclerosis 2012; 223:421-426; Ganna A, Magnusson P K, Pedersen N L, et al. Multilocus genetic risk scores for coronary heart disease prediction. Arterioscler Thromb Vasc Biol 2013; 33:2267-2272; Mega J L, Stitziel N O, Smith J G, et al. Genetic risk, coronary heart disease events, and the clinical benefit of statin therapy: an analysis of primary and secondary prevention trials. Lancet 2015; 385:2264-2271; Tada H, Melander O, Louie J Z, et al. Risk prediction by genetic risk scores for coronary heart disease is independent of self-reported family history. Eur Heart J 2016; 37:561-567; Abraham G, Havulinna A S, Bhalala O G, et al. Genomic prediction of coronary heart disease. Eur Heart J. 2016 Nov. 14; 37(43):3267-3278).
Much evidence has also shown that persons who adhere to a healthy lifestyle have markedly reduced rates of incident cardiovascular events. (Stampfer M J, Hu F B, Manson J E, Rimm E B, Willett W C. Primary prevention of coronary heart disease in women through diet and lifestyle. N Engl J Med 2000; 343:16-22; Folsom A R, Yatsuya H, Nettleton J A, Lutsey P L, Cushman M, Rosamond W D. Community prevalence of ideal cardiovascular health, by the American Heart Association definition, and relationship with cardiovascular disease incidence. J Am Coll Cardiol 2011; 57:1690-1696; Yang Q, Cogswell M E, Flanders W D, et al. Trends in cardiovascular health metrics and associations with all-cause and CVD mortality among US adults. JAMA 2012; 307:1273-1283; Xanthakis V, Enserro D M, Murabito J M, et al. Ideal cardiovascular health: associations with biomarkers and subclinical disease and impact on incidence of cardiovascular disease in the Framingham Offspring Study. Circulation 2014; 130:1676-1683; Chomistek A K, Chiuve S E, Eliassen A H, Mukamal K J, Willett W C, Rimm E B. Healthy lifestyle in the primordial prevention of cardiovascular disease among young women. J Am Coll Cardiol 2015; 65:43-51; Akesson A, Larsson S C, Discacciati A, Wolk A. Low-risk diet and lifestyle habits in the primary prevention of myocardial infarction in men: a population-based prospective cohort study. J Am Coll Cardiol 2014; 64:1299-1306). The promotion of healthy lifestyle behaviors, which include not smoking, avoiding obesity, regular physical activity, and a healthy diet pattern, underlies the current strategy to improve cardiovascular health in the general population.http://www.nejm.org/doi/full/10.1056/NEJMoa1605086—ref31 (Lloyd-Jones D M, Hong Y, Labarthe D, et al. Defining and setting national goals for cardiovascular health promotion and disease reduction: the American Heart Association's strategic Impact Goal through 2020 and beyond. Circulation 2010; 121:586-613).
Many observers assume that a genetic predisposition to coronary artery disease is deterministic. (White P D. Genes, the heart and destiny. N Engl J Med 1957; 256:965-969). However, genetic risk might be attenuated by a favorable lifestyle. Here, we analyzed data for participants in three prospective cohorts and one cross-sectional study to test the hypothesis that both genetic factors and baseline adherence to a healthy lifestyle contribute independently to the risk of incident coronary events and the prevalent subclinical burden of atherosclerosis. We then determined the extent to which a healthy lifestyle is associated with a reduced risk of coronary artery disease among participants with a high genetic risk.
The Atherosclerosis Risk in Communities (ARIC) study is a prospective cohort that enrolled white participants and black participants between the ages of 45 and 64 years, starting in 1987. (The Atherosclerosis Risk in Communities (ARIC) Study: design and objectives. Am J Epidemiol 1989; 129:687-702). For data from this study, we retrieved genotype and clinical data from the National Center for Biotechnology Information dbGAP server (accession number, phs000280.v3.p1). The Women's Genome Health Study (WGHS) is a prospective cohort of female health professionals derived from the Women's Health Study, a clinical trial initiated in 1992 to evaluate the efficacy of aspirin and vitamin E in the primary prevention of cardiovascular disease. (Ridker P M, Chasman D I, Zee R Y, et al. Rationale, design, and methodology of the Women's Genome Health Study: a genome-wide association study of more than 25,000 initially healthy American women. Clin Chem 2008; 54:249-255). The Malmo Diet and Cancer Study (MDCS) is a prospective cohort that enrolled participants between the ages of 44 and 73 years in Malmo, Sweden, starting in 1991. (Berglund G, Elmstahl S, Janzon L, Larsson S A. The Malmo Diet and Cancer Study: design and feasibility. J Intern Med 1993; 233:45-51). In this study, participants with prevalent coronary disease at baseline were excluded. The BioImage Study enrolled asymptomatic participants between the ages of 55 and 80 years who were at risk for cardiovascular disease, beginning in 2008. This study included quantification of subclinical coronary artery disease in Agatston units, a metric that combines the area and density of observed coronary-artery calcification. (Baber U, Mehran R, Sartori S, et al. Prevalence, impact, and predictive value of detecting subclinical coronary and carotid atherosclerosis in asymptomatic adults: the BioImage study. J Am Coil Cardiol 2015; 65:1065-1074).
We derived a polygenic risk score from an analysis of up to 50 single-nucleotide polymorphisms (SNPs) that had achieved genomewide significance for association with coronary artery disease in previous studies. Details regarding the cohort-specific genotyping platform and risk scores are provided in Table S1 in the, available with the full text of this article at NEJM.org.http://www.nejm.org/doi/full/10.1056/NEJMoa1605086—ref11 (Erdmann J, Grosshennig A, Braund P S, et al. New susceptibility locus for coronary artery disease on chromosome 3q22.3. Nat Genet 2009; 41:280-282; Coronary Artery Disease (C4D) Genetics Consortium. A genome-wide association study in Europeans and South Asians identifies five new loci for coronary artery disease. Nat Genet 2011; 43:339-344; IBC 50K CAD Consortium. Large-scale gene-centric analysis identifies novel variants for coronary artery disease. PLoS Genet 2011; 7:e1002260-e1002260; The CARDIoGRAMplusC4D Consortium. Large-scale association analysis identifies new risk loci for coronary artery disease. Nat Genet 2013; 45:25-33). An example of the calculation of the polygenic risk score is provided in Table S2. Individual participant scores were created by adding up the number of risk alleles at each SNP and then multiplying the sum by the literature-based effect size.http://www.nejm.org/doi/full/10.1056/NEJMoa1605086—ref17 (Ripatti S, Tikkanen E, Orho-Melander M, et al. A multilocus genetic risk score for coronary heart disease: case-control and prospective cohort analyses. Lancet 2010; 376:1393-1400). The genetic substructure of the population was assessed by calculating the principal components of ancestry. (Price A L, Patterson N J, Plenge R M, Weinblatt M E, Shadick N A, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 2006; 38:904-909).
We adapted four healthy lifestyle factors from the strategic goals of the American Heart Association (AHA)—no current smoking, no obesity (body-mass index [the weight in kilograms divided by the square of the height in meters], <30), physical activity at least once weekly, and a healthy diet pattern.http://www.nejm.org/doi/full/10.1056/NEJMoa1605086—ref31 (Lloyd-Jones D M, Hong Y, Labarthe D, et al. Defining and setting national goals for cardiovascular health promotion and disease reduction: the American Heart Association's strategic Impact Goal through 2020 and beyond. Circulation 2010; 121:586-613). A healthy diet pattern was ascertained on the basis of adherence to at least half of the following recently endorsed characteristics (Mozaffarian D. Dietary and policy priorities for cardiovascular disease, diabetes, and obesity: a comprehensive review. Circulation 2016; 133:187-225): consumption of an increased amount of fruits, nuts, vegetables, whole grains, fish, and dairy products and a reduced amount of refined grains, processed meats, unprocessed red meats, sugar-sweetened beverages, trans fats (WGHS only), and sodium (WGHS only). Because a detailed food-frequency questionnaire was not performed in the BioImage Study, diet scores in that cohort focused on self-reported consumption of fruits, vegetables, and fish. Additional details regarding cohort-specific metrics for lifestyle factors are provided in Table S3.
The primary study end point for the prospective cohort populations was a composite of coronary artery disease events that included myocardial infarction, coronary revascularization, and death from coronary causes. End-point adjudication was performed by a committee review of medical records within each cohort. In the BioImage Study, a cross-sectional analysis of baseline scores for coronary-artery calcification was performed.
We used Cox proportional-hazard models to test the association of genetic and lifestyle factors with incident coronary events. We compared hazard ratios for participants at high genetic risk (i.e., highest quintile of polygenic scores) with those at intermediate risk (quintiles 2 to 4) or low risk (lowest quintile), as described previously.http://www.nejm.org/doi/full/10.1056/NEJMoa1605086—ref22 (Mega J L, Stitziel N O, Smith J G, et al. Genetic risk, coronary heart disease events, and the clinical benefit of statin therapy: an analysis of primary and secondary prevention trials. Lancet 2015; 385:2264-2271; Tada H, Melander O, Louie J Z, et al. Risk prediction by genetic risk scores for coronary heart disease is independent of self-reported family history. Eur Heart J 2016; 37:561-567). Similarly, we compared a favorable lifestyle (which was defined as the presence of at least three of the four healthy lifestyle factors) with an intermediate lifestyle (two healthy lifestyle factors) or an unfavorable lifestyle (no or only one healthy lifestyle factor). The primary analyses included adjustment for age, sex, self-reported education level, and the first five principal components of ancestry (unavailable in MDCS). In addition, WGHS analyses were adjusted for initial trial randomization to aspirin versus placebo and vitamin E versus placebo. We used Cox regression to calculate 10-year event rates, which were standardized to the mean of all predictor variables within each population. Because of a skewed distribution of scores for coronary-artery calcification in the BioImage Study, linear regression was performed on natural log-transformed calcification scores with an offset of 1. Predicted values were then reverse-transformed to calculate standardized scores, with higher values indicating an increased burden of coronary atherosclerosis. All the analyses were performed with the use of R software, version 3.1 (R Project for Statistical Computing).
The populations in the prospective cohort studies included 7814 of 11,478 white participants in the ARIC cohort, 21,222 of 23,294 white women in the WGHS cohort, and 22,389 of 30,446 participants in the MDCS cohort for whom genotype and covariate data were available (Table 1) Characteristics of the Participants at Baseline.). During follow-up, 1230 coronary events were observed in the ARIC cohort (median follow-up, 18.8 years), 971 coronary events in the WGHS cohort (median follow-up, 20.5 years), and 2902 coronary events in the MDCS cohort (median follow-up, 19.4 years) (Table S4). Categories of genetic and lifestyle risk were mutually independent within each cohort (
Polygenic risk scores approximated a normal distribution within each cohort (
Each cohort was divided into three lifestyle risk categories: favorable (at least three of the four healthy lifestyle factors), intermediate (two healthy lifestyle factors), or unfavorable (no or only one healthy lifestyle factor). Participants with an unfavorable lifestyle had higher rates of baseline hypertension and diabetes, a higher body-mass index, and less favorable levels of circulating lipids than did those with a favorable lifestyle (Tables S12, S13, and S14). An unfavorable lifestyle was associated with a higher risk of coronary events than a favorable lifestyle, with an adjusted hazard ratio of 1.71 (95% CI, 1.47 to 1.98) in the ARIC cohort, 2.27 (95% CI, 1.92 to 2.67) in the WGHS cohort, and 1.77 (95% CI, 1.61 to 1.95) in the MDCS cohort (
Within each category of genetic risk, lifestyle factors were strong predictors of coronary events (
Despite a paucity of well-validated genetic loci in black populations, we observed similar findings among black participants and white participants in the ARIC cohort (
A cross-sectional analysis of 4260 of 4301 white participants with available data from the BioImage Study showed that both genetic and lifestyle factors were associated with coronary-artery calcification (stratified according to the baseline characteristics in Tables S16 and S17). The standardized calcification score was 46 Agatston units (95% CI, 39 to 54) among participants at high genetic risk, as compared with 21 Agatston units (95% CI, 18 to 25) among those at low genetic risk (P<0.001). The calcification score was similarly higher among participants with an unfavorable lifestyle than among those with a favorable lifestyle: 46 Agatston units (95% CI, 40 to 53) versus 28 Agatston units (95% CI, 25 to 31) (P<0.001). Within each subgroup of genetic risk, a significant trend was observed toward decreased coronary-artery calcification among participants who were more adherent to a healthy lifestyle (
In this study, we have provided quantitative data about the interplay between genetic and lifestyle risk factors for coronary artery disease in three prospective cohorts and one cross-sectional study. High genetic risk was independent of healthy lifestyle behaviors and was associated with an increased risk (hazard ratio, 1.91) of coronary events and a substantially increased burden of coronary-artery calcification. However, within any genetic risk category, adherence to a healthy lifestyle was associated with a significantly decreased risk of both clinical coronary events and subclinical burden of coronary artery disease.
The results of this analysis support three noteworthy conclusions. First, our data indicate that inherited DNA variation and lifestyle factors contribute independently to a susceptibility to coronary artery disease. Our finding that a polygenic risk score has robust associations with incident coronary events is well aligned with previous studies of both primary and secondary prevention populations. http://www.nejm.org/doi/full/10.1056/MEJMoa1605086—ref16 (Kathiresan S, Melander O, Anevski D, et al. Polymorphisms associated with cholesterol and risk of cardiovascular events. N Engl J Med 2008; 358:1240-1249; Ripatti S, Tikkanen E, Orho-Melander M, et al. A multilocus genetic risk score for coronary heart disease: case-control and prospective cohort analyses. Lancet 2010; 376:1393-1400; Paynter N P, Chasman D I, Pare G, et al. Association between a literature-based genetic risk score and cardiovascular events in women. JAMA 2010; 303:631-637; Thanassoulis G, Peloso G M, Pencina M J, et al. A genetic risk score is associated with incident cardiovascular disease and coronary artery calcium: the Framingham Heart Study. Circ Cardiovasc Genet 2012; 5:113-121; Brautbar A, Pompeii L A, Dehghan A, et al. A genetic risk score based on direct associations with coronary heart disease improves coronary heart disease risk prediction in the Atherosclerosis Risk in Communities (ARIC), but not in the Rotterdam and Framingham Offspring, Studies. Atherosclerosis 2012; 223:421-426; Ganna A, Magnusson P K, Pedersen N L, et al. Multilocus genetic risk scores for coronary heart disease prediction. Arterioscler Thromb Vase Biol 2013; 33:2267-2272; Mega Stitziei N O, Smith J G, et al. Genetic risk, coronary heart disease events, and the clinical benefit of statin therapy: an analysis of primary and secondary prevention trials. Lancet 2015; 385:2264-2271; Tada H, Melander O, Louie V., et al. Risk prediction by genetic risk scores for coronary heart disease is independent of self-reported family history. Eur Heart J 2016; 37:561-567; Abraham G, Havulinna A S, Bhalala O G, et al. Genomic prediction of coronary heart disease. Eur Heart J 2016 Nov. 14; 37(43):3267-3278). Such findings support long-standing beliefs that genetic variants that are identifiable from birth alter coronary risk. (Müller C. Xanthomata, hypercholesterolemia, angina pectoris. Acta Med Scand 1938; 89:75-84; Gertler M M, Gam S M, White P D. Young candidates for coronary heart disease. J Am Med Assoc 1951; 147:621-625; Slack J, Evans K A. The increased risk of death from ischaemic heart disease in first degree relatives of 121 men and 96 women with ischaemic heart disease. J Med Genet 1966; 3:239-257). Aside from slight differences in LDL cholesterol levels and a family history of coronary artery disease, genetic risk was independent of traditionally measured risk factors.
Second, a healthy lifestyle was associated with similar relative risk reductions in event rates across each stratum of genetic risk. Although the absolute risk reduction that was associated with adherence to a healthy lifestyle was greatest in the group at high genetic risk, our results support public health efforts that emphasize a healthy lifestyle for everyone. An alternative approach is to target intensive lifestyle modification to those at high genetic risk, with the expectation that disclosure of genetic risk can motivate behavioral change. However, whether the provision of such information can improve cardiovascular outcomes remains to be determined.
Third, patients may equate DNA-based risk estimates with determinism, a perceived lack of control over the ability to improve outcomes. (White P D. Genes, the heart and destiny. N Engl J Med 1957; 256:965-969). However, our results provide evidence that lifestyle factors may powerfully modify risk regardless of the patient's genetic risk profile. Indeed, alternative analytic approaches that incorporate more stringent cutoffs or weight the relative effect for each healthy lifestyle factor may lead to an even more pronounced coronary risk gradient.
In conclusion, after quantifying both genetic and lifestyle risk among 55,685 participants in three prospective cohorts and one cross-sectional study, we found that adherence to a healthy lifestyle was associated with a substantially reduced risk of coronary artery disease within each category of genetic risk.
Whole genome sequencing enables ascertainment of the complete spectrum of genetic variation—common and rare, coding and noncoding. Rapid declines in cost have led to substantial enthusiasm that such testing will further our understanding of complex trait genetics and permit DNA-based population stratification that could inform clinical management. (See Ashley E A., Towards precision medicine, Nat Rev Genet, 2016; 17(9):507-22). Here, Applicants test this hypothesis by performing high coverage whole genome sequencing in 2,369 individuals with myocardial infarction at an early age and compare their genome sequences with 4,218 coronary disease-free participants. Applicants determine the association of common single variants as well as rare variants in both coding and noncoding regions with disease risk and identify the prevalence and clinical impact of monogenic (single large-effect mutation) and polygenic (cumulative effect of many variants of small effect) risk pathways associated with myocardial infarction.
The design of the VIRGO study has been previously described. (See Lichtman et al., Circ Cardiovasc Qual Outcomes, 2010; 3(6):684-93.) In brief, 3,501 participants hospitalized with an acute myocardial infarction, age 18 to 55 years, were enrolled between 2009 and 2012 from 103 United States and 24 Spanish hospitals using a 2:1 female-to-male enrollment design. Baseline patient data were collected by medical chart abstraction and standardized in-person patient interviews administered by trained personnel during the index acute myocardial infarction admission. Individuals with available DNA and who had provided written informed consent for genetic analysis were included in the present study.
The TAICHI cohort recruited Taiwanese Chinese individuals at four academic centers. (See Assimes et al., PLoS One, 2016; 11(3):e0138014). Individuals with coronary disease were identified as those with a history of myocardial infarction, coronary revascularization, or a stenosis of ≥50% in a major epicardial vessel demonstrated by angiography. All cases experienced an early-onset coronary event (men≤50 years, women≤60 years) in the context of normal circulating lipid levels (LDL cholesterol<130 mg/dl or total cholesterol<185 mg/dl). Controls were enrolled from an epidemiology study and from the several Hospital Endocrinology and Metabolism Departments either as outpatients or as their family members. Subjects with a history of CAD were excluded.
The design of the MESA study has been previously described and protocol available at www.mesa-nhlbi.org. (See Bild et al., Am J Epidemiol, 2002; 156:871-881). In brief, 6,181 men and women between the ages of 45 and 84 without prevalent cardiovascular disease were recruited between 2000-2002 from 6 United States communities. Individuals were excluded from the present study due if informed consent for genetic testing had not been obtained/was withdrawn, DNA was not available for sequencing, or incident cardiovascular disease (myocardial infarction, coronary revascularization, angina, peripheral arterial disease, stroke, resuscitated cardiac arrest, death due to cardiovascular causes) through the period of last available follow-up in December 2014. Fasting plasma triglyceride, total cholesterol, high density lipoprotein cholesterol (HDL-C) concentrations were measured as described previously. (See Tsai et al., Atherosclerosis, 2008; 200: 359-367). Low density lipoprotein-cholesterol (LDL-C) was calculated based on the Friedewald formula in participants with triglycerides<400 mg/dL. Lipoprotein(a) concentrations were available in 2,521 of 3,761 (67%) of sequenced individuals, measured via the a latex-enhanced turbidometric immunoassay (Denka Seiken, Tokyo, Japan) that is insensitive to Kringle 4 type 2 isoforms as reported previously. (See Guan et al., Arterioscler Thromb Vasc Biol, 2015 April; 35(4):996-1001).
Study participants with early-onset myocardial infarction were derived from the previously described Variation in Recovery: Role of Gender on Outcomes of Young AMI Patients (VIRGO) and TAICHI consortium and controls from the Multiethnic Study of Atherosclerosis (MESA) cohort and TAICHI consortium. The VIRGO study enrolled a multiethnic population of adult patients presenting to enrollment centers in the United States and Spain with a first myocardial infarction at age<55 years. (See Lichtman et al., Circ. Cardiovasc. Qual. Outcomes, 2010; 3(6):684-93). The TAICHI consortium enrolled patients with an early-onset coronary event (men≤50 years, women≤60 years) in the context of normal circulating lipid levels (LDL cholesterol<130 mg/dl or total cholesterol<185 mg/dl) and controls in academic centers in Taiwan. (See Assimes et al., PLoS One, 2016; 11(3):e0138014). The MESA study is a multiethnic prospective cohort that enrolled individuals in the United States free of cardiovascular disease between 2000 and 2002. (See Bild et al., Am. J. Epidemiol., 2002; 156:871-81). MESA participants were included as controls for this study if they remained free of incident cardiovascular disease through the end of 2014 (median follow-up 13.2 years).
†Lipoprotein(a) concentrations available in 2,521 controls from the MESA cohort.
Whole genome sequencing was performed using the Illumina HiSeqX platform at the Broad Institute of Harvard and MIT (Cambridge, Mass.). DNA samples were received into the Genomics Platform's Laboratory information Management System via a scan of the tube barcodes using a Biosero flatbed scanner. This registers the samples and enables the linking of metadata based on well position. All samples are then weighed on a BioMicro Lab's XL20 to determine the volume of DNA present in sample tubes. Following this the samples are quantified in a process that uses PICO-green flourescent dye. Once volumes and concentrations are determined the samples are then handed off to the Sample Retrieval and Storage Team for storage in a −20° Celsius freezer.
Libraries were constructed and sequenced on the Illumina HiSeqX with the use of 151-bp paired-end reads for whole-genome sequencing. Output from Illumina software was processed by the Picard data-processing pipeline to yield BAM files containing well-calibrated, aligned reads. All sample information tracking was performed by automated LIMS messaging.
Samples undergo fragmentation by means of acoustic shearing using Covaris focused-ultrasonicator, targeting 385 bp fragments. Following fragmentation, additional size selection is performed using a SPRI cleanup. Library preparation is performed using a commercially available kit provided by KAPA Biosystems (product KK8202) and with palindromic forked adapters with unique 8 base index sequences embedded within the adapter (purchased from IDT). Following sample preparation, libraries were quantified using quantitative PCR (kit purchased from KAPA biosystems) with probes specific to the ends of the adapters. This assay was automated using Agilent's Bravo liquid handling platform. Based on qPCR quantification, libraries were normalized to 1.7 nM. Samples are then pooled into 24-plexes and the pools are once again qPCRed. Samples were then combined with HiSeq×Cluster Amp Mix 1,2 and 3 into single wells on a strip tube using the Hamilton Starlet Liquid Handling system.
Cluster amplification of the templates was performed according to the manufacturer's protocol (Illumina) using the Illumina cBot. Flowcells were sequenced on Hi Seq X with sequencing software HiSeq Control Software (HCS) version 3.3.76, then analyzed using RTA2. The following versions were used for aggregation, and alignment to hg19_decoy reference: picard (latest version available at the time of the analysis), GATK (3.1-144-g00f68a3) and BwaMem (0.7.7-r441).
A sample was considered sequence complete when the mean coverage was >30× (for the MESA cohort) or ≥20× (for VIRGO and TAICHI cohorts). Two quality control metrics that are reviewed along with the coverage are the sample Fingerprint LOD score and % contamination. At aggregation, Applicants did an all-by-all comparison of the read group data and estimate the likelihood that each pair of read groups is from the same individual. If any pair had a LOD score<−20.00, the aggregation does not proceed and is investigated. FP LOD> or =3 is considered passing concordance with the sequence data (ideally Applicants see LOD>10). A sample will have an LOD of 0 when the sample failed to have a passing fingerprint. Fluidigm fingerprint is repeated once if failed. Read groups with fingerprints<−3.00 were blacklisted from the aggregation. Sample genotypes were determined via a joint callset using the Genome Analysis Toolkit Haplotype Caller.
Reads were aligned using to the human reference genome hg19.
Sample Quality Control.
6,809 individuals underwent whole genome sequencing, of whom 222 (3.3%) were excluded based on sequencing quality control metrics (Table 13). Sample exclusion criteria included:
Variant Quality Control.
After completion of sample level quality control, variant quality control was performed using the Hail software package (https://github.com/hail-is/hail). (Ganna et al., Nat Neurosci., 2016; 19(12):1563-1565). In total, 17.6 of 152.2 million (12%) of single nucleotide polymorphisms and 12.0 of 23.4 million (52%) of insertion-deletions variants were filtered from subsequent analysis (Table 13).
Variant exclusion criteria included:
Race Subgroup Inference.
A panel of approximately 16,000 ancestry informative markers (Hoggart et al., Am J Hum Genet., 2003; 72(6):1492-1504) (AIMs) identified across six continental populations (Libiger O, Schork N J., Front Genet., 2012; 3:322) was chosen to derive principal components (PCs) of ancestry for all samples that passed quality control. Principal component analysis was performed using EIGENSTRAT. (See Price et al., Nat Genet., 2006; 38:904-909).
In order to assign a race to individuals without self-reported race or with discordant self-reported race and PC ancestry, a k-nearest neighbors (k-NN) classifier (Fix E, Hodges J L. Discriminatory analysis: Non-parametric discrimination: Consistency properties. Texas: USAF School of Aviation Medicine. 1951; pp 261-279; Cover T, Hart P., IEEE Trans Inf Theory, 1967; 13:21-27.) was applied using the first five PCs of ancestry. This analysis was done using the k-NN implementation from the Scikit-learn library in Python. (See Pedregosa et al., Journal of Machine Learning Research, 2011; 12:2825-2830). The classifier was built using MESA samples after removing 25 individuals with discordant self-reported race and PC ancestry as determined by visual inspection of PC1 and PC2. The remaining MESA samples were split into a training set (n=2490) and test set (n=1246). A k-NN (k=5) classifier was built using self-reported race as the dependent variable (1: White/Caucasian, 2: Chinese American, 3: Black/African-American, 4: Hispanic) and PC1 to PC5 as features. The classifier had a 98.1% reclassification rate in the test set, with misclassifications generally occurring for Hispanic individuals. This classifier was then applied to all 6,587 samples to generate inferred race. Inferred race and self-reported race were concordant in 6,383 of 6,576 (97%) of sample with nonmissing self-reported race.
The relationship of common (allele frequency≥0.01) biallelic individual single nucleotide polymorphisms or short insertion-deletion (<10 base pairs) variants with early-onset myocardial infarction was tested.
Single Variant Testing.
Single nucleotide polymorphisms and insertion-deletion variants with allele frequency≥1% were tested for association with early-onset myocardial infarction using logistic regression with adjustment for the first four principal components of ancestry.
Coding Variant Gene Burden Testing.
The group of rare (allele frequency<1%) coding variants tested for each gene was composed of 1) loss-of function variants 2) missense variants predicted to be damaging by each 5 of 5 computer prediction algorithms 3) variants annotated to be pathogenic in the ClinVar online genetics database. Loss-of function variants were identified with LOFTEE (Loss-Of-Function Transcript Effect Estimator), a plugin for the Ensembl Variant Effect Predictor (VEP). (See McLaren et al., Genome Biol., 2016; 17(1):122; Lek et al., Nature, 2016; 536(7616):285-91). They were included when they were deemed as high confidence loss-of function. The LOFTEE assessment includes stop-gained, splice site disrupting and frameshift variants. Rare missense variants were included if they were annotated as damaging or possible damaging by each of 5 computer prediction algorithms (SIFT, PolyPhen2-HumDiv, Polyphen2-HumVar, LRT, MutationTaster) as previously performed. (See Purcell et al., Nature, 2014; 506:185-90; Khera et al., J Am Coll Cardiol., 2016; 67(22):2578-89; Khera et al., J Am Coll Cardiol., 2016; 67(22):2578-89). Pathogenic variants were identified with the February 2017 release of the ClinVar database [https://github.com/macarthur-lab/clinvar] using the ‘clinical significance’ annotation. (See Landrum et al., Nucleic Acids Res. 2014; 42(database issue):D980-D985). Variants were included if at least one entry was assigned a ‘pathogenic’ clinical significance and there were no conflicting interpretations (e.g. simultaneous annotation as ‘uncertain,’ ‘benign,’ or ‘protective’). Variants assigned as benign were excluded from subsequent analyses. A collapsed burden test was performed with EPACTS v3.2.6 (EPACTS: Efficient and Parallelizable Association Container Toolbox [Internet]. [cited 2017 Apr. 13]; Available from: http://genome.sph.umich.edu/wiki/EPACTS) using a logistic Wald test between the outcome and 0/1-collapsed variants, including the first four principal components of ancestry were as covariates. Genes were tested when at least two variants met the inclusion criteria and the cumulative allele frequency of the damaging variants was above 0.001.
Regulatory Variant Gene Burden Testing.
Rare (MAF<1%) regulatory non-coding variants for testing were identified based on their location within enhancers and promoters in aortic tissue. Enhancer and promoter regions were annotated based on the Roadmap Epigenomics project. (See Roadmap Epigenomics Consortium., Kundaje et al., Nature, 2015; 518(7539):317-30). These regions were defined based on a chromatin state model (imputed data, 25 states) using observed DNaseI data, (Reg2Map: HoneyBadger2-impute [Internet]. [cited 2017 Apr. 13]; Available from: https://personal.broadinstitute.org/meuleman/reg2map/HoneyBadger2-impute_release/) selecting DNaseI regions were with −log 10(p)≥10. The following states were included to define promoter regions: active TSS, promoter upstream TSS, promoter downstream TSS, promoter downstream TSS, poised promoter and bivalent promoter. The following states were included to define enhancer regions: transcribed 5′ preferential and enh, transcribed 3′ preferential and enh, transcribed and weak enhancer, active enhancer 1, active enhancer 2, active enhancer flank, weak enhancer 1, weak enhancer 2 and possible enhancer. For each tissue or cell line the variants in promoter or enhancer regions were grouped to a gene, based on their proximity to the TSS. The inclusion region for promoters was defined as TSS+/−5 kb or the end of the canonical transcript, if the canonical transcript was shorter than 5000 bases. The inclusion region for enhancers was defined as TSS+/−20 kb or the end of the canonical transcript, if the canonical transcript was shorter than 20000 bases. Variants that fell within the exon bounds+/−5 base pairs of the canonical transcript were excluded. A sequence kernel association test (SKAT-O) (Lee et al., Biostatistics., 2012 September; 13(4):762-75) was performed with EPACTS v3.2.6 for each regulatory non-coding gene group and tissue or cell line. The first four principal components of ancestry were included as covariates in the models. Genes were tested when at least two variants met the inclusion criteria and the cumulative allele frequency of the damaging variants was above 0.001.
Gene-based coding variant testing was performed by aggregating rare (minor allele frequency<0.01) variants that lead to loss-of-function, were annotated as ‘Pathogenic’ in the ClinVar clinical genetics database (see Landrum et al., Nucleic Acids Res., 2014 January; 42 (Database issue):D980-85), or missense variants classified as damaging or possibly damaging by each of five computer prediction algorithms. (See Khera et al., JAMA, 2017; 317(9):937-946; Do et al., Nature, 2015; 518(7537):102-6). Tissue-specific regulatory burden testing was performed by aggregating rare variants in promoter or enhancer regions and assigning them to genes based on chromosomal proximity to a gene's transcription start site (within 5 kilobases for promoters and 20 kilobases for enhancer regions). (See Roadmap Epigenomics Consortium, Kundaje et al., Nature, 2015; 518(7539):317-30). For both the coding and regulatory burden testing, genes were included in the analysis if the cumulative allele frequency in the study population was >0.001 and at least 2 variants were observed.
The association of the three established monogenic risk pathways for early-onset myocardial infarction included variants in LDLR, APOB, or PCSK9 linked with familial hypercholesterolemia, (See Do et al., Nature, 2015; 518(7537):102-6; Khera et al., J Am. Coll. Cardiol., 2016; 67(22):2578-89). LPL or APOA5 associated with defective clearance of triglyceride rich lipoproteins, (see Do et al., Nature, 2015; 518(7537):102-6; Khera et al., JAMA, 2017; 317(9):937-946) or at least two risk variants associating with lipoprotein(a) as previously described. (See Clarke et al., N. Engl. J. Med., 2009; 361(26):2518-28).
A polygenic risk score (PRS) for CAD was built using a p-value and LD-driven clumping procedure in PLINK version 1.90b (--clump). (See Chang et al., GigaScience, 2015; 4). Input included summary CAD association statistics for 8.3 million SNPs from a large 1000 Genomes imputed GWAS of primarily European individuals (CARDIoGRAMplusC4D Consortium, A comprehensive 1000 Genomes-based genome-wide association meta-analysis of coronary artery disease. Nat Genet., 2015; 47:1121-1130) and a reference LD panel of 503 European samples from 1000 Genomes phase 3 version 1. (See The 1000 Genomes Project Consortium, A global reference for human genetic variation, Nature, 2015; 526(7571):68-74). In brief, the algorithm forms clumps around SNPs with association p-values less than a provided threshold. Each clump contains all SNPs within 250 kb of the index SNP that are also in LD with the index SNP as determined by a provided r2 threshold in the LD reference. The algorithm iteratively cycles through all index SNPs, beginning with the smallest p-value, only allowing each SNP to appear in one clump. The final output should contain the most significantly CAD associated SNP for each LD-based clump across the genome. A PRS was built containing the index SNPs of each clump with association estimate betas (log odds) as weights.
PRSs were created over a range of p-value (1, 0.5, 0.05, 5×10-4, 5×10-6, 5×10-8) and r2 (0.2, 0.4, 0.6, 0.8) thresholds. To determine the best score, Applicants applied each to an independent set of 4,831 European CAD cases and 115,455 European controls from the UK Biobank (Sudlow et al., PLoS Med., 2015; 12: e1001779) using PLINK 1.90b (Chang et al., GigaScience, 2015; 4) (--score). Scores were generated by multiplying the number of risk alleles for each variant by the respective weight, and then summing across all variants in the score. Missing values were imputed to the mean genotype of that variant estimated by inferred ancestry group.
Beginning in 2006, individuals aged 45 to 69 years old were recruited from across the United Kingdom for participation in the UK Biobank Study. (See Sudlow et al., PLoS Med., 2015; 12: e1001779). At enrollment, a trained healthcare provider ascertained participants' medical histories through verbal interview. In addition, participants' electronic health records (EHR) including inpatient International Classification of Disease (ICD-10) diagnosis codes and Office of Population and Censuses Surveys (OPCS-4) procedure codes, were integrated into UK Biobank. Individuals were defined as having CAD based on at least one of the following criteria:
A polygenic risk score provides a quantitative assessment of the cumulative risk associated with multiple common risk alleles for each individual. Scores for each individual participant are created by adding up the number of risk alleles at each variant and then multiplying the sum by the literature-based effect size. (See Tada et al., Eur Heart J., 2016; 37(6):561-7; Khera et al., N Engl J Med., 2016; 375(24):2349-2358; Abraham et al., Eur Heart J., 2016; 37(43):3267-3278). Applicants previously demonstrated that a literature-based polygenic risk score comprised of 50 genetic variants that have exceeded genome-wide levels of significance is associated with incident coronary events. (See Tada et al., Eur Heart J., 2016; 37(6):561-7; Khera et al., N. Engl. J. Med., 2016; 375(24):2349-2358). However, the inclusion of additional subthreshold variants in a polygenic risk score may confer additional predictive value. (See Abraham et al., Eur Heart J., 2016; 37(43):3267-3278). In order to test this hypothesis, Applicants derived 24 distinct polygenic risk scores using summary statistics for 8.3 million single nucleotide polymorphisms of a previously reported GWAS study and an independent reference panel of whole genome sequence data from 503 European individuals. (See The 1000 Genomes Project Consortium, A global reference for human genetic variation, Nature, 2015; 526(7571):68-74; Nikpay et al., Nat. Genet., 2015; 47(10):1121-30). These 24 scores varied with regard to inclusions thresholds for previously reported p-value for association with coronary disease and degree of independence from other variants in the score. In order to determine which of these scores had the best predictive capacity, an independent validation dataset from the UK Biobank was assembled. (See Sudlow et al., PLoS Med., 2015; 12:e1001779). Each of these 24 scores was tested for association with coronary artery disease in UK Biobank and the score with the highest area under the curve was selected. This score was then applied to the whole genome sequencing dataset in order to determine the association of this polygenic risk score with myocardial infarction.
The association between each PRS with CAD status was determined using logistic regression adjusted for the first four principal components of ancestry. Area under the curve (AUC) was used to determine model discrimination. While each PRS showed a highly significant association with CAD status, the best PRS consisted of 116,859 SNPs and had an AUC of 0.619 (
0.8
5e−2
116,859
116,632 (99.8%)
0.6185
3.28
1.54
The association of genetic variants with early-onset myocardial infarction, tested either individually or via burden testing, was tested using logistic regression, adjusted for four principal components of ancestry. Race-specific quintiles of the polygenic risk score were derived and risk estimates compared to previously published scores. (See Tada et al., Eur. Heart J., 2016; 37(6):561-7; Khera et al., N. Engl. J. Med., 2016; 375(24):2349-2358; Abraham et al., Eur. Heart J., 2016; 37(43):3267-3278). The relationship of monogenic risk pathway variants with intermediate phenotypes of circulating lipid values was determined using linear regression, adjusting for age, sex, cohort, and four principal components of ancestry.
High-coverage whole genome sequencing was performed on 6,809 individuals. 222 (3.3%) of the original samples were excluded based on sequencing quality control metrics or relatedness, resulting in a final study population of 6,587 individuals—2,369 cases and 4,218 controls. This multiethnic population included 3,081 (47%) white, 1,298 black (20%), 1,289 Asian (20%) and 919 (14%) Hispanic participants Tables 11 & 12). Principal components analysis demonstrated that cases and controls were well-matched according to genetic ancestry (
145,897,548 genetic variants were observed in sequenced individuals, of which the majority were in either intronic (50.6%) or intergenic (32.8%) regions of the genome (Table 14 &
Single variant testing of 9,655,540 single nucleotide polymorphisms with allele frequency≥1% was performed (genomic inflation factor [λ]=1.077), replicating two known associations at the recommended (see Pulit et al., The multiple testing burden in sequencing-based disease studies of global populations, bioRxiv 053264; doi: https://doi.org/10.1101/053264) genome-wide level of significance for sequencing studies of P<5×10−9 (
Applicants tested for an excess burden among cases of rare (allele frequency<1%) damaging coding variants across 12,989 genes. Consistent with previous results derived from exome sequencing, see Do et al., Exome sequencing identifies rare LDLR and APOA5 alleles conferring risk for myocardial infarction, Nature, 2015; 518(7537):102-6, the top signal was for damaging variants in LDLR, conferring an odds ratio of 3.47 (95% CI 2.02−5.95; p=5.8×10−6). Applicants also combined rare non-coding variants in aortic tissue-specific enhancer and promoter regions based on proximity to protein-coding genes, although no statistically significant associations were identified. For both coding and noncoding gene burden testing, genes with suggestive evidence of association (P<0.05) are provided (
A mutation in a monogenic risk pathway for myocardial infarction was observed in 4.8% of sequenced individuals (
Variants associated with defects in triglyceride lipolysis were noted in 24 (1.0%) of myocardial infarction cases and associated with 54 mg/dl (95% CI 15-93) higher circulating triglycerides and an odds ratio for myocardial infarction of 2.3 (95% CI 1.3-4.2). Furthermore, at least two variants associated with increased lipoprotein(a) were identified in 2.1% of myocardial infarction cases, with an odds ratio of 2.8 (95% CI 1.7-4.4) for myocardial infarction. Among 2,521 controls from the MESA cohort with lipoprotein(a) levels available, inheriting at least two variants known to increase lipoprotein(a) was associated with a 16.6 mg/dl (95% CI 4.7-29) higher circulating concentration.
Applicants derived 24 distinct polygenic risk scores based on results from a previously published analysis with numbers of genetic variants in each score ranging from 78 to 2.04 million. Each of these scores was evaluated in an independent testing dataset of individuals from the UK Biobank (Table 15 &
Importantly, the polygenic risk score was selected from 24 scores derived and validated based on a previously published GWAS and the UK Biobank, both of which were comprised primarily of participants of European ancestry. Applicants next tested the association of polygenic risk categories with myocardial infarction in subpopulations stratified by race. Although the score was robustly associated with risk within each group, the performance was best in white participants—6.5 fold (95% CI 5.0-8.5) risk gradient between those of low and high polygenic risk—as compared with gradients of 4.2 fold, 3.9 fold, and 3.1 fold in black, Asian, and Hispanic participants respectively (p-interaction=0.001;
Applicants examined the quantitative importance and interplay of monogenic and polygenic risk pathways as they related to inherited risk of myocardial infarction. The risk associated with mutations in monogenic risk pathways was similar across strata of polygenic risk (p-interaction=0.08). Among the 2,369 individuals with myocardial infarction, 78 (3.3%) harbored a monogenic risk pathway mutation but were not in the top quintile of the polygenic risk score, 664 (28%) were in the top quintile of the polygenic risk score but did not harbor a monogenic risk pathway mutation, and 36 (1.5%) both harbored a monogenic pathway mutation and were in the top quintile of the polygenic score. As compared with those with no monogenic pathway mutation and low or intermediate polygenic risk, a monogenic risk pathway mutation or a high polygenic risk score each conferred a roughly three-fold increase in risk (OR 2.74 [95% CI 2.39-3.14] or 3.03 [95% CI 2.13-4.31], respectively). By contrast, those with both a monogenic pathway mutation and increased polygenic risk had a 5.88-fold (95% CI 3.20-11.09) increased risk of early-onset myocardial infarction.
In this study, Applicants compared the whole genome sequences of 2,369 individuals who suffered myocardial infarction at an early age with 4,218 control individuals free of cardiovascular disease. In a genetic association analysis, Applicants did not identify any new variants or genes associated with myocardial infarction. In a clinical interpretation framework integrating monogenic and polygenic risk pathways, Applicants observed a monogenic risk pathway mutation in 4.8% of individuals with early-onset myocardial infarction and these mutations conferred approximately three-fold increased risk. Applicants developed a new polygenic risk score of 116,859 genetic variants and this score demonstrated a 5.2-fold risk gradient across quintiles.
These results permit several conclusions of relevance to complex trait genetics. First, discovery of rare variant associations with disease in noncoding sequence is likely to require substantially increased sample sizes and improvements in the functional annotation of noncoding variants. Notably, the majority of observed variants reside in intergenic or intronic regions and are present in fewer than in 1 in 1,000 individuals. Our analysis of rare variation in regulatory sequences in tissues of known relevance to human atherosclerosis did not identify statistically significant associations.
Second, a mutation in a monogenic risk pathway was identified in 4.8% of sequenced individuals. These mutations are linked to impaired clearance of LDL cholesterol (familial hypercholesterolemia), defective triglyceride lipolysis, and increased lipoprotein(a). In aggregate, such mutations conferred a three-fold increased risk, broadly consistent with previous reports. (See Do et al., Exome sequencing identifies rare LDLR and APOA5 alleles conferring risk for myocardial infarction, Nature, 2015; 518(7537):102-6; Khera et al., Association of rare and common variation in the lipoprotein lipase gene With coronary artery disease, JAMA, 2017; 317(9):937-946; Khera et al., Diagnostic yield and clinical utility of sequencing familial hypercholesterolemia genes in patients with severe hypercholesterolemia, J Am. Coll. Cardiol., 2016; 67(22):2578-89; Clarke et al., Genetic variants associated with Lp(a) lipoprotein level and coronary disease, N. Engl. J. Med., 2009; 361(26):2518-28; Abul-Husn et al., Genetic identification of familial hypercholesterolemia within a single U.S. health care system, Science, 2016; 354(6319)). Importantly, each of these driving pathways can be targeted using potent therapeutics currently available or in development—statins, ezetimibe, and drugs targeting PCSK9 (monoclonal antibodies or RNA interference) to reduce LDL cholesterol, an antisense oligonucleotide targeting apolipoprotein C-III to accelerate triglyceride clearance, and an antisense oligonucleotide to lower lipoprotein(a). (See Sabatine et al., Evolocumab and clinical outcomes in patients with cardiovascular disease, N. Engl. J Med., 2017 May 4; 376(18):1713-1722; Gaudet et al., Antisense inhibition of apolipoprotein C-III in patients with hypertriglyceridemia, N. Engl. J. Med., 2015; 373(5):438-47; Viney et al., Antisense oligonucleotides targeting apolipoprotein(a) in people with raised lipoprotein(a): two randomised, double-blind, placebo-controlled, dose-ranging trials, Lancet, 2016; 388(10057):2239-2253). A stratified approach that targets use of these medications to those with a lifelong genetic perturbation in the relevant pathway may prove useful.
Third, inheritance of a disproportionate number of common genetic risk variants, each with a modest impact, represents another mechanism underlying genetic predisposition. Monogenic risk pathways and this polygenic risk contributed to risk of myocardial infarction in an additive fashion. Applicants derived and validated a new polygenic risk score that includes 116,859 genetic variants scattered across the genome. This expanded score significantly outperformed previous such scores with a more than five-fold risk gradient observed across score quintiles. (See Tada et al., Risk prediction by genetic risk scores for coronary heart disease is independent of self-reported family history, Eur. Heart J., 2016; 37(6):561-7; Khera et al., Genetic risk, adherence to a healthy lifestyle, and coronary disease, N. Engl. J. Med., 2016; 375(24):2349-2358; Abraham et al., Genomic prediction of coronary heart disease, Eur. Heart J., 2016; 37(43):3267-3278). However, consistent with the development and validation of this and previous scores in individuals of European ancestry, significant heterogeneity in score performance was noted across racial subgroups. (See Martin et al., Human demographic history impacts genetic risk prediction across diverse populations, Am. J. Hum. Genet., 2017; 100(4):635-649). Evidence derived from randomized clinical trials suggests that those with increased polygenic risk derive increased absolute and relative coronary risk reduction with statin therapy. (See Mega et al., Genetic risk, coronary heart disease events, and the clinical benefit of statin therapy: an analysis of primary and secondary prevention trials, Lancet, 2015; 385(9984):2264-71; Natarajan et al., Polygenic risk score identifies subgroup with higher burden of atherosclerosis and greater relative benefit from statin therapy in the primary prevention setting, Circulation, 2017 Feb. 21. [Epub ahead of print]). Similarly, absolute risk reductions associated with adherence to a healthy lifestyle were highest in the high genetic risk subgroup. (See Khera et al., Genetic risk, adherence to a healthy lifestyle, and coronary disease, N. Engl. J. Med., 2016; 375(24):2349-2358). Ascertainment of polygenic risk for common diseases may thus facilitate intensive prevention efforts via lifestyle or pharmacotherapy.
In conclusion, after assessment of more than 145 million genetic variants in 6,587 individuals of a multiethnic case-control study, Applicants identify both mutations in monogenic risk pathways and polygenic risk as important contributors to the genetic underpinnings of early-onset myocardial infarction.
Polygenic risk scores provide a quantitative metric of an individuals inherited risk based on the cumulative impact of many variants. Weights are generally assigned to each genetic variant according to the strength of their association with disease risk (effect estimate). Individuals are scored based on how many risk alleles they have for each variant (e.g. 0, 1, 2 copies) included in the polygenic risk score.
Polygenic risk can be quantified by assessing the number of risk variants in each individual, weighted by the impact of each variant on disease. Here, previously published data for the association of 6.6 million common genetic variants with coronary artery disease (CAD) were used to derive several polygenic scores (
A genome-wide polygenic score was derived based on the association statistics of all available common (minor allele frequency≥0.01) single nucleotide polymorphisms with CAD, as determined by a published genome-wide association study of 60,801 individuals with CAD and 123,504 controls.16 The inter-relationship between these variants was assessed using a reference population of 503 Europeans from the 1000 Genomes study.17
The LDPred computational algorithm was then used to construct polygenic scores. Vilhjálmsson, B. J. et al. Am J Hum Genet. 2015; 97:576-92 (2015). LDpred creates a polygenic risk score using genome-wide variation with weights derived from a set of GWAS summary statistics. Unlike other methods that use variants most strongly associated with disease risk or a set of independent variants across the genome, LDpred includes all available variants in the derived risk score by shrinking effect estimate weights (log-odds) based on an external LD reference panel. This Bayesian approach calculates a posterior mean effect size for each variant based on a prior (association with CAD in a previously published study) and subsequent shrinkage based on the extent to which this variant is correlated with similarly associated variants in a reference population. The underlying Gaussian distribution additionally considers the fraction of causal (e.g. non-zero effect sizes) markers. Because this fraction is unknown for any given disease, LDpred uses a range of plausible values to construct eleven different polygenic scores. For score derivation, CAD summary statistics from a comprehensive 1000 Genomes imputed GWAS of primarily European individuals (CARDIoGRAMplusC4D Consortium, Am J Hum Genet. 97(4), 576-92 (2015)) and a linkage disequilibrium reference panel of 503 European samples from 1000 Genomes phase 3 version 5 (The 1000 Genomes Project Consortium, A global reference for human genetic variation, Nature, 526(7571):68-74 (2015)) were used. Single Nucleotide Polymorphisms (SNPs) with ambiguous strand (A/T or C/G) or minor allele frequency less than 1% were removed from the score derivation. This left 6,630,150 variants available for inclusion. In accordance with recommendations from the LDpred authors, a linkage dysequilibrium radius was set at 2210 variants, equivalent to the number of SNPs used as input divided by 3000. A range of ρ, the fraction of causal variants, was used—1, 0.5, 0.03, 0.01, 0.003, 0.001, 0.0003, 0.0001—along with an infinitesimal (See Visscher, P. M. et al, Nat Rev Genet. 9(4):255-66. (2008)) (each variant assumed to contribute to disease risk) and unweighted model (raw log-odds for all variants input) were considered.
The best score was then determined based on maximal area under the curve from logistic regression models in a previously described CAD case-control cohort of 120,286 individuals (4,831 European CAD cases and 115,455 European controls) from the UK Biobank phase I cohort. (See Klarin, D. et al. Nat Genet. Jul. 17, 2017, doi: 10.1038/ng.3914 [Epub ahead of print]).
Scores were generated by multiplying the genotype dosage of each risk allele for each variant by its respective weight, and then summing across all variants in the score. Incorporating genotype dosages accounts for uncertainty in genotype imputation. All calculations were performed using Hail (https://github.com/hail-is/hail). Over 99.9% of variants in the LDpred-derived risk scores were available for scoring purposes in the UK Biobank phase I genotype release with sufficient imputation quality (INFO>0.3).
The association between each PRS and CAD status was determined using logistic regression, adjusted for the first four principal components of ancestry. Area under the curve (AUC) was used to determine model discrimination. While most PRS showed a highly significant association with CAD status, the PRS generated by LDpred with ρ=0.001 showed the best discrimination based on AUC (Table 17).
A multiethnic early-onset (age≤60 years) CAD case-control cohort was assembled using cases from the previously described Variation in Recovery: Role of Gender on Outcomes of Young AMI Patients (VIRGO) and TAICHI consortium and controls from the Multi-Ethnic Study of Atherosclerosis (MESA) cohort and TAICHI consortium. The design of the Variation in Recovery: Role of Gender on Outcomes of Young AMI Patients (VIRGO) study has been previously described. 7 The VIRGO study enrolled a multiethnic population of adult patients in the United States and Spain with a first myocardial infarction at age≤55 years. (See Lichtman, J. H. et al., Circ Cardiovasc Qual Outcomes; 3, 684-93 (2010)) In brief, 3,501 participants hospitalized with an acute myocardial infarction, age 18 to 55 years, were enrolled between 2009 and 2012 from 103 United States and 24 Spanish hospitals using a 2:1 female-to-male enrollment design. Baseline patient data were collected by medical chart abstraction and standardized in-person patient interviews administered by trained personnel during the index acute myocardial infarction admission. Individuals with available DNA, all of whom were derived from United States enrollment centers, and who had provided written informed consent for genetic analysis were included in the present study.
The TAICHI consortium enrolled patients with an early-onset coronary event (men≤50 years, women≤60 years) in the context of normal circulating lipid levels (LDL cholesterol<130 mg/dl or total cholesterol<185 mg/dl) and controls in Taiwan. (See Assimes, T. L. et al., PLoS One, 11, e01380142016 (2016)) Individuals with coronary disease were identified as those with a history of myocardial infarction, coronary revascularization, or a stenosis of ≥50% in a major epicardial vessel demonstrated by angiography. All cases experienced an early-onset coronary event (men≤50 years, women≤60 years) in the context of normal circulating lipid levels (LDL cholesterol<130 mg/dl or total cholesterol<185 mg/dl). Controls were enrolled from an epidemiology study and from the several Hospital Endocrinology and Metabolism Departments either as outpatients or as their family members. Subjects with a history of CAD were excluded.
The MESA study is a multiethnic prospective cohort that enrolled individuals in the United States free of cardiovascular disease between 2000 and 2002. The design of the MESA study has been previously described and protocol available at www.mesa-nhlbi.org. (See, Bild, D. E. et al., Am J. Epidemiol.; 156, 871-881 (2002). In brief, 6,181 men and women between the ages of 45 and 84 without prevalent cardiovascular disease were recruited between 2000-2002 from 6 United States communities. Individuals were excluded from the present study due if informed consent for genetic testing had not been obtained/was withdrawn, DNA was not available for sequencing, or incident cardiovascular disease (myocardial infarction, coronary revascularization, angina, peripheral arterial disease, stroke, resuscitated cardiac arrest, death due to cardiovascular causes) through the period of last available follow-up in December 2014. Fasting plasma triglyceride, total cholesterol, high density lipoprotein cholesterol (HDL-C) concentrations were measured as described previously. (See Tsai, M. Y. et al., Atherosclerosis 200, 359-67 (2008)). Low density lipoprotein-cholesterol (LDL-C) was calculated based on the Friedewald formula in participants with triglycerides<400 mg/dL. (See Friedewald, W. T. et al., Clin Chem 18(6), 499-502 (1972).
MESA participants were included as controls for this study if they remained free of incident cardiovascular disease through the end of 2014 (median follow-up 13.2 years). The polygenic score calculation was calculated based on whole genome sequencing data. Because the polygenic score was derived and tested based on studies comprised primarily of participants of European ancestry, Applicants determined whether the association of the polygenic score with early-onset CAD varied according to race or ethnicity.
Genotypes in the VIRGO-MESA-TAICHI were ascertained using whole genome sequencing, performed at the Broad Institute of Harvard and MIT (Cambridge, Mass., USA). Libraries were constructed and sequenced on the Illumina HiSeqX with the use of 151-bp paired-end reads for whole-genome sequencing. Output from Illumina software was processed by the Picard data-processing pipeline to yield BAM files containing well-calibrated, aligned reads. All sample information tracking was performed by automated LIMS messaging. A sample was considered sequence complete when the mean coverage was ≥30× (for the MESA cohort) or ≥20× (for VIRGO and TAICHI cohorts). Two quality control metrics that are reviewed along with the coverage are the sample Fingerprint LOD score and % contamination. At aggregation, an all-by-all comparison was done of the read group data and estimate the likelihood that each pair of read groups is from the same individual. If any pair had a LOD score<−20.00, the aggregation does not proceed and is investigated. FP LOD> or =3 is considered passing concordance with the sequence data (ideally LOD>10). A sample will have an LOD of 0 when the sample failed to have a passing fingerprint. Fluidigm fingerprint is repeated once if failed. Read groups with fingerprints<−3.00 were blacklisted from the aggregation. Sample genotypes were determined via a joint callset using the Genome Analysis Toolkit Haplotype Caller.
6,809 individuals underwent whole genome sequencing, of whom 222 (3.3%) were excluded based on sequencing quality control metrics (Table 18). Sample exclusion criteria included:
Baseline characteristics of the 6,587 remaining individuals, stratified by early-onset coronary artery disease case versus control status, are provided in Table 19. Principal components analysis demonstrated that cases and controls were well-matched according to genetic ancestry. Mean sequencing depth was 31.7× (SD 3.8) across the study cohorts with similar quality metrics observed across cases and controls (
In order to assign race within this cohort, A panel of approximately 16,000 ancestry informative markers (AIMs) (see Hoggart, C. J. et al., Am J Hum Genet 72(6), 1492-1504 (2003) identified across six continental populations was chosen to derive principal components (PCs) of ancestry for all samples that passed quality control. Principal component analysis was performed using EIGENSTRAT. (See Price, A. L. et al., Nat Genet 38, 904-9 (2006).
In order to assign a race to individuals without self-reported race or with discordant self-reported race and PC ancestry, a k-nearest neighbors (k-NN) classifier (see Fix, E. et al., Texas: USAF School of Aviation Medicine, pp 261-279 (1951); Cover, T. et al., IEEE Trans Inf Theory. 13, 21-27 (1967)) was applied using the first five PCs of ancestry. This analysis was done using the k-NN implementation from the Scikit-learn library in Python. (See Pedregosa, F. et al., Journal of Machine Learning Research.; 12, 2825-30 (2011)) The classifier was built using MESA samples after removing 25 individuals with discordant self-reported race and PC ancestry as determined by visual inspection of PC1 and PC2. The remaining MESA samples were split into a training set (n=2490) and test set (n=1246). A k-NN (k=5) classifier was built using self-reported race as the dependent variable (1: White/Caucasian, 2: Chinese American, 3: Black/African-American, 4: Hispanic) and PC1 to PC5 as features. The classifier had a 98.1% reclassification rate in the test set, with misclassifications generally occurring for Hispanic individuals. This classifier was then applied to all 6587 samples to generate inferred race.
A second validation set for prevalent and incident CAD was assembled from individuals of European ancestry from the UK Biobank phase II cohort. (See Sudlow, C. et al., PLos Med 12, e1001779 (2015)). The UK Biobank enrolled individuals aged 45 to 69 years old from across the United Kingdom beginning in 2006. Individuals who self-reported a history of myocardial infarction or coronary revascularization or were hospitalized for acute myocardial infarction or coronary revascularization in the electronic health record prior to enrollment were considered prevalent cases; all other individuals were considered controls. Incident coronary events were ascertained based on hospital admission for an acute myocardial infarction or coronary revascularization or fatal CAD as detected in the death registry.
Individuals in the UK Biobank underwent genotyping with one of two closely related custom arrays (UK BiLEVE Axiom Array or UK Biobank Axiom Array) consisting of over 800,000 genetic markers scattered across the genome. (See Bycroft et al., bioRxiv, doi.org/10.1101/166298 (2017)). Additional genotypes were imputed centrally using the Haplotype Reference Consortium and UK10K haplotype resource where available and the 1000 Genomes Phase 3 reference panel otherwise to generate imputation results. In order to analyze individuals with a relatively homogenous ancestry and owing to small percentages of non-British individuals, the present analysis was restricted to the white British ancestry individuals. This subpopulation was constructed centrally using a combination of self-reported ancestry and genetically confirmed ancestry using principal components. Additional exclusion criteria included outliers for heterozygosity or genotype missingness, discordant reported versus genotypic sex, putative sex chromosome aneuploidy, or withdrawal of informed consent. Each of these parameters was derived centrally as previously reported. (Bycroft, C. et al., 2017).
Baseline characteristics of the 288,980 remaining individuals for the prevalent coronary artery disease analysis are provided in Table 20. Current smoking, lipid lowering-medication, and parental history of heart disease was determined by self-report at the time of enrollment survey. Diabetes mellitus, hypertension, and dyslipidemia were assessed based on a combination of self-report or hospitalization diagnosis code prior to date of UK Biobank enrollment reflecting these conditions.
Diagnosis of prevalent coronary artery disease was based on a composite of myocardial infarction or coronary revascularization. Myocardial infarction was based on self-report or hospital admission diagnosis, as performed centrally. (See Schnier, C. et al., Definitions of acute myocardial infarction (MI) and main MI pathological types for UK Biobank phase 1 outcomes adjudication; Version 1, January 2017. Available at: biobank.ctsu.ox.ac.uk/crystal/refer.cgi?id=461). This included individuals with ICD-9 codes of 410.X, 411.0, 412.X, 429.79 or ICD-10 codes of I21.X, I22.X, I23.X, I24.1, I25.2 in hospitalization records. Among the 280,304 individuals free of prevalent coronary artery disease at baseline, incident events included myocardial infarction, fatal coronary event, and coronary revascularization. Myocardial infarction was ascertained using the above ICD-10 diagnoses codes in hospitalization records or the death registry as an underlying cause of death. Coronary revascularization, inclusive of percutaneous angioplasty or coronary artery bypass surgery, was extracted from OPCS (Office of Population, Censuses and Surveys: Classification of Interventions and Procedures) hospitalization procedure codes.
Individuals without evidence of an incident event were censored at the earlier of last hospitalization or death registry follow-up. This corresponded to February 2016 for England and Wales and October 2016 for Scotland participants.
The polygenic score calculation was calculated using array-based genotyping and imputation. (Bycroft, C. et al., 2017).
The third validation study for incident events involved white participants free of prevalent CAD from the Atherosclerosis Risk in Communities (ARIC) study, a prospective cohort that enrolled participants between the ages of 45 and 64 years starting in 1987. (Am J Epidemiol., 129, 687-702 (1989). The ARIC study is a prospective cohort with emphasis on the epidemiology of cardiovascular disease. Baseline lipid levels were measured in the ARIC central lipid laboratory using commercial reagents. (See Brown, S. A. et al. Arterioscler Thromb 13, 1139-58 (1993)). Genotype and clinical data were retrieved from the National Center for Biotechnology Information dbGAP server (accession: phs000280.v3.p1).
Genotyping was performed using the Affymetrix 6.0 array (Affymetrix, Santa Clara, Calif.) and subsequently imputed to the Haplotype Reference Consortium using the Michigan Imputation Server. (See Das, S. et al., Nat Genet 48, 1279-83 (2016)). Phasing was performed using the Eagle2 algorithm. (See Loh, P. R. et al., Nat Genet.; 48, 1443-8 (2016)). 4,954 variants were removed prior to imputation due to duplication, monomorphism or allele mismatch. Imputation was then performed on 799,246 variants using the minimac3 algorithm and the Haplotype Reference Consortium reference panel. (Loh, P. R. et al., 2016). Individuals were excluded if they had prevalent coronary artery disease at the time of enrollment, were outliers with respect to principal components of ancestry, or were related to another individual in the cohort. A composite CAD endpoint including myocardial infarction, coronary revascularization, and death from coronary causes was used in this study. Endpoint adjudication was performed by committee review of mecical records for reported endpoints. (See ARIC manual of operations. No. 2. Cohort component procedures. Chapel Hill: University of North Carolina, ARIC Coordinating Center, School of Public Health, 1987). The polygenic score calculation was based on array-based genotyping data and subsequent imputation.
Within each cohort, individuals were categorized as having low (bottom quintile), intermediate (quintiles 2-4), or high (top quintile) polygenic risk. See Khera et al., N Engl J Med, 375, 2349-58 (2016)). The relationship of these categories to prevalent CAD was determined using logistic regression, adjusting for principal components of ancestry. Principal components of ancestry are based on observed genotypic differences across individuals; their inclusion as covariates in regression analyses minimizes confounding by ancestry. (Price, A. L. et al., 2006). All UK Biobank validation analyses additionally included genotyping array indicator variable in regression models. (Bycroft, C. et al., 2017). The association of the polygenic scores with incident events was determined by calculation of absolute incidence rates and subsequent Cox regression analyses adjusted for age, gender, traditional cardiovascular risk factors or scores, and principal components of ancestry as covariates. Discrimination was assessed using C-statistics and reclassification using the net reclassification index. (See, Pencina, M. J. et al., Stat Med, 27, 157-72 (2008). Tests of interaction between the polygenic score and traditional risk factors were performed within Cox regression analyses adjusted for age, gender, and principal components of ancestry.
Analyses were performed using R version 3.2.2 software (The R Foundation).
Using the association statistics of 6,630,150 genetic variants with CAD as input, the LDPred computational algorithm was implemented to derive eleven polygenic scores as previously recommended. (Vilhjálmsson, B. J. et al., 2015) These scores varied in the fraction of variants assumed to be causal for CAD. The relationship of each of the eleven polygenic scores with CAD was next assessed in the UK Biobank Phase I testing dataset comprised of 4,831 individuals with CAD and 115,455 controls. (Klarin, D. et al., 2017). The score assuming a fraction of causal variants of 0.001 (i.e., 0.1% of variants) achieved the highest area under the curve of 0.64 and was used in subsequent validation datasets (
The relationship of the polygenic score to early-onset CAD was examined in the VIRGO-MESA-TAICHI case-control cohort of 6,587 individuals—2,369 cases and 4,218 controls. Mean age was 57 years and 55% of the participants were female. This multiethnic population included 3,081 (47%) white, 1,298 black (20%), 1,289 Asian (20%) and 919 (14%) Hispanic participants (eTables 2-3). As compared to those with low polygenic risk, an increased odds of early-onset CAD was noted for both the intermediate (odds ratio 2.14; 95% CI 1.82-2.50) and high (odds ratio 4.79; 95% CI 3.99-5.75) risk categories (
The generalizability of the polygenic score was assessed by testing the association of polygenic risk categories with myocardial infarction in racial subpopulations. Although the score was associated with increased odds of early-onset CAD within each race (p<0.001 for each), the association was strongest in white participants (odds ratio for extreme quintiles 7.41; 95% CI 5.68-9.68) as compared with odds ratio for extreme quintiles of 2.82, 4.71, and 3.17 for Black, Asian, and Hispanic participants respectively (
The association of the polygenic score with prevalent CAD in a middle-aged European cohort was assessed in the UK Biobank Phase II dataset (N=288,980), inclusive of 8,676 individuals with CAD and 280,304 controls (Table 20). Mean age was 57 years and 55% of the cohort was female Consistent with the observations noted in the testing dataset, an increased odds of CAD was noted for both the intermediate (odds ratio 1.88; 95% CI 1.75-2.03) and high (odds ratio 3.98; 95% CI 3.68-4.30) risk groups (
Among the 280,304 individuals free of CAD at baseline, 4,922 incident coronary events were observed over a median follow-up of 7.0 years (Table 21). Incident event rates were 1.3 (95% CI 1.2-1.5), 2.4 (2.3-2.5), and 4.3 (4.0-4.5) per 1000 person-years for individuals in the low, intermediate, and high polygenic risk categories (
Addition of the polygenic score to a baseline model containing age, sex, and principal components of ancestry led to an improvement in discrimination, increase in C-statistic from 0.733 to 0.759 (p<0.001) and reclassification, net reclassification index of 0.36 (95% CI 0.33-0.38;p<0.001). When the baseline model additionally included the traditional cardiovascular risk factors of hypertension, diabetes, current smoking, family history of heart disease, and body-mass index, addition of the polygenic score led to an increase in the C-statistic from 0.762 to 0.783 (p<0.001) and net reclassification index of 0.33 (95% CI 0.31-0.36); p<0.001.
An individual who is an extreme outlier in the polygenic score distribution may have a risk for CAD at least as great as a carrier of a familial hypercholesterolemia mutation (present in 0.5% of the population). Applicants compared the risk for CAD for those in the top 0.5% of the polygenic score distribution to the remaining 99.5% of the population, noting a substantially increased odds for prevalent CAD (odds ratio 4.46; 95% CI 3.79-5.22) and risk for incident CAD (hazard ratio 3.63; 95% CI 2.87-4.60).
An interaction of the polygenic score with age at baseline was noted (p-interaction<0.001), such that the risk gradient was more pronounced among younger individuals. For example, the hazard ratio for extreme quintiles of the polygenic score was 5.16 (3.45-7.74) among individuals<50 years of age, 4.02 (95% CI 3.28-4.92) in those 50 to <60 years, and 2.99 (95% CI 2.66-3.36) among those ≥60 years (Table 22). By contrast, no such interaction was observed based on sex (p=0.66), family history of heart disease (p=0.55), or other cardiovascular risk factors (p>0.05 for each).
aIncidence rates are calculated per 1000 person-years of follow-up
Additional validation of the association between the polygenic score and incident coronary events was provided in the ARIC prospective cohort—1,119 incident coronary events were observed in 7,318 white individuals over a median follow-up of 18.9 years. Mean age was 54 years and 54% of the participants were female (Table 23). Incident event rates were 5.6 (95% CI 4.7-6.5), 8.7 (95% CI 8.0-9.3), and 13.5 (95% CI 12.1-15.0) per 1000-person years for individuals in the low, intermediate, and high polygenic risk categories respectively (
In the ARIC cohort, addition of the polygenic score to a baseline model containing age, sex, and principal components of ancestry led to an increase in the C-statistic from 0.672 to 0.697 (p<0.001) and a net reclassification index of 0.34 (95% CI 0.28-0.40). When the predicted risk as assessed by the Pooled Cohorts Equations was included in the baseline model containing age, sex, and principal components of ancestry, addition of the polygenic score led to an increase in the C-statistic from 0.726 to 0.739 (p<0.001) and net reclassification index of 0.34 (95% CI 0.28-0.41; p<0.001).
In this study, Applicants derived a new polygenic score for CAD inclusive of 6.6 million genetic variants. This score significantly and substantially improved prediction of CAD over previously published scores that included fewer variants. Individuals with high polygenic risk (top quintile of polygenic score), as compared to those with low polygenic risk (bottom quintile of polygenic score) had increased odds of early-onset CAD (odds ratio 4.79) and prevalent CAD in a middle-aged population-based cohort (odds ratio 3.98). Furthermore, such individuals were at significantly increased risk of incident CAD in both a large European (hazard ratio 3.36) cohort and United States (hazard ratio 2.78) prospective cohort. The polygenic score risk estimates remained significant after adjustment for traditional cardiovascular risk factors and led to an improvement in model discrimination and reclassification.
These results permit several conclusions. First, a polygenic score for CAD provides a continuous and quantitative metric for CAD that stratifies the population into varying trajectories of coronary risk. This stratification remained robust to adjustment for traditional cardiovascular risk factors, including family history of CAD (a product of shared DNA and shared environment), circulating biomarkers, and predicted 10-year risk based on the ACC/AHA Pooled Cohorts Equation. A key advantage of a DNA-based predictor is that the polygenic score can be assessed from the time of birth, well before the discriminative capacity of alternate risk prediction indices such as coronary artery calcification and circulating biomarkers becomes apparent.
Second, this finding reinforces the concept that heritable risk for complex disease may be driven by rare large-effect mutations or the cumulative impact of many small-effect variants. For example, three previous studies have identified a familial hypercholesterolemia mutation in about 0.5% of the population and noted that such individuals are at increased odd for prevalent CAD compared to non-carriers (reported odds ratios of 2.6, 3.3, and 4.2 respectively). (See, Benn, M. et al., Eur Heart J., 37, 1384-94, (2016); Abul-Husn, N. S. et al. Science 354, doi: 10.1126/science.aaf7000 (2016); Khera, A. V. et al., J Am Coll Cardiol. 67, 2578-89 (2016)). Applicants demonstrate that, compared to the remaining 99.5% of the population, individuals in the top 0.5% of the polygenic score distribution have an even higher odds ratio for prevalent CAD of 4.5.
Third, new evidence from a multiethnic cohort is provided that the polygenic score can discriminate risk across racial groups. However, consistent with the derivation and validation of this and previous scores in individuals of European ancestry, score performance was best in white individuals as compared to other racial groups. Similar findings were noted in a recent analysis of polygenic scores in predicting height, schizophrenia, and type 2 diabetes. (See Martin, A. R. et al., Am J Hum Genet., 100, 635-49 (2017)). This does not suggest that genetic risk is less important in non-white individuals. Rather, large-scale efforts to refine variant risk estimates in multiethnic populations are warranted and can help ensure that such scores would not propagate health disparities if integrated into clinical practice. (See Popejoy, A. B. et al., Nature. 538(7624), 161-64 (2016).
Ascertainment of individuals at increased polygenic risk for common diseases may facilitate intensive prevention efforts via lifestyle or pharmacotherapy. Evidence derived from randomized clinical trials suggests that those with increased polygenic risk derive increased absolute and relative coronary risk reduction with statin therapy. (See, Mega, J. L., et al., Lancet 385(9984), 2264-71 (2015), Natarajan, P. et al., Circulation 135, 2091-101 (2017)). Similarly, absolute risk reductions associated with adherence to a healthy lifestyle were highest in the high polygenic risk subgroup. (Khera et al., 2016). This potential utility must be weighed against possible untoward consequences, including increased cost of care, psychological distress or discrimination following genetic risk disclosure, and a sense of fatalism in those at high risk. Additional research is thus needed prior to widespread implementation. (See Green, E. D. et al., Nature 470(7333), 204-13 (2011)).
A key strength of this study involves the use of a recently developed computational approach to derive a comprehensive polygenic score of 6.6 million genetic variants for a complex disease and application to multiple independent datasets. Importantly, none of the CAD cases from the present validation studies were used in score derivation or testing, thus avoiding inflation of test statistics.
The identification of individuals at increased genetic risk for a common, complex disease can facilitate treatment or enhanced screening strategies to prevent disease manifestation. For example, with respect to coronary disease, ˜1:250 individuals carry a rare, large-effect genetic mutation causal for increased low-density lipoprotein cholesterol (N. S. Abul-Husn, et al. Genetic identification of familial hypercholesterolemia within a single U.S. health care system. Science. 354 (2016); A. V. Khera, et al. Diagnostic yield and clinical utility of sequencing familial hypercholesterolemia genes in patients with severe hypercholesterolemia. J Am Coll Cardiol. 67, 2578-2589 (2016); M. Benn, et al. Mutations causative of familial hypercholesterolaemia: screening of 98 098 individuals from the Copenhagen General Population Study estimated a prevalence of 1 in 217. Eur Heart J. 37, 1384-1394 (2016)). A recent analysis in a large U.S. health care system demonstrated that such individuals have an odds ratio for coronary disease of 2.6 when compared to non-carriers and an odds ratio of 3.7 for early-onset disease (N. S. Abul-Husn, et al. Genetic identification of familial hypercholesterolemia within a single U.S. health care system. Science. 354 (2016)). Aggressive treatment to reduce circulating low-density lipoprotein cholesterol levels among carriers of such mutations can reduce coronary disease risk (Nordestgaard B G, et al. Familial hypercholesterolaemia is underdiagnosed and undertreated in the general population: guidance for clinicians to prevent coronary heart disease: consensus statement of the European Atherosclerosis Society. Eur Heart J. 34, 3478-90a (2013)).
Beyond rare monogenic mutations, a decade of genome-wide association studies (GWAS) has demonstrated that common single nucleotide polymorphisms contribute to a range of complex diseases (P.M. Visscher, et al. 10 Years of GWAS discovery: biology, function, and translation. Am J Hum Genet. 101, 5-22 (2017)). However, because the effect size of such polymorphisms tends to be modest, any individual polymorphism has limited utility for risk prediction. Polygenic scores (PS) provide a mechanism for aggregating the cumulative impact of common polymorphisms by summing the number of risk variant alleles in each individual weighted by the impact of each allele on risk of disease (International Schizophrenia Consortium, et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 460, 748-752 (2009)). Applicants recently demonstrated that a coronary disease PS consisting of 50 common variants that had achieved genome-wide levels of statistical significance in previous studies can stratify the population into varying trajectories of risk (H. Tada, et al. Risk prediction by genetic risk scores for coronary heart disease is independent of self-reported family history. Eur Heart J. 37, 561-567 (2016); A. V. Khera, et al. Genetic risk, adherence to a healthy lifestyle, and coronary disease. N Engl J Med. 375, 2349-2358 (2016)).
Simulated analyses based on GWAS effect size distributions suggest that the predictive power of such PSs may be markedly improved by considering a genome-wide set of common polymorphisms (N. Chatterjee, et al. Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies. Nat Genet. 45, 400-405 (2013); F. Dudbridge. Power and predictive accuracy of polygenic risk scores. PLoS Genet. 9, e1003348 (2013). Zhang, et al. https://doi.org/10.1101/175406 (2017)). But, it remains uncertain whether the extreme of a PS distribution can confer risk equivalent to a monogenic mutation (e.g., 4-fold increased risk). Here, Applicants demonstrate that a PS comprised of a genome-wide set of common variants permits identification of individuals with 4-fold increased risk for coronary disease and subsequently generalize this approach to two additional complex diseases, breast cancer and severe obesity.
In order to develop an optimized polygenic score for coronary disease, Applicants derived two new PSs and compared them with two previously published scores in a testing dataset of 120,286 individuals of European ancestry from the UK Biobank—4,831 with coronary disease and 115,455 controls (H. Tada, et al. Risk prediction by genetic risk scores for coronary heart disease is independent of self-reported family history. Eur Heart J. 37, 561-567 (2016); G. Abraham, et al. Genomic prediction of coronary heart disease. Eur Heart J. 37, 3267-3278 (2016); D. Klarin, et al. Genetic analysis in UK Biobank links insulin resistance and transendothelial migration pathways to coronary artery disease. Nat Genet. 49, 1392-1397 (2017)). The UK Biobank is a large observational study that enrolled individuals aged 45 to 69 years of age from across the United Kingdom beginning in 2006 (C. Sudlow, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015)).
Applicants derived the two new PSs using summary association statistics from our earlier GWAS as a starting point for the relationship of millions of common polymorphisms to risk for coronary disease (Supp. Methods; M. Nikpay, et al. A comprehensive 1,000 Genomes-based genome-wide association meta-analysis of coronary artery disease. Nat Genet. 47,1121-1130 (2015)). A reference population of 503 Europeans from the 1000 Genomes study was used to assess the correlation of a given polymorphism with others nearby (‘linkage disequlibrium’) (The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 526, 68-74 (2015)). For the first score, Applicants implemented a ‘pruning and thresholding’ strategy (PSP&T) to combine independent variants (r2<0.8 with other nearby variants) that exceeded nominal significance (p-value<0.05) in the previous GWAS. For the second score, Applicants used the recently developed LDPred computational algorithm (B.J. Vilhjalmsson, et al. Modeling linkage disequilibrium increases accuracy of polygenic scores. Am J Hum Genet. 97, 576-592 (2015)). This involves a Bayesian approach to calculate a posterior mean effect for all variants based on a prior (effect size in the prior GWAS) and subsequent shrinkage based on linkage disequilibrium.
All four scores demonstrated robust association with coronary disease in the testing dataset. But, the newly-derived genome-wide polygenic score of 6.6 million common single nucleotide polymorphisms (PSGW) demonstrated the maximal area-under-the-curve of 0.64 and was selected for use in subsequent analyses (Table 24).
Next, Applicants sought to validate this score in an independent dataset of the remaining 288,890 individuals of European ancestry in the UK Biobank. Mean age was 57 years and 55% of the cohort was female. 8676 (3.0%) of the participants had been diagnosed with coronary disease, as defined based on verbal interview with a trained nurse or hospitalization for myocardial infarction or coronary revascularization in the electronic health record prior to enrollment.
Applicants tested the hypothesis that individuals with high PSGW might have risk equivalent to a monogenic coronary disease mutation (e.g., four-fold increased risk) by assessing progressively more extreme tails of the PSGW distribution and comparing risk with the remainder of the population (Table 25;
Coronary disease was noted in 663 of 7225 (9.2%) individuals with high PSGW as compared to 8013 of 281,755 (2.8%) of those in the remainder of the distribution (
In order to assess the generalizability of these observations, Applicants used a similar approach to construct separate PSs for two additional complex diseases with major public health implications—breast cancer and severe obesity. As for coronary disease, Applicants used summary association statistics from large prior GWASs as a starting point for the relationship of common polymorphisms to breast cancer or body-mass index (K. Michailidou, et al. Association analysis identifies 65 new breast cancer risk loci. Nature. 551, 92-94 (2017); A. E. Locke, et al. Genetic studies of body mass index yield new insights for obesity biology. Nature. 518, 197-206 (2015)).
Among 157,897 females of the UK Biobank validation dataset, 6567 (4.2%) had been diagnosed with breast cancer at the time of enrollment. Individuals with high PS for breast cancer had a 2.9-fold increased risk when compared with the remaining 97.5% of the population (Table 27). Breast cancer was noted in 10.5% of individuals with high PS as compared to 4.0% of those in the remainder of the distribution (
Among 288,018 individuals of the UK Biobank validation dataset with body-mass index available, 5232 (1.8%) were severely obese at the time of enrollment, defined as body-mass index≥40 kg/m2. Individuals with high PS had a 5.5-fold increased risk of severe obesity when compared with the remaining 97.5% of the population (Table 28). Severe obesity was noted in 8.4% of individuals with high body-mass index PS as compared to 1.6% of those in the remainder of the distribution (
For three common diseases, Applicants demonstrate that the incorporation of a genome-wide set of common polymorphisms into a PS can identify subsets of the population at substantially increased risk.
These results permit several conclusions. First, Applicants provide empiric evidence that the cumulative impact of common polymorphisms on risk of disease can approach that of rare, monogenic mutations. The predictive capacity of PSs will likely continue to improve as larger discovery GWAS studies more precisely define the effect sizes for common polymorphisms across the genome (N. Chatterjee, et al. Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies. Nat Genet. 45, 400-405 (2013); F. Dudbridge. Power and predictive accuracy of polygenic risk scores. PLoS Genet. 9, e1003348 (2013); Y. Zhang, et al. doi.org/10.1101/175406 (2017)). Second, high PSGW seems operable in a much larger fraction of the population as compared to rare monogenic mutations. For coronary disease, the largest gene-sequencing study to date identified a monogenic driver mutation related to increased low-density lipoprotein cholesterol in 94 of 12,298 (0.76%) afflicted individuals (N.S. Abul-Husn, et al. Genetic identification of familial hypercholesterolemia within a single U.S. health care system. Science. 354 (2016)). Here, Applicants identify high PSGW in 7.6% of individuals with coronary disease, a prevalence an order of magnitude higher. Third, traditional risk factor differences of high PSGW individuals versus the remainder of the distribution are modest and these individuals would thus be difficult to identify without direct genotyping. Fourth, a key advantage of a DNA-based diagnostic such as PSGW is that it can be assessed from the time of birth, well before the discriminative capacity of most traditional risk factors emerges, and may thus facilitate intensive prevention efforts. For example, Applicants recently demonstrated that high polygenic risk for coronary disease may be offset by adherence to a healthy lifestyle or cholesterol-lowering therapy with statin medications (A.V. Khera, et al. Genetic risk, adherence to a healthy lifestyle, and coronary disease. N Engl J Med. 375, 2349-2358 (2016); J. L. Mega, et al. Genetic risk, coronary heart disease events, and the clinical benefit of statin therapy: an analysis of primary and secondary prevention trials. Lancet. 385, 2264-2271 (2015); P. Natarajan, et al. Polygenic risk score identifies subgroup with higher burden of atherosclerosis and greater relative benefit from statin therapy in the primary prevention setting. Circulation. 135, 2091-2101 (2017)). Finally, Applicants demonstrate similar patterns for two additional heritable diseases—breast cancer and severe obesity—suggesting that this approach will provide a generalizable framework for risk stratification across a range of common, complex diseases.
In order to determine which of several polygenic risk score (PS) approaches yielded the maximal coronary disease risk discrimination, Applicants applied various PS to a testing dataset from the UK Biobank (D. Klarin, et al. Genetic analysis in UK Biobank links insulin resistance and transendothelial migration pathways to coronary artery disease. Nat Genet. 49, 1392-1397 (2017)). The UK Biobank is a large prospective cohort study that enrolled individuals from across the United Kingdom, aged 40-69 years at time of recruitment, starting in 2006 (C. Sudlow, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015)). Individuals underwent a series of anthropometric measurements and surveys, including medical history review with a trained nurse. The testing dataset was comprised of 120,286 individuals of European ancestry, including 4,831 participants with prevalent coronary disease and 115,455 controls.
Polygenic scores provide a quantitative metric of an individuals inherited risk based on the cumulative impact of many variants. Weights are generally assigned to each genetic variant according to the strength of their association with disease risk (effect estimate). Individuals are scored based on how many risk alleles they have for each variant (e.g. 0, 1, 2 copies) included in the polygenic score.
Applicants tested four distinct approaches to PS derivation, ultimately choosing the best score in an independent testing dataset for subsequent analysis in the validation cohort.
First, Applicants applied a previously reported PS of 50 common genetic variants that had achieved genome-wide levels of statistical significance in earlier studies (H. Tada, et al. Risk prediction by genetic risk scores for coronary heart disease is independent of self-reported family history. Eur Heart J. 37, 561-567 (2016); A. V. Khera, et al. Genetic risk, adherence to a healthy lifestyle, and coronary disease. N Engl J Med. 375, 2349-2358 (2016)). Our prior work demonstrated that this score was predictive of incident coronary disease events in prospective cohort studies of >50,000 individuals.
Second, Applicants applied a PS comprised of 49,310 genetic variants that was derived from a 2013 CARDIoGRAMplusC4D genome-wide association study (GWAS) based on the Metabochip genotyping array (G. Abraham, et al. Genomic prediction of coronary heart disease. Eur Heart J. 37, 3267-3278 (2016)). To avoid redundancy due to linkage disequilibrium (LD), the correlation in inheritance pattern of nearby variants, the reported summary association statistics were thinned based on various LD r2 values. An r2 value of 0.7 was determined to be the optimal threshold via empiric testing of a range of values in an independent dataset. This score was previously shown to predict incident coronary disease events in multiple distinct cohorts (G. Abraham, et al. Genomic prediction of coronary heart disease. Eur Heart J. 37, 3267-3278 (2016)).
Third, Applicants computed a new score using a p-value and LD-driven clumping procedure in PLINK version 1.90b (C. C. Chang, et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience. 4, 7 (2015)). Input included summary coronary disease association statistics for 8.3 million SNPs from the 2015 CARDIoGRAMplusC4D 1000 Genomes imputed GWAS of primarily European individuals and a reference LD panel of 503 European samples from 1000 Genomes phase 3 version 5 (M. Nikpay, et al. A comprehensive 1,000 Genomes-based genome-wide association meta-analysis of coronary artery disease. Nat Genet. 47,1121-1130 (2015); The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 526, 68-74 (2015)). In brief, the algorithm forms clumps around SNPs with association p-values less than a provided threshold. Each clump contains all SNPs within 250 kb of the index SNP that are also in LD with the index SNP as determined by a provided r2 threshold in the LD reference population. The algorithm iteratively cycles through all index SNPs, beginning with the smallest p-value, only allowing each SNP to appear in one clump. The final output contains the most significantly coronary disease associated SNP for each LD-based clump across the genome. A PS was built containing the index SNPs of each clump with association estimate betas (log odds) as weights. PSs were created over a range of p-value (1, 0.5, 0.05, 5×10-4, 5×10-6, 5×10-8) and r2 (0.2, 0.4, 0.6, 0.8) thresholds. The best score for this approach was chosen based on maximal area-under-the curve (AUC) in the testing dataset. This score was based on a p-value for statistical significance in the original GWAS of <0.05 and r2 value of <0.8.
Fourth, Applicants computed another new score using the using the recently developed LDpred computational algorithm (B. J. Vilhjálmsson, et al. Modeling linkage disequilibrium increases accuracy of polygenic scores. Am J Hum Genet. 97, 576-592 (2015)). LDpred creates a polygenic score using genome-wide variation with weights derived from a set of GWAS summary statistics. Unlike other methods that use variants most strongly associated with disease risk or a set of independent variants across the genome, LDpred includes all available variants in the derived risk score by shrinking effect estimate weights (log-odds) based on an external LD reference panel. This Bayesian approach calculates a posterior mean effect size for each variant based on a prior (association with coronary disease in the 2015 CARDIoGRAMplusC4D GWAS) and subsequent shrinkage based on the extent to which this variant is correlated with similarly associated variants in a reference population of 503 European samples from 1000 Genomes phase 3 version 5 (M. Nikpay, et al. A comprehensive 1,000 Genomes-based genome-wide association meta-analysis of coronary artery disease. Nat Genet. 47,1121-1130 (2015); The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 526, 68-74 (2015)). The underlying Gaussian distribution additionally considers the fraction of causal (e.g. non-zero effect sizes) markers, referred to as ρ. Because this fraction is unknown for any given disease, a range of 7 plausible values was trialed in the testing dataset. Single nucleotide polymorphisms (SNPs) with ambiguous strand (A/T or C/G) or minor allele frequency less than 1% were removed from the score derivation. This left 6,630,150 variants available for inclusion. In accordance with recommendations from the LDpred authors, a linkage disequilibrium radius was set at 2210 variants, equivalent to the number of SNPs used as input divided by 3000. A range of ρ, the fraction of causal variants, was used—1, 0.3, 0.1, 0.03, 0.01, 0.003, 0.001—along with an infinitesimal (each variant assumed to contribute to disease risk) and unweighted model (raw log-odds for all variants input). The score with maximal AUC in the testing dataset (ρ=0.001) was carried forward in subsequent analysis.
Scores were generated by multiplying the genotype dosage of each risk allele for each variant by its respective weight, and then summing across all variants in the score. Incorporating genotype dosages accounts for uncertainty in genotype imputation. All calculations were performed using the Hail software platform (https://github.com/hail-is/hail). Over 99.9% of variants in the LDpred-derived polygenic scores were available for scoring purposes in the testing dataset with sufficient imputation quality (INFO>0.3).
The validation cohort was comprised of 288,980 UK Biobank participants distinct from those in the testing dataset described above. Individuals in the UK Biobank underwent genotyping with one of two closely related custom arrays (UK BiLEVE Axiom Array or UK Biobank Axiom Array) consisting of over 800,000 genetic markers scattered across the genome. Additional genotypes were imputed centrally using the Haplotype Reference Consortium resource as previously reported (C. Bycroft C, et al. Genome-wide genetic data on ˜500,000 UK Biobank participants. doi.org/10.1101/166298 (2017)). In order to analyze individuals with a relatively homogenous ancestry and owing to small percentages of non-British individuals, the present analysis was restricted to the white British ancestry individuals. This subpopulation was constructed centrally using a combination of self-reported ancestry and genetically confirmed ancestry using principal components. Additional exclusion criteria included outliers for heterozygosity or genotype missingness, discordant reported versus genotypic sex, putative sex chromosome aneuploidy, or withdrawal of informed consent. Each of these parameters was derived centrally as previously reported (C. Bycroft C, et al. Genome-wide genetic data on ˜500,000 UK Biobank participants. doi.org/10.1101/166298 (2017)).
The 288,980 remaining participants served as the validation dataset for the prevalent coronary disease analysis. Current smoking, lipid lowering-medication, and parental history of heart disease were determined by self-report at the time of enrollment survey. Diabetes mellitus, hypertension, and dyslipidemia were assessed based on a combination of self-report or hospitalization diagnosis code prior to date of UK Biobank enrollment reflecting these conditions.
Diagnosis of prevalent coronary disease was based on a composite of myocardial infarction or coronary revascularization. Data from hospital admissions was available via the Hospital Episode Statistics for England, Scottish Morbidity Record, and Patient Episode Database for Wales. Myocardial infarction was based on self-report or hospital admission diagnosis, as performed centrally. This included individuals with ICD-9 codes of 410.X, 411.0, 412.X, 429.79 or ICD-10 codes of I21.X, I22.X, I23.X, I24.1, I25.2 in hospitalization records.
Applicants sought to generalize the approach to polygenic score derivation, testing, and validation for two additional complex traits—breast cancer and severe obesity. Polygenic scores for breast cancer were creating using the pruning and thresholding approach noted above. Input included summary association statistics from the 2017 OncoArray Consortium GWAS and a reference LD panel of 503 European samples from 1000 Genomes phase 3 version 5 (The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 526, 68-74 (2015); K. Michailidou, et al. Association analysis identifies 65 new breast cancer risk loci. Nature. 551, 92-94 (2017)). Owing to few male participants with breast cancer, analyses were restricted to female participants for both the testing and validation datasets. Prevalent breast cancer was based on self-report in interview with a trained nurse or a hospitalization for breast cancer prior to enrollment. The testing dataset was comprised of 63,349 individuals, of whom 2576 (4.1%) had been diagnosed with breast cancer. A PS based on variant pruning (r2<0.2) and a p-value for statistical significance in the original GWAS of <0.0005 obtained the highest AUC of 0.62 (odds ratio per standard deviation increment 1.54, 95% confidence interval 1.48-1.61) and was used in subsequent validation dataset analyses. 157,897 participants in the UK validation dataset were female (54.7%), of whom 6,567 (4.2%) had been diagnosed with breast cancer.
Polygenic scores for obesity were created using the pruning and thresholding and LDpred approaches as noted above. Input included summary association statistics from the 2015 Genome-Wide Investigation of Anthropometric Traits (GIANT) GWAS and a reference LD panel of 503 European samples from 1000 Genomes phase 3 version 5 (The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 526, 68-74 (2015); A. E. Locke, et al. Genetic studies of body mass index yield new insights for obesity biology. Nature. 518, 197-206 (2015)). As for coronary disease, the relationship of each score to severe obesity was determined in the testing dataset of 120,286 individuals, of whom 2,417 were diagnosed with severe obesity on the basis of body-mass index≥40 kg/m2. The best score was chosen based on maximal AUC in this testing dataset. A score of 2,100,303 variants based on the LDPred algorithm (ρ=0.03) obtained the highest AUC of 0.72 (odds ratio per standard deviation increment of 2.27; 95% confidence interval 2.17-2.36) and was used in the subsequent validation dataset analyses. Body-mass index was available in 288,018 of 288,980 (99.7%) of the validation dataset used for coronary disease, and these individuals served as the validation cohort for the severe obesity analysis.
Multiple PSs were generated using the approaches generated above and scores extracted in the UK Biobank testing dataset. The discriminative capacity of each score was tested by calculating the AUC of a logistic regression model predicting coronary disease status with additional adjustment for the first four principal components of ancestry. Odds ratio per standard deviation increment was additionally determined to facilitate comparison across scores and to previous studies.
In the validation cohort, Applicants tested the hypothesis that individuals in the extreme of the PS distribution might have a four-fold increased risk of coronary disease as compared to the remainder of the population. Starting with the top 20% of the PS distribution versus all others, Applicants tested progressively more extreme segments of the distribution until a four-fold risk increase was noted. This assessment was performed via a logistic regression model that adjusted for age, sex, genotyping array, and the first four principal components of ancestry. Baseline characteristics between those with high PS versus the remainder of the population were tabulated and tests for statistical significance compared via t-test for continuous and chi-square test for categorical variable. A second model adjusting for traditional cardiovascular risk factors—diabetes mellitus, hypertension, smoking status, hypercholesterolemia, family history of heart disease, and body mass index—was then constructed.
To assess for a gradient of risk for prevalent disease across the PS distribution, individuals were binned into groupings of 2.5% of the population and prevalence of coronary disease tabulated. Analyses for severe obesity and breast cancer were conducted in a similar fashion.
A key public health need is to identify individuals at high risk for a given disease to enable enhanced screening or preventive therapies. Because most common diseases have a genetic component, one important approach is to stratify individuals based on inherited DNA variation. Proposed clinical applications have largely focused on finding carriers of rare monogenic mutations at several-fold increased risk. Although most disease risk is polygenic in nature, it has not yet been possible to use polygenic predictors to identify individuals at risk comparable to monogenic mutations. This example shows exemplary methods for developing and validating genome-wide polygenic scores for five common diseases. The approach identified 8.0% of the population at greater than three-fold increased risk for coronary artery disease (CAD). For CAD, this prevalence was 20-fold higher than the carrier frequency of rare monogenic mutations conferring comparable risk.
For various common diseases, genes have been identified in which rare mutations confer several-fold increased risk in heterozygous carriers. An important example is the presence of a familial hypercholesterolemia mutation in 0.4% of the population, which confers an up to 3-fold increased risk for coronary artery disease (CAD). Aggressive treatment to lower circulating cholesterol levels among such carriers can significantly reduce risk. Another example is the p.E508K missense mutation in HNF1A, with carrier frequency of 0.1% of the general population and 0.7% of Latinos,8 which confers up to 5-fold increased risk for type 2 diabetes. Although ascertainment of monogenic mutations can be highly relevant for carriers and their families, the vast majority of disease occurs in those without such mutations.
For most common diseases, polygenic inheritance, involving many common genetic variants of small effect, plays a greater role than rare monogenic mutations. Previous studies to create GPS had only limited success, providing insufficient risk stratification for clinical utility (for example, identifying 20% of a population at 1.4-fold increased risk relative to the rest of the population). These initial efforts were hampered by three challenges: (i) the small size of initial genome-wide association studies (GWAS), which affected the precision of the estimated impact of individual variants on disease risk; (ii) limited computational methods for creating GPS; and (iii) lack of large datasets needed to validate and test GPS.
Using much larger studies and improved algorithms, this example shows that a GPS can identify subgroups of the population with risk approaching or exceeding that of a monogenic mutation. With this approach, we studied CAD.
For CAD, we created several candidate GPS based on summary statistics and imputation from recent large GWAS in participants of primarily European ancestry (Table 30). Specifically, we derived 24 predictors based on a pruning and thresholding method and 7 additional predictors using the recently described LDPred algorithm (
The UK Biobank has genotype data and extensive phenotypic information on 409,258 participants of British ancestry (average age 57 years; 55% female). The Best predictors
Table 30. Genome-wide polygenic score derivation and testing for five common, complex diseases. GWAS—genome-wide association study; AUC—area under the receiver-operator curve; GPS—genome-wide polygenic score AUC was determined using a logistic regression model adjusted for age, sex, genotyping array, the first four principal components of ancestry. Breast cancer analysis was restricted to female participants. For the LDPred algorithm, the tuning parameter p reflects the proportion of polymorphisms assumed to be causal for the disease. For the pruning and thresholding strategy, r2 reflects degree of independence from other variants in the linkage disequilibrium reference panel and p reflects the p-value noted for a given variant in the discovery GWAS.
Coronary artery disease
LDPred
6,629,369/
ρ = 0.001
0.806
6,630,150
(0.800-0.813)
(99.99%)
We used an initial validation dataset of the 120,280 participants in the UK Biobank Phase 1 genotype data release to select the GPS with the best performance, defined as the maximum area under the receiver-operator curve (AUC). We then assessed the performance in an independent testing set comprised of the 288,978 participants in the UK Biobank Phase 2 genotype data release. For each disease, the discriminative capacity within the testing dataset was nearly identical to that observed in the validation dataset.
Taking CAD as an example, our polygenic predictors were derived from a GWAS involving 184,305 participants 16 and evaluated based on their ability to detect the participants in the UK Biobank validation dataset diagnosed with CAD (Table 30). The predictors had AUC ranging from 0.79-0.81 in the validation set, with the best predictor (GPSCAD) involving 6,630,150 variants (Table 31). This predictor performed equivalently well in the testing dataset, with AUC of 0.81. The variants in the predictor are shown in Table D.
We then investigated whether our polygenic predictor, GPSCAD, could identify individuals at similar risk to the 3-fold increased risk conferred by a familial hypercholesterolemia mutation. Across the population, GPSCAD is normally distributed with the empirical risk of CAD rising sharply in the right tail of the distribution, from 0.8% in the lowest percentile to 11.1% in the highest percentile (
We found that 8% of the population had inherited a genetic predisposition that conferred≥3-fold increased risk for CAD (Table 33).
Strikingly, the polygenic score identified 20-fold more people than found by familial hypercholesterolemia mutations in previous studies, at comparable or greater risk. Moreover, 2.3% of the population (‘carriers’) inherited≥4-fold increased risk for CAD and 0.5% (‘carriers’) had inherited≥5-fold increased risk. GPSCAD performed substantially better than two previously published polygenic scores for coronary artery disease that included 50 and 49,310 variants, respectively (Table 34 and
GPSCAD has the advantage that it can be assessed from the time of birth, well before the discriminative capacity emerges for risk factors (for example, hypertension or type 2 diabetes) used in clinical practice to predict CAD. Moreover, even for our middle-aged study population, practicing clinicians could not identify the 8% of individuals at ≥3-fold risk based on GPSCAD in the absence of genotype information (Table 35).
For example, conventional risk factors such as hypercholesterolemia was present in 20% of those with ≥3-fold risk based on GPSCAD versus 13% of those in the remainder of the distribution, hypertension in 32% versus 28%, and family history of heart disease in 44% versus 35%. Making high GPSCAD individuals aware of their inherited susceptibility may facilitate intensive prevention efforts. For example, we previously showed that a high polygenic risk for CAD may be offset by either of two interventions: adherence to a healthy lifestyle or cholesterol-lowering therapy with statin medications.
The results above show that, for a number of common diseases, polygenic risk scores can now identify a substantially larger fraction of the population than found by rare monogenic mutations, at comparable or greater disease risk. Our validation and testing were performed in the UK Biobank population. Individuals who volunteered for the UK Biobank tended to be more healthy than the general population; although this nonrandom ascertainment is likely to deflate disease prevalence, the relative impact of genetic risk strata can be generalizable across study populations. Additional studies are warranted to develop polygenic risk scores for many other common diseases with large GWAS data and validate risk estimates within population biobanks and clinical health systems.
Polygenic risk scores differ in important ways from the identification of rare monogenic risk factors. Whereas identifying carriers of rare monogenic mutations requires sequencing of specific genes and careful interpretation of the functional effects of mutations found, polygenic scores can be readily calculated for many diseases simultaneously, based on data from a single genotyping array.
The potential to identify individuals at significantly higher genetic risk, across a wide range of common diseases and at any age, poses a number of opportunities for clinical medicine. Prevention and detection strategies may have utility regardless of underlying mechanism—as is the case for statin therapy for CAD, blood thinning-medications to prevent stroke in those with atrial fibrillation, or intensified mammography screening for breast cancer.
Polygenic scores provide a quantitative metric of an individuals inherited risk based on the cumulative impact of many common polymorphisms. Weights are generally assigned to each genetic variant according to the strength of their association with disease risk (effect estimate). Individuals are scored based on how many risk alleles they have for each variant (for example, 0, 1, or 2 copies) included in the polygenic score.
For our score derivation, we used summary statistics from recent GWAS studies conducted primarily among participants of European ancestry for five diseases and a linkage disequilibrium reference panel of 503 European samples from 1000 Genomes phase 3 version 5. UK Biobank samples were not included in any of the five discovery GWAS studies. DNA polymorphisms with ambiguous strand (A/T or C/G) were removed from the score derivation. For each disease, we computed a set of candidate genome-wide polygenic scores (GPS) using the LDPred algorithm and a pruning and threshold derivation strategies.
The LDPred computational algorithm was used to generate seven candidate GPSs for each disease. This Bayesian approach calculates a posterior mean effect size for each variant based on a prior and subsequent shrinkage based on the extent to which this variant is correlated with similarly associated variants in the reference population. The underlying Gaussian distribution additionally considers the fraction of causal (e.g. non-zero effect sizes) markers via a tuning parameter, ρ. Because ρ is unknown for any given disease, a range of ρ, the fraction of causal variants, was used—1, 0.3, 0.1, 0.03, 0.01, 0.003, 0.001.
A second approach, pruning and thresholding, was used to build an additional 24 candidate GPSs. Pruning and thresholding scores were built using a p-value and LD-driven clumping procedure in PLINK version 1.90b (clump). In brief, the algorithm forms clumps around SNPs with association p-values less than a provided threshold. Each clump contains all SNPs within 250 kb of the index SNP that are also in LD with the index SNP as determined by a provided r2 threshold in the LD reference. The algorithm iteratively cycles through all index SNPs, beginning with the smallest p-value, only allowing each SNP to appear in one clump. The final output should contain the most significantly disease-associated SNP for each LD-based clump across the genome. A GPS was built containing the index SNPs of each clump with association estimate betas (log odds) as weights. GPSs were created over a range of p-value (1, 0.5, 0.05, 5×10-4, 5×10-6, 5×10-8) and r2 (0.2, 0.4, 0.6, 0.8) thresholds, for a total of 24 pruning and thresholding-based candidate scores for each disease. The resulting GPS for a p-value threshold of 5×10−8 and r2 of <0.2 was denoted the ‘GWAS significant variant’ derivation strategy.
For each disease, the thirty-one candidate GPSs were calculated in a validation dataset of 120,280 participants of European ancestry derived from the UK Biobank Phase I release. The UK Biobank is a large prospective cohort study that enrolled individuals from across the United Kingdom, aged 40-69 years at time of recruitment, starting in 2006.14 Individuals underwent a series of anthropometric measurements and surveys, including medical history review with a trained nurse.
Scores were generated by multiplying the genotype dosage of each risk allele for each variant by its respective weight, and then summing across all variants in the score using PLINK2 software. Incorporating genotype dosages accounts for uncertainty in genotype imputation. The vast majority of variants in the GPSs were available for scoring purposes in the validation dataset with sufficient imputation quality (INFO>0.3) (Tables 31 and 32).
For each of the five diseases, the score with the best discriminative capacity was determined based on maximal area under the receiver-operator curve (AUC) in a logistic regression model with the disease as the outcome and the disease-specific candidate GPS, age, sex, first four principal components of ancestry, and an indicator variable for genotyping array used (Tables 31 and 32). AUC confidence intervals were calculated using the “pROC” package within R.
The testing dataset was comprised of 288,978 UK Biobank Phase 2 participants distinct from those in the validation dataset described above. Individuals in the UK Biobank underwent genotyping with one of two closely related custom arrays (UK BiLEVE Axiom Array or UK Biobank Axiom Array) consisting of over 800,000 genetic markers scattered across the genome. Additional genotypes were imputed centrally using the Haplotype Reference Consortium resource, the UK10K panel, and the 1000 Genomes panel. In order to analyze individuals with a relatively homogenous ancestry and owing to small percentages of non-British individuals, the present analysis was restricted to the white British ancestry individuals. This subpopulation was constructed centrally using a combination of self-reported ancestry and genetically confirmed ancestry using principal components. Additional exclusion criteria included outliers for heterozygosity or genotype missing rates, discordant reported versus genotypic sex, putative sex chromosome aneuploidy, or withdrawal of informed consent, derived centrally as previously reported.
For each of the five diseases, proportion of variance explained was calculated for each disease using the Nagelkerke's pseudo-R2 metric (Table 37). The R2 was calculated for the full model inclusive of the genome-wide polygenic score plus the covariates minus R2 for the covariates alone, thus yielding an estimate of the explained variance. Covariates in the model included age, gender, genotyping array, and the first four principal components of ancestry.
A sensitivity analysis was performed by removing one individual from each pair of related individuals (third-degree or closer; kinship coefficient>0.0442), confirming similar results within this subpopulation comprised of 222,529 of the 288,978 (77%) testing dataset participants (Table 38).
Diagnosis of prevalent disease was based on a composite of data from self-report in an interview with a trained nurse, electronic health record (EHR) information including inpatient International Classification of Disease (ICD-10) diagnosis codes and Office of Population and Censuses Surveys (OPCS-4) procedure codes.
Coronary artery disease ascertainment was based on a composite of myocardial infarction or coronary revascularization. Myocardial infarction was based on self-report or hospital admission diagnosis, as performed centrally. This included individuals with ICD-9 codes of 410.X, 411.0, 412.X, 429.79 or ICD-10 codes of I21.X, I22.X, I23.X, I24.1, I25.2 in hospitalization records. Coronary revascularization was assessed based on an OPCS-4 coded procedure for coronary artery bypass grafting (K40.1-40.4, K41.1-41.4, K45.1-45.5) or coronary angioplasty with or without stenting (K49.1-49.2, K49.8-49.9, K50.2, K75.1-75.4, K75.8-75.9).
Statistical Analysis within the Testing Dataset
For each disease, the GPS with the best discriminative capacity in the testing dataset was calculated in the testing dataset of 288,278 participants using genotyped and imputed variants using the Hail software package. The proportion of the population and of diseased individuals with a given magnitude of increased risk was determined by comparing progressively more extreme tails of the distribution to the remainder of the population in a logistic regression model predicting disease status and adjusted for age, gender, four principal components of ancestry, and genotyping array. Individuals were next binned into 100 groupings according to percentile of the GPS and unadjusted prevalence of disease within each bin determined. We next compared the observed risk gradient across percentile bins to that which would be predicted by the GPS. For each individual, the predicted probability of disease was calculated using a logistic regression model with only the genome-wide polygenic score (GPS) as a predictor. The predicted prevalence of disease within each percentile bin of the GPS distribution was calculated as the average predicted probability of all individuals within that bin. Statistical analyses were conducted using R version 3.4.3 software (The R Foundation).
Various modifications and variations of the described methods, pharmaceutical compositions, and kits of the invention will be apparent to those skilled in the art without departing from the scope and spirit of the invention. Although the invention has been described in connection with specific embodiments, it will be understood that it is capable of further modifications and that the invention as claimed should not be unduly limited to such specific embodiments. Indeed, various modifications of the described modes for carrying out the invention that are obvious to those skilled in the art are intended to be within the scope of the invention. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure come within known customary practice within the art to which the invention pertains and may be applied to the essential features herein before set forth.
This application claims the benefit of U.S. Provisional Application No. 62/531,762, filed Jul. 12, 2017, U.S. Provisional Application No. 62/583,997, filed Nov. 9, 2017, and U.S. Provisional Application No. 62/585,378, filed Nov. 13, 2017. The entire contents of the above-identified applications are hereby fully incorporated herein by reference.
This invention was made with government support under grant numbers HL127564 and HG00895 awarded by the National Institutes of Health. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
62531762 | Jul 2017 | US | |
62583997 | Nov 2017 | US | |
62585378 | Nov 2017 | US |