Breast cancer is the second deadliest cancer for U.S. women. Approximately one in eight women in the U.S. will develop invasive breast cancer over the course of their lifetime (NIH, 2019). Early detection of breast cancer is an effective strategy to reduce the death rate. If breast cancer is detected in the localized stage, the 5-year survival rate is 99% (NIH, 2019). However, only ∼62% of the breast cancer cases are detected in the localized stage (NIH, 2019). In ∼30% of the cases, breast cancer is detected after it spreads to the regional lymph nodes, reducing the 5-year survival rate to 85%. Furthermore, in 6% of cases, the cancer is diagnosed after it has spread to a distant part of the body beyond the lymph nodes and the 5-year survival rate is reduced to 27%. To detect breast cancer early, the US Preventive Services Task Force (USPSTF) recommends a biennial screening mammography for women over 50 years old. For women under 50 years old, the decision for screening must be individualized to balance the benefit of potential early detection against the risk of false positive diagnosis. False-positive mammography results, which typically lead to unnecessary follow-up diagnostic testing, become increasingly common for women 40 to 49 years old (Nelson et al., 2009). Nevertheless, for women with high risk for breast cancer (i.e. a lifetime risk of breast cancer higher than 20%), the American Cancer Society advises a yearly breast MRI and mammogram starting at 30 years of age (Oeffinger et al., 2015).
Polygenic risk scores (PRS) assess the genetic risks of complex diseases based on the aggregate statistical correlation of a disease outcome with many genetic variations over the whole genome. Single-nucleotide polymorphisms (SNPs) are the most commonly used genetic variations. While genome-wide association studies (GWAS) report only SNPs with statistically significant associations to phenotypes (Dudbridge, 2013), PRS can be estimated using a greater number of SNPs with higher adjusted p-value thresholds to improve prediction accuracy.
Previous research has developed a variety of PRS estimation models based on Best Linear Unbiased Prediction (BLUP), including gBLUP (Clark et al., 2013), rr-BLUP (Whittaker et al., 2000; Meuwissen et al., 2001), and other derivatives (Maier et al., 2015; Speed & balding, 2014). These linear mixed models consider genetic variations as fixed effects and use random effects to account for environmental factors and individual variability. Furthermore, linkage disequilibrium was utilized as a basis for the LDpred (Vilhjalmsson et al., 2015; Khera et al., 2018) and PRS-CS (Ge et al., 2019) algorithms.
PRS estimation can also be defined as a supervised classification problem. The input features are genetic variations and the output response is the disease outcome. Thus, machine learning techniques can be used to estimate PRS based on the classification scores achieved (Ho et al., 2019). A large-scale GWAS dataset may provide tens of thousands of individuals as training examples for model development and benchmarking. Wei et al. (2019) compared support vector machine and logistic regression to estimate PRS of Type-1 diabetes. The best Area Under the receiver operating characteristic Curve (AUC) was 84% in this study. More recently, neural networks have been used to estimate human height from the GWAS data, and the best R2 scores were in the range of 0.4 to 0.5 (Bellot et al., 2018). Amyotrophic lateral sclerosis was also investigated using Convolutional Neural Networks (CNN) with 4511 cases and 6127 controls (Yin et al., 2019) and the highest accuracy was 76.9%.
Significant progress has been made for estimating PRS for breast cancer from a variety of populations. In a recent study (Mavaddat et al., 2019), multiple large European female cohorts were combined to compare a series of PRS models. The most predictive model in that study used lasso regression with 3,820 SNPs and obtained an AUC of 65%. A PRS algorithm based on the sum of log odds ratios of important SNPs for breast cancer was used in the Singapore Chinese Health Study (Chan et al., 2018) with 46 SNPs and 56.6% AUC, the Shanghai Genome-Wide Association Studies (Wen et al., 2016) with 44 SNPs and 60.6% AUC, and a Taiwanese cohort (Hsieh et al., 2017) with 6 SNPs and 59.8% AUC. A pruning and thresholding method using 5,218 SNPs reached an AUC of 69% for the UK Biobank dataset (Khera et al., 2018).
The following drawings form part of the present specification and are included to further demonstrate certain aspects of the present disclosure. The accompanying drawings illustrate one or more implementations described herein and, together with the description, explain these implementations. The drawings are not intended to be drawn to scale, and certain features and certain views of the figures may be shown exaggerated, to scale or in schematic in the interest of clarity and conciseness. Not every component may be labeled in every drawing. Like reference numerals in the figures may represent and refer to the same or similar element or function.
The present disclosure relates generally to the field of deep learned-based medical diagnostics. More particularly, it concerns deep neural networks and methods for training deep neural networks to provide estimated polygenic risk scores.
In one embodiment the present disclosure is directed to computer-implemented methods of training a deep neural network for estimating a polygenic risk score for a disease. In some aspects, the method comprise collecting a first set of SNPs from at least 1,000 subjects with a known disease outcome from a database and a second set of SNPs from at least 1,000 other subjects with a known disease outcome from a database; encoding, independently, the first set of SNPs and the second set of SNPs by: labeling each subject as either a disease case or a control case based on the known disease outcome for the subject, and labeled each SNP in each subject as either homozygous with minor allele, heterozygous allele, or homozygous with the dominant allele; optionally applying one or more filter to the first encoded set to create a first modified set of SNPs; training the deep neural network using the first encoded set of SNPs or the first modified set of SNPs; and validating the deep neural network using the second encoded set of SNPs.
In some aspects, the filter comprises a p-value threshold.
In some aspects, the first set of SNPs and the second set of SNPs are both from at least 10,000 subjects. In some aspects, the SNPs are genome-wide. In some aspects, the SNPs are representative of at least 22 chromosomes. In some aspects, both the first set of SNPs and the second set of SNPs comprise the same at least 2,000 SNPs.
In some aspects, the disease is cancer. In some aspects, the cancer is breast cancer. In some aspects, the SNPs include at least five of the SNPs listed in Table 2.
In some aspects, the trained deep neural network has an accuracy of at least 60%. In some aspects, the trained deep neural network has an AUC of at least 65%.
In some aspects, the trained deep neural network comprises at least three hidden layers, and each layer comprises multiple neurons. For example, each layer may comprise 1000, 250, or 50 neurons.
In some aspects, the training the deep neural network comprises using stochastic gradient descent with regularization, such as dropout.
In some aspects, the deep neural network comprises a linearization layer on top of a deep inner attention neural network. In some aspects, the linearization layer computes an output as an element-wise multiplication product of input features, attention weights, and coefficients. In some aspects, the network learns a linear function of an input feature vector, coefficient vector, and attention vector. In some aspects, the attention vector is computed from the input feature vector using a multi-layer neural network. In some aspects, all hidden layers of the multi-layer neural network use a non-linear activation function, and wherein the attention layer uses a linear activation function. In some aspects, the layers of the inner attention neural network comprise 1000, 250, or 50 neurons before the attention layer.
In one embodiment, provided herein are methods of using a deep neural network training using data from subjects with a disease by the methods of the present embodiments to estimate a polygenic risk score for a patient for the disease. In some aspects, the methods comprise collecting a set of SNPs from a subject with an unknown disease outcome; encoding the set of SNPs by labeled each SNP in the subject as either homozygous with minor allele, heterozygous allele, or homozygous with the dominant allele; and applying the deep neural network to obtain an estimated polygenic risk score for the patient for the disease.
In some aspects, the methods further comprise performing, or having performed, further screening for the disease if the polygenic risk score indicates that the patient is at risk for the disease.
In one embodiment, provided herein are methods for determining a polygenic risk score for a disease for a subject. In some aspects, the methods comprise (a) obtaining a plurality of SNPs from genome of the subject; (b) generating a data input from the plurality of SNPs; and (c) determining the polygenic risk score for the disease by applying to the data input a deep neural network trained by the methods of the present embodiments. In some aspects, the methods further comprise performing, or having performed, further screening for the disease if the polygenic risk score indicates that the patient is at risk for the disease. In some aspects, the disease is breast cancer, and wherein the method comprises performing, or having performed, yearly breast MRI and mammogram if the patient’s polygenic risk score is greater than 20%.
In one embodiment, provided herein are polygenic risk score classifiers comprising a deep neural network that has been trained according to the methods provided herein.
In one non-limiting embodiment, the present disclosure is directed to a deep neural network (DNN) that was tested for breast cancer PRS estimation using a large cohort containing 26,053 cases and 23,058 controls. The performance of the DNN was shown to be higher than alternative machine learning algorithms and other statistical methods in this large cohort. Furthermore, DeepLift (Shrikumar et al., 2019) and LIME (Ribeiro et al., 2016) were used to identify salient SNPs used by the DNN for prediction.
Before further describing various embodiments of the apparatus, component parts, and methods of the present disclosure in more detail by way of exemplary description, examples, and results, it is to be understood that the embodiments of the present disclosure are not limited in application to the details of apparatus, component parts, and methods as set forth in the following description. The embodiments of the apparatus, component parts, and methods of the present disclosure are capable of being practiced or carried out in various ways not explicitly described herein. As such, the language used herein is intended to be given the broadest possible scope and meaning; and the embodiments are meant to be exemplary, not exhaustive. Also, it is to be understood that the phraseology and terminology employed herein is for the purpose of description and should not be regarded as limiting unless otherwise indicated as so. Moreover, in the following detailed description, numerous specific details are set forth in order to provide a more thorough understanding of the disclosure. However, it will be apparent to a person having ordinary skill in the art that the embodiments of the present disclosure may be practiced without these specific details. In other instances, features which are well known to persons of ordinary skill in the art have not been described in detail to avoid unnecessary complication of the description. While the apparatus, component parts, and methods of the present disclosure have been described in terms of particular embodiments, it will be apparent to those of skill in the art that variations may be applied to the apparatus, component parts, and/or methods and in the steps or in the sequence of steps of the method described herein without departing from the concept, spirit, and scope of the inventive concepts as described herein. All such similar substitutes and modifications apparent to those having ordinary skill in the art are deemed to be within the spirit and scope of the inventive concepts as disclosed herein.
All patents, published patent applications, and non-patent publications referenced or mentioned in any portion of the present specification are indicative of the level of skill of those skilled in the art to which the present disclosure pertains, and are hereby expressly incorporated by reference in their entirety to the same extent as if the contents of each individual patent or publication was specifically and individually incorporated herein.
Unless otherwise defined herein, scientific and technical terms used in connection with the present disclosure shall have the meanings that are commonly understood by those having ordinary skill in the art. Further, unless otherwise required by context, singular terms shall include pluralities and plural terms shall include the singular.
As utilized in accordance with the methods and compositions of the present disclosure, the following terms and phrases, unless otherwise indicated, shall be understood to have the following meanings: The use of the word “a” or “an” when used in conjunction with the term “comprising” in the claims and/or the specification may mean “one,” but it is also consistent with the meaning of “one or more,” “at least one,” and “one or more than one.” The use of the term “or” in the claims is used to mean “and/or” unless explicitly indicated to refer to alternatives only or when the alternatives are mutually exclusive, although the disclosure supports a definition that refers to only alternatives and “and/or.” The use of the term “at least one” will be understood to include one as well as any quantity more than one, including but not limited to, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 40, 50, 100, or any integer inclusive therein. The phrase “at least one” may extend up to 100 or 1000 or more, depending on the term to which it is attached; in addition, the quantities of 100/1000 are not to be considered limiting, as higher limits may also produce satisfactory results. In addition, the use of the term “at least one of X, Y and Z” will be understood to include X alone, Y alone, and Z alone, as well as any combination of X, Y and Z.
As used in this specification and claims, the words “comprising” (and any form of comprising, such as “comprise” and “comprises”), “having” (and any form of having, such as “have” and “has”), “including” (and any form of including, such as “includes” and “include”) or “containing” (and any form of containing, such as “contains” and “contain”) are inclusive or open-ended and do not exclude additional, unrecited elements or method steps.
The term “or combinations thereof” as used herein refers to all permutations and combinations of the listed items preceding the term. For example, “A, B, C, or combinations thereof” is intended to include at least one of: A, B, C, AB, AC, BC, or ABC, and if order is important in a particular context, also BA, CA, CB, CBA, BCA, ACB, BAC, or CAB. Continuing with this example, expressly included are combinations that contain repeats of one or more item or term, such as BB, AAA, AAB, BBC, AAABCCCC, CBBAAA, CABABB, and so forth. The skilled artisan will understand that typically there is no limit on the number of items or terms in any combination, unless otherwise apparent from the context.
Throughout this application, the terms “about” or “approximately” are used to indicate that a value includes the inherent variation of error for the apparatus, composition, or the methods or the variation that exists among the objects, or study subjects. As used herein the qualifiers “about” or “approximately” are intended to include not only the exact value, amount, degree, orientation, or other qualified characteristic or value, but are intended to include some slight variations due to measuring error, manufacturing tolerances, stress exerted on various parts or components, observer error, wear and tear, and combinations thereof, for example.
The terms “about” or “approximately”, where used herein when referring to a measurable value such as an amount, percentage, temporal duration, and the like, is meant to encompass, for example, variations of ± 20% or ± 10%, or ± 5%, or ± 1%, or ± 0.1% from the specified value, as such variations are appropriate to perform the disclosed methods and as understood by persons having ordinary skill in the art. As used herein, the term “substantially” means that the subsequently described event or circumstance completely occurs or that the subsequently described event or circumstance occurs to a great extent or degree. For example, the term “substantially” means that the subsequently described event or circumstance occurs at least 90% of the time, or at least 95% of the time, or at least 98% of the time.
As used herein any reference to “one embodiment” or “an embodiment” means that a particular element, feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment.
As used herein, all numerical values or ranges include fractions of the values and integers within such ranges and fractions of the integers within such ranges unless the context clearly indicates otherwise. A range is intended to include any sub-range therein, although that sub-range may not be explicitly designated herein. Thus, to illustrate, reference to a numerical range, such as 1-10 includes 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, as well as 1.1, 1.2, 1.3, 1.4, 1.5, etc., and so forth. Reference to a range of 2-125 therefore includes 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, and 125, as well as sub-ranges within the greater range, e.g., for 2-125, sub-ranges include but are not limited to 2-50, 5-50, 10-60, 5-45, 15-60, 10-40, 15-30, 2-85, 5-85, 20-75, 5-70, 10-70, 28-70, 14-56, 2-100, 5-100, 10-100, 5-90, 15-100, 10-75, 5-40, 2-105, 5-105, 100-95, 4-78, 15-65, 18-88, and 12-56. Reference to a range of 1-50 therefore includes 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, etc., up to and including 50, as well as 1.1, 1.2, 1.3, 1.4, 1.5, etc., 2.1, 2.2, 2.3, 2.4, 2.5, etc., and so forth. Reference to a series of ranges includes ranges which combine the values of the boundaries of different ranges within the series. Thus, to illustrate reference to a series of ranges, for example, a range of 1-1,000 includes, for example, 1-10, 10-20, 20-30, 30-40, 40-50, 50-60, 60-75, 75-100, 100-150, 150-200, 200-250, 250-300, 300-400, 400-500, 500-750, 750-1,000, and includes ranges of 1-20, 10-50, 50-100, 100-500, and 500-1,000. The range 100 units to 2000 units therefore refers to and includes all values or ranges of values of the units, and fractions of the values of the units and integers within said range, including for example, but not limited to 100 units to 1000 units, 100 units to 500 units, 200 units to 1000 units, 300 units to 1500 units, 400 units to 2000 units, 500 units to 2000 units, 500 units to 1000 units, 250 units to 1750 units, 250 units to 1200 units, 750 units to 2000 units, 150 units to 1500 units, 100 units to 1250 units, and 800 units to 1200 units. Any two values within the range of about 100 units to about 2000 units therefore can be used to set the lower and upper boundaries of a range in accordance with the embodiments of the present disclosure. More particularly, a range of 10-12 units includes, for example, 10, 10.1, 10.2, 10.3, 10.4, 10.5, 10.6, 10.7, 10.8, 10.9, 11.0, 11.1, 11.2, 11.3, 11.4, 11.5, 11.6, 11.7, 11.8, 11.9, and 12.0, and all values or ranges of values of the units, and fractions of the values of the units and integers within said range, and ranges which combine the values of the boundaries of different ranges within the series, e.g., 10.1 to 11.5. Reference to an integer with more (greater) or less than includes any number greater or less than the reference number, respectively. Thus, for example, reference to less than 100 includes 99, 98, 97, etc. all the way down to the number one (1); and less than 10 includes 9, 8, 7, etc. all the way down to the number one (1).
Polygenic risk scores (PRS) estimate the genetic risk of an individual for a complex disease based on many genetic variants across the whole genome. Provided herein is a deep neural network (DNN) that was found to outperform alternative machine learning techniques and established statistical algorithms, including BLUP, BayesA and LDpred. In the test cohort with 50% prevalence, the Area Under the receiver operating characteristic Curve (AUC) were 67.4% for DNN, 64.2% for BLUP, 64.5% for BayesA, and 62.4% for LDpred. BLUP, BayesA, and LPpred all generated PRS that followed a normal distribution in the case population. However, the PRS generated by DNN in the case population followed a bi-modal distribution composed of two normal distributions with distinctly different means. This suggests that DNN was able to separate the case population into a high-genetic-risk case sub-population with an average PRS significantly higher than the control population and a normal-genetic-risk case sub-population with an average PRS similar to the control population. This allowed DNN to achieve 18.8% recall at 90% precision in the test cohort with 50% prevalence, which can be extrapolated to 65.4% recall at 20% precision in a general population with 12% prevalence. Interpretation of the DNN model identified salient variants that were assigned insignificant p-values by association studies, but were important for DNN prediction. These variants may be associated with the phenotype through non-linear relationships.
While the present method is discussed in the context of breast cancer, the methods used herein can be applied in a variety disease, and in particular genetically complex diseases, such as, for example, other types of cancer, diabetes, neurological disorders, and neuromuscular disorders.
Deep learning generally refers to methods that map data through multiple levels of abstraction, where higher levels represent more abstract entities. The goal of deep learning is to provide a fully automatic system for learning complex functions that map inputs to outputs, without using hand crafted features or rules. One implementation of deep learning comes in the form of feedforward neural networks, where levels of abstraction are modeled by multiple non-linear hidden layers.
On average, SNPs can occur at approximately 1 in every 300 bases and as such there can be about 10 million SNPs in the human genome. In some cases, the deep neural network is trained with a labeled dataset comprising at least about 1,000, at least about 2,000, at least about 3,000, at least about 4,000, at least about 5,000, at least about 10,000, at least about 15,000, at least about 18,000, at least about 20,000, at least about 21,000, at least about 22,000, at least about 23,000, at least about 24,000, at least about 25,000, at least about 26,000, at least about 28,000, at least about 30,000, at least about 35,000, at least about 40,000, or at least about 50,000 SNPs.
In some cases, the neural network may be trained such that a desired accuracy of PRS calling is achieved (e.g., at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%). The accuracy of PRS calling may be calculated as the percentage of patients with a known disease state that are correctly identified or classified as having or not have the disease.
In some cases, the neural network may be trained such that a desired sensitivity of PRS calling is achieved (e.g., at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%). The sensitivity of PRS calling may be calculated as the percentage of patient’s having a disease that are correctly identified or classified as having the disease.
In some cases, the neural network may be trained such that a desired specificity of PRS calling is achieved (e.g., at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%). The specificity of PRS calling may be calculated as the percentage of healthy patients that are correctly identified or classified as not having a disease.
In some cases, the methods, systems, and devices of the present disclosure are applicable to diagnose, prognosticate, or monitor disease progression in a subject. For example, a subject can be a human patient, such as a cancer patient, a patient at risk for cancer, a patient suspected of having cancer, or a patient with a family or personal history of cancer. The sample from the subject can be used to analyze whether or not the subject carries SNPs that are implicated in certain diseases or conditions, e.g., cancer, Neurofibromatosis 1, McCune-Albright, incontinentia pigmenti, paroxysmal nocturnal hemoglobinuria, Proteus syndrome, or Duchenne Muscular Dystrophy. The sample from the subject can be used to determine whether or not the subject carries SNPs and can be used to diagnose, prognosticate, or monitor any cancer, e.g., any cancer disclosed herein.
In another aspect, the present disclosure provides a method comprising determining a polygenic risk score for a subject, and diagnosing, prognosticating, or monitoring the disease in the subject. In some cases, the method further comprises providing treatment recommendations or preventative monitoring recommendations for the disease, e.g., the cancer. In some cases, the cancer is selected from the group consisting of: adrenal cancer, anal cancer, basal cell carcinoma, bile duct cancer, bladder cancer, cancer of the blood, bone cancer, a brain tumor, breast cancer, bronchus cancer, cancer of the cardiovascular system, cervical cancer, colon cancer, colorectal cancer, cancer of the digestive system, cancer of the endocrine system, endometrial cancer, esophageal cancer, eye cancer, gallbladder cancer, a gastrointestinal tumor, hepatocellular carcinoma, kidney cancer, hematopoietic malignancy, laryngeal cancer, leukemia, liver cancer, lung cancer, lymphoma, melanoma, mesothelioma, cancer of the muscular system, Myelodysplastic Syndrome (MDS), myeloma, nasal cavity cancer, nasopharyngeal cancer, cancer of the nervous system, cancer of the lymphatic system, oral cancer, oropharyngeal cancer, osteosarcoma, ovarian cancer, pancreatic cancer, penile cancer, pituitary tumors, prostate cancer, rectal cancer, renal pelvis cancer, cancer of the reproductive system, cancer of the respiratory system, sarcoma, salivary gland cancer, skeletal system cancer, skin cancer, small intestine cancer, stomach cancer, testicular cancer, throat cancer, thymus cancer, thyroid cancer, a tumor, cancer of the urinary system, uterine cancer, vaginal cancer, vulvar cancer, and any combination thereof.
In some cases, the determination of a PRS can provide valuable information for guiding the therapeutic intervention, e.g., for the cancer of the subject. For instance, SNPs can directly affect drug tolerance in many cancer types; therefore, understanding the underlying genetic variants can be useful for providing precision medical treatment of a cancer patient. In some cases, the methods, systems, and devices of the present disclosure can be used for application to drug development or developing a companion diagnostic. In some cases, the methods, systems, and devices of the present disclosure can also be used for predicting response to a therapy. In some cases, the methods, systems, and devices of the present disclosure can also be used for monitoring disease progression. In some cases, the methods, systems, and devices of the present disclosure can also be used for detecting relapse of a condition, e.g., cancer. A presence or absence of a known somatic variant or appearance of new somatic variant can be correlated with different stages of disease progression, e.g., different stages of cancers. As cancer progresses from early stage to late stage, an increased number or amount of new mutations can be detected by the methods, systems, or devices of the present disclosure.
Methods, systems, and devices of the present disclosure can be used to analyze biological sample from a subject. The subject can be any human being. The biological sample for PRF determination can be obtained from a tissue of interest, e.g., a pathological tissue, e.g., a tumor tissue. Alternatively, the biological sample can be a liquid biological sample containing cell-free nucleic acids, such as blood, plasma, serum, saliva, urine, amniotic fluid, pleural effusion, tears, seminal fluid, peritoneal fluid, and cerebrospinal fluid. Cell-free nucleic acids can comprise cell-free DNA or cell-free RNA. The cell-free nucleic acids used by methods and systems of the present disclosure can be nucleic acid molecules outside of cells in a biological sample. Cell-free DNA can occur naturally in the form of short fragments.
A subject applicable by the methods of the present disclosure can be of any age and can be an adult, infant or child. In some cases, the subject is within any age range (e.g., between 0 and 20 years old, between 20 and 40 years old, or between 40 and 90 years old, or even older). In some cases, the subject as described herein can be a non-human animal, such as non-human primate, pig, dog, cow, sheep, mouse, rat, horse, donkey, or camel.
The use of the deep neural network can be performed with a total computation time (e.g., runtime) of no more than about 7 days, no more than about 6 days, no more than about 5 days, no more than about 4 days, no more than about 3 days, no more than about 48 hours, no more than about 36 hours, no more than about 24 hours, no more than about 22 hours, no more than about 20 hours, no more than about 18 hours, no more than about 16 hours, no more than about 14 hours, no more than about 12 hours, no more than about 10 hours, no more than about 9 hours, no more than about 8 hours, no more than about 7 hours, no more than about 6 hours, no more than about 5 hours, no more than about 4 hours, no more than about 3 hours, no more than about 2 hours, no more than about 60 minutes, no more than about 45 minutes, no more than about 30 minutes, no more than about 20 minutes, no more than about 15 minutes, no more than about 10 minutes, or no more than about 5 minutes.
In some cases, the methods and systems of the present disclosure may be performed using a single-core or multi-core machine, such as a dual-core, 3-core, 4-core, 5-core, 6-core, 7-core, 8-core, 9-core, 10-core, 12-core, 14-core, 16-core, 18-core, 20-core, 22-core, 24-core, 26-core, 28-core, 30-core, 32-core, 34-core, 36-core, 38-core, 40-core, 42-core, 44-core, 46-core, 48-core, 50-core, 52-core, 54-core, 56-core, 58-core, 60-core, 62-core, 64-core, 96-core, 128-core, 256-core, 512-core, or 1,024-core machine, or a multi-core machine having more than 1,024 cores. In some cases, the methods and systems of the present disclosure may be performed using a distributed network, such as a cloud computing network, which is configured to provide a similar functionality as a single-core or multi-core machine.
Various aspects of the technology can be thought of as “products” or “articles of manufacture” typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium. Machine-executable code can be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk. “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server. Thus, another type of media that can bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links or the like, also can be considered as media bearing the software. As used herein, unless restricted to non-transitory, tangible “storage” media, terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.
Hence, a machine readable medium, such as computer-executable code, may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as can be used to implement the databases, etc. shown in the drawings. Volatile storage media include dynamic memory, such as main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system. Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media can be involved in carrying one or more sequences of one or more instructions to a processor for execution.
Any of the methods described herein can be totally or partially performed with a computer system including one or more processors, which can be configured to perform the operations disclosed herein. Thus, embodiments can be directed to computer systems configured to perform the operations of any of the methods described herein, with different components performing a respective operation or a respective group of operations. Although presented as numbered operations, the operations of the methods disclosed herein can be performed at a same time or in a different order. Additionally, portions of these operations can be used with portions of other operations from other methods. Also, all or portions of an operation can be optional. Additionally, any of the operations of any of the methods can be performed with modules, units, circuits, or other approaches for performing these operations.
The present disclosure will now be discussed in terms of several specific, non-limiting, examples and embodiments. The examples described below, which include particular embodiments, will serve to illustrate the practice of the present disclosure, it being understood that the particulars shown are by way of example and for purposes of illustrative discussion of particular embodiments and are presented in the cause of providing what is believed to be a useful and readily understood description of procedures as well as of the principles and conceptual aspects of the present disclosure.
Breast cancer GWAS data. This study used a breast cancer genome-wide association study (GWAS) dataset generated by the Discovery, Biology, and Risk of Inherited Variants in Breast Cancer (DRIVE) project (Amos et al., 2017) and was obtained from the NIH dbGaP database under the accession number of phs001265.v1.pl. The DRIVE dataset was stored, processed and used on the Schooner supercomputer at the University of Oklahoma in an isolated partition with restricted access. The partition consisted of 5 computational nodes, each with 40 CPU cores (Intel Xeon Cascade Lake) and 200 GB of RAM. The DRIVE dataset in the dbGap database was composed of 49,111 subjects genotyped for 528,620 SNPs using OncoArray (Amos et al., 2017). 55.4% of the subjects were from North America, 43.3% from Europe, and 1.3% from Africa. The disease outcome of the subjects was labeled as malignant tumor (48%), in situ tumor (5%), and no tumor (47%). In this study, the subjects in the malignant tumor and in situ tumor categories were labeled as cases and the subjects in the no tumor category were labeled as controls, resulting in 26,053 (53%) cases and 23,058 (47%) controls. The subjects in the case and control classes were randomly assigned to a training set (80%), a validation set (10%), and a test set (10%) (
Development of deep neural network models for PRS estimation. A variety of deep neural network (DNN) architectures (Bengio, 2009) were trained using Tensorflow 1.13. The Leaky Rectified Linear Unit (ReLU) activation function (Xu et al., 2019) was used on all hidden-layers neurons with the negative slope co-efficient set to 0.2. The output neuron used a sigmoid activation function. The training error was computed using the cross-entropy function:
where p ∈ [0,1] is the prediction probability from the model and y ∈ is the prediction target at 1 for case and 0 for control. The prediction probability was considered as the PRS from D NN.
DNNs were evaluated using different SNP feature sets. SNPs were filtered using their Plink association p-values at the thresholds of 10-2, 10-3, 10-4 and 10-5. DNN was also tested using the full SNP feature set without any filtering. The DNN models were trained using mini-batches with a batch size of 512. The Adam optimizer (Kingma & Ba, 2019), an adaptive learning rate optimization algorithm, was used to update the weights in each mini-batch. The initial learning rate was set to 10-4,and the models were trained for up to 200 epochs with early stopping based on the validation AUC score. Dropout (Srivastava et al., 2014) was used to reduce overfitting. The dropout rates of 20%, 30%, 40%, 50%, 60%, 70%, 80%, and 90% were tested for the first hidden layer and the final dropout rate was selected based on the validation AUC score. The dropout rate was set to 50% on the other hidden layers in all architectures. Batch normalization (BN) (Ioffe & Szegedy, 2019) was used to accelerate the training process, and the momentum for the moving average was set to 0.9 in BN.
Development of alternative machine learning models for PRS estimation. Logistic regression, decision tree, random forest, AdaBoost, gradient boosting, support vector machine (SVM), and Gaussian naive Bayes were implemented and tested using the scikit-learn machine learning library in Python. These models were trained using the same training set as the DNNs and their hyperparameters were tuned using the same validation set based on the validation AUC (
Development of statistical models for PRS estimation. The same training and validation sets were used to develop statistical models (
The score distributions of DNN, BayesA, BLUP and LDpred were analyzed with the Shapiro test for normality and the Bayesian Gaussian Mixture (BGM) expectation maximization algorithm. The BGM algorithm decomposed a mixture of two Gaussian distributions with weight priors at 50% over a maximum of 1000 iterations and 100 initializations.
DNN model interpretation. LIME and DeepLift were used to interpret the DNN predictions for subjects in the test set with DNN output scores higher than 0.67, which corresponded to a precision of 90%. For LIME, the submodular pick algorithm was used, the kernel size was set to 40, and the number of explainable features was set to 41. For DeepLift, the importance of each SNPs was computed as the average across all individuals, and the reference activation value for a neuron was determined by the average value of all activations triggered across all subjects.
The breast cancer GWAS dataset containing 26,053 cases and 23,058 controls was generated by the Discovery, Biology, and Risk of Inherited Variants in Breast Cancer (DRIVE) project (Amos et al., 2017). The DRIVE data is available from the NIH dbGaP database under the accession number of phs001265.v1.p1. The cases and controls were randomly split to a training set, a validation set, and a test set (
Statistical significance of the disease association with the 528,620 SNPs was assessed with Plink using only the training set. To obtain unbiased benchmarking results on the test set, it was critical not to use the test set in the association analysis (
Previous studies (Khera et al., 2018; Gola et al., 2020) have used a large number of SNPs for PRS estimation on different datasets. In our study, the largest DNN model, consisting of all 528,620 SNPs, decreased the validation AUC score by 1.2% and the validation accuracy by 1.9% from the highest achieved values. This large DNN model relied an 80% dropout rate to obtain strong regularization, while all the other DNN models utilized a 50% dropout rate. This suggested that DNN was able to perform feature selection without using association p-values, although the limited training data and the large neural network size resulted in complete overfitting with a 100% training accuracy and the lowest validation accuracy (
The effects of dropout and batch normalization were tested using the 5,273-SNP DNN model (
As an alternative to filtering, autoencoding was tested to reduce SNPs to a smaller set of encodings as described previously (Fergus et al., 2018; Cudie et al., 2018). An autoencoder was trained to encode 5273 SNPs into 2000 features with a mean square error (MSE) of 0.053 and a root mean square error (RMSE) of 0.23. The encodings from the autoencoder were used as the input features to train a DNN model with the same architecture as the ones shown in
The deep feedforward architecture benchmarked in
The 5,273-SNP feature set was used to test alternative machine learning approaches, including logistic regression, decision tree, naive Bayes, random forest, ADAboost, gradient boosting, and SVM, for PRS estimation (
The performance of DNN was compared with three representative statistical models, including BLUP, BayesA, and LDpred (Table 1). Because the relative performance of these methods may be dependent on the number of training examples available, the original training set containing 39,289 subjects was down-sampled to create three smaller training sets containing 10,000, 20,000, and 30,000 subjects. As the 5,273-SNP feature set generated with a p-value cutoff of 10-3 may not be the most appropriate for the statistical methods, a 13,890-SNP feature set (p-value cutoff = 10-2 ) and a 2,099-SNP feature set (p-value cutoff = 10-5) were tested for all methods.
Although LDpred also required training data, its prediction relied primarily on the provided p-values, which were generated for all methods using all 39,289 subjects in the training set. Thus, the down-sampling of the training set did not reduce the performance of LDpred. LDpred reached its highest AUC score at 62.4% using the p-value cutoff of 10-3. A previous study (Ge et al., 2019) that applied LDpred to breast cancer prediction using the UK Biobank dataset similarly obtained an AUC score of 62.4% at the p-value cutoff of 10-3 This showed consistent performance of LDpred in the two studies. When DNN, BLUP, and BayesA used the full training set, they obtained higher AUCs than LDpred at their optimum p-value cutoffs.
DNN, BLUP, and BayesA all gained performance with the increase in the training set sizes (Table 1). The performance gain was more substantial for DNN than BLUP and BayesA. The increase from 10,000 subjects to 39,258 subjects in the training set resulted in a 1.9% boost to DNN’s best AUC, a 0.7% boost to BLUP, and a 0.8% boost to BayesA. This indicated the different variance-bias trade-offs made by DNN, BLUP, and BayesA. The high variance of DNN required more training data, but could capture non-linear relationships between the SNPs and the phenotype. The high bias of BLUP and BayesA had lower risk for overfitting using smaller training sets, but their models only considered linear relationships. The higher AUCs of DNN across all training set sizes indicated that DNN had a better variance-bias balance for breast cancer PRS estimation.
For all four training set sizes, BLUP and BayesA achieved higher AUCs using more stringent p-value filtering. When using the full training set, reducing the p-value cutoffs from 10-2 to 10-5 increased the AUCs of BLUP from 61.0% to 64.2% and the AUCs of BayesA from 61.1% to 64.5%. This suggested that BLUP and BayesA preferred a reduced number of SNPs that were significantly associated with the phenotype. On the other hand, DNN produced lower AUCs using the p-value cutoff of 10-5 than the other two higher cutoffs. This suggested that DNN can perform better feature selection in comparison to SNP filtering based on association p-values.
The four algorithms were compared using the PRS histograms of the case population and the control population from the test set in
The score histograms of DNN did not follow normal distributions based on the Shapiro normality test with a p-value of 4.1 * 10-34 for the case distribution and a p-value of 2.5 * 10-9 for the control distribution. The case distribution had the appearance of a bi-modal distribution. The Bayesian Gaussian mixture expectation maximization algorithm decomposed the case distribution to two normal distributions: Ncase1(µ = 0.519, σ= 0.096) with an 86.5% weight and Ncase2(µ= 0.876, σ= 0.065) with a 13.5% weight. The control distribution was resolved into two normal distributions with similar means and distinct standard deviations: Ncontrol1(µ= 0.471, σ = 0.1) with an 85.0% weight and Ncontrol2(µ= 0.507, σ = 0.03) with a 15.0% weight. The Ncase1 distribution had a similar mean as the Ncontrol1 and Ncontrol2 distributions. This suggested that the Ncase1 distribution may represent a normal-genetic-risk case sub-population, in which the subjects may have a normal level of genetic risk for breast cancer and the oncogenesis likely involved a significant environmental component. The mean of the Ncase2 distribution was higher than the means of both the Ncase1 and Ncontrol1 distributions by more than 4 standard deviations (p-value < 10-16). Thus, the Ncase2 distribution likely represented a high-genetic-risk case sub-population for breast cancer, in which the subjects may have inherited many genetic variations associated with breast cancer.
Three GWAS were performed between the high-genetic-risk case sub-population with DNN PRS > 0.67, the normal-genetic-risk case sub-population with DNN PRS < 0.67, and the control population (Table 5). The GWAS analysis of the high-genetic-risk case sub-population versus the control population identified 182 significant SNPs at the Bonferroni level of statistical significance. The GWAS analysis of the high-genetic-risk case sub-population versus the normal-genetic-risk case sub-population identified 216 significant SNPs. The two sets of significant SNPs found by these two GWAS analyses were very similar, sharing 149 significant SNPs in their intersection. Genes associated with these 149 SNPs were investigated with pathway enrichment analysis (Fisher’s Exact Test; P < 0.05) using SNPnexus (Dayem et al., 2018) (Table 6). Many of the significant pathways were involved in DNA repair (O’Connor, 2015), signal transduction (Kolch et al., 2015), and suppression of apoptosis (Fernald & Kurokawa, 2013). Interestingly, the GWAS analysis of the normal-genetic-risk case sub-population and the control population identified no significant SNP. This supported the classification of the cases into the normal-genetic-risk subjects and the high-genetic-risk subjects based on their PRS scores from the DNN model.
In comparison with AUCs, it may be more relevant for practical applications of PRS to compare the recalls of different algorithms at a given precision that warrants clinical recommendations. At 90% precision, the recalls were 18.8% for DNN, 0.2% for BLUP, 1.3% for BayesA, and 1.3% for LDpred in the test set of the DRIVE cohort with a ∼50% prevalence. This indicated that DNN can make a positive prediction for 18.8% of the subjects in the DRIVE cohort and these positive subjects would have an average chance of 90% to eventually develop breast cancer. American Cancer Society advises yearly breast MRI and mammogram starting at the age of 30 years for women with a lifetime risk of breast cancer greater than 20%, which meant a 20% precision for PRS. By extrapolating the performance in the DRIVE cohort, the DNN model should be able to achieve a recall of 65.4% at a precision of 20% in the general population with a 12% prevalence rate of breast cancer.
While the DNN model used 5,273 SNPs as input, only a small set of these SNPs were particularly informative for identifying the subjects with high genetic risks for breast cancer. LIME and DeepLift were used to find the top-100 salient SNPs used by the DNN model to identify the subjects with PRS higher than the 0.67 cutoff at 90% precision in the test set (
Michailidiou et al. (2017) summarized a total of 172 SNPs associated with breast cancer. Out of these SNPs, 59 were not included on OncoArray, 63 had an association p-value less than 10-3 and were not included in the 5,273-SNP feature set for DNN, 34 were not ranked among the top-1000 SNPs by either DeepLIFT or LIME, and 16 were ranked among the top-1000 SNPs by DeepLIFT, LIME, or both (Table 7). This indicates that many SNPs with significant association may be missed by the interpretation of DNN models.
The 23 salient SNPs identified by both DeepLift and LIME in their top-100 list are shown in Table 2. Eight of the 23 SNPs had p-values higher than the Bonferroni level of significance and were missed by the association analysis using Plink. The potential oncogenesis mechanisms for some of the 8 SNPs have been investigated in previous studies. The SNP, rs139337779 at 12q24.22, is located within the gene, Nitric oxide synthase 1 (NOS1). Li et al. (Li et al., 2019) showed that the overexpression of NOS1 can up-regulate the expression of ATP-binding cassette, subfamily G, member 2 (ABCG2), which is a breast cancer resistant protein (Mao & Unadkat, 2015), and NOS1-indeuced chemo-resistance was partly mediated by the up-regulation of ABCG2 expression. Lee et al. (2009) reported that NOS1 is associated with the breast cancer risk in a Korean cohort. The SNP, chr13_113796587_A_G at 13q34, is located in the F10 gene, which is the coagulation factor X. Tinholt et al. (2014) showed that the increased coagulation activity and genetic polymorphisms in the F10 gene are associated with breast cancer. The BNC2 gene containing the SNP, chr9_16917672_G_T at 9p22.2, is a putative tumor suppressor gene in high-grade serious ovarian carcinoma (Casaratto et al., 2016). The SNP, chr2_171708059_C_T at 2q31.1, is within the GAD1 gene and the expression level of GAD1 is a significant prognostic factor in lung adenocarcinoma (Tsuboi et al., 2019). Thus, the interpretation of DNN models may identify novel SNPs with non-linear association with the breast cancer.
While neural networks can provide high predictive performance, it has been a challenge to identify the salient features and important feature interactions used for their predictions. This represented a key hurdle for deploying neural networks in many biomedical applications that require interpretability, including predictive genomics. In this paper, linearizing neural network architecture (LINA) was developed here to provide both the first-order and the second-order interpretations on both the instance-wise and the model-wise levels. LINA combines the representational capacity of a deep inner attention neural network with a linearized intermediate representation for model interpretation. In comparison with DeepLIFT, LIME, Grad*Input and L2X, the first-order interpretation of LINA had better Spearman correlation with the ground-truth importance rankings of features in synthetic datasets. In comparison with NID and GEH, the second-order interpretation results from LINA achieved better precision for identification of the ground-truth feature interactions in synthetic datasets. These algorithms were further benchmarked using predictive genomics as a real-world application. LINA identified larger numbers of important single nucleotide polymorphisms (SNPs) and salient SNP interactions than the other algorithms at given false discovery rates. The results showed accurate and versatile model interpretation using LINA.
An interpretable machine learning algorithm should have a high representational capacity to provide strong predictive performance, and its learned representations should be amenable to model interpretation and understandable to humans. The two desiderata are generally difficult to balance. Linear models and decision trees generate simple representations for model interpretation, but have low representational capacities for only simple prediction tasks. Neural networks and support vector machines have high representational capacities to handle complex prediction tasks, but their learned representations are often considered to be “black-boxes” for model interpretation (Bermeitinger et al., 2019).
Predictive genomics is an exemplar application that requires both a strong predictive performance and high interpretability. In this application, the genotype information for a large number of SNPs in a subject’s genome is used to predict the phenotype of this subject. While neural networks have been shown to provide better predictive performance than statistical models (Badré et al., 2020; Fergus et al., 2018), statistical models are still the dominant methods for predictive genomics, because geneticists and genetic counselors can understand which SNPs are used and how they are used as the basis for certain phenotype predictions. Neural network models have also been used in many other important bioinformatics applications (Ho Thanh Lam et al., 2020; Do & Le, 2020; Baltres et al., 2020) that can benefit from model interpretation.
To make neural networks more useful for predictive genomics and other applications, in certain non-limiting embodiments, the present disclosure is directed to a new neural network architecture, referred to as linearizing neural network architecture (LINA), to provide both first-order and second-order interpretations and both instance-wise and model-wise interpretations.
Model interpretation reveals the input-to-output relationships that a machine learning model has learned from the training data to make predictions (Molnar, 2019). The first-order model interpretation aims to identify individual features that are important for a model to make predictions. For predictive genomics, this can reveal which individual SNPs are important for phenotype prediction. The second-order model interpretation aims to identify important interactions among features that have a large impact on model prediction. The second-order interpretation may reveal the XOR interaction between the two features that jointly determine the output. For predictive genomics, this may uncover epistatic interactions between pairs of SNPs (Cordell, 2002; Phillips, 2008).
A general strategy for the first-order interpretation of neural networks, first introduced by Saliency (Simonyan et al., 2014), is based on the gradient of the output with respect to (w.r.t.) the input feature vector. A feature with a larger partial derivative of the output is considered more important. The gradient of a neural network model w.r.t. the input feature vector of a specific instance can be computed using backpropagation, which generates an instance-wise first-order interpretation. The Grad*Input algorithm (Shrikumar et al., 2017) multiplies the obtained gradient element-wise with the input feature vector to generate better scaled importance scores. As an alternative to using the gradient information, the Deep Learning Important FeaTures (DeepLIFT) algorithm explains the predictions of a neural network by backpropagating the activations of the neurons to the input features (Shrikumar et al., 2017). The feature importance scores are calculated by comparing the activations of the neurons with their references, which allows the importance information to pass through a zero gradient during backpropagation. The Class Model Visualization (CMV) algorithm (Simonyan et al., 2014) computes the visual importance of pixels in convolution neural network (CNN). It performs backpropagation on an initially dark image to find the pixels that maximize the classification score of a given class.
While the algorithms described above were developed specifically for neural networks, model-agnostic interpretation algorithms can be used for all types of machine learning models. Local Interpretable Model-agnostic Explanations (LIME) (Ribeiro et al., 2016) fits a linear model to synthetic instances that have randomly perturbed features in the vicinity of an instance. The obtained linear model is analyzed as a local surrogate of the original model to identify the important features for the prediction on this instance. Because this approach does not rely on gradient computation, LIME can be applied to any machine learning model, including non-differentiable models. The studies in Examples 1-4 combined LIME and DeepLIFT to interpret a feedforward neural network model for predictive genomics. Kernel SHapley Additive exPlanations (SHAP) (Lundberg & Lee, 2022) uses a sampling method to find the Shapley value for each feature of a given input. The Multi-Objective Counterfactuals (MOC) method (Dandl et al., 2020) searches for the counterfactual explanations for an instance by solving a multi-objective optimization problem. The importance scores calculated by the L2X algorithm (Chen et al., 2021) are based on the mutual information between the features and the output from a machine learning model. L2X is efficient because it approximates the mutual information using a variational approach.
The second-order interpretation is more challenging than the first-order interpretation because d features would have (d2 - d)/2 possible interactions to be evaluated. Computing the Hessian matrix of a model for the second-order interpretation is conceptually equivalent to, but much more computationally expensive than, computing the gradient for the first-order interpretation. Group Expected Hessian (GEH) (Cui et al., 2020) computes the Hessian of a Bayesian neural network for many regions in the input feature space and aggregates them to estimate an interaction score for every pair of features. The additive grooves algorithm (Sorokina et al., 2007) estimates the feature interaction scores by comparing the predictive performance of the decision tree containing all features with that of the decision trees with pairs of features removed. Neural Interaction Detection (NID) (Tsang et al., 2018) avoids the high computational cost of evaluating every feature pair by directly analyzing the weights in a feedforward neural network. If some features are strongly connected to a neuron in the first hidden layer and the paths from that neuron to the output have high aggregated weights, then NID considers these features to have strong interactions.
Model interpretations can be further classified as instance-wise interpretations or model-wise interpretations. Instance-wise interpretation algorithms, including Saliency (Simonyan et al., 2014), LIME (Ribeiro et al., 2016) and L2X (Chen et al., 2018), provide an explanation for a model’s prediction for a specific instance. For example, an instance-wise interpretation of a neural network model for predictive genomics may highlight the important SNPs in a specific subject which are the basis for the phenotype prediction of this subject. This is useful for intuitively assessing how well grounded the prediction of a model is for a specific subject. Model-wise interpretation provides insights into how a model makes predictions in general. CMV (Simonyan et al., 2014) was developed to interpret CNN models. Instance-wise interpretation methods can also be used to explain a model by averaging the explanations of all the instances in a test set. A model-wise interpretation of a predictive genomics model can reveal the important SNPs for a phenotype prediction in a large cohort of subjects. Model-wise interpretations shed light on the internal mechanisms of a machine learning model.
Disclosed herein is a LINA architecture and first-order and second-order interpretation algorithms for LINA. The interpretation performance of the new methods has been benchmarked using synthetic datasets and a predictive genomics application in comparison with state-of-the-art (SOTA) interpretation methods. The interpretations from LINA were more versatile and more accurate than those from the SOTA methods.
(A) LINA Architecture. The key feature of the LINA architecture (
where y is the output, X is the input feature vector, S() is the activation function of the output layer, ◦ represents the element-wise multiplication operation, K and b are respectively the coefficient vector and bias that are constant for all instances, and A is the attention vector that adaptively scales the feature vector of an instance. X, A and K are three vectors of dimension d, which is the number of input features. The computation by the linearization layer and the output layer is also expressed in a scalar format in Equation (1). This formulation allows the LINA model to learn a linear function of the input feature vector, coefficient vector, and attention vector.
The attention vector is computed from the input feature vector using a multi-layer neural network, referred to as the inner attention neural network in LINA. The inner attention neural network must be sufficiently deep for a prediction task owing to the designed low representational capacity of the remaining linearization layer in a LINA model. In the inner attention neural network, all hidden layers use a non-linear activation function, such as ReLU, but the attention layer uses a linear activation function to avoid any restriction in the range of the attention weights. This is different from the typical attention mechanism in existing attentional architectures which generally use the softmax activation function.
(B) The Loss Function. The loss function for LINA is composed of the training error loss, regularization penalty on the coefficient vector, and regularization penalty on the attention vector:
where E is a differentiable convex training error function, ||K||2 is the L2 norm of the coefficient vector, ||A - 1||1 is the L1 norm of the attention vector minus 1, and β and y are the regularization parameters. The coefficient regularization sets 0 to be the expected value of the prior distribution for K, which reflects the expectation of un-informative features. The attention regularization sets 1 to be the expected value of the prior distribution for A, which reflects the expectation of a neutral attention weight that does not scale the input feature. The values of β and y and the choices of L2, L1, and L0 regularization for the coefficient and attention vectors are all hyperparameters that can be optimized for predictive performance on the validation set.
(C) First-order Interpretation. LINA derives the instance-wise first-order interpretation from the gradient of the output, y, w.r.t the input feature vector, X. The output gradient can be decomposed as follows:
Proof:
The decomposition of the output gradient in LINA shows that the contribution of a feature in an attentional architecture comprises (i) a direct contribution to the output weighted by its attention weight and (ii) an indirect contribution to the output during attention computation. This indicates that using attention weights directly as a measure of feature importance omits the indirect contribution of a feature in the attention mechanism.
For the instance-wise first-order interpretation, the inventors defined
as the full importance score for feature i,
as the direct importance score for feature i,
and as the indirect importance score for feature i.
For the model-wise first-order interpretation, the inventors defined the model-wise full importance score (FPi), direct importance score (DPi), and indirect importance score (IPi) for feature i as the averages of the absolute values of the corresponding instance-wise importance scores of this feature across all instances in the test set:
Because absolute values are used, the model-wise FPi of feature i is no longer a sum of its IPi and DPi.
(D) Second-order Interpretation. It is computationally expensive and unscalable to compute the Hessian matrix for a large LINA model. Here, the Hessian matrix of the output w.r.t. the input feature vector is approximated using the Jacobian matrix of the attention vector w.r.t. the input feature vector in a LINA model, which is computationally feasible to calculate. An approximation is derived as follows.
By omitting the second-order derivatives of the attention weights, Equation (10) can be simplified as
Equation (11) shows an approximation of the Hessian of the output using the Jacobian of the attention vector. The k-weighted sum of the omitted second-order derivatives of the attention weights constitutes the approximation error. The performance of the second-order interpretation based on this approximation is benchmarked using synthetic and real-world datasets.
For instance-wise second-order interpretation, the inventors define a directed importance score of feature r to feature c:
This measures the importance of feature r in the calculation of the attention weight of feature c. In other words, this second-order importance score measures the importance of feature r to the direct importance score of feature c for the output.
For the model-wise second-order interpretation, the inventors defined an undirected importance score between feature r and feature c based on their average instance-wise second-order importance score in the test set:
(E) Recap of the LINA Importance Scores. The notations and definitions of all the importance scores for a LINA model are recapitulated below in Table 8. FQ and SQ are selected as the first-order and second-order importance score, respectively, for instance-wise interpretation. FP and SP are used as the first-order and second-order importance scores, respectively, for model-wise interpretation.
(A) California housing dataset. The California housing dataset (Kelley & Barry, 1997) was used to formulate a simple regression task, which is the prediction of the median sale price of houses in a district based on eight input features (Table 5). The dataset contained 20,640 instances (districts) for model training and testing.
(B) First-order benchmarking datasets. Five synthetic datasets, each containing 20,000 instances, were created using the sigmoid functions to simulate binary classification tasks. These functions were created following the examples in (Chen et al., 2018) for the first-order interpretation benchmarking. All five datasets included ten input features. The values of the input features were independently sampled from a standard Gaussian distribution: xi~N(0, 1), i ∈ {1, 2, ..., 10}. The target value was set to 0, if the sigmoid function output is (0, 0.5). The target value was set to 1, if the sigmoid function output is [0.5, 1). The inventors used the following five sigmoid functions of different subsets of the input features:
(F1):
. This function contains four important features with independent squared relationships with the target. The ground-truth rankings of the features by first-order importance are X1, X2, X3, and X4. The remaining six uninformative features are tied in the last rank.
(F2):
. This function contains four important features with various non-linear additive relationships with the target. The ground-truth ranking of the features is X1, X4, X2, and X3. The remaining six uninformative features are tied in the last rank.
(F3):
. This function contains six important features with multiplicative interactions among one another. The ground-truth ranking of the features is X1, X2 and X3 tied in the first rank, X4, X5 and X6 tied in the second rank, and the remaining uninformative features tied in the third rank.
(F4):
. This function contains six important features with multiplicative interactions among one another and non-linear relationships with the target. The ground-truth ranking of the features is X1, X2 and X3 tied in the first rank, X4, X5 and X6 tied in the second rank, and the other four uninformative features tied in the third rank.
(F5):
. This function contains six important features with a variety of non-linear relationships with the target. The ground-truth ranking of the features is X1 and X2 tied in the first rank, X6 in the second, X3 in the third, X4 and X5 tied in the fourth, and the remaining uninformative features tied in the fifth.
(C) Second-order benchmarking dataset. Ten regression synthetic datasets, referred to as F6-A, F7-A, F8-A, F9-A, and F10-A (-A datasets) and F6-B, F7-B, F8-B, F9-B, and F10-B (-B datasets) were created. The -A datasets followed the examples in Tsang et al. (2018) for the second-order interpretation benchmarking. The -B datasets used the same functions below to compute the target as the -A datasets, but included more uninformative features to benchmark the interpretation performance on high-dimensional data. Each -A dataset contained 5,000 instances. Each -B dataset contained 10,000 instances. The five -A datasets included 13 input features. The five -B datasets included 100 input features, some of which were used to compute the target. In F7-A/B, F8-A/B, F9-A/B, and F10-A/B, the values of the input features of an instance were independently sampled from a standard uniform distribution: Xi~U(-1,1), i ∈ {1, 2, ..., 13} in the -A datasets or i E {1, 2, ..., 100} in the -B datasets. In the F6 dataset, the values of the input features of an instance were independently sampled from two uniform distributions: Xi~U(0,1), i ∈ {1, 2, 3, 6, 7, 9, 11, 12, 13} in the -A datasets and i ∈ {1, 2, 3,6,7,9,11,..., 100} in the -B datasets; and Xi~U(0.6,1), i ∈ {4, 5, 8, 10} in both. The value of the target for an instance was computed using the following five functions:
(F6-A) and (F6-B):
. This function contains eleven pairwise feature interactions: {(X1,X2), (X1,X3), (X2,X3), (X3,X5), (X7,X8), (X7,X9), (X7,X10), (X8,X9), (X8,X10), (X9,X10), (X2,X7)}.
(F7-A) and (F7-B):
This function contains nine pairwise interactions: {(X1X2), (X2,X3), (X3,X4), (X4,X5), (X4,X7), (X4,X8), (X5,X7), (X5,X8), (X7,X8)}.
(F8-A) and (F8-B):
This function contains ten pairwise interactions: { (X1,X2), (X3,X4), (X5,X6), (X4,X7), (X5,X6), (X5,X8), (X6,X8), (X8,X9), (X8,X10), (X9,X10)}.
(F9-A) and (F9-B):
This function contains thirteen pairwise interactions: {(X1,X2), (X1,X3), (X2,X3), (X2,X4), (X3,X4), (X1,X5), (X2,X5), (X3,X5), (X4,X5), (X6,X7), (X6,X8), (X7,X8), (X9,X10)}.
(F10-A) and (F10-B): cos(X1 * X2 * X3) + sin(X4 * X5 * X6). This function contains six pairwise interactions: {(X1,X2), (X1,X3), (X2,X3), (X4,X5), (X4,X6), (X5,X6)}.
(D) Breast cancer dataset. The Discovery, Biology, and Risk of Inherited Variants in Breast Cancer (DRIVE) project (Amos et al., 2017) generated a breast cancer dataset (NIH dbGaP accession number: phs001265.v1.p1) for genome-wide association study (GWAS) and predictive genomics. This cohort contained 26,053 case subjects with malignant tumor or in situ tumor and 23,058 control subjects with no tumor. The task for predictive genomics is a binary classification of subjects between cases and controls. The breast cancer dataset was processed using PLINK (Purcell et al., 2007), as described in Examples 1-4, to compute the statistical significance of the SNPs. Out of a total of 528,620 SNPs, 1541 SNPs had a p-value lower than 10-6 and were used as the input features for predictive genomics. To benchmark the performance of the model interpretation, 1541 decoy SNPs were added as input features. The frequencies of homozygous minor alleles, heterozygous alleles, and homozygous dominant alleles were the same between decoy SNPs and real SNPs. Because decoy SNPs have random relationships with the case/control phenotype, they should not be selected as important features or be included in salient interactions by model interpretation.
(E) Implementations and evaluation strategies. The California Housing Dataset was partitioned into a training set (70%), a validation set (20%), and a test set (10%). The eight input features were longitude, latitude, median age, total rooms, total bedrooms, population, households, and median income. The median house value was the target of the regression. All the input features were standardized to zero mean and unit standard deviation based on the training set. Feature standardization is critical for model interpretation in this case because the scale for the importance scores of a feature is determined by the scale for the values of this feature and comparison of the importance scores between features requires the values of the features to be in the same scale. The LINA model comprised an input layer (8 neurons), five fully connected hidden layers (7, 6, 5, 4 and 3 neurons), and an attention layer (8 neurons) for the inner attention neural network, followed by a second input layer (8 neurons), a linearization layer (8 neurons), and an output layer (1 neuron). The hidden layers used ReLU as the activation function. No regularization was applied to the coefficient vector and L1 regularization was applied to the attention vector (γ = 10-6). The LINA model was trained using the Adam optimizer with a learning rate of 10-2. The predictive performance of the obtained LINA model was benchmarked to have an RMSE of 71055 in the test set. As a baseline model for comparison, a gradient boosting model achieved an RMSE of 77852 in the test set using 300 decision trees with a maximum depth of 5.
For the first-order interpretation, each synthetic dataset was split into a cross-validation set (80%) for model training and hyperparameter optimization and a test set (20%) for performance benchmarking and model interpretation. A LINA model and a feedforward neural network (FNN) model were constructed using 10-fold cross-validation. For the first four synthetic datasets, the inner attention neural network in the LINA model had 3 layers containing 9 neurons in the first layer, 5 neurons in the second layer, and 10 neurons in the attention layer. The FNN had 3 hidden layers with the same number of neurons in each layer as the inner attention neural network in the LINA model. For the fifth function with more complex relationships, the first and second layers were widened to 100 and 25 neurons, respectively, in both the FNN and LINA models to achieve a predictive performance similar to the other datasets in their respective validation sets. Both the FNN and LINA models were trained using the Adam optimizer. The learning rate was set to 10-2. The mini-batch size was set to 32. No hyperparameter tuning was performed. The LINA model was trained with the L2 regularization on the coefficient vector (β = 10-4) and the L1 regularization on the attention vector (γ = 10-6). The values of β and γ were selected from 10-2, 10-3, 10-4, 10-5, 10-6, 10-7, and 0 based on the predictive performance of the LINA model on the validation set. Batch normalization was used for both architectures. Both the FNN and LINA models achieved predictive performance at approximately 99% AUC on the test set in the five first-order synthetic datasets, which was comparable to Chen et al. (2018). Deep Lift (Shrikumar et al., 2017), LIME (Ribeiro et al., 2016), Grad*Input (Shrikumar et al., 2017), L2X (Dandl et al., 2020) and Saliency (Simonyan et al., 2014) were used to interpret the FNN model and calculate the feature importance scores using their default configurations. FP, DP, and IP scores were used as the first-order importance scores for the LINA model. The inventors compared the performances of the first-order interpretation of LINA with DeepLIFT, LIME, Grad*Input and L2X. The interpretation accuracy was measured using the Spearman rank correlation coefficient between the predicted ranking of features by their first-order importance and the ground-truth ranking. This metric was chosen because it encompasses both the selection and ranking of the important features.
For the second-order interpretation benchmarking, each synthetic dataset was also split into a cross-validation set (80%) and a test set (20%). A LINA model, an FNN model for NID, and a Bayesian neural network (BNN) for GEH as shown in Cui et al. (2020), were constructed based on the neural network architecture used in (Tsang et al., 2018) using 10-fold cross-validation. The inner attention neural network in the LINA model uses 140 neurons in the first hidden layer, 100 neurons in the second hidden layer, 60 neurons in the third hidden layer, 20 neurons in the fourth hidden layer, and 13 neurons in the attention layer. The FNN model was composed of 4 hidden layers with the same number of neurons in each layer as LINA’s inner attention neural network. The BNN model uses the same architecture as that of the FNN model. The FNN, BNN and LINA models were trained using the Adam optimizer with a learning rate of 10-3 and a mini-batch size of 32 for the -A datasets and 128 for the -B datasets. The LINA model was trained using L2 regularization on the coefficient vector (β = 10-4) and the L1 regularization on the attention vector (γ = 10-6) with batch normalization. Hyperparameter tuning was performed as described above to optimize the predictive performance. The FNN and BNN models were trained using the default regularization parameters, as shown in Cui et al. (2020) and Tsang et al. (2018). Batch normalization was used for LINA. The FNN, BNN and LINA models all achieved R2 scores of more than 0.99 on the test sets of the five -A datasets, as in the examples in Tsang et al. (2018), while their R2 scores ranged from 0.91 to 0.93 on the test set of the five high-dimensional -B datasets. Pairwise interactions in each dataset were identified from the BNN model using GEH (Cui et al., 2020), the FNN model using NID (Tsang et al., 2018), and the LINA model using the SP scores. For GEH, the number of clusters was set to the number of features and the number of iterations was set to 20. NID was run using its default configuration. For a dataset with m pairs of ground-truth interactions, the top-m pairs with the highest interaction scores were selected from each algorithm’s interpretation output. The percentage of ground-truth interactions in the top-m predicted interactions (i.e., the precision) was used to benchmark the second-order interpretation performance of the algorithms.
For the breast cancer dataset, 49,111 subjects in the breast cancer dataset were randomly divided into the training set (80%), validation set (10%), and test set (10%). The FNN model and the BNN model had 3 hidden layers with 1000, 250 and 50 neurons as described in Examples 1-4. The same hyperparameters were used in Examples 1-4. The inner attention neural network in the LINA model also used 1000, 250 and 50 neurons before the attention layer. All of these models had 3082 input neurons for 1541 real SNPs and 1541 decoy SNPs. β was set to 0.01 and γ to 0, which were selected from 10-2, 10-3, 10-4, 10-5, 10-6, 10-7, and 0 based on the predictive performance of the LINA model on the validation set. Early stopping based on the validation AUC score was used during training. The FNN, BNN and LINA models achieved a test AUC of 64.8%, 64.8% and 64.7% on the test set, respectively, using both the 1541 real SNPs with p-values less than 10-6 and the 1541 decoy SNPs. The test AUCs of these models were lower than that of the FNN model in Examples 1-4 at 67.4% using real 5,273 SNPs with p-values less than 10-3 as input. As the same FNN architecture design was used in the two studies, the reduction in the predictive performance in this study can be attributed to the use of more stringent p-value filtering to retain only real SNPs with a high likelihood of having a true association with the disease and the addition of decoy SNPs for benchmarking the interpretation performance.
Deep Lift (Shrikumar et al., 2017), LIME (Ribeiro et al., 2016), Grad*Input (Shrikumar et al., 2017), L2X (Chen et al., 2018) and Saliency (Simonyan et al., 2014) were used to interpret the FNN model and calculate the feature importance scores using their default configurations. The FP score was used as the first-order importance score for the LINA model. After the SNPs were filtered at a given importance score threshold, the false discovery rate (FDR) was computed from the retained real and decoy SNPs above the threshold. The number of retained real SNPs was the total positive count for the FDR. The number of false positive hits (i.e., the number of unimportant real SNPs) within the retained real SNPs was estimated as the number of retained decoy SNPs. Thus, FDR was estimated by dividing the number of retained decoy SNPs by the number of retained real SNPs. An importance-score-sorted list of SNPs from each algorithm was filtered at an increasingly stringent score threshold until reaching the desired FDR level. The interpretation performance of an algorithm was measured by the number of top-ranked features filtered at 0.1%, 1% and 5% FDR and the FDRs for the top-100 and top-200 SNPs ranked by an algorithm.
For the second-order interpretation, pairwise interactions were identified from the BNN model using GEH (Cui et al., 2020), from the FNN model using NID (Tsang et al., 2018), and from the LINA model using the SP scores. For GEH, the number of clusters was set to 20 and the number of iterations was set to 20. While LINA and NID used all 4,911 subjects in the test set and completed their computation within an hour, the GEH results were computed for only 1000 random subjects in the test set over >2 days because GEH would have taken approximately two months to complete the entire test set with its n2 computing cost where n is the number of subjects. NID was run using its default configuration in the FNN model. The interpretation accuracy was also measured by the numbers of top-ranked pairwise interactions detected at 0.1%, 1% and 5% FDR and the FDRs for the top-1000 and top-2000 interaction pairs ranked by an algorithm. A SNP pair was considered to be false positive if one or both of the SNPs in a pair was a decoy.
(A) Demonstration of LINA on a real-world application. In this section, the inventors demonstrate LINA using the California housing dataset, which has been used in previous model interpretation studies for algorithm demonstration in Cui et al. (2020) and Tsang et al. (2018). Four types of interpretations from LINA were presented, including the instance-wise first-order interpretation, the instance-wise second-order interpretation, the model-wise first-order interpretation, and the model-wise second-order interpretation.
1) Instance-wise interpretation. Table 9 shows the prediction and interpretation results of the LINA model for an instance (district # 20444) that had a true median price of $208600. The predicted price of $285183 was simply the sum of the eight element-wise products of the attention, coefficient, and feature columns plus the bias. This provided an easily understandable representation of the intermediate computation behind the prediction for this instance. For example, the median age feature had a coefficient of 213 in the model. For this instance, the median age feature had an attention weight of -275, which switched the median age to a negative feature and amplified its direct effect on the predicted price in this district.
The product of the attention weight and coefficient yielded the direct importance score of the median age feature (i.e., DQ = -58,524), which represented the strength of the local linear association between the median age feature and the predicted price for this instance. By assuming that the attention weights of this instance are fixed, one can expect a decrease of $58,524 in the predicted price for an increase in the median age by one standard deviation (12.28 years) for this district. But this did not consider the effects of the median age increase on the attention weights, which was accounted for by its indirect importance score (i.e., IQ = 91,930). The positive IQ indicated that a higher median age would increase the attention weights of other positive features and increase the predicted price indirectly. Combining the DQ and IQ, the positive FQ of 33,407 marked the median age to be a significant positive feature for the predicted price, perhaps through the correlation with some desirable variables for this district. This example suggested a limitation of using the attention weights themselves to evaluate the importance of features in the attentional architectures. The full importance scores represented the total effect of a feature’s change on the predicted price. For this instance, the latitude feature had the largest impact on the predicted price.
Table 10 presents a second-order interpretation of the prediction for this instance. The median age row in Table 10 shows how the median age feature impacted the attention weights of the other features. The two large positive SQ values of median age to the latitude and longitude features indicated significant increases of the two location features’ attention weights with the increase of the median age. In other words, the location become a more important determinant of the predicted price for districts with older houses. The total bedroom feature received a large positive attention weight for this instance. The total bedroom column in Table 10 shows that the longitude and latitude features are the two most important determinants for the attention weights of the total bedroom feature. This suggested how a location change may alter the direct importance of the total bedroom feature for the price prediction of this district.
2) Model-wise interpretation.
Some significant differences existed between the instance-wiseinterpretation and model-wise interpretation (e.g., Table 9 vs.
(B) Benchmarking of the first-order and second-order interpretations using synthetic datasets. In real-world applications, the true importance of features for prediction cannot be determined with certainty and may vary among different models. Therefore, previous studies on model interpretation (Ribeiro et al., 2016; Cui et al., 2020) benchmarked their interpretation performance using synthetic datasets with known ground-truth of feature importance. In this study, the inventors also compared the interpretation performance of LINA with the SOTA methods using synthetic datasets created as in previous studies (Chen et al., 2021; Tsang et al., 2018).
The performance of the first-order interpretation of LINA was compared with DeepLIFT, LIME, Grad*Input and L2X (Table 11). The three first-order importance scores from LINA, including FP, DP and IP, were tested. The DP score performed the worst among the three, especially in the F3 and F4 datasets which contained interactions among three features. This suggested the limitation of using attention weights as a measure of feature importance. The FP score provided the most accurate ranking among the three LINA scores because it accounted for the direct contribution of a feature and its indirect contribution through attention weights. The first-order importance scores were then compared among different algorithms. L2X and LIME distinguished many important features correctly from un-informative features, but their rankings of the important features were often inaccurate. The gradient-based methods produced mostly accurate rankings of the features based on their first-order importance. Their interpretation accuracy generally decreased in datasets containing interactions among more features. Among all the methods, the LINA FP scores provided the most accurate ranking of the features on average.
The performance of the second-order interpretation of LINA was compared with those of GEH and NID (Table 12). There were a total of 78 possible pairs of interactions among 13 features in each -A synthetic dataset and there were 4950 possible pairs of interactions among 100 features in each -B synthetic dataset. The precision from random guesses was only ∼12.8% on average in the -A datasets and less than 1% in the -B datasets. The three second-order algorithms all performed significantly better than the random guess. In the -A datasets, the average precision of LINA SP was ∼80%, which was ∼12% higher than that of NID and ∼29% higher than that of GEH. The addition of 87 un-informative features in the -B datasets reduced the average precision of LINA by ∼15%, that of NID by ∼13%, and that of GEH by ∼22%. In the -B datasets, the average precision of LINA SP was ∼65%, which was ∼9% higher than that of NID and ∼35% higher than that of GEH. This indicates that more accurate second-order interpretations can be obtained from the LINA models.
1.00±0.00
1.00±0.00
1.00+0.00
1.00±0.00
0.91±0.04
1.00±0.00
0.98±0.01
1.00±0.00
1.00±0.00
1.00±0.00
1.00±0.00
1.00±0.00
1.00±0.00
1.00±0.00
1.00±0.00
61.8%±0.2%
98.0%±0.1%
85.0%±0.2%
70.0±0.3%
91.7%±0.3%
80.1±0.2%
52.7%±0.3%
90.0%±0.0%
80%0.0±0.3%
51.7%±0.3%
66.6%±0.0%
64.9%±0.2%
(C) Benchmarking of the first-order and second-order interpretation using a predictive genomics application. As the performance benchmarks in synthetic datasets may not reflect those in real-world applications, the inventors engineered a real-world benchmark based on a breast cancer dataset for predictive genomics. While it was unknown which SNPs and which SNP interactions were truly important for phenotype prediction, the decoy SNPs added by the inventors were truly unimportant. Moreover, a decoy SNP cannot have a true interaction, such as XOR or multiplication, with a real SNP to have a joint impact on the disease outcome. Thus, if a decoy SNP or an interaction with a decoy SNP is ranked by an algorithm as important, it should be considered a false positive detection. As the number of decoy SNPs was the same as the number of real SNPs, the false discovery rate can be estimated by assuming that an algorithm makes as many false positive detections from the decoy SNPs as from the real SNPs. This allowed the inventors to compare the number of positive detections by an algorithm at certain FDR levels.
The first-order interpretation performance of LINA was compared with those of DeepLIFT, LIME, Grad*Input and L2X (Table 13). At 0.1%, 1%, and 5% FDR, LINA identified more important SNPs than other algorithms. LINA also had the lowest FDRs for the top-100 and top-200 SNPs. The second-order interpretation performance of LINA was compared with those of NID and GEH (Table 14). At 0.1%, 1%, and 5% FDR, LINA identified more pairs of important SNP interactions than NID and GEH did. LINA had lower FDRs than the other algorithms for the top-1000 and top-2000 SNP pairs. Both L2X and GEH failed to output meaningful importance scores in this predictive genomics dataset. Because GEH needed to compute the full Hessian, it was also much more computationally expensive than the other algorithms.
The existing model interpretation algorithms and LINA can provide rankings of the features or feature interactions based on their importance scores at arbitrary scales. The inventors demonstrated that decoy features can be used in real-world applications to set thresholds for first-order and second-order importance scores based on the FDRs of retained features and feature pairs. This provided an uncertainty quantification of the model interpretation results without knowing the ground-truth in real-world applications.
The predictive genomics application provided a real-world test of the interpretation performance of these algorithms. In comparison with the synthetic datasets, the predictive genomics dataset was more challenging for model interpretation, because of the low predictive performance of the models and the large number of input features. For this real-world application, LINA was shown to provide better first-order and second-order interpretation performance than existing algorithms on a model-wise level. Furthermore, LINA can provide instance-wise interpretation to identify important SNP and SNP interactions for the prediction of individual subjects. Model interpretation is important for making biological discoveries from predictive models, because first-order interpretation can identify individual genes involved in a disease (Rivandi et al., 2018; Romualdo Cardoso et al., 2021) and second-order interpretation can uncover epistatic interactions among genes for a disease (Shaker & Senousy, 2019; van de Haar et al., 2019). These discoveries may provide new drug targets (Wang et al., 2018; Gao et al., 2019; Gonçalves et al., 2020) and enable personalized formulation of treatment plans (We et al., 2015; Zhao et al., 2021; Velasco-Ruiz et al., 2021) for breast cancer.
Conclusion. In this study, the inventors designed a new neural network architecture, referred to as LINA, for model interpretation. LINA uses a linearization layer on top of a deep inner attention neural network to generate a linear representation of model prediction. LINA provides the unique capability of offering both first-order and second-order interpretations and both instance-wise and model-wise interpretations. The interpretation performance of LINA was benchmarked to be higher than the existing algorithms on synthetic datasets and a predictive genomics dataset.
While the compositions, apparatus, and methods of this disclosure have been described in terms of particular embodiments, it will be apparent to those of skill in the art that variations may be applied to the methods and in the steps or in the sequence of steps of the method described herein without departing from the concept, spirit and scope of the disclosure. All such similar variations and modifications apparent to those skilled in the art are deemed to be within the spirit, scope and concept of the inventive concepts as defined by the appended claims.
The following references, to the extent that they provide exemplary procedural or other details supplementary to those set forth herein, are specifically incorporated herein by reference in their entireties.
Amos et al., “The OncoArray Consortium: A Network for Understanding the Genetic Architecture of Common Cancers,” Cancer Epidemiol Biomarkers Prev , vol. 26, no. 1, pp. 126-135, January 2017, doi: 10.1158/1055-9965.EPI-16-0106.
Amos et al., “The OncoArray Consortium: A Network for Understanding the Genetic Architecture of Common Cancers,” Cancer Epidemiol Biomarkers Prev, vol. 26, no. 1, pp. 126-135, January 2017, doi: 10.1158/1055-9965.EPI-16-0106.
Angemueller, T. Pärnamaa, L. Parts, and O. Stegle, “Deep learning for computational biology,” Molecular Systems Biology, vol. 12, no. 7, p. 878, July 2016, doi: 10.15252/msb.20156651.
Badre, L. Zhang, W. Muchero, J. C. Reynolds, and C. Pan, “Deep neural network improves the estimation of polygenic risk scores for breast cancer,” Journal of Human Genetics, pp. 1-11, October 2020, doi: 10.1038/s10038-020-00832-7.
Baltres et al., “Prediction of Oncotype DX recurrence score using deep multi-layer perceptrons in estrogen receptor-positive, HER2-negative breast cancer,” Breast Cancer, vol. 27, no. 5, pp. 1007-1016, September 2020, doi: 10.1007/s12282-020-01100-4.
Bellot, G. de los Campos, and M. Pérez-Enciso, “Can Deep Learning Improve Genomic Prediction of Complex Human Traits?,” Genetics, vol. 210, no. 3, pp. 809-819, November 2018, doi: 10.1534/genetics.118.301298.
Bengio, “Learning Deep Architectures for AI,” Found. Trends Mach. Learn., vol. 2, no. 1, pp. 1-127, January 2009, doi: 10.1561/2200000006.
Bermeitinger, T. Hrycej, and S. Handschuh, “Representational Capacity of Deep Neural Networks — A Computing Study,” Proceedings of the 11th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, pp. 532-538, 2019, doi: 10.5220/0008364305320538.
Cesaratto et al., “BNC2 is a putative tumor suppressor gene in high-grade serous ovarian carcinoma and impacts cell survival after oxidative stress,” Cell Death & Disease, vol. 7, no. 9, Art. no. 9, September 2016, doi: 10.1038/cddis.2016.278.
Chan et al., “Evaluation of three polygenic risk score models for the prediction of breast cancer risk in Singapore Chinese,” Oncotarget, vol. 9, no. 16, pp. 12796-12804, January 2018, doi: 10.18632/oncotarget.24374.
Chang, C. C. Chow, L. C. Tellier, S. Vattikuti, S. M. Purcell, and J. J. Lee, “Second-generation PLINK: rising to the challenge of larger and richer datasets,” Gigascience, vol. 4, no. 1, December 2015, doi: 10.1186/s13742-015-0047-8.
Chen, L. Song, M. Wainwright, and M. Jordan, “Learning to Explain: An Information-Theoretic Perspective on Model Interpretation,” in Proceedings of the 35th International Conference on Machine Learning, July 2018, pp. 883-82. Accessed: Nov. 04, 2021. [Online]. Available: https://proceedings.mlr.press/v80/chen18j.html
Clark, B. P. Kinghorn, J. M. Hickey, and J. H. van der Werf, “The effect of genomic information on optimal contribution selection in livestock breeding programs,” Genetics Selection Evolution, vol. 45, no. 1, p. 44, October 2013, doi: 10.1186/1297-9686-45-44.
Cordell, “Epistasis: what it means, what it doesn’t mean, and statistical methods to detect it in humans,” Human Molecular Genetics, vol. 11, no. 20, pp. 2463-2468, October 2002, doi: 10.1093/hmg/11.20.2463.
Cudic, H. Baweja, T. Parhar, and S. Nuske, “Prediction of Sorghum Bicolor Genotype from In-Situ Images Using Autoencoder-Identified SNPs,” in 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), December 2018, pp. 23-31, doi: 10.1109/ICMLA.2018.00012.
Cui, P. Marttinen, and S. Kaski, “Learning Global Pairwise Interactions with Bayesian Neural Networks,” in ECAI 2020 — 24th European Conference on Artificial Intelligence, 29 August-8 September 2020, Santiago de Compostela, Spain, August 29 - September 8, 2020 - Including 10th Conference on Prestigious Applications of Artificial Intelligence (PAIS 2020), 2020, vol. 325, pp. 1087-1094. doi: 10.3233/FAIA200205.
Dandl, C. Molnar, M. Binder, and B. Bischl, “Multi-Objective Counterfactual Explanations,” in Parallel Problem Solving from Nature - PPSN XVI, Cham, 2020, pp. 448-469. doi: 10.1007/978-3-030-58112-1_31.
Dayem Ullah, J. Oscanoa, J. Wang, A. Nagano, N. R. Lemoine, and C. Chelala, “SNPnexus: assessing the functional relevance of genetic variation to facilitate the promise of precision medicine,” Nucleic Acids Res, vol. 46, no. W1, pp. W109-W113, July 2018, doi: 10.1093/nar/gky399.
De, W. S. Bush, and J. H. Moore, “Bioinformatics Challenges in Genome-Wide Association Studies (GWAS),” in Clinical Bioinformatics, R. Trent, Ed. New York, NY: Springer, 2014, pp. 63-81.
Do and N. Q. K. Le, “Using extreme gradient boosting to identify origin of replication in Saccharomyces cerevisiae via hybrid features,” Genomics, vol. 112, no. 3, pp. 2445-2451, mai 2020, doi: 10.1016/j.ygeno.2020.01.017.
Dudbridge, “Power and Predictive Accuracy of Polygenic Risk Scores,” PLOS Genetics, vol. 9, no. 3, p. e1003348, March 2013, doi: 10.1371/journal.pgen.1003348.
Fergus, A. Montanez, B. Abdulaimma, P. Lisboa, C. Chalmers, and B. Pineles, “Utilising Deep Learning and Genome Wide Association Studies for Epistatic-Driven Preterm Birth Classification in African-American Women,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, pp. 1-1, 2018, doi: 10.1109/TCBB.2018.2868667.
Fernald and M. Kurokawa, “Evading apoptosis in cancer,” Trends in Cell Biology, vol. 23, no. 12, pp. 620-633, December 2013, doi: 10.1016/j.tcb.2013.07.006.
Friedman, “Greedy function approximation: a gradient boosting machine,” Annals of statistics, pp. 1189-1232, 2001.
Gao, Y. Quan, X.-H. Zhou, and H.- Y. Zhang, “PheW AS-Based Systems Genetics Methods for Anti-Breast Cancer Drug Discovery,” Genes, vol. 10, no. 2, Art. no. 2, February 2019, doi: 10.3390/genes10020154.
Ge, C.-Y. Chen, Y. Ni, Y.-C. A. Feng, and J. W. Smoller, “Polygenic prediction via Bayesian regression and continuous shrinkage priors,” Nature Communications, vol. 10, no. 1, pp. 1-10, April 2019, doi: 10.1038/s41467-019-09718-5,
Gola, J. Erdmann, B. Müller-Myhsok, H. Schunkert, and I. R. König, “Polygenic risk scores outperform machine learning methods in predicting coronary artery disease status,” Genetic Epidemiology, vol. 44, no. 2, pp. 125-138, March 2020, doi: 10.1002/gepi.22279.
Goncalves et al., “Drug mechanism-of-action discovery through the integration of pharmacological and CRISPR screens,” Molecular Systems Biology, vol. 16, no. 7, p. e9405, 2020, doi: https://doi.org/10.15252/msb.20199405.
Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, 2nd ed. New York: Springer-Verlag, 2009.
Hastie, S. Rosset, J. Zhu, and H. Zou, “Multi-class adaboost,” Statistics and its Interface, vol. 2, no. 3, pp. 349-360, 2009.
Ho Thanh Lam et al., “Machine Learning Model for Identifying Antioxidant Proteins Using Features Calculated from Primary Sequences,” Biology, vol. 9, no. 10, Art. no. 10, October 2020, doi: 10.3390/biology9100325.
Ho, W. Schierding, M. Wake, R. Saffery, and J. O’Sullivan, “Machine Learning SNP Based Prediction for Precision Medicine,” Front. Genet., vol. 10, 2019, doi: 10.3389/fgene.2019.00267.
Hsieh et al., “A polygenic risk score for breast cancer risk in a Taiwanese population,” Breast Cancer Res Treat, vol. 163, no. 1, pp. 131-138, May 2017, doi: 10.1007/s10549-017-4144-5.
International Schizophrenia Consortium et al., “Common polygenic variation contributes to risk of schizophrenia and bipolar disorder,” Nature, vol. 460, no. 7256, pp. 748-752, August 2009, doi: 10.1038/nature08185.
Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” arXiv:1502.03167 [cs], March 2015, Accessed: Nov. 25, 2019. [Online]. Available: http://arxiv.org/abs/1502.03167.
Kelley Pace and R. Barry, “Sparse spatial autoregressions,” Statistics & Probability Letters, vol. 33, no. 3, pp. 291-297, May 1997, doi: 10.1016/S0167-7152(96)00140-X
Khera et al., “Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations,” Nat Genet, vol. 50, no. 9, pp. 1219-1224, September 2018, doi: 10.1038/s41588-018-0183-z.
Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” 3rd International Conference for Learning Representations, 2015, Accessed: Nov. 26, 2019. [Online]. Available: http://arxiv.org/abs/1412.6980.
Kolch, M. Halasz, M. Granovskaya, and B. N. Kholodenko, “The dynamic control of signal transduction networks in cancer cells,” Nature Reviews Cancer, vol. 15, no. 9, Art. no. 9, September 2015, doi: 10.1038/nrc3983.
LeBlanc and C. Kooperberg, “Boosting predictions of treatment success,” Proc Natl Acad Sci U SA, vol. 107, no. 31, pp. 13559-13560, August 2010, doi: 10.1073/pnas.1008052107.
Lee et al., “Candidate gene approach evaluates association between innate immunity genes and breast cancer risk in Korean women,” Carcinogenesis, vol. 30, no. 9, pp. 1528-1531, September 2009, doi: 10.1093/carcin/bgp084.
Li et al., “NOS1 upregulates ABCG2 expression contributing to DDP chemoresistance in ovarian cancer cells,” Oncology Letters, vol. 17, no. 2, pp. 1595-1602, February 2019, doi: 10.3892/ol.2018.9787.
Lundberg and S.-I. Lee, “A Unified Approach to Interpreting Model Predictions,” in Advances in Neural Information Processing Systems, 2017, vol. 30. Accessed: Jan. 31, 2022. [Online]. Available: https://papers.nips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html
Maier et al., “Joint analysis of psychiatric disorders increases accuracy of risk prediction for schizophrenia, bipolar disorder, and major depressive disorder,” Am. J. Hum. Genet., vol. 96, no. 2, pp. 283-294, February 2015, doi: 10.1016/j.ajhg.2014.12.006.
Mao and J. D. Unadkat, “Role of the Breast Cancer Resistance Protein (BCRP/ABCG2) in Drug Transport—an Update,” AAPS J, vol. 17, no. 1, pp. 65-82, January 2015, doi: 10.1208/s12248-014-9668-6.
Mavaddat et al., “Polygenic Risk Scores for Prediction of Breast Cancer and Breast Cancer Subtypes,” The American Journal of Human Genetics, vol. 104, no. 1, pp. 21-34, January 2019, doi: 10.1016/j.ajhg.2018.11.002.
Meuwissen, B. J. Hayes, and M. E. Goddard, “Prediction of Total Genetic Value Using Genome-Wide Dense Marker Maps,” Genetics, vol. 157, no. 4, pp. 1819-1829, April 2001.
Michailidou et al., “Association analysis identifies 65 new breast cancer risk loci,” Nature, vol. 551, no. 7678, pp. 92-94, November 2017, doi: 10.1038/nature24284.
Molnar, Interpretable machine learning. A Guide for Making Black Box Models Explainable. 2019. [Online], Available: https://christophm.github.io/interpretable-ml-book/
Nelson, K. Tyne, A. Naik, C. Bougatsos, B. K. Chan, and L. Humphrey, “Screening for Breast Cancer: An Update for the U.S. Preventive Services Task Force,” Annals of Internal Medicine, vol. 151, no. 10, pp. 727-737, November 2009, doi: 10.7326/0003-4819-151-10-200911170-00009.
NIH, “Female Breast Cancer - Cancer Stat Facts.” https://seer.cancer.gov/statfacts/html/breast.html (accessed Dec. 03, 2019).
O’Connor, “Targeting the DNA Damage Response in Cancer,” Molecular Cell, vol. 60, no. 4, pp. 547-560, November 2015, doi: 10.1016/j.molcel.2015.10.040.
Oeffinger et al., “Breast Cancer Screening for Women at Average Risk: 2015 Guideline Update From the American Cancer Society,” JAMA, vol. 314, no. 15, pp. 1599-1614, October 2015, doi: 10.1001/jama.2015.12783.
Phillips, “Epistasis - the essential role of gene interactions in the structure and evolution of genetic systems,” Nat Rev Genet, vol. 9, no. 11, pp. 855-867, November 2008, doi: 10.1038/nrg2452.
Purcell et al., “PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses,” The American Journal of Human Genetics, vol. 81, no. 3, pp. 559-575, September 2007, doi: 10.1086/519795.
Ribeiro, S. Singh, and C. Guestrin, “‘Why Should I Trust You?’: Explaining the Predictions of Any Classifier,” in Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 2016, pp. 1135-1144, doi: 10.1145/2939672.2939778.
Rivandi, J. W. M. Martens, and A. Hollestelle, “Elucidating the Underlying Functional Mechanisms of Breast Cancer Susceptibility Through Post-GWAS Analyses,” Frontiers in Genetics, vol. 9, 2018, Accessed: Feb. 01, 2022. [Online]. Available: frontiersin.org/article/10.3389/fgene.2018.00280
Romualdo Cardoso, A. Gillespie, S. Haider, and O. Fletcher, “Functional annotation of breast cancer risk loci: current progress and future directions,” Br J Cancer, pp. 1-13, November 2021, doi: 10.1038/s41416-021-01612-6.
Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks, vol. 61, pp. 85-117, January 2015, doi: 10.1016/j.neunet.2014.09.003.
Scott et al., “An Expanded Genome-Wide Association Study of Type 2 Diabetes in Europeans,” Diabetes, May 2017, doi: 10.2337/db16-1253.
Shaker and M. A. Senousy, “Association of SNP-SNP interactions Between RANKL, OPG, CHI3L1, and VDR Genes With Breast Cancer Risk in Egyptian Women,” Clinical Breast Cancer, vol. 19, no. 1, pp. e220-e238, février 2019, doi: 10.1016/j.clbc.2018.09.004.
Shrikumar, P. Greenside, and A. Kundaje, “Learning Important Features Through Propagating Activation Differences,” in International Conference on Machine Learning, July 2017, pp. 3145-3153, Accessed: Nov. 11, 2019. [Online]. Available: http://proceedings.mlr.press/v70/shrikumar17a.html.
Simonyan, A. Vedaldi, and A. Zisserman, “Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps,” 2014.
Sorokina, R. Caruana, and M. Riedewald, “Additive Groves of Regression Trees,” in Proceedings of the 18th European conference on Machine Learning, Berlin, Heidelberg, September 2007, pp. 323-334. doi: 10.1007/978-3-540-74958-5_31.
Speed and D. J. Balding, “MultiBLUP: improved SNP-based prediction for complex traits,” Genome Res., vol. 24, no. 9, pp. 1550-1557, September 2014, doi: 10.1 101/gr.169375.113.
Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929-1958, 2014.
Tinholt et al., “Increased coagulation activity and genetic polymorphisms in the F5, F10 and EPCR genes are associated with breast cancer: a case-control study,” BMC Cancer, vol. 14, November 2014, doi: 10.1 186/1471-2407-14-845.
Tsang, D. Cheng, and Y. Liu, “Detecting Statistical Interactions from Neural Network Weights,” 2018.
Tsuboi et al., “Prognostic significance of GAD1 overexpression in patients with resected lung adenocarcinoma,” Cancer Medicine, vol. 8, no. 9, pp. 4189-4199, 2019, doi: 10.1002/cam4.2345.
van de Haar, S. Canisius, M. K. Yu, E. E. Voest, L. F. A. Wessels, and T. Ideker, “Identifying Epistasis in Cancer Genomes: A Delicate Affair,” Cell, vol. 177, no. 6, pp. 1375-1383, mai 2019, doi: 10.1016/j.cell.2019.05.005.
Velasco-Ruiz et al., “POLRMT as a Novel Susceptibility Gene for Cardiotoxicity in Epirubicin Treatment of Breast Cancer Patients,” Pharmaceutics, vol. 13, no. 11, Art. no. 11, November 2021, doi: 10.3390/pharmaceutics13111942.
Vilhjálmsson et al., “Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores,” Am J Hum Genet, vol. 97, no. 4, pp. 576-592, October 2015, doi: 10.1016/j.ajhg.2015.09.001. Wang, J. Ingle, and R. Weinshilboum, “Pharmacogenomic Discovery to Function and Mechanism: Breast Cancer as a Case Study,” Clinical Pharmacology & Therapeutics, vol. 103, no. 2, pp. 243-252, 2018, doi: 10.1002/cpt.915.
Wei et al., “From Disease Association to Risk Assessment: An Optimistic View from Genome-Wide Association Studies on Type 1 Diabetes,” PLOS Genetics, vol. 5, no. 10, p. e1000678, October 2009, doi: 10.1371/journal.pgen.1000678.
Wen et al., “Prediction of breast cancer risk based on common genetic variants in women of East Asian ancestry,” Breast Cancer Research, vol. 18, no. 1, p. 124, December 2016, doi: 10.1186/s13058-016-0786-1.
Whittaker, I. Royzman, and T. L. Orr-Weaver, “Drosophila Double parked: a conserved, essential replication protein that colocalizes with the origin recognition complex and links DNA replication with mitosis and the down-regulation of S phase transcripts,” Genes Dev, vol. 14, no. 14, pp. 1765-1776, July 2000.
Wu et al., “A genome-wide association study identifies WT1 variant with better response to 5-fluorouracil, pirarubicin and cyclophosphamide neoadjuvant chemotherapy in breast cancer patients,” Oncotarget, vol. 7, no. 4, pp. 5042-5052, November 2015, doi: 10.18632/oncotarget.5837.
Xu, N. Wang, T. Chen, and M. Li, “Empirical Evaluation of Rectified Activations in Convolutional Network,” arXiv:1505.00853 [cs, stat], November 2015, Accessed: Nov. 25, 2019. [Online]. Available: http://arxiv.org/abs/1505.00853.
Yin et al., “Using the structure of genome data in the design of deep neural networks for predicting amyotrophic lateral sclerosis from genotype,” bioRxiv, p. 533679, January 2019, doi: 10.1101/533679.
Zhao, J. Li, Z. Liu, and S. Powers, “Combinatorial CRISPR/Cas9 Screening Reveals Epistatic Networks of Interacting Tumor Suppressor Genes and Therapeutic Targets in Human Breast Cancer,” Cancer Res, vol. 81, no. 24, pp. 6090-6105, December 2021, doi: 10.1158/0008-5472.CAN-21-2555.
The present application claims the priority benefit of U.S. Provisional Application No. 63/241,645, filed Sep. 8, 2021, the entire contents of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
63241645 | Sep 2021 | US |