This document describes technology that can be used for detecting an etiological factor of a disease in a subject having the disease.
Etiology is the study of causation, or origination. More completely, etiology is the study of the causes, origins, or reasons behind the way that things are, or the way they function, or it can refer to the causes themselves. The word is commonly used in medicine, (where it is a branch of medicine studying causes of disease) and in philosophy, but also in physics, psychology, government, geography, spatial analysis, theology, and biology, in reference to the causes or origins of various phenomena.
Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use in order to perform a specific task effectively without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence.
Etiological factors can be detected for various diseases, including cancers. For example, over the past decade a personalized approach for cancer diagnosis and treatment has evolved to include both genotypic and phenotypic characteristics of patient specific tumors. The identification and characterization of “driver DNA mutations” has been a critical aspect of defining cancer beyond tumor origin and morphology. These driver mutations have created an entirely new approach for development of targeted therapeutics such as Keytruda/PDL1 biomarker (DNA mismatch repair deficiency), Vitrakvi/NTRK gene fusion, and Rozlytrek/NTRK genetic mutation. However, linking biologically relevant “DNA mutations” to actionable and effective outcomes and development of new strategies to deliver “precision, personalized, preventive medicines” goals requires analyzing molecular data which deciphers the “history and footprints” of carcinogen forces, specific driver mutations but also global mutational signatures. This document provides supervised, machine-learning techniques that can identify signatures, called SuperSigs, that can have immediate applications for both prevention and therapy selection. For example, the methods described herein can enable the combination of knowledge about local molecular features (e.g. hot spot “driver mutations”) with global landscape features (e.g. the mutation rate of Cytosine to Adenine representing global damage to the DNA by carcinogens) to determine the optimal treatment choice or the probability of survival of a patient.
As demonstrated herein the SuperSigs technology described herein, contrary to current unsupervised and/or local feature approaches, can be used to enable precision medicine, by assigning patients to different cancer treatment regimens based on their mutational history. Availability of highly curated database signatures as a basis of defining the driving causes of mutations can enable clinicians to adopt a genome-wide holistic approach towards patient management by integrating endogenous, environmental, and inherited factors that are underlying the deadly “mutational DNA signatures”: a highly curated database of “mutational DNA signatures” created through the combination of thousands of human genome sequences with highly sophisticated analytical and mathematical algorithms to establish the footprints that lead up to the transformation of genes.
In one aspect, this document features methods for detecting an etiological factor of a disease in a subject having the disease. The methods can include, or consist essentially of, receiving training data that includes data objects each recording i) a disease label, ii) at least one corresponding mutational signature, and iii) corresponding etiological tags. The methods can include generating a first set of features based on single nucleotide mutations. The methods can include generating a second set of features based on dinucleotide mutations. The methods can include training a machine learning model on the first set of features and on the second set of features. The methods can include generating, from the machine learning model, a classifier that is configured to: operate by receiving a new-genomic-data-object, the new-genomic-data-object specific to the subject having the disease; and generate, from the new-genomic-data-object, a etiological-classification for the new-genomic-data-object, the etiological-classification indicating a corresponding etiological factor that matches one of the etiological tags. The methods can include receiving the subject's genome. The methods can include generating, from the subject's genome, a subject-genomic-data-object for the subject. The methods can include detecting an etiological factor for the subject by providing the subject-genomic-data-object to the classifier. In addition to the methods, computer-readable media, systems, devices, and software may be used.
In some aspects, the first set of features are possible substitutions of single nucleotides of a group consisting of C>A, C>G, C>T, T>A, T>C, and T>G.
In some aspects, the first set of features are defined using a pyrimidine of the mutated Watson-Crick base pair.
In some aspects, a third set of features is generated based on trinucleotide mutations, wherein training the machine learning model further comprises training the machine learning model on the third set of features.
In some aspects, a fourth set of features is generated based on all mutations, wherein training the machine learning model further comprises training the machine learning model on the fourth set of features.
In some aspects, training of the machine learning model comprises organizing the features into a partition tree that includes layers of nodes, each node representing a particular type of mutation and each child of the node representing possible mutations that are a type of mutation in the particular node.
In some aspects, the training of the machine learning model further comprises pruning the partition tree by removing a pruned node and all other nodes that are children of the pruned node.
In some aspects, the training of the machine learning model comprises selecting some, but not all, of the nodes as candidate nodes to be used for candidate testing; and testing the candidate nodes to generate first-phase candidate nodes.
In some aspects, training of the machine learning model further comprises:
generating second-phase candidates by, for each particular first-phase candidate node, adjusting a value for each parent node that is also a first-phase candidate node, the adjustment being based on the particular first-phase candidate node; selecting, as a second-phase candidate, a first-phase candidate with a remaining value above a threshold value.
In some aspects, training of the machine learning model further comprises generating final candidates by combining second-phase candidates of training data that did have a particular tag with training data that did not have the particular tag.
In some aspects, hypermethylation and hypomethylation are considered similarly and independently.
In some aspects, the disease is a cancer.
Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. Although methods and materials similar or equivalent to those described herein can be used to practice the invention, suitable methods and materials are described below. All publications, patent applications, patents, and other references mentioned herein are incorporated by reference in their entirety. In case of conflict, the present specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and not intended to be limiting.
The details of one or more embodiments of the invention are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the invention will be apparent from the description and drawings, and from the claims.
A training data object 600 can include data objects (e.g., rows in a table) that record i) a disease label, ii) at least one corresponding mutational signature, and iii) corresponding etiological tags.
Mutation features 602 can include data objects (e.g., rows in a table) that record features and one or more associated values for these features. Various mutation features may be associated with different kinds of mutations. For example, some mutation features 604 may be based on single nucleotide mutations (e.g., possible substitutions of single nucleotides of a group consisting of C>A, C>G, C>T, T>A, T>C, and T>G, and/or defined using a pyrimidine of the mutated Watson-Crick base pair). For example, some mutation features 604 may be based on dinucleotide mutations. For example, some mutation features 604 may be based on trinucleotide mutations. For example, some mutation features 604 may be based on all mutation types. Other types of mutations may be possible.
A genomic data object 604 can include variables for genes and non-genetic values. An etiologic factor classifier 606 or classifiers can receive a new genomic data object 604 and generate and etiologic classifications 604. The etiologic classifications 604 can indicate a corresponding etiological factor that matches one of the etiological tags.
Training data is received 702.
Sets of features are generated from nucleotide mutations 704 until all groups of mutations are processed 706.
A machine learning model is trained 708 on the features.
Training of the machine learning model comprises organizing the features into a partition tree that includes layers of nodes, each node representing a particular type of mutation and each child of the node representing possible mutations that are a type of mutation in the particular node.
Training of the machine learning model further comprises pruning the partition tree by removing a pruned node and all other nodes that are children of the pruned node.
Training of the machine learning model comprises selecting some, but not all, of the nodes as candidate nodes to be used for candidate testing; and testing the candidate nodes to generate first-phase candidate nodes.
Training of the machine learning model further comprises generating second-phase candidates by for each particular first-phase candidate node, adjusting a value for each parent node that is also a first-phase candidate node, the adjustment being based on the particular first-phase candidate node; selecting, as a second-phase candidate, a first-phase candidate with a remaining value above a threshold value.
Classifiers are generated 710. The classifiers are configured to operate by receiving a new-genomic-data-object, the new-genomic-data-object specific to the subject having the disease and generate, from the new-genomic-data-object, a etiological-classification for the new-genomic-data-object, the etiological-classification.
Training of the machine learning model further comprises: generating final candidates by: combining second-phase candidates of training data that did have a particular tag with training data that did not have the particular tag.
A subject's genome is received 712 as a subject-genomic-data-object.
Etiologic factor(s) are detected 714 by providing the subject-genomic-data-object to the classifier.
The memory 820 stores information within the computing system 800. In some implementations, the memory 820 is a computer-readable medium. In some implementations, the memory 820 is a volatile memory unit. In some implementations, the memory 820 is a non-volatile memory unit.
The storage device 830 is capable of providing mass storage for the computing system 800. In some implementations, the storage device 830 is a computer-readable medium. In various different implementations, the storage device 830 may be a floppy disk device, a hard disk device, an optical disk device, or a tape device.
The input/output device 840 provides input/output operations for the computing system 800. In some implementations, the input/output device 840 includes a keyboard and/or pointing device. In some implementations, the input/output device 840 includes a display unit for displaying graphical user interfaces.
Some features described can be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. The apparatus can be implemented in a computer program product tangibly embodied in an information carrier, e.g., in a machine-readable storage device, for execution by a programmable processor; and method steps can be performed by a programmable processor executing a program of instructions to perform functions of the described implementations by operating on input data and generating output. The described features can be implemented advantageously in one or more computer programs that are executable on a programmable system including at least one programmable processor coupled to receive data and instructions from, and to transmit data and instructions to, a data storage system, at least one input device, and at least one output device. A computer program is a set of instructions that can be used, directly or indirectly, in a computer to perform a certain activity or bring about a certain result. A computer program can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
Suitable processors for the execution of a program of instructions include, by way of example, both general and special purpose microprocessors, and the sole processor or one of multiple processors of any kind of computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for executing instructions and one or more memories for storing instructions and data. Generally, a computer will also include, or be operatively coupled to communicate with, one or more mass storage devices for storing data files; such devices include magnetic disks, such as internal hard disks and removable disks; magneto-optical disks; and optical disks. Storage devices suitable for tangibly embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, such as EPROM (erasable programmable read-only memory), EEPROM (electrically erasable programmable read-only memory), and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM (compact disc read-only memory) and DVD-ROM (digital versatile disc read-only memory) disks. The processor and the memory can be supplemented by, or incorporated in, ASICs (application-specific integrated circuits).
To provide for interaction with a user, some features can be implemented on a computer having a display device such as a CRT (cathode ray tube) or LCD (liquid crystal display) monitor for displaying information to the user and a keyboard and a pointing device such as a mouse or a trackball by which the user can provide input to the computer.
The computing device 900 includes a processor 902, a memory 904, a storage device 906, a high-speed interface 908 connecting to the memory 904 and multiple high-speed expansion ports 910, and a low-speed interface 912 connecting to a low-speed expansion port 914 and the storage device 906. Each of the processor 902, the memory 904, the storage device 906, the high-speed interface 908, the high-speed expansion ports 910, and the low-speed interface 912, are interconnected using various busses, and can be mounted on a common motherboard or in other manners as appropriate. The processor 902 can process instructions for execution within the computing device 900, including instructions stored in the memory 904 or on the storage device 906 to display graphical information for a GUI on an external input/output device, such as a display 916 coupled to the high-speed interface 908. In other implementations, multiple processors and/or multiple buses can be used, as appropriate, along with multiple memories and types of memory. Also, multiple computing devices can be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).
The memory 904 stores information within the computing device 900. In some implementations, the memory 904 is a volatile memory unit or units. In some implementations, the memory 904 is a non-volatile memory unit or units. The memory 904 can also be another form of computer-readable medium, such as a magnetic or optical disk.
The storage device 906 is capable of providing mass storage for the computing device 900. In some implementations, the storage device 906 can be or contain a computer-readable medium, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations. A computer program product can be tangibly embodied in an information carrier. The computer program product can also contain instructions that, when executed, perform one or more methods, such as those described above. The computer program product can also be tangibly embodied in a computer- or machine-readable medium, such as the memory 904, the storage device 906, or memory on the processor 902.
The high-speed interface 908 manages bandwidth-intensive operations for the computing device 900, while the low-speed interface 912 manages lower bandwidth-intensive operations. Such allocation of functions is exemplary only. In some implementations, the high-speed interface 908 is coupled to the memory 904, the display 916 (e.g., through a graphics processor or accelerator), and to the high-speed expansion ports 910, which can accept various expansion cards (not shown). In the implementation, the low-speed interface 912 is coupled to the storage device 906 and the low-speed expansion port 914. The low-speed expansion port 914, which can include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet) can be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.
The computing device 900 can be implemented in a number of different forms, as shown in the figure. For example, it can be implemented as a standard server 920, or multiple times in a group of such servers. In addition, it can be implemented in a personal computer such as a laptop computer 922. It can also be implemented as part of a rack server system 924. Alternatively, components from the computing device 900 can be combined with other components in a mobile device (not shown), such as a mobile computing device 950. Each of such devices can contain one or more of the computing device 900 and the mobile computing device 950, and an entire system can be made up of multiple computing devices communicating with each other.
The mobile computing device 950 includes a processor 952, a memory 964, an input/output device such as a display 954, a communication interface 966, and a transceiver 968, among other components. The mobile computing device 950 can also be provided with a storage device, such as a micro-drive or other device, to provide additional storage. Each of the processor 952, the memory 964, the display 954, the communication interface 966, and the transceiver 968, are interconnected using various buses, and several of the components can be mounted on a common motherboard or in other manners as appropriate.
The processor 952 can execute instructions within the mobile computing device 950, including instructions stored in the memory 964. The processor 952 can be implemented as a chipset of chips that include separate and multiple analog and digital processors. The processor 952 can provide, for example, for coordination of the other components of the mobile computing device 950, such as control of user interfaces, applications run by the mobile computing device 950, and wireless communication by the mobile computing device 950.
The processor 952 can communicate with a user through a control interface 958 and a display interface 956 coupled to the display 954. The display 954 can be, for example, a TFT (Thin-Film-Transistor Liquid Crystal Display) display or an OLED (Organic Light Emitting Diode) display, or other appropriate display technology. The display interface 956 can comprise appropriate circuitry for driving the display 954 to present graphical and other information to a user. The control interface 958 can receive commands from a user and convert them for submission to the processor 952. In addition, an external interface 962 can provide communication with the processor 952, so as to enable near area communication of the mobile computing device 950 with other devices. The external interface 962 can provide, for example, for wired communication in some implementations, or for wireless communication in other implementations, and multiple interfaces can also be used.
The memory 964 stores information within the mobile computing device 950. The memory 964 can be implemented as one or more of a computer-readable medium or media, a volatile memory unit or units, or a non-volatile memory unit or units. An expansion memory 974 can also be provided and connected to the mobile computing device 950 through an expansion interface 972, which can include, for example, a SIMM (Single In Line Memory Module) card interface. The expansion memory 974 can provide extra storage space for the mobile computing device 950, or can also store applications or other information for the mobile computing device 950. Specifically, the expansion memory 974 can include instructions to carry out or supplement the processes described above, and can include secure information also. Thus, for example, the expansion memory 974 can be provide as a security module for the mobile computing device 950, and can be programmed with instructions that permit secure use of the mobile computing device 950. In addition, secure applications can be provided via the SIMM cards, along with additional information, such as placing identifying information on the SIMM card in a non-hackable manner.
The memory can include, for example, flash memory and/or NVRAM memory (non-volatile random access memory), as discussed below. In some implementations, a computer program product is tangibly embodied in an information carrier. The computer program product contains instructions that, when executed, perform one or more methods, such as those described above. The computer program product can be a computer- or machine-readable medium, such as the memory 964, the expansion memory 974, or memory on the processor 952. In some implementations, the computer program product can be received in a propagated signal, for example, over the transceiver 968 or the external interface 962.
The mobile computing device 950 can communicate wirelessly through the communication interface 966, which can include digital signal processing circuitry where necessary. The communication interface 966 can provide for communications under various modes or protocols, such as GSM voice calls (Global System for Mobile communications), SMS (Short Message Service), EMS (Enhanced Messaging Service), or MMS messaging (Multimedia Messaging Service), CDMA (code division multiple access), TDMA (time division multiple access), PDC (Personal Digital Cellular), WCDMA (Wideband Code Division Multiple Access), CDMA2000, or GPRS (General Packet Radio Service), among others. Such communication can occur, for example, through the transceiver 968 using a radio-frequency. In addition, short-range communication can occur, such as using a Bluetooth, WiFi, or other such transceiver (not shown). In addition, a GPS (Global Positioning System) receiver module 970 can provide additional navigation- and location-related wireless data to the mobile computing device 950, which can be used as appropriate by applications running on the mobile computing device 950.
The mobile computing device 950 can also communicate audibly using an audio codec 960, which can receive spoken information from a user and convert it to usable digital information. The audio codec 960 can likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of the mobile computing device 950. Such sound can include sound from voice telephone calls, can include recorded sound (e.g., voice messages, music files, etc.) and can also include sound generated by applications operating on the mobile computing device 950.
The mobile computing device 950 can be implemented in a number of different forms, as shown in the figure. For example, it can be implemented as a cellular telephone 980. It can also be implemented as part of a smart-phone 982, personal digital assistant, or other similar mobile device.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which can be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms machine-readable medium and computer-readable medium refer to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term machine-readable signal refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front end component (e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network (LAN), a wide area network (WAN), and the Internet.
The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
The invention will be further described in the following examples, which do not limit the scope of the invention described in the claims.
Determining the etiologic basis of the mutations that are responsible for cancer is one of the fundamental challenges in modern cancer research. Different mutational processes induce different types of DNA mutations, providing “mutational signatures” that have led to key insights into cancer etiology. The most widely used signatures for assessing genomic data are based on unsupervised patterns that are then retrospectively correlated with certain features of cancer. This Example shows that supervised, machine-learning techniques can identify signatures, called SuperSigs, which are more predictive than those currently available. Surprisingly, it was found that aging causes different SuperSigs in different tissues, and the same is true for environmental exposures. SuperSigs associated with obesity were discovered, the most important lifestyle factor contributing to cancer in Western populations.
After evaluating the performance of the current unsupervised signatures, a new supervised algorithm was developed to determine whether it would outperform previously described unsupervised signatures and used it to study patients in whom clinical as well as sequencing information was available. Several new signatures were discovered that were often more strongly predictive of specific etiologic factors than previously described unsupervised signatures.
The value of a mutational signature can be assessed by either its prediction accuracy in classifying patients as exposed or not to the associated etiological factor, or by its correlation with exposure to that factor. Therefore, the statistical evaluation of a given mutational signature critically depends on the availability of clinical annotation data for the etiological factor associated to that signature. For example, in the absence of at least one set of patients for whom both sequencing data and smoking status information were available, it would be impossible to assess the value of a given mutational signature for smoking. When clinical annotation is available, it is also important to evaluate to what degree the given mutational signature improves upon some prior validated knowledge on the mutational effects of an etiological factor, if prior knowledge exists, e.g, the deamination at CpG dinucleotides with aging. The current unsupervised mutational signatures (see, e.g., Alexandrov et al., Nat Genet 47, 1402-1407 (2015); Alexandrov et al., Science 354, 618-622 (2016); Alexandrov et al., Nature 500, 415-421 (2013); and Alexandrov et al., Cell Rep 3, 246-259 (2013)) were evaluated in all of these three scenarios (
Consider first the case when clinical annotation is available and the main “peak” of a mutational signature, i.e. its most recurrent mutation, is already known before an unsupervised mutational signature is obtained. For example, prior validated knowledge indicated that aging induced [C>T]G mutations, and smoking C>A mutations. The added value of a mutational signature then depends on the extra information that that signature provides beyond the already known peak. This additional information is represented by the distribution of the weights on the other trinucleotides not previously described as significantly enriched by that etiological factor (
Moreover, aberrant results are obtained if the number of patterns selected during the unsupervised step is different from the true number of patterns present; the larger the difference between those two numbers the worse the results (see the “Partially-supervised Method Extension” section in the STAR★Methods for an example).
Supervised Method for Mutational Signatures with Low-Variance Features of Variable Length
Three key features differentiate the new approach to identify signatures from those previously published. First, the machine learning is supervised, i.e. it learns from the data by using the available annotation on clinical variables such as age, smoking status, and body mass index. After a supervised feature selection step, it then uses a supervised classification method—linear discriminant analysis (LDA)—to determine the mutational signatures. Besides classifying samples into exposed or not exposed, this second step provides a score for the evidence of a given exposure in each sample of the test set. This permits comparisons of the intensity of the exposure among different patients.
Second, a pre-determined base length, such as 3-base pairs, was not used as the fundamental unit of the mutational signatures. This provides greater flexibility because there is no reason to assume that all signatures are optimally described by the same base length units. In fact, even the same signature may be defined on units of variable base lengths. For example, a signature may be characterized by significantly elevated proportions of both C>A and A[C>T]G mutations, the former representing a single-base feature and the latter representing a 3-base feature of the signature.
Third, a probabilistic approach was employed to signature discovery. An important characteristic of any mutational process is its randomness. The effects of a mutational process on the genome are stochastic rather than deterministic, with certain mutation types being more probable (i.e. having higher frequencies) than others. Moreover, the mutational distribution caused by the same etiological factor varies greatly among exposed patients: a mutation type very frequent in some patients may not be common in others. From a biological point of view, it seems natural that each patient—and in fact each cell—may have her/his individualized signature characterizing a specific etiological factor. The signatures are therefore built only on selected features that are robust across the exposed population, i.e. features with relatively low variance, thereby increasing their predictive power.
SuperSigs Associated with Aging Vary with Tissue Type
It has long been known that certain types of mutations, such as C>T transitions resulting from cytosine deamination, accumulate with age. It was evaluated whether other mutational signatures of aging were present in cancers and whether they varied among tissue types. For this purpose, sequencing data from thirty types of cancers recorded in The Cancer Genome Atlas (TCGA) database were analyzed. To avoid confounding factors, this analysis was confined to patients without annotated cancer-associated environmental exposures and without known germline predispositions to cancer.
Signatures, which were termed “SuperSigs”, associated with aging in cancers of various types were discovered, examples of which are shown in
It was then wondered whether patients were “young” or “old” (as measured by the lowest and highest tertile, respectively, of the age distribution) could be predicted from the SuperSigs in their cancers. As depicted in
Supersigs Associated with Environmental and Other Factors Vary with Tissue Type
SuperSigs associated with specific environmental carcinogens were next identified. The analysis was performed after controlling for age and for other relevant covariates when available. SuperSigs for smoking, alcohol, hepatitis B and C virus infection (HBV, HCV), aristolochic acid (AA), and ultraviolet (UV) light were obtained (
In addition to documenting that SuperSigs could be attributed to the factors noted above, whether an individual was exposed to the factor could be predicted simply from the SuperSigs in the individual's cancer genome sequencing data. For example, lung adenocarcinoma (LUAD) patients were able to be classified as smokers or non-smokers with 0.89 prediction accuracy. Similarly, patients with esophageal carcinomas (ESCA) were correctly classified as drinking alcohol more than once per week vs. less than once per week with 0.86 prediction accuracy (
The SuperSigs associated with the same factors generally varied across tissues, just as they did with aging. For example, the SuperSigs associated with smoking were very different in lung, head and neck, pancreatic, and esophageal cancers (
The tissue-specific SuperSigs associated with environmental factors were often similar to the aging signature of the same tissue (
Obesity (as measured by a body mass index, BMI, greater than 30) has emerged as the major lifestyle factor contributing to cancer in general. How obesity contributes to cancer risk, however, is unknown. For example, obesity could lead to cancer by inducing mutations or by stimulating the growth of neoplastic cells that have already acquired mutations. If the former explanation were valid, there might be a mutational signature associated with obesity, but no such signature has been previously identified. Three cancer types associated with obesity in which adequate number of samples and body mass index data for a supervised machine learning approach were available: esophageal, uterine, and kidney cancer. SuperSigs were identify for obesity in two of these three cancer types (
The Proportion of Mutations Due to Aging
Finally, the supervised approach was applied to estimate the proportion of the overall mutational load that can be attributable to normal aging rather than to other mutational processes. When considering all 30 tissues, it was estimated that on average 66% (2.5% quantile: 0.13; median: 0.76; 97.5% quantile: 0.86) of the mutations can be attributable to the normal endogenous mutational processes associated with aging, that is normal DNA replication (Table 4). The proportion varied from 9% in endometrial cancer (UCEC-TCGA) patients with defects in the gene POL-ε to very high percentages like in patients with uveal melanoma (UM) where it was 85%. This estimated proportion is expected to be an overestimate, given the lack of full annotation for all environmental and inherited factors.
The results recorded above lead to several important conclusions. First, supervised machine learning led to new signatures for a variety of etiological factors. These new SuperSigs are better at predicting an exposure than the signatures derived from unsupervised learning.
A second observation is that the SuperSigs usually varied with tissue type. In the majority of previous studies of signatures, it has been assumed that a specific mutational process produces the same signature in all tissue types (see, e.g., Alexandrov et al., Nat Genet 47, 1402-1407 (2015); Alexandrov et al., Science 354, 618-622 (2016); Alexandrov et al., Nature 500, 415-421 (2013); and Alexandrov et al., Cell Rep 3, 246-259 (2013); see, e.g., Blokzijl et al., Nature 538, 260-264 (2016) and Hoang et al., Sci Transl Med 5, 197ra102 (2013) for exceptions). In contrast, the SuperSigs were usually tissue-specific. The fact that the same risk factor, such as alcohol, might give rise to different signatures in different tissues might be viewed as surprising given historical views of exogenous carcinogens such as UV light. However, recent studies have suggested that tissue-specific differences in chromatin organization might underlie the tissue specificity of mutations, at least during aging (Polak et al., Nature 518, 360-364 (2015)). Moreover, the tissue-specific nature of SuperSigs is consistent with the tissue specificity of cancer predisposition syndromes. For example, inherited mutations in the fundamental genes involved in DNA repair or recombination, such as BRCA2, might be expected to result in predispositions to cancers of all types, but they only increase cancer risk in a limited subset of tissues. These results show that the SuperSigs associated with BRCA2 indeed vary with tissue type. Clinical observations like these, together with the SuperSigs described here, support the idea that the nature of mutagenesis is highly dependent on tissue type, and often related to inflammation, suggesting important avenues for future research.
A total of 70 SuperSigs were defined but at most 2-3 of these SuperSigs appear to play a role in any single cancer. This stands in contrast to the widely used signatures discovered through unsupervised learning techniques. Even if only a subset of the unsupervised signatures are considered in the analysis of a given cancer type, there are multiple instances where each of these remaining unsupervised signatures is found in essentially every cancer patient. For example, signature 3, a signature for BRCA1 or 2 mutations, was found in virtually every breast cancer patient sequenced in TCGA (see Figure S32 in Alexandrov et al., Nature 500, 415-421 (2013)), whether the cancer had any relationship to the BRCA pathway or not. Similarly, signature 4, a signature for tobacco smoking, and signature 6, a signature associated with defective mismatch repair mechanisms (MMR), was found in virtually every liver cancer patient (see Figure S43 in Alexandrov et al., Nature 500, 415-421 (2013)), while MMR-deficiency is rare in liver cancers).
An important limitation of this method and of any other method is the quality of the clinical data currently available as well as the limited knowledge of the etiological factors patients are exposed to. There is currently much interest in performing genome-wide sequencing studies on very large numbers of cancer patients in whom clinical data are well-annotated. As such studies proceed, and as the knowledge of etiological fac tors advances, the power of the supervised learning approach described here will progressively increase. It is anticipated that this will lead to accurate estimates of the fraction of mutations attributable to each specific environmental, hereditary, and replicative factor. Conversely, in certain cohorts, this approach could lead to the detection of a sizable fraction of mutations that cannot be attributed to any known source, potentially leading to new insights into pathogenesis, and in particular, avoidable pathogenic agents. The supervised approach can be easily extended to a partially supervised one in order to deal with this situation.
A final conclusion relates to obesity. Obesity is now considered the primary environmental risk factor for cancers in general, and with its increasing incidence, the number of cancers impacted by it is huge (see, e.g., Giovannucci et al., Ann Intern Med 122, 327-334 (1995); Hruby et al., Am J Public Health 106, 1656-1662 (2016); and Song et al., Science 361, 1317-1318 (2018)). Yet the mechanisms underlying the effects of obesity on cancer risk are unknown. Numerous speculations about mechanism have been proposed, such as the effects of putative adipokines and a variety of other hormones or circulating metabolites on cell growth. The discovery of SuperSigs for obesity in some tissues indicates that at least in those tissues part of the risk from obesity may be attributed to mutagenesis. This observation thus leads to specific testable hypotheses that can advance the field. For example, what circulating molecules in obese patients increase the mutation rate, giving rise to the SuperSigs described here?
The hypermethylation and hypomethylation were considered similarly but independently and the unit of analysis is a gene. For hypermethylation, genes that are not included in the PolyComb 27 dataset were filtered out. Also, genes with less than 3 or with more than 7 probes were filtered out for hypermethylation. Now, for each gene in each sample, the percentage of probes that are hypermethylated in the sample was calculate. Based on these percentages, an empirical frequency distribution was generate with the following binning: (0,0.1,0.3,0.5,0.7,0.9,1) with first bin including 0 and the last including 1. The number of genes in each one of the 6 bins was considered as one of the hypermethylation features, for a total of 6 features per patient. The Wilcoxon test was performed to test which features (i.e. bins) are significantly differentially methylated between the two groups of patients (exposed vs not exposed) and keep only the features with an FDR smaller than 0.01. The same process was applied for hypomethylation.
Gene expression was used in the standard log 2 scale which spans from 0 to 16. The genes with a median of less 3 or more than 13 among samples in each patient group (exposed vs not exposed) were filtered out. Only genes whose median difference between the two groups is at least 3 were kept. If no genes remain, the threshold was lowered from 3 to the maximum seen over all genes minus 0.5. Among the remaining genes, the significance of differential expression was calculate using the p-value from the Wilcoxon test and adjust it by Benjamini-Hochberg process and only the genes with at most an 0.01 FDR were kept. At most 10 genes were kept if more than 10 genes are significant, and the top 3 genes were kept if less than 3 genes are significant.
10 times 5-fold CV was applied for Smoking in LUAD, Alcohol in LIHC, Smoking in PAAD, high BMI in UCEC, Smoking in KIRP, high BMI in KIRP, HepB in LIHC, HepC in LIHC with accuracy as the following:
Somatic exomic mutational data was downloaded from the TCGA Bioportal (portal.gdc.cancer.gov) and filtered out the mutations which have less than 5% Variant Allele Frequency (VAF). Out of the total thirty-three datasets available, large B-cell lymphoma (DLBC) was not included in the analysis because of the small number of samples available, while lung squamous cell carcinoma (LUSC) and mesothelioma (MESO) were excluded because of the extremely small number of patients unexposed to smoking and asbestos, respectively. For ovarian cancer (OV) and acute myeloid leukemia (LAML) whole genome sequencing data were used. The human genome reference build hg38 was used to determine the context (flanking bases) for each mutation. The clinical information was downloaded from the website Cbioportal (cbioportal.org). For calculating the background frequency of each trinucleotide on both the exome and the genome the R package, deconstructSigs was used. For the “Unsupervised Signature” method, the signatures were downloaded from the Cosmic Signature website (cancer.sanger.ac.uk/cosmic/signatures) and used the table cancer.sanger.ac.uk/signatures/matrix.png in order to determine which signatures were present in which tissue. The following method was used to assess the unsupervised signatures: to determine in a given patient the respective proportional contributions X of each mutational signature i=1, . . . , k, where a total of k signatures were present in that tissue, non-negative least square (FCNLS) was applied as in Alexandrov et al. (Nature 500, 415-421 (2013)) to
Y
j
=A
j1
X
1
+A
j2
X
2
+ . . . +A
jk
X
k
i.e. Y=AX in matrix form, where Yj is the total number of mutations of type j=1, . . . , 96, normalized so that ΣYj=1 in that patient, and Aji is the relative frequency of mutation type j in the mutational signature i, across each one of the k signatures present in that tissue.
All analyses were performed using R version 3.5.2. LDA was performed using the function lda from the package MASS. Logistic regression was performed using glm from the STATS package. Non-negative matrix factorization (NMF) was performed using the function nmf with method “Lee” from the package NMF.
To reduce the effect of confounding factors, a filtering scheme was applied as follows. In each tissue type, samples were divided into two main categories: 1) “unexposed”, meaning that based on the available clinical annotation, no known environmental factor was believed to have contributed to the development of the cancer (we treated NA environmental factors as unexposed), and 2) “exposed”. To mitigate the effects of other unknown factors in the unexposed group, any sample with a mutational load more than 3 times higher than the median number of mutations found among the unexposed samples was removed. Samples were also excluded if the total number of mutations was equal to zero on the exome, a probable indication of low neoplastic cell content. In general, samples with a mutation in POLE/POLE2/POLE3/POLE4 or POLD1/POLD2/POLD3/POLD4 genes were removed—except for when the signature for the specific effects of those mutations was the objective of the analysis. A tissue type was divided into subtypes whenever possible. Acute Myeloid Leukemia (AML) patients younger than 40 years old were not considered. Among the “exposed” samples, samples with known multi-factor exposures were excluded to minimize confounding factors and only evaluated samples with a single known exposure. For the age analysis, the unexposed samples were divided into three groups (younger, middle-aged, older), and eliminated the middle group before training the algorithm. When testing the algorithm, those two age groups were also considered.
To assess the value of the aging (#1) and smoking (#4) unsupervised signatures in Alexandrov et al. (Nature 500, 415-421 (2013)) beyond their main “peak”, i.e. C>A for smoking and [C>T]G for aging, since those peaks were already known. Thus, the value that the unsupervised signatures add to the previously known mutational peaks was evaluated. This essentially corresponds to evaluate if the part of the distribution of an unsupervised mutational signatures that is not the mutational “peak” adds any value to the peak, according to some measure of performance (prediction or correlation).
To do this, a “randomly generated smoking signature”, a signature for smoking in LUAD, was defined whose only property is a higher proportion of C>A mutations than the other mutation types and where, beside this “peak” at C>A, the proportion of all the other mutation types is assigned randomly. Similarly a “randomly generated aging signature”, a signature for aging, was defined whose only property is a higher proportion of [C>T]G mutations than the other mutation types and where, beside this “peak” at [C>T]G, the proportion of all the other mutation types is assigned randomly. This was done in two alternative ways: (i) generating the random signature using random samples or (ii) building a “randomly generated signature” from a uniform distribution. Specifically, for the smoking signature:
After obtaining these randomly generated signatures, the contribution of the random signature was calculated by applying non-negative linear regression. Thereafter, to evaluate the performance of the signature, the Area Under Curve obtained was calculated using the contribution (normalized by total number of mutations) of the randomly generated smoking signature to predict smoking status, as well as its Spearman correlation with the number of packs smoked by the person.
A similar process was applied to the age signature using the sequencing information of unexposed tissues only and it was compared with the performance of Signature 1 in Alexandrov et al. (Nature 500, 415-421 (2013)). The process was modified in three simple ways. It was assumed that the main types of mutations are: [C>T]G, [C>T]H, C>A, C>G, T>A, T>C, and T>G. Also, in the selection among the 30 signature candidates, only the samples whose [C>T]G proportion is at least as high as 0.9 of the maximum proportion of [C>T]G observed were kept. The randomly generated aging signature using random distribution is then the filtered sample with the maximum proportion of C>T substitutions. As usual, for age the contributions were not normalized by the total number of mutations.
All six types of possible substitutions were considered, with or without the context bases flanking those substitutions, as potential features. These features have variable length and can be grouped into 3 categories. The first category, composed of single nucleotides, contains only the six types of possible substitutions, regardless of the bases before (prefix) or after (suffix): C>A, C>G, C>T, T>A, T>C, and T>G, where all substitutions are referred to by the pyrimidine of the mutated Watson-Crick base pair. The second category, composed of dinucleotides, includes 48 substitutions with a specific base as a prefix or as a suffix (e.g. A[C>T] and [C>T]G); there are 24 with a prefix and 24 with a suffix. The third category, composed of trinucleotides, includes 96 substitutions with both a prefix and a suffix (e.g. A[C>T]G or G[C>T]G). Finally, the total number of mutations, Tot, was considered as a feature. Hence, there was a list of 151 potential features (6+48+96+1). These features construct a partitioning tree. In other words, the total number of mutations found in a sample can be seen as the root of all mutation types, and it is partitioned into mutations of the first category as its children, i.e. substitutions with neither prefix or suffix (e.g. C>T). Each mutation in the second category is the child of one in the first category (e.g. [C>T]G and A[C>T] are both children of C>T) and each third-category mutation is the child of two parents of the second category (e.g. A[C>T]G is the child of both [C>T]G and A[C>T]). Importantly there is dependence among features found on the same path when moving along this tree from the root to the leaves. The way this dependence was dealt with is described in the next section.
If the number of training samples were below a threshold (60 unexposed samples or 15 exposed samples), or if the median total number of mutations was <20, only a subset of the 151 features was considered. This subset was composed of 6 features: the first category of mutations (single nucleotides) and the total number of mutations. The reason for this is that it was assumed that the signal/noise ratio would be too low to determine whether second category (dinucleotide) or third category (trinucleotides) context mattered.
For each feature, it is possible to consider its absolute count or its relative frequency (its absolute count divided by the total number of all mutation types). In a patient exposed only to “aging”, i.e. unexposed to any known environmental or inherited factor, the relative frequency of a mutation type is expected to remain constant irrespective of age—as dictated by the aging signature—while the absolute count is expected to increase with age. In contrast, in a patient exposed to an environmental or inherited factor, the relative frequency of a mutation type as well as the count may change with age. Thus, absolute counts were used for determining age signatures, while one analysis was performed using relative frequencies and another one using absolute counts for all other signatures. The results of these two separate analyses were often comparable, except in terms of prediction accuracy where absolute counts often have an advantage, as expected. Thus, the results were reported using relative frequencies to be conservative. To improve accuracy, a log transformation was applied to count features, which is a standard tool in these types of analyses.
Next, it was aimed to purge unrelated or low signal/noise mutation types out of the total 151 potential features. As mentioned, there is a hierarchy among the mutation types, with parents, children, grandchildren, etc. along the partitioning tree. In general, not all 151 potential features of this tree will have counts that are significantly different from what is expected by chance after controlling for their representation on the exome. For each tissue and for each exposure, it was started from the root of the tree and “went down the tree” to find features whose counts are significantly different from those expected. Specifically, the null hypothesis was that there is perfect dependence among the potential features found on the same path when moving along the tree from the root to the leaves. Unless proven otherwise, the count of a given feature could be explained by the count of any of its parent(s), or more precisely of any of its ancestors, after adjusting for its expected representation in the exome. As an example, the null hypothesis for the total number of observed C>T mutations was that this number would be equal to its expected value, which is given by the total number of mutations observed, Tot, adjusted for the normal frequency of the “C” nucleotide on the exome (vs the “T”s), and the fact that there are three equally probable mutation types (i.e. C>A, C>G, and C>T) under the null. Thus, since C (i.e. C:G) nucleotides have a frequency of 0.506 on the exome (0.409 on the genome), then the expected value of C>T mutations on the exome would be given by Tot*0.506*⅓, since it was assumed a priori that a C has the same probability to mutate to an A, a G, or a T. As another example, [C>T]G, which is the child of C>T and the grandchild of the total number of mutations, would be tested twice to see if it significantly exceeded its expected number based on the total number of mutations as well as the number of C>T. Thus, the expected value of [C>T]G mutations would be given by Tot*0.506*⅓*X, where X is the expected frequency of CG out of all C nucleotides in the exome, as estimated by deconstructSigs.
To test each hypothesis, a one-sided binomial test was applied at a 0.05 significance level with a Bonferroni correction for 151 tests to control for multiple testing. The binomial test was based on the sum of the total number of mutations observed for that potential feature across all training samples, and the probability of success was set equal to the frequency of that potential feature, as expected by its representation on the exome. If the null hypothesis was rejected, that potential feature was selected as a “first-phase” candidate feature for the next supervised selection step.
Once a temporary list of candidate features had been selected, this list was updated and pruned by “going up the tree” by testing parents that had children that had also been selected. Indeed, some parent mutations may have been selected only because their children had higher than expected frequencies. In other words, the parent was tested by removing the contribution of the selected child to see if the count/frequency of the leftover in that parent would still be significantly higher than expected by chance. If it were, then that parent remained in the list of first-phase candidate features but only after having subtracted the contribution of the first-phase candidate feature child. If not, the parent was eliminated as a feature in that particular analysis. The feature was named “remaining mutations”—when significant—containing the leftover of the total number of mutations. The list of features that remained after this second selection were termed “second-phase candidate features”.
For every factor other than age, the above feature-engineering step was applied separately to samples from patients that were respectively unexposed or exposed to the factor under consideration. It was then combined these two lists of second-phase candidate features by considering the new partition formed by all intersections and relative complements of the elements in the original two partitions, i.e. the two original sets of second-phase candidate features. This new partition is the smallest refinement of the two original partitions (see also Table 4). When completed, this process provided the final list of candidate features.
For aging signatures, the feature engineering steps described above were applied only to samples from patients who were unexposed to any known environmental or inherited factor. This is because the age signature is not expected to change with aging, but simply to increase in its intensity in terms of mutation counts. The resulting second-phase candidate features constituted its “candidate features” list.
Once the list of candidate features was obtained, they were ranked using a bootstrap t-statistic with pooled variance for each class (young vs old, or unexposed vs exposed to an H or E factor) with 1000 iterations in the training set. For the analysis of absolute counts, features with negative median t-statistic were purged, in light of the biologically reasonable assumption that samples from older/exposed patients should not have a lower absolute count of a given mutation type than younger/unexposed patients. For the analysis of relative frequencies, features with negative median t-statistic were instead kept. The larger the absolute value of the t-statistic, the larger the evidence that the feature was affected by the tested variable (i.e., aging or some exposure). To stabilize the ranking of the features, first, second, and third category features were penalized by subtracting a penalty from the median t-statistics according to the following formula:
This penalty function was chosen a priori, and not optimized in cross-validation. The penalty increases as features are further down the tree, with the largest penalty (0.5) being assigned to features of the third category, i.e. trinucleotides. features that had a t-statistics >3, or in cases where the signal was weak (i.e. when all candidate features had a t-statistics <3), all features with a t-statistic within 0.5 of the top feature, were then selected. Again these values were chosen a priori, and not optimized in cross-validation. The set of these selected features constitute what were defined as mutational signatures and were used in the next step for prediction. The mutational signatures for each factor (aging or exposure) are depicted in
The significance of the signatures can be assessed by their ability to distinguish between groups of patients, i.e. exposed vs unexposed, or younger vs older patients. Thus, after the feature selection step, two alternative classifiers—using two types of distribution families—were used to test the predictive accuracy of each mutational signature: linear discriminant analysis (LDA) and logistic regression (Logit). Both methods yielded very similar results, and the results of LDA are reported.
In LDA, a multivariate normal distribution is used to model the features' mutational frequencies of a group of patients, with a mean vector equal to the empirical mean vector and a covariance matrix for the dependencies among the features. In logistic regression, the maximum entropy distribution is instead used to model the features' mutational frequencies in a group of patients, where the constraint on the maximum entropy distribution is that the expected value of each feature is equal to that of its observed average. In information theory language, features modeled by a maximum entropy distribution have minimum information about each other. For both families of distributions, the log ratio test was then used.
In
To compare the accuracy of the supervised and unsupervised methods, the area under the ROC curve (AUC) was selected. The results are presented in
If prediction accuracy were to be the only goal of the analysis, then other methods other than LDA and logistic regression, like for example Random Forest (RF), could be applied to achieve even higher accuracy (e.g. RF has an average 0.83 accuracy for the environmental and inherited factors' signatures, vs. 0.76 with LDA). At the same time, the results obtained with methods like RF are difficult to interpret in terms of the quantitative relationship among the selected features. However, there may be applications where accuracy is indeed the only goal.
When comparing the signatures of two different exposures a problem is that lack of common features, or at least the lack of perfect overlap between the two sets of selected features contained in the signatures. For example, Exposure 1, may have as selected features [C>T]G, [C>T]H, and the remaining mutations, with proportions 15%, 5%, and 80% respectively, while Exposure 2 may have A[C>T], B[C>T], and the remaining mutations, with proportions 3%, 7%, and 90%. As mentioned, the combination of the two lists is provided by a new partition formed by all intersections and relative complements of the two original partitions, i.e. the two original sets of features. This new partition is the smallest refinement of the two original partitions. In the example, this refinement will contain the following features: A[C>T]G, B[C>T]G, A[C>T]H, B[C>T]H and the remaining of mutations (Table 5).
When “projecting” signatures of Exposure 1 and Exposure 2 onto the new partition uniform distribution of the number of mutations within each feature was assumed. In the example, probabilities were assigned to A[C>T]G, B[C>T]G, A[C>T]H, B[C>T]H, and the remaining mutations, i.e. every mutation except the 4 listed (Table 5). The proportion of a selected feature in a given signature represents the value assigned to that feature in that signature. By assuming a uniform distribution a signature can easily be projected onto any desired refinement partition. See Table 5 for a depiction of this assignment.
To estimate the proportion of mutations due to aging in each specific sample, the median rate of mutations per year in the patient population of the corresponding cancer type and in the absence of any known environmental or inherited factor as first estimated. Then the frequency of each feature present in the cancer-specific supervised age signature was multiplied by that yearly mutation rate and by the patient's age of that specific sample. The number obtained by summing the above counts for each feature in the age signature is then divided by the total number of mutations observed in that sample. This resulting ratio, being forced to be not greater than 1, is the estimate for the proportion of somatic mutations attributable to age in that sample.
One limitation of a supervised approach is that it cannot be applied to find signatures of factors for which no annotation is currently available. It may indeed be desirable to have a method that is able to discover patterns of exposures, even when they are unknown. This limitation, however, can be overcome by using the supervised step, already described, and following it with an unsupervised one. That is, all exposures with available annotations can be taken advantage of to discover their supervised signatures. After learning those signatures, the effects of those supervised signatures can be “subtracted” from the mutational load of the patients exposed to those annotated factors. An unsupervised analysis, such as non-negative matrix factorization (NMF), can then be performed on the leftover, to investigate the presence of further mutational patterns.
This Example provides an example of how the supervised learning of a mutational signature (specifically the aging signature in this example) can be used to improve the performance of an unsupervised approach by discounting the effects of that supervised signature on the test data (this methodology is referred to herein as “partially supervised”).
To simplify matters, features were not engineered; rather, the 96 fundamental mutations as in Alexandrov et al. (Nature 500, 415-421 (2013)) were used. Only the datasets that show a higher average rate of mutation per year in the exposed samples than in the unexposed samples were used. This increase in the rate is required to conform to the premise of non-negativity and linearity in the NMF model. One half of the unexposed samples were use as the training set to learn the age signature (thus a supervised signature) and to estimate the mutation rate (number of mutations accumulated per year of age) so that the effect of age on the test set can be discounted. Next the test set was formed by bootstrapping over the left-out half of the unexposed samples and all exposed ones.
NMF (Lee et al., Nature 401, 788-791(1999)) with rank equal to 3 was applied to decompose the test set, thus obtaining two matrices: one containing the unsupervised signatures and a second one with the corresponding contributions of each of those signatures in each patient. These contributions have not been discounted for age yet. This is the standard unsupervised approach. However, in order to estimate the discounted contributions of a signature in each test sample, the effect of age of a patient on each unsupervised signature was now discounted, by multiplying the learned supervised age signature by the age of the patient, times the estimated mutation rate, and then projecting this vector onto the directions identified by NMF using Non-negative Linear Regression, and then subtracting these projected contributions of age from the contributions of the 3 unsupervised signatures obtained by NMF. To conform with premises of NMF, the negative discounted contributions were set to zero.
The direction whose contribution, divided by the total number of mutations, is the most associated (in terms of the highest AUC) to the exposure status using the known ground-truth, for both the unsupervised and the partially supervised methods, by using the not discounted and discounted contributions, respectively, was chosen. The area under the curve was then used to evaluate the association of the signature with the exposure status, where the contribution of each signature has been divided by the number of total mutations.
This whole process (from the random selection of half of the unexposed patients used to learn the age signature and so on) was repeated 50 times, and the average AUC over them was taken to account for the effect of randomness. This is what is depicted in
These discounted contributions are then averaged. This is what was defined as the partially supervised signature and their contributions. Finally, to obtain the “partially supervised signatures” Non-negative Linear Regression was used again but this time where the coefficients are known and the signatures are unknown. In other words, the decomposition M=SC was still used. Originally, M and S were known and C was wanted. Now, M and C are known and S is wanted. This way the contributions stay the same.
For another example, pretend no annotation for the presence of defects in the gene POL-ε among patients with endometrial cancer in the UCEC-TCGA dataset and no known POL-ε signature. Also assume a supervised aging signature for that tissue, as shown in
Though the example described above is informative about the power of the semi-supervised approach, at least when the signal is very strong as in the case of a POL-ε mutation, it also illustrates a critical weakness of unsupervised approaches in general. The POL-ε signature in
Determining the etiologic basis of the mutations that are responsible for cancer is one of the fundamental challenges in modern cancer research. Different mutational processes induce different types of DNA mutations, providing “mutational signatures” that have led to key insights into cancer etiology. The most widely used signatures for assessing genomic data are based on unsupervised patterns that are then retrospectively correlated with certain features of cancer.
This Example shows that supervised machine-learning techniques can identify signatures, called SuperSigs, that are more predictive than those currently available. Surprisingly, it was found that aging causes different SuperSigs in different tissues, and the same is true for environmental exposures. SuperSigs associated with obesity, the most important lifestyle factor contributing to cancer in Western populations, were discovered.
As demonstrated herein, a supervised algorithm has been developed to determine new mutational signatures, termed “SuperSigs”. It was then demonstrated that these supervised signatures could outperform previously described unsupervised signatures in predicting the presence of various etiological factors in patients for whom both clinical and sequencing information was available.
Supervised Method for Mutational Signatures with Low-Variance Features of Variable Length (SuperSigs)
To obtain SuperSigs signatures, sequencing data from thirty types of cancers recorded in The Cancer Genome Atlas (TCGA) database were analyzed. Four key features distinguish the approach for identifying signatures.
1) A primary methodological step is to use supervised machine learning, i.e. learn the signatures from the data, by using the available annotation on clinical variables such as age, smoking status, and body mass index. By using this information explicitly, stronger associations can be identified and better predictions can be made.
2) A pre-determined base length, such as 3-base pairs, is not specified as a fundamental unit of the mutational signatures. This provides greater flexibility because there is no reason to assume that all signatures are optimally described by the same base length units. In fact, a single signature may be defined on units of variable base lengths, featuring, for example, significantly elevated proportions of both C>A (i.e. a single-base substitution from C to A) and A[C>T]G (i.e. a single-base substitution from C to T with flanking bases A and G) mutations.
3) A probabilistic approach to signature discovery was employed. An important characteristic of any mutational process is its randomness. The mutational distribution caused by the same etiological factor varies greatly among exposed patients: a mutation type very frequent in some patients may not be common in others. From a biological point of view, it seems natural that each patient—and in fact each cell—may have her/his individualized signature characterizing a specific etiological factor. The signatures are therefore built only on a subset of selected features that are robust across the exposed population, i.e. features with relatively low variance, thereby increasing their predictive power.
4) There is no assumption that a given mutational process must have the same mutational signature across tissues, contrary to the approach developed by Alexandrov et al. (Nature 500, 415-421 (2013)) where a given signature (e.g. signature 1) is the same across all tissues.
The method for deriving mutational signatures is based on several steps. First, a nested tree containing all potential features was constructed, with all mutations as the root, and all six single-base substitutions (C>A, C>G, C>T, T>A, T>C, and T>G) as the first level, followed by single-base substitutions with one flanking base as the second level, and by single-base substitutions with two flanking bases as the third level, and where the edges are placed between features which share mutations (
After “pruning” the tree in order to keep only the features that have counts significantly different from their expected values, these remaining features are ranked based on their ability to classify a given exposure, i.e. to discriminate exposed patients from unexposed ones, as measured by the area under the receiver operating characteristic (ROC) curve (AUC). The set of n top features that provide the highest prediction performance in terms of AUC form the signature for a given exposure and are used for prediction (
The value of a mutational signature can be assessed by its prediction accuracy (AUC) in classifying patients as exposed or not to the associated etiological factor, or by its correlation with exposure to that factor. Statistical evaluations were provided for both, relying on the availability of clinical annotation for the etiological factor associated to that signature (
Mutational Signatures Add to Prior Knowledge about Etiologic Factors
In addition to simple performance, it is also important to evaluate the degree to which a given mutational signature improves upon prior knowledge about the mutational effects of an etiological factor (
To statistically evaluate the added value provided by the signatures of Alexandrov and colleagues, hereafter termed “unsupervised”, as well as of the SuperSigs, both of their performances were compared against random alternatives carrying no additional knowledge beyond the known peak, for both aging and smoking. These prior knowledge signatures were termed “random” because they just reflect random noise around the already known peak (
Sequencing data for thirty tumor types were obtained from the TCGA Genomics Commons. After splitting each dataset randomly into training and test partitions, the method above was applied to derive signatures of aging and smoking in the training data, evaluating performance in the test data. The SuperSigs aging signatures were applied to classify patients in a binary fashion (i.e., young versus old) yielded a median AUC of 0.72, calculated over 30 tumor types, significantly outperforming the random aging signature (single peak; median AUC=0.65), which was built on the well-supported observation that over time, cytosines will consistently deaminate to thymine in the CpG context (
The performance of these signatures was next evaluated with respect to smoking status across eight tissues known to be significantly affected by smoking. The SuperSigs added value to prior knowledge while the unsupervised signatures did not (median AUCs for smoking: SuperSigs=0.88, single peak=0.57, unsupervised=0.56) (
These data do not indicate that unsupervised signatures for aging and smoking are meaningless. However, the data indicate that the unsupervised signatures do not add any information to prior knowledge of a peak at [C>T]G for aging and at C>A for smoking. Optimally, an algorithm based on genome-wide cancer genomic sequencing data should add information that was not available from prior studies, and SuperSigs indeed added such information that goes beyond the previously known mutational peaks (
Supervised signatures perform better than unsupervised ones when no prior knowledge about an etiologic factor is available (second scenario in
The method can predict whether an individual patient was “exposed” to a given etiologic factor simply from the SuperSigs in that patient's cancer genome sequencing data. For example, the cross-validated AUC was 0.95 when classifying patients with lung adenocarcinomas (LUAD) as smokers versus never-smokers. Similarly, the AUC was 1.0 when classifying patients with head and neck cancers (HNSCC) as drinking alcohol more than once per week vs. less than once per week
When clinical annotation is not available for an etiologic factor (
SuperSigs for Aging and Other Factors Vary with Tissue Type
It has long been known that certain types of mutations, such as C>T transitions resulting from cytosine deamination, accumulate with age. It was wondered whether other mutational signatures of aging were present in cancers and whether they varied among tissue types. To avoid confounding factors as much as possible, the analysis was confined to patients without known cancer-associated environmental exposures and without known germline predispositions to cancer.
SuperSigs associated with aging were thereby obtained for each cancer type analyzed, examples of which are shown in
It was next sought to identify tissue-specific SuperSigs associated with specific environmental carcinogens. The analysis was performed after controlling for age and for other relevant covariates. Tissue-specific SuperSigs were obtained for smoking, alcohol, hepatitis B and C virus infection (HBV, HCV), aristolochic acid (AA), asbestos, and ultraviolet (UV) light (
In several cases, the SuperSigs associated with the same mutational factors varied across tissues, just as they did with aging. For example, the SuperSigs associated with smoking were very different in bladder, esophageal, head and neck, and lung cancers (
Note that tissue specific differences with respect to etiologic factors are not possible to discover with the unsupervised approach described by Alexandrov et al. (Nature 500, 415-421 (2013)) because the identity of a given signature across multiple tissues was a key theoretical assumption underpinning their approach.
The heatmap in
Moreover, in several cases, the tissue-specific mutational landscape associated with an environmental factor was similar to the aging mutational landscape of the same tissue (
These analyses then suggest that a major effect of environmental factors may simply be to increase the rate of cell division. Such increases would be linearly proportional to the increase in mutation rate and would not be associated with new signatures such as those caused by direct interaction of carcinogens with DNA. Increases in the rate of cell division are known to occur when tissues are damaged or inflamed.
Obesity (as measured by a body mass index, BMI, greater than 30) has emerged as the major lifestyle factor contributing to cancer in general. How obesity contributes to cancer risk, however, is unknown. For example, obesity could lead to cancer by inducing mutations or by stimulating the growth of neoplastic cells that have already acquired mutations. If the former explanation were valid, there might be a mutational signature associated with obesity, but no such signature has been previously identified. Four cancer types associated with obesity in which adequate number of samples and body mass index data for a supervised machine learning approach were available: colon, esophageal, kidney, and uterine cancer. SuperSigs for obesity were identified in all of these cancer types (
A common characteristic of these obesity signatures is that the rate of accumulation of certain mutation types increases under the effect of obesity while other mutation types decrease (
Finally, the supervised approach was applied to estimate the proportion of the overall mutational load that can be attributable to normal aging rather than to other mutational processes. When considering all 30 tissues, it was estimated that on average 70% of the mutations can be attributable to the normal endogenous mutational processes associated with aging, that is normal DNA replication (Table 10). This estimate is consistent with what previously reported in Tomasetti et al. (Science 355, 1330-1334 (2017)). The proportion varied widely across tissues, for example it is 2% on average in endometrial cancers (UCEC) of patients with POLe mutations to 90% in pancreatic cancer (PAAD) patients who smoke. This estimated proportion is expected to be an overestimate given the lack of full annotation for all environmental and inherited factors.
We downloaded somatic exomic mutational data from the TCGA Bioportal (portal.gdc.cancer.gov) and filtered out the mutations which have less than 5% Variant Allele Frequency (VAF). Out of the total thirty-three datasets available, large B-cell lymphoma (DLBC) was not included in the analysis because of the small number of samples available, while lung squamous cell carcinoma (LUSC) and mesothelioma (MESO) were excluded because of the extremely small number of patients unexposed to smoking and asbestos, respectively. For ovarian cancer (OV) and acute myeloid leukemia (LAML) whole genome sequencing data were used. The human genome reference build hg38 was used to determine the context (flanking bases) for each mutation. The clinical information was downloaded from the website Cbioportal (cbioportal.org). For calculating the background frequency of each trinucleotide on both the exome and the genome the R package, deconstructSigs was used. For the Unsupervised Signature method (Alexandrov et al. Nature 500, 415-421 (2013)), the signatures were downloaded from the Cosmic Signature website (cancer.sanger.ac.uk/cosmic/signatures) and used the table cancer.sanger.ac.uk/signatures/matrix.png in order to determine which signatures were present in which tissue.
All analyses were performed using R version 3.5.2. Logistic regression was performed using glm from the STATS package. LDA was performed using the function lda from the package MASS. Non-negative matrix factorization (NMF) was performed using the function nmf with method “Lee” from the package NMF.
To reduce the effect of confounding factors, several filtering criteria were applied. In each tissue type, samples were divided into two categories: 1) “unexposed”, meaning that no exposure to a known environmental factor was recorded, according to the available clinical annotation, and 2) “exposed”. To mitigate the effects of other unknown factors in the unexposed group, any sample with a mutational load more than 3 times higher than the median number of mutations found among the unexposed samples was removed. Samples were excluded if the total number of mutations was equal to zero on the exome, a probable indication of low neoplastic cell content. Samples with microsatellite instability (MSI) or with a mutation in POLE/POLE2/POLE3/POLE4 or POLD1/POLD2/POLD3/POLD4 genes were removed—except for when the signature for the specific effects of those mutations was the objective of the analysis—because of the known large increase in the number of mutations they induce. A tissue type was divided into subtypes whenever possible. Acute Myeloid Leukemia (AML) patients younger than 40 years old were not considered. Among the “exposed” samples, samples with known multi-factor exposures to minimize confounding factors were excluded and only samples with a single known exposure were evaluated. Samples with unknown exposure were treated as unexposed.
Mutation counts are used to characterize mutational burden when considering predictors of aging. For all other exposures, mutation rates (i.e. counts/age) are used. In a patient exposed only to time, i.e. unexposed to any known environmental or inherited factor, the rate of a mutation type is expected to remain constant irrespective of age—as dictated by the aging signature—while the absolute count is expected to increase with age. In contrast, in a patient exposed to an environmental or inherited factor, the rate of a mutation type as well as the count may change with respect to the age signature.
Details for the method developed to obtain the supervised mutational signatures are provided in
At its simplest, a mutational signature of exposure is nothing more than a set of substitutions that characteristically occur at different rates in exposed tissue than in unexposed tissue. In practice, though, a few considerations suggested by prior biological knowledge quickly turn a simple calculation into a complex engineering problem. Specifically, a key principle of the SuperSig approach is that signatures may not be optimally described by the same base length units. Accordingly, all single-base substitutions, with or without the flanking context bases, were consider as potential, signature features. In addition to 6 single base substitutions: C>A, C>G, C>T, T>A, T>C, and T>G, named according to the pyrimidine of the mutated Watson-Crick base pair, there are 48 dinucleotides, in which the substitution is paired with a specific base as a prefix or as a suffix but not both (e.g. A[C>T] or [C>T]G), as well as 96 trinucleotides (e.g. A[C>T]G), which include both flanking bases as context. Hence, there is a list of 151 potential features (6+48+96+1).
The resulting flexibility carries a price, however, as features are no longer independent. The simple substitution C>T spawns dinucleotide children, such as A[C>T], and trinucleotide grandchildren like A[C>T]G. Frequent, exposure-driven A[C>T] substitutions would increase the observed rates of both the C>T parent and the trinucleotide children, making it difficult to assign ownership to the correct generation. The section ContextMatters describes an approach to solving this problem, while the section CombiningPartitions describes how candidate signature features are combined to create a final signature.
The binomial test was based on an estimate of the sum of the number of mutations observed for that potential feature across all training samples, and the probability of success was set equal to the frequency of that potential feature, as expected by its representation on the exome. Specifically, the estimate of the sum of the number of mutations observed for that potential feature across all training samples was calculated by a bootstrap (100 times) for the sum of the pseudo count of that feature, of which the median was taken. The start for the pseudo count of the Total Mutations is set at 1000. For any other feature, the pseudo count starts from the proportion of that feature with respect to the exome, multiplied by 1000. Rounding was applied to the outcome.
All results were considered significant at a p-value of 0.05, subject to Bonferroni correction for 150 tests, as Total Mutations is not tested against. If the null hypothesis was rejected, that potential feature as a “first-phase” candidate feature was selected for the next supervised selection step. First-phase candidate features are colored in grey in
When combining two partitions, features may be overlapping. In that case the respective counts need to be distributed among the features of the refinement partition. Those counts were project as follows. For example, Partition 1, may consist of [C>T]G, [C>T]H, and the remaining mutations, with proportions 15%, 5%, and 80% respectively, while Partition 2 may consist of A[C>T], B[C>T], and the remaining mutations, with proportions 3%, 7%, and 90%, respectively. In the example, this refinement will contain the following features: A[C>T]G, B[C>T]G, A[C>T]H, B[C>T]H and the remaining of mutations (Table 7). When “projecting” counts of features in Partition 1 or Partition 2 onto a feature present in the refinement partition, the counts were split according to the expected frequencies observed on the exome (see Table 7, e.g. #ACG/#CG is the expected frequency of ACGs out of all CGs).
For aging signatures, the feature engineering steps described above were applied only to samples from patients who were unexposed to any known environmental or inherited factor. Therefore, this step of combining partitions was skipped, because there is only one partition, i.e. its second-phase candidate features, which automatically provided its “candidate features” list.
Each feature was ranked according to its ability to discriminate exposed samples from unexposed, based on the rates for that feature (or counts, as appropriate for the exposure). Discriminatory performance was measured by the area under the receiver operating characteristic (ROC) curve (AUC). As above, rather than calculating the AUC directly, it was estimated robustly by taking the median over 1000 bootstrapped samples. Features for which the median AUC ≤0.5 on a balanced dataset are discarded.
Among all these features, the n top-ranked features that provided the highest AUC in an inner loop of 5 iterations of 5-fold cross-validation using a multivariate, logistic regression classifier (LR) were selected. These n features were defined as the predictive features for a given exposure.
For the age analysis, the unexposed samples were divided into three groups of equal size (younger, middle-aged, older), based on the quantiles of the age distribution, and discarded the middle group before training the algorithm.
The set of n predictive features selected above form the supervised signature (SuperSig). Two values are associated to each one of these predictive features: 1) the difference in mean counts (age) or rates (all other exposures) between the exposed and unexposed cohorts, and 2) the beta (β) coefficient for that feature as estimated by logistic regression. Both vectors yield critical information.
The difference in means for each feature, which is the only constraint used by logistic regression in maximizing entropy over the dataset, provide a natural measure of the difference in counts or rates for that feature induced by a given exposure. These values were report in the figures such as in
The beta coefficients of the features in a logistic regression have also an intuitive interpretation, since the logarithm of the odds of being in the exposed class C versus the unexposed one, given the mutational data (counts or rates), is given by
Therefore, eβ of a feature is the factor by which the odds of being in the exposed class increase for every extra unit increase in that feature, when all other features are kept constant. The β coefficients of the mutational signatures for each factor (aging or exposure) can be found in Table 8 and are depicted in
Logistic Regression (LR) was used to test the predictive accuracy of each set of features representing a mutational signature as measured by AUC. the performance of Linear Discriminant Analysis (LDA) and Random Forest (RF), when applied to both feature selection and prediction was reported (Table 9). In both LR and LDA models the mean vectors equal the empirical mean vector. In addition, LDA also accounts for the dependencies among the features. All methods yielded relatively comparable results in cross-validation.
For the age analysis, the unexposed samples were again divided into three groups (younger, middle-aged, older) and discarded the middle group before training the algorithm. For all other exposures, unexposed and exposed formed the two groups except for ultraviolet light (UV) and asbestos, for which samples with respectively the lowest 10% and 33% of the Total Mutations count were used for the unexposed group, and all the other samples for the exposed one.
Training was performed using the counts the predictive features for age and the rates (=count/age) of the predictive features for all other exposures, over the two labeled groups, via 5 iterations of 5-fold cross-validation using LR.
The same quantities, counts for age and rates for all other factors, are used for testing. Again, for age, the middle-aged group was excluded from the test set.
When prior literature has established a strong relationship between an exposure and a particular mutational feature, i.e. [C>T]G for aging and C>A for smoking, it was evaluated whether any new candidate signatures actually improve on these central, peak feature. Specifically, the value of the aging (Signature #1) and smoking unsupervised signatures were assessed in Mucci et al. (JAMA 315, 68-76 (2016)), Stadler et al. (J Clin Oncol 28, 4255-4267 (2010)), Stewart et al. (“Cancer Etiology.” In: World Cancer Report 2014 (eds Stewart B W, Wild C P). IARC (2014)), and Tomasetti (Science 364, 938-939 (2019)), as well as of the SuperSigs, beyond the main “peaks” already known from prior knowledge, i.e. [C>T]G for aging and C>A for smoking. This essentially corresponds to evaluate if the part of the distribution of an unsupervised or supervised mutational signature that is not the mutational “peak” adds any value, according to some measure of performance (prediction or correlation).
To do this, a signature was generate for smoking, whose property is a higher proportion of C>A mutations than the other mutation types and where, beside this “peak” at C>A, the proportion of all the other mutation types is assigned randomly. Similarly a signature was generate for aging, whose property is a higher proportion of [C>T]G mutations than the other mutation types and where, beside this “peak” at [C>T]G, the proportion of all the other mutation types is assigned randomly. This was done by building “randomly generated single peak signatures”, or “single peak signatures” for brevity.
More precisely, for the smoking signature, this randomly generated smoking peak signature was created in a two-step process. In step one, 30 (since in Cosmic v.2 there are about 30 signatures) probability distributions were generated over the six main mutation types (which lack suffix and prefix base). Each distribution was created by sampling 6 numbers from a uniform distribution and by dividing them by their sum. The “smoking single peak signature” was then the distribution among them with the highest proportion of C>A substitutions. In step two, the obtained proportion of each of the six main mutation types was randomly broken down into the 16 fundamental trinucleotide mutations (16 for C>A, 16 for C>T, and so on).
A similar process was applied to the derivation of the randomly generated peak age signatures. The difference is that it was assumed the main types of mutations are now seven: [C>T]G, [C>T]H, C>A, C>G, T>A, T>C, and T>G, due to the fact that [C>T]G is needed as one of the features, since that is the peak obtained from prior-knowledge. Among the 30 signature candidates, the “aging single peak signature” is then the distribution with the maximum proportion of [C>T]G substitutions.
In order to compare the prediction accuracy (AUC) of all three sets of signatures (Alexandrov et al., single peak, and SuperSigs), the same prediction methodology was applied that was previously used in Alexandrov et al. to determine the contribution of each signature in each patient: non-negative least squares (NNLS).
More specifically, to determine in a given patient the respective proportional contributions (used as a score) X of each mutational signature i=1, . . . , k, where a total of k signatures are present in that tissue, NNLS is applied to
Y
i
=A
i1
X
1
+A
i2
X
2
+ . . . +A
ik
X
k
i.e. Y=AX in matrix form, where Y is the total number of mutations of type i, and Aij is the relative frequency (for Alexandrov et al. and single peak signatures) or the difference in mean count (SuperSigs for age) or rate (SuperSigs for all other etiological factors) of mutation type i in the mutational signature j, across each one of the k signatures present in that tissue.
The performance of the various methodologies is presented in
For Alexandrov et al. their Signature 1 was used for predicting age in one comparison, and the combination of the “clock-wise” unsupervised Signatures 1 and 5 as determined in Alexandrov et al., (Nat Genet 47, 1402-1407 (2015)) was used in the other comparison. The specific combination of signatures used for Alexandrov et al. in predicting smoking status was instead determined by the specific combinations provided for each tissue in Alexandrov et al. (Science 354, 618-622 (2016)).
Given that it was not possible to cross-validate directly the unsupervised method of Alexandrov et al. (Nature 500, 415-421 (2013)) the core methodology used in Alexandrov et al., which is non-negative matrix factorization (NMF), it was chosen to use and approximate their method in two alternative ways in order to perform cross-validation: 1) “BestNMF” and 2) “MatchedNMF”.
For both approaches, NMF was applied to the profile of the count mutations of the training samples, i.e. a matrix whose 96 rows represent mutation types and columns represent training samples. The rank parameter, r, of the NMF algorithm was set equal to what shown in Cosmic signature v2 (cancer.sanger.ac.uk/cosmic/signatures v2) for the tissue of interest. This parameter was hardwired to help the unsupervised method to limit model misspecification.
After obtaining the r signatures from NMF, two alternative methods were used to select among them the signature of a specific age or environmental factor: 1) for BestNMF, the signature whose contributions had the highest AUC in classifying exposure to the environmental factor on the training set were chosen; 2) for MatchedNMF, each of the identified signatures from the training set was paired to exactly one of those listed in Cosmic v2 for this specific tissue. This pairing process was obtained by maximizing the sum of the cosine similarity for each pair.
Then, on the test set, an NNLS algorithm was used to estimate the contribution of each signature on the test set.
The performance of the various methodologies is presented in
One limitation of a supervised approach is that it cannot be applied to find signatures of factors for which no annotation is currently available. It may indeed be desirable to have a method that is able to discover patterns of exposures, even when they are unknown. This limitation, however, can be overcome by using the supervised step, already described, and following it with an unsupervised one. That is, one can first take advantage of all exposures with available annotations to discover their supervised signatures. After learning those signatures, the effects of those supervised signatures can be “subtracted” from the mutational load of the patients exposed to those annotated factors. An unsupervised analysis, such as non-negative matrix factorization (NMF), can then be performed on the leftover, to investigate the presence of further mutational patterns.
An example is provided here of how the supervised learning of a mutational signature (specifically the aging signature in this example) can be used to improve the performance of an unsupervised approach by discounting the effects of that supervised signature on the test data. This methodology is referred to hereafter to as “partially supervised”.
To simplify matters, features were not engineered; rather, the 96 fundamental mutations as in Alexandrov et al. (Nature 500, 415-421 (2013)) were used. Only the datasets that show a higher average rate of mutation per year in the exposed samples than in the unexposed samples were used. This increase in the rate is required to conform to the premise of non-negativity and linearity in the NMF model. One half of the unexposed samples were use as the training set to learn the rate of each feature of the age signature (thus a supervised signature) so that the effect of age (i.e. controlling for age) on the test set can be discounted. Next the test set was formed by bootstrapping over the left-out half of the unexposed samples and all exposed ones.
NMF with rank equal to 3 was applied to decompose the test set, Y, thus obtaining two matrices, A and X: one containing the unsupervised signatures (A) and a second one with the corresponding contributions of each of those signatures in each patient (X). These contributions have not been discounted for age yet. This is the standard unsupervised approach. However, in order to estimate the discounted contributions of a signature in each test sample, the effect of age of a patient on each unsupervised signature was discounted by multiplying the learned supervised age signature by the age of the patient, times the estimated mutation rate, and then projecting this vector onto the directions identified by NMF using NNLS, and then subtracting these projected contributions of age from the contributions of the 3 unsupervised signatures obtained by NMF. To conform to the premises of NMF, the negative discounted contributions was set to zero.
The direction whose contribution, divided by the total number of mutations, is the most associated (in terms of the highest AUC) to the exposure status using the known ground-truth, was chosen for both the unsupervised and the partially supervised methods, by using the not discounted and discounted contributions, respectively. To obtain the “partially supervised signatures” non-negative linear regression was used again but this time where the contributions (X) are known and the signatures (A) are unknown. In other words, the decomposition is still Y=AX, but now, Y and X are known and A is estimated.
The AUC was used to evaluate the association of the signature with the exposure status, for both the unsupervised and partially supervised approach, where the contribution of each signature has been divided by the number of total mutations. this whole process (from the random selection of half of the unexposed patients used to learn the age signature and so on) was repeated 50 times and the average AUC over them was taken to account for the effect of randomness. This is what is depicted in
In this partially supervised extension, NMF was used to easily compare with the unsupervised approach by Alexandrov et al. (Nature 500, 415-421 (2013)). However, other methodologies (e.g. a classifier based on EM) may provide even better performance.
If there was no annotation for the presence of defects in the gene POL-ε among patients with endometrial cancer in the UCEC-TCGA dataset and the POL-ε signature was not known, the normalized results for an NMF decomposition are depicted in
Each predictive feature of the SuperSigs can be represented by its rate. For age, the “rate” of feature i, ria, is defined as the mean of the ratio:
in unexposed patients. This rate estimates the number of mutations of that particular feature accumulating per year and attributable to age. To estimate the proportion of mutations due to aging in each specific sample ria of each feature i present in the SuperSig age signature was multiplied by the patient's age of that specific sample. The number obtained by summing the above counts for each feature in the age SuperSig is then divided by the total number of mutations observed in that sample. This resulting ratio, being forced to be not greater than 1, is the estimate for the proportion of somatic mutations attributable to age in that sample (see Table 10).
The mutational landscape of an exposure in a tissue was defined as the 96-long vector (96 trinucleotide mutations) where each entry is given by the average count of that mutation type in the cohort of the samples with that exposure divided by the average age in that cohort. The mutational landscape of aging is obtained in the same way using the cohort of samples without any known exposure (“unexposed”). Then, the distance between any two mutational landscapes is given by 1—the Pearson's correlation between the two mutational landscapes (see
Robustness Analysis with Respect to Mislabeling
To assess the robustness of the methodology with respect to the quality of the clinical annotation, the labels were switch from unexposed to exposed (or vice versa) for 5%, 10%, 20%, and 25% of the samples in the training set. For example, non-smokers would be mislabeled as smokers and vice versa. Then the supervised method is rerun, including feature engineering and selection, on the training set to obtain new signatures. These new signatures are then used for prediction in the test set, where the original labels were used as the ground truth. The performance is reported in Table 11. AUCs at the different mislabeling percentages were compare and it was found that the method still outperforms the unsupervised method up to a mislabeling proportion of 20%, reaching a comparable prediction performance at a mislabeling proportion of 25%.
It is to be understood that while the invention has been described in conjunction with the detailed description thereof, the foregoing description is intended to illustrate and not limit the scope of the invention, which is defined by the scope of the appended claims. Other aspects, advantages, and modifications are within the scope of the following claims.
T[C > G]A
T[C > G]A
C[T > A]G
[C > A]G
T[C > A]A
T[C > A]A
T[C > T]G
T[C > A]T
T[C > A]G
C[T > C]C
T[C > G]T
[C > A]H
T[C > G]A
T[C > A]Y
T[C > A]A
[C > G]V
[T > A]H
V[C > T]W
T[C > A]A
S[C > A]C
S[C > G]G
[C > G]V
[C > G]H
[T > A]H
C[T > A]G
T[C > G]A
V[C > T]W
T[C > A]H
T[C > A]G
T[C > G]A
T[C > A]G
T[C > A]H
T[C > A]A
T[C > G]A
T[C > A]G
S[C > A]C
T[C > A]H
T[C > G]A
T[C > A]Y
T[C > A]G
T[C > A]A
T[C > G]A
S[C > A]C
[C > A]D
T[C > G]V
[T > C]V
[C > G]V
[T > C]V
S[C > A]G
T[C > G]G
T[C > G]A
T[C > A]V
[C > G]V
[T > G]V
T[C > A]G
[T > A]H
C[T > A]G
C[T > A]H
T[C > A]G
C[T > A]G
[T > A]H
T[C > A]G
[C > A]H
[T > G]D
T[C > A]A
T[C > G]A
This application claims the benefit of U.S. Patent Application Ser. No. 62/858,007, filed on Jun. 6, 2019. The disclosure of the prior application is considered part of (and is incorporated by reference in) the disclosure of this application.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2020/036327 | 6/5/2020 | WO |
Number | Date | Country | |
---|---|---|---|
62858007 | Jun 2019 | US |