The present disclosure is generally directed to processing data to identify cancer-related mutations and microsatellite instability in cell-free DNA (cfDNA) sequence data.
The following description of the background of the present technology is provided simply as an aid in understanding the present technology and is not admitted to describe or constitute prior art to the present technology.
Tumors continually shed DNA into the circulation (circulating tumor DNA, or ctDNA), where it is readily accessible (Stroun et al., Eur J Cancer Clin Oncol 23:707-712 (1987)). Analysis of such cancer-derived cell-free DNA (cfDNA) has the potential to revolutionize cancer detection, tumor genotyping, and disease monitoring. For example, noninvasive access to tumor-derived DNA via liquid biopsies is particularly attractive for solid tumors. However, in most early- and many advanced-stage solid tumors, ctDNA blood levels are extremely low (˜0.1%) (Bettegowda, C. et al., Sci. Transl. Med. 6:224ra24 (2014); Newman, A. M. et al., Nat. Med. 20:548-554 (2014)), thus complicating ctDNA detection and analysis. Mutation fractions in cfDNA are often lower than those observed in tissue samples from the same subject and may approach the noise levels of next-generation sequencing workflows, making it impossible to distinguish true somatic mutations from artifacts. Recovery of cfDNA molecules and non-biological errors introduced during library preparation and sequencing limit analytical sensitivity and continue to represent a major obstacle for ultrasensitive ctDNA profiling.
The present disclosure is directed to more sensitive and high-throughput systems and methods for effective detection of somatic mutations and microsatellite instability from cfDNA, particularly for early-stage cancer subjects.
In one aspect, the disclosure is related to a computer-implemented method. The method includes receiving, by one or more processors, from a next generation sequencing device (i) a plurality of nucleic acid (e.g., cell-free DNA (cfDNA)) sequence read-pairs derived from a subject, each nucleic acid (e.g., cfDNA) sequence read from the plurality of nucleic acid (e.g., cfDNA) sequence reads including either a forward unique molecular identifier (UMI) or a reverse UMI, and (ii) a plurality of white blood cell (WBC)-derived sequence read-pairs derived from the subject, each WBC-derived sequence read from the plurality of WBC-derived sequence reads optionally including the forward UMI or the reverse UMI. The method further includes for each microsatellite locus of a plurality of microsatellite loci. The method also includes identifying, by the one or more processors, a first subset of the plurality of nucleic acid (e.g., cfDNA) sequence reads and a second subset of the plurality of WBC-derived sequence reads, each read in the first subset and the second subset corresponds to the microsatellite locus. The method further includes identifying, by the one or more processors, from the first subset and the second subset, a set of alleles, each allele of the set of alleles having a distinct sequence. The method also includes determining, by the one or more processors, for each allele of the set of alleles, a number of nucleic acid (e.g., cfDNA) sequence reads that include the allele. The method further includes determining, by the one or more processors, for each allele of the set of alleles, a number of WBC-derived sequence reads that include the allele. The method also includes determining, by the one or more processors, for each allele in the set of alleles, an absolute difference based on a difference between the number of nucleic acid (e.g., cfDNA) sequence reads for the allele and the number of WBC-derived sequence reads for the allele. The method also includes determining, by the one or more processors, for each microsatellite locus from the plurality of microsatellite loci, a distance based on a sum of absolute differences associated with all alleles in the set of alleles. The method further includes generating, by the one or more processors, a first distribution indicating a number of microsatellite loci having distances within a group of distinct distance intervals. The method further includes generating, by the one or more processors, a second distribution indicating a number of microsatellite loci having distances within the group of distinct distance intervals, the second distribution derived from distances associated with each microsatellite locus of the plurality of microsatellite loci observed in a reference sample. The method also includes determining, by the one or more processors, that a number of microsatellite loci in the first distribution above a threshold distance metric is greater than a number of microsatellite loci in the second distribution above the threshold distance metric to detect a presence of microsatellite instability in the subject. The method additionally includes storing, by the one or more processors, responsive to the determination, in one or more data structures, an association between the subject and the presence of microsatellite instability.
In some embodiments, the method further includes normalizing, by the one or more processors, for each allele of the set of alleles, the number of nucleic acid (e.g., cfDNA) sequence reads that include the allele based on a sum of the number of nucleic acid (e.g., cfDNA) sequence reads corresponding to all alleles in the set of alleles to generate a respective normalized number of nucleic acid (e.g., cfDNA) sequence reads corresponding to the allele, and normalizing, by the one or more processors, for each allele of the set of alleles, the number of WBC-derived sequence that include the allele based on a sum of the number of WBC-derived sequence reads corresponding to all alleles in the set of alleles to generate a respective normalized number of WBC-derived sequence reads corresponding to the allele, where, for each allele in the set of alleles, the absolute difference is based on a difference between the normalized number of nucleic acid (e.g., cfDNA) sequence reads for the allele and the normalized number of WBC-derived sequence reads for the allele.
In some embodiments, wherein the sum of absolute differences associated with all alleles in the set of alleles is based on a sum of an absolute difference between normalized number of cfDNA sequence reads and normalized number of WBC-derived sequence reads for each allele in the set of alleles. In some embodiments, wherein the subject suffers from, or is suspected of having Lynch Syndrome. In some embodiments, the subject harbors at least one mutation in one or more mismatch repair genes selected from the group consisting of MSH2, MSH6, MLH1, and PMS2. In some embodiments, the subject suffers from or is at risk for ovarian cancer, breast cancer, colorectal cancer, lung cancer, prostate cancer, gastric cancer, pancreatic cancer, cervical cancer, liver cancer, bladder cancer, cancer of the urinary tract, thyroid cancer, renal cancer, carcinoma, melanoma, head and neck cancer, or brain cancer. In some embodiments, the method further includes determining the presence of at least one mutation in an exon of a cancer-related gene selected from the group consisting of: AKT1, ALK, APC, AR, ARAF, ARID1A, ARID2, ATM, B2M, BCL2, BCOR, BRAF, BRCA1, BRCA2, CARD11, CBFB, CCND1, CDH1, CDK4, CDKN2A, CIC, CREBBP, CTCF, CTNNB1, DICER1, DIS3, DNMT3A, EGFR, EIF1AX, EP300, ERBB2, ERBB3, ERCC2, ESR1, EZH2, FBXW7, FGFR1, FGFR2, FGFR3, FGFR4, FLT3, FOXA1, FOXL2, FOXO1, FUBP1, GATA3, GNA11, GNAQ, GNAS, H3F3A, HIST1H3B, HRAS, IDH1, IDH2, IKZF1, INPPL1, JAK1, KDM6A, KEAP1, KIT, KNSTRN, KRAS, MAP2K1, MAPK1, MAX, MED12, MET, MLH1, MSH2, MSH3, MSH6, MTOR, MYC, MYCN, MYD88, MYOD1, NF1, NFE2L2, NOTCH1, NRAS, NTRK1, NTRK2, NTRK3, NUP93, PAK7, PDGFRA, PIK3CA, PIK3CB, PIK3R1, PIK3R2, PMS2, POLE, PPP2R1A, PPP6C, PRKCI, PTCH1, PTEN, PTPN11, RAC1, RAF1, RB1, RET, RHOA, RIT1, ROS1, RRAS2, RXRA, SETD2, SF3B1, SMAD3, SMAD4, SMARCA4, SMARCB1, SOS1, SPOP, STAT3, STK11, STK19, TCF7L2, TERT, TGFBR1, TGFBR2, TP53, TP63, TSC1, TSC2, U2AF1, VHL, and XPO1.
In some embodiments, the at least one mutation is a deletion, an insertion, a translocation, an inversion, a copy number variant, or a point mutation. In some embodiments, the method further includes determining the presence of at least one genomic alteration in an intron of a cancer-related gene selected from the group consisting of: ALK, BRAF, EGFR. ETV6, FGFR2, FGFR3, MET, NTRK1, RET and ROS1 or a promoter region of TERT. In some embodiments, the subject lacks detectable tumors.
In another aspect, the disclosure is related to a method for determining the efficacy of a therapy in a subject with a MSI-High tumor. The method includes administering the therapy to the subject. The method further includes detecting the presence of microsatellite instability in a first nucleic acid (e.g., cfDNA) sample obtained from the subject using any of the computer-implemented methods disclosed herein, following administration of the therapy. The method also includes determining that the therapy is effective when the first nucleic acid (e.g., cfDNA) sample shows a shift towards a distance metric that is associated with microsatellite stability (MSS) compared to that observed in a control sample obtained from the subject prior to administration of the therapy.
In some embodiments, the therapy is one or more of radiation therapy, chemotherapy, surgery, immunotherapy, or surgery. In some embodiments, chemotherapy includes the administration of one or more chemotherapeutic agents selected from the group consisting of abraxane, capecitabine, erlotinib, fluorouracil (5-FU), gemcitabine, irinotecan, leucovorin, nab-paclitaxel, cisplatin, irinotecan, docetaxel, oxaliplatin, tipifarnib, everolimus, sunitinib, dovitinib, ruxolitinib, pegylated-hyaluronidase, pemetrexed, folinic acid, paclitaxel, MK2206, GDC-0449, IPI-926, gamma secretase/RO4929097, M402, and LY293111. In some embodiments, immunotherapy includes the administration of one or more agents selected from the group consisting of immune checkpoint inhibitors (e.g., antibodies targeting CTLA-4, PD-1, PD-L1), ipilimumab, 90Y-Clivatuzumab tetraxetan, pembrolizumab, nivolumab, trastuzumab, cixutumumab, ganitumab, demcizumab, cetuximab, nimotuzumab, dalotuzumab, sipuleucel-T, CRS-207, and GVAX.
In another aspect, the disclosure is related to a system including one or more processors. The one or more processors are configured to receive from a next generation sequencing device (i) a plurality of nucleic acid (e.g., cfDNA) sequence read-pairs derived from a subject, each nucleic acid (e.g., cfDNA) sequence read from the plurality of nucleic acid (e.g., cfDNA) sequence reads including either a forward unique molecular identifier (UMI) or a reverse UMI, and (ii) a plurality of WBC-derived sequence read-pairs derived from the subject, each WBC-derived sequence read from the plurality of WBC-derived sequence reads optionally including the forward UMI or the reverse UMI. The one or more processors are configured to, for each microsatellite locus of a plurality of microsatellite loci, identify a first subset of the plurality of nucleic acid (e.g., cfDNA) sequence reads and a second subset of the plurality of WBC-derived sequence reads, each read in the first subset and the second subset corresponds to the microsatellite locus, identify from the first subset and the second subset, a set of alleles, each allele of the set of alleles having a distinct sequence, determine, for each allele of the set of alleles, a number of nucleic acid (e.g., cfDNA) sequence reads that include the allele, determine, for each allele of the set of alleles, a number of WBC-derived sequence reads that include the allele, determine, for each allele in the set of alleles, an absolute difference based on a difference between the number of nucleic acid (e.g., cfDNA) sequence reads for the allele and the number of WBC-derived sequence reads for the allele. The one or more processors are configured to determine, for each microsatellite locus from the plurality of microsatellite loci, a distance based on a sum of absolute differences associated with all alleles in the set of alleles. The one or more processors are configured to generate a first distribution indicating a number of microsatellite loci having distances within a group of distinct distance intervals. The one or more processors are configured to generate a second distribution indicating a number of microsatellite loci having distances within the group of distinct distance intervals, the second distribution derived from distances associated with each microsatellite locus of the plurality of microsatellite loci observed in a reference sample. The one or more processors are configured to determine that a number of microsatellite loci in the first distribution above a threshold distance metric is greater than a number of microsatellite loci in the second distribution above the threshold distance metric to detect a presence of microsatellite instability in the subject. The one or more processors are configured to store, responsive to the determination, in one or more data structures, an association between the subject and the presence of microsatellite instability.
In some embodiments, the one or more processors are configured to normalize, for each allele of the set of alleles, the number of nucleic acid (e.g., cfDNA) sequence reads that include the allele based on a sum of the number of nucleic acid (e.g., cfDNA) sequence reads corresponding to all alleles in the set of alleles to generate a respective normalized number of nucleic acid (e.g., cfDNA) sequence reads corresponding to the allele, and normalize, for each allele of the set of alleles, the number of WBC-derived sequence that include the allele based on a sum of the number of WBC-derived sequence reads corresponding to all alleles in the set of alleles to generate a respective normalized number of WBC-derived sequence reads corresponding to the allele, where, for each allele in the set of alleles, the absolute difference is based on a difference between the normalized number of nucleic acid (e.g., cfDNA) sequence reads for the allele and the normalized number of WBC-derived sequence reads for the allele.
In one or more embodiments, the one or more processors are configured to generate a machine-learning or statistical classifier that generates a decision boundary on a coordinate space that separates a first set of data points that represent presence of microsatellite instability in sequence reads and a second set of data points that represent no presence of microsatellite instability in sequence reads, process the first distribution using the classifier to determine whether the first distribution belongs to the first set of data points or to the second set of data points, determine microsatellite instability responsive to the classifier classifying the first distribution as belonging to the first set of data points that represent presence of microsatellite instability.
In another aspect, the disclosure is related to a computer-implemented method to identify at least one mutation in cell free DNA (cfDNA) present in a sample processed by a next-generation sequencing device. The method includes receiving, by a computer server including one or more processors, from the next generation sequencing device a plurality of first cfDNA sequence reads derived from one strand of a template double-stranded cfDNA molecule (hereby referred to as ‘sense’ strand), each cfDNA sequence read from the plurality of first cfDNA sequence reads including a first unique molecular identifier (UMI), and a plurality of second cfDNA sequence reads derived from the opposite (complementary) strand of the template double-stranded cfDNA molecule (hereby referred to as ‘antisense’ strand), each cfDNA sequence read from the plurality of second cfDNA sequence reads including a second UMI. The method further includes, identifying, by the computer server, a first set of mutations in each of the plurality of first cfDNA sequence reads. The method also includes identifying, by the computer server, a second set of mutations in each of the plurality of second cfDNA sequence reads. The method also includes identifying a first set of consensus mutations in the plurality of first cfDNA sequence reads, the first set of consensus mutations including mutations from the first set of mutations that appear in the same position in the respective cfDNA sequence read of the plurality of first cfDNA sequence reads. The method further includes identifying a second set of consensus mutations in the plurality of second cfDNA sequence reads, the second set of consensus mutations including mutations from the second set of mutations that appear in the same position in the respective cfDNA sequence reads of the plurality of second cfDNA sequence reads. The method further includes identifying a third set of consensus mutations selected from the first set of consensus mutations, each mutation in the third set of consensus mutations having a consistent mutation in the second set of consensus mutations. The method also includes identifying a WBC set of mutations in a plurality of white blood cell (WBC) sequence reads derived from the subject. The method additionally includes generating a final set of consensus mutations by removing from the third set of consensus mutations those consensus mutations that appear in the set of WBC mutations.
In some embodiments, the cfDNA in the sample comprises circulating tumor DNA (ctDNA). In some embodiments, the at least one mutation identified is in an exon of a cancer-related gene selected from the group consisting of:
AKT1, ALK, APC, AR, ARAF, ARID1A, ARID2, ATM, B2M, BCL2, BCOR, BRAF, BRCA1, BRCA2, CARD11, CBFB, CCND1, CDH1, CDK4, CDKN2A, CIC, CREBBP, CTCF, CTNNB1, DICER1, DIS3, DNMT3A, EGFR, EIF1AX, EP300, ERBB2, ERBB3, ERCC2, ESR1, EZH2, FBXW7, FGFR1, FGFR2, FGFR3, FGFR4, FLT3, FOXA1, FOXL2, FOXO1, FUBP1, GATA3, GNA11, GNAQ, GNAS, H3F3A, HIST1H3B, HRAS, IDH1, IDH2, IKZF1, INPPL1, JAK1, KDM6A, KEAP1, KIT, KNSTRN, KRAS, MAP2K1, MAPK1, MAX, MED12, MET, MLH1, MSH2, MSH3, MSH6, MTOR, MYC, MYCN, MYD88, MYOD1, NF1, NFE2L2, NOTCH1, NRAS, NTRK1, NTRK2, NTRK3, NUP93, PAK7, PDGFRA, PIK3CA, PIK3CB, PIK3R1, PIK3R2, PMS2, POLE, PPP2R1A, PPP6C, PRKCI, PTCH1, PTEN, PTPN11, RAC1, RAF1, RB1, RET, RHOA, RIT1, ROS1, RRAS2, RXRA, SETD2, SF3B1, SMAD3, SMAD4, SMARCA4, SMARCB1, SOS1, SPOP, STAT3, STK11, STK19, TCF7L2, TERT, TGFBR1, TGFBR2, TP53, TP63, TSC1, TSC2, U2AF1, VHL, and XPO1.
In some embodiments, the at least one genomic alteration detected is in an intron of a cancer-related gene selected from the group consisting of: ALK, BRAF, EGFR. ETV6, FGFR2, FGFR3, MET, NTRK1, RET and ROS1 or a promoter region of TERT. In some embodiments, the at least one mutation detected is in a microsatellite locus for microsatellite instability. In some embodiments, at least one mutation detected is in cancer-related gene selected from the group consisting of: BRCA1/2, MLH1, MSH2, MSH6, PMS2. In some embodiments, the at least one mutation is a deletion, an insertion, a translocation, an inversion, a copy number variant, or a point mutation. In some embodiments, the cfDNA sample is serum, plasma, sweat, tears, urine, saliva, synovial fluid, lymphatic fluid, ascites fluid, amniotic fluid, or interstitial fluid. In some embodiments, the subject suffers from or is at risk for ovarian cancer, breast cancer, colorectal cancer, lung cancer, prostate cancer, gastric cancer, pancreatic cancer, cervical cancer, liver cancer, bladder cancer, cancer of the urinary tract, thyroid cancer, renal cancer, carcinoma, melanoma, head and neck cancer, or brain cancer.
In some embodiments, the method further includes trimming the forward cfDNA UMI from the plurality of first cfDNA sequence reads and trimming the second cfDNA UMI from the plurality of second cfDNA sequence reads prior to identifying the first set of mutations and the second set of mutations. In some embodiments, the method further includes filtering the first set of mutations and the second set of mutations based on known hotspot mutations. In some embodiments, the method also includes filtering the first set of mutations and the second set of mutations based on a set of mutations identified in cfDNA sequence reads associated with healthy individuals. In some embodiments, the method also includes identifying the first set of consensus mutations in the plurality of first cfDNA sequence reads, the first set of consensus mutations including mutations from the first set of mutations that appear in the same position in more than half of the respective cfDNA sequence reads of the plurality of first cfDNA sequence reads. In some embodiments, the method further includes identifying the second set of consensus mutations in the plurality of second cfDNA sequence reads, the second set of consensus mutations including mutations from the second set of mutations that appear in the same position in more than half of the respective cfDNA sequence reads of the plurality of second cfDNA sequence reads.
In some embodiments, the method further includes receiving, by the computer server including one or more processors, from the next generation sequencing device a plurality of first WBC sequence reads derived from the subject, each WBC sequence read from the plurality of first WBC sequence reads optionally including a first WBC UMI and a plurality of second WBC sequence reads derived from the subject, each WBC sequence read from the plurality of second cfDNA sequence reads optionally including a second WBC UMI. The method also includes identifying, by the computer server, a first WBC set of mutations in each of the plurality of first WBC sequence reads. The method further includes identifying, by the computer server, a second WBC set of mutations in each of the plurality of second WBC sequence reads. The method also includes identifying a first WBC set of consensus mutations in the plurality of first WBC sequence reads, the first set of consensus WBC mutations including mutations from the first WBC set of mutations that appear in the same position in the respective WBC sequence reads of the plurality of first WBC sequence reads. The method also includes identifying a second WBC set of consensus mutations in the plurality of second WBC sequence reads, the second set of consensus WBC mutations including mutations from the second WBC set of mutations that appear in the same position in the respective WBC sequence reads of the plurality of second WBC sequence reads. The method further includes identifying the WBC set of mutations selected from the first WBC set of consensus mutations, each mutation in the WBC set of mutations having a consistent mutation in the second WBC set of consensus mutations. In some embodiments, having the consistent mutation in the second set of consensus mutations includes a nucleotide sequence that is complementary to a nucleotide sequence of the corresponding consensus mutation in the first set of consensus mutation.
The foregoing and other objects, aspects, features, and advantages of the disclosure will become more apparent and better understood by referring to the following description taken in conjunction with the accompanying drawings, in which:
For purposes of reading the description of the various embodiments below, the following descriptions of the sections of the specification and their respective contents may be helpful:
Section A describes a network environment and computing environment which may be useful for practicing embodiments described herein.
Section B describes embodiments of systems and methods for identifying mutations in cell-free DNA.
Section C describes embodiments of systems and methods for detecting the presence of microsatellite instability in cell-free DNA.
The superior performance of the methods and systems disclosed herein with respect to detecting microsatellite instability in cfDNA may be attributed, at least in part to, the following technical features:
(a) Normalization of allelic coverage at the sample level as well as the microsatellite level, which helps mitigate inaccuracies caused by differences in coverage across samples and genomic regions;
(b) Absolute distance associated with each microsatellite locus is a more robust estimate that is resistant to outliers and suitable for sparse data;
(c) Support Vector Machine (SVM) classifiers increase computational efficiency and are naturally resistant to overfitting; and
(d) Leveraging upstream collapsing and error suppression allows for highly accurate quantification of MSI.
The methods disclosed herein permit early detection of cancer in high-risk subjects, such as Lynch Syndrome, and can be used as an indicator of responsiveness to a particular therapeutic regimen. MSI detection is a critical component of clinical genomic profiling to guide diagnosis and treatment selection. Moreover, as shown in
A. Computing and Network Environment
Prior to discussing specific embodiments of the present solution, it may be helpful to describe aspects of the operating environment as well as associated system components (e.g., hardware elements) in connection with the methods and systems described herein. Referring to
Although
The network 104 may be connected via wired or wireless links. Wired links may include Digital Subscriber Line (DSL), coaxial cable lines, or optical fiber lines. The wireless links may include BLUETOOTH, Wi-Fi, Worldwide Interoperability for Microwave Access (WiMAX), an infrared channel or satellite band. The wireless links may also include any cellular network standards used to communicate among mobile devices, including standards that qualify as 1G, 2G, 3G, or 4G. The network standards may qualify as one or more generation of mobile telecommunication standards by fulfilling a specification or standards such as the specifications maintained by International Telecommunication Union. The 3G standards, for example, may correspond to the International Mobile Telecommunications-2000 (IMT-2000) specification, and the 4G standards may correspond to the International Mobile Telecommunications Advanced (IMT-Advanced) specification. Examples of cellular network standards include AMPS, GSM, GPRS, UMTS, LTE, LTE Advanced, Mobile WiMAX, and WiMAX-Advanced. Cellular network standards may use various channel access methods e.g. FDMA, TDMA, CDMA, or SDMA. In some embodiments, different types of data may be transmitted via different links and standards. In other embodiments, the same types of data may be transmitted via different links and standards.
The network 104 may be any type and/or form of network. The geographical scope of the network 104 may vary widely and the network 104 can be a body area network (BAN), a personal area network (PAN), a local-area network (LAN), e.g. Intranet, a metropolitan area network (MAN), a wide area network (WAN), or the Internet. The topology of the network 104 may be of any form and may include, e.g., any of the following: point-to-point, bus, star, ring, mesh, or tree. The network 104 may be an overlay network which is virtual and sits on top of one or more layers of other networks 104′. The network 104 may be of any such network topology as known to those ordinarily skilled in the art capable of supporting the operations described herein. The network 104 may utilize different techniques and layers or stacks of protocols, including, e.g., the Ethernet protocol, the internet protocol suite (TCP/IP), the ATM (Asynchronous Transfer Mode) technique, the SONET (Synchronous Optical Networking) protocol, or the SDH (Synchronous Digital Hierarchy) protocol. The TCP/IP internet protocol suite may include application layer, transport layer, internet layer (including, e.g., IPv6), or the link layer. The network 104 may be a type of a broadcast network, a telecommunications network, a data communication network, or a computer network.
In some embodiments, the system may include multiple, logically-grouped servers 106. In one of these embodiments, the logical group of servers may be referred to as a server farm 38 or a machine farm 38. In another of these embodiments, the servers 106 may be geographically dispersed. In other embodiments, a machine farm 38 may be administered as a single entity. In still other embodiments, the machine farm 38 includes a plurality of machine farms 38. The servers 106 within each machine farm 38 can be heterogeneous—one or more of the servers 106 or machines 106 can operate according to one type of operating system platform (e.g., WINDOWS NT, manufactured by Microsoft Corp. of Redmond, Wash.), while one or more of the other servers 106 can operate on according to another type of operating system platform (e.g., Unix, Linux, or Mac OS X).
In one embodiment, servers 106 in the machine farm 38 may be stored in high-density rack systems, along with associated storage systems, and located in an enterprise data center. In this embodiment, consolidating the servers 106 in this way may improve system manageability, data security, the physical security of the system, and system performance by locating servers 106 and high performance storage systems on localized high performance networks. Centralizing the servers 106 and storage systems and coupling them with advanced system management tools allows more efficient use of server resources.
The servers 106 of each machine farm 38 do not need to be physically proximate to another server 106 in the same machine farm 38. Thus, the group of servers 106 logically grouped as a machine farm 38 may be interconnected using a wide-area network (WAN) connection or a metropolitan-area network (MAN) connection. For example, a machine farm 38 may include servers 106 physically located in different continents or different regions of a continent, country, state, city, campus, or room. Data transmission speeds between servers 106 in the machine farm 38 can be increased if the servers 106 are connected using a local-area network (LAN) connection or some form of direct connection. Additionally, a heterogeneous machine farm 38 may include one or more servers 106 operating according to a type of operating system, while one or more other servers 106 execute one or more types of hypervisors rather than operating systems. In these embodiments, hypervisors may be used to emulate virtual hardware, partition physical hardware, virtualize physical hardware, and execute virtual machines that provide access to computing environments, allowing multiple operating systems to run concurrently on a host computer. Native hypervisors may run directly on the host computer. Hypervisors may include VMware ESX/ESXi, manufactured by VMWare, Inc., of Palo Alto, Calif.; the Xen hypervisor, an open source product whose development is overseen by Citrix Systems, Inc.; the HYPER-V hypervisors provided by Microsoft or others. Hosted hypervisors may run within an operating system on a second software level. Examples of hosted hypervisors may include VMware Workstation and VIRTUALBOX.
Management of the machine farm 38 may be de-centralized. For example, one or more servers 106 may comprise components, subsystems and modules to support one or more management services for the machine farm 38. In one of these embodiments, one or more servers 106 provide functionality for management of dynamic data, including techniques for handling failover, data replication, and increasing the robustness of the machine farm 38. Each server 106 may communicate with a persistent store and, in some embodiments, with a dynamic store.
Server 106 may be a file server, application server, web server, proxy server, appliance, network appliance, gateway, gateway server, virtualization server, deployment server, SSL VPN server, or firewall. In one embodiment, the server 106 may be referred to as a remote machine or a node. In another embodiment, a plurality of nodes 290 may be in the path between any two communicating servers.
Referring to
The cloud 108 may be public, private, or hybrid. Public clouds may include public servers 106 that are maintained by third parties to the clients 102 or the owners of the clients. The servers 106 may be located off-site in remote geographical locations as disclosed above or otherwise. Public clouds may be connected to the servers 106 over a public network. Private clouds may include private servers 106 that are physically maintained by clients 102 or owners of clients. Private clouds may be connected to the servers 106 over a private network 104. Hybrid clouds 108 may include both the private and public networks 104 and servers 106.
The cloud 108 may also include a cloud based delivery, e.g. Software as a Service (SaaS) 110, Platform as a Service (PaaS) 112, and Infrastructure as a Service (IaaS) 114. IaaS may refer to a user renting the use of infrastructure resources that are needed during a specified time period. IaaS providers may offer storage, networking, servers or virtualization resources from large pools, allowing the users to quickly scale up by accessing more resources as needed. Examples of IaaS can include infrastructure and services (e.g., EG-32) provided by OVH HOSTING of Montreal, Quebec, Canada, AMAZON WEB SERVICES provided by Amazon.com, Inc., of Seattle, Wash., RACKSPACE CLOUD provided by Rackspace US, Inc., of San Antonio, Tex., Google Compute Engine provided by Google Inc. of Mountain View, Calif., or RIGHTSCALE provided by RightScale, Inc., of Santa Barbara, Calif. PaaS providers may offer functionality provided by IaaS, including, e.g., storage, networking, servers or virtualization, as well as additional resources such as, e.g., the operating system, middleware, or runtime resources. Examples of PaaS include WINDOWS AZURE provided by Microsoft Corporation of Redmond, Wash., Google App Engine provided by Google Inc., and HEROKU provided by Heroku, Inc. of San Francisco, Calif. SaaS providers may offer the resources that PaaS provides, including storage, networking, servers, virtualization, operating system, middleware, or runtime resources. In some embodiments, SaaS providers may offer additional resources including, e.g., data and application resources. Examples of SaaS include GOOGLE APPS provided by Google Inc., SALESFORCE provided by Salesforce.com Inc. of San Francisco, Calif., or OFFICE 365 provided by Microsoft Corporation. Examples of SaaS may also include data storage providers, e.g. DROPBOX provided by Dropbox, Inc. of San Francisco, Calif., Microsoft SKYDRIVE provided by Microsoft Corporation, Google Drive provided by Google Inc., or Apple ICLOUD provided by Apple Inc. of Cupertino, Calif.
Clients 102 may access IaaS resources with one or more IaaS standards, including, e.g., Amazon Elastic Compute Cloud (EC2), Open Cloud Computing Interface (OCCI), Cloud Infrastructure Management Interface (CIMI), or OpenStack standards. Some IaaS standards may allow clients access to resources over HTTP, and may use Representational State Transfer (REST) protocol or Simple Object Access Protocol (SOAP). Clients 102 may access PaaS resources with different PaaS interfaces. Some PaaS interfaces use HTTP packages, standard Java APIs, JavaMail API, Java Data Objects (JDO), Java Persistence API (JPA), Python APIs, web integration APIs for different programming languages including, e.g., Rack for Ruby, WSGI for Python, or PSGI for Perl, or other APIs that may be built on REST, HTTP, XML, or other protocols. Clients 102 may access SaaS resources through the use of web-based user interfaces, provided by a web browser (e.g. GOOGLE CHROME, Microsoft INTERNET EXPLORER, or Mozilla Firefox provided by Mozilla Foundation of Mountain View, Calif.). Clients 102 may also access SaaS resources through smartphone or tablet applications, including, e.g., Salesforce Sales Cloud, or Google Drive app. Clients 102 may also access SaaS resources through the client operating system, including, e.g., Windows file system for DROPBOX.
In some embodiments, access to IaaS, PaaS, or SaaS resources may be authenticated. For example, a server or authentication server may authenticate a user via security certificates, HTTPS, or API keys. API keys may include various encryption standards such as, e.g., Advanced Encryption Standard (AES). Data resources may be sent over Transport Layer Security (TLS) or Secure Sockets Layer (SSL).
The client 102 and server 106 may be deployed as and/or executed on any type and form of computing device, e.g. a computer, network device or appliance capable of communicating on any type and form of network and performing the operations described herein.
The central processing unit 121 is any logic circuitry that responds to and processes instructions fetched from the main memory unit 122. In many embodiments, the central processing unit 121 is provided by a microprocessor unit, e.g.: those manufactured by Intel Corporation of Mountain View, Calif.; those manufactured by Motorola Corporation of Schaumburg, Ill.; the ARM processor and TEGRA system on a chip (SoC) manufactured by Nvidia of Santa Clara, Calif.; the POWER7 processor, those manufactured by International Business Machines of White Plains, N.Y.; or those manufactured by Advanced Micro Devices of Sunnyvale, Calif. The computing device 100 may be based on any of these processors, or any other processor capable of operating as described herein. The central processing unit 121 may utilize instruction level parallelism, thread level parallelism, different levels of cache, and multi-core processors. A multi-core processor may include two or more processing units on a single computing component. Examples of multi-core processors include the AMD PHENOM IIX2, INTEL CORE i5 and INTEL CORE i7.
Main memory unit 122 may include one or more memory chips capable of storing data and allowing any storage location to be directly accessed by the microprocessor 121. Main memory unit 122 may be volatile and faster than storage 128 memory. Main memory units 122 may be Dynamic random access memory (DRAM) or any variants, including static random access memory (SRAM), Burst SRAM or SynchBurst SRAM (BSRAM), Fast Page Mode DRAM (FPM DRAM), Enhanced DRAM (EDRAM), Extended Data Output RAM (EDO RAM), Extended Data Output DRAM (EDO DRAM), Burst Extended Data Output DRAM (BEDO DRAM), Single Data Rate Synchronous DRAM (SDR SDRAM), Double Data Rate SDRAM (DDR SDRAM), Direct Rambus DRAM (DRDRAM), or Extreme Data Rate DRAM (XDR DRAM). In some embodiments, the main memory 122 or the storage 128 may be non-volatile; e.g., non-volatile read access memory (NVRAM), flash memory non-volatile static RAM (nvSRAM), Ferroelectric RAM (FeRAM), Magnetoresistive RAM (MRAM), Phase-change memory (PRAM), conductive-bridging RAM (CBRAM), Silicon-Oxide-Nitride-Oxide-Silicon (SONOS), Resistive RAM (RRAM), Racetrack, Nano-RAM (NRAM), or Millipede memory. The main memory 122 may be based on any of the above described memory chips, or any other available memory chips capable of operating as described herein. In the embodiment shown in
A wide variety of I/O devices 130a-130n may be present in the computing device 100. Input devices may include keyboards, mice, trackpads, trackballs, touchpads, touch mice, multi-touch touchpads and touch mice, microphones, multi-array microphones, drawing tablets, cameras, single-lens reflex camera (SLR), digital SLR (DSLR), CMOS sensors, accelerometers, infrared optical sensors, pressure sensors, magnetometer sensors, angular rate sensors, depth sensors, proximity sensors, ambient light sensors, gyroscopic sensors, or other sensors. Output devices may include video displays, graphical displays, speakers, headphones, inkjet printers, laser printers, and 3D printers.
Devices 130a-130n may include a combination of multiple input or output devices, including, e.g., Microsoft KINECT, Nintendo Wiimote for the WIT, Nintendo WII U GAMEPAD, or Apple IPHONE. Some devices 130a-130n allow gesture recognition inputs through combining some of the inputs and outputs. Some devices 130a-130n provides for facial recognition which may be utilized as an input for different purposes including authentication and other commands. Some devices 130a-130n provides for voice recognition and inputs, including, e.g., Microsoft KINECT, SIRI for IPHONE by Apple, Google Now or Google Voice Search.
Additional devices 130a-130n have both input and output capabilities, including, e.g., haptic feedback devices, touchscreen displays, or multi-touch displays. Touchscreen, multi-touch displays, touchpads, touch mice, or other touch sensing devices may use different technologies to sense touch, including, e.g., capacitive, surface capacitive, projected capacitive touch (PCT), in-cell capacitive, resistive, infrared, waveguide, dispersive signal touch (DST), in-cell optical, surface acoustic wave (SAW), bending wave touch (BWT), or force-based sensing technologies. Some multi-touch devices may allow two or more contact points with the surface, allowing advanced functionality including, e.g., pinch, spread, rotate, scroll, or other gestures. Some touchscreen devices, including, e.g., Microsoft PIXELSENSE or Multi-Touch Collaboration Wall, may have larger surfaces, such as on a table-top or on a wall, and may also interact with other electronic devices. Some I/O devices 130a-130n, display devices 124a-124n or group of devices may be augment reality devices. The I/O devices may be controlled by an I/O controller 123 as shown in
In some embodiments, display devices 124a-124n may be connected to I/O controller 123. Display devices may include, e.g., liquid crystal displays (LCD), thin film transistor LCD (TFT-LCD), blue phase LCD, electronic papers (e-ink) displays, flexile displays, light emitting diode displays (LED), digital light processing (DLP) displays, liquid crystal on silicon (LCOS) displays, organic light-emitting diode (OLED) displays, active-matrix organic light-emitting diode (AMOLED) displays, liquid crystal laser displays, time-multiplexed optical shutter (TMOS) displays, or 3D displays. Examples of 3D displays may use, e.g. stereoscopy, polarization filters, active shutters, or autostereoscopy. Display devices 124a-124n may also be a head-mounted display (HMD). In some embodiments, display devices 124a-124n or the corresponding I/O controllers 123 may be controlled through or have hardware support for OPENGL or DIRECTX API or other graphics libraries.
In some embodiments, the computing device 100 may include or connect to multiple display devices 124a-124n, which each may be of the same or different type and/or form. As such, any of the I/O devices 130a-130n and/or the I/O controller 123 may include any type and/or form of suitable hardware, software, or combination of hardware and software to support, enable or provide for the connection and use of multiple display devices 124a-124n by the computing device 100. For example, the computing device 100 may include any type and/or form of video adapter, video card, driver, and/or library to interface, communicate, connect or otherwise use the display devices 124a-124n. In one embodiment, a video adapter may include multiple connectors to interface to multiple display devices 124a-124n. In other embodiments, the computing device 100 may include multiple video adapters, with each video adapter connected to one or more of the display devices 124a-124n. In some embodiments, any portion of the operating system of the computing device 100 may be configured for using multiple displays 124a-124n. In other embodiments, one or more of the display devices 124a-124n may be provided by one or more other computing devices 100a or 100b connected to the computing device 100, via the network 104. In some embodiments software may be designed and constructed to use another computer's display device as a second display device 124a for the computing device 100. For example, in one embodiment, an Apple iPad may connect to a computing device 100 and use the display of the device 100 as an additional display screen that may be used as an extended desktop. One ordinarily skilled in the art will recognize and appreciate the various ways and embodiments that a computing device 100 may be configured to have multiple display devices 124a-124n.
Referring again to
Client device 100 may also install software or application from an application distribution platform. Examples of application distribution platforms include the App Store for iOS provided by Apple, Inc., the Mac App Store provided by Apple, Inc., GOOGLE PLAY for Android OS provided by Google Inc., Chrome Webstore for CHROME OS provided by Google Inc., and Amazon Appstore for Android OS and KINDLE FIRE provided by Amazon.com, Inc. An application distribution platform may facilitate installation of software on a client device 102. An application distribution platform may include a repository of applications on a server 106 or a cloud 108, which the clients 102a-102n may access over a network 104. An application distribution platform may include application developed and provided by various developers. A user of a client device 102 may select, purchase and/or download an application via the application distribution platform.
Furthermore, the computing device 100 may include a network interface 118 to interface to the network 104 through a variety of connections including, but not limited to, standard telephone lines LAN or WAN links (e.g., 802.11, T1, T3, Gigabit Ethernet, Infiniband), broadband connections (e.g., ISDN, Frame Relay, ATM, Gigabit Ethernet, Ethernet-over-SONET, ADSL, VDSL, BPON, GPON, fiber optical including FiOS), wireless connections, or some combination of any or all of the above. Connections can be established using a variety of communication protocols (e.g., TCP/IP, Ethernet, ARCNET, SONET, SDH, Fiber Distributed Data Interface (FDDI), IEEE 802.11a/b/g/n/ac CDMA, GSM, WiMax and direct asynchronous connections). In one embodiment, the computing device 100 communicates with other computing devices 100′ via any type and/or form of gateway or tunneling protocol e.g. Secure Socket Layer (SSL) or Transport Layer Security (TLS), or the Citrix Gateway Protocol manufactured by Citrix Systems, Inc. of Ft. Lauderdale, Fla. The network interface 118 may comprise a built-in network adapter, network interface card, PCMCIA network card, EXPRESSCARD network card, card bus network adapter, wireless network adapter, USB network adapter, modem or any other device suitable for interfacing the computing device 100 to any type of network capable of communication and performing the operations described herein.
A computing device 100 of the sort depicted in
The computer system 100 can be any workstation, telephone, desktop computer, laptop or notebook computer, netbook, ULTRABOOK, tablet, server, handheld computer, mobile telephone, smartphone or other portable telecommunications device, media playing device, a gaming system, mobile computing device, or any other type and/or form of computing, telecommunications or media device that is capable of communication. The computer system 100 has sufficient processor power and memory capacity to perform the operations described herein. In some embodiments, the computing device 100 may have different processors, operating systems, and input devices consistent with the device. The Samsung GALAXY smartphones, e.g., operate under the control of Android operating system developed by Google, Inc. GALAXY smartphones receive input via a touch interface.
In some embodiments, the computing device 100 is a gaming system. For example, the computer system 100 may comprise a PLAYSTATION 3, or PERSONAL PLAYSTATION PORTABLE (PSP), or a PLAYSTATION VITA device manufactured by the Sony Corporation of Tokyo, Japan, a NINTENDO DS, NINTENDO 3DS, NINTENDO WII, or a NINTENDO WII U device manufactured by Nintendo Co., Ltd., of Kyoto, Japan, an XBOX 360 device manufactured by the Microsoft Corporation of Redmond, Wash.
In some embodiments, the computing device 100 is a digital audio player such as the Apple IPOD, IPOD Touch, and IPOD NANO lines of devices, manufactured by Apple Computer of Cupertino, Calif. Some digital audio players may have other functionality, including, e.g., a gaming system or any functionality made available by an application from a digital application distribution platform. For example, the IPOD Touch may access the Apple App Store. In some embodiments, the computing device 100 is a portable media player or digital audio player supporting file formats including, but not limited to, MP3, WAV, M4A/AAC, WMA Protected AAC, AIFF, Audible audiobook, Apple Lossless audio file formats and .mov, .m4v, and .mp4 MPEG-4 (H.264/MPEG-4 AVC) video file formats.
In some embodiments, the computing device 100 is a tablet e.g. the IPAD line of devices by Apple; GALAXY TAB family of devices by Samsung; or KINDLE FIRE, by Amazon.com, Inc. of Seattle, Wash. In other embodiments, the computing device 100 is an eBook reader, e.g. the KINDLE family of devices by Amazon.com, or NOOK family of devices by Barnes & Noble, Inc. of New York City, N.Y.
In some embodiments, the communications device 102 includes a combination of devices, e.g. a smartphone combined with a digital audio player or portable media player. For example, one of these embodiments is a smartphone, e.g. the IPHONE family of smartphones manufactured by Apple, Inc.; a Samsung GALAXY family of smartphones manufactured by Samsung, Inc.; or a Motorola DROID family of smartphones. In yet another embodiment, the communications device 102 is a laptop or desktop computer equipped with a web browser and a microphone and speaker system, e.g. a telephony headset. In these embodiments, the communications devices 102 are web-enabled and can receive and initiate phone calls. In some embodiments, a laptop or desktop computer is also equipped with a webcam or other video capture device that enables video chat and video call.
In some embodiments, the status of one or more machines 102, 106 in the network 104 are monitored, generally as part of network management. In one of these embodiments, the status of a machine may include an identification of load information (e.g., the number of processes on the machine, CPU and memory utilization), of port information (e.g., the number of available communication ports and the port addresses), or of session status (e.g., the duration and type of processes, and whether a process is active or idle). In another of these embodiments, this information may be identified by a plurality of metrics, and the plurality of metrics can be applied at least in part towards decisions in load distribution, network traffic management, and network failure recovery as well as any aspects of operations of the present solution described herein. Aspects of the operating environments and components described above will become apparent in the context of the systems and methods disclosed herein.
B. Computer Complemented Method for Identifying Mutations in Cell-Free DNA
cfDNA encompasses all small DNA fragments (˜167 base pairs) circulating in the blood, which can be isolated from the plasma component. In cancer subjects, some of these fragments come from cancer cells (i.e., circulating tumor DNA, or ctDNA), providing a window into the somatic, or acquired, mutations in their tumor(s).
Somatic mutation calling differs from germline mutation calling in that the fraction of DNA molecules harboring a mutation can vary widely due to tumor heterogeneity and chromosomal gains and losses. This challenge is compounded when trying to identify tumor mutations in cfDNA, as the fraction of tumor-derived DNA can be extremely low (˜0.1%). Consequently, the mutation fractions in cfDNA are often lower than those observed in tissue samples from the same subject and may approach the noise levels of next-generation sequencing workflows. This can make it impossible to distinguish true somatic mutations from artifacts. Effective somatic mutation calling from cfDNA, particularly for early-stage cancer subjects, requires suppressing errors introduced in sample preparation and sequencing.
One technique that has been developed for error suppression is ‘unique molecular indexing’ (UMIs), also known as molecular barcoding. Each DNA molecule is tagged with sequence adapters containing a specific sequence barcode (a UMI) to distinguish it from other molecules. As part of sample preparation, each molecule is copied multiple times, and each copy contains the same UMI. The techniques and methods discussed below identify all the copies of each molecule, group them together, and collapse them to derive a single consensus without sequencing errors. Further, the consensus mutations are compared with consensus mutations identified in WBC sequence reads of the same subject. Any germline variants appearing in the consensus mutations associated with the cfDNA sequence reads can be removed, thereby providing an accurate list of identified hematopoietic variants. This reduces the errors associated with identification of mutations in cfDNA sequence reads. The reduction in error improves the accuracy and the confidence of the identified mutations in the cfDNA.
Assay design and workflow for identification of mutations or variants in the cfDNA sequence reads is discussed below.
Assay Design
Sequence-specific DNA probes can be used to capture the desired regions of the genome for cfDNA analysis. As one application of cfDNA analysis is to detect the presence of tumor-derived DNA, the probability that a given cancer would have at least one mutation detectable by the assay has been improved.
Data from more than 20,000 tumors can be leveraged to select the most frequently mutated and the most clinically relevant protein-coding exons according to the following criteria.
1. Exons with at least one OncoKB Level 1-4 mutation in MSK-IMPACT 20 k. (OncoKB is a knowledgebase of the biological and clinical effects of tumor mutations, published in PMID 28890946. ‘MSK-IMPACT 20 k’ refers to the first 20,000 tumors sequenced using the MSK-IMPACT platform.)
2. Exons with at least 10 mutations at hotspot sites in MSK-IMPACT 20 k. (The list of hotspots is published in PMID 29247016.)
3. Exons with >30 mutations per Megabase in MSK-IMPACT 20 k.
4. All exons in protein kinase domains of selected druggable kinase genes (n=21).
5. All exons in frequently mutated tumor suppressor genes (n=25).
6. Additional exons and genes based on expert selection.
7. >160 microsatellite regions to detect the signature of microsatellite instability (‘MSI’).
Altogether, these exons can cover ˜230,000 base pairs and encompass part of 129 genes. Of the >20,000 subjects sequenced by MSK-IMPACT, 84% of cases have at least one mutation covered by this panel (including 94% of all breast cancers and 96% of all lung cancers).
While the above regions were included for the purpose of detecting somatic mutations with high sensitivity, probes have been designed for additional regions to detect other classes of genomic alterations, including:
1. Introns to detect structural variants that produce actionable gene fusions (in ALK, BRAF, EGFR, ETV6, FGFR2, FGFR3, MET, NTRK1, NTRK3, RET, ROS1).
2. Genes associated with clonal hematopoiesis to detect acquired mutations in blood cells.
3. >590 common SNPs to enable the characterization of genome-wide copy number profiles, identify changes in zygosity and copy number in key genes, and perform quality control (genetic fingerprinting and contamination detection).
These probes add another ˜171,000 base pairs. Because the regions in this second category do not require the same ultra-high level of coverage for error suppression and mutation calling, the capture probes have been mixed in unequal ratios. This allows sequencing to provide different levels of coverage and distribute sequence reads (and costs) efficiently.
Workflow
The workflow includes a wet lab process and a data processing process. The wet lab process includes collecting blood or body fluids (including, but not limited to, serum, plasma, sweat, tears, urine, saliva, synovial fluid, lymphatic fluid, ascites fluid, amniotic fluid, or interstitial fluid) from a cancer subject. Additionally or alternatively, in some embodiments, the subject suffers from or is at risk for ovarian cancer, breast cancer, colorectal cancer, lung cancer, prostate cancer, gastric cancer, pancreatic cancer, cervical cancer, liver cancer, bladder cancer, cancer of the urinary tract, thyroid cancer, renal cancer, carcinoma, melanoma, head and neck cancer, or brain cancer. The blood or bodily fluids can be processed to extract cfDNA using any method known in the art. For example, the blood of the subject can be subjected to 2-spin centrifugation to isolate plasma and leukocytes (or white blood cells (WBC)). CfDNA is extracted from the non-cellular portion of the centrifuged body fluid. In addition, WBC DNA is extracted from the white blood cells. In instances where the cfDNA is extracted from non-blood body fluids, the WBC DNA can be extracted from a separate blood draw from the subject. The cfDNA and the WBC DNA are input to an assay. DNA adapters containing unique molecular indexes (UMIs) can be ligated or attached to the ends of the cfDNA and the WBC DNA.
cfDNAs and WBC DNAs associated with the same subject can be assigned unique sample barcodes. In this manner, subject specific analysis of the cfDNA and WBC DNA can be carried out. The process of adding sample barcodes to the cfDNA and the WBC DNA is known as multiplexing. This allows large numbers of libraries to be pooled and sequenced simultaneously during a single sequencing run. With multiplexed libraries, unique sample barcode sequences (see e.g.,
Many next generation sequencing-based techniques rely upon a PCR amplification step to increase the concentration of the library generated from the DNA sample prior to next-generation sequencing. Following alignment to the genome, PCR duplicates are generally identified and removed as there are inherent biases in the amplification step as some sequences become overrepresented in the final library compared to their actual abundance within the DNA sample obtained from a subject. In some next generation sequencing-based techniques, the Picard software (Broad Institute, Cambridge Mass.) is used to identify and remove PCR duplicates using their genomic coordinates.
The PCR copies of the cfDNA and the WBC DNA can be used, as discussed below, for error suppression to produce highly accurate consensus sequences. The PCR copies can be provided to a next-generation (NG) sequencing device such as, for example, an Illumina sequencer, a Lymphotrac sequencer, an Ion Torrent sequencer, and a 454 pyro-sequencer. The NG sequencer can provide detailed chromosome analysis, and can employ techniques such as array comparative genomic hybridization (CGH), microarray, oligo array, single nucleotide polymorphism (SNP) array, whole genome array (WGA), and the like. The NG sequencer can provide raw genomic data to a genomic data processing system (such as the genomic data processing system 120,
Somatic allele fractions in cfDNA are often lower than those observed in tissue samples. Accurate somatic mutation calling at very low allele fractions (<0.1%) is challenging due to noise inherent in sample preparation procedures and Next Generation Sequencing. The techniques discussed herein can reduce noise levels below desired mutation detection levels.
The process 300 further includes identifying a first set of mutations in the sense strand cfDNA sequence reads and identifying a second set of mutations in the anti-sense strand cfDNA sequence reads (304).
The process 300 further includes identifying a first set of consensus mutations in the sense strand cfDNA sequence reads and a second set of consensus mutations in the anti-sense strand cfDNA sequence reads (306). The first set of consensus mutations include mutations from the first set of mutations that appear in the same position in the respective cfDNA sequence reads of sense cfDNA sequence reads. Similarly, the second set of consensus mutations include mutations from the second set of mutations that appear in the same position in the respective cfDNA sequence reads of the anti-sense cfDNA sequence reads. For example,
The process 300 further includes identifying a third set of consensus mutations from the first set of consensus mutations, where each mutation in the third set of consensus mutations have a consistent mutation in the second set of consensus mutations (308). For example,
The process 300 further includes removing those mutations from the third set of consensus mutations associated with the cfDNA sequence reads that are also present in the WBC DNA sequence reads (e.g., third set of consensus mutations associated with the WBC DNA sequence reads) (310). For example, by removing the mutations in the third set of consensus mutations in the cfDNA sequence reads that are also present in the WBC DNA sequence reads, one can remove germline variants and identify clonal hematopoietic variants. After removal, the resulting set of mutations provides a more accurate list of cancer-derived mutations present in the cfDNA of the subject, thereby improving the accuracy of detection of disease in the subject. In some embodiments, the WBC DNA will not necessary go through the same collapsing process as the cfDNA. Error suppression isn't as critical for the control WBC DNA since the errors do not lead to false positive mutation calls. In some embodiments, the process can sequence the WBC DNA to standard (not ultra-high) depth and can still use it to filter the cfDNA data.
In one or more embodiments, the process 300 also can include a polishing step, in which a large set of normal (non-cancer) cfDNA samples is sequenced using molecular barcoding and an error distribution is created from the artifacts observed in those samples at each genomic position. This allows attachment of a confidence value to the somatic mutations called in the cfDNA sequence reads. For example, cfDNA sequence reads from normal healthy donors (e.g., at least 10 individuals, equal distribution of gender) can be analyzed with the same assay to establish background error rates. These confidence intervals associated with the mutations can be further used to determine whether a mutation or a consensus mutation is a valid mutation or an artifact. The polishing step can further improve the accuracy of detecting mutations in the cfDNA sequence reads of the subject.
The process 300 also can include utilizing blacklists to further modify the final set of mutations identified in the cfDNA sequence reads. For example, recurrent errors seen in an n number (e.g., 2) or more normal healthy donor cfDNA sequence reads can be added to a blacklist. Mutations appearing in the final set of mutations associated with the cfDNA sequence reads of the subject if also appear in the blacklist can be removed from the final set, thereby further improving the accuracy of detecting mutations in the cfDNA sequence reads of the subject. The process 300 may also include removing mutations from the final set of mutations based on position-specific and class-specific error models.
In one or more embodiments, at least one identified mutation discussed above is in an exon of a cancer-related gene selected from the group consisting of:
AKT1, ALK, APC, AR, ARAF, ARID1A, ARID2, ATM, B2M, BCL2, BCOR, BRAF, BRCA1, BRCA2, CARD11, CBFB, CCND1, CDH1, CDK4, CDKN2A, CIC, CREBBP, CTCF, CTNNB1, DICER1, DIS3, DNMT3A, EGFR, EIF1AX, EP300, ERBB2, ERBB3, ERCC2, ESR1, EZH2, FBXW7, FGFR1, FGFR2, FGFR3, FGFR4, FLT3, FOXA1, FOXL2, FOXO1, FUBP1, GATA3, GNA11, GNAQ, GNAS, H3F3A, HIST1H3B, HRAS, IDH1, IDH2, IKZF1, INPPL1, JAK1, KDM6A, KEAP1, KIT, KNSTRN, KRAS, MAP2K1, MAPK1, MAX, MED12, MET, MLH1, MSH2, MSH3, MSH6, MTOR, MYC, MYCN, MYD88, MYOD1, NF1, NFE2L2, NOTCH1, NRAS, NTRK1, NTRK2, NTRK3, NUP93, PAK7, PDGFRA, PIK3CA, PIK3CB, PIK3R1, PIK3R2, PMS2, POLE, PPP2R1A, PPP6C, PRKCI, PTCH1, PTEN, PTPN11, RAC1, RAF1, RB1, RET, RHOA, RIT1, ROS1, RRAS2, RXRA, SETD2, SF3B1, SMAD3, SMAD4, SMARCA4, SMARCB1, SOS1, SPOP, STAT3, STK11, STK19, TCF7L2, TERT, TGFBR1, TGFBR2, TP53, TP63, TSC1, TSC2, U2AF1, VHL, and XPO1.
In one or more embodiments, at least one identified mutation discussed above is in an intron of a cancer-related gene selected from the group consisting of: ALK, BRAF, EGFR. ETV6, FGFR2, FGFR3, MET, NTRK1, RET and ROS1 or a promoter region of TERT. In one or more embodiments, at least one mutation identified is in a microsatellite locus for microsatellite instability. In one or more embodiments, at least one mutation identified is in cancer-related gene selected from the group consisting of: BRCA1/2, MLH1, MSH2, MSH6, PMS2. In one or more embodiments, at least one mutation identified is a deletion, an insertion, a translocation, an inversion, a copy number variant, or a point mutation.
The methods of the present disclosure include the use of dual index primers, which can significantly reduce the number of incorrectly assigned reads. See
The methods of the present disclosure are useful for early detection of cancer, monitoring disease progression and tumor burden, identifying clinically relevant alterations and mutational signatures, detecting minimal residual disease, as well as assessing subject responsiveness or acquired resistance to a particular therapy. In one aspect, the present disclosure provides a method for monitoring cancer progression in a subject comprising: detecting the presence of at least one mutation in a cancer-related gene in a cell-free DNA (cfDNA) sample obtained from the subject using any of the computer-implemented methods described herein. Cancer progression includes metastases to secondary organs, increases in tumor volume or tumor burden, or increased tumor proliferation. The methods of the present disclosure are useful for early detection of cancer. For example, in some embodiments, the subject lacks detectable tumors.
In another aspect, the present disclosure provides a method for determining the efficacy of a therapy in a subject suffering from cancer comprising: (a) administering the therapy to the subject; (b) detecting the presence of at least one mutation in a cancer-related gene in a first cell-free DNA (cfDNA) sample obtained from the subject using any of the computer-implemented methods described herein following administration of the therapy; and (c) determining that the therapy is effective when the first cfDNA sample shows a decrease in variant allele fraction compared to that observed in a control sample obtained from the subject prior to administration of the therapy. The control sample may be a cfDNA sample or a tumor sample. The therapy may include one or more of radiation therapy, chemotherapy, surgery, immunotherapy, or surgery. Examples of chemotherapeutic agents include, but are not limited to, abraxane, capecitabine, erlotinib, fluorouracil (5-FU), gemcitabine, irinotecan, leucovorin, nab-paclitaxel, cisplatin, irinotecan, docetaxel, oxaliplatin, tipifarnib, everolimus, sunitinib, dovitinib, ruxolitinib, pegylated-hyaluronidase, pemetrexed, folinic acid, paclitaxel, MK2206, GDC-0449, IPI-926, gamma secretase/RO4929097, M402, and LY293111. Examples of immunotherapeutic agents include, but are not limited to, immune checkpoint inhibitors (e.g., antibodies targeting CTLA-4, PD-1, PD-L1), ipilimumab, 90Y-Clivatuzumab tetraxetan, pembrolizumab, nivolumab, trastuzumab, cixutumumab, ganitumab, demcizumab, cetuximab, nimotuzumab, dalotuzumab, sipuleucel-T, CRS-207, and GVAX.
C. Computer Complemented Method for Detecting Microsatellite Instability in Cell-Free DNA
Microsatellites are short, repeated, sequences of DNA. Cancer cells that have defects in the DNA mismatch repair pathway end up accumulating errors at microsatellite regions when DNA is copied in the cell. Microsatellite instability (MSI) is a somatic genomic condition associated with impaired DNA mismatch repair (MMR) that leads to elevated mutation rates. MSI can arise sporadically in tumors due to somatic mutations in MMR-associated genes, or can arise due to the genetic condition known as Lynch Syndrome in which germline mutations in MMR-associated genes are inherited. MSI is observed in ˜2-5% of solid tumors.
The MSI signature (sporadic or inherited) is of particular clinical significance because it predicts responsiveness to immunotherapy. The immune checkpoint inhibitor pembrolizumab was approved by the FDA for all metastatic solid tumors with MSI or mismatch repair deficiency. Given the clinical significance and therapeutic relevance of MSI, it is critical that genomic profiling assays incorporate measurements of MSI. Moreover, there is evidence that MSI can be acquired later in cancer progression, so it is important to continue to monitor MSI over time.
MSI testing has traditionally been performed by PCR of 5-7 distinct ‘microsatellite’ sites throughout the genome. A similar condition ‘mismatch repair deficiency’ (MMR-d) is detected by immunohistochemistry for the proteins MLH1, MSH2, MSH6, and PMS2. Over the last few years, it has been established that MSI can be read out from next-generation sequencing of tumors using assays such as whole exome sequencing and MSK-IMPACT, a hybridization capture-based next-generation sequencing assay for targeted deep sequencing of all exons and selected introns of 341 key cancer genes in formalin-fixed, paraffin-embedded tumors (Cheng et al., J Mol Diagn. 17(3): 251-264 (2015)). Plasma cell-free DNA represents a non-invasive approach to longitudinally profile tumors. As most tumors that arise in subjects with Lynch Syndrome exhibit MSI, identification of MSI in nucleic acid (e.g., cfDNA) provides an opportunity for early detection of cancer in this high-risk population. However, while tumor sequencing is increasingly performed for MSI detection, the current methods typically fail when the tumor purity falls below ˜25%.
Standard NGS-based methods are expected to perform sub-optimally with respect to detecting MSI in nucleic acid (e.g., cfDNA) since the fraction of tumor-derived cfDNA in plasma is often 1% or lower, especially in early stage cancer. For example, MSIsensor is a C++ program that detects somatic microsatellite changes by computing length distributions of microsatellites per site (i.e., measures variable length insertions and deletions at microsatellite regions) in paired tumor and normal sequence data, and using these length distributions to statistically compare observed distributions in both samples. See Niu et al., Bioinformatics 30(7): 1015-1016 (2014). MSIsensor was used to detect MSI signatures in tumors that were sequenced by the NGS-based MSK-IMPACT panel, which screens >1,000 microsatellite regions in the human genome. As shown in
The data processing methods of the present disclosure are useful for detecting MSI during the early detection of cancer in subjects. Prior to detecting MSI, plasma cfDNA samples and matched white blood cell normal DNA samples are sequenced, and the corresponding sequence reads are processed using the methods described in Section B.
In some embodiments, the nucleic acid (e.g., cfDNA) sequence reads are derived from samples obtained from subjects that have an elevated risk for developing cancer, for example Lynch Syndrome subject samples. The nucleic acid (e.g., cfDNA) sequence reads derived from Lynch Syndrome subject samples may include protein-coding exons of mismatch repair genes (MSH2, MSH6, MLH1, PMS2), SNPs near the mismatch repair genes (useful in detecting allele-specific copy number (zygosity) changes), and/or at least 5, at least 10, at least 15, at least 20, at least 25, at least 30, at least 35, at least 40, at least 45, at least 50, at least 55, at least 60, at least 65, at least 70, at least 75, at least 80, at least 85, at least 90, at least 100, at least 110, at least 120, at least 130, at least 140, at least 150, at least 160, at least 170, at least 180, at least 190, at least 200, at least 300, at least 400, at least 500, at least 600, at least 700, at least 800, at least 900, or at least 1000 microsatellite regions within the human genome. See e.g., Arzimanoglou et al., Cancer 82(10):1808-20 (1998); Dahiya et al., Int J Cancer. 72(5):762-7 (1997). In certain embodiments, the subject suffers from, or is suspected of having Lynch Syndrome, and/or harbors at least one mutation in one or more mismatch repair genes selected from the group consisting of MSH2, MSH6, MLH1, and PMS2. Additionally, or alternatively, in some embodiments, the subject suffers from or is at risk for ovarian cancer, breast cancer, colorectal cancer, lung cancer, prostate cancer, gastric cancer, pancreatic cancer, cervical cancer, liver cancer, bladder cancer, cancer of the urinary tract, thyroid cancer, renal cancer, carcinoma, melanoma, head and neck cancer, or brain cancer.
Additionally, or alternatively, in some embodiments, the method further comprises determining the presence of at least one mutation in an exon of a cancer-related gene selected from the group consisting of:
AKT1, ALK, APC, AR, ARAF, ARID1A, ARID2, ATM, B2M, BCL2, BCOR, BRAF, BRCA1, BRCA2, CARD11, CBFB, CCND1, CDH1, CDK4, CDKN2A, CIC, CREBBP, CTCF, CTNNB1, DICER1, DIS3, DNMT3A, EGFR, EIF1AX, EP300, ERBB2, ERBB3, ERCC2, ESR1, EZH2, FBXW7, FGFR1, FGFR2, FGFR3, FGFR4, FLT3, FOXA1, FOXL2, FOXO1, FUBP1, GATA3, GNA11, GNAQ, GNAS, H3F3A, HIST1H3B, HRAS, IDH1, IDH2, IKZF1, INPPL1, JAK1, KDM6A, KEAP1, KIT, KNSTRN, KRAS, MAP2K1, MAPK1, MAX, MED12, MET, MLH1, MSH2, MSH3, MSH6, MTOR, MYC, MYCN, MYD88, MYOD1, NF1, NFE2L2, NOTCH1, NRAS, NTRK1, NTRK2, NTRK3, NUP93, PAK7, PDGFRA, PIK3CA, PIK3CB, PIK3R1, PIK3R2, PMS2, POLE, PPP2R1A, PPP6C, PRKCI, PTCH1, PTEN, PTPN11, RAC1, RAF1, RB1, RET, RHOA, RIT1, ROS1, RRAS2, RXRA, SETD2, SF3B1, SMAD3, SMAD4, SMARCA4, SMARCB1, SOS1, SPOP, STAT3, STK11, STK19, TCF7L2, TERT, TGFBR1, TGFBR2, TP53, TP63, TSC1, TSC2, U2AF1, VHL, and XPO1.
The at least one mutation may be a deletion, an insertion, a translocation, an inversion, a copy number variant, or a point mutation. Additionally, or alternatively, in some embodiments, the method further comprises determining the presence of at least one genomic alteration in an intron of a cancer-related gene selected from the group consisting of: ALK, BRAF, EGFR. ETV6, FGFR2, FGFR3, MET, NTRK1, RET and ROS1 or a promoter region of TERT. The cfDNA sample may be serum, plasma, sweat, tears, urine, saliva, synovial fluid, lymphatic fluid, ascites fluid, amniotic fluid, or interstitial fluid.
In another aspect, the present disclosure provides a method for monitoring cancer progression in a subject comprising: detecting the presence of microsatellite instability in nucleic acid (e.g., cfDNA) sample obtained from the subject using any of the computer-implemented methods described herein. Cancer progression includes metastases to secondary organs, increases in tumor volume or tumor burden, or increased tumor proliferation. The methods of the present disclosure are useful for early detection of cancer. For example, in some embodiments, the cfDNA sample does not comprise a mutation or genomic alteration in any cancer-related gene described herein. Additionally or alternatively, in some embodiments, the subject lacks detectable tumors.
In one aspect, the present disclosure provides a method for determining the efficacy of a therapy in a subject with a MSI-High tumor comprising: (a) administering the therapy to the subject; (b) detecting the presence of microsatellite instability in a first nucleic acid (e.g., cfDNA) sample obtained from the subject using any of the computer-implemented methods described herein following administration of the therapy; and (c) determining that the therapy is effective when the first nucleic acid (e.g., cfDNA) sample shows a shift towards a distance metric that is associated with microsatellite stability (MSS) compared to that observed in a control sample obtained from the subject prior to administration of the therapy. The control sample may be a nucleic acid (e.g., cfDNA) sample or a tumor sample. The therapy may include one or more of radiation therapy, chemotherapy, surgery, immunotherapy, or surgery. Examples of chemotherapeutic agents include, but are not limited to, abraxane, capecitabine, erlotinib, fluorouracil (5-FU), gemcitabine, irinotecan, leucovorin, nab-paclitaxel, cisplatin, irinotecan, docetaxel, oxaliplatin, tipifarnib, everolimus, sunitinib, dovitinib, ruxolitinib, pegylated-hyaluronidase, pemetrexed, folinic acid, paclitaxel, MK2206, GDC-0449, IPI-926, gamma secretase/RO4929097, M402, and LY293111. Examples of immunotherapeutic agents include, but are not limited to, immune checkpoint inhibitors (e.g., antibodies targeting CTLA-4, PD-1, PD-L1), ipilimumab, 90Y-Clivatuzumab tetraxetan, pembrolizumab, nivolumab, trastuzumab, cixutumumab, ganitumab, demcizumab, cetuximab, nimotuzumab, dalotuzumab, sipuleucel-T, CRS-207, and GVAX.
Microsatellite regions are some of the most error-prone sites in the genome. These Examples demonstrate that the ultra-high depth sequencing and UMI-based error-suppression achieved using the methods described in Section B and Section C significantly improved the sensitivity for detecting MSI.
Based on a reanalysis of >20,000 tumors sequenced by the MSK-IMPACT assay, a small subset of 165 (out of >1,000) of the most frequently mutated microsatellite regions were selected. MSI Score is based on an analysis that looks for DNA slippage (variable length insertions and deletions) at microsatellite regions. The score reflects the % of microsatellite regions with significantly more insertions/deletions in a tumor sample compared to a matched normal sample. The existing form of MSIsensor was used to detect the presence of MSI in nucleic acid (e.g., cfDNA) samples. As shown in
Plasma cfDNA samples and matched white blood cell normal DNA samples were deep-sequenced, and the corresponding sequence reads were processed using the methods described in Section B. The MSI detection algorithm disclosed herein directly compares the number of individual sequence reads observed for every possible allele (1 to N) at each of the 165 microsatellite sites. A vector of length N (upper limit was set as the largest possible read length) was created for each microsatellite site, and a distance metric was computed between plasma cfDNA and matched WBC samples after a per-sample, per-locus normalization was carried out. See
The process 1300 can select a microsatellite locus from a plurality of microsatellite loci for further processing of the sequence reads. For example, the process 1300 can include, for each microsatellite loci, identifying a first subset of cfDNA sequence reads and a second subset of WBC-derived sequence reads corresponding to a microsatellite locus. Thus, both the first subset and the second subset include sequence reads that correspond to the same microsatellite loci.
The process 1300 includes identifying from the first subset and the second subset, a set of alleles, each allele of the set of alleles having a distinct sequence (1306). One example set of alleles is shown in
The process 1300 includes determining, for each allele of the set of alleles, a number of cfDNA sequence reads and a number of WBC-derived sequence reads that include the allele (1308). For example, for Allele 1, the one or more processors, can determine the number of cfDNA sequence reads in the first subset that include Allele 1. Similarly, for Allele 1, the one or more processors can determine the number of WBC-derived sequence reads that include Allele 1. In a similar manner, the one or more processor can determine the number of sequence reads in each of the first and second subsets that include each allele in the set of alleles. Generally, the one or more processors can determine a number hti denoting a number of cfDNA sequence reads corresponding to an Allele i, and can determine a number hni denoting a number of WBC-derived sequence reads corresponding to the Allele i.
In some instances, the one or more processors can normalize the number of cfDNA sequence reads and the number of WBC-derived sequence reads. For example, the one or more processors can determine a normalized value hnti by dividing the value hti by a sum of the number of cfDNA sequence reads for all alleles (Σihti). Similarly, the one or more processors can determine a normalized value hnni by dividing the value hni by the sum of the number of WBC-derived sequence reads for all alleles (Σihni).
The process 1300 further includes determining, by the one or more processors, an absolute difference based on a difference between the number of cfDNA sequence reads for the allele and the number of WBC-derived sequence reads for the allele (1310). In particular, the one or more processors can, for each allele i, determine an absolute difference ai between the corresponding number (hti) of cfDNA sequence reads for that allele and the number (hni) of WBC-derived sequence reads for that allele. Thus, the absolute difference ai can be determined based on: |hti−hni|. In some instances, the absolute difference ai can be determined based on the normalized values. For example, the absolute difference ai can be determined based on: |hnti−hnni|.
The process 1300 includes determining, for each microsatellite locus, from the plurality of microsatellite loci, a distance based on a sum of absolute differences associated with all alleles in the set of alleles (1310). As mentioned above, the set of alleles are associated with a microsatellite locus. To determine the distance, the one or more processors can add the absolute differences ai associated with all alleles. In particular, the one or more processors can determine a distance d for a microsatellite loci based on Σiai. Assuming that there are m number of microsatellite loci, the one or more processors can determine m distance values d for a microsatellite locus. For example, the one or more processors can determine distances d1, d2, d3, . . . , dm corresponding to the m number of microsatellite loci.
The process 1300 also includes generating, by the one or more processors, a first distribution indicating a number of microsatellite loci having distances within a group of distinct distance intervals (1312). The one or more processors can generate a frequency distribution of the distance values over a group of distance intervals. Example distributions are shown in
The process 1300 includes generating, by the one or more processors, a second distribution indicating a number of microsatellite loci having distances within the group of distinct distance intervals, where the second distribution is derived from distances associated with each microsatellite locus observed in a reference sample (1312). In particular, the reference samples can include cfDNA sequence reads and WBC-derived sequence reads from a reference subject. The process discussed above for determining the distance values for the microsatellite loci in samples associated with the subject can be similarly applied to the samples from the reference subject to determine the second distribution. Example second distributions associated with the reference samples are shown in
The process 1300 includes determining, by the one or more processors, that a number of microsatellite loci in the first distribution above a threshold value is greater than a number of microsatellite loci in the second distribution above the threshold value to detect the presence of microsatellite instability (1314). For example, referring to
In some instances, the one or more processors can adopt other methods to detect the presence of microsatellite instability from the first and the second distribution. In one example, the one or more processors use a Z-test statistic to compare the first distribution to the second distribution, and detect the presence of microsatellite instability if the score of the Z-test is above a threshold value. A larger score can indicate that the first distribution, which associated with the subject, is different from the second distribution, which is associated with a reference subject.
In some examples, the one or more processors can adopt machine learning techniques to detect the presence of microsatellite instability. For example, the one or more processors can utilize a classifier, such as, for example, a support vector machine (SVM), to determine whether the first distribution can be classified as having microsatellite instability. The classifier can be trained with data that is labeled with either the presence of lack of microsatellite instability. The classifier can build a model based on that data. Based on the model, the classifier can determine whether the first distribution can be classified as having the presence of microsatellite instability or no presence of microsatellite instability. The SVM is a non-probabilistic binary (linear or non-linear) classifier where examples are mapped onto a space such that examples of separate categories are divided by a clear gap that is as wide as possible. A new example, such as the first distribution, can be mapped onto the same space and predicted as belonging to the presence or no presence of microsatellite instability. The one or more processors feed data to an SVM to enable classification. The data can include, for example, distributions that indicate the presence of microsatellite instability and distributions that indicate no presence of microsatellite instability. The SVM can construct a hyperplane in a multi-dimensional space, which can be used for classification or regression. In some examples, the one or more processors can utilize other types of classifiers such as, for example, linear classifiers, quadratic classifiers, kernel estimators, neural networks, learning vector quantization, etc., to classify the first distribution as having microsatellite instability or not having microsatellite instability.
The process 1300 can further include sorting in one or more data structure, an association between the subject and the presence of microsatellite instability. For example, the one or more processors can store data structure similar to that shown in
Results. The MSI detection model (Allelic Distance-based Microsatellite Instability Estimator or ADMIE) was trained using MSK-IMPACT results from 311 tumor tissue samples with confirmatory immunohistochemistry or PCR to establish the MSI status. Computed allelic distances were used to predict MSI/MSS status for a ‘held-out’ test set of MSK-IMPACT data from over 26,000 tumor tissues (
These results demonstrate that the data processing methods and systems disclosed herein are useful for detecting cancer-related mutations and microsatellite instability in cell-free DNA (cfDNA) sequence data with a high degree of accuracy and sensitivity.
The term “adapter” refers to a short, chemically synthesized, nucleic acid sequence which can be used to ligate to the end of a nucleic acid sequence in order to facilitate attachment to another molecule. The adapter can be single-stranded or double-stranded. An adapter can incorporate a short (typically less than 50 base pairs) sequence useful for PCR amplification or sequencing. In some embodiments, the adapter includes a unique molecular identifier.
The term “hold out” in the context of machine learning refers to splitting up a dataset into a ‘training set’ and ‘test set’. The training set is used to train a model, and the test set is used to see how well that model performs on unseen data.
The terms “variant allele fraction,” “VAF,” “mutant allele fraction” or “MAF” refer to fractions of a mutant allele over the total number of mutant (alternate allele) plus wild-type alleles (reference allele).
“Unique molecular identifiers” or “UMIs” are random nucleotide sequences used to tag each DNA molecule (fragment) prior to library amplification, thereby aiding in the identification of PCR duplicates. If two reads align to the same location and have the same UMI, it is highly likely that they are PCR duplicates originating from the same DNA molecule prior to amplification. As a result, all sequence reads with identical genomic coordinates and UMIs can be collapsed into a single representative read, which is useful for obtaining an accurate estimate of the relative concentration of the DNA molecules in the DNA sample.
The term “plurality of first DNA reads” refers to DNA sequence reads that are derived from the first oligonucleotide strand (e.g., sense strand) of a double-stranded DNA molecule. In some embodiments, the plurality of first DNA reads originate from cfDNA or white blood cells (WBC).
The term “plurality of second DNA reads” refers to DNA sequence reads that are derived from the second oligonucleotide strand (e.g., anti-sense strand) of a double-stranded DNA molecule. The plurality of second DNA reads may be at least partially or completely complementary to the plurality of first DNA reads (e.g., at least 70%. 75%, 80%, 85%, 90%, or 95% complementary). In some embodiments, the plurality of second DNA reads originate from cfDNA or white blood cells (WBC). The term “white blood cells” or “WBC” refers to blood cells that are colorless, lack hemoglobin, contain a nucleus, and include lymphocytes, monocytes, neutrophils, eosinophils, and basophils.
The terms “complementary” or “complementarity” as used herein with reference to polynucleotides (i.e., a sequence of nucleotides such as an oligonucleotide or a target nucleic acid) refer to the base-pairing rules. The complement of a nucleic acid sequence as used herein refers to an oligonucleotide which, when aligned with the nucleic acid sequence such that the 5′ end of one sequence is paired with the 3′ end of the other, is in “antiparallel association.” For example, the sequence “5′-A-G-T-3′” is complementary to the sequence “3′-T-C-A-5.” Complementarity need not be perfect; stable duplexes may contain mismatched base pairs, degenerative, or unmatched bases. Those skilled in the art of nucleic acid technology can determine duplex stability empirically considering a number of variables including, for example, the length of the oligonucleotide, base composition and sequence of the oligonucleotide, ionic strength and incidence of mismatched base pairs.
“Coverage” or “depth” as used herein refers to the number of reads that align to, or “cover,” known reference bases. The next-generation sequencing (NGS) coverage level often determines whether variant discovery can be made with a certain degree of confidence at particular base positions.
“Next-generation sequencing or NGS” as used herein, refers to any sequencing method that determines the nucleotide sequence of either individual nucleic acid molecules (e.g., in single molecule sequencing) or clonally expanded proxies for individual nucleic acid molecules in a high throughput parallel fashion (e.g., greater than 103, 104, 105 or more molecules are sequenced simultaneously). In one embodiment, the relative abundance of the nucleic acid species in the library can be estimated by counting the relative number of occurrences of their cognate sequences in the data generated by the sequencing experiment. Next generation sequencing methods are known in the art. Examples of Next Generation Sequencing techniques include, but are not limited to pyrosequencing, Reversible dye-terminator sequencing, SOLiD sequencing, Ion semiconductor sequencing, Sequencing by synthesis (SBS), Helioscope single molecule sequencing etc. Next generation sequencing methods can be performed using commercially available kits and instruments from companies such as the Life Technologies/Ion Torrent PGM or Proton, the Illumina HiSEQ or MiSEQ, and the Roche/454 next generation sequencing system.
As used herein, “oligonucleotide” refers to a molecule that has a sequence of nucleic acid bases on a backbone comprised mainly of identical monomer units at defined intervals. The bases are arranged on the backbone in such a way that they can bind with a nucleic acid having a sequence of bases that are complementary to the bases of the oligonucleotide. The most common oligonucleotides have a backbone of sugar phosphate units. A distinction may be made between oligodeoxyribonucleotides that do not have a hydroxyl group at the 2′ position and oligoribonucleotides that have a hydroxyl group at the 2′ position. Oligonucleotides of the method which function as primers or probes are generally at least about 10-15 nucleotides long and more preferably at least about 15 to 35 nucleotides long, although shorter or longer oligonucleotides may be used in the method. The exact size will depend on many factors, which in turn depend on the ultimate function or use of the oligonucleotide.
As used herein, a “sample” refers to a substance that is being assayed for the presence of a mutation in cfDNA, e.g., ctDNA. Processing methods to release or otherwise make available a nucleic acid for detection are well known in the art and may include steps of nucleic acid manipulation. A sample may be a body fluid. In some cases, a biological sample may consist of or comprise serum, plasma, sweat, tears, urine, saliva, synovial fluid, lymphatic fluid, ascites fluid, amniotic fluid, or interstitial fluid, cerebral spinal fluid, and the like.
This application claims the benefit of and priority to U.S. provisional Patent Application No. 62/658,489, filed on Apr. 16, 2018, the contents of which are incorporated herein by reference in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2019/027487 | 4/15/2019 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62658489 | Apr 2018 | US |