This disclosure relates to systems and methods for distinguishing somatic genomic sequences from germline genomic sequences.
Germline genomic sequences refer to those sequences that an organism inherits from its parents. In particular, if one or both of an organism's parents have certain genomic mutations (or if the organism experiences certain mutations in its very early development) those mutations may be germline to the organism, and will be passed to the organism's offspring (if any).
By contrast, somatic genomic sequences are sequences that are not passed from parent to child. For example, an organism may develop genomic mutations due to external factors (e.g., pollution, radiation, diet, smoking, etc.), with those genomic mutations being limited only to certain tissues, fluids, or other anatomical material. In some cases, those mutations result in undesirable medical conditions including, but not limited, to cancer.
Precision medicine is a field in which a patient is treated with a therapy that is targeted to the individual characteristics of the patient or their condition. For many patients (including cancer patients), this may involve determining genomic information about both the patient's “normal” genomic state, as well as the genomic state of the patient's “abnormal” tissue, fluid, or other anatomical material. This information may be derived from a sample from the patient, such as a tumor biopsy, a blood draw, or some other type of sample having both normal and abnormal tissue, fluid, or other anatomical material.
These samples may be assayed to determine (at least in part) the genomic sequences of the material contained therein. However, it is sometimes challenging to identify whether a particular genomic sequence comes from the patient's normal anatomical material or whether it comes from abnormal anatomical material; i.e., it is sometimes challenging to determine whether a particular genomic sequence is germline or somatic.
Understanding whether a genetic variant observed in the DNA of a cancer patient is of germline or somatic origin is critically important both in clinical practice and in cancer research. The somatic/germline distinction can be made, for example, by sequencing matched tumor and normal tissue from the same patient. Variants present in tumor but not in normal tissue are classified as somatic, whereas those present in both are classified as germline. However, such a dual-sample approach is constrained by cost as well as specimen availability. Typically in clinical practice matched normal specimens are not obtained. For example, in the case of a tissue biopsy, a single specimen containing both the tumor and its adjacent normal tissue is collected. Thus there is a need to develop methods that can reliably classify detected variants as somatic or germline in origin.
Methods, devices, and computer readable media for distinguishing somatic genomic sequences from germline genomic sequences are described herein.
Disclosed herein are methods of identifying a genomic sequence of interest as germline or somatic, the methods comprising: providing a plurality of nucleic acid molecules obtained from a sample from a subject, wherein the plurality of nucleic acid molecules comprises a mixture of tumor nucleic acid molecules and non-tumor nucleic acid molecules; optionally, ligating one or more adapters onto one or more nucleic acid molecules from the plurality of nucleic acid molecules; amplifying nucleic acid molecules from the plurality of nucleic acid molecules; capturing nucleic acid molecules from the amplified nucleic acid molecules, wherein the captured nucleic acid molecules are captured from the amplified nucleic acid molecules by hybridization to one or more bait molecules; sequencing, by a sequencer, the captured nucleic acid molecules to obtain a plurality of sequence reads corresponding to one or more genomic loci; selecting, by one or more processors, a genomic sequence of interest at a genomic locus from the one or more genomic loci; selecting, by the one or more processors, one or more proxy genomic sequences for the genomic sequence of interest; determining, by the one or more processors, an allele frequency distance using an observed allele frequency of the genomic sequence of interest and a summary statistic or distribution indicative of observed allele frequencies of the one or more proxy genomic sequences; and identifying, by the one or more processors, the genomic sequence of interest as germline or somatic using the allele frequency distance.
In some embodiments, the subject is a cancer patient. In some embodiments, the sample comprises a tissue biopsy sample, a liquid biopsy sample, a circulating tumor cell (CTC) sample, a cell-free DNA (cfDNA) sample, or a normal control. In some embodiments, the sample is a liquid biopsy sample and comprises blood, plasma, cerebrospinal fluid, sputum, stool, urine, or saliva. In some embodiments, the tumor nucleic acid molecules are derived from a tumor portion of a heterogeneous tissue biopsy sample, and the non-tumor nucleic acid molecules are derived from a normal portion of the heterogeneous tissue biopsy sample. In some embodiments, the tumor nucleic acid molecules are derived from a circulating tumor DNA (ctDNA) fraction of a cell-free DNA sample, and the non-tumor nucleic acid molecules are derived from a non-tumor fraction of the cell-free DNA sample. In some embodiments, the one or more adapters comprise amplification primers or sequencing adapters. In some embodiments, the one or more bait molecules comprise one or more nucleic acid molecules, each comprising a region that is complementary to a region of a captured nucleic acid molecule. In some embodiments, amplifying nucleic acid molecules comprises performing a polymerase chain reaction (PCR) or isothermal amplification technique. In some embodiments, the sequencing comprises use of a next generation sequencing (NGS) technique. In some embodiments, the sequencer comprises a next generation sequencer. In some embodiments, the one or more proxy genomic sequences are located within a defined segment of the subject's genomic sequence, and the selected genomic sequence of interest is located within the same defined segment. In some embodiments, the subject's genomic sequence is segmented into a plurality of segments based on copy number uniformity within each segment. In some embodiments, the summary statistic is a mean allele frequency or a median allele frequency. In some embodiments, the allele frequency distance is determined using the observed allele frequency of the genomic sequence of interest and the distribution indicative of the observed frequencies of a plurality of proxy genomic sequences, and wherein the genomic sequence of interest is identified as germline or somatic based on a probability that the observed allele frequency of the genomic sequence of interest fits within or does not fit within the distribution.
In some embodiments, a method of identifying a genomic sequence of interest as germline or somatic includes: selecting, by one or more processors, a genomic sequence of interest at a genomic locus from within a patient genomic sequence obtained for a patient sample comprising a mixture of tumor nucleic acid molecules and non-tumor nucleic acid molecules; selecting, by the one or more processors, one or more proxy genomic sequences for the genomic sequence of interest; determining, by the one or more processors, an allele frequency distance using an observed allele frequency of the genomic sequence of interest and a summary statistic or distribution indicative of observed allele frequencies of the one or more proxy genomic sequences; and identifying (e.g., classifying), by the one or more processors, the genomic sequence of interest as germline or somatic using the allele frequency distance.
In some embodiments of the method, the summary statistic is a mean allele frequency or a median allele frequency. In some embodiments, the allele frequency distance is determined using the observed allele frequency of the genomic sequence of interest and the distribution indicative of the observed frequencies of a plurality of proxy genomic sequences, and wherein the genomic sequence of interest is identified as germline or somatic based on a probability that the observed allele frequency of the genomic sequence of interest fits within or does not fit within the distribution.
In some embodiments, the tumor nucleic acid molecules and the non-tumor nucleic acid molecules comprise DNA molecules. In some embodiments, the tumor nucleic acid molecules and the non-tumor nucleic acid molecules comprise RNA molecules.
In some embodiments, the method further includes sequencing the tumor nucleic acid molecules and the non-tumor nucleic acid molecules from the patient sample to determine the patient genomic sequence. In some embodiments, the patient genomic sequence is obtained or determined using a next generation sequencing technique. In some embodiments, the sequencer is a next generation sequencer.
In some embodiments of the method, the one or more proxy genomic sequences are located within a defined segment of the patient genomic sequence, and the selected genomic sequence of interest is located within the same defined segment. In some embodiments, the patient genomic sequence is segmented into a plurality of segments based on copy number uniformity within each segment. In some embodiments, the method comprises segmenting the patient genomic sequence into a plurality of segments.
In some embodiments of the method, the patient genomic sequence is determined using targeted sequencing. In some embodiments, the targeted sequencing comprises targeted sequencing of one or more genes associated with cancer, or a portion thereof. In some embodiments, the targeted sequencing comprises targeted sequencing of one or more exon regions.
In some embodiments, the method includes: identifying, by one or more processors, a genomic sequence of interest in a patient sample at a genomic locus; identifying, by the one or more processors, one or more proxy genomic sequences for the sequence of interest; comparing, by the one or more processors, an observed frequency of the sequence of interest to a centrality measure of observed frequencies of the one or more proxy genomic sequences; and based on the comparison, identifying (e.g., classifying or characterizing) the genomic sequence of interest as either germline or somatic.
In some embodiments of the method, the one or more proxy genomic sequences includes a single nucleotide polymorphism (SNP).
In some embodiments of the method, the one or more proxy genomic sequences includes an allele.
In some embodiments, the method further comprises identifying, by the one or more processors, a segment of a patient's genome in which the genomic locus is included. In some embodiments, identifying, by the one or more processors, the segment includes performing a segmentation procedure on a continuous portion of the patient's genome. In some embodiments, the portion of the patient's genome is large enough to identify three distinct segments. In some embodiments, the proxy is identified, by the one or more processors, to be located on the same segment as the genomic locus. In some embodiments, the segmentation procedure identifies segments according to whether a genomic parameter is equal across the entirety of each individual segment. In some embodiments, the genomic parameter is copy number.
In some embodiments of any of the above methods of identifying a genomic sequence of interest as germline or somatic, the step of identifying, by the one or more processors, the genomic sequence of interest as germline or somatic includes: inputting an allele frequency distance into a trained statistical model; and outputting, from the trained statistical model, a value indicative of a likelihood that the genomic sequence of interest is germline or a value indicative of a likelihood that the genomic sequence of interest is somatic. In some embodiments, the allele frequency distance is adjusted to correct for a contamination level in the patient sample, a low sequencing read depth, a noisy estimation of allele frequencies, a low segment germline single nucleotide polymorphism (SNP) count, or high variability in segment germline SNP allele frequency. In some embodiments, the trained statistical model comprises a function that associates the allele frequency distance with the value indicative of a likelihood that the genomic sequence of interest is germline or the value indicative of a likelihood that the genomic sequence of interest is somatic.
In some embodiments, the trained statistical model is a logistic regression model. In some embodiments, the trained statistical model is trained using tumor samples with known germline sequences. In some embodiments, the trained statistical model is trained using data for tumor samples with known germline sequences and known somatic sequences. In some embodiments, the method further comprises training the statistical model using data for tumor samples with known germline sequences. In some embodiments, the method further comprises training the statistical model using data for tumor samples with known germline sequences and known somatic sequences.
In some embodiments, the trained statistical model is trained using data for variant allele frequencies that excludes variants located in genomic regions known to have allele frequencies that deviate from expected values. In some embodiments, the method further comprises training the statistical model using data for variant allele frequencies that excludes variants located in genomic regions known to have allele frequencies that deviate from expected values.
In some embodiments, the trained statistical model is trained using data that incorporates prior knowledge of the likelihood of a variant being a germline, a somatic variant, or a clonal hematopoiesis of indeterminate potential (CHIP) variant based on historical data or databases. In some embodiments, the method further comprises training the statistical model using data that incorporated prior knowledge of the likelihood of a variant being a germline, a somatic variant, or a clonal hematopoiesis of indeterminate potential (CHIP) variant based on historical data or databases.
In some embodiments, the trained statistical model is trained using data that accounts for a noise level for a given variant call and its genomic context. In some embodiments, the method further comprises training the statistical model using data that accounts for a noise level for a given variant call and its genomic context.
In some embodiments, the one or more proxy genomic sequences include a single nucleotide polymorphism (SNP). In some embodiments, the one or more proxy genomic sequences include an allele. In some embodiments of the method, the genomic sequence of interest includes a genomic variant.
In some embodiments of the method, the method further comprises generating, by the one or more processors, a report indicating the genomic sequence of interest as either germline or somatic. In some embodiments, the method comprises transmitting the report, for example to a healthcare provider. In some embodiments, the report is transmitted via a computer network or a peer-to-peer connection.
In some embodiments of any of the above methods, the patient sample is derived from a tissue biopsy comprising tumor tissue and non-tumor tissue. In some embodiments, the tissue biopsy is a solid tissue biopsy or a liquid biopsy. In some embodiments, the tissue biopsy is a liquid biopsy comprising blood, plasma, cerebrospinal fluid, sputum, stool, urine, or saliva. In some embodiments, the patient sample comprises cell-free DNA (cdDNA) obtained from the subject. In some embodiments, the patient sample comprises circulating tumor DNA (ctDNA) obtained from the subject.
Further described herein is a method of treating cancer in a patient, which includes identifying, by the one or more processors, one or more genomic sequences of interest as somatic using any of the methods described above; selecting a cancer treatment modality based on the one or more identified somatic sequences; and treating the cancer using the selected cancer treatment modality. In some embodiments, the one or more identified somatic sequences are associated with successful cancer treatment using the selected treatment modality. In some embodiments, the method comprises determining, by the one or more processors, a microsatellite instability status of the cancer using the one or more identified somatic sequences; and selecting the cancer treatment modality based on the microsatellite instability status of the cancer. In some embodiments, the method includes determining, by the one or more processors, a tumor mutational burden for the cancer using the one or more identified somatic sequences; and selecting the cancer treatment modality based on the tumor mutational burden being above a predetermined tumor mutational burden threshold. In some embodiments, the cancer treatment modality comprises administration of an effective amount of one or more anti-cancer agents to the patient if the tumor mutational burden is above a predetermined threshold. In some embodiments, the one or more anti-cancer agents comprises an immuno-oncology agent. In some embodiments, the immuno-oncology agent is an immune checkpoint inhibitor.
Also described herein is a method of monitoring cancer progression or recurrence in a patient, which includes identifying, by the one or more processors, one or more genomic sequences of interest as somatic using any of the methods described above; and detecting, by the one or more processors, the presence or absence of the one or more genomic sequences of interest identified as somatic within a second patient sample obtained from patient after the cancer has been treated. In some embodiments, the method comprises obtaining the second patient sample from the patient. In some embodiments, the method comprises treating the cancer in the patient after the first patient sample is obtained from the patient and before the second patient sample is obtained from the patient. In some embodiments, the second patient sample comprises cell-free DNA. In some embodiments, detecting the presence or absence of the one or more genomic sequences of interest identified as somatic within the second patient sample comprises sequencing nucleic acid molecules in the second patient sample.
Further described herein is a method of selecting a neoantigen for a cancer vaccine personalized for a subject having cancer, comprising: identifying, by the one or more processors, one or more genomic sequences of interest as somatic using any of the methods described above, wherein the one or more genomic sequences of interest identified as somatic is located within an exon region of a gene; and selecting, by the one or more processors, from the one or more genomic sequences of interest identified as somatic, a genomic sequence that encodes a neoantigen suitable as a cancer vaccine for the subject. In some embodiments, the method comprises making a vaccine comprising the neoantigen.
Also described herein is a non-transitory computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of an electronic device, cause the electronic device to: select a genomic sequence of interest at a genomic locus from within a patient genomic sequence obtained for a patient sample comprising a mixture of tumor nucleic acid molecules and non-tumor nucleic acid molecules; select one or more proxy genomic sequences for the genomic sequence of interest; determine an allele frequency distance using an observed allele frequency of the genomic sequence of interest and a summary statistic or distribution indicative of observed allele frequencies of the one or more proxy genomic sequences; and identify the genomic sequence of interest as germline or somatic using the allele frequency distance. In some embodiments, the summary statistic is a mean allele frequency or a median allele frequency. In some embodiments, the allele frequency distance is determined using the observed allele frequency of the genomic sequence of interest and the distribution indicative of the observed frequencies of a plurality of proxy genomic sequences, and wherein the genomic sequence of interest is identified as germline or somatic based on a probability that the observed allele frequency of the genomic sequence of interest fits within or does not fit within the distribution. In some embodiments, the tumor nucleic acid molecules and the non-tumor nucleic acid molecules comprise DNA molecules. In some embodiments, the tumor nucleic acid molecules and the non-tumor nucleic acid molecules comprise RNA molecules.
In some embodiments of the non-transitory computer-readable storage medium, the one or more proxy genomic sequences are located within a defined segment of the patient genomic sequence, and the selected genomic sequence of interest is located within the same defined segment. In some embodiments, the patient genomic sequence is segmented into a plurality of segments based on copy number uniformity within each segment.
In some embodiments of the non-transitory computer-readable storage medium, the one or more programs further comprise instructions, which when executed by the one or more processors of the electronic device, cause the electronic device to segment the patient genomic sequence into a plurality of segments.
In some embodiments of the non-transitory computer-readable storage medium, the patient genomic sequence is determined using targeted sequencing. In some embodiments, the patient genomic sequence is determined using next generation sequencing. In some embodiments, the targeted sequencing comprises targeted sequencing of one or more genes associated with cancer, or a portion thereof. In some embodiments, the targeted sequencing comprises targeted sequencing of one or more exon regions.
In some embodiments, a non-transitory computer-readable storage medium stores one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of an electronic device, cause the electronic device to: identify a genomic sequence of interest in a patient sample at a genomic locus; identify one or more proxy genomic sequences for the sequence of interest; identify an observed frequency of the sequence of interest to a centrality measure of observed frequencies of the one or more proxy genomic sequences; and based on the comparison, characterize the genomic sequence of interest as either germline or somatic.
In some embodiments of the non-transitory computer readable storage medium, the one or more programs further comprise instructions, which when executed by the one or more processors of the electronic device, cause the electronic device to generate a report indicating the genomic sequence of interest as either germline or somatic. In some embodiments, the electronic device comprises a display, and the one or more programs further comprise instructions, which when executed by the one or more processors of the electronic device, cause the electronic device to display the report.
In some embodiments of the non-transitory computer readable storage medium, the one or more proxy genomic sequences includes a single nucleotide polymorphism (SNP).
In some embodiments of the non-transitory computer readable storage medium, the one or more proxy genomic sequences includes an allele.
In some embodiments of the non-transitory computer readable storage medium, the one or more programs further comprise instructions, which when executed by the one or more processors of the electronic device, cause the electronic device to identify a segment of a patient's genome in which the genomic locus is included. In some embodiments, identifying the segment includes performing a segmentation procedure on a continuous portion of the patient's genome. In some embodiments, the portion of the patient's genome is large enough to identify three distinct segments. In some embodiments, the one or more proxy genomic sequence are identified to be located on the same segment as the genomic locus. In some embodiments, the segmentation procedure identifies segments according to whether a genomic parameter is equal across the entirety of each individual segment. In some embodiments, the genomic parameter is copy number.
In some embodiments of the non-transitory computer readable storage medium, the genomic sequence of interest includes a genomic variant.
In some embodiments of the non-transitory computer readable storage medium, the one or more programs further comprise instructions, which when executed by one or more processors of the electronic device, cause the electronic device to receive sequencing data associated with the patient genomic sequence. In some embodiments, the one or more programs further comprise instructions, which when executed by one or more processors of the electronic device, cause the electronic device to assemble the patient genomic sequence using the sequencing data. In some embodiments, the one or more programs further comprise instructions, which when executed by one or more processors of the electronic device, cause the electronic device to operate a sequencer to sequence nucleic acid molecules derived from the patient sample, thereby obtaining the sequencing data.
In some embodiments of the non-transitory computer readable storage medium, the one or more programs further comprise instructions, which when executed by the one or more processors of the electronic device, cause the electronic device to generate a report indicating the genomic sequence of interest as either germline or somatic. In some embodiments, the one or more programs further comprise instructions, which when executed by the one or more processors of the electronic device, cause the electronic device to transmit the report using a computer network.
In some embodiments of the non-transitory computer readable storage medium, the electronic device comprises a display, and the one or more programs further comprise instructions, which when executed by the one or more processors of the electronic device, cause the electronic device to display the report.
In some embodiments of the non-transitory computer readable storage medium, the one or more proxy genomic sequences includes a single nucleotide polymorphism (SNP).
In some embodiments of the non-transitory computer readable storage medium, the one or more proxy genomic sequences includes an allele.
In some embodiments of the non-transitory computer readable storage medium, the genomic sequence of interest includes a genomic variant.
Also described herein is an electronic device, comprising: one or more processors; and a memory storing one or more programs configured to be executed by the one or more processors, the one or more programs including instructions for: selecting a genomic sequence of interest at a genomic locus from within a patient genomic sequence obtained for a patient sample comprising a mixture of tumor nucleic acid molecules and non-tumor nucleic acid molecules; selecting one or more proxy genomic sequences for the genomic sequence of interest; determining an allele frequency distance using an observed allele frequency of the genomic sequence of interest and a summary statistic or distribution indicative of observed allele frequencies of the one or more proxy genomic sequences; and identifying the genomic sequence of interest as germline or somatic using the allele frequency distance. In some embodiments, the summary statistic is a mean allele frequency or a median allele frequency. In some embodiments, the allele frequency distance is determined using the observed allele frequency of the genomic sequence of interest and the distribution indicative of the observed frequencies of a plurality of proxy genomic sequences, and wherein the genomic sequence of interest is identified as germline or somatic based on a probability that the observed allele frequency of the genomic sequence of interest fits within or does not fit within the distribution. In some embodiments, the tumor nucleic acid molecules and the non-tumor nucleic acid molecules comprise DNA molecules. In some embodiments, the tumor nucleic acid molecules and the non-tumor nucleic acid molecules comprise RNA molecules. In some embodiments, the patient genomic sequence is determined using next generation sequencing.
In some embodiments of the electronic device, the one or more proxy genomic sequences are located within a defined segment of the patient genomic sequence, and the selected genomic sequence of interest is located within the same defined segment. In some embodiments, the patient genomic sequence is segmented into a plurality of segments based on copy number uniformity within each segment. In some embodiments, the one or more programs further include instructions for segmenting the patient genomic sequence into a plurality of segments.
In some embodiments of the electronic device, the patient genomic sequence is determined using targeted sequencing. In some embodiments, the targeted sequencing comprises targeted sequencing of one or more genes associated with cancer, or a portion thereof. In some embodiments, the targeted sequencing comprises targeted sequencing of one or more exon regions.
In some embodiments, an electronic device, comprises: one or more processors; and a memory storing one or more programs configured to be executed by the one or more processors, the one or more programs including instructions for: identifying a genomic sequence of interest in a patient sample at a genomic locus; identifying one or more proxy genomic sequences for the sequence of interest; comparing an observed frequency of the sequence of interest to a centrality measure of observed frequencies of the one or more proxy genomic sequences; and based on the comparison, characterizing the genomic sequence of interest as either germline or somatic.
In some embodiments of the electronic device, the one or more proxy genomic sequences includes a single nucleotide polymorphism (SNP).
In some embodiments of the electronic device, the one or more proxy genomic sequences includes an allele.
In some embodiments of the electronic device, the one or more programs further include instructions for identifying a segment of a patient's genome in which the genomic locus is included. In some embodiments, identifying the segment includes performing a segmentation procedure on a continuous portion of the patient's genome. In some embodiments, the portion of the patient's genome is large enough to identify three distinct segments. In some embodiments, the proxy is identified to be located on the same segment as the genomic locus. In some embodiments, the segmentation procedure identifies segments according to whether a genomic parameter is equal across the entirety of each individual segment. In some embodiments, the genomic parameter is copy number.
In some embodiments of the electronic device, the genomic sequence of interest includes a genomic variant.
In some embodiments of the electronic device, the one or more programs further comprise instructions for receiving sequencing data associated with the patient genomic sequence. In some embodiments, the one or more programs further comprise instructions for assembling the patient genomic sequence using the sequencing data. In some embodiments, the one or more programs further comprise instructions for causing a sequencer to sequence nucleic acid molecules derived from the patient sample, thereby obtaining the sequencing data.
In some embodiments of the electronic device, the one or more proxy genomic sequences includes a single nucleotide polymorphism (SNP).
In some embodiments of the electronic device, the one or more proxy genomic sequences includes an allele.
In some embodiments of the electronic device, the genomic sequence of interest includes a genomic variant.
In some embodiments of the electronic device, the one or more programs further include instructions for generating a report indicating the genomic sequence of interest as either germline or somatic. In some embodiments, the one or more programs further include instructions for transmitting the report via a computer network or a peer-to-peer connection. In some embodiments, the device further comprises a display and the one or more programs further include instructions for displaying the report.
In some embodiments of the electronic device, the patient sample is derived from a tissue biopsy comprising tumor tissue and non-tumor tissue. In some embodiments, the tissue biopsy is a solid tissue biopsy or a liquid biopsy. In some instances, the tissue sample is a liquid biopsy comprising blood, plasma, cerebrospinal fluid, sputum, stool, urine, or saliva. In some embodiments, the patient sample comprises cell-free DNA (cfDNA) obtained from the subject. In some embodiments, the patient sample comprises circulating tumor DNA (ctDNA) obtained from the subject.
Also described herein is a system, comprising any of the electronic devices described herein and a sequencer configured to sequence nucleic acid molecules derived from the patient sample. In some embodiments, the sequencer is a next generation sequencer.
Disclosed herein are methods of identifying a genomic sequence of interest as germline or somatic, the methods comprising: identifying, by one or more processors, a genomic sequence of interest in a patient sample at a genomic locus; identifying, by the one or more processors, a proxy genomic sequence for the genomic sequence of interest; comparing, by the one or more processors, an observed allele fraction of the genomic sequence of interest to an observed allele fraction of the proxy genomic sequence; and identifying, by the one or more processors, the genomic sequence of interest as germline or somatic based on the comparison. In some embodiments, the proxy genomic sequence has the same copy number as the genomic sequence of interest. In some embodiments, identifying, by the one or more processors, the genomic sequence of interest as germline or somatic comprises: inputting an allele frequency distance into a trained statistical model; and outputting, from the trained statistical model, a value indicative of a likelihood that the genomic sequence of interest is germline or a value indicative of a likelihood that the genomic sequence of interest is somatic. In some embodiments, the allele fraction of the genomic sequence and the allele fraction of the proxy genomic sequence are determined using a next generation sequencing technique. In some embodiments, the allele fraction of the genomic sequence and the allele fraction of the proxy genomic sequence are determined using a microarray technique. In some embodiments, the patient sample comprises a solid tissue biopsy or a liquid biopsy. In some embodiments, the patient sample is a liquid biopsy comprising blood, plasma, cerebrospinal fluid, sputum, stool, urine, or saliva. In some embodiments, the patient sample comprises cell-free DNA (cfDNA) obtained from the subject. In some embodiments, the patient sample comprises circulating tumor DNA (ctDNA) obtained from the subject. In some embodiments, the patient is a cancer patient.
All publications, patents, and patent applications mentioned in this specification are herein incorporated by reference in their entirety to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference in its entirety. In the event of a conflict between a term herein and a term in an incorporated reference, the term herein controls.
Methods, devices, and computer readable media for distinguishing somatic genomic sequences from germline genomic sequences are described herein. A genomic sequence of interest in a patient sample at a genomic locus can be identified. Then, for the sequence of interest, one or more proxy genomic sequences can be identified. The observed frequency of the sequence of interest can be compared to a centrality measure of observed frequencies of the one or more proxy genomic sequences; and, based on the comparison, the genomic sequence of interest can be characterized as either a germline sequence or a somatic sequence.
Several methods had been developed in the past to determine somatic/germline status of variants in a single-sample setting, including matching to public germline databases such as dbSNP, or using surrogates constructed from a large number of normal individuals in place of the matched normal. See, for example, Hiltemann, et al., Discriminating somatic and germline mutations in tumor DNA samples without matching normal, Genome Res. vol. 25, no. 9, pp. 1382-1390 (2015). However, such methods are ineffective when dealing with rare germline variants that are limited to a family or small population. There is also a so called “basic method”, in which variants with allele frequency (or allele fraction) near 50% or 100% are regarded as germline and those not satisfying this criterion are classified as somatic. See Jones, et al., Personalized genomic analyses for cancer mutation discovery and interpretation, Sci. Transl. Med., vol. 7, no. 283, p. 283ra53 (2015). This basic method fails to account for the fact that aneuploidy can drive allele frequency of germline variants significantly away from the 50% or 100% expectations. The terms “allele frequency” and “allele fraction” are used interchangeably herein and refer to the fraction of sequence reads corresponding to a particular allele relative to the total number of sequence reads for a genomic locus.
Published in early 2018, the SGZ (somatic-germline-zygosity) algorithm sought to provide a solution to the single-sample somatic/germline classification problem by accounting for tumor content, tumor ploidy, and the local copy number. SGZ was demonstrated to greatly out-perform the “basic method” in somatic/germline calling accuracy in validation datasets (Sun, et al., A computational approach to distinguish somatic vs. germline origin of genomic alterations from deep sequencing of cancer specimens without a matched normal, PLoS Comput Biol., vol. 14, no. 2, p. e1005965 (2018), which is incorporated herein by reference in its entirety). Application of the SGZ algorithm in FMI's deep massively parallel sequencing (MPS)-based diagnostic products has enabled effective somatic/germline status determination for short variants (substitutions and indels) and became an indispensable tool for applications such as Tumor Mutational Burden (TMB) estimation.
The methods described herein for somatic/germline classification represent a further improvement over the SGZ approach. The new approach is built upon the same underlying principle, i.e. in a tumor/normal admixture, somatic and germline variants often have different expected allele frequencies that are dictated by tumor fraction, tumor ploidy and local copy number. However, in contrast to SGZ, which estimates expected germline allele frequency by computational modeling of tumor fraction, tumor ploidy and local copy number, the new methods disclosed herein directly infer the expected germline allele frequency from known germline SNPs located on the same copy number segment with the variant in question. Thus, using the method described herein, it is not necessary to determine or model the copy number or tumor purity to obtain an accurate call for somatic and germline variants.
In some embodiments, a trained model, such as a logistic regression model, is used to predict probability of a variant being somatic based on the difference between the observed variant allele frequency and the inferred expected germline variant allele frequency. In some embodiments, the model is trained using data for matched tumor/normal pairs and validated with independent datasets. In some embodiments, the model is trained using data for tumor samples with known germline (and, optionally, known somatic) sequences. In some embodiments, the model is trained using data for mixed tumor/normal samples with known germline (and, optionally, known somatic) sequences. The validation shows the new classifier outperforms SGZ in sensitivity and positive predictive value (PPV) for somatic variant classification.
A determined genomic sequence may be a somatic variant sequence or a germline sequence. Publicly accessible databases of known germline sequences exist (see, for example, dbSNP (available at www.ncbi.nlm.nih.gov/snp/) or gnomAD (available at gnomad.broadinstitute.org)), and a match between a known germline sequence and a sequence determined by sequencing nucleic acids in a sample obtained from a subject indicates that the sequence associated with the sample is likely to be a germline sequence. However, failure to match a known germline sequence does not demonstrate that the sequence is a somatic variant sequence, as it could be a previously unknown (or unrecorded) germline sequence of the subject. The methods described herein allow for the classification of the sequence as a germline sequence or somatic variant sequence.
Methods described herein allow for the identification of a genomic sequence of interest as a germline sequence or a somatic sequence. In some embodiments, the somatic sequence is associated with a cancer in a patient. For example, a patient sample can include a mixture of tumor nucleic acid molecules (i.e., nucleic acid molecules derived from a tumor, either directly (such as in the case of a tumor biopsy) or indirectly (such as in the case of a liquid biopsy or bodily fluid sample comprising circulating-tumor DNA (ctDNA) as well as cell-free DNA (cfDNA)) and non-tumor nucleic acid molecules (i.e., nucleic acid molecules derived from non-tumorous, and preferably healthy, tissue, cells, liquid biopsy samples, or bodily fluid samples). The methods may include a step of selecting a genomic sequence of interest from within a patient genomic sequence (i.e., a genomic sequence obtained for the patient, which may be a whole genome or a portion thereof (e.g., an exome or a targeted region within the whole genome)), and a step of selecting one or more proxy genomic sequences for the genomic sequence of interest. The patient genomic sequence may include one or more alleles at any given locus (e.g., a somatic sequence and/or a germline sequence at any given locus).
Nucleic acid molecules from a sample (for example, a mixed tumor/normal tissue sample, or a cell-free DNA (cfDNA) sample containing a mixture of ctDNA and non-tumor cfDNA) can be sequenced to determine a patient genomic sequence. A genomic sequence of interest can be identified or selected at a genomic locus from the patient genomic sequence. The selected genomic sequence is a test sequence which is to be characterized as germline or somatic. In some embodiments, the genomic sequence of interest differs from a reference sequence. In some embodiments, the genomic sequence of interest differs from a sequence in a selected germline sequence database.
The genomic region 100 shown in
The techniques described below involve characterizing a sequence of interest 102 within the genomic region 100 as either germline or somatic. The characterization is assisted by use of a reference sequence 104. The reference sequence 104 is an exemplary genomic sequence that represents a “normal” (e.g., non-cancerous) patient. In some implementations, the reference sequence 104 can include a sequence determined by the Human Genome Project, e.g. hg19.
In the reference sequence 104, there are known regions of polymorphism 106a, 106b. A region of polymorphism 106a, 106b is a region (comprising any number of bases from a single base to several hundred or more bases) in which variation of a particular organism's genomic sequence is expected across a population of organisms, without adverse consequences corresponding to the variations. For example, in humans there are regions of polymorphism that correspond to various hair colors, eye colors, or other individualized characteristics. The genomic region 100 corresponding to an actual patient sample will have specific base values 108a, 108b at the positions in the region 100 corresponding to the polymorphic regions 106a, 106b in the reference sequence 104. In other words, the polymorphic regions 106a, 106b of the reference sequence 104 are the locations at which certain characteristics of a person (e.g. hair color) are determined; the base values 108a, 108b are the individualized determinations of those characteristics (e.g., red hair) that describe the specific patient.
In some cases, polymorphic regions 106a, 106b include one or more single nucleotide polymorphisms (or “SNPs”). In some cases, regions of polymorphism can include entire alleles or portions thereof.
Determining a genomic sequence (e.g., a genomic region 100) from a physical sample can be accomplished in a variety of ways. One such way is described in U.S. Pat. No. 9,340,830, and another is described in U.S. Pat. Pub. 2017/0356053, the entireties of both of which are incorporated by reference herein. More generally, there is a category of machines that are operable to determine the genetic sequence of an input sample called genomic sequencers. In some instances, the disclosed methods and systems may be implemented using any of a variety of next generation sequencing (NGS) techniques and sequencers, including cyclic array sequencers configured for massively parallel sequencing and single molecule sequencers. Moreover, there are a variety of known sub-regions of human and other organisms' genomes that are known to be relevant to a variety of medical conditions.
The techniques described herein do not depend on the use of a particular sequencing platform or particular sequencing techniques, and any of these machines and accompanying techniques may be used in step 202. In some instances, the disclosed methods may be implemented using alternative nucleic acid sequence analysis techniques, e.g., microarrays, fluorescence in situ hybridization (FISH), and the like.
In some implementations, the region (i.e., sequence) of interest 102 is identified to correspond to a known genetic locus within a reference genome 104. In some implementations, the region of interest 102 corresponds to a mutation with respect to the reference sequence 104 (i.e., a subsection of the genomic region 100 other than a polymorphic region that has a different genetic sequence from that of the corresponding part of reference sequence 104). In some implementations, the sequence of interest corresponds to a gene relevant to a medical condition that the patient possesses. In some implementations, the region of interest 102 is an oncogene or portion thereof.
In step 204, one or more proxy genomic sequences for the genomic sequence are identified (step 204). The selected one or more proxy genomic sequences may be known germline sequences (for example, based on being matched with a known germline sequence from a database of known germline sequences, or by sequencing healthy tissue, cells, or cell-free DNA from the subject or another healthy individual). Referring to
The germline status of a particular candidate proxy sequence may be known from research literature, publicly available databases (e.g., dbSNP (available at www.ncbi.nlm.nih.gov/snp/) or gnomAD (available at gnomad.broadinstitute.org)), or may be discovered by other ab initio means. On the other hand, somatic variants can be identified from matched tumor/normal samples; i.e., samples from the same patient that contain both tumor DNA and non-tumor (“normal”) DNA. In particular, variants seen in tumor DNA but not in corresponding normal DNA are necessarily somatic. Known somatic variants may also be discovered by other ab initio means.
Referring to
A variety of segmentation procedures are known in the art. For example, iSeg (described in Girimurugan, et al., iSeg: an Efficient Algorithm for Segmentation of Genomic and Epigenomic Data, BMC Bioinformatics 19:131 (2018), the entirety of which is incorporated herein), CBS (described in Olshen, et al., Circular Binary Segmentation for the Analysis of Array-Based DNA Copy Number Data, Biostatistics 2004 October; 5(4):557-72, the entirety of which is incorporated by reference herein), SLMSuite (described in Orlandini, et al., SLMSuite: A Suite of Algorithms for Segmenting Genomic Profiles, BMC Bioinformatics 18:321 (2017), Pelt (described in Killick, et al. Optimal detection of changepoints with a linear computational cost, Journal of the American Statistical Association, 107:500 (2012), the entirety of which is incorporated by reference herein) are four among many such algorithms. In some embodiments, the patient genomic sequence is segmented into a plurality of segments based on copy number uniformity within each segment.
Referring back to
In step 206, the frequencies of the proxies 110 are identified. In step 208, the allele frequencies (allele fractions) of sequences from the region of interest (i.e., genomic sequence of interest) 102 are identified. Here, “frequency” refers to a normalized statistical frequency—for example, the number of occurrences of a sequence or proxy within the sample, divided by the total number of occurrences of any sequence at the same genomic locus. In some implementations, several frequency measurements may be made. Allele frequencies of the genomic sequence of interest and the one or more proxy genomic sequence can be determined by sequencing the nucleic acid molecules in the sample from the subject. In some instances, allele frequencies may be determined using other methodologies, e.g., microarrays or fluorescence in situ hybridization (FISH) techniques. When using several proxies, outlier proxy frequencies may be discarded and the remaining frequencies may be combined as a single statistical centrality measure (e.g., a summary statistic, such as mean, median, mode, or others, or a distribution (such as a probability distribution) of the allele frequencies of the proxy sequences) so that step 210 involves a single numerical comparison. For example, in some embodiments, the centrality measure (summary statistic) is a mean allele frequency for the one or more proxy sequences. In some embodiments, the centrality measure (summary statistic) is a median allele frequency for the one or more proxy sequences. When a single proxy genomic sequence is used, the centrality measure of observed frequencies of the proxy genomic sequence is the frequency of that proxy sequence. The centrality measure may be, in some embodiments, a distribution of the observed allele frequencies for the proxy sequences.
In decision 210, the proxy frequency or frequencies (for example, a centrality measure of the observed frequencies of the one or more proxy sequences) are compared to the frequency or frequencies of the region of interest to determine if they are equal. Here and throughout this application the term “equal” includes “equal to within a desired range” or “equal to within a desired threshold” that can routinely be determined based on desired selectivity and specificity of the process 200. The range or threshold may be set, for example, using a statistical threshold or statistical test selected by one skilled in the art. If several proxies 110 are used and individual comparisons are made instead of combining the proxy frequencies as described above, then a decision 210 results in a “yes” if a certain proportion of the comparisons (e.g., greater than 50%, greater than 55%, greater than 60%, greater than 65%, greater than 70%, greater than 75%, greater than 80%, greater than 85%, greater than 90%, or greater than 95%) are equal.
If the proxy frequency is equal to the frequency of the sequence of interest, then the sequence of interest is classified as germline (step 212). Otherwise, the sequence of interest is classified as somatic (step 214). Alternatively, if proxies 110 were selected to be known to encode somatic information (instead of germline), then equal frequencies is interpreted as the sequence of interest being somatic and unequal frequencies is interpreted as the sequence of interest being germline.
In some implementations, the comparison in decision 210 may also be used to eliminate potentially erroneous classifications. In particular, the frequency of a true somatic variant is necessarily less than a true germline variant, because both tumor and non-tumor DNA contribute to a germline variant's frequency count, while only tumor DNA contributes to a somatic variant's frequency count. Thus, in some implementations, if the frequency of the sequence of interest is greater than the proxy frequency, then sequence of interest is classified as germline.
By way of example, in some embodiments, comparing the observed frequency of the genomic sequence of interest to the centrality measure of observed frequencies of the one or more proxy genomic sequences can include determining an “allele frequency distance” (AFDIS) of the genomic sequence of interest from the expected allele frequency. The expected allele frequency if the genomic sequence of interest is a germline sequence is determined based on the frequency of the one or more proxy sequences (or summary statistic indicative of the observed frequencies of the one or more proxy sequences), which are assumed to be germline based on the selection of the one or more proxy sequences. The AFDIS may be numerically expressed, in some embodiments, according to
AFDIS=AFgermline−AFvariant
wherein AFgermline is the expected allele frequency if the genomic sequence of interest were germline, as determined based on the observed allele frequency of the one or more proxy sequences, and AFvariant is the observed allele frequency of the genomic sequence of interest.
In some embodiments, the allele frequency distance may be determined using a distribution of observed frequencies of the proxy genomic sequences. The distribution can be used to determine a probability that the genomic sequence of interest is germline or somatic. In some embodiments, the allele frequency distance is a probability that the observed frequency of the genomic sequence of interest fits within (or does not fit within) the distribution of observed frequencies of a plurality of proxy sequences. For example, if the allele frequency of the genomic sequence of interest fits within the distribution, the genomic sequence of interest may be identified as a germline sequence. If the allele frequency of the genomic sequence of interest does not fit within the distribution, the genomic sequence of interest may be identified as somatic. One skilled in the art may select a statistical test or predetermined threshold to determine if the allele frequency of the genomic sequence of interest fits within the distribution.
In some embodiments, the allele frequency distance may be used to classify the genomic sequence of interest. For example, in some embodiments, if the allele frequency distance is above a selected threshold, the genomic sequence of interest is classified as somatic. In some embodiments, if the allele frequency distance is below a selected threshold, the genomic sequence of interest is classified as germline. The threshold may be set based on the accuracy or specificity tolerance desired.
In some embodiments, classification of the genomic sequence of interest as germline or somatic may include the use of a statistical model. The statistical model can receive, for example, an allele frequency distance for a given genomic sequence of interest, and output a classification of the genomic sequence of interest as somatic (or likely somatic) or germline (or likely germline). The classification may be based on a probability of the genomic sequence of interest being somatic or germline. In some implementations, the genomic sequence of interest may be classified as ambiguous, for example, if the probability of the sequence being somatic or germline is not sufficiently high. The probability threshold for making a call can be based on a desired specificity and/or accuracy of the call. For example, in some embodiments, if the probability of the genomic sequence of interest being somatic is above any one of 0.8, 0.85, 0.9, 0.95, 0.96, 0.97, 0.98, or 0.99 (or any selected value therebetween), the genomic sequence of interest is classified as somatic, and if the probability of the genomic sequence of interest being somatic is below any one of 0.2, 0.15, 0.1, 0.05, 0.04, 0.03, 0.02, or 0.01 (or any selected value therebetween), the genomic sequence of interest is classified as germline. Genomic sequences of interest that are not classified as somatic or germline, based on the statistical model, may be labeled as ambiguous.
In some embodiments, the statistical model is trained using data from one or more matched tumor/normal sample pairs. Normal samples in the matched tumor/normal sample pair can be sequenced to establish a ground truth for germline sequences, and the tumor sample can be sequenced to establish a ground truth for somatic variant sequences (i.e., those sequences that are not germline according to the matched normal sample). Sequencing data from the tumor sample, which can include a mixture of normal and tumor nucleic acid molecules, can be used to determine allele frequency distances for selected genomic sequences of interest, which are then labeled as somatic (probability of being somatic, psomatic, being equal to 1) or germline (psomatic being equal to 0). A function associating allele frequency distance to probability of being somatic can then be generated using the training data.
Other methods of training the statistical model may be used. For example, in some embodiments, the model is trained using only data for germline sequences or only data for somatic sequences.
In some implementations, the comparison of step 210 may be indirectly performed by way of a statistical model. For example, if the median allele frequency of a collection of proxies is used as the central measure of step 206, then a logistic regression model may be constructed that describes the difference of the allele frequency of the sequence of interest from the median allele frequency of the proxies. In some implementations, this logistic regression model can be constructed from data for a collection of matched tumor/normal samples, such that the difference described in the previous sentence is proportional to log
where p represents the probability that the sequence of interest comprises a somatic variant.
The rationale underlying this characterization is that each proxy is physically close to the sequence of interest in the patient's genome. Thus, it is likely that the proxy and the sequence of interest experience the same or similar genomic dynamics or mutations, such as duplication events or deletions. Rather than attempting to model the specific dynamics of the sequence of interest to correlate observed frequencies with germline/somatic status, this approach replaces such a model with a direct empirical measurement. Insofar as prior art models have historically been insensitive or inaccurate to some degree, this approach provides an advantage.
The methods described herein may further include generating a report that indicates one or more genomic sequences of interest as germline or somatic. The generated report can be transmitted to the patient, healthcare providers, or others (for example, using a computer network). The report is particularly beneficial for evaluating cancer treatment therapies, making treatment decisions, monitoring cancer progression or recurrence, designing personalized cancer vaccine, and other beneficial uses.
Input device 420 can be any suitable device that provides input, such as a touch screen, keyboard or keypad, mouse, or voice-recognition device. Output device 430 can be any suitable device that provides output, such as a display, touch screen, haptics device, or speaker.
Memory storage 440 can be any suitable device that provides storage, such as an electrical, magnetic or optical memory including a RAM, cache, hard drive, or removable storage disk. Communication device 460 can include any suitable device capable of transmitting and receiving signals over a network, such as a network interface chip or device. The components of the computer can be connected in any suitable manner, such as via a physical bus or wirelessly.
Software, such as the SGZ module 450 and other sequence analysis and variant calling program modules, which can be stored in memory storage 440 and executed by processor(s) 410, can include, for example, code for the AFDIS-based logistic regression models and other programming that embodies the functionality of the present disclosure (e.g., as embodied in the devices as described above).
Software, such as the SGZ module 450 and other sequence analysis and variant calling program modules, can also be stored and/or transported within any non-transitory computer-readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as those described above, that can fetch instructions associated with the software from the instruction execution system, apparatus, or device and execute the instructions. In the context of this disclosure, a computer-readable storage medium can be any medium, such as storage 440, that can contain or store programming for use by or in connection with an instruction execution system, apparatus, or device.
Software, such as the SGZ module 450 and other sequence analysis and variant calling program modules, can also be propagated within any transport medium for use by or in connection with an instruction execution system, apparatus, or device, such as those described above, that can fetch instructions associated with the software from the instruction execution system, apparatus, or device and execute the instructions. In the context of this disclosure, a transport medium can be any medium that can communicate, propagate or transport programming for use by or in connection with an instruction execution system, apparatus, or device. The transport readable medium can include, but is not limited to, an electronic, magnetic, optical, electromagnetic or infrared wired or wireless propagation medium.
Device 400 may be connected to a network, which can be any suitable type of interconnected communication system. The network can implement any suitable communications protocol and can be secured by any suitable security protocol. The network can comprise network links of any suitable arrangement that can implement the transmission and reception of network signals, such as wireless network connections, T1 or T3 lines, cable networks, DSL, or telephone lines.
Device 400 can implement any operating system suitable for operating on the network. Software, such as the SGZ module 450 and other sequence analysis and variant calling program modules, can be written in any suitable programming language, such as C, C++, Java or Python. In various embodiments, application software embodying the functionality of the present disclosure can be deployed in different configurations, such as in a client/server arrangement or through a Web browser as a Web-based application or Web service, for example.
The subject samples (e.g., patient samples) used with the methods described herein may include a mixture of tumor and non-tumor nucleic acid molecules. The tumor nucleic acid molecules may be obtained directly or indirectly from the tumor. For example, the tumor nucleic acid molecules may be obtained from a tissue biopsy of a tumor. Tumor biopsies often include both tumor and non-tumor tissue, thereby providing a mixture of tumor and non-tumor nucleic acid molecules. In some embodiments, the tumor and non-tumor nucleic acid molecules are obtained from a bodily fluid or liquid biopsy sample (e.g., blood, plasma, spinal fluid, etc.), that may include cell-free (or circulating free) DNA including tumor (e.g., circulating tumor DNA, or ctDNA) and non-tumor cell-free nucleic acid molecules.
The patient sample may be taken, for example, from a subject with cancer, a subject suspected of having cancer, or a subject having previously been treated for a cancer. In certain embodiments, the sample is acquired from a subject having a solid tumor, a hematological cancer, or a metastatic form thereof. In certain embodiments, the sample is obtained from a subject having a cancer, or at risk of having a cancer. In certain embodiments, the sample is obtained from a subject who has not received a therapy to treat a cancer, is receiving a therapy to treat a cancer, or has received a therapy to treat a cancer, as described herein.
A variety of tissues can be the source of the samples used in the present methods. Genomic or subgenomic nucleic acid (e.g., DNA or RNA) can be isolated from a subject's sample (e.g., a sample comprising tumor cells, a blood sample, a blood constituent sample, a sample comprising cell-free DNA (cfDNA), a sample comprising circulating tumor DNA (ctDNA), a sample comprising circulating tumor cells (CTCs), or any normal control (e.g., a normal adjacent tissue (NAT)).
In some embodiments, the sample is acquired from a liquid biopsy. A liquid biopsy patient sample may be derived from, for example, blood, plasma, cerebrospinal fluid, sputum, stool, urine, or saliva.
In some embodiments, the patient sample is derived from a solid tissue sample, such as a solid tumor biopsy. Solid tumor biopsies often include a mixture of tumor and non-tumor tissue. In some embodiments, the solid tissue biopsy sample is a fresh sample. In some embodiments, the solid tissue biopsy sample is a frozen sample or previously frozen sample. In some embodiments, the solid tissue biopsy sample is a fresh sample. In some embodiments, the solid tissue biopsy sample is a preserved sample (for example, a chemically preserved sample). In certain embodiments, the sample is a formalin-fixed paraffin-embedded (FFPE) sample.
In some embodiments, the tumor purity of the patient sample (i.e., the portion of the sample that is tumor nucleic acid molecules compared to total nucleic acid molecules) for any of the sample types disclosed herein is about 1% or more, about 5% or more, about 10% or more, about 15% or more, about 20% or more, about 25% or more, about 30% or more, about 40% or more, about 50% or more, about 60% or more, about 70% or more, or about 80% or more. In some embodiments, the tumor purity of the patient sample is about 99% or less, about 95% or less, about 90% or less, about 85% or less, about 80% or less, about 75% or less, about 70% or less, about 60% or less, about 50% or less, about 40% or less, about 30% or less, about 25% or less, or about 20% or less.
In one embodiment, the method further includes obtaining a sample, e.g., a patient sample described herein. The sample can be acquired directly or indirectly. In an embodiment, the sample is acquired, e.g., by isolation or purification, from a sample that comprises cfDNA. In an embodiment, the sample is acquired, e.g., by isolation or purification, from a sample that comprises ctDNA. In an embodiment, the sample is acquired, e.g., by isolation or purification, from a sample that comprises both malignant cells and non-malignant cells (e.g., tumor-infiltrating lymphocyte). In an embodiment, the sample is acquired, e.g., by isolation or purification, from a sample that comprises CTCs. In some embodiments, the sample is obtained by a solid tissue biopsy.
A sequencing library can be prepared from a patient sample using known methods. The nucleic acid molecules may be purified or isolated from the patient sample. In some embodiments, the isolated nucleic acids are fragmented or sheared using a known method. For example, nucleic acid molecules may be fragmented by physical shearing methods (e.g., sonication), enzymatic cleavage methods, chemical cleavage methods, and other methods well known to those skilled in the art. The nucleic acid may be ligated to an adapter sequence for sequencing. In some instances, the adapter may comprise an amplification primer and/or sequencing adapter. In some instances, nucleic acid molecules purified or isolated from the patient sample, or the sequencing library prepared therefrom, may be amplified, e.g., using a polymerase chain reaction (PCR) or isothermal amplification method known to those of skill in the art.
In some embodiments, the nucleic acid molecules from the patient sample and used to prepare a sequencing library (or a selected (e.g., captured) subset thereof) are sequenced to generate a patient genomic sequence. Sequencing methods are well known in the art, and may be performed using multiplexed (e.g., next-generation) or single molecule sequencing. The patient genomic sequence determined by sequencing need not be the full genome of the patient. For example, in some embodiments, targeted sequencing methods (e.g., using specific probes (or bait) molecules for hybridization-based capture) are used to sequence portions of the patient's genome (i.e., less than the full genome). See, for example, U.S. Pat. No. 9,340,830 B2. Targeted sequencing may be used to target, for example, one or more exon regions, one or more intron regions, one or more intragenic regions, one or more 3′-UTRs (untranslated regions), and/or one or more 5′-UTRs.
In some embodiments, targeted sequencing may be used to sequence one or more genes, or portions of one or more genes, associated with cancer. Exemplary genes associated with cancer that may be sequenced using targeted sequencing include, but are not limited to ABL2, AKT2, AKT3, ARAF, ARFRP1, ARID1A, ATM, ATR, AURKA, AURKB, BCL2, BCL2A1, BCL2L1, BCL2L2, BCL6, BRCA1, BRCA2, CARD11, CBL, CCND1, CCND2, CCND3, CCNE1, CDH1, CDH2, CDH20, CDH5, CDK4, CDK6, CDK8, CDKN2B, CDKN2C, CHEK1, CHEK2, CRKL, CRLF2, DNMT3A, DOT1L, EPHA3, EPHA5, EPHA6, EPHA7, EPHB1, EPHB4, EPHB6, ERBB3, ERBB4, ERG, ETV1, ETV4, ETV5, ETV6, EWSR1, EZH2, FANCA, FBXW7, FGFR4, FLT1, FLT4, FOXP4, GATA1, GNA11, GNAQ, GNAS, GPR124, GUCY1A2, HOXA3, HSP90AA1, IDH1, IDH2, IGF1R, IGF2R, IKBKE, IKZF1, INHBA, IRS2, JAK1, JAK3, JUN, KDR, LRP1B, LTK, MAP2K1, MAP2K2, MAP2K4, MCL1, MDM2, MDM4, MEN1, MITF, MLH1, MPL, MRE11A, MSH2, MSH6, MTOR, MUTYH, MYCL1, MYCN, NF2, NKX2-1, NTRK1, NTRK3, PAK3, PAX5, PDGFRB, PIK3R1, PKHD1, PLCG1, PRKDC, PTCH1, PTPN11, PTPRD, RAF1, RARA, RICTOR, RPTOR, RUNX1, SMAD2, SMAD3, SMAD4, SMARCA4, SMARCB1, SMO, SOX10, SOX2, SRC, STK11, TBX22, TET2, TGFBR2, TMPRSS2, TOP1, TSC1, TSC2, USP9X, VHL, WT1, ABL1, AKT1, ALK, APC, AR, BRAF, CDKN2A, CEBPA, CTNNB1, EGFR, ERBB2, ESR1, FGFR1, FGFR2, FGFR3, FLT3, HRAS, JAK2, KIT, KRAS, MET, MLL, MYC, NF1, NOTCH1, NPM1, NRAS, PDGFRA, PIK3CA, PTEN, RB1, RET, and TP53.
In certain embodiments, the sample is acquired from a subject having a cancer. Exemplary cancers include, but are not limited to, B cell cancer, e.g., multiple myeloma, melanomas, breast cancer, lung cancer (such as non-small cell lung carcinoma or NSCLC), bronchus cancer, colorectal cancer, prostate cancer, pancreatic cancer, stomach cancer, ovarian cancer, urinary bladder cancer, brain or central nervous system cancer, peripheral nervous system cancer, esophageal cancer, cervical cancer, uterine or endometrial cancer, cancer of the oral cavity or pharynx, liver cancer, kidney cancer, testicular cancer, biliary tract cancer, small bowel or appendix cancer, salivary gland cancer, thyroid gland cancer, adrenal gland cancer, osteosarcoma, chondrosarcoma, cancer of hematological tissues, adenocarcinomas, inflammatory myofibroblastic tumors, gastrointestinal stromal tumor (GIST), colon cancer, multiple myeloma (MM), myelodysplastic syndrome (MDS), myeloproliferative disorder (MPD), acute lymphocytic leukemia (ALL), acute myelocytic leukemia (AML), chronic myelocytic leukemia (CML), chronic lymphocytic leukemia (CLL), polycythemia Vera, Hodgkin lymphoma, non-Hodgkin lymphoma (NHL), soft-tissue sarcoma, fibrosarcoma, myxosarcoma, liposarcoma, osteogenic sarcoma, chordoma, angiosarcoma, endotheliosarcoma, lymphangiosarcoma, lymphangioendotheliosarcoma, synovioma, mesothelioma, Ewing's tumor, leiomyosarcoma, rhabdomyosarcoma, squamous cell carcinoma, basal cell carcinoma, adenocarcinoma, sweat gland carcinoma, sebaceous gland carcinoma, papillary carcinoma, papillary adenocarcinomas, medullary carcinoma, bronchogenic carcinoma, renal cell carcinoma, hepatoma, bile duct carcinoma, choriocarcinoma, seminoma, embryonal carcinoma, Wilms' tumor, bladder carcinoma, epithelial carcinoma, glioma, astrocytoma, medulloblastoma, craniopharyngioma, ependymoma, pinealoma, hemangioblastoma, acoustic neuroma, oligodendroglioma, meningioma, neuroblastoma, retinoblastoma, follicular lymphoma, diffuse large B-cell lymphoma, mantle cell lymphoma, hepatocellular carcinoma, thyroid cancer, gastric cancer, head and neck cancer, small cell cancers, essential thrombocythemia, agnogenic myeloid metaplasia, hypereosinophilic syndrome, systemic mastocytosis, familiar hypereosinophilia, chronic eosinophilic leukemia, neuroendocrine cancers, carcinoid tumors, and the like.
In an embodiment, the cancer is a hematologic malignancy (or premaligancy). As used herein, a hematologic malignancy refers to a tumor of the hematopoietic or lymphoid tissues, e.g., a tumor that affects blood, bone marrow, or lymph nodes. Exemplary hematologic malignancies include, but are not limited to, leukemia (e.g., acute lymphoblastic leukemia (ALL), acute myeloid leukemia (AML), chronic lymphocytic leukemia (CLL), chronic myelogenous leukemia (CML), hairy cell leukemia, acute monocytic leukemia (AMoL), chronic myelomonocytic leukemia (CMML), juvenile myelomonocytic leukemia (JMML), or large granular lymphocytic leukemia), lymphoma (e.g., AIDS-related lymphoma, cutaneous T-cell lymphoma, Hodgkin lymphoma (e.g., classical Hodgkin lymphoma or nodular lymphocyte-predominant Hodgkin lymphoma), mycosis fungoides, non-Hodgkin lymphoma (e.g., B-cell non-Hodgkin lymphoma (e.g., Burkitt lymphoma, small lymphocytic lymphoma (CLL/SLL), diffuse large B-cell lymphoma, follicular lymphoma, immunoblastic large cell lymphoma, precursor B-lymphoblastic lymphoma, or mantle cell lymphoma) or T-cell non-Hodgkin lymphoma (mycosis fungoides, anaplastic large cell lymphoma, or precursor T-lymphoblastic lymphoma)), primary central nervous system lymphoma, Sézary syndrome, Waldenström macroglobulinemia), chronic myeloproliferative neoplasm, Langerhans cell histiocytosis, multiple myeloma/plasma cell neoplasm, myelodysplastic syndrome, or myelodysplastic/myeloproliferative neoplasm. Premaligancy, as used herein, refers to a tissue that is not yet malignant but is poised to become malignant.
In some embodiments, the sample is obtained, e.g., collected, from a subject, e.g., patient, with a condition or disease, e.g., a hyperproliferative disease (e.g., as described herein) or a non-cancer indication. In some embodiments, the disease is a hyperproliferative disease. In some embodiments, the hyperproliferative disease is a cancer, e.g., a solid tumor or a hematological cancer. In some embodiments, the cancer is a solid tumor. In some embodiments, the cancer is a hematological cancer, e.g. a leukemia or lymphoma.
In some embodiments, the subject has a cancer. In some embodiments, the subject has been, or is being treated, for cancer. In some embodiments, the subject is in need of being monitored for cancer progression or regression, e.g., after being treated with a cancer therapy. In some embodiments, the subject is in need of being monitored for relapse of cancer. In some embodiments, the subject is at risk of having a cancer. In some embodiments, the subject has not been treated with a cancer therapy. In some embodiments, the subject has a genetic predisposition to a cancer (e.g., having a mutation that increases his or her baseline risk for developing a cancer). In some embodiments, the subject has been exposed to an environment (e.g., radiation or chemical) that increases his or her risk for developing a cancer. In some embodiments, the subject is in need of being monitored for development of a cancer.
In some embodiments, the patient has been previously treated with a targeted therapy, e.g., one or more targeted therapies. In some embodiments, for a patient who has been previously treated with a targeted therapy, a post-targeted therapy sample, e.g., specimen is obtained, e.g., collected. In some embodiments, the post-targeted therapy sample is a sample obtained, e.g., collected, after the completion of the targeted therapy.
In some embodiments, the patient has not been previously treated with a targeted therapy. In some embodiments, for a patient who has not been previously treated with a targeted therapy, the sample comprises a resection, e.g., an original resection, or a recurrence, e.g., disease recurrence post-therapy, e.g., non-targeted therapy. In some embodiments, the sample is or is part of a primary tumor or a metastasis, e.g., metastasis biopsy. In some embodiments, the sample is obtained from a site, e.g., tumor site, with the highest percent of tumor, e.g., tumor cells, as compared to adjacent sites, e.g., adjacent sites with tumor cells. In some embodiments, the sample is obtained from a site, e.g., tumor site, with the largest tumor focus as compared to adjacent sites, e.g., adjacent sites with tumor cells.
In some embodiments, the subject is a human.
The genomic profile of a cancer can often affects the likelihood of success of various cancer treatment modalities. For example, a given anti-cancer agent may be more likely to successfully treat a particular cancer having one genomic profile versus another. The methods described herein can be used characterize the genomic profile of a cancer by distinguishing somatic sequences, which may be attributed to the cancer, from germline sequences.
By way of example, a method of treating cancer in a patient can include identifying (e.g., classifying) one or more genomic sequences of interest as somatic using a method described herein, and selecting a cancer treatment modality based on the one or more identified somatic sequences. The cancer can then be treated using an effective amount of the selected cancer treatment modality. This allows for personalized cancer treatment of the patient based on the somatic sequences specific to that patient's cancer. In contrast, if the treatment selection was based on a germline variant rather than a somatic variant, there is some risk that the selected treatment modality may be ineffective for the patient's cancer.
Exemplary cancer treatment modalities may include, for example, a selected chemotherapeutic agent, a selected immune-oncology agent (such as an immune checkpoint inhibitor), resection surgery, radiation therapy, targeted therapy, gene expression modulators, angiogenesis inhibitors, and hormone therapy, among others.
The cancer treatment may be selected, for example, based on an association between the one or more identified somatic sequences and successful cancer treatment using the selected treatment modality. Exemplary associations between cancer type, somatic sequence, and treatment modality are listed in Table 1.
Microsatellite instability (MSI) status of a cancer can be useful for selecting treatment modality of the cancer. Microsatellite instability can result from deficient DNA mismatch repair (MMR) pathways in a cancer cell, which results in an abnormally high frequency of genetic mutations. See Kim, et al., The Landscape of Microsatellite Instability in Colorectal and Endometrial Cancer Genomes, Cell, vol., 155, no. 4, pp. 858-868 (2013). MSI status is generally characterized as being high (MSI-H), low (MSI-L), or stable (MSS) (or, alternatively, MSI-H or not MSI-H; or MSI-H or MSI-undetermined) based on MSI signatures. MSI-H status has been detected for multiple types of solid tumors, and may be an indicator of successful cancer treatment using certain cancer treatment modalities. See Cortes-Ciriano, et al., A molecular portrait of microsatellite instability across multiple cancers, Nature Communications, vol. 8, no. 15180 (2017). Mutations in the microsatellites (i.e., MSI events) can be detected by distinguishing somatic sequences from germline sequences using the methods described herein.
Success of certain cancer treatment modalities has been associated with MSI-H status of a cancer. For example, a PD-1 inhibitor (namely, pembrolizumab) has been found to be particularly effective in treating MSI-H solid tumors (for example, unresectable or metastatic solid tumors). In some embodiments, the cancer determined to have an MSI-H status is treated with an effective amount of an immune-oncology agent. In some embodiments, the cancer determined to have an MSI-H status is treated with an effective amount of an immune checkpoint inhibitor. In some embodiments, the immune checkpoint inhibitor is AMP-224, AMP-514, atezolizumab, AUNP12, avelumab, BGB-A317, BMS-986189, CA-170, camrelizumab, cemiplimab, CK-301, dostarlimab, durvalumab, ipilimumab, INCMGA00012, KN035, nivolumab, pembrolilzumab, sintilimab, spartalizumab, tislelizumab, or toripalimab. In some embodiments, the cancer determined to have an MSI-H status is treated with an effective amount of a PD-1 inhibitor, a PD-L1 inhibitor, or a CTLA-4 inhibitor. In some embodiments, the cancer determined to have an MSI-H status is treated with an effective amount of pembrolizumab.
In some embodiments, the method of treating cancer includes identifying (e.g., classifying) one or more genomic sequences of interest as somatic using the method described herein; determining a microsatellite instability status of the cancer using the identified somatic sequences; and selecting a cancer treatment modality based on the microsatellite instability status of the cancer. The cancer can then be treated using an effective amount of the selected cancer treatment modality. In some embodiments, the cancer is colorectal cancer, endometrial cancer, biliary cancer, bladder cancer, breast cancer, esophageal cancer, gastric cancer, gastroesophageal junction cancer, pancreatic cancer, prostate cancer, renal cell cancer, retroperitoneal adenocarcinoma, sarcoma, small cell lung cancer, small intestinal cancer, or thyroid cancer.
In some embodiments, tumor mutational burden (TMB) of the cancer is determined using one or more somatic sequences identified using the method described herein to select a treatment modality. TMB is a genomic biomarker for the cancer that quantifies the frequency of somatic mutations in a patient's tumor. TMB-high correlates with higher neoantigen expression, which helps the immune system recognize tumors. It has been detected across numerous tumor types and has been associated with improved response rate and prolonged progression-free survival for patients on immunotherapy. See Goodman, et al., Tumor Mutational Burden as an Independent Predictor of Response to Immunotherapy in Diverse Cancers, Mol. Cancer Ther., vol. 16, no. 11, pp. 2598-2608 (2017).
The tumor mutational burden can be determined for a cancer by identifying somatic sequences associated with the cancer using the method described herein.
TMB can provide a quantitative value such that a cancer treatment modality may be selected based on the tumor mutational burden being above or below a predetermined tumor mutational burden threshold. In some embodiments, the predetermined threshold is about 5 mutations/Mb, about 10 mutations/Mb, about 15 mutations/Mb, about 20 mutations/Mb, about 25 mutations/Mb, about 30 mutations/MB, about 40 mutations/Mb, about 50 mutations/Mb, or higher, or any number therebetween (for example, the predetermined threshold may be between 5 mutations/Mb and about 50 mutations/Mb). By way of example, certain immune-oncology agents have been found to be particularly effective when used to treat tumors having a high tumor mutational burden. See, for example, Fabrizio, et al., Beyond microsatellite testing: assessment of tumor mutational burden identifies subsets of colorectal cancer who may respond to immune checkpoint inhibition, J. Gastrointestinal Oncology, vol. 9, no. 4, pp. 610-617 (2018).
In some embodiments, the cancer determined to have a TMB above a predetermined threshold is treated with an effective amount of an immune-oncology agent. In some embodiments, the cancer determined to have a TMB above a predetermined threshold is treated with an effective amount of an immune checkpoint inhibitor. In some embodiments, the immune checkpoint inhibitor is AMP-224, AMP-514, atezolizumab, AUNP12, avelumab, BGB-A317, BMS-986189, CA-170, camrelizumab, cemiplimab, CK-301, dostarlimab, durvalumab, ipilimumab, INCMGA00012, KN035, nivolumab, pembrolilzumab, sintilimab, spartalizumab, tislelizumab, or toripalimab. In some embodiments, the cancer determined to have a TMB above a predetermined threshold is treated with an effective amount of a PD-1 inhibitor, a PD-L1 inhibitor, or a CTLA-4 inhibitor. In some embodiments, the cancer determined to have a TMB above a predetermined threshold is treated with an effective amount of pembrolizumab. In some embodiments, the cancer determined to have a TMB above a predetermined threshold is treated with an effective amount of pembrolizumab, wherein the predetermined threshold is about 10 mutations/Mb.
In some embodiments, the method of treating cancer includes identifying one or more genomic sequences of interest as somatic using the method described herein; determining a tumor mutational burden for the cancer using the one or more identified somatic sequences; and selecting the cancer treatment modality based on the tumor mutational burden being above a predetermined tumor mutational burden threshold. The cancer can then be treated using an effective amount of the selected cancer treatment modality. In some embodiments, the cancer is colorectal cancer, endometrial cancer, biliary cancer, bladder cancer, breast cancer, esophageal cancer, gastric cancer, gastroesophageal junction cancer, pancreatic cancer, prostate cancer, renal cell cancer, retroperitoneal adenocarcinoma, sarcoma, small cell lung cancer, small intestinal cancer, or thyroid cancer.
Cancer progression monitoring and/or minimum residual disease detection is beneficial for evaluating a cancer treatment plan and/or monitoring a patient for cancer recurrence. A cancer patient may be treated for a cancer to a point where the cancer is no longer detectable. Nevertheless, the patient may remain susceptible to recurrence. The patient may be monitored for cancer recurrence by detecting nucleic acid molecules derived from a recurring tumor (for example, ctDNA molecules). In other embodiments, a cancer patient may be treated for a disease, and progression of the cancer (e.g., an increase or decrease in the amount of cancer) may be monitored by quantifying the amount of detected tumor nucleic acid molecules in the patient (e.g., a ctDNA level).
Identification of somatic sequences may be particularly useful in monitoring cancer progression or detecting minimum residual disease of a cancer. The somatic sequences provide a genomic signature for the cancer, and they can be used to distinguish tumor nucleic acid molecules from non-tumor nucleic acid molecules.
Patient samples may be obtained and analyzed at two or more time points to monitor cancer progression nor recurrence of the cancer. A first sample is analyzed to identify one or more somatic sequences according to the methods described herein. The first sample may be obtained before, during, or after cancer treatment, although the patient generally has some amount of detectable cancer.
A second sample may be obtained at a later time point after the patient has been treated for the cancer, and can be analyzed to determine if the one or more of the identified somatic sequences are present in the sample. The presence of the somatic sequences indicates that the patient still has the cancer or that the cancer has recurred. Failure to detect the somatic sequences does not definitively prove that the patient is free from cancer, but indicates that the cancer level may be low.
The second patient sample may be the same type of sample as the first patient sample type, or may be a different sample type. In some embodiments, the second patient sample is obtained from a liquid biopsy. For example, the liquid biopsy patient sample may be blood, plasma, cerebrospinal fluid, sputum, stool, urine, or saliva. In some embodiments, the patient sample is obtained from a solid tissue sample such as a solid tumor biopsy. In some embodiments, the solid tissue biopsy sample is a fresh sample. In some embodiments, the solid tissue biopsy sample is a frozen sample or previously frozen sample. In some embodiments, the solid tissue biopsy sample is a fresh sample. In some embodiments, the solid tissue biopsy sample is a preserved sample (for example, a chemically preserved sample). In certain embodiments, the sample is a formalin-fixed paraffin-embedded (FFPE) sample.
The somatic sequences may be detected in DNA or RNA (or both) from the second sample. The presence or absence of the somatic sequences in the second sample may be detected by sequencing, quantitative PCR (qPCR), reverse-transcription PCR (RT-PCR), fluorescent in situ hybridization (FISH), or any other suitable method of specific detection of the one or more somatic sequences. In certain embodiments, the nucleic acid molecules are isolated form the second sample. In some embodiments, the nucleic acid molecules are detected directly from the second sample.
In some embodiments, the presence of the one or more somatic sequences are identified in the second sample, the patient may be treated for cancer using the same treatment modality or a different treatment modality for which the cancer was previously treated.
In some embodiments, a method of monitoring cancer progression or recurrence in a patient includes identifying one or more genomic sequences of interest as somatic using the method a method described herein, wherein the patient sample is obtained from a patient having cancer; obtaining a second patient sample from the patient after the cancer has been treated; and detecting the presence or absence of the one or more genomic sequences of interest identified as somatic within the second patient sample. For example, the one or more genomic sequences of interest may be identified as somatic by selecting a genomic sequence of interest at a genomic locus from within a patient genomic sequence obtained for a patient sample comprising a mixture of tumor nucleic acid molecules and non-tumor nucleic acid molecules; selecting one or more proxy genomic sequences for the genomic sequence of interest; determining an allele frequency distance using an observed allele frequency of the genomic sequence of interest and a summary statistic indicative of observed frequencies of the one or more proxy genomic sequences; and identifying the genomic sequence of interest as germline or somatic using the allele frequency distance. In some embodiments, the method comprises treating the cancer in the patient after the first patient sample is obtained from the patient and before the second patient sample is obtained from the patient. In some embodiments, the method comprises treating the cancer in the patient if the presence of the one or more genomic sequences of interest identified as somatic are detected within the second patient sample.
Somatic sequences detected in exon regions of various genes may be suitable as a neoantigen, for example in the development of a personalized cancer vaccine. Peptides can be generated based on the nucleic acid sequence encoded by the somatic variant sequence, which can stimulate the immune system to kill the cancer cells. See, for example, Richters, et al., Best practices for bioinformatics characterization of neoantigens for clinical utility, Genome Medicine, vol., 11 no. 56 (2019).
In some embodiments, a method of selecting a neoantigen for a cancer vaccine personalized for a subject having cancer includes identifying one or more genomic sequences of interest as somatic using the method described herein, wherein the one or more genomic sequences of interest identified as somatic is located within an exon region of a gene; and selecting, from the one or more genomic sequences of interest identified as somatic, a genomic sequence that encodes a neoantigen suitable as a cancer vaccine for the subject. For example, the one or more genomic sequences of interest may be identified as somatic by selecting a genomic sequence of interest at a genomic locus from within a patient genomic sequence obtained for a patient sample comprising a mixture of tumor nucleic acid molecules and non-tumor nucleic acid molecules; selecting one or more proxy genomic sequences for the genomic sequence of interest; determining an allele frequency distance using an observed allele frequency of the genomic sequence of interest and a summary statistic indicative of observed frequencies of the one or more proxy genomic sequences; and identifying the genomic sequence of interest as germline or somatic using the allele frequency distance.
In some embodiments, the method further comprises making a vaccine comprising the neoantigen.
The following example is provided to illustrate an exemplary embodiment of the invention described herein, and is not intended to limit the scope of the invention.
Previously described SGZ algorithms (see, e.g., Sun, et al. (2018), ibid.) can be used to determine the difference in expected variant allele frequency for somatic and germline variants (e.g., a mutation that replaces a C with a T) provided that the tumor fraction for the sample, allele count of the variant, and copy number of the genomic locus were determined, as shown in
wherein p is the tumor purity, V is the variant allele count, and C is the copy number of the allele. For example, given a tumor purity (p) of the sample as 0.25, a variant allele count (V) of 3, and a copy number (C) of 4, if the variant is somatic the expected allele frequency is 0.3 and if germline the expected allele frequency is 0.6. See, for example, Sun, et al., A computational approach to distinguish somatic vs. germline origin of genomic alterations from deep sequencing of cancer specimens without a matched normal, PLoS Comput Biol., vol. 14, no. 2, p. e1005965 (2018).
This Example provides an alternative approach to the previously described SGZ algorithms, which does not require modeling the tumor purity, variant allele count, or copy number values. The allele frequency distance from the expected germline allele frequency (AFDIS) is determined as:
AFDIS=AFgermline−AFvariant
AFgermline is the allele frequency of the sequence assuming the sequence is a definitive germline sequence, as defined by the allele frequency of the corresponding proxy sequences. AFvariant is the observed allele frequency of the given sequence being characterized. To understand the allele frequency distance distribution for germline variants, genomic sequences from 3802 tumor samples were segmented based on copy number uniformity using the Circular Binary Segmentation algorithm described in Olshen, et al., Circular Binary Segmentation for the Analysis of Array-Based DNA Copy Number Data, Biostatistics vol. 5, no. 4, pp. 557-572 (October 2004). Approximately 2.1 million known germline variants (identified in the dbSNP and/or gnomAD database) from the 3802 samples were selected, and the allele frequency (based on sequencing) of each germline variant was compared to the median allele frequency of proxy sequences within the same segment to determine the allele frequency distance of each germline variant. The probability density of the ˜2.1 million germline variants from 3,802 samples is shown in
A threshold of 0.1 AFDIS, corresponding to a cumulative distribution of 0.993 based on the above mentioned ECDF, was empirically determined to be capable of separating somatic from germline variants effectively. As indicated in Table 2, AFDIS thresholds ranging from about 0.05 to 0.1 all provided good discrimination between somatic and germline variants. Nevertheless, as explained below, a trained statistical model was built to understand the probability of any given sequence being germline or somatic.
Allele frequency distance was then determined for 92 genotype-matched high purity/low purity tumor samples with known germline sequences, somatic sequences, and tumor purity. The low purity sample was used to establish ground truth for the somatic/germline status of selected sequences, as in general a low purity sample is considered to be a close approximation of a normal sample and allows for reliable determination of somatic versus germline status of variants within.
Using the available data from 21 matched tumor/normal pairs (lung squamous cell carcinoma (n=5), ovary serous carcinoma (n=4), lung adenocarcinoma (n=3), breast invasive ductal carcinoma (n=2), anus carcinoma (n=1), bladder urothelial carcinoma (n=1), CRC (n=1), kidney clear cell carcinoma (n=1), ovary high grade serous carcinoma (n=1), skin sarcoma (n=1), uterus endometrial adenocarcinoma (n=1), a logistical regression model was generated. The matched tumor/normal pairs allowed for confident determination of somatic and germline sequences.
wherein psomatic is the probability that a given variant is a somatic variant. See
The AFDIS data calculated as discussed above for variants in a total of 188 tumor samples in three different testing sets were inputted into the trained model to determine the probability of each selected sequence being somatic or germline. Based on somatic variant probability, the variant sequence was labeled as somatic (if above the somatic probability threshold), germline (if below the germline probability threshold), or ambiguous (i.e., between the somatic probability threshold and the germline probability threshold). See
The results of classification by the AFDIS classifier for a set of 93 tumor samples with matched normal samples used in the validation of the prior SGZ method demonstrate an improvement over the prior SGZ methods, as shown in
A non-limiting example of data for the sample-level sensitivity performance of the method is shown in
Non-limiting examples of data for classification of variants in the BRCA1 and BRCA 2 genes is shown in
The disclosed methods for discriminating between somatic and germline variants are based on a comparison of allele frequency (AF) of the variant in question to the allele frequencies of known variants in close proximity to its genomic location. In some instances, as noted above, known germline variants in germline databases (e.g., public databases) can be used for comparison. If the AF of the variant in question is very similar to, or very different from, those of the known germline variants located in close proximity, one would conclude that the variant in question is very likely, or unlikely, to be germline, respectively.
In general, the AF of a given variant is mainly decided by its copy number as well as the tumor fraction of the sample. Tumor fraction is a constant for a particular sample, thus the AF of a given variant in a given sample is largely decided by its copy number. This means that, to infer somatic/germline status of a variant, AF can be compared to the AF of germline variants of the same copy number. Two non-limiting examples of implementing such comparisons are described below and in Example 4.
In one implementation, one calculates an “allele frequency distance” (AFDIS) that represents the distance between the AF of the variant in question and the median AF of germline variants located on the same copy number segment (e.g., located on the same physically continuous piece of genomic segment, or located on a discontinuous piece of genomic segment as long as the segment is present at the same copy number as the variant in question). Initially, AFDIS was calculated as:
AFDIS=|MAFvariant−MAFsegment|
where MAF=minor allele frequency, i.e., the minor allele frequency for both the variant of interest and the median of the minor allele frequencies for the segment germline variants was used to calculate their absolute distance. A logistic regression model was then trained with a training dataset consisting of known somatic and germline variants to capture the relationship between “somatic probability” and AFDIS. The model was subsequently improved by using a distance with direction, i.e., redefining AFDIS as AFDIS=AFsegment−AFvariant, where AFsegment is the median of the allele frequencies for the segment germline variants. In this equation, the sign of AFDIS accounts for somatic variants having a lower allele frequency compared to germline variants of the same copy number when there is normal tissue, cells, or cfDNA admixed in the sample. This is because sequencing reads originating from the normal part of the sample or from normal cells in the blood carry germline variants but not somatic variants. The logistic regression model is trained to recognize that negative AFDIS is associated with a low probability of the variant being somatic. The use of the directional AFDIS calculation improved the performance of the model for discriminating between somatic and germline variants.
The AFDIS-based approach has an advantage of simplicity and ease of calculation, and thus can be easily modified to include other considerations in a given implementation. Specifically, since AFDIS is the single predictive variable in the logistic regression model, one can easily adjust the AFDIS value to modify the outcome to account for other potential technical issues. For example, to account for increased uncertainty introduced by mild contamination of the nucleic acid sample, one can apply an adjustment to the AFDIS value according to the contamination level to move the AFDIS value into a range corresponding to more accurate classification of somatic/germline variants by the model. Similar adjustments can be made to account for additional uncertainties introduced by factors such as low read depth, noisy AF estimation, low segment germline SNP count, high variability in segment germline SNP AF, etc. The degree and manner of implementing these adjustments can be engineered and tuned using training datasets comprising known somatic and germline variants.
In this particular implementation, a large dataset of known germline variants was constructed, each with their own AF and the corresponding segment MAF, which is the median MAF of other known germline variants located in the same copy number segment
The disclosed methods provide exemplary techniques for selecting somatic variants from baseline tissue or liquid biopsy samples for plasma monitoring. Several additional measures have been devised to further enhance performance for this particular purpose, including: i) selection of well-behaved variants (e.g., by excluding variants located in genomic regions known to have or expected to have allele frequencies deviating from expected values (such as variants located in regions with repetitive sequences or in regions that share homology with other regions of the genome)) for constructing the logistic regression model, ii) incorporating prior knowledge of the likelihood of a variant being a germline, somatic, or clonal hematopoiesis of indeterminate potential (CHIP) variant based on historical data and public databases, and iii) taking into consideration the noise level of the variant call and its genomic context. These measures were found to enhance performance of somatic variant classification.
The ability of the disclosed AFDIS-based logistic regression models to distinguish somatic variants from germline variants in a sample was verified using, for example, data from matched tumor/normal pairs. The initial training and test datasets used for developing the logistic regression model and non-limiting examples of the resulting performance metrics (# false positives (FP), sensitivity, and positive predictive value (PPV)) for variant-level and sample-level performance are summarized in Table 4 and Table 5, respectively.
The dataset used in a variant calling pipeline verification study included data from 86 matched tissue/peripheral blood mononuclear cell (PBMC) sample pairs. The variant-level and sample-level performance metrics are summarized in Table 6 and Table 7, respectively.
The dataset used in additional variant calling pipeline verification studies included data from 746 matched tissue/peripheral blood mononuclear cell (PBMC) sample pairs. The variant-level and sample-level performance metrics are summarized in Table 8 and Table 9, respectively.
It will be appreciated that the methods and systems described above are set forth by way of example and not of limitation. Numerous variations, additions, omissions, and other modifications will be apparent to one of ordinary skill in the art. In addition, the order or presentation of method steps in the description and drawings above is not intended to require this order of performing the recited steps unless a particular order is expressly required or otherwise clear from the context.
The method steps of the invention(s) described herein are intended to include any suitable method of causing one or more other parties or entities to perform the steps, unless a different meaning is expressly provided or otherwise clear from the context. In some aspects, such parties or entities need not be under the direction or control of any other party or entity, and need not be located within a particular jurisdiction. Thus, for example, a description or recitation of “adding a first number to a second number” includes causing one or more parties or entities to add the two numbers together. For example, if person X engages in an arm's length transaction with person Y to add the two numbers, and person Y indeed adds the two numbers, then both persons X and Y perform the step as recited: person Y by virtue of the fact that he actually added the numbers, and person X by virtue of the fact that he caused person Y to add the numbers. Furthermore, if person X is located within the United States and person Y is located outside the United States, then the method is performed in the United States by virtue of person X's participation in causing the step to be performed.
The terminology used in the description of the various described embodiments herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used in the description of the various described embodiments and the appended claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The disclosures of all publications, patents, and patent applications referred to herein are each hereby incorporated by reference in their entireties. To the extent that any reference incorporated by reference conflicts with the instant disclosure, the instant disclosure shall control.
While particular embodiments of the present invention have been shown and described, it will be apparent to those skilled in the art that various changes and modifications in form and details may be made therein without departing from the spirit and scope of the invention as defined by the following claims. The claims that follow are intended to include all such variations and modifications that might fall within their scope, and should be interpreted in the broadest sense allowable by law.
This application claims the benefit of U.S. Provisional Patent Application No. 63/035,572, filed on Jun. 5, 2020, and of U.S. Provisional Patent Application No. 63/041,437, filed on Jun. 19, 2020, both of which are incorporated by reference herein in their entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2021/035751 | 6/3/2021 | WO |
Number | Date | Country | |
---|---|---|---|
63041437 | Jun 2020 | US | |
63035572 | Jun 2020 | US |