Circulating throughout the bloodstream of a pregnant woman and separate from cellular tissue are small pieces of deoxyribonucleic acid (DNA), often referred to as cell-free DNA (cfDNA). The cfDNA in the maternal bloodstream includes cfDNA from both the mother (i.e., maternal cfDNA) and the fetus (i.e., fetal cfDNA). The fetal cfDNA originates from the placental cells undergoing apoptosis, and constitutes up to 30% of the total circulating cfDNA, with the balance originating from the maternal genome.
Recent technological developments have allowed for noninvasive prenatal screening of chromosomal aneuploidy in the fetus by exploiting the presence of fetal cfDNA circulating in the maternal bloodstream. Noninvasive methods relying on cfDNA sampled from the pregnant woman's blood serum are particularly advantageous over chorionic villi sampling or amniocentesis, both of which risk substantial injury and possible pregnancy loss.
Various noninvasive cfDNA-based screening procedures have proven to be useful in positively identifying certain chromosomal abnormalities, including trisomy 21 (i.e., Down syndrome), trisomy 18 (i.e., Edwards syndrome), trisomy 13 (i.e., Patau syndrome), microdeletions, and various other small fetal copy number variations. False-positive rates of detection for these disorders are relatively low with noninvasive cfDNA-based screening. However, a high proportion of all false-positive results in such screenings can be ascribed to copy-number variants in the maternal DNA.
The disclosures of all publications referred to herein are each hereby incorporated herein by reference in their entireties. To the extent that any reference incorporated by references conflicts with the instant disclosure, the instant disclosure shall control.
As will be described in greater detail below, the instant disclosure describes various systems and methods for optimizing performance of DNA-based noninvasive prenatal screens to reduce false aneuploidy calls and for performing DNA-based noninvasive prenatal screens.
In one embodiment, a computer-implemented method for optimizing performance of a DNA-based noninvasive prenatal screen may include generating a plurality of synthetic sequencing datasets, each of the plurality of synthetic sequencing datasets representing genetic sequencing data from a sample including maternal and fetal cell-free DNA (cfDNA), by, for each of the plurality of synthetic sequencing datasets, (i) generating at least one of a plurality of synthetic copy number variants including a synthetic number of copies of at least a portion of a region of interest represented by a synthetic number of sequencing reads from one or more segments within the region of interest, and (ii) modifying a real sequencing dataset, which includes genetic sequencing data from a real test sample including maternal and fetal cfDNA, by replacing a number of real sequencing reads from the one or more segments within the region of interest in the real test sample with the synthetic number of sequencing reads. The computer-implemented method may also include calculating a potential impact of each of the plurality of synthetic copy number variants on a fetal chromosomal abnormality call during DNA-based noninvasive prenatal screening based on the plurality of synthetic sequencing datasets.
In some embodiments, the method may further include determining, based on the calculated potential impacts of the plurality of synthetic copy number variants on the fetal chromosomal abnormality calls, at least one threshold feature value utilized in the DNA-based noninvasive prenatal screening to identify likely false fetal chromosomal abnormality calls. The threshold feature value may include a threshold percentage of a chromosome covered by at least one copy number variant. The threshold feature value may additionally or alternatively include a threshold base pair length of at least one copy number variant. A feature value above the threshold feature value may indicate a likely false fetal chromosomal abnormality call. The method may further include calculating a potential impact of each of a plurality of real copy number variants on a fetal chromosomal abnormality call during the DNA-based noninvasive prenatal screening based on a plurality of real sequencing datasets each including genetic sequencing data of a real reference sample including one of the plurality of real copy number variants. In this example, determining the at least one threshold feature value utilized in the DNA-based noninvasive prenatal screening may further include determining the at least one threshold feature value based on the calculated potential impacts of both the plurality of synthetic copy number variants and the plurality of real copy number variants on the fetal chromosomal abnormality calls.
In at least one embodiment, the region of interest may include a chromosome or a selected portion of a chromosome. Calculating the potential impact of each of the plurality of synthetic copy number variants on the fetal chromosomal abnormality call may further include determining a quantity of target sequencing reads in each of the plurality of synthetic sequencing datasets, the target sequencing reads corresponding to identified target sequences. The target sequencing reads may each be mappable to a unique location in a reference genome. The at least one of the plurality of synthetic copy number variants may include a synthetic maternal copy number variant. The at least one of the plurality of synthetic copy number variants may additionally include a synthetic fetal copy number variant.
In some embodiments, calculating the potential impact of each of the plurality of synthetic copy number variants on the fetal chromosomal abnormality call may further include calculating a statistical z-score for each of the plurality of synthetic sequencing datasets. Calculating the potential impact of each of the plurality of synthetic copy number variants on the fetal chromosomal abnormality call may further include calculating a statistical z-score change attributable to at least one of the plurality of synthetic copy number variants. The method may further include correlating each of the calculated statistical z-scores and/or each of the calculated statistical z-score changes to a copy number variant size of the at least one of the plurality of synthetic copy number variants. The method may further include correlating each of the calculated statistical z-scores to a copy number variant type of at least one of the plurality of synthetic copy number variants. Calculating the statistical z-score for each of the plurality of synthetic sequencing datasets may include calculating a statistical z-score for the region of interest in the corresponding synthetic sequencing dataset. In this example, calculating the statistical z-score for the region of interest in the corresponding synthetic sequencing dataset may include calculating an average read count in the region of interest in the corresponding synthetic sequencing dataset.
In at least one embodiment, calculating the statistical z-score for each of the plurality of synthetic sequencing datasets may include calculating a statistical z-score for another region of interest in the corresponding synthetic sequencing dataset. In this example, calculating the statistical z-score for the other region of interest in the corresponding synthetic sequencing dataset may include calculating an average read count in the other region of interest in the corresponding synthetic sequencing dataset. Additionally or alternatively, calculating the statistical z-score for each of the plurality of synthetic sequencing datasets may include determining a number of target sequencing reads in each of a plurality of bins. In this example, calculating the statistical z-score for each of the plurality of synthetic sequencing datasets may further include calculating the statistical z-score based on the average number of target sequencing reads per bin for the plurality of bins.
According to some embodiments, one or more of the plurality of synthetic sequencing datasets may further include sequencing reads from one or more additional segments corresponding to real copy number variants in the respective real test samples. Each of the plurality of synthetic copy number variants may include a deletion or a duplication. The region of interest may include at least a portion of human chromosome 1, 13, 18, 21, or X. In at least one embodiment, calculating the potential impact of each of the plurality of synthetic copy number variants on the fetal chromosomal abnormality call may further include calculating a potential impact of each of the plurality of synthetic copy number variants on a fetal chromosomal abnormality call for a specified chromosome that includes the region of interest during DNA-based noninvasive prenatal screening. Additionally or alternatively, calculating the potential impact of each of the plurality of synthetic copy number variants on the fetal chromosomal abnormality call may further include calculating a potential impact of each of the plurality of synthetic copy number variants on a fetal chromosomal abnormality call for a chromosome that does not include the region of interest during DNA-based noninvasive prenatal screening. In at least one embodiment, the fetal chromosomal abnormality call may include a chromosomal aneuploidy call. The chromosomal aneuploidy call may include a chromosomal trisomy call and/or a chromosomal monosomy call. According to some embodiments, the fetal chromosomal abnormality call may include a chromosomal microdeletion call, and/or a chromosomal microduplication call.
In some embodiments, the synthetic number of sequencing reads from each of the one or more segments within the region of interest may be generated by increasing or decreasing the number of real sequencing reads from the one or more segments within the region of interest in the real test sample in proportion to an integer number of copies of the region of interest in the real test sample. In this example, the number of real sequencing reads from each of the one or more segments within the region of interest in the real test sample may be normalized by dividing the number of real sequencing reads from each segment from the real test sample by an average number of real sequencing reads from a corresponding segment from one or more real reference samples. Additionally or alternatively, the number of real sequencing reads from each of the one or more segments within the region of interest in the real test sample may be normalized by dividing the number of real sequencing reads from each segment from the real test sample by an average number of real sequencing reads from one or more segments within the region of interest in the real test sample. The number of real sequencing reads from each of the one or more segments within the region of interest in the real test sample may be normalized for GC content bias or mappability. In at least one embodiment, the number of real sequencing reads from each of the one or more segments within the region of interest in the real test sample may be normalized by fitting a probability distribution based on random subsampling.
According to some embodiments, the method may further include determining, based on the calculated potential impacts of the plurality of synthetic copy number variants on the fetal chromosomal abnormality calls, robustness of a fetal abnormality caller. In this example, the method may further include modifying the fetal abnormality caller based on the determined robustness of the fetal abnormality caller. Determining the robustness of the fetal abnormality caller may include determining a specificity of the fetal abnormality caller over a range of synthetic copy number variant sizes.
In some embodiments, a method for performing a DNA-based noninvasive prenatal screen on a sample that includes maternal DNA and fetal DNA may include (i) isolating cfDNA fragments from a sample that includes maternal cfDNA and fetal cfDNA, (ii) sequencing each of the cfDNA fragments to obtain a plurality of fragment sequencing reads, (iii) identifying target sequencing reads of the plurality of fragment sequencing reads, the identified target sequencing reads being mappable to specified locations of a reference genome, (iv) determining, out of the identified target sequencing reads, a quantity of target sequencing reads for a region of interest, (v) calculating a statistical z-score for the region of interest based on the quantity of target sequencing reads for the region of interest, (vi) determining whether the calculated statistical z-score for the region of interest is outside of a predetermined z-score range, a calculated statistical z-score outside of the predetermined z-score range representing a positive call for a fetal chromosomal abnormality in the region of interest of the fetal DNA, (vii) determining whether maternal genomic DNA from the individual includes at least one copy number variant, and (viii) determining, when the maternal genomic DNA from the individual is determined to include at least one copy number variant, whether a feature value of the at least one copy number variant is greater than a threshold feature value, a feature value greater than the threshold feature value indicating that a call for the fetal chromosomal abnormality is likely a false call.
According to at least one embodiment, the threshold feature value may include a threshold percentage of a chromosome covered by the at least one copy number variant. In this example, the threshold percentage may include about 8% or more. In some embodiments, the threshold percentage may include between about 8% and about 16% and/or between about 10% and about 14%. In at least one embodiment, the threshold feature value may include a threshold base pair length of the at least one copy number variant. According to some embodiments, the threshold feature value may be determined based on analysis of a plurality of synthetic sequencing datasets each representing genetic sequencing data, each of the plurality of synthetic sequencing datasets being generated by (i) generating at least one of a plurality of synthetic copy number variants including a synthetic number of copies of at least a portion of a specified region of interest represented by a synthetic number of sequencing reads from one or more segments within the specified region of interest, and (ii) modifying a real sequencing dataset that includes genetic sequencing data of a real test sample by replacing a number of real sequencing reads from the one or more segments within the specified region of interest in the real test sample with the synthetic number of sequencing reads. The threshold feature value may be further determined by calculating a potential impact of each of the plurality of synthetic copy number variants on a fetal chromosomal abnormality call during DNA-based noninvasive prenatal screening based on the plurality of synthetic sequencing datasets.
According to some embodiments, the fetal chromosomal abnormality may a chromosomal aneuploidy. In this example, the chromosomal aneuploidy may include a chromosomal trisomy and/or a chromosomal monosomy. In at least one embodiment, the fetal chromosomal abnormality may include at least one of a chromosomal microdeletion and a chromosomal microduplication. The at least one copy number variant may include at least one of a deletion and a duplication. The region of interest may include a chromosome or a selected portion of a chromosome. In some embodiments, the region of interest and the at least one copy number variant may be located in the same chromosome. In at least one embodiment, the region of interest and the at least one copy number variant may be located in different chromosomes. The region of interest may include at least a portion of human chromosome 1, 13, 18, 21, or X.
In at least one embodiment, the method may further include (i) adjusting, when the feature value of the at least one copy number variant is greater than the threshold feature value, a quantity of target sequencing reads in at least one variant region corresponding to the at least one copy number variant to generate an adjusted set of target sequencing reads, (ii) generating an adjusted quantity of target sequencing reads for the region of interest based on the adjusted set of target sequencing reads, (iii) calculating an adjusted statistical z-score for the region of interest based on the adjusted quantity of target sequencing reads, and (iv) determining whether the adjusted statistical z-score for the region of interest is outside of the predetermined z-score range. Generating the adjusted quantity of target sequencing reads for the region of interest may include replacing sequencing reads of the quantity of target sequencing reads in the at least one variant region with the adjusted set of target sequencing reads. Adjusting the quantity of target sequencing reads in the at least one variant region to generate the adjusted set of target sequencing reads may include increasing the number of target sequencing reads in the at least one variant region. Additionally or alternatively, adjusting the quantity of target sequencing reads in the at least one variant region to generate the adjusted set of target sequencing reads may include decreasing the number of target sequencing reads in the at least one variant region. According to some embodiments, adjusting the quantity of target sequencing reads in the at least one variant region to generate the adjusted set of target sequencing reads may include removing target sequencing reads in the at least one variant region.
In some embodiments, determining the quantity of target sequencing reads for the region of interest may include determining a number of target sequencing reads in each of a plurality of bins corresponding to the region of interest. Calculating the statistical z-score for the region of interest based on the quantity of target sequencing reads for the region of interest may include calculating the statistical z-score for the region of interest based on the average number of target sequencing reads per bin for the plurality of bins corresponding to the region of interest. In at least one embodiment, the method may further include (i) calculating, when the feature value of the at least one copy number variant is greater than the threshold feature value, an adjusted statistical z-score for the region of interest, and (ii) determining whether the adjusted statistical z-score for the region of interest is outside of the predetermined z-score range. Calculating the adjusted statistical z-score for the region of interest may include adjusting the calculated statistical z-score based on the feature value of the at least one copy number variant.
According to some embodiments, a method for performing a DNA-based noninvasive prenatal screen on a sample that includes maternal DNA and fetal DNA may include (i) isolating cfDNA fragments from a sample that includes maternal cfDNA and fetal cfDNA, (ii) sequencing each of the cfDNA fragments to obtain a plurality of fragment sequencing reads, (iii) identifying target sequencing reads of the plurality of fragment sequencing reads, the identified target sequencing reads being mappable to specified locations of a reference genome, (iv) analyzing the identified target sequencing reads to determine whether maternal genomic DNA from the individual includes at least one copy number variant, (v) adjusting, when the maternal genomic DNA from the individual is determined to include at least one copy number variant, a quantity of target sequencing reads of the identified target sequencing reads for at least one variant region corresponding to the at least one copy number variant to generate an adjusted set of target sequencing reads, (vi) determining, out of the identified target sequencing reads, a quantity of target sequencing reads for a region of interest, (vii) generating an adjusted quantity of target sequencing reads for the region of interest based on the adjusted set of target sequencing reads, (viii) calculating a statistical z-score for the region of interest based on the adjusted quantity of target sequencing reads for the region of interest, and (ix) determining whether the calculated statistical z-score for the region of interest is outside of a predetermined z-score range, a calculated statistical z-score outside of the predetermined z-score range representing a positive call for a fetal chromosomal abnormality in the region of interest of the fetal DNA.
According to some embodiments, generating the adjusted quantity of target sequencing reads for the region of interest may include replacing sequencing reads of the quantity of target sequencing reads in the at least one variant region with the adjusted set of target sequencing reads. Adjusting the quantity of target sequencing reads in the at least one variant region to generate the adjusted set of target sequencing reads may include increasing the number of target sequencing reads in the at least one variant region. Additionally or alternatively, adjusting the quantity of target sequencing reads in the at least one variant region to generate the adjusted set of target sequencing reads may include decreasing the number of target sequencing reads in the at least one variant region. In at least one embodiment, adjusting the quantity of target sequencing reads in the at least one variant region to generate the adjusted set of target sequencing reads may include removing target sequencing reads in the at least one variant region. In some embodiments, determining the quantity of target sequencing reads for the region of interest may include determining a number of target sequencing reads in each of a plurality of bins corresponding to the region of interest. Calculating the statistical z-score for the region of interest based on the adjusted quantity of target sequencing reads for the region of interest may include calculating the statistical z-score for the region of interest based on the average number of target sequencing reads per bin for the plurality of bins corresponding to the region of interest.
Features from any of the above-mentioned embodiments may be used in combination with one another in accordance with the general principles described herein. These and other embodiments, features, and advantages will be more fully understood upon reading the following detailed description in conjunction with the accompanying drawings and claims.
The accompanying drawings illustrate a number of example embodiments and are a part of the specification. Together with the following description, these drawings demonstrate and explain various principles of the instant disclosure.
Throughout the drawings, identical reference characters and descriptions indicate similar, but not necessarily identical, elements. While the example embodiments described herein are susceptible to various modifications and alternative forms, specific embodiments have been shown by way of example in the drawings and will be described in detail herein. However, the example embodiments described herein are not intended to be limited to the particular forms disclosed. Rather, the instant disclosure covers all modifications, equivalents, and alternatives falling within the scope of the appended claims.
The present disclosure is generally directed to systems and methods for optimizing performance of DNA-based noninvasive prenatal screens to reduce false aneuploidy calls and for performing DNA-based noninvasive prenatal screens. The present disclosure is also generally directed to systems and methods for performing DNA-based noninvasive prenatal screens on samples that include both maternal DNA and fetal DNA.
Noninvasive prenatal screens can be used to determine fetal abnormalities for one or more test chromosomes using cell-free DNA from a test maternal blood sample. The results of screening can, for example, inform a patient's decision whether to pursue invasive diagnostic testing (such as amniocentesis or chronic villus sampling), which has a small (but non-zero) risk of miscarriage. Aneuploidy detection using noninvasive cfDNA analysis is linked to fetal fraction (that is, the proportion of cfDNA in the test maternal sample attributable to fetal origin). Aneuploidy may manifest in noninvasive prenatal screens that rely on a measured test chromosome dosage as a statistical increase or decrease in the count of quantifiable products (such as sequencing reads) that can be attributed to the test chromosome relative to an expected test chromosome dosage (that is, the count of quantifiable products that would be expected if the test chromosome were disomic). Various cfDNA-based noninvasive prenatal screening systems and methods are disclosed, for example, in U.S. Patent Publication No. 2014/0342354 and U.S. Patent Application No. 62/424,303.
Conventional aneuploidy detection may rely on an underlying assumption that the maternal cfDNA in a particular sample includes few or no copy number variants (CNVs) on a given chromosome. Thus, cfDNA samples used in noninvasive prenatal screening are implicitly assumed to include the same proportion of genetic material from the maternal chromosome. However, chromosomes for different individuals typically vary to a lesser or greater extent due to CNVs, including CNVs where one or more genomic regions in the chromosomes are duplicated or deleted. For example, one or more duplications in a particular maternal chromosome belonging to a pregnant woman effectively adds to the length of the maternal chromosome and may likewise increase the proportion of cfDNA derived from the maternal chromosome. Conversely, one or more deletions in a particular maternal chromosome may decrease the proportion of cfDNA derived from the maternal chromosome.
Sequencing of cfDNA from individuals having at least one CNV in a chromosome of interest may result in reads leading to false fetal aneuploidy, microdeletion, and/or microduplication interpretations, particularly considering that the vast majority of cfDNA is maternally derived. The mean amount of fetal DNA in cfDNA samples is 13%, although samples may contain as little as about 2% or as much as about 30% fetal DNA. Because the maternal DNA portion of a cfDNA sample is substantially higher than the fetal DNA portion, the impact of CNVs in the maternal DNA may be significant when analyzing the cfDNA sample. Typically, relatively shorter CNVs will not affect detection results in conventional noninvasive prenatal screening. However, longer CNVs of 250 kb and larger have been predicted to increase false-positive aneuploidy calls by 40-fold or more. See, for example, Snyder et al., N Eng J Med, 372:1639-45 (2015). Recent studies of false-positive calls in noninvasive prenatal screens for trisomies 13, 18, and 21 attributed one-third to one-half of the false-positives to duplications in a portion of maternal chromosome 13, 18, or 21. See, for example, Strom et al., N Eng J Med, 376:188-89 (2017), Chudova et al., NEJM, 375:97-98 (2016). Accordingly, CNVs in maternal DNA, particularly duplications, may be a significant contributor to false-positive calls for aneuploidies, including false-positive calls for trisomies 13, 18, and 21. Deletions in maternal DNA may also contribute to false-negative calls for aneuploidies in noninvasive prenatal screens.
In some embodiments, a noninvasive prenatal screen performed on a cfDNA sample from an individual having a duplication or a deletion in a chromosome of interest in the maternal DNA may result in a false-positive or false-negative fetal aneuploidy, microdeletion, or microduplication call. For example, a maternal sequence duplication may, if large enough, increase a total amount of cfDNA corresponding to a specified chromosome such that, during screening of the cfDNA, the percentage of total sequencing reads corresponding to the specified chromosome is greater than a minimum percentage required to declare a positive result for aneuploidy in the specified chromosome. Often, the percentage of total sequencing reads for the specified chromosome may be used to determine a statistical z-score. A z-score greater than the upper limit of a specified range may result in a positive call for an aneuploidy (e.g., duplication) in the fetal chromosome and a z-score below a lower limit of the specified range may result in a positive call for another type of aneuploidy (e.g., a deletion), while a z-score within the specified range may result in a negative aneuploidy call.
Many maternal CNVs (mCNVs) may not affect the overall sequencing read counts during noninvasive prenatal screening to a degree significant enough to result in a false-positive or negative aneuploidy call, as illustrated in
Additional factors contributing to whether or not a maternal CNV is likely to influence an aneuploidy call for a particular chromosome include, for example, the size of maternal CNV with respect to the size of the particular chromosome, whether the maternal CNV is located in the particular chromosome, the number of maternal CNVs in the chromosome, the type of maternal CNV, and the fetal DNA fraction in the cfDNA sample. One or more of these factors may be analyzed to determine a potential impact on an aneuploidy call.
In some embodiments, mCNVs may be detected using a moving-window approach that considers copy-number values in bins (e.g., 20 kb bins) tiling each chromosome. A bin's copy-number value may be a fractional number (e.g., 1.997) that reflects the bin's read depth and results from multiple normalization steps described, as described in greater detail below. The presence or absence of an mCNV may be assessed at each bin i. First, the median copy-number value across, for example, 10 bins i through i+9 may be calculated in both a sample of interest and in background samples. A z-score may be computed for each sample's observed median copy-number value relative to the background average. Bins i through i+9 may be classified as part of an mCNV if (1) the absolute median copy-number value is <1.5 or >2.5, and (2) the absolute z-score is determined to be significant. As some genomic bins may be filtered out elsewhere in the analysis pipeline (e.g., for spuriously high read depth or for “unmappable” regions with redundant sequences that complicate unique mapping of reads), gaps of up to, for example, five genomic bins within mCNVs may be allowed. Consecutive mCNV calls of the same type may be merged if the resulting call has a significant z-score. For example a 12-bin mCNV may be called by merging three mCNV calls starting at bins i, i+1 and i+2, or a 25-bin call may be made by merging calls starting at bins i and i+15 (if bins i+10 through i+14 were a gap). The edges of merged calls may be trimmed by up to 10 bins on either side, with the final mCNV boundaries determined by the pair of edges that maximized the absolute z-score of the call. Due to the trimming, calls smaller than 200 kb may be possible if the trimmed set of bins yield a large enough absolute z-score.
As shown in
The following will provide, with reference to
Unless defined otherwise herein, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Numeric ranges are inclusive of the numbers defining the range.
Reference to “about” a value or parameter herein includes (and describes) variations that are directed to that value or parameter per se. For example, the term “about,” as used herein, may represent plus or minus ten percent (10%) of a value. For example, “about 100” refers to any number between 90 and 110.
The term “average,” as used herein, refers to either a mean or a median, or any value used to approximate the mean or median.
A “bin” is an arbitrary genomic region from which a quantifiable measurement can be made. When multiple bins (i.e., a plurality of bins) are subjected to common analysis, the length of each arbitrary genomic region is preferably the same and tiled across a region of interest without overlaps. Nevertheless, the bins can be of different lengths, and can be tiled across the region of interest with overlaps or gaps.
The term “copy number variant” or “CNV,” as used herein, refers to any duplication or deletion of a region of interest.
The term “deletion,” as used herein, refers to any decrease in the number of copies of a region of interest relative to one or more real reference samples. For example, if the one or more real reference samples have two copies of a region of interest, a deletion can refer to a single copy of the region of interest. If the one or more real reference samples have four copies of a region of interest, a deletion can refer to one, two, or three copies of the region of interest.
The term “duplication,” as used herein, refers to any increase in the number of copies of a region of interest relative to one or more real reference samples, including three or more, four or more, five or more, etc. copies of the region of interest.
A “genetic variant caller,” as used herein, refers to any method or technique (including software) that can be used to identify one or more genetic features. Genetic features that can be identified by a genetic variant caller include, but are not limited to, the copy number of a region of interest, an insertion, a deletion, a translocation, an inversion, or a small nucleotide variant (SNV). An “abnormality caller,” as used herein, refers to any method or technique (including software) that can be used to identify an abnormal number of chromosomes in fetal DNA. For example, an abnormality caller may identify an additional chromosome resulting in a trisomy of the chromosome.
A “mappable” sequencing read, as used herein, refers to a sequencing read that aligns with a unique location in a genome. A sequencing read that maps to zero or two or more locations in the genome is considered not “mappable.”
A “maternal sample,” as used herein, refers to any sample taken from a pregnant mammal which comprises a maternal source and a fetal source of nucleic acids. The term “training maternal sample” refers to a maternal sample that is used to train a machine-learning model.
The term “maternal cell-free DNA” or “maternal cfDNA,” as used herein, refers to cell-free DNA originating from a chromosome from a maternal cell that is neither placental nor fetal. The term “fetal cell-free DNA” or “fetal cfDNA” refers to a cell-free DNA originating from a chromosome from a placental cell or a fetal cell.
The term “normal,” as used herein, when used to characterize a putative fetal chromosomal abnormality, such as a microdeletion, microduplication, or aneuploidy, indicates that the putative fetal chromosomal abnormality is not present. The term “abnormal” when used to characterize a putative fetal chromosomal abnormality indicates that the putative fetal chromosomal abnormality is present.
A “number of sequencing reads,” as used herein, refers to an absolute number of sequencing reads or a normalized number of sequencing reads.
A “real sample,” as used herein, refers to a nucleic acid sequence or sequencing reads originating from a nucleic acid sequence that originates from a physical sample subjected to genetic sequencing without the sequence, sequencing reads, or number of sequencing reads being altered. A “real reference sample” refers to a real sample that is compared to a synthetic sample (e.g., a synthetic copy number variant) by the genetic variant caller. A “real test sample,” as used herein, refers to a real sample that is used to generate the synthetic sample.
A “real sequencing read,” as used herein, refers to a sequencing read that originates from a real sample without alteration of the sequence. A “number of real sequencing reads” refers to an absolute number of real sequencing reads or a normalized number of sequencing reads, but does not refer to a number of sequencing reads that has been altered to reflect an increase in a number of copies of any segment or region of interest and/or portion of a chromosome of interest.
A “segment,” as used herein, refers to a sub-region in a region of interest that serves as a locus of origin for sequencing reads. The segment can be as short as a single base or can be as long as the region of interest. Multiple segments within a region of interest may be, but need not be, continuous, contiguous, or overlapping.
The term “synthetic copy number variant,” as used herein, refers to an artificial nucleic acid sequence generated using real sequencing reads from a real sample with an increase or decrease in the number of copies of a region of interest and/or portion of a chromosome of interest compared to the real sample. The synthetic copy number variant need not be (although, in some embodiments, could be) an aligned or assembled nucleic acid sequence, and can be represented by a synthetic number of sequencing reads (i.e., an absolute number or a normalized number of sequencing reads).
A “synthetic number of copies,” as used herein, refers to the number of copies of a region of interest in the synthetic copy number variant, and can be an increase or decrease in the number of copies relative to the real sample.
A “synthetic number of sequencing reads,” as used herein, refers to a number of real sequencing reads that has been altered to reflect an increase or a decrease in the number of copies of a segment within a region of interest and/or portion of a chromosome of interest. The real sequencing reads originate from the same segment (i.e., originate for a corresponding segment) within the region of interest and/or portion of the chromosome of interest as the sequencing reads in the synthetic number of sequencing reads. The synthetic number of sequencing reads is an absolute number of sequencing reads or a normalized number of sequencing reads.
A “synthetic variant,” as used herein, in a reference genome refers to a variant artificially introduced into a nucleic acid sequence in the reference genome, unless context clearly indicates otherwise. The “inverse” of a synthetic variant refers to the opposite consequence of the synthetic variant that would appear in a nucleic acid sequence when compared to the reference sequence comprising the synthetic variant.
A “variation,” as used herein, refers to any statistical metric that defines the width of a distribution, and can be, but is not limited to, a standard deviation, a variance, or an interquartile range.
A “value of likelihood,” as used herein, refers to any value achieved by directly calculating likelihood or any value that can be correlated to or otherwise indicative of likelihood. The term “value of likelihood” includes an odds ratio.
A “value of statistical significance,” as used herein, is any value that indicates the statistical distance of a tested event or hypothesis from a null or reference hypothesis, such as a z-score, a p-value, or a probability.
A “z-score” (i.e., standard score, z-value, normal score, standardized variable, etc.) as used herein, refers to a number of standard deviations an observation value or data point is from an average value and may refer to an aneuploidy z-score, not a z-score of an mCNV.
It is understood that aspects and variations of the invention described herein include “consisting” and/or “consisting essentially of” aspects and variations.
Where a range of values is provided, it is to be understood that each intervening value between the upper and lower limit of that range, and any other stated or intervening value in that stated range, is encompassed within the scope of the present disclosure. Where the stated range includes upper or lower limits, ranges excluding either of those included limits are also included in the present disclosure.
Unless otherwise indicated, nucleic acids are written left to right in 5′ to 3′ orientation; amino acid sequences are written left to right in amino to carboxy orientation, respectively.
It is to be understood that one, some or all of the properties of the various embodiments described herein may be combined to form other embodiments of the present invention.
The section headings used herein are for organizational purposes only and are not to be construed as limiting the subject matter described.
The practice of the present invention employs, unless otherwise indicated, conventional techniques of immunology, biochemistry, chemistry, molecular biology, microbiology, cell biology, genomics and recombinant DNA, which are within the skill of the art. See e.g. Sambrook, Fritsch and Maniatis, MOLECULAR CLONING: A LABORATORY MANUAL, 2nd edition (1989); CURRENT PROTOCOLS IN MOLECULAR BIOLOGY (F. M. Ausubel, et al. eds., (1987)); the series METHODS IN ENZYMOLOGY (Academic Press, Inc.): PCR 2: A PRACTICAL APPROACH (M. J. MacPherson, B. D. Hames and G. R. Taylor eds. (1995)), Harlow and Lane, eds. (1988) ANTIBODIES, A LABORATORY MANUAL, and ANIMAL CELL CULTURE (R. I. Freshney, ed. (1987)).
Exemplary computer programs which can be used to determine identity between two sequences include, but are not limited to, the suite of BLAST programs, e.g., BLASTN, BLASTX, and TBLASTX, BLASTP and TBLASTN, and BLAT publicly available on the Internet. See also, Altschul, et al., 1990 and Altschul, et al., 1997.
Sequence searches may be carried out, using any suitable software, without limitation, including, for example, using the BLASTN program when evaluating a given nucleic acid sequence relative to nucleic acid sequences in the GenBank DNA Sequences and other public databases. The BLASTX program is preferred for searching nucleic acid sequences that have been translated in all reading frames against amino acid sequences in the GenBank Protein Sequences and other public databases. Both BLASTN and BLASTX are run using default parameters of an open gap penalty of 11.0, and an extended gap penalty of 1.0, and utilize the BLOSUM-62 matrix. (See, e.g., Altschul, S. F., et al., Nucleic Acids Res. 25:3389-3402, 1997).
Alignment of selected sequences in order to determine “% identity” between two or more sequences, may be performed using any suitable software, without limitation, including, for example, the CLUSTAL-W program in MacVector version 13.0.7, operated with default parameters, including an open gap penalty of 10.0, an extended gap penalty of 0.1, and a BLOSUM 30 similarity matrix.
In some embodiments, targeted sequencing and/or high-depth whole-genome sequencing may be utilized to sequence cfDNA fragments. Any high-throughput quantitative data that reflects the dose of a particular genomic region may be used, be it from next-generation sequencing (NGS), microarrays, or any other high-throughput quantitative molecular biology technique. In at least one embodiment, sequences from a region of interest may be isolated and enriched, where possible, with hybrid-capture probes or PCR primers, which should be designed such that the captured and sequenced fragments contain at least one sequence that distinguishes a gene from its homolog(s). For example, hybrid-capture probes may be designed to anneal adjacent to the few bases that differ between the gene and the homolog(s)/pseudogene(s) (“diff bases”). Where such distinguishing sequence is scarce, multiple probes may be used to capture distinguishable fragments to diminish the effect of biases inherent to each particular probe's sequence. Amplicon sequencing can be used as an alternative to hybrid-capture as a means to achieve targeted sequencing.
In some embodiments, sequences from a region of interest may be isolated with oligonucleotides adhered to a solid support. Oligonucleotides to which the solid support is exposed for attachment may be of any suitable length, and may comprise one or more sequence elements. Examples of sequence elements include, but are not limited to, one or more amplification primer annealing sequences or complements thereof, one or more sequencing primer annealing sequences or complements thereof, one or more common sequences shared among multiple different oligonucleotides or subsets of different oligonucleotides, one or more restriction enzyme recognition sites, one or more target recognition sequences complementary to one or more target polynucleotide sequences, one or more random or near-random sequences (e.g. one or more nucleotides selected at random from a set of two or more different nucleotides at one or more positions, with each of the different nucleotides selected at one or more positions represented in a pool of oligonucleotides comprising the random sequence), one or more spacers, and combinations thereof. Two or more sequence elements can be non-adjacent to one another (e.g. separated by one or more nucleotides), adjacent to one another, partially overlapping, or completely overlapping.
In some embodiments, the oligonucleotide sequence attached to the support or the target sequence to which it specifically hybridizes may comprise a causal genetic variant. In general, causal genetic variants are genetic variants for which there is statistical, biological, and/or functional evidence of association with a disease or trait. A single causal genetic variant can be associated with more than one disease or trait. In some embodiments, a causal genetic variant can be associated with a Mendelian trait, a non-Mendelian trait, or both. Causal genetic variants can manifest as variations in a polynucleotide, such 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 50, or more sequence differences (such as between a polynucleotide comprising the causal genetic variant and a polynucleotide lacking the causal genetic variant at the same relative genomic position). Non-limiting examples of types of causal genetic variants include single nucleotide polymorphisms (SNP), deletion/insertion polymorphisms (DIP), copy number variants (CNV), short tandem repeats (STR), restriction fragment length polymorphisms (RFLP), simple sequence repeats (SSR), variable number of tandem repeats (VNTR), randomly amplified polymorphic DNA (RAPD), amplified fragment length polymorphisms (AFLP), inter-retrotransposon amplified polymorphisms (IRAP), long and short interspersed elements (LINE/SINE), long tandem repeats (LTR), mobile elements, retrotransposon microsatellite amplified polymorphisms, retrotransposon-based insertion polymorphisms, sequence specific amplified polymorphism, and heritable epigenetic modification (for example, DNA methylation).
In some embodiments, a plurality of target polynucleotides may be amplified according to a method that comprises exposing a sample comprising a plurality of target polynucleotides to an apparatus of the invention. In some embodiments, the amplification process comprises bridge amplification. In some embodiments, a plurality of polynucleotides may be sequenced according to a method that comprises exposing a sample comprising a plurality of target polynucleotides to an apparatus of the invention.
In some embodiments, adapted polynucleotides may be subjected to an amplification reaction that amplifies target polynucleotides in the sample. Amplification primers may be of any suitable length, such as about, less than about, or more than about 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 90, 100, or more nucleotides, any portion or all of which may be complementary to the corresponding target sequence to which the primer hybridizes (e.g. about, less than about, or more than about 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, or more nucleotides). “Amplification” refers to any process by which the copy number of a target sequence is increased. Methods for primer-directed amplification of target polynucleotides are known in the art, and include without limitation, methods based on the polymerase chain reaction (PCR). Conditions favorable to the amplification of target sequences by PCR are known in the art, can be optimized at a variety of steps in the process, and depend on characteristics of elements in the reaction, such as target type, target concentration, sequence length to be amplified, sequence of the target and/or one or more primers, primer length, primer concentration, polymerase used, reaction volume, ratio of one or more elements to one or more other elements, and others, some or all of which can be altered. In general, PCR involves the steps of denaturation of the target to be amplified (if double stranded), hybridization of one or more primers to the target, and extension of the primers by a DNA polymerase, with the steps repeated (or “cycled”) in order to amplify the target sequence. Steps in this process can be optimized for various outcomes, such as to enhance yield, decrease the formation of spurious products, and/or increase or decrease specificity of primer annealing. Methods of optimization may include adjustments to the type or amount of elements in the amplification reaction and/or to the conditions of a given step in the process, such as temperature at a particular step, duration of a particular step, and/or number of cycles.
Typically, annealing of a primer to its template takes place at a temperature of 25 to 90° C. A temperature in this range will also typically be used during primer extension, and may be the same as or different from the temperature used during annealing and/or denaturation. Once sufficient time has elapsed to allow annealing and also to allow a desired degree of primer extension to occur, the temperature can be increased, if desired, to allow strand separation. At this stage the temperature will typically be increased to a temperature of 60 to 100° C. High temperatures can also be used to reduce non-specific priming problems prior to annealing, and/or to control the timing of amplification initiation, e.g. in order to synchronize amplification initiation for a number of samples. Alternatively, the strands maybe separated by treatment with a solution of low salt and high pH (>12) or by using a chaotropic salt (e.g. guanidinium hydrochloride) or by an organic solvent (e.g. formamide).
Following strand separation (e.g. by heating), a washing step may be performed. The washing step may be omitted between initial rounds of annealing, primer extension and strand separation, such as if it is desired to maintain the same templates in the vicinity of immobilized primers. This allows templates to be used several times to initiate colony formation. The size of colonies produced by amplification on the solid support can be controlled, e.g. by controlling the number of cycles of annealing, primer extension and strand separation that occur. Other factors which affect the size of colonies can also be controlled. These include the number and arrangement on a surface of immobilized primers, the conformation of a support onto which the primers are immobilized, the length and stiffness of template and/or primer molecules, temperature, and the ionic strength and viscosity of a fluid in which the above-mentioned cycles can be performed.
In some embodiments, bridge amplification may be followed by sequencing a plurality of oligonucleotides attached to the solid support. In some embodiments, sequencing comprises or consists of single-end sequencing. In some embodiments, sequencing comprises or consists of paired-end sequencing. Sequencing can be carried out using any suitable sequencing technique, wherein nucleotides are added successively to a free 3′ hydroxyl group, resulting in synthesis of a polynucleotide chain in the 5′ to 3′ direction. The identity of the nucleotide added is preferably determined after each nucleotide addition. Sequencing techniques using sequencing by ligation, wherein not every contiguous base is sequenced, and techniques such as massively parallel signature sequencing (MPSS) where bases are removed from, rather than added to the strands on the surface are also within the scope of the invention, as are techniques using detection of pyrophosphate release (pyrosequencing). Such pyrosequencing based techniques are particularly applicable to sequencing arrays of beads where the beads have been amplified in an emulsion such that a single template from the library molecule is amplified on each bead. In some embodiments, sequencing comprises treating bridge amplification products to remove substantially all or remove or displace at least a portion of one of the immobilized strands in the “bridge” structure in order to generate a template that is at least partially single-stranded. The portion of the template which is single-stranded will thus be available for hybridization with a sequencing primer. The process of removing all or a portion of one immobilized strand in a bridged double-stranded nucleic acid structure may be referred to herein as “linearization.”
In some embodiments, a sequencing primer may include a sequence complementary to one or more sequences derived from an adapter oligonucleotide, an amplification primer, an oligonucleotide attached to the solid support, or a combination of these. In general, extension of a sequencing primer produces a sequencing extension product. The number of nucleotides added to the sequencing extension product that are identified in the sequencing process may depend on a number of factors, including template sequence, reaction conditions, reagents used, and other factors. In some embodiments, a sequencing primer is extended along the full length of the template primer extension product from the amplification reaction, which in some embodiments includes extension beyond a last identified nucleotide. In some embodiments, the sequencing extension product is subjected to denaturing conditions in order to remove the sequencing extension product from the attached template strand to which it is hybridized, in order to make the template partially or completely single-stranded and available for hybridization with a second sequencing primer.
In some embodiments, one or more, or all, of the steps of the method described herein may be automated, such as by use of one or more automated devices. In general, automated devices are devices that are able to operate without human direction—an automated system can perform a function during a period of time after a human has finished taking any action to promote the function, e.g. by entering instructions into a computer, after which the automated device performs one or more steps without further human operation. Software and programs, including code that implements embodiments of the present invention, may be stored on some type of data storage media, such as a CD-ROM, DVD-ROM, tape, flash drive, or diskette, or other appropriate computer readable medium. Various embodiments of the present invention can also be implemented exclusively in hardware, or in a combination of software and hardware. For example, in one embodiment, rather than a conventional personal computer, a Programmable Logic Controller (PLC) is used. As known to those skilled in the art, PLCs are frequently used in a variety of process control applications where the expense of a general purpose computer is unnecessary. PLCs may be configured in a known manner to execute one or a variety of control programs, and are capable of receiving inputs from a user or another device and/or providing outputs to a user or another device, in a manner similar to that of a personal computer. Accordingly, although embodiments of the present invention are described in terms of a general purpose computer, it should be appreciated that the use of a general purpose computer is exemplary only, as other configurations may be used.
In some embodiments, automation may include the use of one or more liquid handlers and associated software. Several commercially available liquid handling systems can be utilized to run the automation of these processes (see for example liquid handlers from Perkin-Elmer, Beckman Coulter, Caliper Life Sciences, Tecan, Eppendorf, Apricot Design, Velocity 11 as examples). In some embodiments, automated steps include one or more of fragmentation, end-repair, A-tailing (addition of adenine overhang), adapter joining, PCR amplification, sample quantification (e.g. amount and/or purity of DNA), and sequencing. In some embodiments, hybridization of amplified polynucleotides to oligonucleotides attached to a solid surface, extension along the amplified polynucleotides as templates, and/or bridge amplification is automated (e.g. by use of an Illumina cBot). In some embodiments, sequencing may automated. A variety of automated sequencing machines are commercially available, and include sequencers manufactured by Life Technologies (SOLiD platform, and pH-based detection), Roche (454 platform), Illumina (e.g. flow cell based systems, such as Genome Analyzer, HiSeq, or MiSeq systems). Transfer between 2, 3, 4, 5, or more automated devices (e.g. between one or more of a liquid handler, a bridge amplification device, and a sequencing device) may be manual or automated.
In some embodiments, exponentially amplified target polynucleotides may be sequenced. Sequencing may be performed according to any method of sequencing known in the art, including sequencing processes described herein, such as with reference to other aspects of the invention. Sequence analysis using template dependent synthesis can include a number of different processes. For example, in the ubiquitously practiced four-color Sanger sequencing methods, a population of template molecules is used to create a population of complementary fragment sequences. Primer extension is carried out in the presence of the four naturally occurring nucleotides, and with a sub-population of dye labeled terminator nucleotides, e.g., dideoxyribonucleotides, where each type of terminator (ddATP, ddGTP, ddTTP, ddCTP) includes a different detectable label. As a result, a nested set of fragments is created where the fragments terminate at each nucleotide in the sequence beyond the primer, and are labeled in a manner that permits identification of the terminating nucleotide. The nested fragment population is then subjected to size based separation, e.g., using capillary electrophoresis, and the labels associated with each different sized fragment is identified to identify the terminating nucleotide. As a result, the sequence of labels moving past a detector in the separation system provides a direct readout of the sequence information of the synthesized fragments, and by complementarity, the underlying template. Other examples of template dependent sequencing methods include sequence by synthesis processes, where individual nucleotides are identified iteratively, as they are added to the growing primer extension product (e.g., pyrosequencing).
In certain embodiments, one or more of modules 622 in
As illustrated in
As illustrated in
As illustrated in
In some embodiments, synthetic sequencing module 624 may generate each of the plurality of synthetic sequencing datasets by generating at least one of a plurality of synthetic copy number variants including a synthetic number of copies of at least a portion of a region of interest represented by a synthetic number of sequencing reads from one or more segments within the region of interest. Each of the plurality of synthetic copy number variants may include a deletion or a duplication. Additionally, synthetic sequencing module 624 may generate each of the plurality of synthetic sequencing datasets by then modifying a real sequencing dataset, which includes genetic sequencing data from a real test sample including maternal and fetal cfDNA, by replacing a number of real sequencing reads from the one or more segments within the region of interest in the real test sample with the synthetic number of sequencing reads. In at least one embodiment, the at least one of the plurality of synthetic copy number variants may include a synthetic maternal copy number variant and a corresponding synthetic fetal copy number variant. For example, cfDNA samples analyzed in non-invasive prenatal screening that are determined to include a maternal CNV are commonly treated as including the CNV in the fetal DNA as well the maternal DNA, with the CNV being assumed to be passed from the mother to the child. Accordingly, attempts to distinguish a maternal CNV from a fetal CNV may not be made. In some examples, the at least one of the plurality of synthetic copy number variants may generated to represent a synthetic maternal copy number variant without a corresponding synthetic fetal copy number variant. For example, to determine the impact of maternal CNV on a fetal chromosomal abnormality call in a cfDNA sample that does not include a corresponding fetal CNV, a synthetic sequencing dataset may be generated to represent a synthetic sample that includes a synthetic maternal CNV with no corresponding fetal CNV.
Real samples having a copy number variant, such as a duplication or deletion, for a particular region of interest (such as a gene or plurality of genes) may be relatively rare. Many putative CNVs may be identified from a retrospective analysis of whole-genome sequencing data from previously sequenced DNA samples from individuals. The vast majority of putative CNVs in such a retrospective analysis may represent relatively shorter CNVs of several thousand base pairs to several hundred thousand base pairs in length and spanning only a small portion of the respective chromosomes harboring the CNVs. However, many potential CNVs and/or CNV lengths may not be represented in such sequencing data. Particularly, relatively larger CNVs, which are much more likely to result in a false aneuploidy call in cfDNA-based prenatal screening, are much less common in the general population (see, e.g.,
In order to supplement the retrospective data for purposes of optimizing the performance of the DNA-based noninvasive prenatal screen, synthetic CNVs in human chromosomes 1, 13, 18, 21, and/or X and/or any other human chromosomes may be generated. In some embodiments, each of the plurality of synthetic sequencing datasets may include a synthetic number of sequencing reads for one or more segments of a reference chromosome. Each of the plurality of synthetic sequencing datasets may represent a chromosome or portion of a chromosome having at least one of a plurality of synthetic maternal copy number variants (e.g., a deletions and/or a duplications) at locations corresponding to the one or more segments of the reference chromosome.
The one or more segments of the reference chromosome may be of any suitable length, without limitation. For example, the one or more segments of the reference chromosome may each be about 1 base to about 250 million bases in length (such as about 1 base to about 50 bases in length, about 50 bases to about 100 bases in length, about 100 bases to about 250 bases in length, about 250 bases to about 500 bases in length, about 500 base to about 1000 bases in length, about 1000 bases to about 2000 bases in length, about 2000 bases to about 4000 bases in length, about 4000 bases to about 8000 bases in length, about 8000 bases to about 16,000 bases in length, about 16,000 bases to about 32,000 bases in length, about 32,000 bases to about 64,000 bases in length, about 64,000 bases to about 125,000 bases in length, about 125,000 bases to about 250,000 bases in length, about 250,000 bases to about 500,000 bases in length, about 500,000 bases to about 1 million bases in length, about 1 million bases to about 2 million bases in length, about 2 million bases to about 4 million bases in length, about 4 million bases to about 8 million bases in length, about 8 million bases to about 16 million bases in length, about 16 million bases to about 32 million bases in length, about 32 million bases to about 64 million bases in length, about 64 million bases to about 125 million bases in length, or about 125 million bases to about 250 million bases in length). In some embodiments, the one or more segments of the reference chromosome may each be about 1 base or more (such as about 50 bases or more, about 100 bases or more, about 250 bases or more, about 500 bases or more, about 1000 bases or more, about 2000 bases or more, about 4000 bases or more, about 8000 bases or more, about 16,000 bases or more, about 32,000 bases or more, about 64,000 bases or more, about 125,000 bases or more, about 250,000 bases or more, about 500,000 bases or more, about 1 million bases or more, about 2 million bases or more, about 4 million bases or more, about 8 million bases or more, about 16 million bases or more, about 32 million bases or more, about 64 million bases or more, or about 125 million bases or more. In some embodiments, the one or more segments of the reference chromosome may include one or more genes (such as 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 40, 50, 75, 100, 150, 200, 250 or more genes). In some embodiments, the one or more segments of the reference chromosome may include one or more exons (such as 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 40, 50, 75, 100, 150, 200, 250 or more exons).
The one or more segments of the reference chromosome may or may not be continuous, contiguous, or partially overlapping. In some embodiments, the one or more segments of the reference chromosome may include 1 or more segments (such as 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 40, 50, 75, 100, 150, 200, 250 or more segments). The synthetic number of sequencing reads (or a portion of the sequencing reads) may each correspond to one of the one or more segments of the reference chromosome (i.e., the sequencing reads can be aligned to segments, for example using a reference sequence). It is understood that a portion of the synthetic number of sequencing reads may not accurately map to a particular segment (for example, a sequencing read may map to more than one segment or may map to no segment); such un-mappable or un-alignable sequencing reads are optionally ignored or discarded.
In some embodiments, at least a portion of one or more real samples may be sequenced to generate real sequencing reads. The real sequencing reads may be generated from one or more real samples (e.g., one or more sequencing libraries from the one or more real samples) using any known sequencing method, such as massively parallel sequencing (for example using an Illumina HiSeq 2500 system). In some embodiments, at least one region of interest, such as one or more specified chromosomes (e.g., chromosome 1, 13, 18, 21, X, and/or Y), and/or one or more portions thereof (e.g., regions of interest), may be enriched, which can increase the proportion of sequencing reads that correspond to the enriched regions. For example, one or more regions of interest may be enriched by PCR (for example, by including one or more primers that hybridize to portions of segments within the regions of interest with genomic DNA from a real sample, and amplifying the segments within the regions of interest). In some embodiments, one or more regions of interest may be enriched by combining capture probes (such as biotinylated DNA, RNA, synthetic oligonucleotides) that hybridize to segments within the regions of interest with genomic DNA (which is preferably sheared). The capture probes may then be used to isolate DNA fragments that include segments from the regions of interest, and those DNA fragments can be sequenced to generate sequencing reads.
In some embodiments, real sequencing reads may be normalized. For example, in some embodiments, the real sequencing reads may be normalized for GC content and/or mappability. For example, some segments within one or more regions of interest may have a higher GC content than other segments within the region of interest. The higher GC content may increase or decrease the assay efficiency within that segment, inflating or deflating the relative number of sequencing reads for reasons other than copy number. Methods to normalize GC content may include, for example, methods as described in Fan & Quake, PLoS ONE, vol. 5, e10439 (2010). Similarly, certain segments within the one or more regions of interest may be more easily mappable (or alignable to a reference region of interest), and a number of sequencing reads may be excluded, thereby deflating the relative number of sequencing reads for reasons other than copy number. Mappability at a given position in the genome may be predetermined for a given read length, k, by segmenting every position within a region of interest into k-mers and aligning the sequences back to the region of interest. K-mers that align to a unique position in the interrogated region are labeled “mappable,” and k-mers that do not align to a unique position in the region of interest are labeled “not mappable.” A given segment may be normalized for mappability by scaling the number of reads in the segment by the inverse of the fraction of the mappable k-mers in the segment. For example, if 50% of k-mers within a bin are mappable, the number of observed reads from within that segment may be scaled by a factor of 2.
In some embodiments, the synthetic number of sequencing reads from each of the one or more segments may be generated by increasing or decreasing a number of real sequencing reads from one or more segments within a region (e.g., the region of interest) in the real test sample and/or within a region (e.g., the region of interest) in a reference sequence that is, for example, derived based on a combination of a plurality of test samples. For example, if a first number of real sequencing reads corresponds to a first segment in a region of interest, and a second number of real sequencing reads corresponds to a second segment in the region of interest, and the real sample has two copies of the region of interest, a synthetic copy number variant representing a duplication having three copies of the region of interest may be generated by generating a first synthetic number of sequencing reads corresponding to the first segment by increasing the first number of real sequencing reads to reflect three copies of the first segment, and generating a second synthetic number of sequencing reads corresponding to the second segment by increasing the second number of real sequencing reads to reflect three copies of the second segment. Since the synthetic number of sequencing reads corresponding to the first segment and the second segment are increased to reflect three copies, the synthetic copy number variant has three copies of the region of interest having the first segment and the second segment. In some embodiments, the synthetic number of sequencing reads may be normalized. For example, in some embodiments, the synthetic number of sequencing reads may be normalized for GC content and/or mappability.
In some embodiments, the synthetic number of sequencing reads may be generated by multiplying the number of real sequencing reads by a factor (such as 1.5 to increase the copy number from two to three, or 0.5 to decrease the copy number from two to one) and/or by applying binomial downsampling to the number of real sequencing reads (e.g., to simulate deletions). In some embodiments, the expected ratio of bin copy numbers in maternal duplications vs. non-mCNV regions may be 3/2=1.50, but this factor may be observed to be slightly lower at 2.88/2=1.44. This approach assumes that simulated mCNVs were inherited by the fetus. mCNVs not inherited by the fetus may have a marginally decreased signal in proportion to the fetal fraction, and this may reduce their potentially compromising effect on specificity but also make them slightly more difficult to detect. In some embodiments, the synthetic number of sequencing reads are generated by adding (or subtracting) a number of sequencing reads (such as 50% of the average number of real sequencing reads corresponding to all segments within the region of interest) to the number of real sequencing reads. In some embodiments, the number of sequencing reads may be normalized such that a single copy of a region of interest is represented by a normalized number of sequencing reads (e.g., 0.5), and two copies of a region of interest are represented by a normalized number of sequencing reads (e.g., 1). Thus, in some embodiments, a number of normalized sequencing reads (such as 0.5) may be added to the normalized number of sequencing reads to increase the number of copies in the synthetic copy number variant, and a number of normalized sequencing reads (such as 0.5) may be subtracted from the normalized number of sequencing reads to decrease the number of copies in the synthetic copy number variant.
In some embodiments, the number of real sequencing reads may be increased or decreased to generate the synthetic number of sequencing reads to represent a synthetic copy number variant with an integer number of copies of the region of interest (such as 1, 2, 3, 4, 5, or more copies of the region of interest). In at least one embodiment, the number of real sequencing reads from each of the one or more segments within the region of interest in the real test sample may be normalized by dividing the number of real sequencing reads from each segment from the real test sample by an average number of real sequencing reads from a corresponding segment from one or more real reference samples or by an average number of real sequencing reads from one or more segments within the region of interest in the real test sample. According to some embodiments, the number of real sequencing reads from each of the one or more segments within the region of interest in the real test sample may be normalized by fitting a probability distribution based on random subsampling. For example, rather than multiplying by set value to normalize the number of real sequencing reads, a probability distribution based on random subsampling may be used (e.g. a binomial distribution with the number of trials equaling the depth and the probability of success equaling 0.5). Any suitable systems and methods for generating synthetic sequencing reads may be utilized, without limitation, including, for example, systems and methods disclosed in U.S. Patent Application No. 62/418,622.
Returning to
Abnormality caller module 626 may calculate the potential impact of each of the plurality of synthetic copy number variants on the corresponding fetal chromosomal abnormality call in a variety of ways. For example, abnormality caller module 626 may determine whether a synthetic CNV has a large enough effect on a calculated z-score of a fetal chromosomal abnormality call to change its interpretation (i.e., whether the z-score is inside or outside of a “normal” z-score range). In some examples, abnormality caller module 626 may determine whether or not each synthetic sequencing dataset is likely to result in a false fetal chromosomal abnormality call during noninvasive prenatal screening, which utilizes cfDNA containing both maternal DNA and fetal DNA. By way of example, abnormality caller module 626 may determine whether sequences contributed by one or more duplications represented in a synthetic sequencing dataset would contribute enough additional reads utilized during noninvasive prenatal screening to push the total reads for a corresponding sample above a positive call threshold, resulting in a false-positive aneuploidy call. (See, e.g.,
In some embodiments, calculating the synthetic copy number variants on a fetal chromosomal abnormality call may include determining a quantity of target sequencing reads in each of the plurality of synthetic sequencing datasets, the target sequencing reads corresponding to identified target sequences. For example, for each of the synthetic sequencing datasets, abnormality caller module 626 may determine a quantity of target sequencing reads in each of the plurality of synthetic sequencing datasets. In some embodiments, the target sequencing reads may be reads of a specified length or lengths (e.g., k-mers) that are mappable to a reference genome. In some embodiments, the target sequencing reads may be sequencing reads that are each mappable to a reference sequence. In at least one embodiment, the target sequencing reads may be unique reads that each match only a single point (i.e., unique location) in a reference genome. In at least one embodiment, mappable target sequencing reads may be utilized by abnormality caller module 626, and un-mappable or un-alignable sequencing reads may be ignored or discarded.
In various embodiments, calculating the potential impact of each of the plurality of synthetic copy number variants on the fetal chromosomal abnormality call may further include calculating a value indicative of the potential effect of the copy number variant represented in each of the synthetic copy number variants. In some embodiments, a value of statistical significance (e.g., z-score or standard score, p-value, probability, etc.) may be calculated to determine the potential impact.
In at least one embodiment, abnormality caller module 626 may calculate a statistical z-score for each of the plurality of synthetic sequencing datasets. In cfDNA-based noninvasive prenatal screening, a value of likelihood that the fetal cfDNA in the test maternal sample is abnormal (e.g., aneuploid or includes a microdeletion or a microduplication) may be determined using a z-score, which is a statistical value indicating how many standard deviations a quantity of target sequences for a specified chromosome or portion of a chromosome in a cfDNA sample from a pregnant individual is from a mean or median reference quantity for the specified chromosome or portion of the chromosome.
For purposes of calculating the potential impact of each of the plurality of synthetic CNVs represented in the plurality of synthetic sequencing datasets on the aneuploidy call, a statistical z-score may be calculated for each of the plurality of synthetic sequencing datasets. In some embodiments, calculating the statistical z-score for each of the plurality of CNVs may further include calculating a quantity of target sequencing reads in a region of interest (e.g., chromosome or selected portion of chromosome) attributable to at least one CNV, such as a synthetic CNV. For example, a number of target sequencing reads obtained for a specified chromosome (e.g., 1, 13, 18, 21, X, or any other specified chromosome), or chromosome of interest, or selected portion of the chromosome, corresponding to the synthetic sequencing datasets may be determined in comparison to a number of target sequencing reads obtained from the specified chromosome or selected portion of the chromosome. For example, for a region of interest that includes a CNV, an average number of read counts may be determined for the region of interest represented by the synthetic sequencing dataset.
The z-score may be determined based on an average number of read counts in the region of interest (i.e., chromosome or portion of chromosome) of the synthetic sequencing dataset with respect to a background that includes a distribution of the average number of read counts in the region of interest of a plurality of other samples (i.e., a sample population), which includes, for example, a plurality of samples that do not include the CNV. The z-score may be determined by dividing a difference between the average number of read counts of in the region of interest and the average number of read counts of the sample population in the region of interest by a variation (e.g., average absolute deviation) in the average number of read counts for the sample population (or by a variation in the average number of read counts for all samples, including the synthetic sequencing dataset and/or additional synthetic chromosomes). In some embodiments, the background may be generated, at least in part, based on reference samples that are tailored to the synthetic sequencing dataset. For example, reference samples sharing one or more common characteristics with the synthetic sequencing dataset may be selected for the background. In one example, reference samples sharing a similar cfDNA fetal fraction may be utilized to generate the background. In some examples, the background used for a synthetic sequencing dataset may additionally or alternatively be generated, at least in part, based on reference samples that were sequenced and analyzed in one or more batches (e.g., a batch of samples sequenced on the same next-generation sequencing (NGS) sample plate), including real test samples that were sequenced in the same batch as the real test sample used to generate the synthetic sequencing dataset.
In some embodiments, target reads for the remainder of the genome, aside from the specified chromosome corresponding to the synthetic sequencing datasets, may correspond to reads obtained from chromosomes including few or no CNVs. In at least one embodiment, each of the target reads for the remainder of the genome may correspond to sequencing reads obtained from a reference genome and/or to sequencing reads obtained from real samples having few or no CNVs. In some embodiments, one or more of the target reads for the remainder of the genome may correspond to sequencing reads obtained from chromosomes including one or more CNVs (e.g., reads from real samples or reference samples, and/or reads from synthesized chromosome sequencing reads). In some embodiments, a z-score may be determined for a region of interest for a chromosome and/or portion of a chromosome that does not include a CNV, such as a simulated CNV.
In at least one embodiment, calculating the potential impact of each of the plurality of synthetic CNVs on the fetal chromosomal abnormality call may further include calculating a statistical z-score change attributable to the at least one CNV represented by the respective synthetic sequencing dataset. For example, calculating the statistical z-score change attributable to at least one CNV represented by a synthetic sequencing dataset may include calculating a statistical z-score for the region of interest in the synthetic sequencing dataset with respect to a z-score from a corresponding background dataset. A difference (or change) in z-score between the synthetic sequencing dataset and the background dataset may be attributed and correlated to the at least one synthetic CNV. In some embodiments, calculated statistical z-score changes may each be correlated to a CNV size of the at least one of the plurality of synthetic CNVs.
In some embodiments, calculating the potential impact of each of the plurality of synthetic CNVs on the fetal chromosomal abnormality call may further include determining whether or not a statistically significant value, such as a statistical z-score, calculated for each of the plurality of synthetic CNVs is outside of a threshold range. For example, abnormality caller module 626 may use a specified range of z-scores to determine whether each of the plurality of synthetic CNVs is likely to affect a fetal chromosomal abnormality call for the specified chromosome during DNA-based noninvasive prenatal screening. In some embodiments, a range of z-scores determined to correlate to synthetic CNVs that are likely to not affect a fetal chromosomal abnormality call may range from about −6 to about 6, about −5 to about 5, about −4 to about 4, about −3.5 to about 3.5, about −3 to about 3, about −2.5 to about 2.5, or about −2 to about 2. A calculated z-score outside of at least one of these ranges may be determined to correlate to a synthetic CNV that is likely to affect a fetal chromosomal abnormality call, with a value outside a range corresponding to a potential false fetal chromosomal abnormality determination (i.e., false-positive, false-negative). In some embodiments, a z-score range may be adjusted based on other samples from a batch used to generate a synthetic sequencing dataset and/or based on characteristics of the synthetic sequencing dataset (e.g., fetal fraction).
In some embodiments, the method may further include correlating each of the calculated statistical z-scores, or z-score changes, to a size of the at least one synthetic CNV represented in the corresponding synthetic sequencing dataset. For example, analysis module 628 shown in
In some embodiments, the method may further include correlating each of the calculated statistical z-scores, or z-score changes, to a type of the at least one CNV represented in the corresponding synthetic sequencing dataset. For example analysis module 628 shown in
According to at least one embodiment, calculating the statistical z-score for the region of interest in the corresponding synthetic sequencing dataset may include calculating an average read count in the region of interest in the corresponding synthetic sequencing dataset. For example, calculating the statistical z-score for each of the plurality of synthetic sequencing datasets may include determining a number of target sequencing reads in each of a plurality of bins (see, e.g.,
In some embodiments, calculating the statistical z-score for each of the plurality of synthetic sequencing datasets may include calculating a statistical z-score for another region of interest in the corresponding synthetic sequencing dataset. Calculating the statistical z-score for the other region of interest in the corresponding synthetic sequencing dataset may, for example, include calculating an average read count in the other region of interest in the corresponding synthetic sequencing dataset. In at least one embodiment, one or more of the plurality of synthetic sequencing datasets may further include sequencing reads from one or more additional segments corresponding to real copy number variants in the respective real test samples.
According to some embodiments, one or more of the systems described herein may determine, based on the calculated potential impacts of the plurality of synthetic CNVs on the fetal chromosomal abnormality calls, at least one threshold feature value utilized in the DNA-based noninvasive prenatal screening to identify likely false fetal chromosomal abnormality calls. For example, analysis module 628 shown in
In some embodiments, analysis module 628 may determine the at least one threshold feature value based on correlations between z-scores and one or more characteristic of corresponding CNVs represented in the respective synthetic sequencing datasets. In at least one embodiment, the at least one threshold feature value may include a threshold percentage of corresponding chromosome covered by at least one CNV and/or a threshold base pair length of at least one CNV in the specified chromosome. For example, numerous synthetic sequencing datasets for one or more other chromosome may be used to determine correlations between z-scores and percentages of chromosomes covered by corresponding CNVs and/or base pair lengths of CNVs. These correlations may be utilized to determine one or more threshold values and/or ranges of values for CNVs that may be utilized in noninvasive prenatal screenings to identify likely false fetal chromosomal abnormality calls one or more chromosomes. For example, a threshold CNV value may be determined based on identification of an increased potential for a false fetal chromosomal abnormality call above the threshold CNV value. In some embodiments, such correlations may be utilized to determine likelihoods of false fetal chromosomal abnormality calls for one or more chromosomes based on a percentage of a chromosome covered by one or more CNVs and/or a base pair length of one or more CNVs.
In some embodiments, a threshold percentage of a chromosome covered by at least one maternal CNV may be utilized as a threshold CNV value in DNA-based noninvasive prenatal screening of more than one chromosome. For example, while human chromosome 21 has far fewer base pairs (approximately 48 Mb) than human chromosome 13 (having approximately 115 Mb), the same or substantially the same threshold percentage of a chromosome covered by at least one maternal CNV may utilized in noninvasive prenatal screening for fetal chromosomal abnormality in both chromosome 21 and chromosome 13. While a much longer CNV may be necessary to potentially trigger a false fetal chromosomal abnormality call for chromosome 13 than for chromosome 21, the threshold percentage of the chromosome occupied by the CNVs, above which a false fetal chromosomal abnormality call may be triggered, may be the same or substantially the same for both chromosome 13 and chromosome 21.
In some embodiments, the at least one threshold feature value may be utilized in response to certain factors during noninvasive prenatal screening. For example, the at least one threshold feature value may be utilized in response to at least one positive fetal chromosomal abnormality call (e.g., an initial aneuploidy call) by an abnormality caller. In at least one embodiment, when an abnormality caller returns a positive call indicating a fetal chromosomal abnormality (e.g., trisomy, monosomy, microdeletion, microduplication, etc.) in a chromosome during noninvasive prenatal screening, the at least one threshold feature value may be utilized to further review and/or confirm the positive call. For example, quality-control metrics and/or manual review, such as computer-assisted manual review, of the sequenced cfDNA sample may be utilized to identify a maternal CNV, such as a duplication, in the chromosome for which the fetal aneuploidy was called. If a maternal CNV, or likely maternal CNV, is identified in the chromosome, the size of the CNV may be calculated. The threshold feature value may be utilized to determine whether the CNV likely resulted in a false-positive fetal chromosomal abnormality call. For example, if the CNV value (e.g., CNV size) is above the threshold feature value, the positive fetal chromosomal abnormality call may be determined to likely be a false-positive call. However, if the CNV value is below the threshold feature value, the positive fetal chromosomal abnormality call may be determined to likely be a likely true-positive call. Such a determination may result in more accurate false-positive fetal chromosomal abnormality determinations during noninvasive prenatal screening, while also preventing expectant mothers from unnecessarily undertaking invasive follow-up testing to confirm the existence of a fetal chromosomal abnormality in cases where the noninvasive prenatal screening produces a false-positive call due to a maternal CNV. In some embodiments, the impact of a false fetal chromosomal abnormality call (e.g., false positive or false-negative) due to a maternal CNV may be mitigated by identifying the location and/or type of maternal CNV and performing further steps to undo the effect of the maternal CNV on fetal chromosomal abnormality detection.
In some embodiments, the at least one threshold feature value may be utilized in response to at least one negative fetal chromosomal abnormality call by an abnormality caller. In at least one embodiment, when an abnormality caller returns a negative fetal chromosomal abnormality call for a chromosome during noninvasive prenatal screening, the at least one threshold feature value may be utilized to further review and/or confirm the negative call. For example, quality-control metrics and/or manual review, such as computer-assisted manual review, of the sequenced cfDNA sample may be utilized to identify a maternal CNV, such as a deletion, in the chromosome. If a maternal CNV, or likely maternal CNV, is identified in the chromosome, the size of the CNV may be calculated. The threshold feature value may be utilized to determine whether the CNV likely resulted in a false-negative fetal chromosomal abnormality call. For example, if the CNV value (e.g., CNV size) is above the threshold feature value, the negative fetal chromosomal abnormality call may be determined to likely be a false-negative call. However, if the CNV value is below the threshold feature value, the negative fetal chromosomal abnormality call may be determined to likely be a likely true-negative call.
In some embodiments, the method may include determining, based on the calculated potential impacts of the plurality of synthetic copy number variants on the fetal chromosomal abnormality calls, robustness of a fetal abnormality caller. For example, analysis module 628 may determine, based on the calculated potential impacts of the plurality of synthetic CNVs on the fetal chromosomal abnormality calls, robustness of one or more fetal abnormality callers. In some examples, the robustness may be determined based on the calculated potential impacts of the plurality of synthetic CNVs and potential or observed impacts of a plurality of real CNVs. In at least one embodiment, the method may further include modifying the fetal abnormality caller based on the determined robustness of the fetal abnormality caller. According to some embodiments, determining the robustness of the fetal abnormality caller may include determining a specificity of the fetal abnormality caller over a range of synthetic copy number variant sizes. For example, analysis module 628 may determine a specificity of the fetal abnormality caller over a range of synthetic CNVs, such as a range of percentages of a corresponding chromosome covered by a CNV.
In at least one embodiment, the determined correlations between z-scores and one or more characteristics of corresponding CNVs represented in the respective synthetic sequencing datasets may be utilized to determine and/or improve the robustness of a fetal abnormality caller utilized in DNA-based noninvasive prenatal screening. For example, such correlations may demonstrate that a particular abnormality caller (e.g., an outlier-robust algorithm) is likely to correctly identify euploidies and fetal chromosomal abnormalities (e.g., aneuploidies, microdeletions, and/or microduplications) with high specificity in fetal DNA when the maternal DNA in the cfDNA sample includes one or more CNVs in a chromosome of interest. The correlations may be used to modify one or more fetal abnormality callers and/or to select a fetal abnormality caller that is best suited to identify fetal chromosomal abnormalities in cfDNA samples having a range of maternal CNV sizes. Moreover, these correlations may demonstrate that the abnormality caller is likely to correctly identify euploidies and fetal chromosomal abnormalities in fetal DNA up to a determined maternal CNV size (e.g., a threshold CNV size) in the chromosome of interest. In some embodiments, the threshold feature value may differ depending on the type of maternal CNV (e.g., duplication and/or deletion) in the chromosome of interest and/or based on the type of call (e.g., positive or negative fetal chromosomal abnormality) indicated by an abnormality caller during noninvasive prenatal screening. In at least one embodiment, the threshold feature may additionally or alternatively differ based on the amount of fetal fraction in a given cfDNA sample (e.g., a sample including a high fetal fraction may be impacted less by CNVs due to a better sample signal obtained from the fetal fraction).
According to some embodiments, calculating the potential impact of each of the plurality of synthetic copy number variants on the fetal chromosomal abnormality call may further include calculating a potential impact of each of the plurality of synthetic copy number variants on a fetal chromosomal abnormality call for a specified chromosome that includes the region of interest during DNA-based noninvasive prenatal screening. For example, abnormality caller module 626 may utilize a synthetic CNV in chromosome 21 to calculate the potential impact of the synthetic CNV on a fetal chromosomal abnormality call for chromosome 21. Additionally or alternatively, calculating the potential impact of each of the plurality of synthetic copy number variants on the fetal chromosomal abnormality call may further include calculating a potential impact of each of the plurality of synthetic copy number variants on a fetal chromosomal abnormality call for a chromosome that does not include the region of interest during DNA-based noninvasive prenatal screening. For example, abnormality caller module 626 may utilize a synthetic CNV in a chromosome other than chromosome 21 to calculate the potential impact of the synthetic CNV on a fetal chromosomal abnormality call for chromosome 21.
In some embodiments, the method may further include calculating a potential impact of each of a plurality of real copy number variants on a fetal chromosomal abnormality call during the DNA-based noninvasive prenatal screening based on a plurality of real sequencing datasets each including genetic sequencing data of a real reference sample including one of the plurality of real copy number variants. The real copy number variants may be CNVs observed in one or more real test samples. Additionally, determining the at least one threshold feature value utilized in the DNA-based noninvasive prenatal screening may further include determining the at least one threshold feature value based on the calculated potential impacts of both the plurality of synthetic copy number variants and the plurality of real copy number variants on the fetal chromosomal abnormality calls. For example, analysis module 628 in
In at least one embodiment, the method may further include calculating a potential impact of each of a plurality of real sequencing datasets on a fetal chromosomal abnormality call for a specified chromosome during the DNA-based noninvasive prenatal screening, the real sequencing datasets corresponding to sequenced cfDNA samples determined to have at least one copy number variant in the specified chromosome. For example, abnormality caller module 626 in
In some embodiments, determining the at least one threshold feature value utilized in the DNA-based noninvasive prenatal screening may further include determining the at least one threshold feature value based on the calculated potential impacts of both the plurality of synthetic sequencing datasets and the plurality of real sequencing datasets on the fetal chromosomal abnormality calls. For example, analysis module 628 in
Maternal mCNVs may be common on the chromosomes that noninvasive prenatal screens frequently interrogate (4.5% of patients have mCNV on chromosome 13, 18, or 21) and can cause frequent false positives if not properly neutralized at the algorithmic level. Even noninvasive prenatal tests that share a common sequencing approach (e.g., whole genome sequencing (WGS) of cfDNA) may nevertheless have very different test specificities based on the sophistication of their mCNV handling. Using 87,255 empirical and 30,000 simulated samples, the impact on specificity of various mCNV-mitigation strategies was quantified and a very wide range of values was observed. As will be described in greater detail below, noninvasive prenatal screening approaches described herein, which may exclude bins in mCNVs from downstream calculations, may reduce the expected rate of mCNV-caused false positives nearly 600-fold relative to the algorithms used in the early iterations of WGS-based noninvasive prenatal screens, and which may still be used in practice in clinical laboratories (1 in 580,000 vs. 1 in 960 false positives across trisomies 13, 18, and 21; see, e.g.,
Algorithmic analysis approaches tailored to mCNVs, as described herein, may result in better specificity than strategies having robust features but are not mCNV-specific. For example, a “Value-filtering” analysis strategy that excludes genomic bins based on their copy-number values (see, e.g.,
Though mostly tailored to retain specificity, mCNV-mitigation approaches may be designed to retain sensitivity for aneuploidies. With the “mCNV filtering” analysis strategy, the small values and variance of ΔZdup mean that mCNVs may minimally affect the z-score in either direction, suggesting that the filtering process does not compromise sensitivity. The “mCNV filtering” analysis strategy may slightly boost sensitivity by avoiding false negative results in trisomic samples where the aneuploidy-inflated z-score is lowered to normal levels due to a maternal deletion.
Additionally, mCNVs on non-tested chromosomes (i.e., autosomes other than chromosomes 13, 18, or 21)—or even mCNVs in other patient samples—could affect the z-score of a test chromosome. WGS-based noninvasive prenatal screens often involve normalization of NGS read depth to calculate a z-score, and this normalization could include one or many chromosomes, as well as other samples in a background cohort. Robust normalization, including a large number of background samples and/or filtering out mCNVs before normalization, can mitigate spurious z-score changes due to cryptic mCNVs in the analysis pipeline. Expert manual review of both z-scores and bin-level copy-number data across all autosomes can further safeguard against mCNV-caused false positives.
With proper algorithm design and extensive testing that leverages empirical and simulated data, as described herein, high specificity in noninvasive prenatal screens may be possible even in the presence of mCNVs that range widely in size. Importantly, by using the “mCNV-filtering” analysis strategy described herein, achieving robustness to mCNVs—and the corresponding rise in positive predictive value—may not compromise detection of true aneuploidies and, thereby, may preserve both high sensitivity and a low test-failure rate. While the identification and analysis of mCNVs may provide biological insight into the impact of large copy-number variants, mCNV removal upstream of fetal aneuploidy assessment may be important to maintain exemplary test performance, which will be especially critical as noninvasive prenatal screening adoption increases in the wider, general obstetric population.
NGS device 910 may include any suitable device or a plurality of devices for isolating polynucleotide fragments and sequencing the isolated polynucleotide sequences. NGS device 910 may include a manual, automated, or semi-automated device for performing any of the NGS procedures and steps as described herein. As will be described in greater detail below, modules 922 may include an abnormality caller module 924 that identifies abnormalities (e.g., aneuploidies, microdeletions, microduplications, etc.) in fetal DNA and an analysis module 926 that determines CNVs in maternal chromosomes and identifies likely true and/or false fetal chromosomal abnormality determinations based on threshold feature values. Modules 922 may also include a correction module 928 that adjusts sequencing read quantities and/or z-scores to compensate for CNVs.
In certain embodiments, one or more of modules 922 in
As illustrated in
As illustrated in
As illustrated in
At step 1004, one or more of the systems described herein may sequence each of the cfDNA fragments to obtain a plurality of fragment sequencing reads. For example, NGS device 910 in
At step 1006, one or more of the systems described herein may identify target sequencing reads of the plurality of fragment sequencing reads, the identified target sequencing reads being mappable to specified locations of a reference genome. For example, abnormality caller module 924 in
In at least one embodiment, one or more of the systems described herein may identify target sequencing reads by aligning cfDNA fragment sequence to a reference sequence. For example, abnormality caller module 924 in
The alignment data output may be provided in the format of a computer file. In certain embodiments, the output is a FASTA file, VCF file, text file, or an XML file containing sequence data such as a sequence of the nucleic acid aligned to a sequence of the reference genome. In other embodiments, the output contains coordinates or a string describing one or more mutations in the subject nucleic acid relative to the reference genome. Alignment strings known in the art include Simple UnGapped Alignment Report (SUGAR), Verbose Useful Labeled Gapped Alignment Report (VULGAR), and Compact Idiosyncratic Gapped Alignment Report (CIGAR) (Ning, Z., et al., Genome Research 11(10):1725-9 (2001)). In some embodiments, the output is a sequence alignment—such as, for example, a sequence alignment map (SAM) or binary alignment map (BAM) file—including a CIGAR string (the SAM format is described, e.g., in Li, et al., The Sequence Alignment/Map format and SAMtools, Bioinformatics, 2009, 25(16):2078-9). In some embodiments, CIGAR displays or includes gapped alignments one-per-line. CIGAR is a compressed pairwise alignment format reported as a CIGAR string. In some embodiments, a second alignment using a second algorithm may be performed after a first alignment using a first algorithm. In some examples, filtering based on mapping quality may be optionally performed.
At step 1008, one or more of the systems described herein may determine, out of the identified target sequencing reads, a quantity of target sequencing reads for a region of interest. For example, abnormality caller module 924 in
At step 1010, one or more of the systems described herein may calculate a statistical z-score for the region of interest based on the quantity of target sequencing reads for the region of interest. For example, abnormality caller module 924 in
In some embodiments, calculating the statistical z-score for the specified chromosome may include calculating a percentage of the quantity of the target sequencing reads for the specified chromosome relative to the total quantity of target sequencing reads. In some embodiments, abnormality caller module 924 may calculate a z-score (i.e., zcfDNA) using the percentage of the quantity of the target sequencing reads for the specified chromosome relative to the total quantity of target sequencing reads according to the following Equation (2):
where %cfDNA is the percentage of the quantity of the target sequencing reads for the specified chromosome with respect to the total quantity of target sequencing reads for the genome, Med%reference is the average percentage of the target sequencing reads for a sample population and/or reference population for the specified chromosome, and MADreference is an average absolute deviation for the sample population and/or reference population for the specified chromosome. Additionally or alternatively, any suitable technique for calculating a z-score, or any other value of statistical significance, as described herein may be utilized. In at least one embodiment, calculating the statistical z-score for the region of interest based on the quantity of target sequencing reads for the region of interest may include calculating the statistical z-score for the region of interest based on an average number of target sequencing reads per bin for a plurality of bins corresponding to the region of interest. For example, the average number reads per bin for a background based on reference samples may be subtracted from the average number reads per bin for the sample and the total may be divided by the average absolute deviation (or dispersion) of the background.
At step 1012, one or more of the systems described herein may determine whether the calculated statistical z-score for the region of interest is outside of a predetermined z-score range, a calculated statistical z-score outside of the predetermined z-score range representing a positive call for a fetal chromosomal abnormality in the region of interest of the fetal DNA. For example, abnormality caller module 924 in
In some embodiments, abnormality caller module 924 may use a specified range of z-scores, with the upper limit of the specified range being a threshold value for a fetal aneuploidy call. In some embodiments, a range of z-scores may range from about −6 to about 6, about −5 to about 5, about −4 to about 4, about −3.5 to about 3.5, about −3 to about 3, about −2.5 to about 2.5, or about −2 to about 2. A calculated statistical z-score greater than an upper limit of at least one of these ranges may be determined to correlate to a likely fetal aneuploidy (e.g., trisomy) and a z-score below a lower limit of at least one of these ranges may be determined to correlate to a likely fetal aneuploidy (e.g., monosomy). Accordingly, abnormality caller module 924 may indicate a positive call for fetal aneuploidy based on a z-score greater than the upper limit or less than a lower limit of the specified range.
In some embodiments, the threshold feature z-score value and/or range may be a z-score value and/or range that has been determined based on analysis of a plurality of synthetic sequencing datasets and/or a plurality of real sequencing datasets. The threshold z-score value and/or range may be determined in accordance with any of the systems and methods disclosed herein. At step 1014, one or more of the systems described herein may determine whether maternal genomic DNA from the individual includes at least one copy number variant. For example, when the calculated statistical z-score for the specified chromosome is determined, based on the statistical z-score for the specified chromosome, to be greater than a threshold statistical z-score, analysis module 926 in
Analysis module 926 may determine whether maternal genomic DNA from the individual includes at least one copy number variant in a variety of ways. In one example, when abnormality caller 924 returns a positive call indicating a fetal chromosomal abnormality (e.g., trisomy, monosomy, microdeletion, microduplication, etc.) during noninvasive prenatal screening based on the calculated statistical z-score being outside of a specified range, quality-control metrics and/or manual review, such as computer-assisted manual review, of the sequenced cfDNA sample may be utilized by analysis module 926 to identify a maternal CNV, such as at least one duplication and/or deletion, in the chromosome for which the fetal aneuploidy was called and/or in another chromosome. Any suitable analysis of the cfDNA sample and/or data obtained from the cfDNA sample (e.g., sequencing data) may be utilized to identify the maternal CNV, without limitation. Maternal CNVs may be identified based on the sample and/or corresponding data utilized to obtain the z-score and make the aneuploidy call. In some embodiments, an additional sample may be obtained from the individual or a stored sample may be retested if necessary to confirm the presence or absence of a maternal CNV. For example, genomic DNA may be extracted from a stored blood or saliva sample and retested to confirm the presence or absence of a maternal CNV. In at least one embodiment, a sample of the maternal DNA may have been obtained and/or sequenced prior to pregnancy and/or prior to obtaining the cfDNA sample, providing maternal sequencing data for the maternal DNA that does not include fetal DNA and/or a much lower quantity of fetal DNA. In some embodiment, an extracted genomic DNA sample obtained during pregnancy (e.g., from blood, saliva, etc.) may include a minimal quantity of fetal DNA.
In some embodiments, a copy caller may be utilized to identify one or more maternal CNVs and/or potential maternal CNVs. For example, a hidden Markov model (HMM) (see, e.g., Boufounos, P., et al., Journ. of the Franklin Inst. 341: 23-36 (2004)), a Gaussian mixture model (see, e.g., U.S. Patent Application No. 62/452,974), a breakpoint caller (see, e.g., U.S. Patent Application No. 62/452,985), and/or any other suitable technique may be utilized to identify one or more CNVs in the specified chromosome, without limitation. Various systems and methods that may be utilized for identifying CNVs may be found, for example, in U.S. Pat. No. 9,092,401, U.S. Patent Publication No. 2016/0140289, U.S. Patent Publication No. 2015/0205914, and U.S. Patent Publication No. 2016/0188793. An operator of system 900 may manually initiate and/or perform at least a portion of the CNV determination review utilizing abnormality caller 924.
In some embodiments, one or more of the systems described herein may calculate read depths for base positions of the plurality of target polynucleotide fragments relative to each base position of a reference sequence. For example, analysis module 926 in
At step 1016, one or more of the systems described herein may determine, when the maternal genomic DNA from the individual is determined to include at least one copy number variant, whether a feature value of the at least one copy number variant is greater than a threshold feature value, a feature value greater than the threshold feature value indicating that a call for the fetal chromosomal abnormality is likely a false call. For example, when a maternal CNV, or likely maternal CNV, is identified in one or more chromosomes (including the specified chromosome and/or one or more other chromosomes), analysis module 926 in
In some embodiments, when a maternal CNV, or likely maternal CNV, is identified in one or more chromosomes (including the specified chromosome and/or one or more other chromosomes), the size of the CNV may be calculated. The threshold feature value may be utilized to determine whether the CNV likely resulted in a false fetal chromosomal abnormality call. For example, if the CNV size is above a predetermined threshold CNV size, a positive fetal chromosomal abnormality call may be determined to likely be a false-positive call. However, if the CNV size is below the threshold CNV size, a positive fetal chromosomal abnormality call may be determined to likely be a true-positive call. In some embodiments, the CNV type (e.g., duplication or deletion) may be determined. If, for example, the CNV includes at least one duplication in the specified chromosome, the size of the at least one duplication (e.g., CNV base pair length and/or percentage of chromosome covered by the CNV) may be determined for the at least one duplication (i.e., size of the at least one duplication or combined size of multiple duplications). If the length of the CNV(s) and/or percentage of chromosome covered by the CNV(s) exceeds a predetermined threshold length and/or percentage of chromosome, then a positive fetal chromosomal abnormality call may be determined to likely be a false-positive call. The threshold feature may comprise any CNV suitable length and/or percentage of chromosome covered by the CNV, without limitation. For example, the threshold percentage of a chromosome covered by the at least one CNV may include a percentage of about 4% or more (e.g., about 4%, 5%, 6%, 7%, 8%, 9%, 10%, 11%, 12%, 13%, 14%, 15%, 16%, 17%, 18%, 19%, 20%, 21%, 22%, 23%, 24%, 25%, 26%, 27%, 28%, 29%, 30% or more of the chromosome covered by the at least one CNV).
Such a determination may result in more accurate true-positive and false-positive fetal chromosomal abnormality determinations during noninvasive prenatal screening. Additionally, identifying likely false chromosomal abnormality calls, such as false-positive chromosomal abnormality calls, during noninvasive prenatal screening may enable expectant mothers to avoid unnecessarily undertaking invasive follow-up testing to confirm the existence of a fetal chromosomal abnormality in cases where the screening produces the likely false-positive call due to a maternal CNV.
In some embodiments, the present systems and methods may additionally or alternatively be utilized to determine whether negative chromosomal abnormality calls are true-negative or false-negative calls. For example, when an abnormality caller 924 returns a negative call for fetal chromosomal abnormality in a specified chromosome during noninvasive prenatal screening based on the calculated statistical z-score being within a specified range, quality-control metrics and/or manual review, such as computer-assisted manual review, of the sequenced cfDNA sample may be utilized to identify a maternal CNV, such as a deletion, in the chromosome for which the fetal chromosomal abnormality was called. In at least one embodiment, review of the sample may be performed when the z-score resulting in the negative call is within a specified sub-range, such as a sub-range adjacent to the upper limit or lower limit of the specified z-score range. Such a sub-range may represent a sub-range of z-scores that, while is not greater than an upper z-score value or less than a lower z-score value of a predetermined range utilized to make a positive chromosomal abnormality call, are nonetheless within sufficiently close proximity to an upper or lower z-score value to merit further review for a potential false-negative call. For example, a sub-range of z-scores may range from a z-score of about 1, about 1.5, about 2, about 2.5 about 3, about 3.5, or about 4, about 4.5, about 5, or about 5.5, to an upper limit, or threshold z-score value (e.g., about 6, about 5, about 4, about 3.5, about 3, about 2.5, or about 2). Additionally or alternatively, a sub-range of z-scores may range from a z-score of about −1, about −1.5, about −2, about −2.5 about −3, about −3.5, or about −4, about −4.5, about −5, or about −5.5, to a lower limit, or threshold z-score value (e.g., about −6, about −5, about −4, about −3.5, about −3, about −2.5, or about −2). A calculated statistical z-score within the specified sub-range may be determined to correlate to a potential false-negative chromosomal abnormality call.
In some embodiments, when a z-score is calculated and determined to be within a sub-range indicating a potential false-negative chromosomal abnormality call, analysis module 926 may determine whether maternal genomic DNA from the individual includes at least one copy number variant in the specified chromosome, such as one or more deletions, in a variety of ways. For example, when an abnormality caller 924 returns a negative chromosomal abnormality call for the specified chromosome, quality-control metrics and/or manual review, such as computer-assisted manual review, of the sequenced cfDNA sample may be utilized to identify a maternal CNV, such as at least one deletion. Any suitable analysis of the cfDNA sample and/or data obtained from the cfDNA sample (e.g., sequencing data) may be utilized to identify the maternal CNV as described herein, without limitation.
In at least one embodiment, when a CNV or potential CNV, such as at least one deletion, is identified, analysis module 926 in
According to some embodiments, the method may further include adjusting, when the feature value of the at least one copy number variant is greater than the threshold feature value, a quantity of target sequencing reads in at least one variant region corresponding to the at least one copy number variant to generate an adjusted set of target sequencing reads. For example, correction module 928 in
In some embodiments, adjusting the quantity of target sequencing reads in the at least one variant region to generate the adjusted set of target sequencing reads may include increasing and/or decreasing the number of target sequencing reads in the at least one variant region corresponding to the at least one CNV. According to some embodiments, adjusting the quantity of target sequencing reads in the at least one variant region to generate the adjusted set of target sequencing reads may include removing target sequencing reads in the at least one variant region. In some embodiments, correction module 928 may utilize various techniques catered to a specific cfDNA sample or type of cfDNA sample. In some embodiments, the quantity of target sequencing reads may be adjusted by reducing or increasing target sequencing read counts in one or more bins corresponding to the at least one CNV. In at least one example, correction module 928 may additionally or alternatively ignore certain sequencing read bins based on specified criteria. For example, outlier bins, such as bins including too many or too few reads, may be removed or ignored (e.g., only bins having sequencing reads in the 5th to 95th percentile based on read counts may be analyzed). Corresponding bins in background samples may also be removed or ignored. A number of bins removed may be selected to ensure that a resulting fetal chromosomal abnormality call utilizing the adjusted set of target sequencing reads maintains a desired level specificity.
The method may also include generating an adjusted quantity of target sequencing reads for the region of interest based on the adjusted set of target sequencing reads. For example, correction module 928 in
In some embodiments, the method may include calculating an adjusted statistical z-score for the region of interest based on the adjusted quantity of target sequencing reads. For example, abnormality caller module 924 in
In some embodiments, the method may further include calculating, when the feature value of the at least one copy number variant is greater than the threshold feature value, an adjusted statistical z-score for the region of interest and determining whether the adjusted statistical z-score for the region of interest is outside of the predetermined z-score range. For example, correction module 928 in
Any of the above-described adjustments to real sequencing reads and/or statistical z-scores, such as any of the above-described functionalities performed by correction module 928 in
As illustrated in
At step 1110, one or more of the systems described herein may adjust, when the maternal genomic DNA from the individual is determined to include at least one copy number variant, a quantity of target sequencing reads of the identified target sequencing reads in at least one variant region corresponding to the at least one copy number variant to generate an adjusted set of target sequencing reads. At step 1112, one or more of the systems described herein may determine, out of the identified target sequencing reads, a quantity of target sequencing reads for a region of interest.
At step 1114, one or more of the systems described herein may generate an adjusted quantity of target sequencing reads for the region of interest based on the adjusted set of target sequencing reads. At step 1116, one or more of the systems described herein may calculate a statistical z-score for the region of interest based on the adjusted quantity of target sequencing reads for the region of interest. At step 1118 one or more of the systems described herein may determine whether the calculated statistical z-score for the region of interest is outside of a predetermined z-score range, a calculated statistical z-score outside of the predetermined z-score range representing a positive call for a fetal chromosomal abnormality in the region of interest of the fetal DNA
Computing system 1210 broadly represents any single or multi-processor computing device or system capable of executing computer-readable instructions. Examples of computing system 1210 include, without limitation, workstations, laptops, client-side terminals, servers, distributed computing systems, handheld devices, or any other computing system or device. In its most basic configuration, computing system 1210 may include at least one processor 1214 and a system memory 1216.
Processor 1214 generally represents any type or form of physical processing unit (e.g., a hardware-implemented central processing unit) capable of processing data or interpreting and executing instructions. In certain embodiments, processor 1214 may receive instructions from a software application or module. These instructions may cause processor 1214 to perform the functions of one or more of the example embodiments described and/or illustrated herein.
System memory 1216 generally represents any type or form of volatile or non-volatile storage device or medium capable of storing data and/or other computer-readable instructions. Examples of system memory 1216 include, without limitation, Random Access Memory (RAM), Read Only Memory (ROM), flash memory, or any other suitable memory device. Although not required, in certain embodiments computing system 1210 may include both a volatile memory unit (such as, for example, system memory 1216) and a non-volatile storage device (such as, for example, primary storage device 1232, as described in detail below). In one example, one or more of modules 622 from
In some examples, system memory 1216 may store and/or load an operating system 1240 for execution by processor 1214. In one example, operating system 1240 may include and/or represent software that manages computer hardware and software resources and/or provides common services to computer programs and/or applications on computing system 1210. Examples of operating system 1240 include, without limitation, LINUX, JUNOS, MICROSOFT WINDOWS, WINDOWS MOBILE, MAC OS, APPLE'S IOS, UNIX, GOOGLE CHROME OS, GOOGLE'S ANDROID, SOLARIS, variations of one or more of the same, and/or any other suitable operating system.
In certain embodiments, example computing system 1210 may also include one or more components or elements in addition to processor 1214 and system memory 1216. For example, as illustrated in
Memory controller 1218 generally represents any type or form of device capable of handling memory or data or controlling communication between one or more components of computing system 1210. For example, in certain embodiments memory controller 1218 may control communication between processor 1214, system memory 1216, and I/O controller 1220 via communication infrastructure 1212.
I/O controller 1220 generally represents any type or form of module capable of coordinating and/or controlling the input and output functions of a computing device. For example, in certain embodiments I/O controller 1220 may control or facilitate transfer of data between one or more elements of computing system 1210, such as processor 1214, system memory 1216, communication interface 1222, display adapter 1226, input interface 1230, and storage interface 1234.
As illustrated in
As illustrated in
Additionally or alternatively, example computing system 1210 may include additional I/O devices. For example, example computing system 1210 may include I/O device 1236. In this example, I/O device 1236 may include and/or represent a user interface that facilitates human interaction with computing system 1210. Examples of I/O device 1236 include, without limitation, a computer mouse, a keyboard, a monitor, a printer, a modem, a camera, a scanner, a microphone, a touchscreen device, variations or combinations of one or more of the same, and/or any other I/O device.
Communication interface 1222 broadly represents any type or form of communication device or adapter capable of facilitating communication between example computing system 1210 and one or more additional devices. For example, in certain embodiments communication interface 1222 may facilitate communication between computing system 1210 and a private or public network including additional computing systems. Examples of communication interface 1222 include, without limitation, a wired network interface (such as a network interface card), a wireless network interface (such as a wireless network interface card), a modem, and any other suitable interface. In at least one embodiment, communication interface 1222 may provide a direct connection to a remote server via a direct link to a network, such as the Internet. Communication interface 1222 may also indirectly provide such a connection through, for example, a local area network (such as an Ethernet network), a personal area network, a telephone or cable network, a cellular telephone connection, a satellite data connection, or any other suitable connection.
In certain embodiments, communication interface 1222 may also represent a host adapter configured to facilitate communication between computing system 1210 and one or more additional network or storage devices via an external bus or communications channel. Examples of host adapters include, without limitation, Small Computer System Interface (SCSI) host adapters, Universal Serial Bus (USB) host adapters, Institute of Electrical and Electronics Engineers (IEEE) 1394 host adapters, Advanced Technology Attachment (ATA), Parallel ATA (PATA), Serial ATA (SATA), and External SATA (eSATA) host adapters, Fibre Channel interface adapters, Ethernet adapters, or the like. Communication interface 1222 may also allow computing system 1210 to engage in distributed or remote computing. For example, communication interface 1222 may receive instructions from a remote device or send instructions to a remote device for execution.
In some examples, system memory 1216 may store and/or load a network communication program 1238 for execution by processor 1214. In one example, network communication program 1238 may include and/or represent software that enables computing system 1210 to establish a network connection 1242 with another computing system (not illustrated in
Although not illustrated in this way in
As illustrated in
In certain embodiments, storage devices 1232 and 1233 may be configured to read from and/or write to a removable storage unit configured to store computer software, data, or other computer-readable information. Examples of suitable removable storage units include, without limitation, a floppy disk, a magnetic tape, an optical disk, a flash memory device, or the like. Storage devices 1232 and 1233 may also include other similar structures or devices for allowing computer software, data, or other computer-readable instructions to be loaded into computing system 1210. For example, storage devices 1232 and 1233 may be configured to read and write software, data, or other computer-readable information. Storage devices 1232 and 1233 may also be a part of computing system 1210 or may be a separate device accessed through other interface systems.
Many other devices or subsystems may be connected to computing system 1210. Conversely, all of the components and devices illustrated in
The computer-readable medium containing the computer program may be loaded into computing system 1210. All or a portion of the computer program stored on the computer-readable medium may then be stored in system memory 1216 and/or various portions of storage devices 1232 and 1233. When executed by processor 1214, a computer program loaded into computing system 1210 may cause processor 1214 to perform and/or be a means for performing the functions of one or more of the example embodiments described and/or illustrated herein. Additionally or alternatively, one or more of the example embodiments described and/or illustrated herein may be implemented in firmware and/or hardware. For example, computing system 1210 may be configured as an Application Specific Integrated Circuit (ASIC) adapted to implement one or more of the example embodiments disclosed herein.
In addition, one or more of the modules described herein may transform data, physical devices, and/or representations of physical devices from one form to another. Additionally or alternatively, one or more of the modules recited herein may transform a processor, volatile memory, non-volatile memory, and/or any other portion of a physical computing device from one form to another by executing on the computing device, storing data on the computing device, and/or otherwise interacting with the computing device.
The present invention is described in further detail in the following examples which are not in any way intended to limit the scope of the invention as claimed. The attached figures are meant to be considered as integral parts of the specification and description of the invention. The following examples are offered to illustrate, but not to limit the claimed invention.
A plurality of real sequencing datasets was obtained from 87,255 real maternal cfDNA samples. Additionally, a plurality of synthetic sequencing datasets for 30,887 synthetic maternal cfDNA samples was generated in accordance with systems and methods described herein. A z-score for a chromosomal aneuploidy was calculated for chromosomes harboring mCNV duplications in the plurality of real sequencing datasets and the plurality of synthetic sequencing datasets.
Correlations between z-scores and percentages of respective chromosomes occupied by maternal copy number variants (duplications and deletions) as illustrated, for example, in
To determine which algorithmic features in a noninvasive prenatal screening pipeline minimize the effect of mCNVs on z-scores, various analysis approaches were used to collectively analyze numerous synthetic sequencing datasets generated in accordance with systems and methods described herein. Six different analysis strategies were used to calculate aneuploidy z-scores for synthetic sequencing datasets each including sequencing data representing various maternal duplications in chromosome 13, 18, or 21.
For each of chromosomes 13, 18, and 21, at least 10,000 mCNV-harboring samples were simulated, each using as a baseline a randomly chosen sample shown to be both euploid (via the “mCNV filtering” analysis strategy described below) and void of mCNVs. Most samples (83%) were chosen for exactly one round of simulation, with the rest used in several rounds of simulations (15% in two and 2% in 3 or more simulations). The sizes of the mCNVs were selected to span a logarithmic range, and the position of each mCNV was randomly chosen. The mCNV size values used in downstream analyses were based on algorithm-detected boundaries rather than the simulated boundaries (e.g., a 3 Mb simulated duplication identified as being 2.8 Mb by the mCNV-finding algorithm is represented in the plots and associated analyses herein based on the 2.8 Mb size).
To calculate the specificity of each analysis strategy as a function of mCNV size, the z-score of a euploid sample harboring an mCNV was modeled as a random variable Z=ZmCNV-+ΔZdup. ZmCNV- represents the z-score of a sample without an mCNV. It follows a standard normal distribution N(μ=0, σ=1) and is not a function of mCNV size. By contrast, for an mCNV of size s, ΔZdup is normally distributed with mean μdup and standard deviation σdup calculated from the ΔZdup values of the 200 simulated samples whose mCNV sizes were closest to s. Assuming ZmCNV- and ΔZdup are independent, Z is a normal random variable with mean μdup and standard deviation (1+σdup2)0.5. Since the simulations introduced mCNVs into otherwise euploid samples, any modeled positives (i.e., Z=ZmCNv-+ΔZdup>3) were false positives. Furthermore, any modeled samples with ZmCNV->3 were considered to be statistical false positives. Hence, the false-positive rate (FPR) attributable to mCNVs was calculated by omitting these statistical false positives:
FPRmCNV=P(ZmCNV+ΔZdup>3)−P(ZmCNV>3)
Specificity was calculated as 1−FPRmCNV. The specificity as a function of mCNV size was estimated for each chromosome separately using simulated samples with mCNVs introduced on the chromosome of interest.
As a first step toward measuring the impact of mCNVs on noninvasive prenatal screening performance, mCNV frequency, size, and positional bias was surveyed in the 87,255 patient samples. Using a rolling-window z-score algorithm, mCNVs ≥200 kb were identified. On average, patients had 1.07 autosomal mCNVs, and 65% of patients had at least one mCNV. There were 37% more deletions than duplications overall, but duplications were generally larger than deletions (median sizes 360 kb and 260 kb, respectively; Kruskal-Wallis H-test p<0.05).
Chromosomes 13, 18, and 21 are commonly tested in noninvasive prenatal screening, and mCNVs on these chromosomes may pose the most direct risk for false positives. On these chromosomes, 2.1% of all patients had at least one duplication and 2.5% had at least one deletion with 4.5% having an mCNV of either type (see, e.g.,
The positional distribution of mCNVs was investigated to evaluate whether, if mCNV positions were highly predictable, an algorithm could achieve robustness simply by masking out (or “blacklisting”) such regions. It was observed that mCNVs were not distributed uniformly (see, e.g.,
The impact of mCNVs on aneuploidy-calling fidelity as a function of mCNV size was next explored. Empirically observed mCNVs rarely spanned ≥1% of a chromosome, which prohibited a statistically powered assessment of the impact of these large mCNVs. To overcome the sparsity of empirical data, simulations to systematically analyze the effects of maternal duplications on trisomy detection were implemented. To create a simulated sample harboring an mCNV of a given size and position, the bin-level copy-number data corresponding to the region of interest was scaled by an empirically derived factor in a euploid and mCNV-free sample. Simulated samples strongly resembled their observed counterparts, both at the level of bin profile and the distribution of bin copy-number values. The bin copy number within simulated mCNVs was very slightly overdispersed compared to the bin copy numbers within detected patient mCNVs. The strong overlap between median z-scores for the empirical and simulated samples (see, e.g.,
Maternal duplications have been observed to exert an upward pressure on z-scores, and this effect was reproduced in the simulated data on autosomes (see, e.g.,
An estimate of cumulative false positives due to mCNVs per 100,000 was calculated as the weighted sum of the empirical maternal-duplication size-prevalence data (see, e.g.,
The “Robust” analysis strategy (
The “Robust+Gaussian” analysis strategy (
The “Z-correction” analysis strategy (
The “Value filtering” analysis strategy (
The “mCNV filtering” analysis strategy (
To evaluate the algorithmic strategies through a more clinically relevant lens, the expected frequency of false-positive aneuploidy calls resulting from mCNVs on chromosomes 13, 18, and 21 was evaluated. Using the measured relationship between duplication size and ΔZdup (see
On average, mCNVs have been predicted to cause a false-positive result of trisomy 13, 18, or 21 for 1 in 960 patients using the “Simple” analysis strategy. This false-positive rate is similar to the rates reported by laboratories prior to incorporating changes that mitigate the effect of mCNVs: in outcome studies, Chudova et al. reported 3 mCNV-caused false positives in 1914 patients (a rate of 1 in 640), and Strom et al. reported 61 mCNV-caused false positives in 31,278 patients (a rate of 1 in 510). See Chudova et al., N. Engl. J. Med., vol. 375, pp. 97-98 (2016), and Strom et al., N. Engl. J. Med. vol. 376, pp. 188-189 (2017). The “Simple” analysis strategy estimated false-positive rate is also consistent with aggregate statistics of noninvasive prenatal screening specificity from meta-analyses over the time period when comparable methods were common.
Overall, mCNV-aware analysis strategies (“Z-correction”, “Value filtering”, and “mCNV filtering” analysis strategies) had higher specificity than mCNV-unaware approaches (“Simple”, “Robust”, and “Robust+Gaussian” analysis strategies). All mCNV-aware analysis strategies increased the pooled specificity for the three common trisomies 13, 18, and 21 such that the aggregate false-positive rate was fewer than 1 in 100,000 tests. Remarkably, relative to the “Simple” analysis strategy, with one false positive expected for every 960 samples, the “mCNV filtering” analysis strategy is expected to incur only one mCNV-caused false positive for every 580,000 samples, representing a 600-fold reduction.
The preceding description has been provided to enable others skilled in the art to best utilize various aspects of the example embodiments disclosed herein. This example description is not intended to be exhaustive or to be limited to any precise form disclosed. Many modifications and variations are possible without departing from the spirit and scope of the instant disclosure. The embodiments disclosed herein should be considered in all respects illustrative and not restrictive. Reference should be made to the appended claims and their equivalents in determining the scope of the instant disclosure.
While various aspects and embodiments have been disclosed herein, other aspects and embodiments are contemplated. The various aspects and embodiments disclosed herein are for purposes of illustration and are not intended to be limiting. Unless otherwise noted, the terms “a” or “an,” as used in the specification and claims, are to be construed as meaning “at least one of.” In addition, for ease of use, the words “including” and “having,” and variants thereof (e.g., “includes” and “has”) as used in the specification and claims, are interchangeable with and have the same meaning as the word “comprising” and variants thereof (e.g., “comprise” and “comprises”).
This application claims the benefit of U.S. Provisional Patent Application No. 62/486,450, filed Apr. 17, 2017 and titled SYSTEMS AND METHODS FOR OPTIMIZING PERFORMANCE OF DNA-BASED NONINVASIVE PRENATAL SCREENS TO REDUCE FALSE ANEUPLOIDY CALLS, U.S. Provisional Patent Application No. 62/508,265, filed May 18, 2017 and titled SYSTEMS AND METHODS FOR PERFORMING AND OPTIMIZING PERFORMANCE OF DNA-BASED NONINVASIVE PRENATAL SCREENS, U.S. Provisional Patent Application No. 62/527,858, filed Jun. 30, 2017 and titled SYSTEMS AND METHODS FOR PERFORMING AND OPTIMIZING PERFORMANCE OF DNA-BASED NONINVASIVE PRENATAL SCREENS, and U.S. Provisional Patent Application No. 62/529,909, filed Jul. 7, 2017 and titled SYSTEMS AND METHODS FOR PERFORMING AND OPTIMIZING PERFORMANCE OF DNA-BASED NONINVASIVE PRENATAL SCREENS, the disclosure of each of which is incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
62529909 | Jul 2017 | US | |
62527858 | Jun 2017 | US | |
62508265 | May 2017 | US | |
62486450 | Apr 2017 | US |