The field of invention covers analysis of human DNA for the purpose of detecting disease-causing aberrations in adult and/or prenatal or neonatal individuals, as well as sample tracking, quantification of sample impurities, and measurement of fetal fraction in maternal ccfDNA.
The present invention relates to single-cell DNA analyses (scDNA) and in particular to CNV and SNP aberrations in cancerous cells. It also relates to fetal contribution to circulating cell-free DNA (ccfDNA) isolated from maternal plasma and sequenced for the purpose of non-invasive prenatal testing (NIPT).
The invention improves multiple aspects of scDNA and ccfDNA data processing, including estimation of GC bias, reduction or elimination of systematic and stochastic errors from measured whole genome resequencing data, scaling of scDNA copy-number profiles, detection of features in CNV profiles, clustering of single cells based on similarity of their CNV profiles, measurement of sample impurity levels and fetal fractions based on measured SNP allele frequencies, detection of fetal copy-number variants based on measured SNP allele frequencies in ccfDNA, haplotyping of scDNA and fetal component of ccfDNA based on phased parental SNPs and measured scDNA or ccfDNA SNP allele frequencies.
What is needed is a collection of accurate algorithms to successfully extract biological signal from technical noise.
In one aspect of the invention, methods make use of Fourier coefficients of the distribution of read counts to accurately quantify GC bias in a highly complex CNV profile of a cancerous cell.
In one aspect of the invention, methods reduce or eliminate technical variability from copy-number profiles. The removal of variability includes both systematic errors and stochastic noise. The resulting normalized profile enables more accurate scaling and detection of CNVs and whole-chromosome aneuploidies, both in scDNA and ccfDNA scenarios, as well as measurement of fetal fractions in ccfDNA.
In one aspect of the invention, methods accurately estimate the scale of the highly complex cancerous scDNA copy number profile.
In one aspect of the invention, methods accurately measure fetal fraction in normalized copy number profile extracted from a ccfDNA sample.
In one aspect of the invention, methods segment a CNV profile into homogeneous regions using a combination of Fourier coefficients and differential variance ratio.
In one aspect of the invention, methods enhance cell clustering by quantifying the degree of similarity among scDNA copy number profiles using a combination of Fourier expansion and differential variance ratio.
In one aspect of the invention, methods quantify the level of sample contamination based on measured SNP allele frequencies.
In one aspect of the invention, methods measure fetal fraction in ccfDNA based on measured SNP allele frequencies.
In one aspect of the invention, methods detect fetal copy number variants and whole chromosome aneuploidies in ccfDNA based on measured SNP allele frequencies.
In one aspect of the invention, methods perform haplotyping of scDNA (such as IVF quality control biopsy sample) or NIPT fetal DNA in a ccfDNA sample based on known parental phasing information and on measured SNP allele frequencies. The phasing of fetal DNA enables detection of monogenic diseases without targeting disease loci. Instead, it is enough to phase SNPs that flank the disease loci.
The things that are needed will be put forth as solutions in the next section.
The overall object of this invention is to enhance accuracy and robustness of detection and quantification of biological features in human genomic data against the background of technical and stochastic errors introduced by measurement techniques, instrumentation, laboratory handling, environmental factors, and chance.
It is an object of this invention to achieve accurate quantification of linear and quadratic coefficients of the parabola describing GC bias in copy-number profiles of highly complex cancerous cells. Due to multiple ploidy levels in cancerous DNA, standard regression is insufficient for the task. The goal is to formulate a generalizable mathematical framework that works equally successfully with the entire spectrum of profile complexity, ranging from nearly-diploid cells to cancer cells with highly rearranged genomes.
It is also an object of this invention to reduce or eliminate systematic and stochastic errors present in whole genome resequencing data, both from scDNA and ccfDNA experiments.
Another object of this invention is to improve accuracy and robustness of CNV profile scaling, particularly in scDNA data obtained from cancerous cells that underwent complex genome rearrangements.
It is an object of this invention to enhance the accuracy of measured fetal fractions in ccfDNA.
It is an object of this invention to enhance detection of copy number variants in scDNA and ccfDNA.
It is an object of this invention to improve clustering of single cells based on similarities between their CNV profiles.
It is an object of this invention to quantify presence of contaminations in a DNA sample.
It is an object of this invention to enable affordable and accurate haplotyping of the fetal genome using scDNA or ccfDNA data and known parental haplotypes in order to detect transmission of monogenic diseases from the carrier (or affected) parent(s).
Still other objects and advantages of the invention will in part be obvious and will in part be apparent from the specification and drawings.
In order to overcome systematic and stochastic errors when extracting biological signal from scDNA and ccfDNA measurements, the present disclosure applies mathematical modeling techniques applicable to both whole-genome resequencing data and to targeted SNP allele counts. The invention implements these statistical and algorithmic solutions in the context of a genomic analysis workflow, comprising sample collection, sample storage and transportation, DNA extraction, library preparation, DNA sequencing, alignment of sequenced DNA fragments to the human reference, counting of reads aligned to specific genomic locations, as well as computer software and hardware used to effectively process the count data in an automated fashion and to automatically create accurate reports regarding subject's disease status.
The invention accordingly comprises the several steps and the relation of one or more of such steps with respect to each of the others, and the apparatus embodying features of construction, combinations of elements and arrangement of parts that are adapted to affect such steps, all is exemplified in the following detailed disclosure, and the scope of the invention will be indicated in the claims.
To accurately quantify GC bias, the methods of the invention generate a distribution of read counts per bin, perform Fourier expansion of the distribution, generate a metric called Fourier Coefficient Ratio (FCR) by combining Fourier coefficients of different orders, and vary trial GC coefficients until achieving optimal FCR value.
To reduce or eliminate systematic and stochastic errors from whole genome resequencing scDNA or ccfDNA data, the methods of the invention generate posterior ploidy probability distribution and evaluate its measure of central tendency for each genomic bin satisfying predefined quality criteria. The resulting values are referred to as normalized profile values.
To improve accuracy and robustness of scDNA CNV profile scaling, the methods of the invention manipulate normalized profile values to generate a penalty function, and then vary trial scale values until the penalty reaches its optimum.
To enhance the accuracy of measured fetal fraction values in ccfDNA, the methods of the invention use normalized CNV profile values obtained by other methods of the invention. Still other methods of the invention generate likelihood or posterior probability distributions of observed SNP allele counts and maximize the likelihood or the posterior probability.
To enhance detection of copy number variants in scDNA or in ccfDNA data, the methods of the invention perform Fourier expansion of the CNV profile, combine the Fourier coefficients of the resulting series to generate Fourier Coefficient Ratio, generate differential profile consisting of successive differences of the original profile, evaluate variances of the original profile and the differential profile, generate Differential Profile Variance, and combine Fourier Coefficient Ratio with Differential Profile Variance to arrive at a measure of profile homogeneity. The measure of profile homogeneity is then used to segment the profile into homogeneous regions.
To improve clustering of single cells, the methods of the invention combine CNV profiles of individual cells in a pairwise fashion and generate a similarity metric from the combined profile's Fourier Coefficient Ratio and the combined profile's Differential Variance Ratio.
To quantify presence of contaminants in a DNA sample the methods of the invention use measured allele counts to construct posterior probability distribution of contamination levels and then evaluate the distribution's central tendency.
To accurately measure fetal fraction in a ccfDNA sample the methods of the invention use measured allele counts to construct posterior probability distribution of the fetal fraction and then evaluate the distribution's central tendency.
To enhance detection of fetal copy number variants in ccfDNA, the methods of the invention use measured SNP allele frequencies to generate posterior ploidy probability distributions and then identify the distribution's mode.
To enable affordable and accurate haplotyping of the fetal genome using scDNA or ccfDNA data and known parental haplotypes in order to detect transmission of monogenic diseases from the carrier (or affected) parent(s), the methods of the invention generate likelihoods of fetal transmission patterns per SNP locus, combine these likelihoods across genomic regions, and segment genomic regions into homogeneous subregions, thus identifying paternally and maternally inherited haplotypes and parental recombination sites.
For a more complete understanding of the invention, reference is made to the following description and accompanying drawings, in which:
1. NIPT: Non-Invasive Prenatal Testing.
2. scDNA: Single-cell DNA.
3. ccfDNA: Circulating cell-free DNA.
4. SNP: Single nucleotide polymorphism
5. CNV: Copy-number variant.
Detailed lists of additional abbreviations and notational symbols are included in relevant specification sections.
The variable g∈[0, 1] describes the fractional GC content per bin. In practice, the lower bound gL and the upper bound gU are tighter: g∈[gL,gU], where 0≤gL≤gU≤1. For example, the values gL=0.3,gU=0.7 may be appropriate for human genome.
For convenience, we also introduce the reference GC value g0. The value g0 is arbitrarily chosen and may match the median GC content, the mean GC content, the mode of the GC content distribution (g), the midpoint of the segment [gL,gU], or any conveniently selected point on the GC axis within the interval [gL,gU] or outside of it. The final results do not depend on the choice of g0, but the numerical values of the model coefficients L and Q do (see below).
The gaps separating g0 from the lower and upper bounds gL and gU are labeled ΔL and ΔU, respectively:
ΔL=g0−gL (1)
ΔU=gU−g0 (2)
GC Bias
To define GC bias, we first introduce three new objects: bin mappability m, overall scale S, and the random variable X(g). The parameter S represents the mean bin count across the entire genome, normalized with respect to bin ploidy. The parameter m represents bin-specific efficiency of read alignment, as known to those skilled in the art. The random variable X(g) is the observed read count per ploidy unit for a bin characterized by GC content g. GC bias G(g) is defined as the asymptotic value of the ratio between the observed read count per unit ploidy X(g) vs. the product of the scale S and the bin mappability m:
Division by bin-specific mappability m separates GC-dependent bias from GC-independent bias. The limit in Eq. 3 eliminates stochastic effects of coverage depth. As a result, the function G(g) describes only the effects of GC content on read counts per bin across the copy-number profile.
The GC bias function G(g) measures the deviation of the observed read counts per bin per ploidy unit X(g) from the constant value mS. The GC bias G(g) represents systematic errors in measured bin counts due to variations in GC content per bin g. Ideally, in the absence of GC bias, G(g) identically equals one. In real measurements, read counts per bin may correlate with the bin GC content g, in which case GC bias G(g) deviates from one. In general, the function G(g) is unknown and may assume a complex analytical form reflecting the mechanistic causes of the bias. Here, GC bias is modeled phenomenologically by expanding the general function G(g) into a Taylor series around g0 and truncating the expansion at the second order, yielding the following parabola with parameters L (linear coefficient) and Q (quadratic coefficient):
G(g;L,Q,g0)=1+L(g−g0)+Q(g−g0)2,g∈[gL,gU] (4)
The model coefficients L and Q are interpreted as the first and second derivatives, respectively, of the function G(g), taken at the point g0 (with additional multiplier ½ in the case of Q):
Constraints on GC Bias Coefficients L and Q
To be physically plausible, the bias function G(g;L,Q,g0) must be non-negative on the domain g∈[gL,gU]. This non-negativity requirement imposes constraints on parameters L and Q:
The intersection point L0 is found as follows:
Profile Uniformity
In the absence of GC bias, the two GC coefficients L and Q vanish and G(g) reduces to one. This scenario is the most desirable because it is the most favorable to performance metrics for detection of copy-number variants (CNVs), including sensitivity, specificity, positive predictive value, negative predictive value, false positive rate, false negative rate, true positive rate, and true negative rate.
The importance of the GC bias lies in its impact on measured copy-number profile, which directly affects the performance of CNV detection. Higher GC bias (larger deviations of G(g) from one) mean less uniform profile, which translates to lower CNV detection performance (lower sensitivity, specificity, positive predictive rate, negative predictive rate, true positive rate, and true negative rate, as well as higher false positive and false negative rates).
In absence of biological variability, the variance of a copy number profile x can be modeled using the following expression:
In Eq. 10, P is the profile's ploidy, x is the bin count divided by the scale S, ar(m) is the variance of the genomic region's mappability, ar(g) is the variance of the GC content per bin, L and Q are the linear and quadratic GC bias coefficients, and the term C(m,g,L,Q) collects all pairwise covariance contributions, which are not negligible. The term E is close to 1. Deviations E−1 reflect presence of duplicate reads. The term P/S is the stochastic Poisson contribution, responsible for deterioration of profile evenness as the scale S (average read count per bin per chromosome copy) decreases.
Division by m and bin filtering based on m significantly reduce ar(m). Two dominant contributions to unevenness originate from L2 (with proportionality constant ar(g)≈0.014 for all hg19 bins) and from Q4 (proportional to ar(g2)≈0.004 in hg19). Larger absolute values of L and Q translate to increased profile unevenness. Accurate estimation of L and Q enables removal of these two major contributors to unevenness. Errors δL and δQ (differences between true and estimated coefficient values) result in residual unevenness proportional to ar(g)δL2+ar(g2)δQ4 (plus correlation terms). To optimize profile evenness, estimation of bias coefficients must reduce the error terms δL and δQ as much as possible.
GC Normalization
Often, in real data, L≠0 and/or Q≠0. The resulting measured read counts per bin are then biased by the per-bin GC content g. To recover the CNV performance lost due to GC bias, one first needs to estimate model coefficients L and Q from the data. Next, the bias G(g) is computationally reduced or removed. One possible approach divides the observed counts with the estimated parabola G(g;L,Q,g0). An alternative, novel approach is discussed below (Section B. scDNA Profile Normalization by Posterior Ploidy Probability). The computational reduction or elimination of GC bias from measured read counts is known as GC normalization.
For successful GC normalization, it is essential to accurately estimate the values of the bias coefficients L and Q from the data. This estimation is the primary focus of the currently discussed aspect of the invention (GC Bias estimation by Fourier Expansion).
Prior Art: GC Bias Coefficient Estimation in Mostly Diploid Copy-Number Profiles
When the copy-number profile is mostly diploid, as is the case in non-invasive prenatal tests (NIPT), L and Q are estimated straightforwardly. The scatter plot of X(g) vs. g in diploid profiles consists of a single ribbon of data points, representing the trend G(g) with added noise. The minor complication with under-represented sex chromosomes in male individuals is easily handled, for example by focusing only on autosomes. When the curvature of the function G(g) is negligible, L can be obtained as the slope of the linear regression of X(g) vs. g. This approach was first proposed by Dr. Lin Tang (unpublished results; private communication) and subsequently used by Željko Džakula in his invention PERUN. PERUN was essential for the successful launch of the high-throughput version of the NIPT MaterniT21, used to process at least 100,000 samples in the first year after the launch. PERUN enabled increased accuracy of fetal trisomies T21, T18, and T13 detection, earliest breakthroughs in detection of fetal sex chromosome abnormalities, fetal subchromosomal abnormalities, measurements of fetal fraction in male fetuses and in trisomy/monosomy pregnancies, as well as detection of fetal autosomal aneuploidies other than T21, T13, and T18. PERUN also enabled usage of maternal CNVs to track the sample's chain of custody, removal of maternal CNVs from NIPT profiles to reduce fetal false positives and false negatives, as well as earliest examples of maternal cancer detection in NIPT samples (Željko Džakula, unpublished results).
Quadratic: Željko Džakula's extension of PERUN.
Difficulties with GC Bias Coefficient Estimation in Non-Diploid Copy-Number Profiles
Rather than a single “ribbon” of data points, as in diploid profiles, the scatter plots of X(g) vs. g in cancer samples display multiple parallel “ribbons” of data points, one for each ploidy level. Each “ribbon” follows the same rule given by the function G(g) (plus added noise). However, unlike in diploid profiles, the observations are no longer single-valued. On the contrary, observed read counts X(g) in cancer profiles follow a multi-valued function of g, reflecting multiple ploidy levels across the genome. Consequently, the parameters L and G in cancer samples become difficult or impossible to estimate using simple linear or polynomial regression.
Prior Art: Estimation of GC Bias Coefficients by Minimizing Entropy
Currently, the only existing procedure that successfully estimates L and Q in copy-number profiles with multiple ploidy levels, such as cancer, is the minimization of entropy invented by Željko Džakula in October, 2017 for the purpose of normalizing GC bias in single-cell CNV profiles. The entropy-based procedure is routinely used to process thousands of scDNA samples per run. The current invention disclosure refers to this entropy-based prior art for comparison and to asses relative performance of the newly proposed alternative approach (Fourier-based, see below).
Željko Džakula's entropy-based procedure proceeds as follows. First, a histogram of raw read counts is created, including all bins across the genome (except those filtered out due to low mappability or other filtering criteria). An upper bound is then imposed on bin counts and the distribution's tail beyond this bound is truncated. In some embodiments, the upper bound is obtained as a product of a constant factor and the mode of the distribution of bin counts. The same upper bound is enforced in all subsequent iterations to prevent variation of the edges of the support as the optimized paraters L and Q are being varied. The resulting frequencies of counts per bin are then converted to normalized probabilities p. If GC bias is present, the density peaks in this histogram will be wider than in the absence of the GC bias. In addition, GC bias induces overlaps among peaks originating from different ploidy levels. As a result, the entropy of a GC-biased histogram of bin counts is high. The estimation of L and G therefore iteratively varies coefficient values, applies the GC correction G(g;L,Q,g0) using trial coefficient values L and Q, and regenerates the histogram until the histogram's entropy H=−Σp log(p) is minimized. The values L and Q corresponding to the lowest found value of H are reported as GC coefficient estimates.
The previous paragraph described Željko Džakula's entropy based procedure as he originally formulated and implemented it in 2017, and as it is currently being practiced at 10X Genomics. The current invention disclosure also adds the following improvement to this prior art: Before generating the histogram of read counts, the counts are divided by bin mappability values. The resulting bin counts, modulated by mappability, are then used to generate bin count histograms and to minimize entropy. Modulation by mappability significantly improves the convergence of the entropy minimization procedure and reduces the uncertainties in final GC coefficient estimates.
Alternative Approach: Estimation of GC Bias Coefficients by Maximizing Fourier Coefficient Ratio
The current invention disclosure proposes an alternative procedure, based on Fourier expansion of the distribution of read counts per bin. The basic idea is as follows: absence of GC bias is associated with sharp probability peaks in the histogram of bin counts. The corresponding Fourier series of the histogram therefore features increased absolute values of Fourier coefficients of higher orders, relative to Fourier coefficient values of the first order. A histogram of GC-biased bin counts, on the contrary, has diffuse and/or overlapping probability peaks, corresponding to decreased higher-order Fourier coefficients relative to the first-order Fourier coefficients.
Based on the principles laid out above, the estimation of L and G proceeds as follows. First, a histogram of raw bin counts is generated. A pair of cutoff percentile values is selected. In some embodiments, the percentile cutoffs are chosen at 5% and 95%. In some embodiments, only the upper bound is enforced. Cutoff percentile values are applied to the distribution of raw bin counts to yield lower and upper bin count cutoffs iL and iU, respectively. In some embodiments, iL is set to 1. Bin counts beyond these cutoff values are truncated. The bin count cutoffs are maintained through all subsequent iterations to prevent variation of the edges of the support as the optimized parameters L and Q are being varied. Next, frequencies of observed bin counts are normalized to 1, yielding normalized bin count probabilities pi, i∈[iL,iU].
Next, the closed interval of integer bin counts [iL,iU] is transformed to generate an equivalent discrete set of equidistant real values belonging to the segment [δ,2π] such that i maps to xi=iδ, where δ is the increment separating neighboring points
The next step selects the Fourier truncation order N. In some embodiments, N takes the value 2, 3, . . . , or 10. In some embodiments, N takes an integer value larger than 10.
The iterative optimization is initialized by assigning zero or a conveniently chosen negative real value to the “best” penalty. In some embodiments, the “best” penalty is initialized with negative infinity.
The iterative procedure assigns trial values to GC coefficients L and Q such that the non-negativity constraints (Eqs. 7-8) are satisfied. In some embodiments, trial values (L,Q) are selected from a predefined two dimensional grid satisfying the non-negativity constraints. The trend parabola G(g;L,Q,g0) is evaluated for the current pair of trial values (L,Q). Raw read counts are divided by bin mappability m and by the trial trend G(g;L,Q,g0). The resulting “corrected” read counts are trimmed within bounds [iL,iU] and used to generate normalized probabilities pi of bin counts as described above.
In some embodiments, division of raw read counts by mappability may be omitted.
Next, cosine and sine Fourier coefficients are evaluated. In some embodiments, Fast Fourier Transform (FFT) is used. In some embodiments, numerical integration is performed using Darboux sums:
In some embodiments, the integration in Eq. 11 is performed using numerical integration methods, such as polynomial interpolation, rectangle rule, trapezoidal rule, composite rule, extended rule, iterated rule, Newton-Cotes rule, Simpson's rule, Clenshaw-Curtis quadrature, Gauss-Kronrod quadrature, generalized mid-point rule, Monte-Carlo integration, or any other quadrature formula known to those skilled in the art.
Finally, the Fourier Coefficient Ratio F1 value corresponding to current trial grid point (L,Q) is evaluated as follows:
In some embodiments, absolute values |an| and |bn| are used instead of squares. In some embodiments, positive even powers other than two (4, 6, 8, . . . ) are used.
If the current value of the Fourier Coefficient Ratio F1 exceeds the “best” penalty, both best penalty and best (L,Q) pair of values are updated.
Optimization procedure described above continues until all grid points (L,Q) are exhausted. The final best grid point (L,Q) represents the optimal solution given the data and the resolution of the (L,Q) grid.
In some embodiments, grid search is followed by additional numerical optimization to fine-tune the solution. In some embodiments, the optimal grid point is used as the initial guess for the numerical optimization. In some embodiments, the additional numerical optimization is achieved using Nelder-Mead, Gradient Descent, BFGS, L-BFGS-B, SANN, or any other optimizer known to those skilled in the art. In some embodiments, grid search is bypassed and numerical optimization starts from an arbitrarily selected initial guess. In some embodiments, the initial guess is the origin.
B. scDNA Profile Normalization by Posterior Ploidy Probability
Normalization of Single Cell and Pure Sample Copy Number Profiles
The objective of deriving a copy-number profile from whole genome resequencing data is to extract location-specific ploidy information from observed bin counts. In raw data, ploidy information is obscured by multiple potential sources of error, including overall scale estimate, stochastic noise, and systematic biases. Scale and stochastic noise are related to overall coverage depth and are treated in separate sections of the current invention disclosure. Normalization reduces or removes biases from the bin count profile, including both mappability-induced bias and GC-related bias.
Conventionally, bias removal is achieved either by applying an additive, multiplicative, or hybrid (both additive and multiplicative) correction to observed bin counts (for example, cf. PERUN). Multiplicative corrections are often suboptimal due to their tendency to inflate errors in regions where low counts are expected. By treating the GC component of the bias as additive, rather than multiplicative, hybrid approaches (such as PERUN) greatly improve uniformity of normalized profiles. However, the mappability-related portion of the bias still has multiplicative character, thus suffering from error inflation. The current invention disclosure proposes an alternative approach, based on posterior bin count probabilities. The new approach naturally accounts for stochastic nature of observed bin counts and produces normalized profiles that are significantly less prone to error inflation than existing multiplicative or hybrid approaches.
When removing systematic errors related to mappability and GC content, the scale S may but does not have to be available. If no scale estimate exists at the time when GC/mappability normalization is performed, the resulting normalized profile represents the product of overall scale S and bin-specific ploidy P. When a scale estimate exists, S can be trivially factored out of the normalized product PS, directly yielding ploidy profile P. We first describe the former case, where normalization generates the product PS of the unknown ploidy P and unknown S.
The observed read count n in a bin characterized by mappability m and GC content g is a random variable that follows Poisson distribution (n|P,S,m,g,L,Q), where L and Q represent the profile-specific linear and quadratic GC bias coefficients. The expectation of this conditional probability equals λ=PSmG(g;L,Q,g0), where G(g;L,Q,g0)=1+L(g−g0)+Q(g−g0)2 is the parabola describing the profile-specific GC bias and g0 is an arbitrarily chosen reference point on the GC content axis.
The prior distribution for the product PS can be modeled as continuous uniform distribution (PS,0,Nmax), where Nmax is the maximum observed count. This simple model applies to the first bin in each chromosome.
For bins other than the first bin in a chromosome, the prior for the product PS is constructed as a mixture of the uniform prior and the posterior previously obtained for the preceding bin. The mixture coefficient allows balancing robustness with respect to noise vs. sensitivity to ploidy changes.
The prior distribution (n) for the bin counts n can be modeled in several ways, including:
Once the form of the prior probability (n) for bin counts is selected, the desired posterior distribution (PS|n,m,g,L,Q) for the product PS, given the observed bin count n, is obtained using Bayes theorem:
In the simplest case, when both the priors for the bin count and for the product PS are modeled as uniform distributions (as in the first bin per chromosome), Eq. 13 reduces to the likelihood:
The estimate for the product PS can then be obtained as 1) the location of the mode of the posterior distribution (PS|n,m,g,L,Q), 2) as its mean, or 3) as its median. Along with the estimate for PS one can also obtain a measure of uncertainty in the estimate as the standard deviation, MAD (median absolute deviation), half-width, second moment, or related spread metric of the posterior (PS|n,m,g,L,Q).
Once the posterior distribution for a given bin is constructed and used to estimate the product PS, it is also passed to the procedure that constructs the prior for PS in the next bin. As mentioned above, the prior for the next bin is a mixture of the current posterior and the uniform distribution. Mixture coefficients reflect the frequency of CNV breakpoints.
The estimate for PS obtained as described in the previous paragraph is to a large degree free of systematic biases due to mappability m and GC content g. It is also much less affected by stochastic noise than the conventional normalized profiles. As such, a collection of PS estimates obtained for a large number of bins across the genome can be used to reliably estimate the scale S, as described in a separate section below.
If a separate procedure is used to obtain an estimate for the scale S prior to normalization, the procedure described above can still be used in almost identical form to obtain final normalized and scaled ploidy profile P, rather than the product PS. One possible advantage of such approach is that the known value of the continuous variable S becomes a part of the condition in the expression for the posterior, leaving only the discrete random variable P. The prior on the ploidy P can then be more restrictive, only allowing integer P values.
C. scDNA CNV Profile Scaling
The removal of mappability- and GC-related biases from raw bin counts yields normalized PS profile, where P is the per-bin ploidy, and S is the overall scale, as described above. The ultimate objective is the ploidy profile P, rather than the product PS. The procedure described below estimates the scale S starting from the normalized PS profile as its input, thus enabling evaluation of the desired ploidy P by dividing the PS profile with the estimate S.
Ideally, the normalized profile PS consists of highly uniform segments. Since ploidy P can only take integer values, the elevations of PS segments are integer multiples of the scale S. At high enough coverage depth per bin, fluctuations around these normalized discrete segment elevations are minimal. The scale estimator takes advantage of the fact that division of the normalized profile PS with the true scale value S can only yield integer values (or values very close to integers). The estimator thus constructs a penalty (or target) function T(s) as such:
The variable s in Eq. 16 represents a positive real trial value for the scale S. The sum runs over all bin indices i from 1 to the total number of bins M. The symbol Ni denotes the normalized product PS at bin i. The symbol └x┐ denotes rounding of x to nearest integer.
The penalty T(s) is evaluated for all candidate scale values s from some predetermined value smin to smax. In some embodiments, the starting value smin is set to 2. In some embodiments, the upper bound smax is set to max(NZ). In some embodiments, s is incremented from smin to smax in integer steps. In some embodiments, s is incremented in fractional steps. In some embodiments, the range of trial values s is scanned in equal increments. In some embodiments, the range of trial values s is scanned in unequal steps. In some embodiments, any standard minimizer known to those skilled in the art is used to minimize T(s). Examples include, but are not limited to, exhausting search, stochastic search, bracketing, region elimination, halving, golden section search, and Newton-Raphson. Regardless of which minimizer is used, the scale S is assigned the trial value s that yields the lowest penalty T(s):
The same estimate S can be obtained if the squared difference
in Eq. 16 is replaced by any of its monotonic functions. For example, the penalty may be reformulated to use absolute values, rather than the square:
Alternatively, any positive even power of the difference
or any positive power of its absolute value can yield the same estimate S.
The procedure described above is highly accurate when the profile contains sufficiently long segments with both even and odd ploidies P. In profiles strongly dominated by even ploidies, the penalty contributions from odd segments may be drowned in noise, resulting in scale estimates that are (almost exactly) twice the true value. This can happen, for example, in a profile of a female subject with very few and very short heterozygous deletions or duplications. To overcome this difficulty, the scale estimate obtained from Eq. 17 is further tested for presence of short odd segments, as follows:
First, the normalized PS profile is divided with the scale estimate S previously obtained using Eq. 17. The resulting ratio is then shifted by its floor, to yield the collapsed profile:
The collapsed profile Ci takes values that range from 0 to 1:
C
i∈[0,1] (20)
When the scale estimate S is close to the true scale value, the collapsed profile values gather close to zero or one, leaving the middle portion of the segment [0, 1] almost completely empty. If, however, the scale estimate S is twice the true value, the segments with odd ploidy give rise to a set of bins with Ci values close to 0.5. At the same time, even-valued ploidies yield collapsed ratios close to zero or one. The presence of non-negligible number of bins with Ci≈0.5 indicates that the estimate S should be divided by two to yield a value close to the true scale. The observed collapsed bin elevations are therefore used to generate a histogram, with Ci values on the x-axis and their frequencies on the y-axis. In some embodiments, the histogram contains ≈100 data points, with resolution of ≈10−2. In some embodiments, the histogram resolution scales with the number of bins.
If contributions from odd-valued ploidies are small relative to even-valued ploidies, the peak at Ci≈0.5 may be overshadowed by the spikes at 0 and 1. To facilitate detection of the odd-valued peak at 0.5 (if the peak exists), even-valued peaks at 0 and 1 are removed as follows. Once the histogram of collapsed elevations Ci is available, its x-axis is truncated on both ends to remove the spikes at zero and one. In some embodiments, histogram bins below CL or above CU are trimmed, leaving only histogram bins within the segment Ci∈[CL,CU]. In some embodiments, CL=0.2 and CU=0.8. In some embodiments, CL and CU are user-defined. In some embodiments, CL and CU are automatically determined based on the widths of histogram spikes at 0 and 1. The removal of spikes at 0 and 1 yields truncated histogram of collapsed elevations.
Next, the baseline of the truncated histogram is estimated and subtracted. Baseline points are obtained by selecting all histogram values corresponding to flanks of the truncated Ci domain. In some embodiments, baseline points are selected within the composite, disjoint segment [0.2, 0.4]∪[0.6, 0.8]. In some embodiments, histogram points with zero frequency values are omitted from estimation of the baseline. The selected baseline points are then used as inputs to polynomial fitting procedure yielding a smooth baseline estimate. In some embodiments, the polynomial least squares procedure uses a parabola to fit the baseline. In some embodiments, a fourth-order polynomial is used to fit the baseline. The resulting polynomial baseline estimate is then subtracted from all histogram elevations to yield baseline-corrected histogram of collapsed elevations.
Next, the procedure determines whether the histogram contains a spike at Ci≈0.5. First, the median μ and MAD (median absolute deviation) a of the baseline-corrected histogram frequencies in predetermined flanks of the Ci domain are obtained. In some embodiments, the flanks comprise the composite, disjoint segment [0.2, 0.4]∪[0.6, 0.8]. Next, the highest K histogram points in the central region of the Ci domain are identified. In some embodiments, the central region matches the segment [0.4, 0.6]. In some embodiments, the number K of selected highest histogram points is 5. In some embodiments, K=3. The median frequency m of the selected K highest central histogram frequencies is then converted to a Z value as follows:
z=(m−μ)/σ (21)
m in Eq. 21 is the median of K highest central histogram frequencies, μ is the median of histogram frequencies in the flanks, and σ is the MAD of the histogram frequencies in the flanks.
Next, the procedure uses a preset cutoff value z0 to determine whether there is a spike at 0.5. In some embodiments, z0=6. In some embodiments, z0=3. If z<z0, there is no evidence supporting existence of a spike at Ci≈0.5, and null hypothesis (S is correct) is not rejected. If z≥z0, null hypothesis is rejected, meaning that the histogram indeed contains a spike at 0.5 and the current scale estimate S is twice the true value. If the collapsed elevation data suggest that this alternative hypothesis should be accepted, the scale estimate is corrected by division with two: S←S/2.
The above procedure yields correct scale estimates in most profiles. The only exceptions are profiles consisting entirely of even-valued ploidies. Examples include replicating cells at the end of S phase, neuronal cells in G2 phase, or cells that underwent unscheduled endoreplication, as well as some ATCC cell lines, such as MOLT-16, MOLT-13, IM-9, LOUCY, GP5D, FARAGE, DAUDI, and ARPE-19. Similar situation results when two or more cells with identical karyotypes are captured in the same cell bead or share the same cell barcode, making their sequencing data indistinguishable. In most such cases, copy number alone is insufficient to determine correct scale and the scale estimate obtained as described above is unavoidably an integer multiple of the true scale (usually double). However, in the special case of an ideal female profile, an additional assumption that the subject's genome is diploid can remove the degeneracy and allow for correct scale estimation. In an ideal female profile with no deletions or duplications, all autosomal bins have ploidy two, as do the X chromosomal bins. The absence of chr Y in such profiles translates to zero ploidy values for chr Y bins (ignoring the interference from chr Y bins that are similar to chr X regions). Under these circumstances, the PS profile only takes two values: zero (within chr Y) or some positive constant real number (elsewhere), plus noise. Similar situation exists if homozygous deletions are present in autosomal or chr X regions, but heterozygous features are absent. This scenario also includes diploid-haploid hydatiform moles, such as HM1 or HM13. To allow for correct scale estimation in such cases, the procedure is as follows.
If z<z0, an additional test is performed to answer the question whether the normalized PS profile assumes one or multiple distinct positive values (in addition to zero, and disregarding noise). If there are multiple distinct positive values, no further actions are taken, and the value of S obtained by minimizing the penalty T(s) is used to extract ploidy from the PS profile.
If, on the other hand, z<z0 and PS only takes a single positive value (in addition to zero), the additional assumption that the genome is diploid implies that S should be divided by 2 to yield the best estimate for the scale: S←S/2. This step is correct only if the diploid assumption is true, and will lead to an incorrect result in a haploid genome (such as gamete). To avoid reporting erroneous scale values in haploid cells, the procedure allows a user-defined flag indicating whether the input biological material is haploid (as in sperm or egg cells). If the haploid flag is set to true, the last test (single or multiple PS levels) is not performed.
D. ccfDNA Profile Normalization by Posterior Ploidy Probability
(n|S, M, F, f, g, m, L, Q): Likelihood (conditional probability of observing n
(n; λ): Poisson distribution of the non-negative integer random variable n
(W|n): Posterior distribution of combined ploidy values W, given the observed
In NIPT ccfDNA samples, the dominant maternal contribution to bin count profile is usually diploid, with minor deviations due to maternal and/or fetal CNVs. The overwhelmingly diploid character of the profile makes the estimation of the scale trivial: S is obtained as the mean or the median of the autosomal portion of the maternal genome.
Unlike in scDNA analyses of cancer CNV profiles, where the scale usually represents the number of reads per bin per chromosome copy, the scale in the ccfDNA NIPT case often represents read count per bin per pair of chromosomes. Typically, scaling is performed by dividing all raw bin counts with the mean or median of the autosomal bins. Thus scaled, a diploid ccfDNA NIPT profile is normally centered on 1, rather than on the number of chromosomes (2 in the diploid case). Deviations from 1 suggest either maternal or fetal CNVs or aneuploidies.
To successfully extract biological features (subchromosomal CNVs and whole-chromosome aneuploidies, either fetal or maternal) from a scaled NIPT ccfDNA count profile, one first needs to address technical sources of unevenness, including mappability, GC bias, and stochastic errors due to counting statistics. To reduce these sources of error, we model raw bin count n as a Poisson random variable with the mathematical expectation given by the following expression:
Eq. 22 is similar to the expression used for mean raw bin counts in the scDNA case. The terms g (GC content per bin), L (linear GC bias coefficient), Q (quadratic GC bias coefficient), g0 (GC reference point), and m (mappability) have the same meaning as in the scDNA case. The parabola G(g,L,Q,g0) describing the GC bias is also the same as in the scDNA case. The scale S in Eq. 22 is divided by two, reflecting the ccfDNA convention to scale diploid genomes to one. The main difference is that the single cell ploidy P is here replaced with a combination of the maternal ploidy M and fetal ploidy F, with the mixture coefficients defined by the fetal fraction f. Unlike the integer-valued single cell ploidy P, the combination of maternal and fetal ploidies W=(1−f)M+fF is real-valued and takes different values in different samples, depending on the fetal fraction f.
The proposed solution to the problem of removing systematic errors constructs the likelihood as the conditional probability of observing raw count n, given the combination W of maternal and fetal ploidies, as Poisson probability with the mean given by Eq. 22.
The terms S, m, and G(g,L,Q,g0) in Eq. 23 are known for each bin. The bin count n is measured. The only unknown in Eq. 23 is the combined maternal and fetal ploidy W. The term W is the biological source of variation that we want to extract from measurements. To estimate W, we apply Bayes Theorem and evaluate the posterior distribution (W|n) by multiplying the likelihood (n|S,W,g,m,L,Q) with the prior (W) and dividing it with the prior for raw bin counts (n). The choices for the two priors are the same as in the scDNA case. In some embodiments, the count prior (n) is uniform. In some embodiments, the ploidy prior (W) is uniform for the first bin, and then each subsequent bin is assigned a weighted mixture ploidy prior (W), composed of two contributions: a uniform and a posterior (W|n) obtained for the previous bin. In some embodiments, the mixture weights for the uniform and the previous posterior are 0.05 and 0.95, respectively.
In some embodiments, the posterior probability (W|n) is evaluated for a plurality of values W ranging from 0 to max(n)/S. In some embodiments, the values of the combined ploidy W are sampled equidistantly in intervals of 10−3.
Once the posterior probability distribution (n|S,W,g,m,L,Q) is evaluated for a range of representative W values in a given bin, the most likely value for the weighted ploidy W is obtained as either 1) the mean, 2) the median, or 3) the mode of the posterior (W|n). In some embodiments, the preferred method to estimate W is to use the mean of the posterior distribution (W|n).
The uncertainty in the estimate of the combined ploidy W can be obtained from the posterior distribution (n|S,W,g,m,L,Q) as its spread. In some embodiments, the uncertainty is estimated as standard deviation, MAD, half-width, or inter-quartile range of (W|n). In some embodiments, the uncertainties in W per bin are used in downstream analyses, for example when measuring fetal fraction or segmenting the normalized W profile to detect fetal and/or maternal CNVs.
E. Profile Segmentation using Fourier Coefficient Ratio
: Scaled array of pre-processed bin counts, obtained by dividing the elements
To facilitate copy-number analyses, genomic coordinates are segregated into an ordered sequence of segments called genomic bins, for convenience also referred to simply as bins. Genomic bins can be topologically contiguous or consist of disjunct sets of genomic coordinates. Successive bins may have non-zero overlap, they may touch one another without overlapping, or may be separated by genomic stretches of varying lengths. The same is true for bins that are not immediate neighbors in the ordered sequence. Genomic bins may be of equal or varying lengths. Typical bin sizes can range from several base-pairs (bp) to multiple kilobases (kb), including bin sizes as large as a few megabases (MB). Examples of bin sizes include 5 kb, 10 kb, 20 kb, 50 kb, 100 kb, 200 kb, 500 kb, 1 Mb, 2 Mb, as well as any positive integer values within or outside this range. The total number of bins is denoted with symbol N.
The sequence of genomic bins is associated with bin indices and with the derived sequence of N real numbers x, reflecting bin indices and thus indirectly also reflecting genomic coordinates of individual bins. The coordinates x are compressed into the segment [0, 2π]. For that reason, this disclosure refers to x as “compressed bin coordinates.” In the special case of contiguous, non-overlapping bins of equal lengths, the sequence x contains elements spanning the range from 0 to 2π and evaluated as follows:
x
i=2πi/N,i∈{1,2, . . . ,N} (24)
The distance between two successive compressed bin coordinates x is labeled δ and evaluated as follows.
δ=xj+1−xj=2π/N,j∈{1,2, . . . ,N−1} (25)
Further discussions assume non-overlapping, contiguous bins of equal lengths. Extensions from Eq. 24 to the more general cases of overlapping or disjunct bins, possibly having varying lengths, is straightforward to those skilled in art and thus wholly included in the current disclosure.
Typically, each chromosome or assembly contig is assigned a separate sequence of genomic bins. Alternatively, the entire genome may be represented by a single sequence of bins, effectively merging binning partitions of individual chromosomes or assembly contigs. A single bin may comprise genomic segments from the same chromosome or assembly contig or may include genomic coordinates from multiple different chromosomes or assembly contigs.
To detect copy-number variants (CNVs) in a given genome, the starting point is to sequence fragmented genomic DNA. Next, one aligns the resulting sequences of genomic fragments to the reference genome. Each bin is assigned the count of sequenced reads whose alignment coordinates fall within that bin. The symbol W denotes the resulting array of raw bin counts.
Prior to detecting copy number variants, the raw bin counts W are subject to various pre-processing steps, including but not limited to scaling with respect to read counts per bin per chromosome, normalization with respect to bin mappability, filtering based on bin mappability, repetitiveness, N-base content, or GC content, and normalization with respect to GC content per bin. Various approaches to these pre-processing steps are elaborated elsewhere in this document. The current discussion assumes that the various scaling, filtering, and normalization steps have already been applied to raw bin counts W and that the resulting pre-processed bin counts Y are minimally affected by technical sources of variation. As such, the counts Y are deemed ready to be inspected for presence of biological copy-number variants.
For convenience, pre-processed bin counts Y are scalled with respect to their overall number to yield scaled bin counts :
To identify copy-number variant edges and ploidy values, we employ a uniformity statistic Q whose value reflects uniformity of the profile. Ideally, a uniform (homogeneous) profile should give rise to uniformity statistic values q that follow a well-defined, narrow probability distribution. We denote the uniform profile case as null-hypothesis H0 and the resulting distribution of the uniformity statistic as (q). Non-uniform profiles, consisting of two or more segments at different elevations, correspond to the alternative hypothesis H1. If the alternative hypothesis is true, it should result in a distribution (q) that minimally overlap with the null distribution 0(q).
Copy-number detection based on the uniformity statistic Q employs the segmentation procedure roughly outlined as follows. The segmentation starts by evaluating the uniformity statistic Q for the entire profile or its selected portion. If the resulting value q is well within the null distribution 0(q), the profile (or its selected portion) is uniform and segmentation is finished. Otherwise, the current segment is iteratively divided (either by bisection or using an optimized procedure, as elaborated below) and tested for uniformity. If uniform, the subdivisions are flagged as such. Non-uniform segments are further subdivided until all subdivisions are uniform or a pre-defined resolution limit is reached. Finally, neighboring uniform segments of same elevation are merged. The end result is a set of profile segments with uniform elevations.
Segment elevations in a segmented profile may have integer or real elevations.
To effectively perform segmentation, the procedure needs an optimal choice of uniformity statistic Q. The inventor of the currently disclosed procedure has previously described two possible choices for Q: Jensen Shannon entropy and Differential Variance Ratio. Both successfully detect CNVs by segmenting genomic profiles into regions of uniform elevation. The current disclosure describes two novel and superior alternatives: Fourier Coefficient Ratio F(K) and Fourier Coefficient Ratio divided by Differential Variance G(K), further explained below.
Before defining the two newly proposed statistics and explicitly stating the expressions to evaluate them, we first introduce notation and definitions for the Fourier transform of the scaled profile . Given the array and the increment δ (cf. Eq. 25), the zero order Fourier coefficient a0 is:
The integer K (K≥1) represents the truncation order of the Fourier expansion of the scaled bin count array . The k-th (1≤k≤K) cosine coefficient of the Fourier expansion is ak, evaluated as follows.
The corresponding k-th sine Fourier coefficient bk (1≤k≤K) is evaluated as follows.
The first proposed candidate for the uniformity statistic Q is the Fourier Coefficient Ratio, denoted by the symbol F(K) and evaluated as the sum of squares of sine and cosine Fourier coefficients ak and bk (with k running from 1 to truncation order K), divided by the square of the zero order Fourier coefficient a0.
The formulation of the statistic F(K) according to Eq. 30 is motivated by the fact that, ideally, a0 is the only non-zero Fourier coefficient in a uniform profile. In reality, due to noise in the scaled bin counts , higher-order Fourier coefficients ak and bk (1≤k≤K) may slightly deviate from zero. The resulting estimate of F(K) is a therefore a random variable with narrow distribution 0(F) centered close to zero, satisfying one of the fundamental criteria for the optimal statistic Q.
The value of the Fourier Coefficient Ratio F(K) depends on the choice of the truncation order K. Eq. 30 therefore represents a family of related statistics, one for each K (K≥1). The optimal choice of K depends on the profile length, desired resolution of CNV detection, and the amount of noise. In practice, K may take values 3, 5, 7, or any value including but not limited to integers from 3 to 10.
When evaluated according to Eq. 30 on a non-uniform profile , the statistic F(K) follows an alternative-hypothesis distribution 1(F) distinct from 0(F). The exact location and spread of 1(F), as well as the amount of overlap between the distributions 0(F) and 1(F) depend on the amount of noise within uniform segments of the (possibly) non-uniform profile , the lengths of the segments, and the differences among their elevations. Larger distances separating segment elevations and increased fractional lengths (closer to π on the x scale, or to ½ on the scale from 0 to 1) of stretches placed at different elevations result in larger gaps between 0(F) and 1(P). On the other hand, larger variance within uniformly elevated segments reduces the gap. In most practically relevant cases, the gap is sufficient for the segmentation to achieve high performance levels (low false positive and false negative rates).
As an example, consider a profile consisting of two segments at elevations E and E+Δ. Assume that the first segment occupies x coordinates from origin to f, with the second placed between f and 2π. The Fourier coefficients ak are then:
Similarly, sine Fourier coefficients bk are:
Notice:
The resulting expression for F(K) is then:
The dominant term in RHS of Eq. 31 corresponds to k=1. It increases quadratically with of the gap between elevations Δ. In addition, it reaches maximum at f=r and vanishes at f=0 or f=2π, in accord with intuitive expectations. Higher-order terms (kl) additionally strengthen this trend. This example illustrates that the uniformity statistic F(K) exhibits the desired behavior when alternative hypothesis H1 is true.
While both null-hypothesis distribution 0(F) and alternative hypothesis distribution 1(F) of F(K) show desirable features (0(F) is tightly concentrated near zero, while 1(F) is shifted to the right from 0(F) with a distance between the two distributions increasing as the gap Δ and the affected fraction f increase), they also both present the same difficulty: the mean values of both 0(F) and 1(F) depend on the variance within the uniform profile segments. Two otherwise identical flat profiles with different amounts of noise result in different 0(F) distributions, with their mean, median, and mode values shifting in accordance with the extent of spread in each profile. The same is true for 1(F). F(K) is not a pivotal quantity. It would be more convenient if both null-hypothesis and alternative hypothesis distributions of the uniformity statistic Q were independent of the variance of uniform segments of the profile.
The solution is to convert the non-pivotal uniformity statistic F(K) to a pivotal uniformity statistic by dividing it with another statistic which behaves the same as F(K) when the profile is flat (and, in particular, exhibits the same dependence on the variance of the flat profile), but unlike F(K) does not “sense” the presence of non-uniformity in the profile (segments at different elevations). One such statistic that is sensitive to variance in flat portions of the profile but ignores changes in elevation is Differential Variance, D, defined below.
Before defining the Differential Variance D, we first introduce the successive differences profile z. The array z contains N−1 elements derived from the scaled profile by subtracting 's neighboring elements from one another:
z
i=i+1−i,i∈{1,2, . . . ,N−1} (32)
The array z can be viewed as the differential of : z=(d/dx)δ. The Differential Variance statistic D is defined as the variance of the successive differences profile z:
D=
ar(z) (33)
If the original profile is flat, D is directly proportional to the variance of the , with the proportionality constant of 2:
D=2ar() (34)
When the profile consists of two or more segments at different elevations, the Differential Variance D is proportional to the variance of the individual uniform segments within the non-uniform profile . The operation of taking successive differences flattens the elevation, so that a non-uniform profile gives almost the same value of D as another uniform profile with the same within-segment variance. The changes in elevation in the profile , if they exist, are reflected in z as upward or downward spikes, depending on whether the elevation increases or decreases at a given location. The spikes in z can potentially affect the value of D. To avoid this undesired effect, extreme values of |z| can be trimmed prior to evaluating D.
When the profile is flat, the uniformity statistic F(K) is proportional to Differential Variance D, and both are proportional to the variance of . Thus, dividing F(K) with D will result in a statistic that is independent on noise when the profile is flat. We therefore define a new uniformity statistic G(K), representing the Fourier Coefficient Ratio F(K) normalized with respect to D:
G(K)=F(K)/D (35)
Similarly to F(K), the statistic G(K) is a function of the Fourier truncation order K. Also, as in the case of F(K), the normalized uniformity statistic G(K) estimated from a uniform profile is tightly distributed close to the origin. Unlike F(K), the mean and spread of 0(G) do not depend on the noise in the profile , meaning that G(K) is pivotal. The use of the redefined uniformity statistic G(K) is therefore preferred to F(K).
F. Single Cell Clustering using Fourier Coefficient Ratio
In a heterogeneous mixture of cells with different copy-number profiles, subsets of cells are usually grouped together based on the degree of similarity of their copy number profiles. The motivation for grouping is 1) to improve signal to noise by using collective information from measurements collected on multiple similar cells, and 2) to reconstruct the evolutionary path leading to the present heterogeneous state. The grouping is achieved using one of multiple existing clustering techniques. Two main aspects of cell clustering include: a) clustering algorithm, and b) dissimilarity (or similarity) measure. The current disclosure proposes a new dissimilarity measure: Fourier Coefficient Ratio of Cell Profile Difference.
The following procedure summarizes the use of Fourier Coefficient Ratio to quantify the degree of dissimilarity between two copy number profiles:
The resulting value of the Fourier Coefficient Ratio divided by Differential Variance Ratio represents the dissimilarity for the given pair of copy number profiles. When two profiles are similar, this metric takes values close to zero. Very dissimilar profiles give rise to high FCR values, which quickly increase as the dissimilarity between profiles increases.
In some embodiments, Fourier Coefficient Ratio without division with Differential Variance Ratio is used as the dissimilarity metric. In some embodiments, Jensen-Shannon entropy is used as the dissimilarity metric. In some embodiments, Differential Variance Ratio is used as the dissimilarity metric.
G. Quantification of Impurities in cfDNA Using Minor Allele Frequencies
(x; l, u): Continuous uniform distribution for random variable x, with domain
(x; α, β): Beta distribution for random variable x, with parameters α and β.
(g): Prior probability of contamination. Equals 1 − pC when g = _ , or pC
(G, g): Prior probability for the given combination of sample and contamination
(ψ|nA, nR): Conditional probability of alternate allele frequency ψ, given the
Problem Formulation
Given measured raw allele counts across a panel of preselected SNP loci in a human sample, estimate contamination level, along with the associated uncertainty.
Genotypes and Allele Frequencies
Notation: uppercase symbols R and A represent reference and alternate alleles at a given locus in the main sample. Lowercase r and a stand for reference and alternate alleles in contaminating DNA material, if present. Two consecutive underscores denote absence of contamination. The following twelve scenarios are possible:
(ƒ; 0, ½)
(ƒ/2; nA + 1, nR + 1)
(ƒ; nA + 1, nR + 1)
((1 − ƒ)/2; n A − 1, nB − 1)
(ƒ; 0, ½)
((1 + ƒ)/2; nA + 1, nB + 1)
(1 − ƒ; nA + 1, nR + 1)
(1 − ƒ/2; nB +1 , nB − 1)
( ƒ; 0, ½)
Table 0.1 lists prior probabilities (G) and (g) in columns E-F for all 12 scenarios.
The conditional probability of contamination level values f, given the observed allele counts nA and nR, is obtained by summing up contributions from all twelve scenarios:
To enhance the predictive power of Eq. 36, we introduce non-negative weights (G,g), satisfying the normalization conditions:
In some embodiments all weights (G,g) are equal. In some embodiments, selected combinations of sample and contamination genotypes G,g are assigned higher weights than other scenarios. In some embodiments, uninformative scenarios (RR.rr, RA.ra, and AA.aa) are assigned lower weights than the remaining, informative scenarios. In some embodiments, scenarios with heterozygous contamination genotypes are assigned higher weights than other scenarios. In some embodiments, scenarios with homozygous contamination genotypes are assigned higher weights than other scenarios. In some embodiments, an arbitrary subset of scenarios are favored by assignment of higher weights.
The introduction of weights transforms Eq. 36 as follows.
The fraction of contamination f0 is estimated as the central tendency of the posterior probability distribution (f|nA,nR). In some embodiments, the estimate f0 is obtained as the mode (location of the maximum) of the posterior (f|nA,nR):
f
0=argmax((f|nA,nR)) (40)
In some embodiments, the estimate f0 is obtained as the mean of the posterior (f|nA,nR):
f
0=∫01/2(f|nA,nR)fdf (41)
In some embodiments, the estimate f0 is obtained as the median of the posterior (f|nA,nR), ie. as the solution of the following integral equation:
∫0f
The uncertainty in f0 is estimated as the spread of the posterior probability distribution (f|nA,nR). In some embodiments, the uncertainty in f0 is obtained as the variance of the posterior (f|nA,nR):
ar(f0)=∫01/2(f|nA,nR)f2df−(∫01/2(f|nA,nR)t)fdf2 (43)
In some embodiments, the uncertainty in f0 is obtained as the half-width of the posterior (f|nA,nR). In some embodiments, the uncertainty in f0 is obtained as the MAD (median absolute deviation) of the posterior (f|nA,nR).
In some embodiments, uninformative loci (those with reference allele counts too low, or alternate allele counts too low) are filtered out. In some embodiments, reference or alternate allele counts are deemed too low if they are lower than a user-defined cutoff c. In some embodiments, loci with nA<c or nR<c are filtered out. In some embodiments, the cutoff c is set to 1.
In some embodiments, loci with too few total counts are filtered out. In some embodiments, a user-defined lower limit t on total allele counts is imposed. In some embodiments, the lower limit t is 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, or any value in between. In some embodiments, loci with nA+nR<t are filtered out.
In some embodiments, loci with too many total counts are filtered out. In some embodiments, a user-defined upper limit T on total allele counts is imposed. In some embodiments, the upper limit T is 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, or any value in between. In some embodiments, loci with nA+nR>T are filtered out.
In some embodiments, loci with drop-out alleles are corrected by introducing additive factors νA and νR to allele counts. In those embodiments, posterior probabilities are evaluated by replacing nA and nR with nA+νA,nR+νR, respectively.
In some embodiments, loci with drop-out alleles are corrected by introducing multiplicative factors μA and μR to allele counts. In those embodiments, posterior probabilities are evaluated by replacing nA and nR with nAμA,nRμR, respectively.
In some embodiments, loci with drop-in alleles are corrected by introducing additive factors ϵA and ϵR to allele counts. In those embodiments, posterior probabilities are evaluated by replacing nA and nR with nA−ϵA,nR−ϵR, respectively.
In some embodiments, loci with drop-in alleles are corrected by introducing multiplicative factors ηA and ηq to allele counts. In those embodiments, posterior probabilities are evaluated by replacing nA and nR with nAηA,nRηR, respectively.
In some embodiments, background noise is estimated by counting alternate alleles at loci that are likely homozygous for reference allele, or by counting reference alleles at loci that are likely homozygous for alternate allele.
In some embodiments, locus-specific and sample-specific GC bias is corrected by using additive or multiplicative correction. In some embodiments, the GC correction is a linear or quadratic function of the GC content of the region surrounding the locus. In some embodiments, the GC correction is also a function of a sample-specific coefficient.
In some embodiments, reference bias is corrected by applying an additive or multiplicative term to reference and/or alternate allele counts.
In some embodiments, the conditional probability (f|nA,nR) is modeled using Poisson distribution, rather than beta distribution, following the same approach as the model described in Section H. (Measurement of Fetal Fraction in cfDNA Using Minor Allele Frequencies).
Once the continuous distributions of contamination level is constructed for all loci in a sample, the distributions are logarithmically transformed and added together to yield log(P) profile over all loci. The log(P) profile is then further examined to estimate the contamination level per sample, as well as the associated uncertainty. Mean, mode, median, variance, and MAD for the combined continuous probability distribution of the contamination are obtained using standard procedures known to those skilled in the art.
H. Measurement of Fetal Fraction in cfDNA Using Minor Allele Frequencies
(ƒ): Prior distribution of fetal fraction values.
(ƒ; α, β): Beta distribution of continuous random variable ƒ ϵ [0, 1], with
(nR, nA): Prior distribution of observed reference
(nR, nA|ƒ): Conditional probability of observing nR reference allele counts and
(ƒ|nR, nA): Posterior probability of fetal fraction ƒ, given observed reference
(n, λ): Poisson distribution of non-negative
This section describes a procedure to estimate fetal fraction in circulating cell-free DNA from maternal plasma based on reference and alternate allele counts in a set of preselected SNP loci. The panel of SNPs is selected such that all loci have significant alternate allele frequency in the population. As a result, it is likely that a NIPT sample will have sufficient number of informative loci, where either the mother is homozygous and the fetus heterozygous or the fetus is homozygous and the mother heterozygous.
The approach proposed here constructs the posterior probability of fetal fraction f for a single locus (f|nR,nA), conditioned on the observed reference and alternate allele counts nR and nA for that locus. Assuming independent loci, posterior distributions from different loci across the SNP panel are multiplied together to yield the final combined posterior distribution (f|{nA,nB}), where the set {nA,nB} denotes measured allele counts for all loci. The fetal fraction estimate is obtained as the average, median, or mode of the posterior distribution (f|{nA,nB}). The uncertainty in the estimate is derived from the variance of (f|{nA,nB}).
Fetal Fraction Prior
The prior distribution for f can be either uniform on the segment from 0 to 1 (uninformative), or derived from observed fetal fraction values in a large number of NIPT samples (informative). The observed prior can be modeled using the beta distribution with shape parameters αF and βF:
The shape parameters αF and βF are chosen such that the prior (f) has mean μF and standard deviation σF values matching the average and standard deviation values observed in a representative set of NIPT samples. The shape parameters are related to μ and σ as follows:
Solving the above two equations for shape parameters gives the following expressions:
Typical values observed in NIPT samples are μF=0.10 and σF=0.04, giving the following values for shape parameters:
αF=5.525 (49)
βF=49.725 (50)
Genotype Scenarios
Assuming that both maternal and fetal loci are diploid and that no de novo mutations take place in fetal DNA, there are seven possible combinations of maternal and fetal genotypes. Table 0.2 lists all possible scenarios, as well as their prior probabilities (based on the alternate allele population frequency pA) and the corresponding expectation values for allele counts.
The locus-specific expected total read count N is not necessarily the same as the observed total read count at that locus (nR+nA). Rather than the actual per-locus count, N represents a fraction of the total read counts across the entire SNP panel, with the proportionality constant reflecting the characteristics of the locus. Various biases (allele drop-in, allele drop-out, locus drop-out, GC bias, mappability bias) can be modeled by introducing multiplicative or additive terms modulating the expected total read count N in Tbl. 0.3.
Likelihood
The joint conditional probability of observed reference and alternate allele counts nR and nA, given a fetal fraction value f, is a sum of terms corresponding to different scenarios 1-7. Each term in the sum is a product of the scenario's weights M(M,pA) and F(M,F,pA) (listed in . 0.2) and two Poisson distributions—one for nR with the mean λR(M,F,f,N)=(nR), and the other for nA with the mean λA(M,F,f,N)=(nA) (Tbl. 0.3):
The symbol (n, λ) in Eq. 51 denotes the Poisson distribution of integer random variable n with mean λ. Likelihood (nR,nA|f) is a function of the population frequency pA and the expected total read count at the given locus N. In the interest of space, this dependence is not explicitly spelled out. The sum in Eq. 51 runs over allowed genotype pairs (M,F) tabulated in Tbls. 0.2-0.3.
In some embodiments, the summation in Eq. 51 omits uninformative terms. In some embodiments, additional non-zero positive weights are assigned to selected scenarios.
Allele Count Prior
The joint prior probability for observed allele counts nR and nA is obtained by integrating the product of the likelihood (nR,nA|f) and the fetal fraction prior (f) over the entire domain of fetal fraction values:
(nR,nA)=∫01(nR,nA|f)(f)df (52)
The actual value of the integral in Eq. 52 depends on the choice for the fetal fraction prior (f) (uniform or beta-distribution). In either case, the integrals are evaluated straightforwardly using standard tabulated values. The count prior (nR,nA) is a function of the population frequency pA and the expected total read count N. If the informative fetal fraction prior is used, the count prior is also a function of the shape parameters αF and βF.
Per-Locus and Overall Fetal Fraction Posterior Distributions
The desired posterior distribution of fetal fraction values f, conditioned on the observed allele counts nR and nA for a given locus, is obtained from the likelihood (nR,nA|f) and from the two priors (f) and (nR,nA) by applying Bayes Theorem:
The overall posterior distribution of fetal fraction values f, conditioned on all observed allele counts {nR,nA} across the entire SNP panel, is obtained by multiplying per-locus posterior distributions (f|nR,nA):
The product index i in Eq. 54 runs over all K SNPs in the panel, assuming that the individual loci are independent. In practice, rather than multiplying per-locus posterior probabilities, the final result is obtained by summing up their logarithms, to improve numerical stability as known to those skilled in the art.
Fetal Fraction Estimate and Error
The sample's fetal fraction is estimated as the central tendency of the overall posterior distribution (f|{nR,nA}). In some embodiments, the estimate is obtained as the mean of (f|{nR,nA}). In some embodiments, the estimate is obtained as the median of (f|{nR,nA}). In some embodiments, the estimate is obtained as the mode of (f|{nR,nA}). The associated error bar is obtained from the spread of (f|{nR,nA}). In some embodiments, the error bar is obtained from the standard deviation of (f|{nR,nA}). In some embodiments, the error bar is obtained from the MAD of (f|{nR,nA}). In some embodiments, the error bar is obtained as the half-width of (f|{nR,nA}).
Quantification of contaminants, described in previous section, can be reformulated in terms of the approach described here. The main differences are that additional genotype scenarios are possible and no restrictions are imposed on minor genotype contribution. Otherwise, the same framework described here for measurement of fetal fractions also applies to measurements of contaminant levels.
I. Detection of Fetal Copy-Number Variants in cfDNA Using Minor Allele Frequencies
Table 0.4 lists all possible combinations of maternal and fetal genotypes, assuming diploid mother and heterozygous fetal deletion at the locus in question. The probabilities of maternal genotypes M(M,pA) and the probabilities of fetal genotypes F(M,F,pA) are also listed. Both maternal and paternal allele losses are included, along with the corresponding weight factors.
Table 0.5 lists the expectation values for reference and alternate allele counts (nR and nA, respectively), assuming the maternal genotype is diploid and the fetus has a heterozygous deletion.
Table 0.6 lists all possible combinations of maternal and fetal genotypes, assuming diploid mother and heterozygous fetal duplication at the locus in question. The probabilities of maternal genotypes M(M,pA) and the probabilities of fetal genotypes wF(M,F,pA) are also listed. Both maternal and paternal allele gains are included, along with the corresponding weight factors.
Table 0.7 lists the expectation values for reference and alternate allele counts (nR and nA, respectively), assuming the maternal genotype is diploid and the fetus has a heterozygous duplication.
Detection of fetal copy number variation at a given locus or genomic segment based on measured reference and alternate allele counts consists of several steps. The first step estimates the posterior distribution (f|{nR,nA}) of fetal fraction values, given observed reference and alternate allele counts at a plurality of genomic loci, as described on page 63. In some embodiments, the loci used to estimate the posterior distribution of fetal fraction values belong to genomic regions that are expected to be diploid in both the mother and the fetus. In some embodiments, the set of genomic loci used to estimate (f|{nR,nA}) excludes the SNP loci used to detect copy
number variants. In some embodiments, the set of genomic loci used to estimate (f|{nR,nA}) includes the SNP loci used to detect copy number variants. In some embodiments, fetal fraction estimation is performed assuming only diploid genotype scenarios (. ??). In some embodiments, fetal fraction estimation includes scenarios that account for fetal deletions and/or duplications (Tables 0.4 and 0.6). In some embodiments, the posterior distribution of fetal fractions is estimated by means other than measured allele counts. In some embodiments, the posterior distribution of fetal fraction values is obtained from whole genome resequencing data, including but not limited to chrX representation, chrY representation, fetal aneuploidy (trisomy or monosomy) of autosomal chromosomes, fetal subchromosomal CNVs, maternal subchromosomal CNVs, read count non-uniformity, or observed read lengths. In some embodiments, the posterior distribution (f|{nR,nA}) is replaced with the prior (f) (see page 63).
The model operates with predefined (or user-defined) prior probabilities of duplications and deletions, pI and pD, respectively. The prior probability of diploid scenario is obtained as 1−pD−pI.
The next step evaluates likelihood values (nR,nA|f,M,F) per scenario,
including all euploid scenarios (. ??), all fetal deletion scenarios (Tbl. 0.4), and all fetal duplication scenarios (Tbl. 0.6). The likelihoods are obtained as products of two Poisson distributions per locus (one for reference counts and one for alternate allele counts). The Poisson expectation values are listed in Tbl. ?? for diploid scenarios, Tbl. 0.5 for fetal deletions, and Tbl. 0.7 for fetal duplications. The expressions adopt the same form as previously used to estimate fetal fraction (cf. Eq. ??). These likelihoods are only evaluated for the loci that are being tested for copy number variations.
The scenario specific likelihoods are next multiplied with prior probabilities of deletions (pD), duplications (pI), or diploid genotypes, as appropriate.
Next, likelihoods are also multiplied with weights M(M,pA) and F (M,F,pA) listed in tables ??, 0.4, and 0.6. For brevity, future references to likelihoods (nR,nA|f,M,F) assume that the prior probabilities pD, pI and weights M(M,pA), F(M,F,pA) are absorbed into the corresponding products.
The likelihoods (nR,nA|f,M,F) are functions of the fetal fraction f. The next step multiplies these likelihoods with the posterior distribution of fetal fraction values (f|{nR,nA}) obtained in the first step (not necessarily on the same set of SNP loci as the ones that are tested for copy-number variations). The resulting products are then integrated over the entire fetal fraction range, yielding prior distribution of allele counts (nR,nA|M,F) marginalized over fetal fractions and only dependent on maternal and fetal genotypes M and F, respectively.
The next step sums up the values (nR,nA|M,F) over all scenarios M, F, yielding prior count probabilities (nR,nA).
In the next step, scenario likelihoods (nR,nA|f,M,F) (including priors pD, pI and weights M, F) are multiplied by posterior distribution of fetal fractions (f|{nR,nA}) and divided by the count prior (nR,nA). Some embodiments omit division by the count prior. The resulting function of fetal fraction f is then integrated over the entire range of f. This integration yields the posterior probabilities of different genotype combinations M, F, one value per scenario.
Finally, the genotype scenario with highest posterior probability is selected as the option with most data support. If the most likely scenario has single fetal allele, the fetus has a deletion at that locus. If the fetal genotype F in the most likely scenario has three alleles, the procedure detected a duplication. Otherwise, the genomic region is diploid both in maternal and fetal genomes.
R(F): Multiplier used to evaluate the mathematical expectation for the num-
A(F): Multiplier used to evaluate the mathematical expectation for the num-
(nR, nA|F, P, F): Joint conditional probability of observing allele frequencies
(F): Likelihood for a region covering one or more SNP loci.
Background
In addition to whole chromosome aneuploidies and subchromosomal copy-number variation, single nucleotide polymorphisms can also predispose the offspring to inherited diseases. In particular, there are more than 10,000 known monogenic diseases, with the incidence at birth estimated to approach 1%. While medically important, prenatal detection of fetal monogenic diseases from ccfDNA is hampered by the overwhelming background of maternal DNA. In particular, targeting multiple disease loci proves prohibitively expensive and difficult to scale for routine testing and/or screening purposes. A particularly promising approach takes advantage of parental carrier and phasing information, when available. Rather than targeting disease loci, the phasing approach targets a well-defined set of SNP loci with significant alternate allele frequency in the population. For example, Agilent's SureSelect Target Enrichment platform for Illumina Paired-End sequencing allows targeting ≈250,000 SNPs, with average distance between two neighboring SNPs on the order of 12.4 kb. At minor allele frequency close to 0.5, the number of paternally informative loci in a typical human genome exceeds 62,000. The average distance between two neighboring paterally informative loci is therefore close to 50 kb. With a similar number of maternally informative loci, the resolution of identifying disease-carrying haplotypes is close to 25 kb. If on average a typical human chromosome contains 1.6 recombination events J. Wang, et al., Cell 150:402, 2012, Milo Phillips (2016): Cell biology by the numbers, Garland Science, Taylor & Francis Group, LLC, New York), this resolution is sufficient to reliably detect transmission of parental monogenic diseases to the offspring without actually targeting the disease loci, thus greatly simplifying the workflow and reducing its cost. Measuring allele counts at well characterized, not necessarily disease-related, SNP loci flanking the monogenic disease hotspots is sufficient to ascertain whether the offspring received the variant homologue from the carrier parent.
The current disclosure proposes a novel approach to establishing haplotype blocks inherited paternally and/or maternally when parental genomes are phased and offspring's allele frequencies are measured. Offspring allele frequencies can be quantified either from pure prenatal or neonatal sample (for example, single cell DNA from non-invasively harvested trophoblasts, obtained through cervical swab, Moser G, Drewlo S, Huppertz B, Armant DR (2018): Trophoblast retrieval and isolation from the cervix: origins of cervical trophoblasts and their potential value for risk assessment of ongoing pregnancies. Hum Reprod Update. 2018 Jul. 1; 24(4):484-496. doi: 10.1093/humupd/dmy008) or from a mixture of maternal and fetal DNA (as in ccfDNA). If parental carrier information is available, this novel approach enables accurate determination of whether the disease-carrying parental haplotype block(s) have been transmitted to the offspring. The same protocol is also applicable to in-vitro fertilization (IVF) quality control, based on sequencing of single cells biopsied from blastocysts (Kuznyetsov et al. (2018): Evaluation of a novel non-invasive preimplantation genetic screening approach. PLoS One. 2018; 13(5): e0197262. Published online 2018 May 10. doi: 10.1371/journal.pone.0197262).
Overview of the Proposed Algorithm
The proposed approach to haplotyping the offspring assumes that a plurality of genomic loci, such as biallelic single nucleotide polymorphisms (SNPs), can be targeted and their allele frequencies can be measured. These loci constitute an SNP panel. The approach also assumes that parental genotypes are known across the entire SNP panel. Further, the approach assumes that parental alleles have been phased and that haplotype block structure is available at least for the parent who is the disease carrier. In some embodiments, only one parent's haplotypes are known. In some embodiments, both parents' haplotypes are known. Finally, the approach assumes that offspring's allele frequencies can be quantified either by NGS or by microarray measurements, such as aCGH.
In summary, the proposed approach haplotypes the offspring by evaluating posterior probabilities (F|nR,nA,P,M) of parental haplotype transmission patterns F, given the measured fetal allele frequencies nR, nA and the known parental haplotypes P, M. The transmission patterns F represent a combination of offspring's genotypes and their possible parental origins. For example, in a diploid fetus (excluding uniparetal disomy), the transmission pattern F can take four possible values: PH1.MH1, PH1.MH2, PH2.MH1, or PH2.MH2, where PHn and MHn are phased paternal and maternal alleles, respectively. Each transmitted parental allele can take values R or A (reference and alternate allele), depending on haplotypes found in the parents. The posterior probabilities (F|nR,nA,P,M) are evaluated per locus across the entire SNP panel. Next, a sequence of per-locus posterior probabilities corresponding to adjacent genomic loci is segmented into homogeneous fetal regions representing inherited haplotype blocks. Segmentation edges represent recombination sites. The haplotyping procedure infers transmission of disease based on whether the affected haplotype is passed from the carrier parent to the offspring. The procedure is also capable of detecting whole chromosome aneuploidies, subchromosomal copy number variations, as well as parental origin of the duplicated or missing fetal genomic segments. Combined with the procedures to measure fetal fraction (described in Section 206), this approach offers non-invasive prenatal detection of the full spectrum of monogenic diseases as well as copy number variations in a single sequencing run.
Transmission Patterns
The fetal transmission pattern F is a categorical variable that can take a value from a countably infinite discrete set. The number of practically relevant scenarios is small. In the case of the heterozygous deletion of the maternally inherited allele, the possible F values include PH1.None and PH2.None. If, in addition to the loss of the maternal allele, the paternally inherited haplotype is duplicated, the variable F takes the value PH1×2.None or PH2×2.None. When both paternal alleles are present, but the maternal allele is missing, F is assigned the value PH1+PH2.None. Similarly, missing paternal allele allows the following F values: None, MH1, None, MH2, None, MH1×2, None, MH2×2, or None, MH1+MH2. Higher paternal copy numbers combined with missing maternal alleles, as well as the symmetrically configured scenarios, can also be modeled in the same fashion.
A euploid locus inherits one allele from the father and one from the mother, allowing the following four values for F: PH1.MH1, PH1.MH2, PH2.MH1, and PH2.MH2.
As the copy number increases, the number of possible transmission patterns F grows as well. For example, a subset of possible duplications of paternally inherited allele(s) include PH1×2.MH1, PH1×2.MH2, PH2×2.MH1, PH2×2.MH2, PH1+PH2.MH1, PH1+PH2.MH2. Symmetrical configurations PH1.MH1×2, PH1.MH2×2, PH2.MH1×2, PH2.MH2×2, PH1.MH1+MH2, and PH2.MH1+MH2 account for maternal duplications. Additional scenarios are presumably less likely as their complexity increases, but the model can include them when appropriate.
Poisson Expectations
In a biallelic SNP, any one of the four parental alleles (PH1, PH2, MH1, and MH2) can either match the reference allele R or the alternate allele A, leading to a number of combinatorial possibilities for allelic realization of the transmission pattern F. These possible realizations then determine the possible expectation values for measured allele counts nR and nA in a single-cell experiment or in a circulating cell-free DNA experiment. The expectations λR and λA in the prenatal single-cell measurements (such as those obtained on DNA from trophoblasts non-invasively obtained by cervical swab) are proportional to the total expected read count N at the given locus and to the multiplier R(F) (for reference allele) or A(F) (for alternate allele):
λR(F,N)=R(F)N (56)
λA(F,N)=A(F)N (57)
The multipliers R(F) and A(F) absorb the dependence of λR, λA on the fetal transmission pattern F. The model can also include additional factors as either multiplicative or additive modifiers, such as mappability bias m, GC bias G(g,L,Q,g0), drop-out rate, drop-out allele rate, and drop-in allele rate. N is a product of the total number of reads across all loci in the SNP panel, a factor representing a fraction of all reads expected to map to the locus in question, and a factor combining the above mentioned biases.
In circulating cell-free DNA measurements, the expected allele frequencies μR and μA must also account for the fetal fraction f. The proportionality factors θR(F,f) and θA(F,f) reflect the relative amount of fetal DNA in the sample, as well as the allelic realization of the fetal transmission pattern F.
μR(F,f,N)=θR(F,f)N (58)
μA(F,f,N)=θA(F,f)N (59)
Tables of Multiplier Values
Tables 0.8-0.19 list a subset of possible parental allele combinations, fetal inheritance patterns, scDNA multipliers R and A, and ccfDNA multipliers θR and θA. In the interest of space, these tables cover some of the more frequently encountered scenarios, while omitting the less likely ones, but can be straightforwardly extended to include additional possibilities.
Both scDNA multipliers R, A and ccfDNA multipliers θR, θA depend on actual allele assignments P and M to parental haplotypes PH1, PH2, MH1, and MH2 (or parental allele realizations). These dependencies are tabulated Tables 0.8-0.19. However, to simplify equations and save space, the notation does not stress this dependence. For example, rather than explicitly writing out R(F,P,M), the tables and equations below simply refer to R(F).
Likelihoods
In single-cell experiments, the joint conditional probability of observing allele counts nR and nA given fetal transmission pattern F is modeled using a product
of two Poisson distributions, assuming independence of data generated by the two alleles:
The joint conditional probability (nR,nA|F,P,F) is also referred to as likelihood. P and M are particular allele assignments to transmitted parental alleles PH1, PH2, MH1, and MH2. The mathematical expectations λR(F) and λA(F) are
given by Eqs. 56-57. The multipliers R(F) and A(F) are listed in Tables 0.8-0.19, columns 7-8. The normalization constant Z is obtained by summation over the set Φ of all values F included in the model, with fixed paternal and maternal allele realizations P and M:
In a ccfDNA experiment, the likelihood (nR,nA|F,P,M) is evaluated differently to take into account the fetal fraction f. The scenario corresponding to fetal
transmission pattern F is modeled using a product of three terms: the distribution of fetal fractions and two Poisson probabilities with expectations μR(F,f,N) and μA(F,f,N) given by Eqs. 58-59. Finally, the three-term product is marginalized
over all possible fetal fraction values to factor out f, giving the following expression:
The multipliers θR(F,f) and θA(F,f) are listed in Tables 0.8-0.19, columns 9-10. Eq. 62 assumes that the processes of generating nR and nA are independent.
The normalization constant Ξ is obtained by summation over all values of F included in the model:
The posterior distribution (f|{nR,nA}) featuring in Eqs. 62-63 can be evaluated as described in Section 206. In some embodiments, fetal fraction prior (f) is used instead of (f|{nR,nA}) (see Section 206). In some embodiments, a single (most likely) point estimate of f is used, rather than integrating the distribution of
f over its entire range.
Summations in Eqs. 61 and 63 run over F values corresponding to a fixed set of paternal P and maternal M alleles actually transmitted to the offspring.
In some embodiments, the set Φ includes all values of transmission patterns F listed in Tables 0.8-0.19. In some embodiments, the set Φ is a non-trivial subset of values of transmission patterns F tabulated in Tables 0.8-0.19. In some embodiments, the set Φ is a non-trivial superset of values of transmission patterns F
tabulated in Tables 0.8-0.19. In some embodiments, the set Φ only includes diploid fetal transmission patterns F listed in Tables 0.14-0.15.
Most Likely Transmission Pattern Per Genomic Region
To determine the transmission pattern with most support from experimental data within a given genomic region, likelihoods from Eqs. 60 and 62 obtained for individual loci are combined together by multiplication or, equivalently but more conveniently, by summation of their logarithms. A region containing loci i∈1, . . . , K
characterized by known and phased parental alleles P(i) and M(i) and by measured fetal allele frequencies nR(i) and nA(i) is thus assigned the following set of
log-likelihood values:
The result is a set of likelihood (or log-likelihood) values for the entire genomic region, one value per F pattern from the set Φ. The most likely pattern F0 is the one with the highest likelihood:
Region Homogeneity
The estimate F0 may contradict some loci, where the locus-specific likelihood values supply more support a scenario different from F0. If all loci in a region have likelihood distributions (nR(i),nA(i)|F(i),P(i),M(i)) such that none lands more support to F≠F0, the region is homogeneous. A homogeneous region contains a single haplotype block from each parent (in a diploid scenario).
Some loci may be uninformative, in which case (nA,nA|F≠F0,P,M) may be equal to (nR,nA|F0,P,M) without violating the homogeneity requirement. If at least one locus in a region lands more support to a transmission pattern F≠F0 (meaning that (nR,nA|F≠F0,P,M)>(nR,nA|F0,P,M)), the region is inhomogeneous. An inhomogeneous region contains two or more haplotype blocks inherited from one or both parents.
Region Segmentation
The goal of the analysis is to partition the fetal genome into homogeneous regions. Many algorithms to effectively partition a region are known to those skilled in arts, including but not limited to bisection, golden ratio section, and partitioning at the first or the last locus contradicting the selection of most likely regional pattern F0. Various approaches differ in terms of time and memory complexity, but all eventually converge to a set of maximally sized homogeneous subregions, representing unbroken inherited haplotype blocks.
Recombination Breakpoints
In addition to segmentation, the determination of parental haplotype inheritance patterns also includes a merging step. If, for example, the most likely transmission patterns F0 in two neighboring homogeneous subregions are PH1.MH1 and PH1.MH2, it is likely that PH1 continues unbroken through the interval separating the two subregions and covers them both. In that case, the gap separating the two homogeneous subregions represents the maternal recombination event where M switches from MH1 to MH2.
Because of the distance between neighboring loci in the SNP panel and because of the unavoidable presence of uninformative loci, the intervals separating neighboring homogeneous subregions may be several kilobases long. The recombination breakpoints are somewhere within these intervals, but the data are insufficient to indicate where exactly. To avoid overinterpreting the data, the final haplotype edge calls coincide with the first and the last informative loci within a homogeneous block. The intervals separating neighboring haplotypes represent uncertainties in locations of recombination events. Detection of transmission of a monogenic disease associated with these uncertainty intervals requires additional information, such as a more dense set of SNPs.
To validate algorithms intended for analyses of cancer CNV profiles, a list of human cell lines available from ATCC (web page https://www.atcc.org/search?title=ATCC%20Human%20Cell%20Lines%20-%20Alphanumeric#q=%40productline%3DC021&sort=relevancy&f:contentTypeFacetATCC=[Products]) was compiled and searched for copy number profile availability at Institute for Cancer Research (ICR, cf. https://cansarblack.icr.ac.uk/).
The copy number profiles of the resulting 438 cell lines were downloaded from ICR and used as templates for simulating whole genome resequencing data at different depths and GC bias levels. The simulated profiles were then used to validate the proposed algorithms and to characterize their performance.
The proposed method to estimate GC bias coefficients in cancerous CNV profiles using Fourier Coefficient Ratio is tested on 1,000 simulated raw count profiles based on the ICR ground truth for the cell line 22Rv-1. The performance of the Fourier Coefficient Ratio approach is compared to results obtained using Željko Džakula's earlier invention based on minimization of entropy.
For each cell, a raw count profile was initiated by uniformly sampling the scale between 50 and 100 reads per 50 kb bin (roughly equivalent to 3-6 million reads per cell). Next, to generate linear (L) and quadratic (Q) GC bias coefficients, the center of the two dimensional (L,Q) distribution was randomly chosen. The mean linear coefficient was uniformly sampled between −4 and 4. The mean quadratic coefficient was then uniformly sampled. The lower bound for uniform sampling of the mean quadratic coefficient is imposed by the previously selected mean linear coefficient value and the non-negativity condition. The non-negativity is enforced for gL=0.01, gU=0.99, and g0=0.45. The upper bound is arbitrarily set to 6. Once the center of the two dimensional distribution of linear and quadratic GC coefficients is fixed, the (L,Q) distribution is defined as the two dimensional Gaussian with standard deviations along both L and Q dimensions set to 2 and the covariance fixed at 0.5. Actual GC coefficient values for each individual cell are sampled from this two dimensional Gaussian, making sure that the non-negativity constraint is satisfied.
To simulate raw bin counts, GC content per bin values were evaluated based on human reference sequence hg19 using Bioconductor. Mappability values were extracted from the file wgEncodeCrgMapabilityAlign100mer.bigWig (downloaded from UCSC Genome Browser http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/) on Mar. 16, 2020).
For each cell, raw counts per bin were sampled using Poisson distribution centered at PSmG(g,L,Q,g0). The ploidy values P are taken from the ICR ground truth profile. Scale S, linear GC coefficient L, and quadratic coefficient Q are sample-specific. Mappability m and GC content g are bin-specific and constant across samples. The constant g0=0.45 is arbitrarily chosen and fixed for all calculations.
Linear and quadratic GC coefficients are estimated using entropy and Fourier Coefficient Ratio as described above. The residual errors (estimated minus true coefficient values) observed when entropy is used are illustrated in
Tbl. 0.20 characterizes errors in linear and quadratic GC coefficient estimates obtained using Entropy (column 2) or Fourier Coefficient Ratio (column 3). Rows A-B list mean and median errors in linear GC coefficients. Rows C-D show standard deviation and MAD values for errors in linear coefficients. Rows E-F show mean and median errors in quadratic GC coefficients. Rows G-H list standard deviation and MAD of errors in quadratic coefficients. Finally, row I shows correlation coefficients between errors in linear and quadratic coefficients (also illustrated in
As a rule, Fourier Coefficient Ratio makes smaller errors than entropy-based approach when estimating linear GC bias coefficients, and FRC's errors are more tightly distributed. On average, errors in quadratic GC bias coefficient estimates are larger when Fourier Coefficient Ratio is used, and their distribution is more widely spread. Both entropy and Fourier Coefficient Ratio show correlation between errors in linear and quadratic coefficients. The correlation is weaker when entropy is used.
Once GC coefficients L and Q are known, the unevenness in raw bin counts can be reduced by division with mappability m and with the GC bias function
G(g,L,Q,g0). The effects of this normalization are illustrated on one cell with extremely high GC bias (L>5, Q>10).
Finally,
These results demonstrate that Fourier Coefficient Ratio is capable of removing GC bias from CNV profiles at least as successfully as entropy.
To test the CNV profile normalization algorithm based on posterior ploidy probability, ICR copy-number profiles for 231 cell lines were used as starting templates for simulated raw bin counts and as ground truth when assessing performance. The number of simulated cells per cell line varied from 75-76 (183 cell lines) to 696-699 (26 cell lines) to 1,000 (21 cell lines), also including one cell line with 742 cells. The total number of cells used for validation is 53,776.
Simulated read counts used uniformly sampled scale values S that ranged between 50 and 100. GC coefficient values L and Q were randomly generated within the entire permissible V-shaped region spanning the linear GC coefficient values from −6.8 to 7.1 and quadratic coefficient values from −4.1 to 13.8 (cf.
Raw counts per 50 kb bin were generated using Poisson distribution with the mean set to PSmG(g,L,Q,g0), (P=bin ploidy, S=scale, m=mappability, g=GC content per bin, L=linear GC coefficient, Q=quadratic GC coefficient, g0=0.45, and G is a parabola describing the GC bias). Bins with mappability below 0.90, GC content lower than 0.3, or GC content above 0.7 were filtered out, leaving 52,437 useful bins (down from 61,927 total). Posterior probability for bin ploidies used uniform count prior. The ploidy prior was set to a mixture of uniform (0.1%) and ploidy posterior for the previous bin (99.9%), except for the first bin, where the ploidy prior was uniform. Final ploidy values were estimated from the locations of maxima of the posterior ploidy distributions per bin.
To separate inaccuracies in GC coefficient estimation from performance inherent to posterior ploidy probability algorithm, the validation used true L and Q values, rather than estimates based on entropy or Fourier Coefficient Ratio. A separate study focuses on the impact of errors in L and Q on normalized profile accuracy.
Performance of the posterior ploidy probability mode estimates was quantified by counting bins where the location of the mode differs from ground truth ploidy by more than 0.5. The number of errors per cell decreases with increasing scale S (
The scaling procedure is validated using the same 231 cell lines (55,260 cells) previously used to validate normalization by posterior ploidy probability. The procedure was successful in 213 cell lines (45,879 cells,
Among the 18 failed cell lines, only a fraction of cells gives incorrect results (2,898 out of 9,381 cells total). Most failed cell lines are characterized by mean ploidy close to 2 or 4, as illustrated by Table 0.21, FIG., and
(2 or 4). Among cell lines with mean ploidy close to 2, most were scaled correctly, and only a minority failed. The cell line DAUDI has mean ploidy close to 3.85, yet all its cells failed the scaling; the reason is that this cell line's copy number profile mostly consists of segments with even ploidy values, with negligible number of bins at odd elevations.
As a rule, cell lines where scaling failed have very few bins that deviate from ploidy 2, as illustrated in
O. Reduction to Practice of the System and Method to Determine Posterior Probability of Fetal Ploidy and Fetal Fraction in ccfDNA
The proposed method to extract posterior probability of combined maternal and fetal ploidy from ccfDNA whole genome resequencing data was tested on simulated ccfDNA copy number profiles.
Since the dashed curve in the middle inset in
Any profile segmentation technique known to those skilled in the art, including but not limited to bisection, golden ratio section, partitioning at spikes of the differential profile, and simple bin-by-bin crawling can make use of DVR, entropy, or FCR as the homogeneity metric when detecting CNVs.
Fourier Coefficient Ratio values obtained from pairwise differences of similar copy-number profiles follow a tight distribution centered near origin. Distributions of FCR values evaluated on pairwise differences of dissimilar profiles are shifted to the right from the origin. The extent of the shift depends on the amount of dissimilarity, as illustrated by
The proposed algorithms that use allele frequencies as inputs are validated using simulated data based on a set of 65 human biallelic SNP loci with high minor allele frequency. Table 0.22 shows the SNP panel used to generate allele counts. Genomic coordinates correspond to human reference GRCh38. To validate the algorithm that quantifies contamination in a mixture of sample and impurity contributions, 1,000 data sets were simulated. The total number of reads per sample uniformly varies from 30,000 to 80,000. The amount of contaminating DNA uniformly varies from 0 to 0.25. In any sample, a subset of loci fails (“drop out”). For different samples, the fraction of failed loci follows a uniform distribution from 0 to 0.04. A given locus is assigned a capacity to absorb certain fraction of the total number of reads in the sample (last column in Table 0.22, labeled “Depth Fraction”). Actual number of reads per locus is generated as a Poisson random variable with mean equal to the product of total reads per sample and the depth fraction characteristic for the locus in question. Read counts are apportioned to the two alleles according to the relationships listed in . ??.
Contamination level per sample was predicted from allele counts as the location of the maximum of the posterior probability (f|{nA,nR}) (cf. Eq. 40).
S. Reduction to Practice of the System and Method to Measure Fetal Fraction from Minor Allele Counts
The human SNP panel tabulated in . ?? was used to simulate 1,000 NIPT ccfDNA samples with fetal fraction uniformly sampled between 0 and 0.25. Total number of reads per sample was uniformly sampled between 30,000 and 80,000. Up to 4% of the 65 SNPs in the panel were randomly dropped out, with the fraction of failing loci uniformly varying from sample to sample between 0 and 4%. Actual read counts were generated using Poisson distribution with the mean given by the product of the locus-specific “Depth Fraction” value (. ??) and the total read count per sample. Reads were apportioned to reference and alternate alleles according to . 0.3. Fetal fraction values f were estimated from simulated allele counts as the mean values of the posterior distributions (f|{nR nA}) (cf. Eq. 54).
T. Reduction to Practice of the System and Method to Detect Fetal Copy Number Variation from Minor Allele Counts
The human SNP panel tabulated in . ?? was used to simulate 300 ccfDNA samples with fetal fraction values uniformly sampled between 0 and 0.25. The total numbers of reads per sample were uniformly sampled from 30,000 to 80,000. The fraction of failed loci per sample was uniformly sampled from 0 to 4%. In each sample, a different chromosome was randomly selected and the fetus was randomly assigned maternal deletion, paternal deletion, maternal duplication, or paternal duplication of the entire selected chromosome. All other fetal and maternal chromosomes are diploid. Total read counts per locus were generated using Poisson distribution with the mean given by the product of the locus-specific “Depth Fraction” value (. ??). Reference and alternate allele counts were simulated as Poisson random variables with mean values given by . 0.7.
U. Reduction to Practice of the System and Method to Haplotype Fetal Genome from scDNA or ccfDNA Minor Allele Counts and Phased Parental Genomes
The haplotyping scDNA algorithm was validated using two tests. First, a single homogeneous fetal region was generated containing 10,000 SNP loci. A set of phased parental alleles was simulated for each locus, with reference and alternate alleles treated as equally likely. A euploid parental transmission pattern was randomly chosen and assigned to the fetal region. Fetal scDNA SNP allele counts were generated using Poisson distribution as appropriate for the chosen parental transmission pattern (cf. Tbl. 0.14). Total number of reads per locus was uniformly sampled between 100 and 1,000. These simulated allele counts were then used to predict the phase and the parental transmission pattern. Next, the prediction was compared with the ground truth. Also, outliers (if any) were counted and recorded. This procedure was repeated 100 times. None of the 100 simulated data sets showed any discrepancies: predicted parental transmission patterns matched the ground truth in all informative loci. The uninformative loci (such as those where both parents are homozygous for the same allele) did not contradict predictions in any of the 100 simulated sets.
The second test generates a fetal genomic segment consisting of three homogeneous subregions. Each homogeneous subregion contains a variable number of SNPs. The actual number of SNP loci per homogeneous subregion is uniformly sampled between 40 and 100. Parental alleles are randomly assigned, again assuming equally probable reference and alternate alleles. Each homogeneous fetal subregion is randomly assigned a parental transmission pattern. Two neighboring subregions are prevented from having the same transmission pattern. Fetal allele counts are simulated using Poisson distribution with the total number of reads per locus randomly sampled between 100 and 1,000. Reference and alternate allele counts have expectation values as specified in . 0.14 for the scDNA case. Finally, based on the simulated allele counts, fetal genome is segmented into homogeneous subregions and each subregion is assigned a predicted parental transmission pattern. Predictions are then compared with the ground truth and any discrepancies recorded. The procedure was repeated 100 times. Discrepancies were only observed in uninformative loci adjacent to recombination sites, sandwiched between two informative loci closest to the recombination site. Otherwise, all predicted phasing blocks and their parental transmission patterns matched the ground truth, and all predicted recombination sites were located within the uncertainty regions due to uninformative loci.
This application claims the benefit of U.S. Provisional Patent Application No. 63/012,256, filed Apr. 19, 2020, which is incorporated by reference herein in its entirety.