METHOD FOR EVALUATING CONSENSUS BASE ERROR RATE AND A SYSTEM THEREOF

FIELD OF THE INVENTION

The present invention is related to a technical field of DNA sequencing data analysis, and particularly related to an evaluation method of a consensus base error rate.

BACKGROUND OF THE INVENTION

Detecting rare mutations is promising for personalized cancer diagnosis and treatment. While somatic variant calling guides the targeted therapy, detecting circulating tumor DNA serves as a non-invasive and cost-effective mean for characterizing cancer progression and monitoring the treatment response. Adopting these approaches, however, is challenging when the variant allele frequency (VAF) is comparable or even lower than the frequency of sequencing errors and other technical artifacts.

To address the challenge, molecular barcodes (MBCs) or unique molecular identifiers (UMIs) have been used and shown to effectively reduce sequencing errors. In UMI sequencing, each DNA template is first labeled by a unique UMI. The labeled templates are then amplified via polymerase chain reaction (PCR) and the resulting amplicons are randomly sampled for sequencing. By taking consensus of the sequenced amplicons, occasional sequencing errors can be eliminated.

However, PCR errors can be introduced during amplification, leading to error bases in a good portion of final amplicons when the errors arise in the first few cycles.

Such error bases can dominate in an UMI group especially when only few amplicons are sampled for sequencing (called small UMI size here). In real application, UMIs of a small size usually constitute majority of the data. When there are only 2 to 3 DNA sequences with the same UMI label, taking consensus may not eliminate PCR and/or sequencing errors if exist. Therefore, considering both PCR and sequencing errors is essential in the analysis of UMI sequencing data.

Several computational tools have been proposed to call variants using UMI data, e.g., DeepSNVMiner, smCounter, MAGERI, smCounter2, and UMI-VarCal. However, all these tools apply a heuristic approach to handle PCR errors. For example, smCounter estimates quality of a called variant using a joint probabilistic model of PCR and sequencing errors. The latter is calculated based on sequencing quality while a hand-waving heuristic approximation is used for the former. On the other hand, MAGERI only treats UMIs of size ≥5 and assumes all sequencing errors are eliminated in the consensus sequences. This approach discards majority of the data and the absence of sequencing errors cannot be fully guaranteed.

To handle PCR errors in the consensus sequences, MAGERI considers that error rates of a consensus base follow a beta distribution. Parameters of this phenomenological model are obtained via fitting to UMI training data of a known DNA template. This phenomenological model is later adopted in smCounter2. To utilize UMI data, smCounter2 builds a second beta-distribution model for UMIs of size <5, but that is still phenomenological and may not describe sequencing errors accurately. More importantly, these phenomenological models cannot be used to estimate quality of each UMI consensus base. Therefore, it is not appropriate to analyze UMI consensus sequences using conventional variant callers, as done by fgbio (http://fulcrumgenomics.github.io/fgbio/) or Mutect2 which requires quality information of each consensus base.

Additionally, the rule-based approach DeepSNVMiner is also unable to reveal PCR errors or sequencing errors, leading to low-confidence of the mutation detection. Besides, using UMI-VarCal to evaluate confidence of mutation detection is not accurate, as UMI-VarCal is also a rule-based method. The limitations of these approaches result from a lack of solid statistical framework to describe PCR and sequencing errors during UMI sequencing.

Chinese Pat. No. CN112687339B, entitled “Method and device for counting sequence errors in plasma DNA fragment sequencing data”, discloses a method and a device for counting sequence errors in plasma DNA fragment sequencing data.

Reading true UMI in a library building process, and taking the true UMI as a UMI reference sequence set; counting the times of reads UMI errors of all reads cycles in sequencing data and the proportion of error bases to current cycles bases; and counting the sequence error information of the reads of the complete testing template, and counting the error rate of each cycle by identifying the 5′ end UMI sequence and the 3′ end UMI sequence of the reads of the complete testing template. The cycle error rate counted by the method can be used for evaluating the data quality, and can also provide data correction bases and quality values of the bases in the subsequent clustering error correction process, so that prior probability is provided for clustering error correction and quality value correction, and the accuracy of low-frequency mutation detection is greatly improved; in this disclosure, PCR amplification error rate and sequencing error rate are integrated as a reference of base correction, but the error rate is determined post template sequencing, where sequence acquisition and alignment are conducted among distinct nucleic acid fragments. Such a method lacks any means to judge the true and false of the consensus base, resulting in difficulties of illustrating consensus base error rate and assigning quality score to each thereof. Moreover, such method is unattainable without UMI sequence library and sequence alignment.

SUMMARY OF THE INVENTION

Low-frequency mutation detection is the foundation of precision medicine and personalized diagnosis, and development of this field will be profoundly influenced by evaluating the error rate of consensus bases especially when the mutation frequency is lower than sequencing error rate. With existing technologies, molecular barcodes are widely used to eliminate sequencing errors. Molecular barcodes with unique sequence are labeled to distinct nucleic acid fragments, and the DNA fragments carrying the molecular barcodes are subjected to PCR amplification prior to sequencing of the randomly sampled amplicons. Taking consensus of the amplicon sequences with the same UMI can eliminate sequencing errors in the amplicon sequences.

Nonetheless, certain PCR errors are inevitable during nucleic acid sequence amplification, which requires additional processing. Current computational tools are capable of handling molecular barcode data, but their evaluations of PCR error are based on empirical rules or phenomenological model, thus is unable to provide a quality score for each individual consensus base. This escalates uncertainty and inconvenience of mutant detection. Such limitation confronted by the current technologies results from a lack of a statistical framework to describe errors during PCR amplification and sequencing process simultaneously.

Accordingly, to provide a solution for issues of the current technologies, the present invention discloses a statistical framework for quantifying PCR errors and sequencing errors. The model facilitates quality evaluation of consensus bases using molecular barcodes, and assigns a quality score to each consensus base for determining confidence of the consensus base. The computational foundation of the aforementioned framework involves the nature of PCR process and sequencing quality information. The framework can also be applied to evaluate PCR error rate of each data set and provides a quality score for each consensus base.

Concretely, provided in the first aspect of the present invention is a method for evaluating consensus base error rate, comprising: selecting one or multiple amplified bases from an amplicon pool, wherein the amplicon pool is amplified from an unknown base N; sequencing a j^thbase of the selected amplified bases to be an observed base s_j, and pooling the observed base s_jinto an observed base collection S, wherein j is any one of positive integers; calculating a consensus base c(S) according to the observed base collection S; and calculating a consensus base error rate Pc(S), wherein the consensus base error rate Pc(S) is a probability that the consensus base c(S) is distinct from the unknown base N, and wherein: the consensus base error rate Pc(S) is a sum of probabilities that the unknown base N from which the observed base collection S is derived is identical to an assumed base b, but the assumed base b is distinct from the consensus base c(S), and wherein the assumed base b is selected from a group consisting of adenine (A), thymine (T), guanine (G) and cytosine (C), and the assumed base b is distinct from the consensus base c(S).

Preferably, the consensus base error rate Pc(S) is calculated through the following equation:

$P_{c (S)} = P (N \neq c (S) | S) = \sum_{b \neq c (S)} P (N = b | S) = \frac{\sum_{b \neq c (S)} P (S | N = b)}{\sum_{b = {A, C, G, T}} P (S | N = b)};$

wherein, the P(S|N=b) is a probability of the observed base collection S being randomly selected from the amplicon pool if the unknown base N is identical to the assumed base b.

Preferably, the P(S|N=b) is calculated by a method comprising: defining a fraction of false bases F_R, wherein the fraction of false bases F_Ris a ratio of an amplified false base to the amplified bases in the amplicon pool, and the amplified false base is a base distinct from the assumed base b; and calculating the P(S|N=b) through the following equation:

$P (S | N = b) = \sum_{F_{R}} P (F_{R}) \cdot P (S | N = {b∩F}_{R} = F_{R});$

wherein the P(S|N=b ∩F_R=F_R) is a sum product of P(s_j|N=b ∩F_R=F_R) of each observed base s_jif the unknown base N is identical to the assumed base b and a ratio of the amplified false base in the amplicon pool is F_R.

More preferably, if the observed base s_jis identical to the assumed base b, the P(s_j|N=b ∩F_R=F_R) is equal to a fraction of true bases F_R′; if the observed base s_jis distinct from the assumed base b, the P(s_j|N=b ∩F_R=F_R) is equal to the fraction of false bases F_R, and wherein a sum of the fraction of true bases F_R′ and the fraction of false bases F_Ris 1.

More preferably, the method further comprises: modifying the P(s_j|N=b ∩F_R=F_R) to a P(s_j=b′|N=b ∩{F_b″≠b}={F_b″≠b}) based on a sequencing accuracy rate P_seq1, a sequencing error rate P_seq2and an amplified base ratio F_b″≠b, wherein the P(s_j=b′|N=b ∩{F_b″≠b}={F_b″≠b}) is a probability that the observed base s_jis identical to another observed base b′ if the unknown base N is identical to the assumed base b and a ratio of one amplified base b″ distinct from the assumed base b in the amplicon pool is F_b″≠b, wherein: if the amplified base b″ is distinct from the assumed base b, the ratio of the amplified base b″ is equal to the fraction of false bases F_R, where the P(s_j=b′|N=b ∩{F_b″≠b}={F_b″≠b}) is modified to satisfy the following equation:

$P (s_{j} = b^{'} | N = b \cap {F_{b^{″} \neq b}} = {F_{b^{″} \neq b}}) = {\begin{matrix} P_{pcr 1} \times P_{seq 1} + P_{pcr 2} \times P_{seq 2} & b^{'} = b \\ P_{pcr 3} \times P_{seq 1} + P_{pcr 4} \times P_{seq 2} + P_{pcr 5} \times P_{seq 2} & b^{'} \neq b \end{matrix};$

wherein, if the another observed base b′ is identical to the assumed base b, P_per1represents a probability that the assumed base b is amplified into a base identical to the another observed base b′, and P_per2represents a probability that the assumed base b is amplified into the amplified base b″, where the amplified base b″ is distinct from the assumed base b;

- wherein, if the another observed base b′ is distinct from the assumed base b, P_per3represents a probability that the assumed base b is amplified into a base identical to the another observed base b′, P_per4represents a probability that the assumed base b is amplified into a base identical to the assumed base b, and P_per5represents a probability that the assumed base b is amplified into the amplified base b″, where the amplified base b″ is distinct from the assumed base b and also distinct from the another observed base b′;
- wherein, the sequencing accuracy rate P_seq1represents a probability that the another observed base b′ is correctly sequenced, and the sequencing error rate P_seq2represents a probability that the amplified base b″ is falsely sequenced as the another observed base b′.

More preferably, the method further comprises: calculating the sequencing accuracy rate P_seq1and the sequencing error rate P_seq2based on a sequencing quality score Q_iof the another observed base b′, wherein: the sequencing accuracy rate P_seq1is calculated through the following equation:

$P_{seq 1} = 1 - 1 0^{- Q_{i} / 10};$

the sequencing error rate P_seq2is calculated through the following equation:

$P_{seq 2} = \frac{1 0^{- Q_{i} / 1 0}}{3} .$

More preferably, the observed base collection S comprises d observed bases s_j, wherein d observed bases s_jcomprise n_bbases identical to the assumed base b and (d-n_b) the amplified false bases distinct from the assumed base b, and wherein d is the maximum integer value of j, and n_bis a positive integer less than or equal to d;

- if the observed base s_jis distinct from the assumed base b, the P(s_j|N=b ∩F_R=F_R) is equal to the fraction of false bases F_R, and a sum product of the P(s_j|N=b ∩F_R=F_R) is calculated through the following equation:

$\prod_{j = 1}^{d} P (s_{j} ❘ N = b ⋂ F_{R} = F_{R}) = {(1 - F_{R})}^{n_{b}} F_{R}^{d - n_{b}} .$

More preferably, the P(S|N=b) is obtained by moment approximation through the following equation:

$P (S ❘ N = b) = \sum_{k = 0}^{n_{b}} {C_{k}^{n_{b}} (- 1)}^{k} M_{(k + d - n_{b})};$

- wherein the M_(k+d−n_b₎is a (k+d-n_b)^thmoment among a probability distribution of the fraction of false bases F_R, where k is any one of non-negative integer less than or equal to n_b;
- wherein an approximated value of the M_(k+d-n_b⁾is r/2^k+d−nb, where r represents a probability of polymerase chain reaction error occurrence at a gene locus corresponding to the unknown base N in each cycle.

Preferably, the method further comprises: retrieving a template sequence and labeling a molecular barcode thereto, and producing the amplicon pool by conducting a polymerase chain reaction based on the template sequence; and establishing the observed base collection S by sequencing bases randomly sampled from the amplicon pool, wherein the molecular barcode comprises a nucleic acid capture sequence or a unique molecular identifier sequence (UMI).

Provided in another aspect of the present invention is a system for evaluating consensus base error rate, comprising a polymerase chain reaction quality evaluation module configured to execute any one of the aforementioned methods so as to calculate a consensus base error rate P_c(S).

Preferably, the system further comprises a sequencing quality evaluation module, being signally connected to the polymerase chain reaction quality evaluation module and configured to read a sequencing quality score Q_iof another observed base b′ so as to calculate a sequencing accuracy rate P_seq1and a sequencing error rate P_seq2, wherein:

- the sequencing accuracy rate P_seq1is calculated through the following equation:

$P_{seq 1} = 1 - 10^{- Q_{i} / 10};$

- the sequencing error rate P_seq2is calculated through the following equation:

$P_{seq 2} = \frac{10^{- Q_{i} / 10}}{3};$

- a consensus base error rate Pc(S) modifying module, configured in the polymerase chain reaction quality evaluation module and being signally connected to the sequencing quality evaluation module so as to modify the P(s_j|N=b ∩F_R=F_R)to be P(s_j=b′|N=b ∩{F_b″≠b}={F_b″≠_b}) based on the sequencing accuracy rate P_seq1, the sequencing error rate P_seq2and an amplified base ratio F_b″≠b, where the P(s_j=b′|N=b ∩{F_b″≠b}={F_b″≠b}) represents a probability that the observed base s_jis identical to the another observed base b′ if the unknown base N is identical to the assumed base b, and a ratio of an amplified base b″ distinct from the assumed base b in the amplicon pool is F_b″≠b, and wherein:
- if the amplified base b″ is distinct from the assumed base b, and the ratio of the amplified base F_b″≠bis equal to the fraction of amplified bases F_R, the P(s_j=b′|N=b ∩{F_b″≠b}={F_b″≠b}) is modified to satisfy the following equation:

$P (s_{j} = b^{'} ❘ N = b ⋂ {F_{b^{″} \neq b}} = {F_{b^{″} \neq b}}) = {\begin{matrix} P_{pcr 1} \times P_{seq 1} + P_{pcr 2} \times P_{seq 2} & b^{'} = b \\ P_{pcr 3} \times P_{seq 1} + P_{pcr 4} \times P_{seq 2} + P_{pcr 5} \times P_{seq 2} & b^{'} \neq b \end{matrix};$

- wherein, if the another observed base b′ is identical to the assumed base b, P_per1represents a probability that the assumed base b is amplified into a base identical to the another observed base b′, and P_per2represents a probability that the assumed base b is amplified into the amplified base b″, where the amplified base b″ is distinct from the assumed base b;
- wherein, if the another observed base b′ is distinct from the assumed base b, P_per3represents a probability that the assumed base b is amplified into the another observed base b′, P_per4represents a probability that the assumed base b is amplified into a base identical to the assumed base b, and P_per5represents a probability that the assumed base b is amplified into the amplified base b″, where the amplified base b″ is distinct from the assumed base b and also distinct from the another observed base b′;
- wherein, the sequencing accuracy rate P_seq1represents a probability that the another observed base b′ is correctly sequenced, and the sequencing error rate P_seq2represents a probability that the amplified base b″ is falsely sequenced as the another observed base b′.

Provided in yet one another aspect of the present invention is a device for evaluating consensus base error rate, comprising a memory and a processor, wherein the memory is configured to store a program, and the processor is configured to execute the program so as to implement the aforementioned method for evaluating consensus base error rate.

The technical solutions provided in the present invention facilitate reflecting accurate error rates in detection of low-frequency mutation by calculating a quality score specific to each one of consensus bases. With regard of the quality score of the consensus base, expectedly, low-frequency mutation shall be detected with improved accuracy.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A is a flowchart illustrating the method provided in the first aspect of the present invention;

FIG. 1B is a flowchart illustrating the method provided in the second aspect of the present invention;

FIG. 1C is a probability distribution diagram illustrating the probability distribution of false error bases F₁₀after a 10-cycle PCR amplification under PCR error rate of 0.01;

FIG. 2 is a probability distribution diagram illustrating the agreement of theoretical values and approximated values of error rates in different amplification cycle;

FIG. 3A is a block chart illustrating the system for evaluating consensus base error rate in the third aspect of the present invention;

FIG. 3B is a block chart illustrating the system for evaluating consensus base error rate in the fourth aspect of the present invention;

FIG. 4 is a block chart illustrating the device for evaluating consensus base error rate in the fifth aspect of the present invention; and

FIGS. 5A to 5B are probability distribution diagrams illustrating the agreement of observed and expected frequencies of consensus base quality in Example 4.

DETAILED DESCRIPTION OF THE INVENTION

The following section provides a more detailed explanation of the present invention by combining specific embodiments with diagrams. The detailed descriptions in the embodiments aim to facilitate a better understanding of the invention. However, those skilled in the art will readily appreciate that certain technical features may be omitted in different circumstances or substituted with other components, materials, or methods. In some cases, certain operations related to the invention are not explicitly shown or described in the specification to avoid overshadowing the core aspects of the invention with excessive details. For those skilled in the art, a comprehensive understanding of these related operations can be readily achieved based on the descriptions in the specification and common technical knowledge in the field.

As shown in FIG. 1A, provided in the first aspect of the present invention is a method for evaluating consensus error rate, comprising:

- amplified base selection process (S1): selecting one or multiple amplified bases from an amplicon pool, wherein the amplicon pool is amplified from an unknown base N;
- observed base collection establishment process (S2): sequencing a j^thbase of the selected amplified bases to be an observed base s_j, and pooling the observed base s_jinto an observed base collection S, wherein j is any one of positive integers;
- consensus base c(S) calculation process (S3): calculating a consensus base c(S) according to the observed base collection S; and
- consensus base error rate P_c(S)calculation process (S4): calculating a consensus base error rate P_c(S), wherein the consensus base error rate P_c(S)is a probability that the consensus base c(S) is distinct from the unknown base N, and wherein:
- the consensus base error rate P_c(S)is a sum of probabilities that the unknown base N from which the observed base collection S is derived is identical to an assumed base b, but the assumed base b is distinct from the consensus base c(S), and wherein the assumed base b is selected from a group consisting of adenine (A), thymine (T), guanine (G) and cytosine (C), and the assumed base b is distinct from the consensus base c(S).

The term “consensus base c(S)” used herein refers to the observed base s_jwith the highest occurrence frequency in the observed base collection S. For instance, in an observed base collection S having 6 observed base s_j, wherein the observed bases s_jare A, A, A, T, T and A, respectively, the consensus base c(S) in this observed base collection S is A whose occurrence frequency is ⅔ and larger than that of T. Accordingly, there are occurrences of non-consensus base c(S)′ among the amplified bases derived from the unknown base N. Regardless of sequencing error, the non-consensus base c(S)′ is caused by an error in polymerase chain reaction. Therefore, in the first aspect of the present invention, probability that the consensus base c(S) is not the assumed base, namely the error rate of consensus base c(S), is evaluated to serve as reference of judging confidence of the consensus base c(S).

In the consensus base error rate P_c(S)calculation process (S4), the consensus base error rate P_c(S)is calculated through the following equation:

$P_{c (S)} = P (N \neq c (S) ❘ S) = \sum_{b \neq c (S)} P (N = b ❘ S) = \frac{\sum_{b \neq c (S)} P (S ❘ N = b)}{\sum_{b = {A, C, G, T}} P (S ❘ N = b)};$

- wherein, the P(S|N=b) is a probability of the observed base collection S being randomly selected from the amplicon pool if the unknown base N is identical to the assumed base b.

In the consensus base error rate P_c(S)calculation process (S4), the P(S|N=b) is calculated by a method comprising: defining a fraction of false bases F_R, wherein the fraction of false bases F_Ris a ratio of an amplified false base to the amplified bases in the amplicon pool, and the amplified false base is a base distinct from the assumed base b; and calculating the P(S|N=b) through the following equation:

$P (S ❘ N = b) = \sum_{F_{R}} P (F_{R}) \cdot P (S ❘ N = b ⋂ F_{R} = F_{R});$

- wherein the P(S|N=b ∩F_R=F_R) is a sum product of P(s_j|N=b ∩F_R=F_R) of each observed base s_jif the unknown base N is identical to the assumed base b and a ratio of the amplified false bases in the amplicon pool is F_R.

In various embodiments, if the observed base s_jis identical to the assumed base b, the P(s_j|N=b ∩F_R=F_R) is equal to a fraction of true bases F_R′; if the observed base s_jis distinct from the assumed base b, the P(s_j|N=b ∩F_R=F_R) is equal to the fraction of false bases F_R, and wherein a sum of the fraction of true bases F_R′ and the fraction of false bases F_Ris 1.

In various embodiments, the observed base collection S comprises d observed bases s_j, wherein d observed bases s_jcomprise n_bbases identical to the assumed base b and (d-n_b) the amplified false bases distinct from the assumed base b, and wherein d is the maximum integer value of j, and n_bis a positive integer less than or equal to d;

- if the observed base s_jis distinct from the assumed base b, the P(s_j|N=b ∩F_R=F_R) is equal to the fraction of false bases F_R, and a sum product of the P(s_j|N=b ∩F_R=F_R) is calculated through the following equation:

$\prod_{j = 1}^{d} P (s_{j} ❘ N = b ⋂ F_{R} = F_{R}) = {(1 - F_{R})}^{n_{b}} F_{R}^{d - n_{b}} .$

In various embodiments, calculation of the P(S|N=b) is further expanded as followings:

$\begin{matrix} P (S ❘ N = b) = \sum_{F_{R}} P (F_{R}) \cdot P (S ❘ N = b ⋂ F_{R} = F_{R}) \\ ≅ \sum_{F_{R}} P (F_{R}) \cdot \prod_{j = 1}^{d} P (s_{j} ❘ N = b ⋂ F_{R} = F_{R}) \\ = \sum_{F_{R}} P (F_{R}) \cdot {(1 - F_{R})}^{n_{b}} \cdot F_{R}^{d - n_{b}} \end{matrix}$

- wherein P(F_R)·F_R^d-n^brepresents the moment of a probability distribution of the fraction of false bases F_Rin the observed base collection S. Since the amplicon pool is established by one or multiple cycles of amplification, calculation of the moment will be extremely intensive with a growing number of amplification cycles. Principally, a value of the moment P(F_R)·F_R^d-n^bcan be calculated by approximation to reduce the intensity of calculations if only considering the first one amplification error. As such, approximation of the moment can be done as followings:

$\begin{matrix} P (S ❘ N = b) = \sum_{F_{R}} P (F_{R}) \cdot {(1 - F_{R})}^{n_{b}} \cdot F_{R}^{d - n_{b}} \\ = \sum_{F_{R}} P (F_{R}) \cdot \sum_{k = 0}^{n_{b}} {C_{k}^{n_{b}} (- 1)}^{k} F_{R}^{k + d - n_{b}} \\ = \sum_{k = 0}^{n_{b}} {C_{k}^{n_{b}} (- 1)}^{k} M_{(k + d - n_{b})} \end{matrix};$

- wherein the M_(k+d-n_b₎is (k+d-n_b)^thmoment among a probability distribution of the fraction of false bases F_R, where k is any one of non-negative integer less than or equal to n_b;
- wherein an approximated value of the M_(k+d-n_b₎is r/2^k+d−nb, where r represents a probability of polymerase chain reaction error occurrence at a gene locus corresponding to the unknown base N in each cycle.

With reference to FIG. 1B, provided in the second aspect of the present invention is a method for evaluating consensus error rate P_c(S). The steps and computational process of the method are overall the same as those in the first aspect, but the method includes sequencing error rates after amplification cycles are completed in addition to calculation of error rate during amplification cycles of the unknown base N.

To further elaborate, in an observed base collection S having 6 observed bases s_jrepresenting A, A, A, T, T and A, the consensus base c(S) is A. The non-consensus base c(S) occurs in the observed base collection S, among the amplified bases derived from the unknown base N, may be a base distinct from the assumed base, such as T, or may be a base identical to the assumed base, such as A, being sequenced as T resulting from a sequencing error, or may be a base distinct from the assumed base and distinct from the non-consensus base C(S)′, such as C or G, being sequenced as T resulting from a sequence error. Considering such situations, a polymerase chain reaction error rate and a sequencing error rate are integrated into calculation of the consensus base error rate P_c(S)to serve as a reference of judging the confidence of the consensus base c(S).

Accordingly, in the second aspect of the present invention, the method further comprises:

- consensus base error rate P_c(S)modification process (S4-1): modifying the P(s_j|N=b ∩F_R=F_R) to a P(s_j=b′|N=b ∩{F_b″≠b}={F_b″≠b}) based on a sequencing accuracy rate P_seq1, a sequencing error rate P_seq2and an amplified base ratio F_b″≠b, wherein the P(s_j=b′|N=b ∩{F_b″≠b}={F_b″≠b}) is a probability that the observed base s_jis identical to another observed base b′ if the unknown base N is identical to the assumed base b and a ratio of one amplified base b″ distinct from the assumed base b in the amplicon pool is F_b″≠b, wherein:
- if the amplified base b″ is distinct from the assumed base b, the ratio of the amplified base b″ is equal to the fraction of false bases F_R, where the P(s_j=b′|N=b ∩{F_b″≠b}={F_b″≠b}) is modified to satisfy the following equation:

$\begin{matrix} P (s_{j} = b^{'} ❘ N = b ⋂ {F_{b^{″} \neq b}} = {F_{b^{″} \neq b}}) \\ = {\begin{matrix} P_{pcr 1} \times P_{seq 1} + P_{pcr 2} \times P_{seq 2} & b^{'} = b \\ \begin{matrix} P_{pcr 3} \times P_{seq 1} + P_{pcr 4} \times P_{seq 2} + \\ P_{pcr 5} \times P_{seq 2} \end{matrix} & b^{'} \neq b \end{matrix} \end{matrix};$

- wherein, if the another observed base b′ is identical to the assumed base b, P_per1represents a probability that the assumed base b is amplified into a base identical to the another observed base b′, and P_per2represents a probability that the assumed base b is amplified into the amplified base b″, where the amplified base b″ is distinct from the assumed base b;
- wherein, if the another observed base b′ is distinct from the assumed base b, P_per3represents a probability that the assumed base b is amplified into a base identical to the another observed base b′, P_per4represents a probability that the assumed base b is amplified into a base identical to the assumed base b, and P_per5represents a probability that the assumed base b is amplified into the amplified base b″, where the amplified base b″ is distinct from the assumed base b and also distinct from the another observed base b′;
- wherein, the sequencing accuracy rate P_seq1represents a probability that the another observed base b′ is correctly sequenced, and the sequencing error rate P_seq2represents a probability that the amplified base b″ is falsely sequenced as the another observed base b′.

In preferred embodiments, the method further comprises: sequencing accuracy rate and sequencing error rate calculation process (S4a): calculating the sequencing accuracy rate P_seq1and the sequencing error rate P_seq2based on a sequencing quality score Q_iof the another observed base b′, wherein: the sequencing accuracy rate P_seq1is calculated through the following equation:

$P_{seq 1} = 1 - 10^{- Q_{i} / 10};$

- the sequencing error rate P_seq2is calculated through the following equation:

$P_{seq 2} = \frac{10^{- Q_{i} / 10}}{3} .$

In various embodiments, further with reference to FIGS. 1A to 1B, before selection of the amplified bases, the method further comprises an amplicon pool establish process (Sla), comprising:

- retrieving a template sequence and ligating a molecular barcode thereto, and producing the amplicon pool by conducting a polymerase chain reaction based on the template sequence; and
- establishing the observed base collection S by sequencing bases randomly sampled from the amplicon pool, wherein the molecular barcode comprises a nucleic acid capture sequence or a unique molecular identifier sequence (UMI), but not limited to this.

The following provides a further explanation of the specific derivations underlying the computational basis in the various embodiments of the present invention. First of all, it is assumed that after an ideal PCR cycle, each DNA template is duplicated, and the new copy has a sequence identical to the DNA template. While each nucleotide in the single-strand template remains intact, PCR error may occur at a locus in the new copy, resulting in a distinct base, thereby creating distinct bases in the subsequent PCR cycle.

Assumed that after R cycles, 2^Rmolecules (a.k.a. amplicons) are expected to arise from one double-strand DNA template, some may contain a base distinct from the one on the original template (called a false base). Fraction of the false bases depends on the PCR error rate r and the cycle number i at which the errors occur. From the 2^Ramplicons, only a small portion (e.g., 1˜1,000 in 2^20-30amplicons) will be randomly selected for sequencing. When a sampled amplicon carries a false base, the PCR error can be observed. Therefore, observing PCR error can be imagined as a process of sampling amplicons that carry a false base from the 2^Ramplicons, and the fraction of false bases F_Rin the amplicon determines the frequency of PCR error. Thus, deriving a probability distribution of fraction of false bases F_Ris required for PCR error modeling.

In the following, it is assumed that PCR errors at different loci are independent and only one locus is discussed. Let t_iand f_ibe the numbers of true bases and false bases at the i^thcycle, Equation (I) describes the probability of having m and n additional false bases arising from the ti true bases and the f false bases in the next cycle (the (i+1)^thcycle), respectively. In Equation (I), r is the PCR error rate per base per cycle. Namely, r represents PCR error rate at a locus corresponding to the unknown base N (namely, the true base). On the other hand, the probability of m additional false bases arising from the t_itrue bases simply follows a binomial distribution. For an additional false base to derive from the f_ifalse bases, it requires either no PCR error (with probability 1-r) or a PCR error that does not turn the false base back to the true one, which probability is r*⅔ assuming no base preference in PCR error. This explains the (1-r/3) term in Equation (I).

With Equation (I), the probability of having f_i+1false bases at the (i+1)^thcycle can be calculated via considering all possible combinations of additional false bases arising from the false bases and true bases at the i^thcycle. In the equation (II), the Kronecker delta function δ ensures (_f_i+1_−f_i) additional false bases from the false bases and true bases at the it cycle. The number f_ican be 0, 1, 2, . . . and up to 2ⁱ−1 because at least the starting true base will persist through the duplications. The upper bound is further limited by _f_i+1because _f_i_≤f_i+1, i.e., false bases remain false in the next cycle.

$\begin{matrix} P_{add} (m, n; t_{i}, f_{i}) = {C_{m}^{t_{i}} (r)}^{m} {(1 - r)}^{t_{i} - m} \cdot {C_{n}^{f_{i}} (1 - \frac{r}{3})}^{n} {(\frac{r}{3})}^{f_{i} - n} & (I) \end{matrix}$

$\begin{matrix} P (f_{i + 1}) = \sum_{f_{i} = 0}^{\min (2^{i} - 1, f_{i + 1})} P (f_{i}) \sum_{m = 0}^{t_{i}} \sum_{n = 0}^{f_{i}} δ (m + n, f_{i + 1} - f_{i}) P_{add} (m, n; t_{i}, f_{i}) & (II) \end{matrix}$

With the above dynamic equation, the probability distribution of P(F_R) can be obtained via calculation of P(f_i+1). For example, shown in FIG. 1C is the probability distribution of F₁₀under a PCR error rate 0.01. As shown in FIG. 1C, three minor peaks appear near fractions 0.5, 0.25, and 0.125. These peaks correspond to a first PCR error at the 1^st2^ndand 3^rdcycle, resulting in about ½, ¼, and ⅛ of the bases being false, respectively. In addition to mathematical induction, random PCR errors may be simulated as well. In one example, with 1 million times of mathematical simulation with a PCR error rate 0.01 in a 10-cycle PCR, the resulting distribution of fractions of false bases agree well with that of mathematical induction.

With the probability distribution of F_R, a probability that a consensus base is not the true base may be calculated assuming no sequencing error. At one specific locus, let a sequenced base s_ibe the base on the i^thsampled amplicon and S={s_i} be the collection of all bases. The term “consensus base c(S)” refers to the most frequent base in the observed base collection S. Probability P(N≠c(S)|S) that consensus base c(S) is not the unknown base N can be calculated via a Bayesian approach in Equation (III).

$\begin{matrix} \begin{matrix} P (N \neq c (S) ❘ S) = \sum_{b \neq c (S)} P (N = b ❘ S) \\ = \frac{\sum_{b \neq c (S)} P (S ❘ N = b) \cdot P (N = b)}{\sum_{b = {A, C, G, T}} P (S ❘ N = b) \cdot P (N = b)} \end{matrix} & (III) \end{matrix}$

Assuming no prior knowledge on the unknown base N, which means that probability of the unknown base N being any one of A, T, C, G is the same, P(N=b) will be a constant ¼. Thus, the terms P(N=b) can be crossed out from the equation (III). P(S|N=b) is the probability of obtaining observed base collection S via random sampling from 2^Ramplicons arising from an assumed base b. The probability P(S|N=b) depends on the fraction of false bases F_Rin the amplicons and can be calculated via considering all possible fraction of false bases F_Rvalues as in Equation (IV). Given an F_R, the probability that a randomly sampled base is false is simply F_Ras illustrated in Equation (V).

$\begin{matrix} \begin{matrix} P (S | N = b) = \sum_{F_{R}} P (F_{R}) \cdot P (S | N = b ⋂ F_{R} = F_{R}) \\ ≅ \sum_{F_{R}} P (F_{R}) \cdot \prod_{i} P (s_{i} | N = b ⋂ F_{R} = F_{R}) \end{matrix} & (IV) \end{matrix}$

$\begin{matrix} P (s_{i} | N = b ⋂ F_{R} = F_{R}) = {\begin{matrix} 1 - F_{R} & s_{i} = b \\ F_{R} & s_{i} \neq b \end{matrix} & (V) \end{matrix}$

When multiple amplicons are sampled, the above formulae involve the calculation of ΣP(F_R)*F_R^k·ΣP(F_R)*F_R^krefers to the k^thmoment of the fraction of false bases F_Rprobability distribution. In principle, the numerical values can be obtained using Equation (II). However, the calculation of ΣP(F_R)*F_R^kis too intense to be handled practically for a large R. Therefore, an approximation of these moments is needed. The moments can be approximated via considering only the first PCR error when the error rate is relatively small. For example, in a PCR of 10 cycles, there are 2¹⁰−1=1,023 duplication events. Under a PCR error rate 0.001, the mean number of PCR errors is about 1.

Besides, a PCR error occurring at an earlier cycle results in a higher fraction of false bases in the final amplicons, and thus contributes more to the moments. When only the first PCR error is considered, the moments can be calculated as in Equation (VI). In the equation (VI), i stands for the cycle at which the first PCR error occurs. For a relatively small PCR error rate r (i.e., r<2^−R), probability that the first PCR error occurs at the i^thcycle is approximately 2^i-1*r because there are 2^i-1amplicons before the i^thcycle and the chance of PCR error on each amplicon is r.

The resulting fraction of false bases F_Ris (½ⁱ) without further PCR errors. Notwithstanding, further PCR errors indeed may occur, but their contributions to moments belong to higher order terms (i.e., r², r³, . . . ), and the moments are much smaller when r<<1. FIG. 2 shows theoretical values of the moments of fraction of false bases F_Rat different PCR cycles. As shown in FIG. 2, theoretical values of the first 5 moments of fraction of false bases F_Rat different cycles under a PCR error rate 0.001 may be calculated based on Equation (II). Meanwhile, FIG. 2 also shows the approximated moments based on Equation (VI), and these moments agree well with the theoretical values. With the approximated moments, probability that a consensus base c(S) is not the unknown base N (true base) can be calculated efficiently without the detailed probability distribution of fraction of false bases F_R. This is the key simplification in PCR error modelling.

$\begin{matrix} M_{k} (R, r) = \sum_{i = 1}^{R} 2^{i - 1} {r (\frac{1}{2^{i}})}^{k} = \frac{r}{2} \cdot \sum_{i = 1}^{R} 2^{- (k - 1) i} = \frac{r}{2} \cdot \frac{(1 - 2^{- (k - 1) R})}{(2^{k - 1} - 1)} \overset{R ≫ 1}{\to} \frac{r}{2} \cdot \frac{1}{(2^{k - 1} - 1)} \overset{k ≫ 1}{\to} \frac{r}{2^{k}} & (VI) \end{matrix}$

With the moments, probability that the consensus base c(S) is not the unknown base N (true base) is calculated explicitly in Equations (VII) and (VIII). For instance, two types of observed bases s_jare assumed. Without loss of generality, let the observed bases s_jbe A's and C's, and in the observed base collection S, the number n_Aof A's is larger than n_Cof C's, thus the consensus base c(S) is A. In Equation (VIII), when PCR error rate r<<1, probabilities P(S|N=G) and P(S|N=T) of the unknown base N not being G or T are neglected, because their magnitudes are much smaller than the probabilities P(S|N=A) and P(S|N=C) of the unknown base N being A or C. It is worth of notion, Equations (VII) and (VIII) should be modified if a different number of base types is observed and the modifications are shown later.

$\begin{matrix} P (S | N = b) ≅ \sum_{F_{R}} P (F_{R}) \cdot \prod_{i = 1}^{d} P (s_{i} | N = b ⋂ F_{R} = F_{R}) = \sum_{F_{R}} P (F_{R}) \cdot {(1 - F_{R})}^{n_{b}} \cdot F_{R}^{d - n_{b}} = \sum_{F_{R}} P (F_{R}) \cdot \sum_{k = 0}^{n_{b}} {C_{k}^{n_{b}} (- 1)}^{k} F_{R}^{k + d - n_{b}} = \sum_{k = 0}^{n_{b}} {C_{k}^{n_{b}} (- 1)}^{k} M_{(k + d - n_{b})} & (VII) \end{matrix}$

$\begin{matrix} \begin{matrix} P (N \neq A | S) = \begin{matrix} \frac{\sum_{b = {C, G, T}} (S | N = b)}{\sum_{b = {A, C, G, T}} (S | N = b)} ≅ \\ \frac{P (S | N = C)}{P (S | N = A) + P (S | N = C)} \end{matrix} \\ = \frac{\sum_{k = 0}^{n_{C}} {C_{k}^{n_{C}} (- 1)}^{k} M_{(k + n_{c})}}{\sum_{k = 0}^{n_{A}} {C_{k}^{n_{A}} (- 1)}^{k} M_{(k + n_{A}) 0 0} + \sum_{k = 0}^{n_{C}} {C_{k}^{n_{C}} (- 1)}^{k} M_{(k + n_{c})}} \end{matrix} & (VIII) \end{matrix}$

In addition to PCR error, sequencing error can result in a false observed bases s_j. Therefore, sequencing error is also integrated into evaluation of probability that the consensus base c(S) is not the unknown base N. In general terms, probability q_ithat the sequenced base s_iis an error can be estimated by the quality score Q_ias in Equation (IX). If there is no preference of false bases in sequencing error, the probability of observing one of the false bases is q_i/3. With such information, PCR error and sequencing error for estimating quality of a consensus base c(S) can be integrated.

$\begin{matrix} q_{i} = 10^{- Q_{i} / 10} = P (s_{i} \neq N) = P (s_{i} \neq b | N = b) & (IX) \end{matrix}$

To consider both PCR error and sequencing error, the particular details of estimating probability that consensus base c(S) is not the unknown base N shall be further elaborated on Equation (IV). Specifically, P(s_i|N=b∩F_R=F_R) shall be rewritten. Equation (X) describes the probability of observing another observed base b′ when the unknown base N is the assumed base b, and F_b″≠bstands for a fraction of false bases of another type of amplified base b″ in the final amplicon pool and represents the set of the amplified bases b″ when placed in “{ }”. When another observed base b′ is the assumed base b, it can be the result of no PCR error (PE) nor sequencing error (SE), or the result of a PCR error giving rise to a false amplified base b″ (PE_bb″≠b) and a sequencing error that turns the amplified base b″ back to the assumed base b (SE_b″b).

When the another observed base b′ is not the assumed base b, there are three possible schemes: (1) a PCR error giving rise to the another observed base b′ distinct from the assumed base b (PE_bb′) and no sequencing error, (2) no PCR error but a sequencing error that turns the assumed base b into the another observed base b′ (SE_bb′), and (3) a PCR error giving rise to a false amplified base b″ that is not the assumed base b nor the another observed base b′ (PE_{bb″≠{b,b′}}) and a sequencing error that turns the amplified base b″ into the another observed base b′ (SE_{b″ b′}).

Assuming independence between PCR error and sequencing error, probabilities of the above schemes are shown explicitly in Equation (X). In the equation (X), F_Rstands for the total fraction of false bases, F_R=Σ_b″≠bF_b″ for instance, which matches the earlier definition. Combining Equation (X) into Equations (VII) and (VIII), a quality score of the consensus base c(S) can be calculated and given to while taking into account both PCR error and sequencing error.

$\begin{matrix} P (s_{i} = b^{'} | N = b ⋂ {F_{b^{″} \neq b}} = {F_{b^{″} \neq b}}) = {\begin{matrix} P (\overline{PE} ⋂ \overline{SE}) + P ({PE}_{{bb}^{″} \neq b} ⋂ {SE}_{b^{″} b}) & b^{'} = b \\ P ({PE}_{{bb}^{'}} ⋂ \overline{SE}) + P (\overline{PE} ⋂ {SE}_{{bb}^{'}}) + P ({PE}_{{bb}^{″} \neq {b, b^{'}}} ⋂ {SE}_{b^{″} b}) & b^{'} \neq b \end{matrix} = {\begin{matrix} (1 - F_{R}) (1 - q_{i}) + F_{R} \cdot \frac{q_{i}}{3} & b^{'} = b \\ F_{b^{'}} (1 - q_{i}) + (1 - F_{R}) \cdot \frac{q_{i}}{3} + \sum_{b^{″} \neq {b, b^{'}}} F_{b}^{″} \cdot \frac{q_{i}}{3} & b^{'} \neq b \end{matrix} = {\begin{matrix} (1 - q_{i}) + (\frac{4 q_{i}}{3} - 1) F_{R} & b^{'} = b \\ \frac{q_{i}}{3} + (1 - \frac{4 q_{i}}{3}) F_{b^{'}} & b^{'} \neq b \end{matrix} & (X) \end{matrix}$

In the above formulation, three types of false observed bases are lumped together. The formulation needs further modification because the three types of false observed bases do not appear equally likely. For example, given A as the assumed base b, it is more likely to observe two C's in the final amplicon pool than one C and one G because the former can be achieved with only one PCR error while the latter requires at least two PCR errors.

Therefore, all four types of bases shall be considered in the formulation. Specifically, instead of using the probability distribution P(F_R) to modify the formulation, the four base version P({F_C, F_G, F_T}) will substitute of P(F_R) when the assumed base b is A. It should be noted that the key simplification of the PCR error modeling is approximation of moments. With four types of bases, approximation of the multivariate moments defined in Equation (XI) is required. Again, the moments can be approximated via considering only the first few PCR errors when r is much smaller than 1.

$\begin{matrix} M_{x y z} (R, r) = \sum_{{F_{C}, F_{G}, F_{T}}} P ({F_{C}, F_{G}, F_{T}}) F_{C}^{x} F_{G}^{y} F_{T}^{z} \equiv \sum_{{F_{b^{″} \neq A}}} P ({F_{b^{″} \neq A}}) F_{C}^{x} F_{G}^{y} F_{T}^{z} & (XI) \end{matrix}$

The evaluation model can also be extended to consider the double-strand nature of DNA. For a double-strand template, a PCR error produces an error base on the synthesized strand, and it takes one more cycle for the error base to settle on both strands. Therefore, a PCR error at the 1^stcycle leads to false bases in a quarter instead of half of the final amplicon pool. To further elaborate, a UMI usually labels only one strand of a DNA template. With these features, the multivariate moments can be approximated. On this occasion, the level of a multivariate moment is defined as the number of zeros in the orders (i.e., x, y, and z). A higher-level moment is smaller.

The above examples are intended merely to explain mathematical induction of the statistical model of the method for evaluating consensus base error rate in the present invention, but not intended to limit practical implementation of the method.

Provided in third aspect of the present invention is a system (1) for evaluating consensus base error rate. As shown in FIG. 3A, the system (1) comprises a polymerase chain reaction quality evaluation module (11) configured to execute the method in the first aspect of the present invention so as to calculate a consensus base error rate P_c(S).

In various embodiments, as shown in FIG. 3A, the system (1) further comprises a scoring module (14), being signally connected to the polymerase chain reaction quality evaluation module (11), configured to identify the consensus base error rate P_c(S)and assign the consensus base c(S) a quality score Q_c.

In other embodiments, as shown in FIG. 3A, the system (1) further comprises a determining module (15), being signally connected to the scoring module (14) and the polymerase chain reaction quality evaluation module (11), configured to receive the quality score Q_cand compare the quality score Q_cwith a second quality score P′_c(S)so as to generate a determined instruct, wherein the second quality score P′_c(S)is output by the polymerase chain reaction quality evaluation module (11), where:

- if the second quality score P′_c(S)larger than the quality score Q_c, the determined instruct presents that the consensus base c(S) is confident;
- if the second quality score P′_c(S)smaller than the quality score Q_c, the determined instruct presents that the consensus base c(S) is not confident.

Exemplarily, when a user obtains the quality score Q_c, the user may focus on the base set with high quality scores, and sets a cutoff score Q_xtaking the quality score Q_cas a baseline. In the subsequent batch processing of the sequencing data, confidence of the consensus base c(S) may be determined according to the cutoff score Q_x. In the aforementioned case, the cutoff score setting may be extended to a consensus sequence consisting of one or multiple consensus bases c(S). Another example is provided herein to explain determination of a consensus sequence's confidence. Table 1 demonstrates a consensus sequence 1 and a consensus sequence 2 obtained from different batches of sequencing data resulting from an unknown sequence. The system (1) identifies that the consensus sequence 1 is 5′-CAA-3′ according to observed base collections including S_1-1={A₁C₁C₁}, S_1-2={A₂G₂A₂} and S_1-3={G₃A₃A₃} read from 5′ end to 3′ end. On the other hand, the consensus sequence 2 is determined to be 5′-CAG-3′ according to observed base collections including S_2-1={A₁C₁C₁}, S_2-2={A₂G₂A₂} and S_2-3={G₃G₃A₃}. After calculation of quality score of each consensus base, the system (1) identifies that the quality score Q_cof consensus base A obtained from the observed base collection S_1-3is smaller than the cutoff score Q_x, thereby giving a determined instruct of non-confidence. In contrast, all consensus bases in the observed base collections S_2-1, S_2-2and S_2-3have quality scores Q_clarger than the cutoff score Q_x, and thus the system (1) gives a determined instruct of confidence. In this context, the user is capable of determining the consensus sequence confidence based on the determined instructs provided by the system (1) so that the most appropriate sequence reading can be made.

TABLE 1

Consensus

Observed
Consensus
Quality
Determined

sequence

base collection
base
score (Qc)
instruct

1
5′-CAA-3′
S_1-1
A₁C₁C₁
C
>Q_x
confident

S_1-2
A₂G₂A₂
A
>Q_x
confident

S_1-3
G₃A₃A₃
A
<Q_x
Not confident

2
5′-CAG-3′
S_2-1
A₁C₁C₁
C
>Q_x
confident

S_2-2
A₂G₂A₂
A
>Q_x
confident

S_2-3
G₃G₃A₃
G
>Q_x
confident

Provided in the fourth aspect of the present invention is a system or evaluating consensus base error rate. As shown in FIG. 3B, the basic elements and processing principles are the same as those in third aspect of the present invention, but the system (1) further comprises a sequencing quality evaluation module (12), being signally connected to the polymerase chain reaction quality evaluation module (11), configured to read a sequencing quality score Q_iof another observed base b′ so as to calculate a sequencing accuracy rate P_seq1and a sequencing error rate P_seq2, wherein:

the sequencing accuracy rate P_seq1is calculated through the following equation:

$P_{seq 1} = 1 - 1 0^{- Q_{i} / 1 0};$

the sequencing error rate P_seq2is calculated through the following equation:

$P_{s e q 2} = \frac{10^{- Q_{i} / 1 0}}{3};$

- a consensus base error rate P_c(S)modifying module (11a), being configured in the polymerase chain reaction quality evaluation module (11) and signally connected to the sequencing quality evaluation module so as to modify the P(s_j|N=b ∩F_R=F_R)to be P(s_j=b′|N=b ∩{F_b″≠b}={F_b″≠b}) based on the sequencing accuracy rate P_seq1, the sequencing error rate P_seq2and an amplified base ratio F_b″≠b, where the P(s_j=b′|N=b ∩{F_b″≠b}={F_b″≠b}) represents a probability that the observed base s_jis identical to the another observed base b′ if the unknown base N is identical to the assumed base b, and a ratio of a amplified base b″ distinct from the assumed base b in the amplicon pool is F_b*_b, and wherein:
- if the amplified base b″ is distinct from the assumed base b, and the ratio of the amplified base F_b″≠bis equal to the fraction of amplified bases F_R, the P(s_j=b′|N=b ∩{F_b″≠b}={F_b″≠b}) is modified to satisfy the following equation:

$P (s_{j} = b^{'} | N = b ⋂ {F_{b^{″} \neq b}} = {F_{b^{″} \neq b}}) = {\begin{matrix} P_{p c r 1} \times P_{s e q 1} + P_{p c r 2} \times P_{s e q 2} & b^{'} = b \\ P_{p c r 3} \times P_{s e q 1} + P_{p c r 4} \times P_{s e q 2} + P_{p c r 5} \times P_{s e q 2} & b^{'} \neq b \end{matrix};$

- wherein, if the another observed base b′ is identical to the assumed base b, P_per1represents a probability that the assumed base b is amplified into a base identical to the another observed base b′, and P_per2represents a probability that the assumed base b is amplified into the amplified base b″, where the amplified base b″ is distinct from the assumed base b;
- wherein, if the another observed base b′ is distinct from the assumed base b, P_per3represents a probability that the assumed base b is amplified into the another observed base b′, P_per4represents a probability that the assumed base b is amplified into a base identical to the assumed base b, and P_per5represents a probability that the assumed base b is amplified into the amplified base b″, where the amplified base b″ is distinct from the assumed base b and also distinct from the another observed base b′;
- wherein, the sequencing accuracy rate P_seq1represents a probability that the another observed base b′ is correctly sequenced, and the sequencing error rate P_seq2represents a probability that the amplified base b″ is falsely sequenced as the another observed base b′.

Please refer to FIG. 4, provided in fifth aspect of the present invention is a device (2) for evaluating consensus base error rate, comprising a memory (21) and a processor (22), wherein the memory (21) is configured to store a program (P), and the processor (22) is configured to execute the program (P) so as to realize the method for evaluating consensus base error rate in the first aspect or the second aspect of the present invention, or to realize the system in third aspect or the fourth aspect of the present invention.

The entirety or part of the functions of the method provided by the present invention can be implemented through hardware integration or software programming. When all or part of the functions are implemented via software, the program can be stored in a memory, which may include a read-only memory (ROM), random-access memory (RAM), magnetic disk, optical disk, hard disk, or other storage media. The processor executes the program to realize the aforementioned functions. For example, the program can be stored in the memory of a sequencing device, and by executing the program through the processor, all or part of the aforementioned functions can be achieved. Additionally, the program may also be stored in a server, another computer, a magnetic disk, an optical disk, flash memory, or a portable hard drive. Through downloading, transmission, copying, or other methods, the program can be transferred to the memory of the sequencing device or used to overwrite or update different versions of the sequencing system. Once the processor executes the program stored in the memory, all or part of the functions of the above method can be implemented.

Example 1: Calculating Consensus Base c(S) Quality Considering Both PCR and Sequencing Errors

Example 1 illustrates a calculation process of probability that a consensus base c(S) is not a true base using the aforesaid method. An observed base collection includes three sequenced bases S={A₁C₂A₃} and the consensus base c(S) is determined to be A. The following equation (XII) shows probability of observing each base given a known assumed base b and composition of the false bases {F_b″≠b} based on Equation (X). In equation (XII), the u_kiand v_kiare defined to simplify notations and are used when the observed base is identical to and distinct from the assumed base b, respectively. The two indices k and i indicate order (i.e., terms associated with F^k) and the i^thobserved base, respectively. With Equation (XII), probabilities of observing an observed base collection S given an assumed base b can be calculated with Equation (XIII).

The calculations can be expressed in terms of weighted multivariate moments. The weights are determined by the following factors together, including whether the observed bases match the assumed base b (i.e., u or v), orders of the moments, and sequencing qualities of the observed bases (i.e., q_i's). A factor “3” appears before some moments because the corresponding “F” term is purely F_Rrepresenting the total fraction of 3 types of false bases. With these weights, probability that a consensus base c(S) is not the assumed base b can be calculated in a more tractable manner with Equation (XIV). In Example 1, exactly two types of bases are present. When there are three types of observed bases, the calculations can be modified following a similar logic.

$\begin{matrix} P (A_{1, 3} | N = A ⋂ {F_{b^{″} \neq A}} = {F_{b^{″} \neq A}}) = (1 - q_{1, 3}) + (\frac{4 q_{1, 3}}{3} - 1) (F_{C} + F_{G} + F_{T}) \equiv u_{0} (q_{1, 3}) + u_{1} (q_{1, 3}) F_{R} \equiv u_{0 (1, 3)} + u_{1 (1, 3)} F_{R} P (C_{2} | N = A ⋂ {F_{b^{″} \neq A}} = {F_{b^{″} \neq A}}) = \frac{q_{2}}{3} + (1 - \frac{4 q_{2}}{3}) F_{C} \equiv v (q_{2}) + v_{1} (q_{2}) F_{C} \equiv v_{0 2} + v_{1 2} F_{C} P (A_{1, 3} | N = C ⋂ {F_{b^{″} \neq C}} = {F_{b^{″} \neq C}}) = v_{0 (1, 3)} + v_{1 (1, 3)} F_{A} P (C_{2} | N = C ⋂ {F_{b^{″} \neq C}} = {F_{b^{″} \neq C}}) = u_{0 2} + u_{1 2} F_{R} P (A_{1, 3} | N = G, T ⋂ {F_{b^{″} \neq G, T}} = {F_{b^{″} \neq G, T}}) = v_{0 (1, 3)} + v_{1 (1, 3)} F_{A} P (C_{2} | N = G, T ⋂ {F_{b^{″} \neq G, T}} = {F_{b^{″} \neq G, T}}) = v_{0 2} + v_{1 2} F_{C} & (XII) \end{matrix}$

$\begin{matrix} P (S | N = A) ≅ \sum_{{F_{b^{″} \neq A}}} P ({F_{b^{″} \neq A}}) \cdot \prod_{i = 1}^{3} P (s_{i} | N = A ⋂ {F_{b^{″} \neq A}} = {F_{b^{″} \neq A}}) = \sum_{{F_{b^{″} \neq A}}} P ({F_{b^{″} \neq A}}) \cdot (u_{0 1} + u_{11} F_{R}) (v_{0 2} + v_{1 2} F_{C}) (u_{0 3} + u_{1 3} F_{R}) = \sum_{{F_{b^{″} \neq A}}} P ({F_{b^{″} \neq A}}) \cdot [u_{01} v_{0 2} u_{0 3} + (u_{11} v_{0 2} u_{0 3} + u_{01} v_{0 2} u_{1 3}) F_{R} + u_{11} v_{0 2} u_{1 3} F_{R}^{2}] + \sum_{{F_{b^{″} \neq A}}} P ({F_{b^{″} \neq A}}) \cdot [⁠ u_{01} v_{1 2} u_{0 3} F_{C} + (u_{11} v_{1 2} u_{0 3} + u_{01} v_{1 2} u_{1 3}) F_{R} F_{C} + u_{11} v_{1 2} u_{1 3} F_{R}^{2} F_{C}] = u_{01} v_{0 2} u_{0 3} + (u_{11} v_{0 2} u_{0 3} + u_{01} v_{0 2} u_{1 3}) (M_{100} + M_{010} + M_{0 0 1}) + u_{11} v_{0 2} u_{0 3} (M_{2 0 0} + M_{0 2 0} + M_{0 0 2} + 2 M_{110} + 2 M_{101} + 2 M_{0 1 1}) + u_{01} v_{1 2} u_{0 3} \cdot M_{100} + (u_{11} v_{1 2} u_{0 3} + u_{01} v_{1 2} u_{1 3}) (M_{2 0 0} + M_{110} + M_{101}) + u_{11} v_{1 2} u_{1 3} (M_{3 0 0} + M_{1 2 0} + M_{102} + 2 M_{210} + 2 M_{2 0 1} + 2 M_{1 1 1}) ≅ u_{01} v_{0 2} u_{0 3} + (u_{11} v_{0 2} u_{0 3} + u_{01} v_{0 2} u_{1 3}) \cdot 3 M_{100} + u_{11} v_{0 2} u_{0 3} \cdot 3 M_{2 0 0} + u_{01} v_{1 2} u_{0 3} \cdot M_{100} + (u_{11} v_{1 2} u_{0 3} + u_{01} v_{1 2} u_{1 3}) M_{2 0 0} + u_{11} v_{1 2} u_{1 3} M_{3 0 0} = u_{0 1} v_{0 2} u_{0 3} + (3 u_{1 1} v_{0 2} u_{0 3} + u_{01} v_{1 2} u_{0 3} + 3 u_{0 1} v_{0 2} u_{1 3}) M_{100} + (u_{1 1} v_{1 2} u_{0 3} + 3 u_{11} v_{0 2} u_{1 3} + u_{01} v_{1 2} u_{1 3}) M_{2 0 0} + u_{11} v_{1 2} u_{1 3} M_{3 0 0} \equiv a_{0} \cdot M_{0 0 0} + a_{1} \cdot M_{1 0 0} + a_{2} \cdot M_{2 0 0} + a_{3} \cdot M_{3 0 0} = \sum_{k = 0}^{3} a_{k} \cdot M_{k 0 0} P (S | N = C) ≅ \sum_{{F_{b^{″} \neq C}}} P ({F_{b^{″} \neq C}}) \cdot (v_{0 1} + v_{11} F_{A}) (u_{0 2} + u_{1 2} F_{R}) (v_{0 3} + v_{1 3} F_{A}) ≅ v_{0 1} u_{0 2} v_{0 3} + (v_{11} u_{0 2} v_{0 3} + 3 v_{01} u_{1 2} v_{0 3} + v_{01} u_{0 2} v_{1 3}) M_{100} + (v_{11} u_{1 2} v_{0 3} + v_{11} u_{0 2} v_{1 3} + v_{01} u_{1 2} v_{1 3}) M_{2 0 0} + v_{11} u_{1 2} v_{1 3} M_{3 0 0} \equiv \sum_{k = 0}^{3} c_{k} \cdot M_{k 0 0} P (S | N = G, T) ≅ \sum_{{F_{b^{″} \neq G, T}}} P ({F_{b^{″} \neq G, T}}) \cdot (v_{0 1} + v_{11} F_{A}) (v_{0 2} + v_{1 2} F_{C}) (v_{0 3} + v_{1 3} F_{A}) = v_{01} v_{0 2} v_{0 3} + (v_{11} v_{0 2} v_{0 3} + v_{01} v_{1 2} v_{0 3} + v_{01} v_{0 2} v_{1 3}) M_{100} + v_{11} v_{0 2} v_{1 3} M_{2 0 0} + (v_{01} v_{1 2} v_{1 3} + v_{11} v_{1 2} v_{0 3}) M_{110} + v_{11} v_{1 2} v_{1 3} M_{210} ≅ v_{01} v_{0 2} v_{0 3} + (v_{11} v_{0 2} v_{0 3} + v_{01} v_{1 2} v_{0 3} + v_{01} v_{0 2} v_{1 3}) M_{100} + v_{11} v_{0 2} v_{1 3} M_{2 0 0} \equiv \sum_{k = 0}^{3} g_{k}, t_{k} \cdot M_{k 0 0} & (XIII) \end{matrix}$

$\begin{matrix} P (N \neq A | S = {A_{1} C_{2} A_{3}}) = \frac{P (S | N = C) + P (S | N = G) + P (S | N = T)}{P (S | N = A) + P (S | N = C) + P (S | N = G) + P (S | N = T)} = \frac{\sum_{k = 0}^{3} (c_{k} + g_{k} + t_{k}) \cdot M_{k 0 0} (R, r)}{\sum_{k = 0}^{3} (a_{k} + c_{k} + g_{k} + t_{k}) \cdot M_{k 0 0} (R, r)} & (XIV) \end{matrix}$

Example 2: Estimating PCR Error Rates

Example 2 takes adenine (A) as the assumed base b, when one type of false bases (C, G, or T) is present, the first level moment M_x00's in Equation (XI) can be modified via replacing the (r/3) by the corresponding substitution rate, such as p_AC, p_AG, or P_AT. As shown in Equation (XV), m_x00is defined as the part of moment without substitution rate, thereby denoted by p_bb′. Similarly, the second level moments can be modified via replacing the (r/3)²by products of the corresponding substitution rates, such as p_AC*p_AGfor M_xy0's when the false base types are C and G as shown in Equation (XVI). To calculate quality of a consensus base c(S), one can simply replace the single-variate moments in the original formulation with the multivariate moments.

$\begin{matrix} M_{x 0 0}^{b b^{″}} (R) = P_{{bb}^{″}} \sum_{i = 1}^{R} 2^{i - 1} {(\frac{1}{2^{i}})}^{x} \equiv P_{{bb}^{″}} \cdot m_{x 0 0} (R) \overset{R ≫ 1 & x ≫ 1}{\to} p_{b b^{″}} \cdot \frac{1}{2^{x}} & (XV) \end{matrix}$

$\begin{matrix} M_{xy 0}^{b b^{'} b^{″}} (R) \equiv p_{{bb}^{'}} p_{{bb}^{″}} \cdot m_{xy 0} (R) \overset{R ≫ 1}{\to} {\begin{matrix} p_{{bb}^{'}} p_{{bb}^{″}} (\frac{R^{2}}{4} - 1) & x = y = 1 \\ p_{{bb}^{'}} p_{{bb}^{″}} (c_{x 10 R} R + c_{x 10 c}) & x > 1 & y = 1 \\ p_{{bb}^{'}} p_{{bb}^{″}} (c_{xy 10 c}) & x > 1 & y > 1 \end{matrix} & (XVI) \end{matrix}$

Practically, these PCR substitution rates can be estimated using a maximal likelihood approach when the assumed bases b are known. In Example 2, let's focus on the loci where the template base is A. With an assumption of no sequencing error, a non-A base at the locus indicates a PCR error.

In Example 2, the loci with two or three types of false bases are neglected, as their frequency is much smaller under a low PCR error rate. Given observed bases at the loci, likelihood of the observation is described in Equation (XVII), where independence between loci is assumed. In the following equations, S_iis the observed bases at locus i, {S_i} stands for the observed base collection, and {p_bb″} represents the PCR error evaluation model described in the present invention. Calculation of the likelihood further requires the probability that all observed bases are A's as shown in Equation (XVIII), and the probability that the observed base collection includes some false bases, such as cytosines (C's) as shown in Equation (XIX).

$\begin{matrix} L ({S_{i}} | N = A & {p_{{bb}^{″}}}) = \prod_{i} P (S_{i} | N = A & {p_{{bb}^{″}}}) & (XVIII) \end{matrix}$

$\begin{matrix} P (S_{i} = {A_{1} \dots A_{a_{i}}} | N = A & {p_{{bb}^{″}}}) ≅ 1 + \sum_{k = 1}^{a_{i}} {C_{k}^{a_{i}} (- 1)}^{k} (p_{A C} + p_{A G} + p_{A T}) \cdot m_{k 0 0} \equiv 1 + a m_{i} \cdot (p_{A C} + p_{A G} + p_{A T}) & (XVIII) \end{matrix}$

$\begin{matrix} P (S_{i} = {A_{1} \dots A_{a_{i}} C_{1} \dots C_{c_{i}}} | N = A & {p_{{bb}^{″}}}) ≅ \sum_{k = 0}^{a_{i}} {C_{k}^{a_{i}} (- 1)}^{k} p_{A C} \cdot m_{(a_{i} + c_{i}) 0 0} \equiv c m_{i} \cdot p_{A C} & (XIX) \end{matrix}$

In Example 2, a maximal likelihood approach is applied to estimate the substitution rate p_AC, and p_ACgives a zero derivative of the log-likelihood. In Equation (XX), n_AAand n_ACare numbers of loci with all bases as A and containing only A's and C's, respectively. Based on the Equation (XX), p_ACis estimated to be proportional to number of loci at which the A is turned into a C, wherein the denominator involves moments for loci without a false base, which reflects the PCR process.

$\begin{matrix} \frac{d \log L}{{dp}_{A C}} = \sum_{i = 1}^{n_{A A}} \frac{d}{{dp}_{A C}} \log P (S_{i} | N = A & {p_{{bb}^{″}}}) + \sum_{j = 1}^{n_{A C}} \frac{d}{{dp}_{A C}} \log P (S_{j} | N = A & {p_{{bb}^{″}}}) = 0 & (XX) \end{matrix}$

$\sum_{i = 1}^{n_{A A}} \frac{d}{{dp}_{A C}} \log (1 + a m_{i} \cdot (p_{A C} + p_{A G} + p_{A T})) + \sum_{j = 1}^{n_{A C}} \frac{d}{{dp}_{A C}} \log (c m_{i} \cdot p_{A C}) = 0$

$\overset{{p_{{bb}^{″}}} ≪ 1}{\to} \sum_{i = 1}^{n_{A A}} \frac{d}{{dp}_{A C}} a m_{i} \cdot (p_{A C} + p_{AG} + p_{A T}) + \sum_{j = 1}^{n_{A C}} \frac{1}{p_{A C}} = 0$

$\sum_{i = 1}^{n_{A A}} a m_{i} + \frac{n_{A C}}{p_{A C}} = 0 \to p_{A C} = \frac{n_{A C}}{- \sum_{i = 1}^{n_{A A}} a m_{i}} = \frac{n_{A C}}{\sum_{i = 1}^{n_{A A}} (\sum_{k = 1}^{a_{i}} {C_{k}^{a_{i}} (- 1)}^{k + 1} m_{k 0 0})}$

Example 3: Estimating Background Errors

Once error probabilities of UMI all consensus bases are calculated, one can set a cutoff score and focus only on the high-quality bases. Among the high-quality bases, it is fair to assume only few false bases due to PCR and/or sequencing errors. Therefore, most of the remaining false bases can be attributed to background errors, for example, spontaneous mutation during DNA fragmentation. By applying UMI sequencing on known DNA templates, background error rate in the UMI sequencing can be estimated using the consensus base error rate P_c(S)obtained in Example 1.

Example 4

UMI sequencing of random sequences was simulated to examine validity of consensus qualities derived from the PCR and sequencing error model as described hereinabove. One hundred random sequences of length 100 bp were generated as the targeted reference sequences of “normal cells”. Totally 10⁷loci in 1000 normal cells were subject to mutation detection. Each targeted reference sequence of each normal cell was attached with a UMI label, and the UMI-labeled double-strand sequences were subject to 20-cycle PCR amplification, where the error rate r was 0.001.

For each UMI, three or five of the amplicons were randomly sampled for simulated Illumina sequencing. Sequenced amplicons of each UMI were assembled into a consensus sequence and quality of each consensus base was calculated using the method for evaluating consensus base error rate as described hereinabove. In Example 4, false consensus bases were identified and the frequency for each quality score was also calculated. The observed frequency was then compared to the expected value (e.g., 10⁻⁴for Q=40).

FIG. 5A shows a good agreement between the observed and expected frequencies across all consensus qualities with a UMI size three. As shown in FIG. 5B, for UMI size five, the overall agreement was still good except for the slightly larger fluctuations for some quality scores because there were not enough loci with such scores and/or only few or even no false consensus bases as the error rate was too low. The model fitting in Example 4 demonstrates that the method provided in the present invention was able to inform the true error probabilities of consensus bases (P_c(S)), and the error probability can be the base of quality scores calculation of a consensus base or even a consensus sequence.

The present invention provides a method for evaluating consensus base error rate based on a consensus base error rate evaluation model incorporating PCR error and sequencing error, and accurate evaluation of the consensus base error rate in UMI sequencing can be achieved using the aforesaid method.

Moreover, by using the method, each consensus base in a consensus sequence can be assigned of a precise quality score, which eliminates the impacts of low-frequency mutation on sequencing accuracy, thereby enhancing sequencing precision, and providing developmental foundation of precision medicine in the future.

	Number	Date	Country
Parent	PCT/CN2022/112752	Aug 2022	WO
Child	19039364		US

METHOD FOR EVALUATING CONSENSUS BASE ERROR RATE AND A SYSTEM THEREOF

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

CROSS-REFERENCE TO RELATED APPLICATION

Continuations (1)