The present disclosure relates to the field of biotechnologies, and in particular, to a UMI and an application thereof, a molecular identifier group, an adapter, an adapter ligation reagent, kits, a method for constructing a DNA library and a method for sequencing a gene.
Next generation sequencing (NGS, also referred to as second generation sequencing) technologies are currently most widely used sequencing technologies, which have advantages of high sequencing depth, large throughput, high accuracy and good sensitivity.
In one aspect, a unique molecular identifier (UMI) is provided. The UMI includes: at least one random base and at least one fixed base.
In some embodiments, the at least one random base includes a plurality of random bases, and/or the at least one fixed base includes a plurality of fixed bases. The plurality of random bases and/or the plurality of fixed bases are arranged consecutively: or at least two random bases of the plurality of random bases are arranged at intervals, and/or at least two fixed bases of the plurality of fixed bases are arranged at intervals.
In some embodiments, in a case where the at least one random base includes the plurality of random bases, and the at least two random bases of the plurality of random bases are arranged at intervals, every two random bases arranged at intervals are separated by one to five fixed bases.
In some embodiments, the at least one random base includes at least three random bases. Every two of the at least three random bases are arranged at intervals from each other and separated by a group of fixed bases, and every two groups of fixed bases each include a same number of fixed bases.
In some embodiments, for the every two adjacent groups of fixed bases, at least one fixed base of one group of fixed bases is different from a fixed base of another group of fixed bases.
In some embodiments, the every two random bases arranged at intervals are separated by 2 to 4 fixed bases, and the 2 to 4 fixed bases are different from each other.
In some embodiments, the at least one random base includes three random bases.
In some embodiments, the UMI includes 7 to 11 bases.
In another aspect, a molecular identifier group is provided. The molecular identifier group includes: two unique molecular identifiers (UMIs) binding to each other through complementary pairing of at least a portion of bases thereof. At least one UMI is the UMI described above.
In yet another aspect, an adapter is provided. The adapter includes: a first strand, a second strand and at least one unique molecular identifier (UMI). Each UMI is located on the first strand or the second strand. The at least one UMI is the UMI described above.
In some embodiments, the at least one UMI includes two UMIs. The two UMI are respectively located on the first strand and the second strand, and bind to each other through complementary pairing of at least a portion of bases thereof.
In some embodiments, the first strand is a forward strand, and the second strand is a reverse strand. The first strand includes a first sequencing primer sequence. The second strand includes a second sequencing primer sequence. A UMI located on the first strand is located downstream of the first sequencing primer sequence. A UMI located on the second strand is located upstream of the second sequencing primer sequence.
In yet another aspect, an adapter ligation reagent is provided. The adapter ligation reagent includes a plurality of kinds of adapters. The plurality of kinds of adapters are each the adapter as described above. Of the plurality of kinds of adapters, for every two kinds of adapters, at least one random base of at least one UMI included in one kind of adapter is different from at least one random base of at least one UMI included in another kind of adapter.
In yet another aspect, a kit is provided. The kit includes: the adapter ligation reagent described above.
In yet another aspect, an application of unique molecular identifiers (UMIs) described above in sequencing a gene is provided.
In some embodiments, the gene includes a deoxyribonucleic acid (DNA) molecule for expressing genetic information. The UMIs are configured to label different DNA molecules.
In yet another aspect, a method for constructing a deoxyribonucleic acid (DNA) library is provided. The method includes:
In yet another aspect, a method for sequencing a gene is provided. The method includes: performing gene sequencing on deoxyribonucleic acid (DNA) by using the DNA library obtained by the method for constructing the DNA library described above.
In yet another aspect, a kit is provided. The kit includes: the DNA library obtained by the method for constructing the DNA library described above.
In order to describe technical solutions in the present disclosure more clearly, accompanying drawings to be used in some embodiments of the present disclosure will be introduced briefly below. However, the accompanying drawings to be described below are merely accompanying drawings of some embodiments of the present disclosure, and a person having ordinary skill in the art can obtain other drawings according to these accompanying drawings. In addition, the accompanying drawings in the following description may be regarded as schematic diagrams, but are not limitations on an actual size of a product, an actual process of a method and an actual timing of a signal involved in the embodiments of the present disclosure.
Technical solutions in some embodiments of the present disclosure will be described clearly and completely below with reference to the accompanying drawings. However, the described embodiments are merely some but not all embodiments of the present disclosure. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present disclosure shall be included in the protection scope of the present disclosure.
Unless the context requires otherwise, throughout the description and the claims, the term “comprise” and other forms thereof such as the third-person singular form “comprises” and the present participle form “comprising” are construed as an open and inclusive meaning, i.e., “including, but not limited to”. In the description of the specification, the terms such as “one embodiment”, “some embodiments”, “exemplary embodiments”, “example”, “specific example” or “some examples” are intended to indicate that specific features, structures, materials or characteristics related to the embodiment(s) or example(s) are included in at least one embodiment or example of the present disclosure. Schematic representation of the above terms does not necessarily refer to the same embodiment(s) or examples(s). In addition, the specific features, structures, materials or characteristics may be included in any one or more embodiments or examples in any suitable manner.
Hereinafter, the terms such as “first” and “second” are used for descriptive purposes only, but are not to be construed as indicating or implying the relative importance or implicitly indicating the number of indicated technical features. Thus, the features defined with “first” and “second” may explicitly or implicitly include one or more of the features. In the description of the embodiments of the present disclosure, the term “a plurality of/the plurality of” means two or more unless otherwise specified.
The phrase “at least one of A, B and C” has a same meaning as the phrase “at least one of A, B or C”, and they both include the following combinations of A, B and C: only A, only B, only C, a combination of A and B, a combination of A and C, a combination of B and C, and a combination of A, B and C.
The phrase “A and/or B” includes the following three combinations: only A, only B, and a combination of A and B.
The phrase “applicable to” or “configured to” as used herein indicates an open and inclusive expression, which does not exclude devices that are applicable to or configured to perform additional tasks or steps.
As used herein, the term “DNA” is an abbreviation for deoxyribonucleic acid. DNA, as a carrier of genetic information in biological cells, is mainly used to guide synthesis of RNA and proteins in a body. The DNA is a macromolecular polymer composed of deoxynucleotides. The deoxynucleotide is composed of a phosphate, a deoxyribose and a base. There are four main kinds of bases, i.e., adenine (A), guanine (G), cytosine (C) and thymine (T).
As used herein, the term “RNA” is an abbreviation for ribonucleic acid. The RNA, as a carrier of genetic information that exists in biological cells and some viruses and viroids, is mainly used to guide synthesis of proteins in the body. The RNA is a macromolecular polymer composed of ribonucleotides. A ribonucleotide is composed of a phosphate, a ribose and a base. There are four main kinds of bases, i.e., adenine (A), guanine (G), cytosine (C) and uracil (U).
At present, next-generation sequencing technologies are widely used in reproductive genetics, tumor detection and other fields, especially in liquid biopsy. In a process of preparing a library for liquid biopsy, a replication of a polymerase chain reaction (PCR) amplification enzyme has a certain base error rate. In addition, an error rate of a sequencer in reading bases is in a range of 0.01% to 0.1% (that is, there are 1 to 10 erroneous bases in every 1000 bases) during a sequencing process. Such noise mutations (also referred to as non-native mutations) appear in samples with low frequency mutations or ultra-low frequency mutations, which makes it difficult to determine whether mutations at frequencies of 1% or less are genuine gene mutations or noisy mutations caused by sequencing or PCR errors. A detection of these low frequency mutations is significant, so that unique molecular identifiers (UMIs), which are also referred to as molecular barcodes, are introduced into original DNA fragments. A principle of the detection is to add a unique identifier sequence to each original DNA fragment, and the identifier sequence will be sequenced together with the original DNA fragment after library construction and PCR amplification. In this way, according to different identifier sequences, it may be possible to distinguish between DNA templates (referred to as DNA molecules below) from different sources from each other, and determine which are false positive mutations caused by random errors during PCR amplification and sequencing and which are mutations actually carried by a patient, thereby improving detection sensitivity and specificity. UMIs label the original DNA fragments, so that different DNA molecules from different sources carry different molecular identifiers; when sequencing results are analyzed, same inserts (i.e., the original DNA fragments) are screened out; if two ends of a same insert have complementary paired UMI adapters, the UMI adapters may be used to mark forward and reverse strands (a forward strand and a reverse strand) of the same insert; and if mutant bases appear at a same position in both the forward and reverse strands, such a mutation is marked as a real mutation. In this way, an original mutation state is restored.
At present, there are two main strategies for introducing the UMIs. A first strategy is to introduce single-ended UMIs. For example, eight random bases may be added to a P5 end of an adapter instead of an Index. Adapters synthesized in this way have advantages such as simplicity, economy and applicability, which have been widely used. However, in a process of constructing a library, bindings of UMI adapters are random, which causes that a single original DNA fragment binds with two different UMI adapters, and that a forward strand and a reverse strand bind with different UMIs. Consequently, information of the original forward and reverse strands cannot be tracked, and then forward and reverse strand sequences cannot be corrected accurately. In addition, if there is a base mutation in a UMI sequence, a number of bases in the original DNA fragment will increase, resulting in introduction of potential false positive mutations. A second strategy is to introduce double-ended UMIs. That is, in related art, a single-strand adapter (a sequence) is firstly synthesized, the single-strand adapter (the sequence) including a first sequence and a second sequence, the second sequence including protection bases of a restriction endonuclease and a double-stranded molecular identifier of random-bases; and then, a double-strand adapter is formed by annealing the single-strand adapter sequence; and finally, an adapter with a 3′-dT tail may be obtained by enzyme cleavage. Although such a double-strand adapter may effectively solve a problem that the single-ended UMIs cannot track the original forward and reverse strands, false positive mutations will still be introduced when a UMI sequence itself is mutated.
Some embodiments of the present disclosure provide an adapter 10. As shown in
Considering an example where adapters 10 are Y-shaped adapters, the adapters may be classified into long adapters (complete Y-shaped adapters) and short adapters (incomplete Y-shaped adapters) according to whether the adapters 10 can match a PCR-free library. The long adapter binds to both ends of a DNA fragment to be sequenced (i.e., the original DNA fragment described above) by TA ligation, and if a library yield is sufficient, sequencing on a computer may be directly performed without PCR amplification; while after the short adapter binds to the both ends of the DNA fragment by TA ligation, PCR amplification must be firstly performed by using indexing primers complementary to the short adapter to form a complete adapter before sequencing on the computer is performed. Such a difference between the short adapter and the long adapter is mainly caused by different manners of introducing Index sequences of the short adapters and the long adapters. The Index sequence is configured to label samples of different sequences to be tested. A single sample may include tens of thousands of DNA molecules, and UMIs 20 are used to label different DNA molecules in a same sample or in different samples.
Considering an example where the adapters 10 are the long adapters, the adapters 10 may be classified into single-ended index adapters and double-ended index adapters. The single-ended index adapter only has an Index sequence at a P7 end, and the double-ended index adapter has Index sequences at both a P5 end and the P7 end.
Here, considering an example where the adapter 10 is a single-ended Index adapter, in a case where the adapter 10 includes one UMI 20, as shown in
In some embodiments, the at least one UMI 20 includes at least one random base and at least one fixed base.
In a single UMI 20, numbers of random base(s) and fixed base(s) and a manner in which the random base(s) and the fixed base(s) are not specifically limited.
In some embodiments, considering an example where there are one random base and one fixed base, the random base and the fixed base may be arranged in sequence in a same direction. For example, the random base and the fixed base are sequentially arranged in a direction from a 5′ end to a 3′ end of the UMI sequence; alternatively, the random base and the fixed base are sequentially arranged from the 3′ end to the 5′ end of the UMI sequence.
In some other embodiments, considering an example where there are a plurality of random bases and/or a plurality of fixed bases, there are two possible cases.
In a first case, there are the plurality of random bases and/or the plurality of fixed bases, and the plurality of random bases and/or the plurality of fixed bases are arranged consecutively.
In this case, depending on whether there is one random base or there are the plurality of random bases, and whether there is one fixed base or there are the plurality of fixed bases, there are many possible situations. In a first situation, there are the plurality of random bases and the one fixed base. In this situation, the plurality of random bases are arranged consecutively. In this situation, the fixed base may be located on a side of the plurality of random bases (e.g., a direction from the 5′ end to the 3′ end of the UMI sequence is referred to as a first direction, and a direction from the 5′ end of the UMI sequence to the 3′ end is referred to as a second direction, and the fixed base may be located on a side of the plurality of random bases in the first direction or in the second direction). In a second situation, there are the plurality of fixed bases and the one random base. In this situation, the random base may be located on a side of the plurality of fixed bases (e.g., the direction from the 5′ end to the 3 end of the UMI sequence is referred to as the first direction, the direction from the 5′ end of the UMI sequence to the 3′ end is referred to as the second direction, and the random base may be located on a side of the plurality of fixed bases in the first direction or in the second direction). In a third situation, there are the plurality of random bases and the plurality of fixed bases. In this situation, the plurality of random bases and the plurality of fixed bases are both arranged consecutively. In this situation, the plurality of fixed bases may be located in a side of the plurality of random bases (e.g., the direction from the 5′ end to the 3′ end of the UMI sequence is referred to as the first direction, the direction from the 5′ end of the UMI sequence to the 3′ end is referred to as the second direction, and the plurality of fixed bases may be located on a side of the plurality of random bases in the first direction or in the second direction).
In a second case, there are the plurality of random bases and/or the plurality of fixed bases, and at least two random bases of the plurality of random bases and/or at least two fixed bases of the plurality of fixed bases are arranged at intervals.
In this case, depending on whether there is one random base or there are the plurality of random bases and whether there is one fixed base or there are the plurality of fixed bases, there are many possible situations. In a first situation, there are the plurality of random bases and the one fixed base. In this situation, the fixed base is located between any two adjacent random bases of the plurality of random bases. In a second situation, there are the plurality of fixed bases and the one random base. In this situation, the random base is located between any two adjacent fixed bases of the plurality of fixed bases. In a third situation, there are the plurality of random bases and the plurality of fixed bases. In this situation, there are a plurality of possible arrangements. In a first arrangement, at least two random bases of the plurality of random bases are arranged at intervals, and the plurality of fixed bases are located between any two random bases arranged at intervals. In a second arrangement, at least two fixed bases of the plurality of fixed bases are arranged at intervals, and the plurality of random bases are located between any two fixed bases arranged at intervals. In a third arrangement, at least two random bases of the plurality of random bases are arranged at intervals, and at least two fixed bases of the plurality of fixed bases are arranged at intervals. That is, at least two bases of the plurality of random bases are arranged at intervals and at least two bases of the plurality of fixed bases are arranged at intervals. There may be one or more fixed bases between two random bases arranged at intervals, and there may also be one or more random bases between two fixed bases arranged at intervals.
Considering an example where there are the plurality of random bases and the at least two random bases of the plurality of random bases are arranged at intervals, every two random bases arranged at intervals may be separated by 1 to 5 fixed bases.
It will be noted that, the “random base”, just as its name implies, means that the base is random, which may be any one selected randomly from the four bases (A, T, C and G), and may be represented by N. Random bases are selected from different bases, and may be used to label different DNA molecules.
For example, considering an example where there is one random base in a single UMI 20, the N in the UMI 20 may be any one selected from the four bases. In this case, according to different Ns in UMIs 20, there may be four kinds of UMIs. The four kinds of UMIs 20 may be formed into 42 (i.e. 16) adapters (a single DNA molecule binds with two adapters), so that 42 (i.e., 16) different DNA molecules may be labeled to detect the 42 (i.e., 16) different DNA molecules.
Considering an example where there are three random bases in a single UMI 20, each N in the UMI 20 may be any one selected from the four bases. Since the three Ns has 43 (i.e., 64) combinations, there may be 43 kinds (i.e., 64) of UMIs 20. The 64 kinds of UMIs 20 may be formed into 642 (i.e., 4096) kinds of adapters (a single DNA molecule binds with two adapters), so that 642 (i.e., 4096) different DNA molecules may be labeled to detect of the 642 (i.e., 4096) different DNA molecules.
It will be seen that, as the larger the number of random bases is, the more kinds of UMIs there are, and then the larger the number of DNA molecules that can be labeled becomes.
The fixed bases are selected from known fixed bases, and used for correcting a sequence to be tested and the UMI when the two have errors during amplification or sequencing, which reduces the introduction of false positive mutations.
As shown in
The 100 original sequences binding with UMI adapters are amplified by PCR and enriched to obtain a DNA library including 100 original sequences binding with the UMI adapters (for distinction, the remaining 99 original sequences binding with UMI adapters that are obtained by copying are recorded as original sequences 1′). As shown in a first case in
It will be seen that, by using a UMI 20 partially composed of fixed base(s), it may be possible to ensure diversity of the adapters to label different original DNA fragments; and moreover, noise mutations introduced by PCR amplification or sequencing may be avoided to a certain extent, which improves detection accuracy.
In some embodiments of the present disclosure, there are at least three random bases, the at least three random bases are arranged at intervals from each other, every two random bases arranged at intervals are separated by a group of fixed bases, and every two groups of fixed bases each include a same number of fixed bases.
In these embodiments, by limiting the number of the random bases to at least three, it may be possible to ensured that at least 4096 different DNA molecules are labeled by the UMIs 20, which increases a number of molecules to be tested, thereby improving the detection accuracy of samples. In addition, by providing a group of fixed bases between every two random bases arranged at intervals and setting the every two groups of fixed bases each to include the same number of fixed bases, it may be possible to improve arrangement regularity of the random bases and the fixed bases, so that it is easy to determine whether a mutation is a mutation of the fixed base or a mutation of an original DNA fragment itself, and then reduces the introduction of false positive mutations and improves the detection accuracy. In addition, it is found through testing that, under a premise of a same number of bases in the UMIs 20, detection accuracy in a case where every two random bases of the plurality of random bases are arranged at intervals from each other is higher than a case where the plurality of random bases are consecutively arranged.
In some other embodiments, since the fixed bases play a role of excluding noise mutations introduced by PCR amplification or sequencing, the larger the number of the fixed bases in a single UMI 20, the better an error tolerance during detection, and then the more accurate the detection. Here, considering an example where the UMI 20 includes three random bases and four fixed bases, there are the four fixed bases in the UMI 20 for error tolerance in a subsequent sequencing, so that an error tolerance rate may be 4 divided by 7 times 100%, i.e., about 57%.
However, considering that an amount of data to be tested will be occupied as the number of the fixed bases increases, it is not that the fixed base the more the better.
In light of this, in some embodiments, every two random bases arranged at intervals are separated from each other by two to four fixed bases.
In these embodiments, by providing two to four fixed bases between every two random bases, an occupation of the data to be tested due to an excessive number of fixed bases may be avoided while the error tolerance rate of detection is ensured.
The two to four fixed bases may be same or different, which are not specifically limited here.
In some embodiments, the two to four fixed bases are different from each other.
In these embodiments, since the two to four fixed bases are different from each other, it may be possible to avoid an concentration of a same kind of fluorescence (labeling a same kind of bases) during sequencing (the same kind of bases are prone to the concentration of the same kind of fluorescence), thereby preventing a problem of inaccurate reading due to the concentration of the fluorescence, which improves the detection accuracy.
In some other embodiments, for every two adjacent groups of fixed bases, at least one fixed base of one group of fixed bases is different from a fixed base of the other group of fixed bases.
For example, considering an example where there are three random bases in a single UMI 20, the three random bases are arranged at intervals from each other, and every two random bases are separated by one fixed base (i.e., every two adjacent groups of fixed bases each including one fixed base), since for every two adjacent groups of fixed bases, at least one fixed base of one group of fixed bases is different from a fixed base of the other group of fixed bases, a sequence of the UMI may be represented as follows:
where N1 and N2 are different from each other, and are each any one independently selected from A, T, C and G; and the three Ns may be same or different, and are each any one independently selected from A, T, C and G.
That is, considering an example where N1 is A and N2 is C, two adjacent group of fixed bases are respectively A and C which are different from each other.
Compared with N1 and N2 that are a same kind of base, it may be possible to avoid the concentration of the same kind of fluorescence (labeling the same kind of bases) during sequencing (the same kind of bases are prone to the concentration of the same kind of fluorescence), thereby preventing the problem of inaccurate reading due to the concentration of the fluorescence, which improves the detection accuracy.
For still another example, considering an example where there are three random bases in a single UMI 20, the three random bases are arranged at intervals from each other, and every two random bases are separated by two fixed base (i.e., every two adjacent groups of fixed bases each including two fixed base), since for every two adjacent groups of fixed bases, at least one fixed base of one group of fixed bases is different from a fixed base of the other group of fixed bases, a sequence of the UMI may be represented as follows:
where N3 and N4 may be same or different, and are each any one independently selected from A, T, C and G; N5 and N6 are same or different, and are each any one independently selected from A, T, C and G; at least one of N3 and N4 is different from each of N5 and N6; and the three Ns are same or different, and are each any one independently selected from A, T, C and G.
In this case, according to whether N3 and N4 are the same, there may be two possible cases. In a first case, N3 and N4 are the same. In this case, there may be two possible situations according to whether N5 and N6 are the same. In a first situation, N5 and N6 are different. In this situation, at least one of the fixed bases N5 and N6 is different from a fixed base of the fixed bases N3 and N4. That is, one of the fixed bases N5 and N6 is different from each of the two fixed bases N3 and N4; or both of the two fixed bases N5 and N6 are different from each of the fixed bases N3 and N4. In a case where one of the fixed bases N5 and N6 is different from each of the two fixed bases N3 and N4, considering an example where both N3 and N4 are As, N5 and N6 may be respectively A and C, or A and G, or A and T. In a case where N5 and N6 are respectively A and C, the two adjacent groups of fixed bases are respectively AA and AC, and for the two groups of fixed bases, one fixed base (C) of one group of fixed bases is different from the two fixed bases (As) in the other group of fixed bases. In a case where N5 and N6 are respectively A and G, the two adjacent groups of fixed bases are respectively AA and AG, and for the two groups of fixed bases, one fixed base (G) in one group of fixed bases is different from the two fixed bases (As) in the other group of fixed bases. In a case where N5 and N6 are respectively A and T, the two adjacent groups of fixed bases are respectively AA and AT, and for the two groups of fixed bases, one fixed base (T) in one group of fixed bases is different from the two fixed bases (As) in the other group of fixed bases. In a case where both of the two fixed bases N5 and N6 are different from each of the fixed bases N3 and N4, still considering an example where both N3 and N4 are As, N5 and N6 may be respectively C and G, C and T, or G and T. In a case where N5 and N6 are respectively C and G, the two adjacent groups of fixed bases are respectively AA and CG, and for the two groups of fixed bases, both of the two fixed bases (C and G) in one group of fixed bases are different from the two fixed bases (As) in the other group of fixed bases. In a case where N5 and N6 are respectively C and T, the two adjacent groups of fixed bases are respectively AA and CT, and for the two groups of fixed bases, both of the two fixed bases (C and T) in one group of fixed bases are different from the two fixed bases (As) in the other group of fixed bases. In a case where N5 and N6 are respectively G and T, the two adjacent groups of fixed bases are respectively AA and GT, and for the two groups of fixed bases, both of the two fixed bases (G and T) in one group of fixed bases are also different from the two fixed bases (As) in the other group of fixed bases. In a second situation, N5 and N6 are the same. In this situation, at least one of the fixed bases N5 and N6 is different from a fixed base of the fixed bases N3 and N4. That is, both of the two fixed bases N5 and N6 are different from each of the two fixed bases N3 and N4. For example, still considering an example where both N3 and N4 are As, N5 and N6 may be both Ts, Gs or Cs. In a case where N5 and N6 are both Ts, the two fixed bases N5 and N6 (Ts) are different from the two fixed bases N3 and N4 (As). In a case where N5 and N6 are both Gs, the two fixed bases N5 and N6 (Gs) are different from the two fixed bases N3 and N4 (As). In a case where N5 and N6 are both Cs, the two fixed bases N5 and N6 (Cs) are different from the two fixed bases N3 and N4 (As).
In a second case, N3 and N4 are different. In this case, according to whether N5 and N6 are same, there may be two possible situations. In a first situation, N5 and N6 are different. In this situation, at least one of the fixed bases N5 and N6 is different from a fixed base of the fixed bases N3 and N4. That is, one of the fixed bases N5 and N6 is different from one of the fixed bases N3 and N4, or both of the two fixed bases N5 and N6 are different from each of the two fixed bases N3 and N4. In a case where one of the fixed bases N5 and N6 is different from one of the fixed bases N3 and N4, considering an example where N3 and N4 are respectively A and T, N5 and N6 may be respectively A and C, A and G, T and C, or T and G. In a case where N5 and N6 are respectively A and C, one (C) of the fixed bases N5 and N6 is different from one (T) of the fixed bases N3 and N4. In a case where N5 and N6 are respectively A and G, one (G) of the fixed bases N5 and N6 is different from one (T) of the fixed bases N3 and N4. In a case where N5 and N6 are respectively T and C, one (C) of the fixed bases N5 and N6 is different from one (T) of the fixed bases N3 and N4. In a case where N5 and N6 are respectively T and G, one (G) of the fixed bases N5 and N6 is different from one (T) of the fixed bases N3 and N4. In a case where both of the two fixed bases N5 and N6 are different from each of the two fixed bases N3 and N4, still considering an example where N3 and N4 are respectively A and T, N5 and N6 may be respectively G and C. In a case where N5 and N6 are respectively G and C, both of the two fixed bases N5 and N6 (G and C) are different from each of the two fixed bases N3 and N4 (A and T). In a second situation, N5 and N6 are the same. In this situation, at least one of the fixed bases N5 and N6 is different from a fixed base of the fixed bases N3 and N4. That is, the two fixed bases N5 and N6 are different from one or two of the fixed bases N3 and N4. For example, still considering an example where N3 and N4 are respectively A and T, N5 and N6 may be both As, Ts, Cs or Gs. In a case where N5 and N6 are both As, the two fixed bases N5 and N6 are different from one of the fixed bases N3 and N4. In a case where N5 and N6 are both Ts, the two fixed bases N5 and N6 are different from one of the fixed bases N3 and N4. In a case where N5 and N6 are both Cs, the two fixed bases N5 and N6 are different from each of the two fixed bases N3 and N4. In a case where N5 and N6 are Gs, the two fixed bases N5 and N6 are also different from each of the two fixed bases N3 and N4.
In these embodiments, similar to the above embodiments where every two random bases is separated by one fixed base (i.e., the two adjacent groups of fixed bases each include one fixed base), it is also possible to avoid the concentration of the same kind of fluorescence (labeling the same kind of bases) during sequencing (the same kind of bases are prone to the concentration of the same kind of fluorescence), thereby preventing the problem of inaccurate reading due to the concentration of the fluorescence, which improves the detection accuracy.
In some embodiments, there are three random bases.
In these embodiments, by limiting a number of the random bases to three, it may be possible to label 4096 different DNA molecules, thereby meeting application requirements.
In some embodiments, the UMI 20 includes seven to eleven bases.
In these embodiments, by limiting a number of the bases included in the UMI 20 to seven to eleven, it may be possible to avoid not only an occupation of the data to be tested due to an excessively large length of the UMIs 20 but also detrimental effects on an improvement of the error tolerance rate (e.g., caused by too few random bases) and/or on labeling of a large number of DNA molecules (e.g., caused by too few fixed bases) due to an excessively short length of the UMIs 20.
In some embodiments, as shown in
In these embodiments, the two UMIs 20 may be a first UMI and a second UMI, respectively. In this case, there are two possible situations. In a first situation, the first UMI 20 may be located between a first sequencing primer sequence 111 and a first amplification primer sequence 112. The second UMI 20 may be located between a second sequencing primer sequence and 121 a second amplification primer sequence 122. The two UMIs 20 bind to each other through complementary pairing of at least a portion of bases thereof. In this situation, the adapter 10 is same as a single-ended UMI adapter, which cannot track forward and reverse strands. In a second situation, as shown in
In some embodiments, as shown in
Some embodiments of the present disclosure provide an adapter ligation reagent. The adapter ligation reagent includes: a plurality of kinds of adapters 10, ligases, a buffer, etc. The plurality of kinds of adapters 10 are adapters 10 each as described above. The ligases may be, for example, DNA ligases or RNA ligases, and are used to promote a ligation of the plurality of kinds of adapters 10 and end repaired DNA fragments. The buffer provides a stable pH environment with for an adapter ligation reaction. Of the plurality of kinds of adapters, for every two kinds of adapters 10, at least one UMI 20 included in one kind of adapter 10 has at least one random base that is different from at least one random base of at least one UMI 20 included in another kind of adapter 10.
In these embodiments, the plurality of kinds of adapters 10 are all UMI adapters. At least one UMI 20 included in the UMI adapter includes at least one random base and at least one fixed base. Random bases corresponding to the plurality kinds of adapters are selected from different bases, so that different DNA molecules may be labeled through the different UMI adapters to realize a sequencing of the plurality of different DNA molecules. Fixed bases are selected from the known fixed bases, so that the sequence to be tested and the UMI may be corrected when the two themselves have errors during amplification or sequencing, which reduces the introduction of false positive mutations.
Some embodiments of the present disclosure provide a kit. The kit includes the adapter ligation reagent described above.
That is, the kit may be an adapter ligation kit. A kit refers to a box used to contain chemical reagents for detection of chemical components, drug residues, virus types, etc.
The kit here refers to a box containing the adapter ligation reagent.
Beneficial technical effects of the kit in the embodiments of the present disclosure are same as the beneficial technical effects of the adapter in the embodiments of the present disclosure, which will not be repeated here.
Some embodiments of the present disclosure provide an application of the UMI 20 in sequencing a gene. The UMI 20 includes the at the least one random base and the at least one fixed base.
In some embodiments, the gene may include a deoxyribonucleic acid (DNA) molecule or a ribonucleic acid (RNA) molecule for expressing genetic information. UMIs 20 are configured to label different DNA molecules or RNA molecules.
For example, the gene may include a circulating free DNA (cfDNA). The UMIs 20 may be used in the UMI adapters to label different cfDNA molecules.
Some embodiments of the present disclosure provide a method for constructing a DNA library of or a RNA library. The method includes the following steps.
Fragmented DNA is obtained.
The fragmented DNA may be obtained by mechanical breakage or enzymatic hydrolysis.
Of course, before the fragmented DNA is obtained, complementary DNA (cDNA) may be obtained by reverse transcription of mRNA, and the fragmented DNA may be obtained after the cDNA is broken.
In some embodiments, some DNA is cell-free DNA in blood, such as circulating free DNA (cfDNA), which is naturally fragmented and may be directly obtained from the blood or be commercially available. The cfDNA is a kind of DNA that is in a free and cell-free state outside the cell.
End-repair and adenine (A) addition are performed on the fragmented DNA or RNA to obtain end-repair products.
For example, considering an example where the fragmented DNA is cfDNA, end-repair and A addition may be performed on the cfDNA by using a KAPA Biosystem (KAPA) kit.
The end-repaired products are treated with the adapter ligation reagent described above, so that the adapters in the adapter ligation reagent react with the end-repaired products to obtain adapter ligation products.
That is, the above adapter ligation reagent including the plurality of kinds of adapters is used to bind the adapters to the end-repaired products. Each end-repaired product may include a forward strand and a reverse strand. A single end-repaired product may bind with two adapters 10. In a case where each adapter 10 includes a single UMI 20, the adapters 10 are single-ended UMI adapters, which may label different end-repaired products, but may not track forward and reverse strands of the end-repaired products; in a case where the adapters 10 are double-ended UMI adapters, the forward and reverse strands of the end-repaired products may be tracked, so that it may be possible to mark a real mutation in a case where mutated bases at a same position on both the forward and reverse strands appear, which may further improve the detection accuracy.
The adapter ligation products are enriched to obtain the DNA library or the RNA library.
For example, the adapter ligation products may be enriched by PCR amplification.
Since the UMIs 20 in the adapters 10 each include the at least one random base and the at least one fixed base, and the random bases are selected from different bases, the UMIs 20 may label different DNAs depending on the different random bases. In addition, the fixed bases are selected from the known fixed bases, so that it is possible to correct a sequence to be tested and the UMI when the two themselves have errors in amplification or sequencing, which reduces the introduction of false positive mutations.
Some embodiments of the present disclosure provide method for sequencing and detecting a gene. The method includes:
performing gene sequencing DNA or RNA by using the DNA library or the RNA library obtained by the method for constructing the DNA library or the RNA library described above.
In the embodiments of the present disclosure, since the DNA library or the RNA library obtained by using the method for constructing the DNA library or the RNA library is used for gene sequencing, and DNA molecules or RNA molecules in the DNA library or the RNA library all bind with adapters 10 including UMIs 20, the DNA molecules or the RNA molecules may be labeled by the UMIs 20, and then fixed bases may be used to correct errors generated during sequencing or amplification in a subsequent sequencing process, which may reduce the introduction of false positive mutations and improves the detection accuracy.
Some embodiments of the present disclosure provide a kit. The kit includes: the DNA library or the RNA library obtained by the method for constructing the DNA library or the RNA library described above.
Of course, in some embodiments, the kit may further include a targeted capture kit. The targeted capture kit may include a targeted capture reagent. The targeted capture reagent may perform targeted capture by hybridization or multiplex PCR (which may be performed before an enrichment step in a library construction process), which both allow sequencing of selected genes.
Some embodiments of the present disclosure provide a UMI group. As shown in
That is, the two UMIs 20 may be located on the first strand 11 and the second strand 12 of the adapter 10. Reference may be made to the above description of the adapter 10 including two UMIs 20, which will not be repeated here.
Some embodiments of the present disclosure provide a method for preparing the adapter 10. The adapter 10 includes the at least one UMI 20. As shown in
In a step S1), the first strand 11 and the second strand 12 are synthesized. Each UMI 20 is located on the first strand 11 or the second strand 12, and the at least one UMI includes the at least one random base and the at least one fixed base.
For example, the first strand 11 and the second strand 12 may be respectively synthesized by chemical synthesis (i.e., DNA synthesis) rather than by biosynthesis.
Of course, in a case where the UMI group is obtained, a strand (e.g., the first strand 11) and a portion of another strand (e.g., the second strand 12) that is not complementary to the first strand 11 may be synthesized based on the UMI group, and then a portion of the second strand 12 that is complementary to the first strand 11 is synthesized through base complementary pairing.
In a step S2), the first strand 11 and the second strand 12 are annealed to obtain the adapter 10.
That is, when the two single strands, i.e., the first strand 11 and the second strand 12, are synthesized by the above step, the first strand 11 and the second strand 12 may bind to each other through complementary pairing of a portion of bases thereof by specific annealing.
In order to objectively evaluate technical effects of the embodiments of the present disclosure, detailed exemplary description of embodiments of the present application is given through the following embodiments and experimental examples.
In a step 1), 64 first strands 11 and UMI molecular tags 20 located thereon (a UMI 20 located on the first strand 11 being located downstream of a first sequencing primer sequence 112, the UMI 20 including three random bases Ns, and every two random bases N being separated from each other by two fixed bases Ts with a thio-modified end) and 64 second strands 12 and UMI molecular tags 20 located thereon (a UMI 20 located on the second strand 12 being located upstream of a second sequencing primer sequence 121, the UMI 20 including three random bases Ns, every two random bases N are separated from each other by two fixed bases with a phosphate group bound to an end) are synthesized.
A sequence of the first strand 11 is as shown in SEQ ID NO: 1 in a sequence listing, and a sequence of the second strand 12 is as shown in SEQ ID NO: 2 in the sequence listing.
The first strand 11 and the second strand 12 may also be as shown in Table 1 below.
In a case where the Ns of the UMI 20 on the first strand 11 are selected from the four different bases, there are 64 kinds of sequences for UMIs 20 on each of the first strands 11 and the second strands 12. The 64 kinds of sequences for the UMIs 20 are as shown in Table 2 below.
In a step 2), first strands 11 and second strands 12 paired with each other are resuspended to 100 μM by using 100 μL of a buffer reagent. The buffer reagent includes: 10 mM of trihydroxymethyl aminomethane (Tris) which makes pH of the buffer reagent be 7.5, 2 mM of ethylenediaminetetraacetic acid (EDTA) and 50 mM of NaCl.
In a step 3), 10 μL of the first strands 11, 10 μL of the second strands 12 and 80 μL of the buffer reagent are taken into a PCR tube, mixed well and briefly centrifuged.
In a step 4), the PCR tube is placed in a PCR instrument to react at a program temperature of 95° C. for 10 min; the PCR instrument is turned off after the reaction is completed; and the PCR tube is removed until its temperature drops to room temperature (cooling down about 2 h, the room temperature being about 25° C.).
In a step 5), 1 μL of a sample in the PCR tube is taken to perform quality inspection by an automatic nucleic acid and protein analyzer (Qsep100). The result is as shown in
Steps in Embodiment 2 are substantially same as the steps in Embodiment 1, which will not be repeated here. A difference is that, in a step 1) here, a portion of fixed bases of UMIs on first strands 11 and second strands 12 change.
In Embodiment 2, a sequence of a first strand 11 is as shown in SEQ ID NO: 131 in the sequence listing, and a sequence of a second strand 12 is as shown in SEQ ID NO: 132 in the sequence listing.
The first strand 11 and the second strand 12 may also be as shown in Table 3 below.
In a case where the Ns of the UMI 20 on the first strand 11 are selected from the four different bases, there are 64 kinds of sequences for UMIs 20 on each of the first strands 11 and the second strands 12. The 64 kinds of sequences for the UMIs 20 are as shown in Table 4 below.
Steps in Embodiment 3 are substantially same as the steps in Embodiment 1, which will not be repeated here. A difference is that, in a step 1) here, a portion of fixed bases of UMIs in first strands 11 and second strands 12 change.
In Embodiment 3, a sequence of a first strand 11 is as shown in SEQ ID NO: 261 in the sequence listing, and a sequence of a second strand 12 is as shown in SEQ ID NO: 262 in the sequence listing.
The first strand 11 and the second strand 12 may also be as shown in Table 5 below.
In a case where the Ns of the UMI 20 in the first strand 11 are selected from the four different bases, there are 64 kinds of sequences for UMIs 20 on first strands 11 and second strands 12. The 64 kinds of sequences of the UMI 20 are as shown in Table 6 below.
In a step 1), the cfDNA standards with a plurality of mutation sites and mutation frequencies of 1% and 0.1% customized from GeneWell are used as samples. The used standards are cfDNA samples, which do not need to be fragmented and may be directly used for constructing a library.
In a step 2), end repair and A addition are performed on the cfDNA by using a KAPA kit.
In a step 3), the cfDNA is bound with adapters by using a KAPA kit and the adapters synthesized in Embodiments 1, so as to obtain adapter ligation products.
In a step 4), the adapter ligation products are amplified, enriched and purified to obtain a cfDNA library.
In a step 5), targeted capture is performed on the adapter ligation products by using a complete kit of Integrated DNA Technologies (IDT) to obtain adapter ligation products of selected gene.
In a step 6), sequencing on a computer is performed by using the cfDNA library obtained in the step 4) as samples and by using a Novaseq 6000 (Illumina) instrument according to a routine use of the instrument.
In a step 7), the FastQC software is used to analyze basic quality control of offline data. Actual detected sites and mutations are substantially consistent with theoretical values. Specific detection results are as shown in Tables 7 and 8 below.
Steps in Experimental example 2 are substantially same as the steps in Experimental example 1, which will not be repeated here. A difference is that, adapters synthesized in Embodiment 2 are used for constructing a library in a step 3) here. Actual detected sites and mutations are also substantially consistent with the theoretical values. Specific detection results are as shown in Tables 7 and 8 below.
Steps in Experimental example 3 are substantially same as the steps in Experimental example 1, which will not be repeated here. A difference is that, adapters synthesized in Embodiment 3 are used for library construction in a step 3) here. Actual detected sites and mutations are also substantially consistent with the theoretical values. Specific detection results are as shown in Tables 7 and 8 below.
In Table 7, the actually detected mutation frequencies of the different mutation sites of the selected genes in Experimental example 1 are substantially between 0.089% to 0.12%, which are relatively accurate compared with the theoretical mutation frequency (0.1%): the actually detected mutation frequencies of the different mutation sites of the selected genes in Experimental example 2 are substantially between 0.081% to 0.150%, which are also relatively accurate compared with the theoretical mutation frequency; and the actually detected mutation frequencies of the different mutation sites of the selected genes in Experimental example 3 are substantially between 0.079% to 0.140%, which are still relatively accurate compared with the theoretical mutation frequency.
In Table 8, the actually detected mutation frequencies of the different mutation sites of the selected genes in Experimental example 1 are substantially between 0.80% to 1.20%, which are relatively accurate compared with the theoretical mutation frequency (1%); the actually detected mutation frequencies of the different mutation sites of the selected genes in Experimental example 2 are substantially between 0.85% to 1.30%, which are also relatively accurate compared with the theoretical mutation frequency; and the actually detected mutation frequencies of the different mutation sites of the selected genes in Experimental example 3 are substantially between 0.78% to 1.25%, which are still relatively accurate compared with the theoretical mutation frequency.
To sum up, by using the UMIs partially composed of fixed bases, it may be possible to ensure the diversity of the adapters to label the different original DNA fragments; and moreover, the noise mutations introduced by PCR amplification or sequencing may be avoided to a certain extent, which improves the detection accuracy.
The foregoing descriptions are merely specific implementations of the present disclosure. However, the protection scope of the present disclosure is not limited thereto. Changes or replacements that any person skilled in the art could conceive of within the technical scope of the present disclosure shall be included in the protection scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.
This application is a national phase entry under 35 USC 371 of International Patent Application No. PCT/CN2021/134159 filed on Nov. 29, 2021, which is incorporated herein by reference in its entirety.
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/CN2021/134159 | 11/29/2021 | WO |