The Sequence Listing associated with this application is part of the application and is provided in text in the form of an ASCII text file in lieu of a paper copy, and is hereby incorporated by reference into the specification. The name of the Sequence listing is—SEQUENCE LISTING.txt.—The text file is 2057 bytes in size, was created on May 24, 2022, and is being electronically submitted via EFS-Web.
This application is a continuation of International Patent Application No. PCT/CN2022/077072, with an international filing date of Feb. 21, 2022, designating the United States, now pending, which is based on Chinese Patent Applications No. 202111422884.6, filed on Nov. 26, 2021. The contents of these specifications are incorporated herein by reference.
The present disclosure relates to the technical field of high-throughput sequencing, and in particular, relates to a method for high-throughput sequencing based on an internal reference with a known index.
With the development of genomics technology, high-throughput sequencing technology, also referred to as next-generation sequencing (NGS), has been widely used in infectious disease prevention and control, such as the investigation of outbreaks of infectious diseases in hospitals, identification of unknown pathogens, detection of resistance gene mutations in pathogens; early diagnosis and precise treatment of tumors (for example, lung cancer, breast cancer, gastrointestinal tumors, melanoma, and the like), such as detection of driver gene mutations associated with individualized tumor treatment, tumor genomics research, and exploration on tumor heterogeneity, drug resistance and tumor clonal evolution process and mechanism; early screening and diagnosis of genetic diseases, such as genetic disease diagnosis, neonatal screening, prenatal screening, pre-implantation screening and other fields, such as genetic deafness and non-invasive screening. The high-throughput sequencing technology is under rapid developments in the direction of convenience and economy. By high-throughput sequencing, i.e., the sequencing of tens of thousands of DNA molecules at the same time, more samples are mixed and loaded, samples are typically distinguished from each other in library preparation by means of adapters (Y-adapters, U-adapters, blunt-ended adapters, and bubble-adapters) or PCR amplification introduced tags (barcodes or indexes).
Studies have found that sequencing platforms based on ExAmp (exclusive amplification), such as HiSeq 3000/4000, HiSeq X Ten and NovaSeq, have the problems of index error distribution (i.e., index hopping) in mixing and loading sequencing of the samples, with a sample error distribution rate exceeding 1% and index hopping rates as high as 6% for PCR-free libraries. Even with the cumbersome nonconbinatorial dual index solution, the index contamination rate may only be reduced by 0.08%.
In June 2018, relevant researchers from Shenzhen BGI studied the problem of index hopping of the DNB sequencing platform by three mainstream library preparation methods. The BGISEQ sequencer utilizes the unique DNA nanoball (DNB) sequencing technology to perform library amplification based on rolling circle replication (RCR). This linear amplification may avoid error accumulation associated with the conventional PCR. The DNB-based NGS application achieves a low sample error distribution rate as low as 0.0001% to 0.0004% using only a single index. In addition, water is used instead of DNA, a index is added, a blank control is added, and the probability of mismatch on the DNB sequencing platform is one in 36 million reads, i.e., 0.0000028%. For the PCR-free library, an average contamination rate is about 0.0004%.
The inventors have found that while the index mismatch rate is lower on the BGI platform compared to the Illumina platform, index hopping is present on both platforms. More importantly, it is difficult to monitor index hopping.
An object of embodiments of the present disclosure is to provide a method for high-throughput sequencing based on an internal reference with a known index, which is capable of monitoring index hopping in the process of sequencing based on the internal reference.
In view of the above, the embodiments of the present disclosure provide a method for high-throughput sequencing based on an internal reference with a known index. The technical solutions are as follows:
A method for high-throughput sequencing based on an internal reference with a known index includes:
generating a random DNA sequence, adding a single-ended adapter DNA sequence containing the known index at both ends of the DNA sequence to obtain an internal reference sequence, and synthesizing a sequencing quality control sequence based on the internal reference sequence;
performing, based on the sequencing quality control sequence, high-throughput sequencing on a library of samples to be tested to obtain sequencing data;
performing result analysis on the sequencing data to obtain a sample error distribution rate of the library of samples to be tested, and ending the high-throughput sequencing.
Compared with the related art, the embodiments of the present disclosure mainly achieve the following beneficial effects:
According to the present disclosure, based on the internal reference sequence containing the known index, index hopping may be effectively monitored, index hopping is reflected by data, and the experimenter is helped to analyze causes of index hopping, such that the experimental scheme is adjusted.
For clearer descriptions of technical solutions according to the embodiments of the present disclosure, drawings that are to be referred for description of the embodiments are briefly described hereinafter. Apparently, the drawings described hereinafter merely illustrate some embodiments of the present disclosure. Persons of ordinary skill in the art may also derive other drawings based on the drawings described herein without any creative effort.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains. The terms used herein in the specification of present disclosure are only intended to illustrate the specific embodiments of the present disclosure, instead of limiting the present disclosure. The terms “comprise,” “include,” and any variations thereof in the specification and claims of the present disclosure and in the description of the drawings are intended to cover a non-exclusive inclusion. Terms such as “first,” “second,” and the like in the specification, claims, or the accompanying drawings of the present disclosure are intended to distinguish different objects, but are not intended to define a specific sequence.
The term “embodiment” in this specification signifies that the specific characteristic, structures or features described with reference to the embodiments may be covered in at least one embodiment of the present disclosure. This term, when appears in various positions of the specification, neither indicates the same embodiment, nor indicates an independent or optional embodiment that is exclusive of the other embodiments. A person skilled in the art would implicitly or explicitly understand that the embodiments described in this specification may be incorporated with other embodiments.
The following embodiments are given for better understanding of the present disclosure, rather than for limiting the present disclosure. In the embodiments described hereinafter, the experimental methods, unless otherwise specified, are all routine methods. In the embodiments described hereinafter, the experimental materials, unless otherwise specified, are all purchased from common biochemical reagent suppliers.
To make a person skilled in the art better understand the technical solutions of the embodiments of the present disclosure, the technical solutions of the present disclosure are clearly and completely described with reference to the accompanying drawings of the embodiments of the present disclosure.
Still referring to
S1: generating a random DNA sequence, adding a single-ended adapter DNA sequence containing the known index at both ends of the random DNA sequence to obtain an internal reference sequence, and synthesizing a sequencing quality control sequence based on the internal reference sequence;
S2: performing, based on the sequencing quality control sequence, high-throughput sequencing on a library of samples to be tested to obtain sequencing data; and
S3: performing result analysis on the sequencing data to obtain a sample error distribution rate of the library of samples to be tested, and ending the high-throughput sequencing.
In this embodiment, the single-ended adapter sequence is a single-ended adapter sequence with a known index provided by the BGI Group. The sequencing platform used in high-throughput sequencing on the library of samples to be tested by the sequencing quality control sequence is a sequencer manufactured by the BGI Group, and MGI 2000 sequencing is performed. According to the present disclosure, index hopping may be monitored based on an internal reference sequence with a known index, and index hopping is reflected by data. In this way, the experimenter is helped to analyze the causes and adjust the experimental scheme.
It should be noted that the specific single-ended adapter sequence according to the present disclosure includes, but is not limited to, the single-ended adapter sequence provided by the BGI Group, and the sequencing platform used includes, but is not limited to, the sequencer manufactured by the BGI Group. Alternatively, an adapter sequence supplied by Illumina and a corresponding sequencer manufactured by Illumina may also be used.
I. Generation and Screening of Random DNA Sequence:
Specifically, generating the random DNA sequence includes:
selecting a reverse virus sequence by a specific species screening algorithm; and
cutting the selected reverse virus sequence into a preset size, making a comparison against a pathogen database and a host database, and determining the reverse virus sequence as the random DNA sequence in the case that the selected reverse virus sequence is not present in the pathogen database or the host database.
In this embodiment, a screening process of a random sequence is as follows: a reverse virus sequence is directly selected by a species-specific screening algorithm, then the selected sequence is cut into 150 bp by jellyfish, the pathogen database and the host database were aligned by BLASTN, and in the case that the alignment is not up to a host and any species, the sequence is determined as the random sequence.
A fragment size of the random sequence generated in the present disclosure is a size of the fragment in the library of samples to be tested to be tested minus a size of the two single-ended adapter sequences. That is, it can be seen that the fragment size of the internal reference sequence generated according to the random sequence is determined according to the size of the fragment in the library of samples to be tested. For example, the fragment size in the library of samples to be tested in the present disclosure is about 250 bp, at this moment, the fragment size of the random sequence is set as 150 bp, a specific single-ended adapter sequence is added at both ends of the random sequence, and the fragment size of the obtained internal reference sequence is 246 bp. The random sequence is a gene sequence of a non-pathogenic pathogen. The random sequence is a bacteriophage sequence or a phytogenic pathogen sequence, wherein both the bacteriophage sequence and the phytogenic pathogen sequence are gene sequences of a non-pathogenic pathogen.
II. Synthesis of Sequencing Quality Control Sequence: A BGI Single-Ended Adapter is Added and Cloned into a pUC57 Vector.
Synthesizing the sequencing quality control sequence based on the internal reference sequence includes:
cloning the internal reference sequence into the pUC57 vector, and synthesizing the sequencing quality control sequence.
1. Synthesis of Sequencing Quality Control Sequence (i.e., Internal Reference Sequence Plasmid):
Sequences SEQ ID NO:1, SEQ ID NO:2, SEQ ID NO:3, and SEQ ID NO:4 in Table 1 are synthesized by Sangon Biotech (Shanghai) Co. Ltd.
In SEQ ID NO:1, gaacgacatggctacgatccgactt and aagtcggaggccaagcggtcttaggaagacaataggtccgatcaactccttggctcaca denote single-ended adapter sequences, and taggtccgat denotes a sequence (10 bp) with a known index (barcode).
In SEQ ID NO:2, gaacgacatggctacgatccgact and aagtcggaggccaagcggtcttaggaagacaaggacggaatccaactccttggctcaca denotes single-ended adapter sequences, and GGACGGAATC denotes a sequence with a known index.
In SEQ ID NO:3, gaacgacatggctacgatccgactt and aagtcggaggccaagcggtcttaggaagacaacttactgccgcaactccttggctcaca denote single-ended adapter sequences, and CTTACTGCCG denotes a sequence with a known index.
In SEQ ID NO:4, gaacgacatggctacgatccgactt and aagtcggaggccaagcggtcttaggaagacaaacctaattgacaactccttggctcaca denote single-ended adapter sequences, and ACCTAATTGA denotes a sequence with a known index.
2. Identification of First-Generation Sequencing on Sequencing Quality Control Sequence (i.e., Internal Reference Sequence Plasmid):
Sequencing results are as illustrated in
As seen from the sequencing peak maps in
3. Amplification of Sequencing Quality Control Sequence (i.e., Internal Reference Sequence Plasmid):
The sequencing quality control sequence is amplified by a designated PCR amplification primer to obtain a target sequencing quality control sequence.
1) The plasmid is dissolved, 60 ng of the dissolved plasmid is subjected to amplification in accordance with the following reaction system and reaction procedure, and a sequence containing a BGI single barcode.
2) Agarose Electrophoresis
Whether an amplification product is consistent with an expected product is detected by agarose gel electrophoresis. Specifically, an agarose gel is formulated, a sample is added into the agarose gel, and an electrophoresis experiment is carried out by an electrophoresis apparatus to obtain an electrophoresis result image; and whether a band position is consistent with an expected position is determined according to the electrophoresis result image.
1.2 g of agarose is added into 100 mL of 1×TAE electrophoresis buffer, and shaken well. The solution is heated in a microwave oven until the agarose is completely dissolved. The solution is cooled to 60° C., 6 μL Gelred fluorescent dye is added, and the solution is shaken well.
The dissolved agarose is added into a combed gel plate and solidified by cooling at room temperatures.
The gel is placed in an electrophoresis tank, 1×TAE electrophoresis buffer is added until the gel is covered by the buffer by a height of 1-2 mm, and the comb is carefully pulled upwards vertically.
2 μL of PCR amplification product is pipetted onto a parafilm, 2 μL of 3× Loading Buffer is added and mixed well, and carefully added into spotting wells, and 6 μL DNA marker (DL2000 of Tiangen) is added into the last well.
The power switch is turned on, the voltage is adjusted to 150 V, and the electrophoresis time is 30 min. Through observation, the band of bromophenol blue moves from a negative electrode to a positive electrode.
Upon completion of electrophoresis, imaging is carried out by a Biorad gel imager, a picture obtained by imaging is saved, and a picture of the electrophoresis result upon amplification (upon purification) is shown in
3) Qsep Analysis
A concentration of the products of PCR amplification obtained in 2) is measured by Qubit, and the concentrations of SEQ ID NO:1, SEQ ID NO:2, SEQ ID NO:3 and SEQ ID NO:4 are respectively: 12.5, 14.7, 21.6, and 19.2 ng/μL.
The sample is diluted to a loading concentration (1-2 ng/μL) recommended by Qsep 100, and then fragment size analysis is performed according to the instructions for Qsep 100. The obtained results are shown in
4. MGI 200 Sequencing of Target Sequencing Quality Control Sequence Together with Normal Library
An amount of adding the target sequencing quality control sequences is determined to obtain an addition amount, and the target sequencing quality control sequences are added into the library of samples to be tested according to the addition amount to obtain a mixed library; the mixed library is subjected to high-throughput sequencing to obtain the sequencing data.
1) Library pooling: According to the addition amount, the amplified and purified target sequencing quality control sequence is mixed with four standard libraries in an equal amount to form a mixed library, and a concentration of the mixed library is quantitatively determined again.
2) Cyclization:
A test is carried out according to the instructions for the BGI cyclization kit (Article No.: 1000005259, Kit Version No.: V2.0).
3) DNB Preparation and Loading Sequencing
The DNB is prepared and loading sequencing is carried out according to the instructions for BGI MGISEQ-200 FCL SE 50.
5. Data Analysis of Unloading
Obtaining a sample error distribution rate by performing result analysis on the sequencing data includes:
performing statistical collection based on the sequencing data to obtain the total number of sequenced sequences, the number of internal reference sequences with known indexes, and the number of sequences specifically detected by pathogens; and determining the sample error distribution rate based on the total number of sequenced sequences, the number of internal reference sequences with known indexes, and the number of sequences specifically detected by pathogens. The library of samples to be tested is a pathogen library.
After an unloading sequencing result (fq.gz file) is obtained, data filtering is performed. Specifically, low quality data, adapters, human hosts, and the like are removed. The sequencing data is aligned with the original known sequences by the BWA alignment tool, and analysis is carried out by the pathogen analysis process developed in laboratory. The results are shown in Table 3:
Escherichia coli
Pseudomonas aeruginosa
Aspergillus fumigatus
Vibrio vulnificus
Acinetobacter baumannii
Moraxella osloensis
Haemophilus influenzae
Cryptococcus gattii
Streptococcus
pneumoniae
Staphylococcus aureus
Neisseria flava
Brucella intermedia
Klebsiella pneumoniae
Porphyromonas
gingivalis
Neisseria flavescens
Mycoplasma hyorhinis
As seen from the results in Table 3-1 and Table 3-2, the result of pathogen analysis of each sequencing quality control contains, more or less, the reads (read length, a sequencing sequence obtained from one reaction in high-throughput sequencing) of some pathogens. This indicates that index hopping is present in this batch of data, and the pathogen involved each hopping is related to the sample involved in each pooling. This also indicates that sequencing quality control can be used as an indicator to monitor index hopping. If the hopping species are related to a clinical causative agent, reference may be provided for the interpreter to make the clinical results more accurate.
It should be understood that, although the various steps in the flowcharts of drawings are shown in order as indicated by the arrows, the steps are not necessarily performed in the order indicated by the arrows. The steps are performed in no strict order unless explicitly stated herein, and may be performed in other orders. Furthermore, at least some of the steps in the flowcharts of the drawings may include sub-steps or stages, which are not necessarily performed at the same time, but may be performed at different times, or in different orders, and may be performed in turn or in alternation with at least some of the other steps or sub-steps or stages of other steps.
It is apparent that the embodiments described above are only exemplary ones, but not all embodiments of the present disclosure, and that the attached drawings illustrate exemplary embodiments of the present disclosure but do not limit the scope of the present disclosure. The present disclosure may be embodied in many different forms and, on the contrary, these embodiments are provided to make the disclosure of the present disclosure more thoroughly and completely understood. Although the present disclosure has been described in detail with reference to the above-mentioned embodiments, those skilled in the art will be able to make modifications to the technical solutions disclosed in the above-mentioned specific embodiments or make equivalent substitutions for some of the technical features. Any equivalent structure made based on the specification and accompanying drawings of the present disclosure, even if being directly or indirectly applied to some other related technical fields, shall all fall within the protection scope of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202111422884.6 | Nov 2021 | CN | national |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2022/077072 | Feb 2022 | US |
Child | 17720285 | US |