This Patent Application comprises a sequence listing presented in an ASCII text file entitled, “515.0010c1 Sequence List_ ST25.txt” which was created on Aug. 19, 2022, and which has a file size of 1 kilobyte, and which is hereby incorporated by reference into this Patent Application in its entirety.
Deoxyribonucleic Acid (DNA) is made of nucleotides that are typically arranged in a double helix of sequential base pairs. DNA sequencers process a sample like soil and tissue to characterize the genetic content of the sample. The sequencing data from the DNA sequencers is often used for medical and biological research. For example, the DNA sequencing data from human and animal tissue is often used for cancer treatment. In another example, the DNA sequencing data from soil and waste may be used to research bacterial species like Escherichia coli and Klebsiella pneumoniae.
Various DNA sequencers and sequencing techniques are available that produce prodigious amounts of sequencing data. These giant data products are further complicated by a variety of different data formats and biological information represented by the data.
DNA sequencing copies (amplifies) relatively small portions of the DNA across various regions in a genetic sample. A large number of DNA copies (amplicons) are often generated for target regions in the genetic material. The DNA sequencer then reads these amplicons to generate sequencing data. Amplicon read depth refers to the number of amplicons with the same nucleotide pattern from the same genome location. For example, E. coli has multiple DNA variants, the read depth for one of the variants is the number of amplicons that correspond to the reference sequence for that DNA variant.
Sequencing data may be interpreted to characterize the genetic content in a sample. For example, K. pneumoniae DNA may be processed to characterize various regions of the bacterial genome at differing read depths. The sequencing data is further processed to characterize details like Single Nucleotide Polymorphisms (SNPs) for antibiotic resistance.
To identify a SNP, the amplicon sequencing analysis pipeline compares the sequencing data to a reference sequence in an alignment process. Reference sequences may come in various lengths relative to the sequencing data from the sample. The amount of overlap between the amplicon sequencing data and the reference sequence is referred to as amplicon coverage. The ratio of identity between the amplicon sequencing data and the reference sequence is referred to as amplicon identity. High amplicon coverage and high amplicon identity are desired when matching DNA sequence data to a DNA reference sequence.
Unfortunately, medical and biological research using DNA sequencing is a complex process. The ability to run different assays with various functions and multiple reference sequences creates even more complexity. In addition, interpreting the resulting output is cumbersome.
An Amplicon Sequencing Analysis Pipeline (ASAP) system characterizes a genetic sample. The ASAP system receives assay configuration data individually associating reference sequences with genetic characteristics and that specifies target genetic characteristics. The ASAP system processes amplicon sequencing data and the reference sequences to for the genetic characteristics to identify ones of the reference sequences. The ASAP system processes the identified ones of the reference sequences and the individual associations between the reference sequences and the genetic characteristics to identify a presence of the target genetic characteristics in genetic sample. The ASAP system transfers genetic data indicating the presence of the target genetic characteristics in the genetic sample and indicates interpretation metrics for the amplicon sequencing read depth and quality related to the presence of the target genetic characteristics.
Amplicon sequencer 110 amplifies the DNA in a sample to generate amplicons (AMP on the figure) of the DNA Amplicon sequencer 110 then reads the amplicons to generate amplicon sequencing data and amplicon sequencing metrics. The amplicon sequencing data describes the nucleotide arrangements. The amplicon sequencing metrics describe the amplicon sequencing operation with data like read confidence factors.
ASAP system 120 receives assay configuration data into its user interface. The assay configuration data individually associates reference sequences (REF SEQ on the figure) and genetic characteristics (CHAR on the figure). For example, assay requirements may be loaded into a data template that the data processor translates into a JavaScript Object Notation (JSON) file for alignment and interpretation. The data processor then executes the JSON file to drive a target assay. For a complex target assay, the JSON file may include individual genetic characteristics, their logical operators (AND/OR/NOT), significance data, and assay thresholds and other requirements.
Exemplary genetic characteristics include species identity, species variant, antibiotic resistance, virulence factor, species strain, haploid type, and Single Nucleotide Polymorphisms (SNPs). Interpretation metrics sequencing read depth and quality. Exemplary interpretation metrics include the number of mapped/unmapped reads, number of aligning reads, breadth of coverage, consensus sequence, depth of coverage at each position, consensus proportion at each position, average depth of coverage, SNP position, SNP depth of coverage, amount of reference and SNP calls, number/percentage of reads containing SNPs, full base distribution at position, significance of SNP, SNP Region of Interest (ROI), number of reads aligning to ROI, reference sequence of ROI, most prevalent nucleotide sequence at ROI, number/percentage of reads containing the prevalent sequence, and amino acid and nucleotide sequence distributions.
ASAP system 120 compares the amplicon sequencing data to the reference sequence(s) to identify genetic characteristics based on the individual associations in the assay configuration data. ASAP system 120 generates interpretation metrics like read depth for the various genetic characteristics. ASAP system 120 may also generate other interpretation metrics like the amplicon breadth of coverage and the amplicon identity for specific genetic characteristics. For a complex assay, ASAP system 120 generates interpretation metrics to readily assess the amplicon sequencing depth and quality of the assay.
ASAP system 120 drives its communication interface to transfer genetic data with the genetic characteristics and related interpretation metrics like confidence, read depth, amplicon coverage, and amplicon identity for the individual genetic characteristics. The genetic data may be graphically presented in custom spreadsheets. Thus, medical and biological researchers may use the genetic data to easily analyze the significance and accuracy of their amplicon-based assays.
The assay configuration data drives ASAP system to perform pipeline analysis. The assay configuration data specifies the interpretation of sequencing metrics like read depth and confidence. The assay configuration data may further specify amplicon identity thresholds for the reference sequences. ASAP system 120 uses the amplicon identity thresholds to determine amplicon identity between the amplicon sequencing data and the reference sequences. The genetic data indicates these amplicon identity thresholds for individual genetic characteristics like antibiotic resistance and virulence factor.
The assay configuration data may be configured as a JSON file that specifies an assay type, various target function(s), multiple reference sequences, significance, and interpretation metrics. The input JSON may optionally specify SNPs and ROIs. Some examples of assay types include presence/absence, SNP, gene variant, and ROI. A user may also define target functions, such as species ID, strain ID, antibiotic resistance type, and/or virulence factor detection. The following JSON template is provided for illustrative purposes.
ASAP system 120 receives assay configuration data that individually associates reference sequences and genetic characteristics. The assay configuration data may specify interpretation metrics like amplicon identity and/or coverage on an individual genetic characteristic basis. For example, the assay configuration data may specify an amplicon identity to determine a specific strain of E. coli and another amplicon identity to associate antibiotic resistance with that E. coli strain. Thus, a researcher can perform a complex target assay. The interpretation metrics may indicate sequencing depth and quality for individual genetic characteristics. For complex target assays, the interpretation metrics are rendered to easily associate select sequencing metrics with individual characteristics that form the target assay.
ASAP system 120 aligns the amplicon sequencing data to the reference sequences and interprets the resulting matches, mismatches, additions, and omissions. ASAP system 120 associates genetic characteristics with reference sequences based on individual associations in the assay configuration data. ASAP system 120 generates interpretation metrics like read depth, amplicon coverage, and amplicon identity for individual genetic characterization(s). ASAP system 120 transfers the genetic data with the genetic characteristics and the interpretation metrics. For a complex assay, ASAP 120 may transfer assay data that correlates amplicon sequencing depth and quality with genetic characteristic of the assay.
The operator object includes the attribute type (i.e. OR, AND, or NOT). The operator object is optional and invoked when two or more targets, SNPs, and/or ROIs are specified in the assay configuration data. The target object has the attributes: function, gene name, start position, end position, and reverse complement flag. Examples of the function attribute include species ID, strain ID, resistance type, and virulence factor. A reverse complement flag may be used to denote whether the target should be reverse complemented. The significance object has the attributes message and resistance. The message attribute indicates the message that is displayed when the conditions of the selected assay are met.
The assay array has the attributes name and type. Examples of the type attribute include presence/absence, gene variant, SNP, and ROI. The SNP array contains the attributes position, reference, variant, and name. The position attribute indicates the position of the SNP in the amplicon. The reference attribute indicates a reference character. The SNP array may be given attributes indicating whether individual SNPs are important or not important. The ROI array includes the attributes: positions, amino acid sequence (amino seq), nucleotide sequence (n tide seq), mutations, and name. The positions attribute may be specified as a range (i.e. m-n), a comma-separated list (i.e. j, k, l, and m), or a combination of the two.
The following is an example of a spreadsheet template for entering assay configuration data and provided for illustrative purposes. For a complex assay, the spreadsheet would indicate additional functions, reference sequences, operators, significance data,
E. coli
E. coli x
E. coli y
Communication transceiver 611 comprises communication components, such as ports, bus interfaces, signal processors, memory, software, and the like. Data processing circuitry 603 comprises circuit boards, bus interfaces, integrated micro-circuitry, and associated electronics. Data storage system 604 comprises non-transitory, machine-readable, data storage media, such as flash drives, disc drives, memory circuitry, servers, and the like. Software 605 comprises machine-readable instructions that control the operation of processing circuitry 603 when executed. Software 605 includes software modules 606-610 and may also include operating systems, applications, data structures, utilities, and the like Amplicon sequencing analysis pipeline 600 may be centralized or distributed. All or portions of software 605 may be externally stored on one or more storage media, such as circuitry, discs, and the like. Some conventional aspects of amplicon sequencing analysis pipeline system 600 are omitted for clarity, such as power supplies, enclosures, and the like.
When executed by data processing circuitry 603, software modules 606-610 direct data processing circuitry 603 to perform the following operations. Assay modules 606 receive assay configuration data through a user interface and transfers the assay configuration data, including reference sequences, to alignment modules 608. Sequencer modules 607 receive DNA reader output signals to develop amplicon sequencing data and amplicon sequencing metrics for alignment modules 608. Alignment modules 608 position amplicon sequencing data against reference sequence(s). Interpretation modules 609 process the aligned sequence data and the assay configuration data to determine genetic characteristics and associated interpretation metrics like coverage or identity. Genetic data modules 610 format and transfer genetic data indicating genetic characteristic and associated interpretation metrics like an E. coli strain SNP related to its read confidence, read depth, amplicon coverage, and amplicon identity.
The above description and associated figures teach the best mode of the invention. The following claims specify the scope of the invention. Note that some aspects of the best mode may not fall within the scope of the invention as specified by the claims. Those skilled in the art will appreciate that the features described above can be combined in various ways to form multiple variations of the invention. As a result, the invention is not limited to the specific embodiments described above, but only by the following claims and their equivalents.
This Patent Application is a continuation application of and claims priority to U.S. patent application Ser. No. 15/576,495 entitled, “ENHANCED AMPLICON SEQUENCING ANALYSIS” which was filed on Nov. 22, 2017, which claims the benefit of PCT Patent Application PCT/US2016/033810 entitled, “ENHANCED AMPLICON SEQUENCING ANALYSIS,” which was filed on May 23, 2016, and which claims the benefit of and priority to United States Provisional Application 62/165,612, entitled “SYSTEMS AND METHODS FOR AMPLICON SEQUENCING ANALYSIS,” which was filed on May 22, 2015, all of which are hereby incorporated by reference into this U.S. Patent Application in their entirety.
This invention was made with government support under R01 AI090782 awarded by the National Institutes of Health. The government has certain right in the invention.
Number | Date | Country | |
---|---|---|---|
62165612 | May 2015 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 15576495 | Nov 2017 | US |
Child | 17835499 | US |