The embodiments described herein relate generally to DNA sequencing techniques, and more particularly to systems, methods and devices for next-generation sequencing.
INTRODUCTION
Advances in DNA sequencing (next-generation sequencing, NGS) have made it practical and affordable for both research and commercial applications. NGS has enabled projects as diverse as elucidating evolutionary relationships1-4, improving agricultural practices5-8, making pre-natal diagnoses9, estimating protein structure10 and examining disease etiologies11-14. Some of the first routine commercial applications of NGS are likely to be in personalized medicine, where genomic profiles may predict treatment response15,16.
As NGS increases in popularity, automated pipelines for processing large numbers of samples will be vital, particularly for whole genome sequencing (WGS). Since NGS studies suffer from biases introduced during library preparation17 and sequencing18, a key part of increased automation is quality control to ensure outputs meet pre-specified constraints.
A new, improved solution is thus needed for quality control in NGS to ensure outputs meet pre-specified constraints.
The present disclosure relates to systems and methods for determining the amount of sequencing required to achieve a target sequencing quality of a genetic sample to be sequenced, as well as systems and methods for genome sequencing.
In an aspect, there is disclosed a method of determining the amount of sequencing required to achieve a target sequencing quality of a genetic sample to be sequenced, the method comprising: receiving the genetic sample; sequencing a portion of the genetic sample; generating from the sequencing a sequencing quality metric, said sequencing quality metric belonging to a category of sequencing quality metrics; determining the amount of sequencing of the genetic sample required to achieve the target sequencing quality by inputting the sequencing quality metric into a model trained with a plurality of reference sequencing quality metrics to predict sequencing quality.
In another aspect, there is disclosed a system for genetic sequencing, the system comprising: a device for receiving a genetic sample; a device for sequencing one or more portions of the genetic sample; a device for capturing sequencing data; at least one processor configured to: generate signals for commencing sequencing the one or more portions of the genetic sample; during or after sequencing of the one or more portions, receiving sequencing data for the one or more portions; generating from the sequencing data, at least one sequencing quality metric, said at least one sequencing quality metric belonging to at least one category of sequencing quality metrics; and generating signals for continuing or aborting the sequencing of the same or additional portions of the genetic sample based on a determination of the amount of sequencing of the genetic sample required to achieve a target sequencing quality using said at least one sequencing quality metric and a model trained with reference sequencing quality metrics to predict sequencing quality.
In another aspect, there is disclosed a system for genome sequencing, the system comprising: a device for receiving a genome sample; a device for sequencing one or more portions of the genome sample; a device for capturing sequencing data; and at least one processor configured to perform the methods described herein.
In various further aspects, the disclosure provides corresponding systems and devices, and logic structures such as machine-executable coded instruction sets for implementing such systems, devices, and methods.
In this respect, before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not limited in its application to the details of construction and to the arrangements of the components set forth in the following description or illustrated in the drawings. The invention is capable of other embodiments and of being practiced and carried out in various ways. Also, it is to be understood that the phraseology and terminology employed herein are for the purpose of description and should not be regarded as limiting.
In the figures, embodiments are illustrated by way of example. It is to be expressly understood that the description and figures are only for the purpose of illustration and as an aid to understanding.
Embodiments will now be described, by way of example only, with reference to the attached figures, described below.
Since NGS studies suffer from biases introduced during library preparation17 and sequencing18, a key part of increased automation is quality control to ensure outputs meet pre-specified constraints. For instance, it is critical to pre-determine the amount of sequencing required to achieve a specific outcome (e.g. some minimum statistical confidence in identifying variants). Over-sequencing incurs time and money costs, while under-sequencing causes reduced prediction accuracy or delays for additional data collection. As another example, groups developing and evaluating new protocols must rapidly assess if data quality is improved or degraded.
To date, such techniques have been elusive, with only a handful of heuristic studies in the literature19-23. Several factors confound the prediction of sequencing quality including variability amongst machines and reagent batches, and the complexity of sequencing libraries23. Similarly the integrity and quality of DNA varies widely: clinical specimens that have been formalin-fixed and paraffin-embedded (FFPE) typically yield degraded DNA that is challenging to sequence24,25. With large-scale exome and whole-genome sequencing studies26 increasing in prevalence, the need for robust quality control is growing. To tackle these challenges, we introduce SeqControl: a technique for predicting overall experimental qualities using a small amount of sequencing, enabling production-scale use of NGS quality- and process-control.
Preferred embodiments of methods, systems, and apparatus suitable for use in implementing the invention are described through reference to the drawings.
The following discussion provides many example embodiments of the inventive subject matter. Although each embodiment represents a single combination of inventive elements, the inventive subject matter is considered to include all possible combinations of the disclosed elements. Thus if one embodiment comprises elements A, B, and C, and a second embodiment comprises elements B and D, then the inventive subject matter is also considered to include other remaining combinations of A, B, C, or D, even if not explicitly disclosed.
In an aspect, there is disclosed a method of determining the amount of sequencing required to achieve a target sequencing quality of a genetic sample to be sequenced, the method comprising: receiving the genetic sample; sequencing a portion of the genetic sample; generating from the sequencing a sequencing quality metric, said sequencing quality metric belonging to a category of sequencing quality metrics; determining the amount of sequencing of the genetic sample required to achieve the target sequencing quality by inputting the sequencing quality metric into a model trained with a plurality of reference sequencing quality metrics to predict sequencing quality.
In some embodiments, the category of sequencing quality metrics is selected from the group consisting of overall coverage, coverage distribution, base-wise coverage, base-wise quality, sequencing experimental information, read-level sequence-quality, read-level mapping-quality, and coverage of genomic repeats.
In an embodiment, the sequencing quality metric belonging to the overall coverage category is selected from the group consisting of uncollapsed coverage, collapsed coverage, masked coverage and clusters. For each of these a summary statistic can be generated such as, but not limited to, the mean, median, standard-deviation, first-quartile and third-quartile.
In another embodiment, the sequencing quality metric belonging to the coverage distribution category is selected from the group consisting of unique start points and average reads/starts.
In yet another embodiment, the sequencing quality metric belonging to the base-wise coverage category is selected from the group consisting of percentage of bases that reach 0× coverage, percentage of bases that reach 1× coverage, percentage of bases that reach 2× coverage, percentage of bases that reach 3× coverage, percentage of bases that reach 4× coverage, percentage of bases that reach 5× coverage, percentage of bases that reach 6× coverage, percentage of bases that reach 7× coverage, percentage of bases that reach 8× coverage, percentage of bases that reach 9× coverage, percentage of bases that reach 10× coverage, percentage of bases that reach 20× coverage, percentage of bases that reach 30× coverage, percentage of bases that reach 40× coverage, percentage of bases that reach 50× coverage percentage of bases that reach 75× coverage, percentage of bases that reach 100× coverage, percentage of bases that reach 150× coverage, percentage of bases that reach 200× coverage, percentage of bases that reach 250× coverage, percentage of bases that reach 500× coverage, and percentage of bases that reach 1000× coverage.
In other embodiments, the sequencing quality metric belonging to the base-wise quality category is selected from the group consisting of percentage of bases receiving a base-wise genotype quality score greater than 0, percentage of bases receiving a base-wise genotype quality score of at least 10, percentage of bases receiving a base-wise genotype quality score of at least 20, percentage of bases receiving a base-wise genotype quality score of at least 30, percentage of bases receiving a base-wise genotype quality score of at least 40, percentage of bases receiving a base-wise genotype quality score of at least 50, percentage of bases receiving a base-wise genotype quality score of at least 60, percentage of bases receiving a base-wise genotype quality score of at least 70, percentage of bases receiving a base-wise genotype quality score of at least 80, percentage of bases receiving a base-wise genotype quality score of at least 90, percentage of bases receiving a base-wise genotype quality score of at least 100, and percentage of bases receiving the maximum base-wise genotype quality score.
In yet another embodiment, the sequencing quality metric belonging to the sequencing experimental information category is selected from the group consisting of Machine ID, machine-name, cluster density, technician name or technician ID.
In another embodiment, the sequencing quality metric belonging to the read-level sequence-quality category is selected from the group consisting of average per-read quality, maximum per-read quality, median per-read quality, SD of per-read quality, first-quartile of per-read quality and third-quartile of per-read quality.
In yet another embodiment, the sequencing quality metric belonging to the read-level mapping-quality category is selected from the group consisting of mapping quality of the read, average mapping quality, median mapping quality, standard deviation of mapping quality, first quartile of mapping quality, third quartile of mapping quality.
In some embodiments, the methods disclosed herein generate a plurality of sequencing quality metrics. In an embodiment, the plurality of sequencing quality metrics consists of 2, 3, 4, 5, 6 , 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49 or 50 sequencing quality metrics.
In yet another embodiment, the plurality of sequencing quality metrics comprises at least 2 sequencing quality metrics selected from 2 different categories of sequencing quality metrics.
In other embodiments, the plurality of sequencing quality metrics comprises at least 3 sequencing quality metrics selected from 3 different categories of sequencing quality metrics.
In yet other embodiments, the plurality of sequencing quality metrics comprises at least 4 sequencing quality metrics selected from 4 different categories of sequencing quality metrics.
In some embodiments, the model disclosed herein is trained by inputting the plurality of reference sequencing quality metrics into the model.
In some embodiments, the model comprises a random forest classifier, a neural network, K-nearest neighbours, support vector machines, linear regression, linear discriminant analysis, or decision trees.
In some embodiments, the portion of the genetic sample is greater than 0% and less than 100% of the genetic sample. In an embodiment, the portion of the genetic sample is less than 50%, less than 40%, less than 30%, less than 20%, less than 10%, less than 5%, or less than 3% of the genetic sample. In some embodiments, the portion of the genetic sample is at least 1% of the genetic sample. In yet other embodiments, the portion of the genetic sample is at least 2% of the genetic sample.
In another embodiment, the portion of the genetic sample is between 2% and 50% of the genetic sample.
In some embodiments, the genetic sample is a genome. In yet other embodiments, the genetic sample originates from a tumour genome.
In other embodiments, the genetic sample originates from a non-tumour genome. In yet other embodiments, the genetic sample is a targeted sequence of a portion of a genome. In one embodiment, the targeted sequence of a portion of the genome is an exome. In another embodiment, the targeted sequence of a portion of the genome is a targeted panel.
In some embodiments, the target sequencing quality is a sequencing depth. In one embodiment, the sequencing depth is between 1× and 500×. In another embodiment, the sequencing depth is between 10× and 100×. In yet another embodiment, the sequencing depth is greater than 1×. In another embodiment, the sequencing depth is 10×, 20×, 30×, 40×, 50×, 60×, 70×, 80×, 90×, 100×, 110×, 120×, 130×, 140×, 150×, 160×, 170×, 180×, 190× or 200×.
In another aspect, there is disclosed a system for genetic sequencing, the system comprising: a device for receiving a genetic sample; a device for sequencing one or more portions of the genetic sample; a device for capturing sequencing data; at least one processor configured to: generate signals for commencing sequencing the one or more portions of the genetic sample; during or after sequencing of the one or more portions, receiving sequencing data for the one or more portions; generating from the sequencing data, at least one sequencing quality metric, said at least one sequencing quality metric belonging to at least one category of sequencing quality metrics; and generating signals for continuing or aborting the sequencing of the same or additional portions of the genetic sample based on a determination of the amount of sequencing of the genetic sample required to achieve a target sequencing quality using said at least one sequencing quality metric and a model trained with reference sequencing quality metrics to predict sequencing quality.
In an embodiment, the at least one category of sequencing quality metrics is selected from the group consisting of overall coverage, coverage distribution, base-wise coverage, and base-wise quality.
In another embodiment, at least one sequencing quality metric belonging to the overall coverage category is selected from the group consisting of uncollapsed coverage, collapsed coverage, and masked coverage and clusters.
In still another embodiment, the at least one sequencing quality metric belonging to the coverage distribution category is selected from the group consisting of unique start points and average reads/starts.
In yet another embodiment, the at least one sequencing quality metric belonging to the base-wise coverage category is selected from the group consisting of percentage of bases that reach 0× coverage, percentage of bases that reach 1× coverage, percentage of bases that reach 2× coverage, percentage of bases that reach 3× coverage, percentage of bases that reach 4× coverage, percentage of bases that reach 5× coverage, percentage of bases that reach 6× coverage, percentage of bases that reach 7× coverage, percentage of bases that reach 8× coverage, percentage of bases that reach 9x coverage, percentage of bases that reach 10× coverage, percentage of bases that reach 20× coverage, percentage of bases that reach 30× coverage, percentage of bases that reach 40× coverage, percentage of bases that reach 50× coverage percentage of bases that reach 75× coverage, percentage of bases that reach 100× coverage, percentage of bases that reach 150× coverage, percentage of bases that reach 200× coverage, percentage of bases that reach 250× coverage, percentage of bases that reach 500× coverage, and percentage of bases that reach 1000× coverage.
In another embodiment, the at least one sequencing quality metric belonging to the base-wise quality category is selected from the group consisting of percentage of bases receiving a base-wise genotype quality score greater than 0, percentage of bases receiving a base-wise genotype quality score of at least 10, percentage of bases receiving a base-wise genotype quality score of at least 20, percentage of bases receiving a base-wise genotype quality score of at least 30, percentage of bases receiving a base-wise genotype quality score of at least 40, percentage of bases receiving a base-wise genotype quality score of at least 50, percentage of bases receiving a base-wise genotype quality score of at least 60, percentage of bases receiving a base-wise genotype quality score of at least 70, percentage of bases receiving a base-wise genotype quality score of at least 80, percentage of bases receiving a base-wise genotype quality score of at least 90, percentage of bases receiving a base-wise genotype quality score of at least 100, and percentage of bases receiving the maximum base-wise genotype quality score.
In some embodiments, the at least one sequencing quality metric comprises at least 2 sequencing quality metrics selected from 2 different categories of sequencing quality metrics.
In some embodiments, the at least one sequencing quality metric comprises at least 3 sequencing quality metrics selected from 3 different categories of sequencing quality metrics.
In some embodiments, the at least one sequencing quality metric comprises at least 4 sequencing quality metrics selected from 4 different categories of sequencing quality metrics.
In some embodiments, the model is trained by inputting the plurality of reference sequencing quality metrics into the model.
In one embodiment, the model comprises a random forest classifier, a neural network, K-nearest neighbours, support vector machines, linear regression, linear discriminant analysis, or decision trees.
In another aspect, there is disclosed a system for genome sequencing, the system comprising: a device for receiving a genome sample; a device for sequencing one or more portions of the genome sample; a device for capturing sequencing data; and at least one processor configured to perform the methods described herein.
The system 2600 is a computer-implemented system, wherein aspects may be provided through computer implementation in the form of hardware, software, embedded software, firmware, etc., using various computing devices, such as servers, processors, memory, non-transitory computer-readable media, etc. In some embodiments, one or more networks 2670 may be utilized for electronic communication. The system 2600 may be provided as part of a single computing device (e.g., a supercomputer), or a series of distributed devices, some of which may be physical, and some of which may be virtual. For example, system 2600 may be provided in the form of distributed networking resources, such as a “cloud computing” implementation (for example, a distributed set of processors, storage, memory, etc., may co-operate in performing computational functions based on instructions from one or more controllers).
The system 2600 may be used to improve and/or streamline various aspects of healthcare and/or bioinformatics analytics, through for example, providing a determination identifying the amount of sequencing required to achieve a specific outcome. These types of analytics may require significant computational processing, and given the amount of resources and time involved in sequencing, such a reduction in sequencing may provide various advantages, while potentially avoiding overly reduced prediction accuracy, etc. The amount of sequencing required for a specific outcome may be influenced by various factors, such as the quality of the input genetic information, etc. The quality may be assessed through one or more quality metrics, etc., the quality being potentially determined through the conducting of a small amount of sequencing. The quality metrics may be assessed across various categories, and metrics may reflect coverage depth across the genome.
The system 2600 may include various computerized components that may be specifically configured for use in providing features related to genome sequencing, such as providing computerized determinations of the amount of sequencing required to achieve a target sequencing quality of a genetic sample, having regard to various characteristics related to a particular genome sample. Some computerized components may include or be connected to sensors and physical apparatuses adapted to perform various steps involved in sequence analysis, such as analyzing or extracting information from physical samples, searching against biological databases, applying various models, performing sequence alignment, identifying sequence differences, comparing sequences, etc.
In some embodiments, the system 2600 includes a genetic data receiver unit 2602, partial sequencer unit 2603, a quality assessment unit 2604, a training unit 2606, a threshold determination unit 26082608, a sequence prediction unit 26102610, a retraining unit 2612, and a DNA fragment library 2614.
The system 2600 may further include a data storage 2680, which may be used to store various aspects of electronic information, such as genetic information, quality metrics, prior determination threshold results, logical rules, etc.
In some embodiments, the system 2600 may further be adapted to interoperate with a sample receiver 2650. The sample receiver may be configured to receive a physical sample and to extract various biological (e.g., genetic) data from the sample. In some embodiments, the system may be configured to receive data from a data source (e.g., from a repository, provided sample information from a laboratory apparatus). In some embodiments, the system 2600 may further include a physical sample receiver. The genetic data receiver unit 2602 may be configured to receive genetic data from various sources. For example, the genetic sample may be a genome, may originate from a tumor genome, may originate from a non-tumor genome, may be a targeted sequence of a portion of a genome, may be a targeted sequence of a portion of the genome is an exome, or a targeted sequence of a portion of a genome (e.g., a targeted panel). The genetic sample may be derived from various types of tissue and tissue preparations, such as blood, frozen preparations, tumors, formaldehyde fixed paraffin embedded tissue, etc.
The partial sequencer unit 2603 may be utilized to conduct sequencing a portion of the genetic sample, generating from the sequencing various sequencing quality metrics. For example, a sequencing quality metric may be generated, belonging to a category of sequencing quality metrics. The quality metrics may be generated through the analysis of various correlations and correlation profiles, for example, determining relationships between different variables and factors.
The sequencing quality metric may be provided in various types of electronic representations, and in some embodiments, a plurality of sequencing quality metrics are generated. For example, a plurality of sequencing quality metrics may be generated, including 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49 or 50 sequencing quality metrics. In some embodiments, the plurality of sequencing quality metrics comprises at least 2 sequencing quality metrics selected from 2 different categories of sequencing quality metrics. In some embodiments, the plurality of sequencing quality metrics comprises at least 3 sequencing quality metrics selected from 3 different categories of sequencing quality metrics. In some embodiments, the plurality of sequencing quality metrics comprises at least 4 sequencing quality metrics selected from 4 different categories of sequencing quality metrics.
The sequencing quality metrics may, for example, be selected from the group consisting of overall coverage, coverage distribution, base-wise coverage, base-wise quality, sequencing experimental information, read-level sequence-quality, read-level mapping-quality, and coverage of genomic repeats, among others. For example, there may be metrics related to machine ID, machine-name, cluster density, technician name/ID, average per-read quality, maximum per-read quality, median per-read quality, SD of per-read quality, first-quartile of per-read quality, third-quartile of per-read quality, mapping quality of the read, average mapping quality, median mapping quality, SD of mapping quality, first quartile of mapping quality, third quartile of mapping quality, among others. The sequencing quality metric belonging to the overall coverage category may also be selected from the group consisting of uncollapsed coverage, collapsed coverage, masked coverage and clusters. For each of these, a summary statistic can be generated such as, but not limited to, the mean, median, standard-deviation, first-quartile and third-quartile. In some embodiments, a sequencing quality metric belonging to the coverage distribution category may also be selected from the group consisting of unique start points and average reads/starts.
In some embodiments, the sequencing quality metric belonging to the base-wise coverage category may be selected from the group consisting of percentage of bases that reach 0× coverage, percentage of bases that reach 1× coverage, percentage of bases that reach 2× coverage, percentage of bases that reach 3× coverage, percentage of bases that reach 4× coverage, percentage of bases that reach 5× coverage, percentage of bases that reach 6× coverage, percentage of bases that reach 7× coverage, percentage of bases that reach 8× coverage, percentage of bases that reach 9× coverage, percentage of bases that reach 10× coverage, percentage of bases that reach 20× coverage, percentage of bases that reach 30× coverage, percentage of bases that reach 40× coverage, percentage of bases that reach 50× coverage percentage of bases that reach 75× coverage, percentage of bases that reach 100× coverage, percentage of bases that reach 150× coverage, percentage of bases that reach 200× coverage, percentage of bases that reach 250× coverage, percentage of bases that reach 500× coverage, and percentage of bases that reach 1000× coverage, among others.
In some embodiments, the sequencing quality metric belonging to the base-wise quality category may be selected from the group consisting of percentage of bases receiving a base-wise genotype quality score greater than 0, percentage of bases receiving a base-wise genotype quality score of at least 10, percentage of bases receiving a base-wise genotype quality score of at least 20, percentage of bases receiving a base-wise genotype quality score of at least 30, percentage of bases receiving a base-wise genotype quality score of at least 40, percentage of bases receiving a base-wise genotype quality score of at least 50, percentage of bases receiving a base-wise genotype quality score of at least 60, percentage of bases receiving a base-wise genotype quality score of at least 70, percentage of bases receiving a base-wise genotype quality score of at least 80, percentage of bases receiving a base-wise genotype quality score of at least 90, percentage of bases receiving a base-wise genotype quality score of at least 100, and percentage of bases receiving the maximum base-wise genotype quality score.
The model application unit 2604 may be configured to apply one or more models, utilizing various quality metrics in performing an analysis. In some embodiments, the model application unit 2604 may be configured with a pre-defined target sequencing quality. In some embodiments, the model application unit 2604 may be configured to determine a target sequencing quality based on a particular desired outcome.
The model application unit 2604 may be configured to receive the one or more sequencing quality metrics into a model trained with a plurality of reference sequencing quality metrics to predict sequencing quality. In some embodiments, the genetic information may be analyzed through the use of various machine learning classification tools, binary models, etc. For example, the models may be configured such that particular metrics are weighted differently than others (e.g., based on importance).
For example, a model may be utilized that comprises one or a combination of a random forest classifier, a neural network, K-nearest neighbors, support vector machines, linear regression, linear discriminant analysis, or decision trees.
A training unit 2606 may also be configured to train the model applied by the model application unit 2604 by inputting the plurality of reference sequencing quality metrics into the model. The model may be refined (e.g., retrained) over time, for example, by validating the prediction ability of various models. For example, a random forest can be trained by the training unit 2606.
Based on the output of the model application unit 2604, the threshold determination unit 2608 may be configured to conduct a determination of the amount of sequencing of the genetic sample required to achieve the target sequencing quality.
For example, a threshold may be provided, wherein the threshold is a percentage value indicating that the portion of the genetic sample required is greater than 0% and less than 100% of the genetic sample. In some embodiments, the portion of the genetic sample required is less than 50%, less than 40%, less than 30%, less than 20%, less than 10%, less than 5%, or less than 3% of the genetic sample. In some embodiments, the portion of the genetic sample required is at least 1% of the genetic sample. In some embodiments, the portion of the genetic sample required is at least 2% of the genetic sample. In some embodiments, the portion of the genetic sample required is between 2% and 50% of the genetic sample.
The threshold may also be a sequencing depth. For example, a sequencing depth may be between 1× and 500×, between 10× and 100×, or greater than 1×, according to various embodiments.
The sequence prediction unit 2610 may be configured to provide control command instructions, for example, issuing control command instructions to start, and/or stop sequencing, based for example, on sequencing of a particular threshold being achieved based on the outputs of the threshold determination unit 2608, etc.
In some embodiments, the system 2600 may be implemented as an input into a genome sequencing unit 2654, and the system 2600, through the sequence prediction unit 2610, may be configured to provide various instructions as to when enough sequencing has been completed to achieve a target sequencing quality of a genetic sample to be sequenced.
The technical solution of embodiments may be in the form of a software product. The software product may be stored in a non-volatile or non-transitory storage medium, which can be a compact disk read-only memory (CD-ROM), a USB flash disk, or a removable hard disk. The software product includes a number of instructions that enable a computer device (personal computer, server, or network device) to execute the methods provided by the embodiments.
The embodiments described herein are implemented by physical computer hardware, including computing devices, servers, receivers, transmitters, processors, memory, displays, and networks. The embodiments described herein provide useful physical machines and particularly configured computer hardware arrangements. The embodiments described herein are directed to electronic machines and methods implemented by electronic machines adapted for processing and transforming electromagnetic signals which represent various types of information. The embodiments described herein pervasively and integrally relate to machines, and their uses; and the embodiments described herein have no meaning or practical applicability outside their use with computer hardware, machines, and various hardware components. Substituting the physical hardware particularly configured to implement various acts for non-physical hardware, using mental steps for example, may substantially affect the way the embodiments work. Such computer hardware limitations are clearly essential elements of the embodiments described herein, and they cannot be omitted or substituted for mental means without having a material effect on the operation and structure of the embodiments described herein. The computer hardware is essential to implement the various embodiments described herein and is not merely used to perform steps expeditiously and in an efficient manner.
For simplicity only one computing device is shown but system 2600 may include more computing devices operable by users to access remote network resources and exchange data. The computing devices may be the same or different types of devices. The computing device 2600 includes at least one processor, a data storage device (including volatile memory or non-volatile memory or other data storage elements or a combination thereof), and at least one communication interface. The computing device components may be connected in various ways including directly coupled, indirectly coupled via a network, and distributed over a wide geographic area and connected via a network (which may be referred to as “cloud computing”).
For example, and without limitation, the computing device may be a server, network appliance, set-top box, embedded device, computer expansion module, personal computer, laptop, personal data assistant, cellular telephone, smartphone device, UMPC tablet, video display terminal, gaming console, electronic reading device, and wireless hypermedia device or any other computing device capable of being configured to carry out the methods described herein.
Each processor 2702 may be, for example, any type of general-purpose microprocessor or microcontroller, a digital signal processing (DSP) processor, an integrated circuit, a field programmable gate array (FPGA), a reconfigurable processor, a programmable read-only memory (PROM), or any combination thereof.
Memory 2704 may include a suitable combination of any type of computer memory that is located either internally or externally such as, for example, random-access memory (RAM), read-only memory (ROM), compact disc read-only memory (CDROM), electro-optical memory, magneto-optical memory, erasable programmable read-only memory (EPROM), and electrically-erasable programmable read-only memory (EEPROM), Ferroelectric RAM (FRAM) or the like.
Each I/O interface 2706 enables computing device 27 to interconnect with one or more input devices, such as a keyboard, mouse, camera, touch screen and a microphone, or with one or more output devices such as a display screen and a speaker.
Each network interface 2708 enables computing device 27 to communicate with other components, to exchange data with other components, to access and connect to network resources, to serve applications, and perform other computing applications by connecting to a network (or multiple networks) capable of carrying data including the Internet, Ethernet, plain old telephone service (POTS) line, public switch telephone network (PSTN), integrated services digital network (ISDN), digital subscriber line (DSL), coaxial cable, fiber optics, satellite, mobile, wireless (e.g. Wi-Fi, WiMAX), SS7 signaling network, fixed line, local area network, wide area network, and others, including any combination of these.
Although the embodiments have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein.
Moreover, the scope of the present application is not intended to be limited to the particular embodiments of the process, machine, manufacture, composition of matter, means, methods and steps described in the specification. As one of ordinary skill in the art will readily appreciate from the disclosure of the present invention, processes, machines, manufacture, compositions of matter, means, methods, or steps, presently existing or later to be developed, that perform substantially the same function or achieve substantially the same result as the corresponding embodiments described herein may be utilized. Accordingly, the appended claims are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or steps
As can be understood, the examples described above and below, and illustrated herein are intended to be exemplary only.
We studied 53 whole-genomes (26 blood-derived normals and 27 prostate tumours) sequenced as part of an International Cancer Genome Consortium26 (ICGC) project studying intermediate-risk prostate cancer. Tumours were treatment-naïve and on average contained 83% tumour cells (Table 1;
We measured 15 sequencing data quality metrics across four categories: overall coverage, coverage distribution, base-wise coverage and base-wise variant-calling confidence scores (Table 2). Overall coverage metrics reflect average coverage depth across the genome. Coverage-distribution metrics describe how evenly reads are distributed. Base-wise coverage metrics measure the proportion of bases that reach a given coverage—a measure of sequencing uniformity. Base-wise SAMtools quality scores quantify confidence in per-base genotype predictions. All metrics can be rapidly evaluated with open-source software (Table 3) source code is available.
Do Different Quality Metrics Have Independent Predictive Power?
We first assessed the variability between different runs (i.e. lanes) of the same sequencing library to verify our dataset reflects the heterogeneity seen in large-scale sequencing datasets. For each sample we created all possible lane groupings: every individual lane, every pair of lanes, etc. (
Next, we studied the relationship between each pair of metrics by calculating Spearman's rank correlation coefficient (p) separately for normal and tumour samples (
To exploit this result, we sought to determine if multi-lane behaviour could be predicted from single-lane values of a metric. We calculated correlations between single-lane values and all-lane values (4-lane for normal, 6-lane for tumour) for each metric (
Can Different Quality Metrics Predict One Another?
The strong pair-wise correlations between metrics (
Since collapsed coverage is currently the most widely used target quality metric in sequencing centres, we further investigated this model. In the validation cohort (
Can Production-Ready Models Be Developed?
While these linear models provide insight, production applications require rapid, reliable and accurate predictions. Binary predictions with confidence metrics are optimal for automation, suggesting the use of machine learning classification techniques. To demonstrate this approach, we started with a practical question. Most sequencing studies set a target coverage depth, but how many lanes of sequencing will be required to reach it for a given library?
We began by assessing the performance of the state-of-the-art tool for predicting library complexity, preseq23. Overall preseq failed to run for 52% (85/162) of our tumour sample lanes, including all FFPE data. Of the 48% (77 lanes) where results were generated, only 36 (47%) yielded preseq-estimated 95% confidence intervals that contained the true value (
We therefore sought to directly predict if a given amount of sequencing would reach a pre-specified threshold. We used 15 randomly selected samples for training and the remaining 12 for validation. Ten of the validation samples were sequence from fresh-frozen tumour tissue while the remaining two used FFPE-preserved tissue. We selected our thresholds to reflect the most common targets in ICGC projects26: for normal data, whether 4 lanes of sequencing would achieve 30× coverage; for tumour data, whether 6 lanes would achieve 50× coverage. Our classifier uses 15 metrics on a single lane to predict the results of multi-lane sequencing. Each validation lane was used, yielding 72 predictions for tumours and 48 for normals.
We began by testing if metrics univariately separated these two classes, and 12/15 did (Wilcoxon rank-sum test; p<10−3;
We considered the entire set of predictions for the tumour data by the confidence of our classifier (number of yes votes from the forest), where more homogeneous results (close to 0 or 1) indicate higher confidence (
Can Protocol Variations Be Detected From Quality Data?
Next we sought to determine if quality metrics diverge over time in a large sequencing centre, as might be caused by technician-specific bias, protocol refinements or reagent-batch differences. We split our 27 tumour genomes into two groups based on sequencing-date, training a model with the 10 oldest samples (See Table 4 in Appendix) and validating on the 17 newest (See Table 5 in Appendix). This age-dichotomized model was significantly less accurate (
Can Accurate Predictions Be Made From Even Smaller Amounts of Sequencing?
Our initial analysis shows high accuracy using ˜16-25% of the total data for modelling (i.e. a complete lane). However, for routine production use it would be highly advantageous to use even smaller subsets so multiple samples could be evaluated simultaneously using barcoded libraries. Combined with same-day sequencing instruments, this would allow for routine quality-assessment prior to full-scale sequencing, reducing cost and increasing quality. To identify the minimum amount of data required for accurate predictions, we trained SeqControl with a fraction of a lane of sequencing then validated its prediction ability as before (Tables 8-10 in Appendix). Performance was only modestly decreased when the amount of training sequence changed from a full-lane (AUC=0.993) to an eighth of a lane (AUC=0.967;
Samples
The sample DNA analyzed in this study came from the Canadian Prostate Cancer Genome Network (CPC-GENE) project (http://icgc.org/icgc/cgp/70/392/70542). Data was collected from twenty-six prostate cancer patients and consisted of tumour tissue DNA and paired DNA extracted from normal blood samples (referred to as normal DNA) for each patient. In addition, one unmatched tumour sample was also included for which normal data was unavailable. Twenty-five of the tumour samples were prepared via flash-freezing and the remaining two were prepared by formalin-fixation and paraffin-embedding (FFPE). Fresh-frozen post-RP (radical prostatectomy) specimens were from the University Health Network BioBank. FFPE tissue blocks were obtained from the Department of Pathology, University Health Network. Blood samples were collected at the time of informed consent, which followed local Research Ethics Board (REB) and ICGC guidelines (UHN REB study protocols UHN 06-0822-CE and UHN 11-0024-CE).
To estimate the cellularity and purity of tumour samples, we hybridized an aliquot of DNA from each sample to an OncoScan Affymetrix microarray. To calculate tumour purity from this data, we used the qpure algorithm32, which requires Log R Ratio (LRR) and B allele frequency (BAF) for each probe. These were computed using the two intensity values for each probe (i.e. one for each allele interrogated at each position) using the equations: LRR=log2(X+Y) and BAF=Y/(X+Y), where Y and X are intensity values corresponding to the minor and major alleles, respectively. We used the tumorpurity.mixture.gam.adjust output of qpure as our estimate of cellularity.
Sequencing
Pico-green quantified gDNA (50 ng) was sheared to 300 bp using a Covaris S2 Ultra-sonicator (Covaris Inc., Woburn, Mass., USA) followed by 3× volume AMPure XP SPRI bead clean-up (Beckman Coulter Genomics, Danvers, Mass., USA Cat#A63881). The resulting bead-DNA mixture was transferred to a 96-well PCR plate and libraries were constructed using enzymatic reagents from KAPA Library Preparation Kits (KAPA Biosciences, Woburn, Mass. USA Cat#KK8201) according to previously reported protocols for end repair, A-tailing and adapter ligation33. Adapter-ligated libraries were enriched by adding 3 μL of 25 μM !lumina F & R PE enrichment primers (Integrated DNA Technologies, Coralville, Iowa, USA), 75 μL of 2× KAPA HiFi HotStart ReadyMix (KAPA Biosciences, Woburn, Mass., USA Cat#KK2602) and 33 μL of nuclease-free water (Life Technologies, Carlsbad, Calif., USA Cat#AM993) to 36 μL of eluted DNA and amplified across 3 individual PCR reaction tubes. Verti 96-well Thermal Cyclers (Life Technologies, Carlsbad, Calif., USA) were used to incubate libraries (45s at 98° C.) and cycled 10 times for 15s at 98° C., 30s at 65° C. and 30s at 72° C. Following a 0.6× SPRI bead clean-up, post-PCR enriched libraries were eluted in 40 μL of elution buffer (Qiagen, Hilden Germany, Cat#19086) and validated using an Agilent Bioanalyzer (using the High Sensitivity DNA Kit; Agilent Technologies, Santa Clara, Calif., USA Cat#5067-4626).
Libraries were quantified on an Illumina Eco Real-Time PCR machine (Illumina Inc., San Diego, Calif., USA) using KAPA Illumina Library Quantification Kits (KAPA Biosciences, Woburn, Mass., USA Cat#KK4835). Paired-end sequencing of 2×101 cycles was carried out for all libraries on the Illumina HiSeq 2000 platform (Illumina Inc., San Diego, Calif., USA). For normal sample CPCG0004R, six lanes were sequenced. For all other normal samples, four lanes were sequenced. All tumour samples had six lanes of sequencing (
Alignment and Pre-Processing
All sequencing was performed on Illumina HiSeq 2000 machines, with raw base-call and intensity files transferred to network storage during sequencing. FASTQ files were generated using IIlumina's CASAVA (v1.8.2), then aligned to the UCSC hg19 human reference (with no repeat-masking) using the Novoalign short-read aligner (v2.07.14; http://www.novocraft.com/). Table 3 lists the parameter values used for this and other processing steps. Novoalign produced output in SAM format34 with properly-configured read groups generated by the Picard tool suite (v1.41; http://picard.sourceforqe.net/). These files were converted to BAM format, sorted by coordinate and indexed using Picard (v1.41). All lane-level BAM files were then filtered individually using SAMtools34 (v0.1.18) to remove unmapped reads and reads mapping to multiple genomic locations (Table 3).
To create data sets consisting of varying numbers of lanes, the lane-level BAM files were grouped. For the first ten patients, we generated all possible lane groupings: every individual lane, every pair of lanes, every set of three lanes, etc. (
Quality Metrics
In order to assess the quality of the sequencing data, a list of 15 quality metrics was generated (Table 2). These metrics were evaluated for each grouping using a custom Perl wrapper script. Uncollapsed coverage, collapsed coverage, number of clusters, number of unique start points and average reads per start point were calculated directly from the group-level BAM files (see above).
To determine how many bases were covered at or above various depths, the BEDTools software suite35 (v2.11.2) was used (Table 3). To account for unknown bases in the reference genome, the coverage files were masked—that is, adjusted so that each unknown (‘N’) base in the UCSC hg19 FASTA files would have a coverage value of NA instead of zero. The resulting files were used to find masked coverage across all known bases in the genome.
To generate information about variant calling confidence, a slightly edited version of SAMtools (v0.1.18) was used. The original source code was changed so that the output file generated by the bcftools view command contained only the CHROM, POS and QUAL fields instead of a full VCF file to reduce computing time and disk space (approximately 4-fold savings). The SAMtools mpileup function (v0.1.18) was run on each group-level BAM and the results were piped to the edited bcftools view function in order to obtain a genotype quality score for each base in the genome (Table 3). This score represents the Phred-scaled probability that the base called is incorrect, so that higher quality scores indicate higher confidence calls
(http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40).
Metric values for all group-level BAMs can be found in Table 4 (oldest 10 samples) and Table 5 (newest 17 samples).
Linear Modelling
Linear modelling was performed in R (v3.0.1) and used the lm function of the base stats package. The models were evaluated on testing data using the predict function of the stats package.
Note that for the dotmap in
Univariate & Principal Components Analysis
Univariate analysis and PCA were performed using all lanes of the 15 tumour training samples. For each metric, a VVilcoxon rank-sum test between lanes from the two outcome classes (<50× vs. ≧50× collapsed coverage) was run using the wilcox.test function in the R base stats package (v3.0.3), and the p-values were extracted. PCA was performed on metric data from all lanes using the prcomp function of the base stats package (v3.0.3).
Random Forest Classification
Random forests28 were trained using the randomForest package (v4.6-7) in R (v3.0.1). Input variables consisted of all metric values for a single-lane group and outcomes consisted of binary (0 or 1) values representing whether the corresponding all-lane grouping (4-lane for normal, 6-lane for tumour) for this sample achieved the specified target coverage depth (30× for normal, 50× for tumour). The forests were grown to 100,000 trees, and all other parameters used default settings. The trained forests were used to classify testing data with the predict function. See Table 12 for a full list of parameters used. Results from random forest training and testing, including the metric data used for input, can be found in Table 6 (in the Appendix) (tumour) and Table 7 (in Appendix) (normal).
For assessing variable importance, the cforest function of the party package (v1.0-13, R v3.0.3) was used. In contrast to the randomForest package, this method accounts for correlations between variables in its forest training and variable importance measures, and therefore gives an unbiased estimate of importance in our classifiers36. For consistency with the randomForest package, an mtry value of 3 was used in model training. See Table 13 fora full list of parameters used.
ROC Comparison
The pROC R package37 (v1.5.4) was used to calculate a p-value representing the significance of the difference between two ROC curves. The comparison was performed with the roc.test function using Venkatraman's test for unpaired ROC curves with 100,000 bootstraps. All other parameters used default settings.
Dilution Analysis
To split each individual BAM file we used Java (v1.6.0_25), Picard (v1.92), and the Rsamtools package (v1.14.3) in the R statistical environment (v3.0.3). Table 14 shows all the parameters used in this step.
For each BAM file we first created a BAM index file (.bai) using Picard's BuildBamIndex function. We then created k files such that each file contained 1/k unique read IDs, sampled at random, without replacement, from the original full-lane BAM. From each of the k files we then created a new BAM file representing the subset of reads using Picard's FilterSamReads function (Table 14). The resulting set of k files represents all the reads from the full-lane BAM partitioned into k subsets. This was repeated for k=2, 4, and 8 to generate data for half-lanes, quarter-lanes, and eighth-lanes. Quality metric values for these partial-lane BAMs were generated using the same methods described for full-lane data.
Random forest analysis was repeated using dilution data for training and testing. Results of this analysis are summarized in Table 11.
Visualization
All plots were created using custom R scripts executed in the R statistical environment (v3.0.1, v3.0.3). Plots were drawn using the lattice (v0.20-15, v0.20-28) and latticeExtra (v0.6-24) packages.
Data Access
Metric values for all groupings are located in Table 4 and Table 5. For the original ten samples (Table 4), all groupings of every size were generated and processed for both normal and tumour tissue. For the additional seventeen samples (Table 5), only single-lane and all-lane data (4-lane for normal, 6-lane for tumour) was generated and processed. Partial-lane data for the 17 tumours can be found in Tables 8, 9, and 10 (in Appendix) (half-lane, quarter-lane, and eighth-lane, respectively). All raw sequencing data will be deposited in the European Genotype-Phenotype Archive (EGA study accession number: EGAS00001000573) and array data will be deposited in GEO. All code for generating SeqControl metric value and training/testing random forests can be downloaded as SeqControl package v0.0.1.
A weighted k-nearest neighbours model was tuned, and revealed that using k=2 and a distance of 2 with a triangular kernel yielded the best model. The space was tuned over of k from 1 to 100 and distance from 1 to 100. Ten fold cross validation was performed on the tuned model to create the ROC curves provided. This was an analysis of 508 lanes of normal and 843 lanes of tumour using the full set of expanded metrics (
One of the major challenges in predicting sequencing quality is variability in library complexity. While the definition of library complexity may vary slightly between applications, it is generally defined as the number of unique DNA fragments in the library (i.e. the number of fragments collected from the original sheared input DNA). In sequencing context, complexity is sometimes described as the unique reads contained in the total reads from an experiment.
Multiple factors can influence library complexity. The amount of starting DNA is critical in the fragmentation and size selection steps. Since fragmentation is random, an insufficient amount of input DNA leads to a decreased chance of obtaining a fragment of the desired length that spans any given region. Low input volume leads to a library containing fragments from only a small portion of the overall target region, so sufficient starting DNA is critical to generating a high-complexity library. This can be a serious limitation for applications like cancer where the available sample material is often limited. The PCR step is another major source of variability. Bias towards some fragments being over-amplified (for example as caused by GC-content, fragment length, etc.) leads to the under-amplification of others. This results in an overall imbalance and reduces the number of well-represented regions. Consequently the number of PCR cycles performed is critical to creating complex libraries, and this design decision needs to be optimized according to the specific application. The significant bias introduced by the PCR step has led to an increasing focus on amplification-free protocols in recent years. These show promising results but are still in their infancy for human whole-genome sequencing applications.
The inherent variability in library complexity introduced by these factors makes quality prediction challenging, since all libraries have different characteristics and behave differently as additional sequencing is performed. Some researchers create multiple libraries for a single sample that can either use the same or different target fragment size in order to account for the random biases. Other issues cannot be avoided, and this highlights the need for a method to characterize and predict quality on a per-library basis.
What is the Intra- and Inter-Sample Variance in Quality?
To examine the variability in the metric data, we initially focused on 10 tumours and 9 normal references (i.e. the first ten patients analyzed; Table 4). In all cases, we observed marked differences in the quality metrics both within a sample (between lanes) and between samples (
Are Different Quality Metrics Related?
The information content of additional sequencing is non-linearly dependent on several factors, including the number of lanes sequenced, the number of unique molecules in a library and lane-to-lane variability in sequencing, and eventually saturates in both coverage (
Interestingly, normal and tumour samples show divergent correlation profiles. The difference between the two profiles (
How Well Do Existing Tools Predict Sequencing Quality?
A recently developed tool, preseq7, is the state-of-the-art for predicting library complexity. Given a small sub-experiment, it uses a non-parametric empirical Bayes method to predict complexity—measured as the number of unique reads present in a total number of reads sequenced. We applied preseq to all 162 lane-level BAMs from our twenty-seven tumour samples (Table 15 in Appendix). Of these, 52% (85/162) failed because preseq was unable to calculate bootstrap-based 95% confidence intervals. The initial report of preseq suggested this as a typical challenge for low-quality data. Our samples showed similar characteristics to other samples at our centre, although with smaller insert-sizes due to the low-input library preparation protocol used (50 ng input DNA). Interestingly, even on the 77/162 successful lanes, preseq results differed significantly from our observed results (
The definition of library complexity differs according to how duplicate reads are defined (i.e. single-end or paired-end), and this represents the main difference we see between preseq and our results. This discrepancy could easily be accounted for with considered analysis of the preseq results.
However, a more serious limitation is that it is not trivial to deduce how many sequencing lanes are required to yield a desired total number of reads. Our metric data shows that the total number of reads in a lane can vary widely, even within a sample. For experimental design, there is value in determining the actual amount of sequencing that should be performed.
Correlations
Metric values for all groups were read into the R statistical environment (v3.0.1). All Spearman correlations were calculated using the cor function of the base stats package (v3.0.1). Clustering in heatmaps was performed using the diana function of the cluster package (v1.14.4) which implements the divisive analysis clustering technique.
Preseq Testing
The preseq tool7 (v0.0.1) was tested by downloading and installing the open-source software. Individual lane-wise BAM files that were filtered according to our pipeline (Table 3) were used as input to the Ic_extrap function to produce predicted complexity curve values for each lane. The predicted number of unique reads in the corresponding 6-lane data was deduced from the Ic_extrap output using the observed total reads. These predictions were then compared to the observed unique read counts. Complexity curves for the lane-level and 6-lane BAMs were also generated for plotting purposes. All non-default input settings used are listed in Table 16.
Discussion
Data quality remains a critical issue in sequencing studies. Failure to meet pre-planned coverage thresholds is common: in the CPC-GENE ICGC project, 24% of normal samples and 25% of tumours did not initially reach target depth and required “top-up” sequencing. This imprecision may be acceptable in research settings, but not when sequencing is a component of clinical and industrial processes. For such applications, systematic strategies for evaluating and predicting data quality are needed. We demonstrate the viability of using statistical process control techniques for next-generation sequencing to refine and enhance existing sequencing pipelines. However, systematic retrospective evaluations are computationally challenging: over 260 TB and 88,000 CPU hours was invested into this study.
Therefore a first key step for most centres will be modifying existing Laboratory Information Management Systems (LIMS) and pipelines to routinely collect and store quality metrics. Over time, prospective data-collection will create large databases without the costs of retrospective analyses. Periodic model re-training will facilitate detection of changes in error profiles (
While our approach is general, some models may need retraining for new platforms and experimental protocols. For example, the SeqControl methodology could be applied to targeted sequencing data (e.g. exome or ChIP-Seq), but each library construction protocol may require its own model. Similarly, each application may require specific tuning: 50× coverage is common today but some studies sequence to 200×29 or deeper30,31. New technologies will have different error and performance characteristics, necessitating development of new control models during initial work-up. Input sample type (e.g. different tissue or tumour types) may also influence sequencing results: the correlation of classifier votes with tumour cellularity hints at this. Fortunately, Random Forests can be trained in a few CPU minutes. SeqControl is open-source and can be easily extended with novel statistical and feature-selection approaches that incorporate these additional covariates. This flexibility underlies the wide range of research and commercial applications that can benefit from the use of predictive quality control.
The challenge of accurately predicting outputs of a complex system given varied and incomplete information about the inputs is common in industrial engineering. For NGS data, as in other industries, the solution will likely involve the application and extension of techniques from the fields of control theory and statistical process control. While the prediction of next-generation sequencing data quality is an emerging field, our results suggest there is untapped value in retaining and analyzing quality metric data. SeqControl represents rigorous control and optimization of next-generation sequencing.
It will be appreciated by those skilled in the art that other variations of the embodiments described herein may also be practiced without departing from the scope of the invention. Other modifications are therefore possible.
Although the disclosure has been described and illustrated in exemplary forms with a certain degree of particularity, it is noted that the description and illustrations have been made by way of example only. Numerous changes in the details of construction and combination and arrangement of parts and steps may be made. Accordingly, such changes are intended to be included in the invention, the scope of which is defined by the claims.
Except to the extent explicitly stated or inherent within the processes described, including any optional steps or components thereof, no required order, sequence, or combination is intended or implied. As will be understood by those skilled in the relevant arts, with respect to both processes and any systems, devices, etc., described herein, a wide range of variations is possible, and even advantageous, in various circumstances, without departing from the scope of the invention, which is to be limited only by the claims. All references and publications mentioned in this specification, including in the following reference list, are hereby incorporated by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CA2015/050710 | 7/27/2015 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62029112 | Jul 2014 | US |