The present disclosure relates generally to the field of data related to biological samples, such as sequence data. More particularly, the disclosure relates to techniques for determining copy number variation based on sequencing data.
Genetic sequencing has become an increasingly important area of genetic research, promising future uses in diagnostic and other applications. In general, genetic sequencing involves determining the order of nucleotides for a nucleic acid such as a fragment of RNA or DNA. Some techniques involve whole genome sequencing, which involves a comprehensive method of analyzing a genome. Other techniques involve targeted sequencing of a subset of genes or regions of the genome. Targeted sequencing focuses on regions of interest, generating a smaller and more compact data set. Further, targeted sequencing reduces sequencing costs and data analysis burdens while also allowing deep sequencing at high coverage levels for detection of variants in the regions of interest. Examples of such variants may include somatic mutations, single nucleotide polymorphisms, and copy number variations. Detection of variants may provide clinicians with information about disease likelihood or susceptibility. Accordingly, there is a need for improved detection of variants in sequencing data.
The present disclosure provides a novel approach for detection of copy number variations in a biological sample. As provided herein, copy number variations (CNVs) are genomic alterations that result in an abnormal number of copies of one or more genomic regions. Structural genomic rearrangements such as duplications, multiplications, deletions, translocations, and inversions can cause CNVs. Like single-nucleotide polymorphisms (SNPs), certain CNVs have been associated with disease susceptibility. The term “copy number variation” herein may refer to variation in the number of copies of a nucleic acid sequence present in a test sample of interest in comparison with an expected copy number. For example, for humans, the expected copy number of autosome sequences (and X chromosome sequences in females) is two. Other organisms may have different expected copy numbers according to their genomic structure. Copy number variation may be the result of duplication or deletion. In certain embodiments, copy number variants refer to sequences of at least 1 kb that are duplicated or deleted. In one embodiment, copy number variants may be at least a single gene in size. In another embodiment, copy number variants may be at least 140bp, 140-280bp, or at least 500bp.
In one embodiment, a “copy number variant” refers to the sequence of nucleic acid in which copy-number differences are found by comparison of a sequence of interest in test sample with an expected level of the sequence of interest. As provided herein, a reference sample is derived from a set of sequencing data of unmatched samples to generate normalization information that permits an individual test sample to be normalized such that deviations from expected copy numbers may be determined on normalized sequencing data. The normalization data is generated using the techniques provided herein and permits normalization to a hypothetical most representative sample matched to the test sample. By normalizing the test sample, noise introduced by sequencing or other bias is removed.
In certain embodiments, the raw sequencing data coverage from a targeted sequencing run is normalized to reduce technical and biological noise to improve CNV detection. In one embodiment, samples of interest (e.g., fixed formalin paraffin embedded samples) are sequenced according to a desired sequencing technique, such as a targeted sequencing technique that uses a sequencing panel of probes to target regions of interest. Once the sequencing data is collected, the sequencing data is normalized to remove noise, and the normalized data is subsequently analyzed to detect CNVs.
In one embodiment, a method of normalizing copy number is provided that includes the steps of receiving a sequencing request from a user to sequence one or more regions of interest in a biological sample; acquiring baseline sequencing data from the one or more regions of interest from a plurality of baseline biological samples that are not matched to the biological sample; determining copy number normalization information using the baseline sequencing data, wherein the copy number normalization information comprises at least one copy number baseline for a region of interest of the one or more regions of interest; and providing the copy number normalization information to the user.
In another embodiment, a method of detecting copy number variation is provided that includes the steps of acquiring sequencing data from a biological sample, wherein the sequencing data comprises a plurality of raw sequencing read counts for a respective plurality of regions of interest; and normalizing the sequencing data to remove region-dependent coverage. The normalizing comprises: for each region of interest, comparing a raw sequencing read count of one or bins in a region of interest of the biological sample to a baseline median sequencing read count to generate a baseline-corrected sequencing read count for the one or more bins in the region of interest, wherein the baseline median sequencing read count for one or more bins in the region of interest is derived from a plurality of baseline samples that are not matched to the biological sample and is determined from only the most representative portions of the baseline sequencing data for each region of interest; and removing GC bias from the baseline-corrected sequencing read count to generate a normalized sequencing read count for each region of interest. The method also includes determining copy number variation in each region of interest based on the normalized sequencing read count of the one or more bins in each region of interest.
In another embodiment, a method of assessing a targeted sequencing panel is provided that includes the steps of identifying a first plurality of targets in a genome for a targeted sequencing panel, wherein the first plurality of targets corresponds to portions of a respective plurality of genes; determining a GC content of each of the first plurality of targets; eliminating targets of the first plurality of targets with GC content outside of a predetermined range to yield a second plurality of targets smaller than the first plurality of targets; when, after the eliminating, the an individual gene has fewer than a predetermined number of targets corresponding portions to the individual gene, identifying additional targets in the individual gene; adding the additional targets to the second plurality to yield a third plurality of targets; and providing a sequencing panel comprising probes specific for the third plurality of targets.
The present techniques are directed to analysis and processing of sequencing data for improved somatic copy number variation (CNV) detection. CNV detection is often confounded by various types of bias introduced during sample preservation, library preparation, or sequencing. Without bias, read depth/coverage should be uniform across the genome for diploid regions, and proportionally higher (lower) for copy number gain (loss) regions. With bias, this assumption is no longer valid, at least for regions of the genome that are subject to bias. Removal of bias or normalizing the data first, e.g., prior to CNV detection, achieves more accurate CNV calling as provided herein.
Provided herein are techniques that generate a reference baseline for an individual biological sample that is useful for normalizing the sequencing date before assessing variations that are representative of copy number changes for one or more regions of interest in a genome. The disclosed techniques provide reference or normalization information without relying on a matched sample from the individual from whom the test sample is obtained to normalize a test sample. While other techniques may use the patient’s own tissue to generate the reference, using a matched sample taken from the same individual as the biological sample presents certain challenges. For example, variation in sample collection (sample quality, selected tissue sites) may mean that reference sample is not truly representative of normal tissue. Further, insofar as the introduction of bias that influences sequencing data may vary between samples, the matched reference sample may have a different level of introduced bias relative to the test sample, which in turn may lead to inaccuracies and inadequately normalized data. In addition, not all test samples have available matched tissue or matched tissue of sufficiently high quality for sequencing.
Accordingly, the disclosed techniques facilitate more accurate copy number variation assessment by generating normalization information with reduced bias and without using a matched sample. The normalization information may be used to normalize a set of sequencing data prior to CNV detection in the individual sample. The normalization information is generated using a set or pool of unmatched reference baseline biological samples. Sequencing data generated from this set is then used to generate normalization information that is representative of a most typical hypothetical matched reference sample. That is, the normalization information represents a virtual calibrated gold standard reference against which any individual test sample may be normalized against.
In certain embodiments, CNVs may be detected using whole genome sequencing techniques. However, such techniques are expensive and involve generating data that may be outside the regions of interest. In other embodiments, using targeted sequencing techniques to detect CNVs is less expensive and is associated with a faster turnaround time. In targeted sequencing, the targeted probes are used to pull down regions of interest from the sample DNA for sequencing; the probes used may vary depending on the regions of interest and the desired detection outcome. However, the coverage of sequencing data from a targeted sequencing run may be variable due to varying characteristics of the regions of interest (e.g., the target sequences) in the genome, the probes, and the quality of the sample itself. For example, probes specific for larger targets (e.g., longer exons) will typically have more reads or coverage than probes for smaller targets. In another example, degraded areas of the DNA in a biological sample will have fewer reads. In yet another example, GC-rich or GC-poor regions of interest will have variations in coverage that may be nonlinear. Accordingly, variability in coverage for sequencing data from targeted sequencing runs may introduce noise that interferes with the accuracy of CNV detection based on coverage/read depth.
Table 1 illustrates the common types of sequencing bias/noise present in enrichment data. For example, different probes may have different pull-down efficiency, thereby creating uneven coverage across different regions (baseline effect). Coverage might also be GC dependent - regions with low or high GC content have lower coverage in general. Additionally, coverage might be affected by formalin-fixed paraffin-embedded (FFPE) sample quality or sample type. All of the aforementioned artifacts present challenge for amplification detection. CNV Robust Analysis aims to remove these biases (i.e., using data normalization) before CNV calling.
The disclosed techniques leverage a panel of reference normal samples to remove the need to use a matched normal sample in read count normalization of a tumor sample. Specifically, sequence read count bias is strongly correlated to tissue type and DNA quality of a test sample, with the equivalent impact as the germline genetics of the sample if not even stronger. Therefore, with a good variety of reference normal samples representing different tissue types and different DNA quality, CRAFT in silicon assembles a “virtual” matched normal sample to a test tumor sample through a linear combination of all the reference normal samples.
The panel of reference normal samples goes through a data-driven clustering process to form read count baselines. Each reference baseline is a representative of certain tissue type, DNA quality, and other systematic background on read count bias, rather than the true copy number changes in a genome. For a test sample, a linear regression of the reference baselines is performed against the sample read count data to determine the coefficient of each baseline. Each test sample results in a unique set of coefficients, mimicking a virtual matched normal sample. When a user acquires sequencing data with the particular sequencing panel, the user can normalize the acquired sequencing data using the coefficients. In one embodiment, coefficients may be applied via a linear combination to yield a weighted copy number value for a particular region of interest (e.g., a gene).
To that end, the disclosed techniques eliminate or reduce copy number variation assessment errors that result from sequencing bias.
At step 12, a user acquires a biological sample of interest for assessment. The biological sample may be a tissue sample, fluid sample, or other sample containing at least a portion of a genome or genomic DNA. In certain embodiments, the biological sample is fresh, frozen, or preserved using standard histopathological preservatives such as FFPE. The biological sample may be a test sample or may be an internal sample used to generate the normalization information. In embodiments in which the biological sample is assessed using a targeted sequencing panel, the user transmits a targeting sequencing request to a provider, whereby the request includes a selected pre-existing sequencing panel and/or a customized sequencing panel based on desired regions of interest in the genomic DNA of the sample. The request may include customer information, biological sample organism information, biological sample type information (e.g. information identifying whether the sample is fresh, frozen, or preserved), tissue type, and desired sequencing assay type. The request may also include nucleic acids sequences for desired probes of a sequencing panel and/or nucleic acid sequences of regions of interest in a genome that may be used by the provider to design and/or generate probes for a targeted sequencing panel.
The provider receives the request at step 14 and designs and/or generates probes to be used in the sequencing based on the designated probe set and/or the designated regions of interest (e.g., bins) at step 16. In certain embodiments, for pre-existing sequencing panels, the probes may be generated and kept in inventory before the request is received at step 14. The probes are provided to the user at step 20 and, subsequent to any relevant sample preparation at step 22, used to sequence the biological sample at step 24. The user acquires sequencing data from the sequencing at step 26.
When the user selects probes for a targeted sequencing panel, the probes are also used in a baseline sequencing reaction on a set of non-matched samples (e.g., other biological samples that are not matched to or from the same individual as the biological sample) to acquire baseline sequencing data at step 28. The baseline sequencing data is used to generate normalization information at step 30, which is provided to the user at step 32. Using the normalization information, the user normalizes the sequencing data of the test sample and subsequently analyzes the acquired sequencing data of the biological sample at step 34 to identify copy number variants for locations that are included in the targeted sequencing panel. That is, in the context of a targeted sequencing panel, which facilitates sequencing of only a portion of the genome, only copy number variants present in the sequenced portion can be identified. This is in contrast to whole genome applications is which copy number variants throughout the entire genome may be identified according to the present techniques.
In response to identifying the copy number variants, an output may be provided to the user at step 36. The output may include a displayed graphical user interface (see
The user may be an external or internal user of sequencing services of the provider. For example, the steps of the flow diagram 10 may be performed as a part of calibrating or generating any new targeted sequencing panel product, which may also include an external request for a customized sequencing panel. A given targeted sequencing panel will be associated with particular bias tendencies based on the regions of interest targeted by the panel probes. This bias may interfere with accurate assessment of copy number variation. Accordingly, the steps of the flow diagram 10 may be performed when any targeted sequencing panel that includes a set of probes is designed, modified, or updated. In another embodiment, if a user request includes regions of interest in a genome, a panel including a set of probes may be generated and evaluated using the disclose techniques to yield normalization information. The normalization information may be evaluated using a set of metrics. If the metrics indicate that the panel yields poor normalization information, the panel may be discarded and the probes redesigned (e.g., shifted 50 bp in either direction). The new probes may be tested using the steps of the flow diagram 50 until high quality normalization information is obtained. In one embodiment, the metrics are obtained by applying the normalization information before identifying copy number variants in an internal sample. If the identified copy number variants across the sequenced regions deviate from an expected distribution, an output may be provided indicating that a new sequencing panel (e.g., a probe redesign) should be triggered. The expected distribution may be associated with a likely distribution of copy number variants. For example, most variants are within a two or three-fold change in either direction. If the internal sample is shown to have a larger than expected distribution of 10-fold or higher variants, the analyzed sample may be indicated as deviating from the expected distribution.
The sequencing data generated by sequencing the biological sample may analyzed to characterize any copy number variation after being normalized using the normalization information. It should be understood that the biological sample sequencing data and the baseline sequencing data may be in the form of raw data, base call data, or data that has gone through primary or secondary analysis.
Further, it should be understood that CNVs may be identified as being part of a gene, an intragenic region, etc. It should also be understood that CNV detection may be associated with duplicate or deleted sequences. Accordingly, CNV detection may represent duplicate copies of a nucleic acid region, such as a region including one or more genes. In one embodiment, CNVs are duplicate or deleted genomic regions of at least 1kb in size.
Sequencing coverage describes the average number of sequencing read counts that align to, or “cover,” known reference bases. The coverage level often determines whether variant discovery can be made with a certain degree of confidence at particular base positions. At higher levels of coverage, each base is covered by a greater number of aligned sequence reads, so base calls can be made with a higher degree of confidence. Reads are not distributed evenly over an entire genome, simply because the reads will sample the genome in a random and independent manner. Therefore many bases will be covered by fewer reads than the average coverage, while other bases will be covered by more reads than average. This is expressed by the coverage metric, which is the number of times a genome has been sequenced (the depth of sequencing). For targeted resequencing, coverage may refer to the amount of times a region is sequenced. For example, for targeted resequencing, coverage means the number of times the targeted subset of the genome is sequenced. The disclosed embodiments address noise in sequencing coverage due to bias.
In the depicted embodiment, the sequencing device 60 includes a separate sample processing device 62 and an associated computer 64. However, as noted, these may be implemented as a single device. Further, the associated computer 64 may be local to or networked with the sample processing device 62. In the depicted embodiment, the biological sample may be loaded into the sample processing device 62 as a sample slide 70 that is imaged to generate sequence data. For example, reagents that interact with the biological sample fluoresce at particular wavelengths in response to an excitation beam generated by an imaging module 72 and thereby return radiation for imaging. For instance, the fluorescent components may be generated by fluorescently tagged nucleic acids that hybridize to complementary molecules of the components or to fluorescently tagged nucleotides that are incorporated into an oligonucleotide using a polymerase. As will be appreciated by those skilled in the art, the wavelength at which the dyes of the sample are excited and the wavelength at which they fluoresce will depend upon the absorption and emission spectra of the specific dyes. Such returned radiation may propagate back through the directing optics. This retrobeam may generally be directed toward detection optics of the imaging module 72.
The imaging module detection optics may be based upon any suitable technology, and may be, for example, a charged coupled device (CCD) sensor that generates pixilated image data based upon photons impacting locations in the device. However, it will be understood that any of a variety of other detectors may also be used including, but not limited to, a detector array configured for time delay integration (TDI) operation, a complementary metal oxide semiconductor (CMOS) detector, an avalanche photodiode (APD) detector, a Geiger-mode photon counter, or any other suitable detector. TDI mode detection can be coupled with line scanning as described in U.S. Pat. No. 7,329,860, which is incorporated herein by reference. Other useful detectors are described, for example, in the references provided previously herein in the context of various nucleic acid sequencing methodologies.
The imaging module 72 may be under processor control, e.g., via a processor 74, and the sample receiving device 18 may also include I/O controls 76, an internal bus 78, non-volatile memory 80, RAM 82 and any other memory structure such that the memory is capable of storing executable instructions, and other suitable hardware components that may be similar to those described with regard to
The present techniques facilitate detecting or calling CNVs in biological samples (e.g., tumor samples) without first normalizing the sequencing data to matched sequencing data. The technique uses a preprocessing step to generate a manifest file and a baseline file, which are used as input parameters for the normalization step. The manifest file and the baseline file are generated independent of and prior to analysis of a sample of interest to determine copy number variation. The manifest file and the baseline file are generated from non-matched samples (i.e., non-matched normal samples) and are determined via the baseline generation technique as provided herein. Baseline generation may be performed on the non-matched normal samples and the results of the baseline generation stored as baseline information (or normalization information) for access by executable instructions of the normalization technique. For example, a user with a sample of interest may perform analysis of one or more CNVs. In certain embodiments, after generation and storage, the baseline information is used in the analysis of a plurality of samples of interest at different and/or subsequent time points. The user may access the stored files based on the sequencing panel that corresponds to the baseline information.
In one embodiment, the copy number normalization information, once generated, is fixed for a particular sequencing panel. That is, the copy number normalization information is associated with the particular probes of the sequencing panel and is stored by the provider and sent to the user of the particular sequencing panel. Different sequencing panels have different copy number normalization information. In another example, a CNV-calling software package may store a plurality of different copy number normalization information, each associated with different sequencing panels. The user may select the appropriate normalization information based on the sequencing panel used to acquire the sequencing data. Alternatively, the sequencing device 60 may automatically acquire the appropriate copy number normalization information based on information input by the user related to the sequencing panel used. The CNV-calling software package may also be capable of receiving updates from a remote server if the copy number normalization information is refined by the provider.
The problem of somatic copy number variation detection is solved by identifying representative baseline coverage behavior using a hierarchical clustering method and then leveraging linear regression and Loess regression for data normalization, as summarized in
The preprocessing (algorithm training) may include the following steps:
After the baseline or normalization (applied to assessed samples) using the reference baseline generated above, where the new sample is scaled to the normalization information by target size and median bin count 114.
1. Baseline correction 116: for a new sample, model its bin count as a linear combination of baselines: Y~c1+c2+c3. Due to potential CNVs in the new sample, outliers are first removed from Y, and the linear model is built on outlier removed values. In certain embodiments, outliers are masked. In other embodiments, only extreme outliers are removed or masked. Then, the ratio of Y and linear model prediction is used as baseline corrected value. Bin counts above or below 3 standard deviation are considered outliers.
2. Robust loess regression 118 to remove GC bias after step 1.
3. For each gene, calculate its fold change 124 by comparing its median bin value to the genome median. Additional statistics, e.g., t-stat for each gene 126, may also be determined.
The present techniques do not use or require matched normal samples to perform normalization. Instead, the normalization techniques herein use non-matched normal samples to generate reference baselines from which fold changes are detected. In certain embodiments, a plurality of normal samples are used to determine the reference baselines, and clustering of sequencing data of the plurality of samples is performed to determine the most representative normal bins. Accordingly, the reference baseline values are assessed on a per bin basis and not on a per sample basis. In addition, the present techniques incorporate more than one baseline behavior value in historical normal samples. The present techniques leverage linear regression for baseline correction, and Loess for GC correction. Results achieved include 100% sensitivity in R2 DVT study (including certain no-calls).
In comparison to other techniques, the normalization as provided yields better performance than control free in terms of LoB and LoD. Further, normalization is more economical relative to techniques using matched normal that require additional sample processing. CNV calling using normalization is more economical because the sequencing costs do not include costs for sequencing of matched normal samples. Accordingly, the sequencing run and operation of the sequencing device is more efficient. Other approaches, such as reference free approaches, do not yield high quality results due to probe pull down effects. Statistical techniques that use SVD decomposition or PCA also do not yield high quality results and/or have limited applicability for certain sample types.
In particular embodiments, a bin as provided herein refers to a contiguous nucleic acid region of interest of a genome. A bin may be an exonic, intronic, or intragenic. Bins or bin regions may include variants, and, therefore, generally refer to the location or region of the genome rather than a fixed nucleic acid sequence. Bin counting is done at the fragment level, not the read level. For example, genes A and B, as shown in
In certain embodiments, probe target selection may be improved to reduce the introduction of noise in the sequencing data. For example, in one technique, the probe selection may occur as outlined: for each gene, identify the number of targets with GC content between 0.3 and 0.8. If the number is smaller than 20, identify regions for not covered by current probe design. Create equally spaced windows of size 140bp and compute the GC and mappability (75mer) for each window. Select the top K windows by mappability and GC content. For the Y chromosome, which is used for gender classification, randomly select 40 regions with mappability of 1 and GC between 0.4 and 0.6.
In one example, 116 out of 170 genes in probe set 2C have fewer than 20 targets. 1042 additional targets are selected. 31 out of 49 amp genes have fewer than 20 targets. 350 additional targets are selected. For the Y chromosome, 40 targets are selected for gender classification. In sum, to cover all the 49 amp genes with at least 20 targets/gene, add 390 additional targets (140bp windows) to probe set 2C. FGF4, CKD4 and MYC still have less than 20 targets due to small gene size. Gene targets for certain genes are shown in Table 2.
Embodiments of the disclosed techniques include graphical user interfaces for displaying copy number variation information and that provide outputs or indications use and/or receive user input.
Technical effects of the disclosed embodiments include improved and more accurate determination of CNVs in a biological sample. Copy number variants may be associated with genetic disorders, cancer progression, or other adverse clinical conditions. Accordingly, improved CNV detection may permit sequencing data to provide richer and more meaningful information to clinicians. Further, the disclosed CNV assessment techniques may be used in conjunction with targeted sequencing techniques, which sequence only a portion of the genome. In this manner, CNVs may be identified from a more efficient sequencing strategy. The normalization techniques as provided herein address bias introduced into sequencing data that affects sequencing coverage counts.
While only certain features of the disclosure have been illustrated and described herein, many modifications and changes will occur to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the disclosure.
This application claims priority to and the benefit of International Application No. PCT/US2017/052766, filed on Sep. 21, 2017, which claims priority to and the benefit of U.S. Provisional Application No. 62/398,354, entitled “SOMATIC COPY NUMBER VARIATION DETECTION” and filed Sep. 22, 2016, and to U.S. Provisional Application No. 62/447,065, entitled “SOMATIC COPY NUMBER VARIATION DETECTION” and filed Jan. 17, 2017, the disclosures of which are incorporated herein by reference for all purposes.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2017/052766 | 9/21/2017 | WO |
Number | Date | Country | |
---|---|---|---|
62447065 | Jan 2017 | US | |
62398354 | Sep 2016 | US |