The invention relates to methods, systems, and apparatuses relating to DNA sequencing. Specifically, the invention relates to evaluating reference sequences for use in genome sequencing.
Next Generation Sequencing (NGS) (e.g., whole-genome shotgun sequencing, pyrosequencing, ion semiconductor sequencing, sequencing by synthesis) is a powerful tool for sequencing an entire genome by generating a large number of sequence reads spanning a whole genome and assembling the sequence reads into longer contigs.
One approach to genome sequencing using NGS is “de novo sequencing,” where sequence reads are assembled into longer contigs by lining up overlapping ends of the sequence reads. However such an approach requires much computing power and time, for example, due to the high number of contigs and scaffolds that are generated. In contrast reference-assisted assembly provides a more efficient way to assemble genome-wide or chromosome-wide sequencing by arranging contigs and scaffolds into putative chromosomes using information from a reference sequence. Thus, the current paradigm for whole-genome sequencing analysis using NGS is the reference-assisted assembly. In order for reference-assisted assembly to work well, the genome being sequenced must be sufficiently similar to the reference genome.
However, while an individual's genome may have high homology in the vast majority of the reference genome, often there are regions in which homology is insufficient. For example, persons with African ancestry have many regions of the genome that widely diverge from commonly used reference sequences, such as the GRCh38 human genome reference. In these insufficient homology regions (“IHRs”) of the reference genome, a sequence alignment program (or “aligner”) may cause sequence reads to align incorrectly, resulting in erroneous variant calls. In such instances, some aligners will attempt to perform a de novo reassembly in these regions based on various measures. For example, the Genome Analysis Toolkit (GATK) Haplotype Caller calculates a per-base “activity” measure based on the variability of individual bases aligned at each position. More specifically, GATK HaplotypeCaller may be used to identify IHRs (“Active Regions”) in which the samples being analyzed show substantial evidence of variation relative to the reference sequence in the following way: (1) based on a reference alignment, GATK HaplotypeCaller computes an activity score (between 0 and 1) for each individual genome position; (2) activity scores are then “smoothed” (i.e., averaged over a sliding window) to yield an “activity profile.” Regions exceeding a threshold value are selected for local reassembly; (3) once regions are selected, GATK HaplotypeCaller would proceed by performing local reassembly and then determining a list of possible haplotypes. The activity score represents the probability that the position contains a variant. The raw activity profile, which is generated from the activity scores, is a function of activity per position, which is an estimate of the variability at that position represented by the aligned bases at that position (A, C, G, or T). The raw activity profile is smoothed to produce the activity profile, from which Active Region thresholds and intervals are determined.
However, GATK's HaplotypeCaller produces a simple measure that often misses other regions of insufficient homology. In other words, when a user uses GATK HaplotypeCaller to detect IHRs of a reference genome sequence, the user may not detect some such regions. When used in reference-assisted assembly, a reference genome sequence that has undetected IHRs with respect to a genomic sample set of reads may result in false-positive and/or incorrect variant calls. In addition, presently available aligners such as GATK HaplotypeCaller may identify regions of a reference sequence as IHRs, for example due to sequencing errors, when these regions have sufficient homology to genomic samples. Accordingly, there is a need for improvements in identifying regions of low homology in reference-assisted assemblies. Therefore, an improved method for identifying IHRs in a reference genome sequence is needed to provide an improved reference-assisted assembly of a genomic sample of a subject with fewer false-positives and a more reliable genetic sequence analysis of a genomic sample of a subject.
The invention is based, at least in part, upon the discovery that IHRs of a reference sequence with respect to a genomic sample of a subject can be more successfully identified by analyzing the alignment abnormality features of corresponding sequence reads. Using this approach, a user can identify IHRs that would otherwise not have been identified using presently available alternatives such as GATK HaplotypeCaller. Additionally, using this approach, a user can prevent identifying non-IHRs as IHRs as it is the case with GATK HaplotypeCaller.
In various aspects, the invention provides a computer-implemented method for identifying an insufficient homology region (“IHR”) within a reference sequence with respect to a genomic sample of a subject, the method being based on determining the extent of abnormality of a particular combination of features exhibited by an aligned sequence read, the method comprising: (a) obtaining an initial alignment based on a reference sequence, the initial alignment comprising a plurality of sequence reads derived from the genomic sample of the subject and the reference sequence; (b) determining an expected background level for each of two or more types of alignment abnormalities in the initial alignment, the background levels being specific to the genomic sample; (c) determining an unexpectedness of observed alignment abnormalities for each sequence read by calculating a score based on the two or more types of observed alignment abnormalities in each sequence read; (d) identifying the insufficient homology region within the reference sequence, wherein the insufficient homology region is associated with sequence reads having a lower combined score than a threshold score; and (e) tagging the identified insufficient homology region for a remedial action.
In various aspects, the invention provides a data processing system for designating an insufficient homology region of a reference sequence, the system comprising: at least one memory operable to store bioinformatics data; and a processor communicatively coupled to the at least one memory, the processor being configured to: (a) obtain an initial alignment based on a reference sequence, the initial alignment comprising a plurality of sequence reads derived from the genomic sample of the subject and the reference sequence; (b) determine an expected background level for each of two or more types of alignment abnormalities in the initial alignment, the background levels being specific to the genomic sample; (c) determine an unexpectedness of observed alignment abnormalities for each sequence read by calculating a score based on the two or more types of observed alignment abnormalities in each sequence read; (d) identify the insufficient homology region within the reference sequence, wherein the insufficient homology region is associated with sequence reads having a lower combined score than a threshold score; and (e) tag the identified insufficient homology region for a remedial action.
In various aspects, the invention provides a non-transitory computer-readable medium comprising instructions which, when implemented by one or more computers, cause the one or more computers to: (a) obtain an initial alignment based on a reference sequence, the initial alignment comprising a plurality of sequence reads derived from the genomic sample of the subject and the reference sequence; (b) determine an expected background level for each of two or more types of alignment abnormalities in the initial alignment, the background levels being specific to the genomic sample; (c) determine an unexpectedness of observed alignment abnormalities for each sequence read by calculating a score based on the two or more types of observed alignment abnormalities in each sequence read; (d) identify the insufficient homology region within the reference sequence, wherein the insufficient homology region is associated with sequence reads having a lower combined score than a threshold score; and (e) tag the identified insufficient homology region for a remedial action.
It should be understood by those skilled in the art, any of the aspects above can be combined with any one or more of the features below.
In various embodiments, the initial alignment comprises a reference-assisted assembly of the sequence reads to the reference sequence.
In various embodiments, the types of observed alignment abnormalities comprise at least two of: (a) abnormal read pairing type alignment abnormality; (b) soft clipping type alignment abnormality; (c) mismatch type alignment abnormality; (d) insertion type alignment abnormality; and (e) deletion type alignment abnormality.
In various embodiments, if the alignment abnormality is a mismatch type abnormality, then the step of determining the expected background levels comprises assessing the expected background level of the alignment abnormality based on the following covariates: (a) base position of the abnormality with respect to the sequence read; and (b) base quality of the base position of the abnormality.
In various embodiments, if the alignment abnormality is an insertion type abnormality or a deletion type abnormality, then the step of determining the expected background levels comprises assessing the expected background level of an alignment abnormality based on a base position of the abnormality with respect to the sequence read.
In various embodiments, if the alignment abnormality is an abnormal read pairing type abnormality, then the step of determining the expected background levels comprises assessing the expected background levels of the alignment abnormality by dividing a total count of the abnormality by a total number of the sequence reads.
In various embodiments, if the alignment abnormality is a soft-clipping type of abnormality, then the step of determining the expected background levels comprises assessing the expected background level of the alignment abnormality based on the following covariates: (a) length of a soft clip of a read; and (b) determination of whether the abnormality occurs at the 3′ end of the read or the 5′ end of the read.
In various embodiments, calculating a score based on the two or more types of observed alignment abnormalities in each sequence read comprises: (a) calculating an individual p-value for each of two or more types of observed alignment abnormalities in each sequence read; and (b) combining a plurality of individual p-values to generate a single combined p-value for the observed alignment abnormalities for each sequence read; wherein the insufficient homology region has a lower combined p-value than a threshold p-value.
In various embodiments, the p-value of each type of observed alignment abnormality in each sequence read is calculated by performing one of: (a) p=Pr(X≥o), wherein X˜Poisson(Σi∈Aλi), A={i: −10 log10(λi)≥20}, and o=Σi∈AI(mismatch at position i), λi is the base quality at position i and I(⋅) is the indicator function; (b) p=Σi∈Ari, wherein A={i:ri≤rx}, wherein p is calculated separately for first and second reads and 5′ and 3′ ends of the sequence reads, and ri is the rate at which the order and end of the reads are clipped for i bases, which may be estimated from the alignment data itself; (c) p=1 if the sequence read is free of an abnormality, and p=rabnormal read pairing if the sequence read is abnormally paired in a read pair, where rabnormal read pairing is the overall rate of a mapped read pair being abnormally paired; and (d) p=Pr(X≥o), wherein o is the number of insertions or deletions (“indels”) observed in a read, X˜Poisson(Σiλi) and λi are the respective background likelihoods for having an indel aligned (i.e., non-soft-clipped) at position i, and wherein background rates for indels are computed separately for each of the two strands of the read.
In various embodiments, the step of combining a plurality of individual p-values to generate a single combined p-value for the observed alignment abnormalities for each sequence read is performed using Fisher's method.
In various embodiments, the step of combining a plurality of individual p-values comprises a plurality of steps such that at each step, some individual p-values are combined first and the resultant intermediate combined p-values are combined with some or all of remaining individual p-values.
In various embodiments, the step of generating the combined p-value further comprises the step of generating the combined p-value further comprises smoothing the combined p-value for the observed alignment abnormalities of each sequence window.
In various embodiments, the remedial action comprises: (a) a de novo local assembly of the identified insufficient homology region; (b) a masking of sequences of the insufficient homology region; or (c) a determination by a user on the appropriateness of the reference sequence for use with respect to the genomic sample of the subject.
These and other advantages of the present technology will be apparent when reference is made to the accompanying drawings and the following description.
While the invention comprises embodiments in many different forms, they are shown in the drawings and will herein be described in detail several specific embodiments with the understanding that the present disclosure is to be considered as an exemplification of the principles of the technology and is not intended to limit the invention to the embodiments illustrated.
The invention is based, at least in part, upon the discovery that IHRs of a reference sequence can be identified more successfully than currently available methods such as those used in GATK HaplotypeCaller by analyzing the alignment abnormality features of sequence reads on a per-read basis, rather than a per-position basis. Further, the invention allows flagging the identified IHRs of the reference genome for a remedial action such as masking such regions from variant calling, reassembling the sequence reads aligned to the IHR de novo, performing targeted resequencing, and the like.
The invention provides a novel approach to detecting IHRs by considering various properties of individual sequence reads in the reference alignment, and estimating the likelihood that an alignment pattern of a particular sequence read is observed given an estimated background rate for different alignment anomalies. These results can be summarized in, e.g., sliding windows across the reference alignment, identifying IHRs as those in which the values within a window significantly diverge from expectation.
IHRs of a reference alignment exhibit various “signatures” that may be detected and/or identified after an initial alignment. For example, features such as abnormal read pairing, mismatches, and soft clipping may be detected and analyzed statistically. Regions in which these features exceed the expected background level may be identified as IHRs (lacking sufficient homology to the sample being sequenced) and thus being at risk for spurious variant calls.
IHRs may have a degree of homology relative to a particular genomic sequence that is in between very high homology and very low homology. Regions of reference sequence with very high homology to a set of sample sequence reads will allow alignment of said sequence reads and will subsequently produce very few variant calls. In contrast, regions of reference sequence with very low homology to a set of sample sequence reads will not allow alignment of said sequence reads to the reference sequence. However, IHRs can be problematic in the context of sequence alignment because sequence reads may be similar enough to the reference in these regions so as to permit alignment, albeit incorrect alignment that results in spurious variant calls that do not reflect the true sequence of the sample. IHRs can be patient specific, sequencing run-specific, and/or sample-specific.
In some embodiments, the present invention estimates the likelihood of an observed alignment pattern of a sequence read given it is aligned against a reference sequence of high homology. Sequence alignment programs (“aligners”) typically provide an alignment pattern result in the form of a CIGAR string which describes the number of matches, mismatches, insertions, and deletions that may be present in the alignment at a given region. Additional information which can be generated includes the length of soft-clipping (zero being no soft-clipping), and whether the sequence read has an abnormally paired mate. Under a null model of the reference sequence being highly homologous, each type of alignment anomaly, i.e., non-match pattern (mismatches, insertions, deletions, soft-clipping and abnormal pairing) should be independent and follow a Poisson distribution with some background rate. The unexpectedness of each of these features can be first evaluated independently and then statistically combined to estimate an overall per-read unexpectedness value, for example using Fisher's p-value combination method.
The exact background rate at which alignment anomalies occur can be sample, technology and/or sequencing run-specific. For example, two different sequencing runs may generate different distributions of quality scores, quality score-specific mismatch rates, or soft-clipping lengths. Thus, in some embodiments, accurate background estimates are obtained for each alignment anomaly type. Background rates for each anomaly type may be calculated, for example, based on the initial alignment data.
In some embodiments, unexpectedness (e.g., p-value) can be measured for each sequence read for each alignment anomaly type (e.g., soft-clip, mismatch, insertion, deletion, abnormal pairing). For each sequence read, the overall unexpectedness can be calculated based on the estimated sample and sequencing run-specific background values. In other words, the extent of abnormality of a particular combination of features exhibited by a sequence read can be determined for each sequence read. In some embodiments, the individual p-values are combined using Fisher's method to yield the final p-value that indicates how unlikely it is to observe the exact alignment pattern for each sequence read, assuming that each sequence read was aligned against a highly homologous reference sequence and all of its errors were generated independently from the background distribution. The present invention is preferable over other available alternatives because the present invention can determine if a sequence read is abnormal given the background distribution of all reads, and does not requires additional training of an algorithm or parameter estimation.
In some embodiments, as each sequence read in the alignment has an associated combined p-value, the combined p-values can be averaged using a smoothing function across windows of the reference alignment. Regions of the reference genome that are averaged in excess of a certain value may be identified as IHRs. In some embodiments, this may be further visualized as an additional “track” in the alignment.
Advantageously, the present invention can help a user to detect IHRs in a reference sequence with low false-negative results such as IHRs that would otherwise not have been detected using existing approaches, such as GATK.
“Genomic sequence data” herein means digital nucleic sequence information that provides a user with the genomic DNA sequence information of a subject. The genomic sequence data may be information that is stored in an electronic format, for example in a subject's medical records, or obtained from a gene sequencing apparatus, for example, a whole genome DNA sequencer.
“Reference sequence” herein means digital nucleic sequence information, either publicly available or proprietary sequence information, that provides a user with a representative example of a species' set of genes. In some embodiments, reference sequence may comprise sequences that are representative of the human genome. In one exemplary embodiment, the human genome reference sequence may be GRCh38. Reference sequence may represent nucleic sequences for an entire genome, or a portion of a genome.
“Local assembly” herein means an assembly of sequence reads into longer contigs, which may be limited to certain regions of the genome, for example, regions where the reference genome has insufficient homology with respect to the subject sample sequence. A local assembly may be de novo, i.e., the sequence reads are assembled without the aid of a reference sequence. De novo local assembly may be achieved by use of software or algorithms that assemble sequence reads into longer contigs or scaffolds by aligning overlapping ends of the sequence reads. Alternately, a local assembly may also incorporate the reference sequence.
“Remedial action” herein means actions or steps that are taken by a user of any of the embodiments of the invention in response to identification of IHRs of the reference sequence with respect to a genomic sample of a subject. Remedial action may be taken to increase accuracy, speed, and/or efficiency of genome sequencing and/or otherwise improve genome sequencing. Remedial actions can include, for example, masking the sequences of IHRs prior to making variant calls, performing local assembly of sequence reads aligned to locations within or nearby the IHR, and/or flagging the IHRs for needed improvement of reference sequences in future use.
A person skilled in the art will recognize a variety of different computer-based technologies that can be used to carry out disclosures contained herein. For example, the devices, systems and methods disclosed herein can be implemented using one or more computer systems, such as the exemplary embodiment of a computer system 100 shown in
As shown in
The various elements of the computer system 100 can be coupled to a bus system 112. The illustrated bus system 112 is an abstraction that represents any one or more separate physical busses, communication lines/interfaces, and/or multi-drop or point-to-point connections, connected by appropriate bridges, adapters, and/or controllers. The computer system 100 can also include one or more network interface(s) 106, one or more input/output (TO) interface(s) 108, and one or more storage device(s) 110.
The network interface(s) 106 can enable the computer system 100 to communicate with remote devices (e.g., other computer systems) over a network, and can be, for example, remote desktop connection interfaces, Ethernet adapters, and/or other local area network (LAN) adapters. The IO interface(s) 108 can include one or more interface components to connect the computer system 100 with other electronic equipment. For example, the IO interface(s) 108 can include high speed data ports, such as USB ports, 1394 ports, etc. Additionally, the computer system 100 can be accessible to a human user, and thus the IO interface(s) 108 can include displays, speakers, keyboards, pointing devices, and/or various other video, audio, or alphanumeric interfaces. The storage device(s) 110 can include any conventional medium for storing data in a non-volatile and/or non-transient manner. The storage device(s) 110 can thus hold data and/or instructions in a persistent state (i.e., the value is retained despite interruption of power to the computer system 100). The storage device(s) 110 can include one or more hard disk drives, flash drives, USB drives, optical drives, various media cards, and/or any combination thereof and can be directly connected to the computer system 100 or remotely connected thereto, such as over a network. The elements illustrated in
Although an exemplary computer system is depicted and described herein, it will be appreciated that this is for the sake of generality and convenience. In other embodiments, the computer system may differ in architecture and operation from that shown and described herein. For example, the computer system may be directly connected to a sequencing system or may be a part of a sequencer system.
In various aspects, the invention provides a computer-implemented method for identifying an insufficient homology region within a reference sequence with respect to a genomic sample of a subject, the method being based on determining the extent of abnormality of a particular combination of features exhibited by an aligned sequence read, the method comprising: (a) obtaining an initial alignment based on a reference sequence, the initial alignment comprising a plurality of sequence reads derived from the genomic sample of the subject and the reference sequence; (b) determining an expected background level for each of two or more types of alignment abnormalities in the initial alignment, the background levels being specific to the genomic sample; (c) determining an unexpectedness of observed alignment abnormalities for each sequence read by calculating a score based on the two or more types of observed alignment abnormalities in each sequence read; (d) identifying the insufficient homology region within the reference sequence, wherein the insufficient homology region is associated with sequence reads having a lower combined score than a threshold score; and (e) tagging the identified insufficient homology region for a remedial action.
As shown in
As shown in
The process of determining expected background levels of the different types of alignment abnormalities can depend on the type of alignment abnormalities. For example, if the alignment abnormality is a mismatch type abnormality, then the expected background level of the alignment abnormality may be determined based on (1) base position of the abnormality with respect to the sequence read; and (2) base quality of the base position of the abnormality. For another example, if the alignment abnormality is an insertion type abnormality or a deletion type abnormality, then the expected background levels of the alignment abnormality is determined based on a base position of the abnormality with respect to the sequence read. For yet another example, if the alignment abnormality is an abnormal read pairing type abnormality, then the step of determining the expected background levels comprise assessing the expected background level of the alignment abnormality by dividing a total count of the abnormality by a total number of the sequence reads. For still further yet another example, if the alignment abnormality is a soft-clipping type of abnormality, then the expected background levels of the alignment abnormality may be determined based on (a) length of a soft clip of a read; and (b) determination of whether the abnormality occurs at the 3′ end of the read or the 5′ end of the read.
As shown in
The process of calculating individual p-values of the different types of observed alignment abnormalities can depend on the type of alignment abnormalities. For example, the p-value for observing a mismatch type may be calculated as follows: p=Pr(X≥o), wherein X˜Poisson(Σi∈Aλi), A={i: −10 log10(λi)≥20}, and o=Σi∈AI(mismatch at position i), λi is the base quality at position i and I(⋅) is the indicator function. (The set A in this example comprises all positions having a base quality more than 20, though one may also vary this threshold.) As another example, the p-value for observing a soft clip type alignment abnormality may be calculated as follows: p=Σi∈Ari, wherein A={i:ri≤rx}, wherein p is calculated separately for the first and second reads and 5′ and 3′ ends of the sequence reads, and ri is the rate at which the order and end of the reads are clipped for i bases. For yet another example, the p-value for observing an abnormal pairing type alignment abnormality may be calculated as follows: p=1 if the sequence read is free of an abnormality, and p=rabnormal read pairing if the sequence read contains the abnormality, where rabnormal read pairing is the overall rate of a mapped read pair being abnormally paired. For yet another example, the p-value for observing an indel type alignment abnormality may be calculated as follows: p=Pr(X≥o), wherein o is the number of indels observed in a read, X˜Poisson(Σiλi) and λi are the respective background likelihoods for having an indel aligned (i.e., non-soft-clipped) at position i, and wherein background rates for indels are computed separately for each of the two strands of the read.
As shown in
As shown in
The following examples are illustrative and not restrictive. Many variations of the technology will become apparent to those of skill in the art upon review of this disclosure. The scope of the technology should, therefore, be determined not with reference to the examples, but instead should be determined with reference to the appended claims along with their full scope of equivalents.
A “per-read” approach of determining the likelihood of observing different types of alignment abnormality features of sequence reads in an initial alignment.
Calculated likelihood of observing different types of alignment abnormalities is shown in an exemplary hypothetical initial alignment between a reference genome and sequence reads of a subject.
As shown in
A comparison between IHRs identified using GATK HaplotypeCaller (i.e., Active Regions) and IHRs identified using an algorithm according to the present invention are shown for a reference sequence for a specific sample.
An initial alignment of sample genetic sequences against a reference genome was used to compare IHRs of the reference genome identified using GATK HaplotypeCaller (i.e., Active Regions) and IHRs of the reference genome identified using an algorithm according to the present invention.
Briefly, the genetic sample sequences were derived from a PCR-free sequencing run of sample NA12878, produced as part of the 1000 Genomes project and aligned to the GRCh38 reference genome using the software Burrows-Wheeler Aligner BWA MEM. The IHRs of the reference genome were identified using an algorithm as described elsewhere in the patent application and compared against IHRs (Active Regions) that were identified using GATK HaplotypeCaller. IHRs identified using GATK HaplotypeCaller are shown in red, whereas IHRs identified using an algorithm according to the present invention are shown by higher “low homology score,” shown in blue bars.
As shown in
Identification of IHRs containing different types of alignment abnormalities, including abnormal pairings.
An initial alignment of sample genetic sequences against a reference genome was obtained as described above in Example 2.
As shown in
A reference sequence may have IHRs identified by GATK HaplotypeCaller but still be suitable for use in reference-assisted assembly.
An initial alignment of sample genetic sequences against a reference genome was obtained as described above in Example 2.
As shown in
A reference sequence may have regions that are identified as IHRs by GATK HaplotypeCaller that are suitable for use in reference-assisted assembly.
An initial alignment of sample genetic sequences against a reference genome was obtained as described above in Example 2.
As shown in
This application claims priority to U.S. Provisional Patent Application No. 62/595,603, filed Dec. 7, 2017, the contents of which are hereby incorporated by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
62595603 | Dec 2017 | US |