An ongoing challenge of next-generation sequencing data analysis is the accurate calling of insertions and deletions (“indels”). Reasons for this difficulty include a lower rate of occurrence, difficulty mapping to the correct location in the genome, and the presence of repeat regions in the genome which prevent unique mapping. Another reason is the inability or inaccuracy of current aligners to correctly identify variants at the ends of sequencing reads. This arises due to the lack of a two-sided context in which to place the variant call.
Shortcomings of the prior art are overcome and additional advantages are provided through the provision of a computer-implemented method, computer system, and computer program product.
According to an embodiment, a computer-implemented method for sequencing data read realignment includes: obtaining from a sequence alignment dataset an initial alignment of a read sequence to a reference sequence, the initial alignment comprising an aligned read; performing realignment processing on the initial alignment, the realignment processing realigning the read sequence to the reference sequence to produce one or more candidate realignments, and the realignment processing comprising: identifying one or more candidate indels, the one or more candidate indels comprising zero or more indels in the aligned read and zero or more indels aligned proximal to the aligned read as indicated by the sequence alignment dataset; creating a flattened aligned read based at least on removing from the aligned read any indels indicated by the initial alignment; and determining one or more candidate realignments of the read sequence to the reference sequence based on introducing, for each candidate realignment of the one or more candidate realignments, a respective at least one candidate indel of the one or more candidate indels into the flattened aligned read; and providing the initial alignment or a selected candidate realignment of the one or more candidate realignments based on one or more selection criteria.
The one or more candidate indels can comprise a plurality of candidate indels, and the determining the one or more candidate realignments can comprise commencing iteratively introducing the plurality of candidate indels into the flattened aligned read, wherein each iteration of the iteratively introducing provides a candidate realignment of the one or more candidate realignments by introducing into the flattened aligned read the respective at least one candidate indel for the candidate realignment.
The iteratively introducing can introduce permutations of one or more candidate indels of the plurality of candidate indels into the flattened aligned read to obtain, for each permutation of the permutations, a different candidate realignment of the one or more candidate alignments.
The realignment processing can further comprise: checking a provided candidate realignment of the one or more candidate realignments to determine whether and aligned read of the provided candidate realignment, the aligned read of the provided candidate realignment having the introduced respective one or more candidate indels, aligns with the reference sequence with no bases mismatched as between the aligned read of the provided candidate realignment and the reference sequence; halting the iteratively introducing based on determining that the aligned read of the provided candidate realignment aligns with the reference sequence with no bases mismatched; and selecting the provided candidate realignment as the selected candidate realignment, wherein the providing outputs the selected candidate realignment based on the aligned read of the provided candidate realignment aligning with the reference sequence.
The realignment processing can further comprise prioritizing the plurality of indels for the iteratively introducing, wherein the iteratively introducing introduces the plurality of indels in order of priority based on the prioritizing.
The prioritizing can prioritize an indel indicated by a reference indel dataset to be a prior known indel over an indel not indicated by the reference indel dataset to be a prior known indel. Additionally or alternatively, the prioritizing can prioritize an indel of longer length over an indel of shorter length. Additionally or alternatively, the prioritizing can prioritize an indel indicated in a greater number of aligned reads of the sequence alignment dataset over an indel indicated in a lesser number of aligned reads of the sequence alignment dataset. Additionally or alternatively, the prioritizing can prioritize an indel indicated in a greater proportion of aligned reads of the sequence alignment dataset that correspond to a location of the indel relative to the reference sequence over an indel indicated in a lesser proportion of aligned reads of the sequence alignment dataset. Additionally or alternatively, the prioritizing can prioritize, as between different indels indicated in a same number of aligned reads of the sequence alignment dataset, an indel whose location relative to a reference genomic sequence indicated by the sequence alignment dataset is upstream from a location, relative to the reference genomic sequence, indicated for another indel.
The selection criteria may be based at least in part on one or more of: number of bases mismatched, number of indels, location of indels relative to a reference genomic sequence indicated by the sequence alignment dataset, and number of softclipped bases.
The selection criteria can prioritize one or more of: an alignment having no indels and only a single base mismatched over an alignment having one or more indels for the providing; an alignment having a lesser number of bases mismatched over an alignment having a greater number of bases mismatched for the providing; as between different alignments having a same number of bases mismatched, an alignment having a lesser number of softclips of a specified type over an alignment having a greater number of softclips of the specified type for the providing; and as between different alignments having a same number of bases mismatched, an alignment having a lesser number of indels over an alignment having a greater number of indels for the providing.
The realignment processing can further comprise selecting a best candidate realignment of the one or more candidate realignments based on a first criteria of the one or more selection criteria, wherein the selected candidate realignment is the selected best candidate realignment, and wherein the outputting selects between the initial alignment and the best realignment candidate based on a second criteria of the one or more selection criteria.
An embodiment of a computer-implemented method can further comprise determining whether the obtained initial alignment is eligible for realignment, the determining being based at least in part on one or more of: identifying whether there are one or more bases mismatched as between the aligned read of the initial alignment and the reference sequence; identifying whether the aligned read comprises a softclip; identifying whether the initial alignment is not a secondary alignment; and identifying whether there are candidate indels around the aligned read in a region of bases of a reference genomic sequence of the sequence alignment dataset.
An embodiment of a computer-implemented method can further comprise determining whether the obtained initial alignment is eligible for realignment, and performing the realignment processing and the providing the initial alignment or selected candidate realignment based on determining that the obtained initial alignment is eligible for realignment; repeating, for each additional initial alignment of one or more additional initial alignments of the sequence alignment dataset, the obtaining and the determining whether the obtained additional initial alignment is eligible for realignment; and performing processing for each additional initial alignment of the one or more additional initial alignments, the performing processing comprising (i) providing the additional initial alignment as is, absent performing the realignment processing, or (ii) performing the realignment processing and the providing the additional initial alignment or selected candidate realignment.
Further, a computer system for sequencing data read realignment, comprising memory and at least one processor can be configured to execute program instructions to perform methods according to aspects described herein.
Yet further, a computer program product for sequencing data read realignment, comprising a tangible storage medium storing program instructions for execution can perform methods according to aspects described herein.
Additional features and advantages are realized through the concepts described herein. Numerous inventive aspects and features are disclosed herein, and unless inconsistent, each disclosed aspect or feature is combinable with any other disclosed aspect or feature as desired by a particular application, for instance, to facilitate detecting an image obstruction.
Aspects described herein are particularly pointed out and distinctly claimed as examples in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
The development of next generation sequencing technology (NGS) has transformed genetic sequencing, permitting generation of a high volume of copies of genetic sequences, such as from the genome of an organism, aligning these sequences to create a putative recapitulation of the sequence of nucleotides of the copied genetic sequences. By identifying the sequence of nucleotide base pairs in the aligned copies, the sequence of nucleotides in the original sequence may be determined. One use of such technology is for identification, understanding, prevention, treatment, or cure of disease. For example, NGS may be used to identify an individual's genomic sequences to identify whether she possesses nucleotide sequences believed to underlie or render susceptibility to particular diseases, or identify such sequences which may do so, or determine whether a given pharmacological or other treatment may be beneficial in treating disorders of such an individual.
The high volume of sequence information that must be processed in order to derive the sequence of nucleotides from alignment of copies thereof is, in many cases, massive. For example, there are approximately three billion base pairs in the human genome. The ability to determine such large nucleotide sequence requires advanced computer processing technology. For example, synthesizing many copies of somewhat overlapping and/or adjoining portions of a large set of genetic sequences (for example, the billions of nucleotides in an entire reference genome, the tens or hundreds of millions of nucleotides in a chromosome or chromosomes, or long portions of a chromosome or other genomic sequence) via high throughput processing, and subsequently aligning them with one another to recapitulate, and identify the sequence of nucleotides of, the copied sequence typically requires processing of high volumes of data by a computer.
In many cases, errors may occur, leading to inaccurate representations of a genomic sequence in an alignment created thereof. A significant component of NGS technology includes a capability to identify and correct such errors. Where large genetic sequences are being sequenced, the number of potential errors may consequently be large as well. Computer technology is therefore desired to identify where such potential errors are, determine whether or not they are errors, and, if they are, what the correct sequences should be, often requiring selecting between multiple possibly correct sequences. Because of the potentially large number of such potential errors across huge spans of genetic sequences, automated processing of identifying and rectifying such errors as a component of the computer processing employed in NGS is highly desirable.
For example, the sequence of nucleotides within a chromosome possessed by most of the population may be known. An individual's sequences may then be determined and compared to such known sequences. Differences between the individual's sequences and the known sequences may be of importance, medically, genealogically, or otherwise. However, the presence of errors or potential errors in the aligned sequence determined for the individual by NGS complicates the identification of differences between the individual's genetic sequence and a known sequence, such as if an error is present but unidentified, or a difference between the individual's sequence and the known sequence is erroneously not detected. The present disclosure includes computer technology for improving automated identification and correction of certain types of errors that may occur in NGS and the associated informatics processing used to produce sequence alignments. Advantages include decreased processing time and increased identification and correction of errors, thereby improving usability of NGS tools and related technology.
Specifically, aspects described herein address problems of false positive (often single nucleotide variant) and false negative (often indel) variant calls caused by improper alignment of indel-containing sequencing data reads to a reference genome. Processes described herein may realign reads in a way that honors the existing representation of true indels and rejects low-frequency “noisy” variants, all in a short running time. Generally, one or more reads, or read sequences, may correspond to a position in a genetic sequence being sequenced by NGS. With the production of many reads spanning, in aggregate, all positions of a sequence being sequenced and aligning them in order from positions corresponding to one end of the sequence being sequenced to the other and identifying the order of nucleotides represented thereby, the complete sequence may be determined. As each read or reads corresponding to a position in the genetic sequence being sequenced are identified as corresponding to that position, they may be considered as having been aligned, or an aligned read. Errors may arise, however, in identifying, or calling, the presence of indels in an alignment, owing to difficulties in accurately identifying with certainty indels indicated by aligned reads.
The lack of a two-sided context presents a challenge to the accurate calling of indels. When calling indels, a two-sided context may help indicate where the variation begins and ends.
Sequencing data gathered as part of a sequence analysis is stored in a sequence alignment dataset. Common file types for storing sequence alignment data are the SAM (.sam) and BAM (.bam) file formats. Sequence alignment software (“aligners”) outputs a sequence alignment dataset file, e.g. a BAM file, that indicates alignments of read sequence(s) to a reference genome and indicates evidence that indel(s) may be present in these aligned reads. Aligners will usually have higher penalties for opening “gaps” (indels) than they do for assigning mismatches, which becomes especially pronounced at the end of a read. As a result, many sequence variations may be mistakenly called mismatches or may be softclipped even when other read evidence indicates that there may be an indel present.
Aspects described herein reprocess a sequence alignment dataset file, taking information from reads aligned nearby as indicated in a source/original/input sequence alignment dataset to form a surrounding context. This approach collects existing indel observations from the initial alignments as indicated in the input sequence alignment dataset and processes them by attempting to realign imperfectly aligned reads around observed indels such that mismatches are minimized. In some examples, reads that are not initially indicated to contain any indels are realigned such that they do indicate an indel relative to the reference. There may initially be little evidence in the sequence alignment dataset that a particular read contains any indels. However, aspects described herein can “rescue” the read when presence of an indel is more appropriately to be indicated by a realignment. By way of specific example, it may be the case that only one read aligned to a region of the reference genomic sequence indicated in the input sequence alignment dataset reflects an indel but after processing the initial alignments as described herein, several reads, e.g. of an output sequence alignment dataset output by processes described herein, support an indel being present.
In addition to reducing false negatives as above, approaches described herein can also reduce false positives by eliminating some mismatches or some indels initially indicated in one or more reads of the input sequence alignment dataset.
Processes described herein present a local indel realignment algorithm. This can help minimize mismatches by realigning input reads around indels, such as those observed in the input sequence alignment dataset file and/or indicated in a reference indel dataset, such as a ‘priors’ Variant Call Format (.VCF) file. The VCF priors may be provided as an input to an algorithm and indicate assumed indels in the source sequence alignment dataset file.
At a high level, a computer system may receive as input an input sequence alignment dataset and execute an algorithm to read through the input dataset, collect existing indel observations, and process one or more initial alignments from the sequence alignment dataset by attempting to realign the read of each initial alignment around observed indels. The algorithm may provide a new, ‘realigned’ sorted indexed sequence alignment dataset, for instance as an output BAM or other sequence alignment dataset file. In cases where the realignment of the read to the reference is better than the initial alignment of the read to the reference, the realignment may be output in place of the initial alignment. Otherwise, the initial alignment may be output as-is from the input sequence alignment dataset. The output sequence alignment dataset may be another file separate from the original sequence alignment dataset, or may be a modified version of the input sequence alignment dataset, in which the algorithm may directly modify/overwrite the original sequence alignment dataset.
In particular examples, the algorithm steps through the input sequence alignment dataset collecting existing indel observations and adding them to a set of candidate indels for use in realignment processing for a particular initial alignment. Whether an observed indel is considered a candidate can depend on any desired parameter, such as the observed allele frequency of the indel. In some examples, a user-configurable threshold allele frequency is provided as a parameter or other input to the algorithm, for instance as a command line parameter or as a parameter specified as an option in a software setting. An observed indel that appears at least as frequently as the frequency indicated by the threshold may be considered a candidate indel. The frequency may include a total number of reads aligned to a given position in a reference sequence that indicate the presence of a given indel at said position. Or, the frequency may include a proportion, out of a total number of reads aligned to a given position in a reference sequence, that indicate the presence of a given indel at said position. The configurable threshold may be set as low as 1, indicating that appearance of the indel in only one read aligned to a given position of a reference sequence constitutes enough evidence for the indel to be considered a candidate. Or, the configurable threshold may be a pre-determined proportion, of between 0 and 1, of how many reads aligned to a given position in a reference sequence, out of a total number of such reads, indicating the presence of the indel constitutes enough evidence for the indel to be considered a candidate. In practice, noise and other considerations may dictate that the frequency be set to something higher. In addition, any indel provided in an optional priors VCF reference indel dataset may be considered as a candidate indel.
A computer system in reading through the sequence alignment dataset may proceed generally from a beginning to an end of the reference genomic sequence being mapped. Candidate indels relevant to an individual alignment may occur before or after that alignment's original position (i.e. upstream or downstream relative to the reference genomic sequence). Upcoming, to-be-processed, reads may provide further support to a candidate indel. Because of this, the algorithm may hold encountered initial alignments in memory until they are considered cleared for processing, rather than processing those initial alignments immediately without reading alignments of reads to locations further down the reference genomic sequence. Cleared alignments are those whose position as indicated in the sequence alignment dataset is upstream from an upstream end of a window, of configurable window size, past the end position of the alignment. This allows collection of candidate indels for a given read from the regions before and after that read. The genomic window size correlates to a number of bases past the initial alignment that must have been read before the algorithm is satisfied that it has collected the information that is deemed potentially relevant to the alignment. The window size may be configurable, for instance as a command line parameter. A larger window size allows for larger and more distant indels to be considered but performance of the computer system may be negatively impacted if the window size is set too large due to the greater demand on resources. In particular examples, window sizes of 250-1000 bases may be sufficient for general use.
Initial alignments that are cleared for processing may undergo processing, an example method of which is described and depicted with reference to
If instead at 304 it is determined that the alignment is eligible for realignment processing (304—Yes), for instance if it is not perfectly aligned, there are softclip(s), it is not a secondary alignment, and/or there are candidate indels in the region, the process continues by attempting realignment processing to realign the initial alignment (308). Such realignment is described in further detail below as part of a read realignment procedure. This realignment procedure provides what is considered a “best” realignment. After the realignment processing, it is determined whether the best realignment is at least as good as the original initial alignment (310). If not, then the initial alignment is output as-is (306). Otherwise, the best realignment is output (312). Thus, in any case where an initial alignment is processed, an alignment of the given read to the reference may by output, the alignment being either the initial alignment (306) or the realigned alignment (312).
By the time an alignment is being considered for realignment (308), all candidate indels that were observed and could impact that read sequence of the alignment (including any from the original alignment itself, surrounding indels, and any “priors”) have been collected to form a set of candidate indels that are candidates for introduction to provide a candidate realignment of the read to the reference. An iterative process is commenced that introduces each candidate indel (and in some examples combinations of two or more such candidate indels) into a flattened version of the aligned read. In some examples, the indel(s) are introduced from the left (i.e., from the upstream, or 5-prime, direction) and the right sides (i.e., from the downstream, or 3-prime direction) of the flattened aligned read. Each iteration provides a resulting ‘candidate realignment’ that is evaluated to determine how good the realignment is. The evaluation may consider any desired indicator(s) of quality, for instance a number of mismatched bases between the aligned read of the realignment and the reference, number of indels, locations of indels, and/or number of softclipped bases as examples.
One concept described herein is a position map, which is an array of chromosomal coordinates for each base in the read. The position map is a data structure used to represent the sequences in the sequence alignment dataset.
Read realignment may involve manipulation of the position map and subsequent comparison of the nucleotide-position pairs to the reference genome. Each aligned read being realigned may first be stripped of its existing indels and non-N type softclips to create a “blank slate”. This provides a read that effectively starts with the assumption that it is indel-free. The indel-free read is referred to herein as a “flattened” read sequence, or flattened aligned read—a flattened version of the read in the initial alignment. Candidate indels are then iteratively introduced into the flattened aligned read and evaluated for agreement with the reference. This introduction may be accomplished by manipulating the position map. The resulting nucleotide-position pairs may then be compared to the reference genome.
One goal in finding the desired realignment may be to prioritize minimizing mismatches and then minimizing the number of indels to reach the best realignment. A realignment with a single indel and no mismatches may be considered to be as good as it gets, in which case realignment processing for that alignment may cease and return the realignment. This may then be compared to the initial alignment to determine which is the better alignment to be output. Alternatively, when a ‘perfect alignment’ is not encountered during the realignment processing, the ‘best’ candidate realignment from the combinations considered may be compared to the original alignment, and the better selected for output as described below.
When determining the “best” candidate alignment, rules may be used and applied in an order or priority. In some examples, the current best candidate realignment is stored and compared with a next determined realignment candidate. The two are compared according to the rule(s) and if the realignment candidate is better, it is prioritized as the new best candidate realignment, replacing the old candidate realignment. An example of such rules and prioritization is:
The input sequence alignment dataset may be processed by a computer system to read the data alignment-by-alignment. These initial alignments are read into memory and each initial alignment is eventually cleared for processing based on a sliding window as described above. If this processing determines the cleared initial alignment is eligible for realignment processing, then read realignment processing as described and depicted with reference to
The process begins by getting all proximal candidate indels for this originally aligned read, i.e. observed indels in the region (702). Proximal indels may be those that are within the region or window considered relevant to this read alignment, and thus may be any indels seen in any of several different alignments indicated in the sequence alignment dataset. This set of indels, optionally together with indel(s) indicated in a reference indel dataset as ‘priors’, or known/assumed-to-be present indels, forms the set of candidate indels.
The process then ranks these candidate indels (704) relevant to the initial alignment. This ranking or prioritization may be based on any desired rule(s), examples of which are as follows and applied in the following order:
The ranking is an indication of which indels are weighted heavier in terms of the probability of their presence as compared to other indels. If two possible candidate indels could have provided two different candidate realignments with a same number (zero or more) of bases mismatched, the prioritization indicates which indels are to be more heavily trusted. The example priority rules above push toward the known, longer, more frequently occurring indels. The prioritization reflects the more likely true indels present.
The process of
Continuing with the process of
Otherwise, if there are bases mismatched as between the realigned read of the best candidate realignment and the reference (708—N), the process proceeds by comparing the best candidate realignment with the original alignment (712). Ultimately the goal is to output the better alignment of the two. Thus, based on the comparison, the process determines whether the best candidate realignment provided by 706 is better than the original alignment (714). If so, the process outputs this best realignment (710). In a particular example, if the best candidate realignment is better than or as good as the original alignment, a mapping quality is adjusted, if appropriate (e.g. set to 40 if the original quality was 20 or lower and the realignment has no mismatches), and the process outputs the best candidate realignment to an output sequence alignment dataset after this mapping quality adjustment. Returning to inquiry 714, if the best candidate realignment is not better than or as good as the original alignment (714—N), then the process outputs the original alignment (716).
The selection criteria for selecting the better of the original alignment and the best candidate realignment may be the same or different than the selection criteria applied to determine the best of the realignment candidate from 706. In particular examples, the selection criteria for selecting the best candidate realignment and/or the better of the best candidate realignment and the original alignment may be based on: number of bases mismatched as between the aligned read of the alignment and the reference sequence, number of indels indicated by the alignment, location(s), corresponding to the indel(s), in the reference genomic sequence indicated in the sequence alignment dataset, and/or number of softclipped bases indicated by the alignment. “Alignment” in the foregoing encompasses both alignment (as in original alignment) and realignment (as in a candidate realignment), since both cases present an alignment of the corresponding read to the reference sequence.
As an example, the selection criteria can prioritize one or more of: an alignment having no indels and only a single base mismatched (as between the read and the reference of the alignment) over an alignment having one or more indels; an alignment having a lesser number of bases mismatched over an alignment having a greater number of bases mismatched; as between alignments having a same number of bases mismatched, an alignment having a lesser number of softclips of a specified type, such as N, over an alignment having a greater number of softclips of the specified type; and/or as between different alignments having a same number of bases mismatched, an alignment having a lesser number of indels over an alignment having a greater number of indels.
By way of specific example, assume there are n candidate indels {I1, I2, I3, . . . , In} ranked in order of priority, and the iterating proceeds through combinations of 1, then 2, then 3 indels. The iteratively introducing the indels into the flattened aligned read will proceed in the following order, with each iteration providing a candidate realignment:
[Iterations of one indel:] Introduce I1, then I2, then I3, . . . , then In; then
[Iterations of two indels:] Introduce I1+I2, then I1+I3, . . . , then I1+In, then I2+I3, then I2+I4, . . . , then I2+In, . . . , then In−1+In; then
[Iterations of three indels:] Introduce I1+I2+I3, . . . , then In−2+In−1+In.
The introduction of the indels injects the indel(s) into the flattened aligned read and checks how the modified read alignment lines up with the reference genome, which checking may be assisted by the modified position map produced.
As noted, if at any point during the iterating a candidate realignment that perfectly aligns to the reference is provided, the processing can break and select that candidate as the best candidate realignment to provide (
Referring to
The result obtained from 810 is a candidate realignment. The process of
If at 816 is it determined that the new best realignment is not a perfect alignment, or if at 812 it is determined that the result obtained was no better than the current best realignment, the process returns to 804 to determine whether there are additional indel permutations to try. If not, then the process returns the current best realignment (818). It is seen that this process continues iterating until there are no more indel permutations to try (804—No), or a perfect alignment is provided by candidate realignment that is determined (816—Yes).
For the right anchor result processing,
The ‘add indel and get result’ processing (1006 of
The process of
If the position map allows the indel to be introduced, the process determines whether the new position map (with the indel inserted) is valid (1104). If not, then there is a failure to add the indel and the process returns NULL (1114) or some other desired result, then ends. Otherwise, the process proceeds by determining whether the candidate indel is an insertion (1106). If so, it is determined whether the bases of the read sequence match the putative insertion (1108). The bases of the read sequence may match the putative insertion if the bases in the read sequence at the position of the putative insertion are the same bases as those specified in the putative insertion. As an illustrative example, if the following read sequence ATCTGA were anchored at position 10 (i.e. the 5-prime A is at chrN:10), and the putative insertion is chrN:12 C>CTG, it would be considered a match, because the next two bases in the read sequence after the C at chrN:12 are TG. By contrast, if the putative insertion is, as another illustrative example, chrN:12 C>CAA, it would not be a match because the next two bases in the read sequence after the C at chrN:12 are not AA. If the bases of the read sequence do not match the putative insertion, then there is a failure to add the indel and the process returns NULL (1114) or some other desired result, then ends.
If instead at 1108 it is determined that the bases of the read sequence match the putative insertion (1108—Yes), or if it was determined at 1106 that the indel is not an insertion, e.g. it is a deletion, then the process proceeds by determining a new CIGAR positions string and start position based on the adjusted position map (1110). It then returns the resulting realignment with the indel added (1112) and ends.
Pseudocode is provided below for an example GetBestAlignment Routine (corresponding to
GetBestAlignment refers to the routine performed on the list of ranked candidate indels for introducing to the flattened read. Within this process, RealignToTargets is performed on each candidate indel both singly and in combination with other candidate indel(s). If at any point the introduction of a single indel results in a read with no mismatches, the process can exit with that realignment being considered the best realignment candidate. Otherwise, the process returns the “best” realignment, as measured by the rules/selection criteria described above from all assessed combinations of one to n indels, where n is the maximum number of indels to introduce.
GetBestAlignment Routine Pseudocode:
RealignToTargets Routine Pseudocode:
Aspects described herein can be used to adjust and improve sequencing data alignments output from an initial aligner. An aligner may output an initial sequence alignment dataset, which is provided as input to software configured to perform aspects described herein. The software outputs a sequence alignment dataset with realignments of one or more of the initial alignments.
The following presents a comparison of indel realignment according to aspects described herein (referred to below as Realigner) with indel realignment of the GATK Indel realigner offered by the Eli and Edythe L. Broad Institute of MIT and Harvard (“Broad Institute”), Cambridge, Mass., USA.
A distinction of the Realigner lies in its ability to accurately realign reads around observed mutations and do so in less time than existing methods. To demonstrate this, Realigner was compared against the likely best-known local indel realigner in the bioinformatics community, the GATK Indel realigner (see, e.g. DePristo, M., Banks, E., Poplin, R., Garimella, K., Maguire, J., & Hartl, C. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics, 43(5), 491-498, (2011)) to determine whether it performed at least as well in a shorter amount of time.
Sensitivity and Specificity on Simulated Variant Data:
Methods
To evaluate sensitivity, the following experiment was performed:
1. Simulate individual-variant FASTQ files of 200 insertions and 200 deletions of length 4-25 bp (400 simulation FASTQs total).
2. Align simulated FASTQ files using the iSAAC aligner offered by Illumina, Inc., San Diego, Calif., U.S.A.. Two conditions were assessed: with and without ‘Priors’. Supplying a list of priors to iSAAC allows indels in the list to be favored over calling a string of mismatches at that position.
3. Realign each of the above conditions (with priors, without priors) using each of Realigner, GATK, and no realignment.
4. Call variants using the Pisces variant caller offered by Illumina, Inc.
5. Evaluate sensitivity and specificity of variants called.
Samples used in the analysis:
200 insertions and 200 deletions were randomly selected from a pool of 2000 mid-length (4-25 bp) indels.
Evaluation of Called Variants
Each simulation sample is expected to have exactly one called variant. To evaluate the sensitivity and specificity of the outcome, all called variants are extracted from the VCF (yielding 0 to many variants, 0 to 1 of which will match the expected variant). The resulting variants are compared to the expected “truth” variant, resulting in one of the outcomes listed in
Results
Using priors in the initial iSAAC alignment increased sensitivity for all conditions. Without realignment, 48.5% of variants were successfully called with no false positives. With GATK realignment, that portion rose to 48.8%, while Realigner achieved 75.3%. In all cases, if the variant was correctly called and passing, there were no other passing variants. In some examples, Realigner used with priors may produce fewer false negatives and fewer false positives than GATK realignment.
Specificity on FFPE Normal Samples:
Methods
To evaluate specificity on realistic samples, normal (non-disease) samples were used. To sufficiently challenge the realigners, FFPE samples were used, which typically have poor DNA quality leading to a large number of low-frequency “noise” variants. In particular for Realigner, each of these low-frequency variants represents an opportunity to introduce a false variant.
Because these are normal, non-cancer samples, we assume that all true variants are at diploid frequencies (˜50% for heterozygous, and ˜100% for homozygous variants). Thus, anything in the “somatic” range (<20% VAF) can be considered a false positive. Furthermore, the lower the resulting somatic mutation count, all other things being equal, the more accurate a realignment method may be considered to be.
The following experiment was performed:
1. Run the iSAAC variant caller using a priors VCF containing the targeted variants from the Catalog of Somatic Mutations in Cancer (COSMIC) online database.
2. Realign the BAM files using Realigner or GATK.
3. Call variants using the Pisces variant caller.
4. Assess somatic mutation rate.
The analysis was conducted on 20 FFPE Normal samples, which were prepped and sequenced using the TruSight Tumor 170 assay offered by Illumina Inc., and had been processed through the TruSight Tumor 170 informatics pipeline up to the Alignment step.
Results
Realigner showed a lower somatic mutation rate (proxy for false positive rate in non-cancer samples) than either non-realigned or GATK-realigned results across the board (in only three of twenty cases did Realigner have a higher FP (false positive) rate than GATK, and all three were extremely close). Realigner appeared to have more aggressive deletion calling (see
Runtime Evaluation:
Methods
The same 20 samples used for the FFPE Normal evaluation were assessed for compute time required to go from input BAM to realigned output BAM. The input BAM files each contained approximately 60 million reads.
Results
In all cases, Realigner was significantly faster than GATK on the mid-size BAMs.
The Realigner is a fast and accurate indel realignment algorithm that maintains fidelity to the representation of existing indels. It relies upon the presence of existing signal in the input sequence alignment dataset to realign around an indel. In the examples above, Realigner does especially well when used on BAM files that have been generated by iSAAC with consideration for priors, as this maximizes the likelihood that the input BAM will contain at least one read with the indel.
The expected gold standard for local realignment would involve a pileup approach with consensus generation and local realignment of the consensus. However, consensus-based solutions have been shown to be costly in terms of time and computing requirements. In contrast, the Realigner treats each read individually, using the context of proximal observed indels for a much simpler, candidate-based approach.
Accordingly, processes for sequence alignment processing are described herein.
The selection criteria can be based at least in part on: number of bases mismatched, number of indels, location of indels relative to a reference genomic sequence indicated by the sequence alignment dataset, and/or number of softclipped bases. In some examples, the selection criteria prioritizes: an alignment having no indels and only a single base mismatched over an alignment having one or more indels for the providing; an alignment having a lesser number of bases mismatched over an alignment having a greater number of bases mismatched for the providing; as between different alignments having a same number of bases mismatched, an alignment having a lesser number of softclips of a specified type over an alignment having a greater number of softclips of the specified type for the providing; and/or as between different alignments having a same number of bases mismatched, an alignment having a lesser number of indels over an alignment having a greater number of indels for the providing.
Referring again to
The process of
The process of
The iteratively introducing introduces permutations of one or more candidate indels of the plurality of candidate indels into the flattened read to obtain, for each permutation of the permutations, a different candidate realignment of the one or more candidate alignments.
The realignment processing (
This selection of the best candidate realignment may include checking a provided candidate realignment to determine whether the aligned read of the provided candidate realignment, the aligned read having the introduced respective one or more candidate indels, aligns with the reference sequence with no bases mismatched as between the aligned read of the provided candidate realignment and the reference sequence. Based on determining that the aligned read of the provided candidate realignment aligns with the reference sequence with no bases mismatched, the iteratively introducing the candidate indel(s) into the flattened aligned read can halt, and the provided candidate realignment with no bases mismatched can be provides as the selected candidate realignment (2010). In these cases, the providing (
The example of
Processes described herein may be performed singly or collectively by one or more computer systems.
The system 2200 includes one or more processors or processing units 2250 and a memory 2252 that includes volatile memory 2254 (e.g. Random Access Memory, RAM) and non-volatile memory 2056. Memory 2252 may further include removable/non-removable, volatile/non-volatile computer system storage media. Further, memory 2252 may include one or more readers for reading from and writing to a non-removable, non-volatile magnetic media, such as a hard drive, a magnetic disk drive for reading from and writing to a removable, non-volatile magnetic disk, and/or an optical disk drive for reading from or writing to a removable, non-volatile optical disk such as a CD-ROM, DVD-ROM. The system 2200 may also include a variety of computer readable tangible storage media. Such media may be any available media, such as volatile and non-volatile media, and removable and non-removable media.
Memory 2252 may include at least one program product having a set (e.g., at least one) of program modules implemented as executable instructions that, when executed, carry out functions described herein. Executable instructions 2258 may include an operating system, one or more application programs, other program modules, and program data or other types of software. Generally, program modules may include routines, programs, objects, components, logic, data structures, and so on, that perform particular tasks or implement particular abstract data types. Program modules may carry out functions, processes, methods and the like described herein including, but not limited to, sequencing data read realignment.
Components of the computer system 2200 may be coupled by an internal bus 2260 that may be implemented as one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures.
Computer system 2200 may also communicate with one or more external devices such as a keyboard, a pointing device, a display 2262, etc.. and/or any devices (e.g., network card, modem, etc.) that enable computer system 2200 to communicate with one or more other computer systems, such as a server or other system hosted in a cloud computing environment. Such communication can occur via I/O interfaces 2264, which may include a network interfaces to interface to one or more networks such as a local area network (LAN), a general wide area network (WAN), and/or a public network (e.g., the Internet) via a suitable network adapter.
Further aspects of sequencing using a computer system are now described.
In the depicted embodiment, the sequencing device 2300 includes a separate sample processing device 2318 and an associated computer system 2320. However, as noted, these may be implemented as a single device. Further, the associated computer 2320 may be local to or networked with (e.g. as a cloud or other remoter offering) the sample processing device 2318. In some embodiments, the computer 2320 may be a cloud computing device that is remote from the sequencing device 2300. That is, the computer 2320 may be capable of communicating with the sequencing device 2300 through a cloud computing environment. In the depicted embodiment, the biological sample may be loaded into the sample processing device 2318 as a sample slide 2370 that is imaged to generate sequence data. For example, reagents that interact with the biological sample fluoresce at particular wavelengths in response to an excitation beam generated by an imaging module 2372 and thereby return radiation for imaging. For instance, the fluorescent components may be generated by fluorescently tagged nucleic acids that hybridize to complementary molecules of the components or to fluorescently tagged nucleotides that are incorporated into an oligonucleotide using a polymerase. As will be appreciated by those skilled in the art, the wavelength at which the dyes of the sample are excited and the wavelength at which they fluoresce will depend upon the absorption and emission spectra of the specific dyes. Such returned radiation may propagate back through directing optics. This retrobeam may generally be directed toward detection optics of the imaging module 2372.
The imaging module detection optics may be based upon any suitable technology, and may be, for example, a charged coupled device (CCD) sensor that generates pixilated image data based upon photons impacting locations in the device. However, it will be understood that any of a variety of other detectors may also be used including, but not limited to, a detector array configured for time delay integration (TDI) operation, a complementary metal oxide semiconductor (CMOS) detector, an avalanche photodiode (APD) detector, a Geiger-mode photon counter, or any other suitable detector. TDI mode detection can be coupled with line scanning. Other useful detectors are described, for example, in the references provided previously herein in the context of various nucleic acid sequencing methodologies.
The imaging module 2372 may be under processor control, e.g., via a processor 2374, and the sample receiving device 2318 may also include I/O controls 2376, an internal bus 2378, non-volatile memory 2380, RAM 2382 and any other memory structure such that the memory is capable of storing executable instructions, and other suitable hardware components that may be similar to those described with regard to
Turning now to
A cloud facility 2412 includes a plurality of computer systems/nodes 2414. The computing resources of the nodes 2414 may be pooled to serve multiple consumers, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand. Examples of resources include storage, processing, memory, network bandwidth, and virtual machines. The nodes 2414 may communicate with one another to distribute resources, and such communication and management of distribution of resources may be controlled by a cloud management module residing in one or more nodes 2414. The nodes 2414 may communicate via any suitable arrangement and protocol. Further, the nodes 2414 may include servers associated with one or more providers. For example, certain programs or software platforms may be accessed via a set of nodes 2414 provided by the owner of the programs while other nodes 2414 are provided by data storage companies. Certain nodes 2414 may also be overflow nodes that are used during higher load times.
In one embodiment, a cloud management module is responsible for load management and cloud resources. The load management may be implemented through consideration of a variety of factors, including user access level and/or total load in the cloud computing environment (peak times versus average load times). The project type may also be considered. In one embodiment, public health emergencies may be prioritized over other types of projects. Further, a user may manage costs by offering certain runs as lower priority that are held until cloud usage is below a certain threshold.
The cloud facility 2412 is configured to communicate with various users (e.g. user computer systems) for generating biological data. Such data may include sequence data generated via a sequencing device 2416, which in particular embodiments may include a sequencing device 2418 that includes a module to accept a biological sample and generate sequence data and an associated computer 2420 that includes executable instructions for analyzing or communicating the sequence data to the cloud facility 2412. It should be understood that, in certain embodiments, the sequencing device 2416 may also be implemented as an all-in-one device. The sequencing device 2416 is configured to communicate with the cloud facility 2412 via a suitable communications link 2424. The communication with the cloud facility 2412 may include communication via a local area network (LAN), a general wide area network (WAN), and/or a public network (e.g., the Internet) via the communications link 2424. In particular, the communications link 2424 sends sequence data 2426 and, in certain embodiments, authentication information 2428, to the cloud computing environment 2412. The authentication information may confirm that the sequencing device 2416 is a client of the cloud facility 2412.
As noted, the cloud facility 2412 may serve multiple users or clients with associated devices, e.g., devices 2416a, 2416b, and 2416c. Further, the cloud facility 2412 may also be accessed by other types of clients, such as secondary users 2430 or third party software holders. Accordingly, the cloud facility 2412 may provide different types of services depending on the access level of the particular client. A sequencing client may have access to storage and data analysis services, while a secondary user 2430 may have access only to shared or public sequences. Third party software holders may negotiate with sequencing clients to determine appropriate access privileges. For example, open source software may be offered for free or on limited license basis, while other types of software may be offered according to various fee or subscription bases.
Further, a primary user (or secondary user) may also interact with the cloud facility 2412 through any appropriate access device, such as a mobile device or other computer system that includes components similar to those described with regard to the computer 2420. That is, once the sequence data has been communicated to the cloud facility 2412, further interaction with and access to the sequence data may not necessarily be coupled to the sequence device 2416. Such embodiments may be beneficial in embodiments in which the owner of the biological sample and/or sequence data has contracted for sequencing, e.g., to a core laboratory facility. In such embodiments, the primary user may be the owner while the core laboratory facility associated with the sequencing device 2416 is at most a secondary user after the sequence data has been communicated to the cloud facility 2412. In certain embodiments, the sequence data may be accessed through security parameters such as a password-protected client account in the cloud facility 2412 or association with a particular institution or IP address. The sequence data may be accessed by downloading one or more files from the cloud facility 2412 or by logging into a web-based interface or software program that provides a graphical user display in which the sequence data is depicted as text, images, and/or hyperlinks. In such an embodiment, the sequence data may be provided to the primary or secondary user in the form of data packets transmitted via a communications link or network.
The cloud facility 2412 may execute user interaction software (e.g., via a web-based interface or application platform) that provides a graphical user interface for users and that facilitates access to sequence data, a community or group of researchers, data analysis programs, available third party software, and user selections for load balancing and instrument settings. For example, in particular embodiments, settings for a sequencing run on a sequencing device 2416 may be set via the cloud facility 2412. Accordingly, the cloud facility 2412 and an individual sequencing device 2416 may be capable of two-way communication. Such an embodiment may be particularly useful for controlling parameters of a remote sequencing run.
Results of a sequencing run and various analyses can be stored in files taking the form of FASTQ files, binary alignment files (bam), *.bcl, *.vcf, and/or *.csv files, as examples. The output files may be in formats that are compatible with sequence data viewing, modification, annotation, manipulation, alignment, and realignment software. Accordingly, the accessible sequence alignment dataset provided herein may be in the form of raw data, partially processed or processed data, and/or data files compatible with particular software programs. In this regard, a computer system, such as a computer system of or in communication with a sequencing device, or a cloud facility computer system, as examples, can obtain a bam or other sequencing alignment dataset and process the file by, for instance, reading its data and performing operations to carrying out aspects described herein. The computer system can then output a file having sequencing alignment data, for instance another bam file. Further, the output files may be compatible with other data sharing platforms or third party software.
Although various embodiments are described above, these are only examples. For example, computing environments of other architectures can be used to incorporate and use one or more embodiments.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising”, when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components and/or groups thereof.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below, if any, are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of one or more embodiments has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain various aspects and the practical application, and to enable others of ordinary skill in the art to understand various embodiments with various modifications as are suited to the particular use contemplated.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2017/061661 | 11/15/2017 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62480330 | Mar 2017 | US | |
62447103 | Jan 2017 | US | |
62422841 | Nov 2016 | US |