The instant application contains a Sequence Listing which has been submitted electronically in XML format and is hereby incorporated by reference in its entirety. Said XML copy, created on April 27, 2023, is named TP109360USUTL1_SL.xml and is 3,922 bytes in size.
This application generally relates to methods, systems, and computer-readable media for applying a deep learning artificial neural network for correction of signal data obtained by next-generation sequencing systems, and, more specifically, to correct the signal data for improving the accuracy of base calling.
The simulated signal measurements may be generated from a simulation model that includes model parameters such as a “carry forward” parameter and an “incomplete extension” parameter. One cause of phase synchrony loss is the failure of a sequencing reaction to incorporate one or more nucleotide species on a template strand for a given flow cycle, which may result in that template strand being behind the main template population in sequence position. This effect is referred to as an “incomplete extension” error. Another cause of phase synchrony loss is the improper incorporation of one or more nucleotide species on a template strand, which may result in that template strand being ahead of the main population in sequence position. This is referred to as a “carry forward” error. Carry forward errors may result from the misincorporation of a nucleotide species, or in certain instances, where there is incomplete removal of a previous nucleotide species in a reaction well (e.g. incomplete washing of the reaction well). Thus, as a result of a given flow cycle, the population of template strands may be a mixture of strands in different phase-states.
However, it has been observed from real data that a significant portion of error events are “systematic” and not random. Even low homopolymers with low noise variance may be affected by systematic error. Unlike random error, systematic error can be reproduced in different experiments. However, the origin of the systematic error may be complex and/or unknown. Systematic error may cause a shift in the distribution of signal measurements, such that the signal measurements may include noise with a non-zero mean.
Aspects of embodiments of the present disclosure apply an artificial neural network to mitigate the effects of systematic error in the signal measurements and improve the accuracy of base calling.
In some embodiments, the base calling step may perform phase estimations, normalization, and runs a solver algorithm to identify best partial sequence fit and make base calls. The base sequences for the sequence reads are stored in unmapped BAM files. The base calling step may generate total number of reads, total number of bases, and average read length as quality control (QC) measures to indicate the base call quality. The base calls may be made by analyzing any suitable signal characteristics (e.g., signal amplitude or intensity). The signal processing and base calling for use with the present teachings may include one or more features described in U.S. Pat. Appl. Publ. No. 2013/0090860 published Apr. 11, 2013, U.S. Pat. Appl. Publ. No. 2014/0051584 published Feb. 20, 2014, and U.S. Pat. Appl. Publ. No. 2012/0109598 published May 3, 2012, each incorporated by reference herein in its entirety.
Once the base sequence for the sequence read is determined, the sequence reads may be provided to the alignment step, for example, in an unmapped BAM file. The alignment step maps the sequence reads to a reference genome to determine aligned sequence reads and associated mapping quality parameters. The alignment step may generate a percent of mappable reads as QC measure to indicate alignment quality. The alignment results may be stored in a mapped BAM file. Methods for aligning sequence reads for use with the present teachings may include one or more features described in U.S. Pat. Appl. Publ. No. 2012/0197623, published Aug. 2, 2012, incorporated by reference herein in its entirety.
The BAM file format structure is described in “Sequence Alignment/Map Format Specification,” Sep. 12, 2014 (github.com/samtools/hts-specs). As described herein, a “BAM file” refers to a file compatible with the BAM format. As described herein, an “unmapped” BAM file refers to a BAM file that does not contain aligned sequence read information and mapping quality parameters and a “mapped” BAM file refers to a BAM file that contains aligned sequence read information and mapping quality parameters.
The variant calling step may include detecting single-nucleotide polymorphisms (SNPs), insertions and deletions (InDels), multi-nucleotide polymorphisms (MNPs), and complex block substitution events. In various embodiments, a variant caller can be configured to communicate variants called for a sample genome as a *.vcf, *.gff, or *.hdf data file. The called variant information can be communicated using any file format as long as the called variant information can be parsed and/or extracted for analysis. The variant detection methods for use with the present teachings may include one or more features described in U.S. Pat. Appl. Publ. No. 2013/0345066, published Dec. 26, 2013, U.S. Pat. Appl. Publ. No. 2014/0296080, published Oct. 2, 2014, and U.S. Pat. Appl. Publ. No. 2014/0052381, published Feb. 20, 2014, and U.S. Pat. No. 9,953,130 issued Apr. 24, 2018, each of which is incorporated by reference herein in its entirety. In some embodiments, the variant calling step may be applied to molecular tagged nucleic acid sequence data. Variant detection methods for molecular tagged nucleic acid sequence data may include one or more features described in U.S. Pat. Appl. Publ. No. 2018/0336316, published Nov. 22, 2018, incorporated by reference herein in its entirety.
An artificial neural network (ANN) may operate in a training mode and an inference mode. During the training mode, the ANN receives training data from the input channels. The training data corresponds to a truth set of base calls. The ANN adapts its parameters in accordance with a loss function, as described below with respect to
The flow order channels include entries corresponding to positions of all the signal measurements in the signal measurement channel, according to the nucleotide that was flowed to obtain each signal measurement. The input channels comprise arrays having the same size and may include padding with zeros. For example, for a channel may have dimensions of 1×576. For example, for a number of input signal measurements of 550, the signal measurement channel would be padded with zeros to a length of 576. The other input channels would also include 550 corresponding values and zero padding to length of 576.
The simulated signal measurement channel may comprise an array of simulated signal measurements predicted using a simulation model that predicts what the expected signal measurements would be for a particular base to be called. The simulation model may be constructed in any suitable fashion to mathematically describe the sequencing process. For example, generating the simulated signal measurements may include one or more features described in U.S. Pat. Appl. Publ. No. 2012/0109598 published May 3, 2012, incorporated by reference herein in its entirety. The simulated signal measurements may be generated from a simulation model that includes model parameters such as a “carry forward” parameter and an “incomplete extension” parameter, as described above.
As shown in
In some embodiments, the ANN may comprise a convolutional neural network (CNN). For example, the CNN may have a U-Net architecture. See, e.g., Ronneberger et al., U-Net: Convolutional Networks for Biomedical Image Segmentation, arXiv:1505.04597v1′, [cs.CV], 18 May 2015. A U-Net architecture may comprise a plurality of processing layers of an encoding portion, or contracting path, and a decoding portion, or expansive path. The encoding, or contracting, path is a convolutional network that includes multiple convolutional layers, each applying a plurality of convolutions followed by a batch normalizing operation and a non-linear activation function, e.g., Sigmoid Linear Unit (SiLU). For the encoding, or contracting, path one or more of the convolutional layers may be followed by a pooling layer performing max pooling operations. During the contracting, the dimensions of the input signal information is reduced by the max pooling operations while feature information is increased by the convolution operations. The decoding, or expansive, path combines the feature and signal information through a sequence of up-convolutions and concatenations with high-resolution features from the encoding, or contracting, path.
A convolutional layer of a CNN may apply a plurality of convolutions to a given input channel with a plurality of convolutional kernels. In some convolutional layers, the convolutional kernel may have a size of B. A given convolutional kernel is convolved with the array of values of a given input channel to the layer. The convolution operation may include multiply, add and shift operations as follows:
The input array may be padded with zeros so that the convolved array size has the same as the input array size. Convolution may include reflecting or reversing the feature map values prior to the convolution calculations. The convolution, as described herein, is consistent with a cross-correlation of the convolutional kernel and the input array. The terms “convolution” and “cross-correlation” are used interchangeably herein.
For multiple input channels, a particular convolutional kernel may be applied to a corresponding one of the input channels. The collection of convolutional kernels applied to multiple input channels is referred to as a filter. Each convolutional kernel “slides” over its respective input channel, producing a per-channel convolved array. The per-channel convolved arrays are then summed together to form a summed array (feature map) for an output channel. Optionally, a bias value may be added to each element of the summed array for an output channel. Multiple output channels for a given layer may be produced. The number of output channels corresponds to the number of filters applied to the input channels of the layer. The weights of the convolutional kernels and bias values may be learned during training mode and fixed at constant values during inference mode.
A batch normalization may be applied to each of the output channels of a given layer. The batch normalization for each output channel is calculated according to Equation (1):
y=(x−E[x]/sqrt(Var[x])*γ+β (1)
Where, y is the normalized value for the element in the output channel, x is the element value in the output channel, E(x) is the expected value, or mean value, of the elements in the output channel, Var(x) is the variance of the elements in the output channel, and γ and β are mean and variance parameters for each output channel that may be learned during training mode. By default, the values of γ may be set to 1 and the values of β may be set to 0.
Batch normalization may be applied before or after an activation function is applied to normalize values for the next layer. In some embodiments, the activation function may be the sigmoid linear unit function (SiLU). The SiLU function is given by equation (2):
SiLU(x)=x/(1+e−x) (2)
where x may be the normalized value or non-normalized value of an output channel. Embodiments of the present disclosure may use other activation functions such as a rectified linear unit function (ReLU), leaky ReLU function, a softmax function, a sigmoid function, and the like.
A pooling layer may apply maximum pooling (MaxPool) operations to the output channel of a convolutional layer that is input to the pooling layer. The MaxPool operation for a kernel size of 2 and stride value of 2 performs the following operations:
Operations by the convolution layers and pooling layers, typically reduce (downs ample) the dimension of the array (feature map) of an input channel or keep it unchanged. To expand the dimensions, another type of CNN layer applies transposed convolution operations to increase (upsample) the dimension of the output array. A 1D transposed convolution operator of kernel size 2 and stride value of 2 applies the following operations to a 1D array of size S of an input channel to the layer:
In some embodiments, the CNN may include a Convolutional Block Attention Module (CBAM), between the encoding, or contracting, path and the decoding, or expansive, path. See, e.g., Woo, S. et al. CBAM: Convolutional Block Attention Module, arXiv:1807.06521v2 [cs.CV], 18 Jul. 2018. The U-Net bottleneck part between the contracting path and the expansive path encodes the most powerful and discriminative semantic features. The CBAM includes a spatial-wise attention module and a channel-wise attention module, each applied to an intermediate feature map input to the CBAM. The spatial-wise attention module may comprise a standard convolutional layer with pooling. In the present context, “spatial” corresponds to the temporal dimension of the array of signal measurements. The channel-wise attention module emphasizes inter-channel relationships of features. In more detail, the channel-wise attention module applies max-pooling and average-pooling to the input features to the CBAM on a per channel basis to generate two descriptor arrays, which denote average-pooled features and max-pooled features, respectively. Each descriptor array is input to a multi-layer perceptron (MLP) to produce an average-pooled output feature vector for the average-pooled features and a max-pooled output feature vector for the max-pooled features. The MLP may have two linear layers which may use activation functions. An element-wise summation of the max-pooled output feature vector and the average-pooled output feature vector is followed by sigmoid function to produce a channel attention map, or array of channel attention values. The channel attention map has a length of the number of channels. An element-wise multiplication of the channel attention map and the intermediate feature map input to the CBAM to form a channel-refined feature map. The channel-refined feature map is provided to the spatial-wise attention module. The spatial-wise attention module applies max-pooling and average-pooling to the channel-refined feature map along the channel axis to generate a max-pooled feature map and an average-pooled feature map. The max-pooled feature map and average-pooled feature map are concatenated and convolved by a convolutional layer, followed by a sigmoid function to form a spatial attention map, or array of spatial attention features. The CBAM output is determined by an element-wise multiplication of the spatial attention map and the channel-refined feature map to form the refined output feature map of the CBAM. The weights for the MLP and the convolutional kernels may be learned during the training mode.
The decoding, or expansive, path may combine the feature and signal information by applying up-convolutions and concatenations with high-resolution features from the encoding, or contracting, path. A copy and concatenate step, also called skip connection, may copy the feature map of a channel (layer) at a particular scale in the encoder path and concatenate it with the upsampled feature map of a channel (layer) at the same scale in the decoding path. The concatenated feature maps increase the length of channel.
In more detail, at layer 602 of layer group 611, a number C of one-dimensional convolutions are applied to each of the M input channels. Each convolutional kernel has a size of B. A given convolutional kernel is convolved with the values of a given input channel with a stride value of 1. Batch normalization may be applied to each of the C feature maps generated by the convolutions to produce normalized feature maps. An activation function, such as SiLU, is applied to the normalized feature maps to produce the output feature maps for the output channel of layer 602. The scale for layer 602 is C×N. The number of output channels C corresponds to the number of filters applied to the input channels of the layer 602. The output channels of layer 602 are input to layer 603. At layer 603, a number C of one-dimensional convolutions with stride value of 1 are applied to the output channels of layer 602, followed by batch normalization and activation function, such as SiLU, to produce the output feature maps for the C output channels of layer 603. The scale for layer 603 is C×N. The C output channels of layer 603 are input to pooling layer 604. At pooling layer 604, a MaxPool operation having a kernel size of 2 and stride value of 2 is applied to each output channel of layer 603, which reduces the dimension of the feature map in each channel to N/2. The scale for layer 604 is C×(N/2).
The layer group 612 doubles the number of feature channels to 2C and reduces the dimensions of the feature maps to N/4. In more detail, the C output channels of layer 604 are input to layer 605 of layer group 612. A number 2C of one-dimensional convolutions are applied to each of the C output channels from layer 604. Each convolutional kernel has a size of B. A given convolutional kernel is convolved with the values of a given input channel with a stride value of 1. Batch normalization may be applied to each of the 2C feature maps generated by the convolutions to produce normalized feature maps. An activation function, such as SiLU, is applied to the normalized feature maps to produce the output feature maps for the output channel of layer 605. The scale for layer 605 is 2C×(N/2). The number of output channels 2C corresponds to the number of filters applied to the input channels of the layer 605. The output channels of layer 605 are input to layer 606. At layer 606, a number 2C of one-dimensional convolutions with stride value of 1 are applied to the output channels of layer 605, followed by batch normalization and activation function, such as SiLU, to produce the output feature maps for the 2C output channels of layer 606. The scale for layer 606 is 2C×(N/2). The 2C output channels of layer 606 are input to pooling layer 607. At pooling layer 607, a MaxPool operation having a kernel size of 2 and stride value of 2 is applied to each output channel of layer 606, which reduces the dimension of the feature map in each channel to N/4. The scale for layer 607 is 2C×(N/4).
The layer group 613 doubles the number of feature channels from 2C to 4C and reduces the dimensions of the feature maps to N/8. In more detail, the 2C output channels of layer 607 are input to layer 608 of layer group 613. A number 4C of one-dimensional convolutions are applied to each of the 2C output channels from layer 607. Each convolutional kernel has a size of B. A given convolutional kernel is convolved with the values of a given input channel with a stride value of 1. Batch normalization may be applied to each of the 4C feature maps generated by the convolutions to produce normalized feature maps. An activation function, such as SiLU, is applied to the normalized feature maps to produce the output feature maps for the output channel of layer 608. The scale for layer 608 is 4C×(N/4). The number of output channels 4C corresponds to the number of filters applied to the input channels to the layer 608. The output channels of layer 608 are input to layer 609. At layer 609, a number 4C of one-dimensional convolutions with stride value of 1 are applied to the output channels of layer 608, followed by batch normalization and activation function, such as SiLU, to produce the output feature maps for the 4C output channels of layer 609. The scale for layer 609 is 4C×(N/4). The 4C output channels of layer 609 are input to pooling layer 610. At pooling layer 610, a MaxPool operation having a kernel size of 2 and stride value of 2 is applied to each output channel of layer 609, which reduces the dimension of the feature map in each channel to N/4. The scale for layer 610 is 4C×(N/8).
The 4C output channels of pooling layer 610 provide the intermediate feature map input to the CBAM 630. As described above, the CBAM applies a channel-wise attention module and a spatial-wise attention module to the output from pooling layer 610. The channel-wise attention module uses a parameter which defines the size of the output of the first linear layer which corresponds to the size of the input for the second linear layer of the MLP. The channel-wise attention module produces a channel attention map with dimension of 4C. The spatial-wise attention module applies kernel size 7 and stride value of 1 with 2 input channels, one from max-pooling and one from average-pooling, to the channel-refined feature map, and one output channel in its convolutional layer. The spatial-wise attention module produces a spatial attention map with dimension of N/8. The CBAM generates the refined output feature map having a scale of 4C×(N/8).
In more detail for the decoder 650, the 4C channels of the refined output feature map of the CBAM are input to the convolution transpose layer 631 of layer group 615. The convolution transpose layer 631 is configured to perform transposed convolution operations of kernel size 2 and stride value of 2 for upsampling the features of the 4C input channels by a factor of 2. The scale for layer 631 is 4C×(N/4). The encoded features of layer 609, having the same scale of 4C×(N/4), are concatenated 641 with the upsampled features of layer 631 to form concatenated features. The concatenated features are provided to the convolutional layer 632. At layer 632, a number 4C of one-dimensional convolutions are applied to the concatenated features for each of the 4C channels. Each convolutional kernel has a size of B. A given convolutional kernel is convolved with the values of a given input channel with a stride value of 1. Batch normalization may be applied to each of the 4C feature maps generated by the convolutions to produce normalized feature maps. An activation function, such as SiLU, is applied to the normalized feature maps to produce the output feature maps for the output channels of layer 632. The scale for layer 632 is 4C×(N/4). The number of output channels 4C corresponds to the number of filters applied to the input channels to the layer 632. The output channels of layer 632 are input to layer 633. At layer 633, a number 4C of one-dimensional convolutions with stride value of 1 are applied to the output channels of layer 632, followed by batch normalization and activation function, such as SiLU, to produce the output feature maps for the 4C output channels of layer 633. The scale for layer 633 is 4C×(N/4). The 4C output channels of layer 633 are input to layer 634.
The layer group 616 halves the number of feature channels to 2C and increases the dimensions of the feature maps to N/2. In more detail, the 4C output channels of layer 633 are input to layer 634 of layer group 616. The convolution transpose layer 634 is configured to perform transposed convolution operations of kernel size 2 and stride value of 2 for upsampling the feature maps of the 4C input channels by a factor of 2 and produce 2C output channels. The scale for layer 634 is 2C×(N/2). The encoded features of layer 606, having the same scale of 2C×(N/2), are concatenated 642 with the upsampled features of layer 634 to form concatenated features. The concatenated features are provided to the convolutional layer 635. At layer 635, a number 2C of one-dimensional convolutions are applied to the concatenated features for each of the 2C channels. Each convolutional kernel has a size of B. A given convolutional kernel is convolved with the values of a given input channel with a stride value of 1. Batch normalization may be applied to each of the 2C feature maps generated by the convolutions to produce normalized feature maps. An activation function, such as SiLU, is applied to the normalized feature maps to produce the output feature maps for the output channel of layer 635. The scale for layer 635 is 2C×(N/2). The number of output channels 2C corresponds to the number of filters applied to the input channels to the layer 635. The output channels of layer 635 are input to layer 636. At layer 636, a number 2C of one-dimensional convolutions with stride value of 1 are applied to the output channels of layer 635, followed by batch normalization and activation function, such as SiLU, to produce the output feature maps for the 2C output channels of layer 636. The scale for layer 636 is 2C×(N/4). The 2C output channels of layer 636 are input to layer 637.
The layer group 617 halves the number of feature channels to C and increases the dimensions of the feature maps to N. In more detail, the 2C output channels of layer 636 are input to layer 637 of layer group 617. The convolution transpose layer 637 is configured to perform transposed convolution operations of kernel size 2 and stride value of 2 for upsampling the feature maps of the 2C input channels by a factor of 2 and produce C output channels. The scale for layer 637 is C×N. The encoded features of layer 603, having the same scale of C×N, are concatenated 643 with the upsampled features of layer 637 to form concatenated features. The concatenated features are provided to the convolutional layer 638. At layer 638, a number C of one-dimensional convolutions are applied to the concatenated features for each of the C channels. Each convolutional kernel has a size of B. A given convolutional kernel is convolved with the values of a given input channel with a stride value of 1. Batch normalization may be applied to each of the C feature maps generated by the convolutions to produce normalized feature maps. An activation function, such as SiLU, is applied to the normalized feature maps to produce the output feature maps for the output channel of layer 638. The scale for layer 638 is C×N. The number of output channels C corresponds to the number of filters applied to the input channels to the layer 638. The output channels of layer 638 are input to layer 639. At layer 639, a number C of one-dimensional convolutions with stride value of 1 are applied to the output channels of layer 638, followed by batch normalization and activation function, such as SiLU, to produce the output feature maps for the C output channels of layer 639. The scale for layer 639 is C×N. The C output channels of layer 639 are input to layer 640. At the last layer 640, each convolutional kernel has a size of BL. At layer 640, a number CL of BL×1 convolutional kernels are applied to each of the C input channels to produce a CL×N output 660.
The last layer 640 may use a convolutional layer instead of a linear layer to do regression for the output channels. For example, the convolutional kernel may have a size of BL=1 to reduce dimensionality and the output channel may have a size CL, e.g. to CL=3, to output CL channels. During inference mode, one of the CL output channels provides the signal correction values. During training mode, all of the CL output channels, e.g. CL=3, may be used for multi-task learning to improve overall model performance. For example, the CL=3 output channels of the last layer 640 used during training mode may include an array of signal correction values, an array of labeled simulated signal measurements and an array of maximum allowed residual values. The simulation model may be applied to a labeled base sequence to generate the labeled simulated signal measurements, as described below. The maximum allowed residual values are the maximum error allowed for calling homopolymer lengths corresponding to labeled measurements, as described below.
During inference mode, the output 660 provides an array of signal correction values having dimension of 1×N. As described above with respect to
Examples of parameters for the U-Net of
In an example, Table 3 lists parameters for the U-Net of
Relating the entries in Table 3 to
As noted above, artificial neural networks for performing signal correction operations using neural network architectures in accordance with embodiments of the present disclosure are trained using training data sets. The training data sets match sequencing data obtained from sequencing experiments applied to a sample of a known reference that has well-characterized genotyping information, to corresponding known truth base sequences. Industry standard cell line samples for NGS benchmarking, such as NA12878 (a.k.a HG001) and NA24385 (a.k.a. HG002), may be used as references for training the ANN. For example, labelling the sequence data of NA12878 library provides the corresponding “ground truth”. Other well characterized samples, either publicly available or proprietary, can be used for training. For example, a sample may be derived from an assay such as the AMPLISEQ™ CARRIERSEQ™ ECS panel (Thermo Fisher Scientific) for expanded carrier screening (ECS) or the AMPLISEQ™ Exome panel (Thermo Fisher Scientific) for genome and DNA sequencing.
Training an ANN generally involves initializing the ANN (e.g., setting the weights in the network, such as the weights in the convolutional kernels, to random values), and providing the training input data to the network. The output of the network is then compared against the labeled training data to generate an error signal (e.g., a difference between the current output and the “ground truth” output), and a backpropagation algorithm is used with gradient descent to update the weights, over many iterations, such that the network computes a result closer to the desired ground truth.
In addition, the process and system can align the reads or sequences with each other and optionally with an expected sequence, referred to herein as alignment. The labeling system can utilize either the aligned read sequences from the alignment process or can use the raw base sequence from the base calling process to associate the read sequences and the associated signal from the sequencer with a known sequence, its variants, and a simulated signal measurement, such as a predicted signal measurement in flow space or a predicted flow measurement. The labeling system can further determine a signal correction value to associate with the flow measurement and associated read sequence.
Such a process and system find particular use when the same region of a genome, for example, is amplified and sequenced many times and stored in many reads. For example, targeted sequencing using library preparation methods, such as Ampliseg™ by Ion Torrent, provides many copies of targeted regions that are sequenced, providing many stored reads of the same targeted region. Alternatively, random fragments of the genome can be used when enough copies of the same fragment (i.e., group) are obtained.
Training a neural network, such as a U-Net convolutional neural network, utilizes a significant amount of data. A labeling system can be utilized to retrieve measurements and label the measurements with the appropriate expected values and underlying parameters. Such labeling can be complicated by variability in the expected values of the measurements.
In an example, a measured series or sequence can be compared to a set of series or sequences to determine which series or sequence most closely matches the measured series or sequence. Some series or sequences within the set of series or sequences may include one or more positions that have more than one valid or expected value, referred to as variants. Each of these variants represents the valid or expected value. As such, the variants represent valid results to be detected and do not represent error. The measured series or sequence can be matched to the variant within the matched series or sequence. Proper labeling of the measured series or sequence with the appropriate sequence and variant allows for the detection of such variants instead of treating the variants as errors to be corrected. For example, a series can be a series of signal measurements in flow space. In another example, a sequence can be a sequence of base calls.
For example,
Such labeling techniques find particular use when sequencing biological systems, such as nucleic acids or proteins. For example, segments of a larger nucleic acid sample can be amplified to provide multiple copies of these nucleic acid segments. Such segments can then be sequenced to determine which alleles are present within the sample. The alleles relate to which variants are present on different sequences derived from the segments. In a particular example, the amplification is designed to amplify specific amplicon sequences found at regions of the sample nucleic acid. The measured sequences can be matched with expected amplicon sequences of the set of amplicon sequences. Once the most likely amplicon sequence is matched with the measured sequence, variants within the amplicon sequence are compared to the measured sequence to determine which variants represent an expected or true value to associate with the measured sequence. As such, the measured sequence can be labeled with the matched amplicon sequence and the associated variant.
As described above, it is desired to correct errors in a flow measurement (signal measurements in flow space) prior to determining a sequence based on the flow measurement. The flow measurement is a measurement indicative of incorporation events in response to the flow of nucleotides in an ordered series in flow space. Flow space represents the coordinate system including a series of nucleotide flows based on the ordered flow of nucleotides through the system that may cause incorporations. How measurements can be utilized to determine a base sequence that can be matched with a sequence of a set of sequences in base space and further matched with variants within the matched sequence. Base space represents the coordinate system including a position of bases within a sequence. The matched sequence and associated variants can then be used to generate an expected flow measurement. The expected flow measurement, flow measurement, and flow order can be utilized to train a neural network to provide for correction in the flow measurements signal to improve the measurement of sequences based.
To label measurements, a set of expected or acceptable values is generated and then associated with the measurements that closely match the expected values. The label measurements can then be grouped and statistical values determined about the labeled measurements. For example, as illustrated in
In particular, a set of labels can be generated, as illustrated at block 1102 of
In a particular example, a set of amplicon sequences can be generated based on expected sequences or segments derived from a nucleic acid sample. For example, each of the amplicon sequences can represent the expected sequence of a segment or region of a chromosome. In an example, such chromosomes may be well characterized in a public database or may be proprietary.
Some sequences of the set of sequences can include variants at positions within the sequence. As illustrated at bock 1204, the variants within the sequences can be identified. For example, there may be variants at known positions on a chromosome. Amplicon sequences incorporating those known positions can be identified and the variants within that amplicon identified. The variants can represent single nucleotide polymorphisms (SNP), multi-nucleotide polymorphism (MNP), insertions or deletions (INDELs), among other variants.
For example, in chromosome 1 (chr1) of the human genome, depending on the cell line (e.g., HG001 or HG002), there are known single nucleotide polymorphisms (SNP) at chr1:11794698 and at chr1:11794839. The reference genome can be stored in a reference genome file (.fasta or .fasta.fai). For well characterized genomes, such variants are known and can be identified. Amplicons can be designed that overlap with known variants. Such amplicons can be defined in a BED file (.bed). Each amplicon defined in the BED file can be compared (walked) to determine which positions within the amplicon include a variant and what the permissible values of that variant are. In an example, the true variants and associated amplicons can be defined in a true-variant variant call format (VCF) file (.vcf)
In the example, there are two possible ground truth base sequences (on the forward strand) in the amplicon, represented as:
where {phased_seq_1} and {phased_seq_2} represent the truth of amplicon insert sequences (expected insert sequences) generated by the first and second phased alleles, respectively.
Returning to
In some embodiments, targeted regions of a genome can be sequenced. In an example, amplicons can be used to amplify target regions of sample nucleic acids. For example, a known set of amplicons can be used to amplify target regions of a nucleic acid sample, and each of the read sequences resulting from sequencing the amplified target regions include at least a portion of the amplicon used to target that region. As illustrated at block 1304, an amplicon can be assigned to the read sequence. For example, an amplicon/primer of a set of expected amplicons can be matched with the read sequence. In an example, an amplicon matching algorithm can be found in the Torrent Variant Caller available from ION Torrent of Thermo Fisher Scientific, Inc.
At least a portion of the gene specific or amplicon primer can remain on the nucleic acid sequenced to form the read sequence. As illustrated at block 1306, the primer lengths can be determined for a given read sequence. To generate an expected flow space flow measurement based on the identified amplicon, the primers on the 5′ end and the 3′ end of the read are determined. In some examples, the primer region does not contain variants. As a result, the primer sequences can be determined by the reference genome and the length of the primers. Ideally, the lengths of the primers can be inferred by the amplicon start/last positions and the read alignment start/end positions. However, sequencing errors, soft clipping, or other alignment artifacts can affect the start/end positions of the read. In general, the 3′ end read can be noisier and suffer from homopolymer insertions and deletions (HP-INDEL) and soft clipping and quality trimming Optionally, the length of the 3′ primer of a read can be estimated by the 5′ primer of reads on opposite strands. For example, segments of both strands of a double helix DNA can be sequenced, and the 3′ prime ends inferred from the 5′ end of the opposite strand.
As illustrated at block 1308, the read sequence can be labeled. In an example, the read sequence and associate flow measurement can first be labeled in base space, identifying the amplicon and variants within the amplicon associated with the read sequence. In addition, the read sequence can be further labeled with an expected flow measurement generated based on the matched amplicon sequence (predicted flow measurement) and variants, as well as model parameters and nucleotide flow order. Further, the read sequence can be labeled with an indicator of strand direction, the 5′ end overhead, the 3′ end overhead, or a residual array.
For example, as illustrated in
In some cases, errors in the base calling make identification of the amplicon and associated variants difficult. As such, a predicted flow measurement (signal measurements in flow space) can be generated based on flow order and the matched amplicon, as illustrated at block 1404. For example, a simulation model and associated model parameters, such as a “carry forward” parameter and an “incomplete extension” parameter, can be used to generate a predicted flow measurement.
Based on the most likely match of the predicted flow measurement, the amplicon and associated variants can be associated with the read sequence and its associated flow measurement, as illustrated at block 1406. For example, the read sequence and flow measurement can be labeled with the predicted flow measurement, and optionally with amplicon, variant, nucleotide flow order, strand direction, or modeling parameters. For example, a predicted flow measurement can be generated from the matched amplicon, and most likely variant using models. In an example, the model and associated parameters can be used along with the nucleotide flow order to provide a predicted flow measurement.
As illustrated at block 1408, the residual or difference between the flow measurement and the predicted flow measurement can be determined for each position within the series of the flow measurement and predicted flow measurement. A hash key can be generated for each read sequence according to the read labeling results.
Returning to
As illustrated at block 1504, an average residual at each position in the series of the flow measurement and the predicted flow measurement can be determined. In particular, a mean of the residual at each position can be determined. In an example, the average residual can be provided as an array.
As illustrated at block 1506, a variance at each position can also be determined. The variance at each flow can indicate how noisy the flow measurements are at that flow (i.e., each flow of a nucleotide). Residuals and measurements depend on the phasing model. Different reads may have different model parameters that may or may not further induce variability in the residuals. Variability of model parameters can cause minor variability in the predicted signals. The variance at each flow can be used to analyze or regularize the regression loss during the training steps. In an example, the variance can be provided as an array.
As such, a set of read sequences determined based on flow measurements are labeled with an associated expected flow measurement and, optionally, an amplicon, variant, nucleotide flow order, residual or mean residual array, or model parameters, such as phase modeling parameters. Such labeled read sequences can be utilized to train a neural network to predict adjustment factors (e.g., signal correction values) that can be used to adjust flow measurements that are used to provide improved base calling. Optionally, the ANN can provide a corrected flow measurement.
For example,
In an example, a channel is represented by an array. For example, flow measurement can be represented by 1×N array in which N is a number, such as an even number greater than the number of flows within a flow measurement Similarly, the predicted flow measurement can be represented by a 1×N array. In another example, the flow order can be represented by 4×N array (e.g., flow order channels) in which the flow of each nucleotide is recorded in each of the rows of the channel. In a particular example, a position within the flow order (e.g., signal position channel) and a series mask (e.g., signal mask channel) are provided as input channels. For example, a signal position channel can be dedicated to indicate position having a range of values from 0 to 1. Owing to various errors or conditions of the simulation model, the expected signal generated by the simulation model can vary based on position. In addition, the size of the array may be greater than the number of data points, for example, an even number greater than the number of data points. A series mask (e.g., signal mask channel) can be used to indicate which columns of the array contain data and which are set to null.
As illustrated at block 1604, an array of generated residual values (signal correction values) can be provided as an output channel. During training, the generated flow measurement of a flow measurement (signal values) can also be provided as an output channel. It has been found that utilizing both the generated residual array and another factor, such as the generated flow measurement, as outputs to be predicted by the neural network improves the training of the neural network. When the neural network is used for inference, output associated with the other factor can be disregarded. In an example, the output signal correction values can be compared to the residual array or the mead residual array. For example, a difference can be determined. In another example, the generated flow measurement can be compared to the predicted flow measurement, for example, a difference can be determined.
As illustrated at block 1606, a convolutional neural network can be iteratively trained using the input channels and the output channels. During inference mode, the prediction of the second factor, such as the predicted flow measurements or variance array, can be ignored, and the predicted adjustment factors or predicted residuals can be used to adjust the flow measurements to improve base calling. In an example, the training optimizes the parameters of the ANN to minimize the mean squared error. For example, the training minimizes the mean square error between the residual array or mean residual array and the output (signal correction values) from the neural network. In another example, the training also minimizes the mean square error between the predicted flow measurements and the output (signal values) from the neural network. Optionally, other output channels, such as a variance channel, can be used to improve training The methods for training may comprise a machine learning algorithm, for example, a stochastic gradient descent (SGD) algorithm, Adam, Adadelta, Adagrad or other adaptive learning algorithm.
In an example, the training set can include between 1,000 and 10,000,000 labeled reads. For example, the training set can include 10,000 to 5,000,000 labeled reads, such as 100,000 to 5,000,000 labeled reads or 500,000 to 5,000,000 labeled reads. The labeled reads can be grouped into 1,000 to 100,000 groups. For example, the labeled reads can be grouped into 1,000 to 50,000 groups, such as 5,000 to 50,000 groups. A majority of the groups have at least 10 reads, such as at least 20 reads or at least 30 reads, but generally, not greater than 10,000 reads. In an example, the mean group size is in a range of 10 to 100 reads, such as a range of 20 to 100 reads or a range of 30 to 80 reads. In an example, labeled reads of the training set can have between 50 and 1000 bases in base space, such as 50 to 600 bases or 100 to 400 bases. In flow space, the flow measurements associated with reads can have between 100 and 10,000 flows, such as between 100 and 5,000 flows or between 400 and 2,000 flows.
Returning to
For example,
As illustrated in
The artificial neural network architecture may be implemented in any processor. The network architecture may be implemented across multiple processors, multiple devices, or both.
According to various exemplary embodiments, one or more features of any one or more of the above-discussed teachings and/or exemplary embodiments may be performed or implemented using appropriately configured and/or programmed hardware and/or software elements. Determining whether an embodiment is implemented using hardware and/or software elements may be based on any number of factors, such as desired computational rate, power levels, heat tolerances, processing cycle budget, input data rates, output data rates, memory resources, data bus speeds, etc., and other design or performance constraints.
Examples of hardware elements may include processors, microprocessors, input(s) and/or output(s) (I/O) device(s) (or peripherals) that are communicatively coupled via a local interface circuit, circuit elements (e.g., transistors, resistors, capacitors, inductors, and so forth), integrated circuits, application specific integrated circuits (ASIC), programmable logic devices (PLD), digital signal processors (DSP), graphics processing units (GPUs), field programmable gate array (FPGA), logic gates, registers, semiconductor device, chips, microchips, chip sets, and so forth. The local interface may include, for example, one or more buses or other wired or wireless connections, controllers, buffers (caches), drivers, repeaters and receivers, etc., to allow appropriate communications between hardware components. A processor is a hardware device for executing software, particularly software stored in memory. The processor can be any custom made or commercially available processor, a central processing unit (CPU), an auxiliary processor among several processors associated with the computer, a semiconductor based microprocessor (e.g., in the form of a microchip or chip set), a macroprocessor, or generally any device for executing software instructions. A processor can also represent a distributed processing architecture. The I/O devices can include input devices, for example, a keyboard, a mouse, a scanner, a microphone, a touch screen, an interface for various medical devices and/or laboratory instruments, a bar code reader, a stylus, a laser reader, a radio-frequency device reader, etc. Furthermore, the I/O devices also can include output devices, for example, a printer, a bar code printer, a display, etc. Finally, the I/O devices further can include devices that communicate as both inputs and outputs, for example, a modulator/demodulator (modem; for accessing another device, system, or network), a radio frequency (RF) or other transceiver, a telephonic interface, a bridge, a router, etc.
Examples of software may include software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, application program interfaces (API), instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or any combination thereof. A software in memory may include one or more separate programs, which may include ordered listings of executable instructions for implementing logical functions. The software in memory may include a system for identifying data streams in accordance with the present teachings and any suitable custom made or commercially available operating system (O/S), which may control the execution of other computer programs such as the system, and provides scheduling, input-output control, file and data management, memory management, communication control, etc.
According to various exemplary embodiments, one or more features of any one or more of the above-discussed teachings and/or exemplary embodiments may be performed or implemented using appropriately configured and/or programmed non-transitory machine-readable medium or article that may store an instruction or a set of instructions that, if executed by a machine, may cause the machine to perform a method and/or operations in accordance with the exemplary embodiments. Such a machine may include, for example, any suitable processing platform, computing platform, computing device, processing device, computing system, processing system, computer, processor, scientific or laboratory instrument, etc., and may be implemented using any suitable combination of hardware and/or software. The machine-readable medium or article may include, for example, any suitable type of memory unit, memory device, memory article, memory medium, storage device, storage article, storage medium and/or storage unit, for example, memory, removable or non-removable media, erasable or non-erasable media, writeable or re-writeable media, digital or analog media, hard disk, floppy disk, read-only memory compact disc (CD-ROM), recordable compact disc (CD-R), rewriteable compact disc (CD-RW), optical disk, magnetic media, magneto-optical media, removable memory cards or disks, various types of Digital Versatile Disc (DVD), a tape, a cassette, etc., including any medium suitable for use in a computer. Memory can include any one or a combination of volatile memory elements (e.g., random access memory (RAM, such as DRAM, SRAM, SDRAM, etc.)) and nonvolatile memory elements (e.g., ROM, EPROM, EEROM, Hash memory, hard drive, tape, CDROM, etc.). Moreover, memory can incorporate electronic, magnetic, optical, and/or other types of storage media. Memory can have a distributed architecture where various components are situated remote from one another, but are still accessed by the processor. The instructions may include any suitable type of code, such as source code, compiled code, interpreted code, executable code, static code, dynamic code, encrypted code, etc., implemented using any suitable high-level, low-level, object-oriented, visual, compiled and/or interpreted programming language.
According to various exemplary embodiments, one or more features of any one or more of the above-discussed teachings and/or exemplary embodiments may be performed or implemented at least partly using a distributed, clustered, remote, or cloud computing resource.
According to various exemplary embodiments, one or more features of any one or more of the above-discussed teachings and/or exemplary embodiments may be performed or implemented using a source program, executable program (object code), script, or any other entity comprising a set of instructions to be performed. When a source program, the program can be translated via a compiler, assembler, interpreter, etc., which may or may not be included within the memory, so as to operate properly in connection with the O/S. The instructions may be written using (a) an object oriented programming language, which has classes of data and methods, or (b) a procedural programming language, which has routines, subroutines, and/or functions, which may include, for example, C, C++, R, Pascal, Basic, Fortran, Cobol, Perl, Java, and Ada.
According to various exemplary embodiments, one or more of the above-discussed exemplary embodiments may include transmitting, displaying, storing, printing or outputting to a user interface device, a computer readable storage medium, a local computer system or a remote computer system, information related to any information, signal, data, and/or intermediate or final results that may have been generated, accessed, or used by such exemplary embodiments. Such transmitted, displayed, stored, printed or outputted information can take the form of searchable and/or filterable lists of runs and reports, pictures, tables, charts, graphs, spreadsheets, correlations, sequences, and combinations thereof, for example.
Example 1 is a method for correcting signal measurements, including: providing a plurality of signal measurements to a channel of an input layer to an artificial neural network (ANN), wherein the input layer includes one or more channels; applying the ANN to the plurality of signal measurements to generate a plurality of signal correction values; subtracting the plurality of signal correction values from the plurality of signal measurements to form a plurality of corrected signal measurements; and applying base calling to the plurality of corrected signal measurements to produce a sequence of base calls.
Example 2 includes the subject matter of Example 1, and further specifies that the input layer further includes a channel for a plurality of simulated signal measurements, wherein the plurality of simulated signal measurements corresponds to the plurality of signal measurements.
Example 3 includes the subject matter of Example 1, and further specifies that the input layer further includes a channel for representing a flow order corresponding to nucleotides flowed, wherein the plurality of signal measurements was detected in response to the nucleotides flowed in the flow order.
Example 4 includes the subject matter of Example 3, and further specifies that the flow order is represented by four binary arrays in four channels of the input layer, wherein a 1 in a position in the array indicates that a particular nucleotide was flowed in that position in the flow order to generate the corresponding signal measurement, wherein flow orders for nucleotides A, T, C, and G are each represented in a respective one of the arrays.
Example 5 includes the subject matter of Example 1, and further specifies that the input layer further includes a channel for an array of values indicating positions of the plurality of signal measurements.
Example 6 includes the subject matter of Example 1, and further specifies that the input layer further includes a channel for a signal mask, wherein the signal mask includes an array having 1's in positions corresponding to the plurality of signal measurements and 0's in positions corresponding to no signal measurement, wherein a number of signal measurements is less than or equal to a size of the array.
Example 7 includes the subject matter of Example 1, and further specifies that the ANN includes a convolutional neural network (CNN).
Example 8 includes the subject matter of Example 7, and further specifies that the CNN includes a U-Net.
Example 9 includes the subject matter of Example 8, and further specifies that the U-Net includes an encoder configured to receive the channels of the input layer and to generate feature maps at a plurality of scales.
Example 10 includes the subject matter of Example 9, and further specifies that the encoder further includes a plurality of layer groups, wherein each layer group includes one or more convolutional layers.
Example 11 includes the subject matter of Example 10, and further specifies that the convolutional layer applies a plurality of convolutions to input channels provided to the convolutional layer to produce a plurality of feature maps.
Example 12 includes the subject matter of Example 11, and further specifies that the convolutional layer further includes a batch normalization applied to the plurality of feature maps to produce normalized feature maps.
Example 13 includes the subject matter of Example 12, and further specifies that the convolutional layer further includes applying an activation function to the normalized feature maps to produce output feature maps for output channels of the convolutional layer.
Example 14 includes the subject matter of Example 13, and further specifies that the activation function includes a sigmoid linear unit function (SiLU).
Example 15 includes the subject matter of Example 10, and further specifies that each layer group further includes a pooling layer, wherein the pooling layer receives output channels from a last convolutional layer in the one or more convolutional layers.
Example 16 includes the subject matter of Example 15, and further specifies that the pooling layer applies a MaxPool operation having a kernel size of 2 and stride value of 2 to each output channel.
Example 17 includes the subject matter of Example 10, and further specifies that a number of the convolutional layers is two.
Example 18 includes the subject matter of Example 10, and further specifies that a number of layer groups in the plurality of layer groups is three.
Example 19 includes the subject matter of Example 9, and further specifies that the U-Net further includes a decoder, wherein the decoder receives the feature maps having the plurality of scales from the encoder.
Example 20 includes the subject matter of Example 19, and further specifies that the U-Net further includes a Convolutional Block Attention Module (CBAM), wherein the CBAM is applied to outputs of a last pooling layer of the encoder and provides refined feature maps to a first layer of the decoder.
Example 21 includes the subject matter of Example 19, and further specifies that the decoder further includes a second plurality of layer groups, wherein each layer group includes a convolution transpose layer.
Example 22 includes the subject matter of Example 21, and further specifies that the convolution transpose layer applies a plurality of transposed convolutions to input channels provided to the convolution transpose layer to produce a plurality of upsampled feature maps.
Example 23 includes the subject matter of Example 21, and further specifies that each layer group of the decoder further includes one or more convolutional layers.
Example 24 includes the subject matter of Example 23, and further specifies that the convolutional layer applies a plurality of convolutions to input channels provided to the convolutional layer to produce a plurality of feature maps.
Example 25 includes the subject matter of Example 24, and further specifies that the convolutional layer further includes a batch normalization applied to the plurality of feature maps to produce normalized feature maps.
Example 26 includes the subject matter of Example 25, and further specifies that the convolutional layer further includes applying an activation function to the normalized feature maps to produce output feature maps for output channels of the convolutional layer.
Example 27 includes the subject matter of Example 26, and further specifies that the activation function includes a sigmoid linear unit function (SiLU).
Example 28 includes the subject matter of Example 23, and further specifies that a number of the convolutional layers is two.
Example 29 includes the subject matter of Example 23, and further specifies that a first convolutional layer of the layer group of the decoder receives output channels from the convolution transpose layer of the layer group.
Example 30 includes the subject matter of Example 23, and further includes concatenating feature maps from a layer group of the encoder with feature maps from the convolution transpose layer to form concatenated feature maps, wherein the feature maps from the layer group of the encoder and the feature maps from the convolution transpose layer have a same scale.
Example 31 includes the subject matter of Example 30, and further includes applying a first convolutional layer of the layer group to the concatenated feature maps.
Example 32 includes the subject matter of Example 23, and further specifies that a second convolutional layer of the layer group of the decoder receives output channels from a first convolutional layer of the layer group.
Example 33 includes the subject matter of Example 21, and further specifies that a number of layer groups in the second plurality of layer groups is three.
Example 34 includes the subject matter of Example 21, and further includes applying a plurality convolutions to a plurality of outputs of a last layer group of the second plurality of layer groups to produce the plurality of signal correction values.
Example 35 includes the subject matter of Example 1, and further specifies that the plurality of signal measurements are provided by a nucleic acid sequencing instrument.
Example 36 is a system for correcting signal measurements, including: a machine-readable memory; and a processor configured to execute machine-readable instructions, which, when executed by the processor, cause the system to perform a method, including: providing a plurality of signal measurements to a channel of an input layer to an artificial neural network (ANN), wherein the input layer includes one or more channels; applying the ANN to the plurality of signal measurements to generate a plurality of signal correction values; subtracting the plurality of signal correction values from the plurality of signal measurements to form a plurality of corrected signal measurements; and applying base calling to the plurality of corrected signal measurements to produce a sequence of base calls.
Example 37 includes the subject matter of Example 36, and further specifies that the input layer further includes a channel for a plurality of simulated signal measurements, wherein the plurality of simulated signal measurements corresponds to the plurality of signal measurements.
Example 38 includes the subject matter of Example 36, and further specifies that the input layer further includes a channel for representing a flow order corresponding to nucleotides flowed, wherein the plurality of signal measurements was detected in response to the nucleotides flowed in the flow order.
Example 39 includes the subject matter of Example 38, and further specifies that the flow order is represented by four binary arrays in four channels of the input layer, wherein a 1 in a position in the array indicates that a particular nucleotide was flowed in that position in the flow order to generate the corresponding signal measurement, wherein flow orders for nucleotides A, T, C, and G are each represented in a respective one of the arrays.
Example 40 includes the subject matter of Example 36, and further specifies that the input layer further includes a channel for an array of values indicating positions of the plurality of signal measurements.
Example 41 includes the subject matter of Example 36, and further specifies that the input layer further includes a channel for a signal mask, wherein the signal mask includes an array having 1's in positions corresponding to the plurality of signal measurements and 0's in positions corresponding to no signal measurement, wherein a number of signal measurements is less than or equal to a size of the array.
Example 42 includes the subject matter of Example 36, and further specifies that the ANN includes a convolutional neural network (CNN).
Example 43 includes the subject matter of Example 42, and further specifies that the CNN includes a U-Net.
Example 44 includes the subject matter of Example 43, and further specifies that the U-Net includes an encoder configured to receive the channels of the input layer and to generate feature maps at a plurality of scales.
Example 45 includes the subject matter of Example 44, and further specifies that the encoder further includes a plurality of layer groups, wherein each layer group includes one or more convolutional layers.
Example 46 includes the subject matter of Example 45, and further specifies that the convolutional layer applies a plurality of convolutions to input channels provided to the convolutional layer to produce a plurality of feature maps.
Example 47 includes the subject matter of Example 46, and further specifies that the convolutional layer further includes a batch normalization applied to the plurality of feature maps to produce normalized feature maps.
Example 48 includes the subject matter of Example 47, and further specifies that the convolutional layer further includes applying an activation function to the normalized feature maps to produce output feature maps for output channels of the convolutional layer.
Example 49 includes the subject matter of Example 48, and further specifies that the activation function includes a sigmoid linear unit function (SiLU).
Example 50 includes the subject matter of Example 45, and further specifies that each layer group further includes a pooling layer, wherein the pooling layer receives output channels from a last convolutional layer in the one or more convolutional layers.
Example 51 includes the subject matter of Example 50, and further specifies that the pooling layer applies a MaxPool operation having a kernel size of 2 and stride value of 2 to each output channel.
Example 52 includes the subject matter of Example 45, and further specifies that a number of the convolutional layers is two.
Example 53 includes the subject matter of Example 45, and further specifies that a number of layer groups in the plurality of layer groups is three.
Example 54 includes the subject matter of Example 44, and further specifies that the U-Net further includes a decoder, wherein the decoder receives the feature maps having the plurality of scales from the encoder.
Example 55 includes the subject matter of Example 54, and further specifies that the U-Net further includes a Convolutional Block Attention Module (CBAM), wherein the CBAM is applied to outputs of a last pooling layer of the encoder and provides refined feature maps to a first layer of the decoder.
Example 56 includes the subject matter of Example 54, and further specifies that the decoder further includes a second plurality of layer groups, wherein each layer group includes a convolution transpose layer.
Example 57 includes the subject matter of Example 56, and further specifies that the convolution transpose layer applies a plurality of transposed convolutions to input channels provided to the convolution transpose layer to produce a plurality of upsampled feature maps.
Example 58 includes the subject matter of Example 56, and further specifies that each layer group of the decoder further includes one or more convolutional layers.
Example 59 includes the subject matter of Example 58, and further specifies that the convolutional layer applies a plurality of convolutions to input channels provided to the convolutional layer to produce a plurality of feature maps.
Example 60 includes the subject matter of Example 59, and further specifies that the convolutional layer further includes a batch normalization applied to the plurality of feature maps to produce normalized feature maps.
Example 61 includes the subject matter of Example 60, and further specifies that the convolutional layer further includes applying an activation function to the normalized feature maps to produce output feature maps for output channels of the convolutional layer.
Example 62 includes the subject matter of Example 61, and further specifies that the activation function includes a sigmoid linear unit function (SiLU).
Example 63 includes the subject matter of Example 58, and further specifies that a number of the convolutional layers is two.
Example 64 includes the subject matter of Example 58, and further specifies that a first convolutional layer of the layer group of the decoder receives output channels from the convolution transpose layer of the layer group.
Example 65 includes the subject matter of Example 58, and further includes concatenating feature maps from a layer group of the encoder with feature maps from the convolution transpose layer to form concatenated feature maps, wherein the feature maps from the layer group of the encoder and the feature maps from the convolution transpose layer have a same scale.
Example 66 includes the subject matter of Example 65, and further includes applying a first convolutional layer of the layer group to the concatenated feature maps.
Example 67 includes the subject matter of Example 58, and further specifies that a second convolutional layer of the layer group of the decoder receives output channels from a first convolutional layer of the layer group.
Example 68 includes the subject matter of Example 56, and further specifies that a number of layer groups in the second plurality of layer groups is three.
Example 69 includes the subject matter of Example 56, and further includes applying a plurality convolutions to a plurality of outputs of a last layer group of the second plurality of layer groups to produce the plurality of signal correction values.
Example 70 includes the subject matter of Example 36, and further specifies that the plurality of signal measurements are provided by a nucleic acid sequencing instrument.
Example 71 is a non-transitory machine-readable storage medium including instructions which, when executed by a processor, cause the processor to perform a method for correcting signal measurements, including: providing a plurality of signal measurements to a channel of an input layer to an artificial neural network (ANN), wherein the input layer includes one or more channels; applying the ANN to the plurality of signal measurements to generate a plurality of signal correction values; subtracting the plurality of signal correction values from the plurality of signal measurements to form a plurality of corrected signal measurements; and applying base calling to the plurality of corrected signal measurements to produce a sequence of base calls.
Example 72 includes the subject matter of Example 71, and further specifies that the input layer further includes a channel for a plurality of simulated signal measurements, wherein the plurality of simulated signal measurements corresponds to the plurality of signal measurements.
Example 73 includes the subject matter of Example 71, and further specifies that the input layer further includes a channel for representing a flow order corresponding to nucleotides flowed, wherein the plurality of signal measurements was detected in response to the nucleotides flowed in the flow order.
Example 74 includes the subject matter of Example 71, and further specifies that the flow order is represented by four binary arrays in four channels of the input layer, wherein a 1 in a position in the array indicates that a particular nucleotide was flowed in that position in the flow order to generate the corresponding signal measurement, wherein flow orders for nucleotides A, T, C, and G are each represented in a respective one of the arrays.
Example 75 includes the subject matter of Example 71, and further specifies that the input layer further includes a channel for an array of values indicating positions of the plurality of signal measurements.
Example 76 includes the subject matter of Example 71, and further specifies that the input layer further includes a channel for a signal mask, wherein the signal mask includes an array having 1's in positions corresponding to the plurality of signal measurements and 0's in positions corresponding to no signal measurement, wherein a number of signal measurements is less than or equal to a size of the array.
Example 77 includes the subject matter of Example 71, and further specifies that the ANN includes a convolutional neural network (CNN).
Example 78 includes the subject matter of Example 77, and further specifies that the CNN includes a U-Net.
Example 79 includes the subject matter of Example 78, and further specifies that the U-Net includes an encoder configured to receive the channels of the input layer and to generate feature maps at a plurality of scales.
Example 80 includes the subject matter of Example 79, and further specifies that the encoder further includes a plurality of layer groups, wherein each layer group includes one or more convolutional layers.
Example 81 includes the subject matter of Example 80, and further specifies that the convolutional layer applies a plurality of convolutions to input channels provided to the convolutional layer to produce a plurality of feature maps.
Example 82 includes the subject matter of Example 81, and further specifies that the convolutional layer further includes a batch normalization applied to the plurality of feature maps to produce normalized feature maps.
Example 83 includes the subject matter of Example 82, and further specifies that the convolutional layer further includes applying an activation function to the normalized feature maps to produce output feature maps for output channels of the convolutional layer.
Example 84 includes the subject matter of Example 83, and further specifies that the activation function includes a sigmoid linear unit function (SiLU).
Example 85 includes the subject matter of Example 80, and further specifies that each layer group further includes a pooling layer, wherein the pooling layer receives output channels from a last convolutional layer in the one or more convolutional layers.
Example 86 includes the subject matter of Example 85, and further specifies that the pooling layer applies a MaxPool operation having a kernel size of 2 and stride value of 2 to each output channel.
Example 87 includes the subject matter of Example 80, and further specifies that a number of the convolutional layers is two.
Example 88 includes the subject matter of Example 80, and further specifies that a number of layer groups in the plurality of layer groups is three.
Example 89 includes the subject matter of Example 79, and further specifies that the U-Net further includes a decoder, wherein the decoder receives the feature maps having the plurality of scales from the encoder.
Example 90 includes the subject matter of Example 89, and further specifies that the U-Net further includes a Convolutional Block Attention Module (CBAM), wherein the CBAM is applied to outputs of a last pooling layer of the encoder and provides refined feature maps to a first layer of the decoder.
Example 91 includes the subject matter of Example 89, and further specifies that the decoder further includes a second plurality of layer groups, wherein each layer group includes a convolution transpose layer.
Example 92 includes the subject matter of Example 91, and further specifies that the convolution transpose layer applies a plurality of transposed convolutions to input channels provided to the convolution transpose layer to produce a plurality of upsampled feature maps.
Example 93 includes the subject matter of Example 91, and further specifies that each layer group of the decoder further includes one or more convolutional layers.
Example 94 includes the subject matter of Example 93, and further specifies that the convolutional layer applies a plurality of convolutions to input channels provided to the convolutional layer to produce a plurality of feature maps.
Example 95 includes the subject matter of Example 94, and further specifies that the convolutional layer further includes a batch normalization applied to the plurality of feature maps to produce normalized feature maps.
Example 96 includes the subject matter of Example 95, and further specifies that the convolutional layer further includes applying an activation function to the normalized feature maps to produce output feature maps for output channels of the convolutional layer.
Example 97 includes the subject matter of Example 96, and further specifies that the activation function includes a sigmoid linear unit function (SiLU).
Example 98 includes the subject matter of Example 93, and further specifies that a number of the convolutional layers is two.
Example 99 includes the subject matter of Example 93, and further specifies that a first convolutional layer of the layer group of the decoder receives output channels from the convolution transpose layer of the layer group.
Example 100 includes the subject matter of Example 93, and further includes concatenating feature maps from a layer group of the encoder with feature maps from the convolution transpose layer to form concatenated feature maps, wherein the feature maps from the layer group of the encoder and the feature maps from the convolution transpose layer have a same scale.
Example 101 includes the subject matter of Example 100, and further includes applying a first convolutional layer of the layer group to the concatenated feature maps.
Example 102 includes the subject matter of Example 93, and further specifies that a second convolutional layer of the layer group of the decoder receives output channels from a first convolutional layer of the layer group.
Example 103 includes the subject matter of Example 91, and further specifies that a number of layer groups in the second plurality of layer groups is three.
Example 104 includes the subject matter of Example 91, and further includes applying a plurality convolutions to a plurality of outputs of a last layer group of the second plurality of layer groups to produce the plurality of signal correction values.
Example 105 includes the subject matter of Example 71, and further specifies that the plurality of signal measurements are provided by a nucleic acid sequencing instrument.
This application claims the benefit under 35 U.S.C. § 119(e) of U.S. Provisional Application No. 63/338,703, filed May 5, 2022, and U.S. Provisional Application No. U.S. 63/338,810, filed May 5, 2022. The entire contents of the aforementioned applications are incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
63338810 | May 2022 | US | |
63338703 | May 2022 | US |