Embodiments of the present invention refer to a method for performing the normalization of the intensity values obtained to perform sequencing analysis. Additional embodiments refer to a corresponding computer program and the corresponding apparatus.
Next generation sequencing (NGS) characterizes the sequence of millions of DNA molecules in parallel. To this end, millions of DNA molecules are immobilized randomly at different positions on an imaging surface and copied to form local clusters of clonal DNA molecules. Sequencing of these template molecules is performed by synthesis of complementary DNA, incorporating fluorescently labeled nucleotides with 4 distinct fluorescent dyes for each specific nucleotide (A, C, G, and T). Specific sequencing chemistry ensures that each sequencing iteration (cycle) incorporates only one nucleotide at a time. For each cycle the sequencing apparatus (sequencer) takes images with four distinct filter settings (channels), one for each nucleotide specified wavelength specified by the respective fluorescent dyes. The entire sequencing run is therefore represented by a set of n×c images, where n equals the number of channels (typically 4) and c equals the number of cycles.
This set-up enables deduction of the nucleotide-sequence of the template DNA molecules from the sequence of images acquired for all channels for each sequencing cycle: Given that the position of a template DNA cluster is known, the intensity profiles over all channels at this given position allows deduction of the incorporated fluorescently labeled nucleotide (base-calling). In theory non-zero intensities should be detectable in only one channel which resembles the nucleotide present in the template DNA. A variety of factors including imaging and sequencing noise result in a derivation from this optimal case: non-zero intensity values are typically observed for all channels. Performing base-calling for all sequencing cycles for a given position allows deduction of the full nucleotide sequence (read).
As indicated above, the signal distribution among the channels determines base-calling. However, deviations from the optimal case (i.e. only one channel is characterized by non-zero intensity values) can be ascribed to two components:
While the first can be described as a systematic offset from zero, the latter can be described by fluctuation around this offset from sample to sample. Importantly, these components can be channel-specific and thus may lead to biased base-calling. Several channel-specific bias introducing phenomena are known and are typically corrected in the algorithm chain leading to base-calling. These include background correction and crosstalk correction. However, only known phenomena can be corrected using such a model based approach. Therefore, there is the need for an improved approach.
It is an objective of the present invention to provide a concept for base-calling or especially the post image analysis used for base-calling procedures having a reduced impact to biasing and noise effects.
This objective is solved by the subject matter of the independent claims.
Embodiments of the present invention provide a method for performing a normalization of intensity values obtained to perform a sequencing analysis. The method comprises the four basic steps of “receiving a plurality of image data for a plurality of channels”, “parametrization of an intensity distribution over all or a subset of all positions for the plurality of channels”, “combining parametrized distributions for the plurality of channels” and determining for each of the plurality of channels a transfer function”. The image data comprise a plurality of images (e.g. generated by a sequencing apparatus). Each received image describes for the respective channel, for example for the four channels A, C, G, and T (belonging to four bases), an intensity distribution in terms of a spatial light density distribution over all positions or at least over a subset of all positions of the respective image. The intensity distribution is parametrized to obtain a parametrized distribution for each of the plurality of channels. Starting from the parametrized distributions for the plurality of channels, the parametrized distributions are combined with the aim to obtain a common distribution for all or for at least two of the plurality of channels. The last step of determining the transfer function for each channel is performed such that the respective transfer function for the respective channel maps the corresponding intensity distribution to the common distribution.
Teachings disclosed herein are based on the finding that the intensity distribution is typically channel-specific leading to systematic differences between the channels. However, since over the entire image, the number of A-, C-, G-, and T-base-calls should be randomly equally distributed, the overall intensity level of all channels should be comparable to each other. In order to maintain the comparableness, all or the relevant channels can be adapted with regard to its intensity level. This is achieved when collapsing each channel-specific intensity distribution into one common distribution and determining corresponding correction factors for the respective channel such that same can be mapped to the (collapsed) common distribution, afterwards. The usage of the correction factor for the respective channel produces unbiased intensity profiles over all channels and leading to unbiased base-calling (without or with reduced systematic differences). Here, it is beneficial that the correction of the base-calling can be performed without knowing the exact phenomenon leading to the biasing.
According to another embodiment, the step of parametrization is performed using the substep of maximum-likelihood estimation, maximum-a-posteriori-estimation, determining a summary statistic for one or more or all channels or determining a specific parameter set describing an intensity distribution (e.g. the maxima and minima of said distribution) for one or more or all channels. The estimation for determining procedures enables beneficially to parametrize different intensity distributions, such that the same are comparable to each other.
According to embodiments a distinction between two types of intensity distributions is made, namely non-normal distributions and normal distributions. Commonly, the type of intensity distribution has an impact to the type of the transfer function, so that typically, a non-linear transformation is used for non-normal distributions, wherein a linear transformation is used for normal distributions. For example, the parametrized distribution for one or more or all of a plurality of channels is described by a Gaussian distribution by a distinct mean and a distinct standard deviation. Here, the respective transfer function for the respective channel can also be described by a Gaussian distribution comprising the mathematic operations for subtracting the mean from the corresponding intensity distribution and/or dividing the corresponding intensity distribution by the standard deviation. According to another embodiment transfer functions like a log-transformation, a square-root transformation, a Box-Cox transformation or a Yeo-Johnson transformation can be used for non-normal intensity distributions. Alternatively, another transformation enabling to transform non-normal distribution to an approximated normal distribution can be applied. Since the intensity distributions for different sequencing analysis techniques or procedures vary with regard to its type, it is beneficial to have different transformation approaches enabling to handle the different types of intensity distributions such that the normalization process can be applied to each used case.
Although the above embodiments suggest performing the normalization over the different channels within one cycle of a sequencing analysis, it should be noted that according to further embodiments the parametrization, combining and determining is performed for a plurality of cycles. In this case, the method comprises the step of determining the transfer functions such that same comprise smoothing functions. Here, smoothing means that the transfer function smoothes the intensity distribution over the multiple cycles such that there are no jumps from cycle to cycle, i.e. the smoothing function is selected such that same describes a maximum change of the intensity value over the number of cycles and/or such that the intensity value within at least two subsequent cycles remain on a constant level. This case depends on the assumption that channel-specific distributions of subsequent cycles are similar and thus their characterization parameters (i.e. distribution parameters or summary statistics) are correlated. According to another variant the normalization over a plurality of cycles may be performed such that a normalization over all cycles of the analysis can be done. Here, the respective transfer function for the respective channels comprise a normalization function enabling that the intensity values of the plurality options are normalized over the number of cycles for all channels and/or such that the intensity values remain on an average constant level over the number of cycles. This approach enables to correct effects like an intensity drift. However, in case there is a trend which causes decreasing or increasing intensity values of the number of cycles this approach may have a negative effect to the sequencing analysis. Therefore, the normalization function is determined just in the case if no Trend of the one or more channels over the number of cycles is expected or detected. This trend detection may be performed based on the image data before applying the transfer function to the plurality of channels and cycles.
Of cause, it is possible to perform the normalization (parametrization, combining and determining of the common target distribution) over the different channels and different cycles. Thus, according to another embodiment the basic method may be performed differently, namely such that the step of receiving a plurality of image data and prioritization of the intensity distribution is performed for a plurality of channels and cycles, wherein the step of combining and determining is performed for the plurality of cycles. Expressed in other words this means that the transfer function enables a mapping over a plurality of cycles instead of a mapping over a plurality of channels. Here, the principles as discussed in context of this moving function may be applied for at least one channel. As described above, the combination of normalizing the intensity distribution over cycles and channels is also possible.
The above embodiments start from the assumption that preferably (but not necessarily) all of the channels are mapped to a common distribution function. However, especially in case when a smoothing over a plurality of cycles is performed, it may be beneficial that a common distribution function is determined for each channel or for at least two channels independently. Here, the common distribution function for all of the plurality of channels may be described by a set of functions comprising for each channel a common distribution function, wherein at least two of the common distribution functions differ from each other. Each respective transfer function is determined such that the respective transfer function for a respective channel maps the corresponding intensity distribution to the corresponding common distribution function of the channel. As mentioned above, this approach having a plurality of common distribution functions is selected, if a smoothing over a plurality of cycles is performed. Here, the plurality of common distribution functions may be used beneficially if a trend of one or more channels over the number of cycles is expected or detected.
According to another embodiment the computer program having a program code for performing one of the above methods or method steps.
According to another embodiment and an apparatus for performing a normalization of intensity values obtained to perform a sequencing analysis is provided. Here, the apparatus comprises an interface and a processor. The interface receives the plurality of image data, wherein the processor performs the parametrization, combination and determining of the transfer functions.
Embodiments of the present invention will subsequently be discussed referring to the enclosed figures, wherein:
Below, embodiments will subsequently be discussed in detail referring to the enclosed figures. Here, identical reference numerals are provided to elements on method steps having similar or identical functions so that the description thereof is mutually applicable and interchangeable.
This is exemplarily illustrated by
Typically a distinction is made between four nucleotides (A, C, G and T), such that four distinguishable fluorescent dyes are used. Each channel can be detected using an own channel. The channels can be analyzed using different filter settings, so that a plurality of images 11a to 11d has to be analyzed during one cycle. Each cycle refers to a sequencing iteration (cf. cycle 11, cycle 12, cycle 13, etc., cycle 15, etc.). Background thereof is that during each sequencing iteration/cycle 11 to 15, just one nucleotide can be detected. Just for the sake of completeness, it should be noted that between the single cycles 11 to 15 a procedure comprising cleaving the fluorescent dye and extending the sequencing primer is performed such that a new base/nucleotide can be incorporated.
Since for each sequencing cycle 11 to 15 a set of n images (11a-11d) are required. The entire sequencing run is represented by a set of n×c images 11a to 15d, where n equals the number of channels (typically four) and c equals the number of cycles (here five). Expressed in other words, this means that each sequencing run has two main dimensions, namely the dimension defined by the number of cycles and the dimension defined by the number of channels.
As indicated above, the signal distribution among the channels a-d (A, C, G and T) determines base-calling. In the optimal case just one channel is characterized by non-zero intensity values. This optimal case is shown by
However, in reality the measured intensity signal is distorted by biasing and noise components. Such biased intensity values are illustrated by
The method 100 enables to remove the channel-specific contribution to bias and/or noise by performing a normalization. Here, the data driven approach enables to normalize intensity values over all channels A to D (preferably within one cycle 11, 12, etc. or 15). Within the first step 110 the plurality of image data for the plurality of channels, i.e., for the example of
The intensity distributions over the relevant positions of the images 11a to 11d, or of at least for two images, are parametrized in order to obtain a parametrized distribution for the plurality of channels. This step is marked by the reference numeral 120. The parametrized distribution can, for example, be a summary statistic or can be described using a specific parameter, e.g., the mean. Therefore, the step 120 can comprise the sub-step of determining a significant parameter or parameter set describing the behavior of the respective channel, like the summary statistic. Alternatively, a maximum-likelihood estimation or aposteriori-estimation can be used.
Here it should be noted that the intensity distribution over all positions and/or the parametrized distributions describe, for example, the number of counts in correlation to respective intensity values for the respective channel of the plurality of channels a-d (cf.
Within the next step 120, the plurality of the parametrized distributions for all or at least two channels a to d (A, C, G and T) are combined such that a common distribution for all or the relevant channels within the first cycle can be obtained. For example, the common distribution may be an average of all or all relevant parametrized distributions.
In context of the common distribution it should be noted that the common distribution (for at least two, all relevant or all channels a-d) represents an average of (at least two) all relevant (or all) parametrized distributions/intensity distributions of the plurality of channels. For example, the common distribution for (at least two) all relevant (or all) channels may describe an averaged number of counts in correlation to an averaged intensity values for at least two, all relevant or all channels a-d (cf.
Starting from the common distribution, the intensity values of the respective channels 11a to 11d can be mapped to the common distribution using a respective transfer function for each channel 11a to 11d. This step is marked by the reference numeral 140. The respective transfer functions can be used for filtering or normalizing the images 11a to 11c. Since the respective transfer function is determined based on all or at least all relevant positions (a sub-set of all positions) of the respective channel 11a to 11d, the transfer function enables the channels 11a to 11d to have—an averaged—same behavior, so that channel specific effects can be avoided or eliminated.
Example: starting from a basic example, merely that one channel has a substantially higher brightness than another channel it is clear that the determined transfer function enables dimming of the entire bright channel, especially, each intensity value of the single position within the channel. Due to the dimming (application of the channel specific transfer function, the intensity values belonging to the signal positions within the channels are more comparable than without applying the transfer function or normalization procedure.
If now, the image analysis of the respective channel 11a to 11d or, in more detail, of each relevant position (sub-set of all positions) or at least of one position of the respective channel 11a to 11d, is performed while applying the respective transfer function to the intensity value, the intensity values of the different channels 11a to 11d are distinguishable from each other. This is illustrated by
The application of the channel specific transfer function to the intensity values is an optional step which is marked by the reference numeral 150. The usage of the method steps 110 to 140 enables determination of the transfer functions, wherein the performing of all method steps 110 to 150 enables the normalization of the channels within one cycle. Therefore, starting from a basic approach, the step 150 is an optional step.
Although in embodiments the normalization has been described in context of normalizing different channels to each other, the normalization can also additionally or alternatively be performed such that one or more channels are normalized over the plurality of cycles. This approach is illustrated by
With respect to
By the usage of the channel specific transfer functions, the observed intensities over all positions 40O1 and 40O2 can be distorted such that the same follow a common distribution. This means that the goal of the approach is to collapse all channel specific intensity distributions into one common distribution, thereby producing unbiased intensity profiles over all channels, leaving two unbiased-base calling. This collapse may be done via means of the channel specific transfer functions determined using the method 100. The result is shown by
Here, the observed intensity values of all positions within the respective channels are distorted using the transfer function, such that the intensity distributions 40O1′ and 40O2′ are achieved.
Starting from the normalization between two or more channels, a preferred way to determine the respective transfer functions enabling collapse of all channel-specific intensity distribution into one distribution will be discussed. For this, the observed intensity distributions have to be characterized.
In order to collapse all channel specific intensity distributions into one common distribution, the observed intensity-distributions have to be characterized. Distribution characterization can be classified into two approaches: i) parametrization of a defined distribution and ii) characterization of an undefined distribution. The first case applies if the phenomena leading to noise and signal are known and the resulting family of observed distribution can be derived. In this case parametrization can be performed by probabilistic modeling and standard procedures like maximum-likelihood-estimation or maximum-aposteriori-estimation. If no probabilistic model of the observed intensity distribution is known, characterization may occur via summary statistics. Depending on the complexity of the underlying distribution and how distinct the channel specific distributions are this can be performed by a single summary statistic or a combination of summary statistics. Common applicable summary statistics include mean, mode, standard deviation (SD) as well as order statistics.
Given the set of characterized intensity distributions for all channels, the second step is to find a common distribution that all channel-specific distributions can be collapsed to. This is performed by channel-specific transformations of the data such that the resulting intensity distributions follow a distribution characterized by specified parameters or summary statistics. The nature of the transformation depends on the underlying distribution and may be a simple linear transformation, a non-linear transformation, a set of linear or non-linear transformations or a combination of both sets. In the simplest case, intensity distribution may be described by a Gaussian distribution. In this case channel specific distributions can be collapsed by normalizing the Gaussian distribution by subtraction of the mean and division by SD.
Analogously, instead or additionally, to the intensity distribution normalization between plurality of channels as cycle-specific normalization can be used.
The intensity normalization approach described above may be applied to all cycles or for each cycle individually. The latter case may be desirable to correct for differences introduced as a function of sequencing cycle (i.e. drift). Drift may occur for example in the case of cycle specific sequencing configuration (i.e. adapted imaging or chemistry) or due to decaying or increasing performance. The cycle-specific normalization can be performed as described above. Alternatively, the cycle context can be used to smooth normalization transformations. This case depends on the assumption that channel-specific distributions of subsequent cycles are similar and thus their characterization parameters (i.e. distribution parameters or summary statistics) are correlated. For smoothing, channel specific distribution characterization is performed as described above for each cycle or groups of cycles. Smoothing may be performed by a variety of functions including sliding window approaches (mean, median, Gaussian filter, local model fitting) or model fitting on the entire cycle set (e.g. polynomials). To estimate the transformation model to the specified target distribution, the parameter estimated by the smoothing function is used instead of the parameter derived from the distribution characterization.
This approach will be discussed with respect to
In order to eliminate the systematic differences, the distributions can be normalized using the above-discussed approach according to which the plurality of channels is distorted using the transfer functions such that each channel is mapped to a common distribution. The transformation can be performed such that the general trend of the parameter versus cycle is retained.
This approach is illustrated by
Expressed in other words, this means that according to an embodiment, a normalization between the single channels together with a smoothing can be used. Due to the interchannel-normalization, the distinction between the same is improved wherein the smoothing enables avoidance of jumps between subsequent cycles.
According to another embodiment, a further approach normalizes both the channels with respect to the other channels and the radius over all cycles such that the parameter profiles are stationary with respect to the plurality of cycles. This approach is illustrated by
Below, an enhanced example will be discussed. Let us assume that we measure intensities in 4 distinct channels and that for each cycle the channel-specific intensity values are distributed according to a Gaussian distribution with distinct mean, but identical SD. In this case the intensity-distribution normalization is simple: i) The channel-specific distribution for each cycle is characterized by the mean of the intensities. ii) Mean versus cycle is determined for each channel individually and smoothed using an appropriate method (e.g. windowed mean). iii) A target distribution for each cycle is determined. Let us further assume that the mean of intensities increases with respect to cycles with a similar function for all channels and that we want to retain this general trend. To retain the mean versus cycle trend, the mean of the target distribution is specified such that it ranges from 0 to 1 and follows the overall trend. iv) Transformation is performed by subtracting the smoothed observed channel mean and adding the mean of the target distribution.
Other examples of collapsing channel intensity distribution may include normalizing Gaussian distributions with distinct mean and standard deviation via standardization (i.e. mean=0, SD=1) or transformation of non-normal distribution to approximate normal distributions (e.g. log-transformation, square-root transformation, Box-Cox transformation, Yeo-Johnson transformation) followed by the standardization of the resulting approximately normal distribution.
Although, referring to the above discussion, all embodiments have been described in the context of a method it should be noted that the idea may also be implemented as an apparatus. Here, all the above implementation aspects may also be used in context of the apparatus.
Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus. Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, some one or more of the most important method steps may be executed by such an apparatus.
Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blu-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transitionary.
A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.
In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are preferably performed by any hardware apparatus.
The above described embodiments are merely illustrative for the principles of the present invention. It is understood that modifications and variations of the arrangements and the details described herein will be apparent to others skilled in the art. It is the intent, therefore, to be limited only by the scope of the impending patent claims and not by the specific details presented by way of description and explanation of the embodiments herein.
Number | Date | Country | Kind |
---|---|---|---|
17177001.9 | Jun 2017 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2018/066278 | 6/19/2018 | WO | 00 |