The present application is a non-provisional patent application claiming priority to European Patent Application No. 18248105.1, filed Dec. 27, 2018, the contents of which are hereby incorporated by reference.
This application contains a Sequence Listing submitted as an electronic text file named “19-2278_Sequences_ST25.txt” having a size of 521 bytes and originally created on Apr. 17, 2020. The information contained in this electronic file is hereby incorporated by reference in its entirety pursuant to 37 CFR § 1.52(e)(5).
Various example embodiments relate to processing of reads of a sequenced nucleic acid such as a DNA or RNA sequence. More particular, they relate to determining the number of occurrences of respective k-mers in the nucleic acid from the reads.
When sequencing a DNA sequence, the order of nucleotides in the DNA sequence is determined. As there are four different nucleotides: adenine (A), cytosine (C), guanine (G) and thymine (T), the result of the sequencing may be represented as a string composed with letters of the four letter alphabet A, C, G and T.
Illumina dye sequencing is a popular sequencing technique because it offers a high throughput and low sequencing cost. The output of the Illumina dye sequencing are short reads wherein a single read represents a short subsequence, for example 100 to 250 nucleotides, of the DNA sequence. The output does not contain positional information for the reads, (i.e., the location of the reads within the DNA sequence is unknown). Each nucleotide within the DNA sequence is typically covered by multiple fully or partially overlapping reads. The average number of reads that cover a nucleotide in the DNA sequence, the sequencing coverage, typically ranges from 30 to 50.
A problem with Illumina dye sequencing is that the reads contain sequencing errors, for example substitutions, insertions and deletions. The error rate is typically in the range of 1% to 2%, the vast majority being substitutions.
Different techniques have been developed to derive information on the DNA sequence from such error-containing reads. One such technique uses a directed graph representation called a de Bruijn graph. According to this technique, k-mers are first derived from the reads, for example all possible sequences with a length of k nucleotides appearing in the reads. Each unique k-mer is then represented as a node of the graph wherein each node is associated with the number of occurrences of the respective k-mer in the reads, the read support of the node. Similarly, the number of k-1 overlaps from one k-mer to another one, the read support of an arc, is associated with the arc between the respective k-mers. The read support of the nodes and the arcs is then used to infer the occurrences of the k-mers in the original DNA sequence, further referred to as the multiplicities of the nodes and arcs. This inference may be based on the so-called k-mer spectrum (i.e., a histogram of the number of k-mers as a function of their read support). From this spectrum, thresholds are derived for classifying the read support of the nodes into multiplicities.
A problem with the above technique is that the k-mer spectrum is a superposition of distributions corresponding to the different multiplicities. Because these distributions overlap, the multiplicities may be erroneously assigned. For example, an erroneous k-mer may be assigned a multiplicity of one because it occurs more often than average in the reads.
Embodiments of the present disclosure alleviate the above identified shortcomings by processing reads of a sequenced nucleic acid that results in fewer errors in the obtained multiplicities.
According to a first aspect, a computer-implemented method for processing reads of a sequenced nucleic acid comprising the following steps: i) determining k-mers from the reads; ii) determining first counts indicative for the respective number of occurrences of the respective k-mers in the reads, and second counts indicative for the respective number of overlaps between respective k-mers in the reads; iii) determining a k-mer spectrum from the first counts; iv) determining first output counts indicative for the respective number of occurrences of the respective k-mers in the nucleic acid based on a position of the k-mers in the k-mer spectrum and based on a constraint. The constraint comprising: i) a first output count of a respective k-mer equals the total number of overlaps of the respective k-mer with other k-mers in the nucleic acid; and/or ii) a first output count of a respective k-mer equals the total number of overlaps of other k-mers with the respective k-mer.
In other words, when representing the k-mers by a directed graph, the first counts would correspond to the read support of the nodes and the second counts would correspond to the read support of the directed edges, the arcs, between the respective nodes. The first output counts then correspond with the determined multiplicity of the k-mers in the nucleic acid. According to the above method, the output counts are not solely based on the k-mer spectrum (i.e., on a fixed threshold obtained from the k-mer spectrum). A further constraint is applied taking into account the neighbourhood of the respective k-mer. When representing the k-mers by a directed graph, this constraint imposes that the directed graph is a balanced directed graph (i.e., that for each node), its multiplicity is equal to the sum of the multiplicities of its incoming arcs and equal to the sum of the multiplicities of its outgoing arcs. This improves the statistical power of assigning multiplicities to nodes or arcs. Specifically, it also improves the accurate identification of nodes or arcs with multiplicity zero, and thus in a better identification of sequencing errors in the reads.
The above method may further comprise determining second output counts indicative for the respective number of overlaps between respective k-mers in the nucleic acid based on the position of the k-mers in the k-mer spectrum and based on the constraint. In other words, also the multiplicities of the arcs may be derived in a similar way.
According to an example embodiment, the first and/or second output counts may be determined by assigning probabilities for candidate output counts. In other words, as the k-mer spectrum already provides information on the probabilities for a certain multiplicity of a node or arc based on the read support, the constraint may be used to further adapt the probabilities on a per node or arc basis. A candidate output count is then assigned a lower probability when violating the constraint.
According to an embodiment, the determining the first and/or second output counts is performed by a probabilistic graphical model. Such a model allows deriving the constraint based multiplicities. The model may for example be a conditional random field, CRF. This way, the output counts may be obtained by iteratively estimating the parameters of the k-mer spectrum and the node- and arc multiplicities in an expectation-maximization setting. More generally, the first output counts and/or second output counts may be the unobserved variables of the model. Similarly, the first and/or second counts are the observed variables of the model.
According to a further embodiment, the determining the first and/or second output counts further comprises selecting a neighbourhood of a respective k-mer and constructing a probabilistic graphical model of the neighbourhood and determining the first output count and/or the second output count associated with the respective k-mer therefrom.
By limiting the number of nodes to a neighbourhood of the k-mer, the number of random variables in the model is reduced. Moreover, as the multiplicities of the neighbouring node have the biggest impact on the constraint, the output counts are obtained faster without a significant loss in accuracy.
According to various example embodiments, the overlaps between the k-mers are k-1 overlaps.
According to a second aspect, the disclosure relates to a data processing system configured to perform the method according to the first aspect.
According to a third aspect, the disclosure relates to a computer program comprising instructions which, when the program is executed by a computer, cause the computer to perform the method according to the first aspect.
According to a fourth aspect, the disclosure relates to a computer-readable medium comprising instructions which, when executed by a computer, cause the computer to perform the method according to the first aspect.
The above, as well as additional, features will be better understood through the following illustrative and non-limiting detailed description of example embodiments, with reference to the appended drawings.
All the figures are schematic, not necessarily to scale, and generally only show parts which are necessary to elucidate example embodiments, wherein other parts may be omitted or merely suggested.
Example embodiments will now be described more fully hereinafter with reference to the accompanying drawings. That which is encompassed by the claims may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided by way of example. Furthermore, like numbers refer to the same or similar elements or components throughout.
Various example embodiments relate to processing of reads of a sequenced nucleic acid. More particular, they relate to determining the number of occurrences of respective k-mers in the nucleic acid from the reads. The below description uses a deoxyribonucleic acid, or shortly DNA sequence, as an example of a nucleic acid but the below description may also be applied to other forms of nucleic acid such as a ribonucleic acid, or shortly RNA sequence.
Similarly, a de Bruijn graph may be constructed from the reads 110 obtained from a DNA sequence 100. This is illustrated in
Furthermore, by the introduction of read errors, different erroneous nodes and arcs may be introduced in the de Bruijn graph. This is illustrated in
The below example embodiments describe how to process the reads 410 of a sequenced DNA sequence 210 in order to assign the node and/or arc multiplicities. To this respect,
At a next step 505, the arc and node counts 553, 554 of the k-mers are determined. In other words, the de Bruijn graph 550 is determined. Although the counts determine the de Bruijn graph, it is not essential that a graph 550 is constructed. The node and arc counts may for example also be stored in a table or database thereby representing the de Bruijn graph 550. Due to the errors in the reads, different errors are introduced in the graph 550. For example, a new intermediate node 555 and end node 558 is introduced together with connecting arcs 556, 557 and 559. Also, a new arc 580 is introduced.
Then, in a next step 507, the arc and/or node multiplicities are determined based on the k-mer spectrum 508, 530 and based on the following constraints 509, 510: i) the multiplicity of a node should equal the sum of the multiplicities of the outgoing arcs of that node; and ii) the multiplicity of a node should equal the sum of the multiplicities of the arcs entering the node. This constraint relates to an inherent property of the de Bruijn graph representing the original DNA sequence.
An example 530 of a k-mer spectrum is also shown in
One way to determine the multiplicities therefrom is to determine hard multiplicity intervals 531, 532, 533, etc., each indicating a count interval within which a k-mer is assigned a certain multiplicity. However, as will be demonstrated further below, this is not desired. Therefore, according to step 507, the constraints 509, 510 are further taken into account in assigning the multiplicities. This may be done by, for each node and arc, assigning probabilities for the different multiplicities based on the k-mer spectrum and, additionally, increasing or decreasing the probability based on the how well the graph fulfils the constraint. In other words, each node is assigned a set of candidate output counts, each with a certain probability based on the k-mer spectrum. These probabilities are then further adjusted according to the constraints 509, 510. From these probabilities, a decision on the multiplicity of each node and arc may be taken.
Regarding node 570, based on the k-mer spectrum cutoff 531, the node 570 and arc 571 would be assigned multiplicity zero as shown in
Different techniques may be used to apply the constraints 509 and 510 in the determination of the multiplicities. According to an example embodiment, a probabilistic graphical model, more particular a conditional random field (CRF) model, is used. When determining multiplicities of nodes and arcs while also taking into account constraints 509, 510, information of neighbouring nodes and arcs may be taken into account as demonstrated in the simplified example above. This results in a high dimensional problem. By using conditional random fields, this high dimensional problem is transformed into probabilities that are products of factors and independencies are more easily represented. Moreover, efficient computational methods may be used to infer the individual probabilities present in the model. The CRF model allows combining the information contained in the k-mer spectrum 530 with constraints 509, 510.
A CRF model is a type of Probabilistic Graphical model or graph model wherein the nodes represent variables and the edges represent direct probabilistic interactions between these variables. These variables are further divided into a set of observed variables and a set of unobserved variables from which information is inferred. When using a CRF model for performing step 507, the observed variables X={X1, . . . , Xm} may correspond to the coverage of arcs in the de Bruijn graph 550 and the unobserved variables Y={Y1, . . . , Ym} may correspond to the multiplicities of nodes and arcs in the resulting de Bruijn graph 650. Probabilistic interactions may then be represented by factors (φi(Di)(Di⊆X∪Y; Di⊆/X) such that all variables in the scope Di of φi form a clique in the CRF. The joint conditional probability over all variables may then be calculated as
wherein Z(X) is the partition function that normalises the product of factors such that P(Y|X) is a probability distribution. By performing the conditioning on X, the modelling of dependencies between these variables may be avoided and only the conditional dependencies between the Ys are modelled. Two types of factors may further be discerned in the model. The first type, referred to as the “singleton factors” φa, models the relationship between the multiplicity of the arcs in the de Bruijn graph and their observed k-mer coverage. The second type, referred to as “conservation of flow factors” φflow, impose the conservation of flow according to the constraints 509 and 510 through the resulting de Bruijn graph 650. Edges connect nodes such that all variables that are in the scope Di of a factor, form a clique in the CRF. The two different factor types will pull the belief, i.e. probability for a certain multiplicity in the graph, in a certain direction. In other words, the final multiplicity-probabilities will be the result of an interplay between the observed k-mer coverage in graph 550, because of φa, and an implied conservation of flow according to constraints 509, 510, because of φa. In some cases, both may point to the same multiplicity. In other cases, φa will put more belief in multiplicity=1, while φflow imposes the multiplicity to be equal to zero, for example in case of a highly covered error, or the other way around, for example in case of a low coverage region.
According to an embodiment, the CRFs are built for small regions of the de Bruijn graph. This way, computational complexity is greatly reduced compared with the case where the whole de Bruijn graph is taken into account. This is further illustrated in
For each Xa and Ya corresponding to an arc in the de Bruijn graph, a “singleton factor” 880
φa(Ya,Xa):(Val(Ya,Xa)+
is created that represents the probability that an arc has multiplicity m given that a coverage C is observed. Because the time and memory requirements of the subsequent inference algorithms depend on the number of possible values for the variables, Val(Ya) is restricted to [max(0, mopt−α),mopt+α] wherein α is a tuneable parameter, for example α=2, and mopt is the current estimation for the most likely multiplicity of Ya. To estimate these probabilities, a mixed model of negative binomials may be fit based on the k-mer spectrum 530 of the data:
For this mixed model some parameters are to be estimated: λerr, the mean coverage of the erroneous k-mers, i.e. those with multiplicity 0, and λ, the mean coverage of k-mers with multiplicity 1 in the DNA sequence. For each multiplicity with m>1 the negative binomial has mean mλ. Each negative binomial may further be weighted according to the estimated number of k-mers with that multiplicity in the dataset, wm. To determine the variance of a negative binomial an overdispersion factor f may be used such that the variance of a negative binomial with mean λ is equal to ƒλ.
For each Yn corresponding to a node n in the de Bruijn graph, two “conservation of flow factors” are created. Each node generates a conservation of flow factor for the incoming arcs and one for the outgoing arcs. The scope of these factors then contains the corresponding CRF-node Yn and all Ya corresponding to incoming respectively outgoing arcs of node n, i.e., for the incoming arcs:
These factors
φflow(Yn,Ya, . . . ,Ya):Val(Yn,Ya, . . . ,Ya)+
assign a value of 1 to combinations of values such that the sum of the multiplicities of the incoming respectively outgoing arcs is equal to the multiplicity of the node and a low value (e.g., 10-7) otherwise. While the singleton factors represent the knowledge present in each node and arc individually, the conservation of flow factors will enforce a consistent assignment (following a conservation of flow of multiplicity) over all nodes and edges represented in the CRF. An analogous equation may be constructed for the outgoing arcs.
The product of all factors is the joint probability of the multiplicities over all nodes and arcs in the selected neighbourhood 860. The following equation may determine the probability distribution over the possible multiplicities of a node n (or similarly an arc a):
P(Yn|X)=ΣY\Y
This may be done using a variable elimination algorithm performing an exact calculation of these probabilities.
The values of the factors in the above CRF depend on parameters which also may be estimated. These estimates depend on the multiplicities of the nodes and arcs in the de Bruijn graph, which are also unknown. Therefore, the CRF-model may be used in an Expectation-Maximisation setting as follows:
1) Use data heuristics to find a first estimate for these parameters.
2) Estimate (E) the most likely multiplicities for the nodes and arcs of the de Bruijn graph, using a CRF with factors based on the current parameters.
3) Estimate (M) new values for the parameters using the newly inferred multiplicities, weighted by their probability.
Convergence is then reached when the estimated multiplicity of a certain percentage of the nodes and arcs does not change between different E-steps.
The parameters in the CRF that may be determined are: i) the negative binomial distributions λ and λerr, ii) the overdispersion factor f determining the variance of the negative binomial, and iii) the weights wi for possible multiplicities. To calculate the values of these parameters the following values derived from the data may be used: cov(n) denoting the k-mer coverage of a node n; cov(a) denoting the k-mer coverage of an arc a; wherein the representation of the de Bruijn graph concatenates nodes that have only one incoming and one outgoing arc, i.e. cov(n) is the total k-mer coverage of all k-mers in the concatenated node. The set of all k-mers in node n is further denoted by k-mers(n), and its size by |k-mers(n)|. The current multiplicity estimate for nodes and arcs is denoted by mult(n) and mult(a) respectively. Since nodes correspond to (concatenated) k-mers, while the arcs correspond to (k-1)-mers, the underlying distributions will differ slightly. Therefore, all parameters are estimated separately for the nodes and the arcs of the de Bruijn graph.
The negative binomial may be estimated as follows:
wherein the weights wm are calculated as the number of nodes/arcs assigned with multiplicity m up until a certain tuneable multiplicity m′. All weights wm, m>m′ may be calculated as a fraction of wm′: wm=β(m−m′)wm′. β<1 is determined as wm′/wm′−1 because most nodes will have a low multiplicity and modelling higher multiplicity weights by, for example, a geometric distribution provides desired results. To determine the overdispersion factor f, the sample variance for each multiplicity m<m′ is calculated separately as
The factor f may be calculated by taking a weighted mean of the sample variances based on the number of nodes/arcs that are assigned with a certain multiplicity:
Table 1 shows results obtained by the above described method. The method was performed on reads obtained from a S. enterica genome for different read coverages (Coverage) and for different sizes of the CRF (CRF-size). The simulation where the CRF-size is zero corresponds to the case where only the k-mer spectrum static thresholds are used. All results were compared with a reference genome of the S. enterica to obtain an accuracy (Acc) measure for each simulation.
As used in this application, the term “circuitry” may refer to one or more or all of the following:
(a) hardware-only circuit implementations such as implementations in only analog and/or digital circuitry and
(b) combinations of hardware circuits and software, such as (as applicable):
(c) hardware circuit(s) and/or processor(s), such as microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g., firmware) for operation, but the software may not be present when it is not used for operation.
This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor (or multiple processors) or portion of a hardware circuit or processor and its (or their) accompanying software and/or firmware. The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit or processor integrated circuit for a mobile device or a similar integrated circuit in a server, a cellular network device, or other computing or network device.
Although the present disclosure has been illustrated by reference to specific embodiments, it will be apparent to those skilled in the art that the disclosure is not limited to the details of the foregoing illustrative embodiments, and that the present disclosure may be embodied with various changes and modifications without departing from the scope thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the disclosure being indicated by the appended claims rather than by the foregoing description, and all changes which come within the scope of the claims are therefore intended to be embraced therein.
It will furthermore be understood by the reader of this patent application that the words “comprising” or “comprise” do not exclude other elements or steps, that the words “a” or “an” do not exclude a plurality, and that a single element, such as a computer system, a processor, or another integrated unit may fulfil the functions of several aspects recited in the claims. Any reference signs in the claims shall not be construed as limiting the respective claims concerned. The terms “first”, “second”, third”, “a”, “b”, “c”, and the like, when used in the description or in the claims are introduced to distinguish between similar elements or steps and are not necessarily describing a sequential or chronological order. Similarly, the terms “top”, “bottom”, “over”, “under”, and the like are introduced for descriptive purposes and not necessarily to denote relative positions. It is to be understood that the terms so used are interchangeable under appropriate circumstances and embodiments of the disclosure are capable of operating according to the present disclosure in other sequences, or in orientations different from the one(s) described or illustrated above.
While some embodiments have been illustrated and described in detail in the appended drawings and the foregoing description, such illustration and description are to be considered illustrative and not restrictive. Other variations to the disclosed embodiments can be understood and effected in practicing the claims, from a study of the drawings, the disclosure, and the appended claims. The mere fact that certain measures or features are recited in mutually different dependent claims does not indicate that a combination of these measures or features cannot be used. Any reference signs in the claims should not be construed as limiting the scope.
Number | Date | Country | Kind |
---|---|---|---|
18248105.1 | Dec 2018 | EP | regional |