The invention relates to the identification of microorganisms, and particularly bacteria, by means of spectrometry.
The invention can in particular apply in the identification of microorganisms by means of mass spectrometry, for example of MALDI-TOF type (“Matrix-assisted laser desorption/ionization time of flight”), of vibrational spectrometry, and of autofluorescence spectroscopy.
It is known to use spectrometry or spectroscopy to identify microorganisms, and more particularly bacteria. For this purpose, a sample of an unknown microorganism is prepared, after which a mass, vibrational, or fluorescence spectrum of the sample is acquired and pre-processed, particularly to eliminate the baseline and to eliminate the noise. The peaks of the pre-processed spectrum are then “compared” by means of classification tools with data from a knowledge base built from a set of reference spectra, each associated with an identified microorganism.
More particularly, the identification of microorganisms by classification conventionally comprises:
Typically, a spectrometry identification device comprises a spectrometer and a data processing unit receiving the measured spectra and implementing the second above-mentioned step. The first step is implemented by the manufacturer of the device who determines the classification model and the prediction model and integrates it in the machine before its use by a customer.
Algorithms of support vector machine or SVM type are conventional supervised learning tools, particularly adapted to the learning of high-dimension classification models aiming at classifying a large number of species.
However, even though SVMs are particularly adapted to high dimension, the determining of a classification model by such algorithms is very complex.
First, conventionally-used SVM algorithms belong to so-called “flat” algorithms which consider the species to be classified equivalently and, as a corollary, also consider classification errors as equivalent. Thus, from an algorithmic viewpoint, a classification error between two close bacteria has the same value as a classification error between a bacteria and a fungus. It is then up to the user, based on his knowledge of the microorganisms used to generate the training spectra, on the structure of the actual spectra, and based on his algorithmic knowledge, to modify the “flat” SVM algorithm used to minimize the severity of the classification errors thereof. Setting aside the difficultly of modifying a complex algorithm, such a modification is highly dependent on the user himself.
Then, even though there would exist some ten or several tens of different training spectra for each microorganism species to build the classification model, this number still remains very low. Not only may the variety of the training spectra be very small as compared with the total variety of the species, but also, a limited number of instances results in mechanically exacerbating the specificity of each spectrum. Thereby, the obtained classification model may be inaccurate for certain species and making the subsequent step of prediction of an unknown microorganism very difficult. Here again, it is up to the user to interpret the results given by the identification to know its degree of relevance and thus, in the end, to deduce an exploitable result therefrom.
The present invention aims at providing a method of identifying microorganisms by spectrometry or spectroscopy based on a classification model obtained by an SVM-type supervised learning method which minimizes the severity of identification errors, thus enabling to substantially more reliably identify unknown microorganisms.
For this purpose, an object of the invention is a method of identifying by spectrometry unknown microorganisms from among a set of reference species, comprising:
According to the invention:
In other words, the invention specifically introduces a priori information which has not been considered up to now in supervised learning algorithms used in the building of classification models for the identification of microorganisms, that is, a hierarchical tree-like representation of the microorganism species in terms of evolution and/or of clinical phenotype. Such a hierarchical representation is for example a taxonomic tree having its structure essentially guided by the evolution of species, and accordingly which intrinsically contains a notion of similarity or of proximity between species.
The SVM algorithm thus no longer is a “flat” algorithm, the species being no longer interchangeable. As a corollary, classification errors are thus no longer considered identical by the algorithm. By establishing a link between the species to be classified, the method according to the invention thus explicitly and/or implicitly takes into account the fact that they have information in common, and thus also non-common information, which accordingly helps distinguishing species, and thus minimizing classification errors as well as the impact of the small number of training spectra per species.
Such a priori information is introduced into the algorithm by means of a structuring of the data and of the variables due to the tensor product. Thus, the structure of the data and of the variables of the algorithm associated with two species is all the more similar as these species are close in terms of evolution and/or of clinical phenotype. Since SVM algorithms are algorithms aiming at optimizing a cost function under constraints, the optimization thus necessarily takes into account similarities and differences between the structures associated with the species.
In a way, it may be set forth that the proximity between species is “qualitatively” taken into account by the structuring of the data and variables. According to the invention, the proximity between species is also “quantitatively” taken into account by a specific selection of the loss functions involved in the definition of the constraints of the SVM algorithm. Such a “quantitative” proximity of the species is for example determined according to a “distance” defined on the trees of the reference species or may be determined totally independently therefrom, for example, according to specific needs of the user. This thus results in a minimizing of classification errors as well as a gain in robustness of the identification with respect to the paucity of the training spectra.
Finally, the classification model now relates to the classification of the nodes of the tree of the hierarchical representation, including roots and leaves, and no longer only to species. Particularly, if during a prediction implemented on the spectrum of an unknown microorganism, it is difficult to determine the species to which the microorganism belongs with a minimum degree of certainty, the prediction is capable of identifying to which larger group (genus, family, order . . . ) of microorganisms the unknown microorganism belongs. Such precious information may for example be used to implement other types of microbial identifications specific to said identified group.
According to an embodiment, loss functions associated with pairs of nodes are equal to distances separating the nodes in the tree of the hierarchical representation. Thereby, the algorithm is optimized for said tree, and the loss functions do not depend on the user's know-how and knowledge.
According to an embodiment, loss functions associated with pairs of nodes are respectively greater than distances separating the nodes in the tree of the hierarchical representation. Thus, another type of a priori information may be introduced in the building of the classification model. Particularly, the algorithmic separability of the species may be forced by selecting loss functions having a value greater than the distance in the tree.
According to an embodiment, the loss functions are calculated:
The loss functions particularly enable to set the separability of the species regarding the training spectra and/or the used SVM algorithm. It is in particular possible to detect species with a low separability and to implement an algorithm which modifies the loss functions to increase this separability.
In a first variation:
Thereby, the impact of having introduced the taxonomy and/or clinical phenotype information contained in the tree of the hierarchical representation is assessed and the remaining errors or classification defects are minimized by selecting loss functions as a function thereof.
According to a second variation:
Just as in the first variation, the remaining error and classification defects are corrected while keeping in the loss functions quantitative information relative to the distances between species in the tree.
Particularly, the current values of the loss functions are calculated according to relation:
Δ(yi,k)=α×Ω(yi,k)+(1−α)×Δconfusion(yi,k)
where Δ(yi,k) are said current values of the loss functions for node pairs (yi,k) of the tree, Ω(yi,k) and Δconfusion(yi,k) respectively are the first and second matrixes, and α is a scalar number between 0 and 1. More particularly, α is in the range from 0.25 to 0.75, particularly from 0.25 to 0.5.
Such a convex combination provides both a high accuracy of the identification and a minimization of the severity of identification errors.
More particularly, the initial values of the loss functions are set to zero for pairs of different nodes and equal to 1 otherwise.
According to an embodiment, a distance Ω separating two nodes n1, n2 in the tree of the hierarchical representation is determined according to relation:
Ω(n1,n2)=depth(n1)+depth(n2)−2×depth(LCA(n1,n2))
where depth(n1) and depth(n2) respectively are the depth of nodes n1, n2, and depth(LCA(n1,n2)) is the depth of the closest common ancestor LCA(n1,n2) of nodes n1, n2 in said tree. Distance Ω thus defined is the minimum distance capable of being defined in a tree.
According to an embodiment, the prediction model is a prediction model for the tree nodes to which the unknown microorganism to be identified belongs. It is thus possible to predict nodes which are ancestors to the leaves corresponding to the species.
According to an embodiment, the optimization problem is formulated according to relations:
under constraints:
ξi≥0, ∀i ∈[1,N]
W,Ψ(xi,yi)≥W,Ψ(xi,k)+ƒ(Δ(yi,k),ξi), ∀i ∈[1,N], ∀k ∈ Y\yi
in which expressions:
In a first variation, function ƒ(Δ(yi,k),ξi) is defined according to relation ƒ(Δ(yi,k),ξi)=Δ(yi,k)−ξ. In a second variation, function ƒ(Δ(yi,k),ξi) is defined according to relation
Particularly, the prediction step comprises:
T
ident=arg maxk(s(xm,k))k ∈[1,T]
The invention also aims at a method of identifying a microorganism by mass spectrometry, comprising:
The present invention will be better understood on reading of the following description provided as an example only in relation with the accompanying drawings, where the same reference numerals designate the same or similar elements, among which:
A method according to the invention applied to MALDI-TOF spectrometry will now be described in relation with the flowchart of
The method starts with a step 110 of acquiring a set of training mass spectra of a new microorganism species to be integrated in a knowledge base, for example, by means of a MALDI-TOF (“Matrix-assisted laser desorption/ionization time of flight”) mass spectrometry. MALDI-TOF mass spectrometry is well known per se and will not be described in further detail hereafter. Reference may for example be made to Jackson O. Lay's document, “Maldi-tof spectrometry of bacteria”, Mass Spectrometry Reviews, 2001, 20, 172-194. The acquired spectra are then preprocessed, particularly to denoise them and remove their baseline, as known per se.
The peaks present in the acquired spectrum are then identified at step 112, for example, by means of a peak detection algorithm based on the detection of local maximum values. A list of peaks for each acquired spectrum, comprising the location and the intensity of the spectrum peaks, is thus generated.
Advantageously, the peaks are identified in the predetermined range of Thomson [mmin;mmax], preferably Thomson's range [mmin;mmax]=[3,000;17,000]. Indeed, it has been observed that the information sufficient to identify the microorganisms is contained in this range of mass-to-charge ratios, and that it is thus not needed to take a wider range into account.
The method carries on, at step 114, by a quantization or “binning” step. To achieve this, range [mmin;mmax] is divided into intervals of predetermined widths, for example, constant, and for each interval comprising a plurality of peaks, a single peak is kept, advantageously the peak having the highest intensity. A vector is thus generated for each measured spectrum. Each component of the vector corresponds to a quantization interval and has, as a value, the intensity of the peak kept for this interval, value “0” meaning that no peak has been detected in the interval.
As a variation, the vectors are “binarized” by setting the value of a component of the vector to “1” when a peak is present in the corresponding interval, and to “0” when no peak is present in this interval. This results in increasing the robustness of the subsequently-performed classification algorithm calibration. The inventors have indeed noted that the information relevant, particularly, to identify a bacterium is essentially contained in the absence and/or the presence of peaks, and that the intensity information is less relevant. It can further be observed that the intensity is highly variable from one spectrum to the other and/or from one spectrometer to the other. Due to this variability, it is difficult to take into account raw intensity values in the classification tools.
In parallel, the training spectrum peak vectors, called “training vectors” hereafter, are stored in the knowledge base. The knowledge base thus lists K microorganism species, called “reference species”, and one set X={xi}i∈[1,N] of N training spectra xi ∈p, i∈[1,N], where p is the number of peaks retained for the mass spectra.
At the same time, or consecutively, the listed species K are classified, at step 116, according to a tree-like hierarchical representation of reference species in terms of evolution and/or of clinical phenotype.
In a first variation, the hierarchical representation is a taxonomic representation of living beings applied to the listed reference species. As known per se, the taxonomy of living organisms is a hierarchical classification of living beings which classifies each living organism according to the following order, from the least specific to the most specific: domain, kingdom, phylum, class, order, family, genus, species. The taxonomy used is for example that determined by the “National Center for Biotechnology Information” (NCBI). The taxonomy of living organisms thus implicitly comprises evolutionary data, close microorganisms at an evolutionary level comprising more components in common than microorganisms that are more remote in terms of evolution. Thereby, the evolutionary “proximity” has an impact on the “proximity” of spectra.
In a second variation, the hierarchical representation is a “hybrid” taxonomic representation obtained by taking into account phylogenic characteristics, for example, species evolution characteristics, and phenotype characteristics, such as for example the GRAM +/− of the bacteria, which is based on the thickness/permeability of their membranes, their aerobic or anaerobic characteristic. Such a representation is for example illustrated in
Generally, the tree of the hierarchical representation is a graphical representation connecting end nodes, or “leaves”, corresponding to the species to a “root” node by a single path formed of intermediate nodes.
At a next step 118, the tree nodes, or “taxons”, are numbered with integers k ∈ Y=[1,T], where T is the number of nodes in the tree, including leaves and roots, and the tree is transformed into a set Λ={Λ(k)}k∈[1,T] of binary vectors Λ(k)∈T.
More particularly, nodes T of the tree are respectively numbered from 1 to T, for example, in accordance with the different paths from the root to the leaves, as illustrated in the tree of
Other vectorial representations of the tree keeping these links are of course possible.
To better understand the following, the following notations are introduced. Each training vector xi corresponds to a specific reference species labeled with an integer yi∈[1,T], that is, the number of the corresponding leaf in the tree of the hierarchical representation. For example, the 10th training vector x10 corresponds to the species represented by leaf number “24” of the tree of
At a next step 120, new “structured training” vectors Ψ(xi, k)∈p×T are generated according to relations:
Ψ(xi,k)=xi⊗Λ(k) ∀i∈[1,N], ∀k∈[1,T] (1)
where ⊗p×T→p×T is the tensor product between space p and space T. A vector Ψ(xi,k) thus is a vector which comprises a concatenation of T blocks of dimension p where the blocks corresponding to the components equal to one unit of vector Λ(k) are equal to vector xi and the other blocks are equal to zero vector 0p of p. Referring again to the example of
and vector Ψ(xi,5) is equal to
It can thus be observed that the closer nodes are to one another in the tree of the hierarchical representation, the more their structured vectors share common non-zero blocks. Conversely, the more nodes are remote, the less their structured vectors share non-zero blocks in common, such observations thus in particular applying to leaves representing reference species.
At a next step 122, loss functions of a structured multi-class SVM type algorithm applied to all the nodes of the tree of the hierarchical representation are calculated.
More particularly, a multi-class SVM algorithm structured in accordance with the hierarchical representation according to the invention is defined according to relations:
under constraints:
ξi≥0, ∀i∈[1,N] (3)
W,Ψ(xi,yi)≥(W,Ψ(xi,k)+ƒ(Δ(yi,k),ξi), ∀i∈[1,N], ∀k∈Y\yi (4)
in which expressions:
As can be observed, the proximity between species, such as coded by the hierarchical representation, and such as introduced into the structure of the structured training vector, is taken into account via the constraints. Particularly, the closer species are to one another in the tree, the more their data are coupled. The reference species are thus no longer considered as interchangeable by the algorithm according to the invention, conversely to conventional multi-class SVM algorithms, which consider no hierarchy between species and consider said species as being interchangeable.
Further, the structured multi-class SVM algorithm according to the invention quantitatively takes into account the proximity between reference species by means of loss functions Δ(yi,k).
According to a first variation, function ƒ is defined according to relation:
ƒ(Δ(yi,k),ξi)=Δ(yi,k)−ξi (5)
According to a second variation, function ƒ is defined according to relation:
In an advantageous embodiment, loss functions Δ(yi,k) are equal to a distance Ω(yi,k) defined in the tree of the hierarchical representation according to relation:
Δ(yi,k)=Ω(yi,k)=depth(yi)+depth(k)−2×depth(LCA(yi,k)) (7)
where depth(yi) and depth(k) respectively are the depth of nodes yi and k in said tree, and depth(LCA(yi,k)) is the depth of the ascending node, or closest common “ancestor” node LCA(yi,k) of nodes yi, k in said tree. The depth of a node is for example defined as being the number of nodes which separate it from the root node.
As a variation, loss functions Δ(yi, k) are of a nature different from that of the hierarchical representation. These functions are for example defined by the user according to another hierarchical representation, to his know-how and/or to algorithmic results, as will be explained in further detail hereafter.
Once the loss functions have been calculated, the method according to the invention carries on with the implementation, at step 124, of the multi-class SVM algorithm such as defined in relations (2), (3), (4), (5) or (2), (3), (4), (6).
The result produced by the algorithm thus is vector W which is the classification model of the tree nodes, deduced from the combination of the information contained in training vectors xi, from the positioning of their associated reference species in the tree, from the information as to the proximity between species contained in the hierarchical representation, and from the information as to the distance between species contained in the loss functions. More particularly, each weight vector wl, l∈[1,T] represents the normal vector of a hyperplane of p forming a border between the instances of node “l” of the tree and the instances of the other nodes k∈[1,T]\l of the tree.
Training steps 112 to 124 of the classification model are implemented once in a first computer system. Classification model W=(w1w2 . . . wT)T and vectors Λ(k) are then stored in a microorganism identification system comprising a MALDI-TOF-type spectrometer and a computer processing unit connected to the spectrometer. The processing unit receives the mass spectra acquired by the spectrometer and implements the production rules determining, based on model W and on vectors Λ(k), to which nodes of the tree of the hierarchical representation the mass spectra acquired by the mass spectrometer are associated.
As a variation, the prediction is performed on a distant server accessible by a user, for example, by means of a personal computer connected to the Internet to which the server is also connected. The user loads non-processed mass spectra obtained by a MALDI-TOF type mass spectrometer onto the server, which then implements the prediction algorithm and returns the results of the algorithm to the user's computer.
More particularly, for the identification of an unknown microorganism, the method comprises a step 126 of acquiring one or a plurality of mass spectra thereof, a step 128 of preprocessing the acquired spectra, as well as a step 130 of detecting peaks of the spectra and of determining a peak vector xm ∈p, such as for example previously described in relation with step 110 to 114.
At a next step 132, a structured vector is calculated for each node in the tree of the hierarchical representation, k ∈ Y=[1,T], according to relation:
Ψ(xm,k)=xm⊗Λ(k) (8)
after which a score associated with node k is calculated according to relation:
s(xm,k)=W,Ψ(xm,k) (9)
The identified node of tree Tident ∈[1,T] of the unknown microorganism then for example is that which corresponds to the highest score:
T
ident=arg maxk(s(xm,k))k ∈[1,T] (10)
Other prediction models are of course possible.
Apart from the score associated with identified taxon Tident, the scores of the ancestor nodes and of the daughter nodes, if they exist, of taxon Tident are also calculated by the prediction algorithm. Thus, for example, if the score of taxon Tident is considered as low by the user, the latter has scores associated with the ancestor nodes, and thus additional more reliable information.
A specific embodiment of the invention where loss functions Δ(yi,k) are calculated according to a minimum distance defined in the tree of the hierarchical representation has just been described.
Other alternative calculations of loss functions Δ(yi,k) will now be described.
In a first variation, the loss functions defined at relation (7) are modified according to a priori information enabling to obtain a more robust classification model and/or to ease the resolution of the optimization problem defined by relations (2), (3), and (4). For example, the loss function Δ(yi,k) of a pair of nodes (yi,k) may be selected to be low, in particular smaller than distance Ω(yi,k), which means that identification errors are tolerated between these two nodes. Releasing constraints on one or a plurality of pairs of species mechanically amounts to increasing constraints on the other pairs of species, the algorithm being then set to more strongly differentiate the other pairs. Similarly, loss function Δ(yi,k) of a pair of nodes (yi,k) may be selected to be very high, particularly greater than distance Ω(yi,k), to force the algorithm to differentiate nodes (yi,k), and thus to minimize identification errors therebetween. In particular, it is possible to release or to reinforce constraints bearing on pairs of reference species by means of their respective loss functions.
In a second variation, illustrated in the flowchart of
The method of calculating loss functions Δ(yi,k) starts with the selection, at 40, of initial values for them. For example, Δ(yi,k)=0 when yi=k, and Δ(yi,k)=1 when yi≠k, functions ƒ thus being reduced to ƒ(Δ(yi,k),ξi)=1−ξi. Other initial values are of course possible for the loss functions, functions ƒ(ξi)=1−ξi appearing in the constraints of the above-discussed algorithms being then replaced with functions ƒ(Δ(yi,k),ξi) of relation (5) or (6) with the initial values of the loss functions.
The calculation method carries on with the estimation of the performance of the SVM algorithm for the selected loss functions Δ(yi,k). Such an estimation comprises:
Calibration vectors {tilde over (x)}i are for example acquired at the same time as training vectors xi. Particularly, for each reference species, the spectra associated therewith are distributed into a training set and a calibration set from which the training vectors and the calibration vectors are respectively generated.
The loss function calculation method carries on, at step 148, with the modification of the values of the loss functions according to the calculated confusion matrix. The obtained loss functions are then used by the SVM algorithm for calculating final classification model W, or a test is carried out at step 150 to know whether new values of the loss functions are calculated by implementing steps 142, 144, 146, 148 according to values of the loss functions modified at step 148.
In a first example of the loss function calculation method, step 142 corresponding to the execution of an SVM algorithm is a one-versus-all type algorithm. This algorithm is not hierarchical and only considers the reference species, referred to with integers k ∈[1,K], and solves a problem of optimization for each of reference species k according to relations:
under constraints:
ξi≥0, ∀i∈[1,N] (12)
q
i(wk,xi+bk)≥1−ξi ∀i∈[1,N] (13)
in which expressions:
The prediction model is provided by the following relation and applied, at step 144, to each of calibration vectors {tilde over (x)}i:
G({tilde over (x)}i)=arg maxkwk,{tilde over (x)}i+bk k ∈[1,K] (14)
An inter-species confusion matrix Cspecies ∈ K×K is then calculated, at step 146, according to relation:
C
species(i,k)=FP(i,k)∀i,k ∈[1,K] (15)
where FP(i,k) is the number of calibration vectors of species i predicted by the prediction model as belonging to species k.
Still at step 146, a normalized inter-species confusion matrix {tilde over (C)}species ∈ K×K is then calculated according to relation:
where Ni is the number of calibration vectors for the species bearing reference i.
Finally, step 146 ends with the calculation of a normalized inter-node confusion matrix {tilde over (C)}taxo ∈ T×T as a function of normalized confusion matrix {tilde over (C)}species. For example, a propagation diagram of values {tilde over (C)}species(i,k) from the leaves to the root is used to calculate values {tilde over (C)}taxo(i,k) of pairs (i,k) of different nodes of the reference species. Particularly, for a pair of nodes (i,k)∈[1,T]2 of the tree of the hierarchical representation for which a component of matrix {tilde over (C)}taxo(iC,kC) has already been calculated for each pair of nodes (iC,kC) of set {iC}×{kC}, where {iC} and {kC} respectively are the sets of “daughter” nodes of nodes i and k, the component of matrix {tilde over (C)}taxo(iC,kC) for pair (i,k) is set to be equal to the average of components {tilde over (C)}taxo(iC,kC).
At step 148, the loss function Δ(yi,k) of each pair of nodes (yi,k) is calculated as a function of normalize inter-node confusion matrix {tilde over (C)}taxo.
According to a first option of step 148, loss function Δ(yi,k) is calculated according to relation:
where λ≥0 is a predetermined scalar controlling the contribution of confusion matrix {tilde over (C)}taxo in the loss function.
According to a second option of step 148, loss function Δ(yi,k) is calculated according to relation:
where ┌·┐ is the rounding to the next highest integer, β≥0 and l>0 are predetermined scalars setting the contribution of confusion matrix {tilde over (C)}taxo in the loss function. For example, by setting l=10, confusion matrix {tilde over (C)}taxo contributes by β per 10% of confusion between nodes (yi,k).
According to a third option of step 148, a first component Δconfusion(yi,k) of loss function Δ(yi,k) is calculated according to relation (17) or (18), after which loss function Δ(yi,k) is calculated according to relation:
Δ(yi,k)=α×Ω(yi,k)+(1−α)×Δconfusion(yi,k) (19)
where 0≤α≤1 is a scalar setting a tradeoff between a loss function only determined by means of a confusion matrix and a loss function only determined by means of a distance in the tree of the hierarchical representation.
In a second example of the loss function calculation method, step 142 corresponds to the execution of a multi-class SVM algorithm which solves a single optimization problem for all references species k ∈[1,K], each training vector xi being associated with its reference species bearing as a reference number an integer yi ∈[1,K], according to relations
under constraints:
ξi≥0, ∀i∈[1,N] (21)
w
y
,x
i
≥
w
k
,x
i
+1−ξi ∀i∈[1,N], ∀k∈[1,K]\yi (22)
where ∀k∈[1,K], wk∈p is a weight vector associated with species k.
The prediction model is provided by the following relation and applied, at step 144, to each of calibration vectors {tilde over (x)}i:
G({tilde over (x)}i)=arg maxkwk,{tilde over (x)}i k∈[1,K] (23)
Steps 146 and 148 of the second example are identical to steps 146 and 148 of the first example.
In a third example of the loss function calculation method, step 142 corresponds to the execution of structured multi-class SVMs based on a hierarchical representation according to relations (2), (3), (4), (5) or (2), (3), (4), (6). At step 144, the prediction model according to the following relation is then applied to each of calibration vectors {tilde over (x)}i:
G({tilde over (x)}i)=arg maxkW,Ψ({tilde over (x)}i,k)k ∈ E (29)
where E={ykspecies} is the set of references of the nodes of the tree of the hierarchical representation corresponding to the reference species.
An inter-species confusion matrix Cspecies ∈K×K is then deduced from the results of the prediction on calibration vectors {tilde over (x)}i and the loss function calculation method carries on identically to that of the first example.
Of course, the confusion may be calculated according to prediction results bearing on all the taxons in the tree.
Embodiments where the SVM algorithm implemented to calculate the classification model is a structured multi-class SVM model based on a hierarchical representation, particularly an algorithm according to relations (2), (3), (4), (5) or according to relations (2), (3), (4), (6), have been described.
The principle of loss functions Δ(yi,k) which quantify an a priori proximity between classes envisaged by the algorithm, that is, nodes of the tree of the hierarchical representation in the previously-described embodiments, also apply to multi-class SVM algorithms which are not based on a hierarchical representation. For such algorithms, the considered classes are the reference species represented in the algorithms by integers k ∈[1,K], and the loss functions are only defined for the pairs of reference species, and thus for couples (yi,k)∈[1,K]2.
Particularly, in another embodiment, the SVM algorithm used to calculate the classification model is the multi-class SVM algorithm according to relations (20), (21), and (22), replacing function ƒ(ξi)=1−ξi of relation (22) with function ƒ(Δ(yi,k),ξi) according to relation (5) or relation (6), that is, according to relations (20), (21), and (22bis):
w
y
,x
i
≥
w
k
,x
i
+ƒ(Δ(yi,k),ξi) ∀i∈[1,N], ∀k∈[1,K]\yi (22bis)
The prediction model applied to identify the species of an unknown microorganism then is the model according to relation (23).
Experimental results of the method according to the invention will now be described, in the following experimental conditions:
The performance of the method according to the invention is assessed by means of a cross-validation defined as follows:
Further, different indicators are taken into account to assess the performance of the method:
The following algorithms have been analyzed and compared:
The parameter C retained for each of these algorithms is that providing the best micro-accuracy and macro-accuracy.
The following table lists for each of these algorithms the micro-accuracy and the macro-accuracy.
These results, and particularly the above table and
It can thus be deduced from the foregoing that the introduction of a priori information in the form of a hierarchical representation, particularly a taxonomy and/or clinical phenotype representation, of the reference species and of quantitative distances between species in the form of loss functions enables to manage the tradeoff between, on the one hand, the global accuracy of the identification of unknown microorganisms and, on the other hand, the severity of identification errors.
Analyses have also been made on loss functions equal to a convex combination of the distance in the tree and confusion loss function according to relation (19), more particularly for the “SVM_cost_taxo_conf” algorithm according to relations (20), (21), (22bis). Function ƒ(Δ(yi,k),ξi) is defined according to relation (6) and loss functions Δ(yi,k) are calculated by implementing the second example of the method of calculating loss functions Δ(yi,k), with Δ(yi,k) being defined according to relations (18) and (19), replacing the inter-node confusion matrix with the inter-species confusion matrix. The “SVM_cost_taxo_conf” algorithm has been implemented for different values of parameter α, that is, values 0, 0.25, 0.5, 0.75, and 1, parameter β in relation (18) being equal to 1, and parameter C in relation (20) being equal to 1,000. The results of this analysis are illustrated in
As can be noted in the drawings, when parameter α comes close to one, the loss functions being thus substantially defined only by the distance in the tree of the hierarchical representation, the accuracy decreases and the severity of errors increases. Similarly, when parameter α comes close to zero, the loss functions being substantially defined from a confusion matrix only, the accuracy per species decreases and the severity of errors increases.
However, for values of parameter α within range [0.25; 0.75], and particularly within range [0.25; 0.5], a greater accuracy can be observed, the lowest accuracy per species being greater by 60% than the lowest accuracy per species of the SVM_cost_0/1 algorithm. A substantial decrease of severe prediction errors, and particularly having a taxonomy cost greater than 6, can also be observed. Further, it can be observed that for values of α close to 0.5, particularly for value 0.5 illustrated in the drawings, the number of errors having a taxonomy cost equal to 2 is decreased as compared with the number of errors of same cost with values of α close to 0.25.
Preliminary analyses show a similar impact for a “SVM_struct_taxo_conf” algorithm implementing relations (2), (3), (4), (8)-(10) with, as a function ƒ(Δ(yi,k),ξi), that defined at relation (6) and, as loss functions Δ(yi,k), those calculated by implementing the second example of the method of calculating loss functions Δ(yi,k) by using relations (18) and (19).
Embodiments applied to MALDI-TOF-type mass spectrometry have been described. These embodiments apply to any type of spectrometry and spectroscopy, particularly, vibrational spectrometry and autofluorescence spectroscopy, only the generation of training vectors, particularly the pre-processing of spectra, being likely to vary.
Similarly, embodiments where the spectra used to generate the training data have no structure have been described.
Now, the spectra are “structured” by nature, that is, their components, the peaks, are not interchangeable. Particularly, a spectrum comprises an intrinsic sequencing, for example, according to the mass-to-charge ratio for mass spectrometry or according to the wavelength for vibrational spectrometry, and a molecule or an organic compound may give rise to a plurality of peaks.
According to the present invention, the intrinsic structure of the spectra is also taken into account by implementing non-linear SVM-type algorithms using symmetrical kernel functions K(x,y) defined as being positive, quantifying the structure similarity of a pair of spectra (x,y). Scalar products between two vectors appearing in the above-described SVM algorithms are then replaced with said kernel functions K(x,y). For more details, reference may for example be made to chapter 11 of document “Kernel Methods for Pattern Analysis” by John Shawe-Taylor & Nello Cristianini—Cambridge University Press, 2004.
Number | Date | Country | Kind |
---|---|---|---|
12305402.5 | Apr 2012 | EP | regional |
This application is a continuation of U.S. application Ser. No. 14/387,777, which is a national stage of PCT/EP2013/056889 filed Apr. 2, 2013, each of which claims priority of European application No. EP12305402.5 filed Apr. 4, 2012, the contents of each of which are incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
Parent | 14387777 | Sep 2014 | US |
Child | 16407422 | US |