The present invention relates to the technical field of molecular biology applied to bacterial genomics, and in particular to the field of predicting phenotypic traits of bacteria from their genomes. The invention is particularly applicable to the prediction of antibiotic susceptibility and virulence of bacteria present in a biological sample.
A. Molecular Technology for Phenotype Prediction
The susceptibility of a bacterial strain to an antibiotic, i.e. its sensitivity or resistance in the context of a treatment based on the antibiotic administered to a human or an animal, cannot be directly observed by a human being. Specifically, direct observation of the strain, even using microscopes, does not make it possible to determine its behavior toward the antibiotic. in vitro diagnosis in a bacterial context consists, by nature, in making this phenotypic nature observable and therefore ultimately exploitable for a clinician. During the 20th century, in vitro diagnostic technologies essentially combined culture-based sample preparation techniques, notably to make the bacterial strains present in the samples visible and manipulable, and techniques for optical measurement of the behavior of the strains in the presence of an antibiotic. For example, a conventional microbiology laboratory workflow involves first spreading a sample from a patient suspected of having a bacterial infection on a culture medium to produce bacterial colonies that are visible to a human operator or an automated system after incubation. In a second stage, when the colonies are large enough, a technician or an automated system takes a colony, mixes it with an antibiotic at different concentrations and introduces the mixtures into a device that measures the optical density of each mixture and deduces therefrom the susceptibility to the antibiotic. Since the optical density indicates the bacterial proliferation, it unambiguously characterizes the sensitivity or resistance of the bacterium: if the density increases, this means that it is proliferating despite the presence of the antibiotic, and thus that it is resistant to the antibiotic at the concentration of the antibiotic under consideration.
Today, the combination of sample preparation technologies and optical density-based measurement technologies has significant limitations in the face of rapid global development in the prokaryotic kingdom, namely the acquisition of multi-drug resistance to antibiotics, which is estimated to be responsible for more deaths than cancer by 2050. First of all, these technologies are not agnostic as regards bacteria. Specifically, depending on the chosen culture medium, certain strains will grow and others will not, and as such these technologies do not make it possible to characterize the antibiotic susceptibility of all bacterial species.
Secondly, these techniques are extremely slow since they are based on bacterial culturing that takes a long time. Thus, obtaining an antibiogram of a bacterium takes at least 30 hours from the time the sample is taken. This delay does not allow effective treatment of patients who are systematically given a broad-spectrum antibiotic cocktail as a first-line treatment. In addition to the consequences for the patient, this inappropriate and massive administration of antibiotics reinforces the selection pressure of multi-drug-resistant bacteria and thus contributes toward their expansion. It is thus believed at the present time that conventional in vitro diagnostic technologies are increasingly unsuitable for treating patients and are, to a certain extent, one of the reasons for the emergence of multi-drug resistance.
The maturation of molecular biology technologies, in particular bacterial DNA and/or RNA characterization technologies, such as polymerase chain reaction (PCR), DNA chips or sequencing, is bringing about a paradigm shift in the analysis of antibiotic resistance in laboratories. Firstly, they are more agnostic toward bacterial species. For example, metagenomic technology makes it possible to process bacterial DNA in a biological sample irrespective of the bacterial species present. Secondly, they aim to provide a result in a few hours, with some, such as PCR, even providing a result in less than 20 minutes. On the other hand, molecular techniques for characterizing antibiotic susceptibility are based on genomic signatures (absence/presence of genes, genetic mutations, predictive models, etc.) characterizing said susceptibility.
One of the main differences between these two molecular technologies is the nature of the genomic signatures. In the case of PCR, the genomic signature is molecular, and therefore tangible: it is translated into primers introduced into a reaction mixture, these primers specifically targeting bacterial DNA sequences introduced into the mixture, and its detection is generally achieved by measuring an optical signal. In contrast, in the case of WGS, genomic signatures are digital since the sequencing produces a digital genome and the processing of said genome is computerized. Whereas the WGS technology allows, at the very least, the implementation of PCR genomic signatures, it above all allows complex exploitation of the digital genome and the use of predictive models of antibiotic susceptibility that are impossible to implement using PCR technologies. Thus, the WGS technology is advantageously based on a computer design of the genomic signatures 30, this design advantageously exploiting large genomic and phenotypic knowledge bases with the aid of complex analytical tools, for example machine learning technologies such as parsimoniously constrained logistic regressions.
This being the case, all molecular techniques are based on the same technical foundation, namely the measurement of genomic information from bacterial strains and the processing of said information so as to extract information regarding the behavior of the strains in the presence of an antibiotic. Moreover, while these computer technologies, and more particularly those of sequencing, are different in nature from those associated with optical density analysis of more conventional microbiological technologies, they do not change the technical nature of the in vitro diagnosis they perform. For example, in the case of infectious diagnostics, it is still a matter of applying technologies for processing a biological sample so as to determine whether a patient has a bacterial infection and to know the behavior of the infectious bacterium in the presence of an antibiotic in order to administer an appropriate antibiotic treatment.
B. Interpretability of Molecular Technologies for Phenotype Prediction
Focusing more particularly on genomic signatures, the first approaches consisted in identifying previously identified antibiotic resistance markers in the bacterial genome, referred to as “direct association” approaches. While these approaches are effective when the genetic mechanisms leading to resistance are well known and simple, as is the case for most resistance mechanisms in species such as Mycobacterium tuberculosis, they suffer from major limitations: incomplete knowledge of resistance mechanisms in many species and antibiotics, resulting, for example, in incomplete databases, difficulty in taking into account differences in the predictive powers of markers and the multifactorial aspect of antibiotic susceptibility (e.g. epistasis, combination of multiple mutations, etc.), etc. Faced with these difficulties, the genetic determinism of antibiotic susceptibility is being addressed more effectively by new approaches based on advanced computer technologies, and in particular by supervised machine learning technologies, the learning and application architecture of which can be summarized as follows:
The above generic description involves first defining the machine learning variables that describe the bacterial genome. There are many ways to describe said genome, one of them being the description in “k-mers”, i.e. the list of nucleic acid sequences of length k (i.e. number of bases) constituting the genome. As described in M. Jaillard Dancette's thesis, “Vers une cartographie des polymorphisms liés à la résistance aux antimicrobiens [Toward a mapping of polymorphisms related to antimicrobial resistance]”, 2018, this description is particularly suited to bacterial genomes which are haploid and highly plastic compared to eukaryotic genomes. In other words, this description effectively describes the diversity of genetic mechanisms underlying antibiotic susceptibility in bacteria.
However, this description has a number of drawbacks that may have a negative impact on machine learning technologies, including the following:
For high-risk fields, particularly that of human health, it is important to reduce the dimensionality of the variables to increase the interpretability of prediction models. Specifically, machine learning tools are sensitive to biases in the learning data, for example lack of genomic diversity, and to biases associated with an incomplete formulation of susceptibility as a function of the bacterial genome, for example failure to take into account strong correlations between different genomic regions. By reducing the dimensionality, prediction models become easier to interpret for both learning tool experts and bacterial genomics experts, enabling biases to be detected and thus enabling the construction of appropriate learning data or reformulating the problem that the learning tool is intended to solve. Similarly, since they are more interpretable, prediction models are easier to validate for use in a high-risk field if no biases are apparent after their analysis.
Among the tools that allow a strong reduction in dimensionality, parsimonious automatic learning tools, for instance penalized lasso regressions or decision tree-based tools, make it possible to obtain a number of predictive k-mers, i.e. k-mers that are retained in the prediction model, of the order of a thousand or even a hundred. However, these tools are unstable in a high-dimensional environment in which the variables are highly correlated. Thus, they can select predictor variables that together form a genomic unit that does not necessarily have any biological reality, and as such the prediction models remain difficult to interpret. Certain techniques can take into account strong correlations between variables, notably elastic-net penalty-based regressions which combine L1-type penalties with L2-type penalties, leading to the selection of groups of correlated predictor variables. It should be noted, however, that this clustering remains mainly algorithmic and that the groups of variables retained are still difficult to interpret biologically.
Other tools, for instance group-lasso tools, enable the a priori clustering of descriptive variables into genomic units. In this context, all the variables in a unit are either predictive or not, depending on the selection or non-selection of the unit by the group-lasso strategy. However, since the description in k-mers is difficult to interpret for the reasons mentioned above, the a priori definition in explanatory units is difficult. In particular, this assumes that high correlation phenomena in a high-dimensional space are well understood and formalized by experts in bacterial genomics, since a lack of knowledge, or imperfect knowledge of these phenomena, translates into bias in the machine learning algorithm.
C. Application of Molecular Prediction Technologies to Biological Variability
In parallel to the problem presented above, when a prediction is based on groups of genome sequences, the question arises as to whether a group is present in the genome of a strain if all the sequences that make up the cluster are all present in the genome in an identical manner. If such a criterion is applied, it assumes that the learning data are complete to encompass all the genomic variability of the bacterial species. In addition to the fact that it is difficult to judge the completeness of the learning data at a given time, said data very often become incomplete over time in certain bacterial species due to the very significant plasticity of their genome. Also, the application of a criterion that is too strict leads to a high rate of false positives or false negatives.
The problems which have just been described in relation to the genomic prediction of the susceptibility of a bacterial strain to one or more antibiotics arise in the same terms for any genomic determination of a phenotypic trait of the strain, for example its virulence or its ribotype.
The aim of the present invention is to provide a genomic prediction of a phenotypic trait, in particular of the susceptibility to an antibiotic, which is more representative of the biological reality of said trait of the bacterial species to which the strain belongs while at the same time maintaining the performance of the prediction.
To this end, one subject of the invention is a process for determining a phenotypic trait of a bacterial strain, involving:
According to the invention, said groups are chosen so that:
In other words, the invention forces machine learning to consider genome sequences of equivalent weight in the prediction of the phenotypic trait as groups rather than as independent variables. The inventors thus noted that the prediction does not suffer a degradation in performance while, at the same time, the learned group(s) that have the most predictive weight correspond to a biological reality, for example correspond to whole genes or groups of specific mutations. This approach has been validated notably on bacterial species whose genomic origin of susceptibility is well documented.
In the text hereinbelow, an application to antibiotic susceptibility is described, it being understood that the application also applies to any phenotypic trait.
According to one embodiment of the invention, a first subset of the genome sequences is a set of unclustered variables of a first susceptibility prediction model. In other words, this filtering makes it possible to eliminate from the outset the sequences that have no predictive weight by advantageously using unconstrained learning tools for clustering, which are thus a priori the most efficient.
In particular, the first prediction model is a model trained on a database of genomes and susceptibilities of bacterial strains belonging to the bacterial species, the learning belonging to the class of parsimonious learning, advantageously a logistic regression, notably of the LASSO type. The term “parsimonious learning” means learning that defines a prediction model whose variables form a subset of the descriptive variables of a problem, directed toward minimizing the size of said subset without having an impact on the prediction performance of the model.
In other words, parsimonious approaches allow the selection of a first set of highly predictive genome sequences which will be joined by other sequences that have very similar co-occurrence units. More particularly, a parsimonious approach can identify regions of the genome that are involved in antibiotic susceptibility. Advantageously, a second subset of the genome sequences is a set of genome sequences having a co-occurrence rate with the first set above the predetermined threshold. In other words, while parsimonious approaches allow an initial identification of genomic regions of interest, they generally fail, due to correlations between variables, to exhaustively describe said regions. By extending the first set of sequences obtained by parsimonious learning with a second set of co-occurring sequences, a set of sequences allowing an exhaustive description of said regions is generated.
According to one embodiment, the genome sequences correspond to nodes of a compacted graph obtained from genome sequences of constant length, of length k in number of bases, or “k-mers”, and occurring in the genomes of strains belonging to the bacterial species. In particular, said compacted graph is calculated by a) calculating a De Bruijn graph of alphabet A, T, G, C from said genome sequences of constant length, b) compacting linear paths of the De Bruijn graph in order to obtain said compacted graph. In this manner, a set of unique genome sequences is obtained, the redundancy existing in the descriptive sequences of constant length, namely the k-mers, being eliminated.
According to one embodiment, the clustering of the genome sequences involves:
In this manner, sequences that are systematically present at the same time in each of the genomes of the strains of the bacterial species which was used for constructing the compacted graph, or absent at the same time, or some present and some absent at the same time, are clustered into single genomic units, reducing the number of variables, and also a hidden redundancy.
According to one embodiment, the value of a group of genome sequences for the application of the prediction model applied to the bacterial strain is equal to the percentage of said first groups of said group that are present in the genome of the bacterial strain (328).
According to one embodiment, the value of a group of genome sequences for the application of the prediction model applied to the bacterial strain is calculated by:
According to one embodiment, the clustering of genome sequences involves a) calculating a dendrogram based on the co-occurrence rates of genome sequences in genomes of strains belonging to the bacterial species, and b) clustering the dendrogram at a predetermined height so as to obtain the groups of genome sequences. In particular, the dendrogram is calculated on the first groups of genome sequences. This occurrence-based hierarchical approach allows the number of sequence groups to be comprehensibly adjusted. By adjusting the cluster height, it is possible to optimize the biological value of the selected groups.
According to one embodiment, the prediction model is trained on a database of genomes and susceptibilities of bacterial strains belonging to the bacterial species, the learning algorithm belonging to the parsimonious learning class, in particular based on logistic regression. In particular, the learning is performed by logistic regression, notably of the LASSO type. By the parsimonious approach, the most predictive groups are selected, these groups constituting genomic signatures of susceptibility.
Advantageously, the value of a group of genome sequences for parsimonious learning is calculated by
According to one embodiment, the value of a group of genome sequences for the prediction model applied to the bacterial strain is equal to the percentage of genome sequences of said group present in the genome of the bacterial strain. In particular, a genome sequence is present in the genome of the bacterial strain when a percentage of genome sequences of constant length constituting said sequence is higher than a predetermined threshold.
A subject of the invention is also a process for identifying genomic signatures that are predictive of antibiotic susceptibility of a bacterial species, the process involving:
A subject of the invention is also a use of a genomic signature identified according to a process for identifying genomic signatures of the abovementioned type, in a molecular test for predicting a phenotypic trait, in particular susceptibility to an antibiotic, in particular a polymerase chain reaction type test, or a DNA chip-based test, said test at least partially targeting the genomic signature.
A subject of the invention is also a computer program product storing computer-executable instructions for applying a model for predicting a phenotypic trait, in particular the susceptibility to an antibiotic, of a bacterial strain of the abovementioned type.
A subject of the invention is also a system for determining the susceptibility to an antibiotic of a bacterial strain, comprising:
A subject of the invention is also a computer program product storing computer-executable instructions for performing a process for identifying genomic signatures predictive of a phenotypic trait, in particular of the susceptibility to an antibiotic, of a bacterial strain of the abovementioned type.
The invention will be better understood on reading the following description, which is given purely by way of example, and made in relation to the appended drawings, in which identical references denote identical or similar elements, and in which:
With reference to
The first part 30 begins by establishing, in 300, a genomic and phenotypic database for said species. In particular, a set of strains is collected, for example from patients infected with said species, and each collected strain is sequenced to obtain its complete genome, for example using an MiSeq sequencing platform from the company Illumina, and an antibiogram is established to determine its susceptibility to the antibiotic—resistant (“R”), intermediate (“I”) or sensitive (“S”) in accordance with the critical concentrations (or “breakpoints”) of the CLSI standard or the EUCAST standard—for example using a Vitek 2 from the company bioMerieux. Preferably, the genomes take the form of assembled sequences (or “contigs”) produced by digitally assembling digital sequences produced by the sequencing platform (or “reads”) in a manner known per se. The complete digital genome and antibiotic susceptibility of each strain are stored in a computer database to form a learning data set and a test data set.
Advantageously, but optionally, the “resistant” and “intermediate” states are merged so as to obtain two antibiotic susceptibility states. In this way, a binary classification problem is defined which distinguishes between sensitive (“S”) bacterial strains and non-sensitive (“NS”) bacterial strains. For example, the S state is coded with the number 0 and the NS state is coded with the number 1.
The remainder of step 30 is performed by computer and begins, in 302, by clustering the database into two, so as to obtain a learning database and a test database. Preferably, the cluster is a “10-fold” cluster, with nine-tenths of the database constituting the learning database and the remaining tenth constituting the test database. The total number of bacterial strains used for the learning database is noted hereinbelow as N.
In the next step 304, a set of descriptive variables of the bacterial genome is determined. With reference to
In a subsequent step 310, the unitigs are clustered into units by deduplication, the minor allele frequency (MAF) of which units is optionally higher than a predetermined threshold “SMAF” of between 98% and 99.5%, for example 99%. In particular:
The remaining units, referred to hereinbelow as the “MAF units”, are unique, different from each other, and are subsequently used as variables in the machine learning tools described below. This clustering into units and filtering significantly reduce the redundancy induced by the description of the genomes in k-mers without modifying the intrinsic information value of the MAF units regarding their possible involvement in antibiotic susceptibility. Steps 304-310 are described in greater detail in the thesis by M. Jaillard Dancette, “Vers une cartographie des polymorphismes li{right arrow over (e)}s à la résistance aux antimicrobiens [Towards a mapping of polymorphisms related to antimicrobial resistance]”, 2018, and in the article by Jaillard M. et al. “A fast and agnostic method for genome-wide association studies: Bridging the gap between k-mers and genetic events”, PLOS Genetics, 2018, and implemented, for example, using the DBGWAS software developed by M. Jaillard (https://gitlab.com/leoisl/dbgwas).
The learning part 30 continues with a step 312 of clustering the MAF units into a limited number of variables that are both predictive of antibiotic susceptibility and have enhanced biological significance, referred to hereinbelow as “clusters”.
Step 312 begins, in 314, with the selection of MAF units that are predictive of antibiotic susceptibility. Advantageously, this selection is performed using a penalized logistic regression tool of the LASSO type. More particularly, for each value A of a set of positive values {λ1, λ2, . . . , λm}, the following optimization problem is solved:
in which relationships:
Any MAF unit that is activated for any of the values of λ and thus predictive, is chosen, i.e. the jth MAF unit is chosen if ∃k∈{1, 2, . . . , m}\(λk)≠0. As is known, LASSO tools incrementally activate their variables along the regularization path {λ1, λ2, . . . , λm}, with one or more variables being added to those already activated as the path {λ1, λ2, . . . , λm}, is traversed, to light up a maximum of N variables. The set {λ1, λ2, . . . , λm} is advantageously chosen so as to obtain activation of a maximum of MAF units. For example, it is calculated according to the method described in the article by Friedman et al., “Regularization Paths for Generalized Linear Models via Coordinate Descent”, Journal of Statistical Software, 2010, or as implemented, for example, by the package glmnet 3.0.2, implemented in R, and available, for example, from the website https://cloud.r-project.org/web/packages/glmnet/index.html. In particular, this method allows the number m of regularization variables λ to be chosen in advance. This number is, for example, chosen equal to 100.
The chosen MAF units are referred to hereinbelow as “active MAF units”, the number of active MAF units being noted pa. Their ranks, noted ai (i.e. their column indices in the matrix X), are stored in a set a={a1, a2, . . . , a1, . . . , ap
Step 310 then consists, in 316, in identifying the MAF units not chosen in step 314 but which have a minimal co-occurrence rate in the genomes with the active MAF units, and then adding the MAF units thus identified to the list of active MAF units. The co-occurrence rate of two units is advantageously measured by the correlation between their corresponding columns of the matrix X. For this purpose, a matrix G, of dimensions pa×p, is calculated such that:
∀(i,j)∈[1,pa]×[1,p],Gi,j=cor(X.,a
where cor is, for example, the Bravais-Pearson linear correlation, with X.,a
Then, any MAF unit having a correlation higher than a predefined threshold s1 with an active MAF unit is chosen, and added to the list of active MAF units, the set of MAF units thus chosen being referred to hereinbelow as “extended active MAF units”. In other words, the jth MAF unit, j∈[1, p], is chosen if ∃i/Gi,j>s1. s1 is a number that sets the required co-occurrence rate between the extended active MAF units. It is between 0.5 and 1, preferably between 0.8 and 1, and advantageously between 0.9 and 0.95, for example 0.95. The total number of extended active MAF units, noted pe, is then much lower than the initial number p of k-mers (pa<pe<<p), of the order of 103 as opposed to 105 to 106. The ranks of the extended active MAF units, noted el (i.e. their column indices in the matrix X), are stored in a set e={e1, e2, . . . , el, . . . ep
In a subsequent step 318, groups, or “clusters”, of highly co-occurring extended active MAF units are explicitly defined by implementing a clustering analysis tool. Preferably, hierarchical clustering is implemented based on a distance matrix calculated from the co-occurrence rates between the extended active MAF units. The hierarchical clustering is, for example, that described in the article by Bühlmann P. et al., “Correlated variables in regression: Clustering and sparse estimation”, Journal of Statistical Planning and Inference, 2013, or implemented by the “hclust” function of the Stats 3.6.2 package implemented in R and available, for example, from the website https://www.rdocumentation.org/packages/stats/versions/3.6.2/source, using an aggregation criterion based on the closest distance (or “single-linkage”).
More particularly, the distance matrix used by the hierarchical clustering is a matrix D, of dimensions pe×pe, calculated as follows, where X.,e
∀(i,j)∈[1,pe]×[1,pe],Di,j=|1−cor(X.,e
A dendrogram of the extended active MAF units is thus obtained. This dendrogram is then clustered, at a height 1−s2 as illustrated in
In a subsequent step 320, for each cluster Cj, the unitigs constituting it are broken down into k-mers, with k between 15 and 50, for example k=31, by sliding, in steps of 1, a window of length k over the unitigs. For each cluster Cj, if qj is used to denote the number of unitigs constituting it, qj sets of k-mers respectively associated with the unitigs are stored. For the sake of clarity, the k-mers and unitigs associated with the clusters are called “cluster k-mers” and “cluster unitigs”, respectively.
In a subsequent step 322, a training model for predicting antibiotic susceptibility is used, the variables of which are the clusters C1, C2, . . . , Cj, . . . Cp
Step 322 begins, in 324, by calculating the value of each cluster for each genome in the learning database. Advantageously, this value is equal to the average of the values of the MAF units constituting it. A matrix Y of dimensions N x pe, is thus obtained, such that:
In 326, several prediction models of antibiotic susceptibility are trained from the learning database, with the next step 328 consisting in selecting the one with the best performance according to predetermined criteria.
Advantageously, the prediction models are trained using penalized logistic regression which allows the number of predictive clusters finally retained by the model to be reduced while maintaining a high level of performance. In particular, the prediction models are models according to the following relationships:
in which relationships:
Step 326 starts by training the model G using a parsimonious logistic regression tool on the learning database, e.g. a LASSO type penalized regression. In particular, a model G is calculated for each value A of a set of positive values {λ1c, λ2c, . . . , λic, . . . , λMc} by solving an optimization problem according to the relationship:
in which relationship:
For example, it is calculated according to the method described above in connection with step 314, with a number M=100. 100 models are thus obtained, noted GA and thus 100 prediction models according to the relationships (5)-(6), each depending on the threshold value Sp, denoted hereinbelow as Gλ
An estimate of the performance of each prediction model Gλ
In particular, in a step 330, for each genome in the test base, the following are performed:
Advantageously, the detection threshold suni depends on the length of the cluster unitig. In particular, for k between 15 and 50, for example k=31, noting L the length of a cluster unitig for which it is sought to know the presence in the genome, if L≤61 then suni=90%, if 61<L≤100 then suni=80% and if 100<L then suni=70%.
Step 328, in 330, by calculating the value of a cluster as being equal to the mean of the detection indicators of the extended MAF units constituting it, determined in the preceding step. It should be noted that for options i, ii and iii, this mean corresponds to the percentage of extended MAF units detected as present. Once the cluster values have been calculated for all the genomes in the test base, step 330 continues with a model selection strategy that maximizes the sensitivity, specificity and parsimony of the model (i.e. minimizes the number of clusters actually used to predict susceptibility). To do this, for each model Gλ
where TP, TN, FP, and FN are the numbers of true positives, true negatives, false positives, and false negatives described in Table 1, respectively.
The values of sensitivity and of (1 minus the specificity) for the threshold values Sp are then plotted on an ROC curve on the y-axis and x-axis respectively, and the area under the ROC curve, denoted “AUC”, is calculated and stored.
An optimal value of the threshold Sp for the model Gλ
The model finally retained among the models Gλ
The chosen model is stored in a computer memory for subsequent use 40 consisting in predicting the antibiotic susceptibility for a bacterial strain of the bacterial species. The retained clusters, and thus the constituent unitigs, thus form a genomic signature of antibiotic susceptibility.
In particular, this prediction consists in:
The process described in
Table 3 lists the performance of the model trained in accordance with the process according to the invention, known as “cluster-lasso model” or “cluster-lasso”, and the performance of a prediction model trained solely on the basis of a prior art Lasso logistic regression, called “lasso model” or “lasso”. The performance presented in this table corresponds to an estimation by cross-validation, according to the procedure described previously. For the latter, models according to relationships (1) and (2) are calculated, thresholds and a final model chosen in the manner described previously for the process according to the invention, and thus on the basis of the same performance criteria.
The “support” columns indicate the number of predictor variables retained for the prediction model, i.e. the number of clusters for which {circumflex over (B)}j≠0 or the cluster-lasso and the number of “active MAF units” for which for the lasso. The “unitigs” columns indicate the total number of unitigs retained for the genomic signature, with the number of unitigs in the broadest of the predictor variables in parentheses.
As may be seen, the performance of the cluster-lasso model is similar to that of a prediction model whose learning variables are not constrained by clustering. It is thus noted that the two models show similar performance in terms of balanced accuracy bACC and of AUC, confirming that taking or not taking into account the correlation between the traits has a limited impact in terms of predictive performance. It is also noted that the model support is often slightly smaller with the cluster-lasso (for 8 out of 10 antibiotics), suggesting that several traits chosen separately with the lasso ended up merging into a single cluster through the cluster-lasso. As expected, the total number of unitigs involved in a cluster-lasso model is significantly larger. It is noted that this number is not evenly distributed among the predictive clusters. For example, in the model predicting susceptibility to meropenem, 159 of the 164 unitigs are present in a single cluster, suggesting the presence of a gene as a predictive genomic trait.
Similarly,
The interpretability can even be very detailed. For example, as regards the OmpK36 annotated subgraph obtained for the cluster-lasso signature (top right panel of
The second subgraph obtained for the lasso signature (bottom left panel of
The additional information provided by the cluster-lasso thus makes it possible to conclude that the first causal variable of cefoxitin resistance is a local mutation in the OmpK36 gene. Advantageously, molecular technologies (PCR, NGS, etc.) for predicting cefoxitin resistance will specifically target this mutation. Furthermore, the second causal variable is the acquisition of a complete blaKPC gene, and any DNA sequence specific to blaKPC may advantageously be used by such technologies to predict cefoxitin resistance.
Other bacterial species/antibiotic pairs were tested. Without going into great detail in relation to Klebsiella pneumoniae, the following is described for Salmonella species, Staphylococcus aureus and Neisseria gonorrhoeae:
Tables 4 and 5,
Unlike the lasso model, which identifies a possible set of point mutations in the TetB genes for the acquisition of tetracycline resistance, without this being otherwise conclusive in view of the large number of mutations that this would imply, the invention based on cluster-lasso identifies the acquisition of resistance for the presence of the TetA gene (cluster 1), the TetB/TetD genes (cluster 2), as well as the acquisition of the TetR gene.
As regards gentamicin resistance, the invention concludes that the AAC3 gene (cluster 1) is acquired and that the OXA, IMP and TEM genes are involved in the resistance mechanisms, whereas the lasso model fails to identify the OXA and IMP genes.
Tables 6 and 7,
As regards cefixime resistance in N. gonorrhoeae, the invention identifies the acquisition of several recombinations in the penM gene.
Staphylococcus aureus
Tables 8 and 9,
As regards tetracycline resistance in S. aureus, the invention identifies the acquisition of the TetK gene (cluster 1), but rules out the acquisition of the TetM gene (clusters 2 and 3) as being a highly predictive genomic determinant of gentamicin resistance in view of the low coefficients of the clusters involved, unlike the lasso model, which interprets the TetM gene as highly predictive.
Steps 302, 304, 312, 320, just like steps 60 and 80 described below, are performed by computer, for example a computer unit comprising one or more processors, storage space and random-access memories, capable of storing computer instructions which, when executed, perform the calculations described previously. The computing unit is, for example, a personal computer, a server, or a computing cluster. Similarly, steps 402, 404 are performed by computer, for example a computer unit as described previously. The unit of steps 302, 304, 312, 320 and the unit of steps 402, 404 are different or are the same unit. Advantageously, the predicted susceptibility is displayed on a computer screen, stored in a laboratory or hospital computer system to supplement a patient's record when the bacterial strain infects a patient, or is transmitted to a clinician's mobile device, for example a smartphone.
Step 330 describes a way of detecting the presence or absence of a genome sequence, in particular a unitig, or a set of genome sequences in a bacterial genome, in particular a set of units clustering unitigs. In general, the embodiment addresses the issue of whether a sequence or set of sequences should be detected identically in the genome or whether it is possible to accept a certain level of difference between the sequence or group of sequences and the sequences or groups of sequences in the genome in order to decide on their presence or absence. In particular, as explained in the preamble, perfect homology assumes that the learning data is complete to encompass all the variability of the biological species, which is de facto difficult, notably in view of the plasticity of their genome.
Furthermore, the sequenced genome can be tainted by error, notably when it is in the form of “reads”, i.e. sequences produced at the output of sequencing platforms before any bioinformatics processing such as consensus assembly or filtering of poor-quality reads. In this case, a sequence, although present in the genome, may be detected as absent due to sequencing errors and vice versa. In particular, bioinformatics processing generally includes filtering of low-quality reads, and optionally consensus assembly of the filtered reads to obtain assembled sequences, or “contigs”. The optional nature of the assembly generally depends on the context in which the sample analysis is performed. The effect of assembly is to significantly reduce sequencing errors in contigs, at the present time to a level of 10−5 for SBS technologies used in the platforms of the company Illumina Inc. and to a level of 10−2 for nanopore technologies used in the platforms of the company Oxford Nanopore Technologies Ltd. On the other hand, since assembly requires high computing power and time, it is not very well suited to “POC” (“point of care”) genomic applications where the computing environment is generally not very powerful and/or to fast, or even real-time applications. In this context, genomic analysis, for example to determine the identity of the one or more species present in the sample and/or their susceptibility to one or more antibiotics as described previously, is performed directly on the filtered or unfiltered reads. However, sequencing errors are of the order of 2 to 3% for SBS technologies and up to 12% for nanopore technologies. Without special precautions, genomic analysis can lead to an equally high error rate.
The process 50 comprises a step 70 of sequencing a sample comprising one or more microorganism strains, in the text hereinbelow, and by way of example only, one or more bacterial strains, and pre-processing the reads produced by the sequencing platform, and a step 80 of detecting a predetermined genome sequence in the genome of one of the bacterial strains. This step 80 optionally includes the detection of at least one predetermined set of genome sequences in said genome.
Step 80 uses a breakdown of the genome sequence, and also a certain number of parameters, calculated in a step 60, for example performed prior to the process 50, and stored in a database DB. More particularly, with reference to
Step 70 involves, as is known per se, and for example described in relation to
The step 80 of detection of the genome sequence SEQ starts with a first test, in 800, which consists in knowing if this detection is performed on contigs or reads. If the detection is performed on contigs, i.e. genome sequences whose error rate is low enough to allow the use of perfect homology through the detection of k-mers, the process continues, in 802, by detecting the presence or absence of each k-mer kmi of the KM set in the contigs. In particular, the k-mer kmi is detected if it is identically present in at least one of the contigs.
In a subsequent step 804, a test is performed to determine whether the sequence SEQ should be detected identically in the bacterial genome. If so, the sequence SEQ is detected, in 806, if all the k-mers kmi in the KM set are detected as being present in the contigs. It is noted in this regard that due to the sequencing technology and/or the assembly technology, the sequence SEQ does not necessarily end up whole in a contig, but may instead be split between several contigs, with the probability of such splitting increasing with the length L of the sequence SEQ. The breakdown into k-mers thus makes it possible to identify the latter in the genome even if it is not present as such in the contigs.
If the sequence SEQ is not sought in an identical manner in 804, the genome is searched in 808 for at least the sequence SEQ or one of its variants. These variants correspond, for example, to mutations in the original sequence SEQ or to imperfect identification of the latter. As described above, the sequence SEQ may be the product of identification of a genetic determinant (e.g. of resistance, virulence, identity, etc.) based on a priori data or knowledge. If the latter is imperfect, the sequence SEQ may not reflect the full diversity of said determinant. By permitting the identification of the sequence SEQ or one of its variants, the process allows the initial imperfection of the data and knowledge to be corrected and thus the genetic determinant to be detected. To a lesser extent, it also allows possible residual sequencing errors in the contigs to be taken into account. Furthermore, independently of the error correction, it also makes it possible, from a single sequence SEQ, to detect whether at least one member of a set of variants is present in the genome, without having to perfectly detect each of said variants.
In a first variant, in 808, the sequence SEQ or one of its variants is detected if the percentage of its constituent k-mers is greater than a predetermined threshold suni, for example greater than 70%. In a preferred variant, this percentage is dependent on the length L of the sequence SEQ and more particularly decreases as a function of L. Specifically, it is preferable to keep a sufficiently long length of k-mers, in particular greater than 15, and preferably greater than 30, so that the latter remain specific to the sequence SEQ. Thus, as the length L decreases, a difference with the sequence SEQ is found in an increasingly large percentage of k-mers, which makes it possible to correct for the percentage suni decreasing as a function of the length L. As illustrated in
Once the sequence SEQ has been detected, optionally, the process continues, at 810, with the detection of a set of genome sequences, denoted {SEQ1, . . . , SEQi, . . . SEQE}, in a manner described below.
If the detection of the sequence SEQ is performed on reads (test 800), and thus before any bioinformatics processing correcting sequencing errors, the process takes these errors into account to accurately detect the presence/absence of a k-mer in the reads, and thus in the genome. The advantage of detecting directly from the reads lies in the speed of data processing, which is less than 2-3 minutes for a given computational environment, whereas assembly alone can take over an hour for the same environment.
In a first variant, a k-mer is detected if the reads contain a minimum number of copies of said k-mer, for example a number greater than or equal to 3. However, this variant has the drawback of not taking into account the sequencing depth of coverage. To a first approximation, the sequencing errors are distributed over the entire genome and as such the greater the sequencing depth of coverage, the higher the probability of detecting the k-mer. However, due to sequencing errors, it is difficult to ascertain whether the k-mer is actually present in the genome or whether the k-mer detected in the reads is the product of another k-mer bearing sequencing errors. In a preferred variant, detection depends on the actual sequencing depth of coverage for the bacterial strain in which the sequence SEQ is sought.
To this end, a test 812 is performed to determine whether the sample is a metagenomic sample, or more generally a sample containing several different species (several bacterial species, human DNA, or other), or a sample prepared from an isolate of the bacterial strain. In the latter case, since only one strain is present, all reads belong to said strain and the sequencing depth of coverage, noted “cov”, is calculated, in 814, for example according to the relationship:
in which relationship N, is the total number of bases included in the reads and N. is the number of bases in a reference genome of the bacterial species to which the bacterial strain belongs, preferentially having a mean size or close to the mean of the genome sizes observed for said species (e.g. Ng=4.4 million base pairs (Mbp) for Mycobacterium tuberculosis).
A copy number, noted Ncov, which needs to be detected in the reads to affirm that a k-mer is indeed present in the genome, is then calculated in 816 according to the following relationship:
N
cov=τ×cov (11)
where τ is a predetermined parameter, preferentially taking into account the sequencing error rate of the sequencing technology used, advantageously between 5% and 15%, preferably greater than or equal to 10%, for example 10%. For the latter percentage and a depth of 100, 10 identical copies of a k-mer must therefore be detected to determine that it is actually present in the reads. A rate of 10% notably enables the presence of a k-mer to be accurately detected in reads produced by a GridION platform from the company Oxford Nanopore Technology Ltd using the R9.4 library preparation kit from the same company. This rate also allows precision detection in reads produced by SBS-type sequencing technology, for example the MiSeq platform from the company Illumina Inc.
Detection of the presence or absence of each k-mer kmi of the KM set in the reads is then performed in 818. In particular, the k-mer kmi is detected if there are at least Ncov identical copies in the reads. The process then continues with step 804 described previously.
If the sample comprises several species (test 812), the process consists in determining the sequencing depth of coverage for the bacterial species under consideration. More particularly, “taxonomic binning” is performed in 820, such binning consisting in assigning to each read an origin among the species present in the sample. This type of binning is well known in the prior art and uses, for example, the classification described in the article by Wood D. E. et al., “Kraken: ultrafast metagenome sequence classification using exact alignments”, Genome Biology, 2014, or as performed by the “Kraken2” software downloadable from the website https://github.com/DerrickWood/kraken2/releases.
The sequencing depth of coverage of said bacterial species under consideration is then calculated, for example according to either of the following relationships:
in which relationship Nrm is the total number of bases included in the reads assigned to the bacterial species to which the bacterial strain under consideration belongs, and N. is the number of bases in a medium-sized genome of the bacterial species, and p is the relative proportion of the bacterial species in the sample. This relative proportion is calculated, for example, using the classification described in the article by Wood D. E. et al., “Kraken: ultrafast metagenome sequence classification using exact alignments”, Genome Biology, 2014, or as performed by the “Kraken2” software downloadable from the website https://github.com/DerrickWood/kraken2/releases. The process continues with step 816 of calculating the copy number as a function of the sequencing depth of coverage just calculated.
Referring again to step 810 of detecting the set of genome sequences, denoted {SEQ1, . . . , SEQi, . . . SEQE}, this step follows the detection of each sequence SEQi in the manner described previously. More particularly, the detection of said set is implemented according to one of the following options:
Firstly, it is noted that the detection performance of sequences SEQ or sets of SEQs is very similar to that obtained directly by k-mer identification, as evidenced by the performance of a cluster-lasso based process compared to the performance of a lasso-based process as described previously.
Secondly, the detection process 50 is robust with respect to the type of sequencing technology employed, and notably to their sequencing errors. The following table illustrates, for 37 test strains of Klebsiella pneumoniae species in isolate form, with variant detection (percentage 70%, 80%, 90%) for unitigs (equivalent to SEQi) and unit detection according to option i) above, the results of a cluster-lasso prediction of resistance to different antibiotics for sequencing by an MiSeq (“Illumina”) and by a GridION (“ONT”). The two sequencing technologies are tested as a function of their reads and the contigs produced by the assembly of their reads. Table 10 shows that the results are similar for both technologies, whereas they have significantly different sequencing error rates, but also similar results for both reads and contigs.
Furthermore, the prediction performance (AUC) as a function of sequencing depth of coverage for the ONT technology is shown in
A particular embodiment of the invention has been described. This process may be modified in accordance with the following features, taken alone or in combination:
Number | Date | Country | Kind |
---|---|---|---|
20162655.3 | Mar 2020 | WO | international |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2021/056013 | 3/10/2021 | WO |