The invention relates to a method of creating and analyzing mass spectrometer signals and more particularly to a method of creating characteristic profiles of mass spectra and identification model for analyzing and identifying features of microorganisms by analyzing mass spectrometry (MS) of their biomolecules. The characteristic profile is a protein expression pattern obtained by analyzing signals from matrix-assisted laser desorption ionization-time of flight mass spectrometry (MALDI-TOF MS) of isolated microorganisms of the same feature. The MALDI-TOF MS data of the isolated microorganisms are processed by density-based clustering to find a mass-to-charge ratio (m/z) with high probability of occurrence. The values of high probability of occurrence together form a characteristic profile for a specific feature of microorganisms. Then, machine learning methods are used to integrate the profiles from different features of microorganisms in order to create features classification models which are used to analyze matched vectors of the microorganisms having the unknown features, thereby identifying and analyzing the features of the microorganisms.
Conventionally, technologies of using MS to identify the species of an unknown microorganism involve comparing the MS of the unknown isolated microorganism to those of known microorganisms in an isolated MS database, or comparing the isolated MS of the unknown microorganism to the characteristic MS species profiles of known microorganisms. In the approach of isolated MS database comparison, it is required to gather all the isolated MS data of known microorganisms in a database. However, microorganisms evolve constantly. Thus, it is required to gather a huge amount of MS data of known isolated microorganism in the database. Further, in the identification step, the comparison process of the isolated MS of the unknown microorganism in the large isolated MS database of known microorganisms is time consuming. A large data storage for efficient and accurate comparison is required. And in turn, complex hardware is required.
For solving the above problem, there is an intelligent method of creating characteristic profiles of mass spectra and identification model for analyzing and identifying features of microorganisms disclosed. The method can quickly process comparisons of mass spectrometer signals data. However, it is first required to discretize the data and then it uses density-based clustering to find an m/z with high probability of occurrence from the discretized data, thereby solving the problem of MS signals drifting in different batch tests. However, the discretization neither identifies the corresponding signals nor provides a possible drifting range. In short, it is not capable of identifying protein.
Thus, the need for improvement does exist.
It is therefore one object of the invention to provide a method of creating characteristic profiles of mass spectra and identification model for analyzing and identifying microorganisms, comprising the steps of (1) obtaining data of MALDI-TOF MS of microorganisms having same features; (2) using a kernel density estimation to generate characteristic profiles of an m/z of the data; (3) creating a characteristic MS profile based on the m/z; (4) repeating steps (1) to (3) until characteristic MS profiles of a plurality of features of the microorganisms is obtained; (5) comparing m/z of MALDI-TOF MS spectrum of microorganisms having known features with the characteristic MS profiles to obtain a plurality of first matched vectors; (6) using a machine learning method to establish a feature classification model; (7) using MALDI-TOF MS to analyze microorganisms having unknown features; (8) comparing the m/z of MALDI-TOF MS spectrum of the microorganisms having unknown features with the characteristic MS profiles to obtain a plurality of second matched vectors; (9) using the feature classification model to analyze the second matched vectors; and (10) identifying the microorganisms having the unknown features.
Preferably, the machine learning method is Support Vector Machine (SVM), Artificial Neuron Network (ANN), k Nearest Neighbor (kNN), Logistic Regression (LR), Fuzzy Logic, Bayesian Algorithms, Decision Tree Induction Algorithm (DT), Random Forest (RF), Deep Learning, or any combination thereof.
Preferably, the kernel density estimation are uniform kernel, triangular kernel, biweight kernel, triweight kernel, Epanechnikov kernel, or Gaussian kernel.
Preferably, the microorganisms are bacteria, molds, or viruses.
Preferably, the features of the microorganisms are species, sub-species, resistance to antibiotics, or toxicity.
The method of the invention has the following advantages and benefits in comparison with the conventional art:
Precise m/z can be obtained: creating characteristic profiles of mass spectra and identification model for analyzing and identifying features of microorganisms facilitates the summarization of the m/zs of the characteristic peaks. It can solve the problem of MS signals being drifted or shifted in different batches of an experiment due to discretization and the problem of being incapable of correctly finding locations of the signals to be aligned. Therefore, corrected locations of the signals to be aligned can be found, precise m/z can be obtained, and identifying protein is made easy.
Both identification precision and resolution are greatly increased: creating characteristic profiles of mass spectra and identification model for analyzing and identifying features of microorganisms can greatly increase both identification precision and resolution. It can solve the problem of the conventional method of identifying microorganism species (e.g., Shigella and E. coli). Further, it can be easily extended to the identification of species, sub-species, resistance to antibiotics, or toxicity. With the increased precision of MS data analysis, healthcare employees can use the analysis result to correctly use antibiotics for infection control in near real time.
Signal drift or shift problem is solved. An m/z comparison of the invention can solve the signal drift problem in microorganism MALDI-TOF MS data when the MS data are acquired from different batches of an experiment. Creation of the matched vectors facilitates the construction of microorganism identification models using machine learning methods. Machine learning methods are characterized by high accuracy, high performance and high repeatability. Thus, the analysis results of MS signals of the invention can be widely used in many applications. And in turn, it decreases the requirement of manual operation and manual intervention. Finally, it improves greatly the reduction of both man power and cost.
The above and other objects, features and advantages of the invention will become apparent from the following detailed description taken with the accompanying drawings.
Referring to
T1: obtaining data of MALDI-TOF MS of microorganisms having same features;
T2: using a kernel density estimation to generate characteristic profiles of an m/z of the data, wherein the kernel density estimation are uniform kernel, triangular kernel, biweight kernel, triweight kernel, Epanechnikov kernel, or Gaussian kernel;
T3: creating a characteristic MS profile based on the m/z;
T4: repeating the steps T1 to T3 until characteristic MS profiles of a plurality of features of the microorganisms is obtained;
T5: comparing m/z of MALDI-TOF MS spectrum of microorganisms having known features with the characteristic MS profiles generated by Gaussian function to obtain a plurality of first matched vectors;
T6: using a machine learning method to establish a feature classification model;
T7: using MALDI-TOF MS to analyze microorganisms having unknown features;
T8: comparing the m/z of MALDI-TOF MS spectrum of the microorganisms having unknown features with the characteristic MS profiles to obtain a plurality of second matched vectors;
T9: using the feature classification model to analyze the second matched vectors; and
T10: identifying the microorganisms having the unknown features. Sub-species of Staphylococcus haemolyticus is taken as an exemplary example in conjunction with
Referring to
Referring to
In
ST types respectively. Further, maximum and minimum area values are calculated and taken as aligned central points and drifting ranges. Finally, all peak values and its ranges are combined to obtain a model having aligned m/z.
As shown in
As shown in
Repeating the steps T1 to T3 until characteristic MS profiles of a plurality of specific sub-species is obtained. After the characteristic MS profiles of the specific sub-species has been obtained, it is possible of comparing MS data of a plurality of known microorganisms sub-species with a characteristic MS profile of each sub-species in terms of signals to create a plurality of matched vectors as a training dataset. A plurality of different conventional machine learning methods are trained to establish a sub-species classification identification model.
Referring to
Referring to
As shown in the dichotomy model of each of ST3, ST42 and other ST types of
As shown in
In conclusion, Gaussian function in cooperation with different machine learning methods can carry out an excellent identifying effect, e.g., having an accuracy of about 0.90 and being better than density-based clustering. Further, a standard deviation of the accuracy is very small and it means that the machine learning method has a very high accuracy.
It is clear from the above embodiment, the novel and nonobvious method of the invention can obtain more accurate characteristic MS profiles of species. Further, the machine learning methods being used can more precisely identify microorganism sub-species. It is understood that sub-species is a feature of microorganisms. In other words, the method of the invention can be easily extended to the identification of species, sub-species, resistance to antibiotics, or toxicity.
While the invention has been described in terms of preferred embodiments, those skilled in the art will recognize that the invention can be practiced with modifications within the spirit and scope of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
108133321 | Sep 2019 | TW | national |