The present invention relates to a method and a system for identifying the structure of a compound.
It is important to identify the structure of a compound such as a low molecular weight compound: metabolites, natural products, drugs, pollutants and the like in various kinds of fields such as the fields of physiology, medicine, food, environment and the like. According to such an identification of the structure of a compound, for example, identification of useful natural products, specification of pollutants, development of biomarkers, and the like are enabled.
As a technique for identifying the structure of a compound, for example, there may be mentioned mass spectrometry. The mass spectrometry is used in the field of analytical chemistry, biopharmaceuticals and environmental research/industry, for identifying the structure of a compound. In the mass spectrometry method, for example, first, a compound contained in a sample is separated by liquid chromatography, the separated compound is ionized and a mass spectrum is acquired.
In the mass spectrometry, information on mass-to-charge ratios of the precursor ion of an analyte compound and fragment ions generated by cleaving it is acquired and utilized for estimating the structure of the compound in some cases, but particularly for low molecular weight compound cases, a number of observed fragment ions is small in many cases, and it is difficult to estimate the cleavage site in the structure, and further a number of candidate structures to be considered is enormous so that it is difficult to identify the structure of the compound with the information of mass spectrometry alone.
In recent years, there exist an analytical instrument in which mass spectrometry and ion mobility spectrometry are integrated, and there is an example (Patent Document 1) in which information on the collision cross section of a precursor ion of an analyte compound measured by such an instrument is matched to information stored in a database in which values of the mass-to-charge ratio and collision cross section acquired by measuring standard compounds in advance, but in this method, there is a problem that the compound whose structure can be determined depends on standard compounds.
Patent Document: JP 2018-517905A
Thus, an object of the present invention is to provide a method and a system for identifying a structure, which can identify structures of various kinds of compounds.
A first method for identifying the structure of a compound of the present invention comprises,
matching the spectral data in which measured mass-to-charge ratios (m/z) and collision cross sections (CCS) of fragment ions of an analyte compound are combined, to spectral data in a reference spectral database in which structure information on fragment ions of a standard compound and mass-to-charge ratios and collision cross sections of fragment ions thereof are combined.
A second method for identifying the structure of a compound of the present invention comprises,
A third method for identifying the structure of a compound of the present invention comprises,
A system for identifying the structure of the compounds of the present invention comprises,
According to the present invention, it is possible to provide a method and a system for identifying the structure of a compound, which can identify the structures of various compounds.
Item 1. A method for identifying the structure of a compound, which comprises
(matching step of) matching the spectral data in which measured mass-to-charge ratios (m/z) and collision cross sections (CCS) of fragment ions of an analyte compound are combined, to spectral data in a reference spectral database in which structure information on fragment ions of a standard compound and mass-to-charge ratios and collision cross sections of fragment ions thereof are combined.
Item 2. The method according to Item 1, which further comprises constructing a database including structure information and, data of mass-to-charge ratios and collision cross sections of fragment ions of a standard compound.
Item 3. The method according to Item 1 or 2, which further comprises (spectral data generation step of) generating spectral data by subjecting the fragment ions of the analyte compound to mass spectrometry analysis and ion mobility analysis.
Item 4. The method according to any one of Items 1 to 3, which further comprises matching the compound spectral data in which the measured m/z and CCS of the precursor ion of the analyte compound are combined, to precursor ion reference spectral data in which structure of precursor ion of the standard compound and m/z and CCS of the same precursor ion are combined.
Item 5. The method according to Item 4, which further comprises generating precursor ion spectral data by subjecting the precursor ion of the analyte compound to mass spectrometry analysis and ion mobility analysis.
Item 6. The method according to any one of Items 1 to 5, wherein in matching, the ones that have masses matched to the observed mass-to-charge ratios of fragment ions of the analyte compound are searched from fragment ions of the standard compounds in the reference spectral data as candidates, with regard to each candidate, both the mass-to-charge ratio and collision cross section are matched with a predetermined tolerance between the spectral data of the analyte compound and the reference spectral database to calculate matching scores, and the candidate showing a top score is determined as a real structure of fragment ion of the analyte compound.
Item 7. A method for identifying the structure of a compound, which comprises
Item 8. The method according to Item 7, wherein in acquiring a candidate structure, instead of or in addition to acquiring a candidate structure from chemical structures included in a chemical structure database, a candidate structure is acquired from chemical structures generated with an algorithm of molecular structure generation for generating a theoretically possible structure based on the estimated elemental composition of the analyte compound. Item 9. The method according to Item 7 or 8, wherein in matching, the measured m/z ratios and CCSs of fragment ions of the analyte compound are matched to the m/z ratios and the acquired CCSs of the estimated fragment ions of the candidate structures, and with regard to each candidate structure, a matching score is calculated based on the number of matched pairs of the mass-to-charge ratio and collision cross section and peak intensities of the fragment ions, and the candidate showing a top score is determined as an real structure of the fragment ion of the analyte compound.
Item 10. A method for identifying the structure of a compound, which comprises
Item 11. The method according to Item 10, wherein in acquiring a candidate structure, instead of acquiring a candidate structure by searching a chemical structure having the substructure in a chemical structure database, a candidate substructure is acquired by matching the observed mass-to-charge ratios and collision cross sections of the fragment ions to a reference spectral database in which structures, mass-to-charge ratios and collision cross sections of fragment ions of a standard compound are combined and/or a theoretical spectral database including structures of fragment ions and theoretically calculated CCSs thereof, and a candidate structure is acquired from chemical structures generated with an algorithm of molecular structure generation for generating a theoretically possible structure based on the substructure and the estimated elemental composition of the analyte compound.
Item 12. The method according to Item 10 or 11, wherein in the first matching, the ones that have masses matched to the observed mass-to-charge ratios of fragment ions of the analyte compound are searched from fragment ions of the standard compounds in the reference spectral data as candidates, with regard to each candidate, both the mass-to-charge ratio and collision cross section are matched with a predetermined tolerance between the spectral data of the analyte compound and the reference spectral database to calculate matching scores, and the candidate showing a top score is determined as a real structure of fragment ion of the analyte compound,
in the second matching, the measured m/z ratios and CCSs of fragment ions of the analyte compound are matched to the m/z ratios and the acquired CCSs of the estimated fragment ions of the candidate structures, and with regard to each candidate structure, a matching score is calculated based on the number of matched pairs of the mass-to-charge ratio and collision cross section and peak intensities of the fragment ions, and the candidate showing a top score is determined as an real structure of the fragment ion of the analyte compound.
Item 13. A system for identifying the structure of a compound, which comprises
Item 14. The system according to Item 13, wherein the matching means searches the ones that have masses matched to the observed mass-to-charge ratios of fragment ions of the analyte compound from fragment ions of the standard compounds in the reference spectral data as candidates, calculates matching scores by matching both the mass-to-charge ratio and collision cross section between the spectral data of the analyte compound and the reference spectral database with a predetermined tolerance with regard to each candidate, and determines the candidate showing a top score as a real structure of fragment ion of the analyte compound.
Item 15. A system for identifying the structure of a compound, which comprises
Item 16. The system according to Item 13, wherein the matching means matches the measured m/z ratios and CCSs of fragment ions of the analyte compound to the m/z ratios and the acquired CCSs of the estimated fragment ions of the candidate structures, calculates a matching score based on the number of matched pairs of the mass-to-charge ratio and collision cross section and peak intensities of the fragment ions with regard to each candidate structure, and determines the candidate showing a top score as an real structure of the fragment ion of the analyte compound.
Item 17. A system for identifying the structure of a compound, which comprises
Item 18. The system according to Item 17, wherein the first matching means searches the ones that have masses matched to the observed mass-to-charge ratios of fragment ions of the analyte compound from fragment ions of the standard compounds in the reference spectral data as candidates, calculates matching scores by matching both the mass-to-charge ratio and collision cross section between the spectral data of the analyte compound and the reference spectral database with a predetermined tolerance with regard to each candidate, and determines the candidate showing a top score as a real structure of fragment ion of the analyte compound,
the second matching means matches the measured m/z ratios and CCSs of fragment ions of the analyte compound to the m/z ratios and the acquired CCSs of the estimated fragment ions of the candidate structures, calculates a matching score based on the number of matched pairs of the mass-to-charge ratio and collision cross section and peak intensities of the fragment ions with regard to each candidate structure, and determines the candidate showing a top score as an real structure of the fragment ion of the analyte compound.
Item 19. An apparatus for identifying the structure of a compound, which comprises
Item 20. A method for identifying the structure of a compound, which comprises
Next, embodiments of the present invention will be explained. Incidentally, the present invention is not limited or restricted by the embodiments mentioned below. In
The first embodiment of the present invention is the above-mentioned method for identifying the structure of a compound, and the subject of the method is a person. In the present embodiment, among the procedures carried out by a person, a procedure that can be also interpreted as being carried out on a computer can be interpreted as, for example, a person makes the computer to carry out a corresponding procedure. The present embodiment includes, as mentioned above, the reference spectral data derived from the standard compound and the reference spectral data generation of theoretically generating from known compound structures, and the calculation of mass-to-charge ratio and collision cross section from fragment structure and the matching step.
In this step, the precursor ion and fragment ions of the ionized compound are measured by mass spectrometry and ion mobility spectrometry to generate spectral data. In order to generate the said data, first, the mass-to-charge ratio and collision cross section of the fragment ions of the analyte compound are measured. Specifically, for example, the analyte compound is analyzed by mass spectrometry and ion mobility spectrometry (
As used herein, mass-to-charge ratio (m/z) typically refers to the mass-to-charge ratio in the usual sense as employed in the mass spectrometry art, but they also include values derived or convereted therefrom. It will be clear to a person skilled in the art that many different converted values (e.g., log(m/z)) could be used for the purposes of the present invention and they are still inside the scope of the invention. This also applies to other embodiments of the present invention.
Here, in ion mobility spectrometry, ions are separated according their mobility. Drift time is a measured time taken for an ion to travel through the ion mobility device. Collision cross section (CCS) can be acquired by applying calibration function to drift time. Therefore both drift time and CCS reflects same chemical properties of analyte compound or fragment ions and can be used as values to distinguish chemical structure in same fashion in this workflow. Accordingly, in the present invention, drift time can be used in place of “CCS”. In addition, it can be naturally understood that “CCS” used in the present invention includes any value directly correlated with ion mobility value of compound or fragment ions, such as “a raw ion mobility value measured by ion mobility separation device” and “any value calculated from the raw data of ion mobility measurement”. As used herein, the ion mobility information of a precursor ion or a fragment ion refers to information obtained using ion mobility spectrometry that characterizes the precursor ion or the fragment ion, respectively. Typically, ion mobility information is expressed as collision cross section or drift time, but it may be other forms of information derived from ion mobility spectrometry measurements as long as it characterizes the precursor ion or the fragment ion in question. Ion mobility information may be unique or specific to the precursor ion or the fragment ion, or may be dependent on the ion mobility machine or the calibration or conversion method used. In the present invention, the non-limiting examples of “ion mobility information” include “CCS”, “drift time”, “a raw ion mobility value measured by ion mobility separation device” and “any value calculated from the raw data of ion mobility measurement”. These also apply to other embodiments of the present invention.
Information on fragment ions of the analyte compound is acquired as follows. First, the precursor ion of the analyte compound is isolated by applying a mass filter such as quadrupole or ion trap device equipped with the mass spectrometry instrument. Then, fragmentation of the precursor ion of the analyte compound is caused by using a fragmentation device such as collision-induced dissociation, high-energy collision-induced dissociation, electron capture dissociation or electron transfer dissociation. Also, fragment ions can be acquired when the analyte compound is measured by gas chromatography, since it causes electron transfer dissociation.
When analysis is carried out for a sample containing multiple compounds, these compounds are first separated by a chromatography method such as liquid chromatography, capillary electrophoresis or supercritical fluid chromatography (
Alternatively, when analysis is carried out by coupling a mass spectrometry instrument to a compound separation instrument such as liquid chromatography, capillary electrophoresis or supercritical fluid chromatography, the fragment ions of the analyte compound are acquired by data independent acquisition method such as SWATH (
In this step, the spectral data in which the measured mass-to-charge ratios and collision cross sections of the fragment ions of the analyte compound are combined is matched to the reference spectral data in which the values of the mass-to-charge ratio and collision cross section of the fragment ions acquired by measuring a standard compound and estimated structures of the fragment ions are described. The matching method will be mentioned later.
The above-mentioned spectral data (reference spectral data) of standard compounds are, for example, included in a reference spectral database. This database is used for structure identification of the analyte compound. For example, this database may be stored in an auxiliary storage device of each user’s client terminal, or may be stored in a server. In the present invention, for example, the user may use a previously constructed reference spectral database, or may construct a reference spectral database and use it, or may update a reference spectral database by including the previously generated spectral data of the analyte compound in the reference spectral database.
As mentioned above, for example, a reference spectral database including reference spectral data of standard compounds having known structures can be constructed. For example, a reference spectral database can be constructed by measuring mass-to-charge ratios and collision cross sections of fragment ions of standard compounds in the same manner as the measurement for the above-mentioned mass-to-charge ratios and collision cross sections of an analyte compound, and combining with the structures of the fragment ions of the standard compounds to generate reference spectral data (
First, with regard to a standard compound or a standard sample, mass-to-charge ratios and collision cross sections of the precursor ions of these compounds are measured, and these values are stored in a database together with the structures of these compounds. Then, the above-mentioned precursor ions of the compounds are fragmented, mass-to-charge ratios and collision cross sections of the fragment ions are measured, and these values are stored in the database.
The structures of the measured fragment ions of the standard compound are estimated as follows. Specifically, for example, the structures of theoretical and potential fragment ions are calculated by systematically cleaving covalent bonds in the structure of the standard compound (the precursor ion of the compound), and the mass-to-charge ratio of the calculated structure is calculated from this structure. When the calculated mass-to-charge ratio matches to the value of the observed fragment ion, the structure of the matched fragment ion is registered as the possible structure of the fragment ion together with the measured mass-to-charge ratio and collision cross section.
Alternatively, in order to estimate the structures of the observed fragment ions of the standard compound, for example, fragmentation model and/or structure rearrangement accompanied thereby are applied to the molecular structure of the standard compound. This structure estimation can be achieved by, for example, fragment prediction such as a fragmentation prediction tool CFM-predict (Allen et al., Nucleic Acids Re. s 42: W94-9 (2014)) which is a machine learning model that is trained with fragmentation spectral data and molecular structures as inputs. Also, fragmentation rules and mechanisms collected from published literatures are applied to the chemical structures of the standard compounds (the precursor ions of the compounds) to predict the structures of the fragment ions. This prediction can be achieved by, for example, scanning substructure with SMARTS pattern matching using a substructure library RDKit. A tool for predicting a fragment structure by applying a fragmentation rule such as Mass Frontier (Thermo Fisher Science) is also used for generating a fragmentation structure by applying a fragmentation rule. The fragmentation structures whose mass-to-charge ratios match to the peaks in the measured fragment spectrum are added to the entry of the compound together with the observed mass-to-charge ratios and collision cross sections.
Hereinafter, the above-mentioned matching step will be mentioned in detail. In the present step, reference spectral data included in the reference spectral database is used, so that the present method is sometimes referred to as reference spectral database-dependent method. Incidentally, the system mentioned in detail below is an example, and does not limit the present invention in any way.
By comparing the spectral data of the analyte compound with the reference spectral data, the present step is accomplished. In
wherein, m is a number of fragment ions in the reference spectral data of the candidate compound. When the mass-to-charge ratio and collision cross section of a fragment ion in the reference spectral data match to the mass-to-charge ratio and collision cross section of a measured fragment ion of the analyte compound, then Pi=1. When the mass-to-charge ratio and collision cross section of a fragment ion in the reference spectral data do not match to the mass-to charge ratio or collision cross section of a measured fragment ion of the analyte compound, then Pi=0. Ii is the intensity of a peak in the measured spectrum. The acquired scores are used to compare candidates, and the candidate showing the top score is determined to be the real structure of the analyte compound. Incidentally, for evaluation of the candidate, not only the above-mentioned function is strictly utilized, but also a similar function that considers the number of matched fragment ions and those signal intensities can be utilized.
As mentioned above, by matching the spectral data of the analyte compound to the reference spectral database, the method of the present embodiment can identify the structures of various compounds. In present invention, in addition to or in place of the reference spectral database, the theoretical spectral database as described later can be used for matching (this also applies to other embodiments of the present invention).
In the prior art, only the ion mobility of the compound itself (the precursor ion) is utilized, and the ion mobilities of the fragment ions are not utilized. Accordingly, compound structural identification is possible only when there is already the actual measurement date for the same compound. And in a current ion mobility analyzer, a collision cross section information of a precursor ion does not have sufficient resolution for identifying the compound, and there are cases where a plurality of compounds have the same collision cross section, so that it is virtually impossible to identify the structure from a huge number of candidate structures. On the other hand, the method of the present embodiment can identify the structure of a compound with high precision even when candidate structures are enormous and also resolution of a ion mobility analyzer is not sufficient because the collision cross sections which reflect unique structures of a plurality of fragment ions derived from the compound are utilized.
The present invention is also related to a method for identifying the structure of a compound, which comprises matching the spectral data in which measured mass-to-charge ratios (m/z) and collision cross sections (CCS) or drift time of both precursor ion and fragment ions of an analyte compound are combined, to spectral data in a reference spectral database in which structure information on fragment ions of a standard compound and mass-to-charge ratios and CCS or drift time of both precursor ion and fragment ions thereof are combined. (see
In the present invention, in addition to structure information, mass-to-charge ratios and collision cross sections CCS of fragment ions, mass-to-charge ratio and CCS of precursor ion of the fragment ions can be used. Specifically, when matching the spectral data of the analyte compound to the spectral databases, the matching is based on the, mass-to-charge ratio and CCS or drift time of both precursor ion and fragment ions of an analyte compound, a standard compound, and the like. Such a matching allows for more accurate or faster identification of a compound. This also applies to other embodiments of the present invention.
In addition, the method described above can further comprise constructing a database including mass-to-charge ratios (m/z), CCS or drift time of fragment ions together with their structures deduced by systematic bond cleavage or a fragmentation prediction model, and mass-to-charge ratios (m/z), CCS or drift time of precursor ion of standard compounds.
The method of the present invention can also identify the substructure of a compound. Thus, the present invention is also related to a method for identifying the substructure of a compound, which comprises matching the spectral data in which measured mass-to-charge ratios (m/z) and collision cross sections (CCS) or drift time of fragment ions of an analyte compound are combined, to spectral data in a reference spectral database in which structural information, mass-to-charge ratios and CCS or drift time of fragment ions of a standard compound and/or a theoretical spectral database including structures of fragment ions and CCS or drift time that are theoretically calculated or predicted by machine learning thereof are combined (see
In this case, the substructure to be identified is the registered structure of fragment ion which is matched with a predetermined tolerance. The description of “Substructure acquisition step” described later can be applied here.
In identifying the substructure of a compound, mass-to-charge ratio and CCS or drift time of fragment ions observed can be used as queries to search against the aforementioned spectral database that contain fragment structures and mass-to-charge ratio and CCS or drift time acquired from standard compounds.
In addition, the method described above can further comprise constructing a database including mass-to-charge ratios (m/z), CCS or drift time of fragment ions together with their structures deduced by systematic bond cleavage or a fragmentation prediction model.
In
The second embodiment of the present invention is the above-mentioned method for identifying the structure of an analyte compound, and the subject of the method is a person. In the present embodiment, among the procedures carried out by a person, a procedure that can be interpreted as being executed on a computer can be interpreted as, for example, a person makes the computer to carry out the procedure. The present embodiment is a method for identifying a chemical structure, when the analyte compound is not registered in the above-mentioned reference spectral database of standard compounds. The present embodiment includes the above-mentioned mass-elemental composition deduction step, the candidate structure acquisition step, the estimated fragment ions structure acquisition step, the estimated fragment ion collision cross section acquisition step and the matching step. In
In this step, first, as described in “1-1. Measurement for mass-to-charge ratio (m/z) and collision cross section (CCS)″ of the above-mentioned embodiment 1, the mass-to-charge ratio and collision cross section of the precursor ion and fragment ions of the analyte compound are measured. From the measurement results, mass and/or elemental composition of the analyte compound is/are deduced. The elemental composition of the analyte compound is deduced taking into account ionized adduct type and measurement error, and then filtered with valence rules and elemental ratio. Candidate molecular structure is estimated based on the candidate elemental composition, and further selected by applying the same process to the fragment ions of the analyte compound.
In the present embodiment, the present step is carried out by bot the chemical structure database dependent method and independent method. Incidentally, the present step is not necessarily carried out by both the chemical structure database dependent method and independent method and, for example, it may be carried out by the chemical structure database dependent method alone, or may be carried out by the chemical structure database independent method alone. In the present embodiment, a candidate structure is searched by the chemical structure database dependent method as mentioned below, and a candidate structure is generated by the chemical structure database independent method.
As the chemical structure database, for example, PubChem and ChEBI may be mentioned, but not limited to these. When structure determination is carried out by searching the true structure in the list of candidate chemical structures in the chemical structure database, the search is carried out by matching the mass of the analyte compound based on the measured mass-to-charge ratio of the analyte compound to select a candidate structure (“Chemical structure DB-dependent” in
When the candidate structure is determined without using any candidate information, for example, the chemical structure database, a candidate structure is generated by an algorithm of molecular structure generation. Specifically, for example, using a theoretical molecular structure generation tool such as Open Molecular Generator (Peironcely et al., J. Cehmoinfo. (2012) 4: 21), MOLGEN or the like, a theoretically existing candidate structure is generated based on the deduced elemental composition of the analyte compound.
After searching and selecting a candidate structure by the above-mentioned chemical structure database dependent method and generating the candidate structure by the above-mentioned chemical structure database independent method, from these candidate structures of the analyte compound,, estimated (hypothetical) fragment ions of the candidate structure of the analyte compound is generated (predicted) with taking into account systematic cleavage and/or a fragmentation model and/or structure rearrangement by the fragmentation in the same manner as described in “2-2. Construction of reference spectral database” of the above-mentioned embodiment 1.
The collision cross sections of the hypothetical fragment ions are acquired by searching from the reference spectral database mentioned above and/or the theoretical fragment database mentioned later. Incidentally, in this step, it is not always necessary to search from the reference spectral database, and only the theoretical spectral database may be searched from. As a result of this search, when fragment ions having the same structures as the hypothetical fragment ions are found, the registered collision cross sections for the fragment ions are retrieved. When fragment ions having the same structures as the hypothetical fragment ions are not found, the collision cross sections of these hypothetical fragments are calculated based on the structure of the hypothetical fragments by the method mentioned later. For example, the collision cross sections of the hypothetical fragment acquired by the calculation can be registered in the theoretical database together with the structures of the same hypothetical fragments. According to this procedure, the fragment structures covered by the theoretical database are enriched.
The above-mentioned theoretical spectral database includes the mass-to-charge ratios and the theoretically calculated collision cross sections of the fragment ions. Such a database can be constructed as follows. That is, first, a two-dimensional structure of the compound is acquired from resources of chemical structures such as a chemical structure database as a textual chemical structure identifier such as InChI and SMILES or a file such as SDF. Then, the identifier or file is converted into a 2D structure by using a chemoinformatics library such as RDKit, and the structures of the fragment ions of the compound are deduced by applying fragmentation models and/or rules as described in “2-2. Construction of reference spectral database” of the above-mentioned first embodiment. Then, the mass-to-charge ratios and collision cross sections thereof are calculated by using the generated fragment structures. Calculation of collision cross section is described in detail below. The structures of the fragment ions are registered in the theoretical spectral database together with the calculated mass-to-charge ratios and collision cross sections, and tagged with the calculation method (
Theoretical collision cross section is calculated as follows. That is, first, a 2D structure of the compound described as a chemical identifier such as InChI and SMILES or a file such as SDF is acquired from a chemical structure database such as PubChem. With respect to most acidic or basic amino acids, they are estimated using a pKa calculation tool such as Marvin pKa plugin (ChemAxon), chemoinformatics library and the like to predict protonation, deprotonation and other adduction sites. The ion form of a compound is generated by attaching an adduct such as proton, sodium and the like to the most basic atom. In the case of a negative ion, a hydrogen connected to the most acidic atom is removed to generate a negative ion structure.
For example, 100 conformers for the precursor ions of a compound are generated, and their structures are optimized by applying MMFF94 molecular force fields using a chemoinformatics tool kit such as RDKit. The conformation with the lowest energy is selected, and the electron distribution is calculated by density functional theory with B3LYP exchange-correlation function and 6-31g*basis set using a computational chemistry tool such as Gaussian. The collision cross sections are calculated by the Trajectory method or the Exact Hard Sphere method using MOBCAL software, which is modified for the buffer gas used for ion mobility analysis (for example, N2 and CO2).
In this approach, the collision cross sections of the precursor ion and/or fragment ions of a compound are predicted by a statistical and/or machine learning approach (
With regard to the deduced structure of each fragment ion in the reference spectral database, a molecular descriptor and a fingerprint are calculated as follows. A molecular descriptor generator such as PaDEL-Descriptor is used to generate the molecular descriptor (constitutional, WHIM, topological, fingerprint). A circular ECFP fingerprint is generated with a chemoinformatics tool kit such as RDKit. With regard to the 3D conformer fingerprint and descriptor, first, conformer candidates are generated for each compound and/or its fragment ions structure as described in “4-2. Calculation of theoretical collision cross section” of the present embodiment. A spherical extended 3D fingerprint (E3FP) is generated by E3FP algorithm. The 3D conformer related values such as volume, molecular surface area and the like are calculated from the optimized 3D conformer described in “4-2. Calculation of theoretical collision cross section” of the present embodiment.
The observed molecular descriptor and fingerprint are used as a training data set to build a support vector regression model that predicts the collision cross section using a machine learning library such as Scikit-learn. The combination of two parameters: constraint violation cost and gamma are evaluated using training data via a cross-validation test. A pair of parameters that gives the best minimum mean squared error for prediction is selected for the prediction model.
In the present invention, theoretically calculated CCS (or drift time) can be CCS (or drift time) that are theoretically calculated or predicted by machine learning. For calculating or predicting CCS (or drift time), for example, support vector machines, deep learning, and the like can be used. Specifically, the following literatures can be referenced for implementation: Anal Chem. 2019 Apr 16;91(8):5191-5199. doi: 10.1021/acs.analchem.8b05821. Epub 2019 Apr 1.; Anal Chem. 2016 Nov 15;88(22):11084-11091. doi: 10.1021/acs.analchem.6b03091. Epub 2016 Nov 1. This also applies to other embodiments of the present invention.
In this step, with regard to each structures of the predicted fragment ions for each candidate structure of the analyte compound, the mass-to-charge ratios and collision cross sections of the candidate fragments (hypothetical fragment ions) structures are acquired by searching the same structures from the reference spectral database and/or the theoretical spectral database. When the candidate fragment structures are not found in these databases, calculation and/or prediction of the collision cross sections are carried out by the above-mentioned 4-2 and/or 4-3. Incidentally, at this time, by registering the collision cross section of the calculated fragment ions in the theoretical spectral database, the database can be enriched. Matching is carried out, for example, as follows. That is, first, the mass-to-charge ratios and collision cross sections derived from the standard compounds or hypothetical fragment ions are matched to the mass-to-charge ratios and collision cross sections in the spectral data of the fragment ions of the analyte compound. In such a manner, each compound candidate is evaluated by the scoring function shown below based on the number of matched pairs of the mass-to-charge ratio and the collision cross section and the peak intensity of the fragment ions.
wherein, m is a number of fragment ions in the spectral data of the measured analyte compound. When the mass-to-charge ratio and collision cross section of a hypothetical fragment ion is matched to the mass-to-charge ratio and collision cross section of a measured fragment ion, then Pi=1. When the mass-to-charge ratio and collision cross section of a hypothetical fragment ion is not matched to the mass-to-charge ratio and collision cross section of a measured fragment ion, then Pi=0. Ii is the intensity of the peak of the measured fragment ion. Wi is a weight factor. Wi=1 when the collision cross section is acquired from the reference spectral database. Wi=0.8 or 0.6, when the collision cross section is derived from “4-2. Calculation of theoretical collision cross section” or “4-3. Prediction of collision cross section by machine learning model”, respectively. The acquired scores are used to compare candidates, and the candidate showing the top score is determined to be the real structure of the analyte compound. Incidentally, for evaluation of the candidate, not only the above-mentioned function is strictly utilized, but also a similar function that considers the number of matched fragment ions and those signal intensities can be utilized.
By matching as mentioned above, according to the method of the present embodiment, it is possible to identify the structure of a compound even if the corresponding compound is not registered in the reference spectral database of the standard compound or the standard sample.
In
In
The first embodiment of the present invention is the above-mentioned method for identifying the structure of an analyte compound, and the subject of the method is a person. In the present embodiment, among the procedures carried out by a person, a procedure that can be interpreted as being executed on a computer can be interpreted as, for example, a person makes the computer to carry out the procedure. The present embodiment includes the above-mentioned first matching step, the substructure acquisition step, the candidate structure acquisition step, the estimated fragment ions structure acquisition step, the estimated (hypothetical) fragment ions collision cross section acquisition step, and the second matching step. The method of the present embodiment may sometimes be referred to as the spectral similarity method. In
The present step can be carried out in the same manner as in “2. Matching step” of the above-mentioned embodiment 1. That is, the measured spectral data of the fragment ions of the analyte compound is searched against the reference spectral database (
When the fragment ions of the analyte compound have multiple fragment ions whose mass-to-charge ratios and collision cross sections are found to be matched with a predetermined tolerance in the reference spectral data, the registered structures of these fragment ions are acquired. Here, the structures of these fragment ions can be seen as the candidates of the substructures of the analyte compound. This is because these fragment ions have the same unique combination of the mass-to-charge ratio and the collision cross section. In other words, it can be deemed that such standard compounds and the analyte compound are likely to have these fragment ions as common substructures with high possibility.
Search for the candidate structure of the analyte compound can be carried out, for example, by “2-1. Chemical structure database dependent method” of the above-mentioned embodiment 2. Then, with regard to the searched candidate structure, it is further searched whether or not it has the substructure searched from the reference spectral database (
When the candidate structure determination is achieved without any previously selected structure candidates (that is, without using any chemical structure database), the candidate structures are generated by, for example, a molecular structure generator algorithm as described in “2-2. Chemical structure database independent method” of the above-mentioned embodiment 2 ((
The present step can be carried out in the same manner as in “3. Estimated (hypothetical) fragment ions structure acquisition step” of the above-mentioned embodiment 2 (
The present step can be carried out in the same manner as in “4. Estimated (hypothetical) fragment ion collision cross section acquisition step” of the above-mentioned embodiment 2 (
The present step can be carried out in the same manner as in “5. Matching step” of the above-mentioned embodiment 2 (
By matching as mentioned above, the structures of various compounds can be further identified by combining various databases and theories.
In
In
The fourth embodiment of the present invention is a system for identifying the structure of a compound, and includes a spectral data generation means, a reference spectral database, a matching means and an output means as mentioned above, and the spectral data generation means and the output means are connected to the reference spectral database and the matching means via a communication network outside the system. In
In this system, a matching means is provided at the server 170 side, and a reference spectral database is stored in the server 170. For example, spectral data generated by the spectral data generation means 110 is transmitted to the server 170, and the spectral data is matched to the reference spectral data at the server 170, as mentioned above. Also, the matching result is output by the output means 130.
According to the present embodiment, the server has a matching means, advanced arithmetic processing required for the matching can be carried out in the server, and as a result, for example, the structure of the compound can be identified at a higher speed. In addition, the reference spectral database is stored in the server, so that the database can be updated more frequently as compared with the case where the reference spectral database is stored in individual client terminals, and as a result, for example, it is possible to identify the structures of more diverse compounds.
The system of the present embodiment may be, for example, a system corresponding to any of the above-mentioned embodiments 1 to 3, and can execute each step explained in the above-mentioned embodiments 1 to 3. For example, the server may include various kinds of databases such as the above-mentioned chemical structure database, the theoretical spectral database and the like. Also, in the system of the present embodiment, the spectral data generation means is a CPU of a client terminal, but the present invention is not limited to this, and, for example, the server may have the spectral data generation means. When such an embodiment is employed, for example, the spectral data generation can be carried out on the server side so that, for example, the structure of the compound can be identified at a further higher speed.
The fifth embodiment of the present invention is an apparatus for identifying the structure of a compound, and includes a spectral data generation means, a reference spectral database and a matching means as mentioned above. In
According to the present embodiment, for example, the structure of an analyte compound can be identified with the single apparatus even when it is not in a communication environment. Also, for example, the reference spectral database may be updated in a communication environment.
The apparatus of the present embodiment may be, for example, an apparatus which corresponds to any of the above-mentioned embodiments 1 to 3, and can execute each step explained in the above-mentioned embodiments 1 to 3. For example, the present apparatus may include various kinds of databases such as the above-mentioned chemical structure database, theoretical spectral database and the like. These databases may be also updated, for example, in a communication environment.
The sixth embodiment of the present invention is a method for identifying the structure of a compound, and each step of the method is executed by a computer. The present embodiment includes the above-mentioned and the matching step. The method of the present embodiment may be executed by, for example, the system described in the above-mentioned embodiment 4 or the apparatus described in the above-mentioned embodiment 5, and among each step explained in the above-mentioned embodiments 1 to 3, the part that is executed by the computer can be executed.
The seventh embodiment of the present invention is a program capable of executing the above-mentioned compound identification method on a computer. The present embodiment may be recorded, for example, on a recording medium. As the recording medium, for example, there may be mentioned a random-access memory (RAM), a read-only memory (ROM), a hard disk (HD), a USB memory, an optical disk, a floppy (registered trademark) disk (FD) and the like.
Hereinabove, the present invention has been explained with reference to the embodiments, but the present invention is not limited to the above-mentioned embodiments. Various modification can be done to the constitution and details of the present invention within the range of the present invention that can be understood by those skilled in the art.
All patents, patent applications, publications of patent applications, and other material, such as articles, books, specifications, publications, documents, things, and/or the like, referenced herein are hereby incorporated herein by this reference in their entirety for all purposes, excepting any prosecution file history associated with same, any of same that is inconsistent with or in conflict with the present document, or any of same that may have a limiting affect as to the broadest scope of the claims now or later associated with the present document.
As mentioned above, according to the present invention, it is possible to provide a method and a system for identifying the structure of a compound, which can identify the structures of various compounds. Therefore, the present invention can be applied to a wide range of fields including various kinds of fields such as physiology, medicine, food, environment and the like.
100
110
111
112
130
140
150
160
170
180
200
Number | Date | Country | Kind |
---|---|---|---|
2018-114216 | Jun 2018 | JP | national |
Number | Date | Country | |
---|---|---|---|
Parent | 17251523 | Dec 2020 | US |
Child | 18165013 | US |