Aspects of the present disclosure relate to the analysis of small molecule components of a complex mixture and, more particularly, to a method and associated apparatus and computer program product for analyzing and elucidating the structure of small molecule components or compounds of a complex mixture, with such small molecule analysis including metabolomics, which is the study of small molecules produced by an organism's metabolic processes, or other analysis of small molecules produced through metabolism.
Compounds are diverse and numerous. The total number of compounds found in nature is unknown but is estimated to be at least in the tens of thousands. Ion data repositories (e.g., libraries) contain named compounds. Unnamed compounds are observed in biological samples but are not currently associated with library entries. Unnamed compounds may show significant correlations with disease, genetic variants, and other important biological metadata. As metabolomics data is collected on more and larger human cohorts, the number of biologically significant unnamed compounds is increasing. However, at present, the capability of elucidating the chemical structure of these unnamed compounds is a bottleneck that blocks the development of novel biomarkers, biological insights, and clinical interventions.
As currently practiced, structural elucidation of unnamed compounds is a laborious, time-consuming process that relies on manual examination of individual spectra and database searches driven by human pattern recognition and know-how. Specifically, although the exact mass of an unnamed compound can equate to a small number of candidate molecular formulas, publicly available databases contain either too few (e.g., zero) or far too many (e.g., hundreds) compound structures for each molecular formula. Therefore, in the current elucidation workflow, a human analyst must manually examine the liquid chromatography and tandem mass spectrometry (LC-MS1/MS2) data, at great length, and still be unable to propose a small number (e.g., 1-5) of testable structural compound candidates. That is, determining the identity and, ultimately, structure of a detected (but unnamed) compound is an important aspect of more thoroughly understanding the compound composition of a sample, but the current process of structural elucidation is manual and time-consuming.
As such, there exists a need for a method and associated apparatus and computer program product for analyzing and elucidating the structure of compound components or compounds of a complex mixture with increased speed and success rate compared to the manual process. It is also desirable to automatically predicting key structural features and stratifying structural candidates based on the LC-MS/MS characteristics of the unnamed compound.
The above and other needs are met by aspects of the present disclosure which, in some aspects, provides a method and associated apparatus and computer program product for analyzing and elucidating the structure of small molecule components or compounds of a complex mixture, that determines the structure of new compounds in one or more samples in an automated manner that is faster and more accurate than existing manual methods. This is accomplished, in part, through tools and models built on large amounts of data from both an existing ion repository/chemical library and publicly available sources to quickly generate a list of arithmetically possible molecular formulas for a given molecular mass.
In some instances, the method and associated apparatus and computer program product for analyzing and elucidating the structure of small molecule components or compounds of a complex mixture involve:
1. for each unnamed compound, use precomputed tables to find molecular formulas for the mass of the unnamed compound that are arithmetically possible, satisfy double-bond constraints, are statistically similar to molecular formulas of known compounds, and satisfy isotopic constraints found during MS1 analysis:
2. aggregating the MS2 spectra stored for each unnamed compound to form a consensus MS2 spectrum:
3. annotating each ion of the consensus MS2 spectrum with a possible molecular formula. Using a precomputed table, find substructures observed in a library corresponding to those molecular formulas:
4. constructing a table showing consistency of ion molecular formulas with unnamed compound molecular formulas
5. finding the compounds in a library with MS2 spectra that are most similar to the consensus MS2 spectra of the unnamed compound according to a measure of spectrum-to-spectrum similarity:
6. searching a large private database and/or aggregating public database information for compounds with SMILES string that plausibly explain the MS2 spectrum using a measure of SMILES-to spectrum similarity:
7. using precomputed models, predicting the presence of various substructures expressed as SMILES strings from the MS2 spectrum of each unnamed compound:
8. attaching all of the above to stored representation of unnamed conpound for future reporting; and
9. Optionally, when desired, generating a graphical report of all of the above for the human elucidator, including color drawings of molecular structures.
Aspects of the present disclosure further provide:
One particular aspect of the present disclosure provides a method of analyzing data for one or more samples, the data for each sample being obtained from a component separation and tandem mass spectrometer system including a mass spectrometer conducting a first mass spectrometry step or function (MS1) and a second mass spectrometry step or function (MS2), and structurally elucidating small molecule components of the one or more samples. Such a method comprises, for each sample, determining a molecular mass of a fragment of a candidate compound, determining possible molecular formulas having the molecular mass of the fragment, and aggregating MS2 spectra for each of a plurality of fragments of the candidate compound to form a candidate MS2 spectrum of the candidate compound. Possible molecular formulas and compound structures of an ion in the candidate MS2 spectrum are determined, with the ion comprising one or more fragments, that are consistent with the possible molecular formulas of the fragments. Known compounds having an MS2 spectrum similar to the candidate MS2 spectrum of the candidate compound are determined, and known compounds having a compound structure plausibly corresponding to the MS2 spectrum of the candidate compound are determined. A probability of the MS2 spectrum of each fragment having one or more compound substructures is determined, and a combination of known fragment spectra forming a compound spectrum statistically similar to the candidate MS2 spectrum of the candidate compound is determined. The determined possible molecular formulas and compound structures, determined known compounds, determined compound substructures, and determined combination of known fragment spectra are then associated with the MS2 spectrum of the candidate compound and fragments thereof.
The present disclosure thus includes, without limitation, the following example embodiments:
Example Embodiment 1: A method of analyzing data for one or more samples, the data for each sample being obtained from a component separation and tandem mass spectrometer system including a mass spectrometer conducting a first mass spectrometry step or function (MS1) and a second mass spectrometry step or function (MS2), and structurally elucidating small molecule components of the one or more samples, said method comprising, for each sample, determining a molecular mass of a fragment of a candidate compound; determining possible molecular formulas having the molecular mass of the fragment; aggregating MS2 spectra for each of a plurality of fragments of the candidate compound to form a candidate MS2 spectrum of the candidate compound; determining possible molecular formulas and compound structures of an ion in the candidate MS2 spectrum, the ion comprising one or more fragments, that are consistent with the possible molecular formulas of the fragments; determining known compounds having an MS2 spectrum similar to the candidate MS2 spectrum of the candidate compound; determining known compounds having a compound structure plausibly corresponding to the MS2 spectrum of the candidate compound; determining a probability of the MS2 spectrum of each fragment having one or more compound substructures; determining a combination of known fragment spectra forming a compound spectrum statistically similar to the candidate MS2 spectrum of the candidate compound; and associating the determined possible molecular formulas and compound structures, determined known compounds, determined compound substructures, and determined combination of known fragment spectra with the MS2 spectrum of the candidate compound and fragments thereof.
Example Embodiment 2: The method of any preceding example embodiment, or combinations thereof, wherein determining possible molecular formulas having the molecular mass of the fragment, comprises determining arithmetically possible molecular formulas for the molecular mass of the fragment, with the arithmetically possible molecular formulas satisfying double-bond constraints, being statistically similar to molecular formulas of known metabolites, and satisfying isotopic constraints from MS1 analysis.
Example Embodiment 3: The method of any preceding example embodiment, or combinations thereof, wherein determining possible molecular formulas and compound structures of an ion in the candidate MS2 spectrum consistent with the possible molecular formulas of the fragments thereof, comprises determining possible isomeric substructures corresponding to the possible molecular formulas and compound structures of the ion.
Example Embodiment 4: The method of any preceding example embodiment, or combinations thereof, wherein determining known compounds having a compound structure plausibly corresponding to the MS2 spectrum of the candidate compound, comprises determining known compounds each having a SMILES string identifier plausibly corresponding to the MS2 spectrum of the candidate compound according to a measure of SMILES-to-spectrum similarity.
Example Embodiment 5: The method of any preceding example embodiment, or combinations thereof, determining known compounds having a compound structure plausibly corresponding to the MS2 spectrum of the candidate compound, comprises ranking plausible molecular formulas of the determined known compounds based on statistical similarity of the plausible molecular formulas to molecular formulas in a compound library.
Example Embodiment 6: The method of any preceding example embodiment, or combinations thereof, wherein determining a probability of the MS2 spectrum of each fragment having one or more compound substructures, comprises predicting whether one or more compound substructures, each expressed as a SMILES string, is present in the fragment from the MS2 spectrum of each fragment.
Example Embodiment 7: An apparatus for analyzing data for one or more samples, the data for each sample being obtained from a component separation and tandem mass spectrometer system including a mass spectrometer for conducting a first mass spectrometry step or function (MS1) and a second mass spectrometry step or function (MS2), the apparatus comprising a processor and a memory storing executable instructions that, in response to execution by the processor, cause the apparatus to at least perform the method steps of any preceding example embodiment, or combinations thereof.
Example Embodiment 8: A computer program product for analyzing data for one or more samples, the data for each sample being obtained from a component separation and tandem mass spectrometer system including a mass spectrometer for conducting a first mass spectrometry step or function (MS1) and a second mass spectrometry step or function (MS2), the computer program product comprising at least one non-transitory computer readable storage medium having computer-readable program code stored thereon, the computer-readable program code comprising program code for performing the method steps of any preceding example embodiment, or combinations thereof.
These and other features, aspects, and advantages of the present disclosure will be apparent from a reading of the following detailed description together with the accompanying drawings, which are briefly described below. The present disclosure includes any combination of two, three, four, or more features or elements set forth in this disclosure, regardless of whether such features or elements are expressly combined or otherwise recited in a specific embodiment description herein. This disclosure is intended to be read holistically such that any separable features or elements of the disclosure, in any of its aspects and embodiments, should be viewed as intended, namely to be combinable, unless the context of the disclosure clearly dictates otherwise.
It will be appreciated that the summary herein is provided merely for purposes of summarizing some example aspects so as to provide a basic understanding of the disclosure. As such, it will be appreciated that the above described example aspects are merely examples and should not be construed to narrow the scope or spirit of the disclosure in any way. It will be appreciated that the scope of the disclosure encompasses many potential aspects, some of which will be further described below, in addition to those herein summarized. Further, other aspects and advantages of such aspects disclosed herein will become apparent from the following detailed description taken in conjunction with the accompanying drawings which illustrate, by way of example, the principles of the described aspects.
Having thus described the disclosure in general terms, reference will now be made to the accompanying drawings, which are not necessarily drawn to scale, and wherein:
The present disclosure now will be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all aspects of the disclosure are shown. Indeed, the disclosure may be embodied in many different forms and should not be construed as limited to the aspects set forth herein; rather, these aspects are provided so that this disclosure will satisfy applicable legal requirements. Like numbers refer to like elements throughout.
The various aspects of the present disclosure mentioned above, as well as many other aspects of the disclosure, are described in further detail herein. The apparatuses, methods, and computer program products associated with aspects of the present disclosure are exemplarily disclosed, in some instances, in conjunction with an appropriate analytical device which may, in some instances, comprise a separator portion or separation portion (e.g., a chromatograph) and/or a detector portion (e.g., a spectrometer). One skilled in the art will appreciate, however, that such disclosure is for exemplary purposes only to illustrate the implementation of various aspects of the present disclosure. Particularly, the apparatuses, methods, and computer program products associated with aspects of the present disclosure can be adapted to any number of processes that are used to generate complex sets of data for each sample (e.g., within a single sample), or over/across a plurality of samples, whether biological, chemical, or biochemical, in nature. For example, aspects of the present disclosure may be used with and applied to a variety of different analytical devices and processes including, but not limited to: analytical devices including a separator portion (or “component separator” or “component separation” portion) comprising a liquid chromatograph (LC), a gas chromatograph (GC), a supercritical fluid chromatograph (SFC), a capillary electrophoresis (CE) analyzer; a cooperating detector portion (or “mass spectrometer” portion) comprising a mass spectrometer (MS); an ion mobility spectrometry mass spectrometer (IMS-MS); and an electrochemical array (EC); and/or combinations thereof (e.g., a tandem mass spectrometer including MS1 and MS2 functionality). In some aspects of the present disclosure, the detector portion may be used without a separator portion.
In this regard, one skilled in the art will appreciate that the aspects of the present disclosure as disclosed herein are not limited to metabolomics analysis. For example, the aspects of the present disclosure as disclosed herein can be implemented in other applications where there is a need to characterize or analyze small molecules present within a sample or complex mixture, regardless of the origin of the sample or complex mixture. For instance, the aspects of the present disclosure as disclosed herein can also be implemented in a bioprocess optimization procedure where the goal is to grow cells to produce drugs or additives, or in a drug metabolite profiling procedure where the goal is to identify all metabolites that are the result of biotranformations of an administered xenobiotic. Some other non-limiting examples of other applications could include a quality assurance procedure for consumer product manufacturing where the goal may be to objectively ensure that desired product characteristics are met, in procedures where a large number of sample components can give rise to a particular attribute, such as taste or flavor (e.g., cheese, wine or beer), or scent/smell (e.g., fragrances). One common theme thus exhibited by the aspects of the present disclosure as disclosed herein is that the small molecules in the sample can be analyzed using the various apparatus, method and computer program product aspects disclosed herein.
For example, the components of a particular sample 100 may pass through a column associated with the separator portion/separation portion, at different rates and exhibit different spectral responses (e.g., associated with intensity as a function of retention time), as detected by the first mass spectrometer functionality (MS1) of the detector portion, based upon their specific characteristics. The second mass spectrometer functionality (MS2) adds a second phase of mass fragmentation which may be implemented, for example, to facilitate quantitation of low levels of compounds in the presence of a high sample matrix background. As will be appreciated by one skilled in the art, the analytical device 110 may generate a set of spectrometry data, corresponding to each sample 100 and having three or more dimensions (e.g., quantifiable samples properties) associated therewith, wherein the data included in the data set generally indicates the composition (e.g., sample components) of the sample 100. In some aspects, the data set may comprise, for example, data for each sample related to retention time, sample or component (ion) mass, intensity, or even sample indicia or identity. However, such data must first be appropriately analyzed in order to determine the sample composition (e.g., ions, metabolites).
In some instances, a three-dimensional data set (MS1 or MS2) for each of one or more samples may be selected or otherwise designated for further analysis, with each dimension corresponding to a quantifiable sample property. An example of such a three-dimensional set of spectrometry data is shown generally in
According to other aspects of the present disclosure, different analytical devices may be used to generate a three or more dimensional set of analytical data corresponding to the sample 100. For example, the analytical device may include, but is not limited to: various combinations of a separator portion/separation portion comprising one of a liquid chromatograph (LC) (positive or negative channel) and a gas chromatograph (GC), a supercritical fluid chromatograph (SFC), a capillary electrophoresis (CE) analyzer; and a cooperating detector portion comprising one of a mass spectrometer (MS); an ion mobility spectrometer (IMS), a tandem mass spectrometer (MS1 and MS2); and an electrochemical array (EC). In some aspects, the analytical device may include a detector portion without a separator portion. One skilled in the art will appreciate that such complex three or more dimensional data sets may be generated by other appropriate analytical devices that may be in communication with components of aspects of the present disclosure as described in further detail herein.
One or more samples 100 may be taken individually from a well plate 120 and/or from other types of sample containers and introduced individually into the analytical device 110 for analysis and generation of the corresponding three or more dimensional data set (see, e.g.,
As shown in
Furthermore, the analytical device 110 may be in communication with one or more processor devices 130 (and associated user interfaces and/or displays 150) via a wire line and/or wireless computer network including, but not limited to: the Internet, local area networks (LAN), wide area networks (WAN), or other networking types and/or techniques that will be appreciated by one skilled in the art. The user interface/display 150 may be used to receive user input and to convey output such as, for example, displaying any or all of the communications involving the system, including the manipulations and analyses of sample data disclosed herein, as will be understood and appreciated by one skilled in the art. The database may be structured using commercially available software, such as, for example, Oracle, Sybase, DB2, or other database software. As shown in
The processor device 130 may, in some aspects, be capable of converting each of the data sets, each including, for example, data indicating a relationship between various sample parameters such as ion mass, retention time, and intensity (see, e.g.,
According to some aspects, the processor device 130 may be configured to selectively execute the executable instructions/computer-readable program code portions stored by the memory device 140, if necessary, in cooperation with the ion data repository/library/database also stored by the memory device 140, so as to accomplish, for instance, the identification, quantification, representation, curation, and/or other analysis of a selected sample component (i.e., a metabolite, molecule, or ion, or portion thereof) in each of the plurality of samples (or within a single sample), from the two-dimensional data set representing the respective sample among the plurality of samples. In doing so, the sample component of interest from the sample to be analyzed is first determined from at least one known characteristic associated with the sample. The at least one known characteristic associated with the sample may include, for example, at least a general type or classification, a source, etc. In some aspects, the at least one known characteristic may involve a particular nature of the sample, wherein the particular nature of the sample may vary considerably, from generally comprising mixtures or complex mixtures including small molecules, to particularly and exemplarily including, without limitation: blood samples, urine samples, cell cultures, saliva samples, plant tissue and organs (e.g., leaves, roots, stems, flowers, etc.), plant extracts, culture media, membranes, cellular compartments/organelles, cerebral spinal fluid (CSF), milk, soda products, food products (e.g., yogurt, chocolate, juice), and/or other types of biological, chemical, and/or biochemical samples. The at least one known characteristic, in particular aspects, indicates which metabolites and/or chemical/molecular components of interest may be present in that sample (or which metabolites and/or chemical/molecular components which are not expected to be in the sample). That is, in addition to data regarding discrete particular ions, the ion data repository/library/database may also include empirical data or other information associated with the known characteristic of the sample.
Accordingly, upon identifying the at least one known characteristic of the sample or receiving the at least one known characteristic as an input via the user interface/display 150, the processor 130 may be configured to execute computer-readable program code portions stored by the memory device 140 for implementing the empirical data and other information to correlate the one or more known characteristics with one or more particular ions, small molecules or metabolites expected to be present in such a sample having that known characteristic. That is, in some aspects, such information and empirical data associated with the one or more known characteristics provides a context to the sample and the data obtained therefrom, wherein the context provides an indicium at least as to a basic component or constituent of the sample, or where relevant data may be located within the ion data repository/database. In turn, the particular identifying data associated with the indicium of the basic component or constituent, or information location within the ion data repository/database, further indicates candidate ions, compounds and components that may be present or are expected or predicted to be present in the sample under analysis. That is, in particular aspects, comparing the known characteristic to empirical data included in the ion data repository, wherein the empirical data includes relational information between known characteristics and certain ions, allows the determination therefrom of the one or more ions corresponding to the known characteristic(s). In particular aspects of the disclosure, the selecting, based on the known characteristic, of one or more ions from the ion data repository expected to be included in the sample may be facilitated by more extensive information and empirical data received and housed within the ion data repository/database, wherein any “learning” by the processor 130 represents efficiencies and accuracies gained from additional correlative information.
According to particular aspects, as shown in
Such a method comprises, for each sample, determining a molecular mass of a fragment of a candidate compound (Block 530); determining possible molecular formulas having the molecular mass of the fragment (Block 535); and aggregating MS2 spectra for each of a plurality of fragments of the candidate compound to form a candidate MS2 spectrum of the candidate compound (Block 540). Possible molecular formulas and compound structures of an ion in the candidate MS2 spectrum, the ion comprising one or more fragments, are determined that are consistent with the possible molecular formulas of the fragments (Block 545), and known compounds having an MS2 spectrum similar to the candidate MS2 spectrum of the candidate compound are also determined (Block 550). In further aspects, known compounds having a compound structure plausibly corresponding to the MS2 spectrum of the candidate compound are determined (Block 555), a probability of the MS2 spectrum of each fragment having one or more compound substructures is determined (Block 560), and a combination of known fragment spectra forming a compound spectrum statistically similar to the candidate MS2 spectrum of the candidate compound is determined (Block 565). The determined possible molecular formulas and compound structures, determined known compounds, determined compound substructures, and determined combination of known fragment spectra are then associated with the MS2 spectrum of the candidate compound and fragments thereof (Block 570).
Further aspects of the methods, apparatuses, and computer program products disclosed herein are as follows:
1. A criterion (harmony) for identifying plausible molecular formulas based on statistical similarity to molecular formulas in a compound library.
A formula that is arithmetically consistent with a given accurate mass may still not be chemically possible. Assessing “chemical possibility” computationally is difficult, requiring analysis of volumes of atoms, distribution of charges, etc. The goal is to determine the molecular formula of metabolites specifically by asking the question of interest: “Is this formula plausibly the formula of a metabolite?”
In instances where the formulas of (e.g., several thousand) metabolites, wherein the formulas of many more metabolites could be retrieved from public databases, the question can be rephrased: Is this formula like the formulas of compounds that can currently be detected? In statistical learning theory, this is the Anomaly Detection Problem.
One goal is to assign a number to any formula that is relatively high in plausibility for formulas already in the library, and low for formulas that are not seemingly plausible. As an example, C18H35N2OPS is arithmetically consistent with exact mass 358.2221 u, but may not necessarily correlate to the metabolites that can be or have been detected (e.g., in context). The number assigned to C18H35N2OPS by the process described below is thus 1 (e.g., on a scale from 0 to 100), wherein the assigned number represents the harmony of that formula.
To use any anomaly detection algorithm, it will be necessary to define the “features” of every formula. An effective set of features (with values for the example C18H35N2OPS, exact mass 358.2208 u), in one example, could be:
Any of a number of published anomaly detection algorithms could be applied, but the one chosen to exemplify the method is Local Outlier Factors. (Breunig, M. M.; Kriegel, H.-P.; Ng, R. T.; Sander, J. (2000) LOF: Identifying Density-based Local Outliers (PDF). Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. SIGMOD. pp. 93-104. doi: 10.1145/335191.335388. ISBN 1-58113-217-4.).
The inputs to the process are (a) the features of all the entries in the library and (b) the features of the formula to be assessed as a possible anomaly. The methodology begins with a “training” phase on the library entries' features (a), which can be carried out once for all subsequent use (such a process may take, for example, on the order of a few minutes). The training phase assigns a local outlier factor (LOF) to every formula in the library. Formulas with lower LOFs are identified as “unusual”. The distribution of LOFs for the library entries has a long lower tail.
The LOF for the example formula C18H35N2NOPS is −0.645, putting it among the lowest 1% of library formulas, and its harmony is reported as 1. Another formula for the same mass, C16H30N4O5, has an LOF of 0.228 at the 59th percentile, harmony=59. Other evidence (e.g., context) also indicates that C16H30N4O5 is the correct formula for the feature detected at this mass.
The training phase may be time-consuming, and it is desirable in any instance to precompute the statistical model once so as to give consistent results across runs of the program. Therefore, the results of the training phase are saved in a persistent form (for example, Python pickle files), and the harmony of each query formula is quickly computed using the saved statistical model.
If the concept herein for identifying metabolites is expanded to identifying formulas that are plausibly drugs or plausibly in another category, the same analysis principles herein will apply.
2. MS2SMILES criterion for assessing similarity of the MS/MS fragmentation spectrum of an unknown molecule to the structure of a known molecule as represented by a SMILES string.
To search for similarities between X-compounds with known spectra but unknown SMILES string, and known compounds with SMILES strings but no spectra in a database, it is useful to have a measure of similarity between a spectrum and a SMILES string, which has been termed MS2SMILES.
A SMILES string can be broken into fragments, wherein each fragment has an exact mass, but no intensity. On the other hand, a spectrum has observed accurate masses and measured intensities. If the spectrum and SMILES string correspond, it is expected that some of the spectral masses will match some of the fragment masses. However, some fragments are never directly represented in any analyzed spectrum, and many spectral masses arise from processes other than the simple breakage of a single covalent bond.
To measure the similarity of a spectrum to a SMILES string, the subset of the spectral masses that match (e.g., +5 ppm) fragment masses are determined and termed the assigned masses. The intensities of the assigned masses are them summed to provide a raw score.
If comparing a single spectrum to a plurality of SMILES strings, it may suffice to compare raw scores to find the “best” match, but the raw score itself gives little indication of absolute quality of the match. It would thus be helpful to know what fraction of all intensities has been assigned to fragment masses. However, this number will be diluted by spectral masses that can't ever arise from any single-bond-breakage process in any plausible molecule; roughly ⅓ of the masses in the spectra in the library fall into this category. These masses are effectively unassignable in any circumstance. To identify them requires a list of assignable exact masses. Given a large database of SMILES strings, all can be fragmented to find a large set of fragments, and the masses of these fragments are the assignable exact masses.
The final MS2SMILES score is:
(sum of intensities of assigned masses)/(sum of intensities of assignable masses)
As an example, here is the matching of spectrum X-21821 (an X-compound) and the SMILES string of indolyl-3-acryloylglycine (IAG):
The MS2 spectrum of X-21821 appears in tabular form in the left two columns of the Table below. Three masses (74, 117, 143) do not correspond to any known fragments for any compound in the database, so these intensities are not assignable in column 3. Of the 12 assignable masses, only three can be assigned to the fragments of LAG. The total assigned intensity is 138.6, and the total assignable intensity is 166.4. The MS2SMILES score is 0.833, a very high score (since 1 is a “perfect score”.) X-21821 was confirmed to be LAG. The structure of LAG is shown below.
It is generally expected that many fragments are not necessarily matched by ions. As such, it is not anticipated that every fragment found theoretically by in silico fragmentation will be realized during mass spectrometry. On the other hand, many assignable ions are not assigned. This is likely because the same accurate mass can be generated by both single-bond-breaking and other processes. Another possible source of ions is another compound that contaminated the MS2 of the compound of interest. Since the ions that were assigned in the example included the two largest ions, the score is still high. The fact that one of the ions also matches a very large fragment of the SMILES string for IAG, and a large fraction of its mass also supports the hypothesis that the spectrum does correspond to IAG. On the other hand, for example, the spectral mass 116.0505 could be any fragment with molecular formula CoHN (or other formulas of similar mass).
To search for similarities between unnamed compounds with known spectra but unknown SMILES string, and known compounds with SMILES strings but no spectra in a database, it is useful to have the MS2SMILES measure of similarity between a spectrum and a SMILES string. This matching involves, for example, matching matches and losses in the observed spectrum to matches and losses that could potentially occur in the compound represented by the SMILES string when subjected to fragmentation in a mass spectrometer. There is no expectation that every potential ion or loss suggested for a SMILES string will be found in the observed spectrum. There is generally no attempt to predict a complete spectrum with intensities; rather, only the presence or absence of an ion/loss is of interest.
Predictions can be generated in two ways:
Given a SMILES string, an observed spectrum, and the set of models, the compatibility of the SMILES string with the spectrum can be assessed by deterministically suggesting ions (and losses) from the SMILES string according to A and then statistically predicting ions and losses for the SMILES strings according to B. Then the match is evaluated by looking at what “fraction” of the observed spectrum is accounted for by the suggestions and predictions for the SMILES strings. In such an aspect, that “fraction” is defined as ratio of the summed intensities of the observed ions matched by the SMILES string divided by the sum of the intensities of all observed ions.
In general, there may be thousands of models for observed ions. Evaluating each for a SMILES string could require an impractical amount of computations if carried out for many SMILES strings. However, this is unnecessary when comparing a single spectrum because at most only a few dozen models are of interest, namely the models for the ions (and losses) actually observed.
3. A methodology for quickly searching for all compounds similar to a fragmentation spectrum in a large (>100,000 entries) database of compounds represented by SMILES strings (RapidSMatch) using the MS2SMILES criterion.
With a large database of compounds with SMILES assembled from public sources, it is desirable to be able to search the database quickly for all compounds similar to a single query spectrum of an X-compound according to the MS2SMILES criterion, since the compounds with the highest scores will be candidate structures for the X-compound. It is not preferred to search compound by compound mainly because each SMILES string in the database would need to be fragmented for each query. Fragmentation is a processor-intensive operation. It is preferable to pre-compute fragmentations once and save those fragmentations in a computer file storage format (e.g., disk file) or other persistent data format. Once fragmentation is completed, other chemical structures enabling fast processing of query spectra can be precomputed and saved as well.
For each SMILES string in the database:
In particular aspects, while the precomputation step is performed one time for the entire database, it operates on one SMILES string at a time. Therefore, adding new strings to the database is readily accomplished as those strings become available from public or other sources.
4. A method for predicting the RI of a molecule based solely on its SMILES string or its molecular formula, based on a statistical analysis of compounds in the library.
The RI (retention index) of a compound is a property of an individual compound and is determined by the peculiarities of the particular LC process employed. RI depends on the three-dimensional structure of the compound (mainly what substructures are on the surface of the molecule interacting with the static and mobile phases of the LC setup). As such, RI generally cannot be predicted with high accuracy from the two-dimensional SMILES string. However, even a very rough prediction of RI is sufficient to eliminate many candidate structures for an unknown spectrum.
In alternate aspects, two methods can be implemented, one based solely on the molecular formula (“one-dimensional structure”) of the molecules, and another based on the SMILES string.
Method 1: RI Prediction by Linear Regression from Molecular Formula Only.
Precomputation: Use compounds from the library to create a simple linear regression model. Independent variables are the counts of atoms of the elements represented in the library compounds. The dependent variable is the RI. The coefficients of the linear model are made persistent.
Query: The coefficients of the linear model are applied to the counts of atoms of the query formula or SMILES, giving a predicted RI.
The second method uses an estimate of logP, the octanol-water partition coefficient, as an independent variable, logP is defined experimentally, but it can be estimated from a SMILES string using various software packages. The software package chosen for the exemplification herein is RdKit. As estimates vary among software packages, it is important to use the same software package for all estimates.
In addition, for an existing ion repository, SMARTS matching was used to count the number of primary amine groups (—NH2, or *N in SMILES notation) in the SMILES strings. Adding this independent variable may improve accuracy.
In other aspects, other SMILES fragments may be later found that may correlate with RI and those SMILES fragments added as independent variables.
Method 2: RI Prediction by Linear Regression from SMILES.
Precomputation: Use compounds from the library to create a linear regression model. Independent variables are:
Since these two methods implement statistical models built on relatively simple chemical properties of the compounds in the library, these methods may not necessarily be reliably extrapolated to formulas not statistically similar to the formulas in the implemented library (e.g., for formulas with low harmony).
A graphical report for the elucidation analysis herein is generated and displayed on a display for human evaluation, including color drawings of molecular structures.
Aspects of the present disclosure thus provide methods of analyzing and elucidating metabolomics data from a LC/tandem MS system, as disclosed herein. In addition to providing appropriate apparatuses and methods, aspects of the present disclosure also provide associated computer program products for performing the functions/operations/steps disclosed herein, in the form of, for example, a non-transitory computer-readable storage medium (i.e., memory device 140,
Many modifications and other embodiments of the inventions set forth herein will come to mind to one skilled in the art to which these disclosed embodiments pertain having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Therefore, it is to be understood that embodiments of the invention are not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the invention. Moreover, although the foregoing descriptions and the associated drawings describe example embodiments in the context of certain example combinations of elements and/or functions, it should be appreciated that different combinations of elements and/or functions may be provided by alternative embodiments without departing from the scope of the disclosure. In this regard, for example, different combinations of elements and/or functions than those explicitly described above are also contemplated within the scope of the disclosure. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.
It should be understood that although the terms first, second, etc. may be used herein to describe various steps or calculations, these steps or calculations should not be limited by these terms. These terms are only used to distinguish one operation or calculation from another. For example, a first calculation may be termed a second calculation, and, similarly, a second step may be termed a first step, without departing from the scope of this disclosure. As used herein, the term “and/or” and the “/” symbol includes any and all combinations of one or more of the associated listed items.
As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises”, “comprising”, “includes”, and/or “including”, when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Therefore, the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting.
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/IB2022/057633 | 8/15/2022 | WO |
| Number | Date | Country | |
|---|---|---|---|
| 63233594 | Aug 2021 | US |