The present invention relates generally to the analysis of the entire collection of organic chemical reactions and compounds reported in the literature over the past two centuries in the form of a complex network in either normal, one-mode graph or bipartite graph representations. Specifically, the invention relates to methods, algorithms, computer-readable storage mediums and other applications derived from the analysis of this graph/network theory.
The synthesis of organic compounds is one of the most important and creative pursuits in modern science, requiring not only technical expertise, but also imagination, intuition and individual judgment (Tietz, L. et al., Angew. Chem. Int. Ed. Engl., 1993, 32, 131; Corey, E. J. et al., The Logic of Chemical Synthesis, Wiley-Interscience, New York, 1995; Nicolaou, K. C. et al., Angew. Chem. Int. Ed., 2000, 39, 44). Sometimes, as in the title of Nicolaou's classic review cited above, chemical synthesis is equated with art, which, by definition, reflects individual imagination and often defies convention, statistics and order. Yet, the universe of chemistry humans are collectively creating one comprising millions upon millions of known reactions and compounds—is surprisingly well-ordered, and its evolution obeys trends that have not changed since the pioneering times of Lavoisier.
On the most abstract level, the millions of known chemicals and reactions constituting organic chemistry can be represented as a complex network, in which compounds correspond to nodes and reactions to directed connections between these nodes (R. Albert et al., Rev. Mod. Phys. 2002, 74, 47). Recently, it has been shown that such a network has a scale-free topology similar to that of the World Wide Web and that by analyzing its time evolution, it is possible to derive statistical laws that describe and also predict how and which types of molecules could be synthesized (Grzybowski, B. A. et al., Angew. Chem. Int Ed., 2005, 44, 7263). This scale-free topology has also been used to demonstrate the existence of a small set of strongly connected, chemically diverse core compounds from which the majority of other known organic compounds in the periphery can be made in three or fewer synthetic steps, and that these core compounds are surrounded by small island compounds that do not connect either to the core or to the periphery (Grzybowski, B. A. et al., Angew. Chem. Int Ed, 2006, 45, 5348). Utilizing such a network could have many applications.
Such an example is chemical warfare. With the increasing risks associated with terrorist organizations, chemical weapons might be considered an ideal mode of attack, since they are both cheap and easy to transport. In addition, many of these deadly substances can nowadays be synthesized readily from commercially available substrates and using synthetic procedures available from public sources. Indeed, the 1995 terrorist attack in the Tokyo subway was carried out with sarin synthesized by cult members using common and unregulated precursors obtained through a network of front companies. This example underscores the need to monitor chemical inventories and also purchase orders to prevent select chemicals, sometimes apparently benign, from falling into the hands of terrorist organizations.
Current methods of chemical agent control rely on static lists of “chemicals of interest.” These lists can be “flat” (for example, the list of 320 compounds compiled by the U.S. Department of Homeland Security), or multi-tiered like the 1993 CW Convention list.1 Unfortunately, control methods based on static lists are easy to circumvent, either by developing trivially different chemical analogs, or by utilizing readily available, non-scheduled starting materials that are two or more synthetic steps away from the target compound. Moreover, static lists can only provide risk assessment, but are incapable of dealing with the concept of intent. These difficulties cannot be overcome by simple list expansion, since this would place an undue burden on legitimate chemical industry and academic research while not preventing a determined and synthetically skilled terrorist from obtaining suitable precursors (or their close analogs) under false pretense. As such, an improved method for assessing the risk and management of chemical inventories and purchases is required. 1 http://www.opcw.org/html/db/cwc/eng/cwc_annex_on_chemicals.html
Similarly, discovery and/or design of reactions that proceed sequentially in one reaction vessel is among the holy grails of modern organic synthesis. The ability to perform multiple reactions in one reaction vessel and in a well-defined sequence can simplify and accelerate multiple-step syntheses, and can translate into significant economic savings by reducing the amounts of byproducts and by eliminating intermediate purification steps which account for as much as 60% of the total synthetic cost (see Chem. Soc. Rev. 2004, 33, 302-312; ChemSusChem 2008, 1, 718-724; Chem. Rev. 2005, 105, 1001-1020; Biotech. Lett. 2000, 22, 871-874). Therefore, the identity of such reactions is desired.
In light of the foregoing, it is an object of the present invention to provide a computer-implemented method for analyzing a conversion of a plurality of organic chemical reactions into a projected or bipartite graph. In a normal, projected graph or representation, compounds correspond to nodes, and directed edges are assigned for a given reaction by connecting all reactants to all products. In a bipartite representation, compounds correspond to substance nodes and are connected through reaction nodes by directed edges. Such a method of analysis makes it possible to derive statistical laws that describe and also predict how and which types of molecules could be synthesized, and which new reactions could be developed.
As such, it is further an object of the invention to provide a computer-implemented method of monitoring an organic compound or compounds of interest by analyzing such a conversion of a plurality of organic chemical reactions into a projected or bipartite graph.
It is yet another object of the invention to provide a computer-implemented method of economically optimizing multiple reactions in parallel by analyzing such a conversion of a plurality of organic chemical reactions into a projected or bipartite graph.
It is still another object of the invention to provide a computer-implemented method of automatically identifying reactions that can be performed sequentially by analyzing such a conversion of a plurality of organic chemical reactions into a projected or bipartite graph.
Accordingly, it will be understood by those skilled in the art that one or more aspects of this invention can meet certain objectives, while one or more other aspects can meet certain other objectives. Each objective may not apply equally, in all its respects, to every aspect of this invention. As such, the following objects can be viewed in the alternative with respect to any one aspect of this invention.
Other objects, features, benefits and advantages of the present invention will be apparent from this summary and the following descriptions of certain embodiments, and will be readily apparent to those skilled in the art. Such objects, features, benefits and advantages will be apparent from the above as taken into conjunction with the accompanying examples, data, and all reasonable inferences to be drawn therefrom.
Illustrating certain non-limiting aspects and embodiments of this invention, a computer-implemented method for analyzing a translation of a plurality of organic chemical reactions retrieved from a database to a projected graph 100 or bipartite graph 110 or network is disclosed. In a projected network 100 of the method, the molecules are nodes 102 and the reactions are the arrows 104 connecting them (
In a specific embodiment and referring to
Referring now to the drawings in detail wherein like numbers represent like elements throughout,
In
The network is constructed from a database or databases that stores published organic chemical reactions. For example, Crossfire Beilstein Database (BD, Elsevier Informations Systems) is the largest repository of organic reactions (see Grzybowski, B. A. et al., Angew. Chem. Int Ed., 2005, 44, 7263; Grzybowski, B. A. et al., Angew. Chem. Int Ed, 2006, 45, 5348; and Grzybowski, B. A. et al., Nature Chemistry, 2009, 1, 31, all of which are incorporated herein by reference). In choosing BD, the well-established criterion for the classification of chemical substances as “organic” and its comprehensive coverage of the chemical literature dating back to 1779 is adopted. While BD is not without omissions (for example, it reports only select types of polymer and is not a comprehensive repository of proteins, DNA, or many important non-covalent organic architectures), it provides the single, most complete description of organic chemistry and its evolution. A processor is coupled to the database. The processor is configured to prune the database to remove catalysts, solvents, substances that participate in no reactions, duplicate reactions, and reactions that lack either reactants or products (that is, half reactions), leaving a universe of known organic chemistry comprising some 6.5 million substances and about 7.0 million reactions connecting them. In the translation of organic synthesis into a network of chemical connectivity, each compound node is represented by some characteristic of the compound, such as, for example, its molecular mass (99.7% of the compounds in BD have mass data).
Beginning with the entries from the first years of the 1800s, both the numbers of molecules and the numbers of chemical reactions have been increasing exponentially to create a network whose complexity exceeds that of metabolic networks and rivals that of the World Wide Web. Despite its apparent randomness (
With respect to
The region in the network outside of the core can be subdivided into a large periphery 20 (
Finally, unconnected to the core/periphery are the network's islands 30, which are typically small (less than four molecules on average) but together constitute about 18% of the network. The most connected molecules in each island 30 are usually either complex natural products or specialized substances (for example, non-natural isotopes). While some islands 30 reflect imperfections of the database and its failure to report the existing syntheses connecting island molecules to the rest of chemistry, a sizeable fraction corresponds to substances that are difficult synthetic targets whose total syntheses have not yet been reported despite numerous attempts (
Within the general framework above, the architecture of the network is further characterized by local connectivity measures. In particular, the number of reaction arrows emanating from each node, kout, corresponds to the number of times a given molecule is used as a reaction substrate (redundancy), and the number of reaction arrows pointing towards the nodes, kin, corresponds to the number of times a molecule is used as reaction product (betweenness) (
Although this scaling might not seem very illuminating, it implies that chemistry has the so-called scale-free structure (Albert, R. et al., Rev. Mod. Phys. 2002, 74, 47) similar to that of the World Wide Web, the Internet, metabolic networks, and even societies. This scale-free architecture is akin to a fractal in the sense that the structural/connectivity motifs characterizing the entire network repeat themselves in all of its subnetworks. Another distinguishing feature of being scale-free is the presence of highly connected “hub” molecules directly analogous to the hubs of the airline system (Atlanta, Chicago, London, Frankfurt and so on) facilitating transportation from one poorly connected airport to another. Likewise, in organic chemistry, the synthesis of one molecule from another by a series of chemical transformations will probably use one or more of these versatile hub compounds as intermediates. Also, the fact that the scale-free structure is conserved as the network evolves in time indicates that it grows by the mechanism of preferential attachment, whereby highly connected substances are more likely to participate in new reactions than poorly connected compounds. The more times a molecule is previously used as a synthetic substrate, i.e. the larger its kout, the higher the chances it will be used again in the future. Similarly, the higher its kin, the more likely it is that chemists will try to make it by a new reaction. Colloquially speaking, molecular “celebrities” such as p-nitrophenol are becoming ever more popular (
As such, simple molecular descriptors (mass, degree of unsaturation, number of stereocenters and so on) are analyzed relatively easily. Analysis of molecular masses offers some interesting insights. For example, despite the enormous progress in the synthetic methodology since the times of Hofmann and Perkin, the most commonly used substrates and products remain those of molecular weights MWsubst≈150 g/mol and MWprod≈250 g/mol, respectively (
A related observation is that the shapes of the mass distributions in
In light of the foregoing, an embodiment of the invention is a computer-implemented method for monitoring organic compounds, the method comprising translating a plurality of organic chemical reactions to a projected or bipartite graph, wherein compounds within the graph correspond to substance nodes and are connected by directed edges representing reactions (projected) or by directed edges through reaction nodes (bipartite); selecting a target compound or compounds within the graph; running a reverse depth-first search or searches outward from the target compound or compounds to identify all possible synthetic pathways of the target compound; measuring topological graph or network indices of a precursor compound or compounds of the target compound as a result of the reverse depth-first search; and ranking the precursor compounds to the target compound using the topological indices to determine which precursor compounds are more likely to be used to make the target compound.
Alternatively, the method comprises running a combinatorial breadth-first search outward from the target compound to identify all minimal sets of precursor compounds; measuring extended topological graph or network indices of the minimal sets as a result of the combinatorial breadth-first search; and ranking the minimal sets to the target compound using the topological indices to determine which are more likely to be used to make the target compound. In another embodiment, the method comprises running both an outward, reverse DFS and an outward combinatorial breadth-first search.
In a specific embodiment, the method can be used to identify synthetic routes to controlled substances, such as, for example, narcotics and chemical weapons. By identifying such synthetic routes, one can monitor the combined inventories of chemical suppliers for the purchases of “red-flag sets” of precursor compounds that signal the intent to make controlled substances. In such an embodiment, the search or searches are preferably run from a bipartite representation, since such a representation provides for information of, for example, multiple precursors for one reaction path.
Examples of topological graph indices, topological network indices, and extended topological graph or network indices include, but are not limited to, synthetic distance, betweenness, redundancy and selectivity. The method can employ from one up to all of these indices to monitor target compounds.
By “synthetic distance” is meant the numbers of reaction nodes separating a substrate or substrates from a target. Here, the premise is that the closer a substance is synthetically to a target compound, the higher its risk for being used in malicious activity.
“Betweenness” is based on the so-called “betweenness centrality” of a node/molecule and refers to the number of synthetic pathways, e.g. of length up to dmax, passing through this molecule to the target (
“Redundancy” quantifies the number of synthetic pathways starting at a given compound and terminating at the target (
“Selectivity” complements betweenness and redundancy. Although a compound may have a high redundancy measure for the synthesis of a specific target, it may also be involved in a large number of other, innocuous syntheses. Consequently, its overall rank as a preferred precursor should be less than that of a compound which is used almost exclusively in the synthesis of a target. The issue may be readily addressed in the context of network topology by examining the local connectivity, i.e. the number of incoming/outgoing reactions, of a precursor compound and/or the ratio of preferred synthetic pathways to non-preferred ones.
In an embodiment of the invention, reverse depth-first searches (DFS) outward from the target compound are run to enumerate all possible pathways (reaction dependencies between compounds are neglected) and to collect betweenness/redundancy statistics of compounds encountered along the way. Selectivity measures are collected by analyzing the local connectivity of the molecules and performing forward DFSs analogous to those above.
The method of the invention provides for the identity of “molecules of interest” not included in Schedules 1 and 2 of Chemical Weapons Convention, e.g. 2,2-diphenyl-2-hydroxyacetic acid methyl ester, which can be used to prepare 3-quinuclidinyl benzilate (BZ).
In another embodiment, a combinatorial breadth-first search outward from the target compound is run to identify all minimal sets 400 of precursor compounds 106 (
Thus, in the context that the method of the invention requires the target compound to be a controlled substance such as a narcotic or chemical weapon, the highest ranked sets are most-likely to be used by those wishing to synthesize a narcotic or chemical weapon. In sum, the ability to identify minimal sets is important for (i) regulation of precursors to controlled substances, and (ii) for assessing the intent of an individual. If chemical sales, chemical inventories, purchase orders and the like are monitored using the methods of the invention, the determination of the likelihood of the intent to make a controlled substance is possible, as well as the probability that it will be successfully synthesized.
As an example, the method of the invention is performed on sarin, a combinatorial breadth-first search outward from the target compound is run to identify all minimal sets of precursor compounds, as shown in
In another example,
Thus, a much more efficient regulatory strategy would be to employ the methods of the invention, i.e. monitor the combined inventories of chemical suppliers for the purchases of “red-flag sets” of precursors that signal the intent to make dangerous substances. In the specific PCP example above, the government would be alerted not if one buys piperidine alone, but when an entity acquires at least two out of three key substances (cyclohexanone, piperidine, and Grignard's phenylmagnesium bromide).
Finally, in still another embodiment of the invention, the minimal-set algorithms are implemented in the form of a friendly user interface that for different types of targets yields CAS numbers that can be then mapped onto the commercial databases of chemical compounds. An example of this capability is illustrated in
Another embodiment of the invention is a computer-implemented method of economically optimizing multiple reactions in parallel, the method comprising translating a plurality of organic chemical reactions retrieved from a database to a bipartite graph, wherein a first set of nodes of the graph is associated with one or more organic compounds connected by directed edges through a second set of nodes of the bipartite graph associated with one or more reactions; identifying a product or set of products, P from the graph; selecting a set of precursor compounds for the product; determining a connectivity, k, from the graph for each precursor compound; identifying a cost per mole for each precursor, Si, using, for example, the mathematical formula of the type Si≅β/√k, wherein √k is the square root of k and β is a constant; calculating a total cost function, Ctot, using, for example, a mathematical formula of the type Ctot=ΣiSi+∝Nr×n, wherein Nr×n represents the total number of reactions and ∝ represents the average cost of performing one reaction; and back-propagating from the product to find optimal prescursor compounds.
The method of the invention determines what set of precursors and reaction pathways should an entity use to minimize its overall production cost. Mathematically, this problem is equivalent to finding a set of precursors that minimizes the cost function, Ctot, represented as a sum of the costs of reagents and all other labor costs, Ctot=Csubst+Clabor. The link between the formula Ctot=ΣiSi+∝Nr×n and the architecture of the network is the correlation between the cost of a precursor and its local network connectivity. Analysis of specific substances (see Angew. Chem. Int. Ed., 2005, 44, 7263; J. Phys. Org. Chem. 2009, 22, 897, incorporated herein by reference) reveals that synthetically popular substances are less expensive than poorly connected ones, with the cost per mole being proportional to the inverse square root of the molecule's network connectivity, Si≅β/√k, where β is a constant. Using this cost relation, stochastic search algorithms (based on a simulated annealing Monte Carlo optimization. i.e. repeated random sampling) can be back-propagated from the products to find optimal substrates, which minimize the total production costs, Ctot. More important are the universal trends that hold for different companies and for different labor costs, as characterized by the dimensionless parameter, χ=∝/β.
Still another embodiment of the invention is a computer-implemented method of automatically identifying reactions that can be performed sequentially, the method comprising translating a plurality of organic chemical reactions retrieved from a database to a bipartite graph, wherein a first set of nodes of the graph is associated with one or more organic compounds connected by directed edges through a second set of nodes of the bipartite graph associated with one or more reactions; identifying and/or enumerating reaction chains within the graph; eliminating those reaction chains wherein all precursors and reagents involved therein are mutually reactive; eliminating reaction chains wherein precursors are not weekly connected; and, identifying the remaining reaction chains. By “weekly connected” is meant a connectivity that is higher than a given threshold value.
This method of the invention provides for the identity of sequential, one-pot reactions. The method starts with the identification/enumeration of reaction chains of the form A→B→C→ . . . within the network. Of course, not all such chains are compatible with one-pot synthetic procedures. To select those that are, criteria is imposed: (i) that all intermediates and reagents involved in the chain are not mutually reactive, and (ii) that the intermediates are weekly connected (this property eliminates substances that have high synthetic “promiscuity”). These rules allow for the identification of numerous candidate reaction sequences that meet the criteria of chemical orthogonality and for which common reaction conditions exist.
As an example, the two steps in
Another embodiment of the invention is to provide a computer-readable medium storing instructions that when executed by a computer cause the computer to perform the methods disclosed above. The term computer-readable medium, as used herein, refers to any medium that participates in providing instructions to a processor unit for execution. Such a medium may take many forms, including but not limited to, non-volatile media, and transmission media. Non-volatile media include, for example, optical or magnetic disks, such as storage devices. Volatile media include dynamic memory, such as main memory or random access memory (“RAM”). Common forms of computer-readable media include, for example, floppy disks, flexible disks, hard disks, magnetic tape, punch cards, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, and any other memory chip or cartridge, or any other medium from which a computer can read.
It is understood by those skilled in the art that the one or more steps of the application methods of the invention are performed by configuring one or more computer processors to perform such steps. In particular, a computer network can be employed, as in
The disclosures of all articles and references, including patents, are incorporated herein by reference.
The invention and the manner and process of making and using it are now described in such full, clear, concise and exact terms as to enable any person skilled in the art to which it pertains, to make and use the same. It is to be understood that the foregoing describes preferred embodiments of the present invention and that modifications may be made therein without departing from the spirit or scope of the present invention as set forth in the claims.
This application claims priority benefit of application Ser. No. 61/157,431 filed Mar. 4, 2009 and of application Ser. No. 61/165,034 filed Mar. 31, 2009, the entirety of both of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
61157431 | Mar 2009 | US | |
61165034 | Mar 2009 | US |