Not Applicable
Not Applicable
Not Applicable
The various embodiments of the present disclosure relate generally to polymer chemical informatics, specifically systems and methods for predicting properties of chemical polymer.
Polymers are an integral part of everyday life and instrumental in the progress of technologies for future innovations. The sheer magnitude and diversity of the polymer chemical space provide opportunities for crafting polymers that accurately match application demands, yet also come with the challenge of efficiently and effectively browsing the gigantic space of polymer systems. The nascent field of polymer informatics allows access to the depth of the polymer universe and demonstrates the potency of machine learning (ML) models to overcome this challenge. ML frameworks have enabled substantial progress in the development of polymer property predictors and solving inverse problems in which polymers that meet specific property requirements are either identified from candidate sets or are freshly designed using genetic or generative algorithms.
Thus, a need exists for systems and methods that can effectively and efficiently traverse the expansive world of polymers to determine properties for varying polymer chemical structures on demand for real-life applications.
An exemplary of the present invention comprises an integrated end-to-end completely machine-driven polymer informatics pipeline uses an encoder-only transformer chemical language model to replace conventional handcrafted fingerprints. The present invention is more than two orders of magnitude faster than conventional approaches.
Another exemplary of the present invention comprises a method that views polymers as a chemical language of atoms and atom connectivities. With this perspective, language model development techniques adopted by the Natural Language Processing (NLP) community, become applicable. The instant innovations include the creation of a training set of 100 million polymers, the creation of a transformer based architecture specifically to learn the chemical language of polymers, and training this architecture to yield a machine-driven polymer fingerprinting model. These fingerprints can then be used in down-stream machine learning models to build polymer property predictors.
Another exemplary of the present invention comprises a method comprises preprocessing polymer simplified molecular input line end system (“PSMILES”) strings, feeding the preprocessed PSMILES strings to an encoder-only transformer chemical language model, fingerprinting at least a portion of an output of the encoder-only transformer chemical language model, and predicting polymer properties from the fingerprinted output.
In any of the exemplary embodiments, the method is an integrated end-to-end completely machine-driven polymer informatics pipeline.
In any of the exemplary embodiments, the preprocessing comprises canonicalizing the PSMILES strings into standardized data strings, tokenizing at least a portion of the standardized data strings, and masking at least a portion of the tokenized standardized data strings.
In any of the exemplary embodiments, the method further comprises creating a training set of polymers, training the encoder-only transformer chemical language model with the training set, and yielding a machine-driven polymer fingerprinting model for the fingerprinting.
In any of the exemplary embodiments, the preprocessing comprises converting chemical fragments from a set of first polymers into standardized data strings by representing set of first polymers into the PSMILES strings, separating the PSMILES strings into one or more tokens, and predicting, via a first machine learning algorithm, one or more tokens from the PSMILES strings,
In any of the exemplary embodiments, the fingerprinting comprises computing, via a processor device, one or more unique fingerprints for the PSMILES strings, and mapping, via the encoder-only transformer chemical language model, one or more properties of the set of first polymers and one or more properties of set of second polymers to the one or more unique fingerprints.
In any of the exemplary embodiments, the predicting comprises predicting, via encoder-only transformer chemical language model, one or more properties for a new polymer based at least in part on one or more of the properties of the set of second polymers and one or more of the properties of set of second polymers.
In any of the exemplary embodiments, the representing the set of first polymers into the PSMILES strings comprises canonicalizing the PSMILES string to create the PSMILES strings for the set of first polymers.
In any of the exemplary embodiments, the separating the PSMILES strings into one or more tokens comprises parsing through the PSMILES strings using one or more text delimiters comprises tokenizing the PSMILES strings based at least in part on the one or more text delimiters.
In any of the exemplary embodiments, the predicting, via the first machine learning algorithm, the one or more tokens from the PSMILES strings comprises creating a masked portion of one or more of the tokens and an unmasked portion of one or more of the tokens for the PSMILES strings.
In any of the exemplary embodiments, the creating the masked portion and the unmasked portion within the PSMILES strings comprises embedding one or more of the tokens of the unmasked portion with a numerical weight, and predicting the masked portion based on the numerical weight for one or more of the tokens of the unmasked portion.
In any of the exemplary embodiments, the embedding one or more of the tokens of the unmasked portion with a numerical weight comprises passing one or more of the tokens of the unmasked portion through one or more neural encoder layers and one or more neural decoder layers, and updating the numerical weight through one or more of the neural decoder layers and one or more of the neural decoder layers for one or more of the tokens of the unmasked portion.
In any of the exemplary embodiments, the updating the numerical weight through one or more of the neural decoder layers and one or more of the neural decoder layers for one or more of the tokens of the unmasked portion comprises determining a syntactical relationship between one of more of the tokens within the PSMILES strings by creating an attention map for one of more of the tokens, wherein the attention map is configured to plot an attention score for one of more of the tokens.
In any of the exemplary embodiments, the mapping, via encoder-only transformer chemical language model, the one or more properties of the set of first polymers and the one or more properties of set of second polymers to the one or more unique fingerprints comprises receiving an input vector of one of more of the unique fingerprints, and mapping the input vector with the polymer properties via a selector vector, wherein the selector vector is a binary vector configured to represent the polymer properties using a binary number format.
In any of the exemplary embodiments, the method further comprises mapping the polymer properties based at least in part on one of more of the unique fingerprints, and outputting one of more of the polymer properties for one of more of the unique fingerprints by filtering the output of one or more polymer properties based at least in part on one or more search parameters.
Another exemplary embodiment of the present invention comprises a system for predicting polymer properties comprising a processor device configured to convert chemical fragments from a set of first polymers into a set of second polymers different than the first polymers, convert the set of second polymers into the PSMILES strings, separate the PSMILES strings into one or more tokens, and compute one or more unique fingerprints for the PSMILES strings with the encoder-only transformer chemical language model.
In any of the exemplary embodiments, the processor device is further configured to parse through the PSMILES strings using one or more text delimiters, and tokenize the PSMILES strings based at least in part on one of more of the text delimiters.
In any of the exemplary embodiments, the processor device is further configured to train a machine learning algorithm configured to predict one or more tokens of the PSMILES strings.
In any of the exemplary embodiments, the machine learning algorithm is further configured to use natural language processing (NLP) on one of more of the tokens of each of the PSMILES strings, and create a masked portion of one of more of the tokens and an unmasked portion of one of more of the tokens for the PSMILES strings.
In any of the exemplary embodiments, the machine learning algorithm is further configured to embed one of more of the tokens of the unmasked portion with a numerical weight, analyze the numerical weight for one of more of the tokens of the unmasked portion to predict the masked portion.
In any of the exemplary embodiments, the machine learning algorithm is further configured to pass one of more of the tokens of the unmasked portion through one or more neural encoder layers and one or more neural decoder layers, update the numerical weight through one of more of the neural encoder layers and one or more of the neural decoder layers for one of more of the tokens of the unmasked portion, determine a syntactical relationship between one of more of the tokens within the PSMILES strings, and create an attention map for one of more of the tokens, wherein the attention map is a plot of an attention score for one of more of the tokens.
Another exemplary embodiment of the present invention comprises system for predicting polymer properties comprising a processor device configured to receive an input vector, map via a machine learning algorithm each entry of the input vector with a selector vector indicative of polymer properties, and output polymer properties for one or more of the entries of the input vector, wherein one or more of the entries of the input vector is indicative of a unique fingerprint for one or more polymers.
In any of the exemplary embodiments, the machine learning algorithm is a multitask deep neural network.
In any of the exemplary embodiments, the processor device is further configured to filter the output of one or more of the polymer properties based at least in part on one or more search parameters.
In any of the exemplary embodiments, the selector vector is a binary vector configured to represent the polymer properties using a binary number format.
Another exemplary embodiment of the present disclosure provides method for predicting polymer properties that can comprise converting chemical fragments from a plurality of first polymers into standardized data strings, separating each of the standardized data strings into one or more tokens, predicting, via a first machine learning algorithm, one or more tokens from each of the standardized data strings, computing, via a processor device, one or more unique fingerprints for each of the standardized data strings, and mapping, via a second machine learning algorithm, one or more properties of the plurality of first polymers and one or more properties of a plurality of second polymers to the one or more unique fingerprints.
In any of the embodiments disclosed herein, the method may further comprise predicting, via a second machine learning algorithm, the one or more properties for a new polymer based at least in part on the one or more properties of the plurality of first polymers and one or more properties of the plurality of second polymers.
In any of the embodiments disclosed herein, the standardized data strings may comprise a polymer simplified molecular input line end system (“PSMILES”) string.
In any of the embodiments disclosed herein, the method may further include representing the standardized data strings for the chemical fragments from the plurality of first polymers as PSMILES string may comprise canonicalizing each PSMILES string to create the standardized data strings for the plurality of first polymers.
In any of the embodiments disclosed herein, the method may further include separating each of the standardized data strings into one or more tokens may comprise parsing through each of the standardized data strings using one or more text delimiters. The method also includes parsing through each of the standardized data strings using one or more delimiters may comprise tokenizing each of the standardized data strings based at least in part on the one or more text delimiters.
In any of the embodiments disclosed herein, the method may further include predicting, via the first machine learning algorithm, one or more tokens of each of the standardized data strings may comprise creating a masked portion of the one or more tokens and an unmasked portion of the one or more tokens for each of the standardized data strings. The method also includes creating a masked portion and an unmasked portion within each of the standardized data strings may comprise embedding each of the one or more tokens of the unmasked portion with a numerical weight, and predicting the masked portion based on the numerical weight for each of the one or more tokens of the unmasked portion.
In any of the embodiments disclosed herein, the method may further include embedding each of the one or more tokens of the unmasked portion with a numerical weight may comprises passing each of the one or more tokens of the unmasked portion through one or more neural encoder layers and one or more neural decoder layers, and updating the numerical weight through each of the one or more encoder layers and each of the one or more decoder layers for each of the one or more tokens of the unmasked portion.
In any of the embodiments disclosed herein, the method may further include updating the numerical weight through each of the one or more encoder layers and each of the one or more decoder layers for each of the one or more tokens of the unmasked portion may comprise determining a syntactical relationship between the one or more tokens within each standardized data string. The method also includes determining a syntactical relationship between the one or more tokens within each standardized data strings may comprise creating an attention map for the one or more tokens, wherein the attention map can be configured to plot an attention score for each of the one or more tokens.
In any of the embodiments disclosed herein, the method may further include utilizing a second machine learning algorithm to map one or more unique fingerprints to a plurality of polymer properties may comprise receiving an input vector of the one or more unique fingerprints, and mapping the input vector with the plurality of polymer properties via a selector vector. The selector vector may be a binary vector that can be configured to represent the plurality of polymer properties using a binary number format.
In any of the embodiments disclosed herein, the method may further comprise mapping the plurality of polymer properties based at least in part on the one or more unique fingerprints, which may comprise outputting the one or more polymer properties for each of the one or more unique fingerprints. The method also includes outputting the plurality of polymer properties for each of the one or more unique fingerprints may comprise filtering the output of one or more polymer properties based at least in part on one or more search parameters.
Another embodiment of the present disclosure provides a system for predicting polymer properties, the system may comprise a processor that can be configured to convert chemical fragments from a plurality of first polymers into a plurality of second polymers different than the first polymers, convert the plurality of second polymers into standardized data strings, separate the standardized data strings into one or more tokens, and compute a unique fingerprint for each of the standardized data strings.
In any of the embodiments disclosed herein, the standardized data strings may be a plurality of a polymer simplified molecular input line end system (PSMILES) strings.
In any of the embodiments disclosed herein, the processor may be further configured to parse through each of the standardized data strings using one or more text delimiters, and tokenize each of the standardized data strings based at least in part on the one or more text delimiters.
In any of the embodiments disclosed herein, the processor may be further configured to train a machine learning algorithm that can be configured to predict one or more tokens of each of the standardized data strings. The machine learning algorithm may be further configured to use natural language processing (NLP) on the one or more tokens of each of the standardized data strings.
In any of the embodiments disclosed herein, the machine learning algorithm may be further configured to create a masked portion of the one or more tokens and an unmasked portion of the one or more tokens for each of the standardized data strings. The machine learning algorithm is further configured to embed each of the one or more tokens of the unmasked portion with a numerical weight. The machine learning algorithm is further configured to analyze the numerical weight for each of the one or more tokens of the unmasked portion to predict the masked portion.
In any of the embodiments disclosed herein, the machine learning algorithm may be further configured to pass each of the one or more tokens of the unmasked portion through one or more neural encoder layers and one or more neural decoder layers. The machine learning algorithm is further configured to update the numerical weight through each of the one or more encoder layers and each of the one or more decoder layers for each of the one or more tokens of the unmasked portion. The machine learning algorithm may be further configured to determine a syntactical relationship between the one or more tokens within each standardized data string.
In any of the embodiments disclosed herein, the machine learning algorithm may be further configured to create an attention map for the one or more tokens, and wherein the attention map is a plot of an attention score for each of the one or more tokens.
Another embodiment of the present disclosure provides a system for predicting polymer properties, the system may comprise a processor that can be configured to receive an input vector, map via a machine learning algorithm each entry of the input vector with a selector vector indicative of a plurality of polymer properties, and output the plurality of polymer properties for each entry of the input vector. Each entry of the input vector may be indicative of a unique fingerprint for each of a plurality of polymers.
In any of the embodiments disclosed herein, the machine learning algorithm may be a multitask deep neural network. The selector vector may be a binary vector configured to represent the plurality of polymer properties using a binary number format. The processor may be further configured to filter the output of the plurality of polymer properties based at least in part on one or more search parameters.
These and other aspects of the present disclosure are described in the Detailed Description below and the accompanying drawings. Other aspects and features of embodiments will become apparent to those of ordinary skill in the art upon reviewing the following description of specific, exemplary embodiments in concert with the drawings. While features of the present disclosure may be discussed relative to certain embodiments and figures, all embodiments of the present disclosure can include one or more of the features discussed herein. Further, while one or more embodiments may be discussed as having certain advantageous features, one or more of such features may also be used with the various embodiments discussed herein. In similar fashion, while exemplary embodiments may be discussed below as device, system, or method embodiments, it is to be understood that such exemplary embodiments can be implemented in various devices, systems, and methods of the present disclosure.
The following detailed description of specific embodiments of the disclosure will be better understood when read in conjunction with the appended drawings. For the purpose of illustrating the disclosure, specific embodiments are shown in the drawings. It should be understood, however, that the disclosure is not limited to the precise arrangements and instrumentalities of the embodiments shown in the drawings.
To facilitate an understanding of the principles and features of the present disclosure, various illustrative embodiments are explained below. The components, steps, and materials described hereinafter as making up various elements of the embodiments disclosed herein are intended to be illustrative and not restrictive. Many suitable components, steps, and materials that would perform the same or similar functions as the components, steps, and materials described herein are intended to be embraced within the scope of the disclosure. Such other components, steps, and materials not described herein can include, but are not limited to, similar components or steps that are developed after development of the embodiments disclosed herein.
It must also be noted that, as used in the specification and the appended claims, the singular forms “a,” “an” and “the” include plural references unless the context clearly dictates otherwise. For example, reference to a component is intended also to include composition of a plurality of components. References to a composition containing “a” constituent is intended to include other constituents in addition to the one named.
Also, in describing the exemplary embodiments, terminology will be resorted to for the sake of clarity. It is intended that each term contemplates its broadest meaning as understood by those skilled in the art and includes all technical equivalents which operate in a similar manner to accomplish a similar purpose.
By “comprising” or “containing” or “including” is meant that at least the named compound, element, particle, or method step is present in the composition or article or method, but does not exclude the presence of other compounds, materials, particles, method steps, even if other such compounds, material, particles, method steps have the same function as what is named.
It is also to be understood that the mention of one or more method steps does not preclude the presence of additional method steps or intervening method steps between those steps expressly identified. Similarly, it is also to be understood that the mention of one or more components in a composition does not preclude the presence of additional components than those expressly identified.
The materials described as making up the various elements of the invention are intended to be illustrative and not restrictive. Many suitable materials that would perform the same or a similar function as the materials described herein are intended to be embraced within the scope of the invention. Such other materials not described herein can include, but are not limited to, for example, materials that are developed after the time of the development of the invention.
Referring back to
As would be appreciated by one of skill in the art, a machine learning algorithm is a subfield within artificial intelligence (AI) that enables computer systems and other related devices to learn how to perform tasks and improve performance of tasks over time. System can incorporate machine learning to perform tasks including, but not limited to, supervised learning algorithms, unsupervised learning algorithms, semi-supervised learning algorithms, reinforcement learning algorithms, and the like. In some embodiments, the first machine learning algorithm 140 can be configured to use natural language processing (NLP) to process PSMILES strings and determine a syntactical relationship between the one or more tokens 132. Resultantly, by determining a syntactical relationship between the one or more tokens 132, the first machine learning algorithm 140 can begin to learn chemical structures of polymers via learning PSMILES strings.
In general, NLP is a machine learning technology that can allow a machine learning algorithm to interpret, manipulate, and comprehend language. With respect to the present technology, the first machine learning algorithm can be configured to use Transformer architecture within NLP technology to predict one or more tokens 132 of a standardized data string 130. As known in the art, Transformer architectures have features such as encoders, decoders, and attention layers which can enable Transformer architectures to surmise understanding of inputs based on position as well as predicting missing parts from inputs, such as in the case of training a machine learning algorithm. Transformer architectures can be employed in several different models such as encoder-only models, decoder-only models, and encoder-decoder models within NLP technology. It should be appreciated that system 100 can incorporate encoder-only, decoder-only models, and encoder-decoder models of the Transformers architecture within the first machine learning algorithm 140. In some embodiments of the present technology, the first machine learning algorithm 140 may be further configured to pass one or more tokens 132 through one or more encoder layers and one or more decoder layers.
As shown in
In some embodiments, the first machine learning algorithm 140 may pass each token 132 of the unmasked portion 136 through one or more encoder layers and one or more decoder layers. As each of the tokens pass through the one or more encoder layers and one or more decoder layers, the first machine learning algorithm 140 may update the numerical weight for each token 132 of the unmasked portion 136 to determine a syntactical relationship. Resultantly, the first machine learning algorithm 140 may predict the tokens 132 of the masked portion 134 once a syntactical relationship is determined. In some embodiments, the first machine learning algorithm may create an attention map 144. The attention map 144, an example of which is shown in
The system 100 can be further configured to compute a unique fingerprint 150 for each of the standardized data strings 130. As shown in
In some embodiments, the system 100 can be further configured to map the unique fingerprints 150 for each standardized data string to a selector vector 162 via a second machine learning algorithm 160. The selector vector 162 can be represented as a binary vector, which can be configured to represent the plurality of polymer properties 170 using a binary number system. It should be appreciated that the selector vector may represent the plurality of polymer properties 170 using other types of number systems including but not limited to hexadecimal, decimal, and the like. It should also be appreciated that the selector vector 162 not having the same dimensions as the input vector 152 does not impact the accuracy of the plurality of polymer properties 170 predicted for each entry of the input vector 152.
In some embodiments, the second machine learning algorithm 160 may be a multitask deep neural network (MTL). As known in the art, MTL is a subfield of machine learning wherein a machine learning algorithm model can be trained to perform multiple tasks at once. MTL can be advantageous when used in conjunction with NLP technologies, such as the Transformers discussed in the present technology, due to the multiple tasks performed being related or having some similarity. With respect to some embodiments of the present disclosure the second machine learning algorithm 160 may use MTL to analyze various features of the unique fingerprints 150 represented as entries in the input vector 152 to predict the plurality of polymer properties 170 for each unique fingerprint 150. The system 100 may also use a single-task or multitask machine algorithms without a neural network as the second machine learning algorithm 160 to analyze various features of the unique fingerprints 150 represented as entries in the input vector 152 to predict the plurality of polymer properties 170 for each unique fingerprint 150. The system 100 may also use a single task machine learning algorithm as the second machine learning algorithm 160 to analyze various features of the unique fingerprints 150 represented as entries in the input vector 152 to predict the plurality of polymer properties 170 for each unique fingerprint 150.
Once the second machine learning algorithm has predicted the plurality of polymer properties 170 for each unique fingerprint 150, the system 100 may be further configured to output the plurality of polymer properties for each unique fingerprint 150. As shown in
In some embodiments, the system 100 and method 200 can also be implemented in a computing environment, as shown in
As shown in
The computer system 310 also includes a system memory 330 coupled to the bus 305 for storing information and instructions to be executed by processors 320. The system memory 330 may include computer readable storage media in the form of volatile and/or nonvolatile memory, such as read only memory (ROM) 331 and/or random access memory (RAM) 332. The system memory RAM 332 may include other dynamic storage device(s) (e.g., dynamic RAM, static RAM, and synchronous DRAM). The system memory ROM 331 may include other static storage device(s) (e.g., programmable ROM, erasable PROM, and electrically erasable PROM). In addition, the system memory 330 may be used for storing temporary variables or other intermediate information during the execution of instructions by the processors 320. A basic input/output system (BIOS) 333 containing the basic routines that help to transfer information between elements within computer system 310, such as during start-up, may be stored in ROM 331. RAM 332 may contain data and/or program modules that are immediately accessible to and/or presently being operated on by the processors 320. System memory 330 may additionally include, for example, operating system 334, application programs 335, other program modules 336 and program data 337.
The computer system 310 also includes a disk controller 340 coupled to the bus 305 to control one or more storage devices for storing information and instructions, such as a hard disk 341 and a removable media drive 342 (e.g., floppy disk drive, compact disc drive, tape drive, and/or solid state drive). The storage devices may be added to the computer system 310 using an appropriate device interface (e.g., a small computer system interface (SCSI), integrated device electronics (IDE), Universal Serial Bus (USB), or Fire Wire).
The computer system 310 may also include a display controller 365 coupled to the bus 305 to control a display 366, such as a cathode ray tube (CRT) or liquid crystal display (LCD), for displaying information to a computer user. The computer system 310 includes an input interface 360 and one or more input devices, such as a keyboard 362 and a pointing device 361, for interacting with a computer user and providing information to the processor 320. The pointing device 361, for example, may be a mouse, a trackball, or a pointing stick for communicating direction information and command selections to the processor 320 and for controlling cursor movement on the display 366. The display 366 may provide a touch screen interface which allows input to supplement or replace the communication of direction information and command selections by the pointing device 361.
The computer system 310 may perform a portion or all of the processing steps of embodiments of the invention in response to the processors 320 executing one or more sequences of one or more instructions contained in a memory, such as the system memory 330. Such instructions may be read into the system memory 330 from another computer readable medium, such as a hard disk 341 or a removable media drive 342. The hard disk 341 may contain one or more datastores and data files used by embodiments of the present invention. Datastore contents and data files may be encrypted to improve security. The processors 320 may also be employed in a multi-processing arrangement to execute the one or more sequences of instructions contained in system memory 330. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions. Thus, embodiments are not limited to any specific combination of hardware circuitry and software.
As stated above, the computer system 310 may include at least one computer readable medium or memory for holding instructions programmed according to embodiments of the invention and for containing data structures, tables, records, or other data described herein. The term “computer readable medium” as used herein refers to any medium that participates in providing instructions to the processor 320 for execution. A computer readable medium may take many forms including, but not limited to, non-volatile media, volatile media, and transmission media. Non-limiting examples of non-volatile media include optical disks, solid state drives, magnetic disks, and magneto-optical disks, such as hard disk 341 or removable media drive 342. Non-limiting examples of volatile media include dynamic memory, such as system memory 330. Non-limiting examples of transmission media include coaxial cables, copper wire, and fiber optics, including the wires that make up the bus 305. Transmission media may also take the form of acoustic or light waves, such as those generated during radio wave and infrared data communications.
The computing environment 300 may further include the computer system 310 operating in a networked environment using logical connections to one or more remote computers, such as remote computer 380. Remote computer 380 may be a personal computer (laptop or desktop), a mobile device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to computer system 310. When used in a networking environment, computer system 310 may include modem 372 for establishing communications over a network 371, such as the Internet. Modem 372 may be connected to bus 305 via user network interface 370, or via another appropriate mechanism.
Network 371 may be any network or system generally known in the art, including the Internet, an intranet, a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a direct connection or series of connections, a cellular telephone network, or any other network or medium capable of facilitating communication between computer system 310 and other computers (e.g., remote computer 380). The network 371 may be wired, wireless or a combination thereof. Wired connections may be implemented using Ethernet, Universal Serial Bus (USB), RJ-11 or any other wired connection generally known in the art. Wireless connections may be implemented using Wi-Fi, WiMAX, and Bluetooth, infrared, cellular networks, satellite or any other wireless connection methodology generally known in the art. Additionally, several networks may work alone or in communication with each other to facilitate communication in the network 371.
The embodiments of the present disclosure may be implemented with any combination of hardware and software. In addition, the embodiments of the present disclosure may be included in an article of manufacture (e.g., one or more computer program products) having, for example, computer-readable, non-transitory media. The media has embodied therein, for instance, computer readable program code for providing and facilitating the mechanisms of the embodiments of the present disclosure. The article of manufacture can be included as part of a computer system or sold separately.
While various aspects and embodiments have been disclosed herein, other aspects and embodiments will be apparent to those skilled in the art. The various aspects and embodiments disclosed herein are for purposes of illustration and are not intended to be limiting, with the true scope and spirit being indicated by the following claims.
An executable application, as used herein, comprises code or machine readable instructions for conditioning the processor to implement predetermined functions, such as those of an operating system, a context data acquisition system or other information processing system, for example, in response to user command or input. An executable procedure is a segment of code or machine readable instruction, sub-routine, or other distinct section of code or portion of an executable application for performing one or more particular processes. These processes may include receiving input data and/or parameters, performing operations on received input data and/or performing functions in response to received input parameters, and providing resulting output data and/or parameters.
A graphical user interface (GUI), as used herein, comprises one or more display images, generated by a display processor and enabling user interaction with a processor or other device and associated data acquisition and processing functions. The GUI also includes an executable procedure or executable application. The executable procedure or executable application conditions the display processor to generate signals representing the GUI display images. These signals are supplied to a display device which displays the image for viewing by the user. The processor, under control of an executable procedure or executable application, manipulates the GUI display images in response to signals received from the input devices. In this way, the user may interact with the display image using the input devices, enabling user interaction with the processor or other device.
The functions and process steps herein may be performed automatically or wholly or partially in response to user command. An activity (including a step) performed automatically is performed in response to one or more executable instructions or device operation without user direct initiation of the activity.
In any of the embodiments described herein, the system 100 can have data set ranges from at least 35,517 data points for 29 various properties. The properties can represent thermal, thermodynamic & physical, electronic, optical & dielectric, mechanical, and permeability characteristics of chemical polymers, or any other class of measurable or computable properties. The properties can include but not be limited to glass transition temperature (Tg), melting temperature (Tm), thermal degradation (Td), heat capacity (cp), atomization energy (Eat), limiting oxygen index (Oi), crystallization tendency (Xe), crystallization tendency density (ρ), band gap (chain) (Egc), band gap (bulk) (Egb), electron affinity (Eea), ionization energy (Ei), electronic injection barrier (Eib), cohesive energy density (δ), refractive index (DFT) (nc), refractive index (exp) (ne), dielectric constant (DFT) (kc), dielectric constant at frequency “f” (kf), Young's modulus (E), tensile strength at yield (σy), tensile strength at break (σb), elongation at break (ϵb), O2 gas permeability (ρO2), N2 gas permeability (μN2), CO2 gas permeability (μCO2), H2 gas permeability (μH2), He gas permeability (μHe), CH4 gas permeability (μCH4), and any other property that is measurable or computable. The system 100 can be observed to perform with a high degree of accuracy (R2>80) with respect to predicting properties for polymers, comparatively performing similarly to traditional polymer fingerprinting methods.
The system 100 can be configured with a speed more than two orders of magnitude (>200 times) faster than any traditional polymer fingerprinting. The system 100 can also be scalable to cloud based computing systems. As shown in
The following examples further illustrate aspects of the present disclosure. However, they are in no way a limitation of the teachings or disclosure of the present disclosure as set forth herein.
The table above shows an exemplary training data set for the property predictors. The properties are sorted into categories, the category provided at the top of each block. The data set contains 29 properties (dielectric constants kf are available at 9 different frequencies f). HP and CP stand for homopolymer and copolymer, respectively.
Once polyBERT completed its unsupervised learning task using the 100 million hypothetical PSMILES strings, multitask supervised learning maps polyBERT polymer fingerprints to multiple properties to produce property predictors. The property data set in TABLE 1 was used for training the property predictors. The data set contains 28,061 (≈80%) homopolymer and 7,456 (≈20%) copolymer (total of 35,517) data points of 29 experimental and computational polymer properties that pertain to 11,145 different monomers and 1,338 distinct copolymer chemistries, respectively. Each of the 7,456 copolymer data points involved two distinct comonomers at various compositions. The copolymer data points are for random copolymers, which are adequately handled by the adopted fingerprinting strategy (see Methods section). Alternating copolymers are treated as homopolymers with appropriately defined repeat units for fingerprinting purposes. Other flavors of copolymers may also be encoded by adding additional fingerprint components.
polyBERT
polyBERT iteratively ingests 100 million hypothetical PSMILES strings to learn the polymer chemical language, as sketched in
The training with 80 million PSMILES strings renders polyBERT an expert polymer chemical linguist who knows grammatical and syntactical rules of the polymer chemical language. polyBERT learns patterns and relations of tokens via the multi-head self-attention mechanism and fully connected feed-forward network of the Transformer encoders. The attention mechanism instructs polyBERT to devote more focus to a small but essential part of a PSMILES string. polyBERT's learned latent spaces after each encoder block are numerical representations of the input PSMILES strings. The polyBERT fingerprint is the average over the token dimension (sentence average) of the last latent space (dotted line in
For acquiring analogies and juxtaposing chemical relevancy, polyBERT fingerprints were compared with the handcrafted Polymer Genome (PG) fingerprints that numerically encode polymers at three different length scales. The PG fingerprint vector for the data set in this work has 945 components and is sparsely populated (93.9% zeros). The reason for this ultra sparsity is that many PG fingerprint components count chemical groups in polymers. A fingerprint component of zero indicates that a chemical group is not present. In contrast, polyBERT fingerprint vectors have 600 components and are fully dense (0% zeros). Fully dense and lower-dimensional fingerprints are often advantageous for ML models whose computation time scales superlinear (O(ns), s>1) with the data set size (n) such as Gaussian process or kernel ridge techniques. Moreover, in the case of neural networks, sparse and high-dimensional input vectors can cause unnecessary high memory load that reduces training and inference speed. The dimensionality of polyBERT fingerprints is a parameter that can be chosen arbitrarily to yield the best training result.
polyBERT learns chemical motifs and relations in the PSMILES strings using the Transformer encoders, each of which includes an attention and feed-forward network layer (see
Not surprisingly, the computations of polyBERT and PG fingerprints scale nearly linearly with the number of PSMILES strings although their performance (i.e., pre-factor) can be quite different, as shown in the log-log scaled
For benchmarking the property prediction accuracy of polyBERT and PG fingerprints, multitask deep neural networks were trained for each property category. Multitask deep neural networks have demonstrated best-in-class results for polymer property predictions while being fast, scalable, and readily amenable if more data points become available. Unlike single-task models, multitask models simultaneously predict numerous properties (tasks) and harness inherent but hidden correlations in data to improve their performance. Such correlation exists, for instance, between Tg and Tm, but the exact correlation varies across specific polymer chemistries. Multitask models learn and improve from these varying correlations in data. The training protocol of the multitask deep neural networks follows state-of-the-art methods involving five-fold cross-validation and a consolidating meta learner that forecasts the final property values based upon the ensemble of cross-validation predictors.
The ultrafast and accurate polyBERT-based polymer informatics pipeline allows the system to predict all 29 properties of the 100 million hypothetical polymers that were originally created to train polyBERT.
Other Advantages of polyBERT: Beyond Speed and Accuracy
The feed-forward network as shown in
A second advantage of the polyBERT approach is interpretability. Analyzing the chemical relevancy of polyBERT fingerprints in greater detail can reveal chemical functions and interactions of structural parts of the polymers. As illustrated with the examples of the three polymers in
Yet another advantage of the polyBERT approach is its coverage of the entire chemical space. Molecule SMILES strings are a subset of polymer SMILES strings and differ by only two stars ([*]) symbols that indicate the two endpoints of the polymer repeat unit. polyBERT has no intrinsic limitations or functions that obstruct predicting fingerprints for molecule SMILES strings. The experiments described herein show consistent and well-conditioned fingerprints for molecule SMILES strings using polyBERT that required only minimal changes in the canonicalization routine.
Here, a generalizable, ultrafast, and accurate polymer informatics pipeline is described that is seamlessly scalable on cloud hardware and suitable for high-throughput screening of huge polymer spaces. polyBERT, which is a Transformer-based NLP model modified for the polymer chemical language, is the critical element of the pipeline. After training on 100 million hypothetical polymers, the polyBERT-based informatics pipeline arrives at a representation of polymers and predicts polymer properties over two orders of magnitude faster but at the same accuracy as the best pipeline based on handcrafted PG fingerprints.
The total polymer universe is gigantic, but currently limited by experimentation, manufacturing techniques, resources, and economical aspects. Contemplating different polymer types such as homopolymers, copolymer, and polymer blends, novel undiscovered polymer chemistries, additives, and processing conditions, the number of possible polymers in the polymer universe is truly limitless. Searching this extraordinarily large space enabled by property predictions is limited by the prediction speed. The accurate prediction of 29 properties for 100 million hypothetical polymers in a reasonable time demonstrates that poly-BERT is an enabler to extensive explorations of this gigantic polymer universe at scale. polyBERT paves the pathway for the discovery of novel polymers 100 times faster (and potentially even faster with newer GPU generations) than state-of-the-art informatics approaches—but at the same accuracy as slower handcrafted fingerprinting methods—by leveraging Transformer—based ML models originally developed for NLP. polyBERT finger-prints are dense and chemically pertinent numerical representations of polymers that adequately measure polymer similarity. They can be used for any polymer informatics task that requires numerical representations of polymers such as property predictions (demonstrated here), polymer structure predictions, ML-based synthesis assistants, etc. polyBERT finger-prints have a huge potential to accelerate past polymer informatics pipelines by replacing the handcrafted fingerprints with polyBERT fingerprints. polyBERT may also be used to directly design polymers based on fingerprints (that can be related to properties) using polyBERT's decoder that has been trained during the self-supervised learning. This, however, requires retraining and structural updates to polyBERT and is thus part of a future work.
The string representations of homopolymer repeat units in this work are PSMILES strings. PSMILES strings follow the SMILES syntax definition but use two stars to indicate the two endpoints of the polymer repeat unit (e.g., [*]CC[*] for polyethylene). The raw PSMILES syntax is non-unique; i.e., the same polymer may be represented using many PSMILES strings; canonicalization is a scheme to reduce the different PSMILES strings of the same polymer to a single unique canonicalized PSMILES string. polyBERT requires canonicalized PSMILES strings because polyBERT fingerprints change with different writings of PSMILES strings. In contrast, PG fingerprints are invariant to the way of writing PSMILES strings and, thus, do not require canonicalization.
As described herein, the canonicalize PSMILES Python package was developed to find the canonical form of PSMILES strings in four steps; (i) it finds the shortest PSMILES string by searching and removing repetition patterns, (ii) it connects the polymer endpoints to create a periodic PSMILES string, (iii) it canonicalizes the periodic PSMILES string using RDKit's canonicalization routines, (iv) it breaks the periodic PSMILES string to create the canonical PSMILES string.
Fingerprinting converts geometric and chemical information of polymers (based upon the PSMILES string) to machine-readable numerical representations in the form of vectors. These vectors are the polymer fingerprints and can be used for property predictions, similarity searches, or other tasks that require numerical representations of polymers.
The polyBERT fingerprints were compared with the handcrafted Polymer Genome (PG) polymer fingerprints. PG fingerprints capture key features of polymers at three hierarchical length scales. At the atomic scale (1st level), PG fingerprints track the occurrence of a fixed set of atomic fragments (or motifs). The block scale (2nd level) uses the Quantitative Structure-Property Relationship (QSPR) fingerprints for capturing features on larger length-scales as implemented in the cheminformatics toolkit RDKit. The chain scale (3rd level) fingerprint components deal with “morphological descriptors” such as the ring distance or length of the largest side-chain.
As discussed the composition-weighted polymer fingerprints were summed to compute copolymer fingerprints F=PN Fc, where N is the number of comonomers in the copolymer, F is the comonomer fingerprint vector, and c is the fraction of the comonomer. This approach renders copolymer fingerprints invariant to the order in which one may sort the comonomers and satisfies the two main demands of uniqueness and invariance to different (but equivalent) periodic unit specifications. While the current fingerprinting scheme is most appropriate for random copolymers, other copolymer flavors may be encoded by adding additional fingerprint components. Contrary to homopolymer fingerprints, copolymer fingerprints may not be interpretable (e.g., the composition-weighted sum of the fingerprint component “length of largest side-chain” of two homopolymers has no physical meaning).
Multitask deep neural networks simultaneously learn multiple polymer properties to utilize inherent correlations of properties in data sets. The training protocol of the concatenation-conditioned multitask predictors follows state-of-the-art techniques involving five-fold cross-validation and a meta learner that forecasts the final property values based upon the ensemble of cross-validation predictors. After shuffling, the data set was split into two parts and use 80% for the five cross-validation models and for validating the meta learners 20% of the data set is used for training the meta learners. The Hyperband method of the Python package KerasTuner was used to fully optimize all hyperparameters of the neural networks, including the number of layers, number of nodes, dropout rates, and activation functions. The Hyperband method finds the best set of hyperparameters by minimizing the Mean Squared Error (MSE) loss function. Data set stratification of all splits was performed based on the polymer properties. The multitask deep neural networks are implemented using the Python API of TensorFlow.
Experiments were conducted using a private infrastructure, which has an estimated carbon efficiency of 0.432 kgCO2eqkWh−1. A total of 31 hours of computations were performed on four Quadro-GP100-16 GB (thermal design power of 235 W) for training polyBERT. Total emissions are estimated to be 12.6 kgCO2eq. About 8 hours of computations on four GPUs were necessary for training the cross-validation and meta learner models with an estimated emission of 3.3 kgCO2eq for polyBERT and Polymer Genome fingerprints, respectively. The total emissions for predicting 29 properties for 100 million hypothetical polymers are estimated to be 5.5 kgCO2eq, taking a total of 13.5 hours. Estimations were conducted using a Machine Learning Impact calculator.
It is to be understood that the embodiments and claims disclosed herein are not limited in their application to the details of construction and arrangement of the components set forth in the description and illustrated in the drawings. Rather, the description and the drawings provide examples of the embodiments envisioned. The embodiments and claims disclosed herein are further capable of other embodiments and of being practiced and carried out in various ways. Also, it is to be understood that the phraseology and terminology employed herein are for the purposes of description and should not be regarded as limiting the claims.
Accordingly, those skilled in the art will appreciate that the conception upon which the application and claims are based may be readily utilized as a basis for the design of other structures, methods, and systems for carrying out the several purposes of the embodiments and claims presented in this application. It is important, therefore, that the claims be regarded as including such equivalent constructions.
Furthermore, the purpose of the foregoing Abstract is to enable the United States Patent and Trademark Office and the public generally, and especially including the practitioners in the art who are not familiar with patent and legal terms or phraseology, to determine quickly from a cursory inspection the nature and essence of the technical disclosure of the application. The Abstract is neither intended to define the claims of the application, nor is it intended to be limiting to the scope of the claims in any way.
This is a Continuation Application of International Application No. PCT/US2023/073627 filed 7 Sep. 2023, which International Application claims the benefit of U.S. Provisional Application Ser. No. 63/374,761, filed 7 Sep. 2022, the entire contents and substance of which are incorporated herein by reference in their entirety as if fully set forth below.
This invention was made with government support under Grant No. GR10005221 awarded by the Office of Naval Research (ONR), and Grant No. GR00004636 awarded by the National Science Foundation (NSF). The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
63374761 | Sep 2022 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/US2023/073627 | Sep 2023 | WO |
Child | 18595796 | US |