The present disclosure relates to systems and methods for chemical toxicity prediction; more specifically, the present disclosure relates to systems and methods for toxicity prediction of chemicals in industrial wastes, wastewater, discharged water, exhaust gases, products or byproducts.
In recent years, the United Nations has promoted sustainable development goals (SDGs), and China is promoting the 2050 net-zero emission pathway policy and transformation policy. Under the green consumption and net-zero policies, many industries are undergoing a transformation to gradually develop and apply emerging green consumer chemicals that are more energy-saving and more efficient. Since these emerging green consumer chemicals are applied to various industrial processes as well as researches and developments that are innovate and keep up with the times, it is necessary for toxicity hazard evaluation of such emerging chemicals. However, there is a lack of toxicity testing and relevant hazard data for emerging green consumer chemicals. Therefore, applicable data are collected by using internationally established chemical toxicity databases, and the structure and toxicity of a chemical can be simulated and predicted through data mining technology and artificial intelligence technology. This will save research funds, a lot of time and experimental animals and will benefit the industrial transformation, the competitiveness improvement, and the business opportunities of domestic green consumer chemical net-zero emission transformation.
In the present disclosure, operation systems and methods for green consumer chemical toxicity prediction are established through the collection and integration of domestic and foreign databases combined with the establishment of a quantitative structure-activity relationship (QSAR) model cascaded with analysis instruments, so as to achieve the promotion of green chemistry.
Some embodiments of the present disclosure provide a method for chemical toxicity prediction. The method comprises: receiving test data from an analysis instrument; selecting a candidate chemical according to the test data; determining a hazard translated level and a hazard evaluation level of the candidate chemical according to a molecular fingerprint of the candidate chemical; and predicting the toxicity of the candidate chemical by using a quantitative structure-activity relationship (QSAR) model based on the hazard translated level and the hazard evaluation level.
Some embodiments of the present disclosure provide a system for chemical toxicity prediction. The system comprises: a database; a host comprising a memory and a processor; and an analysis instrument. The memory stores instructs so that the host executes the following operations: receiving test data of a sample from the analysis instrument; selecting a candidate chemical from the database according to the test data; determining a hazard translated level and a hazard evaluation level of the candidate chemical from the database according to a molecular fingerprint of the candidate chemical; and predicting the toxicity of the candidate chemical by using a quantitative structure-activity relationship (QSAR) model based on the hazard translated level and the hazard evaluation level.
According to detailed descriptions of the following reference drawings, the present disclosure will become more understandable. It is noted that various features may not be plotted in proportions. In actual, for the clear description, the sizes of various features can be arbitrarily increased or decreased.
The following disclosed contents are used for implementing many different embodiments or examples of different features of the provided subject. Next, specific examples for operations, components and configurations will be described to simplify the present disclosure. Of course, these operations, components, and configurations are only examples and not intended to limit the present disclosure. For example, a first operation executed before or after a second operation in the description may include embodiments in which the first and second operations are executed together, and may also include embodiments in which additional operations are executed between the first and second operations. For example, in the following description, the formation of the first feature above, on, or within the second feature may include embodiments formed by direct contact between the first feature and the second feature, and may also include embodiments where additional features can be formed between the first feature and the second feature so that the first feature and the second feature do not come into direct contact. In addition, the present disclosure can repeatedly refer to reference numbers and/or letters. This repetition is for the purpose of simplicity and clarity, and does not specify the relationships between the various embodiments and/or configurations discussed.
For the convenience of description, spatial relative terms such as “before”, “in the front”, “at the back”, “after” and other similar terms can be used herein to describe a relationship between one operation or feature depicted in the figures and another (multiple) element(s) or feature. The term relative time is intended to encompass different sequences of operations depicted in various figures. In addition, for the convenience of description, spatial relative terms such as “below”, “under”, “lower part”, “above”, “upper part” and other similar terms can be used herein to describe a relationship between one component or feature as depicted in the figures and another (multiple) component(s) or feature. In addition to the orientations depicted in the figures, spatial relative terms are also intended to encompass different orientations of devices in use or operation. The device can be oriented in other ways (rotating 90 degrees or in other orientations), and the spatial relative descriptor used herein can also be interpreted accordingly. For ease of description, relative terms about connection can be used herein, such as “connect”, “connected”, “connection”, “couple”, “coupled”, “communicate” and other similar terms, to describe one of operational connection, coupling or link of two components or features. The relative term used for connection is intended to cover different connections, couplings or link of devices or components. The devices or components can be directly or indirectly connected, coupled or linked to each other via another component. The devices or components can be wiredly or wirelessly connected, coupled or linked to each other.
As used herein, unless the context clearly indicates otherwise, singular terms “a/an” and “the” may include plural references. For example, unless the context clearly indicates otherwise, reference to devices may include plural devices. The terms “comprise” and “include” may indicate the presence of features, integers, steps, operations, elements and/or components, but the presence of one of the features, integers, steps, operations, elements and/or components or a combination thereof cannot be excluded. The term “and/or” may include one or more any listed items or all combinations thereof.
In addition, quantities, ratios, and other numerical values are sometimes presented in range format. In should be noted that such the range format is used for convenience and conciseness, and should be flexibly understood as including not only numerical values explicitly specified as range limits, but also all individual numerical values or sub ranges within that range, as if each numerical value and sub-range were explicitly specified.
The natures and purposes of embodiments are described in detail as follows. However, it is understood that the present disclosure provide many applicable inventive concepts, which can exhibit a wide range of multiple specific situations. Specific embodiments described only illustrate and utilize specific methods of the present invention, without limiting their scope.
In the present disclosure, high-tech industrial wastewater and discharged water may be used as test and analysis samples. The purposes of the present disclosure may be to understand analytical principles, interpret analytical data, and understand system operations. The samples disclosed in the present disclosure may also contain unknown chemical substances but no their fragment information; the present disclosure can first focus on identifiable chemicals. The present disclosure can confirm the possibility of analysis. The present disclosure can comprise analysis requirements on unknown chemical substances.
The present disclosure can integrate multiple national attention or control lists as well as multiple international organization attention or control lists. The present disclosure can combine multiple per/polyfluoroalkyl polyolefin databases and lists. By chemical structure comparison, the present disclosure can integrate chemical substance information to be stored in relevant work databases. The present disclosure can comprise relevant application programming interface (API), so as to quickly obtain characteristic information of specific chemical substances. In the present disclosure, a molecule fingerprint group compatible to PubChem can be established based on simplified molecular input line entry specification (SMILES) of chemical structures, and these information is stored in relevant work databases, and a chemical characteristic interface that is easy to read is also designed. PubChem refers to an open chemistry database of the National Institutes of Health (NIH) in the United States. The present disclosure can clear up a training dataset with 3,810 pieces of information in total based on of carcinogenic hazard endpoints. In the present disclosure, a QSAR model can be designed based on the principle provided by the Organisation for Economic Co-operation and Development (OECD), and training data processing, hazard endpoints, and predictable applicability domains of algorithms and definition are planned. In the present disclosure, a toxicity prediction model prototype can be established after searching expert opinions via an expert meeting, and a chemical characteristic input work database is predicted. For an instrument measurement and identification flow, the raw data of the instrument is converted into a public format in the present disclosure. The present disclosure comprises an instrument signal analysis flow and parameter adjustment. In the present disclosure, actual detection can be performed by test data and chemicals are identified. In the present disclosure, chemical toxicity data can be predicted by QSAR model and added into the work database.
In recent years, the United Nations has promoted SDGs, and China is promoting the 2050 net-zero emission pathway policy and transformation policy. Under the green consumption and net-zero policy, many industries are undergoing transformation, that is to say, emerging green consumer chemicals that are more energy-saving and more efficient are gradually developed and applied. Since these emerging green consumer chemicals are applied to various industrial processes as well as researches and developments that are innovate and keep up with the times, it is necessary for toxicity hazard evaluation of such emerging chemicals. However, there is a lack of toxicity testing and relevant hazard data for emerging green consumer chemicals, and therefore applicable data are collected by using an internationally established chemical toxicity database, and data mining technology and artificial intelligence technology are used to simulate structures and predict chemical toxicity, which saves research funds, a lot of time and experimental animals, is beneficial for industrial transformation and improving competitiveness, and benefits for the business opportunities of domestic green consumer chemical net-zero emission transformation.
In the present disclosure, operation flows, systems, and methods for green consumer chemical toxicity prediction are established through the collection and integration of domestic and foreign databases combined with the establishment of a quantitative structure-activity relationship (QSAR) model cascaded with an analysis instrument, so as to achieve the promotion of green chemistry. The work flow 100 of the present disclosure is seen in
First, in the present disclosure, chemical information in various databases is integrated by collecting international chemical databases and comparing by chemical structures, so as to obtain structure and characteristic information of green consumer chemicals.
Second, in the present disclosure, a green consumer chemical prediction model is established and further integrated into a chemical characteristic prediction system. A chemical molecule fingerprint group is established according to chemical structures so as to establish a chemical characteristic prediction flow, which comprises:
Third, a chemical characteristic structure prediction model is established.
Fourth, an instrument measurement and an identification flow are established cascaded with a chemical characteristic prediction system and method.
With the evolution of the times, an attitude and movement of modern society is outlined accompanied with elements of economy, society and culture, which is called green consumerism. The green consumerism is a comprehensive and responsible management process that can meet, identify, achieve and anticipate the needs of stakeholders in the aspects of maintaining environmental and natural well-being and no harm to human health. The green consumerism establishes a balance between the behavior of buyers and the profit goals of organizations, as shown in
The green chemistry movement began in the early 1990s by the US Environmental Protection Agency (US EPA), aimed at encouraging industry and academia to use chemistry to prevent pollution. More specifically, the mission of green chemistry is to “promote the reduction or elimination of hazard substances in design, manufacturing and use of chemical products, or create innovative chemical technologies.” The green chemistry movement expands from the United States to Europe, Australia and Asia. From these principles, it is evident that green chemistry encompasses the concept of sustainability, rather than just pollution prevention. Green chemistry includes 12 principles that show the concept of green chemistry, as shown in
With the advancement of risk evaluation and safety analysis of chemical substances in chemical products, the demand for rapid screening and prudent quality control continues to grow, and there is an urgent need to develop new strategies for predicting chemical characteristics. The Quantitative Structure Activity Relationship (QSAR) model is a tool convenient to use, with the principle that the interaction between small organic molecules and biomacromolecules is studied using a simulation and calculation means by virtue of physical, chemical and structural property parameters of molecules. This method can be applied to drugs, pesticides or chemicals, etc. The QSAR model is an important tool for chemical risk evaluation in cross model prediction and chemical reading.
The OECD QSAR Toolbox is software jointly developed by the OECD and the European Chemical Agency (ECHA). The software can be used by government agencies, chemical industry and academic fields. The software version adopted in the present disclosure is 4.5 SP1, which is released in March 2022. The built-in database in the software includes 59 sub-databases, over 100,000 chemicals, 3 million experimental measurement data, and 902 QSAR models in four main fields, as shown in Table 2 below. Various novel chemicals have emerged in various applications, and the EU REACH regulation requires that chemical registration information must include toxicological information, however, the existing toxicological data is insufficient, and animal protection groups are calling for reducing animal experiments, and therefore, it is desired to use an alternative method in which computer calculation is utilized for animal experiments.
The OECD QSAR Toolbox can search the built-in database based on the CAS number or names of chemicals. If the built-in database already has toxicological data that has been tested and can be publicly obtained, it does not need to be predicted. The OECD QSAR Toolbox operation interface can refer to
After entering the CAS number, the software presents the structural formula and information stored in the database. The classification includes structural information, physical and chemical properties, environmental fate and transmission, environmental toxicity, and human health hazards. Clicking on the symbol before classification can expand the tree structure. If the hazard endpoint data to be reviewed is insufficient, it is needed to make calculation as an alternative method. You can click on a “Data Gap Filling” button and select the desired hazard endpoint. For example, if it is desired to review the aquatic toxicity endpoint under the environmental toxicity, the QSAR button is clicked in the red box after selection, as shown in
The computational toxicology is one of the emerging research fields in the 21st century, which is used to study the toxicity of chemical substances. Even now, many chemicals still have no complete toxicological data. The computational toxicology can be used for preliminarily screening chemicals and hazard ranking to plan further toxicological experiment tests and understand the toxicity mechanisms of chemicals.
The Integrated Chemical Environment (ICE) is a platform released in March 2017 by the National Toxicology Program's Alternative Toxicology Methods Cross departmental Evaluation Center in the United States, which provides chemical data from animal and non-animal testing for querying test endpoints described in the chemical safety regulations and toxicology information, including acute oral toxicity, skin and eye irritation, skin sensitization and endocrine activity. In addition, ICE also provides the query and prediction of physical and chemical property data of chemicals (including solubility, melting point and molecular weight), as well as high throughput screening (HTS) data for TOX21. The expert prediction system OncoLogic for evaluating the potential carcinogenicity of chemical substances is a study and knowledge regulation set for animal or human cancers caused by chemical substances, which stimulates human experts to make predictive judgment. The user provides chemical drug information to OncoLogic™, and the potential carcinogenicity of chemical drugs is evaluated by utilizing the knowledge rules integrated in the prediction system. Open Structure-activity/property Relationship App (OPERA) also provides a QSAR/QSPR model that is reliable and meets supervision requirements for analyzing the characteristics of chemicals in the environment, including the Integrated Chemical Environment (ICE) that is released in March 2017 by the National Toxicology Program's Alternative Toxicology Methods Cross departmental Evaluation Center in the United States, provides chemical data from animal and non-animal testing for querying the testing endpoints described in chemical safety regulations and toxicology information, including acute oral toxicity, skin and eye irritation, skin sensitization and endocrine activity. In addition, ICE also provides the query and prediction of physical and chemical property data (including solubility, melting point and molecular weight) of chemicals, as well as high throughput screening (HTS) data for TOX21. The expert prediction system OncoLogic for evaluating the potential carcinogenicity of chemical substances is a study and knowledge regulation set for animal or human cancers caused by chemical substances, which stimulates human experts to make predictive judgment. The user provides chemical drug information to OncoLogic™, and the potential carcinogenicity of chemical drugs is evaluated by utilizing the knowledge rules integrated in the prediction system. Open Structure-activity/property Relationship App (OPERA) also provides a QSAR/QSPR model that is reliable and meets supervision requirements for analyzing the characteristics of chemicals in the environment, including prediction of estrogen/androgen activity, physical properties, acute toxicity, pharmacokinetic parameters, ecological toxicity parameters, etc. PRED-SKIN and other skin sensitivity data (SkinSensDB), Tox21BodyMap, and Toxic Concern Threshold (TTC) can also be used as references for predicting estrogen/androgen activity, physical properties, acute toxicity, pharmacokinetic parameters, ecological toxicity parameters, etc. PRED-SKIN and other skin sensitivity data (SkinSensDB), Tox21BodyMap, and Toxic Concern Threshold (TTC) can also be used as references.
In addition, other well-known databases, such as ChEMBL (a chemical database of biologically active molecules with drug-like characteristics), have also been integrated for the characteristics of chemicals, which provide information on biologically active molecules capable of inducing drugs, and the used substance specification for Apparel and Footwear International RSL Management (AFIRM) for product packaging ingredients. The Comprehensive Global Database of per/polyfluoroalkyl substances (PFASs) functions as integrating information on per/polyfluoroalkyl substances from 15 countries or territories including the United States and the European Union, such as physical and chemical properties, uses, exposure methods and potential health and environmental impacts, so as to provide comprehensive data for policy formulation and management use. The European Commission's Priority List can provide information on whether chemicals are included in the list of “suspected endocrine disruptors”. Perkins and Will's Precautionary List clears up a precautionary list mainly including substances that have been classified by regulatory authorities as being harmful to humans or environments, listing their GreenScreen Benchmark Score and GSPI Six Classes. The Material Declaration for Products of and for the Electrical Industry comes from the International Electrotechnical Commission (called IEC for short). The database specifies that which substances, substance groups and material categories in the electrical and electronic industry need to be included in the material declaration, and provides data format specifications for software developers to exchange material declaration data. Also, toxicological databases can be used to identify potential chemical substances that may cause harm to the environment. Various toxicology databases have been established internationally, such as the US Environmental Protection Agency's High throughput Screening Toxicity Test database (ToxCast) and the Overall Endotoxicity Test Results Database (ToxRefDB). ToxCast has approximately 1800 widely sourced chemical substance data, including industrial and consumer goods, food additives, and potential green chemicals that may be safer alternatives to the existing chemicals. These data can be used for exploring the chemical and biological diffusion activity space of broad toxicological results related to regulatory concerns to generate high throughput screening data to screen chemical substances in over 700 high throughput analysis endpoints, and develop toxicity prediction models by using the chemical substances, as shown in
In addition to integrating the existing chemical databases, the operational process disclosed in the present disclosure can also refer to two international indicator sources for chemical evaluation and screening: the GreenScreen® for Safety Chemicals and the Globally Harmonized System of Classification and Labeling of Chemicals (GHS) issued by the United Nations. These two international standards are used as screening conditions so as to input a chemical list and give scores to facilitate the more precise grade classification of chemicals. GreenScreen® for Safer Chemicals is a chemical safety evaluation method adopted by multinational corporations and multiple US state governments to assist in identifying chemicals of very high concern and selecting safer alternatives. Its evaluation method can be used for depth evaluation of products, processes or any chemical substances so as to determine and compare their harmfulness, and assist enterprises in selecting safer alternatives and make management decisions.
Another tool is referred to as GreenScreen® List Translator. The list translator is a list-based hazard screening method which is intended to help users quickly identify known and established chemicals receiving high concerns, and can serve as a reference template for this project (GreenScreen, 2013). As shown in
The Globally Harmonized System of Classification and Labeling of Chemicals (GHS) is an internationally recognized system for classification and labeling of chemicals. The Globally Unified System is established by the United Nations, which adopts a set of globally unified and standardized chemical classification and labeling standards to replace their own classification and labeling standards individually used by each country (Winder et al., 2005). The requirements of GHS outline the following standardization requirements: classification of chemical substances and mixtures is carried out according to physical, health, and environmental hazards, as shown in
One of the objectives of the present disclosure is to collect predictive analysis tools and databases from various international sources (including Europe, the United States, Japan, China and New Zealand). In the present disclosure, the existing regulations and evaluation standards are integrated in combination with bioinformatics, computer simulation (in silico) modeling, systems biology methods and the like to predict health and safety effects of chemicals on biological systems or environments. One of the objectives of the present disclosure is to promote the development of predictive toxicology as an alternative method for managing related issues. In the present disclosure, the guidelines and suggestions established by previous domestic research projects can be continued, and domestically used by creating the most suitable and excellent databases. Since the domestic safety alternative integration platform and database has not yet been perfect, one of the objectives of the present disclosure is to determine and position safe alternatives for chemical substances controlled in Taiwan, and establish a prototype of a complete database of chemical hazards. It will also integrate toxicity and concern chemical registration and reporting systems and cross departmental chemical management information collection platforms (such as the Ministry of Environment's Chemical Cloud) to reduce environmental hazards and enhance industrial values.
The present disclosure can comprise the following contents:
In one embodiment of the present disclosure, the NCBI PubChem United Nations GHS database data are used. In one embodiment of the present disclosure, the Python programming language in combination with the BeautifulSoup package and the Ruby programming language in combination with the Wombat package are used to be stored in the work data base so as to obtain the latest GHS project list. In addition to the PubChem United Nations GHS database, the present disclosure can further analyze: (1) the PubChem database of the National Institutes of Health in the United States, (2) the CompTox Chemical Dashboard of the United States Environmental Protection Agency, (3) the Domestic Substances List (DSL) of Canada, (4) the ECHA announcement of the European Chemical Agency, (5) the German Federal Environment Agency's “System Regulations for the Treatment of Substances Hazardous to Water” (AwSV) (published in the Federal Legal Gazette, 2017), (6) the ChEMBL database, and (7) the AFIRM Packaging restricted substance list, (8) Fluoroalkyl polyolefin database and list, (9) European Commission's Priority List, so as to obtain information and toxicity data of chemical substances, in order to act on international convention.
The chemical database “PubChem” is a chemical molecule database maintained by the National Center for Biotechnology Information (NCBI) of the National Institutes of Health (NIH) in the United States, and launched in 2004. PubChem can provide characteristic data of natural or artificially synthesized chemical molecules. PubChem includes three sub-databases, namely Compounds, Substance, and Bioassays. The relationship among these three sub-databases can be seen in
CompTox Chemical Dashboard is a Toxic Substances Emissions Inventory (TRI) database constructed by the United States Environmental Protection Agency. The Toxic Substances Emission Inventory Database provides information on over 870,000 chemicals, including their physical properties, environmental fate, hazard assessment, human absorption, distribution, metabolism, elimination, and exposure, as well as information on structurally similar chemicals, aliases, and related literatures. CompTox can therefore gather myriad chemical toxicology information.
For the Domestic Substances List (DSL) of Canada, the chemical substances on this list are subject to commercial government regulation when being manufactured and imported in Canada. The initial version list of DSL was first published in Canada Gazette, Part II on May 4, 1994. In the initial version list, approximately 23,000 chemicals are regulated and 12 times are updated a year on average, with addition or removal of chemical substances. The latest Excel file of the chemical substance list in this list can be downloaded from a database in an official website. Besides, when the DSL is stipulated in Canada, the existence, bioaccumulation, environmental and human hazards of various chemical substances are evaluated, and these toxicity information is important for chemical testing.
The ECHA announcement of the European Chemical Agency's Chemicals Administration is based on: (1) Annex 17 of the REACH regulation, (2) Candidate List of Substances of Very High Concern, and (3) List of Substances of High Concern-Priority List. The Annex 17 of the REACH regulation refers to a restriction list issued under the European Union's Chemical Registration, Evaluation, Authorization and Restriction Act (REACH), which lists the occasions and conditions where the use of chemical substances is prohibited and will be updated periodically. The Candidate List of Substances of Very High Concern refers to Chapter 59 (10) of the European Union's Chemical Registration, Evaluation, Authorization and Restriction Act (REACH); the use of substances on the list in the European Union is subject to regulation. The List of Substances of Very High Concern—the Priority List is issued by the European Chemical Agency, concerning that in REACH, it is required for the European Chemicals Agency (ECHA) to recommend priority substances listed in REACH from the candidate list, and submit transitional arrangements and related exemptions and review periods of these substances to the European Commission while taking into account the opinions of member state committees. The list of Substances of very high concern-Priority List includes all substances already included in the ECHA draft or final recommendation, and substances that are not recommended will be re-evaluated in subsequent rounds.
For the German Federal Environment Agency's System Regulation for the Treatment of Hazardous Substances in Water (AwSV), it aims to protect a water body from adverse changes caused by chemicals and mixtures. The German Federal Environment Agency requires that substances and mixtures processed or stored in facilities located in Germany are classified into three water hazard levels (WGK) according to their water hazard characteristics: (1) WGK1: slight harm to water; (2) WGK2: clearly harm to water; and (3) WGK3: highly harm to water. In addition to the three water hazard levels, substances can also be classified as no harm to water (nwg) or general harm to water (awg). There are approximately 12,000 pieces of chemical information in this list.
The ChEMBL database provides information on biologically active molecules capable of inducing drugs. The ChEMBL database is currently managed and maintained by the European Institute for Biological Information of the European Molecular Biology Laboratory (EMBL-EBI) in the UK, covering over 2.3 million compounds and 1900 activity analysis data. The data sources of the ChEMBL database include published literature, clinical data and other publicly available databases, etc. The ChEMBL database provides users with information on the molecular characteristics, drug induction, drug action mechanisms and clinical data of compounds and their metabolic mechanisms involved in biological reactions. The ChEMBL database can effectively provide users with basic information on the molecule and its reactions as a drug. The ChEMBL database can be seen in
The International Organization for the Management of Restricted Chemical Substances (AFIRM Group) has proposed substance specifications for the use of product packaging ingredients. The members of the organization include many famous clothing brands such as Adidas and H&M. The members must conduct testing and reporting on the ingredients listed on the restricted substance list. The information on limited values, uses, testing methods and limited values required to be reported for the restricted substances in various types of packaging is listed in the list of restricted substances for packaging, and provided to member brands of the organization as a basis for production, testing and acceptance of testing reports. In addition, a restricted substance list has also been proposed for the product itself, which is used for specifying the ingredients used in production of clothing, shoes and socks, accessories, jewelry, sports equipment, wearable equipment, household textiles and the like by manufacturers. Information on the limit values, uses, testing methods and limit values that need to be reported for restricted substances are also proposed to monitor member brands.
Comprehensive Global Database of PFASs is maintained by the PFC group of the Organization for Economic Cooperation and Development (OECD) and the United Nations Environment Programme (UNEP), and was established in 2012. Per/polyfluoroalkyl substances are a type of chemical substances that includes perfluorinated alkyl polyolefins (PFCs) and perfluorinated fatty acid salts (PFAAs). Such the chemical substances began to be widely used in the 1940s. Due to the characteristics such as water resistance, oil resistance and low friction, such the chemical substances are in most cases used as surface coatings for food packaging, containers, etc. However, since such the chemical substances are high in stability and difficultly decomposed in the environment, their use is monitored by the United States, the European Union and other countries in recent years. The Comprehensive Global Database of PFASs integrates control information on per/polyfluoroalkyl substances from 15 countries or territories such as the United States and the European Union, for example physical and chemical properties, uses, exposure methods, potential health and environmental impacts, so as to provide comprehensive data for policy formulation and management. In addition, the database also includes relevant research findings and suggestions, including management policies and regulations, clearance methods and available alternatives in various countries, to help policy makers make decisions. The screening flowchart for the updated list of the Comprehensive Global Database of PFASs is seen in
SMILES is a string encoding system for describing the structures of chemicals, which can be converted into the three-dimensional structures of chemicals and is widely used in the field of computational chemistry. Therefore, numbers of other database systems are collected and organized in the present disclosure, including CAS numbers, such as CID in the PubChem database system. Meanwhile, the SMILES that is missing from partial data sources is also supplemented to provide comprehensive chemical information.
The CompTox database is a valuable resource which is provided by the US EPA and has extensively collected and organized toxicological information of chemical substances. This database focuses on providing toxicological experimental data for various chemical substances, which is crucial for understanding their harmful characteristics and potential risks to human health and the environment. Toxicological experimental data is a result for studying the adverse effects of chemicals in living organisms through a series of experimental methods. These data cover multiple aspects, including acute toxicity, chronic toxicity, mutagenicity, teratogenicity, etc. These experiments can reveal the degree of damage that chemicals may cause to cells, tissues and the entire biological system. The toxicology experimental data in the CompTox database are not only numbers, but also include detailed descriptions of experimental methods, setting of experimental parameters, statistical analysis of experimental results, etc. This enables researchers to better evaluate the reliability and applicability of data to ensure the correct understanding and application of toxicological information of chemical substances In addition, one of the advantages of the CompTox database is its integrated features. It not only integrates internal experimental data from the US Environmental Protection Agency, but also includes third-party data from different sources, which expands the scope and richness of the database. Therefore, in the present disclosure, toxicological experiment data of PFAS, with a total of 5120 records, was also supplemented by the CompTox database.
The Priority List of the European Commission is described as follow. To establish standards and priorities for listing chemicals on the list of suspected endocrine disruptors, the European Commission began a series of studies in 2000 AD to establish a Priority List through a coherent method to further evaluate their roles in endocrine disruptors. The Priority List is established in two stages. Firstly, independent review of evidence of endocrine disrupting effects and human or animal exposure is conducted, followed by consultation with stakeholders and committee scientific members to determine the priority of implementation matters. The European Commission analyzed 564 suspected endocrine disruptors published by various organizations, and ultimately classified 194 first class chemicals and 126 second class chemicals, which were compiled into a database archive. The review process of substances includes:
According to GreenScreen® Method, after the chemical substance analysis reports from all source databases are downloaded, a program is used to convert and store the information in JSON format for data exchange. Subsequently, a program is written to calculate GreenScreen® hazards based on a scoring method to evaluate and integrate 18 levels of hazards to calculate comprehensive risk evaluation for future query.
The database design of the present disclosure can include 16 tables, including CAS numbers, chemical names, relationship tables between CAS numbers and chemical names, hazards, hazard evaluation, hazard evaluation and restriction list relationship indexes, restricted chemicals, toxicity, risk, risk evaluation, exposure pathways, restriction lists and species summary table. These tables also include relevant information between restricted specifications and tables in different countries and regions. 16 tables consist of 14 primary tables and 2 auxiliary tables for storing many-to-many relationships. The auxiliary tables store indexes between primary tables, and multiple-to-many relationship information including relationships between CAS numbers and chemicals and between regulatory lists and hazard evaluation. A database model association diagram is shown in
The work database of the present disclosure is constructed by using a Docker container system, and provides the characteristics that the database is easy to manage, maintain and transfer. The system adopts a Docker Compose to cascade a database management system (MariaDB), a database website (Ruby on Rails), a caching system (Redis) and an automatic updating workflow system (Airflow). It is planned to adopt the open source Airflow workflow language dominated and maintained by the Apache Foundation to compile and automatically update the workflow. Airflow is developed in Python language, which is a commonly used framework in the field of data science and big data. Both Google and Amazon cloud computing provide Airflow services to cascade their own cloud services. Domestic companies such as Cathay Financial automatically update data through Airflow, so as to reduce maintenance manpower and time costs. The present disclosure can further include 15 database update program codes, and an Airflow container is added to Docker Compose for the convenience of unified management, initiation of updates and storage in the database.
The supplemented chemical information includes SMILES information. In the present disclosure, SMILES is converted into PubChem molecular fingerprint groups by using PaDEL descriptor software and stored in the work database. In order to conveniently and quickly obtain chemical characteristic data from the work database, the inventor of the present disclosure developed a RESTful (Representative State Transfer) application programming interface (API), by which CAS numbers or PubChem CID can be utilized to search hazard information of specific chemicals. The endpoint of the obtained data is as follows: GET/chemicals/(parameter 1: casno or pubchem_cid)/(parameter 2: corresponding database code)/(parameter 3: all). Parameter 1-parameter 2 are described as follows:
The format of the returned file is JSON (JavaScript Object Notation), which includes: hazard endpoints, data types (2 hazard evaluation methods), data sources, credibility, hazard evaluation levels or predicted values, annotations.
The example shown in
In order to make users more concisely and clearly understand inquired information, a chemical hazard information interface and toxicity data visualization presentation portion is also designed in the present disclosure.
In the present disclosure, all data sources are integrated and stored in the work database based on the list, the hazard endpoints are used as hazard evaluation information of carcinogenicity, the hazard evaluation of data is then transformed into hazard levels by using the GreenScreen Method. The hazard evaluation of data sources related to carcinogenicity are exported from the work database, high-credibility lists (such as lists provided by multinational organizations, the United States and the European Union) are retained, hazard evaluation information with unknown hazard levels is removed, and finally 3810 pieces of hazard information are obtained. Since the GreenScreen method includes a list translator level (list translator score) and a hazard evaluation level (Benchmark score), classifications are comprehensively labeled (high hazard H, medium hazard M, low hazard L) based on combination of hazard translator level (level 1, possible level 1) and hazard evaluation level (H, M, L) in the present disclosure. If the hazard translator level is “level 1”, the classification is labeled as high hazard H. If the hazard translation level is “possible level 1”, the classification is labeled as medium hazard M. The hazard evaluation levels H, M, L are transformed into corresponding classifications that are labeled as high hazard H, medium hazard M, and low hazard L. Table 5 is a data field table according to an embodiment of the present disclosure.
The QSAR model should provide the following five pieces of information according to the decision made by Organisation for Economic Cooperation and Development (OECD) in November 2004:
The present disclosure adopts the principle provided by OECD to design a QSAR model and predict the classification label of the target hazard endpoint through the QSAR model. In the present disclosure, “carcinogenicity” is used as the target hazard endpoint. The construction method is described in detail below.
The prediction endpoint of the QSAR model described in the present disclosure is “carcinogenicity”. Chemicals that may be or confirmed to have a carcinogenic risk are classified as positive; chemicals that do not have relevant hazards are classified as negative.
The training data sources may comprise the following three sources:
Due to different sources of the training data, the training data need to be integrated. The integration flow is as shown in
In order to further improve the usability of the QSAR model described in the present disclosure, inorganic compounds and metal organics are excluded. Therefore, after data cleansing is performed by the PubChem molecular fingerprints, through SMILES structural string analyses, the chemicals having metal are removed from the dataset applicable for this model. The definitions of the excluded elements include: Sc, Ti, V, Cr, Mn, Fe, Co, Ni, Cu, Zn, Y, Zr, Nb, Mo, Tc, Ru, Rh, Pd, Ag, Cd, La, Ce, Pr, Nd, Pm, Sm, Eu, Gd, Tb, Dy, Ho, Er, Tm, Yb, Lu, Ac, Th, Pa, U, Np, Pu, Am, Cm, Bk, Cf, Es, Fm, Md, No, Lr, Hf, Ta, W, Re, Os, Ir, Pt, Au, Hg, Rf, Db, Sg, Bh, Hs, Mt, Ds, Rg, Cn, Al, Ga, In, TI, Sn, Pb, Bi, Si, Ge, As, Sb, Se, Te, Po, and At.
Since the FooDB dataset is used as a negative dataset source, its data quantity is much greater than that of the positive international list dataset. To reduce analysis error caused by data imbalance, the FooDB dataset is subjected to similarity comparison with the international list dataset before the combination, and retains items in the FooDB similar to those in the international list dataset.
The similarity comparison method is described as follows:
After similarity comparison, datasets are combined, and overlapped chemical items are compared using InChI. If there are overlapped chemicals with conflicting labels, priority will be given to the following: international list dataset>DeepCare dataset>FooDB dataset as labels. The reasons are that the sources of the international list dataset are all carcinogenicity related lists compiled by the governments and other impartial agencies, wherein many experiments have confirmed the potential or actual carcinogenic hazards in the past. The DeepCare dataset derives from the US Food and Drug Administration and is also credible. The FooDB data is established by detecting the characteristics of the chemical in food, and the chemicals in the FooDB are considered to be less carcinogenic and can fill in negative data, so the FooDB data is listed last in order.
Finally, after the datasets are combined, repeated chemical items are removed by using PubChem molecular fingerprint to ensure that the molecular fingerprint is in one-to-one correspondence to the label.
After being preprocessed, the training data is divided into a training dataset and a validation dataset. The training dataset will be used for definition of the predicable applicability domain and model training. The validation dataset is used for evaluating the precision of the model, and selecting the optimal model for model prediction. The division of the training and validation datasets is repeated 10 times with different division random seeds to obtain the average performance of the model training.
The applicability domain of the present disclosure must meet three conditions:
The model of the present disclosure can be subjected to model training and prediction by using three methods such as support vector machine (SVM), C5.0 decision tree model, and Random forest.
SVM is to convert data into high dimension through kernel functions and find a hyperplane in a high-dimensional space so as to distinguish two types of different labels which are “carcinogenic” and “non-carcinogenic”. Where, SVM adopts two different kernel functions for training: a polynomial kernel (poly) and a radial basis function kernel (rbf). In the training process of SVM, hyperparameters C and γ (when the rbf kernel is used) need to be selected and adjusted, and therefore the adjustment of the hyperparameters is performed by using a 10-fold cross validation method in conjunction with random search. That is, the training dataset is divided into 10 equals, wherein one equal is used as an internal validation dataset, other nine equals are used as training data and trained one by one, and training performances are averaged to obtain the average performance under this group of hyperparameter. Then, another group of hyperparameters is generated at random. The division and model training of the above 10 equals are repeated to find optimal hyperparameter configuration.
The C5.0 decision tree model is to generate one tree according to training data and find a classification rule in the training data through divisions of the characteristics, and the classification rule is used for prediction of new data. Where, the C5.0 decision tree model still has hyperparameters to be selected—maximum decision tree depth (max depth), minimum samples leaf of leaf nodes (min samples leaf). Similarly, the 10-fold cross validation method in combination with random search is used for the hyperparameter adjustment.
The random forest is to randomly extract training data to generate multiple decision trees and perform prediction and classification through a majority decision method. The hyperparameter adjustment item of the random forest has max depth and min samples split required for internal node redistribution. Similarly, the 10-fold cross validation method in combination with random search is used for the hyperparameter adjustment.
Model evaluation pointer adopts accuracy, sensitivity, specificity, precision, negative predictive value (npv), F1-score and Matthew's correlation coefficient (mcc). The calculation formula of the evaluation index is as follows:
The present disclosure can use the above flow for model training and the flow is feasible after being confirmed by expert meeting, and then a model prototype for chemical characteristic structure prediction is established.
In one embodiment of the present disclosure, a trial operation of high-tech industrial discharge wastewater samples and PFAS chemical substances is made to establish a model prototype for characteristic structures of cosmetics. In one embodiment of the present disclosure, a trial operation of high-tech industrial discharge wastewater samples and chemical substances of per/polyfluoroalkyl substances (PFAS) is made. The model prototype is tested by a prediction operation flow, which at least comprises: PFOA (Perfluorooctanoic acid, CAS: 335-67-1), PFOS (Perfluorooctanesulfonic acid, CAS: 1763-23-1), and PFNA (Perfluorononanoic acid, CAS: 375-95-1).
The present disclosure comprises establishing an instrument measurement and identification process cascaded with a chemical characteristic prediction operation process. The present disclosure comprises converting instrument raw data into public format by a utility program. In the present disclosure, Docker containerized programming language and a Wine framework are used to execute ProteoWizard MSConvert conversion software so as to convert instrument measurement raw data into public format mzXML. The execution method of the present disclosure comprises executing the ProteoWizard Docker image file through Docker program, executing Wine program to guide msconvert program to be executed, converting the above file into an mzXML file, and outputting the mzXML file. The schematic diagram of the execution method of the present disclosure can be shown in
The present disclosure comprises calling XCMS to perform wave peak detection. XCMS analysis software is developed by Scripps Institute, a well-known research institution in the United States, which is one of preferred analysis software in the field of small molecule and metabolite research, and has an Application Programming Interface (API) that can be used to write programs for cascading. XCMS can analyze data from a liquid chromatography mass spectrometer and a gas chromatography mass spectrometer. In addition, it can help users to process, analyze and visualize massive mass spectrometry data. Preprocessing, wave peak detection, wave peak alignment and statistical analysis and other operations can be performed on mass spectrometry data by using XCMS. When XCMS is used to perform feature extraction, different methods can be selected, for example, based on peak shape or signal intensity. Then, the extracted features are subjected to wave peak alignment to eliminate variability between different samples. Finally, analysis and visualization are performed using a statistical method.
In the present disclosure, the data measured by the instrument can be used for analysis. In the present disclosure, profiling is performed by using a centWave algorithm in XCMS. The centWave algorithm is suitable for a high resolution mass spectrometer to collects data in a centroid mode. Another method in XCMS, Matched Filter algorithm, is suitable for a low resolution mass spectrometer to collect data in a centroid mode or in a profile mode, as shown in Table 6.
The LC-MS/MS technology plays an important role in metabolomics, which can help us to identify metabolites in a non-targeted manner, thereby gaining a deeper understanding of metabolic processes in organisms. The LC-MS/MS technology achieves the detection and identification of metabolites by combining liquid chromatography (LC) and mass spectrometry (MS) techniques. This technology includes two main steps: liquid chromatography isolation and mass spectrometry analysis. In the liquid chromatography separation step, a mixed sample is separated by different tubular columns so as to separate different compounds. Next, these separated compounds are analyzed in a mass spectrometer. A mass spectrometry data collection method can adopt data dependency acquisition (DDA): this method can perform fragmentation analysis based on the top N strongest ions (m/z values) in MS1 scan, and then proceeds to the next N scans. This method can timely generate clean MS2 fragment map in real-time in the collection process, however, only a limited quantity of ions are fragmented, causing low coverage rate of detected metabolites and poor quantification accuracy of compounds. XCMS can process large-scale LC-MS/MS data to provide identification and quantitative information of metabolites. By XCMS, the present disclosure can detect metabolite characteristics from raw data for performing steps such as peak alignment, peak detection and qualitative analysis. Map comparison is performed by downloading MassBank of North America (MoNA) database and using MetaboAnnotation package to calculate similarity. The map with the similarity of 0.8 is considered as successful comparison. Subsequently, it can be cascaded from MoNA chemical annotation PubChem CID back to the work database and obtain chemical characteristic information.
Table 7 is a table showing the quantity of mans in each MS level
Table 8 shows parameters for centWave algorithm setting and corresponding results.
Chromatographic peak detection is performed aiming at MS2 level data. The results are as follows:
##MSn data (Spectra) with 840 spectra in an MsBackendMzR backend:
In STD_150 group, a total of 840 MS2 maps are identified.
##MSn data (Spectra) with 3160 spectra in an MsBackendMzR backend:
In T3 group, a total of 3160 MS2 maps are identified.
##MSn data (Spectra) with 5267 spectra in a MsBackendMzR backend:
In T4 group, a total of 5267 MS2 maps are identified.
##MSn data (Spectra) with 3058 spectra in a MsBackendMzR backend:
In T5 group, a total of 3058 MS2 maps are identified.
Table 9 shows results of mass spectrometry comparison of each group using MetaboAnnotation package
The present disclosure comprises preprocessing and parameter adjustment. Before wave peak identification, sample data is preprocessed, such as denoising, smoothing and background correction, for example, functions of noiseFilterGaussian( ), fillPeaks( ) and the like in XCMS are used. Multiple parameters need to be adjustment to use the XCMS software so as to achieve optimal effects. In the present disclosure, multiple instrument detection data are used to adjust parameters. Parameter adjustment has been completed for sample analysis, and a total ion chromatogram showing a charge-to-mass ratio of from 77.00029 to 1154.89783 is as shown in
IPO (Identification, Parameter optimization, and Optimization) is an R package for optimizing XCMS parameters, which can help researchers to better adjust the parameters of XCMS, thereby improving the analysis results of LC-MS/MS data. It takes effects in three aspects: identification: IPO helps recognition and identification of metabolites in liquid chromatography tandem mass spectrometry data by optimizing the parameters of XCMS. Through the adjustment of the parameters, IPO can better calibrate peak spectra and determine the starting and ending positions of peaks, thereby more accurately identifying metabolites. Parameter optimization: IPO automatically searches an optimal XCMS parameter combination by applying an optimization algorithm, thereby achieving the optimal data processing effect. This can save a lot of manual optimization time for relevant researchers, while ensuring the reliability and consistency of the analysis results. Optimization: IPO not only focuses on a single XCMS parameter, but also considers an impact relationship between multiple relevant parameters, thereby achieving overall analysis and optimization. Through the automatic search of the optimal parameter combination, IPO helps researchers to save time while improving the reliability of high-efficiency analysis and results. Through proper data preparation and IPO operation, metabolite information in liquid chromatography tandem mass spectrometry data can be more comprehensively understood.
Table 10 is a table showing optimal parameter setting in each group
In the present disclosure, the high-attention or control lists of various countries and international organizations are integrated in combination with several per/polyfluoroalkyl polyolefins databases and lists, and chemical information is integrated through chemical structure comparison to write application development interfaces (API) for fast access to characteristic information of specific chemicals. In the present disclosure, a PubChem molecular fingerprint group is established from simplified molecular input line entry specification (SMILES) of chemical structures, the above information is stored in the work database, and also the chemical characteristic interface that is easy to read is designed. In the present disclosure, a training dataset with 3810 pieces of information is organized based on carcinogenicity hazard endpoints. In the present disclosure, the QSAR model is designed based on the principle provided by OECD, the training dataset processing, hazard endpoints, algorithm and definition applicability domain are planned, a toxicity prediction model prototype is then established after seeking expert opinions through expert meeting, and the predicted chemical characteristics are input into the work database. For the instrument measurement and identification process, the conversion of the instrument raw data into public format, the instrument signal analysis process and parameter adjustment are completed in the present disclosure, and chemicals are actually detected and identified by test information.
As used herein, for the sake of description, space related terms such as “bottom”, “under”, “lower part”, “above”, “upper part”, “lower part”, “left” and “right” can be used herein to describe the relationship between one component or feature shown in the figure and another component or feature. In addition to the orientation shown in the figure, spatial related terms are intended to cover different orientations of the device in use or operation. The device can be oriented in other ways (rotating 90 degrees or being in other orientations), and the spatial related descriptive terms used herein can also be used for corresponding explanations. It should be understood that when a component is “connected” or “coupled” to another component, this component can be directly connected or coupled to another component, or there is an intermediate component.
As used in the present disclosure, the terms “approximately”, “essentially”, “generally” and “about” are used to describe and explain minor changes. When used in combination with events or situations, the term can refer to situations where the events or situations occur precisely, as well as situations where the events or situations occur approximately. As used herein, for a given value or range, the term “about” usually means being within ±10%, ±5%, ±1%, or ±0.5% of the given value or range. The range can be indicated herein as from one endpoint to the other end point, or between two endpoints. Unless otherwise specified, all scopes disclosed in the present invention include endpoints. The term “generally coplanar” can refer to two surfaces located within a few micrometers (μm) along the same plane, such as those located within 10 μm, 5 μm, 1 μm, or 0.5 μm along the same plane. When referring to values or characteristics that are generally the same, this term can refer to values within ±10%, ±5%, ±1%, or ±0.5% of an average value.
The above descriptions briefly illustrate several embodiments and detailed features of the present disclosure. The embodiments described in the present disclosure can be easily used as a basis for designing or modifying other programs and structures for achieving the same or similar objectives and/or obtaining the same or similar advantages introduced in the embodiments of the present disclosure. This type of equivalent construction does not depart from the spirit and scope of the present invention, and can be subject to various changes, substitutions, and modifications without departing from the spirit and scope of the present disclosure.
The present application claims the priority from the U.S. provisional patent application Ser. No. 63/598,148 filed Nov. 13, 2023, and the disclosure of which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63598148 | Nov 2023 | US |