The present disclosure relates generally to disease prediction and control, and relates particularly to a method for predicting the occurrence of a disease and controlling the disease with a predicted treatment for the disease.
Plant diseases caused by pathogens such as fungi affect crops and soils of farms and are the constant problems for the agricultural industry. Fungal diseases could account for as many as two-thirds of the total plant diseases. Frequently, chemical pesticides are applied, or the entire farmland is abandoned to eliminate fungal diseases. Plant diseases thus lead to huge economic losses. The improvement and application of pest and disease models to predict the occurrence of pathogen infection and to suggest proper control measures are still a challenge for the agricultural community, especially with the extent of the climate change observed in the recent years.
However, the leaping advance of information technology in the recent decades provides an innovative approach to solve problems in different fields, including agriculture. Modeling of crop diseases and pests targeted at the development of support capabilities to schedule scouting or pesticide applications has been applied. A better prediction for the time and probability of disease occurrence, or even the type of the pathogen and suitable pesticide is favorably desired.
On the other hand, while various types and great amount of information related to crop management could be collected and stored for later analysis now, there is a need for efficient utilization of the vast data. Organizing and assessing such amount of data and finding the best approach for analysis can be time-consuming, and there is a need for efficient modeling of data that could process vast data in a short time to allow timely action in response to the pathogenic threats.
The present disclosure is to provide a system for disease control of plants. Also, the present disclosure is to predict probability of a disease occurrence, and recommend a suitable and effective control measure for an identified pathogen and/or crop. The present disclosure further provides an integrated database that includes related data for the prediction of the effective control measure for plant diseases. Also provided in the present disclosure is a system and method of examining weather conditions and crop management practices to model a risk of disease occurrence in a field over a specific time period, and generate a prediction of the disease occurrence in the field. Still provided in the present disclosure is an indication to growers, landowners, crop advisors, and other responsible entities of a possible pathogen presence in a field under observation to enable one or more responsive management actions. Yet still provided in the present disclosure is an advisory service with recommended management actions and other alerts and notifications to such growers, landowners, crop advisors and other responsible entities where this is a risk or prediction of pathogen presence in a field under observation.
The present disclosure provides a system for disease control, comprising: a plurality of sensors configured to detect environmental information; and a processor configured to build up a disease prediction model by collecting disease data and weather data, combining the disease data and the weather data to form combined data, processing the combined data by a machine training and testing process, and identifying a plurality of patterns of disease occurrence, wherein the disease prediction model is configured to calculate a probability of the disease occurrence according to the environmental information and the patterns. In one embodiment, the weather data collected by the disease prediction model includes at least one of observation time, pressure, temperature, dew point temperature, relative humidity, wind speed, wind direction, precipitation, sunshine duration, visibility, ultraviolet index, and cloud amount.
In one embodiment, the disease data of the disease prediction model includes a positive label and a negative label indicating the disease occurrence.
In one embodiment, the processor of the disease prediction model is configured further by extracting features from the disease data and the weather data, wherein the features are processed by the processor for the machine training and testing process.
In one embodiment, the machine training and testing process associates with Convolutional Neural Network (CNN).
In one embodiment, the present disclosure provides a system for disease control with the sensors configured to send the environmental information to the disease prediction model through an Internet of Thing (IoT) technology. In another embodiment, the environmental information includes at least one of relative humidity, temperature, rainfall, and pressure. In yet another embodiment, the weather data are collected over a period of time for 5 days, 7 days, 10 days, 14 days, 18 days or 21 days. In an embodiment, the weather data are collected over a period of time for 14 days.
In one embodiment, the present disclosure also provides a system for disease control, wherein the processor is configured to build up the disease prediction model further by classifying the patterns into a negative output indicating a disease not to be happened or a positive output indicating a disease to be happened. In another embodiment, the disease prediction model is further configured to raise warning according to the negative output or the positive output. In another embodiment, the processor is further configured to build up a spore generation model configured to calculate a spore generation rate based on the environmental information. In yet another embodiment, the spore generation model is based on relative humidity and temperature. In still yet another embodiment, the relative humidity and the temperature upon which the spore generation model is based on are independent events.
In one embodiment, the present disclosure provides a system for disease control, wherein the processor is configured to provide the time of the disease occurrence through the disease prediction model and the spore generation model. In another embodiment, the disease prediction model or the spore generation model is configured to send the probability of the disease occurrence or the time of the disease occurrence to a spraying system through an Internet of Thing (IoT) technology. In yet another embodiment, the spore generation rate is Botrytis cinerea's spore germination rate, Myce-liophthora thermophila's spore germination rate, Aspergillus niger's spore germination rate, P. oryzae's spore germination rate, Diplodia corticola's spore germination rate, or Pseudocercospora's spore germination rate.
In one embodiment, the present disclosure also provides a system for disease control, wherein the processor further includes a peptide prediction model configured to predict a peptide with an antifungal function by a Scoring Card Method (SCM). In another embodiment, the peptide prediction model involves calculating a score for a peptide by determining the propensities of dipeptides that make up the peptide. In yet another embodiment, the peptide prediction model involves calculating a score for a peptide by analysis of sequence of the peptide. In another embodiment, the peptide prediction model is further configured to comprise a search system containing relationships of hosts, pathogens, and corresponding peptides. In a further embodiment, the system for disease control is connected to a spraying system configured to spray the peptide with the antifungal function on a field based on the probability of the disease occurrence.
Other embodiments and features of the present disclosure will become apparent from the following descriptions of the embodiments, taken together with the accompanying drawings, which illustrate, by way of example, the principles of the disclosure.
The present disclosure is a framework under which systems and methods for predicting occurrences of different diseases and providing treatments thereof are developed. The framework makes use of machine learning and big data analysis, and includes a peptide prediction model and a disease occurrence prediction model.
Under the framework provided by the present disclosure, the peptide prediction model comprises a database involving an SCM-based antifungal peptide prediction system and related data of the target diseases. The disease occurrence prediction model is built by CNN technology to predict the probability and the outbreak timing of diseases. The components of the framework are connected by IoT technology, and the system works on cloud computing of aggregated data.
The peptide prediction model allows user to efficiently identify the target peptide for use as the control measure for a disease. To predict the target antifungal peptides for fungal diseases, an antifungal database is established with an antifungal prediction system to evaluate and predict for potential antifungal peptides and a search system containing relationships of hosts, pathogens, and corresponding peptides. Therefore, the antifungal database allows queries for the hosts, pathogens, and corresponding peptides according to the users' needs and potentiates its functions in both new drug discovery and old drug repurposing for the antifungal peptides.
The present disclosure utilizes artificial intelligence to strengthen the power of large datasets with an antifungal peptide prediction system, which is based on the SCM configured with further optimization. The antifungal peptide prediction system of the present disclosure evaluates and predicts the antifungal characteristic of a peptide based only on the sequence analysis, and provides a method for peptide prediction with simplicity, interpretability, and acceptable accuracy.
The SCM is based on Support Vector Machine (SVM), and is a method known in the literature [1]. To predict and evaluate the antifungal property of a peptide, SCM is introduced into the peptide prediction model with the perspective of biological information for machine learning. SCM used in the present peptide prediction model can not only predict the peptide function, but also the important domains of the peptides. In the present peptide prediction model, the SCM includes at least two parts, i.e., the calculation of the dipeptide score and the intelligent genetic algorithms (IGA) which is based on genetic algorithms.
The peptide prediction model is implemented with datasets, scoring of peptides by analyzing dipeptides and weights, and Intelligent Genetic Algorithm (IGA), which are further described herein.
Datasets of the peptide prediction model of the present disclosure comprises positive data and negative data. The positive data are the peptides that have antifungal properties and can comprise peptides from the antifungal databases, such as CAMP, PhytAMP, or those known in the literature and published in the public domain, such as PubMed. The negative data are the peptides that do not have antifungal properties and can comprise peptides that are not annotated as antifungal in the protein and peptide databases, for example, UniProt. Train dataset and test dataset are created by reducing the sequence identity of positive data and negative data, and the data are divided into two portions, so that each dataset has an equal amount of positive and negative data.
The “dipeptide” consists of two amino acids (AA) and is viewed as the smallest functional unit.
The initial weight value for each dipeptide is the ratio of the dipeptide appearing in the positive datasets minus the ratio appearing in the negative datasets. The weight value is then further optimized by IGA.
A selection method is used for selection of weight. Two weights are picked up among all: the one that had the highest fitness value, or the one selected by a selection method. The fitness value is calculated as a function of correlation coefficient between the initial and optimized propensity scores and AUCs, which are the Area under ROC (Receiver Operating Characteristic) Curves. An AUC closer to 1 indicates higher accuracy of the prediction model.
The peptide prediction model further comprises IGA (intelligent genetic algorithm), where crossover selection and optimization are implemented. Herein, crossover selection is a pair of parameters of the two weights that are randomly selected to exchange. Optimization is a known art [2] and involves a creative method for large parameters optimization in which the selection function has been designed to simplify the numbers of different parameter sets.
The peptide prediction model is further configured to comprise a search system containing connections between related data in the peptide prediction. The related data can include hosts, pathogens and peptides. These related data are aggregated into a single antifungal database which provides an efficient search for potential peptides for a given host or a given pathogen. The antifungal database also allows cross-match between hosts and peptides or between pathogens and peptides, thereby realizing repurposing of a previously identified drug.
The disease occurrence model provides daily possibility of disease occurrences. In the disease occurrence model, the Convolutional Neural Network (CNN) method is used to catch the weather patterns that was hard to be recognized by humans. Furthermore, the disease occurrence model is coupled with a warning system and an auto-spraying system with the IoT technology to apply the predicted peptide from the peptide prediction model into the farms.
The disease occurrence model is implemented with datasets that include the past fungal disease data and the weather data, based on the CNN method with a softmax function, a model cost function and an optimizer. Further, the disease occurrence model is connected to a system of IoT, which is further described herein.
The disease control system of the present disclosure is based at least in part the weather conditions that are shown to be related to fungal disease occurrence incidence. In the present disclosure, weather data presented by the 4 weather conditions, i.e., relative humidity, temperature, air pressure, and rainfall, are collected for over a period of time. In one embodiment, the weather data is collected for the past 14 days. In one embodiment, a total of 11 features based on the collected weather data are used in the convolutional neural network (CNN) to calculate the daily probability of the disease occurrence. In another embodiment, in addition to CNN, the spore germination rate is also calculated to provide a prediction of the accurate time of spore germination. The components of the system, such as the sensors for collecting the data and the sprayers that apply the predicted peptide based on the predicted occurrence time are connected over IoT.
The disease occurrence model comprises two different kinds of data, i.e., the fungal disease data and the weather data at the time when fungal diseases happened. The fungal disease data can be obtained from the government agency, and the weather data is collected from the Central Meteorological Bureau. Preprocessing of the data includes combining the fungal disease data and the weather data, and the fungal disease data that had no corresponding weather data are deleted. These data are then standardized for machine training and testing. CNN is adopted to recognize the patterns of the weather data features automatically. The pattern of the favorite weather change for the fungal diseases is recognized and caught by CNN. In the disease occurrence model of the present disclosure, CNN was used to identify the weather change that was suitable for occurrence of fungal diseases at a specific time.
The disease occurrence model further comprises a max pooling layer in addition to the CNN. After data went through the CNN layers, the amount of data increases enormously, and an added max pooling layer helps to reduce the computational complexity of the model and helps to find the best tendency of the data.
The disease occurrence model further comprises a full connection layer that converted the max pooling output into high dimensional space, and classified them into two classes, i.e., negative (diseases that had not happened) and positive (diseases that had happened).
The disease occurrence model further comprises a softmax function to transform the output from CNN into the disease occurrence probability. The network output before transformation can be hard to be realized by humans. The softmax function transforms the output into the disease occurrence probability that could be understood by both machines and humans.
In addition, cross-entropy is used to evaluate the differences between the predicted values and actual values in the training stage. Independent data used to test the disease occurrence model showed that the model delivers an accuracy of 83%.
The disease occurrence model further comprises spore germination modeling that predicts a spore germination rate, and hence the prediction of the occurrence of diseases is more effectively and timely. The spore germination modeling comprises fitting a linear equation for the spore germination rate based on the humidity and a cubic equation for the spore germination rate based on the temperature. A general spore germination modeling is thereby obtained by multiplying the two. The spore germination experiments are carried out to verify the modeling and determine the coefficients.
In the following descriptions of the present disclosure, reference is made to the exemplary embodiments illustrating the principles of the present disclosure and how it is practiced. Other embodiments will be utilized to practice the present disclosure, and structural and functional changes will be made thereto without departing from the scope of the present disclosure.
Collection of positive (antifungal) dataset was obtained from online public databases such as CAMP, APD, PhytAMP, in addition to the new peptides that are collected in the local database, while the negative (peptides without antifungal property) dataset was collected from the public database for proteins and peptides, UniProt, where they do not have antifungal or antimicrobial characteristics associated.
The collected datasets undergo pre-processing including deleting the peptides that contain non-standard amino acids. Then, the peptides of the datasets are limited to lengths of between 10 AA's to 100 AA's because antifungal peptides are typically between 10 and 100 amino acids long. Furthermore, the peptides are filtered with identity of no more than 25%. Then, equal amount of negative data and positive data are selected. Afterwards, the positive and negative dataset are randomly distributed, and one-third of the data is used as an independent testing set.
Briefly, for every peptide in the dataset, the dipeptide frequency is calculated. Then, an initial weight for each specific dipeptide is given through statistical methods. Multiplying the dipeptide frequency matrix by the weight matrix tallied out the peptide score. For a peptide evaluated, the higher score it is, the greater possibility it possesses an antifungal function.
Each peptide will form a 400×1 matrix of dipeptide frequency because there are 20 types of amino acids (AA) which result in 400 types of dipeptide frequencies as shown by the formula below:
20AA×20AA=400dipeptide
Each peptide will then get a score according to its peptide sequences by multiplying the dipeptide frequencies with the scorecard matrix of weight.
The calculated score of the peptide compared with the threshold to predict its propensity as an antifungal peptide or a non-antifungal peptide.
The initial weights used in the scoring of the peptide include first determining P(ij), the dipeptide frequency of positive dataset, and N(ij), the dipeptide frequency of negative dataset, which are calculated by the equations below, where nij and Lp-1 represent number of occurrence of the ijth dipeptide and the sum of the lengths of all peptides each minus 1, respectively:
Then, each weight (S(ij)) is obtained from the calculation that the frequency of positive data (P(ij)) minus the frequency of negative data (N(ij)):
S
ij
=P
ij
−N
ij
The individual weight thus obtained is normalized to [0,1] and then times 1000:
The initial scoring card containing a set of dipeptide weights S′(ij) is thus obtained.
IGA is then used to optimize the initial scoring card.
The fitness calculation is further described herein. First, confusion matrix is calculated by separating into a prediction section and a label section and categorizing into four classes: TP (True Positive), FP (False Positive), FN (False Negative), and TN (True Negative), as shown in
Then, the ROC curve is drawn taking TPR as the y-axis and FPR as x-axis, as shown in
In addition to the AUC value, the Pearson coefficient of the amino acids between the initial scoring card and the scoring card under test is also considered for fitness calculation. Different weights are given for each value, with 0.9 of the AUC value and 0.1 of the Pearson coefficients for the best training performance. Use of the Pearson coefficient in the model avoids overtraining.
Then, to optimize the initial scoring card, advanced crossover is used to produce variation for machine learning. For each round, two weights are selected by a selection method. After the advanced crossover is optimized from the normal crossover, mutation is done, and new weights are put into the population.
Specifically, the selection method involves picking two weights from all weights, with one having the most fitness value, namely the highest AUC, which is probably the best weight, and the other weight being selected using the roulette method. The roulette method is done by separating each weight of the score card into different areas in proportion to its fitness. The higher fitness of the weight would get the larger area (
After the parents are selected, IGA is used to optimize the crossover. IGA is based on the normal Genetic Algorithms (GA) where the crossover selection is the most important selection. After selecting two parents, crossover involves choosing a pair of parameters to exchange, and then the exchanged score card is returned into the new population (
To choose the best set of parameters for crossover, a target function shown as follows is maximized, where each x1, x2, x3 represents a pair of dipeptide frequencies under evaluation:
f(x1,x2,x3)=100x1−10x2x3
For each x1, x2, x3, two candidates are chosen, just like the two parents in the crossover step. To maximize the function of IGA, the first step is to create an OA-array shown below:
maximize y(x1,x2,x3)=100x1−10x2−x3,
x
1∈{1,2},x2∈{3,4}, and x3∈{5,6}
For evaluating the value of x1, the key is to eliminate the effect by x2 and x3.
After the crossover, the new offspring undergoes a mutation section. In the mutation section, the program chooses a random number to determine whether to mutate or not. If the result is yes, it randomly chooses an allele of the offspring and sets a random number. The mutation section increases the randomness of the model.
Following the mutation section, the new offspring joins the population, and then the program sorts all scoring cards in the population according to their fitness values. After sorting the population, the last process was to filter out the scoring card that ranks outside the max_population number.
The program is terminated after 30 generations to avoid over training. When it reaches its end condition of 30 generations, it returns the final score card with the best fitness in training data.
Following the steps described in Example 1, the final ROC curve and the result of test datasets with an antifungal peptide having sequence identity of 25% (AFP25) is shown in
The test accuracy, i.e., the overall performance of classifying positive data as positive and negative data as negative, is 76%. The sensitivity, i.e., the performance of classifying positive data as positive, is 77%. The specificity, i.e., the performance of classifying negative data as negative, is 76%. The suitable threshold value is 354, and peptide scores higher than this value is considered as an antifungal peptide.
The score distributions of the positive datasets and negative datasets are shown in
For the five amino acids with the lowest scores (D, E, S, T, V), four of them are hydrophilic, while most of the hydrophilic amino acids have a higher score (average score=362.73>threshold: 350). Additionally, for the top 5 highest amino acids, cysteine contains a sulfide functional group that can form a disulfide bond, and lysine (K) and arginine (R) are easy to form a hydrogen bond.
To show the result of the scoring card, the peptides are visualized by color representation of the dipeptide score on its 3D structure. The region of a peptide with a higher dipeptide score is shaded darker. The region of a peptide with lower dipeptide score is represented with lighter shades. The important region of an antifungal peptide is thus identified by the dark-shaded area.
Therefore, the predicted active sites visualized in the 3D structure with scoring cards corresponds to that reported in the literature, indicating the SCM of the antifungal peptide prediction model indeed possesses the ability to correctly determine the antifungal active sites.
A model predicting the disease occurrence relating to the daily weather was established based on the neural network. There are two kinds of data used in the prediction system, i.e., the disease data collected from the government agency and the weather data from the Central Meteorological Bureau's website that correspond to the disease data. Then, the two data are combined, and the disease data that do not have the weather data to match with are deleted.
The final data then contains a weather feature and a label. The weather feature is a two-dimensional array having 14 days×11 features. The 11 features include relative humidity, rainfall and the maximum, minimum, average of the temperature and air pressure. The label contains two classes, negative (no disease occurrence) and positive (disease occurrence). The flow chart of data processing in the model is shown in
The weather condition affects the spore germination and the health of plants. This relationship between the weather condition and the disease occurrence is recognized by the Convolutional Neural Network (CNN) to catch the specific weather patterns that lead to disease occurrences.
f(xi)=max(xi)
For example, if the input array is [2,5,1,7,0,4] and the max pooling filter size is 2, when the filter step is 1, which is the distance the filter moves, the first max pooling output will be max(2,5)=5, and the second output is max(5,1)=5 and so on. Because the max pooling output is a two-dimensional tensor, the max pooling output is flattened to one dimensional tensor for a further full connection layer.
After flattening the max pooling layer, the full connection layer is used to classify the max pooling result. The full connection layer is a basic neural network layer that can switch the max pooling layer output into the high dimensional space and then classify them into two classes, namely the negative (no disease occurrence) and positive (disease occurrence).
However, the network output is a number that is difficult for humans to understand and use, so that the softmax function is used to transform the number into the disease occurrence probability (
In the above formula, σ is softmax function, Z is final output of the network, K is the total number of outputs, and thej is the jth output.
Afterwards, cross-entropy is chosen as the network cost function because it performs well in the exclusion classification mission. The formula used for the cross-entropy is as follows:
In the above formula, H is a cross-entropy function, y′i is the real label, and the y is the network prediction output.
Parameters of the neural network are then optimized by an Adam optimizer, which is the most commonly used way to optimize the network.
After training, the model is tested by an independent test data with the result shown in
In addition to the daily disease occurrence prediction with the CNN method described in Example 4, a more accurate timing of the prediction is made by including the calculation of the spore germination rate, since the spore germination must happen before disease occurrence. The conditions that lead to spore germination are identified and confirmed by experiment.
Humidity and temperature are found to affect the spore germination the most. A general model for the spore germination rate based on temperature or humidity is built with different fungal species which also fits for every fungal species.
First, the spore germination data published in the literature are used to fit out the functions. As a result, the spore germination rate based on temperature is fitted by a cubic equation: f1(x)=ax3+bx2+cx+d, where x represents the temperature. The spore germination rate based on humidity is fitted with a linear equation: f2(x)=ax+b, where x represents the humidity. The general spore germination rate is therefore f1(x)×f2(x).
Myce-liophthora thermophila's spore germination rate based on temperature is shown in
y=0.0004x3−0.0132x2+4.0447x−24.746
Aspergillus niger's spore germination rate based on temperature is shown in
y=0.0326x3−4.0593x2+160.2x−1943.7
P. oryzae's spore germination rate based on temperature is shown in
y=0.06x3−4.389x2+106.86x−774.66
Diplodia corticola's spore germination rate based on temperature is shown in
y=−0.0041x3+0.0169x2+7.1531x−32.071
Therefore, the spore germination rate based on temperature is fitted in a cubic equation:
f
1(x)=ax3+bx2+cx+d
For the spore germination rate based on humidity, Aspergillus niger's spore germination rate based on relative humidity is shown in
y=352.38x−254.14
Pseudocercospora's spore germination rate based on relative humidity is shown in
y=0.1071x−10.165
As results, the spore germination rate based on the relative humidity is a linear equation:
f
2
=ax+b
The two equations are considered as independent events and are multiplied to form the general fungal spore germination model:
f
1
×f
2
Therefore, temperature and humidity conditions are the only factors needed to calculate the spore germination rate in the specified environment.
Experiments are then conducted to determine the coefficients for the general fungal spore germination model and to verify the model. The experiment was divided into two parts, as shown in
The Botrytis cinerea's spore germination rate based on temperature is thereby:
f
1=(−0.0625x3+2.9974x2−37.865x+141.68)
The Botrytis cinerea's spore germination rate based on relative humidity is thereby:
f
2=316.88x−216.88
The result of the formula derived from the above is compared with the value of the spore germination in the reality to verify that the temperature and relative humidity are independent events to spore germination. The conditions are randomly chosen to be at temperatures of 23 and 13 degree Celsius, and in the relative humidity of 97% and 80%.
Therefore, the final formula of the spore germination model is:
f
1
×f
2=[(−0.0625x13+2.9974x12−37.865x1+141.68)÷100]×[(316.88x2−216.88)÷100]
In the formula, x1 is temperature, and x2 is relative humidity. According to the above experiment results, the temperature and relative humidity are proved to be independent events, and the formula for the model is quite precise.
Sensors of the weather conditions such as those detecting temperature and humidity are connected over IoT and transfer the values to the processors of the prediction model. The daily probability of disease occurrence is calculated. If the calculated probability exceeds a certain value, which can be set by the user, the user is informed that the disease may occur, and advised to spray the predicted antifungal peptide. The user is allowed to decide whether to automatically spray.
While some of the embodiments of the present disclosure have been described in detail in the above, it is, however, possible for those of ordinary skill in the art to make various modifications and changes to the particular embodiments shown without substantially departing from the teaching and advantages of the present disclosure. Such modifications and changes are encompassed in the spirit and scope of the present disclosure as set forth in the appended claims.
The references listed below cited in the application are each incorporated by reference as if they were incorporated individually.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/MY2018/000033 | 10/29/2018 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62577764 | Oct 2017 | US |