METHOD OF ANALYZING INFLUENCE FACTOR FOR PREDICTING CARBON DIOXIDE CONCENTRATION OF ANY SPATIOTEMPORAL POSITION

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

Pursuant to 35 U.S.C. § 119 and the Paris Convention Treaty, this application claims foreign priority to Chinese Patent Application No. 202111524281.7 filed Dec. 14, 2021, the contents of which, including any intervening amendments thereto, are incorporated herein by reference. Inquiries from the public to applicants or assignees concerning this document or the related applications should be directed to: Matthias Scholl P.C., Attn.: Dr. Matthias Scholl Esq., 245 First Street, 18th Floor, Cambridge, Mass. 02142.

BACKGROUND

The disclosure relates to the field of remote-sensing monitoring of greenhouse gas, and more particularly to a machine-learning-based atmospheric carbon dioxide spatiotemporal distribution simulation, and a global sensitivity analysis method of an influence factor.

Carbon dioxide is a major contributor to greenhouse gas and global warming and therefore, accurately knowing about the spatiotemporal distribution of carbon dioxide concentration and its changing trend is greatly important for knowing about and mitigating greenhouse effect. Satellite observation can accurately provide ground carbon dioxide information with a given spatiotemporal resolution and relatively long time sequence observation. Due to influence of clouds and aerosols and the like, there may be gaps in the satellite observation data in actual applications and thus it is difficult to accurately analyze the regional carbon dioxide spatiotemporal distribution. A common method of obtaining a carbon dioxide concentration of any position of a region is interpolation, including interpolation of spatial domain and interpolation for building a physical model based on carbon dioxide time sequence law, where the spatial domain interpolation method has a low accuracy and the physical model interpolation method generates a very complex model and has a low computing efficiency. Along with continuous development of machine learning algorithm, relevant researches on application of various neural network and machine learning models to a regional CO₂simulation modeling so as to compensate a spatiotemporal gap of XCO2 data also starts developing. In this way, a high accuracy and wide-scope carbon dioxide spatiotemporal distribution graph can be efficiently generated. However, the existing machine learning-based methods usually only consider an aspect of the environmental factor or the anthropogenic emission factor for modeling, and the carbon dioxide concentration is subjected to influence of both of them. At present, a relevant method is still needed.

Furthermore, regional carbon dioxide distribution is affected by many factors such as natural environment and anthropogenic emission. The influence factors are complex, the process is complex and there are many relevant researches. But, most of them are qualitative analysis or relevance analysis of environmental factor or carbon dioxide concentration and quantitative evaluation methods for multi-factor influence degree are fewer. Thus, quantitative analysis for the contribution and influence of different environmental factors on atmospheric carbon dioxide concentration cannot be achieved.

SUMMARY

The disclosure aims to provide a machine-learning-based atmospheric carbon dioxide spatiotemporal distribution simulation, and a global sensitivity analysis method of an influence factor. In this way, simulation can be achieved for a region without carbon dioxide data of satellite observation to obtain a carbon dioxide spatiotemporal distribution mode of the entire region and the importance degree of an environmental factor affecting regional carbon dioxide distribution is quantized by using the global sensitivity analysis for the model.

To achieve the above purpose, the technical solution adopted by the disclosure provides a method of analyzing an influence factor for predicting a carbon dioxide concentration of any spatiotemporal position. Firstly, an atmospheric carbon dioxide spatiotemporal distribution simulation method is proposed. This simulation method constructs a simulation model simulating carbon dioxide concentration distribution of any position of a region based on machine learning algorithm in combination with carbon dioxide data of satellite observation and corresponding environmental factors; next, by use of a global sensitivity analysis method, quantitative evaluation on the importance of multiple influence factors for regional carbon dioxide distribution is achieved. The method comprises the following steps:

at step 1, in combination with regional environmental characteristics, classifying environmental factors affecting regional carbon dioxide distribution into, comprising but not limited to, a ground and vegetation coverage factor, a climate meteorological factor, and a combustion emission factor;

at step 2, in combination with satellite carbon dioxide observation data and environmental factors, using machine learning algorithm to construct a regional carbon dioxide spatiotemporal distribution simulation model and training the simulation model using a training dataset;

at step 3, for the constructed carbon dioxide spatiotemporal distribution simulation model, firstly using a test dataset to verify a model prediction accuracy, and then inputting environmental factor data without satellite observation into the trained carbon dioxide spatiotemporal distribution simulation model to obtain a predicted carbon dioxide concentration and finally obtaining a regional carbon dioxide concentration distribution graph;

at step 4, in combination with the constructed regional carbon dioxide spatiotemporal distribution simulation model and the global sensitivity analysis method, calculating a sensitivity of the carbon dioxide concentration for each input parameter, i.e. environmental factor;

at step 5, counting the sensitivities of the regional carbon dioxide concentration for different environmental factors obtained by the global sensitivity analysis method, and quantitatively analyzing the size of the sensitivity of each parameter to finally determine an influence degree of each environmental factor along with the regional carbon dioxide distribution.

Furthermore, in step 1, the environmental factor classification specifically comprises ground coverage type, vegetation coverage, climate type, precipitation, atmospheric temperature, wind velocity and direction, anthropogenic emission amount, and biomass combustion emission amount of a region.

The vegetation coverage is represented by normalization vegetation index data which may be obtained from the L3 vegetation index product of the MODIS satellite; the anthropogenic emission statistics come from the high resolution global anthropogenic emission dataset ODIAC; the biomass combustion data comes from global fire disaster emission database GFED4; atmospheric temperature and precipitation data comes from Chinese 1 km-resolution monthly average atmospheric temperature dataset provided by National Tibetan Plateau Data Center; the ground coverage data comes from annual global land coverage dataset published by European Space Agency, the climate type data comes from Köppen climate zoning dataset, and the wind velocity and direction data comes from ERAS dataset.

Further, the machine learning algorithm used in step 2 is eXtreme Gradient Boosting tree (XGBoost) which is a tree integration model based on gradient boost, wherein the basic construction thinking of the model is: firstly constructing an initial sub-tree to performing fitting for data to correspondingly obtain a fitting residue, and constructing subsequent sub-trees based on previous model residue until the model residue is less than a threshold, and the final simulation result is a sum of all sub-tree results; the specific construction steps are as follows:

initially constructing a weak learner to obtain a residue corresponding to an initial model;

for each subsequent training iteration, based on the existing model, adding one weak learner to fit a residue of a previous model;

through continuous learning, fitting K weak learners to reduce the residue between a model prediction result and a true value until the residue is less than a threshold, and the model is terminated, where the final model prediction value is a result obtained by performing weighted summing using K base learners.

Further, the specific implementation of performing training using the training dataset in step 2 is as follows:

first performing preprocessing for the training dataset, comprising data cleaning, data encoding and data transformation, wherein the data cleaning comprises removal of missing value, abnormal value and noise, and the data transformation comprises normalization and dimension reduction;

wherein the data encoding is to encode non-numerical features and input into the model for training, that is, encode the environmental factors i.e. the ground vegetation type, the climate type and wind direction, by using one-hot encoding;

performing normalization processing for the data in the following formula:

$z_{q}^{'} = \frac{z_{q} - mean (z_{q})}{std (z_{q})}$

wherein mean(z_q) is a mean value of data of environmental factor z_q, and std(z_q) is a standard deviation of the data of the environmental factor z_q;

next, inputting the preprocessed training dataset into the XGBoost model and performing parameter adjustment and further optimization for the XGBoost model, and repeating iterations to obtain an optimal carbon dioxide spatiotemporal distribution simulation model.

Further, the base learner of the XGBoost model is CART tree, and for a dataset with m features of n samples D=(x_i,y_i) (|D|=n,x_i∈R^m,y_i∈R), the final prediction value obtained by training is expressed below:

${\hat{y}}_{i} = φ (x_{i}) = \sum_{k = 1}^{K} f_{k} (x_{i})$

wherein K is a number of base learners, x_iis an i-th sample, y_iis a class label corresponding to the i-th sample, f_k(⋅) is a model of a k-th tree, wherein the k-th tree is split into a leave node q of the tree and a corresponding weight part ω, i.e.:

f
_i(x_i)=ω_q(x_i₎

wherein ω_q(x_i₎is a weight of the leave node q where the sample x_iis located, and q(x_i) is a position of the leave node where the sample x_iis located, that is, for any one sample x_i, the weight at a particular leave node is valued as ω_q(x_i₎;

for each iteration, the model fits the previous predicted residue and therefore, when a t-th base learner is generated, the prediction model is expressed as:

ŷ
_i
^(t)
=ŷ
_i
^(t-1)
+f
_k
^(t)(x_i)

a target function is expressed as:

$L^{(t)} = \sum_{i = 1}^{n} l (y_{i}, ({\hat{y}}_{i}^{(t - 1)} + f_{k}^{(t)} (x_{i}))) + Ω (f_{k}^{(t)})$

wherein the target function is composed of two parts: in a first part, function l(⋅,⋅) describes a difference between a true value and a fitting value, which is calculated based on Euclidean distance; the second part is a regularized part Ω(f_k^(t)) for preventing function overfitting, i.e.

$Ω (f_{k}^{(t)}) = γ T + \frac{1}{2} {λΣ}_{j = 1}^{T} ω_{j}^{2}$

used to limit the complexity of each tree and prevent model overfitting, wherein T is a number of all leave nodes on the CART tree, γ and λ are hyperparameters used to adjust the number of leave nodes and importance distribution of the weight during regularized calculation, ω_jis a weight value of a j-th leave node; to minimize the target function, the XGBoost considers performing second order Taylor expansion for the target function, which is approximately expressed as:

$L^{(t)} ≅ \sum_{i = 1}^{n} [l (y_{i}, {\hat{y}}_{i}^{(t - 1)}) + g_{i} f_{k}^{(t)} (x_{i}) + \frac{1}{2} h_{i} f_{k}^{{(t)}^{2}} (x_{i})] + γ T + \frac{1}{2} λ \sum_{j = 1}^{T} ω_{j}^{2}$

wherein g_iis a first-order derivative, defined as

$ℊ_{i} = \partial_{{\hat{y}}_{i}^{(t - 1)}} l (y_{i}, {\hat{y}}_{i}^{(t - 1)})$

h_iis a second-order derivative

$h_{i} = \partial_{{\hat{y}}_{i}^{(t - 1)}}^{2} l (y_{i}, {\hat{y}}_{i}^{tc - 1)}),$

and the following result is obtained by substituting into the target function:

$L^{(T)} ≅ \sum_{j = 1}^{T} [(\sum_{i = 1} g_{i}) ω_{j} + \frac{1}{2} (\sum_{i = 1} h_{i} + λ) ω_{j}^{2}]$

each iteration minimizes the target function to obtain j optimal leave nodes of the t-th base learner and an optimal solution ω_lcorresponding to each leave node.

Further, the global sensitivity analysis method used in step 4 is Sobol method, the sensitivity of which is calculated by decomposing an output total variance into a sum of a variance of each parameter and a variance of mutual interaction of parameters, and then performing sensitivity grading calculation based on a ratio of a contribution of the parameter to the output variance;

for each environmental factor, a change range and a probability distribution are calculated and then a corresponding sensitivity index is calculated in combination with the regional carbon dioxide spatiotemporal distribution simulation model;

the regional carbon dioxide spatiotemporal distribution simulation model is expressed as: y=f(x₁′,x₂′, . . . , x_p′), wherein f is a trained XGBoost model, x₁′,x₂′, . . . , x_p′ are environmental factors affecting carbon dioxide distribution and are input parameters of the XGBoost model; the total variance of the XGBoost model is:

D=∫f
²(x′)dx′−f₀²

wherein, f₀is an initial value of the model and the a partial variance of the XGBoost model is:

D
_π
₁
_,π
₂
_{, . . . ,π}
_s=∫ . . . ∫(x_π₁′,x_π₂′, . . . ,x_π_s′)dx_π₁′,x_π₂′, . . . ,x_π_s′

wherein, 1≤π₁< . . . <π_s≤p, and s=1, 2, . . . , p and the sensitivity S_π₁_,π₂_{, . . . ,π}_sof each environmental factor:

$S_{π_{1}, π_{2}, \dots, π_{s}} = \frac{D_{π_{1}, π_{2}, \dots, π_{s}}}{D}$

wherein S_π₁is a first-order sensitivity index of the environmental factor x_π₁′, which is used to represent an influence of the parameter on the model output, S_π₁_,π₂_{, . . . , π}_sis an s-order sensitivity index of the environmental factors x_π₁′,x_π₂′, . . . , x_π_s′, which is used to represent a joint influence of s parameters on the model;

further, a total sensitivity index of each environmental factor is obtained, and the total sensitivity index TS_π of the environmental factor x_π_s′ is defined as:

TS_π=S_π₁+S_π₁_,π₂+ . . . +S_π₁_,π₂_{, . . . ,π}_s

the total sensitivity index of each environmental factor obtained by Sobol method is used to evaluate the final sensitivity of the influence factors affecting the regional carbon dioxide distribution, achieving quantitative influence degree analysis.

Compared with the prior art, the disclosure has the following advantages and beneficial effects.

In the disclosure, when retrieval is performed for a regional carbon dioxide distribution, a machine learning model is built by making full consideration for all ground environments, climatic meteorology and anthropogenic combustion emission factors relating to carbon dioxide concentration, thus achieving more accurate and quicker prediction for the regional carbon dioxide concentration distribution. Further, based on the built machine learning model, in a case of consideration of interactive effect, the sensitivity of each factor affecting the regional CO2 growth is quantitatively evaluated so as to provide scientific guidance for formulation of carbon emission policy.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flowchart illustrating a general method of an embodiment of the disclosure;

FIG. 2A is a regional carbon dioxide distribution graph of satellite carbon dioxide observation data and FIG. 2B is modeling retrieval according to an embodiment of the disclosure; and

FIGS. 3A-3B are sector graph of a sensitivity index of an influence factor of an embodiment of the disclosure.

DETAILED DESCRIPTION

To describe the technical solution and technical advantages of the disclosure in more details, the disclosure will be fully described below in combination with specific embodiments and accompanying drawings.

As shown in FIG. 1, the disclosure provides a method of analyzing an influence factor for predicting a carbon dioxide concentration of any spatiotemporal position. The method generally comprises two parts. In a first part, regional carbon dioxide simulation modeling is performed based on machine learning algorithm to achieve simulation for a region without satellite observation carbon dioxide data so as to obtain a carbon dioxide spatiotemporal distribution mode of the entire region; in a second part, according to a trained regional carbon dioxide spatiotemporal distribution simulation model, in combination with a global sensitivity analysis method, an importance degree of an environmental factor affecting regional carbon dioxide distribution is quantized. The specific implementation process is described below.

1. The specific steps of the regional carbon dioxide simulation modeling method based on machine learning algorithm are described below.

At step 1, environmental factor data affecting regional carbon dioxide distribution are collected, including but not limited to ground coverage type, vegetation coverage, climate type, precipitation, atmospheric temperature, wind velocity and direction, anthropogenic emission amount statistic data, and biomass combustion emission amount of a region, and then matched with the satellite observation carbon dioxide data to obtain training and verification datasets of a machine learning model.

At step 2, a machine learning algorithm is selected to construct a carbon dioxide distribution simulation model and the model is trained in combination with environmental factors and the satellite carbon dioxide training dataset.

The specific steps of performing training are as follows: preprocessing the training dataset, comprising data cleaning (removal of missing value, abnormal value and noise and the like), data encoding and data transformation (normalization and dimension reduction and the like) and so on.

For the processing of the missing value of the dataset, in a case of less missing values, it is considered to delete the sample.

For the processing for abnormal value and noise, the noise is firstly detected by statistic characteristics of data or clustering method, and then the data is “smoothed” by using a method such as binning, clustering, regression, and combination of computer check and manual check to remove the abnormal values and noise in the data.

The data encoding is mainly to encode the non-numerical features and input them into the model for training. In this experiment, it is mainly required to encode the environmental factors such as ground coverage type, climate type and wind direction, by using one-hot encoding.

Data preprocessing also requires normalization processing for the data in the following formula:

$z_{q}^{'} = \frac{z_{q} - mean (z_{q})}{s t d (z_{q})}$

where mean(z_q) is a mean value of data of environmental factor z_q, and std(z_q) is a standard deviation of the data of the environmental factor z_q.

Furthermore, the machine learning algorithm used in step 2 is eXtreme Gradient Boosting tree (XGBoost) which is a tree integration model based on gradient boost, wherein the basic construction thinking of the model is: firstly constructing an initial sub-tree to performing fitting for data to correspondingly obtain a fitting residue, and constructing subsequent sub-trees based on previous model residue until the model residue is less than a threshold, and the final simulation result is a sum of all sub-tree results; the specific construction steps are as follows:

initially constructing a weak learner to obtain a residue corresponding to an initial model;

for each subsequent training iteration, based on the existing model, adding one weak learner to fit a residue of a previous model;

through continuous learning, fitting K weak learners to reduce the residue between a model prediction result and a true value until the residue is less than a threshold, and the model is terminated where the final model prediction value is a result obtained by performing weighted summing using K base learners.

Furthermore, the base learner of the XGBoost model is CART tree, and for a dataset with m features of n samples D=(x_i,y_i)(|D|=n,x_i∈R^m,y_i∈R), the final prediction value obtained by training is expressed below:

${\hat{y}}_{i} = φ (x_{i}) = \sum_{k = 1}^{K} f_{k} (x_{i})$

f
_i(x_i)=ω_q(x_i₎

wherein ω_g(x_i₎is a weight of the leave node q where the sample x_iis located, and q(x_i) is a position of the leave node where the sample x_iis located, that is, for any one sample x_i, the weight at a particular leave node is valued as ω_q(x_i₎;

for each iteration, the model fits the previous predicted residue and therefore, when a t-th base learner is generated, the prediction model is expressed as:

ŷ
_i
^(t)
=ŷ
_i
^(t-1)
+f
_k
^(t)(x_i)

a target function is expressed as:

$L^{(t)} = \sum_{i = 1}^{n} l (y_{i}, ({\hat{y}}_{i}^{(t - 1)} + f_{k}^{(t)} (x_{i}))) + Ω (f_{i}^{(t)})$

$Ω (f_{k}^{(t)}) = γ T + \frac{1}{2} λ \sum_{j = 1}^{T} ω_{j}^{2}$

$L^{(t)} ≅ \sum_{i = 1}^{n} [l (y_{i}, {\hat{y}}_{i}^{(t - 1)}) + g_{i} f_{k}^{(t)} (x_{i}) + \frac{1}{2} h_{i} f_{i}^{{(t)}^{2}} (x_{i})] + γ T + \frac{1}{2} λ \sum_{j = 1}^{T} ω_{j}^{2}$

wherein g_iis a first-order derivative, defined as

$ℊ_{i} = \partial_{{\hat{y}}_{i}^{(t - 1)}} l (y_{i}, {\hat{y}}_{i}^{(t - 1)})$

h_iis a second-order derivative

$h_{i} = \partial_{{\hat{y}}_{i}^{(t - 1)}}^{2} l (y_{i}, {\hat{y}}_{i}^{(t - 1)}),$

and the following result is obtained by substituting into the target function:

$L^{(t)} ≅ \sum_{j = 1}^{T} [(\sum_{i = 1} ℊ_{i}) ω_{j} + \frac{1}{2} (\sum_{i = 1} h_{i} + λ) ω_{j}^{2}] + λ T$

Each iteration minimizes the target function to obtain j optimal leave nodes of the t-th base learner and an optimal solution ω_jcorresponding to each leave node.

The preprocessed training dataset is input into the XGBoost model and parameter adjustment and further optimization are performed for the XGBoost model, and iterations are repeated to obtain an optimal carbon dioxide distribution simulation model.

At step 3, for the constructed carbon dioxide distribution simulation model, a test dataset is firstly used to verify a model prediction accuracy, and then environmental factor data without satellite observation is input into the trained carbon dioxide distribution simulation model to obtain a predicted carbon dioxide concentration and finally, a regional carbon dioxide concentration spatiotemporal distribution is obtained.

2. According to the above trained regional carbon dioxide spatiotemporal distribution simulation model and the global sensitivity analysis method, the importance of the influence factors is quantitatively analyzed, comprising the following steps.

At step 4, in combination with the constructed regional carbon dioxide spatiotemporal distribution simulation model and the global sensitivity analysis method, a sensitivity of the carbon dioxide distribution for each environmental factor is calculated.

At step 5, the sensitivities of the regional carbon dioxide concentration for different environmental factors obtained by the global sensitivity analysis method are counted, and the size of the sensitivity of each parameter is quantitatively analyzed to finally determine an influence degree of each environmental factor along with the regional carbon dioxide distribution.

The global sensitivity analysis method used in step 4 is Sobol method which is performed in the following step:

The regional carbon dioxide spatiotemporal distribution simulation model is expressed as: y=f(x₁′,x₂′, . . . , x_p′), wherein f is a trained XGBoost model, x₁′,x₂′, . . . , x_p′ are environmental factors affecting carbon dioxide distribution and are input parameters of the XGBoost model and n is a number of model parameters, i.e. the 9 influence factors in step 1; the total variance of the XGBoost model is:

D=∫f
²(x′)dx′−f₀²

wherein, f₀is an initial value of the model and the a partial variance of the model is:

D
_π
₁
_,π
₂
_{, . . . ,π}
_s=∫ . . . ∫(x_π₁′,x_π₂′, . . . ,x_π_s′)dx_π₁′,x_π₂′, . . . ,x_π_s′

wherein, 1≤π₁< . . . <π_s≤p, and s=1, 2, . . . , p and the sensitivity S_π₁_,π₂_{, . . . , π}_sof each environmental factor:

$S_{π_{1}, π_{2}, \dots, π_{s}} = \frac{D_{π_{1}, π_{2}, \dots, π_{s}}}{D}$

wherein S_π₁is a first-order sensitivity index of the environmental factor x_π₁′, which is used to represent an influence of the parameter on the model output, S_π₁_,π₂_{, . . . , π}_sis an s-order sensitivity index of the environmental factors x_π₁′,x_π₂′, . . . ,x_π_s′, which is used to represent a joint influence of s parameters on the model;

further, a total sensitivity index of each environmental factor is obtained, and the total sensitivity index TS_π of the environmental factor x_π_s′ is defined as:

TS_π=S_π₁+S_π₁_,π₂+ . . . +S_π₁_,π₂_{, . . . ,π}_s

In step 5, the total sensitivity index of each environmental factor obtained by Sobol method is used to evaluate the final sensitivity of the influence factors affecting the regional carbon dioxide distribution, achieving quantitative influence degree analysis.

3. Embodiment

In this embodiment of the present disclosure, by using OCO-2 satellite XCO2 observation data and corresponding environmental factors of 2016 and the XGBoost modeling, the CO₂concentration distribution in the eastern region of China is simulated. FIGS. 2A-2B show a result of satellite observation data and modeling retrieval. For accuracy evaluation on the simulation model constructed using machine learning algorithm, a determination coefficient R2 and a root mean square error RMSE are used and a final modeling accuracy obtained after parameter adjustment and optimization is as shown in Table 1.

TABLE 1

Modeling accuracy

Training samples
Test samples
R2
RMSE

3153 (70%)
1351 (30%)
0.6751
1.6362 ppm

By using the global sensitivity analysis method and the constructed carbon dioxide simulation model, quantitative evaluation is performed for the sensitivities of the influence factors to obtain the results as shown in Table 2.

TABLE 2

a first order sensitivity index and a total sensitivity index

of each environmental factor estimated

using the global sensitivity analysis method

First order
Total sensitivity

Environmental factors
sensitivity index
index

Ground coverage type
0.013060
0.015529

Vegetation coverage
0.300257
0.320699

Climate type
0.006008
0.007367

Precipitation
0.291814
0.301615

Atmospheric temperature
0.262991
0.277399

Wind velocity and direction
0.713833
0.727576

Anthropogenic emission
0.000197
0.000208

amount

Biomass combustion emission
0.000915
0.001157

To more visually display the sizes of the sensitivities of different environmental factors on the total carbon dioxide distribution, a sector graph of sensitivity indexes is drawn to determine ratios of the influence factors as shown in FIGS. 3A-3B.

As shown in FIGS. 2A-2B, the environmental factors, i.e. wind velocity and direction, vegetation, precipitation, atmospheric temperature, ground coverage type, climate type, biomass combustion emission and anthropogenic emission, are sorted in a descending order of sensitivity size, where the indexes of the wind velocity and direction, vegetation, precipitation and atmospheric temperature are large, which indicates that they are major factors affecting regional carbon dioxide distribution.

As known from the model accuracy, it is feasible to simulate the regional carbon dioxide spatiotemporal distribution by using model. The method provided by the disclosure can fill in the gap of satellite observation data by simulating the regional carbon dioxide concentration spatiotemporal distribution with the environmental factors. Further, a method of quantitatively evaluating the influence degrees of the environmental factors on the regional carbon dioxide distribution is proposed so as to determine the influence sizes and specific degrees of various environmental factors on the regional carbon dioxide distribution.

The specific embodiments described in the disclosure are merely illustrated based on the spirit of the disclosure. Those skilled in the art can make various changes or supplementations or similar replacements to the specific embodiments described herein without departing from the spirit of the disclosure or the scope defined by the appended claims.

Claims

1. A method of analyzing an influence factor for predicting a carbon dioxide concentration of any spatiotemporal position, the method comprising: 1) in combination with regional environmental characteristics, classifying environmental factors affecting regional carbon dioxide distribution into a plurality of factors comprising ground coverage type factor, vegetation coverage factor, climate type factor, precipitation factor, atmospheric temperature factor, wind velocity and direction factors, anthropogenic emission amount factor, and biomass combustion emission factor;wherein, in 1), the vegetation coverage factor is from the L3 Normalized Difference Vegetation Index of the Moderate-Resolution Imaging Spectroradiometer (MODIS) satellite; the ground coverage type factor is from the annual global land coverage data from European Space Agency; the climate type factor is from Köppen climate zoning dataset; the precipitation factor and the atmospheric temperature factor are from the Chinese 1 km-resolution monthly average precipitation and atmospheric temperature data from National Tibetan Plateau Data Center; the wind velocity and direction factors are from the wind velocity and direction data from ERAS dataset; the biomass combustion emission factor is from the anthropogenic emission amount from the high resolution global anthropogenic emission dataset ODIAC and biomass combustion emission amount data from the global fire disaster emission database GFED4;2) in combination with OCO-2 satellite carbon dioxide observation data and the environmental factors, using eXtreme Gradient Boosting tree (XGBoost) machine learning algorithm to construct a Regional Carbon Dioxide Spatiotemporal distribution simulation (RCDS) model and training the simulation model using a training dataset;3) for the constructed RCDS model, first using a test dataset to verify a model prediction accuracy, and then inputting environmental factor data without satellite observation into the trained carbon dioxide spatiotemporal distribution simulation model to obtain a predicted carbon dioxide concentration and finally obtaining a regional carbon dioxide concentration distribution graph;4) in combination with the constructed regional carbon dioxide spatiotemporal distribution simulation model and a global sensitivity analysis method, calculating a sensitivity of the carbon dioxide concentration for each input environmental factor parameter;5) counting the sensitivities of the regional carbon dioxide concentration for different environmental factors obtained by the global sensitivity analysis method, and quantitatively analyzing the size of the sensitivity of each parameter to finally determine an influence degree of each environmental factor along with the regional carbon dioxide distribution.
2. The method of claim 1, wherein the machine learning algorithm used in 2) is eXtreme Gradient Boosting tree (XGBoost) which is a tree integration model based on gradient boost; the basic construction thinking of the XGBoost model is: firstly constructing an initial sub-tree to performing fitting for data to correspondingly obtain a fitting residue, and constructing subsequent sub-trees based on initial sub-tree fitting residue until the subsequent sub-tree fitting residue is less than a threshold, and the final simulation result is a sum of all sub-tree results; the specific construction steps are as follows: initially constructing a weak base learner to obtain a residue corresponding to an initial sub-tree model;for each subsequent training iteration, based on the existing sub-tree model, adding one weak learner to fit a residue of a previous sub-tree model;through continuous learning, fitting K weak learners to reduce the residue between a model prediction result and a true value until the residue is less than a threshold, and the model is terminated, finally the model prediction result is a result obtained by performing weighted summing using K base learners.
3. The method of claim 1, wherein, the specific implementation of performing training using the training dataset in 2) is as follows: first performing preprocessing for the training dataset, comprising data cleaning, data encoding and data transformation, wherein the data cleaning comprises removal of missing value, abnormal value and noise, and the data transformation comprises normalization and dimension reduction;the data encoding is to encode non-numerical features and input into the model for training, encode the environmental factors comprising the ground vegetation type, the climate type and wind direction, by using one-hot encoding;performing normalization processing for the data in the following formula:
4. The method of claim 2, wherein the base learner of the XGBoost model is CART tree, and for a dataset with m features of n samples D=(xi,yi)(|D|=n, xi∈Rm, yi∈R), the final CART tree prediction value obtained by training is expressed below:
5. The method of claim 1, wherein the global sensitivity analysis method used in 4) is Sobol method, the sensitivity of which is calculated by decomposing an output total variance into a sum of a variance of each parameter and a variance of mutual interaction of parameters, and then performing sensitivity grading calculation based on a ratio of a contribution of the parameter to the output variance; for each environmental factor, a change range and a probability distribution are calculated and then a corresponding sensitivity index is calculated in combination with the regional carbon dioxide spatiotemporal distribution simulation model;the regional carbon dioxide spatiotemporal distribution simulation model is expressed as: y=f(x1′, x2′, . . . , xp′), wherein f is a trained XGBoost model, x1′, x2′, . . . , xp′ are environmental factors affecting carbon dioxide distribution and are input parameters of the XGBoost model; the total variance of the XGBoost model is: D=∫f2(x′)dx′−f02 wherein, f0 is an initial value of the XGBoost model and a partial variance of the XGBoost model is: Dπ1,π2, . . . ,πs=∫ . . . ∫(xπ1′,xπ2′, . . . ,xπs′)dxπ1′,xπ2′, . . . ,xπs′wherein, 1≤π1< . . . <πs≤p, and s=1, 2, . . . , p and the sensitivity Sπ1,π2, . . . ,πs of each environmental factor:

Priority Claims (1)

Number	Date	Country	Kind
202111524281.7	Dec 2021	CN	national

METHOD OF ANALYZING INFLUENCE FACTOR FOR PREDICTING CARBON DIOXIDE CONCENTRATION OF ANY SPATIOTEMPORAL POSITION

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

Priority Claims (1)