The present invention relates generally to data mining and analysis and, more specifically, to a system which integrates and analyzes hydrocarbon well data from available databases to provide valuable insight into production enhancement and well stimulation/completion.
Over the past decade, data relating to hydrocarbon exploration has been compiled into various databases. The data compilations include general well and job information, job level data, pumping data, as well as wellbore and completion data. There are software platforms available to search those databases to locate existing jobs in a particular location and retrieve certain information related to those jobs.
However, to date, those platforms lack an automated, efficient and statistically rigorous decision making algorithm that searches data for patterns which may be used to evaluate an aspect of a well, such as well performance. It would be desirable to provide an analytical platform or system that could be utilized to, among other things, (1) evaluate the effectiveness of previous well treatments; (2) quantify the characteristics which made those treatments effective; (3) identify anomalously good or bad wells; (4) determine what factors contributed to the differences; (5) determine if the treatment program can be improved; (6) determine if the analysis can be automated; or (7) determine how to best use available data that contains both categorical and continuous variables along with the missing values.
In view of the foregoing, there is a need in the art for a system which meets those deficiencies by analyzing hydrocarbon well-related data in order to determine those data variables which best indicate or predict well performance.
Illustrative embodiments and related methodologies of the present invention are described below as they might be employed in a system for data mining and analysis of well data. In the interest of clarity, not all features of an actual implementation or methodology are described in this specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which will vary from one implementation to another. Moreover, it will be appreciated that such a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking for those of ordinary skill in the art having the benefit of this disclosure. Further aspects and advantages of the various embodiments and related methodologies of the invention will become apparent from consideration of the following description and drawings.
After system 100 has analyzed the data as described herein, attention may be drawn to a particular set of well jobs to, among other things, determine, based on the data output as described herein, if job pause time in a particular region is high, and if so, to determine whether the forgoing is due to a particular customer, service representative, or some other factor.
To achieve the foregoing objectives, as will be described herein, certain exemplary embodiments of WDMA system 100 analyze the wellbore-related data by applying a Classification and Regression Tree (“CART”) methodology on desired datasets. In certain embodiments, the present invention improves the interpretation capability of trees by performing a Normal Score Transform (“NST”) and/or a clustering technique on both discrete and continuous variables.
Referring to
Although not explicitly shown in
Moreover, those skilled in the art will appreciate that the present invention may be practiced with a variety of computer-system configurations, including hand-held devices, multiprocessor systems, microprocessor-based or programmable-consumer electronics, minicomputers, mainframe computers, and the like. Any number of computer-systems and computer networks are acceptable for use with the present invention. The invention may be practiced in distributed-computing environments where tasks are performed by remote-processing devices that are linked through a communications network. In a distributed-computing environment, program modules may be located in both local and remote computer-storage media including memory storage devices. The present invention may therefore, be implemented in connection with various hardware, software or a combination thereof in a computer system or other processing system.
Still referring to
Serving as the database component of database mining and analysis engine 110, data mining module 112 is utilized by processor 102 to capture datasets for computation from a server database (not shown). In certain exemplary embodiments, the server database may be, for example, a local or remote SQL server which includes well job details, wellbore geometry data, pumping schedule data per stage, post job summaries, bottom-hole information, formation information, etc. As will be described herein, exemplary embodiments of the present invention utilize data mining module 112 to capture key variables from the database corresponding to different job IDs using server queries. After the data is extracted, data mining and analysis engine 110 communicates the dataset to data analysis module 114.
Data analysis module 114 is utilized by processor 102 to analyze the data extracted by data mining module 112. An exemplary data analysis platform may be, for example, Matlab®, as will be readily understood by those ordinarily skilled in the art having the benefit of this disclosure. As described herein, WDMA system 100, via data analysis module 114, analyzes the dataset to identify those data variables which indicate or predict well performance.
Referring to
At block 202, WDMA system 100 initializes and displays a graphic user interface via display 108, the creation of which will be readily understood by ordinarily skilled persons having the benefit of this disclosure. Here, WDMA system 100 awaits entry of queries reflecting dataset extraction. In one exemplary embodiment, SQL queries may be utilized to specify the data to be extracted from the database. Such queries may include, for example, field location, reservoir name, name of the variables, further calculations required for new variables, etc. At block 204, once one or more queries have been detected by WDMA system 100, processor 102 instructs data mining module 112 to extract the corresponding dataset(s). Exemplary dataset variables may include, for example, average pressure, crew, pressures, temperatures, slurry volume, proppant mass, screen out, hydraulic power, etc. for a particular well.
At block 206, WDMA system 100 detects a user input that defines a response (i.e., output) variable y and predictor (i.e., input) variables xi for i=(1, . . . n), that are the subject of the analysis. As described herein, such selections may be made via a graphical user interface. Based upon a given response variable, a number of predictor variables are also chosen by the user. The predictor and response variables are selected from the data available in the dataset. For example, screen-out may be selected as the response variable, with predictor variables being engineer, customer, depth, average rate, clean volume, etc. The predictor variables may be categorical (engineer, customer, for example) or continuous (depth, clean volume, for example) in nature, and all values may be identified in standard oil-field units.
At block 208, WDMA system 100 performs pre-processing of the dataset in order to remove corrupted data. In certain exemplary embodiments, pre-processing of the dataset includes de-noising and/or removing outliers in the variables in order to provide a high quality dataset which will form the basis of the analysis. In an exemplary embodiment, outliers may be removed if they are characterized as values greater than three times the standard deviation, although other merit factors may be utilized. In addition, the data entered into the database may comprise incomplete or inconsistent data. Incomplete data may include NAN or NULL data, or data suffering from thoughtless entry. Noisy data may include data resulting from faulty collection or human error. Inconsistent data may include data having different formats or inconsistent names.
As previously described, certain exemplary embodiments of WDMA system 100 utilize a CART data analysis methodology. As understood in the art, classification or regression trees are produced by separating observations into subgroups by creating splits on predictors. These splits produce logical rules that are very comprehensible in nature. Once constructed, they may be applied on any sample size and are capable of handling missing values and may utilize both categorical and continuous variables as input variables.
Although CART is capable of handling missing values, inaccurate or erroneous entries can greatly affect the analysis. Even though CART is capable of accounting for outliers in the input variables xi for i=(1, . . . n), it does not work well with outliers in the output variable y, as a few unusually high or low y values may have a large influence on the mean of a particular node and, in-turn, produce high residual sum of squares that may lead to incorrect interpretation. In this exemplary embodiment, based on the assumption of normal distribution, outliers are characterized as those observations that deviate by more than three times the standard deviation from the mean, although other deviations may be utilized as would be understood by those ordinarily skilled in the art having the benefit of this disclosure. Therefore, at block 208, WDMA system 100 performs pre-processing of the dataset to remove outliers and other corrupted data. After WDMA system 100 removes the corrupted data, the dataset is ready for further analysis.
At block 210, WDMA system 100 normalizes the dataset using, for example, an NST methodology. As will be understood by ordinarily skilled persons having the benefit of this disclosure, CART interpretations may not be sensible when the output variable has a skewed distribution. In such cases, it becomes important to normalize the predictor and response variables before using them for interpretation using CART. Accordingly, certain exemplary embodiments of the present invention utilize NST to transform a dataset to resemble a standard normal distribution. Thus, at block 210, data mining and analysis engine 110 first ranks the original values yi for i=(1, . . . , N) of the variable in order. In one preferred embodiment, the order is an ascending order. Next, the cumulative frequency, or pk, quantile for the observation of rank k is calculated using:
where wk is the weight of the sample with rank k. If the weight of the data samples is not available, the default weight of
is used.
The NST of the data sample with rank k is the pk quantile of the standard normal distribution. Here:
yNST,k=G−1(pk), where G(.) is the cumulative standard normal distribution.
Referring back to
In certain exemplary embodiments described herein, the splitting criteria for the regression or classification tree methodologies utilized by WDMA system 100 includes minimizing the mean squared error for the regression trees and utilizing Gini's diversity index, twoing or entropy for the classification trees. Such splitting criteria will be understood by those ordinarily skilled in the art having the benefit of this disclosure. Nevertheless, in certain exemplary embodiments, it is desirable to select an appropriate tree size, as tree information can become very complex in nature as it grows accounting for several questions at each node. Therefore, the present invention utilizes the NST of the dataset at block 210 in order to optimize the dataset before utilizing it for prediction, analysis or classification purposes.
In view of the foregoing, exemplary embodiments of the present invention determine the optimal tree size such that cross-validation error is minimized. In one exemplary embodiment to obtain a suitable size tree, WDMA system 100 may model an overly complex tree and then prune it back at block 212, as would be understood by those ordinarily skilled in the art having the benefit of this disclosure. Here, the residual error on the training data will decrease or remain the same with an increase in the depth of the tree; however, this does not guarantee low error on the testing data because the data is not used so to build the model. In an alternative embodiment, WDMA system 100 may utilize cross-validation to decide on the optimal decision tree, as would also be understood by those same ordinarily skilled persons having the benefit of this disclosure. In cross-validation, optimal depth of the tree is obtained such that the resulting model is suitable for making predictions for the new dataset. In yet another exemplary embodiment, a user may define a maximum sample per node in order to limit the tree growth.
At block 214, after applying CART, WDMA system 100 then performs an inverse NST on the transformed dataset variables in order to transform them back into their original units for display in a classification or regression tree as shown in
At block 216, WMDA system 100 outputs the results of the analysis. In this exemplary embodiment, the results are output in tree format. As such, a user may then perform visual analysis and/or event prediction. In other words, the tree may be utilized for two purposes. First, the tree may be utilized for prediction or classification of the output (i.e., response variable y) for a new set of input variables xi where i=(1, . . . n) (i.e., once a model is developed, it may be utilized for prediction purposes on any number of samples). Second, in the case of visual analysis, the tree may be utilized by a user to understand the structural relationship between y and xi variables to determine a list of logical questions which may be subsequently utilized to define predictor/output variables. Although described herein as a tree, WDMA system 100 may output the results as, for example, an earth model, plotted graph, two or three-dimensional image, etc., as would be understood by those ordinarily skilled in the art having the benefit of this disclosure.
Thereafter, at block 218, WDMA system 100 determines the importance of dataset variables. In determining variable importance, WDMA system 100 measures the contribution of a particular predictor variable in the tree formation. For classification and regression trees, WDMA system 100 computes the variable importance by summing the node error due to splits on every predictor (i.e., difference between the node error of the parent node and the two child nodes) and dividing the sum by the number of tree nodes. Node error is the mean square error in the case of regression trees and misclassification probability in case of classification trees, as would be understood by those ordinarily skilled in the art having the benefit of this disclosure. Table 1 below illustrates an exemplary ranking of exemplary predictor variables based upon their importance.
The effect of NST on the regression tree will now be illustrated utilizing an exemplary case study. Referring back to
Still referring to the exemplary case study, the tree illustrated in
Referring back to
Visualization of data is an important feature of any data mining analysis. Once the dimension of the data is 3 or higher, human visualization of data becomes quite difficult. As such, certain exemplary embodiments of the present invention utilize Multidimensional Scaling (“MDS”) at block 216 to enhance the analysis of WDMA system 100 with data visualization, as this technique reduces the dimension of the data for visualization purposes, as will be understood by those ordinarily skilled in the art having the benefit of this disclosure. In this exemplary embodiment, data analysis module 114 comprises the MDS functionality. For visualization purposes, WDMA system 100 utilizes Euclidean distance and, hence, calculates the symmetric Euclidean distance matrix θεN×N (also known as dissimilarity matrix) where,
and θiεd, i=j=1 . . . N represents data in NST domain.
Referring back to block 210, many of the large-scale conventional clustering techniques focus on grouping based on the Euclidean distance with the inherent assumption that all the data points lie in a nonlinear Euclidean domain. However, certain exemplary embodiments of the present invention overcome this through utilization of the Kernel-based clustering method described herein by embedding the data points into a high-dimensional non-linear domain and defining their similarity using a nonlinear kernel distance function. Accordingly, through utilization of the foregoing clustering methodology in block 210 (after NST is performed), WDMA system 100 will generate any desired number of dataset clusters.
In an alternative exemplary embodiment of the present invention, WDMA system 100 may perform this clustering technique without utilizing the NST of the dataset. In such an embodiment, after removing the corrupted data at block 208, WDMA system 100 will cluster the dataset at block 210, then proceed on to CART analysis of block 212. Likewise, in an alternative embodiment, any of the methodologies described herein may be conducted without removing the corrupted data. Those ordinarily skilled in the art having the benefit of this disclosure realize any variety of the features described herein may be combined as desired.
The effect of NST and clustering on the regression tree will now be illustrated utilizing another exemplary case study. In this example, a five-cluster output was selected using JPT, for example, as the response variable used to divide the datasets into clusters. Thereafter, trees were created and the clusters were plotted within a 3-dimensional view after performing k-means clustering on the post NST dataset. Thereafter, pruning was conducted as previously described herein. The resubstitution error for each cluster is summarized in Table 2 below.
As expected, improvement was observed in the resubstitution error after performing clustering. For five clusters, the decrease in error was around 9.8%. Increasing numbers of clusters result in further decreased errors. For example, for 6 clusters it was found that there is a 14% decrease in error, and for 8 clusters it was around 18%.
As described herein, exemplary embodiments of the present invention provide system to data-mine and identify significant reservoir related variables (i.e., predictor variables) influencing a defined output variable, thus providing valuable insight into production enhancement and well stimulation/completion. The present invention is useful in its ability to parse the complex data into a series of If-Then-Else type questions involving important predictor variables. The system then presents the results in a simple, intuitive and easy to understand format that makes it a very efficient tool to handle any kind of data that includes categorical, continuous and missing values, which is particularly desirable in evaluation of hydrocarbon well data. In addition, the ability of the present invention to rank predictor variables based on their order of importance makes it equally competitive to stepwise regression, and the use of NST reduces the standard deviation in many nodes, thus yielding better interpretation capability. Moreover, CART performed after k-means clustering improves predictions related to the hydrocarbon well.
Although CART methodologies were described herein, other tree methods may also utilized such as, for example, Boosted Trees. Moreover, multivariate adaptive regression splines, neural networks or ensemble methods that combine a number of trees such as, for example, a tree bagging technique, may also be utilized herein, as will be readily understood by those ordinarily skilled in the art having the benefit of this disclosure.
The foregoing methods and systems described herein are particularly useful in planning, altering and/or drilling wellbores. As described, the system analyses well data to identify characteristics that indicate performance of a well. Once identified, the data is presented visually using a tree or some other suitable form. This data can then be utilized to identify well equipment and/or develop a well workflow or stimulation plan. Thereafter, a wellbore is drilled, stimulated, altered and/or completed in accordance to those characteristics identified using the present invention.
Those of ordinary skill in the art will appreciate that, while exemplary embodiments and methodologies of the present invention have been described statically as part of implementation of a well placement or stimulation plan, the methods may also be implemented dynamically. Thus, a well placement or stimulation plan may be updated in real-time based upon the output of the present invention, such as for example, during drilling or drilling stimulation. Also, after implementing the well placement or stimulation plan, the system of the invention may be utilized during the completion process on the fly or iteratively to determine optimal well trajectories, fracture initiation points and/or stimulation design as wellbore parameters change or are clarified or adjusted. In either case, the results of the dynamic calculations may be utilized to alter a previously implemented well placement or stimulation plan.
An exemplary methodology of the present invention provides a computer-implemented method to analyze wellbore data, the method comprising extracting a dataset from a database, the dataset comprising wellbore data, detecting an output variable, removing corrupted data from the dataset, calculating a normal distribution for the dataset, thus creating a normalized dataset, performing a classification and regression tree (“CART”) analysis on the normalized dataset based upon the output variable and based upon the CART analysis, determining one or more predictor variables that correlate to the output variable. Another exemplary method further comprises determining a contribution of the one or more predictor variables on the output variable and ranking the one or more predictor variables based on their influence on the output variable. In yet another method, calculating the normal distribution further comprises utilizing a Normal Score Transform to calculate the normal distribution of the dataset.
In another method, calculating the normal distribution further comprises performing a clustering technique on the normalized dataset. In yet another, determining one or more predictor variables further comprises displaying the one or more predictor variables utilizing a multidimensional scaling technique. Another methodology further comprises displaying the one or more predictor variables in the form of a tree or earth model. In yet another, determining the one or more predictor variables further comprises determining an optimal tree size. In another, determining the one or more predictor variables further comprises performing an inverse transformation on the normalized dataset. In yet another, a wellbore is drilled, completed or stimulated based on the determined one or more predictor variables.
Another exemplary methodology of the present invention provides a computer-implemented method to analyze wellbore data, the method comprising extracting a dataset from a database, the dataset comprising wellbore data, detecting an output variable, removing corrupted data from the dataset, performing a clustering technique on the dataset, performing a classification and regression tree (“CART”) analysis on the clustered dataset based upon the output variable and based upon the CART analysis, determining one or more predictor variables that correlate to the output variable. In another, performing the clustering technique further comprises normalizing the dataset. In yet another, a wellbore is drilled, completed or stimulated based on the determined one or more predictor variables.
An exemplary embodiment of the present invention provides a system to analyze wellbore data, the system comprising a processor and a memory operably connected to the processor, the memory comprising software instructions stored thereon that, when executed by the processor, causes the processor to perform a method comprising extracting a dataset from a database, the dataset comprising wellbore data, detecting an output variable, removing corrupted data from the dataset, calculating a normal distribution for the dataset, thus creating a normalized dataset, performing a classification and regression tree (“CART”) analysis on the normalized dataset based upon the output variable and based upon the CART analysis, determining one or more predictor variables that correlate to the output variable. In another embodiment, calculating the normal distribution further comprises performing clustering on the normalized dataset. In yet another embodiment, a wellbore is drilled, completed or stimulated based on the determined one or more predictor variables.
Although various embodiments and methodologies have been shown and described, so the invention is not limited to such embodiments and methodologies and will be understood to include all modifications and variations as would be apparent to one skilled in the art. For example, the invention as described herein may also be embodied in one or more systems comprising processing circuitry to perform the described mining and analysis, or may be embodied in a computer program product comprising instructions to perform the described mining and analysis. Therefore, it should be understood that the invention is not intended to be limited to the particular forms disclosed. Rather, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the invention as defined by the appended claims.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US12/62658 | 10/31/2012 | WO | 00 |