The present disclosure relates generally to the field of machine learning. More specifically, the present disclosure relates to systems and methods for improved machine learning using data completeness and collaborative learning techniques.
Completeness of data is key for a variety of computer-based applications, particularly building any machine learning and deep learning model. Such models are useful in a variety of industries. For example, survey data is often modeled to analyze sites for discovering new oil or gas reserves. Further, in the investment industry, accurate information about investment options can be used to determine investment strategy.
Various software systems have been developed for processing data to build models using machine learning. Typically, outliers and null values widely exist in collected data. Conventional approaches mainly fill the null values and replace the outliers with a fixed value. The filled values may be created using statistic metrics of the data set (such as minimum, maximum, or mean), backward or forward filling with neighboring data, local regression to fill the data, or with traditional machine learning and AI technologies.
The conventional approaches are generally inaccurate and time consuming, particularly when employing a machine learning and AI-based approach. These conventional approaches also do not provide clarity as to which known attributes should be input into machine learning and AI-based approaches. As such, the ability to quickly and accurately fill in outliers and null values in data to build accurate models is a powerful tool for a wide range of professionals. Accordingly, the machine learning systems and methods disclosed herein solve these and other needs.
The present disclosure relates to systems and methods for improved machine learning using data completeness and collaborative learning techniques. The system first receives one or more sets of data. For example, the data sets can be received from an array of sensors. The system then classifies samples within the data into a multi-dimensional tree data structure. Next, the system identifies outliers and null values within the tree. Then, the system fills in the outliers and null values based on neighboring values. For example, data points close to one another in the tree data structure can be considered neighbors. In some cases, attributes may not be filled completely based on neighbors due to lack of neighbors. For these values, collaborative filtering AI technology can also be utilized to fill the rest of the missing values of all data attributes.
The foregoing features of the invention will be apparent from the following Detailed Description of the Invention, taken in connection with the accompanying drawings, in which:
The present disclosure relates to systems and methods for improved machine learning using data completeness and collaborative learning techniques, as described in detail below in connection with
In step 14, the system performs an outlier and null value filling phase based on neighbor information. Specifically, the system processes the indexed and partitioned data to detects and classify one or more values in the data as either a null or a value that is outside of expected parameters, e.g., an outlier. In an embodiment, the system can detect and classify the objects in the data using artificial intelligence modeling software, such as a data tree-generating architecture, as described in further detail below. The artificial intelligence modeling software replaces the outliers and null values using data points closely associated with the outliers and null values.
In step 16, the system performs an overall attribute filling phase based on neighbor information. Specifically, the system fills in missing attributes that are not associated with the outliers and null values as will be described in further detail below. In step 18, the system determines if further outliers and/or null values exist in the data set(s). If so, the system repeats step 14. If not, the process is concluded.
The process steps of the invention disclosed herein could be embodied as computer-readable software code executed by one or more processors of one or more computer systems, and could be programmed using any suitable programming languages including, but not limited to, C, C++, C#, Java, Python or any other suitable language. Additionally, the computer system(s) on which the present disclosure can be embodied includes, but is not limited to, one or more personal computers, servers, mobile devices, cloud-based computing platforms, etc., each having one or more suitably powerful microprocessors and associated operating system(s) such as Linux, UNIX, Microsoft Windows, MacOS, etc. Still further, the invention could be embodied as a customized hardware component such as a field-programmable gate array (“FPGA”), application-specific integrated circuit (“ASIC”), embedded system, or other customized hardware component without departing from the spirit or scope of the present disclosure.
It should be noted that during or prior to the index and partition phase, the system can use a plurality of sensors to detect one or more characteristics of one or more objects (e.g., vertical depth, lateral length, water consumption, etc. of oil well sites). Additionally or alternatively, data collected outside the system can be entered into the system for processing.
In step 24, the system selects a tree-generating algorithm. In an embodiment, the selected algorithm is a k-dimensional B-tree algorithm. In step 24, the selected data is indexed and partitioned into a tree structure. The generated tree structure may be multi-dimensional.
Turning briefly to
For the purposes of the above example, it the physical and categorical attributes are documented as numerical values proportional to the similarity of neighboring categories. The numerical values are also set up to provide context to the values. For example, numerical representations of a location index may be based on alphabetical order.
Returning to
In step 34, the system identifies values neighboring the identified outliers and null values within the tree structure. As described above, the system labels adjacent attributes within the tree architecture as neighbors. In step 36, the system creates new values for the outliers and null values based on the values neighboring the outliers and null values. In step 38, the system replaces the outliers and null values with the created values.
Creating values based on neighboring attributes produces values that are more accurate that simply replacing outliers or null values with conventional methods, such as fixed values, using standard metrics of data (minimum, maximum, or mean values), or traditional machine learning algorithms. The values created in step 36 are the product of a collaborate approach, using multiple known attributes, rather than the product of a select few as in conventional methods. The described method also provides a quicker method for filling outliers and null values. The values are replaced in one step, e.g. step 38, rather than replacing each value sequentially as done in conventional methods.
In step 44, the system creates new attribute values to fill missing attributes identified in step 42 using collaborative filtering artificial intelligence, as described above. In step 46, the missing attributes are filled with the created values.
One of the advantages of the system disclosed herein is that it quickly bridges the completeness of data sets having large sizes (e.g., data sets gigabytes in size, and greater). In this regard, the described system was employed to fill null values in oil well attribute data. Attributes included well vertical depth, lateral length, water and proppant consumed in oil extraction operations. Data for 314,000 wells were analyzed. 145 million neighbor attributes were identified by the system within 5 minutes of processing using a tree-generating algorithm. By comparison, a single computer using a geo indexing method required 30 minutes to identify neighboring characteristics within the same data set. The geo indexing method failed frequently because the process ran out of computing resources.
Collaborative AI filtering, as described in step 44, was also employed to analyze the 314,000 oil wells. Main attributes of the wells were identified in 4 minutes. By comparison, a traditional approach employing building regression with conventional machine learning and artificial intelligence models required 2 hours to fill null values. Collaborative AI filtering was also found to be 20% more accurate in filling the null values.
Having thus described the system and method in detail, it is to be understood that the foregoing description is not intended to limit the spirit or scope thereof. It will be understood that the embodiments of the present disclosure described herein are merely exemplary and that a person skilled in the art can make any variations and modification without departing from the spirit and scope of the disclosure. All such variations and modifications, including those discussed above, are intended to be included within the scope of the disclosure. What is desired to be protected by Letters Patent is set forth in the following claims.
This application claims priority to U.S. Provisional Patent Application Ser. No. 63/142,551 filed Jan. 28, 2021, the entire disclosure of which is hereby expressly incorporated by reference.
Number | Date | Country | |
---|---|---|---|
63142551 | Jan 2021 | US |