Certain example embodiments described herein relate to machine learning systems and/or methods. More particularly, certain example embodiments described herein relate to systems and/or methods that perform improved, automated data cleansing for machine learning algorithms.
Machine learning is used in a wide variety of contexts including, for example, facial recognition, automatic search term/phrase completion, song and product recommendations, identification of anomalous behavior in computing systems (e.g., indicative of viruses, malware, hacking, etc.), and so on. Machine learning typically involves building a model from which decisions or determinations can be made. Building a machine learning application and the model that supports it oftentimes involves a significant amount of effort and experience, especially when trying to implement best practices in connection with model building.
Preprocessing for machine learning models frequently involves missing value imputation, feature normalization, data encoding, and/or other operations to help make sure that the collected data values are according to the requirements of the algorithm. As is known, data imputation refers generally the process of replacing missing data with other (e.g., substituted) values; feature normalization refers generally to a technique used to standardize the range of independent variables or features of data; and data encoding refers generally to operations by which categorical variables are converted into numerical form for consumption by machine learning algorithms and/or similar conversions. Similar to feature normalization, data normalization is known for use in data processing and generally is performed during the data preprocessing step.
Referring once again to
Accuracy checks oftentimes are performed here, and further feature engineering may be performed, e.g., in the event that the accuracy is unacceptable. Once a suitable accuracy has been reached, the model can be deployed in connection with the machine learning application and/or unknown data can be predicted (step S110).
Data collection as referred to in
There are several methods for each of the preprocessing data cleansing operations listed above that can be chosen from and applied to the data. Different approaches are better suited to different kinds of data. As is known, each preprocessing operation can greatly influence the results of the machine learning algorithms, and even the selection of a given type of each of the preprocessing operations can greatly influence the results of the machine learning algorithms.
To help understand problems associated with data cleansing, consider the following example, which involves a dataset about the salaries of different people who have different attributes. In this example, the following table includes data that can be used in model building, e.g., to predict the salary of a new employee.
As can be seen from the table above, as one example, the person with name “User 1” gender “Male” of age 28 in profession of “Profession_A” with experience of 5 years earns 50,000. The “NaN” is missing value meaning that information is not available in data. The columns “Name”, “Age”, “Gender”, “Profession”, and “Experience”, are independent variables or features, and the “Salary” column is the target or dependent variable.
As alluded to above, the task is to build a model to help predict the salary of a new employee with certain specified attributes, based on the data in the table above. However, the raw data from the table above cannot be directly passed to a machine learning algorithm. The data needs to be preprocessed, as the machine learning algorithm in this example is designed to accept numerical data and cannot accept missing values or alphanumeric values as input.
Non-numeric data can be processed and then fed to the machine learning algorithms. To treat missing values for a numerical feature (e.g., for a column or independent variable), for example, instances with missing values can be removed; missing values can be replaced with a mean or median value, a value from another instance can be copied, etc. Of course, it can be seen that each of these mentioned approaches for treating missing values can affect the performance of the final model. That is, the approach selected to impute the value directly influences the population of data (the total set of observations that can be made, in statistics terms) and, hence, directly influences the predictive power of the model, which refers to how well the model has learned the pattern in the training data to make predictions on the new data with less error.
In general, for cleaning a column that has information with a class/categorical information (e.g., gender, family type, etc.), one-hot encoding, label encoding, and/or the like, may be used as a data preprocessing approach. As is known, one-hot encoding is a process by which categorical variables are converted into a form that could be provided to machine learning algorithms to do a better job in prediction and generally involves the “binarization” of data. In terms of missing value imputation, a mean, median, some high value, a mode, a random value occurring in the dataset, etc., may be used.
Similarly, in general, for cleaning a column that has information with numerical values (e.g., salary, age, weight, etc.), numerical data preprocessing approaches scaling and/or the like may be used as a data preprocessing approach. Standardization of datasets is a common approach for many machine learning estimators. They might behave badly if the individual features do not more or less look like standard normally distributed data (e.g., a Gaussian distribution with zero mean and unit variance). In this vein, StandardScaler is a method in Python API Sklearn that can be used to standardize features by removing the mean and scaling to unit variance. In terms of missing value imputation, imputation with a frequently occurring class (e.g., in categorical mode), a new “other” class, and/or the like, may be used.
In view of the foregoing, it will be appreciated that data cleansing is widely implemented as a highly manual task. And as people come up with many different ways to perform preprocessing of the data, it oftentimes is highly subjective as well, especially as the structure of data becomes more complicated.
Some approaches work on the basis of identifying the dataset that is most similar to the new dataset, but a high degree of similarity will not always occur. Moreover, even when it can be assumed that the new dataset is most similar to a given reference dataset, applying the same preprocessing techniques to all the columns might not yield the best possible results. For example, a column with name values and a column with gender values would be processed with same preprocessing strategy, which is unlikely to produce good results. Approaches that focus on better accuracy tend to target hyper-parameter tuning more than identifying preprocessing techniques, which will not always produce well-trained models.
It will be appreciated that it would be desirable to overcome the above-identified and/or other problems. For example, it will be appreciated that it would be desirable to improve machine learning algorithms, e.g., by implementing an enhanced preprocessing approach.
One aspect of certain example embodiments relates to overcoming the above-described and/or other issues. For example, one aspect of certain example embodiments relates to improving machine learning algorithms, e.g., by implementing an enhanced preprocessing approach.
Another aspect of certain example embodiments relates to automating the selection of data cleansing preprocessing operations by considering such operations as a classification problem. In machine learning, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Examples are assigning a given email to the “spam” or “non-spam” class, and assigning a diagnosis to a given patient based on observed characteristics of the patient (e.g., gender, blood pressure, presence or absence of certain symptoms, etc.). Classification is an example of pattern recognition. In this way, certain example embodiments decide what preprocessing operations are to be taken individually for each column of the data, e.g., by training a classifier model on the descriptive information of the data columns. As will become clearer from the below, the approach of certain example embodiments is different from the state of the art, where data preprocessing for model building is different from the data correction and data quality management process.
In certain example embodiments, a machine learning system is provided. A non-transitory computer readable storage medium stores thereon a dataset having data from which a machine learning model is buildable. An electronic computer-mediated interface is configured to receive a query processable in connection with a machine learning model. Processing resources including at least one hardware processor operably coupled to a memory are configured to execute instructions to perform functionality comprising: accessing at least a portion of the dataset; for each of a plurality of independent variables in the accessed portion of the dataset: generating meta-features for the respective independent variable; providing, as input to at least first and second pre-trained classification models that are different from one another, the generated meta-features for the respective independent variable; receiving, as output from the first pre-trained classification model, an indication of one or more missing value imputation operations appropriate for the respective independent variable; and receiving, as output from the second pre-trained classification model, an indication of one or more other preprocessing data cleansing related operations appropriate for the respective independent variable; transforming the data in the dataset by selectively applying to the data the one or more missing value imputation operations and the one or more other preprocessing data cleansing-related operations, in accordance with the independent variables associated with the data; building the machine learning model based on the transformed data; and enabling queries received over the electronic interface to be processed using the built machine learning model.
According to certain example embodiments, the dataset is a database and the data thereof is stored in a tabular structure of the database, e.g., in which the independent variables correspond to different columns in the database. In some cases, all columns in the database will be treated as independent variables, except for a column including data of a type on which predictions are to be made in response to queries received over the electronic interface.
According to certain example embodiments, the generated meta-features for a given independent variable include basic statistics for the data associated with that independent variable and/or an indication as to whether a seeming numerical variable likely is a categorical variable. With respect to the latter, in some instances and for a given independent variable, the indication as to whether a seeming numerical variable likely is a categorical variable may be based on a determination as to whether a count of the unique data entries thereof divided by the total number of data entries is less than a threshold value.
According to certain example embodiments, the first and/or second pre-trained classification models may be able to generate output indicating that no operations are appropriate for a given independent variable.
According to certain example embodiments, the first and second pre-trained classification models may be generated independently from one another yet may be based on a common set of meta-features generated from at least one training dataset.
According to certain example embodiments, the at least one training dataset may be different from the dataset stored on the non-transitory computer readable storage medium. In some cases, independent variables in the at least one training dataset may have one or more missing value imputation operations and one or more other preprocessing data cleansing-related operations, manually assigned thereto.
In addition to the features of the previous paragraphs, counterpart methods, non-transitory computer readable storage media tangibly storing instructions for performing such methods, executable computer programs, and the like, are contemplated herein, as well.
These features, aspects, advantages, and example embodiments may be used separately and/or applied in various combinations to achieve yet further embodiments of this invention.
These and other features and advantages may be better and more completely understood by reference to the following detailed description of exemplary illustrative embodiments in conjunction with the drawings, of which:
Certain example embodiments described herein relate to systems and/or methods for automating the selection of data cleansing operations for a machine learning algorithm at the preprocessing stage, using a classification approach typically used in more substantive machine learning processing. Certain example embodiments automatically choose the kind of preprocessing operations needed to make the data acceptable to machine learning algorithms. In certain example embodiments, it becomes feasible to predict the data cleansing operations for a particular column or for a complete dataset very quickly, which helps improve performance at the preprocessing phase in an automatic manner that removes subjectivity and does not require reliance on the accuracy values of the model performance.
Certain example embodiments implement powerful classification algorithms and leverage the data prepared manually to train the algorithm. The classification algorithms have already proven their proficiency on learning the pattern within the data. Thus, in some instances, it is reasonable to treat the data prepared for the training as already having the information that a data scientist would use to make the decision of what preprocessing operations need to be taken for the data columns.
In this regard,
The
Details concerning an example implementation are provided below. It will be appreciated that this example implementation is provided to help demonstrate concepts of certain example embodiments, and aspects thereof are non-limiting in nature unless specifically claimed. For example, descriptions concerning example code, classifiers, classes, functions, data structures, data sources, etc., are non-limiting in nature unless specifically claimed.
Certain example embodiments involve data cleansing being performed in two independent tasks, namely, missing value imputation and selection of preprocessing steps.
To help explain how this may be done, consider once again the example dataset provided in the Background and Summary section, above. To prepare the meta-features of the data, Python's pandas library “describe( )” function was used to generate standard meta-features. Other meta-features were derived as well. The following table provides an overview of the generated and derived meta-features.
The “dtypea” column does not come from Python's inbuilt libraries or functions. Instead, it is logic implemented in certain example embodiments that has been built to handle special cases and to improve the accuracy of the model. It can be considered to be a part of feature engineering in the model-building exercise. This column in essence helps to capture those instances where the data provided is numerical but has its information in accordance with a categorical variable. For example, sometimes a data column like gender will be coded numerically, e.g., with 0 representing “male” and 1 representing “female”. For this particular scenario, by the data type definition, Python will consider it as numerical variable. However, the “dtypea” value will essentially serve as a flag and enable Python to look for this kind of data and provide information indicating that the data is to be treated like a categorical variable instead of a numerical variable (which is its original data type). To derive “dtypea” as in the table above, the following example program logic may be used:
In this example, “thresholdValue” is an empirical value and is calculated as a ratio of the maximum numbers of classes in a column and the number of rows (max number of classes/number of rows).
In the table above, “medMean” is difference between the mean and median values of a column and also does not come from Python's in-built libraries but instead is derived based on this simple mathematical formula. This variable is developed through feature engineering and helps provide information concerning the spread of the data and can be used to help in deciding on an appropriate missing value imputation approach for numerical data columns. Generally, a data scientist can uses this information to decide which value should be used to fill missing values via imputation, e.g., depending on the difference of the values.
To prepare the meta-features of the dataset above, the example code set forth in the Code Appendix may be used. The sample of the training dataset, following step S404 in
Referring once again to
The application of the trained models to predict the preprocessing and missing value imputation operations (“Target_M” and “Target_P”) to be applied to independent variables in a dataset will now be demonstrated. In this regard,
In step S704, meta-features of the data are extracted. This leads to the table shown in
In
As will be appreciated from
The output from the algorithm is correct with respect to how the models have been trained. This approach as a whole advantageously helps on to automate the data cleansing process in a faster, less subjective, more predictable way. Moreover, certain example embodiments advantageously can be extended to predict and implement additional types of preprocessing and/or data imputation approaches, e.g., to help increase the effectiveness of the approach as needed and/or desired. Once the data cleansing approaches are determined, they can be used on the data in the datasets as appropriate. Finally, the models can be reliably trained and reliably used for future machine learning applications.
Although certain example embodiments are described as having data coming from a database with a table structure and with database columns providing variables, it will be appreciated that other example embodiments may retrieve data and/or process data from other sources. XML, JSON, and/or other stores of information may serve as data sources in certain example embodiments. In these and/or other structures, independent and/or dependent variables may be explicitly or implicitly defined by labels, tags, and/or the like.
It will be appreciated that the machine learning system described herein may be implemented in a computing system (e.g., a distributed computing system) comprising processing resources including at least one hardware processor and a memory operably coupled thereto, and a non-transitory computer readable storage medium tangibly storing the dataset(s), pre-trained classification models, etc. The non-transitory computer readable storage medium may store the finally built machine learning model, and that finally built machine learning model may be consulted to respond to queries received over an electronic, computer-mediated interface (e.g., an API, web service call, and/or the like). The queries may originate from remote computing devices (including their own respective processing resources) and applications residing thereon and/or accessible therethrough. Those applications may be used in connection with any suitable machine learning context, including the example contexts discussed above. The processing resources of the machine learning system may be responsible for generation of the pre-trained classification models, execution of code for meta-feature generation, generation of the finally built machine learning model, etc.
In this regard, it will be appreciated that as used herein, the terms system, subsystem, service, engine, module, programmed logic circuitry, and the like may be implemented as any suitable combination of software, hardware, firmware, and/or the like. It also will be appreciated that the storage locations, stores, and repositories discussed herein may be any suitable combination of disk drive devices, memory locations, solid state drives, CD-ROMs, DVDs, tape backups, storage area network (SAN) systems, and/or any other appropriate tangible non-transitory computer readable storage medium. Cloud and/or distributed storage (e.g., using file sharing means), for instance, also may be used in certain example embodiments. It also will be appreciated that the techniques described herein may be accomplished by having at least one processor execute instructions that may be tangibly stored on a non-transitory computer readable storage medium.
While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.
The following is example code in the Python language that can be used to generate the meta-features of the data in certain example embodiments. It will be appreciated that other programming languages and/or approaches may be used in different example embodiments and that this example code is provided by way of example and without limitation, unless explicitly claimed.