The present disclosure, in some embodiments thereof, relates to data exploration and visualization, more specifically, but not exclusively, to methods and systems for automatically identify in a dataset insufficient data for learning, or records with anomalous combinations of feature values, by partition of data space into human-interpretable regions.
The feature space of a dataset is the product of the ranges (if numeric) or sets of values (if categorical) of the dataset features. An example for three features is {STATE in {AL, . . . , WY}} & {AGE in {35, A<=INCOME<=$500,000}. Due to domain-specific or other factors, while this feature space circumscribes (contains within it) all dataset observations, the dataset observations may be unevenly spread within the potential feature space. For instance, there may be few people with high incomes while most people have low or moderate incomes. This means different areas of the potential space have differing observation density, and many areas may be empty. Being able to describe the feature space according to observation density is a basic task for data exploration and conceptualization. It can also be useful when the data is to be used for a learning task, as the density of a given area can differentially affect a machine learning (ML) model's accuracy when used for training or testing.
It is an object of the present disclosure to describe a system and a method for automatically identify in a dataset insufficient data for learning, or records with anomalous combinations of feature values, by partition of data space into human-interpretable regions.
The foregoing and other objects are achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.
In one aspect, the present disclosure relates to a computerized method for automatically identifying in a dataset insufficient data for learning, or records with anomalous combinations of feature values, by partition of numeric and/or categorical data space into human-interpretable regions, comprising:
receiving a dataset of numeric and/or categorical features with a plurality of observations;
calculating observation density for each observation according to a distance or anomaly based metric, and receiving a density measurement representing a density of each observation;
partitioning the dataset along the numeric and/or categorical features according to the density measurement of each observation by a perpendicular cut along the feature spaces, receiving a map of a plurality of hyper-rectangular shapes representing various levels of density including empty spaces;
displaying the received map of plurality of hyper-rectangular shapes, being human-interpretable regions on a Graphic user interface, GUI, wherein the plurality of hyper-rectangular shapes are selectable and present information about the selected hyper-rectangular shape level of density when selected by a user.
In a second aspect, the present disclosure relates to system for automatically identifying in a dataset insufficient data for learning, or records with anomalous combinations of feature values, by partition of numeric and/or categorical data space into human-interpretable regions, comprising:
a processor executing a code, adapted to:
a graphic user interface, GUI, controlled by the processor, which displays the map with the plurality of hyper-rectangular shapes, being human-interpretable regions, wherein the plurality of hyper-rectangular shapes are selectable and present information about the selected hyper-rectangular level of density when selected by a user.
In a third aspect, the present disclosure relates to computer program product for automatically identifying in a dataset insufficient data for learning, or records with anomalous combinations of feature values, by partition of numeric and/or categorical data space into human-interpretable regions, the computer program product comprising:
In a further implementation of the first second and third aspects, a machine learning is applied to the dataset which comprises:
insufficient data for a trained machine learning model to provide high accuracy results; or
records with anomalous combinations of feature values, which indicates on errors in the machine learning model results or errors in input data.
In a further implementation of the first aspect, the method further comprises:
calculating additional metrics of volume spanned by the hyper-rectangular shapes; deviation of fraction of observations contained in a hyper-rectangular shape from the fractional volume contained in the hyper-rectangular shape, to measure how uneven the observations are spread in a multi-dimensional space.
In a further implementation of the first aspect, partitioning is done so that the partitions differ as much as possible among themselves in density.
In a further implementation of the first aspect, observation density for each observation is done by clustering the dataset of numeric and/or categorical features using Ordering Points To Identify the Clustering Structure, OPTICS, with Gower's metric.
In a further implementation of the first aspect, calculating observation density for each observation is done by calculating an anomaly score to each observation, where a higher anomaly score corresponds to lower density.
In a further implementation of the first aspect the anomaly based metric is Isolation Forests, IF.
In a further implementation of the first aspect, partitioning the dataset along the numeric and/or categorical features is done by a regression decision tree.
In a further implementation of the first aspect, partitioning the dataset along the numeric and/or categorical features, is recursive.
In a further implementation of the first aspect the perpendicular cut is successive.
In a further implementation of the first aspect a given percentage p of highest values of a target yi to consider as indicating an outlier, and the observations xi with the highest p percent of targets yi are omitted before defining S and conducting the partition.
In a further implementation of the first aspect a given percentage p of lowest values of a target yi to consider as indicating an outlier, and the observations xi with the lowest p percent of targets yi are omitted before defining S and conducting the partition.
In a further implementation of the first aspect, an empty space is searched for in two locations:
In a further implementation of the first aspect, an internal empty space is found in a given region, which splits the region into at least one empty region and, at least two new non-empty regions.
In a further implementation of the first aspect, a machine learning is applied to the dataset which comprises:
insufficient data for a trained machine learning model to provide high accuracy results; or
records with anomalous combinations of feature values, which indicates on errors in the machine learning model results or errors in input data.
In a further implementation of the second aspect, the processor further adapted to:
calculate additional metrics of volume spanned by the hyper-rectangular shapes; deviation of fraction of observations contained in a hyper-rectangular shape from the fractional volume contained in the hyper-rectangular shape, to measure how uneven the observations are spread in a multi-dimensional space.
In a further implementation of the third aspect, a machine learning is applied to the dataset which comprises:
insufficient data for a trained machine learning model to provide high accuracy results; or
records with anomalous combinations of feature values, which indicates on errors in the machine learning model results or errors in input data.
In a further implementation of the third aspect, the computer program product further comprises:
program instructions to calculate additional metrics of volume spanned by the hyper-rectangular shapes; deviation of fraction of observations contained in a hyper-rectangular shape from the fractional volume contained in the hyper-rectangular shape, to measure how uneven the observations are spread in a multi-dimensional space.
Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention pertains. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the invention, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting.
Some embodiments of the invention are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments of the invention. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments of the invention may be practiced.
In the drawings:
The present disclosure, in some embodiments thereof, relates to data evaluation and presentation and, more specifically, but not exclusively, to methods and systems for automatically identify in a dataset insufficient data for learning, or records with anomalous combinations of feature values, by partition of data space into human-interpretable regions.
Data exploration and analysis algorithms usually relate to the data existing in the dataset. However, no relation is provided to insufficient data, lack of data or absence of data.
In addition, the task of conducting data exploration, visualization, and analysis, on a structured feature dataset is challenging. The difficulties often increase when the data is high-dimensional and has mixed feature types (numeric, nominal categorical, ordinal categorical). Furthermore, it is often difficult to provide analyses that are in a meaningful format for a human user to interpret and gain insights into the data.
There are methods for grouping dataset records by density. However, clustering methods do not group the data into subsets that are of a shape or nature that is human-interpretable. Particularly, these are hard to visualize when the data is more than a few dimensions, or contain a mix of numeric and categorical features.
There is therefore a need for a method and system, for identifying in a dataset insufficient data for learning or records with anomalous combinations of feature values, which provide human-interpretable results.
The present disclosure, in some embodiments thereof, describes a system and a method, which identify in a dataset of arbitrary feature dimension size, insufficient data for learning, or records with anomalous combinations of feature values, where the features may be of mixed type, by partitioning the feature space into regions according to the relative density of observed points. The regions are presented in a form that is intuitive for human interpretation. From these regions, a user can understand where in the potential space most of the data are. According to some embodiments of the present disclosure, the method also finds empty spaces of the same form, where data records are not observed to exist. These may be empty due to domain-specific feature constraints (i.e., a knowledgeable person would expect them to be empty), but they may still be of interest to the user.
in the following description and/or illustrated in the drawings and/or the Examples. The invention is capable of other embodiments or of being practiced or carried out in various ways.
The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
The computer readable program instructions may execute entirely on the user's computer and/or computerized device, partly on the user's computer and/or computerized device, as a stand-alone software package, partly on the user's computer (and/or computerized device) and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer and/or computerized device through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
Reference is now made to
Reference is now made to
At 202, code 102 is executed by processor 101, and observation density is calculated for each observation, according to a distance or anomaly based metric, and receiving a density measurement representing a density of each observation. According to some embodiments of the present disclosure, the calculation of each observation density is done by clustering the dataset of numeric and/or categorical features using for example Ordering Points To Identify the Clustering Structure (OPTICS) with Gower's metric. Another option for calculating the observation density may be by calculating an anomaly score to each observation, where a higher anomaly score corresponds to lower density. An example for an anomaly based metric that may be used is Isolation Forests, IF.
According to some embodiments of the present disclosure, at 203, the dataset is partitioned along the numeric and/or categorical features according to the density measurement of each observation by a perpendicular cut along the feature spaces, and a map of a plurality of hyper-rectangular shapes representing various levels of density including empty spaces is received. Optionally the partition along the numeric and/or categorical features may be recursive, and the perpendicular cut may be successive. The partition of the dataset along the numeric and/or categorical features may be done for example by a regression decision tree. The map with the plurality of hyper-rectangular shapes being human-interpretable regions is displayed at 204, on a display of a Graphic User Interface, GUI. The plurality of hyper-rectangular shapes are selectable and present information about the selected hyper-rectangular shape level of density when selected by a user. At 205, optionally, additional metrics of volume spanned by the hyper-rectangular shapes, and deviation of fraction of observations contained in a hyper-rectangular shape from the fractional volume contained in the hyper-rectangular shape is calculated, to measure how uneven the observations are spread in a multi-dimensional space.
According to some embodiments of the present disclosure, the partition is done so that the partitions differ as much as possible among themselves in density. That is, it is desirable, if possible, to have a set of partitions where some are dense, sparse, and empty, rather than a set of partitions that are more similar in terms of the density of observations in them. This is done without having to specify a grid or discretization on numeric features, and can handle different feature types together. According to some embodiments of the present disclosure, the resulting partitions are interpretable to humans and easily defined mathematically such that they may be easily mapped to another dataset. The criterion for “human-interpretable” is that partition definitions be defined as a conjunction of ranges (numeric) or sets of values (categorical), and create a hyper-rectangular shape, which is intuitive for human to understand. In addition, the results of the method of the present disclosure tells the user where the data are not, by providing empty spaces, which contain no data. The empty spaces are also informative for the user and provide information that is useful. For example, in a case of a trained machine learning model which received an input dataset with empty spaces, the user may infer that the dataset contains insufficient data for the trained machine learning model to provide high accuracy results. Alternatively, the user may infer that the dataset contains records with anomalous combinations of feature values, which indicates on errors in the machine learning model results or errors in input data. According to some embodiments of the present disclosure, the code 102 executed by processor 101, is able to handle numeric and categorical features together without needing a pre-gridding of the feature space, in contrast to other methods, which handle numeric features only, and require a pre-gridding of the feature space.
According to some embodiments of the present disclosure, an example for implementing the method described herein may be by applying iteratively the following steps: calculating a numeric target y={yi}, n i=1 that serves as an approximation for the density of observation i. Then, using regression trees with target y to partition the feature space S, on features F1, . . . , Fp, and carve out empty space along the way from the newest resulting split.
According to some embodiments of the present disclosure, a numeric target yi is required for each observation xi, which represents this observation's multivariate density within the feature space S. Alternatively, yi may also be some score, which does not necessarily directly measure density but is associated with some attributes of density. Observations with high density should have many other points within a small neighborhood, while sparse points should have relatively few neighborhoods or be surrounded by more empty space, yi then serves as the target for partition. As such, it should be approximately monotonically increasing or decreasing with the density of xi, that is, a higher value can represent higher density or more anomalousness, which should correspond to lower density. There are several methods that may be used. One is to use a distance-based clustering algorithm, such as OPTICS (Ordering Points To Identify the Clustering Structure), where a numeric output such as the core distance of xi may be used as yi, the cluster identifications are not used. In OPTICS, the core distance is the distance from xi to its mth closest neighbor, where m is the (user-specified) minimum number of observations within an E-radius neighborhood of xi for it to be considered a core point. Thus, a lower core distance should indicate a higher density of xi. Gower distance is a metric, which calculates the multivariate distance between observations xi and x3 as the average of their feature-wise distances. The distances may be tailored to the feature type (e.g., range-normalized Manhattan distance for numeric, Dice coefficient for nominal categorical, or Manhattan distance for ordered categorical), which means it may apply to mixed data types. If Gower distance is used as the distance metric in, for example, OPTICS, the core distance (i.e., yi) may be considered as a distance-based density measure of observation xi. Depending on the implementation, clustering-based methods can also scale poorly with n in terms of computational complexity. An alternative to obtaining yi by clustering with an appropriate distance-based metric is to use an anomaly score. Here, a higher anomaly score should correspond to lower density, but it may not correspond directly or proportionately to distance-based sparsity. One such anomaly scoring method is isolation forests (IF), which build a forest (ensemble) of trees on subsets of the features. The trees perform binary splits on the ranges of the features to isolate observations. The more splits required to isolate an observation xi, the more anomalous it is. The anomaly score (normalized to [0, 1]) may be used as yi, so a higher score indicates lower density. IF is very fast and computationally light. It is important to note that since the feature space S=∩pj=1 dom(Fj) (where dom (Fj) denotes the domain of Fj) is the bounding hyper-rectangular shape of all observations in D, its definition is sensitive to outliers if they affect the boundary points of the domain of a feature. For example, when F1 is INCOME, and the current domain is [$0, $200,000]. If a new observation is added with F1=$1,000,000, the dom(F1) now grows five-fold. Assuming none of the other feature domains are affected, S now grows five-fold along the F1 dimension. Since V (S) must always be 1, this means this single observation has created an empty region S={$200,000<INCOME<$1,000,000} of volume approximately 0.8, and so the non-empty regions built on D previously would now shrink by approximately a factor of 5. Such outliers will tend to receive a score yi that indicates high sparsity or anomalousness.
According to some embodiments of the present disclosure, to make the partition more robust, it may be wise to omit a given percentage p of the highest values of a target yi to consider as indicating an outlier, and the observations xi with the highest p percent of targets yi before defining S and conducting the partition, According to some other embodiments a given percentage p of lowest values of a target yi to consider as indicating an outlier, and the observations xi with the lowest p percent of targets yi may be omitted before defining S and conducting the partition. For example, the highest 1% of sparsity scores before defining S and conducting the partition. If this million-dollar income observation is unique in the dataset, including it in the partition may make the results non-robust, and so it may be dropped. However, when for example, 5% of the observations have income of $1,000,000, and the next highest income is $200,000, these high earners will likely be neighbors of each other in S, giving them less extreme density targets than otherwise. Even though they are unusual relative to the other observations, some of them will likely be included in D even if, for example, the sparsest 1% are trimmed. Trimming the sparsest observations can affect observations not on the boundaries of S if, for example, they are surrounded by relatively empty space. In this case, trimming them may give a more parsimonious representation of the empty space than if the partition has to ‘cut around’ these observations. This is a similar issue to the general decision of how many outliers to trim from a sample when estimating the population distribution to make the estimate robust to outliers. When points that are unusual but not very extreme are trimmed, the resulting estimate may not be accurate since these unusual points in fact do describe the distribution.
According to some embodiments of the present disclosure, once there is a numeric target y which serves as a measurement associated to some attributes of the density, a density-based partition of S may be obtained by a model mapping D→y. If the model performs hierarchical binary splits on the input features F1, . . . , Fp, this yields interpretable rectangular shapes on these features. One such techniques are regression trees. Regression trees construct binary trees on the input features, which at each node conduct a binary split on one of the features (or a one-hot encoding column corresponding to one level of a nominal categorical feature) such that the mean squared error (or a similar metric) of the numeric target {yi} for observations xi is minimized given the choice of split on the range of the chosen feature Fj.
According to some embodiments of the preset disclosure, code 102 executed by processor 101 is able to carve out regions that represent empty space in S. An empty space is searched for in two locations:
1. only on the feature and split value used by the regression tree at that step;
2. after splitting, at any feature and location of the ‘outside’ of the observed points in that region.
According to some other embodiments of the present disclosure, another heuristic would be to find internal empty space in a given region, which would split the region into at least one empty region and, at least two new non-empty regions.
According to some embodiments of the present disclosure, after carving out empty space from splits, a heuristic is also employed to carve out empty space anywhere on the ‘outside’ of a given region Sk after it is formed from a regression tree split, as opposed to specifically at the regression tree split feature and value. Code 102 executed by processor 101, determines the empty space between the boundaries of Sk and the boundaries of the observed {xi: xi∈Sk}. For instance, in the above, S2={207<F1≤237}, but the observed values span only [216, 233]. The empty boundary space for this feature in region S2 is s′2,1=(207, 216) U (233, 237]. For real-valued, integer-valued, or ordered categorical features, this empty space may be a union of two sets. If p≥2, this trimming can occur on any of the features, while the first form of trimming occurs only on the feature used to split the nodes, and only at that split threshold. The (potentially zero-size) boundary gaps s′k,j in region Sk are calculated for each feature Fj, j=1, . . . , p. If any have L(s′k,j)>minL, these are iteratively trimmed in decreasing order of size, resulting in an empty space of Sk∩s′k,j, and Sk is redefined as Sk\(Sk∩s′k,j), subtracting the empty space. That is, the interval sk,j, which defines Sk on feature Fj is redefined as sk,j\s′k,j. For instance, S2 could then be re-defined as the smaller S2={216≤F1≤233}, and two new empty regions, {207<F1<216} and {233<F1≤237} would be added, since the empty gap S2∩s′2,1 consists of non-contiguous intervals. In each iteration, before trimming, the region Sk=∩pj=1 sk,j. Carving on Fj results in a trimmed region Sk=(∩i≠j sk,i)∩(sk,j\s′k,j) and one or two (if s′k,j is non-contiguous) empty region(s) (∩i≠j sk,i)∩(s′k,j), which have the same definition on all features except j. If, before trimming sk,j≠dom(Fj), then Sk was already defined on Fj. Hence, region Sk's dimension does not increase after trimming, and also equals the dimension of the empty region. Otherwise, if sk,j=dom(Fj) before trimming, its dimension would increase by 1, and so would the empty space. Thus, for trimming to occur at each iteration, two conditions must be met:
1. The dimension of Sk must remain ≤p* after trimming. That is, it must have either been <p*, or, if the dimension was =p*, it must have been defined on Fj.
2. The resulting empty region must be large enough on all dimensions. That is, there must be L(sk,i)>minL, ∀i≠j, and L(s′k,j)>minL as well. This restriction applies only to the empty space, and recall that no such restriction is put on the regression tree when forming the regions initially. If, for instance, the resulting Skis re-defined such that L(sk,j\s′k,j)≤minL (it is ‘narrow’ along feature Fj), this is fine.
In the example of S2 above defined only on F1, L(s′2,1) is too small to trim, so S2 is left as is. Note that for simplicity, the illustrations of regression trees and empty space carving have used a univariate numeric case, but the same calculations of empty space carving can be done on categorical features as well. For instance, when there are two features F1 and F2, where F2 is LOCATION, with dom(F2)={North, East, South}. S is first partitioned only on F1 into regions S1 and S2. That is, S1=s1,1 and S2=s2,1, which are complementary subsets of dom(F1). Even though S1 is not defined on F2 (i.e., currently s1,2=dom(F2)), assume S1 only contains observations with F2∈{East, North} (all South individuals are in S2). There is therefore an empty gap in this region, s′1,2={F2∈{South}}, where the complement is s1,2\s′1,2={North, East}. That is, S1 could be defined on F2 as well, since the data it contains do not span the full dom(F2). This gap has length L(s′1,2)=1/3, since it contains one of three possible values of LOCATION, and forms an 2-dimensional empty region defined as S3=S1∩s′1,2. If the maximum dimension p*>1, and if S3 is large enough on all sides (both L(s1,2), L(s′1,1)>minL), then S3 (South individuals satisfying S1, which are unobserved in D) becomes a new empty region, defined as S3=s3,1∩s3,2, where s3,1=s1,1 and s3,2=s′1,2. S1 is re-defined as S1\(S1∩s′1,2)=s1,1∩(s1,2\s′1,2) (narrowing S1 to omit South individuals, a combination that is not observed in D).
Reference is now made to
A visualization of a scatterplot of the dataset observations, with an overlay of rectangles representing the resulting partition is presented in
According to some embodiments of the present disclosure, once a density partition S1, . . . , Sk is made of S, some calculations may be performed to summarize the results. A realistic dataset should have unevenly-distributed points within S, and these partitions by density, in addition to empty spaces, if found, should characterize the domain constraints of the features. Hence, a realistic dataset- or a synthetic dataset generated to have realistic, and not independent, inter-feature associations-should have regions of various volumes and densities. In addition to summarizing the distribution of observation density in a single dataset, the distributions of two different datasets may also be compared in the following non-parametric way. Say a partition on D results in K regions.
k=1, . . . , K be the fraction of the observations in D contained in region Sk. φk=0 if Sk is empty space. If the distribution of observations in D is perfectly uniform within the feature space S, each region Sk should have V (Sk)=φ(Sk). Regions Sk that are denser than average should have φ(Sk)>V (Sk); that is, the region covered by Sk contains a higher fraction of observations than its volume (which is a fraction of the total feature space volume). Recall that both Σk φ(Sk)=Σk V(Sk)=1. This indicates that a chi-squared statistic
where the statistic χ(D, {Sk}) follows a chi-squared distribution with K−1 degrees of freedom (χk-12. A perfectly uniform dataset D should have χ(D, {Sk})≈0; in the most extreme case, the algorithm will not be able to generate a partition (detecting no variances in density), and thus K=1 with S1=S, and V(S1)=φ(S1)=1. The volumes are set as the expected values of φ(Sk) under the null hypothesis of uniformity since V(Sk)>0, ∀k, while φ(Sk) may equal 0. The p-value of this statistic, according to the chi-squared distribution, can measure the likelihood the distribution is non-uniform, and the value
can be used as a simple metric to compare uniformity of density between different datasets D and D′ by their respective density partitions, where higher values indicate less uniformity. According to some embodiments of the present disclosure, an implementation of the present disclosure may be for identifying empty spaces in a trained machine learning model dataset which indicates to the user the empty spaces, or the sparse regions are probably inaccurate and should not be used. Since a received dataset is partitioned into dense, sparse, and empty regions, empty or very sparse data regions are areas where there may not be enough data to either train a machine learning model (e.g. perhaps observations there should be excluded from training) or perhaps the predictions of the machine learning model on observations there are not to be trusted. When observations from a test dataset fall into previously empty regions, or are out of bounds of the previous feature space (e.g. they have a higher or lower value on at least one feature than what was observed previously), these may be especially of concern. For example, for a dataset of two features, “height” and “weight” of humans. Assume the domains of height are from 0.5-2.5 m and the domains for weight are from 50 kg-200 kg. However, assume there are no people in the region {150-200 kg} and {2-2.5 m}, that is, this is an empty region. Assume that a machine learning algorithm was trained to predict some other feature based on height and weight, for instance a nearest neighbors classifier. Assume that, in the test set there are two observations: one person who weighs 175 kg and is 2.25 m high. Another person weighs 300 kg. The first is in a previously empty region, the second is out of the previous domain. The machine learning classifier gives a prediction result for them, but it is likely that it is not to be trusted. However, the classifier itself will not typically provide an indicator that the prediction may not be with confidence. However, executing the code 102 by processor 101, easily indicates these two and explain that it is in a previously empty region defined on the height and weight features (which may be only a subset of the dataset features). That is, it is able to be explained.
Another example for using the method disclosed herein may be in cases of causal inference. Essentially it may be able to see if certain empty or sparse regions (hyper-rectangular shapes) are also data subsets that suffer from lack of any diversity of observed values of a target class variable that is being tried to be modeled in causal inference (herein after, lack of “positivity”). For instance, in causal inference it may be wanted to predict what the effect of taking a given medication is on a blood pressure of a person, as compared to not taking the medication. If it is desired to predict what the outcome is for, Parkinson patients, it is needed to observe some smokers who did and some who did not take the medication. If there is only data on smokers who took the medication, this is lack of “positivity”, since there is no data at all about what would happen if these smokers did not take the medication. According to some embodiments of the present disclosure, this subset is the hyper-rectangular shape {PARKINSON PATIENTS={yes}} and {MEDICATION={no}}, which is empty, and hence is {PARKINSON PATIENTS={yes]} is a relevant subset on which it cannot be made either any predictions, or perhaps cannot trust the predictions made for these hyper-rectangular shapes.
Other systems, methods, features, and advantages of the present disclosure will be or become apparent to one with skill in the art upon examination of the following drawings and detailed description. It is intended that all such additional systems, methods, features, and advantages be included within this description, be within the scope of the present disclosure, and be protected by the accompanying claims.
The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
It is expected that during the life of a patent maturing from this application many relevant methods and systems for automatically identifying in a dataset insufficient data for learning, or records with anomalous combinations of feature values, by partition of numeric and/or categorical data space into human-interpretable regions will be developed and the scope of the term methods and systems for automatically identifying in a dataset insufficient data for learning, or records with anomalous combinations of feature values, by partition of numeric and/or categorical data space into human-interpretable regions, is intended to include all such new technologies a priori.
As used herein the term “about” refers to ±10%.
The terms “comprises”, “comprising”, “includes”, “including”, “having” and their conjugates mean “including but not limited to”. This term encompasses the terms “consisting of” and “consisting essentially of”.
The phrase “consisting essentially of” means that the composition or method may include additional ingredients and/or steps, but only if the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method.
As used herein, the singular form “a”, “an” and “the” include plural references unless the context clearly dictates otherwise. For example, the term “a compound” or “at least one compound” may include a plurality of compounds, including mixtures thereof.
The word “exemplary” is used herein to mean “serving as an example, instance or illustration”. Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments.
The word “optionally” is used herein to mean “is provided in some embodiments and not provided in other embodiments”. Any particular embodiment of the invention may include a plurality of “optional” features unless such features conflict.
Throughout this application, various embodiments of this invention may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.
Whenever a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases “ranging/ranges between” a first indicate number and a second indicate number and “ranging/ranges from” a first indicate number “to” a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween.
It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as suitable in any other described embodiment of the invention. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements.
It is the intent of the Applicant(s) that all publications, patents and patent applications referred to in this specification are to be incorporated in their entirety by reference into the specification, as if each individual publication, patent or patent application was specifically and individually noted when referenced that it is to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention. To the extent that section headings are used, they should not be construed as necessarily limiting. In addition, any priority document(s) of this application is/are hereby incorporated herein by reference in its/their entirety.