The present invention generally relates to unbalanced datasets, and more particularly relates to systems and methods for sampling and augmenting unbalanced datasets.
Classification problems are quite common in the machine learning world. When data is collected from real world scenarios, it is often highly unbalanced. Imbalanced data refers to those types of datasets where the target class has an uneven distribution of observations, i.e., one class label has a very high number of observations and the other has a very low number of observations. In other words, a particular class may have a huge number of datapoints in comparison to the others.
Further, managing huge datasets provides significant challenges. For example, there may be several difficulties in storing, indexing, and managing large amounts of data that is required for certain systems to function. One area in which such problems arise includes systems that search for and identify a target class of data from large datasets. Storage of the actual data points makes up much of the storage volume in a database.
While there are some methods to train a model on unbalanced dataset (weighting the cost, data augmentation etc,), it is still a challenging issue and can result in a machine learning model with high overall accuracy, but completely useless for the target application.
Accordingly, there is a need for a system and method to balance the unbalanced datasets. Further, there is a need to balance the majority and minority classes of data within unbalanced datasets.
This summary is provided to introduce a selection of concepts, in a simplified format, that are further described in the detailed description of the invention. This summary is neither intended to identify key or essential inventive concepts of the invention and nor is it intended for determining the scope of the invention.
According to one embodiment of the present disclosure, a method for sampling a set of data points associated with a single class is disclosed. The method includes receiving a required number of reduced set of data points. Further, the method includes determining a neighbour count for the set of data points based on the required number of reduced set of data points and based on a number of the set of data points. Furthermore, the method includes creating a plurality of clusters from the set of data points, each of the plurality of clusters comprising a plurality of similar data points selected based on a similarity threshold, wherein a number of the plurality of similar data points in each of the plurality of clusters is less than or equal to the neighbour count. Additionally, the method includes selecting a representative data point from the plurality of similar data points, for each of the plurality of clusters. Finally, the method includes providing the reduced set of data points based on representative data points, wherein a number of the representative data points corresponds to the received required number of reduced data points.
According to another embodiment of the present disclosure, a method for sampling and augmenting a dataset of images associated with a first class and a second class, respectively, to balance the dataset of images is disclosed. The method includes receiving a required number of reduced set of images associated with the first class. Further, the method includes creating a plurality of clusters from a set of images associated with the first class based on the required number of reduced set of images. Furthermore, the method includes selecting a representative image from a plurality of similar images in the corresponding cluster, for each of the plurality of clusters. Additionally, the method includes providing the reduced set of images based on representative images, wherein a number of the representative images corresponds to the received required number of reduced images. Still further, the method includes generating a median image corresponding to the set of images associated with the first class, wherein the median image represents a background of the set of images. Moreover, the method includes creating a non-defect artifact mask based on a difference of intensity occurring at each pixel between the median image and the set of images associated with the first class. In addition, the method includes extracting a defect foreground based on the median image and each defect image of another set of images associated with the second class, wherein the defect foreground comprises the defect and at least one non-defect artifact. Further, the method includes removing the at least one non-defect artifact from the defect foreground based on the non-defect artifact mask to generate a defect foreground without artifacts for each defect image. Finally, the method includes providing, is for each defect image, a new synthetic defect image based on the median image and the defect foreground without artifacts.
According to yet another embodiment of the present disclosure, a system for sampling of a set of data points associated with a single class is disclosed. The system comprises a memory storing instructions, and a processor configured to execute the instructions to perform operations to: receive a required number of reduced set of data points; determine a neighbour count for the set of data points based on the required number of reduced set of data points and based on a number of the set of data points; create a plurality of clusters from the set of data points, each of the plurality of clusters comprising a plurality of similar data points selected based on a similarity threshold, wherein a number of the plurality of similar data points in each of the plurality of clusters is less than or equal to the neighbour count; for each of the plurality of clusters, select a representative data point from the plurality of similar data points; and provide the reduced set of data points based on representative data points wherein a number of the representative data points corresponds to the received required number of reduced data points.
According to yet another embodiment of the present disclosure, a system for sampling and augmenting a dataset of images associated with a first class and a second class, respectively, to balance the dataset of images is disclosed. The system comprises a memory storing instructions, and a processor configured to execute the instructions to perform operations to: receive a required number of reduced set of images associated with the first class; create a plurality of clusters from a set of images associated with the first class based on the required number of reduced set of images; for each of the plurality of clusters, select a representative image from a plurality of similar images in the corresponding cluster; provide the reduced set of images based on representative images, wherein a number of the representative images corresponds to the received required number of reduced images; generate a median image corresponding to the set of images associated with the first class, wherein the median image represents a background of the set of images; create a non-defect artifact mask based on a difference of intensity occurring for each pixel between the median image and the set of images associated with the first class; extract defect foreground based on the median image and each defect image from another set of images associated with the second class, wherein the defect foreground comprises the defect and at least one non-defect artifact; remove the at least one non-defect artifact from the defect foreground based on the non-defect artifact mask to generate a defect foreground without artifacts for each defect image; and provide, for each defect image, a new synthetic defect image based on the median image and the defect foreground without artifacts.
To further clarify the advantages and features of the present invention, a more particular description of the invention will be rendered by reference to specific embodiments thereof, which is illustrated in the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. The invention will be described and explained with additional specificity and detail with the accompanying drawings.
These and other features, aspects, and advantages of the present invention will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein:
Further, skilled artisans will appreciate that elements in the drawings are illustrated for simplicity and may not have necessarily been drawn to scale. For example, the flow charts illustrate the method in terms of the most prominent steps involved to help to improve understanding of aspects of the present invention. Furthermore, in terms of the construction of the device, one or more components of the device may have been represented in the drawings by conventional symbols, and the drawings may show only those specific details that are pertinent to understanding the embodiments of the present invention so as not to obscure the drawings with details that will be readily apparent to those of ordinary skill in the art having the benefit of the description herein.
For the purpose of promoting an understanding of the principles of the is invention, reference will now be made to the various embodiments and specific language will be used to describe the same. It will nevertheless be understood that no limitation of the scope of the invention is thereby intended, such alterations and further modifications in the illustrated system, and such further applications of the principles of the invention as illustrated therein being contemplated as would normally occur to one skilled in the art to which the invention relates.
It will be understood by those skilled in the art that the foregoing general description and the following detailed description are explanatory of the invention and are not intended to be restrictive thereof.
Reference throughout this specification to “an aspect”, “another aspect” or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrase “in an embodiment”, “in another embodiment” and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment.
The terms “comprises”, “comprising”, or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a process or method that comprises a list of steps does not include only those steps but may include other steps not expressly listed or inherent to such process or method. Similarly, one or more devices or sub-systems or elements or structures or components proceeded by “comprises . . . a” does not, without more constraints, preclude the existence of other devices or other sub-systems or other elements or other structures or other components or additional devices or additional sub-systems or additional elements or additional structures or additional components.
The present invention is directed towards a method and system for addressing the issue of class imbalance in a dataset. Specifically, the embodiments are directed towards sampling/reducing the data points in the majority class of the dataset. Further, some embodiments are directed towards augmentation process of the minority class of data points within the dataset, where the class imbalance at the outset was severe.
Further, at step 102, one or more statistical features are extracted from the data points of the majority class. In an exemplary embodiment where each input data corresponds to an image, some exemplary statistical features of each of the data points may include, but not limited to, average, median, standard, minimum, maximum, skew, and kurtosis of pixels of the image. Subsequently, these extracted statistical features are clustered to identify similar data points.
At step 104 of the process flow, the data points are grouped or clustered which are similar to each other based on similarity of extracted statistical features. In one embodiment, the similarity of extracted statistical features may be identified based on a predefined similarity threshold. Accordingly, a plurality of clusters is created from the set of data points of the majority class. Each of the clusters comprises a plurality of similar data points selected based on the similarity threshold between the data points. All data points which do not form a part of any cluster remain in the initial set of data points. In some embodiments, a cluster formation may be initiated based on a seed data point or image selected from the initial set of received data points/images associated with majority class. A seed data point may be selected randomly or based on one or more embodiments, as discussed throughout the disclosure.
At step 106, a representative data point is selected from each cluster or group, wherein each representative data points represents the corresponding cluster. Subsequently, a group of representative data points is output as a reduced set of data points which represents the majority class. The above process may be repeated till a required reduction is achieved. However, in each subsequent iteration, the set of data points, which formed a part of any cluster(s)/group(s) during any previous iteration(s), are not taken into consideration. Only the remaining set of data points are taken into consideration which were never a part of any cluster/group. In an embodiment, a similarity check may be performed after each iteration to ensure that enough similarity is left in the remaining set of data points for another iteration of reduction.
According to various embodiments of the present invention, the similarity threshold is a control of the radius of acceptable region around the seed data point 202 of a cluster. The similarity threshold 204 may be configurable based on a required reduction of data set, as per a predefined mapping table. This is explained in conjunction with various Figures throughout this disclosure. Thus, the larger the region of similarity threshold, there would be more data points per cluster and thus, lesser number in final output of representative data points.
At step 302, the method 300 comprises receiving information related to full distribution of data points in an unbalanced set of data points associated with a plurality of classes. The unbalanced set of data points may include data points associated with a majority class and a minority class. The further steps of the present embodiment are associated with sampling of the set of data points associated with the majority class from the unbalanced set of data points. Further, in exemplary embodiment, the data points may correspond to, but not limited to, images or transaction data points.
At step 304, the method 300 comprises receiving a required number of reduced set of data points associated with the majority class of data points. In an exemplary embodiment the required number of reduced set of data points may be, but not limited to, 2%, 5% or 10% of a total number of data points within the majority class of the unbalanced data. In one embodiment, the required number of reduced set of data points may be automatically determined based on the full distribution of data points in the unbalanced set of data points. For example, the system implementing the present invention may be configured to automatically set the required number of reduced set of data points based on a predefined information or formula stored in a database. In some other embodiments, the required number of reduced data points associated with the majority class of data points may be received as an input from the user.
At step 306, the method 300 comprises determining a compression ratio based on the number of the set of data points and the count of required number of reduced set of data points. In an exemplary embodiment, the compression ratio may be a ratio of count of required number of reduced data points and the total number of the set of data points associated with the majority class.
At step 308, the method 300 comprises determining a neighbour count for the set of data points based on the required number of reduced set of data points and based on a number of the set of data points associated with the majority class. Further, a similarity threshold may also be selected based on the required number of reduced set of data points and based on a number of the set of data points associated with the majority class. In an embodiment, the similarity threshold and the neighbour count may be determined based on the compression ratio.
In an exemplary embodiment, the neighbour count may be determined based on a predefined ruleset stored within a database associated with the system executing the present invention, as discussed later throughout the disclosure. An exemplary ruleset for determination of the neighbour count based on the compression ratio and a similarity threshold is provided below in Table 1:
As depicted, a specific range of the compression ratio leads to determination or selection of a specific similarity threshold and a neighbour count.
At step 310, the method 300 comprises creating a plurality of clusters from the set of data points, each of the plurality of clusters comprising a plurality of similar data points selected based on a similarity threshold. As discussed previously, in an embodiment, the creation of each cluster may be initiated with selection of a seed data point, and other data points within each cluster may be identified based on the similarity threshold and the neighbour count.
Further, in an embodiment, the cluster may be created by extracting at least one statistical feature for each data point of the set of data points. Subsequently, a statistical distance between the at least one statistical feature of each of two data points from the set of data points may be determined, wherein the two data points are identified within a particular threshold distance. Finally, the plurality of similar data points from the set of data points associated with majority class may be selected based on the determined statistical distance among the plurality of similar data points, wherein a count of the plurality of similar data points is less than or equal to the neighbour count.
At step 312, the method 30) comprises selecting, for each of the plurality of clusters, a representative data point from the plurality of similar data points. In an embodiment, the representative data point may correspond to the seed data point. In some embodiments, the representative data point may correspond to multiple data points. The various embodiments for selection of the representative data points are explained in conjunction with
At step 314, the method 300 comprises providing the reduced set of data points based on representative data points. Once a representative data point is selected for each of the plurality of clusters, an output comprising such representative data points is provided which corresponds to the reduced set of data points.
At step 316, the method 300 comprises determining whether output dataset size is within predefined range of required number of reduced set of data points. The predefined range may be set within the system implementing the present invention, as discussed later throughout this disclosure. Based on a determination that the output dataset size is within the predefined range of required number of reduced data points, the method 300 proceeds to step 322 where the reduced set of data points comprising the plurality of representative data points is provided as an output.
At step 318, the method 300 comprises determining a modified set of data points after removing the plurality of similar data points from the set of data points, based on a determination that the output dataset size is not within the predefined range of required number of reduced data points.
At step 320, the method 300 comprises determining whether modified set of data points has a similarity greater than another similarity threshold. The another similarity threshold is predefined and stored within the system implementing the present invention. The modified set of data points correspond to remaining data points of the initially received set of data points associated with majority class after removing the plurality of similar data points that were included within any of the clusters. In response to a determination that the modified set of data pints does have a similarity greater than the another similarity threshold, the method 300 proceeds to step 306 where a new compression ratio may be determined based on required set of reduced data points in the next iteration/round and the remaining/modified set of data points after removing the plurality of similar data points that were a part of any cluster in the previous rounds.
At step 322, the method 300 comprises providing the reduced set of data points based on the representative data points as an output, based on a determination that the modified set of data pints does not have a similarity greater than the another similarity threshold.
At step 424, the method 400 comprises generating a median image corresponding to the set of images associated with the first class. In one embodiment, generating the median image comprises calculating, for each pixel of the median image, a median intensity occurring at the corresponding pixel across the set of images associated with the majority class of images. Subsequently, the median image is generated based on the calculated median intensity for each pixel of the set of images associated with the majority class of images.
At step 426, the method 400 comprises creating a non-defect artifact mask based on a difference of intensity occurring at each pixel between the median image and the set of images associated with the first class. The non-defect artifact mask is a visible feature in the foreground that are not defects. These may arise out of edges and texture differences in the image.
At step 428, the method 400 comprises extracting a defect foreground based on the median image and each defect image of another set of images associated with the second class. The defect foreground is a visible feature identifying a defect present in the foreground.
At step 430, the method 400 comprises removing the at least one non-defect artifact from the defect foreground based on the non-defect artifact mask to generate a defect foreground without artifacts for each defect image.
At step 432, the method 400 comprises creating a library of each of the defect foreground without artifacts associated with each defect image.
At step 434, the method 400 comprises sampling at least one defect foreground without artifacts from the library.
At step 436, the method 400 comprises cropping and morphing the selected at least one defect foreground without artifacts to generate a morphed version of the at least one defect foreground without artifacts.
At step 438, the method 400 comprises blending the morphed version of the at least one defect foreground without artifacts into a new foreground to generate a new defect foreground.
At step 440, the method 400 comprises providing, for each defect image, a new synthetic defect image based on the median image and the defect foreground without artifacts. The synthetic defect image is an image that simulates an image of a defective item.
Thus, in steps 424-440, the present invention attempts to decompose a given image into a foreground and a background. Typically, background captures the features that are present in a normal (OK) item. The foreground captures features that are edge or texture artifacts (i.e., non-defect artifacts) and defects features as well. Once, the foreground background decomposition is done, the defects from the foreground can be extracted to compile a defect library. The defect library once composed is used to create synthetic images of defective items.
At step 602, the workflow 600 may include obtaining image labels and bounding boxes associated with the set of images associated with the minority class.
Subsequently, the synthetic generation method 400 is applied for augmenting the images in minority class.
Further, at steps 604 and 606, the workflow 600 may include determining whether the new synthetic defect image indicates performance improvement by performing an ablation test performance on the new synthetic defect image. If there is a performance improvement, then the workflow proceeds to step 608 which includes outputting the new synthetic defect image as a final output. The set of images are then deployed for production at step 608. Alternatively, the workflow 600 may proceed to deploy synthetic generation 400 technique including repeating the steps of generating the median image, creating, extracting defect foreground, removing, and providing the new synthetic defect image based on a determination that the new synthetic defect image does not indicate performance improvement.
Further, a continuous check is performed for performance of the synthetic generated images. Upon detecting any performance degradation during deployment of the synthetic generated images, the workflow 600 may proceed for manual intervention through a message on a user interface of the system/device implementing the present invention. The manual intervention is required at the input level only. The manual operator need not handle/inspect the synthetic generated images. Additionally, the manual intervention may also be required in case there is no performance improvement detected even after multiple iterations/rounds of synthetic image generation.
At step 706, the workflow comprises comparing T1 false positive rate with T2 false negative rate and proceeding to step 708 upon detecting that the false positive rate is lesser than false negative rate. Additionally, upon detecting that T1 false positive rate is greater than T2 false negative rate, the workflow proceeds to infer that the synthetic generated images comprise misclassified images, and there is a need for further iteration of implementing method 400. Steps 706 and 708 describe a recursive method where the defect library is created using only those images of the minority class which get misclassified as majority class by the classifier. Once the defect library is created, all the remaining steps are similar to
While the above discussed steps in
In one embodiment, the system 800 may be included within a mobile device or a server. Examples of mobile device may include, but not limited to, a laptop, smart phone, a tablet, or any electronic device having a capability to access internet and to install a software application(s). The system 800 may further include a processor/controller 802, an I/O interface 804, modules 806, transceiver 808, and a memory 810.
In some embodiments, the memory 810 may be communicatively coupled to the at least one processor/controller 802. The memory 810 may be configured to store data, instructions executable by the at least one processor/controller 802. In some embodiments, the modules 806 may be included within the memory 810. The memory 810 may further include a database 812 to store data. The one or more modules 806 may include a set of instructions that may be executed to cause the system 800 to perform any one or more of the methods disclosed herein. The one or more modules 806 may be configured to perform the steps of the present disclosure defined in
In one embodiment, the memory 810 may communicate via a bus within the system 800. The memory 810 may include, but not limited to, a non-transitory computer-readable storage media, such as various types of volatile and non-volatile storage media including, but not limited to, random access memory, read-only memory, programmable read-only memory, electrically programmable read-only memory, electrically erasable read-only memory, flash memory, magnetic tape or disk, optical media and the like. In one example, the memory 810 may include a cache or random-access memory for the processor/controller 802. In alternative examples, the memory 810 is separate from the processor/controller 802, such as a cache memory of a processor, the system memory, or other memory. The memory 810 may be an external storage device or database for storing data. The memory 810 may be operable to store instructions executable by the processor/controller 802. The functions, acts or tasks illustrated in the figures or described may be performed by the programmed processor/controller 802 for executing the instructions stored in the memory 810. The functions, acts or tasks are independent of the particular type of instructions set, storage media, processor or processing strategy and may be performed by software, hardware, integrated circuits, firmware, micro-code and the like, operating alone or in combination. Likewise, processing strategies may include multiprocessing, multitasking, parallel processing, and the like.
Further, the present invention contemplates a computer-readable medium that includes instructions or receives and executes instructions responsive to a propagated signal, so that a device connected to a network may communicate voice, video, audio, images, or any other data over a network. Further, the instructions may be transmitted or received over the network via a communication port or interface or using a bus (not shown). The communication port or interface may be a part of the processor/controller 802 or maybe a separate component. The communication port may be created in software or maybe a physical connection in hardware. The communication port may be configured to connect with a network, external media, the display, or any other components in system, or combinations thereof. The connection with the network may be a physical connection, such as a wired Ethernet connection or may be established wirelessly. Likewise, the additional connections with other components of the system 800 may be physical or may be established wirelessly. The network may alternatively be directly connected to the bus.
In one embodiment, the processor/controller 802 may include at least one data processor for executing processes in Virtual Storage Area Network. The processor/controller 802 may include specialized processing units such as, integrated system (bus) controllers, memory management control units, floating point units, graphics processing units, digital signal processing units, etc. In one embodiment, the processor/controller 802 may include a central processing unit (CPU), a graphics processing unit (GPU), or both. The processor/controller 802 may be one or more general processors, digital signal processors, application-specific integrated circuits, field-programmable gate arrays, servers, networks, digital circuits, analog circuits, combinations thereof, or other now known or later developed devices for analyzing and processing data. The processor/controller 802 may implement a software program, such as code generated manually (i.e., programmed).
The processor/controller 802 may be disposed in communication with one or more input/output (I/O) devices via the I/O interface 804. The I/O interface 804 may employ communication code-division multiple access (CDMA), high-speed packet access (HSPA+), global system for mobile communications (GSM), long-term evolution (LTE), WiMax, or the like, etc.
The processor/controller 802 may be disposed in communication with a communication network via a network interface. The network interface may be the I/O interface 804. The network interface may connect to a communication network. The network interface may employ connection protocols including, without limitation, direct connect, Ethernet (e.g., twisted pair 10/100/1000 Base T), transmission control protocol/internet protocol (TCP/IP), token ring, IEEE 802.11a/b/g/n/x, etc. The communication network may include, without limitation, a direct interconnection, local area network (LAN), wide area network (WAN), wireless network (e.g., using Wireless Application Protocol), the Internet, etc. The network interface may employ connection protocols including, but not limited to, direct connect, Ethernet (e.g., twisted pair 10/100/1000 Base T), transmission control protocol/internet protocol (TCP/IP), token ring, IEEE 802.11a/b/g/n/x, etc.
As depicted in
Subsequently, the data points at step 910 may include 60,000 representative images. Further, at step 912, it may be determined whether the output number of images (i.e., 60,000) is within a predefined range of required number of reduced images. For example, if the number of output data points is within an error margin of +/−10%. For example, for a required reduction set of 1000, if the output images are between 900 to 1100, the determination would be that the output number of images are within the predefined range of required number of reduced images.
Since, the output number is much greater than the required number, the method 900 proceeds to step 914. At step 914, it may be determined whether there is enough similarity in the remaining images for another round of reduction.
This step serves as a safety check to prevent data loss as a result of clustering dissimilar images. As an exemplary scenario, if >40% of the images are still quite similar (within 10% of max similarity score). For instance, if 1500 images remain, and 10% of remaining similarity distribution is 0.98, and if number of images with similarity at least 0.98 is >600, then the next iteration would be performed. Similarity distribution is the collection of all similarity scores of pair wise comparison of images within the dataset. Thus, 60th percentile of all similarity values within the comparison matrix should be greater than 90% of Max (all similarity values within the comparison matrix).
In case there is not enough similarity in the remaining images, the method 900 proceeds to step 916 to output currently selected representative images. In case there is enough similarity in the remaining images, the method proceeds to second iteration, as illustrated in
As depicted, in
Similarly, in case of neighbour count being kept as 3, the clusters would be 4 and hence, 4 representative data points may be provided as an output. In another example, in case of neighbour count being kept as 4, the clusters would be 3 and hence, 3 representative data points may be provided as an output. Thus, the larger the neighbour count, more data points will be included per cluster, and lesser number in final output.
Compression ratio (CR)=size of output dataset/size of input dataset
Further,
Given an unbalanced data set as input (i.e., number of initial set of data points>number of defects), the first part of the invention (sampling/reduction) reduces the amount of redundant information from the majority class, as depicted in PIE CHART 1. In most cases, this is not enough to obtain a balanced dataset (Number of initial data set=Number of Defects), as depicted in PIE CHART 2. Further, the second part of the invention increases the number of defect images until the desired performance is obtained, i.e., PIE CHART 3.
The present invention provides for various technical advancements based on the key features discussed above. First, the present invention facilitates in down sampling a dataset by a factor of 80% and still achieve single digit false positive rate comparable to the false positive rate (FPR) of a model, based on a dataset manually annotated by an experienced engineer. The FPR indicates the proportion of true negatives that are misclassified as positives. Thus, the present invention facilitates in cutting down manual intervention and associated cost while still retaining the robustness of the model.
Additionally, the present sampling method does not require check(s) for misclassification, as the criteria for selection is the intrinsic statistical features of the dataset and not the characteristics of a model like some of the previously known models (e.g., related to Condensed Nearest Neighbour Rule Undersampling). Further, the present invention facilitates a dynamic selection of a number of examples to form a cluster based on the neighbour count and threshold selected (e.g., if neighbourhood count is 5, selection may be made in cluster of size 1, 2, 3, 4, or 5 and is not fixed to a rigid value, like some previously known models). This allows for a more relaxed selection criteria and ensures only redundant data is removed over known rigid value models such as near miss understanding. Additionally, the presented sampling method for majority class doesn't require any knowledge of the minority class nor the border between the two classes. The sampling is performed from data points in the overall distribution of datapoints and not just localized to a border, as known in some previously known methods.
Further, upon combining sampling and data augmentation, the present invention is able to outperform existing methods and reduce the problem caused by imbalanced dataset beyond conventional methods.
While specific language has been used to describe the present subject matter, any limitations arising on account thereto, are not intended. As would be apparent to a person in the art, various working modifications may be made to the method in order to implement the inventive concept as taught herein. The drawings and the foregoing description give examples of embodiments. Those skilled in the art will appreciate that one or more of the described elements may well be combined into a single functional element. Alternatively, certain elements may be split into multiple functional elements. Elements from one embodiment may be added to another embodiment.