One or more aspects of embodiments according to the present disclosure relate to classifiers, and more particularly to a system and method for data augmentation, for use in training a classifier.
Automatic classifiers may exhibit relatively poor performance when trained with data having a data imbalance over a binary class, or when the amount of training data is relatively small given the input data dimension.
Thus, there is a need for an improved system and method for data augmentation.
According to an embodiment of the present invention, there is provided a method for classification, the method including: forming a first training dataset and a second training dataset from a labeled input dataset; training a first classifier with the first training dataset; training a variational auto encoder with the second training dataset, the variational auto encoder including an encoder and a decoder; generating a third dataset, by feeding pseudorandom vectors into the decoder; labeling the third dataset, using the first classifier, to form a third training dataset; forming a fourth training dataset based on the third dataset; and training a second classifier with the fourth training dataset.
In some embodiments, the first training dataset is the labeled input dataset.
In some embodiments, the second training dataset is the labeled input dataset.
In some embodiments, the forming of the first training dataset includes: oversampling the labeled input dataset, to produce a first supplementary dataset; and combining the labeled input dataset and the first supplementary dataset to form the first training dataset.
In some embodiments, the oversampling of the labeled input dataset includes using a synthetic minority over-sampling technique.
In some embodiments, the oversampling of the labeled input dataset includes using an adaptive synthetic over-sampling technique.
In some embodiments, the fourth training dataset is the same as the third training dataset.
In some embodiments, the forming of the fourth training dataset includes combining: a first portion of the labeled input dataset, and the third training dataset to form the fourth training dataset.
In some embodiments, the forming of the fourth training dataset includes combining: a first portion of the labeled input dataset, the first supplementary dataset, and the third training dataset to form the fourth training dataset.
In some embodiments, he method further includes validating the second classifier with a second portion of the labeled input dataset, different from the first portion of the labeled input dataset.
In some embodiments, the forming of the second training dataset includes: oversampling the labeled input dataset, to produce a first supplementary dataset; and combining the labeled input dataset and the first supplementary dataset to form the second training dataset.
In some embodiments, the labeled input dataset includes: majority class data including a first number of data elements and minority class data including a second number of data elements, the first number exceeding the second number by a factor of at least 5.
In some embodiments, the first number exceeds the second number by a factor of at least 15.
According to an embodiment of the present invention, there is provided a system, including: a processing circuit configured to: form a first training dataset and a second training dataset from a labeled input dataset; train a first classifier with the first training dataset; train a variational auto encoder with the second training dataset, the variational auto encoder including an encoder and a decoder; generate a third dataset, by feeding pseudorandom vectors into the decoder; label the third dataset, using the first classifier, to form a third training dataset; form a fourth training dataset based on the third dataset; and train a second classifier with the fourth training dataset.
In some embodiments, the first training dataset is the labeled input dataset.
In some embodiments, the second training dataset is the labeled input dataset.
In some embodiments, the forming of the first training dataset includes: oversampling the labeled input dataset, to produce a first supplementary dataset; and combining the labeled input dataset and the first supplementary dataset to form the first training dataset.
In some embodiments, the oversampling of the labeled input dataset includes using a synthetic minority over-sampling technique.
In some embodiments, the oversampling of the labeled input dataset includes using an adaptive synthetic over-sampling technique.
According to an embodiment of the present invention, there is provided a system for classifying manufactured parts as good or defective, the system including: a data collection circuit; and a processing circuit, the processing circuit being configured to: form a first training dataset and a second training dataset from a labeled input dataset; train a first classifier with the first training dataset; train a variational auto encoder with the second training dataset, the variational auto encoder including an encoder and a decoder; generate a third dataset, by feeding pseudorandom vectors into the decoder; label the third dataset, using the first classifier, to form a third training dataset; form a fourth training dataset based on the third dataset; and train a second classifier with the fourth training dataset.
These and other features and advantages of the present disclosure will be appreciated and understood with reference to the specification, claims, and appended drawings wherein:
The detailed description set forth below in connection with the appended drawings is intended as a description of exemplary embodiments of a system and method for data augmentation provided in accordance with the present disclosure and is not intended to represent the only forms in which the present disclosure may be constructed or utilized. The description sets forth the features of the present disclosure in connection with the illustrated embodiments. It is to be understood, however, that the same or equivalent functions and structures may be accomplished by different embodiments that are also intended to be encompassed within the scope of the disclosure. As denoted elsewhere herein, like element numbers are intended to indicate like elements or features.
A classifier over a binary class may have the task assigning data samples to one of two classes, and there may be a significant imbalance in the training data used to train such classifier. For example, in a manufacturing process for manufacturing electronic parts, it may be the case that the majority of the parts are acceptable, or “good”, and a small minority of the parts are in some way defective, or “no good”. For this reason, when data are obtained during the manufacturing and testing process, most of the data may be from good devices, i.e., an imbalance may be present in the data. Such an imbalance may be an obstacle when training an automated classifier to classify parts as “good” or “no good”.
Further, the number of measurements obtained for each part may be large, i.e., the number of dimensions of each data sample (a data element being the set of measurements for an item, such as a manufactured part, to be classified) may be large. This may be a further obstacle when training an automated classifier, especially when the number of training data elements in either class is small in light of the dimensions of each data element.
For example, when manufacturing mobile displays, trace data may be acquired during the manufacturing process for display panels. The trace data may include, for example, measurements of temperature and pressure in the manufacturing process, as a function of time. Multiple temperature and pressure sensors may be used, and each sensor may be sampled multiple times (e.g., three or four times per day, over a period of multiple days). The trace data resulting from these measurements may, for example, include about 64 time traces each having about 304 measurements, e.g., a total of over 19,000 measurements, so that each data element has over 19,000 dimensions.
Various methods, as described in further detail below, may be used to address some of the obstacles mentioned above. Referring to
The data preprocessing circuit 110 may receive raw trace data (e.g., a number of time traces, as mentioned above) from the data collection circuits 105 and may reformat the data, e.g., into two dimensional arrays (e.g., 224×224 arrays). The size of the two dimensional arrays may be selected to be comparable to the size of images commonly classified by neural networks. The reformatting may then make it possible to reuse certain portions of the code implementing a neural network classifier of images, for use in some embodiments.
The model resulting from the training of the first classifier (e.g., the trained first classifier, or a copy of its neural network, programmed with the weights resulting from the training of the first classifier) may then be used, at 220, to label a third data set, to form a third training dataset. The machine learning model may be any one of multiple forms including a classifier, a regressor, an autoencoder, etc. The third data set may be generated, at 225, by a data augmentation method using a variational auto encoder as discussed in further detail below. The data augmentation method, at 225, may use as input a second training dataset, which may be, for example, the labeled input dataset 205, or the combination of the labeled input dataset 205 and the first supplementary dataset.
A second classifier may then be trained, at 230, using a combination of one or more portions of (i) a first portion 235 of the labeled input dataset 205 (produced from the labeled input dataset 205 by a data splitter 240), (ii) the first supplementary dataset, and (iii) the third training dataset. The model resulting from the training of the second classifier (e.g., the trained second classifier, or a copy of its neural network, programmed with the weights resulting from the training of the second classifier) may then be validated, at 245, using a second portion 250 of the labeled input dataset 205 (also produced from the labeled input dataset 205 by the data splitter 240). The second portion 250 (which is used for validation) may be different from the first portion 235 (which is used for training), e.g., it may be the remainder of the labeled input dataset 205.
The performance of the second classifier after training (i.e., the performance of the model resulting from the training of the second classifier) in the validation step, at 245, may be used to assess whether the second classifier is suitable for use in production, e.g., to make a determination, for each manufactured part, whether it is to be used, or discarded (or reworked).
The table of
The table of
It may be seen that the performance shown in
In some embodiments, k-fold validation is used to obtain a more reliable assessment of the accuracy of a classifier 115 constructed according to methods described herein.
In some embodiments, each of the first classifier (or “first classifier model”) 310 and the second classifier (or “second classifier model”) 325 may be a SqueezeNet, ResNet, or VggNet neural network, suitably trained, as described herein. The variational auto encoder may be constructed as described in “Auto-Encoding Variational Bayes” by D. Kingma and M. Welling, available at arxiv.org/abs/1312.6114, the entire content of which is incorporated herein by reference.
In some embodiments, one or more of the data preprocessing circuit 110, the classifier 115, and the system that executes the method illustrated in
As used herein, a “portion” of a thing means all of, or less than all of, the thing. As such, a portion of a dataset means a proper subset of the dataset, or the entire dataset.
It will be understood that, although the terms “first”, “second”, “third”, etc., may be used herein to describe various elements, components, regions, layers and/or sections, these elements, components, regions, layers and/or sections should not be limited by these terms. These terms are only used to distinguish one element, component, region, layer or section from another element, component, region, layer or section. Thus, a first element, component, region, layer or section discussed herein could be termed a second element, component, region, layer or section, without departing from the spirit and scope of the inventive concept.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the inventive concept. As used herein, the terms “substantially,” “about,” and similar terms are used as terms of approximation and not as terms of degree, and are intended to account for the inherent deviations in measured or calculated values that would be recognized by those of ordinary skill in the art. As used herein, the term “major component” refers to a component that is present in a composition, polymer, or product in an amount greater than an amount of any other single component in the composition or product. In contrast, the term “primary component” refers to a component that makes up at least 50% by weight or more of the composition, polymer, or product. As used herein, the term “major portion”, when applied to a plurality of items, means at least half of the items.
As used herein, the singular forms “a” and “an” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising”, when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items. Expressions such as “at least one of,” when preceding a list of elements, modify the entire list of elements and do not modify the individual elements of the list. Further, the use of “may” when describing embodiments of the inventive concept refers to “one or more embodiments of the present disclosure”. Also, the term “exemplary” is intended to refer to an example or illustration. As used herein, the terms “use,” “using,” and “used” may be considered synonymous with the terms “utilize,” “utilizing,” and “utilized,” respectively.
It will be understood that when an element or layer is referred to as being “on”, “connected to”, “coupled to”, or “adjacent to” another element or layer, it may be directly on, connected to, coupled to, or adjacent to the other element or layer, or one or more intervening elements or layers may be present. In contrast, when an element or layer is referred to as being “directly on”, “directly connected to”, “directly coupled to”, or “immediately adjacent to” another element or layer, there are no intervening elements or layers present.
Any numerical range recited herein is intended to include all sub-ranges of the same numerical precision subsumed within the recited range. For example, a range of “1.0 to 10.0” is intended to include all subranges between (and including) the recited minimum value of 1.0 and the recited maximum value of 10.0, that is, having a minimum value equal to or greater than 1.0 and a maximum value equal to or less than 10.0, such as, for example, 2.4 to 7.6. Any maximum numerical limitation recited herein is intended to include all lower numerical limitations subsumed therein and any minimum numerical limitation recited in this specification is intended to include all higher numerical limitations subsumed therein.
Although exemplary embodiments of a system and method for data augmentation have been specifically described and illustrated herein, many modifications and variations will be apparent to those skilled in the art. Accordingly, it is to be understood that a system and method for data augmentation constructed according to principles of this disclosure may be embodied other than as specifically described herein. The invention is also defined in the following claims, and equivalents thereof.
The present application claims priority to and the benefit of U.S. Provisional Application No. 62/830,131, filed Apr. 5, 2019, entitled “SYSTEM AND METHOD FOR DATA AUGMENTATION FOR TRACE DATASET”, the entire content of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
62830131 | Apr 2019 | US |