The disclosure relates to a classification method. More particularly, the disclosure relates to a classification method for classifying unlabeled data into an inlier data set or an outlier data set.
Outlier detection in machine learning technology is the process of identifying data instances that deviate significantly from the normal distribution within a dataset. Detecting outliers is crucial in various applications, including medical prediction, fraud detection, network security, quality control, and anomaly detection in healthcare or industrial processes.
An embodiment of the disclosure provides a data classification method, which includes following steps. Unlabeled image are obtained. Q prediction rounds are executed on the unlabeled images. Q is a positive integer. Each of the Q prediction rounds includes: randomly selecting assumed inlier images among the unlabeled images; computing a first similarity matrix comprising first similarity scores of the unlabeled images relative to the assumed inlier images; and generating intermediate inlier-outlier predictions about the unlabeled images in one prediction round according to the first similarity matrix. The intermediate inlier-outlier predictions about the unlabeled images generated respectively in the Q prediction rounds are aggregated to select aggregate-predicted inlier images among the unlabeled images. A second similarity matrix is computed, and the second similarity matrix includes second similarity scores of the unlabeled images relative to the aggregate-predicted inlier images. Each of the unlabeled images is classified into an inlier data set or an outlier data set according to the second similarity matrix, so as to generate inlier-outlier predictions of the unlabeled images.
Another embodiment of the disclosure provides a data classification method, which includes following steps. Unlabeled images are obtained. An assigned inlier image is selected among the unlabeled images. A similarity matrix is computed and the similarity matrix includes first similarity scores of the unlabeled images relative to the assigned inlier image. Each of the unlabeled images is classified into an inlier data set or an outlier data set according to the similarity matrix, so as to generate inlier-outlier predictions of the unlabeled images.
Another embodiment of the disclosure provides a data classification method, which includes the following steps. Unlabeled images are obtained. Q prediction rounds are executed about the unlabeled images. Q is a positive integer. Each of the Q prediction rounds includes: randomly selecting assumed inlier images among the unlabeled images; computing a first similarity matrix comprising first similarity scores of the unlabeled images relative to the assumed inlier images; and generating intermediate inlier-outlier predictions about the unlabeled images in one prediction round according to the first similarity matrix. The intermediate inlier-outlier predictions about the unlabeled images generated respectively in the Q prediction rounds are aggregated to select aggregate-predicted inlier images among the unlabeled images. A second similarity matrix is computed, and the second similarity matrix includes second similarity scores of the unlabeled images relative to the aggregate-predicted inlier images. Each of the unlabeled images are classified into an inlier data set or an outlier data set according to the second similarity matrix, so as to generate first inlier-outlier predictions of the unlabeled images. A part of the first inlier-outlier predictions of the unlabeled images is displayed. An adjustment input revised from the first inlier-outlier predictions is obtained. A third similarity matrix is computed, and the third similarity matrix includes third similarity scores of the unlabeled images relative to the adjustment input. Each of the unlabeled images are classified into the inlier data set or the outlier data set according to the third similarity matrix, so as to generate second inlier-outlier predictions of the unlabeled images.
It is to be understood that both the foregoing general description and the following detailed description are by examples, and are intended to provide further explanation of the invention as claimed.
The disclosure can be more fully understood by reading the following detailed description of the embodiment, with reference made to the accompanying drawings as follows:
Reference will now be made in detail to the present embodiments of the disclosure, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers are used in the drawings and the description to refer to the same or like parts.
Reference is made to
It is desirable to train a machine learning model according to a training dataset without the outliers. If the training dataset includes outliers, it can have several effects on the performance and behavior of machine learning models. For example, the outliers in the training dataset may cause issues like model bias, increased model complexity, overfitting, reduced robustness, and difficulty in anomaly detection.
In a healthcare application, unlabeled images can include different kinds of medical examination images, such as chest X-ray images, brain MRI images and abdominal ultrasound images. These different medical examination images have different usages in different diagnosis.
For example, the chest X-ray images are beneficial in training a machine learning model for detecting pneumonia, and the brain MRI images and the abdominal ultrasound images are not suitable for detecting the pneumonia. Including the brain MRI images and the abdominal ultrasound images in the training dataset can be harmful in training the pneumonia detection model. In this case, the chest X-ray images should be regarded as inliers, and the brain MRI images and the abdominal ultrasound images should be regarded as outliers.
Manually labeling inliers and outliers within datasets can be a time-consuming and costly process, especially when dealing with large datasets. It requires human experts to review and identify inliers and outliers, which can be impractical for big datasets.
In some embodiments, the data classification method 100 provides an easier way to generate inlier-outlier predictions of the unlabeled images in a dataset. Reference is further made to
The input interface 220 is configured to receive unlabeled images ULIMG and other manually instructions. In some embodiments, the electronic device 200 can classify the unlabeled images ULIMG, and then display the classification result (i.e., inlier-outlier predictions PRED of the unlabeled images ULIMG) on the displayer 280. The input interface 220 can include a data transmission interface, a wireless communication circuit, a keyboard, a mouse, a microphone or any equivalent input device. The processing unit 240 is coupled with the input interface 220, the storage unit 260 and the displayer 280. The storage unit 260 is configured to store a program code. The program code stored in the storage unit 260 is configured for instructing the processing unit 240 to execute the data classification method 100 shown in
Reference is further made to
In some embodiments, as shown in
It is noticed that there are six images IMG1-IMG6 of the unlabeled images ULIMG shown in
As shown in
As shown in
On the other hand, step S116 is executed to collect some manual-input labels MLB by the input interface 220. In some embodiments, a user can manually assign the manual-input labels MLB corresponding to the images IMG1-IMG6 of the unlabeled images ULIMG. Reference is further made to
It is noticed that, in this case, the user provides the manual-input labels MLB on a part (i.e., two images IMG1 and IMG5) of the unlabeled images ULIMG. Other four images IMG2, IMG3, IMG4 and IMG6 remains unlabeled. The data classification method 100 shown in
In aforesaid embodiments, the manual-input labels MLB includes one inlier label and one outlier label to select one assigned inlier image INL and one assigned outlier image OUTL. However, the disclosure is not limited thereto.
In other embodiments, the manual-input labels MLB include 1, 2, 3 or more inlier label to select at least one assigned inlier image INL. The manual-input labels MLB include 0, 1, 2, 3 or more outlier label. In other words, the assigned outlier image OUTL is not necessarily required in generating the inlier-outlier predictions PRED.
As shown in
As shown in
In some embodiments, the processing unit 240 is configured to perform a similarity algorithm between feature vectors extracted from the unlabeled images (e.g., IMG1-IMG6) and the assigned inlier image INL, so as to compute the first similarity scores SSc1. The similarity algorithm can be selected from a cosine similarity algorithm, a Euclidean distance similarity algorithm, a Manhattan distance algorithm or a Hamming distance algorithm.
The processing unit 240 can perform a cosine similarity algorithm to calculate first similarity scores SSc1 in an cosine similarity equation as:
In aforesaid equation (1), A and B are feature vectors of two images to be compared.
For example, the processing unit 240 can perform the cosine similarity algorithm to calculate one similarity score SS21 between the image IMG2 and the assigned inlier image INL (i.e., the image IMG1) in the cosine similarity equation as:
In aforesaid equation (2), V1 is the feature vector of the image IMG1, and V2 is the feature vector of the image IMG2. In the same way, other similarity scores in the first similarity scores SSc1 can be calculated based on the similarity algorithm.
If the feature vectors of two images IMG2 and IMG1 are similar to each other, the similarity score SS21 will be closer to 1. In this case, because the images IMG2 and IMG1 are similar to each other, the similarity score SS21 is 0.92, which is adjacent to 1. On the other hand, if the feature vectors of two images are different from each other, the similarity score will be closer to 0.
Similarly, the processing unit 240 can perform the cosine similarity algorithm to calculate second similarity score SSc2 between the images IMG1-IMG6 and the assigned outlier image OUTL (i.e., the image IMG5).
For example, the processing unit 240 can perform the cosine similarity algorithm to calculate one similarity score SS15 between the image IMG1 and the assigned outlier image OUTL (i.e., the image IMG5) in the cosine similarity equation as:
In aforesaid equation (3), V1 is the feature vector of the image IMG1, and V5 is the feature vector of the image IMG5. In this case, because the images IMG1 and IMG5 are different from each other, the similarity score SS15 is 0.53, which is not adjacent to 1.
As shown in
As shown in
As shown in
In this case as shown in
In this case as shown in
In aforesaid embodiments shown in
In some other embodiments, the assigned outlier image OUTL is not necessarily required in generating the inlier-outlier predictions PRED. If the manual-input label only selects the assigned inlier image INL without selecting the assigned outlier image OUTL. In this case, step S140 can be executed, by the processing unit 240, to classify each of the unlabeled images ULIMG by comparing the first similarity scores SSc1 in
Based on aforesaid embodiments, the data classification method 100 shown in
As shown in
In some embodiments, when the user reviews the inlier-outlier predictions PRED and learns that the current threshold similarity value is not ideal, step S152 can be executed by the data classification method 100 to collect an adjusted threshold similarity value according to a feedback command inputted through the input interface 220. For example, the threshold similarity value can be adjusted lower into the adjusted threshold similarity value “0.90”. As shown in
In some other embodiments, when the user reviews the inlier-outlier predictions PRED and learns that the manual-input labels are not ideal, step S154 can be executed by the data classification method 100 to collect adjusted manual-input labels according to a feedback command inputted through the input interface 220. For example, the user can manually assign images IMG1 and IMG4 as “inlier”, and delete the “outlier” label from the image IMG5. As shown in
The data classification method 100 shown in
In some embodiments, as shown in
After step S314, the data classification method 300 is configured to executing Q prediction rounds R1, R2 . . . RQ about the unlabeled images ULIMG to generate intermediate inlier-outlier predictions in each of the Q prediction rounds R1, R2 . . . RQ. Q is a positive integer.
During the prediction round R1, step S320 is executed by the processing unit 240 to randomly select assumed inlier images among the unlabeled images ULIMG. The assumed inlier images are randomly sampled from the unlabeled images ULIMG and regarded as “inlier” in this prediction round R1. In practical applications, the unlabeled images ULIMG in the dataset usually include relatively more inlier data and relative fewer outlier data (e.g., a ratio between inlier to outlier can be 5:1 or even 10:1). Therefore, the assumed inlier images randomly sampled from the unlabeled images ULIMG have a higher possibility to select actual inlier data, and have a smaller possibility to select actual outlier data.
Later in the prediction round R1, step S330 is executed by the processing unit 240 to compute a first similarity matrix, which includes first similarity scores of the unlabeled images ULIMG relative to the assumed inlier images. Details of step S330 are similar to step S130 in aforementioned embodiments, and not repeated here. The difference between step S330 in
Later in the prediction round R1, step S340 is executed by the processing unit 240 to classify each of the unlabeled image ULIMG based on the first similarity matrix and to generate intermediate inlier-outlier predictions about the unlabeled images in the prediction round R1. Details of step S340 are similar to step S140 in aforementioned embodiments, and not repeated here.
As shown in
After the Q prediction rounds R1-RQ are finished, the data classification method 300 goes to step S350, which is executed by the processing unit 240 to aggregate the intermediate inlier-outlier predictions in the Q prediction rounds R1-RQ, so as to select aggregate-predicted inlier images. Reference is further made to
As shown in
As shown in
On the other hand, the image IMG2 is classified as “outlier” in the intermediate inlier-outlier predictions PREDR1, such that the image IMG2 is disqualified. Similarly, image IMG4 is classified as “outlier” in the intermediate inlier-outlier predictions PRED R2, such that the image IMG2 is disqualified.
In some embodiments, Q is configured to be a positive integer between about 10 to about 20. If Q is smaller than 10, the aggregate-predicted inlier images can be not accurate enough (e.g., actual outlier images may be mixed into the aggregate-predicted inlier images by chances). If Q is higher than 20, it can be too strict and too difficult in selecting the aggregate-predicted inlier images.
As shown in
As shown in
In some embodiments, step S380 is executed to display the inlier-outlier predictions (referring to
In embodiments shown in
In some other embodiments, the disclosure provides a hybrid approach based on the data classification method 100 shown in
As discussed in aforementioned embodiments, the first inlier-outlier predictions of the unlabeled images ULIMG can be generated in step S570. Step S580 is executed to display a part of the first inlier-outlier predictions of the unlabeled images ULIMG. In practical applications, the first inlier-outlier predictions are generated from thousands of unlabeled images ULIMG. Step S580 is configured to display a relatively small amount the first inlier-outlier predictions on the displayer 280. Reference is further made to
In some embodiments, the first inlier-outlier predictions are generated automatically without waiting for manual-input labels. The partial predictions PREDp1 of the first inlier-outlier predictions may include some false predictions. The user can review on the partial predictions PREDp1 of the first inlier-outlier predictions, and provide an adjustment input ADJ corresponding to the partial predictions PREDp1. In the embodiments shown in
As shown in
In response to the adjustment input ADJ (and the adjusted manual-input labels LBadj), step S591 is executed, by the processing unit 240, to select the images IMG3 and IMG4 among the unlabeled images ULIMG as assigned inlier images INL based on the adjustment input ADJ, and also select the images IMG5 and IMG6 among the unlabeled images ULIMG as assigned outlier images OUTL based on the adjustment input ADJ.
As shown in
As shown in
As shown in
In some embodiments, the user can review the second inlier-outlier predictions on the displayer 280. If the second inlier-outlier predictions are not correct, the user can provide another adjustment input again and the data classification method 500 can repeat steps S590 to S594 again.
The data classification method 500 in
In some embodiments, the inlier data set is utilized as training data for training a machine-learning model. The outlier data set is filtered out and not utilized as the training data. In this case, the outlier data set will not affect the training process of the machine-learning model. In this case, it can avoid issues like model bias, increased model complexity, overfitting, reduced robustness, and difficulty in anomaly detection during the training process of the machine-learning model.
Although the present invention has been described in considerable detail with reference to certain embodiments thereof, other embodiments are possible. Therefore, the spirit and scope of the appended claims should not be limited to the description of the embodiments contained herein.
It will be apparent to those skilled in the art that various modifications and variations can be made to the structure of the present invention without departing from the scope or spirit of the invention. In view of the foregoing, it is intended that the present invention cover modifications and variations of this invention provided they fall within the scope of the following claims.
This application claims the priority benefit of U.S. Provisional Application Ser. No. 63/382,723, filed Nov. 8, 2022, which is herein incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
10614379 | Fu et al. | Apr 2020 | B2 |
20210042330 | Bremer | Feb 2021 | A1 |
20240242131 | Ravindran | Jul 2024 | A1 |
Number | Date | Country |
---|---|---|
105612554 | May 2019 | CN |
111046933 | Apr 2020 | CN |
111310846 | Jun 2020 | CN |
113627458 | Nov 2021 | CN |
114154570 | Mar 2022 | CN |
114429170 | May 2022 | CN |
114650167 | Jun 2022 | CN |
2000-285141 | Oct 2000 | JP |
2020-003846 | Jan 2020 | JP |
2020-32044 | Mar 2020 | JP |
2021-508373 | Mar 2021 | JP |
202109030 | Mar 2021 | TW |
2019102043 | May 2019 | WO |
Entry |
---|
The office action of the corresponding Taiwanese application No. TW112142925 issued on Apr. 22, 2024. |
The office action of the corresponding Japanese application No. JP2023-190037 issued on Nov. 19, 2024. |
Number | Date | Country | |
---|---|---|---|
20240160660 A1 | May 2024 | US |
Number | Date | Country | |
---|---|---|---|
63488976 | Mar 2023 | US | |
63382723 | Nov 2022 | US |