This application claims priority to Vietnamese Application No. 1-2020-05312 filed on Sep. 15, 2020. The aforementioned application is incorporated herein by reference in its entirety.
Embodiments of the present invention relate to a label inference system that combines a novel active learning method, called Online Activate Learning (OAL), and a human in the loop, for efficient annotation.
A chest X-ray (CXR) is one of the most popular and important imaging examination methods for screening, diagnosing, and managing public health. However, the clinical interpretation of a CXR requires the expertise of highly qualified radiologists.
Furthermore, some biases make the diagnosis even more problematic. Firstly, the geography bias where some diseases appear more frequently in some specific areas but are very rare in some others. Second, the expertise bias where radiologists are only good at diagnosing a specific set of diseases. Third, consistency among radiologists especially on ambiguous cases cause more noisy labelled data. The automated CXR interpretation system that assists radiologists in decision making would, therefore, tackle these problems.
Automated CXR interpretation software, at the level of an experienced radiologist, could provide a great benefit in both consistency and speed of diagnosis. However, it is challenging to develop such software that matches the expertise and experiences of practicing radiologists. Taking recent advantage of Artificial Intelligent (AI) and Deep Learning, many systems can outperform humans in terms of accuracy in a number of computer vision tasks. However, Deep Learning generally requires large-scale and high quality labeled datasets to achieve the human level of accuracy. These datasets are not easy to obtain in practice. Two main reasons are the expertise requirement to label the large amount of data and the consensus of doctors cannot be reached easily. Thus, the cost for high quality and large labeled data set is high and time consuming.
There are several methods to obtain a qualified label. In particular, three radiologists were selected at random for each image from a cohort of 11 American Board of Radiology-certified radiologists or from 13 individuals, including board-certified radiologists and radiology residents to label a test set or validation set, respectively. Adjudication proceeded until consensus, or up to a maximum of five rounds. This method produces high quality labels, but also consumes a lot of time and money. It is only well-suited for making the high-quality test and validation set.
There are several publicly available chest X-ray datasets that can be used for image classification and retrieval tasks. CheXpert is a large dataset of CXR and competition for automated chest X-ray interpretation. This dataset features uncertainty labels and radiologist-labeled reference standard evaluation sets. The 224,316 chest radiographs were collected from Stanford Hospital from 65,240 patients. The 14 observations was extracted based on the prevalence in the reports and clinical relevance using rule-based labelling tools. The evaluation of the dataset focuses on five observations which were used for the competition task: Atelectasis, Cardiomegaly, Consolidation, Edema, and Pleural Effusion.
Similar to CheXpert, MIMIC-CXR, as proposed in “Mimic-cxr, a de-identified publicly available database of chest radiographs with freetext reports.” p. 317, 2019 by A. E. W. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, R. G. Mark, and S. Horng, contains 371,920 chest X-rays associated with 227,943 imaging studies sourced from the Beth Israel Deaconess Medical Center between 2011 and 2016. Each imaging study consists of one or more images, but most often is associated with two images: a frontal view and a lateral view. Images are provided with 14 labels derived from a natural language processing tool applied to the corresponding notes of radiology reports. Both MIMIC-CXR and CheXpert share a common labeling tool for extracting a set of labels from radiology reports.
Another large dataset, PAthology Detection in Chest radiographs (PadChest), includes more than 160,000 images obtained from 67,000 patients that were interpreted and reported by radiologists at Hospital San Juan in Spain from 2009 to 2017. The reports were labeled with 174 different radiographic findings, 19 differential diagnoses and 104 anatomic locations organized as a hierarchical taxonomy and mapped to standard Unified Medical Language System (UMLS) terminology. Of these reports, 27% were manually annotated by trained physicians and the remainder of the set was labeled using a supervised method based on a recurrent neural network with attention mechanisms.
Finally, the National Institute of Health of America (NIH) repository contains 108,948 frontal view chest X-rays (ChestX-ray8). It corresponds to 32,717 different patients and multi-labeled with 14 different thoracic diseases. Another publicly dataset available from NIH (ChestX-ray14) consists of 112,120 frontal chest radiograph images in 30,805 patients. ChestX-ray14 is enriched for various thoracic abnormalities relative to the general population.
From ChestX-ray14, the final labels for four findings (Pneumothorax, Airspace Opacity, Nodule or Mass and Fracture) of 2,412 validation images and 1,962 test images were assigned via adjudicated review by certified radiologists. Each image was first reviewed independently by three radiologists. If all radiologists agreed after the initial review, then that label became final. For images with label disagreements, images were returned for additional reviews. Anonymous labels and any notes from the previous rounds were also available during each iterative review. Adjudication proceeded until consensus, or up to a maximum of five rounds. For the small number of images for which consensus was not reached, the majority vote label was used.
Overall, these CXR datasets used as a ground-truth for training and validating the Machine Learning (ML) models are mostly used Natural Language Processing (NLP) techniques to automate their extraction. This technique has a limitation in dealing with multi-language ambiguity and the uncertainties in radiology reports. Furthermore, most of the annotations are not validated by radiologists or professional physicians to ensure the accuracy and quality of the annotations.
The present invention is directed to providing a label inference system capable of improving label quality.
The present invention is also directed to providing a label inference system capable of saving costs.
One aspect of the present invention includes a label inference system including a data generator configured to generate a training set and a test set, each including a plurality of images labeled with experts' annotations, a data trainer configured to perform training for a base model based on the generated training set and test set, a determiner configured to identify whether an evaluation metric f1 of the training model satisfies a base evaluation metric f1base and a data inference unit configured to perform inference using the training set, the test set, and an unlabeled data set with the training model satisfying the base evaluation metric f1base.
The data generator may expand the training set by labeling and adding another batch of data into the training set when the determiner identifies that the evaluation metric f1 of the training model is less than or equal to the base evaluation metric f1base.
The data trainer may perform training for the base model based on expanded training set and the test set.
When the determiner identifies that the evaluation metric f1 of the training model is greater than the base evaluation metric f1base, the data inference unit may select an unlabeled data set from a data pool.
The data inference unit may calculate a current inference value of the unlabeled data set from a weighted average value of a previous inference value and a current estimated value.
The data inference unit may calculate the current inference value of the unlabeled data set according to Equation 1 below.
{circumflex over (p)}
t
=F
μ({circumflex over (p)}t-1,Pt), [Equation 1]
wherein {circumflex over (p)}t represents an inference value, {circumflex over (p)}t-1 represents a previous inference value, pt represents a current estimated value, Fμ represents an online update operator with momentum μ, and t is a natural number.
The current inference values {circumflex over (p)}t may include samples with high confident score {circumflex over (p)}tH) and samples with low confident score {circumflex over (p)}tL.
The data inference unit may extract samples with thresholded high confidence scores {circumflex over (p)}tH,thresh by hard-thresholding a subset of the samples with the high confidence scores {circumflex over (p)}tH using a threshold τt.
The data inference unit may extract samples with relabeled low confidence score {circumflex over (p)}tL,relabel, which is a subset of the samples with low confidence scores relabeled by an user.
The data inference unit may extract outliers of the training set and the test set, which need to be relabeled by an user, using the threshold τt.
The data inference unit may perform inference using the updated training set and test set, which are modified by relabeling, and the updated unlabeled data set.
The data inference unit trains a new snapshot of the training model for some iterations t using the updated unlabeled data set, training set, and test set.
When the determiner identifies that the evaluation metric f1 of the training model is greater than the target evaluation metric f1target, a terminal condition of the data inference unit is triggered.
The data inference unit may calculate the current inference value of the unlabeled data set using momentum.
The momentum is set to 0.5.
The above and other objects, features and advantages of the present invention will become more apparent to those of ordinary skill in the art by describing in detail exemplary embodiments thereof with reference to the attached drawings, in which:
Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
However, the technical idea of the present invention is not limited to some embodiments set forth herein and may be embodied in many different forms, and one or more of components of these embodiments may be selectively combined or substituted within the scope of the present invention.
All terms (including technical and scientific terms) used in embodiments of the present invention have the same meaning as commonly understood by those of ordinary skill in the art to which the present invention pertains, unless otherwise defined. Terms, such as those defined in commonly-used dictionaries, should be interpreted as having a meaning in the context of the relevant art.
In addition, the terms used in embodiments of the present invention are for the purpose of describing embodiments only and are not intended to be limiting of the present invention.
As used herein, singular forms are intended to include plural forms as well, unless the context clearly indicate otherwise. Expressions such as “at least one (or one or more) of A, B and C” should be understood to include one or more of all possible combinations of A, B, and C.
In addition, terms such as first, second, A, B, (a), and (b) may be used to describe components of embodiments of the present invention.
These terms are only for distinguishing a component from other components and thus the nature, sequence, order, etc. of the components are not limited by these terms.
When one component is referred to as being “coupled to,” “combined with,” or “connected to” another component, it should be understood that the component is directly coupled to, combined with or connected to the other component or is coupled to, combined with or connected to the other component via another component therebetween.
When one component is referred to as being formed or disposed “on (above) or below (under)” another component, it should be understood that the two components are in direct contact with each other or one or more components are formed or disposed between the two components. In addition, it should be understood that the terms “on (above) or below (under)” encompass not only an upward direction but also a downward direction with respect to one component.
Hereinafter, embodiments will be described in detail with reference to the accompanying drawings, and the same or corresponding components will be assigned the same reference numerals even in different drawings and a description thereof will not be redundantly described herein.
Referring to
The label inference system 10 according to an embodiment may perform generating a training set and a test set, each including a plurality of images labeled with experts' annotations (S201), performing training for a base model based on the training set and the test set (S202), determining whether an evaluation metric f1 of the base model satisfies a base evaluation metric f1base (S203); if No, expanding the training set by labeling and adding another batch of data into the training set when the evaluation metric f1 of the base model is less than or equal to the base evaluation metric f1base (S204); performing training for the base model based on expanded training set and the test set (S202); if Yes, selecting an unlabeled data set when the evaluation metric f1 of the base model is greater than the base evaluation metric f1base (S205); and performing inference using the training set, the test set, and the unlabeled data set (S206).
The label inference system 10 according to an embodiment may comprise two-phase data construction flow which consists of a initial phase for constructing a reasonably modest model and a iterative phase involving incremental human-annotation in which an online update operation plays an important role in minimizing necessary external intervention.
First, the data generator 11 may generate a training set and a test set, each including a plurality of images labeled with experts' annotations. Each of the training set and the test set may include a formally defined number of images, i.e., batch data, together with medical experts' annotations. The training set and the test set may be used to train and evaluate a base model θbase.
In an embodiment, a data set including annotations may be obtained by weakly supervised labels and unsupervised methods to gradually improve data quality.
In an embodiment, VBCheX, which is a completely new annotated data set of chest X-ray images collected from hospitals for research purposes, may be used as the training set and the test set. VBCheX may refer to a data set including the largest number of manual annotations for seventeen pathologies and labels for tuberculosis which is an infectious disease.
The data trainer 12 may perform training for the base model based on expanded training set and the test set.
The determiner 13 may determine whether the evaluation metric f1 of the base model satisfies the base evaluation metric f1base. The data generator 11 may expand the training set by labeling and adding another batch of data into the training set when the determiner 13 determines the evaluation metric f1 of the base model is less than or equal to the base evaluation metric f1base. The data trainer 12 may repeatedly perform for the base model based on expanded training set and the test set.
After repeated training of the base model, the determiner 13 may compare the evaluation metric f1 of the base model and the base evaluation metric f1base with each other. The data inference unit 14 may select an unlabeled data set when the evaluation metric f1 of the base model satisfies the base evaluation metric f1base, i.e., when the evaluation metric f1 of the base model is greater than the base evaluation metric f1base. In this case, the initial phase described above ends.
At an end point of the initial phase, the training set includes ninit batches of data and the test set may include one batch of data.
Next, the data inference unit 14 may perform inference using the training set, the test set, and the unlabeled data set. The data inference unit 14 may use a reasonably modest base model and then select an unlabeled data set and combination of the training set containing the ninit batches of data and the test set including one batch of data for performing inference.
The data inference unit 14 may calculate a current inference value of the unlabeled data set from a weighted average value of a previous inference value and a current estimated value. In the case of the unlabeled data set, a current inference value {circumflex over (p)}t is weighted averaging with a previous inference value {circumflex over (p)}t-1 and a current estimated value Pt at a specific iteration t. For example, the data inference unit 14 may calculate the current inference value {circumflex over (p)}t of the unlabeled data set according to Equation 1 below.
{circumflex over (p)}
t
=F
μ({circumflex over (p)}t-1,Pt), [Equation 1]
In Equation 1 above, Fμ represents an online update operator with momentum μ.
As a result of the inference, the inference value {circumflex over (p)}t may potentially include samples with high confidence scores {circumflex over (p)}tH and low confidence scores {circumflex over (p)}tL.
The data inference unit 14 may extract samples with thresholded high confidence scores {circumflex over (p)}tH,thresh by hard-thresholding the subset of the samples with the high confidence scores {circumflex over (p)}tH using a threshold τt. In addition, the data inference unit 14 may extract samples with relabeled low confidence scores {circumflex over (p)}tL,relabel, which is a subset of the samples with low confidence scores relabeled by an user, and extract outliers of the training set and the test set, which need to be relabeled by an user, using the threshold τt.
The subset of the samples with the high confident scores {circumflex over (p)}p is hard-thresholded by the threshold τt to produce the samples with thresholded high confidence scores {circumflex over (p)}tH,thresh and is combined with samples with relabeled low confidence scores {circumflex over (p)}tL,relabel, which are annotation relabeled by an user with respect to the low confidence scores. A current training set and test set that are re-evaluated are also based on the threshold τt for extracting outliers which need to be labeled by an user.
Next, the data inference unit 14 may perform inference using the updated training set and test set, which are modified by relabeling, and the updated unlabeled data set. After relabeling performed by an user, the data inference unit 14 trains a new snapshot of a model θt for some iterations t using a triplet (the updated unlabeled data set, the training set, and the test set) of the modified data set. When an evaluation metric is satisfied (e.g., the evaluation metric f1 is greater than the target evaluation metric f1target, f1>f1target), a terminal condition is triggered.
In an embodiment, the user is preferably a doctor.
Therefore, in a next online active learning iteration, the training set may be self-corrected by relabeling of outliers associated therewith, and expanded by an amount of new labeling with respect to the relabeled low confidence score {circumflex over (p)}tL,relabel of the unlabeled data set.
Using the label inference system 10 according to the embodiment, the number of additionally labeled samples for unlabeled data and the number of times of relabeling for the training set and the test set can be minimized in the long term, in terms of costs.
In a binary classification task, the data inference unit 14 may decide whether an image includes a predefined label according to Equation 2 below by using the distance from a decision boundary parameterized by a trained weight vector and feature vector, which are extracted from the image via a deep CNN.
P(C=1|x)=σ(ωTf(x,θ))
P(C=0|x)=1−P(C=1|x) [Equation 2]
For example, in Equation 2, P(C=1|x) may represent the probability that the image includes the predefined label and P(C=0|x) may represent the probability that the image does not include the predefined label. Conversely, in Equation 2, P(C=1|x) may represent the probability that the image does not include the predefined label and P(C=0 |x) may represent the probability that the image includes the predefined label. In addition, in Equation 2, f (x, θ) represents a feature extracted from the CNN, ω represents a trained weight vector, and σ represents a sigmoid function. A general approach to active learning includes calculating an uncertainty score uncertain(x) using entropy according to Equation 3 below, assigning a label to instances with a low uncertainty score, and randomly selecting instances for further annotation.
This approach is equivalent to selecting an instance with a high confidence score p(c=1|x) or p(c=0|x) to assign a label and picking in-between instances for annotation.
However, only little information is obtained through data instances close to a high confidence threshold and thus a small amount of information is obtained through this approach. In particular, instances with high confidence scores have very discriminative feature vectors, which exist far above/below a decision boundary f(x, θ) (i.e., ωTf(x,θ)>>0, (ωTf(x,θ)<<0 in the case of a negative instance with a high confidence score)).
Such instances are not as informative as instances of which a feature vector is present at a decision boundary (i.e., ωTf(x,θ)≈0), which contain non-discriminative features for which a classifier cannot assign a label 1 or 0 and thus contain most additional information for further learning.
In the case of human annotators, mistakes may occur when working with thousands of instances. Such mistakes may cause noise in learning, thus making it difficult to converge learning. In order to reduce human error, the data inference unit 14 may detect all possible noisy labeled instances in a training set and provide these instances to an annotator for label re-evaluation.
For example, the success of training a supervised deep learning model may contribute to the collection of a large-scale data set annotated by human annotators. However, a human annotation process, especially for medical imaging applications, generates a large amount of noisy labels due to human observation errors or computer-generated label errors. Many studies have shown that noisy labels may negatively affect the performance of a model. In recent years, much attention has been paid to handling noisy labels. A noise label processing method is proposed herein with various strategies that may be classified into three main strategies: a database-based method, a network architecture-based method, and a training procedure-based method.
In the database-based method, inaccurate data samples are identified and corrected or discarded during a training process. Noisy data is predicted as soft labels for final training by using an ensemble training classifier. CleanNet identifies correct or incorrect labels by estimating a similarity between feature vectors of data samples. In addition, smoothing labels are advantageously applied to model distillation of noisy data. These methods based on data cleansing and pre-processing appear to be effective ways for training models with high-level noisy data sets.
As the network architecture-based method, for the purpose of training a transition matrix between noise and an actual label, the development of network architectures for training with noise has been applied in several approaches using an additional noise layer or a Generative Adversarial Network (GAN).
Recently, training procedure-based methods for dealing with noise labels has been applied in various ways. In the co-teaching method, proposed in “Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels” by X.Y., G.N., M.X., W.H., I.T., M.S., Bo Han, Quanming Yao, NeurIPS, 2018, two networks are trained in parallel by using samples have small losses to train each other. In “Mixup: Beyond Empirical Risk Minimization,” proposed by Y. N. D., D. L-P., Hongyi Zhang, Moustapha Cisse, ICLR, 2018, label noise is processed by training models with training data samples and labels newly generated by combination of pairs of training data sets and their labels. In addition, a combination of the Co-teaching method and the Mixup method is an effective method to reduce classification errors with label noise data shown by Berthelot et al. Data re-weighting is a simple way to handle label noise during a training process. The optimization weights are assigned to minimize loss for likely clear samples are much higher than noise samples. This method may be effective for data with a limited amount of clear samples. Data label consistency, which is another effective strategy, may be applied to update a teacher model trained with noise label data, a student model trained with clean data, and predicted labels from the teacher model.
In addition to selecting instances for additional labeling, the data inference unit 14 may employ a data distillation approach to generate pseudo-labels for high confidence data (p≥0.9 or p≤0.1). However, such label generated data may be inherently noisy because the model may not be strong enough to generate a consistent label at the start of active learning iteration. That is, high confidence instances in one iteration may become low confidence instances in a next iteration.
The data inference unit 14 may calculate a current inference value of the unlabeled data set using momentum. For example, the data inference unit 14 may apply an approach of stabilizing a confidence output of a model using momentum according to Equation 4 below.
{circumflex over (p)}
t
=F
μ({circumflex over (p)}t-1,Pt)=μ{circumflex over (p)}t-1+(1−μ)Pt [Equation 4]
In Equation 4, {circumflex over (p)}t represents a final confidence score, Pt represents a distilled confidence score of a model in an iteration t and μ represents a momentum parameter for controlling an effect of previous scores on the final confidence score.
In the experiments, a dataset VBCheX was tested with respect to two pathologies, i.e., lung lesion pathology and cardiomegaly pathology. A standard baseline is constructed for a normal industrial process i.e random instances from labeled dataset are added in a batch of 10,000 instances. For active learning, an experiment was conducted on a naive sampling approach that selects a batch of 10,000 instances with a confidence score between a negative high confidence threshold and a positive high confidence threshold (i.e., 0.1<{circumflex over (p)}t<0.9). In the proposed method, the same high confidence threshold as the naive approach was used but in the case of labeling an instance, all data points with a confidence score {circumflex over (p)}t between 0.4 and 0.6 were selected.
Therefore, it can be seen that the number of instances with a confidence score near a decision boundary is quite reasonable, compared to selecting 10,000 batch data for further labeling. In addition, data instances near the decision boundary are more informative when compared to the naive sampling approach, and the present invention is directed to providing the best performance improvement in a test set. Additionally, using momentum may further stabilize a process of adding pseudo-labeled data instances to a training set, thereby improving a F1 score in the test set.
The term ‘unit’ as used herein refers to software or a hardware component, such as a field-programmable gate array (FPGA) or an application-specific integrated circuit (ASIC), which performs certain functions. However, the term ‘unit’ is not limited to software or hardware. The term ‘unit’ may be configured to be stored in an addressable storage medium or to reproduce one or more processors. Thus, the term ‘unit’ may include, for example, components, such as software components, object-oriented software components, class components, and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, a circuit, data, database, data structures, tables, arrays, and parameters. Components and functions provided in ‘units’ may be combined into a smaller number of components and “units” or may be divided into sub-components and ‘sub-units’. In addition, the components and ‘units’ may be implemented to execute one or more CPUs in a device or a secure multimedia card.
While embodiments of the present invention have been described above, it will be apparent to those of ordinary skill in the art that various modifications and changes may be made therein without departing from the spirit and scope of the present invention described in the following claims.
Number | Date | Country | Kind |
---|---|---|---|
1-2020-05312 | Sep 2020 | VN | national |