The present invention relates to an information processing apparatus, a method, and a non-transitory computer readable medium for determining a threshold for class label scores such that the expected recall of the classier is above a user-defined value.
In many situations, classification accuracy can be improved by collecting more covariates. However, acquiring some of the covariates might incur costs. As an example, the diagnosis of whether a patient has either diabetes or not is considered. Collecting information (covariates) such as age and gender are almost at no cost, whereas taking blood measures clearly involves costs (e.g. working hour cost of medical doctor).
On the other hand, there is also a cost of wrongly classifying the patient. There are two types of misclassification. First, the patient may be classified as having no diabetes, although the patient is suffering from diabetes. The resulting cost is called the false negative misclassification cost which is denoted as c1,0. Second, the patient may be classified as having diabetes, although the patient is not suffering from diabetes. The resulting cost is called the false positive misclassification cost which is denoted as c0,1.
Defining both misclassification costs c1,0 and c0,1 is crucial for rational decision making which is based on a Bayes procedure. A Bayes procedure for binary classification is defined as follows
where y∈{0,1} is the class label, and x is the covariate vector (also called feature vector in the machine learning literature). Note that we assume here that c0,0 = c1,1 = 0. Label 1 denotes the true label, e.g. that the patient has diabetes, whereas label 0 denotes the false label.
Methods described in NPL 1 try to collect only as many covariates as necessary to minimize the total cost of classification, i.e. costs of collecting the covariates + expected costs of misclassification.
For that purpose (NPL 1) assumes a pre-defined sequence of covariates sets S1 ⊂ S2 ⊂ S3 ... Sq. First the covariates in S1 are acquired, and then depending on the observed values of S1, either the covariates S2\S1 are additionally acquired, or a classification decision is made. In case, the covariates S2\S1 are acquired, the same procedure is repeated again analogously. The strategy for either choosing additional covariates or classifying, is such that the total cost of classification is minimized in expectation.
Note that based on each covariate set Si, where i∈2 {1, ... , q}, a classifier is trained returning the probability p(y = 1|xSi ), where xSi is the vector denoting the observed values for covariates Si.
NPL 1: (Andrade et al, 2019) “Efficient Bayes Risk Estimation for Cost-Sensitive Classification”, Artificial Intelligence and Statistics, 2019.
NPL 2: (Kanao et al, 2009)“PSA CUT-OFF NOMOGRAM THAT AVOID OVER-DETECTION OF PROSTATE CANCER IN ELDERLY MEN”, The Journal of Urology, 2009.
A Bayes procedure, and in particular, the method in NPL 1 requires that all misclassification costs are specified. In most situations the misclassification cost c0,1 is relatively easy to specify. For example, in the medical domain, it is easier to specify the medical costs for treating a healthy patient who has no diabetes but is wrongly classified as having diabetes.
On the other hand, it is more difficult to specify c1,0. For example, it is more difficult to monetize the exact cost of the case in which a diabetes patient died although he might had been rescued. Therefore, in the medical domain, it is more common to try to make a guarantee on the recall. The terminology “recall” in the machine learning field can be used herein although the term “sensitivity” is more common than “recall” in the medical field. In particular, it is common practice to require that the recall is 95% (see e.g. NPL2).
However, as mentioned above, a Bayes procedure requires the specification of c1,0 and cannot make guarantees on the required recall.
The present disclosure has been accomplished to solve the above problems and an object of the present disclosure is thus to provide an information processing apparatus, etc., capable of determining thresholds of a classification procedure which can ensure a user-specified recall.
An information processing apparatus according to the present disclosure is an information processing apparatus for determining a threshold on classification scores, including:
A method according to the present disclosure is a method for determining a threshold on classification scores, including:
A non-transitory computer readable medium according to the present disclosure is a non-transitory computer readable medium storing a program for causing a computer to execute a method for determining a threshold on classification scores, the method comprising:
The present disclosure can determine threshold t to guarantee that, in expectation, the recall of a classification procedure is at least as large as a user-specified value r.
Example exemplary embodiments according to the present disclosure will be described hereinafter with reference to the drawings.
For the clarification of the description, the following description and the drawings may be omitted or simplified as appropriate. Further, each element shown in the drawings as functional blocks that perform various processing can be formed of a CPU (Central Processing Unit), a memory, and other circuits in hardware and may be implemented by programs loaded into the memory in software. Those skilled in the art will therefore understand that these functional blocks may be implemented in various ways by only hardware, only software, or the combination thereof without any limitation. Throughout the drawings, the same components are denoted by the same reference signs and overlapping descriptions will be omitted as appropriate.
Instead of requiring the specification of the misclassification cost c1,0, the present disclosure allows the usage of a user-specified recall r, e.g. r = 95%.
In order to guarantee that the recall of the classification procedure is at least r, the present disclosure calculates a threshold t on the classification probability p(y = 1|x) based on the empirical estimate on hold-out data (=evaluation data). The threshold t output by the present disclosure is only as small as necessary to guarantee a recall of at least r. For example a threshold of 0 would trivially lead to 100% recall, but would have 0% precision.
Furthermore, the acquired threshold t and a user specified false positive cost c0,1 allow the calculation of the false negative cost c1,0 by using the properties of a Bayes procedure.
The core components of the threshold estimation apparatus 100 according the first embodiment of the present disclosure are illustrated in
First, a threshold estimation apparatus according to a first embodiment will be described with reference to
In the following, the indicator function 1M used herein indicates 1 if expression M is true and otherwise 0. Furthermore, a data set for evaluation with n samples, which is denoted as (y(k), x(k))nk=1, can be used.
First, Score Ranking Component 10 can remove all samples for which y(k) = 0, and calculate p(k)1 := p(y(k) = 1|x(k)), for k = 1,... nT , where nT is the number of true samples (i.e. nT := Σnk=1 1y(k)=1). Furthermore, Score Ranking Component 10 sorts the entries in increasing order and removes all duplicates, thus the resulting sequence is denoted as {p(k)}nHk=1.
Next, Iteration Component 20 can perform the following steps outlined in Algorithm 1.
Algorithm 1: Determine threshold t for one classifier.
Using the threshold t output by Algorithm 1, the classifier defined by
is guaranteed, in expectation, to have recall of at least r.
This can be seen as follows. Given a distribution over (y, x) such that [1y=1] > 0, the recall of a classifier δ is defined as:
Since p(x|y = 1) is unknown, the evaluation data (y(k), x(k))nk=1 is used to estimate Rδ:
where t = p(m), with m being the value after exiting the while loop in Algorithm 1.
As described above, the iteration component 20 (Corresponding to Algorithm 1) that iterates the threshold from the highest score returned from the score ranking component down until the number of samples with score not lower than the current threshold is larger than a user specified recall value times the number of true labels in the evaluation data set.
Finally, examples in
Next, the threshold for classification is lowered to 0.8 (i.e. the second highest score classification score),
The threshold value t starts at 0.9 in
Next, we consider the situation, where we are given q score functions p(y =1 |xSi), for each i∈{1 ,..., q} and one score function is chosen according to some strategy, e.g. the selection strategy outlined in (NPL1). Here, we do not make any assumptions on the selection strategy.
The threshold for defining classifier δti,Si(xSi) is denoted as ti. The classifier δti,Si(xSi) is as follows:
In the following, the thresholds ti can be found such that the following requirement is satisfied:
When that the above inequality is satisfied, it is ensured that the recall of any classier selection strategy is at least r. The reason is as follows. Assume that the label of the sample is true (i.e. y = 1) and any classifier δti,Si outputs label 0, then an adversarial selection strategy (a selection strategy which tries to produce the lowest recall) will select this classifier. Otherwise, if all classifiers output label 1, then even an adversarial selection strategy needs to select a classifier δti,Si for which the output is 1. By the requirement of Inequality (2), the selecting a classifier δti,Si for which the output is 1 will happen with probability of at least r.
As described before, a data set for evaluation with n samples is denoted as (y(k), x(k))nk=1. Without loss of generality, we assume that the samples are sorted such that all positive samples (i.e. y = 1) come first, that is yk = 1 for k∈1, 2,..., nT , where nT denotes the total number of positive samples.
For each classifier i∈ {1,..., q}, Threshold estimation apparatus 100 determines a threshold ti as follows:
First, Score ranking component 10 calculates p(k)i := p(y(k) = 1|x(k)Si ), for k = 1,..., nT. Next, for each classifier i, Score ranking component 10 sorts p(k)i in increasing order and removes duplicates. The resulting sequence is denoted as {1(m)i}nHim=1.
Iteration Component 20 then performs the following steps described in Algorithm 2.
Algorithm 2: Determine thresholds ti for different classifiers.
For evaluating inequality (2), Iteration Component 20 uses the empirical estimate of p(δt1,S1 =1, δt2,S2 = 1,..., δtq,Sq = 1|y = 1), that is
Note that the while loop in the above algorithm necessarily exits, since eventually for all i∈1, 2, ... , q, we have ti = p(1)i, where p(1)i is the smallest score in the evaluation data for classifier i.
Furthermore, Threshold estimation apparatus 100 determines the thresholds are only as large as necessary to guarantee that, in expectation, the recall is at least r.
It is noted that the above procedure performed by Threshold estimation apparatus 100 can be simplified (and speed-up), if it is required that all thresholds ti are the same, which is denoted as t.
First, score ranking component 10 as shown in
Furthermore, let
which indicates whether sample k is correctly classified as y = 1 by all classifiers, when we assume threshold t.
The iteration component 20 as shown in
Algorithm 3: Determine threshold t common to different classifiers.
Finally, examples in
(the number of samples which are classified correctly by all classifiers) is 0. Next, in
becomes 1. Finally, in
, and since
(the user specified threshold), the procedure ends, and returns threshold t = 0.3.
As described above, the iteration component 20 stops the iteration until the number of times where all scores corresponding to one sample but from different classifiers are larger than the threshold, is larger than a user specified recall value times the number of true labels in the evaluation data (Corresponding to
).
Finally, the threshold t as determined using Algorithm 1 and Algorithm 3 can be used to determine the false negative cost c1,0. The false negative cost c1,0 is used to define a Bayes Classifier.
The complete Diagram of false negative cost determination apparatus 200 is shown in
Assuming that classifier δ is a Bayes classifier, False negative cost calculation component 30 (see
ensures that the recall of classifier is at least r. Specifically, the false negative misclassification cost is determined by the reciprocal of the threshold minus 1, and the resulting value multiplied by the false positive misclassification cost which is assumed to be provided by the user. The reason is as follows. Assuming that is a Bayes procedure (see definition in Equation 1), we have
Therefore, the false negative cost determination apparatus 200 can obtain the recall of a classifier δ as follows.
The processor 1202 performs processing of a center node 20 described with reference to the sequence diagrams and the flowchart in the above embodiments by reading software (computer program) from the memory 1203 and executing the software. The processor 1202 may be, for example, a microprocessor, an MPU or a CPU. The processor 1202 may include a plurality of processors.
The processor 1202 performs data plane processing which includes digital baseband signal processing for wireless communication, and control plane processing. In a case of, for example, LTE and LTE-Advanced, the digital baseband signal processing of the processor 1004 may include signal processing of a PDCP layer, an RLC layer and an MAC layer. Furthermore, the signal processing of the processor 1202 may include signal processing of a GTP-U•UDP/IP layer in an X2-U interface and an S1-U interface. Furthermore, the control plane processing of the processor 1004 may include processing of an X2AP protocol, an S1-MME protocol and an RRC protocol.
The processor 1202 may include a plurality of processors. For example, the processor 1004 may include a modem processor (e.g., DSP) which performs the digital baseband signal processing, a processor (e.g. DSP) which performs the signal processing of the GTP-U•UDP/IP layer in the X2-U interface and the S1-U interface, and a protocol stack processor (e.g., a CPU or an MPU) which performs the control plane processing.
The memory 1203 is configured by a combination of a volatile memory and a non-volatile memory. The memory 1203 may include a storage disposed apart from the processor 1202. In this case, the processor 1202 may access the memory 1203 via an unillustrated I/O interface.
In the example in
In the above-described exemplary embodiment, the programs may be stored in various types of non-transitory computer readable media and thereby supplied to computers. The non-transitory computer readable media includes various types of tangible storage media.
Examples of the non-transitory computer readable media include a magnetic recording medium (such as a flexible disk, a magnetic tape, and a hard disk drive) and a magneto-optic recording medium (such as a magneto-optic disk).
Further, examples of the non-transitory computer readable media include CD-ROM (Read Only Memory), CD-R, and CD-R/W. Further, examples of the non-transitory computer readable media include a semiconductor memory. The semiconductor memory includes, for example, a mask ROM, a PROM (Programmable ROM), an EPROM (Erasable PROM), a flash ROM, and a RAM (Random Access Memory).
These programs may be supplied to computers by using various types of transitory computer readable media. Examples of the transitory computer readable media include an electrical signal, an optical signal, and an electromagnetic wave. The transitory computer readable media can be used to supply programs to a computer through a wired communication line (e.g., electric wires and optical fibers) or a wireless communication line.
Note that the present disclosure is not limited to the above-described example exemplary embodiments and can be modified as appropriate without departing from the spirit and scope of the present disclosure. Further, the present disclosure may be implemented by combining these example exemplary embodiments as desired.
Although the present disclosure is explained above with reference to example exemplary embodiments, the present disclosure is not limited to the above-described example exemplary embodiments.
Guaranteeing the recall of a decision procedure (classifier) is important for many risk critical applications. For example in the medical domain it is common to require a minimum value on the recall.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2020/006653 | 2/13/2020 | WO |