Active feature probing using data augmentation

Information

  • Patent Grant
  • 7958064
  • Patent Number
    7,958,064
  • Date Filed
    Wednesday, October 10, 2007
    17 years ago
  • Date Issued
    Tuesday, June 7, 2011
    13 years ago
Abstract
Systems and methods are disclosed that performs active feature probing using data augmentation. Active feature probing is a means of actively gathering information when the existing information is inadequate for decision making. The data augmentation technique generates factitious data which complete the existing information. Using the factitious data, the system is able to estimate the reliability of classification, and determine the most informative feature to probe, then gathers the additional information. The features are sequentially probed until the system has adequate information to make the decision.
Description
BACKGROUND

In tasks that classify instances with few known features, it is usually difficult to build useful classifiers without adequate information because the results are inaccurate or unreliable. To improve the accuracy, the classifier needs to gather more information before classification. As an example, in a technical support center, a customer calls the technical support center with a technical issue. At the beginning, the customer (the information source) may only provide limited information (feature values) to a representative at the center. To identify the issue (classification), the representative asks questions (unknown features), and the customer provides some answers (values of probed features) or maybe volunteers some additional information (values of non-probed features). After a few rounds, the representative may identify (classify) the problem correctly and provide a suitable solution. In such tasks, additional to the accuracy of identifying the problem, the efficiency or the number of probing is also an importance criterion to evaluate the performance.


A pre-built decision tree may be considered as a straightforward approach to such a task. At each non-leaf node, the tree classifier probes the feature values. With the supplied feature value, the classifier follows the branch. Repeating the process, the classifier reaches a leaf node and makes the prediction with adequate information. However, ignoring the given feature values at the beginning and the volunteered feature values during the process makes this approach inefficient. Moreover, the data source may not be able to provide all feature values for some instances, which requires the tree to have an “unknown” branch for each split. Instead of using static pre-built decision trees, the system dynamically creates a split based on the given information.


To dynamically probe feature values, the system needs to estimate which features are the most informative to make the decision. To estimate the information given by an unknown feature, the system needs to classify the instance given the known feature values and the unknown feature under estimation. On one hand, building classifiers for all possible feature subsets is impractical, because the number of possible combinations of features is exponentially large when the number of features is large. On the other hand, building classifiers on-the-fly is also impractical because of the cost of building classifiers.


SUMMARY

Systems and methods are disclosed that performs active feature probing using data augmentation. Active feature probing actively gathers information when the existing information is inadequate for decision making. The data augmentation technique generates factitious data which complete the existing information. Using the factitious data, the system is able to estimate the reliability of classification, and determine the most informative feature to probe, then gathers the additional information. The features are sequentially probed until the system has adequate information to make the decision.


In one aspect, a method to classify information includes augmenting existing information with factitious data; performing actively feature probing using the factitious data; generate a model based on the active feature probing with factitious data; and classifying information using the model.


Implementations of the system can include one or more of the following. The system learns existing information. One or more feature vectors can be generated by clustering data from the existing information and the factitious data. A classification model can be generated using the one or more feature vectors. Information can be learned by clustering data from the existing information and the factitious data to generate one or more feature vectors and by generating a classification model from the one or more feature vectors. The classifying information can be done at run-time. The system can evaluate information in a new case to be classified. The evaluating information can include determining entropy of a set of factitious cases generated from the new case. The system can gather additional information for the new case. The system can identify the most similar past case as a solution. The system can evaluate confidence on one or more outcomes after several probing. The system can measure entropy for the one or more outcomes with







H


(

y
|

x
obs


)


=

-




y

𝒴





Pr


(

y
|

x
obs


)



log







Pr


(

y
|

x
obs


)


.








The system can determine a feature to probe. The feature can be selected to minimize an expected loss. The system can determine a goal for a given xobs as:











arg





min






𝒾

un






E

Pr


(


𝓍
i

|

x
obs


)








(

Pr


(

𝓎
|

x

obs
+
𝒾



)


)



,






    • where custom character(custom character(custom character)) is a loss function of a distribution custom character(custom character) on custom character and Pr(custom character|custom character)=custom characterPr(custom character|x), obs+custom character=obs U {custom character}, and un\custom character={custom character∈un|custom charactercustom character}.





The system can evaluate a probability by marginalizing a full probability










Pr


(


𝓍
𝒾

,

𝓎
|

x
obs



)


=








x
un


\

𝒾



𝒳






un

\

𝒾






Pr


(


𝓍
i

,
y
,


x

un

\

𝒾


|

x
obs



)









=








x
un


\

𝒾



𝒳






un

\

𝒾






Pr


(


𝓎
|

x
obs


,

x

un

\

𝒾


,

𝓍
i


)












Pr


(


x

un

\

𝒾


,


𝓍
i

|

x
obs



)








=








x
un


\

𝒾



𝒳






un

\

𝒾







Pr


(

𝓎
|
𝓍

)





Pr


(


x
un

|

x
obs


)


.










The system can obtain the distribution of class label given all features, Pr(custom character|x), by a classification method. This can be done by sampling a number of augmented instances using a Monte Carlo method to approximate Pr(xi,custom character|xobs) without explicitly summing over all possible un\custom character. The system can sample instances by drawing virtual instances. Samples can be obtained from a distribution using a Gibbs sampling method or from a distribution using a kernel density estimation. The system can dynamically building a decision split based on sampled data from







Pr


(


x
un

|

x
obs


)


=



Pr


(


x
un

,

x
obs


)


/

Pr


(

x
obs

)



×

1

Pr


(

x
obs

)








X

𝒟





K


(

x
,
X

)


×




X

𝒟






K
obs



(


x
obs

,

X
obs


)







i

un






K
i



(


x
i

,

X
i


)


.











Advantages of the system may include one or more of the following. The system needs to build only one classifier, and classifies a new data point by classifying its augmented factitious data. The sampling method using the kernel density model is highly efficient and provides interactive operation to quickly serve the user or support staff. The system efficiently uses existing information while providing high accuracy.





BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 shows an exemplary system to perform active feature probing using data augmentation.



FIG. 2 shows an exemplary learning process.



FIG. 3 shows an exemplary run time process.



FIG. 4 shows an exemplary sampling method.





DESCRIPTION


FIG. 1 shows an exemplary system to perform active feature probing using data augmentation. Active feature probing is a means of actively gathering information when the existing information is inadequate for decision making. The data augmentation technique generates factitious data which complete the existing information. Using the factitious data, the system is able to estimate the reliability of classification, and determine the most informative feature to probe, then gathers the additional information. The features are sequentially probed until the system has adequate information to make the decision.


Turning now to FIG. 1, the system includes a learning process 1 and a run-time process 2. In the learning process 1, the process receives a data set with questions and answers to those questions (12). Next, clustering is performed on the question data set (14). A feature vector (canonical questionnaire) is determined for the data set (16). The learning process can repeatedly perform questionnaire imputation (22) on past cases (24). From the past cases, the process can detect questions and answers (12). From the questionnaire imputation, the learning process can also impute the feature vector for each case (18). Next, the learning process 1 can generate a model for help-desk assistance (20). Alternatively, the learning process 1 can classify cases (26) and then generate the model for help-desk assistance (20).


Once the model has been built, the run time process 2 can operate using the model. A new case can be submitted (30). The system can perform questionnaire generation by sampling the questionaires (32). Next, a set of highly probably questionnaires are identified (34), and the user can probe by active learning (36). The most similar past case is selected as the final solution (40).


The system augments an instance to multiple factitious instances (multiple imputation). The goal of data augmentation is to set each unknown feature to a value according to the data model. Instead of setting the features to the most likely values, multiple imputation sets the features to values drawing from the distribution of the data. The system assumes the classifier and the data model can be learned from the existing training data. Once the system can estimate the information given by each unknown feature, the system is able to choose the most informative one to probe its actual value. The system can repeat the process until the system knows sufficient information about the instance. Then, the system can predict the target of the instance.


Turning now to FIG. 2, the learning process 1 is detailed. First, existing cases are retrieved (202). Next, the process generates questionnaire(s) by clustering sentences in existing cases and generates a feature vector for each corresponding answer to each questionnaire (204). The process 1 then builds a classification model using the feature vectors (206). Next, a model is generated for the run-time process 2.



FIG. 3 shows the run time process 2 in more detail. First, the process 2 receives a new case (302). The process evaluates information of each question for the new case (304). The importance is measured by the entropy computed based on a set of fictitious cases generated from the new case. The generation process uses an efficient sampling method shown in more details in FIG. 4. The process then gathers additional information by asking one or more informative questions (306). A solution is then determined (308).



FIG. 4 shows one embodiment of an efficient sampling method. The process computes kernel values between the new case and each existing case based on the known features (402). Next, the process draws N sample cases from existing cases according to the product of the kernel values of the known features (404). For each sample, the process sets f as the i-th feature value and for each unknown feature i, the process draws a new feature value according to the kernel distribution around the value f (406).


In the following discussion on active feature probing, let custom character be the set of classes, custom character be the set of features denoted by their indexes, {1, . . . , d}, and custom character=custom character1× . . . ×custom characterd be the d-dimensional feature space. In the feature space, the system uses NA as a possible choice, which represents unknown feature value. For simplicity, in one embodiment, the system assumed all feature variables are categorical. For a given variable x=(x1, . . . , xd)∈custom character, let obs be the set of indexes whose corresponding feature values are currently known, un be the set of the remainder indexes. The system denote the x's components with indexes in a set, s, by xs, for example, xobs the known part of x, and xun the unknown part of x. In active feature probing, the system looks for a policy π: x→custom charactercustom character. For a given instance, x, to be classify, π maps X to a feature in custom character to probe its value, or to a class in z,111  to be its label. The criterion of an optimal probing policy need make tradeoff between classification accuracy and the average number of features probed before classification. Given a criterion, the optimization may involve a Bellman equation, which can be solved by reinforcement learning.


In one illustrative empirical feature probing method, the system avoids the complicated reinforcement learning framework. In this embodiment, the system defines a policy in a greedy way. The idea of the greedy approach is to find the most informative features to probe, which is similar to find the best splits for building decision trees, until certain stop criteria meet. For stop criteria, the system considers two cases. On one hand, when the system has sufficient confidence of the outcomes after several probings, further probings may increase little accuracy at a large cost of performing the probings. The system may consider stopping further probings. To measure the confidence, the system can empirically use the entropy of the outcomes, i.e.







H


(

y
|

x
obs


)


=

-




y

𝒴





Pr


(

y
|

x
obs


)



log







Pr


(

y
|

x
obs


)


.








On the other hand, when the system is not confident with the outcomes at all after many probings, even further probings may not increase the accuracy. The system may consider stopping the probing to save the cost of probing.


Now the system looks for a feature to probe. Let custom character(custom character(custom character)) be the loss function of a distribution custom character(custom character) on custom character. Active feature probing finds a feature to minimize the expected loss. Given xobs, the goal is written as











arg





min






𝒾

un






E

Pr


(


𝓍
𝒾

|

x
obs


)








(

Pr


(

𝓎
|

x

obs
+
𝒾



)


)



,





where Pr(custom character|custom character)=custom characterPr(custom character|x), obs+custom character=obs ∪ {custom character}, and un\custom character={custom character∈un|custom character≠ω}.


When the system uses log loss, i.e.,

custom character(custom character(custom character))=EQ(y)log Q(custom character),

the goal is actually to find the feature with the maximum information gain to probe. Maximum information gain is one of many criteria to find a best split during building a decision tree. The information gain, i.e. conditional mutual information between a variable Xi and y given xobs, is defined as







I


(


𝓍
i

,

y
|

x
obs



)


=


H


(

y
|

x
obs


)


+






𝓍
i



X
i



y

𝒴






Pr


(


𝓍
i

,

y
|

x
obs



)



log







Pr


(


y
|

x
obs


,

𝓍
i


)


.




Because












Pr


(


y
|

x
obs


,

𝓍
i


)


=


Pr


(


𝓍
i

,

y
|

x
obs



)






𝓎

y




Pr


(


𝓍
𝒾

,

y
|

x
obs



)





,






Pr


(

y
|

x
obs


)


=





𝓍
i



x
i





Pr


(


𝓍
𝒾

,

y
|

x
obs



)




,





the system only needs to estimate Pr(xi,y|xobs) to evaluate each I(xi,y|xobs). To avoid to use a large number of models to estimate Pr(xi,y|xobs) with all combinations of obs and i, the system evaluates the probability by marginalizing the full probability,










Pr


(


𝓍
𝒾

,

𝓎
|

x
obs



)


=








x
un


\

𝒾



𝒳






un

\

𝒾






Pr


(


𝓍
i

,
y
,


x

un

\

𝒾


|

x
obs



)









=








x
un


\

𝒾



𝒳






un

\

𝒾







Pr


(


𝓎
|

x
obs


,

x

un

\

𝒾


,

𝓍
i


)




Pr


(


x

un

\

𝒾


,


𝓍
i

|

x
obs



)










=








x
un


\

𝒾



𝒳






un

\

𝒾







Pr


(

𝓎
|
𝓍

)





Pr


(


x
un

|

x
obs


)


.











It is easy to model and obtain the distribution of class label given all features, Pr(custom character|x), by various classification methods. Given xobs, if the system can sample a sufficient number of augmented instances


x=(xobs,xun)′s, using Monte Carlo methods, the system can approximate all Pr(xi,custom character|xobs) without explicitly sum over all possible un\custom character.


Various Monte Carlo methods can be used for data models, as discussed in more details below. Suppose that the system has a sample set of augmented instances, denoted custom character={x(j)}, where


xobs(j)=xobs for all j's, there is a weight, custom characterj, associated to each sample, x(j). The value of custom character1 depends on sampling methods. Without obvious notation, custom characterj=1/|custom character|. The system can define population mutual information as mutual information as








I
~



(


𝓍
i

,

𝓎
;
𝒮


)


=






𝓍
i

,
𝓎






p
~



(


𝓍
i

,

𝓎
;
𝒮


)



log







p
~



(


𝓍
i

,

𝓎
;
𝒮


)




-




𝓍
i






p
~



(


𝓍
i

;
𝒮

)



log







p
~



(


𝓍
i

;
𝒮

)




-



𝓎





p
~



(

𝓎
;
𝒮

)



log







p
~



(

𝓎
;
𝒮

)















where














p
~



(


𝓍
i

,

𝓎
;
𝒮


)


=





𝓍

(
j
)



s






p
~



(


𝓍

(
j
)


;
𝒮

)





Pr


(

𝓎
|

𝓍

(
j
)



)




[


𝓍
𝒾

(
j
)


=

𝓍
i


]





,











p
~



(


𝓍
i

;
𝒮

)


=





𝓍

(
j
)



s





p
~




(


𝓍

(
j
)


;
𝒮

)



[


𝓍
𝒾

(
j
)


=

𝓍
i


]





,











p
~



(

𝓎
;
𝒮

)


=





𝓍

(
j
)



s






p
~



(


𝓍

(
j
)


;
𝒮

)




Pr


(

𝓎
|

𝓍

(
j
)



)





,











p
~



(


𝓍

(
j
)


;
𝒮

)


=

𝓌
j


,






and [•] is Iverson bracket, i.e., [x′=x]=1 if x′ equals to x, 0 otherwise.


The population mutual information is an approximation to mutual information, the system can use Bayesian mutual information based on Bayesian entropy as an alternative.


Given the sample set custom character, the greedy policy becomes







π


(

x
obs

)


=

{








arg





max






𝓎

𝒴







p
~



(

𝓎
;
𝒮

)







if





stop





criteria





meet

,










arg





max






𝒾

un







I
~



(


𝓍
i

,

𝓎
;
𝒮


)






otherwise
.









Next, the method of probing feature values is discussed. The feature value probing operation sequentially probes the feature with the largest mutual information with the label of the data. To estimate the mutual information, the system needs to sample factitious instances, which is discussed in next for sampling instances.


In general cases, it is difficult to directly model and learn a large number of combinations of Pr(xi,y|xobs). The system uses Monte Carlo methods to approximate Pr(xi,y|xobs) in Eq. (I). This section discusses the methods of sampling virtual instance set custom character. Augmenting instances requires the underlying data model, which is usually domain specific.


In one embodiment, the system can explore a few commonly used data models. Based these data models, the system is able to pick a suitable sampling method to draw virtual instances.


With some data models, the system knows the distribution of one feature given the rest features, i.e. Pr(xi|x−i), where −custom character, represents the set of {custom character:1≦custom character≦d and custom charactercustom character}, for example, Bayesian networks, random fields, and Boltzmann machines.


If the underlying structure is known, the models can incorporate prior knowledge of the structure efficiently. Even without much prior knowledge, the system can use generalized linear models for each feature i against the rest features in many cases, for example, logistic regression

Pr(xi=0|x−i)=1/(1+exp(αTx−i+b)).


Once the system has Pr(xi|x−i) for each custom character∈un, it is easy to obtain samples from the distribution of Pr(xun|xobs) by Gibbs sampling method. The Gibbs sampling is outlined in Algorithm 1.












Algorithm 1 Gibbs Sampling from a conditional model

















1: randomly draw xun ∈ Xun



2: for j = 1 to burn in number do










3:
for i ∈ un do










4:
update x by drawing xi ∈ Xi with a probability of Pr(xi|x_i)










5:
end for









6: end for



7: return x









Gibbs sampling method is easy to implement and has very few constraints. However, when the number of unknown features is large, it runs very slow. Therefore, the system may also consider two special models to obtain samples.


Next, the operation of the system using a nonparametric kernel density model is discussed. When the number of training data is large enough, the kernel density estimation is relatively good to capture the distribution of data. The kernel density model can be thought of as an extreme case of mixture model. The kernel density estimation is a nonparametric method to estimate the distribution of Pr(x|custom character),








Pr


(

𝓍
|
𝒟

)


=


1


𝒟








X

𝒟




K


(

𝓍
,
X

)





,





where K: custom character×custom charactercustom character is a kernel with ∫xK(x,custom character)dx=1, and D={X(n)εcustom character} is the set of training data. When the system can write the kernel function K as








K


(

𝓍
,
X

)


=



𝒾




K
𝒾



(


𝓍
𝒾

,

𝒳
𝒾


)




,





the system can say the kernel is a product kernel. With a product kernel, the feature variables around each training data point can be thought as “locally independent”. A widely used kernel, normalized radial basis function (RBF) kernel, is a product kernel for continuous features, for example,









K
𝒾



(


𝓍
𝒾

,

𝒳
𝒾


)


=


1



2

π




σ
𝒾





exp
(

-



(


𝓍
𝒾

-

𝒳
𝒾


)

2


2


σ
𝒾
2




)



,





where σi2 is known as the bandwidth of the kernel. For binary features, the system can use a kernel like

Ki(xi,custom characteri)=βi+(1−2βi)[xi=custom characteri],

where a small positive βi mimics the bandwidth of the kernel.


The system can use a sampling method that takes advantage of a product kernel with little cost of learning. Here, the system assumes it has a product kernel. The system can efficiently infer marginal distribution Pr(xun|xobs), because








Pr


(


x
un

|

x
obs


)


=



Pr


(


x
un

,

x
obs


)


/

Pr


(

x
obs

)






1

Pr


(

x
obs

)








𝒳

𝒟




𝒦


(

𝓍
,
X

)









𝒳

𝒟






K
obs



(


x
obs

,

X
obs


)







𝒾





un





K
i



(


𝓍
𝒾

,

𝒳
𝒾


)







,





where Kobs(xobs,Xobsi∈unKi(xi,custom characteri). Then, the system can sample Pr(xun|xobs) as shown in Algorithm 2.












Algorithm 2 Sampling from a kernel density model

















1: draw X ∈ custom character  with a probability proportional to Kobs(xobs, Xobs)



2: for all i ∈ un do










3:
draw xi ∈ Xi with a probability of Ki(xi, Xi)









4: end for



5: return x - {xobs, xun}









When the bandwidth of the kernel K goes to zero, the algorithm degrades to sample X where Xobs=xobs. This degradation causes sampling from an empty set and over-fitting issues. In one embodiment, a heuristic variant of non-parametric sampling method can be used for texture synthesis, where only k nearest neighbors are used as samples without giving an explicit formula of the distribution of samples.


Next, exemplary test results are presented. The experiment uses documents data from Topic Detection and Tracking (TDT-2). The system picked a total of 4802 documents from seven topics with largest numbers of documents. The system selected the top 100 most informative words (features) from the documents. All features are converted into binary, i.e., “True” for the document containing the word, “False” otherwise. The data set is split into 70% for training and 30% for testing. First, the system uses multinomial logistic regression and decision tree to classify the documents, assuming all feature values are known. The results show the up bound that active feature probing could achieve.
















method
error rate



















logistic regression
5.0%



decision tree
10.9%










To simulate the technique support center scenario that callers have provide some information, in the experiment the system assumes that one random “True” feature is known. The system tries random probing for baseline. A random probing approach randomly select a few random features other than the given “True” feature(s), then augment the data with default values, “False”, and classify with the logistic regression model.
















random probing
error rate









 0 probings
75%



 1 probing
74%



10 probings
60%



20 probings
51%



30 probings
44%



40 probings
38%











The decision tree approach is ignoring known feature and using decision tree. For comparison, the system limits the number of features to be probed, i.e. the max depth of the tree. The results are as follows.
















decision tree
error rate









5 probings
18.2%



8 probings
12.2%











The system tries two of the proposed approaches. One is generalized linear regression as the data model with Gibbs sampling described in Section 3.1, denoted by GLR. The other is the kernel density estimation with nonparametric sampling described in Section 3.2, denoted by KDE. The results are as follows.
















active feature probing
error rate



















GLR, 5 probings
16.1%



GLR, 8 probings
11.5%



KDE, 5 probings
8.3%



KDE, 8 probings
5.4%










As the results show, active feature probing results a higher accuracy than the random probing. The active feature probing methods outperform the static decision tree. The accuracy of the KDE method with 8 probings almost matches that of the logistic regression method on all features.


The above system can perform classification when the information is insufficient. The system studies the way of gathering further information by actively probing the information source. The active feature probing system dynamically selects the most informative feature to probe, based on the information gain. To estimate information gain, the system uses sampling techniques to avoid building a large number of classifiers. The system can deploy several sampling methods for difficult situations including a nonparametric sampling based on kernel density estimation. These methods outperform the random probing and static decision tree probing.


The invention may be implemented in hardware, firmware or software, or a combination of the three. Preferably the invention is implemented in a computer program executed on a programmable computer having a processor, a data storage system, volatile and non-volatile memory and/or storage elements, at least one input device and at least one output device.


By way of example, a block diagram of a computer to support the system is discussed next. The computer preferably includes a processor, random access memory (RAM), a program memory (preferably a writable read-only memory (ROM) such as a flash ROM) and an input/output (I/O) controller coupled by a CPU bus. The computer may optionally include a hard drive controller which is coupled to a hard disk and CPU bus. Hard disk may be used for storing application programs, such as the present invention, and data. Alternatively, application programs may be stored in RAM or ROM. I/O controller is coupled by means of an I/O bus to an I/O interface. I/O interface receives and transmits data in analog or digital form over communication links such as a serial link, local area network, wireless link, and parallel link. Optionally, a display, a keyboard and a pointing device (mouse) may also be connected to I/O bus. Alternatively, separate connections (separate buses) may be used for I/O interface, display, keyboard and pointing device. Programmable processing system may be preprogrammed or it may be programmed (and reprogrammed) by downloading a program from another source (e.g., a floppy disk, CD-ROM, or another computer).


Each computer program is tangibly stored in a machine-readable storage media or device (e.g., program memory or magnetic disk) readable by a general or special purpose programmable computer, for configuring and controlling operation of a computer when the storage media or device is read by the computer to perform the procedures described herein. The inventive system may also be considered to be embodied in a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.


The invention has been described herein in considerable detail in order to comply with the patent Statutes and to provide those skilled in the art with the information needed to apply the novel principles and to construct and use such specialized components as are required. However, it is to be understood that the invention can be carried out by specifically different equipment and devices, and that various modifications, both as to the equipment details and operating procedures, can be accomplished without departing from the scope of the invention itself.

Claims
  • 1. A method to classify information, comprising: augmenting existing information with factitious data;performing actively feature probing using the factitious data;generate a model based on the active feature probing with factitious data;classifying information using the model; anddetermining a goal for a given xobs as:
  • 2. The method of claim 1, comprising learning existing information.
  • 3. The method of claim 2, comprising generating one or more feature vectors by clustering data from the existing information and the factitious data.
  • 4. The method of claim 3, comprising generating a classification model using the one or more feature vectors.
  • 5. The method of claim 1, learning information by clustering data from the existing information and the factitious data to generate one or more feature vectors and by generating a classification model from the one or more feature vectors.
  • 6. The method of claim 1, wherein the classifying information is done at run-time.
  • 7. The method of claim 6, comprising evaluating information in a new case to be classified.
  • 8. The method of claim 7, wherein the evaluating information comprises determining entropy of a set of factitious cases generated from the new case.
  • 9. The method of claim 6, comprising identifying the most similar past case as a solution.
  • 10. The method of claim 1, comprising gathering additional information for the new case.
  • 11. The method of claim 1, comprising determining a confidence on one or more outcomes after several probing.
  • 12. The method of claim 11, comprising measuring an entropy for the one or more outcomes with
  • 13. The method of claim 1, comprising determining a feature to probe.
  • 14. The method of claim 1, comprising finding a feature to minimize an expected loss.
  • 15. The method of claim 1, comprising obtaining the distribution of class label given all features, Pr(y|x), by a classification method.
  • 16. The method of claim 15, comprising sampling a number of augmented instances using a Monte Carlo method.
  • 17. The method of claim 16, comprising approximating Pr(xi,|xobs) without explicitly summing over all possible un\i.
  • 18. The method of claim 1, comprising drawing virtual instances.
  • 19. The method of claim 18, comprising obtaining samples from a distribution using a Gibbs sampling method.
  • 20. The method of claim 18, comprising obtaining samples from a distribution using a kernel density estimation.
  • 21. A method to classify information, comprising: augmenting existing information with factitious data;performing actively feature probing using the factitious data;generate a model based on the active feature probing with factitious data;classifying information using the model; and
  • 22. A method to classify information, comprising: augmenting existing information with factitious data;performing actively feature probing using the factitious data;generate a model based on the active feature probing with factitious data;classifying information using the model;
Parent Case Info

This application claims the benefit of U.S. Provisional Application 60/869,799 filed Dec. 13, 2006, the content of which is hereby incorporated-by-reference.

US Referenced Citations (7)
Number Name Date Kind
6058205 Bahl May 2000 A
6557011 Sevitsky Apr 2003 B1
7087018 Comaniciu et al. Aug 2006 B2
7458936 Zhou et al. Dec 2008 B2
7738705 Casadei Jun 2010 B2
20040133083 Comaniciu et al. Jul 2004 A1
20040193036 Zhou et al. Sep 2004 A1
Related Publications (1)
Number Date Country
20080147852 A1 Jun 2008 US
Provisional Applications (1)
Number Date Country
60869799 Dec 2006 US