The embodiments described herein relate to a system and method for active learning and modeling for field specific data streams. According to various aspects of the embodiments, a system and method are used to actively learn from and model field-specific data. For example, the system may employ an active learning algorithm in order to learn from and model data. In particular, the system may use query-by-transduction as the active learning algorithm, which may be used to generate training data for classifying unlabelled data points. Based on whether data points are interesting, the system may selectively and iteratively add the data points to the training data until an appropriate stopping threshold is reached. Once the training data is generated, the system may use the training data in order to classify unlabelled data.
In operation, classifications of prior unlabelled data points may be used in a variety of applications. For example, in a stream-based setting, the system may observe streaming data points and dynamically classify the observed data points. In particular, among other applications in a stream-based setting, streaming video may be analyzed to detect changes to the streaming video (such as, for example, detecting scene changes in a movie and monitoring security/surveillance cameras). In a pool-based setting, the system may use the training data to select relevant data from a pool of data points. In particular, among other applications in a pool-based setting, relevant medical data points (such as from a patient's medical record) may be selected to assist medical diagnoses, prognoses, and care for a patient.
According to various aspects of the embodiment, active learning device 110 may use data observing module 116 to observe (i.e., receive and/or select) data points 106a . . . 106n from data sources 104a . . . 104n. Data sources 104a . . . 104n may be streaming and/or pooled, as appropriate. In other words, data points 106a . . . 106n may be streaming data and/or be pooled data.
According to various aspects of the embodiment, at least one processor 114 may initialize a Support Vector Machine (SVM) 115. SVM 115 is a choice classifier that provides classifications of data within training data 160, thereby providing an analytical framework for classifying data points 106a . . . 106n. Training data 160 may be dynamically generated and trained as data points 106a . . . 106n are observed by data observing module 116 and may be selectively added to the training data 160. To generate training data 160, active learning device 110 may initialize SVM 115 with training data 160 that includes an initial set of data points, from which classifications of the initial set of data points are generated.
Using the analytical framework provided by SVM 115, QBT module 118 generates training data 160 by, among other things, selectively adding observed data points 106a . . . 106n to training data 160. QBT module 118 selectively adds a data point(s) 106a . . . 106n to the training data 160 when the data point(s) 106a . . . 106n is interesting. Data point(s) 106a . . . 106n may be interesting when an uncertainty exists regarding whether the data point(s) 106a . . . 106n belongs to one of at least two classifications within the analytical framework provided by SVM 115. Such uncertainty suggests that the data point(s) 106a . . . 106n belong to a new classification of data, thereby enriching training data 160 with the new classification. Thus, active learning system 100 may learn from a data point(s) 106a . . . 106n that is interesting because the data point(s) 106a . . . 106n may represent a new classification of data. QBT module 118 may generate training data 160 until terminated as appropriate when indicated by a stopping threshold.
According to an embodiment of the embodiment, once generated, training data 160 may be used to identify particular data among a plurality of data points (not shown) that may be important or otherwise relevant for a particular field. For example, data selection device 170 may use training data 160, trained on a particular field, to identify important information from among a diverse or otherwise large body of information. In particular, data selection device 170 may use training data 160 to mine medical databases and/or other health records of patients (not shown) in order to identify which medical information is relevant for particular patients, particular diseases, diagnoses, and/or other particular fields. A healthcare professional may use data selection device 170 to mine a patient's medical record and select data from the medical record that may be important for diagnosing the patient, for example. In this manner, training data 160 may be trained on a variety of fields in order to identify important information for each field.
According to various aspects of the embodiment, label assignment module 202 may assign a plurality of labels to an observed data point(s) 106a . . . 106n according to classifications of the training data 160 from SVM 115. Each label may indicate a classification of the observed data point(s) 106a . . . 106n from among classifications of training data 160. In other words, label assignment module 202 may assign labels to the observed data point(s) 106a . . . 106n. Each label predicts a possible classification of the observed data point(s) 106a . . . 106n based on the analytical framework provided by SVM 115. For example, if the training data 160 includes data that is classified according to nine classifications, nine labels (one for each classification) may be assigned for an observed data point(s) 106a . . . 106n that predicts that the observed data point(s) 106a . . . 106n belongs to one of nine respective classifications. Other examples are contemplated and the foregoing is an example only. For example, any number of classification may exist within training data 160 and all or at least a portion of the classifications may be predicted for observed data point(s) 106a . . . 106n by a respective label as appropriate.
According to various aspects of the embodiment, confidence analysis module 204 may determine a confidence metric for each of the assigned labels using SVM 115. The confidence metric may indicate a level of confidence that a corresponding assigned label predicts a classification for the observed data point(s) 106a . . . 106n. According to various aspects of the embodiment, the confidence metric is a p-value, which may be calculated using a measure of strangeness. Strangeness is a measure of how much a data point(s) 106a . . . 106n is different from other data points.
According to various aspects of the embodiment, strangeness (and therefore a p-value) may be determined based on the analytical framework provided by SVM 115. For example, given training data 160: {(x1,y1),(x2,y2), . . . ,(xn,yn)}, where yi ∈ {−1,1}, SVM 115 seeks the separating hyperplane that yields a maximal margin for the separable case, i.e., the set of training data 160 is separated without error and the distance between the closest training data 160 and the hyperplane may be maximal. For a nonseparable case, the margin may be maximized with minimum loss in misclassification. When an unknown instance xn+1 is included with a potential label yn+1=y* into training data 160, Lagrange multipliers α1,α2, . . . αn,αn+1 associated with the data in training data 160 and (xn+1,y*) as the strangeness measure using SVM 115. The Lagrange multipliers αi, i=1, . . . , n+1 may be found by maximizing the dual formulation of a soft-margin SVM 115, which may be expressed as:
subject to the constraints
and 0≦αi≦C, i=1, . . . ,n+1, where K(.) is a kernel function. Strangeness and the Lagrange multipliers may be related where sets of training data 160 outside the margin have zero Lagrange multipliers. For the sets of training data 160 on the margin, the values of the Lagrange multiplier are between 0 and C. Sets of training data 160 within the margin may have the Lagrange multiplier value C. The sets of training data 160 within the margin are more strange as compared to sets of training data 160 that are outside the margin.
According to various aspects of the embodiment, a p-value function generating p-values may be generated based on strangeness. For example, if xn+1 is an observed data point(s) 106a . . . 106n and αn+1y is the strangeness of observed data point(s) 106a . . . 106n for an assigned label y*, then t((x1,y1),(x2,y2), . . . , (xn+1, y*)) may be the p-value of xn+1 for the assigned label y*, given training data 160 {(x1,y1),(x2,y2), . . . , (xn,yn)}. In this example, a p-value function t:Xn+1→[0,1] may be expressed as:
t((x1,y1),(x2,y2), . . . , (xn+1,y*))=#{i=1, . . . , n:αi>=αn+1y}/n (2).
According to various aspects of the embodiment, confidence metrics of assigned labels may be analyzed to determine whether observed data point(s) 106a . . . 106n are interesting. As previously noted, data point(s) 106a . . . 106n may be interesting when an uncertainty exists regarding whether the data point(s) 106a . . . 106n belongs to one of at least two classifications within the analytical framework of SVM 115. According to various aspects of the embodiment, closeness selection module 206 may determine a closeness metric, which is a measure of uncertainty, among at least two assigned labels for the observed data point(s) 106a . . . 106n based on their respective confidence metrics. An uncertainty exists regarding whether the observed data point(s) 106a . . . 106n belongs to a first classification or a second classification when a difference between first and second confidence metrics is small. In other words, uncertainty increases as the difference between at least two confidence metrics approaches zero.
Three cases may exemplify determining whether an uncertainty exists between two labels, “j” and “k,” assigned to observed data point(s) 106a . . . 106n, according to their respective confidence metrics Pj and Pk. In these examples, labels j and k predicts that observed data point(s) 106a . . . 106n belong to classifications “j” and “k,” respectively. Confidence metrics Pj and Pk are levels of confidence that label j and label k plausibly predict classifications for the observed data point(s) 106a . . . 106n, respectively.
Case 1: Pj is high and Pk is low.
Case 2: Pj is high and Pk is high.
Case 3: Pj is low and Pk is low.
Cases 2 and 3 may indicate a data point(s) 106a . . . 106n that is interesting. In cases 2 and 3, there exists is a level of uncertainty whether labels j and k predict a classification for data point(s) 106a . . . 106n. Case 1 may indicate that data point(s) 106a . . . 106n is not interesting because there may exist a high level of certainty that label j predicts that data point(s) 106a . . . 106n belongs to classification j. These cases are examples and an indication of “high” and “low” confidence metrics are not dispositive.
According to various aspects of the embodiment, closeness selection module 206 may determine a closeness score between confidence metrics Pj and Pk that measures a level of closeness between confidence metrics Pj and Pk. The closeness score may be expressed as:
Pj−Pk (3)
Closeness selection module 206 may compare the closeness score to a selection threshold. When the closeness score is less than the selection threshold, data point(s) 106a . . . 106n may be determined to be interesting and added to training data 160.
According to one aspect of the embodiment, SVM 115 may be initialized with training data 160 that includes an initial set of data points that are classified in an operation 302. In an operation 304, at least one data point(s) 106a . . . 106n may be observed. Data point(s) 106a . . . 106n may be interesting and may enrich the classifications included in training data 160. As such, in an operation 306, a determination may be made whether data point(s) 106a . . . 106n is interesting. If in an operation 308, data point(s) 106a . . . 106n is not interesting, processing may return to operation 304, wherein another data point(s) 106a . . . 106n is observed.
Returning to operation 308, if data point(s) 106a . . . 106n is determined to be interesting, data point(s) 106a . . . 106n may be added to training data 160 in an operation 310. Upon adding data point(s) 106a . . . 106n to training data 160 in operation 310, training data 160 may include sufficient data points 106. As such, in an operation 312, a determination is made whether training is complete. If in an operation 314 training is determined to be incomplete, a new data point(s) 106a . . . 106n may be observed in operation 304. If in operation 314 training is complete, training may be terminated in an operation 316, wherein training data 160 may be used to classify data.
In an operation 506, a determination may be made whether the closeness metric is below a selection threshold. The selection threshold may be predefined or otherwise configurable. When the closeness score is less than the selection threshold, data point(s) 106a . . . 106n may be determined to be interesting and added to training data 160. As previously noted, data point(s) 106a . . . 106n is interesting when an uncertainty exists regarding whether the data point(s) 106a . . . 106n belongs to one of at least two classifications within the analytical framework of SVM 115. In other words, uncertainty increases as the difference between at least two confidence metrics approaches zero. The selection threshold may be predefined or otherwise configurably defined in order to set a level of certainty above which data point(s) 106a . . . 106n is deemed to be not interesting. In other words, the selection threshold may be used to set a threshold level of uncertainty in order to define which data points 106a . . . 106n are not interesting.
If in operation 506, the closeness metric is less than the selection threshold, data point(s) 106a . . . 106n is determined to be interesting in an operation 508 because as the closeness metric approaches zero, greater uncertainty regarding whether data point(s) 106a . . . 106n belongs to classifications respectively predicted by labels corresponding to teach of the top two confidence metrics. If in operation 506, the closeness metric exceeds the selection threshold, then data point(s) 106a . . . 106n may be determined to be not interesting in an operation 510.
Referring to
Referring to
According to an aspect of the embodiment, active learning device 110 may be accessible over a network 108, via any wired or wireless communications link, using one or more user terminals 102. Network 108 may include any one or more of, for instance, the Internet, an intranet, a PAN (Personal Area Network), a LAN (Local Area Network), a WAN (Wide Area Network), a SAN (Storage Area Network), a MAN (Metropolitan Area Network), or other network. Examples of terminal 102 may include any one or more of, for instance, a personal computer, portable computer, personal digital assistant (PDA), workstation, web-enabled mobile phone, WAP device, web-to-voice device, or other device. Those having skill in the art will appreciate that the embodiment described herein may work with various system configurations.
In this specification, “a” and “an” and similar phrases are to be interpreted as “at least one” and “one or more.”
Many of the elements described in the disclosed embodiments may be implemented as modules. A module is defined here as an isolatable element that performs a defined function and has a defined interface to other elements. The modules described in this disclosure may be implemented in hardware, software, firmware, wetware (i.e., hardware with a biological element) or a combination thereof, all of which are behaviorally equivalent. For example, modules may be implemented as a software routine written in a computer language (such as C, C++, Fortran, Java, Basic, Matlab or the like) or a modeling/simulation program such as Simulink, Stateflow, GNU Octave, or LabVIEW MathScript. Additionally, it may be possible to implement modules using physical hardware that incorporates discrete or programmable analog, digital and/or quantum hardware. Examples of programmable hardware include: computers, microcontrollers, microprocessors, application-specific integrated circuits (ASICs); field programmable gate arrays (FPGAs); and complex programmable logic devices (CPLDs). Computers, microcontrollers and microprocessors are programmed using languages such as assembly, C, C++ or the like. FPGAs, ASICs and CPLDs are often programmed using hardware description languages (HDL) such as VHSIC hardware description language (VHDL) or Verilog that configure connections between internal hardware modules with lesser functionality on a programmable device. Finally, it needs to be emphasized that the above mentioned technologies are often used in combination to achieve the result of a functional module.
The disclosure of this patent document incorporates material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, for the limited purposes required by law, but otherwise reserves all copyright rights whatsoever.
In addition, implementations of the embodiment may be made in hardware, firmware, software, or any suitable combination thereof. Aspects of the embodiment may also be implemented as instructions stored on a machine-readable medium, which may be read and executed by one or more processors. A machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computing device). For example, a machine-readable storage medium may include read only memory, random access memory, magnetic disk storage media, optical storage media, flash memory devices, and others, and a machine-readable transmission media may include forms of propagated signals, such as carrier waves, infrared signals, digital signals, and others. Further, firmware, software, routines, or instructions may be described herein in terms of specific example aspects and implementations of the embodiment, and performing certain actions. However, it will be apparent that such descriptions are merely for convenience and that such actions in fact result from computing devices, processors, controllers, or other devices executing the firmware, software, routines, or instructions.
Aspects and implementations described herein as including a particular feature, structure, or characteristic, but every aspect or implementation may not necessarily include the particular feature, structure, or characteristic. Further, when a particular feature, structure, or characteristic is described in connection with an aspect or implementation, it will be understood that such feature, structure, or characteristic may be included in connection with other aspects or implementations, whether or not explicitly described. Thus, various changes and modifications may be made to the provided description without departing from the scope or spirit of the embodiment. As such, the specification and drawings should be regarded as examples only, and the scope of the embodiment to be determined solely by the appended claims.
While various embodiments have been described above, it should be understood that they have been presented by way of example, and not limitation. It will be apparent to persons skilled in the relevant art(s) that various changes in form and detail can be made therein without departing from the spirit and scope. In fact, after reading the above description, it will be apparent to one skilled in the relevant art(s) how to implement alternative embodiments. Thus, the present embodiments should not be limited by any of the above described exemplary embodiments. In particular, it should be noted that, for example purposes, the above explanation has focused on using p-values for confidence metrics. However, one skilled in the art will recognize that embodiments of the invention could be any other confidence metric.
In addition, it should be understood that any figures which highlight the functionality and advantages, are presented for example purposes only. The disclosed architecture is sufficiently flexible and configurable, such that it may be utilized in ways other than that shown. For example, the steps listed in any flowchart may be re-ordered or only optionally used in some embodiments.
Further, the purpose of the Abstract of the Disclosure is to enable the U.S. Patent and Trademark Office and the public generally, and especially the scientists, engineers and practitioners in the art who are not familiar with patent or legal terms or phraseology, to determine quickly from a cursory inspection the nature and essence of the technical disclosure of the application. The Abstract of the Disclosure is not intended to be limiting as to the scope in any way.
Finally, it is the applicant's intent that only claims that include the express language “means for” or “step for” be interpreted under 35 U.S.C. 112, paragraph 6. Claims that do not expressly include the phrase “means for” or “step for” are not to be interpreted under 35 U.S.C. 112, paragraph 6.
This Application claims the benefit of U.S. Provisional Patent Application Ser. No. 60/053,350, filed May 15, 2008, which is hereby incorporated by reference herein in its entirety.
| Number | Date | Country | |
|---|---|---|---|
| 61053350 | May 2008 | US |