This document relates generally to constructing and using computer predictive models and more particularly to using semi-supervised learning systems and methods for generating predictive models.
Computer predictive models have found applicability in many diverse areas. However, difficulty arises in using predictive models when the training targets are not fully known. A non-limiting example where predictive models encounter unknown targets is when predictive models are to assess whether fraud may have occurred with respect to monetary-related transactions. Current predictive model approaches have difficulty in discerning legitimate monetary-related transactions from fraudulent ones.
In accordance with the teachings provided herein, systems and methods for operation upon data processing devices are provided for performing semi-supervised learning. For example, a method and system can be configured to receive a target data set, wherein the target data set includes known targets and unknown targets. A supervised model such as a neural network model is generated using the known targets. The unknown targets are used with the neural network model to generate values for the unknown targets. Analysis with an unsupervised model (e.g., using an approach such as outlier detection analysis) is performed using the target data set in order to determine which of the unknown targets are outliers. A comparison of the list of outlier unknown targets is performed with the values for the unknown targets that were generated by the neural network model. The subset of unknown targets to investigate is determined based upon the comparison.
Semi-supervised situations involve generation of predictive models typically (but not always) by means of a small amount of labeled data and a large amount of unlabeled data (e.g., collectively target data set 42). Semi-supervised situations can arise because the cost associated with the labeling process may render a fully labeled training set impractical, whereas acquisition of unlabeled data is relatively inexpensive. In such situations, semi-supervised learning can be of great practical value.
The users 32 can interact with the predictive model construction system 34 through a number of ways, such as over one or more networks 36. A server 38 accessible through the network(s) 36 can host the predictive model construction system 34. Data store(s) 40 can store the data to be analyzed (e.g., target data set 42) as well as any intermediate or final data calculations and data results.
The predictive model construction system 34 can be a web-based tool that provides users with flexibility and functionality for generating predictive models when the targets 42 are only partially known. Moreover, the predictive model construction system 34 can be used separately or in conjunction with other software programs, such as with other predictive model construction techniques.
With reference to
Process 50 constructs models for predicting target values for the entries contained in the target data set 42. Many different types of predictive models can be constructed, such as artificial neural network predictive models. An artificial neural network is constructed of interconnecting neurons designed to model the target(s) and which changes its structure based on how it is trained, such as through the training process 50. More specifically, neural networks are non-linear statistical data modeling tools. They can be used to model complex relationships between inputs and targets or to find patterns in data.
The training process 50 begins with a set of interconnected nodes and alters the strength (e.g., weights) of the connections in the network to produce outputs. The training process 50 is provided with the input data about the target data set 42 and a cost function to be minimized. The cost function can be any function of the target values and predicted target values from the model under construction, such as the norm of the difference between the predicted and original target values. With the input data and the cost function, process 50 generates a neural network model using the known targets 100 (or at least a portion of the known targets 100). The unknown targets 110 (or at least a portion of the unknown targets 110) are used with the generated neural network model to generate values (i.e., results 130) for the unknown targets 110.
An outlier detection process 52 is performed using the target data set 42 for determining which of the unknown targets 110 are outliers. An unknown target being identified as an outlier by process 52 is an indication of anomalous activity, such as the possibility that fraud may have occurred.
In training process 52, both the input data set and the output data set are the target data set 42. The learning process tries to reproduce the input data as the target. Let the vector x=(x1, x2, . . . xp)T represent an observation with p inputs to the unsupervised learning process with a mean μ and covariance Σ. Let the vector y=(y1, y2, . . . , yp)T represent the same observation with p outputs from the unsupervised learning process. The cost function can be any function of the input data and target output of the model under construction—such as in this example, the Mahalanobis distance between the inputs and outputs which is defined as √{square root over ((x−y)TΣ−1(x−y))}{square root over ((x−y)TΣ−1(x−y))}. The difference between the inputs and outputs is also defined as the reconstruction error of the unsupervised learning process and can be represented as E=(x−y). The Mahalanobis distance based reconstruction error can thus be defined as √{square root over (ETΣ−1E)}. The covariance matrix can be expressed in terms of eigenvalue matrix Λ and eigenvector UasΣ=UAUT. Therefore, √{square root over (ETΣ−1E)} can be computed from √{square root over ((EU)TΔ−1(EU))}{square root over ((EU)TΔ−1(EU))}. Σ can be noisy. The computation of the reconstruction error can be done by using the first m eigenvalues Λ of the covariance matrix Σ. An alternate approach is to use the first m eigenvalues and a small value for the remaining eigenvalues. A more general approach could be to weight the different eigenvalues differently in computing the reconstruction error. The inputs with the highest reconstruction error are deemed to be anomalous inputs.
Process 140 performs a comparison between the list of unknown targets (that have been identified as outliers) and the values 130 for the unknown targets that were generated by the neural network model. The subset 150 of unknown targets to investigate is determined based upon the comparison by process 140.
The scores of the neural network model can assume a number of different formats. As an illustration, the scores can be binary in nature (e.g., a value of 1 to indicate that a transaction constitutes fraud; and a value of 0 to indicate that fraud has not occurred). The scores could also encompass a range of values. For example, a continuous range from 0-100 can indicate the degree by which an item in the target data set 42 can be considered fraudulent, wherein a value of 0 can establish the lower end of the fraud spectrum (i.e., not a fraud event) and a value of 100 can establish the upper end of the fraud spectrum.
Different techniques are available in order to perform the compression and uncompression processes 300 and 310 such as by using nonlinear replicator neural networks (e.g., autoregressive neural networks, etc.) as shown at 350 in
As shown in
Process 140 compares the results 130 (e.g., the unknown target scores) from the neural network with the rank order list from the outlier detection process 52. The comparison process 140 can use output subset criteria 500 in determining which of the unknown targets should be included in the subset 150 of unknown targets to investigate. As an example, the output subset criteria 500 can specify that the subset 150 should include only those unknown targets that have a relatively high score as determined by the neural network model as well as those that were highly ranked as outliers by the outlier detection process. The combination of the analyses performed by the neural network and by the outlier detection process enhances the fidelity of the selection of which of the unknown targets constitutes anomalous behavior. It should be understood that many different types of criteria can be used in order to determine the subset of unknown targets to investigate and should be based upon the application at hand.
The investigation process 600 is resource efficient because the investigation (at this stage) only needs to focus on the subset 150 and not the entire (and potentially large) corpus of unknown targets. As an example of process 600, analyzing the subset 150 can include examination by human analysts of the unknown targets to determine which targets within the data subset 150 constitute anomalous behavior.
In addition to or as a supplement to manual investigation, analysis of the target subset 150 can include examination of the subset 150 by another software program that uses more resource intensive techniques to determine a more accurate set of values for the unknown targets in the subset 150.
In any event, process 600 results in more accurate target values being produced for the data subset 150. The target results are fed into process 50 so that the neural network model can be improved which results in more accurate results 130 for use by the comparison process 140 in subsequent iterations.
The predictive model construction approaches described herein can be utilized for many different purposes where target values are unknown. For example, predictive models can be constructed in order to analyze whether fraud may have occurred with respect to a financial transaction.
The target data set 800 includes both known targets 802 and unknown targets 808. The known targets 802 in this example include known fraud targets 804 or instances of fraud (e.g., an entity had filed a false tax return; an entity had used a stolen credit card to purchase an item; etc.) as well as known non-fraud targets (e.g., an entity had filed a correct tax return; an entity had made a legitimate credit card purchase; etc.). In this example, the number of legitimate type targets overwhelmingly outnumbers the known fraud targets. The unknown targets 808 constitute entities, such as individuals or organizations, whose transactions (e.g., filing a tax return; making a credit card purchase; etc.) do not contain an indication of whether fraud has occurred.
As discussed above, the target data set 800 contains only a partial set of known targets (e.g., known targets 802 and unknown targets 808). Process 810 trains a model (e.g., neural network) for predicting values or scores that are indicative of whether fraud has occurred with respect to the known targets 802.
With the input data and the cost function, process 810 generates a neural network model using the known targets 802. The unknown targets 808 are used with the generated neural network model to generate values (i.e., fraud-indicative scores 830) for the unknown targets 808.
An outlier detection process 820 is performed using the target data set 800 in order to determine which of the unknown targets 808 are outliers. An unknown target being identified in the results 840 as an outlier by process 820 is an indication of fraudulent activity. More specifically in this example, the results 840 of the outlier detection process 820 can be a rank order list. The rank order list contains the unknown targets that were determined to be outliers ordered by the error amount that they exhibited from the compression and uncompression processes. The unknown targets exhibiting a higher amount of error are considered more likely to involve fraudulent activity than the unknown targets that exhibit a smaller amount of error.
Process 850 performs a comparison between the list of unknown targets (that have been identified as outliers) and the scores 830 for the unknown targets that were generated by the neural network model. The subset 870 of unknown targets to investigate is determined based upon the comparison by process 850.
Process 850 compares the fraud-indicative scores 830 from the neural network with the rank order list from the outlier detection process 820. The comparison process 850 uses subset criteria 860 in determining which of the unknown targets should be included in the subset 870 of unknown targets to investigate. In this example, the subset criteria 870 specifies that the subset 860 should include only those unknown targets that have a relatively high score as determined by the neural network model as well as those that were highly ranked as outliers by the outlier detection process.
Process 880 performs an investigation of the unknown targets that passed the fraud-indicative criteria 860. As a result of the investigation 880, the true target values become known for the subset 870 and are used to retrain the neural network at 810. The retrained neural network is then applied to the remaining unknown targets in the target data set 800 in order to generate new fraud indicative scores.
Similarly, the target values 890 that are known for the subset 870 can also be used to refine the outlier detection process as indicated in 1000. The improved outlier detection process 820 performs outlier detection upon the remaining unknown targets 808 in order to produce a new fraud-indicative outlier list. Process 850 then uses the new fraud-indicative scores as well as the new fraud-indicative scores generated by process 810 when it is to perform its comparison operations. Process 850 results in a new subset of unknown targets to investigate. The investigation and retraining operations can continue until the models have reached a particular level of precision and/or until no more investigations are desired.
While examples have been used to disclose the invention, including the best mode, and also to enable any person skilled in the art to make and use the invention, the patentable scope of the invention is defined by claims, and may include other examples that occur to those skilled in the art. Accordingly the examples disclosed herein are to be considered non-limiting. As an illustration,
As another illustration, the systems and methods disclosed herein could use different types of models as supervised models and unsupervised models. For example, linear regression models and logistic regression models can be used as supervised models in the disclosed operational scenarios; and principal component analysis type models can be used as unsupervised models in the disclosed operational scenarios.
As yet another illustration, the systems and methods disclosed herein may be implemented on various types of computer architectures, such as for example on a networked system, on a single general purpose computer, etc. For example,
It should be understood that the analytical systems described herein (e.g., tax fraud analysis system, purchase card fraud analysis system, etc.) can be implemented in other ways, such as on a stand-alone computer for access by a user as shown at 1300 in
It is further noted that the systems and methods may include data signals conveyed via networks (e.g., local area network, wide area network, internet, combinations thereof, etc.), fiber optic medium, carrier waves, wireless networks, etc. for communication with one or more data processing devices. The data signals can carry any or all of the data disclosed herein that is provided to or from a device.
Additionally, the methods and systems described herein may be implemented on many different types of processing devices by program code comprising program instructions that are executable by the device processing subsystem. The software program instructions may include source code, object code, machine code, or any other stored data that is operable to cause a processing system to perform methods described herein. Other implementations may also be used, however, such as firmware or even appropriately designed hardware configured to carry out the methods and systems described herein.
The systems' and methods' data (e.g., associations, mappings, etc.) may be stored and implemented in one or more different types of computer-implemented ways, such as different types of storage devices and programming constructs (e.g., data stores, RAM, ROM, Flash memory, flat files, databases, programming data structures, programming variables, IF-THEN (or similar type) statement constructs, etc.). It is noted that data structures describe formats for use in organizing and storing data in databases, programs, memory, or other computer-readable media for use by a computer program.
The systems and methods may be provided on many different types of computer-readable media including computer storage mechanisms (e.g., CD-ROM, diskette, RAM, flash memory, computer's hard drive, etc.) that contain instructions (e.g., software) for use in execution by a processor to perform the methods' operations and implement the systems described herein.
The computer components, software modules, functions, data stores and data structures described herein may be connected directly or indirectly to each other in order to allow the flow of data needed for their operations. It is also noted that a module or processor includes but is not limited to a unit of code that performs a software operation, and can be implemented for example as a subroutine unit of code, or as a software function unit of code, or as an object (as in an object-oriented paradigm), or as an applet, or in a computer script language, or as another type of computer code. The software components and/or functionality may be located on a single computer or distributed across multiple computers depending upon the situation at hand.
It should be understood that as used in the description herein and throughout the claims that follow, the meaning of “a,” “an,” and “the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise. Finally, as used in the description herein and throughout the claims that follow, the meanings of “and” and “or” include both the conjunctive and disjunctive and may be used interchangeably unless the context expressly dictates otherwise; the phrase “exclusive or” may be used to indicate situation where only the disjunctive meaning may apply.
This application claims priority to and the benefit of U.S. Application Ser. No. 60/902,380, (entitled “Computer-Implemented Semi-supervised Learning Systems And Methods” and filed on Feb. 20, 2007), of which the entire disclosure (including any and all figures) is incorporated herein by reference. This application contains subject matter that may be considered related to subject matter disclosed in: U.S. Application Ser. No. 60/902,378, (entitled “Computer-Implemented Modeling Systems and Methods for analyzing Computer Network Intrusions” and filed on Feb. 20, 2007); U.S. Application Ser. No. 60/902,379, (entitled “Computer-Implemented Systems and Methods For Action Determination” and filed on Feb. 20, 2007); U.S. Application Ser. No. 60/902,381, (entitled “Computer-Implemented Guided Learning Systems and Methods for Constructing Predictive Models” and filed on Feb. 20, 2007); U.S. Application Ser. No. 60/786,039 (entitled “Computer-Implemented Predictive Model Generation Systems And Methods” and filed on Mar. 24, 2006); U.S. Application Ser. No. 60/786,038 (entitled “Computer-Implemented Data Storage For Predictive Model Systems” and filed on Mar. 24, 2006); and to U.S. Provisional Application Ser. No. 60/786,040 (entitled “Computer-Implemented Predictive Model Scoring Systems And Methods” and filed on Mar. 24, 2006); of which the entire disclosures (including any and all figures) of all of these applications are incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
5335291 | Kramer et al. | Aug 1994 | A |
5500513 | Langhans et al. | Mar 1996 | A |
5519319 | Smith et al. | May 1996 | A |
5650722 | Smith et al. | Jul 1997 | A |
5675253 | Smith et al. | Oct 1997 | A |
5677955 | Doggett et al. | Oct 1997 | A |
5761442 | Barr et al. | Jun 1998 | A |
5884289 | Anderson et al. | Mar 1999 | A |
5903830 | Joao et al. | May 1999 | A |
5999596 | Walker et al. | Dec 1999 | A |
6021943 | Chastain | Feb 2000 | A |
6029154 | Pettitt | Feb 2000 | A |
6047268 | Bartoli et al. | Apr 2000 | A |
6064990 | Goldsmith | May 2000 | A |
6122624 | Tetro et al. | Sep 2000 | A |
6125349 | Maher | Sep 2000 | A |
6128602 | Northington et al. | Oct 2000 | A |
6170744 | Lee et al. | Jan 2001 | B1 |
6330546 | Gopinathan et al. | Dec 2001 | B1 |
6422462 | Cohen | Jul 2002 | B1 |
6453206 | Soraghan et al. | Sep 2002 | B1 |
6516056 | Justice et al. | Feb 2003 | B1 |
6549861 | Mark et al. | Apr 2003 | B1 |
6601049 | Cooper | Jul 2003 | B1 |
6631212 | Luo et al. | Oct 2003 | B1 |
6650779 | Vachtesvanos et al. | Nov 2003 | B2 |
6675145 | Yehia et al. | Jan 2004 | B1 |
6678640 | Ishida et al. | Jan 2004 | B2 |
7117191 | Gavan et al. | Oct 2006 | B2 |
7191150 | Shao et al. | Mar 2007 | B1 |
7269516 | Brunner et al. | Sep 2007 | B2 |
7403922 | Lewis et al. | Jul 2008 | B1 |
7455226 | Hammond et al. | Nov 2008 | B1 |
7461048 | Teverovskiy et al. | Dec 2008 | B2 |
7467119 | Saidi et al. | Dec 2008 | B2 |
7480640 | Elad et al. | Jan 2009 | B1 |
7536348 | Shao et al. | May 2009 | B2 |
7562058 | Pinto et al. | Jul 2009 | B2 |
7580798 | Brunner et al. | Aug 2009 | B2 |
7761379 | Zoldi et al. | Jul 2010 | B2 |
7765148 | German et al. | Jul 2010 | B2 |
20020099635 | Guiragosian | Jul 2002 | A1 |
20020138417 | Lawrence | Sep 2002 | A1 |
20020194119 | Wright et al. | Dec 2002 | A1 |
20030093366 | Halper et al. | May 2003 | A1 |
20030097330 | Hillmer et al. | May 2003 | A1 |
20040039688 | Sulkowski et al. | Feb 2004 | A1 |
20050055373 | Forman | Mar 2005 | A1 |
20060020814 | Lieblich et al. | Jan 2006 | A1 |
20060181411 | Fast et al. | Aug 2006 | A1 |
20070192167 | Lei et al. | Aug 2007 | A1 |
20080134236 | Iijima et al. | Jun 2008 | A1 |
Number | Date | Country | |
---|---|---|---|
60902380 | Feb 2007 | US |