This invention relates generally to prediction of responses using mathematical algorithms for quality measurements and more specifically to the use of Classification and Regression Tree (CART) analysis for prediction of responses.
In a financing example, an amount collected on a charged off loan is a function of many demographic variables, as well as historic and current information on the debtor. If one desired to predict the amount paid for an individual borrower, a statistical model need be built from an analysis of trends between the account information and the amount paid by “similar” borrowers, that is, borrowers with similar profiles. CART tools allow an analyst to sift, i.e. data mine, through the many complex combinations of these explanatory variables to isolate which ones are the key drivers of an amount paid.
Commercially available tools for CART analysis exist, however, there is no known tool that allows the user to build a model that predicts more than one measurement at a time (i.e., more than one response in a CART application). It would be desirable to develop a CART tool that allows a user to predict more than one measurement at a time, thereby allowing for a multivariate response CART analysis.
The present invention is, in one aspect, a method of allowing inclusion of more than one variable in a Classification and Regression Tree (CART) analysis. The method includes predicting y using p exploratory variables, where y is a multivariate response vector. A statistical distribution function is then described at “parent” and “child” nodes using a multivariate normal distribution, which is a function of y. A split function where “child” node distributions are individualized, compared to the parent node is then defined.
Classification and Regression Tree (CART) analysis is founded on the use of p explanatory variables, X1, X2, . . . , Xp, to predict a response, y, using a multi-stage, recursive algorithm as follows:
1. For each node, P, evaluate every eligible split, s, of the form Xi∈S, Xi∉S, on each predictor variable, by associating a split function, φ(s,P)≧0 which operates on P. The split forms a segregation of data into two groups. The set S can be derived in any useful way.
2. Choose the best split for each node according to φ(s,P). This could be the maximum or minimum split function value for that node, for example. Each split produces two child nodes.
3. Repeat 1 and 2 for each child node.
4. Stop when apriori conditions are met.
The p(y) notation is used, with subscripts where appropriate, to describe probability density function at the parent and child nodes in the sequel. Parameter nomenclature typically associated with multivariate continuous distributions is used in
Several measures of diversity in the univariate response setting have been advocated. One, called Node Impurity, is negative entropy:
I(P)=−∫log(p(y))p(y)dy
with a split function defined as φ(s,P)=I(P)−I(L)−I(R).
Other known regression tree methodologies include longitudinal data by using split functions that addressed within-node homogeneity in either mean structure (a Hotelling/Wald-type statistic) or covariance structure (a likelihood ratio split function), but not both. Another methodology uses five multivariate split criteria that involved measures of generalized variance, association, and fuzzy logic. In addition, the use of tree methods on multiple binary responses, and introducing a generalized entropy criterion has been investigated.
CART analysis and methodology can be applied, for example, for valuation of non-performing commercial loans. A valuation of n non-performing commercial loans involves ascribing (underwriting) the loans with values for a recovery amount, expressed as a percentage of unpaid principal balance, and a value for recovery timing, expressed in months after an appropriate baseline date (e.g., date of acquisition). Recovery amount and timing information is sufficient to calculate the present value of future cash flows, a key part of portfolio valuation. Underwriters of defaulted loans use their individual and collective experience to ascribe these values. Statistical models can be used to associate underwriters' values with key loan attributes that shed light on the valuation process.
In the multivariate normal case, the node impurity equation results in
An implementation using the above equation, with maximum likelihood estimations imputed, when compared to the split function acts as a diversity measure on covariance structure only. A Hotelling/Wald-type statistic, as a diversity measure on mean structure only, results in:
Node 66 is split into two nodes 72 and 74 where node 72 signifies that 121 of the 132 loans of node 66 are the subject of a lawsuit, while node 74 signifies that eleven of the loans are in collections. The 121 loans of node 72 are further separated into nodes 76 and 78, showing that of the 121 loans that are subjects of lawsuits, 54 are secured by assets such as real estate, shown in node 76, while 67 of the loans are unsecured, shown by node 78.
Typically, in known applications, separate CART models are built for each response variable. Described below are applications where a single multivariate CART model, which uses multiple response variables, is built. The form of the probability density function under multivariate normality is:
where n=sample size (number of observations), r=number of response variables, y=n×r matrix of response values, μ=n×r matrix of mean response values, where each row is the same r-vector mean, and Σ=r×r matrix of covariance values for the responses. The structure of the above equation encompasses repeated measures and time series models. It is assumed that the observations are not correlated, i.e., the covariance matrix for the rows of y is the identity matrix of size n. Node homogeneity, as depicted in
where
signifies the expected value, taken over the joint distribution arising from the child nodes. Note that the implied node impurity measure in the above equation is related to the node impurity equation in the univariate case, in that node impurity is measured in comparison with a proposed split, s, and the child probability density functions involved:
I(s,P)=−∫log(pp(y))pL(yL)pR(yR)dyLdyR=−EL,R[log(pp(y)].
Under probability density function for p(y), the split function (p is calculated, using matrix calculus:
In one embodiment, the present invention uses Kullback-Liebler divergence as a node split criterion. This criterion has an interpretation related to the node impurity function earlier described. Kullback-Liebler divergence is a general measure of discrepancy between probability distributions, that is usually a function of mean and covariance structure.
That φ(s,P) is a valid split function is guaranteed by the information inequality, which states that KL(pLpR,pp)≧0, and equals zero if and only if pLpR=pp, i.e., the parent node is optimally homogeneous. Kullback-Liebler divergence, in this context, measures the information gain, resulting from the use of individualized statistical distributions for the child nodes in
Another split function used in practice for univariate response settings, and adaptable for multivariate responses is the least squares split function:
where {overscore (y)} signifies the sample average of observations, with the subscript designating from which node the sum and averages come. The split function equation
in this case reduces to Σ=σ2=ΣL=ΣR, r=1, and the implementation version of the above equation, with maximum likelihood estimations imputed is proportional to:
and agrees with the least squares equation, but for the dependence on sample sizes nL and nR.
Server 112 is configured to perform multivariate CART analysis to assess valuation and to predict future performance in non-performing commercial loans. In one embodiment, server 112 is coupled to computers 114 via a WAN or LAN. A user may dial or directly login to an Intranet or the Internet to gain access. Each computer 114 includes an interface for communicating with server 112. The interface allows a user to input data relating to a portfolio of non-performing loans and to receive valuations of the loans and predictions future loan performance. A CART analysis tool, as described above, is stored in server 112 and can be accessed by a requester at any one of computers 114.
As shown by the commercial loan example, multivariate CART response methodology is useful for determination of recovery timings and amounts and has efficiency over known univariate response models in that one model is used to data mine multiple through multiple covariates to predict future loan performances.
While the invention has been described in terms of various specific embodiments, those skilled in the art will recognize that the invention can be practiced with modification within the spirit and scope of the claims.
Number | Name | Date | Kind |
---|---|---|---|
5671279 | Elgamal | Sep 1997 | A |
5710887 | Chelliah et al. | Jan 1998 | A |
5712984 | Hammond et al. | Jan 1998 | A |
5737440 | Kunkler | Apr 1998 | A |
5740271 | Kunkler et al. | Apr 1998 | A |
5812668 | Weber | Sep 1998 | A |
5850446 | Berger et al. | Dec 1998 | A |
5889863 | Weber | Mar 1999 | A |
5898919 | Yuen | Apr 1999 | A |
5917931 | Kunkler | Jun 1999 | A |
5931917 | Nguyen et al. | Aug 1999 | A |
5940815 | Maeda et al. | Aug 1999 | A |
5943424 | Berger et al. | Aug 1999 | A |
5978840 | Nguyen et al. | Nov 1999 | A |
5983208 | Haller et al. | Nov 1999 | A |
5987132 | Rowney | Nov 1999 | A |
5987623 | Ushida | Nov 1999 | A |
5996076 | Rowney et al. | Nov 1999 | A |
6002767 | Kramer | Dec 1999 | A |
6014454 | Kunkler | Jan 2000 | A |
6016255 | Bolan et al. | Jan 2000 | A |
6026364 | Whitworth | Feb 2000 | A |
6026379 | Haller et al. | Feb 2000 | A |
Number | Date | Country |
---|---|---|
WO 9903052 | Jan 1999 | WO |