The invention is related to the field of loan risk assessment and the determination of risk associated with a plurality of loan accounts. The invention is specifically directed towards a system, method, and apparatus for loan risk prediction via utilization of multiple algorithms to independently select features from a plurality of loan account histories X, the plurality of loan account histories containing variables x describing each loan account. The computing device then utilizes one or a plurality of algorithms to independently select features from the plurality of loan account histories, the selected features being functions of the received variables x. The selected features are then the results grouped into a first data structure xf. A voting algorithm or voting algorithms are then applied to the selected features and grouped into a second data structure xr. A third data structure xI of interaction terms is then generated from the second data structure xr. A fourth data structure, xNL, is then defined by the mathematical union xr∪xI or x∪xI, (where x denotes the set of all the original features in X). These data structures are used directly and indirectly to generate further data structures and various models for loan risk prediction.
This application is related to the co-filed U.S. patent application Ser. No. 14/221,944 and U.S. patent application Ser. No. 14/222,099. These patent applications are incorporated in their entirety here.
The personal lending industry, including the lending of student loans, auto loans, commercial loans, and mortgages, as well as other types of personal loans is valued at trillions of dollars in the United States in the twenty-first century. The total value of mortgages outstanding alone in the United States is $10 trillion dollars. The total value of all student loans outstanding in the United States in 2013 is currently between $902 billion and $1 trillion. The sheer volume of this debt leads to a large amount of competition among lenders, trying to extend the greatest number of loans which have a reasonable chance of being repaid with interest. The tendency to over-purchase existing personal loan accounts from other lenders as well as over-lend leads to situations such as presented in the 2009 Financial Crisis in which defaults of large amounts of mortgages and mortgage-backed securities consisting of individual homeowner's mortgages led to the failure of the entire banking industry, and the need for government bailouts to prevent another Great Depression.
Personal loan accounts consist of accounts such as auto loans, home mortgages, personal lines of credit, credit cards, student loans, and similar type of lending arrangements made to individuals. Whether a lender or loan servicer obtains management of personal loan accounts through directly lending, or via assignment of an existing personal loan account, the need to obtain information on loan risks remains. In any event once management of a personal loan account has been obtained it is necessary to continuously monitor the potential for default for the personal loan account itself. Collection services as well require information on the status of loans, and whether collection should be pursued or not or how aggressively to pursue it. Monitoring of loan account status is required to determine whether the personal loan remains an asset valuable enough to remain “on the books” or whether to file a lawsuit against the personal loan holder to collect on the debt, sell the personal loan to another owner loan servicer, or similar extreme recourse.
Accordingly, a need exists for a system, method, and apparatus for loan risk prediction which facilitates assessment of future risk and other statistics regarding a plurality of loan account histories.
The present invention is directed towards a system, method, and apparatus for loan risk prediction comprising receiving by a computing device a plurality of loan account histories X containing variables x transmitted from a database; utilizing by the computing device a plurality of algorithms to independently select features from the plurality of loan account histories (in various embodiments, the plurality of algorithms number between two and eight), the selected features being functions of the received variables x; grouping the selected features selected from the plurality of loan account histories into a first data structure xf; applying by the computing device a voting algorithm or voting algorithms to the selected features selected from the plurality of loan account histories and grouping results into a second data structure xr; generating by the computing device a third data structure xI of interaction terms from the second data structure xr; generating by the computing device a fourth data structure xNL where xNL equals xr∪xI or x∪xI. A model then executes selecting significant features from the fourth data structure xNL, and generates a fifth data structure xNLR. The fourth data structure xNL may also be used to form a data structure XNL, by selecting elements of X whose indices are in the fourth data structure xNL. The fifth data structure xNLR may be used to form a data structure XNLR by selecting elements of X whose indices are in xNLR.
A nonlinear model is generated y=f(XNLR) where f is a nonlinear function, the nonlinear model y indicating risk associated with each of the received plurality of loan account histories on a monthly or other periodic basis for a time period into the future.
The plurality of algorithms independently selecting features may select features from the plurality of loan account histories by operating in parallel (i.e., simultaneously) or sequentially (i.e., one after another). The plurality of algorithms may be two or more of the following: (1) an Elastic Net algorithm; (2) a LASSO algorithm; (3) a Stepwise Regression with the RIC Penalty Algorithm; and/or (4) a Multivariate Adaptive Regression Splines Algorithm.
In a further embodiment of the invention the second data structure xr is used by the computing device to create a data structure Xr that is, in turn, used to generate a linear model, the linear model indicating risk associated with each of the received plurality of loan account histories on a periodic basis for a time period into the future. The time period into the future may be one week, one month, two months, six months, or one year. The linear model may be defined by an equation z=g(Xr). The data structure Xr is formed by selecting elements of X whose indices are in xr. This may occur, by example, via selection of elements in the columns of X whose column indices are in xr.
In an embodiment of the invention, the voting algorithm or voting algorithms are applied to the selected features selected from the plurality of loan account histories to create a second data structure xr, and also perform the steps of: (1) selecting variables that appear at least r times in the first data structure xf, (2) selecting variables that appear r times pairwise, and/or (3) selecting variables that appear r times in models that have a certain average accuracy.
In another embodiment of the invention after generating the nonlinear model y, M algorithms are used to independently confirm features in the generated nonlinear model y. M may be an integer between one and eight, and may be one or more of the following: an Elastic Net Algorithm, a LASSO Algorithm, a Stepwise Regression with the RIC Penalty Algorithm, and/or a Multivariate Adaptive Regression Splines Algorithm.
In a further embodiment of the invention, the third data structure xI of interaction terms comprises sets of two elements and sets of three elements.
Finally, in another embodiment of the invention the generated nonlinear model y is stored in a non-transitory computer-readable storage for future use with test data.
All embodiments of the invention must utilize computing devices to process the large amounts of data being considered (i.e. hundreds, thousands, or even millions of loan account histories and including even more variables describing such loan account histories and including even more variables describing such loan account histories), making impractical manual processing of the large amounts of data and allowing for fast scanning and early risk warning for a plurality of loan account histories associated with a large amount of data.
These and other aspects, objectives, features, and advantages of the disclosed technologies will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
Describing now in further detail these exemplary embodiments with reference to the figures as described above, the system, method, and apparatus for Voting Mechanism and Multi-Model Feature Selection to Aid for Loan Risk Prediction, is described below. It should be noted that the drawings are not to scale.
“Homoscedasticity” and “heteroscedasticity” are typically defined within the context of a sequence or a vector of random variables in the field of statistics. A sequence is “homoscedastic” if, even though the variables or vectors are random, they possess approximately the same finite variance. A sequence is “heteroscedastic” if, on the other hand, the variables within a sequence of random variables or vectors possess largely dissimilar variances. Whether a sequence possesses a dissimilar variance or not is determined by comparison to a “heteroscedasticity score threshold.” In the field of statistics, homoscedasticity or heteroscedasticity is tested for using the White test, the Breusch-Pagan test, the Koenker-Basset test, Goldfeld-Quandt test, or any other means presently existing or after-arising. Within the context of this patent application and related patent applications, “homoscedasticity” or “heteroscedasticity” refers to the homoscedasticity or heteroscedasticity of provided sample data, i.e., sample data involving a plurality of loan account histories which are transmitted from a database.
A “loan account” (within the context of this and associated patent applications) and the associated “loan account history” describing the loan account is a record of debt for the lending of money (typically, for a specific purpose such as a payment for school tuition, refinancing a house, purchasing an automobile, etc.). A loan account contains one or more of the following: principal amount, interest rate, terms of repayment, date(s) of repayment, etc. As discussed within this patent application and associated patent applications a loan account and an associated loan account history will exist in a format accessible to a computing device for processing as a spreadsheet, .csv value, matrix (as defined by certain programming languages), an array, a database entry, a linked-list, a tree-structure, other types of computer files or variables (or any other presently existing or after-arising equivalent). Variables tracked include the origination date of the loan, the original amount of the loan, the remaining principle balance to be paid, the date of the monthly payment, the current interest rate, the terms of repayment, number of original monthly payments, number of remaining monthly payments, whether each monthly payment was timely (true/false), number days delinquent of every monthly payment (from 0-integer), credit score of loan account holder at various points in time, etc. In a further embodiment of the invention, variables further include loan status (ls) (current or not), delinquency days (dd), and forbearance months (fm).
A “computing device,” as discussed in the context of this patent application and related patent applications, refers to one or multiple computer processors acting together, a logic device or devices, an embedded system or systems, or any other device or devices allowing for programming and decision making. Multiple computer systems may also be networked together in a local-area network or via the internet to perform the same function. In one embodiment, a computing device may be multiple processors or circuitry performing discrete tasks in communication with each other. The system, method, and apparatus described herein are implemented in various embodiments as, to execute on a “computing device[s],” or, as is commonly known in the art, such a device specially programmed in order to perform a task at hand. A computing device is a necessary element to process the large amount of data (i.e., thousands, tens of thousands, hundreds of thousands, or even more of loan accounts, loan account histories, and associated variables). Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer usable program code embodied in the medium. Computer program code for carrying out operations of the present invention may operate on any or all of the “server,” “computing device,” “computer device,” or “system” discussed herein. Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object-oriented programming language such as Java, Smalltalk, C++, or the like, conventional procedural programming languages, such as Visual Basic, “C,” or similar programming languages. After-arising programming languages are contemplated as well.
A “data structure,” as discussed within the context of this patent application and related patent applications refers to a computer-based storage unit allowing for the storage of single or multiple types of data. The data structure may take the form of any computer-based storage unit functioning at any level of an OSI model, including computer files, .csv files, matrixes, a linked-list, arrays, tree structures, objects, variables, text files, SQL-databases or database entries, packets, frames, or any presently existing or after-arising equivalent. The “data structure” for the purposes defined herein can actually be one or multiple computer-storage units transmitted sequentially or in parallel.
Referring to
At step 130, selected features selected from the plurality of loan account histories are grouped into a first data structure xf. In one embodiment of the invention, the first data structure is implemented as or to include a vector xf=[xf1 . . . xfN]. Features whose indices appear more frequently in xf are more representative of the risk associated with the set of loan accounts X. In one embodiment of the invention, xf contains all the indices of the features present in X selected by the algorithms.
At step 140 a voting algorithm or voting algorithms are applied to the selected features selected from the plurality of loan account histories and the results are grouped into a second data structure xr. In an embodiment of the invention, as previously, the second data structure xr is generated from vector xf and a subset of feature indices xr is created, containing indices to the features whose index appears at least r times in vector xf. In a further embodiment of the invention, r is defined previously by default or by a user as between 1 and a fraction of N (e.g., the nearest integer to 20, 30, 40 or 50% of IV). Other embodiments may increase this further or change the value of r. Increasing r, while decreasing accuracy, does improve processing time. In yet a further embodiment of the invention the voting algorithm or algorithms include (1) selecting variables such that they have appeared r times pairwise in the first data structure Xf′, (2) selecting variables such that they appear r times in models that have a certain average accuracy; (3) selecting variables such that they appear r times pairwise; and (4) selecting variables such that occurrence in models with higher weightage (because of model type, efficiency, etc.) are included. The voting algorithm or algorithms produce a subset of features that will be used as potential individual (linear) and interaction (nonlinear) terms during the derivation of a nonlinear model. The voting algorithm or algorithms also function to select the more statistically significant selected features as selected by multiple algorithms.
The second data structure xr may be used to form a data structure Xr that is, in turn, used to generate a linear model, the linear model indicating risk associated with each of the received plurality of loan account histories on a periodic basis for a time period into the future. The linear model may be defined by an equation z=g(Xr). The data structure Xr may be formed by selecting all the elements of X whose indices are in xr (such as, for example, all the elements in the columns of X whose column indices are in xr).
At step 150, a third data structure xI of interaction terms is generated from the second data structure xr by the computing device. As previously, in some embodiments of the invention the third data structure xI takes the form of a vector or any sort of computer-implemented structure. The “interaction terms” are, in some embodiments, a vector of all possible combinations of elements in xr. In further embodiments of the invention, interaction terms comprise sets of two elements and sets of three elements in xr. For example, let xI denote the set of all the interaction terms formed from all the elements from the set xr. For example, if xr=[1 3 8] and the interaction terms comprise sets of two elements of xr, then xI=[(1,3) (1,8) (3,8) (1,1) (3,3) (8,8)].
Optionally, after step 150 execution proceeds to step 160 or step 165. At step 160, a fourth data structure xNL is generated using the formula xNL=xr∪xI. The mathematical “∪” (or “union”) operator has the typical meaning one of skill in the art would assign to it, specifically the meaning associated with the mathematical union operator. Optionally, execution may proceed from step 150 to 165 where the fourth data structure is generated with a new feature set xNL=x∪xI, containing all the original features in X, plus interaction terms between features selected by the voting stage with a potentially different value of r. The fourth data structure xNL, as previously, may take the form of a vector in some embodiments of the invention or any sort of computer-implemented structure.
In an embodiment of the invention, the new feature set xNL=xr∪xI, is used to create a new data structure XNL. XNL is, in turn, input to a nonlinear model that will further seek to reduce the set of features xNLR contained in xNL and produce a reduced set of features xNLR, whose use in predictive tasks result in a better performance than the selection of features as discussed in connection with step 120. The new data structure XNL is formed by X(*, xNL), or equivalently by X(*, xr) U X(*, xI). XNL may also be formed by X∪X(*, xI). Since xI contains indices denoting interaction terms, X(*, xI) consists of columns containing the element-wise product between the columns indexed by the elements of xI. For example, if xI=[(1,3) (1,8) (3,8) (1,1) (3,3) (8,8)], then a column of X(*, xI) comprises the element-wise multiplication between columns 1 and 3 of X, another comprises the element-wise multiplication between columns 1 and 8 of X, and so on.
In a further embodiment of the invention, the heteroscedasticity score of xNL may be calculated. This process discussed in J. R. Schott, “A Test for the Equality of Covariance Matrices when the Dimension is Large Relative to the Sample Sizes,” J
for every k, may be defined, to minimize
instead of eTe=y−ŷ, to account for the heteroscedastic data. This is further discussed in C. Tofallis, “Least Squares Percentage Regression,” J
At step 170, a model executes that selects significant features from the fourth data structure xNL to form a fifth data structure xNLR. In an embodiment of the invention, xNL may be further reduced to generate a new feature set xNLR; that is, feature selection algorithms may be executed on the features indicated by xNL, which, it should be noted, may contain interaction terms. In an embodiment of the invention, a single model selects significant features via operation in a simultaneous or sequential fashion. In an alternate embodiment of the invention, a plurality of models is executed to select significant features.
At step 172, the fourth data structure xNL is used to form XNL by selecting elements of X whose indices are in the fourth data structure xNL. At step 175, the fifth data structure xNLR may be used to form a data structure XNLR by selecting elements of X whose indices are in xNLR.
As execution proceeds to step 180 a nonlinear model y=f (XNLR) is generated. In an embodiment of the invention, XNLR is a subset of XNL. f is a nonlinear function, the nonlinear model y indicating risk associated with each of the received plurality of loan account histories on a periodic basis for a time period into the future. XNLR is formed by X(*, xNLR). The result is a low-dimensional nonlinear model with high accuracy. In an embodiment of the invention, risk is indicated via output of risk factors yεRn assigned to all bank accounts i months ahead (Mc+j) from the current month. Let y(k)εR denote the risk factor assigned to bank account k. The data structure XNLR may be formed by selecting elements in X (via review of the columns of X or other means) whose indices are in xNLR. The generated nonlinear model y is stored in a non-transitory computer-readable storage medium for future use with test data.
In a further embodiment of the invention at step 180, a computation of risk associated with each bank account is performed based upon the value of three variables at month Mc+j: loan status (ls), delinquency days (dd), and forbearance months (fm). Other variables may be used in further embodiments. In various embodiments the computation of risk values or risk intervals associated with each bank account is performed by inspection of the set x. Generation of rules to assign risk values or risk intervals may be performed via standard logic, fuzzy logic, or even via an expert carrying out an inspection of the accounts themselves previous to later calculations by the computing device as discussed herein. The time period into the future for which risk is calculated for the plurality of loan accounts may be one week, one month, two months, six months, one year, or any other time period.
At step 185, M algorithms independently confirm features in the generated nonlinear model y. The M algorithms utilized may be, for example, an Elastic Net algorithm, a LASSO algorithm, a Stepwise Regression with the RIC penalty algorithm, and a Multivariate Adaptive Regression Splines Algorithm. At step 190, execution terminates in an embodiment of the invention. Other embodiments of the invention allow for returning to start 100 in order to perform further calculations by the computing device.
Referring to
Referring to
Referring to
The preceding description has been presented only to illustrate and describe the invention. It is not intended to be exhaustive or to limit the invention to any precise form disclosed. Many modifications and variations are possible in light of the above teachings.
The preferred embodiments were chosen and described in order to best explain the principles of the invention and its practical application. The preceding description is intended to enable others skilled in the art to best utilize the invention in its various embodiments and with various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the following claims.
The invention described herein is to be construed in a manner consistent with all relevant local, municipal, federal, and international laws and is not intended to be violate the law in any way.