The technology described herein relates generally to distributed data processing and more specifically to scenario analysis using distributed data processing.
In accordance with the teachings provided herein, systems and methods are provided for generating multiple system state projections for one or more scenarios. For example, a central coordinator software component executes on a root data processor and provides commands and data to a plurality of node coordinator software components. Each of the node coordinator software components are associated with and execute on separate node data processors. The node data processors have volatile computer memory for access by a node coordinator software component and for access by threads executing on the node data processor. A node coordinator software component manages threads which execute on its associated node data processor and which perform a set of matrix operations with respect to the simultaneous linear equations. Stochastic simulations use results of the matrix operations to generate multiple state projections. Threads execute on their associated node data processor and perform a portion of the scenario evaluations based upon the state projections and based upon scenario information provided by a user computer, thereby generating scenario evaluation results. The volatile computer memory of a node data processor retains the results of the scenario evaluations that were performed at the node data processor.
The central coordinator software component is configured to receive ad hoc questions from the user computer and provide responses to the ad hoc questions by aggregating and concatenating the scenario evaluation results provided by each of the node data processors.
The central coordinator software component processes the ad hoc questions from the user computer by instructing the node coordinator software component to access and process the results of the scenario evaluations that are stored in the volatile memory of its associated node data processor.
One or more data stores 36 can store the data to be analyzed by the grid computing environment 30 as well as any intermediate or final data generated by the grid computing environment. However in certain embodiments, the configuration of the grid computing environment 30 allows its operations to be performed such that intermediate and final data results can be stored solely in volatile memory (e.g., RAM), without a requirement that intermediate or final data results be stored to non-volatile types of memory (e.g., disk).
This can be useful in certain situations, such as when the grid computing environment 30 receives ad hoc queries from a user and when responses, which are generated by processing large amounts of data, need to be generated on-the-fly. In this non-limiting situation, the grid computing environment 30 is configured to retain the processed information within the grid memory so that responses can be generated for the user at different levels of detail as well as allow a user to interactively query against this information.
In addition to the grid computing environment 30 handling such large problems, the grid computing environment 30 can be configured to allow a user to pose multiple ad hoc questions and at different levels of granularity. For example, a user may inquire as to what is the relative risk exposure a particular set of stocks might have in the oil sector. To respond to this type of inquiry from the user, the grid computing environment 30 aggregates all of the oil sector price information together and makes a determination of the exposure that might exist in the future for the oil sector. Upon viewing the results, the user may wish to learn which specific oil company stocks are contributing the most amount of risk. Without an OLAP or relational database environment being required, the grid computing environment 30 aggregates all of the oil company price information and makes a determination of the company-level risk exposure that might exist in the oil sector in the future. Additionally, because the underlying data results are retained throughout the queries of the user, the grid computing environment 30 can provide other items of interest. For example, in addition to a user's earlier query involving Chevron and Exxon stock, the user now wishes to add Sun oil to the portfolio to see how it is affected. In response, the grid computing environment 30 adds position pricing information that has already been generated and retained in memory for Sun oil as well as for the other companies. As another example, the user can specify in a subsequent query that they wish to reduce their number of Exxon stock and have that position analyzed.
As an example of an implementation environment, the grid computing environment 30 can comprise a number of blade servers, and a central coordinator 100 and the node coordinators (106, 108) are associated with their own blade server. In other words, a central coordinator 100 and the node coordinators (106, 108) execute on their own respective blade server. In this example, each blade server contains multiple cores, and as shown in
The central coordinator 100 comprises a node on the grid. For example, there might be 100 nodes, with only 50 nodes specified to be run as node coordinators. The grid computing environment 30 will run the central coordinator 100 as a 51st node, and selects the central coordinator node randomly from within the grid. Accordingly, the central coordinator 100 has the same hardware configuration as a node coordinator.
As shown in
With respect to data transfers involving the central coordinator 100, the central coordinator 100 communicates with the client (or another source) to obtain the input data to be processed. The central coordinator 100 divides up the input data and sends the correct portion of the input data for routing to the node coordinators. The central coordinator 100 also may generate random numbers for use by the node coordinators in simulation operations as well as aggregate any processing results from the node coordinators. The central coordinator 100 manages the node coordinators, and each node coordinator manages the threads which execute on their respective machines.
A node coordinator allocates memory for the threads with which it is associated. Associated threads are those that are in the same physical blade server as the node coordinator. However, it should be understood that other configurations could be used, such as multiple node coordinators being in the same blade server to manage different threads which operate on the server. Similar to a node coordinator managing and controlling operations within a blade server, the central coordinator 100 manages and controls operations within a chassis.
As shown in
The X′X matrix is further processed by performing at the node coordinators adjustments at 404 to the X′X rows of data stored at the node coordinators. This processing results in obtaining a root, such as a Cholesky root (L′ matrix). To generate the system state projections 412, stochastic simulations are performed at 410 at the node coordinators based upon the generated L′ matrix that was distributed to the node coordinators at 406 and based upon vectors of random numbers that were distributed to the node coordinators at 408. After the system state projections are calculated, each node coordinator will have a roughly equal number of system state projections, with each system state containing values for all of the factors from the input data.
The scenario condition information provided by the user is received by the central coordinator and distributed at 500 by the central coordinator to the node coordinators. Each node coordinator instructs its threads to call scenario analysis functions at 502 for the system state projections that are present on that node. When this is accomplished, each node coordinator has scenario analysis results for the system state projections for which it is responsible as shown at 504.
The grid computing environment examines the history of these risk factors to determine how it may affect stock prices. The grid computing environment then projects forward from the risk factor historical data (e.g., via a stochastic model) by generating at 700 market state projections 702 for all of the risk factors. For example, market state projections in this field may examine how oil prices varied over the past couple of years as well as currency, and then perform stochastic simulations using the historical risk factor data to project how they might possibly perform in the future (e.g. over the next year).
As an illustration, the grid computing environment is provided with several years of historical information for the risk factors. As shown at 800 in the example of
For each of these market states (e.g., oil is at $75 over the next year on average, and the dollar will be $1.39 to the euro, and unemployment will 10%), the grid computing environment examines how much a person's 200 shares of Exxon stock will be worth, and similarly, how much the person's 300 shares of Chevron stock will be worth. The grid computing environment takes each of the market state projections into the future, and generates a price for the different stock positions.
To achieve a relatively high level of confidence, a large number of risk factors is examined. As an illustration, the number of risk factors in
With reference to
This input data set can be supplied by the user over a network and stored only in volatile memory, thereby helping, if needed, to mitigate security concerns. However, it should be understood that other situations may allow the input data set to be stored and provided on non-volatile medium.
For risk pricing applications which only involve a relatively small number of risk factors, processing time using conventional approaches can be acceptable. However, once the problem becomes inordinately large, such as having the grid computing environment track tens of thousands of risk factors (e.g., 40,000 risk factors), processing time can approach multiple days. In addition to the large number of risk factors, the issue is further exacerbated because to acquire a needed level of confidence, the grid computing environment must also generate thousands of market state projections (e.g., 10,000 or more market state projections). This only serves to increase further the overall amount of processing time required to handle such large data sets, with some runs using convention approaches lasting as many as 5-7 days.
As another indication of the relatively large nature of the problem, it is not uncommon for a user to provide a million positions to evaluate. With this number of positions to price and the grid computing environment generating 10,000 market state projections, this will result in 11 billion items to process. A grid computing environment as disclosed herein can be configured to efficiently handle such a large-scale problem.
With reference back to
To generate these curves for the risk factors, the grid computing environment uses stochastic simulation techniques. Stochastic simulation techniques differ from methods which use forecasting of risk factors to understand risk. For example, a forecasting model probably would not have predicted unemployment to have risen to 10% and beyond in 2009 because only a couple years ago it was much lower. In contrast, a stochastic simulation may have simulated a situation where unemployment did reach 10% and beyond in 2009.
After the market state projections are generated at 700, the next step involves pricing each of the positions at 704. A list of held stocks, bonds, or loans (e.g., positions) are received from the user. A pricing function uses this information as well as the generated market state projections to generate prices 706 for each of the positions under the different market state projections 702.
After the prices 706 of positions are generated, the next step is to process at 708 any queries from a user. Because the grid computing environment retains the pricing information on the grid, responses can be generated on the fly. In other words, the grid computing environment does not need to know beforehand what is to be asked. Previous approaches would have to pre-aggregate the data up to the level at which the user's question was asked (e.g., an industry sector level information), thereby losing more detailed pricing information (e.g., company-specific level information). In the grid computing environment disclosed herein, the grid computing environment keeps the lower level information live in memory and does not aggregate information until the grid computing environment receives a query from a user. Additionally, the pricing information staying out in the grid is in contrast to previous approaches wherein the data was written to a central disk location. The central disk location approach constituted a single point which operated as a bottleneck in the process.
As an example using the data of
Accordingly, the grid starts with rows of the X matrix, and the calculated X′X matrix will be “p” by “p.” Because the matrix is symmetrical, only the upper or lower triangular portion of the matrix is stored. In this example, the upper triangular portion is stored.
The processing of a row by a node coordinator involves instructing its threads to read that row, and each thread will build a portion of the upper triangular matrix for which it is responsible. The X′X matrix is stored in chunks as shown at 1300 in
Each node coordinator knows which portion of the triangle is its responsibility to construct based upon how many other nodes there are and how many threads per node there are (i.e., “n” and “p” of
For example, the central coordinator can indicate to the 20 node coordinators that there will be 80 overall threads that will be working on a 40,000×40,000 size matrix. Based on this information, each node coordinator (e.g., node coordinators 1-20) knows on which portion of the matrix it is to work. The central coordinator then sends out a row from the n by p input matrix to a node coordinators. As an illustration in
The wave technique of
When a node coordinator finishes processing, it can begin the next iteration of processing. This can occur even if subsequent node coordinators have not completed their first iteration of processing. For example, if node coordinator 3 completes its processing for the first iteration, then node coordinator 3 can begin processing for the second iteration (i.e. the data provided during the second wave) even if a subsequent node coordinator has not completed its processing for the first iteration.
To form the L′ matrix using the wave technique, the node coordinators perform a Cholesky decomposition upon the X′X matrix. For this, the grid computing environment uses a forward Doolittle approach. The forward Doolittle approach for forming the Cholesky decomposition results in a decomposition of a symmetric matrix into the product of a lower or upper triangular matrix and its transpose. The forward Doolittle approach is discussed further in: J. H. Goodnight, A Tutorial On The Sweep Operator, The American Statistician, vol. 33, no. 3 (August 1979), pp. 149-158. (This document is incorporated herein by reference for all purposes.)
The forward Doolittle approach essentially performs Gaussian elimination without the need to make a copy of the matrix. In other words, the grid computing environment constructs the L′ matrix as the grid computing environment goes through the matrix (i.e., as the grid computing environment sweeps the matrix a row at a time). As the node coordinators work on it, they create an inverse matrix. Because of this, storage of the entire matrix is not needed and can be done in place, thereby significantly reducing memory requirements.
For example, the Doolittle approach allows the grid computing environment to start at a row and adjust all rows of the node coordinators below it and the grid computing environment is not required to go back up. For example, if the grid computing environment were on row three, then the grid computing environment never needs to go back up to rows one and two of the matrix. Whereas if it were a full sweep, the grid would have to go back to earlier rows in order to make the proper adjustments for the current row. This allows the grid to send out a row that is being operated upon by other nodes, and when a node coordinator receives that row to work on, the node coordinator already has everything that it needs to make the adjustment to that portion of the row. Accordingly, the grid computing environment can do this very efficiently by only having to go through the matrix twice to form the L′ matrix. Additionally, each node coordinator is given approximately the same amount of work to do. This prevents bottlenecks from arising if a node coordinator takes longer to complete its task.
Upon completion of the row adjustments by the threads of all of the node coordinators, the X′X matrix will have been adjusted for all rows and is now an L′ matrix distributed among the node coordinators.
To complete the market state projection calculations, the node coordinators are provided with the entire L′ matrix as illustrated at 1700 in
While other approaches can be used (e.g., another approach is to generate the market state projections using the distributed L′), the approach to provide the entire L′ matrix to the node coordinators is used because the generated L′ matrix contains a significant number of zeros. Because of this, a subset of L′ is formed, which is, in this example, a 500×40,000 matrix that is distributed to the node coordinators. Additionally, an advantage of each node coordinator having the L′ matrix is that the subsequent market state projections can be calculated more quickly because this obviates the requirement for a node coordinator to have to fetch additional rows of information when calculating market states. Because each node coordinator is no longer storing just its portion of the L′ matrix, a reconfiguration of the node's memory is done to transition from the storage of only a node coordinator's specific portion of the L′ matrix to storing the entire L′ matrix for the 500×40,00 matrix.
As an alternative, the grid computing environment could have each node coordinator individually generate the random numbers it needs for its simulation operations. However, this alternate approach may exhibit certain drawbacks. For example, random numbers are typically generated using seeds. If each node coordinator starts with a predictable seed, then a deterministic set of random numbers (e.g., a reproducible sequence) may arise among the node coordinators. For example if the root seed is 1 for a first node coordinator, the root seed is 2 for a second node coordinator, and so forth, then the resulting random numbers of the node coordinators may become deterministic because of the progressive and incremental values of the seeds for the node coordinators.
Because the central coordinator generates and distributes the random numbers for use by the node coordinators, it is ensured that the random numbers utilized by the node coordinators do not change the ultimate results whether the results are generated with two node coordinators or twenty node coordinators. In this approach, the central coordinator uses a single seed to generate all of the random numbers that will be used by the node coordinators and will partition the random numbers among the node coordinators.
The grid computing environment can be configured such that while the node coordinators are constructing the L′ matrix, the central coordinator is constructing a vector of random numbers for subsequent use by the node coordinators in generating markets state projections.
More specifically, the market state projections are determined by computing a UL′ matrix, wherein U is a vector of random numbers. The calculations are repeated K times for K different random vectors, wherein K is selected by the user (e.g., K equals 10,000). A value of 10,000 for K results in 10,000 vectors of size 40,000 each for use in generating market state projections. Additionally, the market state projections are calculated by adding a base case to UL′. (The large number of market state projections can be needed to reach a relatively high degree of confidence.)
With respect to the base case, the market state projections generated by a node coordinator are generated from the base case, which in this example, comprise current values of the risk factors. For example, in the case of the oil price risk factor, the base case can be the current values for oil prices.
To help expedite processing of the positions, each thread of a node is assigned a particular portion of the problem to solve. As an illustration,
In
With respect to pricing functions, a client may provide in the position data for each type of instrument (e.g., a stock, a bond, a loan etc.) which pricing function should be used. For example, a Wall Street company can indicate how much a share of Chevron will be worth if the grid computing environment can provide information about the market state projections. Many different types of pricing functions can be used, such as those provided by FINCAD®. FINCAD® (which is located in Surrey, B.C., Canada) provides an analytics suite containing financial functions for pricing and measuring the risk of financial instruments.
The grid computing environment can be configured to map the stored risk factors to the pricing functions so that the pricing functions can execute. If needed, the grid computing environment can mathematically manipulate any data before it is provided as a parameter to a pricing function. In this way, the grid computing environment acts as the “glue” between the risk factors of the grid computing environment and the specific parameters of the pricing functions. For example, a pricing function may be called for a particular bond and calculates prices of positions based upon a set of parameters (e.g., parameters “a,” “b,” and “c”). The grid's risk factors are directly or indirectly mapped to the parameters of the pricing function. A system risk factor may map directly to parameter “a,” while a different system risk factor may need to be mathematically manipulated before it can be mapped to parameter “b” of the pricing function.
The number of calls by the node coordinator to the pricing function may be quite large. For example, suppose there are 1,000,000 positions and 10,000 market state projections. The overall number of pricing calls by the node coordinators will be 1,000,000 times 10,000 calls (i.e., 10,000,000,000).
A pricing function can provide many different types of outputs. For example, a pricing function can provide an array of output values and the grid computing environment can select which of the outputs is most relevant to a user's question. The output values can include for a bond pricing-related function what is the price of my bond, what is the exposure of my bond, etc.
Each node coordinator maintains all of its pricing information results in its memory and optionally writes to a file in case a different user would like to access the results. Upon request by the central coordinator, each node coordinator sends its pricing information to the central coordinator for further processing. An example of node coordinators storing the pricing results are shown at 3200 in
This figure also illustrates the degree to which memory reconfiguration occurs at the node coordinators from when they generate the X′X matrix, the L′ matrix, the market state projections, and the position pricing results. The node coordinators change their node memory layouts as they generate each of the aforementioned data. Upon the final reconfiguration of the memory by each node coordinator, the user can then query (indirectly through the central coordinator) against the position pricing results which are stored at the node coordinators.
As illustrated in
As noted above, position pricing results are retained in memory after they are created. The ability to do this entirely within memory without a requirement to writing it to disk can yield advantages within certain contexts. For example, the grid computing environment can be processing sensitive financial information which may be subject to regulations on preserving the confidentiality of the information. Because the sensitive financial information is only retained within memory, security regulations about sensitive financial data and their storage on nonvolatile storage medium are not implicated. Additionally, the user queries against pricing information which is stored in memory; after the querying process has completed, the information is removed from volatile memory at the end of the session. Accordingly in this example, information is not stored to disk, thereby eliminating or significantly reducing risk of a security breach. However, it should be understood that various other storage approaches can be utilized to suit the situation at hand, such as storing in non-volatile memory position pricing information for use at a later time. This can be helpful if a user would like to resume a session that had occurred several weeks ago or to allow another user (who has been authorized) to access the position pricing information.
With respect to the aggregation of results from the node coordinators,
As an illustration, consider a situation wherein all of the node coordinators have Google and Microsoft stock information, and the first node coordinator has position information for the first 1000 market state projections. The first node coordinator sends its Google and Microsoft position pricing results for its market state projections to the central coordinator for aggregation. Similarly, the other node coordinators send to the central coordinator its Google and Microsoft position pricing results for their respective market state projections. The central coordinator will join these sets to satisfy the user query. (It is noted that each node coordinator (in parallel with the other node coordinators) also performs its own form of aggregation upon the position pricing information received from its respective threads.) In short, because the underlying originally generated data is continuously stored either in memory or on disk, the central coordinator can answer ad hoc user queries at any level. This obviates the requirement that a grid must know the query before generating the market state projections and position pricing.
The central coordinator can be configured to retain the last query and its results in memory so that if the last query's results are relevant to a subsequent query, then such results can be used to handle that subsequent query. This removes the need to have to retrieve information from the node coordinators to handle the subsequent query. A central coordinator could be configured to discard a query's results if a subsequent query does not map into the most recent query. In this approach, the central coordinator would retrieve position pricing results from the node coordinators in order to satisfy the most recent query.
The query results sent back to the client can be used in many different ways, such as stored in a database at the client location, displayed on a graphical user interface, processed further for additional analysis, etc.
To assist in the classification variable processing, the node coordinators associate levels to the values within their respective position pricing data. The node coordinators keep track that each position is associated with a particular level of a classification variable. Accordingly during the querying phase, a user query may indicate that the client wishes to have an accumulation based upon a particular classification variable and to be provided with descriptive statistics associated with that classification variable or a combination of the classification variables (e.g., cross-classification of variables, such as for this region provide a company-by-company breakdown analysis). The central coordinator receives from the node coordinators their respectively processed data and aggregates them.
If the user prefers information at a higher level for a query, then the node coordinators aggregate their respective detailed pricing information to satisfy the first query. If the user provides a second query which is at a level of greater detail, then the node coordinators aggregate their detailed pricing information at the more detailed level to satisfy the second query. At these different levels, a user can learn whether they are gaining or losing money.
For example, the user can learn that the user has a higher level or risk of losing money in the computer industry sector, but only a low risk of losing money in a different industry sector. The user can then ask to see greater detail about which specific companies are losing money for the user within the computer industry. Upon receiving this subsequent query, the node coordinators process the position pricing data associated the industry sector classification variable at a lower level of detail than the initial query which was at a higher industry sector level.
This written description uses examples to disclose the invention, including the best mode, and also to enable a person skilled in the art to make and use the invention. The patentable scope of the invention may include other examples. For example, the systems and methods described herein may be used for market stress testing purposes as shown in
The examples of
As another example of the wide scope of the systems and methods disclosed herein, the systems and methods may include data signals conveyed via networks (e.g., local area network, wide area network, internet, combinations thereof, etc.), fiber optic medium, carrier waves, wireless networks, etc. for communication with one or more data processing devices. The data signals can carry any or all of the data disclosed herein that is provided to or from a device.
As another example of the wide scope of the systems and methods disclosed herein, it should be understood that the techniques disclosed herein are not limited to risk pricing, but can also include any type of problem that involve large data sets and matrix decomposition. As another example, it should be understood that a configuration can be used such that a conventional approach is used to generate market state projections (e.g., through use of the SAS Risk Dimensions product), but the position pricing approaches disclosed herein are used. Correspondingly, a configuration can be used such that the market state generation approach as disclosed herein can provide output to a conventional position pricing application.
Additionally, the methods and systems described herein may be implemented on many different types of processing devices by program code comprising program instructions that are executable by the device processing subsystem. The software program instructions may include source code, object code, machine code, or any other stored data that is operable to cause a processing system to perform the methods and operations described herein. Other implementations may also be used, however, such as firmware or even appropriately designed hardware configured to carry out the methods and systems described herein.
The systems' and methods' data (e.g., associations, mappings, data input, data output, intermediate data results, final data results, etc.) may be stored and implemented in one or more different types of computer-implemented data stores, such as different types of storage devices and programming constructs (e.g., RAM, ROM, Flash memory, flat files, databases, programming data structures, programming variables, IF-THEN (or similar type) statement constructs, etc.). It is noted that data structures describe formats for use in organizing and storing data in databases, programs, memory, or other computer-readable media for use by a computer program.
The computer components, software modules, functions, data stores and data structures described herein may be connected directly or indirectly to each other in order to allow the flow of data needed for their operations. It is also noted that a module or processor includes but is not limited to a unit of code that performs a software operation, and can be implemented for example as a subroutine unit of code, or as a software function unit of code, or as an object (as in an object-oriented paradigm), or as an applet, or in a computer script language, or as another type of computer code. The software components and/or functionality may be located on a single computer or distributed across multiple computers depending upon the situation at hand.
It should be understood that as used in the description herein and throughout the claims that follow, the meaning of “a,” “an,” and “the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise. Finally, as used in the description herein and throughout the claims that follow, the meanings of “and” and “or” include both the conjunctive and disjunctive and may be used interchangeably unless the context expressly dictates otherwise; the phrase “exclusive or” may be used to indicate situation where only the disjunctive meaning may apply.
This application is a continuation of prior application Ser. No. 12/705,204, filed Feb. 12, 2010, the contents of which are hereby incorporated by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
Parent | 12705204 | Feb 2010 | US |
Child | 14543340 | US |