Many decisions (e.g., business decisions) are complex and thus difficult to make. For example, accurately forecasting future demand for a business's products enables planning to occur for ordering raw materials and components used to make the products. Factors that may affect such demand forecasting include the general state of the economy, seasonal variations, competitive factors, etc.
For a detailed description of various illustrative implementations, reference will now be made to the accompanying drawings in which:
The implementations described herein are directed to a semi-automatic system that permits a user to derive specific information from a set of transactional and industry-specific data. The information is obtained from a process of selection/generation of suitable data analysis models. A user is afforded insight to how the system functions and can control operation of the system during the model selection process, as well as being provided with results from operation of the system (e.g., an appropriate model).
The non-transitory storage device 160 is shown in
The distinction among the various engines 100-120 and among the software modules 170-190 is made herein for ease of explanation. In some implementations, however, the functionality of two or more of the engines/modules may be combined together into a single engine/module. Further, the functionality described herein as being attributed to each engine 100-120 is applicable to the software module corresponding to each such engine, and the functionality described herein as being performed by a given module is applicable as well as to the corresponding engine.
Overall Operation
The final model selection engine 120 narrows down the list of candidate models by evaluating how well each candidate model performs given the transactional data 210 and the user-provided output requirements. Via the output delivery engine 121, the final model is provided to the user who can select that model for future use or reject the model, change one or more of the transactional data 210, industry-specific data 212, and output requirements 214, add more models and tests, and force the system to re-evaluate to generate another proposed model.
The various processes described herein need not be performed strictly sequentially. For example, the final model selection process 218 may cause control to loop back to the candidate model selection process 216 or prompt the user for additional input in the input collection process 200. Given its dynamic nature, the system allows for a continuous verification and updating of the selections made in previous processes. For example, during the final model selection process 218, the system may automatically detect that an outcome of the analysis does not satisfy one or more of the output requirements 214 provided by the user. This might be because, for example, new transactional or industry-specific data 210, 212 was entered since the output requirements 214 were initially set. In that case, the system will re-perform the candidate model selection process 216, and determine if a different candidate model would result in a better performance. Alternatively, the input collection engine 100 will request the user to enter additional transaction data 210 and/or industry-specific data 212 or to update the output requirements 214.
The output delivery process 219 provides, for example, a graphical user interface to present to the user the results of the model selection process. The presentation may be in the form of a list of the candidate model selection options that were analyzed by the final model selection engine process 218, as well as information regarding the results of the various tests that were performed.
Input
Referring to
Transactional data 210 may include information regarding individual customer transactions. Examples include products sold, prices, quantity, dates of sale, etc. Industry-specific data 212 refers to market structure and industry relevant information that affects the candidate model selection process 216, such as the number of brands in the market, the price points of different brands, and market share of the different brands—across regions and time. Industry-specific data 212 also may include market research reports with unstructured information about the major factors identified as influential in a purchase decision in this situation. A report made of text, figures and tables may be analyzed using text analysis to extract the main concepts of interest like market share of brands, number of brands in the market, annual unit sales, annual revenues, growth rates of unit sales, growth rate of revenues (last 5 years), number of customers, growth rate of customers over time (last 5 years), number of repeat versus new customers, customer segmentation dimensions, brand tiers and who are the cohorts in each tier.
The output requirements 214 may include a description of the analysis requested by the user. The output requirements may include the outcome variable to be estimated. For example, a user may want an estimate of the price elasticity of his own brand A (“own price elasticity”). By way of another example, the user may want to estimate the purchase probability of a customer for another brand B. In another example, the user may want an estimate of the churn risk score of all customers over the next 365 days. The output requirements may also include the accuracy levels with which the outcome variables are to be estimated. An example of an accuracy level might be specified as 95% accuracy or a 5 percent statistical significance level. In another example, the estimated value might be desired to be within 2 standard deviations of the actual values in holdout tests. The output requirements may also include other performance level variables. For example, the model might be required to compute the scores for 1 million customers in less than 5 minutes.
Through the analysis of the industry-specific data, the candidate model selection engine 110 may inform the process about the market structure and industry in which the transactional data 210 is being analyzed and may create a set of selection criteria to be used in the subsequent parts of the process. The candidate model selection engine 110 uses information about the market structure such as there are, for example, two brands in the market to decide the set of variables to be included in a demand function formulation to estimate the price elasticity of the user's own brand A. Further, the candidate model selection engine 110 uses the information that, for example, brands A and B are consumer packaged goods to decide the functional form of the demand function formulation that would be relevant for this type of industries. The candidate model selection engine 110 may use a set of rules to narrow down the model specification. Such rules are identified by the input collecting engine based on previously performed analyses and may include formulae and the variables relevant to such formulae.
User input of the transactional data 210, industry-specific data 212 and output requirements 214, as well as updated or new models and tests may be obtained in several ways. For instance, a user may provide information 252 via input device 152 in a suitable manner (e.g., a structured file or completion of an electronic form), or the input collection engine 100 may explicitly request 254 from the user information that will help it to identify the market structure and industry or fill any identified data gaps.
Alternatively or additionally, the input collection engine 100 may implement a learning process 250 to obtain industry-specific data 212 based on previous uses of the system. For example, if at least a threshold number (e.g., 3) of previous users have entered new industry-specific data 212 via the input collection engine 100 that turned out to be relevant in the candidate model selection process, then, the input collection engine learns and solicits that same type of industry-specific data automatically from a user who uses the system in the future.
Candidate Model Selection
Information usable to the candidate model selection engine 110 may include the industry-specific data 212, the output requirements 214 (that help to identify the final target of the analysis), and selection criteria (e.g., rules) 268 as to how to process information learned by the system through experience from past user experience 270. Industry-specific data 212 may be clustered in homogeneous groups/categories that the candidate model selection engine 110 uses to identify the market structure and industry type. These groups may be related to the structure of the industry 280 (e.g., degree of competition) and to characteristics of the customers 280 (e.g., demographic, income, degree of risk aversion, etc.). The groups are created by the user via the input collection engine 100 or by the candidate model selection engine 110 based on previous user experience.
The candidate model selection engine 110 selects one or more candidate model options 264 based on, for example, the industry-specific data 212, the output requirements 214, and a set of rules. The candidate model selection engine 110 examines a set of rules and checks the rules against the information presented—output requirements 214 and industry specific data 212. The rules may be stored in a library of rules and some rules may emerge from learning resulting from detecting that users' selections are correlated with specific industry data causing various automatic model selection rules to be created. The rules may specify, for example, how the industry-specific data 212 is to be used by the candidate model selection engine 110. For example, to estimate the price elasticity of the user's own brand A, the candidate model selection engine 110 determines from the industry-specific data 212 whether the market structure is a monopoly (e.g., a single brand), an oligopoly (a few brands), or perfect competition (many brands). The candidate model options 264 may be stored in pre-existing libraries that may be updated with additions or deletions by the user(s). The selection is such that the set of modeling options 264 identified (e.g., models of competition versus monopoly, stationary versus dynamic models, linear versus non-linear models, etc.) is supported by data availability, and it is consistent with the satisfaction of the output requirements.
Based on, for example, the industry-specific data 212, the output requirements 214, and the selection criteria (rules) 268, the candidate model selection engine 110 selects one or more models to include in the candidate model selection options 264 for further examination by the final model selection engine 120 (described below). Some implementations may include, as an input to the candidate model selection engine, the past user's experiences 270. Input 270 may include a default model resulting from a previous run of the system. For example, a particular model previously may have been determined by the final model selection engine 120 to be the model of choice for analyzing the data. That particular model may be specified to the candidate model selection engine 110 by the user. In some implementations, the final model selection engine 120 determines whether the specified default model is acceptable or not (as described below). If the default model remains acceptable, no other model options are tested by the final model selection engine 120. If, however, the default model is determined not to be acceptable by the final model selection engine 120, the user is so notified and control loops back to the candidate model selection engine 110 to identify one or more other candidate model selection options as explained above.
Final Model Selection
The operation performed by the final model selection engine 120 is illustrated in
The candidate model selection options 264 are subject to a predetermined set of tests (e.g., colinearity test, endogeneity test, etc.) by the final model selection engine 120. The candidate model option(s) that do not pass the tests are dropped from further consideration. The results of these tests may be displayed to a user via output device 154 so that the user can see which candidate model options were rejected and which were accepted.
The final model selection engine 120 also “cleans” the transactional data 210 to produce cleaned data 260. Cleaning the transactional data 210 may include processing the data in accordance with data format requirements of the candidate models. Before, during or after a set of models is selected by the final model selection engine 120, the models may need the data to be presented in a certain format. For example, the transactional data 210 may need to be ordered by date, normalized (e.g., demeaned and scaled). The variables may be subtracted by their means and divided by the standard deviation. Certain attributes may need to be computed. The variables may need to be transformed to a logarithmic scale or quadratic values of the attributes may need to be computed for inclusion in the formulation. Further, any missing observations or outliers in the data may need to be accounted for. For example, the user may want to estimate the price elasticity of his own brand A. Further, the industry-specific data 212 on the market structure may indicate that the market comprises two brands A and B and the market share of brand B is large enough to influence the demand for brand A. In this example, the final model selection engine 120 determines that the demand function formulation should include not only the own price of brand A but also the competitive brand B's price as well. Further, the final model selection engine 120 determines that the transactional data supplied by the user does not include price data for brand B on different purchase visits. The ‘gap’ is the information on competitor's price for brand B as it is a known important variable in the demand function formulation for brand A for this type of market structure and for this type of outcome variable.
Given the set of candidate models 264 determined by the candidate model selection engine 110, the final candidate selection engine 120 verifies which candidate model performs better in terms of the output requirements 214 set by the user. Different test data and testing algorithms (e.g., stationary time series versus dynamic time series models) are employed by the final model selection engine 120 to test the candidate model options 264 with the results being compared. For example, the candidate models for demand estimation of brand A may be a linear model, a log-linear model, and a log-log model with various variables such as brand A sales being a function of price of brand A, advertising levels of brand A, price of brand B, advertising levels of brand B, brand A sales from previous period, brand B sales from previous period, etc. The final model selection engine 120 may estimate all three models using calibration data and compare various fit statistics in the calibration sample such as R-square, number of variables with the correct sign (positive or negative), log-likelihood function value, etc. The final model selection engine 120 also may assess the performance of the models by performing tests for the three models in holdout samples and compare model fit statistics in holdout samples such as Bayesian information criteria, Akaike information criteria, log-likelihood value, hit rate (e.g., number of times the actual value and estimated value are within the predefined confidence interval). Based on these tests for fits, the final model selection engine 120 may select the final model and present the analysis to the user. For example, the final model selection engine 120 may present to the user the estimated value of the price elasticity of the user's own brand A.
In the event that one candidate model option 264 does not perform better than another, multiple alternative models are provided to the input collection engine 100 and offered to the user via output device 154 for the user to make a final choice. In the case of an analysis that involves a dynamic inflow of data, the final model selection engine 120 verifies that the past modeling choices are still optimal and robust. That is, if a model has been generated and selected by a user, but additional transactional data 210, industry-specific data 214, and/or output requirements 214 are provided by the input collection engine 100, the processes described above with regard to engines 100, 110, and 120 are iterated to ensure that the currently determined model is the correct choice. If it is not, a different model is offered to the user.
In accordance with at least some implementations, the final model presented to the user is a model equation with the estimated values of the unknown parameters in the equation, as well as the estimates of the target outcome variable(s). Further, the final model selection engine 120 may perform the final selected model on the transactional data 210 to generate output processed data, which is then provided to a user via the output delivery engine (e.g., displaying, printing, etc.).
When the default model (selected in previous runs) underperforms in terms of some predetermined dimension, then the software automatically re-consider the choices made in earlier operations such as the candidate model selection process 216. For example, in the above example, the model may not complete the computation in less than 1 hour, or the confidence interval may be larger than what was requested. Control then loops back to the candidate model selection process 216 or the input collection engine 100 may request additional inputs from the user. A verification/test may be automatically performed (306) or user-induced via a user command 304 via input device 152. An example of a verification test is a test or series of tests to check whether the output requirements 214 are being met. For example, to check if the estimates of the price elasticity of brand A are robust (fall in the same range), the final model selection engine 120 may perform repeated executions of the model with the same calibration data and with other calibration samples and then compare the estimated values over the repeated executions. Whenever the user is aware of a structural change that may require re-evaluating the model selection, the user can force the system to start from the input collection process 200 again, taking in account the new information. The selection criteria 302 are determined by the type of analysis based on the output requirements 214. For example, if the user becomes aware that the market now has a new brand C that should be included in the demand formulation for brand A, the user can restart the model selection process. The selection of the candidate models will then include this new information about the market structure (including brand C) in deciding the set of models and use additional variables in the demand formula and estimate the price elasticity for brand A. The user may still want the same level of accuracy. The selection criteria 302 in
The disclosed implementation is flexible and can be used in different market structure and industry situations, with different levels of data availability and for different business needs. Further, the system described herein is easy to use and will reduce the time and complexity of the process of developing a data analytical modeling solution. The system is transparent and allows the business user to interact with it and ensure it is correctly specified. It is dynamic and can learn from current user inputs, past inputs from the user as well as inputs by other users. The system can enable the use of data analytical models to be offered as a service offering in the cloud.
The above discussion is meant to be illustrative of the principles and various embodiments of the present invention. Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.