The present invention relates to a method of generating a data set from data stored in at least one database. In particular, the invention relates to a method of automatically generating a standardized data set for inputting to an analytical model.
The task of analysing recorded data for extracting useful information has become increasingly difficult with the ever increasing volume and complexity of data in modern industry, science and business. The process of sorting through vast amounts of data and producing relevant information, often referred to as data mining, can be extremely tedious and time consuming. Automatic data analysis using more complex and sophisticated tools for producing useful information from vast amounts of stored data has become more and more common. Through the use of sophisticated algorithms, analysts can, for example, identify key attributes of business processes, predict client behaviour and use this information to target business opportunities. Such mathematical and statistical technique implementations are known as advanced analytical engines or analytical models, and may be classified under functions such as, for example, classification, regression, clustering, segmentation, attribute importance, association rules, and times series forecasting. Development of such models is however a costly and time consuming process while keeping the models up to date requires further investment of time and costs.
Data generated by industries and businesses can be stored in databases such as operational databases, data warehouses or data marts. Data marts are typically tailored to store data oriented to a specific purpose or subject. Data warehouse design principles generally require data to be stored in its most basic form, as “Atomic” data, and they generally contain a significant number of database tables consisting of raw data columns.
Operational databases are generally optimized to preserve data integrity and speed of recording of business transactions through use of database normalization. Data warehouses are optimized for speed of data retrieval. Frequently data in data warehouses are denormalised via a dimension-based model. Also, to speed data retrieval, data warehouse data is often stored multiple times—in their most granular form and in summarized forms known as aggregates.
Both operational databases and data warehouses typically follow an entity-relationship data schema and are usually based on Relational DataBase Management Systems (RDBMS). There is a vast array of tools and techniques to extract, transform, and load (ETL) data between repositories, and there is a wealth of techniques for performing data manipulations, generally using a standard data and meta-data query language known as Structured Query Language (SQL).
Data for use in an analysis may be gathered from multiple data sources, from data recorded by an industry or business in operational databases and data warehouses, as well as from third party data providers. Third party data providers may supply different types of data such as demographic data lifestyle data, customer interests and the like.
To operate effectively, advanced analytical modelling techniques require data to be fed to the model in a simplified form such as a single table known as an analytical data set containing a vision of the entities of interest at a given time. Analytical data sets may be regarded as virtual data tables where each row represents a given entity of interest and columns are made up of attributes, also known as analytical variables or explanatory values, for describing the different entities. An analytical record is the group of attributes used to describe the entity. An analytical data set can sometimes be referred to as a virtual flat file. Such tables should be as complete as possible for the analysis and generally require more sophisticated data attributes than the raw data attributes stored in the source databases. An attribute definition or expression describe how a given attribute is derived from the operational data for an analytical record and may be comprised of primitives and/or computational expressions. Primitives are generally base attributes while computational expressions may include predicates, aggregates or other functions. An entity is defined as the object of analytical interest and may include, for example: customers, products, shops etc. In customer analytics, an analytical data set is sometimes expressed as the ‘360’ view of the customer. A customer can be described by thousands of attributes that can be computed from atomic data contained within a customer data warehouse. Effective analysis necessitates easy reconstruction of these attributes for a given population of customers at a given time.
Recent mathematical developments and the spread of best practice methods have allowed the degree of automation of analytical modelling techniques to be improved. Mathematical and statistical engines still require, however, a repeatable and industrialized process in order to create the analytical data sets used as their inputs and to maintain these data sets through time.
U.S. Pat. No. 7,047,251 describes a standardized customer application for inputting customer data into analytical models. U.S. Pat. No. 7,272,617 relates to the creation of an analytical data set for modelling in a customer relationship management system. These systems do not, however, deal in an automated manner with attributes describing customer entities, which may vary over time.
Accordingly, it is an object of the present invention to provide an improved method of automatically generating an analytical data set for input into an analytical model.
In general terms, the invention sets out to provide a method of and a system for automatically generating a standardised dataset for input to an analytical model by providing a cross product of a time stamped population of an entity of interest and an analytical record describing the entity.
In accordance with a first aspect of the invention, there is provided a method of generating a dataset from data stored in at least one data base, for input into an analytical model, the method comprising the steps of: defining a time stamped population comprising a plurality of tuples, each tuple comprising an entity identifier of an entity for analysis, and at least one reference time stamp associated with the corresponding entity identifier; and creating a dataset by generating at least one time dependent attribute value for each entity identifier from data associated with said entity identifier in the at least one database, the or each time dependent attribute value representing a time dependent parameter of the corresponding entity identifier and being generated according to a corresponding attribute definition, wherein the or each time dependent attribute value is generated as a function of the corresponding time stamp.
In this way, the method according to the invention provides a standardised input for an analytical model. Since advanced analytics techniques can now be used in very high dimensional space (some techniques, for example, automatically handle thousands of attributes describing an entity), the method of the present invention addresses an unfulfilled need for automatically creating very wide analytical data sets that manage time dependent attribute computations in a formal way, requiring a minimal amount of programming knowledge and human intervention.
The proposed automation method, dealing with time dependent attributes, is beneficial and effective for integrating data mining tasks into a scheduled environment as well as for allowing the implementation of back-testing facilities without the need for specific programming, and is of importance for the overall productivity of data mining activities.
The method may include the preliminary steps of defining the entity as the object of analysis of the analytical model; and defining an analytical record for describing the entity, the analytical record comprising at least one time dependent attribute defined by the corresponding attribute definition. Naturally, the analytical record may also include one or more attributes which are non time dependent.
Features of embodiments of the method of the invention may include:
A second aspect of the invention provides a method of analysing data using an analytical model, the method comprising generating a data set for input to the analytical model according to the method hereinbefore described; inputting the dataset to the analytical model and performing analysis of data according to the analytical model.
A third aspect of the invention provides a method of predicting the behaviour of an entity, comprising analysing data using an analytical model according to the method hereinbefore described.
A fourth aspect of the invention provides a method of training, scoring or back-testing an analytical model comprising analysing data using the analytical model according to the method hereinbefore described.
The methods according to the invention may be computer implemented. They may be implemented in software on a programmable apparatus. They may also be implemented solely in hardware or in software, or in a combination thereof.
According to a fifth aspect of the invention there is provided a system for generating a data set from data stored in at least one data base, for input into an analytical model, the system comprising: an input for receiving data from a database;
a processor for defining a time stamped population comprising a plurality of tuples, each tuple comprising an entity identifier of an entity for analysis, and at least one reference time stamp associated with the corresponding entity identifier; and for creating a dataset by generating at least one time dependent attribute value for each entity identifier from data associated with said entity identifier in the at least one database, each attribute value representing a time dependent parameter of the corresponding entity identifier and being generated according to an attribute definition, wherein the or each time dependent attribute value is generated as a function of the corresponding time stamp; and an output for transmitting the data set to the analytical model.
Features of embodiments of the system of the invention may include:
Since the present invention can be implemented in software, the present invention can be embodied as computer readable code for provision to a programmable apparatus on any suitable carrier medium. A tangible carrier medium may comprise a storage medium such as a floppy disk, a CD-ROM, a hard disk drive, a magnetic tape device or a solid state memory device and the like. A transient carrier medium may include a signal such as an electrical signal, an electronic signal, an optical signal, an acoustic signal, a magnetic signal or an electromagnetic signal, e.g. a microwave or RF signal.
Embodiments of the invention will now be described, by way of example only, and with reference to the following drawings in which:—
A first embodiment of the method according to the invention will be described with reference to
With reference to
With reference to
As illustrated in
The user may wish to add a new attribute that will describe a customer at a given date with the amount billed on the previous month called PM_Billing before this date. For this, he can join data contained in the “2007_Billings_T” Table 11_2 and an equivalent table “2006_Billings_T” containing data for the year 2006 based on the customer identifier (“Id”) and define the attribute with an expression following a switch statement such as: <<switch on (RTS), when [month(RTS)=01 and year(RTS)=2007], return 2006_Billings_T.M12, when [month(RTS)=02 and year(RTS)=2007], return 2007_Billings_T.M01, when [month(RTS)=03 and year(RTS)=2007], return 2007_Billings_T.M02, . . . >>. The user may also decide to define an attribute by computing some aggregates based on the number of purchases that have taken place over the two previous months for each of the products. The number of purchases of product A during the previous month could be called “PM_ProductA” and may, for example, be expressed as: <<count_filtered_aggregate(“Transactions_T, “T.Id”, “Id”, “Date”, RTS-1 month, RTS, “Product”, ‘A’)>> which can be translated as: count the number of transactions contained in the “Transactions_T” table when joined on columns “Customers_T.Id” and “Transactions_T.Id” where the “Date” of the transactions must be between RTS-1 month and RTS, keeping only transactions where the attribute “Product” is equal to “A”.
The resulting analytical data set will contain at least one time dependent attribute. In the present example four attributes are time dependent: Age, PM_Billing, PM_Product_A and PM_Product_B.
It will be appreciated that the method according to the invention is not limited to a specific language expression for defining attributes; these expressions can be made by SQL or provided by a graphical user interface. It will also be appreciated that the method according to the invention is not limited to specific data manipulations. The concept of the method according to the invention is that at least one data manipulation for definition of an attribute refers to a specific date value used in time dependent expressions to generate the desired time dependent attributes.
The attributes of the analytical records may or may not be grouped into homogeneous attributes sometimes referred to as ‘Domains’. A domain is a group of attributes having similar characteristics describing a homogeneous section of an entity. For example, an analytical record describing customers may have a demographic domain or a behavioural domain. Domains may also be created for data generated by the analytical model, for example, score and segment domains.
The initial two steps (S1&S2) of the method may be implemented with an easy to use expression editor or some programming language expertise. It will be appreciated, however, that steps S1 and S2 are not required each time the method according to the invention is run. The first time an analytical project deals with a specific entity, a user defines this entity and the associated analytical record. All subsequent projects on the same entity may reuse the Analytical Record once it has been defined. Optionally, several analytical records focusing on very different domains may be associated with one entity.
Referring to
It will be appreciated that in some embodiments of the invention an entity identifier may have more than one reference time stamp, or different entity identifiers may be associated with different time stamps. A given entity may be represented at different time stamps in a single time stamped population.
The fourth step (S4) is the generation of the Analytical Data Set. This step is fully automated when provided with a given Analytical Record and a corresponding time stamped population for the same entity. In this embodiment of the invention, this step is implemented through generation of SQL statements by the dataset generation processor 25. Execution of these SQL statements by the database query engine 15 results in retrieval of data from the database 10 for populating the cells of the analytical dataset and creation of the analytical data set containing at least one attribute value dependent on the RTS of the time stamped population.
As illustrated in
The SQL statements executed by the database query engine 15 may be provided as very complex ‘select’ statements returning a volatile result set, only available at the time of the query, or the resulting select statement may be executed in order to create a dataset according to the choice of the user. In the resulting analytical table as shown in
The system according to the invention may be implemented in the form of hardware in a microcontroller, in the form of software on a software medium or a programmable component in a non-volatile memory executed by a microprocessor.
The method according to this embodiment of the invention provides the advantage that there is no need to redefine the analytical record each time the time factor for analysis changes. Simply by changing the reference time stamp in the time stamped population a new data set adapted to a new desired time reference can be generated. The invention thus provides management of time dependent attributes in a formalised and highly automated manner, with the minimum amount of human intervention. As a result generation of datasets becomes significantly less time consuming and costly. Analytical models for analysing or predicting entity behaviour can be maintained, retrained and back-tested more effectively.
The method according to the embodiment of the invention may find use in many analytical applications such as in entity behaviour analysis, predictive modelling or in the training, scoring, retraining and back-testing of analytical models etc. For example, the method may be used in customer analytics for customer segmentation, i.e. grouping customers having similar characteristics and then using these groups to create lists of customers to target for specific campaigns.
A potential application of the method according to the invention is for automating advanced analytical models to update, on a scheduled basis, model retraining with the most recent version of data, or simply to apply advanced analytical models (sometimes referred to as ‘scoring’) to the most recent view of the entities of interest. For example, a telecommunications operator may be interested in scoring all its post-paid customers in order to see which customers are most at risk of leaving for a competitor operator and to deploy retention programs for those who are worth retaining. For this, the operator needs to score its entire customer base every month. It is likely that the predictive analytical model used for scoring will use as one of the key influencers some behavioural data (for example: the number of calls given in the last month). In order for the analysis to be fully automated, the process generating the data set compiling information on all post-paid customers with all their attribute values should also be automated. The present invention provides a formal way of automatically gathering the most recent values for attributes that are time dependent.
Another potential application of the method of the invention is in model back-testing. For example, in the previous example, the business owner of the telecommunications operator may decide to ask the analytical team to test an advanced analytical model on past data in order to see how the system would have performed if it had been used in the previous 6 months for detecting the potential leavers. In this case, the advanced analytical model can be used on a data set compiling the attribute values describing the customers as they were 6 months ago, 5 months ago, and so on . . . . The present method provides a way of easily reconstructing the analytical data set compiling information on the post-paid customers as they were known in the system at a given date.
A further potential application is when a proposed advanced analytical model needs more data in order to obtain robust results than a single time period would provide. In the example developed earlier around the telecommunications operator, let us assume that the business owner wishes to have a dedicated model for a specific segment of customers for example (the “5 stars” segment of very valuable clients). In order to ensure that the number of “5 stars” customers potentially leaving for a competitor is correctly detected, he may wish to have a specific model developed for this segment. If the number of people in such a segment is too low, it will be difficult for any modelling technique to find any robust statistical law a way of overcoming this limitation is to compile a training data set that will concatenate “snapshots” of “5 stars” customers not only on the last month, but on several past months: in this case the training data set may contain attributes values for a given customer at different dates. The present invention provides a formal process for deriving attributes values of a given entity of interest at a given time (to compile several versions of a single entity at different stages of its life cycle).
Although the present invention has been described hereinabove with reference to specific embodiments, the present invention is not limited to the specific embodiments, and modifications will be apparent to a skilled person in the art which lie within the scope of the present invention.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/IB2008/054156 | 7/9/2008 | WO | 00 | 1/7/2011 |