Priority is claimed to European Patent Application No. EP 16 15 9981.6, filed on Mar. 11, 2016, the entire disclosure of which is hereby incorporated by reference herein.
The present invention relates to a method and a system for generating a training model for fabricating synthetic data. The present invention also relates to a method and system for fabricating synthetic data. The present invention also relates to a corresponding computer-readable medium containing program to perform any of the presented methods.
An aspect of the present invention is the modeling of real data for the purpose of synthesis of statistically similar data. One of the purposes of data synthesis is to create secure and privacy compliant randomized data that retains most of the statistical nuances of the original real data. The challenge to achieving this lies in modeling real data without any end use-case in mind.
In the general setting, this is an ill-posed problem and has no solution that models all relationships within the data automatically, due to the combinatorial complexity in detecting and modeling.
In an exemplary embodiment, the present invention provides a method for generating a training model for fabricating synthetic data. The method includes: a) providing training data comprising information on a plurality of objects, the training data further comprising information on a plurality of states and state transitions of at least one object of the plurality of objects, the state transitions of the at least one object being chronologically ordered into a plurality of sessions; b) clustering a plurality of states and/or a plurality of state clusters into a plurality of state transition clusters; c) clustering sessions of respective state transitions clusters into a plurality of session clusters; d) modeling each session cluster using a stochastic process; and e) training a probabilistic model using information on a respective object, the session clusters corresponding to the respective object, start times of sessions corresponding to respective session clusters, end times of sessions corresponding to the respective session clusters, and delays between the respective session clusters.
The present invention will be described in even greater detail below based on the exemplary figures. The invention is not limited to the exemplary embodiments. All features described and/or illustrated herein can be used alone or combined in different combinations in embodiments of the invention. The features and advantages of various embodiments of the present invention will become apparent by reading the following detailed description with reference to the attached drawings which illustrate the following:
The present invention offers, among others, a tractable and efficient solution for certain domains of data and analytics.
A direct consequence of this is that the present invention is not restricted to any specific use case for which we model data. The present invention provides a generic way for someone knowledgeable in the semantics of the data to map it into a machine readable format that allows for automatic modeling and synthesis.
Exemplary embodiments of the invention provide a method and a system for generating a training model for fabricating synthetic data. Exemplary embodiments of the present invention further provide a method and system for fabricating synthetic data. Exemplary embodiments of the present invention further provide a corresponding computer-readable medium containing program to perform any of the aforementioned methods.
The method according to the present invention is computer-implemented. However, it is understood by the skilled person that there are also other ways of implementing the method according to the present invention.
The invention relates to a method for generating a training model for fabricating synthetic data comprising the steps of:
a) providing training data comprising information on a plurality of objects, the training data further comprising information on a plurality of states and state transitions of at least one object of the plurality of objects, the state transitions of the at least one object being chronologically ordered into a plurality of sessions;
b) clustering a plurality of states and/or a plurality of state clusters into a plurality of state transition clusters;
c) clustering the sessions of respective state transitions clusters into a plurality of session clusters;
d) modeling each session cluster using a stochastic process; and
e) training a probabilistic model using information on the respective object, the corresponding session clusters, start times of the corresponding sessions of the session cluster, the end times of the corresponding sessions of the session cluster and the delays between the corresponding session clusters.
Preferably, the method further comprises: clustering a plurality of states of respective objects into a plurality of state clusters after step a) and clustering the plurality of state clusters into a plurality of state transition clusters.
Preferably, the stochastic process in step d) comprises a Markov Chain.
Preferably, the probabilistic model in step e) comprises a Bayesian Network tm.
Preferably, the method further comprises at least one of the following steps before step a):
removing empty data and/or removing outlier data from the training data;
allocating a unique identifier OID to each object;
allocating a unique identifier TID to each state transition; splitting at least one timely ordered sequence of state transitions into a plurality of sessions.
Preferably, at least one constraint is introduced into the probabilistic model.
Preferably, the clustering of the sessions of respective state transitions clusters into a plurality of session clusters in step d) is performed using the Fréchet-Distance to measure the distance between sessions.
Preferably, the clustering of the sessions of respective state transitions clusters into a plurality of session clusters in step d) is performed using the word2vec tool (see references [1] and [2]).
Preferably, at least one node of the Markov Chain represents a corresponding state transition cluster.
Preferably, the at least one node is linked to a further Bayesian Network for training to predict at least one item of the respective state transition cluster.
Preferably, the at least one node is linked to a further Bayesian Network dl for training to predict the delay time until the subsequent state transition following the corresponding state transition cluster is performed.
Preferably, at least one constraint is introduced into the stochastic process.
Preferably, the clustering of the plurality of states of respective objects into a plurality of state clusters is performed using vector quantization, preferably using a k-means heuristic, more preferably using a k-means heuristic with Euclidian distance as a distance measure.
Preferably, the clustering of the plurality of state clusters into a plurality of state transition clusters is performed using vector quantization, preferably using a k-means heuristic, more preferably using a k-means heuristic with Euclidian distance as a distance measure.
The invention also relates to a method for fabricating synthetic data comprising any of the aforementioned methods.
Preferably, the method for fabricating synthetic data comprises the following steps:
The invention also relates to a system for generating a training model for fabricating synthetic data using any of the aforementioned methods for generating a training model for fabricating synthetic data.
The invention also relates to a system for fabricating synthetic data using any of the aforementioned methods for fabricating synthetic data.
The invention also relates to a computer-readable medium containing program instructions for causing a computer to perform any of the aforementioned methods.
The present invention has, among others, the following advantages:
The present invention will be explained in more detail in the following with reference to preferred exemplary embodiments and with reference to the attached drawings, in which:
The present invention relates to modeling and synthesizing datasets that contain certain salient features such as an ObjectID that can be followed over time, and wherein records essentially can be said to describe “Transactions” or state-transitions and have associated variables (called “Items”) that are dependent on the Transaction.
One example of such a dataset is the movement data ranging from taxi trips including the Object ID denoting the driver. Thus, every trip the driver makes can be seen as a “Transaction” from one location (state) to another (state). And all the associated data which is collected such as distance, cost, tip, time etc., all relate to the Transaction (trip) being performed.
Similarly, the method extends to other motion datasets including long haul train and bus networks as well as cellular and data networks. The current implementation supports two types of main functionalities:
(1) Creating a new model for a given data source as can be seen in
According to this embodiment, the model creation process basically comprises five steps:
(2) Fabricating synthetic data according to the learned model as can be seen in
Input Data and Data Mapping
In order to distinguish between the different data fields and maintain a generic structure, each data source needs to be mapped against the generic input structure needed by the system.
This task is performed by a Data Mapper in a pre-processing step, as will be discussed below, individually designed and configured for different input data.
As stated earlier, the present invention is capable of modeling the state transition behaviour of multiple objects within the data set.
In this embodiment, the manner in which the class of data sets for which the algorithm works is defined is that the data set comprises one or more ObjectIDs that are consistent or persistent throughout the dataset and whose behaviour is of interest within the dataset.
This ObjectID could represent an individual (anonymized) or an item such as a movie, etc. Furthermore, the object performs various transactions that include various state transitions. Preferably, the number of states that denote a transaction is some constant greater than or equal to 1 and valid for all Transactions within the data set. Each Transaction has therefore a Start Time and the elapsed time between one state and the next within the Transactions are denoted by the term “Duration”.
Thus for example, of a taxi trip being a Transaction, the starting location and destination represent two states and the time between the start and the end of the state as duration. Preferably, when the Transaction is a single state based action, then Duration can be assumed to be zero if not otherwise stated
Preferably, the states of a Transaction are denoted as Cartesian coordinates. However, according to other embodiments, the states can be extended easily to categorical data too.
Algorithm Flow
The algorithm progresses by tracking the Transaction of each object ID in terms of Sequence, Sessions, Transactions and States in the following manner.
A Sequence is defined as the complete list of “Transactions” or all possible records per unique ObjectID.
The Sequence is then split into a plurality of Sessions. According to other embodiments, a Session can be considered as a sequence of closely timed Transactions separated by other Sessions with a long temporal pause.
Object behaviour is modeled by first hierarchically clustering the various Sequences and component Sessions.
One of the advantages of breaking Sequences into Sessions is that they could be very long and thereby hinder any reasonable attempt at clustering similar Sequences.
The Sessions themselves are made up of individual Transactions. Every Transaction comprises an ObjectID making some set of state transitions each of which takes some time (Duration) and has related variables (Items).
In order to group similar Sessions together, preferably using k-means, we first group Transactions together following the hierarchical clustering paradigm to bottom up reduce the number of data elements being worked with and find more compact representations of them.
Therefore, the following fields can be handled; some are mandatory and some are optional fields according to this embodiment:
Creating a New Model Based on Input Data
This section describes the process of training a new synthesization model based on real data. The complete process is depicted in
Preprocessing
It is assumed that the data is already in the format as described above.
Before the modelling starts, the input data is transformed to perform object tracking for modelling, resulting in the creation of two new columns as can be seen in the below table:
By adding the SessionID, state transitions are modelled in a 4-level hierarchy as depicted in
Hierarchical Transition Clustering
Main objective of this step is to compress the “Big Data”-Input into a compact representation (clusters) of the core characteristics needed to allow efficient statistical modelling of the data in the later stages of the algorithm.
As described above, transitions occur at 4 hierarchical levels: Sequences→Sessions→Transactions→States(Locations).
In order to best represent such a hierarchy, the data are clustered bottom-up, beginning with the States (locations) and ending at Session level. Sequences are not currently clustered as they are expected to be too different and do not lend themselves to clustering.
Single State(Location) Clustering (First Hierarchy Level). Please note that it is interchangeably referred to the generic concept of State clustering with the term Location clustering.
The Location Clustering step aims at grouping single locations of interest and thereby improving a network structure onto the data. This helps in clustering the Transactions. An exemplary sketch of the single location clustering can be seen in
All single locations (Latitude and Longitude coordinates) are extracted to form a N×2 feature matrix. Subsequently, a k-means heuristic is used to cluster all single locations into similar groups. Thereby, the Euclidian distance is used as a distance measure. The resulting centroids are added to each location of the Transactions as depicted in the below table. In the next clustering stage (Transaction clustering), those newly created single location centroids will be used to find Transaction clusters.
Preferably, the parameter k is determined using at least one of the following metrics: BIC, AIC or the Gab Statistic. An example of gab statistic can be found in ref [3].
Preferably, all metrics need to perform k-means repeatedly.
Trip Clustering (Second Hierarchy Level)
In the previous clustering stage, single locations were clustered together. In this stage, similar Transactions are being clustered to allow compression and modelling of Transaction information.
Instead of using the raw real data locations, the single location centroids are used to represent Transactions. This helps in creating connected Transaction clusters as depicted in
Preferably, the parameter k is determined using at least one of the following metrics: BIC, AIC or the Gab Statistic.
Preferably, all metrics need to perform k-means repeatedly.
Session Clustering (Third Hierarchy Level)
As discussed in above, sequences were split into Sessions to make movements clusterable. One of the motivations is that shorter movement patterns lend themselves better to clustering because patterns might become visible.
Following the Location Modelling, in the Session Modelling phase below, each cluster of Sessions will be transformed into a Markov Chain Model modelling movements and items.
While testing different approaches of clustering Sessions, three ideas will be presented in the following, each resulting in an additional column SessionClusterID as shown in the below table:
Compared to the previous clustering stages the Session clustering is the most difficult stage. While again it is difficult to determine the right k, this stage has various pitfalls to overcome:
Session Modelling
In the location modelling stages, the aim was to compress data to its statistical core characteristics by finding groups of similar behaviour on different hierarchical levels of movements (Single Locations, Transactions, Sessions and Sequences).
Scope of the Session Modelling Stage
The Session Modelling stage aims at modelling movement on the Session level, which is the transition from Transaction cluster to Transaction cluster. As described above, Sessions were the result of splitting long sequences into multiple smaller parts to improve clustering results. The question of which Sessions are performed after one another will be modelled in the sequence modelling stage below.
Implementation
The concept in this embodiment is to use the concept of Markov Chains to model transitions between Transaction clusters, however extended with additional machine learning models to allow the incorporation of all influencing factors that lead to a certain transition and the generation of items. The prototypical implementation of the extended Markov Chain model contains 4 top level elements: Standard nodes, Edges between nodes, Starting nodes and end nodes. Standard nodes itself represent Transaction clusters and contain further predictive models as can be seen in
In
1) All IDs of successors & predecessors including transition probabilities;
2) Access to a Bayesian Network for predicting transaction delays trained on information about current & previous Transaction Cluster;
3) Access to Bayesian network for item generation training on current Transaction Cluster.
Every Markov Chain models one Session cluster. It comprises of 1 start node, 1 end node, an arbitrary number of standard nodes, each representing a Transaction cluster and transition probabilities between those nodes, as can be seen in
Transaction Cluster nodes of one Markov Chain (or Session) can also appear again in other Markov Chains that are representing other Session clusters.
The start node is the entry point into the Markov Chain and allows sampling the first Transaction Cluster for the Session from its transition probabilities. The start node itself is not connected to any Transaction cluster.
Similarly, the end node is not connected to any Transaction Cluster. Once, the end node is reached, a Session has ended. Each node contains various elements which altogether form the extended Markov Chain implementation used.
Successors and Predecessors. Each node contains a list of its possible successors and predecessors including their transition probabilities. However, according to this embodiment, the predecessor list of the start node is empty. Reversed, according to this embodiment, the successor list of the end node is empty. The transition probabilities are, in this embodiment, calculated by counting frequencies and then building the empirical cumulative distribution function from the results. Successors are sampled using the method of inverse transformation sampling. However, other embodiments could consider conditional probabilities to be computed based on previously visited transaction clusters, for example.
Bayesian Networks for Transaction Delay Prediction. Each Transaction Cluster Node of a Markov Chain has preferably access to a Bayesian network to predict the delay time until the next Transaction is performed. The Bayesian network is trained on the current and all previous Transaction Clusters if they are existing. Different Markov Chains might be able to access the same Bayesian Network Objects if they contain the same trip clusters.
Bayesian Network for Item Generation. Each Transaction Cluster Node of a Markov Chain has access to a Bayesian Network that allows predicting the items for the current Transaction Cluster. In the case of a taxi trip, items could be the tip a passenger has given, the costs, the duration of the trip and/or the distance. Different Markov Chains can access the same Bayesian Network Objects if they contain the same Transaction clusters.
In the following all main challenges encountered during the development are discussed:
Sequence Modelling
While in the Session, modelling stage the aim is to model transitions within a Session cluster, this section combines multiple Sessions into a single sequence which will belong to a single Object ID.
In order to do so, the real clustered data is used to build a table containing one Session cluster per row belonging to an ObjectID, with its start time, end time and delay time to the next Session. This table is then used to train a Bayesian Network that learns the relations between ObjectIDs, SessionClusterID, start time, end time and delay to the next SessionCluster. Subsequently, this Bayesian Network is used to produce a table of synthetic data containing similar information. This table is then used to create sequences of Sessions for an ObjectID in the synthesis process.
Preferably, the end time of a Session is integrated in the prediction of the delay between Transactions.
With low probabilities, sequences can last for a long time and end in the far future. Therefore, preferably, manual constraints are added to prevent these sequences.
Fabricating Synthetic Data based on Model
This section relates to the process of fabricating synthetic data based on a training model. The complete process is shown in
Implementation
According to this embodiment, the process starts by loading a list of serialized R-Objects that contain all information needed to synthesize data.
Next the user specifies the total number of Sessions to be generated. As a result, based on a trained Bayesian Network, a table is generated as shown in the below table:
The table contains n rows, each having an ObjectID, SessionClusterID, StartTimeOfSession, EndTimeOfSession and a Delay time. Each row represents the top level description of a single Session that will be generated as part of the synthesis:
Next, the table is split based on Object IDs. Thereby, each resulting table per Object ID represents a sequence of Sessions Clusters that will be part of the synthesis. As depicted in
In the course of the synthesis for each ObjectID, each Markov Chain Model for each predicted Session Cluster is loaded and traversed according to its trained properties. Thereby, items and delays between Transactions are generated using the trained Bayesian Networks for each Transactions Cluster as can be seen in
As described above, the entry point to a Markov Chain is its starting node. It is traversed according to its transition probabilities, until the end node has been reached, the last starting time for a Transaction is reached, or the maximum number of Transactions for that Session Cluster is achieved.
If one Session has ended, the predicted delay time from the ObjectID table, as shown in the above table, is used to calculate the first starting time of the next Session.
Preferably, the resulting synthetic data table contains one row per Transaction and preferably has the same format as the input data.
The presented invention has been evaluated on the New York Taxi Data Set and an extract of Call Data Records coming from the mobile network of Telekom.
Data
The used New York Taxi data set includes 30 GB of single taxi trips for the year 2013, each row of data containing at least the following information: Medallion; hack_licence; Pickup_Datetime; Pickup_latitude; Pickup_Longitude; dropoff_datetime; dropoff_latitude; dropoff_longitude; trip_distance; trip_time_in_secs; surcharge; fare_amount; tip_amount; total_amount. The New York Taxi Data Set 2013 is retrievable from: http://www.andresmh.com/nyctaxitrips/
A visualization of all pickup and drop-off locations is depicted in
The second type of data comprises different extracts of Call Data Records coming from the mobile network of Deutsche Telekom. Each row of data contains a single event that happened when a mobile phone signal was handled by tower, as can be seen in
Thereby, each row contains the following information: ObjectID; StateID; Timestamp; Item. The ObjectID represents the individual using the phone, the StateID represents the tower to which a connection was established, the Timestamp contains the time at which it happened and the Item contains the type of event that has happened (e.g. handover).
Technologies
Two development technologies can be used: Spark and R. Spark are preferably used for the first half of the algorithm, which comprises the Location and Movement.
R is used in this embodiment for the subsequent stages, which are Session Modelling, Sequence Modelling and finally the fabrication of synthetic data. However, also other languages like SCALA or Python on a parallel platform like Spark would be preferred.
The visual analysis of the results can be performed using Tableau, as it allows an interactive analysis of geo spatial data. Static visual analyses can be performed in R.
Reasoning for Combination R and Spark
R is a scientific scripting language intentionally designed for statistics. Therefore, its language natively contains constructs helpful for statistical analysis. Also, it allows an interactive analysis of data without the need of continuous recompiling of code. It is suitable for quick and dirty “data wrangling”, testing of ideas and offers an immense amount of different statistical packages.
On the downside, R is not developed with multi-core or production quality systems in mind. Hence, speed parallelization capabilities and stability are not always sufficient. Also memory management is inefficient and can lead to unnecessary consumption of RAM. Further, it is not natively Object Oriented.
Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning. On the positive side Spark is the currently fasted Big Data processing system and is intended for multi-core and multi-cluster systems. Also it is intended for the development of production ready systems which are stable and maintainable.
Further, it has a more efficient memory management than R. For every execution, the complete code is compiled. That makes it fast. Spark is written using the Object Oriented
Programming Language Scala. However, it is also possible to write applications using Java or Python. Scala is preferred as it is the easiest and most direct way to develop Spark applications. On the downside, interactive analysis and quick and dirty testing of ideas are not as easy as with R.
Also, due to the novelty of Spark it does currently only offer very few statistical libraries. Further, the employed map reduce paradigm can reduce the development speed, if the developer is not used to it.
Hence, it is preferred to first use an interactive and rich tool like R, for the verification and testing of new ideas. Once a part of the pipeline is verified and fully understood, it is preferably realized in SCALA or Python on a parallel platform like Spark.
The present invention is applied and tested on two data sets: First, an extract of the free available Taxi Data from New York City (downloadable at http://www.andresmh.com/nyctaxitrips/) is used to develop and test the implementation of the invention.
The Telekom-owned CDR data serves as a second test data set. This chapter describes the used data sets, and compares the original data to the results of synthesis.
New York City Taxi Data
In the raw NYC Taxi data the information of one trip is divided into two different files—one containing the trip data, the other containing the trip fares. These two files are merged to get the necessary data structure for the algorithm.
The used input file for the algorithm is depicted in
A row represents a trip or—as called in the algorithm—a transaction between two locations. The ObjectID is an alpha-numeric String which represents a pseudonymization of the taxi driver. The PickupTimestamp represents the date and time of the trip start, and the DropoffTimestamp presents the ending of a trip. During the algorithm the time is converted into a UNIX timestamp to facilitate the processing. FromLat and FromLong are the coordinates of the pickup location of the taxi, analogously ToLat and ToLong being the coordinates of the drop-off location.
The Distance of a trip, respectively between pickup and drop-off is given in miles, the Duration is displayed in seconds. Cost and Tip are represented in US-Dollars. Within the data Tips are only reported for credit card transactions. Cash tips are not included. The columns FromLat, FromLong, ToLat and ToLong represent the state/location information. Distance, Cost and Tip belong to the items of the transaction.
Some results of the comparison of real clustered data and the synthesized data are shown in
In the synthetic data, similar patterns of the clustered locations are visible. But a higher number of sessions, resp. transaction, can improve the similarity of both data sets regarding the pickup locations. Furthermore, a higher number of clusters can result in a more precise location synthesis.
Comparing the pickup times of the real and synthesize data an almost equal period could be achieved as can be seen in
The relationship between Distance and Tip is represented in
CDR Data
The CDR Data contains call data records from cell towers, in practice information about when and from where a SMS was sent. For testing purpose the data of 15 minutes is provided and transformed to the necessary algorithm data structure.
An extract of the used data structure is depicted in
The StateID represents an internal code for the location of the tower which is used as a categorical variable in this case. The Timestamp gives the UNIX Time (including Milliseconds) for the point in time of the event. However, there is no end time, the duration of the transaction is always zero. The SMSType displays the type of transaction. There are two different types. AgeCategory is a categorical value for an internal defined age group. Gender represents the Gender of the object. The last three variables belong to the items of the transaction. Furthermore, the last two columns contain missing values which have to be replaced (in this case with the value 0) for the modeling part (in order to create and use the Bayesian networks).
In the following, the results of the synthetic data are be compared with the real CDR data.
The general behaviour is displayed in the synthetic data, e.g. including the peaks between 11:00 and 11:10 a.m.
Comparing the synthesis and real data regarding the ObjectID as can be seen in
This issue can be addressed by modeling variables/items which belong to the object separately from the general item modeling and synthesis (where the items are assigned to a certain transaction).
Considering the segmentation by AgeGroup, the SMS type “DTAP SMS MT” was better synthesized than the other one. The AgeGroup “2” has a smaller share in the synthetic data than in the real one. Furthermore, the AgeGroup “6” is not represented at all in the synthetic data.
The second item in the data is the AgeGroup which represents a range of age for a certain object. The distribution of the age groups within the real and synthetic data is shown in
Considering Gender in the synthetic data, the relative amount throughout all transactions is comparable with the relative amounts within the real data, as can be seen in
During the development of the present invention, a group of general challenges have emerged that are addressed with the present invention.
A summarization of those challenges is depicted in
Automated Clustering
The chosen clustering algorithm is k-means. It is a heuristic approach for clustering data, which cannot guarantee the best cluster centers for a dataset as the initialization is random.
However, it is much faster compared to agglomerative hierarchical clustering which has a complexity of n2.
In order to automatically determine a reasonable number of clusters for a given dataset, it is necessary to have some kind of metric that evaluates the goodness resp. quality of the clusters for varying number of clusters.
The so-called elbow approach is one proposal for finding the right k. An elbow diagram is shown in
Domain Knowledge needed
Preferably, domain knowledge is used to model the possible relationships between variables in the Bayesian networks.
Incorporating the past
Preferably, items are not modeled based on the past. That means past items do not influence current items. Preferably, delays between transactions are only based on the previous transaction.
Constraints of Data Types
The present invention is capable of modeling data that is presented in terms of a flat table with no missing data values. If missing data exists, the data-mapper assigns a categorical label to it so that the core algorithms can process them seamlessly.
Valid Data Types
Essentially categorical data, i.e. nominal and ordinal scaled data, is modeled.
All data, wherein the various variables (i.e. columns of the dataset) relates to a fixed set of predefined values/labels (either numeric or otherwise), can be processed and modeled by the present invention. Example of such categorical data is shown in the below table:
In contrast to such categorical data, interval or ratio data—which maps to real numbers, such as age, income, temperature, etc. can be dealt with.
Furthermore, no real models are developed to deal with such real valued variables. The following table contains both, categorical (ObjectID) as wells as interval/ratio variables. The Bayesian network and Decision trees can deal with such data but in a less efficient manner.
Possible Extensions to Non-Categorical Data
As the present invention preferably uses categorical data, any real valued data can be quantized into a potentially ‘large’ set of bins (N bins), thereby making the data categorical (by falling into one of the N bins).
This can be done by the data-mapper that pre-processes data before transferring it over to the core methods that model and synthesize data.
Tweaking the Clustering Phase of the Current System
Preferably, the present invention for data modeling and synthesis assumes one variable field (column of data) to be designated as a StateID that is alphanumeric and used for clustering the data.
The idea behind this clustering is on one hand breaking the data into meaningful similar data groups (clusters) while simultaneously reducing the size of data that (though similar) need to be processed individually.
According to an embodiment, generic multi-dimensional clustering approaches are used with a variety of distance metrics allowing for far greater flexibility. Preferably, all the real valued columns are designated jointly as a StateID (meant for clustering) label. A direct impact of this is that the columns lend themselves to meaningful clustering as shown in the below table:
This table shows a typical data set with real valued entries like location in latitude and longitude that can be used if the start and end locations were all used as one multi-dimensional StateID. Furthermore, Dist (distance) can be quantized and thereby made categorical.
Constraints on Data Columns (variables)
Beyond the above constraints on what domains of data sets our method is suitable for, some additional constraints on the nature of data that can preferably be processed.
The datasets preferably comprises:
Constraints of Relationship Types
According to an embodiment, the present invention uses a state transition approach that groups (clusters) similar transitory behaviour and then model each of these “unique” clustered behaviours.
Abilities of the present invention include, among others, modeling interdependencies between variables at a given state based on current and previous states alone.
Privacy Preservation
The exact details of how exactly privacy of users are further enforced and preserved as they are preferably not exactly detailed in terms of computational algorithms. Preferably, standard mechanisms to ensure privacy are implemented in the methods and systems of the present invention in terms of “k-anonymity” as well as preferably using approaches such as differential privacy.
While the invention has been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative or exemplary and not restrictive. It will be understood that changes and modifications may be made by those of ordinary skill within the scope of the following claims. In particular, the present invention covers further embodiments with any combination of features from different embodiments described above and below. Additionally, statements made herein characterizing the invention refer to an embodiment of the invention and not necessarily all embodiments.
The terms used in the claims should be construed to have the broadest reasonable interpretation consistent with the foregoing description. For example, the use of the article “a” or “the” in introducing an element should not be interpreted as being exclusive of a plurality of elements. Likewise, the recitation of “or” should be interpreted as being inclusive, such that the recitation of “A or B” is not exclusive of “A and B,” unless it is clear from the context or the foregoing description that only one of A and B is intended. Further, the recitation of “at least one of A, B and C” should be interpreted as one or more of a group of elements consisting of A, B and C, and should not be interpreted as requiring at least one of each of the listed elements A, B and C, regardless of whether A, B and C are related as categories or otherwise. Moreover, the recitation of “A, B and/or C” or “at least one of A, B or C” should be interpreted as including any singular entity from the listed elements, e.g., A, any subset from the listed elements, e.g., A and B, or the entire list of elements A, B and C.
[1]—https://code.google.com/archive/p/word2vec/
[2]—Mikolov et al., Efficient Estimation of Word Representations in Vector Space; retrievable from arxiv: http://arxiv.org/pdf/1301.3781.pdf
[3]—https://datasciencelab.wordpress.com/2013/12/27/finding-the-k-in-k-means-clustering/
Number | Date | Country | Kind |
---|---|---|---|
16 15 9981.6 | Mar 2016 | EP | regional |