The number of disparate sources of consumer data has dramatically increased over the last two decades, largely due to the growth of the internet, and the proliferation and accessibility of connected digital devices. Consolidation of disparate data sources can yield a comprehensive understanding of consumers; their attributes, behaviors, locations, interests, and tendencies. Many applications rely on such a comprehensive understanding of consumers. Applications include, but are not limited to, advertising, marketing, customer service, fraud, and homeland security. However, consolidating data from disparate sources breeds challenges. One such challenge is auditing. To illustrate, data must be moved from its origin to a central location, merged with other data originating from different sources and then typically processed through multiple transformations. Moreover, many transformations change the cardinality of the originally received data; one record can transform into many, and many records can transform into one. These “transformations” physically blend similar and dissimilar data elements to derive new data elements. To further complicate matters, derived data elements are often blended with other derived data elements. The result is a series of derived data elements which often have little or no resemblance to its original source data elements. Auditing transformations enables the ability to answer questions like “where did this data come from?” and “where did this data go?”. These answers are necessary for effective data governance, and in many cases, required for compliance with regulations governing data protection, privacy and usage. This invention presents a method for using attributional metadata to effectively audit lineage of consumer data through multiple phases of transformation.
The features, aspects, and advantages of the exemplary embodiments are understood when the following Detailed Description is read with reference to the accompanying drawings, wherein:
The exemplary embodiments will now be described more fully hereinafter with reference to the accompanying drawings. The exemplary embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. These embodiments are provided so that this disclosure will be thorough and complete and will fully convey the exemplary embodiments to those of ordinary skill in the art. Moreover, all statements herein reciting embodiments, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future (i.e., any elements developed that perform the same function, regardless of structure).
Thus, for example, it will be appreciated by those of ordinary skill in the art that the diagrams, schematics, illustrations, and the like represent conceptual views or processes illustrating the exemplary embodiments. The functions of the various elements shown in the figures may be provided through the use of dedicated hardware as well as hardware capable of executing associated software. Those of ordinary skill in the art further understand that the exemplary hardware, software, processes, methods, and/or operating systems described herein are for illustrative purposes and, thus, are not intended to be limited to any particular named manufacturer.
As used herein, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless expressly stated otherwise. It will be further understood that the terms “includes,” “comprises,” “including,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being “connected” or “coupled” to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Furthermore, “connected” or “coupled” as used herein may include wirelessly connected or coupled. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.
It will also be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first device could be termed a second device, and, similarly, a second device could be termed a first device without departing from the teachings of the disclosure.
For the exemplary embodiments to enable data flow through various datasets by way of workflows illustrated in
When the data presented to the system is tabular, that is, represented as rows and columns, a splitting (one to many) and merging (many to one) phenomenon can occur at the cell and row level, in addition to the dataset level. While exemplary embodiments address non-tabular data as well as tabular data, because the vast majority of consumer data processed today is tabular, and because tabular data presents additional challenges that exemplary embodiments address, the remainder of this disclosure will focus on tabular data.
In both cases just described, the transformation changes the cardinality of the data, and this presents a challenge for simple tracking methods. For example, a simple method to track data is to assign a row identifier, for example, in a field on each row of a table, and then carry that row identifier through the various transformations into other tables. This works only if the cardinality of all tables remains the same; each row in Table A has one and only one counterpart row in Table B, post transformation. While this is true with very simple transformations, it is often not true for more complex transformations.
Exemplary embodiments address such challenges by relying upon two abstractions recognized by the data consolidation platform: “workflow instances” and “dataset instances”. A dataset instance represents a specific chunk of data whose existence emanated from the execution of a data transformation task. A workflow instance is an entity whose existence results from the compiling and dispatching of instructions to accomplish a data transformation task. A workflow instance is an entity composed of dynamic variables resolved at run-time in conjunction with metadata values copied from the schema entities illustrated in
The consolidation platform tracks the details of dispatched workflow instances by assigning a globally unique value (known as “transformation workflow instance identifier”) to each workflow instance entity. The consolidation platform also tracks the details of dataset instances by assigning a globally unique value (known as “cohort dataset instance identifier”) to each dataset instance entity. A dataset instance (aka chunk of data) is a cohort by virtue of its lifecycle emanating from the same workflow instance. The physical data resulting from the execution of the workflow instance includes its assigned cohort data instance identifier. This cohort data instance identifier value is physically stored in a column named dataset_instance_id within the target table (aka dataset) affected by the workflow instance. Within this model, one or more rows in an input table, each with its corresponding dataset instance identifier, can be transformed into new cohort of rows in an output table, and all resulting rows from transformation are assigned the new cohort dataset instance identifier value. As such, exemplary embodiments can track sets of input rows through transformations into a set of output rows, thus addressing the more complex transformations which change the cardinality of the data as illustrated in
It should be noted that this technique does not provide for lineage tracking of instances of individual cells or rows. For example, it does not provide the ability to trace a specific cell of data through transformations to another specific cell of data. While this is theoretically possible, exemplary embodiments do not address this, because of the additional processing and storage costs that would be required. Exemplary embodiments provide for lineage tracking for tables, and for cohorts of rows within the tables. Exemplary embodiments also provide for capturing a snapshot of the transformation logic at the time of execution. Foundational to exemplary embodiments is the assertion that with these three things, necessary and sufficient auditing lineage of consumer data through multiple phases of transformation can be accomplished, and can be accomplished with minimal overhead.
It is common for transformation logic defined in metadata schema entities (
In order to facilitate the capture of transformation logic, exemplary embodiments express the workflow lineage and transformation logic within them with a declarative language. All datasets, workflows and the relationships between them are expressed in a declarative language which is pre-compiled at runtime to resolve dynamic and temporal variables. The output of the pre-compiler is fully resolved, and sufficient to describe the lineage and transformations within them with enough precision to support governance and compliance requirements. The pre-compiler output is captured as a snapshot, associated with the transformation workflow instance identifier and stored permanently in the metadata schema. An example of the pre-compiler output is shown in
Exemplary embodiments may be utilized in any operating environment. For example, the server 100 storing the electronic database 120 may perform row transformation, consolidate workflows, and generate snapshots of declarative transformation logic (as above explained). The algorithm 110 instructs the processor 108 to perform operations via a network interface to the communications network 104. Information may be received as packets of data according to a packet protocol (such as any of the Internet Protocols). The packets of data contain bits or bytes of data describing the contents, or payload, of a message. A header of each packet of data may contain routing information identifying an origination address and/or a destination address. The algorithm 110, for example, may instruct the processor 108 to inspect packetized information for network addresses (e.g., IP address), cellular identifiers (e.g., telephone number, MSISDN), and/or any other data contained within header or payload.
Exemplary embodiments may be applied regardless of networking environment. Exemplary embodiments may be easily adapted to stationary or mobile devices having cellular, WI-FI®, near field, and/or BLUETOOTH° capability. Exemplary embodiments may be applied to mobile devices utilizing any portion of the electromagnetic spectrum and any signaling standard (such as the IEEE 802 family of standards, GSM/CDMA/TDMA or any cellular standard, and/or the ISM band). Exemplary embodiments, however, may be applied to any processor-controlled device operating in the radio-frequency domain and/or the Internet Protocol (IP) domain. Exemplary embodiments may be applied to any processor-controlled device utilizing a distributed computing network, such as the Internet (sometimes alternatively known as the “World Wide Web”), an intranet, a local-area network (LAN), and/or a wide-area network (WAN). Exemplary embodiments may be applied to any processor-controlled device utilizing power line technologies, in which signals are communicated via electrical wiring. Indeed, exemplary embodiments may be applied regardless of physical componentry, physical configuration, or communications standard(s).
Exemplary embodiments may utilize any processing component, configuration, or system. Any processor could be multiple processors, which could include distributed processors or parallel processors in a single machine or multiple machines. The processor can be used in supporting a virtual processing environment. The processor could include a state machine, application specific integrated circuit (ASIC), programmable gate array (PGA) including a Field PGA, or state machine. When any of the processors execute instructions to perform “operations”, this could include the processor performing the operations directly and/or facilitating, directing, or cooperating with another device or component to perform the operations.
This application claims domestic benefit of U.S. Provisional Application 62/430,379 filed Dec. 6, 2016 and incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62430379 | Dec 2016 | US |