As used in Master Data Management (MDM) and Data Quality Management (DQM), a “golden record” is a representation of a real-world entity. In a specific implementation, a “golden record” has multiple views of any object depending on a viewer's account and survivorship rules associated therewith. It is understood that changing golden records in a datastore is an O(n), or linear process. Big O notation, or asymptotic notation, is a mathematical notation that describes the limiting behavior of a function when the argument tends towards a particular value or infinity. Asymptotic notation characterizes functions according to their growth rates. In a big data context, it would normally be necessary to shut down a system to integrate a new data set (e.g., a third-party data set) into an existing one.
A unique architecture enables efficient modelling of entities, relationships, and interactions that typically form the basis of a business. These models enable insights, scalability, and management not previously available in the prior art. It will be appreciated that with the information model discussed herein, there is no need to consider tables, foreign keys, or any of the low-level physicality of how the data is stored.
An information model may be utilized as a part of a multi-tenant platform. In a specific implementation, a configuration sits in a layer on top of the RELTIO™ platform and natively enjoys capabilities provided by the platform such as matching, merging, cleansing, standardization, workflow, and so on. Entities established in a tenant may be associated with custom and/or standard interactions of the platform. The ability to hold and link three kinds of data (i.e., entities, relationships, and interactions) in the platform and leverage the confluence of them in one place provides power to model and understanding to a business.
Entities established in a tenant may be associated with custom and/or standard interactions of the platform. The ability to hold and link three kinds of data (i.e., entities, relationships, and interactions) in the platform and leverage the confluence of them in one place provides unlimited power to model and understanding to a business.
In various embodiments, the metadata configuration is based on an n-layer model. One example is a 3-layer model (e.g., which is the default arrangement). In some embodiments, each layer is represented by a JSON file (although it will be appreciated that many different file structures may be utilized such as B SON or YAML).
The information models may be utilized as a part of a connected, multi-tenant system.
In various embodiments, the platform 102 is multi-domain and enables seamless integration of many types of data and from many sources to create master profiles of any data entity—person, organization, product, location. Users can create master profiles for consumers, B2B customers, products, assets, sites, and connect them to see the complete picture.
The platform 102 may enable API-first approach to data integration and orchestration. Users (e.g., tenants) can use APIs, and various application-specific connectors to ease integration. Additionally, in some embodiments, users can stream data to analytics or data science platforms for immediate insights.
Along with the built-in data loader, event streaming capabilities, data APIs, and partner connectors, the integration hub system 202 enables rapid links to user systems using the platform 102. The integration hub system 202 may enable users to build automated workflows to get data to and from the platform 102 with any number of SaaS applications in just hours or days. Faster integration enables faster access to unified, trusted data to drive real-time business operations.
The L3 302 layer typically inherits from the L2 layer 304 (an industry-focused layer) which in turn inherits from the L1 layer 306 (An industry-agnostic layer). Usually, the L3 layer 302 refers to an L2 304 container and inherits all data items (or “objects”) from the L2 304 container. However, it is not required that the L3 302 refer to the L2 304 container, it can standalone.
The L2 layer 304 may inherit the objects from the L1 layer. Whereas there is only a single L1 306 set of objects, the objects at the L2 layer 304 may be grouped into industry-specific containers. Like the L1 layer 306, the containers at the L2 layer 304 may be controlled by product management and may not be accessible by customers.
Life sciences is a good example of an L2 layer 304 container. The L2 layer 304 container 304 may inherit the Organization entity type (discussed further herein) from L1 layer 306 and extends it to the Health Care Organization (HCO) type needed in life sciences. As such, the HCO type enjoys all of the attribution and other properties of the Organization type, but defines additional attributes and properties needed by an HCO.
The L1 layer 306 may contain entities such as Party (an abstract type) and Location. In some embodiments, the L1 layer 306 contains a fundamental relationship type called HasAddress that links the Party type to the Location type. The L1 layer 306 also extends the Party type to Organization and Individual (both are non-abstract types).
There may be only one L1 layer 306, and its role is to define industry-agnostic objects that can be inherited and utilized by industry specific layers that sit at the L2 layer 304. This enables enhancement of the objects in the L1 layer 306, potentially affecting all customers. For example, if an additional attribute was added into the HasAddress relationship type, it typically would be available for immediate use by any customer of the platform.
Any object can be defined in any layer. It is the consolidated configuration resulting from the inheritance between the three layers that is commonly referred to as the tenant configuration or metadata configuration. In a specific implementation, metadata configuration consolidates simple, nested, and reference attributes from all the related layers. Values described in the higher layer overrides the values from the lower layers. The number of layers does not affect the inheritance.
In a specific implementation, metadata configuration consolidates simple, nested, and reference attributes from all the related layers. Values described in the higher layer overrides the values from the lower layers. The number of layers does not affect the inheritance.
Often, entity types can materialize in single instances, such as the “Alyssa” example above. In another example, the L1 layer may define the abstract “Party” entity type with a small collection of attributes. The L1 layer may then be configured to define the “Individual” entity type and the “Organization” entity type, both of which inherit from “Party,” both of which are non-abstract and both of which add additional attributes specific to their type and business function. Continuing with the concept of inheritance, in the L2 Life Sciences container, the HCP entity may be defined (to represent physicians) which inherits from the “Individual” type but also defines a small collection of attributes unique to the HCP concept. Thus, there is an entity taxonomy “Party,” “Individual,” or “HCP,” and the resulting HCP entity type provides the developer and user with the aggregate attribution of “Party,” “Individual,” and “HCP.”
Once the entity types are defined, the user can link entities together in a data model by using the relationship type. Once the user defines entity types, they can be linked by defining relationships between them. For example, a user can post a relationship independently to link two entities together, or the client can mention a relationship in a JSON, which then posts the relationship and the two entities all at once.
A relationship type 404 describes the links or connections between two specific entities (e.g., entities 406 and 408). A relationship type 404 and the entities 406 and 408 described together form a graph. Some common relationship types are Organization to Organization, Subsidiary Of, Partner Of, Individual to Individual, Parent of/Child Of, Reports To, Individual to Organization/Organization to Individual, Affiliated With, Employee Of/Contractor Of.
Once the user defines entity types, they can be linked by defining relationships between them. For example, a user can post a relationship independently to link two entities together, or the client can mention a relationship in a JSON, which then posts the relationship and the two entities all at once.
The platform 102 may enable the user to define metadata properties and attributes for relationship types. The user can define up to any number metadata properties. The user can also define several attributes for a relationship type, such as name, description, direction (undirected, directed, bi-directional), start and end entities, and more. Attributes of one relationship type can inherit attributes from other relationship types.
Hierarchies may be defined through the definition of relationship subtypes. For example, if a user defines “Family” as a relationship type, the user can define “Parent” as a subtype. One hierarchy contains one or many relationship types; all the entities connected by these relationships form a hierarchy. Entity A>HasChild (Entity B)>HasChild (Entity C). Then A, B, and C form a hierarchy. In the same hierarchy, the user can add Subsidiary as a relationship and if Entity D is subsidiary of Entity C, then A, B, C, and D all become part of single hierarchy.
Interactions 410 are lightweight objects that represent any kind of interaction or transaction. As a broad term, interaction 410 stands for an event that occurs at a particular moment such as a retail purchase or a measurement. It can also represent a fact in a period of time such as a sales figure for the month of June.
Interactions 410 may have multiple actors (entities), and can have varying record lengths, columns, and formats. The data model may be defined using attribute types. As a result, the user can build a logical data model rather than relying on physical tables and foreign keys; define entities, relationships, and interactions in granular detail; make detailed data available to content and interaction designers; provide business users with rich, yet streamlined, search and navigation experiences.
In various embodiments, four manifestations of the attribute type include Simple, Nested, Reference, and Analytic. The simple attribute type represents a single characteristic of an entity, relationship, or interaction. The nested, reference and analytic attribute types represent combinations or collections of simple sub-attribute types.
The nested attribute type is used to create collections of simple attributes. For example, a phone number is a nested attribute. The sub-attributes of a phone number typically include Number, Type, Area code, Extension. In the example of a phone number, the sub-attributes are only meaningful when held together as a collection. When posted as a nested attribute, the entire collection represents a single instance, or value, of the nested attribute. Posts of additional collections are also valid and serve to accumulate additional nested attributes within the entity, relationship or interaction data type.
The reference attribute type facilitates easy definition of relationships between entity types in a data model.
A user may utilize the reference attribute type when they need one entity to make use of the attributes of another entity without natively defining the attributes of both. For example, the L1 layer in the information model defines a relationship that links an Organization and an Individual using the affiliatedwith relationship type. The affiliatedwith relationship type defines the Organization entity type to be a reference attribute of the Individual entity type. This approach to data modeling enables easier navigation between entities and easier refined search.
Easier navigation between entities: In the example of the Organization and Individual entities that are related using the affiliatedwith relationship type, specifying an attribute of previous employer for the Individual entity type enables this attribute to be presented as a hyperlink on the individual's profile facet. From there, the user can navigate easily to the individual's previous employer.
Easily refined search: When attributes of a referenced entity and relationship type are available to be indexed as though they were native to the referencing entity, business users can more easily refine search queries. For example, in a search of a data set that contains 100 John Smith records, entering John Smith in the search box will return 100 John Smith records. Adding Acme to the search criteria will return only those records with John Smith that have a reference, and thus an attribute, that contains the word Acme.
The analytic attribute type is lightweight. In various embodiments, it is not managed in the same way that other attributes are managed when records come together during a merge operation. The analytic attribute type may be used to receive and hold values delivered by an analytics solution.
The user may utilize the analytic attribute type when they want to make a value from your analytics solution, such as Reltio Insights, available to a business user or to other applications using the Reltio Rest API. For example, if an analytics implementation calculates a customer's lifetime value and the user needs that value to be available to the user while they are looking at the customer's profile, the user may define an analytic attribute to hold this value and provide instructions to deliver the result of the calculation to this attribute.
In a specific implementation, the platform 102 assigns entity IDs (EIDs) to each item of data that enters the platform. As such, the platform can appropriately be characterized as including an EID assignment engine. Importantly, a lineage-persistent relational database management system (RDBMS) retains the EIDs for each piece of data, even if the data is merged and/or assigned a new EID. As such, the platform can appropriately be characterized as including a legacy EID retention engine, which has the task of ensuring when new EIDs are assigned, legacy EIDs are retained in a legacy EID datastore. The legacy EID retention engine can at least conceptually be divided into a legacy EID survivorship subengine responsible for retaining all EIDs that are not promoted to primary EID as legacy EIDs and a lineage EID promotion subengine responsible for promoting an EID of a first data item merged with a second data item to primary EID of the merged data item. An engine responsible for changing data items, including merging and unmerging (previously merged) data items can be characterized as a data item update engine. Cross-tenant durability also becomes possible when legacy EIDs are retained. In a specific implementation, a cross-tenant durable EID lineage-persistent RDBMS has an n-Layer architecture, such as a 3-Layer architecture.
Data may come from multiple sources. The process of receiving data items can be referred to as “onboarding” and, as such, the platform 102 can be characterized as including a new dataset onboarding engine. Each data source is registered and, in a specific implementation, all data that is ultimately loaded into a tenant will be associated with a data source. If no source is specified when creating a data item (or “object”), the source may have a default value. As such, the platform can be characterized as including an object registration engine that registers data items in association with their source.
A crosswalk can represent a data provider or a non-data provider. Data providers supply attribute values for an object and the attributes are associated with the crosswalk. Non-data providers are associated with an overall entity (or relationship); it may be used to link an L1 (or L2) object with an object in another system. Crosswalks do not necessarily just apply to the entity level; each supplied attribute can be associated with data provider crosswalks. Crosswalks are analogous to the Primary Key or Unique Identifier in the RDBMS industry.
The engines and datastores of the platform 102 can be connected using a computer-readable medium (CRM). A CRM is intended to represent a computer system or network of computer systems. A “computer system,” as used herein, may include or be implemented as a specific purpose computer system for carrying out the functionalities described in this paper. In general, a computer system will include a processor, memory, non-volatile storage, and an interface. A typical computer system will usually include at least a processor, memory, and a device (e.g., a bus) coupling the memory to the processor. The processor can be, for example, a general-purpose central processing unit (CPU), such as a microprocessor, or a special-purpose processor, such as a microcontroller.
Memory of a computer system includes, by way of example but not limitation, random access memory (RAM), such as dynamic RAM (DRAM) and static RAM (SRAM). The memory can be local, remote, or distributed. Non-volatile storage is often a magnetic floppy or hard disk, a magnetic-optical disk, an optical disk, a read-only memory (ROM), such as a CD-ROM, EPROM, or EEPROM, a magnetic or optical card, or another form of storage for large amounts of data. During execution of software, some of this data is often written, by a direct memory access process, into memory by way of a bus coupled to non-volatile storage. Non-volatile storage can be local, remote, or distributed, but is optional because systems can be created with all applicable data available in memory.
Software in a computer system is typically stored in non-volatile storage. Indeed, for large programs, it may not even be possible to store the entire program in memory. For software to run, if necessary, it is moved to a computer-readable location appropriate for processing, and for illustrative purposes in this paper, that location is referred to as memory. Even when software is moved to memory for execution, a processor will typically make use of hardware registers to store values associated with the software, and a local cache that, ideally, serves to speed up execution. As used herein, a software program is assumed to be stored at an applicable known or convenient location (from non-volatile storage to hardware registers) when the software program is referred to as “implemented in a computer-readable storage medium.” A processor is considered “configured to execute a program” when at least one value associated with the program is stored in a register readable by the processor.
In one example of operation, a computer system can be controlled by operating system software, which is a software program that includes a file management system, such as a disk operating system. One example of operating system software with associated file management system software is the family of operating systems known as Windows from Microsoft Corporation of Redmond, Wash., and their associated file management systems. Another example of operating system software with its associated file management system software is the Linux operating system and its associated file management system. The file management system is typically stored in the non-volatile storage and causes the processor to execute the various acts required by the operating system to input and output data and to store data in the memory, including storing files on the non-volatile storage.
The bus of a computer system can couple a processor to an interface. Interfaces facilitate the coupling of devices and computer systems. Interfaces can be for input and/or output (I/O) devices, modems, or networks. I/O devices can include, by way of example but not limitation, a keyboard, a mouse or other pointing device, disk drives, printers, a scanner, and other I/O devices, including a display device. Display devices can include, by way of example but not limitation, a cathode ray tube (CRT), liquid crystal display (LCD), or some other applicable known or convenient display device. Modems can include, by way of example but not limitation, an analog modem, an IDSN modem, a cable modem, and other modems. Network interfaces can include, by way of example but not limitation, a token ring interface, a satellite transmission interface (e.g. “direct PC”), or other network interface for coupling a first computer system to a second computer system. An interface can be considered part of a device or computer system.
Computer systems can be compatible with or implemented as part of or through a cloud-based computing system. As used in this paper, a cloud-based computing system is a system that provides virtualized computing resources, software and/or information to client devices. The computing resources, software and/or information can be virtualized by maintaining centralized services and resources that the edge devices can access over a communication interface, such as a network. “Cloud” may be a marketing term and for the purposes of this paper can include any of the networks described herein. The cloud-based computing system can involve a subscription for services or use a utility pricing model. Users can access the protocols of the cloud-based computing system through a web browser or other container application located on their client device.
A computer system can be implemented as an engine, as part of an engine, or through multiple engines. As used in this paper, an engine includes at least two components: 1) a dedicated or shared processor or a portion thereof; 2) hardware, firmware, and/or software modules executed by the processor. A portion of one or more processors can include some portion of hardware less than all of the hardware comprising any given one or more processors, such as a subset of registers, the portion of the processor dedicated to one or more threads of a multi-threaded processor, a time slice during which the processor is wholly or partially dedicated to carrying out part of the engine's functionality, or the like. As such, a first engine and a second engine can have one or more dedicated processors, or a first engine and a second engine can share one or more processors with one another or other engines. Depending upon implementation-specific or other considerations, an engine can be centralized, or its functionality distributed. An engine can include hardware, firmware, or software embodied in a computer-readable medium for execution by the processor. The processor transforms data into new data using implemented data structures and methods, such as is described with reference to the figures in this paper.
The engines described in this paper, or the engines through which the systems and devices described in this paper can be implemented as cloud-based engines. As used in this paper, a cloud-based engine is an engine that can run applications and/or functionalities using a cloud-based computing system. All or portions of the applications and/or functionalities can be distributed across multiple computing devices and need not be restricted to only one computing device. In some embodiments, the cloud-based engines can execute functionalities and/or modules that end users access through a web browser or container application without having the functionalities and/or modules installed locally on the end-users' computing devices.
As used in this paper, datastores are intended to include repositories having any applicable organization of data, including tables, comma-separated values (CSV) files, traditional databases (e.g., SQL), or other applicable known or convenient organizational formats. Datastores can be implemented, for example, as software embodied in a physical computer-readable medium on a general- or specific-purpose machine, in firmware, in hardware, in a combination thereof, or in an applicable known or convenient device or system. Datastore-associated components, such as database interfaces, can be considered “part of” a datastore, part of some other system component, or a combination thereof, though the physical location and other characteristics of datastore-associated components is not critical for an understanding of the techniques described in this paper.
Datastores can include data structures. As used in this paper, a data structure is associated with a way of storing and organizing data in a computer so that it can be used efficiently within a given context. Data structures are generally based on the ability of a computer to fetch and store data at any place in its memory, specified by an address, a bit string that can be itself stored in memory and manipulated by the program. Thus, some data structures are based on computing the addresses of data items with arithmetic operations, while other data structures are based on storing addresses of data items within the structure itself. Many data structures use both principles, sometimes combined in non-trivial ways. The implementation of a data structure usually entails writing a set of procedures that create and manipulate instances of that structure. The datastores, described in this paper, can be cloud-based datastores. A cloud based datastore is a datastore that is compatible with cloud-based computing systems and engines.
Assuming a CRM includes a network, the network can be an applicable communications network, such as the Internet or an infrastructure network. The term “Internet” as used in this paper refers to a network of networks that use certain protocols, such as the TCP/IP protocol, and possibly other protocols, such as the hypertext transfer protocol (HTTP) for hypertext markup language (HTML) documents that make up the World Wide Web (“the web”). More generally, a network can include, for example, a wide area network (WAN), metropolitan area network (MAN), campus area network (CAN), or local area network (LAN), but the network could at least theoretically be of an applicable size or characterized in some other fashion (e.g., personal area network (PAN) or home area network (HAN), to name a couple of alternatives). Networks can include enterprise private networks and virtual private networks (collectively, private networks). As the name suggests, private networks are under the control of a single entity. Private networks can include a head office and optional regional offices (collectively, offices). Many offices enable remote users to connect to the private network offices via some other network, such as the Internet.
Matching is a powerful area of functionality and can be leveraged in various ways to support different needs. The classic scenario is that of matching and merging entities (Profiles). Within the architecture discussed herein, relationships that link entities can also and often do match and merge into a single relationship. This may occur automatically and is discussed herein.
Matching can be used on profiles within a tenant to deduplicate them. It can be used externally from the tenant on records in a file to identify records within that file that match to profiles within a tenant. Matching may also be used to match profiles stored within a Data Tenant to those within a tenant.
Unlike other systems, in various embodiments, the architecture is designed to operate in real-time. Prior to the match process and merge processes occurring, every profile created or updated is may be cleansed on-the-fly by the profile-level cleansers. Thus the 3-step sequence of cleanse, match, merge may be designed to all occur in real-time anytime a profile is created or updated. This behavior makes the platform 102 ideal for real-time operational use within a customer's ecosystem.
Lastly, the survivorship architecture is responsible for creating the classic “golden record”, but in a specific implementation, it is a view, materialized on-the-fly. It is returned to any API call fetching the profile and contains a set of “Operational Values” from the profile, which are selected in real-time based on survivorship rules defined for the entity type.
In various embodiments, matching may operate continuously and in real-time. For example, when a user creates or updates a record in the tenant, the platform cleanses and processes the record to find matches within the existing set of records.
Each entity type (e.g., contact, organization, product) may have its own set of match groups. In some embodiments, each match group holds a single rule along with other properties that dictate the behavior of the rule within that group. Comparison Operators (e.g., Exact, ExactOrNull, and Fuzzy) and attributes may comprise a single rule.
Match tokens may be utilized to help the match engine quickly find candidate match values. A comparison formula within a match rule may be used to adjudicate a candidate match pair and will evaluate to true or false (or a score if matching is based on relevance).
In some embodiments, the matching function may do one of three things with a pair of records: Nothing (if the comparison formula determines that there is no match); Issue a directive to merge the pair; Issue a directive to queue the pair for review by a data steward. In some embodiments, the architecture may include the following:
The matchGroups construct is a collection of match groups with rules and operators that are needed for proper matching. If the user needs to enable matching for a specific entity type in a tenant, then the user may include the matchGroups section within the definition of the entity type in the metadata configuration of the tenant. The matchGroups section will contain one or more match groups, each containing a single rule and other elements that support the rule.
Looking at a match group in a JSON editor, the user can easily see the high-level, classic elements within it. The rule may define a Boolean formula (see the and operator that anchors the Boolean formula in this example) for evaluating the similarity of a pair of profiles given to the match group for evaluation. It is also within the rule element that four other very common elements may be held: ignoreInToken (optional), Cleanse (optional), matchTokenClasses (required), and comparatorClasses (required). The remaining elements that are visible (URI, label, and so on), and some not shown in the snapshot, surround the rule and provide additional declarations that affect the behavior of the group and in essence, the rule.
Each match group may be designated to be one of four types: automatic, suspect, <custom>, and relevance_based described below. The type the user selects may govern whether the user develops a Boolean expression for the comparison rule or an arithmetic expression. The types are described below.
Behavior of the automatic type: With this setting for type, the comparison formula is purely Boolean and if it evaluates to TRUE, the match group will issue a directive of merge which, unless overridden through precedence, will cause the candidate pair to merge.
Behavior of the suspect type: With this setting for type, the comparison formula is purely Boolean and if it evaluates to TRUE, the match group will issue a directive of queue for review which, unless overridden through precedence, will cause the candidate pair to appear in the “Potential Matches View” of the MDM UI.
Behavior of the relevance_based type: Unlike the preceding rules, all of which are based on a Boolean construction of the rule formula, the relevance-based type expects the user to define an arithmetic scoring algorithm. The range of the match score determines whether to merge records automatically or create potential matches.
If a negativeRule exists in the matchGroups and it evaluates to true, any merge directives from the other rules are demoted to queue for review. Thus, in that circumstance, no automatic merges will occur. The Scope parameter of a match group defines whether the rule should be used for Internal Matching or External Matching or both. External matching occurs in a non-invasive manner and the results of the match job are written to an output file for the user to review. Values for Scope are: ALL—Match group is enabled for internal and external matching (Default setting). NONE—Matching is disabled for the match group. INTERNAL—Match group is enabled for matching records within the tenant only. EXTERNAL—Match group is enabled only for matching of records from an external file to records within the tenant; in a specific implementation, external matching is supported programmatically via an External Match API and available through an External Match Application found within a console, such as a RELTIO™ Console.
If set to true, then only the OV of each attribute will be used for tokenization and for comparisons. For example, if the First Name attribute contains “Bill”, “William”, “Billy”, but “William” is the OV, then only “William” will be considered by the cleanse, token, and comparator classes.
The rule is the primary component within the match group. It contains the following key elements each described in detail: IgnoreInToken, Cleanse, matchTokenClasses, comparatorClasses, Comparison formula.
A negative rule allows a user to prevent any other rule from merging records. A match group can have a rule or a negative rule. The negative rule has the same architecture as a rule but has the special behavior that if it evaluates to true, it will demote any directive of merge coming from another match group to queue for review. To be sure, most match groups across most customers' configurations use a rule for most matching goals. But in some situations, it can be advantageous to additionally dedicate one or more match groups to supporting a negative rule for the purpose of stopping a merge based on usually a single condition. And when the condition is met, the negative rule prevents any other rule from merging the records. So in practice, the user might have seven match groups each of which use a rule, while the eighth group uses a negative rule.
The platform 102 may include a mechanism to proactively monitor match rules in tenants across all environments. In some embodiments, after data is loaded into the tenant, the proactive monitoring system inspects every rule in the tenant over a period of time and the findings are recorded. Based on the percentage of entities failing the inspections, the proactive monitoring system detects and bypasses match rules that might cause performance issues and the client may be will be notified. The bypassed match rules will not participate in the matching process.
In various embodiments, the user receives a notification when the proactive monitoring system detects a match rule that needs review. ScoreStandalone and scoreIncemental elements may be used to calculate a Match Score for a profile that is designated as a potential match and can assist a data steward when reviewing potential matches.
Relevance-based matching is designed primarily as a replacement of the strategy that uses automatic and suspect rule types. With Relevance-based matching, the client may create a scoring algorithm of the user's own design. The advantage is that in most cases, a strategy based on Relevance-based matching can reduce the complexity and overall number of rules. The reason for this is that the two directives of merge and queue for review which normally require separate rules (automatic and suspect respectively) can often be represented by a single Relevance-Based rule.
In step 604, match rules are created. Using Relevance-based matching, the client could create a match rule that contains a collection of attributes to test as a group.
In step 606, weights may be assigned to attributes to govern their relative importance in the rule. Weights can be set from 0.0 to 1.0. If the client does not explicitly set a weight for an attribute, it may receive a default weight of 1.0 during execution of the rule. For example, starting with all weights equal to 1.0 and perhaps start with actionThresholds of 0.0-0.5 for queue_for_review and 0.5-1.0 for auto_merge. Do some trial runs and examine the results. If too many obvious matches are being set to queue_for_review, then weights may be adjusted and the actionThresholds modified (e.g., to perhaps 0.0-0.7, and 0.7-1.0). The user may iterate and experiment until able to get optimized results with the data set.
In step 608, score comparison of entities is performed. In step 610, the relevance_based match rules use the match token classes in the same way as they are used in suspect and automatic match rules. However, the comparison of the two entities works differently. Every comparator class provides relevance value while comparing values. The relevance is in the range of 0 to 1. For example, BasicStringComparator returns 0 if two values are different. It returns 1 if two values are the identical. Fractional values can be a result of DistinctWordsComparator or other comparators. Every attribute has assigned weights according to the importance of the attribute. If the weight is not assigned explicitly then it is equal to 1 for the simple attributes or Maximum of the weights of sub-nested attributes for nested or reference attributes. If an attribute has multiple values, then the maximum value of relevance is selected.
In various embodiments, the following information describes participants of the formulae: RelevanceScoreAND—the relevance score of AND operand, the relevance score of the match rule; Nsimple—number of simple attributes (e.g., FirstName, LastName) participating in the AND operator directly; weighti—configured weight of i-th simple attribute; relevancei—calculated relevance of i-th simple attribute; Nnest—number of nested and reference attributes (e.g., Phone-no, Email-ID, Address) participating in the AND operator directly; weightj—configured weight of j-th nested or reference attribute; relevancej—calculated relevance of j-th nested/reference attribute; Nlogical—number of logical operands (For example, AND or OR) participating in the AND operator directly; relevancek—calculated relevance of k-th logical operand (the weight of a logical operand is fixed to 1; RelevanceScoreOR=max(relevance1, . . . , relevancei, . . . , relevanceN) relevancei-relevance of simple attribute, nested attribute, logical operand participating in the OR operand directly; RelevanceScoreNOT=1−RelevanceScoreAND,OR,exact, . . . (The relevance score of the NOT operand is equal to 1 minus the relevance score of the operand having this negation.)
In various embodiments, the following information describes participants of the formulae:
BasicStringComparator provides the relevance values and the score is calculated as follows: true for First Name; true for LastName; false for Suffix. The score is calculated as (1*1+1*1+0*1)/(1+1+1)=?=0.66. With a score of 0.66 the directive for this pair will be set to queue_for_review.
The example below shows the use of the verifyMatches API when using Relevance-based matching. Noteworthy items are relevance values appear for every attribute comparison and relevance for the entire rule; Match action name is shown if the relevance is within the corresponding threshold range, and null if it is not within any actionThreshold range; Matched field will be true if the relevance is within any actionThreshold range.
In the match group configuration, the user may define Weights and actionThresholds. The weight property allows the client to assign a relative weight (strength) for each attribute. For example, the user may decide that Middle Name is less reliable and thus less important than First Name.
The actionThreshold allows the client to define a range of scores to drive a directive. For example, the user might decide that the match group should merge the profile pair if the score is between 0.9 to 1.0, but should queue the pair for review if the score falls into a lower range of 0.6 to 0.9.
The user can configure a relevance-based match rule with multiple action thresholds having the same action type but with a different relevance score range.
In the above example, the type is potential match for two different action thresholds. The user can differentiate such thresholds by assigning appropriate labels. The user can generate potential matches with different labels based on the range of the relevance score that allows the user to differentiate between higher and lower relevance score matches. The user can resolve matches quickly based on the label. In the example above, based on the relevance score, some potential matches can be considered for merging directly while others must be reviewed before any action is taken. The results of the API to get potential matches and the external match API will contain a relevance value and a matchActionLabel corresponding to each of the action type configured under the actionThreshold parameter. For more information, see Potential Matches API and External Match API.
Using operators like equals and notEquals prevents tokenization from generating tokens. These operators should not have an impact on tokenization, if we want to compare and conclude that even though address and/or email and/or phone are different, the remaining attributes match enough to take the score above the threshold.
In some embodiments, the following options equal, notEquals and in constraints: 1) strict (Boolean value with default=true): Allows the constraint to be skipped before the match tokens and relevance score are computed; 2) weight (decimal with default=0.0): Allows the constraint to participate in the relevance score calculation. (The two options and their default values ensure backward compatibility.)
An example of a formula to calculate relevance score is:
The formulae have the following variables: Roperand—the relevance score of an operand (for example: exact, exactOrNull, exactOrAllNull, fuzzy, etc.); Rconstraint—the relevance score calculated for a constraint (for example: equals, notEquals, in); Woperand—configured weight for an operand; Wconstraint—configured weight for a constraint.
In at least some organizations, profiles are maintained across systems and there are instances where multiple records of the same profile exist. There may be inconsistencies in each record. In such cases, it would be beneficial to merge these records and maintain one record with the complete information. There are also instances where two profiles are related to each other.
There are certain match pairs that the user can configure such that the system can automatically take action on those. Other match pairs that require manual review are resolved using the Potential Match screen. Match rules and Match IQ (discussed herein) may be utilized to determine if two records are a match, not a match, or a potential match.
Match rules and Match IQ may be used to determine if two records are a match, not a match, or a potential match. The user can also use the Match Score to decide if a profile is a potential match. Based on predefined match rules, each potential match is given a Match Score and the higher the score, higher is the probability of it to be a potential match for the profile. In some embodiments, the Match Score of a potential match will have a value of more than 0 only if the standalone and incremental scores are configured for the match rules.
There may be instances when certain profiles, in spite of being a potential match, are excluded from the profile view due to these match rules. In such cases, the user can manually search by entering the search criteria in the “Search” field and include these profiles as potential matches.
The user may have the option of viewing the Potential Matches perspective in the classic mode or the new mode.
In various embodiments, Match IQ uses machine learning (ML) to simplify and accelerate the data matching process. With Match IQ, business users can easily create a model for matching the records, by simply selecting the entity type and related attributes, without or minimum IT help. They can then train the ML model with the active learning process by reviewing pairs of records and indicating which are a match and which are not. As users confirm the matches, machine learning adjusts the matching model and presents additional record pairs to further refine the model.
After a sufficient number of representative record pairs have been matched or not matched, the user can download and review the match results. A downloaded file may show a sample set of match results and a relevance score for each record pair. The higher the relevance score, the more likely the records match. If needed, the user can retrain the model by answering more questions or even creating an alternate model to compare the matching results.
After the results are satisfactory, the data steward or other user with approval authority can review, approve and publish the model to use with internal and/or external data. The user also provides publishing settings based upon the relevance score range—for example, to define that match pairs with a relevance score of 0.8 to 1 should be matched and merged.
The end-to-end process, driven and performed by business users, typically takes only a day or two to complete and produces the quality matches customers require. In some embodiments, Match IQ uses machine learning technology to help ensure unified and reliable data across virtually unlimited data sources. The ML matching model, created with active learning using resolutions of suspected matched pairs, can be effectively applied to future match pairs. This provides a consistent way for business users and data stewards to match and merge data for increased quality, reliability, and business value.
Once a matching model is trained, no user interaction is required but the model can be retrained if needed. Because match and merge operations are performed using these models and calculated relevance scores, the process is rapid, consistent, and reliable. As the business grows or changes, the models can easily be adjusted to accommodate additional data sources. This enables matching and merging at the scale and speed of business.
The streamlined matching process, which does not require IT specialists or coding, enables customers to get up and running faster and with less effort. Typically, they can progress from initial subscription to completing their match-and-merge operations in a matter of days. Compare this to the weeks or months required by more traditional approaches. This same process is used to perform matching for new data sources as they are added, providing additional time savings and increased productivity.
No definition of matching requirements is needed; instead, users select matched pairs and machine learning creates the models. This greatly reduces the possibility of matching requirements not being correctly identified that might generate incorrect matches or miss valid matches. In addition, because machine learning creates and adjusts the matching model without configuration by IT specialists, coding errors are a thing of the past. This not only reduces errors in the match-and-merge process, but it also saves significant time as it creates a repeatable process. Customers have an option to use both Match IQ and traditional rule-based matching together if needed.
With all the time saved by using Match IQ, those involved-data owners, data stewards, IT and other business users-will find they have more time available for work that adds value to the business. They can use their time to focus on creating better user experiences, data improvement initiatives or streamlining other processes.
In step 704, the model is trained. When the user trains a model, the user identifies records as matches or non-matches (e.g., by answering a series of questions). After the completion of the Preparing Data stage, the model moves under the Training lane. At this stage, the model is ready for training. There can be variations where records are neither close to matches nor non-matches. Such records then become the input to the training process where the user may be prompted with questions seeking confirmation on whether a particular pair is a match or not.
A machine learning methodology may be utilized. For example, a neural network may be utilized for training. Alternately, as other examples, gradient boosted decision trees or random forests may be utilized.
In step 706, results are curated. In various embodiments, the graphical user interface may display details related to the model and results may be displayed (e.g., downloaded). Matches may be run and reviewed by the user to curate the results for further training and model improvement.
In step 708, the user may publish the model. The user may choose to publish the model for internal and external matching. In some embodiments, the user may select external or internal.
For example, if the user selects external, the model may be used to match data from an external file with the data in the tenant. If the user selects internal, the model may be used to match the data within your tenant along with the match rules configured for the tenant.
In various embodiments, the user may define a custom action and a corresponding relevance score range. This allows the user to execute custom actions for relevance scores that are received for relevance-based rules. If a match pair falls within the defined range, then the custom action is executed. In a specific implementation, the relevance score range the user specifies for one action cannot overlap with the relevance score of another custom action.
In various embodiments, survivorship and merging are separate concepts and processes. Again, think of an entity as a container of crosswalks and their associated attributes and values. A merged entity may be an aggregation of crosswalks from two or more entities. The additional crosswalks continue to bring their own attributes and values with them. If the acquiring (winning) entity already has the same attribute URI that the incoming entity is bringing, then the values from the attributes will accumulate within the attribute, yet the integrity of which crosswalk each value within the attribute came from is maintained for several purposes including the need to return the attribute and its values to the original entity it came from if an unmerge is requested. If the acquiring entity does not already have the same attribute URI that the incoming entity is bringing, then the new attribute URI becomes established within the entity.
In some embodiments, unlike other MDM systems, survivorship is a separate process that doesn't occur during the merge. It is a process that executes in real-time when the entity is being retrieved during an API call. Survivorship may not depend on how the crosswalks and attributes came into the consolidated profile nor the order that they arrived. Survivorship processes each attribute according to the attribute's defined survivorship rule, and produces an Operational Value (OV) for the attribute on the fly. Depending on the type of survivorship rule selected, there could be one or more OVs for an attribute. For example, the user might choose the aggregation rule for the address attribute for the purpose of returning all addresses a person is related to. Conversely the user might choose the frequency rule for “first name” to return the one name that occurs most frequently in the “first name” attribute. Note also that the role of the username making the API call also factors into the survivorship rule used. This feature allows one survivorship rule for an attribute to be stored with one username role, while another survivorship rule for the same attribute is stored with another username role. A fetch of the entity by each username role might return different OVs.
When configuring the survivorship rules for the attributes of an entity type, the user can do this largely from the UI, but there are some advanced survivorship strategies that may be defined through metadata configuration.
In step 806, in the Sources view while editing the survivorship for each attribute, the user can instantly see the effect on the screen in step 808, which may guide the user. After you make a rule adjustment, the entity is fetched again using your new version of the rule and so you see the effect instantaneously.
In the Contact Last Name example, the survivorship rule is Recency, so the selected value is the one that was provided most recently. The value that is selected by the survivorship rule and that may be displayed by default in the UI and in an API request is called the Operational Value or OV, displayed on the left on the screen. This may be depicted in
In various embodiments, the OV is not stored or persisted anywhere, instead it is evaluated when the data is accessed. If the survivorship rule is changed, then the new survivorship rule may automatically supply the new OV when the data is retrieved.
This may be a differentiator from traditional MDM systems: Firstly a single attribute can have multiple values, either from different sources or even from the same source; secondly there may not be a persisted “golden record,” instead there may be a set of attribute level survivorship rules that are evaluated at run time to select the operational value from the available attribute values.
A set of survivorship rules can be grouped into a “Ruleset”, which can be tied to a user role. In this way the OVs can differ according to a user's role, so someone in the Finance department will see OVs from the Finance system, whereas someone in the Sales department will see OVs from the CRM system. In some embodiments, survivorship rules: are set individually for each attribute; are evaluated dynamically at run time when data is retrieved; have a variety of rule types e.g. recency, source system, aggregation, frequency; and can be set for a user role (e.g., a person in sales can have one set of OVs and a person in customer support can have a different set of OVs).
To be sure, any declarations of survivorship performed via the UI may be written to the metadata configuration and the user may observe their JSON construction via a JSON editor.
Recency (Last Update Date, also known as LUD) Rule is an example survivorship rule. This rule selects the value within the attribute that was posted most recently. The user may think that the rule need only compare the LastUpdateDate of the crosswalks that contribute to the attribute to find the most recently updated crosswalk, then the user may use the value that comes from that crosswalk as the Operational Value (OV). But the real process may be a bit more complex. There are three timestamps associated with an attribute value that play a role in determining the effective LastUpdateDate for the attribute value. They are: Crosswalk Update Date—this is updated at the crosswalk level and reflects the best information about when the source record was most recently updated; Crosswalk Source Publish Date—this is also updated at the crosswalk level but entirely under your control and is an optional field you can write, to capture the business publish date of the data (e.g., a quarterly data file for which you might post the value of Mar. 31, 2020 into this field); Single Attribute Upate Date—This is an internally managed timestamp associated with an actual value in the attribute's array of values and is updated separately from the crosswalk.updateDate if the value experiences a partial override operation in which case it will be more recent than the crosswalk.
The Recency rule may calculate the effective timestamp of an attribute value to be the most recent of the three values discussed above: sourcePublishDate, SingleAttrUpdateDates, LastUpdateDate. Once it calculates that for each value in the attribute, it returns the most recent attribute value(s) as the OV of the attribute.
Another example survivorship rule is the Source System Rule. This rule allows the user to organize a set of sources in order of priority, as a source for the OV. You will use the gear icon to arrange the sources. The gear icon in the UI will appear when the user chooses the Source System rule. Using this rule, the survivorship logic will test each source in order (starting at the top of the list). If the source tested has contributed a value into the attribute, then that value will be the OV of the attribute. If it has not, then the logic will try the next source in the list. This cycle will continue until a value from a source has been found or the logic has exhausted the list. If there are multiple crosswalks from the same source, then the OV will be sourced from the most recent crosswalk.
Another example survivorship rule is the Frequency Rule. This rule calculates the OV as the value within the attribute that is contributed by the most number of crosswalks.
Another further survivorship rule is the Aggregation Rule. If an attribute has more than one value and Aggregation is chosen for the survivorship rule, then all unique values held within the attribute are returned as the OV of the attribute. This is easy to see in the UI.
Another example survivorship rule is the OldestValue Rule. The Oldest Value strategy finds the crosswalk with the oldest create date. All values within the attribute that were provided by this crosswalk are selected as the OV. Other attributes are not affected.
Another example survivorship rule is the MinValue Rule. This rule selects the minimum value held in the attribute. The minimum value is defined as follows for different data types: Numeric-MinValue is the smallest numeric value; Date-MinValue is the minimum timestamp value; Boolean-False is the MinValue; String-MinValue is based on the lexicographical sort order of the strings.
Another example survivorship rule is the MaxValue Rule. This rule selects the maximum value held in the attribute. The maximum value is defined as follows for different data types: Numeric-MaxValue is the largest numeric value; Date-MaxValue is the maximum timestamp value; Boolean-True is the MaxValue; String-MaxValue is based on the lexicographical sort order of the strings.
Another example survivorship rule is the OtherAttributeWinnerCrosswalk Rule. This rule leverages the crosswalk that was chosen by the outcome of another attribute's survivorship. Example suppose you have a Name attribute and an Address attribute, and you feel they should be tightly coupled. And so you want to ensure that the address that is selected as the OV comes from the same crosswalk that produced the OV of the name.
The user can define whether pinned/ignored or unpinned/unignored statuses (flags) should survive when two attributes with the same value but with different flags get merged.
Returning to the flowchart in
The Survivorship rules (also known as survivorship strategy or OV rules) define a way to govern which attribute values must be identified as the OV. Survivorship is important to defining the golden record (final state) of any object that a business considers important.
When an entity or relationship is the result of previous merges, it contains the aggregation of attributes and attribute values from the contributing objects. As a result, any attribute, whether it be a simple, nested, or reference, may contain multiple values. For example, after merging with two other entities, the first name attribute of an entity could contain three values: ‘Mike,’ ‘Mikey,’ and ‘Michael.’
Through Advanced Search, you can use the has all option to search for Source System Names for which to add values to attributes for the crosswalk. From the values you specify, the system may choose the best value from these recent values. Furthermore, although multiple values are shown, you have the option to select the configuration to use to not calculate survivorship based on all the system sources but to calculate survivorship only on certain sources. All rules work on the entire set of crosswalks that exist for the record.
If the user does not want all the survivorships to be calculated based on all of the records or all of the crosswalks that exist on any records, then the user may set Survivorship Rules from the Sources View of any entity.
While it is important to store all the contributing values in the attribute for audit purposes, ultimately, the ‘best value’ or set of values for the attribute may be determined so that they can be returned to Hub users and calling applications in a request. These ‘best values’ are called the Operational Values, or winner values, and referred to as the OV of the attribute.
In the Hub, the OV is primarily shown next to the attribute label. The Hub provides an indicator if additional, yet non-OV values exist. The indicator is a blue oval with a + and a number in it. The number indicates how many additional unique values are held within the attribute. Clicking on the oval will navigate the user to the Sources view, where all source crosswalks and all contributed values can be seen for each attribute.
Each attribute can have 0, 1, or multiple values that have been marked as OV. The OV flag is a Boolean property used by the Hub to determine which attribute values must be shown to the user. The OV flag of each attribute value is calculated just-in-time whenever the entity's values are requested by either the Hub or a calling application.
Survivorship strategy is configurable for each entity type. Survivorship strategy can be changed on the fly and will take effect immediately. This ensures that you have the agility to change the rules for calculating the OV flags at any time, and the new definition will affect the very next payload returned from the database. Survivorship rules can be configured via the Hub or via the Configuration API.
Survivorship rules can be set for simple, nested, sub-nested, and referenced attributes. However, survivorship rules cannot be set for sub attributes of referenced attributes because survivorship rules for sub attributes are taken from the referenced entity/relation and cannot be overridden on the sub attribute level. For example, if an address attribute has sub attributes such as AddressLine1, AddressLine2, and City, the survivorship rules for these sub attributes will be determined by the survivorship rules that are set for the Location entity. However, sub attributes can be used as a link in additional fields of strategy (primaryAttributeUri, comparisonAttributeUri).
In a more advanced implementation, a user may use the OtherAttributeWinner crosswalk and advanced strategies behavior for calculating the Operational Value. In some embodiments, the user can identify a source to calculate Operational Value based on the value of another attribute. The user may define a survivorship rule in a manner where you provide precedence to certain data sources based on the value of another attribute.
For example, assume a configuration where the relationship type ProductToCountry includes a nested attribute for Language and an attribute for Type. Example Relationship type: ProductToCountry with attributes: Language (Nested), Type (Simple String). The user can apply the survivorship rule where you can specify the source used to calculate the OV for the ProductToCountry.Language.Overview attribute based on the value of the Type relationship type attribute.
With the Complex OV rule type, the user can define the survivorship strategy for a nested attribute based conditionally on values of a sub-attribute. This is accomplished using the optional “filter” property for a survivorship group mapping. Thus, a survivorship strategy, which is defined in the “survivorshipStrategy” property of the mapping, will be applied only for attributes which match the filter criteria. In this way, several survivorship strategies can be leveraged to treat different sub-attribute types. The resulting winners for the nested attribute are the aggregation of winners emerging from each strategy.
Advantageously, the techniques described above facilitate dynamic survivorship in an RDBMS. Specifically, objects have dynamic survivorship from on-the-fly changes made to them. The changes can include changes to primary EID when an object is merged with another object, where the primary EID of the object is either retained or replaced by another, the latter of which automatically causes the previous primary EID to be retained as a legacy EID. Moreover, a portion of an object prior to merge with another object retains the legacy EID in association with that portion so that if the merge is ever undone, the legacy EID survives for the subsequently unmerged portion. Dynamic survivorship enables cross-tenant durability, such that changes to aspects of an L1 (e.g., platform-layer) object retain legacy EID at L1 and applicable legacy EID on the tenant where the object is updated, allowing matching of an object at any tenant regardless of EID used at a given tenant.
The engine responsible for ensuring dynamic survivorship can be characterized as a dynamic survivorship engine, which can itself be characterized as comprising an object matching engine, a lineage EID promotion engine, and a legacy EID retention engine.
The present application claims priority to U.S. Provisional Patent Application Ser. No. 63/353,006 filed Jun. 16, 2022, which is incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
63353006 | Jun 2022 | US |