1. Field of the Invention
Embodiments of the present invention generally relate to database techniques for gathering and organizing data and, more particularly, to a method and apparatus for gathering and organizing data for one or more entities.
2. Description of the Related Art
Large enterprises purchase substantial amounts of information technology (IT) resources, e.g., computers, printers, scanners, and so on. Systems currently exist for gathering information about the IT resources using manual data entry techniques, where an administrator enters information about their IT resources at the moment of purchase or installation. This information may include serial number, purchase date, software installed and information about the software, and so on. A database of such information is useful in monitoring assets, determining expected life of the resources, tracking software license compliance and such. However, such a manual system is expensive and time consuming to operate. Further, such a system does not include any device that is installed without the administrator's knowledge. In large corporations having many offices, world-wide, the likelihood of the database being incorrect is very high.
Information and insight are at the core of every intelligent business decision. Given the importance of information technology in driving an organization's success, making informed decisions regarding enterprise-wide IT infrastructure and resources is critical. Simply put, an organization must have accurate data regarding the organization's assets in order to make sound business decisions. And not only does the organization need data, but data that clearly supports decision-making that promotes an efficient and cost-effective use of resources.
Typical issues with the information gathered about an organization's IT assets include:
Therefore, there is a need for a method and apparatus for gathering information for one or more entities and organizing the information to be analyzed.
A method and apparatus for gathering and organizing data pertaining to an entity by extracting data from a plurality of data sources associated with one or more tenants, organizing the data into connector files having a predefined structure and associating the data in each connector file with a tenant parameter, and storing the connector files in memory.
So that the manner in which the above recited features of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments.
Table 1 depicts one embodiment of a table populated by UnionTableAction;
Table 2 depicts an exemplary mapping of a union table to a prototype table;
Table 3 depicts a representative sample of an intermediate table that is created by UnionTableAction; and
Table 4 depicts the relationships of various inputs and outputs used for billing categorization.
One embodiment of the invention is a software platform (referred to as the system) that provides a cohesive view of an organizations information technology (IT) assets and the opportunity to identify infrastructure savings. Other embodiments may comprise gathering and organizing information related to real estate (e.g., home and property sales), network transmission equipment inventories, medical or dental records, or any venture where data records are utilized.
Some embodiments of the invention provide several advantages to assist in the successful management of IT assets:
With near-time access to accurate information about IT resources, embodiments of the invention can show a discrete asset, who is using the asset, what software applications are present on the asset, when the lease for the asset expires, and so on.
The data from the data sources 106 is coupled through a communications network 108 to the system (e.g., including a server 110). The system 100 may comprise multiple servers and data storage units; although, only one server 110 and storage 112 is shown. The data storage unit 112 may comprise disk arrays, redundant storage, a storage area network system, or any other type of digital data storage system.
The server comprises at least one central processing unit (CPU) 114, support circuits 116, and memory 118. The CPU 114 may be one or more commercially available processors, microprocessors, application specific integrated circuit, microcontroller, and the like. The support circuits 116 are well known circuits that facilitate functionality of the CPU and comprise clock circuits, I/O circuits, network interface circuits, cache, power supplies, and the like. The memory 118 comprises one or more types of circuit or system for storing digital data and executable programs. Such memory includes, but is not limited to, read only memory, random access memory, disk storage, optical storage, and the like. The memory 118 stores a database 122 and database software 120 (e.g., a structured query language (SQL) database and supporting software). The database 118 stores IT data 121 received from the data sources 102 and metadata 126 used by the database software to perform the functions described below.
In operation, the data 121 from the data sources 102, as described in detail below, is extracted from the sources and organized to create records identifying IT assets of each tenant 104 and various attributes of the IT assets. The database 122 is utilized for storing asset information, manipulating and organizing the information, reporting on the information and so on. The metadata 126, which is either defined or generated by a “configurator” (e.g., a system administrator) facilitates manipulation and organization of the IT data.
More specifically, the database is a relational database containing logical schemas used for different stages of data organization to extract, transform, and load (ETL) processes and data analytics such as:
In a Structured Query Language (SQL) environment, the metadata is used as a configuration tool to generate the SQL-based statements and organize the data into tables to facilitate SQL queries. As such, the embodiments of the invention are referred to herein as being metadata-based. The data and meta-data regarding assets are organized as records. The information within records, or the records themselves, can be organized and manipulated as tables.
The system 100 manipulates the information within the database (herein referred to as data integration) to achieve a plurality of goals:
The term “gold record” describes a unique asset record that represents ‘best of’ asset data drawn from multiple data sources. To produce a gold record, embodiments of the invention provide a configurable data Integration Framework that allows an organization to manipulate its data in various ways, including:
Embodiments of the invention are capable of monitoring and analyzing the IT assets of multiple organizations, or divisions within a single organization. These organizations/divisions are herein referred to as tenants. The system 100 gathers and organizes asset information separately for each tenant as described below.
More specifically, the system receives data regarding IT assets from a plurality of sources including, but not limited to, asset management systems, human resources system, procurement system, messaging systems, IT asset protection systems (SPAM, virus, and the like), and the like. The data extracted from these sources is then consolidated, reconciled and analyzed in accordance with one embodiment of the invention. The most common external sources include (but are not limited to):
Each external source typically contains significant overlap in the type of data stored. For instance, hardware inventory detail will often be contained in both Asset Management and Agent-based Discovery systems. One embodiment of the invention identifies disparities in the data details that should be consistent between different data sources. Cost savings and increased revenue result from the comparison of lease/contract and billing systems to that of physical inventory systems exposing overpayment and under-billing opportunities. The logic used to identify such opportunities is configured within the Integration Framework.
The Generic Connector defines a standardized approach to load data from external source systems into the system database. In addition and as a further embodiment of the invention, this standardized approach can be performed in a flexible manner using a Dynamic Connector, as described below.
The Generic Connector consists of a set of data format specifications to which the extracted data sources (Tivoli, SMS, etc.) must conform in order to be processed correctly. Each format specification describes a specific entity/concept, such as Asset, Contract, User, etc. The format specification defines format of a flat file (tables) containing the asset information including placement of data, columns to use, headings to use, and so on.
A combination of scripts, tools, and network structures are used to facilitate the movement and transformation of data from the external source systems into the system; more specifically, a Connector Repository. Data Extraction is the process of reading data from the external source system. The methods used to read the data are largely dependent on the external system's architecture. The extraction process often includes a conversion step to prepare the data for the transformation process.
At step 310, the method 310 queries whether a generic or dynamic connector is to be used. If a generic connector is used, the method 300 proceeds to step 312. Selection of generic or dynamic is typically a predefined configuration parameter related to the absence or presence of a dynamic configuration file.
At step 312, Data Transformation is the process of applying rules and data functions on the extracted data to generate a desired data output format. In this case, the data will be stored as a set of flat files that conform to the Generic Connector format organized by tenant.
At step 322, once the Generic Connector Files have been generated, they are transferred to the Connector Repository which is a central location that collects all tenant data files that are to be loaded used by the system.
The Connector Repository is a storage location where all tenant Generic Connector Files are stored. Once the files have been transferred to this repository they have reached their initial entry point for use by the system. The data stored in the file system-based repository has undergone an initial transformation that conforms to the Generic Connector data model, but primarily contains the original, non-cleansed data, i.e., the data is organized, but not cleansed.
In one embodiment of the invention, the repository is implemented as an FTP server whereby the Generic Connector Files are sent to a shared directory from each of the external sources via an FTP transport mechanism, such as a secure FTP transport mechanism. A script is typically used to copy the Generic Connector Files from the initial directory to an import/staging directory for processing by an Integration Framework. This decoupling allows for file updates to be sent at anytime of the day without impacting any processing being performed by the system. The repository is configured to:
To define a generic connector file, the Generic Connector defines a set of data structures that represent the asset-related entity types that can be imported into the system. The Generic Connector approach is entity-centric rather than source system or system type-centric. That is, there is a single asset connector format, to which all systems that contain assets (Asset Management, Agent Discovery, and the like) shall conform. The primary entity types include but are not limited to:
Every Generic Connector table has the following keys:
The following list defines the characteristics of one embodiment of the Generic Connector file format:
The system supports the configuration of multiple directories in which the Generic Connector Files extracts are stored. The system supports multiple Generic Connector Files of the same type, either in the same directory or in multiple directories. All directories are directly accessible by the database server process.
The directories are partitioned to reduce the possibility of file name clashing and to allow for more finely grained access control. The directory names are not interpreted as indicative of the tenant name or source type.
The filename has an indicative prefix, corresponding to the type of data it contains, i.e. CON_ASSET, CON_USER, etc. Further, name clashing must be avoided where multiple sources and/or tenants are providing extracts of the same entity type. For that, there are several conventions that should be employed and strictly enforced by agreement, though none are mandatory since the best strategy is dependent on the particulars of a specific implementation. Possible examples include:
The management of extracted Generic Connector Files (adding, updating, deleting, replacing) is outside of the scope of this document and the system.
The number of files of a specific type not need be configured or known in advance. The system recognizes, imports and processes all files in the configured directories matching the indicative prefixes. The union of all files shall be considered to be the definition of “current state” with respect to the external systems.
Use of a single global date format for all generic connector file extracts is recommended. A parameter is supported in metadata to allow for the configuration of this global data format mask, such as “YYYY-MM-DD HH24:MI:SS”. Additionally, generic connector structures that contain date fields contain a DATE_FORMAT column that allows for the specification of a data format mask that applies only to the single row. In this way, non-default data formats can still be supported, albeit with the requirement of additional work during extraction. In the case where the default date format is to be used, the DATE_FORMAT can be left null.
At step 324, the connector files are accessed within the repository and processed (filtered) as they are loaded, at step 326, into import tables. At step 324, the extracted data within the connector repository is stored in import tables, the structure of which mirrors the structure of the original data source. There may be a small amount of transformation performed in this process, but the use of staging tables (see
If the method 300 selects a dynamic connector at step 312, the method 300 proceeds from step 310 to step 314. Given the file and structure requirements for the Generic Connector, allowing for flexibility in defining Generic Connector structure formats eases the burden on administrators performing data extractions. In addition, some organizations may not have data for all the Connector columns, also necessitating the ability to configure a more flexible extract format.
At step 314, the data is transformed into a connector file. The Dynamic Connector is a mechanism that allows definition of more than one format for the CON_* tables and deal with data issues. In one embodiment, the Dynamic connector uses a separate XML configuration file paired with the data extract file. The configuration file is generated at step 316 using input from an administrator 320. This configuration file directs the Integration Framework to allow for a customized interpretation of the associated CON_* data file's content; however, there are requirements that must be met to allow for the use of custom data extracts that do not fit the Generic Connector model.
For each con file (CON_*) data extract that is governed by a dynamic connector configuration file, there is a matching configuration xml file that, in one embodiment, is defined as follows:
The connector file and its related configuration file are stored at step 318 in the connector repository.
There are two major sections to the configuration file: (1) file level parameters and (2) column level parameters.
The supported file-level parameters are:
The optional dataType column attribute (only used with the virtual and override types) determines whether or not the virtual value is wrapped in single quotes. Here are the supported dataTypes:
Given the connector file extract name is “CON_ASSET.csv”:
The data extract file, CON_ASSET.csv, contains the four required fields (TENANT_CODE, DATA_PARTITION_CODE, SOURCE_CODE, NATIVE_ASSET_ID) plus a simple data format of ASSET_NAME and ASSET_TAG.
The CFG-CON_ASSET.xml file that modifies the connection from the system to the customized six-column CON_ASSET.csv file would resemble the following:
IN one embodiment of the invention, file-level parameters are specified using the <fileParameters> tag as shown here:
To only override the dateFormatMask parameter, only that parameter in the <fileParameters> tag need be specified:
If the CON_ASSET.csv file extract has ASSET_TAG first then ASSET_NAME, just change the ordering of the XML <column> elements:
The following are column element configuration examples (virtual, split, ignore, override)
virtual type:
A data extract is provided containing a single data partition for a single tenant, but lacks the required TENANT_CODE or DATA_PARTITION_CODE columns, necessitating the creation of those columns. In this case, the TENANT_CODE column a value of “TENANT1” and the DATA_PARTITION_CODE column a value of “PARTITION1” is also supplied.
split type:
A CON_ASSET data extract does not have a NATIVE_ASSET_ID populated, but does have a unique ASSET_TAG column. However, this column is already being mapped to ASSET_TAG in import. This would require the creation of a split column for ASSET_TAG going to NATIVE_ASSET_ID, while retaining the existing mapping for the ASSET_TAG column.
ignore type:
A CON_ASSET data extract has extra “junk” columns that do not require importing, necessitating the designation of ignore columns for any columns to be suppressed. These columns are still created at the external table level (since the column offsets must be accounted for), but they are not created in the wrapper view, and are not participants in import.
Note: The “IGNORE” mapping type in IF_TABLE_COLUMN_MAP can be used, but this would still require a record in IF_TABLE_COLUMN_MAP. In addition, if there are other extracts that are attempting to use the Generic Connector (no Dynamic Connector), those extracts would have to include the junk columns in order to line up the column offsets correctly. Using the ignore type eliminates the need for a record of these columns in metadata.
Ignore columns do need to have unique names within the columns in the file. Ignore columns that do not match a record in IF_TABLE_COLUMN_MAP are created with the default parameters of VARCHAR(255) in the external table.
override type:
An extract has a column that is empty or only partially populated and the value in the column should or could be uniform for the whole file. An example could be a metadata column like LOCATION_DATA_SOURCE_TYPE or a column like TOTAL_DISK_UNITS. This would require the creation of override columns, which override the value in an existing external column with a specific value. Note that functionally this is essentially the same as an ignore column-virtual column pair, but can be done with a single type:
externalName attribute:
Documenting the column mapping inside the XML is a good practice and a way to make sure that the mappings are understood, maintainable and kept up-to-date. It may be most useful where the name of the GC column and the external column name are significantly different.
dataType attribute for virtual and override column types:
In a virtual or override column the target column is a numeric type, not a char type. This might require a data type conversion such as forcing all TOTAL_DISK values in an extract to be “0” in an extract that is missing the column.
At step 324, the movement and organization of data within the system schemas to import tables is managed by the system's Integration Framework. The Integration Framework allows for the configuration and performance of various ETL and database functions.
Note: Table names and naming conventions shown here will vary between different system implementations.
In
The data is then transferred into the various asset staging tables (STG_*) 408. At this point, no processing of data has been performed amongst data from the sources; for a single, unique asset or user that is present in one or more source systems there will be one or more records in the staging tables 408.
The asset records reaching target tables 410 are considered to be the “Gold” (the processed asset information using the ‘best of’ data from multiple data sources) records, although performing further data scrubbing on the processed data in the target tables 410 is a typical action performed. At this point, the asset data integration is considered to be complete from the perspective of mapping and transfer functions.
Note: For asset data that does not map to the system schema, extended tables can be created to contain information not accounted for in the system structure. Table extension and view extension are handled as part of the Integration Framework. Both the analytics views and reports will have to be modified to take advantage of any custom fields in any extended assets tables.
Within the integration process, tasks—also referred to here as processes—are configured in the Integration Framework to perform specific manipulations of the warehouse tables and the data contained therein.
In one embodiment, at step 606 the processes performed on the target tables include at least one of:
At step 608, the method 600 maps the fields from the various import tables and transfers the mapped data into the various asset staging tables. As mentioned earlier, at this point in staging, an asset that is present in one or more source systems will have one or more records in the staging tables. At the end of this process, the data has been placed into the various data warehouse target table structures.
At step 608, additional key assignment and association of asset records is occurs. Unique assets identified by common fields are assigned key values and customer-specific logic is applied to the staging tables. At the end of this section of the process, the one or more asset records from one or more sources will be linked by common, system-generated asset IDs.
At step 610, data processing is performed to create a final consolidation of records into the single, reconciled asset record, referred to as the Gold record. Each field in the final reconciled asset record will be fed using a source-prioritization structure that can be configured to vary on a column-by-column basis. At step 612, the Gold asset records are transferred to the target tables.
At step 612, the method 600 allows for setting of column values that are not part of the imported data. Examples include setting statuses and default values. The method 600 ends at step 614.
The data within the target tables, staging tables, and extension tables can be used in reports. Of particular interest are the staging tables, which contain data source identifiers for each tenant, allowing the system to accommodate analytics focused on data reconciliation and conflict issues.
The SYSTEM_META schema 710 is the main controller of the system platform. The tables contained in the SYSTEM_META hold data related to various components of the system platform. The tables pertinent to a discussion of the system Integration Framework include the following:
The structures of the DW_1 and DW_2 schemas 706, 707 are duplicates of each other. This is due to their function in the system platform as twin repositories for asset fact data, or asset data that is specific to an individual asset (such as serial numbers, or machine names) and technically should not be found in other asset records. The tables contained in the SYSTEM_DW_1 and DW_2 schemas 706, 707 hold asset data as it moves through the processes of data gathering and organization, as described above.
The function of the SYSTEM_DW 704 is to store dimensional data, or data about an asset that is not unique in nature, such as a machine model or the state/province location of an asset's user. Most of the DW Common tables are prefixed with a ‘D_’ or ‘DW_’. The only exceptions are tables related to remarks and status on various asset management areas.
The SYSTEM_AV schema 702 is a container for various Analytic Views and database synonyms. The creation of this schema was necessary to shield the reporting engine from the schema switch mechanism.
Two goals of this database schema switch mechanism are:
The issue is handled with the creation of two mirrored data warehouses, specifically the SYSTEM_DW_1 and _2 schemas 706, 707 described above. Simply put, one schema is used for reporting purposes and the other is loaded with fresh data by the ETL process. As soon as the ETL process is finished, the pairs are switched. Then, the loading schemas become the reporting schemas, and vice versa. Views and synonyms contained in the SYSTEM_AV schema are used to implement the switching mechanism.
In the system, certain tasks are necessary to perform in order to successfully integrate asset data from multiple sources. Such processes in the system Integration Framework and related Integration Framework tables include:
In addition, many of the process areas listed above must be configured to occur in a specific order during the system integration synchronization. This is also handled by the Integration Framework, as well as a few other miscellaneous tasks.
In the area of source definition, tables define data sources to the Integration Framework. Supporting metadata tables include the following:
Creation and deletion processes create the data structures used by the remainder of the reconciliation processes; import and staging tables, columns that extend the data model, and custom data entities. Provided functions include:
Supporting metadata tables include the following:
Data mapping and transfer processes perform the heavy lifting of the overall reconciliation process, by transferring data through external and import tables, through staging tables, to the target base and extension tables. Provided functions include:
Supporting metadata tables include the following:
Data transformation processes perform some form of data transformation. Motivations for data transformation include data normalization and conversion. Data transformation can be applied at multiple stages of the standard process. Provided functions include:
Supporting metadata tables include the following:
Error and conflict processes identify faulty or conflicting data and remove data from tables to prevent them from proceeding further in the integration process. Error records are moved to specific error tables. Additionally, data quality tracking is performed to ensure accurate system operation.
Data Quality Tracking (DQT) provides insight into the integrated data of a system deployment, as well as the operation of the current metadata-driven rules and processes by which the data is integrated. DQT data can be thought of as output “metadata” using the standard definition of the term: data about data. (This is distinct from the configuration metadata that is generated and/or defined by the user as an input into the integration process.) The DQT data describes qualities and attributes of the data as well as the workings and efficacy of the integration process.
DQT is part of the overall consolidation process, where data from multiple sources are combined, and where there are multiple sources of data for the same entity (an asset, a user, etc.) the system allows for prioritized reduction down to a single “gold” record, the “best view” of the entity.
DQT tables are structured in a set of star schemas to support dynamic, open-ended growth of data and structures. DQT is organized into categories by the nature of the data, the nature of the operation to determine the data, and the granularity of the data. All DQT tables start at the entity level, where typical entities supported by the system out-of-the-box are Asset, User, Contract, etc. As such, a particular implementation may have a set of DQT tables related to Assets, another set of DQT tables related to Users, etc. Each set can be enabled or disabled in configuration metadata, and within a set different features and functions can be enabled or disabled in configuration metadata.
As an example, the following description describes only DQT tables for Assets as a case study, but it is important to note that the solution is not restricted to only Assets.
The Asset DQT tables are organized by granularity of the data being described. For example, some DQT functions operate on specific, individual fields/columns in a record, while some DQT functions are applicable to the full record/row of data, etc. The granularity organization (with its standard table prefix) is as follows:
Within each table, there may be multiple columns, which may be related to one or more DQT functions. There may or may not be any relationship between the columns, except for the fact that they operate at the same level of granularity.
Column DQT tracking tracks information about each individual field on each individual record. The items that are tracked in a column-wise fashion are:
During the consolidation process, the integration framework determines conflicts by getting the count of distinct non-null values from all sources for a particular column in a particular record. If that count is >1, then this is considered to be a conflict (“Y”) else it is a not a conflict.
Populated source, populated priority and/or populated weight are determined at the time that the multiple source values are reduced down to a single value during consolidation. At the time that the correct source value is determined, the source ID, prioritization and/or weight are also available, and these values are stored in the related DQT table.
With column-wise tracking, the system can answer questions like the following:
Not only can the aggregate counts be determined, but the system facilitates drilling to the specific records and/or individual fields in question, as well as back to the individual sources that make up this consolidated record.
Row DQT tracking tracks information about individual records at the record level, not at an individual column level. The items that are tracked in a row-wise fashion are:
During association, the primary key of the final target table for the entity (a.k.a., entity ID i.e., ASSET_ID for assets) is assigned and linked between different source records, where the association rules determine that these records are referring to the same entity. Thus, at the end of association, the records in the asset staging table will all have an assigned ASSET_ID, and records that have the same ASSET_ID are referring to the same entity, even if the records come from different sources.
The Source count is determined by counting the number of distinct sources for each individual entity ID in the staging table.
Row conflict is determined by counting the number of column conflicts found for a particular record. If this number is >0 then Row conflict=‘Y’ else row conflict=N. Thus, row tracking is performed after and is dependent on column tracking.
With row-wise tracking, the system can answer questions like the following:
Source DQT Tracking tracks information about the associated sources for each individual record. The item that is tracked in a source-wise fashion is:
As described previously, after association every record from each source for the entity will have an assigned entity ID in the staging table. Thus, for each entity, the system determines the individual sources in which data for this entity was found by selecting the entity ID and the source ID. The system also actively stores the negative flags (i.e., the source IDs where a record was not found for a particular entity) since this simplifies the reporting of negative data, which can be of interest to a customer.
With source-wise tracking, the system can answer questions like the following:
Source Match DQT tracking tracks information about the relationship between source pairs for each individual record. This data is determined and stored at multiple levels of granularity in the source match DQT table. To understand source match tracking, the concept of Association is described first.
Association uses metadata-driven rules to find records between different sources that are in fact the same entity. Typically, there will be a prioritized set of match rules for linking between sources. For example, for asset association, the metadata may be configured to match using the following columns, in order:
What this means is, the system first tries to find any matches between, say, source 1 and 2 by matching on Asset Name, Asset Tag and Serial Number. Any assets with matches on all three attributes will be assigned the same Asset ID in both sources. Then, the system matches just on Asset Name and Asset Tag, then just on Asset Name and Serial Number, etc. down through the priority list. Note that this will waterfall down through all the sources, though the relationship between each pair of sources can use different association rules, as configured in metadata.
Once a pair has been found and the asset ID is assigned, that record will not be reassigned a different asset ID based upon a later, lower priority rule. So if an asset matches Asset 1 using Asset Name and Asset Tag, it will also be considered Asset 1, even if it also happens to match Asset 2 on Serial Number alone (since Serial Number match is a lower priority match condition.)
Source Matching DQT Tracking has three major facets to it:
Direct rule matching is, in essence, tracking the results of the association rule matching described above. (Technically, it is re-evaluating each rule independently, for reasons described later.) Thus, direct rule matching will store, for each individual asset, whether there is a match between the sources in question for a particular association rule, i.e. was there a match between source 1 and 2 for Asset ID 1 when matching on Asset Name, Asset Tag and Serial Number? Metadata controls which source pairs are evaluated.
Direct source matching stores the following information:
For Direct Rule Matching, the system reevaluates each linking association rule independently using the same logic as association, and store the results (both positive and negative) in the Source Match DQT table.
With Direct Rule Matching, the system can answer questions like the following:
Direct source matching is a single record distillation of all of the individual direct rule matches for a particular source pair for a particular entity. That is, if there are any direct rule matches between sources for a particular record, then there is a direct source match, else there is not a direct source match. In other words, direct source match indicates whether there is a direct (source-to-source) match between sources. (There is also the idea of an indirect match, discussed below.)
Additionally, beyond the types of information tracked for Direct Rule Matching, the system also tracks another piece of information:
This is an important distinction between the evaluation of association and the re-evaluation of Source Matching DQT. Source Matching indicates what can match, and association and consolidation indicate what did match. Because of potential ambiguities and/or other issues in incoming data, this can be a very real difference that can be of significant interest to a customer. (For example, a single asset in one source could match ambiguously to multiple records in a different source. Only one will be the match from the association point of view (i.e. Is Associated=Y), but both will have a direct source match. This can indicate a data error or other issue.)
Direct Source Matching Has Match takes the direct rule matching records as an input, and creates a single Y/N flag based upon whether any individual rule had a match. Is Associated takes the Source Tracking (QSRC) data as an input, where Is Associated is set to “Y” where the entity is found in both sources, otherwise “N”.
With Direct Source Matching, the system can answer questions like the following:
Indirect Source Matching is the final piece of the source matching puzzle. Indirect Source Matching determines assets that have been associated (as determined from Source Tracking) but do not have a direct relationship (as determined by Direct Rule/Source Matching above). Indirect Source Matching uses the same flags as direct source matching (Has Match, Is Associated, etc.), but has a different match type indicator, so that it is clear that the record is for indirect matching, not direct source or direct rule matching.
An asset can get indirect matches via a few different scenarios. For example, consider a single asset (“Asset 10”) and three sources for asset data. Presume that Asset 10 in Source 1 has a match to Source 2. This is a direct source match, and whichever rule(s) matched are direct rule matches. But due to the association rules that can differ from source to source, say that no direct match was found between Source 1 and Source 3. However, if the Asset 10 record in Source 2 matches an asset record in Source 3 (which can be according to rules that are particular to this pair of sources) then this same Asset ID will be assigned to the Source 3 record. Thus, Asset 10 will be found in all 3 sources, but there is no direct relationship between Sources 1 and 3 for this asset. Thus, all direct rule match and direct source match records will indicate Has Match=N, even though Is Associated=Y.
Thus, the system considers this to be an Indirect Source Match between sources 1 and 3, because it did match, even though there wasn't a direct relationship. The relationship can be through an intermediary source like above, or an even more circuitous path.
Indirect Source Matching takes the Direct Source Matching and Source Tracking data as input. If an entity is associated between a source pair but does not have a direct match between that pair, then Indirect Source Match Has Match=Y, else Has Match=N.
With Indirect Source Matching, the system can answer questions like the following:
Summary group DQT tracking is somewhat different in nature from the above DQT types. The concept of a summary group is a set of records of an entity type (i.e., Assets) that have a (potentially arbitrary) relationship with each other. The system does not impose any specific relationship, and intentionally keep the structure and relationship as simple as possible, to make the capabilities flexible while keeping the reporting simple.
The system uses summary groups as part of aggregate table detail, where the aggregate table contains individual “category” records with counts of records that have a particular quality in common to them. The system uses summary groups in several aggregate tables, including a Source Comparison aggregate table, which rolls up the DQT data described above in various ways to simplify reporting.
Examples of specific summary groups used in a typical Source Comparison aggregate include:
But again, there is no specific limitation on the contents of a summary group. So these would be equally valid summary groups:
Basically, if the item can be queried for in the database with freeform SQL (or even multiple SQL statements), the system can place it in a summary group, and then aggregate it, report on it and (most importantly) drill to it to see the original records in question.
As described above, a Summary Group is a group of records with an arbitrary but deterministic relationship. Without summary groups, grouping records arbitrarily so that they can be reported on simply and aggregated can be a significant challenge that typically puts pressure on the reporting solution to execute and/or replicate the business logic of the framework. This violates the separation of concerns, where the reporting solution should focus on the presentation of data, not on the generation of data.
Thus, at its core, a summary group is the addition of a single layer of indirection allowing for the assignment of a unifying summary group ID to a set of ID pointers (references) to the original records of interest. The system stores the summary group ID and the entity ID pointers in the summary group table.
Summary groups are a means to an end. The meaning of the groups is what provides value. Summary groups are used for:
In both of these cases, reporting on the above in a clean, simple and extensible (non-hardcoded) way would have been extremely difficult and/or impossible without summary groups. Further, summary group applications include the use of an aggregate “Y/N” report allowing for navigating between all possible permutations of source combinations (i.e., assets with records in sources A, B and D and not in sources C, E and F).
Union tables store copies of records that originate in multiple, disparate tables, and then create an action that populates this table. One example of a union table instance is table containing a union of error records related to assets (E_ASSET table). (Note that union tables don't have to be unions of error tables, but this is the primary motivating case.)
Note that the structure of E_ASSET is focused on the high priority columns for assets (ASSET_NAME, SERIAL_NUMBER, etc.), not every possible asset column. The reason is, every column mapping adds complexity from a metadata configuration perspective; so focusing on the most important columns (as well as those most likely to participate in error checks) gives important focus. Also, the more columns in the union table, the more sparse the records will probably be as each source table may only have a subset of the union columns. The list of proposed columns were determined by reviewing the most common columns on various reports.
In one embodiment of the invention, there are four metadata tables related to union tables:
Each table as its own purpose. The only table that must be populated by the system configurator to turn on this functionality is IF_UNION_TABLE.
From this table, UnionTableAction knows:
From this information, UnionTableAction determines which tables should be mapped into E_ASSET. Since this is an ERROR type union table, then the candidates are: every error table that corresponds to a table that maps into STG_F_ASSET. Now, all of those relationships can be found in metadata (IF_TABLE_COLUMN_MAP, IF_DATA_ERR_CHK_TABLE, etc.). UnionTableAction traverses those relationships and finds the appropriate tables, which are populated in IF_UNION_TABLE_MAP.
As shown below for CSC, there are 2 error tables, at import and one at staging) that correspond to STG_F_ASSET. Table 1 is an example of a union table that was populated automatically by UnionTableAction.
In Table 1, besides the table mappings, there are a couple of other columns including:
Union Tables may be used as follows:
Continuing, IF_UNION_TABLE_COL_ALIAS is used to create column aliases. Since the system knows the union table (and its columns) and knows the prototype table (and its columns), the system starts automatic mapping of prototype columns to union columns. This is an actually an intermediate mapping table, that is used to handle column mappings as well as create column aliases. The aliases are just other names for the columns. Table 2 depicts a brief sample of this mapping.
Any union table map table column that maps to the prototype table which has the column alias name above, is mapped to the corresponding union table column in the union table.
In one specific example of “ASSET_NAME”, any column that maps (via IF_TABLE_COLUMN_MAP or the error extension) to STG_F_ASSET (the prototype) ASSET_NAME, is mapped to E_ASSET.ASSET_NAME.
Within the mapping process there is an alias priority, which means that the system uses a hierarchy of mappings, and for each table to be mapped in, the system continues from highest priority to lowest priority until the system finds a column match.
Also, the system uses an ALIAS_MODEL with a value of “PROTOTYPE”. What this means is, the system maps columns that map to the PROTOTYPE table column with the name of COLUMN ALIAS. It is also possible to have an ALIAS_MODEL of “SOURCE”. With this, the COLUMN_ALIAS values are referring directly to columns on the source tables (which might have completely different names, such as “HOSTNAME” at an import level). However, it is expected that using the PROTOTYPE alias model will be the most common way to do this, as an import level HOSTNAME column will still get picked up and mapped correctly to E_ASSET.ASSET_NAME, as long as in IF_TABLE_COLUMN_MAP the import HOSTNAME column is mapped to STG_F_ASSET.ASSET_NAME (the prototype).
ALIAS_PRIORITY set 1 is automatically populated, using just the exact column names of E_ASSET. Any additional mappings, or tweaks, need to be added as additional sets. For example, certain tweaks were preformed above, where alternate column names have been added for PHYSICAL_STATUS_NAME and FINANCIAL_STATUS_NAME. Thus, if a mapped table has a column that maps into either of these, it will be mapped into the corresponding union column. (If a table has both, the alias priority determines the actual one used).
All of the above forms an intermediate mapping table. At the end of this process, the system has to make a determination of source tables and columns to be mapped explicitly into the union tables and columns. So this could (in theory) be calculated on the fly from the above, but instead the system takes the explicit route and store the results of the column mappings in another metadata table, IF_UNION_TABLE_COLUMN_MAP.
UnionTableAction takes a first stab at it, and Table 3 depicts a representative sample.
So at the end of all of this, E_ASSET is populated with, in the above case, records mapped from both ERR_STG_F_ASSET and ERR_IMP_ASSET. Thus, the system may drill from an error overview report and see the actual records (or rather their union table proxies, but that should be transparent to the end user.) Error functions provided include:
Conflict functions provided include:
During association, the sync process finds and links records that refer to the same entity, an “entity” in this case defined as an asset, a software item, a user, and the like. Data comes from different sources, but the often data concerns identical entities; association finds the links between records and gives each record the same identifier in the staging tables.
Association is also used to create the IDs that are the PKs and FKs for the final target tables. Some association actions create (and associate) new records where they are needed. Association is invoked on data that is already in the staging tables. However, because a process stage ID can be specified, an association step can be performed at any point in the process.
For example, at the end of the association process, it is clear what record from source B refers to asset 10 from source A. Association functions include:
Import table association is another type of association made in the Integration Framework. Import tables define the relationship between imported tables and support a variety of import table relationships such as:
The Key Merge action assigns unique values from a sequence to the set of distinct values in a specified, configurable set of columns in a table. If the same set of values occurs multiple times in the table, all occurrences of the distinct set of values are assigned the same key value. Provided functions include:
In some cases, the temporary table is not really a temporary table but an existing table; in this case, it is not created or dropped.
Consolidation processes function to combine the data to obtain a Gold data record for a unique asset, user, or other entity by defining the priority of data coming from a particular source over another. Consolidation is performed column by column, meaning that the value for one column may be taken from source A if possible, and if not then from source B, source C, and so on, while for other columns the order may be reversed.
In one embodiment of the invention, consolidation is performed by creating a plurality of tables, where each table contains assets comprising the same attribute, e.g., a narrow table for each attribute such as asset name, serial number, and so on. Each of these tables is consolidated into a temporary table using a single SQL statement. The temporary tables are then joined to form a multi-dimensional table containing the consolidated records (golden records). Using this procedure is significantly faster in achieving the golden records rather than processing the data on a record by record basis. Provided functions include:
The system possesses the capability segregate an outsourcer's multiple clients (referred to herein as tenants) by assigning unique identifiers to each individual client's data. This way, queries and subsequent reporting on IT assets can be filtered for display for each individual tenant only, while an outsourcer organization may opt to make distinctions based on its own hierarchy.
Tenant configuration within the Integration Framework consists of identifying a client and associating that identifier with asset data as it enters the system schema and passes through the import, staging, and target phases. Provided functions include:
Closely related to the creation/deletion processes and tables shown earlier, Data Structure Configuration and Definition tables identify data structures and how they are related. Provided functions include:
Rulesets are used in Multi-Tenant scenarios to allow for the conditional application of a specific action. This was implemented in the system to facilitate metadata configuration via the reuse/reapplication of existing rules across multiple platform tenants when applicable.
A rule is a generic term for a unit of configured metadata intended to drive a specific action, i.e. association rule or data transform rule. A single rule may correspond to one or more specific metadata records in one or more metadata tables, based upon the particular configuration options of the action in question. A single rule should not be interpreted to span multiple actions, even of a single type (i.e. 1 data transform=1 rule.)
An action type is a type of action that is being performed. For the purpose of rulesets, the action types are “chunky”, so multiple specific action classes may be considered the same action type. For example, Link action and Assign action are both considered of the same type “Association” action.
The currently identified action types to be used for tenant rulesets are:
A ruleset is a collection of rules of a single type that have been determined by the human configurator to have a relationship such that they should be applied (or not applied) as an atomic entity. That is, either all rules in a ruleset are applied for a specific tenant, or none are. A ruleset is identified by its type and identifier (ID). The ID does not need to be unique across rulesets of different types, so for example there may be association ruleset 1 as well as data transform ruleset 1. Rulesets of different types should be viewed as completely independent of each other, and thus there is no inherent relationship between rulesets of different type even if they have the same ruleset ID.
The action type for a specific rule or set of rules is referred to as a rule or reset type. Each ruleset has a specific type, which is one of the action types identified above.
The system systematically applies the concept of a ruleset. In one embodiment of the invention, where the system operates in a non-multi-tenant mode, almost all actions configured in metadata are directly applied to the customer data without implicit condition. In an embodiment that operates in a multi-tenant mode, there shall be an additional level of indirection between the rules and actions configured in metadata and the various tenants' data. All metadata rules (transforms, associations, etc.) shall have a mandatory attribute of a ruleset ID.
A tenant will not be assigned directly to a rule, but rather will be associated with one or more rulesets. In this way, specific, related groups of rules will be applied to tenant data, and multiple tenants can share the same rules without requiring massive duplication of metadata rule configuration. Further, additional tenants shall be able to be introduced into a live system with little or no mandatory metadata configuration, provided that the new tenant's sources and rules conform to existing definitions within the system.
Every metadata rule has a ruleset ID. All rules of the same type (i.e. association or data transform) and same ruleset ID are applied (or not applied) together to the same set of tenants as configured in metadata. (It is not necessary for this ruleset to be atomic in the ACID transactional sense.)
A ruleset has a type, corresponding to the type of metadata rules it represents. A ruleset also has an ID that is unique within the type. A tenant can be assigned one or more rulesets of a type. Tenants can be assigned multiple rulesets, and the same ruleset can be assigned to multiple tenants.
It is an explicit consideration that the amount of configuration that the end customer must perform in order to introduce a new tenant into an existing deployment shall be minimized where possible. Force changing configuration on an exception basis only (i.e. where a new tenant needs alternate behavior that has not yet been configured). Encourage default behavior, rule and ruleset reuse to reduce complexity and increase manageability of a large deployment.
There is a default ruleset for each type (ID 1). All tenants inherit the default ruleset, even if it is not explicitly configured. In this way, a new tenant with “typical” behavior (as configured for the sources within a deployment) can be introduced into a deployment and start working “out of the box” with no additional configuration required.
In the interest of having a “single line of code” for both multi-tenant and non-multi-tenant system deployments, the aforementioned ruleset behavior is only applicable for multi-tenant scenarios. Thus, the activity of multi-tenant rulesets and their behavior is governed by at least one global parameter that enables/disables rulesets. This may be the same global parameter that enables/disables multi-tenancy in general.
In terms of implementation, the addition of rulesets in a multi-tenant deployment has the primary effect of adding a (additional) WHERE clause to the various generated SQL statements. For example, in an Association link action, the clause shall also match on tenant_IDs and restrict the set of affected tenant IDs to those that are associated with the ruleset. The ruleset determines which tenant IDs are applicable for the particular rule. Here are three example clauses:
Other actions will be affected in a similar manner. Functions provided include:
A remaining set of miscellaneous tables do not easily fit into a defined process, but are configurable within the Integration Framework. Provided functions include:
IT Outsourcers manage IT assets for their customers and generate revenue by billing the customer for assets under management. Such billing may require billing reconciliation.
At its core, billing reconciliation has three primary goals:
Billing reconciliation leverages many aspects of DQT (particularly source tracking and source match tracking), as well as other information and business rules, to put each consolidated asset into a billing category.
Considerations that typically are inputs to this categorization include:
These various inputs form a matrix, where each cell of the matrix represents a billing category, such as “Verified as Billable” or “Underbilled by Scope”.
Note that the system determines both a major category, such as:
A sub category, such as:
And a billing category cause (i.e. the “reason” the particular category was assigned), such as:
Table 4 depicts the relationships of the various inputs and outputs for billing categorization.
As discussed previously, the gold record is the final, consolidated record that represents the best view of the entity by combining and consolidating the data from multiple sources. As discussed previously, DQT allows for simple reporting of the number of sources that constitute a gold record, and their relationships. However, the system stores of a copy of the gold record in the staging table in staging format. This allows for intuitive reporting where a user can drill from a standard (gold) record report to a source detail report to see a specific gold record plus all the records from the various sources that were consolidated to the gold record, in a simple table layout. This is a tremendous source of information and benefit to the customer as it shows in striking detail the records from the various sources that are familiar to the customer, plus their relationships which can be verified through simple visual inspection (i.e., see the Asset Name matches, see the Serial Number matches, etc.)
This is particularly effective with Billing Reconciliation, since a simple visual presentation of the related source records that affords easy verification by the customer instills confidence in general in the reported results.
While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof.
This application claims benefit of U.S. provisional patent application Ser. No. 60/999,275, filed Oct. 17, 2007, which is herein incorporated by reference.
Number | Date | Country | |
---|---|---|---|
60999275 | Oct 2007 | US |