The invention relates generally to automatic business content discovery, and more specifically, to discovering business content via data validation rules bound to business terms.
Organizations today have large data stores storing business content in the form of Information Technology (IT) assets. Business content may be information critical for the business and its operations. For example, an enterprise may store different types of data in different systems such as legacy systems, enterprise information systems, relational databases, object databases, file stores, and so on.
Within a huge infrastructure and a complex IT landscape, an organization may have the need to organize, profile, and monitor data periodically. Because of a complex IT landscape, the organization may need to employ IT professionals to profile data manually. Thus, the monitoring and profiling of data may consume a lot of resources.
Many organizations have operations in different geographic regions and intricate supply chains involving many stakeholders. As data sources become larger and the complexity of the data exchanged on a daily basis is increased because of increasing numbers of stakeholders as operations grow, it may be beneficial for an organization to streamline the profiling and monitoring of data.
These and other benefits and features of embodiments of the invention will be apparent upon consideration of the following detailed description of preferred embodiments thereof, presented in connection with the following drawings.
In various embodiments, a method to automatically discover business content is described. The method of the various embodiments includes binding business terms to data validation rules, discovering business content based on data validation rules and binding business content to data elements. In various embodiments, data is profiled and monitored using data validation rules.
In various embodiments, a system is described. The system of the embodiments includes a catalog to store business terms and data validation rules, a data services engine to discover business content from a variety of data sources, and a user interface.
In various embodiments, a user interface provides dialogs and screens for creating business terms and data validation rules. The user interface also provides dialogs and screens for data analysis and profiling.
The claims set forth the embodiments of the invention with particularity. The invention is illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. The embodiments of the invention, together with its advantages, may be best understood from the following detailed description taken in conjunction with the accompanying drawings.
Embodiments of techniques for ‘Method and System for Automatic Business Content Discovery’ are described herein. In the following description, numerous specific details are set forth to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention can be practiced without one or more of the specific details, or with other methods, components, materials, etc. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
Reference throughout this specification to “one embodiment”, “this embodiment” and similar phrases, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of these phrases in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Metadata is information about information. Metadata typically constitutes a subset or representative values of a larger data set. Metadata describes how structure and calculation rules are stored, plus, optionally, additional information on data sources, definitions, transformations, quality, date of last update, user privilege information, etc.
A data source is a source of information, such as a database. A data source table is a database table, structured file, or the like whose data content is used at least in part to define the data content of a target table by mapping at least a portion of the data content of the data source table to the target table using a data federation program.
Data sources include sources of data that enable data storage and retrieval. Data sources may include databases, such as, relational, transactional, hierarchical, multidimensional (e.g., OLAP), object oriented databases, and the like. Further data sources may include tabular data (e.g., spreadsheets, delimited text files), data tagged with a markup language (e.g., XML data), transactional data, unstructured data (e.g., text files, screen scrapings), hierarchical data (e.g., data in a file system, XML data), files, one or more reports, and any other data source accessible through an established protocol, such as, Open Data Base Connectivity (ODBC) and the like. Data sources may also include a data source where the data is not stored like data streams, broadcast data, and the like.
Master data contains information that is needed often and in some predictable or accepted form. Master data may be stored in a computer system, in a network of computer systems or in a variety of data stores. Master data may be persistent data that defines data relevant for the operation of a company or organization.
For example, the master data of a cost center contains the name of the cost center, the person responsible for the cost center, and the corresponding hierarchy area. In another example, the master data of a vendor contains the name, address, and bank information for the vendor. In a further example, the master data of a user in a computer system may contain the user's authorizations in the system, the name of their default printer, and other information.
A business term is a term used in an organization to describe an asset of the organization. Business terms are collected in a vocabulary of words and phrases, or notation systems. Using business terms, users describe the content type of their data, for example, employee, social security number, driver's license number, address, etc. Master data of an organization may be defined and described as a business term and stored in a business term repository or catalog.
A simple business term describes an atomic content of a basic data element (e.g., social security number and purchase order number). A compound business term is a business term which incorporates several simple business terms. For example, the compound business term employee may incorporate several simple business terms such as name, last name, social security number, etc.
The content type of a piece of data may describe the nature of the data as required by the definition of the data in a business term.
A business term can also be bound to reference data. In that case, only values of the business terms from the pool of reference data are valid. For example, a name may be required to be checked and found in a name dictionary. In another example, company name may be required to be checked and found in a firm name dictionary. Such reference data can be used if the format of the business term cannot be uniformly defined. For example, a social security number is a sequence of 9 digits in a prescribed format so its format is standard. However, a name cannot be expected to have an exact number of characters in an exact format.
Business terms may also have parent-child relationships. For example, the business term “organization” may have “employees.” Thus, employee business terms are child business terms to the parent business term organization.
Some business data may have data validation rules that define the basic structure or pattern of a data element representing such data. For example, a social security number is a sequence of digits in the format “999-99-9999.” Data validation rules to be applied to simple business terms are simple rules. Data validation rules to be applied to compound terms are compound rules. A compound rule is a collection of rules that are relevant for a term. For example, a compound rule for an employee business term may define that the employee term is expected to have four fields, such as “name”, “address”, “social security number”, and “driver's license number.” If such a data element is found, further rules to match each of the fields to a business term will be applied. For example, four rules will be applied to verify that the employee data element not only has the four required fields, but also each field is of a required format.
In various embodiments, a data validation rule may specify that a business term conforms to reference data. Such embodiments are relevant for data in business terms that cannot be uniformly specified in a format, such as, but not limited to, names.
According to various embodiments, business terms, their definitions, and data validation rules are stored in a catalog as a repository. A catalog may hold business terms relevant for an organization. For example, one organization may define the business term “employee” to have a social security number, a name, and an address. Another organization may define the business term “employee” to have an ID, a name, a social security number, and a driver's license number.
In various embodiments, data quality tools assess the state of completeness, validity, consistency, timeliness and accuracy of a data set in view of a specific use, because different requirements may exist for data in different uses. In other words, in one use of data there may be required that the data is 99% accurate; while in another use of the data it may be required that the data is 97% accurate.
In various embodiments, a system may be implemented to maintain a repository of business terms and data validation rules. In various embodiments, the bindings may be applied to tie business terms to one or more data validation rules that apply to the terms. So for instance, a repository may contain a textual definition of a term and bindings that bind the term to one or more data validation rules. In various embodiments, the system may be configured to periodically discover data elements related to selected business terms in selected data sources that conform to the one or more data validation rules bound to the term. Data elements that are found to satisfy their respective data validation rules may then be bound to the data validation rules. This additional binding is also referred to as “profiling” and serves as a stamp of validity of the data element. Furthermore, the system may periodically monitor data elements to determine whether they continue to satisfy their corresponding data validation rules.
In an exemplary embodiment, an exemplary business term “SSN” may stand for social security number and may be bound to an exemplary data validation rule specifying a format for the SSN as “999-99-9999.” According to the process described in
In various exemplary embodiments, the following exemplary code may be used to generate a data validation rule for a social security number:
At process block 212, a validity threshold is relevant for the data validation rule is received. In various embodiments, the validity threshold may be used to determine a likeliness of data to match one or more data validation rule. At process block 214, the data elements matching the format specified in the data validation rule are determined. At process block 216, the data elements determined to have matched the rules are sent to a user interface for approval.
In various embodiments, data in business terms may also be used in searching data sources for matching data elements. For example, a business term can contain valid values which can be used in matching data elements. Further, a business term may include sample data that can be used in matching data elements. A business term can also include a definition to be used in matching data elements form data sources. Using both data in data validation rules and business terms to match data elements may be useful in searching data sources as data elements may be matched more efficiently and more precisely. Also, better matching techniques can result in savings of time and resources.
Some embodiments of the invention may include the above-described methods being written as one or more software components. These components, and the functionality associated with each, may be used by client, server, distributed, or peer computer systems. These components may be written in a computer language corresponding to one or more programming languages such as, functional, declarative, procedural, object-oriented, lower level languages and the like. They may be linked to other components via various application programming interfaces and then compiled into one complete application for a server or a client. Alternatively, the components maybe implemented in server and client applications. Further, these components may be linked together via various distributed programming protocols. Some example embodiments of the invention may include remote procedure calls being used to implement one or more of these components across a distributed programming environment. For example, a logic level may reside on a first computer system that is remotely located from a second computer system containing an interface level (e.g., a graphical user interface). These first and second computer systems can be configured in a server-client, peer-to-peer, or some other configuration. The clients can vary in complexity from mobile and handheld devices, to thin clients and on to thick clients or even other servers.
The above-illustrated software components are tangibly stored on a computer readable medium as instructions. The term “computer readable medium” should be taken to include a single medium or multiple media that stores one or more sets of instructions. The term “computer readable medium” should be taken to include any article that is capable of undergoing a set of changes to store, encode, or otherwise carry a set of instructions for execution by a computer system which causes the computer system to perform any of the methods or process steps described, represented, or illustrated herein. Examples of computer-readable media include, but are not limited to: magnetic media, such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs, DVDs and holographic devices; magneto-optical media; and hardware devices that are specially configured to store and execute, such as application-specific integrated circuits (“ASICs”), programmable logic devices (“PLDs”) and ROM and RAM devices. Examples of computer readable instructions include machine code, such as that produced by a compiler, and files containing higher-level code that are executed by a computer using an interpreter. For example, an embodiment of the invention may be implemented using Java, C++, or other object-oriented programming language and development tools. Another embodiment of the invention may be implemented in hard-wired circuitry in place of, or in combination with machine readable software instructions.
A data source is an information resource. Data sources include sources of data that enable data storage and retrieval. Data sources may include databases, such as, relational, transactional, hierarchical, multi-dimensional (e.g., OLAP), object oriented databases, and the like. Further data sources include tabular data (e.g., spreadsheets, delimited text files), data tagged with a markup language (e.g., XML data), transactional data, unstructured data (e.g., text files, screen scrapings), hierarchical data (e.g., data in a file system, XML data), files, one or more reports, and any other data source accessible through an established protocol, such as, Open Data Base Connectivity (ODBC), produced by an underlying software system (e.g., ERP system), and the like. Data sources may also include a data source where the data is not tangibly stored or otherwise ephemeral such as data streams, broadcast data, and the like. These data sources can include associated data foundations, semantic layers, management systems, security systems and so on.
A semantic layer is an abstraction overlying one or more data sources. It removes the need for a user to master the various subtleties of existing query languages when writing queries. The provided abstraction includes metadata description of the data sources. The metadata can include terms meaningful for a user in place of the logical descriptions used by the data source. For example, common business terms in place of table and column names. These terms can be localized and or domain specific. The layer may include logic associated with the underlying data allowing it to automatically formulate queries for execution against the underlying data sources. The logic includes connection to, structure for, and aspects of the data sources. Some semantic layers can be published, so that it can be shared by many clients and users. Some semantic layers implement security at a granularity corresponding to the underlying data sources'structure or at the semantic layer. The specific forms of semantic layers includes data model objects that describe the underlying data source and define dimensions, attributes and measures with the underlying data. The objects can represent relationships between dimension members, and provide calculations associated with the underlying data.
The above descriptions and illustrations of embodiments of the invention, including what is described in the Abstract, is not intended to be exhaustive or to limit the invention to the precise forms disclosed. While specific embodiments of, and examples for, the invention are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize. These modifications can be made to the invention in light of the above detailed description. Rather, the scope of the invention is to be determined by the following claims, which are to be interpreted in accordance with established doctrines of claim construction.