This invention relates to a data warehousing method and system, and more specifically, to a method and system that cleans and prioritizes data for a data warehouse.
Most data warehousing projects consolidate data from different source systems, each of which typically will be using a different data organization and/or format, whether the data is relevant or of interest to the end-users. Common data source formats include relational databases, flat files, and non-relational database structures such as information management system (IMS), virtual storage access method (VSAM), indexed sequential access method (ISAM), DB2 (relational) and flat files (XML) structures. The current approach to creating a data warehouse is to extract the data from a variety of sources, to transform the data from the original source to a form for the data warehouse, and to load the data into the data warehouse. To facilitate the transformation of the data, predetermined rules are used, and typically the predetermined rules do not get the transformation right because data is excluded or incorrectly transformed. The predetermined rules are setup using data profile surveys, but not based on user requirements. This results in a high cost for the transformation, which is only sent higher by the desire to move as much data over as possible and can be obtained for extraction.
This invention provides methods and computer program products for a reduced volume precision data quality information cleansing feedback process. More specifically, a method according to one embodiment of the invention receives a request from a user for information from an electronic information warehouse. In response to the request, the information is transmitted to the user. Feedback is received from the user, wherein the feedback includes errors in content of the information and errors in relationship data. The relationship data has data describing how a data entry in the information relates to other data entries in the information. The feedback also includes proposals on how to correct the errors in the content and the errors in the relationship data. In another embodiment, the user is prompted for feedback.
Furthermore, the method according to one embodiment of the invention creates correction rules based on the feedback and monitors information request behavior patterns to identify selected types of information by the user and non-selected types of information by the user. The information contained in the information warehouse is modified using the correction rules to produce modified information, wherein the modifying reduces the volume of the information. The modification of the information removes the non-selected types of information and only process relevant transactional data to build a data warehouse. Thus, the modification of the information only processes relevant data for analysis.
The method, according to one embodiment of the invention, displays the modified information to the user. Further, alerts are sent to a data quality operations team, wherein the alerts include the correction rules. A response to the alerts is received from the data quality operations team, wherein the response includes an acceptance, rejection and/or modification of the correction rules. In one embodiment of the invention, the alerts are sent before the information is modified; in another embodiment, the alerts are sent after the information is modified.
Moreover, the method, according to one embodiment of the invention, receives additional feedback from the user and/or an additional user. The correction rules are updated based on the additional feedback to produce updated correction rules. The updating of the correction rules adds and/or removes rules from the correction rules. Further, the modified information is updated using the updated correction rules to produce updated modified information. The method also stores the information in a data warehouse and updates the data warehouse by replacing the information with the modified information.
The present invention is described with reference to the accompanying drawings. In the drawings, like reference numbers indicate identical or functionally similar elements.
One embodiment of the invention combines services oriented architecture (SOA), subject matter expertise and rules driven technology to deliver an optimized approach in maintaining and building trusted information for business intelligence. This framework enables the creation and delivery of quality information warehouses at lower costs and at faster rates then is currently possible. As discussed below, the cleansing process reduces the volume of information contained in the information warehouse and only processes relevant transactional data. By combining this framework with end-user expertise and translating rules into embedded web services, this system streamlines and optimizes information repository builds. This illustrative embodiment places a strong focus on web interaction, analysis of requested information, and the ability of end-users to influence what they know to be valid. In at least one embodiment, provided inputs are translated into dynamic rules in the form of “feedback” instructions that drive the data refresh, build, and cleansing processes. This enables an enterprise to build information warehouses selectively instead of having to process every single transaction. This selective process capability ultimately reduces the cost and the time needed to build and maintain information warehouses.
In at least one embodiment of the invention, published and subscribed web services are used to implement alerts and process rules that are solicited directly from the user community as opposed to standard IT processes of requirements gathering and internal development work. This illustrative embodiment supports “Information on Demand” from three perspectives: 1) providing an automated tool, 2) providing a process methodology and 3) leveraging subject matter expertise through an implemented active feedback loop.
End users submit requests for business intelligence or information from web connected applications using pervasive and non pervasive computing devices. These requests are processed through enterprise mash-up applications or other web based user interfaces (UI) that are enabled with logic to receive information requests, dispatch XML based web services that monitor information requests and collect parameter driven rules to influence how the information is constructed and refreshed on a scheduled or real time basis.
In at least one embodiment of the invention, the ability to issue alerts when changes to data content are requested is included. This information in at least one embodiment is transmitted to other systems and to data quality operations personnel that can react to the requested changes. By enriching the information warehouse build with external rules that are driven by subject matter experts and end-users, the process is optimized from a cost, speed and volume perspective. Enterprises no longer need to process every possible available transaction to deliver trusted information sources. Different embodiments of the invention provide capabilities to analyze information requested along with user driven correction rules to reconstruct how the information sources get built and updated.
Different embodiments of this invention include at least one of the following features: connections to Information Sources, instructions for Information Retrieval, dynamically constructed user driven specifications for information source builds, Publish/Subscribe web services to control dynamic build rules, Publish/Subscribe web services for triggering alerts, ability to collect feedback from user communities to drive and optimize the information build process, and ability to improve data quality by associating rules dynamically from subject matter experts.
In at least one embodiment, as illustrated in
In another embodiment of the invention, as illustrated in
First, requests for business intelligence metrics and information analytics are issued through internet connected pervasive and non pervasive computing devices 210, for example, from user requests or software calls. This activity can occur for any information domain where electronically stored information is preprocessed, cleansed, transformed and subsequently loaded into databases 230 known as data marts, data cubes or information warehouses. Requests are sent to web enabled applications as noted below. Results are then returned to the requesting interfaces.
Once information requests are received, the requests are parsed, analyzed and subsequently converted by information search application 220 into retrieval instructions for needed information and data stored in databases 230. In addition to requesting preprocessed information, in at least one embodiment of the invention, a feedback alert and process engine 270 (described in more detail below) monitors the type of transactions that the requests are focusing on. This is done to help determine which types of information are being queried versus which types are not. This information will be used in subsequent information warehouse builds to help reduce the amount of data processed and/or prioritize the data. In addition to monitoring and recording the types of information requests being made, the end-users in at least one embodiment are also prompted to indicate anomalies in the information they are viewing. This information is routed to a storage area using, for example, XML based web services.
Furthermore, information source containers 230 are used to house information accessed during requests for business intelligence and other information analytics. A variety of formats can be used, such as relational, flat, and cube. The containers 230 are created from collecting raw transactional data from systems such as order entry, inventory, and customer information capture systems. Web service rules 240 (also referred to herein as “correction rules”) are created to enrich the information in containers 230. Specifically, the web service rules 240 are created by analyzing data that is being requested and feedback received from end-users. The system looks for repeated patterns of usage and based on the requests being made, a statistical model is maintained within the metadata container to optimized builds based on data usage.
As described below, these rules are stored in the “FEEDBACK METADATA” container 280. Extract, transform and load processes required to harvest raw transactional data and turn it into usable information which would be subsequently used to drive business decisions and influence business processes are performed by processor 250. Rules stored in the “FEEDBACK METADATA” container 280 are used to build publish and subscribe rules to drive the information build process performed by processor 250.
The data containers 230 store inbound raw data transactions 260 that can be of any type or domain. As described below, these transactions are used as input. The feedback alert and process engine 270 performs a server process that takes in process rules and information request behavior patterns wrapped in, for example, XML messages, Real Simple Syndication (RSS), Java Script Object notation, Simple Object Access Protocol (SOAP), Atom, or any user defined messaging format, as web services.
This process also publishes processing rules that are subscribed to by an “Extract, Transform, Load and Dynamic Rules Processing Engine” (not shown). This information is also stored in the “FEEDBACK METADATA” container 280 described below. As also described below, XML contained web service alerts 215 are triggered from the feedback alert and process engine 270. These service alerts are used to indicate issues with the information being viewed. These alerts would be used to drive data quality monitoring dashboards that either people or systems would be the recipient of.
A data repository, or FEEDBACK METADATA” container, 280 is used to retain information from the feedback alert and process engine 270. Further, publish and subscribe web service implementations are performed by an XML Service Pub/Sub 290 to drive optimized rules for refreshing the information sources.
A Pub/Sub component 205 indicates that a publish and subscribe web service process has been implemented to drive the dynamic rules that are used to influence which data gets transformed. This also includes any subject matter expert rules that are entered through the user interfaces 210. Moreover, XML based web services 215 that contain alert messages that are emitted from the feedback alert and process engine 270 are identified.
Through internet connected pervasive and non pervasive computing devices 225, a data quality operations team interacts and monitors with feedback rules that are being driven by the end-user community. The feedback loop concept 235 reduces transaction volumes required to keep information sources up to date per data warehousing processes. This includes information transform rules that are established via end-user input and brokered by web services.
The data warehousing software in at least one embodiment is shared, simultaneously serving multiple customers in a flexible, automated fashion. It is standardized, requiring little customization and it is scalable, providing capacity on demand such as in a pay as-you-go model.
Correction rules are created based on the feedback (340). As discussed above, the feedback is translated into dynamic rules that drive the data refresh, build, and cleansing processes. This enables an enterprise to build information warehouses selectively instead of having to process every single transaction. The information is modified using the correction rules to produce modified information (350), wherein the modification reduces the volume of the information. This selective process capability ultimately reduces the cost and the time needed to build and maintain information warehouses. In at least one embodiment, the modified information is displayed to the user (360).
Correction rules are created based on the feedback (440). As discussed above, the feedback is translated into dynamic rules that drive the data refresh, build, and cleansing processes. This enables an enterprise to build information warehouses selectively instead of having to process every single transaction. The process monitors information request behavior patterns to identify selected types of information by the user and non-selected types of information by the user (450). This is done to help determine which types of information are being queried versus which types are not. This information will be used in subsequent information warehouse builds to help reduce the amount of data processed. Specifically, the non-selected types of information are removed when modifying and/or updating the information.
The information is modified using the correction rules to produce modified information (450). The modification reduces a volume of the information. This selective process capability ultimately reduces the cost and the time needed to build and maintain information warehouses. After modifying the information, alerts are sent to a data quality operations team (452). The alerts include the correction rules and modifications to the information. As described above, published and subscribed web services are used to implement alerts and process rules that are solicited directly from the user community as opposed to standard IT processes of requirements gathering and internal development work. The data quality operations team reviews the correction rules and the modifications to the information.
The modifying of information only processes relevant transactional data to build a data warehouse (454). Moreover, the modification of information only processes relevant data for analysis (456). As discussed above, the feedback loop reduces transaction volumes required to keep information sources up-to-date per data warehousing processes. This includes information transform rules that are established via end-user input and brokered by web services.
The modified information is displayed to the user (460). The process further includes receiving a request for the information from an additional user; and displaying the modified information to the additional user (470). Additionally, the process receives additional feedback from the user and/or an additional user and updates the correction rules based on the additional feedback to produce updated correction rules (480). Furthermore, the modified information is updated using the updated correction rules to produce updated modified information. As discussed above, rules stored in the “FEEDBACK METADATA” container are used to build publish and subscribe rules to drive the information build process.
The updating of the correction rules adds and/or removes rules from the correction rules (482). As discussed above, by enriching the information warehouse build with external rules that are driven by subject matter experts and end-users, the process is optimized from a cost, speed and volume perspective. Enterprises no longer need to process every possible available transaction to deliver trusted information sources.
At least one embodiment of the invention takes the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment including both hardware and software elements. In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
Furthermore, at least one embodiment of the invention takes the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium is any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
Input/output (I/O) devices (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers. Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
A representative hardware environment for practicing at least one embodiment of the invention is depicted in
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.