The present invention generally relates to data processing and management processes and, more particularly, to an adaptive data cleaning process and system.
The quality of a large real world data set depends on a number of issues, but the source of the data is the crucial factor. Data entry and acquisition is inherently prone to errors both simple and complex. Much effort is often given to this front-end process, with respect to reduction in entry error, but the fact often remains that errors in a large data set are common. The field error rate for a large data set is typically around 5% or more. Up to half of the time needed for a data analysis is typically spent for cleaning the data. Generally, data cleaning is applied to large data sets. Data cleaning is the process of scrubbing data to improve accuracy of a large data set. Ideally, data cleaning should be able to eliminate obvious transcription errors, to correct erroneous entries, such as erroneous part numbers or invalid codes, to update missing data, such as pricing or lead times, and to recognize that there may exist multiple sources and definitions of data. Effective data cleaning should incorporate electronic notes to explain the rational for rule based or manual selections, should provide an audit trail, and should be easy to operate.
Data cleaning is often done using a manual process, which is laborious, time consuming, and prone to errors. Consequently, methods that enable automated detection of errors in large data sets or that assist in detecting errors are of great interest. The process of automated data cleaning is typically multifaceted and a number of problems must be addressed to solve any particular data cleaning problem. Generally, possible error types need to be defined and determined, a search for errors needs to be conducted and the errors need to be identified, and the uncovered errors need to be corrected.
For example, current supply chain software solution vendors, such as i2 Technologies, IBM, Manugistics, MCA Solutions, Systems Exchange, or Xelus have well developed and thought out internal data structures. These structures must be mapped to a customer's source system and must be updated on a periodic basis. The mapping is “hardwired” during implementation, requiring recoding when sources or business rules change. Furthermore, the development of an intermediate database that stores customer data prior to loading into the supply chain software is often needed. Also, current supply chain software solutions do not support archiving results, archiving the inputs that lead to the results, or versioning data over time. This prevents a customer from auditing the decision process which leads, for example, to the stocking recommendations for a piece of heavy equipment, such as aircraft, trucks, ships or machinery. With service part stock levels for repairable items, such as heavy equipment having a long life, running into the tens to hundreds of millions of dollars, auditability is an important requirement for many customers.
Extract, Transform, and Load (ETL) tools are typically used to bridge the gap between source systems and an intermediate database. ETL tools are used to convert data from one operating system and brand of database software to another. ETL tools apply limited business rules to transform and filter data. ETL tools are not designed to handle multiple sources of the same data. Furthermore, when business rules are applied to multiple sources of data, they are applied during the data collection process, which precludes later visibility of changes to more than one source of data. ETL tools also do not support versioning of data, which includes tracking changes in data over time.
In 2000, Ventana Systems, Inc, Harvard, Mass., U.S.A., developed a data cleaning solution for The Boeing Company, Long Beach, Calif., U.S.A. for the supply software solution for the C-17 airlift program. This prior art cleaning solution is written in Oracle and C++, with an Excel-like user interface. The data cleaning solution advances the prior art by allowing users to change data in a database and color-coding the data that was changed, by developing a way to allow changes to data to persist over time using simple decision tree logic, and by allowing users to select the data elements, which they wish to clean. Still, this prior art data cleaning solution incorporates several limitations. For example, the supply chain software solution uses global variables that can be changed by any routine versus using data encapsulation, the data cleaning solution uses a complex internal data structure that makes it difficult to maintain, and the loading of the data by the application must adhere to a strict procedure or the data may become corrupted.
As can be seen, there is a need for a method for data cleaning that is automated and enables selection of data from multiple sources. Furthermore, there is a need for a data cleaning process that allows support for archiving results, archiving the inputs that lead to the results, or versioning data over time. Still further, there is a need for a data cleaning process that can be easily implemented into existing data management systems.
There has, therefore, arisen a need to provide a process for data cleaning that offers standardized procedures, that complements corporate common data warehouse projects, and that selects data from multiple sources. There has further arisen a need to provide a process for data cleaning that recognizes that different customers may need to see different sources of ostensibly the same data element, and that there may exist multiple versions of what should theoretically be the same data. There has still further arisen a need to provide a process for adaptive data cleaning that enables archiving both the data used for an analysis and the results of the analysis.
In one aspect of the present invention, a data cleaning process comprises the steps of: validating data loaded from at least two source systems using data formatting utilities and data cleaning utilities; appending the validated data to a normalized data cleaning repository; selecting the priority of the source systems; creating a clean database; creating and maintaining a cross-reference between the unique data identifiers; loading consistent, normalized, and cleansed data from the clean database into a format required by data systems and software tools using the data; creating standardized data cleaning and management reports using the consistent, normalized, and cleansed data; and updating the consistent, normalized, and cleansed data by a user without updating the source systems. The clean database contains unique data identifiers for each data element from the at least two source systems.
In another aspect of the present invention, a data cleaning process for a supply chain comprises the steps of: loading data from multiple source systems to a master table of data elements and sources; selecting precedence of the source systems; reviewing high driver and error reports; cleaning logistics data contained in the master table of data elements and sources; approving consistent, normalized, and cleansed data of the master table of data elements and sources and providing the cleansed data to data systems and software tools using the data; initiating inventory optimization of stock level and reorder points using a strategic inventory optimization model using the cleansed data; providing spares analysis including stock level and reorder point recommendations; archiving supporting data for customer audit trail; creating reports; and purchasing spares to cover shortfalls according to the reports.
In a further aspect of the present invention, a data cleaning system includes data formatting utilities, data cleaning utilities, a normalized data cleaning repository, source prioritization utilities, a clean database, cross-reference utilities, and a data cleaning user interface. The data formatting utilities are used to validate data downloaded from at least two source systems. The data cleaning utilities are used to clean the data. The source prioritization utilities are used to select the priority of the at least two source systems. The normalized data cleaning repository receives the formatted and cleansed data. The clean database combines the cleansed and prioritized data. The clean database is a single source of item data containing the best value and unique data identifiers for each data element. The cross-reference utilities are used to create and maintain a cross-reference between the unique data identifiers. The data cleaning user interface enables a user to update the clean database.
These and other features, aspects and advantages of the present invention will become better understood with reference to the following drawings, description and claims.
The following detailed description is of the best currently contemplated modes of carrying out the invention. The description is not to be taken in a limiting sense, but is made merely for the purpose of illustrating the general principles of the invention, since the scope of the invention is best defined by the appended claims.
Broadly, the present invention provides an adaptive data cleaning process and system that standardizes the process of collecting and analyzing data from disparate sources for optimization models. The present invention further generally provides a data cleaning process that provides complete auditablility to the inputs and outputs of optimization models or other tools or models that are run periodically using a dynamic data set, which changes over time. The adaptive data cleaning process and system as in one embodiment of the present invention enables consistent analysis, eliminates one time database coding, and reduces the time required to adjust to changing data sources, and may be used, for example, for inventory optimization models or during the development of supply chain proposals. One embodiment of the present invention provides a data cleaning process that is suitable for, but not limited to, applications in aircraft industry, both military and commercial, for example for supply chain management. One embodiment of the present invention provides a data cleaning process that is further suitable for, but not limited to, applications in industries that utilize heavy equipment having a long life. The data cleaning process as in one embodiment of the present invention may be used where a large database needs to be managed, where the database receives data from multiple sources, for example, large corporations that need to combine data from several sub organizations, and where the data to be managed relate to high value goods, such as heavy equipment in transportation industries. The data cleaning process as in one embodiment of the present invention may further be used, for example, for inventory management, order management, consumer data management, or in connection with industrial maintenance.
In one embodiment, the present invention provides a data cleaning process that selects data from multiple sources and uses heuristics based on precedence to select the best source from the multiple sources and to select the best value for forecasting. Existing ETL (Extract, Transform, and Load) tools are not designed to handle multiple sources of the same data. Current ETL tools may load data from multiple sources but require a software developer or user to create custom logic to select one source over another. Furthermore, sources may not be added or deleted after initial implementation of a typical ETL tool without manual intervention of a software developer or user. Contrary to the prior art, the data cleaning process, as in one embodiment of the present invention, allows unlimited numbers of data elements and sources to be added or dropped at any time. Contrary to prior art data cleaning processes, the data cleaning process as in one embodiment of the present invention may recognize that different users, such as customers, may need to see different sources of ostensibly the same data element, such as a unit price, which may have an internal value for buying a part and an external value for selling the part. For this example, both values of the price are valid and which one is used depends upon the application. The data cleaning process as in one embodiment of the present invention may have the ability to display multiple values for selected data elements from different sources. The user may override the original selection with information that may be more accurate than the information in the source system. Unlike traditional databases, where only one value for each data element is visible, the data cleaning process as in one embodiment of the present invention may provide versioning to previous values and traceability to all versions of each data element available from different source systems.
In one embodiment, the present invention provides a data cleaning process that has the ability to capture and identify all changes being made to data elements in the data repository area, and redisplay the changes back to the user. Information about changes to the data element, regardless if the changes are screen changes or mass updates, may be captured by tracking the user changing the data, the date of the change, and comments including why changes were done. This is an advantage over prior art data cleaning processes, which generally allow only flagging the suspected data and which generally require the change to be made to the system of record. In many cases, the system of record is a customer database, or a departmental database, that the data cleaner does not have update authority for. As a result, prior art data cleaning solutions which force the user to update the system of record are often impractical. Contrary to the prior art, the data cleaning process as in one embodiment of the present invention provides dated versioning to both input and outputs to computer models, tracking changes to data over time. Existing ETL tools do not support versioning data over time. The data cleaning process, as in one embodiment of the present invention, allows auditability of both results and the data and data sources upon which the results were based. The data cleaning process, as in one embodiment of the present invention, further ensures data integrity by screening the data against user definable business rules. Furthermore, the data cleaning process, as in one embodiment of the present invention, allows user additions and deletions, for example, to part numbers from source systems, maintaining traceability to what was added and flagging deleted data for traceability, rather than physically deleting the data. Consequently, data is electronically tagged as deleted, but not physically removed from the data repository. Still further, the data cleaning process, as in one embodiment of the present invention, adds automated notes, and allows for manual notes, which may be attached to each data element and provide information on automated processing, format conversions, and other data quality information. This provides auditability when data must be converted for an analysis, for example, when normalizing currency from Great Britain Pounds to United States Dollars.
In one embodiment, the present invention provides a data cleaning process that may be used, for example in connection with supply chain software tools and that may allow archiving and sharing the results of such supply chain software tools. Currently existing data repositories will store current input data required to perform an analysis. The data cleaning process, as in one embodiment of the present invention, will allow archiving both the data used at the time the analysis was performed, and the results of the analysis. This provides complete auditability to the source of data and the model results based upon that data. This is important, for example, for government supply chain contracts and commercial contracts, where auditability to the rational behind the purchase of costly maintenance spares is required. There are no known supply chain tools which support archiving of data and results. In addition, the data cleaning process, as in one embodiment of the present invention allows thresholds and triggers to be established at the data element level providing alerts, which notify, for example, asset managers and data owners that specific data elements are suspect and should be reviewed. These thresholds are particularly important when large amounts of data are being updated, as it may be physically impossible as well as error prone to scan each and every data element for errors. Furthermore, the data cleaning process, as in one embodiment of the present invention provides defaults to fill in critical missing data, while flagging the missing data for manual review. This makes it more likely that all parts will be included in an analysis, compared with traditional solutions of deleting an entire item if any data element for that item is missing or invalid. The data cleaning process, as in one embodiment of the present invention provides traceability to all data elements for which defaults have been used.
Referring now to
The ETL tool 21 may extract data from the data warehouse 14 or from external data sources 15, may transform the extracted data to a common format for data cleaning, and may load the transformed data into the data cleaning system 20. This operation may also be performed using custom database queries. The data warehouse 14 and the external data sources 15 may be source systems or sources for source data. The data formatting utilities 22 may be used to adjust unique data identifiers to common format as part of the data validation.
The data formatting utilities 22 may account for data entry issues in which slight variations in a unique data identifier, such as inclusion of a dash or blank spaces, may cause identifiers to be interpreted as different items when they should not be.
The data cleaning utilities 23 may be used to clean data from the source systems, such as the data warehouse 14 and the external data sources 15 as part of the data validation. The data cleaning utilities 23 may be used to ensure validity of data loaded from each source system (the data warehouse 14 or the external data sources 15) into data cleaning format.
The normalized data cleaning repository 24 may receive the formatted and cleansed data from different source systems. The normalized data cleaning repository 24 may load cleansed data from different source systems, such as the data warehouse 14 and the external data sources 15, into a master data table.
The source prioritization utilities 26 may be used to select the priority of data sources, such as the data warehouse 14 and the external data sources 15. Source systems, such as the data warehouse 14 and the external data sources 15, may typically be loaded and maintained by disparate organizations, leading to different values being stored for what is ostensibly the same data element 32. This is common both within large organizations with multiple departments, and across customers, suppliers, and government organizations.
The master table of data elements and sources 30 (also shown in
The cross-reference utilities 27 may be used to create and maintain a cross-reference between unique data identifiers 31. Different data sources may use different unique data identifiers 31, such as section reference, NSN (defined as either NATO (North Atlantic Treaty Organization) stock number or national stock number), or part number and manufacturer's code. Often, unique data identifiers 31 will be cross-referenced within a particular data source. This may allow a cross reference to be developed as the clean database is created from multiple sources, such as the data warehouse 14 or the external data sources 15. It may further be possible to create a unique reference number for each item. A one-to-many, many-to-one, or many-to-many relationship in a cross-reference may occur when a unique data identifier 31 on one scheme maps to multiple unique data identifiers 31 on another scheme and vice versa. Consequently the prioritized data cleaning master table of data elements and sources 30 may often contain duplicate unique data identifiers 31. The cross-reference utilities 27 may provide utilities to delete unwanted duplicates and to correct discrepancies in the cross-reference. Furthermore, a unique reference number may be created to enable data systems 16, which are fed data from the data cleaning system 20, to receive a truly unique data identifier number. This may enable data systems 16 and connected applications to execute without requiring that the cross-reference is perfect. Some applications, for example, for an automobile having four tires plus a spare tire, may enable a unique item identifier to be used multiple times. Other applications, for example, a purchasing system, which requires that a particular model tire only list the preferred supplier and most recently quoted price, may require a unique item identifier to occur only once. To solve this problem, an indentured master data item list may be created and maintained. When required, the master data item list allows a unique item identifier to be used multiple times. An example is a list of parts of a military aircraft. For example, a helicopter may contain six rotor blades, three as part of the forward pylon assembly and three as part of the aft pylon assembly. A purchasing system 161 may only need to know the annual buy for rotor blades, while an inventory optimization system 163 may want to know the required demand per blade, and the quantity of blade according to the assembly. A set of utilities may enable duplicate data in the master data item list to be merged with unique item data in the master table of data elements and sources 30 (shown in
The ETL tool 21 or custom database queries may be used to load the consistent, normalized and cleansed data from the master table of data elements and sources 30 into the format required for data systems and software tools 16, such as supply chain software 161, integrated information systems 162, inventory management systems 163, contracts and pricing 164, engineering 165, and simulation 166.
Also, standardized data cleaning and management reports 28 may be created. Often, management reports in one system are similar or even identical to management reports in another system. The data cleaning system 20 may provide some of the most common reports against the master table of elements and sources 30. For example, a line count report may be created that may tally the number of unique item identifiers 31 in the master table of elements and sources 30 (shown in
The data cleaning user interface 29 may enable closed loop data cleaning. Data cleaning is most often performed on the “front line” by users of the execution systems (data systems and software tools 16), such as inventory management 163. These users frequently update data in the course of going for new quotes, or making corrections to data while working with, for example, customers, suppliers, or repair shops. Users must have a way to update the data cleaning system 20 without updating the source systems, such as the data warehouse 14 or the external data sources 15. This may be necessary because the source system, such as the data warehouse 14 or the external data sources 15, is often under control of another organization, or even another customer or supplier. Consequently, it may not be practical or even feasible to update the source system (14 and/or 15). The data cleaning user interface 29 may enable users of data systems and software tools 16, which make decisions based upon the cleansed data provided by the data cleaning system 20, to update the data cleaning system 20. This enables all data systems and software tools 16, for example the supply chain software 161, to maintain consistency based on updates to the cleansed data. Manual updates may be date and time stamped, may include traceability to the user making the update, and may include a common field to capture information deemed important be the user. The data cleaning user interface 29 may be web enabled. The source prioritization utilities 26 may enable data systems and software tools 16, which rely upon information from the data cleaning system 20, to select or not select updates from this user (or users of a particular software tool, such as the supply chain software 161) based upon specific requirements. Manual updates may persist over time during subsequent updates to the source system, such as the data warehouse 14 or the external data sources 15. If the source data stays the same, the data cleaning value may be used. If the source data changes to the same value (within a user specified tolerance band) as the data cleaning value, the source data may be selected and the data cleaning value may be flagged as source system updated. If the source data changes, but is outside the user specified tolerance band, the data element 32 may be flagged for manual review.
The data cleaning system 20 may be integrated into a computer system (not shown). The computer system may be used for executing the utilities, such as the ETL (Extract, Transform, and Load) tools 21, the data formatting utilities 22, the data cleaning utilities 23, the normalized data cleaning repository 24, the source prioritization utilities 26, the master table of data elements and sources 30 (also shown in
Referring now to
Referring now to
Referring now to
In step 52, data formatting utilities 22 of the data cleaning system 20 (shown in
The data cleaning utilities 23 of the data cleaning system 20 (shown in
In step 54, the validated and cleansed data may be appended into the normalized data cleaning repository 24 (
In step 55, the priority of data sources may be selected. Step 55 may include: determining the number of unique data elements 32 (
In step 56 a clean database from multiple sources (such as the data warehouse 14 or the external data sources 15,
In step 57, a cross-reference may be created between unique data identifiers 31. Step 57 may include prioritizing cross-referenced data based upon the unique data identifier. For example, a scheme may identify the section reference as the best value for describing an item uniquely, followed by a NSN (NATO stock number or national stock number), and followed by a part number and a manufacturer's code.
In step 58, the cross-reference between the unique data identifiers 31 may be maintained by a utility. Step 58 may include reviewing inconsistencies developed when creating a database (master table of data elements and sources 30,
In step 59, a unique reference number may be created for each data element 32 (
In step 61, an indentured master data item list that may contain the unique item identification number may be maintained. When required, the master data item list may allow a unique item identification number to be used multiple times. Step 61 may include merging duplicate item data in the master data item list with unique item data in the master table of data elements and sources 30 (
In step 62, the consistent, normalized, and cleansed data may be loaded from the master table of data elements and sources 30 (
In step 63, standardized data cleaning and management reports, such as line counts reports and high driver reports 40 (
In step 64, the data cleaning system 20 (
Referring now to
Referring now to
Referring now to
It should be understood, of course, that the foregoing relates to exemplary embodiments of the invention and that modifications may be made without departing from the spirit and scope of the invention as set forth in the following claims.
This application claims the benefit of the U.S. Provisional Application No. 60/673,420, filed Apr. 20, 2005.
Number | Date | Country | |
---|---|---|---|
60673420 | Apr 2005 | US |