This invention relates generally to information processing. More particularly, this invention relates to parallel processing of data profiling information.
Database profiling is the process of analyzing a database to determine its structure and internal relationships. Database profiling assesses such issues as the tables used, their keys and number of rows, the columns used and the number of rows with a value, relationships between tables and columns copied or derived from other columns. Database profiling can also include analysis of tables and columns used by different applications, how tables and columns are populated and changed, and the importance of different tables and columns. Database profiling is useful when planning and managing data conversion and data cleanup projects. In addition, database profiling can be an initial step in defining a data quality domain, which is used in data quality profiling.
In some respects, database profiling is analogous to data processing operations performed on a database. Database profiling operations are also analogous to operations performed during the process of migrating data from a source (e.g., a database) to a target (e.g., another database, a data mart or a data warehouse), which is sometimes referred to as Extract, Transform and Load, or the acronym ETL. Unlike database and ETL operations, database profiling is potentially applied to multiple varied data sources and therefore requires different processing techniques. For example, data profiling systems may store metadata related to the data attributes being processed instead of actual data.
Current data profiling systems provide rudimentary forms of data processing and characterization. These tools fail to provide efficient data processing operations. Accordingly, it would be desirable to provide improved data profiling techniques that address data processing and characterization deficiencies associated with prior art approaches.
The invention includes a computer readable medium comprising executable instructions to process data in a data profiling system. The executable instructions include executable instructions to establish a plurality of attribute profiling threads, distribute columns of a selected row of a table across the plurality of attribute profiling threads, and generate data profiling information.
The invention provides significant performance improvements. Data profiling operations commonly entail reading millions of rows from a source and then calculating the attributes of every column. The parallel processing of the invention enables the processing of columns in one row on different threads.
The invention is more fully appreciated in connection with the following detailed description taken in conjunction with the accompanying drawings, in which:
Like reference numerals refer to corresponding parts throughout the several views of the drawings.
The input/output devices 104 may include a keyboard, mouse, touch screen, display, printer and the like. A network interface circuit 108 is also connected to the bus 106. The network interface circuit 108 provides connectivity to a network (not shown). Thus, the invention may operate in a networked environment, such as a client/server environment or a peer-to-peer network where multi-threading operations of the invention are distributed across a number of processors.
A memory 110 is also connected to the bus 106. The memory 110 stores executable instructions to implement operations associated with the invention. The memory 110 may also store a data source (e.g., a database) 112. The data source stores data that is processed by a multi-thread profiling module 114. The multi-thread profiling module 114 includes executable instructions to implement multi-thread profiling processing operations of the invention.
A thread refers to a string of execution. Threads allow a computer program to split itself into two or more simultaneously running tasks. Multiple threads can be executed in parallel on a set of computers or on a single computer. Multi-threading generally occurs by time slicing (e.g., a single processor switches between different threads) or by multiprocessing (e.g., where threads are executed on separate processors). Many modern operating systems directly support both time-sliced and multiprocessor threading with a process scheduler. Operating system kernels commonly allow programmers to manipulate threads via a system call interface. Programs can implement threading by using timers, signals, or other methods to interrupt their own execution and perform ad hoc time-slicing.
Any number of multi-threading techniques may be used in accordance with the invention. In one embodiment of the invention, the multi-thread profiling module 114 includes executable instructions to establish a set of attribute profiling threads. The set of attribute profiling threads are configured as time sliced attribute profiling threads on a single processor. In another embodiment of the invention, the multi-thread profiling module 114 includes executable instructions to establish a set of attribute profiling threads on multiple processors. The multiple processors may be in a single machine or may be distributed across a network. In one embodiment of the invention, the multi-thread profiling module 114 includes executable instructions to establish a number of attribute profiling threads corresponding to the lower value between a minimum degree of available processing parallelism (either on a single machine or a set of machines) and the total number of columns to be processed.
The multi-thread profiling module 114 produces profile data 116, which may be stored in a repository 118. The data and executable modules of memory 110 may be distributed across a network. The operations of the invention are significant. Where those operations are performed on a computer or within a network is not significant, nor is the precise implementation of those operations significant.
The multi-thread profiling module 114 generates profile data 116. In one embodiment, the profile data is normalized to a standard format. For example, the profile data 116 may be normalized to include a data store identification 210, a table identification 212, a column identification 214, a row identification 216, a column value 218 and attributes 220. For example, the attributes may include an attribute identification 222 and attribute information 224.
The second machine 304 also includes two profile threads, namely, profile threads 310 and 312. Profile thread 310 is assigned to process threads from the fifth and sixth columns, in this case, values V_5 and V_6. Profile thread 312 is assigned to process threads from the seventh and eighth columns, namely values V_7 and V_8.
The multi-thread profiling module 114 configures each profile thread to track specified profiling information for the column that it processes, such as a low value, a high value, a low value count, a high value count, average value, median value, minimum string length, maximum string length, average string length, median string length, distinct count, distinct percent, null count, null percent, zero count, zero percent, blank count, blank percent, and the like. This processing results in profile data 116. The profiling data 116 may then be applied to a repository 118 using standard techniques.
GUI block 406 allows the mapping of a row identification value to an attribute identification. For example, row identification value “778” from GUI block 402 maps to an attribute identification of “615.0” in GUI block 406. The attribute identification value allows mapping to attribute information. For example, GUI block 408 links the attribute identification “615.0” to the attribute information of “Low Value” for the given column. The attribute information also includes the specified value of “0.000002”, which is the column value shown in GUI block 402.
Naturally, any number of configurations may be used to display profile data 116. The configuration of
The GUI 500 facilitates the drill down to source information. For example, cell 510 is at the intersection of the column value “SHIPNAME” or row 512 and the “Distincts” column 514. Information on this cell is provided in block 516. By clicking on the first entry of block 516, i.e., Save-a-lot Markets, the thirty-one records associated with this entity are displayed in block 518. Thus, an embodiment of the invention allows a user to drill down to data source information.
An embodiment of the present invention relates to a computer storage product with a computer-readable medium having computer code thereon for performing various computer-implemented operations. The media and computer code may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well known and available to those having skill in the computer software arts. Examples of computer-readable media include, but are not limited to: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs and holographic devices; magneto-optical media such as floptical disks; and hardware devices that are specially configured to store and execute program code, such as application-specific integrated circuits (“ASICs”), programmable logic devices (“PLDs”) and ROM and RAM devices. Examples of computer code include machine code, such as produced by a compiler, and files containing higher-level code that are executed by a computer using an interpreter. For example, an embodiment of the invention may be implemented using Java, C++, or other object-oriented programming language and development tools. Another embodiment of the invention may be implemented in hardwired circuitry in place of, or in combination with, machine-executable software instructions.
The foregoing description, for purposes of explanation, used specific nomenclature to provide a thorough understanding of the invention. However, it will be apparent to one skilled in the art that specific details are not required in order to practice the invention. Thus, the foregoing descriptions of specific embodiments of the invention are presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed; obviously, many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, they thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. It is intended that the following claims and their equivalents define the scope of the invention.
This application claims the benefit of U.S. Provisional Application No. 60/720,277, entitled “Apparatus and Method for Parallel Processing of Data Profiling Information,” filed on Sep. 23, 2005, the contents of which are hereby incorporated by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
60720277 | Sep 2005 | US |