Businesses and other organizations that consume and/or produce datasets have a substantial interest in quality assurance of those datasets. Datasets can be of substantial size, often containing many thousands, or even millions of records, such that automated analysis of those datasets is the only feasible way to examine whether those records are meeting predetermined quality assurance metrics. A quality assurance analysis of a dataset often utilizes a validity specification that defines when a record of the dataset is, and is not, considered valid based on the contents of the record. In addition, statistical analyses of the values of data fields of the records, such as counts of instances of each value, can also produce valuable quality assurance information over and above the checking of validity.
As an illustration of such a process, a record format for a dataset may contain a postal code data field that is associated with a validity rule that checks whether the postal code conforms to an appropriate format (e.g., in the U.S., either five digits or five digits plus a four digit extension). Application of such a validity rule to the records of the data set may indicate how many records of the dataset contain valid or invalid U.S. postal code values. Further quality assurance information might also be obtained from statistical analysis of the postal code values, even if those values were properly formatted. A quality assurance concern might be surfaced, for example, if a large number of the postal code values unexpectedly have the same value. This distribution of postal code values might indicate a data processing error in a process that produces or modifies the postal code values.
An analysis that utilizes both the validation and statistical approaches is referred to herein as “profiling” a dataset. The collective results of such an analysis are referred to herein as a “data profile.”
According to some aspects, a computer-implemented method of operating a data processing system to generate a data profile based on a dataset having an associated record format defining a plurality of fields is provided, a value census for the dataset comprising a first plurality of values each having an associated field of the plurality of fields and a plurality of count values, wherein a count value indicates a number of times a respective field and value combination occurs in the at least one dataset, and a validation specification comprising a plurality of validation rules defining criteria for invalidity for one or more fields of the plurality of fields, the method comprising generating a validation census based at least in part on the dataset and the validation specification, the validation census comprising a second plurality of values each having an associated field of the plurality of fields, and a plurality of indications of invalidity, each indication of invalidity being associated with one of the second plurality of values, and generating a data profile of the at least one dataset based at least in part on the value census and the validation census, wherein generating the data profile comprises matching ones of the second plurality of values and their associated fields with ones of the first plurality of values and their associated fields, and producing a data profile for the first plurality of values of the value census at least in part by enriching ones of the first plurality of values and their associated fields with indications of invalidity of the validation census associated with matching ones of the second plurality of values and their associated fields.
According to some aspects, a computer system is provided comprising at least one processor, at least one user interface device, and at least one computer readable medium comprising processor-executable instructions that, when executed, cause the at least one processor to perform a method of generating a data profile based on a dataset having an associated record format defining a plurality of fields, a value census for the dataset comprising a first plurality of values each having an associated field of the plurality of fields and a plurality of count values, wherein a count value indicates a number of times a respective field and value combination occurs in the at least one dataset, and a validation specification comprising a plurality of validation rules defining criteria for invalidity for one or more fields of the plurality of fields, the method comprising generating, based at least in part on the dataset and the validation specification, a validation census comprising a second plurality of values each having an associated field of the plurality of fields, and a plurality of indications of invalidity, each indication of invalidity being associated with one of the second plurality of values, and generating, based at least in part on the value census and the validation census, a data profile of the at least one dataset by matching ones of the second plurality of values and their associated fields with ones of the first plurality of values and their associated fields, and producing a data profile for the first plurality of values of the value census at least in part by enriching ones of the first plurality of values and their associated fields with indications of invalidity of the validation census associated with matching ones of the second plurality of values and their associated fields.
According to some aspects, a computer system for generating a data profile based on a dataset having an associated record format defining a plurality of fields is provided, a value census for the dataset comprising a first plurality of values each having an associated field of the plurality of fields and a plurality of count values, wherein a count value indicates a number of times a respective field and value combination occurs in the at least one dataset, and a validation specification comprising a plurality of validation rules defining criteria for invalidity for one or more fields of the plurality of fields, comprising at least one processor, means for generating, based at least in part on the dataset and the validation specification, a validation census comprising a second plurality of values each having an associated field of the plurality of fields, and a plurality of indications of invalidity, each indication of invalidity being associated with one of the second plurality of values, and means for generating, based at least in part on the value census and the validation census, a data profile of the at least one dataset by matching ones of the second plurality of values and their associated fields with ones of the first plurality of values and their associated fields, and producing a data profile for the first plurality of values of the value census at least in part by enriching ones of the first plurality of values and their associated fields with indications of invalidity of the validation census associated with matching ones of the second plurality of values and their associated fields.
The foregoing apparatus and method embodiments may be implemented with any suitable combination of aspects, features, and acts described above or in further detail below. These and other aspects, embodiments, and features of the present teachings can be more fully understood from the following description in conjunction with the accompanying drawings.
Various aspects and embodiments will be described with reference to the following figures. It should be appreciated that the figures are not necessarily drawn to scale. In the drawings, each identical or nearly identical component that is illustrated in various figures is represented by a like numeral. For purposes of clarity, not every component may be labeled in every drawing.
As discussed above, generating a “data profile” may include any analysis on data that generates both validity (e.g., which records are valid or invalid) and statistical (e.g., counts of values, mean values, etc.) information. In a conventional data profiling process, records of a dataset are examined one at a time. For each record, a validity specification is used to check data fields of each record to determine whether any data fields contain invalid data according to the specification, and if so, which ones and for what reason. In addition, information on the various values that appear in each field are retained so that statistics on the values of each field may be produced at the end of the process. Such a process is, however, inflexible in that both validity checking and statistical processing of the dataset occur in tandem, leading to substantial processing in a single operation. Moreover, an entire dataset must be profiled as a unit in order to produce the desired invalidity and statistical information. This process can be very computationally intensive for large datasets, such that a large processing operation is necessary to produce a desired data profile of the dataset. An illustrative example of this approach is shown in
The inventors have recognized and appreciated that profiling of a dataset can be broken up into multiple stages. Firstly, a “value census” can be produced from the dataset that summarizes how frequently values occur in each data field of the dataset, thereby producing a list of field-value pairs with associated count values. Secondly, and separately, a “validation census” can be produced from the dataset that identifies invalid fields for records of the dataset. The data from each census are then combined by “enriching” the value census data with any indications of invalidity that are associated with the field-value pairs of the value census in the validation census. An advantage of this approach is that a process that generates the value census need not utilize the same software that generates the validation census or that enriches the value census using the validation census. Indeed, this technique allows the value census to be generated in any location and using any suitable software, greatly increasing the flexibility afforded to a data profiling process.
This multi-stage approach accordingly produces two sets of distinct data—the value census and the validation census—that each do not generally provide a complete data profile of a dataset. For instance, it is possible for a particular value of a data field to be sometimes valid yet also sometimes invalid, because the validity specification for that data field may depend on the values of one or more other data fields. For at least this reason, a complete understanding of the data cannot generally be gained purely from statistics on the data field values represented by the value census, because a count of a number of instances of a data field value provides no information on its validity. Nor can the validation census fully convey quality on its own because it indicates only instances of invalidity or validity, and does not comprise any statistics with regard to the data values.
As a non-limiting example, a validation specification to be applied may indicate that a postal code data field is invalid when a country data field is equal to “US” and the postal code does not contain five numeric digits (in this example, other country data field values could have different, additional validity rules for the postal code data field). A value census of a dataset may indicate that a particular postal code of “02144” appears 100 times in the dataset, but this alone does not indicate which of these 100 values might be valid and which might be invalid, because the value census does not determine which respective country values correspond to these 100 values. Indeed, to perform this kind of analysis would involve the checking of validity; something the value census is intended to avoid in the described multi-step process. Similarly, the validation census would indicate, for each of the records containing the “02144” postal code value, which records are valid and which are invalid. It is true that the validation census could be analyzed to determine how many times the “02144” postal code appears in the validation census, but again this would replicate the processing required to produce the value census and it is this type of combined processing that the described multi-step process is intended to avoid.
Enriching the value census with the validation census presents a challenge, however, because the contents of the validation census do not readily have a one-to-one mapping with the contents of the value census. Since the value census indicates only the counts of data fields and their respective values, there is no way to discern from the value census which records contributed to which field-value counts. The validation census, on the other hand, produces indications of invalidity associated with particular records of the data. Yet, corresponding entries of the validation census need to identified for a given entry in the value census before the value census entry can be enriched.
The inventors have recognized and appreciated techniques for combining a value census and a validation census by matching field-value pairs from each census. Results of applying a validation specification to a dataset may be processed to “roll up” indications of invalidity for records to produce counts of invalid field-value pairs, allowing the field-value pairs of the value census to be enriched with matching field-value pairs in the validation census. In some cases, a count of a given field-value pair in the value census may not match the corresponding count of the same field-value pair in the validation census; this occurs when only some, but not all, instances of that field and value combination are invalid. In such cases, the process that combines the census data can interpret the censuses to produce an appropriate output.
According to some embodiments, application of a validation specification to a dataset by a data processing system may produce indications of invalidity that are associated with particular records of the dataset. For instance, a data processing system may receive the dataset and validation specification as inputs and may produce data that contains records of the dataset in addition to data for each record that specifies whether any data fields of the record were found to be invalid, and for what reason(s). While such a result may conventionally be a helpful indication of validity, as discussed above, an output of this form cannot be easily used to enrich a value census. As such, according to some embodiments, production of a validation census may aggregate indications of validity that are associated with respective records of the dataset to produce indications of invalidity associated with respective field-value pairs.
According to some embodiments, a data processing system that aggregates indications of validity associated with respective records of the dataset may “roll up” those indications such that a list of indications of invalidity is produced for each field-value combination. That is, aggregation may produce, for each record in the dataset, a record in the validation census containing a field-value pair from the dataset record along with an indication of why that field-value pair was found to be invalid. A roll up operation may perform additional aggregation on such data so that each unique field-value pair appears once in the validation census and which each have an associated list of reasons for invalidity. In some embodiments, a count value for each instance of invalidity may also be produced in the validation census for an associated field-value pair.
According to some embodiments, a value census may include an indication of type validity of each field-value pair present in a dataset. A data processing system may produce the value census by accessing a record format describing the dataset and collate instances of each value of each data field of the record format. As such, the data processing system may additionally check whether such values conform to the expected type as defined by the record format. For instance, a record format including a telephone number data field may have a numeric type. Whilst collating the number of instances of each value of this field to produce the value census, the data processing system may additionally determine, for each of these values, whether the value conforms to the expected numeric type. The value census may accordingly include an indication of whether each of the included field-value pairs is valid or invalid for type.
According to some embodiments, a value census may include an indication of nullity for each field-value pair. In some cases, a record format describing a dataset may indicate, for one or more fields of the record, whether particular values of that field are to be interpreted as null values. Whilst empty field values are commonly defined to be interpreted as null, in general any suitable field value can be defined to be null. In some embodiments, a data processing system generating a value census from a dataset may additionally check whether such values are those defined to be null as per the record format. For instance, a record format including a name field may be defined to be a string field that is null when empty. Whilst collating the number of instances of each value of this field to produce the value census, the data processing system may additionally determine, for each of these values, whether the value is empty, and therefore considered to be null. The value census may accordingly include an indication of whether each of the included field-value pairs is or is not null. An indication of nullity may be generated for a value census independently of whether an indication of type validity is also generated. That is, either or both indications may be generated for a value census.
Following below are more detailed descriptions of various concepts related to, and embodiments of, techniques for integrating validation results in data profiling. It should be appreciated that various aspects described herein may be implemented in any of numerous ways. Examples of specific implementations are provided herein for illustrative purposes only. In addition, the various aspects described in the embodiments below may be used alone or in any combination, and are not limited to the combinations explicitly described herein.
As shown in the illustrative example of
In the example of
In process 100, each of the rules of the validation specification 104 may be checked against the records of the dataset 102 in step 110. Furthermore, the data processing system, when executing step 110, may count how many times each field contains each unique value found in dataset 102. The system may then collate these statistics to produce the data profile 120. In view of the above process, for each record when executing step 110, the data processing system evaluates all of the validation rules in validation specification 104 against the record and updates running totals of the unique instances of each value of each field in the dataset 102. As discussed above, executing all of these actions for each record of a dataset can occupy a significant amount of time and may require significant processing power to complete, especially when the dataset contains many fields and many records. Breaking up the process into the production of a value census and a validation census by a data processing system can provide greater flexibility in how and where this processing takes place, since the production of a data profile need not be a single, monolithic process but can take place in multiple locations by multiple different computing systems, and even with multiple different types of software.
According to some embodiments, data profile 120 may include statistics regarding absolute or relative occurrences of unique values of one or more of the fields of dataset 102 in addition to a list of occurrences of invalid field values. Portions of such data is represented in
The value census 210 may have been produced by the same data processing system executing process 200, or may have been produced by a different data processing system. Indeed, the flexibility of the approach described herein, and of which an illustrative example is depicted by
In the example of
In the example of
To aid in illustration of the flexibility of the process shown in
In the example of
In step 221 of
In step 222 of
In step 223 of
It will be appreciated that there may be many different ways in which a data processing system may check records of a dataset against a validation specification and produce a list of validation issues associated with field-value pairs of the dataset, and accordingly that
In the example of
For example, the Cust. ID value of 7773XXX appears once in the dataset as shown by the count of that field and value pair in the value census 381. In addition, the same field-value pair appears in the validation census with the same count and an identified reason for invalidity (“invalid for type”). The resulting enriched census data 383 includes a summary of this instance of the field-value pair. As another example, the postal code value of 02421 occurs 18 times in the dataset as shown by the value census 381. This field-value pair appears in the validation census as an “Invalid UK Postal Code” with a count of 1. When enriching the value census with this validation census data, therefore, it is inferred that there are 17 valid instances of the postal code-02421 field-value pair and one invalid instance, as summarized in the enriched census data 383.
In some embodiments, step 331 may be implemented as a join operation by the data processing system executing step 330. That is, the union of records in the value census and validation census may be produced using the combination of field and value as a join key. The resulting data may include, for each field-value pair in the value census, an indication of any issues and their counts that occurred with respect to that field-value pair in addition to a count of the number of times the pair appeared in the dataset.
Irrespective of how the field-value pairs are matched in step 331, in step 332 each field-value pair appearing in the value census is examined to determine whether its count value in the value census is equal to the total of the count values from amongst all matching field-value pairs in the validation census. If the counts match, this indicates that all of the instances of the field-value pair were invalid in the dataset. Accordingly, in this case data is produced in step 333 that indicates all instances of this field-value pair were found invalid. In the example of
Alternatively, in step 334 a field-value pair appearing in the value census may have a count value that is greater than the total of the count values from amongst all matching field-value pairs in the validation census. This indicates that less than all of the instances of the field-value pair were found by the data processing system to be invalid. In step 334, the data processing system determines the number of valid instances of the field-value pair by subtracting the sum of all matching count values in the validation census from the count from the value census. In some cases, where the sum of all matching count value from the validation census is zero, this indicates that all of the instances of the field-value pair were found to be valid. Alternatively, only a portion of the instances of the field-value pair may have been found to be valid. In act 335, data is produced that indicates that some instances of the field-value pair were found valid and, in some cases, that also indicates that some instances of the field-value pair were found invalid. In the example of
It will be appreciated that, in some embodiments, it may not be necessary to perform the logical evaluation of act 334 because the sum of all matching count values from the validation census cannot exceed the count from the value census for a given field-value pair. As such, the logical evaluation of act 334 may be redundant once the logical evaluation of act 332 has been determined and found to be false. Accordingly, a data processing system may execute step 335 directly after evaluating step 332.
In the example of
In contrast to
In some embodiments, step 431 may be implemented as a join operation by the data processing system executing step 430. That is, the union of records in the value census and validation census may be produced using the combination of field and value as a join key. The resulting data may include, for each field-value pair in the value census, an indication of any issues and their counts that occurred with respect to that field-value pair (e.g., as per the vector format of the validation census 420) in addition to a count of the number of times the pair appeared in the dataset.
Irrespective of how the field-value pairs are matched in step 431, in step 432 each field-value pair appearing in the value census is examined to determine whether its count value in the value census is equal to the total of the count values from amongst all matching field-value pairs in the validation census. If the counts match, this indicates that all of the instances of the field-value pair were invalid in the dataset. Accordingly, in this case data is produced in step 433 that indicates all instances of this field-value pair were found invalid. In the example of
Alternatively, in step 434 a field-value pair appearing in the value census may have a count value that is greater than the total of the count values from amongst all matching field-value pairs in the validation census. This indicates that less than all of the instances of the field-value pair were found by the data processing system to be invalid. In step 434, the data processing system determines the number of valid instances of the field-value pair by subtracting the sum of all matching count values in the validation census from the count from the value census. In some cases, where the sum of all matching count value from the validation census is zero, this indicates that all of the instances of the field-value pair were found to be valid. Alternatively, only a portion of the instances of the field-value pair may have been found to be valid. In act 435, data is produced that indicates that some instances of the field-value pair were found valid and, in some cases, that also indicates that some instances of the field-value pair were found invalid. In the example of
It will be appreciated that, in some embodiments, it may not be necessary to perform the logical evaluation of act 434 because the sum of all matching count values from the validation census cannot exceed the count from the value census for a given field-value pair. As such, the logical evaluation of act 434 may be redundant once the logical evaluation of act 432 has been determined and found to be false. Accordingly, a data processing system may execute step 435 directly after evaluating step 432.
In the example of
In the example of
In the example of
In some embodiments, step 531 may be implemented as a join operation by the data processing system executing step 530. That is, the union of records in the value census and validation census may be produced using the combination of field and value as a join key. The resulting data may include, for each field-value pair in the value census, an indication of any issues and their counts that occurred with respect to that field-value pair (e.g., as per the vector format of the validation census 520) in addition to a count of the number of times the pair appeared in the dataset.
Irrespective of how the field-value pairs are matched in step 531, in step 532 each field-value pair appearing in the value census is examined to determine whether the value census indicates the value is valid for type. In the example of
Returning to
Alternatively, in step 535 a field-value pair appearing in the value census may have a count value that is greater than the total of the count values from amongst all matching field-value pairs in the validation census. This indicates that less than all of the instances of the field-value pair were found by the data processing system to be invalid. In step 535, the data processing system determines the number of valid instances of the field-value pair by subtracting the sum of all matching count values in the validation census from the count from the value census. In some cases, where the sum of all matching count value from the validation census is zero, this indicates that all of the instances of the field-value pair were found to be valid. Alternatively, only a portion of the instances of the field-value pair may have been found to be valid. In act 536, data is produced that indicates that some instances of the field-value pair were found valid and, in some cases, that also indicates that some instances of the field-value pair were found invalid. In the example of
It will be appreciated that, in some embodiments, it may not be necessary to perform the logical evaluation of act 534 because the sum of all matching count values from the validation census cannot exceed the count from the value census for a given field-value pair. As such, the logical evaluation of act 534 may be redundant once the logical evaluation of act 532 has been determined and found to be false. Accordingly, a data processing system may execute step 535 directly after evaluating step 532.
When the value census indicates a field-value pair is determined to be not valid for type in step 532, flow proceeds to step 537 in which the indication of type validity of the value census is examined to see whether it is invalid for type. If the field-value pair is invalid for type, in step 538 the data processing system produced data that indicates all instances of this field-value pair were found invalid. In the example of
When the value census indicates a field-value pair is determined to be not valid or invalid for type in steps 532 and 537, flow proceeds to step 539 in which the indication of type validity of the value census is examined to see whether it is NULL. If the indication of type validity of the value census is NULL, in step 540 the instance count of the value census is compared with the total count from the validation census for the same field-value pair to determine whether these counts match. If the counts match, this indicates that all of the instances of the field-value pair were NULL and invalid in the dataset. Accordingly, in this case data is produced in step 541 that indicates all instances of this field-value pair were found NULL and invalid. In the example of
Alternatively, in step 542 a field-value pair appearing in the value census may have a count value that is greater than the total of the count values from amongst all matching field-value pairs in the validation census. This indicates that less than all of the instances of the field-value pair were found by the data processing system to be invalid. In step S425, the data processing system determines the number of NULL and valid instances of the field-value pair by subtracting the sum of all matching count values in the validation census from the count from the value census. In some cases, where the sum of all matching count value from the validation census is zero, this indicates that all of the instances of the field-value pair were found to be NULL and valid. Alternatively, only a portion of the instances of the field-value pair may have been found to be NULL and valid. In act 543, data is produced that indicates that some instances of the field-value pair were found NULL valid and, in some cases, that also indicates that some instances of the field-value pair were found NULL and invalid. In the example of
The technology described herein is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with the technology described herein include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The computing environment may execute computer-executable instructions, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The technology described herein may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
With reference to
Computer 610 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 610 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by computer 610. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above should also be included within the scope of computer readable media.
The system memory 630 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 631 and random access memory (RAM) 632. A basic input/output system 633 (BIOS), containing the basic routines that help to transfer information between elements within computer 610, such as during start-up, is typically stored in ROM 631. RAM 632 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 620. By way of example, and not limitation,
The computer 610 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
The computer 610 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 680. The remote computer 680 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 610, although only a memory storage device 681 has been illustrated in
When used in a LAN networking environment, the computer 610 is connected to the LAN 671 through a network interface or adapter 670. When used in a WAN networking environment, the computer 610 typically includes a modem 672 or other means for establishing communications over the WAN 673, such as the Internet. The modem 672, which may be internal or external, may be connected to the system bus 621 via the user input interface 660, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 610, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
Having thus described several aspects of at least one embodiment of this invention, it is to be appreciated that various alterations, modifications, and improvements will readily occur to those skilled in the art.
Such alterations, modifications, and improvements are intended to be part of this disclosure, and are intended to be within the spirit and scope of the invention. Further, though advantages of the present invention are indicated, it should be appreciated that not every embodiment of the technology described herein will include every described advantage. Some embodiments may not implement any features described as advantageous herein and in some instances one or more of the described features may be implemented to achieve further embodiments. Accordingly, the foregoing description and drawings are by way of example only.
The above-described embodiments of the technology described herein can be implemented in any of numerous ways. For example, the embodiments may be implemented using hardware, software or a combination thereof. When implemented in software, the software code can be executed on any suitable processor or collection of processors, whether provided in a single computer or distributed among multiple computers. Such processors may be implemented as integrated circuits, with one or more processors in an integrated circuit component, including commercially available integrated circuit components known in the art by names such as CPU chips, GPU chips, microprocessor, microcontroller, or co-processor. Alternatively, a processor may be implemented in custom circuitry, such as an ASIC, or semicustom circuitry resulting from configuring a programmable logic device. As yet a further alternative, a processor may be a portion of a larger circuit or semiconductor device, whether commercially available, semi-custom or custom. As a specific example, some commercially available microprocessors have multiple cores such that one or a subset of those cores may constitute a processor. However, a processor may be implemented using circuitry in any suitable format.
Further, it should be appreciated that a computer may be embodied in any of a number of forms, such as a rack-mounted computer, a desktop computer, a laptop computer, or a tablet computer. Additionally, a computer may be embedded in a device not generally regarded as a computer but with suitable processing capabilities, including a Personal Digital Assistant (PDA), a smart phone or any other suitable portable or fixed electronic device.
Also, a computer may have one or more input and output devices. These devices can be used, among other things, to present a user interface. Examples of output devices that can be used to provide a user interface include printers or display screens for visual presentation of output and speakers or other sound generating devices for audible presentation of output. Examples of input devices that can be used for a user interface include keyboards, and pointing devices, such as mice, touch pads, and digitizing tablets. As another example, a computer may receive input information through speech recognition or in other audible format.
Such computers may be interconnected by one or more networks in any suitable form, including as a local area network or a wide area network, such as an enterprise network or the Internet. Such networks may be based on any suitable technology and may operate according to any suitable protocol and may include wireless networks, wired networks or fiber optic networks.
Also, the various methods or processes outlined herein may be coded as software that is executable on one or more processors that employ any one of a variety of operating systems or platforms. Additionally, such software may be written using any of a number of suitable programming languages and/or programming or scripting tools, and also may be compiled as executable machine language code or intermediate code that is executed on a framework or virtual machine.
In this respect, the invention may be embodied as a computer readable storage medium (or multiple computer readable media) (e.g., a computer memory, one or more floppy discs, compact discs (CD), optical discs, digital video disks (DVD), magnetic tapes, flash memories, circuit configurations in Field Programmable Gate Arrays or other semiconductor devices, or other tangible computer storage medium) encoded with one or more programs that, when executed on one or more computers or other processors, perform methods that implement the various embodiments of the invention discussed above. As is apparent from the foregoing examples, a computer readable storage medium may retain information for a sufficient time to provide computer-executable instructions in a non-transitory form. Such a computer readable storage medium or media can be transportable, such that the program or programs stored thereon can be loaded onto one or more different computers or other processors to implement various aspects of the present invention as discussed above. As used herein, the term “computer-readable storage medium” encompasses only a non-transitory computer-readable medium that can be considered to be a manufacture (i.e., article of manufacture) or a machine. Alternatively or additionally, the invention may be embodied as a computer readable medium other than a computer-readable storage medium, such as a propagating signal.
The terms “program” or “software,” when used herein, are used in a generic sense to refer to any type of computer code or set of computer-executable instructions that can be employed to program a computer or other processor to implement various aspects of the present invention as discussed above. Additionally, it should be appreciated that according to one aspect of this embodiment, one or more computer programs that when executed perform methods of the present invention need not reside on a single computer or processor, but may be distributed in a modular fashion amongst a number of different computers or processors to implement various aspects of the present invention.
Computer-executable instructions may be in many forms, such as program modules, executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Typically the functionality of the program modules may be combined or distributed as desired in various embodiments.
Also, data structures may be stored in computer-readable media in any suitable form. For simplicity of illustration, data structures may be shown to have fields that are related through location in the data structure. Such relationships may likewise be achieved by assigning storage for the fields with locations in a computer-readable medium that conveys relationship between the fields. However, any suitable mechanism may be used to establish a relationship between information in fields of a data structure, including through the use of pointers, tags or other mechanisms that establish relationship between data elements.
Various aspects of the present invention may be used alone, in combination, or in a variety of arrangements not specifically discussed in the embodiments described in the foregoing and is therefore not limited in its application to the details and arrangement of components set forth in the foregoing description or illustrated in the drawings. For example, aspects described in one embodiment may be combined in any manner with aspects described in other embodiments.
Also, the invention may be embodied as a method, of which an example has been provided. The acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.
Further, some actions are described as taken by a “user.” It should be appreciated that a “user” need not be a single individual, and that in some embodiments, actions attributable to a “user” may be performed by a team of individuals and/or an individual in combination with computer-assisted tools or other mechanisms.
Use of ordinal terms such as “first,” “second,” “third,” etc., in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the claim elements.
Also, the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of “including,” “comprising,” or “having,” “containing,” “involving,” and variations thereof herein, is meant to encompass the items listed thereafter and equivalents thereof as well as additional items.
Number | Name | Date | Kind |
---|---|---|---|
5179643 | Homma et al. | Jan 1993 | A |
5566072 | Momose et al. | Oct 1996 | A |
5604899 | Doktor | Feb 1997 | A |
5742806 | Reiner et al. | Apr 1998 | A |
5842200 | Agrawal et al. | Nov 1998 | A |
5845285 | Klein | Dec 1998 | A |
5966072 | Stanfill et al. | Oct 1999 | A |
6134560 | Kliebhan | Oct 2000 | A |
6138123 | Rathbun | Oct 2000 | A |
6163774 | Lore | Dec 2000 | A |
6343294 | Hawley | Jan 2002 | B1 |
6546416 | Kirsch | Apr 2003 | B1 |
6553366 | Miller et al. | Apr 2003 | B1 |
6601048 | Gavan et al. | Jul 2003 | B1 |
6657568 | Coelho et al. | Dec 2003 | B1 |
6741995 | Chen et al. | May 2004 | B1 |
6788302 | Ditlow et al. | Sep 2004 | B1 |
6801938 | Bookman et al. | Oct 2004 | B1 |
6839682 | Blume et al. | Jan 2005 | B1 |
6879976 | Brookler et al. | Apr 2005 | B1 |
6952693 | Wolff et al. | Oct 2005 | B2 |
6957225 | Zait et al. | Oct 2005 | B1 |
6959300 | Caldwell et al. | Oct 2005 | B1 |
7013290 | Ananian | Mar 2006 | B2 |
7031843 | Bullard | Apr 2006 | B1 |
7032212 | Amir et al. | Apr 2006 | B2 |
7039627 | Modelski et al. | May 2006 | B1 |
7043476 | Robson | May 2006 | B2 |
7047230 | Gibbons | May 2006 | B2 |
7058819 | Okaue | Jun 2006 | B2 |
7117222 | Santosuosso | Oct 2006 | B2 |
7130760 | Ilic | Oct 2006 | B2 |
7149736 | Chkodrov et al. | Dec 2006 | B2 |
7359847 | Gabele et al. | Apr 2008 | B2 |
7376656 | Blakeley et al. | May 2008 | B2 |
7386318 | Moon et al. | Jun 2008 | B2 |
7392169 | Gabele et al. | Jun 2008 | B2 |
7395243 | Zielke et al. | Jul 2008 | B1 |
7403942 | Bayliss | Jul 2008 | B1 |
7426520 | Gorelik et al. | Sep 2008 | B2 |
7433861 | Santosuosso | Oct 2008 | B2 |
7584205 | Stanfill et al. | Sep 2009 | B2 |
7587394 | Chang et al. | Sep 2009 | B2 |
7689542 | Yoaz et al. | Mar 2010 | B2 |
7694088 | Bromley et al. | Apr 2010 | B1 |
7698163 | Reed et al. | Apr 2010 | B2 |
7698345 | Samson et al. | Apr 2010 | B2 |
7720878 | Caldwell et al. | May 2010 | B2 |
7756873 | Gould et al. | Jul 2010 | B2 |
7774346 | Hu et al. | Aug 2010 | B2 |
7813937 | Pathria et al. | Oct 2010 | B1 |
7849075 | Gould et al. | Dec 2010 | B2 |
7877350 | Stanfill et al. | Jan 2011 | B2 |
7899833 | Stevens et al. | Mar 2011 | B2 |
7904464 | Golwalkar et al. | Mar 2011 | B2 |
7912867 | Suereth et al. | Mar 2011 | B2 |
7958142 | Li et al. | Jun 2011 | B2 |
7966305 | Olsen | Jun 2011 | B2 |
8069129 | Gould et al. | Nov 2011 | B2 |
8122046 | Chang et al. | Feb 2012 | B2 |
8145642 | Cruanes et al. | Mar 2012 | B2 |
8250044 | Santosuosso | Aug 2012 | B2 |
8271452 | Longshaw | Sep 2012 | B2 |
8296274 | Leppard | Oct 2012 | B2 |
8326824 | Agrawal et al. | Dec 2012 | B2 |
8359296 | Santosuosso | Jan 2013 | B2 |
8396873 | Xie | Mar 2013 | B2 |
8412713 | Stewart et al. | Apr 2013 | B2 |
8447743 | Santosuosso | May 2013 | B2 |
8463739 | Williamson | Jun 2013 | B2 |
8560575 | Gradin | Oct 2013 | B2 |
8572018 | Mishra et al. | Oct 2013 | B2 |
8615519 | Froemmgen | Dec 2013 | B2 |
8666919 | Miranda | Mar 2014 | B2 |
8762396 | Hudzia et al. | Jun 2014 | B2 |
8775447 | Roberts | Jul 2014 | B2 |
8825695 | Studer et al. | Sep 2014 | B2 |
8856085 | Gorelik | Oct 2014 | B2 |
8868580 | Gould et al. | Oct 2014 | B2 |
8924402 | Fuh et al. | Dec 2014 | B2 |
9251212 | Cao et al. | Feb 2016 | B2 |
9275367 | Neway | Mar 2016 | B2 |
9323748 | Anderson et al. | Apr 2016 | B2 |
9323749 | Anderson et al. | Apr 2016 | B2 |
9323802 | Gould | Apr 2016 | B2 |
9336246 | Gorelik | May 2016 | B2 |
9449057 | Anderson et al. | Sep 2016 | B2 |
9569434 | Anderson et al. | Feb 2017 | B2 |
9652513 | Anderson et al. | May 2017 | B2 |
9892026 | Isman et al. | Feb 2018 | B2 |
9971798 | Khan et al. | May 2018 | B2 |
9990362 | Anderson et al. | Jun 2018 | B2 |
10719511 | Anderson | Jul 2020 | B2 |
20020073138 | Gilbert et al. | Jun 2002 | A1 |
20020120602 | Overbeek et al. | Aug 2002 | A1 |
20020161778 | Linstedt | Oct 2002 | A1 |
20020198877 | Wolff et al. | Dec 2002 | A1 |
20030023868 | Parent | Jan 2003 | A1 |
20030033138 | Bangalore et al. | Feb 2003 | A1 |
20030063779 | Wrigley | Apr 2003 | A1 |
20030135354 | Gabele | Jul 2003 | A1 |
20030140027 | Huttel et al. | Jul 2003 | A1 |
20030208744 | Amir et al. | Nov 2003 | A1 |
20040023666 | Moon et al. | Feb 2004 | A1 |
20040049492 | Gibbons | Mar 2004 | A1 |
20040073534 | Robson | Apr 2004 | A1 |
20040083199 | Govindugari et al. | Apr 2004 | A1 |
20040111410 | Burgoon et al. | Jun 2004 | A1 |
20040181514 | Santosuosso | Sep 2004 | A1 |
20040181533 | Santosuosso | Sep 2004 | A1 |
20040249810 | Das et al. | Dec 2004 | A1 |
20040260711 | Chessell | Dec 2004 | A1 |
20050048564 | Emili | Mar 2005 | A1 |
20050055369 | Gorelik | Mar 2005 | A1 |
20050065914 | Chang et al. | Mar 2005 | A1 |
20050071320 | Chkodrov | Mar 2005 | A1 |
20050075831 | Ilic | Apr 2005 | A1 |
20050102297 | Lloyd et al. | May 2005 | A1 |
20050102325 | Gould et al. | May 2005 | A1 |
20050108631 | Amorin et al. | May 2005 | A1 |
20050114368 | Gould et al. | May 2005 | A1 |
20050114369 | Gould et al. | May 2005 | A1 |
20050154715 | Yoaz et al. | Jul 2005 | A1 |
20050177578 | Chen et al. | Aug 2005 | A1 |
20050183094 | Hunt | Aug 2005 | A1 |
20050192994 | Caldwell et al. | Sep 2005 | A1 |
20050240354 | Mamou | Oct 2005 | A1 |
20060041544 | Santosuosso | Feb 2006 | A1 |
20060064313 | Steinbarth et al. | Mar 2006 | A1 |
20060069717 | Mamou | Mar 2006 | A1 |
20060074881 | Vembu et al. | Apr 2006 | A1 |
20060089827 | Gabele | Apr 2006 | A1 |
20060294055 | Santosuosso | Dec 2006 | A1 |
20060294129 | Stanfill et al. | Dec 2006 | A1 |
20070011668 | Wholey et al. | Jan 2007 | A1 |
20070021995 | Toklu et al. | Jan 2007 | A1 |
20070050381 | Hu et al. | Mar 2007 | A1 |
20070073721 | Belyy et al. | Mar 2007 | A1 |
20070106666 | Beckerle et al. | May 2007 | A1 |
20070214179 | Hoang | Sep 2007 | A1 |
20070288490 | Longshaw | Dec 2007 | A1 |
20070299832 | Chang et al. | Dec 2007 | A1 |
20080071904 | Schuba et al. | Mar 2008 | A1 |
20080114789 | Wysham | May 2008 | A1 |
20080140646 | Inoue et al. | Jun 2008 | A1 |
20080189269 | Olsen | Aug 2008 | A1 |
20080215602 | Samson et al. | Sep 2008 | A1 |
20080222089 | Stewart et al. | Sep 2008 | A1 |
20080306920 | Santosuosso | Dec 2008 | A1 |
20080319942 | Courdy et al. | Dec 2008 | A1 |
20090216717 | Suereth et al. | Aug 2009 | A1 |
20090226916 | DeSimas | Sep 2009 | A1 |
20100057697 | Golwalkar | Mar 2010 | A1 |
20100057777 | Williamson | Mar 2010 | A1 |
20100114976 | Castellanos | May 2010 | A1 |
20100250563 | Wu et al. | Sep 2010 | A1 |
20110029478 | Broeker | Feb 2011 | A1 |
20110040874 | Dugatkin | Feb 2011 | A1 |
20110066602 | Studer et al. | Mar 2011 | A1 |
20110119221 | Mishra | May 2011 | A1 |
20110137940 | Gradin | Jun 2011 | A1 |
20110153667 | Parmenter et al. | Jun 2011 | A1 |
20110225191 | Xie | Sep 2011 | A1 |
20110296108 | Agrawal | Dec 2011 | A1 |
20110313979 | Roberts | Dec 2011 | A1 |
20120158745 | Gorelik | Jun 2012 | A1 |
20120197887 | Anderson et al. | Aug 2012 | A1 |
20120250563 | Liu et al. | Oct 2012 | A1 |
20120281012 | Neway | Nov 2012 | A1 |
20120323927 | Froemmgen | Dec 2012 | A1 |
20130006931 | Nelke et al. | Jan 2013 | A1 |
20130024430 | Gorelik | Jan 2013 | A1 |
20130031044 | Miranda et al. | Jan 2013 | A1 |
20130031367 | Mao et al. | Jan 2013 | A1 |
20130100957 | Suzuki et al. | Apr 2013 | A1 |
20130159353 | Fuh | Jun 2013 | A1 |
20130166576 | Hudzia et al. | Jun 2013 | A1 |
20130247008 | Mitran et al. | Sep 2013 | A1 |
20140047015 | Sheshagiri et al. | Feb 2014 | A1 |
20140095233 | Yeung | Apr 2014 | A1 |
20140114926 | Anderson et al. | Apr 2014 | A1 |
20140114927 | Anderson et al. | Apr 2014 | A1 |
20140114968 | Anderson et al. | Apr 2014 | A1 |
20140114987 | Hoeng et al. | Apr 2014 | A1 |
20140115013 | Anderson et al. | Apr 2014 | A1 |
20140147013 | Shandas | May 2014 | A1 |
20140222752 | Isman et al. | Aug 2014 | A1 |
20140294993 | Bueno et al. | Oct 2014 | A1 |
20150106341 | Gould et al. | Apr 2015 | A1 |
20150199352 | Bush et al. | Jul 2015 | A1 |
20150220838 | Martin | Aug 2015 | A1 |
20150254292 | Khan | Sep 2015 | A1 |
20160012100 | Anderson et al. | Jan 2016 | A1 |
20160232115 | Sawal et al. | Feb 2016 | A1 |
20160078100 | Anderson et al. | Mar 2016 | A1 |
20160239532 | Gould et al. | Aug 2016 | A1 |
20170139996 | Marquardt et al. | May 2017 | A1 |
20170154075 | Anderson et al. | Jun 2017 | A1 |
20180165181 | Isman et al. | Jun 2018 | A1 |
Number | Date | Country |
---|---|---|
1314634 | Sep 2001 | CN |
1749224 | Mar 2006 | CN |
1853181 | Oct 2006 | CN |
1993755 | Jul 2007 | CN |
101191069 | Jun 2008 | CN |
101208696 | Jun 2008 | CN |
101271471 | Sep 2008 | CN |
101271472 | Sep 2008 | CN |
101661510 | Mar 2010 | CN |
102203773 | Sep 2011 | CN |
102436420 | May 2012 | CN |
102681946 | Sep 2012 | CN |
103080932 | May 2013 | CN |
103348598 | Oct 2013 | CN |
1136918 | Sep 2001 | EP |
1302871 | Apr 2003 | EP |
2261820 | Dec 2010 | EP |
03-002938 | Jan 1991 | JP |
H07-502617 | Mar 1995 | JP |
08-030637 | Feb 1996 | JP |
10-055367 | Feb 1998 | JP |
10-091633 | Apr 1998 | JP |
10-320423 | Dec 1998 | JP |
11-238065 | Aug 1999 | JP |
2001-43237 | Feb 2001 | JP |
2001-142827 | May 2001 | JP |
2002-024262 | Jan 2002 | JP |
2007-066017 | Mar 2007 | JP |
2010-072823 | Apr 2010 | JP |
2012-038066 | Feb 2012 | JP |
2012-503256 | Feb 2012 | JP |
10-2006-0080588 | Jul 2006 | KR |
WO 0010103 | Feb 2000 | WO |
WO 0057312 | Sep 2000 | WO |
WO 0079415 | Dec 2000 | WO |
WO 03071450 | Aug 2003 | WO |
WO 2005029369 | Mar 2005 | WO |
WO 2009095981 | Aug 2009 | WO |
WO 2010033834 | Mar 2010 | WO |
Entry |
---|
U.S. Appl. No. 10/941,373, filed Sep. 15, 2004, Gould et al. |
U.S. Appl. No. 10/941,401, filed Sep. 15, 2004, Gould et al. |
U.S. Appl. No. 10/941,402, filed Sep. 15, 2004, Gould et al. |
U.S. Appl. No. 13/360,230, filed Jan. 27, 2012, Anderson et al. |
U.S. Appl. No. 13/827,558, filed Mar. 14, 2013, Isman et al. |
U.S. Appl. No. 13/957,641, filed Aug. 2, 2013, Anderson et al. |
U.S. Appl. No. 13/957,664, filed Aug. 2, 2013, Anderson et al. |
U.S. Appl. No. 13/958,057, filed Aug. 2, 2013, Anderson et al. |
U.S. Appl. No. 14/059,590, filed Oct. 22, 2013, Anderson et al. |
U.S. Appl. No. 14/156,544, filed Jan. 16, 2014, Bush et al. |
U.S. Appl. No. 14/519,030, filed Oct. 20, 2014, Gould et al. |
U.S. Appl. No. 14/625,902, filed Feb. 19, 2015, Khan et al. |
U.S. Appl. No. 14/859,502, filed Sep. 21, 2015, Anderson et al. |
U.S. Appl. No. 14/954,434, filed Nov. 30, 2015, Anderson et al. |
U.S. Appl. No. 15/135,852, filed Apr. 22, 2016, Gould et al. |
U.S. Appl. No. 15/431,008, filed Feb. 13, 2017, Anderson et al. |
AU 2009200294, Jun. 12, 2012, Examiner's Report. |
CA 2,655,731, Dec. 3, 2009, Canadian Communication. |
CA 2,655,735, May 4, 2009, Canadian Communication. |
CN 201210367944.3, Mar. 27, 2015, Chinese First Office Action. |
CN 201210367944.3, Nov. 4, 2015, Chinese Communication. |
EP 04784113.5, Jul. 30, 2010, Summons to Attend Oral Proceedings. |
EP 14746291.5, Sep. 5, 2016, European Search Report. |
JP 2006-526986, Oct. 13, 2010, Notification of Reasons for Refusal |
JP 2006-526986, Nov. 22, 2012, Japanese Communication. |
JP 2010-153799, May 8, 2012, Japanese Communication. |
JP 2010-153799, Feb. 12, 2013, Japanese Communication. |
JP 2010-153800, May 8, 2012, Japanese Communication. |
JP 2013-551372, Oct. 27, 2015, Notification of Reasons for Refusal. |
PCT/US2012/022905, May 2, 2012, International Search Report and Written Opinion. |
PCT/US2013/053351, Oct. 25, 2013, International Search Report and Written Opinion. |
PCT/US2014/014186, Aug. 20, 2014, International Search Report and Written Opinion. |
PCT/US2015/011518, May 12, 2015, International Search Report and Written Opinion. |
PCT/US2015/011518, Jul. 19, 2016, International Preliminary Report on Patentability. |
PCT/US2015/016517, May 18, 2015, International Search Report and Written Opinion. |
PCT/US2018/015274, Jul. 24, 2018, International Search Report and Written Opinion. |
Australian Examiner's Report for Australian Application No. 2009200294 dated Jun. 12, 2012. |
Canadian Communication for Canadian Application No. 2,655,731 dated Dec. 3, 2009. |
Canadian Communication for Canadian Application No. 2,655,735 dated May 4, 2009. |
Chinese First Office Action for Chinese Application No. 201210367944.3 dated Mar. 27, 2015. |
Chinese Communication for Chinese Application No. 201210367944.3 dated Nov. 4, 2015. |
Summons to Attend Oral Proceedings for EP Application No. 04784113.5 dated Jul. 30, 2010. |
European Search Report for European Application No. 14746291.5 dated Sep. 5, 2016. |
English Translation of Notification of Reasons for Refusal for Japanese Application No. 2006-526986 dated Oct. 13, 2010. |
Japanese Communication for Japanese Application No. 2006-526986 dated Nov. 22, 2012. |
Japanese Communication for Japanese Application No. 2010-153799 dated May 8, 2012. |
Japanese Communication for Japanese Application No. 2010-153799 dated Feb. 12, 2013. |
Japanese Communication for Japanese Application No. 2010-153800 dated May 8, 2012. |
English Translation of Notification of Reasons for Refusal for Japanese Application No. 2013-551372 dated Oct. 27, 2015. |
International Search Report and Written Opinion for International Application No. PCT/US2012/022905 dated May 2, 2012. |
International Search Report and Written Opinion for International Application No. PCT/US2013/053351 dated Oct. 25, 2013. |
International Search Report and Written Opinion for International Application No. PCT/US2014/014186 dated Aug. 20, 2014. |
International Search Report and Written Opinion for International Application No. PCT/US2015/011518 dated May 12, 2015. |
International Preliminary Report on Patentability for International Application No. PCT/US2015/011518 dated Jul. 19, 2016. |
International Search Report and Written Opinion for International Application No. PCT/US2015/016517 dated May 18, 2015. |
International Search Report and Written Opinion for International Application No. PCT/US2018/015274 dated Jul. 24, 2018. |
[No Author Listed], Ascential Software. http://www.ascentialsoftware.com (2003). 15 pages. |
[No Author Listed], Avellino. http://www.avellino.com (2003). 8 pages. |
[No Author Listed], Data Profiling: The Foundation for Data Management. DataFlux Corporation. XP-002313258. 2004. 17 pages. |
[No Author Listed], Evoke. Evoke Software. http://www.evokesoftware.com 2003. 71 pages. |
[No Author Listed], Profiling: Take the First Step Toward Assuring Data Quality. IBM. White paper, GC-18-9728-00, Dec. 2005. 16 pages. |
Alur et al., IBM Websphere Information Analyzer and Data Quality Assessment. ibm.com/redbooks Dec. 2007 p. 1-642, retrieved from the Internet:http://www.ibm.com/redbooks/pdfs/sg247508.pdf. |
Apte et al., Business Application for Data Mining. Communications of the ACM. Aug. 2002;45(8):49-53. |
Bagchi et al., Dependency Interference Algorithms for Relational Database Design. Computers in Industry. 14. 1990;4:319-50. |
Bell et al., Discovery of Data Dependencies in Relational Databases. University of Dortmund, LS-8, Report 14, Apr. 3, 1995. 22 pages. |
Bitton et al., A Feasability and Performance Study of Dependency Inference. Department of Electrical Engineering and Computer Science, University of Illinois at Chicago. IEEE.1989;635-41. |
Brown et al., BHUNT: Automatic Discovery of Fuzzy Algebraic Contraints in Relational Data. 29th VLDB Conference. Sep. 9, 2003. XP-002333907. 12 pages. |
Bruno et al., Efficient Creation of Statistics over Query Expressions. The Computer Society. 2003;201-12. |
Chaudhuri., An Overview of Query Optimization on Relational Systems. Proceedings of the 17th ACM Sigact-Sigmod-Sigart Symposium on Principles of Database Systems. 1998;34-43. XP-000782631. |
Chilimbi et al., Quantifying the Effectiveness of Testing via Efficient Residual Path Profiling. Proceeding ESEC-FSE companion '07. The 6th Joint Meeting on European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering. 2007;545-8. |
Cox et al., Integrating gene and protein expression data: pattern analysis and profile mining. Methods. Mass Spectrometry in Proteomics. 2005;35(3):303-14. |
Dasu et al., Mining Database Structure; Or, How to Build a Data Quality Browser. ACM SIGMOD. 2002;240-51. |
Florescu et al., A Performance Evaluation of Alternative Mapping Schemes for Storing XML Data in a Relational Database. INRIA Rocquencourt. May 1999. 31 pages. |
Gauch et al., User Profiles for Personalized Information Access. The Adaptive Web, LNCS 4321. 2007;54-89. |
Graefe, Query Evaluation Techniques for Large Databases. ACM Computing Surveys. 1993;25(2):98 pages. |
Henrad et al., Data Dependency Elicitation in Database Reverse Engineering. Institut d'Informattique. University of Namur, Belgium., 2001;11-9. |
Huhtala et al., Efficient Discovery of Functional and Approximate Dependencies Using Paritions. Proceedings of the 14th International Conference on Data Engineering. Feb. 23-27, 1998. 12 pages. |
Huhtala et al., Efficient Discovery of Functional and Approximate Dependencies Using Partitions (Extended Version). University of Helsinki, Department of Computer Science Series of Publications C, Report C-1997-79, Nov. 1997. 35 pages. |
Huhtala et al., TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies. The Computer Journal. 1999;42(2):100-11. |
Jaedicke et al., On Parallel Processing of Aggregate and Scalar Functions in Object-Relational DBMS. ACM. 1998;XP-002313223:379-89. |
Jahnke et al., Adaptive Tool Support for Database Reverse Engineering. AG-Softwaretechnik, Universitat Paderborn, Germany. IEEE. 1999;278-82. |
Johnson et al., Comparing Massive High-Dimensional Data Sets. Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD 98). Aug. 27-31, 1998;229-33. |
Kandel et al., Profiler: Integrated Statistical Analysis and Visualization for Data Quality Assessment. AVI '12 Proceedings of the International Working Conference on Advanced Visual Interfaces. May 2012;1-8. |
Kivinen et al., Approximate Inference of Functional Dependencies from Relations. Theoretical Computer Science. 1995;149:129-49. |
Kouris et al., Using Information Retrieval Techniques for Supporting Data Mining. Data & Knowledge Engineering. Elsevier BC, NL. 2005;52(3):353-83. |
Lee et al., Bitmap Indexes for Relational XML Twig Query Processing. OIKM '09, Nov. 2-6, 2009;465-74. |
Lemire et al., Sorting Improves Word-Aligned Bitmap Indexes. Data & Knowledge Engineering. Dec. 2009;1-43. |
Li et al., A Practical External Sort for Shared Disk MPPs. http://www.thearling.com/text/sc93/sc93. Thearling. 1993. 24 pages. |
Lopes et al., Efficient Discovery of Functional Dependencies and Armstrong Relations. Proceedings of the 7th International Conference on Extending Database Technology (EDBT 2000), LNCS 1777. Mar. 27-31, 2000;350-64. |
Lynch, Canonicalization: a fundamental tool to facilitate preservation and management of digital information. D-Lib Magazine. 1999;5(9):1-7. |
Mannila, Theoretical Frameworks for Data Mining. SIGKDD Explorations. Jan. 2000;1(2):30-2. |
Milne et al., Predicting Paper Making Defects On-line Using Data Mining. Knowledge-Based Systems. Jul. 24, 1998;11:331-8. |
Mobasher, Data Mining for Web Personalization. The Adaptive Web, LNCS 4321. 2007;90-135. |
Munakata, Integration of Distributed Heterogeneous Information Sources. Systems, Control and Infollnation, Japan, The Institute of Systems, Control and Infolination Engineers, Dec. 15, 1996;40(12):514-21. |
Naumann, Data Profiling Revisited. SIGMOD Record. 2014;42,(4):40-9. |
Novelli et al., FUN: An Efficient Algorithm for Mining Functional and Embedded Dependencies. Proceedings of the 8th International Conference on Database Theory (ICDT 2001), LNCS 1973. Jan. 4-6, 2001;189-203. |
Olsen, Data Profiling Technology, Chapters 7 and 8. Elsevier Science. Jan. 2003. 23 pages. |
Olson, Know Your Data: Data Profiling Solutions for Today's Hot Projects. DM Review, XP-002313222. 2000;1-4. |
Petit et al., Towards the Reverse Engineering of Denormalizes Relational Databases. Laboratoire d'Ingeenierie des Systemes d'Information, Lyon. 1996;218-27. |
Rahm et al., Data Cleaning: Problems and Current Approaches. XP-002284896. 2000. 12 pages. |
Wyss et al., FastFDs: A Heuristic-Driven, Depth-First Algorithm for Mining Functional Dependencies from Relation Instances. (Extended Abstract) Computer Science Department, Indiana University. XP-002333906, 2001;101-10. |
Yan et al., Algorithm for discovering multivalued dependencies. ACM Proceedings of the 10th International Conference on Information and Knowledge Management (CIKM '01) Nov. 5-10, 2001;556-8. |
Yao et al., FD_Mine: Discovering Functional Dependencies in a Database Using Equivalences. University of Regina, Department of Computer Science, Technical Report TR Apr. 2002, Aug. 2002. 17 pages. |
Yao et al., FD_Mine: Discovering Functional Dependencies in a Database Using Equivalencies. Proceedings of the 2nd IEEE International Conference on Data Mining. Dec. 9-12, 2002. 4 pages. |
Yao et al., Mining functional dependencies from data. Springer Science-Business Media, Data. Mining and Knowledge Discovery. Sep. 15, 2007;16(2):197-219. |
Yoon et al., BitCube: A Three-Dimensional Bitmap Indexing for XML Documents. Journal of Intelligent Information Systems. 2001:241-54. |
Young et al., A Fast and Stable Incremental Clustering Algorithm. 2010 Seventh International Conference on Information Technology. IEEE. 2010;204-9. |
Number | Date | Country | |
---|---|---|---|
20190228108 A1 | Jul 2019 | US |