The following relates generally to the information acquisition, transmission, and processing arts, information quality control arts, information logging arts, medical imaging device machine and service logging arts, patient monitoring arts, and related arts.
Information for a given task may be acquired via numerous pathways, such as manual data entry, reading sensor devices, generating data by electronic computations, reading data from a data storage device, various combinations thereof, or so forth. Such data can be inaccurate for numerous reasons, such as data entry errors, sensor glitches or failures, values dropped or corrupted during electronic data transmission, reading data from an incorrect data file, or so forth. It is known to perform automated data checking to flag some such errors, for example an exact zero value may be flagged as suspicious if the source is such that it should not generate exact zero values, or data of a wrong data type may be flagged (e.g. an integer when the value should be a floating point value), or other similar data checks. However, these automated checks can potentially miss many errors that do not fail any of these limited quality control checks. A human domain expert could detect many other errors, but manual review of large quantities of data may be impractical due to the sheer data volume and/or limited availability of human domain experts with requisite special expertise.
Accuracy of information is important in many applications. As one illustrative example, modern medical imaging devices are complex systems with tens of thousands of interacting components, and are commonly designed to incorporate control and monitoring electronics which generates large volumes of machine log data. Such systems are serviced on routine and as-needed bases, and service log data are generated during servicing by manual data entry performed by the service engineer and possibly by other mechanisms such as records of an online parts ordering system, service log data automatically generated by the imaging device electronics, and/or so forth. Inaccuracies in machine or service log data can lead to increased cost due to unnecessary or ineffective servicing, and in some cases can lead to degradation of the image quality of acquired clinical images which can have adverse impact on the clinical services provided by the medical imaging device. Yet the volume and domain specificity of much of the content of the machine and service logs makes detection of errors difficult.
As another illustrative example, modern electronic medical record (EMR), electronic health record (EHR), cardiovascular information systems (CVIS), and other such patient data recordation systems present high complexity. A single multifunction patient monitor may acquire numerous vital signs (for example, heart rate, respiration rate, blood oxygenation level, capnography data, one or more types of blood pressure, electrocardiogram, electroencephalogram, or so forth) and possibly other patient monitoring data such as intravenous therapy flow rate. These data may be acquired intermittently or continuously (or, more precisely, at a high sampling rate). These data are subject to errors due to sensor glitches from patient movements or other problems with the patient coupling, and due to corruption during transmission from the patient monitor to the server computer maintaining the EHR. Other patient data in an EMR may include demographic information manually entered by clerical staff and/or retrieved from other database systems such as those of another hospital or an insurance company, and these data are prone to data entry errors, data retrieval errors, or the like. Errors in patient data can have adverse consequences for the patient's clinical treatment and for ancillary activities such as insurance billing. Yet the volume and domain specificity of much of the content of an EMR makes detection of errors difficult.
These are merely some illustrative examples, and it will be appreciated that accuracy of information is similarly important in many other fields in the medical domain as well as in numerous fields beyond the medical domain.
The following discloses certain improvements.
In some non-limiting illustrative embodiments disclosed herein, an electronic data processing system comprises an electronic processor and a non-transitory storage medium storing instructions readable and executable by the electronic processor to perform an electronic data processing method. The method includes: performing pairwise comparisons of columns of at least one table to associate a first column and a second column with a one-to-N relationship by detecting violations of the one-to-N relationship for data of the first and second columns and determining that a count of the violations of the one-to-N relationship for the first and second columns is less than a threshold; and indicating, on a display, possible data errors corresponding to the violations of the one-to-N relationship for data of the first and second columns. The performing and the indicating are performed for the one-to-N relationship being a one-to-one relationship and/or for the one-to-N relationship being a one-to-many relationship.
In some non-limiting illustrative embodiments disclosed herein, a non-transitory storage medium stores instructions that are readable and executable by an electronic processor to perform an electronic data processing method comprising: identifying a first column and a second column of at least one table which have a one-to-one or one-to-many relationship with fewer than a threshold number of violations of the one-to-one or one-to-many relationship; and indicating, on a display, possible data errors in the at least one table corresponding to the violations of the one-to-one or one-to-many relationship.
In some non-limiting illustrative embodiments disclosed herein, an electronic data processing method is disclosed. Using an electronic processor, a first column j and a second column j′ of at least one table are identified which have a mostly one-to-N relationship, where the entry in row i and column j of the table is denoted by t(i,j). On a display, possible data errors in the at least one table corresponding to violations of the identified one-to-N relationship are indicated. The one-to-N relationship is a one-to-one relationship defined as t(i,j)=t(i′,j) if and only if t(i,j′)=t(i′,j′) or a one-to-many relationship from column j to column j′ defined as t(i,j)=t(i′,j) if t(i,j′)=t(i′,f) but not necessariy vice versa, where i and i′ are different rows of the at least one table. The same can be defined for columns from different tables, say column j from table t and column j′ from table t′, provided that a one-to-one relationship is known between the rows of the two tables.
One advantage resides in providing an electronic data processing system with information-agnostic data quality checking.
Another advantage resides in providing an electronic data processing system with data quality checking that makes minimal a priori assumptions.
Another advantage resides in providing an electronic data processing system with data quality checking that does not rely upon a priori information beyond the data itself.
Another advantage resides in providing an electronic data processing system with data quality checking operating on one-to-one relations between different columns of data in the same table or in different tables.
Another advantage resides in providing an electronic data processing system with data quality checking operating on one-to-many relations between different columns of data in the same table or in different tables.
Another advantage resides in providing an electronic medical data processing system with one or more of the foregoing benefits.
A given embodiment may provide none, one, two, more, or all of the foregoing advantages, and/or may provide other advantages as will become apparent to one of ordinary skill in the art upon reading and understanding the present disclosure.
The invention may take form in various components and arrangements of components, and in various steps and arrangements of steps. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention.
Data quality assessment techniques disclosed herein provide for detecting data errors in an information-agnostic fashion; that is, without taking into consideration the information represented by the data values. The disclosed approaches assumes data in a table format in which the rows correspond to different entities (e.g., different patients, different medical imaging devices, or so forth) while the columns correspond to data fields. Hence, entry t(i,j) represents the value of the data field represented by j for entity i. The data may be stored in a single table, or may be stored in two or more different tables so long as corresponding rows can be identified, for example based on an index value identifying the entity (such as a machine identifier in the case of medical imaging device machine logs, a patient name or patient identifier, PID, in the case of patient data, or so forth). This is the extent of a priori assumptions—there are no further assumptions made regarding the informational content of the various columns j.
In description herein, tabular data is described in terms of “rows” corresponding to different entities, and “columns” corresponding to different data fields. It will be appreciated that other analogous terms may be employed in a specific implementation or description, for example a “row” may alternatively be referred to as a “record”, a “tuple”, a “vector”, or other analogous phraseology—all such analogous phraseology is to be understood as being encompassed by the term “row” as used herein. Similarly, a “column” may be alternatively referred to as a “field”, a “parameter”, a “property”, an “attribute”, a “vector element”, or other analogous phraseology— all such analogous phraseology is to be understood as being encompassed by the term “column” as used herein. Still further, the term “table” may alternatively be referred to as a list of tuples, a set of vectors, or so forth—again, all such analogous phraseology is to be understood as being encompassed by the term “table” as used herein.
With reference to
The electronic data processing system 10 receives data for processing from one or more sources, such as an illustrative data entry terminal 20 (for example, used to enter patient information such as name, age, gender, ethnicity, reason for admission, and so forth), a patient monitor 22 (e.g. an illustrative bedside multifunction monitor such as may be provided in a hospital room, or a more specific patient monitor generating specific data such as an electrocardiogram, an intravenous infusion pump generating flow rate data, various combinations thereof, and/or so forth), a medical imaging device 24 generating automatically machine logs and/or automatically and/or manually generated service logs, various combinations thereof, and/or so forth. The illustrative input sources 20, 22, 24 are merely examples, and a given embodiment of the electronic data processing system 10 may include some subset of these inputs, additional inputs, multiple instances of a given data type of input, and/or so forth. For example, if the electronic data processing system 10 is an Electronic Medical Record (EMR) (this term being broadly used herein to encompass any electronic data processing system for recording patient data, such as an electronic health record (EHR), cardiovascular information systems (CVIS), or so forth) then the inputs may include the data entry terminal 20 used for patient admissions and a large number of patient monitors 22. As another example, if the electronic data processing system 10 is a call center for a medical imaging devices manufacturer then the inputs may include a fleet of medical imaging devices 24 such as magnetic resonance imaging (MRI) scanners, computed tomography (CT) scanners, positron emission tomography (PET) scanners, gamma cameras used for single photon emission computed tomography (SPECT) imaging, hybrid medical imaging devices such as PET/CT scanners, various combinations thereof, and/or so forth. It will be appreciated that the received data may be of a wide range of different data types (e.g. free-form text, integers, floating point values, logical data fields, structured data fields having a closed set of possible values for example, generated by a drop-down list of a graphical user interface, and/or so forth).
The received data are received by a data aggregator 28 which formats the received data as at least one table 30. In the following, a single table is assumed, for example having the format of the illustrative table 30 of
With continuing reference to
The resulting violations 40, 50 (if any) are taken to be possible errors. In one approach, these possible errors are presented to a user via a user interface (UI) 52 presented on an illustrative desktop or notebook computer 54 having a display 56 and one or more user input devices 58 (such as an illustrative keyboard and trackpad, and/or a mouse, trackball, touch-sensitive overlay of the display 56, various combinations thereof, and/or so forth). For example, possible data errors corresponding to the violations 40, 50 of the one-to-N relationship for data of the first and second columns may be indicated, e.g. displayed, on the display 56. In one approach, indication is displayed on the display 56 that the violation is not consistent with the one-to-N relationship. Instead of the illustrative desktop or notebook computer 54, the user interface 52 may be presented on a cellular telephone (i.e. cellphone), tablet computer, or any other user interfacing device having a display or other output component for presenting the possible errors (these are the violations 40, 50) and one or more user input devices for receiving input from the user. Via the UI 52, the user may correct the error by replacing the current value with a new value, or may delete the value entirely (or replace it with some “null” value if deletion is impermissible); or, the user may confirm that the present value is in fact correct.
Data 60 which are not violations, together with data 62 that are violations and have been reviewed and confirmed or corrected or nulled or otherwise remediated by the user via the US 52, are collected and stored in a database on a non-transitory storage medium 64 (for example a hard disk or other magnetic storage medium, a solid state drive or other electronic storage medium, an optical disk or other optical storage medium, a RAID, and/or so forth). The format in which the data are stored is application dependent. As an illustrative example, in a machine logging application (e.g. for the medical imaging device 24), the table 30 is suitably stored in a machine or service log of a medical imaging device, and the possible data errors are indicated as possible machine or service log errors. As another illustrative example, in a patient data recordation application, the table 30 is suitably stored in an electronic medical record (EMR), and the possible data errors are indicated as possible patient data errors. These are merely illustrative examples, and the disclosed pairwise columns-based data consistency checking may be employing in numerous other medical and non-medical electronic data processing tasks.
It should be noted that the non-transitory storage medium 64 may be the same as, different from, or partially overlapping with, the non-transitory storage medium 14 storing the instructions which are readable and executable by the electronic processor 12 to implement the electronic data processing system 10. Similarly, in some embodiments the computer 54 may be also used as a data entry computer 20.
In the following, some additional aspects of the electronic data processing system 10, and of the data quality checking performed by the pairwise columns comparator 32 in particular, are described.
The basic idea of the pairwise columns comparator 32 is to infer directly from the data stored in each pair of columns in a table whether or not there is a one-to-one or a one-to-many relation between them, approximately. A test is used to determine to what extent the relation of the data stored in the two columns (designated without loss of generality here as a “first” column and a “second column”) is one-to-one or one-to-many. And if it is likely that the relation is one-to-one or one-to-many (determined at respective blocks 38, 48), then the violations of this relation (e.g., violations 40, 50) are flagged as a potential data quality issue. These potential data quality issues 40, 50 can be indicated as alerts in a dashboard UI 52, so that subject matter experts can examine whether indeed these are data quality issues or whether the suspected relation is not valid. If the latter is the case, then this will be remembered by the data quality checker 32, so that in a next evaluation these alerts are not raised again.
To more precisely describe some preferred embodiments, the following notation is employed herein. Let T be an n×m database table of entries (table 30), with n rows and m columns. Entry t(i,j) denotes the entry in row i and column j. Typically, a table stores information on entities of the same type (e.g. the “type” may be a patient, or a medical imaging device, or so forth). Each row corresponds to information on a single entity (a single medical imaging device, or a single patient, . . . ) and each column corresponds to a specific property (or data field or so forth) of such entities (type, age, . . . ). The pairwise columns comparator 32 is comparing a “first” column j and a second column j′ (where again “first” and “second” are arbitrary labels for the chosen two columns to be compared, and do not denote ordinal position of the columns in the table). Using this notation, two columns j and j′ have a one-to-one relationship if and only if:
for any pair i and i′: t(i,j)=t(i′,j) if and only if t(i,j′)=t(i′,j′) (1)
Said another way, the one-to-one relationship is defined for the first column j and the second column j′ as:
t(i,j)=t(i′,j) if and only if t(i,j′)=t(i′,j′) (2)
where i and i′ are different rows of the table. For example, in a table with information on countries, the columns with country name and capital name will have a one-to-one relation.
Two columns j and j′ have a one-to-many relationship if and only if:
for any pair i and i′:t(i,j′)=t(i′,j′) implies that t(i,j)=t(i′,j) (3)
but not necessarily the other way around. Said another way, the one-to-many relationship is defined for the first column j and the second column j′ as:
if t(i,j′)=t(i′,j′) then t(i,j)=t(i′,j) (4)
where i and i′ are different rows of the table. For example, in a table with information on persons, there is a one-to-many relation from the column with country name to the column with city name. If two persons live in the same city, then their country will also be the same, assuming for the sake of argument that city names are unique in the world.
To infer a possible one-to-one or one-to-many relation (identified generically herein as one-to-N relation, where N can be “one” or “many”), an undirected, weighted bipartite graph G=(Vj,Vj′,E) is used, where Vj denotes a node set on the left side of the graph defining the values that occur in column j, and Vj′ denotes a node set on the right side of the graph defining the values that occur in column j′. The set E of edges only connect a node from Vj with a node from Vj′, i.e., there are no edges between two nodes from Vj and no edges between two nodes from Vj′. There is an undirected edge e=(u, v)∈E between u∈Vj and v∈Vj′ if and only if the corresponding values occur together in one or more rows of table T. Both nodes and edges have weights whose value is an integer. More particularly, each node u∈Vj has a weight w(u) denoting the number of times that the corresponding value occurs in column j. Each node v∈Vj′, has a weight w(v) denoting the number of times that the corresponding value occurs in column j′. Each edge e=(u, v) has a weight w(e), also represented here as weight w(u, v), denoting the number of rows in which the values corresponding to u and v co-occur, where by definition w(u, v)=0 if (u, v)∉E.
Put another way, the weighted bipartite graph has: a first (i.e. left) part comprising nodes (Vj) representing values occurring in the first column (j) weighted by counts (i.e. weights w(u)) of the occurrences of the respective values in the first column; a second (i.e. right) part comprising nodes (Vj′) representing values occurring in the second column (j′) weighted by counts (i.e. weights w(v)) of the occurrences of the respective values in the second column, and edges (E) connecting nodes of the first part and nodes of the second part having weights (w(u, v)) corresponding to counts of co-occurrences of the respective values represented by the connected nodes in the respective first and second columns. In one approach for the pairwise columns comparison, such a weighted bipartite graph representing the first and second columns is generated; and violations of the one to N relationship for the data of the first and second columns are detected using the bipartite graph.
It is not difficult to see that for each node u∈Vj the following holds:
And that for each node v∈Vj′ the following holds:
With reference now to
Furthermore, let the degree of a node of a weighted bipartite graph be defined as the number of edges that are connected to the node. If a null value (usually interpreted as a missing value) is interpreted as just one of the possible entry values, then the degree of each node is at least one, since each value occurring in one of the columns must have at least one matching value in the other column.
The following definitions are employed: a fully one-to-one relation between columns j and j′ is present if all nodes in graph G have a degree one. A fully one-to-many relation from column j to column j′ is present if all nodes in Vj′ (i.e., the node set corresponding to values in column j′) have a degree one.
To infer approximate (i.e. mostly) one-to-one or one-to-many relations, let m(u) denote the maximum weight of an edge connected to node u∈V. For node v∈Vj′, let m(v) denote the maximum weight of an edge connected to node v∈Vj′. By definition, m(u)≤w(u) and m(v)≤w(v) and for a fully one-to-one relation it follows that m(u)=w(u) and m(v)=w(v) for all u∈Vj and v∈Vj′. Furthermore, for fully one-to-many relations from column j to column j′ it follows that m(v)=w(v) for all v∈Vj′.
For approximately one-to-one or one-to-many relations, the number of values of the total n values that occur in a single column, say column j, that violate the relation can be counted by the following expression, called the number of violations nov(j) of column j:
nov(j)=Σu∈V
Likewise, for column j′, the count of violations is given by:
nov(j′)=Σv∈V
For a table with n rows, for both columns j, j′ the following respectively hold: nov(j)<n and nov(j′)<n. For an approximate (i.e. mostly) one-to-one relation, the number of violations for both columns should be counted, and the relative (i.e. normalized) number of violations can be defined by:
where the value ½(nov(j)+nov(j′)) is normalized by the total number of rows (n) in the table 30. For an approximate (i.e. mostly) one-to-many relation from column j to column j′, it is not required that the nodes in U all have a degree one, so only the violations for column j′ are relevant. Hence for a one-to-many relation the relative number of violations is given by:
where the value nov(j′) is normalized by the total number of rows (n) in the table 30.
To quantitatively define a “mostly” one-to-one relation, a threshold T1-1 is specified, and there is a (mostly) one-to-one relation between columns j and j′ if and only if:
In a similar fashion, to quantitatively define a “mostly” one-to-many relation, a threshold T1-many is specified, and there is a (mostly) one-to-many relation between columns j and j′ if and only if:
In Equations (11) and (12) different thresholds T1-1 and T1-many are employed. However, these may optionally be the same value, i.e. T1-1=T1-many is a possibility. The above approach to identify approximate one-to-one and one-to-many relations is an illustrative embodiment. Alternative embodiments can be employed for quantitatively defining (mostly) one-to-one or (mostly) one-to-many relations, such as, for example, counting the average degree (minus one) of nodes to establish a relative number of violations.
By way of illustration, for the weighted bipartite graph for columns 1 and 2, shown in
Similarly, for the weighted bipartite graph for columns 1 and 3, shown in
In this way, given a number of pairs of columns for which a one-to-one or a one-to-many relation holds approximately, the violations 40, 50 can be identified and raised as possible data quality issues. In a dashboard (or other UI 52), the suspected type of relation that holds approximately can be marked, so that a subject matter expert can indicate whether or not this relation is true. And if so, the violations can be inspected by the subject matter expert and explicitly marked as data quality issues. Note that in this way, a subject matter expert is only bothered with questions about potential one-to-one or one-to-many relations, whenever there are potential exceptions for these relations. In this way, data quality issues can be identified in an early stage so that actions can be taken to further pollute the contents of the database.
In some embodiments, the UI 52 optionally presents one or more solutions to identified data quality issues (i.e., solutions for the detected violations 40, 50). For example, if it has been established that there is (likely) to be a one-to-one relation between columns j and j′, then if value uk in column j almost always co-occurs with value vl in column j′, then for the few exceptions where it co-occurs with value vm in column j′ the UI 52 suitably suggests to change values vm into vl for the cases where they co-occur with uk in column j. Similar suggestions can be made for one-to-many relations.
Furthermore, once a relation has been confirmed by a subject matter expert, it can be treated with high priority in subsequent data quality checks.
With reference now to
Additionally, the approach is modified when comparing columns across tables by the normalization factor. As noted with reference to Equations (9) and (10), the count of the violations is normalized by the total number of rows (n) in the table 30. When comparing columns across two tables 301, 302, on the other hand, the count of the violations is normalized by the total number of corresponding rows of the first and second tables 301, 302. For example, the normalization factor is 2 in the example of
As a further variant, the disclosed approach can be extended to more than two columns. For example, column pairs can be compared to a third column to determine if there is a 1-1 or 1-N correlation. Such second order analysis might help further narrow the exceptions, find new exceptions, or help in root cause analysis methodologies. Hence, this second order analysis entails performing further comparisons of columns of at least one table 30, 301, 302 to associate a first pair of columns and a third column with a one-to-N relationship (e.g., a 1-1 relationship or a 1-many relationship) by detecting violations of the one-to-N relationship for data of the first pair of columns and the third column and determining that a count of the violations of the one-to-N relationship for the first pair of columns and the third column is less than a threshold. Various approaches could be used to define a one-to-N relationship in this context. In one approach, a match can be found for a row if a 1-1 (or 1-N) relation holds between either column of the pair and the third column. In another approach, a match can be found for a row if a 1-1 (or 1-N) relation holds between each column of the pair and the third column.
The invention has been described with reference to the preferred embodiments. Modifications and alterations may occur to others upon reading and understanding the preceding detailed description. It is intended that the exemplary embodiment be construed as including all such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2020/067198 | 6/19/2020 | WO |
Number | Date | Country | |
---|---|---|---|
62866679 | Jun 2019 | US |