The subject matter disclosed herein relates to the accuracy and completeness of data records drawn from historical and current sources.
Data may be collected and stored for numerous industrial, commercial, and personal applications. For example, routine transactions may generate various types of data or new data points in an ongoing sequence. Such data may in turn be reviewed, evaluated, and used in various decision making processes, such as maintenance or repair tracking or planning in a building or vehicle context, budgetary planning, financial forecasting, or regulatory compliance and planning.
Inaccurate and incomplete data, however may result in errors in these various processes or, more generally, may result in inaccurate decisions being drawn, improper actions being taken, or proper action not being taken. Such data problems may result from various sources, such as a set of data being incomplete, data points being recorded inaccurately, or data points being improperly characterized or categorized. These types of errors may arise in historical data or data being collected currently or contemporaneously and may arise in both fixed choice and free text data collection methodologies.
In one embodiment, a computer-implemented method is provided for processing data. The method includes the acts of accessing a data record and performing a text mining operation on the data record using seeds derived from a semantic template encompassing the data record. One or more fields of a data instance are populated using data elements derived from the analysis of the data record by the text mining operation. The data instance is based on the semantic template. The data instance is then updated based on semantic rules defined by the semantic template. The seeds are updated and the steps of: performing the text mining operation, populating one or more fields of the data instance, and updating the data instance based on semantic rules to generate a final data instance are iterated.
In a further embodiment, a data processing system is provided. The data processing system comprises a memory storing one or more routines; and a processing component configured to communicate with the controller and to execute the one or more routines stored in the memory. The one or more routines, when executed by the processing component, cause acts to be performed comprising: accessing a data record related to a transaction; accessing a set of seeds derived from a semantic template that describes the transaction; text mining the data record using the set of seeds; populating one or more fields of a semantic instance using data elements identified in the data record by text mining, wherein the data instance is based on the semantic template and wherein the one or more fields are populated based upon probabilities generated by the text mining; and analyzing the data instance based on one or more semantic rules associated with the semantic instance to validate the populated one or more fields of the semantic instance.
In an additional embodiment, one or more non-transitory computer-readable media are provided encoding one or more processor-executable routines. The one or more routines, when executed by a processor, cause acts to be performed comprising: accessing a data record related to a transaction; accessing a semantic template derived from a plurality of representative transactions that described the transaction; and generating a data instance corresponding to the data record by iteratively: performing statistical text mining of the data record using seeds derived from the semantic template; and analyzing the data instance using one or more semantic rules derived from the semantic template.
These and other features, aspects, and advantages of the present invention will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein:
Inaccurate and incomplete data can result in errors in the conclusions drawn from that data. That is, poor data quality can result in poor decision making, whether in an automated context (where a computer or other machine is provided the data and takes a corresponding action) or a human actor context. Such inaccurate and incomplete data may be present in data collected by “fixed choice” mechanisms (e.g., “check the box”) or “free text” data entry where a user types or writes a free form entry or record. However, as will be appreciated, “free text” data entry can introduce much more variability and uncertainty than data entry using fixed fields or options. As discussed herein, aspects of the present approach would drive the usability of free text fields to approach that of fields filled in by drop-down or radio box methods, and would facilitate and improve subsequent decision-making, automated or otherwise. Likewise, inaccurate and incomplete data may be present in both historical data contexts, where data may be collected or translated, from paper or other media, as well as in contemporaneous or real-time data collection. Aspects of the present approach may be used to improve the archived or existing data records as well as to facilitate or improve contemporaneous data collection. While certain examples and discussions within the present disclosure may relate to the processing of free text data fields for the reasons noted above, it should also be appreciated that the present approaches may also be used to improve the quality and accuracy of data acquired in non-free text contexts, such as where the data entry options are limited to specific values or choices. For example, the use of semantic templates, as discussed herein, may also be used to merge data sources that are not free text to improve the quality of data and data collection in these contexts as well.
The presently disclosed approaches relate to cleaning and controlling the accuracy of electronic data surrounding transactions (e.g., business transactions) or other similarly structured events. In particular, as discussed herein, the structure of the events is captured as semantic model templates, and instances of these models are created from the data. The content of fields with missing entries can be highlighted for acquisition or entry of the missing data. The content of fields with questionable or ambiguous data may be flagged for review and/or more suitable contents for the fields can be suggested. The electronic data to be processed can have been acquired at different times, i.e., may have varied temporality corresponding to archived events, historical events, finalized and completed current events, or currently occurring events with some components yet to happen in the future. For cleaning and accuracy control of historical data, the approaches discussed herein may be exercised periodically as additional new data becomes available, more extensive semantic templates are generated, or improved algorithms become available. For data arriving in real time, such as by conversation or simultaneous data entry, repeated application of the approach creates a converging semantic solution that can be used immediately for reasoning or classification purposes.
In addition, in one implementation the present approach addresses the generally low quality of data entered by humans in free text data input situations, such as may be found in customer service requests, for example. One instantiation of this approach would suggest wording improvements and sentence structure changes to people entering data into an otherwise free text field. This results in normalized data structures that flow smoothly into the event model discussed herein.
With the foregoing in mind, a general description is provided below of suitable electronic devices that may be used in the implementation of the present approaches to improve the accuracy or completeness of acquired data. In particular,
As will be appreciated, the various functional blocks shown in
With the foregoing discussion of suitable systems in mind, the present disclosure relates to the generation and use of a set of semantic models (i.e., templates) that describe a typical or generic instance of an event and encompass the various possible data that may be entered with respect to the event. For example, in one implementation, to construct a representation of a past, a present, or a future event an iterative solution may be employed that begins with the creation of a set of semantic models (templates) describing a comprehensive instance of an generalized event and all the available data describing that event. As used herein, the templates are semantic models describing the events (e.g., transactions) to which the data refers.
This process is depicted graphically by the flowchart 50 of
Turning to
Initial instances 74 based on the semantic templates 56 may then be created. For example, fields containing one of a fixed number of entries can be mapped to the instances 74 by copying the contents from the original data (or from another comparable set of data) into the instances 74. In the depicted embodiment, free text fields in the data (e.g., transactions 52) are mapped using statistical text mining techniques that use the n-tuples extracted from the original semantic templates 56. In such an implementation, text mining techniques may structure the free text field within the transactions 52 for later semantic processing and may use information from the free text field to populate individual fields of the instance.
Next semantic reasoning (block 80) is applied to the instance 74 populated using statistical text mining algorithms. The semantic reasoning uses known rules or logic to evaluate the data fields filled in by text mining to identify incongruous data fields and validate remaining data fields. The semantic reasoning may also suggest contents for other data fields of the semantic instance 74 as constructed so far. Once semantic reasoning is done, a check may be performed to see if anything has changed (block 82) since the last semantic reasoning on the instance 74. If so, the process may be iterated by extracting (block 70) an updated set of n-tuples from the semantic instance 74. These new instances drive the biases that the statistical approach uses to generate the most likely matches, and the biases converge along with the instance 74 to drive the best solution. In one such implementation, the text mining techniques are driven by n-tuple structures created from the evolving instance 74, or in the case of the first iteration, from the original semantic template 56.
In certain implementations, text mining techniques may also be guided by semantic templates 56 and may allow some degree of structure to be imposed on or implied from a set of unstructured data (e.g. free text fields, and so forth). For example, word, word pairs, and/or n-tuple data structures 72 identified by analysis of a semantic template 56 may be used for text mining of data acquired in an unstructured form, such as to identify likely structured relationships within the data that can be leveraged in subsequent analysis. For example, n-tuple data structures for use in text mining may be taken or derived from patterns within the semantic structure. These data structures derived from the semantic template 56 (or from instance 74) and used for text mining may be simple listings of paired word relationships or may be more complex patterns that may represent semantic structures themselves. By way of example, for each field in an instance 74 generated based on a set of unstructured data, such as a free text field, a distribution of likely entries may be constructed, along with associated probabilities or rankings. The most likely entry exceeding a threshold likelihood may be entered into the respective field of the instance 74. In such an example, a confidence score or other likelihood indicator may be displayed in conjunction with a field entry determined in this manner.
With the foregoing in mind,
As depicted in
In one implementation, one or more of the fields 94 may be characterized by the type or structure of data that may be entered, such as text strings, numeric strings, e-mail addresses, numbers strings formatted as or having characteristics of a data or phone number, and so forth. Such constraints may be useful in parsing or generating the n-tuples for text mining and/or for parsing data into the template 56 or assigning probabilities to unstructured data for which a structure is being derived.
In the depicted example, sample data 104, such as from representative transactions 52 may be associated with particular fields 94. For example, under the “type” field within the product data structure 98, various examples are listed (i.e., microwave, dishwasher, refrigerator, range) which may be derived from representative transactions that have been structures in accordance with the template 56. Similarly, other fields 94 are shown having representative data 104 (e.g., contact title, date day, date year, contact name, and so forth).
Such representative data 104 may be useful in generating n-tuples, as discussed herein. For example, turning to
With the foregoing in mind,
Turning to
In particular, the “date” data structure 102 and associated fields and the “contact” data structure 100 and associated fields are partially populated based on the call log record data to generate the example instance 130 which includes a data instance 132 and contact instance 134. Data derived from the call record 120 used to populate the instance 130 is shown as entered data 136. In the depicted example, an improperly assigned data element 140 is also shown in the instance 130. In particular, the data element “microwave” was incorrectly specified as a contact address based on the probabilities generated by the text mining operation.
In certain embodiments, however, semantic rules derived based on the semantic template 56 may, subsequent to the text mining operation, evaluate the initial instance 130 to address such errors and to thereby generate an improved instance 144. In this example, turning to
As will be appreciated, however, the relationship between the respective date instance 132, contact instance 134, and product instance 142 is still undefined. In this example, turning to
Based on the results of the additional round of text mining, and turning to
As will be appreciated from the preceding discussion and example, the present approach may be used to process and improve data sets which can be described by a semantic model. Typical of this class of data sets are business events (e.g., order to remittance), sales transactions, or inspection records. Further, processing of this event data in accordance with the present approaches would also include the content of free text fields to be normalized within the context of a larger semantic model. Additionally as discussed herein the present approach may be implemented as an iterative process, where the extent of the semantic instance grows with the information added by each iterative use of statistical text mining, and the power of the text mining extends through the quality of the n-tuple set and the reasoning biases extracted from the semantic instance.
Of note, the present approach allows data improvement using both semantic reasoning as well as statistical modeling. The hypotheses generated as a result of such a hybrid technique are better than those derived using either method alone. In particular, by combining semantic reasoning with statistical modeling, a level of certainty is captured and covers cases when few examples are available for training the statistical models. Conversely, by combining statistical modeling with semantic reasoning, the non-obvious relations can be identified by statistical models. The iterative application of both approaches allows for hypotheses to be created, validated, and retracted.
In practice, the present approach can be fully automated and embedded in other applications. The approach is data-agnostic, and through the use of text mining may process both fixed and free-text fields. Further, the present approach may be implemented in various manners, such as by embedding in batch programs or GUIs.
Technical effects of the invention include the use of both semantic and statistical models to improve the completeness and accuracy of data instances.
This written description uses examples to disclose the invention, including the best mode, and also to enable any person skilled in the art to practice the invention, including making and using any devices or systems and performing any incorporated methods. The patentable scope of the invention is defined by the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims if they have structural elements that do not differ from the literal language of the claims, or if they include equivalent structural elements with insubstantial differences from the literal languages of the claims.