The present disclosure relates generally to the field of computer-based question-and-answer (QA) systems. More specifically, the present disclosure relates to partial information retrieval using data provenance techniques.
In today's information-laden world, consumers of data have unlimited and unfettered access to minutia, as well as broad general principles. However, the veracity, reliability, and accuracy of that data in decision-making can be problematic. By its very nature, digital data is an incomplete approximation of a natural phenomenon, and statistical or quantified methods of analysis produce approximations at some level of granularity. Further, logical inference structures are approximations of sentient behavioral processes. As a result, questions answered through digital processes can be unreliable because neither the granularity of the data, nor the path through which it passes (that is, its provenance), can be directly examined. As a result, there is a significant need to track data provenance using rules on domain-dependent information in a manner that includes not only where data comes from, but also the granularity of the data reasoned upon.
The importance of being able to gather and examine data provenance, as well as to allow decision making when data is incomplete, can be of significant use in particular domains. For example, in the field of computer-based health maintenance journals, it is beneficial to provide patients with a vehicle through which to answer specific questions of their healthcare providers about whether they have adhered to a course of action involving medication and lifestyle activities. This problem requires monitoring daily activity including diet, exercise, sleep, relaxation, and medication to produce direct outcomes such as acceptable blood sugar and cholesterol level, weight, and aerobic capacity. Input information can come from a variety of sources, such as direct sensors (e.g., scales), inferential sensors such as blood pressure cuffs, glucometers, pulse oximeters, and databases such as the U.S. Food and Drug Administration (FDA) Food Source database. Data can also come from human- or machine-interpreted nutrition labels.
However, two major information retrieval problems are critical to developing a useful health journal software application: (1) how the daily data is recorded; and (2) the accuracy of the recording. Small errors in detail based on assumptions (inferences) rather than fine-grained information can accumulate into misinformation. Moreover, in the context of question answering, such errors can produce a wrong answer. Furthermore, inaccurate targets, such as daily calorie consumption, can set patients up for failure. Additionally, an adherence problem emerges as a result of the foregoing problems. Patients give up on tracking their diets and taking medication, and often rebel against activity trackers because the mapping of data collection to results is at the wrong level of granularity (triggering the wrong inference set that conflicts with the results given the level of granularity). The goal of a computer-based health maintenance journal therefore poses a challenge of balancing the reliability of the input data with the expected outcomes, without overburdening the patient with details.
Other domains where data provenance is of importance include, but are not limited to, natural language question answering (e.g., chatbots, chatbot platforms, etc.), Internet search, generative artificial intelligence, real-time data acquisition and process control, sporting events, medical diagnosis and treatment, and other important domains. Still further, generative artificial intelligence (large language models) currently fail to track data transformations, and tend to create opacity rather than transparency in their reasoning. Further, their heuristics, rather than being transparently represented within a domain, are instead produced using trained learning systems, almost all of which are derivatives of neural networks. However, the input data (e.g., the training data set) and the training process are often not documented, which can lead to serious “hallucinations” within such systems. Additionally, while semantic based natural language processing (NLP) question-and-answer (QA) systems exist, such systems typically require exact knowledge representation or extremely fine-grained coded heuristics, and do not take into account domain knowledge.
Accordingly, what would be desirable are systems and methods for partial information retrieval using data provenance techniques which address the foregoing and other needs.
The present disclosure relates to systems and methods for partial information retrieval using data provenance techniques. The system includes an partial information retrieval processor that executes an event trigger software agent which identifies an event, a query listener agent which generates a query in response to the identified event, and a partial information retrieval agent which processes the query in accordance with one or more modular domain heuristic data structures and generates response data that includes provenance information. Optionally, the system can include a knowledge base updating agent which updates a knowledge base (e.g., a collection of one or more modular domain heuristic data structures) using the response data, as well as a response generator agent which generates a human-readable (e.g., text or image) response based on the response data. Advantageously, the system allows for the generation of natural language answers to questions in circumstances where only partial information is available, such as partially-identified question types or question contexts. A visualization user interface is also provided, which allows for visualization of partial information retrieval outcome.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
The foregoing features of the invention will be apparent from the following Detailed Description of the Invention, taken in connection with the accompanying drawings, in which:
The present disclosure relates to systems and methods for partial information retrieval using data provenance techniques, as discussed in detail below in connection with
The processor 12 could include one or more suitable computer systems programmed in accordance with the present invention, such as a personal computer, server, cloud computing platform, or other suitable processor. The processor 12 could be programmed using any suitable high- or low-level programming language, including, but not limited to, C, C++, Java, Javascript, Python, or other suitable programming language. Additionally, the processing steps disclosed herein could be embodied as computer-readable instructions stored in a non-transitory, computer-readable medium provided in, or in communication with, the processor 12. The network 16 could be any suitable computer network including, but not limited to, a local area network, wide area network, an intranet, the Internet, a cellular data communications network, or any other suitable computer network. The end-user computing device 18 could include, but is not limited to, a personal computer, a tablet computer, a laptop computer, a mobile computing device, a smart cellular telephone, or any other suitable computing device. Additionally, it is noted that the functions performed by the processor 12 could be performed by the end-user computing device 18. Still further, the end-user computing device 18 could include a smart watch, and/or the queries generated by the computing device 18 could alternatively (or, additionally) be generated by one or more sensors, sensor agents that ask questions (e.g., in a form that is not intuitive to a human being), an Internet-of-Things (IoT) device, an artificial intelligence (AI) platform and/or chip, or any other device/platform/service that can generate a question.
The event trigger agent 22 monitors for an event that is suitable for a question-and-answer cycle. Such triggers could include, but are not limited to, information generated by a sensor (such as an electro-mechanical sensor, an audio input, a gesture input, an environmental sensor, etc.), a query generated by a user (e.g., using one of the end-user computing devices 18 of
The query listener agent 24 can be instantiated by a recognized event trigger (e.g., by the event trigger agent 22), and is a listener software object that has baseline information about its triggers and can also include data provenance information. For example, if the agent 24 is instantiated in response to an event generated by a sensor (and detected by the trigger agent 22), it can keep track of information relating to the sensor, such as sensor type, manufacturer name, model number, serial number, etc. If the agent 24 is instantiated in response to a cloud-based event detected by the agent 22, it can keep track of information such as a Uniform Resource Locator (URL) link associated with the cloud-based event, an indication of whether a website corresponding to the URL link is a government database, or is associated with a non-profit or commercial entity. The agent 24 records the time of the event detected by the agent 22 and parses event data into the query data structure 26.
The query data structure 26 includes data relating to a question type, a question context, and provenance data (either created by the agent 24 or added to a provenance map that the agent 24 may have received). The question type is domain-dependent and can be as specific or broad as an application domain may require. It provides the query instruction type for accessing domain knowledge structures. For example, the question type could be as specific as a formal Structured Query Language (SQL) or more general (e.g., a text stream), depending on the domain. The question context provides the focus of the query, and can be represented as an attribute of an attribute value pair, where the result returned by the system is the value. Similar to the question type, the question context can be a text stream to focus the response of a large language model. The provenance data can be a map of the processes through which the input data (e.g., event data generated by the event agent 22) is transformed into output (e.g., response structure), including the input into each process in the map. The map can be organized by granularity (much like a physical map, where high-level countries are shown, followed by states, cities, etc.). The provenance data is utilized by the partial information retrieval agent 28 to produce a response. Simplified provenance maps can be single bundles of event data, originating sensor information, and time information. The granularity of the map can be dependent upon the requirements of the domain database search techniques utilized by the system. Additionally, the provenance map can be fully or partially encrypted to prevent tampering, and a key can also be included that only allows for decryption of the map by processes that require the map for decision-making purposes.
The partial information retrieval agent 28 receives and processes the query data structure 26 and creates the response data structure 28, and its behavior is controlled by the modular, domain-specific heuristics data structure 30. A detailed description of the processes carried out by the agent 28 is provided below in connection with
The modular domain heuristic data structure 30 includes information relating to domain knowledge, domain heuristics, and partial information retrieval heuristics. These components are defined at levels of specificity necessary to determine the sufficiency of a query structure, and can be embodied as knowledge objects in software. Additionally, the domain knowledge, domain heuristics, and partial information retrieval heuristics can be implemented as processes ranging from fixed selection statements (“if . . . then” conditional logic) in software code, to machine learning architectures. It is noted that the data structure 30 is modular in nature, in that, depending on the domain being processed by the system, one or more customized data structures 30 could be utilized and/or substituted by the agent 28. For example, if the agent 28 is processing a query in the domain of food data, a specific data structure 30 with its own customized heuristics germane to food data could be utilized by the agent 28 to develop a query response (the heuristics of the data structure 30 controlling operation of the agent 28), whereas if the agent 28 needs to process a future query in a completely different domain (e.g., well data, cemetery data, library data, etc.), then a different, customized (modular) data structure 30 could be utilized by the agent 28 with its own set of heuristics in order to control operation of the agent 28 and development of a response tailored to that domain.
The domain knowledge can be encoded in a form appropriate for a domain, from a simple lookup to a complex primitive to composite knowledge structures. The information encoded in the data structure 30 can be incomplete or extraneous, thereby mitigating the need to “clean” or enhance the information. For example, a data object representing an apple might include detailed information about the nutritional value of a medium apple. The provenance of the data may be from a commercial website for weight management. Alternatively, an entry for ‘apple’ might include both the ‘medium’ apple as well as the full details for 100 grams of apple from the FDA Food Source database. Alternatively, the domain knowledge could be as simple as “apple, carbs, 34.”
The domain heuristics of the data structure 30 comprise rules or inferences that can be applied to the domain structure to determine whether there is sufficient or insufficient data available to produce a response to a query. For example, a heuristic may be encoded that states that if there is only one entry, such as “apple, carbs, 34,” then the system should use such entry and note the provenance associated with the entry (e.g., indicating that the entry came from a user's journal entry, for example), but if there is an FDA Food Source also available, then the system should also use that source in developing a response and include both sources in the provenance for later disambiguation.
The partial information retrieval heuristics of the data structure 30 comprise a set of rules that can be tailored to a domain, and control operation of the agent 28 to generate a response to a query generated by the query listener agent 24. They provide general principles for deciding whether the question type and question context are recognized, and the set of possible results returned can be bundled into either a successful response or an unsuccessful response. Processing steps carried out by the agent 28 are described in more detail below in connection with
The response data structure 32 bundles all of the information collected by the agent 28 into a domain-specific data structure, and is processed by both the knowledge base update agent 34 and the response generator update agent 36. Included in the structure 32 is an indication of whether the response is successful or unsuccessful in answering the query generated by the query agent 24.
The knowledge base update agent 34 is an optional component of the system that can update one or more persistent stores (e.g., update one or more of the modular domain heuristic data structures 30), either in real time or through a deferred batch process, and with or without human supervision. Using this agent 34, the system can develop an up-to-date repository of information relating to domain-specific queries and associated heuristics.
The response generator agent 36 generates a natural language response to the query generated by the agent 24, which can be generated and transmitted to the user (e.g., to one or more of the devices 18 of
In step 50, a determination is made as to whether a query response failure (which could be generated in steps 46 or 48) matters. If so, step 52 occurs, wherein the system generates and returns a response. Otherwise, step 54 occurs, wherein the system adds the response to the response structure. Then, in step 56, a determination is made as to whether there are more questions in the query to be answered. If not, step 52 occurs. Otherwise, step 58 occurs, wherein the system performs further partial information processing on the remaining questions by recursively calling the processing steps 40 of
In the event that only a partial question type is determined to be identified in step 60, then step 72 occurs, wherein a determination is made as to whether the question context can be identified. If so, step 74 occurs, wherein the system attempts to match the partial question type with the question context. Then, in step 76, a determination is made as to whether the partial question type and question context match. If so, the full bundle is returned in step 86, a success notification S3 is generated, and processing ends. Otherwise, step 78 occurs, wherein the system attempts to match the partial question type with the partial question context. Then, step 80 occurs, wherein a determination is made as to whether the partial question type matches the partial question context. If so, step 86 occurs, wherein the full bundle is returned, a success notification S3 is generated, and processing ends. In such circumstances, the question “carbs in apples” could have generated a question type such as “How many Y in X” where thee intent was “Does X have Y.” The candidate types inferred to be sufficient are collected into an unordered collection (perhaps represented as a list) of partial question types. The question context is then evaluated as sufficient to be identified. If a single perfect match occurs, then it is used with each item in the partial question type list to produce a list of potential matches. If the potential list contains at least one match, then the results are bundled and returned as a successful result.
In the event that a negative determination is made in step 80, step 82 occurs, wherein the input data is bundled, a failure notification F2 is generated, and processing ends. In such circumstances, there is still value in understanding and returning the failed result. For example, the question “Antioxidants in a honey crisp” may lead to failure if the knowledge representation cannot identify this kind of apple and there is no knowledge of antioxidants, nor how the two are related. Returning the heuristic (e.g., that the context is not in the knowledge base) can lead to more refined questions. Finally, in the event that a negative determination is made in step 72 (i.e., the question context cannot be identified), step 84 occurs, wherein the system bundles the input data and generates a failure notification F3. It is noted that, in all of the failure paths (failure notifications F1, F2, and F3), the input data provenance is augmented with a bundle of heuristics that produced the unsuccessful result. Importantly, the failure of one question within the framework of a composite question may still allow success at a higher level (with an optional qualifier).
The processing steps 48 of
Additionally, it is noted that the various bundles generated by the processing steps of
The domain heuristics of the data structure 30 discussed in connection with
It is noted that the systems and methods discussed herein in connection with
In the natural (non-digital) world, information is only as accurate and reliable as the methodology employed by the sentient reporter and, more critically by the media through which that information is presented. Historians assert that at every stage from observation to recording to reporting, the quality of the initial observation is modified via cultural, social and psychological norms, biases, and assumptions. Information can be trusted only when its origins and transformations can be examined or referenced. Attribution is essential to reliable information retrieval. Information provenance is key to good decision making. In computing information systems, informative responses to user queries via visualization-based user interfaces must provide a means for interactively examining the provenance of that information—often referred to as “drilling down.” This is especially critical to summation techniques within in conversational AI systems. The customized visualization interface described in
Current digital technologies provide the illusion that quantified data is precise, and that current data retrieval, analysis, and reporting techniques maintain that precision. This is a significant problem when inappropriate visualization techniques are applied to data that may be incomplete or aggregated in a manner without attention to its provenance chain. It is not sufficient to provide links to source, but instead necessary to allow a user to examine the aggregation process directly. This is particularly problematical for individuals making personal decisions based on incomplete information sourced both from their personal observations and external information sources (e.g., media, the Internet, libraries, domain professionals (both through computer-based media and direct communication), etc.) Standard visualization techniques ranging from lists and charts to graphs to animations, and video recordings may be useful at demonstrating phenomena, but are equally able to obfuscate and elude. The visualization interface of
To articulate the novelty of this approach, the problem of representing weight loss goals is presented herein, but of course, other types of visualizations are possible. Weight loss, despite the overwhelming information available on the Internet, is a simple calculation for an individual: the calories consumed in a day must be consistently less than an individualized calculated maintenance amount. “Diet programs” have promoted this concept as far back as the 18th century. Complex models, methods, and visualizations are available for assisting individuals in mapping their personal consumption to goals determined by generalized (and often conflicting) assumptions and calculations.
At their core, existing visualizations represent the calorie consumption goals as precise quantitative data. Data aggregation without provenance hides that precise target behind rules prescribing “good”, “neutral” and “bad” foods, and rarely if ever provides transparency for these categorizations. The information is incomplete but is represented visually as complete and absolute as a goal to achieve. Similarly, actual calories consumed are assumed to be precisely measured, when it is always an estimate as the USDA nutrition website asserts. For example, a “Honey Crisp Apple” listed on the USDA FoodData website is calibrated to 100 g based on a well-established methodology. But at 60 calories/100 grams, a precise calorie count for an actual apple could vary significantly. In a typical day with a target of 2000 calories (however well disguised by the weight loss industry), the accumulation of those small differences can statistically lead to percentage of error that can indeed impact calorie deficiency goals. Forgetting to record a candy bar purchased during a commute home can similarly thwart precise comparison of targets and actual values.
Visualizing incomplete or inaccurate data as precise data points leads to misinterpretation that, in turn, can lead to failed expectations. For example, if an established recommended calorie intake is 2000 calories, and the recommended daily deficit is 200 calories, the prediction is that an individual will lose about a pound per week in a healthy manner. If weight is not lost at that rate, it may very well be that the data recording is not sufficiently accurate for these precise targets and the visualizations are unable to demonstrate that inaccuracy. Visualizations that support “drilling down” into the provenance of data can provide explanations for failed outcomes that don't blame the victim (e.g. not enough will power, didn't properly take the medication, didn't follow an exercise regimen, etc.). The visualization interface described herein removes the value judgement and provides a visualization that provides, at a glance, the relationship between, and individual provenance of, two aggregate and potentially inaccurate values. Depending on the immediate needs of the user, the visualization interface can produce multiple viewpoints of a data set.
As noted above, the systems and methods (and their visualizations) could be applied to a variety of domains, including any domain in which target outcomes and actual recording are compared. Health care domains include, but are not limited to, protein versus calories, nutrition balance, physical therapy, and habit formation. Energy consumption domains include, but are not limited to, total energy use versus energy produced by solar panels, car maintenance and fuel consumption, etc. Education domains include, but are not limited to, normative student outcomes versus actual performance.
Still another domain having applicability to the systems and methods of the present disclosure include personal or small business finance, e.g., making it easier for someone not that familiar with or comfortable with traditional financial reports. Examples include:
The systems and methods herein can perform data comparison where a difference provides essential insight into the question or problem and can be visualized. For example, student progress ranging from a curricular activity to standardized test results can inform comparison of a student's performance to a normative target. A numeric value out of context provides little insight into what knowledge or skills a student has gained. Providing a difference measure between a norm and a student's performance is enhanced with the potential to ‘drill down’ from an aggregate ‘score’ or ‘rubric assignment’ to examine the component performances to directly identify causes for concern or potential for enrichment.
The visualization components of the systems and methods of the present invention include a triple of pieces: two data objects that are retrieved from one or more resources, and a difference object that is defined as the quantitative difference between the two data objects. Provenance of the data objects as well as that of the difference object is maintained throughout, to support methods/algorithms to “drill down” into the data analysis behind each of the data objects, and “build up” new pieces from previously defined data and difference objects, or combine pieces into “rows” and “blocks” to form more complex representations. Finally, a data piece “summary” object can be generated from rows and blocks of pieces. Enhanced visualization ability comes from domain specific heuristics for choosing the order of the two data pieces (top/bottom or left right) position of the difference piece (relative to the smaller data object: (before or after), color (or image), opacity, relative depth of the piece, and piece orientation (horizontal/vertical).
A unique aspect of the definition is that each visual attribute includes its provenance, which can be used to visually examine its origins. The surrounding system can use this to “drill down” within any piece providing dashboard capability to provide multi-modal visualizations as well as interaction to allow users to construct sophisticated visualizations that combine visualization pieces with other informative visual modalities such as traditional graphs, text boxes, and animations. The attributes and value pairs are intentionally implemented as natural language-based tokens so that accessibility functions can be employed to render in appropriate modalities for the visually impaired.
The subclasses 178-182 include:
The following example is given in connection with a visualization of a meal. When the meal is breakfast, the actual datum is blue and the recommended is aqua. For any meal:
All of the other attributes are hard-coded to values for this illustration, but could be determined by rules such as:
The visualization object attributes can be determined by static (compile time) or run time (user initiated) rules established in the domain heuristic object.
Importantly, the interface supports multiple layers of pieces. For example, selecting breakfast from a day composite piece will instantiate a new window with all of the functionality of the parent window. Parent-child relationships are maintained behind the scenes, but free motion within any window supports user-initiated organizations. The interaction objects also support ‘corralling’ visualization pieces into a row that in turn can be combined into a block. The “pieces” attribute of a piece represents a row or block depending on the structure of the pieces.
In step 312, an event trigger occurs, which specifies two data items to be comparted (data1 and data2) and also includes (as input) provenance that impacts domain heuristics. The “build difference piece” method 314 is then instantiate, and in step 316, the system creates a new difference piece. Additionally, a first datum piece (datum1) is created with a length of data1, and a second datum piece (datum2) is created with length of data2. Next, in step 318, a determination is made as to whether data1 is greater than data2. If so, step 320 occurs, wherein the length of the difference piece is set to the length of datum1. Then, in step 322, the dominant piece is set as datum1, and the subordinate piece is set as datum2. Then, step 324, discussed below, occurs.
In the event that a negative determination is made in step 318, step 330 occurs, wherein a determination is made as to whether data1 is less than data2. If not (implying that data1 and dat2 are equal and there is no difference), step 336 occurs, wherein the system applies rendering heuristics to datum 1 and datum 2. Otherwise, step 332 occurs, wherein the length of the difference piece is set to the length of datum2. Then, in step 334, the dominant piece is set to datum2, and the subordinate piece is set to datum1.
In step 324, the system creates a difference piece having a length equal to the difference in lengths between the dominant piece and the subordinate piece. In step 326, the system applies rendering heuristics to the dominant piece, the subordinate piece, and the difference piece. Finally, in step 328, the system returns a new difference piece.
The domain heuristic object is an “API” for the system that mediates between an external decision-making system and the visualization system. Consequently, the API consists of methods to retrieve required attribute values for a piece as summaries. When a piece is being built, it calls the relevant Domain Heuristic method whose implementation is defined by the surrounding system. Information querying is two way: the surrounding system can both request and provide information so that visualization decisions are not static but can be updated dynamically in real time.
Like the Domain Heuristics Object, the Interaction Object is an “API” for the system. The abstract object is a generic display that has a position in a visual rendering system such as a web window, and an interaction API definition through which to manipulate both the position and size of an instantiated display and the internal contents of the display. Each piece has a provenance, either input data (that also has a provenance), or the originating instantiation of the piece (via build methods). If the surrounding system maintains a provenance discipline (e.g. provides a provenance with raw input data) then an appropriate display hierarchy can be constructed by the surrounding system.
The visualization system described herein illustrates the difference between two datum with the same unit of measure. The power of the system is the recursive inclusion of data provenance. The developer of the surrounding system can provide displays and interaction that appropriately constrain or allow examination and reorganization of the input data without either overwhelming the user, or misrepresenting the data and its origins.
Having thus described the system and method in detail, it is to be understood that the foregoing description is not intended to limit the spirit or scope thereof. It will be understood that the embodiments of the present disclosure described herein are merely exemplary and that a person skilled in the art can make any variations and modification without departing from the spirit and scope of the disclosure. All such variations and modifications, including those discussed above, are intended to be included within the scope of the disclosure. What is desired to be protected by Letters Patent is set forth in the following claims.
The present application claims the priority of U.S. Provisional Application Ser. No. 63/545,319 filed on Oct. 23, 2023, the entire disclosure of which is expressly incorporated hereby reference.
| Number | Date | Country | |
|---|---|---|---|
| 63545319 | Oct 2023 | US |