Data visualizations can provide a powerful way to convey information. In particular, visualizing data in a meaningful or compelling way can be influential and facilitate decision making. Many existing data analytics and visualization tools are sophisticated. Creating a meaningful, or compelling, data visualization using such visualization tools, however, can be difficult and tedious. For example, many data consumers have limited experience with data science and/or graphical designs, making generation of data visualizations difficult. An extensive amount of data and data visualizations can make it even more time-consuming to identify specific data and an appropriate manner in which to present the data. Accordingly, although such existing data analytics and visualization tools are powerful, they may be difficult and inefficient for many data consumers to use.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
Various aspects of the technology described herein are generally directed to systems, methods, and computer storage media for, among other things, facilitating automated generation of visual designs and insights via natural language processing. In this regard, embodiments described herein facilitate automated generation of data stories and data summaries via user queries. In this regard, a user may provide or input a query (e.g., a natural language query) and, based on the input, obtain an automatically generated data story and/or data summary. As such, a user only needs a high-level idea or understanding about the data, design, and/or insights desired. Accordingly, the user need not have a strong understanding of the data or the visualization technologies in order to generate a meaningful visual design and a summary thereof. Further, user feedback can be obtained and used to refine the data story and/or data summary in an efficient and effective manner.
In operation, at a high level, to generate a data story, a set of relevant facts are identified based on the user query. A particular set of the relevant facts can be selected (e.g., using a maximum margin relevance algorithm) to generate a data story to present to the user in response to the user query. In some cases, a user may provide feedback indicating preferences associated with the data story. For instance, the user may indicate a particular desired or undesired attribute associated with a fact included in the data story. Based on the user feedback, the query may be refined, and the refined query can be used to identify a new set of relevant facts, which can then be used to select facts for a data story. Such an iterative process may continue until a desired data story is attained. Further, the set of relevant facts, as identified based on the user query (or refined user query), can be used to provide as input (via a prompt) to a machine learning model, such as a large language model. The machine learning model generates a data summary that summarizes the dataset from which the data story was generated. In some cases, contextual facts are used as input, via a prompt, to a machine learning model to generate a data summary. In this way, a more meaningful summary of the dataset can be generated. In some embodiments, to limit the contextual facts included in the prompt, the particular contextual facts may be selected based on a fact score (e.g., based on diversity and importance).
The technology described herein is described in detail below with reference to the attached drawing figures, wherein:
The technology described herein is described with specificity to meet statutory requirements. However, the description itself is not intended to limit the scope of this patent. Rather, the inventors have contemplated that the claimed subject matter might also be embodied in other ways, to include different steps or combinations of steps similar to the ones described in this document, in conjunction with other present or future technologies. Moreover, although the terms “step” and “block” may be used herein to connote different elements of methods employed, the terms should not be interpreted as implying any particular order among or between various steps herein disclosed unless and except when the order of individual steps is explicitly described.
As data is becoming increasingly pervasive and plentiful, many individuals seek to use such data to provide meaningful data visualizations to others. Individuals oftentimes have unique perspectives and ideas on how to generate a meaningful visualization, or otherwise provide a story from data. Visualizing data in a meaningful or compelling way can be influential and facilitate decision making.
Many existing data analytics and visualization tools are sophisticated. Creating a meaningful, or compelling, data visualization using such visualization tools, however, can be difficult. For example, many data consumers have limited experience with data science and/or graphical designs. Accordingly, although such existing data analytics and visualization tools are powerful, they may be difficult and inefficient for many data consumers to use.
Such difficulties and inefficiencies may transpire at many steps in the data visualization workflow, including exploring data, identifying insights, and generating and customizing designs. By way of example, with regard to exploring data, users often have only high-level ideas of the desired information. However, conventional data visualization authoring tools require users to specify data fields for use in generating charts. It may be difficult for many users to map their high-level ideas to specific data fields. With regard to finding insights, statistical insights such as distributions, outliers, and correlations are one approach that users may use to drive data exploration and tell compelling stores. However, without data science knowledge and programming skills, it may be difficult for users to discover data insights. With regard to customizing charts, users often have evolving design needs. Initially, a user may desire to view a line chart. Subsequently, the user may desire to add filters and change colors. Existing tools require users to translate their high-level design concepts like “line chart” or “add a filter” to manual user interface actions, such as selecting a button, selecting from a menu, etc. As another example, a strong chart title, highlighting, and annotations facilitate understanding a visualization. Adding highlighting and annotations is a tedious task, particularly when there are a lot of marks. For instance, using a mouse to select a particular line to highlight among numerous lines (e.g., 50+) may be difficult. Users, however, often do not have the expertise or time to learn sophisticated user interface tools.
Accordingly, manual visualization authoring tools that utilize a manual workflow for data exploration and visualization can be difficult and time-consuming. As described, an analyst may need to select which variables to explore, decide what kind of visualization charts to use, inspect if useful insights exist, and repeat. Such tools may be too tedious for non-experts who have limited data science knowledge or graphic design skills.
Further, conventional automated analytics tools can also be tedious and difficult for non-experts. Conventional automated analytics tools may select interesting data variables or generate appropriate visual representations. For example, one such tool can automatically recommend chart types based on users' variable selections. Although such automated analytics tools can provide helpful guidance, they lack an input channel for users to communicate their intents so as to get desired results quickly. Further, any inputs require the users to be familiar with the data and specify concrete data queries. However, non-expert users often only have vague ideas or questions. It may be difficult for non-expert users to map their ideas to the concrete queries. In this way, upon reviewing visualizations in a data story, the user can interact with the user interface to make simple refinements to the data story. Such a trial-and-error approach, however, generally results in iterative modifications to attain the user's goals, thereby resulting in unnecessary utilization of computing device resources.
Moreover, even when a desired visualization, such as a chart, is generated (e.g., manually or in an automated manner) based on a dataset, there is a lack of communication between the dataset and the analyst, as visualizations often do not provide a lot of information. Accordingly, analysts may need to review the charts and derive information and insights therefrom. For example, to present such visualizations to a particular audience, an analyst may need to review each of the visualizations and generate a natural language summary that can be presented or distributed to provide a more holistic and contextual description of the visualizations. In addition to a contextually inadequate description that might be manually generated, it consumes computing resources to locate desired visualizations, review the dataset, and generate various summaries and insights to provide a summary of the data, including relevant visualizations. For example, navigation to various visuals, the dataset, and analytic tools may be needed to generate a summary of data that includes various visualizations. Further, the user may need to interact with the summary to make refinements thereto. Such a trial-and-error approach, however, generally results in iterative modifications to attain the user's goals, thereby resulting in unnecessary utilization of computing device resources.
As such, embodiments described herein facilitate automated generation of data stories and data summaries via user queries. In this regard, a user may provide or input a query (e.g., a natural language query) and, based on the input, obtain an automatically generated data story and data summary. As such, a user only needs a high-level idea or understanding about the data, design, and/or insights desired. Accordingly, the user need not have a strong understanding of the data or the visualization technologies in order to generate a meaningful visual design and a summary thereof. Further, user feedback can be obtained and used to refine the data story and/or data summary in an efficient and effective manner.
At a high level, implementations described herein employ queries, such as natural language queries, to facilitate creation of data stories and data summaries. In particular, a user may simply input ideas or thoughts (e.g., via a verbal or written input). Based on the input, a data story and data summary can be automatically generated. To generate a data story, a set of relevant facts are identified based on the user query. A particular set of the relevant facts can be selected (e.g., using a maximum margin relevance algorithm) to generate a data story to present to the user in response to the user query. In some cases, a user may provide feedback indicating preferences associated with the data story. For instance, the user may indicate a particular desired or undesired attribute associated with a fact included in the data story. Based on the user feedback, the query may be refined and the refined query can be used to identify a new set of relevant facts, which can then be used to select facts for a data story. Such an iterative process may continue until a desired data story is attained.
Further, the set of relevant facts, as identified based on the user query (or refined user query), can be used to provide as input (via a prompt) to a machine learning model, such as a large language model. The machine learning model can be used to generate a data summary that summarizes the dataset from which the data story was generated. In some cases, contextual facts can be used as input, via a prompt, to a machine learning model to generate a data summary. A contextual fact generally includes context to represent a fact. In embodiments, a contextual fact can include a fact caption and contextual data associated with the corresponding fact. In this way, a more meaningful summary of the dataset can be generated. As described herein, in some cases, the machine learning model, such as an Large Language Model (LLM), includes a limit to the size of input (e.g., via a token limit). Accordingly, in some embodiments, to limit the contextual facts included in the prompt, the particular contextual facts may be selected based on a fact score. In some examples, a fact score is generated for a fact in accordance with diversity of the fact in a set of relevant facts, as well as importance of the fact. In this way, the facts or contextual facts included in the prompt include diverse and important facts.
Advantageously, embodiments described herein facilitate efficient and effective generation of data stories and data summaries. In particular, the technology described herein enables usage of a reduced amount of computer resources as it more specifically analyzes aspects of interest to a user (e.g., input via a query and user feedback). For example, as described herein, a data story generated for a user is based on the user query and refined in accordance with user feedback associated with the data story, or aspects associated therewith. Further, in addition to generating a desired data story, aspects of the technology automatically generate a data summary that summarizes the data story and/or the dataset from which the data story was generated. The automated data summary also corresponds to a user's interest, as the data included in the prompt to generate the data summary is based on facts identified as relevant to the user query, or a refined user query (e.g., based on user feedback in association with the data story). Generating a data summary in accordance with a user's preferences and interests enables usage of a reduced amount of computer resources, for instance, as an iterative manual implementation is not unnecessarily using computing resources.
Various terms are used throughout the description of embodiments provided herein. A brief overview of such terms and phrases is provided here for ease of understanding, but more details of these terms and phrases are provided throughout.
A data story generally refers to a manner in which to present data in a visually appealing and easy-to-understand manner related to a topic or to provide a visual story of data. A data story may include any number of fact visualizations to convey information. A fact visualization generally refers to any visualization of data that can illustrate data and, in some cases, provide captions or insights associated therewith. For example, fact visualizations may include charts, graphs, captions, and/or insights corresponding therewith. In this regard, a set or collection of fact visualizations may be used to present a story regarding a topic.
A fact or a data fact represents a piece of information extracted from a dataset. In some cases, a data fact measures a collection of data items in a subspace of an input dataset based on a measurable data field. Facts can be extracted from a dataset, such as a tabular dataset. The fact can be represented in a data story using a fact visualization (e.g., including a graphical design and a caption). A fact can also be represented using a fact tuple.
A fact tuple refers to a representation of a fact using an ordered sequence of values. In this regard, a fact fi ∈ F is defined by a fact tuple using various attributes, such as type, subspace, measure, breakdown, focus, and/or aggregate.
A fact caption refers to natural language text representing or describing a fact. In embodiments, a fact caption is generated using a template to provide a natural language text (e.g., phrase, sentence, or set of sentences) to describe a fact.
A contextual fact generally refers to a fact that includes context associated with the fact. In this regard, additional contextual data for each fact fi in the set of relevant facts, FR, is identified and appended, supplemented, or aggregated with the corresponding fact. The fact may be represented via a fact caption. In this regard, a contextual fact includes a fact caption and contextual data associated therewith. Contextual data may be in various formats. As one example, contextual data may be in the form of numerical or categorical values related to various attributes associated with the fact tuple, such as, for example, subspaces, measures, breakdowns, aggregates, focus, etc.
A relevant fact refers to a fact that corresponds or matches with a user query (or a refined user query). In some cases, to identify relevant facts, a key phrase search can be performed. In this regard, key phrases can be extracted from the query for use in performing the search. In other cases, the entire user query is used to perform the similarity search to identify relevant facts. In yet other cases, a similarity search can be performed using both key phrases and the entire user query to identify relevant facts.
A data summary refers to a natural language summary of a dataset, for example, that was used to generate a data story. A data summary is generally provided in a natural language form along with visualizations (e.g., fact visualizations that may include a graphical design and caption). A data summary provides a more comprehensive experience than a data story (which generally includes a sequence of visualizations and corresponding captions). For example, coverage in a data story can be limited, and templatized captions may be less engaging. On the other hand, a data summary provides information and insights regarding the dataset along with the visualizations. Such a summary is easy to follow and interpret, as it presents information succinctly with context and provides a condensed and efficient presentation of key findings and trends, enabling the audience to grasp main insights and conclusions.
Referring initially to
A data summary refers to a natural language summary of the data along with visualizations. A data summary provides a more comprehensive experience than a data story. For example, a data summary provides a natural language description of the dataset from which the data story is generated. In this regard, the data summary provides more context, key findings, and trends, in addition to being described in a manner that is easy to understand and interpret. By way of example only, an example data summary 126 is provided in
The network environment 100 includes a user device 110, a data fact engine 112, a data store 114, data sources 116a-116n (referred to generally as data source [s] 116), and a data analysis service 118. The user device 110, the data fact engine 112, the data store 114, the data sources 116a-116n, and the data analysis service 118 can communicate through a network 122, which may include any number of networks such as, for example, a local area network (LAN), a wide area network (WAN), the Internet, a cellular network, a peer-to-peer (P2P) network, a mobile network, or a combination of networks.
The network environment 100 shown in
The user device 110 can be any kind of computing device capable of facilitating generation of data stories and data summaries via user queries. For example, in an embodiment, the user device 110 can be a computing device such as computing device 1700, as described above with reference to
The user device can include one or more processors and one or more computer-readable media. The computer-readable media may include computer-readable instructions executable by the one or more processors. The instructions may be embodied by one or more applications, such as application 120 shown in
User device 110 can be a client device on a client-side of operating environment 100, while data fact engine 112 and/or data analysis service 118 can be on a server-side of operating environment 100. Data fact engine 112 and/or data analysis service 118 may comprise server-side software designed to work in conjunction with client-side software on user device 110 so as to implement any combination of the features and functionalities discussed in the present disclosure. An example of such client-side software is application 120 on user device 110. This division of operating environment 100 is provided to illustrate one example of a suitable environment, and it is noted there is no requirement for each implementation that any combination of user device 110, data fact engine 112, and/or data analysis service 118 remain as separate entities.
In an embodiment, the user device 110 is separate and distinct from the data fact engine 112, the data store 114, the data sources 116, and the data analysis service 118 illustrated in
As described, a user device, such as user device 110, can facilitate generation and/or presentation of data stories and data summaries via user queries. A user device 110, as described herein, is generally operated by an individual or entity interested in viewing visualizations of data (e.g., in the form of graphs, charts, insights, etc.). As can be appreciated, a user interested in viewing data visualizations need not be an individual or entity associated with capturing or providing a dataset from which the data visualizations are generated. For example, in some cases, a user desiring to view data stories and/or data summaries may be an individual gathering insights of trends of data provided by another entity (e.g., in a collaborative environment or obtained via the Internet).
In some cases, automated data story and/or data summary generation may be initiated at the user device 110. For example, a user may input or provide a user query, such as a natural language query. A natural language query generally refers to any query having natural language. In this regard, a user can speak or type at will to provide aspects of a desired data visualization. Various aspects a user may indicate as desired may include dataset attributes, design attributes, and insight attributes. Dataset attributes refer to attributes related to the data. For example, a user may specify particular fields, variables, or dimensions of interest. Design attributes refer to attributes related to a visualization design. For example, a user may specify a particular type of desired chart. Insight attributes refer to attributes related to insights. For example, a user may specify a type of information desired to be interpreted from the data.
As can be appreciated, in some cases, a user of the user device 110 that may initiate a data story and/or data summary is a user that can view the data. In additional or alternative cases, an administrator, programmer, or other individual associated with an organization may initiate generation of a data story and/or data summary, but not necessarily be a consumer or viewer of such information.
A user query or input (e.g., a natural language query) may be provided via an application 120 operating on the user device 110. In this regard, the user device 110, via an application 120, might allow a user to input, select, or otherwise provide a query. The application 120 may facilitate the inputting of a query in a verbal form of communication or a textual form of communication. The user device 110 can include any type of application and may be a stand-alone application, a mobile application, a web application, or the like. In some cases, the functionality described herein may be integrated directly with an application or may be an add-on, or plug-in, to an application.
Such a query may be input at the user device 110 in any manner. For instance, upon accessing a particular application (e.g., a data analytics application), a user may be presented with, or navigate to, a text input tool to input a query. As another example, a user may select an icon to initiate input via a voice query. Irrespective of a type of input, a user may provide various aspects including dataset attributes, design attributes, and/or insight attributes.
The user device 110 can communicate with the data fact engine 112 to provide queries, initiate generation of data stories and data summaries, and/or obtain data stories and data summaries. In embodiments, for example, a user may utilize the user device 110 to initiate a generation of a data story and/or data summary via the network 122. For instance, in some embodiments, the network 122 might be the Internet, and the user device 110 interacts with the data fact engine 112 (e.g., directly or via data analysis service 118) to initiate generation of a data story and/or data summary. In other embodiments, for example, the network 122 might be an enterprise network associated with an organization. It should be apparent to those having skill in the relevant arts that any number of other implementation scenarios may be possible as well.
With continued reference to
As described, in some cases, the data fact engine 112 can receive queries for generating data stories and/or data summaries via the user device 110 (or other device). User queries, such as natural language queries, received from a device, such as user device 110, can include various attributes (e.g., data attributes, design attributes, and/or insight attributes) manually or explicitly input by the user (input queries or selections). Generally, the data fact engine 112 can receive queries from any number of devices. In accordance with receiving a query (e.g., via the user device 110), the data fact engine 112 can access and utilize data to generate a data story(s) and/or data summary(s). As described, in various embodiments, a user-provided attribute(s) is not required. For example, default attributes (e.g., a default or identified dataset attribute, design attribute, and/or insight attribute) can be used to generate a data story(s) and/or data summary(s).
In accordance with a user query, the data fact engine 112 can use data from a dataset to generate a data story and/or data summary. Such data may be any data that can be used to generate a data story and/or data summary. A dataset used for forming a data story and/or data summary can be any type of data that may be analyzed. By way of example and not limitation, data within a dataset may include data that is sensed or determined from one or more sensors, such as location information of mobile device(s), smartphone data, activity information (for example: app usage; online activity; searches; browsing certain types of webpages; listening to music; taking pictures; voice data such as automatic speech recognition; activity logs; communications data including calls, texts, instant messages, and emails; website posts; or other user data associated with communication events) including user activity that occurs over more than one device, user history, session logs, application data, contacts data, calendar and schedule data, notification data, social network data, news, online gaming data, e-commerce activity, sports data, research data, health data, and nearly any other source of data that may be used to generate data stories and/or data summaries, as described herein.
A dataset to use may be a default dataset or a specified dataset. For example, as a user navigates to a particular dataset, the dataset may be used to generate a data story and/or data summary. As another example, a user may specify or select a particular dataset for generating or viewing a data story and/or data summary.
Such data can be initially collected at remote locations or systems and transmitted to data store 114 for access by data fact engine 112. In accordance with embodiments described herein, data collection may occur at data sources 116. In some cases, data sources 116, or a portion thereof, may be user devices, that is, computing devices, operated by a user. As such, user devices, or components associated therewith, can be used to collect various types of data. For example, in some embodiments, data may be obtained and collected at a user device via one or more sensors, which may be on or associated with one or more user devices and/or other computing devices. As used herein, a sensor may include a function, routine, component, or combination thereof for sensing, detecting, or otherwise obtaining information, such as visualization data, and may be embodied as hardware, software, or both.
In addition or in the alternative to data sources 116 including user devices, data sources 116 may include servers, data stores, or other components that collect data, for example, from user devices. For example, in interacting with a user device, datasets may be captured at data sources 116 and, thereafter, such data can be provided to the data store 114 and/or data fact engine 112. In this regard, dataset contributors may operate a client device and provide a dataset to the data source 116. Although generally discussed as data provided to the data store 114 and/or data fact engine 112 via data sources 116 (e.g., a user device or server, data store, or other component in communication with user device), data may additionally or alternatively be obtained at and provided from the data analysis service 118, or other external server, for example, that collects data. Datasets, or data associated therewith, can be obtained at a data source periodically or in an ongoing manner (or at any time) and provided to the data store 114 and/or data fact engine 112 to facilitate generation of data stories and/or data summaries.
In accordance with embodiments described herein, and as more fully described below with reference to
In operation, to generate a data story, a set of relevant facts are identified based on the user query. A particular set of the relevant facts can be selected (e.g., using a maximum margin relevance algorithm) to generate a data story to present to the user in response to the user query. In some cases, a user may provide feedback indicating preferences associated with the data story. For instance, the user may indicate a particular desired or undesired attribute associated with a fact included in the data story. Based on the user feedback, the query may be refined, and the refined query can be used to identify a new set of relevant facts, which can then be used to select facts for a data story. Such an iterative process may continue until a desired data story is attained.
Further, the set of relevant facts, as identified based on the user query (or refined user query), can be used to provide as input (via a prompt) to a machine learning model, such as a large language model. The machine learning model can be used to generate a data summary that summarizes the dataset from which the data story was generated. In some cases, contextual facts can be used as input, via a prompt, to a machine learning model to generate a data summary. A contextual fact generally includes context to represent a fact. In embodiments, a contextual fact can include a fact caption and contextual data associated with the corresponding fact. In this way, a more meaningful summary of the dataset can be generated. As described herein, in some cases, the machine learning model, such as an LLM, includes a limit to the size of input (e.g., via a token limit). Accordingly, in some embodiments, to limit the contextual facts included in the prompt, the particular contextual facts may be selected based on a fact score. In some examples, a fact score is generated for a fact in accordance with diversity of the fact in a set of relevant facts as well as importance of the fact. In this way, the facts or contextual facts included in the prompt include diverse and important facts.
In some cases, the data story and/or data summary can be provided to the user device 110 for display to the user. In other cases, the data analysis service 118 may use such data to perform further analysis and/or render or provide such data to the user device 110. The data analysis service 118 may be any type of server or service that can analyze data, render data, and/or provide information to user devices. Although data analysis service 118 is shown separate from the data fact engine 112, as can be appreciated, the data fact engine can be integrated with the data analysis service 118 or other service. The user device 110 can present received data or information in any number of ways, and is not intended to be limited herein. As an example, a data story 124 and/or data summary 126 can be presented via application 120 of the user device. The data story 124 and/or data summary 126 may be presented concurrently, sequentially, etc. In some cases, the data story 124 and/or data summary 126 are presented automatically. In other cases, the data story 124 and/or data summary 126 are presented in response to a user selection to view such data.
Advantageously, utilizing implementations described herein enables generation of data stories and/or data summaries to be performed in an efficient and more accurate manner (e.g., in accordance with user desires). Further, the data stories and/or data summaries can dynamically adapt to align with information desired by the user (e.g., based on user queries and refinements thereof). As such, a user can view desired information and can assess the information accordingly.
Turning now to
In operation, the data fact engine 212 is generally configured to manage generation of data stories and data summaries. In embodiments, the data fact engine 212 includes a data fact manager 220, a data story generator 222, a data story refiner 224, a data summary manager 226, and a data provider 228. According to embodiments described herein, the data fact engine 212 can include any number of other components not illustrated. In some embodiments, one or more of the illustrated components 220, 222, 224, 226, and 228 can be integrated into a single component or can be divided into a number of different components. Components 220, 222, 224, 226, and 228 can be implemented on any number of machines and can be integrated, as desired, with any number of other functionalities or services.
The data fact manager 220 is generally configured to manage data facts associated with a dataset. In particular, data facts can be identified in association with a dataset from which a data story or set of data stories are generated. In this regard, the data fact manager 220 can extract facts from a dataset, such as a tabular dataset. As described, a data fact or fact represents a piece of information extracted from a dataset. In embodiments, a data fact is designed to measure a collection of data items in a subspace of an input dataset based on a measurable data field. In some cases, a fact fi ∈ F is defined or represented by a fact tuple, such as:
fi={type (ti), subspace (si), measure (mi), breakdown (bi), focus (xi)}.
In this fact tuple, type (denoted as ti) indicates the type of information described by the fact. Various fact types may include, for example, difference, proportion, trend, categorization, distribution, rank, association, extreme, and outlier. Subspace (denoted as si) describes the data scope of the fact, which is defined by a set of data filters in the form: {{F1=V1}, . . . , {Fk=Vk}}, where Fi indicates a data field and Vi indicates a corresponding value selected to filter the data. In some cases, a subspace is the entire dataset. Measure (denoted as mi) may include a numerical data field based on which a data value can be retrieved, or a derived value (e.g., count, sum, average, minimum, or maximum) can be calculated by aggregating the subspace of each data group. Breakdown (denoted as bi) is a set of temporal or categorical data fields based on which the data items in the subspace are further divided into groups. Focus (denoted as xi) indicates a set of specific data items in the subspace that require attention. A fact tuple may include alternative or additional fields. For example, a fact tuple may include an aggregate, which indicates a manner in which to calculate a measure (e.g., total, average, minimum, maximum, etc.). For example, in some cases, fact tuples may include a derived value (denoted as Vd), such as a textual summary of the trend (e.g., “increasing” or “decreasing”), the specific difference value between two cases described by a difference fact, or the correlation coefficient computed for an association fact.
By way of example only, assume a fact corresponds to “Total profit on Ford cars is increasing over years.” Such a fact can be represented as a fact tuple as <TREND, Company=Ford, Profit, Year, Total>. Other examples of fact tuples (including only subspace, measure, and breakdown) include:
In some embodiments, a brute force search may be employed to extract facts, or fact tuples, from a dataset. For example, a brute force search may be executed to identify fact tuple possibilities (e.g., based on different filters for subspaces and/or measures). In some cases, the fact tuples may be identified in association with satisfying certain conditions. For example, one condition may be that there are at most two categorical filters and at most two temporal filters for the subspace. By way of example only, January 2020 is considered to include two filters-one filter for year and one filter for month.
In some cases, the data fact manager 220 generates importance scores for fact tuples. An important score generally represents an extent of importance of a corresponding fact tuple. Such importance scores may be used to rank the fact tuples. Importance scores may be generated in any of a number of ways. As one example, an importance of a fact is defined as the product of a significance score of a fact and a self-information score of the fact, as follows:
For this importance score, Is (fi), S (fi) is the significance score of a fact, fi, and I (fi) is the self-information score of the fact fi. In embodiments, a normalization (e.g., min max normalization) is applied to normalize the importance score of each fact type. In this regard, normalization is performed subject to the fact type. For example, importance scores associated with distribution facts are normalized with minimum and maximum values being equal to that amongst the distribution facts. Normalized importance scores can ensure preventing bias during optimization. A normalized important score for a fact can be represented as follows:
In this approach, IN (fi) is the normalized importance score of fact fi. Max (Is (fT)) represents the maximum importance score of all facts of a fact type that is the same as that of fi, Min (Is (fT)) represents the minimum importance score of all facts with a fact type that is the same as that of fi.
As described, a self-information score is used to generate an importance score for a fact tuple. The self-information score, I (fi), can be defined as the negative logarithm of the probability of occurrence of the fact, as follows:
I(fi)=−log2(P(fi)).
P (fi) indicates an occurrence probability of the fact given the input data. A fact with a lower occurrence probability in the data space has a higher self-information value as it reveals uncommon patterns, which are usually more meaningful and interesting. In some embodiments, the occurrence probability of the fact is determined by taking the product of occurrence probability of subspace, occurrence probability of breakdown for the given fact type, occurrence probability of measure for the given fact type, and occurrence probability of focus for the given subspace, as follows:
In this occurrence probability of a fact, the probability of subspace
where k is the number of filters in the subspace and m is the total number of columns that can be used for subspace filters. P (Fj=Vj) is the probability that the filter Fj takes the value Vj. A filter can represent a particular value of a subspace.
The occurrence probability of breakdown corresponds with a given fact type. In this regard, for a trend fact type, P (bi|ti)=1/T. For a categorization and distribution fact type, P (bi|ti)=1/C. For other fact types, P (bi|ti)=1/(C+T). C represents the number of categorical columns, and T represents the number of temporal columns. T may represent any time period, such as months, years, etc.
The occurrence probability of measure for the given fact type can be represented as P (mi|ti)=1/N for all fact types. In this occurrence probability, N represents the number of numerical columns.
The occurrence probability of focus for a given subspace may be represented as P (xi|si)=count (xi)/count (si), wherein count denotes the cardinality of the set. As can be appreciated, occurrence probabilities for facts can be represented using alternative or additional probabilities.
As described, a significance score is also used to generate an importance score for a fact tuple. The significance score S (fi) ∈ [0,1] generally estimates the significance of the data patterns described by a fact. A significance score can be generated in any number of ways. In some embodiments, a significance score is defined separately for each fact type based on hypothesis tests. For example, a significance score can estimate the significance of data patterns described by a fact based on auto-insight techniques.
In some embodiments, the data fact manager 220 converts fact tuples to captions, such as natural language captions. A caption generally refers to a textual explanation of a visual design (e.g., chart or graph). Generally, a caption is a brief text portion. In some cases, the captions are generated using templates. For instance, template-based methods can be employed to generate captions for charts. To ensure readability and avoid ambiguity, a syntax may be defined for each fact type that regulates the generation results. By way of example, and with reference to
Returning to
Generating a fact corpus, or updating a fact corpus, can be performed at various times. In some cases, generating or updating a fact corpus is performed periodically, for example, on a weekly basis. In other cases, a fact corpus may be generated or updated upon an occurrence of an event (e.g., a user selection to view a data story and/or data summary).
The data story generator 222 is generally configured to generate data stories. Data stories can be generated in any of a number of ways. As one example, a data story is generated in accordance with a user query. In this regard, the data story generator 222 can obtain input data 250, which can include a user query 252, such as a natural language query. The user query, or portion thereof, may be input, selected, or otherwise provided in a textual form or a verbal form via a user interface. Such a user query(s) may indicate a manner in which to tailor or customize data visualizations. For example, as previously described, a user may indicate information related to desired data, visual design, and/or insight for a data visualization or data story via a natural language query in unstructured form.
The user query 252 may be or include a command, a question, a list of words and phrases, or the like. As can be appreciated, the user query may include any number of details related to various visualization aspects, such as data, visual design, and/or insight. As one example, a natural language query may be more general or vague. As another example, a natural language query may be more specific. User queries, or portions thereof, may be stored, for instance, at data store 214.
To generate a data story based on a user query, a searching or matching process may be employed using a fact corpus. As described, the fact corpus may include various facts, for example, in the form of fact captions, or representations thereof (e.g., vector embeddings). To perform the search, a contextual similarity search (e.g., a cosine similarity search) may be used to search fact captions (e.g., natural language templatized fact captions) or representations thereof (e.g., vector embeddings) in the fact corpus based on the user query. As such, in some cases, to perform the contextual similarity search, the user query may be converted to a vector embedding. In embodiments, the technology used to convert the user query to the vector embedding may be the same as that used to convert the fact captions to vector embeddings. For example, a pre-trained encoding model, such as SBERT, may be used to convert a user query to a corresponding vector embedding. Using a contextual similarity search to identify fact captions (e.g., represented as vector embeddings) relevant to the user query (e.g., represented as vector embeddings), a set of relevant facts (e.g., 150 to 200 facts) are extracted. Such relevant facts that correspond or match with the user query can be defined as a set of relevant facts to the user query. In some cases, to identify relevant facts, a key phrase search can be performed. In this regard, key phrases can be extracted from the query for use in performing the search. In other cases, the entire user query is used to perform the similarity search to identify relevant facts. In yet other cases, a similarity search can be performed using both key phrases and the entire user query to identify relevant facts. The similarity search (e.g., cosine similarity search) may generate similarity scores between the data facts and the user query (e.g., as represented via vector embeddings), which are then used to select relevant facts.
Using the relevant facts identified from the user query, a data story is generated. The data story may include any number of facts, or fact visualizations (e.g., 15). In this regard, in some cases, a particular set of relevant facts, also referred to herein as data story facts, can be selected from among the relevant facts for inclusion in the data story. The facts to include in the data story may be selected from the relevant facts using any number of methods. In one example, maximum margin relevance (MMR) is used to select a number (e.g., predetermined number or number that exceeds a threshold) of facts, from the relevant facts, to include in the data story. MMR is a diversity-based ranking technique that maximizes the relevance and novelty of top-ranked items, such as the identified relevant facts. MMR generally reduces the redundancy of results while maintaining query relevance of results for already ranked relevant facts. In this regard, MMR selects facts according to a combined criterion of query relevance and novelty of information. In embodiments, novelty of information measures a degree of dissimilarity between the document being considered and previously selected ones already in the ranked list. With MMR, a relevance parameter, λ, can be set to reflect a desired relevance. The relevance parameter may vary between 0 and 1, with a higher value indicating more relevance, thereby focusing more on the similarity scores, and a lower value indicating more diversity selected among the facts. For example, a relevance parameter of 0.5 indicates a balance between facts being similar to the user query as well diversity of facts that are different from one another. In another example, a text rank algorithm may be used to perform extractive summarization to retrieve data story facts from a set of relevant facts. Text rank generally measures the relationship between two or more words.
As an example of selecting facts for a data story from a set of relevant facts, assume a user query q is provided. The user query q is converted to a vector embedding using SBERT, such that the user query can be used to search a fact corpus including vector embeddings representing captions of facts. Using the vector embedding of the user query, a key phrase-based and an entire query-based similarity search is performed, thereby resulting in retrieving about 150-200 relevant facts from the precomputed fact corpus. Using the relevant facts, an MMR algorithm is used to select about 15 facts for including in a data story. Using an MMR algorithm ensures both relevance and diversity in the data story.
In accordance with selecting a set of relevant facts (e.g., 15 facts), a data story is generated by sequentially arranging the selected relevant facts, or a portion thereof, for ensuring better coherence among the facts in the data story. In this regard, to present a coherent data story, the data story generator 222 can coherently arrange the facts in the data story. Generating the sequence of relevant facts, or coherently arranging the facts, may be performed in any of a number of ways. As one example, a Jaccard similarity matrix between fact tuples in the data story may be generated. Thereafter, starting with a first fact amongst D facts (e.g., from selected relevant facts), a next most similar fact to the fact already selected may be identified one by one using the similarity matrix.
By way of example, coherence of a data story can be calculated based on Jaccard similarity between consecutive fact tuples fi and fj selected for a data story. In this regard, for a data story generated, pairwise consecutive Jaccard similarity between fact tuples are determined and averaged to determine coherence of the data story. A higher coherence implies better logical flow between consecutive facts. In operation, initially, a most relevant fact may be identified. The most relevant fact may be identified using the maximum similarity score associated with the user query. Jaccard similarity can be determined between the most relevant fact (e.g., in association with the user query) and the remaining facts. The fact, among the remaining facts, that is most similar to the previous fact, is selected as a next data fact to present in the data story. The newly selected fact can then be used to identify a next fact based on Jaccard similarities to the newly selected fact. This iterative approach can be applied until a sequential order of data facts is determined. Using this approach, facts can be arranged so there is more coherence or similarity between adjacently positioned facts in the data story.
In accordance with generating a data story, the data story manager 222 provides the data story, for example, for display to a user. In this regard, the data story is presented. In embodiments, the data story is presented in a manner that allows a user to provide relevance feedback, which can be used to refine the data story as well as generate a data summary, as described below.
In some cases, based on contextual similarity search, data facts in the presented data story may not be relevant to or desired by the user (e.g., user query). In this way, embodiments described herein may receive user feedback in relation to relevance of the data facts and use such feedback to refine the data story. As such, the user query refiner 224 is used to refine the user query based on user feedback, also referred to herein as relevance feedback. In this regard, in accordance with presenting a data story, user feedback pertaining to the data story can be obtained. In particular, feedback associated with relevance of the data story, or portions thereof (e.g., fact visualizations), can be obtained. For example, a user may view various fact visualizations of a data story and provide feedback of fact visualizations the user does not deem relevant to the user query. As described herein, the refined user query may be used, for instance, to refine the data story.
User feedback may be elicited and/or provided in any number of ways. In some cases, each fact visualization may be presented in association with a relevance indicator(s) that can be used to indicate whether, or an extent to which, the fact visualization, or a portion thereof, is relevant (e.g., to the user query). For instance, relevance feedback may enable refining a data story based on binary feedback according to user interests. As an example, an “accept” relevance indicator may indicate the fact visualization is relevant, while a “decline” relevance indicator may indicate the fact visualization is irrelevant. In some cases, only a single relevance indicator may be presented, for example, that can be selected to indicate that the data fact is not relevant. In yet other cases, relevance indicators may be variable to select or input different levels or extents of relevance. For instance, a sliding scale may be presented and manipulated to indicate an extent of relevance to a user query.
Relevance feedback may be provided in association with a type of data fact or other attribute associated with the data fact (e.g., visual, caption, data type, etc.). In this regard, relevance can be in terms of “types” of facts a user is interested in, or attributes of the data a user is interested in. In some embodiments, the particular attribute for which the relevance feedback is provided may be specified by the user or identified based on the elicitation for the feedback. For example, a relevance indicator may specify a particular attribute for which the relevance feedback pertains. In other embodiments, the particular attribute for which the relevance feedback is provided may be automatically identified or determined.
Various implementations may be employed to refine a user query based on relevance feedback. In embodiments, the query is refined (e.g., mathematically refined) in the vector space. In some embodiments, to refine a user query based on relevance feedback, Rocchio's algorithm is used:
wherein {right arrow over (Q)}opt represents a modified query vector, which is based on a difference between centroids of relevant and non-relevant documents. In operation, the modified query vector is optimized to have more similarity with the relevant set and minimize similarity with the non-relevant set of documents.
In accordance with obtaining relevance feedback, the user query refiner 224 can refine the user query to adapt to the relevance feedback. For instance, assume a user selected a particular data fact that is irrelevant or uninteresting to the user. In such a case, the user query (e.g., in the form of a vector embedding) is refined to avoid or indicate a lack of interest in the particular data fact such the data fact may be removed from the data story. Based on the refined user query, a refined data story may be generated. In this regard, a new set of facts can be identified based on a refined user query. In this regard, the process may return to the data story generator 222 to generate a data story using the refined user query. For example, the refined user query may be used to identify a relevant set of facts. Using the relevant fact set, a particular set of facts can be selected for use in the data story. This iterative process may continue until a desired data story is attained.
The data summary manager 226 is generally configured to generate a data summary associated with a dataset and/or a data story generated in association with the dataset. A data summary generally refers to a summary of data. Such a summary can be in a natural language form and be more comprehensive than a data story. At a high level, the data summary manager 226 uses the dataset associated with the data story, including the various data facts in the data story, to generate a data summary. As one example, a large language model may be used to facilitate generation of data summaries. In such a case, the data story, including the various data facts in the data story, can be used in a prompt as input to the LLM to generate, as output, a data summary.
Data summary generation may be performed in any number of ways. In some cases, data summary manager 226 may include a prompt generator 230, a data summary generator 232, and a data summary analyzer 234. As can be appreciated, additional or alternative components may be used to manage data summaries, in accordance with various embodiments described herein. In some cases, data summary generation may be initiated automatically, for example, upon obtaining a user query, obtaining a refined user query, generating a data story, etc. In other cases, data summary generation may be initiated based on a user request. For example, in accordance with presenting a data story, a user may select (e.g., via a prompt, link, menu, etc.) to generate and/or view a data summary. Such a selection indicating a preference to view a data summary can initiate or trigger the data summary manager 226 to generate a corresponding data summary.
The prompt generator 230 generates prompts that may be used to initiate generation of data summaries. A prompt generally refers to an input, such as an input text, that can be provided to a data summary generator 232, such as an LLM, to generate an output in the form of a data summary(s). In embodiments, the prompt generally includes text to influence a machine learning model, such as an LLM, to generate text having a desired content and structure. A prompt typically includes text given to a machine learning model to be completed. In this regard, a prompt generally includes instructions and, in some cases, data to use in performing the analysis and/or examples of desired output. In accordance with embodiments described herein, a prompt may include various types of text data. In particular, a prompt includes text data corresponding to facts to use to generate a data summary. For example, data facts selected for a data story, or data associated therewith (e.g., fact captions), may be included in a prompt as input to an LLM to generate a data summary.
In various implementations, to facilitate a more enriched data summary, a more extensive set of facts may be used. In this regard, for input into the LLM, additional facts may be used to supplement the facts in the data story. As previously described, based on a user query or refined user query, a relevant fact set FR is formed, which may result in around 150 to 200 facts. In this regard, facts identified as relevant but not used in the data story may be additionally used as input into the LLM.
To make facts more informative, contextual facts corresponding with the relevant facts can be generated. A contextual fact generally refers to a fact that includes context associated with the fact. In this regard, additional contextual data for each fact fi in the set of relevant facts FR is identified and appended, supplemented, or aggregated with the corresponding fact. Contextual data may be in various formats. As one example, contextual data may be in the form of numerical or categorical values related to various attributes associated with the fact tuple, such as, for example, subspaces, measures, breakdowns, aggregates, focus, etc. In this regard, the prompt generator 228 may obtain contextual data associated with each fact (e.g., via a data store) and append the contextual data to the corresponding fact caption (e.g., obtained via a data store). Accordingly, a contextual fact can be generated for each fact, with each contextual fact including a fact caption and contextual data associated with the corresponding fact.
As such, the prompt generator 230 may include contextual facts in the prompt. For example, contextual facts can be generated for the identified relevant facts (e.g., 150 to 200 facts relevant to the user query or refined user query) and included in a prompt for generating a data story. By way of example only, and with reference to
In embodiments, the prompt generator 230 is configured to select a set of content or text, such as facts or contextual facts, for which to use to generate the prompt. Contextual facts, and/or other text data, may be selected based on any number or type of criteria. As one example, contextual facts may be selected to be under a maximum number of tokens required by a data summary generator, such as an LLM. For example, assume an LLM includes a 3,000-token limit. In such a case, text data totaling less than the 3,000-token limit may be selected. In this regard, prompts may have a size limit, thereby limiting the number of contextual facts included in the prompt. As such, in some cases, using all relevant facts in FR (e.g., extracted via a similarity search) as well as the corresponding contextual data may not be possible to be used as a prompt to an LLM due to size limitations of an LLM. Hence, it is necessary to select an optimal set of facts for feeding to the LLM for obtaining data summary.
Accordingly, in embodiments, the prompt generator 230 may be configured to select the contextual facts to include in a prompt to generate a data summary. To identify contextual facts to include, a fact score may be used. Such a fact score may indicate an extent or measure of some aspect for assessing optimal facts to include in the prompt. For example, a fact score may indicate relevance to a user query, informativeness, diversity, and/or the like.
In some embodiments, contextual facts that provide diversity and informativeness are desired for selection. In this regard, the prompt generator 230 may generate a fact score in association with each fact and/or corresponding contextual fact. For instance, for each fact of the relevant facts, fi ∈ FR, a fact score is generated and assigned. Such a fact score can be based on diversity of the fact in the set of all relevant facts extracted via similarity search and/or the importance score of the fact. A fact score, si, for a fact may be represented as:
In this fact score, selfbleu score of fact fi is calculated considering the set of relevant facts FR as the reference corpus. Self-BLEU refers to a metric used to evaluate the diversity of generated texts. Such a score can be calculated by treating one sentence as a hypothesis and the others as a reference and, thereafter, calculating the BLEU score for every generated sentence. A higher selfbleu score of fi denotes lesser diversity, which implies there are a lot of other facts in FR, similar to fi. On the other hand, a lower selfbleu of fact fi indicates a higher diversity, implying fi is quite unique amongst facts in the set of relevant facts, FR.
IN (fi) is the averaged normalized importance score of fact fi, as described herein. In this regard, the previously determined normalized importance score may be referenced (e.g., from data store 212) and used to generate a fact score. Using this approach to generate a fact score, such scoring considers both diversity amongst selected facts as well as their importance scores. As described, diversity helps in avoiding redundancy in fact selection. For example, and with reference to
As described, the fact scores associated with various facts, such as relevant facts, may be used to select which corresponding contextual facts to include in a model prompt. In some cases, contextual facts are selected in accordance with a set of facts having the highest fact scores or, alternatively, the lowest fact scores. In other cases, selection of optimal facts is posed as an Integer Linear Programming (ILP) problem. Each fact fi in a set of relevant facts, FR, is assigned a fact score, as described above. A constraint to the optimization problem is the token limit of the LLM (e.g., 3,000 for Chat GPT). In this way, ILP can be implemented to select contextual facts to include in the model prompt to generate a data summary.
Although generally described as using tokens (e.g., pieces of words, individual sets of letters within words, spaces between words, and/or other natural language symbols or characters), for input size, as can be appreciated, other input sizes may be used and not necessarily be based on token sequence length, but other data size parameters, such as bytes, number of words, etc.
In accordance with identifying an optimal set of facts, or contextual facts, such information can be included in a prompt. In addition to the facts, or contextual facts, the prompt may include additional information. For example, to minimize hallucinations, the prompt can include a unique value assigned to each fact, or contextual fact, in the prompt. For example, each fact, or contextual fact, may be assigned a unique value that is indicated via a special token(s). For example, a unique random number may be appended within special tokens of @ and #, or other tokens. In this regard, the prompt can include explicit instructions to the LLM to specify the unique index value. In this way, the LLM is instructed to cite the corresponding fact it has used to form the summary. For example, the prompt may include an instruction that when including an aspect of information in the data summary, refers to or cites the corresponding fact.
In embodiments, the prompt may also include data context. Data context provides context about the dataset from which the facts, or contextual facts, were formed. For example, assume you have a dataset related to the Olympics in 2020 in Tokyo. In this regard, the prompt may include a data context indicating that the dataset includes information about the Olympics in 2020 in Tokyo, as well as the columns of data included therein. Providing data context provides the LLM with more information related to the data, thereby enabling the LLM to better understand the data and perform more optimally.
Other types of information that may be included in a prompt may include an instruction to generate a data summary, user data associated with a user viewing, or to view, the data summary, and/or the like, depending on the desired implementation or output. In accordance with embodiments herein, the instruction can specify to generate one or more data summaries for the provided or corresponding content.
In addition, a prompt may also include output attributes. Output attributes generally indicate desired aspects associated with an output, such as a data summary. For example, an output attribute may indicate a target temperature to be associated with the output. A temperature refers to a hyperparameter used to control the randomness of predictions. Generally, a low temperature makes the model more confident, while a higher temperature makes the model less confident. Stated differently, a higher temperature can result in more random output, which can be considered more creative. On the other hand, a lower temperature generally results in a more deterministic and focused output. A temperature may be a default value, a value based on user input, or a determined value. As another example, an output attribute may indicate a length of output. For example, a prompt may include an instruction for a desired number of paragraphs or sentences (e.g., in association with the data story). As another example, a prompt may include an instruction for a maximum number of characters or a target range of characters. As another example, an output attribute may indicate a number of facts to include in the data summary. As another example, an output attribute may indicate a target language for generating the output. For example, the text data may be provided in one language, and an output attribute may indicate to generate the output in another language. Any other instructions indicating a desired output are contemplated within embodiments of the present technology.
The prompt generator 228 may format the prompt in a particular form or data structure. One example of a data structure for a prompt is as follows:
With reference to
The data summary generator 232 is generally configured to identify or generate data summaries. In this regard, the data summary generator 232 utilizes content in the form of text to generate a data summary(s) associated with a set of facts, or contextual facts. In embodiments, the data summary generator 232 takes, as input, a prompt or set of prompts generated by the prompt generator 230. Based on the prompt, the data summary generator 232 can generate a data summary or set of data summaries associated with a fact indicated in the prompt. For example, assume a prompt includes a set of contextual facts associated with a dataset. In such a case, the data summary generator 232 generates a data summary based on the set of contextual facts included in the prompt.
As described, a data summary generally refers to a summary of a dataset representing various facts. In embodiments, the data summary is provided in a natural language format. The data summary can include any amount of text providing a natural language description as well as any number of visualizations desired for use in summarizing the data.
The data summary generator 232 may be or include any number of machine learning models or technologies. In some embodiments, the machine learning model is a Large Language Model (LLM). A language model is a statistical and probabilistic tool that determines the probability of a given sequence of words occurring in a sentence (e.g., via next sentence prediction [NSP] or minimal learning machine [MLM]). In this way, it is a tool that is trained to predict the next word in a sentence. A language model is called a large language model when it is trained on an enormous amount of data. Some examples of LLMs are OPT, FLAN-T5, BART, GOOGLE's BERT, and OpenAI's GPT-2, GPT-3, and GPT-4. For instance, GPT-3, is a large language model with 175 billion parameters trained on 570 gigabytes of text. These models have capabilities ranging from writing a simple essay to generating complex computer codes-all with limited to no supervision. Accordingly, an LLM is a deep neural network that is very large (billions to hundreds of billions of parameters) and understands, processes, and produces human natural language by being trained on massive amounts of text. In embodiments, an LLM generates representations of text, acquires world knowledge, and/or develops generative capabilities.
As such, as described herein, the data summary generator 232, in the form of an LLM, can obtain the prompt and, using such information in the prompt, generate a data summary(s) for a set of facts or contextual facts. In some embodiments, the data summary generator 230 takes on the form of an LLM, but various other machine learning models can additionally or alternatively be used.
In embodiments, as described herein, the data summary generator 232 may be instructed to include the source indicator to indicate the source or reference associated with the contextual fact. For example, as described, for each contextual fact identified in the selected optimal set of contextual facts, a unique random number may be appended thereto to indicate a source reference. In generating the data summary, the source indicator may be included in association with the corresponding text to indicate the corresponding contextual fact, or fact, associated with the text in the data summary. In some cases, the data summary generator 232 may use the source indicator to generate a link or reference a corresponding fact, contextual fact, fact caption, and/or fact visualizations. For example, in association with a source indicator, a link may be generated that links to or indicates a fact visualization, including a fact caption, that corresponds with the text. In embodiments, the source indicator may be used to look up or identify an appropriate fact citation and/or obtain fact data that corresponds therewith (e.g., a fact visualization). Providing source indicators or citations facilitates prevention of an LLM from hallucinating and ensures factuality. Further, referencing and obtaining the corresponding fact and/or fact visualization enables the data summary to be more engaging and comprehensive.
The data summary analyzer 234 is generally configured to analyze data summaries. In some cases, the data summary analyzer 234 may analyze the data summary to identify and/or perform any post-processing techniques. In this way, in accordance with obtaining a data summary, for example, via data summary generator 232, the data summary analyzer 234 may analyze the data summary to generate a data summary score(s) associated with the data summary. Various aspects or characteristics associated with the data summary may be measured or scored. For example, coherence, diversity, informativeness, and/or factuality may be measured or scored in association with a generated data summary.
Coherence of a data summary refers to its clarity, logical flow, and ease of understanding due to well-connected ideas and/or organized structure. In one embodiment, unsupervised text coherence scoring based on graph construction may be used to generate a coherence score associated with a data summary. In the graph construction, edges can be established between semantically similar sentences represented by vertices. Sentence similarity is calculated based on the cosine similarity of semantic vectors representing sentences. In this regard, to measure coherence, the data summary analyzer 234 may construct a graph with nodes as sentences and edges as similarities between nodes. A threshold can be set to decide on the existence of a node for a given pair of edges. The coherence can be calculated as:
wherein N is the number of sentences in the summary, Li is the number of outgoing edges from the vertex vi (sentencei), and eik is the edge between vi and vk, which is calculated by the cosine similarity between vi and vk. In this regard, eik=cosinesim (v
For evaluating the diversity of a generated data summary, a self-bleu score of the data summary may be generated and used. The lower the self-bleu score is, the higher the diversity. Self-bleu scores can range from 0 to 1.
For evaluating the informativeness of a generated data summary, average normalized importance scores of facts cited by the LLM in the data summary may be generated and used. One example for determining informativeness includes:
wherein ci=1, if fact fi is cited by an LLM in the summary, and IN (fi) is normalized importance score fact fi, as previously described.
Factuality refers to the correctness of information provided in a data summary. In some embodiments, factuality in the data summary can be defined by the ratio of facts cited by the LLM correctly (e.g., a corresponding sentence in the summary appropriately describes the fact) to the total number of facts cited by the LLM.
With reference back to
In some cases, upon generating a data story(s) and/or data summary(s), the data provider 234 can provide such data, for example, for display via a user device. To this end, in cases in which the data fact engine 212 is remote from the user device, the data provider 234 may provide a data story(s) and/or data summary(s) to a user device for display to a user interested in the content.
Alternatively or additionally, the data story and/or data summary may be provided to a data store for storage or another component or service, such as an analysis service (e.g., data analysis service 118 of
The data story and/or data summary may be provided for display in any number of ways. In some examples, the data story and/or data summary may be automatically displayed upon being generated (e.g., concurrently displayed). For example, a data story may be presented with a corresponding data summary. In other cases, a user may select to view the data story and/or data summary. For instance, a link may be presented that, if selected, presents the data story and/or data summary (e.g., integrated with the content, or provided in a separate window or pop-up text box). As one example, a data story may be presented along with a link to open or present a corresponding data summary. Based on selection of the link, the data summary may be presented.
By way of example, and with reference to
Based on the user selection to refine story 1318 in accordance with relevance feedback, a refined data story 1320 is presented, as shown in
As described, various implementations can be used in accordance with embodiments described herein.
Turning initially to method 1400 of
Turning to
With reference now to
Having briefly described an overview of aspects of the technology described herein, an exemplary operating environment in which aspects of the technology described herein may be implemented is described below in order to provide a general context for various aspects of the technology described herein.
Referring to the drawings in general, and initially to
The technology described herein may be described in the general context of computer code or machine-usable instructions, including computer-executable instructions such as program components, being executed by a computer or other machine, such as a personal data assistant or other handheld device. Generally, program components, including routines, programs, objects, components, data structures, and the like, refer to code that performs particular tasks or implements particular abstract data types. Aspects of the technology described herein may be practiced in a variety of system configurations, including handheld devices, consumer electronics, general-purpose computers, and specialty computing devices. Aspects of the technology described herein may also be practiced in distributed computing environments where tasks are performed by remote-processing devices that are linked through a communications network.
With continued reference to
Computing device 1700 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by computing device 1700 and includes both volatile and non-volatile, removable and non-removable media. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program sub-modules, or other data.
Computer storage media includes RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVDs) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage, or other magnetic storage devices. Computer storage media does not comprise a propagated data signal.
Communication media typically embodies computer-readable instructions, data structures, program sub-modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared, and other wireless media. Combinations of any of the above should also be included within the scope of computer-readable media.
Memory 1712 includes computer storage media in the form of volatile and/or non-volatile memory. The memory 1712 may be removable, non-removable, or a combination thereof. Exemplary memory includes solid-state memory, hard drives, and optical-disc drives. Computing device 1700 includes one or more processors 1714 that read data from various entities such as bus 1710, memory 1712, or I/O components 1720. Presentation component(s) 1716 present data indications to a user or other device. Exemplary presentation components 1716 include a display device, speaker, printing component, and vibrating component. I/O port(s) 1718 allow computing device 1700 to be logically coupled to other devices including I/O components 1720, some of which may be built-in.
Illustrative I/O components include a microphone, joystick, game pad, satellite dish, scanner, printer, display device, wireless device, a controller (such as a keyboard and a mouse), a natural user interface (NUI) (such as touch interaction, pen [or stylus] gesture, and gaze detection), and the like. In aspects, a pen digitizer (not shown) and accompanying input instrument (also not shown but which may include, by way of example only, a pen or a stylus) are provided in order to digitally capture freehand user input. The connection between the pen digitizer and processor(s) 1714 may be direct or via a coupling utilizing a serial port, parallel port, and/or other interface and/or system bus known in the art. Furthermore, the digitizer input component may be a component separated from an output component such as a display device, or in some aspects, the usable input area of a digitizer may be coextensive with the display area of a display device, integrated with the display device, or may exist as a separate device overlaying or otherwise appended to a display device. Any and all such variations, and any combination thereof, are contemplated to be within the scope of aspects of the technology described herein.
An NUI processes air gestures, voice, or other physiological inputs generated by a user. Appropriate NUI inputs may be interpreted as ink strokes for presentation in association with the computing device 1700. These requests may be transmitted to the appropriate network element for further processing. An NUI implements any combination of speech recognition, touch and stylus recognition, facial recognition, biometric recognition, gesture recognition both on screen and adjacent to the screen, air gestures, head and eye tracking, and touch recognition associated with displays on the computing device 1700. The computing device 1700 may be equipped with depth cameras, such as stereoscopic camera systems, infrared camera systems, RGB camera systems, and combinations of these, for gesture detection and recognition. Additionally, the computing device 1700 may be equipped with accelerometers or gyroscopes that enable detection of motion. The output of the accelerometers or gyroscopes may be provided to the display of the computing device 1700 to render immersive augmented reality or virtual reality.
A computing device may include radio(s) 1724. The radio 1724 transmits and receives radio communications. The computing device may be a wireless terminal adapted to receive communications and media over various wireless networks. Computing device 1700 may communicate via wireless protocols, such as code-division multiple access (“CDMA”), global system for mobiles (“GSM”), or time-division multiple access (“TDMA”), as well as others, to communicate with other devices. The radio communications may be a short-range connection, a long-range connection, or a combination of both a short-range and a long-range wireless telecommunications connection. When we refer to “short” and “long” types of connections, we do not mean to refer to the spatial relation between two devices. Instead, we are generally referring to short range and long range as different categories, or types, of connections (i.e., a primary connection and a secondary connection). A short-range connection may include a Wi-Fi® connection to a device (e.g., mobile hotspot) that provides access to a wireless communications network, such as a WLAN connection using the 802.11 protocol. A Bluetooth connection to another computing device is a second example of a short-range connection. A long-range connection may include a connection using one or more of CDMA, GPRS, GSM, TDMA, and 802.16 protocols.
The technology described herein has been described in relation to particular aspects, which are intended in all respects to be illustrative rather than restrictive.