The field of the invention relates to computer implemented methods and systems of analysing, querying and interacting with data.
A portion of the disclosure of this patent document contains material, which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
The ability to search, rapidly explore and gain meaningful insights across every dataset has the potential to transform the way ordinary and professional users interact with data. However, data is inherently imprecise and people's questions tend to be ambiguous. This is particularly the case when dealing with datasets from many different sources or when queries are complex.
Conventional database query systems require precision in their datasets and also precision in the queries being processed in order to produce exact outputs. Hence they cannot cope with the imperfection of the real world, such as imperfect data and ambiguity of a query.
Currently the conversion from structured data to a precise output still needs human oversight, where a series of entirely deterministic assumptions (often effected using multiple products or packages—e.g., data cleaning and querying) are performed and tracked manually making these assumptions and the associated decisions difficult to track, reverse or communicate.
Solutions to date require skilled data analysts and can be slow if data cleansing is needed first. Skilled data analysts are in many cases required to: clean data in order to reduce it to a precise form, translate ambiguous questions into a structured query language (SQL or equivalent), or manage the flow of information within the system, providing the context for how the data and query should be prepared with respect to each other. (i.e., every question potentially requires a re-examination of the data to ensure the two are consistent). In addition, the cleaning of the dataset and the translation of the query are performed by different entities (different people for example, but which may look superficially the same e.g. the same person using different disconnected programs with little ability to pass information about the assumptions made between them, or with a substantial time between performing the actions during which information is forgotten, or people using programs on the same machine, which even running on the same processor are by default unable to communicate). and no entity can be held accountable or have its actions verified by any other due to loss of information.
The solutions are therefore limited to small silos of specialists, are costly, time consuming and cannot scale: putting every dataset into context with every other scales as N2, where N is the number of datasets. It is certainly not possible for a single human user to hold the context for N>100 datasets simultaneously, and difficult for N>10. Therefore a different approach is needed.
Attempts to solve this problem through standardisation are also not scalable. Standardization has been shown to be ineffective even in fields that are well suited to it (e.g., even after 30 years of standardisation, the cleaning of dates and times in data is still a time-consuming process; and, while longitude and latitude are successfully used to denote a point on the earth, there is no universal adoption of a single geographical projection) and typically involves the loss of information. In addition, current solutions are ill suited to various fields that include complex and evolving concepts, or the interaction of multiple proprietary systems, where aiding communication outside of the system is often intentionally or unintentionally neglected e.g., the Internet of Things (IoT), the digital music industry or academic research.
There is a need for a system that would enable anyone to ask complex question and that would deal with unclean data automatically, while at the same time providing the most responsive and intuitive user experience.
Delivering on this has required a paradigm shift in the way we think of querying datasets and designing databases.
A first aspect of the invention is a computer-implemented method of querying a source dataset, in which:
(i) a user provides a query to a dataset querying system; and
(ii) the system automatically processes simultaneously and/or in a linked manner both the dataset and the query, so that processing the query influences the processing of the dataset, and/or processing the dataset influences the processing of the query.
A second aspect is a computer-implemented method of querying a source dataset, in which:
(i) a user provides a query to a dataset querying system; and
(ii) the system automatically processes the query and the dataset to derive a probabilistic inference of the intent behind the query.
A third aspect is a computer-implemented method of querying a source dataset, in which:
(i) a user provides a query to a dataset querying system; and
(ii) the system automatically processes the query and the dataset and dynamically generates a series of relevance-ranked attempts to answer that query or to infer the intent behind the query as an initial processing step; and
(iii) the user further expresses their intent by interacting with the relevance-ranked attempts to answer that query (e.g. enters a modified query, selects a part of a graph) and the system then iteratively improves or varies how it initially processed the query and the dataset, as opposed to processing in a manner unrelated to the initial processing step, to dynamically generate and display further relevance-ranked attempts to answer that query, to enable the user to iteratively explore the dataset or reach a useful answer.
Further aspects of the invention include computer-implemented systems for querying a source dataset that implement the above method-related aspects of the invention.
Aspects of the invention will now be described, by way of example(s), with reference to the following Figures, which each show features of an implementation of the invention:
An implementation of the invention relates to a system allowing anyone to write complex queries across a large number of curated or uncurated datasets. These inherently ambiguous or imperfect queries are processed and fed into a database that is structured to handle imprecise queries and imprecise datasets. Hence, the system natively handles ambiguity and surfaces something plausible or helpful and enables ordinary and professionals users to iterate rapidly and intuitively to as precise an answer as the dataset is capable of supporting.
Instead of attempting to obtain perfectly structured data and perfectly structured queries, we instead re-architect the entire querying and database stack to work with ambiguity—the complete opposite to conventional approaches which require precision in the query and are intolerant to imperfections in the datasets.
This will transform the way people interact with data: immersive searching and exploration of datasets will become as ubiquitous and pervasive as searching the web.
The product focussed on in this work is a desktop software application (app), an example of the use of which is given. However, the technology used in the application is not limited to this use-case, and could also be used, for example, for providing data online, for hosted and cloud-based server solutions in companies, and within mobile applications. Therefore, although in the description below, the database and the front end of the product are often described in the context of the desktop application, the technology should also be considered in its other uses.
We use the term dataset to cover any input of data to the system—for example: a csv file, an API connection, any collection of data that can be written in a relational/tabular form (e.g., tables in a SQL database, delimited text files, tables within spreadsheets), as well as any collection of data that can be written in a non-relational form (e.g., collections in a NoSQL database, nested or hierarchical files, prose, a newspaper article, pictures, sounds). Because we expansively define a dataset to include any input of data to the dataset querying system, it includes also any representation of a source dataset, including an index of that source dataset. The number and scale of the datasets that can be queried is technically unbounded (except by computational constraints)—in principle it can be extended to all datasets—such as all datasets existing in the world, including some or all web pages indexed by the Google search engine, some or all information on social networks, some or all personal information and the data generated by some or all IoT devices.
By a query we mean any input information by an end-user—for example: any type of query, precise or imprecise, a null query, a query in NL “natural language”, a gesture, a voice command, a keyword, a hint from another computer, any type of user behaviour or interaction such as a click, a visual exploration, a selection of settings, a touch or smell. It includes any interaction or user input to make the system (including the database that forms part of the system) do something and/or make the interpreter (see definition below) update its state (which typically happens as more information is provided to the interpreter).
A structured dataset is a dataset which has been created in or modified into a form which, the entity modifying it thinks is accurate and unambiguous, and which can be queried by a database.
A structured query is a query which has been created in or modified into a form which, the entity modifying it thinks is accurate and unambiguous, and which can be used by a database to act on a structured dataset.
Structured databases act by applying structured queries to structured datasets. These are the databases often found within organisations. Usually, a human intermediary converts a dataset to a structured dataset by cleaning and converts a query to a structured query by translating the query.
Structured datasets are a subset of datasets. An existing SQL database usually requires little or no cleaning for simple, self-contained analysis and is often considered to be a structured dataset, even though the conversion into a computer-usable form often introduces inaccuracy and ambiguity through the loss of information it entails. In a similar way structured queries, for example a SQL query, are a subset of queries. We could therefore describe datasets/queries which are not structured datasets/queries as unclean, imprecise or ambiguous.
An organisation is an entity which, as a whole, interacts with datasets and/or queries. The organisation may be made up of many more localised entities, for example employees within a company, individuals or machines (e.g. IoT devices). Organisation is used here to reflect the main contemporary use case in companies due to the nascence of individuals' interaction with data and of computers' autonomous interaction with data.
We define a dataset context to be the information which interacts with a dataset—is applied to it, or which is extracted from it—when cleaning. This could be expressed as a series of instructions or as an explanation of what has been performed. The context includes not only this information, but how to present that information to another entity, this can range from the encoding/character set/language to a relative visual layout.
We define the equivalent for a query context in the case of translation.
A dataset or query context is created as an organisation cleans a dataset or interprets a query. While some software exists to store the recipes used, this is not typical, and relates to the single source of truth problem. Broadly we can see a dataset as a structured dataset plus the dataset context, and a query as a structured query plus the query context.
A context may contain the following, but is not limited to:
We can see a context as being itself a dataset or containing datasets. This forms a second order problem in that if the dataset contexts created from a number of datasets are themselves not structured datasets a further cleaning into a new structured dataset and context will need to be performed, and so on, recursively.
Problematically, this is typically the situation in companies—the dataset context is stored by the employee who has performed the cleaning, as a new dataset which itself has a context, and so on, but where the eventual context resides in a place inaccessible to the rest of the organisation—thoughts or private notes.
We will describe both cleaning a dataset or translating a query as interpreting which is performed by an interpreter. Conventionally, these are a specially trained individual such as a data scientist, but we use it to refer to any entity. In our implementation, the interpreter is computer implemented. When an interpreter simultaneously operates on a dataset and a query it creates a context.
The Interpreter is an evolving component of the system, and the evolution of the interpreter leads to the evolution of answers. The level of information used to evolve the interpreter after an interaction is determined by the provenance and restrictions on the use of the data.
The properties, and therefore behaviour, of the interpreter may be determined by, but not limited to:
Running a structured database with a structured query acting on a structured dataset, presenting it and adding to this the context produces the answer.
The intent of the user is what they are trying to achieve in the system of queries and datasets we are considering. This is not necessarily explicitly discernible or inferable from their Queries and/or their Datasets in isolation, but is their goal.
Processing or cleaning a dataset has a broad meaning and may refer to any transformation that can be applied to a dataset or a portion of the dataset, such as but not limited to: finding and replacing values, deleting a column or a value, performing a function to produce a new column, transliteration of values, unit of measurement conversion, phonetic correction, format localisation, changing the type of a column or value, formatting a column or value or performing any function in relation to one or more columns. In essence, ‘cleaning’ or ‘cleansing’ the dataset means transforming the dataset such that the user's or a computer's understanding or comprehension of the dataset can be improved. As an example, the value of a column may be translated or formatted depending on the location of the user—e.g. if the user self-describes as being in New York, then the dataset could be ‘processed’ or ‘cleaned’ by joining additional map datasets that cover New York in to the dataset currently in use. Or currency amounts could be converted to USD. If the user self-describes as being a CEO, then the properties of the dataset could be altered to those which are more likely to appeal to a CEO; likewise, if the user self-describes as a data analyst, then properties of the dataset could be altered to those which are more likely to appeal to a data analyst. The scope of this term is hence significantly broader than ‘cleaning’ in the narrow sense, namely identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.
We will refer also to processing or translating a query. This term should be expansively interpreted to cover any process that contributes to generating a structured query. Less obvious examples within this expansive scope include, where the query is a speech input, then translation includes the speech recognition and language processing. It includes literal language A to language B translation. It includes error correction. It includes interpretation of domain specific terms in the context required, e.g. “best” of a ranking is minimum, not maximum; “seasonally adjusted” in the US is usually defined differently to in the UK/Europe.
Since the query may also be anything which makes the interpreter update its state, we also extend it in general to mean any interaction of the user which causes the interpreter to update its state. While most of the interactions of the user with the interpreter will be some form of query in the usual sense of the word.—i.e. a question, and expression of intent to receive an answer, and hence the word “Query” is used here to aid reading, we emphasise that other interactions with the interpreter may be considered in the same way. A user being in the US (and therefore likely using USD and MMDDYYYY dates) could be inferred from the questions they ask and the datasets they use, but could also be known or inferred from another interaction; for example an explicit selection by the user of their locale as US when the software was installed, or the IP address of the user being noted as being in the US when the user accesses the system over the Internet.
We will use both the capitalised and non-capitalised forms to refer to these defined terms.
The examples and figures below illustrate the current practice and distinguish it from the proposed approach.
The examples illustrate the role of the interpreter and of intent sitting above datasets and queries. Currently, the interpreter of the dataset, who creates the structured dataset and the dataset context, is not the interpreter of the query, nor are they aware of what the query will be.
The following pair of examples in the house price data is given, where the data consists of the columns “price”, “date”, “locality”, “district”, “county” and an example row is: “100000”, “10-02-2014T00:00”, “Retford”, “Bassetlaw”, “Nottinghamshire”
In both cases no answer is explicitly obvious. The first may be attributed to laziness, the second may likely be a need to get the dataset into a recognised format for a structured database.
A common pitfall demonstrated by “official” datasets such as the house price data is to assume that a structured dataset is accurate, as this is the point of the process of forming structured datasets in organisations. From discussion of the workflow with representatives from a variety of organisations, this is seen to rarely be the case.
Firstly we will explain the status quo:
With reference to
The arrows to the bottom represent the output and the arrows show the flow.
With reference to
In our example, the interpreter has split the raw input of houses being registered into a structured dataset (D′) that we can download and a context, C1. The context is then hidden. In this case it contains, for example: “Murton (town, row 23568) looked funny to me, I thought Murton was in London so I've updated fields accordingly” or “Had to get datetimes to ISO for the SQL database—chose DD-MM-YYTHH:MM for that—HH:MM=00:00 is correct due to ‘Purchase of Houses Act 1983’”, however that context seems to reside entirely in one/many (over the past 20 years the data has been recorded) employees' heads at the Land Registry. Of course, the importance of either bit of information is very much dependent on the query—in the Bristol branch of Foxton's estate agents neither will make any difference (assuming they will never analyse house prices by hour of day or in London), whereas in the Merton branch the former point certainly will.
With reference to
In reality, something like
In essence the second interpreter is, due to uncertainty in the contents of C3, having to try to guess from D′ and C1 what D probably was, then re-interpret that probable D in the context of Q to get a new D′ and Q′. This obviously has a lot of room for error, and also is inefficient and not scalable, nullifying the point of cleaning D→D′ in the first place.
The two main problems are:
There are ways round this problem:
An implementation of the invention solves both these problems by using a single interpreter with knowledge of both D and Q: C, which replaces C1 and C2, is constructed and known internally to that interpreter. The setup is shown in
At a very high level, the system dynamically manipulates the dataset in response to a query. The query acting on one or more datasets triggers the creation of a structured query, a structured dataset and a single context. This process is being performed by a single interpreter. This contrasts in particular with the scenario where a dataset is interpreted by a first interpreter to a structured dataset and a context, followed by a second interpreter (possibly taking into account some or all of that context) interpreting a query to a structured query to act on that structured dataset, along with a new context. An aspect of the implementation hangs on the use of a computer implemented interpreter, otherwise the criticism of scalability remains. However, for a computer the time to run the query and the time to perform the D→D′ and Q→Q′ transformations are very similar. This is unlike a human, where the transformations are the bottleneck.
At a very high level, the system's scalability is therefore obtained by removing human elements from the processing of the dataset and from the processing of the query since the interaction with a dataset by a human is constrained by the time taken for the human to interact and the time taken for the machine to perform the query, not the time which would be taken for the machine to replicate the human's interactions. Hence, the scalability of the system may be as good as if a perfect dataset cleaning followed by distributed query translation model were used.
To explicitly show what happens in our system in the current example (see
In each case, the context provided to the user and the answer are directly related to the query and the dataset, making the minimum number of assumptions, therefore providing the minimum possible scope for error or confusion. This is due to the fact that the context and answer give a complete account of the actions of the system, and the queryer has complete control over the system's actions. A single interpreter has complete visibility of the system and can be held entirely accountable for whatever actions are performed. This contrasts with a scenario where cleaning of the dataset and the translation of the query are performed by different entities, neither of which can be held accountable or have its actions verified by the other due to loss of information.
We now move on to describe the computer implemented interpreter in more detail.
A probabilistic interpreter (which we shall just refer to as an interpreter below) is an interpreter which creates a series of possible {Q′, D′, C, Answer} with different weight/chances attached to them, allowing a ranking.
A probabilistic interpreter builds up a set of instructions as to how to proceed to create the list of {Q′, D′, C, Answer}, when seeing a query Q and dataset D, based on at least.
The Interpreter creates multiple possible sets of {Structured Query, Structured Dataset, Context, Answer} each with an associated probability. The probabilities are formed from a combination of probabilities local to individual aspects of the interpretation, from global properties of the interpretation, from a combination of probabilities local to the particular query and dataset and from probabilities from a stored knowledge of the interpreter.
At a very high level, the query is processed by an interpreter that aims for intent, not query, resolution. This contrasts with other solutions which provide an answer which satisfies the query asked “to the letter”, as it is often their stated aim and is often achieved by enforcing that aim even when it ignores a user's clear expression of intent, allowing no error in either understanding or execution by the user asking the query. This rigid enforcement of a meticulous response to the query asked provides a barrier to non-technical or non-expert users trying to query a database. Even after having rephrased their query into the language required by the database they are using, where concepts which are quite simple to think or express in natural language are often very difficult to write (for example a question like “Which London borough had the largest increase in house price in 2015” is difficult to ask a standard, SQL, database, yet very easy to write and understand in English), they can still be defeated by a single error, a misplacement of a space or comma for example, in executing the query.
Multiple Answers allow an interpreter to iterate its understanding of the user's Intent.
Multiple Answers are presented to the user, which may be used to:
Based on the user's interaction with the possible answers, the Interpreter can refine:
Only one answer could be returned, however this would significantly hinder the Interpreter's ability to learn off the user's behaviour, as well as the user's experience. The interpreter uses but is not limited to:
In human terms the probabilistic interpreter is learning what the queryer's intent is through its repeated interactions with them. The better the alignment between its understanding of the user's intent and the user's actual intent, the better the experience for the user.
As stated above, intent is what the user wants to see. This is not necessarily what they have asked for, verbatim. This distinguishes our approach from a standard analytic tool. Every aspect of the user's interaction with the system is part of the intent—what datasets they interact with, what they do ask, what they don't ask.
Through continual interaction with the user and learning from their response to the suggestions provided, the interpreter can be updated to provide the highest possible chance of matching the user's intent.
In the event that the user's intent is unclear, either due to the query or the dataset, the explicitly given intent is supplemented by the behaviour of the interpreter. The combination of learnt, deterministic and other behaviours can be simple or sophisticated.
The interpreter infers the user's intent not just from the most recent interaction (for example the query e.g. writing some words or clicking on a graph, and the currently loaded datasets) but the entire history of their interaction with the system.
At a very high level, the system enables the ‘exactness’ of the input from the user to scale with ‘exactness’ of the intent. Hence, if the user has a broad intent, they can express this, if they have an intent which is a very explicit query they can ask this. The remainder of the work to make Answers and Contexts from the intent expressed by the user and the available datasets is performed by the interpreter. In contrast, no matter what the user's intent, current analytical systems force them to construct a structured query, often having to add extraneous information which is not what they actually wish to ask
We begin with the house price example, before moving to another fictional dataset to provide further examples.
Now another example:
Note the contrast with the closest ‘exact’ structured queries for 5, 6, 7 which are: for 5 something like “SELECT * FROM ‘sales_figures’” i.e. just show the table, for 6, “SELECT * FROM ‘sales_figures’ WHERE ‘county’=‘Bedfordshire’”, and for 7, “SELECT * FROM ‘sales_figures’ WHERE ‘county’=‘Northumberland’”. These indeed capture the translation of the query to a structured query in the context of the dataset, but are far removed from the eventual intent of the user.
Other examples are given:
Creation of context with and without a query being entered:
Continuing to use a stored or iterative context
Additionally, the answer ignoring the interpreter's internal understanding is returned first, in order to avoid the failure mode that the user actually has the intent they have perfectly expressed in their query which may however seem very unlikely to the interpreter.
A null query also gives meaningful results. The datasets are interpreted using the interpreter's stored knowledge from e.g. previous behaviour.
At a very high level, the interpreter handles both content generation and display. The answers include both the data and its presentation and the context, which itself includes presentation to another entity. Hence, the presentation of the content is treated on the same level as the generation of the content. Particularly, with a dataset and without a query, a ‘homepage’ is created, where the manner of display of the information within the dataset is particularly important.
This system makes the exploration and visualisation of data as easy as possible for as many people as possible. The metrics by which we judge the product are different to those used in competitive products and drive the unique technical solution. This allows us to make progress and innovate in what might appear at first glance to be a full or saturated research area.
Key aspects of the system are the following, but not limited to:
This is achieved by cleaning the dataset and translating the query to create a single context using a single interpreter. The system therefore responds to the intent of the user (determined through the learning of the interpreter, taking hints from previous and current interaction of the user, the organisation and the world with the system) rather than their specific query being asked/dataset being analysed at that one time.
The interpreter therefore creates, uses and stores the context for rapid retrieval and modification. The context may include the dataset context and query context as defined above. The context is continuously updated and improved as the system iteratively resolves a query/intent provided by the user with further user input. The record of the interactively generated contexts can also be used between datasets, sessions or users to inform future query results, learning, on a local and global scale.
An innovation here is the realisation that we must have an entire system optimised and designed from the outset with that goal in mind. Many products could argue that they try to solve one of these problems. We have realised that the only way to solve them is to provide an integrated solution to all of the problems, substantially simultaneously.
Technologically, we have to provide an entire, integrated and consistent solution. Because of this, the vast majority of the code that is described here is proprietary and has significantly different functionality to other products in the field.
The different parts of the system are highly interconnected in order to optimise the system for the best response(s) globally (in contrast to a modular approach, in which at best local maxima can be found, even if parts of the system can operate probabilistically): for example, a natural language engine in the system can predict how interesting the results of multiple potential interpretations of a query will be to the user by firing several through the database, and it uses these results as part of its method to determine the best interpretation(s) it outputs to the user. When it is unclear to the system what the best outputs are, various options are given, allowing the user to iterate quickly to the desired result.
This implementation is a method of analysing data in which the data includes one or more raw datasets which have not necessarily, in whole or part, been cleaned for the purpose of facilitating querying and are in that sense ‘imprecise’, and queries which are not necessarily structured to conform to a structured querying language or protocol and are in that sense ‘imprecise’; and in which an imprecise query triggers the creation or recall of a new or different dataset from the imprecise raw dataset, the new dataset being a modified version of the imprecise raw dataset that permits or facilitates querying by the imprecise query.
The method is not limited to only “imprecise” datasets and queries, and may also be generalised to analyse structured datasets from structured queries, or may use any combinations of structured/unstructured data and query.
Another implementation is a system that includes an interpreter or probabilistic interpreter, a database; and a user interface to accept queries and display answers, such as charts, maps or other information, operating as an integrated system, where the system automatically processes (e.g. cleans) the dataset and processes (e.g. translates) the query simultaneously or in a linked manner, so that processing the query influences the processing of the dataset, and/or processing the dataset influences the processing of the query.
Optional features in an implementation of the invention includes any one or more of the following:
Whilst a NL front-end represents one key use-case that is discussed here, the following description and examples can be generalized to any use cases where inherently ambiguous datasets are interrogated or explored. Hence, any reference to a query or a NL query can be generalized to any type of query as defined above.
The following sections take us deeper into an implementation. References (§) in this section will refer to subsections of ‘2. Details of an implementation’. For example, 95 refers to section 2.5, and §§ 3.4-3.5 refers to sections 2.3.4 to 2.3.5.
The functionality of the application—aside from the interaction with the user and the display of the visuals—is provided by the program Bethe, which is written in C++. It connects to the front end by using Chrome's V8 runtime engine, which underlies Node.js (nodejs.org). The rest of the code in Bethe is proprietary and bespoke, using no libraries beyond the C++ 11 standard (which is supported in its entirety by all major proprietary and open source C++ compilers). As such, in the context of discussing components of Bethe, “Bethe uses” or “we use” does not hide the use of an underlying library in which any patentability and/or intellectual property could ultimately reside.
The vast majority of the material in the technical description of the document relates to Bethe, with the exception of parts of § 5, which relate to the front of the app. This is written in JavaScript and is used in the app with the help of Electron (electron.atom.io), which wraps a mixture of HTML, CSS and JavaScript into a desktop application.
Here we show an example of how this would be implemented for a user. The user can move in any order through this process, however we describe the three stages here in the natural order for a simple use of the app.
Here the user brings in the datasets they wish to use.
Front of app, § 5.1: The user drags and drops or pastes in a file/buffer (or, extending the current UX, opens a connection to a database, which may reside on the user's computer or an external server). The user is informed of the progress and success of parsing on various simple metrics, e.g., the determination of the file separators, as well as more complex metrics, e.g., the concatenation of the imported tables. In the progress of that they are shown a sample (e.g. top 10 lines plus the headers) of the table or a representation of the spreadsheet sent by Bethe. If mistakes have been made by Bethe these are then rectified by the user (for example, the separator can be changed and the file re-parsed).
Bethe: For machine-structured data (e.g., delimited text files or database extracts), Bethe uses the techniques in § 2.1, and for human-structured data (e.g. Microsoft Excel spreadsheets), the columns/data are identified using the parsers (Appendix A) and the preparation described in § 2.1.1 and § 2.1.2 is performed. The conclusions are relayed to the front of app, which then displays them to the user.
Result: The files have been loaded into the database and are ready for the user to query them.
Front of app, § 5.2: The tables can be seen, columns created, global filters applied and manipulation and cleaning of the data performed.
Bethe, §§ 3.4-3.5: here the database must create new columns, edit their values, perform SQL-style queries and help with cleaning.
Result: A table is now ready to be explored.
Front of app, § 5.3. The user experience (UX) is driven by natural language with feedback at a word and phrase level. Charts are displayed along with the interpretation of the query and suggestions.
Bethe: Interprets the queries (as described in § 4), providing feedback on the named entities and associated values (such as dates) and responding to feedback. Performs the queries, automatically joining. Makes suggestions based on the current tables and previous behaviour.
Here we show a sample workflow, to give a more concrete idea of what the user experience is in a simple case. While this does not show all features of the system we describe, or of a product based on it, it should serve as an indication how the system is used in practice.
Our user, working in a medium-sized UK company with offices around the country, is interacting with an implementation of the system in a desktop app. She decides to have a look at comma-separated values (CSV) files containing a list of employees first name, last name, office, date joined, region, productivity and salary from 100 offices, in 100 separate files. She drags and drops the files onto the import screen and they are imported. The app detects that they are similar and concatenates them—if they weren't, she could modify that and re-parse.
She is first taken to the “home page” of the data she has brought in. She actually has a specific intent as to what she is looking for—her intent is to look at the performance of John Smith relative to that of the other employees—is he providing good value?However, she finds the homepage useful to remind her of the columns in the dataset she has brought in and what their contents are and to see a few overall metrics. Since she has previously brought in an older version of this dataset, a few months ago, one of the streams on the homepage replicates some of her earlier queries.
She sticks with her original intent and first types “John Smith”. This gives her the exact result—a list of people with first name John and last name Smith—along with a number of other visual insights about the productivity, salary etc. within that group.
There are obviously several John Smiths in the company, and so she moves to the list of people. Based on her query and previous behaviour (having judged the column contents), the interpreter has formed a name column from the first and last name columns—she accepts this, adding it to the table. She wants the John Smith from Swindon and so clicks on that entry, she also wants to ignore interns with very low salaries and so applies a global, persistent, filter to the salaries column (excluding salaries below £10,000). The app flags up that one of the salaries is 100 times larger than any others—she corrects that error. It also flags that “date joined” has been assumed to be an integer—the dates are written as YYYYMM—201503 for example—and the interpreter has erred on the side of caution—she ignores that, since she's not looking at dates, and also ignores a couple of flagged up possible name misspellings since they're irrelevant for her intent.
The “John Smith” query now has a locked “Swindon” token appended to it and is now returning the exact result—the Swindon office John Smith's record—along with insights about John Smith in the context of the other employees in the Swindon office and in the West England region. She selects the graph of “productivity by employee in West England” and sees John Smith in context, followed by clicking on an insight in the sidebar and looking at Swindon's average productivity in context of the “West England” region. In general Swindon is less productive, but equally, she thinks they probably get paid less there.
To explore that, she changes her query and asks: “productivity salary” (in fact she types “productivity salry” but the autocomplete corrects the spelling). As well as the exact result, several other visuals are provided, including, near the top, “productivity and salary and office”—the interpreter flags up that this was highly ranked because “office” causes clustering on that graph, showing that “office” is a reasonable predictor of the relation between “productivity” and “salary”, whereas the graph with “region” instead of “office” has no real pattern. She realises that she should probably therefore look at John Smith relative to others in his office and really ignore the region—it not being as important. She zooms in on the Swindon office on the “productivity and salary and office” graph—which therefore restricts the insights to the Swindon office and other offices with similar salaries and productivities. Due to her previous queries regarding John Smith (Swindon) his point is highlighted on the graph.
She quickly exports the chart that she initially needed—John Smith has a reasonably high productivity for his office relative to his salary—so that's good news—but overall Swindon's productivity is low. Since she was looking at the Swindon office, the interpreter contextualises this to the West England region—“average salary and productivity by office in West England” is returned in the sidebar of the larger visual and the interpreter flags up that Swindon is a low outlier relative to other offices round there. This will be something to bring up with the West England manager.
To help aid the visual impact for that discussion—despite it not really being crucial—she asks to see the “average productivity by office in West England” and selects a map-type plot. The interpreter automatically joins the office names to its internal database of UK towns and plots those points on a map with different size points for the productivity in each office—which makes a nice slide. Since a fair amount has been done behind the scenes she quickly checks the method the interpreter has returned—the table of data looks reasonable, and the fuzzy join again has been done properly. She sees that the global filter is still on though and turns that off in the method since it would complicate the story for the West England manager—the map automatically updates.
This is not something she would have been able to make without the support from the software since she would have had to have found the data for the towns, fuzzily joined the offices to the towns (“Swindon Office”=“swindon”) and then plotted it on a map, which is well beyond the technical ability of an average business user.
She now returns to Swindon and decides to look at the 20 employees there—is it a case that the longer-term employees like John are simply getting lazy? She looks at “productivity year” and filters for Swindon. “date joined” was previously being interpreted as an integer—to respond to this query the interpreter dynamically ‘re-cleans’ the dataset with the knowledge that a date column is present (i.e. it now recognises 201503 as March 2015) and flags up this change in the method it returns to her. The “average productivity by year of ‘date joined’” graph shows a general rise with time—the younger people are keener—she looks at John in that context . . . .
Here we have shown a subset of functionality, including:
The technology discussed here could be applied to many other problems.
The discussion in this document has immediate relevance to many problems in the modern business environment in which a user interacts with a system or database through an interface, prime examples being document management systems and version control systems. Such systems are often in fact a database, however they are not transparently implemented as such to the user.
Considering both the premise of producing an accurate, quick and high quality result with little expertise and the natural language specifically, the techniques used (discussed in § 4) again have a broad applicability. We are not “using NL to understand a sentence” but rather “using NL to translate the sentence into a plausible database query” which is what makes the problem tractable. In addition to routine or structured tasks in the office environment, this approach could be used to control machinery or other systems with a known set of outputs, particularly providing at least a simple level of functionality to a non-expert user—“using NL to translate the sentence into a plausible system behaviour”—while not needing software bespoke to the system, just as the software here is not written for a single dataset or data source. In the case of data analysis, the software discussed here is unlikely to entirely displace a skilled data analyst, and in a similar way we would not expect an entire displacement of skilled control of any system.
A good example is provided by a safety system in a heavy mechanical setting which could use the ideas presented here to allow a person untrained in the usage of a specific piece of machinery (an attending fire-fighter for example) to obtain at least a basic level of control over it. If they needed to perform a simple task quickly, for example making the main component move in a specific direction to avoid further injury; the unstructured nature of the NL queries, the software's learnt behaviour based on previous usage of the system, flexibility as to the language of the query and concepts analogous to the implicit understanding of time or number (§ 4.1.2) would be beneficial.
A system that allows people to ask complex questions of a large number of datasets enables multiple propositions in a number of areas, such as:
Other examples of applications are the following:
Any data analytics product which aims to provide a full user experience for a non-expert must address the problem of importing data without requiring the user to intervene. Data divides broadly into two categories:
I. Machine structured data—this is generally quite simple to import.
II. Human structured data—this is often much more complex.
We recognise that speed is a key function to provide the user with confidence and insight. For large tables and servers this means importing a sample of the data so a user can identify key insights, then running the full query on external database.
The system interprets machine structured data by considering the likelihood of an individual entity's interpretation based on how well this fits with the interpretation of the other entities in the same column.
Machine structured data is easy to retain in such a format—for example, a CSV file is already a table. The challenge here lies in interpreting the entries.
Software for doing this currently is not very good—for example, Microsoft Excel (Excel) interprets each cell on an individual basis such that for a CSV file containing
. . .
. . .
in the same column, 50A1 and 50B1 are interpreted as strings and 50E1 as 500 (a float).
Also, having done that, it will not allow the conversion back to a “50E1” string. This may seem a contrived example, but this is precisely the problem with data cleansing—such examples require significant time and effort to find and rectify.
The ability to import machine-structured data reliably, and in a highly automated way, has a huge number of further applications across many areas of technology.
We must identify the character set in use. There are hundreds of different encodings of various sets of characters which are used around the world.
Tables to be imported are stored in various formats—the separator between the columns, the number of columns or rows and the presence of a header row are obvious to a human. Current database software often suffers from poor import capabilities, dominated by a single default behaviour (e.g., always assume a header row and “,” separator).
Bethe automatically identifies the size, separator and header row in a table with high accuracy.
In summary, and as is the case with much of the database program, the default behaviour is not a single, defined behaviour, but rather to use the most sensible settings given by evaluation of a set of simple, defined metrics. This provides a significant increase in the ease of importing data, removing a barrier to entry for non-technical users and inconvenience for a technical user.
Using the techniques above other machine readable formats, e.g., JSON, JSONstat, XML can be read in. These formats are generally newer and with a higher level of standardisation.
2.2.1.4 Import of Tables from Machine Databases
A command of the form SELECT * FROM <table name> will return the contents from a database table, which can then be treated in a way very similar to a CSV, with the added benefit that column types and other additional metadata are known.
2.2.1.5 Import of Parts of Tables from Machine Databases
We will often not want to import the entire table from a database due to security, bandwidth or memory constraints.
In the case of security constraints this must be decided by the user. For the other cases, we take a random sample of the database and use this to perform the operations described in the rest of this document. The SQL feedback (§ 4.8) can then be used to control the remote database to perform the ‘final’ query.
Often data is not provided in a static file.
Keeping the data constantly indexed automatically is key for the general user experience.
A streamed data source of a given format can be read in to a series of buffers, as is a normal file. The buffers are then parsed and the values appended to the appropriate columns. If a column is made ready for streaming we have two options:
Queueing of the incoming buffers can be performed using the Javascript front end, which is well designed for handling such structures. To solve the queueing problem in general for an incoming stream of data, a larger database-based solution, for example MongoDB (mongodb.com), is required. Practically, the speed of import into Bethe is faster than a high-speed internet connection and so such import problems will be confined to large organisations where existing system and security issues are likely to eclipse the technical challenge.
Many pieces of data are spread across multiple, similar files—for example the data for each month or for each county are stored separately; usually for reasons of being able to transfer the data and open it in table-size-limited programs. Bethe provides automatic multiple file import capabilities.
To enable the files to be imported, we must provide the following.
In the process of import, errors in the files can be detected and flagged for correction by the user.
Flagging for type errors and spelling errors come as a corollary of the indexing process. We can also consider the distribution of ordinal values to detect outliers.
Potential issues with data based on odd distributions of different columns may be flagged to the user, such as: gaps in an otherwise continues distribution of numbers; very large outliers; combinations of two different entity formats. Also relevant to providing suggestions.
The system provides the functionality required so that the user can obtain the results they require from the database.
A database must perform a few basic functions.
Perform simple queries
SELECT x FROM y WHERE z GROUP BY a ORDER BY b
We select columns x from table y where conditions z are satisfied. We group by a and finally order as specified by b. When we find an entry or number of entries satisfying z we want it/them to be found quickly—if there are 1 million entries it takes time to look through them all. There are various ways around this—we discuss ours in § 3.1.1. The same applies to grouping, which is discussed in § 3.1.2.
What is Davidson's salary? For Davidson, we look up Chris and then find Chris's salary. This is joining the two tables. However, often the tables are larger, more complex, more numerous (and chains of joins through the tables are required), do not match exactly, have duplication/ambiguity. Setting up the joining in those cases has typically been the preserve of data scientists—we give our solution in § 3.3.1.
In order to achieve the required user experience, queries must be as fast as possible. Two techniques are used to speed the interaction of the database with the data.
We use a hash-table index to allow fast access to individual records or groups of records.
The method is:
For a hashtable in general, step IV is not always the case and the realisation that a hashtable such that this is true is needed is important. The innovation in Step I is discussed in § 3.2.
We use a combination of a dictionary and a hash-table. The hash-table points to a series of elements which are represented in the column by integers (of 8, 16 or 32 bit length). If the column contains an equal mixture (say) of “YES” and “NO”, these have an average size of 20 bits, but can be represented by a column of 8 bit integers (this being the lowest easily representable number of bytes; an x byte integer can represent up to 2x possibilities, e.g., 256 for an 8-byte integer). This represents compression by a factor of 2.5. The hashtable is then used to point at both the dictionary and the column values.
Compression for in memory and on disk storage using the dictionary and hash-table. Also increases read-in speed from the proprietary format and speed for aggregation. (§ 3.3.1)
This allows a very fast check for existence, which is required for the natural language to be effective (NL, § 4).
Such compression is therefore beneficial to all aspects of the code. This feature is not shared by the more sophisticated, high-compression algorithms used in e.g. bzip, gzip etc.
For testing and demonstration purposes, the database has a simple SQL interface, allowing SELECTing a number of columns FROM a table, WHERE conditions are met, GROUPed BY a number of columns and ORDERed BY others. The language used is a subset with slight syntactic differences from the SQL standard.
Compound Searches
A method for enabling compound searches provides an order of filtering and searches are based on how the database is indexed.
For example, IN Lincoln AND price >100000. There are far fewer houses sold in Lincoln than below 100,000 and so it is optimal to apply that filter first (using the indexing) and then apply the second filter on the result. The judgement can be made since the number of elements of each type in many columns is known from the indexing and dictionary process, and the number of numbers to be returned can again be determined from indexing or a knowledge of the distribution.
The search implemented here is as quick or (from testing), quicker than any other well indexed database.
We must also be able to group data according to common features. A method for grouping uses the dictionary and re-indexing on the concatenation of column values.
This is either performed using a compressed representation of the dictionary onto the integers—for example, for three columns with dictionary sizes L, M and N, and dictionary numbers 1, m and n; we can use lMN+mN+n to map uniquely to an integer between 0 and LMN. For high cardinality columns, the database can also use a re-indexing on the concatenation of the column values.
Standard database functionality must be provided, here the advance comes in the choice of the methods.
Databases must be able to perform “simple” queries, in SQL notation: SELECT x FROM y WHERE z GROUP BY a ORDER BY b. We select columns x from table y where conditions z are satisfied. We group by a and finally order by b. All database languages must provide this functionality and it covers many common queries.
The Bethe database provides this functionality, with methods applied to add ease and speed for the user. We try to hide as much of the technical running of the database from the user as possible.
The solution in Bethe is to index using a string where as much information has been removed as possible while keeping it recognisable. It is far faster to find a few candidate strings (and is unlikely in most data sets due to the redundancy of language and the precision of codified values) and then check them for equality. Therefore the indexing and the searching is performed on severely de-informationed strings. In this case both “ST. HELEN'S” and “st helens” are indexed (and looked up) as “sthelens” and so the match is trivial. Another example is ile-de-France which is indexed by “iledefrance”-note the lack of the circumflex. Bethe will pick a single “ile-de-France” out of a list as fast as it picks “Ile-deFrance” or “ile de france” out and has been tested to do this in near-constant time on columns with tens of millions of rows.
With suitable internationalisation and transliteration this also allows a significant cross-character-set, e.g. simplified Chinese to Pinyin capability.
Hence, indexes of each column are stored in a reduced form such that lookups are of constant time and are approximate by default (e.g. “Ile de France” and “ile-de-france” are stored as the same entity. This fuzzy indexing enables the fuzzy joining described in the following subsections.
This allows the database to pull out a small number of candidate variables (which can then also be displayed to the user for ‘autocomplete’ or disambiguation) and then check for equality rather than look for the exact variable straight away.
Standard database functionality must be provided, here the advance comes in the choice of the methods.
Databases must be able to perform joins—this is where a column or columns of a table are used to link two tables together. Take, for example, a table with columns “employee” and “salary” and a separate table with columns “employee” and “department”; these could be joined and aggregated to produce the total pay in each department by joining on the employee column.
Joining is usually performed by explicit instruction from the user, however this is unsuited to the use case for our product—it scores highly on a metric of flexibility and exactness, but is completely outside of the capabilities of an average business user. While in the case of simple queries the correspondence between sentences and SQL is generally good, this is not the case here:
“Houses in Nottinghamshire for which the average price in their town is greater than the average price in Lincolnshire”
is conceptually simple, and can be executed by our application, yet is complex to write in SQL:
SELECT * FROM price_paid INNER JOIN (SELECT town, AVG(price) FROM price_paid WHERE county=“Nottinghamshire” GROUP BY town HAVING AVG(price) > (SELECT AVG(price) FROM price_paid WHERE county=“Lincolnshire”)) AS b ON price_paid.town=b.town WHERE county=“Nottinghamshire”
(where the SQL table name is price_paid).
Making a query with joins in is complex and involves a good understanding of database semantics.
Bethe provides joins by performing them automatically. This is done by a route-finding algorithm.
The fuzzy joining algorithm uses a combination of global and local variables to optimise across entity, column and dataset. It consists of continual refinement of the best guess join.
Fuzzy joining on-the-fly allows data to be stored in an unclean form. This means no information is lost, and a single source of truth is maintained.
Fuzzy joining allows the system to find dataset relationships upfront and improve the depth of insight offered up to users without their involvement more efficiently than if only exact joining were available.
Approximate joining between datasets is highly optimised and scalable. If new datasets are added the user can quickly be informed of possible joins to the existing datasets. By always storing the shortest length between two datasets, as is required anyway by standard algorithms for the matching through multiple joins, this extends trivially to multiple joins, as described in the following section.
Given various tables, we must find routes through them which allow the tables to be joined.
Due to the number of tables being relatively small, this is solved through a dynamic programming method. The overhead involved in finding the possible join between two tables is large and so this is stored when found. We also store the shortest route between any pair of tables—this allows an incremental change in the join (e.g., by the addition of another table) to be effected quickly.
Joins are found by performing a fuzzy match between two columns or two groups of columns. This match includes
Geography uses similar techniques to this, however with some more preparation of the geography database having been performed.
These factors are weighted together to provide a metric of the goodness of the join; and then the best join is made; or in the case of multiple chained joins the join with the best product of metrics for the different joins.
Fuzzy joining, and, though fuzzy joining a column to itself, fuzzy aggregation, is key functionality in a product trying to remove the burden of data cleaning.
This is technically straightforward.
The only challenge in this is speed, which is naturally limited by the complexity of the operation. For example, performing an exponential moving average on time series data might involve a pre-sort of the data, or pulling apart a string will take longer in UTF. In a similar way to the searching this is optimised as much as possible.
Most databases are disk-based; in-memory databases are gaining popularity due to the increase in speed and the increase in availability of high memory machines. This increase is nullified if swapping or hard faulting occurs.
In Bethe:
The database caches full queries without any limiting of the number of entries returned from the table to allow fluid browsing of results (e.g. a data table view where the user scrolls down).
Bethe implements caching of the results of queries and what the queries were.
This is used to:
The cache has a specified size (e.g., 5 elements) and is emptied in order of when the element of the cache was last used. In web-based applications this capability is supplemented by the caching on the reverse proxy server, which does not offer the contextualisation here, merely returning results of exactly identical queries to those already made.
Natural Language (NL): Take a string of words and provide a query of the database corresponding closely to the users intentions for the session interacting with the database. The key to this technology is the recognition that this is not the same as providing the best possible parse of a given input, rather we want a parse which robustly maps to a valid query, is non-pedantic, is well justified/explained and provides a solid jumping off point for future queries.
Our metric is therefore significantly different to that used in much of current NL research.
Feedback: This provides a key component to the NL and to the working of the product as a whole (though, since the NL is the primary means of interaction for the user with the database, much of the feedback centres around that). As discussed in § 1, a mainstay of the method is to make high-accuracy guesses as to what the user requires. Feedback allows the user to confirm or reject the assumptions the app has made allowing erroneous assumptions to be corrected in the short term and learning to occur in the long term.
The broad program is to firstly tokenise the sentence, then to find the query best corresponding to the sentence and perform it. However, at each step the inclusion of feedback allows us to move backwards and forwards through this process.
We use a series of helper parsers to provide specialised handling of named entities, operators and dates. This tokenises a sentence:
“Average price by town in Lincolnshire”
quickly to:
[operator] [column—numeric] by [column—word] in [element—county]
This is where the database technology is enabling—we must have a very responsive database to allow elements to be identified in a high cardinality dataset and we must identify them in a fuzzy way to avoid pedantry.
The helper parsers which handle in addition operators, dates, numbers and geographies are modified for differing locales.
We provide a consistent experience in that the same parsers are used in the database as in the NL. This means that if a date is recognised in the table then it will be recognised in the NL.
This may sound obvious, but in many programs the parsers used are different for different parts of the program—this leads to an inconsistent experience—for example a date is recognised in the table, but can then not be interacted with, or as discussed with the fuzzy matching, a foreign word is rendered, but can then not be searched for.
We must solve the problem of recognising what is an operator, column, record, date, etc. We assign a weighting to each word and phrase in the sentence as follows. Say we have two phrases A and B with lengths a and b. We begin with two numbers, W1=1 and W2=0. Consider each word in A:
I. Does it match a word in B?
II. Is it after the previous matched word?
III. How important is it that it is matched?
III is determined using a weighting w=exp(−[f+g]), based on the fractional frequency of the word in the Brown corpus, f, and the frequency of the word in the table itself, g. The value assigned to a word if it is present is v=(1+w) and v=1/(1+w) if not. If w>1, W2 increments by 1, and W1 is multiplied by v in all cases.
The output is then W×W2/[max(a, b)×(1+0.03ab)].
We emphasise that the choice of the formula and parameters here is determined from testing, seeing that sensible behaviour is produced.
Other uses of this scheme would include tokenisation in many other fields. Other people have solved this problem in various ways in the past; the value here would likely come if people became used to this being the “usual” method of tokenisation and so modified their behaviour.
2.4.4 Decide which Entities to Start with
Using the methods in § 4.3, we have given ratings to the entities in the sentence. We begin with the highest rated token, which is accepted. If that token overlaps with any lower-rated tokens, they are ignored. We then continue this process with the next highest rated, still existing, token. We then inform the UI of which tokens have been accepted, which have been ignored and their positions.
If an entity is initially wrongly named it will lead to either failure of the parse, or that the overall parse takes longer due to having to catch the error at a later, more complex stage. Allowing the user to see and interact with the named entities is therefore a first line of defense against an unsuccessful parse.
Here, again the database is enabling technology in that the named entities must be quickly interacted with by the user. The advance is in this speed of interaction combined with a good UX.
The sentence is tokenised by the NL routine so that individual words and phrases are recognised as being related to the table or operators and other database commands. Each phrase is given a rating by § 4.3 and the most likely initial parse is produced by § 4.4. The decisions made by the NL are then encoded (as 0 for not-accepted, 1 for accepted), passed to the front-end and displayed to the user, when 3 interactions may occur:
I causes no change; II raises the weighting assigned to that word, in the current parse (to a value higher than any weighting the NL can give), in the parses in this session referring to that table (by a small factor ˜1.1), and in its parses in future (by an even smaller factor ˜1.01); III does the opposite to II, with higher weights (see below); and IV causes a re-parse of the entire sentence using the new text.
Note how IV is given a different flag and causes no change in the weightings. This is because of logical inference—the parser is using a sentence and parsing technique S to imply a parse P; S⇒P. The inverse of this is, contrary to popular expectation, PS (not P—S), which means only rejections of the parse are useful for learning. II goes against this and hence is included so that learning can still be achieved in situations where rejection may be uncommon—this is intentionally designed to mimic human reinforcement learning and make the application seem more friendly.
Autocomplete is implemented throughout software applications on desktop and mobile. We use it here to show the user quickly the options available.
Our implementation of autocomplete is necessarily similar to others. However, in our particular case there are various additional advances, challenges and points of interest.
We implement a basic autocomplete functionality as the user types. This has some special features:
Much of people's interaction is made up of sentences which are agrammatical, vague, assume knowledge the computer cannot have or are technically structured. The parser must provide the best parse of such sentences in a way which maximises our metrics.
Current NL techniques are tested against a fundamentally different metric, namely the reproduction of parts-of-speech (POS) tags and dependency trees (DT) of a known, annotated corpus. These parsers are poorly suited to understanding queries in our typical input form—in the wider world, this is seen through ineffectiveness in parsing Twitter text, and represents a known unknown in the field.
Our parser uses techniques which are not seen in other parsers.
2.4.6.1 Performing the Conversion of the Sentence into a Query
We use a probabilistic, rule-based parsing to decide what query a sentence should be parsed to. In the example “Average price by town in Lincolnshire”, the interpretation
[operator] [column—numeric] by [column—word] in [element—county]
AVG(price) WHERE county=“lincolnshire” GROUP BY town
is assigned a high probability. The rules are:
The technology used here could be applied to any problem where the form of the output can be constrained. In this case that we are not “using NL to understand” a sentence but rather “using NL to translate the sentence into a plausible database query” is what makes the problem tractable. Examples might be:
I. Interacting other less ‘obvious’ databases (such as document management systems).
II. Controlling machinery or other systems with a known set of outputs.
The system is configured to run and store an interim table for each step of a query execution. This allows audit playback by the end-user and easy re-running based on tweaks.
When the user hits Enter and sends the query, we think we know (through tokenisation and feedback) what all the words are, and have then converted the sentence into a query. We can then provide feedback on how we have interpreted the query.
We provide a template interpretation of the extended SQL:
NL: “Houses in Nottinghamshire for which the average price in their town is greater than the average price in Lincolnshire”
Extended SQL: SELECT * FROM price_paid WHERE [AVG(price), town FROM price_paid GROUP BY town]>[AVG(price), county FROM price_paid WHERE county=“Lincolnshire” ] AND county=“Nottinghamshire”
Templated return: First we find the average price (and county since its filtered by) where county is Lincolnshire—we store this as table 1. Then we find the average price (and town since its grouped by) grouped by town—we store this as table 2.
This provides the following solutions:
I. It allows us to show the user what has been performed.
II. It allows the user to modify what has been performed.
III. It allows us to train the user to use recognised terms.
Once the query has been performed, the result is output visually and the method is described verbally.
We can simply expand our “Extended SQL” after the query has been performed and the joins have been seen to be correct for use by the user in a separate, conventional database program.
All of the technology here can be applied to voice-based commands, provided that the voice/writing text capability is sufficiently good.
Traditionally the accuracy of the voice/writing to text is a source of frustration, however in our case the information required by the named entities parser and the high speed of the database allows a “dictionary” to be provided to OCR/voice software which increases the accuracy of the parse.
Some features of the database or how the database is used in the code enable the suggestions.
Ensuring speed of experience is maintained even when datasets are large within the database. This includes:
When the user makes a query, Bethe suggests further queries based on that query.
This is performed by:
Suggestions are generated continuously—at any time, the UI can ask for the return of one or more suggestions from an updating list. This is limited by the wishes of the user.
By identifying named entities relating to the dataset or database in a user's input, the system can generate SQL queries which include all or some of these entities, with the possible addition of other entities. The order which the user receives the results is weighted based on metrics including the distribution of the results; metrics of statistical relevance; previous search results both by the user and the total user population; and proximity to the inferred intent of the user.
The more entities the user provides the more specific is the set of options for how these elements can combine in an SQL query. This helps the user very easily narrow down to a specific chart they need with minimal input and at speed.
The system can provide SQL queries which contain only the entities the user has requested, and also provide queries where all or some of the elements provided are included alongside others.
Results can be provided without any user input (i.e. without any user interaction with the input bar at all), or with non-tokenisable input (i.e. zero entities are inputted—an empty bar, random text, “Hello, how are you?” etc.) giving the illusion of the user browsing all possible manipulations of the dataset.
The immediate provision of results from zero or partial input, along with the NL interpretations provided as titles of the graphs allows a user to be vicariously taught to use the product, by aping the examples provided, at their own speed and without formal training.
Interface design hinges on providing end output first and then letting user tweak the early stages of the analytical process to a more precise output iteratively.
An example of what could be learnt is that Usually after grouping on a column synonymous to price and containing integer or float values a grouping by a low cardinality column which is not a geography is common.
The process of constructing the metric. The metric is formed from information including, but not restricted to:
Each of the subsections above provides a number of metrics, which together form a list {xi}. Some function f({xi}) (lower is better) is used to provide an overall metric on each suggestion which is then used to rank them for return to the user. A process such as f. and g. above is used to quickly approximate {xi} and hence f({xi}), particularly in cases where the value of f({xi} can be lower-bounded and so a suggestion can be shown to be low-ranked and therefore irrelevant.
The function f({xi}) can be approximated by simple choices of parameters which make the ranking good enough to be refined by user testing. The list {xi} can also be used as the input for a machine learning algorithm, this being a classic neural network problem, the neural network being a continuously refined definition of f( ). The initial approximation to f({xi}) being used overcomes the need for enormous amounts of training data before any reasonable results are obtained.
There are a large number of queries which could be produced, since we a sampling from a large space of possible queries on the database. Even restricting to simple SQL queries, this set would be far too large to compute f({ }) in each case. We therefore create the possible queries by moving randomly away from the query believed to best express the intent of the user.
The possible movements can be enumerated and are not limited to:
a. By using only the tokenised entities in the query,
b. By using the tokenised entities and adding to them before making the query,
c. By adding and removing entities before making the query,
d. By moving the entities around within the query.
This includes through b. creation of queries from the entry of no tokenisable entities. These define the radius (c.) around the query's intent.
By way of example, in a dataset containing price of an object and the town and region in which it was bought: ‘price’ with a filter of region=X can provide
The continuous involvement of the user in defining the intent, the specificity of their intent and, in refining the local definition of f({ }) enables the provision of suggestions to the user in real-time as their intent and understanding of the dataset is better understood.
When a raw dataset is imported into the database, the database ingests the raw dataset and stores it as contiguous streams in memory as a column-based database with an index residing on top of it, as described in the sections above. The indexing allows an enhanced user experience by enabling a fast access to individual records or group of records within the dataset.
As illustrated in
The suggestions that are displayed take into account the query context and dataset context. As an example, the suggestions may be based on the dataset content such as column entries or individual entries, headings, functions (e.g minimum, maximum, rank) and English words. The suggestions may also be generated from the knowledge stored on previous end-user interaction with the system such as end-user search queries (64). The suggestions are then ranked and the top (most relevant) suggestions are displayed on the dropdown menu. The system loops (63) until the end-user finishes typing (65) or clicks on a suggestion that is being displayed. The system continuously learns and updates the database, context and its knowledge as the user interacts with it.
As an example, an end-user may type “average price in Munchester” (i.e. a misspelling of Manchester) and the dropdown menu may display “average price in Munchester (town), average price in Munchester (district), average price in Manchester (town), average price in Manchester (district)”, the end-user may then choose to select “average price in Manchester (town)”.
As another example, an end-user may type “average price by town in Not” the dropdown menu may display “average price by town in Nottingham, Average price by town in Nottinghamshire, Average price by town in not”, the end-user may then choose to select “average price by town in Nottinghamshire”.
With reference to
With reference to
The steps presented in these diagrams have been simplified for illustration purposes and do not necessarily represent an exhaustive list of steps performed by the system.
The answers displayed to the end-user and the metrics used to ranking the answers are also continuously updated and improved with user feedback. The system may also return a stream of answers. Initially a finite number of answers may be displayed (limited by the amount of information which can be/should be displayed on the screen). These Answers may be categorised into one or more streams, which may be ordered by the probability/weight assigned to each answer.
As the metric of success for the application is to make the exploration of data as easy as possible for as many people as possible (§ 1.1), the user interface (UI) must allow non-technical users to use the product without the need for training. This precludes the use of complex menus and contrived drag-and-drop interfaces as seen in competitor products.
Examples of user interfaces are now given.
The import page allows the user to bring a source dataset into the application painlessly and see an overview of it quickly.
With reference to
Each source may contain one or more tables, which can be explored by clicking on the corresponding source. Tables can also quickly be previewed in full.
Each table may contain one or more columns, which can be seen by clicking on the corresponding table. Here the user can check and edit column names and types and quickly inspect the distribution of each column. Any errors and/or inconsistencies can then be rectified in the edit page (§ 5.2).
The edit page allows the user to manipulate data within a (single) table and apply global filters using natural language.
There are two primary views in the edit page: a table view and column view. The table view allows the user to see all rows and columns in the table quickly, and sort by any column.
With reference to
With reference to
With reference to
The explore page allows the user to query the data using natural language, with rich visuals produced directly from NL statements. A user may select to explore one, several or all of the tables created from the imported sources (§ 5.1). For cases where more than one table is selected, the application may automatically join tables (depending on the exact query), as described in § 3.3.
With reference to
The query processing routines detailed in § 4 underpin the search capability.
The types of visual produced may include, but are not limited to:
A method by which we infer the type of visual to use is detailed in § 5.5.
An interpretation allows the user to check that the computer has understood the query correctly and may take one of the following forms:
Users are able to switch quickly between these options.
Typically a user of a data analytics product either does not know the contents of the dataset or is unaware of what the most interesting features of it are. Competitor products do little to help with this, beyond providing a table view that allows the user to explore the data through simple inspection.
Making the “null query” results returned valuable may be dependent for example on learnt context from other datasets and stored ‘state’ of previous sessions. Not only is the homepage auto populated with content, the elements of the template are also selected based on the dataset itself—we are doing both content generation and template creation—it is an automated version of, for example, the BBC homepage—where a news stories is generated and presented in such a way to reflect the users' interests and the quality or relevance of answers.
While this is possible in our application, the user is also helped in exploring the data by the automatic provision of charts, both:
With reference to
Using a combination of rule-based and learnt behaviour, we provide:
Any of these charts can be clicked on by the user and explored more fully. Each chart is presented with a succinct NL description so that it can be quickly understood by the user.
Note that this functionality overlaps and interfaces with that described in § 4.10.
Let's say we have a table containing columns of types: i) numerical (corresponding to integer or float types); ii) categorical (category or range); iii) geography; and iv) date/time, which are inferred on import. A NL query is submitted by the user, which is parsed using the methods in § 4 and data is correspondingly returned from the database (§ 3), which itself contains columns of these four types. The way in which the data is displayed is determined as follows:
For example, say we have a table with columns “price” (numerical), “product” (categorical), “category” (categorical) and “date” (date/time). The queries
Although I-XII represent the default types of visual produced, the user may override these.
A key aspect of the experience is consistency. If an entry in a table can be parsed for example “5 Mar. 2016” is recognised as the date 05/03/2016—it must also be parsed if encountered in another dataset or in any other interaction of the user with the product. In our code the same parsers are used throughout, enabling this experience. This is not possible if an integrated interface and database is not used.
Date is a key concept, with applications in various fields. (For example, Python's dateutils library, which has a subset of the functionality of our dateparser and also does not use any of the innovations discussed below has been downloaded 40 million times: pypi-ranking.info/alltime.) The methods described below are not just an extension of existing techniques (e.g. to more periods) but the methods used for inference of ranges, repetition and financial concepts appear to be novel.
Our dateparser takes as an input either a string or a list of strings. It outputs the most likely dates represented in those strings.
In the case of a single string, we first tokenise it:
For example, “11 March 16”→{Y, D, H, T, S} m {Y, D, H, T, S}.
Alternatively, there could be a range of dates. There remains an ambiguity, however, in that the day and year is not recognised. We assign a probability based on:
This section summarises the most important high-level features (A->T); an implementation of the invention may include one or more of these high-level features, or any combination of any of these. Note that each high-level feature is therefore potentially a stand-alone invention and may be combined with any one or more other high-level feature or features or any of the ‘optional’ features; the actual invention defined in this particular specification is however defined by the appended claims.
A computer-implemented method of querying a source dataset, in which:
(i) a user provides a query to a dataset querying system; and
(ii) the system automatically processes simultaneously and/or in a linked manner both the dataset and the query, so that processing the query influences the processing of the dataset, and/or processing the dataset influences the processing of the query.
B. Answering the User's Intent, Rather than the Precise Query Asked
A computer-implemented method of querying a source dataset, in which:
(i) a user provides a query to a dataset querying system; and
(ii) the system automatically processes the query and the dataset to derive a probabilistic inference of the intent behind the query.
C. Iterative Process where Every Inferred Intent Generates Multiple Possible Answers Based Around that Intent
A computer-implemented method of querying a source dataset, in which:
(i) a user provides a query to a dataset querying system; and
(ii) the system automatically processes the query and the dataset and dynamically generates a series of relevance-ranked attempts to answer that query or to infer the intent behind the query as an initial processing step; and
(iii) the user further expresses their intent by interacting with the relevance-ranked attempts to answer that query (e.g. enters a modified query, selects a part of a graph) and the system then iteratively improves or varies how it initially processed the query and the dataset, as opposed to processing in a manner unrelated to the initial processing step, to dynamically generate and display further relevance-ranked attempts to answer that query, to enable the user to iteratively explore the dataset or reach a useful answer.
A computer-implemented method of querying a source dataset, in which a user provides a query to a dataset querying system, and the query is processed by an interpreter that derives a probabilistic inference of intent, or interpretation, of the query.
Optional features in an implementation of the invention include any one or more of the following:
A computer-implemented method of querying a source dataset, in which an interpreter creates, uses or stores a ‘dataset context’ when it cleans the source dataset to generate the cleaned, structured dataset, the dataset context being the information applied to the source dataset or extracted from it, when the source dataset is cleaned.
Optional features in an implementation of the invention may include the following:
A computer-implemented method of querying a source dataset, in which a user provides a query to a dataset querying system and in which an interpreter creates, uses or stores a ‘query context’ when it analyses the query, the query context being the information applied to the query or extracted from it, when the query is translated to generate a structured query.
Optional features in an implementation of the invention include any one or more of the following:
A computer-implemented method of querying a source dataset, in which a user provides a query to a dataset querying system and in which a query (i) triggers joining across multiple source datasets and (ii) the dynamic creation of a different database or dataset using data from the joined source datasets, that different database or dataset being analysed to generate one or more answers to the query.
Optional features in an implementation of the invention include any one or more of the following:
A computer-implemented method of querying a source dataset, in which a user provides a query to a dataset querying system, and the query is processed by an interpreter that derives a probabilistic inference of intent, or interpretation, of the query, and in which the interpreter generates a series of probability ranked structured datasets.
Optional features in an implementation of the invention include any one or more of the following:
The interpreter, when it receives a query, generates a series of a ranked set of {structured dataset, structured query, context, answer}, each set being assigned a probability and each set being specific to a computer-generated inference of the intent behind that query.
A computer-implemented method of querying a source dataset, in which a user provides a query to a dataset querying system and an interpreter dynamically manipulates the dataset in response to the query.
Optional features in an implementation of the invention include any one or more of the following:
A computer-implemented method of querying a source dataset, in which a user provides a query to a dataset querying system, and the query is processed by an interpreter that infers or predicts properties of the result of the query before using the dataset or a database derived from the dataset.
Optional features in an implementation of the invention include any one or more of the following
A computer-implemented method of querying a source dataset, in which a user provides a query to a dataset querying system and any query, precise or imprecise, always results in an answer being presented to the user, even if it is a very low probability answer.
L. Multiple Answers Enable a User to Try More Meaningful Queries A computer-implemented method of querying a source dataset, in which a user provides a query to a dataset querying system and in which the query results in a number or a range of answers being presented to the user, enabling the user to understand the data or metadata in the source dataset, or the content or structure of the source dataset, and to hence try a more meaningful query.
A computer-implemented method of querying a source dataset, in which a user provides a query to a dataset querying system and in which the degree of inaccuracy or imprecision of the query is assessed and the breadth of the search results or answers that are returned to the user is a function of this degree of inaccuracy or imprecision.
An optional feature in an implementation of the invention includes the following.
A computer-implemented method of querying a source dataset, in which a user provides a query to a dataset querying system and in which an interpreter automatically generates and displays a home or summary page from the dataset in response to a null query, and that home or summary page includes multiple charts or graphics or other information types.
Optional features in an implementation of the invention include any one or more of the following:
Computer implemented method for use in a database in which a database holds data from the source dataset, and the database is automatically structured to handle imprecise data queries and imprecise datasets.
Optional features in an implementation of the invention include any one or more of the following:
A computer-implemented method of querying a source dataset, in which a user provides a query to a dataset querying system and in which the query is tokenized and the entities in the query are each assigned a weighting based on the entity relationship with the dataset.
Computer implemented method for processing a string, or list of strings, in which a date parser takes as an input a string, or list of strings, converts it to the most likely date or dates represented in those strings and outputs a date information.
Optional features in an implementation of the invention include any one or more of the following:
A computer-implemented data query system that includes an interpreter or probabilistic interpreter as defined above; a database as defined above; and a user interface to accept queries and display answers, such as charts, maps or other information, operating as an integrated system, where the system automatically processes (e.g. cleans) the dataset and processes (e.g. translates) the query simultaneously or in a linked manner, so that processing the query influences the processing of the dataset, and/or processing the dataset influences the processing of the query.
Optional features in an implementation of the invention include any one or more of the following.
A computer-implemented data query system that includes an interpreter or probabilistic interpreter as defined above; a database as defined above; and a user interface to accept queries and display answers, such as charts, maps or other information, operating as an integrated system, where the system automatically processes (e.g. cleans) the dataset and processes (e.g. translates) the query simultaneously or in a linked manner, so that processing the query influences the processing of the dataset, and/or processing the dataset influences the processing of the query, in which the properties of the system can be recorded and stored, and in which the properties of the system (or ‘state’) can include any combination of one or more of the following: query, structured query, raw dataset, cleaned dataset, interpreter, dataset context, query context, answer or user behaviour.
Optional features in an implementation of the invention include any one or more of the following
The method or system is used as part of a web search process and the imprecise raw datasets are WWW web pages.
The method or system is used as part of a web search process and the imprecise raw datasets are the entirety of indexed WWW web pages.
The method or system is used as an IOT query or analysis system, and the imprecise raw datasets are the data generated by multiple IOT devices.
The method or system is used as an IOT query or analysis system, and the imprecise raw datasets are the data generated by multiple IOT devices using different metadata or labelling structures that attribute meaning to the data generated by the IOT devices.
The method or system is used as part of a web search process that serves answers and relevant advertising to an end-user in response to a query
The method or system is used to query property related data and the system joins the datasets from multiple sources, such as government land registry, estate agents, schools, restaurants, mapping, demographic, or any other source with data of interest to a house purchaser or renter
The method or system is used to query flights or travel data and the system joins the flight timetables and prices from various airlines and OTAs
The method or system is used to query entertainment data. The method of any preceding claim, when used to query restaurant data
The method or system is used to query hotel data.
The method or system is used to query financial data.
The method or system is used to query personal financial data.
The method or system is used to query sensitive personal data.
The method or system is used to query retail data.
The method or system is used to query customer data.
The method or system is used to query manufacturing data.
The method or system is used to query property data.
The method or system is used to query company accounts.
The method or system is used to query sensitive health data.
The method or system is used to query any business, operational, personal, geographical data, or any other kind of data.
The method or system is used to create a valuation for a dataset.
U. Key Dependent Claims that are Common Across Multiple High Level Features
Optional features in an implementation of the invention may include any one or more of the following:
It is to be understood that the above-referenced arrangements are only illustrative of the application for the principles of the present invention. Numerous modifications and alternative arrangements can be devised without departing from the spirit and scope of the present invention. While the present invention has been shown in the drawings and fully described above with particularity and detail in connection with what is presently deemed to be the most practical and preferred example(s) of the invention, it will be apparent to those of ordinary skill in the art that numerous modifications can be made without departing from the principles and concepts of the invention as set forth herein.
| Number | Date | Country | Kind |
|---|---|---|---|
| 1702216.1 | Feb 2017 | GB | national |
| 1702217.9 | Feb 2017 | GB | national |
| 1715083.0 | Sep 2017 | GB | national |
| 1715087.1 | Sep 2017 | GB | national |
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/GB2018/050380 | 2/12/2018 | WO | 00 |