This application claims priority to GB Application No. GB 2004231.3, filed on Mar. 24, 2020, the entire contents of which being fully incorporated herein by reference.
The invention relates to a database query and visualisation system; it enables a user to query a database, for example by inputting a natural language query into a user interface, and to receive a response displayed as a table, graph or other visual representation.
A typical, conventional data exploration tool has several drawbacks: it can take 20 hours or more to learn how to correctly use the tool—e.g. how to even correctly express a query; there are usually arbitrary limitations to the questions that can be asked; once a correctly formulated question has been asked, the tool can take 5 minutes to answer even a simple question.
The arbitrary limitations stem from the existing tools not making questions which are simple to ask, simple to compute an answer to. This makes sense in the context of the historical data analytics workflow, which has involved manipulating the data (pre-aggregating and feature engineering for example) to make this the case. This works well on traditional use cases, like a single, well-structured company database. However it does not work when the data is being bought in either from sources which have multiple uses (and so where pre-aggregation is unsuitable) or on the fly, where a data preparation workflow would be inappropriate.
Both the behaviour of notebooks in general (notebooks are already a known and well used concept in data analytics, for example Jupyter, Observable) and of semi-structured natural language or SQL to a database query in general (for an example open source implementation: ln2sql) have been implemented and described previously in the commercial, academic, open source and patenting literature; however both represent huge areas with large ranges of structural and implementation choices which can be made, particularly the latter case which is a problem that is known for being unsolved in the general case and where the limitations of popular implementations in e.g. home assistants are well documented.
There is a need for a database query and visualisation system that:
The invention is a database query and visualisation system that is configured (i) to provide an interface with which a user can specify one or more queries; (ii) to generate a response to each query which is displayed as a table, graph or other visual representation; and in which the system is configured to process each query as a layer, where a layer is a node in a directed graph, and a set of nodes forms a notebook.
In one implementation, the notebook is therefore made up of one or more layers; each layer generally corresponds to a step in the query. The notebook can be presented vertically down a scrollable page or as a dashboard. A query in a layer is represented as a combination of some or all of:
A query in a layer is then transformed by the system into an exact, unambiguous, query in a database query language, such as SQL.
This section contains a description of the workings of the core data exploration system used in one implementation of the invention, called the Count® system.
The aims of the Count data exploration system
The aims of the Count data exploration system experience are:
All of these are currently problems in people's data workflow with conventional data exploration systems. A typical data exploration tool can take 5 minutes to answer a simple question and 20 hours to learn from scratch—we are aiming to decrease both of these numbers by at least a factor of 10. As noted earlier, the arbitrary limitations stem from the existing tools not making things which are simple to ask, simple to do; this makes sense in the context of the historical data analytics workflow which has involved manipulating the data (pre-aggregating and feature engineering for example) to make this the case. This works well on traditional use cases like a single, well-structured company database, however it does not work when the data is being bought in either from sources which have multiple uses (and so where pre-aggregation is unsuitable) or on the fly, where a data preparation workflow would be inappropriate.
The user's interaction with the Count data exploration system centres around a reactive notebook (see Appendix D—discussed in more detail below)—this is a series of layers, each one of which generally corresponds to a step in their query and which can reference each other to answer increasingly complex questions. This provides a differentiation in behaviour from traditional products which do not allow this iterative approach.
Both in the interaction and in the implementation, there is a natural division between the behaviour of an individual layer and the behaviour of the notebook as a whole system.
The behaviour of an individual layer currently centres around a semi-free text input bar. Here the user can use either exactly written SQL, a mixture of SQL expressions and free text input, or free text input alone to specify their query, and they can reference things from the data they have bought in or in layers they have created further up the notebook. The response to the query, or the reason for an error, is then displayed to the user as a table, and possibly a graph, the behaviour of the layer is played back to them, and further queries are suggested. The way the layers, if they use or affect the result of another layer, behave with respect to each other is handled at a notebook level.
The development of this experience is driven by both the changing demographic of people analysing data and by the changing use cases due to the development of different data sources and use cases.
Both the behaviour of notebooks in general (notebooks are already a known and well used concept in data analytics, for example Jupyter, Observable) and of semi-structured natural language or SQL to a database query in general (for an example open source implementation: ln2sql) have been implemented and described previously in the commercial, academic, open source and patenting literature. However, both represent huge areas with large ranges of structural and implementation choices that can be made, particularly the latter which is a problem known for being unsolved in the general case and where the limitations of popular implementations in e.g. home assistants are well documented.
It appears that in other prior art realisations in this area, the work on notebook technology and on the technology used in a layer are disjoint. The Count data exploration tool looks, inter alia, at the interface between the notebook structure and the layer structure, by:
We begin with an introductory example to give some context to what is subsequently discussed in more detail.
Following that, we discuss the system from high to low level, at each stage describing the general working or flow of the system and giving details of the specific implementation we are using in the current experience.
The example is a demonstration of the workflow of the experience to provide some context for the description below.
We are looking at some data from the record of orders in a home and office supplies company (they have recorded the individual sales—the date of the order, the profit etc.); we are aiming to find out the spread in profits by the mode of shipping. The table contains columns including ‘Profit’ (a number) and ‘Ship Mode’ (a string).
Two trees are created—one has layers a and e in it; a being before e; and the other has b, c and d where b and c are used together to make d, the answer we are looking for, a and e are other bits of analysis we are doing around our main piece of work.
See Appendix D for the screen-grab of this example from a prototypal implementation.
Some key features are immediately visible:
Key Themes
Here, we highlight some important themes to help provide context and direction for the coming sections.
The Context of the Experience
The experience sits between the user and the source of data they are querying. This section describes what we are assuming about the system the experience is connected to. The assumptions are relatively minimal and therefore lead to the experience being applicable to a wide range of systems, including all standard database systems. We assume that the experience can be connected to a database. The database is assumed to have a deterministic interface, which will take an instruction and perform that manipulation on the underlying data returning the result if requested. The core experience is agnostic to the execution flow used within the database or nature of the language or format of the interface.
Specifically, we assume that we know the form of the output of the database for a given input, though we may not know the exact value of the data returned, the values within the tables. In relational databases this is commonly referred to as the database or table schema—the number of columns in the tables, sometimes the number of rows, and the columns' names and types along with the connections between the tables, and columns in different tables which are equivalent, these relationships are defined by keys, and the use of those keys to bring tables together by selecting rows with the same values of the key in the two tables are called joins.
If we refer to everything in the database—tables, columns and the entries in the database—as entities, a key point is that in the vast majority of practical cases the number of entries and the size of the information needed to be either processed, communicated or otherwise manipulated to find out about them vastly dwarfs the similar things for the schema.
We assume that the initial data has been loaded into the database, that we know the schema and that this is valid and ready for querying. This can either be provided or can be inferred—column types can be determined; joins can be found using record matching techniques and missing entities such as column names can be added.
The High Level System
We describe a notebook structure which consists of a series of layers which correspond to the user's queries. We describe how our assumptions about the database's behaviour and our control of the state allow us to use the database statelessly and with minimal or optional data retrieval.
The system is based around a notebook structure.
A notebook is a data structure which consists of a set of nodes connected in a directed graph. This graph becoming cyclic is a circular dependency and raises an error, and therefore we have a directed tree or trees formed from the nodes. By giving each node a unique, orderable, identifier we allow a strict ordering of nodes relative to a chosen node based on their distance along the tree from that node, with any ties being broken using the identifier. For any pair of nodes we can say that one comes before the other and can define a nearest neighbour to a node or group of nodes.
For any node we can provide a tree of all dependent nodes, or all nodes it is dependent on. If the notebook consists of more than one tree, we can split it into multiple notebooks, each based on one tree. The ordering of the nodes displayed to the user, or whether all or some are shown, is purely presentational.
Each node in the notebook is experienced as a layer and represents one query or one interaction by the user.
This ordering allows us to make the notebook reactive. This means that an update to one node precipitates the update of any other nodes which are affected by its value or behaviour. This can be achieved recursively on the set of updated nodes, which begins as the single user-updated node and spreads to its nearest neighbour at each step. This can be limited to nodes which come ‘after’ the updated node or can propagate back up the tree to nodes which come ‘before’. An update is this process by which the experience moves from one state to another.
There is a local state of the experience which consists of the current and past state of the notebook structure along with representational information for showing it to the user. The database also has a state, which (since it is typically a fully functional piece of software) can be very complex but is at least a list of the initial datasets and their contents.
Due to the size of the underlying data we need to have a many-to-one correspondence between different users' notebooks and the database, and therefore the experience is designed to hold all of the state, meaning that the database can be used as if it is entirely stateless (i.e., it does not hold any notebook state)—between each update the database could be entirely deleted and reloaded with an unchanged schema and contents without affecting the behaviour of the system.
The assumption about our knowledge of the database schema and the state allows us to simulate the vast majority of the behaviour of the database within the experience. For each query that is run we always know the number, types, names and key type of the columns returned and know if there will definitely be a single row (for example when non-grouped aggregation or filtering on a column with unique entries occurs). The only time we do not know whether the length of the returned table will be zero, 1 or greater than 1 is when a general filter or certain special functions have occurred. This allows many conclusions about the user's intent and the behaviour of the notebook to be determined without any interaction with the database, which is important for scalability.
Databases can have a variety of behaviour while fulfilling the conditions we need, and the experience acts to abstract those different database behaviours away from the user. These details are often problematic even to an expert user, particularly we abstract the strong typing of columns used in the majority of databases, the variation of function names, sampling, deferred scheduling, query parallelism, connecting to multiple databases or database copies and phrasing of queries to optimise caching. An example of the typing/function behaviour is that in SQLite strtimer(‘% Y’, column) returns a column containing the year of a series of dates in column (which is untyped and is cast on-the-fly by the database), whereas in PostgreSQL date_partCyear; column) is used and column must have a date or time type. This abstraction means that beyond unalterable performance differences, the user should not have to alter their behaviour and should not see different behaviour if different databases are used beneath the experience.
The ability and necessity of using a local state of the notebook is dictated by its use in a context where its underlying system is a database—this context is one where there are enough assumptions and restrictions on the behaviour of underlying system such that simulation of a significant fraction of its behaviour can be performed locally without having to replicate a significant fraction of that underlying system, but also where there is value in performing the local simulation that since it provides significant advantages in terms of speed and scalability relative to repeated interaction with the underlying system.
We note that these advantages would hold for other underlying systems where substantial parts of the behaviour are deterministically based on a small amount of information, but where the exact behaviour can only be derived by a costly direct testing of the system, or where there is benefit in having a distinct local state of the system when the global state can then be updated to replicate at a future time (see ‘Other Uses’).
The key interest in the system as a whole is how the state of the notebook informs the behaviour of the layers and how the state of the layers informs the behaviour of the notebook; we will discuss the behaviour from each direction.
Operation of a Layer
A layer is the basic atomic unit of the notebook, holding a local state between update operations precipitated by the user or the notebook.
A layer holds a state which includes, but is not limited to,
As alluded to in (2) and (3), the query the layer represents to the user and the query run by the database may be different; this is because the layer has to hold additional information regarding the key which sometimes has to be derived from knowledge of the entries in the database. The key is needed for the notebook to know how to bring the result of two layers together but need not be seen by the user. A simple example would be if we selected columns of a table which did not include the key, but then wanted to join another table to that new table—this can only be performed because the key column is present in the query run by the database and in the state of the layer, but not in the table the user sees or interacts with.
When two tables are brought together in a layer, a set of rules are used to determine which columns can be joined on and then perform the join, see Appendix A.
A layer is stateful and holds that state until some update event occurs. This can be caused by:
The following sections now describe those two interactions.
We also note the extension that an interaction with the layer or with the notebook by the user or by a notebook event can produce a query or other behaviour which can be better represented as more than one layer. In this case the layers may be created along with the necessary notebook machinery around them. For example
The User Interaction with the Layer
Interacting with a layer forms the bulk of the user's interaction with the system. The layer provides the user with an interface to query the database using entities from the data which have been imported from the results of previous layers.
This interface can take a huge number of forms, from free text input to click-based to VR/AR etc. —in each case we are assuming that the user has the ability to use that interface to give an intent to use given functions provided by the database on the concepts which they think are in it.
Any functionality provided by the database can be obtained unambiguously and with absolute certainty by a user who has knowledge of the database's operation. In particular:
The usual interfaces which connect directly to databases allow such functionality through a high level of structure and formalism, often to a level which is inappropriate for the query which is being asked.
In the example above, the value ‘Second Class’ only occurs in ‘Ship Mode’ in the table. Using SQL, we have to type (the quotation mark types are important):
However
Produces an error; no query is run
Such behaviour is surprising to a non-expert and frustrating to an expert, in particular it runs counter to the expectation from using web-based shallow search e.g. Google, where “secnod class” [sic] would be acceptable.
However, more ambiguous input can be entered through the same interface, allowing ambiguous input, yet having the ability to enter text which is interpreted exactly.
We describe implementations of this input system in Appendix B.
Suggestions as to further queries the user may wish to perform are produced based on the content of the layer and of previous layers, we describe implementations of these in Appendix F.
Graphical output of the query is automatically generated. We describe the rules for generating these charts in Appendix C.
It is known that the interpretation of the query is simple to produce since the query is known exactly, it is also simple to name newly produced columns since it is known exactly what operation produced them. For example the query average sales by week will produce a query of SELECT AVG (“Sales”), strfime(‘% W’, “Date”) FROM “Orders” GROUP BY strftime(‘% W’, “Date”) which can be simply played back as average sales and the week of date from the Orders table (for example) and the returned columns can be named average_sales and week_of date.
At any time, multiple results (tables, charts, interpretations, suggestions etc.) due to multiple different interpretations of the user's input or other intent may be displayed. The user may then select from these. The input system is designed in part to limit the number which need to be displayed (preferably to one) by helping the user to refine their intent during input.
Operation of a Notebook
When a layer updates, it has to ensure that any information it requires from other layers has updated as well. When a layer is updated we firstly must determine if the change is local, if it propagates and if so, if it causes a local or propagating change to other layers.
Local changes include a change to the presentation of the node, for example some changes to how the visual is displayed or how a table is seen to be sorted or how many rows are visible, or the inclusion of another column in the query which is unused elsewhere—they will cause no change to surrounding layers.
Propagating changes fire an event to surrounding layers—when a layer is updated in this way then every layer which depends on that layer must then update. For each layer asked to update:
Once this is complete, the notebook can be used to produce the query at each layer
The queries which are run could be run at any time i.e. after each instance of (5) or can be run at the end after all changes have been made.
The queries may be run using intermediate tables corresponding to the layers or can be written by nesting the queries so that no intermediate tables are formed—this leads to different performance in databases where table creation involves distinct behaviour, for example disk writes. Explicitly, in SQL:
may be compared to
or
Since the state at any time is known, and is very small, it is simple to have undo functionality.
The structure and operation of a notebook also lends itself to the inclusion of other types of layer, not linked to the notebook-layer interaction we are describing here. Layers could provide space for
These other layers can also provide entities which could be interacted with in other layers, for example the outputs of the special functions (3), the imported data (2) or some information or metadata regarding the media or text (1).
Other Uses
Other uses of such a system—one where a local stateful notebook structure
We will use the naming conventions used in SQL relational databases to allow connection with the literature and to provide examples, but this does not exclude implementations not using that technology.
Keys on tables relate entries between tables, describing which columns contain entries which are unique or which columns contain entries which are unique in another table.
A primary key is a column or group of columns which uniquely identifies any row in a table for example an employee id column.
A foreign key is a column or group of columns in a table which are a primary key in another table. The column of towns in a database of addresses can be linked to the column of towns which is the primary key of a table which is a list of towns and their crime rate.
A candidate key is the same as a primary key, but has not been chosen as the primary key (many databases enforce the requirement of one primary key, though this is not obvious to a lay user)
Tracking Keys
A query on a table can have three types—non-aggregating filtering, non-aggregating constructing and aggregating
There may be other candidate keys or foreign keys created, however the behaviour of these keys requires knowledge of the contents of the tables.
Any other queries can be seen as chains of these query types (for example binning is (2) followed by (3) and a SQL HAVING statement is (3) followed by (1), although these are often mixed together in database calls)
Despite the requirements on knowing the table contents to solve the general case, much of the behaviour in general use can be determined solely using the table schemas.
Performing JOINs
From the knowledge of the series of primary and foreign keys in the database at any time, JOINs can be performed automatically.
First take the tables in the database and construct a directed graph which has an edge pointing from one table to another if the primary key of the first links to a foreign key of the second.
Now, take two tables to be joined, and follow all possible paths from the tables through the directed graph. These paths either do not intersect, intersect in one place, or intersect in multiple places.
Note that multiple paths can intersect in the same place and provide different results since the behaviour of the eventual JOIN is dependent on restriction of the values passed along the paths.
We provide simple examples of where the paths cross using two base tables, Table A with two columns ‘price’ and ‘colour’ and Table B with ‘profit’, ‘town’ and ‘product’.
If they do not intersect, there are two possible behaviours:
For example: we try to join Table B with Table A, we return either an error, or a table which contains five columns ‘price’, ‘colour’, ‘profit’, ‘town’, and ‘product’.
If they intersect once, then,
If they intersect more than once, then perform the same operations as on a single intersection and return multiple options to the user.
Lack of Implementation Specificity
The above description has used the notation and database structure associated with a table-based relational database; however, the concepts are not confined to that implementation. In the other commonly used representation, an entity-relationship form which underlies graph databases and much of the research on human interfaces to databases, visualisation theory etc., the rules applied are very similar to those above, however when aggregating queries are performed, the results are classified by the primary key which is formed, the primary key defining an entity in the ER representation. The operation (*) is therefore just an operation on a single table. The other operations are also significantly simplified if the columns contain hierarchies, which is often the case (for example if product were replaced with county, with town to county being a many to one mapping), however if no hierarchies are present, there is little to recommend either implementation over the other. The use of an ER representation in the presence of hierarchies (or equivalently recording those hierarchies in the table representation) also reduces the load on the database since it allows the determination of candidate keys and of empty filters without using the database in many common cases.
The way in which the information is displayed to the user is independent of whichever underlying implementation is used.
Joining on Things which are not Usually Seen as Keys
A key concept in data is time, which is not usually seen as a key, however we track and allow interaction between time columns in the same way as with keys. Any time column can be joined to any other since we can say whether any two times or time ranges are equal, or contained within one another or disjoint. Joining may involve aggregation or projection and may therefore involve a choice of how to operate on other columns or on the time columns themselves. Time columns can also be seen as primary or candidate keys if the time intervals within them are unique and disjoint, and in a similar way to any other column can be used as one of the columns in a multiple column key.
Similar observations apply to any non-categorical column where the equality, containment or disjointness of any two entries can be specified. For example, numerical ranges, numbers, geographical points, lines, areas or volumes.
We describe manifestations and features of a system which can simultaneously handle ambiguous and entirely unambiguous input.
Input from the user is mapped via tokens, which include the entities from the database, entities from previous layers, special commands, functions, literals, words in the base language, for example English, and other input from the user e.g. annotations and comments
When the query is returned from the database, a series of chart types are selected for display to the user. The quality of the user experience is tied to the selection of charts and the order in which they are displayed.
The process:
See
See
The Database, Local State, Notebook and Observer are all on the server. The User is remote.
Note that:
Builders
Builders are deterministic suggestions based on the contents of a layer, and possibly one or more layers coming before it; examples include:
| Number | Date | Country | Kind |
|---|---|---|---|
| 2004231.3 | Mar 2020 | GB | national |