Data sets that do not have navigable relationships to each other can be joined by associating objects (entities) in one data set with objects that share a common attribute in the other data set.
Examples will now be described, by way of non-limiting example, with reference to the accompanying drawings, in which:
Various techniques exist for joining datasets and for enabling querying across joined datasets, including record linkage, relational databases, probabilistic databases, deductive databases, and multiplex graphs. Each of these techniques involves creating a model of each of the data sets to be joined. The term “model” is intended to refer to a simplified representation of the underlying entities in a system, their evolution over time, and their mutual interactions.
Record linkage techniques detect duplicated records in the same table or across different tables of a database. Many of these techniques permit a user to specify similarity functions according to which two items will be flagged as being the same. The rules that govern these similarity functions are usually hardcoded and it is therefore difficult for a non-expert user to adjust the similarity functions.
A probabilistic database consists of: (1) a collection of incomplete relations R, which have missing or uncertain data, and (2) a probability distribution F across all possible complete versions of those relations, also called possible worlds. An incomplete relation is defined over a schema comprising a (non-empty) subset of deterministic attributes that includes all candidate and foreign key attributes in R, and a subset of probabilistic attributes. Deterministic attributes have no uncertainty associated with any of their values, whilst probabilistic attributes may contain missing or uncertain values. The probability distribution F of these missing or uncertain values is represented by a probabilistic graphical model, such as Bayesian Network or Markov Random Field. Each possible database instance is a possible completion of the missing and uncertain data in R. A set of SQL expansions has been proposed to enable a probabilistic database to select the best process to use for creating a join between data sets within a single database management system. These expansions are, however, expressed in a highly imperative manner which makes them difficult for a non-expert user to understand and employ.
A deductive database is a database system that can make deductions (i.e., conclude additional facts) based on rules and facts stored in the deductive database. Deductive databases represent a mix between logic programming languages, such as Prolog, and relational databases. As a result, deductive databases can be queried using declarative language. Joins in a deductive database can be seen as templates that the logic inference process “takes down to earth” and maps to specific actions on the database. As with all database systems, joins in deductive databases comprise merely a result set, and are not part of the data model itself. Consequently, joins are recomputed for every query.
Multiplex graphs are data models which enable joins across graphs to be maintained, because the result of a join becomes part of the data model itself. This facilitates the building of queries that span a multiplex graph (or multiple multiplex graphs). However; the creation of multiplex graphs is a manual process that involves creating multiplex links in an ad hoc manner. A user explicitly models how the links spanning graphs are created and, responsive to changes to the underlying graphs, manually updates these links.
In the following description the term “equivalence” is used to refer to an entity or attribute of an entity in a first data set which is deemed to be the same as an entity or attribute of an entity in a second data set. The criteria used to determine whether entities or attributes are the same can vary, e.g. in dependence on the particular application, user preferences, etc., and thus a given pair of entities/attributes may comprise an equivalence in one example but not in another example.
In the following description the term “high-level” is used to refer to language which is strongly abstracted from the details of the computer or process which the language is being used to describe. A high-level language for the purposes of the specification is therefore to be understood as a query language which does not prescribe a sequence of commands to be followed to create a join, but instead is closer to the way a non-technical user would specify such an action. One example of this may use natural language elements. A high-level language can therefore easily be used without any detailed knowledge of the underlying computer system or process which will run the query.
Then, in block 102, information relating to a link to be created between the first data set and the second data set is received, e.g. by the processor. In some examples the information comprises a declarative query which provides a high-level description of the link to be created. The information may, for example, be in the form of a specification submitted by a user of the computer system. In some examples the information comprises a query written in a high-level, declarative query language. Since the language is declarative, rather than imperative, the information does not need to specify how the link is to be created (e.g. the exact manner in which equivalences between the first and second data sets are to be found).
For example, a declarative query used to specify a particular join could have the form:
The declarative language used by examples can provide flow processing abstractions for querying across linked datasets graphs, composable query fragments, and a macro inclusion system. In particular, the examples which use declarative language make nested aggregations and projections of database tables easy to understand and use.
In some examples the received information comprises information identifying the first data set and the second data set. In other words, the information specifies the data sources of the data sets which the user wants to link. These sources can be, for example, graphs, database tables, file repositories, etc. In some such examples the information specifies a hardware provision and a service provision for each data set.
The user can also indicate in the specification information relating to equivalences the user wishes the created link to be based on. Such information can comprise, for example, a type or set of types of entity that the user wishes an equivalence search to be restricted to; a type or set of types of entity that the user wishes to be considered by an equivalence search, an attribute or set of attributes that the user wishes an equivalence search to be restricted to, an attribute or set of attributes that the user wishes to be considered by an equivalence search, and/or a process to be used in an equivalence search (e.g. entropy-based determination of text similarity). Thus, in some examples the received information additionally comprises any or all of: information identifying a type of entity for which equivalences between the data sets to be linked are to be found; information identifying an attribute or a set of attributes for which equivalences between the data sets to be linked are to be found; information identifying transformations on such an attribute or a set of such attributes (e.g. a fast Fourier transform on an attribute carrying signal information); information identifying a process to be used for finding equivalences.
In some examples the user can create a specification by completing a template, where a template is form comprising fields that can be filled in with high-level information (as opposed to programming code or an imperative query, both of which are considered to comprise low-level information for the purposes of this specification). The completion of some of the fields in the template may be optional, such that a user can provide certain kinds of information if the user wishes to specify in more detail how a requested link is to be created, but the link creation process can still proceed without receiving these kinds of information. In some examples, if a field of the template is left blank by the user (i.e. the received information does not contain a certain type of information relating to the link to be created), the processor will consider all possible options relating to that type of information. For example, if a field “entity type” (in which a user can indicate, for example, whether equivalences between text, numbers, or both are to be considered) is left blank, the processor may by default consider both text and numbers when searching for equivalences.
A template can be seen as a static (and often partial) version of the model representing the first and second data sets. A completed template represents the requested status of some of the possible equivalences between the first and second data sets, and the template does not take into account the existence of other possible equivalences. Consider, for example, the declarative query listed in paragraph 19 above:
In block 103, a link creation mechanism is selected (e.g. by the processor) based on the received information. In some examples the processor has access to a store of various link creation mechanisms from which the processor may select the most appropriate link creation mechanism for a given received specification. A link creation mechanism can be, for example, a process for finding equivalences between two data sets.
In some examples, the selection of a link creation mechanism is based on a description of that link creation mechanism.
In block 203 a link creation mechanism is selected based on its description as well as on the information received in block 202. In some examples selecting a link creation mechanism comprises, for each description, matching terms in the description with terms in the received information and selecting a link creation mechanism associated with a description having the highest number of matching terms. In some examples in which the provided descriptions comprise information about a complexity and/or a threshold of the described link creation mechanism, selecting a link creation mechanism comprises selecting a link creation mechanism having a relatively lower complexity, and/or a relatively higher threshold, than another link creation mechanism in the set. For example, if several descriptions contain the same number of matching terms, the link creation mechanism having the lowest complexity and/or the highest threshold will be selected from among the link creation mechanisms associated with the descriptions having equal highest numbers of matching terms. If it is not possible to identify a single link creation mechanism meeting predefined selection criteria, in some examples the assistance of a human operator will be sought (e.g. by generating an error message on a display of the computer system).
Thus, the performance of block 203 can be seen as the processor interpreting the descriptions and mapping them to the user provided specification, so as to find the available link creation mechanism that “best” matches what the user indicated in the specification.
Example link creation mechanisms will now be described. In some examples, e.g. examples in which the received information does not comprise any indications as to how the user wishes equivalence relations to be found, or any indications of particular attributes or entities the user wishes to be considered (e.g. the received information is information identifying the first data set and the second data set), a link creation mechanism operates by converting all of the entity attributes in the first data set and all of the entity attributes in the second data set to text. A clustering process based on text similarity is then performed, e.g. by the processor, which generates pairs of attributes (i.e. comprising one attribute from each data set) having a level of text similarity which is greater than a predefined threshold. In some examples this threshold is configurable, e.g. by the user. In some examples the processor presents the generated pairs to the user and requests the user to confirm whether each pair is an equivalence.
In a second block 402 the process determines the attribute identified by the attribute identifier for the first entity, and in a third block 403 the process determines the attribute identified by the attribute identifier for the second entity. Blocks 402 and 403 can be performed in any order, or simultaneously. In examples in which multiple attribute identifiers are input to the process, blocks 402 and 403 are performed in respect of each attribute identified by the input attribute identifiers.
Then, in block 404, the process determines a similarity of the first entity and the second entity by comparing the determined attribute of the first entity with the determined attribute of the second entity. In some examples performing block 404 comprises converting determined attributes to text elements, and comparing the determined attributes comprises determining the similarity of the text elements, e.g. using a clustering process based on text similarity. In some such examples, associations between the attribute and its text elements is stored for a configurable predetermined time period, which can reduce the computational overhead if a further equivalence finding process is performed during the predetermined time period.
In block 405 the process calculates a probability that the first entity and the second entity are related in a manner specified by the input relationship identifier, based on the determined similarity. In examples in which multiple attribute identifiers are input to the process, the similarity determination comprises comparing a pair of determined attributes corresponding to each input attribute identifier, and combining the results of these comparisons. In some examples block 405 comprises comparing a calculated probability to a predefined threshold, wherein a probability less than the threshold will result in the process determining that the first and second entities are not related in the manner specified by the input relationship identifier, and a probability greater than the threshold will result in the process determining that the first and second entities are related in the manner specified by the input relationship identifier.
Returning to
The examples therefore provide a simple way for a user to find equivalent entities across multiple data sets. The examples permit the use of a high-level specification language which is accessible to non-experts. Furthermore, since the task of determining how equivalences are to be found can be performed automatically on the basis of a provided high-level specification, equivalences can be found quickly, accurately, and with a little effort on the part of the user.
In block 503, a link creation mechanism is selected based on the received information and/or on the received second information. In some examples a single link creation mechanism is selected based on the received information and on the received second information. In some examples selecting a link creation mechanism comprises selecting a first link creation mechanism based on the received information and selecting a second link creation mechanism based on the received second information. In some examples performing block 503 comprises comparing terms in a description of an available link creation mechanism to terms in the received information and terms in the received second information, e.g. in the manner described above in relation to block 103 of
In block 504, each selected link creation mechanism is used to determine an equivalence between the first data set and the second data set, in the manner described above in relation to block 104 of
In some examples a processor performing the example method is to run received specifications in parallel whenever possible. When, in block 505, equivalence relations based on the determined equivalences are added to the first and second models, this can trigger the creation and/or removal of other equivalence relations in the models. In such cases the processor performs blocks 504 and 505 several times. The first pass comprises a parallel processing of all the received information, and subsequent passes comprise an analysis of the entities for which new equivalences were determined in previous passes. In some examples the number of passes after the initial pass is the same as the number of different informations received and processed in parallel (i.e. for the example in
In some examples detecting a change comprises creating a watch process, e.g. by the processor of a computer system. In some examples in which the processor comprises a receiving process, the watch process and the receiving process comprise independent execution threads. The watch process may run continuously. In some examples a single watch process is to watch multiple entities, which may be involved in multiple equivalence relations. In some examples the creation of the watch process is based on watch information provided by a user. For example, a user can provide an input indicating an entity or multiple entities, and/or an entity attribute or set of entity attributes, that the user wishes to be observed by a watch process. In some examples the watch information is provided together with information relating to a link to be created between two data sets. In some examples the watch information is provided separately from information relating to a link to be created. In some examples the watch process is to watch all entities which are involved in equivalence relations.
In some examples the watch process is to observe attributes of an entity and to detect when any of these attributes change. A change can comprise, for example, the addition of an entity, the deletion of an entity, or a change in the value of an attribute of an entity (i.e. an update to the entity). In some examples new, deleted and updated entities are handled separately, which simplifies the change detection process and reduces the computational overhead. In some examples the output of the watch process is a list of entities whose “to-be-watched” attributes have changed.
In some examples in which a watch process is provided, the receiving process does not trigger the running of a link creation mechanism to find equivalences involving the changed entities. Such examples reduce the computational burden on the receiving process, enabling updates to the data sets to be processed quickly.
In response to a detection of a change (or multiple changes) relating to an entity involved in an equivalence relation, in block 603 the equivalence relation in which the watched entity is involved is updated in the first model and the second model. In some examples the watched entity may be involved in more than one equivalence relation, in which case block 603 comprises updating each equivalence relation in which the watched entity is involved. In some examples the updating comprises running a link creation mechanism to find new equivalences. Several passes may be necessary, as described above in relation to blocks 504 and 505 of
The examples therefore provide systems which enable a user to link two data sets merely by specifying some high-level preferences. The system automatically infers what an equivalence could mean for those data sets, in light of the high-level information provided by the user. Such examples are particularly suitable for non-technical users. Furthermore, in some of the examples equivalence relations created during the link creation process are maintained, enabling them to be used to enrich a result set generated when a user later queries one of the linked data sets. In some examples the equivalence relations are maintained and updated even in the face of changes to the underlying data contained in the linked data sets.
Examples in the present disclosure can be provided as methods, systems or machine readable instructions, such as any combination of software, hardware, firmware or the like. Such machine readable instructions may be included on a computer readable storage medium (including but is not limited to disc storage, CD-ROM, optical storage, etc.) having computer readable program codes therein or thereon.
The present disclosure is described with reference to flow charts and/or block diagrams of the method, devices and systems according to examples of the present disclosure. Although the flow diagrams described above show a specific order of execution, the order of execution may differ from that which is depicted. Blocks described in relation to one flow chart may be combined with those of another flow chart.
It shall be understood that each flow and/or block in the flow charts and/or block diagrams, as well as combinations of the flows and/or diagrams in the flow charts and/or block diagrams can be realized by machine readable instructions.
The machine readable instructions may, for example, be executed by a general purpose computer, a special purpose computer, an embedded processor or processors of other programmable data processing devices to realize the functions described in the description and diagrams. In particular, a processor or processing apparatus may execute the machine readable instructions. Thus functional modules of the apparatus and devices may be implemented by a processor executing machine readable instructions stored in a memory, or a processor operating in accordance with instructions embedded in logic circuitry. The term ‘processor’ is to be interpreted broadly to include a CPU, processing unit, ASIC, logic unit, or programmable gate array etc. The methods and functional modules may all be performed by a single processor or divided amongst several processors.
Such machine readable instructions may also be stored in a computer readable storage that can guide the computer or other programmable data processing devices to operate in a specific mode.
Such machine readable instructions may also be loaded onto a computer or other programmable data processing devices, so that the computer or other programmable data processing devices perform a series of operation steps to produce computer-implemented processing, thus the instructions executed on the computer or other programmable devices provide a step for realizing functions specified by flow(s) in the flow charts and/or block(s) in the block diagrams.
Further, the teachings herein may be implemented in the form of a computer software product, the computer software product being stored in a storage medium and comprising a plurality of instructions for making a computer device implement the methods recited in the examples of the present disclosure.
While the method, apparatus and related aspects have been described with reference to certain examples, various modifications, changes, omissions, and substitutions can be made without departing from the spirit of the present disclosure. It is intended, therefore, that the method, apparatus and related aspects be limited only by the scope of the following claims and their equivalents. It should be noted that the above-mentioned examples illustrate rather than limit what is described herein, and that those skilled in the art will be able to design many alternative implementations without departing from the scope of the appended claims.
The word “comprising” does not exclude the presence of elements other than those listed in a claim, “a” or “an” does not exclude a plurality, and a single processor or other unit may fulfil the functions of several units recited in the claims.
The features of any dependent claim may be combined with the features of any of the independent claims or other dependent claims.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2015/061892 | 5/28/2015 | WO | 00 |