The present application claims the benefit under 35 U.S.C. § 119 of European Patent Application No. EP 19206246.1 filed on Oct. 30, 2019, which is expressly incorporated herein by reference in its entirety.
The present invention relates to a computer implemented method and a system for processing data for generating data subsets from data sets wherein a data set comprises a plurality of data elements.
To find a particular data set out of a plurality of data sets, recent efforts yielded data set search engines. Such search engines are for example configured to retrieve data sets that are relevant to a keyword query by matching the query with the description in the metadata of each data set.
Other data set search engines may present data set summaries, which are mainly composed of some metadata about the data set, such as provenance and license. Their utility in relevance judgment is limited, with users having to analyze each data set in the search results to assess its relevance, which would be a time-consuming process.
An object of the present invention is to provide a method and system for computing optimal data subsets, wherein a data subset is a representative subset of a data set, also known as data snippet.
A data subset aims at concisely explaining the user why the represented data set fulfils their demand and in particular can illustrate the main content of the data set and explains its relevance to user's query.
The object may be achieved by the device and methods according to the example embodiments of the present invention.
In accordance with an example embodiment of the present invention, a computer-implemented method is provided for processing data for generating data subsets. The method includes the following steps:
receiving at least one data set that specifies a search result responsive to a search query, the data set including a plurality of data elements and the search query including at least one query term;
identifying a number of data elements in said data set, each data element characterized by a weight with regard to coverage of said query terms and/or coverage of a data schema of said data set and/or coverage of key data of said data set, wherein the data elements are identified such that the total weight of the identified data elements is maximized.
According to an example embodiment, the method further comprises the step of generating a data subset comprising the identified data elements.
According to an embodiment, a data element of said at least one data set is a RDF, Resource Description Framework, triple comprising subjects, predicates, and objects. More particular, the data set is a set of RDF triples denoted by T={t1,t2, . . . ,tn}, where each ti=tis,tip,tio
is a subject-predicate-object triple of RDF resources. The subject tis of a triple ti is an entity (i.e., a non-literal resource at the instance level) that appears in the data set. The predicate tip represents a property. The object tio is a value of tip, which can be a class, a literal, or another entity in the data set.
According to an embodiment of the present invention, the weight of a data element comprises a value for the coverage of the query terms and/or a value for the coverage of the data schema of the data set and/or a value for the coverage of key data of the data set.
According to an embodiment of the present invention, the value for the coverage of a query term is evaluated by
if said query term is instantiated in said data element, wherein Q represents the search query including the query term.
According to an embodiment of the present invention, the value for the coverage of the data schema for a data element is evaluated either by a relative frequency of a class observed in the data set if said class is instantiated in said data element or by a relative frequency of a property observed in the data set if said property is instantiated in said data element. More particular, the relative frequency of a class c observed in the data set is given by
where T represents the set of triples in the data set. Analogously, the relative frequency of a property p observed in the data set is given by
According to an embodiment of the present invention, value for the coverage of key data is evaluated by a mean normalized out-degree and in-degree of an entity of said data set if said entity is instantiated in said data element. Central entities represent the key content of the data set. If the data set is a directed graph comprising nodes, called entities, and lines connecting the nodes are called edges. An entity e comprises an out-degree, given by d+ (e), and an in-degree, given by d− (e), in the RDF graph representation of the data set. The in-degree represents the number of edges incoming to an entity and the out degree represents the number of edges outgoing from an entity, respectively.
The mean normalized out-degree and in-degree is given by
where Ent(T) is the set of entities that appear in T.
According to an example embodiment of the present invention, in said weight of a data element the value for the coverage of the query terms and/or the value for the coverage of the data schema of the data set and/or the value for the coverage of key data of the data set are weighted by multiplication with a weighting factor.
According to an example embodiment of the present invention, the data subset comprising the identified data elements maximizes an objective function
q(SD1)=Σw(x),
wherein SD1 represents said data subset and w represents the weight for x being a query term or a class in the data schema of the data set or a property in the data schema of the data set or an entity in the data set.
More particular, the data subset maximizes the objective function
with x being an element of cov(ti), wherein cov(ti) represents a set consisting of the query terms covered by ti, the class instantiated in ti, the property instantiated in ti and the entities that appear in ti
Preferably, the generation of data subsets can be formulated as a combinatorial optimization problem, aiming to find a data subset such that it contains the query terms and an instantiation of the most frequently used classes and properties in the data set and contains entities having the highest scores in the data set, wherein the optimization problem is solved by a data subset, which maximizes the objective function.
According to an example embodiment of the present invention, the step of identifying data elements comprises identifying a limited number of data elements.
According to an example embodiment of the present invention, said method further comprises prior to receiving the data set that specifies the search result responsive to the search query the step of receiving the search query and the step of conducting a search.
The present invention also concerns a data subset comprising the identified data elements, which maximizes the objective function q(SD1)=Σw(x).
The present invention also concerns a system for processing data for generating data subsets, wherein the system is configured to carry out the method according to any of the embodiments.
The present invention also concerns a computer program, wherein the computer program comprises computer readable instructions that when executed by a computer cause the computer to execute a method according to the embodiments.
The present invention also concerns the use of a method according the embodiments and/or a system according to the embodiments and/or a computer program according to the embodiments for generating data subsets in a data set search engine.
Further advantageous embodiments are derived from the description below and the figures.
receiving 110 at least one data set that specifies a search result responsive to a search query, the data set including a plurality of data elements and the search query including at least one query term and
identifying 120 a number of data elements in said data set, each data element characterized by a weight with regard to coverage of said query terms and/or coverage of a data schema of said data set and/or coverage of key data of said data set, wherein the data elements are identified such that the total weight of the identified data elements is maximized.
According to the example embodiment of the present invention, the method 100 further comprises the step of generating 130 a data subset comprising the identified data elements.
According to an example embodiment of the preset invention, the method 100 further comprises prior to receiving 110 the data set that specifies the search result responsive to the search query the step of receiving 140 the search query and the step conducting 150 a search.
The system 200 is configured to carry out at least the steps 110, 120 and 130 of the method 200.
According to an example embodiment of the present invention, the system 200 is further configured to carry out the steps 140 and 150 of the method 100.
The system 200 receives 110 at least one data set D1 that specifies a search result responsive to a search query. The data set D1 includes a plurality of data elements and the search query includes at least one query term.
According to an example embodiment of the present invention, the system 200 receives a plurality of data sets D1, wherein each data set D1 specifies a search result responsive to a search query.
According to an example embodiment of the present invention, a search query is a set of query terms denoted by Q={q1,q2; . . . ;qm}.
According to an example embodiment of the present invention, the data elements of the data set D1 are a RDF, Resource Description Framework, triples comprising subjects, predicates, and objects. The data set D1 is a set of RDF triples denoted by T={t1,t2, . . . ,tn}, where each ti=tis,tip,tio
is a subject-predicate-object triple of RDF resources. The subject tis of a triple ti is an entity (i.e., a non-literal resource at the instance level) that appears in the data set D1. The predicate tip represents a property. The object tio is a value of tip, which can be a class, a literal, or another entity in the data set D1.
The system 200 identifies 120 a number of data elements in said data set D1. Each data element is characterized by a weight with regard to coverage of said query terms and/or coverage of a data schema of said data set D1 and/or coverage of key data of said data set D1. According to the embodiment, the data elements are identified such that the total weight of the identified data elements is maximized.
According to an example embodiment of the present invention, the weight of a data element comprises a value for the coverage of the query terms and/or a value for the coverage of the data schema of the data set D1 and/or a value for the coverage of key data of the data set D1.
According to an example embodiment of the present invention, the value for the coverage of a query term is evaluated by
if said query term is instantiated in said data element, wherein Q represents the search query including the query term.
According to an example embodiment of the present invention, the value for the coverage of the data schema for a data element is evaluated either by a relative frequency frqCls of a class observed in the data set D1 if said class is instantiated in said data element or by a relative frequency frqPrp of a property observed in the data set D1 if said property is instantiated in said data element.
The relative frequency of a class c observed in the data set D1 is given by
where T represents the set of triples in the data set D1. Analogously, the relative frequency of a property p observed in the data set D1 is given by
According to an example embodiment of the present invention, the value for the coverage of key data is evaluated by a mean normalized out-degree and in-degree of an entity of said data set D1 if said entity is instantiated in said data element.
If the data set is a directed graph comprising nodes, called entities, and lines connecting the nodes are called edges. An entity e comprises an out-degree, given by d+ (e), and an in-degree, given by d− (e), in the RDF graph representation of the data set. The in-degree represents the number of edges incoming to an entity and the out degree represents the number of edges outgoing from an entity, respectively.
The mean normalized out-degree and in-degree is given by
where Ent(T) is the set of entities that appear in T.
According to an example embodiment of the present invention, in said weight of a data element the value for the coverage of the query terms and/or the value for the coverage of the data schema of the data set and/or the value for the coverage of key data of the data set are weighted by multiplication with a weighting factor α,β,γ.
According to an example embodiment of the present invention, the weight of x, being a query term or a class in the data schema of the data set or a property in the data schema of the data set or an entity in the data set, is given by
The weighting factors α,β,γ can be tuned, to balance between the value for the coverage of the query terms and the value for the coverage of the data schema of the data set and the value for the coverage of key data of the data set.
The system 200 is further configured to generate 130 the data subset SD1, wherein the data subset D1 comprises the identified data elements.
According to an example embodiment of the present invention, the data subset SD1 generated by system 200 comprising the identified data elements maximizes an objective function
wherein k is a predefined number of identified data elements and cov(ti) is set consisting of the query terms covered by a triple ti, the class instantiated in the triple ti, the property instantiated in the triple ti, and the entities that appear in the ti, the set cov(ti) corresponding to the triple ti.
The data subsets SD1 generated according to the embodiments solves the combinatorial optimization problem, such that the total weight of the covered elements is maximized.
The system 200 comprises a storing unit 220. The storing unit 220 may further comprise a volatile memory 220a, in particular random access memory (RAM), and a nonvolatile memory 220b, e.g. a flash EEPROM, on. The non-volatile memory 220b contains at least one computer program PRG1 for the computing unit 210, which controls the execution of the method according to the embodiments and/or any other operation of the system 200.
The system 200 may further comprise an interface unit 230 for receiving the data set D1 and/or the search query S from at least one external data source.
The data set D1 and the search query S can be stored in said volatile volatile memory 220a of said storing unit 220.
For processing the step of generating 130 the data subset SD1, the system 200 is preferably configured to receive the objective function q(SD1) and the search query S.
According to a further embodiment of the present invention, the system 200 is configured carry out the steps of receiving 140 the search query and the step of conducting 150 a search.
According to a further embodiment of the present invention, the system 200 comprises suitable elements, for example a user interface, for receiving the search query and a communication interface for conducting the search (not shown in the figures).
The quality of data subset SD1 generated according to the embodiments can be evaluated using one of the following evaluation metrics coKyw, coSkm, and coDat, which provides values all in the range of [0; 1].
A metric coKyw evaluates the coverage of query terms. A resource r covers a query term q if r's textual form, e.g. rdfs:label of an IRI or blank node, lexical form of a literal, contains a keyword match for q. A triple t covers a query term q, denoted by t<q, if r covers q for any r∈{ts,tp,to}. For a data subset SD1, the coKyw metric evaluates its coverage of query terms:
A metric coSkm evaluates the coverage of a data schema of the data set D1, wherein the data set is in the RDF format. The relative frequency of a class c observed in the data set D1 is given by
Analogously, the relative frequency of a property p observed in the data set D1 is given by
For a data subset SD1, its coverage of the schema of the data set D1 is the harmonic mean (hm) of the total relative frequency of the classes and properties it contains:
where Cls(SD1) is the set of classes instantiated in SD1 and Prp(SD1) is the set of properties instantiated in SD1.
A metric coDat evaluates the coverage of key data of the data set D1. Central entities represent the key content of the data set D1. If the data set D1 is a directed graph comprising nodes, called entities, and lines connecting the nodes are called edges. An entity e comprises an out-degree, given by d+ (e), and an in-degree, given by d− (e), in the RDF graph representation of the data set D1. The in-degree represents the number of edges incoming to an entity and the out degree represents the number of edges outgoing from an entity, respectively.
For a data subset SD1, its coverage of the entities in D1 is the harmonic mean (hm) of the mean normalized out-degree and in-degree of the entities it contains:
The method 100 for processing data for generating data subsets SD1 has been implemented and evaluated by reusing 387 query data set pairs specified in Wang, X., Chen, J., Li, S., Cheng, G., Pan, J., Kharlamov, E., Qu, Y.: “A framework for evaluating snippet generation for dataset search,” in: ISWC 2019, https://doi.org/10.1007/978-3-030-30793-6_39. The data sets were collected from DataHub and queries included 42 real queries submitted to data.gov.uk and 345 artificial queries comprising i category names in DMOZ referred to as DMOZ-i for i=1; 2; 3; 4. The method was tested on an Intel Core i7-8700K (3.70 GHz) with 10 GB memory for the JVM.
Algorithm 1 Greedy Algorithm
Input: A data set D1, a search query Q, and a size bound k
Output: An optimum data subset SD1⊆D1
Algorithm 1 presents the greedy algorithm for the optimization problem, which at each stage chooses a set that contains the maximum weight of uncovered elements. It achieves an approximation ratio of
Assuming (SD1∪{t})−q(SS1) is computed in O(1), the overall running time of a naive implementation of the algorithm is O(k*n), where n is the number of RDF triples t in D1. According to another embodiment, a priority queue to hold candidate triples can be used.
Among all the 387 query data set pairs, for 234 (60.47%) a data set snippet was generated within 1 second, and for 341 (88.11%) one was generated within 10 seconds. The median time was 0.51 second, showing promising performance for practical use.
The example method 100 was compared with four baseline methods, namely IlluSnip specified in Cheng, G., Jin, C., Ding, W., Xu, D., Qu, Y., “Generating illustrative snippets for open data on the web” in: WSDM 2017. pp. 151-159 (2017), TA+C specified in Ge, W., Cheng, G., Li, H., Qu, Y., “Incorporating compactness to generate term-association view snippets for ontology search,” Inf. Process. Manage. 49(2), 513-528 (2013), PrunedDP++ specified in Li, R., Qin, L., Yu, J. X., Mao, R.′ “Efficient and progressive group steiner tree search,” in: SIGMOD 2016. pp. 91-106 (2016), and CES specified in Feigenblat, G., Roitman, H., Boni, O., Konopnicki, D., “Unsupervised query-focused multi-document summarization using the cross entropy method” in: SIGIR 2017. pp. 961-964 (2017). Number k was set to 20.
According to an example embodiment of the present invention, the method 100 according the embodiments and/or the system 200 according to the embodiments and/or the computer program PRG 1 according to the embodiments are used for generating data subsets in a data set search engine. The generated data subset SD1 can help users to judge the relevance of a retrieved data set D1.
Number | Date | Country | Kind |
---|---|---|---|
19206246.1 | Oct 2019 | EP | regional |