COMPUTER IMPLEMENTED METHOD AND SYSTEM FOR PROCESSING DATA FOR GENERATING DATA SUBSETS

Information

  • Patent Application
  • 20210133252
  • Publication Number
    20210133252
  • Date Filed
    October 28, 2020
    4 years ago
  • Date Published
    May 06, 2021
    3 years ago
  • CPC
    • G06F16/9035
    • G06F16/9532
    • G06F16/9538
    • G06F16/90335
    • G06F16/9024
  • International Classifications
    • G06F16/9035
    • G06F16/9532
    • G06F16/901
    • G06F16/903
    • G06F16/9538
Abstract
A computer-implemented method for processing data for generating data subsets. The method includes: receiving at least one data set that specifies a search result responsive to a search query, the data set including a plurality of data elements and the search query including at least one query term; identifying a number of data elements in said data set, each data element characterized by a weight with regard to coverage of said query terms and/or coverage of a data schema of said data set and/or coverage of key data of said data set, wherein the data elements are identified such that the total weight of the identified data elements is maximized.
Description
CROSS REFERENCE

The present application claims the benefit under 35 U.S.C. § 119 of European Patent Application No. EP 19206246.1 filed on Oct. 30, 2019, which is expressly incorporated herein by reference in its entirety.


FIELD

The present invention relates to a computer implemented method and a system for processing data for generating data subsets from data sets wherein a data set comprises a plurality of data elements.


BACKGROUND INFORMATION

To find a particular data set out of a plurality of data sets, recent efforts yielded data set search engines. Such search engines are for example configured to retrieve data sets that are relevant to a keyword query by matching the query with the description in the metadata of each data set.


Other data set search engines may present data set summaries, which are mainly composed of some metadata about the data set, such as provenance and license. Their utility in relevance judgment is limited, with users having to analyze each data set in the search results to assess its relevance, which would be a time-consuming process.


An object of the present invention is to provide a method and system for computing optimal data subsets, wherein a data subset is a representative subset of a data set, also known as data snippet.


A data subset aims at concisely explaining the user why the represented data set fulfils their demand and in particular can illustrate the main content of the data set and explains its relevance to user's query.


SUMMARY

The object may be achieved by the device and methods according to the example embodiments of the present invention.


In accordance with an example embodiment of the present invention, a computer-implemented method is provided for processing data for generating data subsets. The method includes the following steps:


receiving at least one data set that specifies a search result responsive to a search query, the data set including a plurality of data elements and the search query including at least one query term;


identifying a number of data elements in said data set, each data element characterized by a weight with regard to coverage of said query terms and/or coverage of a data schema of said data set and/or coverage of key data of said data set, wherein the data elements are identified such that the total weight of the identified data elements is maximized.


According to an example embodiment, the method further comprises the step of generating a data subset comprising the identified data elements.


According to an embodiment, a data element of said at least one data set is a RDF, Resource Description Framework, triple comprising subjects, predicates, and objects. More particular, the data set is a set of RDF triples denoted by T={t1,t2, . . . ,tn}, where each ti=custom-charactertis,tip,tiocustom-character is a subject-predicate-object triple of RDF resources. The subject tis of a triple ti is an entity (i.e., a non-literal resource at the instance level) that appears in the data set. The predicate tip represents a property. The object tio is a value of tip, which can be a class, a literal, or another entity in the data set.


According to an embodiment of the present invention, the weight of a data element comprises a value for the coverage of the query terms and/or a value for the coverage of the data schema of the data set and/or a value for the coverage of key data of the data set.


According to an embodiment of the present invention, the value for the coverage of a query term is evaluated by







1


Q



,




if said query term is instantiated in said data element, wherein Q represents the search query including the query term.


According to an embodiment of the present invention, the value for the coverage of the data schema for a data element is evaluated either by a relative frequency of a class observed in the data set if said class is instantiated in said data element or by a relative frequency of a property observed in the data set if said property is instantiated in said data element. More particular, the relative frequency of a class c observed in the data set is given by








frqCls






(
c
)


=




{


t


T


:



t
p



=


rdf


:


type





and






t
o


=
c


}






{


t


T


:



t
p



=

rdf


:


type


}





,




where T represents the set of triples in the data set. Analogously, the relative frequency of a property p observed in the data set is given by







frqPrp






(
p
)


=





{


t


T


:



t
p



=
p

}





T



.





According to an embodiment of the present invention, value for the coverage of key data is evaluated by a mean normalized out-degree and in-degree of an entity of said data set if said entity is instantiated in said data element. Central entities represent the key content of the data set. If the data set is a directed graph comprising nodes, called entities, and lines connecting the nodes are called edges. An entity e comprises an out-degree, given by d+ (e), and an in-degree, given by d (e), in the RDF graph representation of the data set. The in-degree represents the number of edges incoming to an entity and the out degree represents the number of edges outgoing from an entity, respectively.


The mean normalized out-degree and in-degree is given by









log


(



d
+



(
x
)


+
1

)




Σ

e


Ent


(
T
)










log


(



d
+



(
e
)


+
1

)




+


log


(



d
-



(
x
)


+
1

)




Σ

e


Ent


(
T
)










log


(



d
-



(
e
)


+
1

)





,




where Ent(T) is the set of entities that appear in T.


According to an example embodiment of the present invention, in said weight of a data element the value for the coverage of the query terms and/or the value for the coverage of the data schema of the data set and/or the value for the coverage of key data of the data set are weighted by multiplication with a weighting factor.


According to an example embodiment of the present invention, the data subset comprising the identified data elements maximizes an objective function






q(SD1)=Σw(x),


wherein SD1 represents said data subset and w represents the weight for x being a query term or a class in the data schema of the data set or a property in the data schema of the data set or an entity in the data set.


More particular, the data subset maximizes the objective function








q


(

SD





1

)


=




x






t
i


S








cov


(

t
i

)







w


(
x
)




,




with x being an element of cov(ti), wherein cov(ti) represents a set consisting of the query terms covered by ti, the class instantiated in ti, the property instantiated in ti and the entities that appear in ti


Preferably, the generation of data subsets can be formulated as a combinatorial optimization problem, aiming to find a data subset such that it contains the query terms and an instantiation of the most frequently used classes and properties in the data set and contains entities having the highest scores in the data set, wherein the optimization problem is solved by a data subset, which maximizes the objective function.


According to an example embodiment of the present invention, the step of identifying data elements comprises identifying a limited number of data elements.


According to an example embodiment of the present invention, said method further comprises prior to receiving the data set that specifies the search result responsive to the search query the step of receiving the search query and the step of conducting a search.


The present invention also concerns a data subset comprising the identified data elements, which maximizes the objective function q(SD1)=Σw(x).


The present invention also concerns a system for processing data for generating data subsets, wherein the system is configured to carry out the method according to any of the embodiments.


The present invention also concerns a computer program, wherein the computer program comprises computer readable instructions that when executed by a computer cause the computer to execute a method according to the embodiments.


The present invention also concerns the use of a method according the embodiments and/or a system according to the embodiments and/or a computer program according to the embodiments for generating data subsets in a data set search engine.


Further advantageous embodiments are derived from the description below and the figures.





BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 depicts a schematic view of a computer-implemented method according to an example embodiment of the present invention.



FIG. 2 depicts a schematic view of a block diagram of a system according to an example embodiment of the present invention.



FIG. 3 depicts a schematic view of a block diagram of a system according to another example embodiment of the present invention.



FIG. 4 depicts average evaluation scores of data subset processed with different methods.



FIG. 5 depicts average evaluation scores of data subset processed with a method according to the example embodiments of the present invention.





DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS


FIG. 1 depicts a schematic view of a computer-implemented method 100 for processing data for generating data subsets. According to an example embodiment of the present invention, the method comprises the steps of:


receiving 110 at least one data set that specifies a search result responsive to a search query, the data set including a plurality of data elements and the search query including at least one query term and


identifying 120 a number of data elements in said data set, each data element characterized by a weight with regard to coverage of said query terms and/or coverage of a data schema of said data set and/or coverage of key data of said data set, wherein the data elements are identified such that the total weight of the identified data elements is maximized.


According to the example embodiment of the present invention, the method 100 further comprises the step of generating 130 a data subset comprising the identified data elements.


According to an example embodiment of the preset invention, the method 100 further comprises prior to receiving 110 the data set that specifies the search result responsive to the search query the step of receiving 140 the search query and the step conducting 150 a search.



FIG. 2 depicts a schematic view of a block diagram of a system 200 for processing data for generating data subsets.


The system 200 is configured to carry out at least the steps 110, 120 and 130 of the method 200.


According to an example embodiment of the present invention, the system 200 is further configured to carry out the steps 140 and 150 of the method 100.


The system 200 receives 110 at least one data set D1 that specifies a search result responsive to a search query. The data set D1 includes a plurality of data elements and the search query includes at least one query term.


According to an example embodiment of the present invention, the system 200 receives a plurality of data sets D1, wherein each data set D1 specifies a search result responsive to a search query.


According to an example embodiment of the present invention, a search query is a set of query terms denoted by Q={q1,q2; . . . ;qm}.


According to an example embodiment of the present invention, the data elements of the data set D1 are a RDF, Resource Description Framework, triples comprising subjects, predicates, and objects. The data set D1 is a set of RDF triples denoted by T={t1,t2, . . . ,tn}, where each ti=custom-charactertis,tip,tiocustom-character is a subject-predicate-object triple of RDF resources. The subject tis of a triple ti is an entity (i.e., a non-literal resource at the instance level) that appears in the data set D1. The predicate tip represents a property. The object tio is a value of tip, which can be a class, a literal, or another entity in the data set D1.


The system 200 identifies 120 a number of data elements in said data set D1. Each data element is characterized by a weight with regard to coverage of said query terms and/or coverage of a data schema of said data set D1 and/or coverage of key data of said data set D1. According to the embodiment, the data elements are identified such that the total weight of the identified data elements is maximized.


According to an example embodiment of the present invention, the weight of a data element comprises a value for the coverage of the query terms and/or a value for the coverage of the data schema of the data set D1 and/or a value for the coverage of key data of the data set D1.


According to an example embodiment of the present invention, the value for the coverage of a query term is evaluated by







1


Q



,




if said query term is instantiated in said data element, wherein Q represents the search query including the query term.


According to an example embodiment of the present invention, the value for the coverage of the data schema for a data element is evaluated either by a relative frequency frqCls of a class observed in the data set D1 if said class is instantiated in said data element or by a relative frequency frqPrp of a property observed in the data set D1 if said property is instantiated in said data element.


The relative frequency of a class c observed in the data set D1 is given by








frqCls






(
c
)


=




{


t


T


:



t
p



=


rdf


:


type





and






t
o


=
c


}






{


t


T


:



t
p



=

rdf


:


type


}





,




where T represents the set of triples in the data set D1. Analogously, the relative frequency of a property p observed in the data set D1 is given by







frqPrp






(
p
)


=





{


t


T


:



t
p



=
p

}





T



.





According to an example embodiment of the present invention, the value for the coverage of key data is evaluated by a mean normalized out-degree and in-degree of an entity of said data set D1 if said entity is instantiated in said data element.


If the data set is a directed graph comprising nodes, called entities, and lines connecting the nodes are called edges. An entity e comprises an out-degree, given by d+ (e), and an in-degree, given by d (e), in the RDF graph representation of the data set. The in-degree represents the number of edges incoming to an entity and the out degree represents the number of edges outgoing from an entity, respectively.


The mean normalized out-degree and in-degree is given by









log


(



d
+



(
x
)


+
1

)




Σ

e


Ent


(
T
)










log


(



d
+



(
e
)


+
1

)




+


log


(



d
-



(
x
)


+
1

)




Σ

e


Ent


(
T
)










log


(



d
-



(
e
)


+
1

)





,




where Ent(T) is the set of entities that appear in T.


According to an example embodiment of the present invention, in said weight of a data element the value for the coverage of the query terms and/or the value for the coverage of the data schema of the data set and/or the value for the coverage of key data of the data set are weighted by multiplication with a weighting factor α,β,γ.


According to an example embodiment of the present invention, the weight of x, being a query term or a class in the data schema of the data set or a property in the data schema of the data set or an entity in the data set, is given by







w


(
x
)


=

{





α
*

1


Q


















x

Q

,








β
*

frqCls


(
x
)

















x


Cls


(

D





1

)



,








β
*

frqPrp


(
x
)



















Prp






(

D





1

)



,







γ
*

(



log


(



d
+



(
x
)


+
1

)




Σ

e


Ent


(
T
)










log


(



d
+



(
e
)


+
1

)




+


log


(



d
-



(
x
)


+
1

)




Σ

e


Ent


(
T
)










log


(



d
-



(
e
)


+
1

)





)





x


Ent







(

D





1

)

.











The weighting factors α,β,γ can be tuned, to balance between the value for the coverage of the query terms and the value for the coverage of the data schema of the data set and the value for the coverage of key data of the data set.


The system 200 is further configured to generate 130 the data subset SD1, wherein the data subset D1 comprises the identified data elements.


According to an example embodiment of the present invention, the data subset SD1 generated by system 200 comprising the identified data elements maximizes an objective function








q


(

SD





1

)


=




x






t
i


S








cov


(

t
i

)







w


(
x
)




,


subject





to







S




k

,




wherein k is a predefined number of identified data elements and cov(ti) is set consisting of the query terms covered by a triple ti, the class instantiated in the triple ti, the property instantiated in the triple ti, and the entities that appear in the ti, the set cov(ti) corresponding to the triple ti.


The data subsets SD1 generated according to the embodiments solves the combinatorial optimization problem, such that the total weight of the covered elements is maximized.



FIG. 3 depicts a schematic view of a block diagram of a system 200 according to another example embodiment of the present invention. The system 200 comprises a computing unit 210, e.g. a microprocessor and/or microcontroller and/or programmable logic device, in particular FPGA, and/or application-specific integrated circuit, ASIC, and/or digital signal processor, DSP, and/or a combination thereof.


The system 200 comprises a storing unit 220. The storing unit 220 may further comprise a volatile memory 220a, in particular random access memory (RAM), and a nonvolatile memory 220b, e.g. a flash EEPROM, on. The non-volatile memory 220b contains at least one computer program PRG1 for the computing unit 210, which controls the execution of the method according to the embodiments and/or any other operation of the system 200.


The system 200 may further comprise an interface unit 230 for receiving the data set D1 and/or the search query S from at least one external data source.


The data set D1 and the search query S can be stored in said volatile volatile memory 220a of said storing unit 220.


For processing the step of generating 130 the data subset SD1, the system 200 is preferably configured to receive the objective function q(SD1) and the search query S.


According to a further embodiment of the present invention, the system 200 is configured carry out the steps of receiving 140 the search query and the step of conducting 150 a search.


According to a further embodiment of the present invention, the system 200 comprises suitable elements, for example a user interface, for receiving the search query and a communication interface for conducting the search (not shown in the figures).


The quality of data subset SD1 generated according to the embodiments can be evaluated using one of the following evaluation metrics coKyw, coSkm, and coDat, which provides values all in the range of [0; 1].


A metric coKyw evaluates the coverage of query terms. A resource r covers a query term q if r's textual form, e.g. rdfs:label of an IRI or blank node, lexical form of a literal, contains a keyword match for q. A triple t covers a query term q, denoted by t<q, if r covers q for any r∈{ts,tp,to}. For a data subset SD1, the coKyw metric evaluates its coverage of query terms:







coKyw


(

SD





1

)


=


1


Q



·




{


q


Q


:





t

S




,

t

q


}



.






A metric coSkm evaluates the coverage of a data schema of the data set D1, wherein the data set is in the RDF format. The relative frequency of a class c observed in the data set D1 is given by







frqCls






(
c
)


=





{


t


T


:



t
p



=


rdf


:


type





and






t
o


=
c


}






{


t


T


:



t
p



=

rdf


:


type


}




.





Analogously, the relative frequency of a property p observed in the data set D1 is given by







frqPrp






(
p
)


=





{


t


T


:



t
p



=
p

}





T



.





For a data subset SD1, its coverage of the schema of the data set D1 is the harmonic mean (hm) of the total relative frequency of the classes and properties it contains:








coSkm


(

SD





1

)


=

hm
(





c


Cls


(

SD





1

)






frqCls


(
c
)



,




p


Prp


(

SD





1

)






frqPrp


(
p
)




)


,




where Cls(SD1) is the set of classes instantiated in SD1 and Prp(SD1) is the set of properties instantiated in SD1.


A metric coDat evaluates the coverage of key data of the data set D1. Central entities represent the key content of the data set D1. If the data set D1 is a directed graph comprising nodes, called entities, and lines connecting the nodes are called edges. An entity e comprises an out-degree, given by d+ (e), and an in-degree, given by d (e), in the RDF graph representation of the data set D1. The in-degree represents the number of edges incoming to an entity and the out degree represents the number of edges outgoing from an entity, respectively.


For a data subset SD1, its coverage of the entities in D1 is the harmonic mean (hm) of the mean normalized out-degree and in-degree of the entities it contains:







coDat


(

SD





1

)


=


hm
(



1



Ent


(

SD





1

)





·




e


Ent


(

SD





1

)







log


(



d
+



(
x
)


+
1

)




max


e




Ent


(

D





1

)










log


(



d
+



(

e


)


+
1

)






,


1



Ent


(

SD





1

)





·




e


Ent


(

SD





1

)







log


(



d
-



(
x
)


+
1

)




max


e




Ent


(

D





1

)










log


(



d
-



(

e


)


+
1

)







)

.





The method 100 for processing data for generating data subsets SD1 has been implemented and evaluated by reusing 387 query data set pairs specified in Wang, X., Chen, J., Li, S., Cheng, G., Pan, J., Kharlamov, E., Qu, Y.: “A framework for evaluating snippet generation for dataset search,” in: ISWC 2019, https://doi.org/10.1007/978-3-030-30793-6_39. The data sets were collected from DataHub and queries included 42 real queries submitted to data.gov.uk and 345 artificial queries comprising i category names in DMOZ referred to as DMOZ-i for i=1; 2; 3; 4. The method was tested on an Intel Core i7-8700K (3.70 GHz) with 10 GB memory for the JVM.


Algorithm 1 Greedy Algorithm


Input: A data set D1, a search query Q, and a size bound k


Output: An optimum data subset SD1⊆D1

    • 1: SD1↓Ø;
    • 2: while |SD1|<k do
    • 3: t*↓argmaxtϵ(D1\SD1)(q(SD1 ∪{t})−q(SD1));
    • 4: SD1↓SD1 ∪{t*};
    • 5: end while
    • 6: return SD1;


Algorithm 1 presents the greedy algorithm for the optimization problem, which at each stage chooses a set that contains the maximum weight of uncovered elements. It achieves an approximation ratio of






1
-


1
e

.





Assuming (SD1∪{t})−q(SS1) is computed in O(1), the overall running time of a naive implementation of the algorithm is O(k*n), where n is the number of RDF triples t in D1. According to another embodiment, a priority queue to hold candidate triples can be used.


Among all the 387 query data set pairs, for 234 (60.47%) a data set snippet was generated within 1 second, and for 341 (88.11%) one was generated within 10 seconds. The median time was 0.51 second, showing promising performance for practical use.


The example method 100 was compared with four baseline methods, namely IlluSnip specified in Cheng, G., Jin, C., Ding, W., Xu, D., Qu, Y., “Generating illustrative snippets for open data on the web” in: WSDM 2017. pp. 151-159 (2017), TA+C specified in Ge, W., Cheng, G., Li, H., Qu, Y., “Incorporating compactness to generate term-association view snippets for ontology search,” Inf. Process. Manage. 49(2), 513-528 (2013), PrunedDP++ specified in Li, R., Qin, L., Yu, J. X., Mao, R.′ “Efficient and progressive group steiner tree search,” in: SIGMOD 2016. pp. 91-106 (2016), and CES specified in Feigenblat, G., Roitman, H., Boni, O., Konopnicki, D., “Unsupervised query-focused multi-document summarization using the cross entropy method” in: SIGIR 2017. pp. 961-964 (2017). Number k was set to 20.



FIG. 4 depicts a table, which presents the average scores of the three evaluation metrics coKyw, coSkm and coDat over all the query data set pairs. Compared with the baselines, the method 100 achieved the highest overall score of 0.708. In particular, the coverage of the method 100 of data schema coSkm=0.8651 and data coDat=0.4247 were at the top. The coverage of query terms coKyw=0.8352 was close to TA+C, PrunedDP++, and CES which are query-focused methods. Therefore, the method 100 achieved a satisfying trade-off between these evaluation metrics.



FIG. 5 depicts a table, which breaks down the scores of the method 100 into groups of query data set pairs. The scores on different groups were generally consistent with each other, demonstrating the robustness of the method 100.


According to an example embodiment of the present invention, the method 100 according the embodiments and/or the system 200 according to the embodiments and/or the computer program PRG 1 according to the embodiments are used for generating data subsets in a data set search engine. The generated data subset SD1 can help users to judge the relevance of a retrieved data set D1.

Claims
  • 1. A computer-implemented method for processing data for generating data subsets, comprising the following steps: receiving at least one data set that specifies a search result responsive to a search query, the data set including a plurality of data elements and the search query including at least one query term;identifying a number of data elements in the data set, each data element of the data elements being characterized by a weight with regard to coverage of the query terms and/or coverage of a data schema of the data set and/or coverage of key data of the data set, wherein the data elements are identified such that a total weight of the identified data elements is maximized; andgenerating a data subset including the identified data elements.
  • 2. The method according to claim 1, wherein each data element of the data elements of the at least one data set is a Resource Description Framework (RDF) triple including subjects, predicates, and objects.
  • 3. The method according to claim 1, wherein the weight of the data element includes a value for the coverage of the query terms and/or a value for the coverage of the data schema of the data set and/or a value for the coverage of key data of the data set.
  • 4. The method according to claim 3, wherein the value for the coverage of the query term is evaluated by 1/|Q| when the query term is instantiated in the data element, wherein Q represents the search query including the query term.
  • 5. The method according to claim 3, wherein the value for the coverage of the data schema for the data element is evaluated either by a relative frequency of a class observed in the data set when the class is instantiated in the data element or by a relative frequency of a property observed in the data set when the property is instantiated in the data element.
  • 6. The method according to claim 3, wherein the value for the coverage of key data is evaluated by a mean normalized out-degree and in-degree of an entity of the data set when the entity is instantiated in the data element.
  • 7. The method according to claim 3, wherein in the weight of the data element, and/or the value for the coverage of the query terms and/or the value for the coverage of the data schema of the data set and/or the value for the coverage of key data of the data set are weighted by multiplication with a weighting factor.
  • 8. The method according to claim 1, wherein the data subset including the identified data elements maximizes an objective function q(SD1)=Σw(x), wherein SD1 represents the data subset and w represents the weight for each x, x being a query term or a class in the data schema of the data set or a property in the data schema of the data set or an entity in the data set.
  • 9. The method according to claim 1, wherein the step of identifying the data elements including identifying a limited number of data elements.
  • 10. The method according to claim 1, further comprising the following steps: prior to receiving the data set that specifies the search result responsive to the search query: receiving the search query, andconducting a search.
  • 11. A system configured to process data for generating data subsets, the system configured to: receiving at least one data set that specifies a search result responsive to a search query, the data set including a plurality of data elements and the search query including at least one query term;identifying a number of data elements in the data set, each data element of the data elements being characterized by a weight with regard to coverage of the query terms and/or coverage of a data schema of the data set and/or coverage of key data of the data set, wherein the data elements are identified such that a total weight of the identified data elements is maximized; andgenerating a data subset including the identified data elements.
  • 12. The method as recited in claim 1, wherein the method is used to generate the data subset in a data set search engine.
  • 13. The system as recited in claim 11, wherein the system is used to generate the data subset in a data set search engine.
Priority Claims (1)
Number Date Country Kind
19206246.1 Oct 2019 EP regional