Conventional search engines for searching electronic documents, such as web and company intranet pages, accept a search query from a user, and generate a list of search results containing one or more terms of the search query. The user typically views one or two of the results and then discards the results as needed.
For example, an employee of a company in China may wish to search the company intranet to find all human resource policies valid in China. The employee can achieve some results by querying “HR China Policy”. But there are some problems with this query. For example, the following related documents cannot be retrieved: (i) documents containing “Human Resources” instead of “HR”; and (ii) documents describing worldwide applicable policies not containing the term “China”.
It is known to associate data classification tags or keyword identifiers with an electronic document so as to represent content of the document. Such classification tags or identifiers have been shown to assist in identifying relevant documents when searching.
Furthermore, it is known to organize data classification tags in a hierarchical structure so as to represent one or more relationships between such tags. However, it is difficult to define a well organized hierarchical structure of data classification tags, especially for a general field of information. Accordingly, the definition and building of an organized tag architecture is typically restricted to experts.
Embodiments are described in more detail and by way of non-limiting examples with reference to the accompanying drawings, wherein
It should be understood that the Figures are merely schematic and are not drawn to scale. It should also be understood that the same reference numerals are used throughout the Figures to indicate the same or similar parts.
Hereinafter, a data classification tag representing the content of the document is referred to as a tag. Thus, a tag can be a keyword identifier which is associated with an electronic document so as to represent content of the document.
Referring to
Turning now to
Embodiments can combine tag information and content information for ranking search results. By using structured tags, semantic meanings and search query context can be accounted for to provide improved searching accuracy. Also, the use of free tags enables the implementation and searching of a simple and flexible tagging architecture in conjunction with a document database. Both user-defined and machine-generated tags may be catered for, thus enabling the use of flexible and accurate document data repositories and searching.
Turning now to
Specifically, the tagging module 110 comprises a structured tagging module 112 which is adapted to generate structured tags and a free tagging module 114 which is adapted to generate free tags. The structured tags generated are organized as hierarchical trees, directed graphs, or other structures so as to comprise information representing their relationship to at least one other tag. In this way, semantic meanings can be associated to the structured tags.
The structured tagging module 112 is adapted to provide the structured tags to the first data repository 120, whereas the free tagging module 114 is adapted to provide the free tags to the second data repository 130.
The structured tagging module 112 and the free tagging module 114 are each adapted to analyze an electronic document, to generate one or more tags based on the analysis, and to associate the one or more tags with the electronic document. Several methods can be used for such automatically generated tags.
Here, methods based on term frequency, part-of-speech and topic modeling are used to automatically generate free tags.
A term frequency based method extracts words that appear in a document with a high frequency and identifies the extracted words as free tags.
A part-of-speech based method extracts phrases which meet a predefined part-of-speech combination rules and identifies the extracted phrases as free tags.
A topic modeling based method learns the probability distribution of words on topics from a corpus in advance, recognizes the talked topics of a document, and returns words with maximal probabilities on the talked topics as free tags.
Rule or classification based methods can be used to generate structured tags automatically. A rule-based method assigns a structured tag to a document according to predefined rules. A classification-based method assigns a structured tag to a document by document classification models which can be trained by machine learning methods, such as SVM (Support Vector Machine), ANN (Artificial Neutral Network), Bayes, etc.
Also, each of the structured tagging module 112 and the free tagging module 114 is adapted to generate a structured tag and free tag, respectively, in accordance with a user defined input. Specifically, a user-defined input US for the generation of a structured tag can be provided to the structured tagging module 112 via a suitable user interface (not shown). Also, a user-defined input UF for the generation of a free tag can be provided to the free tagging module 114 via another user interface (not shown). Moreover, a user is able to add, remove, edit, approve or disapprove a tag via the user-defined inputs US and UF.
It will be appreciated that the structured 112 and free 114 tagging modules are each adapted to generate user-defined tags in addition to automatically/machine generated tags. To maintain this distinction between user-defined tags and automatically/machine generate tags, these two types of tags are stored separately in each of the first 120 and second 130 data repositories.
Here, the structured tags are stored in two separate sub-repositories 122 and 124 of the first data repository 120, wherein the machine-generated structured tags are stored in a first sub-repository 122 of the first data repository 120, and wherein the user-defined structured tags are stored in a second sub-repository 124 of the first data repository 120. Similarly, the free tags are stored in two separate sub-repositories 132 and 134 of the second data repository 130, wherein the machine-generated free tags are stored in a first sub-repository 132 of the second data repository 130, and wherein the user-defined free tags are stored in a second sub-repository 134 of the second data repository 130.
Referring to
As shown in
The ranking process uses a degree of relevance value based on attributes of the tags and documents. For example, one may define a relevance value RT(p,t) of a document p and associated tag t, wherein the value of RT(p,t) is defined by equation 1 as follows:
R
T(p,t)=WN*NU(p,t)+(1−WN)*NM(p,t) (1),
where NU(p, t) is the number of users who associated document p with a tag t, NM(p, t) is the number of machines that associated document p with tag t, and WN is a factor that controls the weights of NU(p, t) and NM(p, t).
The relevance value RT(p) of a document p may then be defined as the sum of all relevance values for the document p, as represented by equation 2:
R
T(p)=SUM(RT(p,t)) (2).
Combining the results from either the tag organized navigation 140 process or the tag cloud navigation 150 process with the result of the ranking process 160, one or more of the highest ranked documents are selected in a filtering process 170 and presented to a user in output process 180
Referring to
Firstly, a search query is received and processed in a search input process 200. The search query includes both content search information and tag search information. Consequently, two separate search processes are performed: a content search 210 and a tag search 220.
The content search 210 retrieves all documents whose contents match the input search query. The tag searching 220 retrieves all documents whose tags match the input search query. For tags belong to an organized tag architecture (i.e. structured tags), a tag expansion process 225 is first executed before the tag searching process 220 so as to expand the tags to be searched.
Next, all retrieved documents are clustered 230 and ranked 240 according to the tag information and content information.
The tag based search result ranking process 240 combines a predetermined ranking result (such as PageRank result) with tag information. For example, one may define a rank value of R(p) of a document p according to equation 3 as follows:
R(p)=WS*RT(p)+(1−Ws)*RO(p) (3),
wherein RT(p) is the relevance value between tags associated to p and the query terms, RO(p) is a known ranking value of document p, WS is a factor that controls the weights of RT(p) and RO(p).
The results from clustering 230 and ranking 240 processes are combined and one or more of the highest ranked documents are selected in a result filtering process 250. Finally, the selected documents are presented to the user in output process 260.
Turning now to
It will be appreciated that embodiments provide advantages which can be summarized as follows:
Embodiments combine the advantages of structured tag architectures and free tag architectures.
User contributed tags can used in conjunction with machine contributed tags. Sometimes, users may not be willing to define tags, so machine contributed tags can boost the tag results and prompt human users to add or modify existing tags.
Search results can be improved through the use of tag information/attributes. A data classification tag can be viewed as a kind of document content summarization tool or keyword identifier. Thus, ranking search results taking account of tag attributes improves has been shown to improve search result accuracy and quality.
It should be noted that the above-mentioned embodiments are illustrative, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word “comprising” does not exclude the presence of elements or steps other than those listed in a claim. The word “a” or “an” preceding an element does not exclude the presence of a plurality of such elements. Embodiments can be implemented by means of hardware comprising several distinct elements. In the device claim enumerating several means, several of these means can be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/CN2009/073446 | 8/24/2009 | WO | 00 | 12/19/2011 |