1. Technology Field
The present disclosure relates to a method for recommending semantic annotations and a system thereof.
2. Description of Related Art
Transmitting or publishing information though documents is widely adopted. A document usually includes many words, several diagrams or several tables. Typically, a keyword-based approach is used when searching a document. However, searching by using keywords reflecting some general concepts may not always find out specific information. Therefore, for improving the searchability of documents, document annotation technology is a common approach. If some specific data or information is annotated into a document, the annotations could be used when searching, data mining, manipulating a database.
The annotations in a document have to be readable by a computer or a machine. That is, the annotations must comply with a metadata protocol. Currently, the manual approach, called tagging, is still widely applied, but it is very laborious. As a result, how to annotate a document automatically with a metadata protocol is getting extensive attentions. However, for a semi-structured document or a unstructured document, it is hard to get the semantic structure thereof. Thereby, how to develop a method that precisely recommends semantic annotations has become a major subject in the industry.
The exemplary embodiments of the disclosure are directed to a method and a system for recommending semantic annotations of a document.
According to an exemplary embodiment of the disclosure, a method for recommending semantic annotations is provided. The method includes: extracting a keyword of the main document; extracting a keyword of each of the sub documents; and generating a keyword similarity of each of the sub documents, wherein the keyword similarity of each of the sub documents is generated based on a degree of similarity between the keyword of the main document and the keyword of each of the sub documents. The method also includes: obtaining a plurality of words appeared on each of the sub documents and calculating a frequency of each of the words appeared on each of the sub documents; generating a semantic capacity of each of the sub documents according to the frequency of each of the words appeared on each of the sub documents; grouping the main document and at least one of the sub documents into a semantic document set based on the semantic capacities of the sub documents and the keyword similarities of the sub documents; and annotating the main document according to the semantic document set.
According to an exemplary embodiment of the disclosure, a system for recommending semantic annotations is provided. The system comprises a processor and a memory storing a plurality of instructions. The processor is coupled to the memory, and is configured to execute the instructions to extract a keyword of the main document; extract a keyword of each of the sub documents; and generate a keyword similarity of each of the sub documents, wherein the keyword similarity of each of the sub documents is generated based on a degree of similarity between the keyword of the main document and the keyword of each of the sub documents. The processor is also configured to execute the instructions to obtain a plurality of words appeared on each of the sub documents and calculate a frequency of each of the words appeared on each of the sub documents; generate a semantic capacity of each of the sub documents according to the frequency of each of the words appeared on each of the sub documents; group the main document and at least one of the sub documents into a semantic document set based on the semantic capacities of the sub documents and the keyword similarities of the sub documents; and annotate the main document according to the semantic document set.
As described above, the method and the system of the exemplary embodiments of the disclosure can precisely annotate a document based on information extracted from a semantic document set instead of a single document.
It should be understood, however, that this Summary may not contain all of the aspects and exemplary embodiments of the present disclosure, is not meant to be limiting or restrictive in any manner, and that the present disclosure as disclosed herein is and will be understood by those of ordinary skill in the art to encompass obvious improvements and modifications thereto.
These and other exemplary embodiments, features, aspects, and advantages of the present disclosure will be described and become more apparent from the detailed description of exemplary exemplary embodiments when read in conjunction with accompanying drawings.
The accompanying drawings are included to provide a further understanding of the present disclosure, and are incorporated in and constitute a part of this specification. The drawings illustrate exemplary embodiments of the present disclosure and, together with the description, serve to explain the principles of the present disclosure.
Reference will now be made in detail to the present preferred exemplary embodiments of the present disclosure, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers are used in the drawings and the description to refer to the same or like parts.
Referring to
The system 100 includes a processor 120 and a memory 140. In the exemplary embodiment, the processor 120 is a central processing unit (CPU), and the memory 140 is a random access memory. However, the disclosure is not limited thereto, the processor 120 may be a microprocessor, and the memory 140 may be a flash memory. A plurality of instructions are stored in the memory 140, and they are implemented as, but not limited to concept discovery module 142, document filter module 144, metadata matching module 146 and user interface module 148. The processor 120 is configured to execute the modules in the memory 140 to annotate the input documents 102. The function of each of the modules will be described in detail below.
Referring to
The input documents 102 further include a plurality of sub documents. In step S204, the document filter module 144 collects documents which semantic meanings are related with the concept 224 from the sub documents. Then, the document filter module 144 generates the semantic document set 226 according to the collected documents. For example, the concept 224 is about a person, and the collected documents may have descriptions of the person. In the exemplary embodiment, the document filter module 144 will annotate the input document 102 according to the semantic document set 226 instead of a single document.
In step S206, the document filter module 144 obtains a plurality of candidate words 228 from the semantic document set 226. The candidate words 228 are more informative than the other words in the semantic document set 226 and have high probabilities to be annotated into the input document 102.
In step S208, the metadata matching module 146 matches the candidate words 228 with properties of the concept 224. For example, when the concept 224 is represented as an item type “person”, the properties of the concept 224 may be name, title, or address. Each property includes a property name and a property value. The metadata matching module 146 matches the candidate words 228 with the properties to identify the property names and property values and generate the properties 230.
In step S210, the metadata matching module 146 embeds the properties 230 into the input document 102 as annotations, and thereby generating the annotated documents 104.
The user interface module 148 shows the annotated documents 104 on a screen (not shown). In other embodiments, the user interface module 148 only shows the recommending properties 230 on the screen, the disclosure is not limited thereto.
Referring to
Referring to
In addition, the document filter module 144 generates a keyword similarity of each of the sub documents. In detail, the keyword similarity is generated based on a degree of similarity between the keyword of the main document 402 and the keyword of each of the sub documents. For example, document filter module 144 compares a keyword of the main document 402 with a keyword of the document 404 to generate a keyword similarity of the document 404. If the generated keyword similarity is larger than a similarity threshold, the document filter module 144 will group the document 404 into the semantic document set 226. For example, if the document filter module 144 compares a keyword of the main document 402 with a keyword of the document 406 to generate a keyword similarity and determines that the keyword similarity is smaller than the similarity threshold, the document filter module 144 will not group the document 406 into the semantic document set 226.
Moreover, the document filter module 144 also obtains a semantic capacity of each of the sub documents in the semantic documents set 226. A semantic capacity is a degree indicating how noticeable a document is, and is used to filter out the documents which are not noticeable. For example, if a document is a biography of a person and another document is a web page of a social network of the same person, the semantic capacity of the former one will be larger than that of the other. If the semantic capacity of a sub document is lower than a capacity threshold, the document filter module 144 will not group the sub document into the semantic document set 226.
To generate a semantic capacity, the document filter module 144 obtains a plurality of words appeared on each of the sub documents and calculates a frequency of each of the words. And, the document filter module 144 generates a semantic capacity of each of the sub documents according to the frequency of each of the words appeared on each of the sub documents. To be specific, the frequencies of words appeared on one of the sub documents includes a first frequency and a second frequency. The document filter module 144 would generate the semantic capacity of the sub document according to a difference between the first frequency and the second frequency. If the difference is large, it means that the content of the sub document is targeted on only a few words, which makes the semantic capacity of the sub document large.
Referring to
ΔRank(F(K+1))˜F(K+1)−F(K),kε{0,H} (1)
Wherein ΔRank(F(K+1)) is the random variable, F(K+1) and F(K) are the (k+1)th frequency and the kth frequency, respectively, and H is the ranking threshold 506. The document filter module 144 calculates the variance of the random variable and takes the variance as the semantic capacity. In other words, if the variance of a sub document is smaller than the capacity threshold, the document filter module 144 will not group the sub document into the semantic document set 226.
Referring to
In step S606, the document filter module 144 obtains a first document set related to the chosen concept and a second document set not related to the chosen concept. For example, the chosen concept is “person” and the corresponding keyword is “Bob”. The document filter module 144 searches documents from the external database 324 according to the word “Bob” to generate the first document set. The document filter module 144 may chose another keyword (also referred as a second keyword) not related to the chosen concept “person”. For example, the second keyword is “plant”. The document filter module 144 searches documents from the external database 324 according to the second keyword to generate the second document set.
In step S608, the document filter module 144 calculates invert document factors of words in unanalyzed documents choosen in the step S604 according the first document set and the second document set. In detail, the chosen document has a plurality of words. Take a first word in these words as an example, the document filter module 144 calculates a first invert document factor of the first word according to the first document set. And, the document filter module 144 calculates a second invert document factor of the first word according to the second document set. To be specific, a invert document factor is a numerical statistic which reflects how important the first word is to a document set.
In step S610, the document filter module 144 selects the candidate words 228. In detail, if the difference between the first invert document factor and the second invert document factor is larger than a difference threshold 620, then the first word is chosen as one of the candidate words 228. For example, the process can be described as a formula (2).
W(c)=IDF(c|A)−IDF(c|B)>Z (2)
Wherein C is the first word, A is the first document set, B is the second document set, Z is the difference threshold, IDF( ) is function for calculating invert document factors, and W(c) is the difference between the first invert document factor and the second invert document factor.
In step S612, the document filter module 144 determines whether all the document in the semantic document set 226 are analyzed. If not, the document filter module 144 goes back to the step S604. Otherwise, the document filter module 144 goes to the step S614. In step S614, the document filter module 144 sets all the document in the semantic document set 226 as unanalyzed documents.
In step S616, the document filter module 144 determines whether all the concepts are analyzed. If not, the document filter module 144 goes back to the step S602. Otherwise, the process shown in
Referring to
Referring to
In step S804, the metadata matching module 146 determines whether all the property names are matched. As discussed above, not all the property names could be matched by candidate words 228. Therefore, if a property name (also referred as a first property name) is not matched, in step S806, the metadata matching module 146 then tries to match the first property name to the words in the semantic document set 226. For example, the metadata matching module 146 searches every word in the documents of the semantic document set 226 to match the first property name. Then, the metadata matching module 146 generates the property names 820 matching the metadata protocol 222. It should be noticed that, since the property names 820 are corresponding to words in a document, the locations of the property name 820 are referred as the locations of the corresponding words.
In step S808, the metadata matching module 146 selects property values from the candidate words 228. Since a property name is located, a corresponding property value could be found near the location of property name. Take a second property name as an example, the metadata matching module 146 selects a second candidate word among the candidate words, wherein a location of the second candidate word is closest to a location of the second property name. And, the metadata matching module 146 recommends or assigns the second candidate word as the property value corresponding to the second property name. In other exemplary embodiment, the metadata matching module 146 obtains a third property name, wherein a location of the second property name is next to a location of third property name. The metadata matching module 146 also obtains a fourth property name, wherein a location of the fourth property name is next to the location of the second property name. To be specific, the location of the fourth property name just succeeds the location of the second property name, and the location of the third property name just precedes the location of the second property name. The metadata matching module 146 would obtain a second candidate word located between the third property name and the fourth property name; and recommends or assigns the second candidate word as the property value corresponding to the second property name. After that, the metadata matching module 146 generates properties 230 in which all the property names and property values are found.
Referring to
In step S904, the metadata matching module 146 determines whether a concept (item type) is not processed. If a concept is not processed, in step S906, the metadata matching module 146 selects the unprocessed concept and sets a pointer at the begging of the document. In step S908, the metadata matching module 146 determines if the pointer is at the end of the document.
If the pointer is not at the end of the documents, in step S910, the metadata matching module 146 tries to add tags and then moves forward the pointers. In detail, for every property value, the metadata matching module 146 adds property names as tags. If a property value is a text node between two tags, the property name is added as annotations. If a property value is a part of pure text or it crosses several node sectors, then the metadata matching module 146 creates a virtual tag in the global scope as annotations. For example, the original text of “<p><b>Allen Ezail Iverson<b>(born Jun. 7, 1975) is an American professional <a href=“/wiki/Basketball” title=“Basketball”>basketball</a>player” could be annotated as “<p><b itemprop=”name“>Allen Ezail Iverson</b>(born Jun. 7, 1975) is an American professional <span itemprop=”role“><a href=”/wiki/Basketball” title=“Basketball”>basketball</a>player</span>. </p>″. After that, the metadata matching module 146 moves the pointer forward and goes back to the step S908.
If the pointer is at the end of the document, the metadata matching module 146 goes back to the step S904. If every concept is processed, in step S912, the metadata matching module 146 saves the document as an annotated document, and generates the annotated documents 104.
After that, the user interface module 148 creates a user interface on a screen, and shows the annotated documents 104 on the screen. The user interface module 148 may also create another user interface and only shows the properties 230 on the user interface. A user may confirm the properties 230 shown on the interface by clicking a confirm button, but the disclosure is not limited thereto.
It should be noted, in the first exemplary embodiment, an example of recommending semantic annotations for web pages is described. However, the present disclosure is not limed thereto. In the second exemplary embodiment, general documents, such as portable document files (PDF) or Microsoft Word documents, may be annotated.
Hardware components of the second exemplary embodiment are substantially similar to that disclosed in the first exemplary embodiment, and components described in the first exemplary embodiment are applied to describe the second exemplary embodiment.
Referring to
In step S1006, the document filter module 144 generates a keyword similarity of each of the sub documents, wherein the keyword similarity of each of the sub documents is generated based on a degree of similarity between the keyword of the main document and the keyword of each of the sub documents. Herein, the manner of generating a keyword similarity of a document is similar to the manner described in the first exemplary embodiment, and therefore it will not be repeated.
In step S1008, the document filter module 144 obtains a plurality of words appeared on each of the sub documents and calculating a frequency of each of the words appeared on each of the sub documents.
In step S1010, the document filter module 144 generates a semantic capacity of each of the sub documents according to the frequency of each of the words appeared on each of the sub documents. Herein, the manner of generating a semantic capacity of a document is similar to the manner described in the first exemplary embodiment, and therefore it will not be repeated.
In step S1012, the document filter module 144 groups the main document and at least one of the sub documents into a semantic document set based on the semantic capacities of the sub documents and the keyword similarities of the sub documents. Herein, the manner of grouping documents into a semantic document set is similar to the manner described in the first exemplary embodiment, and therefore it will not be repeated.
In step S1014, the metadata matching module 146 annotates the main document according to the semantic document set. Herein, the manner of grouping documents into a semantic document set is similar to the manner in the first exemplary embodiment, and therefore it will not be repeated.
As described above, the method and system for recommending semantic annotations in the above exemplary embodiments annotates a document according to a semantic document set instead of a single document and the sub documents grouped into the semantic document set are determined according to a semantic capacity of each sub document. Therefore, the document can be annotated more precisely about the conceptual topics related to the semantic document set 226.
The previously described exemplary embodiments of the present disclosure have the advantages aforementioned, wherein the advantages aforementioned not required in all versions of the present disclosure.
It will be apparent to those skilled in the art that various modifications and variations can be made to the structure of the present disclosure without departing from the scope or spirit of the present disclosure. In view of the foregoing, it is intended that the present disclosure cover modifications and variations of this disclosure provided they fall within the scope of the following claims and their equivalents.