This Non-provisional application claims priority under 35 U.S.C. §119(a) on Patent Application No(s). 104104845 filed in Taiwan, Republic of China on Feb. 12, 2015, the entire contents of which are hereby incorporated by reference.
1. Field of Invention
The present invention relates to a system and method for obtaining information and, in particular, to a system and method for obtaining generalized term information, synonym information or homonym information.
2. Related Art
In most articles, especially in Chinese articles, the repeated terms are usually shown as abbreviations. For example, the term “” (Taiwan Railways Administration, pinyin: tai wan tie lu ju) has an abbreviation of “” (pinyin: tai tie ju). Moreover, the generic terms may increase and change with the history, culture and frequency. For instance, after the popularity of the famous “Facebook”, people in Taiwan will simply call it as “FB” or “ (pinyin: lian shu)”. The created synonyms and abbreviations can improve the communication efficiency and convenience, and further enrich the emotion expression. However, this is a difficult issue for the word/terminology process, which may fatally affect the searching results of all search engines.
For example, when a user wants to know about the term “” (army, pinyin: san jun) and googles it, the search results show a lot of information related to “” (Tri-service general hospital, pinyin: san jun zong yi yuan). Unfortunately, most of these results are not the desired answers for the user. Accordingly, the user may spend a lot of time to find out the desired information from the search results. This and similar problems exist in many situations. In brief, these generalized terms and abbreviations will decrease the searching efficiency of the search engine, thereby increasing the time spent of the user to discover the desired answers.
In view of the foregoing description, this invention is to provide a system, a method and an application for obtaining information that can improve the searching efficiency, thereby providing correct information with respect to the query term(s).
The present invention discloses a system for obtaining information, which includes a term creating unit, a term mapping unit, a database group and a user interface unit. The term creating unit links to a first server, which contains at least one first text file, and analyzes the first text file to generate at least one extracted term. The term mapping unit links to the term creating unit and a second server, which contains a plurality of second text files, and compares the extracted term with the second text files to determine to execute a generalized term extraction procedure, a synonym extraction procedure, or a homonym extraction procedure so as to correspondingly generate generalized term information, synonym information or homonym information. The database group links to the term creating unit and the term mapping unit, and stores the extracted term and the generated generalized term, synonym or homonym information. The user interface unit links to the database group ad receives a query term. When the query term matches the extracted term, the user interface unit provides the generalized term, synonym or homonym information.
In addition, this invention also discloses a method for obtaining information, which includes the following steps of: retrieving at least a first text file from a first server; analyzing the first text file to generate at least an extracted term; accessing a second server containing a plurality of second text files; comparing the extracted term with the second text files; and when at least one of the second text files contains the extracted term, executing a generalized term extraction procedure, a synonym extraction procedure, or a homonym extraction procedure so as to correspondingly generate generalized term information, synonym information or homonym information.
In one embodiment, the method for obtaining information further includes the steps of: when receiving a query term, comparing the query term to the extracted term to determine whether the query term matches the extracted term or not; and when the query term matches the extracted term, providing the generalized term, synonym or homonym information.
In one embodiment, the first server is a news server, and the first text file is a source code file of a news webpage.
In one embodiment, the step of generating the extracted term at least includes: retrieving a text content of the first text file; and executing a segmentation process with regard to the text content of the first text file so as to generate the extracted term.
In one embodiment, the segmentation process includes a lexicon segmentation method, a statistical segmentation method or a hybrid segmentation method.
In one embodiment, the second server is an open edit information server, and the second text file is an editable information webpage.
In one embodiment, the method for obtaining information further includes the steps of: determining whether the extracted term contains a number in a Chinese word; and if yes, executing the generalized term extraction procedure.
In one embodiment, when the text content of one of the second text files contains the extracted term, the generalized term extraction procedure includes: searching a location of the extracted term in the second text file; determining whether at least one specific character exists behind the extracted term in the second text file; if yes, determining whether the total number of the specific characters behind the extracted term matches the number in the Chinese word; and when the total number of the specific characters matches the number in the Chinese words, extracting terms in front of and behind the specific characters as the generalized term information.
In one embodiment, the specific character is a Chinese back sloping comma. In one embodiment, the step of determining whether the total number of the specific characters behind the extracted term matches the number in the Chinese word is to determine whether the total number of the terms in front of and behind the Chinese back sloping comma equals the number in the Chinese words minus one.
In one embodiment, when the text content of one of the second text files contains the extracted term, the synonym extraction procedure includes: searching a location of the extracted term in the second text file; and extracting the first term of the paragraph containing the extracted term as the synonym information.
In one embodiment, when the text content of one of the second text files contains the extracted term, the synonym extraction procedure includes: searching a location of the extracted term in the second text file; and extracting boldfaced words in the paragraph containing the extracted term as the synonym information.
In one embodiment, when the text content of one of the second text files contains the extracted term, the synonym extraction procedure includes: extracting a term located at a specific position in the second text file as the synonym information according to an editing rule of the second text file.
In one embodiment, when there are more than one of the second text files containing the extracted term, the homonym extraction procedure includes: processing the contents of the multiple second text files according to a term combination rule so as to generate the homonym information.
In one embodiment, the method for obtaining information further includes a step of: modifying the generalized term, synonym or homonym information according to an agree score; or modifying the generalized term, synonym or homonym information according to an input content.
In addition, the present invention further discloses a storage device storing an application, which is executed by a computer for performing the following steps of: retrieving at least a first text file from a first server; analyzing the first text file to generate at least an extracted term; accessing a second server containing a plurality of second text files; comparing the extracted term with the second text files; and when at least one of the second text files contains the extracted term, executing a generalized term extraction procedure, a synonym extraction procedure, or a homonym extraction procedure so as to correspondingly generate generalized term information, synonym information or homonym information.
Moreover, this invention also discloses a method for obtaining information, which includes the following steps of: receiving a query term; and when the query term contains a number in a Chinese word, providing generalized term information obtained according to a generalized term extraction procedure.
In one embodiment, the method for obtaining information further includes a step of: when the query term does not contain a number in a Chinese word, providing synonym information or homonym information obtained according to a synonym extraction procedure or a homonym extraction procedure.
As mentioned above, the method for obtaining information of this invention can retrieve at least one extracted term from the first text file of a first server, compare the extracted term with the second text files of a second server, and then execute a generalized term extraction procedure, a synonym extraction procedure, or a homonym extraction procedure according to the comparing result. As a result, this invention can improve the searching efficiency, thereby providing correct information with respect to the query term(s).
The invention will become more fully understood from the detailed description and accompanying drawings, which are given for illustration only, and thus are not limited to the present invention, and wherein:
The present invention will be apparent from the following detailed description, which proceeds with reference to the accompanying drawings, wherein the same references relate to the same elements.
As shown in
In addition, the term mapping unit 14 links to a second server 22, which contains a plurality of second text files 222. In some embodiments, the second server 22 is an open-edited information server, such as the Wikipedia server. Correspondingly, these second text files 222 can be multiple editable information webpages, such as the information webpages of the Wikipedia. Although the following embodiments are all based on Wikipedia, it should be known that the second server 22 can also be another kind of server, such as the Bidu server, Wikipedia Taiwan server, and the likes.
In
Referring to
Afterwards, the step S508 is executed to determine whether at least one specific character exists behind the extracted term. In this embodiment, the specific character is, for example, “” (a Chinese back sloping comma), “” (or, pinyin: huo), “” (and, pinyin: yi ji), or “” (and, pinyin: he). If the step S508 determines that at least one of the above-mentioned specific characters exists behind the extracted term of the second text file, the step S510 is executed to determine whether the total number of the specific characters behind the extracted term matches the number in the Chinese word. To be noted, to determine whether the total number of the specific characters matches the number in the Chinese word is not restricted to determine whether the total number of the specific characters (Chinese back sloping commas) “is equal to” the number in the Chinese word. In general, the total number of the specific characters (the consecutive Chinese back sloping commas in the text content) is equal to the number in the Chinese word minus one. This embodiment will be further described in details in the following description.
If the step S510 determines that the total number of the specific characters matches the number in the Chinese word, the step S512 is executed to extract all the terms in front of and behind the specific characters (the consecutive Chinese back sloping commas) as the generalized term information.
For example, when the extracted term 122 is “” (army, pinyin: san jun), the term mapping unit 14 determines the extracted term 122 contains a number in a Chinese word, “” (three, pinyin: san). Accordingly, the term mapping unit 14 starts to execute a generalized term extraction procedure so as to search the Wikipedia server and find out the webpages containing and/or related to the term “”. Then, this procedure is to search the location of the term “” from the searched webpage, and then determine whether at least one Chinese back sloping comma exists behind the term “”.
In practice, the searched webpage containing the term “” (the matched second text file 222) includes the following description: “” (army generally includes a senior army, an intermediate army and a lower army; pinyin: san jun chang cheng wei shang jun zhong jun xia jun).
In this case, the number of the Chinese back sloping comma “” existed behind the term “” is 2 (equal to 3−1). Accordingly, it is determined that the total number of the at least one Chinese back sloping comma (2) behind the extracted term matches the number in the Chinese word (3). As a result, the term mapping unit 14 extracts the terms (“”, “” and “”) in front of and behind the Chinese back sloping commas (“”) as the generalized term information and then stores the extracted generalized term information in the database group 16.
Referring to
In one embodiment of the invention, the user interface unit 18 is a webpage browser such as Chrome, Firefox, Safari, IE or the likes. However, in other embodiments, the system for obtaining information can be a plug-in module or software cooperating with the above-mentioned webpage browser.
Please referring to
For example, when the extracted term 122 of
Then, the term mapping unit 14 extracts the first term of the paragraph (as the above Chinese paragraph) containing the extracted term 122 as the synonym information. In this case, the term “” (National Yunlin University of Science and Technology; pinyin: guo li yun lin ke ji da xue) is extracted as the synonym information.
In other embodiments, the synonym extraction procedure may further include a step of: extracting a term located at a specific position in the matched second text file as the synonym information according to an editing rule of the second text file.
For example, when the extracted term 122 is “” (National Yunlin University of Science and Technology; pinyin: guo li yun lin ke ji da xue), the term mapping unit 14 can find the matched second text file 222 as shown above and determine the extracted term 122 is the title of the matched second text file 222. Accordingly, the term mapping unit 14 extracts the following term, such as “” (NYU, pinyin: yun ke da) and “” (Yun Tech, pinyin: yun ke), as the synonym information.
In addition, after examining the editing structure of Wikipedia, it is discovered that the Wikipedia uses Infobox to record a lot of structural information (as shown in
The above embodiments disclose the steps of several synonym extraction procedures. This invention can execute one or the combination of the above mentioned embodiments to perform the synonym extraction procedure. In addition, those skilled persons in the art can execute other synonym extraction procedures without departing the spirit of the invention.
In addition, when the term mapping unit 14 determines that more than one second text file 222 contains the extracted term 122, a homonym extraction procedure will be executed. In this embodiment, the term mapping unit 14 processes the contents of all matched second text files according to a term combination rule so as to generate the homonym information.
On the contrary, if the paragraph of the matched second text file 222 containing the extracted term 122 also contains a restricted term, the step S1106 is executed to combine the restricted term and the extracted term 122 and add the combined term into the homonym information.
For example, when the extracted term 122 is “” (pinyin: xiao tian tian), the term mapping unit 14 searches the Wikipedia Taiwan server and finds out the webpage relating to a Japanese historical romance novel, manga, and anime series and the webpage relating to a Taiwanese performer. In the webpage containing the term “”, which relates to a Japanese historical romance novel, manga, and anime series, the paragraph containing the extracted term 122 does not include any restricted term. Accordingly, the term mapping unit 14 directly adds the term “” into the homonym information. Alternatively, if the preset restricted terms include “” (mganga; pinyin: man hua) or “” (anime (cartoon); pinyin: ka tong), the term mapping unit 14 can find corresponding restricted term in the paragraph. In this case, the term mapping unit 14 will add the term “” (manga Candy Candy; pinyin: man hua xiao tian tian) and/or “” (anime (cartoon) Candy Candy; pinyin: ka tong xiao tian tian) into the homonym information.
Similarly, if the restricted terms include “” (performer; pinyin: yi ren), the term mapping unit 14 can find this restricted term from the paragraph containing the extracted term 122 in the webpage relating the containing a Taiwanese performer. Accordingly, the term mapping unit 14 adds the term “” (performer; pinyin: yi ren xiao tian tian) into the homonym information. In this case, the homonym information contains the terms “” and “”, or the terms “” (and/or “”) and “”.
In order to improve the correction of the searching result, some embodiments of the invention may provide an agreement score mechanism for achieving the user interaction purpose.
Besides, some embodiments of the invention allow the user to add new terms into the generalized term, synonym and homonym information (list).
In summary, this invention can retrieve the extracted term from the first server and compare the extracted term with the second text files of the second server so as to obtain the desired generalized term, synonym and homonym information. Accordingly, this invention can improve the searching efficiency, thereby providing correct information with respect to the query term(s).
Although the invention has been described with reference to specific embodiments, this description is not meant to be construed in a limited sense. Various modifications of the disclosed embodiments, as well as alternative embodiments, will be apparent to persons skilled in the art. It is, therefore, contemplated that the appended claims will cover all modifications that fall within the true scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
104104845 | Feb 2015 | TW | national |