1. Technical Field
Embodiments of the present disclosure relate to query processing, and more specifically relates to techniques for searching web pages.
2. Description of Related Art
People seeking information usually search the Internet using a web browser. One typically begin his/her search for information by pointing his/her web browser at a website associated with a search engine. The search engine allows a user to request web pages containing information related to a particular search term or phrase.
Although the search terms and phrases may be used by the search engine to guide the information search, finding target web pages being sought from hundreds or even thousands of web pages by users is challenging.
In general, the word “module,” as used hereinafter, refers to logic embodied in hardware or firmware, or to a collection of software instructions, written in a programming language, such as, for example, Java, C, or assembly. One or more software instructions in the modules may be embedded in firmware. It will be appreciated that modules may comprise connected logic units, such as gates and flip-flops, and may comprise programmable units, such as programmable gate arrays or processors. The modules described herein may be implemented as either software and/or hardware modules and may be stored in any type of non-transitory computer-readable storage medium or other computer storage device.
The client devices 2 may include, but is not limited to, smart phones, personal digital assistants (PDA), notebooks, and desktops. Each of the client devices 2 includes a web browser which can be pointed at a website associated with a search engine to request web pages containing information related to search keywords from the web server 3.
The search system 10 includes a plurality of function modules, such as a keyword obtaining module 100, a related keyword analysis module 101, a related user analysis module 102, a displaying module 103, and a storage module 104. The function modules 100-104 may include computerized codes in the form of one or more programs, which provide at least the functions needed to execute the steps illustrated in
The storage device 20 may include some type(s) of non-transitory computer-readable storage medium, such as a hard disk drive, a compact disc, a digital video disc, or a tape drive. The storage device 20 stores the computerized codes of the function modules of the search system 10.
The control device 30 may be a processor, an application-specific integrated circuit (ASIC), or a field programmable gate array, (FPGA) for example. The control device 30 may execute the computerized codes of the function modules of the search system 10 to realize the functions of the search system 10.
In step S01, the keyword obtaining module 100 obtains a keyword (hereinafter referred to as the first keyword) from a search engine of one of the client devices 2 operated by a user, and the storage module 104 records the first keyword and information of the user into the storage device 20. The information of the user may be a username of the user, an Internet Protocol (IP) address of the client device 2 of the user, and other information. In one embodiment, when a user A opens a website associated with a search engine using a client 2, and inputs a keyword, such as “computer” into the search engine, the keyword obtaining module 100 obtains the keyword “computer,” and then the storage module 104 records the keyword “computer” and the user A into the storage device 20.
When the search engine returns a search result including a plurality of web pages related to the first keyword, in step S02, the related keyword analysis module 101 selects a number of first web pages from the search result. The number of the first web pages may be N, where N is a positive integer.
In step S03, the related keyword analysis module 101 identifies phrases appearing in the first web pages, and computes a weighting of each of the phrases in the first web pages. The phrases may be single words, for example, “computer,” “network,” and so on, or may be compound words, for example “computer network,” “authorized user” and so on. In one embodiment, the weighting of each of the phrases is computed using a weighting algorithm, such as term frequency-inverse document frequency (tf-idf) algorithm. The tf-idf algorithm is a numerical statistic which reflects how important a phrase is to a document in a collection or corpus. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which alleviates the fact that some words are used more commonly. For example, when a number of phrases appearing in a single web page is 100, and a phrase “computer” appears 3 times in this single web page, then the term frequency (tf) value of the phrase “computer” in the webpage is 3/100, namely 0.03. However, when the phrase “computer” appears in 1,000 web pages, and a number of total web pages is 10,000,000, then the inverse document frequency (idf) of the phrase “computer” is log(10,000,000/1,000), namely 4. Thus, the weighting of the phrase “computer” in the total web pages is 0.03*4, namely 0.12.
In step S04, the related keyword analysis module 101 ranks the identified phrases according to the weightings, and selects one or more of the phrases which have higher weightings. In one embodiment, a number of the selected phrases is R, where R is a positive integer.
In step S05, the related user analysis module 102 obtains related users who have previously requested web pages related to the first keyword using the search engine. For example, the first keyword inputted into the search engine by the user is “computer”, the related user analysis module 102 obtains other users who have previously inputted “computer” into the search engine before, all such users being considered as the related users. As mentioned above in step S01, when a user inputs a keyword into the search engine, the keyword obtaining module 100 obtains and records this keyword and the user into the storage device 20, thus the related user analysis module 102 can obtain the related users according to records in the storage device 20.
In step S06, the related user analysis module 102 selects one of the related users, and obtains a number of second web pages which a selected related user has previously browsed, from the search result returned according to the first keyword. The number of the second web pages may be M, where M is a positive integer. In one embodiment, when a user browses a web page by clicking a website of the web page, the web page can be marked with an tag indicating the user has previously browsed. The tag may include, such as “user A, true” indicating the user A has previously browsed this web page.
In step S07, the related user analysis module 102 identifies phrases appearing in the second web pages, computes a phrase intersection between the phrases of the second web pages and the selected phrases of the first web pages, computes a number of the phrases in the phrase intersection, and computes an evaluation value of the selected related user according to the number of the selected phrases and the number of the phrases in the phrase intersection. In one embodiment, the evaluation value equals S over R, (V=S/R), where S is the number of the phrases in the phrase intersection and R is the number of the selected phrases.
In step S08, the related user analysis module 102 determines if anyone in the related users has not been selected. The process goes back to step 06 when anyone in the related users has not been selected. Otherwise, the process goes to step S09 when all the related users have been selected.
In step S09, the displaying module 103 presents a help page which shows searching histories of the related users who have higher evaluation values. Referring to
It should be emphasized that the above-described embodiments of the present disclosure, including any particular embodiments, are merely possible examples of implementations, set forth for a clear understanding of the principles of the disclosure. Many variations and modifications may be made to the above-described embodiment(s) of the disclosure without departing substantially from the spirit and principles of the disclosure. All such modifications and variations are intended to be included herein within the scope of this disclosure and protected by the following claims.
Number | Date | Country | Kind |
---|---|---|---|
101133596 | Sep 2012 | TW | national |