1. Technical Field
Embodiments of the present disclosure relate to query processing, and more specifically relates to techniques for searching web pages.
2. Description of Related Art
People seek information from the Internet using a web browser. A person begins his/her search for information by pointing his/her web browser at a website associated with a search engine. The search engine allows a user to request web pages containing information related to a particular search word or phrase.
Although the search words and phrases may be used by the search engine to guide the search, finding target web pages being sought from hundreds or even thousands of web pages by users is challenging.
In general, the word “module,” as used hereinafter, refers to logic embodied in hardware or firmware, or to a collection of software instructions, written in a programming language, such as, for example, Java, C, or assembly. One or more software instructions in the modules may be embedded in firmware. It will be appreciated that modules may comprise connected logic units, such as gates and flip-flops, and may comprise programmable units, such as programmable gate arrays or processors. The modules described herein may be implemented as either software and/or hardware modules and may be stored in any type of non-transitory computer-readable storage medium or other computer storage device.
The client devices 2 may include, but are not limited to, smart phones, personal digital assistants (PDA), notebooks, and desktops. Each of the client devices 2 includes a web browser which can be pointed at a website associated with a search engine to request web pages containing information related to search items inputted by a user. The search items may be words, phrases and pictures. In the present embodiment, the search items are pictures.
The picture database 4 is an organized collection of embedded pictures in web pages which can be distributed by the web server 3. Each of the pictures in the picture database 4 has related information, including a web site of the web page containing the picture, and a location of the picture in the web page.
The search system 10 includes a plurality of function modules, such as a receiving module 100, an analyzing module 101, a locating module 102, a computing module 103, and a retrieving module 104. The function modules 100-104 may include computerized codes in the form of one or more programs, which provide at least the functions needed to execute the steps illustrated in
The storage device 20 may include some type(s) of non-transitory computer-readable storage medium, such as a hard disk drive, a compact disc, a digital video disc, or a tape drive. The storage device 20 stores the computerized codes of the function modules of the search system 10.
The control device 30 may be a processor, an application-specific integrated circuit (ASIC), or a field programmable gate array, (FPGA) for example. The control device 30 may execute the computerized codes of the function modules of the search system 10 to realize the functions of the search system 10.
In step S01, the receiving module 100 obtains a picture from a search engine of one of the client devices 2 currently inputted from a client device by a user. In one embodiment, when a user A opens a website associated with a search engine using the client device 2, and inputs a picture into the search engine, the receiving module 100 obtains the picture from the search engine.
In step S02, the analyzing module 101 analyzes basic features of the received picture, and computes similarities between the received picture and any picture in the picture database according to the basic features. The basic features of the received picture include, but are not limited to, colors, an outline, and a shape of the received picture. In one embodiment, the analyzing module 101 uses a Scale Invariant Feature Transform (SIFT) method to analyze the basic features of the received picture.
In step S03, the analyzing module 101 selects pictures from the picture database according to the similarities. In one embodiment, the analyzing module 101 selects the pictures which have high similarities with the received picture from the picture database.
In step S04, the locating module 102 finds web pages which contain the selected pictures. As mentioned above, the picture database 4 stores the pictures, and also stores related information of the pictures, including web sites of the web pages containing the pictures, and a location of the picture in the web pages. Thus, the locating module 102 finds the web pages containing the selected pictures according to the web sites.
In step S05, the locating module 102 finds locations of the selected pictures in the web pages, and obtains textual content around the selected pictures in the web pages. The locating module 102 finds the locations of the selected pictures according to the related information of the selected pictures that are stored in the picture database 4.
In step S06, the computing module 103 computes weightings of words and phrases in the textual content in each of the web pages. The words include, for example, “computer,” “network,” and so on, the phrases include, for example “computer network,” “authorized user” and others. In one embodiment, the weighting of each of the words and phrases is computed using a weighting algorithm, such as a term frequency-inverse document frequency (tf-idf) algorithm. The tf-idf algorithm is a numerical statistic which reflects how important a word and phrase is to a document in a collection or corpus. The tf-idf value increases proportionally to the number of times a word or phrase appears in the document, but is offset by the frequency of the word or the phrase in the corpus, which alleviates the fact that some words or phrase are used more commonly. For example, when a number of words and phrases appearing in a single web page is 100, and a word “computer” appears 3 times in this single web page, then the term frequency (tf) value of the word “computer” in the webpage is 3/100, namely 0.03. However, when the word “computer” appears in 1,000 web pages, and a number of total web pages is 10,000,000, then the inverse document frequency (idf) of the word “computer” is log (10,000,000/1,000), namely 4. Thus, the weighting of the word “computer” in the total web pages is 0.03*4, namely 0.12.
In step S07, the computing module 103 adjusts the weightings of the words and phrases according to the locations of the selected pictures in the web pages. In one embodiment, if a selected picture appears in the first page of a web page, it can be deemed that the selected picture is important, thus the computing module 103 may adjust the weightings of the words and phrases in the textual content around this selected picture by multiplying with a coefficient of 1.1. Step S07 may be omitted in another embodiment.
In step S08, the retrieving module 104 selects one or more of the words and phrases according to the weightings of the words and phrases.
In step S09, the retrieving module 104 inputs the selected words and phrases into the search engine, receives a search result accordingly, and displays the search result on the client device 1.
It should be emphasized that the above-described embodiments of the present disclosure, including any particular embodiments, are merely possible examples of implementations, set forth for a clear understanding of the principles of the disclosure. Many variations and modifications may be made to the above-described embodiment(s) of the disclosure without departing substantially from the spirit and principles of the disclosure. All such modifications and variations are intended to be included herein within the scope of this disclosure and protected by the following claims.
Number | Date | Country | Kind |
---|---|---|---|
101143516 | Nov 2012 | TW | national |