Field of the Invention
This invention relates to systems and methods for providing relevant search results.
Background of the Invention
The objective of a search engine is to provide the most relevant results to a user. Many algorithms have been devices to achieve this goal. In particular, prior responses to search results and the queries that were used to identify the search results are used in some prior approaches to determine a relevant result. For example, such metrics as click-through-rate and others may be used to determine user response to search results.
The systems and methods disclosed herein provide an improved approach for providing search results in response to a query.
In order that the advantages of the invention will be readily understood, a more particular description of the invention briefly described above will be rendered by reference to specific embodiments illustrated in the appended drawings. Understanding that these drawings depict only typical embodiments of the invention and are not therefore to be considered limiting of its scope, the invention will be described and explained with additional specificity and detail through use of the accompanying drawings, in which:
It will be readily understood that the components of the present invention, as generally described and illustrated in the Figures herein, could be arranged and designed in a wide variety of different configurations. Thus, the following more detailed description of the embodiments of the invention, as represented in the Figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of certain examples of presently contemplated embodiments in accordance with the invention. The presently described embodiments will be best understood by reference to the drawings, wherein like parts are designated by like numerals throughout.
Embodiments in accordance with the present invention may be embodied as an apparatus, method, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.), or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “module” or “system.” Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium.
Any combination of one or more computer-usable or computer-readable media may be utilized. For example, a computer-readable medium may include one or more of a portable computer diskette, a hard disk, a random access memory (RAM) device, a read-only memory (ROM) device, an erasable programmable read-only memory (EPROM or Flash memory) device, a portable compact disc read-only memory (CDROM), an optical storage device, and a magnetic storage device. In selected embodiments, a computer-readable medium may comprise any non-transitory medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object-oriented programming language such as Java, Smalltalk, C++, or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on a computer system as a stand-alone software package, on a stand-alone hardware unit, partly on a remote computer spaced some distance from the computer, or entirely on a remote computer or server. In the latter scenario, the remote computer may be connected to the computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
The present invention is described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions or code. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a non-transitory computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
Referring to
The server system 102 may host or access a database 104a to facilitate responding to queries. The database 104a may store a corpus dictionary 106. The corpus dictionary may be a list of terms obtained from a reference corpus such as a dictionary including words and definitions or an encyclopedia including articles on various subjects. For example, the corpus dictionary 106 may be obtained from analysis of a reference corpus 104b that is accessible online, such as through one or more other servers 102b. The corpus dictionary 106 may be a list of terms taken from the titles of articles available in the reference corpus or otherwise obtained from the content of the reference corpus.
The database 104a may additionally store queries 108. Queries 108 may be queries received by the server 102a or queries included in referrals from a search engine hosted by another system. For example, a third-party search engine may present a link to a web interface hosted by the server system 102a in response to a query received by the third-party search engine. Upon a user selecting the link, the third-party search engine may transmit a request for the web interface along with the query the resulted in the presentation of the link. Queries 108 may be stored indefinitely or may store only those queries received in the last N days, weeks, months, or years, where N is a value greater than zero.
The database 104a may further store a query dictionary 110 generated based on the queries 108. Methods by which the query dictionary 110 is generated and used are described in greater detail below.
The server system 102a may receive queries from mobile devices 112 (smart phones, tablet computers, wearable computers, etc.) and from desktop or laptop computers 114. The mobile devices 112 and computers 114 may communicate with the server system 102a by means of a network 116 such as the Internet, a local area network (LAN), wide area network (WAN), or some other network connection. The mobile devices and computers 114 may communicate with the server system 102a by means of any wired or wireless protocol.
Computing device 200 includes one or more processor(s) 202, one or more memory device(s) 204, one or more interface(s) 206, one or more mass storage device(s) 208, one or more Input/Output (I/O) device(s) 210, and a display device 230 all of which are coupled to a bus 212. Processor(s) 202 include one or more processors or controllers that execute instructions stored in memory device(s) 204 and/or mass storage device(s) 208. Processor(s) 202 may also include various types of computer-readable media, such as cache memory.
Memory device(s) 204 include various computer-readable media, such as volatile memory (e.g., random access memory (RAM) 214) and/or nonvolatile memory (e.g., read-only memory (ROM) 216). Memory device(s) 204 may also include rewritable ROM, such as Flash memory.
Mass storage device(s) 208 include various computer readable media, such as magnetic tapes, magnetic disks, optical disks, solid-state memory (e.g., Flash memory), and so forth. As shown in
I/O device(s) 210 include various devices that allow data and/or other information to be input to or retrieved from computing device 200. Example I/O device(s) 210 include cursor control devices, keyboards, keypads, microphones, monitors or other display devices, speakers, printers, network interface cards, modems, lenses, CCDs or other image capture devices, and the like.
Display device 230 includes any type of device capable of displaying information to one or more users of computing device 200. Examples of display device 230 include a monitor, display terminal, video projection device, and the like.
Interface(s) 206 include various interfaces that allow computing device 200 to interact with other systems, devices, or computing environments. Example interface(s) 206 include any number of different network interfaces 220, such as interfaces to local area networks (LANs), wide area networks (WANs), wireless networks, and the Internet. Other interface(s) include user interface 218 and peripheral device interface 222. The interface(s) 206 may also include one or more user interface elements 218. The interface(s) 206 may also include one or more peripheral interfaces such as interfaces for printers, pointing devices (mice, track pad, etc.), keyboards, and the like.
Bus 212 allows processor(s) 202, memory device(s) 204, interface(s) 206, mass storage device(s) 208, and I/O device(s) 210 to communicate with one another, as well as other devices or components coupled to bus 212. Bus 212 represents one or more of several types of bus structures, such as a system bus, PCI bus, IEEE 1394 bus, USB bus, and so forth.
For purposes of illustration, programs and other executable program components are shown herein as discrete blocks, although it is understood that such programs and components may reside at various times in different storage components of computing device 200, and are executed by processor(s) 202. Alternatively, the systems and procedures described herein can be implemented in hardware, or a combination of hardware, software, and/or firmware. For example, one or more application specific integrated circuits (ASICs) can be programmed to carry out one or more of the systems and procedures described herein.
Referring to
The method 300 may include generating 302 the corpus dictionary 106 by analyzing the reference corpus 104b. For example, the titles of some or all articles in the corpus dictionary may be retrieved and included as entries in the corpus dictionary 106. Other criteria may also be used to select strings of one or more words from among the titles of articles of the corpus and/or the content of the articles themselves.
The method 300 may further include retrieving 304 recent queries, such as queries from the last M days, weeks, or months. For example, queries from the month preceding the time of performing step 304 are used. The top queries from among the recent queries may be identified and the corpus dictionary may then be augmented 308 with terms from top queries identified from among the top queries. For example, the top P queries having the highest number of occurrences among the recent queries may be identified. Each query may include one or more strings of one or more words. These strings of one or more words may then be added to the corpus dictionary. In particular, those strings of one or more words found in the top queries may be added to the corpus dictionary if not already present in the dictionary. Steps 302-308 may be performed independently and with different frequency from the following steps 310-318.
At step 310 start-side and end-side occurrence of sub-queries may be identified 310 among the recent queries. In particular, a start-side sub query of a query may be a sub-query of that is a series of one or more contiguous words from the sub-query that include the starting word of the query. Likewise, an end-side sub query may be a series of one or more contiguous words from the query that include end ending word of the query. For purposes of this disclosure, a word may be any series of characters including alphanumeric characters. Words may also be identified from the query by means of whitespace characters separating them, commas, or other delimiting characters. The method by which the query is parsed to identify words may include any method known in the art.
In some languages, text is read from left to right, accordingly each start-side sub-query may include the left-most word of a query (i.e. and l-gram) and each end-side sub query may include the right-most word of the query (i.e. and r-gram). If the query is received in a language that is read from right to left, the starting and ending words may be identified oppositely.
The method 300 may further include computing 312 a start-side count and computing 314 an end-side count. Specifically, the number of occurrences of each start-side sub-query may be computed as the start-side count of that sub-query. Likewise, the number of occurrences of each end-side sub-query may be the number of occurrences of the end-side sub-query.
The method 300 may further include determining 316 a maximum of all the start- and end-side counts. Specifically, among the set of values including all of the start- and -end-side counts, the largest value may be identified. The largest value may be referred to hereinafter as “mcount.” For each sub-query, a score may be calculated 318. For example, the score may be calculated as:
where “end_count” is the end-side count of the sub-query occurs and “start_count” is the start-side count of the sub-query. That is to say, a sub-query may occur as a start-side sub-query or an end-side query. Accordingly, the score for a sub-query may be based on both the end-side and start-side counts of the sub-query. As is readily apparent, the score increases with increasing end count for the sub-query and decreases with increasing start count of the sub-query. Other functions that achieve this correspondence may also be used. For example, the square root operations may be replaced with some other function or exponent. In addition, rather than including a term of the form 1−√{square root over (start_count/mcount)}, this term may be replaced with a term of the form
where x is ½ or some other integer or fractional value.
The score as calculated above may advantageously enable the distinguishing of sub-queries located at the beginning of a query to the sub-queries at the end of the query. In particular, words that occur at the end of a query phrased grammatically may be more likely to indicate a user's intent. For example, given the query “24 inch Sony TV,” “TV” is the operative term and all results should include it prominently. In contrast, the leading term “24” and “inch” are more general.
Referring to
The method 400 may include receiving 402 a query, such as from a mobile device 112 or computing device 114. The query may be parsed to identify 404 one or n-grams. An n-gram may be any series of one or more characters. An n-gram may include one or more words or other characters separated by whitespace, commas, or some other delimiter. An “n-gram” as used herein may include an n-gram as this term is known in the art.
The method 400 may further include for the n-grams, identifying 406 the longest matches in the corpus dictionary for one or more of the n-grams. In particular, for a particular series of characters in the query, the longest n-gram of the plurality of n-grams that includes that series of characters that has a matching entry in the corpus dictionary may be selected to represent that series of characters in subsequent steps of the method 400. For example, given a query including “16 GB iPhone 4s,” possible n-grams may include “16,” “16 GB,” “iPhone,” “iPhone 4S,” and “16 GB iPhone 4s,” among others. Since the 16 GB iPhone 4s exists, an entry for it may exist in the corpus dictionary. Accordingly, the n-gram “16 GB iPhone 4s” may be selected to represent the characters of this string since it is the longest n-gram for which an entry exists.
The n-grams identified at step 406 may then be evaluated to identify 408 those identified n-grams that are noun phrases. In particular, the query itself and the context in which the identified n-grams occur may be evaluated to identify those n-grams that are used as nouns or noun phrases in the query. In particular, natural language processing (NLP) and part of speech (POS) identification techniques known in the art may be used to identify those n-grams that are used as nouns in the query.
The noun phrases identified at step 408 may then be ranked according to scores associated with the noun phrases. In particular, a noun phrase may have occurred as a sub-query in the queries analyzed according to the method 300 and may therefore have a corresponding score. If a noun-phrase has not occurred as a start-side or end-side query in the queries analyzed, the score for that noun-phrase may be zero.
Accordingly, the noun phrases may be ranked 410 from highest score to lowest. The noun phrases may then be weighted according to their rank, with higher ranked noun-phrases being weighted more than noun-phrases with lesser rank. For example, the range of possible scores (zero to one) may be divided up into R sub-ranges or bins r1, r2, r3 . . . rR. Each noun phrase may then be assigned to the range within which its score falls. Of the possible ranges, some or all of them may have a noun-phrase assigned thereto. Accordingly, each range having at least one noun-phrase assigned thereto may be assigned a rank according to the value of the range. The highest range with an assigned noun phrase is ranked highest, the next highest range with an assigned noun-phrase is ranked second, and so on until the lowest range having an assigned noun phrase is ranked last. Each noun phrase within a range may then be assigned 412 a weight according to the rank of that range. The highest ranked range is given a higher weight than the second highest ranked range, the second highest ranked range is given a higher weight than the third highest ranked range, and so on.
For example, where P(i) is the rank assigned to range i, all noun phrases assigned to range i may receive a weight according to some function W(P(i)). W(P) may be a linear range that decreases linearly from a highest weight W(1) assigned to the noun phrases of the highest ranked range. Alternatively W(P) may decrease exponentially or according to some polynomial or other function that decreases with increasing rank number (i.e. 2nd place, 3rd place 4th place, etc.).
The method 400 may further include performing 414 a search using the query with the identified noun phrases weighted according to the weights assigned at steps 412. In some embodiments, performing a weighted search may include returning only results that include the highest weighted noun phrase and omitting others from the returned results. The results of the search may be returned to the user or computer system from which the query was received at step 402.
Various search engine algorithms known in the art perform terms or phrases to be assigned weights and which identify and rank search results according to the inclusion and usage of the terms and the weights assigned thereto. For example, the APACHE SOLR search of the APACHE LUENCE project. Accordingly, performing 414 a weighted search and returning 416 results may be performed using such an algorithm. The documents searched at step 414 may be web pages, a corpus of product records, or some other collection of documents.
The present invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative, and not restrictive. The scope of the invention is, therefore, indicated by the appended claims, rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.
Number | Name | Date | Kind |
---|---|---|---|
5544049 | Henderson | Aug 1996 | A |
5826260 | Byrd, Jr. | Oct 1998 | A |
5920854 | Kirsch | Jul 1999 | A |
5963940 | Liddy | Oct 1999 | A |
6006225 | Bowman | Dec 1999 | A |
6772150 | Whitman | Aug 2004 | B1 |
8326861 | Ainslie | Dec 2012 | B1 |
8515985 | Zhou | Aug 2013 | B1 |
20070112764 | Yih | May 2007 | A1 |
20070288450 | Datta | Dec 2007 | A1 |
20080319975 | Morris | Dec 2008 | A1 |
20090006324 | Morris | Jan 2009 | A1 |
20100094673 | Lobo | Apr 2010 | A1 |
Number | Date | Country | |
---|---|---|---|
20160034584 A1 | Feb 2016 | US |