At least one embodiment of the present invention pertains to network-oriented information search technology, and more particularly, to a technique for quickly providing relevant facts to a user in response to a search query on a network.
Network-oriented information search technologies have undergone rapid maturation and improvements in recent years. These technologies are often quite effective for some purposes. Nonetheless, known search technologies still have certain shortcomings.
At least one well-known network search technology in use today continuously “crawls” the Internet to identify new or updated web pages and other types of information resources (e.g., video clips, audio files, photos, etc.). The search engine creates and continuously updates an index of these resources. In response to a search query from a user, the search engine processes the query against the index by using one or more search algorithms and produces a set of hyperlinks, i.e., uniform resource locators (URLs). These hyperlinks represent the information resources found by the search algorithm to be most relevant to the query; as such, the hyperlinks are provided to the user in response to the query. Sometimes each URL is shown along with a small amount of contextual information, such as a snippet of text that includes terms from the query as they appear within the referenced resource. The user then examines these URLs, along with any contextual information provided, and decides which of them, if any, are worth selecting (e.g., clicking on) to access and examine the corresponding resources.
A shortcoming of this search technology, however, is that it often provides too little information and requires too much effort from the user. Frequently the user is looking for the answer to a specific question or for a fairly specific piece of information, even though he may not know what that information looks like when he forms the query. With this known search technology, the user has to review the provided URLs and associated contextual information to determine which corresponding resources, if any, are worth actually retrieving. The user then has to click on them one at a time to access and examine each corresponding resource, and then determine the relevance of each resource and try to glean from it the information for which he was searching.
This process can involve a considerable amount of time and effort on the part of the user, depending on the nature of the search. Even with extremely effective search algorithms, the amount of time and effort required to actually obtain the sought-after information may be undesirable from the user's perspective. This is even more likely if the user is searching from a small-footprint mobile communication device, such as a smartphone or personal digital assistant (PDA), the relatively small user interfaces of which can make it difficult to navigate and examine effectively multiple levels of information.
Another type of known search technology is extensible markup language (XML) document query systems. These systems are specially designed for operating on XML markup language; as such, they are not well suited for identifying relevant information in standard human sentences, such as may be found in web pages, for example.
The technique introduced here includes a system and method for quickly providing relevant facts to a user of a search engine, directly in response to a search query. The technique eliminates the need for the user to review a list of links to determine which corresponding information resources, if any, are worth actually retrieving and to then click on them one at a time to review each corresponding information resource and to try to glean from them the sought-after information.
In certain embodiments, in response to a search query the system initially identifies a set of network locators, such as URLs, that are deemed relevant to the search query, including at least one network locator. This may involve invoking a set of third-party search application program interfaces (APIs). Each identified network locator corresponds to a separate information resource, such as a web page, stored on a network, such as the Internet. The system then retrieves the information resource (or resources) corresponding to each network locator so identified.
The system then processes the retrieved set of information resources to extract an information item from the set of information resources, and returns that information item to the user as a response to the search query. This returned information item is called a “fact” here and may be in the form of a standard sentence in a language used for spoken and written communication among humans, e.g., English, French, etc.
In certain embodiments, processing the set of information resources to extract the information item comprises: producing a normalized document for each information resource in the retrieved set of information resources, producing a “gobbet” set, including at least one gobbet, from each such normalized document; selecting at least one gobbet from the gobbet set; and creating the above-mentioned information item for output to the user, from the selected at least one gobbet.
A “gobbet”, as the term is used here, is a fragment of information extracted from its original source and context. In certain embodiments a separate gobbet is generated for each paragraph and for each individual sentence in each normalized document generated from the retrieved information resources. A gobbet can be represented as a data object in the system, which can include a gobbet identifier, a network locator corresponding to a source of the gobbet, and various content items, including a subject phrase and a verb phrase.
In certain embodiments, processing the set of information resources to extract the information item further comprises storing and indexing, in a gobbet repository, each gobbet in the gobbet set produced from the query. It may include querying the gobbet repository with the user query to retrieve a result gobbet set including at least one gobbet, then forming a fact from the result gobbet set, and then returning the fact as a response to the user's search query, for output to the user.
Other aspects of the technique will be apparent from the accompanying figures and from the detailed description which follows.
One or more embodiments of the present invention are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements and in which:
To facilitate description, the technique introduced here is generally described here by using URLs as examples of network locators, web pages as examples of information resources, and the Internet as an example of the target information base to be searched. However, various embodiments of the technique introduced here may alternatively (or additionally) handle other types of network locators, information resources and/or target information bases.
A user (not shown) of a client system 3 forms a search query, which is transmitted by the client system 3 to the search system 1 via the network 2 using any known or convenient protocol(s), such as hypertext transfer protocol (HTTP). The search query can be in the form of, for example, a conventional keyword search of the type used with conventional search engines known today, such as Google, Yahoo, etc. The search query can be, but is not necessarily, in the form of a natural language search.
In one embodiment, in response to the user's search query, the search system initially identifies a set of URLs that are deemed relevant to the search query, including at least one network locator. This may involve generating and using a secondary query to invoke the published, well-known API of one or more secondary (third-party) information sources 4. The secondary information sources can include, for example, any one or more of: Twitter Recommender, Yahoo Boss, Google, Reuters, or any other information source that can provide a list of references (e.g., URLs) to information resources in response to a search query. Each such secondary information source 4 returns a set of one or more URLs in response to the secondary query. Note that the secondary query may be identical to the user query, or it may be a modified version of the user query (e.g., if necessitated by the particular API of any of the secondary information sources 4). Each URL returned to the search system 1 in response to the secondary query represents a separate information resource, such as a web page, stored on the network 2 at one or more primary information sources 5.
Still in response to the user's query, the search system 1 then retrieves the information resource (or resources) corresponding to each of the returned URLs. In some cases, the search system 1 may also access and retrieve additional information resources, such as those referenced by hyperlinks in the retrieved information resources, as explained below. The search system 1 then processes the retrieved set of information resources to extract from them one or more “facts” relevant to the users search query, and the extracted fact or facts are then returned to the client system 3 as a response to the user's query. A “fact” can be a standard sentence in a language used for spoken and written communication among humans, e.g., English, French, etc. The term “fact” is used here merely for convenience, since it connotes a complete yet concise unit of information; it does not imply anything about the truth or falsity of the information to which it pertains.
As an example of how the search system 1 operates, in response to the illustrative user search query, “highest city in the world”, the system 1 might return the following fact:
The system 1 in this example has located a sentence identifying La Rinconada in Peru as the highest city in the world; it has computed the most useful enclosing context—in this case the next two sentences of the original article—and then attached a citation to the original source (the web site “gadling.com”), as well the most likely publication date (Mar. 15, 2009). The system 1 may also display, next to or after the result, a set of buttons that allow the user to provide feedback (e.g., a button to share on Facebook, a thumbs up icon to record a positive response, a thumbs down icon to record a negative response, and a star icon to record a “favoriting” response). The most likely publication date is determined by matching a by-line (in this example the article at gadling.com contains the by-line “by Kraig Becker (RSS feed) on Mar. 15, 2009 at 10:00AM”).
The search system 1 also includes a verb phrase repository 26, a gobbet repository 27 and a gobbet index 28. The gobbet repository 27 and gobbet index 28 (at least) also can be combined. Note that normally the functionality of all of the above mentioned elements is invoked in response to each user search query, as described below.
The markup processor 21 is the first stage of the search system 1 and has three main functions: First, it receives the user search query from a client system 3 (
In some instances, the markup processor 21 may also access and retrieve information resources that are “depth 2” or even deeper, i.e., web pages and/or other resources that are linked-to by the information resources retrieved in step 304. In one embodiment, the markup processor 21 will do so if the initial (depth 1) resource is a “hub” but not if it is an “authority” (as those terms are defined in the Hyperlink-Induced Topic Search (HITS) link analysis algorithm).
The text processor 22 receives each of the markup language documents from the markup processor 21 (305) and, for each one, performs a normalization process to produce a corresponding normalized markup language document (306). The normalization process generally puts each markup language document into a canonical format, which facilitates processing by subsequent stages of the search system 1. For example, the normalization process strips out information that is not needed, such as advertising, detailed page formatting information, and embedded scripts. Information that is retained includes, for example, the basic substantive content of the markup language document as well as all lists and key/value pairs (if any) in the markup language document, the most likely publication date, and relevant images and videos. In addition, the normalization process may also fix obvious spelling errors and/or address other formatting issues. An example of a normalized markup language document is described below and illustrated in
The sentence processor 23 receives each normalized document from the text processor 22 and, for each normalized document, performs a linguistic analysis to generate and output a gobbet set (307), where each gobbet set contains one or more gobbets. A “gobbet”, as the term is used here, is a fragment of information extracted from its original source and context. In one embodiment each gobbet represents a single sentence or paragraph, and a gobbet set includes a separate gobbet for each paragraph and for each individual sentence in the corresponding normalized document. A gobbet that represents a sentence is called a “sentence gobbet” herein, and a gobbet that represents a paragraph is called a “paragraph gobbet” herein.
A gobbet can be represented as a data object in the search system 1, which can include a gobbet identifier, a network locator (e.g., a URL) corresponding to a source of the gobbet (e.g., a web page), and various content items, including, in the case of a sentence gobbet, a subject phrase, a verb phrase and an object phrase (if any) of the sentence that the gobbet represents. Further details and an example of a gobbet are described below.
The GSI module 24 indexes and stores, in the gobbet repository 27, each gobbet in each gobbet set that resulted from the user's query. More specifically, the GSI module 24 generates a set of terms found in each gobbet of each gobbet set (308), then indexes all of the terms and stores all of the gobbets in the gobbet repository 28 (309). Each term is stored and indexed in the gobbet index 27 so that the gobbet or gobbets in which it appears can be quickly and easily identified. This is an application of inverted file indexing applied to the gobbets as files. The index comprises an index of terms and, for each such term, an associated term list containing all of the gobbet IDs of gobbets that contained that term. The index of terms is organized in memory in such a way that a given term can be directly addressed; specifically, the corresponding term list (if any) can be retrieved in a constant amount of time irrespective of the size of the index. This is accomplished through the use of memory-mapped hash tables. Term lists are sequentially accessed but include a super-structure (a skip list), which allows skipping past blocks of gobbet IDs that fail to match user queries.
The processing to this point can be separated from the remaining steps as an independent process, in which any fixed set of queries can be pre-processed to create a gobbet index and gobbet repository for future use.
The fact query module 25 identifies (310) the terms that are contained in the user's query and then uses the gobbet index 27 to look up (311) the gobbet or gobbets that contain those terms; each gobbet so identified is referred to herein as a “fact”. The fact query module 25 then retrieves these gobbets from the gobbet repository 28 and collects them into a fact set (312), which is returned to the requesting client system 3 to be output to the user (313).
Operation of the search system 1 is further described now with reference to
The normalized document has at least a body portion (denoted by the “<body>” tag), as can be seen from
The metadata elements in the normalized document can include, for example, the name of the author, the publication date of the document, and any information from the document that appears to be in the form of a key-value pair. In one embodiment the presence of a colon (“:”) is considered to be an indicator of a key-value pair. Another function of the normalization process is to keep track of and preserve the various section headings and their hierarchical relationships, if any, in the document.
Next, the sentence processor 23 performs operations 703, 704 and 705, for each sentence in the normalized document. At 703 the sentence processor 23 identifies all of the verb phrases in a given sentence. A verb phrase contains one or more words, including a single verb. To identify the verb phrases in the sentence, the sentence processor 23 tries to match one or more words in the sentence with contents of the verb phrase repository 26.
The verb phrase repository 26 is a text repository (e.g., a file or database) that preferably contains every conceivable form of every verb phrase in a given language (infinitive, gerund, all participles, etc.). For example, for the verb “to abide”, the verb phrase repository 26 would include at least the following entries:
abide
abided
were abiding
was abided
had been abiding
am abiding
are abiding
is abiding
have abided
have been abided
has been abided
would abide
is going to abide
will be abiding
am going to be abiding
are going to be abiding
would be abided
is going to be abided
will have abided
am going to have abided
are going to have abided
would have been abiding
is going to have been abiding
will have been abided
am going to have been abided
are going to have been abided
After identifying all of the verb phrases in the sentence, at 704 the sentence processor 23 identifies the dominant verb phrase in the sentence. The dominant verb phrase is the verb phrase that is deemed to be most important to the meaning of the sentence. If the sentence contains only one verb phrase, then that sentence is the most dominant verb phrase. On the other hand, consider for example the following sentence: “While walking to the store this morning, I ran into a good friend whom I hadn't seen in many years.” This sentence contains three separate verb phrases: 1) “while walking to the store this morning”, 2) “ran into a good friend” and 3) “hadn't seen in many years”. The second verb phrase, “ran into a good friend”, is the one that is most significant to the meaning of the sentence and is therefore the dominant verb phrase in the sentence; the other two verb phrases are ancillary, because they merely qualify the dominant verb phrase.
For example, in response to a user query, “Feynman Manhattan Project”, the system may find a document containing the following sentence:
The sentence processor 23 decides which among the apparent verb phrases “began”, “developing”, “to separate”, “went to”, “to work with” is the dominant verb phrase. In this case the sentence processor 23 picks the verb “began”, with “developing” and “to separate” deemed as qualifying terms, and “went to”, and “to work with” appearing in a subordinate clause. The sentence processor 23 recognizes and records that this particular sentence occurs within the following paragraph:
The sentence processor also recognizes and records that this particular sentence occurs within a context that includes a sequence of nested titles:
Feynman biography
Richard Phillips Feynman
The sentence processor 23 further recognizes and records that the enclosing document contains two relevant key-value pairs:
Born: 11 May 1918 in Far Rockaway, New York, USA
Died: 15 Feb. 1988 in Los Angeles, Calif., USA
When a sentence contains more than one verb phrase, the sentence processor 23 applies a set of criteria to identify the dominant verb phrase. For this purpose, the verbs in the verb phrase repository 26 are ranked in degree of dominance. In general, any form of the verb “to be” is considered more dominant than any other verb. After forms of “to be”, commonly used (“common”) verbs are considered more dominant than less commonly used (“uncommon”) verbs. Whether a verb is deemed “common” or “uncommon” can be based on an arbitrary threshold, such as the frequency of use of that verb in the corresponding language. Various statistics in this regard have been published. If two or more verb phrases in a sentence have the same degree of dominance, then the length of the verb phrases is used as a secondary criterion to determine the dominant one, with a longer verb phrase being considered dominant over a shorter verb phrases, as discussed further below. If two or more verb phrases in a sentence have equal degrees of dominance and length, the one that occurs earlier in the sentence is considered to be more dominant.
In one embodiment, to improve performance (speed), the verb phrase repository 26 is partitioned before run time into multiple tiers by degree of dominance (importance). For example, as shown in
In one embodiment, the sentence processor 23 tries to match words in the sentence with contents of the verb phrase repository by comparing a sliding n-gram in the sentence (a set of n consecutive words in the sentence) to the verb phrase repository 26.
In the example of
Referring again to
Referring again to the illustrative web page in
In one embodiment, a gobbet is a data object that includes both content items and context items. The content items can include, for example, the subject phrase of the corresponding sentence, the dominant verb phrase of the sentence, and the object phrase (if any) of the sentence. The context items are metadata which can include, for example: a gobbet identifier (ID) that uniquely identifies the gobbet within the search system; the URL of the markup language document from which the sentence was extracted; one or more implied subjects of the sentence (e.g., any heading, or any one of the chain of headings, that enclose the paragraph in which the sentence resides); a timestamp indicating when the source document was fetched; a parent gobbet ID indicating which gobbet, if any, is the parent of this gobbet (e.g., for a sentence gobbet, the parent gobbet is the gobbet representing paragraph which includes that sentence); and a quality indicator (may indicate the degree of relevance of the gobbet to a particular query, and may be assigned by the fact query module after the gobbet has been indexed; and an application-opaque ID (i.e., opaque to the search system). Each gobbet is stored in the gobbet repository, indexed by its gobbet ID.
In the above example:
1. ‘Timestamp’ is recorded as a Unix timestamp, namely, as seconds elapsed since midnight Coordinated Universal Time (UTC) of Jan. 1, 1970, not counting leap-seconds.
2. ‘Quality’ is recoded on an arbitrary (but consistent) scale with 0 being the highest quality and larger numeric values indicating lesser quality.
3. ‘Appid’ is an opaque, application-dependent identifier that can be used flexibly to record a small amount (e.g., 64 bits) of arbitrary information about any given gobbet.
4. ‘Parent’ is the gobbet ID in the current gobbet repository of the enclosing gobbet (if any) of the given gobbet.
5. ‘Trace’ is a packed number (e.g., 64 bits) encoding information related to the quality of the gobbet, as explained in more detail below.
6. ‘url’ is a enclosing document Uniform Resource Locator.
7. ‘loc’ is the position of the sentence/paragraph/image/video/key-value pair within the normalized document, represented as a pair (paragraph number; sentence number).
8. ‘img’ is the URL (Uniform Resource Locator) of any image associated to the gobbet.
9. ‘implied-list’ is the list of enclosing titles.
10. ‘Head’ is the sentence subject.
11. ‘Verb’ is the dominant verb phrase.
12. ‘Rest’ is the sentence predicate.
The ‘Trace’ is, in one embodiment, a packed 64-bit structure that includes the following items:
1. ‘topic’ (bits 58 . . . 63)—a penalty score assessed for weak resemblance to the topic sentence of the enclosing paragraph.
2. ‘rank’ (bits 53 . . . 57)—a penalty score assessed for low page rank of the enclosing document.
3. ‘traffic’ (bits 46 . . . 51)—a penalty score assessed for low web traffic to the enclosing document.
4. ‘ambiguity’ (bits 40 . . . 45)—a penalty score assessed for high levels of verb ambiguity in the sentence.
5. ‘depth’ (bits 30 . . . 33)—a penalty score assessed depending on how deep into an enclosing paragraph the sentence (from which the gobbet is derived) appears.
6. ‘head’ (bits 28 . . . 29)—a penalty score assessed for sentences with very short subject phrases.
7. ‘pred’ (bits 26 . . . 27)—a penalty score assessed for sentences with very short predicate phrases.
8. ‘site’ (bits 22 . . . 25)—a boost score assessed for certain (authoritative) sites, for example nytimes.com, wikipedia.org.
9. ‘query_type’ (bits 16 . . . 21)—records the type of query that returned this gobbet. ‘query_type’ can have the following values, which are explained in detail below:
10. ‘reputation’ (bits 10 . . . 15)—records the authority of the original source (URL) author (individual or organization).
11. ‘rest’ (bits 0 . . . 9)—labels the remaining unallocated bits of the trace structure.
As noted above, after generating a gobbet set (
To index the terms, in one embodiment each term is applied to a hash function to generate a hash value, which is used as an index value into the gobbet index. Each entry in the gobbet index represents one term and includes the hash value of that term and the gobbet ID of each gobbet that includes that term. The hash value is used as an index to locate that entry later.
After the terms are indexed and the gobbets are stored, the fact query module 25 queries the gobbet index 27 with the user query to retrieve a term set (
Referring to
a. Products
b. Ticker symbols
c. Music-related
d. Current news
e. Geographic
f. Weather
g. Subject-Verb phrase
The query parse module 102 determines if the user query consists of a combination of these categories, for example, geographically localized product queries, (e.g.) “best pizza in Palo Alto”, will be parsed into three segments: “best”, “pizza” (a product), “Palo Alto” (a location). The query parse module 102 operates by matching a sequence of regular expressions against the user query. If a given regular expression matches, for example, a product pattern, then the query parse module 102 removes the portion of the query that matches this pattern, and continues to match against the remainder of the query. The query parse module 102 continues in this manner, removing matching segments, until either the query is exhausted or the set of patterns is exhausted. Each extracted segment of the query is labeled by the category that it matched. The unmatched remainder of the query (which may be the entire query) is also returned.
The query parse module 102 generates a query plan. The query plan includes of a list of very specific queries derived from the original user query. The plan queries define subsets of the gobbet repository that match gobbet-specific conditions.
head-phrase:highest_city_in_the_world (1)
head:highest_city_in_the_world (2)
head:highest+head:city+head:in+head:the+head:world (3)
url:highest+url:city+url:world (4)
highest_city+city_in+in_the+the_world (5)
highest_city+in_the+world (6)
implied:highest+implied:city+implied:in+implied:the+implied:world (7)
head:highest+city+in+the+world (8)
implied:highest+city+in+the+world (9)
highest+city+in+the+world (10)
highest|city|world (11)
Plan query (1), the head-exact-phrase-query, defines a query that matches the user query completely and exactly within the subject portion of one gobbet. Plan query (2), the head-phrase-query, defines a query that matches the user query phrase anywhere within the subject portion of one gobbet. Plan query (3), the head-query, defines a query that matches each term of the user query independently within the subject portion of one gobbet. Plan query (4), the URL-query, defines a query that matches the non-stop-word terms of the user query within the path portion of the enclosing document URL of one gobbet. Stop words are very common worlds, typically articles and conjunctions, which do not add specificity to the query. In the example of “highest city in the world” —“in”, and “the” are stop words, and can be removed from the query when matching against the document URL. Plan query (5), the phrase-query, defines a query that matches overlapping bi-grams formed from the user query anywhere in one gobbet. Plan query (6), the weak-phrase-query, defines a query that matches non-overlapping bi-grams anywhere in one gobbet. Plan query (7), the implied-(title)-query, defines a query that matches each of the user query terms anywhere within the title-list of one gobbet. Plan query (8), the mixed-and-query, defines a query that matches the leading term of the user query within the subject portion of one gobbet, and the remaining terms of the user query anywhere within that gobbet. Plan query (9), the mixed-implied-and-query, defines a query that matches the leading term of the user query within the title-list portion of one gobbet, and the remaining terms of the user query anywhere within that gobbet. Plan query (10), the and-query, defines a query that matches each of the user query terms anywhere within one gobbet. Plan query (11), the or-query, defines a query that matches any one of the non-stop-word terms of the user query anywhere within one gobbet.
All plan queries, with the exception of (11), the or-query, include conjunctions. That is to say the plus sign “+” in the query is taken to mean “AND”. The constituents of each plan query are called elementary plan queries. For example, “url:highest” is an elementary plan query. It defines a subset consisting of all the gobbets containing the term “highest” anywhere within the path portion of the URL.
Referring again to
The gobbet id list set intersector 106 processes a collection of input gobbet ID lists 105 and outputs the list of gobbet ids common to all the input ID lists. Considering each input gobbet ID list as defining subset of gobbets (with the corresponding IDs), then the gobbet id list set intersector 106 exactly returns the result gobbet ID list 107 representing the intersection of this collection of input sets. The gobbet id list set intersector 106 performs a multi-way merge operation on the gobbet ID list, which are ordered, compressed lists of unsigned integer values.
The gobbet ID lists in some embodiments may contain skip lists that allow accelerated comparisons between pairs of gobbet ID lists. A skip list comprises a set of pointers mixed into the gobbet ID lists at regular or random intervals that define a jump value and a jump location. For example, the simple gobbet ID list:
(1, 3, 5, 10, 15, 30, 200, 201, 211, 250, 251, 252, 305, 500, 510) (A)
can be improved by adding the following skip list entries:
([200:5], 1, 3, 5, 10, 15, 200, [300:6], 201, 211, 250, 251, 252, 305, 500, 510)
Skip list entries make it possible to accelerate the comparison between two gobbet ID lists when looking for common entries. For example, if a second gobbet ID list
(201, 202, 203, 250, 260, 270, 301, 302, 303, 304, 305) (B)
were compared to list (A), the skip entry [200:5] records the information that the first gobbet ID equal or greater than 200 occurs five steps past the first entry, and allows the comparison processor to skip the first six entries (including the skip entry itself) of list (A) when comparing it to list (B).
The gobbet id list set intersector 106 is applied at each stage of the query plan evaluation to compute the gobbet ID list corresponding to the conjunctive condition defined by that stage of the query plan. For example, plan query (4), “url:highest+url:city+url:world” requires intersecting three gobbet ID lists corresponding to the three terms “url:highest”, which returns a gobbet ID list comprising all the gobbets in the gobbet repository containing “highest” anywhere in the path portion of the URL, “url:city”, which returns a gobbet id list comprising all the gobbets in the gobbet repository containing “city” anywhere in the path portion of the URL, and “url:world”, which returns a gobbet ID list comprising all the gobbets in the gobbet repository containing “world” anywhere in the path portion of the URL. The output of this stage of the query plan processing is the gobbet ID list including all the gobbets in the gobbet repository that contain all three terms anywhere in the path portion of the URL.
The query plan process (
The gobbet repository lookup module 108 processes an input gobbet ID list 107 and outputs a set of gobbets 109 corresponding to the input IDs. The gobbet repository lookup module 108 maintains a two-level structure including: (1) a directly indexed fixed-width memory-mapped vector of gobbet-representatives, and (2) a memory-mapped heap of variable-width strings associated to each gobbet. The gobbet-representative consists of a number of fixed-width fields corresponding one-to-one with the fields of a gobbet, but with the difference that the variable-width gobbet fields, namely the URL, location, image, title list, subject, verb, and predicate are all represented in the gobbet-representative as fixed-width offsets into the secondary memory-mapped heap of strings. Heap offsets are used to fetch a fixed maximum sized chunk of the heap. Strings within the heap are zero-delimited. The actual length of a string retrieved from the heap can be determined by scanning the maximum-length chunk for the first occurrence of a null (0) character. This null (0) character conventionally defines the end of the string.
The context resolution module 110 processes an input set of gobbets 109 and outputs an ordered subset of those gobbets and the final form of the fact query response to the original user query 101. The context resolution module 110 applies one or more regular expression and/or Bloom filter pattern-matching steps to eliminate non-English, non-relevant, and offensive gobbets from the input set. It also looks for cases of multiple input gobbets from the same paragraph of the same document. In the case when three or more gobbets occur closely within the same enclosing paragraph, then the context resolution module 110 will replace the subset of all gobbets pertaining to the enclosing paragraph with a single gobbet representing the entire paragraph.
a. qt_head_exact_phrase
b. qt_head_phrase
c. qt_head
d. qt_url
e. qt_phrase
f. qt_weak_phrase
g. qt_implied
h. qt_mixed_and
i. qt_mixed_implied_and
j. qt_and
k. qt_or
l. qt_widget
m. qt_tophit
n. qt_video
o. qt_image
p. qt_keyval
The fact query module 25 evaluates these queries in priority order (a) . . . (p) either sequentially or concurrently, and stops when it has found a sufficient number of useful gobbets. The number of gobbets considered “sufficient” can be determined empirically and can be set to any finite value
The processor(s) 121 is/are the central processing unit (CPU) of the processing system 120 and, thus, control the overall operation of the processing system 120. In certain embodiments, a processor(s) 121 accomplishes this by executing software or firmware stored in memory 122. In other embodiments, a processor 121 can be special-purpose, hardwired (non-programmable) circuitry. Thus, a processor 121 may be, or may include, one or more programmable general-purpose or special-purpose microprocessors, digital signal processors (DSPs), programmable controllers, application specific integrated circuits (ASICs), programmable logic devices (PLDs), trusted platform modules (TPMs), or the like, or a combination of such devices.
The memory 122 is or includes the main memory of the processing system 120. The memory 122 represents any form of random access memory (RAM), read-only memory (ROM), flash memory, or the like, or a combination of such devices. In use, the memory 92 may contain, among other things, code 126 for executing some or all of the operations described above.
Also connected to the processor(s) 121 through the interconnect 123 are a network adapter 124 and a storage adapter 125. The network adapter 124 provides the processing system 120 with the ability to communicate with remote devices, such as a client system 3, over the network 2 and may be, for example, an Ethernet adapter or Fibre Channel adapter. The storage adapter 125 allows the processing system 120 to access a mass storage subsystem (not shown) and may be, for example, a Fibre Channel adapter or SCSI adapter. The mass storage subsystem four can be used to store, among other things, the verb phrase repository 26, the gobbet index 27 and the gobbet repository 28.
The techniques introduced above can be implemented by programmable circuitry programmed/configured by software and/or firmware, or entirely by special-purpose circuitry, or in a combination of such forms. Such special-purpose circuitry (if any) can be in the form of, for example, one or more application-specific integrated circuits (ASICs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), etc.
Software or firmware to implement the techniques introduced here may be stored on a machine-readable storage medium and may be executed by one or more general-purpose or special-purpose programmable microprocessors.
A “machine-readable medium”, as the term is used herein, includes any mechanism that can store information in a form accessible by a machine (a machine may be, for example, a computer, network device, cellular phone, personal digital assistant (PDA), manufacturing tool, any device with one or more processors, etc.). For example, a machine-accessible medium includes recordable/non-recordable media (e.g., read-only memory (ROM); random access memory (RAM); magnetic disk storage media; optical storage media; flash memory devices; etc.), etc.
References in this specification to “an embodiment”, “one embodiment”, or the like, mean that the particular feature, structure or characteristic being described is included in at least one embodiment of the present invention. Occurrences of such phrases in this specification do not necessarily all refer to the same embodiment. On the other hand, different embodiments may not be mutually exclusive either.
Although the present invention has been described with reference to specific exemplary embodiments, it will be recognized that the invention is not limited to the embodiments described, but can be practiced with modification and alteration within the spirit and scope of the appended claims. Accordingly, the specification and drawings are to be regarded in an illustrative sense rather than a restrictive sense.
This application claims the benefit of U.S. Provisional Patent Application No. 61/295,532, filed on Jan. 15, 2010, which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
61295532 | Jan 2010 | US |