This application is claims priority to Indian Patent Application No. 1007/CHE/2014, filed on Feb. 27, 2014, the content of which is incorporated by reference herein in its entirety.
Search query expansion may be used to reformulate an initial search query to include related search queries, and using the initial search query and the related search queries when performing a search. Search query expansion may improve the relevance of search results.
According to some possible implementations, a device may receive information that identifies a search query to be used to search a text. The search query may include a first multi-word term. The device may determine, based on user input, one or more search query expansion techniques to be performed to expand the search query. The device may perform the one or more search query expansion techniques to generate a set of expanded search queries based on the search query and the text. The set of expanded search queries may include a second multi-word term. The device may search the text, using the set of expanded search queries, to identify a plurality of sections of the text that include an expanded search query included in the set of expanded search queries. The device may provide search results that identify the plurality of sections of the text based on searching the text.
According to some possible implementations, a computer-readable medium may store one or more instructions that, when executed by one or more processors, cause the one or more processors to: receive information that identifies a search query to be used to search a text; provide information that identifies a plurality of search query expansion techniques for expanding the search query; receive a selection of one or more search query expansion techniques, of the plurality of search query expansion techniques, to be performed to expand the search query; perform the one or more search query expansion techniques to generate a set of expanded search queries based on the search query and the text; search the text, using the set of expanded search queries, to identify a plurality of sections of the text that include an expanded search query included in the set of expanded search queries; and provide search results that identify the plurality of sections of the text based on searching the text.
According to some possible implementations, a method may include receiving, by a device, information that identifies a search query to be used to search a text. The method may include determining, by the device, one or more search query expansion techniques to be performed to expand the search query. The method may include performing, by the device, the one or more search query expansion techniques using the search query and the text. The method may include determining, by the device, a plurality of expanded search queries based on performing the one or more search query expansion techniques, where one or more of the plurality of expanded search queries are included in the text. The method may include providing, by the device, information that identifies a set of expanded search queries included in the plurality of expanded search queries. The method may include receiving, by the device, input that modifies the set of expanded search queries. The method may include generating, by the device, a modified set of expanded search queries based on the input that modifies the set of expanded search queries. The method may include searching the text, by the device and using the modified set of expanded search queries, to identify a plurality of sections of the text that include an expanded search query included in the modified set of expanded search queries. The method may include providing, by the device, search results that identify the plurality of sections of the text based on searching the text.
The following detailed description of example implementations refers to the accompanying drawings. The same reference numbers in different drawings may identify the same or similar elements.
A text document may include terms that are related to one another, but not identical. A user may use a search query to search the text document for a particular term. Search results generated using the search query may be limited to results that include the particular term (e.g., an exact match), and may not include results that include related terms (e.g., non-exact matches). Thus, search results generated in this manner may be incomplete, in that the search results may omit results that are relevant but that do not include an exact match of the search query.
Implementations described herein permit a user to select one or more techniques to expand an initial search query to include a set of related search queries. To perform the techniques, a client device may search a text document, using the initial search query, to identify terms that are related to the initial search query (e.g., terms that may be misspellings of the search query, terms that may be proper spellings of the search query, terms that may be semantically related to the search query, terms that may be aliases of the search query, terms that are contained within the search query, etc.). The related terms may be included in the set of related search queries, which may be used to search the text document to generate search results. The search results may identify sections of the text document that include a search query included in the set of related search queries. In this way, the user may discover a section of the text document that is relevant to a search query, even though the section may not include a term that is an exact match of the search query.
As further shown in
As shown in
Client device 210 may include one or more devices capable of receiving, generating, storing, processing, and/or providing text and/or information associated with text (e.g., a search query, a set of expanded search queries, etc.). For example, client device 210 may include a computing device, such as a desktop computer, a laptop computer, a tablet computer, a server, a mobile phone (e.g., a smart phone, a radiotelephone, etc.), or a similar device. In some implementations, client device 210 may receive a search query, and may process text to expand the search query. Additionally, or alternatively, client device 210 may search text using a search query and/or a set of expanded search queries. In some implementations, client device 210 may receive information from and/or transmit information to server device 220 (e.g., text and/or information associated with text).
Server device 220 may include one or more devices capable of receiving, generating, storing, processing, and/or providing text and/or information associated with text. For example, server device 220 may include a computing device, such as a server, a desktop computer, a laptop computer, a tablet computer, or a similar device.
Network 230 may include one or more wired and/or wireless networks. For example, network 230 may include a cellular network, a public land mobile network (PLMN), a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a telephone network (e.g., the Public Switched Telephone Network (PSTN)), an ad hoc network, an intranet, the Internet, a fiber optic-based network, and/or a combination of these or other types of networks.
The number of devices and networks shown in
Bus 310 may include a component that permits communication among the components of device 300. Processor 320 may include a processor (e.g., a central processing unit, a graphics processing unit, an accelerated processing unit), a microprocessor, and/or any processing component (e.g., a field-programmable gate array (FPGA), an application-specific integrated circuit (ASIC), etc.) that interprets and/or executes instructions. Memory 330 may include a random access memory (RAM), a read only memory (ROM), and/or another type of dynamic or static storage device (e.g., a flash, magnetic, or optical memory) that stores information and/or instructions for use by processor 320.
Input component 340 may include a component that permits a user to input information to device 300 (e.g., a touch screen display, a keyboard, a keypad, a mouse, a button, a switch, etc.). Output component 350 may include a component that outputs information from device 300 (e.g., a display, a speaker, one or more light-emitting diodes (LEDs), etc.).
Communication interface 360 may include a transceiver-like component, such as a transceiver and/or a separate receiver and transmitter, that enables device 300 to communicate with other devices, such as via a wired connection, a wireless connection, or a combination of wired and wireless connections. For example, communication interface 360 may include an Ethernet interface, an optical interface, a coaxial interface, an infrared interface, a radio frequency (RF) interface, a universal serial bus (USB) interface, a Wi-Fi interface, a cellular network interface, or the like.
Device 300 may perform one or more processes described herein. Device 300 may perform these processes in response to processor 320 executing software instructions included in a computer-readable medium, such as memory 330. A computer-readable medium is defined herein as a non-transitory memory device. A memory device includes memory space within a single physical storage device or memory space spread across multiple physical storage devices.
Software instructions may be read into memory 330 from another computer-readable medium or from another device via communication interface 360. When executed, software instructions stored in memory 330 may cause processor 320 to perform one or more processes described herein. Additionally, or alternatively, hardwired circuitry may be used in place of or in combination with software instructions to perform one or more processes described herein. Thus, implementations described herein are not limited to any specific combination of hardware circuitry and software.
The number of components shown in
As shown in
A search query may include a term. A term, as used herein, may refer to a set of characters, such as a single character, multiple characters (e.g., a character string), a combination of characters (e.g., in a particular order) that form a word, a combination of characters that form multiple words (e.g., a phrase, a sentence, a paragraph, etc.), a combination of characters that form an acronym, a combination of characters that form an abbreviation of a word, a combination of characters that form a misspelled word, etc. In some implementations, a user may denote a multi-word term included in a search query by using a delimiter, such as single quotes (e.g., ‘multi-word term’), double quotes (e.g., “multi-word term”), or the like. Additionally, or alternatively, client device 210 may determine separate words or strings included in a search query based on a delimiter, such as a space, a comma, a semicolon, a single quote, a double quote, the word “and,” the word “or,” etc.
As further shown in
In some implementations, a user may input information identifying the text or a memory location at which the text is stored (e.g., local to and/or remote from client device 210). Based on the user input, client device 210 may retrieve the text. Additionally, or alternatively, client device 210 may provide a user interface via which a user may input text, and client device 210 may obtain the text based on the information input via the user interface.
Client device 210 may use the text to expand an initial search query, as described in more detail elsewhere herein. For example, client device 210 may apply one or more search query expansion techniques that use the initial search query and the text to expand the initial search query.
As further shown in
A search query expansion technique may include, for example, a misspelling analysis that determines whether a first term (e.g., a term included in the text) and a second term (e.g., a term included in the search query) are potential misspellings of one another, a semantic relatedness analysis that identifies a measure of semantic relatedness between the first term and the second term, an alias analysis that determines whether the first term and the second term are alias terms, a containment analysis that determines whether a set of characters included in the first term is also included in the second term, etc. Client device 210 may receive information that identifies one or more of these or other search query expansion techniques to be performed to expand an initial search query.
For example, client device 210 may provide, via a user interface, information that identifies a set of search query expansion techniques. Client device 210 may receive user input that specifies one or more (e.g., a subset, the entire set, etc.) of the search query expansion techniques to be performed to expand the initial search query. Client device 210 may apply the specified search query expansion technique(s) to the initial search query and a specified text to generate a set of expanded search queries, as described in more detail elsewhere herein.
Additionally, or alternatively, client device 210 may receive information that identifies weights to be assigned to different search query expansion techniques. Client device 210 may assign the identified weights to the search query expansion techniques when expanding the search query and/or scoring search results, as described in more detail elsewhere herein. Client device 210 may assign a same weight value or different weight values to different search query expansion techniques. Additionally, or alternatively, client device 210 may calculate a score using a set of search query expansion techniques and/or weight values, and may use the score to expand an initial search query and/or score a search result. In some implementations, the weight value may weight a search result relative to a perfect match (e.g., where every character of a search result matches every character of a search query). For example, a weight value of 0.5 for a semantic relatedness technique may give terms determined to be semantically related half as much weight as if the terms were perfect matches. In some implementations, a user may input a weight value as any real number in a range of numbers greater than or equal to zero.
As further shown in
A search result option may include, for example, an indication of a quantity of search results to be provided for display (e.g., for each initial search query), an indication of whether search results are to be ranked (and/or how search results are to be ranked), an indication of whether relevance scores are to be provided for display in association with search queries (e.g., a relevance score that indicates a degree to which a search result is relevant to an initial search query), an indication of whether search results are to overlap across multiple queries (e.g., whether the same search result, or section of a text, is to be provided as a result for more than one initial search query), etc.
As further shown in
Although
As shown in
As shown by reference number 520, the user may interact with a delete query input mechanism to remove a search query from the set of initial search queries to be expanded. As shown by reference number 525, the user may interact with an update query input mechanism to modify an initial search query included in the set of initial search queries to be expanded. For example, the user may select “design and management of ‘web sit’,” may select the update query input mechanism, and may correct the misspelled search query to “design and management of ‘web site.’”
As shown by reference number 530, the user may select a text to be used to expand the initial search queries. As shown, assume that the user identifies two documents, shown as “Document A” and “Document B,” to be used to expand the set of initial search queries. As further shown, assume that the user interacts with an input mechanism to continue to configure search options.
As shown in
As shown by reference number 550, assume that the user specifies that client device 210 is to perform a containment analysis, a misspelling analysis, a semantic relatedness analysis, and an alias analysis to expand the initial search queries. As shown by reference number 555, assume that the user inputs different weight values to be applied to the different query expansion techniques to expand the initial search queries and/or score search results. In some implementations, the user may select to apply default weight values to the search query expansion techniques, and client device 210 may determine default weight values to be applied to the search query expansion techniques (e.g., based on information stored in a data structure).
As shown by reference number 560, the user may interact with an add or update queries input mechanism to return to the user interface shown in
As indicated above,
As shown in
Determining to prepare the text may include determining text sections to be processed, in some implementations. For example, client device 210 may partition the text into sections, and may process particular sections of the text. In some implementations, client device 210 may determine sections of the text to process based on a user interaction, based on an indication from server device 220, or the like.
As further shown in
As further shown in
In some implementations, client device 210 may determine one or more unique identifiers to be associated with sections of the text. In some implementations, client device 210 may generate a data structure storing section identifiers. For example, client device 210 may generate a list of section identifiers D of size d (e.g., with d elements), where d is equal to the number of unique sections in the text (e.g., where unique requirements list D=[Sec1, Sec2, . . . , Secd]). In some implementations, client device 210 may label sections of the text based on processing the text. For example, client device 210 may process the text to identify the sections (e.g., based on a delimiter). Additionally, or alternatively, client device 210 may receive an indication of the sections, such as a set of section tags, a user identification of the sections, or the like.
As further shown in
As an example, client device 210 may receive a list of part-of-speech tags (POS tags) and tag association rules for tagging words in the text with the POS tags based on the part-of-speech of the word. Example part-of-speech tags include NN (noun, singular or mass), NNS (noun, plural), NNP (proper noun, singular), NNPS (proper noun, plural), VB (verb, base form), VBD (verb, past tense), VBG (verb, gerund or present participle), VBP (verb, non-third person singular present tense), VGZ (verb, third person singular present tense), VBN (verb, past participle), RB (adverb), RBR (adverb, comparative), RBS (adverb, superlative), JJ (adjective), JJR (adjective, comparative), JJS (adjective, superlative), etc.
As an example, client device 210 may receive text that includes the following sentence:
Client device 210 may tag the sentence with POS tags, as follows:
In the above tagged sentence, DT may represent a determiner tag (e.g., used to tag articles like a, an, and the), NN may represent a singular noun or mass noun tag (e.g., used to tag singular or mass nouns), and VB may represent a base-form verb tag (e.g., used to tag verbs in base form). These tags are provided as an example, and client device 210 may use additional or other tags in some implementations, as described elsewhere herein.
In some implementations, client device 210 may further process the tagged text to associate additional or alternative tags with groups of words that meet certain criteria. For example, client device 210 may associate an entity tag (e.g., ENTITY) with noun phrases (e.g., consecutive words with a noun tag, such as /NN, /NNS, /NNP, /NNPS, etc.). Client device 210 may apply entity tags and/or action tags to the tagged text, as follows:
As can be seen, the nouns “gasoline” and “engine” have been combined into a single term “gasoline engine” (e.g., set off by braces { }), and have been tagged with an entity tag. In some implementations, client device 210 may only process terms with particular tags, such as noun tags, entity tags, verb tags, etc., when expanding a search query and/or performing a search.
As further shown in
In some implementations, client device 210 may receive information that identifies stop tags or stop terms. The stop tags may identify tags associated with terms that are not to be included in the list of unique terms. Similarly, the stop terms may identify terms that are not to be included in the list of unique terms. When generating the list of unique terms, client device 210 may only add terms to the list that are not associated with a stop tag or identified as a stop term.
Additionally, or alternatively, client device 210 may convert terms to a root form when adding the terms to the list of unique terms. For example, the terms “processes,” “processing,” “processed,” and “processor” may all be converted to the root form “process.” Similarly, the term “devices” may be converted to the root form “device.” Thus, when adding terms to the list of unique terms, client device 210 may convert the terms “processing device,” “processed devices,” and “processor device” into the root form “process device.” Client device 210 may add the root term “process device” to the list of unique terms.
Generating a term corpus may include generating a data structure that stores terms extracted from the text, in some implementations. For example, client device 210 may generate a list of terms TermList of size t (e.g., with t elements), where t is equal to the number of unique terms in the text (e.g., where unique terms list TermList=[term1, term2, . . . , termt]).
As further shown in
As further shown in
where C[i,j] represents the co-occurrence matrix value (e.g., a frequency quantity) for a particular term in a particular section, d represents the total number of sections, and n, represents the number of sections that include termi.
In some implementations, when client device 210 determines that a semantic relatedness analysis is to be performed, client device 210 may map the co-occurrence matrix to a lower-dimensional latent semantic space. The lower-dimensional latent semantic space may represent a semantic relatedness between terms and sections included in the text. For example, terms and sections with a high semantic relatedness may map to closer locations in the lower-dimensional latent semantic space than terms and section with a low semantic relatedness.
As an example, client device 210 may apply singular value decomposition (SVD) to co-occurrence matrix C, to determine matrices U, Σ, and VT, such that:
C=UΣVT,
where C represents the co-occurrence matrix (e.g., with or without merged rows and/or with or without adjusted values), U represents a t×t unitary matrix, Σ represents a t×d rectangular diagonal matrix with nonnegative real numbers on the diagonal, and VT (the conjugate transpose of V) represents a d×d unitary matrix. The diagonal values of Σ (e.g., Σi,i) may be referred to as the singular values of matrix C.
Client device 210 may determine a truncation value k for reducing the size of matrix U, which may be useful for calculating a semantic relatedness score for two terms, as discussed in more detail elsewhere herein. Client device 210 may determine a quantity of non-zero singular values (e.g., the quantity of non-zero entries in Σ), which may be referred to as the rank r of matrix C, and may set the truncation value k equal to the rank r of matrix C. Alternatively, client device 210 may set the truncation value k equal to (t×d)0.2. In some implementations, client device 210 may set the truncation value k as follows:
Client device 210 may truncate the matrix U by removing columns from U that are not included in the first k columns (e.g., the truncated matrix U may only includes columns 1 through k of the original matrix U). The rows in truncated matrix U may correspond to term vectors in the latent semantic analysis (LSA) space.
As further shown in
Although
As shown in
The report generation subsystem logs the process steps.
The analytics module collects data generated by RepGenMod.
As further shown, Document B may include the following text:
The report processing module performs log reporting.
The text in Document B may be tagged as follows:
The {report processing module}/SYSTEM performs {log reporting}/PROCESS.
Thus, the term “report processing module” may be tagged with a SYSTEM tag, and the term “log reporting” may be tagged with a PROCESS tag.
As further shown, the POS Tag List may include the following tags:
NN: Noun, singular or mass
NNS: Noun, plural
VB: Verb, base form
VBD: Verb, past tense
DT: Determiner
IN: Preposition.
The User Tag List may include the following tags:
SYSTEM: System entity
PROCESS: Process entity
The User Tag List may be associated with Document B (e.g., the SYSTEM and PROCESS tags). Client device 210 may receive text 702 and tag lists 704, and may process text 702 using tag lists 704.
As shown in
Client device 210 may process the text sections to associate the tags from tag lists 704 with words in the text sections. For example, client device 210 may tag section 1 as follows:
Client device 210 may tag section 2 and section 3 in a similar manner, as shown in tagged text sections 708. When associating tags with section 3, client device 210 may, in some implementations, tag untagged terms (e.g., “the” and “performs”). Additionally, or alternatively client device 210 may ignore terms that have already been tagged (e.g., “report processing module” and “log reporting”), or may add additional tags (e.g., POS tags) to terms that have already been tagged.
As shown by reference number 710, client device 210 may process tagged text sections 708 by tagging noun phrases with an entity tag to generate entity-tagged text sections 712. A noun phrase may include two or more consecutive words that have each been tagged with a noun tag (e.g., NN, NNS, etc.). For example, client device may tag the noun phrase “report/NNS generation/NN subsystem/NN” with an entity tag, and may optionally remove the noun tags, to generate the entity-tagged phrase “{report generation subsystem}/ENTITY.” Similarly, client device 210 may tag the noun phrases “process steps,” and “analytics module” with an entity tag, as shown.
As shown in
As further shown in
As shown in
As shown in
In some implementations, client device 210 may only add terms to unique term list 740 (e.g., from a root temporary list) that are not already included in unique term list 740. For example, assume that unique term list 740 includes the term “report generate subsystem,” as shown. Based on unique term list 740 including the term “report generate subsystem,” client device 210 may not add terms with the same roots (e.g., “report generate subsystem”) to unique term list 740, such as the terms “report generating subsystems,” “reporting generation subsystem,” or the like.
As indicated above,
As shown in
Client device 210 may generate a query term list QTokens that includes a list of terms included in the initial search query (e.g., a single term in the case of a single-word search query, a single word and/or multiple words in the case of a multi-word search query, etc.). In some implementations, client device 210 may prevent terms associated with a stop tag or identified as a stop term from being included in the query term list (e.g., of, the, and, etc.). Additionally, or alternatively, client device 210 may convert terms to a root form when adding the terms to the query term list. In some implementations, when client device 210 determines that there are multiple initial search queries to be expanded (e.g., when a user has input multiple search queries), client device 210 generates a query term list QTokens[q] for each initial search query q. In some implementations, the query term list may identify a quantity of times that each term is included in a search query.
As further shown in
A search query expansion technique may include a misspelling analysis, a semantic relatedness analysis, an alias analysis, a containment analysis, etc. Except as otherwise described herein, client device 210 may perform a single search query expansion technique, or may perform any combination of multiple search query expansion techniques. When performing a combination of multiple search query expansion techniques, client device 210 may perform the multiple search query expansion techniques in any order, except as otherwise described herein.
As further shown in
In some implementations, client device 210 may use a language database, such as a dictionary, to determine whether a term is a misspelled term. When a term is included in the language database, client device 210 may determine that the term is not a misspelled term. When the first term or the second term is not included in the language database, client device 210 may calculate a Levenshtein distance (e.g., edit distance) of the terms to determine whether the terms are misspelled terms.
Levenshtein distance may refer to the smallest number of insertion, deletion, and/or substitution operations required to modify a first term to generate a second term. For example, the terms “environment” and “wenvironment” have a Levenshtein distance of one (e.g., an insertion of a single character “w” at the beginning the term “environment”). Similarly, the terms “environment” and “nvironment” have a Levenshtein distance of one (e.g., a deletion of a single character “e” at the beginning the term “environment”). As another example, the terms “environment” and “winvironment” have a Levenshtein distance of two (e.g., an insertion of “w” and a substitution of “e” with “i”).
In some implementations, client device 210 may determine that terms are misspelled terms if the Levenshtein distance of the terms satisfies a threshold value (e.g., if the Levenshtein distance is less than a threshold value, such as 2, and/or is equal to a threshold value, such as 1). Additionally, or alternatively, when client device 210 analyzes multi-word terms, client device 210 may determine that the multi-word terms are misspelled terms if the average Levenshtein distance of the words included in the multi-word terms satisfies a threshold value (e.g., is equal to one, is less than or equal to two, etc.). When the terms include a different quantity of words, client device 210 may only consider corresponding words when calculating the average Levenshtein distance, in some implementations.
For example, consider two multi-word terms:
For example, consider two terms with three words each:
Based on this calculation, client device 210 may determine that the terms {Environmental Protection Agency} and {Envirnmental Protecton Agency} are misspelled terms (e.g., since βavg≦1).
Additionally, or alternatively, when the terms include a different number of words, client device 210 may compare the number of words in each term to determine whether the terms are misspelled terms. For example, client device 210 may determine the difference between the number of words in the terms (e.g., m−n), and may determine that the terms are possible misspelled terms when the difference is less than or equal to a threshold (e.g., 1). For example, consider the terms:
In the above example, the number of words n in u1 is equal to 3 (e.g., n=3), and the number of words m in u2 is equal to 4 (e.g., m=4). Client device 210 may determine that the difference between the number of terms in u1 and u2 satisfies a threshold (e.g., m−n≦1). Based on this determination, client device 210 may remove words from the larger term (e.g., u2) that do not correspond to words in the smaller term (e.g., u1), and may determine the average Levenshtein distance of the remaining words. For example, client device 210 may remove the word “US” from u2, and may determine the average Levenshtein distance between {Environmental Protection Agency} and {Envirnmental Protecton Agency}, as described above.
As further shown in
In some implementations, client device 210 may use a language database to determine whether two terms are semantically related. The language database may include, for example, an indication of sets of words that are synonyms of one another (e.g., SynSets in a WordNet database). Client device 210 may receive a list of term pairs u1, u2 where u1 is included in QTokens[q] and u2 is included in TermList. Client device 210 may use the language database to compare pairs of terms u1, u2. For example, client device 210 may compare terms u1 and u2:
Client device 210 may determine whether at least one word in u1 and at least one word in u2 is included in the language database (e.g., in a WordNet database, which indicates that the word is a dictionary word). If not, client device 210 may skip a glossary analysis of the term. Otherwise, client device 210 may perform the glossary analysis.
To perform the glossary analysis, client device 210 may use the language database to determine a list of synonyms for each word of u1 and u2 based on the tags of u1 and u2 (e.g., part of speech tags). For example, client device 210 may generate a list Lij that includes synonym pairs, for the word wij, that correspond to tag pij. For each pair (w1i, w2j) from the list (w11, w21), . . . , (w1n, w2m), client device 210 may calculate:
In other words, client device 210 may set rij equal to one when L1i∩L2j is not an empty set. Otherwise, client device 210 may set rij equal to zero.
Client device 210 may determine a synonym score s for the pair of terms. In some implementations, the synonym score s may be based on a quantity of times that at least one synonym of a word in the first term u1 matches a synonym of a word (or the word itself) in the second term u2. For example, the synonym score s may include a quantity of times that a synonym set of a word in the first term u1 shares a word with (e.g., overlaps with) a synonym set of a word in the second term u2. In some implementations, a synonym set of a word may include the word. As another example, the synonym score s may include a quantity of shared synonyms between words in the first term u1 and words in the second term u2. In some implementations, client device 210 may calculate the synonym score as follows:
Client device 210 may compare the synonym score (e.g., γ or s) to a glossary threshold δ (e.g., a threshold between 0 and 1). In some implementations, client device 210 may determine the glossary threshold δ based on user input. Additionally, or alternatively, client device 210 may determine the glossary threshold δ based on characteristics of the terms u1 and/or u2.
In some implementations, client device 210 may determine the glossary threshold δ based on a quantity of words of the first and/or second term that are included in the language database. As an example, assume that the first term u1 includes a single word w1, and the second term u2 includes two or more words w2, w3, . . . , wm. If the single word w1 of u1 is included in the language database, client device 210 may set the glossary threshold δ to a first value (e.g., 1). If the single word w1 of u1 is not included in the language database, client device 210 may set the glossary threshold δ to a second value (e.g., 0).
Alternatively, if the single word w1 of the first term u1 is not included in the language database, client device 210 may set the glossary threshold δ based on a quantity of words of the second term u2 included in the language database. For example, if two or more of the words (e.g., if all of the words) of the second term u2 are included in the language database, then client device 210 may set the glossary threshold δ to the first value (e.g., 1). Otherwise, if fewer than two of the words of the second term u2 are included in the language database, then client device 210 may set the glossary threshold δ to the second value (e.g., 0).
In some implementations, client device 210 may determine a quantity of words (e.g., a quantity of unique words) included in both terms (e.g., m+n), and may determine the glossary threshold δ based on the quantity of words included in both terms. For example, client device 210 may set the glossary threshold δ to the first value (e.g., 1) if a quantity or percentage of the words included in the language database satisfies a threshold (e.g., 4 out of 5 terms included in the database, where the threshold of 4 is determined by m+n−1). Otherwise, if the quantity or percentage of words included in the language database does not satisfy the threshold, then client device 210 may set the glossary threshold δ to the second value (e.g., 0).
In some implementations, client device 210 may determine a quantity of shared words included in both terms (e.g., a quantity of words included in the first term and also included in the second term), and may determine the glossary threshold δ based on the quantity of shared words. For example, if all of the words of the first term are included in the second term, then client device 210 may set the glossary threshold δ to the second value (e.g., 0). Alternatively, if the quantity of shared words satisfies a threshold (e.g., n−1), then client device 210 may set the glossary threshold δ to the second value (e.g., 0). Otherwise, if the quantity of shared words does not satisfy the threshold, then client device 210 may set the glossary threshold δ to the first value (e.g., 1).
In some implementations, setting the glossary threshold δ to the first value (e.g., 1) may cause client device 210 to prevent the term u2 (e.g., included in TermList) from being added to the list of synonym terms, Lsyn. Conversely, setting the glossary threshold δ to the second value (e.g., 0) may cause client device 210 to add the term u2 to the list of synonym terms, Lsyn.
Once client device 210 has determined the glossary threshold δ, client device 210 may compare the synonym score (e.g., γ or s) to the glossary threshold δ. Based on whether the synonym score, associated with a pair of terms, satisfies the glossary threshold δ, client device 210 may add or prevent addition of the term u2 to the list of synonym terms, Lsyn. For example, if the synonym score is greater than or equal to the glossary threshold, then client device 210 may add the term u2 to the list of synonym terms, Lsyn.
As an example, assume that client device 210 has set the glossary threshold δ=⅔. Further assume that u1={US/NNP Environmental/JJ Protection/NN Agency/NN} and u2={Climatic/JJ Safeguard/NN Bureau/NN}. Assume that in the language dictionary, the following pairs of terms are determined to be synonyms: {Environmental, Climatic}, {Protection, Safeguard}, and {Agency, Bureau}. Based on determining these three matches, client device 210 may set the synonym score s=3. Client device 210 may calculate a new synonym score γ=(2×s)/(n+m)= 6/7. Because 6/7>⅔, client device 210 may add u2 to the list of synonym terms, Lsyn.
Additionally, or alternatively, client device 210 may calculate a latent semantic similarity score to determine whether two terms are semantically related. Client device 210 may calculate a latent semantic similarity score for a pair of terms u1, u2. The latent semantic similarity score may be calculated as the cosine of the angular distance between the term vectors U[u1] and U[u2], and may be calculated as follows:
The latent semantic similarity score may be calculated as SemSim[i, j]=Cosine(V[i],V[j],k), where V[i] and V[j] are section vectors from the truncated matrix V, where i and j are included in [1, . . . , d], and where i<j. The SemSim score may range from [−1, 1], where −1 indicates that the terms are antonyms, 0 indicates that the terms are statistically independent, and 1 indicates that the terms are synonyms.
In some implementations, client device 210 may set a semantic threshold value δsem (e.g., 0.9) and/or may receive information that identifies a semantic threshold value. Client device 210 may compare SemSim[u1, u2] to the semantic threshold value to determine whether the term u2 should be added to the list of synonym terms, Lsyn. For example, if SemSim[u1, u2]>δsem, then client device 210 may add the term u2 to the list of synonym terms, Lsyn.
In some implementations, client device 210 may determine not to perform a semantic relatedness analysis. In this case, client device 210 may calculate a similarity score as SemSim[i, j]=Cosine(CT[i],CT[j],t), where CT[i] and CT[j] are section vectors from the transposed co-occurrence matrix CT, where i and j are included in [1, . . . , d], and where i<j.
As further shown in
As an example, client device 210 may analyze two or more terms to determine whether the terms are short form alias terms (e.g., an acronym, an abbreviation, etc.). In some implementations, client device 210 may use a language database to determine whether the terms are short form alias terms. When a term is included in the language database, client device 210 may determine that the term is not a short form alias term. Alternatively, when a term is included in the language database, client device 210 may determine that the term is a possible short form alias term if the term appears in capital letters in the text (e.g., “ACT” being an acronym), and/or appears before a period in the text (e.g., “pot.” being an abbreviation of potential).
Client device 210 may determine that two terms are short form alias terms by determining that a first term, SF, is shorter in length than the second term, LF (e.g., SF includes a smaller number of characters than LF), and/or by determining that SF and LF begin with the same character (e.g., the same letter). In some implementations, client device 210 may modify SF and/or LF by removing a period from SF and/or LF (e.g., “env. prot. agency” may be modified to “env prot agency”).
In some implementations, client device 210 may determine that SF and LF are short form alias terms based on determining that SF is an acronym of LF. Client device 210 may determine that SF is an acronym of LF by determining that each letter in SF matches a corresponding first letter of each word in LF. For example, client device 210 may determine that “EPA” is a short form alias term of “Environmental Protection Agency” because each letter of “EPA” matches a corresponding first letter of each word in “Environmental Protection Agency.” Additionally, or alternatively, client device 210 may determine that SF includes all capital letters before considering SF as a possible acronym. Additionally, or alternatively, client device 210 may determine that SF appears in the text enclosed by parentheses before considering SF as a possible acronym. Additionally, or alternatively, client device 210 may determine that SF appears in the text within a threshold number of words of LF before considering SF as a possible acronym (e.g., SF appears in parentheses immediately after LF in the text).
In some implementations, client device 210 may determine that SF and LF are short form alias terms based on determining that SF is a prefix of LF. Client device 210 may determine that SF is a prefix of LF by determining that a threshold number of letters at the beginning of SF match corresponding letters at the beginning of LF. For example, client device 210 may determine that “env” is a short form alias term of “environment” because the first three letters of “env” match the first three letters of “environment.” Additionally, or alternatively, client device 210 may determine that SF ends with a period before considering SF as a possible prefix (e.g., an abbreviation).
When LF and/or SF is a multi-word term, client device 210 may determine that SF and LF are short form alias terms based on determining that multiple words in SF are prefixes of corresponding words in LF. Client device 210 may determine that SF is a short form alias term of LF based on a threshold number of words in SF (e.g., all of the words) being prefixes of corresponding words in LF. For example, client device 210 may determine that “env. prot. ag” is a short form alias term of “environmental protection agency” by determining that “env” is a prefix of “environmental,” “prot” is a prefix of “protection,” and “ag” is a prefix of “agency.”
In some implementations, client device 210 may determine that SF and LF are short form alias terms based on determining that SF can be generated from LF by deleting characters from LF. For example, client device 210 may determine that “mtc” is a short form alias term of “matching” because the “mtc” can be generated from “matching” by deleting characters from “matching.”
When LF and/or SF is a multi-word term, client device 210 may determine that SF and LF are not short form alias terms when a residual string, determined based on generating SF from LF by deleting characters from LF, includes a particular character, such as a space. The residual string may include a string of characters in LF that immediately follow the last matching character (e.g., the last matching character between LF and SF), up to and including the last character of LF. For example, assume that SF=“pdef” and LF=“period defined.” The residual string of this example is “fined.” This residual string does not include a space, so client device 210 may consider “pdef” and “period defined” as short form alias terms (e.g., based on being able to generate SF from LF by deleting characters from LF). As another example, assume that SF=“web sit” and LF=“web site exchange.” The residual string of this example is “e exchange.” This residual string includes a space, so client device 210 may not consider “web sit” and “web site exchange” as short form alias terms.
Client device 210 may use one or more of the above techniques to determine whether terms are short form alias terms. In some implementations, client device 210 may first determine whether SF is an acronym of LF. If client device 210 determines that SF is not an acronym of LF, client device 210 may then determine whether SF is a prefix of LF (or whether multiple words in SF are prefixes of corresponding words in LF). If client device 210 determines that SF is not a prefix or LF (or that multiple words in SF are not prefixes of corresponding words in LF), client device 210 may then determine whether SF can be generated from LF by deleting characters from LF. If client device 210 determines that SF can be generated from LF by deleting characters from LF, client device 210 may determine whether a residual string includes a particular character (e.g., a space). In performing the analysis in this manner, client device 210 may determine whether SF and LF are short form alias terms without being required to perform every analysis for each pair of terms SF and LF.
As another example, client device 210 may analyze two or more terms to determine whether the terms are explicit alias terms. In some implementations, client device 210 may determine whether the terms are explicit alias terms based on an alias character pattern, such as “is also known as.” Client device 210 may receive information (e.g., from a user and/or from another device) that identifies one or more alias character patterns to use to determine whether terms are explicit alias terms. Example alias character patterns include: “aka,” “also known as,” “sometimes also known as,” “generally also known as,” “generally known as,” “better known as,” “will be referred to as,” “will be referred to henceforth as,” “also called,” “also called as,” “will be used instead of,” “will be mentioned as,” “written as,” “will be written as,” “is an alias of,” etc.
Client device 210 may determine that two terms are explicit alias terms when an alias character pattern is included in the text in between the terms, and/or within a threshold number of words between the terms. For example, using the alias character pattern “also known as,” client device 210 may determine that “hot dog” and “ballpark frank” are explicit alias terms, based on any of the following being included in the text:
As can be seen, additional words may appear before, after, and/or within the alias character pattern. Client device 210 may determine that the two terms are explicit alias terms based on the number of additional words, appearing before, after, and/or within the alias character pattern, satisfying a threshold (e.g., less than 3).
If client device 210 determines that a term v included in TermList is an alias of a term u included in QTokens[q] (e.g., a short form alias, an explicit alias, etc.), then client device 210 may add the term v to a list of alias terms, Lalias.
As further shown in
For each term u included in ExQTokens[q], client device 210 may determine whether any term v in TermList includes any word included in term u. If client device 210 determines that a term v included in TermList includes any word included in term u from ExQTokens[q], then client device 210 may add the term v to a list of containment terms, Ltc.
As an example, assume that the term “record management system” is a term v included in TermList. Client device 210 may add this term “record management system” to the list of containment terms, Ltc, for example search queries that includes “Management system should print access log” (e.g., since the words management and system are included in both v and u), “Managmnt systm” (e.g., since the misspelled terms are included in the extended query list), and “sys should print access log” (e.g., since the short form “sys” is included in the extended query list). Once client device 210 has finished performing the containment analysis, client device 210 may extend the extended query list to include terms in the list Ltc.
As further shown in
As further shown in
As an example, client device 210 may generate a query array Q of size 1×t (e.g., [Q]1×t), and may initialize all elements of Q to zero. For each term u in the extended query list ExQTokens[q], client device 210 may determine whether there is a term v in the TermList that matches u. For a particular term TermList[k], if TermList[k]=u, then client device 210 may calculate a frequency score for the term, and may store the frequency score in Q[k]. Client device 210 may calculate the frequency score Q[k] based on a quantity of occurrences of term u in the list ExQTokens[q] and a combined weight p calculated for term v. The combined weight p may be calculated as a sum of one or more individual weights associated with a search query expansion technique. For example, p may be calculated as:
p=∂mis+∂alias+∂tc
In the above expression, client device 210 may set ∂mis equal to a weight assigned to a misspelling analysis if the term u is included in the misspelling list Lmis, and may set ∂mis equal to zero if the term u is not included in the misspelling list Lmis. Similarly, client device 210 may set ∂alias equal to a weight assigned to an alias analysis if the term u is included in the alias list Lalias, and may set ∂alias equal to zero if the term u is not included in the alias list Lalias. Similarly, client device 210 may set ∂t, equal to a weight assigned to a containment analysis if the term u is included in the containment list Ltc, and may set ∂t, equal to zero if the term u is not included in the containment list Ltc. Client device 210 may determine the weights based on user input, as described herein in connection with block 430 of
Client device 210 may calculate the quantity of occurrences f of term u within the expanded search query list ExQTokens. Client device 210 has determined the combined weight p and the quantity of occurrences f for a particular term, client device 210 may calculate the frequency score Q[k] for the term as follows:
Q[k]=f×p
Client device 210 may apply information theoretic weighting to the frequency score Q[k] to weight the score in relation to the text as a whole. For example, for each k in [0, t−1], client device 210 may calculate an information theoretic weighted frequency score as follows:
Q[k]=Q[k]×IDFk
In the above expression, IDFk may be calculated as follows:
where d is the total number of sections of the text and n, is the total number of sections where the ith term appears.
If client device 210 determines that a latent semantic search is to be performed (e.g., based on user input), then client device 210 may map Q to the latent semantic analysis space, such as by calculating the following:
[Qnew]1×k=[Q]1×t[U]t×k[Σk×k]−1
In the above expression, Qnew represents the mapped matrix Q, U represents a t×k unitary matrix (e.g., described elsewhere herein in connection with singular value decomposition), and Σ−1 represents the matrix inverse of sigma matrix Σ (e.g., described elsewhere herein in connection with singular value decomposition) when only the first k rows and the first k columns are selected.
Client device 210 may calculate a cosine similarity between a text section Seci, included in truncated matrix Vd×k, and the search query q included in query vector Qnew, as follows:
Closeness[q,i]=Cosine(Qnew,Seci,k)
If client device 210 determines that a latent semantic search is not to be performed (e.g., based on user input), then client device 210 may map Q to the tf-idf space as [Qnew]1×t=[Q]1×t. Similarly, client device 210 may calculate a cosine similarity between a text section Seci, included in transposed co-occurrence matrix CTd×t, and the query vector Qnew, as follows:
Closeness[q,i]=Cosine(Qnew,Seci,t)
For each text section Seci included in a result Resultq for search query q, client device 210 may determine a relevance score. As an example, client device 210 may calculate the relevance score as follows:
In the above expression, CQ[i,q] may represent a clustering quality of the ith search result Seci with respect to query q in the query list. CQ[i,q] may represent a measure of how strongly Seci clusters around the query (e.g., how many sections, which are similar to the query, are also similar to Seci). The variable α may represent a configurable weight value between zero and one (e.g., with a default value of 0.5). Client device 210 may use the relevance score when ranking search results.
As further shown in
As an example, if client device 210 determines (e.g., based on user input) that search results are not to overlap, then client device 210 may distribute the search results (e.g., the text sections) into groups around each search query based on a relevance score between a text section and the search query. For example, client device 210 may associate a text section with a search query with which the text section has the highest relevance score (e.g., as compared to relevance scores between the text section and other search queries). Client device 210 may provide a list of search results associated with each search query, and may rank the search results from highest relevance score to lowest relevance score, ensuring that no search result is provided in association with more than one search query.
If client device 210 determines that search results are to overlap, then client device 210 may provide a list of search results associated with each search query, and may rank the search results from highest relevance score to lowest relevance score for a particular search query. In this case, client device 210 may permit a search result to be provided in association with more than one search query.
In some implementations, client device 210 may determine not to cluster search results. In this case, client device 210 may provide a list of search results sorted based on relevance scores.
In some implementations, client device 210 may determine to cluster search results based on relevancy. In this case, client device 210 may create relevancy categories that include search results with a relevance score that falls within a particular range. For example, client device 210 may create relevancy categories of high relevance (e.g., search results with a relevance score and/or cosine similarity between 0.8 and 1), medium-high relevance (e.g., search results with a relevance score and/or cosine similarity between 0.5 and 0.8), average relevance (e.g., search results with a relevance score and/or cosine similarity between 0.25 and 0.5), low relevance (e.g., search results with a relevance score and/or cosine similarity between 0 and 0.25), and no relevance (e.g., search results with a relevance score and/or cosine similarity between −1 and 0).
In some implementations, client device 210 may determine to cluster search results based on a degree of relatedness of search results. In this case, client device 210 may rank search results from highest to lowest relevancy scores to generate a cluster list CQ. In some implementations, client device 210 may remove search results from the cluster list (e.g., may remove a threshold quantity of search results with the lowest relevancy scores). For each pair of search results included in CQ, client device 210 may determine a combined relevance score for the pair. For example, client device 210 may determine a combined relevance score by summing the relevance scores for each search result in the pair. Client device 210 may select a particular quantity of search result pairs (e.g., one-quarter of the search results included in CQ), and may include these search results in the list Ctop.
Client device 210 may calculate a weighted clustering coefficient (WCC) value for each search result Rl included in CQ. Client device 210 may initialize the WCC value for the search result Rl by setting the WCC value to zero, and may determine a maximum edge weighted sum, as follows:
where |CQ| represents the number of elements included in CQ.
Client device 210 may determine two other search results Rl1 and Rl2 included in CQ. Client device 210 may determine whether both of the other search results Rl1 and Rl2 are included in the list Ctop. If both of the other search results Rl1 and Rl2 are included in the list Ctop, then client device 210 may determine whether either of the search result pairs of (Rl, Rl1) or (Rl, Rl2) are included in the list Ctop. If either of these pairs is included in the list Ctop, then client device 210 may update the WCC value for Rl as follows:
Client device 210 may continue to update the WCC value for Rl until all other search results in Ctop have been analyzed. After analyzing all search result values, client device 210 may normalize the WCC value for Rl as follows:
Client device 210 may calculate a WCC value for each search result, and may sort the search results from highest to lowest WCC value. Client device 210 may select a top quantity of search results with the highest WCC values (e.g., the top quartile), and may center the remaining search results around these top search results. For example, client device 210 may use a k-means clustering technique to cluster a search result into a cluster with which the search result has a highest average (e.g., mean) similarity, as compared to other clusters.
For example, assume that L=[Rl1, . . . , Rlm] represents the top quartile of search results with the highest WCC scores. From these search results, client device 210 may create initial clusters Y1={Rl1}, Y2={Rl2}, . . . , Ym=[Rlm]. Then, for each Rl included in CQ, client device 210 may calculate a mean similarity of Rl with respect to each cluster Y, and may add Rl to the cluster with which Rl has the highest mean similarity. Client device 210 may calculate the mean similarity of Rl with respect to cluster Yk as follows:
In some implementations, client device 210 may iterate through all of the search results a threshold quantity of times to determine final clusters. Additionally, or alternatively, client device 210 may iterate through all of the search results until there is no change in the elements included in the clusters. Client device 210 may provide the clusters for display. In this way, the user may be able to see clusters of search results, associated with a search query, that are related to one another.
Although
As shown in
For example, assume that client device 210 performed the selected “alias” search query expansion technique on the term “web_sit” within the search query “design management web_sit.” Assume that client device 210 applied the search query expansion technique to a text, and identified the terms “web site” and “ws” as being expanded search queries. As shown by reference number 920, client device 210 may provide information that identifies these terms for display. As further shown in
As shown in
As shown in
As shown in
As shown in
As shown by reference number 950, client device 210 may provide a list of ranked search results that identifies text sections that included the expanded initial search query. For example, for the initial search query “design and management of ‘web sit’,” Section 14 of the text was the most relevant match, followed by Section 10, Section 18, etc. As shown by reference number 955, client device 210 may provide information that identifies a percentage of the text sections that included the expanded initial search query. In this case, client device 210 has searched 41 text sections, and 17 of them matched the expanded initial search query “design and management of ‘web sit’,” for a total of 34.1%. As further shown in
As shown by reference number 960, client device 210 may provide an input mechanism (e.g., a button, a link, a menu item, etc.) that permits the user to cause client device 210 display clustered search results. Assume that the user interacts with this input mechanism.
As shown in
As indicated above,
The foregoing disclosure provides illustration and description, but is not intended to be exhaustive or to limit the implementations to the precise form disclosed. Modifications and variations are possible in light of the above disclosure or may be acquired from practice of the implementations.
As used herein, the term component is intended to be broadly construed as hardware, firmware, or a combination of hardware and software.
Some embodiments are described herein in connection with thresholds. As used herein, satisfying a threshold may refer to a value being greater than the threshold, more than the threshold, higher than the threshold, greater than or equal to the threshold, less than the threshold, fewer than the threshold, lower than the threshold, less than or equal to the threshold, equal to the threshold, etc.
It will be apparent that systems and/or methods, as described herein, may be implemented in many different forms of software, firmware, and hardware in the implementations illustrated in the figures. The actual software code or specialized control hardware used to implement these systems and/or methods is not limiting of the implementations. Thus, the operation and behavior of the systems and/or methods were described without reference to the specific software code—it being understood that software and hardware can be designed to implement the systems and/or methods based on the description herein.
Even though particular combinations of features are recited in the claims and/or disclosed in the specification, these combinations are not intended to limit the disclosure of possible implementations. In fact, many of these features may be combined in ways not specifically recited in the claims and/or disclosed in the specification. Although each dependent claim listed below may directly depend on only one claim, the disclosure of possible implementations includes each dependent claim in combination with every other claim in the claim set.
No element, act, or instruction used herein should be construed as critical or essential unless explicitly described as such. Also, as used herein, the articles “a” and “an” are intended to include one or more items, and may be used interchangeably with “one or more.” Also, as used herein, the term “set” is intended to include one or more items, and may be used interchangeably with “one or more.” Where only one item is intended, the term “one” or similar language is used. Also, as used herein, the terms “has,” “have,” “having,” and the like are intended to be open-ended terms. Further, the phrase “based on” is intended to mean “based, at least in part, on” unless explicitly stated otherwise.
Number | Date | Country | Kind |
---|---|---|---|
1007/CHE/2014 | Feb 2014 | IN | national |
Number | Name | Date | Kind |
---|---|---|---|
20070282811 | Musgrove | Dec 2007 | A1 |
20090083255 | Li | Mar 2009 | A1 |
20090089286 | Kumar | Apr 2009 | A1 |
20090198672 | Jones | Aug 2009 | A1 |
20090248665 | Garg | Oct 2009 | A1 |
20130254155 | Thollot et al. | Sep 2013 | A1 |
Entry |
---|
Andy Garron et al., “Latent Semantic Indexing with Selective Query Expansion”, http://trec.nist.gov/pubs/trec20/papers/Ursinus.legal.update.pdf, 2012, 10 pages. |
Giorgos Akrivas et al., “Context—Sensitive Semantic Query Expansion”, 2002 IEEE International Conference on Artificial Intelligence Systems, 2002, 6 pages. |
Min Song et al., “Ontologies-driven Semantic Query Expansion”, http://www.cis.drexel.edu/faculty/thu/research-papers/song—dek—2006.pdf, 2006, 15 pages. |
Saeedeh Shekarpour et al., “Keyword Query Expansion on Linked Data Using Linguistic and Semantic Features”, 7th IEEE International Conference on Semantic Computing, Sep. 16-18, 2013, 7 pages. |
Apostol Natsev et al., “Semantic Concept-Based Query Expansion and Re-ranking for Multimedia Retrieval”, Proceedings of the 15th international conference on Multimedia, Sep. 23-28, 2007, 10 pages. |
Ruofan Wang et al., “Re-ranking Search Results Using Semantic Similarity”, Eighth International Conference on Fuzzy Systems and Knowledge Discovery, vol. 2, Jul. 26-28, 2011, 5 pages. |
Number | Date | Country | |
---|---|---|---|
20150242493 A1 | Aug 2015 | US |