This invention relates to query answering systems. More particularly, this invention relates to a query answering system that uses semantic searching methods.
The marriage of Natural Language Processing (NLP) with Information Retrieval (IR) has long been desired by researchers. NLP techniques are intensely used in query answering (QA) systems, indeed QA has been a playground for Artificial Intelligence since the 1960s. NLP techniques such as ontologies, syntactic parsing and information extraction techniques can be commonly found in a good QA system.
Although QA has been successful in domain specific areas and in small document collection such as TREC, large scale open-domain QA is still very difficult because the current NLP techniques are too expensive for massive databases like the internet Therefore, some commercial systems resort to simplified NLP, where, for example an attempt is made to map user queries (questions) to previously hand-picked question/answer sets. Thus, the performance of such systems is limited to their QA database.
Recently, a fast convolutional neural network approach for semantic extraction called SENNA has been described, which achieves state-of-art performance for Propbank labeling, while running hundreds of times faster than competing methods.
A method is disclosed herein for searching for information contained in a database of documents. The method comprises predicting, in a first computer process, semantic data for sentences of the documents contained in the database, querying the database for information with a semantically-sensitive query, predicting, in a real time computer process, semantic data for the query, and determining, in a second computer process, a matching score against all the documents in the database, which incorporates the semantic data for the sentences and the query.
Also disclosed herein is a system for searching for information contained in a database of documents. The system comprises a central processing unit and a memory communicating with the central processing unit. The memory comprises instructions executable by the processor for predicting, in a first computer process, semantic data for sentences of the documents contained in the database, querying the database for information with a semantically-sensitive query, predicting, in a real time computer process, semantic data for the query, and determining, in a second computer process, a matching score against all the documents in the database, which incorporates the semantic data for the sentences and the query.
Disclosed herein is a semantic searching method for use in a query answering (QA) system. The semantic searching method efficiently indexes and retrieves sentences based on a semantic role matching technique that uses a neural network architecture similar to SENNA. First, semantic roles (metadata) are computed offline on a large database of documents (e.g., general web data used by search engines or special collections, such as Wikipedia) and the metadata is stored along with word information necessary for indexing. At query time, semantic roles are predicted for the query online and a matching score against all the documents in the database, is computed that incorporates this semantic information.
The offline processing and online labeling are performed using fast prediction methods. Therefore, given the indices computed offline, the retrieval time of the present system is not much longer than other simple IR models such as a vector space model, while the indexing itself is affordable for daily crawling. More importantly, the neural network framework described herein is general enough to be adapted to other NLP features as well, such as named entity recognition and coreference resolution.
In the offline process, documents are collected in step 10 and processed in step 13 with a sentence splitter. The sentence splitter separates each document into discrete sentences. The sentences at the output 14 of the sentence splitter, are processed by a tagger in step 11. In the present disclosure, the tagger comprises a neural network. The neural network tagger computes part-of speech (POS) tags and semantic role tag predictions for each sentence. The sentences and their POS tags and role labeling information at the output 15 of the neural network tagger are indexed at step 101 and stored in a database in step 19 as a forward index and an inverted index.
In the online process, a user query 17 entered into browser 20 is sent by web server 30 to the neural network tagger for processing on the fly. The neural network tagger computes part-of speech (POS) tags and semantic role tag predictions for the query. In step 12, the web server calculates the similarity between the query and the sentences stored in the database containing the forward and inverted indices calculated during the offline process using the query's POS tags and role labeling information at the output 16 of the neural network tagger and each sentence's syntactic and/or role labeling information stored in the database as forward and inverted indices, and then ranks the similarity calculations. The result of the similarity ranking is displayed by the browser 20 for viewing by the user in step 18. The result comprises top ranked sentences which are most likely to match the semantic meaning of the query 17.
In the second layer of the neural network tagger, represented by steps 22 and 23, the vectors for each word in the sentence/query are transformed by convolution to a matrix (step 22) which includes, for example, k+1 columns (vectors) and 200 rows (step 23). Using this example, the convolutional second layer is capable of outputting 200 features for every window of, for example, 3 adjoining words in the sentence/query. The convolutional second layer enables the neural network to deal with variable length sentences/queries. So for a sentence/query of length n the second layer of the neural network tagger outputs (n−2)×200 features.
In the third layer of the neural network tagger, the columns or vectors (each of which comprises 200 rows) of the resulting matrix from the second layer are converted into a single vector in step 24 by examining each row of the columns to find the largest (maximum) value, and constructing the single vector from the maximum value in each row of the columns. The max function is applied over that feature set with the intuition that the most pertinent locations in the sentence for various learned features are identified at this layer of the neural network tagger. Independent of sentence/query length, this exemplary embodiment of the neural network tagger outputs 200 features. The single vector is processed by the fourth and fifth layers of the neural network tagger in step 25 to get the role label of the word w. Specifically in this exemplary embodiment, 100 hidden units are applied to the single vector in the fourth layer of the neural network and the fifth layer of the neural network predicts possible outputs (classes). The fifth layer may be a linear layer that predicts 24 possible classes. The neural network tagger is trained by backpropagation to minimize training error.
In an exemplary embodiment, the similarity between the query and the sentences stored in the forward and inverted indices may be calculated by the web server during the offline process using the following method. First, the similarity between a query Q and a sentence S may be represented as Equations (1) and (2).
A predicate-argument-structure (PAS) of a sentence, is the part of sentence which has a meaningful role for a certain verb. For example, the sentence, “the cat sat on the mat and ate fish” has two PASs, i.e., 1) “the cat sat on the mat”; 2) “the cat ate fish”. For each PAS a in query Q, similarity is calculated with all PASs in sentence S, and consider the maximum as the similarity between a and S. The similarity SENmatch(Q,S) is a weighted sum of similarities between each query PAS and S. The weight is the relative importance of the verb in the query PAS, indicated by its inverse document frequency (IDF).
The similarity between PAS a and b (PASmatch(a,b)) is a variant of classical cosine measure, with two modifications. In the first modification, the vector Vsemtfidf(x) for a PAS x is a vector of size |W∥R|, where |W| is the size of vocabulary, and |R| is the number of possible semantic roles. In other words, the vector has an entry for each combination of a word and a semantic role. If there is a word w (w is a unique integer id of a word, starting with 0) with a role r (r is a unique integer id for a semantic role, starting with 0) in the PAS x, then the entry with index (w|R|+r) is set to the IDF of word w; otherwise the entry is set to 0. In the second modification, the similarity between two PASs is considered only if their verbs are synonyms, as indicated by the term I(a,b) in Eq. (2)
One skilled in the art will recognize that both the web server and the computer system that performs the offline process, as described herein, may each comprise any suitable computer system. The computer system may include, without limitation, a mainframe computer system, a workstation, a personal computer system, a personal digital assistant (PDA), or other device or apparatus having at least one processor that executes instructions from a memory medium.
The computer system may include one or more memory mediums on which one or more computer programs or software components may be stored. The one or more software programs which are executable to perform the methods described herein, may be stored in the memory medium. The one or more memory mediums may include, without limitation, CD-ROMs, floppy disks, tape devices, random access memories such as but not limited to DRAM, SRAM, EDO RAM, and Rambus RAM, non-volatile memories such as, but not limited hard drives and optical storage devices, and combinations thereof. In addition, the memory medium may be entirely or partially located in one or more associated computers or computer systems which connect to the computer system over a network, such as the Internet.
The methods described herein may also be executed in hardware, a combination of software and hardware, or in other suitable executable implementations. The methods implemented in software may be executed by the processor of the computer system or the processor or processors of the one or more associated computers or computer systems connected to the computer system.
While exemplary drawings and specific embodiments of the present invention have been described and illustrated, it is to be understood that that the scope of the present invention is not to be limited to the particular embodiments discussed. Thus, the embodiments shall be regarded as illustrative rather than restrictive, and it should be understood that variations may be made in those embodiments by workers skilled in the arts without departing from the scope of the present invention as set forth in the claims that follow and their structural and functional equivalents.
This application claims the benefit of U.S. Provisional Application No. 61/026,844, filed Feb. 7, 2008, the entire disclosure of which is incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
6189002 | Roitblat | Feb 2001 | B1 |
6240409 | Aiken | May 2001 | B1 |
6246977 | Messerly et al. | Jun 2001 | B1 |
6269368 | Diamond | Jul 2001 | B1 |
6480843 | Li | Nov 2002 | B2 |
6999959 | Lawrence et al. | Feb 2006 | B1 |
7283992 | Liu et al. | Oct 2007 | B2 |
7890539 | Boschee et al. | Feb 2011 | B2 |
20040215663 | Liu et al. | Oct 2004 | A1 |
20050080776 | Colledge et al. | Apr 2005 | A1 |
20070112764 | Yih et al. | May 2007 | A1 |
20070156669 | Marchisio et al. | Jul 2007 | A1 |
20080221878 | Collobert et al. | Sep 2008 | A1 |
20090112835 | Elder | Apr 2009 | A1 |
Number | Date | Country | |
---|---|---|---|
20090204605 A1 | Aug 2009 | US |
Number | Date | Country | |
---|---|---|---|
61026844 | Feb 2008 | US |