A typical business enterprise has a relatively large amount of information, such as emails, wikis, web pages, relational databases, and so forth, which may preferably be searched in a cost efficient manner by users of the enterprise to produce positive business outcomes. The information for the enterprise may be stored as structured data, such as data contained in relational databases, as well as unstructured data, such as data present in documents, web pages and emails.
As an example of a search, an enterprise user may submit a search query for purposes of finding a solution to a particular problem. For example, the user may be experiencing an information technology (IT)-related problem and may desire to find a self-help solution by using a query that describes the nature of the problem to search a collection of the enterprise's knowledge documents.
Search queries may be used to find relevant documents in an enterprise's collection of documents. For example, an enterprise user (an employee of the enterprise, for example) may experience an information technology (IT) support problem; and in the interest of acquiring “self-help” information from the enterprise's collection of documents, the user may construct a search query and submit the query to an enterprise search engine in an attempt to retrieve relevant documents to solve the IT support problem.
As a more specific example, the user may experience the problem of not being able to access the enterprise intranet with the user's personal computer (PC); and the user may construct and submit an unstructured search query to search the enterprise's knowledge document collection, which may be, for example, a set of “how-to” documents and documents containing answers to frequently asked questions. In this context, an “unstructured query” means a query that does not have a predefined format. For example, the unstructured query may be a natural language-based query. As the user may not initially know what could be causing the problem or even which hardware/software components are related to the problem, the user, having a host computer name of “XYZ.A.com,” may submit (as an example) the following unstructured query: “XYZ cannot access intranet.”
The foregoing example search query centers around an entity, i.e., a computer called “XYZ”; and the user expects as a result of this query to retrieve relevant documents about possible causes why the users XYZ computer cannot access the enterprise intranet. However, because the enterprise's knowledge documents may seldom contain information pertaining to specific IT assets such as the “XYZ” computer, there may be many documents found containing the terms “cannot access intranet” and relatively fewer documents found containing the terms “XYZ computer.” Therefore, in a potentially complex iterative process, the user may potentially review many documents (some potentially relevant and others potentially not) that are returned in response to the query, perform a computer check to verify each possible cause, and may reformulate the query with additional knowledge gained from the first set of retrieved documents in an attempt to retrieve more relevant documents.
Referring to
More specifically, techniques and systems are disclosed herein for purposes of performing entity-centric query expansion. In this manner, as further disclosed herein, the search engine 40 refines a given unstructured query 30 that targets a data collection 80 of the enterprise system 10 to effectively narrow the scope of the search in an effort to find more relevant documents based at least in part on 1: Entity(ies) that are mentioned in the search query 30; and 2. the relationships among the mentioned entity(ies) and entities that are contained in the data collection 80.
The data collection 80 contains structured and unstructured information. The unstructured information contains web pages, application-generated documents, emails, wikis, and so forth. In general, the structured information contains data arranged in specific, defined relations, such as information that is contained in tables in relational databases, for example. As described below, the unstructured information and the structured information are sources that contain rich information, which the search engine 40 exploits to improve search accuracy.
In this manner, continuing the example above in which an enterprise user searches for self-help IT information for the user's intranet connection problem, the data collection 80 may include a relational database (i.e., structured information) that contains two tables that are particularly relevant to the search query 30: an asset table containing information about the IT assets of the enterprise; and a dependency table containing information about the dependencies, or relationships, between the IT assets.
As a more specific example, the users XYZ.A.com computer may be an asset that is listed in the asset table using an “XYZ.A.com” description. The asset table may further specify that the XYZ.A.com computer has an associated identification (ID) of “A103” and is of the category “PC.” The dependency table may specify that the A103 asset is related to an asset that has an ID of “A101,” and the asset table may describe the “A101” asset as being a proxy server that has the name “proxy.A.com” for all PCs. Therefore, based on the join relations between the above-described asset and dependency tables, “proxy.A.com” is the web proxy server for all the PCs, including the users “XYZ.A.com” computer.
Continuing the example, unstructured data of the data collection 80 may be used to further augment the information gleaned from the structured information. For example, the data collection 80 may contain an unstructured data document, which contains the language, “employees need to install ActivKey” to access intranet from their PCs.” Thus, the unstructured data sets forth a relationship between “PC” and “ActivKey.”
As described herein, the search engine 40 uses the entity(ies) mentioned in the search query 30 (called “entity mentions” herein, such as “XYZ computer” for the example) along with relationships derived from entities of the structured and unstructured data (such as the above-described relationships between the PC, ActivKey and proxy.A.com entities, in the example) to further enhance the search to obtain more relevant documents. For example, using this additional information, the search engine 40 may find the following relevant documents that may be helpful in solving the user's IT problem: a first document stating, “ActivKey is required for authentication to connect to the network”; a document stating, “configure the proxy of your browser to proxy.A.com”; and an email stating, “employees cannot access intranet for 2 hours due to network failures on September 10.”
As a more specific example, in accordance with example implementations, the search engine 40 uses previously-identified related entities in the structured and unstructured data to refine a given unstructured search query 30. In this manner, the structured data contains explicit information about relations among entities, such as key-foreign key relationships. However, the entity relationship information may also be “hidden” in the unstructured data. As described herein, condition random fields models are applied to learn a domain-specific entity recognizer, and an entity recognizer is applied to documents and queries to identify entities from the unstructured information. If two entities co-occur in the same document, they are related. The relations may be discovered by the context terms surrounding their occurrences.
The search engine 40 uses the entities and relations identified in both structured and unstructured data along with a general ranking strategy to systematically integrate the entity relationships from both data types to rank the entities that have relationships with the query entity(ies). Intuitively, related entities are relevant not only to the entity(ies) mentioned in the query but are also relevant to the query as a whole. Thus, in accordance with example implementations, the ranking strategy is determined by not only the relationships between entities, but also the relevance of the related entities for the given query and the confidence of the entity identification results.
The search engine 40 uses the related entities and their relations for query refinement. In particular, depending on the particular implementation, the search engine 40 may employ one or several of the following three options to refine the query 30: 1. use related entities; 2. use relations between the related entities and query entities; and 3. use the relations between query entities.
Still referring to
For the example of
It is noted that the physical machine 20 is an actual machine that is made up of actual hardware and software. For example, in accordance with some implementations, the physical machine 20 contains one or multiple central processing units (CPUs) 22, which individually or collectively execute machine executable instructions 26 that are stored in a memory 24 for purposes of forming the search engine 40. The memory 24 may be any non-transitory memory, such as memory formed from semiconductor devices, magnetic storage, optical storage, removable media, volatile memory, non-volatile memory, and so forth.
The physical machine 20 may contain other hardware, such as, for example, a network interface 28, user input devices, user display devices, and so forth. Moreover, although the physical machine 20 is depicted in
Turning now to more specific details, referring to
As depicted in
In the following discussion of the more specific details of the query expansion, the following notations are used. “Q” denotes an entity-centric unstructured query, such as the query 30. “EQ” denotes a set of entity mentions of the query expansion in query Q. “ER” denotes the related entities for query Q (such as expanded query 190. “QE” denotes the expanded query of Q (such as expanded query 190). “D” denotes an enterprise data collection (such as data collection 80). “DTEXT” denotes the unstructured information in D, and “DDB” denotes the structured information in D. “ei” denotes an entity in the structured information DDB. “em” denotes an entity mention in the unstructured information DTEXT. “EM(T)” denotes a set of entity mentions in the text T. “E(em)” denotes the set of top K similar candidate entities from the structured information DDB for entity mention em.
In response to the query 30, the search engine 40, in general, first retrieves a set of entities ER relevant to query Q. Intuitively, the relevance score of an entity is determined by the relationships between the entity and the entities in the query. The entity relationship information exists both explicitly in the structured data 120 as well as implicity in the unstructured data 110. To identify entities in the unstructured data 110, the documents 112 of the unstructured data 110 are traversed offline (examined by the search engine 40 before the particular query Q is processed, for example) for purposes of identifying whether a given document 112 contains any occurrences of entities in the structured data 120. A similar strategy may be used to identify the entity mentions EQ in query Q, and then, the search engine 40 uses a ranking strategy to retrieve the related entities ER for the given query Q based on the relationships between ER and EQ.
The related entities ER are then used to estimate the entity relation model from both the structured data 120 and the unstructured data 110; and then the related entities 160 and entity relation model 170 are used to formulate the expanded query QE. Because the expanded query QE contains related entities and their relations, the retrieval performance is enhanced.
Thus, referring to
Because structured information is designed based on entity relationship models, it may be rather straightforward to identify entities and their relationships therein. However, the problem may be more challenging to identify entities and corresponding relationships in unstructured information, which does not contain information about the semantic meanings of text fragments. First discussed below is a technique to identify entities in unstructured information, and next, a general ranking strategy is discussed below to rank the entities based on the relationships in both unstructured and structured information is discussed.
Unlike structured information, unstructured information does not have semantic meanings associated with each piece of text. As a result, entities are not explicitly identified in the documents and are often represented as sequences of terms. Moreover, the mentions of an entity could have more variants in unstructured data. For example, entity “Microsoft Outlook 2003” could be mentioned as “MS Outlook 2003” in one document but as “Outlook” in another.
The majority of entities in enterprise data are domain specific entities, such as IT assets. These domain specific entities have more variations than the common types of entities. To identify entity mentions in unstructured information, a model is trained based on conditional random fields with various features including dictionary, regular expression and part of speech tags. Specifically, the model makes a binary decision for each term in a document, as the term will be labeled as either an entity term or not.
After identifying entity mentions in the unstructured data (denoted as em), the entity mentions are compared with the entities in the structured data (denoted as “e”) for purposes of make both the unstructured and structured data integrated. Specifically, a list of candidate entities from the structured data is first constructed. Given an entity mention in a document, a string similarity is determined between the entity mention and the entities on the candidate list so that the most similar candidates are selected. To minimize the impact of entity identification errors, one entity mention is mapped to multiple candidate entities, i.e., the top K candidates with the highest similarities. Each mapping between entity mention em and a candidate entity e is assigned with a mapping confidence score, i.e., c(em, e), which may be computed using, for example, the technique that is set forth in W. W. Cohen, P. Ravikumar, and S. E. Fienberg, “A C
The next challenge performing to entity relationships relates to ranking candidate entities for a given query. The underlying assumption is that the relevance of the candidate entity for the query is determined by the relationships between the candidate entity and the entities mentioned in the query. If a candidate entity is related to more entities in the query, the entity should have a higher relevance score. Formally, the search engine 40 may determine relevance score of a candidate entity e for a query Q as follows:
Recall that, for every entity mention in the query, there may be multiple (i.e., K) possible matches from the entity candidate list, and each of matches is associated with a confidence score. The relevance score of candidate entity e for a query entity mention emiQ may be computed using the weighted sum of the relevance scores between e and the top K matched candidate entity of the query entity mention. Thus, Eq. 1 may be rewritten as follows:
where “E(em)” denotes the set of K candidate entities for entity mention emiQ in the query; “ejQ” denotes a matched candidate entity; “Re(ejQ, e)” represents the relevance score between query entity ejQ and a candidate entity e based on their relationships in collection D; and “c(emiQ, ejQ)” represents the string similarity between emiQ and ejQ.
The characteristics of both unstructured and structured information may be used to determine a relevance score between two entities, (called “Re(eQ,e)”) based on their relationships.
More specifically, in relational databases, every table corresponds to one type of entities, and every tuple in a table corresponds to an entity. The database schema describes the relations between different tables as well as the meanings of their attributes.
Two types of entity relationships are considered. First, if two entities are connected through foreign key links between two tables, these entities have the same relation as the one specified between the two tables. For example, as shown in the example of
The following discusses how to compute the relevance scores between entities based on these two relation types.
The relevance scores based on foreign key relations may be computed as follows:
and the relevance scores based on field mention relations may be computed as follows:
where “e.text” denotes the union of text in the attribute fields of e.
The final ranking score may be determined by integrating the two types of relevance score through linear interpolation, as described below:
R
e
DB(eQ,e)=αRELINK(eQ,e)+(1−α)ReFIELD(eQ,e), Eq. 5
where “α” represents a coefficient to control the influence of the two components.
Unlike in the structured data where entity relationships are specified in the database schema, there is no explicit entity relationship in unstructured data. Since the co-occurrences of entities may indicate certain semantic relations between these entities, the co-occurrence relationships may be used.
After identifying entities from unstructured data and connecting them with candidate entities as described above, the information about co-occurrences of entities in the document sets may be determined. In general, if an entity co-occurs with a query entity in more documents and the context of the co-occurrences is more relevant to the query, the entity should have higher relevance score.
Formally, the relevance score may be computed as follows:
where “d” denotes a document in the enterprise collection, and
“WIN DOW(emQ, em, d)” represents the context of the two entities mentions in the document d. The basic assumption is that the relations between the two entities may be captured through their context. Thus, the relevance between the query and the context terms can be used to model the relevance of the relationships between two entities for the given query. The window size may be set to a predefined threshold based on preliminary results. If the distance of two entities is longer than the window size, the entities may be considered to be non-related. Note that S(Q, W/NDOW(emQ, em, d)) measures the relevance score between the query and content of the two entity mentions. Because both Q and WINDOW (emQ, em, d) essentially are bag of words, the relevance score between them may be estimated by existing document retrieve models.
The related entities and their relations may be utilized to improve the performance of document retrieval. Related entities, which are relevant to the query but are not directly mentioned in the query, as well as the relations between the entities, may serve as complementary information to the original query terms. Therefore, integrating the related entities and their relations into the query may aid in covering more information aspects and thus, improve the performance of document retrieval.
Language modeling may be used as framework for document retrieval. Once such retrieval model is called, “KL-divergence,” where the relevance score of document D for query Q may be estimated based on the distance between the document and query models, as described below:
To further improve the performance, the original query model may be updated using feedback documents as described below:
θQnew=(1−λ)θQ+λθF, Eq. 8
where “θp” represents the original query model, “θF” represents the estimated feedback query model based on feedback documents, and “λ” represents a weighting factor to control the influence of the feedback model.
The query model is updated using the related entities and their relationships. More specifically, the query model may be updated as follows:
θQnew=(1−λ)θq+λθER, Eq. 9
where “θQ” represents the query model, “θER” represents the estimated expansion model based on related entities and their relations and “λ” controls the influence of θE. Given a query Q, the relevance score of a document D may be computed as follows:
where “w” represents the set of shared words between the query Q and the document D.
Disclosed below is a way, which may be used by the search engine 40 to estimate p(w|θER) based on related entities and their relationships, in accordance with an example implementation.
The top ranked related entities ER provide useful information to better reformulate the original query Q. Here a “bags-of-terms” representation is used for entity names, and a name list of related entities may be regarded as a collection of short documents. The expansion model based on the related entities may be estimated as follows:
where “ERL” represents the top L ranked entities from ER, “N(e)” represents the name of the entity e and “w” represents a word in the vocabulary.
Although the names of related entities provide useful information, the names may be short and their effectiveness to improve retrieval performance may be relatively limited. However, the relations between entities may provide additional information that may be useful for query reformulation. For example, two relation types may be used: 1. external relations, which are the relationships between a query entity and its related entities; and 2. internal relations, which are the relationships between two query entities. For example, consider the query “XYZ cannot access intranet”, which contains one entity “XYZ”. The external relation with the related entities, e.g. “ActivKey”, would be: “ActivKey is required for authentication of XYZ to access the intranet”. Consider another query “Outlook cannot connect to Exchange Server”. For this example query, there are two entities “Outlook” and “Exchange Server”, and these entities have an internal relation, which is “Outlook retrieve email messages from Exchange Server.”
Thus, a language model is estimated based on the relations between entities. As discussed earlier, the relationship information exists as attribute names in structured data while co-occurred documents as in unstructured data. To estimate the model, the relationship information is pooled together, and maximum likelihood estimation is used to estimate the model.
Specifically, given a pair of entities, the relation information from the enterprise collection D is first determined, and then, the relation model may be estimated as follows:
p(w|θERR,e1,e2))=pML(w|CONTENT(e1,e2)), Eq. 12
where “CONTENT(e1, e2)” represents the union of attribute names about the relationship between the entities or the set of documents mentioning both entities; and “pML” represents the maximum likelihood estimate of the document language model.
Thus, given a query Q with an EQ set of query entities and “ERL” as a set of top L related entities, the external relation model may be estimated by taking the average over all the possible entity pairs, as set forth below:
where “|EQ|” denotes the number of entities in the set EQ. Note that |ERL|≦L, because some queries may have less than L related entities.
The internal relation model may be estimated as follows:
Note that
as the co-occurrences of different entities are counted.
Referring to
The technique 300 further includes refining (block 320) the query based on external relations among query entities and selective set of candidate entities. Moreover, the query may be refined, pursuant to block 324, based on internal relations among the query entities. Lastly, the relevance scores of documents in the collection may be determined, pursuant to block 328, based on the refined query.
While a limited number of examples have been disclosed herein, those skilled in the art, having the benefit of this disclosure, will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2012/061034 | 10/19/2012 | WO | 00 |