Machine learning algorithms may be employed for a variety of purposes. For example, a machine learning algorithm may be used to categorize data, form clusters of entities having similar characteristics, make recommendations relating to content, rank results in an Internet search, analyze data in an enterprise, and so on.
Machine learning algorithms typically employ vectors to represent entities that are the subject of the “learning.” However, in certain cases traditional techniques that were employed to construct vectors could be quite difficult as they may involve a great deal of experience. Therefore, these traditional techniques could be difficult to utilize and were often limited to sophisticated users that had this knowledge and experience.
Feature vector construction techniques are described. In one or more implementations, an input is received at a computing device that describes a graph query that specifies one of a plurality of entities to be used to query a knowledge base graph. A feature vector is constructed, by the computing device, having a number of indicator variables, each of which indicates observance of a sub-graph feature represented by a respective indicator variable in the knowledge base graph.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different instances in the description and the figures may indicate similar or identical items.
Machine learning algorithms for tasks like categorization, clustering, recommendations, ranking, and so on may operate on entities (e.g., documents, people, tweets, chemical compounds, and so on) represented using feature vectors. However, traditional techniques used to construct feature vectors suitable for use by the machine learning algorithms may involve specialized knowledge and experience.
Feature vector construction techniques are described herein. In one or more implementations, these techniques leverage knowledge about entities and corresponding relationships that is aggregated in the form of knowledge base graphs, e.g., triple-stores. These knowledge base graphs may represent knowledge in terms of a graph whose nodes represent entities and whose edges represent relationships between such entities. Such a representation of the entities may operate as a source for automatically constructing features describing the entities in the knowledge base graph. Further discussion of techniques that may be used to construct these feature vectors may be found in relation to the following sections.
The following discussion starts with a section describing an example environment and system that is operable to employ the feature vector construction techniques described herein. Example implementations are then described, along with an example procedure. It should be readily apparent that the example implementation and procedure are not limited to performance in the example environment and vice versa, as a wide variety of environments, implementations, and procedures are contemplated without departing from the spirit and scope thereof
The computing device 102 may also include entity component (e.g., software) that causes hardware of the computing device 102 to perform operations, e.g., processors, functional blocks, and so on. For example, the computing device 102 may include a computer-readable medium that may be configured to maintain instructions that cause the computing device, and more particularly hardware of the computing device 102 to perform operations. Thus, the instructions function to configure the hardware to perform the operations and in this way result in transformation of the hardware to perform functions. The instructions may be provided by the computer-readable medium to the computing device 102 through a variety of different configurations.
One such configuration of a computer-readable medium is signal bearing medium and thus is configured to transmit the instructions (e.g., as a carrier wave) to the hardware of the computing device, such as via the network. The computer-readable medium may also be configured as a computer-readable storage medium and thus is not a signal bearing medium. Examples of a computer-readable storage medium include a random-access memory (RAM), read-only memory (ROM), an optical disc, flash memory, hard disk memory, and other memory devices that may use magnetic, optical, and other techniques to store instructions and other data.
The computing device 102 is illustrated as including a knowledge base graph 104, a vector construction module 106, one or more feature vectors 108, and a machine learning module 110. Although these components are described as being included in the computing device 102, functionality and data represented by these respective components may be further divided, combined, distributed, e.g., across a network 112, and so on.
The knowledge base graph 104 in this example represents entities 114 and relationships 116 between the entities 114. For example, the knowledge base graph 104 may be configured to represent pair-wise relationships, such as nodes and edges as further described beginning in relation to
The vector construction module 106 is representative of functionality of the computing device 102 to construct one or more feature vectors 108 from the knowledge base graph 104. The entities 114 of the knowledge base graph 104, for instance, may have a plurality of different types. For example, an entity “Albert_Einstein” may have a type “physicist” as well as a type “philosopher.” Accordingly, graph queries may be constructed and utilized by the vector construction module 106 that may serve as a basis for constructing the feature vectors 108.
The feature vectors 108 formed by the vector construction module 106 may be utilized for a variety of purposes. For example, a machine learning module 110 may employ machine learning algorithms for tasks like categorization, clustering, recommendations, ranking, and so on using the feature vectors 108. Thus, the feature vector 108 may have a wide variety of different uses, further discussion of which may be found in relation to the following figure.
The query construction module 202, for instance, is representative of functionality to construct a graph query 204. A user, for instance, may interact with a user interface 118 of the computing device 102 of
The graph query 204, may specify an entity “E” of type “T.” The graph query 204 may then be used by the vector processing module 206 to return sub-graphs of a knowledge database graph “KB.” In the illustrated example, the knowledge database graph 118 represents a document 210 having a plurality of words 212 although other knowledge database graphs are also contemplated as previously described.
The sub-graphs returned by the vector processing module 206 contain the entity “E” as specified by the graph query 204. Further, in one or more implementations a number of sub-graphs for entity “E” that are returned is restricted by a number of types to which the entity “E” belongs.
The vector processing module 206 is also configured to construct a set including each possible returned sub-graph for the entity “E” of type “T” as a set of sub-graphs for entity E (the entity of interest). In an implementation, the feature vector 108 constructed from this information by the vector processing module 206 is configured as a feature vector 108 that has length equal to a number of the possible sub-graph features available for entity “E” of type “T.” The feature vector 108 is formed to include indicator variables to describe observance of a feature represented by the respective indicator variables.
In one or more implementations, the feature vector 108 is configured as a binary feature vector having indicator variables that contain a “1” if a corresponding sub-graph feature is present and “0” if a corresponding sub-graph feature is not present. It should be readily apparent that a wide variety of transform functions may be employed by the vector construction module 106 to form the feature vector 108 without departing from the spirit and scope thereof
For example, suppose the knowledge base graph (KB) is configured to represent entities and pair-wise relationships in terms of a graph where the nodes represent the entities and the edges represent the relationships. Feature vector representations may then be formed by the vector construction module 106 for a subset of entities in the knowledge base graph (KB) from its local context in the knowledge base graph. To this end, the graph query language 208 (e.g., NAGA or SPARQL) may be used to form a graph query 204. In one or more implementations, the graph query 204 effectively describes a template for sub-graphs to be returned for the query. Continuing with the previous example, the techniques described herein may take a knowledge base graph 104 “KB,” a graph query 204 “GQ,” an entity type “T,” and an entity “E” of type “T” to return a binary feature vector having a form as follows:
FV(KB, GQ,T,E) for entity E.
As previously described, the feature vector “FV” may be constructed as a vector of indicator variables. Each of the indicator variables may be used to indicate observance of a corresponding feature, such as whether a given sub-graph feature is observed for an entity “E” or not.
Consider now an example of constructing feature vectors for documents 210 based on a “bag-of-words” representation as illustrated in
?W isA Word
D containsWord ?W
In order to construct the feature vector 108, a vocabulary is first determined to find which words the document 210 may contain. The following query returns each of the document/word pairs such that the word is contained in the document.
?D isA Document
?W isA Word
?D containsWord ?W
The feature vector 108 may be constructed in this example as a binary feature vector such that an indicator variable (e.g., an entry) is included for each word in the vocabulary, and the entries take value of “1” if the corresponding word is present and “0” otherwise.
The discussion above is but a simple example of how to construct feature vectors 108 from a knowledge base graph 104. Based on the type system/isA relationship, feature vectors 108 can be constructed which allow a machine learning algorithm to generalize across entities that share a type. Also, by introducing wildcard (e.g., dummy) variables, features may be constructed based on many-to-one lookup tables such as mappings from IP address to geo-location or similar.
Generally, any of the functions described herein can be implemented using software, firmware, hardware (e.g., fixed logic circuitry), manual processing, or a combination of these implementations. The terms “module” and “functionality” as used herein generally represent hardware, software, firmware, or a combination thereof In the case of a software implementation, the module, functionality, or logic represents instructions and hardware that performs operations specified by the hardware, e.g., one or more processors and/or functional blocks.
The instructions can be stored in one or more computer readable media. As described above, one such configuration of a computer-readable medium is signal bearing medium and thus is configured to transmit the instructions (e.g., as a carrier wave) to the hardware of the computing device, such as via the network 104. The computer-readable medium may also be configured as a computer-readable storage medium and thus is not a signal bearing medium. Examples of a computer-readable storage medium include a random-access memory (RAM), read-only memory (ROM), an optical disc, flash memory, hard disk memory, and other memory devices that may use magnetic, optical, and other techniques to store instructions and other data. The features of the techniques described below are platform-independent, meaning that the techniques may be implemented on a variety of commercial computing platforms having a variety of hardware configurations.
As previously described, these techniques may be applied to a variety of different knowledge base graphs 118 that may describe a variety of different data, such as web pages, social network services, Yago, DBPedia, Linked Open Data (LOD), product catalogs of business entities, and so on and use a variety of frameworks for knowledge representation such as RDF, RDFS, OWL, and so forth. Thus, these techniques may be used to navigate through large collections of disparate information, such as the World Wide Web, which bears the potential of being the world's most comprehensive knowledge base. For example, the Web includes a multitude of valuable scientific and cultural content, news and entertainment, community opinions, advertisements. However, this data may also include a variety of other data having limited value, such as spam and junk. Unfortunately, the useful and limited value data may form an amorphous collection of hyperlinked web pages. Accordingly, typically keyword-oriented search engines merely provide best-effort heuristics to find relevant “needles” in this “haystack.”
For example, entities in the knowledge base graph 118 may have a plurality of types. Suppose a query is contemplated to locate physicists who were born in the same year as Albert Einstein. Using traditional search techniques, it is difficult if not impossible to formulate this query in terms of keywords. Additionally, the answer to this question may be distributed across multiple web pages, so that a traditional search engine may not be able to find it. Further, the keywords “Albert Einstein” may stand for different entities, e.g., the physicist Albert Einstein, the Albert Einstein College of Medicine, and so on. Therefore, posing this query to traditional search engines (by using the keywords “physicist born in the same year as Albert Einstein”) may yield pages about Albert Einstein himself, along with pages about the Albert Einstein College of Medicine. This example highlights the limitations found in traditional search engines.
Using the techniques described herein, however, a knowledge base graph 118 may be leveraged with binary predicates, such as Albert Einstein isA physicist or Albert Einstein bornInYear 1879 to overcome the previous limitations. Combined with an appropriate query language and ranking strategies, users may be able to express queries with semantics and retrieve precise information in return.
For example, these techniques maybe employed by a semantic search engine, such as NAGA. The semantic search engine may follow a data model of a graph, in which the nodes represent entities and the edges represent relationships between the entities as previously described. An edge in the graph with its two end-nodes may be referred to as a “fact.” Facts may be extracted from various sources, such as Web-based data sources, social network services, enterprise systems, and so on.
An example of a knowledge base graph 118 is illustrated in an example 300 of
In order to query the knowledge base graph 118, a graph query language 208 may be used as previously described. In implementations, the graph query language 208 allows the formulation of queries with semantic information.
Friends: $x is Friend John
The vector construction module 106 may then process the graph context 302 of the knowledge database graph 118 of
Number of Friends: |$x isFriend John|
Thus, this graph query 502 is configured to determine how many other entities in the knowledge database 118 are indicated as friends of John 304. Accordingly, the vector construction module 106 may process the graph context 302 of the knowledge database graph 118 of
$x isFriend John,
$x isofAge $y
Thus, this graph query 502 is configured to determine how many other entities in the knowledge database 118 are indicated as friends of John 304 and that have a particular age. Accordingly, the vector construction module 106 may process the graph context 302 of the knowledge database graph 118 of
$x isFriend John,
$x isofAge 25
Thus, this graph query 702 is configured to determine how many other entities in the knowledge database 118 are indicated as friends of John 304 are twenty five. Accordingly, the vector construction module 106 may process the graph context 302 of the knowledge database graph 118 of
Thus, as described above the graph query language 208 may be used to support complex graph queries 204 with regular expressions over relationships on edge labels. These techniques may be employed in a variety of ways, such to implement a graph-based knowledge representation model for knowledge extraction from Web-based corpora, data describing enterprise systems, and so on.
The following discussion describes feature vector construction techniques that may be implemented utilizing the previously described systems and devices. Aspects of each of the procedures may be implemented in hardware, firmware, or software, or a combination thereof The procedures are shown as a set of blocks that specify operations performed by one or more devices and are not necessarily limited to the orders shown for performing the operations by the respective blocks. In portions of the following discussion, reference will be made to the environment 100 of
A graph query is formed that specifies an entity and a type (block 804). For example, a graph query language 208 may be employed to form a graph query 204 that may be used as a template for the feature vector 108.
Sub-graphs are found, in the knowledge base graph, which contain the entity (block 806). Further a number of sub-graphs for the entity that are found may be restricted by a number of types to which the entity belongs (block 808). The vector construction module 106, for instance, may process the knowledge database 118 to find entities from the knowledge database graph 118.
A set of the found sub-graphs are located for the type (block 810), e.g., by the vector construction module 106, that include the type specified by the graph query 204.
A feature vector is constructed (block 812). For example, the feature vector may have a length that corresponds to a number of possible sub-graph features available for the type (block 814). The feature vector may also be configured as binary feature vector and contain an indicator for each of the possible sub-graph features that describe whether the feature is available or not available (block 816). Examples of such feature vectors were previously described in relation to
Although the invention has been described in language specific to structural features and/or methodological acts, it is to be understood that the invention defined in the appended claims is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as example forms of implementing the claimed invention.