The present application claims priority to, and the benefit of, EP Patent Application No. 23205300.9, filed 23 Oct. 2023, the contents of which are incorporated herein by reference for all purposes.
The technical field of the present disclosure is information protection. Specifically, disclosed aspects relate to detecting data leakage in the context of information security, wherein sensitive or confidential data has been transmitted or otherwise distributed to an unauthorized destination. Further aspects relate to detecting dangerous information that has been made available to the public and could be harmful to a person. Detecting such dangerous information may contribute to user safety.
Leaked data may have been distributed by a trusted agent or another insider. Conventionally, data leakage detection is handled by watermarking, e.g., a unique code is embedded in each distributed copy. If the copy is later discovered in the hands of an unauthorized party, the leaker can be identified. Another approach to detecting leaked data is the use of pattern matching in an intrusion detection system. Yet another approach may include search operations on encrypted data.
In some cases, data leakage or dangerous information may be detected by interacting with a language model, for example, a large language model based on a generative pre-trained transformer, such as ChatGPT. In some cases, the language model is a probability distribution over sequences of words. The large language model may have at least one million parameters or at least one billion parameters.
Dangerous information may include incorrect or false instructions for using a medicine, instructions on how to make a bomb that should not have been made public, incorrect instructions on how to build a machine that could lead to injury.
Such a language model may be trained in an unsupervised manner on large quantities of unlabeled human-readable text. More specifically, the language model may be trained on at least one million, at least 10 million, at least 100 million or at least one billion samples. A sample may be a document or a web page. At least a portion of the samples may be publicly available or otherwise available to unauthorized users.
In addition or alternatively, the language model may include a neural network. The neural network may be a deep neural network, for example, a neural network having one or more hidden layers. The neural network may have at least one million parameters (e.g., weights and biases) or at least one billion parameters. The neural network may have been trained on unlabeled (i.e., unannotated) text using unsupervised (that is, self-supervised) learning.
In addition or alternatively, the neural network may include a transformer that uses self-attention, thereby differentially weighting the significance of each part of input data provided to the neural network. Input to the neural network may be parsed into tokens and the tokens may be processed simultaneously by calculating weights for the tokens in successive layers of the neural network. The neural network may be designed to process sequential input data. The neural network may include weights (e.g., soft weights) that can be changed during runtime.
Accordingly, it may be desirable to interact with an application programming interface for a social media application or a language model in order to detect leaked data and/or to detect dangerous information.
According to an aspect, a computer implemented method for detecting data leakage and/or detecting dangerous information is provided. The method comprises receiving a knowledge graph and extracting data from at least one network service. The method further comprises identifying statements in the extracted data. The method further comprises, for each identified statement: determining whether the identified statement is public or private using the knowledge graph, and/or determining whether the identified statement is true (i.e., correct) or false (i.e., incorrect) using the knowledge graph.
The knowledge graph may be associated with or belong to an enterprise. The knowledge graph may store knowledge regarding skills, processes, experience, capabilities and insights that are relied upon in day-to-day operations of the enterprise. Contents of the knowledge graph may also include enterprise specific acronyms, departments of the enterprise and product specifications. The knowledge in the knowledge graph may enable the enterprise to react to business situations in a fast, professional and flexible manner. The knowledge graph may be expensive and labor intensive to construct and maintain.
The knowledge graph (i.e., semantic web and/or web of linked data) may be specified using the resource description framework (RDF) or an extension of the resource description framework, such as RDF-star (i.e., RDF*). RDF star enables descriptions to be added to edges in the knowledge graph, wherein the descriptions may include scores, weights, temporal aspects and provenance. RDF star enables statements about statements in the knowledge graph to be created. Specifically, it is possible to attach metadata to describe edges in the knowledge graph in addition to attaching statements to nodes in the knowledge graph. In the present disclosure, references to RDF may be understood to refer to RDF-star.
The knowledge graph may include a plurality of statements. The knowledge graph may be implemented as a directed graph. The statements may be referred to as triples or triple statements, with subject, predicate and object components. The subject, the object, and the predicate may be referred to as components of their respective statement. The subject and/or the predicate may be an RDF resource (e.g., the subject and/or the predicate may have the resource property of RDF and may be a type or a label). The object may be a literal (e.g., an RDF literal) having a defined data type, such as string, integer, Boolean or double (as defined in the extensible markup language (XML) schema definition language (XSD). Regarding RDF, please refer to the RDF specification, “Concepts and Abstract Syntax”, https://www.w3.org/TR/rdf11-concepts/
In some cases, the statements of the knowledge graph may be close to human language.
The following exemplary three statements may be included in the knowledge graph:
In addition or alternatively, the following four statements may be included in the knowledge graph:
In addition or alternatively, the following four statements may be included in the knowledge graph:
In some cases, the knowledge graph includes a plurality of nodes connected by edges. The nodes may represent real world entities and the edges may represent relations between entities or relations between entities and types (i.e., classes) of the entities. Hence, predicates can be distinguished depending on whether they connect two entities or an entity and an entity type. The entities may also be referred to as resources. For each statement, the subject may correspond to a node of the knowledge graph, the object may correspond to a different node of the knowledge graph and an edge of the knowledge graph corresponding to the predicate may connect the subject node to the object node.
The nodes may have corresponding classes, such that each of the nodes has a corresponding class. The corresponding classes may be part of (or organized in) a schema (i.e., a data schema or an ontology). The schema may be defined in the RDF or the web ontology language. The following are examples of classes:
Hence “:State” is a resource that is a class, more specifically, an RDF class. The class “:EuropeanState” is another resource that is a class, more specifically, a subclass of “:State”. Hence, hierarchies of classes are possible. Moreover, multiple inheritance is also possible.
In addition or alternatively, the knowledge graph may be labeled and multi-relational. Accordingly, both the nodes and edges may have labels and the edges may have directions. The objects of the statements may be labels of the knowledge graph. The knowledge graph may be multi-relational in the sense that the edges have different labels. The nodes of the knowledge graph may be subjects or objects and the edges may be predicates.
In some cases, the schema includes properties. Each of the properties may apply to at least one of the classes in the schema. At least one of the properties may have a domain and/or arrange. Each of the properties may be used by (or apply to) at least one statement. The domain (e.g., rdfs:domain) may specify a class to which a subject belongs and the range (for example, rdfs:range) may specify a class to which an object belongs. More specifically, the domain may specify a class to which the subject of the statement belongs and the range may specify a class to which an object of the statement belongs. With regard to the RDF Schema, please refer to the W3C RDF Schema specification, https://www.w3.org/TR/rdf-schema/.
The following are examples of properties:
Hence, “:locatedIn” and “:capitalOf” are properties. Moreover, “:capitalOf” is a subproperty of “:locatedIn”. Hence, properties can also form hierarchies. The property “:EuropeanState rdfs:subClassOf:State” indicates that “:EuropeanState” is a subclass in a class hierarchy including the class “:State” and the subclass “:EuropeanState”.
Hence, the schema may provide a vocabulary for the knowledge graph. The knowledge graph may have predefined property prefixes, which can indicate whether a node (that is, a subject or object) is an instance of a class or a class (for example, a node may be a class if the node has a prefix “dbo,” which represents DBPedia ontology, and a node may be an instance if the node has a prefix “dbr,” which represents DBPedia resource. In certain cases, the directed graph can use a component of a URI to differentiate between instances and classes.
The knowledge graph may include statements which (explicitly or implicitly) indicate that certain nodes are classes. Accordingly, whether a specific node represents an instance or a class can depend on the underlying model (i.e., definitions in the schema). For example, whether a node is a class (and included in the schema of the knowledge graph) or an instance (and thus is not included in the schema of the knowledge graph) can be determined by checking the rdf:type property. If the rdf:type is owl: class, then the node is a class and is included in the schema; otherwise, the node is an instance (that is, instance of a class) and is not included in the schema.
In addition or alternatively, the statements from the knowledge graph cover a plurality of topical domains. Each statement may be identified by at least one uniform resource identifier (URI). At least one of the nodes and edges may be identified by a URI or an internationalized resource identifier the (IRI). More specifically, the nodes and edges may each be identified by a URI or an IRI. In some cases, one or more of the subject, the object and the predicate may be a URI. Some nodes (for example, nodes corresponding to objects) may be identified via a literal rather than a URI. The knowledge graph may be represented using the RDF. This graph may have a structure that is similar to known knowledge graphs such as DBPedia, Wikidata, BabelNet, DBkWik, Freebase and DBnaray.
Compared to relational databases, the knowledge graph has a more flexible data structure because the types of data provided by the knowledge graph can vary. For example, properties associated with different instances can differ even though these instances share the same class (e.g., “SAP_SE” and “BASF_SE” can have different property data although they share the same class “Company”). A relational database can be represented in a knowledge graph format, i.e., the knowledge graph may be a higher-level abstraction of the relational database.
In certain examples, the nodes of the knowledge graph can be organized in a hierarchical structure where a lower-level node (representing a more specific object) may be connected to a higher-level node (representing a more generic object) by one or more edges. The lower-level node (or the lower-level object it represents) may be a descendant of the higher-level node (or the higher-level object it represents), and the higher-level node (or the higher-level object it represents) may be an ancestor of the lower-level node (or the lower-level object it represents).
The method may further comprise receiving one or more rules corresponding to the knowledge graph. The rules may be reasoning, logic, inference or RDF schema rules. The method may further comprise materializing the knowledge graph by applying the rules to statements of the knowledge graph to compute additional statements. Materializing the knowledge graph may be described as adding context data or references to context data to the knowledge graph. Materializing the knowledge graph may be implemented by applying reasoning or applying the rules to the knowledge graph. Numbered examples of rules are the following:
An example of materializing (i.e., reasoning) follows. The materializing is based on the following statement:
Accordingly, materializing may include combining a statement with one or more properties. More specifically, materializing may include combining a statement with properties (e.g., property restrictions) that limit the subject or object of the statement. The combinations may be used to determine further statements, e.g., classes that the subject of the statement is an instance of and/or classes that the object of the statement is an instance of. Materializing may include determining statements that can be implicitly derived from the knowledge graph and adding the determined statements to the knowledge graph. Three numbered examples of reasoning follow:
Each of the three examples above combines the statement, “:Madrid :capitalOf :Spain” with a different property in order to compute (i.e., derive) an additional statement. In the first example, the statement “:Madrid :capitalOf :Spain” is combined with the property “:capitalOf rdfs:domain :City” to compute “:Madrid a :City”, which indicates that the subject of the statement, “:Madrid”, belongs to (i.e., is an instance of) the class “:City”. In the second example, “:Madrid :capitalOf :Spain” is combined with the property “:capitalOf rdfs:range :Country” to compute “:Spain a :Country”, which indicates that the object of the statement, “:Spain”, is an instance of the class “:Country”. In the third example, “:Madrid :capitalOf :Spain” is combined with the property “:capitalOf rdfs:subPropertyOf:locatedIn” to compute “:Madrid :locatedIn :Spain”, which indicates that the subject “:Madrid” has the property “:locatedIn” with respect to the object “:Spain”.
Continuing the example, given the statement and the properties before materialization, the following SPARQL Protocol and RDF Query Language (SPARQL) query would return FALSE:
After materialization, the same SPARQL query would return TRUE. Materializing the knowledge graph may increase the effectiveness and/or usability of the knowledge graph for detecting data leakage and detecting dangerous information. The knowledge graph may be materialized before the knowledge graph is received.
In addition or alternatively, the method may further comprise sorting the knowledge graph such that nodes of the knowledge graph are grouped together their neighbors. The sorting may be carried out after materializing the knowledge graph. Sorting may comprise determining a list of nodes in the knowledge graph and adding a randomly selected node to the list of nodes. For each node in the new list of nodes, determining the connected nodes. For each of the connected nodes, if the respective node is in the knowledge graph, adding the respective node to the list of nodes. The method may further comprise removing the node from the list of nodes.
Pseudocode for the sorting algorithm described in the preceding paragraph is provided below:
The serialize( ) function above may verbalize a respective triple (“r”—subject, “n.edge”—object, “n.node”—predicate). In other words, the serialize( ) function may translate a statement from the knowledge graph into a serialization format, such as RDF/XML, RDFa, Notation3 (.n3), or Turtle (.ttl), N-Triples, or JSON-LD.
The sorting algorithm may be referred to as a clustering algorithm and may ensure topicality, i.e., that sorted statements are close to each other in the sense that they relate to similar topics or the same topic. In other words, neighboring statements are semantically similar. This may increase the effectiveness of the knowledge graph in detecting data leakage and detecting dangerous information.
In some cases, determining whether the identified statement is true or false using the knowledge graph comprises determining whether a subject predicate combination of the identified statement has a maximum cardinality, i.e., a maximum number of objects. When the identified statement has a maximum cardinality and the maximum cardinality of the subject predicate combination of the identified statement is exceeded, determining that the identified statement is false. For example, the maximum cardinality may be 1. When the subject predicate combination has a maximum cardinality of 1, then the subject predicate combination may only be TRUE appear in combination with one object and may be false in combination with any other object.
In addition or alternatively, the at least one network service may include one or more of the following: a web service, a social networking service, a service providing an interface to a language model. The network service may run on a server and listen for network requests on a port. The web service may use the hypertext transfer protocol (HTTP) and may listen for network requests on port 80 and/or 443.
The at least one network service may include one or more of the following: a web service, a social networking service, a service providing an interface to a language model. Extracting data from the at least one network service may comprise accessing at least one application programming interface via the at least one network service. Extracting data from the at least one network service may comprise web scraping using HTTP, wherein the web scraping may comprise web crawling.
In the context of the social networking service, the application programming interface may provide access to settings and allow updates to the settings. The application programming interface may further allow access and updates to content, to create and get posts, to get comments or insights and perform other update actions. The application programming interface may also enable searching, filtering and data extraction.
Web scraping may comprise extracting content from a website, specifically underlying HTML code and/or data stored in a database. Web crawling may comprise systematically browsing webpages as well as indexing, updating and retrieving webpage content.
Examples of social networking services include Facebook, LinkedIn and Twitter (or X). Examples of language models include ChatGPT, PaLM, Megatron, Titan and Chinchilla.
The method may further comprise computing a vector for each uniform resource identifier (URI) of the knowledge graph, such that each of the uniform resource identifiers of the knowledge graph has a corresponding vector. The method may further comprise creating a copy of the knowledge graph including the computed vectors. The method may further comprise modifying the copy of the knowledge graph to include a number of true statements (i.e., true facts) that is equal to a number of false statements (i.e., false facts) in the copy of the knowledge graph. In the context of the present disclosure, the statements may also be referred to as facts.
The combination of a URI, a vector and an endpoint identifier may be referred to as an embedding. In this context, an endpoint may be a URL determining a web service or API that can be used to retrieve information from a knowledge graph. The knowledge graph may be deployed on a graph database server and the graph database server may have a software layer that exposes the web service or API. An exemplary endpoint is as follows:
A purpose of computing the vectors for each URI of the knowledge graph may be so that the vectors can be used to train a machine learning model. Computing the vectors may be referred to as training embeddings. Training embeddings may be a machine learning task of learning a low-dimensional representation of the knowledge graph's entities and relations (edges) while preserving the semantic meaning of the entities and relations. The embeddings may be used for link prediction, triple classificartion, entity recognition, clustering and relation extraction.
A purpose of modifying the copy of the knowledge graph so that the number of true statements is equal to the number of false statements may be so that the copy of the knowledge graph can be used to train the machine learning model. Specifically, it might not be possible to train the machine learning model properly if the knowledge graph only has true statements or there is a substantial difference between the number of true statements and the number of false statements.
Modifying the copy of the knowledge graph to include a number of true statements that is equal to a number of false statements may comprise, when the copy of the knowledge graph has more false statements than true statements, randomly deleting false statements. When the copy of the knowledge graph has more true statements than false statements, adding false statements to the copy of the knowledge graph by randomly selecting and combining subjects, predicates, and objects from different statements in the knowledge graph. Each component of a statement in the copy of the knowledge graph may have a corresponding computed vector. For each statement in the copy of the knowledge graph, the method may comprise computing, using the computed vectors corresponding to the statements, a similarity value measuring the similarity of the statement to each of the other statements in the copy of the knowledge graph. The similarity value may be computed using a cosine similarity measurement.
Accordingly, the similarity value may be computed by taking the dot product of the computed vectors divided by the product of the lengths of the vectors. Hence, to proportional vectors may have a cosine similarity of 1, two orthogonal vectors may have a similarity of 0 and two opposite vectors may have a similarity of −1.
The vectors may be computed such that each relation (i.e., predicate) in the knowledge graph is characterized by two vectors, a norm vector of a hyperplane and a translation vector on the hyperplane. The hyperplane may be normal to an embedding space of the vectors. The vectors may be computed using a pure translational model, such as, TransH, TransD, TransA or TransE. Further information regarding pure translational models may be found in “Knowledge Graph Embedding by Translating on Hyperplanes”, Wang et al., 2014.
A benefit of computing the vectors with TransH may be that there will be a small distance (e.g., close to 0, where 0 is perfect similarity) between true statements and a large distance (e.g., close to 2, where 2 is perfect dissimilarity) between false statements. The distance may be a cosine distance (i.e., the complement of the cosine similarity or 1-cosine similarity). Close to 0 may be less than 0.1 or less than 0.2. Close to 2 may be greater than 1.8 or greater than 1.9.
The method may further comprise training a statistical model using the computed similarity values and information indicative of whether the similarity values correspond to true or false statements in the copy of the knowledge graph. The statistical model may be a logistic regression model. Determining whether the identified statement is true or false using the knowledge graph may comprise applying the statistical model to the identified statement. When the statistical model returns a probability for an identified statement that is greater than a specified threshold, determining that the identified statement is true. When the statistical model returns a probability for the identified statement that is less than or equal to the specified threshold, determining that the identified statement is false.
The method may further comprise identifying sentences in the extracted data, comprising carrying out term frequency-inverse sentence frequency with stemming and lemmatization on the extracted data. For further details on term frequency-inverse sentence frequency with stemming and lemmatization see “Sentence retrieval using Stemming and Lemmatization with Different Length of the Queries”, Boban et al., 11 Jun. 2020. The method may further comprise identifying, using named entity recognition, entities in the extracted data. Named entity recognition may be a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into predefined categories.
The method may further comprise identifying, via relation extraction, implied relations between the identified entities. The relation extraction may include deriving structured text from the extracted data. A result of the named entity recognition and the relation extraction may be one or more entity1-predicate-entity2 combinations.
The method may further comprise mapping, via entity linking, the identified entities to the knowledge graph. The mapping may comprise named entity disambiguation. The method may further comprise mapping predicates of the identified entity1-predicate-entity2 combinations to properties of the knowledge graph. The method may further comprise providing, based on the mapping of identified entities to the knowledge graph in the mapping of predicates to properties of the knowledge graph, at least one result statement, each result statement including a subject-predicate-object triple, each component of the subject-predicate-object triple including a uniform resource identifier of the knowledge graph.
In addition or alternatively, determining whether the identified statement is true or false (e.g., 0 or 1) and/or determining whether the identified statement is public or private comprises comparing the identified statement with statements in the knowledge graph.
In some cases, comparing the identified statement with the statements in the knowledge graph may comprise determining (e.g., finding) at least one comparable statement for the identified statement. The comparable statement for the identified statement may have at least one value in common with the identified statement. Specifically, the comparable statement for the identified statement may have at least one of a subject, a predicate, and an object in common with the identified statement. For example, the comparable statement may have the same subject as the identified statement but a different predicate and a different object than the identified statement. A plurality of the statements in the knowledge graph may include metadata indicating whether the respective statement is (i.e., the statement is tagged as) true or false, and/or a plurality of the statements in the knowledge graph may include metadata indicating whether the statement is (i.e., the statement is tagged as) public or private.
When one of the statements in the knowledge graph does not include metadata indicating whether the statement is true or false, the method may further comprise determining that the statement is true. Determining whether the identified statement is true or false may comprise identifying a maximum cardinality of the at least one comparable statement and determining whether the identified statement exceeds the maximum cardinality. In some cases, if the identified statement exceeds the maximum cardinality than the identified statement is determined to be false.
Each statement of the knowledge graph may include a subject-predicate-object triple, wherein each subject-predicate-object triple is composed of a subject, a predicate, and an object. The subject, the predicate and the object may each be represented (i.e., identified) by a uniform resource identifier.
Extracting data from the at least one network service may comprise randomly selecting a statement from the knowledge graph. The method may further comprise constructing a query string based on a subject and a predicate of the statement and calling, using a uniform resource locator (URL), at least one application programming interface (API) via the at least one network service and using the query string.
Extracting data from the at least one network service may comprise building a natural language processing question based on a randomly (i.e., pseudo-randomly) selected statement from the knowledge graph. The method may further comprise issuing a call to the at least one application programming interface using the question.
According to a second aspect, a computer program is provided. The computer program may be tangibly embodied in a computer readable medium. The computer program comprises instructions that, when the program is executed by a computer, cause the computer to carry out to any one of the methods described above.
According to a third aspect, a computer system for detecting data leakage and/or detecting dangerous information is provided. The computer system may include a memory and at least one processor. The memory may include instructions, which, when executed, cause the at least one processor to receive a knowledge graph, and extract data from at least one network service. The instructions, when executed, may also cause the at least one processor to identify statements in the extracted data. For each identified statement, the instructions, when executed, may cause the at least one processor to determine whether the identified statement is public or private using the knowledge graph, and/or determine whether the identified statement is true or false using the knowledge graph.
The subject matter described in this disclosure can be implemented as a method or on a device, possibly in the form of one or more computer programs (e.g., computer program products). Such computer programs may cause a data processing apparatus to perform one or more operations described in the present disclosure.
The subject matter described in the present disclosure can be implemented in a data signal or on a machine readable medium, where the medium is embodied in one or more information carriers, such as a CD-ROM, a DVD-ROM, a semiconductor memory, or a hard disk. In particular, disclosed subject matter may be tangibly embodied in a non-transitory machine (computer) readable medium.
In addition, the subject matter described in the present disclosure can be implemented as a system including a processor, and a memory coupled to the processor. The memory may encode one or more programs to cause the processor to perform one or more of the methods described in the application. Further subject matter described in the present disclosure can be implemented using various machines.
Details of one or more implementations are set forth in the exemplary drawings and description that follow. Other features will be apparent from the description, the drawings, and from the claims.
In the following text, a detailed description of examples will be given with reference to the drawings. Various modifications to the examples may be made. In particular, one or more elements of one example may be combined and used in other examples to form new examples.
The knowledge graph 100 includes a statement 112 (i.e., triple statement) having a subject “dbr:SAP_SE”, a predicate “dbo:foundationPlace” and an object dbr:Germany”, each of which are URIs defined in RDF. An exemplary serialization of the statement 112 is dbr:SAP_SE dbo:foundationPlace dbr:Germany. A schema of the knowledge graph 100 may be defined via RDF schema (RDFS) or Web ontology language (OWL) from the World Wide Web Consortium (W3C).
An API crawler may determine, for each API, whether the API has a type of standard or conversational and may use this information to issue a corresponding request to the API. The conversational type may correspond to APIs for language models. The standard type may correspond to other APIs, e.g., for social network services, web services, and search engine services. For example, if the API is standard, the API crawler may select a statement from the knowledge graph 100 and use label information of a subject and a predicate of the statement to construct a query for the API. The statement may be randomly selected from the knowledge graph 100. The API (e.g., an API for a social network service) may be called using the query.
If the API is conversational, the API crawler may build a natural language processing question from a randomly selected statement of the knowledge graph 100. The natural language processing question may be constructed as described in the European patent application entitled, “Providing Data from a Directed Graph to a Language Model”, docket number S15972EU, application Ser. No. 23/177,538.8, filed with the European patent office on Jun. 6, 2023. The API crawler may issue a call to an API (e.g., an API for a language model) using the natural language processing question. A URI for the API may be determined based on configuration settings provided as discussed above with respect to
A response provided via the conversational API (e.g., from the language model) may be as follows:
From the response above, the following statement may be identified:
The above-identified statement may be compared with statements in the knowledge graph to determine that the above-identified statement is false.
Responses obtained by the API crawler may be forwarded to a statement identifier of the statement checking engine. The statement identifier may include a plurality of modules. Specifically, the statement identifier may include a sentence identifier module, a named entity recognition module, a relation extraction module, an entity linking module, and a predicate mapping module. The sentence identifier module may identify sentences in the extracted data, for example, by carrying out term frequency-inverse sentence frequency with stemming and lemmatization on the extracted data. The named entity recognition module may identify, using named entity recognition, entities in the extracted data. The named entity recognition may be a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into predefined categories. The relation extraction module may identify implied relations between the identified entities. The relation extraction module may include deriving structured text from the extracted data. The relation extraction module may produce a plurality of entity1-predicate-entity2 combinations. The predicate mapping module may map the entity1-predicate-entity2 combinations to properties of the knowledge graph. In algorithms 1 to 4 below, statements are referred to as facts.
The statement identifier may operate according to algorithm 1, as shown below:
The input text for algorithm 1 may be data extracted from at least one network service, e.g., via an API accessible via the network service. The knowledge graph instances may be different or distinct knowledge graphs.
The statement checking engine may also include a statement tagger. The statement tagger may depend on a machine learning model. The machine learning model may be trained after the configuration discussed in connection with
A goal of training embeddings, as exemplified in step 1 of algorithm 2, is that given a subject s and a predicate p, if s and p are added together they produce a vector close (e.g., a small cosine distance from) the embedding for a corresponding object o.
A goal of algorithm 2 is to produce a vector for each resource (i.e., each node and each edge) of the knowledge graph, with some exceptions. For example, if a node or edge only appears in a false fact, the node or edge will have no embedding. It may be that only facts with resources having embeddings will be used for algorithm 3. In other words, facts with resources not having embeddings will not be used for the machine learning (e.g., logistic regression) model.
s,p,o may refer to a subject, a predicate, and an object respectively. se,pe,oe may refer to a subject embedding, a predicate embedding, and an object embedding respectively. cs may refer to cosine similarity. Accordingly, step 1 of algorithm 2 may involve computing a vector for each uniform resource identifier of the knowledge graph 100, such that each of the uniform resource identifiers of the knowledge graph has a corresponding vector. Step 2 of algorithm 2 may include creating a copy of the knowledge graph 100 including the computed vectors. Step 2 may further include modifying the copy of the knowledge graph 100 to include a number of true statements that is equal to a number of false statements. Step 3 of algorithm 2 may include for each statement in the copy of the knowledge graph, computing, using the computed vectors corresponding to the statements, a similarity value measuring the similarity of the subject and the predicate of a statement to the object of the statement in the copy of the knowledge graph. In other words, step three of algorithm two may include computing a similarity value measuring the similarity of the ES+EP and EO, where ES is a subject embedding of subject S, EP is a predicate embedding for predicate P and EO is an object embedding for object O. In the example shown above, the similarity is computed using a cosine similarity measurement.
Step 4 of algorithm 2 may include training a statistical model using the computed similarity values from step 3 as well as information indicative of whether the similarity values correspond to true or false statements in the copy of the knowledge graph, i.e., whether the similarity values between the predicted and actual object embeddings correspond to true or false statements in the copy of the knowledge graph. In the example above, the statistical model is implemented as a logistic regression model. Step 4 may output a probability for the identified statement.
Accordingly, the vectors β0 and β1 were trained so that, in general, when there is a subject S and a predicate P, for which the respective embeddings are ES and EP, then, ES+EP should be “mathematically close” (e.g., have a cosine distance close to 0 or a cosine distance of less than 0.1 or less than 0.2) to EO, where EO is an embedding for O, and S, P, O is a True statement (i.e., fact) in the knowledge graph.
For example:
Continuing the example, the corresponding knowledge graph embeddings may include the following.
Further continuing the example, the knowledge graph resources associated with the embeddings Ex, Ey and Ez may be the following:
Further operations of the statement tagger may be implemented according to algorithm 3, e.g., as follows:
In the example of the statement tagger above, f is a fact from the knowledge graph G and f′ is another fact from the knowledge graph G. f and f′ have the same subject and predicate but different objects.
Accordingly, algorithm 2 may implement machine learning model training in preparation for statement tagging and algorithm 3 may implement statement tagging based on the machine learning model training carried out in algorithm 2. In algorithm 3, G may be a knowledge graph (e.g., the knowledge graph 100) and F may be an identified statement in data extracted from the network service using the API. β0 and β1 may be the Logistic Regression parameters output by algorithm 2.
Algorithm 3 includes two conditional statements that involve comparing and identified statement (i.e., fact F) with statements in the knowledge graph 100. If the identified statement matches a statement in the knowledge graph, it is determined whether the statement in the knowledge graph is private or false. If the statement in the knowledge graph is private (e.g., the knowledge graph includes metadata indicating that the respective statement is private), then algorithm 3 returns “private”. If the statement in the knowledge graph is false (e.g., the knowledge graph includes metadata indicating that the respective statement is false), then algorithm 3 returns “false”. According to algorithm 3, determining whether the identified statement is true or false includes identifying a maximum cardinality of a comparable statement in the knowledge graph (e.g., a statement with the same subject and predicate and a different object) and determining whether the identified statement exceeds the maximum cardinality. In the example of algorithm 3, the maximum cardinality is 1.
When the probability for the identified statement is greater than a specified threshold (e.g., 0.5 as shown above in algorithm 3), it may be determined that the identified statement is true. When the probability for the identified statement is less than or equal to the specified threshold, it may be determined that the identified statement is false.
Algorithm 3 may return false by default. This may be in the interest of improving security in the case of detecting leaks or improving safety in the case of detecting dangerous information.
False or private statements detected via algorithm 3 may cause corresponding warnings to be generated. These warnings may be shown to the user in the monitor of
In the example above “user” corresponds to a login id and “pass” corresponds to a password.
The logical data model may include a plurality of connection configurations. There may be exactly one connection configuration for each API configuration, while each connection configuration corresponds to at least one API configuration. Each connection configuration may include an endpoint identifier, an endpoint, a login, a password, and a plurality of APIs. Each connection configuration may correspond to a network service and the endpoint may be a network service. The endpoint identifier may be a combination of an Internet protocol address and a port number.
An exemplary connection configuration is as follows:
Each connection configuration may correspond to at least one set of embeddings. Each set of embeddings may correspond to exactly one connection configuration. The combination of an endpoint identifier, a URI and a vector may be referred to as an embedding container, where the vector may be an embedding. The embeddings may be computed from identified statements and compared to embeddings computed from the knowledge graph in order to determine whether a respective identified statement is true or false and/or to determine whether the respective identified statement is public or private. The embeddings shown in the logical data model may be embeddings that have been computed from the knowledge graph. An example of an embedding container follows:
Each connection configuration may be associated with exactly one model. Each model may include an endpoint identifier, as well as the vectors β0 and β1, as discussed in connection with algorithm 2 above. There may be one machine learning model (e.g., logistic regression model) per knowledge graph endpoint.
A cloud computing environment (i.e., a cloud environment or cloud) may have one or more of the following characteristics: scalability, multi-tenancy, performance monitoring, virtual resources that are dynamically assignable to different users according to demand, multiple redundant sites, multiple virtual machines, and network accessibility (e.g., via the Internet) from multiple locations (e.g., via a web browser and devices (e.g., mobile device or PC).
In comparison to an on-premises computing environment, the cloud computing environment may have a higher ratio of virtual resources to physical resources (e.g., a higher ratio of virtual machines to physical machines). For example, the ratio of virtual resources (e.g., machines) to physical resources may be at least 10 to 1, at least 20 to 1 or at least 30 to 1 in the cloud computing environment. In contrast, an on-premises computing environment may have less than four virtual resources (e.g., machines) per physical resource.
The cloud computing environment may include a public cloud 850 (i.e., public cloud computing infrastructure) and/or a private cloud 852 (i.e., private cloud computing infrastructure). The public cloud computing infrastructure 850 may involve sharing hardware, storage and/or network resources among multiple organizations or tenants. Services and/or applications may be accessed and managed using a web browser. The private cloud computing infrastructure 852 may include resources exclusively used by one organization or group of users. In comparison to the public cloud computing infrastructure 850, the private cloud infrastructure 852 may provide more flexibility and control, however, the private cloud infrastructure 852 may be more expensive. In both cases, public 850 and private cloud 852 computing infrastructure may be hosted by a service provider, e.g., Microsoft (Azure), Amazon (AWS) or SAP Business Technology Platform.
In the cloud configuration of
The personal computer 720 may further include a hard disk drive 732 for reading from and writing to a hard disk (not shown), and an external disk drive 734 for reading from or writing to a removable disk 736. The removable disk may be a magnetic disk for a magnetic disk driver or an optical disk such as a CD-ROM for an optical disk drive. The hard disk drive 732 and the external disk drive 734 are connected to the system bus 726 by a hard disk drive interface 738 and an external disk drive interface 740, respectively. The drives and their associated computer-readable media provide non-volatile storage of computer readable instructions, data structures, program modules and other data for the personal computer 720. The data structures may include relevant data for the implementation of the method for detecting data leakage and/or detecting dangerous information. The relevant data may be organized in a database, for example a relational or object database.
Although the exemplary environment described herein employs a hard disk (not shown) and an external disk 736, it should be appreciated by those skilled in the art that other types of computer readable media which can store data that is accessible by a computer, such as magnetic cassettes, flash memory cards, digital video disks, random access memories, read only memories, and the like, may also be used in the exemplary operating environment.
A number of program modules may be stored on the hard disk, external disk 736, ROM 730 or RAM 728, including an operating system (not shown), one or more application programs 744, other program modules (not shown), and program data 746. The application programs may include at least a part of the functionality as depicted in
A user may enter commands and information, as discussed below, into the personal computer 720 through input devices such as keyboard 748 and mouse 750. Other input devices (not shown) may include a microphone (or other sensors), joystick, game pad, scanner, or the like. These and other input devices may be connected to the processing unit 722 through a serial port interface 752 that is coupled to the system bus 726, or may be collected by other interfaces, such as a parallel port interface 754, game port or a universal serial bus (USB). Further, information may be printed using printer 756. The printer 756, and other parallel input/output devices may be connected to the processing unit 722 through parallel port interface 754. A monitor 758 or other type of display device is also connected to the system bus 726 via an interface, such as a video input/output 760. In addition to the monitor, computing environment 720 may include other peripheral output devices (not shown), such as speakers or other audible output.
The computing environment 720 may communicate with other electronic devices such as a computer, telephone (wired or wireless), personal digital assistant, television, or the like. To communicate, the computer environment 720 may operate in a networked environment using connections to one or more electronic devices.
When used in a LAN networking environment, the computing environment 720 may be connected to the LAN 764 through a network I/O 768. In a networked environment, program modules depicted relative to the computing environment 720, or portions thereof, may be stored in a remote memory storage device resident on or accessible to remote computer 762. Furthermore, other data relevant to the method for detecting data leakage and/or detecting dangerous information (described above) may be resident on or accessible via the remote computer 762. The network connections shown are exemplary and other means of establishing a communications link between the electronic devices may be used.
The above-described computing system is only one example of the type of computing system that may be used to implement the method for detecting data leakage and/or detecting dangerous information.
Number | Date | Country | Kind |
---|---|---|---|
23205300.9 | Oct 2023 | EP | regional |