The invention relates to communications.
Malicious software (malware) refers to software used to disrupt or modify computer or network operations, collect sensitive information or gain access to a private computer or network system. Malware has a malicious intent, acting against the requirements of a user or network operator. Malware may be intended to steal information, gain free services, harm an operator's business or spy on the user for an extended period without the user's knowledge, or it may be designed to cause harm. The term malware may be used to refer to a variety of forms of hostile or intrusive software, including mobile computer viruses, worms, Trojan horses, ransomware, spyware, adware, scareware and/or other malicious programs. It may comprise executable code, or an ability to download such, scripts, active content and/or other software. Malware may be disguised as or embedded in non-malicious files.
According to an aspect, there is provided the subject matter of the independent claims. Embodiments are defined in the dependent claims.
One or more examples of implementations are set forth in more detail in the accompanying drawings and the description below. Other features will be apparent from the description and drawings, and from the claims.
In the following, the invention will be described in greater detail by means of preferred embodiments with reference to the accompanying drawings, in which
The following embodiments are exemplary. Although the specification may refer to “an”, “one”, or “some” embodiment(s) in several locations, this does not necessarily mean that each such reference is to the same embodiment(s), or that the feature only applies to a single embodiment. Single features of different embodiments may also be combined to provide other embodiments. Furthermore, words “comprising” and “including” should be understood as not limiting the described embodiments to consist of only those features that have been mentioned and such embodiments may contain also features/structures that have not been specifically mentioned.
Mass surveillance of core network and roaming interfaces is seen as a tool to detect terrorist activities or to counteract attacks on critical communication infrastructure. In mass surveillance systems everybody is under suspicion to some degree. Thus the principle of innocent till proven guilty does not seem to apply to modern surveillance technology usage. On the other hand, criminals may easily benefit from communication networks that are not protected. Too much data collection means that the privacy of the user is compromised and network nodes may be hacked (or become a national security agency (NSA) target) because of the data stored. If too little data is collected, then data scanning for malware detection does not work, and the network is vulnerable. The larger the amount of data, the slower is the data checking, and thus potential countermeasures are less efficient (due to a delay). The consumer perception of a company/device/system collecting large amounts of data is very negative with regards to privacy.
Let us now describe an embodiment of the invention for data scanning with reference to
Referring to
Let us now describe some embodiments of block 202 with reference to
An embodiment enables selecting data fields to be processed, stored and released for further processing by a data scanning entity, whilst respecting privacy laws and avoiding abusive collection of personal data. If too much data is collected in some network nodes, this may pose a risk to become a potential target of attackers. Thus, mechanisms are provided for partitioning these with respect to the mode of operation.
In an embodiment, the actual data scanning (block 204) is carried out in the same network node as the optimizing (block 202) of the data scanning. In that case, the transmission of the output message (step 203) may not be needed.
An embodiment provides a mechanism where the processing and collecting of data may be temporarily increased to support greater fidelity of data scanning, such as malware detection, spam detection, terrorist identification, network statistics detection and/or other detection, in a justifiable and privacy law compliant manner.
An embodiment provides a method for reduction of the amount of fields and applying privacy tools to a set of collected data (obtained e.g. from data scanning entity, radio measurement system). A classification mechanism for data usage, privacy sensitivity and risk is included. Thus user privacy is obtained, while still enabling user protection against criminals or unauthorized intruders.
The relevant part of data is extracted from a large set of network data, such that the data scanning is still possible. Malware detection may include the signature of the malware (its fingerprint), and the signature of the malware is applied on the extracted data set.
An embodiment comprises a classification step for identifying privacy relevance (labelling). The fields of a data set are classified according to usage (an input from product and service usage). The fields of a data set are classified according to information type (what data is included). The fields of the data set are classified according to the overall identifiability of that particular data set (privacy law).
An embodiment comprises a procedure for defining a privacy relevance output. The sensitivity of the fields is calculated according to a metric calculated over selected properties. A partial order of the data fields is formed according to the sensitivity, and partial order of data subsets is formed according to the sensitivity. The fields of the data set are classified according to usage, alone, and a partial order of the data fields is formed according to usage. The cross product (combination) of the two partial orders (i.e. the partial order of the data fields and the partial order of the data subsets) is mapped according to the risk, the data fields are partitioned into various data scanning categories, and the operation of the data scanning entity is rated into the various data scanning categories.
An embodiment comprises acting according to the privacy relevance procedure output. A minimum set of fields is selected from each of the data scanning categories corresponding to the operation of the data scanning entity. The data scanning entity default mode for the data collection is set to be the minimum set of fields that satisfies a lowest risk level corresponding to the required usage of data for, ostensibly, data scanning/malware detection purposes.
Sorting (i.e. classification) and labelling of the data fields is carried out based on reducing the information content in terms of sensitivity and identifiability (i.e. privacy wise the data becomes less sensitive) of the data set over the required usages of that information as defined by the malware signature. This also applies to other type of user data collection, e.g. collecting of radio measurements, SON (self organizing networks), MDT (mobile drive tests). Thus performing of data scanning on sensitive and/or private data may at least partly be prevented.
Classifying according to the usage may be based on code investigation. This may comprise attaching, during programming, on each piece of data, information on what the piece of data is actually used for. Based on the code, it may thus be seen which data is used and where (for what purpose) and what is the required data to get a service running. This may require input and knowledge about the services that are going to be performed.
Classifying according to the information type may be based on investigating the field types for their variables, for example, what kind of data they have, are they names/IP (internet protocol) addresses etc., what certain strings etc. represent. Herein, each data field is assigned an information type.
Classifying according to the sensitivity may be based on local legislation and/or on evaluating which data actually is sensitive and which is not. For example, in USA, phone location information is not privacy sensitive, while in European union (EU) it is. Herein, a sensitivity level is assigned to each data field. The data that is labelled sensitive may be referred to as “S-data” (sensitive=high, non-sensitive=low; see
Once the extracted data is created, the information contained therein is classifled according to its information type and usage, independently of the machine type. When the data classified and labelled according to the usage, information type and sensitivity is combined, an exemplary output may be classified and labelled as illustrated below in Table 1.
A privacy relevance procedure comprises deciding, based on the obtained usage, sensitivity and information type for each element, how to minimize the amount of data so that it still is possible to run the service (e.g. malware) over it successfully.
The sensitivity is calculated from a combination of the usage and information type along with the combined identifiability of the data calculated from the entire data set.
Regarding partial order creation for the sensitivity, the combinations of data such as {destination IP address, protocol, IMSI} may form one set. {Destination IP address, protocol} may form another set. The set {destination IP address, protocol, IMSI} is more sensitive (according to the calculated sensitivity value), and the set {destination IP address, protocol} is less sensitive. Thus these groups of data may be sorted into an order by their sensitivity.
Alternatively or in addition to creating the partial order over the sensitivity, other fields, annotations and calculated values may also be incorporated into the ordering metric. {Destination IP address, protocol, IMSI}>{destination IP address, protocol} may form the partial order (or lattice) of each field, for example:
Regarding partial order creation for the usage, the partial order for the usage is calculated. For a basic service, only a few data fields are required e.g. {MSISDN, TMSI}, but for a high value service more data may be required e.g. {MSISDN, TMSI, PIN}. {MSISDN, TMSI, PIN}>{MSISDN, TMSI} gives a partial order for the usage (i.e. similar to that of the partial order for the sensitivity).
The two partial orders (usage, sensitivity), do not yet indicate which data fields really are under high risk and need to be protected thoroughly, and which data fields are less important. The data set on top (the first data set) is more sensitive than the other sets. A mapping to a data scanning category is made over these by combining the partial orders and the risk, wherein an exemplary intersection of the field lattice, the usages and data scanning categories are illustrated in
Thus it is possible to determine the required data combinations, whether privacy is ok and what information there is. To take action, the determined data is collected and sent to the data scanning entity. Taking the set of fields in the intersection of the required usages and, for example, a medium data scanning category in this section, provides the set of data fields with a maximum privacy with respect to some risk criteria (these usages may then be mapped into a particular mode of operation of the data scanning entity). As the level of risk to be tolerated for the situation at hand increases, the number of fields or the set of fields is taken from a higher data scanning category.
Alternatively, the reduction or addition of noise addition (differential privacy, I-diversity, t-closeness, k-anonymity) may be used as mechanisms for controlling the sensitivity and risk characteristics of the data fields.
Regarding the fidelity of data for the data scanning, typical data scanning assumes access to a wide range of fields and content. This is in contradiction with various privacy laws, and runs a number of risks such as accusations of surveillance and the potential for the over-collection of data. Data scanning also is a rather imprecise process with a number of false positive and negative results even in the above situation. Reducing the fidelity of the data by removing fields, hashing certain content, introducing noise and diversity still allows the data to be used statistically, but individual records are no longer attributable to unique persons. This reduced fidelity data is thus more privacy compliant and may thus be sufficient to satisfy privacy laws. The data scanning and the risk to network and consumer with the result of the increase in fidelity may then be better justified under these circumstances.
If the malware detector detects potential malware then the classification and filtering may be changed to a less restrictive operation mode, such that more data is made available, with greater privacy risk but greater fidelity.
Another possible mode of operation is where the data scanning entity operates normally but unfiltered traffic is presented to an access-restricted node, e.g. encrypted storage, such that it is not possible to read or tamper with the highly sensitive data. Thus at least part of the privacy sensitive data may be directed in an encrypted storage to prevent data scanning to be performed on said privacy sensitive data, and, if required, the privacy sensitive data may be retrieved from the encrypted storage in order to allow data scanning to be performed on said privacy sensitive data.
The classification and filtering may be carried out at any part of the network. For example, the classification and filtering may comprise centralised processing of the data, edge processing for initial classification and marking of the data with the malware detector being placed in-line at a different point, e.g. at a Gn interface, and/or edge processing as before and tagging of network packets such that these may be identified by utilizing SDN (software-defined networking) flow-table pattern matching.
An embodiment provides two ontologies for the classification of data: information type and usage. Also other ontologies may be applied either in the sensitivity and identifiability calculations or in the risk calculation, or as an additional partial field order calculations over the system as a whole. Such ontologies include but are not limited to: provenance, purpose (primary data vs. secondary data), identity characteristics, jurisdiction (including source, routing properties, etc.), controller classification, processor classification, data subject classification, personally identifiable information (PII) classification (including sensitive PII classification, e.g. HIIPA health classifications), personal data classification (including sensitive personal data classification), traffic data, and/or management data.
Further ontologies may be included into the calculations by constructing a final metric by combination of the ontologies, for example, when calculating the sensitivity, the metric may be a function f(usage×information type), however, this may be generalised into a function f(ontology1×ontology2×ontology3× . . . ×ontologyN). Further ontologies may also be included into the calculations by constructing the cross-product of two or more of the calculations. For example, when calculating the cross product of the partial orders of the usage against sensitivity Ls×Lu, this may be generalised into L1×L2× . . . ×Ln.
An embodiment enables a technical implementation and handling of network communication traffic such that the network provider is able to protect user data in the core network (e.g. P-CSCF, S-SCSF, HSS) against malicious activities in the communication networks without mass surveillance and loss of the right of privacy of the users.
An embodiment enables a mechanism that makes privacy compliance and the consumer perception of the data collection more in line with what is expected, meaning justified collection, processing and usage of data, and legal compliance to local privacy legislations.
An embodiment enables data scanning by processing the data sets with respect to their content, usage and data scanning categorisation.
Let us now describe an embodiment for optimizing data scanning with reference to
An embodiment provides an apparatus comprising at least one processor and at least one memory including a computer program code, wherein the at least one memory and the computer program code are configured, with the at least one processor, to cause the apparatus to carry out the procedures of the above-described network element or the network node. The at least one processor, the at least one memory, and the computer program code may thus be considered as an embodiment of means for executing the above-described procedures of the network element or the network node.
The processing circuitry 10 may comprise the circuitries 12 to 19 as subcircuitries, or they may be considered as computer program modules executed by the same physical processing circuitry. The memory 20 may store one or more computer program products 24 comprising program instructions that specify the operation of the circuitries 12 to 19. The memory 20 may further store a database 26 comprising definitions for traffic flow monitoring, for example. The apparatus may further comprise a radio interface (not shown in
As used in this application, the term ‘circuitry’ refers to all of the following: (a) hardware-only circuit implementations such as implementations in only analog and/or digital circuitry; (b) combinations of circuits and software and/or firmware, such as (as applicable): (i) a combination of processor(s) or processor cores; or (ii) portions of processor(s)/software including digital signal processor(s), software, and at least one memory that work together to cause an apparatus to perform specific functions; and (c) circuits, such as a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation, even if the software or firmware is not physically present.
This definition of ‘circuitry’ applies to all uses of this term in this application. As a further example, as used in this application, the term “circuitry” would also cover an implementation of merely a processor (or multiple processors) or portion of a processor, e.g. one core of a multi-core processor, and its (or their) accompanying software and/or firmware. The term “circuitry” would also cover, for example and if applicable to the particular element, a baseband integrated circuit, an application-specific integrated circuit (ASIC), and/or a field-programmable grid array (FPGA) circuit for the apparatus according to an embodiment of the invention.
The processes or methods described above in connection with
The present invention is applicable to cellular or mobile communication systems defined above but also to other suitable communication systems. The protocols used, the specifications of cellular communication systems, their network elements, and terminal devices develop rapidly. Such development may require extra changes to the described embodiments. Therefore, all words and expressions should be interpreted broadly and they are intended to illustrate, not to restrict, the embodiment.
It will be obvious to a person skilled in the art that, as the technology advances, the inventive concept can be implemented in various ways. The invention and its embodiments are not limited to the examples described above but may vary within the scope of the claims.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2015/056610 | 3/26/2015 | WO | 00 |