A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
The present invention relates to systems and methods for maintaining privacy, including a differentially private solution for traffic monitoring.
In recent years, privacy research has been gaining ground in vehicular communication technologies. Collecting data from connected vehicles presents a range of opportunities for government authorities and other entities to perform data analytics. Although many researchers have explored some privacy solutions for vehicular communications, the conditions to deploy the technology are still maturing, especially when it comes to privacy for sensitive data aggregation analysis.
This description and the accompanying drawings that illustrate aspects, embodiments, implementations, or applications should not be taken as limiting—the claims define the protected invention. Various mechanical, compositional, structural, electrical, and operational changes may be made without departing from the spirit and scope of this description and the claims. In some instances, well-known circuits, structures, or techniques have not been shown or described in detail as these are known to one skilled in the art. Like numbers in two or more figures represent the same or similar elements.
In this description, specific details are set forth describing some embodiments consistent with the present disclosure. Numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent to one skilled in the art, however, that some embodiments may be practiced without some or all of these specific details. The specific embodiments disclosed herein are meant to be illustrative but not limiting. One skilled in the art may realize other elements that, although not specifically described here, are within the scope and the spirit of this disclosure. In addition, to avoid unnecessary repetition, one or more features shown and described in association with one embodiment may be incorporated into other embodiments unless specifically described otherwise or if the one or more features would make an embodiment non-functional.
In recent times, there has been a surge in digital technologies embedded in physical objects, leading to what is today known as the Internet of Things (IoT). This trend has also reached the automotive industry, which has shown a growing interest in exploring interaction models such as Vehicle-to-Vehicle (V2V), Vehicle-to-Infrastructure (V2I), and Vehicle-to-Pedestrian (V2P), collectively referred to as Vehicle-to-Everything (V2X) communications.
The V2X communications technology is a cornerstone for the development of Intelligent Transportation Systems (ITS). Mobility is a major concern in any city, and deploying ITS can make cities more efficient. ITS are an indispensable component of smart cities, achieving traffic efficiency while minimizing traffic problems. The adoption of ITS is widely accepted and it is used in many countries today. Because of its endless possibilities, ITS has become a multidisciplinary field of connective work and therefore many organizations around the world have developed solutions to provide ITS applications to meet demand.
Indeed, the U.S. Department of Transportation has initiated a “connected vehicles” program “to test and evaluate technology that will enable cars, buses, trucks, trains, roads and other infrastructure, and our smartphones or other devices to ‘talk’ to one another. Cars on the highway, for example, would use short-range radio signals to communicate with each other so every vehicle on the road would be aware of where other nearby vehicles are. Drivers would receive notifications and alerts of dangerous situations, such as someone about to run a red light as they [are] nearing an intersection or an oncoming car, out of sight beyond a curve, swerving into their lane to avoid an object on the road.” U.S. Department of Transportation at https://www.its.dot.gov/cv_basics/cv_basics_what.htm. “Connected vehicles could dramatically reduce the number of fatalities and serious injuries caused by accidents on our roads and highways. [They] also promise to increase transportation options and reduce travel times. Traffic managers will be able to control the flow of traffic more easily with the advanced communications data available and prevent or lessen developing congestion. This could have a significant impact on the environment by helping to cut fuel consumption and reduce emissions.” In some embodiments, the V2X environment for an ITS can comprise or be implemented with a Security Credential Management System (SCMS) infrastructure. The SCMS was developed in cooperation with the U.S. Department of Transportation and the automotive industry.
Each vehicle 110V may, for example, broadcast its location, speed, acceleration, route, direction, weather information, etc. Such broadcasts can be used to obtain advance information on traffic jams, accidents, slippery road conditions, and allow each vehicle to know where the other vehicles are, and so on. In response, vehicle recipients of such information may alert their drivers, to advise the drivers to stop, slow down, change routes, take a detour, and so on. The traffic lights can be automatically adjusted based on the traffic conditions broadcast by the vehicles and/or other objects 110.
With the emergence of the V2X communication and ITS technology, there is an inherent increase in vehicle safety, thus saving lives and fostering a safer driving experience. This technology allows vehicles to communicate with multiple devices on-the-go and when stationary, thereby introducing an entirely new set of communication infrastructure, applications, services, etc. Furthermore, it is perceived as one of the building blocks that can propel the quicker adoption of autonomous vehicles and smart cities.
Applications in ITS are broad, encompassing areas such as safety, cooperative driving, traffic optimization, among others. Although their use is not just limited to traffic congestion control and information, the introduction of information and communication technologies, especially in vehicles, is generally considered as means to achieve efficiency, safe and sustainable mobility. Specifically, collecting data from connected vehicles presents opportunities through aggregated data analysis for investigating driver behavior to vehicle manufacturers and insurers, monitoring traffic conditions to governmental agencies involved in tolling or traffic management, and to develop new services as needed.
While connected vehicles, V2X, and ITS technology offer the promise of increased safety, traffic flow, efficiency, etc., the large scale deployment of such technologies also requires addressing some challenges, especially security and privacy concerns. For example, in a V2X and ITS environment, information and data for connected vehicles will necessarily be generated and collected, leading to concerns about how such collected information can be used while preserving the privacy of individual vehicles and their drivers.
Operation of computing device 150 is controlled by processor 150P, which may be implemented as one or more central processing units, multi-core processors, microprocessors, microcontrollers, digital signal processors, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), graphics processing units (GPUs), tensor processing units (TPUs), and/or the like in computing device 150P.
Memory 150S may be used to store software executed by computing device 150 and/or one or more data structures used during the operation of computing device 150. Memory 150S may include one or more types of machine-readable media. Some common forms of machine-readable media may include a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, EEPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
Processor 150P and/or memory 150S may be arranged in any suitable physical arrangement. In some embodiments, processor 150P and/or memory 150S may be implemented on the same board, in the same package (e.g., system-in-package), on the same chip (e.g., system-on-chip), and/or the like. In some embodiments, processor 150P and/or memory 150S may include distributed, virtualized, and/or containerized computing resources. Consistent with such embodiments, processor 150P and/or memory 150S may be located in one or more data centers and/or cloud computing facilities. In some examples, memory 150S may include non-transitory, tangible, machine-readable media that include executable code that when run by one or more processors (e.g., processor 150P) may cause the computing device 150, alone or in conjunction with other computing devices in the environment, to perform any of the methods described further herein
The computing device or equipment 150 may include user interface 150i, e.g., such as present in a smartphone, an automotive information device, or of some other type device, for use by pedestrians, vehicle drivers, passengers, traffic managers, and possibly other people.
Wireless communication equipment 150W of computing device 150 may comprise or be implemented with one or more radios, chips, antennas, etc. for allowing the device 150 to send and receive signals for conveying information or data to and from other devices. Under the control of processor 150P, wireless communication equipment 150W may provide or support communication over Bluetooth, Wi-Fi (e.g., IEEE 802.11p), and/or cellular networks with 3G, 4G, or 5G support.
The vehicle 110V includes on-board equipment (OBE) or on-board unit (OBU) 304 with one or more sensors—such as accelerometers, brake monitors, object detectors, LIDAR, etc.—for sensing conditions within and around vehicles 110V, such as sudden braking, wheel spin, potential collisions, etc. Using these sensors, the vehicle 110V may, for example, detect the icy road patch at scene 308. The sensors supply information to the OBE's computing device or equipment 150 (
Different pieces of equipment on the vehicle 110V communicate by exchanging Basic Safety Messages (BSM) and/or other messages with each other and other vehicles. The BSM messages are described in detail in Whyte et al., “A security credential management system for V2V communications,” IEEE Vehicular Networking Conference, 2013, pp. 1-8, and CAMP, “Security credential management system proof-of-concept implementation—EE requirements and specifications supporting SCMS software release 1.1,” Vehicle Safety Communications Consortium, Tech. Rep., May 2016 (available: https:/www.its.dot.gov/pilots/pdf/SCMS_POC_EE_Requirements.pdf), both of which are incorporated by reference.
A vehicle or other object 110 can obtain its location, for example, by using GPS satellites 1170 or cellular triangulation. The vehicle 110V may also include communication equipment 150W, which, in some embodiments, can include a Direct Short Range Communications (DSRC) radio and non-DSRC radio equipment such as a mobile phone. The vehicle may thus communicate through a cellular system or other roadside equipment (RSE) 110RSE directly, i.e., without intermediate network switches. RSE may alternately be referred to as a roadside unit (RSU). In some embodiments, the RSE can be implemented with or in a base station (BS) proximate a road. An RSE may include some of the same or similar equipment as vehicle 110V, including computing devices 150, sensors, user interfaces, communication equipment, etc. The RSE may act as a gateway to other networks, e.g., the Internet. Using the communication equipment 150W, vehicle 110 can communicate BSM messages and other information to other vehicles, entities, or objects 110 in the V2X or connected vehicle environment. Thus, vehicle 110V/150 may inform the other parts of the environment or ITS of the icy patch at scene 308. Likewise, another vehicle 110 may be located in scene 1020 and may alert other vehicles of winter maintenance operations at that scene.
A traffic management system 110L may comprise equipment—e.g., stoplights, crosswalk lights, etc. located in or near roads, highways, crosswalks, etc.—to manage or control traffic of vehicles, persons, or other objects and entities. Traffic management system 110L may include some of the same or similar equipment as vehicle 110V, including computing devices 150, sensors, user interfaces, communication equipment, etc.
Computer systems 316 process, aggregate, generate or otherwise operate on information sent to or received from vehicles 110V, traffic management systems 110L, and other objects or entities 110 in the V2X or connected vehicle technology environment, along with their respective computing devices 150. Also shown is a traveler information system 318. Computer systems 316 can be implemented in or incorporate, for example, one or more servers. These computer systems 316, for example, provide or support location and map information, driving instructions, traffic alerts and warnings, information about roadside services (e.g., gas stations, restaurants, hotels, etc.). The computer systems 316 may receive information from the various vehicles, entities, and objects 110 in the environment, process and communicate information or instructions throughout the environment to manage the objects, e.g., by adjusting signaling on traffic lights, rerouting traffic, posting alerts or warnings, etc.
In some embodiments, one or more of the various objects, entities, equipment, computers, and infrastructure shown in
In some embodiments, the underlying technology used in VANETs can be include or encompass Dedicated Short Range Communication (DSRC)/Wireless Access in Vehicular Environment (WAVE) with radio communication provided by IEEE 802.11p and cellular networks with 3G or 4G support. A vehicle 110V periodically sends beacons to its neighbors, which contains data as identification, timestamp, position, speed (direction), acceleration, among other 7700 signals, approximately, collected by sensors of the vehicle. A beacon contains sensitive information, which may be used in many applications of interest to the industry, companies or government. One widely used application is traffic management.
Analyzing the voluminous data and information generated and collected in an ITS can bring enormous social benefits, but it also brings concerns about data breaches and leakage. The main challenge for entities performing statistical analyses on sensitive data is to release aggregated information about a population while protecting the privacy of its individuals. Disclosure of this data poses a serious threat to the privacy of individual contributors, which creates a liability for industry and governments.
Differential privacy has become increasingly accepted as a privacy technique of choice. Differential privacy technology can help to discover the usage patterns of a large number of users without compromising individual privacy. To obscure an individual's identity, differential privacy adds mathematical noise to a small sample of the individual's usage pattern. In the context of analyses over a database (statistics or machine learning), it is a strong mathematical definition of privacy. Its definition allows the possibility of a useful analysis be performed on a data set while protecting the privacy of contributors in this data set.
According to some embodiments, systems and methods are provided for an instance-based data aggregation solution for traffic management that satisfies the differential privacy definition. In some examples, a simple or basic approach to compute the average speed is evaluated and then an enhanced solution with an instance-based technique is provided to mitigate the negative impact on accuracy. In some embodiments, the systems and methods of the present disclosure use a sample-and-aggregate framework to construct a new instance that has low sensitivity for the median function. This disclosure provides a detailed evaluation of privacy-preserving techniques based on differential privacy applied to traffic monitoring. The systems and methods of the present disclosure have been validated through simulations in typical traffic congestion scenarios. The results show that for typical instances (e.g., under-dispersed), the systems and methods provide a significant reduction in the number of outliers, considering a deviation tolerance from the original reported average speed.
Differential privacy emerged from the problem of performing statistical studies on a population while attempting to maintain the privacy of its individuals. The definition of differential privacy models the risk of disclosing data from any individual belonging to a database by performing statistical analyses on it. The definition says that, using a randomized algorithm in two databases differing by only one element, the probabilities of producing the same result are bounded by a constant factor. For example, imagine that there are two otherwise identical databases, but one has your information in it, and the other does not. Differential privacy ensures that the probability that a statistical query will produce a given result is (nearly) the same whether the query is conducted on the first or second database. In other words, a differentially private algorithm will behave similarly to similar input databases.
Definition 1. Differential privacy. A randomized algorithm A taking inputs from the domain Dn gives (ϵ; δ)-differential privacy analysis if for all data sets D1, D2∈Dn differing on at most one element, and all U⊆Range(A), denoting the set of all possible outputs of A,
where the probability space is over the coin flips of the mechanism A and
is defined as 1 for all p∈.
Two fundamental parameters control the level of privacy in a differentially private algorithm. The privacy loss parameter, denoted by ϵ, is the main parameter. This parameter E can be thought of as the magnitude of the constant factor that determines the indistinguishability between two databases differing in one element. In other words, parameter ϵ is a relative measure of privacy breach risk. It quantifies the contribution of each individual on the output of the analysis and controls the trade-off between privacy and utility. The second parameter is the relaxation parameter, denoted by δ. This parameter allows negligible leakage of information from individuals in an analysis performed on a database. In other words, an (ϵ, δ)-differential privacy algorithm requires that an (ϵ, 0)-differential privacy algorithm be satisfied with a probability of at least 1−δ; that is, the (ϵ, 0)-differential privacy algorithm can be violated for some tuples and the probability of that occurring it is linearly bounded by δ.
The protection of the individuals' privacy in a database is made by masking the contribution (presence or absence) of any single individual in the analysis, making it infeasible to infer any information specific to an individual. In this way, it is sufficient to mask an upper bound of the attribute of interest in the related database. This upper bound is known as global sensitivity. In other words, global sensitivity is related to an analysis function; it is the maximum difference between the analyses performed over two databases differing only in one element:
Definition 2. Global sensitivity. For ƒ: Dn→d the global sensitivity Δ off is
where Dn is the domain of all databases of size n and d(D1, D2)=1 means that for all D1, D2 the difference between these database is bounded by one element.
A differentially private analysis protects the privacy of an individual by adding carefully-tuned random noise when producing statistics. One of the main models of computation, in which a differentially private algorithm works, is the centralized model. In the centralized model, also known as output perturbation, there is a trusted party that has access to individuals' data without perturbation and uses it to release noisy aggregate analyses.
In order to add carefully-tuned random noise to the computation, two of the main primitives satisfying differential privacy are the Laplace and exponential mechanisms. The Laplace mechanism is the first and probably most widely used mechanism. This mechanism is based on sampling continuous random variables from a Laplace distribution. This distribution presents the following probability density function:
where b>0 is the scale parameter and μ is the location parameter. In order to get an independent and identically distributed random variable from a Laplace distribution, its probability density function must be calibrated by centering the location parameter at zero and setting the scale parameter as the ratio between the global sensitivity (Δƒ) and the privacy loss parameter (ϵ). In the centralized model of computation, the Laplace mechanism works by computing the value of the aggregate function ƒ over a database D, sampling a random variable Y from Laplace distribution and adding it to the computation. That is, M(D)=ƒ(D)+Y, where Y˜h(x, 0, Δƒ/ϵ).
On the other hand, the exponential mechanism is used to handle both numerical and categorical analysis. The exponential mechanism may work well for situations in which it is desirable to output the best response among finite (countable) options, an arbitrary range, but adding noise directly to the output of the analysis can compromise the output quality of the same. Due the finite set of output options, the exponential mechanism for categorical analysis is discrete, and defined as follows.
Definition 3. Exponential mechanism. For any quality function, q: (Dn×O)→ and a privacy parameter ϵ, the exponential mechanism outputs an element o∈O with probability
where O is a set of all possible outputs and
is the sensitivity of the quality function with D1, D2∈Dn; d(D1, D2)=1.
It has been observed that the Laplace mechanism can be viewed as a special case of the exponential mechanism, by using the quality function as q(D, o)=−|ƒ(D)−o|, which provides Δq=Δƒ. In fact, the Laplace distribution is known as double exponential distribution, because it can be thought of as two exponential distributions with an additional location parameter splicing both distributions [20]. In this way, considering the case of numerical analysis, it is sufficient to assume q(D, o)=−|ƒ(D)−o| for exponential mechanism, whereas the output o can be viewed as zero, which gives the true value of the analysis.
Definition 4. Monotonic function. A function ƒ performed over a database is monotonic if the addition of an element to the database cannot cause the value of the function to decrease. That is, ƒ(D1)≥ƒ(D2) if d (D1, D2)=1 and |D1|≥|D2|, and vice-versa.
It has been proven that if a quality function is monotonic then the exponential mechanism can output o∈O with probability
The exponential distribution presents the following probability density function:
h(x,λ)=λe−λx, (5)
where λ>0 is the rate parameter.
Composability The composition theorems are useful to understand how to combine multiple mechanisms for designing differentially private algorithms. The privacy loss parameter ϵ will degrade along repeatedly analyses over databases containing the same elements. As such, it is often referred to as the privacy budget, since it needs to be divided and consumed by a sequence of differentially private algorithms to attend a sequence of analyses. There are two main composition theorems, the sequential and parallel compositions.
Theorem 1. Sequential composition. Let A1(D), . . . , Ak(D) be k algorithms that satisfy (ϵ1, δ1), . . . , (ϵk, δk)-differential privacy, respectively. Then, an algorithm A, such as A (D)=A[A1(D), . . . , Ak(D)] is (Σi=1kϵi, Σi=1kδi) differentially private.
Theorem 2. Parallel composition. Given a deterministic partitioning ƒ, such as D1, . . . , Dk are resulting partitions of ƒ over D. Let A1(D), . . . , Ak(D) be k algorithms that satisfy (ϵ1, δ1), (ϵk, δk)-differential privacy, respectively. Then, A(D)=A [A1(D), . . . , Ak(D)] is (maxi=1kϵi, maxi=1kδi)-differentially private.
In one embodiment of a differential privacy framework, the noise magnitude depends on the global sensitivity (Δƒ, Definition 2), but not on the instance D. For many functions, such as the median, this framework yields high noise, compromising the utility of the analysis. Two frameworks have been proposed that allow noisy analyses to be performed with magnitude proportional to the instance in question. These frameworks are known as smooth sensitivity, and sample and aggregate.
Local Sensitivity. Local sensitivity is a local measure of sensitivity. It depends directly from the instance in question. Local sensitivity allows or enables adding significantly less noise as compared to calibrating with global sensitivity. In some embodiments, local sensitivity is defined as follows.
Definition 5. Local sensitivity. For ƒ: Dn→d and D1∈Dn, the local sensitivity of ƒ at D1 is
However, this scheme does not satisfy differential privacy, since it can change abruptly when the instance changes, revealing information about the instance.
Smooth Sensitivity. The idea behind the smooth sensitivity framework is to find the smallest upper bound on the local sensitivity such that adding noise proportional to this upper bound is safe. This upper bound is known as smooth sensitivity. It is a measure of variability of a function ƒ performed over all neighborhood of the instance in question:
Definition 6. Smooth sensitivity. For β>0, the β-smooth sensitivity off is:
One can add noise proportional to
where α, β are parameters of the noise distribution.
Let a database D={d1, . . . , dn} in a non-decreasing order and ƒmed=median(D) where di∈, with di=0 for i≤0 and di=Δƒ for i>n. It has been proven that the β-smooth sensitivity of Median function is
where m is the rank of median element and
for odd n. It can be computed in time O(n2).
Sample and Aggregate Framework.
The intuition behind the technique is that changing a single point in D will change very few small databases di′∈D′, and hence very few evaluations ƒ*(D′). The output ƒ*(D′) will be close to ƒ(D) if ƒ can be approximated well on random partitions. This evaluation is quantified by the following definition.
Definition 7. Good approximation. A function ƒ: Dn→R is well approximated from random partitions {D1, . . . , Dm} of a database D if
Pr{dM[|ƒ(Di),ƒ(D)]≤r}≥¾, (9)
where dM is some metric, r is a ratio of accuracy and i∈{1, . . . , m}.
The level of privacy is with respect to the degree of protection of an individual or entity by differentially private mechanisms. In other words, what should be protected from an entity, the entity itself, or an action of the same? Two levels of privacy have been described—event protection and user protection.
Event-level Privacy. In event-level privacy, privacy protection is centered on an event, i.e., it protects the privacy of individual accesses. Thus, data set is an unbounded stream of events. An event may be an interaction between a particular person and an arbitrary term. If the data set is dynamic, i.e., the attribute changes for each interaction; an event is unique and its ID (identification) is the combination of timestamp, user ID and attribute value. Otherwise, the data set is static and an event must occur once for the same particular person. In this latter case, if the occurrence of an interaction for the same particular person happens more than once, events will be cumulative, and user-level privacy will be dealt with by composition theorems.
User-level Privacy. Privacy protection in user-level privacy is centered on a user. That is, user-level privacy protects the presence or absence of an individual in a stream, independent of the number of times it arises, should it actually be present at all. At any time interval on the stream, several interactions between a particular person and an arbitrary term should arise. In this case, privacy loss parameter E should be monitored and bounded. This implies an upper bound on the privacy loss of a particular person due to the participation on the statistical study. In order to ensure that differential privacy is satisfied, the progress of privacy budget should be checked over a period of time.
This section describes a simple or basic approach, e.g., to calculate average speed, in a differentially private way through a prefix of a finite length formed from an unbounded data stream containing beacons reported by vehicles crossing a road segment. A simple solution was initially presented by Kargl et al., “Differential privacy in intelligent transportation systems,” In: WiSec '13 Proceedings of the sixth ACM Conference on Security and Privacy in Wireless and Mobile Networks, pp. 107-112, ACM, Budapest, Hungary (2013), the entirety of which is incorporated by reference herein, that considered the original framework of differential privacy.
According to some embodiments, an enhanced or improved solution focuses instead on event-level privacy, following the problem statement, and adds noise proportional to global sensitivity in a centralized model through the Laplacian distribution. In the version according to some embodiments of the present disclosure, the size of prefix is calculated in a differentially private way by using the exponential mechanism, since negative values are not of interest.
Firstly, the method 500 starts with an empty set called prefix used to store beacons received by RSU. At a process 502, the RSU initializes the prefix list.
Next, at a process 504, the RSU starts collecting or receiving data (for events e), adding (or appending) each of them to the prefix. This continues with the RSU receiving and appending data for the remaining events.
The control of collection is made by a differentially private Count function, at a process 506, which uses the exponential mechanism. The Count function, in some examples, is given by or performed according to a method 600 (Algorithm 2), as shown in
At a process 508, the privacy loss parameter ϵc of the Count function is then deduced from the privacy budget E of each event.
After collecting enough data to compose an aggregation, at a process 510, method 500 selects the most recent beacons to calculate the average speed. In some examples, the average speed is calculated as follows: i) at a process 512, calculate the noisy sum from N latest reported speeds through the Laplace mechanism; then, ii) at a process 514, compute the average speed of the road segment as the ratio between the noisy sum and the size of the aggregation. The Sum function, in some examples, is given by or performed according to a method 700 (Algorithm 3), as shown in
Finally, at a process 516, the privacy loss parameter ϵs of the Sum function is deduced from the privacy budget E for each event in the aggregation.
A security analysis of the methods 500, 600, 700 (Algorithms 1, 2 and 3) for the basic or simple approach is provided below.
This section describes an enhanced approach, e.g., to compute or calculate average speed on a road segment, that meets the differential privacy definition while providing accurate aggregate information. This approach was inspired by the observation that most speed values are close to the average when measured in a short time interval and road segment, but there exist anomalies (few values outside of this range). Thus, an idea of the hypothesis is that cropping a range in the original prefix can eliminate anomalies and produce accurate analysis, since it allow us to introduce less but significant noise to protect the maximum element in that instance. However, the noise magnitude might reveal information about the prefix. That is, the choice of range itself is sensitive data leaking information about events in the prefix and, as such, should be chosen under a differentially private algorithm.
In some embodiments, the enhanced solution is based on the sample-and-aggregate framework. Details of such framework are provided in Nissim et al., “Smooth sensitivity and sampling in private data analysis,” In: Proceedings of the Thirty-ninth Annual ACM Symposium on Theory of Computing, pp. 75184, (2007), the entirety of which is incorporated by reference herein. This sample-and-aggregate framework considers the instance-based additive noise problem and allow or enables adding significantly less noise in typical instances, where most speed values are close to the average, while maintaining privacy requirements.
The approach proposed herein is presented in
In some embodiments, method 800 (Algorithm 4) is similar to method 500 (Algorithm 1). One difference between method 500 and method 800, respectively the basic and enhanced approaches, is that while in the basic approach we add noise proportional to global sensitivity of the Sum function, in the enhanced approach we add noise proportional to smooth sensitivity of the Median function.
The method 800 receives as input a privacy budget E related to each event received in the RSU, the aggregation size N to calculate the average speed, the global sensitivity of Sum function (maximum allowed speed value in the road segment), and the privacy loss parameters ϵc and ϵs for Count and Sum functions. Method 800 receives as additional inputs the relaxation budget parameter δ differing from zero, the number of partitions M over the aggregation list, and the privacy and relaxation parameters ϵm and δm for the Median function. The global sensitivity of the Median function is the same as the Sum function (maximum allowed speed value in the road segment).
Method 800 starts with an empty set called prefix used to store beacons received by RSU. At a process 802, the RSU initializes the prefix list.
Next, at a process 804, the RSU starts collecting or receiving data (for events e), adding (or appending) each of them to the prefix. This continues with the RSU receiving and appending data for the remaining events.
The control of collection is made by a differentially private Count function, at a process 806, which uses the exponential mechanism. The Count function, in some examples, is given by or performed according to a method 600 (Algorithm 2), as shown in
At a process 808, the privacy loss parameter ϵc of the Count function is then deduced from the privacy budget ϵ of each event.
After receiving all beacons from vehicles crossing the road segment and adding them to the prefix list, the aggregation set is composed through selection of most recent events. At a process 810, method 800 selects the most recent beacons to calculate the average speed.
At a process 812, method 800 calculates the average speed using the sample and aggregate framework or approach. In some examples, the sample and aggregate framework is given by or performed according to a method 900 (Algorithm 5), as shown in
Referring to
In some examples, at a process 904 each partition is composed (or extracted) by uniformly distributed samples of size N=M without replacement. For each partition, at a process 906 the average speed is calculated and the result is stored in a set called average speeds.
Once this set is filled with M average speeds, at a process 908, the average speeds set is sorted, e.g., in non-decreasing order.
At a process 910, the smooth sensitivity of the median function is calculated over the average speeds set as an instance. In some examples, the Smooth Median function is given by or performed according to a method 1000 (Algorithm 6), as shown in
Referring to
Returning again to
A security analysis of the methods 800, 900, and 1000 (Algorithms 4, 5 and 6) for the enhanced approach is provided below.
In this section, we describe a hybrid approach to calculate the average speed on a road segment satisfying the definition of differential privacy. This approach combines the original differential privacy framework (ODP) to the sample and aggregate framework (SAA). The adoption of the latter was inspired by the hypothesis that most speed values are close to the average when measured in a short time interval and road segment yielding some well-behaved instances. The hybrid approach is justified by the dynamism of the application, which yields misbehaved instances leading to very high sensitivity in the SAA framework.
The noise magnitude from the original and smooth sensitivity techniques are not related. While the differences among the instance and its neighbors are taken into account to get the noise magnitude in the smooth sensitivity, the original technique considers only the global sensitivity without examining the instance itself. The core of our contribution is to propose a formulation relating these techniques in order to obtain the lowest noise magnitude, which results in more accurate analyses.
From now on, we will refer to the collected set of beacons as a prefix, a finite length chain from an unbounded stream of beacons. In the hybrid approach, we calculate the noisy prefix size by using the exponential mechanism. To calculate the average speed, we use the Laplace mechanism in both ODP and SAA frameworks.
One way to calculate the differentially private average function using the ODP framework is to add a random variable, sampled from the Laplace distribution, to the true sum function, then, divide it by the set size N to obtain the average. In this case, the scale parameter is set as
The method 1500 (Algorithm 1A) of
On the other hand, using the SAA framework, we can divide the prefix into random partitions and evaluate the average function over each partition. After this process, we sort the resulting data set where we will select the central element (median) as the average speed. One idea is to reduce the impact of anomalies present in the prefix when calculating the aggregation. It allow us to introduce less but significant noise to protect the maximum element in well-behaved instances.
The Hybrid approach is based in the following lemma and theorem.
Lemma 2A. Let a prefix P={x1, x2, . . . xn-1, xn} be a set of points over , such that xi∈[0, Δƒ] for all i. Sampling a random variable from the Laplace distribution with scale parameter set as
and add it to the true average function is equivalent to Algorithm 1 both performed over P.
Proof. Consider the cumulative distribution function of the Laplace distribution with mean (μ=0). Suppose S is the sum of P and rs=λ·S represents a proportion of S. The probability of sampling any value greater than rs is given by
Now, suppose A is the average of P and ra=λ·A represents a proportion of A. The probability of sampling any value greater than ra is given by
In order to conclude the proof, we need to determine ba. So, it is a fact that S=A·N. Thus, we have rs=λ·A·N, which results in rs=ra·N. By substituting it in Eq. (6A) and equaling to Eq. (7A), i.e., ps=pa, we obtain
Based on Lemma 2, the following construction (Algorithm 3A), as shown in the method 1700 of
Theorem 1A. Let a prefix P={x1, x2, . . . xn-1, xn} be a set of points over , such that xi∈[0, Δƒ] for all i. Then, the method 1600 (Algorithm 2A) of
both performed over P.
Proof. Let bSAA and bODP be the scale parameter of the Laplace distribution in the methods 1600 (Algorithm 2A) and 1700 (Algorithm 3A), respectively. Then, we obtain
Rearranging Eq. (8) and setting bODP as an upper bound on bSAA, we get if Sƒ
In order to prove this theorem, assume for the sake of contradiction that Algorithm 3 provide more accurate results than method 1600 (Algorithm 2A), both performed over P. Then, bODP is less than bSAA. By Eq. (10A), it is a contradiction.
Therefore, if Eq. (10A) is the premise, method 1600 (Algorithm 2A) provides more accurate results than method 1700 (Algorithm 3A).
From Theorem 1 and Lemma 2, the noise magnitude of the Hybrid approach is formulated as follows:
The method 1800 starts by checking the privacy budget of the privacy loss and relaxation parameters. After that, it initializes an empty list called beacons used to store all beacons received through the base station. Next, the base station starts collecting data (beacons/events) adding each of them to the list. The collection control is made by a differentially private Count function which uses the exponential mechanism, method 1600 (Algorithm 2A,
This section presents and discusses the results obtained from evaluation of the basic and enhanced approaches for average speed calculation. Since the evaluation focuses on accuracy of the proposed solutions, the two fundamental parameters—privacy loss parameter ϵ and relaxation parameter δ—were fixed and calibrated. For this evaluation, we set the privacy loss parameter E considering each aggregation function with the following values: ln(2)−0:15 for Sum function, 0:15 for Count function and ln(2)−0:15 for Median function. Since the aggregation set size for this evaluation has been defined as the value of 55, it is sufficient calibrate the relaxation parameter δ with 0:01, which is a negligible value over the size of the aggregation set.
In order to evaluate the approaches, the analysis adopts the open source traffic mobility (SUMO) and the discrete event-based (OMNeT++) simulators. In addition, the open source framework for running vehicular network simulations (Veins) is used as an interface of the two simulators. The evaluation is made on two simple synthetic scenarios which try to simulate real traffic jam situations.
We adopt the absolute deviation as a utility metric, and built a filter with a deviation tolerance (margin of error) of 10% in the original reported average speed a caused by the introduction of noise. In other words, we desire that the reported noisy average speed n, Eq. (10), should stay within a confidence interval with a confidence level of 95%, and any reported measurement outside of this range is considered an outlier.
n=a±(0.1*a) (10)
As result, we calculate the number of outliers obtained in a simulation time window and present the behavior of the real average speed as well as the approximation of the two solutions or approaches (basic and enhanced). In addition, we show the quality of original and derived instances by presenting two standardized measures of dispersion besides the approximation of random partitions, given by definition 7. We use the relative deviation as a metric chosen to evaluate the random partitions in this definition with ratio of accuracy r fixed in 0:01, thus meeting the requirements of our utility metric. These mentioned numerical and graphical results for both approaches are presented below, organized by scenarios.
The first evaluation is made in a synthetic scenario 1100 containing simple SUMO features. As shown in
In the first traffic condition, we consider that all cars are traveling at the maximum speed of the road segment, with a car insertion time period of 1 second, so that there is no congestion, considered an ideal condition. The second traffic condition differs from the first by car insertion frequency of 0:1 second, forcing a traffic jam. In this setting, cars would automatically reduce their speeds to avoid collisions.
We retake the car insertion frequency of 1 second in the third setting, and force all cars to travel at a maximum speed of 11:11 m/s, even though the maximum road speed is 33:33 m/s. In the fourth and last setting, we only modify the car insertion time period to 0:1 second, related to third setting.
Summarized results for the first scenario appear in Table 1, shown in
The numerical results in Table 1 show that the basic or simple approach works well in ideal scenarios (Setting 1), where all or most of cars are traveling at or close to the maximum road speed (global sensitivity), getting an average speed close to the global sensitivity. When the speed of vehicles move away from the maximum road speed, caused by congestion (Settings 2, 3 or 4), the basic approach get more outliers due to the distance between the average and maximum road speed.
On the other hand, the enhanced approach presents good results in Settings 1 and 3, but its performance may be negatively affected in Settings 2 and 4. The amount of noise added to the smooth median depends on the Euclidean distance between an element and the values of its neighbors in the instance, then, the enhanced approach presents good results when we obtain well behaved instances (Settings 1 and 3), that is, instances with low variance. Instances with high variance yield average speed calculations distant from most elements in the instance.
In Setting 2, the number of outliers increases drastically, while the basic approach presents about 49% of outliers, and the enhanced approach presents about 67%. This jump is due to the amount of vehicles inserted in the scenario in a short period of time, causing a congestion.
Settings 3 and 4 have the same behavior in their results. Forcing vehicles to travel at maximum speed of 11:11 m/s degrade too much the accuracy of the basic approach.
The enhanced approach presents good results for Setting 3, where we get only 15:84% of outliers. This result is due to good behavior of the original instances that are under-dispersed in Setting 3, getting most values below to 0:5. In Setting 4, the number of outliers is about 64%. This result is due to the misbehavior of original instances, caused by the car insertion time period of 0:1 second. The original instances in Setting 4 are classified as over-dispersed, presenting index of dispersion between 1:5 and 2.
The second scenario 1300 is slightly more complex than the first. As shown in
In this scenario, during the simulation time window, we evaluate the average speed measurements from each RSU, which are related to each road segment. RSU 7 is attached to the first road segment, before the exit road. RSU 8 is fixed after the exit road. RSU 9 is attached to the exit road. We consider that all cars can travel at the maximum speed of the road segments, and the insertion time period of vehicles is set to 1 second.
Numerical results appear in Table 2 shown in
In RSU 7, the simple or basic approach gets almost 60% of outliers. The enhanced approach reaches about 66%, a high value, over simple approach.
RSUs 8 and 9 have the same behavior, both with average speed values practically constant, about 13:77 m/s. The simple approach performance degrades with this behavior, presenting more than 30% of outliers in RSU 8 and about 26% in RSU 9. On the other hand, the enhanced approach enjoys this behavior, presenting no outliers in both RSU's, as shown in Table 2 (
The security of the simple or basic approach is supported by the following Lemmas 1 and 2, and Theorem 3. In Lemma 1, we prove that the randomized Count function presented in Algorithm 2 (
Lemma 1. Let a prefix P=(x1, x2, . . . xn-1, xn} be a set of points over such that xi∈[0, Δƒ] for all i and |P| be the length of the prefix. Then, Algorithm 2 satisfies (ϵc, 0)-differential privacy.
Proof. Assume that, without loss of generality, A represents Algorithm 2. Let P1 and P2 be two neighbouring prefixes differing by at most one event. From Eq. (1) in the differential privacy definition, we have to evaluate two cases: when the ratio is greater than 1 and less equal to 1. Since the quality of the Count function is monotonic:
Lemma 2. Let S be an aggregation set of points from a prefix P=(x1, x2, . . . xn} over such that xi∈[0, Δƒ] for all i. Then, Algorithm 3 (
Proof. Assume now, without loss of generality, A represents Algorithm 3. Let S1 and S2 be two neighboring aggregations differing by at most one event. From the definition of differential privacy:
We will solve this ratio in two parts. First, considering numerator of Eq. (13), we have to evaluate two cases, when x≥0 and x<0.
Now, considering denominator of Eq. (13), we have to evaluate the cases when x≥−Δƒ and x<−Δƒ.
By replacing Eq. (14) and Eq. (16) in Eq. (13), we obtain
Similarly, by substituting Eq. (15) and Eq. (17) in Eq. (13), we have
Theorem 3. Let a prefix P=(x1, x2, . . . xn-1, xn} be a set of points over such that xi∈[0, Δƒ] for all i. Then, Algorithm 1 (
Proof. From Lemma 1 and 2 we have that Algorithms 2 and 3 are differentially private. We now show that their combination preserves (ϵc+ϵs, 0)-differential privacy.
Assume, without loss of generality, that A, B and C are random algorithms representing Algorithm 2, 3 and their combination, respectively. Let P1 and P2 be two neighboring prefixes differing by at most one event. From the definition of differential privacy:
From Algorithm 1, we have combination of Algorithm 2 and 3 when ϵc+ϵs≤ϵ. Therefore, in this case, we have that Algorithm 1 satisfies (ϵ, 0)-differential privacy.
To demonstrate the security of the enhanced approach, we show that Smooth Median function, presented in Algorithm 6 (
Definition 8. Admissible Noise Distribution. A probability distribution h∈ is (α, β)-admissible for α(ϵm, δm) and β(ϵm, δm) if it satisfies the following inequalities:
for all ∥Δ∥≤α|λ|≤β and all subsets U⊆.
This definition states that a probability distribution that does not change too much under translation and dilation can be used to add noise proportional to Sƒ,β*.
Lemma 3. The Laplace distribution on , Eq. (3) (
Proof. From Definition 8, we can obtain α and β parameters. Since Laplace distribution are not a heavy tail distribution, then δm>0.
Considering Eq. 23, we have
Considering numerator of Eq. (25), we have to evaluate interval [c, d] in two cases,
Now, considering denominator of Eq. (25), we have
By substituting Eq. (26) and Eq. (28) in Eq. (25) we obtain
When δm tends to zero in Eq. (30), the ratio tends to 1. Thus, assuming a very small δm (negligible), we get
Similarly, by replacing Eq. (27) and Eq. (29) in Eq. (25) we get the same result, Δ≤b (ϵm/2).
Therefore, it is sufficient to admit α=b (ϵm/2), so that the translation property is satisfied with probability
Considering Eq. (24), we have
Numerator of Eq. (33) is given by Eq. (26) and (27). On the other hand, denominator of Eq. (33) is given by evaluate interval [c, d] in two cases,
By replacing Eq. (26) and Eq. (34) in Eq. (33) we obtain
From an analysis of Eq. (36), we can conclude that, regardless of values of b, c and d, where d>c, the ratio tends to zero when we get high values of λ. This is because the value of δm is negligible. When we get λ tending to zero, the ratio tends to 1. Thus, an acceptable upper bound for λ, so that Eq. (36) is satisfied with high probability, is ϵm/(2 ln(1/δm)). This value tends to zero when we get a very small value for δm.
Similarly, by replacing Eq. (27) and Eq. (35) in Eq. (33) we obtain the same result, λ≤ϵm/(2 ln(1/δm)).
Therefore, to satisfy dilation property with probability
it is enough to assume β=ϵm/(2 ln(1/δm)).
Lemma 4. Let Y be a random variable sampled from a Laplace distribution. Then, Algorithm 6 is ϵm differentially private with probability 1−δm.
Proof. The proof follows by combination of Definition 8 and Lemma 3.
Theorem 4. Let S be an aggregation set of points from a P=(x1, x2, . . . xn} over such that xi∈[0, Δƒ] for all i. Then, Algorithm 5 satisfies (ϵm, δm)-differential privacy and yields accurate aggregation result.
Proof. Our construction is based on uniformly distributed samples from the aggregation set. These random samples are extracted without replacement, producing partitions of size N=M on the aggregation set. From it, an M size set is constructed by calculating the average speed over these partitions. Finally, to calculate the smooth sensitivity of Median function from Eq. (8), it is needed to sort the aggregate set in a non-decreasing order. Thus, Algorithm 5 (
If a function ƒ can be approximated well over random partitions of a database, then, a differentially private version of ƒ can be released with a significantly little noise. The accuracy of this approximation can be measured following Definition 7. In fact, in this case, changing a single element in the aggregation set does not affect significantly the result of Algorithm 5, since most values in aggregation set will be close to the average.
Therefore, the proof of this theorem follows by a combination of Lemma 4, Theorem 2 and Definition 7.
Theorem 5. Let prefix P=(x1, x2, . . . xn-1, xn} be a set of points over such that xiϵ[0, Δƒ] for all i. Then, Algorithm 4 satisfies (ϵ, δ)-differential privacy.
Proof. From Lemma 1 and Theorem 4 we have that Algorithms 2 and 5 satisfy (ϵc, 0) and (ϵm, δm)-differential privacy. Thus, by Theorem 1, we have that Algorithm 4 satisfies (ϵc+ϵm, δm)-differential privacy. Therefore, as in Algorithm 4 the combination of Algorithm 2 and 5 occurs when ϵc+ϵm≤ϵ and δm≤δ, then, Algorithm 4 is (ϵ, δ)-differentially private.
According to some embodiments, an instance-based data aggregation solution is disclosed herein for traffic monitoring based on differential privacy, focusing on event-level privacy. In some embodiments, an enhanced approach for differentially private solution (e.g., for average speed calculation) uses, employs, or is implemented with the smooth sensitivity and sample and aggregate framework. Experimental results have shown that the enhanced approach is superior to a basic or simple approach for differential privacy in situations which present at least a little jam with under-dispersed instances, following the hypothesis that vehicles will traveling in the same speed in a short period of time and space.
The embodiments described above illustrate but do not limit the invention. For example, the techniques described for vehicles can be used by other mobile systems, e.g., pedestrians' smartphones or other mobile systems equipped with computer and communication systems 150. The term “vehicle” is not limited to terrestrial vehicles, but includes aircraft, boats, space ships, and maybe other types of mobile objects. The vehicle techniques can be also be used by non-mobile systems, e.g., they can be used on a computer system.
This description and the accompanying drawings that illustrate inventive aspects, embodiments, implementations, or applications should not be taken as limiting. Various mechanical, compositional, structural, electrical, and operational changes may be made without departing from the spirit and scope of this description and the claims. In some instances, well-known circuits, structures, or techniques have not been shown or described in detail in order not to obscure the embodiments of this disclosure. Like numbers in two or more figures typically represent the same or similar elements.
In this description, specific details are set forth describing some embodiments consistent with the present disclosure. Numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent, however, to one skilled in the art that some embodiments may be practiced without some or all of these specific details. The specific embodiments disclosed herein are meant to be illustrative but not limiting. One skilled in the art may realize other elements that, although not specifically described here, are within the scope and the spirit of this disclosure. In addition, to avoid unnecessary repetition, one or more features shown and described in association with one embodiment may be incorporated into other embodiments unless specifically described otherwise or if the one or more features would make an embodiment non-functional.
Although illustrative embodiments have been shown and described, a wide range of modification, change and substitution is contemplated in the foregoing disclosure and in some instances, some features of the embodiments may be employed without a corresponding use of other features. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. Thus, the scope of the invention should be limited only by the following claims, and it is appropriate that the claims be construed broadly and in a manner consistent with the scope of the embodiments disclosed herein.
The present application claims priority to U.S. Provisional Patent Application No. 62/964,694, “ENHANCED DIFFERENTIALLY PRIVATE SOLUTION FOR TRAFFIC MONITORING,” filed on 23 Jan. 2020, which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
62964694 | Jan 2020 | US |