This invention relates, generally, to privacy preservation and action recognition systems. More specifically, it relates to a self-supervised privacy preservation action recognition system that minimizes private information leakage during action recognition without requiring privacy labels for individual frames.
Recent advances in action recognition have enabled a wide range of real-world applications, such as video surveillance cameras, smart shopping systems, and elderly person monitor systems. Most of these video understanding applications involve extensive computation, for which a user needs to share the video data to a cloud computation server. However, in the process of sharing a video or other multimedia segment to the cloud server for the utility action recognition task, the user also shares private visual information contained within the shared segment, such as subject gender, skin color, clothing, background objects, and other potentially personally identifying information.
As shown in
Turning to section B of
More recently, a novel approach was proposed to remove privacy features via learning an anonymization function through an adversarial training framework 28, which requires both action 18 and privacy labels 20 from the video 12. Although the method achieves a good trade-off of action recognition and privacy preservation, it has two main problems. First, it is not feasible to annotate a video dataset for privacy attributes, as conceded by the reference noting the immense efforts required to annotate privacy attributes for a small-scale video dataset PA-HMDB, which includes only 515 videos. Second, the learned anonymization function from the known privacy attributes may not generalize in anonymizing the novel privacy attributes. For example, as shown in section C of
Accordingly, what is needed is a self-supervised privacy preservation action recognition system that allows action recognition while minimizing private information leakage in an output image without requiring supervision or input privacy labels. However, in view of the art considered as a whole at the time the present invention was made, it was not obvious to those of ordinary skill in the field of this invention how the shortcomings of the prior art could be overcome.
While certain aspects of conventional technologies have been discussed to facilitate disclosure of the invention, Applicant in no way disclaims these technical aspects, and it is contemplated that the claimed invention may encompass one or more of the conventional technical aspects discussed herein.
The present invention may address one or more of the problems and deficiencies of the prior art discussed above. However, it is contemplated that the invention may prove useful in addressing other problems and deficiencies in a number of technical areas. Therefore, the claimed invention should not necessarily be construed as limited to addressing any of the particular problems or deficiencies discussed herein.
In this specification, where a document, act or item of knowledge is referred to or discussed, this reference or discussion is not an admission that the document, act or item of knowledge or any combination thereof was at the priority date, publicly available, known to the public, part of common general knowledge, or otherwise constitutes prior art under the applicable statutory provisions; or is known to be relevant to an attempt to solve any problem with which this specification is concerned.
The present invention includes both the application of an anonymization model and the process, method and system in which that model is created. As shown in
An embodiment of the invention encompasses a self-supervised privacy preservation action recognition system, incorporating a computer processor configured to process a video composed of several frames. The system applies a learnable transformation anonymization function to the video, aimed at eliminating spatial cues from the frames while retaining necessary data for action recognition. This function stems from an anonymization model developed through iterative application to a training video dataset.
Within the system, there are designated branches: an action recognition branch and a self-supervised privacy removal branch. The system's procedure entails sequentially freezing these branches to allow specific weight adjustments within the anonymization branch, directed by the processor. This includes producing a first batch output, evaluating this against the branches for action recognition and privacy contrastive loss, generating a second batch output, and adjusting the weights based on these evaluations.
The function employs an encoder-decoder network structure with sigmoid activation layers to control the output values. For action recognition, a pre-trained three-dimensional convolutional neural network (3D-CNN) is used, while for privacy removal, a pre-trained two-dimensional convolutional neural network (2D-CNN) is utilized.
The system includes a frame sampling module for selecting frame pairs based on their temporal separation to facilitate the evaluation of privacy-preservation efficacy. Data augmentation techniques are applied to enhance the model's adaptability. The system utilizes a minimax optimization protocol to adjust the model parameters, targeting the balance between action recognition accuracy and privacy contrastive loss.
After the anonymization step, the learned anonymization module implements transformation on a test video to meet specified thresholds for action recognition accuracy and privacy preservation. The processor may also forward this test video to a cloud server for additional action recognition processing.
The training aspect involves same-dataset and cross-dataset training protocols to improve the model's versatility. Furthermore, the system's capability extends to evaluating the anonymization and action recognition models against datasets depicting actions not previously encountered in the training, assessing the system's effectiveness in new contexts. This approach ensures maintenance of action recognition while enhancing privacy in video content.
For a fuller understanding of the invention, reference should be made to the following detailed description, taken in connection with the accompanying drawings, in which:
In the following detailed description of the preferred embodiments, reference is made to the accompanying drawings, which form a part thereof, and within which are shown by way of illustration specific embodiments by which the invention may be practiced. It is to be understood that other embodiments may be utilized, and structural changes may be made without departing from the scope of the invention.
As used in this specification and the appended claims, the singular forms “a,” “an,” and “the” include plural referents unless the content clearly dictates otherwise. As used in this specification and the appended claims, the term “or” is generally employed in its sense including “and/or” unless the context clearly dictates otherwise.
All numerical designations, such as measurements, efficacies, physical characteristics, forces, and other designations, including ranges, are approximations which are varied up or down by increments of 1.0 or 0.1, as appropriate. It is to be understood, even if it is not always explicitly stated that all numerical designations are preceded by the term “about.” As used herein, “about” or “approximately” refers to being within an acceptable error range for the particular value as determined by one of ordinary skill in the art, which will depend in part on how the value is measured or determined. As used herein, the term “about” refers to ±15% of the numerical; it should be understood that a numerical including an associated range with a lower boundary of greater than zero must be a non-zero numerical, and the term “about” should be understood to include only non-zero values in such scenarios.
As shown in section D of
One key of the system is learning an anonymization function such that privacy attributes deteriorate in an output without requiring privacy labels during training, while maintaining the performance of the action recognition task on the modified output. For example, considering a video dataset X with action an recognition as a utility task T and a privacy attribute classification as a budget task B, the goal of the system is to maintain the performance of T while minimizing the budget B to preserve privacy. This minimax function is achieved by learning an anonymization function ƒA, which transforms (or anonymizes) the original raw data X. Assuming that the final system has any action target model ƒT′ and any privacy target model ƒB′, the goal of privacy preserving training is to find an optimal point of ƒA (referred to as ƒA*) by the following two criteria C1 (Equation 1) and C2 (Equation 2):
C1: LT(ƒT′(ƒA*(X)),YT)≈LT(ƒT′(X),YT) (1)
where T denotes the utility task, LT is the loss function, which is the standard cross entropy in the case of single action label YT or the binary cross entropy in the case of multi-label actions YT; and
C2: LB(ƒB′(ƒA*(X)))≈LB(ƒB′(X)) (2)
where B denotes the privacy budget, LB is the self-supervised loss for the system framework, and the binary cross entropy in the case of a supervised system framework, which requires privacy label annotations, is denoted as YB.
Increasing a self-supervised loss LB results in deteriorating all useful information regardless of whether or not the information involves privacy attributes. However, the useful information for the action recognition function is preserved via criterion C1. Combining criteria C1 and C2, the privacy preserving optimization equation can be expressed through Equation 3, in which the negative sign before LB indicates optimization by maximizing LB:
Referring to
In the second phase, Step-2: ƒB, ƒT update 40, a new Input Video Batch (Xi+1) 43 is introduced. Here, the anonymization function ƒA 56 is frozen, focusing on updating the action recognition model 52 to minimize LT and the privacy model ƒB 58 to minimize LB. This utilizes the NT-Xent contrastive loss 54 to enhance the model's ability to anonymize privacy-sensitive features while optimizing the action recognition model's accuracy. This iterative approach balances privacy preservation with accurate action recognition. The anonymization function is a learnable transformation function, which transforms the video in such a way that the transformed information can be useful to learn action classification on any target model, ƒT′, and not useful to learn any privacy target model, ƒB′. An encoder-decoder neural network is used as the anonymization function. ƒA is initialized as an identity function by training the function using the L1 reconstruction loss from Equation 4:
where x is the input image, {circumflex over (x)} is the sigmoid output of ƒA logits, C denotes the input channels, H denotes the input height, and W denotes the input width.
An embodiment of the schematic of the self-supervised removal branch is shown in
where h(u, v)=exp(uT v/(∥u∥∥v∥τ)) is used to compute the similarity between u and v vectors with an adjustable parameter temperature, τ, and where [j≠i]∈{0,1} is an indicator function which equal 1 if j≠i.
A goal of the system is to optimize the self-supervised system framework with the objective as shown in
Considering anonymization function ƒA parameterized by θA, and auxiliary models ƒB and ƒT being parameterized by θB and θT, respectively. Assuming αA, αB, and αT represent the learning rates for θA, θB, and θT, respectively; θA is initialized as given in Equation 6 unless ƒA reaches the threshold thA0 reconstruction performance on the validation set:
θA←θA−αA∇θL1(θA)) (6)
One θA is initialized, it is utilized for initialization of θT and θB as shown in Equations 7 and 8 below, unless their performance reaches the loss values of thB0 and thT0:
θT←θT−αT∇θ
θB←θB−αB∇θ
After the initialization, a two-step iterative optimization process takes place. The first step, as depicted in the left side of
θA←θA−αA∇θ
where ω∈(0,1) is the relative weight of the self-supervised contrastive loss (SSL), LB, with respect to the supervised action classification loss, LT, with the negative sign before LB indicating a maximization. During the second step, as shown in the right side of
Taking a model ƒb that is initialized with SSL pretraining, in an embodiment, ƒb is frozen when the contrastive loss is maximized, such that the input is changed to ƒb to decrease agreement between frames of the same video. Since it is known that frames of the same video share vast semantic information, minimizing agreement between frames of the same video results in destroying (or unlearning) most of the semantic information of the input video. Said another way, maximizing the contrastive loss results in destroying all highlighted attention map parts of video frames. Since this unlearned generic semantic information contains privacy attributes related to humans, scenes, and objects, the private information is removed from the input video; however, the system ensures that semantic information relating to action recognition remains in the output video through an action recognition branch of the system.
Training of the supervised privacy preserving action recognition method requires a video dataset Xt with action labels YTt and privacy labels YBt, where t denotes the training set. Since the self-supervised privacy removal framework does not require privacy labels, the system does not utilize YBt. Once the training is finished, the anonymization function is frozen as ƒA*, and auxiliary models ƒT and ƒB are discarded. The evaluate the quality of the learned anonymization, ƒA* is utilized to train: 1) a new action classifier ƒT′ over the training set (ƒA*(Xt), YTt); and 2) a new privacy classifier ƒB′ to train over (ƒA*(X′), YBt). The system does not utilize privacy labels for training ƒA in any protocol. Privacy labels are used only for the evaluation purpose to train the target model ƒB′. Once the target models ƒT′ and ƒB′ finish training on the anonymized version of the training set, the target models are evaluated on the test sets (ƒA*(Xe), YTe) and (ƒA*(Xe), YBe), respectively, where e denotes the evaluation/test set. Test set performance of the action classifier is denoted as AB1 (classwise-mAP) or AB2 (classwise-F1). The same-dataset training and evaluation protocol is shown in
In
Moving to
Finally, in
In practice, a trainable-scale video dataset with action and privacy labels does not exist. In, the supervised training process was remedied by a cross-dataset training and/or evaluation protocol utilizing two different datasets: 1) an action annotated dataset (Xactiont, YTt) to optimize ƒA and ƒT; and 2) a privacy annotated dataset (Xprivacyt, YBt) to optimize ƒA and ƒT. After learning the ƒA through the different training sets, ƒA is frozen as ƒA*. A new action classifier ƒT′ is trained on an anonymized version of the action annotated dataset (ƒA*(Xactiont), YTt), and a new privacy classifier ƒB′ is trained on the anonymized version of the privacy annotated dataset (ƒA*(Xprivacyt), YBt). Once the target models ƒT′ and ƒB′ finish training on the anonymized version of the training set, the target models are evaluated on the test sets (ƒA*(Xactione), YTe) and (ƒA*(Xprivacye), YBe). The cross-dataset training and evaluation protocol is shown in
In the second phase, shown in
The final evaluation phase in
For the two protocols discussed above, the same training set Xt (Xactiont and Xprivacyt) is used for the target models ƒT′, ƒB′ and learning the anonymization function ƒA. However, a learned anonymization function ƒA* is expected to generalize on any action or privacy attributes. To evaluate the generalization on novel action, an anonymized version of the novel action set ƒA*(Xprivacynt), such that YTnt∩YBt=ϕ (where nt represents novel training) is used to train the privacy target model ƒB′, and its performance is measured on a novel privacy test set ƒA*(Xprivacyne) (where ne represents novel evaluation). The novel action and privacy attribution protocol training and evaluation is shown in
Experimental Methods
Different datasets were evaluated using the system and method described in detail above. These datasets include the UCF101 set; the HMDB51 set; the PA-HMDB set; the P-HVU selected subset of the LSHVU set (with the subset being selected for including action-object-scene levels to create the train/test split); and the VISPR set (with subsets shown in Table 7).
For a default experiment setting, UNet is utilized as ƒA; R3D-18 is utilized as ƒT; and ResNet-50 is utilized as ƒB. For a fair evaluation, results of different methods are reported using the same training augmentations and model architectures. Two different sets of augmentations were applied depending on the loss function: 1) for supervised losses, standard augmentations (random crop, random scaling, horizontal flip, and random gray-scale conversion) with less strength were used; 2) for self-supervised loss, in addition to the standard augmentations with more strength, random color jitter, random cutout, and random color drop were used. To ensure temporal consistency in a clip, the same augmentation was applied on all frames of the clips. All video frames or images are resized to 112×112; input videos include 16 frames with a skip rate of 2.
A base learning rate of 1e-3 was used with a learning rate scheduler, which drops the learning rate to its 1/10th value on the loss plateau. For self-supervised privacy removal branch, the 128-D output was used as a representation vector to compute contrastive loss of temperature τ=0.1. For the RotNet experiment, 4 rotations were used: {0, 90, 180, 270}.
An Adam optimizer was used to optimize parameters of different neural networks. For initialization, ƒA was trained for 100 epochs using 1 reconstruction loss, action recognition auxiliary model ƒT was trained using cross-entropy loss for 150 epochs, and privacy auxiliary model ƒB was trained using NT-Xent loss for 400 epochs. The training phase of anonymization function ƒA is carried out for 100 epochs, whereas target utility model ƒT′ and target privacy model ƒB′ are trained for 150 epochs.
Macro-average of classwise mean average precision (cMAP) is used to evaluate the performance of target privacy model ƒB′. The results are also reported in average F1 score across privacy classes. F1 score for each class is computed at confidence 0.5. For action recognition, a top-1 accuracy is used that is computed from video-level prediction from the model and groundtruth. A video-level prediction is an average prediction of 10 equidistant clips from a video.
Downsampling methods are adopted with down-sampled versions of input resolution with a factor of 2× and 4× used in training and testing. Obfuscation methods are carried out using a MS-COCO pretrained Yolo object detector to detect person categories. The detected persons are removed using two different obfuscation strategies: 1) blackening the detected bounding boxes, and 2) applying Gaussian blur in the detected bounding boxes at two different strengths.
The two existing protocols from were used to evaluate for known action and privacy attributes using UCF101-VISPR cross dataset training and evaluation, and using HMDB51-VISPR cross dataset training and PA-HMDB evaluation. For the UCF101-VISPR protocol, Xactiont=UCF101 trainset and Xactione=UCF101 testset, with Xprivacyt=VISPR trainset and Xprivacye=VISPR testset. For the HMDB51-VISPR cross dataset training and PA-HMDB evaluation, Xactiont=HMDB51 trainset and Xactione=PA-HMDB, with Xprivacyt=VISPR trainset and Xprivacye=PA-HMDB.
In addition, a new protocol from the system and method described in detail above is evaluated using a P-HVU dataset from same-domain training and testing. In this protocol, the utility task is multi-label action recognition and privacy is defined in terms of object and scene multi-label classification. In this protocol, Xt=P-HVU trainset, and Xe=P-HVU testset.
As shown in Table 1, the self-supervised framework from the system and method achieves a comparable action-privacy trade-off in the case of known action and privacy attributes. Other methods, such as Downsample-4×, Obf-blackening, and Obf-StrongBlur achieve commendable privacy removal but at a cost of action recognition performance.
To evaluate the case of novel actions and novel privacy attributes, the following sets are used: 1) for actions, Xactiont=UCF101 trainset, Xactionnt=HMDB51 trainset, and Xactionne=HMDB51 testset/PA-HMDB; and 2) for privacy, Xprivacyt=VISPR-1 trainset, Xprivacynt=VISPR-2 trainset, and Xprivacyne=VISPR-2 testset. From the left part of Table 2 and from the graph of
The following sets were used to evaluate detection of novel privacy attributes from scenes to objects: known action set Xactiont=P-HVU trainset, Xactione=P-HVU testset, Xprivacyt=P-HVU trainset Object, Xprivacynt=P-HVU trainset Scene, and Xprivacyne=P-HVU testset Scene.
From the right side of Table 2, it is observed that while testing the learned anonymization from scenes to objects, the supervised method achieves a similar result to Obf-StrongBlur and removes only approximately 46% of the raw data's privacy, whereas the self-supervised framework from the system and method removes approximately 88% object privacy of the raw data. This privacy removal gain is due to the amount of domain shift in the novel privacy attributes. In VISPR-1→2, the domain shift is very small, and hence is able to generalize and perform within approximately 5% of the self-supervised system and method; however, in PHVU Scene→Obj, the domain shift is large, and suffers in generalizing and performs significantly worse than the self-supervised system and method (i.e., greater than approximately 40% worse).
The second row in Table 3 shows the results using only an encoder-decoder based model ƒA without using any privacy removal branch ƒB. However, the style changing fails to anonymize privacy information. As such, a pretrained SSL frozen model was then used to anonymize the privacy information by Equation 9. This method of freezing fB is able to remove the privacy information by a small extent (<2%); however, the biggest boost in privacy removal (7%) resulted from updating fB with every update in fA, as can be seen in the fourth row of Table 3. This shows the importance of updating the fB in the second step (Equation 8) of the minimax optimization. Said another way, if fB is not updated with fA, then fA can maximize fB, leading to poor privacy removal. In addition, a spatio-temporal SSL framework was tested as a privacy removal branch; however, removing spatio-temporal semantics from the input video leads to severe degradation in action recognition performance.
In order to experiment with various temporal samplers (SF) for choosing a pair of frames from a video, the duration (distance) between the two frames is changed, as shown in Table 4. The chosen pair of frames from a video is considered for the positive term of contrastive loss (Equation 5). In the default setting of experiments, a pair of frames is randomly selected from a video, as shown in the first show of Table 4. It is observed that mining positive frames from further distances decreases the anonymization capability, due to the dissimilar positives in contrastive loss leading to poorly learned representations [11, 35].
As shown in Table 5, three different 2D SSL schemes were explored using Equation 5. It was observed that NT-Xent and MoCo achieve comparable performances; however, the RotNet framework provides a suboptimal performance in both utility and privacy, likely because RotNet encourages learning global representations and heavily removes the global information from the input via the privacy removal branch, leading to a drop in the action recognition performance.
To understand the effect of the auxiliary model ƒB in the training process of ƒA different privacy models ƒB were tested and the performance of each learned ƒA* is reported in the same evaluation setting, as shown in Table 6 (with additional results shown in Table 13). It was observed that using a better architecture of ƒB results in better anonymization. There is no significant effect of using different architectures of ƒT in learning ƒA.
In addition, the self-supervised privacy preserving action recognition framework was tested by directly evaluating ƒA* using a privacy target model pretrained on raw data, rather than learning a target privacy model ƒB′ from the anonymized version of the training data. Results are shown in Table 8, in which a ResNet-50 model was used as a privacy target model which is pretrained on raw training data of the respective evaluation set.
Compared to other methods, the supervised and the self-supervised system and method achieved a great deal of privacy classification drop, which is desired to prevent privacy leakage; in addition, the self-supervised system and method achieved competitive cMAP performance as compared with and outperforms regarding F1 score.
Moreover, using R3D-18 as an auxiliary action recognition model ƒT in training the anonymization function, the learned ƒA* was evaluated to train different action recognition (utility) target models, such as R3D-18, C3D, R2plus1D-18 from scratch, and Kinetics-400 pretraining. Results are shown in Table 9, in which the self-supervised system and method maintains the action recognition performance on any utility action recognition model.
ResNet-50 was also used as an auxiliary privacy model ƒB to evaluate the learned anonymization ƒA* on target privacy classifiers ƒB′ like ResNet18/50/34/101/152 and MobileNet-V1 with and without ImageNet pretraining.
As shown in Table 10, the self-supervised system and method protects privacy leakage regardless of choice of target privacy model. Using ImageNet pretraining as shown in Table 11, privacy leakage increases in all methods; however, the relative drop to the raw data baseline is improved in the self-supervised system and method.
In a practical scenario, an action recognition target model ƒT′ is initialized from the Kinetics400 raw data pretrained checkpoint. Since an intruder has no direct access to the learned anonymization function in a practical setting, the raw data pretrained privacy classifier can be considered as a target privacy model ƒB. Results are shown in the graph shown in
At the cost of a small drop in action recognition performance, the self-supervised system and method obtains approximately 66% reduction in privacy leakage as compared to the raw data baseline.
Various test set videos of UCF101 were used to visualize the transformation due to the learned anonymization function ƒA*. The sigmoid function after ƒA* ensures a range of (0,1) for the output image. The output was visualized at different stages of anonymization training, as shown in
In
Overall, these figures collectively demonstrate the self-supervised privacy preservation framework's ability to learn and apply context-sensitive anonymization across different actions and scenarios. The system effectively balances the dual objectives of maintaining action recognition accuracy while significantly reducing privacy risks, showcasing its potential application in sensitive environments where both privacy and action understanding are paramount.
A self-supervised model focuses on holistic spatial semantics, whereas a supervised privacy classifier focuses on specific semantics of the privacy attributes. To bolster this observation, the attention map of ResNet50 model was visualized which is trained in 1) a supervised manner using binary cross entropy loss using VISPR-1; and 2) a self-supervised manner using NT-Xent loss. The method of Zagoruyko and Komodakis was used to generate model attention from the third convolutional block of the ResNet model.
In
A self-supervised privacy preserving action recognition framework that does not require privacy labels during training achieves competitive performance compared to supervised baselines for the known action-privacy attributes. In addition, the self-supervised framework for the system and method achieves better generalization to novel action-privacy attributes compared to the supervised baseline.
2D-CNN Backbone means a neural network architecture designed for processing two-dimensional data inputs, such as images or video frames, through convolutional layers that extract spatial features essential for tasks like action recognition and privacy-sensitive information identification within a privacy preservation action recognition system.
3D-CNN Backbone means a neural network architecture that extends the 2D convolutional network concept into three dimensions, allowing it to process temporal information in video sequences by analyzing spatial features across consecutive frames, thereby enhancing its ability to recognize complex actions.
Ablation means a systematic process of removing or modifying components of a machine learning model or its training procedure to evaluate the impact of those components on the model's overall performance, often used to understand the contribution of specific features or techniques within privacy preservation action recognition systems.
Action Recognition means the computational task of analyzing sequences of video frames to identify and categorize human actions or activities depicted, utilizing algorithms that can differentiate between various movements while minimizing the impact on individual privacy.
Adam Optimizer means an optimization algorithm used in training neural networks, combining the advantages of two other extensions of stochastic gradient descent: Adaptive Gradient Algorithm (AdaGrad) and Root Mean Square Propagation (RMSProp), to update network weights iteratively based on training data.
Annotations are labels or data added to training samples to provide ground truth information for supervised learning tasks. In the context of action recognition and privacy preservation, annotations might include action labels for recognizing activities and privacy labels for identifying sensitive information, although the latter is not required in self-supervised settings.
Anonymization Branch means a specific part of a privacy preservation action recognition system responsible for applying transformations to video segments to conceal or remove identifiable information, ensuring the video's usability for action recognition while enhancing privacy.
Anonymization Function means a learnable algorithm within a privacy preservation action recognition system that modifies video data to obscure or eliminate spatial cues and identifiable markers without compromising the video's utility for recognizing actions.
Base Learning Rate means the initial rate of learning for a training process before any adjustments by a learning rate scheduler or other modifications. It sets the starting speed at which model weights are updated during training.
Binary Cross Entropy refers to a loss function used primarily for binary classification problems. It measures the difference between two probability distributions for the classification task, typically the predicted probabilities and the actual binary outcomes.
Branch in the context of neural networks, particularly in privacy preservation and action recognition systems, denotes a segment or pathway of the network designed to perform a specific subset of the overall task, such as action recognition or privacy information removal. Each branch operates on the input data or the features extracted by previous layers to contribute to the system's final output.
Bounding Boxes mean rectangular borders drawn around objects of interest in images or video frames, used to identify and locate these objects precisely within the space of the frame, often in tasks involving object detection, tracking, and recognition.
Budget Task is related to the secondary objective in a learning model aimed at minimizing resource usage, such as computational cost or data privacy. In privacy-preserving contexts, it often refers to reducing the amount of private information that can be inferred from the model's outputs.
Cloud Server typically refers to virtualized server resources accessible over the internet, offering scalable computing power and storage. In the context of action recognition, cloud servers may process and analyze video data, benefiting from their computational resources for intensive tasks.
Computer Processor means the hardware within a computing system that executes the operations and instructions of software applications, including those of privacy preservation action recognition systems, by performing calculations and controlling the system's components.
Contrastive Loss Function means a loss calculation used to train models by maximizing the similarity between pairs of data points that should be similar (positive pairs) and maximizing the dissimilarity between pairs that should be different (negative pairs), especially in self-supervised learning scenarios for privacy preservation.
Convolutional Neural Network (CNN) is a class of deep neural networks, most commonly applied to analyzing visual imagery. CNNs use a mathematical operation called convolution and have layers that act as filters, progressively extracting higher-level features from the input image or video frames.
Cross-Entropy Loss means a loss function commonly used in classification tasks, which measures the performance of a classification model whose output is a probability value between 0 and 1, providing a gauge of how well the model's predicted probabilities match the actual labels.
Cross-Dataset Training refers to a training protocol that employs separate datasets for optimizing different aspects of a privacy preservation action recognition system: one dataset annotated for action recognition and another annotated for privacy attributes. This method allows the system to learn generalized anonymization and recognition capabilities by exposing it to a wider variety of data scenarios, enhancing its adaptability to different privacy considerations and action types across datasets.
Datasets mean collections of video data curated for the purpose of training, validating, and testing machine learning models, particularly those involved in action recognition, where each video is typically labeled with one or more actions depicted in the sequence.
Data Augmentation involves artificially increasing the diversity of a training dataset by applying random transformations (e.g., rotation, scaling, cropping) to the original data. This technique helps improve the robustness and generalization ability of machine learning models.
Downsampling means the process of reducing the resolution of video frames or images by eliminating pixel data, which can help in reducing the detail level of privacy-sensitive information, though it may also affect the clarity of action recognition.
Encoding for Video refers to the process of converting raw video data into a structured representation using a neural network to obtain meaningful embeddings. This transformation retains the essential visual information required for tasks like action recognition and privacy preservation.
Epochs mean complete passes over the entire dataset used in training a machine learning model, where each epoch involves presenting all training samples to the model, allowing it to learn from the data iteratively and adjust its parameters to improve performance.
Frame means a single image or picture that constitutes one of the many sequential elements making up a video segment, serving as the basic unit for video analysis in tasks such as action recognition and privacy-sensitive content modification.
Frame Sampling refers to the process of selecting specific frames from a video sequence. This technique is used in video processing and analysis to reduce computational load or focus on significant moments within a video for tasks like action recognition.
Frozen in machine learning and deep learning, refers to the state of model parameters (or layers) that are not updated during a particular phase of training. Freezing parts of the model helps in fine-tuning specific sections while keeping others constant.
Gaussian Blur means a filtering technique applied to images or video frames to reduce detail and noise by averaging the pixels within a region, based on a Gaussian function, often used in privacy preservation to obscure faces or other identifiable features.
Learnable Transformation means a type of function, specifically utilized in the context of anonymization, that is capable of modifying input data (e.g., video segments) in a way that can be iteratively adjusted or optimized through the process of training. Such transformations aim to preserve essential information for a given task, like action recognition, while removing or obfuscating information that could compromise privacy.
Learning Rate Scheduler is a tool or strategy used in training machine learning models to adjust the learning rate during training. It modifies the learning rate based on predefined rules or metrics, often to improve training efficiency and model performance.
Minimax is an optimization strategy used in various contexts, including machine learning. It involves minimizing the possible losses for a worst-case scenario when an adversary is trying to maximize those losses. In privacy preservation, it could relate to adjusting models to minimize privacy risks while considering the maximization of such risks under adversarial conditions.
Multilayer-Projection Head is a component of neural networks, especially in self-supervised learning frameworks, that projects features extracted by the network into a space where learning objectives, such as contrastive loss, are applied. This component usually consists of several layers, including non-linearities, to transform the feature representation effectively.
Neural Networks mean computational models inspired by the human brain's structure, consisting of layers of interconnected nodes or “neurons” that process input data through weighted connections, learning to perform tasks by adjusting those weights based on experience.
Noisy Features Baseline means a reference model or technique within the context of privacy preservation action recognition systems that introduces randomness or “noise” into the data or feature representation, serving as a baseline to evaluate the impact of noise on the system's ability to recognize actions and preserve privacy.
Novel Action involves the system's evaluation on video segments depicting actions not present in the training dataset, testing the system's ability to generalize its action recognition capabilities to unseen scenarios. This protocol assesses the robustness of the learned anonymization and action recognition models against new, previously unobserved actions, ensuring the system maintains high performance in real-world applications where it may encounter a diverse array of human activities.
Obfuscation Baselines mean standard methods or procedures used to compare the effectiveness of various techniques in obscuring or hiding identifiable information within video segments, often involving the application of specific transformations like blurring or pixelation to enhance privacy.
Optimization of Model involves the process of iteratively adjusting the model parameters, including weights, to minimize or maximize a defined objective function. In the context of privacy preservation action recognition, the optimization process aims to find the best balance between action recognition performance and the level of privacy preservation.
Pretraining means the process of training a neural network model on a large dataset before fine-tuning it on a smaller, task-specific dataset, aiming to initialize the model with knowledge that enhances its performance on the specific task, such as action recognition in privacy preservation systems.
Privacy Classifier is a model or function used to identify and classify private or sensitive information within data. In the context of privacy-preserving action recognition, it might aim to detect and categorize private attributes like identity or location.
PyTorch means an open-source machine learning library based on the Torch library, widely used for applications in computer vision and natural language processing, offering dynamic computation graphs that facilitate building and training neural network models.
Raw Data means the original, unprocessed digital information captured by video recording devices, serving as the input for privacy preservation action recognition systems before any modifications for action recognition or privacy enhancement are applied.
Representation Space refers to the high-dimensional space where features extracted by neural networks reside. In this space, semantic similarities and differences among data samples can be quantified, aiding tasks like classification and recognition.
ResNet-50 Model means a specific configuration of the Residual Network architecture designed for image recognition tasks, comprising 50 layers, including convolutional layers, and employing shortcut connections to facilitate the training of deeper networks by addressing vanishing gradient issues.
Same-Dataset Training means a training protocol where a single dataset, annotated for both action recognition and privacy considerations, is used to train the components of a privacy preservation action recognition system. This approach ensures that the system learns to anonymize video data and recognize actions based on the same set of video segments, allowing for a unified optimization process where action recognition and privacy preservation goals are aligned within the context of the same data distribution.
Self-Supervised refers to a learning paradigm where the system utilizes the input data itself to generate labels or learning signals, bypassing the need for explicitly provided external ground-truth labels. In privacy preservation action recognition, this approach allows for the learning of effective feature representations and anonymization functions without reliance on privacy labels.
Semantic Information relates to the meaning or context derived from data, such as objects, actions, or scenes within images or videos. In privacy preservation, certain semantic information may be sensitive and targeted for removal or obfuscation to protect privacy.
Spatial Cues mean visual information within video frames that indicate the physical arrangement and characteristics of objects and individuals, which privacy preservation action recognition systems seek to modify or remove for enhancing privacy while retaining action-related information.
Spatio-Temporal refers to the combined aspects of space and time, especially concerning video or sequences of images. In action recognition, spatio-temporal features are useful for understanding motions and activities that unfold over time across different spatial regions of the frames.
Supervised Adversarial Framework means a training setup where a model is taught to generate outputs that are indistinguishable from real data, while another model simultaneously tries to distinguish between the model's output and real data, often used in contexts requiring the differentiation between genuine and modified content, such as privacy-enhanced videos.
Temporal Diversity signifies the variation in content or information across different time points or frames within a video. Capturing temporal diversity is relevant for understanding dynamic scenes and actions in video analysis tasks.
Temporal Frame Sampler means a component of privacy preservation action recognition systems that selects specific frames or pairs of frames from video segments based on temporal criteria, optimizing the input for processes like contrastive loss evaluation and ensuring diverse representation for effective learning.
Temporal Separation describes the distinction or interval between two points in time or frames within a video sequence. In video analysis, understanding or manipulating temporal separation can be important for recognizing actions that occur over varying durations.
Training Video means video segments specifically designated for the training phase of machine learning models, where these videos are used to teach the models to accurately recognize actions and effectively anonymize content for privacy protection, providing experiential data foundational to the models' learning processes.
Vector Pairs in the context of self-supervised learning for privacy preservation, typically refer to the feature vectors representing different data samples (e.g., frames from the same or different videos) used in contrastive learning. These pairs are compared to learn representations that capture essential information while discarding irrelevant or sensitive data.
Video means a sequence of images or frames displayed in succession at a certain rate, creating the illusion of motion, and serving as the primary medium for both the input and output of privacy preservation action recognition systems, where actions are analyzed and privacy-sensitive information is managed.
Weights represent the parameters within neural networks that are adjusted during training. They determine the significance of input features in producing the correct output. In privacy preservation action recognition, weights are optimized to balance the trade-off between maintaining action recognition accuracy and achieving effective privacy preservation.
All referenced publications are incorporated herein by reference in their entirety. Furthermore, where a definition or use of a term in a reference, which is incorporated by reference herein, is inconsistent or contrary to the definition of that term provided herein, the definition of that term provided herein applies and the definition of that term in the reference does not apply.
The advantages set forth above, and those made apparent from the foregoing description, are efficiently attained. Since certain changes may be made in the above construction without departing from the scope of the invention, it is intended that all matters contained in the foregoing description or shown in the accompanying drawings shall be interpreted as illustrative and not in a limiting sense.
It is also to be understood that the following claims are intended to cover all of the generic and specific features of the invention herein described, and all statements of the scope of the invention that, as a matter of language, might be said to fall therebetween.
The present application is a continuation of U.S. Provisional Patent Application Ser. No. 63/455,451 entitled “Self-Supervised Privacy Preservation Action Recognition System,” filed Mar. 29, 2023.
Number | Name | Date | Kind |
---|---|---|---|
11475158 | Zhang | Oct 2022 | B1 |
20230252173 | Amonkar | Aug 2023 | A1 |
20230260652 | Azizi | Aug 2023 | A1 |
20230419170 | Chopde | Dec 2023 | A1 |
20240119289 | Gonzalez Sanchez | Apr 2024 | A1 |
Entry |
---|
Wu et al. “Privacy-Preserving Deep Action Recognition: An Adversarial Learning Framework and a New Dataset”, Mar. 21, 2021. pp. 1-18 (Year: 2021). |
Zhang et al., Privacy Preserving Automatic Fall Detection for Elderly Using RGBD Cameras. 2012. In: Miesenberger, K., Karshmer, A., Penaz, P., Zagler, W. (eds) Computers Helping People with Special Needs. ICCHP 2012. Lecture Notes in Computer Science, vol. 7382. |
Kumawat et al. Privacy-Preserving Action Recognition via Motion Difference Quantization. 2012. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision—ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol. 13673. |
Wu et al. Privacy-Preserving Deep Action Recognition: An Adversarial Learning Framework and A New Dataset Apr. 1, 2022. In IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, No. 4, pp. 2126-2139. |
Ryoo et al. Privacy-Preserving Human Activity Recognition from Extreme Low Resolution. 2017. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1). |
Wu et al. Towards Privacy-Preserving Visual Recognition via Adversarial Training: A Pilot Study. 2018. Proceedings of the European Conference on Computer Vision (ECCV), pp. 606-624. |
Ren et al. Learning to Anonymize Faces for Privacy Preserving Action Detection. 2018. Proceedings of the European Conference on Computer Vision (ECCV), pp. 620-636. |
Number | Date | Country | |
---|---|---|---|
20240331389 A1 | Oct 2024 | US |
Number | Date | Country | |
---|---|---|---|
63455451 | Mar 2023 | US |