The present invention relates generally to computing systems and, specifically, has its application within the detection of synthetic content in videos, e.g., deep fakes.
Synthetic video is the term used for any computer-generated video that has been manipulated to appear “real.” Most people have also adopted the term “deep fake” to refer to any content—often video or audio in nature—that has been manipulated to look like something it is not.
The proliferation of scams using deepfakes is a problem that can affect the entire world population. Moreover, the main element that can be affected is video calls.
Therefore, it is important to create applications that detect deepfakes, as in the very near future, it will be essential to know if we are talking with a real person or whether the video that has been sent to us contains images of a real person or not. Deepfake algorithms can create fake images and videos that humans cannot distinguish them from authentic ones.
The famous deep fake Obama video warning us about deep fakes going viral may be the most innocent example, but there are far more devastating real-world examples which impact the society by introducing inauthentic content.
The threats associated with the advancement of Artificial Intelligence in the field of deep fakes require developing tools to help detect them. So far, just proofs of concept and academic papers are available, but final users require simple tools that can be executed in any type of device, even mobile phones with limited computation power.
Some existing solutions uses Machine Learning. For example, “Deep Learning for Deepfakes Creation and Detection: A Survey” by Thanh Thi Nguyen et al. (Computer Vision and Image Understanding, Volume 223, October 2022, 103525) discloses a survey of algorithms used to create deepfakes and, more importantly, methods proposed to detect deepfakes in the literature to date. The survey shows that deepfakes can be created easier than ever before with the support of deep learning and is also quicker thanks to the development of social media platforms. Deepfake detection is normally deemed a binary classification problem where classifiers are used to classify between authentic videos and tampered ones. This kind of methods requires a large database of real and fake videos to train classification models. The number of fake videos is increasingly available, but it is still limited in terms of setting a benchmark for validating various detection methods.
The biggest problem with existing deepfakes detection solutions is that they are based on Machine Learning (ML) algorithms, which detect anomalies, artifacts, etc., in the images generated with deepfake techniques, but the complexity of these solutions and the large computational processing involved often render them inaccessible and impractical; some models even need to be executed with help of multiple GPUs (graphics processing units). For this reason, if the user has a device with limited processing power, it is not possible for the user to use these ML algorithms. Moreover, numerous deepfake detection applications are limited to GitHub repositories, featuring intricate implementations that pose considerable hurdles for non-expert users.
Summarizing, the surge of deep fake-driven scams is an emerging global issue that poses a significant threat to the world's population, with video calls being a primary target, and discerning the authenticity of video content becomes paramount.
Therefore, there is a need of providing a detector of deepfakes without using Machine Learning techniques.
The problems found in prior art techniques are generally solved or circumvented, and technical advantages are generally achieved, by the disclosed embodiments which provide a lightweight deepfake detector configured to analyze a video in real time for detecting indicia of synthetic origin.
The present invention allows unmasking deepfakes, transcending the confines of conventional machine learning techniques and harnessing the power of intricate mathematical calculations involving 3D vectors. The present invention provides a deepfake detection method which leverages motion analysis and spatial elements, steering clear of traditional machine learning techniques, by scrutinizing critical factors such as head movement, facial symmetry, and blink rates to ascertain the veracity of video content-distinguishing between genuine and tampered footage with remarkable precision. The true allure of this invention lies in its unparalleled speed and accuracy, eliminating the need for detailed mathematical computations or cumbersome machine-learning models. As a result, the present invention paves the way for a more efficient and seamless deepfake detection process, empowering users to unmask deceptively manipulated content in the digital age swiftly.
An aspect of the present invention refers to a computer-implemented method for detecting synthetic content in videos which is defined by the independent claim. The dependent claims define advantageous embodiments.
The present description should therefore be interpreted as extending the disclosures of the references cited in the background of the invention, and therefore the scope of this disclosure is not limited to detection of deepfake video in the particular manner described, and is rather extended to the application of components of the present disclosure to augment the existing technologies.
The method in accordance with the above-described aspects of the invention has a number of advantages with respect to the aforementioned prior art, which can be summarized as follows:
To complete the description that is being made and with the object of assisting in a better understanding of the characteristics of the invention, in accordance with a preferred example of practical embodiment thereof, accompanying said description as an integral part thereof, is a set of drawings wherein, by way of illustration and not restrictively, the following has been represented:
The present invention may be embodied in other specific systems and/or methods. The described embodiments are to be considered in all respects as only illustrative and not restrictive. In particular, the scope of the invention is indicated by the appended claims rather than by the description and figures herein. All changes that come within the meaning and range of equivalency of the claims are to be embraced within their scope.
The embodiments of the present invention propose deep fake detection methods based on mathematical calculations of 3D (three-dimensional) vectors by checking their translation, rotation, etc., trajectories. These trajectories allow to determine that the head motion is natural, that it is not too long still or looking at a stationary point, to determine the symmetry of the face, to calculate blinks, etc.
The embodiments of the present invention can be implemented in a software application that can analyze any software application running in a personal computer (PC), mobile phone or any other smart device, which is capable of playing a video or directly a call from any videoconferencing platform. Videoconference calls are the ideal place to try to fool someone using deepfakes.
The proposed fake detection method can obtain the image to be analyzed from a webcam or other camera or video capturing means. In a first analysis performed by at least one processor, one or more body parts of a subject (e.g., head, the whole face, eyes, mouth or lips, hands, torso, etc.) are detected. In addition, the method can obtain information captured from, for example, a video executed with a player. The obtained images and information are optimized so that the method does not require a more limited computational capacity than other examples that use Machine Learning techniques. The method comprises calculating 3D vectors, their position, and movement to detect different spatial positions of the head, eyes, hands, etc.
In a possible implementation, the method can be programmed in Python in a Windows environment where a user interface is created to interact with the users, which provides simplicity when performing the detection operations. The detection and possible confirmation of whether the interlocutor is a real human are based on several criteria as detailed further below. As already mentioned, a main goal is to implement a deepfake detector application as light as possible to run on any mid-range PC user.
The method of this invention comprises two main types of verifications, a first one related to the head pose, and a second one related to the eye blinking. In turn, the head pose verification can be implemented by i) analyzing static motion factors, ii) analyzing facial symmetry, iii) detecting head abrupt turns, iv) detecting other parts of the body and their motion, v) checking with the focal point, vi) emotion detection, and/or vii) lip movement analysis.
The first two implementations of head pose verification (i) static motion factors and ii) facial symmetry) use 3D vector calculations to perform different computations. The verification process described here is called “Head Pose verification” but goes beyond its definition as it is not only centered on the head, but it can be expanded to all parts of the human's body.
The goal with the head pose verification is to offer a new approach to detect anomalies in real time, e.g., during a video call, to check that the person in front of the user(s) is 100% human. Furthermore, this new approach mainly uses 3D positioning of various points of the face, eyes, mouth, and even hands and torso. All this information is collected in real-time from the webcam or directly from a previously recorded video. Once these positions have been estimated in a 3D space, they are stored in matrices for further calculation or processing. In addition, the method of head pose verification computes a focal point projection that starts from the nose to detect the exact point on the screen where the person under analysis is looking. All this information is accessible in real-time, so the second part is to designate those movement patterns that are not classified as “human.” It is also important to note that the whole system is adjusted to the frames per second of the image capture, as it is better to calibrate the anomaly detection process by having the frame rate under our control. The calibration by frames, in addition to the time allotted for checking, determines the final result. A checking in real time is performed to determine if the 3D points of the face and the projection do not meet some requirements to assign them as “anomalous.” Finally, depending on the time of the test, a weighted result is offered according to the detected anomalies.
On the other hand, the blink verification uses the same basic approach as above, i.e., obtain the 2D points of the eye, label them, and check when these points come close to each other, this closeness implying a blink. Each blink is accounted for within the assigned framerate and checking time. The final result is checked between the different tolerances within the human being, the previously defined tolerance values determined by using several thresholds.
In order to calculate the movements of the different captured 3D points, geometry and linear algebra techniques are used; in particular, the camera projection matrix, which relates the 3D points in space to their 2D projections in the image, is used. Using this matrix, the relative position and orientation of the camera in each frame of the video are calculated and then the motions of the different points are inferred.
The translation is simply the movement of an object along a straight line in space in the x, y, and z directions. The translation of a point (x, y, z) in terms of vectors can be represented as:
where [tx, ty, tz] are the translational distances in the x, y, and z directions. Rotation is the motion of an object around an axis in space. The rotation of a point can be represented in terms of rotation matrices. A rotation matrix is a 3×3 matrix that defines how points are rotated around each x, y, and z-axis. The rotation matrix depends on the rotation angle and the axis around which it is rotated. The rotation of a point (x, y, z) around the x-axis by an angle theta can be represented in terms of matrices such as:
The same can be computed for the y and z axes and the rotation matrices can be combined to make rotations around multiple axes.
To merge translational and rotational motions, a homogeneous transformation matrix can be used: a 4×4 matrix that includes translation and rotation. The homogeneous transformation matrix can be represented as follows:
where R is the rotation matrix, and t is the translation vector. The transformation matrix is applied to a point in homogeneous coordinates [x, y, z, 1] to obtain the transformed point in homogeneous coordinates [x′, y′, z′, 1]. Then first three components of this homogeneous vector can be divided by the fourth component to obtain the transformed coordinates of the point.
In summary, the motions of the different captured 3D points can be calculated using a combination of translation and rotation, and these motions can be represented using homogeneous transformation matrices. These are the base calculations to detect anomalies in human motion.
HEAD POSE VERIFICATION The basic scheme of operation of head pose verification (head position) is shown in
The picture of
It is also possible to create patterns of different poses and postures to be able to detect them later. It is also possible to create templates with complete movement sequences and then check that they correspond to real ones. This way, a collection of different movement patterns can be generated to be used in checking whether the subject behind the camera is a natural person. For example, a recording sequence can be obtained as reference information by a particular application that takes screenshots of the different movements for a time period and with different shots (to capture possible variations). This capture can be labeled as, for example, “chin scratching movement”. Once the sequence is taken with several shots, it is stored in matrices, which can be used to check against the captured points in real time. The captured points are compared with the points stored in these matrices and the method checks if there is match. If the captured points and the stored points match, the method can interpret that the captured points correspond to human movements and give clues as to whether the capture is a deepfake.
In a practical example of the method, and specifically according to an exemplary implementation of the first criteria explained above, that is, detecting that the head is too static, the focal projection is used to detect that the head is too still or without movement. The focal projection we can be obtained in the following way:
For instance, it is assumed that, in order to project a point P on the Z-axis, given the position of point P by coordinates (x,y,z) in three-dimensional space, the x and y coordinates to project P on the Z-axis can be ignored and only the z coordinate is used. That is, the projection of P on the Z-axis (Pz) is calculated as:
This means that the position of the nose on the Z-axis can be obtained by simply extracting the z-coordinate of the nose point in three-dimensional space.
For example, if the position of the nose in three-dimensional space is (x,y,z)=(10, 20, 30), then the projection of the nose on the Z-axis is the z-coordinate, i.e., Pz=30. In this way, the projection of the nose on a specific reference point can be easily obtained to detect unusual movements on that axis.
The example focuses only on the the Z-axis but can be also applied to the X-axis or the Y-axis in a simple way using the same logic.
Once the nose projection is obtained, the following algorithm can be used to check whether the head is practically stationary or its movements are very static/still:
A possible implementation in pseudocode is the following:
Or in more technical pseudocode:
Another code implementation corresponding to the detection of a head that is too static, with hardly any movement, is shown below.
Next, more details about the implementation of the detection of the above-mentioned anomalies for head pose verification are disclosed below.
A set of position vectors (x, y, z) in real-time of a person's body part at different points in real-time is obtained (e.g. using Mediapipe solutions that can work with single images or a continuous stream of images and output body pose landmarks in image coordinates and in 3-dimensional world coordinates). To detect if the movements are too static, the differences between consecutive position vectors are computed and the magnitude of these difference vectors is measured. If the magnitude for most of these difference vectors is below a certain threshold, an output of the anomaly detection (1010) concludes that the person is not moving much, and their movement is too static. Thus, the anomaly detection (1010) considering static motion factors comprises the following steps:
There are many other approaches that can be implemented to detect this type of anomaly. Preferable values for the threshold range from 5 to 10 pixels. However different embodiments are possible, with different threshold values or with the threshold set in units different to pixels, for instance areas comprising several pixels.
The detector can also check that the structure and composition of all the points that make up the face are not altered due to a possible manipulation with some deepfake technique. The distances between symmetric points on the face, as shown in
Di=D(Si, S′i), for i=1, 2, . . . , M, denoting the initial distances between the symmetric points.
Preferable values for the threshold range from 2 to 10 pixels. However different embodiments are possible, with different threshold values or with the threshold set in units different to pixels, for instance areas comprising several pixels.
iii) Head Abrupt Turns
To detect sudden movements of the head, the process can rely on the data described in the static movement detection point and add a factor that calculates the translation speed (acceleration) of some vectors, compare them and check that they are within limits. The velocity and acceleration of these movements are calculated using the real-time position vectors (x, y, z) and compared with established thresholds to determine if the movements are too fast or accelerated.
Pt(i+1)−Pi represents the difference vector calculated above.
Preferable values for these thresholds are:
This type of detection is simply modifying the source points to detect different states of motion, that is, focusing, for example, on the hands, torso, etc. To do this, the key points at these points of the body are detected and the techniques already discussed above can be applied.
An interesting approach is to see how the person uses or moves their hands, shown in
v) Checking with the Focal Point
The focal point where the person is looking at is calculated to be used as a base and variant to detect the abovementioned detection techniques for head pose verification. But if the detection is focused only on the focal point, the consistency between the movements of the eyes and the direction of the focal point can be monitored. In a normal situation, the focal point also changes coherently when the person's eyes (or nose) motion. If there are inconsistencies between the movement and the focal point, it could be a sign of a deepfake. Thus, the anomaly detection (1010) considering the focal point comprises the following steps:
Get each eye's real-time position vectors (x, y, z). Let's denote the left eye position as PL and the right eye position as PR.
All this calculation can be done in the same way but with the focal point centered on the nose.
To implement this analysis, it is mandatory to create a database of vector patterns that identify the person's mood. Generic patterns or templates for laughter, sadness, etc. are calculated. To do so, we must store these different states in matrices to check them later during detection. This information can be compared with real-time facial expressions and detect inconsistencies that may indicate a deepfake. The anomaly detection (1010) considering the person's mood or emotion comprises the following steps:
To detect deepfakes using lip data in (x, y, z) vectors, a similar approach as before can be followed, focusing on the consistency and coherence of lip movements and their relationship to speech or expressions. Thus, the anomaly detection (1010) taking into account the lips and their movements comprises the following steps:
The basic operating scheme of Blink Verification is shown in
For Blink Verification, it is also essential to consider the speed, as too high or too low speed also reflects anomalies. Once appropriate thresholds are assigned, flicker is detected by using the Eye Aspect Ratio (EAR) model. To use the EAR model to detect flicker in videos, the following steps have been followed:
Some of the EAR thresholds tested in an implementation example are as follows:
On the other hand, some thresholds based on the number of blinks that have been tested in an implementation example are as follows:
Another example of code implementation to detect the blinks and calculate the blinks per minute (EAR) is disclosed as follows:
This blink detection method works more accurately and faster than other methods based on Machine Learning models.
The method provides, in real-time, at least a result related to the detection of a deepfake or synthetic content in the video images.
All or a subset of the verification data that have been described as the output of the detection method can be used to produce an evaluation on whether the person appearing in the video is a real human or on the contrary is a deep fake, that is, a synthetic video generated by an artificial intelligence. This evaluation can be a confidence score, for example “real human with 97% probability”, a binary evaluation human/deep fake, or a warning of the suspicion that it is a deep fake.
A possible implementation of the method provides a subset of the described possible verifications and provides a binary evaluation of deep fake in case a number of verifications is higher than a threshold providing a result suspicious not to be human.
This threshold can be of a single criterion for an increased security, meaning that a single verification providing doubtful results can trigger an evaluation of the video as a deep fake. It should be noted that this binary output is subject to false positives and false negatives, that is, real humans classified as deep fakes and the opposite.
Instead of a binary evaluation, the outcome of another possible embodiment can be a warning of the suspicion that it is a deep fake. Therefore, the information to the user is less categoric, and thus can be clearer in terms of the setting the expectations for users of the detection method. As in the case of the binary evaluation, a subset of the possible verifications described before are conducted. In this case, a warning of the possibility of a deep fake is provided to the user, preferably when at least one of the verifications provides a result suspicious not to be human.
In another possible embodiment, there may be some criteria that can more certainly determine that the video is a deep fake, for instance the head abrupt turns, whilst other may be less categoric. An example of the latter is the blink frequency criteria, which due to the different behavior of each human-being may not provide a categoric classification of deep fakes. According to this embodiment one of the former more categoric criteria is sufficient to classify a video as deep fake or to provide a warning about the video being suspicious to be a deep fake (not human), but suspicious results in more than one of the latter criteria may be needed to provide this classification of deep fake or the warning.
Another possible embodiment provides a confidence score of the video being a real human, for instance as a percentage (0-100%). This score is the result of a weighted average of the confidence scores provided by each individual criterion, where each of the criteria has a weight, proportional to its relevance.
where:
A weighted confidence score of 0 means total certainty of the video being a deep fake, and a growing weighted confidence score means increasing certainty of the video being a real human, until the weighted confidence score of 100% meaning total certainty of the video being a real human.
Those criteria being more decisive have a higher weight than those less relevant. For instance, the blink frequency criterion may have a lower weight than the head abrupt turns criterion.
Several factors can affect the final results of the deepfake detection or final detection outcome.
Firstly, the framerate plays a significant role, as not all video conference connections offer the same speed. This factor is essential for determining appropriate guidelines or timing to achieve optimal results. Consequently, the detection method relies on two key elements: the framerate and the checking time. The better the framerate and the more testing time allocated, the more accurate the final analysis.
Secondly, time is another critical aspect that defines the duration of the deepfake check. As mentioned, allocating sufficient time is vital for obtaining an excellent final result. Additionally, the webcam's resolution, quality, and interpretation of natural or artificial lighting surrounding the individual are crucial factors. A high-quality webcam or one poorly aligned with the person in front of it can significantly impact the final detection outcome. Lighting also plays a pivotal role and directly affects the final result, and these factors can cause distortions in the final estimates obtained by the abovementioned methods.
In conclusion, achieving the best possible deepfake detection results necessitates considering various factors, such as framerate, time checking, webcam quality, and lighting conditions. By carefully considering these elements, we can enhance the accuracy and reliability of our deepfake detection approach and better safeguard users against the threat of manipulated content.
There are several improvements for the above-described embodiments of deepfake detection based on head movements and vector-detected blinking to increase its accuracy:
Note that in this text, the term “comprises” and its derivations (such as “comprising”, etc.) should not be understood in an excluding sense, that is, these terms should not be interpreted as excluding the possibility that what is described and defined may include further elements, steps, etc.
Number | Date | Country | Kind |
---|---|---|---|
23382153.7 | Feb 2023 | EP | regional |
23179069.2 | Jun 2023 | EP | regional |