This application claims priority from European patent application no. 23306848.5, filed Oct. 23, 2023, the contents of which is incorporated herein by reference.
This disclosure relates to the quantification of distortions in images, and automated facial authentication or identification systems where the facial identification algorithms are trained using data sets generated to include the quantified distortion estimate data, in particular to address potential performance issues of the algorithms when dealing with input images which are distorted due to mechanisms including perspective distortion.
The use of biometrics in the authentication or identification of individuals is gaining traction in recent years, in particular given the advance in facial recognition and image processing techniques. An application for which such use can be readily adopted is the identification or registration of passengers, particularly in airports, where there are already self-serve kiosks where passengers can complete other functions such as checking into flights, printing boarding passes, or printing baggage tags. With the advance of computing and camera technologies, facial biometric verification also increasingly may be used in other scenarios, such as building access control.
In a facial biometric identification system, an image of a person's face is captured, analysed and compared with a database of registered faces to determine whether there is a match. Based on the result of this determination the system ascertains the identity of the person. However, the acquisition process whereby the facial image is captured may introduce distortions into the captured image, resulting in the captured facial image being a distorted presentation of the person's face. Some existing algorithms which make the biometric matching determination are designed to deal with intrinsic distortions such as lens distortions. However, the algorithms do not handle perspective distortions well, and their performance degrades when they operate on input images that include perspective distortions but need to match these images with undistorted reference images such as passport photos. It is possible to improve the performances of biometric matching algorithms which utilise machine-learning techniques, in handling images with perspective distortions, but this improvement requires that useful training data to be available. However, currently available data providing information regarding the amounts of distortion in images and the distances at which the images are acquired, is limited. An example is the Caltech Multi-Distance Portrait (CMDP) dataset which includes frontal portraits of 53 individuals, where multiple portraits are taken of each individual with the camera positioned at different distances from the person. The limited availability of data of this type impacts the ability to train and improve machine-learning based facial recognition or biometric matching algorithms.
It is to be understood that, if any prior art is referred to herein, such reference does not constitute an admission that the prior art forms a part of the common general knowledge in the art in other country.
In a first aspect, there is disclosed herein a method of estimating an amount of distortion in an input image, comprising: identifying landmarks in an input object shown in the input image, the landmarks including an anchor landmark and a plurality of non-anchor landmarks; determining distances between the non-anchor landmarks and the anchor landmark in respect of the input object; determining distances between each of a plurality of non-anchor landmarks and an anchor landmark in respect of a reference object, the non-anchor landmarks and the anchor landmark in respect of the reference object being the same as the non-anchor landmarks and the anchor landmark identified in respect of the input object; comparing distances determined in respect of the input object with corresponding distances determined in respect of the reference object; and determining the amount of distortion on the basis of the comparison.
The anchor landmark may be centrally located amongst the identified landmarks.
Comparing distances determined in respect of the input object with corresponding distances determined in respect of the reference object can comprise: determining a plurality of differences, each difference being determined between a respective pair of corresponding distances, the respective pair of corresponding distances comprising a respective one of the distances determined in respect of the input object and a corresponding distance determined in respect of the reference object comprises.
The method can comprise: statistically fitting the differences at locations of the landmarks in respect of the input object or the reference object, to a difference model; and determining the amount of distortion on the basis of the difference model.
The difference model can be a Gaussian or Gaussian-like model.
The amount of distortion can be an integral value calculated from the difference model.
The image capture parameter can be a distance from which the image is acquired, and the distortion is a perspective distortion associated with the distance.
The reference object can be shown in a reference image, being an available image of the imaged object shown in the input image. The reference image may be an image which is known to have substantially no distortion.
The reference object can be determined from a statistically representative image or a statistically representative set of landmarks, for an object which is of a same type as the object shown in the input image.
The imaged object can be a face of a person.
In a second aspect, there is disclosed herein a method of generating an image dataset from an input image of an imaged object, comprising: generating a three-dimensional model of the imaged object from the input image; generating a plurality of synthesized images which are two-dimensional representations of the generated three-dimensional model, obtained by simulating a plurality of image captures of the three-dimensional model, wherein for each of the plurality of simulated image captures, a theoretical value for a capture parameter at least partially characterising the simulated image capture is set to a respective one of a plurality of different values, wherein the input image is used as a reference image. The method comprises, for each synthesized image, calculating a distortion amount in the synthesized image compared with the reference image; and determining a calibrated capture parameter value for each synthesized image, on the basis of the theoretical capture parameter value or the calculated distortion amount of the synthesized image or both, distortion amounts calculated from real acquired images, and known capture parameter values used to acquire the real acquired images. The training image data set comprises the synthesized images and their corresponding calibrated capture parameter values.
The calculation of the distortion amounts in the synthesized image can be in accordance with the distortion estimation method described in the first aspect above, where an object shown in the reference image is the reference object.
Generating the three-dimensional model can comprise modifying an initial model generated using the input image, to correct an artefact in the initial model, apply enhancement, or both.
Modifying the initial model can comprise in-painting a void region in the initial model caused by a two-dimension to three-dimension transition during an initial model generation from the input image.
Determining the calibrated capture parameter value for each synthesized image can comprise referencing a relationship model describing a relationship between distortion amounts in the real acquired images and the known capture parameter values.
Determining the calibrated capture parameter value can further comprise: determining a theoretical relationship model describing a relationship between calculated distortion amounts in the synthesized images and the theoretical values of the capture parameter used to generate the synthesized images; adjusting the theoretical relationship model to find a best fit to the reference relationship model, wherein the adjusted relationship model providing the best fit is determined as the calibrated relationship model; and for each synthesized image, finding its calibrated capture parameter value on the basis of its distortion amount and the calibrated relationship model.
The reference relationship model can be predetermined.
The training image data set can comprise the real acquired facial images.
In a third aspect, there is disclosed herein a method of training a facial recognition algorithm, comprising obtaining one or more input facial images, using each of the input facial images to generate a respective dataset in accordance with the method mentioned above, and using the datasets as training data. The datasets can be utilised by being included into existing set of training data to augment the existing training data. The use of the generated datasets in the training data helps to train the facial recognition algorithm to deal with perspective distortion in input images.
The training data can be provided for training a generative adversarial network for training or augmenting the facial recognition algorithm. The generative adversarial network trained using the training data is expected to be able to generate more realistic or higher resolution of distorted images, which can be used to train the face recognition algorithm.
In a fourth aspect, there is disclosed herein a method of determining a value of an image capture parameter which at least partially characterises how an input image is captured or generated, comprising: determining an amount of distortion in the input image, in accordance with the distortion estimation method mentioned above in the first aspect; determining a calibrated capture parameter value for each synthesized image, on the basis of a relationship model describing a relationship between distortion amounts in images and values of the image capture parameter.
The relationship model can be determined on the basis of distortion amounts in real acquired images and known capture parameter values used to acquire the real acquired images.
The relationship model can be a calibrated relationship model, the calibrated relationship model being generated by: generating one or more datasets, using a respective one of one or more training images showing imaged objects of a same type as the imaged object in the input image; determining the relationship model using the distortion amounts and corresponding calibrated capture parameter values from the datasets. The generation of the datasets may be in accordance with the second aspect disclosed above.
The capture parameter can be a distance between the imaged object and a camera that acquired the input image of the imaged object.
In a fifth aspect, there is disclosed herein a facial identification apparatus, comprising a processor configured to execute machine instructions which implement a facial recognition algorithm to identify a user identity from a facial image of a user, wherein the facial recognition algorithm is trained in accordance with the training method mentioned above in the third aspect.
The facial image can be acquired using a local device used or accessed by the user and provided to the facial identification apparatus over a communication network.
The processor can be provided by a server system hosting a biometric matching service.
This patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
Embodiments are described with reference to the following figures.
In the following detailed description, reference is made to accompanying drawings which form a part of the detailed description. The illustrative embodiments described in the detailed description, depicted in the drawings, are not intended to be limiting. Other embodiments may be utilised and other changes may be made without departing from the spirit or scope of the subject matter presented. It will be readily understood that the aspects of the present disclosure, as generally described herein and illustrated in the drawings can be arranged, substituted, combined, separated and designed in a wide variety of different configurations, all of which are contemplated in this disclosure.
Aspects of the disclosure will be described herein, with examples in the context of facial biometric identification of persons such as air-transit passengers or other travellers. However, the technology disclosed is applicable to any other automated systems utilising facial biometrics. Moreover, the disclosed technology may be generalised toward the processing of images of objects other than face, for applications in which distortions in the images need to be quantified or addressed, such as applications for identification or recognition of the objects.
In the context of air travel, the capture and biometric analysis of a passenger's face can occur at the various points, such as at flight check-in, baggage drop, security, boarding, etc. For example, in a typical identification system utilising facial biometric matching, the identification system includes image analysis algorithms, which rely on input images taken with cameras at the checkpoints or with cameras in the passengers' mobile devices.
Facial images taken with the passengers' mobile devices may be “selfie” image, which tend to be distorted in comparison to the passenger's face, due to the camera being held at a relatively short distance to the face. It is also possible that at a biometric checkpoint, the passenger can stand at a distance that is close enough to the camera at the checkpoint that distortion occurs. The type of distortion, resulting from a short distance between the camera “viewpoint” and the imaged object (i.e., the passenger's face), is referred to as “perspective distortion”. In the case of facial images, perspective distortions typically result in a dilation of the nose region in the face and a compression of the side regions of the face such as the ears, when the images are acquired from a close distance, which is often the case when the images are acquired using a webcam or selfie camera. Other types of distortion mechanisms may be due to the design of the camera lens, such as barrel or pincushion distortions. These distortions cause the input image which is acquired to be distorted relative to what the passenger's face would look like in the absence of such distortion mechanisms.
When such a distorted input image is provided to a biometric identification system, if the system has not been trained to deal with input images with built-in distortions, the identification system may not perform as expected.
Herein disclosed includes a method and system for handling input images with potential distortions. In a first aspect, embodiments of the present invention include a method and system of generating datasets comprising images with different amounts of distortions in the images, and calibrated estimates of the distances at which the images are captured (hereinafter, “capture distances”) which caused the distortions. The generated datasets may be used for purposes such as training facial recognition algorithms or facial biometric matching algorithms. The generation of the datasets is automated, meaning it is not necessary to manually take real photos of individuals.
At step 206, a plurality of synthesized two-dimensional images of the 3D model will be generated, each by simulating the capture of an image of the 3D model, where a capture parameter at least partially characterising the image capture is set at a corresponding value, and rendering the simulated captured image. A perspective camera model can be used to simulate the “virtual camera” capturing the 3D model at a particular fixed field of view. The simulated capture parameter value is changed to different values in order to generate synthesized images virtually captured at different perspectives.
In the context of facial biometric identification, the rendered image is the simulated portrait of the 3D face, and the capture parameter is the theoretical capture distance. That is, the system simulates the image capture of the 3D facial object model at various capture distances, to synthesize facial images with different capture perspectives.
At step 208, the estimated amounts of distortions in the synthesized images are determined, in comparison to the seed image used to generate the synthesized images. Correlating the simulated images with their respective theoretical capture parameter values, i.e., capture distances, will provide a dataset from the input image. However, in some embodiments, the theoretical capture distances will be calibrated against real images and their known capture distances, before being correlated with the input images. This helps to remove potential biases within the system or the perspective camera model used which may skew the relationship between the theoretical capture distances and the associated amounts of distortion in the synthesized, in comparison to what may be observed in the real world. The process 200 shown in
At step 210, on the basis of the estimated amounts of distortion in each synthesized image, the theoretical capture distance associated with that synthesized image is calibrated. This calibration utilises the relationship between distortions observed in real acquired images and their known capture distances. The same algorithms will be used for the calculation of the amounts of distortions in the real images and for the calculation of the amounts of distortions in the synthesized images. At step 212, the synthesized images are correlated with their respective calibrated capture distances, in order to produce the output dataset.
Different methods of quantifying distortion may be used. For instance, to calculate the distortion in a synthesized image compared with the seed image used to generate the synthesized image, the distortion may be quantified by applying a direct comparison between the pixels of the images. In another example, the two images may each be processed to identify particular areas, landmarks, or features. The areas, landmarks, or features respectively identified from the two images are then compared to determine the distortion, e.g., by checking a difference in the respective sizes or dimensions, a difference in the respective locations of the features in relation to the corresponding images.
In the real world, even where there are no distortions associated with a person's facial image, that facial image may still look different to the person's portrait in the passport or another identity document. This is because in some cases the person may be tilting their head or their face may not be centred with respect of the centre of the camera's field of view, resulting in a facial image which is rotated or shifted with respect to the centre of the camera's field of view, in addition to any distortion that may be introduced due to the lens used or the capture perspective. In some scenarios, a traveller may place his or her passport in a slightly mis-aligned manner into the scanner of an airport checkpoint, so that a resulting scan of the passport photo will be misaligned relative to the field of view of the scanner. Thus, preferably, the quantification of the distortion will not be significantly affected by such rotations or translations.
Embodiments according to an aspect disclosed herein provide a method for estimating the distortion in a first, “input”, image in comparison to a second, “reference” image. The method for providing the estimation is designed so that the calculated estimate will not be significantly affected by possible rotations or translations in the input image or the reference image.
Alternatively, the reference image 304 may be a statistically representative image for an object of the type shown in the input image 302. In the context of facial images, the reference image 304 would be the image of a “face” considered to be statistically representative of all faces, or the faces of people identified to be in the same category or categories (e.g., gender, race, age or age bracket, etc) as the person whose face is shown in the input image 302, e.g., female, Caucasian, child. This is useful when there is no undistorted image available to use as the reference. Different ways of obtaining the statistically representative image may be used, such as by creating an “average” image or a composite image where each feature shown in the composite image is a statistical representative (e.g., average) of that particular type of feature (e.g., face shape, nose, eye separation, etc, mouth width, lips thicknesses, etc.).
The distortion estimation processing 310 in this embodiment is based on landmarks. At step 312, a plurality of landmarks of the object shown in the input image are identified, and the same set of landmarks of the object shown in the reference image are also identified. Appropriate object detection (e.g., face detection) algorithms will be used to locate the object in the input image and in the reference image prior to the landmark identification. In the context of facial images, this means to identify a set of biometric facial landmark points in respect of the face shown in the input image. The same landmarks are also identified in respect of the face shown in the reference image. The set of landmarks will include an anchor landmark PA and the rest are non-anchor landmarks P1, P2, . . . etc. The anchor landmark A may be a centrally located landmark amongst the identified landmarks, such as a landmark on the nose of a face.
Next, at step 314, for each of the input image and the reference image, the distances (L) of each of the non-anchor landmarks from the anchor landmark from the same image will be calculated. This calculation results in L(P1, A)_input, L(P2, A)_input, . . . etc for landmarks P1, P2, . . . etc in respect of the face (object) shown on the input image, and L(P1, A)_ref, L(P2,A)_ref, . . . etc for landmarks P1, P2, . . . etc, in respect of the face (i.e., object) shown in the reference image. At step 316, distances between the same landmark pairs on the input image and on the reference are compared. Therefore, for a given landmark P, the change in its distance to the anchor A, from the reference image to the input image, can be expressed as ΔP=L(P, A)_input−L(P, A)_ref. The difference ΔP represents how much the landmark P has shifted radially with respect to the anchor, due to the distortion, i.e., from the reference image to the input image. The differences ΔP, P=P1, P2 . . . PN may therefore also be considered as “radial” errors in the input image at each non-anchor landmark point, due to the effect of the distortion.
At step 318, the results from the comparison, i.e., the radial errors at the landmarks ΔP, P=P1, P2 . . . PN, are used to estimate an amount of distortion in the input image relative to the reference image.
The calculation of the distortion estimate at step 318 may be done differently in different implementations. Statistical methods can be applied to the comparison results from step 316 in order to obtain a quantified estimate of the distortion. For example, the radial errors at the landmarks ΔP, P=P1, P2 . . . PN, can be “attached” to the coordinates of the landmark positions in the input image. The difference ΔP in the distances can therefore be expressed as a function of the landmark's position ΔP=f(xP,yP) in the input image.
In some examples, the calculation involves statistically modelling the radial errors as a multivariate function, and then quantifying the distortion on the basis of the model, such as by calculating particular parameters associated with the model. As shown in
As described above, in the error estimation process 310, the comparison is made between inter-landmark distances between corresponding landmark-anchor landmark pairs on the reference image and on the input image, to find the “radial error” associated with each landmark. A rotation or a translation, or both, of the imaged object relative to the field of view of the camera, is not expected to affect or significantly affect the inter-landmark distances within the same image. Therefore, the results from the comparison will not be significantly impacted by rotations or translations of the imaged object (i.e., face) relative to a centre of the camera field of view, in the input image or the reference image. Therefore, the estimated distortion amount will also be robust against rotations or translations. This is illustrated in
In
Images 506, 516, 526, 536, show the overlays of landmark points of each pair of reference and input images, respectively images 502 and 504, images 502 and 514, images 502 and 524, and images 502 and 534. Graphs 508, 518, 528, 538, respectively show the Gaussian or Gaussian-like surfaces obtained by approximating the radial errors between the landmark points of each of the four image pairs. As can be seen from the landmark overlay image 526, even though the landmarks in image 524 have individually shifted the most from their respective positions in image 502, there is no “radial error” in the image 524, because it is identical to image 502, so from image 502 to image 524, the non-anchor landmarks have not changed their positions relative to the anchor. The graph 528 which shows a Gaussian surface centred at the anchor, to statistically approximate a model for the radial errors, is almost flat. On the other hand, the surface 518 approximating a model for the radial errors in image 514 is much more pronounced in having central values which are different than the border values. The estimated distortion amount calculated from graph 518 is significantly larger than those calculated from graphs 508, 528 and 538.
The process explained above with reference to
In
In the example shown in
The examples shown in
In the example shown in
Processes described with reference to
As another example, the determination of the calibrated relationship model as included in the process shown in
Referring back to
For example, the machine instructions include a module 720 for building synthesized images from the reference image and obtaining calibrated capture distances in respect of these images so that they can be provided to a data repository 722. The data repository 722 may be located within a server system 724 with which the device 702 is in communication. The server system 724 may include a processor or processor system 726 configured to execute program code to, e.g., train a facial recognition system or train a generative adversarial network that will train the facial recognition system. The facial recognition system thus trained is expected to be better able to deal with input images which contain distortions. The data in the data repository may instead or also be used as testing or validation data, and the processor or processing system 726 may be configured to execute program code to use the data to test or validate a facial recognition system, such as by testing to see its performance generally, or to see how well it performs in recognising diverse faces of different genders, ages, ethnicities etc.
The device 702 can be used in an air-travel context. For example it can be a facial identification apparatus for biometrically identifying travellers. Such an apparatus may be used as a touch point at an airport, e.g., at check-in, bag-drop, boarding gate, etc, where the input facial image acquired by or provided to the touch point will be used to identify the traveller.
As another example the facial identification apparatus is provided by the processing device of a server or server arrangement, configured to receive facial images from the travellers' devices such as mobile phones or tablets. In this example the facial identification apparatus does not necessarily need to have the camera or the scanner. The server or server arrangement may host the biometric matching service for determining the identity of the traveller. A correct result can be communicated to a departure control system to obtain boarding pass information. The boarding pass information may be communicated to the traveller's device.
Because the facial recognition algorithm(s) used to perform the identification has been trained using datasets with images taken at different fields of view from the camera, it will be better able to correctly identify the traveller, particularly where the input image has been acquired by a wide field of view camera which permits a close capturing distance, as is typically the case when the input image has been captured using a selfie camera.
Variations and modifications may be made to the parts previously described without departing from the spirit or ambit of the disclosure. For example, in embodiments where the distortion is quantified based on landmarks, the landmarks and/or the quantities associated with the landmark on the reference object can be predetermined, rather than computed anew each time. For example, where the reference object is a “statistically representative” reference object, it may not be necessary to utilise a reference image showing the reference object in the algorithm. Rather, the landmarks for a statistically representative object can be pre-determined, the radial distances between each non-anchor landmark and anchor landmark pair, or both, can be predetermined and stored so that they are already available during the distortion estimation calculations. This variation may be made to all embodiments wherein the determination of a distortion estimate involves using landmarks for a statistically representative object.
In claims which follow and in the preceding description, except where the context requires otherwise due to express language or necessary implication, the word “determine” and its variants such as “determining”, is used to mean the ascertainment of particular quantitative or qualitative evaluation, parameter, data, model, or any other of information, whether this information is predetermined and is obtained by reading it from a memory or data storage location, receiving it as an input, or whether the information is obtained by carrying out or implementing algorithms to calculate or arrive at the information in cases where the information is not already pre-determined.
In the claims which follow and in the preceding description, except where the context requires otherwise due to express language or necessary implication, the word “comprise” or variations such as “comprises” or “comprising” is used in an inclusive sense, i.e. to specify the presence of the stated features but not to preclude the presence or addition of further features in various embodiments of the disclosure.
| Number | Date | Country | Kind |
|---|---|---|---|
| 23306848.5 | Oct 2023 | EP | regional |