This invention concerns improvements in and relating to identifier investigation, particularly, but not exclusively, in relation to the comparison of biometric identifiers or markers, such as prints from a known source with biometric identifiers or markers, such as prints from and unknown source. The invention is applicable to fingerprints, palm prints and a wide variety of other prints or marks, including retina images.
It is useful to be able to capture, process and compare identifiers with a view to obtaining useful information as a result. In the context of fingerprints, the useful result may be evidence to support a person having been at a crime scene.
Problems exist with present methods in terms of their accuracy and speed.
The present invention has amongst its potential aims to process a representation of an identifier so as to produce a processed representation which more accurately represents the identifier. The potential aims may include a faster process for the representation of a identifier.
According to a first aspect of the present invention we provide a method of processing a representation of an identifier, the method including:
obtaining a representation of an identifier, the representation including one or more components;
defining a part of the representation of the identifier as being within a neighborhood by reference to a boundary for that neighborhood;
determining any ends for the components which ends fall within the boundary to the neighborhood;
determining any limits for the components, a limit being an element of the component which coincides with the boundary to the neighborhood and/or being an element of the component which forms the junction with one of more of the components;
generating a processed representation of the identifier;
The first aspect of the present invention may include features, options or possibilities set out elsewhere in this application, including in the other aspects of the invention. In particular, the first aspect may include the following.
The representation of the identifier may have been captured. The representation may be captured from a crime scene and/or an item and/or a location and/or a person. The representation may have been captured by scanning and/or photography.
The method may process an already processed representation of an identifier. The already processed representation may have been processed to convert a colour and/or shaded representation into a black and white representation. The already processed representation may have been processed using Gabor filters.
The method may process a representation of an identifier which has been altered in format. The alteration in format may involve converting the representation into a skeletonised format. The alteration in format may involve converting the representation into a format in which the representation is formed of components, preferably linked data element sets. The alteration may convert the representation into a representation formed of single pixel wide lines.
The method of the first aspect may provide processing which cleans the representation, particularly when provided according to the second aspect of the present invention and/or its options, possibilities and features.
The method of the first aspect may provide processing which heals the representation, particularly when provided according to the third aspect of the present invention and/or its options, possibilities and features.
The method may provide for cleaning followed by healing.
The identifier may be a biometric identifier or other form of marking. The identifier may be a fingerprint, palm print, ear print, retina image or a part of any of these.
The representation of the identifier may be obtained direct or after processing of the type provided above.
The representation may be formed of a plurality of components, particularly in the form of linked data element sets.
One or more of the components may be in the form of linked data elements, for instance to form linked data element sets. One or more of the components may be formed of a plurality of data elements which are connected to one another. A linked data element set may be formed of a plurality of data elements which are connected to one another. A plurality of the data elements in a linked data element set may be connected to two adjoining data elements. One or two data elements in a linked data element set may be connected to only one other data element, for instance the data element defining a ridge end in a fingerprint. One or two of the data elements in a linked data element set may be connected to three data elements, for instance the data element defining a bifurcation in a fingerprint.
The part of the representation being within the neighborhood is preferably less than the whole of the representation. The part may be less than 10% of the whole, preferably less than 5% of the whole, more preferably less than 1% of the whole and ideally less than 0.1% of the whole.
The neighborhood may have a shape defined by the boundary. The shape may be circular or square or rectilinear. The neighborhood may have a pre-determined area and/or shape and/or size. The neighborhood area and/or shape and/or size may be varied between parts of the representation and/or between different processings of the representation.
An end may be a part of a component within the boundary of the neighborhood and/or a part of the component only connected to another part of the component in a single place. An end may be a representation of a ridge end and/or apparent ridge end which falls within the boundary of the neighborhood. A neighborhood may contain no or one or more ends. An end may be in the form of an end element. An end data element may be one within the boundary of the neighborhood and/or only connected to one other data element. An end data element may be one representing a ridge end and/or apparent ridge end which falls within the boundary of the neighborhood. A neighborhood may contain no or one or more end data elements.
A limit may be the part of a component which crosses the boundary. The limit may be connected to a part of the component on the inside of the boundary and a part of the component on the outside of the boundary. The limit may be connected to one or more other parts of the component across the boundary and/or outside the neighborhood. One or more other parts of the component across the boundary and/or outside the neighborhood may not be considered part of the component. They may be considered part of a component in respect of the processing of another part of the representation. A limit may be one point on a representation of a continuous ridge. The boundary may coincide with no or one or more such limits. A limit may be in the form of a limiting data element. A limiting data element may be the data element of a linked data element set which crosses the boundary. The limiting data element may be connected to a data element on the inside of the boundary and a data element on the outside of the boundary. The limiting data element may be connected to one or more other data elements across the boundary and/or outside the neighborhood. One or more other data elements across the boundary and/or outside the neighborhood may not be considered part of the linked data element set. They may be considered part of a linked data element set in respect of the processing of another part of the representation. A limiting data element may be one point on a representation of a continuous ridge. The boundary may coincide with no or one or more such limiting data elements.
A limit may be a part of a component which meets another part of another component. The limit may be connected to three other parts of components. Preferably one of the three parts is a part of the component and the other parts are parts of other components. A limit may represent a bifurcation or apparent bifurcation. A neighborhood may contain no or one or more limits of this nature. A limit may be in the form of a limiting data element. A limiting data element may be a data element of a linked data element set which meets another data element of another linked data element set. The limiting data point may be connected to three other data elements. Preferably one of the three data elements is in the linked data element set and the other data elements or in another data element set or other data element sets. A limiting data element may represent a bifurcation or apparent bifurcation. A neighborhood may contain no or one or more limiting data elements of this nature.
The processed representation of the identifier may contain components in a form not present in the representation before processing, for instance due to healing. The processed representation of the identifier may contain components which include further parts not present in the representation before processing, for instance due to healing. Preferably any new parts are part of one or more components. The processed representation of the identifier may not contain parts and/or components present in the representation before processing, for instance due to cleaning. The processed representation of the identifier may contain data elements not present in the representation before processing, for instance due to healing. The processed representation of the identifier may contain linked data element sets which include further data points not present in the representation before processing, for instance due to healing. Preferably any new data elements are part of one or more linked data element sets. The processed representation of the identifier may not contain data elements and/or linked data element sets present in the representation before processing, for instance due to cleaning.
Consideration of one or more inter-relationships between the ends and/or limits of components may be provided. The inter-relationships may be between parts of the same components and/or may be between parts of different components.
The inter-relationship may be that where a component has two ends within the boundary that that component is omitted from the processed representation. The inter-relationship may be that where a component has an end and a limit which forms the junction with one or more of the other components, within the boundary, that that component is omitted from the processed representation. The inter-relationship may be that where a component has an end, within the boundary, and a limit which coincides with the boundary to the neighborhood that that component is present in the processed representation. The inter-relationship may be that where a component has two limits which coincide with the boundary to the neighborhood that that component is present in the processed representation. Two of these inter-relationships, preferably three of them and ideally all four of them may be applied in the method, particularly for the purposes of cleaning the representation.
The determination may involve, for components having an end, generating a line extending between the end and the end or limit forming the other extent of the component. In such a case, one or more of the inter-relationships set out in the next paragraph may be applied. Preferably both of the inter-relationships are applied. The inter-relationship or inter-relationships may be applied together with one or more of the inter-relationships provided in the previous paragraph.
The inter-relationship may be that, where the direction of the generated line for the first component and the direction of the generated line for the second component match within limits, that the processed representation includes the end of the first component being joined to the end of the second component. The inter-relationship may be that, where the direction of the generated line for the first component and the direction of the generated line for the second component do not match within limits, that the processed representation includes the end of the first component not being joined to the end of the second component.
Preferably the processed representation is formed of a series of parts to which the method has been applied in turn. A series of adjoining or overlapping neighborhoods may be used to process a series of parts of the representation. At least 50% of the representation may be so processed. Preferably at least 75% of the representation is so processed and ideally all of the representation is so processed. The neighborhood used to process one part of the representation may be of the same area and/or shape and/or size as the neighborhood used to process another part of the representation.
The processing of one neighborhood may result in one or more parts and/or data elements and/or components and/or linked data element sets being retained, which are not present in the eventual processed representation because of the processing of one or more other neighborhoods.
The processed representation may be subjected to one or more further steps. The one or more further steps may include the extraction of data from the processed representation, particularly as set out in detail in applicant's UK patent application no 0502990.5. One or more further steps in which the processed representation is placed in a form for comparison may be provided. The form for comparison may particularly be that set out in detail in applicant's UK patent application number 0502902.0 of 11 Feb. 2005 and/or UK patent application number 0422786.46 of 14 Oct. 2004. The form for comparison may allow the representation to be compared with one or more other representations. The one or more other representations may have been processed according to the present invention. The method of comparison may particularly be that set out in applicant's UK patent application number 0502900.4 filed 11 Feb. 2005 and/or UK patent application number 0422784.9 filed 14 Oct. 2004. The comparison may provide an indication of the likelihood of the representation and other representation coming from the same source.
According to a second aspect of the invention we provide a method of processing a representation of an identifier, the method including:
obtaining a representation of an identifier, the representation including one or more linked data element sets;
defining a part of the representation of the identifier as being within a neighborhood by reference to a boundary for that neighborhood;
determining any end data elements for the linked data element sets which fall within the boundary to the neighborhood;
determining any limiting data elements for the linked data element sets, a limiting data element being one which coincides with the boundary to the neighborhood and/or being one which forms the junction with one of more of the other linked data element sets;
generating a processed representation of the identifier;
The second aspect of the present invention may include features, options or possibilities set out elsewhere in this application, including in the other aspects of the invention.
According to a third aspect of the invention we provide a method of processing a representation of an identifier, the method including:
obtaining a representation of an identifier, the representation including one or more linked data element sets;
defining a part of the representation of the identifier as being within a neighborhood by reference to a boundary for that neighborhood;
determining any end data elements for the linked data element sets which fall within the boundary to the neighborhood;
determining any limiting data elements for the linked data element sets, a limiting data element being one which coincides with the boundary to the neighborhood and/or being one which forms the junction with one of more of the other linked data sets;
for linked data element sets having an end data element, generating a line extending between the end data element and the end or limiting data element forming the other extent of the linked data element set;
generating a processed representation of the identifier;
The third aspect of the present invention may include features, options or possibilities set out elsewhere in this application, including in the other aspects of the invention.
The limits may be expressed in terms of an angle. The limits may be constant between neighborhoods and/or different processings of the representation.
The end data element of a first linked data element set may be joined to the end data element of a second linked data element set only if the first linked data element set and second linked data element set are within a certain distance range of one another. The end data element of a first linked data element set may not joined to the end data element of a second linked data element set if the distance between the first linked data element set and the second linked data element set is above a certain distance.
The method of the third aspect of the invention may be applied to the representation after it has had the method of the second aspect of the invention applied to it.
Various embodiments of the invention will now be described, by way of example only, and with reference to the accompanying figures in which:—
a is a schematic illustration of a part of a basic skeletonised print;
b is a schematic illustration of the print of
a illustrates minutia and direction information from a mark and a suspect;
b illustrates the presentation of the direction information in a format for comparison;
c illustrates the information of
A variety of situations call for the comparison of markers, including biometric markers. Such situations include a fingerprint, palm print or other such marking, whose source is known, being compared with a fingerprint, palm print or other such marking, whose source is unknown. Improvements in this process to increase speed and/or reliability of operation are desirable.
In the context of forensic science in particular, the consideration of the unknown source fingerprint may require the consideration of a partial print or print produced in less than ideal conditions. The pressure applied when making the mark, substrate and subsequent recovery process can all impact upon the amount and clarity of information available.
Process Overview
The overall process of the comparison is represented schematically in
After the recovery of the fingerprint and its representation, which may be achieved in one or more of the conventional manners, a representation of the fingerprint is captured. This may be achieved by the consideration of a photograph or other representation of a fingerprint which has been recovered.
In the next stage, the representation is enhanced. The representation is processed to represent it as a purely black and white representation. Thus any colour or shading is removed. This makes subsequent steps easier to operate. The preferred approach is to use Gabor filters for this purpose, but other possibilities exist.
Following on from this part of the stage, the enhanced representation is converted into a format more readily processed. This skeletonisation includes a number of steps. The basic skeletonisation is readily achieved, for instance using a function within the Matlab software (available from The MathWorks Inc). A section of the basic skeleton achieved in this way is illustrated in
Once the enhanced representation of the recovered fingerprint has been processed to give a clean and healed representation, the data from it to be compared with the other print can be considered. To do this involves first the extraction of representation data which accurately reflects the configuration of the fingerprint present, but which is suitable for use in the comparison process. The extraction of representation data stage is explained in more detail below, but basically involves the use of one of a number of possible techniques.
The first of the possible techniques, see
In a second technique, developed by the applicant, the positions of features are defined and the positions of a group of these are considered to define a center. The center defines one apex of the triangles, with adjoining features defining the other apexes.
To facilitate the comparison stage, the representation data extracted is formatted before it is used in the comparison stage. This basically involves presenting the information characteristic of the triangles, quadrilaterals or other polygons being considered when the data is extracted in a format mathematically coded for use in the comparison stage. Further details of the format are described below.
Now that the fingerprint has been expressed as representation data, it can be compared with the other fingerprint(s). The comparison stage is based on different representation data being compared to that previously suggested. Additionally, in making the comparison, the technique goes further than indicating that the known and unknown source prints came from the same source or that they did not. Instead, an expression of the likelihood that they came from the same source is generated. In the preferred forms, one or both of the two different models (a data driven approach and a model driven approach) both described in more detail below are used.
Having provided an overview of the entire process, the stages and steps in them will now be discussed in more detail.
Cleaning and Healing Steps of the Skeletonisation Stage
Some existing attempts at interpreting the basic skeleton to give an improved version have been made.
In the situation illustrated in
The existing interpretation considers the length of the ridge island 40. If the length is equal to or greater than a predetermined length value then it is deemed a true ridge island and is left. If the length is less than the predetermined length then the ridge island is discarded. In a similar manner, the length from the bifurcation point 43 to the ridge end 44 is considered. Again if it is equal to or greater than the predetermined length it is kept as a ridge with its attendant features. If it is shorter than the predetermined length it is discarded. This approach is slow in terms of its processing as the length in all cases is measured by starting at the feature and then advancing pixel by pixel until the end is reached. The speed is a major issue as there are a lot of such features need to be considered within a print.
The new approach now described has amongst its aims to provide a reliable, faster means for handling such a situation. Instead of advancing pixel by pixel, the new approach illustrated in
When further neighborhoods are considered, it may of course be that the feature 52 is itself part of a data set with the features both within that neighborhood, where upon it too will be discarded. If, however, it is the end of a ridge of significant length then for all neighborhoods considered its data set will start with the feature and end with a crossing and so be kept.
This approach can be used to address all ridge ends and attendant bifurcation features within the print to be cleaned.
As well as addressing “extra” data by cleaning, the present invention also addresses the type of situation illustrated in
Not only is it desirable to address this type of situation, but it also must be done in a way which does not detract from the accuracy of the subsequent process, and in particular the generation of the representative data which follows. This is particularly important in the case where the “direction” is a part of the representative data generated, as proposed for the embodiment of the invention detailed below.
To ensure that the “direction” information is not impaired it must be accurately determined and maintained. The pixel by pixel approach of the type used above for cleaning, suggests taking a feature and then moved pixel by pixel away from it for a given length. A projected line between the feature and the pixel the right length away then gives the angle. Again the pixel by pixel approach is labourious and time consuming.
The approach of the present invention is illustrated in
The direction of data set W is defined by a line drawn between ridge end 71 and crossing 73. A similar determination can be made for the direction of the other data sets.
Once the directions for data sets have been obtained, the type of situation shown in
The approach taken in the present invention allows faster processing of the cleaning and healing stage, in a manner which is accurate and is not to the detriment of subsequent stages and steps.
Extraction of Representation Data
Preferably after the above mentioned processing, the necessary data from it to be compared with the other print can be extracted in a way which accurately reflects the configuration of the fingerprint present, but which is suitable for use in the comparison process.
It is possible to fix coordinate axes to the representation and define the features/directions taken relative to that. However, this leads to problems when considering the impact of rotation and a high degree of interrelationship being present between data.
Instead of this approach, with reference to
Whilst this one approach is suitable for use in the new mathematical coding of the information extracted set out below, the use of Delaunay triangulation does not extract the data in the most robust way.
In the alternative approach, developed by the applicant, an entirely new approach is taken. Referring to
Having established the series of features, the position of each of these features is considered and used to define a centre 124. Preferably, and as illustrated in this embodiment this is done by considering the X and Y position of each of the features and obtaining a mean for each. The mean X position and mean Y position define the centre 124 for that group of features 120a through 120l. Other approaches to the determination of the centre are perfectly useable. Instead of defining triangles with features at each apex, the new approach uses the centre 124 as one of the apexes for each of the triangles. The other two apexes for first triangle 126 are formed by features 120a and 120b. The next triangle 128 is formed by centre 124, feature 120b and 120c. Other triangles are formed in a similar way, preferably moving around the centre 124 in sequence. The set of triangles formed in this approach is unique, simple and easy to describe data set. The approach is more robust than the Delaunay triangulation described previously, particularly in relation to distortion. Furthermore, the improvement is achieved without massively increasing the amount of data that needs to be stored and/or the computing power needed to process it. For comparison purposes,
Either the first, Delaunay triangulation, based approach or the second, radial triangulation, approach extract data which is suitable for formatting according to the preferred approach of the present process.
Format of Representative Data
Having considered the print in one of the above mentioned ways to extract the representative data, the data must be suitably mathematically coded to allow the comparison process and here a different approach is taken to that considered before. The approach presents the extracted data in vector form, and so allows easy comparison between expressions of different representations.
Particularly with reference to the first approach, for a given triangle, a number of pieces of information are taken and used to form a feature vector. The information is: the type of the minutia feature each node represents (three pieces of information in total); the relative direction of the minutia features (three pieces of information in total); and the distances between the nodes (three pieces of information in total). Thus the feature vector is formed of nine pieces of information. The type of minutia can be either ridge end or bifurcation. The direction, a number between 0 and 2π radians, is calculated relative to the orientation, a number between 0 and π radians, of the opposing segment of the triangle as reference and so the parameters of the triangle are independent from the image.
In particular the feature vector may be expressed as:
FV=[GP, Reg, {T1, A1, D1,2, T2, A2, D2,3, T3, A3, D3,1}]
where
GP is the general pattern of the fingerprint;
Reg is the region of the fingerprint the triangle is in;
T1 is the type of minutia 1;
A1 is the direction of the minutia at location 1 relative to the direction of the opposing side of the triangle;
D1,2 is the length of the triangle side between minutia 1 and minutia 2;
T2 is the type of minutia 2;
A2 is the direction of the minutia at location 2 relative to the direction of the opposing side of the triangle;
D2,3 is the length of the triangle side between minutia 2 and minutia 3;
T3 is the type of minutia 3;
A3 is the direction of the minutia at location 3 relative to the direction of the opposing side of the triangle;
D3,1 is the length of the triangle side between minutia 3 and minutia 1.
To avoid the same feature vector representing two symmetrical triangles, the features are recorded for all the triangles in the same order (either clockwise or anticlockwise). A rule of starting with the furthest feature to the left is used, but other such rules could be applied.
As each triangle considered is independent of the others and is also independent of the print image this addresses the problem of rotational issues in the comparison.
Advantageously the second data extraction approach described above is also suited to be mathematically coded using the vector format and so allow comparison with data extracted from other representations. The pieces of information used to form the feature vector in this case are: the general pattern of the fingerprint; the type of minutia; the direction of the minutia relative to the image; the radius of the minutia from the centre or centroid; the length of the polygon side between a minutia and the minutia next to it; the surface area of the triangle defined by the minutia, the minutia next to it and the centroid.
In particular the vector may be expressed as:
FV=[GP, {T1, A1, R1, L1,2, S1}, . . . , {Tk, Ak, Rk, Lk,k+1, Sk}, . . . , {TN, AN, RN, LN,1, SN}]
where
GP is the general pattern of the fingerprint;
Tk is the type of minutia i;
Ak is the direction of minutia k relative to the image;
Lk,k+1 is the length of the polygon side between minutia k and minutia k+1;
Sk is the surface area of the triangle defined by minutia k, k+1 and the centroid; and
Rk is the radius between the centroid and the minutia k.
When compared with the expression of the vector set out above in the context of the approach taken for the first data extraction approach, it should be noted that region of the fingerprint is no longer considered. The set of features can extend across region boundaries and so it is potentially not appropriate to consider one region in the vector. The region could still be considered, however, and the expression set out below is a suitable one in that context, with the region designated Reg and the other symbols having the meanings outlined above. Note a separate region is possible for each minutia.
FV=[GP, {T1, A1, R1, Reg1, L1,2, S1}, . . . , {Tk, Ak, Rk, Regk, Lk,k+1, Sk}, . . . , {TN, AN, RN, RegN, LN,1, SN}]
Using the types of format described above, it is possible to present the data extracted from the representations in a format particularly useful to the comparison stage.
Comparison Approaches
A number of different approaches to the comparison between a feature vector of the above mentioned type which represent the print from an unknown source with the a feature vector which represent the print from the known source are possible. A match/not match result may simply be stated. However, substantial benefits exist in making the comparison in such a way that a measure of the strength of a match can be stated.
Likelihood Ratio Approach
One general type of approach that can be taken, which allows the comparison to be expressed in terms of a measure of the strength of the match is through the use of a likelihood ratio.
The likelihood ratio is the quotient of two probabilities, one being that of two feature vectors conditioned on their being from the same source, the other two feature vectors being conditioned on their being from different sources. Feature vectors obtained according to the first data extraction approach and/or second extraction approach described above can be compared in this way, the differences being in the data represented in the feature vectors rather than in the comparison stage itself.
In each case, therefore, the approach can be derived from the expression:
Where the feature vector fv contains the information extracted from the representation and formatted. The addition of the subscript s to this abbreviation denotes that a feature vector comes from the suspect, and the addition of the subscript m denotes that a feature vector originates from the crime. The symbol fvs then denotes a feature vector from the known source or suspect, and fvm denoted the feature vector originated from an unknown source from the crime scene. For modelling purposes it is useful to classify a feature vector into discrete quantities (which may include general pattern, region, type, and other data) and continuous quantities (which may include the distances between minutiae, relative directions and other data).
The preferred forms for the quotient in the context of the first approach and second approach are discussed in more detail below in the context of their use in the data driven approach to the comparison stage.
Within the general concept of a likelihood ratio approach, a number of ways of implementing such an approach exist. One such approach which allows the comparison to be expressed in terms of a measure of the strength of the match is through the use of a data driven approach.
Data Driven Approach
In general terms, the data driven approach involves the consideration of a quotient defined by a numerator which considers the variation in the data which is extracted from different representations of the same fingerprint and by a denominator which considers the variation in the data which is extracted from representations of different fingerprints. The output of the quotient is a likelihood ratio.
In order to quantify the likelihood ratio, the feature vector for the first representation, the crime scene, and the feature vector for the second representation, the suspect are obtained, as described above. The difference between the two vectors is effectively the distance between the two vectors. Once the distance has been obtained it is compared with two different probability distributions obtained from two different databases.
In the first instance, the probability distribution for these distances is estimated from a database of prints taken from the same finger. A large number of pairings of prints are taken from the database and the distance between them is obtained. This involves a similar approach to that described above. Each of the prints has data extracted from it and that data is formatted as a feature vector. The differences between the two feature vectors give the distance between that pairing. Repeating this process for a large number of pairings gives a range of distances with different frequencies of occurrence. A probability distribution reflecting the variation between prints of the same figure is thus obtained.
Ideally, the database would be obtained from a number of prints taken from the same finger of the suspect. However, the approach can still be applied where the prints are taken from the same finger, but that finger is someone's other than the suspect. This database needs to reflect how a print (more particularly the resulting triangles and their respective feature vectors) from the same finger changes with pressure and substrate. This database is formed from a significant number of sets of information, each set being a large number of prints taken from the same finger under the full range of conditions encountered in practice. The database is populated by the identification, by an operator, of corresponding triangles in several applications of the same finger. Alternatively, a smaller set of prints can be processed as described above, distortion functions can then be calculated. The prefer method is thin plate splines, but other methods exist. The distortion function can then be applied to other prints to simulate further sets of data.
In the second instance, the probability distribution for these distances is estimated from a database of prints taken from different fingers. Again a large number of pairings of prints are taken from the database and the distance between them obtained. The extraction of data, formatting as a feature vector, calculation of the distance using the two feature vectors and determination of the distribution is performed in the same way, but uses the different database.
This different database needs to reflect how a print (more particularly the resulting triangles and their respective feature vectors) from a number of different fingers varies between fingers and, potentially, with various pressures and substrates involved. Again, the database is populated by the identification, by an operator, of triangles in the various representations obtained from the different fingers of different persons.
Having established the manner in which the databases and probability distributions are obtained, the comparison of a crime scene print against a suspect print is considered further.
The numerator may thus be thought of as considering a first representation obtained from a crime scene or an item linked to a crime, against a second representation from a suspect through an approach involving:
comparing the difference of the feature vector value of the first representation and the feature vector value of the second representation with the probability distribution.
The denominator may thus be thought of as considering the second representation obtained from a suspect against a series of representations taken from a population through an approach involving:
Applying the data driven approach, and in the context of the first data extraction approach (Delaunay triangulation), and after some algebraic operations, a probability for the numerator of the likelihood ratio is computed using the following formula:—
Num=Σ{Pr(d(fvs,c,fvm,c)|fvs,d,fvm,d,Hp): for all fvs,d and fvm,d such that fvs,d=fvm,d}
where
fv means feature vector, c means continuous, d means discrete, m means mark and s means suspect and therefore:
fvm,c: continuous data of the feature vector from the mark
fvm,d: discrete data of the feature vector from the mark
fvs,c: discrete data of the feature vector from the suspect
fvs,d: discrete data of the feature vector from the suspect
d(fvs,c, fvm,c) is the distance measured between the continuous data of the two feature vectors from the mark and the suspect
Hp is the prosecution hypothesis, that is the two feature vectors originate from the same source.
Notice that, conditioning on Hp, suggests fvs,c and fvm,c become measurements extracted from the same finger of the same person. The subscript in the summation symbol means that the probabilities in the right-hand-side of equation are added up for all the cases where the values of the discrete quantities of the features vectors coincide. In some occasions some or all of the discrete variables are present in the fingermark. For these cases the index of the summation is replaced by values of the quantities that are not present. The summation symbol is removed when all discrete quantities are present in the fingermark.
The expression d(fvs,c, fvm,c) denotes a distance between the continuous quantities of the feature vectors for the prints. The continuous quantities in a feature vector are the length of the triangle sides and minutia direction relative to the opposite side of the triangle. There are a number of distance measures that can be used but the distance measure describe below is preferred. This distance measure is computed by first subtracting term by term. The result is a vector containing nine quantities. This is then normalised to ensure that the length and angle are given equal weighting. By taking the sum of the squares of the distances from all the feature vectors considered in this way a single value is obtained.
In such a case, and after some algebraic operations, a probability for the denominator of the likelihood ratio is computed using the following formula,
Den=Σ{Pr(d(fvs,c,fvm,c)|fvs,d,fvm,d,Hd)Pr(fvm,d|Hd): for all fvs,d and fvm,d such that fvs,d=fvm,d}
where
fv means feature vector, c means continuous, d means discrete, m means mark and s means suspect. and therefore:
fvm,c: continuous data of the feature vector from the mark
fvm,d: discrete data of the feature vector from the mark
fvs,c: discrete data of the feature vector from the suspect
fvs,d: discrete data of the feature vector from the suspect
d(fvs,c, fvm,c) is the distance measured between the continuous data of the two feature vectors from the mark and the suspect
Hd is the defence hypothesis, that is the two feature vectors originate from different sources.
Several distance measures exist but the one described above is preferred. The subscript in the summation symbol means that the probabilities in the right-hand-side of this equation are added up for all the cases where the values of the discrete quantities of the features vectors coincide. In some occasions some or all of the discrete variables are present in the fingermark. For these cases the index of the summation is replaced by values of the quantities that are not present. The summation symbol is removed when all discrete quantities are present in the fingermark.
Conditioning on Hd, that is “the prints originated from different sources”, the features vectors come from different fingers of different people. The probability distribution for distances d(fvs,c, fvm,c) can be estimated from a reference database of fingerprints. This database needs to reflect how much variability there is in respect of all prints (again more particularly the resulting triangles and their feature vectors) between different sources. This database can readily be formed by taking existing records of different source fingerprints and analysing them in the above mentioned way.
The second factor Pr(fvm,d|Hd) is a probability distribution of discrete variables including general pattern. A probability distribution for general pattern was computed based on frequencies compiled by the FBI for the National Crime Information Center in 1993. These data can be found on http://home.att.net/˜dermatoglyphics/mfre/. A probability distribution for the remaining discrete variables can be estimated from a reference database using a number of methods. A probability tree is preferred because it can more efficiently code the asymmetry of this distribution, for example, the number of regions depends on the general pattern.
Again applying the data driven approach, and in the context of the second data extraction approach (radial triangulation), a probability for the numerator of the likelihood ratio is computed using the following formula:
Num=Pr(d(fvsfvm)|Hp)
where
d(fvsfvm) is the distance measured between discrete and continuous data of the two feature vectors from the mark and suspect;
Hp is the prosecution hypothesis, that is the two vectors originate from the same source.
The probability for the numerator is computed using the following formula:
Den=Pr(d(fvsfvm)|Hd)
where
Hd is the defence hypothesis, that is the two vectors originate from different sources.
In each case, similar approaches to those detailed above can be used to generate the relevant probability distributions.
In the second approach, it is possible to measure the distance between feature vectors in the above described manner of the first data extraction approach in respect of each orientation of the polygon in the mark and suspect representations. However, the large number of minutia which may now be being considered in a feature vector (for instance 12) would mean that there are very many rotations (for instance 12 rotations) of the feature vector which must be considered, compared with the more practical three of the first approach. The use of a greater number of minutia is desirable as this increases the discriminating power of the process. Investigations to date suggest that by the time 12 minutia are being considered, there is little or no overlap between the within finger distribution and between finger distributions illustrated in
In a modification, therefore, a feature vector is first considered against another feature vector in terms of only part of the information it contains. In particular, the information apart from the minutia direction can be compared. In the comparison, the data set included in one of the vectors is fixed in orientation and the data set included in the other vector with which it is being compared is rotated. If the data set relates to three minutia then three rotations would be considered, if it related to twelve then twelve rotations would be used. The extent of the fit at each position is considered and the best fit rotation obtained. This leads to the association of minutiae pairs across both feature vectors.
In respect of the best fit rotation, in each case, the process then goes on to compare the remaining data in each set, the minutia direction. To achieve this, the minutiae directions are made independent of the orientation of the print on the image. The approach taken on direction is described with reference to
In effect, the match between the polygons is being considered in terms of the minutia type, distance between minutia, radius between the minutia and the centroid, surface area of the triangle defined between the minutia and the centroid and minutia direction. All of these considerations serve to compliment one another in the comparison process. One or more may be omitted, however, and a practical comparison be carried out.
The comparison provides a distance which can be considered against the two distributions in the manner previously described with reference to
Assessing a Comparison Using the Data Driven Approaches
Having extracted the data, formatted it in feature vector form and compared two feature vectors to obtain the distance between them, that distance is compared with the two probability distributions obtained from the two databases to give the assessment of match between the first and second representation.
In
In
The databases used to define the two probability distributions preferably reflect the number of minutia being considered in the process. Thus different databases are used where three minutia are being considered, than where twelve minutia are being considered. The manner in which the databases are generated and applied are generally speaking the same, variations in the way the distances are calculated are possible without changing the operation of the database set up and use. Equally, it is possible to form the various databases from a common set of data, but with that data being considered using a different number of minutia to form the database specific to that number of minutia.
The databases may be generated in advance in respect of the numbers of minutia expected to be considered in practice, for instance 3 to 12, with the relevant databases being used for the number of minutia being considered in a particular case, for instance 6. Pre-generation of the databases avoids any delays whilst the databases are generated. However, it is also possible to have to hand the basic data which can be used to generate the databases and generate the database required in a specific case in response to the number of minutia which need to be considered. Thus, a mark may be best considered using six minutia and the desire to consider this mark would lead to the database being generated for six minutia from the basic database of fingerprint representations by considering that using six minutia. The data set size which needs to be stored would be reduced as a result.
In certain circumstances it is also possible to generate the probability distributions in advance. This can occur, for instance, where the within finger variation is being considered and that is considered on the basis of a single (or several) finger(s) not from the suspect. In the case of the model based approach, discussed below, it is possible to generate and store both probability distributions in advance.
Significant benefit from this overall approach arise due to: incorporating distortion and clarity in the numerator of the likelihood ratio; introducing the distance measure between the quantities in the feature vector; the use of probability distribution distances between features vectors from the same source and its estimation from a dedicated sets of data of replicates of the same finger; the use of probability distribution for the distances between print of different sources and its estimation from a reference database containing prints from different sources.
The description presented here exemplifies the use of this methodology, but the methodology is readily adapted for use in other forms. For instance, the Delauney triangulation form could be extended to cover more than three minutiae.
Model Based Approach
Within the general concept of a likelihood ratio approach, another approach which allows the comparison to be expressed in terms of a measure of the strength of the match is through the use of a model based approach.
In such an approach, and after some algebraic operations a probability for the numerator of the likelihood ratio is computed using the following formula,
Num=Σ{Pr(fvm,c|fvs,c,fvs,d,fvm,d,Hp): for all fvs,d and fvm,d such that fvs,d=fvm,d}
where
fv means feature vector, c means continuous, d means discrete, m means mark and s means suspect. and therefore:
fvm,c: continuous data of the feature vector from the mark
fvm,d: discrete data of the feature vector from the mark
fvs,c: discrete data of the feature vector from the suspect
fvs,d: discrete data of the feature vector from the suspect
d(fvs,c, fvm,c) is the distance measured between the continuous data of the two feature vectors from the mark and the suspect
Hp is the prosecution hypothesis, that is the two feature vectors originate from the same source;
As noted before, the continuous quantities, when conditioning on fvs,c and fvm,c become measurement of the same finger and person. The subscript in the summation symbol means that the probabilities in the right-hand-side of the equation are added up for all the cases where the values of the discrete quantities of the features vectors coincide. In some occasions some or all of the discrete variables are present in the fingermark. For these cases the index of the summation is replaced by values of the quantities that are not present. The summation symbol is removed when all discrete quantities are present in the fingermark.
The probability distribution for fvs,c is computed using a Bayesian network estimated from a database of prints taken from the same finger as described above. Many algorithms exists for estimating the graph and conditional probabilities in a Bayesian networks, but the preferred algorithms are the NPC algorithm for estimating acyclic directed graph, see Steck H., Hofmann, R., and Tresp, V. (1999). Concept for the PRONEL Learning Algorithm, Siemens AG, Munich and/or the EM-algorithm, S. L. Lauritzen (1995). The EM algorithm for graphical association models with missing data. Computational Statistics & Data Analysis, 19:191-201. for estimating the conditional probability distributions. The contents of both documents, particularly in relation to the algorithms they describe are incorporated herein by reference.
Further explanation of the use of Bayesian networks follows below.
The manner in which the first representation is considered against the second representation, through the use of a probability distribution, is as described above, save for the probability distribution being computed using the Bayesian network approach rather than a series of example representations of the second representation.
Using this approach and after some algebraic operations a probability for the denominator of the likelihood ratio is computed using the following formula,
Den=Σ{Pr(fvm,c|fvm,d,Hd)Pr(fvm,d|Hd): for all fvs,d and fvm,d such that fvs,d=fvm,d}
where
fv means feature vector, c means continuous, d means discrete, m means mark and s means suspect. and therefore:
fvm,c: continuous data of the feature vector from the mark
fvm,d: discrete data of the feature vector from the mark
fvs,c: discrete data of the feature vector from the suspect
fvs,d: discrete data of the feature vector from the suspect
d(fvs,c, fvm,c) is the distance measured between the continuous data of the two feature vectors from the mark and the suspect
Hd is the defence hypothesis, that is the two feature vectors originate from different sources.
The subscript in the summation symbol means that the probabilities in the right-hand-side of equation are added up for all the cases where the values of the discrete quantities of the features vectors coincide. In some occasions some or all of the discrete variables are present in the fingermark. For these cases the index of the summation is replaced by values of the quantities that are not present. The summation symbol is removed when all discrete quantities are present in the fingermark.
The probability distribution in the first factor of the right hand side of equation above is computed with a Bayesian network estimated from a database of feature vectors extracted from different sources. There are many methods for estimating Bayesian networks as noted above, but the preferred methods are the NPC-algorithm of Steck et al., 1999 for estimating an acyclic directed graph and/or the EM-algorithm of Lauritzen, 1995 for the conditional probability distributions. There is a Bayesian network for each combination of values of the discrete variables. The second factor Pr(fvm,d|Hd) is estimated in the same manner as described for the data-driven approach above.
Again the approach to considering the second representation against the population representations is as detailed above, save for the probability distribution being computed using the Bayesian network approach.
Assessing a Comparison Using the Model Based Approach
Given a feature vector from know source fvs and from an unknown source fvm, the numerator is given by the equation and is calculated with a Bayesian network dedicated for modelling distortion. The second factor in the denominator is calculated in the same manner as with the data-driven approach. The first factor is computed using Bayesian networks. A Bayesian network is selected for the combination of values of fm,d which is then use for computing a probability Pr(fvm,c|fvm,d,Hd). This process is repeated for all values in the index of the summation. The likelihood ratio is then obtained by computing the quotient of the numerator over the denominator.
Significant benefit from this approach arise due to: using Bayesian networks for computing the numerators and denominator of the likelihood ratio; estimating Bayesian networks for the numerator from dedicated databases containing replicates of the same finger and under several distortion conditions; estimating Bayesian networks for the denominator from dedicated databases containing prints from different fingers and people.
The description above is an example of using Bayesian networks for calculating the likelihood ratio, but the invention is not limited to it. Another example is estimating one Bayesian network per general pattern. This invention can also be used for more than three minutiae by defining suitable feature vectors.
As mentioned above, in order to estimate the numerator and denominator in the above likelihood ratio consideration, it is possible to use a Bayesian network representation to specify a probability distribution. For brevity of explaination the concept of a Bayesian network is presented through an example. A Bayesian network is an acyclic directed graph together with conditional probabilities associated to the nodes of the graph. Each node in the graph represents a quantity and the arrows represent dependencies between the quantities.
p(x,y,z)=p(x)p(y|x)p(z|y) for all x,y,z
and so the joint distribution is completely specified within the graph and the conditional probability distributions {p(x): for all x}, {p(y/x) for all x and y} and {p(z/y) for all z and y}. A detailed presentation on Bayesian networks can be found in a number of books, such as Cowell, R. G., Dawid A. P., Lauritzen S. L. and Spiegelhalter D. J. (1999) “Probabilistic networks and expert systems”.
Number | Date | Country | Kind |
---|---|---|---|
0422786.4 | Oct 2004 | GB | national |
0502893.3 | Feb 2005 | GB | national |