The present invention relates to computer content searching, especially for extracts common to two files.
More especially, this involves searching for at least one extract common to a first file and to a second file, in the form of binary data.
The techniques known at present propose a search for identicalness, generally data item by data item. The slowness of the search, for applications with large size files, becomes crippling.
The present invention aims to improve the situation.
Accordingly, it proposes a method of searching content which comprises a prior preparation of the first file at least, comprising the following steps:
In step b), a data packet is assigned the state:
In a preferred embodiment, a processing prior to step b) is applied to the data of a file, said processing comprising the following steps:
Advantageously, the application of said digital filter amounts to:
The low-pass filter operates on a frequency band comprising substantially the interval:
[−Fe/2(k−1), +Fe/2(k−1)],
where Fe is said sampling frequency,
and k is the number of samples that a packet comprises.
Advantageously, said digital filter comprises a predetermined number of coefficients of like value,
and the frequency response of the associated low-pass filter is expressed, as a function of frequency f, by an expression of the type:
sin(PI.f.T)/(PI.f.T),
where sin( ) is the sine function, and with:
Preferably, said digital filter is a mean value filter of a predetermined number of coefficients, and in that the difference between two successive filtered samples is proportional to the difference between two unfiltered samples, respectively of a first rank and of a second rank, which are spaced apart by said predetermined number of coefficients, and in that the calculation of said faltered samples is performed by utilizing this relation to reduce the number of calculation operations to be performed.
Said predetermined number of coefficients of the filter is greater than or equal to 2k−1, where k is the number of samples that a packet comprises, which value may be designated hereinafter by the term index ratio.
Preferably:
Advantageously, for any filtered sample rn of given order n, said reference value is calculated by averaging the values of the unfiltered samples fk over a chosen number of unfiltered consecutive samples about an unfiltered sample fn of the same given order n.
The values of the filtered samples are made relative, for comparison, to a zero threshold value,
and the filtered samples r′n are expressed by a sum of the type:
In an advantageous embodiment, said sum is applied to the unfiltered samples fn a plurality of times, according to a processing performed in parallel, while respectively varying the number of coefficients K. This measure then makes it possible to determine a plurality of digital signatures, substantially statistically independent.
In a particular embodiment, the fuzzy states associated with the first file at least are each coded on at least two bits.
In this embodiment, the fuzzy states determined for a least number of coefficients K are coded on least significant bits and the fuzzy states determined for a larger number of coefficients K are coded on subsequent bits, up to a chosen total number of bits. It will be understood that this chosen number may be advantageously adapted to the binary data size used by the microprocessors of computer entities for comparison logic operations.
Preferably, each filtered sample rn is expressed as a sum of the type:
This measure advantageously makes it possible to ensure an overlap of a packet of k data which is used for the calculation of a single digital signature data item.
In this embodiment,
For an application in which the two files to be compared comprise data representative of alphanumeric characters, in particular of the text and/or a computer or genetic code,
the method advantageously comprises:
For the optimization of the chosen number k of samples per packet, account is advantageously taken of a total number:
The method advantageously provides for a step in the course of which a cue relating to a minimum desired size of common extracts searched for is obtained, used to optimize said chosen number k of samples per packet. This optimal number k of samples per packet varies substantially as said minimum size, so that the larger the desired minimum size of common extracts searched for, the more the total number of companion operations decreases, and therefore the shorter the duration of the search for common extract.
For other applications such as searching of content of audio, video or other files, the search for common extracts preferably consists of a single group of steps comprising the formation of the digital signatures and their comparison. The number of data items per packet is then optimized by initially fixing a confidence index characterizing an acceptable threshold of probability of false detection of common extracts.
In a preferred general embodiment, for the first file:
In this embodiment,
Of course, the method within the meaning of the present invention is implemented by computer means such as a computer program product, described later. In this regard, the invention is also aimed at such a computer program product, as well as a device, such as a computer entity, comprising such a program in one of its memories. The invention is also aimed at a system of computer entities of this type, that communicate, as will be seen later.
This computer program is capable in particular of generating a digital signature of a file of binary data, this digital signature thereafter being compared with another signature for the search for common extract. It will be understood that the digital signature of any data file, which signature is formulated by the method within the meaning of the invention, is an essential means for undertaking the comparison step. In this regard, the present invention is also aimed at the data structure of this digital signature.
Other characteristics and advantages of the invention will become apparent on examining the detailed description hereinbelow, and the appended drawings in which:
The method within the meaning of the invention consists in inter-comparing computer files so as to search therein for all the possible common extracts. The examination pertains directly to the binary representation of the data which constitute the files and, advantageously, does not therefore require prior knowledge of the format of the files. Moreover, the files to be compared may be of any nature, such as for example, text files, multimedia files comprising sounds or images, data files, or the like.
Each file is represented in the form of a one-dimensional array in which the binary data are arranged with the same order as that used for storage on disk. The binary data are bytes (8-bit words). The array is therefore of the same size as that of the file, in bytes. Each cell of the array is labeled by an address. According to the conventions used in programming, the address 0 points to the first cell of the array, the address 1 to the next cell, and so on and so forth.
The term “extract”, especially in the formula “common extract”, is understood as follows. It entails a sequence of consecutive data, said sequence being obtained by copying the binary data of a file commencing from a determined start address. This sequence is itself represented in the form of a binary data array with which is associated a start address which makes it possible to label the extract in the original file. It is indicated that the binary data are bytes (8-bit words). Each data item is represented by the integer number (lying between 0 and 255) which is obtained by addition to the base 2 of the bits of the byte:
B0+21B1+ . . . +27B7
The array therefore clearly has the same size as that of the extract (in bytes). This size of extract may lie between 1 and that of the file.
In the example of a document stored in a file in text format, an extract could for example be a word, a phrase or a page of text.
For the method within the meaning of the invention, the expression “extract common to two files” is understood as follows. This entails a sequence of consecutive data whose content is fixed and which may be obtained either by copying the binary data of the first file commencing from a determined start address, or by copying the binary data of the second file commencing from another determined start address. Stated otherwise, if an extract is lifted from each file commencing from the labeled start positions, the condition of common extract will be achieved if there is perfect identity of the contents carried by the first binary data item of each extract, then of those carried by the next binary data item, and so on and so forth. Typically, in the case of text format files, each byte carries the ASCII code of a printable character (Latin alphabet, digit, punctuation mark, and the like). The perfect identity of the contents of two bytes is therefore equivalent to perfect identity of the characters coded by these bytes. Any common extract found is labeled by a pair of start addresses (one per file) and by a size expressed as a number of bytes.
Described hereinbelow is an exemplary extract taken from a short text file. The text chosen is “Le lièvre et La tortue”. Its representation in the form of a file in text mode is represented by way of example in the array below. The size of the file is 22 bytes. The binary data (bytes) carry the ASCII codes which are associated with each character of the text and are displayed in integer mode.
The “lièvre” extract is found in the file. Its representation in the form of a data array is in the next array. It occupies 6 binary data items. Its start position in the file is the address 3.
An example of extracts common to two short text files is now described. The texts chosen are “Le lièvre” and “La tortue”. The representations in the form of files in text mode are those of the array below. The size of each file is 9 bytes. The binary data (bytes) are displayed in integer mode.
There are therefore five extracts common to the files, They are presented in ascending order of start addresses on the first file:
It is indicated that the characters “L” and “l” are distinct since the values of their ASCII codes are different.
In order to avoid a profusion of search results, a value of the minimum size of the common extracts to be found is used as selection criterion. It is easily understood that the probability of finding extracts decreases as the size of the extracts to be searched for increases. Consequently, if two files are intercompared, the number of common extracts found will decrease as the minimum size of the extracts to be found increases.
With the same aim, one tries moreover to eliminate the search results which overlap. This processing is advised but is not indispensable. Its complete implementation in fact requires storing the whole set of search results so as to be able to eliminate therefrom those which are overlapped by other search results.
Described hereinbelow is another example of extracts, common to two short text files. The texts chosen are “Un mouton” and “Des moutons”. The minimum size of the common extracts searched for is 6 bytes. The binary data (bytes) are displayed in integer mode.
The representations in the form of files in text mode are in the array below.
An extract common to the files is found: “mouton” at position (2, 3) and of size 7.
As indicated above, the “ ” (space) character is treated as a data item. Two common extracts of size 6 are eliminated from the search results since they are overlapped by the extract “mouton” of larger size (7). We have:
These basic principles being defined, a so-called “conventional” search algorithm using said principles is now described, Globally, the search strategy implemented is to examine all the possible pairs of start positions which can be taken by a common extract on the two files to be compared. The algorithm described here is defined by the term “conventional”. However, this definition does not necessarily imply that it can be found in the prior art. It should simply be understood that the algorithm within the meaning of the present invention performs extra operations, in particular for formulating digital signatures, which will be described later.
For each value of pair of start positions (one start position per file), a comparison is performed between the extracts which can be lifted from each file. This comparison indicates whether the common extract condition is achieved and determines the maximum size of the common extract found for the pair of start positions that is considered. As appropriate, this size is finally compared with the value of the minimum size of the common extracts to be found.
For any pair of start positions on the files, one and the same succession of steps is used to identify the existence of a common extract. The pairs of start positions are tested with the following predefined order:
In the case where the search has been stopped so as to display a common extract found at the position (n, m), the search for other common extracts resumes commencing from the next pair of start positions:
Referring to
Described hereinbelow is a two dimensional representation using an array represented in
The vertical axis A1 carries the addresses of the data of the first file. The horizontal axis A2 carries the addresses of the data of the second file. Each cell (m, n) of the array represents a pair of start positions to be evaluated to search for a common extract.
For the example, the size of the first file equals 6 (addresses 0 to 5) and that of the second file equals 10 (addresses 0 to 9). The arrows F in the array indicate the direction of movement which is used to test the whole set of possible pairs of start positions of common extracts to be found.
The example represented in
As the computer programming tools impose constraints on the size of the data arrays that can be used in programs, a computer program employing this algorithm preferentially proceeds to a prior splitting of the files into consecutive data blocks of reduced size (the split takes account of necessary overlaps between blocks making it possible to guarantee the test of the whole set of pairs of start positions of common extracts to be searched for). The algorithm is then applied to the whole set of possible combinations of pairs of data blocks. The order of comparison of the pairs of data blocks is analogous to that described previously, namely via the pairs of start positions of extracts. However, simply here, the comparison pertains to blocks of data rather than pertaining to isolated data. Typically, the first block of the first file is compared with the first block of the second file, then with the subsequent blocks of the second file. The next block of the first file is then compared with the first block of the second file, followed by the subsequent blocks of the second file, . . . , and so on and so forth until the last block of each file is reached.
In terms of performance, the execution time of the search engine program in “full text” mode (that is to say by analysis of the entirety of the content of the files) depends essentially on the number of comparisons to be performed between data. This parameter is the most important one but is not the only one since account must be taken also of the speed of transfer of the data between disk and random access memory (RAM), and then between RAM memory and microprocessor. The minimum number of comparisons to be performed between data to accomplish the search for a common extract of size 1 is equal to the product:
(size of the first file)×(size of the second file)
For the search for common extracts of minimum size n, the search algorithm is optimized so as to eliminate the end-of-file positions from the possible pairs of start positions to be analyzed. In this case, the minimum number of comparisons between data to be performed is reduced to the product:
(size of the first file−n+1)×(size of the second file−+1)
For large size files, the value of this number remains close to that of the product of the sizes of the files.
The program according to the conventional search algorithm uses this value to estimate the total duration and the speed of search by interpolation of the number of pairs of start positions already tested and of the search time elapsed.
The algorithm for searching for common extracts within the meaning of the present invention is now described.
Globally, one seeks to improve the search performance by reducing the number of comparison operations to be performed between data relative to the conventional algorithm. The approach employed here is to perform the searches in two passes. A coarse search on the files which rapidly eliminates file portions which do not comprise any common extracts. A fine search on the remaining file portions using an algorithm much like the conventional algorithm described above. However, as will be seen later in certain cases of files, the second pass is not always necessary and is preferentially used for text files to be compared.
For the coarse search, the algorithm within the meaning of the invention implements an advantageous calculation of digital signatures on the files to be compared. The “digital signatures” may be regarded as files or as arrays of data whose size is less than that of the files from which these signatures emanate.
Digital signatures have the property of being able to be used as indices of the files which are associated with them. Furthermore, a mathematical relation makes it possible to match up any extract of a digital signature with a corresponding precise portion of the file which is associated with it. Moreover, the start position of a digital signature extract matches correspondingly with a fixed number of start positions of extracts on the file which is associated with the digital signature. Conversely, onwards of a certain size of extract, any data extract taken from a file may be associated with an extract of the digital signature. Digital signatures also have the property of being able to be compared with one another to identify common extracts of signatures.
It is indicated however that the definition of the common extracts of digital signatures and the mathematical operations used to perform the comparisons of digital signatures are different from those which were described hereinabove in respect of the search for extracts common to files. The index properties of digital signatures are utilized to interpret the results of the search for common extracts of signatures. Specifically, for a determined pair of start positions (one per digital signature), the absence of any common extract is conveyed mathematically by an absence of common extract between two portions of file (one portion per file associated with each digital signature). Inversely, a common extract found between two digital signatures is conveyed by the possible existence of an extract common to two portions of files (one portion per file associated with each signature).
The search for the extracts common to files is performed only on the file portions which are labeled by the positive results of search for common extracts of digital signatures. Any common extract of digital signatures is labeled by a pair of start positions in each signature, and each signature start position correspondingly matches with a file portion delimited by a fixed integer number (N) of start positions in the file. Each common extract of digital signatures which is found is therefore manifested as a search for common extract between files on a reduced set of (N×N) pairs of start positions to be tested. Inversely, each pair of start positions which is characterized by an absence of common extract of digital signatures is manifested as a saving of search of common extract between files on a set of (N×N) pairs of start positions to be tested.
The calculation of the digital signatures conditions the value of minimum size of the common extracts to be found between files. The fixed number (N) of positions of start of extract on the file matching each digital signature data item is an adjustable parameter of the processing for calculating the digital signatures.
The value of the minimum size of the common extracts of files which may be found with the coarse search algorithm is determined on the basis of this number by means of a mathematical formula that will be described in detail hereinbelow. This value increases as that of the fixed number N of positions increases. Hereinafter, this number N is designated by the term “index ratio”.
It will be seen later and in detail that the algorithm for searching for common extracts of digital signatures has some similarities with the conventional algorithm for searching for extracts common to files.
It is indicated simply here that the search strategy implemented is to examine all the possible pairs of start positions that can be taken by a common extract on the two digital signatures to be compared. The minimum size of the common extract of digital signatures to be found is determined by means of a mathematical formula that will be described later, on the basis of the value of the index ratio and of the minimum size of the common extracts of files to be found.
For each value of pair of start positions (one start position per digital signature), a comparison is performed between the extracts which can be lifted from each digital signature.
Thus, globally, the algorithm within the meaning of the invention chains together the following search steps:
The principle of the algorithm within the meaning of the invention is now described in greater detail. Referring to
Represented in
Operations of calculation and of comparison of the digital signatures are now described in detail.
The calculation of the data of digital signatures uses a mathematical theory of fuzzy logic.
Customarily, binary logic uses a data bit to code two logic states. The code 0 is associated with the state “false”, while the code 1 is associated with the state “true”.
Binary logic employs a set of logic operations for comparing between binary states, as is represented on the truth tables of
An 8-bit data item (one byte) can store 8 independent binary states.
Compared to binary logic, fuzzy logic uses two extra states which are the undetermined state “?” (at one and the same time true and at one and the same time false) and the prohibited state “X” (neither true nor false).
The 4 fuzzy logic states are coded on two bits, as is represented in
An 8-bit data item (one byte) can thus store 4 independent fuzzy states.
Fuzzy logic employs a set of logic operations for comparing between fuzzy states such as represented in
It is indicated that, in the context of the invention, the calculation of digital signatures uses the OR operation to determine a fuzzy state common to a block of consecutive data of the file associated with the signature. At the outset, a binary state (0 or 1) is associated with each address of a data item, in a block of data of the file. The size of the data block is equal to the index ratio, as indicated hereinabove. The binary states are thereafter intercompared to determine the fuzzy state “0”, “1” or “?” of a data item of the digital signature. A digital signature data item is thereafter associated with the data block of the file.
Thereafter, the comparison of the digital signatures, properly speaking, uses the AND operation to determine whether or not it is possible to have an extract common to the files. The decisions are therefore taken as a function of the fuzzy logic state which is taken by the result of the AND operation applied to pairs of data of digital signatures.
The prohibited state X signifies that there is no common extract between the files in the data zones which are associated with the current pair of positions of start of common extract of digital signatures (with one block per digital signature data item). This case will be described in detail later. The states “0”, “1” or “?” signify inversely that there is a possibility of common extract between the files in the data zones which are associated with the current pair of positions of start of common extract of digital signatures.
Referring to
In the example of
Examples of calculating digital signatures, with a chosen text, “La tortue”, are described hereinbelow. Each character of the text is coded on a byte employing the ASCII code. Each ASCII code is represented by the value of the integer number which is coded by the 8 bits of the byte. This number lies between 0 and 255. The binary states which are associated with each data address are determined, by way of example through a law of the type:
The array of
Represented in
Described hereinbelow are the mathematical laws used for the calculation of the digital signatures, in a preferred embodiment. The description which follows supplements the first aforesaid step of calculating a binary signature of the search algorithm within the meaning of the invention and describes the mathematical laws which are used to determine the binary states which are associated with each data address of the file. In the examples above, each binary state of digital signature is determined by a simple law which rests upon the comparison of the integer value of the code of each byte of the file with an integer reference value. The benefit of this law is limited however, since each binary signature data item characterizes only a single data item of a file at a time. The interpretation of the result of the comparisons between data of fuzzy signatures (which are obtained in the second step of the calculation) is thus limited to the possible existence of extracts common to the files of size 1. The possible absence or existence of an extract common to the files of size greater than 1 cannot be detected by a single operation of comparison between fuzzy signature data. To remedy this situation, the mathematical laws for determining the states of the binary signature are chosen in such a way that each data item of a binary signature characterizes an extract of preferentially fixed size of the file. The size of the data extracts is a parameter of the mathematical law for determining the states of the binary signature. The value of this parameter is always greater than or equal to that of the index ratio. By virtue of this condition, the result of a comparison between a pair of fuzzy signatures data may be interpreted either through the absence or through the possible existence of a common extract of file of size at least equal to the index ratio (N) from among the set (N×N) of pairs of positions of start of common extract of file which is associated with the pair of fuzzy signatures data.
Likewise, a common extract found of size K between digital signatures is interpreted through the possible existence of a common extract of file of size at least equal to N×K from among the set (N×N) of pairs of positions of start of common extract of file which is associated with the pair of start positions of the common extract found of digital signatures.
It will also be understood that the proportion of “?” fuzzy states increases as the size of the index ratio increases. Consequently, the step of searching for common extracts between digital signatures becomes much less selective when the index ratio increases. Specifically, if the data of a digital signature are all equal to the “?” state, the comparison of this signature with another digital signature will not eliminate any pair of start positions of extract to be searched for on the files which are associated with the signatures. To remedy this situation, the law for determining the binary states must be chosen in such a way that the step of calculating the fuzzy states (by comparing blocks of binary states) generates a small proportion of “?” states and inversely a high proportion of “0” or “1” states.
Described hereinbelow is a processing for improving the selectivity of the digital signatures. The explanations which follow use results of mathematical theories from the areas of the algebra of transformations and digital signal processing.
It is recalled that the Fourier transformation is a mathematical transformation which matches a function f(t) of the variable t with another function F(f) of the variable f according to the following formula:
A property of the Fourier transformation is reciprocity, making it possible to obtain the function f(t) backwards from F(f) through the following formula:
This formula indicates that any real function f(t) may be decomposed into an infinite sum of pure cosinusoid functions of frequency f, of amplitude 2.|F(f)| and of phase φ(f).
The variations of the function cos(2πft+φ) are represented in
The latter property will be exploited for the choice of the laws for determining the binary signatures. A law States(t,p) for determining fuzzy states with two variables is associated with the function s(t)=cos(2πft+φ). We put T=1/f.
The law States(t,p) is defined for any real value of t and for any positive real value of the parameter p (to be compared with the aforesaid index ratio):
States(t,p)=1 if ∀ x ε[t, t+p], s(x)>0
States(t,p)=0 if ∀ x ε[t, t+p], s(X)<0
States(t,p)=? otherwise
Represented in
Represented in
Hereinbelow, the following notation is used:
The following results are obtained for the law States(t,p):
It is again recalled that the probabilities of drawing the fuzzy states were obtained after applying the law State s(t,p) for determining the fuzzy states to the function s(t)=cos(2πft+φ). It will also be remarked that the probability of drawing the fuzzy states does not depend on the phase φ of the function s(t)=cos(2πft+φ).
Referring to
We will now seek to apply this observation to the comparison of binary data within the meaning of the invention.
The sampling of a function f(t) of the variable t consists in logging the values which are taken by this function at instants Tn which are spaced apart by a fixed interval Te.
The following notation is used:
In the theory of signal processing, Shannon's theorem shows that the original of a function f(t) can be obtained backwards from the samples fn if the frequency spectrum of the Fourier transform F(f) associated with f(t) is strictly bounded by the interval [−Fe/2, Fe/2], with Fe=1/Te.
Under this condition, the function f(t) is obtained after applying an ideal low-pass filtering in the frequency band [−Fe/2, Fe/2] to the Fourier transform of the sampled signal F(f).
In what follows, it is considered that the data files exhibit samples fn of a function f(t) which satisfies the above conditions. In particular, each data address corresponds to a sample number n. Each data item stores the value of a sample (typically an integer coded on the bits of a byte).
The Fourier transform of the signal associated with the samples fn of a data file is as follows:
It will be noted that the choice of the sampling period Te is free here.
The Fourier transform is also expressed in this case by the following simplified formula:
The Fourier transform F(f) of the original of the function f(t) which is associated with the samples fn is obtained by applying Shannon's theorem:
F(f)={circumflex over (F)}(f)/Fe for fε[−Fe/2,Fe/2]
F(f)=0 for the other values of f
The function f(t) which is associated with the samples fn is obtained by applying the inverse Fourier transform.
and is finally expressed in the form of a finite sum of terms as:
f(x)=sin(x)/x, or x=πFe(t−nTe), i.e.:
Represented in
It is indicated that the above relations between a function f(t) and a set of samples fn=f(nTe) apply for any function which satisfies the Shannon conditions.
They therefore also apply for the function s(t)=cos(2πft+φ) if the following condition holds:
fε[−Fe/2, Fe/2]
It is then possible to represent s(t) by an infinite set of samples sn taken over s(t) at the instants tn=nTe.
We recall the law States(t,p) defined above for any real value of t and for any positive real value of p:
States(t,p)=1 if ∀ x ε[t, t+p], s(x)>0
States(t,p)=0 if ∀ x ε[t, t+p], s(x)<0
States(t,p)=? otherwise
The properties of this law may be transposed simply into the domain of the samples sn if we are interested in the following law for determining fuzzy states, defined over a sequence of k consecutive samples {sn, sn+1, . . . , sn+k+1}.
States(n,k)=1 if ∀ i ε{0, k−1}, sn+1>0
States(n,k)=0 if ∀ i ε{0, k−1}, sn+i<0
States(n,k)=? otherwise
The probabilities of drawing the fuzzy states associated with the law States(n,k) are obtained simply on the basis of the law States(t,p) by replacing p by (k−1)Te.
We thus obtain the graphical representation of the probabilities of drawing the states 1 or 0 of the law States(n,k) as a function of the frequency of the function s(t) associated with the samples sn.
In the example of
The definition of the laws for determining fuzzy states will be extended to the case of any function f(t) which satisfies Shannon's conditions. In this general case, the law Statef(t,p) is defined for any real value of t and for any positive real value of p:
Statef(t,p)=1 if ∀ x ε[t, t+p], f(x)>0
Statef(t,p)=0 if ∀ x ε[t, t+p], f(x)<0
Statef(t,p)=? otherwise
This law for determining the fuzzy states is also transposed into the domain of the samples fn over sequences of k consecutive samples {fn, fn+1, . . . , fn+k−1},
Statef(n,k)=1 if ∀ i ε{0, k−1}, fn+i>0
Statef(n,k)=0 if ∀ i ε{0, k−1}, fn+i<0
Statef(n,k)=? otherwise
Contrary to the particular case already treated where f(t) is a pure sinusoid of frequency f, there is no simple mathematical relation which makes it possible here to calculate the probabilities of drawing fuzzy states on the basis of the Fourier transform F(f).
On the other hand, we can harness the properties of the probabilities of drawing the fuzzy states associated with the laws States(n,k) and States(t,p) to deduce that the application of a low-pass filtering to any function f(t) is conveyed by the increasing of the probabilities of drawing the states 0 and 1 and by the decreasing of the probability of drawing the state ? which are associated with the laws Statef(n,k) and Statef(t,p).
In the case of the law Statef(n,k), we know that if the function f(t) is a pure sinusoid of frequency f, we will have f>Fe/2(k−1) and k>1
P1(f,k)=P0(f,k)=0
P?(f,k)=1
If we apply an ideal low-pass filtering in the frequency band [−Fe/2(k−1), Fe/2(k−1)] to a function f(t), it is understood that the probabilities of drawing the states 1 and 0 will increase since each frequency component Rk(f) of the result signal rk(t) contributes to the final result with a non zero individual probability of drawing the states 0 or 1.
This assertion can be demonstrated in the case of a random noise function b(t) for which the amplitude of the spectrum B(f) is constant in the frequency band [−Fe/2, Fe/2] In the case of a random noise function b(t), we know that the probabilities of drawing a sample are:
P1b(k=1)=P0b(k=1)=½
P?b(k=1)=0
For 2 consecutive samples, we obtain:
P1b(k=2)=P0b(k=2)=(½)2
P?b(k=2)=1−P1b−P0b=1−2×(½)2
And for n consecutive samples, we obtain:
P1b(k=n)=P0b(k=n)=(½)n
P?b(k=n)=1−P1b−P0b=1−2·(½)n
Thus, for a large number of successive samples, the probabilities of drawing the states “0” and “1” tend to 0 while the probability of drawing the undetermined state “?” tends to 1. We now consider a function rn(t) which is obtained by applying an ideal low-pass filtering to the function b(t) in the frequency band [−Fe/2(n−1), Fe/2(n−1)]. We have then observed that the representation of the spectra of Rn(f), of P1(f,n), of P0(f,n) and de P?(f,n) is obtained by simple homothety of the spectra of R2(f), of P1(f,2), of P0(f,2) and of P?(f,2), as shown by
From this we deduce that there is equality between the probabilities of drawing n consecutive samples of the filtered noise signal rn(t) and the probability of drawing 2 consecutive samples of the unfiltered noise signal b(t). The probabilities of drawing a 1 state or a 0 state for n consecutive samples of the filtered noise signal rn(t) equal ¼, while the probability of drawing a “?” state for n consecutive samples of the filtered noise signal rn(t) equals ½.
In conclusion, the selectivity of the digital signatures is improved by applying a low-pass filtering to the function f(t) which is associated with the samples fn=f(nTe).
The processing steps and the relations between data of files, samples and functions may be summarized as represented in
F(f)=0 for f∉[−Fe/2,Fe/2]
By applying a low-pass filter (step 135′) to this function F(f), we obtain the function R(f) corresponding to the Fourier transform of the function r(t) (step 133′) whose samples r, are such that rn=r(n.Te)=r(n/Fe) according to Shannon's theorem (step 133).
In practice, in step 135 a low-pass digital filter will preferentially be applied directly to the samples fn to obtain the samples rn in step 133. This digital filter will be described in detail later. Finally, a law for determining fuzzy states is applied to the filtered samples rn to obtain the digital signature data sn/k=Stater(n,k), over k consecutive samples {rn, rn+1, . . . , rn+k−1}, n being a multiple of k (step 134).
As indicated hereinabove, these steps of
In what follows, the following notation is adopted:
Borel's theorem gives the relation:
R(f)=Filter(f)×F(f)
This relation is conveyed on the functions r(t), filter(t) and f(t) by a formula of the type:
If we consider the functions which are associated with the samples (and which satisfy Shannon's conditions), this relation becomes:
The digital filtering therefore consists in defining a set of coefficients filterk that will be used to calculate each sample rn by applying the above formula.
In practice, we try to approximate a predefined filter template by limiting the size of the set of coefficients filterk. The compromise to be found depends on the following factors:
If the number of coefficients equals K, each calculation of a sample rn is conveyed by K multiplication operations and by (K−1) addition operations.
For the digital filters used by the search algorithm within the meaning of the invention, the main criterion adopted is the speed of calculation of the samples rn.
In a preferred embodiment, the choice pertains to a particular family of filters termed “mean value” filters for which the coefficients of the digital filter are identical, so that:
The equation of the digital filter simplifies into the following form:
For this filter with 2K+1 coefficients, the calculation of a sample rn is thus now conveyed by only 2K+1 addition operations, and by a multiplication operation if the term Cst is different from the value 1.
It is remarked moreover that the sample r(n+1), may be obtained simply from rn through the relation r(n+1)=rn+Cst(f(n+K+1)−f(n−K))
In a particularly advantageous manner, by applying this latter relation, the calculation of each new sample r(n+1) is now conveyed by only two addition operations.
The frequency response of the mean value digital filter is obtained from the Fourier transform of the following summation operator σ(t):
The filtering of f(t) by the operator σ(t) is then conveyed by the formula:
The frequency response of the operator σ(t) is Σ(f) with:
We finally obtain:
The frequency response of the mean value filter is obtained by subsequently dividing that of the summation operator Σ(f) by T.
The frequency response of the mean value digital filter over K consecutive samples is thereafter obtained by replacing T by (K−1)Te, i.e.:
According to the parity of K, two equations for a digital filter are used for the calculation of the samples rn.
Represented in
We know moreover that the application of an ideal low-pass filtering in the frequency band [−Fe/2(n−1), Fe/2(n−1)] is conveyed by the following probabilities of drawing fuzzy states calculated over sequences of n consecutive samples:
We can approximate an ideal low-pass filtering template by choosing a mean value digital filter whose zero cutoff frequency occurs at f=Fe/2(n−1): this condition is attained for K=2n−1.
In practice, the application of a mean value digital filter is of course conveyed by probabilities of drawing fuzzy states which differ from the probabilities obtained with an ideal low-pass filter. The determination of the value of K is done empirically knowing that the probabilities obtained with K=2n−1 will be close to those of the ideal filter, and that the probabilities of drawing P1 and P0 also increase with the value of K.
Described hereinbelow are the adaptations made to the laws for determining fuzzy states, in particular as a function of the foregoing.
The calculations of probabilities on the drawing of fuzzy states are based on the hypothesis that the data of files represent the values of samples of a signal f(t) of zero mean value. This condition is again conveyed by the following relation:
The results obtained on the probabilities of drawing fuzzy states are therefore valid only if this condition is satisfied for the samples fn:
In the case of a file of samples of size N, this condition becomes:
Now, the above conditions of zero mean value are not systematically satisfied when the values of the samples are determined from the binary data of a file. These conditions are for example not satisfied if the “unsigned integer” coding law is used to represent the values of the samples associated with the data of a file. Specifically, in this case each byte represents an integer lying between 0 and 255, this leading to a mean sample value of 127.5 for a file of random content.
To alleviate this problems a reference value parameter Vref is introduced as follows into the law for determining fuzzy states over the sequences of k consecutive samples rn {rn, rn+1, . . . , rn+k−1} which were obtained by digital filtering on the basis of the samples fn:
Stater(n,k)=1 if ∀ i ε{0, k−1}, rn+i>=Vref
Stater(n,k)=0 if ∀ i ε{0, k−1}, rn+i<Vref
Stater(n,k)=? otherwise
The choice of the value Vref is then made so as to best approximate the mean value taken by the samples fn of the data file.
In the case where the search application is targeted at the comparison of files of like nature, such as for example text files, the value of Vref must be fixed in full knowledge of the law for coding the data of the file and the probabilities of drawing each code.
For the embodiment of the full text computer search program, in a preferred embodiment, it is considered that the format of the files to be compared is not known in advance. The value of Vref is therefore determined by carrying out a prior analysis of the files to be compared. For this embodiment, the value of Vref is calculated for each sample rn by performing a mean value calculation for the samples fk over a sequence of fixed size, Kref, centered on fn, with:
Knowing that the samples rn were already obtained by a mean value calculation over sequences of K consecutive samples fk, the size of the sequence Kref (used for the calculation of Vrefn) is chosen greater than that of K (used for the calculation of the samples rn).
The law for determining the fuzzy states over the sequences of k consecutive samples rn {rn, rn+1, . . . , rn+k+1} then becomes:
Stater(n,k)=1 if ∀ i ε{0, k−1}, rn+i>=Vrefn+i
Stater(n,k)=0 if ∀ i ε{0, k−1}, rn+i<Vrefn+i
Stater(n,k)=? otherwise
This law simplifies by putting r′n=(rn−Vrefn). Then:
Stater(n,k)=1 if ∀ i ε{0, k−1}, r′n+i>=0
Stater(n,k)=0 if ∀ i ε{0, k−1}, r′n+i<0
Stater(n,k)=? otherwise
For K even and Kref even, the formula for the digital filter is:
We recall that the frequency response of the digital filter associated with the calculation of the samples r′n is obtained simply from that of Σavg(K,f):
Filter(f)=Σavg(K,f)−Σavg(Kref,f)
The choice of the value of K is made in such a way that the zero cutoff frequency of the filter is less than or equal to that which would have to be used for an ideal low-pass filter which makes it possible to obtain probabilities of drawing 1 or 0 states equal to ¼. It is recalled that this ideal low-pass filter cutoff frequency is obtained as a function of the index ratio k by the formula Fe/(2.(k−1)) and that this condition is attained on Σavg(K,f) for K smaller than or equal to 2k−1. The choice of Kref is made in such a way as to be greater than K, without now being too high.
For the preferential embodiment of the full text computer search program, the values to be used for K and Kref are adjusted automatically as a function of the value k desired for the index ratio. The values of K and of Kref are chosen as a multiple of k, thereby facilitating the data address calculations, hence:
K=interv×k and Kref=intervref×k
The response of the adjusted digital filter for an index ratio k is
Filter(k,f)=Σavg(interv×k,f)−Σavg(intervref×k,f)
For the embodiment of the full text computer search program, four laws for determining fuzzy states are used simultaneously, in a particular embodiment.
The fuzzy states determined by the first law are coded on the 2 least significant bits of each digital signature data item. The fuzzy states determined by the second law are coded on the next 2 least significant bits of each digital signature data item, and so and so forth, until the 8 bits (hence 1 byte) of each digital signature data item are occupied completely.
The four laws are characterized by a set of parameters interv1, interzv2, interv3, interv4 and intervref. The same parameter intervref is used for each law. For an index ratio k, the default choice falls on the following set of digital filters associated with each law for determining fuzzy states:
Filter1(k,f)=Σavg(2k,f)−Σavg(14k,f)
Filter2(k,f)=Σavg(3k,f)−Σavg(14k,f)
Filter3(k,f)=Σavg(5k,f)−Σavg(14k,f)
Filter4(k,f)=Σavg(7k,f)−Σavg(14k,f)
To avoid the calculation noise caused by the divisions, in an advantageous embodiment, we firstly calculate the sums, then we subsequently perform the sign tests on terms rn by multiplying the first sum by Kref and the second sum by K.
We now describe a complete optimization for the application to a full text search engine.
This optimization begins with the determination of an appropriate index ratio.
To be independent of the particular choices which could be made for the embodiment of the low-pass digital filters (
As indicated in relation to
Illustrated in
sn/k=(ebn or ebn+1 or ebn+2 or . . . or ebn+k+1)
while advantageously complying with the same start addresses of the samples fn.
For the application to the full text search engine, the value k of the index ratio conditions the value of minimum size of extracts common to two files which may be detected by carrying out a search of common extracts of digital signatures. This minimum size of common extract of a file is obtained when the size of the extract common to the digital signatures is equal to 1. In this case, the condition for detecting the common file extract requires that the group of consecutive data of the extract to be found covers the group of consecutive data used for the calculation of each digital signature data item.
Taking the notation text for the size of common file extract to be found and tsign for the size of the group of data used for the calculation of an index data item, we demonstrate the relation text≧tsign+(k−1).
Represented in
It is observed that the overlap conditions depend on the phase of the start address of the data extract which will be searched for. In the most favorable case, the start address of the extract coincides with the address of the first data item of a data group used for the calculation of a digital signature data item. In this case, the start address of the extract is n−I1 (with n a multiple of k) and the minimum size of the extract for overlap is I1+I2+k. In the least favorable case, the start address of the extract coincides with the address +1 of the first data item of a data group used for the calculation of a digital signature data item. In this case, the start address of the extract is n−I1−(k−1) (with n a multiple of k) and the minimum size of the extract for overlap equals I1+I2+k+(k−1).
In all cases, the overlap condition for a data group used for the calculation of a single digital signature data item is satisfied if the size of the extract to be found is greater than or equal to (I1+I2+2k−1). Conversely, if the size of extract to be found is equal to (I1+I2+2k−1), the extract does indeed overlap a data group used for the calculation of a single data item of a digital signature.
The reasoning can be extended to the case of the overlapping of a data group used for the calculation of an extract of digital signatures data of size TES. In the most favorable case, the start address of the extract coincides with the address of the first data item of a data group used for the calculation of TES consecutive data of the digital signature. If the start address of the extract equals n−I1 (with n a multiple of k), the minimum size of the extract for overlap equals I1+I2+k.TES.
In the least favorable case where the start address of the extract coincides with the address +1 of the first data item of a data group used for the calculation of TES data of a digital signature, the start address of the extract equals n−I1−(k−1) (with n a multiple of k) and the minimum size of the extract for overlap=I1+I2+k.TES+(k−1).
In all cases, the overlap condition for a data group used for the calculation of TES consecutive data of a digital signature is satisfied if the size of the extract to be found is greater than or equal to (I1+I2+k(TES+1)−1).
On the basis of the above formulae, inverse reasoning is applied to determine the values of the index ratio k which can be used to search for a common extract of files of size TEF. The following relations must then be satisfied:
TEF≧I1+I2+k(TES+1)−1, and
TES≧1 (which is simply the minimum size of common extract of digital signatures)
The minimum value for k is kmin=2, otherwise there is of course no improvement to be expected in the search speed.
Finally, from this we deduce the minimum size value usable for TEF
TEF mini=I1+I2+2(TES+1)−1
It will be noted that for TES=1, TEF mini=I1+I2+3
The maximum value for k is obtained backwards by taking TES=1, then:
kmax=integer part of [(TEF−I1−I2+1)/2]
For any value of k lying between kmin and kmax, we deduce the size of the common extract of signatures TES which will condition the detection of a possible extract common to the files of size TEF:
TES≦integer part of [(TEF−I1−I2+1)/k]−1
The formulae may be adapted to the particular case of “default” digital filters adjusted for an index ratio k, as was seen previously. It then suffices to replace I1 by (intervref×k)/2 and I2 by I1−1. We obtain the following relation between TEF, TES, k and intervref:
TEF≧k(intervref+TES+1)−2
The minimum size value usable for TEF is obtained for k=2 and TES=1 and we deduce TEF mini=2.intervref+2
For TEF fixed, we deduce the span of licit values for the index ratio k:
kmin=2≦k≦kmax=integer part [(TEF+2)/(intervref+2)]
For any value of k lying between kmin and kmax, we deduce the size of the common extract of signature TES which will condition the detection of a possible extract common to the files of size TEF:
TES≦integer part of [(TEF+2)/k]−(intervref+1)
Thus, the detection of a common extract of files of size TEF may be obtained by comparing digital signatures using various values of index ratio k. For a determined value TEF, we deduce a span of usable values for k: from kmin to kmax. For each usable value of k, we then determine a value TES of maximum size of common extract of digital signatures which guarantees the detection of a common extract of files of size TEF.
We shall now examine how to choose the value of k (in the licit span kmin, kmax) to get the fastest search speed.
As indicated previously, for the application to a full text search engine, the search is done in two passes:
For the evaluation of the number of comparison operations to be performed for the two search passes, the following simplifying hypotheses are adopted in a first approach:
The probability of drawing a common extract of files of size 1 is denoted PF. The probability of drawing a common extract of files of size 2 is denoted PF2. Finally, the probability of drawing a common extract of files of size TEF is PFTEF.
Subsequently, the probability of drawing a common extract of digital signatures of size 1 is denoted PS. The probability of drawing a common extract of digital signatures of size 2 is PS2. The probability of drawing an extract of size TES is PSTES.
Moreover the following notation is adopted:
We firstly evaluate the number Total1 of comparisons to be performed for the first step of “coarse” searching for common extracts of digital signatures of size greater than or equal to TES. The number of possible pairs of start positions of common extract of digital signatures is equal to TS1×TS2. For a value of index ratio k, the sizes TS1 and TS2 are deduced from the sizes TF1 and TF2 through the relations:
TS1=TF1/k and TS2=TF2/k
For each possible pair of start positions of common extract of digital signatures, we compare first data of an extract. In case of correlation, the comparison is continued with second data of an extract, and so and so forth until the requested size of extract TES is attained.
For each test, the mean number of comparison operations is obtained from the probability of drawing PS, with:
In total, we therefore obtain 1+PS+ . . . +PSIES−1, i.e. (1−PSIES)/(1−PS) operations. The value of Total1 is deduced therefrom by multiplication by (TS1×TS2), i.e.:
Total1=(TF1×TF2)×(1−PSIES)/(k2×(1−PS))
We now evaluate the number Total2 of comparisons to be performed for the second step of “targeted” searching for common extracts of the files of size TEF from among the set of pairs of start positions of extracts of files in conjunction with the digital signatures common extracts found in the previous step of coarse searching. For a digital signatures common extract labeled by a pair of start addresses (n1, n2), the start addresses to be tested on the first file lie between (k.n1+I2+k.TES−TEF) and (k.n1−I1), i.e. in total, Na=(TEF−I1−I2−k.TES+1) possible addresses (
The value of TEF may moreover be bracketed by the following relation when the largest possible value for k is used:
I1+I2+k(TES+1)−1≦TEF<I1+I2+k(TES+2)−1
From this we deduce that k≦Na<2k.
The same reasoning applies to the start addresses to be tested on the second file by substituting n2 for n1.
There are therefore a total of Na2 pairs of start positions of common extracts of files to be evaluated. The mean number of comparisons to be performed to search for a common extract of files of size TEF is obtained from the probability of drawing PS but by applying analogous reasoning to that of the coarse search step:
Na2×(1−PFIEF)/(1−PF)
The mean number of digital signatures common extracts found in the first step is obtained from the probability of drawing PS and the sizes of the signatures TS1 and TS2:
TS1×TS2×PSIES
We replace TS1 by TF1/k and TS2 by TF2/k and we finally obtain Total2 by product of the latter expressions:
Total2=(TF1×TF2)×(Na2/k2)×PSIES×(1−PFIEF)/(1−PF)
We have already shown that 1≦Na/k<2. From this we deduce the following relations:
Total2≧(TF1×TF2)×PSIES×(1−PFIEF)/(1−PF) and
Total2<4×(TF1×TF2)×PSIES×(1−PFIEF)/(1−PF)
It is indicated that the sign “×” signifies here “multiplied by”.
Finally, the evaluation of the number Total3 of comparison operations to be performed for the two search passes is obtained by simple addition of Total1 and of Total2, i.e.:
Total3=(TF1×TF2)×(1−PSIES)/(k2(1−PS))+(TF1×TF2)×(Na/k)2×PSIES×(1−PFIEF)/(1−PF)
For large values of TEF and TES, the relation may be approximated by:
Total3=(TF1×TF2)×[(1/(k2×(1−PS)))+((Na/k)2×PSIES/(1−PF))]
The total number of comparisons to be performed with the reference search algorithm is close to TF1×TF2. The ratio between the latter number and Total3 gives an estimate of the search speed gain obtained by using the algorithm within the meaning of the invention:
Gain=1/[(1/(k2×(1−PS)))+((Na/k)2×PSIES/(1−PF))]
When the second term of the sum is less than the term in 1/k2, it will be noted that a gain of greater than k2/2(1−PS) is obtained.
It is indicated incidentally that, however, to obtain the effective gain in search speed, it is also necessary to deduce the actual times for calculating the digital signatures.
As will be seen with reference to
It is recalled that in the general case, TES=integer part of [(TEF−I1−I2+2)/k]−1
In the case of optimized mean value digital filters,
TES=integer part of [(TEF+2)/k]−(intervref+1)
It is apparent that the value of k to be used to obtain the minimum value of this function cannot be determined through a simple mathematical relation. However, as the set of possible values of k is reduced, the optimal value of k is determined empirically. For each possible value of k (between kmin and kmax), we calculate the value of Total3 as a function of k and we retain the value of k which produces the smallest value of Total3.
However, the evaluation of the number of comparison operations to be performed is more accurate if we also correct the model used for the calculation of the probabilities of drawing common extracts of digital signatures. Specifically, the probabilities of drawing the data of digital signatures are not mutually independent, since there is a sizeable overlap between the span of the file data which are used for the calculation of a digital signature data item of address (n/k) and that of the file data which are used for the calculation of the next data item of a digital signature of address (n/k)+1.
In the general case of a low-pass digital filter with (I1+I2+1) coefficients, the fuzzy states taken by the digital signature data of addresses (n/k) and ((n/k)+j) will be independent if there is no overlap between the spans of file data which are used for their determination. This condition is satisfied if (n+I2+k−1)<(n+k.j−I1−k+1), i.e. if j>(I1+I2+2k−2)/k.
In the particular case of the default digital filters adjusted for an index ratio k, we simply substitute (k×intervref−1) for (I1+I2) in the above equation. The condition of independence is then satisfied if j>(intervref+2)−3/k, stated otherwise, if the discrepancy in addresses between the digital signatures data equals at least (intervref+2).
To take account of the dependency of the fuzzy states taken by consecutive data of a digital signature, the probabilities model is modified as indicated below.
The probability of drawing a common extract of digital signatures of size 1, independent is denoted PSI. The probability of drawing a common extract of digital signatures of size 2 is equal to the probability of drawing PSI an extract of size 1, multiplied by the conditional probability of drawing PSD (D standing for dependent) another extract of size 1 following consecutively a previously found extract of size 1. This probability of drawing then becomes PSI×PSD. The probability of drawing a common extract of digital signatures of size 3 becomes PSI×PSD2. Finally, the probability of drawing an extract of size TES becomes PSI×PSD(IES−1). The following relation may be demonstrated between PSI and PSD: PSD(intervref+2)<PSI
On the basis of this new model of probabilities, we re-evaluate the formulae for calculating the numbers Total1 and Total2:
Total1=[(TF1×TF2)/k2]×[1+(PSI×(1−PSD(IES−1))/(1−PSD))]
Total2=(TF1×TF2)×(Na/k)2×PSI×PSD(IES−1)×(1−PFTEF)/(1−PF)
For high values of TEF and TES, the formulae may be approximated as follows:
Total1=[(TF1×TF2)/k2]×[1+(PSI/(1−PSD))]
Total2=(TF1×TF2)×(Na/k)2×PSI×PSD(IES−1)/(1−PF)
And Total3=(TF1×TF2)×[(1+(PSI/(1−PSD))/k2+(Na/k)2×PSI×PSD(TES−1)/(1−PF)]
In a preferred embodiment, the values of PSI and PSD are determined in advance by statistical analysis of the results of comparisons between digital signatures obtained with files of large size. For this purpose, a specific statistical analysis program standardizes the values to be used for PSI and PSD.
For the set of 4 default digital filters (
Represented in
We now describe the improvement in the selectivity of the search for common extracts of digital signatures, still for a full text search engine.
In the simple case where the digital signatures data each carry only a single fuzzy logic state, the probability PSI of detecting a common extract of digital signatures of size 1 can be deduced from the probabilities of drawing the states “0”, “1” and “?”.
We denote by P0 the probability of drawing the state 0, by P1 that of the state 1 and by P? that of the state ?.
For a given pair of start positions of extracts of digital signatures to be evaluated, the conditions for detecting a common extract of digital signatures of size 1 are as follows:
For a given pair of start positions of extracts of digital signatures to be evaluated, the probabilities of detecting a common extract of digital signatures of size 1 are determined as follows for each situation presented above:
The probability of detection PSI is obtained by addition of the probabilities of each situation:
PSI=P0×(P0+P?)+P1×(P1+P?)+P?
The formula for determining PSI may again be simplified by replacing (P0+P?) by (1−P1), (P1+P?) by (1−P0), and (P0+P1+P?) by 1, and:
PSI=P0×(1−P1)+P1×(1−P0)+P?=1−2×P0×P1
The maximum value of PSI equals 1. It is obtained for P0=0 or P1=0. This situation is to be proscribed, since, in this case, the search for common extracts of digital signatures has no selectivity.
The minimum value of PSI equals ½. It is obtained for P?=0 and P0=P1=½. This situation is ideal and may be approximated if we use a default adjusted digital filter with high values for the parameters intervref and interv, as was seen above.
For mean value digital filters, the value of PSI is obtained statistically by analyzing the intercomparison of digital signatures of large size. It has been shown that the application of an ideal filter of cutoff Fe/2(k−1) is conveyed by probabilities P0=P1=¼ and P?=½. It follows that PSI=⅞.
We therefore use digital filters that are more selective so that PSI<⅞, in a preferential embodiment.
In the general case where the digital signatures data each carry 4 fuzzy logic states (supplementary state “X” (prohibited)), the probability PSI of detecting a common extract of digital signatures of size 1 is evaluated on the basis of the previous results. We denote by PSI1 the probability of detecting a common extract of digital signatures of size 1 based only on a comparison of the states taken by the first law for determining fuzzy states. We denote by PS2, PS3 and PS4 the analogous detection probabilities associated with the following laws for determining fuzzy states (law 1, law 2, law 3 and law 4). If the laws are mutually independent, PSI=PSI1×PSI2×PSI3×PSI4. In practice, there is a dependence between the laws and the value of PSI obtained by statistical analysis is greater than the previous product.
Thus, the determination of each fuzzy state of a digital signature is performed by a prior calculation of a set of k consecutive binary states. In the case of a search for common extracts of files, it will be remarked that the detection of a possible common extract between the files will always be guaranteed if:
It is indicated indeed that, in a preferred embodiment, a digital signature carrying fuzzy states (0, 1 or ?) (first file) is in fact intercompared with a digital signature carrying only binary states (0 or 1) (second file). It is shown below that the selectivity of the search is thereby improved, since the probabilities of detecting extracts common to the digital signatures are simply decreased.
For a given pair of start positions of extracts of digital signatures to be evaluated, the conditions for detecting a common extract of digital signatures of size 1 are as follows:
We take as notation P0′ and P1′ for the probabilities of drawing the binary states carried by the digital signature data items associated with the second file. We have the following relations:
P0′+P1′=1
P0≦P0′≦P0+P?
P1≦P1′≦P1+P?
For a given pair of start positions of extracts of digital signatures to be evaluated, the probabilities of detecting a common extract of digital signatures of size 1 are determined as follows for each situation presented above:
The probability of detection PSI′ is obtained by addition of the probabilities of each situation:
PSI′=P0×P0′+P1×P1′+P?
≦P0×(P0+P?)+P1×(P1+P?)+P?
≦PSI
The relation PSI′≦PSI therefore implies an improvement in the selectivity of the search by carrying out the comparison between a signature carrying fuzzy states and a signature carrying only binary states.
It will be remarked that for a common extract of digital signatures that is labeled by a pair of start addresses (n1, n2), the start addresses to be tested on the files must take account of the use of a binary digital signature for the search. In the case where the fuzzy digital signature is calculated on the basis of the first file, the start addresses to be tested lie between (k×n1+I2+k×TES−TEF) and (k×n1−I1), i.e. in total:
Naf=(TEF=I1−I2−k×TES+1) possible addresses.
In the case where the binary digital signature is calculated on the basis of the second file, the start addresses to be tested lie between:
(k×n2+I2+k×(TES−1)−(TEF−1)) and (k×n2−I1), i.e. in total:
Nab=(TEF−I1−I2−k×(TES−1)) possible addresses.
For a default digital filter with parameter intervref, we obtain:
Naf=TEF−k×intervref−k×TES+2
Nab=TEF−k×intervref−k×(TES−1)+1
Described hereinbelow is a standardization of the probability laws associated with the digital filters. Logged in the array below are the probabilities PSI and PSD of mean value digital filters obtained by comparing two text files of large size (300 kilobytes).
It is noted that:
The probabilities logged for the aggregate of 4 filters (interv=2, 3, 5, and 7) are greater than the product of the probabilities logged individually for each filter. It will therefore be understood that there is interdependency of the probabilities associated with each law.
To better approximate a situation of independency of the probabilities, it is possible to envisage proceeding as follows to adapt the realization of the functions for calculating the digital signature:
For high values of TEF (and TES), the mathematical model for estimating the numbers of comparison operations to be performed for the search gives good results on the automatic determination of an optimal value of index ratio to be used.
For low values of TEF (and TES), the mathematical estimation model does not give good results, since the search processes are no longer allotted principally to comparison operations.
For each common extract of digital signatures that is found, a program triggers the call to a function for targeted searching for common extract of a file over a restricted span of pairs of start addresses on the files. With each call, the function carries out a certain number of tests of validity of the call parameters and of initialization of local variables. With each call, this function performs an operation of reading on each file of the data to be compared whose speed depends on the performance of the hard disk and the bus of the computer. To take account of the impact of these additional processing times, a further corrected mathematical model is used which adds, in the step of targeted searching for common extracts of a file, comparison operations in numbers that are representative of the call times of the targeted search function and of the reading times for the data to be compared. Typically, the number added to Total2 is of the form:
[((TF1×TF2)/k2)×PSI×PSD(IES−1)]×[A×B×k],
where
The value of the parameter's A and B depend on the characteristics of the computer used for the execution of the program and are determined empirically.
Described hereinbelow are the performance evaluation results with a 1 GHz Pentium III processor, with 128 Mb RAM, and 20 Gb hard disk used as computer for the evaluation (running under the Windows 98® operating system).
The performance was logged with the execution of a full text computer search program developed specifically in the C++ language. The program offers the choice of using a “conventional” algorithm or an algorithm within the meaning of the invention to perform a search for extracts common to the two files. The execution times of the algorithm within the meaning of the invention also integrate those for calculating the digital signatures.
In order to avoid falsifying the performance measurements, particular attention should be paid to the choice of files used to perform the searches. Specifically, in the course of tests it has transpired that the data files associated with everyday software such as Word®, Excel®, PowerPoint®, or the like have storage formats which lead to the existence of numerous consecutive data spans initialized to the same value 0 (0x00). As the size of these spans is of the order of several hundred data items, the probability model used for the embodiment of the prototype search program is falsified. Adaptations of this model may be investigated on a case by case basis, such as for example the ignoring in the targeted search function of the data value pair (0,0) as start position of a common extract.
The choice of the type of text file falls above all on text documents of large size in the HTML format. The search speed is expressed in millions of comparison operations per second (Mega ops/sec). The first file is of size: 213275 bytes and the second file of size: 145041 bytes. The array below shows the results obtained.
Other applications of searching for probable common extracts are now described. In certain areas of application, the criteria for detecting common extracts of files differ from the perfect identity of the extracts to be found. Such is the case in particular for data files representative of the digitization of a signal, such as for example audio files (with a .wav extension for example).
It is known that the value of the samples obtained will depend on the phase of the sampling clock. It is known moreover that the digitizing device introduces other errors into the values of the samples (noise, clock jitter, dynamic swing, or the like).
For these applications, the principle of the search algorithm within the meaning of the invention may be adapted so as to confine itself solely to the step of coarse searching between files. The steps envisaged may therefore be summarized as follows:
In what follows we shall show how it is possible to define for oneself a criterion for detecting common extract with the aid of probabilities.
We showed previously, within the framework of the optimization of the value of the index ratio, that the number of comparison operations for searching between digital signatures is estimated at:
Total1=[(TF1×TF2)/k2]×[1+PSI×(1−PSD(IES−1))/(1−PSD))]
We also showed that the probability of drawing a common extract of digital signatures equals PSI×PSD(IES−1).
The probable number of common extracts of minimum size TEF which will be found by the intercomparison of two files of respective sizes TF1 and TF2 therefore becomes:
N=[(TF1×TF2)/k2]×PSI×PSD(IES−1), with
TES=integer part of [(TEF−I1−I2+1)/k]−1
The optimization of the value of k depends on the compromise between the search speed (inversely proportional to Total1) which grows as k increases (it is therefore beneficial to use high values for k) and the number N which grows as k increases (the value of k must therefore be lowered if one wishes to limit the number of probable common extracts detected).
The optimization of the value of k is done by fixing in advance a target value Nc for N and a value of minimum size of extract to be found TEF. On the basis of these parameters, the value of N is evaluated for all the permitted values of k and the value of k which makes it possible to best approximate the value Nc is retained.
This search procedure introduces an inaccuracy in the start positions of the probable common extracts found. In the case of a search for common extracts between a fuzzy signature and a binary signature (corresponding to a preferred embodiment), the inaccuracy in the start position of the probable common extract of files is of the order of +k or −k in the file associated with the fuzzy signature, and of the order of +k or −2k in the file associated with the binary signature.
The effective probability of detecting a common extract of digital signatures may be approximated by an analysis of the variations taken by the states of the extract on the fuzzy signature. Advantageously, the preferred embodiment evaluates a ceiling probability by detecting the number of transitions occurring between data in the 0 state and in the 1 state, thereby making it possible to filter from the search result the common extracts whose measured probability is greater than a predefined threshold, and thus to avoid perverting the statistical probability model (PSI×PSD(IES−1)) used to optimize the search parameters.
In the case of audio files, the search for audio extracts common to two recording files may therefore be summarized as follows. We begin with a prior calculation of digital signatures associated with each file. On completion of this first step, we can regard a digital signature file as being a succession of logic states which characterize consecutive time spans of fixed duration of the audio signal. Typically, if one chooses a time span duration of one second for each digital signature data item, the processing of an audio file of an hour is conveyed by the creation of a file of digital signatures of 3600 data items (one per second). The first data item of the signature characterizes the first second of recording, the second data item the second second, and so on and so forth.
The search for common audio extracts is then performed by intercomparing the data of digital signatures which were calculated on the basis of each audio recording. Any common extract is characterized by a pair of groups of N consecutive data of digital signatures (the first group of data items of signatures being associated with the first audio file and the second group being associated with the second audio file) and for which groups there is a compatibility between the N consecutive fuzzy logic states of the first group with the N consecutive fuzzy logic states of the second group.
The address of the first data item of the digital signature of the first group of G1 makes it possible to label the temporal position of the common extract in the first audio file. The address of the first data item of the digital signature of the second group G2 makes it possible to label the temporal position of the start of common extract in the second audio file. The number N (of consecutive data found in conjunction) makes it possible to deduce the duration of the extract found by simple multiplication with the duration of the time spans associated with each digital signature data item.
For example, assuming that digital signatures have been calculated on a first file audio1 of one hour and on a second file audio2 of one hour while fixing on a time span duration of 1 second per digital signature data item, in the case where the result of the search gives a common extract of digital signatures of 20 consecutive data items which is labeled by the address 100 in signature 1 and by the address 620 in signature 2, this search result would be conveyed by an audio common extract of a duration of 20 seconds, labeled by a start timing of 1 minute 40 seconds on the file audio1 and by a start timing of 10 minutes 20 seconds on the file audio2.
Contrary to the search for extracts by identicalness in text files, there are no other steps in the processing which makes it possible to remove the doubt as to the identification of the extracts which are logged in the step of comparing the digital signatures. The mathematical algorithm which is used for the calculation of the digital signatures guarantees that if there exists a common extract between the two audio files, a common extract will then be detected between the digital signatures. However, the reciprocal condition is false: there is a possibility of detecting common extracts of digital signatures which do not correspond to audio common extracts.
In order to have available a confidence index for the search results, the processing uses a probability model which makes it possible to calculate a false detections error rate. The model consists in calculating the probability of matching up a group of N consecutive data items of digital signatures which is representative of an audio extract with another group of N consecutive data items of digital signatures whose values are random and representative of a random audio signal.
The probability P(N) of detecting a common extract of N data of digital signatures is then expressed in a form P exp(N), P being the probability of drawing a common extract of size 1. In practice, and given the simultaneous use of several fuzzy logic states, P is less than ½ and P(N) is therefore bounded above by ½exp(N), Given that we can approximate 210 by 103, we easily deduce the probability of false detection of a common extract of N data items: P(10)<10−3, P(20)10−6, . . . .
To evaluate the probable number of false detections which will be associated with the comparison of two audio files, we have to multiply this value P(N) by the total number of pairs of start positions of extracts of digital signatures which is tested during the step of comparing the digital signatures. If we take S1 as notation for the number of data items of digital signatures of the file audio1 and S2 for the file audio2, the probable number of false detections becomes P(N)×S1×S2.
As indicated above, this number is divided by 2, each time that the size of the digital signatures common extracts searched for is increased by 1 (and divided by 1000 if the size is increased by 10).
To hone the algorithm for detecting musical extracts, the minimum size of common extract of signatures has been adjusted to 50 data items, thereby guaranteeing a false detection probability of less than 10−15. This choice takes account of the non-randomness of the audio signals processed, which in the case of music comprise numerous repetitive spans (refrains, and the like). This size may of course be adapted as required by other applications, either to increase or to decrease the acceptable error rate.
On the basis of this minimum size of extract, the program determines, backwards, the minimum duration of the extracts to be searched for as a function of the value of duration associated with each data item of the signature (the inverse of the frequency of the signature data).
For a digital signature frequency of 25 Hz (25 data items per second), the program makes it possible to search for audio extracts of a minimum duration of 2 seconds (50× 1/25 s). For a digital signature frequency of 5 Hz (5 data items per second), the program makes it possible to search for audio extracts of a minimum duration of 10 seconds (50×⅕ s). For a digital signature frequency of 1 Hz (1 data item per second), the program makes it possible to search for audio extracts of a minimum duration of 50 seconds.
In practice, it is the application which fixes the threshold value of minimum duration of audio extract to be search for. For applications in the monitoring of advertising, the requirement is to detect extracts of television or radio spots of 5 s. For applications in the recognition of musical titles, the requirement is to detect extracts of the order of 15 S. For applications in the recognition of television programs (films, series, etc), the requirement is to detect extracts of the order of a minute.
It is indicated moreover that in the application to audio, video, or other files, where the first and second files are files of samples of digitized signals, the method within the meaning of the invention advantageously comprises a step of preprocessing of the data, for example by subband filtering, and a taking into account of the data associated with signal portions of higher level than a noise reference, so as to limit the effects of different equalizations between the first and second files.
Moreover, the method advantageously provides for a step of consolidating the search results, preferably by adjusting relative sizes of the packets of the first and second files, in such a way as to tolerate a discrepancy in respective speeds of retrieval of the first and second files.
Finally, it is indicated that one at least of the first and second files may be, in this application, a data stream, and the method of searching for common extracts is then executed in real time.
A specific program, written in the C++ language, is being developed to perform the search for common extracts with microcomputers equipped with a 32-bit Windows operating system. It proposes to select two files to be compared, to define the minimum size of the common extracts to be found therein, and then to instigate the search.
When the search is instigated, the program advantageously displays an execution monitoring window. This window indicates the time elapsed since the start of the search and estimations of the total duration and of the speed of search. It also makes it possible to abandon the search if it transpires that its duration is deemed to be too long. The search is interrupted as soon as a common extract has been found. The size of the extract found and its position in each file are then displayed. The program performs the analysis of the files following a predefined order. The principle is to test each pair of start positions that may be taken by a common extract in the files.
Its implementation is described in the presentations hereinbelow of the search algorithms. It is indicated that the search may be resumed so as to find other extracts common to the files. In this case, the search is resumed from the pair of start positions of the last common extract found and following the predefined order of analysis of the files. The search is stopped when the files have been analyzed completely. The stopping conditions are then displayed so as to indicate as appropriate that there is no extract common to the files or that there is no other extract common to the files.
The program proposes to use by choice two algorithms to perform searches: a conventional search algorithm and an algorithm within the meaning of the invention.
The program thus makes it possible to compare on one and the same microcomputer the performance of the two algorithms, and to do so for any search configuration, in terms of minimum size of the common extracts to be searched for, of size of the files, of nature of the files, or the like.
The performance evaluation criterion is the swiftness of execution of the algorithms. The execution monitoring windows make it possible to recover the estimations such as the duration of execution to accomplish the search, the search speed, and the like.
It emerges with the conventional algorithm that the search speed is practically constant and does not depend on the minimum size of the common extracts to be found. It is expressed as a number of operations of comparison of binary data (bytes) per second which are performed by the computer. Its value is always less than the clock frequency of the microprocessor.
On the other hand, with the algorithm within the meaning of the invention, the search speed varies as a function of the minimum size of the common extracts to be found. It is expressed by an estimation of the number of operations of comparison of binary data (bytes) per second which would be performed by the computer if the conventional algorithm were used. Thus, the more the minimum size of the common extracts to be found increases, the more the speed increases. Its value may exceed that of the clock frequency of the microprocessor.
Represented in
Represented in
Represented in
One of the entities at least (PC0 and/or PC2) comprises a memory (respectively MEM1 and/or MEM2) suitable for storing the computer program product as described hereinabove, for the search for common extract between the first and second files.
In this regard, the present invention is also aimed at such an installation.
Here, the entity storing this computer program product is then capable of performing a remote update of one of the first and second files with respect to the other of the first and second files, while already comparing the first and second files. Thus, one of the entities may have altered a computer file through new entries of data or other modifications in a certain period (a week, a month, or the like). The other computer entity, which in this application, has to provide for the storage and regular updating of the files output by the first entity, receives these files.
Rather than completely transferring the files to be updated from the first entity to the second entity, the first entity labels by the method within the meaning of the invention the data extracts which are common between two versions of the same file, the new version which has been modified by adding or deleting data, and the old version which has been previously transmitted to the other entity and of which the first entity has kept a backup locally. This comparison within the meaning of the invention makes it possible to create a file of differences between the new version and the old version of the file which comprises information regarding the position and size of the common data extracts which may be used to partially reconstruct the new version of the file on the basis of the data of the old version of the file, and which comprises the data supplements which must be used to complete the reconstruction of the new version of the file. The updating of the file is then performed by carrying out a transmission of the file of differences to the second entity, then by thereafter applying a local processing to the second entity for reconstructing the new version of the file by combining the old version of the file and said file of differences.
The application of the method within the meaning of the invention makes it possible to considerably reduce the processing times necessary for generating said file of differences and makes it possible to reduce the volume of data to be transferred (and hence the transfer cost and time) to perform the remote updating of bulky computer files that have undergone only few modifications, in particular when such files comprise data relating to accounts, banking or the like.
The computer entities may take the form of any computing device (computer, server, or the like) comprising a memory for storing (at least momentarily) the first and second files, for the search for at least one common extract between the first file and the second file. They are then equipped with a memory also storing the instructions of a computer program product of the type described above. In this regard, the present invention is also aimed at such a computing device.
It is also aimed at a computer program product, intended to be stored in a memory of a central unit of a computer such as the aforesaid computing device or on a removable medium intended to cooperate with a reader of said central unit. This program product comprises instructions for conducting all or part of the steps of the processing within the meaning of the invention, described hereinabove.
The present invention is also aimed at a data structure intended to be used for a search of at least one extract common to a first and a second file, the data structure being representative of the first file, provided that this data structure is obtained by applying the processing within the meaning of the invention so as to form a digital signature. In particular, this data structure is obtained by implementing steps a) and b) of the method stated hereinabove and comprises a succession of addresses identifying addresses of the first file and to each of which is assigned a fuzzy logic state from among the states: “true” (1), “false” (0) and “undetermined” (?).
Number | Date | Country | Kind |
---|---|---|---|
0403556 | Apr 2004 | FR | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/FR05/00673 | 3/18/2005 | WO | 7/17/2007 |