Pattern matching in computing applications involves locating instances of a shorter sequence (such as a string)—or an approximation thereof—within an equal or larger sequence. This is particularly useful in the analysis of time series data, such as for data mining.
Various pattern matching algorithms exist, each suitable for specific applications.
Embodiments of the invention will now be described by way of example only with reference to the drawings in which:
There will be described a method for comparing a first data set with a second data set, each comprising one or more corresponding segments. The method comprises determining the difference between corresponding pairs of end points of corresponding segments, and deeming the first data set to match the second data set if the difference is less than a predetermined tolerance for all of the corresponding pairs of end points, and deeming the first data set not to match the second data set if the difference is greater than the predetermined tolerance for any one of the corresponding pairs of end points. If the difference between a corresponding pairs of end points equals the predetermined tolerance, the method may either include treating this as consistent with matching or as inconsistent with matching, according to user preference, application or otherwise.
The method may include determining the difference for all of the end points of the segments, then identifying whether the. difference exceeds the predetermined tolerance for any of the end points of the segments. Thus, the difference may be determined for all the segments (and both ends thereof) before checking whether any difference value exceeds the tolerance (hence indicative of a mismatch) or whether all the difference values are less than the tolerance (hence indicative of a match).
The method may comprise determining the difference until either the difference has been determined to be less than the predetermined tolerance for all of the corresponding pairs of end points or the difference has been determined to be greater than the predetermined tolerance for any one of the corresponding pairs of end points. Thus, rather than determining the difference for every pairs of end points then checking against the tolerance, the determination of differences can stop after any single pair of end points is found to exceed the tolerance.
The method may include identifying a maximum and a minimum value in each of segments of the first data set and of the second data set, performing a comparison of the maxima of the pairs of corresponding segments, the minima of the pairs of corresponding segments, or both the maxima of the pairs of corresponding segments and the minima of the pairs of corresponding segments, and deeming the first data set not to match the second data set if a mismatch is identified.
A time series query method for analysing time series data (referred to below as reference data) is illustrated by means of a flow diagram in
Thus, at step 102 (see
At step 104, the input pattern and the reference data set are smoothed to eliminate minor fluctuations in the data that are regarded as noise. Thus, in the case of the reference data, a window is defined about each reference data point, the average value over that slide window is determined, and that average value is used as the new value of that respective point, thereby reducing such fluctuations. The input pattern is processed in the same manner.
The size of the window defined about each data point dictates how much proximity is acceptable, and is specified by the user. Some users may wish to identify only regions of high similarity between the reference data and the input pattern, and will therefore employ a small window size. Users content to locate less close matches will employ a larger window size.
At steps 106 and 108, segmentation is performed in order to reduce the number of comparison points so that matching is faster. Thus, referring to
At step 108 the mid-line 208 of the tunnel 202 that was fitted to the referenced data 206 is determined and output as an output segment for use in place of the smoothed, referenced pattern 204. (The mid-line 208 is also stored for future use.) Similarly, the mid-line of the tunnel fitted to the input pattern is determined and output as an output segment for use in place of the smoothed, referenced pattern 204; this mid-line can—but will generally not—be stored for future use.
The width of the tunnel is, in each case, specified by the user. It equals the vertical distance 210 between the top of the tunnel and the bottom of the tunnel. Its width is chosen according to the level of matching desired between the reference data and the input pattern. Thus, the smaller the width of the tunnel, the more closely must the reference data match the input pattern if a match is to be deemed to exist during the subsequent pattern matching proper.
At step 110, the input pattern is scaled to the reference data in the current time window. This is done because comparisons of two patterns (i.e. data sets) have little meaning if the absolute scales of the data differ significantly. Hence at this step the input pattern is scaled by multiplying each point such that its average becomes equal to the sliding average of the reference data.
At step 112, the local maximum (or peak) and local minimum (or trough) in the input pattern (denoted Pi and Ti respectively) and, similarly, the local maximum and local minimum in the reference data (denoted Pr and Tr respectively) are located for the current (initially, first) time window. This is illustrated schematically in
These properties of each cycle of a sinusoidal curve (i.e. only one peak and one trough, and every other point having at least one other point with the same amplitude) means that it is quicker, when comparing sinusoidal curves, to find a mismatch than to find a match (which requires an exhaustive point by point comparison). Further, since the number of peaks and troughs are minimal, there exists a great probability of mismatching these points if a mismatch is indeed to be found. Hence, by representing both data sets as sinusoidal curves, mismatches can be located promptly.
Thus, by initially comparing the peaks and troughs of both the input and referenced patterns, many mismatches can be quickly identified in this phase, which leads to faster jumps and hence faster matching. If all the peaks and troughs are found to match, then matching need only be further checked in respect of sub-segment end-points.
Hence, at step 114 the method compares corresponding peaks (or maxima) in the input pattern and reference data and, at step 116, test whether the corresponding peaks match. If they do not match, the time window is advanced by one segment at step 118 and processing returns to step 110. If a match is found at step 116, processing continues at step 120 where corresponding troughs (or minima) in the input pattern and reference data are compared. At step 122, the method tests whether these corresponding troughs match; if not, processing continues at step 118 where the time window is advanced by one segment and then returns to step 110.
If the corresponding troughs are found to match at step 122, processing continues at step 124, where sub-segmentation is performed in the current time window. Referring to the schematic plot of an exemplary time window 400 of
Once the sub-segmentation has been completed, the actual pattern matching is performed. This involves the following steps 126 to 134.
At step 126 (see
At step 128, the method checks whether, for this pair of segments, the differences between the end-points are both less than or equal to a tolerance T, that is, whether this pair of corresponding segments match to within that tolerance. If so, processing passes to step 130, where the method checks whether the segment pair just compared at steps 126 and 128 was the last pair of corresponding segments in the current time window. If not, the method continues at step 132 where it advances to the next pair of corresponding segments in the current time window, then returns to step 126. Progressively, therefore, all the pairs of corresponding segments in the current time window are compared as long as no mismatches are found.
If, at step 130, it is determined that the last segment pair has just been compared, the method continues at step 134, where a match is held to have been found, and the input pattern 402 is considered to match the reference data 404 in that time window. Processing then continues at step 136, where the current time window is advanced by the width of the lowest segment (that is, the lowest sub-segment defined at step 124), and the method then continues at step 122.
If, at step 128, the method determines that, for the instant pair of segments, the difference between either pair of end-points is greater than the tolerance T, the input pattern 402 and the reference data 404 are considered not to match in that time window and the method continues at step 138, where a match is held not to have been found.
In this embodiment at steps 126 to 132, the pairs of corresponding segments are compared from left to right as shown in
In addition, it will be appreciated by those in the art that it is sufficient to compare only the end-points of the segments to determine whether corresponding segments match because, if the end-points of the segments match according to this test, then all the points in the segment necessarily match. Thus, the criterion for finding a match may be described as requiring that all the points in all the segments match, but according to this embodiment, this is established by comparing only end-points. In a computing environment this considerably reduces computing time overhead.
From step 138 (i.e. a match is held not to have been found in the current time window), the method continues at step 140. At this step, the method of this embodiment determines whether the input pattern 402 and the reference data 404 were held not to match owing to a mismatch at the start of a pair of corresponding segments or at the end of those corresponding segments.
If the mismatched segments were mismatched at their starts, the method continues at step 136, at which—as described above—the current time window is advanced by the width of its lowest (sub-)segment and the method then continues at step 122.
If the mismatched segments were not mismatched at their start points but were at their end points, the method continues at step 142. Clearly, if the corresponding segments that were held not to match were not mismatched at their start points but were at their end points they must be diverging in the increasing time direction. Such a situation is depicted in
Thus, at step 142 the method advances in an increasing time direction by one segment. At step 144, the method determines whether the instant corresponding segments (i.e. of the input pattern and of the reference data) converge and whether the start point 506a of the entire input pattern is within tolerance T of the end point of the instant segment of the reference data. In the example of
If either or both these conditions are not satisfied, the method returns to step 142. If both these conditions are satisfied, -the method continues at step 146, at which the input pattern is advanced in a time increasing direction to the end point of the segment (510 in
Hence, in the example shown in
Thus, by advancing the input pattern (502 in
Next, at step 148 a new segment 512 of width |t′| is defined, extending from the time translated start point of the input pattern to the end point of the reference data segment (510 in
Reference data (in the form of Hewlett-Packard stock indices over 5 years) was searched for matches with input patterns of various lengths, using both the technique described in Keogh and Smyth (A probabilistic approach to fast pattern matching in time series databases, Proc. of the 3rd International Conference of Knowledge Discovery and Data Mining (1997) 24-30)and that of this embodiment. The number of comparisons that were made in each case are tabulated in Table 1. This table also includes the percentage improvement in the number of comparisons by employing the method of this embodiment. This percentage improvement was calculated as:
% improvement=(M−N)×100/N
where M is the number of comparisons required according to the method of Keogh and Smyth and N is the number of comparisons required according to the method of this embodiment.
From the results in Table 1, it can be seen that the method of this embodiment provides better results than that of Keogh and Smyth. Further, it will be observed that the improvement increases with the length of the input pattern.
Referring to
The foregoing description of the exemplary embodiments is provided to enable any person skilled in the art to make or use the present technique. While the present technique has been described with respect to particular illustrated embodiments, various modifications to these embodiments will readily be apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. It is therefore desired that the present embodiments be considered in all respects as illustrative and not restrictive. Accordingly, the present invention is not intended to be limited to the embodiments described above but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Number | Date | Country | Kind |
---|---|---|---|
IN2875/DEL/2005 | Oct 2005 | IN | national |