This application is related to the following commonly owned copending U.S. patent applications:
Ser. No. 10/393,139 entitled “Method and Apparatus For Finding Repeated Substrings In Pattern Recognition” filed Mar. 20, 2003, and
Ser. No. 10/393,146 entitled “Method and Apparatus For Performing Fast Closest Match In Pattern Recognition” filed Mar. 20, 2003, which are hereby incorporated by reference herein.
The present invention relates in general to pattern recognition systems, and in particular, to pattern recognition systems for finding each occurrence of an reference pattern (RP) within an input pattern (IP) with repeating substrings.
Recognizing patterns within a set of data is important in many fields, from speech recognition, image processing, seismic data, etc. Some image processors collect image data and then pre-process the data to prepare it to be correlated to reference data. Other systems, like speech recognition, are real time where the input data is compared in real time to reference data to recognize patterns. Once the patterns are “recognized” or matched to a reference, the system may output the reference. For example, a speech recognition system may output equivalent text corresponding to input speech patterns. Other systems, like biological systems may use similar techniques to determine sequences in molecular strings like DNA.
In some systems there is need to find RPs that are imbedded in a continuous data stream. In non-aligned data streams, there are some situations where occurrences of the RP may be missed if only a single byte by byte comparison is implemented. The situation where RPs may be missed occurs when there is a repeated or nested repeating substring patterns in the input stream or the IP being matched. An RP containing the desired sequence is loaded into storage where each element of the sequence has a unique address. An address register is loaded with the address of the first element of the RP that is to be compared with the first element of the IP. This address register is called a “pointer.” In the general case a pointer may be loaded with an address that may be either incremented (increased) or decremented (decreased). The value of the element pointed to by the pointer is retrieved and compared with input elements that are clocked or loaded into a comparator.
There is, therefore, a need for a method and an apparatus to ensure that imbedded patterns in an input data stream are not missed because the position of pointer to the RP does not coincide with the start of the desired pattern in the input pattern.
A method and apparatus for finding an embedded pattern in a data stream uses dual pointers, a first pointer and a second pointer. The dual pointers are used to identify which elements of the RP to compare to a corresponding input element of the IP presently being read. Initially both pointers point (contain the same pointer address) to the first reference element of the RP. When the first reference element in the RP matches the input element being read from the input data stream, a match is recorded and the first pointer is incremented and moves to the next reference element in the RP. This continues until the reference element in the RP pointed to by the first pointer does not match the input element being read from the input data stream. The reference element in the RP being pointed to by the second pointer is also compared to the input elements in the input data stream, however when the first and second pointer have the same address only the first pointer is incremented. Therefore, when the first pointer reference element fails to match an input element, a match of the second pointer reference element causes the second pointer to be incremented and the second pointer becomes the activated pointer (one whose pointer address is incremented on a compare.
When the first pointer reference element fails to match the input data stream its pointer address is reset to the first reference element in the RP. Comparison will continue between reference elements, pointed to by the second pointer and input elements in the input data stream. If the second pointer reference element fails to match a current input element in the input data stream, then the first pointer is incremented if it points to a reference element that compares with the current input element. This alternating will continue until the IP is exhausted or the desired imbedded reference pattern is found in the in put pattern. In this manner, imbedded patterns in the input data stream will not be missed because a single pointer to the RP was not in synchronization with the input data pattern.
The foregoing has outlined rather broadly the features and technical advantages of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the invention will be described hereinafter which form the subject of the claims of the invention.
For a more complete understanding of the present invention, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
In the following description, numerous specific details are set forth to provide a thorough understanding of the present invention. However, it will be obvious to those skilled in the art that the present invention may be practiced without such specific details. In other instances, well-known circuits may be shown in block diagram form in order not to obscure the present invention in unnecessary detail. For the most part, details concerning timing considerations and the like have been omitted in as much as such details are not necessary to obtain a complete understanding of the present invention and are within the skills of persons of ordinary skill in the relevant art.
Refer now to the drawings wherein depicted elements are not necessarily shown to scale and wherein like or similar elements are designated by the same reference numeral through the several views. In the following, a pointer contains the address of an element that is used for some operation. The element at the pointer address could be in general another address or data. To simplify the explanation, the nth element in the RP may be simply written as reference element n or REn (e.g., the first element in the RP is RE1. Likewise the first element in the input pattern (IP) is written simply as IE1. As explained in the Background, a pointer contains an address of an element stored in an addressable storage unit. To make a comparison between an element in a stored RP and an input element in an input data pattern, a reference pointer is first loaded with the address of the desired element. The reference element is retrieved and loaded into a comparator. The input element is then clocked into the comparator where a comparison is made. Typically, the results of this comparison determine what is done with the address stored in the address pointer (incremented, reloaded, decremented). In this explanation, the terminology “a pointer compares” means that the “element at the address contained in the pointer compares.” Unless stated otherwise, when it is said that a pointer was incremented, this means the address stored in the pointer was increased by one so that the pointer now contains the address of the next sequential element. Resetting or reloading a pointer means that a new address replaces the present address in the pointer. Resetting a pointer to a reference element N means that the pointer now contains the storage address of reference element N and not the value of reference element N itself. To distinguish between a comparison of the elements pointed to by pointers and a comparison of the addresses of the pointer themselves, the term “element” is used (e.g., reference element) refers to the element pointed to by the pointer and “pointer address” refers to the address contained in the pointer. Pointer address A compares to pointer address B, means the two pointers A and B contain the same address to an element. In the following, stating that pointer A compares to pointer B or pointer A compares to element C means that the element pointed to by the address in pointer A compares to the element pointed to by the address in pointer B or to the element C itself.
A reference element (RE) is one of the elements in a reference pattern (RP). In the following, “pointer reference element” may be used at times to refer to the particular RE pointed by a particular pointer. For example, the RE pointed to by a first pointer may be referred to as the “first pointer reference element” to distinguish it from a RE pointed to by a second pointer. This may be done to clarify the explanation. When it is stated that a pointer is “activated” in means that its pointer address is incremented if it matches a current IE.
At clock cycle 1, PTR 108 and PTR 208 both point to RE1 of RP 303. In this embodiment, when both pointer point to the same element in RP 303, PTR 108 is first activated (moves in response to the result of a compare cycle). At clock cycle 1, IE1 is an “A” and it compares to the “A” of RE1. A match is recorded and PTR 108 is moved to RE2 which is a “B”. PTR 108 continues to match the elements of IP 302 until it does not match a sequential element in IP 302. In this embodiment, PTR 208 is likewise comparing to IP 302, however it does not start comparison until it one clock cycle after PTR 108 if they are at the same RE. It is not incremented unless it matches an element of IP 302 and it is not pointing to the same element of RP 303 as PTR 108. At clock cycle 4, PTR 208 points to RE1 and matches IE 4. Also, PTR 208 and PTR 108 point to different elements of RP 303. PTR 108 continues to match IP 302 until clock cycle 7 where IE 7 does not match RE7. At this point, PTR 108 resets back to RE1 and again looks for matches in IP 302. PTR 208 started to match IP 302 at clock cycle 4. At clock cycle 7, PTR 208 is pointing to RE4 and IE 7 matches RE4. PTR 208 matches IP 302 from clock cycle 7 until clock cycle 17. Meanwhile, PTR 108 (again at RE1) finds a match in IP 302 at clock cycle 11 where RE1 and IE11 match and PRT 108 and PTR 208 are not at the same element in RP 303. PTR 208 and PTR 108 continue to match IP 302 until clock cycle 17 where PTR 208 is at RE14. At this point, IE17 (D) fails to match RE14 (E). PTR 208 is reset back to RE1 and PTR 108 continues to match IP 302. PTR 108 continues to match elements of IP 302 until clock cycle 24 where the entire pattern of RP 303 is found. PTR 208 again matches IP 302 at clock cycle 18 and continues to match through clock cycle 24 where it is at RE14.
If the result of the test in step 402 is YES, then in step 403 a test is done to determine of IP (I) also matches the RE pointed to by PTR 208. If the result of the test in step 403 is NO, then PTR 108 is incremented and PTR 208 is reset to RE1 in step 409. Then in step 414 the successful matches are accumulated into a match sequence for PTR 108. A branch is again taken to step 412. If the result of the test in step 403 is YES, then a test is done to determine if PTR 108 and PTR 208 both point to the same RE in step 404. If the result of the test in step is YES, then in step 405 PTR 108 is incremented and PTR 208 is held and a branch is taken again to step 412. If the result of the test in step 404 is NO, then both PTR 108 and PTR 208 are incremented in step 410. Then in step 415, the successful matches are accumulated into match sequences for PTR 108 and PTR 208. At this time, it is not known which of the pointers will end up finding a match in IP 302. In this case, both PTRs are comparing elements (different elements) of RP 303 to IP 302. From step 410, a branch is taken to step 412. When a branch is taken to step 412, a test is done to determine if the last IE in IP 302 has been read to be compared. If the result of the test in step 412 is YES, then a branch is taken to step 400 and a new IP may be started. If the result of the test in step 412 is NO, then there remain IEs in IP 302 that may need to be processed. Then in step 417, a test is done to determine if either of the accumulated match sequences have a complete match to the RP 303. If the result of the test in step 417 is No, then in step 411 a branch is taken back to step 401 where index I is incremented to read in the next IE (I). If the result of the test in step 417 is YES, then a match is found and a branch is taken back to step 400 where a new pattern recognition process may be started.
To initialize the system 600, both P1R 608 and P2R 609 are loaded with the pointer address (PA) corresponding to RE1 in RP storage 604. At the beginning of a pattern recognition cycle, the P1A 622 and P2A 623 are used to load a first reference element RE1 corresponding to P1A 622 into P1 comparator 606 and
RE2 corresponding to P2A 623 into P2 comparator 607. PE1 and PE2 are compared to an IE in P1 comparator 606 and P2 comparator 607. P1A 622 and P2A 621 are compared in controller 603 to determine if the pointers point to the same RE. The results of the P1A 622 and P2A 621 compare are coupled to compare logic 610 in signals 620. Compare logic 610 generates increment signals, increment P2A 619 and increment P1A 618, in response to compare signals 616 and 617 and the result of the P1A 622 and P2A 621 compare (in signals 620). When P1A 622 and P2A 621 are incremented, they access new REs from RP 611 stored in R1P storage 604. Compare logic 610 has a P1M counter 630 to keep track of consecutive matches for REs accessed by pointer one (P1) and a P2M counter 631 to keep track of consecutive matches for P2. The P1M counter 630 and P2M counter 631 are reset to zero if the respective RE 614 accessed by P1 or RE 614 accessed by P2 fail to match an IE 615 being clocked. A number R, defining the total number of REs in RP 611, is sent to compare logic 610 by controller 603. If either count value in P1M counter 630 or the P2M counter 631 compares to R, then an occurrence of RP 611 has been found in the IP 602. Compare logic 610 also keeps track of which particular IE 615 generates a match to an RE 614 accessed by PI or RE 614 accessed by P2 so that the location of RP 611 in the IP 602 is determined. IP unit 605 may store addresses of IP 602 each time there is a match in response to signals 626.
Although the present invention and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.
Number | Name | Date | Kind |
---|---|---|---|
3916100 | Azuma | Oct 1975 | A |
4075688 | Lynch et al. | Feb 1978 | A |
4145738 | Inoue et al. | Mar 1979 | A |
4493034 | Angelle et al. | Jan 1985 | A |
4500776 | Laser | Feb 1985 | A |
4555796 | Sakoe | Nov 1985 | A |
4673816 | Matsui et al. | Jun 1987 | A |
4754490 | Swonger | Jun 1988 | A |
4779210 | Katsura et al. | Oct 1988 | A |
4901352 | Watari | Feb 1990 | A |
5070478 | Abbott | Dec 1991 | A |
5127055 | Larkey | Jun 1992 | A |
5230045 | Sindhu | Jul 1993 | A |
5384722 | Dulong | Jan 1995 | A |
5394532 | Belsan | Feb 1995 | A |
5408625 | Narita et al. | Apr 1995 | A |
5410667 | Belsan et al. | Apr 1995 | A |
5438630 | Chen et al. | Aug 1995 | A |
5459798 | Bailey et al. | Oct 1995 | A |
5533148 | Sayah et al. | Jul 1996 | A |
5560039 | Dulong | Sep 1996 | A |
5717908 | Dulong | Feb 1998 | A |
5757409 | Okamoto et al. | May 1998 | A |
5825921 | Dulong | Oct 1998 | A |
5864867 | Krusche et al. | Jan 1999 | A |
5881312 | Dulong | Mar 1999 | A |
5943493 | Teich et al. | Aug 1999 | A |
6012027 | Bossemeyer, Jr. | Jan 2000 | A |
6185568 | Douceur et al. | Feb 2001 | B1 |
6202106 | Baxter | Mar 2001 | B1 |
6285686 | Sharma | Sep 2001 | B1 |
6425067 | Chong et al. | Jul 2002 | B1 |
6499083 | Hamlin | Dec 2002 | B1 |
6691219 | Ma et al. | Feb 2004 | B2 |
6856981 | Wyschogrod et al. | Feb 2005 | B2 |
6912526 | Akaboshi | Jun 2005 | B2 |
7024537 | Pickett et al. | Apr 2006 | B2 |
7046848 | Olcott | May 2006 | B1 |
20020083297 | Modelski et al. | Jun 2002 | A1 |
20040184661 | Kravec et al. | Sep 2004 | A1 |
Number | Date | Country | |
---|---|---|---|
20040184661 A1 | Sep 2004 | US |