1. Field of Invention
The present invention relates to information processing, and more particularly to a rapid string matching method.
2. Description of Related Arts
For the applications, such as the text editor, the search engine, the data processing and the communication system, the searching, positioning and statistics of a target string among a long source string are usually required to be executed at a fast speed. Supposing that the source string has a length of m and the target string has a length of n, the naïve string matching algorithm, such as the strstr( ) algorithm of the C standard library, matches the string one by one from head to end, which induces much repeated matching of the characters of the target string and causes inefficiency, wherein the worst-case time complexity is O(m*n); although the improved matching algorithm, such as the Knuth-Morris-Pratt (KMP) algorithm, reduces the repeated matching of the characters of the target string, and thus improves the efficiency compared with the naïve string algorithm, the improved matching algorithm matches with the whole m-length source string, which means the efficiency remains to be further improved.
An object of the present invention is to provide a rapid string matching method which improves an efficiency of matching and searching a target string.
Accordingly, in order to accomplish the above object, the present invention provides a rapid string matching method comprising steps of:
(1) pre-treating a target string to obtain a simple hash table of each character of the target string, and setting a time complexity for determining whether an arbitrary character belongs to the target string to be 1;
(2) starting matching, searching a source string for characters matching with a first character of the target string, and ending searching when an end of the source string is searched;
(3) matching, by the searched character of the source string, with the first character of the target string, and going to step (5);
(4) non-matching, by the searched character of the source string, with the first character of the target string, and moving a character pointer of the source string to next character, going to step (2);
(5) checking whether a last character of a part of the source string which starts from the searched character matching with the first character of the target string and ends at a length of the target string belongs to the target string, if yes, going to step (6); if no, going to step (8);
(6) checking whether the target string is wholly or partially within the part of the source string which starts from the searched character matching with the first character of the target string and ends at the length of the target string and whether a whole of the part of the source string is matched, if yes, going to step (7); if no, going to step (8);
(7) moving forward the character pointer from a character which re-matches with the first character of the target string by the length of the target string, and going to step (2);
(8) moving forward the character pointer from the searched character matching with the first character of the target string by the length of the target string, and going to step (2).
In the rapid string matching method of the present invention, the target string is pre-treated; matching by the source string with the first character of the target string readily triggers matching with a last character of the target string.
The present invention functions according to a high probability event in practice. The source string refers to a string to be searched. The target string refers to the string which remains to be matched.
Firstly, the characters of the string are not random characters.
Secondly, since the characters are not random characters, a certain association exists among the characters, especially between neighboring characters. For example, beside a non-vowel letter, it is more probable to emerge a vowel letter than a non-vowel letter; in Chinese grammar, it is far more probable to follow “” by “” than by “”. When a part of two or more than two close characters is matched, a matching probability for other characters beside the matched part is relatively higher than that for other characters far away from the matched part. In other words, the characters further away from the matching part has a relatively higher probability of non-matching.
Thirdly, even for a totally random string, a probability for an arbitrary character to belong to the target string is far lower than that for the arbitrary character to fall out of the target string. Given that the matching is defined as belonging to the target string by the last character, if the first character of the target string is matched but the last character is non-matched (as stated above, the non-matching probability if higher than the matching probability), it is pretty certain that the part of the source string from the first character to the last character is non-matched, needless of comparing one by one. Therefore, the part of the source string from the first character to the last character can be directly skipped and the matching continues directly from a character next to the last character.
Fourthly, among the source string, non-matched strings are far more than matched strings, and thus it is more meaningful to improve an efficiency of searching for non-matched strings than to improve an efficiency of searching for matched strings.
Fifthly, based on the practice, the method of the present invention improves an efficiency of matching the “non-matched” strings by comparing the last character, so as to improve an efficiency of matching the “matched” string. The better matching with the aforesaid conditions results in the higher efficiency.
Sixthly, even for the pattern matching among the random characters, the method of the present invention has a better efficiency than the naïve algorithm.
Tests show that the method of the present invention has an averaged time efficiency advantage of no less than 20% over the naïve string matching algorithm.
These and other objectives, features, and advantages of the present invention will become apparent from the following detailed description, the accompanying drawings, and the appended claims.
Referring to
A rapid string searching method, according to the preferred embodiment of the present invention, comprises step (1) of: pre-treating a target string to obtain a simple hash table for rapidly searching, and setting a time complexity for determining whether an arbitrary character belongs to the target string to be 1, which are executed as the following program.
The rapid string searching method further comprises a step of searching text for target, comprising steps of:
(2) searching for characters matching with a first character of the target string, and ending searching when an end of the source string is searched;
(3) matching, by the searched character of the source string, with the first character of the target string, and going to step (5);
(4) non-matching, by the searched character of the source string, with the first character of the target string, and moving a character pointer of the source string to next character, going to step (2);
(5) checking whether a last character of a part of the source string which starts from the searched character matching with the first character of the target string and ends at a length of the target string belongs to the target string, if yes, going to step (6); if no, going to step (8);
(6) checking whether the target string is wholly or partially within the part of the source string which starts from the searched character matching with the first character of the target string and ends at the length of the target string and whether a whole of the part of the source string is matched, if yes, going to step (7); if no, going to step (8);
(7) moving forward the character pointer from a character which re-matches with the first character of the target string by the length of the target string, and going to step (2);
(8) moving forward the character pointer from the searched character matching with the first character of the target string by the length of the target string, and going to step (2).
Steps (2)-(8) are executed as the following program.
(1) As showed in
(2) As showed in
(3) As showed in
(4) As showed in
(5) A next round of string matching is continued by returning to (1).
Thus, no matter whether the whole target string is wholly matched or not, the matching is executed at least by the length of the target string, so as to improve the efficiency.
The string searching method of the present invention has a higher efficiency than the common searching algorithm comprising the naïve string searching algorithm, so as to speed up string matching. In a simple test, a paragraph of texts containing program source codes is searched for a designated string. Results thereof are as follows (Pentium® Dual-Core CPU E5800 @ 3.20 GHz, 4G, no complier optimization).
In the searching of existing string, the time efficiency of the method of the present invention is 24% higher than the naive string searching algorithm.
In the searching of non-existing string, the time efficiency of the method of the present invention is 22% higher than the naive string searching algorithm.
In the searching of some special string, the method of the present invention makes higher improvement, embodied as nearly 50% in the above simple test.
One skilled in the art will understand that the embodiment of the present invention as shown in the drawings and described above is exemplary only and not intended to be limiting.
It will thus be seen that the objects of the present invention have been fully and effectively accomplished. Its embodiments have been shown and described for the purposes of illustrating the functional and structural principles of the present invention and is subject to change without departure from such principles. Therefore, this invention includes all modifications encompassed within the spirit and scope of the following claims.
Number | Date | Country | Kind |
---|---|---|---|
201310287683.9 | Jul 2013 | CN | national |
This is a U.S. National Stage under 35 U.S.C 371 of the International Application PCT/CN2013/081309, filed Aug. 12, 2013, which claims priority under 35 U.S.C. 119(a-d) to CN 201310287683.9, filed Jul. 09, 2013.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2013/081309 | 8/12/2013 | WO | 00 |