This invention relates to searching for and identifying strings in data.
Searching for a given string of data in large sets of data has been solved by reading each set of data (or “record”) from the data storage, transferring the data to a server or host system which searches each and every record, typically in sequence. If the search is to be conducted on a large number of data storage magnetic tapes, the process can be very time and computationally consuming. Magnetic tape is typically a high capacity data storage, and typically compresses the data to increase the capacity further. For one magnetic tape drive and one server to read and then to search an entire set of magnetic tape cartridges could be prohibitively time consuming. For example, it might take as much as 2 hours to mount, load and then completely read and search a tape cartridge, and thus 1000 tape cartridges would take 2000 hours, or nearly 83 days. To reduce the time, multiple servers can be assigned to do the job in parallel. Another solution is to keep an index of the data as it is stored or catalogued. This is fine so long as the index covers all the terms of interest such that a server or host system can process the search against the index.
It has been suggested that, if the data were stored on hard disk drives for data mining, the hard disk drives would have low-level search intelligence, and the database application would break searches into individual commands, which would be sent simultaneously to all drives to conduct a direct search of the data. Substantial time is required to access, read and transfer the data from magnetic tape to the host and/or to hard disk drives, when the data is already stored on magnetic tape, and, further, many searches are not so simple.
Logic, magnetic tape drives, and service methods are provided for searching data.
In one embodiment, a plurality of string comparison engines are configured to search data and to indicate matches to search terms; and an identification engine is configured to identify patterns of the matches indicated by selected string comparison engines.
In a further embodiment, the string comparison engines are configured to search a common set of data in parallel.
In a still further embodiment, at least one of the string comparison engines comprises at least one mask configured to modify specific search terms.
In another embodiment, at least one of the string comparison engines is configured to search the data on a byte-by-byte basis. In a further embodiment, at least one string comparison engine is configured to search the bytes of data employing a bit mask for each byte and a byte mask. In a still further embodiment, at least one string comparison engine is configured to search two consecutive bytes of the data in parallel.
In another embodiment, the identification engine comprises a Boolean look-up table.
In another embodiment, a magnetic tape drive comprises a tape drive system for moving a magnetic tape longitudinally; at least one read channel configured to read data recorded on a magnetic tape as the tape is moved longitudinally by the tape drive system; and a search engine configured to search data read by the read channel(s) and to identify matches of strings of data to search terms. In a further embodiment, the magnetic tape drive additionally comprises at least one decompressor configured to decompress the data read by the read channel(s); and the search engine is configured to search the decompressed data. The search engine may further comprise the embodiments of logic discussed above.
In another embodiment, a service method of searching data comprises searching a common set of data in parallel and indicating matches of strings of the data to search terms; and identifying patterns of selected matches.
In a further embodiment, the data is decompressed prior to searching; such that the searching comprises searching the decompressed data.
In a still further embodiment, the patterns are identified by looking-up patterns of selected matches in a Boolean look-up table.
In another embodiment, where the data is stored on a plurality of magnetic tape cartridges, data is read from magnetic tape cartridges in a plurality of magnetic tape drives; and the data is searched by the plurality of magnetic tape drives, indicating matches of strings, and the plurality of magnetic tape drives identify patterns of selected matches.
For a fuller understanding of the present invention, reference should be made to the following detailed description taken in conjunction with the accompanying drawings.
This invention is described in preferred embodiments in the following description with reference to the Figures, in which like numbers represent the same or similar elements. While this invention is described in terms of the best mode for achieving this invention's objectives, it will be appreciated by those skilled in the art that variations may be accomplished in view of these teachings without deviating from the spirit or scope of the invention.
Referring to
Referring to
A read/write system is provided for reading and writing information to the magnetic tape, and, for example, may comprise a read/write and servo head system 18 with a servo system for moving the head laterally of the magnetic tape 11, a read/write servo control 19, and a drive motor system 20 which moves the magnetic tape 11 longitudinally between the cartridge reel 13 and the take up reel 16 and across the read/write and servo head system 18. The read/write and servo control 19 controls the operation of the drive motor system 20 to move the magnetic tape 11 across the read/write and servo head system 18 at a desired velocity, and, in one example, determines the location of the read/write and servo head system with respect to the magnetic tape 11. In one example, the read/write and servo head system 18 and read/write and servo control 19 employ servo signals on the magnetic tape 11 to determine the location of the read/write and servo head system, and in another example, the read/write and servo control 19 employs at least one of the reels, such as by means of a tachometer, to determine the location of the read/write and servo head system with respect to the magnetic tape 11. The read/write and servo head system 18 and read/write and servo control 19 may comprise one or more read channels and one or more write channels, and may comprise hardware and any suitable form of logic, including a processor operated by software, or microcode, or firmware, or may comprise hardware logic, or a combination.
A control system 24 communicates with the memory interface 17, and communicates with the read/write system, e.g., at read/write and servo control 19. The control system 24 may comprise any suitable form of logic, including a processor operated by software, or microcode, or firmware, or may comprise hardware logic, or a combination.
The illustrated and alternative embodiments of magnetic tape drives are known to those of skill in the art, including those which employ dual reel cartridges.
The control system 24 typically communicates with one or more host systems 25, and operates the magnetic tape drive 15 in accordance with commands originating at a host. Alternatively, the magnetic tape drive 15 may form part of a subsystem, such as a library, and may also receive and respond to commands from the subsystem.
In one embodiment of the present invention, a search engine 30 is configured to search data read by the read channel(s) 18, 19 and to identify matches of strings of data to search terms. The search engine 30 may comprise any suitable form of logic, including hardware logic, such as VLSI, a processor operated by software, or microcode, or firmware, or a combination. In a further embodiment, the magnetic tape drive additionally comprises at least one decompressor, for example, embodied in the read channel(s) 18, 19, configured to decompress the data read by the read channel(s); and the search engine 30 is configured to search the decompressed data.
Referring additionally to
Magnetic tape drives conducting the searches of large databases of data stored on magnetic tape frees the host(s) for other work and places the searches in proximity to the databases. For example, the magnetic tape drives may be located in a library which houses the magnetic tape cartridges storing the database. Further, a number of magnetic tape drives can conduct the searches simultaneously. Both the proximity to the data and the number of magnetic tape drives in parallel allow the search to be conducted efficiently.
An embodiment of a search engine 30 in accordance with the present invention is illustrated in
An embodiment of a string comparison engine (e.g. engine 31) is illustrated in
Bit and byte masks 61 may be applied to modify specific search terms 51. The masks and search terms for each consecutive set of two bytes are applied at inputs 93A-93P to the comparison blocks 92A-92P. Examples of masks will be discussed subsequently.
The bytes to be searched for are compared in the string comparison blocks. The current byte is compared to the older byte, as byte 90, and the previous byte is compared to the newer byte, as byte 91. A match of both bytes results in a carry out of the first or second comparison blocks, and, in subsequent comparison blocks, a match of both bytes and the match carry in results in a carry out to the comparison block two blocks to the right. A first comparison block 94 only compares the first byte of the string to be matched to the newer byte of the incoming string. This allows a match to start at the second of the two bytes.
In the example, each comparison block works independently. For example, if the string to match was “THTHE” and the incoming byte is a “TH”, assuming that there was no match to the previous bytes, the only match will be at the first comparison block, since the carry in to the other comparison blocks will be off. When the second “TH” comes, there will be matches in the first and third comparison blocks. This allows strings to be continually matched no matter where in the sequence the characters are input.
In the example of
As an example, the bit mask may comprise an 8 bit value that could apply to all bytes in the string. This allows for case independent searches. Where there is a “1”, the bit must match. Where there is a “0”, this is a “don't care” condition. For example:
“11011111” bit mask would match any upper or lower case ASCII character.
“10111111” bit mask would match any upper or lower case EBCDIC character.
The byte mask, for example, is a 2 bit field for each byte in the string. The two bits may be encoded in the following manner:
“11”—Byte must match exactly, bit for bit.
“10”—Byte must match, but based on the bit mask for this string.
“01”—byte must exist in this location, but its value is a “don't care”. This is as though the bit mask were all zeros.
“00”—Not a valid byte in this position. Used when the string to search for contains fewer bytes than the maximum length search string. Note that bytes cannot be skipped, as the carry in will not propagate. Therefore, this byte mask will signal the end of a search.
An example of the equations for VLSI logic used to match the strings:
There are two cases for determining the match within the comparison block. In one case there is a carry in and the first byte matches, but not the second. Or, both bytes match. In the first case, the carry out will not propagate, but if the second byte was not a valid byte to search for, the match could occur here:
MatchGREQ1<=Str0EQ AND NOT(str1EQ) AND carryin; --match thru the first byte.
MatchGTEQ2<=Str0EQ AND Str1EQ AND carryin; --match thru both bytes.
We can determine if there was a match by also using the flag from the next byte to determine if it was valid, as the flag from the older byte box of comparison block 92A to the newer byte comparison block 94 of
The carry out to the next comparison block is the latched version of MatchGTEQ2. This signifies both bytes matched and the carry in was active.
If any strmatch from any of the comparison blocks is active, then the string match for the overall string is set. These string matches go into the identification engine 40, 42 of
In the example of
The Boolean look up table 42 is able to perform complex pattern matching. This is a table that contains 2**N bits, where N is the number of strings that can be searched for. In the instant example, there is a maximum of 8 strings that can be searched for, so the table is 2**8, or 256 bits. Each location of the table can be envisioned as being encoded by 8 bits. Thus, bit 0 is “00000000”, and location 3 is “00000011”.
To create a Boolean equation, for example, of:
(str1 AND str2) OR (str3 AND str4),
to determine if there are any matches, the look up table 42 is filled with a “1” in each location where both bits 1 and 2 are a “1” and also with a “1” where both bits 3 and 4 are a “1”. Thus, in this case to match str1 AND str2, location 3, 7, 11, etc. will all be filled with a “1”. And to match str3 AND str4, locations 12 thru 15, 28-31, etc. would be filled with a “1”.
Now the strmatch bits from the output 71-78 of each string comparison engine 61-68 is decoded by decoder 40. This decoded value is used as an index into the Boolean look up table 42. If that location contains a “1”, then the Boolean equation has been satisfied.
As an example, suppose that the following Boolean equation is to be searched for:
(str1 AND NOTstr2) OR (NOTstr3 AND NOTstr4),
A “1” would be entered in each location where bit 1 is a “1” and bit 2 is a “0”; and a “1” in each location where bit 4 is a “0” and bit 5 is a “0”. So for the str1 and not str2, any location of the following “xxxxxx01” in the 256 bit look up table would contain a “1”; and for not str4 and not str5, any location of the following “xxx00xxx” would contain a
A service method in accordance with an embodiment of the present invention is depicted by the flow chart of
Those of skill in the art will understand that changes may be made with respect to the method and operation of the described and the illustrated components. Further, those of skill in the art will understand that differing specific component arrangements may be employed than those illustrated herein.
While the preferred embodiments of the present invention have been illustrated in detail, it should be apparent that modifications and adaptations to those embodiments may occur to one skilled in the art without departing from the scope of the present invention as set forth in the following claims.