The longest common subsequence problem is the problem of finding the longest subsequence common to all sequences in a set of sequences (at least two but possibly more sequences, each a “component sequence”). It differs from problems of finding common substrings: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences.
A common subsequence of two or more sequences each consisting of one or more items is defined as a sequence of items that appears in each of the component sequences in the same order in each component sequence. The longest common subsequence is defined as the set of one or more common subsequences that have the greatest length. The numerous practical applications for, and desirability of efficiently deriving, a longest common subsequence are well documented in the literature.
However, a need has arisen for means for obtaining not only the longest common subsequence, but the set of one or more common subsequences. A need has also arisen for means for obtaining the set of one or more common subsequences that are of at least a certain minimum length. A need has also arisen for means for obtaining the set of one or more common subsequences that are of at least a certain minimum density. A need has also arisen for means for obtaining the set of one or more common subsequences that are of at least a certain minimum length and a certain minimum density.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential characteristics of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
One example embodiment includes a method of finding common subsequences in a set of two or more component sequences. The method includes obtaining two or more component sequences and identifying the location(s) of one or more distinct items that occur at least once within each of the two or more component sequences. The method also includes placing the location(s) within each component sequence of each commonly-occurring distinct item in a location n-tuple and storing each location n-tuple in a location n-tuple container. The method further includes sorting the entries in the location n-tuple container and placing each of the location n-tuples in the location n-tuple container into a tier in a tier set. The method additionally includes obtaining any desired information regarding common subsequences.
Another example embodiment includes a method of finding common subsequences in a set of two or more component sequences. The method includes obtaining two or more component sequences and identifying the location(s) of one or more distinct items that occur at least once within each of the two or more component sequences. Identifying the location(s) of one or more distinct items that occur at least once within each of the two or more component sequences includes iteratively identifying each item within the component sequence and placing a new entry for the item in a location index associated with the component sequence when the item has not been encountered previously in the component sequence. Identifying the location(s) of one or more distinct items that occur at least once within each of the two or more component sequences also includes adding the current location of the item to an existing entry for the item in a location index associated with the component sequence when the item has been encountered previously in the component sequence. The method also includes adding one or more location indexes associated with one or more component sequences to a location index set and using the location index set to identify the locations of one or more distinct items that occur at least once within each of the two or more component sequences. The method moreover includes placing the location(s) within each component sequence of each commonly-occurring distinct item in a location n-tuple and storing each location n-tuple in a location n-tuple container. The method further includes sorting the entries in the location n-tuple container and placing each of the location n-tuples in the location n-tuple container into a tier in a tier set. The method additionally includes obtaining any desired information regarding common subsequences.
Another example embodiment includes a method of placing a location n-tuple into a tier in a tier set. The method includes creating a new tier, placing the location n-tuple into the newly-created tier and adding the newly-created tier to the tier set when the tier set is empty and determining the correct tier for the location n-tuple when the tier set is not empty. The method also includes placing the location n-tuple into the correct tier.
These and other objects and features of the present invention will become more fully apparent from the following description and appended claims, or may be learned by the practice of the invention as set forth hereinafter.
To further clarify various aspects of some example embodiments of the present invention, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. It is appreciated that these drawings depict only illustrated embodiments of the invention and are therefore not to be considered limiting of its scope. The invention will be described and explained with additional specificity and detail through the use of the accompanying drawings in which:
Reference will now be made to the figures wherein like structures will be provided with like reference designations. It is understood that the figures are diagrammatic and schematic representations of some embodiments of the invention, and are not limiting of the present invention, nor are they necessarily drawn to scale.
A subsequence is a sequence that can be derived from another sequence by deleting some elements without changing the order of the remaining elements. For example, the sequence {A, B, D} is a subsequence of {A, B, C, D, E, F}. A subsequence should not be confused with a substring, which is a refinement of the definition of subsequence that includes the additional requirement that elements in the substring must occupy consecutive positions within the underlying string. For example, {A, B, C, D} is a substring of the string {A, B, C, D, E, F}.
For example, the length of the longest common subsequence is equal to the number of tiers created and may be obtained if desired. E.g., if 5 tiers have been created then the longest common subsequence is exactly five items long. The actual location n-tuples within the tiers are irrelevant to the length determination. As noted above, if the length of the longest common subsequence is less than a desired minimum length then no minimum length common subsequence can exist.
In addition, if a common subsequence (or set of common subsequences if more than one) is desired it can be recovered from the tier set. Each potential common subsequence must include precisely one location n-tuple from each of one or more tiers such that the location n-tuple from each tier is unambiguously smaller than the location n-tuple from each subsequently-created tier if any (the “increasing order requirement”). That is, the location n-tuple from tier[0] is unambiguously smaller than the location n-tuple from tier[1] and the location n-tuple from tier[1] is unambiguously smaller than the location n-tuple from tier[2], and so forth for each tier. Moreover, the total number of potential common subsequences that may be identified among any set of tiers is equal to the product of the number of location n-tuples in each such tier (e.g., if there are three tiers and if tier[0] contains 2 location n-tuples, tier[1] contains 3 location n-tuples and tier[2] contains 1 location n-tuple then the total number of potential common subsequences is 2*3*1=6). One of skill in the art will appreciate that potential common subsequences may include location n-tuples from non-sequential tiers. For example, if seven tiers have been created, a potential common subsequence may be identified by selecting precisely one location n-tuple from each of the following tiers: tier[0], tier[1], tier[3], tier[5] and tier[6]. Thus, each potential common subsequence can be identified and examined to ensure that it satisfies the increasing order requirement, eliminating any that do not and thus leaving only valid common subsequences. In addition, any duplicate common subsequences may be eliminated.
Further, if the longest common subsequence set is desired then the same method as above can be used except that only any common subsequences that include precisely one location n-tuple from each tier need be identified and/or recreated.
Further, if the minimum length common subsequence set is desired then the same method as above can be used except that only any common subsequences above the minimum length need be identified and/or recreated. For example, if 7 tiers have been created and the minimum desired subsequence length is 5 items then only common subsequences which span at least 5 tiers need be identified and/or recreated.
Further, if the minimum density common subsequence set is desired then the same method as above can be used except that only common subsequences which are above the minimum density need be identified and/or recreated. The density of a common subsequence is defined as the length of the common subsequence divided by the longest distance between items (including the first and last item) in any component sequence. That is, density=LCS/D=LCS/IBFL+2=LCS/PLI−PFI+1 (where LCS is the length of the common subsequence, D is the longest distance between items—including the first and last item—in any component sequence, IBFL is the number of items between the first item and the last item, PLI is the position of the last item and PFI is the position of the first item). For example, if the length of the common subsequence is five items and in one component sequence the first item is at position 4 and the last item is at position 15 then the distance between items is 12 and the number of items between the first item and the last item is 10. Therefore, the density=5/12=5/(10+2)=˜0.42.
Finally, if the minimum length, minimum density common subsequence set is desired then the same method as above can be used except that only common subsequences which are above the minimum length and the minimum density need be identified and/or recreated.
Continuing the above example, if the location n-tuple to be placed is compared to a first location n-tuple in tier[2] but the first location n-tuple is not unambiguously smaller then comparisons continue. If the location n-tuple to be placed is then compared to a second location n-tuple in tier[2] and the second location n-tuple is unambiguously smaller then a new tier (tier[3]) is created, the location n-tuple to be placed is placed in tier[3], tier[3] is added to the tier set and comparisons cease. However, if none of the location n-tuples in tier[2] are unambiguously smaller than the location n-tuple to be placed then the location n-tuple to be placed is compared to the location n-tuples in tier[1] (and if then any tier[1] location n-tuple is found to be unambiguously smaller than the location n-tuple to be placed then the location n-tuple to be placed is placed in tier[2] and comparisons cease). If no tier contains a location n-tuple that is unambiguously smaller than the location n-tuple to be placed then the location n-tuple to be placed is placed in the first-created tier (tier[0] in the above example). The method may be repeated until all location n-tuples in the location n-tuple container have been placed into the tier set.
The following example is provided for illustrative purposes only and without intent or effect to limit the scope of the invention. It does not purport to illustrate all of the steps (either required or optional) nor every sub-part of, nor state nor condition applicable to, those steps (either required or optional) illustrated.
Assume three Sequences, S1, S2 and S3 as follows:
These same Sequences may alternately be depicted as follows:
After a location index has been created (element 104 of
After location n-tuples have been generated for each possible combination of the locations within S1, S2 and S3 of each commonly-occurring distinct item and each such location n-tuple has been added to the location n-tuple container (element 114 of
Because the entries in the location n-tuple container are not already in location n-tuple sorted order, they must be sorted. After the entries in the location n-tuple container are sorted (element 116 of
The tier set is initially empty. After the first location n-tuple in the sorted location n-tuple container is placed (elements 302 and 304 of
The second location n-tuple in the sorted location n-tuple container is then placed. Because the first location n-tuple is not unambiguously smaller than the second (since the corresponding position in S1 and S2 are the same), the second location n-tuple is placed in the same tier as the first (element 312 of
The third location n-tuple in the sorted location n-tuple container is then placed in the tier set. At this point, there exists at least one (and, in fact, two) entries in the tier set that are unambiguously smaller than the third location n-tuple and hence a most recently created tier containing an unambiguously smaller location n-tuple is identified (elements 306 and 308 of
The fourth and fifth location n-tuples in the sorted location n-tuple container are then placed (element 312 of
The sixth location n-tuple in the sorted location n-tuple container is then placed (element 310 of
After placement of the remaining location n-tuples in the sorted location n-tuple container, the tier set might be depicted as follows:
Because there are six entries in the tier set, the length of the longest common subsequence (S1, S2, S3) is equal to six. Notice also that the tier containing the location n-tuple {7, 6, 8} consists only of this one entry. Consequently, the item in the component sequences S1, S2 and S3 that is associated with this location n-tuple (I) is guaranteed to be included as part of the longest common subsequence. It is also guaranteed to be included as part of any common subsequence of length 4 or greater.
If the set of potential common subsequences is generated an example of a potential common subsequence that is a valid common subsequence is the following:
An example of a potential common subsequence that is not a valid common subsequence is the following:
This potential common subsequence does not satisfy the increasing order requirement because the location n-tuple {3, 1, 6} is not unambiguously smaller than the location n-tuple {6, 4, 5}.
If the set of valid longest common subsequences is generated the result might be depicted as follows:
An example of a potential longest common subsequence that is not a valid longest common subsequence is the following:
This potential longest common subsequence does not satisfy the increasing order requirement because the location n-tuple {4, 3, 1} is not unambiguously smaller than the location n-tuple {4, 3, 7}.
If the original sequence item longest common subsequence set is generated the result might be depicted as follows:
If the original sequence item longest common subsequence set is de-duplicated the result might be depicted as follows:
If the minimum length had been set to 5 and the set of potential minimum length common subsequences is generated an example of a valid minimum length common subsequence is the following:
An example of a potential minimum length common subsequence that is not a valid minimum length common subsequence is the following:
The length of this potential minimum length common subsequence does not equal or exceed the minimum length (5).
If the minimum density had been set to 0.5 and the set of potential minimum density common subsequences is generated an example of a valid minimum density common subsequence is the following:
An example of a potential minimum density common subsequence that is not a valid minimum density common subsequence is the following:
This potential minimum density common subsequence does not contain the requisite minimum density (0.5) with respect to sequence S3, for the following reason. The location in S3 associated with the first location n-tuple in this potential minimum density common subsequence is 2. The location in S3 associated with the last location n-tuple in this potential minimum density common subsequence is 6. The number of items between these two location n-tuples in S3 is 3. The length of this potential minimum density common subsequence (2) divided by the sum of 2 plus the number of items between (3) is equal to 0.4, which does not equal or exceed the minimum density (0.5). Thus, this potential minimum density common subsequence does not satisfy the minimum density requirement with respect to sequence S3 even though this potential minimum density common subsequence does satisfy the minimum density requirement with respect to sequences S1 and S2.
If the minimum length had been set to 5 and the minimum density had been set to 0.5 and the set of potential minimum length, minimum density common subsequences is generated an example of one valid minimum length, minimum density common subsequence is the following:
An example of a potential minimum length, minimum density common subsequence that is not a valid minimum length, minimum density common subsequence set is the following:
The length of this potential minimum length, minimum density common subsequence (4) does not equal or exceed the requisite minimum length (5). It also does not contain the requisite minimum density (0.5) with respect to sequence S2, for the following reason. The location in S2 associated with the first location n-tuple in this potential minimum length, minimum density common subsequence is 1. The location in S2 associated with the last location n-tuple in this potential minimum length, minimum density common subsequence is 10. The number of items between these two location n-tuples in S2 is 8. The length of this potential minimum length, minimum density common subsequence (4) divided by the sum of 2 plus the number of items between (8) is equal to 0.4, which does not equal or exceed the minimum density (0.5). Thus, this potential minimum length, minimum density common subsequence does not meet the minimum density requirement with respect to sequence S2 even though this potential minimum length, minimum density common subsequence does satisfy the minimum density requirement with respect to sequences S1 and S3.
One skilled in the art will appreciate that the invention may be practiced in network computing environments with many types of computer system configurations, including personal computers, hand-held devices, mobile phones, multi-processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like. The invention may also be practiced in distributed computing environments where tasks are performed by local and remote processing devices that are linked (either by hardwired links, wireless links, or by a combination of hardwired or wireless links) through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.
With reference to
The computer 420 may also include a magnetic hard disk drive 427 for reading from and writing to a magnetic hard disk 439, a magnetic disk drive 428 for reading from or writing to a removable magnetic disk 429, and an optical disc drive 430 for reading from or writing to a removable optical disc 431 such as a CD-ROM or other optical media. The magnetic hard disk drive 427, magnetic disk drive 428, and optical disc drive 430 are connected to the system bus 423 by a hard disk drive interface 432, a magnetic disk drive-interface 433, and an optical drive interface 434, respectively. The drives and their associated computer-readable media provide nonvolatile storage of computer-executable instructions, data structures, program modules and other data for the computer 420. Although the exemplary environment described herein employs a magnetic hard disk 439, a removable magnetic disk 429 and a removable optical disc 431, other types of computer readable media for storing data can be used, including magnetic cassettes, flash memory cards, digital versatile discs, Bernoulli cartridges, RAMs, ROMs, and the like.
Program code means comprising one or more program modules may be stored on the hard disk 439, magnetic disk 429, optical disc 431, ROM 424 or RAM 425, including an operating system 435, one or more application programs 436, other program modules 437, and program data 438. A user may enter commands and information into the computer 420 through keyboard 440, pointing device 442, or other input devices (not shown), such as a microphone, joy stick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 421 through a serial port interface 446 coupled to system bus 423. Alternatively, the input devices may be connected by other interfaces, such as a parallel port, a game port or a universal serial bus (USB). A monitor 447 or another display device is also connected to system bus 423 via an interface, such as video adapter 448. In addition to the monitor, personal computers typically include other peripheral output devices (not shown), such as speakers and printers.
The computer 420 may operate in a networked environment using logical connections to one or more remote computers, such as remote computers 449a and 449b. Remote computers 449a and 449b may each be another personal computer, a server, a router, a network PC, a peer device or other common network node, and typically include many or all of the elements described above relative to the computer 420, although only memory storage devices 450a and 450b and their associated application programs 436a and 436b have been illustrated in
When used in a LAN networking environment, the computer 420 can be connected to the local network 451 through a network interface or adapter 453. When used in a WAN networking environment, the computer 420 may include a modem 454, a wireless link, or other means for establishing communications over the wide area network 452, such as the Internet. The modem 454, which may be internal or external, is connected to the system bus 423 via the serial port interface 446. In a networked environment, program modules depicted relative to the computer 420, or portions thereof, may be stored in the remote memory storage device. It will be appreciated that the network connections shown are exemplary and other means of establishing communications over wide area network 452 may be used.
One skilled in the art will appreciate that, for this and other processes and methods disclosed herein, the functions performed in the processes and methods may be implemented in differing order. Furthermore, the outlined steps and operations are only provided as examples, and some of the steps and operations may be optional, combined into fewer steps and operations, or expanded into additional steps and operations without detracting from the essence of the disclosed embodiments.
The present invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.
This application claims the benefit of and priority to U.S. Provisional Patent Application Ser. No. 62/073,128 filed on Oct. 31, 2014, which application is incorporated herein by reference in its entirety. This application claims the benefit of and priority to U.S. Provisional Patent Application Ser. No. 62/083,842 filed on Nov. 24, 2014, which application is incorporated herein by reference in its entirety. This application claims the benefit of and priority to U.S. Provisional Patent Application Ser. No. 62/170,095 filed on Jun. 2, 2015, which application is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62073128 | Oct 2014 | US | |
62083842 | Nov 2014 | US | |
62170095 | Jun 2015 | US |