This disclosure relates generally to the detection of groups of interest. More particularly, this disclosure relates to a system and method for the detection of groups of interest from travel data.
Many human activities among multiple individuals require coordination in various forms. In some cases, direct face-to-face coordination is needed among a group of people. This coordination may be required repeatedly over the course of time. Assuming an adversary organization is engaging in such coordinated activity, it seems reasonable to speculate that patterns resulting from such repeated coordination could be detectable and hence would allow for the discovery of adversarial groups. One example of where such group of interest detection is desirable is with detecting coordinated group activity based on travel data. It is assumed that travel data in the form of traveler ID, destination and travel date are available for a large set of N people (e.g. flight records for many airlines).
However, the methods and systems currently available today to perform such detections are, in many ways, inadequate. For example, many systems are too parameter constrained to provide an effective detection of groups of interest. Others are resource limited and the detection methodology is slow and inaccurate producing a high rate of false positives. As for one aspect, the problem of using travel data to detect groups of people traveling to a common destination within a time interval (i.e., co-travel) multiple times is shown to be equivalent to detecting complete bipartite sub-graphs in a bipartite graph, a problem known to be NP-complete. A number of approaches have been attempted in the industry, none achieving levels of success that are reliable enough for responsible use. One particular problem needing attention in today's environment, is the detection of groups of travelers that co-travel with each other K times (but not necessarily all at the same time and same location) that is equivalent to clique detection (albeit on a smaller graph), another known NP-complete problem.
Accordingly, there exists a long felt need for an improved system and method for the detection of groups of interest from travel data and/or other types of data that alleviates the inherent problems known in the systems and methods for group detection currently being employed in the various industries today.
According to one embodiment of the present disclosure applied to travel data, a system for detecting a group of interest (GoI) based on a suspect traveler and a co-travel count threshold, is presented comprised of a database comprised of traveler names, each having respective destinations and corresponding travel dates, and a detection module in communication with said database. The detection module is operable to search the database to determine traveler names having a co-travel count based on the suspect traveler. From the traveler names having a co-travel count, the detection module is operable to form a co-travel group based on the traveler names having respective co-travel counts greater than or equal to the co-travel count threshold. From the co-travel group, the detection module is operable to determine co-travel within said co-travel group. The detection module is then further operable to identify cliques within the co-travel group based on the co-travel. From the cliques so identified, the detection module determines the maximal clique to thereby detect the GoI.
Accordingly, some embodiments of the disclosure may provide numerous technical advantages. Some embodiments may benefit from some, none or all of these advantages. For example, a potential technical advantage of one embodiment of the disclosure may be an improved and more efficient system and method for detecting groups of interest in information that requires less computational resources and is less time expensive. Another potential technical advantage of one embodiment of the disclosure is that it may provide for an improved system and method for detecting groups of interest in information having more reliable and consistent detection results.
Another example of a potential technical advantage of one embodiment of the present disclosure is that it may alleviate problems associated with false positive detections or otherwise false candidate counts. That is, detecting groups of interest having some members that are not truly a member. Many current group detection systems simply live with these false detections and deal with identifying and removing them with additional resources.
Although specific advantages have been disclosed hereinabove, it will be understood that various embodiments may include all, some, or none of the disclosed advantages. Additionally, other technical advantages not specifically cited may become apparent to one of ordinary skill in the art following review of the ensuing drawings and their associated detailed description. The foregoing has outlined rather broadly some of the more pertinent and important advantages of the present disclosure in order that the detailed description of the disclosure that follows may be better understood so that the present contribution to the art can be more fully appreciated. It should be appreciated by those skilled in the art that the conception and the specific embodiment disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present disclosure. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the present disclosure as set forth in the appended claims.
For a fuller understanding of the nature and possible advantages of the present disclosure, reference should be had to the following detailed description taken in connection with the accompanying drawings in which:
Similar reference characters refer to similar parts throughout the several views of the drawings.
In referring now to
In the particular embodiment of
The database 110, in the particular embodiment of
The processing system 132, in the embodiment of
The database 110, in one embodiment, is preferably comprised of various travel information including traveler names with respective destinations and corresponding travel dates. However, it should be understood by those skilled in the art that database 110 may be implemented having any number of additional types of information, including some of which that may not be related to travel. For example, in an alternative embodiment of a more general nature, database 110 in a may simply be comprised of a plurality of entries each of which having one or more attributes and pertaining to any number of other various types of information. The detection module 120 is in communication with the database 110 via the network 130 and operable to perform a series of steps to detect a group of interest based on an established co-travel count threshold and a selected suspect traveler.
In referring now to
Being an important parameter to system 100, the co-travel count threshold must first be established. This may be accomplished by way of running simulations of test cases with known groups, looking at the false positive results, and then choosing the best fit value for the co-travel count threshold that produces an acceptable false positive rate. However, in order to accomplish this, a random travel model must first be created and then tested by running simulations for a small group size and a minimum number of meetings. In an alternative embodiment, this may be accomplished similarly by choosing the best fit value for an attribute count threshold that produces an acceptable false positive rate. It should be understood by one skilled in the art that when dealing with other types of information, random models for such other types the information equally apply and can likewise be created.
For a random travel model, assume a population of “N” travelers can travel to any of “L” destinations. In each of “T” time intervals, a total of “F” flights occur, each of which travels to one of the “L” locations chosen in a uniformly random manner. Each flight contains “Nf” passengers selected randomly (with replacement) in a uniformly random manner from the traveler population. Against this general background of uniform random travel we desire to detect a group of interest (“GoI”) defined as follows: 1) “m-k-Group of Interest” (m-k-GoI) means a set of travelers of size m that has co-traveled at least k times; 2) “Co-travel event” means a group co-travels when every member of the group arrives at the same destination in the same time interval; and 3) “Weak k-co-travel event” means a group weakly k-co-travels when every member of the group k-co-travels with each member of the group (not necessarily at the same time and location).
The variables and events of the model are presented in Table 1 and Table 2.
The above definition of “GoI” hinges on the co-travel concept. Clearly, in order to reliably detect a “GoI”, it is necessary to reliably distinguish genuine “GoIs” from “GoIs” resulting from random chance. To accomplish this, a confidence threshold must be determined. Our starting point is the co-travel event, as this represents the atomic event of analysis, indicating association between a set of travelers.
Distribution of gc
The probability of group co-travel is the joint probability that all group members travel in a given time interval to the same destination. Assuming independence between travelers, and using the definitions given above, the probability of co-travel of a group “g” in a time interval “i” can be expressed as:
The probability that traveler “i” travels within a unit time interval to location “l” as dictated by our uniform distribution is given by:
Thus Equation (1) can be expressed as:
Define ratio “r”:
This ratio represents the proportion of total travelers that travel in a unit interval. Using this definition, Equation (2) becomes:
Equation (3) represents the probability of group co-travel. The probability of co-travel k times in T unit intervals is determined by the binomial distribution where Equation (4) represents the probability of success. Under this distribution, the probability of k co-travel events is given by:
The expected number of successes out of T trials for a binomial distribution with success probability “p” is given by Np. Thus, the expected number of times group “g” co-travels is given by:
The probability of a group “g” k-co-traveling is the probability of this group co-traveling k or more times. Using Equation (4), the probability of this event can be expressed as:
or, equivalently:
The single-group k-co-travel probability, Equation (6), is a constant for all groups of a given size “m”. Thus, the probability of “n” groups of size “m” k-co-traveling can be expressed as:
is the number of groups of size “m”. From Equation (7) it follows that the expected number of k-co-traveling groups of size “m” is given by:
E[c
m
]=N
m
p
|g|,k Equation (8)
Based upon Equation (8), the expected number of k-co-traveling groups can be bounded as follows:
Substituting values into Equation (9), we have:
Defining “pk,m” as follows:
then an upper bound for “pk,m” can be expressed as:
and thus:
Using the definition of “pk,m” above, the ratio of “E[cm+1] to E[cm]” can be expressed as:
Using Equation (10), Equation (11) can be expressed as:
From Equation (9), the expected number of groups is:
Using Equation (12), the expected number can then be expressed as:
Note that the product in Equation (12) has a maximum value for “m=2”. Define “a” as follows:
Using this definition, Equation (13) can then be expressed as:
E[c]≦E[c
2](1+a+a2+ . . . +aN-2) Equation (14)
Using the closed-form expression for the sum of a finite geometric series, Equation (14) can be expressed as:
Evaluation of Equation (15) requires the determination of the expected number of k-co-traveling groups of size 2 (namely, “E[c2]”) using Equations (8) and (6). Expected group counts for various values of L and N are shown as contours in a 2-dimensional space over k and “r” in
First, the detection of weak co-travel is considered in view of the random travel model. Using the “weak” co-travel definition, consider the following graph formulation for “weak” m-k-GoI detection:
G=(V, E) where
V=set of all travelers who have k-co-traveled with at least m−1 other travelers
E={(vi, vj)|(vi, vj is in E if and only if traveler vi k-co-traveled with traveler vj}.
Finding an m-k-GoI is equivalent to finding a complete sub-graph with at least “k” vertices in G. This is the clique decision problem. A clique is defined as a sub-graph in which every node has connectivity to every other node in the sub-graph and, under the above definition using nodes and edges, a clique is equivalent to a “GoI”. However, the clique detection problem is known to be NP-complete. That is, it implies we can only guarantee an efficient (time-wise) solution for small problems. Therefore, in order to alleviate the NP-complete condition, we look to detecting “GoI” using a suspect based search where cliques are determined from a smaller graph.
Here, a suspect-based “GoI” detection algorithm is proposed. In this approach, an initial suspect traveler is used to obtain a list of candidate “GoI” partners (i.e., traveler names). Against this candidate set of traveler names, a search for cliques is performed to identify the maximal clique in the candidate set of traveler names. Although the need for clique detection remains even in this suspect-based case, the clique search is performed against a much smaller graph than that required in the general case. To complete the determination of the co-travel count threshold in this case, one must first understand the rest of the method of
In referring now to
In an alternative embodiment of a more general nature, addressing other types of information, a comparable step to step 204 may take the form of the detection module 120 searching the information in database 110 to determine entries having an attribute count based on the suspect entry. In such embodiment, the detection model 120 may accomplish this by way of searching the database 110 and matching the attributes for each entry with the attributes of the suspect entry to determine common attribute occurrences and, for each entry having one or more common attribute occurrence, calculating a attribute count equal to the number of common attribute occurrences for that entry. For a common attribute occurrence to occur, an entry must have an attribute identical to an attribute of said suspect entry.
From step 204, the process then moves on to step 206. At step 206, the detection module 120 then takes the list of traveler names having a co-travel count and forms a co-travel group based on those traveler names having respective co-travel counts greater than or equal to the co-travel count threshold. Alternatively, in another embodiment, a comparable step to step 206 may take the form of detection module 120 taking the list of entries having an attribute count and forming a subgroup based on those entries having respective attribute counts greater than or equal to the attribute count threshold. From step 206, the process moves on to step 208.
At step 208, the detection module 120 then determines co-travel within the co-travel group. In one embodiment, detection module 120 may accomplish this step 208 by way of searching the database 110 and matching the destinations and corresponding travel dates for each traveler name in the co-travel group with the destinations and corresponding travel dates associated with each of the other traveler names in the co-travel group to determine co-travel occurrences within the co-travel group. In an alternative embodiment, a comparable step to step 208 may take the form of detection module 120 determining common attributes within the subgroup. This may be accomplished in the alternative embodiment by way of searching the database 110 and matching the attributes for each entry in the subgroup with the attributes associated with each of the other entries in the subgroup to determine common attribute occurrences within the subgroup.
Now that the co-travel has been determined for the traveler names within and among the co-travel group, the process moves on to step 210. At step 210, the detection module 120 then identifies cliques within the co-travel group based on the co-travel determined from step 208. In one embodiment, detection module 120 may accomplish this step 210 by way of first forming a graph representation of the co-travel among the co-travel group wherein the graph representation includes nodes for each traveler name and edges running between the nodes having co-travel occurrences. From the graph representation, the detection module 120 identifies one or more sets of nodes formed of nodes interconnected by equal edges. Each such set of nodes forms one clique.
In an alternative embodiment, a comparable step to step 210 may take the form of detection module 120 identifying cliques within a subgroup based on determined common attribute occurrences. This may be accomplished by way of forming a graph representation of the common attributes among said subgroup, the graph representation including nodes for each entry and edges running between nodes having common attribute occurrences. From the graph representation, the detection module 120 then identifies one or more sets of nodes formed of nodes interconnected by equal edges. Each said set of nodes forms one clique.
From step 210, the process moves to step 212. At step 212, the detection module 120 determines the maximal clique from the cliques identified in step 210. In one embodiment, detection module 120 may accomplish this step 212 by way of determining which clique (i.e., set of nodes), contains the most nodes. The maximal clique thereby forms the “GoI” based on the suspect traveler.
In step 204 of the method of
From the foregoing complexity analysis, it is clear that the feasibility of the method of
where vi is an indicator variable defined as follows:
Since the probability term in Equation (16) is constant, we can express the expected number of traveler names having a co-travel count as:
Thus our co-travel count threshold “k” determines the expected number of traveler names having a co-travel count via Equation (17). Assuming it is desirable to keep the list of traveler names having a co-travel count a size in the order of 10E2 or smaller, and given that N is in the order of 10E6 or larger, the co-travel count threshold “k” needs to be such that the probability of co-travel is no larger than in the order of 10E-4. From
For each trial, a known suspect traveler was identified along with a set of G-1 accomplices. A set of k destinations were randomly selected from the set of L destinations. G travel records of the form (Traveleri, Destinationj, Datej) were added to the dataset for i=1 . . . G and j=1 . . . k to simulate the coordinated travel of the “GoI” members.
A total of 3,000 random trials were performed. For each trial, a binary variable was returned indicating whether the method of
In 2,992 cases, the method of
The simulation results confirm the overall theoretical prediction: the method of
The present disclosure includes that contained in the appended claims, as well as that of the foregoing description. Although this disclosure has been described in its preferred form in terms of certain embodiments with a certain degree of particularity, alterations and permutations of these embodiments will be apparent to those skilled in the art. Accordingly, it is understood that the above descriptions of exemplary embodiments does not define or constrain this disclosure, and that the present disclosure of the preferred form has been made only by way of example and that numerous changes, substitutions, and alterations in the details of construction and the combination and arrangement of parts may be resorted to without departing from the spirit and scope of the invention.