The present disclosure relates generally to data processing and, in an example embodiment, to circular transaction path detection.
In many different business markets, such as stock markets, commodities futures markets, and the like, large numbers of transactions, in which a buyer purchases some quantity of one or more items from a seller, may occur with great frequency on a dally basis. With advances in electronics technology, the volume said speed of such transactions have increased by leaps and bounds. While the overwhelming majority of such transactions are performed in the coarse of legal and ethical business dealings, some small percentage of such transactions represents illegal or fraudulent activity. In one example, two or more individuals or entities may attempt to generate public interest in a corporate stock by engaging in transactions of the stock that are primarily intended to greatly increase the trading volume of the stool, thus making the stock appear to be more valuable than under more typical trading circumstances, thus potentially driving up the price of the stock in a fraudulent manner. To generate such volume, one or more trading parties may buy and sell the stock multiple times among themselves, such as in a circular fashion, according to some prior arrangement or plan. Such circular trading may be considered illegal or fraudulent if performed specifically to manipulate the price of the stock.
Such circular trading may be employed for other reasons as well, such as for tax evasion or money laundering purposes. Further, such trading need not be limited to stocks, but may occur with respect to commodities futures, national or regional currencies, or any item of interest that may be bought or sold in a marketplace.
Given the extremely large number of transactions that may occur within any market over a particular time period, such as a day, week, or month, detection of such potentially fraudulent circular transactions may be difficult and time-consuming, even with the use of specialized computer programs designed specifically for that purpose running on high-speed processing systems.
The present disclosure is illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements and in which:
The description that follows includes illustrative systems, methods, techniques, instruction sequences, and computing machine program products that embody illustrative embodiments. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide an understanding of various embodiments of the inventive subject matter. It will be evident, however, to those skilled in the art that embodiments of the inventive subject matter may be practiced without those specific details. In general, well-known instruction instances, protocols, structures, and techniques have not been shown in detail.
The system 100 then may process the directed graph to detect circular transaction paths of a particular path length of interest. For example, the system 100 may process the directed graph to detect all circular transactions of length two. A circular transaction path of length two would be indicated, for example, by a first transaction from a first party to a second party, and a second transaction from the second party back to the first party. In some markets, a circular transaction path of length two may be the most common way in which transactional volume of a stock or other commodity is generated for the purposes of increasing the stock price while the original owner of the stock effectively retains ownership of the stock. In other examples, circular transaction paths of three or more may be employed for the same purpose while making detection of the circular transactional path more difficult.
As shown in
The graph generation module 102 may utilize the transaction data 112 of the data storage 110 to generate one or more directed graphs representing the various transactions that have occurred during some previously occurring period of time. In one example, the transaction data 112 may include information regarding each transaction of interest, such as the item or items that were the subject of the transaction, the monetary amount of the transaction, the date and time of the transaction, the parties involved in the transaction, and the roles of the parties in the transaction. Other types of information regarding the transaction not specifically enumerated herein may also be stored as the transaction data 112.
The graph generation module 102 may access and process the transaction data 112 to generate one or more directed graphs representing a plurality of the transactions, as mentioned above. In one implementation, the graph generation module 102 may filter the transaction data 112 according to the parties or items involved in the transactions, the amounts of the transactions, the time periods during which the transactions occurred, and/or other factors, or attributes regarding the transactions. In an example, the resulting directed graph includes nodes representing parties to the various transactions, and the directed edges connecting the nodes represent the transactions themselves. The direction of the directed edges may correlate with the direction in which the item or target of the transaction, such as a product, service, security, commodity, or other item of interest, changed hands, or the direction in which the monetary value being offered for the item of interest was transferred. As shown in
After the graph generation module 102 generates the one or more directed graphs, the circular path search module 104 may access the graph information 114 describing the one or directed graphs and process, analyze, or search the one or more graphs to detect circular transaction paths of some predetermined length. Examples of how this processing or searching may be performed are discussed in greater detail below.
Within each strongly connected component of the directed graph, each circular path having a length equal to the circular path length of interest is discovered (operation 208). In some examples described more fully below, this discovery or searching process may include the elimination of possible search paths according to one or more preset rules. Within each discovered circular path, the transactions represented by the directed edges of the discovered circular path may then be denoted as related transactions (operation 210). In one implementation, the related transactions may be viewed as potentially fraudulent transactions, as discussed above. In other examples, operations 204-210 may be repeated using the same directed graph in light of different circular path lengths if a number of different lengths of circular paths are of interest.
While the operations 202 through 210 of the method 200 of
In accordance with at least some of the embodiments described above, detection of circular transaction paths of one or more lengths that may be indicative of fraudulent transactional activity may be detected in an efficient manner. More specifically, by representing the transactions as a directed graph and subdividing the graph into strongly connected components, portions of the path discovery process may be apportioned among multiple processors or processing threads in order to reduce the overall execution time. Also, one or more rules may be employed to reduce the overall number of paths searched, thus decreasing the discovery execution time while maintaining accuracy of the circular path discovery process. Other possible aspects and advantages may be ascertained from the discussion of the various embodiments presented below.
In
As shown in
In one example, the SCCs 310 are identified via one of the algorithms available so the art for such a purpose. In one example, the algorithm employed is Tarjan's Algorithm, proposed by Robert E. Tarjan (Tarjan, R. E. (1972). “Depth-first search and linear graph algorithms,” SIAM Journal on Computing 1 (2): 146-160). Tarjan's Algorithm, which is well-known in the art of graph theory, utilizes only a single depth-first search, in which each path in the graph is explored to its conclusion before backtracking to explore other paths. In other examples, other algorithms for identifying the strongly connected components, such as Kosaraju's Algorithm and the Path-Based Strong Component Algorithm, which are also known in the art of graph theory, may be utilized in other implementations.
Once the SCCs 310 are identified, the paths within each SCC 310 may be searched to discover the various circular paths having a particular or desired length of interest. In one example, each node 302 of an SCC 310 may be employed as a starting node 302 from which to begin a search for a set of paths within the SCC 310 that form a circular path of the desired length. While each possible circular path may be searched in this manner, some paths discovered may be duplicates of others. For example, a path from Node A to Mode B and back is the same as a path from Node B to Node A and back. In addition, some paths may be eliminated altogether due to their impossibility of being included as part of a circular path. As a result, one or more searching rules may be employed to reduce the overall amount of searching compared to a rote searching of all available paths.
In
Circular paths in the SCC 310 are then searched starting at each node 302 having a rank at least as high as the circular path length of interest (operation 408). For example, in an SCC 310 having Nodes A, B, C, D, and E, and with a circular path length of interest equal to three, searches may be performed using Nodes C, D, and E (the top three ranked nodes) as starting nodes, and Nodes A and B would be ineligible as starting nodes 302 for circular path searches. Further, during a search using a particular starting node 302, all paths that include other nodes 302 in the SCC 310 with a higher rank than the starting node 302 may be eliminated as potential circular paths (operation 410). As a result, a search for a circular path starting from Node C can ignore any paths that include Nodes D or E. The various operations 404-410 of
In method 400C (Rule Three (Case One)) of
Similarly, in method 400D (Rule Three (Case Two)) of
In a specific example of application of the rules embodied by the methods 400A, 400B, 400C, and 400D of
Given a circular path length of interest of two, applying Rule One does not eliminate any potential starting nodes 302 since none of the SCCs 310 includes less than two nodes 302, in applying Rule Two, any nodes 102 with an identifier of a rank that is less than two can be eliminated as a starting node 302. As a result, Node A of SCC 310A, Node F of SCC 310B, and Node C of SCC 310C are eliminated on that basis. Rule Three may then be applied to remaining Nodes B, D, B, G and H. Using Rule Three, Node B may then be eliminated as a starting node 302. As a result, half of the eight possible starting nodes 302 in the directed graph 300 are eliminated before searching begins in earnest.
Further, when performing the search operation with each of Nodes D, E, G, and H, Rule Two can be further applied to stop searching a particular path if a node 302 is encountered that has a higher rank than the starting node 302. In the case of Node D, far example, the outgoing edge 304 of Node D that connects to Node H may be ignored since Node H is of a higher rank than Node D.
Proceeding with searching each of the remaining paths using Nodes D, E, G, and H as starting nodes 302 yields three circular paths of length two: a circular path from Node G to Node F and back, a circular path from Node D to Node C and back and a circular path, from Node H to Node D and back. In the case of Node E, no circular path of length two from Node E to either Nodes A or B and back is found. Thus, three circular paths of length two are found in the directed graph 300.
In another example,
Applying Rule One in this example results in both Nodes F and G of SCC 310B being eliminated, as an SCC 310 with only two nodes 302 cannot produce a circular path of length three. Also, in applying Rule Two, only the highest-ranked node 302 in each of SCC 310A and SCC 310C (e.g., Node E in SCC 310A and Node H in SCC 310C) remain as potential starting nodes 302 for the search procedures to follow, as SCC 310A and SCC 310C have three nodes 302 apiece. Further, the use of Rule Three is unnecessary in this example as at most one eligible starting node 302 remains in each SCC 310.
In using Node E of SCC 310A as a starting node 302, searching for a circular path length of three results in the path of Node B to Node A to Node B, and then back to Node E. However, utilizing Node H as a starting node 302 results in no circular path of length three being available, as only a circular path from Node H to Node D and back may be discovered.
For each remaining SCC 310, the outgoing edges 304 of each node 302 in the SCC 310 that end in another node 302 within the same SCC 310 are noted in an outgoing edge list for that node 302. In some examples, the outgoing edge list for each node 302 may be sorted based on the identifier of the node 302 at which each outgoing edge 304 terminates, as such sorting may aid in identifying those edges 304 which may be eliminated due to their rank being higher than that of the starting node 302 according to Rule Two during a circular path search. In one example, multiple processors or processing threads may be employed to perform the edge list building and sorting operations according to individual SCCs 310 or nodes 302.
In the main function 800A, each of the nodes 302 in each SCC 310 may also be sorted in descending rank according to their identifiers. Based on this ranking, the main function 800A may then mark the last, or lowest, N−1 nodes 303 as being ineligible as starting nodes 302 according to Rule Two. Also, sorting the nodes 302 in this manner allows one or more processing threads to process the higher-ranked nodes 302 being used as starting nodes 302 for search operations first, as the higher-ranked nodes 302 tend to consume the most searching operations compared to lower-ranked nodes 302 under Rule Two, as described above. In some examples, the main function 800A may then apply Rule Three (e.g., either or both of Case One and Case Two) to each remaining eligible starting node 302 in each remaining SCC 310 to determine if more nodes 302 may be eliminated from the group of nodes 302 eligible as starting nodes 302 for search purposes.
At this point, the main function 800A may initiate a number of searching threads via calls to SearchThread 800B, illustrated in the pseudo-code of
Continuing with
Continuing with
SearchCircularPaths 800C may then add the starting node 302 start to a data structure PartialPath, which tracks the nodes 302 that constitute the current path being searched. SearchCircularPaths 800C may then call another function, SearchCircularPathsHelper 800D of
In SearchCircularPaths 800C, when execution control returns from SearchCircularPathsHelper 800D, Results includes the circular paths of length N, if any, found in SearchCircularPathsHelper 800D. SearchCircularPaths 800C then returns Results to its associated SearchThread 800B, which may in turn add Results to a centralized data structure that contains all discovered circular paths of length from all SearchThreads 800B.
As depicted in
If, instead, the length of PartialPath is not equal to N, more searching to complete a circular path may be undertaken. In this example, SearchCircularPathsHelper 800D attempts to locate the next node terminating one of the current node 302 v's outgoing edges 304 having an identifier or rank greater than that of the starting node 302 start, as indicated under Rule Two. To accomplish this task, SearchCircularPathsHelper 800D determines the next index of current node 302 v's sorted outgoing edge list that is associated with a node 302 that has a greater rank than the starting node 302 start. This determination is made via a call to a function GetUpperBound (not it described in pseudo-code herein), which, in one example, is a binary search routine. In
SearchCircularPathsHelper 800D may then mark the current node 302 v as being visited by marking its element in the NodeVisited array as TRUE. SearchCircularPathsHelper 800D may then initiate a search for a circular path from the current node 302 v for cards outgoing edge of current node 302 v terminated by a node 302 having a rank no higher than the starting node 302 start as indicated by Rule Two. To accomplish this task, SearchCircularPathsHelper 800D accesses the next eligible terminating node 302, or end node 302, from the current node 302 v's outgoing edge list and determines if that node 302 has been visited during this search by checking the appropriate element of NodeVisited. If this terminating node 302 has not been visited already along this path, that node 302 is added to PartialPath, and another call is made to SearchCircularPathsHelper 800D with the terminating node 302 being designated as the current node 302 for that function call.
The search may then continue, with each successive node 302 in the search of a path resulting in another call to SearchCircularPathsHelper 800D. If the search results in a circular path of length N being found, the PartialPath constructed to that point is added to Results as the circular path. If, instead, the search is terminated before a circular path is found by encountering the end of the path or by encountering a node 302 that has already been designated as part of the path, the last call to SearchCircularPathsHelper 800D designates the element of NodeVisited associated with its current node 302 v as FALSE, and returns to the previous instantiation of SearchCircularPathsHelper 800D. In turn, the previous instantiation of SearchCircularPathsHelper 800D removes from PartialPath its terminating node 302 (e.g., the current node 302 v for the instantiation of SearchCircularPathsHelper 800D just returned from), marks the current node 302 v for the current instantiation of SearchCircularPathsHelper 800D in NodeVisited as FALSE, and returns, thus returning back up the path in search of an alternate path. Thus, the search for the next circular pads of length N progresses in depth-first fashion until all potential paths from the starting node 302 start have been explored.
While
As a result of at least some of the embodiments discussed herein, as the number of processors P increases, or as the circular path length of interest decreases, or both, the average computational complexity of the algorithms described herein decreases and performance improves. Overall, this level of performance may represent a vast improvement over other methods that do not systematically reduce the number of searches performed or are not able to employ multiple processors in an efficient and parallel manner.
Further, in some implementations, some enhancements may be made to the functions and associated pseudo-code of
Thus, in view of at least some of the embodiments described herein, the searching of circular transaction paths of some length of interest may be facilitated by representing the transactions as a directed graph and employing one or more techniques for dividing the overall computational work into separate, identifiable portions for processing and searching possible paths using multiple processors operating in parallel. Further, the implementation of one or more rules, as described herein, may eliminate a significant number of duplicate searches by eliminating at least some nodes from which individual path searches may begin, as well as reduce the amount of processing or computation to complete searches that have already been initiated by terminating searching along some paths based on the identity of the starting node for those paths.
While the embodiments described herein are directed to transactions between parties, other types of interactions between parties, such as, for example, social or business networking connections made between people or parties (e.g. connections established between people on Facebook® or other social or business networking sites) may also be represented as a directed or undirected graph in order to detect social or business connections that form circular paths of a specific length of interest.
The machine is capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine. Further, while only a single machine is illustrated, the term “machine” shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.
The example of the processing system 900 includes a processor 902 (e.g., a central processing unit (CPU), a graphics processing unit (CPU), or both), a main memory 904 (e.g., random access memory), and static memory 906 (e.g., static random-access memory), which communicate with each other via bus 908. The processing system 900 may further include video display unit 910 (e.g., a plasma display, a liquid crystal display (LCD), or a cathode ray tube (CRT)). The processing system 900 also includes an alphanumeric input device 912 (e.g., a keyboard), a user interface (UI) navigation device 914 (e.g., a mouse), a disk drive unit 916, a signal generation device 918 (e.g., a speaker), and a network interface device 920.
The disk drive unit 916 (a type of non-volatile memory storage) includes a machine-readable medium 922 on which is stored one or more sets of data structures and instructions 924 (e.g., software) embodying or utilized by any one or more of the methodologies or functions described herein. The data structures and instructions 924 may also reside, completely or at least partially, within the main memory 904, the static memory 906, and/or within the processor 902 during execution thereof by processing system 900, with the main memory 904 and processor 902 also constituting machine-readable, tangible media.
The data structures and instructions 924 may further be transmitted or received over a computer network 950 via network interface device 920 utilizing any one of a number of well-known transfer protocols (e.g., HyperText Transfer Protocol (HTTP)).
Certain embodiments are described herein as including logic or a number of components, modules, or mechanisms. Modules may constitute either software modules (e.g., code embodied on a machine-readable medium or in a transmission signal) or hardware modules. A hardware module is a tangible unit capable of performing certain operations and may be configured or arranged in a certain manner. In example embodiments, one or more computer systems (e.g., the processing system 900) or one or more hardware modules of a computer system (e.g., a processor 902 or a group of processors) may be configured by software (e.g., an application or application portion) as a hardware module that operates to perform certain operations as described herein.
In various embodiments, a hardware module may be implemented mechanically or electronically. For example, a hardware module may include dedicated circuitry or logic that is permanently configured (for example, as a special-purpose processor, such as a field-programmable gate army (FPGA) or an application-specific integrated circuit (ASIC)) to perform certain operations. A hardware module may also include programmable logic or circuitry (for example, as encompassed within a general-purpose processor 902 or other programmable processor) that is temporarily configured by software to perform certain operations. It will be appreciated that the decision to implement a hardware module mechanically, in dedicated and permanently configured circuitry, or in temporarily configured circuitry (for example, configured by software) may be driven by cost and time considerations.
Accordingly, the term “hardware module” should be understood to encompass a tangible entity, be that an entity that is physically constructed, permanently configured (e.g., hardwired) or temporarily configured (e.g., programmed) to operate in a certain manner and/or to perform certain operations described herein. Considering embodiments in which hardware modules are temporarily configured (e.g., programmed), each of the hardware modules need not be configured or instantiated at any one instance in time. For example, where the hardware modules include a general-purpose processor 902 that is configured using software, the general-purpose processor 902 may be configured as respective different hardware modules at different times. Software may accordingly configure a processor 902, for example, to constitute a particular hardware module at one instance of time and to constitute a different hardware module at a different instance of time.
Modules can provide information to, and receive information from, other modules. For example, the described modules may be regarded as being communicatively coupled. Where multiples of such hardware modules exist contemporaneously, communications may be achieved through signal transmissions (such as, for example, over appropriate circuits and buses) that connect the modules. In embodiments in which multiple modules are configured or instantiated at different times, communications between such modules may be achieved, for example, through the storage and retrieval of information in memory structures to which the multiple modules have access. For example, one module may perform an operation and store the output of that operation in a memory device to which it is communicatively coupled. A further module may then, at a later time, access the memory device to retrieve and process the stored output. Modules may also initiate communications with input or output devices, and can operate on a resource (for example, a collection of information).
The various operations of example methods described herein may be performed, at least partially, by one or more processors 902 that are temporarily configured (e.g., by software) or permanently configured to perform the relevant operations. Whether temporarily or permanently configured, such processors 902 may constitute processor-implemented modules that operate to perform one or more operations or functions. The modules referred to herein may, in some example embodiments, include processor-implemented modules.
Similarly, the methods described herein may be at least partially processor-implemented. For example, at least some of the operations of a method may be performed by one or more processors 902 or processor-implemented modules. The performance of certain of the operations may be distributed among the one or more processors 902, not only residing within a single machine but deployed across a number of machines. In some example embodiments, the processors 902 may be located in a single location (e.g., within a home environment, within an office environment, or as a server farm), while in other embodiments, the processors 902 may be distributed across a number of locations.
While the embodiments are described with reference to various implementations and exploitations, it will be understood that these embodiments are illustrative and that the scope of claims provided below is not limited to the embodiments described herein. In general, the techniques described herein may be implemented with facilities consistent with any hardware system or hardware systems defined herein. Many variations, modifications, additions, and improvements are possible.
Plural instances may be provided for components, operations, or structures described herein as a single instance. Finally, boundaries between various components, operations, and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of the claims. In general, structures and functionality presented as separate components in the exemplary configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements fall within the scope of the claims and their equivalents.