METHOD AND SYSTEM FOR DISTRIBUTED PROCESSING

Information

  • Patent Application
  • 20130204941
  • Publication Number
    20130204941
  • Date Filed
    February 05, 2013
    11 years ago
  • Date Published
    August 08, 2013
    10 years ago
Abstract
Nodes at first, second, and third locations have the same first-axis coordinates, while nodes at the first, fourth, and fifth locations have the same second-axis coordinates. First transmission transmits data elements from the node at the first location to nodes at the second and fourth locations, as well as to the node at either the third location or the fifth location. Second transmission transmits data elements from nodes at the second locations to nodes at the first, fourth, and fifth locations. Third transmission transmits data elements from nodes at the third locations to nodes at the first, second, and fourth locations. These three transmissions are performed with each node location selected as the base point on a diagonal line. The nodes execute a data processing operation by using the data elements assigned thereto and the data elements received as a result of the first to third transmissions.
Description
CROSS-REFERENCE TO RELATED APPLICATION

This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2012-022905, filed on Feb. 6, 2012, the entire contents of which are incorporated herein by reference.


FIELD

The embodiments discussed herein relate to a method and system for distributed processing.


BACKGROUND

Distributed processing systems are widely used today to process large amounts of data by running programs therefor on a plurality of nodes (e.g., computers) in parallel. These systems may also be referred to as parallel data processing systems. Some parallel data processing systems use high-level data management software, such as parallel relational databases and distributed key-value stores. Other parallel data processing systems operate with user-implemented parallel processing programs, without relying on high-level data management software.


The above systems may exert data processing operations on a set (or sets) of data elements. In the technical field of relational database, for example, a join operation acts on each combination of two data records (called “tuples”) in one or two designated data tables. Another example of data processing is matrix product and other operations that act on one or two sets of vectors expressed in matrix form. Such operations are used in the scientific and technological fields.


It is preferable that the nodes constituting a distributed processing system are utilized as efficiently as possible to process a large number of data records. To this end, there has been proposed, for example, an n-dimensional hypercubic parallel processing system. In operation of this system, two datasets are first distributed uniformly to a plurality of cells. The data is then broadcast from each cell for other cells within a particular range before starting computation of a direct product of the two datasets. Another example is a parallel computer including a plurality of computing elements organized in the form of a triangular array. This array of computing elements is subdivided to form a network of smaller triangular arrays.


Yet another example is a parallel processor device having a first group of processors, a second group of processors, and an intermediate group of processor between the two. The first group divides and distributes data elements to the intermediate group. The intermediate group sorts the data elements into categories and distributes them to the second group so that the processors of the second group each collect data elements of a particular category. Still another example is an array processor that includes a plurality of processing elements arranged in the form of a rectangle. Each processing element has only one receive port and only one transmit port, such that the elements communicate via limited paths. Further proposed is a parallel computer system formed from a plurality of divided processor groups. Each processor group performs data transfer in its local domain. The data is then transferred from group to group in a stepwise manner.


There is proposed still another distributed processing system designed for solving computational problems. A given group of processors is divided into a plurality of subsystems having a hierarchical structure. A given computational problem is also divided into a plurality of subproblems having a hierarchical structure. These subproblems are subjected to different subsystems, so that the given problem is solved by the plurality of subsystems as a whole. Communication between two subsystems is implemented in this distributed processing system, with a condition that the processors in one subsystem are only allowed to communicate with their associated counterparts in another subsystem. Suppose, for example, that one subsystem includes processors #000 and #001, while another subsystem includes processors #010 and #011. Processor #000 communicates with processor #010, and processor #001 communicates with processor #011. The inter-processor communication may therefore take two stages of, for example, communication from subsystem to subsystem and closed communication within a subsystem.


The following is a list of documents pertinent to the background techniques:


Japanese Laid-open Patent Publication No. 2-163866


Japanese Laid-open Patent Publication No. 6-19862


Japanese Laid-open Patent Publication No. 9-6732


International Publication Pamphlet No. WO 99/00743


Japanese Laid-open Patent Publication No. 2003-67354


Shantanu Dutt and Nam Trinh, “Are There Advantages to High-Dimension Architectures?: Analysis of K-ary n-cubes for the Class for Parallel Divide-and-Conquer Algorithms”, Proceedings of the 10th ACM (Association for Computing Machinery) International Conference on Supercomputing (ICS), 1996


As in the case of join operations mentioned above, some classes of data processing operations may use the same data elements many times. When a plurality of nodes are used in parallel to execute this type of operations, one or more copies of data elements have to be transmitted from node to node. Here the issue is how efficiently the nodes obtain data elements for their local operations.


Suppose, for example, that the nodes exert a specific processing operation on every possible combination pattern of two data elements in a dataset. The data processing calls for complex scheduling of tasks, and it is not easy in such cases for the nodes to duplicate and transmit their data elements in an efficient way.


SUMMARY

According to an aspect of the embodiments discussed herein, there is provided a method for distributed processing including the following acts: assigning data elements to a plurality of nodes sitting at node locations designated by first-axis coordinates and second-axis coordinates in a coordinate space, the node locations including a first location that serves as a base point on a diagonal line of the coordinate space, second and third locations having the same first-axis coordinates as the first location, and fourth and fifth locations having the same second-axis coordinates as the first location; performing first, second, and third transmissions, with each node location on the diagonal line which is selected as the base point, wherein: the first transmission transmits the assigned data elements from the node at the first location as the base point to the nodes at the second and fourth locations, as well as to the node at either the third location or the fifth location, the second transmission transmits the assigned data elements from the nodes at the second locations to the nodes at the first, fourth, and fifth locations, and the third transmission transmits the assigned data elements from the nodes at the third locations to the nodes at the first, second, and fourth locations; and causing the nodes to execute a data processing operation by using the data elements assigned thereto by the assigning and the data elements received as a result of the first, second, and third transmissions.


The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.


It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.





BRIEF DESCRIPTION OF DRAWINGS


FIG. 1 illustrates a distributed processing system according to a first embodiment;



FIG. 2 illustrates a distributed processing system according to a second embodiment;



FIG. 3 illustrates an information processing system according to a third embodiment;



FIG. 4 is a block diagram illustrating an exemplary hardware configuration of nodes;



FIG. 5 illustrates an exhaustive join;



FIG. 6 illustrates an exemplary execution result of an exhaustive join;



FIG. 7 illustrates a node coordination model according to the third embodiment;



FIG. 8 is a graph illustrating how the amount of transmit data varies with the row dimension of nodes;



FIGS. 9A, 9B, and 9C illustrate exemplary methods of relaying from node to node;



FIG. 10 is a block diagram illustrating an exemplary software structure according to the third embodiment;



FIG. 11 is a flowchart illustrating an exemplary procedure of joins according to the third embodiment;



FIG. 12 illustrates a first diagram illustrating an exemplary data arrangement according to the third embodiment;



FIG. 13 is a second diagram illustrating an exemplary data arrangement according to the third embodiment;



FIG. 14 is a third diagram illustrating an exemplary data arrangement according to the third embodiment;



FIG. 15 illustrates an information processing system according to a fourth embodiment;



FIG. 16 illustrates a node coordination model according to the fourth embodiment;



FIG. 17 is a block diagram illustrating an exemplary software structure according to the fourth embodiment;



FIG. 18 is a flowchart illustrating an exemplary procedure of joins according to the fourth embodiment;



FIG. 19 is a first diagram illustrating an exemplary data arrangement according to the fourth embodiment;



FIG. 20 is a second diagram illustrating an exemplary data arrangement according to the fourth embodiment;



FIG. 21 is a third diagram illustrating an exemplary data arrangement according to the fourth embodiment;



FIG. 22 is a fourth diagram illustrating an exemplary data arrangement according to the fourth embodiment;



FIG. 23 illustrates a triangle join;



FIG. 24 illustrates an exemplary result of a triangle join;



FIG. 25 illustrates a node coordination model according to a fifth embodiment;



FIG. 26 is a flowchart illustrating an exemplary procedure of joins according to the fifth embodiment;



FIG. 27 is a first diagram illustrating an exemplary data arrangement according to the fifth embodiment;



FIG. 28 is a second diagram illustrating an exemplary data arrangement according to the fifth embodiment;



FIG. 29 illustrates a node coordination model according to a sixth embodiment;



FIG. 30 is a flowchart illustrating an exemplary procedure of joins according to the sixth embodiment;



FIG. 31 is a first diagram illustrating an exemplary data arrangement according to the sixth embodiment;



FIG. 32 is a second diagram illustrating an exemplary data arrangement according to the sixth embodiment;



FIG. 33 illustrates a node coordination model according to a seventh embodiment;



FIG. 34 is a flowchart illustrating an exemplary procedure of joins according to the seventh embodiment;



FIG. 35 is a first diagram illustrating an exemplary data arrangement according to the seventh embodiment;



FIG. 36 is a second diagram illustrating an exemplary data arrangement according to the seventh embodiment;



FIG. 37 is a flowchart illustrating an exemplary procedure of joins according to an eighth embodiment;



FIG. 38 is a first diagram illustrating an exemplary data arrangement according to the eighth embodiment;



FIG. 39 is a second diagram illustrating an exemplary data arrangement according to the eighth embodiment;



FIG. 40 illustrates a node coordination model according to a ninth embodiment;



FIG. 41 is a flowchart illustrating an exemplary procedure of joins according to the ninth embodiment;



FIG. 42 a first diagram illustrating an exemplary data arrangement according to the ninth embodiment;



FIG. 43 is a second diagram illustrating an exemplary data arrangement according to the ninth embodiment;



FIG. 44 is a third diagram illustrating an exemplary data arrangement according to the ninth embodiment;



FIG. 45 illustrates a node coordination model according to a tenth embodiment;



FIG. 46 is a flowchart illustrating an exemplary procedure of joins according to the tenth embodiment;



FIG. 47 is a first diagram illustrating an exemplary data arrangement according to the tenth embodiment;



FIG. 48 is a second diagram illustrating an exemplary data arrangement according to the tenth embodiment;



FIG. 49 is a third diagram illustrating an exemplary data arrangement according to the tenth embodiment;



FIG. 50 is a fourth diagram illustrating an exemplary data arrangement according to the tenth embodiment; and



FIG. 51 is a fifth diagram illustrating an exemplary data arrangement according to the tenth embodiment.





DESCRIPTION OF EMBODIMENTS

Several embodiments will be described below with reference to the accompanying drawings.


(a) First Embodiment


FIG. 1 illustrates a distributed processing system according to a first embodiment. The illustrated distributed processing system includes a plurality of nodes 1a, 1b, 1c, and 1d and communication devices 3a and 3b.


Nodes 1a, 1b, 1c, and 1d are information processing apparatuses configured to execute data processing operations. Each node 1a, 1b, 1c, and 1d may be organized as a computer system including a processor such as a central processing unit (CPU) and data storage devices such as random access memory (RAM) and hard disk drives (HDD). For example, the nodes 1a, 1b, 1c, and 1d may be what are known as personal computers (PC), workstations, or blade servers. Communication devices 3a and 3b are network relaying devices designed to forward data from one place to another place. For example, the communication devices 3a and 3b may be layer-2 switches. These two communication devices 3a and 3b may be interconnected by a direct link as seen in FIG. 1, or may be connected via some other network devices in a higher level of the network hierarchy.


Two nodes 1a and 1b are linked to one communication device 3a and form a group #1 of nodes. Another two nodes 1c and 1d are linked to the other communication device 3b and form another group #2 of nodes. Each group may include three or more nodes. Further, the distributed processing system may include more groups of nodes. Each such node group may be regarded as a single node, and is hence referred to as a virtual node. There are node-to-node relationships between every two groups of nodes in the system. For example, one node 1a in group #1 is associated with one node 1c in group #2, while the other node 1b in group #1 is associated with the other node 1d in group #2.


A plurality of data elements constituting a dataset are assigned to the nodes 1a, 1b, 1c, and 1d in a distributed manner. These data elements may previously be assigned before a command for initiating data processing is received. Alternatively, the distributed processing system may be configured to assign data elements upon receipt of a command that initiates data processing. Preferably, data elements are distributed as evenly as possible across the plurality of nodes used for the subsequent data processing, without redundant duplication (i.e., without duplication of the same data in different nodes). The distributed data elements may belong to a single dataset, or may belong to two or more different datasets. In other words, the distributed data elements may be of a single category, or may be divided into two or more categories.


Subsequent to the above initial data assignment, the nodes 1a, 1b, 1c, and 1d duplicate the data elements in preparation for the parallel data processing. That is, the data elements are copied from node to node, such that the nodes 1a, 1b, 1c, and 1d obtain a collection of data elements that they use in their local data processing. According to the first embodiment, the distributed processing system performs this duplication processing in the following two stages: (a) first stage where data elements are copied from group to group, and (b) second stage where data elements are copied from node to node in each group.


Group #1 receives data elements from group #2 in the first stage. More specifically, one node 1a in group #1 receives data elements from its counterpart node 1c in group #2 by communicating therewith via the communication devices 3a and 3b. Another node 1b in group #1 receives data elements from its counterpart node 1d in group #2 by communicating therewith via the communication devices 3a and 3b. Group #2 may similarly receive data elements from group #1 in the first stage. The nodes 1c and 1d in group #2 respectively communicate with their counterpart nodes 1a and 1d via communication devices 3a and 3b to receive their data elements.


In the second stage, group #1 locally duplicates data elements. More specifically, the nodes 1a and 1b in group #1 have their data elements, some of which have initially been assigned to each group, and the others of which have been received from group #2 or other groups, if any. The nodes 1a and 1b transmit and receive these data elements to and from each other. The decision of which node to communicate with which node is made on the basis of logical connections of nodes in group #1. For example, one node 1a receives data elements from the other node 1b, including those initially assigned thereto, and those received from the node 1d in group #2 in the first stage. Likewise, the latter node 1b receives data elements from the former node 1a, including those initially assigned thereto, and those received from the node 1c in group #2 in the first stage. The nodes 1c and 1d in group #2 may duplicate their respective data elements in the same way.


The four nodes 1a, 1b, 1c, and 1d execute data processing operations on the data elements collected through the two-stage duplication described above. As noted above, the current data elements in each node include those initially assigned thereto, those received in the first stage from its associated nodes in other groups, and those received in the second stage from nodes in the same group. For example, such data elements in a node may constitute two subsets of a given dataset. When this is the case, the node may apply the data processing operations on every combination of two data elements that respectively belong to these two subsets. As another example, data elements in a node may constitute one subset of a given dataset. When this is the case, the node may apply the data processing operations on every combination of two data elements both belonging to that subset.


According to the first embodiment, the proposed distributed processing system forms a plurality of nodes 1a, 1b, 1c, and 1d into groups, with consideration of their connection with communication devices 3a and 3b. This feature enables the nodes to duplicate and send data elements that they use in their local data processing.


Another possible method may be, for example, to propagate data elements of one node 1c successively to other nodes 1d, 1a, and 1b, as well as propagating those of another node 1d successively to other nodes 1a, 1b, and 1c. In that method, however, the delay times of communication between nodes 1a and 1b or between nodes 1c and 1d are different from those between nodes 1b and 1c or between nodes 1d and 1a, because the former involves only a single intervening communication device whereas the latter involves two intervening communication devices. In contrast, the proposed method delivers data elements in two stages, first via two or more intervening communication devices, and second within the local domain of each communication device. This method makes it easier to parallelize the operation of communication.


While the above-described first embodiment forms a single layer of groups, it is possible to form two or more layers of nested groups. Where appropriate in the system operations, the two communication devices 3a and 3b in the first embodiment may be integrated into a single device, so that the nodes 1a and 1b in group #1 are connected with the nodes 1c and 1d in group #2 via that single communication device.


As will be described later in a third embodiment and other subsequent embodiments, the groups of nodes execute exhaustive joins and triangle joins in a parallel fashion. It is noted that the same concept of node grouping discussed above in the first embodiment may also be applied to other kinds of parallel data processing operations. For example, the proposed concept of node grouping may be combined with the parallel sorting scheme of Japanese Patent No. 2509929, the invention made by one of the applicants of the present patent application. Other possible applications may include, but not limited to, the following processing operations: hash joins in the technical field of database, grouping of data with hash functions, deduplication of data records, mathematical operations (e.g., intersection and union) of two datasets, and merge joins using sorting techniques.


In general, the above-described concept of node grouping is applicable to computational problems that may be solved by using a so-called “divide-and-conquer algorithm.” This algorithm works by breaking down a problem into two or more sub-problems of the same type and combining the solutions to the sub-problems to give a solution to the original problem. A network of computational nodes solves such problems by exchanging data elements from node to node.


(b) Second Embodiment


FIG. 2 illustrates a distributed processing system according to a second embodiment. The illustrated distributed processing system of the second embodiment is formed from a plurality of nodes 2a to 2i. These nodes 2a to 2i may each be an information processing apparatus for data processing or, more specifically, a computer including a processor(s) (e.g., CPU) and storage devices (e.g., RAM and HDD). The nodes 2a to 2i may all be linked to a single communication device (e.g., layer-2 switch) or may be distributed in the domains of different communication devices.


The nodes 2a to 2i are sitting at different node locations designated by first-axis coordinates and second-axis coordinates in a coordinate space. The first axis and second axis may be X axis and Y axis, for example. In this coordinate space, the nodes are logically arranged in a lattice network. Out of these nine nodes 2a to 2i, three nodes 2a, 2e, and 2i are located on a diagonal line of the coordinate space, which runs from the top-left corner to the bottom-right corner in FIG. 2. Suppose now that one location #1 on the diagonal line is set as a base point. Then relative to this base-point location #1, two more locations #2 and #3 are defined as having first-axis coordinate values equal to that of location #1. Likewise, another two locations #4 and #5 are defined as having second-axis coordinate values equal to that of location #1. It is noted that each of those locations #2, #3, #4, and #5 may actually be a plurality K of locations for K nodes, where K is an integer greater than zero. Referring to the example of FIG. 2, the base point (or location #1) is set to the top-left node 2a. Then node 2b is at location #2, node 2c at location #3, node 2d at location #4, and node 2g at location #5, where K=1.


Each node 2a to 2i receives one or more data elements of a dataset. These data elements may previously be assigned before reception of a command that initiates data processing. Alternatively, the distributed processing system may be configured to assign data elements upon receipt of a command that initiates data processing. Preferably, data elements are distributed as evenly as possible over the plurality of nodes to be used in the requested data processing, without duplication of the same data in different nodes. The distributed data elements may be of a single category (i.e., belong to a single dataset).


The data elements are duplicated from node to node during a specific period between their initial assignment and the start of parallel processing, so that the nodes 2a to 2i collect data elements for their own use. More specifically, the distributed processing system of the second embodiment executes the following first to third transmissions for each different base-point node (or location #1) on a diagonal line of the coordinate space.


In the first transmission, the node at location #1 transmits its local data elements to other nodes at locations #2 and #4. When, for example, the base point is set to the top-left node 2a, the data element initially assigned to the base-point node 2a is copied to other nodes 2b and 2d. The first transmission further includes selective transmission of data elements of the node at location #1 to either the node at location #3 or the node at location #5. Referring to the example of FIG. 2, the data element of the base-point node 1a is copied to either the node 2c or the node 2g. As a result of the above, the element in the base-point node 2a is duplicated in the nodes 2b, 2d, and 2c, while the others are duplicated to the nodes 2b, 2c, and 2g. In the case where the base-point node 2a has a plurality of data elements to duplicate, it is preferable to equalize the two nodes 2c and 2g as much as possible in terms of the number of data elements that they may receive from the base-point node 2a. For example, the difference between these nodes 2c and 2g in the number of data elements is managed so as not to exceed one.


In the second transmission, the node at location #2 transmits its local data elements to other nodes at locations #1, #4, and #5. For example (assuming the same base-point node 2a), the data element initially assigned to the node 2b is copied to other nodes 2a, 2d, and 2g.


In the third transmission, the node at location #3 transmits its local data elements to other nodes at locations #1, #2, and #4. For example (assuming the same base-point node 2a), the data element initially assigned to the node 2c is copied to other nodes 2a, 2b, and 2d.


In the case where K is greater than one (i.e., there are K nodes at K locations #2), each of the K nodes at locations #2 transmits its data elements to the (K−1) peer nodes in the second transmission. This note similarly applies to K nodes at locations #3 in the third transmission.


As a result of the above-described three transmissions, the base-point node at diagonal location #1 now has a collection of data elements initially assigned to nodes at locations #1, #2, and #3 sharing the same first-axis coordinate. The nodes at locations #3 and #5 have different fractions of those data elements in the base-point node at location #1, whereas the nodes at locations #2 and #4 have the same data elements as those in the node at location #1.


Each node then executes data processing by using their own collections of data elements, which include those initially assigned thereto and those received as a result of the first to third transmissions described above. For example, diagonal nodes may execute data processing with each combination of data elements collected by the diagonal nodes as base-point nodes. Non-diagonal nodes, on the other hand, may execute data processing by combining two sets of data elements collected by setting two different base-point nodes on the diagonal line.


According to the second embodiment, the proposed distributed processing system propagates data elements from node to node in an efficient way, after assigning data elements to a plurality of nodes 2a to 2i. Particularly, the proposed system enables effective parallelization of data processing operations that are exerted on every combination of two data elements in a dataset. The second embodiment duplicates data elements to the nodes 2a to 2i without excess or shortage, besides distributing the load of data processing as evenly as possible.


(c) Third Embodiment


FIG. 3 illustrates an information processing system according to a third embodiment. This information processing system is formed from a plurality of nodes 11 to 16, a client 31, and a network 41.


The nodes 11 to 16 are computers connected to a network 41. More particularly, the nodes 11 to 16 may be PCs, workstations, or blade servers, capable of processing data in a parallel fashion. While not explicitly depicted in FIG. 3, the network 41 includes one or more communication devices (e.g., layer-2 switches) to transfer data elements and command messages. The client 31 is a computer that serves as a terminal console for the user. For example, the client 31 may send a command to one of the nodes 11 to 16 vial the network 41 to initiate a specific data processing operation.



FIG. 4 is a block diagram illustrating an exemplary hardware configuration of nodes. The illustrated node 11 includes a CPU 101, a RAM 102, an HDD 103, a video signal processing unit 104, an input signal processing unit 105, a disk drive 106, and a communication unit 107. While FIG. 4 illustrates one node 11 alone, this hardware configuration may similarly apply to the other nodes 12 to 16 as well as to the client 31. It is noted that the video signal processing unit 104 and input signal processing unit 105 may be optional (i.e., add-on devices to be mounted when a need arises) as in the case of blade servers. That is, the nodes 11 to 16 may be configured to work without those processing units.


The CPU 101 is a processor that controls the node 11. The CPU 101 reads at least part of program files and data files stored in the HDD 103 and executes programs after loading them on the RAM 102. The node 11 may include a plurality of such processors.


The RAM 102 serves as volatile temporary memory for at least part of the programs that the CPU 101 executes, as well as for various data that the CPU 101 needs when executing the programs. The node 11 may include other type of memory devices than RAM.


The HDD 103 serves as a non-volatile storage device to store program files of the operating system (OS) and applications, as well as data files used together with those programs. The HDD 103 writes and reads data on its internal magnetic platters in accordance with commands from the CPU 101. The node 11 may include a plurality of non-volatile storage devices such as solid state drives (SSD) in place of, or together with the HDD 103.


The video signal processing unit 104 produces video images in accordance with commands from the CPU 101 and outputs them on a screen of a display 42 coupled to the node 11. The display 42 may be, for example, a cathode ray tube (CRT) display or a liquid crystal display.


The input signal processing unit 105 receives input signals from input devices 43 and supplies them to the CPU 101. The input devices 43 may be, for example, a keyboard and a pointing device such as mouse and touchscreen.


The disk drive 106 is a device used to read programs and data stored in a storage medium 44. The storage medium 44 may include, for example, magnetic disk media such as flexible disk (FD) and HDD, optical disc media such as compact disc (CD) and digital versatile disc (DVD), and magneto-optical storage media such as magneto-optical disk (MO). The disk drive 106 transfers programs and data read out of the storage medium 44 to, for example, the RAM 102 or HDD 103 according to commands from the CPU 101.


The communication unit 107 is a network interface for the CPU 101 to communicate with other nodes 12 to 16 and client 31 (see FIG. 3) via a network 41. The communication unit 107 may be a wired network interface or a radio network interface.


The following section will now describe exhaustive joins executed by the information processing system according to the third embodiment. Exhaustive joins may sometimes be treated as a kind of simple joins.


Specifically, an exhaustive join acts on two given datasets A and B as seen in equation (1) below. One dataset A is a collection of m data elements a1, a2, . . . , am, where m is a positive integer. The other dataset B is a collection of n data elements b1, b2, . . . , bm, where n is a positive integer. Preferably, each data element includes a unique identifier. That is, data elements are each formed from an identifier and a data value(s). As seen in equation (2) below, the exhaustive join yields a dataset by applying a map function to every ordered pair of a data element “a” in dataset A and a data element “b” in dataset B. The map function may return no output data elements or may output two or more resulting data elements, depending on the values of the arguments, a and b.






A={a
1
,a
2
, . . . , a
m}






B={b
1
,b
2
, . . . , b
n}  (1)






x-join(A,B,map)={map(a,b)|(a,bA×B}  (2)



FIG. 5 illustrates an exhaustive join. As can seen from FIG. 5, an exhaustive join may be interpreted as an operation that applies a map function to the direct product between two datasets A and B. For example, the operation selects one data element a1 from dataset A and another data element b1 from dataset B and uses these data elements a1 and b1 as arguments of the map function. As mentioned above, function map(a1, b1) may not always return output values. That is, it is possible that the map function returns nothing when the data elements a1 and b1 do not satisfy a specific condition. Such operation of the map function is exerted on all of the (m×n) ordered pairs.


Exhaustive joins may be implemented as a software program using an algorithm known as “nested loop.” For example, an outer loop is configured to select one data element a from dataset A, and an inner loop is configured to select another data element b from dataset B. The inner loop repeats its operation by successively selecting n data elements b1, b2, . . . , bn in combination with a given data element ai of dataset A. FIG. 5 depicts a plurality of such map operations. Since these operations are independent of each other, it is possible to parallelize the execution by allocating a plurality of nodes to them.



FIG. 6 illustrates an exemplary execution result of an exhaustive join. In this example of FIG. 6, dataset A includes four data elements a1 to a4, and dataset B includes four data elements b1 to b4. Each of these data elements of datasets A and B represents the name and age of a person.


In operation, the exhaustive join applies a map function to each of the sixteen ordered pairs organized in a 4×4 matrix. The map function in this example is, however, configured to return a result value on the conditions that (i) the age field of data element a has a greater value than that of data element b, and (ii) their difference in age is five or smaller. Because of these conditions, the map function returns a resulting data element for four ordered pairs (a1, b1), (a2, b2), (a2, b3), and (a3, b4) as seen in FIG. 6, but no outputs for the remaining ordered pairs.


Datasets may be provided in the form of, for example, tables as in a relational database, a set of (key, value) pairs in a key-value store, files, and matrixes. Data elements may be, for example, tuples of a table, pairs in a key-value store, records in a file, vectors in a matrix, and scalars.


An example of the above-described exhaustive join will now be described below. This example will handle a matrix as a set of vectors. Specifically, equation (3) represents a product of two matrixes A and B, where matrix A is treated as a set of row vectors, and matrix B as a set of column vectors. Equation (4) then indicates that matrix product AB is obtained by calculating an inner product for every possible combination of a row vector of matrix A and a column vector of matrix B. This means that matrix product AB is calculated as an exhaustive join of two sets of vectors.










A
=


(




a
11




a
12






a
21




a
22




)

=

(





a
1









a
2






)









B
=


(




b
11




b
12






b
21




b
22




)

=

(





b
1







b
2






)







(
3
)






AB
=



(





a
1









a
2






)



(





b
1







b
2






)


=

(






a
1



·


b
1









a
1



·


b
2











a
2



·


b
1









a
2



·


b
2







)






(
4
)








FIG. 7 illustrates a node coordination model according to the third embodiment. The exhaustive join of the third embodiment handles a plurality of participating nodes as if they are logically arranged in the form of a rectangular array. That is, the nodes are organized in an array with a height of h nodes and a width of w nodes. In other words, h represents the number of rows (or row dimension), and w represents the number of columns (or column dimension). The node sitting at the i-th row and j-th column is represented as nij in this model. As will be described below, the information processing system determines the row dimension h and the column dimension w when it starts data processing. The row dimension h may also be referred to as the number of vertical divisions. Similarly, the column dimension w may be referred to as the number of horizontal divisions.


At the start of parallel data processing, the data elements of datasets A and B are divided and assigned to a plurality of nodes logically arranged in a rectangular array. Each node receives and stores data elements in a data storage device, which may be a semiconductor memory (e.g., RAM 102) or a disk drive (e.g., HDD 103). This initial assignment of datasets A and B is executed in such a way that the nodes will receive as equal amounts of data as possible. This policy is referred to as the evenness. The assignment of datasets A and B is also executed in such a way that a single data element will never be assigned to two or more nodes. This policy is referred to as the uniqueness.


Assuming that both the evenness and uniqueness policies are perfectly applied, each node nij obtains subsets Aij and Bij of datasets A and B as seen in equation (5). The number of data elements included in a subset Aij is calculated by dividing the total number of data elements of dataset A by the total number N of nodes (N=h×w). Similarly, the number of data elements included in a subset Bij is calculated by dividing the total number of data elements of dataset B by the total number N of nodes.












A
=




i
=
1

h






j
=
1

w



A
ij










A
ij



=



A


N







B
=




i
=
1

h






j
=
1

w



B
ij










B
ij



=



B


N








(
5
)







Row subset Ai of dataset A is now defined as the union of subsets Ai1, Ai2, . . . , Aiw assigned to nodes ni1, ni2, . . . , niw having the same row number. Likewise, column subset Bj is defined as the union of subsets B1j, B2j, . . . , Bhj assigned to nodes n1j, n2j, . . . , nhj having the same column number. As seen from equation (6), dataset A is a union of h row subsets Ai, and dataset B is a union of w column subsets Bj.










A
=




i
=
1

h



A
i









B
=




j
=
1

w



B
j







(
6
)







The exhaustive join of two datasets A and B may now be rewritten by using their row subsets Ai and column subsets Bj. That is, this exhaustive join is divided into h×w exhaustive joins as seen in equation (7) below. Here each node nij may be configured to execute an exhaustive join of one row subset Ai and one column subset Bj. The original exhaustive join of two datasets A and B is then calculated by running the computation in those h×w nodes in parallel. Initially the data elements of datasets A and B are distributed to the nodes under the evenness and uniqueness policies mentioned above. The node nij in this condition then receives subsets of dataset A from other nodes with the same row number i, as well as subsets of dataset B from other nodes with the same column number j.













x


-



join


(

A
,
B
,
map

)



=






i
=
1

h






j
=
1

w



{


map


(

a
,
b

)


|


(

a
,
b

)




A
i

×

B
j




}









=






i
=
1

h






j
=
1

w



x


-



join


(


A
i

,

B
j

,
map

)












(
7
)







As described above, data elements have to be duplicated from node to node before each node begins an exhaustive join locally with its own set of data elements. The information processing system therefore determines the optimal row dimension h and optimal column dimension w to minimize the amount of data transmitted among N nodes deployed for distributed execution of exhaustive joins.


The amount c of data transmitted or received by each node is calculated according to equation (8), assuming that subsets of dataset A are relayed in the row direction whereas subsets of dataset B are relayed in the column direction. For simplification of the mathematical model and algorithm, it is also assumed that each node receives data elements not only from other nodes, but also from itself. The amount c of transmit data (=the amount of receive data) is added up for N nodes, thus obtaining the total amount C of transmit data as seen in equation (9). More specifically, the total amount C of transmit data is a function of the row dimension h, when the number N of nodes and the number of data elements of each dataset A and B are given.









c
=



w




A
ij




+

h




B
ij





=


w




A


N


+

h




B


N








(
8
)






C
=

Nc
=



w



A



+

h



B




=



N



A



h

+

h



B










(
9
)








FIG. 8 gives a graph illustrating how the amount of transmit data varies with the row dimension h of nodes. The graph of FIG. 8 plots the values calculated on the assumptions that 10000 nodes are configured to process datasets A and B each containing 10000 data elements. The illustrated curve of total amount C hits its minimum point when h=100. The row dimension h at this minimum point of C is calculated by differentiating equation (9). The solution is seen in equation (10). It is noted that the row dimension practically has to be a divisor of N. For this reason, the value of h is determined as follows: (a) h=1 when equation (10) yields a value of one or zero, (b) h=N when equation (10) yields a value of N or more, and (c) h is otherwise set to a divisor of N that is closest to the value of equation (10) below. In the last case (c), there may be two closest divisors (i.e., one is greater than and the other is smaller than the calculated value). When this is the case, h is set to the one that minimizes the total amount C of transmit data.


The total number N of nodes is previously determined on the basis of, for example, the number of available nodes, the amount of data to be processed, and the response time that the system is supposed to achieve. Preferably, the total number N of nodes has many divisors since the above parameter h is selected from among those divisors of N. For example, N may be a power of 2. It is not preferable to select prime numbers or other numbers having few divisors. If the predetermined value of N does not satisfy this condition, N may be changed to a smaller number having many divisors. For example, a new integer number of N may be a power of 2 that is the largest in the range below N.









h
=





A




B




N






(
10
)







The following section will now describe how to relay data elements from node to node. While the description assumes that data elements are passed along in the row direction, the person skilled in the art would appreciate that the same method applies also to the column direction.



FIGS. 9A, 9B, and 9C illustrate three methods for the nodes to relay their data. Referring first to method A illustrated in FIG. 9A, the leftmost node n11 sends a subset A11 to the second node n12 in order to propagate its assigned data to other nodes n12, . . . , n1w. The second node n12 then duplicates the received subset A11 to the third node n13. Similarly the subset A11 is transferred rightward until it reaches the rightmost node n1w. Other subsets initially assigned to the intermediate nodes may also be relayed rightward in the same way. According to this method A, the originating node does not need to establish connections with every receiving node, but has only to set up a connection to the next node, because data elements are relayed by a repetition of such a single connection between two adjacent nodes. This nature of method A contributes to a reduced load of communication. The method A is, however, unable to duplicate data elements in the leftward direction.


Referring next to method B illustrated in FIG. 9B, the rightmost node n1w establishes a connection back to the leftmost node n11, thus forming a circular path of data. Each nodes nij transmits its initially assigned subset Aij to the right node, thereby propagating a copy of subset Aij to other nodes on the same row. Each data element bares an identifier (e.g., address or coordinates) of the source node so as to prevent the data elements from circulating endlessly on the path.


Referring lastly to method C illustrated in FIG. 9C, the nodes establish a rightward connection from the leftmost node n11 to the rightmost node n1w and a leftward connection from the rightmost node n1w to the leftmost node n11. For example, the leftmost node n11 sends its subset A11 to the right node. The rightmost node n1w, on the other hand, sends its subset A1w to the left node. Intervening nodes nij send their respective subsets Aij to both their right and left nodes. This method C may be modified to form a circular path as in method B.


The third embodiment preferably uses method B or method C to relay data for the purpose of exhaustive joins. As an alternative method, data elements may be duplicated by broadcasting them in a broadcast domain of the network. This method is applicable when the sending node and receiving nodes belong to the same broadcast domain. The foregoing equation (10) may similarly be used in this case to calculate the optimal row dimension h, taking into account the total amount of receive data.


When data elements are sent from a first node to a second node, the second node sends a response message such as acknowledgment (ACK) or negative acknowledgment (NACK) back to the first node. In the case where the second node has some data elements to send later to the first node, the second node may send the response message not immediately, but together with those data elements.



FIG. 10 is a block diagram illustrating an exemplary software structure according to the third embodiment. This block diagram includes a client 31 and two nodes 11 and 12. The former node 11 includes a receiving unit 111, a system control unit 112, a node control unit 114, an execution unit(s) 115, and a data storage unit 116. This block structure of the node 11 may also be used to implement other nodes 12 to 16. For example, the illustrated node 12 includes a receiving unit 121, a node control unit 124, an execution unit(s) 125, a data storage unit 126, and a system control unit (omitted in FIG. 10). FIG. 10 also depicts a client 31 with a requesting unit 311. The data storage units 116 and 126 may be implemented as reserved storage areas of RAM or HDD, while the other blocks may be implemented as program modules.


The client 31 includes a requesting unit 311 that sends a command in response to a user input to start data processing. This command is addressed to the node 11 in FIG. 10, but may alternatively be addressed to any of the other nodes 12 to 16.


The receiving unit 111 in the node 11 receives commands from a client 31 or other nodes. The computer process implementing this receiving unit 111 is always running on the node 11. When a command is received from the client 31, the receiving unit 111 calls up its local system control unit 112. Further, when a command is received from the system control unit 112, the receiving unit 111 calls up its local node control unit 114. The receiving unit 111 in the node 11 may also receive a command from a peer node when its system control unit is activated. In response, the receiving unit 111 calls up the node control unit 114 to handle that command. The node knows the addresses (e.g., Internet Protocol (IP) addresses) of receiving units in other nodes.


The system control unit 112 controls overall transactions during execution of exhaustive joins. The computer process implementing this system control unit 112 is activated upon call from the receiving unit 111. Each time a specific data processing operation (or transaction) is requested from the client 31, one of the plurality of nodes activates its system control unit. The system control unit 112, when activated, transmits a command to the receiving unit (e.g., receiving units 111 and 121) of nodes to invite them to participate in the execution of the requested exhaustive join. This command calls up the node control units 114 and 124 in the nodes 11 and 12.


The system control unit 112 also identifies logical connections between the nodes and sends a relaying command to the node control unit (e.g., node control unit 114) of a node that is supposed to be the source point of data elements. This relaying command contains information indicating which nodes are to relay data elements. When the duplication of data elements is finished, the system control unit 112 issues a joining command to execute an exhaustive join to the node control units of all the participating nodes (e.g., node control units 114 and 124 in FIG. 10). When the exhaustive join is finished, the system control unit 112 so notifies the client 31.


The node control unit 114 controls information processing tasks that the node 11 undertakes as part of an exhaustive join. The computer process implementing this node control unit 114 is activated upon call from the receiving unit 111. The node control unit 114 calls up the execution unit 115 when a relaying command or joining command is received from the system control unit 112. Relaying commands and joining commands may come also from a peer node (or more specifically, from its system control unit that is activated). The node control unit 114 similarly calls up its local execution unit 115 in response. The node control unit 114 may also call up its local execution unit 115 when a reception command is received from a remote execution unit of a peer node.


The execution unit 115 performs information processing operations requested from the node control unit 114. The computer process implementing this execution unit 115 is activated upon call from the node control unit 114. The node 11 is capable of invoking a plurality of processes of the execution unit 115. In other words, it is possible to execute multiple processing operations at the same time, such as relaying dataset A in parallel with dataset B. This feature of the node 11 works well in the case where the node 11 has a plurality of processors or a multiple-core processor.


When called up in connection with a relaying command, the execution unit 115 transmits a reception command to the node control unit of an adjacent node (e.g., node control unit 124 in FIG. 10). In response, the adjacent node makes its local execution unit (e.g., execution unit 125) ready for receiving data elements. The execution unit 115 then reads out assigned data elements of the node 11 from the data storage unit 116 and transmits them to its counterpart in the adjacent node.


When called up in connection with a reception command, the execution unit 115 receives data elements from its counterpart in a peer node and stores them in its local data storage unit 116. The execution unit 115 also forwards these data elements to the next node unless the node 11 is their final destination, as in the case of relaying commands. When called up in connection with a joining command, the execution unit 115 locally executes an exhaustive join with its own data elements in the data storage unit 116 and writes the result back into the data storage unit 116.


The data storage unit 116 stores some of the data elements constituting datasets A and B. The data storage unit 116 initially stores data elements belonging to subsets A11 and B11 that are assigned to the node 11 in the first place. Then subsequent relaying of data elements causes the data storage unit 116 to receive additional data elements belonging to a row subset A1 and a column subset B1. Similarly, the data storage unit 126 in the node 12 stores some data elements of datasets A and B.


Generally, when a first module sends a command to a second module, the second module performs requested information processing and then notifies the first module of completion of that command. For example, the execution unit 115 notifies the node control unit 114 upon completion of its local exhaustive join. The node control unit 114 then notifies the system control unit 112 of the completion. When such completion notice is received from every node participating in the exhaustive join, the system control unit 112 notifies the client 31 of completion of its request.


The above-described system control unit 112, node control unit 114, and execution unit 115 may be implemented as, for example, a three-tier internal structure made up of a command parser, optimizer, and code executor. Specifically, the command parser interprets the character string of a received command and produces an analysis tree representing the result. Based on this analysis tree, the optimizer generates (or selects) optimal intermediate code for execution of the requested information processing operation. The code executor then executes the generated intermediate code.



FIG. 11 is a flowchart illustrating an exemplary procedure of joins according to the third embodiment. As previously mentioned, the number N of participating nodes may be determined by the system control unit 112 before the process starts with step S11, based on the total number of nodes in the system. Each step of the flowchart will be described below.


(S11) The client 31 has specified datasets A and B as input data for an exhaustive join. The system control unit 112 divides those two datasets A and B into N subsets Aij and N subsets Bij, respectively, and assigns them to a plurality of nodes 11 to 16. As an alternative, the nodes 11 to 16 may assign datasets A and B to themselves according to a request from the client 31 before the node 11 receives a start command for data processing. As another alternative, the datasets A and B may be given as an output of previous data processing at the nodes 11 to 16. In this case, the system control unit 112 may find that the assignment of datasets A and B has already been finished.


(S12) The system control unit 112 determines the row dimension h and column dimension w by using a calculation method such as equation (10) discussed above, based on the number N of participating nodes (i.e., nodes used to executed the exhaustive join), and the number of data elements of each given dataset A and B.


(S13) The system control unit 112 commands the nodes 11 to 16 to duplicate their respective subsets Aij in the row direction, as well as duplicate subsets Bij in the column direction. The execution unit in each node relays subsets Aij and Bij in their respective directions. The above relaying may be achieved by using, for example, the method B or C discussed in FIGS. 9B and 9C. The above duplication of subsets permits each node nij to obtain row subset Ai and column subset Bj.


(S14) The system control unit 112 commands all the participating nodes 11 to 16 to locally execute an exhaustive join. In response, the execution unit in each node executes an exhaustive join locally (i.e., without communicating with other nodes) with the row subset Ai and column subset Bj obtained at step S13. The execution unit stores the result in a relevant data storage unit. Such local exhaustive joins may be implemented in the form of a nested loop, for example.


(S15) The system control unit 112 sees that every participating node 11 to 16 has finished step S14, thus notifying the requesting client 31 of completion of the requested exhaustive join. The system control unit 112 may also collect result data from the data storage units of nodes and send it back to the client 31. Or alternatively, the system control unit 112 may allow the result data to stay in the nodes, so that the subsequent data processing can use it as input data. It may be possible in the latter case to skip the step of assigning initial data elements to participating nodes.



FIG. 12 illustrates a first diagram illustrating an exemplary data arrangement according to the third embodiment. This example assumes that six nodes n11, n12, n13, n21, n22, and n23 (11 to 16) are configured to execute an exhaustive join of datasets A and B. Dataset A is formed from six data elements a1 to a6, while dataset B is formed from twelve data elements b1 to b12. Each node nij is equally assigned one data element from dataset A and two data elements from dataset B. In other words, the former is subset Aij, and the latter is subset Bij. For example, node n11 receives the following two subsets: A11={a1} and B11={b1, b2}. Here the number N of participating nodes is six. As dataset A includes six data elements, and dataset B includes twelve data elements, the foregoing equation (10) gives h=2 for the row dimension.


Upon determination of row dimension h and column dimension w, each subset Aij is duplicated by the nodes having the same row number (i.e., in the row direction), and each subset Bij is duplicated by the nodes having the same column number (i.e., in the column direction). For example, subset A11 assigned to node n11 is copied from node n11 to node n12, and then from node n12 to node n13. Subset B11 assigned to node n11 is copied from node n11 to node n21.



FIG. 13 is a second diagram illustrating an exemplary data arrangement according to the third embodiment. As a result of the above duplication of data elements, each node nij now contains a row subset Ai and a column subset Bj in their entirety. For example, nodes n11, n12, and n13 have obtained a row subset A1={a1, a2, a3}, and nodes n11 and n21 have obtained a column subset B1={b1, b2, b3, b4}.



FIG. 14 is a third diagram illustrating an exemplary data arrangement according to the third embodiment. Each node nij locally executes an exhaustive join with the above row subset Ai and column subset Bj. For example, node n11 selects one data element a from row subset A1={a1, a2, a3} and one element b from column subset B1={b1, b2, b3, b4} and subjects these two data elements to the map function. By repeating this operation, node n11 applies the map function to all the twelve ordered pairs (i.e., 3×4 combinations) of data elements. As seen in FIG. 14, six nodes n11, n12, n13, n21, n22, and n23 equally process twelve different ordered pairs. These six nodes as a whole cover all the 72 (=6×12) ordered pairs produced from datasets A and B, without redundant duplication.


The proposed information processing system of the third embodiment executes an exhaustive join of datasets A and B efficiently by using a plurality of nodes. Particularly, the system starts execution of an exhaustive join with the initial subsets of datasets A and B that are assigned evenly (or near evenly) to a plurality of participating nodes without redundant duplication. The nodes are equally (or near equally) loaded with data processing operations with no needless duplication. For these reasons, the third embodiment enables scalable execution of exhaustive joins (where overhead of communication is neglected). That is, the processing time of an exhaustive join decreases to 1/N when the number of nodes is multiplied N-fold.


(d) Fourth Embodiment

This section describes a fourth embodiment with the focus on its differences from the third embodiment. For their common elements and features, see the previous description of the third embodiment. To execute exhaustive joins, the fourth embodiment uses a large-scale information processing system formed from a plurality of communication devices interconnected in a hierarchical way.



FIG. 15 illustrates an information processing system according to the fourth embodiment. The illustrated information processing system includes virtual nodes 20, 20a, 20b, 20c, 20d, and 20e, a client 31, and a network 41.


Each virtual node 20, 20a, 20b, 20c, 20d, and 20e includes at least one switch (e.g., layer-2 switch) and a plurality of nodes linked to that switch. For example, one virtual node 20 includes four nodes 21 to 24 and a switch 25. Another virtual node 20a includes four nodes 21a to 24a and a switch 25a. Each such virtual node may be handled logically as a single node when the system executes exhaustive joins.


The above six virtual nodes are equal in the number of nodes that they include. The number of constituent nodes has been determined in consideration of their connection with a communication device and the like. However, this number of constituent nodes may not necessarily be the same as the number of nodes that participate in a particular data processing operation. As discussed in the third embodiment, the latter number may be determined in such a way that the number of nodes will have as many divisors as possible. The constituent nodes of a virtual node are associated one-to-one with those of another virtual node. Such one-to-one associations are found between, for example, nodes 21 and 21a, nodes 22 and 22a, nodes 23 and 23a, and nodes 24 and 24a. While FIG. 15 illustrates an example of virtualization into a single layer, it is also possible to build a multiple-layer structure of virtual nodes such that one virtual node includes other virtual nodes.



FIG. 16 illustrates a node coordination model according to the fourth embodiment. To execute an exhaustive join, the fourth embodiment handles a plurality of virtual nodes as if they are logically arranged in the form of a rectangular array. That is, the virtual nodes are organized in an array with a height of H virtual nodes and a width of W virtual nodes. In other words, H represents the number of rows (or row dimension), and W represents the number of columns (or column dimension). The row dimension H and column dimension W are determined from the number of virtual nodes and the number of data elements constituting each dataset A and B, similarly to the way described in the previous embodiments. Further, in each virtual node, its constituent nodes are logically organized in an array with a height of h nodes and a width of w nodes. The row dimension h and column dimension w are determined as common parameters that are applied to all virtual nodes. Specifically, the dimensions h and w are determined from the number of nodes per virtual node and the number of data elements constituting each dataset A and B.


The virtual node at the i-th row and j-th column is represented as ijn in the illustrated model, where the superscript indicates the coordinates of the virtual node. Within a virtual node ijn, the node at the i-th row and j-th column is represented as where the subscript indicates the coordinates of the node. At the start of an exhaustive join, the system assigns datasets A and B to all nodes n11, . . . , nhw included in all participating virtual nodes 11n, . . . , HWn. That is, the data elements are distributed evenly (or near evenly) across the nodes without redundant duplication.


The data elements initially assigned above are then duplicated from virtual node to virtual node via two or more different intervening switches. Subsequently the data elements are duplicated within each closed domain of the virtual nodes. There is a recursive relationship between the data duplication among virtual nodes and the data duplication within a virtual node. Specifically, subsets of dataset A are duplicated across virtual nodes with the same row number, while subsets of dataset B are duplicated across virtual nodes with the same column number. Then within each virtual node, subsets of dataset A are duplicated across nodes with the same row number, while subsets of dataset B are duplicated across nodes with the same column number. Communication between two virtual nodes is implemented as communication between “associated nodes” (or nodes at corresponding relative positions) in the two. For example, when duplicating data elements from virtual node 11n to virtual node 12n, this duplication actually takes place from node 11n11 to node 12n11, from node 11n12 to node 12n12, and so on. There are no interactions between non-associated nodes.



FIG. 17 is a block diagram illustrating an exemplary software structure according to the fourth embodiment. The illustrated node 21 includes a receiving unit 211, a system control unit 212, a virtual node control unit 213, a node control unit 214, an execution unit(s) 215, and a data storage unit 216. This block structure of the node 12 may also be used to implement other nodes, including nodes 22 to 24 and nodes 21a to 24a in FIG. 15. For example, another node 22 illustrated in FIG. 17 includes a receiving unit 221, a node control unit 224, an execution unit(s) 225, and a data storage unit 226. While not explicitly depicted, this node 22 further includes its own system control unit and virtual node control unit. Yet another node 21a illustrated in FIG. 17 includes a receiving unit 211a and a virtual node control unit 213a. While not explicitly depicted, this node 21a further includes its own system control unit, node control unit, execution unit(s), and data storage unit. Still another node 22a illustrated in FIG. 17 include a receiving unit 221a. While not explicitly depicted, this node 22a further includes its own system control unit, virtual node control unit, node control unit, execution unit(s), and data storage unit. As in the foregoing third embodiment, the data storage units 216 and 226 may be implemented as reserved storage areas of RAM or HDD, while the other blocks may be implemented as program modules.


The following description assumes that the node 21 is supposed to coordinate execution of data processing requests from a client 31. It is also assumed that the node 21 controls the virtual node 20 to which the node 21 belongs, and that the node 21a controls the virtual node 20a to which the node 21a belongs.


The receiving unit 211 receives commands from a client 31 or other nodes. The computer process implementing this receiving unit 211 is always running on the node 21. When a command is received from the client 31, the receiving unit 211 calls up its local system control unit 212. When a command is received from the system control unit 212, the receiving unit 211 calls up its local virtual node control unit 213 in response. Further, when a command is received from the virtual node control unit 213, the receiving unit 211 calls up its local virtual node control unit 214 in response. The receiving unit 211 in the node 21 may also receive a command from a peer node when its system control unit is activated. In response, the receiving unit 211 calls up the virtual node control unit 213 to handle that command. Further, the receiving unit 211 may receive a command from a peer node when its virtual node control unit is activated. In response, the receiving unit 211 calls up the node control unit 214 to handle that command.


The system control unit 212 controls a plurality of virtual nodes as a whole when they are used to execute in an exhaustive join. Each time a specific data processing operation (or transaction) is requested from the client 31, only one of those nodes activates its system control unit. Upon activation, the system control unit 212 issues a query to a predetermined node (representative node) in each virtual node to request information about which node will be responsible for the control of that virtual node. The node in question is referred to as a “deputy node.” The deputy node is chosen on a per-transaction basis so that the participating nodes share their processing load of an exhaustive join. The system control unit 212 then transmits a deputy designating command to the receiving unit of the deputy node in each virtual node. For example, this command causes the node 21 to call up its virtual node control unit 213, as well as causing the node 21a to call up its virtual node control unit 213a.


Subsequent to the deputy designating command, the system control unit 212 transmits a participation request command to the virtual node control unit in each virtual node (e.g., virtual node control units 213 and 213a). The system control unit 212 further determines logical connections among the virtual nodes, as well as among their constituent nodes, and transmits a relaying command to the virtual node control unit in each virtual node. When the duplication of data elements is finished, the system control unit 212 transmits a joining command to the virtual node control unit in each virtual node. When the exhaustive join is finished, the system control unit 212 so notifies the client 31.


The virtual node control unit 213 controls a plurality of nodes 21 to 24 belonging to the virtual node 20. The computer process implementing this virtual node control unit 213 is activated upon call from the receiving unit 211. Each time a specific data processing operation (or transaction) is requested from the client 31, only one constituent node in each virtual node activates its virtual node control unit. The activated virtual node control unit 213 may receive a participation request command from the system control unit 212. When this is the case, the virtual node control unit 213 forwards the command to the receiving unit of each participating node within the virtual node 20. For example, the receiving units 211 and 221 receive this participation request command, which causes the node 21 to call up its node control unit 214 and the node 22 to call up its node control unit 224.


The virtual node control unit 213 may also receive a relaying command from the system control unit 212. The virtual node control unit 213 forwards this command to the node control unit (e.g., node control unit 214) of a particular node that is supposed to be the source point of data elements to be relayed. The virtual node control unit 213 may further receive a joining command from the system control unit 212. The virtual node control unit 213 forwards the command to the node control unit of each participating node within the virtual node 20. For example, the node control units 214 and 224 receive this participation request command.


The node control unit 214 controls information processing tasks that the node 21 undertakes as part of an exhaustive join. The computer process implementing this node control unit 214 is activated upon call from the receiving unit 211. The node control unit 214 calls up the execution unit 215 when a relaying command or joining command is received from the virtual node control unit 213. Relaying commands and joining commands may come also from a peer node of the node 21 (or more specifically, from the virtual node control unit activated in a peer node). The node control unit 214 similarly calls up its local execution unit 215 in response. The node control unit 214 may also call up its local execution unit 215 when a reception command is received from a remote execution unit of a peer node.


The execution unit 215 performs information processing operations requested from the node control unit 214. The computer process implementing this execution unit 215 is activated upon call from the node control unit 214. The node 21 is capable of invoking a plurality of processes of the execution unit 215. When called up in connection with a relaying command, the execution unit 215 transmits a reception command to the node control unit of a peer node (e.g., node control unit 224 in node 21). The execution unit 215 then reads data elements out of the data storage unit 216 and transmits them to its counterpart in the adjacent node (e.g., execution unit 225).


When called up in connection with a reception command, the execution unit 215 receives data elements from its counterpart in a peer node and stores them in its local data storage unit 216. The execution unit 215 forwards these data elements to another node unless the node 21 is not their final destination. Further, when called up in connection with a joining command, the execution unit 215 locally executes an exhaustive join with the collected data elements and writes the result back into the data storage unit 216.


The data storage unit 216 stores some of the data elements constituting datasets A and B. The data storage unit 116 initially stores data elements that belong to some subsets assigned to the node 21 in the first place. Then subsequent relaying of data elements, both between virtual nodes and within a single virtual node 20, causes the data storage unit 216 to receive additional data elements belonging to relevant row and column subsets. Similarly, the data storage unit 226 in the node 22 stores some data elements of datasets A and B.



FIG. 18 is a flowchart illustrating an exemplary procedure of joins according to the fourth embodiment. As previously mentioned, the number N of participating nodes may be determined by the system control unit 212 before the process starts with step S21, based on the total number of nodes in the system. Each step of the flowchart will be described below.


(S21) The client 31 has specified datasets A and B as input data for an exhaustive join. The system control unit 212 divides those two datasets A and B into as many subsets as the number of participating virtual nodes, and assigns them to those virtual nodes. Then in each virtual node, the virtual node control unit subdivides the assigned subset into as many smaller subsets as the number of participating nodes in that virtual node and assigns the divided subsets to those nodes. The input datasets A and B are distributed to a plurality of nodes as a result of the above operation. As an alternative, the assignment of datasets A and B may be performed upon request from the client 31 before the node receives a start command for data processing. As another alternative, the datasets A and B may be given as an output of previous data processing at these nodes. In this case, the system control unit 212 may find that datasets A and B have already been distributed to relevant nodes.


(S22) The system control unit 212 determines the row dimension H and column dimension W by using a calculation method such as equation (10) described previously, based on the number N of participating virtual nodes and the number of data elements of each given dataset A and B.


(S23) The system control unit 212 commands the deputy node of each virtual node to duplicate data elements among the virtual nodes. The virtual node control unit in the deputy node then commands each node within the virtual node to duplicate data elements to other virtual nodes. The execution units in such nodes relay the subsets of dataset A in the row direction by communicating with their associated nodes in other virtual nodes sharing the same row number. These execution units also relay the subsets of dataset B in the column direction by communicating with their associated nodes in other virtual nodes sharing the same column number.


Steps S22 and S23 may be executed recursively in the case where the virtual nodes are nested in a multiple-layer structure. This recursive operation may be implemented by causing the virtual node control unit to inherit the above-described role of the system control unit 212. The same may apply to step S21.


(S24) The system control unit 212 determines the row dimension h and column dimension w by using a calculation method such as equation (10) described previously, based on the number of participating nodes per virtual node and the number of data elements per virtual node at that moment.


(S25) The system control unit 212 commands the deputy node of each virtual node to duplicate data elements within that virtual node. The virtual node control unit in the deputy node then commands the nodes constituting the virtual node to duplicate their data elements to each other. The execution unit in each constituent node transmits subsets of dataset A in the row direction, which include the one initially assigned thereto and those received from other virtual nodes at step S23. The execution unit in each constituent node also transmits subsets of dataset B in the column direction, which include the one initially assigned thereto and those received from other virtual nodes at step S23.


(S26) The system control unit 212 commands the deputy node in each virtual node to locally execute an exhaustive join. In the deputy nodes, their virtual node control unit commands relevant nodes in each virtual node to locally execute an exhaustive join. The execution unit in each node locally executes an exhaustive join between the row and column subsets collected through the processing of steps S23 and S25, thus writing the result in the data storage unit in the node.


(S27) Upon completion of the data processing of step S26 at every participating node, the system control unit 212 notifies the client 31 of completion of the requested exhaustive join. The system control unit 212 may further collect result data from the data storage units of nodes and send it back to the client 31. Or alternatively, the system control unit 112 may allow the result data to stay in the nodes.



FIG. 19 is a first diagram illustrating an exemplary data arrangement according to the fourth embodiment. This example assumes that an exhaustive join is executed by six virtual nodes 11n, 12n, 13n, 21n, 22n, and 23n (virtual nodes 20, 20a, 20b, 20c, 20d, and 20e in FIG. 15). Dataset A is formed from 24 data elements a1 to a24, while dataset B is formed from 48 data elements b1 to b48. According to the foregoing equation (10), the row dimension H is calculated to be 2, since the number of virtual nodes is six, and the number of data elements is 24 for dataset A and 48 for dataset B. In other words, the virtual node as a whole, is assigned two subsets Aij and Bij. For example, virtual node 11n is assigned subset A11 and subset B11.



FIG. 20 is a second diagram illustrating an exemplary data arrangement according to the fourth embodiment. Virtual nodes 11n, 12n, 13n, 21n, 22n, and 23n in the present example are each formed from four nodes n11, n12, n21, and n22. Each of those nodes has been equally assigned a subset of each source data set, including one data element from data set A and two data elements from dataset B. For example, node 11n11 has been assigned data elements a1, b1, and b2.


The initial assignment of data elements has been made, and the row dimension H and column dimension W of virtual nodes are determined. Now the virtual nodes having the same row number duplicate their subsets of dataset A in the row direction, while the virtual nodes having the same column number duplicate their subsets of dataset B in the column direction. In this duplication, data elements are copied from one node in a virtual node to its counterpart in another virtual node. For example, data element a1 initially assigned to node is copied from node 11n11 to node 12n11, and then from node 12n11 to node 13n11. Further, data elements b1 and b2 initially assigned to node 11n11 is copied from node 11n11 to node 21n11. No copying operations take place between non-associated nodes in this phase. For example, neither node 12n12 nor node 13n12 receives data element a1 from node 11n11.



FIG. 21 is a third diagram illustrating an exemplary data arrangement according to the fourth embodiment. This example depicts the state after the above node-to-node data duplication is finished. That is, each node has obtained three data elements of dataset A and four data elements of dataset B. For example, node 11n11 now has data elements a1, a3, and a5 of dataset A and data elements b1, b2, b5, and b6 of dataset B. According to the foregoing equation (10), the row dimension h is calculated to be 2 since the number of nodes per virtual node is 4, and the number of data elements per virtual node is 12 for dataset A and 16 for dataset B.


Now that the row dimension h and column dimension w of virtual nodes are determined, further duplication of data elements is performed within each virtual node. That is, the nodes having the same row number duplicate their respective subsets of dataset A in the row direction, including those received from other virtual nodes. Similarly the nodes having the same column number duplicate their subsets of dataset B in the column direction, including those received from other virtual nodes. For example, one set of data elements a1, a3, and a5 collected in node 11n11 are copied from node 11n11 to node 11n12. Also, another set of data elements b1, b2, b5, and b6 collected in node 11n11 are copied node 11n11 to node 11n21. The nodes in a virtual node do not have to communicate with nodes in other virtual nodes during this phase of local duplication of data elements.



FIG. 22 is a fourth diagram illustrating an exemplary data arrangement according to the fourth embodiment. This example depicts the state after the above node-to-node data duplication is finished. Each node has obtained a row subset of dataset A and a column subset of dataset B. For example, the topmost six nodes 11n11, 11n12, 12n11, 12n12, 13n11, 13n12 have six data elements a1 to a6 as a row subset. The leftmost four nodes 11n11, 11n21, 21n11, 21n21 have eight data elements b1 to b8 as a column subset. The resultant distribution of data elements in those 24 nodes is identical to what would be obtained without virtualization of nodes.


Each node executes an exhaustive join with its local row subset and column subset obtained above. For example, node 11n11 selects one data element out of six data elements a1 to a6 and one element out of eight data elements b1 to b8 and subjects this combination of data elements to a map function. By repeating this operation, the node 11n11 applies the map function to 48 ordered pairs (i.e., 6×8 combinations) of data elements. All nodes equally executes their own 48 ordered pairs seen in FIG. 22. These twenty-six nodes as a whole cover all the 1152 (=24×48) ordered pairs produced from datasets A and B, without redundant duplication.


The proposed information processing system of the fourth embodiment provides advantages similar to those of the foregoing third embodiment. The fourth embodiment may further reduce unintended waiting times during inter-node communication by taking into consideration the unequal communication delays due to different physical distances between nodes. That is, the proposed system performs relatively slow communication between virtual nodes in the first place, and then proceeds to relatively fast communication within each virtual node. This feature of the fourth embodiment makes it easy to parallelize the communication, thus realizing a more efficient procedure for duplicating data elements.


(e) Fifth Embodiment

This section describes a fifth embodiment with the focus on its differences from the third and fourth embodiments. See the previous description for their common elements and features. As will be described below, the fifth embodiment executes “triangle joins” instead of exhaustive joins. This fifth embodiment may be implemented in an information processing system with a structure similar to that of the third embodiment discussed previously in FIGS. 3, 4, and 10. Triangle joins may sometimes be treated as a kind of simple joins.


A triangle join is an operation on a single dataset A formed from m data elements a1, a2, . . . , am (m is an integer greater than one). As seen in equation (11) below, this triangle join yields a new dataset by applying a map function to every unordered pair of two data elements ai and aj in dataset A with no particular relation between them. As in the case of exhaustive joins, the map function may return no output data elements or may output two or more resulting data elements, depending on the values of the arguments ai and aj. According to the definition of a triangle join seen in equation (11), the map function may be applied to a combination of the same data element (i.e., in the case of ai=aj). It is possible to define a map function that excludes such combinations.






t-join(A,map)={map(ai,aj)|ai,ajεA,i≦j}  (11)



FIG. 23 illustrates a triangle join. Since triangle joins operate on unordered pairs of data elements, there is no need for calculating both map(ai, aj) and map(aj, ai). The map function is therefore applied to a limited number of combinations of data elements as seen in the form of a triangle when a two-dimensional matrix is produced from the data elements of dataset A. Specifically, the map function is executed on m(m+1)/2 combinations or m(m−1)/2 combinations of data elements. This means that the amount of data processing is nearly halved by using a triangle join in place of an exhaustive join of dataset A itself.


For example, a local triangle join on a single node may be implemented as a procedure described below. It is assumed that the node reads data elements on a block-by-block basis, where one block is made up of one or more data elements. It is also assumed that the node is capable of storing up to α blocks of data elements in its local RAM. When executing a triangle join of dataset A, the node loads the RAM with the topmost (α−1) blocks of data elements. For example, the node loads its RAM with two data elements a1 and a2. The node then executes a triangle join with these (α−1) blocks on RAM. For example, the node subjects three combinations (a1, a1), (a1, a2), and (a2, a2) to the map function.


Subsequently the node loads the next one block into RAM and executes an exhaustive join between the previous (α−1) blocks and the newly loaded block. For example, the node loads a data element a3 into RAM and applies the map function to two new combinations (a1, a3) and (a2, a3). The node similarly processes the remaining blocks one by one until the last block is reached, while maintaining the topmost (α−1) blocks in its RAM. Upon completion of an exhaustive join between the topmost (α−1) blocks and the last one block, the node then flushes the existing (α−1) blocks in RAM and loads the next (α−1) blocks. For example, the node loads another two data elements a3 and a4 into RAM. With these new (α−1) blocks, the node executes a triangle join and exhaustive join in a similar way. The node iterates these operations until all possible (α−1) blocks are finished. It is noted that the final cycle of this iteration may not fully load (α−1) blocks, depending on the total number of blocks.



FIG. 23 depicts a plurality of such map operations. Since these operations are independent of each other, it is possible to parallelize their execution by assigning a plurality of nodes, just as in the case of exhaustive joins.



FIG. 24 illustrates an exemplary result of a triangle join. In this example of FIG. 24, dataset A is formed from four data elements a1 to a4, each including X-axis and Y-axis values representing a point on a plane. When two specific data elements are given as its arguments, the map function calculates the distance between the two corresponding points on the plane. A triangle join applies such a map function to 10 (=4×(4+1)/2) combinations of data elements. Alternatively, the combinations may be reduced to 6 (=4×(4−1)/2) in the case where the map function is not applied to combinations of the same data element (i.e., in the case of ai=aj).



FIG. 25 illustrates a node coordination model according to the fifth embodiment. The triangle joins discussed in this fifth embodiment handle a plurality of participating nodes as if they are logically arranged in the form of an isosceles right triangle. The nodes are organized in a space with a height of h nodes (max) and a width of h nodes (max), such that (h−i+1) nodes are horizontally aligned in the i-th row, while j nodes are vertically aligned in the j-th column. The node sitting at the i-th row and j-th column is represented as nij in this model. The information processing system determines the row dimension h, depending on the number N of nodes used for its data processing. For example, the row dimension h may be selected as the maximum integer that satisfies h2<=N. In this case, a triangle join is executed by h2 nodes.


The data elements of dataset A are initially distributed across h nodes n11, n22, . . . , nhh on a diagonal line including the top-left node n11 (i.e., the base of the isosceles right triangle). As in the case of exhaustive joins, these data elements are placed evenly (or near evenly) on these nodes without redundant duplication. At this stage, data elements are assigned to no other nodes, but the nodes on the diagonal line. For example, subset Ai is assigned to node nii as seen in equation (12). Here the number of data elements of subset Ai is determined by dividing the total number of elements in dataset A by the row dimension h.










A
=




i
=
1

h



A
i












A
i



=



A


h






(
12
)








FIG. 26 is a flowchart illustrating an exemplary procedure of joins according to the fifth embodiment. Each step of the flowchart will be described below.


(S31) The system control unit 112 determines the row dimension h based on the number of participating nodes (i.e., those assigned to the triangle join), and defines logical connections of those nodes.


(S32) The client 31 has specified dataset A as input data for a triangle join. The system control unit 112 divides this dataset A into h subsets A1, A2, . . . , Ah and assigns them to h nodes including node n11 on the diagonal line. These nodes may be referred to as “diagonal nodes” as used in FIG. 26. As an alternative, the node 11 may assign dataset A to these nodes according to a request from the client 31 before the node 11 receives a start command for data processing. As another alternative, dataset may be given as an output of previous data processing at these nodes. In this case, the system control unit 112 may find that dataset A has already been assigned to relevant nodes.


(S33) The system control unit 112 commands each diagonal node nii to duplicate its assigned subset Ai in both the rightward and upward directions. The relaying of data subsets begins at each diagonal node nii, causing the execution unit in each relevant node to forward subset Ai rightward and upward, but not downward or leftward. The above relaying may be achieved by using, for example, the method A discussed in FIG. 9A. The duplication of subset Ai permits non-diagonal nodes nij to receive a copy of subset Ai (Ax) initially assigned to node nii, as well as a copy of subset Aj (Ay) initially assigned to node njj. The diagonal nodes nii, on the other hand, receive no extra data elements from other nodes.


(S34) The system control unit 112 commands the diagonal nodes to locally execute a triangle join. In response, the execution unit in each diagonal node nii executes a triangle join with its local subset Ai and stores the result in a relevant data storage unit. The system control unit 112 also commands non-diagonal nodes to locally execute an exhaustive join. The non-diagonal nodes nij locally execute an exhaustive join of subset Ax and subset Ay respectively obtained in the above rightward relaying and upward relaying. The non-diagonal nodes nij store the result in relevant data storage units.


(S35) The system control unit 112 sees that every participating node has finished step S34, thus notifying the requesting client 31 of completion of the requested triangle join. The system control unit 112 may also collect result data from the data storage units of nodes and send it back to the client 31. Or alternatively, the system control unit 112 may allow the result data to stay in the nodes.



FIG. 27 is a first diagram illustrating an exemplary data arrangement according to the fifth embodiment. This example assumes that six nodes n11, n12, n13, n22, n23, and n33 are configured to execute a triangle join, where dataset A is formed from nine data elements a1 to a9. Each diagonal node nii is assigned a different subset Ai including three data elements. For example, node n11 is assigned a subset A1={a1, a2, a3}. Duplication of data elements then begins at each diagonal node nii, causing other nodes to forward subset Ai in both the rightward and upward directions. For example, subset A1 assigned to node n11 is copied from node n11 to node n12, and then from node n12 to node n13. Similarly, subset A2 assigned to node n22 is copied upward from node n22 to node n12, as well as rightward from node n22 to node n23.



FIG. 28 is a second diagram illustrating an exemplary data arrangement according to the fifth embodiment. This example depicts the state after the above duplication of data elements is finished. That is, the diagonal nodes nii maintain their initially assigned subset Ai alone. In contrast, the non-diagonal nodes nij have received subset Ai from the nodes on their left and subset Aj from the nodes below them. For example, node n13 has obtained subset A1={a1, a2, a3} and subset A3={a7, a8, a9}.


The diagonal nodes nii locally execute a triangle join with their respective subset Ai. For example, node n11 applies the map function to six combinations derived from A1={a1, a2, a3}. On the other hand, the non-diagonal nodes nij locally execute an exhaustive join with their respective subsets Ai and Aj. For example, node n13 applies the map function to nine (=3×3) ordered pairs derived from subsets A1 and A3, one element from subset A1={a1, a2, a3} and the other element from subset A3={a7, a8, a9}. As can be seen from FIG. 28, the illustrated nodes perfectly cover the 45 possible combinations of data elements of dataset A, without redundant duplication.


According to the fifth embodiment described above, the proposed information processing system executes triangle joins of dataset A in an efficient way, without needless duplication of data processing in the nodes.


(f) Sixth Embodiment

This section describes a sixth embodiment with the focus on its differences from the foregoing third to fifth embodiments. See the previous description for their common elements and features. The sixth embodiment executes triangle joins in a different way from the one discussed in the fifth embodiment. This sixth embodiment may be implemented in an information processing system with a structure similar to that of the third embodiment discussed in FIGS. 3, 4, and 10.



FIG. 29 illustrates a node coordination model according to the sixth embodiment. To execute a triangle join, this sixth embodiment handles a plurality of participating nodes as if they are logically arranged in the form of a square array. That is, the nodes are organized in an array with a height and width of h nodes. The information processing system determines this row dimension h, depending on the number of nodes used for its data processing. For example, the row dimension h may be selected as the maximum integer that satisfies h2<=N. In this case, a triangle join is executed by h2 nodes. Data elements of dataset A are initially distributed over h nodes n11, n22, . . . , nhh on a diagonal line including node n11, similarly to the fifth embodiment.



FIG. 30 is a flowchart illustrating an exemplary procedure of joins according to the sixth embodiment. Each step of the flowchart will be described below.


(S41) The system control unit 112 determines the row dimension h based on the number of participating nodes (i.e., nodes used to execute a triangle join), and defines logical connections of those nodes.


(S42) The client 31 has specified dataset A as input data for a triangle join. The system control unit 112 divides this dataset A into h subsets A1, A2, . . . , Ah and assigns them to h diagonal nodes including node n11. As an alternative, the node 11 may assign dataset A to these nodes according to a request from the client 31 before the node 11 receives a start command for data processing. As another alternative, dataset may be given as an output of previous data processing at these nodes. In this case, the system control unit 112 may find that dataset A has already been assigned to relevant nodes.


(S43) The system control unit 112 commands each diagonal node nii to duplicate its assigned subset Ai in both the row and column directions. The execution unit in each diagonal node nii transmits all data elements of subset Ai in both the rightward and downward directions. The execution unit in each diagonal node nii further divides the subset Ai into two halves as evenly as possible in terms of the number of data elements. The execution unit sends one half leftward and the other half upward. The above relaying may be achieved by using, for example, the method C discussed in FIG. 9C. As a result of this step, some non-diagonal nodes nij obtain a full copy of subset Ai (Ax) initially assigned to node nii, as well as half a copy of subset Aj (Ay) initially assigned to node njj. The other non-diagonal nodes nij obtain half a copy of subset Ai (Ax), together with a full copy of subset Aj (Ay). The diagonal nodes nii, on the other hand, receive no data elements from other nodes.


(S44) The system control unit 112 commands the diagonal nodes to locally execute a triangle join. In response, the execution unit in each diagonal node nii executes a triangle join of its local subset Ai and stores the result in a relevant data storage unit. The system control unit 112 also commands non-diagonal nodes to locally execute an exhaustive join. The non-diagonal nodes nij locally execute an exhaustive join of subset Ax and subset Ay respectively obtained in the above row-wise relaying and column-wise relaying. The non-diagonal nodes nij store the result in relevant data storage units.


(S45) The system control unit 112 sees that every participating node has finished step S44, thus notifying the requesting client 31 of completion of the requested triangle join. The system control unit 112 may also collect result data from the data storage units of nodes and send it back to the client 31. Or alternatively, the system control unit 112 may allow the result data to stay in the nodes.



FIG. 31 is a first diagram illustrating an exemplary data arrangement according to the sixth embodiment. This example assumes that nine nodes n11, n12, . . . , n33 are configured to execute a triangle join, where dataset A is formed from nine data elements a1 to a9. The diagonal nodes nii have each been assigned a subset Ai including three data elements, as in the case of the fifth embodiment.


Duplication of data elements then begins at each diagonal node nii, causing other nodes to forward subset Ai in both the rightward and downward directions. In addition to the above, the subset Ai is divided into two halves, and one half is duplicated in the leftward direction while the other half is duplicated in the upward direction. For example, data elements a4, a5, and a6 assigned to node n22 are wholly copied from node n22 to node n23, as well as from node n22 to node n32. The data elements {a4, a5, a6} are divided into two halves, {a4} and {a5, a6}, where some error may be allowed in the number of elements. The former half is copied from node n22 to node n21, and the latter half is copied from node n22 to node n12.



FIG. 32 is a second diagram illustrating an exemplary data arrangement according to the sixth embodiment. This example depicts the state after the above duplication of data elements is finished. That is, the diagonal nodes nii maintain their initially assigned subset Ai alone. The non-diagonal nodes nij, on the other hand, have obtained a subset Ax (i.e., a full or half copy of subset Ai) from their adjacent nodes on the same row. The non-diagonal nodes nij have also obtained a subset Ay (i.e., a full or half copy of subset Aj) from their adjacent nodes on the same column. For example, node n13 now has all data elements {a1, a2, a3} of one subset A1, together with two data elements a8 and a9 out of another subset A3.


The diagonal nodes nii locally execute a triangle join of subset Ai similarly to the fifth embodiment. The non-diagonal nodes execute an exhaustive join locally with the subset Ax and subset Ay that they have obtained. For example, node n13 applies the map function to six (=3×2) ordered pairs, by combining one data element selected from {a1, a2, a3} with another data element selected from {a8, a9}. As can be seen from FIG. 32, the method proposed in the sixth embodiment may be regarded as a modified version of the foregoing fifth embodiment. That is, one half of the data processing tasks in non-diagonal nodes is delegated to the nodes located below the diagonal line. The nine nodes completely cover the 45 combinations of data elements of dataset A, without redundant duplication.


The proposed information processing system of the sixth embodiment executes a triangle join of dataset A efficiently by using a plurality of nodes. Particularly, the sixth embodiment is advantageous in its ability of distributing the load of data processing to the nodes as evenly as possible.


(g) Seventh Embodiment

This section describes a seventh embodiment with the focus on its differences from the foregoing third to sixth embodiments. See the previous description for their common elements and features. The seventh embodiment executes triangle joins in a different way from those discussed in the fifth and sixth embodiments. This seventh embodiment may be implemented in an information processing system with a structure similar to that of the third embodiment discussed in FIGS. 3, 4, and 10.



FIG. 33 illustrates a node coordination model according to the seventh embodiment. To execute a triangle join, this seventh embodiment handles a plurality of participating nodes as if they are logically arranged in the form of a square array. Specifically, the array of nodes has a height and width of 2k+1, where k is an integer greater than zero. In other words, each side of the square has an odd number of nodes which is greater than or equal to three. The information processing system determines a row dimension h=2k+1, depending on the number of nodes available for its data processing. For example, the row dimension h may be selected as the maximum odd number that satisfies h2<=N. In this case, a triangle join is executed by h2 nodes. The triangle joins discussed in this seventh embodiment assume that these nodes are connected logically in a torus topology. For example, node ni1 is located at the right of node nih. Node n1j is located below node nhj. Data elements of dataset A are initially distributed over h nodes n11, n22, . . . , nhh on a diagonal line including node n11.



FIG. 34 is a flowchart illustrating an exemplary procedure of joins according to the seventh embodiment. Each step of the flowchart will be described below.


(S51) The system control unit 112 determines the row dimension h=2k+1 based on the number of participating nodes (i.e., nodes used to execute a triangle join), and defines logical connections of those nodes.


(S52) The client 31 has specified dataset A as input data for a triangle join. The system control unit 112 divides this dataset A into h subsets A1, A2, . . . , Ah and assigns them to h diagonal nodes including node n11. As an alternative, the node 11 may assign dataset A to these nodes according to a request from the client 31 before the node 11 receives a start command for data processing. As another alternative, dataset may be given as an output of previous data processing at these nodes. In this case, the system control unit 112 may find that dataset A has already been assigned to relevant nodes.


(S53) The system control unit 112 commands each diagonal node nii to duplicate its assigned subset Ai in both the row and column directions. In response, the execution unit in each diagonal node nii transmits a copy of subset Ai to the node on the right of node nii, as well as to the node immediately below node nii.


Subsets are thus relayed in the row direction. During this course, the execution units in the first to k-th nodes located on the right of node nii receive subset Ai from their left neighbors. The execution units in the (k+1)th to (2k)th nodes receive one half of subset Ai from their left neighbors. These subsets are referred to collectively as Ax. Subsets are also relayed in the column direction. During this course, the execution units in the first to k-th nodes below node nii receive subset Ai from their upper neighbors. The execution units in the (k+1)th to (2k)th nodes receive the other half of subset Ai from their upper neighbors. These subsets are referred to collectively as Ay. The above relaying may be achieved by using, for example, the method B discussed in FIG. 9B.


As a result of step S53, some non-diagonal nodes nij obtain a full copy of subset Ai (Ax) initially assigned to node nii, as well as half a copy of subset Aj (Ay) initially assigned to node njj. The other non-diagonal nodes nij obtain half a copy of subset Ai (Ax), together with a full copy of subset Aj (Ay). The diagonal nodes nii, on the other hand, receive no data elements from other nodes.


(S54) The system control unit 112 commands the diagonal nodes to locally execute a triangle join. In response, the execution unit in each diagonal node nii executes a triangle join of its local subset Ai and stores the result in a relevant data storage unit. The system control unit 112 also commands non-diagonal nodes to locally execute an exhaustive join. The non-diagonal nodes nij locally execute an exhaustive join of subset Ax and subset Ay respectively obtained in the above row-direction relaying and column-direction relaying. The non-diagonal nodes nij store the result in relevant data storage units.


(S55) The system control unit 112 sees that every participating node has finished step S54, thus notifying the requesting client 31 of completion of the requested triangle join. The system control unit 112 may further collect result data from the data storage units of nodes and send it back to the client 31. Or alternatively, the system control unit 112 may allow the result data to stay in the nodes.



FIG. 35 is a first diagram illustrating an exemplary data arrangement according to the seventh embodiment. The illustrated arrangement is a case of k=1, where nine (3×3) nodes n11, n12, . . . , n33 are configured to execute a triangle join. Dataset A is formed from nine data elements a1 to a9. Each diagonal node is assigned a different subset including three data elements.


For example, data elements a1, a2, and a3 assigned to one diagonal node n11 are wholly copied to node n12, and one half of them (e.g., data element a3) are copied to node n13. The same data elements a1, a2, and a3 are wholly copied to node n21, and the other half of them (e.g., data elements a1 and a2) are copied to node n31. Similarly, data elements a4, a5, and a6 assigned to another diagonal node n22 are wholly copied to node n23, and one half of them (e.g., data element a4) are copied to node n21. The same data elements a4, a5, and a6 are wholly copied to node n32, and the other half of them (e.g., data elements a5 and a6) are copied to node n12. Further, data elements a7, a8, and a9 assigned to yet another diagonal node n33 are wholly copied to node n31, and one half of them (e.g., data element a7) are copied to node n32. The same data elements a7, a8, and a9 are wholly copied to node n13, and one half of them (e.g., data elements a8 and a9) are copied to node n23.



FIG. 36 is a second diagram illustrating an exemplary data arrangement according to the seventh embodiment. This example depicts the state after the above duplication of data elements is finished. That is, the diagonal nodes nii maintain their initially assigned subset Ai alone. The non-diagonal nodes nij, on the other hand, have obtained a subset Ax (i.e., a full or half copy of subset Ai) from their left neighbors. The non-diagonal nodes nij have also obtained a subset Ay (i.e., a full or half copy of subset Aj) from their upper neighbors. The diagonal nodes nii locally execute a triangle join of subset Ai similarly to the fifth and sixth embodiments. The non-diagonal nodes locally execute an exhaustive join of the subset Ax and subset Ay that they have obtained.


The proposed information processing system of the seventh embodiment provides advantages similar to those of the foregoing sixth embodiment. Another advantage of the seventh embodiment is that the amount of transmit data of diagonal nodes are equalized or nearly equalized. For example, the nodes n11, n22, and n33 in FIG. 35 transmit the same amount of data. This feature of the seventh embodiment makes it more efficient to duplicate data elements from node to node.


(h) Eighth Embodiment

This section describes an eighth embodiment with the focus on its differences from the foregoing third to seventh embodiments. See the previous description for their common elements and features. The eighth embodiment executes triangle joins in a different way from those discussed in the fifth to seventh embodiments. This eighth embodiment may be implemented in an information processing system with a structure similar to that of the third embodiment discussed in FIGS. 3, 4, and 10.


The eighth embodiment handles a plurality of participating nodes of a triangle join as if they are logically arranged in the same form discussed in FIG. 33 for the seventh embodiment. The difference lies in its initial distribution of data elements. That is, the eighth embodiment first distributes a given dataset A evenly (or near evenly) among the participating nodes, without redundant duplication. For example, subset Aij is assigned to node nij in the way described in equation (13). The number of data elements per subset is determined by dividing the total number of elements of dataset A by the number N of nodes, where N=h2=(2k+1)2.










A
=




i
=
1

h






j
=
1

h



A
ij













A
ij



=



A



h
2







(
13
)








FIG. 37 is a flowchart illustrating an exemplary procedure of joins according to the eighth embodiment. Each step of the flowchart will be described below.


(S61) The system control unit 112 determines the row dimension h=2k+1 based on the number of participating nodes, and defines logical connections of those nodes.


(S62) The client 31 has specified dataset A as input data. The system control unit 112 divides that dataset A into N subsets and assigns them to a plurality of nodes, where N=(2k+1)2. As an alternative, the node 11 may assign dataset A to these nodes according to a request from the client 31 before the node 11 receives a start command for data processing. As another alternative, dataset may be given as an output of previous data processing at these nodes. In this case, the system control unit 112 may find that dataset A has already been assigned to relevant nodes.


(S63) The system control unit 112 commands the nodes to initiate “near-node relaying” and “far-node relaying” with respect to the locations of diagonal nodes. The execution unit in each node relays subsets of dataset A via two paths. Non-diagonal nodes are classified into near nodes and far nodes, depending on their relative locations to a relevant diagonal node nii. More specifically, the term “near nodes” refers to node ni(i+1) to node ni(i+k), i.e., the first to k-th nodes sitting on the right of diagonal node nii. The term “far nodes” refers to node ni(i+k+1) to node ni(i+2k), i.e., the (k+1)th to (2k)th nodes on the right of diagonal node nii. As mentioned above, the participating nodes are logically arranged in a square array and connected with each other in a torus topology.


Near-node relaying delivers data elements along a right-angled path (path #1) that runs from node n(i+2k)i up to node nii and then turns right to node ni(i+k). Far-node relaying delivers data elements along another right-angled path (path #2) that runs from node n(i+k)i up to node nii and turns right to node ni(i+2k). Subsets Aii assigned to the diagonal nodes nii are each divided evenly (or near evenly) into two halves, such that their difference in the number of data elements does not exceed one. One half is then duplicated to the nodes on path #1 by the near-node relaying, while the other half is duplicated to the nodes on path #2 by the far-node relaying. The near-node relaying also duplicates subsets Ai(+1) to Ai(i+k) of near nodes to other nodes on path #1. The far-node relaying also duplicates subsets Ai(i+k+1) to Ai(i+2k) of far nodes to other nodes on path #2.


The above-described relaying of data subsets from a diagonal node, near node, and far node is executed as many times as the number of diagonal nodes, i.e., h=2k+1. These duplicating operations permit each node to collect as many data elements as those obtained in the seventh embodiment.


The above duplication method of the eighth embodiment may be worded in a different way as follows. The proposed method first distributes initial subsets of dataset A evenly to the participating nodes. Then the diagonal node on each row collects data elements from other nodes, and redistributes the collected data elements so that the duplication process yields a final result similar to that of the seventh embodiment.


(S64) The system control unit 112 commands the diagonal nodes to locally execute a triangle join. In response, the execution unit in each diagonal node nii locally executes a triangle join of the subsets collected through the above relaying and stores the result in a relevant data storage unit. The system control unit 112 also commands non-diagonal nodes to locally execute an exhaustive join. The non-diagonal nodes nij locally execute an exhaustive join between the subsets Ax collected through the above relaying performed with reference to a diagonal node nii and the subsets Ay collected through the above relaying performed with reference to another diagonal node njj. The non-diagonal nodes nij store the result in relevant data storage units.


(S65) The system control unit 112 sees that every participating node has finished step S64, thus notifying the requesting client 31 of completion of the requested triangle join. The system control unit 112 may further collect result data from the data storage units of nodes and send it back to the client 31. Or alternatively, the system control unit 112 may allow the result data to stay in the nodes.



FIG. 38 is a first diagram illustrating an exemplary data arrangement according to the eighth embodiment. Seen in FIG. 38 is the case of k=1, where nine (3×3) nodes n11, n12, . . . , n33 are configured to execute a triangle join. Dataset A is formed from nine data elements a1 to a9, and the nodes nij are initially assigned different subsets Aij each including a single data element.


Subset A11 of node n11 is duplicated to other nodes n12, n21, and n31 through near-node relaying. Subset A11 is not subjected to far-node relaying in this example because subset A11 contains only one data element. Subset A12 of node n12 is duplicated to other nodes n11, n21, and n31 through near-node relaying. Subset A23 of node n13 is duplicated to other nodes n11, n12, and n21 through far-node relaying.


Subset A22 of node n22 is duplicated to other nodes n23, n32, and n12 through near-node relaying. The subset A22 is not subjected to far-node relaying in this example because it contains only one data element. Subset A23 of node n23 is duplicated to other nodes n22, n32, and n12 through near-node relaying. Subset A21 of node n21 is duplicated to other nodes n22, n23, and n32 through far-node relaying.


Subset A33 of node n33 is duplicated to other nodes n31, n23, and n23 through near-node relaying. The subset A33 is not subjected to far-node relaying in this example because it contains only one data element. Subset A31 of node n31 is duplicated to other nodes n33, n13, n23 through near-node relaying. Subset A32 of node n32 is duplicated to other nodes n33, n31, and n13 through far-node relaying.



FIG. 39 is a second diagram illustrating an exemplary data arrangement according to the eighth embodiment. This example depicts the state after the above duplication of data elements is finished. That is, the diagonal nodes nii have collected data elements initially assigned to the nodes ni1 to nih on the i-th row. The non-diagonal nodes nij, on the other hand, have obtained a subset Ax collected through the above relaying performed with reference to a diagonal node nii, as well as a subset Ay collected through the above relaying performed with reference to another diagonal node njj. The diagonal nodes nii locally execute a triangle join of the collected subset in a similar way to the foregoing fifth to seventh embodiments. The non-diagonal nodes locally execute an exhaustive join of the two subsets Ax and Ay obtained above.


The proposed information processing system of the eighth embodiment provides advantages similar to those of the foregoing seventh embodiment. The eighth embodiment is configured to assign data elements, not only to diagonal nodes, but also to non-diagonal nodes, as evenly as possible. This feature of the eighth embodiment reduces the chance for non-diagonal nodes to enter a wait state in the initial stage of data duplication, thus enabling more efficient duplication of data elements among the nodes.


(h) Ninth Embodiment

This section describes a ninth embodiment with the focus on its differences from the foregoing third to eighth embodiments. See the previous description for their common elements and features. To execute triangle joins, the ninth embodiment uses a large-scale information processing system formed from a plurality of communication devices interconnected in a hierarchical way. This information processing system of the ninth embodiment may be implemented on a hardware platform of FIG. 4, configured with a system structure similar to that of the fourth embodiment discussed previously in FIGS. 15 and 17.



FIG. 40 illustrates a node coordination model according to the ninth embodiment. When executing triangle joins, the ninth embodiment handles a plurality of virtual nodes as if they are logically arranged in the form of a right triangle. That is, the virtual nodes are organized in a space with a height of H (max) and a width of H (max), such that (H−i+1) virtual nodes are horizontally aligned in the i-th row while j virtual nodes are vertically aligned in the j-th column. The information processing system determines the row dimension H, depending on the number N of virtual nodes used for its data processing. For example, the row dimension H may be selected as the maximum integer that satisfies H2<=N. In this case, a triangle join is executed by H2 virtual nodes. The total number of virtual nodes contained in the system and the total number of nodes per virtual node are determined taking into consideration the number of participating nodes, connection with network devices, the amount of data to be processed, expected response time of the system, and other parameters.


In the virtual nodes sitting on the illustrated diagonal line (referred to as “diagonal virtual nodes”), their constituent nodes are handled as if they are logically arranged in the form of a right triangle. That is, the nodes in such a virtual node are organized in a space with a height of h (max) and a width of h (max), such that (h−i+1) nodes are horizontally aligned in the i-th row while j nodes are vertically aligned in the j-th column. In non-diagonal virtual nodes, on the other hand, their constituent nodes are handled as if they are logically arranged in the form of a square array. That is, the nodes in such a virtual node are organized as an array of h×h nodes. This row dimension h is common to all virtual nodes. For example, the row dimension h may be selected as the maximum integer that satisfies h2<=M, where M is the number of nodes constituting a virtual node. In this case, each virtual node contains h2 nodes.


At the start of a triangle join, the system divides and assigns dataset A to all nodes n11, . . . , nhh included in all participating virtual nodes 11n, . . . , HWn. That is, the data elements are distributed evenly (or near evenly) to those nodes without needless duplication. Similarly to the foregoing fourth embodiment, the initially assigned data elements are then duplicated from virtual node to virtual node via two or more different intervening switches. Subsequently the data elements are duplicated within each closed domain of the virtual node. Communication between two virtual nodes is implemented as communication between “associated nodes” in them. While FIG. 40 illustrates an example of virtualization into a single layer, it is also possible to build a multiple-layer structure of virtual nodes such that one virtual node includes other virtual nodes.



FIG. 41 is a flowchart illustrating an exemplary procedure of joins according to the ninth embodiment. Each step of the flowchart will be described below.


(S71) Based on the number of virtual nodes available for computation of triangle joins, the system control unit 212 determines the row dimension H of virtual nodes and defines their logical connections. The system control unit 212 also determines the row dimension h of nodes as a common parameter of virtual nodes.


(S72) Input dataset A has been specified by the client 31. The system control unit 212 divides this dataset A into as many subsets as the number of diagonal virtual nodes, and assigns the resulting subsets to those virtual nodes. In each virtual node, the virtual node control unit subdivides the assigned subset into as many smaller subsets as the number of diagonal nodes in that virtual node and assigns the divided subsets to those nodes. The input dataset A is distributed to a plurality of nodes as a result of the above operation. As an alternative, the assignment of dataset A may be performed upon a request from the client 31 before the node 21 receives a start command for data processing. As another alternative, the dataset A may be given as an output of previous data processing at these nodes. In this case, the system control unit 212 may find that dataset A has already been assigned to relevant nodes.


(S73) The system control unit 212 commands the deputy node of each diagonal virtual node 11n, 22n, . . . , HHn to duplicate data elements to other virtual nodes. In response, the virtual node control unit of each deputy node commands diagonal nodes n11, n22, . . . , nhh to duplicate data elements to other virtual nodes in the rightward and upward directions. The execution unit in each diagonal node sends a copy of data elements to its corresponding node in the right virtual node. These data elements are referred to as a subset Ax. The execution unit also sends a copy of data elements to its corresponding node in the upper virtual node. These data elements are referred to as a subset Ay.


(S74) The system control unit 212 commands the deputy node of each diagonal virtual node 11n, 22n, . . . , HHn to duplicate data elements within the individual virtual nodes. In response, the virtual node control unit of each deputy node commands diagonal nodes n11, n22, . . . , nhh to duplicate their data elements to other nodes in the rightward and upward directions. The relaying of data subsets Ax and Ay begins at each diagonal node, causing the execution unit in each relevant node to forward data elements in the rightward and upward directions.


(S75) The system control unit 212 commands the deputy node of each non-diagonal virtual node to duplicate data elements within the individual virtual nodes. In response, the virtual node control unit in each deputy node commands the diagonal nodes n11, n22, . . . , nhh to send a copy of subset Ax in the row direction, where subset Ax has been received from the left virtual node. Similarly the virtual node control unit commands the diagonal nodes n11, n22, . . . , nhh to send a copy of subset Ay in the column direction, where subset Ay has been received from the lower virtual node. The execution unit in each node relays subsets Ax and Ay in their specified directions. Steps S74 and S75 may be executed recursively in the case where the virtual nodes are nested in a multiple-layer structure. This recursive operation may be implemented by causing the virtual node control unit to inherit the above-described role of the system control unit 212. The same may apply to step S72.


(S76) The system control unit 212 commands the deputy node of each diagonal virtual node 11n, 22n, . . . , HHn to execute a triangle join. In response, the virtual node control unit in each deputy node commands the diagonal nodes n11, n22, . . . , nhh to execute a triangle join, while instructing the non-diagonal nodes to execute an exhaustive join. The execution unit in each diagonal node locally executes a triangle join of its own subset and stores the result in a relevant data storage unit. The execution unit of each non-diagonal node locally executes an exhaustive join between subsets Ax and Ay and stores the result in a relevant data storage unit.


The system control unit 212 also commands the deputy node of each non-diagonal virtual node to execute an exhaustive join. In response, the virtual node control unit of each deputy node commands each node in the relevant virtual node to execute an exhaustive join. The execution unit of each node locally executes an exhaustive join between subsets Ax and Ay and stores the result in a relevant data storage unit.


(S77) The system control unit 212 sees that every participating node has finished step S76, thus notifying the requesting client 31 of completion of the requested triangle join. The system control unit 212 may further collect result data from the data storage units of nodes and send it back to the client 31. Or alternatively, the system control unit 212 may allow the result data to stay in the nodes.



FIG. 42 a first diagram illustrating an exemplary data arrangement according to the ninth embodiment. This example includes three virtual nodes 11n, 12n, and 22n configured to execute a triangle join. The diagonal virtual nodes 11n and 22n each contain three nodes n11, n12, and n22, whereas the non-diagonal virtual node 12n contain four nodes n11, n12, n21, and n22. It is assumed that dataset A is formed from four data elements a1 to a4. Referring now to the two diagonal virtual nodes, their diagonal nodes 11n11, 11n22, 22n11, and 22n22 are each assigned one data element. In other words, one virtual node 11n, as a whole, is assigned a subset A1={a1, a2}, and another virtual node 12n is assigned another subset A2={a3, a4}.


The assigned data elements are duplicated from virtual node to virtual node. More specifically, data element a1 of node 11n11 is copied to its counterpart node 12n11, and data element a2 of node 11n22 is copied to its counterpart node 12n22. Further, data element a3 of node 22n11 is copied to its counterpart node 12n11, and data element a4 of node 22n22 is copied to its counterpart node 12n22. No copy is made to non-associated nodes in this phase.



FIG. 43 is a second diagram illustrating an exemplary data arrangement according to the ninth embodiment. This example depicts the state after the above node-to-node data duplication is finished. That is, the two diagonal nodes 12n11 and 12n22 in non-diagonal virtual node 12n have obtained two data elements for each.


Then in each virtual node, the data elements of each diagonal node are duplicated to other nodes. In one diagonal virtual node 11n, node 11n11 copies its data element a1 to node 11n12, and node 11n22 copies its data element a2 to node 11n12. In another diagonal virtual node 22n, node 22n11 copies its data element a3 to node 22n12, and node 22n22 copies its data element a4 to node 22n12.


Also in non-diagonal virtual node 12n, node 12n11 copies its data element a1 to node 12n12, and node n22 copies its data element a2 to node 12n21. These data elements a1 and a2 are what the diagonal nodes 12n11 and 12n22 have obtained as a result of the above relaying in the row direction. Further, node 12n11 copies its data element a3 to node 12n21, and node 12n22 copies its data element a4 to node 12n12. These data elements a3 and a4 are what the diagonal nodes 12n11 and 12n22 have obtained as a result of the above relaying in the column direction.



FIG. 44 is a third diagram illustrating an exemplary data arrangement according to the ninth embodiment. This example depicts the state after the internal data duplication in virtual nodes is finished. That is, a single data element resides in diagonal nodes 11n11, 11n22, 22n11, and 22n22 in the diagonal virtual nodes, whereas two data elements reside in the other nodes. The former nodes locally execute a triangle join, while the latter nodes locally execute an exhaustive join. As can be seen from FIG. 44, the illustrated nodes completely cover the ten possible combinations of data elements of dataset A, without redundant duplication.


The proposed information processing system of the ninth embodiment provides advantages similar to those of the foregoing fifth embodiment. The ninth embodiment may further reduce unintended waiting times during inter-node communication by taking into consideration the unequal communication delays due to different physical distances between nodes. That is, the proposed system performs relatively slow communication between virtual nodes in the first place, and then proceeds to relatively fast communication within each virtual node. This feature of the ninth embodiment makes it easy to parallelize the communication, thus realizing a more efficient procedure for duplicating data elements.


(i) Tenth Embodiment

This section describes a tenth embodiment with the focus on its differences from the foregoing third to ninth embodiments. See the previous description for their common elements and features. The tenth embodiment executes triangle joins in a different way from the one discussed in the ninth embodiment. This tenth embodiment may be implemented in an information processing system with a structure similar to that of the ninth embodiment.



FIG. 45 illustrates a node coordination model according to the tenth embodiment. When executing triangle joins, the tenth embodiment handles a plurality of virtual nodes as if they are logically arranged in the form of a square array. Specifically, the array of nodes has a height and width of H=2K+1, where K is an integer greater than zero. The information processing system determines the row dimension H, depending on the number N of virtual nodes available for its data processing. This determination may be made by using the method described in the ninth embodiment, taking into account that the row dimension H is an odder number in the case of the tenth embodiment. Further, the tenth embodiment handles these virtual nodes as if they are logically connected in a torus topology. More specifically, it is assumed that virtual node i1n sits on the right of virtual node iHn, and that virtual node 1jn is immediately below virtual node Hjn.


Each virtual node includes a plurality of nodes logically arranged in the form of a square array with a width and height of h=2k+1. This row dimension parameter h is common to all virtual nodes. The information processing system determines the row dimension h, depending on the number of nodes per virtual node. The determination may be made by using the method described in the ninth embodiment, taking into account that the row dimension h is an odd number in the case of the tenth embodiment. Further, the nodes in each virtual node are handled as if they are logically connected in a torus topology. Dataset A is divided and assigned across all the nodes n11, . . . , nhh included in participating virtual nodes 11n, . . . , HHn, so that the data elements are distributed evenly (or near evenly) to those nodes without needless duplication. The data elements are then duplicated from virtual node to virtual node. Subsequently the data elements are duplicated within each closed domain of the virtual nodes.



FIG. 46 is a flowchart illustrating an exemplary procedure of joins according to the tenth embodiment. Each step of the flowchart will be described below.


(S81) Based on the number of virtual nodes available for computation of triangle joins, the system control unit 212 determines the row dimension H of virtual nodes and defines their logical connections. The system control unit 212 also determines the row dimension h of nodes as a common parameter of virtual nodes.


(S82) The system control unit 212 divides dataset A specified by the client 31 into as many subsets as the number of virtual nodes that participate in a triangle join. The system control unit 212 assigns the resulting subsets to those virtual nodes. In each virtual node, the virtual node control unit subdivides the assigned subset into as many smaller subsets as the number of nodes in that virtual node and assigns the divided subsets to those nodes. The input dataset A is distributed to a plurality of nodes as a result of the above operation. As an alternative, the assignment of dataset A may be performed upon a request from the client 31 before the node 21 receives a start command for data processing. As another alternative, the dataset A may be given as an output of previous data processing at these nodes. In this case, the system control unit 212 may find that dataset A has already been assigned to relevant nodes.


(S83) The system control unit 112 commands the deputy node in each virtual node to initiate “near-node relaying” and “far-node relaying” among the virtual nodes, with respect to the locations of diagonal virtual nodes. In response, the virtual node control unit of each deputy node commands each node in relevant virtual nodes to execute these two kinds of relaying operations. The execution units in such nodes relay the subsets of dataset A by communicating with their counterparts in other virtual nodes.


The near-node relaying among virtual nodes delivers data elements along a right-angled path (path #1) that runs from virtual node (i+2k)in up to virtual node and then turns right to virtual node i(i+k)n. The far-node relaying, on the other hand, delivers data elements along another right-angled path (path #2) that runs from virtual node (i+k)in up to virtual node iin and then turns right to virtual node i(i+2k)n. (Subsets assigned to the diagonal virtual nodes iin are each divided evenly (or near evenly) into two halves, such that their difference in the number of data elements does not exceed one. One half is then duplicated to the virtual nodes on path #1 by the near-node relaying, while the other half is duplicated to the virtual nodes on path #2 by the far-node relaying. The near-node relaying also duplicates subsets of virtual nodes i(i+1)n to i(i+k)n to n to other virtual nodes on path #1. The far-node relaying also duplicates subsets of virtual nodes (i+k+1)n to i(i+2k)n to other virtual nodes on path #2.


(S84) The system control unit 212 commands the deputy node of each diagonal virtual node 11n, 22n, . . . , HHn to duplicate data elements within the individual virtual nodes. In response, the virtual node control unit in each deputy node commands the nodes in the relevant virtual node to execute “near-node relaying” and “far-node relaying” with respect to the locations of diagonal nodes. The execution unit in each node duplicates data elements, including those initially assigned thereto and those received from other virtual nodes, to other nodes by using the same method discussed in the eighth embodiment.


(S85) The system control unit 212 commands the deputy node of each non-diagonal virtual node to duplicate data elements within the individual virtual nodes. In response, the virtual node control unit in each deputy node commands the nodes in the relevant virtual node to execute relaying in both the row and column directions. The execution unit in each node relays subset Ax in the row direction and subset Ay in column direction. Here, the subset Ax is a collection of data elements received during the course of relaying from one virtual node, and the subset Ay is a collection of data elements received during the course of relaying from another virtual node. In other words, data elements are duplicated within a virtual node in a similar way to the duplication in the case of exhaustive joins. Steps S84 and S85 may be executed recursively in the case where the virtual nodes are nested in a multiple-layer structure. This recursive operation may be implemented by causing the virtual node control unit to inherit the above-described role of the system control unit 212. The same may apply to step S82.


(S86) The system control unit 212 commands the deputy node of each diagonal virtual node 11n, 22n, . . . , HHn to execute a triangle join. In response, the virtual node control unit in each such deputy node commands the diagonal nodes n11, n22, . . . , nhh to execute a triangle join, while instructing the non-diagonal nodes to execute an exhaustive join. The execution unit in each diagonal node locally executes a triangle join of its own subset and stores the result in a relevant data storage unit. The execution unit in each non-diagonal node locally executes an exhaustive join between subsets Ax and Ay and stores the result in a relevant data storage unit. The subset Ax is a collection of data elements received during the course of relaying from one virtual node, and the subset Ay is a collection of data elements received during the course of relaying from another virtual node.


The system control unit 212 also commands the deputy node of each non-diagonal virtual node to execute an exhaustive join. In response, the virtual node control unit of each such deputy node commands the nodes in the relevant virtual node to execute an exhaustive join. The execution unit in each non-diagonal node locally executes an exhaustive join between subsets Ax and Ay and stores the result in a relevant data storage unit. The subset Ax is a collection of data elements received during the course of relaying in the row direction, and the subset Ay is a collection of data elements received during the course of relaying in the column direction.


(S87) The system control unit 212 sees that every participating node has finished step S86, thus notifying the requesting client 31 of completion of the requested triangle join. The system control unit 212 may further collect result data from the data storage units of nodes and send it back to the client 31. Or alternatively, the system control unit 212 may allow the result data to stay in the nodes.



FIG. 47 is a first diagram illustrating an exemplary data arrangement according to the tenth embodiment. In this example, nine virtual nodes 11n, 12n, . . . , 33n are logically arranged in a 3×3 array to execute a triangle join. Dataset A is assigned to these nine virtual nodes in a distributed manner. In other words, dataset A is divided into nine subsets A1 to A9 and assigned to nine virtual nodes 11n, 12n, . . . , 33n, respectively, where one virtual node acts as if it is a single node.



FIG. 48 is a second diagram illustrating an exemplary data arrangement according to the tenth embodiment. The foregoing relaying method of the eighth embodiment is similarly applied to the nine virtual nodes to duplicate their assigned data elements. As previously described, duplication of data elements between two virtual nodes is implemented as that between each pair of corresponding nodes.


More specifically, subset A1 assigned to virtual node 11n is divided into two halves, one being copied to virtual nodes 12n, 21n, and 31n by near-node relaying, the other being copied to virtual nodes 12n, 13n, and 21n by far-node relaying. Subset A2 assigned to virtual node 12n is copied to virtual nodes 11n, 21n, and 31n by near-node relaying. Subset A3 assigned to virtual node 13n is copied to virtual nodes 11n, 12n, and 21n by far-node relaying.


Similarly to the above, subset A5 assigned to virtual node 22n is divided into two halves, one being copied to virtual nodes 23n, 32n, and 12n by near-node relaying, the other being copied to virtual nodes 23n, 21n, and 32n by far-node relaying. Subset A6 assigned to virtual node 23n is copied to virtual nodes 22n, 32n, and 12n by near-node relaying. Subset A4 assigned to virtual node 21n is copied to virtual nodes 22n, 23n, and 32n by far-node relaying.


Further, subset A9 assigned to virtual node 33n is divided into two halves, one being copied to virtual nodes 31n, 13n, and 23n by near-node relaying, the other being copied to virtual nodes 31n, 32n, and 13n by far-node relaying. Subset A7 assigned to virtual node 31n is copied to virtual nodes 33n, 13n, and 23n by near-node relaying. Subset A8 assigned to virtual node 32n is copied to virtual nodes 33n, 31n, and 13n by far-node relaying.



FIG. 49 is a third diagram illustrating an exemplary data arrangement according to the tenth embodiment. In this example, nine virtual nodes 11n, 12n, . . . , 33n are each formed from nine nodes n11, n12, . . . , n33. Dataset A is formed from 81 data elements a1 to a81. This means that every node is uniformly assigned one data element. For example, one data element a1 is assigned to node 11n11, and another data element a81 is assigned to node 33n33.


Upon completion of initial assignment of data elements, the near-node relaying and far-node relaying are performed among the associated nodes of different virtual nodes, with respect to the locations of diagonal virtual nodes 11n, 22n, and 33n. For example, data element a1 assigned to node 11n11 is copied to nodes 12n11, 21n11, and 31n11 by near-node relaying. This node 11n11 does not undergo far-node relaying because it contains only one data element. Data element a4 assigned to node 12n11 is copied to nodes 11n11, 21n11, and 31n11 by near-node relaying. Data element a7 assigned to node 13n11 is copied to nodes 11n11, 12n11, and 21n11 by far-node relaying.



FIG. 50 is a fourth diagram illustrating an exemplary data arrangement according to the tenth embodiment. Specifically, FIG. 50 depicts the result of the duplication of data elements discussed above in FIG. 49. Note that the numbers seen in FIG. 50 are the subscripts of data elements. For example, node 11n11 has collected three data elements a1, a4, and a7. Node 12n11 has collected data elements a1, a4, and a7 as subset Ax and data elements a31 and a34 as subset Ay. Node 13n11 has collected one data element a7 as subset Ax and three data elements a55, a58, an a61 as subset Ay. Upon completion of data duplication among virtual nodes, local duplication of data elements begins in each virtual node.


Specifically, the diagonal virtual nodes 11n, 22n, and 33n internally duplicate their data elements by using the same techniques as in the triangle join of the eighth embodiment. Take node 11n11, for example. This node 11n11 has collected three data element a1, a4, and a7. The first two data elements a1 and a4 are then copied to nodes 11n12, 11n21, and 11n31 by near-node relaying, while the last data element a7 is copied to nodes 11n12, 11n13, and 11n21 by far-node relaying. Data element a2, a5, and a8 of node 11n12 are copied to nodes 11n11, 11n21, and 11n31 by near-node relaying. Data elements a3, a6, and a9 of node 11n13 are copied to nodes 11n11, 11n12, and 11n21 by far-node relaying.


In addition to the above, the non-diagonal virtual nodes internally duplicates their data elements in the row and column directions by using the same techniques as in the exhaustive join of the third embodiment. For example, data elements a1, a4, and a7 (subset Ax) of node 12n11 are copied to nodes 12n12 and 12n12 by row-wise relaying. Data elements a31 and a34 (subset Ay) of node 12n11 are copied to nodes 12n21 and 12n31 by column-wise relaying.



FIG. 51 is a fifth diagram illustrating an exemplary data arrangement according to the tenth embodiment. Specifically, FIG. 51 depicts an exemplary result of the above-described duplication of data elements. For example, node 11n11 has collected data elements a1 to a9. Node 11n12 has collected data elements a1 to a9 (subset Ax) and data elements a11, a12, a14, a15, and a18 (subset Ay). Node 12n11 has collected data elements a1 to a9 (subset Ax) and data element a21, a34, a40, a43, a49, and a52 (subset Ay).


In each diagonal virtual node 11n, 22n, and 33n, the diagonal nodes n11, n22, and n33 locally execute a triangle join with the collected subsets. The non-diagonal nodes, on the other hand, locally execute an exhaustive join of subsets Ax and Ay that they have collected. For example, diagonal node 11n11 applies the map function to 45 possible combinations derived from its data elements a1 to a9. Node 11n12 applies the map function to 45 ordered pairs by selecting one of the nine data elements a1 to a9 and one of the five data elements a11, a12, a14, a15, and a18. Node 12n11 applies the map function to 54 ordered pairs by selecting one of the nine data elements a1 to a9 and one of the six data elements a31, a34, a40, a43, a49, and a52.


As can be seen from the above description, the tenth embodiment duplicates data among virtual nodes for triangle joins. Then the diagonal virtual nodes internally duplicate data for triangle joins in a recursive manner, whereas the non-diagonal virtual nodes internally duplicate data for exhaustive joins. With the duplicated data, the diagonal nodes in each diagonal virtual node (i.e., diagonal nodes when the virtualization is canceled) locally execute a triangle join, while the other nodes locally execute an exhaustive join. In the example of FIG. 51, 3321 combinations of data elements derive from dataset A. The array of 81 nodes covers all these combinations, without redundant duplication.


The above-described tenth embodiment makes it easier to parallelize the communication even in the case where a triangle join is executed by a plurality of nodes connected via a plurality of different switches. The proposed information processing system enables efficient duplication of data elements similarly to the ninth embodiment. The tenth embodiment is also similar to the eighth embodiment in that data elements are distributed to a plurality of nodes as evenly as possible for execution of triangle joins. It is therefore possible to use the nodes efficiently in the initial phase of data duplication.


According to an aspect of the embodiments, the proposed techniques enable efficient transmission of data elements among the nodes for their subsequent data processing operations.


All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.

Claims
  • 1. A method of distributed processing, comprising: assigning data elements to a plurality of nodes sitting at node locations designated by first-axis coordinates and second-axis coordinates in a coordinate space, the node locations including a first location that serves as a base point on a diagonal line of the coordinate space, second and third locations having the same first-axis coordinates as the first location, and fourth and fifth locations having the same second-axis coordinates as the first location;performing first, second, and third transmissions, with each node location on the diagonal line which is selected as the base point, wherein: the first transmission transmits the assigned data elements from the node at the first location as the base point to the nodes at the second and fourth locations, as well as to the node at either the third location or the fifth location,the second transmission transmits the assigned data elements from the nodes at the second locations to the nodes at the first, fourth, and fifth locations, andthe third transmission transmits the assigned data elements from the nodes at the third locations to the nodes at the first, second, and fourth locations; andcausing the nodes to execute a data processing operation by using the data elements assigned thereto by the assigning and the data elements received as a result of the first, second, and third transmissions.
  • 2. The method according to claim 1, wherein: the plurality of nodes includes at least one diagonal node sitting on the diagonal line and at least one non-diagonal node sitting off the diagonal line;the diagonal node exerts the data processing operation on each combination of data elements collected by the diagonal node as the node at the first location; andthe non-diagonal node exerts the data processing operation on each combination of data elements selected from two sets of data elements collected by setting two different base points on the diagonal line.
  • 3. The method according to claim 2, wherein: the coordinate space has dimensions of (2K+1) nodes by (2K+1) nodes, where K is an integer greater than zero; andthe plurality of nodes include K nodes at the second locations, K nodes at the third locations, K nodes at the fourth locations, and K nodes at the fifth locations.
  • 4. A distributed processing system comprising: a plurality of nodes sitting at node locations designated by first-axis coordinates and second-axis coordinates in a coordinate space, the node locations including a first location that serves as a base point on a diagonal line of the coordinate space, second and third locations having the same first-axis coordinates as the first location, and fourth and fifth locations having the same second-axis coordinates as the first location,wherein the nodes are configured to perform a procedure including:performing first, second, and third transmissions, with each node location on the diagonal line which is selected as the base point, wherein: the first transmission transmits data elements from the node at the first location as the base point to the nodes at the second and fourth locations, as well as to the node at either the third location or the fifth location,the second transmission transmits data elements from the nodes at the second locations to the nodes at the first, fourth, and fifth locations, andthe third transmission transmits data elements from the nodes at the third locations to the nodes at the first, second, and fourth locations; andexecuting a data processing operation by using the data elements assigned thereto by the assigning and the data elements received as a result of the first, second, and third transmissions.
Priority Claims (1)
Number Date Country Kind
2012-022905 Feb 2012 JP national