This disclosure generally relates to performing physical computations, and more specifically with simulations of rigid and flexible bodies using an array processor.
Physics computations are broadly used in games and may be based on the rules of classical mechanics to compute the state of bodies at each point in time. As long as bodies do not collide, inertia, gravitation and acceleration may be used to compute the next state of bodies. Such computations may be performed in parallel by means of dedicated parallel hardware. An example of such computation is shown in U.S. patent application Ser. No. 10/715,440, publication number 20050075849, entitled “Physics Processing Unit” by Maher et al., and which is hereby incorporated by reference for all of the useful information it may contain.
Real world simulation, however, do involve collisions between pairs of bodies. For a colliding pair of bodies, contact resolution involves finding the contact point and computing the forces acting at on that point on each body in the pairs of bodies. These additional forces are integrated with the rest of each body parameters for each body in the pairs of bodies, and are used to compute its next state.
Currently available methods, based on parallel hardware, cannot efficiently detect collisions in parallel because of the huge number of pairs of bodies typical in many scenarios in general, and in games in particular. Conventionally, this complex problem is solved as a sequential process by sorting the bodies in the scene three times, according to their projection on the three axes in the three dimensional space, and detecting collisions in each axis separately. This complex problem may be accomplished in the order of O(N log N) times.
Alternatively, dedicated data structures are built to maintain pairs of colliding candidates, with an amortized update time of O(log N). If there is a separation between two bodies in their respective projection on at least one of the axes, according to this method these two bodies cannot collide. Such a broad phase collision detection process, for culling away objects that cannot possibly collide, rules out the vast majority of body pairs, leaving a narrow phase collision detection process, for accurate collision detection, to be performed only on a fraction of the body pairs. This process may be parallelized easily by existing hardware.
However, either the dependency on a sequential process for the broad phase collision detection, or the application of narrow phase collision detection on every possible pair of bodies in the scene, create a severe bottleneck to the acceleration of physics computations in general and real world simulation for games in particular.
It would be advantageous to provide a solution that overcomes the deficiencies in conventional approaches to the above described problem. It would be further advantageous that such a solution would perform in an order that is lower than the best order provided by conventional technologies.
To realize some of the advantages described above, there is provided an apparatus for physical properties computation comprising an array processor. The array processor comprises of a plurality of processing elements, said processing elements arranged in a grid. A processing unit (PU) is coupled to the array processor. A local memory is coupled to the PU. The PU broadcasts data to rows of said processing elements in said grid, and performs physical computations in an order of complexity of O((√N) log N).
More specifically, the grid is a n-by-n 1-grid.
More specifically, at least a processing element of said plurality of processing elements further comprises a local register file.
More specifically, the physical computations comprise real-world simulations.
Still more specifically, the real-world simulations are simulations for games.
Still more specifically, the real-world simulations further comprise concurrent processing of broad phase collision detection.
Still more specifically, the real-world simulation further comprise sorting in concurrent the results of said broad phase collision detection for each axis from a starting point.
Still more specifically, sorting the results further comprises a concurrent two-dimensional array sorting.
Still more specifically, said array sorting is a Shear sorting.
Still more specifically, the real-world simulation further comprise parallel processing of narrow phase collision detection.
Still more specifically, the narrow phase collision detections comprises triangle-triangle intersection.
Another aspect of the disclosed teachings is a computerized method for performing physical properties computation comprising loading processing elements of an array processor with a description of bodies. A broad phase collision detection for each axes by using concurrent processing on the array processor. The results of the broad phase collision detection are stored for each axis from a starting point by using concurrent processing on said array processor. A narrow phase collision detection is performed on the sorted results by using parallel processing on said array processor. The array processors are coupled to a processing unit (PU) which is further coupled to a local memory and the array processors performs the physical properties computation in an order of complexity of O((√N) log N).
More specifically, the technique further comprises determining for each colliding pair from said bodies forces subjected on each such body by using parallel processing on said array processor. The steps of the method are repeated if additional states are to be determined.
For better understanding of the disclosed teachings and to show how the same may be carried into effect, reference will now be made, purely by way of example, to the accompanying drawings. The particulars shown in the figures are by way of example and for purposes of illustrative discussion of the teachings, and are presented in the cause of providing what is believed to be most useful and readily understood description of the principles and conceptual aspects of the teachings. In this regard, no attempt is made to show structural details in more detail than is necessary for a fundamental understanding of the teaching, the description taken with the drawings making apparent to those skilled in the art how the teaching maybe embodied in practice. In the accompanying drawings:
FIG. 1—is a high-level block diagram an aspect of the disclosed teachings;
FIG. 2—is a top-level architecture of an exemplary graph-processing unit;
FIG. 3—is an architecture of an exemplary control unit;
FIG. 4—is an architecture of propagation unit;
FIG. 5—is a high-level block diagram of the disclosed teachings;
FIG. 6—is a top-level architecture of the array of processor elements;
FIG. 7—is a flowchart describing the method for performing physical computations in accordance with the disclosed teachings.
An apparatus and a top-to-bottom concurrent method for classical mechanics computations on a concurrent processor is shown. In one embodiment it is used on rigid bodies that are approximated by decomposition into geometrical bodies for which intersection computation is easy. Decomposition may be into triangles in a two dimensional space, while in another embodiment the decomposition may be into triangular pyramids. One aspect of the invention is a concurrent method and apparatus for physics computations, performed in an order of complexity of O((√N) log N).
An exemplary system dedicated for such artificial intelligence (AI) tasks is described herein. This system could be embodied in a semiconductor chip. The system contains some or all of the followings: processors, configurable program memory, data memory, bus interface, dedicated logic for processing AI techniques, and a graph-processing unit.
The graph-processing unit holds a network of interconnected node, each of which comprises at least one digitally programmable delay. The network represents the weighted graph, where the delays act as the edges.
The delay is formed by a single counter in each node, a dedicated memory, also referred to as edges memory, and a comparator element on each edge between nodes. The edge is triggered once the memory is equal to the counter. This physical realization of a weighted graph is then used for searching minimal paths in a reduced time by injecting an electromagnetic pulse at the start node and letting it propagate through the entire network in parallel in accordance with the predetermined delays. Resetting all these counters allows the performing of a new test without the need to reload the graph representation.
The disclosed teaching is further aimed at allowing access to the system in one or more of the following manners: configuring the graph processing unit with one or more terrain representations (raster maps, navmesh, etc.), search-path queries, and terrain analysis queries, and other appropriate applications. The results are stored and accessible to the computer program. The graph-processing unit is supplemented with an embedded processor and dedicated logic for performing post-processing of the path-searching and terrain analysis queries. Accordingly the graph processing unit may be used iteratively to process the results with the aid of additional data memory.
In an embodiment of the disclosed teaching the embedded processor/s are used to manage and run the queries in batch mode. Each query in the batch is decoded, executed and answers stored in memory, the answers can be retrieved together or separately, as may be necessary. The processor can also change the order of queries in the batch for optimization. For example, it is possible to gather all queries for the same map, and units of the same size, into a single query.
The disclosed teaching further allows for the finding of the T-connected region connectivity in a highly efficient manner, by allowing the embedded processor to halt the propagation in the propagation unit inside the graph-processing unit after time T, and retrieve the nodes that the signal arrived at.
An exemplary implementation of a system with an architecture embodying the disclosed teaching is presented herein. Such a system contains processor/s, graph-processing unit, AI dedicated logic and peripherals (memory, interfaces, etc.). It is to be understood that the invention is not limited in its application to the details of construction and the arrangement of the components set forth in the following description or illustrations in the drawings.
Reference is now made to
The AI dedicated logic 150 is similarly connected to local bus 170 and performs preprocessing and post-processing of the AI queries, for example path-smoothing or string-pooling queries. The specific protocol used to connect bus interface unit 160 with its respective user may vary and should not be considered as limiting the scope of the disclosed teaching.
In referring to
With reference to
Execution ends when either test-counter 315 reaches a predefined time T in time limited queries, or, when the propagating signal in propagation unit 22 reaches a destination node indicated by destination arrival signal 360. At that point, query execution control 310 disables signal propagation of propagation unit 220 by stopping to assert propagation enable signal 350. Query execution control 310 then activates read result signal 370 to result retrieve unit 320, thereby indicating the ability to start reading the results from the propagation unit 220.
Reference is now made to
Each path-finding query begins by first resetting the state of all the nodes 410 by asserting signal 330, followed by selecting the nodes from which the signal will start to propagate. Thereafter propagation is enabled by asserting propagation enable signal 350. Each node 410 contains a counter that represents the time passed since the node was first reached, and holds information about the neighbor node from which the signal first arrived. This allows, once test execution is complete, the back tracing of the shortest-path to every node from the origins of propagation. Multiple tests of the same terrain representation can be achieved consuming minimal time by simply repeating the test flow once for each new test.
In order to better understand the description of the inventions disclosed herein a brief description of an array processor as described in
It should be noted that a graph, denoted by G, is a mathematical object defined by two sets, namely, a set of vertices denoted by V(G) and a set of edges denoted by E(G), where E(G) is a set of pairs (u,v) where both u and v are in V(G). A graph H is called a subgraph of G if V(H) is a subset of V(G). An m·n d-grid G is a graph of m·n vertices where V(G) is the set {(i,j): 1≦i≦m, 1≦j≦n}, and where E(G) is the set {((i,j),(k,l)): 1≦i,k≦n, 1≦j,l≦m, |i−k|≦d, |j−l|≦d, i#k or j#1}. For the scope of this disclosure, any graph isomorphic to an m·n d-grid is also considered an m·n d-grid. Concurrent processors, such as an array processor, may be arranged as graphs, wherein every processing element is uniquely mapped to a vertex in the graph, and two processing elements may access each others internal data only if there is an edge between their corresponding vertices in the graph. In the present invention, concurrent processors that are arranged as grids are considered, and are referred to as array processors.
The method disclosed herein is a few step long process, the process being performed by a device that consists of an array processor, local memory, and a central processing unit (CPU), as shown for example in
The method for physics simulation starts with N=n2 bodies whose descriptions are stored in the local memory of the device, and further described with respect to the exemplary and non-limiting implementation shown in
The next step of the method is a narrow phase collision detection (S740), that is performed using the parallel processing capabilities of the system. It performs a more accurate method, such as the Tropp-Tal-Shimshoni algorithm, for triangle-triangle intersection. This algorithm checks every pair of collision candidates that survive the broad phase collision detection as possible candidates. This step is performed by loading in each processing element of the array processor a pair of bodies, and applying the Tropp-Tal-Shimshoni algorithm in parallel. In one embodiment, the loading of pairs may be done by moving a copy of every body's data back in sorted list one step, and performing Tropp-Tal-Shimshoni algorithm for the pairs that survived the broad phase collision detection. In another embodiment, all bodies are uploaded from the array processor to the local memory of the device, and then only pairs of bodies that survived the broad phase collision detection are downloaded back to the array processor, each pair in a different processing element.
The Tropp-Tal-Shimshoni algorithm, as well as many other algorithms for triangle-triangle intersection, performs many inner products. In one embodiment of the present invention, an inner product operation on floating point coordinates may be performed by first adjusting all mantissas according to the exponent of the result. This may be done for two three-dimensional vectors u and v, by taking the maximum over i of exponent(ui)+exponent(vi). The mantissas are then adjusted accordingly, for example, if i is the coordinate for which exponent(ui)+exponent(vi) is maximal, and exponent(ui)+exponent(vi)-exponent(uj)-exponent(vj) is 6, then uj and vj are each shifted right by 3. The inner product is then performed on the mantissas as integers without the need to normalize after every addition and every multiplication.
Finally, when collision detection is performed accurately, the method continues to compute for colliding pairs the forces they are subject to (S370). S370 is performed using the parallel processing capabilities of the system. Between any pair of colliding surfaces, forces orthogonal to the surfaces are applied. The computation of these forces is also done in parallel, following the narrow phase collision detection. Once this is done, the system is set for the computation of its next state (S370).
In one embodiment, the physics computations continue by integration of all parameters applied to bodies. Next state of every body in the system is computed using motion formulae and constraints. These computations are applied to each body separately, thus, are carried out completely in parallel, where each body's next state is computed by a different processing element in the array processor.
With respect to the disclosed invention and without limiting the scope of the invention, the following is disclosed: (1) a method for embedding of N=n2 bodies in a concurrent processor arranged as an n·n 1-grid G such that physical computation are performed concurrently; (2) a method according to (1), wherein physical computations are real world simulations; (3) a method according to (2), wherein real world simulations are real world simulations for games; (4) a method and apparatus for concurrent physics computations; (5) a method and apparatus according to (4), wherein physical computations are real world simulations; (6) a method and apparatus according to (5), wherein real world simulations are real world simulations for games; (7) a method and apparatus according to (5) wherein real world simulations comprise broad phase collision detection; (8) a method and apparatus according to (7) wherein broad phase collision detection comprises sorting according to each axis of the three-dimensional space by means of a concurrent two-dimensional array sorting; (9) a method and apparatus according to (8) wherein sorting is Shear sorting; (10) a method and apparatus according to (5) wherein real world simulations comprise narrow phase collision detection; (11) a method and apparatus according to S10 wherein narrow phase collision comprises triangle-triangle intersection; (12) a method and apparatus according to (5) wherein real world simulations comprise contact resolution; (13) a method and apparatus according to (5) wherein real world simulations comprise classical mechanics computations; (14) a method and apparatus according to (13) wherein classical mechanics computations comprise motion formulae computations; and, (15) a method and apparatus according to S13 wherein classical mechanics computations comprise Newton laws constraints.
The invention disclosed hereinabove may be implemented in software, firmware, hardware, or any combination thereof. When implemented as software or certain types of firmware, the software contains instructions executable on a system, for example, a computer, that is enabled to execute the instructions so as to perform the methods and outcomes thereof, of the disclosed inventions. Other implementations of the principles disclosed hereinabove are also envisioned by those of regular skill-in-the-art, and are specifically and intentionally included as part of the disclosure. Therefore, the scope of the inventions should be only limited by the scope of the respective claims.
This application is a Continuation of U.S. patent application Ser. No. 12/785,837, filed on May 24, 2010, which is a Continuation-in-part of U.S. patent application Ser. No. 12/207,680, filed on Sep. 10, 2008, which is a divisional patent application of now U.S. Pat. No. 7,440,447, and that further claims priority from U.S. provisional patent application 60/555,975 filed on 25 Mar. 2004, all of which Applications are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
60555975 | Mar 2004 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 11089029 | Mar 2005 | US |
Child | 12207680 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 12785837 | May 2010 | US |
Child | 14070968 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 12207680 | Sep 2008 | US |
Child | 12785837 | US |