The subject matter presented herein generally relates to analyzing scenes and predicting evolution of the scene over time. More particularly, certain aspects presented herein relate to analysis of multi-object events to predict how multi-object movement evolves over time.
Understanding complex dynamic scenes, for example in team sports, is a challenging problem. This is partly because an event, such as in a game, involves not only the local behaviors of individual objects but also structural global movements. Seeing only video footage or other positioning data, it is difficult to understand the overall development of the scene and predict future events.
In summary, one aspect provides a method for predicting evolution of motions of active objects comprising: accessing active object position data stored in a memory device, said active object position data including positioning information of a plurality of individual active objects; and using one or more processors to perform: extracting a plurality of individual active object motions from the active object position data; constructing a motion field using the plurality of individual active object motions; and using the motion field to predict one or more points of convergence at one or more spatial locations that active objects are proceeding towards at a future point in time.
Another aspect provides a computer program product comprising: a computer readable storage medium having computer readable program code embodied therewith, the computer readable program code comprising: computer readable program code configured to access active object position data, said active object position data including positioning information of a plurality of individual active objects; computer readable program code configured to extract a plurality of individual active object motions from the active object position data; computer readable program code configured to construct a motion field using the plurality of individual active object motions; and computer readable program code configured to use the motion field to predict one or more points of convergence at one or more spatial locations that active objects are proceeding towards at a future point in time.
A further aspect provides a system comprising: one or more processors; and a memory device operatively connected to the one or more processors; wherein, responsive to execution of program instructions accessible to the one or more processors, the one or more processors are configured to: access active object position data, said active object position data including positioning information of a plurality of individual active objects; and extract a plurality of individual active object motions from the active object position data; construct a motion field using the plurality of individual active object motions; and use the motion field to predict one or more points of convergence at one or more spatial locations that active objects are proceeding towards at a future point in time.
The foregoing is a summary and thus may contain simplifications, generalizations, and omissions of detail; consequently, those skilled in the art will appreciate that the summary is illustrative only and is not intended to be in any way limiting.
For a better understanding of the embodiments, together with other and further features and advantages thereof, reference is made to the following description, taken in conjunction with the accompanying drawings. The scope of the invention will be pointed out in the appended claims.
It will be readily understood that the components of the embodiments, as generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations in addition to the described example embodiments. Thus, the following more detailed description of the example embodiments, as represented in the figures, is not intended to limit the scope of the claims, but is merely representative of those example embodiments.
Reference throughout this specification to “embodiment(s)” (or the like) means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, appearances of the phrases “according to embodiments” or “an embodiment” (or the like) in various places throughout this specification are not necessarily all referring to the same embodiment.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in different embodiments. In the following description, numerous specific details are provided to give a thorough understanding of example embodiments. One skilled in the relevant art will recognize, however, that aspects can be practiced without certain specific details, or with other methods, components, materials, et cetera. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obfuscation.
Moreover, while example embodiments are described in detail herein with reference to a particular type of scene (a sporting event), and with reference to a particular type of objects (players within the sporting event), these are merely non-limiting examples. It will be readily understood by those having ordinary skill in the art that embodiments are equally applicable to other scenes and objects, such as large crowds of people at a public event (plays, concerts and the like). Moreover, embodiments are described in detail herein with reference to use of computer vision; however, it will be readily understood by those having ordinary skill in the art that the techniques described in detail with reference to the example embodiments may be applied to data from other sources than computer vision. For example, a motion field may be derived by computer vision, but may equally be derived from other sources such as sensors, for example worn radio frequency devices, embedded pressure sensors in the ground, radar, hand annotated video, and the like.
For predicting evolution in dynamic scenes, such as in a sporting event, a higher level information can be deduced by tracking and analyzing the objects movements, not only individually, but also as a group. It should be noted that the term “object(s)” used throughout this description takes the meaning of an “active object”, such as an object having an internal source of energy for controlling motion. Non-limiting examples of active objects include human beings, animals, or even machines that move about independently, such as robots. Herein are described embodiments that build a global flow field from objects' ground-level motions. Embodiments operate on the concept that flow on the ground reflects the intentions of the group of individual objects' based on the context of the scene, and that this can be used for understanding and estimating future events.
The description now turns to the figures. The illustrated example embodiments will be best understood by reference to the figures. The following description is intended only by way of example and simply illustrates certain example embodiments representative of the invention, as claimed.
Consider for example the soccer scene in
Some primary characteristics of example embodiments described herein are thus extracting ground-level motion from individual objects' movement, which may be captured from multiple-views; generating a flow field from a sparse set of individual objects' motions (a motion field on the ground); detecting the locations where the motion field converges; and inferring the scene evolution. Various example applications for embodiments are noted throughout, and again the sport of soccer is simply used as a representative context.
Referring generally to
To get accurate tracks from multi-view video, the following challenges are addressed. View dependent analysis for player tracking using multiple cameras suffers from a data fusion problem. In addition, flow analysis may be sensitive to the perspective distortion of different views. To address these issues, an embodiment analyzes the game scenes from a top down warped image of the ground plane. The top down view is constructed by combining the warped footage (images) of each of the multiple cameras. An embodiment thus first extracts a multi-view consistent player location in the top down view by optimizing the geometric constraints (aligning the warped views).
This allows for extraction of the individual player's ground level motion. Through spatial and temporal interpolation, an embodiment combines these motions to create a dense motion field on the ground-plane. An embodiment analyzes the motion field to detect and localize important regions (referred to herein as points of convergence).
Some notations used herein are defined. Assume that there are N cameras. Let Ik (1≦k≦N) refer to a frame of each camera and Iktop refer to a top down image where each Ik is warped through the homography Hktop. Additionally, xεItop denotes that x follows the coordinate of a top down view (ground field).
Extracting Individual Ground-Level Motion
To construct a flow field, an embodiment first extracts the ground-level motion of individual players. At each time t, this motion is defined as the velocity vector [uv]T representing a player's movement on the ground at a 2D location x(x,y)εItop.
To find the motion, an embodiment first detects the ground position of each player x (an optimized location near the feet in the example of soccer) at a given time t. Then, a search is made for a corresponding position in a previous frame at time t−a (where a>1 for stability and may for example be set to 5 frames). The motion velocity at time t is the difference between the two positions. Note that the individual motion is reconstructed at each time separately and does not require explicit tracking since it is only used to construct the flow field.
To find the 2D location of players on the ground, an embodiment may make use of the fact that each view (from multiple cameras) has its own vertical vanishing point (VVP) vk, which for a given view is the point at which an object vanishes as it is viewed from farther away. In
The projected VVPs onto the ground view (top down warped image) is denoted as {circumflex over (v)}=Hktopvk (1≦k≦N). In
Using background subtraction, for each pixel in each view, a confidence measure of that pixel being part of the foreground (in this example, a player) or background (in this example, a grass field) may be defined. Combining all measures from all views on the ground plane by summing their projections and normalizing, a position confidence map, PC:Iktop[0,1], is obtained, where PC(x) is the probability that xεItop is part of the foreground.
In
Referring to
An embodiment defines a function (G(x)) and searches for its minimum inside Winit. Function G(x) is the weighted summation of the distance between a set of foreground sample points {tilde over (x)}i,k and a line axis established by xεWinit and each projected vertical vanishing point {circumflex over (v)}k,
where nk is the number of foreground samples based on each direction k, and PC({tilde over (x)}i,k), the probability of being foreground, is used as the weight for each of the foreground sample points.
The evaluation based on G(x) may be performed over all directions simultaneously (in this case N=3). The optimal ground-level position of the player is xopt=argminxεW
To find the corresponding position xoptt-a of the player in the previous frame t−a, an embodiment may establish a search window Wopt centered around xopt. An embodiment may use a combination of the geometric constraints G(x)t-a on the previous top down frame It-atop using equation 2 (below), and a color proximity measure C(x)t-a:
C(x)t-a is a normalized Bhattacharyya distance of the color (HSV) histogram between the two sets of foreground samples used for xoptt and xt-aεWoptt-a, respectively. The weighting factor β is usually very small (0.1). The use of color similarity reduces the chance of matching a different player. Once xoptt-a is found, the motion (velocity vector) at xopt can be defined as:
Dense Motion Field Construction
For motion extraction, an embodiment outputs a sparse set of motions on the ground plane. To generate a ground-level dense flow field, the sparse motions are combined using Radial Basis Functions. Also, the flow is temporally interpolated using a weighted set of motions over time. As described herein, the motion at a location x(x,y)εItop is defined by a velocity vector
Assume that Nk individual players are detected at a given frame k, then the set of the positions is denoted as {x1k, x2k, . . . , xN
A temporal kernel of size p is defined using a half Gaussian function. By applying the kernel to each entry of velocity over time, two n×1 vectors may be constructed, which are temporally smoothed versions of a to uik to uik−p+1 and vik to vik−p+1, respectively: U=[U1, U2, . . . , UN
The problem now may be stated as follows: given a collection of n scattered 2D points {x1, x2, . . . , xn} on the ground plane, with associated scalar velocity values {U1, . . . , Un} and {V1, . . . , Vn}, construct a smooth velocity field that matches each of these velocities at the given locations. One can think of this solution as scalar-valued functions ƒ(x) and g(x) so that ƒ(xi)=Ui, and g(xi)=Vi, respectively, for 1≦i≦n.
For the case of interpolating the velocity of x-direction, the interpolation function may be expressed as:
ƒ(x)=c(x)+Σi=0nλiφ(∥x−xi∥) (4)
In the above equation, c(x) is a first order polynomial that accounts for the linear, constant portions of ƒ, λ, is a weight for each constraint, and xi are the locations of the scattered points (nodes). Specifically, the radial function φ was chosen as a thin plate spline, φ(r)=r2 log r, as it gives C1 continuity for smooth interpolation of the velocity field.
To solve for the set of weights λi so that the interpolation satisfies the constraints ƒ(xi)=Ui, the equation is solved by evaluating each node at equation (4) (for example, Ui=c(xi)+Σj=0nλjφ(∥xi−xj∥)).
Since the equation is linear in the unknowns, it can be formulated as a linear system:
where λ=[λ1, . . . , λn]T, c=[c1 c2 c3]T and n×n matrix A=(aij)=φ(∥xi−xj∥)).
Once the system is solved, the interpolated velocity of the x-direction at any location xa(xa,ya)εItop can be evaluated as: ua=c1+c2xa+c3ya+Σi=1nλi(∥xi−xj∥)). The velocity of y-direction is interpolated similarly. For more temporally smooth transition the flow may be smoothed with 1×5 box filters. Such a flow is referred to herein as the motion field on the ground, and is denoted as Φ(x)=ƒ(x)i+g(x)j=ui+vj. This is illustrated in
Detecting Points of Convergence
Using the sport of soccer as a representative example, the motion field is been defined as a global or group tendency reflecting the play (or the strategy or intention of the players). In this context, a point of convergence (POC) is defined as the spatial location that play evolution is proceeding toward in the near future. Embodiments provide for detection of POC(s) of the game by finding locations where the motion field merges.
Point of convergence detection may be implemented in two steps. First, the motion field on the ground, Φ, is used to propagate a confidence measure forward to calculate an importance table, Ψ, whose size is the same as Itop. Then, the accumulated confidences are clustered and a Gaussian Mixture Model is used to detect POC clusters.
The confidence value is defined as the local magnitude of velocity at any location on the ground. In a first step, this value is propagated (copied) at a fixed time t from each starting location through Φ. Then, these values are accumulated along the trace in an importance table Ψ. Given a location, x(i,j)εIttop, Ψ is calculated by performing a forward propagation recursively based on the motion field Φ. The magnitude of the velocity ρij2=uij2+vij2 is propagated by updating Ψ as follows: Ψ(i+uij, j+vij=Ψ(i+uij, j+vij)+ρij. This forward propagation is continued along the motion field until the attenuation that is proportional to ρij is smaller than ε (converges close to zero). Consequently, locations having a large ρ in Φ can have a large influence on far away locations as long as the motion field moves in that direction.
Thus, the accumulated distribution of confidence is computed by determining confidence propagation for any location in Itop. To determine the location and the number of POCs at a given frame k, meanshift clustering may be applied to find an optimal number of clusters. Based on the initial mean and the number of clusters (modes), a Gaussian Mixture Model is fit to the distribution of those regions using Expectation Maximization (EM). Note that POC detection is different than classical singular (critical) point(s) detection. Primarily, POC is a global measurement of the flow, while the critical point is a local extremum of the velocity potential and the stream functions.
The divergence of the motion field Φ(x) at the location xεItop is defined as
If ∇Φ is negative, the flux of the motion field across the boundary of the region is inward, while positive is outward. Thus, if the motion field flows to the boundary of a specific region, the divergence of the region becomes negative.
In practice, many of the detected POCs exist in regions where the local measurement of divergence becomes negative because the POC proceeds in the direction of the motion field flow. Therefore, in many cases, a POC exists where there is a significant singular sink point. However, if the majority of flows in a specific scene are not regular enough to construct an apparent local extremum, determining a POC by detection of singularities will fail. In such cases, a forward-propagation method (as described herein) can still locate regions of major global flows that signify positions where the play evolution may proceed to.
Embodiments may be used in a wide variety of applications. Two example applications in the sports domain may include automatic device control (such as controlling cameras, lighting devices, et cetera) and visualization analysis. For automatic camera control (which may include camera positioning and/or selection), embodiments may be utilized to estimate where the important events will happen without accessing future video frames. This is important for automated live broadcasting.
For example, camera selection may include determining one or more points of convergence in a scene, and then utilizing the point(s) to provide camera selection(s) from among a plurality of cameras. Such camera selection may be implemented as in an automated way, such that a camera positioned most appropriately to capture a point of convergence is automatically selected to provide video for a broadcast. Similarly, a proposed camera selection may be provided such that a producer may choose from among cameras capturing the point of convergence, which he or she may then select manually. With regard to the example of camera positioning, to mimic human camera operators, an algorithm should control the pan and zoom of cameras and direct them towards regions of importance while maintaining smooth transitions and good field of view (FOV) framing.
As an example, referring to
Thus, for example, embodiments may be utilized to provide POC detection to forecast future important events, and control a camera by using cropped windows. Note, that this does not necessarily mean that the ball is always centered in the frame, which may provide richer context of the scene, and (2) the movement of the FOV may be smoothed based on the play evolution.
For sports visualization and analysis, embodiments similarly provide for tracking the location, the number, and the size of POC(s) for a good indication of interesting and important events during a game. This may be a useful tool for analyzing games (for example, by coaches and trainers, or by broadcasters during a live game) to show novel views of the game. Thus, the various applications may include visualization and analysis such as steering a crop window in the video images; defining a region of interest to a virtual camera (for example, taking a region out of an image to synthesize a new view), assisting in play analysis associated with the video images; visualizing and analyzing player movements from the video images; providing one or more virtual indicators (for example, arrows) showing where one or more players are moving; and/or providing predictive visualization information.
As described herein, evaluation of sports scenes captured by video images are used herein as non-limiting examples. Embodiments may access data from other sources, such as positioning data for objects tracked in a variety of ways. These objects may be tracked as they take part in other types of evolving scenes, such as movement of large crowds and the like.
It will be readily understood that embodiments may be implemented as a system, method, apparatus or computer program product. Accordingly, various embodiments may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects. Furthermore, embodiments may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied therewith.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a non-signal computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. A computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
Computer program code for carrying out operations various embodiments may be written in any combination of one or more programming languages, including an object oriented programming language such as Java™, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer (device), partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. The remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer.
It will be understood that the embodiments can be implemented by a computer executing a program of instructions. These computer program instructions may be provided to a processor of a special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, implement the functions/acts specified.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, et cetera, to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified.
Referring to
Components of computer 810 may include, but are not limited to, at least one processing unit 820, a system memory 830, and a system bus 822 that couples various system components including the system memory 830 to the processing unit(s) 820. The computer 810 may include or have access to a variety of computer readable media. The system memory 830 may include computer readable storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) and/or random access memory (RAM). By way of example, and not limitation, system memory 830 may also include an operating system, application programs, other program modules, and program data.
A user can interface with (for example, enter commands and information) the computer 810 through input devices 840. A monitor or other type of device can also be connected to the system bus 822 via an interface, such as an output interface 850. In addition to a monitor, computers may also include other peripheral output devices. The computer 810 may operate in a networked or distributed environment using logical connections (network interface 860) to other remote computers or databases (remote device(s) 870). The logical connections may include a network, such local area network (LAN) or a wide area network (WAN), but may also include other networks/buses.
Thus, it is to be understood that certain embodiments provide systems, methods, apparatuses and computer program products configured for analyzing scenes. Certain embodiments focus on methods to build a global flow field from objects' (such as players) ground-level motions. Certain embodiments utilize the flow on the ground as it reflects the intentions of the group of individual objects based on the context (such as a game), and use this for understanding and estimating future events. Various example embodiments have been described in further detail herein. The details regarding the example embodiments provided are not intended to limit the scope of the invention but are merely illustrative of example embodiments.
This disclosure has been presented for purposes of illustration and description but is not intended to be exhaustive or limiting. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiments were chosen and described in order to explain principles and practical application, and to enable others of ordinary skill in the art to understand the disclosure for various embodiments with various modifications as are suited to the particular use contemplated.
Although illustrative embodiments of the invention have been described herein, it is to be understood that the embodiments of the invention are not limited to those precise embodiments, and that various other changes and modifications may be affected therein by one skilled in the art without departing from the scope or spirit of the disclosure.
This application claims priority to U.S. Provisional Application Ser. No. 61/319,242, entitled “SYSTEMS AND METHODS FOR UTILIZING MOTION FIELDS TO PREDICT EVOLUTION IN DYNAMIC SCENES”, which was filed on Mar. 30, 2010, and which is incorporated by reference here.
Number | Name | Date | Kind |
---|---|---|---|
6545705 | Sigel et al. | Apr 2003 | B1 |
20030026504 | Atkins et al. | Feb 2003 | A1 |
20050117061 | Li et al. | Jun 2005 | A1 |
20060013480 | Sano | Jan 2006 | A1 |
20070031062 | Pal et al. | Feb 2007 | A1 |
20080192116 | Tamir et al. | Aug 2008 | A1 |
20080310734 | Ahammad et al. | Dec 2008 | A1 |
20090059007 | Wagg et al. | Mar 2009 | A1 |
20090222388 | Hua et al. | Sep 2009 | A1 |
20100026809 | Curry | Feb 2010 | A1 |
Number | Date | Country | |
---|---|---|---|
20110242326 A1 | Oct 2011 | US |
Number | Date | Country | |
---|---|---|---|
61319242 | Mar 2010 | US |