Various aspects of the present disclosure relate generally to machine learning for sports applications, in particular various aspects relate to systems and methods for performing high-fidelity sports tracking based on a transformer based model and a diffusion model.
With the rising popularity of sports, there is an increased desire for accurate granular predictions of what will occur during a sporting event. For example, predicting how the number of passes or shots that a particular soccer player (e.g., Lionel Messi) will have in the given game (e.g., World Cup final), both prior to and during the World Cup final, can be of particular interest to members of the media, broadcast (whether on the primary feed, or a second screen experience), sportsbook, and fantasy/gamification applications. Existing solutions are unable to accurately make such predictions. In particular, existing solutions may be unable to accurately make predictions to the trajectory one or more players in a game. In particular, existing solutions may be unable to accurately predict trajectory due to, for example, long-term occlusions and tracking errors in broadcast footage. Hence, new solutions are needed.
Unless otherwise indicated herein, the materials described in this section are not prior art to the claims in this application and are not admitted to be prior art, or suggestions of the prior art, by inclusion in this section.
In some aspects, the techniques described herein relate to a method for tracking one or more individuals during a sporting event, the method including: receiving, as an input, geospatial data of a sporting event; receiving, as an input, labeled event data based on sports broadcast footage of the sporting event; performing multi-object tracking of one or more agents of the received geospatial data to determine one or more vectors; inputting the labeled event data and one or more vectors into a diffusion model; and determining, using the diffusion model, one or more trajectory sequences for the one or more agents.
In some aspects, the techniques described herein relate to a method, wherein the labeled event data includes a sequential stream of one or more major events throughout a sport event, the major events including at least one of a pass, shot, tackle, foul, turnover, penalty, goal, score, or substitution from the sporting event, and wherein the geospatial data includes sports broadcast footage, in-venue footage, global positioning system (GPS) data, near field communication (NFC) and/or radio-frequency identification data (RFID).
In some aspects, the techniques described herein relate to a method, wherein the one or more vector includes at least one of an agent two dimensional coordinates on a sporting event's field, an agent position, an agent team, an indicator indicating the agent is a ball, or player visibility information.
In some aspects, the techniques described herein relate to a method, wherein the event data is represented as a two dimensional spatiotemporal grid, the grid representing a stacking of each player's events.
In some aspects, the techniques described herein relate to a method, wherein the diffusion model applies spatiotemporal axial attention on the received event data and one or more vectors, where self-attention is applied across temporal and spatial axis, separately.
In some aspects, the techniques described herein relate to a method, wherein the diffusion model includes: an event encoder; and a tracking decoder, wherein the event encoder encodes the labeled event data and the tracking decoder conditionally decodes trajectory sets.
In some aspects, the techniques described herein relate to a method, wherein the event encoder embeds the event data, embedding the event data further including: tokenzing the labeled event data using a linear projection; applying sinusoidal positional embeddings to specify temporal occurrences of the event data; processing the event data with stacked encoders; and outputting event embeddings.
In some aspects, the techniques described herein relate to a method, wherein the tracking decoder uses attention to embed and fuse the one or more vectors with the event embeddings.
In some aspects, the techniques described herein relate to a method, further including: a second tracking decoder; and a transpose temporal convolution, the temporal convolution being configured to expand trajectories to their initial temporal dimensionality.
In some aspects, the techniques described herein relate to a system for tracking one or more individuals during a sporting event, the system including: a non-transitory computer readable medium configured to store processor-readable instructions; and a processor operatively connected to the memory, and configured to execute the instructions to perform operations including: receiving, as an input, geospatial data of a sporting event; receiving, as an input, labeled event data based on sports broadcast footage of the sporting event; performing multi-object tracking of one or more agents of the received geospatial data to determine one or more vectors; inputting the labeled event data and one or more vectors into a diffusion model; and determining, using the diffusion model, one or more trajectory sequences for the one or more agents.
In some aspects, the techniques described herein relate to a system, wherein the labeled event data includes a sequential stream of one or more major events throughout a sport event, the major events including at least one of a pass, shot, tackle, foul, turnover, penalty, goal, score, or substitution from the sporting event, and wherein the geospatial data includes sports broadcast footage, in-venue footage, global positioning system (GPS) data, near field communication (NFC) and/or radio-frequency identification data (RFID).
In some aspects, the techniques described herein relate to a system, wherein the one or more vector includes at least one of an agent two dimensional coordinates on a sporting event's field, an agent position, an agent team, an indicator indicating the agent is a ball, or player visibility information.
In some aspects, the techniques described herein relate to a system, wherein the event data is represented as a two dimensional spatiotemporal grid, the grid representing a stacking of each player's events.
In some aspects, the techniques described herein relate to a system, wherein the diffusion model applies spatiotemporal axial attention on the received event data and one or more vectors, where self-attention is applied across temporal and spatial axis, separately.
In some aspects, the techniques described herein relate to a system, wherein the diffusion model includes: an event encoder; and a tracking decoder, wherein the event encoder encodes the labeled event data and the tracking decoder conditionally decodes trajectory sets.
In some aspects, the techniques described herein relate to a system, wherein the event encoder embeds the event data, embedding the event data further including: tokenzing the labeled event data using a linear projection; applying sinusoidal positional embeddings to specify temporal occurrences of the event data; processing the event data with stacked encoders; and outputting event embeddings.
In some aspects, the techniques described herein relate to a system, wherein the tracking decoder uses attention to embed and fuse the one or more vectors with the event embeddings.
In some aspects, the techniques described herein relate to a system, further including: a second tracking decoder; and a transpose temporal convolution, the temporal convolution being configured to expand trajectories to their initial temporal dimensionality.
In some aspects, the techniques described herein relate to a non-transitory computer readable medium configured to store processor-readable instructions for tracking one or more individuals during a sporting event, the instructions perform operations including: receiving, as an input, geospatial data of a sporting event; receiving, as an input, labeled event data based on sports broadcast footage of the sporting event; performing multi-object tracking of one or more agents of the received geospatial data to determine one or more vectors; inputting the labeled event data and one or more vectors into a diffusion model; and determining, using the diffusion model, one or more trajectory sequences for the one or more agents.
In some aspects, the techniques described herein relate to a non-transitory computer readable medium, wherein the labeled event data includes a sequential stream of one or more major events throughout a sport event, the major events including at least one of a pass, shot, tackle, foul, turnover, penalty, goal, score, or substitution from the sporting event, and wherein the geospatial data includes sports broadcast footage, in-venue footage, global positioning system (GPS) data, near field communication (NFC) data and/or radio-frequency identification data (RFID).
Additional objects and advantages of the disclosed aspects will be set forth in part in the description that follows, and in part will be apparent from the description, or may be learned by practice of the disclosed aspects. The objects and advantages of the disclosed aspects will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosed aspects, as claimed.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate various exemplary embodiments and together with the description, serve to explain the principles of the disclosed embodiments.
Notably, for simplicity and clarity of illustration, certain aspects of the figures depict the general configuration of the various embodiments. Descriptions and details of well-known features and techniques may be omitted to avoid unnecessarily obscuring other features. Elements in the figures are not necessarily drawn to scale; the dimensions of some features may be exaggerated relative to other elements to improve understanding of the example embodiments.
Various aspects of the present disclosure relate generally to machine learning for sports applications, in particular various aspects relate to systems and methods for a transformer network for generating trajectories of players.
According to embodiments disclosed herein, a guided diffusion model may receive as input broadcast tracking data and event data for a sporting event. The guided diffusion model may generate high-fidelity tracking data based on the received input data. The diffusion model may include an event encoder and a tracking decoder that may embed and fuse the event and broadcast tracking data received. The output embeddings may be fed to score-based diffusion models to generate trajectories of one or more players in a sporting event.
As used herein, a “machine learning model” generally encompasses instructions, data, and/or a model configured to receive input, and apply one or more of a weight, bias, classification, or analysis on the input to generate an output. The output may include, for example, a classification of the input, an analysis based on the input, a design, process, prediction, or recommendation associated with the input, or any other suitable type of output. A machine learning model is generally trained using training data, e.g., experiential data and/or samples of input data, which are fed into the model in order to establish, tune, or modify one or more aspects of the model, e.g., the weights, biases, criteria for forming classifications or clusters, or the like. Aspects of a machine learning model may operate on an input linearly, in parallel, via a network (e.g., a neural network), or via any suitable configuration.
The execution of the machine learning model may include deployment of one or more machine learning techniques, such as linear regression, logistic regression, random forest, gradient boosted machine (GBM), deep learning, and/or a deep neural network. Supervised and/or unsupervised training may be employed. For example, supervised learning may include providing training data and labels corresponding to the training data, e.g., as ground truth. Unsupervised approaches may include clustering, classification or the like. K-means clustering or K-Nearest Neighbors may also be used, which may be supervised or unsupervised. Combinations of K-Nearest Neighbors and an unsupervised cluster technique may also be used. Any suitable type of training may be used, e.g., stochastic, gradient boosted, random seeded, recursive, epoch or batch-based, etc.
While several of the examples herein involve certain types of machine learning, it should be understood that techniques according to this disclosure may be adapted to any suitable type of machine learning. It should also be understood that the examples above are illustrative only. The techniques and technologies of this disclosure may be adapted to any suitable activity.
While soccer and various aspects relating to soccer (e.g., a predicted trajectory for one or more players during a game) are described in the present aspects as illustrative examples, the present aspects are not limited to such examples. For example, the present aspects can be implemented for other sports or activities, such as football, basketball, baseball, tennis, golf, cricket, rugby, and so forth.
The uniform, complete, and scalable digitization of sports broadcast videos into tracking data (player and ball locations through time) may be considered a landmark challenge within computer vision for sport. Conventional vision-centric systems can partially track players from broadcast footage, however, visual constraints of broadcast footage cause this tracking data to suffer from frequent long-term occlusions and/or from tracking errors. The system and methods described herein address this issue by utilizing a diffusion-based multi-agent trajectory generation model which jointly imputes occluded behaviors and de-noises tracking errors. A multi-agent trajectory generation model may fuse large amounts of event data (e.g., soccer's temporal semantic data stream) with raw broadcast tracking data. The multi-agent trajectory generation model described herein may both generate highly realistic behaviors, and allow the injection of domain-specific constraints via guidance.
With the commercialization and popularity of sports, tracking data (e.g., player and ball locations through time) may facilitate deeper analysis of individual and collective performance. Traditionally, sports tracking data has been captured by multi-camera in-venue tracking systems that continuously monitor the locations of all active players and the ball, generating complete tracking. However, the cost of installing and operating these systems has limited their adoption to only the most high-profile leagues.
In a sporting event, complete tracking data for all players when an object (e.g., a ball) is in-play may be crucial for tactical analysis of players and teams. Furthermore, given the sparseness of high-leverage events in sporting events (e.g., goal-scoring opportunities), having complete tracking data for each of these events may be essential for down-stream analysis. Conventional broadcast tracking systems may be vision-centric, where the conventional tracking systems may rely predominantly on visual perception. However, with such visual perception, complete tracking data cannot be generated from broadcast footage (e.g., using vision alone). A drawback of conventional vision-centric systems may be caused by the camera's limited receptive field, where important context leading up to a high-probability goal-scoring opportunity may be missed. This example may be representative of the impact that occlusions (e.g., where a subset of agents cannot be visually perceived) have on broadcast tracking systems. Occlusions may have diverse sources, caused by the broadcast camera's limited monocular receptive field, close-ups, replays, and alter-native camera angles. During a sporting event such as soccer, the majority of occlusions may be short-term (≤10 seconds), however, many occlusions last for much longer periods of time, with some occlusions lasting over 60 seconds (e.g., as illustrated in
The systems and techniques described herein may utilize generative modelling to jointly impute and denoise broadcast tracking data. The systems and techniques described herein may be applicable to any sport. For example, the systems and techniques described herein may be applicable to sports with inherently structured game player such as where players adhere to short and long-term policies within the adversarial team-based environment and/or where rigid single and multi-agent spatiotemporal behaviors recur within and across games. The systems and techniques described herein may be configured to learn such structures and multi-agent trajectory generation methods may be used to impute the behaviors of occluded agents and correct erroneous tracking data.
Drawbacks of conventional techniques include a challenge within multi-agent trajectory modelling. First, for example, most trajectory modelling tasks may be completed over independently segmented short-term windows (approximately <10 seconds), where trajectory start and/or end anchors may be observed. This framework may not suitable for broadcast tracking, due to the long duration of soccer matches (e.g., 90+ minutes) and the high frequency of long-term occlusions that exceed approximately 10 seconds. Secondly, for example, rather than performing a forecasting (e.g., predicting future trajectories) or imputation (e.g., generating occluded behaviors), the system described herein may include trajectories that may be jointly forecasted, imputed, and denoised.
The system and techniques described herein may include learning a given sport's coarse and/or long-term latent structure. The system and techniques described herein may include a multimodal foundational architecture. For example, a system or one or more components may be transformer-based, where partially observed multi-agent trajectories (e.g., based on raw broadcast tracking data) are fused with a coarse semantic information stream (e.g., event data). Event data may refer to the sequential stream of all events or all major events throughout a given sports match (e.g., pass, shot, tackle, foul, turnover, penalty, goal, score, substitution, etc.). Event data may provide an essential signal for reconstructing the sections of games that are not covered by raw broadcast tracking data. The system described herein may include joint modelling of event and tracking data. This may represent a paradigm shift away from conventional systems which uni-modally process sporting trajectories. Given the long durations of agent occlusions during the match, a key benefit of the system and techniques described herein may be the large temporal context that the system is able to ingest. For example, the system may adapt the use of spatiotemporal axial attention and temporal convolutions to jointly process tracking data with event context at a time (e.g., three minutes of tracking data with ten minutes of event context).
The system described herein may, for example, incorporate a conditional diffusion model capable of generating multi-agent tracking data. The conditional diffusion model may be configured to learn the conditional probability density function over multi-agent trajectories which may enable the synthesis of highly realistic multi-agent tracking data
The system described herein may be configured to fuse large amounts of tracking and semantic context. By utilizing a diffusion model, realistic trajectories may be generated that inject domain specific constraints via guidance.
To improve upon prior approaches, one or more techniques describe herein utilize a transformer-based neural network to predict player positions. For example, using a transformer-based neural network, the present system can generate or simulate the remainder of a given match at the player trajectory level. For example, the present system may generate players sequences based on an underlying representation of players. A benefit of this approach may be that it can be used as an assistive aid in the data collection process in combination with or side-by-side with tracking systems (e.g., computer vision based player and ball tracking). Such an approach can also be used to highlight a potentially erroneous data point for assessment (e.g., human operators based assessment or automated system based assessment).
Tracking system 102 may be in communication with and/or may be positioned in, adjacent to, or near a venue 106. Non-limiting examples of venue 106 include stadiums, fields, pitches, and courts. Venue 106 includes agents 112A-N (players). Tracking system 102 may be configured to record the motions and actions of agents 112A-N on the playing surface, as well as one or more other objects of relevance (e.g., ball, referees, etc.). Although environment 100 depicts agents 112A-N generally as players, it will be understood that in accordance with certain implementations, agents 112A-N may correspond to players, officials, coaches, objects, markers, and/or the like.
In some aspects, tracking system 102 may be an optically-based system using, for example, using camera 103. While one camera is depicted, additional cameras are possible. For example, a system of six stationary, calibrated cameras, which project the three-dimensional locations of players and the ball onto a two-dimensional overhead view of the court may be used.
In another example, a mix of stationary and non-stationary cameras may be used to capture motions of all agents 112A-N on the playing surface as well as one or more objects or relevance. Utilization of such tracking system (e.g., tracking system 102) may result in many different camera views of the court (e.g., high sideline view, free-throw line view, huddle view, face-off view, end zone view, etc.). In some aspects, tracking system 102 correspond to or use a broadcast feed of a given match. In such aspects, each frame of the broadcast feed may be stored in a game file.
Tracking system 102 may be configured to communicate with computing system 104 via network 105. Computing system 104 may be configured to manage and analyze the data captured by tracking system 102. Computing system 104 may include a web client application server 114, a pre-processing agent 116 (e.g., a processor and/or preprocessor), a data store 118, and a third-party Application Programming Interface (API) 138. An example of computing system 104 is depicted with respect to
Pre-processing agent 116 may be configured to process data retrieved from data store 118 or tracking system 102 prior to input to predictor 126. The pre-processing agent 116, predictor, and/or prediction model analysis engine 122 may be comprised of one or more software modules. The one or more software modules may be collections of code or instructions stored on a media (e.g., memory of organization computing system 104) that represent a series of machine instructions (e.g., program code) that implements one or more algorithmic steps. Such machine instructions may be the actual computer code the processor of organization computing system 104 interprets to implement the instructions or, alternatively, may be a higher level of coding of the instructions that is interpreted to obtain the actual computer code. The one or more software modules may also include one or more hardware components. One or more aspects of an example algorithm may be performed by the hardware components (e.g., circuitry) itself, rather as a result of the instructions.
Data store 118 may be configured to store different kinds of data. In an example, data store 118 can store raw tracking data received from tracking system 102. The data store 118 can include historical game data, live data, features, and/or predictions. The historical game data can include historical team and player data for one or more sporting events. Live data can include data received from tracking system 102, e.g., in real time or near real time. Gama data may include broadcast data or content related to a game (e.g., a match, a competition, a round, etc.) and/or may include tracking data generated by tracking system 102 or in response to data generated by tracking system 102. Data store 118 may be configured to store features (e.g., feature vectors) generated for a specific sporting event that incorporate player, team, and match features. Data store 118 may further be configured to store event data for a sporting event.
According to aspects disclosed herein, data store 118 may receive and/or store a game file. A game file may include one or more game data types. A game data type may include, but is not limited to, position data (e.g., player position, object position, etc.) change data (e.g., changes in position, changes in players, changes in objects, etc.), trend data (e.g., player trends, position trends, object trends, team trends, etc.), play data, etc. A game file may be a single game file or may be segmented (e.g., grouped by one or more data type, grouped by one or more players, grouped by one or more teams, etc.). Pre-processing agent 116 and/or data store 118 may be operated (e.g., using applicable code) to receive tracking data in a first format, store game files in a second format, and/or output game data (e.g., to predictor 126) in a third format. For example, pre-processing agent 116 may receive an intended destination for game data (or data stored in data store 118 in general) and may format the data into a format acceptable by the intended destination.
Predictor 126 includes one or more machine-learning models 128A-N. Examples include a transformer neural network that may include one or more encoders and/or decoders. The transformers may be configured to generate tracking data based on broadcast tracking data and on event data. The transformers may be further configured to generate prediction(s) for the trajectory of one or more players during a match. The transformer neural network may be configured to fuse partially observed multi-agent trajectories (e.g., raw broadcast tracking data) with sport based coarse semantic information stream (e.g., event data). The transformer neural network may further include a diffusion model capable of generating multi-agent tracking data. The transformer-based neural network may be configured to generate or simulate the remainder of a given match at the player trajectory level. For example, instead of generating trajectories for a possession, the transformer network may be configured to generate trajectories for multiple possessions and even for the remainder of a sporting event. Further, the transformer network may be further configured to generate event data for the game. In this manner, the transformer network may be used to generate the commentary of a game via text/speech or 3D models of player behaviors.
Client device 108 may be in communication with computing system 104 via network 105. Client device 108 may be operated by a user. For example, client device 108 may be a mobile device, a tablet, a desktop computer, or any computing system having the capabilities described herein. Users may include, but are not limited to, individuals such as, for example, subscribers, clients, prospective clients, or customers of an entity associated with computing system 104, such as individuals who have obtained, will obtain, or may obtain a product, service, or consultation from an entity associated with computing system 104.
Client device 108 may include one more applications 109. Application 109 may be representative of a web browser that allows access to a website or a stand-alone application. Client device 108 may access application 109 to access one or more functionalities of computing system 104. Client device 108 may communicate over network 105 to request a webpage, for example, from web client application server 114 of computing system 104. For example, client device 108 may be configured to execute application 109 to access content managed by web client application server 114. The content that is displayed to client device 108 may be transmitted from web client application server 114 to client device 108, and subsequently processed by application 109 for display through a graphical user interface (GUI) of client device 108.
Client device may include display 110. Examples of display 110 include, but are not limited to, computer displays, Light Emitting Diode (LED) displays, and so forth. Output or visualizations generated by application 109 (e.g., a GUI) can be displayed on or using display 110.
Functionality of sub-components illustrated within computing system 104 can be implemented in hardware, software, or some combination thereof. For example, software components may be collections of code or instructions stored on a media such as a non-transitory computer-readable medium (e.g., memory of computing system 104) that represent a series of machine instructions (e.g., program code) that implements one or more method operations. Such machine instructions may be the actual computer code the processor of computing system 104 interprets to implement the instructions or, alternatively, may be a higher level of coding of the instructions that is interpreted to obtain the actual computer code. The one or more software modules may also include one or more hardware components. Examples of components include processors, controllers, signal processors, neural network processors, and so forth.
Network 105 may be of any suitable type, including individual connections via the Internet, such as cellular or Wi-Fi networks. In some aspects, network 105 may connect terminals, services, and mobile devices using direct connections, such as radio frequency identification (RFID), near-field communication (NFC), Bluetooth™, low-energy Bluetooth™ (BLE), Wi-Fi™, ZigBee™, ambient backscatter communication (ABC) protocols, USB, WAN, or LAN. Because the information transmitted may be personal or confidential, security concerns may dictate one or more of these types of connection be encrypted or otherwise secured. In some aspects, however, the information being transmitted may be less personal, and therefore, the network connections may be selected for convenience over security.
Network 105 may include any type of computer networking arrangement used to exchange data or information. For example, network 105 may be the Internet, a private data network, virtual private network using a public network and/or other suitable connection(s) that enables components in computing environment 100 to send and receive information between the components of environment 100.
The system described herein may utilize transformer based neural network to fuse multi-agent trajectory with sport's semantic even stream data. The system may implement a score-based diffusion framework as described below.
The system 200 may for example include video input data 202 of a sports broadcast. The video input data 202 may for example have a limited receptive field. For example occlusions may occur where a subset of players cannot be visually displayed on the video input data 202. These occlusions may occur from diverse sources, caused by the broadcast camera's limited monocular receptive field, close-ups, replays, and alter-native camera angles. The video input data (e.g., broadcast data) may for example be a subset of geospatial data. Geospatial data may be any content, information, or feed that may allow tracking of one or more objects, as further discussed herein. For example, geospatial data may refer to broadcast footage, in-venue footage, global satellite positioning (GPS) data, radio-frequency identification data (RDIF), Near Field Communication (NFC), triangulation data, and/or the like. Geospatial data and subsequently processed geospatial data (e.g., by video analysis system 204) may for example be received as input by the diffuser 210 described herein. Video input data may refer to broadcast footage or an in-venue computer vision system output which may be or include, for example, raw video content. An in-venue computer vision system may for example record video footage of an entire field of play throughout and entire match.
The video input data 202 may for example be input into a video analysis system 204. The video analysis system 204 may for example perform one or more functions. First, the video analysis system 204 may implement one or more computer vision algorithms to determine broadcast tracking data 206. The broadcast tracking data 206 may for example be output as multi-agent trajectories for each of the players in a match. The one or more computer vision algorithms may be configured to (1) detect players in a sporting event; (2) classify the detected players into one or more teams; (3) identify a “logical identity” to the identified players in order to maintain identity and track players over a temporal sequence; (4) identify a ground plane of the sporting event; and/or (5) identify the assigned number of each player on the field. The one or more computer vision algorithms may further provide a tracking of identified players over time. The broadcast tracking data 206 may for example be stored in a JavaScript Object Notation (JSON) file. The broadcast tracking data 206 may for example include the two dimensional tracking of one or players in a match, the players respective team, and the player's respective identifying number (e.g., a player's respective jersey number).
The broadcast tracking data 206 may be based on publically available broadcast data and/or footage related to a sports event generated or broadcasted at least in part using one or more cameras or camera systems. Broadcast tracking data 206 may be generated using tracking system 102 of
The second function of the video analysis system 204 may be to determine event data 208. The event data 208 may refer to the sequential stream of all major events throughout the match (e.g., pass, shot, tackle, foul, turnover, penalty, goal, score, substitution, etc.). Event data 208 may provide an essential signal for reconstructing the sections of games that are not covered by raw broadcast tracking data. Event data 208 may for example be automatically detected by a computing system or input from a user reviewing the video input data 202. For example, event data 208 may be input by a user viewing video input data 202 (e.g., a broadcast feed). The event data 208 may be unified to be a two-dimensional spatiotemporal grid. This may be performed by stacking (with padding) each player's events, forming an event stream s∈RL×E×D
The determined broadcasting tracking data 206 and the event data 208 may for example be input to diffuser 210. Diffuser 210 may for example be depicted by the diffusion neural network 500 of
As discussed above, processed geospatial data may be received as input by the diffuser 210 (e.g., in place of or in addition to video data 202). For example, the geospatial data may be based on wearable technology worn by the one or more agents on the field. For example, GPS, RFID, and/or NFC data may be received by the system 200. GPS, RFID, and/or NFC data may correspond to location data tracked using GPS sensors, satellite tracking, proximity sensors, tags, and/or the like. Such location data may provide useful context to the system 200 when sensor information (e.g., broadcast data, in-venue sensor information, etc.) is noisy or missing. The geospatial data may also be based on an in-venue computer vision system. The in-venue computer vision data may be utilized to denoise the input (or merge together in the event data 208). The event data 208 may be received in and/or transformed into the frame of reference which is being tracked. For example, the event data may have a frame of reference from (0, 0, 100, 100) whereby the filed coordinate may be (0, 0, 106, 68). Accordingly, the event data may be transformed into the (0, 0, 100, 100) frame of reference using any applicable scaling technique such as a transformation, transfer, normalization, and/or the like.
Further, the diffuser 210 may be configured to receive labelled input such as human labelled inputs (e.g., only event data 208). The system 200 may be configured to impute the position of one or more objects based on event data 208 (e.g., based only one event data 208). Such a labelled input may be received, for example, in text form and may be converted to tracking data based on analysis of the text and/or based on providing the text to a machine learning model trained to output tracking data based on labeled text inputs. In another example, the system may be configured to impute the event data 208 based on one or more inputs discussed herein. For example, the frame on what time interval an event occurred).
Denoising diffusion models may be implemented by the system 200 described herein. Such diffusion models may consider the family of distributions p(x, σ) where Gaussian noise of standard deviation o is added to a data distribution ρdata(x) with standard deviation σdata. Where the Gaussian noise standard deviation may be maximized (i.e., σmax), this perturbed data distribution may be virtually indistinguishable from pure Gaussian noise. Samples from this data distribution may thus be generated by iteratively denoising x0˜N(0, σ2max|) over range σmax, . . . , σN-2, σN-1 such that xi˜p(xi, σi). Score-based diffusion models may frame this reverse diffusion process as an ordinary differential equation (ODE) where the derivative of the noised sample x is given by:
Where ∇x log p(x, σ) gives the score function, o(t) is the noise level at diffusion step t, and {dot over (σ)}(t) is the time derivative of o. The score function may be a vector field that gives the direction where the probability density function grows most quickly, from which the underlying probability density function can be inferred. The probability distribution's score function can be obtained by training a conditional de-noising model De(x, σ, c) parameterized by e to minimize the L2 reconstruction loss between the perturbed and original data sample,
Where q denotes the distribution of o during training and y=x+n. Following this definition, the score is given by:
Training and preconditioning may be implemented for a diffuser model used herein. Such models (e.g., deep models) may learn most effectively when their inputs and outputs are scaled to have unit variance. Furthermore, at low values of o it may be easier to predict the noise level n, whereas at high values of σ it is easier to predict the clean original signal x. Consequently, rather than directly returning the raw output of the denoiser neural network, the diffuser described herein (e.g., diffuser 210) may add preconditioning terms to both scale the variance of the model's inputs, and a skip connection to enable the model to adaptively predict either the noise level or the clean signal for different levels of σ. The denoiser can be written as:
Such that Fθ is the raw neural network's output, cinput modulates the perturbed trajectory's variance, cnoise modulates the noise's variance, Cout modulates the output's variance, and cskip modulates the skip connection. To normalize losses over the σ range, the per-sample reconstruction losses are scaled by term λ(σ)=1/c2. c may represent a raw input to the neural network, and may be assumed to be modulated.
Constrained sampling may be applied by the diffuser described herein. The diffusion model described herein (e.g., diffuser 210) may learn the conditional score function ∇y log p(y, σ, c) of the probability distribution of multi-agent trajectory sets. However, it is often preferable to sample from the joint score function:
Where the second term represents the constraint gradient score for manifold q over y. This constraint manifold may represent any loss function: L:RT×E×2→R that can be differentiated with respect to y. Scaled by hyper parameter a, the constraint gradient score can be calculated as:
With the ODF dynamics described in equation (1) above, sampling from the diffuser 210 may be performed using, for example, approximately 128 inference steps of the Henu sampler.
In order to prepare (e.g., train) and/or validate diffuser 210, the diffuser 210 is provided access to multiple streams of spatiotemporal data such as video input data 202 (e.g., including broadcast tracking data 206 and/or event data 208) and may be provided in-venue tracking data. Such streams may be represented as spatiotemporal grids which consist of a temporal dimension T specifying the length of trajectories, a spatial dimension (e.g., of size E=23) denoting the number of agents (e.g., two teams of 11 and one ball), followed by a feature dimension. The perturbed in-venue trajectories may be written as y∈RT×E×2, where each observation specifies the agent's perturbed 2D location. Similarly, the broadcast tracking stream is represented as b∈RT×E×D
The diffuser 210 may apply spatiotemporal axial attention. The diffuser 210 may process the modalities in a way that maintains their underlying spatiotemporal structure. While spatiotemporal data has a clear temporal total ordering (i.e., chronologically), no such natural ordering may exist over agents spatially. In soccer, because there are two teams each with 10 outfield players with no natural ordering, there may be (10!)2 possible permutations of agent indices. To avoid a combinatorial increase in complexity, the spatial dimension of spatiotemporal grids may be processed in a permutation equivariant manner. That is, for example, the following equality may hold for every permutation p of agent indices:
Where yP and cP may represent permutations of the agent indices for the perturbed in-venue tracking and contextual vectors respectively.
This property may be obtained using spatiotemporal axial attention, where self-attention is applied across temporal and spatial axes separately (e.g., as depicted in
The diffusion neural network 500 may be configured to embed and fuse the event data (e.g., event data 208) and broadcast tracking context c (e.g., broadcast tracking data 206). The diffusion neural network 500 may include an event encoder 508 as depicted in
Event data's low temporal dimensionality relative to tracking data may mean that separately encoding these modalities increases the amount of event data able to be encoded at a time. The increased context may improve the accuracy of denoised trajectories.
The diffusion neural network 500 may include an event encoder 508.
The diffusion neural network 500 may include a tracking decoder 514.
The combination of broadcast tracking's high temporal dimensionality and self-attention's quadratic blowup with respect to sequence length may have traditionally limited the length of multi-agent trajectories that can be processed by transformers. This may be addressed by the system described herein, where large amounts of temporal context may be leveraged to denoise long-term occlusions and tracking errors. For example, tracking decoder 514 may compress the broadcast tracking stream bconv∈R(T//k)×E×d by applying temporal convolutions of kernel size and stride k, dimensionality d and 0 padding. By increasing the amount of trajectory context used, the trajectory generations output accuracy may improve.
Following this compression, sinusoidal positional embeddings may be added to specify the temporal ordering of convolved trajectory tokens. Tokens may then be processed with a temporal attention operation 536 followed by spatial attention 538. Next, embedded tokens temporally cross attend with event encodings (e.g., each agent may only attend to its own event tokens). This may occur at a temporal x-attention operation 540. This allows agents to be jointly modelled, avoiding an ego-centric representation as has been completed previously. The final modules within the tracking decoder are the normalization 544 and feedforward layers 542 standard to Transformers. Module 502 (e.g., via tracking decoder 514) may return Zb∈R(T//k)×E×d, which represents joint encodings of each agent's event and broadcast tracking streams. This output may for example be the output of the module 502.
The diffusion neural network 500 may not directly utilize the embeddings output by the module 502 to deterministically predict behaviors of players, but rather may generate trajectories via score-based diffusion models. Score based diffusion models may learn the conditional joint probability distribution over multi-agent trajectory sets, generating much more collectively realistic and controllable behaviors than models trained to independently regress agent locations. Specifically, a trained/activated version of diffuser 210 may represent a denoising diffusion neural network Fe(y, σ, c). This network may form the parameterized section of the denoising function shown in Equation (4) and trained with the objective from Equation (2). The perturbed trajectories 516 y∈RT×E×2 of
At inference time (e.g., when using a trained diffuser 210 or diffusion neural network 500), the system may adopt constrained trajectory sampling to allow greater control of generated behaviors. The diffusion neural network 500 may utilize this control to enforce physical constraints specific to sport. Events (e.g., event stream 504) may be labelled by any automated system or by humans at 1-second intervals, and as a result, pass events may frequently be out-of-phase with tracking encourages the ball and the player passing the ball to be in close proximity when passes occur. This guidance term aims to reduce the temporal misalignment between event and tracking streams. Formally, Lpass may be defined as follows,
Where F==[ti, ei]i=0L
The diffusion neural network 500 may be evaluated to determine accuracy. To evaluate the coarse and granular realism of generated trajectories, three metrics may be utilized. Coarse behaviors may be evaluated using Average Displacement Error (ADE): the average distance (m) between the generated and real locations of agents. This metric is reported as a total average across all trajectories, for all trajectory segments that are obscured by short-term occlusions (STO) (≤10 seconds) and long-term occlusions (LTO) (>10 seconds). To quantify the granular realism of generated trajectories, two forms of trajectory generation failures may be defined. First, velocity failure rate may be defined as the average proportion of players that exhibit an instantaneous velocity above a threshold of 12 m/s (defined as the maximum human sprinting velocity) over three minute windows. Secondly, a pass failure may reports the percentage of passes that occur outside a 2.5 m radius of the passing player. Only passes where this property is maintained in the in-venue tracking may be considered for the evaluations. These metrics may be averaged across games.
The evaluation may utilize a dataset containing in-venue tracking, broadcast tracking, and event data from, for example, approximately 124 games of professional soccer. In-venue tracking may have been obtained using on-location multi-camera tracking systems, and therefore represents the ground-truth behaviors. Broadcast tracking may have been generated by commercial computer vision tracking systems gathered from publicly accessible broadcast footage. These systems may be made of object detectors, tracklet association algorithms, and camera calibration models. Event data may be provided at scale by human labelling. In reference to an experiment, an evaluation used 109 games for training and 15 games for evaluation. Of the training games, the system had access to broadcast tracking for only 9 games, and for the remaining games, was synthesized broadcast tracking from in-venue tracking. A heuristics-based approach was used to simulate multiple forms of tracking errors (e.g., object detection error, player misidentifications, camera calibration error) and occlusions (e.g., due to the camera's limited receptive field, close-ups, stoppages).
The setting of the system described herein may be vastly different for conventional trajectory modelling settings in terms of the lengths of occlusion, available data streams, and generative objective (i.e., denoising, imputation, and forecasting). Consequently, according to the experiment discussed above, three baselines were devised to evaluate the diffusion neural network 500. First, a Linear Interpolator was used which imputed occluded behaviors by linearly transitioning players from their last to next visible location. Second, a Vanilla Transformer was used which denoised each agent's trajectory independently using the player's broadcast tracking and event streams. Third, a Spatiotemporal Axial Transformer was used. The Spatiotemporal Axial Transformer resembles a Transformer encoder with the multi-headed self-attention implemented as a temporal attention module followed by a spatial attention module. This architecture is able to represent inter-agent dependencies. The Vanilla Transformer and Spatiotemporal Axial Transformer jointly processed 30 seconds of broadcast tracking and event context. These two transformer baselines made no assumptions about trajectory visibility, ingest extended trajectory context, and fuse event data with tracking data, therefore represent much stronger baselines than any previous approaches from conventional systems.
Implementation of the diffusion neural network 500 may include a down sample of the broadcast and in-venue tracking streams to 5 Hz. According to the experiment, the diffuser utilized 180 seconds of trajectory and event context, where each segment also contains 500 seconds of event data before and after the main window. Each attention module of the diffusion neural network 500 used a hidden dimensionality of 128, and a feedforward dimensionality of 512 and 8 attention heads. For module 502, the event encoder 508 and tracking decoder 514 has 2 and 4 layers respectively, and the second tracking decoder 520 has 8 layers. Each tracking decoder uses a temporal convolution (e.g., temporal convolution 518) of stride and kernel-size k=5. Each module 502 may be pre-trained to reconstruct in-venue tracking via a L2 loss objective, before being fine-tuned on the objective (utilizing equation (2) described above).
The quantitative performance of an exemplary use of the diffusion neural network 500 may be compared against the three baselines (linear interpolator, vanilla transformer, and the Spatiotemporal Axial Transformer model) discussed as shown in table 1 below.
The first two baselines (linear Interpolator and vanilla transformer) may have poor performance in the ADE metrics, as both process each agent's trajectory independently and may not use vital interagent context for denoising and imputing agent behaviors. In contrast, as the Spatiotemporal Axial Transformer models inter-agent dependencies, it may have much stronger performance in terms of reconstructing agent locations. Each baseline however exhibits a high proportion of velocity failures. The Linear Interpolator only imputes missing behaviors, and therefore cannot correct the velocity failures present in broadcast tracking data (e.g., caused by frequent camera calibration errors and misidentifications). While the Vanilla and Spatiotemporal Axial Transformers may be able to denoise the unrealistic behaviors in raw broadcast tracking, they may be limited by their training regime. These models may be trained to minimize L2 reconstruction loss, and although they typically generate locations that are independently reasonable, these locations often do not collectively exhibit realistic human motion. The exemplary use case of the diffusion neural network 500 has both lower ADE metrics and velocity failure rates. The exemplary use case of the diffusion neural network 500 may for example be depicted in
Next, the key components of the diffusion neural network 500 may be ablated to determine their relative impact on the model's performance. The quantitative results for the ablation experiment (ablation study) are depicted in table 2 below:
The ablation study included ablating diffusion. Examining the ablation, the ablation study first ablated the use of diffusion by directly using outputs of the pre-trained module 502. This model may have strong performance in terms of the average displacement error (ADE) metrics, outperforming the base architecture in terms of reconstructing short term objectives (STOs) and long term objective (LTOs). Primarily, this may be because the Event2Tracking module 502 is trained to directly minimize L2 reconstruction loss. However, as seen in the baseline architectures, training a multi-agent trajectory generative model via reconstruction loss may result in unrealistic collective trajectories as exemplified by the high velocity failure rate
The ablation study included long trajectory context. In this ablation, 30 second trajectory segments may be used instead of 180 seconds, examining the importance of extending the denoising temporal horizon. Shortening the trajectory context may decrease the accuracy of reconstructed locations, as is exhibited by the considerably weaker ADE metrics. This ablation also produced approximately twice as many velocity failures than the base architecture, further reinforcing the importance of utilizing large amounts of trajectory context.
The ablation study included an expanded event. This included matching the event temporal horizon to the 180 second of the trajectory segment. Reducing this context window results in weaker performance in terms of each ADE metric and velocity failure rates. This may further demonstrate the advantage of ingesting as much context as possible when denoising sporting trajectories.
The impact of “pass guidance” on an exemplary use of the diffusion neural network 500 was performed. This was performed by quantitatively comparing the an exemplary use of the diffusion neural network 500 performance both without and with the Pass Guidance sampling as depicted in Table 3.
While the exemplary use of the diffuser has shown strong performance with respect to the ADE and velocity failure rate metrics, it may have a high pass failure rate. Possible causes of this may include, the pitch's vast size, noisy broadcast tracking, and synchronization issues between the event and tracking streams. However, with Pass Guidance the pass failure rate was considerably decreased, while maintaining strong performance in the other metrics, displaying the granular control that guided diffusion models provide.
The trajectory tails depicted in
Accordingly, the diffusion neural network 500 may be capable of producing complete game tracking without in-venue vision systems, representing an important step in providing scalable and uniform game analysis across a given sport. The diffusion neural network 500 may include several key technical components. The module 502's architecture may be a multimodal foundational model. Secondly, the diffusion neural network 500 may integrate a foundational model with guided diffusion, showing that this setup greatly improves the granular realism and controllability of generated behaviors. The diffusion neural network 500 may be configured to apply the module 502 architecture on other game analysis tasks that require both trajectory and semantic perception.
Due to the ease of set-up and cost, conventional adversarial multi-agent behaviors are being captured via video (e.g., broadcast video, as described herein). Although valuable, apart from human end-users being able to view the video, the raw pixel information may have little utility for downstream analysis purposes. Compressing raw-pixel information into tracking data (e.g., spatial locations of agents) may provide a compact yet interpretable mid-level representation of agent behaviors. A key problem which may not be solved, using conventional techniques, is how to impute missing tracking segments caused by long-term occlusions. Conventional systems may be unable to impute these forms of long-term occlusions, especially when the starts and ends of analysis streams are unavailable.
The systems and methods described herein may address these limitations with a Multi-Agent Masked Autoencoder (MAT-MAE) system, which may be a robust to diverse forms of multi-agent occlusions. The methods may include imputing long-term occlusions by modelling the distant temporal inter-dependencies that may exist between both trajectory and coarse semantic data-streams. The systems disclosed herein may utilize basketball broadcast tracking data in an exemplary embodiment. The system may outperforms several baselines. An exemplary use of the system may increases the proportion of visible frames in the basketball game by, for example, approximately 26.75% from 59.48% to 86.23%, based on experiments (e.g., such as those described herein).
Video and sensor data of sporting events has grown in recent years allowing for the analysis of fine-grained behaviors in sporting events. This may include behavior analysis of single agents (e.g., motion capture, multiple agents (e.g., autonomous vehicles and human trajectory forecasting), and multiple adversarial agents (e.g., sports and games). An emerging problem within fine-grained multi-agent behavior analysis may be that of trajectory imputation, where multiple agents' movements are reconstructed from partial observations, as discussed herein. This problem may be critical in tracking systems with limited visual perception, which are unable to track agents that are out-of-view (e.g., those based on broadcast tracking). In these cases, trajectory generation techniques may be used to impute occluded appearance information.
Broadcast tracking in sports may forms a valuable test-bed in which the approaches discussed herein can be validated. Unlike traditional in-venue tracking systems which have continual, unimpeded observation of the entire match, broadcast tracking systems may track players directly from broadcast footage. Although this may function well when all players are in-view, broadcast footage contains diverse sources of both partial and full occlusions, as discussed herein.
The methods described herein may partition occlusion into two classes: short-term and long-term occlusion. Short-term occlusion (STO) may occur when a subset of players are out-of-view for a short period of time (e.g., ≤5 seconds). Long-term occlusion (LTO) may refer to the cases where no appearance information is observed for longer periods of time (e.g., >5 seconds), due to advertisements, on-screen graphics, or close-ups of players, coaches, or fans, for example.
Conventional systems that utilize multi-agent trajectories in sports may use the coarse spatial structure that exists within sports to approximate locations of occluded agents. Conventional systems have explored trajectory generation methods to impute occluded appearance information. Exemplary conventional systems may exhibit acceptable performance when imputing STO by fusing bidirectional multi-agent context. However, this approach may be considerably less suited at imputing LTOs. In an exemplary conventional system, a non-autoregressive long-term imputation method may be implemented. This method may not support partial occlusions (e.g., where a subset of players are occluded at a single frame), which occurs frequently in broadcast footage. Finally, conventional approaches both model fixed length trajectories where the starts and ends of trajectories are strictly visible, which may be highly unrealistic assumptions for sport's broadcast tracking data.
Unlike conventional systems which impute behaviors using only trajectory information, sport's semantic data-streams may also be leveraged to more accurately predict behaviors as described herein. One such data-stream is event-detection data, which may specify the timestamps and player identities of the match's on-ball events e.g., pass, rebound, dribble. Semantic information may provide a coarse reconstruction of the game's granular multi-agent behaviors, and may therefore be used to contextualize the large portions of games where trajectory information is fully occluded. As a result, semantic information may substantially increase the capacity to accurately impute LTOs.
The multi-agent trajectory imputation setting may be deeply analogous to the masked modelling framework, where the reconstruction of a partially masked input may be used as a self-supervised pre-training task. Additionally, when fusing sparse semantic information and heavily occluded trajectory data, attention-based models such as transformers may have the beneficial property of being able to model long-term temporal inter-agent dependencies.
The Multi-Agent Trajectory Masked Autoencoder (MAT-MAE) 806 as depicted in
MAT-MAE 806 may be configured to perform long-term trajectory imputation of diverse classes of occlusion. MAT-MAE 806 may produce a trajectory imputation framework that leverages both trajectory streams (e.g., based on broadcast tracking data 206) and semantic data-streams (e.g., based on event data 208). The MAE framework may be adapted to the imputation setting with diverse, unseen forms of LTOs. MAT-MAE 806 may explicitly represents the permutation-invariant relational dynamics of the team-based multi-agent setting.
Multi-agent trajectory imputation may include the task of predicting the occluded locations of agents within a trajectory sequence. Using basketball as an example, all trajectory sets may include K=11 agents (e.g., Agents 112A-N), consisting of five offensive players, five defensive players and a single ball. In addition to partially occluded trajectory information, the system may also leverage the stream of fully visible on-ball player events. Each event includes a timestamp, the player who performed the event, and the event type (including field-goal make, field-goal miss, offensive rebound, defensive re-bound, turnover, pass, inbound pass, block, assist and dribble). Additionally, for certain events supplementary coarse spatial information may be provided. For field-goals, whether the shot was a three-point, mid-range, or close-range attempt may be specified. For inbound passes, the area of the court the pass originated from may be included (i.e., baseline, front-court, backcourt).
The problem addressed by MAT-MAE 806 may be formalized, as further discussed herein. Each possession may be represented as a set of entity trajectories X={Xk}k=0k=1, where agent k's trajectory consist of T observations Xk={Xkt}k=t=0t-1. At each timestep t entity k's observations are denoted as xkt∈R3, which includes the entity's (x, y) location and their event category i.e., (x, y, event-category). Occlusions may be defined by a mask m∈RT×K×3, where mt,k,d=1 specifies that dimension d of entity k's observation at timestep t is occluded. Within this framework, mt,k,2=0, meaning that player events may be strictly visible. The imputation task may be to use the masked input to generate the trajectory set {circumflex over (X)}={{circumflex over (X)}k}K=0K-1 where entity k's generated trajectory is denoted as {circumflex over (X)}k=[{circumflex over (X)}kt]T=0T-1[ ]. For inputted trajectories, each prediction {circumflex over (X)}kt∈R2 specifies the generated (x, y) location of entity k at timestep t. The objective during training may be to minimize the L2 reconstruction loss of masked trajectory segments i.e., L2({circumflex over (X)}·m, X·m) where · denotes the Hadamard product.
Multi-agent trajectories may be fundamentally non-Euclidean, as entities have no natural spatial ordering. Consequently, methods which optimally model sets of trajectories may be permutation-invariant. The system may utilize aspects of a permutation-invariant method for pedestrian trajectory prediction, where inter-agent (agent to other agents) and intra-agent (agent to itself) dependencies are modelled separately. However, in adversarial team-based environments, both inter-agent and intra-agent relationships may also dependent on agents' team affiliations. Using basketball as an example, an offensive player's interactions with another agent may depend on whether that agent is another offensive player, a defensive player, or the ball. Furthermore, the manner in which an offensive player attends to themselves may be different to how defensive players attends to themselves. As a result, both agent inter-attention and intra-attention may incorporate team identity.
Consequently, MAT-MAE 806 may include a permutation-invariant agent-based positional encoding method which models the relational dynamics of sport's adversarial environment. To reflect the impact of team identity on inter-agent attention, a separate relative encoding may be computed for each possible ordered pair of team identities. A unique intra-agent relative encoding for each team identity may also be computed. More formally, when attending between agent at index ksrc with team allegiance asrc and agent at index kdst with allegiance adst.
Relational Agent Encoding γA is computed as,
Autoencoding may be implemented by MAT-MAE 806. The MAT-MAE 806 may include the ability to represent the long-term inter-agent dependencies that are present in LTOs with sparse semantic information.
Within the MAT-MAE 806 architecture, individual tokens may be represented by an agent k's observations at timestep t. To include the spatiotemporal inductive biases present in multi-agent trajectories, the system may use Shaw's relative positional encodings. Different encoding methods may be employed for both the agent and temporal dimensions of multi-agent trajectories. Temporally, the system may utilize learned positional encodings with a maximum relative attention window relmax=±40 frames i.e., ±8 seconds where tracking data is sampled at 5 Hz. These relative temporal encodings γT between a source time step tsrc and destination time step tdst may be computed as,
For agent relative positional encoding, the relational agent encoding method described herein is implemented. Consequently, the relative positional encoding γ between xx
The MAE may utilize a symmetric autoendoer. The encoder may process both the masked and non-masked tokens, due to the varying number of strictly visible semantic events present in trajectory sequences.
Conventional masked modeling approaches may use imputation as a pre-trained method. These approaches may only investigate the impacts of each policy in isolation. The systems described herein may extend the approaches by using a diverse set of synthetic stochastic masks during training, to enable powerful generalization to a set of unseen, diverse masks. For each batch during training, the system may randomly select from one of the five following policies: random, timestep, block, starts and ends. These masking policies may be depicted in
Implementation of MAT-MAE 806 may now be described, in reference to an experiment. Accordingly, it will be understood that MAT-MAE 806 may be implemented using techniques, ranges, data, inputs, and outputs similar to, though not the same as that disturbed in reference to the experiment.
For training, a dataset of 100 games of National Basketball Association (NBA) in-venue tracking data have been implemented. The data is down sampled from 25 Hz to 5 Hz, and the games have been separated into possessions. A possessions begin when a team first establishes ownership of the ball in an active play, and conclude when the opposition establish ownership of the ball, or the play becomes in-active (e.g., due to an out-of-bounds). During training, the basketball possessions are partitioned into eight second segments.
MAT-MAE 806 model(s) have been trained for 500 epoch, using a batch-size of 64 and an optimizer with a learning rate of 1e-3 and default exponential decay hyper parameters b1=0.9 and b2=0.999. MAT-MAE 806 may be implemented as a symmetrical auto encoder with r-layer transformers, each with a hidden dimensionality of 64 and 4 attention heads. The training may be evaluated on a cluster of GPUs.
The training may be performed on fixed length trajectories of eight second, during evaluation implementation of an autoregressive policy may be performed to impute trajectories of longer length. For example, eight second sliding windows of context may have been implemented where the system autoregressively upgraded four seconds at a time.
MAT-MAE 806 may be validated by performing two experiments each constructing a single game of heavily occluded college basketball tracking data. In the first experiment, the system may be evaluated based on the capacity to reconstruct synthetically masked clean in-venue tracking data, where the synthetic masks represented the occlusions from broadcast footage. This experiment may facilitate a granular evaluation of each method's capacity to impute diverse, realistic forms of occlusions. In the second experiment the system's ability to reconstruct noisy, naturally occluded broadcast tracking data was evaluated. This experiment may reflect a realistic real-world application of trajectory imputation methods, we evaluate reconstructions using macro-level performance analysis metrics from the sporting domain.
During evaluation of the MAT-MAE 806, the imputation method be applied to a single game of college basketball. For this game, both the complete in-venue tracking data and the naturally occluded broadcast tracking data may be processed. The game may be downsampled to 5 Hz, and separated into possessions. In this game, there were a total of 116 possessions. The frequency of each class of occlusion is displayed in Table 5.
To evaluate the fine-grained reconstruction of imputed trajectories, L2 reconstruction loss was computed, which may be the per-possession average distance between the ground-truth and generated locations in occluded sections of trajectories. This metric may be reported separately for each entity and class of occlusion.
The following baselines may have been implemented in the experiment: (i) Linear: this baseline may have completed linear interpolation using visible sections of trajectories. (ii) Bidirectional LSTM: this baseline may have used forwards and backwards LSTM models to separately impute each player's trajectories. (iii) Graph Imputer: this baseline may include a stochastic GNN-based method which uses bidirectional multi-agent context to impute occlusions.
These baselines may not be able to natively complete imputation without trajectory bounds (e.g., the start and end points of trajectories). As a result, the system implemented a lookup-based method which may have imputed the first and final seconds of occluded trajectories. This lookup method may find a similar agent trajectory from the dataset of non-occluded trajectories, and may copy the trajectory's first/final second. Similar trajectories may be defined by both the notable starting event (e.g., baseline inbounds) and/or ending event (e.g., three-point attempt), and the entity's visible trajectory information.
Various ablations studies were conducted for the experiment to investigate the relative contributions of MAT-MAE's 806 architecture's primary components. The first ablation study to the relational agent encoding module. The model was compared to a method which uses absolute positional encoding of agents, randomly indexing agents within each team. The second and third ablations were investigated the autoencoder architecture, utilizing conventional systems. The ablations explored both the impact of using an asymmetrical autoencoder with a shallow 2 layer decoder, and the use of a shallow, symmetrical autoencoder with 2 layer transformers. The ablation further investigated the impact of the synthetic masking policy, using both a random masking policy with a 60% masking ratio and a block mask policy with a block size of 5 and a mask ratio of 60%.
Further Quantitative analysis was performed. Table 6 below depicts the results of various baselines and ablations.
The system described herein had had average L2 reconstruction losses of 2.02/2.20/2.01 and 5.09/6.95/4.30 for STO and LTO respectively, where losses are reported separately for each class of entities (offensive/defensive/ball). This outperformed each baseline method for each type of entity and class of occlusion.
Experiments were also performed on various ablations of the MAT-MAE 806 architecture. Overall, the base architecture displayed the strongest performance. However, the ablation that exclusively used synthetic block masks for training achieved strong performance for STO imputation, with results of 1.77/2.01/1.50, outperforming the base architecture for each class of entities. However, this ablation displayed considerably weaker performance than the base architecture in reconstructing LTO, highlighting the utility of diverse synthetic masks during training. Although the ablation that applied to MAT-MAE 806 with a shallow autoencoder outperformed the base architecture in reconstructing defenses with LTO, it displayed worse performance than the base architecture in all other metrics.
The system of
Qualitatively, based on the experiment depicted in
Another notable difference between MAT-MAE's 806 performance and the baselines is that MAT-MAE's method's ability to generate trajectories that demonstrate the semantic actions from the event-detection data-stream. This is evident in the second example 1004 and third example 1006, where in the ground-truth, the ball clearly transitions between two offensive players, representing a pass. This same coarse behavior is represented in the MAT-MAE's output, reflecting its ability to generate multi-agent trajectories that are conditionally dependent on discrete event data. This ability may enabled through MAT-MAE's use of a transformer-based method that is able to perform long-term attention with partially occluded trajectory information and sparse semantic information. In contrast, this distinct behavior is not produced in the graph imputer's output, where instead the ball smoothly transitions from its initial location to its first visible location. At a high-level each of the baselines provide a method for smoothly fusing forwards and backwards context. As a result, these methods may be unable to generate behaviors where an entity's high-level intent changes suddenly in an occluded section of the trajectory, such as in passes.
Regarding the second experiment of MAT-MAE's 806 performance, broadcast tracking imputation was performed. The second experiment reflects the utility of this imputation method in downstream applications. As the system operates in the sports domain, the experiment uses two metrics that are of interest for fitness and performance evaluations: total distance travelled and average speed. These values are also reported across both teams.
The experiment implements a baseline that predicts player velocities according to entities' average velocities over the visible sections of the game. This average is computed separately for each type of entity (offensive, defensive, ball).
Examining the results from the experiment, at the team level the MAT-MAE 806 predicted total distance and average speed for the University of Michigan of 39,089 ft. and 5.90 ft./s respectively, and for Penn State of 39,628 ft. and 5.98 ft./s respectively. These metrics were substantially more accurate than the average velocity base-line. Furthermore, at the individual player level, the MAT-MAE 806 was able to more realistically predict total distances for 16 of the total 22 players when compared with the baseline. These strong results reflect the possible downstream performance analysis tasks that without this imputation method would not be feasible.
The MAT-MAE 806 performs multi-agent trajectory imputation. Through leveraging sparse semantic information, the MAT-MAE 806 was able to generate high fidelity behaviors in the presence of diverse forms of LTOs. MAT-MAE 806 demonstrated impressive ability to re-construct semantic events in trajectory space, and to predict agents' realistic initial states when the starts of trajectories were heavily occluded. Both quantitatively and qualitatively the MAT-MAE 806 outperformed a range of baseline imputation methods. Using a single game of naturally occluded basketball broadcast tracking data, the MAT-MAE 806 was able to substantially increase the proportion of fully visible frames by 26.75% from 59.48% to 86.23%. The MAT-MAE 806 had demonstrated utility in downstream domain-specific applications for sport.
At step 1202, a sports broadcast footage of a sporting even may be received as an input.
At step 1204, labeled event data of the sports broadcast footage may be received as an input. The labeled event data may include a sequential stream of one or more major events throughout a sport event, the major events including at least one of a pass, shot, tackle, foul, turnover, penalty, goal, score, or substitution from the sporting event. The event data may be represented as a two dimensional spatiotemporal grid, the grid representing a stacking of each player's events.
At step 1206, multi-object tracking of one or more agents of the received sports broadcast footage may be performed to determine one or more vectors. The one or more vector may include at least one of an agent two dimensional coordinates on a sporting event's field, an agent position, an agent team, an indicator indicating the agent is a ball, or player visibility information.
At step 1208, the labeled event data and one or more vectors may be input into a diffusion model.
At step 1210, one or more trajectory sequences for the one or more agents may be determined using the diffusion model. The diffusion model may apply spatiotemporal axial attention on the received event data and one or more vectors, where self-attention is applied across temporal and spatial axis, separately. The diffusion model may include an event encoder; and a tracking decoder, wherein the event encoder encodes the labeled event data and the tracking decoder conditionally decodes trajectory sets. The event encoder may embed the event data, embedding the event data further including: tokenzing the labeled event data using a linear projection; applying sinusoidal positional embeddings to specify temporal occurrences of the event data; processing the event data with stacked encoders; and outputting event embeddings. The tracking decoder may use attention to embed and fuse the one or more vectors with the event embeddings. The diffusion model may further include a second tracking decoder; and a transpose temporal convolution, the temporal convolution being configured to expand trajectories to their initial temporal dimensionality.
The training data 1312 and a training algorithm 1320 may be provided to a training component 1330 that may apply the training data 1312 to the training algorithm 1320 to generate a trained machine learning model 1350. According to an implementation, the training component 1330 may be provided comparison results 1316 that compare a previous output of the corresponding machine learning model to apply the previous result to re-train the machine learning model. The comparison results 1316 may be used by the training component 1330 to update the corresponding machine learning model. The training algorithm 1320 may utilize machine learning networks and/or models including, but not limited to a deep learning network such as Deep Neural Networks (DNN), Convolutional Neural Networks (CNN), Fully Convolutional Networks (FCN) and Recurrent Neural Networks (RCN), probabilistic models such as Bayesian Networks and Graphical Models, and/or discriminative models such as Decision Forests and maximum margin methods, or the like. The output of the flowchart 1310 may be a trained machine learning model 1350.
A machine learning model disclosed herein may be trained by adjusting one or more weights, layers, and/or biases during a training phase. During the training phase, historical or simulated data may be provided as inputs to the model. The model may adjust one or more of its weights, layers, and/or biases based on such historical or simulated information. The adjusted weights, layers, and/or biases may be configured in a production version of the machine learning model (e.g., a trained model) based on the training. Once trained, the machine learning model may output machine learning model outputs in accordance with the subject matter disclosed herein. According to an implementation, one or more machine learning models disclosed herein may continuously update based on feedback associated with use or implementation of the machine learning model outputs.
It should be understood that aspects in this disclosure are exemplary only, and that other aspects may include various combinations of features from other aspects, as well as additional or fewer features.
In general, any process or operation discussed in this disclosure that is understood to be computer-implementable, such as the processes illustrated in the flowcharts disclosed herein, may be performed by one or more processors of a computer system, such as any of the systems or devices in the exemplary environments disclosed herein, as described above. A process or process step performed by one or more processors may also be referred to as an operation. The one or more processors may be configured to perform such processes by having access to instructions (e.g., software or computer-readable code) that, when executed by the one or more processors, cause the one or more processors to perform the processes. The instructions may be stored in a memory of the computer system. A processor may be a central processing unit (CPU), a graphics processing unit (GPU), or any suitable types of processing unit.
A computer system, such as a system or device implementing a process or operation in the examples above, may include one or more computing devices, such as one or more of the systems or devices disclosed herein. One or more processors of a computer system may be included in a single computing device or distributed among a plurality of computing devices. A memory of the computer system may include the respective memory of each computing device of the plurality of computing devices.
The computer 1400 may also have a memory 1404 (such as RAM) storing instructions 1424 for executing techniques presented herein, for example the methods described with respect to
Program aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of executable code and/or associated data that is carried on or embodied in a type of machine-readable medium. “Storage” type media include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer of the mobile communication network into the computer platform of a server and/or from a server to the mobile device. Thus, another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links, or the like, also may be considered as media bearing the software. As used herein, unless restricted to non-transitory, tangible “storage” media, terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.
While the disclosed methods, devices, and systems are described with exemplary reference to transmitting data, it should be appreciated that the disclosed aspects may be applicable to any environment, such as a desktop or laptop computer, an automobile entertainment system, a home entertainment system, etc. Also, the disclosed aspects may be applicable to any type of Internet protocol.
It should be appreciated that in the above description of exemplary aspects of the invention, various features of the invention are sometimes grouped together in a single aspect, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed embodiments requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed aspect. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate aspect of this invention.
Furthermore, while some aspects described herein include some but not other features included in other aspects, combinations of features of different aspects are meant to be within the scope of the invention, and form different aspects, as would be understood by those skilled in the art. For example, in the following claims, any of the claimed aspects can be used in any combination.
Thus, while certain aspects have been described, those skilled in the art will recognize that other and further modifications may be made thereto without departing from the spirit of the invention, and it is intended to claim all such changes and modifications as falling within the scope of the invention. For example, functionality may be added or deleted from the block diagrams and operations may be interchanged among functional blocks. Operations may be added or deleted to methods described within the scope of the present invention.
The above disclosed subject matter is to be considered illustrative, and not restrictive, and the appended claims are intended to cover all such modifications, enhancements, and other implementations, which fall within the true spirit and scope of the present disclosure. Thus, to the maximum extent allowed by law, the scope of the present disclosure is to be determined by the broadest permissible interpretation of the following claims and their equivalents, and shall not be restricted or limited by the foregoing detailed description. While various implementations of the disclosure have been described, it will be apparent to those of ordinary skill in the art that many more implementations are possible within the scope of the disclosure. Accordingly, the disclosure is not to be restricted except in light of the attached claims and their equivalents.
This application claims the benefit of U.S. Provisional Patent Application 63/477,932 filed Dec. 30, 2022 the entire contents of which are incorporated herein by reference for all purposes.
Number | Date | Country | |
---|---|---|---|
63477932 | Dec 2022 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 18401006 | Dec 2023 | US |
Child | 18421539 | US |