Light Weight Multi-Branch and Multi-Scale Person Re-Identification

Information

  • Patent Application
  • 20220351535
  • Publication Number
    20220351535
  • Date Filed
    December 20, 2019
    5 years ago
  • Date Published
    November 03, 2022
    2 years ago
Abstract
A system for lightweight multi-branch and multi-scale (LMBMS) re-identification is described herein. The system includes a convolutional neural network trained for person identification, wherein the convolutional neural network comprises a series of residual blocks that obtain input from a head network of the convolutional neural network. The system also includes a plurality of refine blocks, wherein one or more refine blocks take as input features from a residual block of the series of residual blocks, wherein the features are at input at different scales and different resolutions and an output of the plurality of refine blocks is a plurality of features in a same feature space. A channel-wise attention mechanism may merge the plurality of features and generate final dynamic features.
Description
BACKGROUND

Multiple cameras can be used to capture activity in a scene. Subsequent processing of the captured images enables end users to view the scene and move throughout the scene over a full 360-degree range of motion. For example, multiple cameras may be used to capture a sports game and end users can move throughout the area of play freely. The end user may also view the game from a virtual camera.





BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 is a block diagram of a multiple player tracking system;



FIG. 2 is an illustration of bounding boxes extracted from frames captured from a single camera view;



FIG. 3 is an illustration of bounding boxes captured from a multiple camera at a timestamp t;



FIG. 4 is a block diagram of a structure of a lightweight multiple branch multiple scale model for re-identification;



FIG. 5 is a block diagram of the structure of a bottleneck and a refine block;



FIG. 6 is an illustration of a channel-wise attention mechanism;



FIG. 7 is an illustration of players;



FIG. 8 is a process flow diagram of a method that enables lightweight multi-branch and multi-scale (LMBMS) re-identification;



FIG. 9 is a block diagram illustrating a computing device that enables lightweight multi-branch and multi-scale re-identification; and



FIG. 10 is a block diagram showing computer readable media that stores code for enabling a lightweight multi-branch and multi-scale re-identification.





The same numbers are used throughout the disclosure and the figures to reference like components and features. Numbers in the 100 series refer to features originally found in FIG. 1; numbers in the 200 series refer to features originally found in FIG. 2B; and so on.


DESCRIPTION OF THE EMBODIMENTS

Sporting events and other competitions are often broadcast for the entertainment of end users. These games may be rendered in a variety of formats. For example, a game can be rendered as a two-dimensional video or a three-dimensional video. The games may be captured using one or more high-resolution cameras positioned around an entire area of play. The plurality of cameras may capture an entire three-dimensional volumetric space, including the area of play. In embodiments, the camera system may include multiple super high-resolution cameras for volumetric capture. The end users can view the action of the game and move through the captured volume freely by being presented with a sequence of images representing the three-dimensional volumetric space. Additionally, an end user can view the game from a virtual camera that follows the action within the area by following the ball or a specific player in the three-dimensional volumetric space. As used herein, the area of play is the portion of space that is officially marked for game play. In examples, the area of play may be a field, court, or other ground used for game play.


Capturing a sporting event using multiple high-resolution cameras creates a three-dimensional (3D) scene for every play in real-time, which brings immersive, interactive experience to end users. To replay exciting moments from any angle, a virtual camera can be controlled to follow a specific path for offering immersive user experience. A core component of controlling a path of the virtual camera is a multiple camera (multi-camera) based multiple player tracking algorithm that is used to track every player within the area of play. In the multiple camera based multiple player tracking, player re-identification (Re-ID) associates a same player across frames captured by the same camera. A multi-camera association component may find the same player across frames from different cameras.


The present techniques enable a lightweight multi-branch and multi-scale (LMBMS) re-identification. A person may be re-identified by determining the correspondences of an identity of a query person across frames captured by the same camera, as well as frames captured from multiple cameras at a same time. Thus, re-identification can be used in single camera person tracking and multi-view association. The present techniques include re-identification that considers local information, employs multi-scale fusion, and generates dynamic features.


The local information may be local features extracted from shallow layers of convolution neural network. A multi-branch structure is built to realize multi-scale fusion in person Re-ID. Additionally, a channel-wise attention mechanism is deployed to generate robust, dynamic features. The LMBMS model according to the present techniques enables a good balance of performance and speed. For ease of description, the present techniques are described using a sporting event as a scene and players as persons that are re-identified. However, the present techniques may apply to any scene with any person or object re-identified.


Person Re-ID is especially challenging when applied to sporting events for several reasons. First, players of the same team wear almost identical jerseys during gameplay. Additionally, extreme interactions where players come into bodily contact with other players are very common in many sports. The extreme interactions and other issues such as like posture, occlusion, etc. are more pronounced in sporting events when compared to existing Re-ID datasets. Traditional solutions focus on unified, abstract, and global information, and as such do not work well in sports applications and thus cannot resolve the aforementioned problems. The light-weight person Re-ID model with multi-branch and multi-scale convolution neural network of the present techniques addresses these challenges. For ease of description, the present techniques are described using basketball. However, the present techniques may apply to any scene and any sport, such as football, soccer, and the like.



FIG. 1 is a block diagram of a multiple player tracking system 100. The multiple player tracking system 100 includes a plurality of single player tracking modules 102 and a multi-view association module 104. In particular, the multiple player tracking system 100 includes single player tracking modules 102A . . . 102N, where there are N cameras and each single player tracking module corresponds to a camera. In embodiments, the camera system may include one or more physical cameras with 5120×3072 resolution, configured to capture the area of play. The number of cameras may be selected to ensure that the entire area of play is captured by at least three cameras. The plurality of cameras captures a real-time video stream at various poses. The plurality of cameras may capture the area of play at 30 frames per second (fps). The number of cameras selected may be different in different scenarios. For example, depending on the structure surrounding the area of play, each location may be captured by at least three cameras using a varying number of cameras.


In embodiments, the multiple-camera based player tracking process decouples the identity of a player from the location of a player. As illustrated, each single player tracking module 102A . . . 102N obtains a respective stream 106A . . . 106N. Each stream may be a video that contains multiple images or frames captured by a camera. For each single player tracking module 102A . . . 102N, a plurality of detection blocks 108A . . . 108N detects a same player in a single camera view. In embodiments, player detection executes on each decoded frame captured by each camera to generate all player positions in a two-dimensional (2D) image. Each player's position may be defined by a bounding box. For each of the detection blocks 108A . . . 108N, isolated bounding boxes are determined that define the location of each player in a stream corresponding to each single camera view. A plurality of tracking blocks 110A . . . 110N track each player detected at blocks 108A . . . 108N in the streams 106A . . . 106N. In embodiments, player tracking is executed via a location-based tracking module. A single camera multiple object tracking algorithm tracks all players for consecutive frames in each camera.


The single camera player tracking results for each camera are input into a multi-view association module 104. Multiple player tracking results can be obtained in 3D space through multi-view association to connect the same player across multiple cameras. The multi-view association module 104 executes at the frame-level to associate players identified in each camera view across all camera views. Accordingly, the multi-view association module 104 uses the bounding boxes of a player at a timestamp t from multiple cameras to derive a location of the player in the area of play. The location may be a two-dimensional or a three-dimensional location in the captured volumetric space. In particular, the multi-view association module 104 associates bounding boxes in multiple camera views of an identical player by Re-ID features from each camera of the camera system.


Put another way, the multi-view association module 104 identifies the same player tracked in each camera view. In embodiments, a 3D position is computed for each player using projection matrices. In examples, a projection matrix is camera parameter. The projection matrix enables the use of image coordinates from multiple cameras to generate corresponding three-dimensional locations. The projection matrix maps the two-dimensional coordinates of a player in from multiple cameras at timestamp t to a three-dimensional location within the three-dimensional volumetric space of the captured scene.


Through the use of images obtained from high resolution cameras, the present techniques are able to immerse an end user in a three-dimensional recreation of a sporting event or game. In embodiments, an end user is able to view gameplay from any point within the area of play. The end user is also able to view a full 360° of the game at any point within the area of play. Thus, in embodiments an end user may experience gameplay from the perspective of any player. The game may be captured via a volumetric capture method. For example, game footage may be recorded using a plurality of 5K ultra-high-definition cameras that capture height, width, and depth of data points to produce voxels (pixels with volume). Thus, a camera system according to the present techniques may include multiple super-high-resolution cameras to capture the entire playing area. After the game content is captured, a substantial amount of data is processed, and all viewpoints of a fully volumetric three-dimensional person or object are recreated. This information may be used to render a virtual environment in a multi-perspective three-dimensional format that enables users to experience a captured scene from any angle and perspective, and can provide a true six degrees of freedom. The ability to progress through the area of play enables an end user to “see” the same view as a player saw during real-time gameplay. The present techniques also enable game storytelling, assistant coaching, coaching, strategizing, player evaluation, and player analysis through the accurate location of players at all times during active gameplay. Re-identification as described herein can be applied to frames from a single camera during single camera tracking. Re-identification may also be applied to frames from multiple cameras at time t. Thus, in a multiple player tracking system, player Re-ID plays an important role in both single camera player tracking and multi-view association.


The diagram of FIG. 1 is not intended to indicate that the example system is to include all of the systems and modules shown in FIG. 1. Rather, the example system 100 can be implemented using fewer or additional camera systems and modules not illustrated in FIG. 1.



FIG. 2 is an illustration of a plurality of bounding boxes 200 extracted from frames captured by a single camera view. In single camera tracking, Re-ID is used to match a player in pairs between frames captured by the same camera. As illustrated in FIG. 2, a bounding box 202, bounding box 204, bounding box 206, and bounding box 208 are extracted from frames captured by a single camera at various timestamps. For each pair of frames captured by the single camera, all players are detected by determining isolated bounding boxes that define the location of each player and then matching features within the bounding boxes. Re-ID may be used to obtain an association of the bounding boxes of an identical player across frames captured by the single camera. In embodiments, each player is assigned a unique identifier across the frames captured by the single camera.


The diagram of FIG. 2 is not intended to indicate that the frames are limited to capturing a single player. Rather, the bounding boxes are 200 image patches derived from a frame, which can include multiple image patches or bounding boxes. The present techniques can be implemented using any number of bounding boxes not illustrated in FIG. 2.



FIG. 3 is an illustration of bounding boxes 300 captured from a multiple camera at a timestamp t. In FIG. 3, an area of play 302 is illustrated. A plurality of cameras is positioned around the area of play 302. In particular, camera 02, camera 04, camera 16, and camera 20 may be used to capture a plurality of players at a timestamp t. The isolated bounding box 304, bounding box 306, bounding box 308, and bounding box 310 are of a same player. These bounding boxes are associated across multiple camera views by matching each player in pairs among all cameras in the same frame at a same timestamp t, illustrated in FIG. 3.


Though the particular usage of player Re-ID in player tracking and multi-view association is different, their underlying philosophy is the same, i.e., determining whether a pair of image patches is the same player from any frames and any cameras. Accordingly, the present techniques identify the same player across multiple frames by matching the same player in image patches, such as bounding boxes. Traditional Re-ID solutions rely on a unified global feature and inefficient local features. Generally, deep layers of a neural network are used to produce a global feature, while shallow layers produce local features. Traditional models generate features using deep layers which may be referred to as a unified global feature. As players wear similar jerseys and share similar backgrounds when within the area of play, global features are not capable of distinguishing different players since there is very little difference between their global features. In addition, global features are traditionally extracted from deeper layers of a convolution neural network. These deep layers mostly contain high level semantic information and easily fall into pairing errors. Moreover, traditional solutions obtain local features by dividing an image into several independent parts, and then summarizing the local features independently. Independent summarization of local features causes errors due to the lack of context when extracting local features. Additionally, extracting local features without any context will weaken the most discriminative part of the images when all parts' features are merged. A discriminative part of a feature or discriminative features are those features that are most likely or always different between persons within bounding boxes.


The lightweight multi-branch and multi-scale (LMBMS) re-identification according to the present techniques extracts global features and abundant multi-scale multi-level local features. In embodiments, an LMBMS model consists of three branches. Each branch starts from a different part of the backbone network. As described herein, the backbone network may be an 18-layer Residual Network (ResNet-18). After several refine blocks in each branch, the features are generated at different scales and different levels. After that, features from all branches are merged. Finally, a channel-wise attention mechanism is applied to select efficient features dynamically. Features from some channels with a high identification are strengthened, and features from other channels with low identification are weakened. Features with a high identification may be the most discriminative features. Accordingly, the present techniques can focus and use most the discriminative part of an image to distinguish and pair players. For example, two players on the same team may look the same, and the most discriminative part of features extracted from the two players will be the jersey number area. Put another way, the jersey number area will most likely or always be different between persons on the same team. The ability to strengthen and weaken particular channels may be learned from a dataset during training. The structure of the LMBMS Re-ID model is depicted in FIG. 4.


The diagram of FIG. 3 is not intended to indicate that the frames are limited to capturing a single player. Rather, the bounding boxes 300 are image patches derived from a frame, which can include multiple image patches or bounding boxes. The present techniques can be implemented using any number of bounding boxes not illustrated in FIG. 3.



FIG. 4 is a block diagram of a structure of a lightweight multiple branch multiple scale model 400 for re-identification. In embodiments, the lightweight multiple branch multiple scale model for re-identification is a neural network 402. The convolutional network backbone as described herein is an 18-layer Residual Network (ResNet-18). However, any classification model with good balance of performance and speed can be used. As shown in Table 1, ResNet-18 has 6 parts, different parts have different resolutions, different resolutions present features with different level (such as abstract and detail) and different scale, wherein the different scale may refer to various sizes of the image.












TABLE 1









Head Net
conv(7*7*64, S2)




Max Pool(3*3, S2, P1)



Block 1
conv(3*3*64)*4



Block 2
conv(3*3*128)*4



Block 3
conv(3*3*256)*4



Block 4
conv(3*3*512)*4



FC
conv(1*1*1000)










As indicated in Table 1, the portions of ResNet-18 used according to the present techniques include a head network (head net) and four residual blocks. Accordingly, for the head net 402, a 7×7 convolutional layer with 64 output channels and a stride of 2 is followed by a 3×3 maximum pooling layer with a stride of 2. Each residual block may have a number of convolutional layers as described with respect to FIG. 5. However, the residual blocks according to the present techniques include additional average pooling as described with respect to FIG. 5.


As defined in Table 1, residual block 1 includes a 3×3 convolutional layer with 64 output channels and a stride of 4. Residual block 2 includes a 3×3 convolutional layer with 128 output channels and a stride of 4. Residual block 3 includes a 3×3 convolutional layer with 256 output channels and a stride of 4. Residual block 4 includes a 3×3 convolutional layer with 512 output channels and a stride of 4. A fully connected layer flattens a 2D array into a 1D array and connects all input nodes and all output nodes.


The LMBMS Re-ID model 400 uses head net 402, residual block 1 406, residual block 2 408, and residual block 3 410 for extracting features. For each residual block 1 406, residual block 2 408, and residual block 3 410, multi-level features are extracted with multiple scales at different resolutions. According to the present techniques, the features with a higher resolution contain more detailed local information when compared to features with a lower resolution, which have abstract global information. In embodiments, residual blocks may be positioned in a series, where the first residual block in the series is closest to the head net. The closer a residual block is to the head net, the shallower the residual block. Thus, shallow features may be obtained from layers closer to the head net. The farther the residual block is positioned in the series from the head net, the deeper the residual block. Deeper features may be obtained from layers farther from the head net. The LMBMS Re-ID model 400 derives features from bounding boxes input to the series of residual blocks, wherein features from deeper residual blocks are at a higher scale and a higher resolution when compared to features from shallower residual blocks.


In embodiments, each residual block forms the start of a branch of the LMBMS. Features extracted from each residual block may be input to a plurality of refine blocks. Thus, each branch of the LMBMS includes a residual block and one or more refine blocks. In embodiments, the number of refine blocks applied to features extracted from a residual block is dependent on a scale and resolution of the features. In particular, the number of refine blocks in each branch may be selected to guarantee that output of each branch has a same size. Since the output of each residual block may have different output sizes, the refine blocks can generate a same size output for each branch. As illustrated in FIG. 4, features extracted from the residual block 1 406 may be processed by refine blocks 412A, 414A, and 416A. Features extracted from the residual block 2 408 may be processed by refine blocks 412B and 414B. Features extracted from the residual block 3 410 may be processed by a refine block 412C. The refine blocks illustrated in FIG. 4 unify different levels of features into the same feature space. A channel-wise attention mechanism 418 receives and pools the feature maps from the refine blocks into feature map. The features may be reshaped and squeezes to a sequence. As used here, to squeeze to a sequence may refer to flattening a high-dimension array to a one-dimensional sequence. The output network (output net) may then serves as an output layer that performs a 1×1 convolution.


The diagram of FIG. 4 is not intended to indicate that the example system is to include all of the blocks and modules shown in FIG. 4. Rather, the example system 400 can be implemented using fewer or additional blocks and modules not illustrated in FIG. 4.



FIG. 5 is a block diagram of the structure of a bottleneck 502 and a refine block 504. The traditional bottleneck 502 includes a layer 502A that performs a 1×1 convolution, a layer 504A that performs a 3×3 convolution, and a layer 506A that performs a second 1×1 convolution. In execution, the layer 502A and the layer 506A function to reduce and then restore dimensions, while the layer 504A creates a bottleneck with smaller input/output dimensions than the convolution performed. Accordingly, the bottleneck 502 reduces channels of the input before performing the computationally expensive 3×3 convolution and restores the channels back to the original shape using a 1×1 convolution. The skip branch of the traditional bottleneck 502 includes a layer 508A that performs a 1×1 convolution.


The refine block 504 includes a layer 502B that performs a 1×1 convolution, a layer 504B that performs a 3×3 convolution, and a layer 506B that performs a second 1×1 convolution. The refine block 504 also includes a layer 508B in a skip branch that performs a 1×1 convolution. Similar to the traditional bottleneck 502, the refine block 504 can receive any resolution input feature map, and generate new feature map that is half-sized in resolution and four times in number of channels. An additional average pooling layer 510 in the skip branch abstracts more general information to focus on important, discriminative areas of the input image. As illustrated in FIG. 4, there are different numbers of refine blocks stacked in each branch, which unifies different levels of features into the same feature space. In particular, different branches of the LMBMS may have different sizes and different resolution features, and are not in the same feature space. In order to maintain the characteristic features, each refine block may transfer features into a same feature space.


The diagram of FIG. 5 is not intended to indicate that the refine block 504 is to include all of the convolutions and pooling shown in FIG. 5. Rather, the example refine block 504 can be implemented using fewer or additional convolutional layers and pooling not illustrated in FIG. 5.



FIG. 6 is an illustration of a channel-wise attention mechanism 600. When features from all branches are collected as illustrated in FIG. 4, a channel-wise attention mechanism is applied to the features to obtain a weight distribution for each feature. The channel-wise attention mechanism can generate features dynamically and has a more efficient feature expression. In the channel-wise attention mechanism 600, block 602 performs adaptive pooling. In adaptive pooling, two-dimensional (2D) feature maps are pooled into a single value. At block 604, the single feature map value is reshaped and squeezed into a sequence. Thus, a high dimension array may be converted into a one-dimension array. A layer 606 is a fully connected layer with 64 output channels. A layer 608 is a fully connected layer with 128 output channels.


After the two fully connected convolution layers 606 and 608, a 128-dimension vector is obtained, where each dimension represents an importance of each channel. To obtain weight of each channel, a softmax layer 610 is applied. The softmax function is applied as follows:







softmax
(

x
i

)

=


e

x
i






j
=
0

k


e

x
j








The softmax layer 610 produces a unified weight distribution. To obtain the final features, the weight distribution output of the softmax layer 610 is multiplied by the original features. In embodiments, Re-ID features are treated as a distance metric. When used in single-cam multiple player tracking and multi-cam association, the features should be normalized in unified space. In embodiments, batch normalization is added in final output.


The diagram of FIG. 6 is not intended to indicate that the channel-wise attention mechanism 600 is to include all of the convolutions and pooling shown in FIG. 6. Rather, the example the channel-wise attention mechanism 600 can be implemented using fewer or additional fully connected layers and pooling not illustrated in FIG. 6.


To evaluate LMBMS model in sport analysis, a dataset with 20000 images containing 100 players from 50+ basketball videos is used to train the model. FIG. 7 is an illustration of players 702, 704, 706, and 708. Each of the players 702, 704, 706, and 708 have different sizes, poses, orientations and backgrounds. All in all, each player has 200 images from several camera angles in several sequences. 90% of data is used as the training set, and the rest 10% is the testing set.


During a training phase of the LMBMS model, training efficiency may be promoted as follows. First, in each epoch, some examples are selected with different difficulty, in detail. Hard examples are made of different players from same team, and easy examples are different players from different teams. Second, each image is assigned a fixed probability, and can simulate an occlusion situation where more than one player is in bounding box. Finally, considering features are used when making pairs, a triplet loss is combined for a metric task and cross entropy loss for classification task to guide training. The losses can obtain good classification results during training as well as generate an appropriate feature mapping function. Triplet loss and cross entropy loss are as follows:











loss
triplet

=




i
=
0

N



[






f

(

x
i
a

)

-

f

(

x
i
p

)




2
2

-





f

(

x
i
a

)

-

f

(

x
i
n

)




2
2

+
α

]

+








Triplet


Loss











loss

cross
-
entropy


=


-

1
N







i
=
0

N





k
=
0

M



y
ik


log


p
ik











Cross


Entropy


Loss








The final loss function is made of triplet loss, cross entropy loss and a constant weight, which balances the two types of loss:





lossLMBMS=losstriplet+αlosscross-entropy  Loss function of LMBMS


where α is 3 in the exemplary dataset.


The present techniques keep a good balance between accuracy and speed. High accuracy guarantees the model can obtain more accurate identification results in a challenging situation when compared to the traditional model. What's more, fewer parameters guarantees that the model can run in real time. Traditionally, person Re-ID focuses on separating different identity in the same class which is extended from the classification task. However, these traditional techniques are typically based on deeper layers of ResNet-50, which is a baseline in a classification task. The deeper layers of ResNet-50 have fewer local information, which is important when distinguishing people with almost the same appearance. Second, in traditional techniques with local information, the image is divided into parts to do multi-scale feature extraction. Traditional techniques cannot realize multi-scale fusion when they separate local information from similar backgrounds. Finally, traditional techniques do not have a good balance between performance and model complexity.



FIG. 8 is a process flow diagram of a method 800 that enables lightweight multi-branch and multi-scale (LMBMS) re-identification. The example method 800 can be implemented in system 100 of FIG. 1, the computing device 900 of FIG. 9 or the computer readable media 1000 of FIG. 10. For example, the method 800 can be implemented using the single player tracking module 102, multi-view association module 104, processing unit CPU 902, GPU 908, or the processor 1002.


At block 802, local information is extracted from shallow layers of a convolutional neural network. The local information may be local features extracted from images input to a convolutional neural network, such as ResNet-18. The images input to the ResNet-18 may be isolated bounding boxes that define a location of each player in a single camera view captured by a camera of a plurality of cameras. The bounding box may be described by a location of the bounding box within the image according to xy coordinates. The width (w) and the height (h) of the bounding box is also given.


At block 804, multi-level features at multiple scales are derived at different resolutions from the extracted local information. In particular, local features extracted from residual block 1, residual block 2, and residual block 3 may be input into one or more refine blocks, wherein extraction of features from each residual block is the start of a branch in the LMBMS model. Features extracted from residual block 1 are processed through three refine blocks. In the ResNet-18, residual block 1 receives the output of the head net. Features extracted from residual block 2 are processed through two refine blocks. In the ResNet-18, residual block 2 receives the output of the residual block 1. Features extracted from residual block 3 are processed through one refine block. In the ResNet-18, residual block 3 receives the output of residual block 2. Accordingly, after several refine blocks in each branch, the features are generated at different scales and different levels.


At block 806, dynamic features are generated via a channel-wise attention mechanism. The channel-wise attention mechanism merges the features generated at different scales from each branch of the LMBMS model. In particular, efficient features are selected from the features generated at different scales and different levels. Features from some channels with high identification are strengthened, and features from other channels with low identification are weakened via a weight distribution derived for each feature. To obtain weight of each channel, a softmax function is applied. The softmax function produces a unified weight distribution. To obtain the final features, the weight distribution output of the softmax function is multiplied by the original features.


Once features are obtained, feature matching may be executed. In single camera Re-ID, feature matching occurs across pairs of frames from a single camera. In multiple-camera association, feature matching occurs across pairs of frames from different cameras of a camera system at timestamp t. In feature matching, a detected player in one frame is matched with the same player in other frames according to extracted features. In this manner, the Re-ID as described herein enables player identification even in extreme occlusions and other poor conditions.


This process flow diagram is not intended to indicate that the blocks of the example method 800 are to be executed in any particular order, or that all of the blocks are to be included in every case. Further, any number of additional blocks not shown may be included within the example method 800, depending on the details of the specific implementation.


Referring now to FIG. 9, a block diagram is shown illustrating a computing device that enables lightweight multi-branch and multi-scale (LMBMS) re-identification. The computing device 900 may be, for example, a laptop computer, desktop computer, tablet computer, mobile device, or wearable device, among others. The computing device 900 may include a central processing unit (CPU) 902 that is configured to execute stored instructions, as well as a memory device 904 that stores instructions that are executable by the CPU 902. The CPU 902 may be coupled to the memory device 904 by a bus 906. Additionally, the CPU 902 can be a single core processor, a multi-core processor, a computing cluster, or any number of other configurations. Furthermore, the computing device 900 may include more than one CPU 902. In some examples, the CPU 902 may be a system-on-chip (SoC) with a multi-core processor architecture. In some examples, the CPU 902 can be a specialized digital signal processor (DSP) used for image processing. The memory device 904 can include random access memory (RAM), read only memory (ROM), flash memory, or any other suitable memory systems. For example, the memory device 904 may include dynamic random-access memory (DRAM).


The computing device 900 may also include a vision processing unit or graphics processing unit (GPU) 908. As shown, the CPU 902 may be coupled through the bus 906 to the GPU 908. The GPU 908 may be configured to perform any number of graphics operations within the computing device 900. For example, the GPU 908 may be configured to render or manipulate graphics images, graphics frames, videos, or the like, to be displayed to a viewer of the computing device 900.


The CPU 902 may also be connected through the bus 906 to an input/output (I/O) device interface 912 configured to connect the computing device 900 to one or more I/O devices 914. The I/O devices 914 may include, for example, a keyboard and a pointing device, wherein the pointing device may include a touchpad or a touchscreen, among others. The I/O devices 914 may be built-in components of the computing device 900, or may be devices that are externally connected to the computing device 900. In some examples, the memory 904 may be communicatively coupled to I/O devices 914 through direct memory access (DMA).


The CPU 902 may also be linked through the bus 906 to a display interface 916 configured to connect the computing device 900 to a display device 916. The display devices 918 may include a display screen that is a built-in component of the computing device 900. The display devices 918 may also include a computer monitor, television, or projector, among others, that is internal to or externally connected to the computing device 900. The display device 916 may also include a head mounted display.


The computing device 900 also includes a storage device 920. The storage device 920 is a physical memory such as a hard drive, an optical drive, a thumbdrive, an array of drives, a solid-state drive, or any combinations thereof. The storage device 920 may also include remote storage drives.


The computing device 900 may also include a network interface controller (NIC) 922. The NIC 922 may be configured to connect the computing device 900 through the bus 906 to a network 924. The network 924 may be a wide area network (WAN), local area network (LAN), or the Internet, among others. In some examples, the device may communicate with other devices through a wireless technology. For example, the device may communicate with other devices via a wireless local area network connection. In some examples, the device may connect and communicate with other devices via Bluetooth® or similar technology.


The computing device 900 further includes a lightweight multi-branch and multi-scale (LMBMS) re-identification module 928 for person identification. A local information extractor 930 may be configured to extract global feature and abundant multi-scale multi-level local features from input images. Branches of the LMBMS Re-ID module 928 may extract local features from shallow layers of ResNet-18. A multi-scale, multi-level feature generator 932 applies one or more refine blocks to the extracted features. A channel-wise attention mechanism 932 generates dynamic features from the refined features. The channel-wise attention mechanism merges the features generated at different scales from each branch of the LMBMS model. Once features are obtained, feature matching may be executed. In single camera Re-ID, feature matching occurs across pairs of frames from a single camera. In multiple-camera, feature matching occurs across pairs of frames from different cameras of a camera system at timestamp t.


The block diagram of FIG. 9 is not intended to indicate that the computing device 900 is to include all of the components shown in FIG. 9. Rather, the computing device 900 can include fewer or additional components not illustrated in FIG. 9, such as additional buffers, additional processors, and the like. The computing device 900 may include any number of additional components not shown in FIG. 9, depending on the details of the specific implementation. Furthermore, any of the functionalities of the LMBMS Re-ID module 928, the multi-scale, multi-level feature generator 932, and the channel-wise attention mechanism 932 may be partially, or entirely, implemented in hardware and/or in the processor 902. For example, the functionality may be implemented with an application specific integrated circuit, in logic implemented in the processor 902, or in any other device. For example, the functionality of the LMBMS Re-ID module 928 may be implemented with an application specific integrated circuit, in logic implemented in a processor, in logic implemented in a specialized graphics processing unit such as the GPU 908, or in any other device.



FIG. 10 is a block diagram showing computer readable media 1000 that stores code for enabling a lightweight multi-branch and multi-scale (LMBMS) re-identification. The computer readable media 1000 may be accessed by a processor 1002 over a computer bus 1004. Furthermore, the computer readable medium 1000 may include code configured to direct the processor 1002 to perform the methods described herein. In some embodiments, the computer readable media 1000 may be non-transitory computer readable media. In some examples, the computer readable media 1000 may be storage media.


The various software components discussed herein may be stored on one or more computer readable media 1000, as indicated in FIG. 10. For example, a local information extractor module 1006, a multi-scale, multi-level feature module 1008, and a channel-wise attention module 1010 may be stored on the computer readable media 1000.


The local information extractor module 1006 may be configured to extract global feature and abundant multi-scale multi-level local features from input images. The multi-scale, multi-level feature module 1008 may be configured to applies one or more refine blocks to the extracted features. The channel-wise attention module 1010 may be configured to extract a skeleton frame corresponding to the person within the workout area. The repetition counting module 1012 may be configured to generates dynamic features from the refined features.


The block diagram of FIG. 10 is not intended to indicate that the computer readable media 1000 is to include all of the components shown in FIG. 10. Further, the computer readable media 1000 may include any number of additional components not shown in FIG. 10, depending on the details of the specific implementation.


EXAMPLES

Example 1 is a system for lightweight multi-branch and multi-scale (LMBMS) re-identification. The system includes a convolutional neural network trained for person identification, wherein the convolutional neural network comprises a series of residual blocks that obtain input from a head network of the convolutional neural network; a plurality of refine blocks, wherein one or more refine blocks take as input features from a residual block of the series of residual blocks, wherein the features are at input at different scales and different resolutions and an output of the plurality of refine blocks is a plurality of features in a same feature space; and a channel-wise attention mechanism to merge the plurality of features and generate final dynamic features.


Example 2 includes the system of example 1, including or excluding optional features. In this example, feature matching is applied to the final dynamic features across pairs of frames from a single camera view for single camera person re-identification.


Example 3 includes the system of any one of examples 1 to 2, including or excluding optional features. In this example, feature matching is applied to the final dynamic features across pairs of frames, each from a different camera view at a timestamp t, multiple camera person re-identification.


Example 4 includes the system of any one of examples 1 to 3, including or excluding optional features. In this example, the system includes one or more cameras that capture a plurality of players within an area of play; and a person detector that derives isolated bounding boxes for each player within the area of play, wherein the isolated bounding boxes are input to the convolutional neural network for person identification.


Example 5 includes the system of any one of examples 1 to 4, including or excluding optional features. In this example, the series of residual blocks derives features from bounding boxes input to the series of residual blocks, wherein features from deeper residual blocks are at a higher scale and a higher resolution when compared to features from shallower residual blocks.


Example 6 includes the system of any one of examples 1 to 5, including or excluding optional features. In this example, each refine block performs a 1×1 convolution, a 3×3 convolution, and a second 1×1 convolution on input features and multiplies a result of this series of convolutions with a result of average pooling and another 1×1 convolution applied to the input features in a skip branch of the refine block.


Example 7 includes the system of any one of examples 1 to 6, including or excluding optional features. In this example, a number of refine blocks applied to features extracted from a residual block is dependent on a scale and resolution of the features.


Example 8 includes the system of any one of examples 1 to 7, including or excluding optional features. In this example, the channel-wise attention mechanism obtains a weight distribution for each feature according to a softmax function and merges features input to the channel-wise attention mechanism to obtain a 128-dimension vector of the final dynamic features, where each dimension represents an importance of each channel.


Example 9 includes the system of any one of examples 1 to 8, including or excluding optional features. In this example, batch normalization is applied to the final dynamic features.


Example 10 includes the system of any one of examples 1 to 9, including or excluding optional features. In this example, the convolutional neural network is trained using a triplet loss combined with a metric task and cross entropy loss for classification.


Example 11 is a method for lightweight multi-branch and multi-scale (LMBMS) re-identification. The method includes extracting local features from bounding boxes input to a convolutional neural network trained for person identification, wherein the convolutional neural network comprises a series of residual blocks that obtain input from a head network of the convolutional neural network; deriving multi-level features at multiple scales from the extracted local features via the series of residual blocks, wherein one or more refine blocks take as input features from each residual block of the series of residual blocks and outputs a plurality of features in a same feature space; and merging the plurality of features to generate final dynamic features.


Example 12 includes the method of example 11, including or excluding optional features. In this example, feature matching is applied to the final dynamic features across pairs of frames from a single camera view for single camera person re-identification.


Example 13 includes the method of any one of examples 11 to 12, including or excluding optional features. In this example, feature matching is applied to the final dynamic features across pairs of frames, each from a different camera view at a timestamp t, multiple camera person re-identification.


Example 14 includes the method of any one of examples 11 to 13, including or excluding optional features. In this example, the method includes capturing a plurality of players within an area of play; and deriving isolated bounding boxes for each player within the area of play, wherein the isolated bounding boxes are input to the convolutional neural network for person identification.


Example 15 includes the method of any one of examples 11 to 14, including or excluding optional features. In this example, the series of residual blocks derives features from bounding boxes input to the series of residual blocks, wherein features from deeper residual blocks are at a higher scale and a higher resolution when compared to features from shallower residual blocks.


Example 16 includes the method of any one of examples 11 to 15, including or excluding optional features. In this example, each refine block performs a 1×1 convolution, a 3×3 convolution, and a second 1×1 convolution on input features and multiplies a result of this series of convolutions with a result of average pooling and another 1×1 convolution applied to the input features in a skip branch of the refine block.


Example 17 includes the method of any one of examples 11 to 16, including or excluding optional features. In this example, a number of refine blocks applied to features extracted from a residual block is dependent on a scale and resolution of the features.


Example 18 includes the method of any one of examples 11 to 17, including or excluding optional features. In this example, the method includes obtaining a weight distribution for each feature according to a softmax function and merging features input to a channel-wise attention mechanism to obtain a 128-dimension vector of the dynamic features, where each dimension represents an importance of each channel.


Example 19 includes the method of any one of examples 11 to 18, including or excluding optional features. In this example, batch normalization is applied to the final, dynamic features.


Example 20 includes the method of any one of examples 11 to 19, including or excluding optional features. In this example, the convolutional neural network is trained using a triplet loss combined with a metric task and cross entropy loss for classification.


Example 21 is at least one computer readable medium that enables a smart gym having instructions stored therein that. The computer-readable medium includes instructions that direct the processor to extract local features from bounding boxes input to a convolutional neural network trained for person identification, wherein the convolutional neural network comprises a series of residual blocks that obtain input from a head network of the convolutional neural network; derive multi-level features at multiple scales from the extracted local features via the series of residual blocks, wherein one or more refine blocks take as input features from each residual block of the series of residual blocks and outputs a plurality of features in a same feature space; and merge the plurality of features to generate final dynamic features.


Example 22 includes the computer-readable medium of example 21, including or excluding optional features. In this example, feature matching is applied to the final dynamic features across pairs of frames from a single camera view for single camera person re-identification.


Example 23 includes the computer-readable medium of any one of examples 21 to 22, including or excluding optional features. In this example, feature matching is applied to the final dynamic features across pairs of frames, each from a different camera view at a timestamp t, multiple camera person re-identification.


Example 24 includes the computer-readable medium of any one of examples 21 to 23, including or excluding optional features. In this example, the computer-readable medium includes capturing a plurality of players within an area of play; and deriving isolated bounding boxes for each player within the area of play, wherein the isolated bounding boxes are input to the convolutional neural network for person identification.


Example 25 includes the computer-readable medium of any one of examples 21 to 24, including or excluding optional features. In this example, the series of residual blocks derives features from bounding boxes input to the series of residual blocks, wherein features from deeper residual blocks are at a higher scale and a higher resolution when compared to features from shallower residual blocks.


Not all components, features, structures, characteristics, etc. described and illustrated herein need be included in a particular aspect or aspects. If the specification states a component, feature, structure, or characteristic “may”, “might”, “can” or “could” be included, for example, that particular component, feature, structure, or characteristic is not required to be included. If the specification or claim refers to “a” or “an” element, that does not mean there is only one of the element. If the specification or claims refer to “an additional” element, that does not preclude there being more than one of the additional element.


It is to be noted that, although some aspects have been described in reference to particular implementations, other implementations are possible according to some aspects. Additionally, the arrangement and/or order of circuit elements or other features illustrated in the drawings and/or described herein need not be arranged in the particular way illustrated and described. Many other arrangements are possible according to some aspects.


In each system shown in a figure, the elements in some cases may each have a same reference number or a different reference number to suggest that the elements represented could be different and/or similar. However, an element may be flexible enough to have different implementations and work with some or all of the systems shown or described herein. The various elements shown in the figures may be the same or different. Which one is referred to as a first element and which is called a second element is arbitrary.


It is to be understood that specifics in the aforementioned examples may be used anywhere in one or more aspects. For instance, all optional features of the computing device described above may also be implemented with respect to either of the methods or the computer-readable medium described herein. Furthermore, although flow diagrams and/or state diagrams may have been used herein to describe aspects, the techniques are not limited to those diagrams or to corresponding descriptions herein. For example, flow need not move through each illustrated box or state or in exactly the same order as illustrated and described herein.


The present techniques are not restricted to the particular details listed herein. Indeed, those skilled in the art having the benefit of this disclosure will appreciate that many other variations from the foregoing description and drawings may be made within the scope of the present techniques. Accordingly, it is the following claims including any amendments thereto that define the scope of the present techniques.

Claims
  • 1. A system for lightweight multi-branch and multi-scale (LMBMS) re-identification, the system comprising: a convolutional neural network trained for person identification, the convolutional neural network including a series of residual blocks that obtain input from a head network of the convolutional neural network;a plurality of refine blocks, one or more of the plurality of refine blocks to take as input features from a residual block of the series of residual blocks, the features are to be input at different scales and different resolutions, the plurality of refine blocks to output a plurality of features in a same feature space; anda channel-wise attention mechanism to merge the plurality of features and generate final dynamic features.
  • 2. (canceled)
  • 4. The system of claim 1, further including: one or more cameras that capture a plurality of players within an area of play; anda person detector that derives isolated bounding boxes for each player within the area of play, wherein the isolated bounding boxes are input to the convolutional neural network for person identification.
  • 5. The system of claim 1, wherein the series of residual blocks derives the features from bounding boxes input to the series of residual blocks, wherein features from deeper residual blocks are at a higher scale and a higher resolution when compared to features from shallower residual blocks.
  • 6. The system of claim 1, wherein each refine block performs a series of convolutions including a 1×1 convolution, a 3×3 convolution, and a second 1×1 convolution on the input features and multiplies a result of this the series of convolutions with a result of average pooling and another 1×1 convolution applied to the input features in a skip branch of the refine block.
  • 7. (canceled)
  • 8. The system of claim 1, wherein the channel-wise attention mechanism obtains a weight distribution for each feature according to a softmax function and merges features input to the channel-wise attention mechanism to obtain a 128-dimension vector of the final dynamic features, where each dimension represents an importance of each channel.
  • 9-10. (canceled)
  • 11. A method for lightweight multi-branch and multi-scale (LMBMS) re-identification, the method comprising: extracting local features from bounding boxes input to a convolutional neural network trained for person identification, the convolutional neural network including a series of residual blocks that obtain input from a head network of the convolutional neural network;deriving multi-level features at multiple scales from the extracted local features via the series of residual blocks, one or more refine blocks to take as input features from each residual block of the series of residual blocks and to outputs a plurality of features in a same feature space; andmerging the plurality of features to generate final dynamic features.
  • 12. The method of claim 11, further including applying feature matching to the final dynamic features across pairs of frames from a single camera view for single camera person re-identification.
  • 13. The method of claim 11, further including applying feature matching to the final dynamic features across pairs of frames, each from a different camera view at a timestamp t, multiple camera person re-identification.
  • 14. The method of claim 11, further including: capturing a plurality of players within an area of play; andderiving isolated bounding boxes for each player within the area of play, wherein the isolated bounding boxes are input to the convolutional neural network for person identification.
  • 15. The method of claim 11, further including deriving, via the series of residual blocks, features from bounding boxes input to the series of residual blocks, wherein features from deeper residual blocks are at a higher scale and a higher resolution when compared to features from shallower residual blocks.
  • 16. The method of claim 11, further including performing, via each refine block, a 1×1 convolution, a 3×3 convolution, and a second 1×1 convolution on input features and multiplying a result of this series of convolutions with a result of average pooling and another 1×1 convolution applied to the input features in a skip branch of the refine block.
  • 17. The method of claim 11, further including applying a number of refine blocks to features extracted from a residual block based on a scale and resolution of the features.
  • 18. The method of claim 11, further including obtaining a weight distribution for each feature according to a softmax function and merging features input to a channel-wise attention mechanism to obtain a 128-dimension vector of the dynamic features, where each dimension represents an importance of each channel.
  • 19. The method of claim 11, further including applying batch normalization to the final, dynamic features.
  • 20. The method of claim 11, further including training the convolutional neural network using a triplet loss combined with a metric task and cross entropy loss for classification.
  • 21. At least one computer readable medium comprising instructions that, in response to being executed on a computing device, cause the computing device to: extract local features from bounding boxes input to a convolutional neural network trained for person identification, the convolutional neural network including a series of residual blocks that obtain input from a head network of the convolutional neural network;derive multi-level features at multiple scales from the extracted local features via the series of residual blocks, wherein one or more refine blocks take as input features from each residual block of the series of residual blocks and outputs a plurality of features in a same feature space; andmerge the plurality of features to generate final dynamic features.
  • 22. The computer readable medium of claim 21, wherein the instructions, when executed, cause feature matching to be applied to the final dynamic features across pairs of frames from a single camera view for single camera person re-identification.
  • 23. The computer readable medium of claim 21, wherein the instructions, when executed, cause feature matching to be applied to the final dynamic features across pairs of frames, each from a different camera view at a timestamp t, multiple camera person re-identification.
  • 24. The computer readable medium of claim 21, wherein the instructions, when executed, cause the computing device to: derive isolated bounding boxes for each player of a plurality of players captured within the area of play, wherein the isolated bounding boxes are input to the convolutional neural network for person identification.
  • 25. The computer readable medium of claim 21, wherein the instructions, when executed, cause the computing device to derive features from bounding boxes input to the series of residual blocks, wherein features from deeper residual blocks are at a higher scale and a higher resolution when compared to features from shallower residual blocks.
PCT Information
Filing Document Filing Date Country Kind
PCT/CN2019/126906 12/20/2019 WO