This patent document relates generally to the field of machine learning. More particularly, the present document relates to creating a two-dimension (2-D) graphical symbol for representing semantic meaning of a video clip.
Machine learning is an application of artificial intelligence. In machine learning, a computer or computing device is programmed to think like human beings so that the computer may be taught to learn on its own. The development of neural networks has been key to teaching computers to think and understand the world in the way human beings do.
Video stream data contain a series of still images, for example, a typical 30 frames per second of images. Generally, a still image is a snapshot of an action, while a video stream shows the action. For example a snapshot of a person swims in a pool is a person in a swimming pool, while video shows a person is doing freestyle swim strokes. To recognize the action contained in a video stream must be done by video classification technique. Therefore, there would be a need to efficiently recognize the action contained in a video stream via machine learning.
This section is for the purpose of summarizing some aspects of the invention and to briefly introduce some preferred embodiments. Simplifications or omissions in this section as well as in the abstract and the title herein may be made to avoid obscuring the purpose of the section. Such simplifications or omissions are not intended to limit the scope of the invention.
Systems and methods of creating two-dimension (2-D) graphical symbols for representing a semantic meaning of a video clip are described.
According to one aspect of the disclosure, a video clip having Q frames of 2-D image is extracted from a video stream received in a computing system. The video stream includes a number of frames with each frame containing a 2-D image in time order. A vector of P feature encoding values is obtained for each frame by a set of image transformations of each frame along with performing computations of a specific succession of convolution and pooling layers of a first Cellular Neural Networks or Cellular Nonlinear Networks (CNN) based deep learning model followed with operations of a nested invariance pooling layer. As a result, the vector of P feature encoding values represents the image of each frame with desired invariance (e.g., rotations, translations and scaling). Each feature encoding value is then converted from real number to a corresponding integer value within a range designated for color display intensity in accordance with a quantization scheme. A 2-D graphical symbol that contains N×N pixels is formed by placing respective color display intensities into the N×N pixels according to a data arrangement pattern for representing all frames of the video clip in form of P×Q feature encoding values, such that the 2-D graphical symbol possesses a semantic meaning of the video clip and the semantic meaning can be recognized via another CNN based deep learning model with trained filter coefficients. P and N are positive integers, and Q is a multiple of 512.
According another aspect, the Q frames are sequentially chosen from the video steam.
According yet another aspect, the Q frames are arbitrarily chosen from the video steam and rearranged in time order.
According yet another aspect, the quantization scheme is a non-linear quantization based on K-means clustering of each of the P feature encoding values obtained using a training dataset.
According yet another aspect, the quantization scheme is a linear quantization based on boundaries determined by empirical observations of all of the feature encoding values obtained using a training dataset.
According yet another aspect, the data arrangement pattern for representing all frames of the video clip comprises arranging all of the P feature encoding values of each frame in a square format such that there are Q square images contained in the 2-D graphical symbol.
According yet another aspect, the data arrangement pattern for representing all frames of the video clip comprises arranging each of the P feature encoding values of all Q frames in a rectangular format such that there are P rectangular images contained in the 2-D graphical symbol.
Objects, features, and advantages of the invention will become apparent upon examining the following detailed description of an embodiment thereof, taken in conjunction with the attached drawings.
These and other features, aspects, and advantages of the invention will be better understood with regard to the following description, appended claims, and accompanying drawings as follows:
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will become obvious to those skilled in the art that the invention may be practiced without these specific details. The descriptions and representations herein are the common means used by those experienced or skilled in the art to most effectively convey the substance of their work to others skilled in the art. In other instances, well-known methods, procedures, and components have not been described in detail to avoid unnecessarily obscuring aspects of the invention.
Reference herein to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Used herein, the terms “vertical”, “horizontal”, “diagonal”, “left”, “right”, “top”, “bottom”, “column”, “row”, “diagonally” are intended to provide relative positions for the purposes of description, and are not intended to designate an absolute frame of reference. Additionally, used herein, term “character” and “script” are used interchangeably.
Embodiments of the invention are discussed herein with reference to
Referring first to
Process 100 starts at action 102 by receiving a video stream in a computer system capable of performing computations of Cellular Neural Networks or Cellular Nonlinear Networks (CNN) based deep learning models, for example, the computing system 800 of
Then, at action 104, a video clip is extracted from the received video stream. The video clip contains a predetermined number of frames (i.e., Q frames). Q is a positive integer. The selection of Q frames may be conducted in a number of manners. In one embodiment, Q frames are sequentially chosen from the video steam. A first example video clip 220 shown in
In another embodiment, Q frames are chosen with a criterion, for example, every other frame as shown in
In yet another embodiment, Q frames are arbitrarily chosen from the video steam and rearranged in time order. Shown in
Each frame is then converted to a resolution suitable as an input image to a CNN based deep learning model that contains a specific succession of convolution and pooling layers at action 106. For example, Visual Geometry Group's VGG-16 model (shown in
Next, at action 108, a vector of P feature encoding values is obtained for each frame by a set of image transformations of each frame along with performing computations of a specific succession of convolution and pooling layers of a Cellular Neural Networks or Cellular Nonlinear Networks (CNN) based deep learning model (e.g., VGG-16, ResNet, MobileNet, etc.) followed with operations of a nested invariance pooling layer. P is a positive integer and multiple of 512. In one embodiment, the feature encoding values are referred to as Compact Descriptors for Video Analysis (CDVA) in the MPEG-7 standard. MPEG stands for Moving Picture Experts Group and is an international standard for encoding and compressing video images.
A schematic diagram shown in
Each feature encoding value is a real number and can be either positive or negative, for example, 0.26, −0.01, 0.12, etc.
At action 110, each feature encoding value is then converted from the real number to a corresponding integer value within a range designated for color display intensity in accordance with a quantization scheme. In one embodiment, the range designated for color display intensity is between 0 and 255 for grayscale display.
In one embodiment, the quantization scheme is based on K-means clustering of each of the P feature encoding values obtained using a training dataset.
In grayscale display, applying K-means clustering to each of the P feature encoding values would create 256 clustering centers. Shown in
In another embodiment, the quantization scheme is a linear quantization scheme based on boundaries determined in empirical observations of all of the feature encoding values obtained using a training dataset. Then a conversion formula is used for converting real number to a corresponding integer. An example of linear quantization for grayscale display shown in
i_n_m=(v_n_m/(max−min))*256+128, if v_n_m is within the range of [min, max]
i_n_m=255, if v_n_m>max
i_n_m=0, if v_n_m<min
For example, shown in
i_1_1=(0.26/(0.3−(−0.3)))*256+128=239
i_1_2=(−0.14/(0.3−(−0.3)))*256+128=68
Referring back to process 100, at action 112, a 2-D graphical symbol that contains N×N pixels is formed by placing respective color display intensities into the corresponding pixels according to a data arrangement pattern that represents P feature encoding values of all Q frames, such that the 2-D graphical symbol possesses semantic meaning of the video clip. Each feature encoding value occupies at least one pixel. The resulting 2-D graphical symbol can be recognized via an image classification task using another trained CNN based deep learning model. In order to accomplish such an image classification task, labeled 2-D graphical symbols (e.g., symbol with data arrangement patterns shown in
The data structure of an example 2-D graphical symbol 500 is shown in
Due to the size of a 2-D graphical symbol, only a limited number of frames can be used in a video clip. For example, when N is 224, P is 512, the maximum number of frames or Q is 78 with a gap of at least one pixel. Each of the 512 rectangular images contains 6×13=78 pixels.
To overcome such a limitation, two or more video clips may be extracted from a video stream 1610 as shown in
Each video clip is then transformed into a 2-D graphical symbol by applying the 2-D graphical symbol creation method shown in
Furthermore, the world we live in contains four dimensions: three-dimension (3-D) objects in spatial plus another dimension for temporal. At any instance in time, a 3-D object is represented by a number of 2-D still images. For example, images of a 3-D object can be scanned via various technologies, for example, magnetic resonant imaging, computer axial tomography (CT), Light Detection and Ranging (LiDAR) and the likes. Scanned 3-D object results are then represented by a number of 2-D image frames.
Referring now to
The CNN based computing system 800 may be implemented on integrated circuits as a digital semi-conductor chip (e.g., a silicon substrate in a single semi-conductor wafer) and contains a controller 810, and a plurality of CNN processing units 802a-802b operatively coupled to at least one input/output (I/O) data bus 820. Controller 810 is configured to control various operations of the CNN processing units 802a-802b, which are connected in a loop with a clock-skew circuit (e.g., clock-skew circuit 1540 in
In one embodiment, each of the CNN processing units 802a-802b is configured for processing imagery data, for example, 2-D graphical symbol 520 of
In another embodiment, the CNN based computing system is a digital integrated circuit that can be extendable and scalable. For example, multiple copies of the digital integrated circuit may be implemented on a single semi-conductor chip as shown in
All of the CNN processing engines are identical. For illustration simplicity, only few (i.e., CNN processing engines 822a-822h, 832a-832h) are shown in
Each CNN processing engine 822a-822h, 832a-832h contains a CNN processing block 824, a first set of memory buffers 826 and a second set of memory buffers 828. The first set of memory buffers 826 is configured for receiving imagery data and for supplying the already received imagery data to the CNN processing block 824. The second set of memory buffers 828 is configured for storing filter coefficients and for supplying the already received filter coefficients to the CNN processing block 824. In general, the number of CNN processing engines on a chip is 2n, where n is an integer (i.e., 0, 1, 2, 3, . . . ). As shown in
The first and the second I/O data bus 830a-830b are shown here to connect the CNN processing engines 822a-822h, 832a-832h in a sequential scheme. In another embodiment, the at least one I/O data bus may have different connection scheme to the CNN processing engines to accomplish the same purpose of parallel data input and output for improving performance.
More details of a CNN processing engine 842 in a CNN based integrated circuit are shown in
In order to achieve faster computations, few computational performance improvement techniques have been used and implemented in the CNN processing block 844. In one embodiment, representation of imagery data uses as few bits as practical (e.g., 5-bit representation). In another embodiment, each filter coefficient is represented as an integer with a radix point. Similarly, the integer representing the filter coefficient uses as few bits as practical (e.g., 12-bit representation). As a result, 3×3 convolutions can then be performed using fixed-point arithmetic for faster computations.
Each 3×3 convolution produces one convolution operations result, Out(m, n), based on the following formula:
where:
Each CNN processing block 844 produces Z×Z convolution operations results simultaneously and, all CNN processing engines perform simultaneous operations. In one embodiment, the 3×3 weight or filter coefficients are each 12-bit while the offset or bias coefficient is 16-bit or 18-bit.
To perform 3×3 convolutions at each sampling location, an example data arrangement is shown in
Imagery data are stored in a first set of memory buffers 846 , while filter coefficients are stored in a second set of memory buffers 848. Both imagery data and filter coefficients are fed to the CNN block 844 at each clock of the digital integrated circuit. Filter coefficients (i.e., C(3×3) and b) are fed into the CNN processing block 844 directly from the second set of memory buffers 848. However, imagery data are fed into the CNN processing block 844 via a multiplexer MUX 845 from the first set of memory buffers 846. Multiplexer 845 selects imagery data from the first set of memory buffers based on a clock signal (e.g., pulse 852).
Otherwise, multiplexer MUX 845 selects imagery data from a first neighbor CNN processing engine (from the left side of
At the same time, a copy of the imagery data fed into the CNN processing block 844 is sent to a second neighbor CNN processing engine (to the right side of
After 3×3 convolutions for each group of imagery data are performed for predefined number of filter coefficients, convolution operations results Out(m, n) are sent to the first set of memory buffers via another multiplex MUX 847 based on another clock signal (e.g., pulse 851). An example clock cycle 850 is drawn for demonstrating the time relationship between pulse 851 and pulse 852. As shown pulse 851 is one clock before pulse 852, as a result, the 3×3 convolution operations results are stored into the first set of memory buffers after a particular block of imagery data has been processed by all CNN processing engines through the clock-skew circuit 860.
After the convolution operations result Out(m, n) is obtained from Formula (1), activation procedure may be performed. Any convolution operations result, Out(m, n), less than zero (i.e., negative value) is set to zero. In other words, only positive value of output results are kept. For example, positive output value 10.5 retains as 10.5 while −2.3 becomes 0. Activation causes non-linearity in the CNN based integrated circuits.
If a 2×2 pooling operation is required, the Z×Z output results are reduced to (Z/2)×(Z/2). In order to store the (Z/2)×(Z/2) output results in corresponding locations in the first set of memory buffers, additional bookkeeping techniques are required to track proper memory addresses such that four (Z/2)×(Z/2) output results can be processed in one CNN processing engine.
To demonstrate a 2×2 pooling operation,
An input image generally contains a large amount of imagery data. In order to perform image processing operations, an example input image 1400 (e.g., 2-D graphical symbol 520 of
Although the invention does not require specific characteristic dimension of an input image, the input image may be required to resize to fit into a predefined characteristic dimension for certain image processing procedures. In an embodiment, a square shape with (2L×Z)-pixel by (2L×Z)-pixel is required. L is a positive integer (e.g., 1, 2, 3, 4, etc.). When Z equals 14 and L equals 4, the characteristic dimension is 224. In another embodiment, the input image is a rectangular shape with dimensions of (2I×Z)-pixel and (2J×Z)-pixel, where I and J are positive integers.
In order to properly perform 3×3 convolutions at pixel locations around the border of a Z-pixel by Z-pixel block, additional imagery data from neighboring blocks are required.
When more than one CNN processing engine is configured on the integrated circuit. The CNN processing engine is connected to first and second neighbor CNN processing engines via a clock-skew circuit. For illustration simplicity, only CNN processing block and memory buffers for imagery data are shown. An example clock-skew circuit 1540 for a group of example CNN processing engines are shown in
CNN processing engines connected via the second example clock-skew circuit 1540 to form a loop. In other words, each CNN processing engine sends its own imagery data to a first neighbor and, at the same time, receives a second neighbor's imagery data. Clock-skew circuit 1540 can be achieved with well-known manners. For example, each CNN processing engine is connected with a D flip-flop 1542.
Although the invention has been described with reference to specific embodiments thereof, these embodiments are merely illustrative, and not restrictive of, the invention. Various modifications or changes to the specifically disclosed example embodiments will be suggested to persons skilled in the art. For example, whereas the number of feature values has been shown and described as 512, other multiple of 512 may be used for achieving the same, for example, MobileNet contains 1024 feature encoding values. Furthermore, whereas data arrangement pattern containing square images have been shown in various examples, data arrangement pattern containing rectangular images may be used instead for accomplishing the same. In summary, the scope of the invention should not be restricted to the specific example embodiments disclosed herein, and all modifications that are readily suggested to those of ordinary skill in the art should be included within the spirit and purview of this application and scope of the appended claims.
This application claims benefits of a U. S. Provisional Patent Application Ser. No. 62/839,633 for “2-D Symbol For Graphically Representing Feature Encoding Values Of A Video Clip”, filed Apr. 27, 2019. The contents of which are hereby incorporated by reference in its entirety for all purposes.
Number | Date | Country | |
---|---|---|---|
62839633 | Apr 2019 | US |