Not applicable.
Not applicable.
Not applicable.
This invention relates to artificial intelligence (AI).
There are different methods of pattern recognition, for example, based on artificial natural networks (ANN).
Many important problems, including the classical problem of pattern recognition—handwritten digit recognition, can be solved by ANN with good accuracy. According to the book of Michael Nielsen. “Neural networks and deep learning”. Determination Press, 2015: “At present, well-designed neural networks outperform every other technique for solving MNIST, including SVMs. The current (2013) record is classifying 9,979 of 10,000 images correctly”. Michael Nielsen refers to the MNIST database (see https://pjreddie.com/projects/mnist-in-cav/) that includes a training set of 60,000 images of digits and a testing set of 10,000 images of digits written by 250 different people. That record of correct classifying 97.79% of 10,000 images still holds (updated an Oct. 2, 2018, see http://neuralnetworksanddeeplearning.com/chap1.html).
The drawback of using ANN is the necessity of optimization procedures on training data sets, which is slow and does not guarantee the best result. For recognition of complex dynamic patterns, like recognition of a person by LIDAR videos of him (see, for example, https://vimeo.com/219562254), both performance and accuracy of recognition are extremely important.
It is therefore the objective of the present invention to provide a device and a method for recognition/classification of static images (e.g., digits) and dynamic images (e.g. LIDAR videos) with a high performance and a high level of accuracy (we claim 9989 correct classifications out of 10,000 MNIST test images).
The proposed device is comprised of an optical device for recording static or dynamic images, e.g., a photo camera or a LIDAR system.
The proposed method is based on comparison of an image/video that is to be classified with AI maps (we introduce this notion below) calculated from a training set of images/videos.
Each training image is an image (23 pixels by 23 pixels) of a scanned handwritten digit, and the color of each pixel of the image is gray. In the current patent application, all gray pixels are considered as black.
If P is a black pixel in an image (
Suppose that the difference between the x-coordinates of pixels P and N is equal to 41 and between the y-coordinates—to 29, see
Now let us figure out the number that has to be in the 2D array instead of pixel Q located at the distance of 1 pixel to the right from P. In this case, pixel Q is white because if Q were black, then Q (not N) would be the nearest black pixel for P. The square of this number cannot be greater than 2441=(41−1)2+292 because if it were greater than 2441, then N would be the nearest black pixel to Q and 2441 would be added in the array instead of Q. As the same time, by the reason that N is the black pixel closest to P, there are no black pixels inside of the circle, so the square of this number cannot be less than (r−1)2=(√2522−1)2=49.22=2422.6.
The integer numbers in this range (from 2423 to 2441) each of which can be presented as a sum of two squares are shown in
Starting from the very left top pixel (as pixel P) is the image, then moving from pixel to pixel with a step of 1 to the right until the edge of the image, then down, then to the left, and so on and performing on each move calculations similar to the described above, we fill the entire 2D array.
To classify 10,000 MNIST test images, we compare each of them against each of 60,000 training AI maps and calculate 10,000*60,000 distances.
To calculate the distance between a test image and a training image, we overlap the testing image over the training map of the training image. Each black pixel of die testing image will get into a cell of the training map. The number in this cell is the distance from the pixel of the testing image to the nearest black pixel in the training image (a constant complexity algorithm). The average of all these distances over all black pixels of the testing image is our definition of the distance between the testing image and the training image (linear complexity).
For each testing image we calculate 60,000 distances to each of training images. One of these images, let us say—to image of digit i—is minimum. Then we classify the testing image as image if digit i. If testing image represents image i, our classification is correct, if not—wrong. Testing all 10,000 testing images shows that in all 10,000 cases we make only 11 errors (as opposed to the known ANN best result of 21 errors out of 10,000 conclusions).
The subject of the current invention is synchronization of the images. We compare not one image against another image (as in the case of digits classification above), but a sequence of N images (N frames of templet video, see https://vimeo.com/219562254) against another sequence of N images (N frames of surveillance video). We consider all N frames as a single image comprising N parts. It is important to synchronize videos so than a part of image with legs apart would not be compared against a part of image with legs together, see
We assume that each person has several different styles of movements (walking, fast walking, sport walking, running . . . ) and that inside of one style the person can move faster or slower but repeats the same motion pattern of this style. In other words, if you have two videos—one where the person walks normally and second—where the same person walks the same distance but 10% faster, and the second video was recorded with 10% higher rate of frames per second, then these videos have the same number of frames and the same number of frames per one step (35 in the example above). Moreover, if the videos start from the same position (e.g., the position with maximum distance between feet, it is absolutely the same video.
In oar method, all templet videos for different persons are recorded so that F1/v1=const1, where F1 is the rate of the templet video (frames per second) and v1 is the speed of the person. The speed of the person is measured with the same LIDAR, and the recording rate and the start of recording are adjusted automatically. All surveillance videos are also recorded so that F2/v2=const1, where F2 is the rate of the surveillance video and v2 is the speed of the person under surveillance. As a result, F2/v2=const1=F1/v1 and F2=F1*v2/v1. It means that the templet and the surveillance videos of the same person moving in the same style should coincide no matter of the person's speed (because the videos start from the same position).
The advantages of the proposed method are as follows: high accuracy—we have set a world record on that: 9989 correct recognitions out of 10,000 MNIST test images (with the previous record 9979 out of 10,000); high performance: all training and testing take minutes on a regular CPU; simplicity—any person with basic programing skills can verify our results and implement the method.