This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2013-172495, filed on Aug. 22, 2013, the entire contents of which are incorporated herein by reference.
The embodiment discussed herein is related to, for example, an image processing device used to detect the hand and fingers of a user, an image processing method, and an image processing program.
Since the past, a method in which a document image is projected using a projector has been used. In recent years, a technology has been developed for actualizing user operation assistance by interactive manipulation being performed on a projected projection image through use of gestures, such as hand and finger movement. For example, an augmented reality (AR) technology has been developed in which, when an arbitrary word included in a projection image is indicated by the hand and fingers, an annotation or the like that is associated with the word is presented.
In the above-described interface, the position of the hand and fingers of the user has to be accurately identified by use of a camera that is fixed to an arbitrary location or a camera that is capable of moving freely. As a method for identifying the position of the hand and fingers, for example, in C. Prema et al., “Survey on Skin Tone Detection using Color Spaces”, International Journal of Applied Information Systems, 2(2):18-26, May 2012, published by Foundation of Computer Science, New York, USA, a technology is disclosed in which a hand-area contour is extracted by, for example, a skin-tone color component (color feature quantity) being extracted from a captured image, and the position of the hand and fingers is identified by the hand-area contour.
In accordance with an aspect of the embodiments, an image processing device includes, a processor; and a memory which stores a plurality of instructions, which when executed by the processor, cause the processor to execute, acquiring an image including a first region of a user; extracting a color feature quantity or an intensity gradient feature quantity from the image; detecting the first region based on the color feature quantity or the intensity gradient feature quantity; and selecting whether the detecting is detecting the first region using either the color feature quantity or the intensity gradient feature quantity, based on first information related to the speed of movement of the first region calculated from a comparison of the first regions in a plurality of images acquired at different times.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, as claimed.
These and/or other aspects and advantages will become apparent and more readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawing of which:
First, the situation regarding an issue in the conventional technology will be described. This issue has been newly discovered by the present inventors as a result of close examination of the conventional technology and has not been known in the past. It has been found that erroneous detection occurs when the background of a wall surface or a paper surface on which a projection image is projected is of a skin-tone color. A reason for this is that the skin-tone area of the background is erroneously detected as hand and fingers, and accurate identification of the position of the hand and fingers becomes difficult. Therefore, the issue will not occur if the position of the hand and fingers of the user is able to be identified without depending on the background color. In image processing in which the position of the hand and fingers of the user is detected, the following matter has been newly verified through keen verification by the present inventors. For example, when an intensity gradient feature quantity, such as a histogram of oriented gradients (HOG) feature quantity or a local binary pattern (LBP) feature quantity, is used, the skin-tone area of the background and the hand and fingers may be accurately differentiated due to the characteristics of the intensity gradient feature quantity. However, compared to the color feature quantity, the intensity gradient feature quantity involves a higher calculation load. Therefore, a delay occurs in the interactive manipulation performed on a projection image, of which prompt responsiveness is desired, and a problem occurs in that operability of the image processing device decreases. In other words, although the intensity gradient feature quantity has high robustness, another characteristic thereof is that the calculation load is high. Therefore, in terms of practical use, detecting the position of the hand and fingers of the user using only the intensity gradient feature quantity is difficult. On the other hand, the color feature quantity is characteristic in that processing load is low. In other words, the color feature quantity does not have high robustness, but is characteristic in that the calculation load is low.
Focusing on the low calculation load of the color feature quantity and the high robustness of the intensity gradient feature quantity, the present inventors have newly found that, through dynamic selection of the color feature quantity and the intensity gradient feature quantity depending on various circumstances, the position of the hand and fingers of the user is able to be detected with high robustness and low calculation load without depending on the background color.
Taking into consideration the technical features that have been newly found through keen verification by the present inventors, described above, examples of an image processing device, an image processing method, and an image processing program according to an embodiment will be described in detail with reference to the drawings. The examples do not limit the disclosed technology.
The acquiring unit 2 is, for example, a hardware circuit based on wired logic. In addition, the acquiring unit 2 may be a functional module actualized by a computer program executed by the image processing device 1. The acquiring unit 2 acquires an image that has been captured by an external device. The resolution and the acquisition frequency of the images received by the acquiring unit 2 may be set to arbitrary values depending on the processing speed, processing accuracy, and the like requested of the image processing device 1. For example, the acquiring unit 2 may acquire images having a resolution of VGA (640×480) at an acquisition frequency of 30 FPS (30 frames per second). The external device that captures the images is, for example, an image sensor. The image sensor is an imaging device, such as a charge coupled device (CCD) or a complementary metal oxide semiconductor (CMOS) camera. The image sensor captures, for example, an image including the hand and fingers of a user as a first region of the user. The image sensor may be included in the image processing device 1 as occasion calls. The acquiring unit 2 outputs the acquired image to the extracting unit 3.
The extracting unit 3 is, for example, a hardware circuit based on wired logic. In addition, the extracting unit 3 may be a functional module actualized by a computer program executed by the image processing device 1. The extracting unit 3 receives an image from the acquiring unit 2 and extracts the color feature quantity or the intensity gradient feature quantity of the image. The extracting unit 3 may extract, for example, a pixel value in RGB color space as the color feature quantity. In addition, the extracting unit 3 may extract, for example, the HOG feature quantity or the LBP feature quantity as the intensity gradient feature quantity. The intensity gradient feature quantity may be, for example, a feature quantity that is capable of being calculated within a fixed rectangular area. In example 1, for convenience of explanation, the HOG feature quantity will mainly be described as the intensity gradient feature quantity. In addition, for example, the extracting unit 3 may extract the HOG feature quantity, serving as an example of the intensity gradient feature quantity, using a method disclosed in N. Dalai et al., “Histograms of Oriented Gradients for Human Detection”, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2005. The extracting unit 3 outputs the extracted color feature quantity or intensity gradient feature quantity to the detecting unit 5. When the selecting unit 6 instructs the extraction of only either of the color feature quantity or the intensity gradient feature quantity, as described hereafter, only either of the color feature quantity or the intensity gradient feature quantity may be extracted.
The storage unit 4 is, for example, a semiconductor memory element, such as a flash memory, or a storage device, such as a hard disk drive (HDD) or an optical disc. The storage unit 4 is not limited to the types of storage devices described above, and may be a random access memory (RAM) or a read-only memory (ROM). The storage unit 4 does not have to be included in the image processing device 1. For example, various pieces of relevant data may be stored in a cache, memory, or the like (not illustrated) of each functional unit included in the image processing device 1. In addition, the storage unit 4 may be provided in an external device other than the image processing device 1, via the communication line and using the communication unit (not illustrated) provided in the image processing device 1.
In the storage unit 4, for example, a first feature quantity model (may also be referred to as a classifier) in which the feature quantity of the first region has been extracted in advance is stored in advance by preliminary learning. In addition, in the storage unit 4, various pieces of data acquired or held by each function of the image processing device 1 may be stored as occasion calls. The first feature quantity model may be generated based on the above-described HOG feature quantity or LBP feature quantity. In example 1, the first feature quantity model is described as being generated based on the HOG feature quantity. Preliminary learning is, for example, performed using an image (positive image) in which a target object (the hand and fingers serving as an example of the first region) is captured and an image (negative image) in which the target object is not captured. Various publically known classifier learning methods may be used, such as Adaboost or support vector machine (SVM). For example, as the classifier learning method, a classifier learning method using SVM that is disclosed in the above-mentioned N. Dalai et al., “Histograms of Oriented Gradients for Human Detection”, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2005 may be used. The intensity gradient feature quantity is a feature quantity that is able to be calculated within a fixed rectangular area, as described above. Therefore, in the positive image, a rectangular area may be prescribed such that the first region (such as the hand and fingers of the user) is disposed with left-right symmetry, and the intensity gradient feature quantity may be calculated within the prescribed rectangular area. In addition, a fingertip position within the rectangular area may also be registered. Furthermore, in the preliminary learning of the classifier, an average value of the fingertip positions in all positive rectangular areas may be calculated as appropriate.
The detecting unit 5 in
(Method for Detecting the First Region Using the Color Feature Quantity by the Detecting Unit 5)
A method by which the detecting unit 5 detects the first region using the color feature quantity will be described. The detecting unit 5 extracts a skin-tone area using the color feature quantity received from the extracting unit 3, and detects a hand area (combined area of the fingers and the back of the hand) based on the skin-tone area using various publically known methods. For example, the detecting unit 5 may detect the hand area using a method disclosed in Japanese Patent No. 3863809. After detecting the hand area, the detecting unit 5 may recognize the number of fingers in the hand area, and detect the fingers and the fingertip positions from the contour of the hand area. In addition, using a method described hereafter as appropriate, the detecting unit 5 may acquire a center-of-gravity position of the hand area. As a method for calculating the center-of-gravity position, the detecting unit 5 may, for example, calculate a center-of-gravity position Gt (xt, yt) using a following expression, when the coordinates of a pixel Pi within an area Ps extracted as the skin-tone area in an image of a frame t is defined as (xi, t, yi, t) and the number of pixels is defined as Ns.
(Method for Detecting the First Region Using the Intensity Gradient Feature Quantity by the Detecting Unit 5)
A method by which the detecting unit 5 detects the first region using the intensity gradient feature quantity will be described. The detecting unit 5 in
In addition, the detecting unit 5 may perform detection of the hand and fingers serving as the first region using a score. First, the detecting unit 5 performs calculation of the fingertip direction from the fingertip position identified from the color feature quantity. Here, the fingertip direction may, for example, be a direction perpendicular to the contour in the periphery of the fingertip position. Next, the detecting unit 5 sets a predetermined rectangular area based on the fingertip position and the fingertip direction. The detecting unit 5 matches the average fingertip position in the first feature quantity based on preliminary learning with the fingertip position set by the detecting unit 5 using the color feature quantity, and matches the direction of the rectangular area with the fingertip direction calculated earlier. Thereafter, for example, the detecting unit 5 calculates the intensity gradient feature quantity for the inside of the rectangular area using the HOG feature quantity. Next, based on the first feature quantity model, the detecting unit 5 performs estimation of a fingertip likeness using the intensity gradient feature quantity extracted from the rectangular area. For example, for SVM, the output score is a score from −1 to 1. A negative value is indicated when the object is not a finger, and a positive value is indicated when the object is a finger. The detecting unit 5 performs threshold determination of the score. When the score is less than a predetermined threshold, the detecting unit 5 may reject the estimation result. When the score is the threshold or higher, the detecting unit 5 may accept the estimation result. The detecting unit 5 may detect the hand and fingers and calculate the position of the fingertip based on the estimation result.
In addition, to support rotation movement within two-dimensional coordinates of the image acquired by the acquiring unit 2, the detecting unit 5 may perform the detection process on all rotation images using a plurality of rotation images that are rotated by a fixed interval (angle). Furthermore, the detecting unit 5 may limit the retrieval area for the intensity gradient feature quantity using and based on the skin-tone area extracted from the above-described color feature quantity, as occasion calls. In other words, when even a single pixel of a skin-tone area extracted based on the color feature quantity is included within the rectangular area prescribed from the intensity gradient feature quantity extracted by the extracting unit 3, the detecting unit 5 performs a comparison determination with the HOG feature quantity in the first feature quantity model. When a skin-tone area is not included, the detecting unit 5 does not perform the detection process. As a result of the process, the calculation load of the detecting unit 5 is able to be significantly reduced. The detecting unit 5 may identify an averaged fingertip position as the fingertip within the rectangular area detected as the first region (hand and fingers). Furthermore, when a plurality of rectangular areas are detected, the detecting unit 5 may select the rectangular area of which the similarity with the first feature quantity model (may also be referred to as a classifier) is the highest.
The selecting unit 6 in
Next, the technical significance of the first information and the details of the selection process performed by the selecting unit 6 will be described. First, the technical significance of the first information will be described. As a result of keen verification, the present inventors have newly found a phenomenon that is commonly observed when the detection of the hand and fingers and the detection of the position of the fingertip are not accurately performed using the color feature quantity that characteristically has a low calculation loads. The phenomenon is characteristic in that, as a result of the hand and finger area and the skin-tone area of the background being overlapped, the number of fingers increases or decreases within a short amount of time or the position of the fingertip significantly changes within a short amount of time. In other words, as a result of the hand and finger area and the skin-tone area of the background being overlapped, an instance may occur in which the movement amount of the hand and fingers, serving as the first region, within an arbitrary amount of time (may also be referred to as within a third time that is the difference between a first time and a second time) becomes a predetermined threshold (may also be referred to as a first threshold) or higher.
In
As is understandable from
Next, the details of the selection process performed by the selecting unit 6 will be described. For convenience of explanation, in the following description, a state in which the detecting unit 5 detects the first region using the color feature quantity is referred to as a color feature quantity mode. A state in which the detecting unit 5 detects the first region using the intensity gradient feature quantity is referred to as an intensity gradient feature quantity mode.
In
(Determination Process Regarding the Increase and Decrease in the Number of Fingers)
Here, the details of the determination process regarding the increase and decrease in the number of fingers will be described. First, regarding the increase and decrease in the number of fingers, differentiation is desired between when the user intentionally increases the number of fingers (such as when the user extends a finger from a state in which the hand is fisted) and when the number of fingers increases due to erroneous detection as a result of the skin-tone area of the background and the hand and fingers overlapping. Therefore, when the number of fingers has changed at a certain time, the selecting unit 6 checks the increase and decrease in the number of fingers that has occurred at a fixed short time tm prior. For example, when the number of fingers has changed from two to one at time t [sec], the selecting unit 6 checks whether or not the number of fingers has changed from one to two before time t−tm [sec]. If the number of fingers has changed, the selecting unit 6 determines that an increase or decrease in the number of fingers has occurred. The time tm may be set to a value taking into consideration the speed at which a human is able to move a finger. For example, at 30 FPS, under an assumption that a person is not able to (realistically not able to) increase then decrease (or decrease then increase) the number of fingers during 0.06 seconds, tm may be set to 0.06 (over two frames). The time tm may be referred to as the third time. The above-described first threshold may be set, for example, to the change quantity in the number of fingers.
(Calculation Process for the Movement Amount of the Finger Vector)
Here, the details of the calculation process for the movement amount of the finger vector will be described. Regarding the movement amount of the finger vector, for example, the vector from the center of gravity of the back of the hand to each finger may be calculated, and the movement amount may be calculated based on the vectors at a previous time and the current time. In addition to size, the finger vector includes a direction component. Therefore, a movement of the finger of the user in an unexpected movement direction (such as the finger moving to the left and right for only a certain amount of time while moving from a downward direction towards an upward direction) may be detected. In addition, in the calculation of the movement amount, if the movement of the fingertip position identified based on the color feature quantity is used, when the hand and fingers move at a high speed, a transition to intensity gradient feature quantity mode may be assumed to occur even in a state in which the transition to intensity gradient feature quantity mode is not desired. On the other hand, as a result of determination being performed using the change quantity of the finger vector, a transition to intensity gradient feature quantity mode when the transition is not used is able to be suppressed.
In the above-described expression (2), the term in the front half of the right side indicates the difference in the size of the finger vector from the previous frame. The closer the value is to zero, the less the size of the finger vector changes. In addition, in the above-described expression (2), the term in the rear half of the right side indicates a value that is a normalized angle (unit [rad]) formed by the vectors. The closer the value is to zero, the smaller the angle that is formed becomes. In other words, the closer the finger vector change quantity var is to zero, the higher the reliability of the detection result from the detecting unit 5 becomes. In other words, when the change quantity of the finger vector falls below a certain threshold θ, the reliability of the detection result from the detecting unit 5 may be considered high. Various arbitrary methods may be applied as the method for setting the threshold θ. For example, a method may be applied in which a plurality of users are asked to move their hand and fingers in an area in which the background does not include the skin-tone color in advance, and the maximum value of the values of the finger vector change quantity var obtained at this time is used. For example, when the speed of image processing by the image processing device 1 is 30 FPS, if the difference in the size of the finger vector from the previous frame is 0.25 and 15 degrees (π/6 [rad]) is set as the maximum value of the angle formed by the finger vectors, the threshold θ is 0.04. In addition, because the threshold indicates the ease with which intensity gradient feature quantity mode is entered, the threshold may be changed accordingly depending on the intended use.
Next, in the first flowchart of the feature quantity selection performed by the selecting unit 6 in
In
The extracting unit 3 receives the image from the acquiring unit 2 and extracts the color feature quantity or the intensity gradient feature quantity of the image (step S1102). The extracting unit 3 may extract, for example, a pixel value in RGB color space as the color feature quantity. In addition, the extracting unit 3 may extract, for example, the HOG feature quantity or the LBP feature quantity as the intensity gradient feature quantity. When the selecting unit 6 instructs the extraction of only either of the color feature quantity or the intensity gradient feature quantity, as described hereafter, the extracting unit 3 may extract only either of the color feature quantity or the intensity gradient feature quantity at step S1102. The extracting unit 3 then outputs the extracted color feature quantity or the intensity gradient feature quantity to the detecting unit 5.
The detecting unit 5 receives, from the extracting unit 3, the color feature quantity or the intensity gradient feature quantity extracted by the extracting unit 3. The detecting unit 5 detects the first region based on the color feature quantity or the intensity gradient feature quantity (step S1103). At step S1103, the detecting unit 5 detects the first region using the color feature quantity or the intensity gradient feature quantity based on the selection by the selecting unit 6. In addition, the detecting unit 5 may detect the fingertip position of the hand and fingers serving as an example of the first region, as occasion calls.
The selecting unit 6 selects whether the detecting unit 5 detects the hand and fingers using either the color feature quantity or the intensity gradient feature quantity based on the first information, and instructs the detecting unit 5 (step S1104). In addition, at step S1104, the selecting unit 6 may instruct the extracting unit 3 to extract only either of the color feature quantity or the intensity gradient feature quantity, as appropriate. A detailed flow of the process at step S1104 corresponds with the flowcharts in
In the image processing device in example 1, the position of the hand and fingers of the user is able to be accurately identified without depending on the background color. Furthermore, through dynamic selection of the color feature quantity and the intensity gradient feature quantity depending on various circumstances, the position of the hand and fingers of the user is able to be detected with high robustness and low calculation load without depending on the background color.
In example 2, a method is disclosed in which calculation load is reduced and processing speed is improved by a scanning range for the intensity gradient feature quantity by the detecting unit 5 in
When the movement amount of the finger vector at the preceding and subsequent times is low, the position of the hand and fingers has not significantly changed from the preceding time. Therefore, as a result of the range of the rectangular area and the rotation area being restricted as described above, the search area is able to be significantly reduced. Furthermore, in example 2, the search area is restricted using the center-of-gravity position rather than the fingertip position. A reason for this is that, in example 2, the center of gravity is calculated from the extracted skin-tone area. At this time, because a skin-tone area of a fixed size or larger is extracted, the center of gravity is acquired with relative stability. On the other hand, the fingertip position is estimated from the extracted skin-tone area based on a curvature of the contour. Therefore, depending on the state of the contour, a situation in which the position of the fingertip is difficult to stably acquire may occur. In the image processing device 1 in example 2, the search area is restricted using the center-of-gravity position rather than the fingertip position. Therefore, operation stability is realized.
In the image processing device in example 2, the position of the hand and fingers of the user is able to be accurately identified without depending on the background color. Furthermore, through dynamic selection of the color feature quantity and the intensity gradient feature quantity depending on various circumstances, the position of the hand and fingers of the user is able to be detected with high robustness and low calculation load without depending on the background color.
The overall computer 100 is controlled by a processor 101. A random access memory (RAM) 102 and a plurality of peripheral devices are connected to the processor 101 by a bus 109. The processor 101 may be a multi-processor. In addition, the processor 101 is, for example, a CPU, a microprocessing unit (MPU), a digital signal processor (DSP), an application specific integrated circuit (ASIC), or a programmable logic device (PLD). Furthermore, the processor 101 may be a combination of two or more elements among the CPU, MPU, DSP, ASIC, and PLD.
The RAM 102 is used as a main storage device of the computer 100. The RAM 102 temporary stores therein an operating system (OS) program and at least some application programs executed by the processor 101. In addition, the RAM 102 stores therein various pieces of data to be used for processes performed by the processor 101.
The peripheral devices connected to the bus 109 are a hard disk drive (HDD) 103, a graphic processing device 104, an input interface 105, an optical drive device 106, a device connection interface 107, and a network interface 108.
The HDD 103 magnetically writes and reads out data onto and from a magnetic disk provided therein. The HDD 103 is, for example, used as an auxiliary storage device of the computer 1000. The HDD 103 stores therein an OS program, application programs, and various pieces of data. As the auxiliary storage device, a semiconductor device such as a flash memory may also be used.
A monitor 110 is connected to the graphic processing device 104. The graphic processing device 104 displays various images on the screen of the monitor 110 based on instructions from the processor 101. The monitor 110 is a display device using cathode ray tube (CRT), a liquid crystal display device, or the like.
A keyboard 111 and a mouse 112 are connected to the input interface 105. The input interface 105 transmits to the processor 101 signals transmitted from the keyboard 111 and the mouse 112. The mouse 112 is an example of a pointing device, and other pointing devices may be used. Other pointing devices are a touch panel, a tablet, a touchpad, a trackball, and the like.
The optical drive device 106 reads out data recorded on an optical disc 113 using a laser light or the like. The optical disc 113 is a portable recording medium on which data is recorded such as to be readable by reflection of light. The optical disc 113 is a digital versatile disc (DVD), a DVD-RAM, a compact disc read-only memory (CD-ROM), a CD-recordable/rewritable (CD-R/RW), or the like. Programs stored on the optical disc 113 which is a portable recording medium is installed on the image processing device 1 via the optical drive device 106. A predetermined installed program is executable by the image processing device 1.
The device connection interface 107 is a communication interface for connecting peripheral devices to the computer 100. For example, a memory device 114 and a memory reader/writer 115 may be connected to the device connection interface 107. The memory device 114 is a recording medium provided with a communication function for communicating with the device connection interface 107. The memory reader/writer 115 is a device that writes data onto a memory card 116 or reads out data from the memory card 116. The memory card 116 is a card-type recording medium.
The network interface 108 is connected to a network 117. The network interface 108 performs transmission and reception of data with another computer or a communication device, over the network 117.
For example, the computer 100 executes a program recorded on a computer-readable recording medium and actualizes the above-described image processing functions. A program in which the processing content performed by the computer 100 is written may be recorded on various recording mediums. The program may be configured by one or a plurality of functional modules. For example, the program may be configured by functional modules actualizing the processes performed by the acquiring unit 2, the extracting unit 3, the storage unit 4, the detecting unit 5, and the selecting unit 6 illustrated in
Each constituent element of each device that has been illustrated does not have to be physically configured as illustrated. In other words, specific examples of dispersion and integration of the devices is not limited to those illustrated. All or some of the devices may be configured to be functionally or physically dispersed or integrated in arbitrary units depending on various loads, usage conditions, and the like. In addition, the various processes described in the above-described examples may be actualized by programs that have been prepared in advance being executed by a computer, such as a personal computer or a workstation.
Furthermore, the image sensor, such as the CCD or the CMOS, is described giving an external device as an example. However, the present embodiment is not limited thereto. The image processing device may include the image sensor.
According to the present embodiment, an example in which the hand and fingers are skin tone and the background is similar to the skin tone is described. However, the present embodiment is not limited thereto. For example, the present embodiment is able to be applied even when the hand and fingers are covered by a glove or the like, and a color similar to the color of the glove is used in the background.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2013-172495 | Aug 2013 | JP | national |