The present disclosure relates to an image capturing system, and more particularly, to an image capturing system using voice-based focus control.
Autofocus is a common function for current digital cameras in electronic devices. For example, an application processor of a mobile electronic device may achieve the autofocus function by dividing a preview image into several blocks and selecting a block having most textures or details to be a focus region. However, if the block selected by the electronic device does not meet a user's expectation, the user needs to manually select the focus region on his/her own. Therefore, a touch focus function has been proposed. The touch focus function allows the user to touch a block on a display touch panel of the electronic device that he/she would like to focus on, and the application processor then adjusts the focus region accordingly.
However, the touch focus function requires complex and unstable manual operations. For example, the user may have to hold the electronic device, touch a block to be focused on, and take a picture all within a short period of time. Since the block may contain a number of objects, it can be difficult to know exactly which object the user wants to focus on, thus causing inaccuracy and ambiguity. Furthermore, when the user touches the display touch panel of the electronic device, such action may shake the electronic device or alter a field of view of a camera. In such case, a region the user touches may no longer be the actual block the user wants to focus on, and consequently a photo taken may not be satisfying. Therefore, finding a convenient means to select the region to focus on with greater accuracy when taking pictures has become an issue to be solved.
One embodiment of the present disclosure discloses an image capturing system. The image capturing system comprises an image-sensing module, a plurality of processors, a display panel, and an audio acquisition module. The processors comprise a first processor and a second processor. The first processor is configured to detect a plurality of objects in a preview image sensed by the image-sensing module and attach identification labels to the objects detected. The display panel is configured to display the preview image with the identification labels of the detected objects. The audio acquisition module is configured to convert an analog signal of a user's voice into digital voice data. At least one of the processors is configured to parse the digital voice data into user intent data. The second processor is configured to select a target from the detected objects in the preview image according to the user intent data and the identification labels of the detected objects, and control the image-sensing module to perform a focusing operation with respect to the target.
Another embodiment of the present disclosure discloses a method for adjusting focus. The method comprises sensing, by an image-sensing module, a preview image; detecting a plurality of objects in the preview image; attaching identification labels to the objects detected; displaying the preview image with the identification labels of the detected objects on a display panel; converting, by an audio acquisition module, an analog signal of a user's voice into digital voice data; parsing the digital voice data into user intent data, selecting a target from the detected objects in the preview image according to the user intent data and the identification labels of the detected objects; and controlling the image-sensing module to perform a focusing operation with respect to the target.
Since the image capturing system and the method for adjusting focus provided by the embodiments of the present disclosure allow a user to select a target or a specific subject to be focused by means of voice-based focus control, the user can concentrate on holding and stabilizing the camera or the electronic device while composing a photo without touching the display panel for focusing, thereby simplifying the image-capturing process and avoiding shaking the image capturing system. Furthermore, since the objects in the preview image can be detected and labeled for the user to select from using the proposed voice-based focus control, the focusing operation can be performed with respect to the target directly with greater accuracy.
A more complete understanding of the present disclosure may be derived by referring to the detailed description and claims when considered in connection with the Figures, where like reference numbers refer to similar elements throughout the Figures.
The following description accompanies drawings, which are incorporated in and constitute a part of this specification, and which illustrate embodiments of the disclosure, but the disclosure is not limited to the embodiments. In addition, the following embodiments can be properly integrated to complete another embodiment.
References to “one embodiment,” “an embodiment,” “exemplary embodiment,” “other embodiments,” “another embodiment,” etc. indicate that the embodiment(s) of the disclosure so described may include a particular feature, structure, or characteristic, but not every embodiment necessarily includes the particular feature, structure, or characteristic. Further, repeated use of the phrase “in the embodiment” does not necessarily refer to the same embodiment, although it may.
In order to make the present disclosure completely comprehensible, detailed steps and structures are provided in the following description. Obviously, implementation of the present disclosure does not limit special details known by persons skilled in the art. In addition, known structures and steps are not described in detail, so as not to unnecessarily limit the present disclosure. Preferred embodiments of the present disclosure will be described below in detail. However, in addition to the detailed description, the present disclosure may also be widely implemented in other embodiments. The scope of the present disclosure is not limited to the detailed description, and is defined by the claims.
In step S210, the image-sensing module 110 may capture the preview image IMG1, and in step S220, the first processor 140 may detect objects in the preview image IMG1. In some embodiments, the first processor 140 may be an artificial intelligence (AI) processor, and the first processor 140 may detect the objects according to a machine learning model, such as a deep learning model utilizing a neuro-network structure. For example, a well-known object detection algorithm, YOLO (You Only Live Once), proposed by Joseph Redmon et al. in 2015, may be adopted. In some embodiments, the first processor 140 may comprise a plurality of processing units, such as neural-network processing units (NPU), for parallel computation so that the speed of object detection based on the neuro network can be accelerated. However, the present disclosure is not limited thereto. In other embodiments, other suitable models for object detection may be adopted, and a structure of the first processor 140 may be adjusted accordingly.
Furthermore, in some embodiments, to improve an accuracy of object detection, the preview image IMG1 captured by the image-sensing module 110 may be subject to image processing to have a better quality. For example, the image capturing system 100 may be incorporated in a mobile device, and the second processor 150 may be an application processor of the mobile device. In such case, the second processor 150 may include an image signal processor (ISP) and may perform image enhancement operations, such as auto white balance (AWB), color correction or noise reduction, on the preview image IMG1 before the first processor 140 detects the objects in the preview image IMG1 so that the first processor 140 can detect the objects with greater accuracy.
After the objects are detected, the first processor 140 may attach identification labels to the detected objects in step S230, and the display panel 130 may display the preview image IMG1 with the identification labels of the detected objects in step S240.
As shown in
In the present embodiment, when the user sees the preview image IMG1 and the identification labels of the objects shown on the display panel 130, the user may select a target from the objects that have been detected by speaking the name and/or the serial number of the target contingent on the content of the object identification labels. Meanwhile, in step S250, the audio acquisition module 120 may take an analog signal of a user's voice and convert the analog signal into digital voice data. In some embodiments, the image capturing system 100 may be incorporated in a mobile device, such as a smart phone or a tablet, and the audio acquisition module 120 may include a microphone that is used for a phone call function.
After the user's voice is converted into digital voice data, the digital voice data may be parsed into the user intend data in step S252. In some embodiments, the user's voice may convey a speech, and the user intend data may be derived by analyzing the content of the user's speech in the digital voice data.
In some embodiments, a speech recognition algorithm may utilize a machine learning model, such as a deep learning model, for parsing the digital voice data. The deep learning model has a multi-layer structure, and may take features extracted from a previous layer and use the features as an input for the next layer. Thus, feature learning will be used to attempt to learn the transformation of the previously learned features at each new layer. Since the deep learning model can be evolved to find crucial features by training, it has been adopted in a variety of types of recognition algorithms in the field of computer science, for example, object recognition algorithms and speech recognition algorithms.
In some embodiments, since the first processor 140 may have a multi-core structure that is suitable for realizing algorithms utilizing machine learning models, the first processor 140 may be utilized to realize the deep learning model for speech recognition to parse the digital voice data into the user intend data in step S252 as well. However, the present disclosure is not limited thereto. In some other embodiments, if the first processor 140 is not suitable for operating the chosen speech-recognizing algorithm, the image capturing system 100 may further include a third processor that is compatible with the chosen speech-recognizing algorithm to perform step S252. In yet some other embodiments, instead of the machine learning-based algorithm, a speech-recognizing algorithm using Gaussian Mixture Models (GMMs) that are based on hidden Markov models (HMMs) may be adopted. In such case, the second processor 150 or another processor suitable for realizing the GMM models may be employed accordingly.
Furthermore, in some embodiments, the speech recognition may be performed by more than one processor.
In some embodiments, to reduce power consumption, the audio acquisition module 120 may only be enabled when a speak-to-focus function is activated. Otherwise, if the autofocus function already meets the user's requirement or the user chooses to adjust the focus by some other means, the speak-to-focus function may not be activated, and the audio acquisition module 120 can be disabled accordingly.
After the digital voice data is parsed into the user intend data in step S252, the second processor 150 may select the target in the preview image IMG1 according to the user intend data and the identification labels of the detected objects in step S260. For example, the second processor 150 may decide the target when the user intent data includes a data segment that matches the identification label of the target. For example, if the second processor 150 determines that the user intend data includes a data segment matching the identification label of an object O1 in the preview image IMG1, such as the name “Tree” of the object O1, then the object O1 will be selected as the target.
Alternatively, if the second processor 150 determines that the user intend data includes a data segment matching the object name “Human 1” of an object O2 in the preview image IMG1, then the object O2 will be selected as the target. That is, the image capturing system 100 allows the user to select the target to be focused on by saying the object name and/or the serial number listed in the object identification labels. Therefore, when taking pictures, users can concentrate on holding and stabilizing the camera or the electronic device while composing a picture without touching the display panel 130 for focusing, thereby not only simplifying an image-capturing process but avoiding shaking the image capturing system 100. Furthermore, since the objects in the preview image IMG1 can be detected and labeled for the user to select from, the selection operation based on voice input is more intuitive, and the focusing operation can be performed with respect to the target with greater accuracy.
In some embodiments, to confirm the user's selection, the second processor 120 may change a visual appearance of an identification label of the object that the user selects via voice input. For example, the second processor 120 may select a candidate object from the objects in the preview image IMG1 when the user intend data includes a data segment that matches the identification label of the candidate object, and may change a visual appearance of the identification label of the candidate object so as to visually distinguish the candidate object from the reset of the objects in the preview image IMG1. For example, in some embodiments, the second processor 120 may change the color of the bounding box of the candidate object. Therefore, the user is able to check if the candidate object is his/her target. In some embodiments, if the candidate object is not that the user intends to focus on, the user may say the object name and/or the object serial number of the desired object again, and steps S240 to S260 may be performed repeatedly until the target is selected and confirmed.
In addition, to confirm that the candidate object selected by the image capturing system 100 is the correct target, the user may say a predetermined confirm command, for example but not limited to “yes” or “okay.” In such case, the audio acquisition module 120 may receive the analog signal of the user's voice, and convert the analog signal into digital voice data so that the speech recognition can be performed. When the user intent data is recognized that there is a command segment matching the confirm command, the image capturing system 100 then confirms that the candidate object is the target to be focused.
Also, to allow the user to be visually aware of the object been picked via voice input, the second processor 120 may change a visual appearance of the identification label of the target once the target is selected through the above described steps. For example, in some embodiments, the second processor 120 may change the color of the bounding box B1 of the object O1 that has been selected as the target. As a result, the user can distinguish the selected object from the other objects according to colors of the identification labels. Since the image capturing system 100 can display the objects in a scene with their identification labels, the user may select the target from the labeled objects shown on the display panel 130 directly by saying the name and/or the serial number of the target. Therefore, any ambiguity caused by selecting adjacent objects via hand touch can be avoided.
In some embodiments, the user may take pictures in a noisy environment or an environment full of people. In such cases, noises or voices of other people may interfere the image capturing system 100 when performing the speak-to-focus function. For example, if a person next to the user says the name of a certain object detected in the preview image IMG1, the image capturing system 100 may accidentally select this object as the target. To avoid such case, before step S260, the method 200 may further check the user's identity according to the characteristics of the user's voice, such as his/her voiceprint. Consequently, in step S260, the target will only be decided if the identity of the user is verified as valid and the user intent data includes a data segment that matches the identification label of the target.
Once the target is selected, the second processor 150 may control the image-sensing module 110 to perform a focusing operation with respect to the target in step S270 for subsequent capturing operations.
In the present embodiment, after the focus of the image-sensing module 110 is adjusted with respect to the target, the second processor 150 may further track the movement of the target in step S280, and control the image-sensing module 110 to keep the target in focus in step S290. For example, the first processor 140 and/or other processor(s) may extract features of the target in the preview image IMG1 and locate or track the moving target by feature mapping. In some embodiments, any known focus tracking technique that is suitable may be adopted in step S280. Consequently, after step S280 and/or S290, when the user commands the image capturing system 100 to capture an image, the image-sensing module 110 captures the image while focusing on the target.
In summary, the image capturing system and the method for adjusting focus provided by the embodiments of the present disclosure allow the user to select the target on which the image-sensing module should focus by saying the name and/or the serial number of the target shown on the display panel. Users can concentrate on holding and stabilizing the camera or the electronic device while composing a photo without touching the display panel for focusing, thereby not only simplifying an image-capturing process but avoiding shaking the image capturing system. Furthermore, since the objects in the preview image can be detected and labeled for the user to select from using voice-based focus control, the focusing operation can be performed with respect to the target directly with greater accuracy.
Although the present disclosure and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the disclosure as defined by the appended claims. For example, many of the processes discussed above can be implemented in different methodologies and replaced by other processes, or a combination thereof.
Moreover, the scope of the present application is not intended to be limited to the particular embodiments of the process, machine, manufacture, composition of matter, means, methods and steps described in the specification. As one of ordinary skill in the art will readily appreciate from the present disclosure, processes, machines, manufacture, compositions of matter, means, methods or steps, presently existing or later to be developed, that perform substantially the same function or achieve substantially the same result as the corresponding embodiments described herein, may be utilized according to the present disclosure. Accordingly, the appended claims are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods and steps.