Having thus described the invention in general terms, reference will now be made to the accompanying drawings, which are not necessarily drawn to scale, and wherein:
Embodiments of the present invention will now be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the invention are shown. Indeed, the invention may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements. Like reference numerals refer to like elements throughout.
In addition, while several embodiments of the method of the present invention are performed or used by a mobile terminal 10, the method may be employed by other than a mobile terminal. Moreover, the system and method of the present invention will be primarily described in conjunction with mobile communications applications. It should be understood, however, that the system and method of the present invention can be utilized in conjunction with a variety of other applications, both in the mobile communications industries and outside of the mobile communications industries.
The mobile terminal 10 includes an antenna 12 in operable communication with a transmitter 14 and a receiver 16. The mobile terminal 10 further includes a controller 20 or other processing element that provides signals to and receives signals from the transmitter 14 and receiver 16, respectively. The signals include signaling information in accordance with the air interface standard of the applicable cellular system, and also user speech and/or user generated data. In this regard, the mobile terminal 10 is capable of operating with one or more air interface standards, communication protocols, modulation types, and access types. By way of illustration, the mobile terminal 10 is capable of operating in accordance with any of a number of first, second and/or third-generation communication protocols or the like. For example, the mobile terminal 10 may be capable of operating in accordance with second-generation (2G) wireless communication protocols IS-136 (TDMA), GSM, and IS-95 (CDMA) or third-generation wireless communication protocol Wideband Code Division Multiple Access (WCDMA).
It is understood that the controller 20 includes circuitry required for implementing audio and logic functions of the mobile terminal 10. For example, the controller 20 may be comprised of a digital signal processor device, a microprocessor device, and various analog to digital converters, digital to analog converters, and other support circuits. Control and signal processing functions of the mobile terminal 10 are allocated between these devices according to their respective capabilities. The controller 20 thus may also include the functionality to convolutionally encode and interleave message and data prior to modulation and transmission. The controller 20 can additionally include an internal voice coder, and may include an internal data modem. Further, the controller 20 may include functionality to operate one or more software programs, which may be stored in memory. For example, the controller 20 may be capable of operating a connectivity program, such as a conventional Web browser. The connectivity program may then allow the mobile terminal 10 to transmit and receive Web content, such as location-based content, according to a Wireless Application Protocol (WAP), for example.
The mobile terminal 10 also comprises a user interface including an output device such as a conventional earphone or speaker 24, a ringer 22, a microphone 26, a display 28, and a user input interface, all of which are coupled to the controller 20. The user input interface, which allows the mobile terminal 10 to receive data, may include any of a number of devices allowing the mobile terminal 10 to receive data, such as a keypad 30, a touch display (not shown) or other input device. In embodiments including the keypad 30, the keypad 30 may include the conventional numeric (0-9) and related keys (#, *), and other keys used for operating the mobile terminal 10. Alternatively, the keypad 30 may include a conventional QWERTY keypad. The mobile terminal 10 further includes a battery 34, such as a vibrating battery pack, for powering various circuits that are required to operate the mobile terminal 10, as well as optionally providing mechanical vibration as a detectable output.
In an exemplary embodiment, the mobile terminal 10 includes a media capturing module 36, such as a camera, video and/or audio module, in communication with the controller 20. The media capturing module 36 may be any means for capturing an image, video and/or audio for storage, display or transmission. For example, in an exemplary embodiment in which the media capturing module 36 is a camera module, the camera module 36 may include a digital camera capable of forming a digital image file from a captured image. As such, the camera module 36 includes all hardware, such as a lens or other optical device, and software necessary for creating a digital image file from a captured image. Alternatively, the camera module 36 may include only the hardware needed to view an image, while a memory device of the mobile terminal 10 stores instructions for execution by the controller 20 in the form of software necessary to create a digital image file from a captured image. In an exemplary embodiment, the camera module 36 may further include a processing element such as a co-processor which assists the controller 20 in processing image data and an encoder and/or decoder for compressing and/or decompressing image data. The encoder and/or decoder may encode and/or decode according to a JPEG standard format.
The mobile terminal 10 may further include a user identity module (UIM) 38. The UIM 38 is typically a memory device having a processor built in. The UIM 38 may include, for example, a subscriber identity module (SIM), a universal integrated circuit card (UICC), a universal subscriber identity module (USIM), a removable user identity module (R-UIM), etc. The UIM 38 typically stores information elements related to a mobile subscriber. In addition to the UIM 38, the mobile terminal 10 may be equipped with memory. For example, the mobile terminal 10 may include volatile memory 40, such as volatile Random Access Memory (RAM) including a cache area for the temporary storage of data. The mobile terminal 10 may also include other non-volatile memory 42, which can be embedded and/or may be removable. The non-volatile memory 42 can additionally or alternatively comprise an EEPROM, flash memory or the like, such as that available from the SanDisk Corporation of Sunnyvale, California, or Lexar Media Inc. of Fremont, Calif. The memories can store any of a number of pieces of information, and data, used by the mobile terminal 10 to implement the functions of the mobile terminal 10. For example, the memories can include an identifier, such as an international mobile equipment identification (IMEI) code, capable of uniquely identifying the mobile terminal 10.
Referring now to
The MSC 46 can be coupled to a data network, such as a local area network (LAN), a metropolitan area network (MAN), and/or a wide area network (WAN). The MSC 46 can be directly coupled to the data network. In one typical embodiment, however, the MSC 46 is coupled to a GTW 48, and the GTW 48 is coupled to a WAN, such as the Internet 50. In turn, devices such as processing elements (e.g., personal computers, server computers or the like) can be coupled to the mobile terminal 10 via the Internet 50. For example, as explained below, the processing elements can include one or more processing elements associated with a computing system 52 (two shown in
The BS 44 can also be coupled to a signaling GPRS (General Packet Radio Service) support node (SGSN) 56. As known to those skilled in the art, the SGSN 56 is typically capable of performing functions similar to the MSC 46 for packet switched services. The SGSN 56, like the MSC 46, can be coupled to a data network, such as the Internet 50. The SGSN 56 can be directly coupled to the data network. In a more typical embodiment, however, the SGSN 56 is coupled to a packet-switched core network, such as a GPRS core network 58. The packet-switched core network is then coupled to another GTW 48, such as a GTW GPRS support node (GGSN) 60, and the GGSN 60 is coupled to the Internet 50. In addition to the GGSN 60, the packet-switched core network can also be coupled to a GTW 48. Also, the GGSN 60 can be coupled to a messaging center. In this regard, the GGSN 60 and the SGSN 56, like the MSC 46, may be capable of controlling the forwarding of messages, such as MMS messages. The GGSN 60 and SGSN 56 may also be capable of controlling the forwarding of messages for the mobile terminal 10 to and from the messaging center.
In addition, by coupling the SGSN 56 to the GPRS core network 58 and the GGSN 60, devices such as a computing system 52 and/or origin server 54 may be coupled to the mobile terminal 10 via the Internet 50, SGSN 56 and GGSN 60. In this regard, devices such as the computing system 52 and/or origin server 54 may communicate with the mobile terminal 10 across the SGSN 56, GPRS core network 58 and the GGSN 60. By directly or indirectly connecting mobile terminals 10 and the other devices (e.g., computing system 52, origin server 54, etc.) to the Internet 50, the mobile terminals 10 may communicate with the other devices and with one another, such as according to the Hypertext Transfer Protocol (HTTP), to thereby carry out various functions of the mobile terminals 10.
Although not every element of every possible mobile network is shown and described herein, it should be appreciated that the mobile terminal 10 may be coupled to one or more of any of a number of different networks through the BS 44. In this regard, the network(s) can be capable of supporting communication in accordance with any one or more of a number of first-generation (1G), second-generation (2G), 2.5G, third-generation (3G) and/or future mobile communication protocols or the like. For example, one or more of the network(s) can be capable of supporting communication in accordance with 2G wireless communication protocols IS-136 (TDMA), GSM, and IS-95 (CDMA). Also, for example, one or more of the network(s) can be capable of supporting communication in accordance with 2.5G wireless communication protocols GPRS, Enhanced Data GSM Environment (EDGE), or the like. Further, for example, one or more of the network(s) can be capable of supporting communication in accordance with 3G wireless communication protocols such as Universal Mobile Telephone System (UMTS) network employing Wideband Code Division Multiple Access (WCDMA) radio access technology. Some narrow-band AMPS (NAMPS), as well as TACS, network(s) may also benefit from embodiments of the present invention, as should dual or higher mode mobile stations (e.g., digital/analog or TDMA/CDMA/analog phones).
The mobile terminal 10 can further be coupled to one or more wireless access points (APs) 62. The APs 62 may comprise access points configured to communicate with the mobile terminal 10 in accordance with techniques such as, for example, radio frequency (RF), Bluetooth (BT), infrared (IrDA) or any of a number of different wireless networking techniques, including wireless LAN (WLAN) techniques such as IEEE 802.11 (e.g., 802.11a, 802.11b, 802.11g, 802.11n, etc.), WiMAX techniques such as IEEE 802.16, and/or ultra wideband (UWB) techniques such as IEEE 802.15 or the like. The APs 62 may be coupled to the Internet 50. Like with the MSC 46, the APs 62 can be directly coupled to the Internet 50. In one embodiment, however, the APs 62 are indirectly coupled to the Internet 50 via a GTW 48. Furthermore, in one embodiment, the BS 44 may be considered as another AP 62. As will be appreciated, by directly or indirectly connecting the mobile terminals 10 and the computing system 52, the origin server 54, and/or any of a number of other devices, to the Internet 50, the mobile terminals 10 can communicate with one another, the computing system, etc., to thereby carry out various functions of the mobile terminals 10, such as to transmit data, content or the like to, and/or receive content, data or the like from, the computing system 52. As used herein, the terms “data,” “content,” “information” and similar terms may be used interchangeably to refer to data capable of being transmitted, received and/or stored in accordance with embodiments of the present invention. Thus, use of any such terms should not be taken to limit the spirit and scope of the present invention.
Although not shown in
An exemplary embodiment of the invention will now be described with reference to
Referring now to
The input control module 70 may be any device or means embodied in either hardware, software, or a combination of hardware and software that is capable of controlling when analysis of a speakers voice for utilization in speaker recognition will occur. In an exemplary embodiment, the input control module 70 is in operable communication with the camera module 36. In this regard, the input control module 70 may receive an indication 78 from the camera module 36 that a content item is about to be created. For example, the indication 78 may be indicative of an intention to create a content item, which may be inferred when a camera application is launched, when lens cover removal is detected, or any other suitable way. In an exemplary embodiment, the input control module receives input audio 80 from areas proximate to the mobile terminal 10 and may begin recording audio data from the input audio 80 when the camera application is launched. Thus, an audio sample including audio data may be recorded before, during and after an image is captured. The audio sample including either a portion of the recorded audio data or all of the recorded audio data may then be communicated to the identity determining module 72 for speaker recognition processing. In an exemplary embodiment, audio data may be recorded during the entire time that the camera application is active, however, only a portion of the recorded audio data corresponding to a predetermined time period after and/or before content item creation may be communicated to the identity determining module 72 as recognition data 82 associated with the content item created. In other words, for example, the input control module 70 may communicate audio data corresponding to a predetermined time before and/or after an image is created to the identity determining module 72 in response to creation of the image. It should be noted that the recognition data 82 may be recorded as described above, or communicated in real-time responsive to control by the input control module 70.
The identity determining module 72 may be any device or means embodied in either hardware, software, or a combination of hardware and software that is capable of determining an identity of a speaker based on the recognition data 82 including voice data from the speaker. The identity determining module 72 may also be capable of determining corresponding identities for a plurality of speakers given voice data from the plurality of speakers. In an exemplary embodiment, the identity determining module 72 receives the recognition data 82 and compares voice data included in the recognition data 82 to voice models that may be stored in the identity determining module 72 or in another location. The voice models may include models of voices of any number of previously recorded speakers. The voice models may be produced by any means known in the art, such as by recording and sampling the voice patterns of respective speakers. The voice models may be stored, for example, in a speaker database 84 which may be a part of the identity determining module 72 or located remote from the identity determining module 72. As such, the speaker database 84 may include a presentation of “long-term” statistical characteristics of speech for each speaker. The statistical characteristics may be gathered, for example, from phone conversations conducted with the speaker, or from previous recordings of the speaker conducted by the mobile terminal 10 or stored at the mobile terminal 10, a network server, a personal computer, a storage device, etc. Each of the voice models may correspond to a particular identity. For example, if a name of the speaker is known then the name may form the identity for the speaker. Alternatively, a label of “unknown” or any other appropriate or distinctive label may form the identity for a particular speaker.
As stated above, the identity determining module 72 compares voice data from the recognition data 82 to the voice models in order to determine the identity of any speakers associated with the voice data. If one or more speakers in a particular segment of recognition data 82 cannot be identified, the user may be notified of the failure to recognize the speaker via the interface module 76. Additionally, the user may be given an option to assign a new identity for each of the one or more speakers that could not be identified. The assignment of the new identity may be performed manually, or in conjunction with any of the characterization mechanisms described below in conjunction with the characterization module 74. If one or more speakers in a particular segment of recognition data 82 can be correlated with a corresponding voice model, a metadata or other annotation 88 based on the identity associated with the corresponding voice model may be assigned to the content item associated with the recognition data 82. The interface module 76 may then display the metadata annotation 88 of the identity when a corresponding content item 90 is highlighted or selected, for example, on the display 28 of the mobile terminal 10 as shown in
The interface module 76 may be any device or means embodied in either hardware, software, or a combination of hardware and software that is capable of presenting information associated with content items to the user, for example, on the display 28 of the mobile terminal 10. The information associated with the content items may include, for example, thumbnails of images corresponding to each content item and the metadata annotation 88 of a highlighted or selected content item as shown in
The interface module 76 may also provide the user with a mechanism by which to select a specific speaker as search criteria. For example, data entry may be performed in a field as shown in
The characterization module 74 may be any device or means embodied in either hardware, software, or a combination of hardware and software that is capable of assigning a characterization 96 to a particular speaker. The characterization 96 may be any user understandable identifier by which the particular speaker may be recognized by the user. For example, the characterization 96 may be a shortened version of the identity, a made up label, etc. Alternatively, the characterization 96 may be associated with an object that is already known to the mobile terminal 10, such as a phonebook entry or a known device. Some embodiments of characterization assignment will now be discussed for purposes of providing examples, and not by way of limitation. Thus, the present invention should not be considered to be limited to the examples disclosed herein.
One exemplary characterization assignment may be a manually performed. For example, a name corresponding to the identity, a nickname, a title, a label, or any other suitable identification mechanism may be manually assigned to correspond to a speaker. The user may manually assign the characterization 96 via the interface module 76. Such manual assignment could be performed, for example, by entering a textual characterization using the keypad 30 or another text entry device or by manually correlating the speaker to a phonebook entry. In order to make label selection easier, a short recording of the speaker's voice may be played before the manual labeling occurs.
Another exemplary characterization assignment may be automatically performed by the mobile terminal 10 or other device employing the present invention. For example, the speaker's voice may automatically be associated with an existing characterization of a corresponding phonebook entry. As such, during phone conversations, voices of both the user and the speaker may be recorded for voice modeling using the “long-term” statistical characteristics of the user and the speaker. Accordingly, a very good model can be achieved in this way. The characterization module 74 may then include a database or other correlation device to correlate a particular identity to an existing characterization of a corresponding phonebook entry. Thus, when the identity determining module 72 assigns an identity to a speaker that is recognized from a segment of recognition data 82, the characterization module 74 may automatically correlate the content item corresponding to the recognition data 82 with a phonebook entry corresponding to the identity of the speaker.
As another alternative, automatic characterization assignment may be performed by associating the speaker with nearby devices. For example, by simultaneously detecting a speaker and a nearby device on multiple occasions, a reasonably high probability may exist that the speaker correlates to the device. Accordingly, when a sufficiently high probability of correlation is reached, a speaker-to-device correlation may be made and an existing characterization for the device may be assigned to the identity of the speaker whenever the speaker's voice is detected. Furthermore, the device may be associated with a phonebook entry, thereby allowing the identity of the speaker, once determined, to be correlated to an existing characterization for the phonebook entry via correlation of the speaker to the device, and the device to the phonebook entry.
As yet another alternative, embodiments of the present invention may be used in conjunction with face recognition devices that may be employed on the mobile terminal 10 or any other device capable of practicing the present invention. As such, the face recognition device may have the capability to correlate a person in an image with a particular existing characterization. The existing characterization may have been developed in response to face models created from video calls which can be associated with a corresponding phonebook entry. Alternatively, the existing characterization may have been developed by manually assigning a textual characterization to a particular image or thumbnail of a face. Face recognition typically involves using statistical modeling to create relationships between a face in an image and a known face, for example, from another image. Statistical modeling may also be used to create relationships between recognized faces and speakers. Thus, for example, if a face is discernable in a particular image which forms a content item having associated recognition data 82, the characterization module 74 may include software capable of employing both face recognition and speaker recognition techniques to develop a statistical probability that the speaker and the face are related. Thus, a face-to-speaker relationship may be determined. The face-to-speaker relationship may then be used to associate a speaker with an existing characterization associated with the face. Furthermore, the face may be correlated with a phonebook entry, such that the speaker can be correlated to an existing characterization associated with the phonebook entry via face recognition.
As stated above, although the present invention was primarily described in the context of content items that are still images such as pictures or photographs, any content item that may be created at the mobile terminal 10 or any other device employing embodiments of the present invention is also envisioned. For example, in a situation where the content item is audio or video which includes audio content, the audio content in content items associated with either the audio or the video may be used as described above for assigning appropriate metadata or other tags to the content items based on the identity of the speaker as determined via the principles described above: In other words, when the content item is audio or video which includes audio material, there is no need to capture additional audio in order to employ embodiments of the present invention.
Accordingly, blocks or steps of the flowcharts support combinations of means for performing the specified functions, combinations of steps for performing the specified functions and program instruction means for performing the specified functions. It will also be understood that one or more blocks or steps of the flowcharts, and combinations of blocks or steps in the flowcharts, can be implemented by special purpose hardware-based computer systems which perform the specified functions or steps, or combinations of special purpose hardware and computer instructions.
In this regard, one embodiment of a method for utilizing speaker recognition in metadata-based content management includes comparing an audio sample obtained at a time corresponding to creation of a content item to stored voice models at operation 100. At operation 110, an identity of a speaker is determined based on the comparison. If the audio sample does not correspond to any of the stored voice models, then a new voice model is stored corresponding to the audio sample and a new identity may be assigned at operation 115. A quality check regarding recording quality of the audio sample may be performed to ensure the audio sample meets a quality standard before any identity can be assigned to the speaker. As such, the quality standard may be chosen to create a reasonably high probability that the speaker recorded in the audio sample can be accurately compared to the stored voice models. A metadata tag is assigned to the content item based on the identity at operation 120. The method may include an additional operation of manually or automatically correlating the identity to an existing phonebook entry, device, or face recognition characterization. The method may also include associating a plurality of content items in a group with a particular characterization in response to each of the content items of the group having a same metadata tag. In an exemplary embodiment, the method includes providing a user interface configured to enable searching for content items based on the particular characterization and/or enable presentation of a list of characterizations.
It should be noted once again that although the preceding exemplary embodiment has been described in the context of image related content items, embodiments of the present invention may also be practiced in the context of any other content item. Furthermore, embodiments of the present invention may be advantageously employed for utilization of speaker recognition for metadata-based content management in numerous types of devices such as, for example, a mobile terminal, a personal computer, a remote or local server, a video recorder, a network attached storage device, etc. It should also be noted that embodiments of the present invention need not be confined to application on a single device, as described in exemplary embodiments above. In other words, some operations of a method according to embodiments of the present invention may be performed on one device, while other operations are performed on a different device. Similarly, one or more of the modules described above may be embodied on a different device. For example, processing operations, such as those performed in the identity determining module 72, the characterization module 74 and/or the speaker database 84, may be performed on one device, such as a server, while display operations are performed on a different device, such as a mobile terminal. Additionally, stored voice models may be located at one device, while a comparison between the voice models and recognition data occurs on a separate device. Furthermore, audio samples may be recorded or processed in real time, as stated above. However, a device obtaining the audio samples may, in any case, be separate from a device that stores the audio samples, which may in turn be separate from a device which processes the audio samples.
The above described functions may be carried out in many ways. For example, any suitable means for carrying out each of the functions described above may be employed to carry out the invention. In one embodiment, all or a portion of the elements of the invention generally operate under control of a computer program product. The computer program product for performing the methods of embodiments of the invention includes a computer-readable storage medium, such as the non-volatile storage medium, and computer-readable program code portions, such as a series of computer instructions, embodied in the computer-readable storage medium.
Many modifications and other embodiments of the inventions set forth herein will come to mind to one skilled in the art to which these inventions pertain having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Therefore, it is to be understood that the inventions are not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.